
Site Reliability Engineer
Responsibilities
Qualifications & Requirements
Experience Level: Mid Level
Full Job Description
We are seeking a skilled and motivated Site Reliability Engineer (SRE) to join our team in Hyderabad / Secunderabad, Telangana, India. As an SRE, you will be instrumental in ensuring the reliability, availability, and performance of our systems and applications. You will leverage your technical expertise and knowledge of SRE practices to collaborate with cross-functional teams, drive automation initiatives, and implement best practices to enhance system resilience. If you are a dedicated and detail-oriented SRE professional passionate about maintaining highly reliable systems, we encourage you to apply for this position.
Essential Duties and Responsibilities:
- System Monitoring and Incident Response: Monitor system health, proactively detect issues, and respond to incidents promptly. Participate in incident response activities, including triage, troubleshooting, and resolution, ensuring minimal service disruption.
- Automation and Tooling: Develop and maintain automation scripts, tools, and utilities to streamline operational tasks, reduce manual effort, and improve system efficiency. Utilize scripting languages and configuration management tools for automating routine tasks.
- Performance Optimization: Identify performance bottlenecks, analyze system metrics, and optimize system performance. Collaborate with Development and Operations teams to implement performance tuning measures and ensure optimal resource utilization.
- Infrastructure and Configuration Management: Manage infrastructure resources, including cloud platforms, servers, and network devices. Implement and maintain configuration management practices for consistency and reliability across environments.
- Capacity Planning: Conduct capacity planning exercises to forecast resource requirements and support scalability. Analyze usage patterns, monitor system performance, and recommend infrastructure adjustments to meet demand.
- Incident Analysis and Post-Mortems: Perform root cause analysis for incidents and contribute to post-incident reviews. Identify areas for improvement, implement preventive measures, and update documentation and runbooks.
- System Documentation: Contribute to the development and maintenance of system documentation, runbooks, and standard operating procedures (SOPs), ensuring accuracy and accessibility.
- Collaboration and Communication: Collaborate effectively with cross-functional teams (Development, Operations, Support) to address system issues, implement changes, and improve system reliability. Communicate updates, findings, and recommendations clearly to stakeholders.
- Continuous Improvement: Identify opportunities for automation, process enhancements, and tooling improvements. Drive initiatives to optimize system reliability, streamline workflows, and improve operational efficiency.
- Security and Compliance: Collaborate with Security and Compliance teams to ensure adherence to security best practices, regulations, and standards. Participate in security assessments, vulnerability management, and risk mitigation efforts.
- Performs other duties as assigned.
- Complies with all policies and standards.
Qualifications:
Education:
- Bachelor's Degree or equivalent experience.
Work Experience:
- Typically 2+ years of relevant work experience in Site Reliability Engineering, system administration, or infrastructure management.
Knowledge, Skills, and Abilities:
- Strong understanding of SRE principles, practices, and methodologies.
- Proficiency in scripting languages such as Python, Bash, or PowerShell.
- Familiarity with configuration management tools like Ansible, Puppet, or Chef.
- Experience with cloud platforms such as AWS, Azure, or GCP.
- Knowledge of containerization technologies like Docker and orchestration tools like Kubernetes is a plus.
- Understanding of networking concepts, load balancing, and distributed systems.
- Experience with monitoring and observability tools like Prometheus, Grafana, or ELK stack.
- Excellent problem-solving and troubleshooting skills.
- Strong attention to detail and the ability to work efficiently in a fast-paced environment.
- Effective communication and collaboration skills, with the ability to work well in a team.
Work Environment:
- Work in a clean, pleasant, and comfortable office setting. Reasonable accommodations may be made for individuals with disabilities.
- This position is 100% in-office.
Company
TriNet
TriNet is a leading provider of comprehensive human resources solutions for small to midsize businesses (SMBs) across India, including the bustling regions of Hyderabad and Secunderabad in Telangana. ...