
Operations Engineer
Qualifications
Experience Level: Senior Level
- </b><li>Expertise in Python
- or JavaScript for automation and tooling.Hands-on with cloud environments AWS
- GCP and orchestration tools like Kubernetes and Terraform.Deep understanding of Linux systems
- networking
- and distributed architectures.Experience with observability solutions Prometheus
- Grafana
- Datadog
- CloudWatch
- or New Relic.Familiarity with incident management and alerting platforms (PagerDuty
- xmatters)Proficiency in CI/CD frameworks such as Jenkins
Full Job Description
About the Role
As an Operations Engineer at Accenture, you will be instrumental in ensuring the stability, scalability, and high availability of our production systems and services. This role requires a strong foundation in Site Reliability Engineering (SRE) principles, acting as a crucial bridge between business application development and IT operations.
You will be responsible for supporting operations and/or managing delivery for production systems, adhering strictly to operational requirements and service level agreements. A key aspect of this role is to leverage automation, observability, and incident response to maintain continuous service reliability while simultaneously accelerating delivery velocity.
The Site Reliability Engineer designs, implements, and maintains robust production systems that consistently meet defined Service Level Objectives (SLOs) and error budgets. By applying software engineering best practices, you will prevent downtime, automate complex operational tasks, and enhance platform performance through advanced observability techniques, fault tolerance, and system resilience.
Key Responsibilities:
- Reliability and Performance: Monitor and optimize system uptime, latency, and throughput to ensure adherence to SLOs and Service Level Indicators (SLIs).
- Incident Management: Lead incident response efforts, manage escalations effectively, conduct thorough root cause analyses (RCAs), and drive postmortem reviews to prevent recurrence.
- Automation and Tooling: Develop and maintain CI/CD pipelines, automate infrastructure management processes, and eliminate manual toil through efficient scripting and orchestration.
- Monitoring and Observability: Implement and manage metrics, logging, and tracing frameworks (e.g., Prometheus, Grafana, ELK, Datadog) to provide real-time visibility into distributed systems.
- Capacity Planning: Conduct comprehensive resource forecasting, design scalable infrastructure solutions, and ensure robust performance under surge conditions.
- Change & Release Management: Collaborate closely with development teams to ensure safe and reliable rollouts of new features, incorporating automated testing and rollback mechanisms.
- Disaster Recovery & Resilience Engineering: Implement multi-region resilience strategies, conduct chaos testing, and automate failover processes for business continuity.
- Process Improvement: Utilize post-incident analytics to refine operational practices and drive data-driven improvements in system reliability.
- Collaboration: Work with product, design, ML, and DevOps teams to build intelligent workflows and enhance user experiences.
- Infrastructure as Code (IaC): Implement IaC using tools such as Terraform, CloudFormation, Azure DevOps, or Pulumi.
- Cloud Expertise: Demonstrate expert knowledge of Cloud IaaS and PaaS services.
Professional & Technical Skills:
- Expertise in Python, Go, Bash, or JavaScript for automation and tooling.
- Hands-on experience with cloud environments including AWS, Azure, and GCP.
- Proficiency with orchestration tools like Kubernetes and Terraform.
- Deep understanding of Linux systems, networking concepts, and distributed architectures.
- Experience with observability solutions such as Prometheus, Grafana, Datadog, CloudWatch, or New Relic.
- Familiarity with incident management and alerting platforms (e.g., PagerDuty, xMatters).
- Proficiency in CI/CD frameworks like Jenkins, GitHub Actions, or GitLab CI.
- Working knowledge of security, compliance, and performance optimization for highly available systems.
This role requires a minimum of 5 years of experience and a full-time education of 15 years.
Location: While the position is advertised for Hyderabad, it is based at our Bengaluru office.