About The Role
We are seeking an experienced Operations Engineer to join our team in Noida. This role is critical for supporting the operations and managing the delivery of production systems and services, ensuring they meet all operational requirements and service level agreements.
Project Role Description
As an Operations Engineer, you will be responsible for maintaining the stability, scalability, and high availability of our production systems. You will work at the intersection of business application development and IT operations, leveraging automation, observability, incident response, and performance engineering to ensure continuous service reliability and accelerate delivery velocity.
You will design and maintain production systems that adhere to defined Service Level Objectives (SLOs) and error budgets. Utilizing software engineering principles, your goal will be to prevent downtime, automate operational tasks, and enhance platform performance through robust observability, fault tolerance, and system resilience.
Roles & Responsibilities:
- Reliability and Performance: Monitor and optimize system uptime, latency, and throughput to meet SLOs and SLIs.
- Incident Management: Lead incident response, manage escalations, perform root cause analysis (RCA), and drive postmortem reviews.
- Automation and Tooling: Develop CI/CD pipelines, automate infrastructure management, and eliminate manual toil through scripting and orchestration.
- Monitoring and Observability: Implement metrics, logging, and tracing frameworks (Prometheus, Grafana, ELK, Datadog) for real-time visibility into distributed systems.
- Capacity Planning: Conduct resource forecasting, design scalable infrastructure, and manage performance under surge conditions.
- Change & Release Management: Collaborate with developers to ensure safe, reliable rollout of new features with automated testing and rollback mechanisms.
- Disaster Recovery & Resilience Engineering: Implement multi-region resilience strategies, chaos tests, and failover automation for business continuity.
- Process Improvement: Use post-incident analytics to refine operational practices and drive data-driven improvements in reliability.
- Collaborate with product, design, ML, and DevOps teams to build intelligent workflows and user experiences.
- Implement Infrastructure as Code (IaC) using tools like Terraform, CloudFormation, AZURE DEV OPS, or Pulumi.
- Be an expert in Cloud IaaS and PaaS services.
Professional & Technical Skills:
- Expertise in Python, Go, Bash, or JavaScript for automation and tooling.
- Hands-on experience with cloud environments (Azure, GCP) and orchestration tools like Kubernetes and Terraform.
- Deep understanding of Linux systems, networking, and distributed architectures.
- Experience with observability solutions such as Prometheus, Grafana, Datadog, CloudWatch, or New Relic.
- Familiarity with incident management and alerting platforms (PagerDuty, xmatters).
- Proficiency in CI/CD frameworks like Jenkins, GitHub Actions, or GitLab CI.
- Working knowledge of security, compliance, and performance optimization for highly available systems.
Qualifications:
A minimum of 15 years of full-time education is required. Preferred certifications include AWS Certified Solutions Architect Professional, Microsoft Certified: Azure Solutions Architect Expert, Google Professional Cloud Architect, Certified Kubernetes Administrator (CKA), HashiCorp Certified: Terraform Associate, or Certified DevOps Engineer (AWS, Azure, or Google).
Additional Information:
This position is based at our Bengaluru office, supporting operations in Noida.
