Cloud Site Reliability Engineer - Mumbai
Cornerstone is looking for a skilled Cloud Site Reliability Engineer (SRE) in Mumbai. This role focuses on ensuring the reliability and performance of mission-critical systems through expertise in Kubernetes, incident response, and observability.
Key Responsibilities
- Troubleshoot and resolve issues within Kubernetes clusters, addressing deployments, pod failures, networking, and autoscaling.
- Lead incident management, including on-call duties, root cause analysis, and enhancing incident response procedures.
- Develop and maintain robust monitoring, logging, and alerting systems using tools like Prometheus, Grafana, and the ELK stack.
- Manage Kibana dashboards and the ELK stack for high availability and optimal performance of logging infrastructure.
- Integrate metrics, logs, and traces into a unified observability platform.
- Build and refine alerting pipelines to minimize noise and maximize signal for production incidents.
- Contribute to infrastructure automation using tools such as Terraform and Helm.
- Implement and support CI/CD pipelines for automated testing, deployment, and rollback across various environments.
- Participate in shift rotations and drive continuous improvements in observability and response systems.
Qualifications
- Minimum of 2 years of experience in an SRE, DevOps, or Infrastructure Engineer role.
- Bachelor's degree in Computer Science, IT, or a related technical field.
- Hands-on experience with cloud platforms AWS and GCP.
- Deep practical experience with Kubernetes (EKS, AKS, GKE).
- Strong understanding of Linux internals, container orchestration, and microservice architectures.
- Proficiency with monitoring and logging tools including Prometheus, Grafana, InfluxDB, and the ELK stack (Elasticsearch, Logstash, Kibana).
- Experience with incident response and alerting tools like PagerDuty.
- Familiarity with Kafka (topic monitoring, consumer health), ElastiCache/Redis (caching patterns, troubleshooting), and InfluxDB (time-series metrics).
- Ability to write and maintain automation scripts in Bash, Python, or Go.
