
Site Reliability Engineer
Responsibilities
Qualifications & Requirements
Experience Level: Mid Level
Full Job Description
We are seeking skilled Site Reliability Engineers, DevOps Engineers, or Linux & Cloud Engineers with a minimum of 2 years of hands-on production experience to support our critical nightly operations and infrastructure. The ideal candidate will possess strong expertise in Linux systems, cloud platforms (AWS, Azure, GCP), Kubernetes environments, containerization, automation, deployment strategies, monitoring, troubleshooting production workloads, and fundamental networking concepts. This role is crucial for maintaining the high availability, reliability, and performance of our services.
Key Responsibilities
- Manage, monitor, and support Linux-based production and staging environments.
- Oversee deployments, environment maintenance, patching, and system updates.
- Operate and troubleshoot production Kubernetes (K8s) clusters, containerized workloads, and cloud infrastructure.
- Deploy, manage, and troubleshoot Docker containers, including a solid understanding of Docker networking concepts.
- Utilize monitoring and observability tools to ensure system health and respond effectively to production incidents.
- Configure, maintain, and optimize CI/CD pipelines for automated deployments.
- Ensure the high availability, reliability, and performance of production services.
- Collaborate closely with development teams to resolve application, infrastructure, and deployment challenges.
- Maintain comprehensive documentation for production configurations, deployment procedures, and operational guidelines.
- Troubleshoot networking issues across applications, servers, containers, and cloud environments in a production setting.
Required Skills and Experience
- At least 2 years of hands-on production handling experience in DevOps, Linux, or Cloud environments.
- Strong Linux administration skills, including experience with Ubuntu, CentOS, or RHEL.
- Proven experience with major cloud platforms such as AWS, Azure, or GCP.
- Hands-on experience with Kubernetes (K8s), including deploying, monitoring, and troubleshooting workloads (EKS/AKS/GKE experience is a plus).
- Proficiency with Docker and container networking concepts (e.g., bridge/overlay networking, ingress, load balancing).
- Experience with monitoring and logging tools like Prometheus, Grafana, CloudWatch, or the ELK stack.
- Experience with MongoDB operations, including deployment, basic administration, backups, and monitoring in production or containerized environments.
- Solid understanding of TCP/IP protocols and networking fundamentals in a DevOps production context (DNS, routing, subnetting, load balancing, firewall rules, security groups).
- Good understanding of networking fundamentals, system monitoring, and log analysis.
- Proficiency in Git for version control and scripting languages such as Shell or Python.
Nice to Have
- Experience with Infrastructure as Code (IaC) tools like Terraform, Ansible, or CloudFormation.
- Familiarity with advanced Kubernetes concepts (e.g., Helm, scaling strategies, security best practices, upgrades).
- Exposure to Site Reliability Engineering (SRE) practices, including alerting strategies and incident response management.
This is a permanent position based in Navi Mumbai, India. Immediate joiners are preferred.
Company
ResourceDekho
ResourceDekho is a global leader in business solutions, IT services, and resource outsourcing, dedicated to enabling organizations achieve digital transformation. Our comprehensive offerings span infr...