
Yashasvini Recuriter Services Pune•2h ago
Naukri
Site Reliability Engineer
Remote
Remote
Mid Level
2000000-3000000
N/A
N/A
N/A
Full Job Description
Role & Responsibilities
We are seeking a skilled Site Reliability Engineer (SRE) with 2-4 years of experience to ensure the utmost reliability, availability, performance, and scalability of our production systems. This pivotal role emphasizes operational excellence, proactive incident management, robust monitoring, and in-depth infrastructure debugging, built upon a strong foundation in IT systems, networking, and Linux environments.
Required Technical Skills
SRE & Reliability
- Demonstrated strong understanding of SRE principles, including reliability, scalability, and fault tolerance.
- Proven experience with incident response, effective escalation procedures, and conducting thorough postmortems.
- Solid knowledge of capacity planning and performance tuning methodologies.
Cloud & Infrastructure
- Hands-on expertise with Amazon Web Services (AWS), specifically EC2, EKS, VPC, IAM, ALB/NLB, RDS, S3, and CloudWatch.
- Experience operating Kubernetes in a production setting, managing pods, services, ingress, and autoscaling.
- Proficiency in containerization using Docker.
Monitoring & Observability
Practical experience with a variety of monitoring and observability tools:
- Prometheus, Grafana
- CloudWatch
- ELK Stack / OpenSearch / Loki
- SigNoz / Datadog / New Relic
- Ability to design meaningful alerts that balance low noise with high signal accuracy.
IT & Systems Fundamentals
- Advanced Linux administration skills, covering processes, memory, disk, CPU, and system limits.
- Comprehensive understanding of networking fundamentals, including TCP/IP, DNS, HTTP/HTTPS, load balancing, firewalls, and SSL/TLS.
- Knowledge of storage concepts such as block vs. object storage, IOPS, and latency.
- Experience troubleshooting complex OS-level and network-level issues.
Automation & Tooling
- Proficiency in scripting languages like Bash and Python.
- Experience with Infrastructure as Code using Terraform or CloudFormation.
- Familiarity with CI/CD pipeline support, including Jenkins, GitHub Actions, and GitLab CI.
Professional Attributes
- High-energy, positive attitude with a demonstrated ability to learn quickly.
- Strong analytical and problem-solving skills.
- Embraces AI-powered development as a significant productivity multiplier.
- Brilliant communication skills essential for effective distributed team collaboration.
- A true team player with a proactive mindset and a commitment to long-term success.
- Excellent time-management skills and a strong ownership mindset.
- Full Software Development Life Cycle (SDLC) experience, from design and deployment to ongoing maintenance.
What You Will Be Doing?
- Ensure the high availability, reliability, and optimal performance of production environments.
- Operate and provide comprehensive support for large-scale systems running on AWS and Kubernetes.
- Define and meticulously monitor Service Level Indicators (SLIs), Service Level Objectives (SLOs), and error budgets.
- Build and maintain robust monitoring, alerting, and observability platforms.
- Manage and respond to production incidents, participate in on-call rotations, and conduct post-incident Root Cause Analysis (RCA).
- Debug complex infrastructure, OS, network, and application issues.
- Actively reduce toil through automation and the implementation of standard operating procedures (SOPs).
- Collaborate closely with engineering teams to enhance system reliability and resilience.
- Plan and rigorously test disaster recovery (DR) and failover strategies.
- Maintain up-to-date operational documentation and comprehensive runbooks.
Company
Yashasvini Recuriter Services Pune
Remote
Posted on Naukri