What experience is required for this Site Reliability Engineer -Remote Night Shift-Immediate Joiner role?

This Site Reliability Engineer -Remote Night Shift-Immediate Joiner position requires mid_level of experience.

How do I apply for this Site Reliability Engineer -Remote Night Shift-Immediate Joiner position at Yashasvini Recuriter Services Pune?

You can apply for this Site Reliability Engineer -Remote Night Shift-Immediate Joiner position by clicking the 'Apply Now' button on this page, which will direct you to the official application portal.

Site Reliability Engineer -Remote Night Shift-Immediate Joiner at Yashasvini Recuriter Services Pune | Remote | Apply Now | MindMyJob

Role & Responsibilities

We are seeking a skilled Site Reliability Engineer (SRE) with 2-4 years of experience to ensure the utmost reliability, availability, performance, and scalability of our production systems. This pivotal role emphasizes operational excellence, proactive incident management, robust monitoring, and in-depth infrastructure debugging, built upon a strong foundation in IT systems, networking, and Linux environments.

Required Technical Skills

SRE & Reliability

Demonstrated strong understanding of SRE principles, including reliability, scalability, and fault tolerance.
Proven experience with incident response, effective escalation procedures, and conducting thorough postmortems.
Solid knowledge of capacity planning and performance tuning methodologies.

Cloud & Infrastructure

Hands-on expertise with Amazon Web Services (AWS), specifically EC2, EKS, VPC, IAM, ALB/NLB, RDS, S3, and CloudWatch.
Experience operating Kubernetes in a production setting, managing pods, services, ingress, and autoscaling.
Proficiency in containerization using Docker.

Monitoring & Observability

Practical experience with a variety of monitoring and observability tools:

Prometheus, Grafana
CloudWatch
ELK Stack / OpenSearch / Loki
SigNoz / Datadog / New Relic
Ability to design meaningful alerts that balance low noise with high signal accuracy.

IT & Systems Fundamentals

Advanced Linux administration skills, covering processes, memory, disk, CPU, and system limits.
Comprehensive understanding of networking fundamentals, including TCP/IP, DNS, HTTP/HTTPS, load balancing, firewalls, and SSL/TLS.
Knowledge of storage concepts such as block vs. object storage, IOPS, and latency.
Experience troubleshooting complex OS-level and network-level issues.

Automation & Tooling

Proficiency in scripting languages like Bash and Python.
Experience with Infrastructure as Code using Terraform or CloudFormation.
Familiarity with CI/CD pipeline support, including Jenkins, GitHub Actions, and GitLab CI.

Professional Attributes

High-energy, positive attitude with a demonstrated ability to learn quickly.
Strong analytical and problem-solving skills.
Embraces AI-powered development as a significant productivity multiplier.
Brilliant communication skills essential for effective distributed team collaboration.
A true team player with a proactive mindset and a commitment to long-term success.
Excellent time-management skills and a strong ownership mindset.
Full Software Development Life Cycle (SDLC) experience, from design and deployment to ongoing maintenance.

What You Will Be Doing?

Ensure the high availability, reliability, and optimal performance of production environments.
Operate and provide comprehensive support for large-scale systems running on AWS and Kubernetes.
Define and meticulously monitor Service Level Indicators (SLIs), Service Level Objectives (SLOs), and error budgets.
Build and maintain robust monitoring, alerting, and observability platforms.
Manage and respond to production incidents, participate in on-call rotations, and conduct post-incident Root Cause Analysis (RCA).
Debug complex infrastructure, OS, network, and application issues.
Actively reduce toil through automation and the implementation of standard operating procedures (SOPs).
Collaborate closely with engineering teams to enhance system reliability and resilience.
Plan and rigorously test disaster recovery (DR) and failover strategies.
Maintain up-to-date operational documentation and comprehensive runbooks.

Site Reliability Engineer

Auto Apply to 50+ AI Matched Site Reliability Engineer Jobs

Full Job Description