Software Engineer - Infrastructure (Site Reliability Engineer)
Company: Weavings Manpower Solutions (Client: US MNC in Solar Industry)
Location: Gurugram
Work Model: Hybrid (Twice weekly in-office presence required)
About the Role
Our client, a prominent US multinational in the solar sector with global reach, is seeking a skilled Site Reliability Engineer to join their team in Gurugram. This is a critical hybrid role requiring collaboration across engineering, automation, and data teams to address diverse infrastructure needs.
Key Responsibilities
- Develop and maintain modular, platform-agnostic GitOps CI/CD pipelines.
- Manage AWS services for multiple engineering teams.
- Deploy and manage custom data store solutions, including sharded MongoDB clusters and Elasticsearch clusters.
- Oversee the deployment and management of Kubernetes resources.
- Implement and manage an end-to-end observability stack, including custom metrics exporters, trace data, application metrics, dashboard design, and querying data from various sources.
- Establish incident response services and design efficient processes.
- Deploy and manage essential platform services such as OPA and Keycloak for Identity and Access Management (IAM).
- Champion best practices for high availability, scalability, Infrastructure as Code (IaC), Kubernetes deployments, and GitOps CI/CD pipeline design within AWS environments.
What You Bring
- Proficient hands-on experience with container runtimes like Docker and Linux system administration.
- Experience with web servers (e.g., Nginx, Apache) and cloud platforms, with a preference for AWS.
- Strong scripting and automation skills (Python and/or Bash) with proven ability to debug and troubleshoot Linux and cloud-native environments.
- Experience building CI/CD pipelines and familiarity with monitoring and alerting tools (e.g., Grafana, Prometheus, exporters).
- Solid understanding of web architecture, distributed systems, and potential single points of failure.
- Familiarity with cloud-native deployment concepts such as high availability, scalability, and bottleneck identification.
- Good fundamental knowledge of networking protocols (SSH, DNS, TCP/IP, HTTP, SSL) and concepts like load balancing, reverse proxies, and firewalls.
Good to Have
- Experience in backend development, database setup, and performance tuning.
- Hands-on experience with Kubernetes cluster administration and deployments.
- Experience collaborating with SecOps engineers.
- Basic knowledge of Envoy, service meshes (e.g., Istio), and SRE principles like distributed tracing.
- Experience with setting up and using OpenTelemetry, centralized logging, and monitoring systems.
Experience and Notice Period
- Experience: 2-5 years
- Notice Period: 2 months (preference for candidates with less than 15 days notice due to the critical nature of the role).
Technical Skills
- Core: Site Reliability Engineering (SRE), Kubernetes (K8s), AWS at scale, Networking, Linux System Administration, Scripting (Python highly preferred, Bash, or similar), Good Communication Skills.
- Good-to-Have: Proficiency in Data Structures and Algorithms (DSA), System Architecture, Bare Metal experience.
Eligibility Criteria
- Hard Filter: Graduates from IITs, NITs, and BITs only.
