Senior Site Reliability Engineer
Full Job Description
Nexthink is at the forefront of digital employee experience management, providing IT leaders with unparalleled insight to see, diagnose, and fix issues at scale that impact employees anywhere, with any application or network, before they are even noticed. We are seeking a proactive and innovative Senior Site Reliability Engineer to join our team in Bengaluru, Karnataka, India. The SRE team at Nexthink plays a crucial role in strengthening our infrastructure and enhancing our ability to deploy, monitor, and scale systems effectively and reliably. You will collaborate with over 50 Product Engineering teams, as well as Technical Platform Engineering, Security, and Architecture teams, to understand reliability requirements, design and implement solutions, and promote their adoption. Join our vibrant and diverse team, where cutting-edge technology meets innovation, and contribute to Nexthink's mission of delivering a seamless digital experience to our global customers.
Key Responsibilities:
- Implement and manage cloud-native systems on AWS using automation and best-in-class tools.
- Operate and enhance Kubernetes clusters, deployment pipelines, and service meshes to support rapid delivery.
- Design, build, and maintain the infrastructure for our multi-tenant SaaS platform with a focus on reliability, security, and scalability.
- Define and maintain SLOs, SLAs, and error budgets, proactively addressing availability and performance issues.
- Develop infrastructure-as-code (e.g., Terraform) for repeatable and auditable provisioning.
- Build internal platform tools and automation for provisioning, monitoring, and operational efficiency.
- Monitor infrastructure and applications to ensure high-quality user experiences.
- Participate in a shared on-call rotation, responding to incidents, troubleshooting outages, and driving timely resolution and communication.
- Act as an Incident Commander during on-call duties, coordinating cross-team responses to maintain SLAs.
- Drive and refine incident response processes to reduce Mean Time to Detect (MTTD) and Mean Time to Recover (MTTR).
- Independently diagnose and resolve complex issues, minimizing escalations.
- Collaborate with software engineers to embed observability, fault tolerance, and reliability principles into service design.
- Automate runbooks, health checks, and alerting for reliable operations.
- Support automated testing, canary deployments, and rollback strategies for safe, fast, and reliable releases.
- Contribute to security best practices, compliance automation, and cost optimization.
Qualifications:
- Bachelor’s degree in Computer Science or equivalent practical experience.
- 5+ years of experience as a Site Reliability Engineer or Platform Engineer with strong software development best practices.
- Hands-on experience with public cloud services (AWS, GCP, Azure) and supporting SaaS products.
- Strong programming or scripting skills (e.g., Python, Go, Bash) and experience with infrastructure-as-code (e.g., Terraform).
- Proficiency with Kubernetes, container-based deployment (e.g., Docker), and related ecosystems (e.g., Helm).
- Experience supporting multi-tenant microservices architectures.
- Experience with CI/CD pipelines & tools (e.g., Jenkins, GitHub Actions, GitLab CI, FluxCD, Crossplane).
- Experience managing monitoring solutions (e.g. Datadog).
- Comfortable with a rotating on-call schedule, managing critical incidents, and leading post-incident reviews.
- Proficiency in operating and managing production systems, balancing urgency with methodology.
- Strong system-level troubleshooting skills and a proactive approach to incident prevention.
- Deep understanding of Linux systems, networking, and common troubleshooting practices.
- Solid understanding of the network stack (TCP/IP, VPN), cloud architectures (VPC, subnets, firewalls, load balancers), service mesh (e.g., Istio), and storage (e.g., S3, EBS).
- Knowledge of zero-downtime deployment strategies, blue/green and canary releases.
- Exposure to compliance standards (SOC 2, ISO 27001, HIPAA) is a plus; FedRAMP experience is a significant advantage.
- Experience with chaos engineering or resilience testing practices.
- Excellent problem-solving, collaborative mindset, and a strong grasp of agile, iterative development.
- Self-driven, highly organized, and capable of managing priorities independently.
- Curiosity to learn new technologies.
- Strong communication, presentation, and team collaboration skills in English.
We encourage you to apply even if you don't meet every single requirement. We are interested in candidates with diverse backgrounds and experiences.
Company
Nexthink
Nexthink is a global leader in Digital Employee Experience (DEX) management software. We empower IT teams with AI-driven, user-centric insights to proactively optimize technology performance, enhance ...