
Senior Site Reliability Engineer I
Responsibilities
Qualifications & Requirements
Experience Level: Senior Level
Full Job Description
Sumo Logic is seeking a Senior Site Reliability Engineer I with a Product Area Focus to join their team in Noida, India. This role is crucial for ensuring the availability and operational excellence of Sumo Logic's planet-scale observability and security products. As part of a global SRE team, you will execute on reliability roadmaps for specific product areas, focusing on optimizing operations, enhancing cloud resource efficiency, strengthening security posture, and increasing developer velocity. You will collaborate closely with multiple engineering teams to optimize their microservices and improve the overall experience for engineers within your supported product areas.
Responsibilities
- Support engineering teams by maintaining and executing a reliability roadmap focused on improving reliability, maintainability, security, efficiency, and velocity.
- Collaborate with development infrastructure, Global SRE, and product area engineering teams to define and refine the reliability roadmap.
- Define, evolve, and manage Service Level Objectives (SLOs) for teams within your product area.
- Participate in on-call rotations to understand operational workloads and to improve the on-call experience and reduce operational burden.
- Develop and implement projects to optimize the on-call experience for engineering teams.
- Enhance the lifecycle management of microservices and architectural components, from design through operation and refinement.
- Write code and automation to reduce operational workload, increase efficiency, improve security, eliminate toil, and accelerate feature delivery.
- Partner with developer infrastructure teams to accelerate the adoption of tools that advance your reliability roadmap, identifying needs and contributing fixes/features.
- Drive sustainable system scaling through automation and system evolution for improved reliability and velocity.
- Facilitate blame-free root cause analysis meetings for incidents to drive continuous improvement.
- Lead root cause identification and issue resolution processes.
- Operate effectively within a fast-paced, iterative development environment.
Required Qualifications and Skills
- Experience in cloud-native application development, applying best practices and design patterns.
- Strong debugging and troubleshooting skills across diverse technology stacks.
- In-depth knowledge of AWS networking, compute, storage, and managed services.
- Proficiency with modern CI/CD tools such as Kubernetes, Terraform, Ansible, and Jenkins.
- Experience in full lifecycle support for services, from creation to production.
- Proficiency in Infrastructure as Code (IaC) using technologies like Terraform or CloudFormation.
- Ability to write production-ready code in Java, Scala, or Go.
- Experience with Linux systems and command-line operations.
- Understanding and application of modern cloud-native software security approaches.
- Experience with agile frameworks like Scrum and Kanban.
- Flexibility and willingness to take on new responsibilities.
- Eagerness to learn and utilize Sumo Logic products for reliability and security challenges.
- Bachelor's or Master's Degree in Computer Science, Electrical Engineering, or a related scientific/technical field.
- 4-6 years of relevant industry experience.
Desirable Skills
- Experience using Sumo Logic products or similar observability tools for reliability and security.
- Experience with planet-scale product development.
- Expert-level experience running and operating SaaS products on AWS Cloud.
- Familiarity with streaming technologies such as Kafka, Kafka Streams, or KSQL.
- Expert-level coding experience in Java, Go, Scala, or Python.
- Expert-level experience with Terraform, Jenkins, or Kubernetes.
- Extensive experience running and tuning JVM workloads at scale.
Company
Sumo Logic
Sumo Logic is a leading provider of cloud-native security and observability solutions, empowering organizations to secure, accelerate, and improve the reliability of their digital operations. Their In...