
Staff Site Reliability Engineer
Responsibilities
Qualifications & Requirements
Experience Level: Senior Level
Full Job Description
Staff Site Reliability Engineer - Product Area Focus
Sumo Logic is seeking a Staff Site Reliability Engineer with a Product Area Focus to join our team in Noida, Uttar Pradesh, India. This hybrid role is crucial for maintaining and enhancing the availability of our planet-scale observability and security products.
As a Staff SRE, you will own the availability of our most critical product features, driving sustained operational excellence. You will collaborate with a global SRE team, executing on a reliability roadmap tailored to specific product areas. Your work will involve optimizing operations, increasing efficiency in cloud resource utilization and developer time, strengthening security, and accelerating feature velocity for our developers. You will also work closely with multiple teams to optimize their microservices and improve the daily experience of engineers within your supported product areas.
Responsibilities
- Support engineering teams by maintaining and executing a reliability roadmap focused on improving reliability, maintainability, security, efficiency, and velocity, and assisting in the realization of these improvements.
- Collaborate with development infrastructure, Global SRE, and product area engineering teams to define and continuously refine the reliability roadmap.
- Participate in defining, evolving, and managing Service Level Objectives (SLOs) for various teams within your product area.
- Engage in on-call rotations to understand operational workloads, with the goal of improving the on-call experience and reducing operational burden for microservices and related components.
- Implement projects to optimize and refine the on-call experience for your engineering teams.
- Continuously enhance the lifecycle of microservices and architectural components from design through operation and refinement.
- Develop code and automation to reduce operational workload, boost efficiency, enhance security posture, eliminate toil, and enable faster feature delivery by Sumo Logic developers.
- Partner with developer infrastructure teams to accelerate the adoption of tools that advance your reliability roadmap by identifying needs for supported engineering teams and contributing features and bug fixes.
- Scale systems sustainably through automation and evolve systems by advocating for changes that improve reliability and velocity.
- Facilitate blame-free root cause analysis meetings for incidents to drive learning and improvement.
- Contribute to and continuously improve our global Incident Response Coordination (IRC) for all products.
- Drive root cause identification and issue resolution with relevant teams.
- Operate effectively within a fast-paced, iterative development environment.
- Participate in hiring and mentoring new team members.
Required Qualifications and Skills
- Experience in cloud-native application development, applying best practices and design patterns.
- Strong debugging and troubleshooting skills across the entire technology stack.
- Deep understanding of AWS Networking, Compute, Storage, and managed services.
- Proficiency with modern CI/CD tooling, including Kubernetes, Terraform, Ansible, and Jenkins.
- Experience with the full lifecycle support of services, from creation to production support.
- Familiarity with Infrastructure as Code (IaC) practices using technologies like Terraform or CloudFormation.
- Ability to write production-ready code in at least one of the following languages: Java, Scala, or Go.
- Experience with Linux systems and comfort with command-line operations.
- Understanding and application of modern approaches to cloud-native software security.
- Experience with agile frameworks such as Scrum and Kanban, and proficiency in operating within these frameworks to deliver value.
- Flexibility and willingness to take on new roles and responsibilities.
- Eagerness to learn and utilize Sumo Logic products for solving reliability and security challenges.
- Bachelor’s or Master’s Degree in Computer Science, Electrical Engineering, or a related scientific or technical discipline.
- 8+ years of professional experience in applied software security roles.
Desirable Skills
- Experience using Sumo Logic products or other observability products for reliability and security initiatives.
- Experience with planet-scale product development.
- Expert-level proficiency in running and operating SaaS products on AWS Cloud.
- Experience with streaming technologies such as Kafka, Kafka Streams, or KSQL.
- Expert-level experience in one or more of the following programming languages: Java, Go, Scala, or Python.
- Expert-level experience with one or more of the following technologies: Terraform, Jenkins, Kubernetes.
- Extensive experience running and tuning JVM workloads at scale.
Company
Sumo Logic
Sumo Logic is a leader in providing an Intelligent Operations Platform that unifies critical security and operational data. This platform is designed to address the complex challenges of modern cybers...