
Senior Production Reliability Engin...
Responsibilities
Qualifications & Requirements
Experience Level: Senior Level
Full Job Description
Join Saviynt's SaaS Operations team as a Senior Production Reliability Engineer in Bengaluru. This role is part of the Monitoring and Alerting team, which merges operational excellence with development expertise to deliver highly available, resilient services through automation and Infrastructure as Code. You will build reliability into the ecosystem by applying best practices in Resiliency Engineering, Automation, Observability, and Chaos Testing. The team thrives on diverse technical backgrounds and offers challenges in software and systems engineering, with a strong emphasis on building and managing Monitoring and Alerting systems. We seek a Systems Thinking Principal Engineer with a track record of scaling teams through production insights, operational automation, building observability programs, developer guidance, and real-time metrics.
As a Senior Site Reliability Engineer on the Product SRE team, you will report to the Senior Director, Site Reliability Engineering.
WHAT YOU WILL BE DOING
- Create and maintain infrastructure and tools to ensure service reliability and enhance customer experience.
- Collaborate with teams to improve observability, automation, deployment processes, and system reliability.
- Develop, deploy, and manage scalable, dependable infrastructure solutions to support global cloud services.
- Partner with product, operations, and security teams for seamless implementation of features, tools, and updates across the platform.
- Develop and deploy AI-powered tools to increase operational efficiency and drive engineering excellence.
What We are Looking For:
- Implement comprehensive observability for microservices and Kubernetes clusters using tools like OpenTelemetry.
- Build and manage automation tools to streamline deployment, patching, scaling, and infrastructure management.
- Develop scalable portals for SRE dashboards, SLI/SLO/SLA tracking, error budgets, and executive metrics to facilitate data-driven decisions.
- Proficiency in programming and scripting languages such as Java, Python, Go, or Shell.
- Experience with OpenStack cloud, Linux, Kafka, RabbitMQ, Prometheus, Terraform, Kubernetes, Ansible, MLOps, Generative AI, PostgreSQL, and analytics databases.
- Familiarity with AWS solutions; Azure experience is a plus.
- Experience with containerized workloads, particularly Helm, AKS & EKS, other K8s distributions, Docker, and JFrog.
- Hands-on experience with logging and monitoring tools such as Prometheus, Grafana, ELK/OpenSearch, OpenTelemetry, Datadog, AWS Cloudwatch, Azure Monitor, Log Analytics, and Fluentd.
- Knowledge of Network Security concepts including AWS/Azure Policy, VPN, Active Directory/RBAC, ACLs, NSG rules, and private endpoints.
- Proven experience in implementing advanced observability practices and techniques at scale.
- Hands-on experience with one or more observability tools (Prometheus, Grafana, ELK/OpenSearch, Open Telemetry, Datadog, etc.).
WHAT YOU BRING
- Bachelor’s degree in Computer Science or a related field, or equivalent experience, with 4+ years in Cloud-SRE, DevOps, or Systems Engineering.
- Strong problem-solving abilities, excellent collaboration and communication skills, and a proactive approach to teamwork.
- Knowledge of testing tools and frameworks.
Company
Saviynt
Saviynt is a leading provider of AI-powered identity governance solutions. Their platform manages and governs human and non-human access to an organization's applications, data, and business processes...