
Site Reliability Engineer II
Responsibilities
Qualifications & Requirements
Experience Level: Mid Level
Full Job Description
Backblaze is seeking a talented Site Reliability Engineer II (SRE II) to join our team in Bengaluru / Bangalore, India. This role is crucial for ensuring the stability, scalability, and reliability of our services and infrastructure. You will focus on developing automation, enhancing observability, and supporting incident response to maintain peak performance for our customer-facing systems.
The SRE will collaborate closely with engineering, product, and operations teams to integrate reliability best practices into daily development and operational workflows. Your contributions will help build tools and processes that boost efficiency and minimize manual tasks.
Key Responsibilities
Service Reliability & Operations
- Ensure the availability and durability of critical services in production environments.
- Monitor service health using SLIs, SLOs, and error budgets, escalating issues when thresholds are at risk.
- Participate in on-call rotations, incident response, and post-incident reviews to drive service improvements.
- Adhere to ITIL/OSS processes, including incident, change, problem, and capacity management.
Automation & Tooling
- Develop automation for routine operational tasks to reduce manual intervention and toil.
- Contribute to monitoring, logging, and alerting frameworks (e.g., Prometheus, Grafana, Catchpoint, ELK).
- Work with CI/CD pipelines, configuration management, and infrastructure as code tools such as Terraform, Ansible, and Jenkins.
- Write scripts in languages like Bash, Python, or Go to enhance system reliability and efficiency.
Collaboration
- Partner with engineering, product, and operations teams to support resilient system design and operations.
- Assist in capacity planning and disaster recovery exercises.
- Collaborate with vendors and service providers to troubleshoot issues and track SLA performance.
- Document systems, share knowledge, and foster a reliability-focused engineering culture.
Continuous Improvement
- Contribute to playbooks, runbooks, and operational documentation.
- Identify recurring issues and propose long-term solutions.
- Promote reliability-focused practices within development and operations teams.
Qualifications
Education & Experience
- Bachelor's degree in Computer Science, Engineering, or a related field, or equivalent practical experience.
- 2-4 years of experience in site reliability, systems engineering, or operations.
- Experience with large-scale, production-grade systems.
Technical Skills
- Strong Linux systems administration and troubleshooting skills.
- Familiarity with service reliability concepts including monitoring, alerting, incident response, and root cause analysis.
- Proficiency in at least one scripting language (Python, Bash, or Go).
- Understanding of containerization technologies (Kubernetes, Docker) and microservices.
- Knowledge of incident response and operational best practices.
Preferred Attributes
- Experience in a SaaS, service provider, or distributed systems environment.
- Familiarity with ITIL/OSS practices and SLO/SLA concepts.
- Excellent problem-solving abilities and a strong desire to learn new technologies.
- Experience with cloud platforms such as AWS, GCP, or Azure.
- Ability to work independently, take ownership, and drive projects from identification to resolution.
At Backblaze, we are committed to creating a workplace where everyone feels valued and empowered. We encourage applications from individuals of all backgrounds and experiences. We believe in fairness, good treatment of our employees, and fostering diversity, equity, and inclusion at all levels.
Company
Backblaze
Backblaze is a leading provider of object storage solutions within the open cloud movement, empowering businesses to unlock their budgets, simplify administration, and foster innovation. We enable cus...