Site Reliability Engineer
Responsibilities
Qualifications & Requirements
Experience Level: Senior Level
Full Job Description
About the Role
We are seeking a skilled Site Reliability Engineer to join our dynamic team. This role involves supporting complex, multi-tier Java (J2EE/Spring Boot) applications with intricate upstream and downstream dependencies. You will be instrumental in understanding application request flows, analyzing logs for troubleshooting, and resolving application breaks. The ideal candidate thrives in a fast-paced environment, can self-organize, and effectively prioritize tasks amidst competing demands.
Key responsibilities include contributing to the development and implementation of automated CI/CD pipelines, enhancing our continuous improvement and delivery processes, and maturing DevOps practices. You will actively participate in infrastructure design discussions, aiming for a fully automated, robust, and secure environment. Collaboration with internal SRE and Development teams, as well as business users, is crucial for investigations, testing, and deployments. You will manage Release Management, initiate Change Requests, and schedule implementations for fixes and enhancements. Effective collaboration with both local and remote teams is essential. A primary focus will be on ensuring high application availability through robust observability solutions and supporting the production environment with strong performance tuning, end-to-end troubleshooting, and solid networking fundamentals.
This position requires participation in rotational shifts and on-call rosters to support our critical applications on a 24x7 basis.
Requirements
- Minimum 5-7 years of experience as a Site Reliability Engineer, supporting diverse applications and infrastructure in a hybrid-cloud environment encompassing both on-premises and AWS/GCP platforms.
- Proficiency in supporting Java (J2EE/Spring Boot) or .NET applications. Experience in managing incidents, facilitating application recovery, driving root cause analysis, communicating effectively, and managing client relationships in conjunction with Infrastructure Service Support team members.
- Ensures all production changes adhere to lifecycle methodologies and risk guidelines.
- Experience in Application Support, deploying releases, patches, and fixes on the platform.
- Ability to analyze application performance, perform tuning, and ensure high availability and stability of the platform.
- Knowledge of Batch Processing systems and tools.
- Familiarity with Unix/Linux systems, containerization, and container orchestration platforms such as Docker, Cloud Foundry, OpenShift, and Kubernetes.
- Strong scripting skills (Shell, Python, or PowerShell) to automate manual tasks.
- Experience with observability tools like Grafana, Kibana, and AppDynamics.
- Hands-on experience with AWS/GCP public cloud services.
- Practical experience with CI/CD tools such as Jenkins, CircleCI, or GitHub Actions, and the ability to understand and define various deployment strategies.
- Hands-on experience with Git, including managing deployments and branching strategies within Git.