
Site Reliability Engineer_Director_...
Responsibilities
Qualifications & Requirements
Experience Level: Senior Level
Full Job Description
As a Site Reliability Engineer focused on Software Production Management and Reliability Engineering in Mumbai, India, you will be instrumental in ensuring the stability and performance of production applications. Your responsibilities will include proactively detecting, troubleshooting, and resolving production issues, collaborating with development and external teams. You will own issues until resolution or a viable workaround is provided. Maintaining clear, timely communication during outages is crucial. You will be responsible for the overall stability of the production environment, developing and refining policies and procedures for application development standards, and enforcing Change Implementation Management guidelines. This role involves servicing requests for data and activities requiring production system access, and partnering with development teams early in the application lifecycle to ensure new systems meet production standards. A key aspect is building and maintaining a knowledge base to enhance team self-reliance in troubleshooting. You will provide deep analytical triage, subject matter expertise in debugging and issue analysis, and offer recommendations to prevent future application issues. As a seasoned technical resource, you will contribute expertise in outage management and proactive solutions to enhance user experience. We expect at least 4 years of relevant experience and a minimum of 7 years in developing/supporting Enterprise Applications. Embracing Agile and DevOps/SRE concepts is essential. Strong analytical skills, problem determination, and recovery processes are required. Experience with observability tools like Prometheus, Grafana, Loki, Kibana, and Splunk is necessary. You should be adept at building relationships with technology teams, business analysts, and vendors. Administrative competence in at least one major programming language or platform (e.g., Perl, Powershell, Python, Java) is required. The ideal candidate is a fast learner, possesses strong organizational skills, can manage multiple tasks and high-pressure situations, and is driven to learn new technologies. Hands-on experience administering large-scale, high-availability systems and monitoring tools is essential. A BS/MS or equivalent degree, preferably in a quantitative discipline (Computer Science, Computer Engineering, EE, Math, Physics), is preferred. Experience with incident "on call" rotations and 24/7 emergency response is required. Experience in the Financial Services sector is a plus.
Company
Morgan Stanley
Morgan Stanley is a leading global financial services firm. The company is dedicated to putting clients first, doing the right thing, leading with exceptional ideas, committing to diversity and inclus...