Morgan Stanley
Morgan Stanley2h ago
Career Pages

Site Reliability Engineer on AI Pla...

Bengaluru, KA,IN, IN
Full Time
Senior Level

Auto Apply to 50+ AI Matched Site Reliability Engineer on AI Pla... Jobs

Use Auto Apply Agents to Bulk Apply jobs with ATS Optimised Resumes, find verified Insider Connections for jobs at Morgan Stanley

Responsibilities

Qualifications & Requirements

Experience Level: Senior Level

Full Job Description

Morgan Stanley is seeking a Director-level Site Reliability Engineer (SRE) to join their AI Platform team in Bengaluru, India. This role involves operating, monitoring, and maintaining the infrastructure critical for GenAI applications, including training, inference, feature stores, data ingestion, and model serving. You will be responsible for designing and building automation to reduce manual toil and developing infrastructure-as-code (IaC) for provisioning and managing compute, storage, network, GPU clusters, and Kubernetes. Key responsibilities include establishing and enforcing SLOs/SLIs/SLAs, error budgets, alerting, and dashboards, with a focus on Grafana and Prometheus for metric visualization. You will lead incident response, root cause analysis (RCA), postmortems, and systemic remediation. Capacity planning, scaling strategies, workload scheduling, and resource forecasting are essential, as is optimizing cost-performance tradeoffs in large-scale compute environments. Security, compliance, auditability, and data governance will be critical aspects of system hardening. Collaboration with cloud engineers, data engineers, infrastructure, and security teams is expected for safe deployment and integration of new systems. Defining disaster recovery (DR) strategies, backup/restore practices, and fault tolerance mechanisms are also required. Maintaining runbooks, operational playbooks, documentation, and training materials is part of the role. Participation in 24/7 on-call rotations is necessary. Continuous evaluation and integration of new tools and technologies to enhance platform reliability are encouraged.

The ideal candidate will have at least 6+ years of relevant experience in SRE, infrastructure, or operations for large-scale systems. Strong programming/scripting skills in Python, Go, Java, or equivalent are required, along with deep experience in containerization (Docker) and orchestration (Kubernetes). Proficiency with monitoring, observability, logging, and alerting tools such as Prometheus, Grafana, ELK/EFK, Datadog, and PagerDuty is essential. A solid understanding of SRE techniques and infrastructure-as-code tools like Terraform, Helm, CloudFormation, and Ansible is necessary. Familiarity with GPU/AI compute clusters, high-performance data storage, and distributed architectures is expected. Knowledge of networking and systems engineering (TCP/IP, DNS, routing, load balancing, distributed storage) and proven experience in capacity planning, performance tuning, scaling, and incident response are crucial. Demonstrated ability to lead RCAs, deploy fixes, and drive reliability improvements is important. Experience in regulated environments like financial services, compliance, audit, and security is a strong plus. Excellent communication, documentation, and cross-team collaboration skills are required. A proven track record of reducing operational toil via automation is highly valued. Proficiency with Open Telemetry tools including Grafana, Loki, Prometheus, and Cortex, as well as good knowledge of Microservice-based architecture, industry standards for public and private cloud, and data pipeline technologies (Kafka, Spark, Flink) is beneficial. Knowledge of various DB engines (SQL, Redis, Kafka, Snowflake) for cloud app storage, experience working with Generative AI development (embeddings, fine-tuning), high-performance computing (HPC), distributed GPU cluster scheduling (e.g., Slurm, Kubernetes GPU scheduling), and an understanding of ModelOps/MLOps/LLMOps are required. Experience with chaos engineering, canary deployments, and blue/green rollouts is also desirable.

Company

Morgan Stanley

Morgan Stanley

Morgan Stanley is a leading global financial services firm that operates in 1,200 offices across 42 countries, employing over 80,000 people. The company is dedicated to putting clients first, acting w...

Bengaluru, KA,IN, IN
Posted on Career Pages