
Site Reliability Engineer on AI Pla...
Responsibilities
Qualifications & Requirements
Experience Level: Senior Level
Full Job Description
Morgan Stanley is seeking a Director-level Site Reliability Engineer (SRE) to join their AI Platform team in Bengaluru, India. This role involves operating, monitoring, and maintaining the infrastructure critical for GenAI applications, including training, inference, feature stores, data ingestion, and model serving. You will be responsible for designing and building automation to reduce manual toil and developing infrastructure-as-code (IaC) for provisioning and managing compute, storage, network, GPU clusters, and Kubernetes. Key responsibilities include establishing and enforcing SLOs/SLIs/SLAs, error budgets, alerting, and dashboards, with a focus on Grafana and Prometheus for metric visualization. You will lead incident response, root cause analysis (RCA), postmortems, and systemic remediation. Capacity planning, scaling strategies, workload scheduling, and resource forecasting are essential, as is optimizing cost-performance tradeoffs in large-scale compute environments. Security, compliance, auditability, and data governance will be critical aspects of system hardening. Collaboration with cloud engineers, data engineers, infrastructure, and security teams is expected for safe deployment and integration of new systems. Defining disaster recovery (DR) strategies, backup/restore practices, and fault tolerance mechanisms are also required. Maintaining runbooks, operational playbooks, documentation, and training materials is part of the role. Participation in 24/7 on-call rotations is necessary. Continuous evaluation and integration of new tools and technologies to enhance platform reliability are encouraged.
The ideal candidate will have at least 6+ years of relevant experience in SRE, infrastructure, or operations for large-scale systems. Strong programming/scripting skills in Python, Go, Java, or equivalent are required, along with deep experience in containerization (Docker) and orchestration (Kubernetes). Proficiency with monitoring, observability, logging, and alerting tools such as Prometheus, Grafana, ELK/EFK, Datadog, and PagerDuty is essential. A solid understanding of SRE techniques and infrastructure-as-code tools like Terraform, Helm, CloudFormation, and Ansible is necessary. Familiarity with GPU/AI compute clusters, high-performance data storage, and distributed architectures is expected. Knowledge of networking and systems engineering (TCP/IP, DNS, routing, load balancing, distributed storage) and proven experience in capacity planning, performance tuning, scaling, and incident response are crucial. Demonstrated ability to lead RCAs, deploy fixes, and drive reliability improvements is important. Experience in regulated environments like financial services, compliance, audit, and security is a strong plus. Excellent communication, documentation, and cross-team collaboration skills are required. A proven track record of reducing operational toil via automation is highly valued. Proficiency with Open Telemetry tools including Grafana, Loki, Prometheus, and Cortex, as well as good knowledge of Microservice-based architecture, industry standards for public and private cloud, and data pipeline technologies (Kafka, Spark, Flink) is beneficial. Knowledge of various DB engines (SQL, Redis, Kafka, Snowflake) for cloud app storage, experience working with Generative AI development (embeddings, fine-tuning), high-performance computing (HPC), distributed GPU cluster scheduling (e.g., Slurm, Kubernetes GPU scheduling), and an understanding of ModelOps/MLOps/LLMOps are required. Experience with chaos engineering, canary deployments, and blue/green rollouts is also desirable.
Company
Morgan Stanley
Morgan Stanley is a leading global financial services firm that operates in 1,200 offices across 42 countries, employing over 80,000 people. The company is dedicated to putting clients first, acting w...