What is the salary for this Site Reliability Engineer on AI Platform, Director position?

Salary information for this Site Reliability Engineer on AI Platform, Director position is available upon application.

What experience is required for this Site Reliability Engineer on AI Platform, Director role?

This Site Reliability Engineer on AI Platform, Director position requires senior_level of experience.

How do I apply for this Site Reliability Engineer on AI Platform, Director position at Morgan Stanley?

You can apply for this Site Reliability Engineer on AI Platform, Director position by clicking the 'Apply Now' button on this page, which will direct you to the official application portal.

Site Reliability Engineer on AI Platform, Director at Morgan Stanley | Bengaluru, KA,IN, IN | Apply Now | MindMyJob

Morgan Stanley is seeking a Director-level Site Reliability Engineer (SRE) to join their AI Platform team in Bengaluru, India. This role involves operating, monitoring, and maintaining the infrastructure critical for GenAI applications, including training, inference, feature stores, data ingestion, and model serving. You will be responsible for designing and building automation to reduce manual toil and developing infrastructure-as-code (IaC) for provisioning and managing compute, storage, network, GPU clusters, and Kubernetes. Key responsibilities include establishing and enforcing SLOs/SLIs/SLAs, error budgets, alerting, and dashboards, with a focus on Grafana and Prometheus for metric visualization. You will lead incident response, root cause analysis (RCA), postmortems, and systemic remediation. Capacity planning, scaling strategies, workload scheduling, and resource forecasting are essential, as is optimizing cost-performance tradeoffs in large-scale compute environments. Security, compliance, auditability, and data governance will be critical aspects of system hardening. Collaboration with cloud engineers, data engineers, infrastructure, and security teams is expected for safe deployment and integration of new systems. Defining disaster recovery (DR) strategies, backup/restore practices, and fault tolerance mechanisms are also required. Maintaining runbooks, operational playbooks, documentation, and training materials is part of the role. Participation in 24/7 on-call rotations is necessary. Continuous evaluation and integration of new tools and technologies to enhance platform reliability are encouraged.

The ideal candidate will have at least 6+ years of relevant experience in SRE, infrastructure, or operations for large-scale systems. Strong programming/scripting skills in Python, Go, Java, or equivalent are required, along with deep experience in containerization (Docker) and orchestration (Kubernetes). Proficiency with monitoring, observability, logging, and alerting tools such as Prometheus, Grafana, ELK/EFK, Datadog, and PagerDuty is essential. A solid understanding of SRE techniques and infrastructure-as-code tools like Terraform, Helm, CloudFormation, and Ansible is necessary. Familiarity with GPU/AI compute clusters, high-performance data storage, and distributed architectures is expected. Knowledge of networking and systems engineering (TCP/IP, DNS, routing, load balancing, distributed storage) and proven experience in capacity planning, performance tuning, scaling, and incident response are crucial. Demonstrated ability to lead RCAs, deploy fixes, and drive reliability improvements is important. Experience in regulated environments like financial services, compliance, audit, and security is a strong plus. Excellent communication, documentation, and cross-team collaboration skills are required. A proven track record of reducing operational toil via automation is highly valued. Proficiency with Open Telemetry tools including Grafana, Loki, Prometheus, and Cortex, as well as good knowledge of Microservice-based architecture, industry standards for public and private cloud, and data pipeline technologies (Kafka, Spark, Flink) is beneficial. Knowledge of various DB engines (SQL, Redis, Kafka, Snowflake) for cloud app storage, experience working with Generative AI development (embeddings, fine-tuning), high-performance computing (HPC), distributed GPU cluster scheduling (e.g., Slurm, Kubernetes GPU scheduling), and an understanding of ModelOps/MLOps/LLMOps are required. Experience with chaos engineering, canary deployments, and blue/green rollouts is also desirable.

Site Reliability Engineer on AI Pla...

Auto Apply to 50+ AI Matched Site Reliability Engineer on AI Pla... Jobs

Responsibilities

Qualifications & Requirements

Full Job Description

Company

Morgan Stanley