Site Reliability Engineer, HPC / AI Infrastructure at Tesla Asia
Tesla's Supercomputing/AI infrastructure team is seeking a Site Reliability Engineer to support its high-performance computing and machine learning infrastructure, including virtual simulations and Autopilot hardware & silicon design. This role is crucial for managing and optimizing the rapidly growing compute resources necessary for Optimus, Full-Self-Driving (FSD), and Robotaxi efforts.
As an SRE, you will be responsible for maintaining and improving the platform to ensure the FSD and Optimus engineering teams have the necessary tools and resources. This includes AI infrastructure management, monitoring compute/GPU/network metrics, Linux troubleshooting & performance tuning, and security. Your work will directly facilitate neural network training at scale & streamline FSD development.
Responsibilities:
- Support the AI/ML cluster infrastructure on GPU platforms, focusing on systems automation, configuration management and deployment at scale.
- Improve monitoring & self-healing pipelines, as well as security posture.
- Optimize server, storage and network performance.
- Develop new tools in Python, Golang or Bash/Shell.
- Use Infrastructure as Code best practices.
- Participate in 24x7 on-call rotation.
Qualifications:
- Proficiency with Linux fundamentals and performance optimizations.
- Experience with Slurm, LSF and storage management of parallel file systems.
- Proficiency in Python, Golang and/or Bash.
- Experience with configuration management software (Ansible, etc.), systems monitoring & alerting (Prometheus, Grafana, Telegraf, Splunk, etc.).
- Experience with containerization technologies such as Kubernetes.
- Experience with high-throughput low-latency networks, GPU-based computing systems, and/or high-performance storage systems is a plus.
- Bachelor's Degree in Computer Science, Computer Engineering, Electrical Engineering, Physics or proof of exceptional skills in related field.
- 3+ years of additional equivalent experience or evidence of exceptional ability related to the position.
