
Staff Specialist IT
Full Job Description
Site Reliability Engineer - High Performance Computing (HPC)
Infineon is seeking a highly skilled and motivated Site Reliability Engineer (SRE) to join our High Performance Computing (HPC) team in Bengaluru. In this role, you will be instrumental in ensuring the reliability, performance, and scalability of our critical HPC systems and infrastructure. You will collaborate closely with engineering, infrastructure, and operations teams to design, implement, and manage systems that support compute-intensive workloads, driving cutting-edge research, simulations, and data processing.
This position offers a unique opportunity to combine software engineering expertise with system administration skills. You will focus on continuously improving the reliability and performance of HPC environments, reducing operational toil, and responding effectively to incidents. If you thrive in high-performance, mission-critical environments and want to make a significant impact, this role is for you.
Key Responsibilities:
System Reliability and Performance
- Ensure the reliability, availability, and performance of High Performance Computing systems.
- Identify and mitigate bottlenecks within HPC clusters, interconnects, and storage systems.
- Proactively develop and implement advanced monitoring and alerting systems to anticipate and minimize downtime.
Automation and Infrastructure as Code (IaC)
- Automate system deployment, configuration, and maintenance processes for HPC clusters.
- Implement and manage Infrastructure as Code (IaC) using tools like Terraform, Ansible, or similar technologies.
- Develop self-healing and automated recovery mechanisms to reduce manual intervention.
Incident Management and Troubleshooting
- Respond to HPC system incidents, conduct thorough root cause analysis, and implement effective preventive measures.
- Create and maintain comprehensive runbooks and playbooks for efficiently handling predictable issues.
System and Software Optimization
- Collaborate with engineering teams to optimize workloads, schedulers (e.g., LSF), and resource allocations for maximum efficiency.
- Test, benchmark, and optimize hardware and software configurations in partnership with vendors.
Collaboration and Communication
- Serve as a key liaison between software development and operations teams to ensure seamless deployment of HPC workloads.
- Provide essential training, documentation, and guidance to users and stakeholders.
Research and Continuous Improvement
- Stay abreast of the latest HPC technologies and trends, including GPUs, accelerators, and interconnects like InfiniBand.
- Propose and implement innovative solutions to enhance the efficiency and scalability of HPC environments.
Candidate Profile:
The ideal candidate will possess:
- A Bachelor's or Master's degree in Computer Science, Engineering, or a related technical field. Equivalent practical experience will also be considered.
- Proven experience in managing and optimizing HPC clusters and associated resources (compute nodes, storage, and interconnects).
- Expertise with workload managers and job schedulers, such as LSF.
- Strong programming or scripting skills in languages like Python, Bash, or Go.
- Proficiency in Linux system administration (RHEL, CentOS, Ubuntu) and networking fundamentals.
- Familiarity with containerization technologies like Docker and Kubernetes, particularly in HPC contexts.
- Experience with monitoring tools (e.g., Prometheus, Grafana, Nagios) and log management tools (e.g., ELK stack).
- Excellent problem-solving abilities with a keen eye for detail.
- Strong communication and teamwork skills, enabling effective collaboration across multi-disciplinary teams.
- The ability to prioritize tasks effectively in a dynamic and fast-paced environment.
- A solid understanding of DevOps principles and practices, including CI/CD pipelines.
- Knowledge of security principles and best practices relevant to HPC environments.
- Relevant certifications such as RHCE are a plus.