What is the salary for this Staff Specialist IT position?

Salary information for this Staff Specialist IT position is available upon application.

What experience is required for this Staff Specialist IT role?

This Staff Specialist IT position requires mid_level of experience.

Where is this Staff Specialist IT job located?

This Staff Specialist IT position is located in Hybrid - Bengaluru.

How do I apply for this Staff Specialist IT position at Infineon?

You can apply for this Staff Specialist IT position by clicking the 'Apply Now' button on this page, which will direct you to the official application portal.

Site Reliability Engineer - High Performance Computing (HPC)

Infineon is seeking a highly skilled and motivated Site Reliability Engineer (SRE) to join our High Performance Computing (HPC) team in Bengaluru. In this role, you will be instrumental in ensuring the reliability, performance, and scalability of our critical HPC systems and infrastructure. You will collaborate closely with engineering, infrastructure, and operations teams to design, implement, and manage systems that support compute-intensive workloads, driving cutting-edge research, simulations, and data processing.

This position offers a unique opportunity to combine software engineering expertise with system administration skills. You will focus on continuously improving the reliability and performance of HPC environments, reducing operational toil, and responding effectively to incidents. If you thrive in high-performance, mission-critical environments and want to make a significant impact, this role is for you.

Key Responsibilities:

System Reliability and Performance

Ensure the reliability, availability, and performance of High Performance Computing systems.
Identify and mitigate bottlenecks within HPC clusters, interconnects, and storage systems.
Proactively develop and implement advanced monitoring and alerting systems to anticipate and minimize downtime.

Automation and Infrastructure as Code (IaC)

Automate system deployment, configuration, and maintenance processes for HPC clusters.
Implement and manage Infrastructure as Code (IaC) using tools like Terraform, Ansible, or similar technologies.
Develop self-healing and automated recovery mechanisms to reduce manual intervention.

Incident Management and Troubleshooting

Respond to HPC system incidents, conduct thorough root cause analysis, and implement effective preventive measures.
Create and maintain comprehensive runbooks and playbooks for efficiently handling predictable issues.

System and Software Optimization

Collaborate with engineering teams to optimize workloads, schedulers (e.g., LSF), and resource allocations for maximum efficiency.
Test, benchmark, and optimize hardware and software configurations in partnership with vendors.

Collaboration and Communication

Serve as a key liaison between software development and operations teams to ensure seamless deployment of HPC workloads.
Provide essential training, documentation, and guidance to users and stakeholders.

Research and Continuous Improvement

Stay abreast of the latest HPC technologies and trends, including GPUs, accelerators, and interconnects like InfiniBand.
Propose and implement innovative solutions to enhance the efficiency and scalability of HPC environments.

Candidate Profile:

The ideal candidate will possess:

A Bachelor's or Master's degree in Computer Science, Engineering, or a related technical field. Equivalent practical experience will also be considered.
Proven experience in managing and optimizing HPC clusters and associated resources (compute nodes, storage, and interconnects).
Expertise with workload managers and job schedulers, such as LSF.
Strong programming or scripting skills in languages like Python, Bash, or Go.
Proficiency in Linux system administration (RHEL, CentOS, Ubuntu) and networking fundamentals.
Familiarity with containerization technologies like Docker and Kubernetes, particularly in HPC contexts.
Experience with monitoring tools (e.g., Prometheus, Grafana, Nagios) and log management tools (e.g., ELK stack).
Excellent problem-solving abilities with a keen eye for detail.
Strong communication and teamwork skills, enabling effective collaboration across multi-disciplinary teams.
The ability to prioritize tasks effectively in a dynamic and fast-paced environment.
A solid understanding of DevOps principles and practices, including CI/CD pipelines.
Knowledge of security principles and best practices relevant to HPC environments.
Relevant certifications such as RHCE are a plus.

Staff Specialist IT

Auto Apply to 50+ AI Matched Staff Specialist IT Jobs

Full Job Description