
HPC Engineer
Full Job Description
HPC Engineer - Network
Join World Wide Technology in Mumbai as an HPC Engineer focused on Network infrastructure for AI factories. You will be the primary engineer responsible for configuring, stabilizing, and tuning high-speed interconnects, ensuring seamless communication for AI compute nodes. This role demands hands-on execution of Low-Level Designs (LLDs) and a deep understanding of high-performance networking, including InfiniBand and AI Ethernet (RoCEv2).
You will work with cutting-edge technologies such as NVIDIA SuperPOD, NVIDIA BasePOD, and Cisco AI Factory environments, moving beyond standard enterprise networking. A critical requirement for this position is operating on shift hours to align with international client time zones.
Key Responsibilities:
- Fabric Configuration Provisioning: Execute switch configurations for high-performance switches (NVIDIA Quantum InfiniBand, NVIDIA Spectrum-X Ethernet, Cisco Nexus) using templates and automation.
- NetDevOps Execution: Utilize Ansible playbooks for configuration management, firmware updates, and compliance enforcement across the network fabric. Maintain NVIDIA Unified Fabric Manager (UFM) for optimal routing and fault tolerance.
- Host Networking: Collaborate with the Compute team to configure host-side adapters (ConnectX SuperNICs, BlueField DPUs), ensuring correct IP addressing, MTU, and driver parameters.
- Validation & Performance Tuning: Verify physical connectivity and link health using specialized tools. Execute network-specific benchmarks to validate fabric performance, ensuring full bi-sectional bandwidth and low latency.
- Congestion Control: Implement and tune Quality of Service (QoS) settings, including PFC and ECN, to prevent packet loss and optimize throughput in RoCEv2 environments.
- Operations Support: Configure monitoring agents for fabric telemetry and traffic flow visualization. Handle L2 support tickets for network issues and execute firmware upgrades during maintenance windows.
Technical Competencies:
Essential Skills:
- High-Performance Networking: Deep operational knowledge of NVIDIA Quantum InfiniBand switches and AI Ethernet (RoCEv2).
- Fabric Management: Experience with NVIDIA UFM (Unified Fabric Manager).
- Automation Tools: Proficiency in Ansible for network automation. Strong Linux CLI skills for network troubleshooting. Practical implementation skills in BGP, EVPN, and VXLAN.
Desirable Experience:
- Cisco AI Integration (Cisco Nexus Dashboard, Cisco 8000 series).
- DPU Configuration (NVIDIA BlueField DPUs, DOCA framework).
- Optical Networking (interpreting transceiver signal levels).
- AI Fabric Orchestration (Netris, Cisco Nexus Hyperfabric AI).
Certifications:
- Highly desirable: NVIDIA-Certified Professional: AI Networking (NCP-AIN), NVIDIA-Certified Associate: AI Infrastructure and Operations (NCA-AIIO), Cisco Certified Network Professional (CCNP) Data Center.
Success Metrics (KPIs):
- Achieve >95% effective bandwidth efficiency on NCCL-test benchmarks.
- Zero unstable connections handed over to the Compute team.
- Consistently meet SLAs for network-related support tickets.
Company
World Wide Technology
World Wide Technology is a global technology solutions provider specializing in system integration for AI factories. They design and deliver bespoke, high-scale AI infrastructure.