E2E Networks
E2E Networks2h ago
Naukri

Accelerated Computing Engineer

Kanchipuram
Full Time
Entry Level
450000-700000

Auto Apply to 50+ AI Matched Accelerated Computing Engineer Jobs

Use Auto Apply Agents to Bulk Apply jobs with ATS Optimised Resumes, find verified Insider Connections for jobs at E2E Networks

Full Job Description

Accelerated Computing Specialist Opportunity

E2E Networks is seeking an Accelerated Computing Specialist to join our dynamic Cloud Platform team in Kanchipuram, Tamil Nadu. This role is perfect for passionate engineering graduates with 0-4 years of experience looking to dive into the world of cloud infrastructure, Linux systems, and GPU computing. You will gain invaluable hands-on experience with real production workloads in a rapidly growing cloud environment.

About the Role

This position offers a unique opportunity to work at the intersection of Cloud Infrastructure, Linux Systems, and GPU Computing. You will learn how to build, manage, and troubleshoot cutting-edge AI/ML clusters essential for large-scale model training and inference. Ideal for individuals eager to explore Linux, cloud computing, GPUs, or AI.

Responsibilities

In this role, you will gradually take ownership of the following areas:

Cloud Platform & Infrastructure

  • Provide L1-L2 operational support for cloud compute, storage, and networking services.
  • Monitor Virtual Machines (VMs), containers, and GPU instances for optimal availability and performance.
  • Troubleshoot complex issues including connectivity failures, storage mount problems, and GPU driver errors.
  • Assist in configuring Load Balancers (ALBs) and Ingress Controllers within Kubernetes clusters.
  • Contribute to infrastructure automation using Bash, Python, and Terraform.

GPU and AI Workload Support

  • Learn to efficiently launch and validate GPU clusters utilized for AI/ML workloads.
  • Gain a deep understanding of Slurm job scheduling for distributed training tasks.
  • Support the configuration of clusters for Large Language Model (LLM) training (e.g., Llama 3 models) using specialized tools like DGCX Bench.
  • Monitor and maintain vLLM inference endpoints to ensure high availability and rapid restart capabilities.
  • Verify cluster health, ensuring all worker nodes are accessible, GPUs are detected via nvidia-smi, and InfiniBand connections are active (verified with ibstat).

Reliability & Automation

  • Maintain comprehensive documentation and contribute to insightful Root Cause Analysis (RCA) reports.
  • Actively support incident response and resolution processes for GPU-based workloads.
  • Collaborate with the team to automate deployments and validation checks for: training cluster launches (e.g., 8xH100, 8xH200), notebook availability and restarts, and inference readiness and autoscaling.

Continuous Learning

  • Stay abreast of the latest trends in GPU computing, including H100, H200, InfiniBand, CUDA, and AI inference technologies.
  • Explore the methodologies behind training and deploying LLMs such as Llama, Mistral, and Falcon on GPU clusters.
  • Understand the integration of cloud orchestration, MLOps, and DevOps principles in production AI environments.

Technical Foundation (Preferred Skills)

While we don't expect expertise from day one, a strong foundation and curiosity in at least one of the following areas are highly valued:

Core Cloud Skills

  • Operating Systems: Proficiency in Linux distributions such as Ubuntu, Debian, or CentOS.
  • Networking: Understanding of DNS, NAT, VPN, and basic Load Balancer concepts.
  • Containers: Experience with Docker, Kubernetes, and Helm.
  • Storage: Familiarity with Block and Object storage, including S3 APIs.
  • Monitoring: Exposure to tools like Prometheus, Grafana, and the ELK stack.
  • Automation: Skills in Bash, Python, Git, Ansible, and Terraform.

GPU / AI Computing Concepts (Good to Know)

  • NVIDIA GPU tools: Familiarity with CUDA, nvidia-smi, and basic GPU scheduling principles.
  • ML frameworks: Exposure to TensorFlow, PyTorch, ONNX, or Hugging Face.
  • Cluster Scheduling: An introduction to Slurm is beneficial.
  • LLM Workloads: A basic understanding of how inference endpoints serve AI models.

Preferred Background

  • Education: B.E., B.Tech., or MCA in Computer Science, IT, ECE, or related fields.
  • Projects: Academic or internship experience in Linux, Cloud, ML, or GPU computing.
  • Community Involvement: Participation in hackathons, open-source projects, or AI community events like Kaggle.
  • DevOps Familiarity: Experience with GitHub or any DevOps pipeline tools.

Company

E2E Networks

E2E Networks

Kanchipuram
Posted on Naukri
Accelerated Computing Engineer at E2E Networks | Kanchipuram | Apply Now | MindMyJob | MindMyJob - AI Job Search Platform