What is the salary for this Accelerated Computing Engineer position?

The salary range for this Accelerated Computing Engineer position at E2E Networks is 450000-700000.

What experience is required for this Accelerated Computing Engineer role?

This Accelerated Computing Engineer position requires entry_level of experience.

Where is this Accelerated Computing Engineer job located?

This Accelerated Computing Engineer position is located in Kanchipuram.

How do I apply for this Accelerated Computing Engineer position at E2E Networks?

You can apply for this Accelerated Computing Engineer position by clicking the 'Apply Now' button on this page, which will direct you to the official application portal.

Accelerated Computing Engineer at E2E Networks | Kanchipuram | Apply Now | MindMyJob

Accelerated Computing Specialist Opportunity

E2E Networks is seeking an Accelerated Computing Specialist to join our dynamic Cloud Platform team in Kanchipuram, Tamil Nadu. This role is perfect for passionate engineering graduates with 0-4 years of experience looking to dive into the world of cloud infrastructure, Linux systems, and GPU computing. You will gain invaluable hands-on experience with real production workloads in a rapidly growing cloud environment.

About the Role

This position offers a unique opportunity to work at the intersection of Cloud Infrastructure, Linux Systems, and GPU Computing. You will learn how to build, manage, and troubleshoot cutting-edge AI/ML clusters essential for large-scale model training and inference. Ideal for individuals eager to explore Linux, cloud computing, GPUs, or AI.

Responsibilities

In this role, you will gradually take ownership of the following areas:

Cloud Platform & Infrastructure

Provide L1-L2 operational support for cloud compute, storage, and networking services.
Monitor Virtual Machines (VMs), containers, and GPU instances for optimal availability and performance.
Troubleshoot complex issues including connectivity failures, storage mount problems, and GPU driver errors.
Assist in configuring Load Balancers (ALBs) and Ingress Controllers within Kubernetes clusters.
Contribute to infrastructure automation using Bash, Python, and Terraform.

GPU and AI Workload Support

Learn to efficiently launch and validate GPU clusters utilized for AI/ML workloads.
Gain a deep understanding of Slurm job scheduling for distributed training tasks.
Support the configuration of clusters for Large Language Model (LLM) training (e.g., Llama 3 models) using specialized tools like DGCX Bench.
Monitor and maintain vLLM inference endpoints to ensure high availability and rapid restart capabilities.
Verify cluster health, ensuring all worker nodes are accessible, GPUs are detected via nvidia-smi, and InfiniBand connections are active (verified with ibstat).

Reliability & Automation

Maintain comprehensive documentation and contribute to insightful Root Cause Analysis (RCA) reports.
Actively support incident response and resolution processes for GPU-based workloads.
Collaborate with the team to automate deployments and validation checks for: training cluster launches (e.g., 8xH100, 8xH200), notebook availability and restarts, and inference readiness and autoscaling.

Continuous Learning

Stay abreast of the latest trends in GPU computing, including H100, H200, InfiniBand, CUDA, and AI inference technologies.
Explore the methodologies behind training and deploying LLMs such as Llama, Mistral, and Falcon on GPU clusters.
Understand the integration of cloud orchestration, MLOps, and DevOps principles in production AI environments.

Technical Foundation (Preferred Skills)

While we don't expect expertise from day one, a strong foundation and curiosity in at least one of the following areas are highly valued:

Core Cloud Skills

Operating Systems: Proficiency in Linux distributions such as Ubuntu, Debian, or CentOS.
Networking: Understanding of DNS, NAT, VPN, and basic Load Balancer concepts.
Containers: Experience with Docker, Kubernetes, and Helm.
Storage: Familiarity with Block and Object storage, including S3 APIs.
Monitoring: Exposure to tools like Prometheus, Grafana, and the ELK stack.
Automation: Skills in Bash, Python, Git, Ansible, and Terraform.

GPU / AI Computing Concepts (Good to Know)

NVIDIA GPU tools: Familiarity with CUDA, nvidia-smi, and basic GPU scheduling principles.
ML frameworks: Exposure to TensorFlow, PyTorch, ONNX, or Hugging Face.
Cluster Scheduling: An introduction to Slurm is beneficial.
LLM Workloads: A basic understanding of how inference endpoints serve AI models.

Preferred Background

Education: B.E., B.Tech., or MCA in Computer Science, IT, ECE, or related fields.
Projects: Academic or internship experience in Linux, Cloud, ML, or GPU computing.
Community Involvement: Participation in hackathons, open-source projects, or AI community events like Kaggle.
DevOps Familiarity: Experience with GitHub or any DevOps pipeline tools.

Accelerated Computing Engineer

Auto Apply to 50+ AI Matched Accelerated Computing Engineer Jobs

Full Job Description