What is the salary for this Lead DevOps / Cloud Engineer — Multimodal Search Platform position?

Salary information for this Lead DevOps / Cloud Engineer — Multimodal Search Platform position is available upon application.

What experience is required for this Lead DevOps / Cloud Engineer — Multimodal Search Platform role?

This Lead DevOps / Cloud Engineer — Multimodal Search Platform position requires mid_level of experience.

Lead DevOps / Cloud Engineer — Multimodal Search Platform at KnowDis AI | New Delhi, Delhi, India | Apply Now | MindMyJob

We are seeking a Lead DevOps/Cloud Engineer to own and scale the production infrastructure for our high-availability AI workloads and real-time search experiences. This role involves leading the design, reliability, and automation of cloud infrastructure, CI/CD systems, and production operations for a high-traffic platform serving enterprise-grade deployments. The ideal candidate will possess strong cloud architecture expertise combined with deep operational ownership, enabling reliable GPU-backed AI model serving and low-latency distributed systems at scale.

You will collaborate with backend, platform, and ML engineering teams to ensure resilient, observable, and cost-efficient infrastructure capable of supporting rapid experimentation and mission-critical production workloads.

Key Responsibilities:

End-to-end infrastructure ownership: Design, deploy, and operate scalable cloud infrastructure supporting a high-scale multimodal search platform handling production traffic across text, image, and voice workloads.
Kubernetes platform engineering: Architect and manage Kubernetes-based production environments (EKS/GKE/AKS) with robust autoscaling, failover mechanisms, and zero-downtime deployment practices.
Infrastructure as Code (IaC): Design and maintain reproducible infrastructure using Terraform or equivalent tools to ensure secure, version-controlled environments across development, staging, and production.
Reliability & performance engineering: Implement high-availability and low-latency architectures capable of handling traffic spikes through intelligent caching, queuing, rate limiting, load balancing, and graceful degradation strategies.
CI/CD & release automation: Build and maintain automated CI/CD pipelines enabling safe, repeatable, and rapid deployments with rollback capabilities and environment parity.
Disaster recovery & incident management: Establish redundancy, backup, and disaster recovery systems aligned with defined SLOs/SLAs; lead incident response practices including runbooks and post-incident reviews.
Observability & monitoring: Implement comprehensive observability covering metrics, logs, distributed tracing, alerting, and performance monitoring; drive continuous reliability improvements through data-driven insights.
AI infrastructure & model serving: Support GPU-based inference infrastructure and AI model deployment pipelines (e.g., Triton, vLLM, TGI), collaborating closely with MLOps and ML teams to ensure reliable, scalable model serving under production workloads.
Security & operational excellence: Enforce cloud security best practices, access controls, secrets management, and infrastructure governance aligned with enterprise deployment standards.
Cost optimization: Continuously monitor and optimize cloud resource utilization, GPU workloads, and infrastructure spend without compromising performance or reliability.

Required Skills & Experience:

6–10 years of experience in DevOps, Site Reliability Engineering (SRE), or Cloud Platform Engineering roles within high-scale technology environments.
Strong hands-on expertise with at least one major cloud platform — AWS, GCP, or Azure — including networking, compute, storage, and managed Kubernetes services.
Deep experience operating production Kubernetes environments at scale, including autoscaling, cluster upgrades, workload orchestration, and resilience design.
Proven experience implementing Infrastructure as Code using Terraform (preferred) or equivalent tooling.
Strong understanding of distributed systems reliability, including load balancing, caching strategies, asynchronous queues, and failure recovery patterns.
Experience designing and managing CI/CD pipelines using modern tooling (GitHub Actions, GitLab CI, ArgoCD, Jenkins, or equivalent).
Hands-on experience building observability stacks using tools such as Prometheus, Grafana, ELK/OpenSearch, Datadog, or OpenTelemetry.
Experience supporting GPU workloads and AI inference systems, including containerized model deployment and performance optimization for production ML systems.
Familiarity with AI model serving frameworks such as Triton Inference Server, vLLM, TGI, or similar platforms is strongly preferred.
Strong scripting and automation skills (Python, Bash, or Go preferred).
Solid understanding of networking, security best practices, secrets management, and cloud cost optimization strategies.
Experience working in fast-moving startup or scale-up environments with high ownership expectations.

Qualifications:

Bachelor’s or Master’s degree in Computer Science, Engineering, or a related technical discipline.
Relevant cloud certifications (AWS/GCP/Azure, Kubernetes CKA/CKAD) are preferred but not mandatory.
Demonstrated experience supporting production systems with defined uptime, latency, and reliability targets.

Why Join KnowDis.ai

You will help build foundational infrastructure powering advanced multimodal AI systems used in real-world production environments. This role offers deep technical ownership, exposure to cutting-edge AI workloads, and the opportunity to shape platform reliability from the ground up within a fast-growing AI company.

Selection Process:

Interested Candidates are mandatorily required to apply through this listing on Jigya. Only applications received through this posting will be evaluated further.
Shortlisted candidates may be required to appear in a Screening interview administered by Jigya
Candidates selected after the Jigya screening rounds will be interviewed by KnowDis

Lead DevOps / Cloud Engineer — Multimodal Search Platform

Maximize your interview chances

Full Job Description

Key Responsibilities:

Required Skills & Experience:

Qualifications:

Why Join KnowDis.ai

Company

KnowDis AI