KnowDis AI
KnowDis AI2d ago
LinkedIn

Lead DevOps / Cloud Engineer — Multimodal Search Platform

New Delhi, Delhi, India
Mid Level

Maximize your interview chances

Prepare for your Lead DevOps / Cloud Engineer — Multimodal Search Platform interview at KnowDis AI with AI-powered practice sessions

Full Job Description

We are seeking a Lead DevOps/Cloud Engineer to own and scale the production infrastructure for our high-availability AI workloads and real-time search experiences. This role involves leading the design, reliability, and automation of cloud infrastructure, CI/CD systems, and production operations for a high-traffic platform serving enterprise-grade deployments. The ideal candidate will possess strong cloud architecture expertise combined with deep operational ownership, enabling reliable GPU-backed AI model serving and low-latency distributed systems at scale.

You will collaborate with backend, platform, and ML engineering teams to ensure resilient, observable, and cost-efficient infrastructure capable of supporting rapid experimentation and mission-critical production workloads.

Key Responsibilities:

  • End-to-end infrastructure ownership: Design, deploy, and operate scalable cloud infrastructure supporting a high-scale multimodal search platform handling production traffic across text, image, and voice workloads.
  • Kubernetes platform engineering: Architect and manage Kubernetes-based production environments (EKS/GKE/AKS) with robust autoscaling, failover mechanisms, and zero-downtime deployment practices.
  • Infrastructure as Code (IaC): Design and maintain reproducible infrastructure using Terraform or equivalent tools to ensure secure, version-controlled environments across development, staging, and production.
  • Reliability & performance engineering: Implement high-availability and low-latency architectures capable of handling traffic spikes through intelligent caching, queuing, rate limiting, load balancing, and graceful degradation strategies.
  • CI/CD & release automation: Build and maintain automated CI/CD pipelines enabling safe, repeatable, and rapid deployments with rollback capabilities and environment parity.
  • Disaster recovery & incident management: Establish redundancy, backup, and disaster recovery systems aligned with defined SLOs/SLAs; lead incident response practices including runbooks and post-incident reviews.
  • Observability & monitoring: Implement comprehensive observability covering metrics, logs, distributed tracing, alerting, and performance monitoring; drive continuous reliability improvements through data-driven insights.
  • AI infrastructure & model serving: Support GPU-based inference infrastructure and AI model deployment pipelines (e.g., Triton, vLLM, TGI), collaborating closely with MLOps and ML teams to ensure reliable, scalable model serving under production workloads.
  • Security & operational excellence: Enforce cloud security best practices, access controls, secrets management, and infrastructure governance aligned with enterprise deployment standards.
  • Cost optimization: Continuously monitor and optimize cloud resource utilization, GPU workloads, and infrastructure spend without compromising performance or reliability.

Required Skills & Experience:

  • 6–10 years of experience in DevOps, Site Reliability Engineering (SRE), or Cloud Platform Engineering roles within high-scale technology environments.
  • Strong hands-on expertise with at least one major cloud platform — AWS, GCP, or Azure — including networking, compute, storage, and managed Kubernetes services.
  • Deep experience operating production Kubernetes environments at scale, including autoscaling, cluster upgrades, workload orchestration, and resilience design.
  • Proven experience implementing Infrastructure as Code using Terraform (preferred) or equivalent tooling.
  • Strong understanding of distributed systems reliability, including load balancing, caching strategies, asynchronous queues, and failure recovery patterns.
  • Experience designing and managing CI/CD pipelines using modern tooling (GitHub Actions, GitLab CI, ArgoCD, Jenkins, or equivalent).
  • Hands-on experience building observability stacks using tools such as Prometheus, Grafana, ELK/OpenSearch, Datadog, or OpenTelemetry.
  • Experience supporting GPU workloads and AI inference systems, including containerized model deployment and performance optimization for production ML systems.
  • Familiarity with AI model serving frameworks such as Triton Inference Server, vLLM, TGI, or similar platforms is strongly preferred.
  • Strong scripting and automation skills (Python, Bash, or Go preferred).
  • Solid understanding of networking, security best practices, secrets management, and cloud cost optimization strategies.
  • Experience working in fast-moving startup or scale-up environments with high ownership expectations.

Qualifications:

  • Bachelor’s or Master’s degree in Computer Science, Engineering, or a related technical discipline.
  • Relevant cloud certifications (AWS/GCP/Azure, Kubernetes CKA/CKAD) are preferred but not mandatory.
  • Demonstrated experience supporting production systems with defined uptime, latency, and reliability targets.

Why Join KnowDis.ai

You will help build foundational infrastructure powering advanced multimodal AI systems used in real-world production environments. This role offers deep technical ownership, exposure to cutting-edge AI workloads, and the opportunity to shape platform reliability from the ground up within a fast-growing AI company.

Selection Process:

  • Interested Candidates are mandatorily required to apply through this listing on Jigya. Only applications received through this posting will be evaluated further.
  • Shortlisted candidates may be required to appear in a Screening interview administered by Jigya
  • Candidates selected after the Jigya screening rounds will be interviewed by KnowDis

Company

KnowDis AI

KnowDis AI

KnowDis AI is a leading AI company headquartered in New Delhi, India. We specialize in providing cutting-edge solutions, including generative AI-based search for e-commerce platforms. Our focus is on ...

New Delhi, Delhi, India
Posted on LinkedIn