Lead DevOps / Cloud Engineer — Multimodal Search Platform
Full Job Description
We are seeking a Lead DevOps/Cloud Engineer to own and scale the production infrastructure for our high-availability AI workloads and real-time search experiences. This role involves leading the design, reliability, and automation of cloud infrastructure, CI/CD systems, and production operations for a high-traffic platform serving enterprise-grade deployments. The ideal candidate will possess strong cloud architecture expertise combined with deep operational ownership, enabling reliable GPU-backed AI model serving and low-latency distributed systems at scale.
You will collaborate with backend, platform, and ML engineering teams to ensure resilient, observable, and cost-efficient infrastructure capable of supporting rapid experimentation and mission-critical production workloads.
Key Responsibilities:
- End-to-end infrastructure ownership: Design, deploy, and operate scalable cloud infrastructure supporting a high-scale multimodal search platform handling production traffic across text, image, and voice workloads.
- Kubernetes platform engineering: Architect and manage Kubernetes-based production environments (EKS/GKE/AKS) with robust autoscaling, failover mechanisms, and zero-downtime deployment practices.
- Infrastructure as Code (IaC): Design and maintain reproducible infrastructure using Terraform or equivalent tools to ensure secure, version-controlled environments across development, staging, and production.
- Reliability & performance engineering: Implement high-availability and low-latency architectures capable of handling traffic spikes through intelligent caching, queuing, rate limiting, load balancing, and graceful degradation strategies.
- CI/CD & release automation: Build and maintain automated CI/CD pipelines enabling safe, repeatable, and rapid deployments with rollback capabilities and environment parity.
- Disaster recovery & incident management: Establish redundancy, backup, and disaster recovery systems aligned with defined SLOs/SLAs; lead incident response practices including runbooks and post-incident reviews.
- Observability & monitoring: Implement comprehensive observability covering metrics, logs, distributed tracing, alerting, and performance monitoring; drive continuous reliability improvements through data-driven insights.
- AI infrastructure & model serving: Support GPU-based inference infrastructure and AI model deployment pipelines (e.g., Triton, vLLM, TGI), collaborating closely with MLOps and ML teams to ensure reliable, scalable model serving under production workloads.
- Security & operational excellence: Enforce cloud security best practices, access controls, secrets management, and infrastructure governance aligned with enterprise deployment standards.
- Cost optimization: Continuously monitor and optimize cloud resource utilization, GPU workloads, and infrastructure spend without compromising performance or reliability.
Required Skills & Experience:
- 6–10 years of experience in DevOps, Site Reliability Engineering (SRE), or Cloud Platform Engineering roles within high-scale technology environments.
- Strong hands-on expertise with at least one major cloud platform — AWS, GCP, or Azure — including networking, compute, storage, and managed Kubernetes services.
- Deep experience operating production Kubernetes environments at scale, including autoscaling, cluster upgrades, workload orchestration, and resilience design.
- Proven experience implementing Infrastructure as Code using Terraform (preferred) or equivalent tooling.
- Strong understanding of distributed systems reliability, including load balancing, caching strategies, asynchronous queues, and failure recovery patterns.
- Experience designing and managing CI/CD pipelines using modern tooling (GitHub Actions, GitLab CI, ArgoCD, Jenkins, or equivalent).
- Hands-on experience building observability stacks using tools such as Prometheus, Grafana, ELK/OpenSearch, Datadog, or OpenTelemetry.
- Experience supporting GPU workloads and AI inference systems, including containerized model deployment and performance optimization for production ML systems.
- Familiarity with AI model serving frameworks such as Triton Inference Server, vLLM, TGI, or similar platforms is strongly preferred.
- Strong scripting and automation skills (Python, Bash, or Go preferred).
- Solid understanding of networking, security best practices, secrets management, and cloud cost optimization strategies.
- Experience working in fast-moving startup or scale-up environments with high ownership expectations.
Qualifications:
- Bachelor’s or Master’s degree in Computer Science, Engineering, or a related technical discipline.
- Relevant cloud certifications (AWS/GCP/Azure, Kubernetes CKA/CKAD) are preferred but not mandatory.
- Demonstrated experience supporting production systems with defined uptime, latency, and reliability targets.
Why Join KnowDis.ai
You will help build foundational infrastructure powering advanced multimodal AI systems used in real-world production environments. This role offers deep technical ownership, exposure to cutting-edge AI workloads, and the opportunity to shape platform reliability from the ground up within a fast-growing AI company.
Selection Process:
- Interested Candidates are mandatorily required to apply through this listing on Jigya. Only applications received through this posting will be evaluated further.
- Shortlisted candidates may be required to appear in a Screening interview administered by Jigya
- Candidates selected after the Jigya screening rounds will be interviewed by KnowDis
Company
KnowDis AI
KnowDis AI is a leading AI company headquartered in New Delhi, India. We specialize in providing cutting-edge solutions, including generative AI-based search for e-commerce platforms. Our focus is on ...