
LLM & ML Ops Engineer
Full Job Description
Gainwell is actively recruiting experienced LLM Ops Engineers and ML Ops Engineers to join our expanding AI/ML team in Bengaluru/Bangalore. This pivotal role involves the development, deployment, and meticulous maintenance of scalable infrastructure and pipelines for both Machine Learning (ML) models and Large Language Models (LLMs). You will be instrumental in ensuring the seamless management of the model lifecycle, robust performance monitoring, efficient version control, and strict compliance adherence, all while fostering close collaboration with Data Scientists and DevOps teams.
Key Responsibilities:
- LLM Operations:
- Design and implement scalable LLM deployment strategies for models like GPT, Llama, and Claude.
- Enhance LLM inference performance through techniques such as model parallelization, quantization, pruning, and fine-tuning.
- Integrate sophisticated prompt management, version control, and Retrieval-Augmented Generation (RAG) pipelines.
- Manage and maintain vector databases, embedding stores, and document stores integral to LLM applications.
- Monitor and optimize LLM API usage, token consumption, and overall cost efficiency for both cloud and on-premise deployments.
- Implement continuous model performance monitoring and establish proactive alert systems.
- Ensure LLM workflows strictly adhere to ethical AI practices, privacy regulations, and responsible AI guidelines.
- ML Operations:
- Architect, build, and maintain resilient CI/CD pipelines for the complete ML model lifecycle, including training, validation, deployment, and monitoring.
- Establish robust version control, model registry, and reproducibility frameworks for ML models.
- Automate data ingestion, feature engineering, and model retraining processes.
- Monitor ML model performance, detect drift, and ensure the effectiveness of alerting systems.
- Implement comprehensive security, compliance, and governance protocols for model deployments.
- Collaborate closely with Data Scientists to accelerate model development and experimentation cycles.
Qualifications:
- Bachelor's or Master's degree in Computer Science, Engineering, or a related technical field.
- Proven experience with ML Ops tools such as Kubeflow, MLflow, TFX, or SageMaker.
- Demonstrated experience with LLM-specific tools and frameworks including LangChain, Lang Graph, LlamaIndex, Hugging Face, OpenAI APIs, and various Vector Databases (Pinecone, FAISS, Weavite, Chroma DB, etc.).
- Strong track record of deploying models in cloud environments (AWS, Azure, GCP) and on-premise infrastructure.
- Proficiency in containerization technologies like Docker and Kubernetes, coupled with extensive CI/CD experience.
- Familiarity with monitoring tools like Prometheus, Grafana, and specialized ML observability platforms.
- Advanced Python and Bash scripting skills, with experience in infrastructure-as-code tools (Terraform, Helm, etc.).
- Knowledge of healthcare AI applications and regulatory compliance (HIPAA, CMS) is highly advantageous.
- Proficiency with tools such as Giskard and Deepeval is required.
What to Expect:
- Fully Remote: Work from any location within India.
- Minimal Travel: Occasional travel opportunities (0-10%).
- Impactful Work: Engage with cutting-edge AI solutions within a mission-driven healthcare technology organization.
Company
Gainwell Technologies LLC
Gainwell Technologies LLC is a premier provider of critical technology solutions for health and human services program administration and operations. As a significant player in the Medicaid sector, Ga...