Gen AI Data Engineer
Full Job Description
As a Generative AI Data Engineer, you will build the data infrastructure critical for powering generative AI applications. You will collaborate with software engineers, ML engineers, and product leaders to facilitate rapid experimentation and deploy scalable, production-ready GenAI systems. Your responsibilities include designing flexible and scalable data pipelines, supporting retrieval systems, and ensuring high-quality data flow throughout the GenAI lifecycle, from initial experimentation to production deployment.
Core Responsibilities
Data Engineering & Pipeline Development
- Design and build scalable batch and near real-time data pipelines using modern data processing frameworks.
- Develop robust ETL/ELT workflows for ingesting and transforming structured and unstructured data (documents, PDFs, APIs, logs, etc.).
- Utilize cloud-native orchestration and data processing solutions for reliability and scalability.
- Implement reusable data frameworks to support rapid experimentation and iteration cycles.
- Ensure data quality through validation, schema enforcement, and automated checks.
GenAI Data Preparation & Experimentation
- Prepare and curate datasets for GenAI use cases including RAG, embeddings, and fine-tuning workflows.
- Implement data processing steps like chunking, tokenization, metadata enrichment, and semantic structuring.
- Enable fast experimentation loops by supporting dynamic datasets and evaluation pipelines.
- Collaborate with engineering and product teams for quick iteration on features and experiments.
- Transition experimental pipelines into production-ready, robust workflows.
Vector Databases & Retrieval Systems
- Build and maintain embedding pipelines using LLM providers and open-source models.
- Design and optimize retrieval systems using cloud-native vector databases and hybrid storage solutions.
- Work with relational databases that support vector capabilities, such as PostgreSQL with vector extensions.
- Implement and optimize RAG pipelines, including indexing, retrieval, ranking, and refresh strategies.
- Manage the lifecycle of embeddings, vector indexes, and retrieval datasets.
Data Storage & Platform Engineering
- Work with cloud-native data platforms and storage solutions, including data lakes, lakehouses, and object storage.
- Design efficient storage schemas for both analytical and retrieval workloads.
- Optimize relational and hybrid data stores for low-latency, high-throughput access patterns.
- Ensure cost-effective and scalable data storage strategies.
Productionization & Scalability
- Convert experimental workflows into scalable, reliable production pipelines.
- Optimize pipelines for performance, cost, and reliability.
- Implement incremental processing, caching, and efficient refresh strategies.
Monitoring & Data Observability
- Implement monitoring for pipeline health, data freshness, and quality.
- Track dataset drift, embedding drift, and retrieval effectiveness.
- Build logging, alerting, and observability frameworks for data systems.
Collaboration & Cross-Functional Work
- Partner closely with engineering teams, product leadership, and data scientists to define and deliver data solutions.
- Act as a bridge between rapid experimentation and production engineering.
- Contribute to architecture decisions and GenAI data best practices.
- Document pipelines, architectures, and data models clearly.
Nice-to-Have / Growth Areas
- Experience with GenAI frameworks such as LangChain, LlamaIndex, or similar.
- Exposure to knowledge graphs and graph-based retrieval approaches.
- Understanding of data governance, lineage, and cataloging.
- Experience with experiment tracking and dataset versioning.
- Experience working with multi-modal datasets (text, image, audio).
Qualifications
- 2-3 years of experience in Data Engineering or related roles.
- Strong proficiency in Python and SQL.
- Hands-on experience with modern data processing frameworks and orchestration tools.
- Experience working with cloud-native data platforms on Azure, AWS, or GCP.
- Experience with relational databases such as PostgreSQL, including extensions for advanced workloads like vector storage.
- Strong understanding of building scalable data pipelines for both structured and unstructured data.
- Familiarity with GenAI concepts such as LLMs, embeddings, and RAG architectures.
Soft Skills
- Strong collaborator comfortable working with engineers, product managers, and leadership.
- Ability to balance rapid experimentation with production rigor.
- Strong problem-solving and debugging capabilities across data systems.
- Clear communicator with strong documentation practices.
- Adaptable and thrives in fast-moving, GenAI-driven environments.
Additional Information
- Enjoy a flexible and rewarding work environment with peer-to-peer recognition platforms.
- Recharge and revitalize with wellness plans for you and your family.
- Plan your future with financial wellness tools.
- Stay relevant and upskill yourself with career development opportunities.
Our Benefits
- Flexible working environment
- Volunteer time off
- LinkedIn Learning
- Employee-Assistance-Program (EAP)
NIQ may use AI tools in recruitment for tasks like resume screening, assessments, scheduling, job matching, and communication support to enhance efficiency and ensure consistent evaluation based on job-related criteria. All AI use adheres to NIQ's principles of fairness, transparency, human oversight, and inclusion. Final hiring decisions are made by humans. NIQ regularly reviews AI tools to mitigate bias and ensure compliance. For questions, accommodations, or to request human review where legally permitted, contact your local HR representative. Learn more about NIQ's AI Safety Policies and Guiding Principles at https://www.nielseniq.com/global/en/ai-safety-policies.
Company
NielsenIQ
NIQ, a global leader in consumer intelligence, provides comprehensive insights into consumer buying behavior and identifies growth opportunities. Following its 2023 merger with GfK, NIQ offers unparal...