DEV Community

Nolan Vale
Nolan Vale

Posted on

Building a CI/CD Pipeline for Your Enterprise AI System

If you are running AI in production without a deployment pipeline, you are operating in a state that would be unacceptable for any other production system. The AI model updates, the prompts change, the retrieval configuration evolves — and those changes go to production through a process that amounts to "someone edited a config file and restarted the service."

This post is a practical guide to building a CI/CD pipeline for an enterprise AI system. It assumes you have a RAG-based deployment with a self-hosted or API-backed LLM, a vector database, and a set of prompts and retrieval configurations that need to be managed as versioned artifacts.

What Needs to Be Under Version Control

Before you can build a pipeline, you need to define what the deployable artifacts are. For an enterprise AI system, the answer is broader than most teams initially assume.

Prompts are code. System prompts, few-shot examples, and retrieval instruction templates should be in version control alongside application code. Changes to prompts should go through code review. Prompt history should be auditable.

Retrieval configuration is code. Chunk size, overlap, top-k, similarity threshold, re-ranking configuration — these parameters significantly affect retrieval quality and should be versioned, reviewed, and deployed with the same rigor as application code.

Embedding model version is a dependency. If you are using an embedding model — self-hosted or via API — the version of that model is a dependency of your vector index. An embedding model upgrade requires re-indexing, and that re-indexing must be managed as a deployment, not as an ad-hoc operational task.

Evaluation sets are test fixtures. The set of query-answer pairs you use to validate retrieval and generation quality is a testing artifact that belongs in version control, maintained with the same care as application tests.

The Pipeline Structure

A minimal viable CI/CD pipeline for an enterprise AI system has four stages.

Stage 1: Validation

When a change is proposed — to prompts, retrieval configuration, application code, or model version — automated validation runs before any human review.

Prompt validation checks for schema compliance, token budget violations, and injection vulnerabilities. A change that pushes the system prompt past the intended token budget should fail validation before it ever reaches review.

Configuration validation checks that retrieval parameters are within acceptable ranges and that configuration changes don't create inconsistencies — for example, a chunk size larger than the embedding model's maximum input length.

Static analysis for the application code, following whatever standards your engineering team already uses.

Stage 2: Automated Evaluation

This is the stage most teams skip and the stage that provides the most value.

Against your versioned evaluation set, run automated quality metrics for any change that could affect retrieval or generation quality. At minimum: retrieval recall at k for a sample of evaluation queries, answer correctness for a sample of reference Q&A pairs, and latency percentiles for standard query types.

The evaluation should run in a sandboxed environment that mirrors production configuration but uses a separate index seeded with a representative subset of the document corpus. Running full re-indexing on every PR is expensive; running against a curated representative subset is fast enough to be practical.

If any metric regresses beyond a defined threshold — retrieval recall drops by more than 5%, p95 latency increases by more than 200ms — the pipeline fails and the change requires explicit override to proceed.

Setting the thresholds requires a calibration period: measure your baseline metrics, determine what magnitude of change would indicate a real problem versus normal variance, and set the thresholds accordingly. Start conservative and adjust based on false positive rates.

Stage 3: Staging Deployment and Human Review

Changes that pass automated evaluation are deployed to a staging environment that mirrors production as closely as possible, including connecting to a staging instance of the vector index seeded with a representative document set.

Human review at this stage focuses on things automated evaluation cannot catch: subjective quality assessment of AI responses for a sample of realistic queries, verification that UI/UX behavior is correct for edge cases, and confirmation that the change achieves its intended purpose.

For prompt changes specifically, the reviewer should compare responses in staging against responses from the current production version for the same inputs. Regression in subjective quality that doesn't show in automated metrics is common and requires human judgment to catch.

The staging review should have a defined SLA: changes should not sit in staging review indefinitely, as this creates a backlog that discourages small, incremental improvements in favor of large batched changes that are harder to review and harder to roll back.

Stage 4: Production Deployment and Monitoring

Deployment to production should be automated following staging approval, with the ability to roll back to the previous version within minutes.

For changes that involve embedding model upgrades — which require re-indexing — the deployment process is more complex. The standard pattern is blue-green deployment: maintain the current index in production while building the new index in parallel, validate the new index quality against the old index before cutover, and cut over with the ability to revert if post-deployment monitoring shows regression.

Post-deployment monitoring should track the same metrics used in the automated evaluation stage, but against real production traffic. A change that passes evaluation against your test set but degrades on the distribution of real production queries indicates a gap in your evaluation set that should be addressed.

The Operational Investment and Why It Pays Back

Setting up this pipeline requires real investment. A rough estimate for a team that hasn't done this before: 3 to 5 engineer-weeks to build the pipeline infrastructure, plus ongoing maintenance.

The payback comes from multiple directions.

Incident reduction: the most common production AI incidents are caused by unreviewed changes — a prompt edit that introduced an unintended behavior, a configuration change that silently degraded retrieval quality, an embedding model update that changed the semantic space in ways that broke downstream assumptions. A deployment pipeline catches these before they reach production.

Faster iteration: counterintuitively, a deployment pipeline enables faster iteration, not slower. Without a pipeline, teams are conservative about changes because they can't validate them without deploying to production. With a pipeline, small changes can be evaluated safely and deployed frequently, enabling the rapid iteration that AI systems require.

Auditability: every change to the AI system is documented, reviewed, and linked to a deployment record. When something goes wrong in production — and something always eventually does — the investigation starts from a clear record of what changed and when, rather than from forensic reconstruction.

The organizations running AI in production without deployment pipelines are accumulating technical debt that will eventually surface as an incident. The organizations that build the pipeline upfront are investing in a capability that compounds over time.

Top comments (0)