klement Gunndu

Posted on Mar 11

Ship AI Agents Like Software: 5 CI/CD Patterns That Prevent Silent Failures

#ai #python #tutorial #devops

Your CI/CD pipeline ships code. But an AI agent is not just code.

An agent is code + a model + a prompt + tool configurations + retrieval context. Change any one of those and the agent behaves differently. Your pipeline tests one of them. The other four ship untested.

This is why teams deploy agents that pass every unit test on Monday and hallucinate in production by Wednesday. The code didn't change. The model did.

Here are 5 CI/CD patterns that close the gap between "tests pass" and "agent works."

1. Version the Full Agent Stack, Not Just the Code

Traditional CI/CD versions code with git. That covers about 20% of what determines an AI agent's behavior.

An agent's output depends on:

Code (orchestration logic, tool definitions)
Model (provider, model ID, temperature, max tokens)
Prompt (system prompt, few-shot examples)
Tools (API endpoints, schemas, auth)
Context (retrieval pipeline, vector store version)

Change the model from claude-3-5-sonnet-20241022 to claude-3-5-sonnet-20250514 and every output changes. But git sees zero diff.

The fix: Pin every component in a manifest file and version it alongside your code.

# agent_manifest.py
from dataclasses import dataclass, field
from datetime import datetime
import hashlib
import json

@dataclass
class AgentManifest:
    code_version: str          # git SHA
    model_id: str              # e.g., "claude-3-5-sonnet-20241022"
    model_provider: str        # e.g., "anthropic"
    prompt_version: str        # hash of system prompt file
    tool_versions: dict        # {"search": "v2.1", "calculator": "v1.0"}
    retrieval_index: str       # vector store snapshot ID
    created_at: str = field(
        default_factory=lambda: datetime.utcnow().isoformat()
    )

    def fingerprint(self) -> str:
        """Unique hash of the full agent configuration."""
        content = json.dumps({
            "code": self.code_version,
            "model": self.model_id,
            "prompt": self.prompt_version,
            "tools": self.tool_versions,
            "retrieval": self.retrieval_index,
        }, sort_keys=True)
        return hashlib.sha256(content.encode()).hexdigest()[:12]

Now every deployment gets a fingerprint like a3f9c2e1b0d4. Two deployments with different prompts but identical code get different fingerprints. Your rollback target is a fingerprint, not a git SHA.

Why this matters: When production quality degrades, you can diff fingerprints to find which component changed. "Same code, different model" is no longer invisible.

2. Add Eval Gates to Your Pipeline

Unit tests check if your code runs. Eval gates check if your agent's output is correct.

These are different things. An agent can run without errors and still produce wrong answers. Standard test suites miss this entirely.

DeepEval integrates with pytest so eval gates fit your existing CI workflow. Install it with pip install deepeval.

# tests/test_agent_evals.py
import pytest
from deepeval import assert_test
from deepeval.test_case import LLMTestCase
from deepeval.metrics import (
    AnswerRelevancyMetric,
    HallucinationMetric,
)

from my_agent import run_agent


def build_test_case(query: str, expected_output: str, context: list[str]):
    actual = run_agent(query)
    return LLMTestCase(
        input=query,
        actual_output=actual,
        expected_output=expected_output,
        retrieval_context=context,
    )


# Eval 1: Does the agent answer the question?
@pytest.mark.parametrize("query,expected,context", [
    (
        "What is the refund policy?",
        "Full refund within 30 days of purchase.",
        ["Refund policy: Full refund within 30 days."],
    ),
    (
        "How do I reset my password?",
        "Go to Settings > Security > Reset Password.",
        ["Password reset: Settings > Security > Reset Password."],
    ),
])
def test_answer_relevancy(query, expected, context):
    test_case = build_test_case(query, expected, context)
    assert_test(test_case, [AnswerRelevancyMetric(threshold=0.7)])


# Eval 2: Does the agent hallucinate?
@pytest.mark.parametrize("query,expected,context", [
    (
        "What programming languages do you support?",
        "Python and JavaScript.",
        ["Supported languages: Python, JavaScript."],
    ),
])
def test_no_hallucination(query, expected, context):
    actual = run_agent(query)
    test_case = LLMTestCase(
        input=query,
        actual_output=actual,
        context=context,  # HallucinationMetric uses 'context', not 'retrieval_context'
    )
    assert_test(test_case, [HallucinationMetric(threshold=0.5)])

Add this to your GitHub Actions workflow:

# .github/workflows/agent-ci.yml
name: Agent CI

on:
  push:
    branches: [main]
  pull_request:

jobs:
  unit-tests:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with:
          python-version: "3.12"
      - run: pip install -r requirements.txt
      - run: pytest tests/ -k "not test_agent_evals"

  eval-gate:
    runs-on: ubuntu-latest
    needs: unit-tests
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with:
          python-version: "3.12"
      - run: pip install -r requirements.txt
      - run: deepeval test run tests/test_agent_evals.py
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
          ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}

The eval gate runs after unit tests pass. If the agent hallucinates or returns irrelevant answers, the pipeline fails before deployment. No manual review needed.

Key detail: deepeval test run is the recommended CLI command for running evals. It wraps pytest with additional reporting and creates a test run you can inspect later.

3. Build Deterministic Docker Images for Agents

Standard Python Docker images pull latest dependencies on every build. For an AI agent, "latest" can change your agent's behavior silently.

A multi-stage Dockerfile locks dependencies in the build stage and copies only the runtime into production. This approach reduces image sizes by up to 82% compared to single-stage builds and eliminates dependency drift.

# Dockerfile
# Stage 1: Build dependencies
FROM python:3.12-slim AS builder

WORKDIR /app

# Install build dependencies
RUN apt-get update && \
    apt-get install -y --no-install-recommends gcc && \
    rm -rf /var/lib/apt/lists/*

# Create virtual environment
RUN python -m venv /opt/venv
ENV PATH="/opt/venv/bin:$PATH"

# Install pinned dependencies
COPY requirements.lock .
RUN pip install --no-cache-dir -r requirements.lock

# Stage 2: Production runtime
FROM python:3.12-slim AS runtime

WORKDIR /app

# Copy virtual environment from builder
COPY --from=builder /opt/venv /opt/venv
ENV PATH="/opt/venv/bin:$PATH"

# Create non-root user
RUN groupadd -r agent && useradd -r -g agent agent

# Copy application code
COPY src/ ./src/
COPY prompts/ ./prompts/
COPY agent_manifest.py .

# Copy the agent manifest (versioning)
COPY manifest.json .

USER agent

HEALTHCHECK --interval=30s --timeout=10s --retries=3 \
    CMD python -c "from src.agent import health_check; health_check()"

ENTRYPOINT ["python", "-m", "src.agent"]

Three things to notice:

requirements.lock instead of requirements.txt. Use pip-compile (from pip-tools) to generate a fully pinned lock file. Every dependency, including transitive ones, is frozen.
Non-root user. The agent runs as agent:agent, not root. If a prompt injection attack compromises your agent, it can't escalate to the host.
Health check. The container self-reports agent health. Kubernetes or your orchestrator can restart unhealthy containers automatically.

Generate the lock file:

pip install pip-tools
pip-compile requirements.in -o requirements.lock

Now docker build produces the same image regardless of when or where you run it.

4. Deploy With Shadow Testing Before Cutover

Canary deploys send 5% of real traffic to a new version. That works for stateless APIs. Agents are different — a bad response to one user can cascade (wrong tool call triggers downstream actions).

Shadow deployment is safer for agents. The new version processes real requests but its responses are not returned to users. You compare outputs offline.

# shadow_deploy.py
import asyncio
import json
from datetime import datetime


async def shadow_test(
    request: dict,
    production_agent,
    shadow_agent,
    log_file: str = "shadow_results.jsonl",
):
    """Run both agents on the same request. Return production result.
    Log shadow result for offline comparison."""

    prod_task = asyncio.create_task(production_agent.run(request))
    shadow_task = asyncio.create_task(shadow_agent.run(request))

    # Always return production result to the user
    prod_result = await prod_task

    try:
        shadow_result = await asyncio.wait_for(shadow_task, timeout=30.0)
        shadow_output = shadow_result
    except (asyncio.TimeoutError, Exception) as e:
        shadow_output = {"error": str(e)}

    # Log both results for comparison
    comparison = {
        "timestamp": datetime.utcnow().isoformat(),
        "request": request,
        "production": prod_result,
        "shadow": shadow_output,
        "match": prod_result == shadow_output,
    }

    with open(log_file, "a") as f:
        f.write(json.dumps(comparison) + "\n")

    return prod_result


async def analyze_shadow_results(log_file: str = "shadow_results.jsonl"):
    """Analyze divergence between production and shadow agents."""
    results = []
    with open(log_file) as f:
        for line in f:
            results.append(json.loads(line))

    total = len(results)
    matches = sum(1 for r in results if r["match"])
    errors = sum(1 for r in results if "error" in r.get("shadow", {}))

    print(f"Total requests: {total}")
    print(f"Output match rate: {matches/total:.1%}")
    print(f"Shadow errors: {errors}")
    print(f"Divergent responses: {total - matches - errors}")

    return {
        "match_rate": matches / total if total > 0 else 0,
        "error_rate": errors / total if total > 0 else 0,
    }

Run the shadow deploy for 24-48 hours. If the match rate drops below your threshold (we use 85%), the new version stays in shadow until you investigate the divergence.

When to use canary vs. shadow:

Shadow: Model change, prompt change, retrieval pipeline change. Output quality might differ.
Canary: Code refactor, performance optimization. Same logic, different implementation.

5. Automate Rollback on Output Quality Degradation

Traditional rollback triggers: HTTP 500 errors, latency spikes, crash loops. These catch infrastructure failures. They don't catch an agent that responds successfully but gives wrong answers.

Agent rollback needs an output quality signal. Here's how to wire it into your deployment:

# quality_monitor.py
import json
from dataclasses import dataclass
from collections import deque


@dataclass
class QualityThresholds:
    min_relevancy_score: float = 0.7
    max_hallucination_rate: float = 0.1
    max_tool_error_rate: float = 0.05
    window_size: int = 100  # evaluate over last N requests


class QualityMonitor:
    def __init__(self, thresholds: QualityThresholds):
        self.thresholds = thresholds
        self.scores = deque(maxlen=thresholds.window_size)

    def record(self, relevancy: float, hallucinated: bool, tool_error: bool):
        self.scores.append({
            "relevancy": relevancy,
            "hallucinated": hallucinated,
            "tool_error": tool_error,
        })

    def should_rollback(self) -> tuple[bool, str]:
        if len(self.scores) < 10:
            return False, "insufficient_data"

        scores = list(self.scores)
        avg_relevancy = sum(s["relevancy"] for s in scores) / len(scores)
        hallucination_rate = sum(
            1 for s in scores if s["hallucinated"]
        ) / len(scores)
        tool_error_rate = sum(
            1 for s in scores if s["tool_error"]
        ) / len(scores)

        if avg_relevancy < self.thresholds.min_relevancy_score:
            return True, f"relevancy={avg_relevancy:.2f}"

        if hallucination_rate > self.thresholds.max_hallucination_rate:
            return True, f"hallucination_rate={hallucination_rate:.2f}"

        if tool_error_rate > self.thresholds.max_tool_error_rate:
            return True, f"tool_error_rate={tool_error_rate:.2f}"

        return False, "healthy"

Wire this into your deployment. When should_rollback() returns True, your orchestrator (Kubernetes, ECS, or even a cron job) reverts to the previous agent fingerprint from Pattern 1.

# Kubernetes rollback annotation example
metadata:
  annotations:
    agent.deploy/fingerprint: "a3f9c2e1b0d4"
    agent.deploy/previous-fingerprint: "7b2d4e6f8a1c"
    agent.deploy/rollback-trigger: "quality_monitor"

The fingerprint from your agent manifest (Pattern 1) tells the rollback system exactly which full configuration to restore — not just which code version, but which model, prompt, tools, and retrieval index.

Putting It All Together

Here's what the full pipeline looks like:

git push
  → unit tests (pytest)
  → eval gate (deepeval)
  → build Docker image (multi-stage, pinned deps)
  → generate agent fingerprint (manifest)
  → shadow deploy (24h, compare outputs)
  → promote to production (if match rate > 85%)
  → quality monitor (rolling window, auto-rollback)

Each stage catches a different failure mode:

Stage	What it catches
Unit tests	Broken code, import errors
Eval gate	Hallucinations, wrong answers, bad tool calls
Docker build	Dependency drift, environment differences
Shadow deploy	Output quality regression before users see it
Quality monitor	Production degradation after deployment

The pipeline costs about 30 minutes of additional CI time per deployment. The alternative is debugging hallucinations in production at 2 AM.

What This Doesn't Cover

This article focuses on deployment patterns. Three areas need separate treatment:

Cost management — Running evals in CI adds API costs. Batch your eval requests and cache responses.
Multi-agent systems — When agents call other agents, versioning and rollback affect the entire graph.
Compliance and audit — Regulated industries need agent decision logs tied to specific fingerprints.

Each of these deserves its own deep dive.

Follow @klement_gunndu for more DevOps and AI engineering content. We're building in public.

DEV Community