You deploy an AI agent. It passes your manual tests. It looks good in the demo.
Three weeks later, someone edits the system prompt to make the output "cleaner." The agent starts behaving differently on edge cases. No error. No alert. Just subtly wrong output — until someone notices.
This post is about the CI/CD and prompt regression setup that prevents this. Everything here is practical and works today on AWS.
The Problem With AI Agents in CI/CD
Traditional software has a clear contract: given input X, function F returns output Y. Tests verify Y. If Y changes, the test fails, the build breaks, you investigate.
LLM-based agents break this model. The "function" is a language model. The same input can produce slightly different outputs on every run. And the failure mode isn't an exception — it's a plausible-looking wrong answer.
Three things make this worse in serverless AI pipelines:
1. Prompts aren't versioned like code. Engineers edit them in a string in a Python file, or worse, in a config file outside version control. Nobody reviews a prompt change the way they'd review a code change.
2. Retries mask failures. Lambda retries on error. Your retry logic retries on low-confidence responses. By the time a bad output surfaces, it's hard to trace it back to the prompt change that caused it.
3. Silent degradation. A classification agent that's 95% accurate and drops to 80% accurate won't throw an error. It'll just be wrong more often. You'll find out from downstream effects, not logs.
The Fix: A Prompt Regression Test Suite
The idea is simple. Lock a set of golden fixtures — known inputs with known correct outputs. Run your agent against them on every deploy. Fail the build if accuracy drops below a threshold.
Here's the full setup.
Step 1: Golden Fixture Format
Each fixture is a JSON file in tests/fixtures/. Structure:
{
"document_id": "fixture_001",
"input": {
"document_text": "Policy holder: Jane Smith. Coverage: accidental damage. Item: MacBook Pro 16-inch. Purchase date: 2023-08-15. Claim date: 2025-11-03. Damage description: Screen cracked after drop.",
"tenant_id": "test-tenant"
},
"expected": {
"risk_level": "MEDIUM",
"reminder_eligible": true,
"confidence_min": 0.70
}
}
Keep 20–30 fixtures. Cover your edge cases: borderline risk levels, ambiguous descriptions, missing fields, very old claims. These are the documents your agent gets wrong.
Never auto-generate fixtures. Write them manually. The point is that you a human have decided what the correct output is.
Step 2: The Test Runner
# tests/test_regression.py
import json
import os
import glob
import pytest
from agents.classifier import run_classifier
FIXTURE_DIR = "tests/fixtures"
MIN_ACCURACY = 0.90 # Fail the build if accuracy drops below this
def load_fixtures():
paths = glob.glob(f"{FIXTURE_DIR}/*.json")
fixtures = []
for p in paths:
with open(p) as f:
fixtures.append(json.load(f))
return fixtures
@pytest.mark.parametrize("fixture", load_fixtures())
def test_classifier_regression(fixture):
result = run_classifier(
document_text=fixture["input"]["document_text"],
tenant_id=fixture["input"]["tenant_id"]
)
expected = fixture["expected"]
assert result["risk_level"] == expected["risk_level"], (
f"[{fixture['document_id']}] "
f"Expected risk_level={expected['risk_level']}, "
f"got {result['risk_level']}"
)
if "confidence_min" in expected:
assert result["confidence"] >= expected["confidence_min"], (
f"[{fixture['document_id']}] "
f"Confidence {result['confidence']:.2f} below minimum "
f"{expected['confidence_min']}"
)
def test_overall_accuracy():
"""
Separate test: fail the whole suite if aggregate accuracy < MIN_ACCURACY.
This catches regression even when individual tests pass on edge cases.
"""
fixtures = load_fixtures()
passed = 0
for fixture in fixtures:
result = run_classifier(
document_text=fixture["input"]["document_text"],
tenant_id=fixture["input"]["tenant_id"]
)
if result["risk_level"] == fixture["expected"]["risk_level"]:
passed += 1
accuracy = passed / len(fixtures)
assert accuracy >= MIN_ACCURACY, (
f"Accuracy {accuracy:.0%} below threshold {MIN_ACCURACY:.0%}. "
f"Passed {passed}/{len(fixtures)} fixtures."
)
Run locally with pytest tests/test_regression.py -v. You'll see per-fixture pass/fail and the aggregate accuracy check.
Step 3: GitHub Actions Pipeline
# .github/workflows/deploy.yml
name: warrantyAI CI/CD
on:
push:
branches: [main]
pull_request:
branches: [main]
env:
AWS_REGION: ap-south-1
ECR_REGISTRY: ${{ secrets.ECR_REGISTRY }}
ECR_REPOSITORY: warrantyai-pipeline
LAMBDA_FUNCTION: warrantyai-processor
jobs:
regression-tests:
name: Prompt Regression Tests
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: '3.12'
- name: Install dependencies
run: pip install -r requirements.txt
- name: Configure AWS credentials (for Bedrock)
uses: aws-actions/configure-aws-credentials@v4
with:
role-to-assume: ${{ secrets.AWS_ROLE_ARN }}
aws-region: ${{ env.AWS_REGION }}
- name: Run prompt regression tests
run: pytest tests/test_regression.py -v --tb=short
env:
BEDROCK_MODEL_ID: anthropic.claude-haiku-4-5-20251001
build-and-deploy:
name: Build → ECR → Lambda
runs-on: ubuntu-latest
needs: regression-tests # Only runs if tests pass
if: github.ref == 'refs/heads/main'
steps:
- uses: actions/checkout@v4
- name: Configure AWS credentials
uses: aws-actions/configure-aws-credentials@v4
with:
role-to-assume: ${{ secrets.AWS_ROLE_ARN }}
aws-region: ${{ env.AWS_REGION }}
- name: Log in to ECR
id: login-ecr
uses: aws-actions/amazon-ecr-login@v2
- name: Build and push Docker image
run: |
IMAGE_TAG=$(git rev-parse --short HEAD)
docker build -t $ECR_REGISTRY/$ECR_REPOSITORY:$IMAGE_TAG .
docker push $ECR_REGISTRY/$ECR_REPOSITORY:$IMAGE_TAG
echo "IMAGE_TAG=$IMAGE_TAG" >> $GITHUB_ENV
- name: Deploy to Lambda
run: |
aws lambda update-function-code \
--function-name $LAMBDA_FUNCTION \
--image-uri $ECR_REGISTRY/$ECR_REPOSITORY:$IMAGE_TAG \
--region $AWS_REGION
aws lambda wait function-updated \
--function-name $LAMBDA_FUNCTION
echo "Deployed image $IMAGE_TAG to $LAMBDA_FUNCTION"
Key decisions:
-
needs: regression-tests— deploy job won't start if tests fail - OIDC role assumption (no long-lived keys in secrets)
-
lambda wait function-updated— ensures the function is actually updated before the job completes
Step 4: IAM OIDC Setup for GitHub Actions (No Long-Lived Keys)
The cleanest way to give GitHub Actions access to AWS is OIDC — a temporary credential that's scoped to your repo and expires after the job.
# infra/oidc.tf
data "aws_iam_openid_connect_provider" "github" {
url = "https://token.actions.githubusercontent.com"
}
resource "aws_iam_role" "github_actions" {
name = "github-actions-warrantyai"
assume_role_policy = jsonencode({
Version = "2012-10-17"
Statement = [{
Effect = "Allow"
Principal = { Federated = data.aws_iam_openid_connect_provider.github.arn }
Action = "sts:AssumeRoleWithWebIdentity"
Condition = {
StringLike = {
"token.actions.githubusercontent.com:sub" = "repo:YOUR_ORG/warrantyai:ref:refs/heads/main"
}
}
}]
})
}
resource "aws_iam_role_policy" "github_actions_policy" {
name = "github-actions-policy"
role = aws_iam_role.github_actions.id
policy = jsonencode({
Version = "2012-10-17"
Statement = [
{
Effect = "Allow"
Action = ["ecr:GetAuthorizationToken", "ecr:BatchCheckLayerAvailability",
"ecr:PutImage", "ecr:InitiateLayerUpload",
"ecr:UploadLayerPart", "ecr:CompleteLayerUpload"]
Resource = "*"
},
{
Effect = "Allow"
Action = ["lambda:UpdateFunctionCode", "lambda:GetFunction"]
Resource = aws_lambda_function.pipeline_processor.arn
},
{
Effect = "Allow"
Action = ["bedrock:InvokeModel"]
Resource = "*"
}
]
})
}
Replace YOUR_ORG/warrantyai with your actual GitHub org and repo name. The StringLike condition locks the role to your main branch only PRs get the regression test job but not deploy permissions.
What This Catches (and What It Doesn't)
It catches:
- Prompt edits that shift classification behaviour on known edge cases
- Model version changes that affect output structure
- Output parser changes that break field extraction
- Accidental removal of instructions that were doing real work
It doesn't catch:
- Brand-new edge cases you haven't added to fixtures yet
- Latency regressions (add a separate latency benchmark for this)
- Cost regressions from prompt bloat (add token counting)
The fixture set is a living document. Every time a production bug surfaces from a new edge case, add a fixture for it. The test suite gets more valuable over time, not less.
The One Thing Worth Knowing
The first time you run this on an existing project, it will probably fail. Not because your agent is bad — but because you'll discover that your "obvious" classifications aren't as consistent as you thought.
That's the test suite doing its job. Fix the fixtures (or fix the agent), and you now have a baseline. Every future change is measured against that baseline.
That's the whole point.
if you think what its to do with warrantyAI
This is a solution which I am building to learn and implement AI systems.
Top comments (0)