DEV Community

Harish Aravindan
Harish Aravindan

Posted on

Your AI Agent Will Lie to You in Production — Here's How to Catch It Before It Ships

You deploy an AI agent. It passes your manual tests. It looks good in the demo.

Three weeks later, someone edits the system prompt to make the output "cleaner." The agent starts behaving differently on edge cases. No error. No alert. Just subtly wrong output — until someone notices.

This post is about the CI/CD and prompt regression setup that prevents this. Everything here is practical and works today on AWS.


The Problem With AI Agents in CI/CD

Traditional software has a clear contract: given input X, function F returns output Y. Tests verify Y. If Y changes, the test fails, the build breaks, you investigate.

LLM-based agents break this model. The "function" is a language model. The same input can produce slightly different outputs on every run. And the failure mode isn't an exception — it's a plausible-looking wrong answer.

Three things make this worse in serverless AI pipelines:

1. Prompts aren't versioned like code. Engineers edit them in a string in a Python file, or worse, in a config file outside version control. Nobody reviews a prompt change the way they'd review a code change.

2. Retries mask failures. Lambda retries on error. Your retry logic retries on low-confidence responses. By the time a bad output surfaces, it's hard to trace it back to the prompt change that caused it.

3. Silent degradation. A classification agent that's 95% accurate and drops to 80% accurate won't throw an error. It'll just be wrong more often. You'll find out from downstream effects, not logs.


The Fix: A Prompt Regression Test Suite

The idea is simple. Lock a set of golden fixtures — known inputs with known correct outputs. Run your agent against them on every deploy. Fail the build if accuracy drops below a threshold.

Here's the full setup.


Step 1: Golden Fixture Format

Each fixture is a JSON file in tests/fixtures/. Structure:

{
  "document_id": "fixture_001",
  "input": {
    "document_text": "Policy holder: Jane Smith. Coverage: accidental damage. Item: MacBook Pro 16-inch. Purchase date: 2023-08-15. Claim date: 2025-11-03. Damage description: Screen cracked after drop.",
    "tenant_id": "test-tenant"
  },
  "expected": {
    "risk_level": "MEDIUM",
    "reminder_eligible": true,
    "confidence_min": 0.70
  }
}
Enter fullscreen mode Exit fullscreen mode

Keep 20–30 fixtures. Cover your edge cases: borderline risk levels, ambiguous descriptions, missing fields, very old claims. These are the documents your agent gets wrong.

Never auto-generate fixtures. Write them manually. The point is that you a human have decided what the correct output is.


Step 2: The Test Runner

# tests/test_regression.py
import json
import os
import glob
import pytest
from agents.classifier import run_classifier

FIXTURE_DIR  = "tests/fixtures"
MIN_ACCURACY = 0.90   # Fail the build if accuracy drops below this

def load_fixtures():
    paths = glob.glob(f"{FIXTURE_DIR}/*.json")
    fixtures = []
    for p in paths:
        with open(p) as f:
            fixtures.append(json.load(f))
    return fixtures

@pytest.mark.parametrize("fixture", load_fixtures())
def test_classifier_regression(fixture):
    result = run_classifier(
        document_text=fixture["input"]["document_text"],
        tenant_id=fixture["input"]["tenant_id"]
    )

    expected = fixture["expected"]

    assert result["risk_level"] == expected["risk_level"], (
        f"[{fixture['document_id']}] "
        f"Expected risk_level={expected['risk_level']}, "
        f"got {result['risk_level']}"
    )

    if "confidence_min" in expected:
        assert result["confidence"] >= expected["confidence_min"], (
            f"[{fixture['document_id']}] "
            f"Confidence {result['confidence']:.2f} below minimum "
            f"{expected['confidence_min']}"
        )


def test_overall_accuracy():
    """
    Separate test: fail the whole suite if aggregate accuracy < MIN_ACCURACY.
    This catches regression even when individual tests pass on edge cases.
    """
    fixtures = load_fixtures()
    passed   = 0

    for fixture in fixtures:
        result = run_classifier(
            document_text=fixture["input"]["document_text"],
            tenant_id=fixture["input"]["tenant_id"]
        )
        if result["risk_level"] == fixture["expected"]["risk_level"]:
            passed += 1

    accuracy = passed / len(fixtures)
    assert accuracy >= MIN_ACCURACY, (
        f"Accuracy {accuracy:.0%} below threshold {MIN_ACCURACY:.0%}. "
        f"Passed {passed}/{len(fixtures)} fixtures."
    )
Enter fullscreen mode Exit fullscreen mode

Run locally with pytest tests/test_regression.py -v. You'll see per-fixture pass/fail and the aggregate accuracy check.


Step 3: GitHub Actions Pipeline

# .github/workflows/deploy.yml
name: warrantyAI CI/CD

on:
  push:
    branches: [main]
  pull_request:
    branches: [main]

env:
  AWS_REGION:     ap-south-1
  ECR_REGISTRY:   ${{ secrets.ECR_REGISTRY }}
  ECR_REPOSITORY: warrantyai-pipeline
  LAMBDA_FUNCTION: warrantyai-processor

jobs:
  regression-tests:
    name: Prompt Regression Tests
    runs-on: ubuntu-latest

    steps:
      - uses: actions/checkout@v4

      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: '3.12'

      - name: Install dependencies
        run: pip install -r requirements.txt

      - name: Configure AWS credentials (for Bedrock)
        uses: aws-actions/configure-aws-credentials@v4
        with:
          role-to-assume: ${{ secrets.AWS_ROLE_ARN }}
          aws-region:     ${{ env.AWS_REGION }}

      - name: Run prompt regression tests
        run: pytest tests/test_regression.py -v --tb=short
        env:
          BEDROCK_MODEL_ID: anthropic.claude-haiku-4-5-20251001

  build-and-deploy:
    name: Build → ECR → Lambda
    runs-on: ubuntu-latest
    needs: regression-tests        # Only runs if tests pass
    if: github.ref == 'refs/heads/main'

    steps:
      - uses: actions/checkout@v4

      - name: Configure AWS credentials
        uses: aws-actions/configure-aws-credentials@v4
        with:
          role-to-assume: ${{ secrets.AWS_ROLE_ARN }}
          aws-region:     ${{ env.AWS_REGION }}

      - name: Log in to ECR
        id: login-ecr
        uses: aws-actions/amazon-ecr-login@v2

      - name: Build and push Docker image
        run: |
          IMAGE_TAG=$(git rev-parse --short HEAD)
          docker build -t $ECR_REGISTRY/$ECR_REPOSITORY:$IMAGE_TAG .
          docker push    $ECR_REGISTRY/$ECR_REPOSITORY:$IMAGE_TAG
          echo "IMAGE_TAG=$IMAGE_TAG" >> $GITHUB_ENV

      - name: Deploy to Lambda
        run: |
          aws lambda update-function-code \
            --function-name $LAMBDA_FUNCTION \
            --image-uri     $ECR_REGISTRY/$ECR_REPOSITORY:$IMAGE_TAG \
            --region        $AWS_REGION

          aws lambda wait function-updated \
            --function-name $LAMBDA_FUNCTION

          echo "Deployed image $IMAGE_TAG to $LAMBDA_FUNCTION"
Enter fullscreen mode Exit fullscreen mode

Key decisions:

  • needs: regression-tests — deploy job won't start if tests fail
  • OIDC role assumption (no long-lived keys in secrets)
  • lambda wait function-updated — ensures the function is actually updated before the job completes

Step 4: IAM OIDC Setup for GitHub Actions (No Long-Lived Keys)

The cleanest way to give GitHub Actions access to AWS is OIDC — a temporary credential that's scoped to your repo and expires after the job.

# infra/oidc.tf

data "aws_iam_openid_connect_provider" "github" {
  url = "https://token.actions.githubusercontent.com"
}

resource "aws_iam_role" "github_actions" {
  name = "github-actions-warrantyai"

  assume_role_policy = jsonencode({
    Version = "2012-10-17"
    Statement = [{
      Effect    = "Allow"
      Principal = { Federated = data.aws_iam_openid_connect_provider.github.arn }
      Action    = "sts:AssumeRoleWithWebIdentity"
      Condition = {
        StringLike = {
          "token.actions.githubusercontent.com:sub" = "repo:YOUR_ORG/warrantyai:ref:refs/heads/main"
        }
      }
    }]
  })
}

resource "aws_iam_role_policy" "github_actions_policy" {
  name = "github-actions-policy"
  role = aws_iam_role.github_actions.id

  policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Effect   = "Allow"
        Action   = ["ecr:GetAuthorizationToken", "ecr:BatchCheckLayerAvailability",
                    "ecr:PutImage", "ecr:InitiateLayerUpload",
                    "ecr:UploadLayerPart", "ecr:CompleteLayerUpload"]
        Resource = "*"
      },
      {
        Effect   = "Allow"
        Action   = ["lambda:UpdateFunctionCode", "lambda:GetFunction"]
        Resource = aws_lambda_function.pipeline_processor.arn
      },
      {
        Effect   = "Allow"
        Action   = ["bedrock:InvokeModel"]
        Resource = "*"
      }
    ]
  })
}
Enter fullscreen mode Exit fullscreen mode

Replace YOUR_ORG/warrantyai with your actual GitHub org and repo name. The StringLike condition locks the role to your main branch only PRs get the regression test job but not deploy permissions.


What This Catches (and What It Doesn't)

It catches:

  • Prompt edits that shift classification behaviour on known edge cases
  • Model version changes that affect output structure
  • Output parser changes that break field extraction
  • Accidental removal of instructions that were doing real work

It doesn't catch:

  • Brand-new edge cases you haven't added to fixtures yet
  • Latency regressions (add a separate latency benchmark for this)
  • Cost regressions from prompt bloat (add token counting)

The fixture set is a living document. Every time a production bug surfaces from a new edge case, add a fixture for it. The test suite gets more valuable over time, not less.


The One Thing Worth Knowing

The first time you run this on an existing project, it will probably fail. Not because your agent is bad — but because you'll discover that your "obvious" classifications aren't as consistent as you thought.

That's the test suite doing its job. Fix the fixtures (or fix the agent), and you now have a baseline. Every future change is measured against that baseline.

That's the whole point.


if you think what its to do with warrantyAI

This is a solution which I am building to learn and implement AI systems.

Building WarrantyAI: AI Platform Engineer's 2026 North-Star Goal | Harish Aravindan posted on the topic | LinkedIn

🚀 𝗙𝗿𝗼𝗺 𝗗𝗲𝘃𝗢𝗽𝘀 𝘁𝗼 𝗠𝗟𝗢𝗽𝘀: 𝗪𝗲𝗲𝗸 𝟭 𝗼𝗳 𝗟𝗲𝗮𝗿𝗻𝗶𝗻𝗴 / 𝗕𝘂𝗶𝗹𝗱𝗶𝗻𝗴 "𝗪𝗮𝗿𝗿𝗮𝗻𝘁𝘆 𝗔𝗜" After years of managing cloud infrastructure and DevOps pipelines, I’ve officially started my transition into AI Platform Engineering. My north-star goal for 2026 is to build WarrantyAI: a production-grade, "warranty-aware" system that helps homeowners and office managers assess appliance health, identify "fine-print" gotchas, and optimize repair costs using Generative AI. 𝗧𝗵𝗲 𝗬𝗲𝗮𝗿-𝗟𝗼𝗻𝗴 𝗩𝗶𝘀𝗶𝗼𝗻: 𝗪𝗵𝘆 𝗮𝗻 𝗔𝗜 𝗣𝗹𝗮𝘁𝗳𝗼𝗿𝗺 𝗘𝗻𝗴𝗶𝗻𝗲𝗲𝗿 An AI Platform Engineer doesn't just "write prompts." They build the scalable "plumbing" that allows AI models to interact with real-world data securely and efficiently. My 12-month roadmap focuses on: 𝗠𝗟𝗢𝗽𝘀: Automating the lifecycle of models (training, deployment, monitoring). 𝗦𝗲𝗺𝗮𝗻𝘁𝗶𝗰 𝗔𝗿𝗰𝗵𝗶𝘁𝗲𝗰𝘁𝘂𝗿𝗲: Using Vector Databases to give AI a "long-term memory." 𝗖𝗹𝗼𝘂𝗱-𝗡𝗮𝘁𝗶𝘃𝗲 𝗔𝗜: Leveraging AWS resources like Bedrock and S3 Lakehouses for cost-effective scale. 𝗪𝗲𝗲𝗸 𝟭 𝗟𝗲𝗮𝗿𝗻𝗶𝗻𝗴: 𝗕𝗿𝗲𝗮𝗸𝗶𝗻𝗴 𝘁𝗵𝗲 "𝗕𝗹𝗮𝗰𝗸 𝗕𝗼𝘅" This week, I focused on Retrieval-Augmented Generation (RAG). Instead of just asking an AI what a general warranty looks like, I fed it a specific 5-page LG Refrigerator Warranty PDF and asked it to find the hidden costs. GitHub Repo: https://lnkd.in/g-hkGJ6M 𝗖𝗼𝗿𝗲 𝗖𝗼𝗻𝗰𝗲𝗽𝘁𝘀: 𝗘𝗺𝗯𝗲𝗱𝗱𝗶𝗻𝗴𝘀 (𝗧𝗵𝗲 "𝗧𝗿𝗮𝗻𝘀𝗹𝗮𝘁𝗼𝗿"): Using Amazon Titan Text Embeddings v2 to convert human text into mathematical vectors. 𝗩𝗲𝗰𝘁𝗼𝗿 𝗜𝗻𝗱𝗲𝘅𝗶𝗻𝗴 (𝗧𝗵𝗲 "𝗟𝗶𝗯𝗿𝗮𝗿𝘆"): Implementing FAISS to store these vectors so the AI can search by "meaning" rather than keywords. 𝗠𝗲𝘀𝘀𝗮𝗴𝗲𝘀 𝗔𝗣𝗜 (𝗧𝗵𝗲 "𝗖𝗼𝗻𝘃𝗲𝗿𝘀𝗮𝘁𝗶𝗼𝗻"): Transitioning from legacy string-based prompts to the modern, structured messages format required by models like Ministral-3-8b. 𝗖𝗼𝗻𝗰𝗹𝘂𝘀𝗶𝗼𝗻 & 𝗡𝗲𝘅𝘁 𝗦𝘁𝗲𝗽𝘀 Week 1 taught me that AI Engineering is 20% prompting and 80% data engineering and infrastructure. By mastering the native APIs and vector logic, I’ve built a foundation that isn't just "chatty"—it's accurate. 𝗡𝗲𝘅𝘁 𝘄𝗲𝗲𝗸: I’ll be moving these vectors into a persistent S3 Lakehouse using Apache Iceberg to ensure our data remains organized and queryable as we scale to hundreds of appliances. #MLOps #AWSBedrock #AIPlatformEngineer #WarrantyAI #GenerativeAI #DevOpsToAI #ApacheIceberg https://lnkd.in/g-hkGJ6M

favicon linkedin.com

Top comments (0)