DEV Community

Cover image for Agentic CI: How I Test and Gate AI Agents Before They Touch Real Users
Kowshik Jallipalli
Kowshik Jallipalli

Posted on

Agentic CI: How I Test and Gate AI Agents Before They Touch Real Users

You wouldn't merge a backend PR without unit tests. Yet, when it comes to AI agents, most teams are still doing "vibe checks." We tweak a system prompt, run three manual queries in a terminal, say "looks good to me," and push to production.

When your agent is just summarizing text, vibe checks are fine. But when your agent has access to tools—when it can execute database queries, issue API refunds, or send emails—a non-deterministic vibe check is a disaster waiting to happen.

If you are building autonomous workflows, you have to treat your agent like a microservice. It needs a contract, it needs invariants, and it needs a Continuous Integration (CI) pipeline that rigorously gates breaking changes. Here is the blueprint for "Agentic CI."

The Scenario: The Automated Refund Agent
Let’s use a concrete internal tool as our running example: a Refund Triage Agent.

This agent receives incoming customer support tickets, extracts the user ID, calls a check_stripe_purchases tool, and evaluates the purchase against a company policy injected into its system prompt (e.g., "Refunds only allowed within 14 days"). It then outputs a strictly structured JSON response containing {"approved": boolean, "reason": string}.

Why This Matters (The Breaking Change)
Imagine a developer notices the agent sounds a bit cold. They update the system prompt, adding: "You are a highly empathetic support agent. Always prioritize customer happiness and give them the benefit of the doubt."

Without CI, this gets merged. In production, a user requests a refund on day 16. The agent, prioritizing "customer happiness" over the 14-day rule, hallucinates an exception and returns {"approved": true}. You just shipped a prompt change that directly bleeds revenue.

How it Works: Contracts and Invariants
You cannot test LLMs with exact string matching (assert response == "Refund denied"). The model's wording will change constantly. Instead, Agentic CI tests invariants: the structural rules and tool execution paths that must always be true, regardless of the prose.

For our Refund Agent, the invariants are:

Schema Adherence: The output must be valid JSON matching our Pydantic schema.

Tool Execution: If the user asks about a refund, the agent must execute the check_stripe_purchases tool exactly once.

Logic Fences: A synthetic input representing a 15-day-old purchase must result in "approved": false.

The Code: The Evaluation Harness and CI Pipeline
Here is how we translate those invariants into a runnable test harness using Python and Pytest.

  1. The Pytest Harness (tests/test_refund_agent.py) Instead of mocking the LLM completely, we test the actual model using synthetic, hardcoded data. import pytest import json from src.agent import run_refund_agent # Your agent execution function

Synthetic test cases

EVAL_SCENARIOS = [
{
"test_name": "valid_refund_under_14_days",
"ticket_text": "I bought this 2 days ago and it's broken. Refund me.",
"mock_stripe_data": {"days_since_purchase": 2},
"expected_approval": True,
"expected_tool_call": "check_stripe_purchases"
},
{
"test_name": "invalid_refund_over_14_days",
"ticket_text": "I bought this 3 weeks ago, please refund.",
"mock_stripe_data": {"days_since_purchase": 21},
"expected_approval": False,
"expected_tool_call": "check_stripe_purchases"
}
]

@pytest.mark.parametrize("scenario", EVAL_SCENARIOS, ids=lambda x: x["test_name"])
def test_agent_invariants(scenario):
# Run the agent (passing mock tool data so we don't hit the real Stripe API in CI)
result = run_refund_agent(
ticket=scenario["ticket_text"],
mock_tool_responses={scenario["expected_tool_call"]: scenario["mock_stripe_data"]}
)

# Invariant 1: Valid JSON Schema
try:
    parsed_output = json.loads(result.final_text)
except json.JSONDecodeError:
    pytest.fail("Agent failed to return valid JSON.")

assert "approved" in parsed_output, "Missing 'approved' key in schema."
assert "reason" in parsed_output, "Missing 'reason' key in schema."

# Invariant 2: Correct Tool Usage
executed_tools = [tool.name for tool in result.tool_history]
assert scenario["expected_tool_call"] in executed_tools, "Agent failed to verify purchase history."

# Invariant 3: Business Logic Fence
assert parsed_output["approved"] == scenario["expected_approval"], f"Agent bypassed policy fence. Expected approval: {scenario['expected_approval']}"
Enter fullscreen mode Exit fullscreen mode

Pitfalls and Gotchas
When setting up Agentic CI, watch out for these operational traps:

The CI Token Bill: If you run 50 complex agent evaluation tests using Claude 3.5 Sonnet or GPT-4o on every single commit, your API bill will explode. Fix: Use smaller models (like Claude Haiku or Gemini Flash) for standard PR checks, and only run the expensive models on the final merge to main.

Flaky Tests: LLMs will occasionally fail a structural test due to a random hallucination, causing a flaky CI pipeline. Fix: Implement a retry decorator in your Pytest harness. If the test fails, retry it up to 3 times before failing the GitHub Action. If it fails 3 times, your prompt is not resilient enough.

Testing Live Tools: Never let your CI agent run real tool calls against external APIs. You will accidentally email customers or hit rate limits. Always inject mock outputs for your tools during the CI run, testing the agent's decision making, not the external API's uptime.

What to Try Next
Ready to lock down your agent workflows? Try adding these to your pipeline:

LLM-as-a-Judge: For qualitative outputs (e.g., "Was the tone polite?"), add a test step that uses a separate, cheaper LLM prompt to grade the agent's output, asserting that the politeness_score is > 8/10.

Regression Test Sets: Start saving your weirdest production edge cases into a eval_dataset.json file. Pipe this dataset into your Pytest harness so your agent is constantly tested against the exact tickets that previously broke it.

Prompt Sandboxing: Move your system prompts out of your Python code and into discrete .md or .txt files. This allows your CI pipeline to track diffs strictly on the prompt phrasing, making debugging much easier when a test suddenly fails.

Top comments (0)