DEV Community

Yui Kato
Yui Kato

Posted on

Gate your LLM app in CI: prompt regression testing + agent trace policies with llm-canary

The silent regression problem

If you ship an LLM-powered app, you've probably lived this:

  • A one-line system prompt tweak silently breaks your JSON output format
  • A RAG pipeline change makes the bot answer questions it should refuse
  • A model swap keeps answers correct but doubles your token bill

Nothing in a normal CI pipeline catches any of this. The code compiles, the types check, the unit tests pass — and the behavior of your AI has changed. You find out from a customer complaint.

I built llm-canary to fix that: a regression canary that fails your build when your LLM app drifts. Like the canary in a coal mine, it falls over before your users do.

pip install llm-canary
llm-canary init && llm-canary run canary.yaml   # works immediately, zero API keys
Enter fullscreen mode Exit fullscreen mode

Declarative test suites

name: support-bot
providers:
  - name: openai
    model: gpt-4o-mini
cases:
  - name: refund-policy
    prompt: "A customer asks: can I get a refund for a keyboard bought 2 weeks ago?"
    assertions:
      - type: contains
        value: "30 days"
      - type: json_schema
        value: {type: object, required: [eligible]}
      - type: judge
        value: "Politely explains the refund policy"
      - type: max_cost_usd
        value: 0.01
Enter fullscreen mode Exit fullscreen mode
llm-canary run suite.yaml    # exit 0 on green, 1 on failures — drop it in CI
Enter fullscreen mode Exit fullscreen mode

11 assertion types: substrings, regex, JSON Schema, semantic similarity, LLM-as-judge, latency/cost/token budgets. A matrix: key expands one case into a cartesian product (angry customer × 3 languages, etc.).

Regression detection without golden answers

LLM output isn't byte-stable, so snapshot tests don't work. Instead:

llm-canary record suite.yaml   # snapshot today's outputs as the baseline
llm-canary check suite.yaml    # fail when meaning drifts or cost jumps
Enter fullscreen mode Exit fullscreen mode

check compares against the baseline with semantic similarity and a cost-drift threshold. No hand-written expected answers — just "tell me when it changed more than I allowed."

Agent traces: test what the agent did, not just what it said

LLM apps act now — they call tools, query databases, post to Slack. The risk moved from "what did the model say" to "what did the agent do". llm-canary gates a JSONL action log against a policy:

# policy.yaml
max_steps: 10
max_cost_usd: 0.05
forbidden_tools: [delete_records, send_email]
required_order: [query_sales_db, post_slack]   # read before you post
max_tool_repeats: 3                            # catch runaway loops
Enter fullscreen mode Exit fullscreen mode
llm-canary trace trace.jsonl --policy policy.yaml
Enter fullscreen mode Exit fullscreen mode

Emit one JSON line per agent step from whatever framework you use, and the canary enforces the contract in CI.

Test YOUR bot, not the raw model

A canary is only meaningful if the thing you change — your system prompt, your RAG pipeline, your pre/post-processing — is on the tested execution path. Sending test prompts straight to the OpenAI API tests the model, not your app.

So llm-canary can put your real application under test, however it's built:

providers:
  # anything executable — stdout is the reply
  - name: command
    options:
      cmd: "python my_bot.py --ask {prompt}"

  # anything with an HTTP API
  - name: http
    options:
      url: http://localhost:8000/chat
      body: {message: "{prompt}"}
      response_path: reply.text
Enter fullscreen mode Exit fullscreen mode

In CI: boot your bot, point the canary at it.

- run: docker compose up -d my-chatbot
- run: llm-canary run suite.yaml
Enter fullscreen mode Exit fullscreen mode

Self-hosted eval server

llm-canary serve runs a small FastAPI service inside your own infra: run history, a dashboard, and team-shared baselines in a local SQLite file. Your prompts and agent logs never leave your network — useful if you can't ship eval data to a SaaS.

How it compares

promptfoo and DeepEval are excellent and more mature for prompt evaluation. llm-canary's niche is the combination of agent-trace policy gates, baseline regression without golden answers, and a fully self-hosted history server — all MIT-licensed, no SaaS upsell.

Repo: https://github.com/okssusucha/llm-canary — issues and PRs welcome, in English or Japanese.

Top comments (0)