Gate your LLM app in CI: prompt regression testing + agent trace policies with llm-canary

#opensource

The silent regression problem

If you ship an LLM-powered app, you've probably lived this:

A one-line system prompt tweak silently breaks your JSON output format
A RAG pipeline change makes the bot answer questions it should refuse
A model swap keeps answers correct but doubles your token bill

Nothing in a normal CI pipeline catches any of this. The code compiles, the types check, the unit tests pass — and the behavior of your AI has changed. You find out from a customer complaint.

I built llm-canary to fix that: a regression canary that fails your build when your LLM app drifts. Like the canary in a coal mine, it falls over before your users do.

pip install llm-canary
llm-canary init && llm-canary run canary.yaml   # works immediately, zero API keys

Declarative test suites

name: support-bot
providers:
  - name: openai
    model: gpt-4o-mini
cases:
  - name: refund-policy
    prompt: "A customer asks: can I get a refund for a keyboard bought 2 weeks ago?"
    assertions:
      - type: contains
        value: "30 days"
      - type: json_schema
        value: {type: object, required: [eligible]}
      - type: judge
        value: "Politely explains the refund policy"
      - type: max_cost_usd
        value: 0.01

llm-canary run suite.yaml    # exit 0 on green, 1 on failures — drop it in CI

11 assertion types: substrings, regex, JSON Schema, semantic similarity, LLM-as-judge, latency/cost/token budgets. A matrix: key expands one case into a cartesian product (angry customer × 3 languages, etc.).

Regression detection without golden answers

LLM output isn't byte-stable, so snapshot tests don't work. Instead:

llm-canary record suite.yaml   # snapshot today's outputs as the baseline
llm-canary check suite.yaml    # fail when meaning drifts or cost jumps

check compares against the baseline with semantic similarity and a cost-drift threshold. No hand-written expected answers — just "tell me when it changed more than I allowed."

Agent traces: test what the agent did, not just what it said

LLM apps act now — they call tools, query databases, post to Slack. The risk moved from "what did the model say" to "what did the agent do". llm-canary gates a JSONL action log against a policy:

# policy.yaml
max_steps: 10
max_cost_usd: 0.05
forbidden_tools: [delete_records, send_email]
required_order: [query_sales_db, post_slack]   # read before you post
max_tool_repeats: 3                            # catch runaway loops

llm-canary trace trace.jsonl --policy policy.yaml

Emit one JSON line per agent step from whatever framework you use, and the canary enforces the contract in CI.

Test YOUR bot, not the raw model

A canary is only meaningful if the thing you change — your system prompt, your RAG pipeline, your pre/post-processing — is on the tested execution path. Sending test prompts straight to the OpenAI API tests the model, not your app.

So llm-canary can put your real application under test, however it's built:

providers:
  # anything executable — stdout is the reply
  - name: command
    options:
      cmd: "python my_bot.py --ask {prompt}"

  # anything with an HTTP API
  - name: http
    options:
      url: http://localhost:8000/chat
      body: {message: "{prompt}"}
      response_path: reply.text

In CI: boot your bot, point the canary at it.

- run: docker compose up -d my-chatbot
- run: llm-canary run suite.yaml

Self-hosted eval server

llm-canary serve runs a small FastAPI service inside your own infra: run history, a dashboard, and team-shared baselines in a local SQLite file. Your prompts and agent logs never leave your network — useful if you can't ship eval data to a SaaS.

How it compares

promptfoo and DeepEval are excellent and more mature for prompt evaluation. llm-canary's niche is the combination of agent-trace policy gates, baseline regression without golden answers, and a fully self-hosted history server — all MIT-licensed, no SaaS upsell.

Repo: https://github.com/okssusucha/llm-canary — issues and PRs welcome, in English or Japanese.