The silent regression problem
If you ship an LLM-powered app, you've probably lived this:
- A one-line system prompt tweak silently breaks your JSON output format
- A RAG pipeline change makes the bot answer questions it should refuse
- A model swap keeps answers correct but doubles your token bill
Nothing in a normal CI pipeline catches any of this. The code compiles, the types check, the unit tests pass — and the behavior of your AI has changed. You find out from a customer complaint.
I built llm-canary to fix that: a regression canary that fails your build when your LLM app drifts. Like the canary in a coal mine, it falls over before your users do.
pip install llm-canary
llm-canary init && llm-canary run canary.yaml # works immediately, zero API keys
Declarative test suites
name: support-bot
providers:
- name: openai
model: gpt-4o-mini
cases:
- name: refund-policy
prompt: "A customer asks: can I get a refund for a keyboard bought 2 weeks ago?"
assertions:
- type: contains
value: "30 days"
- type: json_schema
value: {type: object, required: [eligible]}
- type: judge
value: "Politely explains the refund policy"
- type: max_cost_usd
value: 0.01
llm-canary run suite.yaml # exit 0 on green, 1 on failures — drop it in CI
11 assertion types: substrings, regex, JSON Schema, semantic similarity, LLM-as-judge, latency/cost/token budgets. A matrix: key expands one case into a cartesian product (angry customer × 3 languages, etc.).
Regression detection without golden answers
LLM output isn't byte-stable, so snapshot tests don't work. Instead:
llm-canary record suite.yaml # snapshot today's outputs as the baseline
llm-canary check suite.yaml # fail when meaning drifts or cost jumps
check compares against the baseline with semantic similarity and a cost-drift threshold. No hand-written expected answers — just "tell me when it changed more than I allowed."
Agent traces: test what the agent did, not just what it said
LLM apps act now — they call tools, query databases, post to Slack. The risk moved from "what did the model say" to "what did the agent do". llm-canary gates a JSONL action log against a policy:
# policy.yaml
max_steps: 10
max_cost_usd: 0.05
forbidden_tools: [delete_records, send_email]
required_order: [query_sales_db, post_slack] # read before you post
max_tool_repeats: 3 # catch runaway loops
llm-canary trace trace.jsonl --policy policy.yaml
Emit one JSON line per agent step from whatever framework you use, and the canary enforces the contract in CI.
Test YOUR bot, not the raw model
A canary is only meaningful if the thing you change — your system prompt, your RAG pipeline, your pre/post-processing — is on the tested execution path. Sending test prompts straight to the OpenAI API tests the model, not your app.
So llm-canary can put your real application under test, however it's built:
providers:
# anything executable — stdout is the reply
- name: command
options:
cmd: "python my_bot.py --ask {prompt}"
# anything with an HTTP API
- name: http
options:
url: http://localhost:8000/chat
body: {message: "{prompt}"}
response_path: reply.text
In CI: boot your bot, point the canary at it.
- run: docker compose up -d my-chatbot
- run: llm-canary run suite.yaml
Self-hosted eval server
llm-canary serve runs a small FastAPI service inside your own infra: run history, a dashboard, and team-shared baselines in a local SQLite file. Your prompts and agent logs never leave your network — useful if you can't ship eval data to a SaaS.
How it compares
promptfoo and DeepEval are excellent and more mature for prompt evaluation. llm-canary's niche is the combination of agent-trace policy gates, baseline regression without golden answers, and a fully self-hosted history server — all MIT-licensed, no SaaS upsell.
Repo: https://github.com/okssusucha/llm-canary — issues and PRs welcome, in English or Japanese.
Top comments (0)