<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Yui Kato</title>
    <description>The latest articles on DEV Community by Yui Kato (@okssusucha).</description>
    <link>https://dev.to/okssusucha</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3976784%2F93f6921b-71ae-4db7-8b6e-2db4b2b53afb.jpg</url>
      <title>DEV Community: Yui Kato</title>
      <link>https://dev.to/okssusucha</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/okssusucha"/>
    <language>en</language>
    <item>
      <title>Gate your LLM app in CI: prompt regression testing + agent trace policies with llm-canary</title>
      <dc:creator>Yui Kato</dc:creator>
      <pubDate>Wed, 10 Jun 2026 02:13:28 +0000</pubDate>
      <link>https://dev.to/okssusucha/gate-your-llm-app-in-ci-prompt-regression-testing-agent-trace-policies-with-llm-canary-33fn</link>
      <guid>https://dev.to/okssusucha/gate-your-llm-app-in-ci-prompt-regression-testing-agent-trace-policies-with-llm-canary-33fn</guid>
      <description>&lt;h2&gt;
  
  
  The silent regression problem
&lt;/h2&gt;

&lt;p&gt;If you ship an LLM-powered app, you've probably lived this:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A one-line system prompt tweak silently breaks your JSON output format&lt;/li&gt;
&lt;li&gt;A RAG pipeline change makes the bot answer questions it should refuse&lt;/li&gt;
&lt;li&gt;A model swap keeps answers correct but doubles your token bill&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Nothing in a normal CI pipeline catches any of this. The code compiles, the types check, the unit tests pass — and the &lt;em&gt;behavior&lt;/em&gt; of your AI has changed. You find out from a customer complaint.&lt;/p&gt;

&lt;p&gt;I built &lt;strong&gt;&lt;a href="https://github.com/okssusucha/llm-canary" rel="noopener noreferrer"&gt;llm-canary&lt;/a&gt;&lt;/strong&gt; to fix that: a regression canary that fails your build when your LLM app drifts. Like the canary in a coal mine, it falls over before your users do.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;llm-canary
llm-canary init &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; llm-canary run canary.yaml   &lt;span class="c"&gt;# works immediately, zero API keys&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Declarative test suites
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;support-bot&lt;/span&gt;
&lt;span class="na"&gt;providers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;openai&lt;/span&gt;
    &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;gpt-4o-mini&lt;/span&gt;
&lt;span class="na"&gt;cases&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;refund-policy&lt;/span&gt;
    &lt;span class="na"&gt;prompt&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;A&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;customer&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;asks:&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;can&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;I&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;get&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;a&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;refund&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;for&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;a&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;keyboard&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;bought&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;2&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;weeks&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;ago?"&lt;/span&gt;
    &lt;span class="na"&gt;assertions&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;contains&lt;/span&gt;
        &lt;span class="na"&gt;value&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;30&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;days"&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;json_schema&lt;/span&gt;
        &lt;span class="na"&gt;value&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;{&lt;/span&gt;&lt;span class="nv"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="nv"&gt;object&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;required&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;eligible&lt;/span&gt;&lt;span class="pi"&gt;]}&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;judge&lt;/span&gt;
        &lt;span class="na"&gt;value&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Politely&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;explains&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;the&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;refund&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;policy"&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;max_cost_usd&lt;/span&gt;
        &lt;span class="na"&gt;value&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;0.01&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;llm-canary run suite.yaml    &lt;span class="c"&gt;# exit 0 on green, 1 on failures — drop it in CI&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;11 assertion types: substrings, regex, JSON Schema, semantic similarity, LLM-as-judge, latency/cost/token budgets. A &lt;code&gt;matrix:&lt;/code&gt; key expands one case into a cartesian product (angry customer × 3 languages, etc.).&lt;/p&gt;

&lt;h2&gt;
  
  
  Regression detection without golden answers
&lt;/h2&gt;

&lt;p&gt;LLM output isn't byte-stable, so snapshot tests don't work. Instead:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;llm-canary record suite.yaml   &lt;span class="c"&gt;# snapshot today's outputs as the baseline&lt;/span&gt;
llm-canary check suite.yaml    &lt;span class="c"&gt;# fail when meaning drifts or cost jumps&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;check&lt;/code&gt; compares against the baseline with semantic similarity and a cost-drift threshold. No hand-written expected answers — just "tell me when it changed more than I allowed."&lt;/p&gt;

&lt;h2&gt;
  
  
  Agent traces: test what the agent &lt;em&gt;did&lt;/em&gt;, not just what it said
&lt;/h2&gt;

&lt;p&gt;LLM apps act now — they call tools, query databases, post to Slack. The risk moved from "what did the model say" to "what did the agent do". llm-canary gates a JSONL action log against a policy:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# policy.yaml&lt;/span&gt;
&lt;span class="na"&gt;max_steps&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;10&lt;/span&gt;
&lt;span class="na"&gt;max_cost_usd&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;0.05&lt;/span&gt;
&lt;span class="na"&gt;forbidden_tools&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;delete_records&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;send_email&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
&lt;span class="na"&gt;required_order&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;query_sales_db&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;post_slack&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;   &lt;span class="c1"&gt;# read before you post&lt;/span&gt;
&lt;span class="na"&gt;max_tool_repeats&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;3&lt;/span&gt;                            &lt;span class="c1"&gt;# catch runaway loops&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;llm-canary trace trace.jsonl &lt;span class="nt"&gt;--policy&lt;/span&gt; policy.yaml
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Emit one JSON line per agent step from whatever framework you use, and the canary enforces the contract in CI.&lt;/p&gt;

&lt;h2&gt;
  
  
  Test YOUR bot, not the raw model
&lt;/h2&gt;

&lt;p&gt;A canary is only meaningful if the thing you change — your system prompt, your RAG pipeline, your pre/post-processing — is on the tested execution path. Sending test prompts straight to the OpenAI API tests &lt;em&gt;the model&lt;/em&gt;, not &lt;em&gt;your app&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;So llm-canary can put your real application under test, however it's built:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;providers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="c1"&gt;# anything executable — stdout is the reply&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;command&lt;/span&gt;
    &lt;span class="na"&gt;options&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;cmd&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;python&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;my_bot.py&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;--ask&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;{prompt}"&lt;/span&gt;

  &lt;span class="c1"&gt;# anything with an HTTP API&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;http&lt;/span&gt;
    &lt;span class="na"&gt;options&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;url&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;http://localhost:8000/chat&lt;/span&gt;
      &lt;span class="na"&gt;body&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;{&lt;/span&gt;&lt;span class="nv"&gt;message&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;{prompt}"&lt;/span&gt;&lt;span class="pi"&gt;}&lt;/span&gt;
      &lt;span class="na"&gt;response_path&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;reply.text&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In CI: boot your bot, point the canary at it.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;docker compose up -d my-chatbot&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;llm-canary run suite.yaml&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Self-hosted eval server
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;llm-canary serve&lt;/code&gt; runs a small FastAPI service inside your own infra: run history, a dashboard, and team-shared baselines in a local SQLite file. Your prompts and agent logs never leave your network — useful if you can't ship eval data to a SaaS.&lt;/p&gt;

&lt;h2&gt;
  
  
  How it compares
&lt;/h2&gt;

&lt;p&gt;promptfoo and DeepEval are excellent and more mature for prompt evaluation. llm-canary's niche is the combination of &lt;strong&gt;agent-trace policy gates&lt;/strong&gt;, &lt;strong&gt;baseline regression without golden answers&lt;/strong&gt;, and a &lt;strong&gt;fully self-hosted history server&lt;/strong&gt; — all MIT-licensed, no SaaS upsell.&lt;/p&gt;

&lt;p&gt;Repo: &lt;a href="https://github.com/okssusucha/llm-canary" rel="noopener noreferrer"&gt;https://github.com/okssusucha/llm-canary&lt;/a&gt; — issues and PRs welcome, in English or Japanese.&lt;/p&gt;

</description>
      <category>opensource</category>
    </item>
  </channel>
</rss>
