<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Daniel Kissel</title>
    <description>The latest articles on DEV Community by Daniel Kissel (@dpk890).</description>
    <link>https://dev.to/dpk890</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F4010300%2Fac0c9e4f-c0c0-4766-b058-5ad514a3bde2.png</url>
      <title>DEV Community: Daniel Kissel</title>
      <link>https://dev.to/dpk890</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/dpk890"/>
    <language>en</language>
    <item>
      <title>DProvenanceKit — regression testing and observability for the reasoning of AI agents (Python, zero deps)</title>
      <dc:creator>Daniel Kissel</dc:creator>
      <pubDate>Wed, 01 Jul 2026 04:31:31 +0000</pubDate>
      <link>https://dev.to/dpk890/dprovenancekit-regression-testing-and-observability-for-the-reasoning-of-ai-agents-python-zero-3ckd</link>
      <guid>https://dev.to/dpk890/dprovenancekit-regression-testing-and-observability-for-the-reasoning-of-ai-agents-python-zero-3ckd</guid>
      <description>&lt;p&gt;DProvenanceKit — regression testing and observability for the reasoning of AI agents (Python, zero deps)&lt;/p&gt;

&lt;p&gt;Every agent framework will happily tell you what happened — here are the tokens, here are the tool calls, here's a trace waterfall. What almost none of them tell you is what changed between two runs, and whether that change is a regression.&lt;/p&gt;

&lt;p&gt;That's the gap DProvenanceKit fills. It turns each execution of an agent into a queryable, diffable trace, then gives you the surfaces to actually act on drift:&lt;/p&gt;

&lt;p&gt;Run → Record → Query → Diff → Detect regressions → Gate in CI&lt;/p&gt;

&lt;p&gt;The core is pure Python standard library — zero third-party dependencies (sqlite3, contextvars, threading, json, hashlib). It's a faithful port of a Swift library, and the two are kept behaviorally identical by a frozen cross-language conformance spec, not by hope.&lt;/p&gt;

&lt;p&gt;The pitch in one example. A research agent answers a support question. The healthy "golden" run does plan → search → rank → verify → decide. A later PR quietly breaks it — the agent loops its search tool and skips verification. Here's every layer catching it, from one runnable script (python examples/end_to_end_demo.py):&lt;/p&gt;

&lt;p&gt;── 1. Record two runs of the agent ───────────────────────&lt;br&gt;
   golden:    5 steps  ['plan', 'search', 'rank', 'verify', 'decide']&lt;br&gt;
   candidate: 8 steps  ['plan', 'search', 'search', 'search', 'search', 'search', 'search', 'decide']&lt;br&gt;
── 2. Query for a suspicious pattern (searched but never verified) ──&lt;br&gt;
   matched 1 run(s): ['research-agent · PR-42']&lt;br&gt;
── 3. Gate the candidate against the golden run ──────────&lt;br&gt;
   verdict: REGRESSION  (severity high, strength 0.95)&lt;br&gt;
   removed critical steps: ['rank', 'verify']&lt;br&gt;
── 4. Out-of-the-box anomaly rules ───────────────────────&lt;br&gt;
   ! [tool_drop:verify] required step 'verify' was never recorded&lt;br&gt;
   ! [looping:search] step 'search' repeated 6 times (&amp;gt; 3 allowed)&lt;br&gt;
── 5. Structural diff ────────────────────────────────────&lt;br&gt;
   removed rank    (Retriever) @seq 2&lt;br&gt;
   removed verify  (Verifier)  @seq 3&lt;br&gt;
   added   search  (Retriever) @seq 2..6&lt;br&gt;
OK — every layer agreed: the candidate regressed (dropped verify, looped search).&lt;/p&gt;

&lt;p&gt;Why this works. Each run has a fingerprint — the structural identity of the agent's execution path. Two runs that diverge (a tool called in a different order, a retrieval step skipped) produce different fingerprints. That's a cheap, deterministic regression signal you can compute without an LLM in the loop. When you want more than "same/different," a semantic alignment engine grades the divergence by severity (dropping a CRITICAL step is a HIGH regression; reordering one that inverts a dependency is too).&lt;/p&gt;

&lt;p&gt;The part I actually like: it gates PRs. No server required. Point the CLI at a SQLite trace DB and two run IDs; exit code 0 pass / 1 regression:&lt;/p&gt;

&lt;p&gt;dprovenancekit gate --db traces.sqlite --golden "$GOLDEN" --candidate "$CANDIDATE"&lt;/p&gt;

&lt;p&gt;There's a drop-in GitHub Action and GitLab CI template that wrap this and comment the diff on the PR/MR. So "my agent silently stopped verifying its sources" becomes a red check instead of a customer ticket three weeks later.&lt;/p&gt;

&lt;p&gt;Instrumentation is basically free:&lt;/p&gt;

&lt;p&gt;LangChain / LangGraph adapter — callbacks become a span tree automatically.&lt;br&gt;
OpenAI Agents SDK — one register(store) and every run is captured.&lt;/p&gt;

&lt;p&gt;No framework? Decorate plain functions with @traced (sync, async, generators all work). Capture is failure-proof — it never changes your code's behavior or swallows exceptions.&lt;br&gt;
from dprovenancekit import InMemoryTraceStore, traced, traced_run, record_event&lt;/p&gt;

&lt;p&gt;@traced&lt;br&gt;
def search(query): ...&lt;/p&gt;

&lt;p&gt;@traced&lt;br&gt;
def answer(question, sources): ...&lt;/p&gt;

&lt;p&gt;with traced_run(store, context_id="ticket-42"):&lt;br&gt;
    sources = search(question)&lt;br&gt;
    record_event("plan.chosen", {"strategy": "rag"})&lt;br&gt;
    reply = answer(question, sources)&lt;/p&gt;

&lt;p&gt;Other things in the box: one query DSL over two backends (in-memory + WAL SQLite) held in lockstep by a parity suite; deterministic replay; priority-aware backpressure so load-shedding is never silent; shareable HTML regression reports; and a validation corpus that scores Precision/Recall/F1 = 1.000 across 8 standard + 5 adversarial scenarios, matching the Swift implementation case-for-case. 168 tests.&lt;/p&gt;

&lt;p&gt;Apache-2.0. The SDK and all the CI tooling are open; there's a separate hosted visualizer (span tree, payload inspector, side-by-side semantic diff) if you want the dashboard, but you never need it — everything above runs offline from your terminal.&lt;/p&gt;

&lt;p&gt;pip install dprovenancekit&lt;br&gt;
python examples/end_to_end_demo.py   # the whole arc, self-asserting&lt;/p&gt;

&lt;p&gt;Happy to answer questions on the fingerprinting/alignment model or the cross-language conformance approach — that part was the most interesting to build.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>python</category>
      <category>git</category>
    </item>
  </channel>
</rss>
