DProvenanceKit — regression testing and observability for the reasoning of AI agents (Python, zero deps)

Daniel Kissel — Wed, 01 Jul 2026 04:31:31 +0000

Every agent framework will happily tell you what happened — here are the tokens, here are the tool calls, here's a trace waterfall. What almost none of them tell you is what changed between two runs, and whether that change is a regression.

That's the gap DProvenanceKit fills. It turns each execution of an agent into a queryable, diffable trace, then gives you the surfaces to actually act on drift:

Run → Record → Query → Diff → Detect regressions → Gate in CI

The core is pure Python standard library — zero third-party dependencies (sqlite3, contextvars, threading, json, hashlib). It's a faithful port of a Swift library, and the two are kept behaviorally identical by a frozen cross-language conformance spec, not by hope.

The pitch in one example. A research agent answers a support question. The healthy "golden" run does plan → search → rank → verify → decide. A later PR quietly breaks it — the agent loops its search tool and skips verification. Here's every layer catching it, from one runnable script (python examples/end_to_end_demo.py):

── 1. Record two runs of the agent ───────────────────────
golden: 5 steps ['plan', 'search', 'rank', 'verify', 'decide']
candidate: 8 steps ['plan', 'search', 'search', 'search', 'search', 'search', 'search', 'decide']
── 2. Query for a suspicious pattern (searched but never verified) ──
matched 1 run(s): ['research-agent · PR-42']
── 3. Gate the candidate against the golden run ──────────
verdict: REGRESSION (severity high, strength 0.95)
removed critical steps: ['rank', 'verify']
── 4. Out-of-the-box anomaly rules ───────────────────────
! [tool_drop:verify] required step 'verify' was never recorded
! [looping:search] step 'search' repeated 6 times (> 3 allowed)
── 5. Structural diff ────────────────────────────────────
removed rank (Retriever) @seq 2
removed verify (Verifier) @seq 3
added search (Retriever) @seq 2..6
OK — every layer agreed: the candidate regressed (dropped verify, looped search).

Why this works. Each run has a fingerprint — the structural identity of the agent's execution path. Two runs that diverge (a tool called in a different order, a retrieval step skipped) produce different fingerprints. That's a cheap, deterministic regression signal you can compute without an LLM in the loop. When you want more than "same/different," a semantic alignment engine grades the divergence by severity (dropping a CRITICAL step is a HIGH regression; reordering one that inverts a dependency is too).

The part I actually like: it gates PRs. No server required. Point the CLI at a SQLite trace DB and two run IDs; exit code 0 pass / 1 regression:

dprovenancekit gate --db traces.sqlite --golden "$GOLDEN" --candidate "$CANDIDATE"

There's a drop-in GitHub Action and GitLab CI template that wrap this and comment the diff on the PR/MR. So "my agent silently stopped verifying its sources" becomes a red check instead of a customer ticket three weeks later.

Instrumentation is basically free:

LangChain / LangGraph adapter — callbacks become a span tree automatically.
OpenAI Agents SDK — one register(store) and every run is captured.

No framework? Decorate plain functions with @traced (sync, async, generators all work). Capture is failure-proof — it never changes your code's behavior or swallows exceptions.
from dprovenancekit import InMemoryTraceStore, traced, traced_run, record_event

@traced
def search(query): ...

@traced
def answer(question, sources): ...

with traced_run(store, context_id="ticket-42"):
sources = search(question)
record_event("plan.chosen", {"strategy": "rag"})
reply = answer(question, sources)

Other things in the box: one query DSL over two backends (in-memory + WAL SQLite) held in lockstep by a parity suite; deterministic replay; priority-aware backpressure so load-shedding is never silent; shareable HTML regression reports; and a validation corpus that scores Precision/Recall/F1 = 1.000 across 8 standard + 5 adversarial scenarios, matching the Swift implementation case-for-case. 168 tests.

Apache-2.0. The SDK and all the CI tooling are open; there's a separate hosted visualizer (span tree, payload inspector, side-by-side semantic diff) if you want the dashboard, but you never need it — everything above runs offline from your terminal.

pip install dprovenancekit
python examples/end_to_end_demo.py # the whole arc, self-asserting

Happy to answer questions on the fingerprinting/alignment model or the cross-language conformance approach — that part was the most interesting to build.

DEV Community: Daniel Kissel

DProvenanceKit — regression testing and observability for the reasoning of AI agents (Python, zero deps)