DProvenanceKit — regression testing and observability for the reasoning of AI agents (Python, zero deps)
Every agent framework will happily tell you what happened — here are the tokens, here are the tool calls, here's a trace waterfall. What almost none of them tell you is what changed between two runs, and whether that change is a regression.
That's the gap DProvenanceKit fills. It turns each execution of an agent into a queryable, diffable trace, then gives you the surfaces to actually act on drift:
Run → Record → Query → Diff → Detect regressions → Gate in CI
The core is pure Python standard library — zero third-party dependencies (sqlite3, contextvars, threading, json, hashlib). It's a faithful port of a Swift library, and the two are kept behaviorally identical by a frozen cross-language conformance spec, not by hope.
The pitch in one example. A research agent answers a support question. The healthy "golden" run does plan → search → rank → verify → decide. A later PR quietly breaks it — the agent loops its search tool and skips verification. Here's every layer catching it, from one runnable script (python examples/end_to_end_demo.py):
── 1. Record two runs of the agent ───────────────────────
golden: 5 steps ['plan', 'search', 'rank', 'verify', 'decide']
candidate: 8 steps ['plan', 'search', 'search', 'search', 'search', 'search', 'search', 'decide']
── 2. Query for a suspicious pattern (searched but never verified) ──
matched 1 run(s): ['research-agent · PR-42']
── 3. Gate the candidate against the golden run ──────────
verdict: REGRESSION (severity high, strength 0.95)
removed critical steps: ['rank', 'verify']
── 4. Out-of-the-box anomaly rules ───────────────────────
! [tool_drop:verify] required step 'verify' was never recorded
! [looping:search] step 'search' repeated 6 times (> 3 allowed)
── 5. Structural diff ────────────────────────────────────
removed rank (Retriever) @seq 2
removed verify (Verifier) @seq 3
added search (Retriever) @seq 2..6
OK — every layer agreed: the candidate regressed (dropped verify, looped search).
Why this works. Each run has a fingerprint — the structural identity of the agent's execution path. Two runs that diverge (a tool called in a different order, a retrieval step skipped) produce different fingerprints. That's a cheap, deterministic regression signal you can compute without an LLM in the loop. When you want more than "same/different," a semantic alignment engine grades the divergence by severity (dropping a CRITICAL step is a HIGH regression; reordering one that inverts a dependency is too).
The part I actually like: it gates PRs. No server required. Point the CLI at a SQLite trace DB and two run IDs; exit code 0 pass / 1 regression:
dprovenancekit gate --db traces.sqlite --golden "$GOLDEN" --candidate "$CANDIDATE"
There's a drop-in GitHub Action and GitLab CI template that wrap this and comment the diff on the PR/MR. So "my agent silently stopped verifying its sources" becomes a red check instead of a customer ticket three weeks later.
Instrumentation is basically free:
LangChain / LangGraph adapter — callbacks become a span tree automatically.
OpenAI Agents SDK — one register(store) and every run is captured.
No framework? Decorate plain functions with @traced (sync, async, generators all work). Capture is failure-proof — it never changes your code's behavior or swallows exceptions.
from dprovenancekit import InMemoryTraceStore, traced, traced_run, record_event
@traced
def search(query): ...
@traced
def answer(question, sources): ...
with traced_run(store, context_id="ticket-42"):
sources = search(question)
record_event("plan.chosen", {"strategy": "rag"})
reply = answer(question, sources)
Other things in the box: one query DSL over two backends (in-memory + WAL SQLite) held in lockstep by a parity suite; deterministic replay; priority-aware backpressure so load-shedding is never silent; shareable HTML regression reports; and a validation corpus that scores Precision/Recall/F1 = 1.000 across 8 standard + 5 adversarial scenarios, matching the Swift implementation case-for-case. 168 tests.
Apache-2.0. The SDK and all the CI tooling are open; there's a separate hosted visualizer (span tree, payload inspector, side-by-side semantic diff) if you want the dashboard, but you never need it — everything above runs offline from your terminal.
pip install dprovenancekit
python examples/end_to_end_demo.py # the whole arc, self-asserting
Happy to answer questions on the fingerprinting/alignment model or the cross-language conformance approach — that part was the most interesting to build.
Top comments (0)