Most RAG demos answer "what's the right chunk?" Very few can answer the
two questions a regulator or an auditor will actually ask:
- Replay this decision — show me the exact, complete record of how this answer was produced.
- Reconstruct the past — what did your system know at the moment it answered, not what it knows now?
I got tired of hand-waving at both, so I shipped two pre-registered,
deterministic benchmarks alongside JAMES,
my local-first, audit-native Graph-RAG. Pre-registered means the metrics,
scenarios, and decision rules were locked before the numbers came in —
no post-hoc story-fitting.
RAB — Replayable-Audit Benchmark
RAB measures whether your audit trail is good enough to replay a
decision, with three deterministic metrics:
| Metric | What it checks | EU AI Act |
|---|---|---|
| AC — Audit Completeness | Is every decision-relevant event logged? | Art. 10 |
| RF — Replay Fidelity | Can you re-derive the answer from the log alone? | Art. 12 |
| PC — Provenance Coverage | Does every claim trace to a source? | Art. 19 |
The three metrics map verbatim to EU AI Act Articles 10, 12, and 19 —
record-keeping obligations that apply from 2026-08-02 (per Article 113).
Scenario S1 result:
AC RF PC
JAMES 1.000 1.000 1.000
Baseline-0 0.275 0.000 0.000 (vanilla default-logging)
The gap is the whole point. "We have logs" (AC 0.275) is not the same as
"we can replay the decision" (RF 0). Default application logging gets you
a partial event trail and zero replay/provenance — which is exactly the
failure mode an Article 12 audit would surface.
LRB — Lifecycle Retrieval Benchmark
RAG facts go stale. A policy is superseded, a price changes, a spec is
revised. LRB asks: when you query as of a point in time, do you
retrieve the fact that was valid then, or whatever overwrote it?
Three systems compared:
- V — Vanilla: no time handling.
- N — Naive-supersede: newest fact wins.
-
J — JAMES: validity-window retrieval (
reconstruct_graph_at(t)).
The R@1 ordering V < N < J holds across 4 model families × 4 scale
points (a 12.5× scale span) — time-aware retrieval beats both naive
overwrite and no time-handling at every scale, not just one lucky cell.
At publication scale (S3):
R@1
V 0.502
N 0.721
J 0.845
How to run it yourself
Everything is local — Ollama (gemma4:e4b default) + BAAI/bge-m3
embeddings + ChromaDB. No cloud LLM account.
git clone https://github.com/Hashevolution/James-RAG-Evol
cp .env.example .env
pip install -r requirements.txt
ollama pull gemma4:e4b
# benchmark runners live in scripts/research/ (lrb_run*.py, rab_*)
Honest framing
These are benchmarks, not a victory lap. JAMES hitting 1.0/1.0/1.0 on a
scenario I designed is a starting line, not proof of general
superiority — the value is that the scenarios, metrics, and baselines are
public and deterministic, so you can run them, disagree, and beat the
numbers.
- 📄 Preprints (RAB 10pg, LRB 11pg) + Zenodo DOI: 10.5281/zenodo.20652679
- 💻 Code (MIT, OpenSSF Best Practices passing): https://github.com/Hashevolution/James-RAG-Evol
Feedback I'd value most: (a) does the AC/RF/PC ↔ Art. 10/12/19 mapping
hold up under your reading of the text? (b) is "newest wins" the right
Naive-supersede baseline for LRB, or is there a stronger one I should add?
Top comments (0)