Two Pre-Registered Benchmarks for Audit-Native RAG: RAB (EU AI Act 10/12/19) + LRB (Time-Travel Retrieval)

#rag #llm #aiact #audit

Most RAG demos answer "what's the right chunk?" Very few can answer the
two questions a regulator or an auditor will actually ask:

Replay this decision — show me the exact, complete record of how this answer was produced.
Reconstruct the past — what did your system know at the moment it answered, not what it knows now?

I got tired of hand-waving at both, so I shipped two pre-registered,
deterministic benchmarks alongside JAMES,
my local-first, audit-native Graph-RAG. Pre-registered means the metrics,
scenarios, and decision rules were locked before the numbers came in —
no post-hoc story-fitting.

RAB — Replayable-Audit Benchmark

RAB measures whether your audit trail is good enough to replay a
decision, with three deterministic metrics:

Metric	What it checks	EU AI Act
AC — Audit Completeness	Is every decision-relevant event logged?	Art. 10
RF — Replay Fidelity	Can you re-derive the answer from the log alone?	Art. 12
PC — Provenance Coverage	Does every claim trace to a source?	Art. 19

The three metrics map verbatim to EU AI Act Articles 10, 12, and 19 —
record-keeping obligations that apply from 2026-08-02 (per Article 113).

Scenario S1 result:

                 AC      RF      PC
JAMES          1.000   1.000   1.000
Baseline-0     0.275   0.000   0.000   (vanilla default-logging)

The gap is the whole point. "We have logs" (AC 0.275) is not the same as
"we can replay the decision" (RF 0). Default application logging gets you
a partial event trail and zero replay/provenance — which is exactly the
failure mode an Article 12 audit would surface.

LRB — Lifecycle Retrieval Benchmark

RAG facts go stale. A policy is superseded, a price changes, a spec is
revised. LRB asks: when you query as of a point in time, do you
retrieve the fact that was valid then, or whatever overwrote it?

Three systems compared:

V — Vanilla: no time handling.
N — Naive-supersede: newest fact wins.
J — JAMES: validity-window retrieval (reconstruct_graph_at(t)).

The R@1 ordering V < N < J holds across 4 model families × 4 scale
points (a 12.5× scale span) — time-aware retrieval beats both naive
overwrite and no time-handling at every scale, not just one lucky cell.

At publication scale (S3):

        R@1
V       0.502
N       0.721
J       0.845

How to run it yourself

Everything is local — Ollama (gemma4:e4b default) + BAAI/bge-m3
embeddings + ChromaDB. No cloud LLM account.

git clone https://github.com/Hashevolution/James-RAG-Evol
cp .env.example .env
pip install -r requirements.txt
ollama pull gemma4:e4b
# benchmark runners live in scripts/research/ (lrb_run*.py, rab_*)

Honest framing

These are benchmarks, not a victory lap. JAMES hitting 1.0/1.0/1.0 on a
scenario I designed is a starting line, not proof of general
superiority — the value is that the scenarios, metrics, and baselines are
public and deterministic, so you can run them, disagree, and beat the
numbers.

📄 Preprints (RAB 10pg, LRB 11pg) + Zenodo DOI: 10.5281/zenodo.20652679
💻 Code (MIT, OpenSSF Best Practices passing): https://github.com/Hashevolution/James-RAG-Evol

Feedback I'd value most: (a) does the AC/RF/PC ↔ Art. 10/12/19 mapping
hold up under your reading of the text? (b) is "newest wins" the right
Naive-supersede baseline for LRB, or is there a stronger one I should add?