DEV Community

Hashevolution
Hashevolution

Posted on

Two Pre-Registered Benchmarks for Audit-Native RAG: RAB (EU AI Act 10/12/19) + LRB (Time-Travel Retrieval)

Most RAG demos answer "what's the right chunk?" Very few can answer the
two questions a regulator or an auditor will actually ask:

  1. Replay this decision — show me the exact, complete record of how this answer was produced.
  2. Reconstruct the past — what did your system know at the moment it answered, not what it knows now?

I got tired of hand-waving at both, so I shipped two pre-registered,
deterministic
benchmarks alongside JAMES,
my local-first, audit-native Graph-RAG. Pre-registered means the metrics,
scenarios, and decision rules were locked before the numbers came in —
no post-hoc story-fitting.

RAB — Replayable-Audit Benchmark

RAB measures whether your audit trail is good enough to replay a
decision, with three deterministic metrics:

Metric What it checks EU AI Act
AC — Audit Completeness Is every decision-relevant event logged? Art. 10
RF — Replay Fidelity Can you re-derive the answer from the log alone? Art. 12
PC — Provenance Coverage Does every claim trace to a source? Art. 19

The three metrics map verbatim to EU AI Act Articles 10, 12, and 19 —
record-keeping obligations that apply from 2026-08-02 (per Article 113).

Scenario S1 result:

                 AC      RF      PC
JAMES          1.000   1.000   1.000
Baseline-0     0.275   0.000   0.000   (vanilla default-logging)
Enter fullscreen mode Exit fullscreen mode

The gap is the whole point. "We have logs" (AC 0.275) is not the same as
"we can replay the decision" (RF 0). Default application logging gets you
a partial event trail and zero replay/provenance — which is exactly the
failure mode an Article 12 audit would surface.

LRB — Lifecycle Retrieval Benchmark

RAG facts go stale. A policy is superseded, a price changes, a spec is
revised. LRB asks: when you query as of a point in time, do you
retrieve the fact that was valid then, or whatever overwrote it?

Three systems compared:

  • V — Vanilla: no time handling.
  • N — Naive-supersede: newest fact wins.
  • J — JAMES: validity-window retrieval (reconstruct_graph_at(t)).

The R@1 ordering V < N < J holds across 4 model families × 4 scale
points
(a 12.5× scale span) — time-aware retrieval beats both naive
overwrite and no time-handling at every scale, not just one lucky cell.

At publication scale (S3):

        R@1
V       0.502
N       0.721
J       0.845
Enter fullscreen mode Exit fullscreen mode

How to run it yourself

Everything is local — Ollama (gemma4:e4b default) + BAAI/bge-m3
embeddings + ChromaDB. No cloud LLM account.

git clone https://github.com/Hashevolution/James-RAG-Evol
cp .env.example .env
pip install -r requirements.txt
ollama pull gemma4:e4b
# benchmark runners live in scripts/research/ (lrb_run*.py, rab_*)
Enter fullscreen mode Exit fullscreen mode

Honest framing

These are benchmarks, not a victory lap. JAMES hitting 1.0/1.0/1.0 on a
scenario I designed is a starting line, not proof of general
superiority — the value is that the scenarios, metrics, and baselines are
public and deterministic, so you can run them, disagree, and beat the
numbers.

Feedback I'd value most: (a) does the AC/RF/PC ↔ Art. 10/12/19 mapping
hold up under your reading of the text? (b) is "newest wins" the right
Naive-supersede baseline for LRB, or is there a stronger one I should add?

Top comments (0)