🔬 Direction 1 closure on JAMES — when the hypothesis fails but the data turns "7-tier monotonic natural-stop gradient"

#rag #gemma #gemmachallenge #llm

Gemma 4 Challenge: Write about Gemma 4 Submission

Two weeks ago I shipped core/reasoning/budget.py to test whether per-call dynamic token budgets could cut JAMES's reasoning cost by 60-80% on gemma4:e4b. Built as an experiment: A/B sweep, raw JSON, env-flag default-OFF.

The hypothesis flipped.

🎯 Finding 1 — cap was a ceiling, not the floor.

gemma4:e4b naturally stops well below 4096 on every workload tier. Cap 4096→200/800 produced +0%/+8%/-2% eval_count change, done_reason=stop on every cell, zero quality regression. PR #399's lifted cap was permission to finish, not waste.

Real wins: Latency -17.5%/-7.3% on sub/light tiers (KV-cache sizing); ~20x memory cut on sub; safety bound (cap=200). Ships behind JAMES_ADAPTIVE_BUDGET=1 (default OFF).

🎯 Finding 2 — 7-tier monotonic natural-stop gradient.

Combined free-form + 4 cognitive middleware stages on the same fixture:

substitution verbatim 62 tokens
light synth e-commerce 235
query_rewriter ~370
planner ~690
reflect ~910
verify ~970
heavy synth 4-step 1681

27x dynamic range, cross-sweep noise <5% per tier. The quantitative form of Robin's "workload gradient is multi-tier monotonic on a single model." Natural-stop length IS the workload measurement.

🎯 Finding 3 — verify is a high-clustering cognitive stage. Mechanism 2 needs a second axis.

At T=0.2, verify produces only 2-3 unique responses across 20 baseline calls (~12.5%) — stable across two sweeps. Other cognitive stages at same workload tier: 20/20 unique. verify emits structured JSON; answer space is small finite set.

Mechanism 2 (answer convergence) now has two axes: workload weight (sub 1/20 → heavy 20/20) AND task type (structured-JSON clusters independent of workload). Ali's "ceiling vs path" framing extends here cleanly.

🎯 Process finding — falsification → revision → confirmation.

First cognitive sweep at CAP_LIGHT=800 truncated reflect (926) and verify (984) 19/20 each, quality -40~-75%. Data drove the bump (800→1200); re-sweep PASSed (0/20 truncation, 20/20 quality).

🤝 Three-author joint-piece status:

Headline locked (Ali + Robin + JAMES): "Substitution is free. Synthesis costs in proportion to what it has to invent."

New sub-clauses:
• "…and inversely to parameter count." (Robin, 2 evidence layers)
• "…and the gradient is multi-tier monotonic — 7 tiers, 27x range." (JAMES)
• "…and answer convergence has a task-type axis." (JAMES, cross-sweep)

Three stacks: Robin (26b MoE), JAMES (e4b cognitive), Ali (mid-June Gemini).

📌 Citable archive (Zenodo DOI): https://doi.org/10.5281/zenodo.20363998

🔗 PRs #461 / #463: https://github.com/Hashevolution/James-RAG-Evol/pull/461

@robin Converse @ali Afana — three axes locked.