This is a submission for the Gemma 4 Challenge: Build with Gemma 4.
Thanks for checking out my Gemma 4 Challenge submission.
This project focuses on a behavior I care about a lot in local LLM systems: knowing when to answer from evidence and when to abstain safely.
Repo:
https://github.com/empowereddata/causal-rl-harness
Demo:
https://www.youtube.com/watch?v=1a3n0Y_km1o
SCMRLH 003 is my Gemma 4 Challenge project: a grounded answer-or-abstain harness for local LLMs.
Instead of asking a model to answer every prompt, the harness uses a controlled workflow:
- retrieve relevant evidence
- send a compact evidence window to the model
- require the shortest exact answer span or ABSTAIN
- apply guardrails to reject unsupported answers
- report answerable accuracy, unanswerable accuracy, abstain rate, and runtime
For this project I used Gemma 4 26B through Ollama as the primary local reasoning model inside the harness.
Representative benchmark snapshots highlighted in this demo:
- Main benchmark: 200 examples, 0.850 overall accuracy, 0.700 answerable accuracy, 1.000 unanswerable accuracy, 0.570 abstain rate
- Deep benchmark: 1000 examples, 0.827 overall accuracy, 0.654 answerable accuracy, 1.000 unanswerable accuracy, 0.576 abstain rate
Why this matters:
This project is designed to measure grounded behavior under uncertainty, especially when a model should answer and when it should deliberately abstain.
What I Built
I built SCMRLH 003 (Causal RL Harness), a grounded question-answering evaluation harness for local LLMs that focuses on a simple but important behavior:
knowing when to answer and when to abstain.
Instead of treating question answering as βalways generate a response,β SCMRLH 003 turns the task into a controlled workflow:
retrieve the most relevant evidence from the provided context
send only a compact evidence window to the model
require the model to return the shortest exact answer span or ABSTAIN
run guardrails that reject unsupported answers
score answerable accuracy, unanswerable accuracy, abstain rate, and runtime
The main idea is simple: if a local model cannot support an answer from retrieved evidence, it should decline cleanly instead of improvising.
That makes the project useful for:
hallucination-resistant local AI
document-grounded assistants
multi-model evaluation
AI coworker workflows where unsupported answers are costly
The public release bundle is intentionally lightweight and reproducible. It includes dependency-light Python code, sample fixtures, configs, manifests, and a paper bundle so someone can inspect the harness without needing the full private research tree.
I used Gemma 4 26B through Ollama as the primary local reasoning model inside SCMRLH 003.
This was a good fit because the project is not a general chatbot demo. It is a grounded decision pipeline that repeatedly asks the model to read a compact evidence window, return the shortest exact supported answer span, or abstain when the answer is not explicitly supported.
Gemma 4 worked well in that setup because it balanced local deployment practicality, evidence-bound reasoning, and stable abstention behavior. In the benchmark snapshots highlighted here, Gemma 4 achieved strong overall performance while also reaching perfect unanswerable accuracy, which is especially important for a harness designed to prefer safe abstention over unsupported answers.
Demo
The public bundle can be run locally with:
PYTHONPATH=code python3 -m scmrlh --config config/scmrlh_003_v000_tub_smoke_baseline_stub.json
Repository:
https://github.com/empowereddata/causal-rl-harness
Top comments (0)