Hashevolution

Posted on May 13

Building a Mini Palantir on gemma4:e4b — 128K Context Lets the Graph Actually Be Graph-RAG

#devchallenge #gemmachallenge #gemma

Gemma 4 Challenge: Build With Gemma 4 Submission

This is a submission for the Gemma 4 Challenge: Build with Gemma 4

What I Built

PROJECT JAMES — a security-focused, locally-runnable Graph-RAG knowledge engine in Python, MIT-licensed. Think "Mini Palantir Foundry, but MIT, runs on a laptop, no cloud":

Graph-RAG with 12-type ontology — relations carry semantic meaning, not just vector similarity
3-stage access control — RBAC + ABAC + instruction isolation (vector → graph → output)
Self-evolution scaffold — feedback → patch → 4-Gate validation → auto-rollback on bench regression, with approver_username audit
100% local via Ollama — no cloud LLM dependency
Explicit reasoning paths surfaced in every response

The problem it solves: most local RAG projects pick one of "ontology-aware retrieval", "role-based security", or "self-evolution audit logs". JAMES combines all three because for a 1-person knowledge engine, security and reasoning have to be the same pipeline, not two pipelines glued together — the graph traversal is the security boundary. A confidential entity is never visited for an employee role, so the model never sees it. No jailbreak prompt can leak content it never had in the context.

Palantir® is a registered trademark of Palantir Technologies Inc. PROJECT JAMES is not affiliated. "Mini Palantir" is a descriptive comparison of the ontology-and-audit-log design pattern.

Demo

Self-hosted, alpha v0.2.0. Quick-start (≈ 5 minutes on a laptop):

git clone https://github.com/Hashevolution/James-RAG-Evol
cd James-RAG-Evol
cp .env.example .env   # set JAMES_API_KEY, JAMES_JWT_SECRET (32+ char)
pip install -r requirements.txt
ollama pull gemma4:e4b   # ~2.5 GB
python server_llmwiki.py

→ http://localhost:8000 — chat UI + admin dashboard + 3D ontology graph visualizer.

Key endpoints worth poking at:

POST /query/ — natural-language query, returns answer + traversed graph_paths strings
GET /admin/graph — 3D force-directed ontology visualizer (Three.js)
GET /admin/patch/audit — operator-facing audit log over the patch lifecycle
GET /admin/trace/{trace_id} — full pipeline replay for any query (auth → retrieve → graph → tool → answer → complete stages, with per-stage latency)

Background article with architecture diagram and limitations: dev.to write-up.

Code

Repository: Hashevolution/James-RAG-Evol — MIT

Code-quality and security signals for reviewers:

OpenSSF Best Practices passing badge (Tiered 111%, awarded 2026-05-11)
7 published GitHub Releases through v0.2.0 (Foundation Hardening, 5/6 axes engineering-complete)
ruff.toml enforces F821 / F541 / F401 / F841 on every PR via GitHub Actions
83-item security regression suite (james_security_test.py): injection, path traversal, prompt injection, unsafe deserialization
17-item password regression suite (tests/test_password_bcrypt.py)
bcrypt password storage with transparent SHA-256 → bcrypt migration on first login (PR #173)
GitHub Private Vulnerability Reporting enabled
Module-size gate: no file under core/ exceeds 20 KB

Where Gemma 4 lives in the codebase:

config.py:139 — GEMMA_MODEL = os.environ.get("JAMES_LLM_MODEL", "gemma4:e4b")
llm/router.py — task-aware dispatch (task_type=extract / classify / general / coding / vision); every production call site declares its task
core/reasoning/pipeline.py — RAG retrieval pipeline with explicit graph_paths argument carried to the model
core/security_layer.py::pre_check — risky-coding hard-refuse, byte-identical to prompt-injection block

How I Used Gemma 4

Model choice: E4B (gemma4:e4b)

Three Gemma 4 variants were available. The choice was forced by single-user, laptop-class constraints:

Variant	Considered for	Outcome
31B Dense	Server-grade reasoning depth	❌ Doesn't fit 16 GB RAM; single-user means no throughput need
26B MoE	Long-context advanced reasoning	❌ Expert-routing overhead helps batch workloads; single-user has batch size 1
E4B (4B effective) ⭐	Edge variant: 4B params, native multimodal, 128K context	✅ Fits 8 GB GPU or CPU-only laptop, gives the 128K window I need, supports vision for v0.3 multimodal track
E2B (2B effective)	Smaller still	⚠️ Tested as fallback; reasoning depth too low for graph synthesis at depth 3+

The deciding factor was the 128K context window — not parameter count. Here's why.

Why 128K context matters for Graph-RAG specifically

A typical RAG pipeline retrieves top-k chunks and stuffs them into the prompt. Graph-RAG retrieves chunks and the relations between them — and the relations carry semantic meaning I want the model to reason over, not see as decoration.

A depth-3 query against my 161-entity wiki produces a context like:

[retrieved chunk 1]  (entity A, sensitivity=public)
[retrieved chunk 2]  (entity B, sensitivity=public)
...
[graph_path]  A --[CAUSES]--> X --[REQUIRES]--> Y --[BLOCKED_BY]--> B
[graph_path]  A --[KNOWN_AS]--> A' --[REFERENCES]--> C
[ontology]    relation 'CAUSES'      directed=true  weight=0.85
[ontology]    relation 'BLOCKED_BY'  directed=true  weight=0.92
[instruction] use graph_paths to constrain the answer

For real queries at depth 3 this routinely hits ~40K tokens. With a 32K-window model (most older OSS LLMs), I'd be silently truncating the graph paths — meaning the model defaults to vector-only reasoning and the ontology becomes decoration. With Gemma 4's 128K window the full retrieval result fits in one shot and the model actually reasons over the relation labels.

This is the property I designed the rest of the system around. Without 128K, the "Graph-RAG with ontology" claim collapses into "RAG with extra metadata".

Native function calling → router `task_type`

Gemma 4's native function calling underpins llm/router.py::call_router, which makes every call declare its purpose:

call_router(prompt, task_type="extract",  **kwargs)   # entity extraction
call_router(prompt, task_type="classify", **kwargs)   # intent classification
call_router(prompt, task_type="general",  **kwargs)   # chat answer
call_router(prompt, task_type="coding",   **kwargs)   # code generation

The same router can route task_type=coding to a 32B Coder and task_type=general to Gemma 4 — but default for general reasoning is gemma4:e4b because the 128K window dominates everything else at this scale.

Reasoning mode → security policy

Gemma 4's chain-of-thought reasoning is what makes the risky-coding hard-refuse policy actually usable. The block fires before the LLM is called for clear destructive patterns (regex match in core/security_layer.py::RISKY_CODING_REGEX), but borderline queries pass pre_check and the model itself classifies them:

Query asks how to perform destructive command on a target → refuse, byte-identical block message
Query asks about command syntax (documentation) → answer normally

Without a reasoning-capable model, this distinction collapses into "block everything" (false positives) or "answer everything" (security holes). Gemma 4's reasoning is what threads the needle.

What I didn't use yet

Native multimodal retrieval — Image OCR (Tesseract, EasyOCR) and video ASR (Whisper) are wired as ingestion paths, but treating images/audio as first-class graph citizens during retrieval is the v0.3 deliverable. Gemma 4's native vision is ready and waiting.
31B Dense or MoE for server deployment — JAMES stays single-machine until v1.0 by design (docs/PLATFORM_READINESS.md). When multi-tenancy lands, swapping JAMES_LLM_MODEL=gemma4:31b is a one-env-var change — the router already abstracts it.

One-line summary of the model fit

128K context is what lets Graph-RAG be graph-RAG instead of "RAG with extra metadata". gemma4:e4b is the smallest variant that ships it at a footprint a laptop can hold.

Looking for: adversarial review of the security model, a second user willing to run scripts/bench.py --suite=step7 on their own corpus (that's the v0.2 → v0.3 gate), and critiques of the self-evolution 4-Gate.

GitHub: https://github.com/Hashevolution/James-RAG-Evol
OpenSSF: https://www.bestpractices.dev/projects/12806

🤖 Honest disclosure: this submission was drafted with AI assistance and edited by the author. The codebase, design decisions, model-choice rationale, and limitations described above are real and verifiable in the linked repository.

Top comments (1)

Hashevolution • May 13

A few specific questions I'd love adversarial pushback on:

Was E4B the right call, or should I have gone with the 26B MoE?
My reasoning was "single-user has batch size 1, so the MoE routing
overhead doesn't pay back." Counter-arguments welcome — especially
if you've measured MoE vs Dense vs Edge on a similar Graph-RAG
workload.
Does the 128K context really beat chunking for Graph-RAG?
I'm dropping the entire depth-3 traversal (~40K tokens) into one
shot. The alternative is map-reduce over chunks. Has anyone
measured retrieval-precision-at-K with both?
Self-evolution 4-Gate — enough or not enough?
Gate 4 is human approval. Should there be a Gate 5 — a cooling-off
period before the patch can actually apply, even after approval?

Also actively recruiting a second user willing to run
scripts/bench.py --suite=step7 on their own corpus — that's the
v0.2 → v0.3 gate and not a code task.