I built an AI résumé tool that refuses to lie about your experience

jaberoma — Mon, 25 May 2026 17:53:13 +0000

Most AI résumé tools have the same flaw: they hallucinate. Ask them to tailor your résumé for a job requiring "Rust experience" and they'll happily invent a Rust project you never worked on. It reads great — until the technical interview.

I wanted the opposite. So I built Citevault: a local-first résumé tailoring tool where every claim is either grounded in your own evidence, or refused and flagged as a gap.

No fabrication. No API keys. Runs entirely on your laptop. (Model weights are pulled from Hugging Face once on first boot; after that, no outbound connections.)

The core idea: claim-level grounding

Every bullet in your résumé starts as a claim. Citevault processes each one through a pipeline:

Retrieve — hybrid BM25 + dense embedding search over your indexed evidence (master résumé, project READMEs, blog posts, anything you upload)
Re-rank — BGE cross-encoder scores the top candidates for relevance
Verify — Gemma 4 reads the claim alongside the retrieved span and gives a verdict: SUPPORTS, PARTIAL, UNCLEAR, or CONTRADICTS
Rewrite or refuse — SUPPORTS → the claim is verified and cited; PARTIAL → rewritten to match only what the evidence actually says; UNCLEAR → a rewrite is attempted, and if it still can't be grounded, refused and gap-reported; CONTRADICTS → refused immediately and gap-reported

The result is a résumé where every bullet has a [^sp-...] footnote traceable back to a specific span in your source material.

The wow demo: Naive Comparison Mode

Toggle "Compare with naive AI" before starting a tailoring run. Citevault runs its grounded pipeline and a second single-pass run — same model, same evidence, same task description, no verification loop. The only difference is the grounded pipeline checks every claim against its source before including it.

The diff is striking:

Grounded résumé: seven bullets, every one backed by a citation footnote traceable to a source span
Naive résumé: longer, more confident-sounding — and full of placeholders like [Candidate Name] and invented achievements that never appeared in the evidence

The AI stack (all local, no API keys)

Component	Role
Gemma 4 E4B (`gemma4:e4b`) via Ollama	Claim drafting, verification, cover letter composition
BGE-small-en-v1.5	Dense embeddings for semantic retrieval
BGE cross-encoder	Re-ranking retrieved candidates
BM25 + SQLite FTS5	Keyword retrieval (hybrid RAG)
sqlite-vec	Vector store — no external database required

Gemma 4 E4B was chosen specifically for this role: it is instruction-tuned well enough to return consistent structured JSON verdicts, small enough to run on CPU without a GPU, and open-weight so no API key or data exposure is involved. The e4b tag is the Q4_K_M quantised build — the best size/quality tradeoff for local inference via Ollama.

The entire stack runs on CPU. Measured on a 4-core/8-thread laptop with 32 GB RAM and no discrete GPU: 3–8 tokens/second generation speed, 20–30 minutes per tailoring run; add another 10–20 minutes if naive comparison is enabled. Slower than a cloud API, but zero cost, zero data exposure, and no dependency on an upstream service staying alive.

What I learned building this

Structured generation is the hard part. Getting Gemma 4 to consistently return structured JSON verdicts from the verifier took more prompt iteration than anything else. The final verifier prompt is tightly constrained: it gives the model a specific rubric, a strict output format, and a worked example. It still occasionally returns malformed output — those claims are logged and omitted from the output rather than silently passed through.

Hybrid RAG matters. Pure dense search misses exact keyword matches. Pure BM25 misses semantic similarity. On the five-case golden eval set, the hybrid combination recovered ~15 percentage points in first-pass grounding rate over either retrieval strategy alone — enough to tip borderline claims from UNCLEAR to SUPPORTS.

Eval-driven development pays off. I built a golden evaluation set of five synthetic candidates and ran the pipeline against it after every significant change. The final first-pass grounding rate is 98.2% — but more importantly, I caught two regressions that looked fine in manual testing.

Local-first is a real constraint, not a marketing line. Your career data is sensitive. Résumés contain salary history, reasons for leaving, private project details. I didn't want to be a data controller. Building local-first forced specific architectural decisions — no cloud storage, no async job queue, no third-party embedding API.

Try it

docker compose up -d ollama
docker compose exec ollama ollama pull gemma4:e4b
docker compose up -d
# Then open http://localhost:5173/admin in your browser

Upload your evidence, paste a job posting, and watch the grounding happen in real time via SSE stream.

Heads up — this runs on CPU. On a 4-core laptop without a GPU, expect 20–30 minutes per tailoring run. With naive comparison enabled, add another 10–20 minutes for the second pass. It is slow by cloud-API standards, but fully offline and costs nothing after the first model pull.

The best test: pick a role where you have a genuine skill gap — that is where the gap report is most useful.

The full architecture (hexagonal layout, RAG pipeline, Docker Compose stack) is documented in docs/architecture.md in the repo.

The code is on GitHub: github.com/jaberoma/citevault — MIT licensed, no account required, runs on any laptop with Docker.