Yash Kumar Saini

Posted on May 24

I built a local document Q&A tool around Gemma 4 E4B's 128K context — five days, no RAG, no cloud

#devchallenge #gemmachallenge #gemma

Gemma 4 Challenge: Build With Gemma 4 Submission

Five days, an 8 GB laptop GPU, and a stubborn belief that for the kind of documents I actually read — research papers, internal memos, the API docs of one project — RAG is over-engineering. DeepRead loads PDFs into Gemma 4 E4B's 128K context as page images and answers questions with footnote citations pointing to the exact page. No vector DB. No chunker. No retriever. ~500 lines of Python.

This is my submission for the Build with Gemma 4 prompt. The full repo is at github.com/yashksaini-coder/DeepRead. The model is gemma4:e4b (the 4-billion-parameter Gemma 4) served by Ollama. Everything runs offline.

What it does

You start with an empty chat. The right sidebar has a Papers picker — five classic CS papers ship bundled (Attention, GFS, MapReduce, Raft, Bitcoin), or you upload your own PDF. Click one and it ingests in about half a second; the sidebar's Plotly bar fills green to show how much of the 128K window is now in your prompt.

Then you ask whatever you want. The answer streams back with footnote markers [^1], [^2] that resolve to specific pages of specific papers at the bottom of the message. The model is constrained at prompt-construction time to use only page IDs from a known list — so it physically can't hallucinate a page number.

Diagnostics live in the same chat as slash commands. /bench show renders the latest needle-in-a-haystack sweep as three Plotly charts (pass rate, tok/s, time-to-first-token). /bench run --ctx 5000 20000 60000 --needles 5 kicks off a fresh sweep. No tab switching, no separate session, no chat clearing.

Why E4B specifically

I'm going to quote the model-selection paragraph straight from the README, because the rubric explicitly weights this and I don't want to bury it:

E4B is the only model in the Gemma 4 family — and arguably the only open model at this size today — that combines four properties at once: a 128K context window wide enough to hold a complete research paper plus supplementary material in a single call; native vision that handles PDF pages rendered at 150 DPI without an OCR pipeline; native audio input (held in reserve for the next iteration); and a ~9.6 GB on-disk footprint that runs on an 8 GB laptop GPU. The 26B and 31B variants would push reasoning quality up, but they would kill the laptop story — and the whole point of DeepRead is that nothing leaves the machine. E2B was tempting for portability but loses fidelity on multi-step reasoning across long context. E4B is the precise sweet spot.

Three sentences, one decision, the rest of the build defends it.

The decision I'm most proud of: no RAG

Every "AI document assistant" I've used in the past two years has the same shape — chunk, embed into a vector DB, retrieve top-k, prompt a hosted LLM with the retrieved chunks. RAG. It works. It's also a small mountain of moving parts that all need to stay aligned: chunk size, overlap, embedding model version, top-k value, rerank threshold. And at the end of all that, your documents have been shipped to someone else's machine.

DeepRead is a bet that for a non-trivial class of documents — research papers, internal memos, a few weeks of meeting notes — none of that is necessary. Drop the PDFs into the prompt as rendered page images, ask the question, get an answer with page citations. The whole tool is gemma4:e4b running locally via Ollama, plus about 500 lines of Python.

I did a full math comparison which I will post in a later article. The TL;DR: DeepRead spends ~28% more tokens per page than a text-embedding pipeline, in exchange for zero offline preprocessing, zero retrieval failure mode, and the ability to reason about figures, tables, equations, and handwritten margin notes the same way a human reader does.

What 100K tokens actually costs on a laptop GPU

I built benchmarks/run_context_sweep.py to answer the question honestly. It runs a needle-in-a-haystack test: five 4-character codes seeded at fixed positions (5/25/50/75/95%) inside a long synthetic document, and the model has to recover each in isolation. From an RTX 5050 Laptop, 8 GB VRAM:

Context	Pass rate (5/5 needles)	Tokens/sec	Time to first token
20K	5/5 ✓	8.6	15 s
60K	5/5 ✓	7.6	38 s
100K	5/5 ✓	6.8	72 s

The recall result genuinely surprised me. I expected E4B to degrade past 60K and it didn't — the window held all the way to 80% of its 128K spec. What broke was latency: TTFT grew nearly linearly with context size. Generation throughput stayed flat around 7-9 tok/s; the consumer-GPU tax shows up entirely in the prefill phase.

The practical mental model for someone building on this hardware:

< 20K is the interactive zone. Answers start within 15 seconds; conversation feels alive.
20K – 60K is the research-assistant zone. Drop in a whole paper, go make coffee, come back to the answer.
60K – 100K is the batch zone. Load a codebase, kick off a query, accept that you'll come back to a notification.

The Plotly chart in the right sidebar surfaces these zones live as you load papers, so you know what your context choices will feel like before you ask.

Five-day build log

Tuesday. Scaffolded the project, wrote the deepread/ package: ingest.py (PyMuPDF rasterization), budget.py (token estimator), citations.py ([[id]] grammar), llm.py (Ollama wrapper). 16 unit tests. End of day, I could ingest a PDF and stream an answer from the terminal.

Wednesday. Built the first Gradio UI. Got immediately bitten by Gradio's gr.HTML(value=...) rendering rule — it sets .innerHTML, which the browser refuses to execute scripts from for security. I shipped three different "fixes" for the same paper-click bug before I realized I was reading the wrong stack trace. Lost most of the afternoon.

Thursday morning. Got the UI working with an <img onerror="..."> trick. Looked at it. Decided I hated it. The chat-shaped product was wearing document-tool clothes. Migrated to Chainlit. The migration was about three hours because deepread/ was UI-independent from day one — only app.py got rewritten.

Thursday night. Wrote benchmarks/run_context_sweep.py and let it run overnight on E4B at 5K / 20K / 60K / 100K. Spoiler: the numbers above are the result. The 100% recall at 100K was a relief — it meant the whole no-RAG thesis was actually defensible.

Friday. Polish day. Moved the context-budget chart from chat into the right ElementSidebar. Built a cl.CustomElement React component for the paper picker so the buttons live in the sidebar (Chainlit's ElementSidebar accepts Elements but not Actions — the picker bridges that gap). Pinned a floating "Context" toggle to the top-right of the chrome with :has(img[src*="/avatars/sidebar_toggle"]).

Saturday morning. Killed the cl.ChatProfile second tab. The profile-switch dialog ("This will clear your chat history") was a constant friction every time I wanted to check a benchmark. Replaced it with /bench slash commands inside the same chat. One session, no clearing.

Saturday afternoon. Wrote the posts. (You're reading them.)

The code I'd point a reviewer at first

deepread/llm.py is the whole model contract:

def stream_chat(question, images=(), *, history=None, num_ctx=24_000, model=MODEL):
    payload = [_encode_media(i) for i in images]
    user_msg = {"role": "user", "content": question}
    if payload:
        user_msg["images"] = payload
    stream = ollama.chat(
        model=model,
        messages=list(history or []) + [user_msg],
        stream=True,
        options={"num_ctx": num_ctx},
    )
    for chunk in stream:
        delta = chunk.get("message", {}).get("content", "")
        if delta:
            yield delta

That's it. There's a health_check(model) next to it that returns a typed HealthReport(ok, reason, hint) instead of raising — the caller decides how to surface it. The Chainlit handler runs it once per session and caches the result.

deepread/citations.py is the part I'm sneakiest about:

def citation_prompt(shards):
    catalog = "\n".join(f"- {s.cite_id}" for s in shards)
    return (
        "You are a research assistant. When you make a factual claim that "
        "comes from a specific page or image, cite it inline using the "
        "format [[cite_id]]. Use ONLY these citation ids:\n"
        f"{catalog}\n"
        "If a fact isn't supported by the provided material, say so."
    )

Then _format_answer(raw, known_cite_ids) regex-replaces [[id]] markers with numbered [^N] footnotes — but only if id is in the catalog. So even if the model emits [[some-fake-id]], the formatter leaves it as literal text. The page-citation hallucination is structurally impossible.

What I cut for time

Voice input. Gemma 4 E4B accepts raw WAV bytes in the same images: [...] field, but I never got the Chainlit cl.Audio plumbing solid enough to ship. Two days of work I didn't have.
Per-paper exclude/remove from the library. EXCLUDED_KEY exists in session state and the working-set logic respects it — there's just no UI button to flip it. A 30-minute add I'll do after the deadline.
Conversation export. Save / share a Q&A session as Markdown. Easy to add, not on the rubric.
Multi-language UI. Chainlit ships 20+ locale files; English-only for now.

What I learned

Framework choice matters more than you think. Gradio is more flexible; Chainlit is more aligned with the chat-shaped problem I had. Five UI iterations to recognize it. Pick the framework whose defaults match your shape.

The model-selection paragraph is the highest-leverage paragraph in the submission. Judges read it. Don't bury the reasoning.

Benchmark first, blog second. I'd written the article-1 stress-test draft before I had the numbers. When the numbers came in, the article got better, sharper, and shorter. The opposite order would have left me defending claims that the data didn't support.

Try it

git clone https://github.com/yashksaini-coder/DeepRead.git
cd DeepRead
ollama pull gemma4:e4b           # ~9.6 GB, one-time
make install                      # uv sync
make run                          # http://127.0.0.1:8000

Pick Bitcoin · 2008 for the fastest demo (smallest paper) and ask What problem does proof-of-work solve in this paper?. The answer streams back with citations resolving to specific pages.

Companion posts

I stress-tested Gemma 4 E4B's 128K context on a laptop GPU — the benchmark numbers in long form, plus the reproducible test rig

yashksaini-coder / DeepRead

Local document Q&A with Gemma 4 E4B's 128K context — no RAG, no cloud, answers cite the exact page they came from.

DeepRead

Local, long-context document Q&A with Gemma 4 E4B — load PDFs, ask anything, get page-cited answers. The model runs on your machine. Your documents never leave it.

DeepRead is the build submission for the dev.to Gemma 4 Challenge. It demonstrates that a high-quality document-intelligence experience can run entirely on consumer hardware — no cloud, no per-query cost, no telemetry — by leaning on Gemma 4 E4B's 128K context window instead of a RAG pipeline.

Highlights

One chat, two modes. Document Q&A is the default. Type /bench show or /bench run --ctx 5000 20000 --needles 3 in the same chat to render the context-window stress-test charts inline — no profile switching, no history loss.
Native multimodality. Pages go in as rendered images via Gemma 4's vision path — no OCR pipeline.
Live context budget. A right-side sidebar shows the working set against the 128K ceiling, color-coded by latency tier…

View on GitHub

Star DeepRead on GitHub

Connect with me:

• Website
• GitHub
• LinkedIn
• X (Twitter)

Top comments (3)

shogun 444 • May 24

This is an insanely well-thought-out build. The “no RAG” approach backed by actual benchmark data, citation constraints, and long-context testing made this feel like real systems engineering instead of another generic AI wrapper. Great work.

Tony Stark • May 24

Awesome project, nice that it is different from traditional RAG

Some comments may only be visible to logged-in visitors. Sign in to view all comments.