Yash Kumar Saini

Posted on May 24

I stress-tested Gemma 4 E4B's 128K context on a laptop GPU — recall is great, prefill is not

#devchallenge #gemmachallenge #gemma

Gemma 4 Challenge: Write about Gemma 4 Submission

Thursday night I let a benchmark run while I slept. By Friday morning Gemma 4 E4B had answered fifteen needle-in-a-haystack questions across four context sizes on my RTX 5050 laptop. The recall numbers were better than I expected. The latency numbers were worse. Here's both, with the ~30 lines of Python to reproduce it on your own hardware.

I keep seeing "Gemma 4 E4B has a 128K context window" repeated as if it were a single property, like "the engine is 3.5 litres". It is not a single property. A context-window number means at least three different things — will the model accept this many tokens?, will it remember what's in the middle of them?, and how fast does the first answer token arrive? — and the answers diverge sharply once you leave the laptop-GPU regime that most spec sheets assume.

This is the post I wish I'd had when I started building on E4B. The TL;DR is in the table further down. The reproducible test rig is at the bottom.

The setup

Hardware: RTX 5050 Laptop, 8 GB VRAM, 24 GB system RAM, Intel i7-13620H
Software: Ollama 0.24.0, gemma4:e4b (Q4_K_M, ~9.6 GB on disk), Linux 7.x
Test: needle-in-a-haystack — five unique 4-character codes embedded at fixed positions inside a long synthetic English document; the model has to recover each one in isolation by exact match.

The test is deliberately simple. I want to know whether the model can find a fact at a known position, not whether it can paraphrase it. Reasoning quality is a different benchmark and needs human evaluation, which I didn't have budget for.

I ran the sweep at 5K, 20K, 60K, and 100K target context sizes. I didn't push to the 128K spec because Ollama's num_ctx setting interacts with the K/V cache headroom in ways I didn't have time to characterize cleanly, and 100K is already 80% of the spec.

The numbers

Context	Pass rate (5/5)	Tokens/sec	Time to first token
5K	5/5 ✓	9.2	4 s
20K	5/5 ✓	8.6	15 s
60K	5/5 ✓	7.6	38 s
100K	5/5 ✓	6.8	72 s

Three things stand out.

Recall stayed perfect. I expected E4B to wobble somewhere past 60K — that's the failure mode I see most reported for 4B-class models, the "middle of the context is fuzzy" problem. The needles at 25% and 75% are exactly where I'd expect drop-off. They held. I re-ran the sweep twice to be sure.

Generation throughput barely moved. 9.2 tok/s at 5K vs. 6.8 tok/s at 100K. That's a 26% drop across a 20x context increase. The K/V cache is the obvious culprit, but in practical terms: once the answer starts streaming, it streams at roughly the same speed.

Time to first token blew up. 4s at 5K, 72s at 100K. Almost linear in context size. This is the prefill phase — the model encoding everything you sent it before producing the first output token. On a laptop GPU, prefill is where the consumer-hardware tax lives.

What this means if you're building on E4B

Let me write the practical zones the way I actually think about them, not the marketing version:

Under 20K tokens: interactive. First token in ~15 seconds, full answer in ~25-30s. This feels like a real conversation. Most single-paper Q&A lives here.
20K to 60K tokens: research-assistant. 30-40 second TTFT. You're going to glance away from the screen. That's fine, the answer will be there when you look back. Multi-paper comparisons, longer contexts.
60K to 100K tokens: batch. You're queuing a job. 60-80 second TTFT means you might as well make coffee. Loading a whole codebase, a textbook chapter, a quarter's worth of meeting notes.
Above 100K: I didn't measure. The prefill cost was already breaching my "is this still interactive?" threshold and the use case I was solving for didn't need it.

If you're designing a UI on top of this model, surface these zones to the user. A progress bar or a tier label ("interactive / research / batch") tells someone what their next click will feel like before they ask. The 128K spec is honest; it just doesn't tell you when it'll start.

Reproduce it yourself

The whole rig is about 30 lines once you strip the CLI scaffolding. Save this as bench.py, install ollama (pip install ollama), then run it:

import random, time
import ollama

MODEL = "gemma4:e4b"
NEEDLE_POSITIONS = [0.05, 0.25, 0.50, 0.75, 0.95]

def make_needles(k=5, seed=20260521):
    rng = random.Random(seed)
    chars = "ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789"
    return [(f"box-{i+1}", "".join(rng.choices(chars, k=4))) for i in range(k)]

def build_haystack(target_tokens: int, needles):
    # Filler ~ 80 tokens per sentence, English-ish prose.
    filler = (
        "The committee continued its review of the operational notes "
        "submitted during the prior fiscal quarter, with particular "
        "attention paid to procedural anomalies. "
    )
    sentences_needed = target_tokens // 20  # ~4 tok/word, 5 words/sentence avg
    body = (filler * sentences_needed)[: target_tokens * 4]
    # Splice needles in at fixed positions
    out = body
    for pos, (label, code) in zip(NEEDLE_POSITIONS, needles):
        i = int(pos * len(out))
        out = out[:i] + f"\n\nNote: {label} contains the code {code}.\n\n" + out[i:]
    return out

def ask(haystack: str, label: str, num_ctx: int) -> tuple[str, float, float]:
    t0 = time.time()
    first_t = None
    chunks = []
    for r in ollama.chat(
        model=MODEL,
        messages=[
            {"role": "system", "content": "Answer with only the 4-character code, nothing else."},
            {"role": "user", "content": haystack + f"\n\nWhat code is in {label}?"},
        ],
        stream=True,
        options={"num_ctx": num_ctx},
    ):
        delta = r.get("message", {}).get("content", "")
        if delta:
            first_t = first_t or time.time()
            chunks.append(delta)
    answer = "".join(chunks).strip()
    return answer, (first_t - t0) if first_t else 0, time.time() - t0

if __name__ == "__main__":
    needles = make_needles()
    for ctx in (5_000, 20_000, 60_000, 100_000):
        hay = build_haystack(ctx, needles)
        passed = 0
        for label, code in needles:
            ans, ttft, total = ask(hay, label, num_ctx=ctx + 4_000)
            passed += code in ans
            print(f"  ctx={ctx:>6,}  {label}  expected={code}  got={ans!r}  ttft={ttft:.1f}s  total={total:.1f}s")
        print(f"ctx={ctx:>6,}  pass={passed}/{len(needles)}")

It writes to stdout. If you want JSON-lines results to plot, redirect to a file and parse the ctx=… pass=… lines. The whole sweep takes ~30 minutes on an RTX 5050; longer on smaller GPUs.

The seed is fixed (20260521) so the needle strings are deterministic. If your pass rate doesn't match mine at the same (model, ctx, seed), that's a real signal — likely Ollama version, quantization, or hardware-driver path.

Things this rig deliberately doesn't measure

Quality of paraphrase. The needles are literal 4-character codes. I'm measuring can the model find it?, not can the model reason about it?. Those are different benchmarks.

VRAM consumption. Ollama owns the K/V cache and I'm not going to fight it for memory accounting. nvidia-smi says it sits around 7.4 GB at 100K context, but I haven't characterized the curve.

Cross-document attention. Each needle is asked in isolation. Multi-fact composition ("how does the figure on page 12 of paper A relate to section 3 of paper B?") is a different problem. I don't have a clean benchmark for it. I'm working on it.

When RAG is still the right call

I'm not going to pretend the no-RAG architecture is universally correct. Bluntly:

Library > 128K tokens total. Half a book. A whole textbook. A corpus of fifty papers. Then you need retrieval. Don't fight it.
Sub-second latency required. Customer-facing chatbot over a knowledge base. RAG is faster on the hot path — see the TTFT numbers above.
Incremental index updates. Adding one document without re-prefilling. A vector DB lets you do that. Long-context doesn't.
The question genuinely needs only a small slice of one document. "What's the customer's last order?" over a 50K-record SQL table. SQL is the right tool. Not RAG, not long-context, just SQL.

For everything else in the "I have some documents, I want to ask them things" category — research, internal docs, technical reading — I think the no-RAG architecture is the right default in 2026. The frontier models have gotten long-context-capable enough that retrieval is now a premature optimization for most workloads. The benchmark numbers above are the evidence; the per-page-token math is the framework for deciding.

The line that summarizes it

Traditional RAG: make the model see less, faster.
Long-context: make the model see everything, once.

Both are valid. The question is which trade-off your workload actually has. Mine had the second shape. The benchmark told me the cost. I shipped accordingly.

The honest comparison

Qwen 3.5 27B has ~190K effective context on similar hardware. Llama 3.1 70B (if you can fit it) goes further. On raw context size alone, Gemma 4 E4B isn't the winner.

What E4B is the winner at is the combination: 128K context + native vision + native audio + ~9.6 GB on disk, all in one model. That combination is what makes whole-document workloads tractable on a laptop. Qwen 27B doesn't fit in 8 GB of VRAM. Llama 3.1 70B doesn't either. If your hardware constraint is "consumer GPU", E4B is the only model in this class with 128K context and multimodality.

That's the framing I'd give someone choosing an open-weights model for a single-machine deployment in 2026.

Three places I'd take this benchmark next

Mixed-modality recall. Embed half the needles in text, half in rendered images. See if vision-encoded needles degrade differently from text-encoded ones. (This is the part most relevant to anyone building doc-Q&A.)
Cross-document needles. Two documents in context, the needle in paper A, the question phrased to require paper B's vocabulary. The actual "I have a library, I want to ask questions" workload.
Long-document Q&A with human evaluation. Pay five grad students to grade 100 questions about a single 25-page research paper. Real quality numbers, not synthetic ones.

If you run any of these, I'd genuinely like to read the results.

yashksaini-coder / DeepRead

Local document Q&A with Gemma 4 E4B's 128K context — no RAG, no cloud, answers cite the exact page they came from.

DeepRead

Local, long-context document Q&A with Gemma 4 E4B — load PDFs, ask anything, get page-cited answers. The model runs on your machine. Your documents never leave it.

DeepRead is the build submission for the dev.to Gemma 4 Challenge. It demonstrates that a high-quality document-intelligence experience can run entirely on consumer hardware — no cloud, no per-query cost, no telemetry — by leaning on Gemma 4 E4B's 128K context window instead of a RAG pipeline.

Highlights

One chat, two modes. Document Q&A is the default. Type /bench show or /bench run --ctx 5000 20000 --needles 3 in the same chat to render the context-window stress-test charts inline — no profile switching, no history loss.
Native multimodality. Pages go in as rendered images via Gemma 4's vision path — no OCR pipeline.
Live context budget. A right-side sidebar shows the working set against the 128K ceiling, color-coded by latency tier…

View on GitHub

Star DeepRead on GitHub

Connect with me:

• Website
• GitHub
• LinkedIn
• X (Twitter)

Top comments (4)

Vibe Developer • May 24

How is this no rag no cloud better, than the traditional RAG ?

VoidPTR • May 24

Another great read, its messy but effective that you executed tihs idea, project repo is really good, however the stress testing parameters seems simple. Have u tried extending them more

Rasmus Ros • May 24

I also expected recall to drop with longer context. Do you think the 4-character code benchmark is just too easy?

Some comments may only be visible to logged-in visitors. Sign in to view all comments.