Thursday night I let a benchmark run while I slept. By Friday morning Gemma 4 E4B had answered fifteen needle-in-a-haystack questions across four context sizes on my RTX 5050 laptop. The recall numbers were better than I expected. The latency numbers were worse. Here's both, with the ~30 lines of Python to reproduce it on your own hardware.
I keep seeing "Gemma 4 E4B has a 128K context window" repeated as if it were a single property, like "the engine is 3.5 litres". It is not a single property. A context-window number means at least three different things — will the model accept this many tokens?, will it remember what's in the middle of them?, and how fast does the first answer token arrive? — and the answers diverge sharply once you leave the laptop-GPU regime that most spec sheets assume.
This is the post I wish I'd had when I started building on E4B. The TL;DR is in the table further down. The reproducible test rig is at the bottom.
The setup
- Hardware: RTX 5050 Laptop, 8 GB VRAM, 24 GB system RAM, Intel i7-13620H
-
Software: Ollama 0.24.0,
gemma4:e4b(Q4_K_M, ~9.6 GB on disk), Linux 7.x - Test: needle-in-a-haystack — five unique 4-character codes embedded at fixed positions inside a long synthetic English document; the model has to recover each one in isolation by exact match.
The test is deliberately simple. I want to know whether the model can find a fact at a known position, not whether it can paraphrase it. Reasoning quality is a different benchmark and needs human evaluation, which I didn't have budget for.
I ran the sweep at 5K, 20K, 60K, and 100K target context sizes. I didn't push to the 128K spec because Ollama's num_ctx setting interacts with the K/V cache headroom in ways I didn't have time to characterize cleanly, and 100K is already 80% of the spec.
The numbers
| Context | Pass rate (5/5) | Tokens/sec | Time to first token |
|---|---|---|---|
| 5K | 5/5 ✓ | 9.2 | 4 s |
| 20K | 5/5 ✓ | 8.6 | 15 s |
| 60K | 5/5 ✓ | 7.6 | 38 s |
| 100K | 5/5 ✓ | 6.8 | 72 s |
Three things stand out.
Recall stayed perfect. I expected E4B to wobble somewhere past 60K — that's the failure mode I see most reported for 4B-class models, the "middle of the context is fuzzy" problem. The needles at 25% and 75% are exactly where I'd expect drop-off. They held. I re-ran the sweep twice to be sure.
Generation throughput barely moved. 9.2 tok/s at 5K vs. 6.8 tok/s at 100K. That's a 26% drop across a 20x context increase. The K/V cache is the obvious culprit, but in practical terms: once the answer starts streaming, it streams at roughly the same speed.
Time to first token blew up. 4s at 5K, 72s at 100K. Almost linear in context size. This is the prefill phase — the model encoding everything you sent it before producing the first output token. On a laptop GPU, prefill is where the consumer-hardware tax lives.
What this means if you're building on E4B
Let me write the practical zones the way I actually think about them, not the marketing version:
- Under 20K tokens: interactive. First token in ~15 seconds, full answer in ~25-30s. This feels like a real conversation. Most single-paper Q&A lives here.
- 20K to 60K tokens: research-assistant. 30-40 second TTFT. You're going to glance away from the screen. That's fine, the answer will be there when you look back. Multi-paper comparisons, longer contexts.
- 60K to 100K tokens: batch. You're queuing a job. 60-80 second TTFT means you might as well make coffee. Loading a whole codebase, a textbook chapter, a quarter's worth of meeting notes.
- Above 100K: I didn't measure. The prefill cost was already breaching my "is this still interactive?" threshold and the use case I was solving for didn't need it.
If you're designing a UI on top of this model, surface these zones to the user. A progress bar or a tier label ("interactive / research / batch") tells someone what their next click will feel like before they ask. The 128K spec is honest; it just doesn't tell you when it'll start.
Reproduce it yourself
The whole rig is about 30 lines once you strip the CLI scaffolding. Save this as bench.py, install ollama (pip install ollama), then run it:
import random, time
import ollama
MODEL = "gemma4:e4b"
NEEDLE_POSITIONS = [0.05, 0.25, 0.50, 0.75, 0.95]
def make_needles(k=5, seed=20260521):
rng = random.Random(seed)
chars = "ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789"
return [(f"box-{i+1}", "".join(rng.choices(chars, k=4))) for i in range(k)]
def build_haystack(target_tokens: int, needles):
# Filler ~ 80 tokens per sentence, English-ish prose.
filler = (
"The committee continued its review of the operational notes "
"submitted during the prior fiscal quarter, with particular "
"attention paid to procedural anomalies. "
)
sentences_needed = target_tokens // 20 # ~4 tok/word, 5 words/sentence avg
body = (filler * sentences_needed)[: target_tokens * 4]
# Splice needles in at fixed positions
out = body
for pos, (label, code) in zip(NEEDLE_POSITIONS, needles):
i = int(pos * len(out))
out = out[:i] + f"\n\nNote: {label} contains the code {code}.\n\n" + out[i:]
return out
def ask(haystack: str, label: str, num_ctx: int) -> tuple[str, float, float]:
t0 = time.time()
first_t = None
chunks = []
for r in ollama.chat(
model=MODEL,
messages=[
{"role": "system", "content": "Answer with only the 4-character code, nothing else."},
{"role": "user", "content": haystack + f"\n\nWhat code is in {label}?"},
],
stream=True,
options={"num_ctx": num_ctx},
):
delta = r.get("message", {}).get("content", "")
if delta:
first_t = first_t or time.time()
chunks.append(delta)
answer = "".join(chunks).strip()
return answer, (first_t - t0) if first_t else 0, time.time() - t0
if __name__ == "__main__":
needles = make_needles()
for ctx in (5_000, 20_000, 60_000, 100_000):
hay = build_haystack(ctx, needles)
passed = 0
for label, code in needles:
ans, ttft, total = ask(hay, label, num_ctx=ctx + 4_000)
passed += code in ans
print(f" ctx={ctx:>6,} {label} expected={code} got={ans!r} ttft={ttft:.1f}s total={total:.1f}s")
print(f"ctx={ctx:>6,} pass={passed}/{len(needles)}")
It writes to stdout. If you want JSON-lines results to plot, redirect to a file and parse the ctx=… pass=… lines. The whole sweep takes ~30 minutes on an RTX 5050; longer on smaller GPUs.
The seed is fixed (20260521) so the needle strings are deterministic. If your pass rate doesn't match mine at the same (model, ctx, seed), that's a real signal — likely Ollama version, quantization, or hardware-driver path.
Things this rig deliberately doesn't measure
Quality of paraphrase. The needles are literal 4-character codes. I'm measuring can the model find it?, not can the model reason about it?. Those are different benchmarks.
VRAM consumption. Ollama owns the K/V cache and I'm not going to fight it for memory accounting. nvidia-smi says it sits around 7.4 GB at 100K context, but I haven't characterized the curve.
Cross-document attention. Each needle is asked in isolation. Multi-fact composition ("how does the figure on page 12 of paper A relate to section 3 of paper B?") is a different problem. I don't have a clean benchmark for it. I'm working on it.
The honest comparison
Qwen 3.5 27B has ~190K effective context on similar hardware. Llama 3.1 70B (if you can fit it) goes further. On raw context size alone, Gemma 4 E4B isn't the winner.
What E4B is the winner at is the combination: 128K context + native vision + native audio + ~9.6 GB on disk, all in one model. That combination is what makes whole-document workloads tractable on a laptop. Qwen 27B doesn't fit in 8 GB of VRAM. Llama 3.1 70B doesn't either. If your hardware constraint is "consumer GPU", E4B is the only model in this class with 128K context and multimodality.
That's the framing I'd give someone choosing an open-weights model for a single-machine deployment in 2026.
Three places I'd take this benchmark next
- Mixed-modality recall. Embed half the needles in text, half in rendered images. See if vision-encoded needles degrade differently from text-encoded ones. (This is the part most relevant to anyone building doc-Q&A.)
- Cross-document needles. Two documents in context, the needle in paper A, the question phrased to require paper B's vocabulary. The actual "I have a library, I want to ask questions" workload.
- Long-document Q&A with human evaluation. Pay five grad students to grade 100 questions about a single 25-page research paper. Real quality numbers, not synthetic ones.
If you run any of these, I'd genuinely like to read the results.
Connect with me:
• Website
• GitHub
• LinkedIn
• X (Twitter)

Top comments (2)
How is this no rag no cloud better, than the traditional RAG ?
Another great read, its messy but effective that you executed tihs idea, project repo is really good, however the stress testing parameters seems simple. Have u tried extending them more