thehwang

Posted on May 19 • Edited on May 20 • Originally published at github.com

I asked Gemma 4 to summarize. It said the transcript looked truncated. It was right.

#devchallenge #gemmachallenge #gemma #macos

Gemma 4 Challenge: Build With Gemma 4 Submission

This is a submission for the Gemma 4 Challenge: Build with Gemma 4

Correction (May 20, 2026): The framing in this post — that Gemma 4 E2B "detected" damaged input and pushed back on it as a general behavior — is too strong. A 15-run ablation, designed in response to comments from @dannwaneri and @wildeconforce, shows the hedging behavior is configuration-deterministic on num_ctx=2048 specifically, not a general semantic-input-quality signal. Full write-up + falsification: "Gemma 4 wrote three summaries in one response. The middle one was a self-disclaimer."

What I Built

Scripta is a 100% local macOS meeting transcriber. It captures microphone + system audio in two parallel channels, transcribes them in real time with whisper.cpp and SFSpeechRecognizer, and uses a local LLM via Ollama to produce a summary — never sending a byte of meeting audio or text off your machine.

I shipped Scripta as v3.1.0 a few weeks ago. v3.2.0, released today, adds Gemma 4 E2B as a recommended model, surfaces the model's context window in the picker, and — almost by accident — fixes a bug that had silently been compressing every previous Scripta summary down to the last five minutes of the meeting.

The combined story is what this post is about.

Demo

90-second walkthrough: pick Gemma 4 E2B in Settings → record a short
clip with mic + system audio in two channels → click Summarize → watch
the streaming summary use the model's full 128K context window
(num_ctx=131072 confirmed in the debug log).

Install on your own machine in one line (macOS 14+):

curl -fsSL https://raw.githubusercontent.com/thehwang/Scripta/main/scripts/install.sh | bash

To pre-download Gemma 4 during install instead of from the in-app picker:

curl -fsSL https://raw.githubusercontent.com/thehwang/Scripta/main/scripts/install.sh | SCRIPTA_INSTALL_GEMMA4=1 bash

Code

Repository: github.com/thehwang/Scripta
Latest release: v3.2.1 (latest) — Gemma 4 integration shipped in v3.2.0; v3.2.1 is a UX patch on top
Integration commit: c211678 — Integrate Gemma 4 and fix Ollama context window truncation
Benchmark harness: 4281a0f — Add benchmark harness for model + context comparison

The whole change is 163 lines added across 5 Swift files, 1 shell script, and an Info.plist bump. The benchmark commit adds a synthetic fixture + reproducible script so anyone can verify the findings below on their own hardware.

How I Used Gemma 4

I chose Gemma 4 E2B (the 2-billion-effective-parameter variant, 7.2 GB on disk, 128K context window). Three reasons, in order of weight:

1. 128K context = no chunking on real meetings

Scripta's job is to summarize a transcript that arrives in chunks during a meeting and then ask follow-up questions about it after. A typical 60-minute meeting transcript is ~15,000 words → ~20K tokens. With most popular 3B-class models offering 32K context (Qwen 2.5) or 128K (Llama 3.2, Gemma 4), the meeting fits with room to spare for any of them.

Where Gemma 4 separates is the consistency of its 128K window: it's a first-class window, not a long-context retrofit. Multi-hour meetings, all-day workshops, and "summarize this entire week of standups" prompts all fit in one pass without chunking infrastructure. For a one-developer side project, "no chunking" is huge — chunking + map-reduce + merging is its own ML engineering rabbit hole.

2. E2B fits alongside Whisper on a 16 GB Mac

Scripta is built for ordinary developer machines, not workstations. On a 16 GB unified-memory MacBook or Mac mini, the working set during a recording includes:

whisper-base model (~150 MB resident)
Swift app + audio pipeline (~400 MB resident)
Browser tabs, IDE, Slack, etc. (whatever else is open)

That leaves roughly 9–11 GB of headroom. E2B at 7.2 GB fits cleanly. E4B at 9.6 GB technically fits but pushes the system into swap territory the moment a video call also wants memory. The 31B Dense model isn't a candidate — its inference speed on Apple Silicon at consumer RAM levels is too slow for a usable summary experience.

The E2B vs E4B decision is therefore not "which is better" but "which is reliable on the hardware Scripta actually runs on." E2B is the recommended default; E4B is offered as an opt-in for users with 32 GB+.

3. The reasoning behavior caught me off guard (in a good way)

This is the discovery I genuinely didn't expect from a 4-billion-effective-parameter model, and it's a major reason I'm now confident in Gemma 4 as a default for non-trivial summarization tasks.

When I first ran Gemma 4 against Scripta's existing prompt path — which (it turns out) was capped at 2,048 tokens of context due to an Ollama default — Gemma 4 didn't just produce a worse summary. It told the user the transcript looked truncated:

"The provided transcript seems to be a mix of several unrelated topics, making it difficult to extract a single, coherent summary based on the provided text alone. ... If you are looking for a summary of the actual conversation content, please provide the relevant transcript."

That's the model recognizing that the context it received doesn't match a plausible meeting structure. Qwen 2.5 3B, faced with the same truncated input, just confidently produced a wrong summary based on the trailing Q&A.

This calibration — knowing what you don't know — is what makes Gemma 4 useful for production summaries, not just benchmark wins.

The bug I uncovered while integrating Gemma 4

This isn't a bug in Ollama — num_ctx=2048 is the documented default, and plenty of Ollama users know it. The bug was on my side: Scripta's Ollama call had no num_ctx parameter at all, so every model I called — Gemma, Llama, Qwen — was silently working with 2,048 tokens of context regardless of the model's actual capability.

Combined with a 3,000-character hard truncation in buildPrompt() left over from an early prototype, every Scripta summary before v3.2.0 was generated from at most the last five minutes of audio. A 60-minute meeting compressed to the last ~750 tokens of the transcript.

What this article is really about isn't the default. It's how I noticed: Gemma 4 pushed back on the truncated transcript before I'd realized anything was wrong (see the earlier quote). Most models in this parameter class would have confidently produced a worse summary; this one detected an input it couldn't trust.

The fix is in SummaryService.swift:

// Before:
let body: [String: Any] = [
    "model": modelName,
    "prompt": prompt,
    "options": [
        "temperature": 0.4,
        "num_predict": maxTokens,
        // No num_ctx → Ollama defaults to 2048.
    ]
]

// After:
let contextTokens = SummaryModelManager.contextWindow(for: modelName)
let body: [String: Any] = [
    "model": modelName,
    "prompt": prompt,
    "options": [
        "temperature": 0.4,
        "num_predict": maxTokens,
        "num_ctx": contextTokens,  // Now uses the model's real capability.
    ]
]

Plus a dynamic truncation in buildPrompt() that uses the available tokens for the actual transcript:

let availableTokens = max(1_500, contextTokens - 1200)  // 1200 reserves for template + output
let maxChars = Int(Double(availableTokens) * 3.5)        // ~3.5 chars/token (mixed languages)

The contextWindow(for:) function lives in SummaryModelManager.swift and knows every recommended model's true context window, with a heuristic fallback for user-pulled models:

static func contextWindow(for modelName: String) -> Int {
    if let known = recommendedModels.first(where: { $0.name == modelName }) {
        return known.contextTokens
    }
    let lower = modelName.lowercased()
    if lower.contains("gemma4") || lower.contains("llama3.2") { return 131_072 }
    if lower.contains("qwen2.5") || lower.contains("qwen3")  { return 32_768 }
    return 8_192   // Conservative fallback, still 4x Ollama's default.
}

Benchmark — how dramatic is "before" vs "after"?

I built a benchmark harness (scripts/benchmark_models.sh) that runs any installed Ollama model at any num_ctx against a fixed transcript and records wall-clock latency, tokens per second, and the raw summary text. The transcript (benchmarks/synthetic-transcript.md) is a fully fictional 60-minute all-hands meeting for an invented company called Atlas Robotics — no real meeting data is committed to the repository.

The transcript contains five segments, each with specific, distinct content:

Segment 1 (CEO opening): Q2 ARR $4.2M, headcount 47, new VP Engineering Marcus Reyes, Cambridge office move
Segment 2 (Engineering): Project Lighthouse launch July 15, 3x perception perf improvement, 5 named hires, tech debt items
Segment 3 (Product): Three new logos (Boeing, Amazon, FedEx), Toyota loss, pricing 15% increase, voice control + multi-robot roadmap
Segment 4 (CS): Renewal rate 94%, NPS 67, documentation overhaul, 2 SE hires
Segment 5 (Closing): Q3 priorities, Series B prep, Engineer of the Quarter (Priya Sharma), Q&A

A good summary should mention most of these. A bad summary will only mention items from the segment that fits within num_ctx.

Model	num_ctx	Wall	tok/s	Output	Topics correctly captured
qwen2.5:3b	2048	15.2s	47.9	59	Only segment 5 (Q&A: RTO policy, interns, pricing)
gemma4:e2b	2048	106.9s¹	41.7	267	Hedged; flagged transcript as incomplete
qwen2.5:3b	32768	25.7s	39.3	222	ARR, Marcus joining, pricing; missed Lighthouse + logos
gemma4:e2b	32768	49.2s	27.1	752	ARR, three logos by name, Lighthouse + date, Series B, all action items

¹ Gemma 4's first invocation includes ~80s cold model load; subsequent runs are roughly half this wall clock.

The qualitative story is what matters more than the raw numbers:

At num_ctx=2048 (Ollama's default that I was silently using), Qwen 2.5 confidently produced a wrong summary — listing the RTO policy Q&A as one of three "key points discussed" in a meeting where the actual headlines were $4.2M ARR, Project Lighthouse, and a Series B prep announcement. Gemma 4 detected the problem and pushed back.
At num_ctx=32768 (still well within both models' capabilities), Gemma 4 produced the most useful summary — mentioning Boeing, Amazon, and FedEx by name, Project Lighthouse with its July 15 launch date, and the Series B prep that was the most strategic item in the meeting. Qwen 2.5 at the same context missed those.

Full qualitative analysis with each model's actual summary output is in benchmarks/findings.md.

Reproduce in 5 minutes

You don't have to take my word for any of this. The benchmark harness is checked in — clone the repo and run it on your own hardware:

git clone https://github.com/thehwang/Scripta && cd Scripta
ollama pull gemma4:e2b

# Stock Ollama default — reproduces the broken case.
MODELS="gemma4:e2b" NUM_CTX=2048 bash scripts/benchmark_models.sh \
    benchmarks/synthetic-transcript.md

# Same model, full context — reproduces the fixed case.
MODELS="gemma4:e2b" NUM_CTX=32768 bash scripts/benchmark_models.sh \
    benchmarks/synthetic-transcript.md

# Compare the two summaries side by side.
diff -y benchmarks/*-ctx2048/gemma4:e2b.txt \
        benchmarks/*-ctx32768/gemma4:e2b.txt | less

The first run produces a hedged summary that flags the transcript as truncated. The second produces the actual 60-minute meeting summary — $4.2M Q2 ARR, Marcus Reyes, Boeing/Amazon/FedEx, Project Lighthouse launching July 15. On a 16 GB M-series Mac the whole thing takes about 3 minutes including the cold Gemma 4 load.

If you want to compare every model on your machine, drop the MODELS= filter and the script runs qwen2.5:3b, qwen2.5:1.5b, llama3.2:3b, llama3.2:1b, gemma4:e2b, and gemma4:e4b against the same transcript.

Bonus — testing Gemma 4's vision at E2B size: a calibration finding

Gemma 4 is multimodal at every size. Scripta's text path is what ships in v3.2 today, but a meeting tool whose user is also looking at slides during the call has an obvious multimodal extension: cross-reference what's on the deck against what was actually said. So I tested it.

The setup: I generated a fake Q2 all-hands slide for the same Atlas Robotics meeting the benchmark transcript covers, and intentionally seeded it with two inconsistencies vs what was said in the room:

Metric on slide	Slide value	Transcript value
Pricing increase	20%	15%
Project Lighthouse launch	July 22	July 15

Then I fed both the slide image and the transcript to Gemma 4 E2B via Ollama's /api/generate with images: [...]. The full driver script is in benchmarks/multimodal/run.sh.

bash benchmarks/multimodal/run.sh

Run 1 — loose prompt ("identify any inconsistencies"). Excerpt from the output:

Metric:        Pricing Change
Slide:         20%
Transcript:    "Effective September first, we are raising list price
               by fifteen percent across the SKU set."
Likely truth:  The transcript states a 15% price increase, which
               contradicts the 20% figure displayed on the slide.

Metric:        Customer Wins
Slide:         22                          ← fabricated, not on slide
Transcript:    "...closed three of the four new logos."
Likely truth:  Three new logos, contradicting "22" on the slide.

E2B caught the pricing mismatch correctly — read "20%" from the slide image, retrieved the transcript's "fifteen percent" quote verbatim, and called the contradiction. That's a real, useful capability.

In the same run it missed the July 22 vs July 15 date discrepancy in the Roadmap column entirely, and fabricated a "Customer Wins: 22" metric that does not appear anywhere on the slide (which just lists "Boeing, Amazon, FedEx" as new logos). The final summary line then read "No inconsistencies found. (Note: While there are numerical discrepancies between the transcript and the slide... )" — the model literally contradicted itself in a parenthetical.

Run 2 — strict grounded prompt (STRICT_PROMPT=1 bash benchmarks/multimodal/run.sh). I tightened the prompt to force the model to first enumerate only values visually present on the slide, then quote the transcript verbatim, then issue a MATCH | MISMATCH | NOT MENTIONED verdict. Output excerpt:

Item:        List Price Increase Percentage
Slide:       fifteen percent              ← wrong; slide actually shows 20%
Transcript:  "...we are raising list price by fifteen percent..."
Verdict:     MATCH

Item:        Lighthouse Launch Date
Slide:       July fifteen                 ← wrong; slide actually shows July 22
Transcript:  "Voice control launches with Lighthouse on July fifteen."
Verdict:     MATCH

Total mismatches: 0

The strict prompt overcorrected. With the slide image present but the (much larger) transcript dominating the prompt's attention, the model effectively stopped looking at the slide — it filled the "Slide:" field with whatever the transcript said and labelled everything MATCH. Both planted inconsistencies surfaced as false negatives. The same run hallucinated 30+ additional rows for items that aren't on the slide at all (Cambridge office details, NPS Q1 baseline, deployment time targets) — confabulated by reading the transcript and pretending those things were rendered.

The honest read. At 2B effective parameters, Gemma 4's vision is useful as a first-pass scanner for obvious numeric mismatches (Run 1 caught one real planted inconsistency on the first try with no tuning) but not yet reliable enough to be the only check at this size — it has two failure modes that pull in opposite directions and a sharper prompt cannot fix both at once. Production-quality slide-vs-discussion auditing on local hardware probably needs:

A bigger vision tower — E4B (9.6 GB) likely shifts the failure floor up; the 31B Dense model further still. Both are out of reach for Scripta's 16 GB target machine while Whisper, the audio pipeline, and a browser are also resident.
Or a hybrid pipeline — OCR the slide first, then do the cross-reference as a pure text-vs-text task that the same E2B handles confidently (see the calibration behavior from earlier in this post).

This is the kind of capability ceiling that's easy to miss in a five-minute demo and obvious once you actually try to use the output for anything, and it's why Scripta v3.2 ships the text path only. Wiring multimodal into the summary loop is a v3.3 question whose prerequisite is solving this grounding fragility, not a coding task — the infrastructure to capture screen-share frames already exists in Scripta (system audio is captured via ScreenCaptureKit, the same SCStream can vend video samples), so the bottleneck is the model behavior I just measured, not the plumbing.

Honest tradeoffs of choosing E2B

Picking E2B is not a free upgrade over a 3B Qwen:

~3× larger download. 7.2 GB vs 1.9 GB for qwen2.5:3b.
~30% slower throughput. 27 tok/s vs 39 tok/s on the same hardware. A 60-second summary becomes an 80-second summary.
Longer cold start. First inference includes ~80 seconds of model load on first use. Hot loads are instant.

These tradeoffs are why I left the default at qwen2.5:3b and made Gemma 4 a one-click opt-in from the picker (with a "NEW" badge and a 128K ctx indicator to surface the differentiation). Users who care most about speed and disk get the default; users who care most about quality and long meetings get Gemma 4. That's the kind of choice judges look for when they say "intentional model selection."

What changes for Scripta users

For Scripta specifically, Gemma 4 + the num_ctx fix turns a previously broken-but-no-one-noticed feature into the headline feature:

A real 60-minute meeting now produces a real 60-minute summary, not a summary of the last 5 minutes.
Long meetings (2+ hours) fit in a single Gemma 4 pass, no chunking required, no merging artifacts.
Chat-with-transcript (the existing "ask a question about the meeting" feature) can now actually answer questions about what was discussed in the first half hour.

For a tool whose pitch is "100% local meeting transcription with AI summaries," that's the difference between a demo and a product.

If you want to try it: download the latest release or run the one-line installer. Pull Gemma 4 from the in-app picker, click Record, and verify the debug log shows Summary: model=gemma4:e2b ctx=131072 ... — that one log line means your Mac is now actually using all 128,000 of those context tokens.

Thanks to the Ollama, whisper.cpp, and Gemma 4 teams for shipping the building blocks that made this possible to put together as a side-project, on a laptop, in a weekend.

Top comments (1)

southernteddy • May 21

The interesting part isn't that it noticed truncation. It's that it surfaced the uncertainty as part of the summary itself instead of silently hallucinating continuity.