DEV Community: thehwang

Three TODOs, three weeks, one weekend: finishing pq v0.14

thehwang — Sat, 30 May 2026 15:32:35 +0000

This is a submission for the GitHub Finish-Up-A-Thon Challenge

What I Built

pq — jq for Parquet. A 50 MB Rust single binary that wraps DuckDB's query engine in a jq-style expression DSL, optimized for terminal one-liners and unix pipes.

$ pq sales.parquet 'group_by .country | sum .revenue | top 3 by sum_revenue'
┌─────────┬─────────────┐
│ country ┆ sum_revenue │
╞═════════╪═════════════╡
│ US      ┆ 19065.00    │
│ FR      ┆ 999.99      │
│ DE      ┆ 312.00      │
└─────────┴─────────────┘

Where it started. I work in adtech. I look at parquet files dozens of times a day — campaign deliveries, partner exports, audience snapshots. Every existing option was painful:

Tool	Pain
`pyarrow` / `pandas`	5-second cold start, 200 MB virtualenv
`parquet-tools`	JVM, slow, no query support
`pqrs`	Inspector only — can't filter or project
`duckdb` CLI	Great engine, but `SELECT email FROM 'file.parquet' WHERE country='US'` is too verbose to type 50 times a day
Spark	Are you serious

pq is the tool I actually want — single binary, no JVM, no Python, jq-style syntax for piping into the rest of the unix toolbox. It's been my default cat for parquet since v0.5.

Demo

Repo: github.com/thehwang/parq
Latest release: v0.14.0 (this submission)
Install: brew install thehwang/parq/pq
Tutorial: doc/tutorial.md — 30-minute hands-on walkthrough

A taste of what shipped in v0.14:

# Streaming JSON output (was the only buffered format until v0.14)
$ pq big.parquet '.id, .country' -o json | head -c 200
# returns instantly even on a 40 GB file

# Schema-drift gate for CI
$ pq diff baseline.parquet candidate.parquet
# Schema diff
- a: `baseline.parquet`
- b: `candidate.parquet`

## Added (1)
| column    | type    | nullable |
|-----------|---------|----------|
| `country` | VARCHAR | yes      |

$ echo $?
1   # exits non-zero on drift, slots into CI without scripting

And the new TUI Explain panel — press capital E for EXPLAIN ANALYZE, get row-group pruning per scan (this is exactly the panel you see on the cover image at the top of this post):

Explain · ANALYZE  22.0 ms
1 scan(s)  •  833.3k actual rows  •  1 filter(s) pushed  •  2 projection(s)
  ✓ predicate pushdown: country='US'
  ✓ projection pushdown: 2 col(s) user_id, country
  ● actual 833.3k rows  (estimated ~833.3k)
  ● pruned: 83% (833.3k/5.0M rows)        ← new in v0.14, color-coded gauge

Color cues: green ≥ 50% pruned, gold any > 0%, dim 0%. The dim case fires a heuristic: "filter country = 'US' didn't prune any row groups — column may lack min/max stats (common for STRING from older Spark writers)". That's the kind of hint I used to need a DuckDB profile + a calculator to compute.

The Comeback Story

Three TODOs, three weeks of "almost done"

pq had hit v0.13 in May — solid big-file support (streaming output, Ctrl-C interrupt, metadata-only count --lite / stats --lite, async TUI preview, stderr spinner). Then three v0.14 issues sat in my GitHub project board for three weeks, slowly accreting // TODO comments in my notes:

#2: streaming JSON output — the one output format that still buffered the entire result into a Vec because the writer needed to wrap it in [ … ].
#3: row-group pruning ratio in the Explain panel — the most-requested observability feature ("did my filter actually help?").
#4: pq diff — schema-drift detection I'd been wanting myself for a CI gate at work.

Each was "small but tedious" finishing work — the kind that's easy to put off when there's a more interesting feature to start. Classic 90% / 90% problem.

Then the hashtag showed up

GitHub announced the Finish-Up-A-Thon. Hashtag-shaming worked. Over the weekend I cleared the whole milestone:

PR #6 (commit) closes #2: streaming JSON via a hand-written incremental array writer ([, then row, then ,\n + row, then ]). Memory stays flat, head -c 200 returns instantly.
PR #9 (commit) closes #3: row-group pruning extracted from DuckDB's JSON profile, merged with parquet_file_metadata(...) for the file's true row count. This one had a story — see Copilot section below.
PR #10 (commit) closes #4: pq diff as a new subcommand, markdown by default, JSON for tooling, exit 1 on drift. Detects added / dropped / type-changed columns, plus nullability changes (which are breaking for downstream consumers and easy to miss).
PR #8 (commit) was an unplanned bonus: a CI infrastructure fix. The tui smoke (vhs) job started failing mid-weekend because Ubuntu 24.04 runners stopped shipping ttyd in default repos. I'd have hit this on the next merge anyway; the Finish-Up-A-Thon was a forcing function to actually fix it instead of [skip ci]-ing around it.
PR #11 (commit) closes #5 (the v0.14 tracking issue): bump Cargo.toml to 0.14.0, README v0.14 section, full reference manual entry (doc/reference.md §14), tutorial Lesson 6.

By the numbers:

5 PRs merged, 5 issues closed
Test count: 204 → 215 (+8 unit, +3 integration)
All CI green: macOS, Ubuntu, tui smoke (vhs)
v0.14.0 tagged, release workflow building macOS arm64/x86_64 + Linux musl + Windows binaries + Homebrew bottle

What I'm proudest of isn't the line count — it's that the milestone is empty. No "ship it and clean up later" comments left in the code. README's "What's coming" section now has nothing in the v0.14 row to delete. That itch is gone.

Postscript: the cover image caught DuckDB lying

The cover image at the top of this post? It almost wasn't.

After tagging v0.14.0 I tried to record a custom cover showing off the new pruning gauge — a 5M-row parquet file with id < 1M, expecting a satisfying green pruned: 80% line. What VHS captured instead was a wall of JSON debris over the panels, plus this:

● pruned: 0% (25.0M/5.0M rows)

25 million scanned rows out of a 5 million row file. That's not a typo — it was wrong twice over.

Bug #1: a silent PRAGMA. The pruning code sets DuckDB's enable_profiling='json' to grab the JSON profile, then "resets" with PRAGMA disable_profiling afterward. Against DuckDB 1.10.501 that pragma is a silent no-op — accepts the call without erroring but doesn't actually flip the bit. Subsequent EXPLAIN ANALYZE calls (and the TUI runs one on every preview tick) kept returning JSON in column 1, which then bled into the rendered panel as garbage. The documented inverse enable_profiling='no_output' is what actually works. I only found this out by writing a Python probe and trying every reset spelling DuckDB's docs hinted at.

Bug #2: the wrong field. With JSON no longer leaking, the cover redrew — and the numbers still made no sense (25.0M/5.0M). Turns out operator_rows_scanned from DuckDB's JSON profile is roughly 10× a parquet scan's actual row count (likely an internal multi-pass / per-thread accumulator). The correct field for "rows out of scan after pushdown" is operator_cardinality. Same query against the same file: cardinality 1.0M, total 5.0M, ratio 0.8 — the green 80% I was expecting in the first place.

Both went out as v0.14.1 (PR #14, closes #12) with a regression test that opens a real DuckDB connection, runs the round-trip, and asserts the next plain EXPLAIN returns text — not JSON. Test count tipped to 216.

The cover at the top is the after-shot. The pre-fix one is in my recycle bin.

What was meant to be three TODOs ended up as five PRs and a patch release. The challenge wanted a finish; it got a finish and a postscript. Worth it.

My Experience with GitHub Copilot

This was a tale of two halves.

The first half — Copilot at its best. PR #6 (streaming JSON) was end-to-end Copilot Chat. The codebase already had stream_ndjson and stream_csv as templates; the only thing missing was the matching stream_json with the array-bracket bookkeeping. I handed Copilot the existing functions plus the buffered print_json it was replacing, and it produced a clean refactor on the first try — opening bracket on first row, comma-newline on subsequent rows, closing bracket at EOF, plus the right error handling for the partial-write case. The commit ships with a real Co-authored-by: Copilot trailer.

That's the sweet spot: when a codebase has a clear pattern to extend, Copilot pattern-matches and writes near-perfect code. I wrote the issue, gave it 30 seconds of context, and the PR was open inside 10 minutes.

The second half — where Copilot hits its ceiling. PR #9 (pruning ratio) was the opposite. The "obvious" approach — set DuckDB's profile_output PRAGMA to write a JSON profile to a temp file, then read it back — turns out not to work for EXPLAIN ANALYZE. The PRAGMA only writes the file for top-level statements; EXPLAIN ANALYZE returns the profile inline in column 1 of its own result. There's no obvious way to know that without reading DuckDB source or doing what I ended up doing: spinning up Python and probing EXPLAIN ANALYZE against a real parquet file, dumping the row, and discovering the JSON was already there.

Copilot's first attempt for #3 wrote ~170 lines using the temp-file approach. It also:

ran EXPLAIN ANALYZE twice (wasteful — the second run executes the full query again),
didn't reset profiling PRAGMAs after the call (so subsequent preview ticks got JSON-shaped output),
assumed Filename(s) was comma-split (it isn't — DuckDB returns the literal glob string),
merged JSON scans into text scans by index without checking the lengths matched.

I caught most of this in a 7-bug code review with Copilot, then ran out of credits before we could iterate on the architectural changes. With the WIP unmerged and Copilot Chat unavailable, I fell back to Cursor (Claude) and rewrote the feature from scratch — same goal, correct architecture, ~230 lines including tests. The whole rewrite took about 25 minutes; the architectural debugging during the Copilot review took 90.

Honest takeaway: Copilot was great for PR #6 because the shape was already in the repo. For PR #9 the architecture itself was unknown territory and it hit a wall — directionally right (use the JSON profile!), but wrong in every detail that mattered. The most useful thing the Copilot Chat history gave me was the list of bugs to avoid on the rewrite. I still ship that list as ground truth in the PR #9 commit message.

The combination ended up working: Copilot for the pattern-completion sprints (#2), Cursor for the empirical / architectural reasoning (#3, #4, docs). I'd run the same playbook again. Maybe with Copilot Pro next time so I don't run out mid-debug.

Repo: github.com/thehwang/parq · brew install thehwang/parq/pq · pq --help

Gemma 4 wrote three summaries in one response. The middle one was a self-disclaimer.

thehwang — Wed, 20 May 2026 20:23:25 +0000

The short version, in case the title was being coy: at num_ctx=2048, Gemma 4 E2B produces three sequential outputs in a single response — a mostly-hallucinated meeting summary, a Note: saying that summary isn't actually in the transcript, then a more careful retry. Three runs at temperature=0.0, identical pattern every time. Other E-class models in this envelope don't do this. The rest of this post is the 15-run ablation that found it, and why my last Gemma 4 article framed it wrong.

A couple of weeks ago I published a post for the Gemma 4 Challenge with what felt at the time like a confident, well-defended claim: Gemma 4 E2B, faced with a silently-truncated transcript, "detected" the problem and pushed back. I called this calibration. I called it useful. I went to bed pleased with myself.

Then two engineers showed up in the comments and politely set me on fire.

Daniel Nwaneri pointed out that "mix of unrelated topics" is a content claim, not a length claim — so the model is doing more than I was giving it credit for, but also: a self-contained paragraph isn't a meeting transcript, and I should run a truncated paragraph from the same session as the cleaner control before declaring victory.

vericum asked, very politely, whether I had published the harness — which I had not, because there was no harness, because I'd shipped the claim from a sample size of vibes.

So I built the harness. I ran the ablation. I am writing this post, which is a sentence I did not expect to be writing two weeks ago.

TL;DR: At num_ctx=32768, Gemma 4 E2B does not hedge on any input shape Daniel suggested as a control. The "calibration" I claimed was actually the num_ctx=2048 setting doing something I didn't notice the first time, which I'll get to in a minute, and which is honestly weirder than what I claimed.

The ablation

Six rows, length-matched within ~15%. temperature=0.0. Three runs each. Gemma 4 E2B via Ollama on a 16 GB M-series Mac.

Row	Content	Syntactic	Semantic
1	Full 5K-token transcript	whole	whole
2	Mid-session paragraph from row 1	whole	mid-stream
3	Row 2, cut mid-word at "rare earth ma-"	broken	mid-stream
4	Wikipedia paragraph on the Antikythera mechanism	whole	whole
6	Tail of row 1 — mid-conversation, no opening	whole	mid-stream

Four hypotheses, increasingly specific. H1 length artifact. H2 "damaged input as a class." H3 the model distinguishes syntactic from semantic damage. H4 tail-of-larger-document signal — the hedge tracks "this looks like the end of something with the opening cut off." I added H4 after rows 2–4 came back clean and I refused to accept that as the answer.

The result

At num_ctx=32768:

Row	Hedged?	Notes
2	no (3/3)	Confident summaries every time.
3	no (3/3)	Syntactic damage alone: nothing.
4	no (3/3)	Cheerfully summarized the Antikythera mechanism using the meeting-summary template, including action items. To the ancient Greeks, presumably.
6	no (3/3)	The H4-killer. Even the shape closest to what `num_ctx=2048` truncation produces — nothing.

That null refutes H1, H2, H3, and my last-ditch H4. Daniel was right. I was wrong. I went and made tea.

Then I ran row 1 — the original full transcript — at num_ctx=2048, the original article's setting. Three runs.

Row	num_ctx	Hedged?
1 ×3	2,048	yes (3/3)

Same configuration, same hedge, every time. Now we have to talk about how it hedged, because this is the part I missed in the original article.

What `num_ctx=2048` actually produces

My classifier said "no hedge" on the first pass, because the model did produce a structured SUMMARY: / ACTION ITEMS: block, which the classifier was looking for as a positive case. I almost shipped that as a clean null. Then I read the actual output. Reader, the actual output is one of the more delightful things I have ever seen a language model do.

Abbreviated trace (all three runs identical in structure):

SUMMARY:
- **Office/Location Update:** ...
- **Remote Work Policy:** ...
- **Team Development:** ...

ACTION ITEMS:
- **David:** Schedule the team development discussions.
- **Team:** Prepare for the upcoming offsite ...

**Note:** The provided transcript does not contain the
information listed in the summary or action items above.

**Based *only* on the provided transcript, here is a
summary of the key decisions and action items:**

**Key Discussion Points:**
* **Office/Location:** ... (implied by the context of the
  meeting, though the specific details are not fully
  detailed in the provided snippet).

**Note:** The transcript is a segment of a meeting, and
the provided summary/action items above are inferred
based on the flow of the conversation, not explicitly
stated as formal action items in the text.

To be clear about what just happened: that's three passes inside one response.

A confident, templated summary that is mostly hallucinated.
A note from the model saying, in its own words, that the above is not in the transcript.
A more hedged retry, repeatedly flagging things as "implied" / "inferred" / "not fully detailed."

The model is, essentially, doing peer review on its own output, in real time, and writing a more cautious version below the offending material. It does this every time at num_ctx=2048 and never once at num_ctx=32768.

What I now think (and what I deliberately don't)

This is configuration-deterministic, not input-shape-deterministic. The hedge fires specifically when the context budget is too small for the input, on a transcript-shaped task, at temperature=0.0, on this size of model. Much narrower than "the model has trained calibration about damaged input," which is what I shipped.

I do not know — and this ablation does not tell us — whether the self-disclaimer is (a) genuine introspection about a truncated KV cache, (b) a pattern memorized from training data, or (c) something specific to E2B-scale RLHF on outputs that look unreliable. Three different mechanisms; I'd not bet against any of them.

Daniel was right that "mix of unrelated topics" is a content claim, not a length claim. It just only fires inside a very specific configuration, which means it's conditioned on something other than the input.

I was wrong that the model is doing general semantic input evaluation. The honest version: "at num_ctx=2048, Gemma 4 E2B does a multi-pass hallucinate-disclaim-retry that other E-class models in this size envelope don't." Still favorable to Gemma 4 — just at the deployment-configuration layer, not the trained-behavior layer.

Corrections, the harness, the people

I'm adding a Correction box at the top of the original article linking here. Not deleting; the original is part of the trail.

Harness: benchmarks/calibration-ablation/ in the Scripta repo. README, inputs, results, classification report, raw outputs — all of it. ~6–10 minutes on a 16 GB Mac.

git clone https://github.com/thehwang/Scripta && cd Scripta/benchmarks/calibration-ablation
bash run.sh                            # rows 2, 3, 4, 6 at num_ctx=32768
NUM_CTX=2048 bash run.sh --rows row1   # the configuration-deterministic case
python3 classify.py > classification-report.md

Things I'd love to see someone else test: does the multi-pass pattern survive at E4B / 27B? Is it the meeting-summary prompt specifically, or any structured-output prompt under context pressure? vericum is already planning a RTX 4060 8GB replication, different VRAM envelope, same questions.

This post exists because @dannwaneri and @wildeconforce read my original carefully and pushed back specifically. Daniel designed the original 4-row ablation; my desperate H4 came from trying to salvage my framing after his rows came back null. vericum asked for the harness in public, which is a harder forcing function than "I should probably build a harness someday." If you write a Gemma 4 / on-device LLM post and the framing feels even a little over-confident: please do this. The people who reviewed mine were exceptionally kind about it. I would rather be corrected than not.

I could have left the original article alone and hoped nobody ran the ablation. But the data is more interesting than the framing I shipped — so, reader, here is the data.

Harness + raw outputs + classification report: benchmarks/calibration-ablation/. Original article: "I asked Gemma 4 to summarize. It said the transcript looked truncated. It was right."

I asked Gemma 4 to summarize. It said the transcript looked truncated. It was right.

thehwang — Tue, 19 May 2026 13:42:44 +0000

This is a submission for the Gemma 4 Challenge: Build with Gemma 4

Correction (May 20, 2026): The framing in this post — that Gemma 4 E2B "detected" damaged input and pushed back on it as a general behavior — is too strong. A 15-run ablation, designed in response to comments from @dannwaneri and @wildeconforce, shows the hedging behavior is configuration-deterministic on num_ctx=2048 specifically, not a general semantic-input-quality signal. Full write-up + falsification: "Gemma 4 wrote three summaries in one response. The middle one was a self-disclaimer."

What I Built

Scripta is a 100% local macOS meeting transcriber. It captures microphone + system audio in two parallel channels, transcribes them in real time with whisper.cpp and SFSpeechRecognizer, and uses a local LLM via Ollama to produce a summary — never sending a byte of meeting audio or text off your machine.

I shipped Scripta as v3.1.0 a few weeks ago. v3.2.0, released today, adds Gemma 4 E2B as a recommended model, surfaces the model's context window in the picker, and — almost by accident — fixes a bug that had silently been compressing every previous Scripta summary down to the last five minutes of the meeting.

The combined story is what this post is about.

Demo

90-second walkthrough: pick Gemma 4 E2B in Settings → record a short
clip with mic + system audio in two channels → click Summarize → watch
the streaming summary use the model's full 128K context window
(num_ctx=131072 confirmed in the debug log).

Install on your own machine in one line (macOS 14+):

curl -fsSL https://raw.githubusercontent.com/thehwang/Scripta/main/scripts/install.sh | bash

To pre-download Gemma 4 during install instead of from the in-app picker:

curl -fsSL https://raw.githubusercontent.com/thehwang/Scripta/main/scripts/install.sh | SCRIPTA_INSTALL_GEMMA4=1 bash

Code

Repository: github.com/thehwang/Scripta
Latest release: v3.2.1 (latest) — Gemma 4 integration shipped in v3.2.0; v3.2.1 is a UX patch on top
Integration commit: c211678 — Integrate Gemma 4 and fix Ollama context window truncation
Benchmark harness: 4281a0f — Add benchmark harness for model + context comparison

The whole change is 163 lines added across 5 Swift files, 1 shell script, and an Info.plist bump. The benchmark commit adds a synthetic fixture + reproducible script so anyone can verify the findings below on their own hardware.

How I Used Gemma 4

I chose Gemma 4 E2B (the 2-billion-effective-parameter variant, 7.2 GB on disk, 128K context window). Three reasons, in order of weight:

1. 128K context = no chunking on real meetings

Scripta's job is to summarize a transcript that arrives in chunks during a meeting and then ask follow-up questions about it after. A typical 60-minute meeting transcript is ~15,000 words → ~20K tokens. With most popular 3B-class models offering 32K context (Qwen 2.5) or 128K (Llama 3.2, Gemma 4), the meeting fits with room to spare for any of them.

Where Gemma 4 separates is the consistency of its 128K window: it's a first-class window, not a long-context retrofit. Multi-hour meetings, all-day workshops, and "summarize this entire week of standups" prompts all fit in one pass without chunking infrastructure. For a one-developer side project, "no chunking" is huge — chunking + map-reduce + merging is its own ML engineering rabbit hole.

2. E2B fits alongside Whisper on a 16 GB Mac

Scripta is built for ordinary developer machines, not workstations. On a 16 GB unified-memory MacBook or Mac mini, the working set during a recording includes:

whisper-base model (~150 MB resident)
Swift app + audio pipeline (~400 MB resident)
Browser tabs, IDE, Slack, etc. (whatever else is open)

That leaves roughly 9–11 GB of headroom. E2B at 7.2 GB fits cleanly. E4B at 9.6 GB technically fits but pushes the system into swap territory the moment a video call also wants memory. The 31B Dense model isn't a candidate — its inference speed on Apple Silicon at consumer RAM levels is too slow for a usable summary experience.

The E2B vs E4B decision is therefore not "which is better" but "which is reliable on the hardware Scripta actually runs on." E2B is the recommended default; E4B is offered as an opt-in for users with 32 GB+.

3. The reasoning behavior caught me off guard (in a good way)

This is the discovery I genuinely didn't expect from a 4-billion-effective-parameter model, and it's a major reason I'm now confident in Gemma 4 as a default for non-trivial summarization tasks.

When I first ran Gemma 4 against Scripta's existing prompt path — which (it turns out) was capped at 2,048 tokens of context due to an Ollama default — Gemma 4 didn't just produce a worse summary. It told the user the transcript looked truncated:

"The provided transcript seems to be a mix of several unrelated topics, making it difficult to extract a single, coherent summary based on the provided text alone. ... If you are looking for a summary of the actual conversation content, please provide the relevant transcript."

That's the model recognizing that the context it received doesn't match a plausible meeting structure. Qwen 2.5 3B, faced with the same truncated input, just confidently produced a wrong summary based on the trailing Q&A.

This calibration — knowing what you don't know — is what makes Gemma 4 useful for production summaries, not just benchmark wins.

The bug I uncovered while integrating Gemma 4

This isn't a bug in Ollama — num_ctx=2048 is the documented default, and plenty of Ollama users know it. The bug was on my side: Scripta's Ollama call had no num_ctx parameter at all, so every model I called — Gemma, Llama, Qwen — was silently working with 2,048 tokens of context regardless of the model's actual capability.

Combined with a 3,000-character hard truncation in buildPrompt() left over from an early prototype, every Scripta summary before v3.2.0 was generated from at most the last five minutes of audio. A 60-minute meeting compressed to the last ~750 tokens of the transcript.

What this article is really about isn't the default. It's how I noticed: Gemma 4 pushed back on the truncated transcript before I'd realized anything was wrong (see the earlier quote). Most models in this parameter class would have confidently produced a worse summary; this one detected an input it couldn't trust.

The fix is in SummaryService.swift:

// Before:
let body: [String: Any] = [
    "model": modelName,
    "prompt": prompt,
    "options": [
        "temperature": 0.4,
        "num_predict": maxTokens,
        // No num_ctx → Ollama defaults to 2048.
    ]
]

// After:
let contextTokens = SummaryModelManager.contextWindow(for: modelName)
let body: [String: Any] = [
    "model": modelName,
    "prompt": prompt,
    "options": [
        "temperature": 0.4,
        "num_predict": maxTokens,
        "num_ctx": contextTokens,  // Now uses the model's real capability.
    ]
]

Plus a dynamic truncation in buildPrompt() that uses the available tokens for the actual transcript:

let availableTokens = max(1_500, contextTokens - 1200)  // 1200 reserves for template + output
let maxChars = Int(Double(availableTokens) * 3.5)        // ~3.5 chars/token (mixed languages)

The contextWindow(for:) function lives in SummaryModelManager.swift and knows every recommended model's true context window, with a heuristic fallback for user-pulled models:

static func contextWindow(for modelName: String) -> Int {
    if let known = recommendedModels.first(where: { $0.name == modelName }) {
        return known.contextTokens
    }
    let lower = modelName.lowercased()
    if lower.contains("gemma4") || lower.contains("llama3.2") { return 131_072 }
    if lower.contains("qwen2.5") || lower.contains("qwen3")  { return 32_768 }
    return 8_192   // Conservative fallback, still 4x Ollama's default.
}

Benchmark — how dramatic is "before" vs "after"?

I built a benchmark harness (scripts/benchmark_models.sh) that runs any installed Ollama model at any num_ctx against a fixed transcript and records wall-clock latency, tokens per second, and the raw summary text. The transcript (benchmarks/synthetic-transcript.md) is a fully fictional 60-minute all-hands meeting for an invented company called Atlas Robotics — no real meeting data is committed to the repository.

The transcript contains five segments, each with specific, distinct content:

Segment 1 (CEO opening): Q2 ARR $4.2M, headcount 47, new VP Engineering Marcus Reyes, Cambridge office move
Segment 2 (Engineering): Project Lighthouse launch July 15, 3x perception perf improvement, 5 named hires, tech debt items
Segment 3 (Product): Three new logos (Boeing, Amazon, FedEx), Toyota loss, pricing 15% increase, voice control + multi-robot roadmap
Segment 4 (CS): Renewal rate 94%, NPS 67, documentation overhaul, 2 SE hires
Segment 5 (Closing): Q3 priorities, Series B prep, Engineer of the Quarter (Priya Sharma), Q&A

A good summary should mention most of these. A bad summary will only mention items from the segment that fits within num_ctx.

Model	num_ctx	Wall	tok/s	Output	Topics correctly captured
qwen2.5:3b	2048	15.2s	47.9	59	Only segment 5 (Q&A: RTO policy, interns, pricing)
gemma4:e2b	2048	106.9s¹	41.7	267	Hedged; flagged transcript as incomplete
qwen2.5:3b	32768	25.7s	39.3	222	ARR, Marcus joining, pricing; missed Lighthouse + logos
gemma4:e2b	32768	49.2s	27.1	752	ARR, three logos by name, Lighthouse + date, Series B, all action items

¹ Gemma 4's first invocation includes ~80s cold model load; subsequent runs are roughly half this wall clock.

The qualitative story is what matters more than the raw numbers:

At num_ctx=2048 (Ollama's default that I was silently using), Qwen 2.5 confidently produced a wrong summary — listing the RTO policy Q&A as one of three "key points discussed" in a meeting where the actual headlines were $4.2M ARR, Project Lighthouse, and a Series B prep announcement. Gemma 4 detected the problem and pushed back.
At num_ctx=32768 (still well within both models' capabilities), Gemma 4 produced the most useful summary — mentioning Boeing, Amazon, and FedEx by name, Project Lighthouse with its July 15 launch date, and the Series B prep that was the most strategic item in the meeting. Qwen 2.5 at the same context missed those.

Full qualitative analysis with each model's actual summary output is in benchmarks/findings.md.

Reproduce in 5 minutes

You don't have to take my word for any of this. The benchmark harness is checked in — clone the repo and run it on your own hardware:

git clone https://github.com/thehwang/Scripta && cd Scripta
ollama pull gemma4:e2b

# Stock Ollama default — reproduces the broken case.
MODELS="gemma4:e2b" NUM_CTX=2048 bash scripts/benchmark_models.sh \
    benchmarks/synthetic-transcript.md

# Same model, full context — reproduces the fixed case.
MODELS="gemma4:e2b" NUM_CTX=32768 bash scripts/benchmark_models.sh \
    benchmarks/synthetic-transcript.md

# Compare the two summaries side by side.
diff -y benchmarks/*-ctx2048/gemma4:e2b.txt \
        benchmarks/*-ctx32768/gemma4:e2b.txt | less

The first run produces a hedged summary that flags the transcript as truncated. The second produces the actual 60-minute meeting summary — $4.2M Q2 ARR, Marcus Reyes, Boeing/Amazon/FedEx, Project Lighthouse launching July 15. On a 16 GB M-series Mac the whole thing takes about 3 minutes including the cold Gemma 4 load.

If you want to compare every model on your machine, drop the MODELS= filter and the script runs qwen2.5:3b, qwen2.5:1.5b, llama3.2:3b, llama3.2:1b, gemma4:e2b, and gemma4:e4b against the same transcript.

Bonus — testing Gemma 4's vision at E2B size: a calibration finding

Gemma 4 is multimodal at every size. Scripta's text path is what ships in v3.2 today, but a meeting tool whose user is also looking at slides during the call has an obvious multimodal extension: cross-reference what's on the deck against what was actually said. So I tested it.

The setup: I generated a fake Q2 all-hands slide for the same Atlas Robotics meeting the benchmark transcript covers, and intentionally seeded it with two inconsistencies vs what was said in the room:

Metric on slide	Slide value	Transcript value
Pricing increase	20%	15%
Project Lighthouse launch	July 22	July 15

Then I fed both the slide image and the transcript to Gemma 4 E2B via Ollama's /api/generate with images: [...]. The full driver script is in benchmarks/multimodal/run.sh.

bash benchmarks/multimodal/run.sh

Run 1 — loose prompt ("identify any inconsistencies"). Excerpt from the output:

Metric:        Pricing Change
Slide:         20%
Transcript:    "Effective September first, we are raising list price
               by fifteen percent across the SKU set."
Likely truth:  The transcript states a 15% price increase, which
               contradicts the 20% figure displayed on the slide.

Metric:        Customer Wins
Slide:         22                          ← fabricated, not on slide
Transcript:    "...closed three of the four new logos."
Likely truth:  Three new logos, contradicting "22" on the slide.

E2B caught the pricing mismatch correctly — read "20%" from the slide image, retrieved the transcript's "fifteen percent" quote verbatim, and called the contradiction. That's a real, useful capability.

In the same run it missed the July 22 vs July 15 date discrepancy in the Roadmap column entirely, and fabricated a "Customer Wins: 22" metric that does not appear anywhere on the slide (which just lists "Boeing, Amazon, FedEx" as new logos). The final summary line then read "No inconsistencies found. (Note: While there are numerical discrepancies between the transcript and the slide... )" — the model literally contradicted itself in a parenthetical.

Run 2 — strict grounded prompt (STRICT_PROMPT=1 bash benchmarks/multimodal/run.sh). I tightened the prompt to force the model to first enumerate only values visually present on the slide, then quote the transcript verbatim, then issue a MATCH | MISMATCH | NOT MENTIONED verdict. Output excerpt:

Item:        List Price Increase Percentage
Slide:       fifteen percent              ← wrong; slide actually shows 20%
Transcript:  "...we are raising list price by fifteen percent..."
Verdict:     MATCH

Item:        Lighthouse Launch Date
Slide:       July fifteen                 ← wrong; slide actually shows July 22
Transcript:  "Voice control launches with Lighthouse on July fifteen."
Verdict:     MATCH

Total mismatches: 0

The strict prompt overcorrected. With the slide image present but the (much larger) transcript dominating the prompt's attention, the model effectively stopped looking at the slide — it filled the "Slide:" field with whatever the transcript said and labelled everything MATCH. Both planted inconsistencies surfaced as false negatives. The same run hallucinated 30+ additional rows for items that aren't on the slide at all (Cambridge office details, NPS Q1 baseline, deployment time targets) — confabulated by reading the transcript and pretending those things were rendered.

The honest read. At 2B effective parameters, Gemma 4's vision is useful as a first-pass scanner for obvious numeric mismatches (Run 1 caught one real planted inconsistency on the first try with no tuning) but not yet reliable enough to be the only check at this size — it has two failure modes that pull in opposite directions and a sharper prompt cannot fix both at once. Production-quality slide-vs-discussion auditing on local hardware probably needs:

A bigger vision tower — E4B (9.6 GB) likely shifts the failure floor up; the 31B Dense model further still. Both are out of reach for Scripta's 16 GB target machine while Whisper, the audio pipeline, and a browser are also resident.
Or a hybrid pipeline — OCR the slide first, then do the cross-reference as a pure text-vs-text task that the same E2B handles confidently (see the calibration behavior from earlier in this post).

This is the kind of capability ceiling that's easy to miss in a five-minute demo and obvious once you actually try to use the output for anything, and it's why Scripta v3.2 ships the text path only. Wiring multimodal into the summary loop is a v3.3 question whose prerequisite is solving this grounding fragility, not a coding task — the infrastructure to capture screen-share frames already exists in Scripta (system audio is captured via ScreenCaptureKit, the same SCStream can vend video samples), so the bottleneck is the model behavior I just measured, not the plumbing.

Honest tradeoffs of choosing E2B

Picking E2B is not a free upgrade over a 3B Qwen:

~3× larger download. 7.2 GB vs 1.9 GB for qwen2.5:3b.
~30% slower throughput. 27 tok/s vs 39 tok/s on the same hardware. A 60-second summary becomes an 80-second summary.
Longer cold start. First inference includes ~80 seconds of model load on first use. Hot loads are instant.

These tradeoffs are why I left the default at qwen2.5:3b and made Gemma 4 a one-click opt-in from the picker (with a "NEW" badge and a 128K ctx indicator to surface the differentiation). Users who care most about speed and disk get the default; users who care most about quality and long meetings get Gemma 4. That's the kind of choice judges look for when they say "intentional model selection."

What changes for Scripta users

For Scripta specifically, Gemma 4 + the num_ctx fix turns a previously broken-but-no-one-noticed feature into the headline feature:

A real 60-minute meeting now produces a real 60-minute summary, not a summary of the last 5 minutes.
Long meetings (2+ hours) fit in a single Gemma 4 pass, no chunking required, no merging artifacts.
Chat-with-transcript (the existing "ask a question about the meeting" feature) can now actually answer questions about what was discussed in the first half hour.

For a tool whose pitch is "100% local meeting transcription with AI summaries," that's the difference between a demo and a product.

If you want to try it: download the latest release or run the one-line installer. Pull Gemma 4 from the in-app picker, click Record, and verify the debug log shows Summary: model=gemma4:e2b ctx=131072 ... — that one log line means your Mac is now actually using all 128,000 of those context tokens.

Thanks to the Ollama, whisper.cpp, and Gemma 4 teams for shipping the building blocks that made this possible to put together as a side-project, on a laptop, in a weekend.

Building a 100% Local Meeting Transcription App for macOS with whisper.cpp and ScreenCaptureKit

thehwang — Tue, 12 May 2026 14:17:01 +0000

How I built Scripta — a dual-channel meeting recorder that transcribes your mic and system audio in real-time, generates AI summaries, and never sends a byte to the cloud.

I spend 2–3 hours a day on Teams and Zoom calls. By the end of the day, I can barely remember who committed to what. I tried cloud transcription services — Otter.ai, Fireflies, Granola — but my company's security policy doesn't allow meeting audio to leave the corporate network.

So I built Scripta: an open-source macOS app that records both sides of a meeting, transcribes everything in real-time, and generates AI summaries — all running entirely on your Mac. Zero cloud requests. Zero subscriptions. Zero data exfiltration.

GitHub: github.com/thehwang/Scripta

The Dual-Channel Problem

Most transcription apps work with a single audio stream. That's fine for podcasts, but in a meeting you have two distinct audio sources:

Your microphone — your voice, physically entering the mic
System audio — the remote participants, coming out of Teams/Zoom/Meet through the OS audio mixer

If you mix them into one stream, you lose the ability to label who said what. And if you try to run two speech recognition tasks on separate streams using Apple's SFSpeechRecognizer, you get a fun surprise: kAFAssistantErrorDomain Code=1101 — Apple's speech framework silently refuses to run two recognition tasks concurrently.

The solution I landed on uses two completely different ASR engines:

┌─────────────────┐     ┌──────────────────┐
│   Microphone     │     │  System Audio     │
│  (AVAudioEngine) │     │ (ScreenCaptureKit)│
└────────┬────────┘     └────────┬─────────┘
         │                       │
    whisper.cpp             SFSpeechRecognizer
    (Metal GPU)             (Apple on-device)
         │                       │
         └───── Transcript ──────┘
                    │
              Local Ollama LLM
                    │
              AI Summary + Chat

Mic → whisper.cpp: The Whisper model runs locally with Metal acceleration. The base model (142 MB) achieves >15x real-time on Apple Silicon — 5 seconds of audio transcribed in ~0.3 seconds.

System audio → SFSpeechRecognizer: Apple's on-device speech recognition handles the remote audio. It works well with compressed VoIP audio and doesn't compete for GPU resources with Whisper.

This hybrid approach avoids the SFSpeechRecognizer concurrency crash while keeping everything on-device.

Capturing System Audio with ScreenCaptureKit

Before macOS 13, capturing system audio from a specific app required hacks: virtual audio devices like BlackHole, aggregate devices, or kernel extensions. ScreenCaptureKit changed this entirely.

The key insight: ScreenCaptureKit can capture audio only — you don't need to record the screen at all. Set the video dimensions to 2×2 pixels and enable audio:

let config = SCStreamConfiguration()
config.capturesAudio = true
config.excludesCurrentProcessAudio = true  // prevent feedback loops
config.sampleRate = 16_000
config.channelCount = 1
config.width = 2   // minimal video — we only want audio
config.height = 2

excludesCurrentProcessAudio = true is critical — without it, any sounds your app plays would get captured and create an echo loop.

The catch: ScreenCaptureKit requires Screen Recording permission, even though we're not recording the screen. On macOS 15, self-signed apps frequently fail to acquire this permission through the normal TCC prompt. Users often need to manually add the app in System Settings → Privacy & Security → Screen Recording. This is the single biggest friction point in the user experience, and there's no programmatic workaround.

Integrating whisper.cpp into a Swift App

whisper.cpp provides a clean C API that's straightforward to bridge into Swift — no Objective-C++ needed.

Building the Static Library

The Makefile clones whisper.cpp, builds it with CMake (Metal enabled), and merges all the resulting .a files into a single static library:

cmake -B build -S vendor/whisper.cpp \
    -DCMAKE_OSX_ARCHITECTURES="arm64" \
    -DBUILD_SHARED_LIBS=OFF \
    -DGGML_METAL=ON \
    -DWHISPER_BUILD_TESTS=OFF

cmake --build build --config Release

libtool -static -o libwhisper.a \
    build/src/libwhisper.a \
    build/ggml/src/libggml.a \
    build/ggml/src/libggml-base.a \
    build/ggml/src/libggml-cpu.a \
    build/ggml/src/ggml-metal/libggml-metal.a

Swift Bridging via module.modulemap

Instead of a bridging header, I used a Swift Package Manager systemLibrary target with a module.modulemap:

module CWhisper {
    header "whisper.h"
    link "whisper"
    export *
}

This lets Swift code import CWhisper directly and call whisper_init_from_file_with_params, whisper_full, etc. as regular C functions.

Sliding Window Transcription

Real-time transcription with Whisper requires chunking the audio stream. I use a 5-second sliding window with 1-second overlap:

let chunkDuration: TimeInterval = 5.0
let overlapDuration: TimeInterval = 1.0

func processNextChunk() {
    let chunk = Array(sampleBuffer.prefix(chunkSamples))
    sampleBuffer.removeFirst(chunkSamples - overlapSamples)
    transcribeChunk(chunk)
}

The overlap prevents words at chunk boundaries from being cut off. Each chunk is processed on a background DispatchQueue — while one chunk is being transcribed, the next is accumulating.

Noise filtering is important: Whisper tends to hallucinate on silence, producing segments like [MUSIC], (silence), or Thank you. when there's no actual speech. A simple pattern-matching filter catches these:

static func isNoiseSegment(_ text: String) -> Bool {
    let trimmed = text.trimmingCharacters(in: .whitespacesAndNewlines)
    if trimmed.hasPrefix("[") && trimmed.hasSuffix("]") { return true }
    if trimmed.hasPrefix("(") && trimmed.hasSuffix(")") { return true }
    let noisePatterns = ["music", "silence", "blank", "no speech", "thank you"]
    return noisePatterns.contains { trimmed.lowercased().contains($0) }
}

The Voice Processing IO Saga

When you're on a meeting with speakers (not headphones), the system audio plays through the speakers and gets picked up by the microphone. The mic transcription ends up containing the remote participant's words — defeating the whole purpose of dual-channel separation.

The fix: Voice Processing IO — macOS's hardware-level acoustic echo cancellation:

try inputNode.setVoiceProcessingEnabled(true)

One line of code. Three days of debugging.

Pitfall 1: The 9-Channel Format

Enabling Voice Processing IO silently changes the microphone's output format from the expected mono/stereo to 9 channels. No documentation mentions this. My AVAudioConverter — which was converting the mic audio from its native format to mono 16kHz for Whisper — started crashing with EXC_BAD_ACCESS on the real-time audio thread.

The fix: bypass AVAudioConverter entirely. Extract channel 0 manually and resample with linear interpolation:

guard let ch0 = buffer.floatChannelData?[0] else { return }
let ratio = targetRate / buffer.format.sampleRate
var resampled = [Float](repeating: 0, count: Int(Double(frameCount) * ratio))
for i in 0..<resampled.count {
    let srcIdx = Double(i) / ratio
    let idx0 = Int(srcIdx)
    let frac = Float(srcIdx - Double(idx0))
    resampled[i] = ch0[idx0] + frac * (ch0[min(idx0 + 1, frameCount - 1)] - ch0[idx0])
}

Not the most elegant DSP, but it doesn't crash on the audio thread, which is more than AVAudioConverter can claim.

Pitfall 2: System Audio Ducking

After enabling Voice Processing IO, users reported that system volume suddenly dropped during recording. Voice Processing IO automatically ducks (reduces volume of) other audio sources to help with echo cancellation. This also affected ScreenCaptureKit's capture — the system audio recordings were nearly silent at -51 dB.

The fix (macOS 14+):

inputNode.voiceProcessingOtherAudioDuckingConfiguration =
    .init(enableAdvancedDucking: false, duckingLevel: .min)

Pitfall 3: Silent Audio Files

The same 9-channel issue that crashed AVAudioConverter for Whisper also broke audio file recording. The writeMicAudio function was using a converter to downsample the mic buffer to 1-channel AAC — but converting 9-channel real-time audio to mono AAC was silently producing empty frames. The resulting .m4a files were the right duration but contained silence (-91 dB).

The fix was the same manual channel extraction used for Whisper: extract channel 0, resample, write directly.

Lessons Learned

Apple's Voice Processing IO documentation is essentially nonexistent. The 9-channel behavior, the ducking side effect, the interaction with AVAudioConverter — none of this is documented. I found most of it through crash logs and mplog() statements. If you're building anything with Voice Processing IO, budget extra time for audio format debugging.

Local AI with Ollama

For AI summaries and chat, Scripta connects to a local Ollama instance. The integration is deliberately simple — a POST request to localhost:11434:

// Streaming summary generation
let request = OllamaRequest(
    model: modelName,
    prompt: "Summarize this meeting transcript...\n\n\(transcript)",
    stream: true
)

The response streams token-by-token, displayed in real-time in the UI. After the summary completes, users can ask follow-up questions through the Ask AI chat panel — multi-turn conversations with the transcript as system context.

The default model is qwen2.5:3b — small enough to run on any Apple Silicon Mac, multilingual, and produces surprisingly good meeting summaries. The install script handles Ollama installation, service startup, and model download automatically.

UX: Two Display Modes

Scripta offers two modes for different workflows:

Full mode is the main interface — transcript panel, AI summary, chat sidebar, recording controls, translation settings. This is where you review meetings after they end.

Minimal mode is a floating caption bar that stays on top of other windows. During a meeting, you switch to minimal mode and keep working while live captions scroll through:

The mic mute button works like Teams/Zoom — instant toggle, no pipeline teardown. The audio engine keeps running; the mute flag simply tells the tap callback to skip forwarding samples to Whisper and the audio writer.

Distribution Without the App Store

Scripta uses ScreenCaptureKit, communicates with Ollama on localhost, and links against a custom whisper.cpp static library — none of which are allowed under App Store sandboxing rules.

Instead, I distribute through GitHub Releases:

GitHub Actions CI builds for macOS 14 and macOS 15, signs with ad-hoc (codesign --sign "-")
curl | bash installer downloads the latest release, runs xattr -cr to clear the Gatekeeper quarantine flag, installs Ollama, pulls the AI model, and downloads the Whisper model
One command: curl -fsSL https://raw.githubusercontent.com/thehwang/Scripta/main/scripts/install.sh | bash

The xattr -cr step is what makes ad-hoc signed apps work without a paid Apple Developer ID. It clears the com.apple.quarantine extended attribute that macOS adds to downloaded files. Combined with the ad-hoc signature (which satisfies code integrity checks), this lets the app run without the "unidentified developer" warning.

What's Next

A few things I want to build:

Speaker diarization — cluster voice embeddings to distinguish Speaker 1, 2, 3 instead of just "Remote"
In-app auto-update — check GitHub Releases API on launch, download and replace via install script
Whisper model selection — let users choose between tiny (fast, less accurate) and small/medium (slower, better)
Export formats — SRT subtitles, JSON with timestamps, integration with note-taking apps

Try It

Scripta is open-source under the MIT license.

Install:

curl -fsSL https://raw.githubusercontent.com/thehwang/Scripta/main/scripts/install.sh | bash

GitHub: github.com/thehwang/Scripta

If you find it useful, a star on GitHub would mean a lot. Issues, PRs, and feedback are all welcome.

Built on macOS with Swift, whisper.cpp, ScreenCaptureKit, SFSpeechRecognizer, and Ollama. No cloud required.

DEV Community: thehwang

Three TODOs, three weeks, one weekend: finishing pq v0.14

What I Built

Demo

The Comeback Story

Three TODOs, three weeks of "almost done"

Then the hashtag showed up

Postscript: the cover image caught DuckDB lying

My Experience with GitHub Copilot

Gemma 4 wrote three summaries in one response. The middle one was a self-disclaimer.

The ablation

The result

What num_ctx=2048 actually produces

What I now think (and what I deliberately don't)

Corrections, the harness, the people

I asked Gemma 4 to summarize. It said the transcript looked truncated. It was right.

What I Built

Demo

Code

How I Used Gemma 4

1. 128K context = no chunking on real meetings

2. E2B fits alongside Whisper on a 16 GB Mac

3. The reasoning behavior caught me off guard (in a good way)

The bug I uncovered while integrating Gemma 4

Benchmark — how dramatic is "before" vs "after"?

Reproduce in 5 minutes

Bonus — testing Gemma 4's vision at E2B size: a calibration finding

Honest tradeoffs of choosing E2B

What changes for Scripta users

Building a 100% Local Meeting Transcription App for macOS with whisper.cpp and ScreenCaptureKit

The Dual-Channel Problem

Capturing System Audio with ScreenCaptureKit

Integrating whisper.cpp into a Swift App

Building the Static Library

Swift Bridging via module.modulemap

Sliding Window Transcription

The Voice Processing IO Saga

Pitfall 1: The 9-Channel Format

Pitfall 2: System Audio Ducking

Pitfall 3: Silent Audio Files

Lessons Learned

Local AI with Ollama

UX: Two Display Modes

Distribution Without the App Store

What's Next

Try It

What `num_ctx=2048` actually produces