Eduard Maghakyan

Posted on May 16

Mnemonic - local-first voice notes with Gemma 4 E4B

#devchallenge #gemmachallenge #gemma

Gemma 4 Challenge: Build With Gemma 4 Submission

This is a submission for the Gemma 4 Challenge: Build with Gemma 4

What I Built

Mnemonic is a macOS menu-bar app and CLI for voice notes that go straight into your daily journal.

Press a hotkey, speak, release. One bullet appears in today's YYYY-MM-DD.md:

- 14:35 This is a new node. Let me try to see if it'll work. [audio](../audio/2026-05-10/143500.wav)
- 15:12 I want to email Sarah tomorrow about the migration plan. [audio](../audio/2026-05-10/151200.wav)
- 16:08 The bug is in how we handle the empty array case in `merge_chunks`. [audio](../audio/2026-05-10/160800.wav)

That's the whole product. No titles, no summaries, no auto-generated TODO lists, no extracted entities. No cloud, no telemetry. Transcribed thoughts dropped into a Markdown file you already control.

Early versions over-structured short voice memos - every 30-second thought came back with a title, a "summary" that restated what you said, and an "Actions" list that invented TODOs. v0.2 cut all of that. The model's job is now narrow: transcribe and lightly clean.

v0.3 added three things on top, each either opt-in or invisible by default:

Image attachments. Take a screenshot, then speak - or use Ctrl+Option+Cmd+Space to drag a region and start recording in one motion. Gemma 4 reads the WAV and the PNG together and produces one bullet referencing both. PNG saves next to the audio.
Recording queue. Recording is decoupled from structuring. Release the hotkey, the tray goes idle, fire the next one immediately. A background worker drains an on-disk inbox serially; quitting mid-job is safe, the inbox survives. Tray is now binary - gray idle, red recording.
Intent routing (opt-in, off by default). A second narrow Gemma 4 call decides whether your note is asking the OS to do something - "remind me to call Sarah at 3 PM" - and if it is, fires a macOS Shortcut you've whitelisted. Undoable for 5 seconds. No AppleScript, no shell interpolation.

The file format (YYYY-MM-DD.md at the vault root) matches Obsidian's Daily Notes plugin, so pointing notes_dir at an Obsidian vault makes bullets land in today's daily note. Graph view, backlinks, search - all free.

Everything runs locally. No network call leaves the loopback interface. The DMG is signed and notarized; no telemetry crates are linked into the binary.

Demo

Code

Repository: github.com/EduardMaghakyan/mnemonic
Latest release (signed + notarized DMG): v0.3.1
Install via Homebrew:

  brew tap EduardMaghakyan/tap
  brew install --cask mnemonic

Rust workspace: Tauri 2 for the menu-bar app, clap for the CLI, a shared mnemonic-core crate for audio + markdown + the llama-server client. Single Apple Silicon DMG, code-signed with a Developer ID, notarized, and stapled. MIT licensed.

How I Used Gemma 4

Mnemonic uses three Gemma 4 capabilities through a single local model: native audio, native vision, and lightweight reasoning. They all run against the same llama-server on 127.0.0.1 - no second model, no external API.

Why E4B

Gemma 4 ships in four sizes. Only two are audio-capable, and only one fits a 16 GB laptop. Numbers from the official Gemma 4 model card:

Size	Audio?	MMLU Pro	BBEH	CoVoST	FLEURS (↓)
E2B	✓	60.0%	21.9%	33.47	0.09
E4B	✓	69.4%	33.1%	35.54	0.08
26B A4B	✗	higher	higher	-	-
31B Dense	✗	highest	highest	-	-

E2B and E4B are the only sizes with the audio encoder (~300M params). Both also ship with a ~150M vision encoder. The 26B and 31B are vision-only - no ears. "Use the biggest model that fits" is a non-starter for this product.

Between E2B and E4B, the deltas matter:

MMLU Pro 60.0 → 69.4 (+9.4). The difference between a model that fumbles unfamiliar technical vocabulary in voice notes and one that doesn't.
BBEH 21.9 → 33.1 (+11.2). Reasoning quality matters for self-correction ("actually, scratch that…") and for intent routing - one misclassification fires the wrong Shortcut.
CoVoST 33.47 → 35.54 and FLEURS 0.09 → 0.08. Modest audio-recognition wins.

At Q4_K_M the E4B GGUF is 4.98 GB (Hugging Face), plus audio and vision mmprojs (~1 GB combined). Co-resident with an IDE and browser on 16 GB.

One model, one pass - for both audio and vision

The conventional architecture for this product is two stages:

ASR (Whisper, Parakeet, etc.) → raw transcript
Text LLM → clean and structure

Mnemonic does both in one Gemma 4 forward pass. The audio goes into the model with a system prompt that says, in effect: "transcribe this and write it the way the speaker would write it themselves." Why it works better than two stages:

One model in memory, one HTTP round-trip per recording. A 2-stage version means two model downloads, two warm processes, two failure modes.
The cleaning prompt operates on the audio, not on a flat transcript. The model can hear pauses, hesitation, restarts - the difference between "I think" as filler and "I think" as opinion. A downstream LLM working from a transcript has already lost that.
Lower end-to-end latency.

The same approach works for vision. A two-stage screenshot-with-voice product would be OCR (Tesseract, Apple Vision) → LLM merge. Mnemonic sends the WAV and the PNG to Gemma 4 in one multipart request, and the model produces a bullet that references both. The image isn't OCR'd in isolation - it's grounded by what the user said while taking it. Captions come out as "the panic the speaker mentioned in line 42" rather than generic "code editor with red error text."

Intent routing - a second narrow call, same model

Letting a voice note fire a macOS Shortcut took some thought. I didn't want to bolt on a tools/function-calling framework, an MCP server, or anything that added attack surface for a side-effect feature running on a user's machine.

What works is a second Gemma 4 call to the same llama-server, on the already-cleaned transcript, with one job - output a single JSON object:

{ "tool": "create-reminder", "input": "call Sarah at 3 PM" }

…or { "tool": "none" } if the transcript isn't a request. No tools registry, no plugin protocol - same model, same server.

Most of the work is around what doesn't fire:

Whitelist required. Mnemonic only runs Shortcuts named in allowed_shortcuts. A hallucinated name is refused before it reaches the OS.
No AppleScript, no shell interpolation. Input is piped to the Shortcut via stdin (shortcuts run NAME --stdin).
Undoable for 5 seconds. The tray menu shows Undo: <name> for the configured window. Click it to run a paired undo-<name> Shortcut.
Thought-dumps don't fire. "I was thinking about reminding Sarah, but maybe she already knows" → {tool: "none"}. Validated at 30/30 on hand-labelled transcripts, including 15 hedged/observational cases that must not fire (docs/spike/intent/PHASE-0-INTENT-FINDINGS.md).
Notes are the source of truth. A fire writes a ↳ Ran shortcut "<name>": <input> continuation line under the bullet. Greppable, auditable.

Cost is ~1.7s of warm latency per recording when enabled, off the user's critical path because the structuring queue is already async.

Implementation

Users run llama-server themselves, on loopback:

llama-server -hf unsloth/gemma-4-E4B-it-GGUF:Q4_K_M --port 5809 --mmproj-auto -ngl 99 -c 8192

The app posts one multipart request per recording (WAV + optional PNG + system prompt) to 127.0.0.1:5809, parses the JSON response, and appends a bullet. With intent routing on, a second JSON-mode call follows.

A few things that matter:

JSON mode + thinking mode. response_format: { type: "json_object" } for structure, chat_template_kwargs: { thinking: true } for chain-of-thought. The thinking tokens cost a few hundred ms but improve handling of technical terms.
The schema shrunk over three versions, then stayed put. v1 had seven fields (title, tags, summary, cleaned, actions, questions, entities). v1.5 had two. v2 has one: { cleaned: String }. Each prune was a UX win - the simpler the schema, the less the model felt the urge to narrate, summarize, or invent. The system prompt explicitly bans third-party narration ("the speaker", "the user", "the recording") and includes two calibration examples. v0.3's vision and intent work added new inputs and a new second call, but didn't grow that schema.
Failure is visible, not silent. llama-server unreachable, malformed JSON twice, silent audio - each produces a stub bullet with the timestamp and a redacted error. No recording is ever lost without a trace; the audio is preserved on disk, and the inbox queue means a job in flight survives a quit or crash.

What works well

Audio transcription and cleaning. Better than the Whisper + LLM pipeline I started with. Audio-grounded cleaning preserves intent across self-correction and hesitation in a way a downstream LLM on a flat transcript can't.
Vision captions are grounded. They answer "what was the speaker talking about" rather than describing everything in the image. Short, useful captions.
Intent routing is conservative. It refuses to fire more often than it fires. That's the right error direction for a feature that can run OS-level actions.
The queue makes the app feel instant. You stop waiting on the model. Tray returns to idle the moment you release the hotkey.
Everything is local. Loopback only. No keys to manage, no quota to worry about, no data leaves the machine.

Limits

Processing is batched after the fact, not streamed. Fine for journaling; wouldn't work for live captioning.
16 GB unified memory is the floor. With a heavy IDE + browser open, memory pressure shows.
I haven't tested non-English voice notes systematically. Gemma 4 is multilingual, but I work in English.
No speaker diarization, no noise suppression. By design, most voice memos are solo.
Intent routing requires building macOS Shortcuts by hand. Powerful if you set it up, but most users won't.
Image OCR is good for screenshots, not for dense document scans. Short captions and inline text work well; multi-column papers don't.

UX

The whole interaction is one keystroke. Default is Ctrl+Option+Space hold-to-record (push-to-talk). Both the combo and the mode (hold or toggle) are configurable in ~/.config/mnemonic/config.toml, and the config hot-reloads - change a hotkey, save, the app re-registers without a restart. Tray transitions are perceptible within 100 ms.

The CLI ships a mnemonic doctor command with a green/red checklist for every common failure mode (mic permission, llama-server reachable, model loaded, mmproj loaded, paths writable). Brew install creates the mnemonic symlink on PATH automatically - no admin password prompt anywhere in the install path.

Built solo (well... not quite, Claude was involved). Source under MIT. Thanks to the Gemma team for shipping audio, vision, and chain-of-thought in a model that fits a laptop.

DEV Community