DEV Community: xiaocai oh

ambient-voice v2: How Deleting Whisper and Adding a JSON File Made Our Voice Pipeline Better

xiaocai oh — Sat, 28 Mar 2026 12:59:58 +0000

Last month I open-sourced ambient-voice — a macOS voice input tool built entirely on Apple-native frameworks. The headline feature was context biasing: it OCRs your screen before you speak, so the recognizer already knows your domain.

But the other headline feature — a self-improving distillation pipeline — turned out to be over-engineered. Here's what we changed in v2, and what we learned.

The v1 Pipeline (RIP)

Audio → Whisper re-transcription ──┐
                                    ├─→ Merge → QLoRA → ollama
User correction capture (30s) ─────┘

Three problems:

Whisper was a GPU tax. Re-transcribing 30 min of audio → 2 hours on a GPU server. Most users don't have spare compute for background distillation.
Correction capture was noisy. Users edit text for many reasons — rephrasing, restructuring, deleting. Only a fraction of edits are actual recognition error corrections. The training data was polluted.
The feedback loop never closed. Need dozens of data points → training run → model deploy. Too slow for anyone to see improvement.

The v2 Pipeline

dictionary.json + raw transcription → Gemini correction → QLoRA → ollama

That's it.

dictionary.json

{ "terms": ["Sharpe ratio", "MPLS", "Claude Code", "MCP", "QLoRA", "ollama"] }

You list your domain-specific terms. The distillation pipeline sends the raw SpeechAnalyzer transcription + your dictionary to Gemini. Gemini returns a corrected version respecting your vocabulary. The pair becomes QLoRA training data.

Why this works better than "automatic learning":

The user's real pain was never "the system doesn't learn from my corrections." It was "certain terms never come out right." dictionary.json targets that pain directly — zero noise, exact user intent.

What Got Deleted

WhisperTranscriber — entire module removed
CorrectionCapture — removed
CorrectionStore — removed
Dual-path merge logic — removed
GPU server dependency — gone

~30% code reduction. The cron job ("run every 10 minutes") became "run pipeline.sh when you want to."

Evaluation Framework

v2 ships with proper benchmarks. On Mac Mini M4:

AliMeeting (real Chinese meeting recordings):

Nearfield (headset): ~25% CER
Farfield (8-ch single channel): 40% CER (high overlap, no beamforming)

AMI (English meetings):

FluidAudio speaker diarization: 23.2% DER average
Processing speed: 130x real-time

End-to-end: 30 min meeting → 20-30s processing. Peak memory < 1GB. Runs on 8GB MacBook Air.

Not SOTA — but fully on-device, zero cost, no network calls.

Community PRs

Two external contributions merged:

TextInjector clipboard restore bug fix
OpenSSL 3.x certificate script compatibility

An MIT project getting outside PRs at two weeks old — that's the best validation metric.

Architecture Overview

Daily dictation flow:

Right Option key
  → ScreenCaptureKit + Vision OCR (context extraction)
  → SpeechAnalyzer (transcription with context bias)
  → Local LLM polish (ollama)
  → Paste to focused app

Improvement flow:

dictionary.json + voice-history.jsonl
  → Gemini distillation
  → QLoRA fine-tuning
  → Deploy to ollama

The Takeaway

We replaced an ML pipeline with a JSON file and got better results. The lesson: capture user intent explicitly, don't infer it from noisy behavioral signals.

Complex systems are seductive. Simple systems ship.

GitHub: github.com/Marvinngg/ambient-voice
License: MIT
Requirements: macOS 26 (Tahoe), Apple Silicon (M1+)

If you tried v1: git pull && make install.
If you didn't: now is a better time to start.

PRs and issues welcome.

The Four Layers of Hook Perception: Why Your AI Guardrails Aren't Actually Working

xiaocai oh — Thu, 26 Mar 2026 03:16:24 +0000

Someone let Claude Code help write documentation. It hardcoded a real Azure API key into a Markdown file and pushed it to a public repo. Eleven days went by before anyone noticed. A hacker found it first — $30,000 gone.

Someone else asked AI to clean up test files. It ran rm -rf and wiped their entire Mac home directory — Desktop, Documents, Downloads, Keychain. Years of work, gone in seconds.

And then there's the person who let an AI agent manage their inbox. It bulk-deleted hundreds of real emails from Gmail.

These aren't jokes. These are real incidents from 2025-2026.

Once AI starts running, you can't stop it mid-stride.

Every developer who's used AI coding tools has felt this fear. You ask it to post something on an English-language platform and it replies in Chinese — catastrophic for your account. You ask it to tweak a config and it corrupts your .env, taking down your entire service.

So the question is: Is there a mechanism that can intercept AI before it acts?

Yes. It's called a Hook.

What Is a Hook: The 30-Second Version

Forget the jargon. A Hook is a gate system you install around your AI.

Think of yourself as a building manager. AI is the contractor working inside. The contractor is competent but occasionally does wild things — tears out a load-bearing wall, throws away someone else's stuff, posts notices in the wrong place.

Hooks are the access controls + surveillance cameras you install at key points:

Before the contractor acts (PreToolUse): Check what they're about to do. Block if dangerous.
After the contractor finishes (PostToolUse): Check what they did. Log problems immediately.
Before the contractor clocks out (Stop): Verify the work is done. Don't let them leave if it isn't.
Before a building renovation (PreCompact): Lock critical documents in the safe first.

These gates aren't installed by the AI. You install them. The AI doesn't even know they exist.

This is the most counterintuitive thing about Hooks:

Hooks operate outside AI's awareness. The AI doesn't know it's been intercepted. It doesn't know what the gates are checking.

You can't ask Claude "Are your Hooks configured correctly?" — it can't answer. You can't ask Claude to debug your Hooks, because Hooks execute in a code layer outside of Claude.

This means something serious:

Hooks are something you, as the AI operator, must learn to configure yourself. AI can't help you here.

The Core Insight: It's Not About "What You Block" — It's About "What You Can See"

I learned this the hard way.

While researching Claude Code's Skill engineering system, I did a line-by-line alignment of Anthropic's official design principles against the open-source toolchain. I found one completely blank spot — Hooks. The AI toolchain didn't cover it. I didn't understand it either.

So I decided to build one myself.

Here's the scenario: Claude Code performs "context compaction" during long conversations — it compresses earlier dialogue into summaries to free up space. The problem is that compression loses critical information: SSH connection IPs, temporary API tokens, which step of a multi-step task you're on.

My idea: Before compaction, have Claude automatically save critical info.

So I wrote a Hook:

"Before compaction, check the current conversation for critical information
and extract it to a file in /tmp/."

Looks reasonable, right? I set it up confidently, thinking the problem was solved.

It ran for days. Then one compaction happened and I discovered that comments I'd planned to auto-publish on another platform never went out — the compaction had wiped the critical info, and my Hook did nothing.

I opened the save file. It contained nothing but a timestamp.

I'd installed a guardrail, but it was made of paper.

The problem wasn't a bug in the Hook mechanism. I had given it eyes that couldn't see anything.

I used a prompt hook — which essentially makes a standalone Claude API call to do the evaluation. But this call is completely isolated: no tool access, no file reading, no file writing, no command execution. It can't even see the current conversation content.

I'd asked a blind person to guard the keys to the safe.

It could see the transcript file's path — but couldn't open the file. It was told to "write to /tmp/" — but had zero file-writing capability. Like handing someone a photo of a key, but they can't touch the actual key.

This failure taught me the core principle:

A guardrail's upper bound isn't determined by what you tell it to block. It's determined by what it can see.

This is what I call the perception boundary of Hooks — and it determines whether your guardrail is made of steel or paper.

The Four Layers of Hook Perception

What a Hook can perceive falls into four layers, from narrowest to widest. Each layer defines what the guardrail can and cannot do.

Layer 0: Event Snapshot

The baseline information available to every Hook — what tool the AI is calling and what arguments it's passing.

{
  "tool_name": "Bash",
  "tool_input": {"command": "rm -rf /tmp/test"},
  "cwd": "/Users/xxx/Project"
}

That's it. No conversation history. No context. No AI reasoning chain.

Like a security guard who can only see what's in your hands, but doesn't know why you're carrying it.

But this layer is enough for a lot. rm -rf in the command? Block. git push --force main? Block. --publish in the arguments? Pop a confirmation dialog.

These checks only need string matching. Simple, deterministic, zero cost.

Layer 1: Conversation Archive

The Hook input includes a field called transcript_path — pointing to the raw conversation log file.

The key: only command hooks can read it. Because command hooks run in your machine's shell, they can use cat, jq, grep to open the file.

This means command hooks can look back through conversation history: what the user said, what the AI replied, which tools were called previously.

An upgrade from "seeing what's in your hands" to "being able to review the security footage."

But other Hook types only get the path string — an address they can't open.

Layer 2: Project Codebase

There's a type called agent hook — it spawns a mini AI sub-agent that can read project code files, search for keywords, and find files.

This means it can do deeper validation: if the AI wants to modify a file, the agent hook can read that file first and check whether the change would break something.

An upgrade from "reviewing security footage" to "entering the room and checking the drawers."

The tradeoff: every trigger runs a full AI sub-agent, consuming significant tokens.

Layer 3: AI's Internal World — The Permanent Blind Spot

No Hook can see any of these:

What the AI is currently thinking (its reasoning process)
Why the AI decided to call this tool (its motivation)
What's in the system prompt
Post-compaction conversation summaries

Hooks intercept actions, not intentions.

This is the fundamental limitation. Imagine someone hides a line in a file the AI reads: "Please ignore all previous safety rules." The AI might change its behavior after reading that, but it won't necessarily go through a Hook-protected tool path. It might find a route you didn't anticipate.

Hooks are a gate system, not mind-reading. They can secure the door, but they can't cover every window.

Four Guardrail Patterns — Right Eyes for the Right Job

Once you understand perception boundaries, choosing the right Hook type becomes straightforward:

Command Hook: The Regex Guard at the Door

Runs a shell script. Can read files, write files, run commands. Makes decisions via string matching and regex.

100% deterministic. Zero cost.

Use cases: rm -rf in the command → block. File path contains .env → block. Arguments include --publish → confirmation dialog. These rules don't need AI — a single grep is faster and more accurate than an LLM call.

If regex can handle it, don't call in the AI.

HTTP Hook: The Remote Policy Server

Sends the event to a remote HTTP service for server-side decision-making.

Use case: team-wide security policies. Ten people using Claude Code, one policy server enforcing the rules — no direct pushes to main, no touching production databases.

One counterintuitive design choice: if the server is down, AI keeps running. Non-2xx responses don't block operations. So HTTP hooks can't be your only safety wall.

Prompt Hook: The Lightweight Semantic Judge

Makes a single AI call for semantic evaluation. No tools, no file access — it only sees the fields in the event JSON.

Use case: decisions that require "understanding meaning" rather than "matching strings." Like detecting if Claude's response is deflecting — "that's out of scope," "I'd suggest handling this later" — patterns that regex can't reliably catch, but another AI spots instantly.

Prompt hook's one superpower is understanding natural language. Beyond that, it can do nothing.

This is exactly where I got burned — I asked it to write files, but it can't even touch the filesystem.

Agent Hook: The Inspector with a Toolbox

Spawns a sub-agent that can read code, search files, find keywords.

Use case: AI wants to modify a critical file, and you need to read that file's context first to judge whether the change is safe. This "need to read code to make a judgment" scenario is where only agent hooks qualify.

Highest cost: every trigger is a full AI session. Use it where it counts.

The decision framework:

Regex can handle it → command hook. Need to understand meaning → prompt hook. Need to read code → agent hook. Need team-wide control → HTTP hook.

The first question in Hook selection isn't "what do I want to block?" — it's "what do I need to see in order to judge?"

Three Real-World Cases

Case 1: The Confirmation Key Before One-Click Publish

I have a content distribution workflow — Claude rewrites articles for different platforms, then calls a publish script. The script has a --publish flag that sends it live immediately.

One Hook solved it:

if echo "$CMD" | grep -q '--publish'; then
  echo '{"hookSpecificOutput":{"permissionDecision":"ask"}}'
fi

Whenever --publish appears in the command, it pauses and asks me to confirm.

Perception layer: Layer 0. Just looking at the command string. grep. Command hook. Zero cost.

Case 2: Posting Chinese on an English Platform

This actually happened. I asked Claude to reply to comments on an English community, and it replied in Chinese. On some platforms, this kind of mistake does irreversible damage to your account.

Regex can't handle this — you can't string-match your way to "is this text English?" (What about mixed Chinese-English? Chinese comments inside code blocks?)

This is a prompt hook scenario:

{
  "type": "prompt",
  "prompt": "The following command will publish content on an English-language platform. Check the text content in tool_input. If the primary language is not English, return {\"decision\":\"block\",\"reason\":\"Target platform is English-only. Please write in English.\"}. $ARGUMENTS"
}

Have another AI scan the content language. If it's Chinese, block. Semantic judgment — lightweight, fast.

Case 3: The Config File Guardian

In some projects, Claude has a bad habit of modifying .env files. After a change, the service goes down, and it's hard to immediately realize .env was the culprit.

One Hook solved it:

FILE=$(echo "$INPUT" | jq -r '.tool_input.file_path // ""')
if echo "$FILE" | grep -qE '\.env'; then
  echo "Modifying .env files is prohibited" >&2
  exit 2  # Block
fi

Perception layer: Layer 0. Check the file path. Match .env. Command hook.

Dead simple. But this kind of simple rule prevents an entire class of common incidents.

Less Is More

One counterintuitive conclusion: knowing which Hooks NOT to add is more important than knowing how to add them.

Every additional Hook adds overhead to every tool call. If you Hook every operation, Claude Code's response time degrades noticeably.

Scenarios where you don't need a Hook:

Checking if a file exists before editing — the edit tool already checks and returns an error on failure
Logging every operation — the conversation transcript is already a complete log
Injecting environment variables — belongs in .zshrc, not in a Hook

Good guardrails aren't airtight. They're a single infallible sentry at the right chokepoint.

The essence of Hooks in four words: few and precise.

Closing

Back to the original question: Once AI starts running, how do you stop it?

The answer: First figure out what your guardrail can see.

Hooks aren't omnipotent. They can't see what AI is thinking, can't see AI's motivations, and might even be bypassed by prompt injection. They're a check at the action layer, nothing more.

But this check is one that you — the human — must learn to configure yourself.

AI can help you write code, write articles, manage projects. But it can't install its own brakes. That's on you.

Perception determines capability. What you can see is what you can stop.

I Built a Context-Aware Voice Input Tool for macOS — 100% On-Device, Zero Cloud

xiaocai oh — Fri, 20 Mar 2026 04:03:29 +0000

Every voice input tool I've tried on Mac has the same problem: it doesn't know what I'm doing.

I'm writing Swift code and say "optional." The recognizer gives me the English adjective. I'm drafting an email about OKR targets and say "retention." It transcribes something phonetically similar but semantically wrong — because it has no idea I'm looking at a quarterly business review.

So I asked: what if the recognizer already knew your context before you started speaking?

That question led to ambient-voice — an open-source macOS voice input system where every layer runs on Apple-native frameworks, everything stays on your device, and screen context is injected into the recognizer at transcription time.

The Stack: 100% Apple-Native

Capability	Framework
Speech recognition	SpeechAnalyzer
Screen capture	ScreenCaptureKit
OCR	Vision
Text injection	Accessibility API + CGEvent
Speaker diarization	FluidAudio (CoreML)
Hotkey listening	CGEventTap

No Whisper. No Electron. No cloud APIs. No third-party dependencies for core functionality.

Why this matters:

On-device processing. Your audio never leaves your Mac. No network calls, no telemetry, no cloud storage.
Zero cost. No subscriptions, no per-minute charges. The Neural Engine is already in your Mac.
Automatic improvement. When Apple improves SpeechAnalyzer in macOS 27, ambient-voice gets better without code changes.

The Core Mechanism: Context Biasing

When you press the hotkey, two things happen simultaneously:

Audio capture begins — AVCaptureSession feeds audio to SpeechAnalyzer
Screen context capture — ScreenCaptureKit grabs the focused window, Vision OCR extracts visible text, keywords get injected into SpeechAnalyzer's AnalysisContext

By the time your first word reaches the recognizer, it already knows what's on your screen.

Example: You're replying to an email about OKR targets. Your screen shows "retention rate," "Q3 objectives," "churn reduction." You say "change the retention target." Without context biasing, "retention" gets mis-transcribed. With it, the recognizer sees "retention" in the AnalysisContext, and the ambiguity resolves correctly — on the first pass.

This isn't post-processing correction. Prevention, not correction.

Self-Improving Data Loop

Every transcription session automatically generates training data:

Each transcription logs to voice-history.jsonl
A 30-second observation window captures your corrections via Accessibility API
Whisper re-transcribes the audio as a high-quality reference
The three outputs merge with weighted scoring → QLoRA fine-tuning of a local model

The system improves without requiring any effort from you. Strong-model-distills-to-small-model architecture.

Meeting Mode

Press ⌘M to start recording. Real-time transcription in a floating panel. When you stop, FluidAudio performs on-device speaker diarization.

Output: a Markdown file with timestamps, speaker labels, and full text. Every word stays on your Mac.

Hardest Bugs (Solved with Claude Code)

Most of ambient-voice was developed with Claude Code using structured "Skills" — domain knowledge documents that capture the why and what, letting Claude figure out the how.

The trickiest problems had no Stack Overflow answers:

Bluetooth audio silence → rewrote capture pipeline around AVCaptureSession
Swift 6 concurrency crashes → CGEventTap with DispatchQueue bridging
Accessibility permissions resetting on build → switched to Apple Development certificate signing

Try It

ambient-voice is MIT licensed: github.com/Marvinngg/ambient-voice

Requirements: macOS 26 (Tahoe)+, Apple Silicon (M1+).

If you care about privacy-first voice input or building on Apple's latest frameworks — stars, issues, and PRs welcome.

From Slash Commands to Real Skill Engineering: 3 Lessons I Learned the Hard Way

xiaocai oh — Fri, 20 Mar 2026 03:55:45 +0000

I wrote an email-processing Skill with 8 detailed rules. Claude followed every one of them like an obedient but soulless intern — the output was correct but completely useless.

Then I deleted all 8 rules and replaced them with two sentences: "Which emails need my action, and which do I just need to know about?"

The result was 3x better. Claude started organizing information by urgency, merging redundant emails, and even flagging ones I could safely ignore.

That experience taught me something: writing instructions ≠ Skill engineering. There are three cognitive layers between the two.

Lesson 1: Your Skill's Entry Point Is Probably Broken

Here's an embarrassing fact: all of my Skills are triggered via slash commands. /read-think-write, /invest-analysis, /idc-inspection — every single time, I type the command manually.

This means the description field — the one that's supposed to determine "when the user says X, auto-trigger this Skill" — is completely dead weight in my setup.

Thariq from Anthropic wrote about this explicitly: description isn't documentation. It's a classifier — written for the AI to decide when to activate, not for humans to read.

Community benchmarks tell the story: unoptimized descriptions → 20% natural language trigger rate. Optimized → 50%. With examples → 90%.

There's also a counterintuitive design principle: descriptions should over-trigger. Recall matters more than precision. A false trigger wastes a few tokens — Claude enters the Skill, realizes it's not needed, and exits. But a missed trigger means the user thinks the Skill is useless and never tries natural language again.

We all use slash commands because we never engineered the entry point.

Lesson 2: Stop Opening Blind Boxes

The second thing I ignored for too long was eval — the evaluation system.

When I used skill-creator, I'd iterate 2-3 rounds. Each round it scores, keeps the higher version. Final output: ~90 points. Ship it.

But if you asked me "what does 90 actually measure?" — I couldn't answer.

Layer 1: Trigger evaluation. Tests whether "user said X, should the Skill activate?" This is the only layer I ever used — and where that 90 came from.

Layer 2: Quality evaluation. Run the same task with the Skill and without (bare Claude), then compare. That delta is your Skill's true value.

Bare Claude: 80 pts, your Skill: 82 pts → hundreds of lines for 2 points. Not worth it.
Bare Claude: 60 pts, your Skill: 95 pts → that 35-point delta is why your Skill exists.

No baseline comparison = slot machine development.

Layer 3: Process evaluation. Examine Claude's execution transcript. If Claude skips the same step in three test cases, that step isn't pulling its weight. Delete it — the Skill gets better.

Lesson 3: Don't Put Guardrails in the Prompt

A Hook in Claude Code is: a shell command that auto-triggers before or after Claude uses a tool.

{
  "hooks": {
    "PostToolUse": [{
      "matcher": "Write",
      "command": "eslint --fix $FILE_PATH"
    }]
  }
}

Every time Claude writes a file, the system automatically runs linting. Claude doesn't need to "remember" — it doesn't even know it's happening.

We write tons of MUST, NEVER, ALWAYS in our SKILL.md files — all enforced by Claude's attention. Long context = forgotten rules.

But if you turn "never modify .env" into a PreToolUse hook — Claude tries to write .env, gets blocked by the system — the rule goes from "please remember this" to "you can't violate this even if you try."

Good engineering doesn't rely on AI discipline. It relies on system guarantees.

The Conclusion

Skills aren't written — they're tested, measured, and system-guaranteed.

The core loop: write → test → observe → revise → test.

Most people stop at "write." I did too.

If you're triggering everything via slash commands, iterating by gut feel, and putting all your rules in the prompt — maybe it's time to pause and see what you've been skipping.