Mike

Posted on Mar 23

I Built a macOS App in a Weekend with an AI Agent — Here's What 'Human on the Loop' Actually Looks Like

#ai #flutter #productivity #development

Last weekend I built Duckmouth — a macOS speech-to-text app with LLM post-processing, global hotkeys, Accessibility API integration, and Homebrew distribution. From first commit to shipping DMG: 26 hours.

brew tap nesquikm/duckmouth
brew install duckmouth

The interesting part isn't the app. It's how the process worked — and specifically, how much I was not hands-off.

The Numbers

Metric	Value
Milestones completed	31
Dart files	96
Lines of code	~12,700
Native Swift files	2 (platform channels)
Tests	409 (unit, widget, integration, e2e)
Distribution	DMG + Homebrew cask

What Duckmouth Does

Record speech → transcribe via OpenAI-compatible API (OpenAI, Groq, or custom) → optionally post-process with LLM (fix grammar, translate, summarize) → paste at cursor or copy to clipboard. Lives in the menu bar, responds to global hotkeys, keeps history. Standard Flutter/Dart on macOS, with Swift platform channels for the Accessibility API and system sounds.

Nothing exotic. But it touches enough surface area — audio capture, HTTP APIs, Accessibility framework, clipboard, system tray, hotkeys, persistent storage — that doing it manually in a weekend would be ambitious.

Oh, and during the same weekend I also shipped the_logger_viewer_widget — a companion package for the_logger that embeds a log viewer directly in your app. Built with the same dev-process-toolkit workflow, published to pub.dev, and integrated into Duckmouth's debug screen. Side quest completed before Sunday dinner.

Human on the Loop, Not Out of It

There's a popular framing: "AI built my app while I slept." That's not what happened. At all.

I used dev-process-toolkit, a Claude Code plugin I built specifically for this kind of work. It enforces a spec-driven development workflow: write specs → TDD → deterministic gate checks → bounded self-review → human approval.

Here's what "human on the loop" looked like in practice:

I wrote the specs upfront. Four files in a specs/ directory — requirements, technical spec, testing spec, implementation plan. Every functional requirement had acceptance criteria. Every milestone had a gate. The agent didn't decide what to build — I did. But once the specs existed, I tried to stay out of the way.

I let it run. Most milestones, I wasn't watching. The agent would pick up the next milestone, run the TDD cycle, pass the gate check (flutter analyze && flutter test), and move on. I'd check in periodically, skim the diffs, and keep going. The specs and gates were doing the supervision, not me.

I stepped in when things broke. The Accessibility API for paste-at-cursor? That took real investigation — AXUIElement, CGEvent fallback chains, entitlement flags. The hotkey system crashed three times before we got USB HID key code translation right. These weren't "tell the agent to fix it" moments. These were "read the Apple docs and figure out what's actually wrong" moments. But between those moments — long stretches of autopilot.

I made the calls the agent couldn't. Architecture decisions (BLoC/Cubit, feature-first structure, repository pattern). Priority calls when the agent wanted to gold-plate a settings page while the core pipeline had a race condition. "This is fine, move on" — the most useful sentence in human-on-the-loop development.

What the Agent Did Well

The grunt work. Scaffolding 96 files with consistent architecture. Writing the boilerplate for BLoC states, repository interfaces, DI registration. Generating test files that mirror the lib structure. Wiring up HTTP clients to multiple provider APIs.

The agent was also good at following the spec once it existed. With acceptance criteria spelled out as binary pass/fail checks, it could methodically work through a list and not skip items. The TDD cycle (write test → watch it fail → implement → watch it pass → run all gates) kept each milestone clean.

And the gate checks caught real issues. Every milestone, flutter analyze && flutter test had to pass before I'd see a review. The agent couldn't hand-wave past a type error. It had to actually fix it.

What the Agent Did Poorly

Anything involving platform-specific behavior. The agent has no mental model of how macOS Accessibility APIs actually behave at runtime. It can write the code, but it can't predict that AXUIElementSetAttributeValue will silently fail without the right entitlement. I spent real debugging time on platform channel issues that the agent confidently declared "should work."

UI polish. The agent can implement a design, but it has no taste. Every UI decision that involved "does this feel right" was mine.

The dev-process-toolkit Difference

I've done AI-assisted weekend projects before, without the toolkit. The difference is stark:

Without process: The agent races ahead, skips tests, introduces subtle bugs, and produces code that works on the happy path but falls apart at edges. You spend Monday debugging what the agent shipped on Sunday.

With process: Each milestone is gated. Tests exist before implementation. The agent can't skip phases. When something breaks, the spec tells you what should be true, so you can pinpoint where it diverged. Monday is for polish, not triage.

The overhead of writing specs upfront felt like a tax on Saturday afternoon. By Sunday morning, when milestone 20 needed to touch code from milestone 4, those specs were the only reason the agent didn't break things it had forgotten about.

The Takeaway

"Human on the loop" is not a weaker claim than "human out of the loop." It's a more honest one.

The agent was a force multiplier. It turned a month of evenings into a weekend. But the multiplication only works if you invest upfront — specs, architecture decisions, quality gates — so the agent can run on autopilot most of the time, and you only step in when something actually needs a human.

If you want to try this workflow: dev-process-toolkit is open source. Install it, run /dev-process-toolkit:setup, and start with gate-check on your existing project. The agent doesn't need to be autonomous. It needs to be accountable.

This is the third article in a series on engineering discipline for AI agents. Previously: Your Agents Run Forever (bounded loops) and I Built a Claude Code Plugin That Stops It from Shipping Broken Code (dev-process-toolkit).

Top comments (0)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.