DEV Community: Nathan Schram

I voice-code from my phone while walking my dog

Nathan Schram — Thu, 02 Apr 2026 05:22:14 +0000

Last Wednesday afternoon I was at the oval with Normi, my 13-year-old dog, playing tug of war with his favourite rope ball. Between rounds I pulled out my phone, recorded a voice note asking Claude Code to run the full engine test suite across six Telegram chats, and went back to playing. Twenty minutes later, Normi and I were both sitting on the grass, absolutely pooped. I checked Telegram. Claude Code had finished testing, logged the bugs it found, and created GitHub issues for each one. I hadn't typed a single character.

That's most of my afternoons now.

TL;DR:

I spend 2-4 hours a day walking my 13-year-old dog Normi. During those walks, I dictate coding tasks to Claude Code via Telegram voice notes using Untether.

Voice input is roughly 4x faster than typing on a phone (150 WPM speaking vs 40 WPM typing). The walks themselves boost creative output by 60% compared to sitting (Stanford, 176 participants).

This isn't a novelty. It's how I work every day. Honestly, I get more done on walks than I do at my desk.

Simon Willison put it well back in November 2024: "Coding while walking the dog is an underrated benefit of AI tooling." He never wrote the detailed post. So here it is.

The problem: AI coding agents need you at a terminal

AI coding agents are genuinely useful. 71% of developers who regularly use AI agents use Claude Code (Pragmatic Engineer survey, 15,000 developers, Feb 2026). 4% of all public GitHub commits are now authored by Claude Code (SemiAnalysis, Feb 2026).

They all share one assumption: you're sitting at a terminal.

Claude Code runs in a terminal. You watch it work. It asks permission to edit files or run commands. You type y or n. If you walk away, it stalls. The session just sits there waiting for you to come back and press a key.

I'm a vibe coder. Not CS-trained, background in sales and ops, 13 years in tech. I run Little Bear Apps in Melbourne and I build tools to scratch my own itch. And I kept finding myself mid-session with Claude Code on my MacBook, absolutely in the zone, when I had to leave. Walk the dog. Go to the shops. Whatever. I hated it. Every time I walked away, the session died.

I tried the Blink shell iOS app with TMUX and MOSH connecting to my VPS. That worked okay. I could at least see the terminal from my phone. Typing on a tiny screen while holding a leash isn't great.

There are official solutions now. Claude Code Remote Control (February 2026) lets you scan a QR code from the Claude mobile app. Claude Code Channels (March 20, 2026) adds Telegram and Discord support through MCP. Both are Claude-only, text-only, and Channels still pauses at the terminal when it needs permission.

As of March 2026, none of them support voice input, multiple AI engines, or interactive permission buttons from a phone. I needed a proper remote coding workflow - one where I could speak a task into my phone while walking my dog and have it just... work. Including the permission prompts.

How I found takopi and why I rebuilt it

I found banteg/takopi in late December 2025. It's a Telegram notifier for AI coding agents, and at first it was an absolute godsend. I could hook up voice-to-text transcription via Telegram, record a voice note, and it would send the task to Claude Code. Brilliant.

Then I hit the wall.

Takopi doesn't handle Claude Code's interactive bits. When Claude Code needs permission to run a command, or wants to exit plan mode to implement something, or asks you a question, Takopi just... freezes. The agent sits there waiting for input that never comes. Your Telegram chat goes silent. You don't even know it's stuck unless you check.

I opened issues. I waited three, maybe four weeks for a response from banteg on the repo. Nothing. The bugs are still there today.

So I forked it and rebuilt it. Untether launched in February 2026, and it's been my primary development tool since.

The setup: what connects to what

The chain looks like this:

iPhone (Telegram app) -> Telegram Bot API -> Untether (Python, running on my VPS) -> Claude Code (also on my VPS)

The VPS (virtual private server) matters. I run Untether and Claude Code on a Hetzner server in Germany. Not on my MacBook, not on my home network. This means I don't care if the power's on at home, if my MacBook is sleeping, or if my home internet drops. The VPS is always on. Even if my phone dies mid-walk, the coding agent keeps working. I'll see the results when I get back. (The VPS also runs the infrastructure I wrote about in how a D1 billing disaster taught me to build circuit breakers.)

Untether is open source, Python 3.12+, and installs with one command:

uv tool install untether

The key pieces:

Voice transcription. When I record a voice note in Telegram, Untether sends it to a Whisper-compatible endpoint via Groq for transcription, then passes the text to Claude Code as a task. I don't type on my phone. I talk.

Progress streaming. As Claude Code works, Untether streams updates to my Telegram chat. Tool calls, file changes, elapsed time. I can watch it think in real time or just check back later.

Interactive permissions. This is the part that makes it actually usable away from a terminal. When Claude Code needs to run a command, edit a file, or exit plan mode, Untether shows me inline Telegram buttons. Approve, Deny, or reply with instructions. No terminal required.

I leave plan mode on and I leave permissions on. I prefer to have some control rather than letting Claude Code just go wild. I built a custom button called "Pause and outline plan" that forces Claude Code to write out a detailed plan before it does anything. In the version I'm about to ship (v0.35.0), I've added a second step after that: Approve, Deny, and a new "Stop and let's discuss" button. Sometimes you want to talk it through before committing.

Multiple engines. Untether isn't locked to Claude Code. It supports Codex, OpenCode, and Pi today, with Gemini CLI and Amp coming. I mostly use Claude Code, but the multi-engine support matters for testing. I have one Telegram chat per engine, and Claude Code can actually switch between them during automated test runs using a Telegram MCP server I helped fix (we submitted a PR fixing an entity cache bug that broke 87% of operations for session-based users).

One thing worth knowing if you use multiple engines: each one has its own context file format. Claude Code reads CLAUDE.md, Codex wants agents.md, Gemini has its own thing. If you've only set up context for one engine, the others will still work but they'll take longer to get oriented. Your directory-level context, global context, working directory structure - all of it matters. Get your infrastructure right and Untether works perfectly regardless of which engine you're talking to.

Why does talking beat typing on a phone?

Speaking is roughly 4x faster than typing on a phone screen. 150 words per minute speaking versus about 40 WPM thumb-typing (Wispr Flow). On a walk, with a leash in one hand, that difference matters.

Speed isn't the real advantage, though. The real advantage is that talking forces you to think out loud, and thinking out loud produces better prompts.

When I type a task for Claude Code at my desk, I tend to be terse: "refactor the auth middleware." When I'm walking and talking, I naturally add context: "Hey, the auth middleware in Viewpo is getting messy - the session validation is mixed in with the role checking. Can you split those into separate middleware functions? Keep the existing tests passing."

The voice prompt is longer, more specific, and gives Claude Code more to work with. I'm not trying to be thorough. I'm just talking the way people talk.

I'm a waffler. I love to talk things out, talk things through, often just to crystallise something for myself as I say it. Claude Code takes that waffle and rearranges it into something structured. It's a surprisingly good loop: I ramble with context, Claude Code extracts the actual task.

What voice transcription gets wrong

Honestly? Not much. As long as you speak reasonably loudly and clearly, Groq handles it well. I've only had a couple of times where words genuinely got mangled beyond recognition.

If I'm mumbling, or doing my neurodiverse ADHD waffle thing where I'm jumping between thoughts mid-sentence, yeah, it can struggle a bit. But Claude Code is pretty good at inferring intent even from imperfect transcription. Most of the time, close enough is close enough.

How do you handle AI permissions from your phone?

AI coding agents from your phone have a problem: the agent is going to ask you questions. It's going to want permission to delete files, run tests, push code. If you can't respond to those prompts, the session stalls.

Takopi didn't handle this. You could send a task, but when Claude Code hit a permission prompt, everything just stopped until you got back to a terminal.

Untether solves this with inline Telegram buttons:

Plan mode toggles per-chat. I leave it on. When Claude Code wants to implement a plan, I get buttons: Approve, Deny, or my custom "Pause and outline plan"
Approve/Deny buttons appear inline when Claude Code needs permission for destructive operations
Progressive cooldown reduces prompt frequency for repeated similar actions
Ask mode lets Claude Code ask me questions through Telegram. I can reply with text or another voice note
Cost controls with per-run and daily budgets, /usage breakdowns. Important when you're kicking off tasks and walking away

The "Pause and outline plan" button is one I built for my own workflow. Claude Code in plan mode is a life saver. I'd rather read a plan and approve it than have the agent just start editing files. And in v0.35.0, after Claude Code writes the outline, you get three choices: approve it, deny it, or hit "Stop and let's discuss" if you want to talk it through first.

This is the part that makes the workflow real rather than theoretical. Without interactive permissions, "code from your phone" means "start a task and hope for the best." With them, I have the same control I'd have at my desk. Just through buttons instead of keystrokes.

What a real walk looks like

Normi is 13. He's a 16-kilo staffy cross pug cross French bulldog. A little fun-sized potato. Super friendly with everyone, loves people, tolerates cats, and sounds absolutely vicious when he plays. He isn't. He's having the time of his life.

We go out two or three times a day. Sometimes we do the same tracks and parks we always do, sometimes we explore new ones. I'm outside for probably two to four hours total, depending on the weather and what we're up to. We'll often stop at an oval so Normi can play.

I call the game "grrrrr" - which is basically the noise Normi makes while playing it. It's tug of war combined with chasey. I got these nearly indestructible dog balls with tug of war ropes on them, and Normi goes absolutely feral for them. He grabs one end, I grab the other, and he growls and shakes his head like he's fighting a crocodile. Then he bolts and wants me to chase him. Then he comes back and wants to do it again. For years I'd try to get him to drop the ball and he'd just stand there growling. "Norman. Come on." Only took me about a decade to realise he didn't want to drop it - he wanted the fight. He's an expert at grrrrr. Arguably he's never lost.

Between rounds of grrrrr, while Normi's catching his breath (or more often, while he's pretending he can't hear me calling him back), I pull out my phone and check Telegram. There's usually a response from Claude Code waiting in one of my working directory chats. I read it, record a quick voice note with the next task, put my phone away, and go back to playing.

Five or ten minutes later I check again. Claude Code has been working the whole time. I might have three or four different working directory chats going - one time I had five or six running in parallel, testing bug fixes in the Untether repo while making website updates and working on various other projects at the same time.

After thirty or forty minutes, Normi and I are both sitting on the grass, absolutely pooped, having a drink. And Claude Code is still working away on the VPS, finishing up the last task.

This is not work-life balance

Look, I think work-life balance is bullshit. I've never been able to find it. And I don't think most solo devs have either.

But I can go for two or three walks a day with Normi, in the sun or the rain, and be outside for hours. I don't have to sit in traffic. I don't have to be in some office. I don't have to sit through meetings that could have been emails. I can play grrrrr at the oval and check in on my coding agents between rounds. I can be at Coles, on the bus, in bed at 6am. The VPS doesn't care. Telegram doesn't care. Claude Code keeps running whether I'm watching or not.

That's living, I guess. To me, anyway.

The thinking loop

There's a Stanford study that found walking increases creative output by 60% compared to sitting (Oppezzo & Schwartz, 2014, 176 college students across four experiments). I'm not claiming causation for my own work. But I notice it.

Something about walking with a reasonably clear mind, being outside with Normi, not staring at code - my best task descriptions come out on walks, not at my desk. I think it's because I'm not lost in implementation details. I'm thinking about what I actually want.

Voice-to-text amplifies this. I'm a talker. I process by talking things through, often just to crystallise something for myself. The walk gives me space to think clearly, I talk it through as a voice note, and Claude Code rearranges my waffle into something structured. The loop works.

What doesn't work well?

Honestly, most of it works. But there are a few things worth knowing.

Voice notes are great for intent and convenience, but not always good for precision. If I say "the auth middleware needs splitting into two separate functions, keep the tests passing" - that works brilliantly. Dictating actual code syntax is painful no matter how good the transcription is. The trick is prompting the same way you would at your laptop - be descriptive, give context, explain what you want and why. As long as you do that, voice works just as well as typing. The issue is never the voice input. It's being vague.

Screen size. Reading a 200-line diff on a phone screen isn't great, I'll be honest. I'll skim the progress updates on a walk, approve or deny the obvious stuff, and do a proper review when I get home to a real screen. The agent handles the straightforward decisions - formatting, renames, clear-cut logic changes - and I handle the ones that need thought.

Mobile signal. This is actually one of the best bits about the whole setup. Your AI agents run on the VPS, not your phone. If you lose mobile coverage walking through a dead zone or duck into a building with no signal, the agents keep working. They don't care that your phone went quiet - they're on a server in Germany. When you find coverage again, all the updates are sitting there in Telegram waiting for you. Nothing stalls, nothing breaks. Telegram queues messages beautifully.

Deep architecture sessions. If I need to trace through a complicated chain of files or make big architectural decisions, I'll sometimes save that for home with a proper screen. But even then, I've been surprised how far I can get by just being clear in my voice prompts: "Create a plan and save it. Don't implement yet. Let's discuss first." Going back and forth on plans through voice notes genuinely works.

Transcription and mumbling. If I'm not speaking clearly, or doing my ADHD thing where I jump between thoughts mid-sentence, transcription quality drops. Speak clearly and you'll be fine. Mumble and you'll confuse everyone, including the AI.

The big thing for me is that using Untether means I actually get to enjoy the walks more, not less. I'm not hunched over a tiny keyboard slowly typing out messages. A voice note takes seven seconds, then I'm back to playing with Normi. The rest of the time I'm genuinely present - outside, moving, not staring at a screen. That's the whole point.

Getting started

If this workflow sounds useful, here's how to try it:

Install Untether: uv tool install untether (requires Python 3.12+)
Create a Telegram bot via @BotFather
Configure untether.toml with your bot token and Claude Code path
Register your projects: untether init <shortname> in each repo
Send your first task as a text message or voice note

You don't need a VPS. You can run Untether on your laptop. But if you want the "my phone can die and work continues" setup, a cheap VPS does the trick. I use Hetzner.

Untether is free and open source: github.com/littlebearapps/untether

Normi and I will be at the oval either way. Might as well ship something while we're there.

FAQ

Can I use Untether with AI coding agents other than Claude Code?
Yes. Untether supports Claude Code, Codex, OpenCode, and Pi today. Gemini CLI and Amp support are coming.

Does Untether work on Android?
Yes. It works through Telegram, which runs on iOS, Android, desktop, and web. The phone doesn't matter, only the Telegram app.

Is Untether free?
Yes. It's open source (MIT licence), free to install and use. You'll need your own API keys for the AI coding agent you connect to.

How accurate is voice transcription for coding tasks?
Good enough for natural language task descriptions. Groq's Whisper-compatible transcription handles conversational English well. Technical terms occasionally get mangled, but Claude Code usually infers the correct intent from context. Speak clearly and you'll be fine.

Does the AI agent keep running if my phone dies?
Yes, if you're running on a VPS. The agent runs on the server, not your phone. Telegram just delivers the messages. When your phone comes back online, you'll see everything that happened while you were offline.

How does Untether compare to Claude Code Channels?
Channels launched in March 2026 and adds Telegram and Discord support for Claude Code through MCP. It's Claude-only and text-only. It still pauses at the terminal for permission prompts. Untether supports six engines, accepts voice notes, and handles permissions with inline Telegram buttons.

Can I use voice notes to write code with an AI agent?
Yes. Untether transcribes Telegram voice notes using Groq's Whisper-compatible endpoint, then passes the text to the AI coding agent as a task. Speaking is roughly 4x faster than typing on a phone (150 WPM vs 40 WPM).

What server specs does Untether need?
Minimal. Untether itself is lightweight Python. The AI agent does the heavy lifting. A basic VPS like a Hetzner CX22 is more than enough. You can also run it on your laptop if you don't need the always-on setup.

My $5/month Cloudflare bill hit $4,868 because of an infinite loop

Nathan Schram — Tue, 31 Mar 2026 05:11:24 +0000

The invoice said $4,868.00. My Cloudflare account usually costs $5 a month.

In January 2026, two bugs in two different workers wrote billions of rows to D1. I'm a solo developer on the Workers Paid plan. I don't have a billing department. I have a credit card and a vague hope that nothing goes catastrophically wrong. That hope cost me 18 days of stress, a near-suspension of my entire account, and a spam folder I should have been checking more carefully.

TL;DR: Two code bugs wrote 4.83 billion rows to Cloudflare D1 in January 2026, generating a ~$4,868 overage on a $5/month account. After 18 days and four escalation channels, Cloudflare waived the full $4,586.64 invoice. I then built a three-layer circuit breaker system so it can't happen again.

What went wrong with D1?

Two separate bugs, two separate projects, both writing to D1 without anything to stop them.

The embedding worker that couldn't stop writing

Semantic Librarian is my Australian heritage records project. 1.4 million historical records from the National Library of Australia's Trove archive, searchable via Workers AI embeddings stored in Vectorize, backed by a D1 database. The worker runs on a cron schedule, processing documents in batches: fetch a batch of records, generate embeddings through Workers AI, write the vectors and metadata to D1, move to the next batch.

The bug was in the "move to the next batch" part. There was no deduplication check. The worker would process a batch of documents, write the embeddings, and on the next cron tick, process the exact same batch again. No offset tracking. No "already processed" flag. Every cycle wrote the same records. And the next cycle wrote them again.

For four days, from January 11 to 14, the worker ran on autopilot while I was focused on building other things. I wasn't watching the Cloudflare dashboard. Why would I? The worker was deployed, running on a cron, no errors in the logs.

3.45 billion D1 writes in four days. Here's how that breaks down:

Date	D1 Writes	Cost
Jan 11	479,873,853	$479.91
Jan 12	1,335,107,674	$1,259.99
Jan 13	1,424,638,592	$1,411.40
Jan 14	282,900,856	$282.65

Peak day was January 13: 1.42 billion writes in 24 hours. Storage spiked to 10 GB. After I killed the worker and cleaned up, it dropped to 2.4 GB, confirming most of it was duplicate data.

I didn't notice for four days because the worker was running silently. No errors. No alerts from Cloudflare. No email saying "hey, your D1 writes are 7,000x above normal." Just a worker doing exactly what I told it to do, over and over and over.

The harvester without ON CONFLICT

A second project, a GitHub data harvesting tool I was deploying for the first time, had a different version of the same problem. During the initial data seeding phase in early January (Jan 1-4), each scan cycle re-inserted existing records instead of updating them. The INSERT statements had no ON CONFLICT clause. So every time the harvester ran, it tried to insert records that already existed, and D1 happily accepted every one. About 910 million redundant writes in four days.

I found this one faster and fixed it on January 5 with proper ON CONFLICT DO UPDATE clauses. The Semantic Librarian bug started six days later.

Between the two bugs: 4.83 billion D1 writes in January. To put that in perspective, my normal usage across all 9 databases is maybe 200 writes per hour. The D1 pricing page says $1 per million rows written beyond the 50 million included. 4.83 billion rows at that rate is $4,779 in write charges alone, plus storage, requests, and AI inference costs that pushed the total to $4,868.

How do you fight a $4,868 bill on a $5/month account?

Slowly. Across multiple channels. With a detailed audit document and more patience than I thought I had.

7 days of silence

On February 1, I submitted support ticket #01953111. I didn't just write "please waive this." I attached a full usage audit as a PDF: daily D1 write counts broken down by project, spike period analysis with exact dates and row counts, root cause analysis for each bug, and a list of every fix and architectural improvement I'd deployed to prevent recurrence.

I wanted to make it easy for whoever reviewed it. Here's exactly what happened, here's exactly why, and here's what I built to make sure it doesn't happen again. If you're going to ask a company to waive $4,868, you should come prepared.

No response by February 7. Six days. I sent a follow-up asking if it had been assigned to the billing team. Nothing.

Finding the right human

On February 7, I posted to the Cloudflare Community Forum. A CF Community MVP called neiljay responded quickly and pointed me to a post by cherryjimbo (CF MVP '23-'26) who had shared a direct email for the Head of Billing.

On February 8, I emailed Dmitry Alexeenko (Head of Billing) directly, referencing my ticket number and cherryjimbo's referral.

Then I waited.

Marta from support had actually replied on February 11. The case had been raised with Engineering and was on temporary hold. That should have been reassuring.

The spam folder that almost killed my account

On February 18, I found an automated email in my junk folder. It was dated February 17. Cloudflare's billing system had sent a suspension warning: paid services would be disabled for the unpaid invoice. I ran a full account audit. R2 object storage and Analytics Engine were already disabled. My 34 workers were still running, the 8 D1 databases were still accessible, KV and Queues were fine. Partial suspension, not full. Not yet.

The human support team had my case on hold with Engineering, actively working on it. The automated billing system operated on its own timeline and didn't check whether a human being was already handling the dispute. Two parallel systems, zero coordination between them.

I sent urgent follow-ups to both Marta and Dmitry. Dmitry's autoresponder came back: the Portugal office was closed for Carnival, and he'd included his mobile number for urgent matters. I texted him. That same day, I posted to Reddit r/CloudFlare: "Support said my $4.8k billing dispute was on hold, but the automated system just suspended me anyway."

By that evening, four escalation channels were active: the original support ticket, the community forum post, the direct email to Dmitry, and the Reddit post. Within hours, things moved. Dmitry responded despite the holiday. Akash Das, Director of Customer Support, personally took the case. He'd read my audit document and accurately identified both root causes: the infinite write loop in the heritage records worker and the missing conflict handling in the data harvesting tool. The case was upgraded to urgent priority.

Did Cloudflare do the right thing?

Yes. Daniel Anselmo (Technical Support Shift Engineer) confirmed the full waiver on February 19: $4,586.64, invoice IN 56608827. Account unlocked. All services restored. I re-subscribed to the Workers Paid plan and verified everything: 34 workers running, 8 D1 databases accessible, KV, Queues, R2, Analytics Engine all back online.

The $4,586.64 was the actual invoice total, slightly different from my $4,868 estimate because of how Cloudflare calculates final billing. Either way, the full amount was waived as a one-time courtesy.

18 days from first ticket to resolution. That feels long when you're living it, and fair when you look back at it. I want to credit the specific people who made the resolution happen: Akash Das (Director of Customer Support) for personally reviewing the case and identifying both technical root causes accurately from my audit. Dmitry Alexeenko (Head of Billing) for responding and escalating despite a public holiday in Portugal. neiljay and cherryjimbo on the Cloudflare Community Forum for pointing me to the right contact when the ticket queue was silent. And Daniel Anselmo for closing it out cleanly.

My critique isn't of the people. The people were good. It's of the gap between the human support process (which was thorough once it engaged) and the automated billing system (which nearly suspended my entire account while that support was actively investigating my case). Those two systems don't talk to each other fast enough. A billing dispute that's actively being reviewed by the Director of Support should probably not trigger an automated suspension at the same time.

Why doesn't D1 have write rate limits?

Mine isn't an isolated case.

In 2025, ofsecman.io documented a $5,000+ D1 overage caused by a missing WHERE clause in an update statement. A single-row update became a full-table update on every incoming request. Over $5,000 in under 10 seconds. Their conclusion was blunt: "Don't ever use Cloudflare D1 as a Database."

On the Cloudflare Community Forum, a first-time database user reported a $3,200 bill because they didn't set up an index. Their credit card was overdrawn before they noticed anything. "Cloudflare did not give me any notice or reminder."

Three different bugs, three different accounts, same outcome. D1 charges per row written with no caps, no write rate limits, and no billing alerts granular enough for D1 write operations specifically. Cloudflare's billing notifications exist, but they're not designed to catch a worker writing a billion rows in 24 hours.

Scale-to-zero billing is D1's selling point. You pay nothing when your database is idle. That's genuinely great for solo developers and small projects, and it's why I chose Cloudflare's stack in the first place. Scale-to-zero also means scale-to-infinity when a bug amplifies, because the same billing model that charges you nothing at rest charges you per operation at scale with no ceiling.

D1 hit general availability in September 2024. The billing model shipped before the billing safeguards did. This is a platform maturity gap, not malice, and I expect Cloudflare will address it. It's a gap that has cost at least three people documented real money though, and probably more who paid the bill without writing about it.

I'm not saying don't use D1. I still use it across 9 databases for multiple projects. I'm saying don't use it without your own circuit breakers, because the platform doesn't have them yet.

What did I build to prevent this from happening again?

After the invoice was waived, I spent a week doing nothing except cost safety. I had been building features. Now I was building guardrails. Nine improvements across three tiers of priority, all shipped and deployed. Tier 1 was critical one-line fixes I could push immediately. Tier 2 was the anomaly detection that would have caught the January incident within an hour. Tier 3 was the longer-term monitoring improvements.

Everything described here lives in an infrastructure SDK I built after the incident. Two TypeScript packages: a consumer SDK that goes into each worker, and an admin backend with a monitoring dashboard and the telemetry pipeline that feeds it. The admin side runs on Cloudflare Pages, so I can check budget state from my phone - a meaningful upgrade from my previous approach of noticing the invoice a month later.

I open-sourced it because three people hitting the same $4,000+ wall suggests this isn't just my problem.

Update (March 2026): The original circuit breaker infrastructure described above (Platform SDKs) worked but was too complex - 10+ workers, 61 D1 migrations, cross-account HMAC forwarding. I've since replaced it with CF Monitor, a much simpler rewrite: one worker per account, Analytics Engine + KV only, zero D1. It's open source and available as an npm package (@littlebearapps/cf-monitor).

Each worker imports the consumer SDK, wraps its environment bindings on startup, and the tracking happens automatically. No per-project instrumentation.

Three layers of circuit breakers

The circuit breakers work at three levels of granularity.

Feature-level is the most precise. Each distinct function in each project gets its own budget. A GitHub scanner, a document embedder, an API endpoint - each has a defined daily limit for D1 writes, KV operations, Workers AI neurons, whatever resources it consumes. If the document embedder goes haywire, it gets disabled. The GitHub scanner keeps running.

Project-level aggregates all features for a project. Individual features might stay within their budgets while the project total is too high.

Global emergency stop is the nuclear option. It kills everything across all projects immediately. I haven't had to use it. I hope I never do.

Each level enforces progressively: 70% triggers a Slack warning, 90% triggers a critical alert, 100% auto-disables the feature. The breakers auto-reset after 1 hour via KV TTL, so a tripped feature doesn't sit dead until I manually re-enable it at 3am.

Counting writes before they become a bill

The tracking can't use D1 writes to count D1 writes. That would be self-defeating. If my monitoring system writes usage data to D1, it's consuming the exact resource it's trying to protect. The January incident itself proved this: my original monitoring infrastructure was writing ~200 rows per hour to D1 just to track usage across all projects. That's not a lot in isolation, but the principle is wrong.

The consumer SDK wraps your Cloudflare environment bindings with proxies that automatically count every operation. D1 reads and writes, KV gets and puts, R2 uploads, Workers AI inference calls, Vectorize queries - all tracked transparently. When your worker calls env.DB.prepare(...).run(), the proxy intercepts it, increments a counter, and forwards the call. Your code doesn't change. You call createTrackedEnv(env) at startup and the counting happens behind the scenes.

The counters get flushed to Analytics Engine via a Cloudflare Queue. Analytics Engine is free for the first 25 million data points per month and it's designed for exactly this kind of high-volume telemetry. Zero D1 write overhead for the tracking itself.

The budget checker queries Analytics Engine roughly every 30 seconds, sums up recent writes per feature, and compares them against the budget defined in a YAML config file. If a feature crosses 70%, Slack warning. 90%, critical alert. 100%, the feature's circuit breaker trips and the SDK starts rejecting operations for that feature until the breaker resets.

Detection latency: about 30 seconds from a write happening to the circuit breaker evaluating it. In January, my runaway worker ran for four days. Now it would run for about 30 seconds before getting shut down automatically.

The monitoring that ties it together

On top of the circuit breakers, the nine hardening improvements across three tiers:

Auto-reset circuit breakers via KV TTL (1-hour expiry, no manual intervention needed)
Workers AI cost monitoring added to the sentinel (previously untracked)
Investigation SQL column fix (the monitoring was querying the wrong column name)
Hourly D1 write anomaly detection using a 168-hour rolling window with 3-sigma threshold
Per-project anomaly detection, not just account-wide (previously only checked totals)
Budget warning thresholds at 70% and 90% with Slack alerts and 1-hour deduplication
Monthly budget tracking with progressive alerts at 70%, 90%, and exceeded
Batch resource snapshot inserts, reduced from ~200 individual D1 writes per hour to ~8 batch transactions
Six missing budget overrides for features that were falling back to overly generous defaults

All nine shipped and deployed within a week. The batch insert change alone (item 8) cut monitoring D1 write overhead by 96%, from ~200 individual writes per hour down to ~8 batch transactions. The hourly anomaly detection (item 4) would have caught the January spike within its first hour: 480 million writes in a single day is roughly 15,000 standard deviations above my normal baseline of ~200 writes per hour. The 3-sigma threshold would have tripped before the first hour was up.

What pattern do serverless billing failures share?

Three cases. Same billing model. Same outcome.

	My infinite loop	ofsecman.io	CF Community
Root cause	Missing deduplication	Missing WHERE clause	Missing index
Time to cost	4 days	10 seconds	Days (unclear)
Bill	$4,868	$5,000+	$3,200
Warning from CF	None	None	None

The common denominator isn't the code. Bugs happen. I wrote about this in my last post on dogfooding: the bugs that matter most are the ones that live in the gaps between states, not in the states themselves. A worker that runs correctly once will also run correctly a billion times. The bug isn't in the execution; it's in the assumption that anything would stop it.

The common denominator is that D1's billing model has no safety net between "working correctly" and "catastrophic overage." No write rate limit. No anomaly detection. No automatic pause when usage spikes 10,000x above normal. The billing system faithfully counts every row, generates an invoice, and sends it to your credit card.

Every serverless database with per-operation billing has this exposure. D1 isn't unique in charging per write. Most managed databases give you some combination of connection pools, query timeouts, billing caps, or at minimum a usage alert that fires before you hit four figures. D1 currently offers none of those for write operations.

If you're building on D1 in production, build your own circuit breakers. I did. The infrastructure described in this post took about a week to build and deploy. The January invoice would have taken me considerably longer to pay off. That's a pretty clear cost-benefit calculation.

Common questions about D1 billing and cost protection

Can you set a billing cap on Cloudflare D1?

No. As of March 2026, Cloudflare doesn't offer a hard billing cap for D1 write operations. You can set up billing notifications, but they're not granular enough to catch a worker writing a billion rows overnight. Application-level circuit breakers are currently the only option.

How do you detect a runaway worker before the bill arrives?

Monitor D1 write counts at sub-hourly intervals. I use Analytics Engine (free tier, 25 million events per month) to track every D1 write via proxied environment bindings, with a budget checker that evaluates every 30 seconds. Anomaly detection with a 168-hour rolling window catches spikes that exceed 3 standard deviations from normal. The whole system adds zero D1 write overhead because telemetry goes through Analytics Engine, not D1.

Is this just a D1 problem?

The billing exposure exists on any serverless platform with per-operation pricing and no rate limits. D1 is the most visible example right now because it's relatively new (GA 2024) and write-heavy workloads can accumulate cost fast. DynamoDB, Firestore, and PlanetScale all have their own versions of this risk, though most offer billing alerts or auto-scaling limits that D1 currently lacks.

The timeline, dollar amounts, and technical details in this post are reconstructed from support ticket #01953111, Reddit r/CloudFlare, Cloudflare dashboard data, and internal audit documents.

Dogfooding found 22 bugs my 1,548 tests missed

Nathan Schram — Thu, 19 Mar 2026 04:28:20 +0000

Last week I found 86 orphaned processes eating 10.3 GB of RAM on my VPS. The week before that, my stall monitor fired because I went for a walk. And my own documentation tool told me my docs were stale.

TL;DR:

Real use of three open-source tools found 22 bugs that 1,548 automated tests missed.

Bugs cluster in two categories: resource accumulation over time, and gaps between "works" and "works for me".

Test suites check states. Dogfooding finds the transitions between them.

None of these would show up in a test suite. I found them because I actually use my own tools - not as a testing practice, just because they solve problems I have. Test suites tell you if something works. Using your own product tells you if it's any good. Those are different questions with different answers. Joel Spolsky described this gap twenty-five years ago - he found 45 bugs in one Sunday afternoon of actually using CityDesk to run his blog. "All the testing we did, meticulously pulling down every menu and seeing if it worked right, didn't uncover the showstoppers."

Dogfooding is the practice of using your own products as your primary tools - not as a scheduled testing exercise, but as part of how you work. Untether is a Telegram bridge for AI coding agents. PitchDocs is a documentation generator for code repositories. Outlook Assistant is an MCP server that gives AI assistants access to Outlook email, calendar, and contacts.

I run three open-source tools that I built for myself. Untether and PitchDocs I use every day. Outlook Assistant I pull out when the job calls for it - digging through inbox, sent, archived, and deleted folders to find invoices and receipts for tax time, or trawling through calendar events across linked calendars. Not daily, but when I do use it, I use it hard. And honestly, the bugs I find through real use are the ones that matter most - the ones your users would hit first.

What does daily Untether use actually find?

In 3 days of daily use, I shipped 8 releases and found bugs that 1,548 automated tests missed. The bugs that matter live in the transitions between states - sleeping and waking, busy and stuck, present and away.

I use Untether for basically everything. Voice notes from the couch, approving file changes while making coffee, kicking off test runs from my phone. It's not a tool I built and then test occasionally - it's how I do my job.

Last week I noticed a 41-minute stall in one of my chats had gone completely undetected. A wrangler tail command got stuck, no events were flowing, and Untether just sat there silently. No warning, nothing. I only caught it because I was actually waiting for a result and it never came.

So I built a stall monitor. Seemed simple enough - if no events arrive for 5 minutes, send me a Telegram warning. v0.34.0, shipped, done.

Then the real education started.

The stall monitor that couldn't tell stuck from busy

I ran pytest through Untether and the stall monitor fired 3 times in 10 minutes. The process was alive and working fine, but it just wasn't emitting progress events during tool execution. From the monitor's perspective, silence meant "stuck". In reality, silence meant "busy running your tests".

I had to add /proc diagnostics - CPU usage, memory, TCP connections, file descriptors, child processes - so the monitor could tell the difference between "stuck" and "busy doing something useful." That became v0.34.1. Along with a liveness watchdog, progressive warnings, and a JsonlStreamState tracker that remembers recent events in a ring buffer. The first version couldn't tell silence from activity.

Then I closed my laptop overnight. Came back the next morning to find the stall monitor stuck in an infinite loop. The subprocess had died when the laptop went to sleep, but the monitor kept firing "No progress" warnings every 3 minutes - 7 of them stacked up by the time I opened the lid. Each one showing pid=None, process_alive=None because it couldn't even find the process. It just kept warning about a ghost.

So I built dead process detection, a zombie warning cap (3 warnings before auto-cancel, absolute cap at 10), and early PID threading so the monitor knows about the subprocess from spawn, not from the first event. I also made /cancel work as a standalone command without having to reply to the progress message - because on mobile, finding a specific message to reply to when your screen is full of stall warnings is not fun. v0.34.2.

When "idle" means walking Normi

Then I went for a walk.

Claude was waiting for approval on a file change. I had the inline keyboard showing on my phone - Approve, Deny, Skip - but I was out walking Normi and didn't reply for about 6 minutes. Stall monitor fired. "No progress for 6 min."

That's not a stall. I was just away. The difference between "the process is stuck" and "the human hasn't replied yet" is obvious to a person but invisible to a monitor that only watches event timestamps. Added approval-aware thresholds - 30 minutes when there's an inline keyboard showing, 5 minutes normally.

A long pytest run triggered it again. A 10-minute test suite is not a stall. Built a three-tier threshold system: 5 minutes for normal operation, 10 minutes during active tool execution, 30 minutes during approval waits. v0.34.3.

Four releases. One "simple" feature. Each release driven by a real moment where I was actually using the tool and it got something wrong.

86 orphaned processes and 10.3 GB of RAM

While chasing all of this down, I noticed my VPS was getting sluggish. Telegram messages were slow, progress updates felt laggy. I found 86 orphaned MCP server processes eating 10.3 GB of RAM.

Here's what happened: each Claude Code session spawns about 14 MCP server processes - brave-search, context7, apify, jina, github, trello, pal, and so on. My systemd unit file was using KillMode=process, which means when Untether restarts, systemd kills the main Python process but leaves all the children alive. They get reparented to systemd and just sit there, holding memory, doing nothing. I'd been iterating fast - 64 service restarts in one day during the v0.30-v0.33 development cycle. Each restart leaked another 14 processes. They accumulated silently.

One config change to KillMode=control-group and all 10.3 GB came back.

Then I built a subprocess watchdog to catch a related problem: when a runner subprocess exits but its MCP server children keep stdout pipes open, proc.wait() blocks forever because anyio waits for both process exit and pipe drain. The session just hangs with no completion event. The watchdog polls process liveness with os.kill(pid, 0) instead, gives a 5-second grace period, then kills the orphan process group.

None of this shows up in a test suite. The laptop sleep bug requires an actual laptop going to actual sleep. The "went for a walk" edge case requires a human being away from their phone. The orphaned process leak requires 64 restarts in one day of real development. The subprocess pipe deadlock requires actual MCP servers holding actual file descriptors. You can't mock this stuff. You can only find it by living with the tool.

Eight releases in 3 days. 545 new tests (1003 to 1548 total). And a stall monitor that actually works now, because it got tested by my life, not just my test suite. Michael Bolton calls this the difference between testing and checking - automated tests check what you already know to look for, but they can't discover the things you never thought to test.

Every one of these bugs lived in a gap between states. Sleeping and waking. Busy and stuck. Present and away. Tests verify that individual states work. Dogfooding finds the transitions between them - the seams where things actually break.

What happens when your docs tool says your docs are bad?

Running my own documentation tool on its own repo exposed context drift, content filter blocks, and stale docs that test fixtures would never catch. The most embarrassing moment was when PitchDocs told me my own docs were stale.

PitchDocs generates documentation for repos. READMEs, changelogs, roadmaps, security policies, user guides, AI context files. I use it on all my repos. Including PitchDocs itself.

That's where it gets interesting.

I ran /docs-audit on PitchDocs one morning and got a score of... not great. My own documentation tool was telling me my docs were stale. The irony was not lost on me. But that's the point - I wouldn't have noticed without actually running the tool on my own work.

Context drift across 7 files

The bigger discovery came from context file drift. PitchDocs generates AI context files for 7 different tools: CLAUDE.md, AGENTS.md, .cursorrules, copilot-instructions.md, .windsurfrules, .clinerules, and GEMINI.md. When I added the platform-profiles skill and the /pitchdocs:platform command, I had to manually update counts in all 6 of those files plus llms.txt. "15 skills" became "16 skills", "12 commands" became "13 commands", across 7 files.

I did it. Then I added another feature and had to do it again. And again.

That friction became Context Guard. First version was a post-commit hook that warns you when AI context files have drifted from the codebase. Then I upgraded it to a two-tier system - a gentle nudge after commits, plus a pre-commit guard that blocks the commit entirely if context files are stale. The whole thing exists because I kept getting bitten by my own documentation going out of sync while I was actively building the tool that's supposed to prevent exactly that problem.

When the content filter blocks your own docs

Then Claude Code's content filter blocked me from generating a CODE_OF_CONDUCT file. PitchDocs was trying to write standard open-source community documents, and the API returned HTTP 400 errors because the content triggered safety filters. The same thing happened with SECURITY.md. I had to build chunked writing workarounds and add a content-filter.md rule with risk levels and mitigations. This only surfaced because I was actually generating these files for real repos, not test fixtures.

The cross-tool compatibility matrix came from real testing. PitchDocs claims to work with 9 AI tools. That claim exists because I actually installed it in Cursor, Windsurf, Codex CLI, and Gemini CLI and watched what happened. Each tool had its own quirks. The compatibility docs aren't theoretical - they're field notes from running the same skill files across different environments and documenting where they broke.

The README went through 11 revisions in 11 days. I kept applying PitchDocs to its own README, reading the output, and thinking "no, that's not right". The 4-question test, the lobby principle, the feature benefits extraction with persona inference - all of it came from repeatedly failing to describe my own product well and building features to fix the specific ways it failed.

Look, a documentation tool that doesn't use itself to generate its own docs is just a theory. Running /docs-audit on PitchDocs and getting a mediocre score was embarrassing, but it showed me exactly what to fix. I'd rather be embarrassed than wrong.

What happens when you let AI manage your email?

Using AI for intensive email tasks uncovered 12 bugs in one release cycle, plus two real security vulnerabilities. The safety controls - dry-run mode, rate limiting, recipient allowlist - all came from production failures that slipped through every other layer, not threat modelling.

I don't manage my email through Claude Code every day. But when I need to find something specific - tax receipts scattered across inbox and sent and archived, invoices from three months ago, calendar events linked from shared calendars - that's when I pull out Outlook Assistant. Claude Code can programmatically search across every folder, collate the results, and export what I need. It turns a full afternoon of manual searching into a 10-minute conversation.

The first version had 55 tools. That's what happens when you map every Microsoft Graph API endpoint to its own MCP tool. Read email, search email, list email, get attachment, list attachments, send email, update email, move email, flag email. Then repeat for calendar, contacts, folders, rules, and categories. It worked. The API coverage was thorough.

Then I actually used it in conversation.

55 tools and the context window

55 tools consume a lot of tokens. Each tool has a name, description, and parameter schema that gets loaded into context. I hit token limits in real conversations - not contrived long conversations, just normal "search for that email from last week, read it, draft a reply" workflows. The context window was getting eaten by tool definitions before I could do real work.

I consolidated 55 tools down to 20 using what I called the STRAP pattern - action-parameter consolidation where one tool handles multiple operations through an action parameter. manage-emails with actions like flag, move, categorise, export. manage-calendar with create, update, delete. 64% reduction. About 11,000 tokens saved per turn. I only knew the limit was a problem because I was the one hitting it.

Silent API failures and progressive search

Microsoft's $search API silently fails on personal Outlook.com accounts. Not an error - it just returns no results. I found out because I searched for an email I knew existed and got nothing back. Built progressive search: try $search first, fall back to $filter with subject/from matching, then try broader date-range filtering, then full scan. Four strategies, automatic fallback, with a warning message so you know which strategy actually found your email.

Twelve bugs came from real use in the v3.1.0 cycle. A folder that doesn't exist would silently fall back to the inbox instead of telling you it couldn't find it. Asking for count=0 emails would return everything instead of nothing. The "minimal" view mode said "No content" instead of showing a body preview. HTML email bodies weren't detected correctly for certain Content-Types. Conversation export crashed on personal accounts. Calendar events showed UTC timestamps instead of local time. Inbox rule sequences showed internal IDs instead of readable order.

Each of these is a small thing. None of them would fail a test that checks "does the API return a 200?" But they only matter when you're a person trying to get through your email.

Safety controls born from production failures, not documentation

The safety controls tell their own story. Dry-run mode for sending emails. Session rate limiting. Recipient allowlist. None of these came from a threat model document. They came from failures that got past the test suite, past AI agent verification, and past CI checks - and only surfaced when I was actually using the tool on real email.

The first time Claude drafted a reply and I realised it was one approval button away from sending it to the wrong person, I built dry-run mode that afternoon. Not the next sprint. That afternoon. Rate limiting came after a loop scenario almost happened in production - the kind of thing where the AI gets confused and tries to send 50 replies. The test suite didn't catch it because the tests mock the send endpoint. The AI agent review didn't catch it because it looked correct in isolation. CI passed. It only became obvious when I was sitting there watching it happen in real time. The allowlist lets me restrict which addresses Claude can actually send to during testing, because "oops, sent a test email to a client" is not a recoverable mistake.

That's what dogfooding adds as a layer. You run the test suites. You get the AI agent to stress-test it. You run CI. And then you spend days or weeks actually using it in production, pushing it to its limits, and you find the edge cases that none of those layers caught. The guardrails in Outlook Assistant didn't come from security best practices or compliance requirements. They came from real production use where things went wrong after every automated check had passed.

I also found two real security vulnerabilities through production use - an XSS issue and an information exposure bug - that I fixed and then added CodeQL SAST scanning to catch that class of problem earlier. Those bugs shouldn't have made it as far as they did, and they wouldn't have been found as quickly without me actually using the tool on real email.

What pattern do these bugs share?

Across all three products, the bugs I find through real use cluster into two categories that tests can't reach.

Resource accumulation over time. 86 orphaned processes eating 10.3 GB of RAM. Documentation counts going stale across 7 files. Token budgets getting consumed by tool definitions before the real work starts. These problems are invisible in short test runs. They only appear after hours or days of real use.

The gap between "works" and "works for me." A stall monitor that can't tell the difference between a stuck process and a person walking their dog. An email search that returns nothing because Microsoft's API silently fails on personal accounts. A documentation tool that can't generate a CODE_OF_CONDUCT because the content filter blocks it. These aren't bugs in the traditional sense. They're mismatches between what the code does and what the person using it needs.

Bug type	What tests check	What dogfooding found
Resource leaks	Memory per isolated test run	86 processes accumulating over 64 restarts.
State transitions	Each state in isolation	Gaps between sleeping/waking, busy/stuck, present/away.
API quirks	Mocked API returns 200 OK	Microsoft search silently returns nothing on personal accounts.
UX friction	Feature exists and works	Finding a reply button while stall warnings fill your screen.
Safety gaps	Permissions check passes	Nearly sending email to the wrong person in production.

The common thread across all of these bugs.

They only show up after sustained real use, not isolated test runs.
They live in the transitions between states, not in the states themselves. (Nancy Leveson found the same pattern in spacecraft accidents - the software worked per spec, but failed at state transitions under real conditions.)
They're invisible to automated tests because you can't mock a human walking their dog.
They matter more to users than the logic errors your test suite catches.

I think test suites give you a kind of false confidence. You get to 90% coverage and you feel good about it. Martin Fowler made this point years ago - high coverage is useful for finding untested code, but it's "of little use as a numeric statement of how good your tests are." The bugs that actually matter - the ones your users would hit on day one - don't live in test cases. They live in the space between your code and someone's actual life.

That said, I don't dogfood as a deliberate practice. I don't schedule "dogfooding sessions" or maintain a testing protocol. I use these tools because they solve problems I have. The bugs get found as a side effect of genuine use, not deliberate testing. The stall monitor saga happened because I actually rely on Untether to work. The context drift problem surfaced because I actually use PitchDocs on my repos. The dry-run mode exists because I actually use Claude to handle real email tasks.

If you're building something and you don't use it yourself, you're shipping based on hope rather than experience. Jason Fried put it simply: "A good chef is tasting their food as they go." And experience, it turns out, is a much better debugger than pytest.

Common questions about dogfooding

Some things I get asked when I talk about this approach.

What is the difference between dogfooding and automated testing?

Testing checks that code produces expected outputs from known inputs. Dogfooding exposes code to conditions you can't mock - laptop sleep, distracted humans, resources accumulating over days. 8 of the bugs I found in Untether exist in the gaps between states that tests treat as isolated. Both matter, but they catch different classes of problems.

How many bugs does real use catch that tests miss?

Across three products in one month, real use surfaced 22 bugs that automated tests missed entirely. They split into resource accumulation (86 orphaned processes, 10.3 GB of leaked RAM) and works-vs-works-for-me gaps. Tests catch logic errors. Dogfooding catches experience errors.

Do you schedule dedicated dogfooding sessions?

No. I use these tools because they solve real problems - Untether for mobile coding, PitchDocs for repo docs, Outlook Assistant for email. The bugs surface as a side effect of genuine use. Scheduled testing would never reproduce "walked Normi for 6 minutes" as an edge case.

Can dogfooding replace automated testing?

No. Untether has 1,548 automated tests and I run them constantly. Automated tests catch regressions and logic errors reliably. Dogfooding catches a different category - state transitions, resource leaks, UX friction that only appears in real workflows. You need both.

Does dogfooding work differently for solo developers?

Solo developers are their own most demanding user. I put Untether through 64 service restarts in a single day, which revealed 86 orphaned processes. Real development patterns create edge cases that no QA team testing "normal usage" would ever reproduce.

All numbers in this post are verified from GitHub issues, PRs, and commit history. Untether, PitchDocs, and Outlook Assistant are all open source.