Nathan Schram

Posted on Mar 19 • Originally published at littlebearapps.com

Dogfooding found 22 bugs my 1,548 tests missed

#buildinpublic #testing #opensource #devlog

Last week I found 86 orphaned processes eating 10.3 GB of RAM on my VPS. The week before that, my stall monitor fired because I went for a walk. And my own documentation tool told me my docs were stale.

TL;DR:

Real use of three open-source tools found 22 bugs that 1,548 automated tests missed.

Bugs cluster in two categories: resource accumulation over time, and gaps between "works" and "works for me".

Test suites check states. Dogfooding finds the transitions between them.

None of these would show up in a test suite. I found them because I actually use my own tools - not as a testing practice, just because they solve problems I have. Test suites tell you if something works. Using your own product tells you if it's any good. Those are different questions with different answers. Joel Spolsky described this gap twenty-five years ago - he found 45 bugs in one Sunday afternoon of actually using CityDesk to run his blog. "All the testing we did, meticulously pulling down every menu and seeing if it worked right, didn't uncover the showstoppers."

Dogfooding is the practice of using your own products as your primary tools - not as a scheduled testing exercise, but as part of how you work. Untether is a Telegram bridge for AI coding agents. PitchDocs is a documentation generator for code repositories. Outlook Assistant is an MCP server that gives AI assistants access to Outlook email, calendar, and contacts.

I run three open-source tools that I built for myself. Untether and PitchDocs I use every day. Outlook Assistant I pull out when the job calls for it - digging through inbox, sent, archived, and deleted folders to find invoices and receipts for tax time, or trawling through calendar events across linked calendars. Not daily, but when I do use it, I use it hard. And honestly, the bugs I find through real use are the ones that matter most - the ones your users would hit first.

What does daily Untether use actually find?

In 3 days of daily use, I shipped 8 releases and found bugs that 1,548 automated tests missed. The bugs that matter live in the transitions between states - sleeping and waking, busy and stuck, present and away.

I use Untether for basically everything. Voice notes from the couch, approving file changes while making coffee, kicking off test runs from my phone. It's not a tool I built and then test occasionally - it's how I do my job.

Last week I noticed a 41-minute stall in one of my chats had gone completely undetected. A wrangler tail command got stuck, no events were flowing, and Untether just sat there silently. No warning, nothing. I only caught it because I was actually waiting for a result and it never came.

So I built a stall monitor. Seemed simple enough - if no events arrive for 5 minutes, send me a Telegram warning. v0.34.0, shipped, done.

Then the real education started.

The stall monitor that couldn't tell stuck from busy

I ran pytest through Untether and the stall monitor fired 3 times in 10 minutes. The process was alive and working fine, but it just wasn't emitting progress events during tool execution. From the monitor's perspective, silence meant "stuck". In reality, silence meant "busy running your tests".

I had to add /proc diagnostics - CPU usage, memory, TCP connections, file descriptors, child processes - so the monitor could tell the difference between "stuck" and "busy doing something useful." That became v0.34.1. Along with a liveness watchdog, progressive warnings, and a JsonlStreamState tracker that remembers recent events in a ring buffer. The first version couldn't tell silence from activity.

Then I closed my laptop overnight. Came back the next morning to find the stall monitor stuck in an infinite loop. The subprocess had died when the laptop went to sleep, but the monitor kept firing "No progress" warnings every 3 minutes - 7 of them stacked up by the time I opened the lid. Each one showing pid=None, process_alive=None because it couldn't even find the process. It just kept warning about a ghost.

So I built dead process detection, a zombie warning cap (3 warnings before auto-cancel, absolute cap at 10), and early PID threading so the monitor knows about the subprocess from spawn, not from the first event. I also made /cancel work as a standalone command without having to reply to the progress message - because on mobile, finding a specific message to reply to when your screen is full of stall warnings is not fun. v0.34.2.

When "idle" means walking Normi

Then I went for a walk.

Claude was waiting for approval on a file change. I had the inline keyboard showing on my phone - Approve, Deny, Skip - but I was out walking Normi and didn't reply for about 6 minutes. Stall monitor fired. "No progress for 6 min."

That's not a stall. I was just away. The difference between "the process is stuck" and "the human hasn't replied yet" is obvious to a person but invisible to a monitor that only watches event timestamps. Added approval-aware thresholds - 30 minutes when there's an inline keyboard showing, 5 minutes normally.

A long pytest run triggered it again. A 10-minute test suite is not a stall. Built a three-tier threshold system: 5 minutes for normal operation, 10 minutes during active tool execution, 30 minutes during approval waits. v0.34.3.

Four releases. One "simple" feature. Each release driven by a real moment where I was actually using the tool and it got something wrong.

86 orphaned processes and 10.3 GB of RAM

While chasing all of this down, I noticed my VPS was getting sluggish. Telegram messages were slow, progress updates felt laggy. I found 86 orphaned MCP server processes eating 10.3 GB of RAM.

Here's what happened: each Claude Code session spawns about 14 MCP server processes - brave-search, context7, apify, jina, github, trello, pal, and so on. My systemd unit file was using KillMode=process, which means when Untether restarts, systemd kills the main Python process but leaves all the children alive. They get reparented to systemd and just sit there, holding memory, doing nothing. I'd been iterating fast - 64 service restarts in one day during the v0.30-v0.33 development cycle. Each restart leaked another 14 processes. They accumulated silently.

One config change to KillMode=control-group and all 10.3 GB came back.

Then I built a subprocess watchdog to catch a related problem: when a runner subprocess exits but its MCP server children keep stdout pipes open, proc.wait() blocks forever because anyio waits for both process exit and pipe drain. The session just hangs with no completion event. The watchdog polls process liveness with os.kill(pid, 0) instead, gives a 5-second grace period, then kills the orphan process group.

None of this shows up in a test suite. The laptop sleep bug requires an actual laptop going to actual sleep. The "went for a walk" edge case requires a human being away from their phone. The orphaned process leak requires 64 restarts in one day of real development. The subprocess pipe deadlock requires actual MCP servers holding actual file descriptors. You can't mock this stuff. You can only find it by living with the tool.

Eight releases in 3 days. 545 new tests (1003 to 1548 total). And a stall monitor that actually works now, because it got tested by my life, not just my test suite. Michael Bolton calls this the difference between testing and checking - automated tests check what you already know to look for, but they can't discover the things you never thought to test.

Every one of these bugs lived in a gap between states. Sleeping and waking. Busy and stuck. Present and away. Tests verify that individual states work. Dogfooding finds the transitions between them - the seams where things actually break.

What happens when your docs tool says your docs are bad?

Running my own documentation tool on its own repo exposed context drift, content filter blocks, and stale docs that test fixtures would never catch. The most embarrassing moment was when PitchDocs told me my own docs were stale.

PitchDocs generates documentation for repos. READMEs, changelogs, roadmaps, security policies, user guides, AI context files. I use it on all my repos. Including PitchDocs itself.

That's where it gets interesting.

I ran /docs-audit on PitchDocs one morning and got a score of... not great. My own documentation tool was telling me my docs were stale. The irony was not lost on me. But that's the point - I wouldn't have noticed without actually running the tool on my own work.

Context drift across 7 files

The bigger discovery came from context file drift. PitchDocs generates AI context files for 7 different tools: CLAUDE.md, AGENTS.md, .cursorrules, copilot-instructions.md, .windsurfrules, .clinerules, and GEMINI.md. When I added the platform-profiles skill and the /pitchdocs:platform command, I had to manually update counts in all 6 of those files plus llms.txt. "15 skills" became "16 skills", "12 commands" became "13 commands", across 7 files.

I did it. Then I added another feature and had to do it again. And again.

That friction became Context Guard. First version was a post-commit hook that warns you when AI context files have drifted from the codebase. Then I upgraded it to a two-tier system - a gentle nudge after commits, plus a pre-commit guard that blocks the commit entirely if context files are stale. The whole thing exists because I kept getting bitten by my own documentation going out of sync while I was actively building the tool that's supposed to prevent exactly that problem.

When the content filter blocks your own docs

Then Claude Code's content filter blocked me from generating a CODE_OF_CONDUCT file. PitchDocs was trying to write standard open-source community documents, and the API returned HTTP 400 errors because the content triggered safety filters. The same thing happened with SECURITY.md. I had to build chunked writing workarounds and add a content-filter.md rule with risk levels and mitigations. This only surfaced because I was actually generating these files for real repos, not test fixtures.

The cross-tool compatibility matrix came from real testing. PitchDocs claims to work with 9 AI tools. That claim exists because I actually installed it in Cursor, Windsurf, Codex CLI, and Gemini CLI and watched what happened. Each tool had its own quirks. The compatibility docs aren't theoretical - they're field notes from running the same skill files across different environments and documenting where they broke.

The README went through 11 revisions in 11 days. I kept applying PitchDocs to its own README, reading the output, and thinking "no, that's not right". The 4-question test, the lobby principle, the feature benefits extraction with persona inference - all of it came from repeatedly failing to describe my own product well and building features to fix the specific ways it failed.

Look, a documentation tool that doesn't use itself to generate its own docs is just a theory. Running /docs-audit on PitchDocs and getting a mediocre score was embarrassing, but it showed me exactly what to fix. I'd rather be embarrassed than wrong.

What happens when you let AI manage your email?

Using AI for intensive email tasks uncovered 12 bugs in one release cycle, plus two real security vulnerabilities. The safety controls - dry-run mode, rate limiting, recipient allowlist - all came from production failures that slipped through every other layer, not threat modelling.

I don't manage my email through Claude Code every day. But when I need to find something specific - tax receipts scattered across inbox and sent and archived, invoices from three months ago, calendar events linked from shared calendars - that's when I pull out Outlook Assistant. Claude Code can programmatically search across every folder, collate the results, and export what I need. It turns a full afternoon of manual searching into a 10-minute conversation.

The first version had 55 tools. That's what happens when you map every Microsoft Graph API endpoint to its own MCP tool. Read email, search email, list email, get attachment, list attachments, send email, update email, move email, flag email. Then repeat for calendar, contacts, folders, rules, and categories. It worked. The API coverage was thorough.

Then I actually used it in conversation.

55 tools and the context window

55 tools consume a lot of tokens. Each tool has a name, description, and parameter schema that gets loaded into context. I hit token limits in real conversations - not contrived long conversations, just normal "search for that email from last week, read it, draft a reply" workflows. The context window was getting eaten by tool definitions before I could do real work.

I consolidated 55 tools down to 20 using what I called the STRAP pattern - action-parameter consolidation where one tool handles multiple operations through an action parameter. manage-emails with actions like flag, move, categorise, export. manage-calendar with create, update, delete. 64% reduction. About 11,000 tokens saved per turn. I only knew the limit was a problem because I was the one hitting it.

Silent API failures and progressive search

Microsoft's $search API silently fails on personal Outlook.com accounts. Not an error - it just returns no results. I found out because I searched for an email I knew existed and got nothing back. Built progressive search: try $search first, fall back to $filter with subject/from matching, then try broader date-range filtering, then full scan. Four strategies, automatic fallback, with a warning message so you know which strategy actually found your email.

Twelve bugs came from real use in the v3.1.0 cycle. A folder that doesn't exist would silently fall back to the inbox instead of telling you it couldn't find it. Asking for count=0 emails would return everything instead of nothing. The "minimal" view mode said "No content" instead of showing a body preview. HTML email bodies weren't detected correctly for certain Content-Types. Conversation export crashed on personal accounts. Calendar events showed UTC timestamps instead of local time. Inbox rule sequences showed internal IDs instead of readable order.

Each of these is a small thing. None of them would fail a test that checks "does the API return a 200?" But they only matter when you're a person trying to get through your email.

Safety controls born from production failures, not documentation

The safety controls tell their own story. Dry-run mode for sending emails. Session rate limiting. Recipient allowlist. None of these came from a threat model document. They came from failures that got past the test suite, past AI agent verification, and past CI checks - and only surfaced when I was actually using the tool on real email.

The first time Claude drafted a reply and I realised it was one approval button away from sending it to the wrong person, I built dry-run mode that afternoon. Not the next sprint. That afternoon. Rate limiting came after a loop scenario almost happened in production - the kind of thing where the AI gets confused and tries to send 50 replies. The test suite didn't catch it because the tests mock the send endpoint. The AI agent review didn't catch it because it looked correct in isolation. CI passed. It only became obvious when I was sitting there watching it happen in real time. The allowlist lets me restrict which addresses Claude can actually send to during testing, because "oops, sent a test email to a client" is not a recoverable mistake.

That's what dogfooding adds as a layer. You run the test suites. You get the AI agent to stress-test it. You run CI. And then you spend days or weeks actually using it in production, pushing it to its limits, and you find the edge cases that none of those layers caught. The guardrails in Outlook Assistant didn't come from security best practices or compliance requirements. They came from real production use where things went wrong after every automated check had passed.

I also found two real security vulnerabilities through production use - an XSS issue and an information exposure bug - that I fixed and then added CodeQL SAST scanning to catch that class of problem earlier. Those bugs shouldn't have made it as far as they did, and they wouldn't have been found as quickly without me actually using the tool on real email.

What pattern do these bugs share?

Across all three products, the bugs I find through real use cluster into two categories that tests can't reach.

Resource accumulation over time. 86 orphaned processes eating 10.3 GB of RAM. Documentation counts going stale across 7 files. Token budgets getting consumed by tool definitions before the real work starts. These problems are invisible in short test runs. They only appear after hours or days of real use.

The gap between "works" and "works for me." A stall monitor that can't tell the difference between a stuck process and a person walking their dog. An email search that returns nothing because Microsoft's API silently fails on personal accounts. A documentation tool that can't generate a CODE_OF_CONDUCT because the content filter blocks it. These aren't bugs in the traditional sense. They're mismatches between what the code does and what the person using it needs.

Bug type	What tests check	What dogfooding found
Resource leaks	Memory per isolated test run	86 processes accumulating over 64 restarts.
State transitions	Each state in isolation	Gaps between sleeping/waking, busy/stuck, present/away.
API quirks	Mocked API returns 200 OK	Microsoft search silently returns nothing on personal accounts.
UX friction	Feature exists and works	Finding a reply button while stall warnings fill your screen.
Safety gaps	Permissions check passes	Nearly sending email to the wrong person in production.

The common thread across all of these bugs.

They only show up after sustained real use, not isolated test runs.
They live in the transitions between states, not in the states themselves. (Nancy Leveson found the same pattern in spacecraft accidents - the software worked per spec, but failed at state transitions under real conditions.)
They're invisible to automated tests because you can't mock a human walking their dog.
They matter more to users than the logic errors your test suite catches.

I think test suites give you a kind of false confidence. You get to 90% coverage and you feel good about it. Martin Fowler made this point years ago - high coverage is useful for finding untested code, but it's "of little use as a numeric statement of how good your tests are." The bugs that actually matter - the ones your users would hit on day one - don't live in test cases. They live in the space between your code and someone's actual life.

That said, I don't dogfood as a deliberate practice. I don't schedule "dogfooding sessions" or maintain a testing protocol. I use these tools because they solve problems I have. The bugs get found as a side effect of genuine use, not deliberate testing. The stall monitor saga happened because I actually rely on Untether to work. The context drift problem surfaced because I actually use PitchDocs on my repos. The dry-run mode exists because I actually use Claude to handle real email tasks.

If you're building something and you don't use it yourself, you're shipping based on hope rather than experience. Jason Fried put it simply: "A good chef is tasting their food as they go." And experience, it turns out, is a much better debugger than pytest.

Common questions about dogfooding

Some things I get asked when I talk about this approach.

What is the difference between dogfooding and automated testing?

Testing checks that code produces expected outputs from known inputs. Dogfooding exposes code to conditions you can't mock - laptop sleep, distracted humans, resources accumulating over days. 8 of the bugs I found in Untether exist in the gaps between states that tests treat as isolated. Both matter, but they catch different classes of problems.

How many bugs does real use catch that tests miss?

Across three products in one month, real use surfaced 22 bugs that automated tests missed entirely. They split into resource accumulation (86 orphaned processes, 10.3 GB of leaked RAM) and works-vs-works-for-me gaps. Tests catch logic errors. Dogfooding catches experience errors.

Do you schedule dedicated dogfooding sessions?

No. I use these tools because they solve real problems - Untether for mobile coding, PitchDocs for repo docs, Outlook Assistant for email. The bugs surface as a side effect of genuine use. Scheduled testing would never reproduce "walked Normi for 6 minutes" as an edge case.

Can dogfooding replace automated testing?

No. Untether has 1,548 automated tests and I run them constantly. Automated tests catch regressions and logic errors reliably. Dogfooding catches a different category - state transitions, resource leaks, UX friction that only appears in real workflows. You need both.

Does dogfooding work differently for solo developers?

Solo developers are their own most demanding user. I put Untether through 64 service restarts in a single day, which revealed 86 orphaned processes. Real development patterns create edge cases that no QA team testing "normal usage" would ever reproduce.

All numbers in this post are verified from GitHub issues, PRs, and commit history. Untether, PitchDocs, and Outlook Assistant are all open source.

DEV Community