אחיה כהן

Posted on Jun 22

My README said 80 tools. My code had 96. Nobody noticed for weeks.

#opensource #documentation #devtools #ai

I run an open-source MCP server. It exposes browser-automation tools to AI coding agents — click this, read that page, fill this form. The README has a big proud table: "80 tools." It's in the tagline. It's in the nav anchor. It's in the alt text of the social-preview image. It's in two comparison tables. It's even in the pre-written tweet text for the share button.

Eighty. Eighty everywhere.

Last week I ran a boring audit — just diffing what the README claims against what the code actually registers at startup. The code registered 96.

Not 80. Ninety-six. The number had been wrong in nine different places, and I'd been the one typing it each time.

How a number rots

Nobody decides to lie in their README. Drift happens one merge at a time.

You ship a feature. It adds three tools. You update the tool list (because that's the part reviewers look at) but you forget the count in the tagline. Next release adds four more. Now the list says one thing and the tagline says another, and both are behind reality. A month of "small, fast" releases later, the gap is sixteen.

The worst part wasn't even the tagline. When I actually counted the entries in the tool list itself, it documented 83 tools. So I had three different truths living in one file:

The marketing number: 80
The documented list: 83
The code: 96

Thirteen tools existed, worked, shipped to npm, and ran on real users' machines — completely undocumented. They weren't in the README at all. If you only read my docs, you didn't know they existed.

The tools you forget are the ones that matter

Here's the part that turned a docs cleanup into a genuinely uncomfortable afternoon.

I expected the undocumented thirteen to be boring helpers — the safari_wait_for_new_tabs of the world. Some were. But four of them were the native input tools: synthetic keyboard and mouse events driven through the OS-level event API (CGEvent), not through JavaScript injected into the page.

Those are the most powerful tools in the whole project. JavaScript-level automation is sandboxed by the page; OS-level input is not. It types into anything that has focus. It's the difference between "fill this <input>" and "press these physical-looking keys at the system level." It's exactly the category a security-conscious user would want to read about before granting Accessibility permissions.

And it was the category I'd never written down.

That's the real lesson, and it's not "keep your docs updated." It's this: documentation drift is selection-biased toward the things you'd least want undocumented. The tools you forget to document are the ones added in a hurry, in a feature branch, under a deadline — which correlates almost perfectly with the tools that are powerful, sharp, or security-relevant. The mundane stuff gets documented because it's easy. The dangerous stuff gets a TODO.

If you want to find the riskiest, least-reviewed surface of any project, don't read the docs. Read the gap between the docs and the code.

"Just update the README" is the wrong fix

The tempting fix is to fix the number. Change 80 → 96, add the thirteen missing entries, commit, done. I did that part. But it's treating the symptom.

The number was wrong because it was hand-maintained. Any fact a human types by hand will eventually disagree with reality. The only durable fix is to make the fact impossible to get wrong — derive it from the source of truth instead of restating it.

The irony: my project already had this solved, in one place. The smoke test — contributed by someone else, not me — boots the server over a real stdio transport, asks it how many tools it registered, and asserts against a count derived from the source. No hardcoded number. When you add a tool, the test's expectation moves with it automatically. That test could never have drifted to 80, because it never stored an 80 to drift from.

So the fix isn't "be more disciplined about the README." Discipline is what failed for a month straight. The fix is:

Counts → generate them. A tiny script that reads the registrations and writes the number into the README at build time. The human never types it.
Tool lists → generate them too, ideally from the same registry the server reads at startup. The schema is already structured data. Rendering it as a Markdown table is a formatting problem, not a writing problem.
The prose around them → that's the part humans should actually spend their attention on, because it's the part a generator can't write.

Every fact in your docs is either derivable or it isn't. Derivable facts should never be typed by hand. They are drift waiting to happen.

The check that takes thirty seconds

If you maintain anything with a "we have N features / N tools / N integrations" claim, here's the audit. It took me one command:

Count the real thing in code (registrations, exported functions, route handlers — whatever your N actually counts).
Count what your README claims.
If they disagree, you don't just have a stale number. You have a list of things that exist and that nobody chose to write down. Go read that list. It's the most interesting list in your repo.

I found mine had grown to a sixteen-tool gap, with the project's sharpest tools sitting inside it. Yours might be smaller. But "we never check" and "the number is correct" are not the same state, and the only way to tell them apart is to look.

This is from maintaining Safari MCP, an open-source Safari automation server for AI agents on macOS (the one that now correctly says 96). I write up the unglamorous parts of running a small OSS project at achiya-automation.com.

What's the widest doc-vs-code gap you've ever found in your own project — and was the stuff in the gap boring, or was it the scary stuff?

Top comments (6)

Alex Shev • Jun 22

Docs drift is one of those problems where automation can help, but only if the source of truth is clear. A generated README table is useful when it is rebuilt from code, not when it becomes another surface that needs manual remembering.

אחיה כהן • Jun 25

Exactly — "another surface that needs manual remembering" is the bug in one sentence. What actually stuck for us wasn't a discipline ("remember to update the count") but moving the source of truth: the contributed smoke test now derives the tool count straight from the registered tools, so the number can't drift without turning CI red. The prose table stays hand-written, but the claim it makes is backed by a test now. My takeaway: any doc fact you can assert in a test stops being documentation and becomes a guarantee.

Alex Shev • Jun 26

That is the cleanest version of the lesson: if a doc fact can be asserted in a test, it should eventually become a test-backed fact.

I still like keeping the human-written prose, but the claims inside the prose need anchors. Counts, supported tools, generated schemas, flags, routes, permissions — those should not rely on memory.

אחיה כהן • Jun 28

That list — counts, supported tools, schemas, flags, routes, permissions — is basically a spec for which sentences in a README are allowed to be hand-written. If a claim names a number or an enumerable set, there's a test-shaped anchor waiting for it; everything else is genuinely prose. The prose explains the why; the test guards the what.

Good thread — thanks for pushing on the source-of-truth point. That's the part most "just auto-generate the docs" takes skip right past.

Alex Shev • Jun 28

Exactly. The safest split is generated facts, human framing. Counts, tool lists, command flags, schema fields, supported integrations - those should come from something executable. The human-written part should explain why those choices matter and where the edges are. That way the README stops pretending to be the source of truth for things the code can prove.

אחיה כהן • Jun 29

"Generated facts, human framing" is the cleanest two-bucket split I've seen for this.

The one seam I'd add: the boundary between the buckets isn't fixed. A tool's existence is a generated fact; its description is human framing — but the moment a description claims a behavior ("clicks the element"), it's quietly crossed back into fact territory and can drift again. So the rule I landed on is less "facts vs framing" and more "anything you can enumerate or assert, the generator owns; prose stays untrusted until a test pins it down."

Enjoyed this thread — thanks for pushing on it.