Sergey Shkuratov

Posted on Jun 2

Documentation is code: LLMs don’t actually read it — and honestly, neither do we

#llm #documentation

I learned this the hard way: when an LLM says “it matches the docs”, it can still be wrong for a boring reason—it didn’t read the part that matters.

I’m building a small SaaS (checklists as a service). No users yet. Plenty of documentation already. And at some point my docs stopped being an asset and started turning into a liability.

This is the story of how I rebuilt my documentation so that an LLM could actually read it end-to-end—and how that restructure helped me.

The moment I got scared: “silent misses”

The docset grew. I kept asking the LLM to verify tasks against it.

And then I noticed a pattern that felt worse than hallucinations.

Not “the model invented stuff”, but “the model confidently said it matches”—while quietly missing exceptions, prohibitions, and thresholds. Keyword scanning instead of reading.

I called it silent drift: code slowly moves away from conventions, while the invariants remain only in my head.

In a project with roles, audit, and CI/CD security gates, that kind of drift isn’t “just messy docs”. It’s how you lose the ability to implement and review changes consistently.

I couldn’t do it manually (and I couldn’t delegate it fully)

I knew I had to redo the documentation. But I also knew I couldn’t realistically do it all by hand.

At the same time, I couldn’t just tell an LLM: “Rewrite everything according to approach X.” Not enough context, too easy to lose control.

So I went with a third option: build a reliable process out of unreliable components—me + an LLM.

Step 1: I separated my docs into domains (and forced the model to actually read)

First, I extracted domain areas from the old documentation—the vocabulary I was using to describe the project and its parts. I tried to keep domains mutually independent (so the overall framework stays holdable in my head).

Then I ran the same loop for each domain:

I asked the LLM to read all old docs carefully and extract requirements for that domain.
I moved those requirements into a dedicated file and gave each one a project-unique ID.
I asked the LLM to reread everything and check internal consistency.
I fixed contradictions (sometimes by cross-checking the code).
I repeated the consistency check (this caught small but nasty issues).
I reviewed diffs manually to catch what was missing or implicitly assumed.

This took ~4 days (about 4 hours/day). Exhausting, but still much faster than doing it without an LLM.

Step 2: I hit a wall—because I mixed “requirements” with “verification”

After the requirements pass, I wanted to extract scenarios (the thing that connects domains and requirements).

And suddenly the model started to stumble and hallucinate again.

The fix turned out to be painfully simple: my requirements were still “too thick” because they contained verification sections.

Verification text was useful, but it didn’t belong inside requirements files. It confused the extraction step.

So I separated verification into its own files per domain. After that, scenario extraction became stable again.

Step 3 (the main artifact): I built per-subsystem digests

Even with cleaner docs, there was still one big problem:

An LLM is much more likely to actually read one document than to wander through folders and do keyword search across many files.

So I built a small, boring artifact:

a registry file listing subsystems and the docs that belong to each
a tiny builder script that concatenates those files into a single digest per subsystem

Now, for each subsystem (authentication, access control, audit, security gates, plus a few project-specific ones), I have one consolidated document.

I also keep a short “selection rules” note for myself: which digests to feed into the agent for a given task (e.g., access control vs audit logic). The LLM can check conformance well, but I don’t expect it to reliably infer what to check via chains of implicit assumptions.

The payoff: the restructure wasn’t cosmetic

After I rebuilt the registry and digests, I asked the LLM to check the whole codebase for conformance to each consolidated document.

It found about 15 bugs. Some only manifested under specific conditions.

At first, I was upset: how could this exist with so many tests?

Then I realized: this was the clearest proof that the new documentation structure was doing real work.

What I’m taking from this

A big docset is not automatically verifiable.
If you want LLM-assisted development to be stable, you need docs the model can read, not just search.
A tiny artifact (subsystem registry + digest builder) can become a point of leverage for your whole workflow.

If you’ve dealt with docs/code drift (especially with LLMs in the loop), I’d love to hear what helped—and what failed.

Top comments (2)

江欢（JackSoul） • Jun 2

This is a useful way to frame it. The part that resonates with me is treating docs as something the system can verify against, not just something humans hope the model has absorbed.

For LLM-assisted workflows, splitting docs into smaller invariant/checklist-style units seems to make failures much easier to catch than one big narrative doc.

Sergey Shkuratov • Jun 2

Totally agree on invariants/checklist-style units.

In my experience, splitting into units/declarations helps, but splitting into many files hurts. In LLM-assisted workflow, a single bundled digest per subsystem is more reliable than a folder of tiny files — fewer silent omissions, less accidental “browsing/search mode”