Engineering Documentation Series — Article 1 of 50
Every team has some version of this story, so start with the composite.
A senior engineer gives two weeks' notice after six years. She designed the billing reconciliation system, chose its consistency model, negotiated the three exceptions that keep a large customer from churning, and decided — for a reason she could explain in thirty seconds — why a nightly job retries exactly four times and not three or five.
None of this is secret. All of it is in production, running, observable. The code compiles. The tests pass. The system works.
On her last Friday, the team takes her to lunch. On Monday, the billing system runs exactly as before. For three months, nothing breaks.
Then a regulatory change requires a modification to how partial refunds are reconciled. An engineer opens the module. The code is clear enough — he can read what it does. What he cannot recover is why it does it that way. Why four retries. Why that customer is special. Why the consistency model tolerates a window that, on paper, looks like a bug. He spends three weeks reconstructing reasoning that once lived, complete and confident, in a single person's mind. He gets most of it right. The part he gets wrong causes an incident the following quarter.
The instinct is to file this under "she should have written more documentation." That instinct is correct. The reason we usually give for it is wrong — and the wrongness is the subject of this article, and in a quieter way the subject of this entire series.
The industry has quietly agreed on the wrong definition
Ask a room of engineers what documentation is for, and the answers cluster around one theme: explaining. Documentation explains how to use the API. It explains how to set up the project. It explains what the function does. Under this definition, documentation is a communication tool — a translation of code into prose for someone who has not yet read the code.
This definition is so widespread it has become invisible. It shapes how teams budget time ("we'll document it after the feature ships"), how they measure quality ("is the README current?"), and how they assign ownership ("whoever wrote the code writes the docs"). It is the definition embedded in nearly every documentation tutorial, which is why nearly every tutorial teaches Markdown, headings, and tools — the mechanics of explaining — rather than the thing that actually matters.
The explaining definition is not wrong so much as it is small. It captures the least durable, least valuable function of documentation and mistakes it for the whole. The engineer who left did not fail to explain her code. Her code was perfectly explained — by itself. What she failed to preserve was something code can never contain.
What actually disappeared
Look again at what was lost. It was not knowledge of what the system does; that survived her departure intact, because it is encoded in the running system and recoverable by anyone willing to read it. What disappeared was the knowledge of why — the decisions, the rejected alternatives, the constraints that were live at the time, the context that made a strange-looking choice the correct one.
This distinction has a name. Knowledge that can be written down, transferred, and read is explicit knowledge. Knowledge that lives in a person's judgment and intuition — the kind you demonstrate but struggle to articulate — is tacit knowledge. The terms come from the philosopher Michael Polanyi, whose formulation, "we can know more than we can tell," predates software by decades and survives every change of technology since.
Here is the engineering reframing, and it is the first load-bearing claim of this article: source code is the explicit residue of a large body of tacit reasoning, and it preserves almost none of the reasoning. We have good evidence that this gap is where engineering effort actually goes. In a large-scale field study of professional developers across seven industrial projects, totaling more than 3,000 working hours, developers spent roughly 58% of their time not writing code but trying to comprehend existing code — and the single most common driver of the long, difficult comprehension sessions was insufficient or missing explanation of intent, present in 46% of them (Xia et al., IEEE Transactions on Software Engineering, 2018). The code was right there, readable, and still the reasoning had to be reconstructed. That reconstruction is the tax an organization pays when the why was never preserved.
The tax is not evenly distributed, either. In the same study, junior developers spent about 66% of their time on comprehension against 44% for seniors — which is to say the cost of missing knowledge falls heaviest on exactly the people who just arrived and have no one left to ask.
So documentation, properly understood, is not the act of explaining code. It is the act of capturing the tacit reasoning that code cannot hold, before the only copy walks out the door.
The root cause is a category error, not a discipline problem
It is tempting to explain missing documentation as a discipline failure — engineers are busy, writing is tedious, deadlines win. That explanation is comforting because it implies an easy fix: more discipline, better tooling, a documentation item in the Definition of Done. Teams have tried all of these for years. The knowledge keeps leaving anyway.
The deeper cause is that the explaining definition tells engineers to document the wrong things. If documentation exists to explain what code does, the natural thing to write is a description of what the code does — and that is precisely the information the code already provides and the system already enforces. Engineers correctly sense this kind of documentation is low-value, so they skip it, and they are not wrong to skip it. The tragedy is that in skipping the worthless documentation, they also skip the valuable kind, because the definition never distinguished between them.
The root cause, then, is a category error. We treat documentation as a writing problem — producing clear prose about code — when it is a knowledge problem: which knowledge an organization cannot afford to lose, and how to move it out of fragile human memory into a durable, shared form. Once the category is corrected, the discipline problem largely dissolves. Engineers are rarely reluctant to preserve knowledge they can see is irreplaceable. They are reluctant to transcribe knowledge the compiler already has.
The engineering principle: a single knowledge-holder is a single point of failure
Engineers already possess the principle they need. They simply apply it to servers and not to people.
No competent engineer designs a critical system to depend on a single machine with no redundancy. The reason is not that the machine is unreliable on any given day — it is that the consequence of its failure is unbounded and the failure is certain on a long enough timeline. We replicate and remove single points of failure not because we expect failure tomorrow, but because we refuse to let the system's survival depend on a component that will, eventually, fail.
A person who is the sole holder of critical knowledge is a single point of failure with a guaranteed eventual outage called resigning. And this is not a hypothetical risk — it is the measured norm. When researchers computed the "truck factor" of 133 popular GitHub projects — the minimum number of developers who would have to leave before the project is incapacitated — 65% had a truck factor of two or fewer, and about a third had a truck factor of one (Avelino et al., International Conference on Program Comprehension, 2016). Most real systems are one or two departures away from losing the people who understand them. A larger 2024 study of more than 36,000 open-source projects found that 89% lost their entire core team at least once, and only about 27% of those recovered (Nourry et al., 2024). Knowledge concentration is not an edge case. It is the default condition of software.
What makes this worse is that the risk is widely felt and rarely managed. In a survey of 269 professional engineers, 75% rated bus factor as a serious concern and 63% had worked on a high-risk, low-bus-factor project within the past year — yet only 19% worked somewhere the risk was actually tracked and communicated (JetBrains study, ICSE-SEIP, 2022). Engineers know the building has one exit. Almost no one has mapped it.
The principle generalizes beyond any technology: knowledge concentrated in a single holder is an architectural risk, and the mechanism for removing it is the same as for any single point of failure — replication into a durable, independent store. Documentation is that replication. Notably, when the truck-factor researchers asked developers how they would actually mitigate the risk, the single most-cited practice was not pairing, not tests, not "readable code" — it was documentation. The people closest to the problem name the same fix.
The mental model: documentation is organizational memory
If you take one idea from this article, take this one, because the remaining forty-nine articles lean on it.
A human mind has two kinds of memory. Working memory holds what you are thinking about right now; it is fast, rich, and almost entirely lost within seconds unless something transfers it. Long-term memory is slower to write and less vivid, but it persists — it is what lets you be, tomorrow, a continuation of who you were today rather than starting from nothing each morning.
An organization has the same two layers, and documentation is its long-term memory. The working memory of an organization is the live knowledge in the heads of its current engineers — vivid, detailed, and constantly overwritten and lost as people move and forget and leave. Without a mechanism to consolidate that working memory into something durable, an organization is amnesiac: it knows only what its current members happen to remember, and it forgets a little more with every departure. Documentation is the consolidation mechanism — how an organization's fleeting working memory becomes durable long-term memory that outlives the individuals who formed it.
The organizational-memory research bears this out as a measurable phenomenon rather than a metaphor. A systematic review of 91 empirical studies of knowledge loss from employee turnover found turnover to be among the most frequently reported causes of organizational knowledge loss, and found that tacit knowledge — the hard-to-write-down kind — causes disproportionately greater harm when it leaves (2023 systematic review of 91 empirical studies, 2000–2022). The reframing also changes who you are writing for. The true reader of documentation is not the colleague at the next desk who could have asked you in person. It is the engineer who has not yet joined the company — the one who will inherit this system after everyone who built it is gone.
This model survives any change of technology. It said nothing about Markdown, wikis, or repositories, because none of those are the point. The point is the transfer of fragile working memory into durable organizational memory, and that need will outlast every tool we currently use to satisfy it.
A framework for deciding what to preserve: the Preservation Matrix
The organizational-memory model tells you why to document. It does not yet tell you what — and "document everything" fails as surely as "document nothing," because when everything is preserved the irreplaceable knowledge drowns in the trivial.
Two properties decide whether any piece of knowledge is worth preserving. The first is its cost to re-derive: how much effort, time, and risk it would take to reconstruct from scratch if lost. The second is its probability of loss: how likely it is to leave the organization. Cross them and you get a simple decision tool — call it the Preservation Matrix:
LOW cost to re-derive HIGH cost to re-derive
┌────────────────────────────┬────────────────────────────┐
HIGH │ Skim / auto-generate │ PRESERVE FIRST │
probability │ (changelogs, obvious │ the "why": decisions, │
of loss │ how-tos — cheap to │ rationale, constraints, │
│ regenerate) │ the four-retries reason │
├────────────────────────────┼────────────────────────────┤
LOW │ Ignore │ Preserve if you can │
probability │ (the system already │ (deep tacit expertise — │
of loss │ enforces it; the code │ expensive, but mentorship │
│ is the documentation) │ may beat written docs) │
└────────────────────────────┴────────────────────────────┘
The top-right quadrant is your organizational memory's most precious cargo: knowledge both expensive to re-derive and likely to be lost. That is, almost always, the knowledge of why — the decisions and their rationale, the alternatives considered and rejected, the constraints that were live at the time. The four-retries choice lives here. So does the consistency-model decision and the reason that one customer is special.
The bottom-left quadrant is its opposite: knowledge cheap to recover and unlikely to be lost — the description of what the code does, which the code already holds and the system already enforces. This is exactly the documentation the explaining definition tells us to write, and exactly the documentation that goes stale fastest and matters least.
The matrix collapses to a single question an engineer can ask in the moment: if the people who hold this knowledge left tomorrow, how expensive would it be to recover, and how likely is it that no one could? The knowledge that scores high on both is what you write down. The rest can wait, or be derived, or be left to the system.
What this looks like in practice
Made concrete, the matrix redirects effort away from the documentation most teams produce and toward the documentation almost no team produces.
Take the choice to use eventual consistency in the billing system rather than strong consistency. The explaining definition produces, at best, a description of how the eventually-consistent system behaves. The Preservation Matrix produces something more durable: a short record of the decision itself — the problem that forced it, the alternatives on the table, the constraints (latency budgets, throughput targets, the cost of distributed transactions) that ruled the alternatives out, and the assumptions the decision depends on, so a future engineer can recognize when those assumptions stop holding and the decision should be revisited.
That record is what the industry calls an Architecture Decision Record, and a later article is devoted to it. The mechanics are not the point here. The point is that the matrix tells you the ADR is worth writing and why — it is top-right knowledge — while the explaining definition cannot even see it, because no one is "explaining" anything. They are preserving a decision. The same lens applies to the four-retries choice: nothing to write under the explaining definition, one irreplaceable sentence to write under the matrix — the sentence the returning engineer spent three weeks failing to reconstruct.
The strongest objections, taken seriously
A skeptical senior engineer has real counterarguments, and an honest case has to meet them rather than pretend they don't exist.
"Good code is self-documenting — write readable code instead of prose." This is partly right, and the right part matters: a well-named function does make the what self-evident, which is exactly why the bottom-left quadrant of the matrix says don't bother documenting it. But even careful advocates of self-documenting code concede the limit. As one puts it, readable code "is a great goal, sometimes achievable, but it doesn't remove the need to comment your code" — because the why, the business rule, the non-obvious quirk, is precisely what clean naming cannot express. Self-documenting code wins the bottom-left quadrant and leaves the top-right untouched. The two positions are not actually in conflict; they are talking about different quadrants.
"Documentation always rots, so it's worse than useless — stale docs mislead." This is the most serious objection, and it is true for the wrong documentation. Documentation that restates what the code does rots fast, because the code changes underneath it; that is another reason the bottom-left quadrant is a trap. But decision rationale — why a choice was made, against which alternatives, under which constraints — does not rot the same way, because the historical fact of the decision doesn't change even when the code does. You can read a five-year-old ADR and it remains true about the moment it describes. The rot objection is an argument against preserving the what, which the matrix already tells you to skip. It is not an argument against preserving the why.
"AI will just generate the documentation now." Returned to below — but briefly: a model can generate a fluent description of what code does, which is the quadrant that was never the problem. It cannot recover a decision rationale that was never recorded anywhere, because that knowledge exists in no text for it to read. AI lowers the cost of the cheap quadrant and does nothing for the expensive one.
The honest synthesis is that the objections are mostly right about the documentation the matrix already tells you not to write, and mostly silent about the documentation it tells you to write.
The trade-offs are real
It would be dishonest to present knowledge preservation as free. Preserving knowledge costs time at the moment of writing, when the pressure to ship is highest and the value is least visible. Preserved knowledge can itself decay if a decision changes and its record is never updated. And some tacit expertise is so deeply tacit that any written form loses what mattered — the bottom-right quadrant of the matrix — and the time is better spent on mentorship or pairing.
These costs are why the matrix matters rather than a blanket "document more." A blanket instruction ignores the trade-offs and collapses under them; teams told to document everything learn the cost is real and the value diffuse, and they stop. The matrix concentrates the cost where the return is highest. The decision to not document something cheap-to-recover is not negligence — it is correct engineering, the same way declining to add redundancy to a non-critical component is correct.
So the recommendation is conditional, as engineering recommendations should be. Preserve aggressively when knowledge is expensive to re-derive and likely to be lost. Preserve sparingly, or not at all, when the system already holds and enforces it. The skill is not writing more. It is knowing which knowledge is irreplaceable.
Why prevention has to happen at the moment of the work
There is a reason the opening story ends in an incident rather than a near-miss, and it is a property of knowledge, not of that particular team. Knowledge decays in a way that hides the decay until the moment it is needed.
A system whose knowledge has left looks identical, from the outside, to a system whose knowledge is fully preserved. It runs the same and passes the same tests. The loss is invisible right up until someone needs the missing knowledge — and by then the people who held it are gone, the context is unrecoverable, and the cost of reconstruction has multiplied. This is the defining feature of the knowledge problem: the cheapest moment to preserve knowledge is exactly when it feels least necessary, and the moment it becomes obviously necessary is exactly when preservation has become impossible.
Prevention therefore has a specific shape. Knowledge must be captured while it is live — when the decision is being made and the context is still in the room. Captured then, it costs a sentence. Recovered later, it costs weeks, and the recovery is often wrong. The preventive practice is not "schedule a documentation sprint" — by the sprint, the knowledge has already decayed — but "capture decisions as they are made." The highest-leverage documentation habits in this series will all turn out to happen at the moment of the work, not after it.
This is an organizational property, not an individual virtue
It is easy to read all of this as advice to individual engineers: be the person who writes things down. That reading is not wrong, but it is incomplete, and the incompleteness matters.
Preservation only works if the organization treats it as infrastructure rather than personality. If it depends on which individuals happen to be conscientious, then the practice itself has a bus factor — it survives only as long as the diligent people stay, and collapses exactly when they leave, which is when it is needed most. And the payoff of getting this right shows up at the organizational level, not the individual one: DORA's State of DevOps research, the largest ongoing study of software delivery, has found a clear link between internal documentation quality and organizational performance, and found that documentation quality amplified the effectiveness of every technical capability it studied — teams with strong documentation saw dramatically larger performance gains from the same practices than teams with poor documentation (DORA, 2021–2022). Documentation, at the organizational scale, behaves less like a chore and more like a multiplier on everything else the engineering organization does.
This is the difference between an organization that knows things and one that remembers them. The first depends on its current members and forgets a little with every departure. The second accumulates knowledge over time, gets smarter as it ages, and treats the loss of any individual as survivable. Every later article in this series is, in some sense, about how to become the second kind of organization.
Where this is heading
This reframing is about to matter more than it has, for a reason that has nothing to do with human readers. Organizations are beginning to connect automated systems — including AI — to their own knowledge, asking machines to reason over what the organization knows. Those systems can only reason over knowledge that has been preserved in a durable, accessible form. They are blind to whatever still lives only in human heads.
This is the real answer to "AI will just write the docs." A model can generate a competent description of what code does — the cheap quadrant, which was never the problem. It cannot hand a new engineer, or a new agent, the reasoning behind a decision that was recorded nowhere, because that knowledge exists in no text it can read. An organization whose critical knowledge was never externalized cannot give that knowledge to a machine any more than it could give it to a new hire. The knowledge problem is the same problem, now with a second kind of reader that also cannot ask a colleague over lunch. The organizations whose memory is intact will find they have been preparing for this without knowing it. The rest will discover the problem they postponed has compounded.
What should change after reading this
If this article has done its work, one belief should feel different than it did a few minutes ago. Documentation was, at the start, a writing task — clear prose about code, done after the real work when time allowed. It should now look like something else: the mechanism by which an organization preserves the knowledge it cannot afford to lose, the long-term memory that lets it survive the departure of any individual mind.
A few things follow, worth carrying into the rest of the series and into your own work. The most valuable documentation captures why, not what, because the why is the knowledge both expensive to re-derive and certain to be lost — the top-right quadrant of the Preservation Matrix. The cheapest moment to preserve knowledge is while it is live, and that moment does not return. And the question that should become reflexive — the entire practice in miniature — is this: if the people who hold this knowledge left tomorrow, what would the organization be unable to recover?
The answer to that question is what you write down. Everything else in this series is an elaboration of how.
Next in the series — Article 2: The Six-Month Half-Life of a Wiki.
Sources and evidence notes
Per the editorial standard, confidence is flagged where it matters: peer-reviewed studies are treated as strongest; large-scale industry research (DORA) is strong but correlational; practitioner cases and estimates are labeled as such.
- Program comprehension (peer-reviewed). Xia, Bao, Lo, Xing, Hassan, Li, "Measuring Program Comprehension: A Large-Scale Field Study with Professionals," IEEE Transactions on Software Engineering, 2018. ~58% of developer time on comprehension; missing/insufficient explanation the top driver of long comprehension sessions (46%); juniors 66% vs seniors 44%.
- Truck/bus factor (peer-reviewed). Avelino, Passos, Hora, Valente, "A Novel Approach for Estimating Truck Factors," ICPC 2016. 133 GitHub systems; 65% truck factor ≤ 2; ~34% = 1; developers ranked documentation the #1 mitigation. Nourry et al., 2024 — 36,000+ projects; 89% lost their core team at least once, ~27% recovered.
- Bus factor as a felt-but-untracked risk (peer-reviewed). JetBrains study, ICSE-SEIP 2022. 269 engineers; 75% rate it important, 63% recently on a high-risk project, only 19% where it was tracked.
- Knowledge loss from turnover (peer-reviewed systematic review). Synthesis of 91 empirical studies (2000–2022): turnover a leading cause of organizational knowledge loss; tacit knowledge disproportionately harmful when lost.
- Documentation ↔ organizational performance (industry research, correlational). DORA / Accelerate State of DevOps, 2021–2022. Clear link between documentation quality and organizational performance; documentation quality amplified the impact of every technical capability studied. Strong but survey-based and correlational, not a controlled experiment.
- Real-world cost cases (secondary / illustrative). David DeLong, Lost Knowledge, and related reporting: Boeing's veteran-knowledge loss contributing to a multi-week 737/747 line shutdown and a $1.6B charge; a single veteran engineer's retirement projected at >$400K first-year disruption. Use as illustration, not as precise engineering metrics.
- Onboarding time (practitioner estimate). Developer-experience practitioners report new hires taking 2–3 months longer to reach productivity under poor documentation, and 3–10 hours/week lost to searching for undocumented information. Directionally consistent with the peer-reviewed comprehension data; treat the specific figures as estimates.
- Tacit vs. explicit knowledge (foundational). Michael Polanyi, The Tacit Dimension (1966): "we can know more than we can tell."
- Dissenting view (practitioner). Self-documenting-code advocates concede readable code "doesn't remove the need to comment" the why; documentation-rot critics correctly note that stale descriptive docs can mislead — both addressed directly in the objections section.
Top comments (0)