DEV Community: John Rojas

I built a docs site without writing the docs: the part that was mine was a layer up

John Rojas — Sun, 28 Jun 2026 10:42:47 +0000

I spent four articles on a single stubborn idea: that you cannot trust an AI agent to review documentation just because it tells you it did. The whole point was moving verification out of the agent's reach and onto disk, where a claim about a file and the file itself are not the same thing.

This piece is about the other direction. Not reviewing docs with an agent, but building them with one. I forked a popular LLM library whose entire documentation was a single flat README, pointed an agent at it, and a while later had a 12-page docs site. I wrote almost none of the prose.

I want to be honest about which parts of that were actually mine, because the answer is not the one the hype implies. The prose was the easy part to hand off. The work that mattered lived a layer up, and it is the part I would not delegate.

The canvas: a README pretending to be documentation

The library is LangExtract, an LLM structured-extraction tool with something north of thirty-six thousand GitHub stars. (Published by Google, though its own README is careful to say it is not an officially supported Google product, a caveat I kept in mind every time I was tempted to type the word "official.") For all that reach, the entire documentation was one long README. Install, a scattering of examples, some API notes, all stacked in a single scroll you navigate with the find command and a prayer.

That makes it a good canvas precisely because it is not a writing problem. The words mostly existed. What did not exist was a structure: a way for a developer to move from "what is this thing" to "how do I actually use it" to "something broke and I need the reference." Turning a README into a docs site is an architecture problem wearing a writing problem's clothes.

The part I would not hand off: the shape

Before the agent drafted a single page, the decision that mattered was already made, and I made it. What content types exist here, and how do they relate?

I used Diátaxis as the spine: concept pages that explain, how-to pages that walk you through a task, a reference you look things up in, a quickstart that gets you running. That sounds tidy on a slide. In practice it is a sequence of judgment calls the agent is perfectly happy to get wrong. A concept page wants to stay pure concept, and the instant it starts telling you which flag to pass, it has quietly become a how-to, and the boundary blurs until nothing is findable. So I trimmed the how-extraction-works page back to pure concept. I promoted the long-document workflow, the genuinely hard thing the library does well, into the flagship how-to instead of letting it rot in an examples dump.

The agent can draft any page you name. It cannot tell you which pages should exist.

That decision sits upstream of the prose, and it is the one that decides whether the finished site is navigable or merely tidy.

Splitting verification by what kind of thing it is

Here is where the gating work bled straight into the build. If you have read the earlier pieces, the reflex will be familiar: do not let the thing that produced the work be the only thing that vouches for it.

But "verify the docs" is not one question. It is at least three, and they need different instruments.

"Is the structure intact" is a mechanical question. Broken links, dead anchors, malformed markdown. So the build runs scans for exactly those, set to throw, which means a broken link fails the build instead of filing a polite warning nobody reads. Green or it does not ship.

"Is the code right" is a different question, and a scan cannot touch it. A code sample can be flawlessly formatted and completely wrong. So I added a small harness that runs the example code for real, against the live model, and checks that it does what the page claims. When I ran it, the examples passed with zero failures, and the long-document example came back with ten grounded extractions, which is the actual behavior the page promises. That is not the agent telling me the code works. That is the code working, in front of me.

"Is the architecture right, does this read on a phone, are these facts actually true" is a third kind of question, and that one is mine. No harness catches a content model that does not match how developers think.

The shape of the lesson has not changed. Confidence is not evidence. A green scan is evidence the structure holds. A passing harness is evidence the code runs. Neither is evidence the docs are any good, and that gap is where the human stays.

Where the agent drifted, and how I knew

An agent building documentation is confidently, fluently wrong in small ways, and the small ways are the dangerous ones, because they survive a casual read.

Two examples from this build. The first was a syntax landmine. Docusaurus changed how it parses admonitions, the little "Note" and "Warning" callout boxes, between major versions. The space-based form the agent had clearly absorbed from its training no longer parses the title in the current version, and the word "Title" leaks into the rendered page as literal text. The old syntax and the fix differ by a single character:

:::note Title
:::note[Title]

The first line is what the agent wrote. The second is what current Docusaurus actually wants. Nothing in the agent's prose flagged the difference. The rendered page flagged it, because part of verification was checking that zero stray colon-sequences survived into the final HTML.

The second was quieter and entirely mine to care about. The agent's prose was clean, professional, and absolutely riddled with em dashes. Ninety-eight of them. I have a rule, a slightly fussy personal one, that my published writing does not use em dashes, and a fork I am putting my name on counts as my published writing. So there was a pass that pulled all ninety-eight out of the prose and rewrote each into the colon or comma or parenthetical it should have been, while leaving the five that lived inside code blocks untouched (in code, a dash is a dash, not a stylistic opinion).

The thing no harness would have caught

I built and verified the site on a laptop, like a reasonable person. Then I opened it on my phone, like an actual reader, and it fell apart.

The API reference tables, perfectly comfortable on a wide screen, blew straight off the right edge on a narrow one. Long inline code in headings refused to wrap and shoved the whole layout sideways. The buttons on the homepage hero sat in a stubborn single row instead of stacking. None of this showed up in any scan, because none of it is wrong in the way a scan understands wrong. The HTML was valid. The links worked. It was just unusable in the hand of the person most likely to be holding it.

The fixes were small once I had seen the problem: let the tables scroll horizontally inside a contained box, break long words in headings, let the hero buttons wrap. But I only knew to make them because I held the thing the way a reader would, and that instinct, to go check the lived experience rather than the rendered output, is the most durable part of the job and the one I have least idea how to hand to an agent.

Believing the green light, for the right reason

The last pass was a separate read of the agent's factual claims against the actual source code, treating every confident assertion as a hypothesis until the source confirmed it.

It came back with zero accuracy errors. Which is a strange thing to be pleased about, so let me be precise about what it means. It does not mean the check was unnecessary. It means the check said the docs were accurate and I could believe that, because I had looked, not because the agent had assured me. The subtle things it confirmed were exactly the kind a casual read waves through: that the Ollama backend defaults its model URL to localhost:11434, that the library does not silently load a .env file because there is no load_dotenv call anywhere in the source, that the OpenAI batch setting's threshold is a count of queued prompts, not a token budget or a time window. All true. All checkable. None of it taken on faith.

This is the whole argument from the gating pieces, restated for building instead of reviewing. The agent does not get to be both the author and the only witness. Zero errors is a good outcome only when zero errors is a measured fact.

Velocity from rigor, not despite it

The story people expect about AI and documentation is a speed story: the agent writes fast, you ship fast, and quality is the tax you grudgingly pay for the pace. That is not what happened here, and the inversion is the part worth keeping.

The site came together fast because the verification was mechanical and sorted by type. The scans spared me re-reading for broken links. The harness spared me wondering whether the code ran. And that is precisely what freed my actual attention for the things that needed a person: the content model, the page that disintegrated on a phone, the claims that needed confirming against source.

The rigor is not friction dragging on the velocity. The rigor is what made the velocity safe to want.

Where I am right now: the site is live, it is a deliberately narrow slice, and it is built so the next slice does not demand a rebuild. A troubleshooting page, a custom-provider guide, a deeper reference, all have somewhere to land. I left the room on purpose, because the fastest thing you can do for future-you is not to finish. It is to build something future-you can extend without tearing it down.

I am still working out how much of this generalizes. The verification-by-type split feels portable. The "go hold it like a reader" instinct feels stubbornly manual. If you have built docs by directing an agent and found a cleaner seam between the parts you delegate and the parts you keep, I would like to hear where you drew the line.

I am writing this while it is still in motion, which is the only honest way I know to write about a practice I am still figuring out. If you are coming at the same problem from a different angle, the comments are open and I am reading them.

I write about AI-assisted documentation workflows, developer experience, and the evolving role of technical writers. If any of this resonates, let's connect on LinkedIn.

Capturing review lessons: how I stopped my feedback loop from depending on my memory

John Rojas — Sun, 21 Jun 2026 12:47:23 +0000

For context: in the previous piece, I took the review comments I kept repeating and turned them into mechanical scans, so the agent stops asking me to catch the same passive-voice slip for the hundredth time. That piece, and the one before it, were the same idea from two angles: move verification out of the agent's head and onto disk, where I can check it without taking the agent's word for anything.

But I ended checkbox theater on a confession. There was one part of the system that still ran on me remembering to run it. This piece is about closing that gap. It is the last of three pieces on gating, and unlike the other two, it has an actual ending.

The thread I left hanging

Here is what I admitted last time. After every review there is a step where you look at what the reviewers caught that your gates didn't, and you write down the lesson so the gate catches it next time. I had that step. I had built the log it writes to. The mechanism that feeds those lessons back into the next review worked.

The problem was the trigger. The "now go look at the reviewer comments and write down what we missed" part was mine. I had to prompt the agent to do it. If I forgot, which I did, the loop didn't run, and the lessons that should have hardened into checks just evaporated.

So I had a feedback loop with a single point of failure, and the single point of failure was a person. Me. The least reliable component in any system I have ever built.

What the gap log actually is

Before the fix, the mechanism. The gap log is one file, append-only, one line per gap:

2026-04-22 | PR-1247 | clarity      | "this powerful feature" not caught            | resolved
2026-05-12 | PR-1268 | completeness | nav entry missing for new section            | mechanized
2026-05-18 | PR-1284 | technical    | dependency claim not cross-checked vs source | open

Three statuses carry the whole thing. open means the gap is logged but nothing automated catches it yet, so the next review reads it and injects it as an extra check. mechanized means a scan or a hook now catches the pattern on its own, and the line can go quiet. resolved means the underlying problem is gone, often because something upstream changed, and no check is needed anymore.

The format is boring on purpose. The value is in what reads it: before the next review starts, the pre-flight step opens this file and turns every open line into an additional check on its dimension. A gap that surfaced once keeps showing up in every subsequent review until it becomes a script or it becomes irrelevant.

That is the part I am proud of. It converts a one-time lesson into infrastructure. The thing I learned reviewing PR-1284 doesn't sit in a Slack message I will lose by Thursday. It sits in the log, and the log gets read.

The part that was still me

And yet. The capture was manual. After a review, I had to be the one to say "walk the comments, log the gaps." The log got written, the injection worked, the lessons carried forward, all of it, but only after I remembered to pull the trigger. A loop that depends on me remembering is not a loop. It is a to-do item wearing a loop costume, and I will eventually drop it.

Cutting myself out

Two moves fixed it.

The first was to stop waiting on me. I had been treating gap capture as something the agent does and then pauses on: find the gap, log it as open, show me, let me decide whether to build the check. That is two places for me to be the bottleneck. Getting asked, and approving. I removed both. The rule now is that for every gap, the agent builds the smallest deterministic check it can, in the same session, and reports it afterward. It does not log the gap and wait. It does not present the analysis and ask permission. It does the work and then tells me what it did.

Not every gap survives that. Some need judgment a regex cannot fake, version-aware reasoning, the difference between a reader and an actor in a sentence. Those still get logged open, with a recorded reason for why a dumb scanner would do more harm than good. The default flipped, though. It used to be "log it and wait for John." Now it is "mechanize it, or justify why you can't."

The second move was the trigger that replaced my memory. Every review now records its findings to a small ledger, pass by pass. When a later pass surfaces a finding that no earlier pass recorded, that is a leak, and the check that detects it exits nonzero. A nonzero exit is the signal to run the whole routine: root cause, re-audit the artifact, close the gap by strengthening the gate. I do not have to notice the miss anymore. A thing the first pass should have caught showing up in the second pass is, mechanically and exactly, what "we missed something" looks like. So the script notices, and the script does not get tired or distracted or called into a meeting.

It normalizes the findings before it compares them, so a shifted line number doesn't fool it. Same issue, different line, still a match.

What it looks like when it works

Three real cases, because the abstract version never lands.

A British spelling slipped through. The automated reviewers passed it, the nine scans in my style gate passed it, and the human reviewers who spotted it held back on purpose, to see whether the gate would flag it on its own. It didn't. There was no spelling scan, so "honour" cleared all nine, and the miss went unactioned until I ran an explicit American-English pass days later. The gap got logged, and inside the same session it became a curated British-spelling scan, tuned to fire on the real offenders and stay silent on the words that only look British. Now it catches the next one before it reaches a human at all.

The same page lived in several parallel versions. A wording fix landed in some of the copies and not the others, so two of them kept the wrong term and nobody noticed they had drifted apart. The lesson became a parity scan: when one copy of a shared page changes, diff the changed lines against its siblings and flag the ones that no longer match. Verified against the exact case that produced it, zero false positives.

A config key got documented that did not exist anywhere in the source. Plausible-looking, well-formatted, completely invented. The fix was a zero-hit search: take every identifier the docs claim is real, search the actual source for it, and flag the ones that come back with nothing. The agent can no longer make up a key and have it sail through, because the gate now asks the source whether the key exists before believing the prose.

Each of those started as a single embarrassing miss and ended as a check that runs forever. That is the entire point of the log. The miss is one-time. The check is permanent.

The principle

Here is the spine of all three pieces, finally connected. Checkbox theater said: do not trust the agent's claim, check the artifact it produced. Mechanical pattern gates said: take the note you keep repeating and turn it into a scan. Both move verification out of the agent's control. This piece moves the improvement out of mine.

Because that was the gap I could not see for too long. I had made the agent's verification mechanical and left the system's self-improvement running on a human prompt, and the human was me. A feedback loop that fires only when I remember to fire it is checkbox theater one level up. The box that said "lessons captured" was getting checked by me deciding I had done it, with nothing on disk proving I had. Same failure mode, one floor higher, harder to see because it was wearing the costume of a process that worked.

The capture has to be as mechanical as the checks it produces. Otherwise the weakest gate in the whole system is the one whose entire job is to make the others stronger.

Where it actually stands

The loop runs without me now. Gaps get caught, the mechanizable ones become checks in the same session, the rest sit logged with a reason, and the leak detector trips when something slips between passes. I am still in it, but as the person who reads what it did, not the person it depends on to remember to do anything.

Plenty is still open, and I am fine with that. The log has open lines I have not cracked because they need judgment I cannot cheaply encode: version-aware checks, telling a reader apart from an actor, the cases where a blunt scanner would flag more good than bad. "Open with a reason" is an honest state to leave something in. The goal was never zero open gaps. It was to stop losing the lessons between the moment I learned them and the moment I forgot.

So that is the gating arc. Three pieces, one idea: the agent does the work, but the work it does, and the work of getting better at the work, both live somewhere I can point to instead of somewhere I have to vouch for. If you are building anything where the same system does the work, audits the work, and improves the work, I would put to you the question it took me three articles to answer for myself. Which of those three still depends on you remembering? That is the one to mechanize next. I would like to hear what you find when you go looking.

I write about AI-assisted documentation workflows, developer experience, and the evolving role of technical writers. If any of this resonates, let's connect on LinkedIn.

Mechanical pattern gates: how I turned recurring review comments into scans

John Rojas — Sun, 14 Jun 2026 18:40:59 +0000

In the last piece, I moved the review gate outside the agent's control. The hook reads files, not sentences. If a scan didn't write an artifact to disk, the agent doesn't get to post its review to GitHub, no matter how confidently it reports that the scan ran. That fixed the enforcement problem. It left a more basic question sitting there untouched: what are the scans actually looking for?

Five things. This piece is about those five, and about the small, slightly embarrassing moment where each one stopped being a comment I kept retyping and turned into a script that retypes it for me.

None of these is clever

I want to say that up front, because the temptation in an article like this is to make the engineering sound harder than it was. It wasn't hard. Most of these scans are a few lines of bash and a regex I would not show a real engineer without apologizing first.

The value was never in the sophistication. It's in the repetition. A script runs the same scan, the same way, against every diff, every time. I don't. I read carefully on a Tuesday morning and I read like a man late for lunch on a Thursday afternoon, and the gap between those two reviews used to be the gap between catching a passive-voice definition and shipping it to a developer who then had to parse it. The scan doesn't have a Thursday afternoon. That's the entire pitch.

Here's the pattern that produced all five of them. I'd leave a review comment on a PR. Then I'd leave the same comment the next week, on someone else's PR, in nearly the same words. By the third time you type a sentence, you start to suspect the sentence wants to be a script. Every one of these scans is just a review comment I got tired of writing.

The five scans

The real ones live in a script called style-gate.sh. Each one runs against the added prose lines from the diff, with code and link syntax stripped out first, so the scans only ever see prose. What follows is the shape of each, simplified down to the part that does the work.

Will and would: the one that started it

Documentation describes how a system behaves now, not how it will behave at some unspecified future point. So "the client will retry failed requests" is a small lie of tense. The client retries failed requests. Present, indicative, true at read time.

The comment I kept leaving: we describe current behavior in present tense, so "will retry" should be "retries." Every week. Sometimes twice.

The scan is about as dumb as it sounds:

# will/would: future and conditional framing in prose that should be present-tense
grep -nE '\b(will|would)\b' "$added_prose_lines"

The false positives are real and I left them in on purpose. Sometimes "will" is correct: a deprecation notice ("this parameter will be removed in v3") is a genuine statement about the future. So the scan flags, it doesn't block. It hands me a list of line numbers, and I spend ten seconds deciding which ones are tense errors and which are legitimate. Ten seconds I will actually spend, because the list is sitting in front of me, instead of the careful reading I would have skipped.

Passive voice: flagged, never failed

This is the noisiest scan I run, and the one I trust least to be right, which is exactly why it never blocks anything.

The comment it replaced: consider active voice here so the reader knows who does what. "The token is validated by the middleware" hides the actor at the back of the sentence. "The middleware validates the token" puts it back up front.

Detecting passive voice with a regex is a crude business. You're looking for a form of "to be" followed by a past participle, and English has opinions about that you cannot fully encode in grep:

# passive voice: a 'to be' form followed by a likely past participle
grep -nEi '\b(is|are|was|were|been|being)\s+\w+(ed|en)\b' "$added_prose_lines"

It catches real passives. It also catches "the data is embedded" when embedded is doing perfectly honest work, and it misses passives built on irregular participles. I could spend a weekend tightening it. I won't, because passive voice is the one dimension where the regex genuinely cannot tell a good call from a bad one. "The request is rate-limited" is fine. The scan can't know that. So it surfaces candidates and stays out of the decision. I read the list. I keep the ones that earn their place.

Placeholders: TODO, TBD, and the bracket that shipped

This one exists because of a specific humiliation. A [TODO: confirm the actual rate limit] once made it into a merged docs PR, sat in published documentation for a sprint, and got found by a developer rather than by me.

The comment it replaced wasn't even a sentence. It was the sigh you make when you find scaffolding in the finished building.

# placeholders: leftover scaffolding in prose
grep -nE '(TODO|TBD|FIXME|XXX|lorem ipsum)' "$added_prose_lines"

The interesting tuning problem was brackets. I wanted to flag a stray [fill this in], but markdown link syntax is full of legitimate brackets, and so is CLI documentation ([options], [--flag]). My first version flagged every link on the page and I nearly threw the whole idea out. The fix was boring: run the bracket check only on prose lines, after stripping inline code spans and link syntax. Most of tuning these scans is teaching them where not to look.

The marketing words

Documentation has a vocabulary it borrows from the marketing site, and it should give it back. "This powerful endpoint simply returns the user object." Powerful is doing no work. Simply is doing worse than no work, because it tells a reader who finds the thing hard that they were supposed to find it easy.

The comments I kept leaving were two: "powerful" is a marketing word, not a documentation word, cut it, and "simply" assumes it's simple for the reader, who may not agree.

So there's a word list:

# marketing creep and reader-blaming minimizers
grep -nEiw '(powerful|robust|seamless|effortless|blazing|simply|just|easily|obviously)' "$added_prose_lines"

"Just" and "easily" are the troublemakers here, because "just the Authorization header, not the whole set" is a legitimate use of just, and the scan can't see the difference. Same answer as everywhere else: it flags, I scan the hits, I cut the ones that minimize the reader's difficulty and keep the ones doing honest narrowing. The list is short by design. A long banned-words list produces so much noise that you start ignoring it, and a gate everyone ignores is just checkbox theater with extra steps.

When a sentence should have been a fact

This is the subtlest of the five and the one I'm still tuning. It catches hedged, conditional prose standing in for a fact the reader actually needs.

"The cache may or may not be invalidated depending on the header." That sentence has the shape of information and the content of a shrug. The reader has a yes/no question (does my request invalidate the cache?) and the sentence answers "maybe." Somewhere a fact exists: setting Cache-Control: no-store invalidates the cache; other values leave it intact. The hedge was hiding a condition I could have just stated.

The comment it replaced: this reads as uncertainty, but there's a deterministic rule underneath, so state the condition instead of hedging it.

# hedged prose where a stated condition would serve the reader better
grep -nEi '(may or may not|can either|might also|sometimes|in some cases|it depends)' "$added_prose_lines"

The honest disclosure on this one: it has the highest false-positive rate of the five after passive voice, because sometimes the uncertainty is real. Some behavior genuinely depends on runtime state you can't reduce to a clean rule. So this scan is the least mechanical of the lot in practice, even though the grep is trivial. It mostly works as a prompt: you wrote "may or may not," are you sure there isn't a fact here you're avoiding? Often there is.

What a scan can't see

A regex catches a shape, not a meaning. The passive scan flags "the token is validated by the middleware," and most of the time it's right to. But it cannot tell you that "the middleware validates the token" is the better sentence, because it doesn't know what the sentence is for. It knows the sentence has the shape of passive voice. That is the entire extent of what it knows.

So none of these scans decides anything. They surface candidates. I decide. This is the part that gets lost when people hear "automated review" and picture the machine doing the reviewing.

It's worth being precise about the division of labor, because it's the whole reason the system works. The scan is not there to replace my judgment. It's there to protect my judgment from my own inconsistency. I'm good at deciding whether a passive construction earns its place in a given sentence. I'm bad at remembering to look for it at four o'clock on a Thursday. The scan does the looking, which it never forgets to do. I do the deciding, which it can't do at all. We are each assigned the part we're actually reliable at.

What doesn't have a scan yet

This handles the failure modes I can name. The ones I've typed often enough to recognize as a pattern and reduce to a regex. There's a whole other category underneath it: the comment I haven't gotten tired of yet, because I've only had to make it once. Those don't have scans. They have a log.

That's the next piece. The gap log, and the genuinely annoying problem of how a lesson I learn on one PR survives long enough to become a scan I run on the next one. I have not fully solved it. But there's a workable shape, and I'll show you exactly where the friction still is, because pretending it's seamless would land me on my own marketing-words list.

If you run scans like these, I'd want to see your false-positive list before anything else. That's where the real knowledge lives. Mine took far longer to tune than to write, and the tuning is the part nobody puts in the article.

I write about AI-assisted documentation workflows, developer experience, and the evolving role of technical writers. If any of this resonates, let's connect on LinkedIn.

Checkbox theater: how I stopped trusting my AI agent to run the checks

John Rojas — Sun, 24 May 2026 10:36:22 +0000

For context: in the previous piece, I worked through a five-dimension review framework for documentation, covering clarity, readability, style, completeness, and technical accuracy. Those dimensions are now part of how our team's AI agent reviews PRs. It runs them on every review pass, quietly, in the background. Most people don't think about them. They just see the review output.

Then I started catching things on my own review pass that the agent had marked clean. The style scan reported zero hits. I'd find three present-tense violations on the next read. A completeness check came back marked complete. A ticket requirement was unaddressed in the diff. The dimensions were running. They were also missing things, and I was the one finding what they missed.

This piece is about what I learned from that gap, what I built to close it, and the bigger principle I'm still working through. Sharing as I go, in case any of it is useful.

The setup

I'd built up a set of gates around the dimension checks. A PRE-FLIGHT gate that forced the agent to write a todo list with concrete execution methods for each dimension before any review work began. No "I'll check style" wishful thinking; you had to say "I'll run gh pr diff and scan for forbidden terms, will/would violations, and passive voice constructions." A COMPLETION gate that required documented evidence for all five dimensions before the review file could be written: findings, or "no issues, here's what I checked."

It felt thorough. Looked thorough. Read thorough on paper. Was thorough in approximately the same way a paper checklist is fire-safe.

The failure mode

What I started noticing across reviews:

A style scan reported clean. On my own pass through the diff, I caught three present-tense violations.

A completeness check came back marked complete. A ticket requirement was unaddressed.

Sub-agents reported back with results no artifact on disk corroborated. "Ran the will/would scan, zero hits." Where? Show me. There was no where. The scan had never produced output. It had produced a sentence.

I want to be careful about how I name this. It was satisfying the instruction as stated. The PRE-FLIGHT todo said "run the will/would scan." Marking the todo complete satisfied the instruction. Whether the scan had actually executed and produced findings was, structurally, outside the loop. The gate was social, not mechanical. It depended on the agent choosing to do the work, and on me choosing to believe that the work had been done because the agent said so.

I was reading the agent's confidence as evidence. Confidence is not evidence. It's a sentence about evidence. There is a difference, and the difference was costing me on every review.

The phrase that stuck was checkbox theater. The gates existed. They had names, structure, even formal blocking semantics. What they didn't have was teeth. They lived in instructions, and instructions are wishes.

The shift

The question I put to the agent: can we make these gates mechanical? Not "the agent should run the check" but "the agent cannot move forward until the check has produced a written artifact, on disk, tied to the current state of the PR."

That reframe is the whole article in one sentence. Evidence over status. Substrate over self-report. The shift from a gate that asks the agent to verify itself to a gate that verifies whether the agent has verified itself, where "has verified" is measured by file existence, not by claim.

Once that shift was on the table, the implementation became obvious. Most of it, anyway. I'm still finding the edges.

The implementation: three moves

Move 1: Scripts that write artifacts, not status flags

Each mechanical scan now writes a JSON file. Not a status. Not a return code. A file, with hit counts, file paths, sample matches, a timestamp, and the SHA of the PR HEAD it was run against.

{
  "pr_number": "NNNN",
  "pr_head_oid": "<headRefOid>",
  "run_at": "2026-05-12T08:59:14Z",
  "dimensions": {
    "style": {
      "status": "ran",
      "hits": 0,
      "source": "style-gate.sh (5 scans: will/would, passive, placeholders, superlatives, boolean)"
    },
    "readability": {
      "hits": 2,
      "scan": "sentence length > 25 words on added prose lines",
      "samples": ["docs/api/auth.md:42", "docs/api/auth.md:67"]
    }
  },
  "total_hits": 2,
  "status": "ran"
}

The SHA pin is doing real work. If the PR gets a new commit, the artifact's pr_head_oid no longer matches the current HEAD. The artifact is now stale, which means the scan results are stale, which means whatever was clean five minutes ago is no longer demonstrably clean. The agent has to re-run.

Move 2: A hook that intercepts the destination

This is the move that turned out to matter most. Cursor supports a beforeShellExecution hook: a shell script that runs before any shell command the agent issues. The hook reads the command, decides whether it's a PR-write command (gh api .../pulls/<N>/comments, gh pr edit --body, gh pr comment), and if so, validates the gate artifacts before deciding whether to allow or deny.

The mechanism here is Cursor-specific. The principle isn't. Other agentic tools have equivalent shell-level hooks; if yours doesn't, the enforcement point shifts to a pre-commit hook or a CI gate, but the move is the same: put the verification somewhere the agent can't talk its way past.

The validation is dumb on purpose. Does PR-<N>-tickets.json exist? Does it have status: "loaded" or "partial_blocked"? Does its pr_head_oid match the current HEAD from gh pr view? Same questions for PR-<N>-gate.json. If any check fails, the hook returns deny with a clear message:

pr-review-gate-hook BLOCKED gh api .../pulls/NNNN/comments on PR #NNNN.

Missing or stale gate artifacts:
- Stale: PR-NNNN-gate.json pr_head_oid=<old-sha> but current PR HEAD is <new-sha>
  Re-run: ~/Documents/docs-agent/scripts/review-gate.sh NNNN

Resolve and retry. Bypass available for one command via environment variable.

What changed when this hook went live: the agent stopped being able to lie about whether it had run the scans. There was nothing for it to lie about. Either the artifact existed and the SHA matched, or the call to GitHub got blocked. The agent could still produce a sentence saying "I ran the scan." That sentence no longer affected anything. The hook didn't read sentences. It read files.

Here's the before-and-after, visually:

One subtle but important detail: if the hook itself errors, if jq isn't installed, if gh can't reach the network, the command is allowed through with a warning. The gate fails open. This is deliberate. The cost of false negatives, a stale artifact slipping through, is low because the next review will catch it. The cost of false positives, every command bricked because the hook crashed, is high. A bypass environment variable exists for the same reason: when you genuinely need to override, you can, but you have to do it on purpose.

Move 3: The dimension rules that match

Hard gating only works if the gates point at the right things, so the rules inside the dimensions got tightened too. The style dimension can no longer be satisfied by citing the style guide; you have to run the mechanical scans and either resolve every hit via inline suggestion or document zero matches with the command that produced them. The completeness dimension requires a per-requirement mapping table built from tickets.json, not from the PR diff alone, because feature-mapping from the artifact being reviewed is circular. The rule structure stopped being aspirational and started being operational. "Run the check" turned into "produce the artifact that proves you ran the check, and here's the schema."

The feedback loop: lessons as infrastructure

Hard gating handles the failure modes I already knew about. It does not handle the ones I haven't run into yet. For those, there's a separate piece: the gap log.

The gap log is an append-only file. After every review, the POST-REVIEW IMPROVEMENT gate runs: take every reviewer comment, ask whether the workflow could have caught it before submission, and if yes, draft a check that would catch it next time. The check gets logged.

The format is one line per gap:

2026-05-10 | PR-1234 | style    | passive voice not caught on added definition lines | open
2026-05-12 | PR-1242 | complete | nav entry missing for new partial                  | mechanized
2026-04-22 | PR-1289 | clarity  | "this powerful feature" not caught                 | resolved

Three statuses do the work:

Status	Meaning
`open`	Logged but not yet caught by any script or hook. Next PRE-FLIGHT reads this and injects the gap as an additional dimension check.
`mechanized`	A scan or hook now catches this pattern automatically. The gap can sit dormant; the infrastructure handles it.
`resolved`	The underlying recurring pattern is gone (often because upstream changed). No further check needed.

What this does, structurally, is convert one-time learnings into infrastructure. A gap surfaced in PR-1234 doesn't sit in a Slack message I'll forget about. It sits in the log. The next PRE-FLIGHT reads the log and reminds the agent. When I get around to writing a script that catches the pattern mechanically, the status flips. The lesson doesn't depend on me remembering it.

Honest disclosure: this part still has friction. At the end of each session, I have to prompt the agent to walk through reviewer comments and log gaps. The pattern isn't fully self-sustaining yet. The log gets written. The injection works. The "remember to look at the comments" step is still mine. There's a future article in closing that gap, and I'm still figuring out the right shape for it.

The principle

Trust-but-verify is the wrong frame for AI workflows, because verification is what you're asking the LLM to do. The agent that runs the check is the same agent reporting on whether the check ran. There is no second agent watching. The integrity of the whole thing depends on a self-report from the entity being audited. That's not a verification model. That's a wish.

The fix isn't a better prompt. The fix is moving verification outside the agent's control loop entirely. The agent can write any sentence about whether it ran a scan. The hook cannot write any sentence about whether a file exists. The file is on disk or it isn't. The SHA matches or it doesn't. There is no clever phrasing the agent can produce that changes that.

Where I am right now: hard gates for the failure modes I know about, gap log for the ones I don't, manual prompting for the meta-check that I don't have a way to automate yet. It's a workable system I'm still actively improving.

If you're building AI-assisted workflows, especially ones where the LLM both does the work and audits the work, I'd push you to ask the same question I had to ask: what does my agent actually produce, on disk, that I could check without taking its word for anything? If the answer is nothing, that's checkbox theater. The first step to closing the gap is making it produce something.

More on the specific mechanical scanners in the next piece. More on the gap log and the durability problem in the one after that. If you've solved any of this, or seen a sharper take on it, I'd want to hear from you.

I'm publishing this while the system is still actively improving, because the principle landed for me before the implementation did, and the implementation isn't finished. If you're building AI workflows where the agent both does the work and audits it, I'd want to hear how you've thought about that gap, or whether you've found a way to close it.

I write about AI-assisted documentation workflows, developer experience, and the evolving role of technical writers. If any of this resonates, let's connect on LinkedIn.

I kept saying the same five things in every doc review. So I made them defaults.

John Rojas — Thu, 14 May 2026 19:23:09 +0000

In my last piece, I wrote about how AI changed my work as a technical writer. I closed with a line I keep coming back to: technical writing is shifting from producing content to building the gates, constraints, and workflows that ensure content is accurate before it ever reaches a reader. This piece is about one of those constraints, named: five things I check in every documentation review, and how I went from reciting them out loud every time to having them run by default.

How I got here

I was reviewing PRs alongside an AI agent. The agent was helpful. The agent was also unevenly helpful. Some reviews came back sharp. Others missed things any first read should have caught. The pattern wasn't in what the agent did. It was in what I kept telling it.

At first, I was iterating on prompts. Trying different framings, different phrasings, different orders of operations. Some produced sharper reviews. Some didn't. And I wasn't writing any of it down.

Then I'd remember a recent review that had come out particularly well, and I'd want to use the same approach again, and I'd realize I had no idea what I'd told the agent that time. I found myself reconstructing the prompts from memory. And badly at that. Yikes.

So I opened a Slack message to myself and pasted in my current working prompt. The one with the five things I kept asking the agent to check. Then I closed Slack and got back to work.

The next time I started a review, I opened that Slack message and pasted the prompt into Cursor. The time after that, I did the same thing. And the time after that. And the time after that. And another. And another. You know where this is going.

I want to say this didn't go on for long. But it did. OMG it did. I was apparently fine pasting the same wall of text into every conversation, multiple times a day, until my own laziness finally lit up a bulb.

If I'm pasting the same instructions every time, the agent shouldn't be relying on me to paste them. They should just be there. By default.

Something else was eating at me in the background. Every long-pasted prompt was eating up part of the context window before the agent had even started reading the diff. That seemed wasteful. (Spoiler: it was.)

So I started baking the five dimensions into my local preferences. Not the agent rules. Not the slash commands. Just my own preferences and local memory, where I could iterate without touching anything the rest of the team relied on. I wanted to test whether the bake-in actually preserved the catches I was getting from manual prompting. If the local version produced worse reviews than my paste-everything version, I'd know early. And nobody else would have to deal with it.

It didn't produce worse reviews. It produced consistent ones, which is its own kind of better.

So I named the thing I'd been doing all along.

Clarity. Readability. Style. Technical accuracy. Completeness.

Five dimensions. Each one a specific kind of failure I'd seen the agent commit, and a specific kind of catch I'd had to make on my own pass through the diff. Once they lived in the rules, the realization landed: this isn't a checklist I should be reciting from memory before every review. It's a checklist that should run by default.

The five dimensions

Clarity

Technical documentation has an obscurity problem. Layers of concepts, configuration options, dependencies, prerequisites. Somewhere underneath the torrent of context, the actual thing you came to learn. Clarity is the dimension that asks: can a reader who doesn't already know this still follow it?

It catches the sentence that assumes too much. The acronym used before it's defined. The reference to a concept that's actually three concepts. The instruction that skips a step because the author considered it obvious.

When clarity is missing, the reader rereads. Then they reread again. Then they give up and ask in Slack. Documentation that needs a translation isn't documentation. It's a draft.

Readability

This one is structural, not conceptual. Readability is about how text feels on the page. The sentence that runs three clauses long without breath. The paragraph that's nine lines tall and looks like a wall. The bulleted list that's actually seven sub-points masquerading as one.

The goal isn't "easy" prose. The goal is prose someone can absorb without working at it. Chunking. Variation. White space. The boring craft of making information accessible to a reader who is tired, distracted, or scanning while their build runs.

When readability is missing, the eye glazes. The reader skims past the part that matters. Worse, they skip the page entirely and search for an answer somewhere else.

Style

Every team has a style guide. Most are imperfect. All of them are non-negotiable inside the codebase they belong to.

Style is the dimension that catches drift from those rules. Present tense where past tense slipped in. The term we use vs. the term we don't. Sentence case in headings when it should be title case, or the other way around. Boolean phrasing when we've agreed on declarative.

I'm a stickler for directional language in particular. We don't allow it. I've been known to chase every instance of "via" across a doc set and hunt each one down to oblivion. (Nah, I just replace them with something else.) It's an itch I have to scratch, and honestly, this kind of work is what I live for. (My obsessive tendencies might be showing. I'm fine with it.)

It's also the dimension AI agents are most likely to silently abandon mid-document. Style rules are pattern-matching against a long list of small constraints. Context windows get full, attention drifts, and what should have been a will/would catch becomes a missed will. The agent didn't forget the rule. It just stopped applying it.

When style is missing, the doc set drifts. Voice fragments. Readers feel the inconsistency before they can name it.

Technical accuracy

This is the highest-stakes dimension. Technical accuracy is the dimension that asks: is this actually true?

It catches the configuration property that doesn't exist. The endpoint behavior described backwards. The version compatibility claim that contradicts the upgrade guide. The example that compiled in someone's head but not in a terminal.

The configuration property that doesn't exist is the one I keep coming back to. I added this dimension specifically because of one passage I reviewed that read beautifully. Confident, authoritative, smooth. Then I went to check the configuration properties it referenced. They didn't exist. Not anywhere in the codebase. The text was perfectly written and completely wrong. That's when technical accuracy became its own line item on every review.

The check has two parts. Source-grounded: does this match the codebase, the specs, the schemas, the actual source of truth? Cross-document consistent: does this conflict with what we've already said elsewhere?

Both halves matter. A statement can be technically correct against the code and still wrong in the context of the broader documentation. Two pages can each be internally consistent and contradict each other. AI agents are particularly good at producing both kinds of failure, in perfect sentences, with confident phrasing.

When technical accuracy is missing, readers act on bad information. That's the worst outcome documentation can produce. Everything else is just polish.

Completeness

This is the dimension most external to the doc itself. Completeness asks: did we address what the ticket actually said?

It catches the requirement that's documented in the PR description but not in the docs. The acceptance criterion that got lost between planning and merge. The new partial that needs a nav entry. The feature flag that quietly changes behavior in a way someone needs to know about.

The trick with completeness is that it requires holding two artifacts in mind at once: the change being reviewed, and the requirements that motivated the change. AI agents handle this poorly by default. They look at the diff and review the diff. They forget to look back at the ticket and ask whether the diff is enough.

When completeness is missing, the docs ship and a customer finds the gap two releases later.

The five at a glance

Dimension	What it catches	What missing it looks like
Clarity	Concepts that need decoding	Reader rereads, gives up, asks in Slack
Readability	Walls of text, no rhythm	Eye glazes, page gets skipped
Style	Drift from style guide rules	Voice fragments, inconsistency felt
Technical accuracy	Hallucinations and contradictions	Readers act on bad information
Completeness	Missing requirements or acceptance criteria	Customer finds the gap two releases later

What naming didn't solve

Baking the five dimensions into the rules solved one problem. It didn't solve all of them.

Naming the dimensions told the agent what to check. It didn't tell the agent how to verify the check had actually happened. There's a gap between "the agent ran a clarity pass" and "the agent ran a clarity pass and the output exists somewhere on disk that I can read." For a while, I didn't see that gap. The agent would report clean across all five dimensions, and I'd believe it, because the dimensions were named and the dimensions were "covered."

Then I started catching things on my own review pass that the agent had marked clean. Why? That's a tale for another time.

What might be missing

Five dimensions felt like enough when I wrote them down. I'm less sure of it now.

A few I've been mulling over, but haven't committed to:

Accessibility. Heading hierarchy, alt text on images, link labels that say more than "click here," screen reader behavior on tables and code blocks. Most style guides treat this as a subset of style. I'm not convinced it deserves to live under style; the failure modes are too different.

Cross-document consistency. This currently sits inside technical accuracy ("doesn't contradict what we've said elsewhere"). But the failure mode is structural, not factual. Two pages can each be true and still pull a reader in opposite directions. Worth a separate dimension? Maybe.

Discoverability. A page that's accurate, clear, well-styled, and complete is still useless if no one can find it. Nav placement, search terms, related-links suggestions. Probably belongs somewhere in the framework. I just don't know where.

Audience appropriateness. A doc written for the wrong reader fails before any of the other dimensions get a chance to. Did we write this for the developer, the admin, or the integrator? Each one needs a different shape.

None of these are settled. They might fold into the existing five with a sharper definition. They might warrant standalone dimensions. They might turn out not to matter at the level of automated review. I don't know what I don't know yet. The fun is in finding out.

Are these the kinds of things you check for in your own reviews? I'd be curious to hear what's on your list.

What's next

This is the first piece in a small series I'm working through on AI-assisted doc reviews. Next up: what happened when I discovered that naming the dimensions wasn't enough to make the agent actually check them.

The build is never finished. You just keep improving it. That still suits me fine.

I'm sharing this because the framework isn't a destination. It's a working hypothesis. Five named dimensions emerged from my own habits and my own catches, and the next reviewer to read past them might find a sixth. If you've shaped something similar from your own practice, or watched these dimensions miss things mine hasn't yet, I'd genuinely love to hear about it.

I write about AI-assisted documentation workflows, developer experience, and the evolving role of technical writers. If any of this resonates, let's connect on LinkedIn.

What AI actually changed about my work as a technical writer

John Rojas — Sat, 28 Mar 2026 08:44:57 +0000

For me, the most frustrating part of technical writing was never the writing.

It was the waiting.

Waiting for a subject matter expert to have a free hour. Waiting for a developer to confirm whether something changed between releases. The information existed. It just lived in someone else's head, or somewhere in a codebase I wasn't expected to touch.

To be fair, waiting wasn't always the only option. I've always preferred finding things out for myself, digging into a ticket, tracing a thread, unblocking my own work rather than sitting on a question. But self-directed digging has its own cost. It takes time, it can send you down the wrong path, and without the right tools it's easy to spend an hour finding half an answer. The codebase was there. I just didn't have a good way in.

That changed when my team started experimenting with AI-assisted documentation workflows.

Where I started

I didn't build the workflow. I was asked to test it.

A colleague had been developing an AI-assisted approach to documentation and brought me in to try it, break it, and help refine it before we rolled it out to the wider team. That process taught me more about how AI actually works in practice than any course I've taken.

The core idea was straightforward. Instead of waiting for a developer to explain a feature, we used AI tools to investigate the codebase directly. Read the source. Check the specs. Cross-reference existing documentation. Then plan what to write before touching a single file.

Refining the workflow felt a lot like optimizing a character build in a game. You start with what you have, figure out what works, pick up better tools along the way, and keep tweaking until the whole thing plays smoothly. There's no perfect final state. You just keep improving the build.

What I noticed

Three things stood out.

The first was speed. Not because AI writes faster, but because the back-and-forth shrinks. Fewer dependency chains. Less time blocked waiting for input that may or may not arrive before the deadline.

The second was confidence. I could start a piece of work with a clearer picture of what I was walking into. Instead of beginning from a blank page and a vague (sometimes even empty, shocker, I know) ticket description, I had a verified starting point grounded in what the code actually does. That changes how you work.

The third was autonomy. I could start work on a ticket without needing anyone's permission or availability. That shift is harder to explain than it sounds. After years of structuring your work around other people's calendars, being able to just start is genuinely different.

What I also learned

Three things here too, and none of them are about prompting.

The first is that AI-assisted workflows are only as good as the information they work with. Garbage in, garbage out. If the source material is incomplete or inconsistent, the output reflects that. You still need to know what good documentation looks like. You still need to catch what the AI misses. Judgment doesn't go away. It just gets applied differently.

The second is that critical thinking becomes more important, not less. You're not just reviewing writing anymore. You're reviewing claims. Every output needs to be read with the question: is this actually true, or does it just sound true? AI doesn't flag uncertainty the way a careful human writer might. It states things. Confidently. Whether they're correct or not. In an age where misinformation spreads faster than corrections, accuracy and fact-checking are non-negotiable. That doesn't change just because the content came from an AI. If anything, it becomes more important.

The third is that domain knowledge and attention to detail are not optional. They're your last line of defense. AI will describe a configuration property that doesn't exist, get the behavior of an endpoint backwards, and do it in perfect sentences. It also skips style rules, not always because it doesn't know them, but because context window limits or what I can only describe as selective laziness means it stops applying them mid-document. Directional language, incorrect linking patterns, inconsistent terminology. If you don't know the subject matter well enough to catch a wrong answer, or the style guide well enough to catch a broken rule, those errors walk straight out the door into your published docs.

Where I think this is going

Technical writing is changing. The role is shifting from producing content to something broader: building the gates, constraints, and workflows that ensure content is accurate before it ever reaches a reader.

The core purpose stays the same. Technically accurate content, usable by a specific audience. But the way we get there is different.

I started this by talking about waiting. Waiting for the right person, the right meeting, the right moment to get unblocked. What I've learned is that the writers who thrive won't be the ones who wait less. They'll be the ones who bring more to the table when they show up: domain knowledge, critical thinking, a sharp eye for detail, and the ability to build workflows that catch what they miss.

The build is never finished. You just keep improving it.

That suits me fine.

I'm sharing this because I want to hear how others are navigating the same shift. There's been no shortage of hype and fear around AI in our industry, and honestly, some of that fear is justified. Across the industry, roles are changing. But changing isn't the same as disappearing. What I've experienced is less replacement and more transformation. The job looks different, demands different things, and rewards different skills than it did a few years ago. If you're in the middle of that shift too, I'd genuinely love to hear how you're doing it.

I write about AI-assisted documentation workflows, developer experience, and the evolving role of technical writers. If any of this resonates, let's connect on LinkedIn.