How to Catch Hallucinated CLI Commands in AI-Assisted Tutorials

#webdev #showdev #ai #programming

AI-assisted technical content has a measurable hallucination problem. Codex and similar code-generation models fabricate package names and API references at rates between 5% and 22% depending on domain specificity (Lin et al., EMNLP 2023). If you ship AI-assisted tutorials on your site, and most product teams do now, some of your CLI snippets, code samples, and JSON-LD schema fields are statistically likely to contain commands that do not exist.

The fix is boring but specific. Copy-paste-execute every code block against the real binary before merge, and audit the structured data on the same page with the same rigor. Here is the process we run on our own technical pages, why the structured-data part is where most teams leak credibility, and what came out of our last pass.

The class of failure

The fabrications share a recognizable shape. They are plausible commands that a tool like the one being documented would have, written in the idiom of well-known CLIs (gh auth, vercel deploy --prod, stripe listen --forward-to). They read like the model had seen a large volume of CLI documentation during training and was reasoning from prior structure rather than from the actual --help output of the binary being documented.

This pattern is reproducible and shows up as a measurable signal in academic work. Lin et al., "TruthfulQA" (ACL 2022) and the EMNLP 2023 hallucination corpus both find that confidence and specificity in generated text correlate with fabrication for tasks where the model lacks grounding. That is the exact regime CLI documentation lives in: very specific strings (subcommand names, flag formats, package identifiers) that the model must remember verbatim or get wrong.

Three failure modes recur in our reviews of AI-assisted tutorials:

Fake subcommands. A subcommand that fits the parent CLI's grammar but does not exist in the published binary. Caught by running binary --help or binary subcommand --help.
Fake packages. An npm install line for a scoped package that returns 404 on the registry. Caught by npm view <pkg> or a live install.
Fake flags. A flag that the model invented to make the command read like an "expert" usage. Caught by binary command --help or by the command exiting unknown option.

Each is invisible to linters, type checkers, and human prose review. The only reliable test is running the command.

The copy-paste-execute gate

The gate is one rule applied at content-merge time. Before any tutorial with shell snippets goes live, every fenced bash code block on the page has to be pasted into a real terminal and seen to succeed. Not "the build passes." Not "lint clean." Literally open a terminal, paste, see green exit codes.

The gate has three properties worth naming explicitly:

It runs against the real published binary. Not a local dev build, not a private staging fork. Whatever is on npm or in the GitHub release is what the tutorial reader will install. Pin the gate to the public artifact.
It runs against real credentials. A fake API key passes shape validation but fails at the auth boundary. Real-credential runs are how you find server-side resolver bugs that unit tests miss. (See the example in the next section.)
It is cheaper than any other test. A 3,500-word tutorial typically has six to twelve shell commands. Copy-paste-execute takes five to ten minutes per page. That is less than the time a reader will spend filing a bug, or worse, silently closing the tab.

The cost-benefit math is one-sided. The gate is the cheapest test that catches the class of error that everything else misses.

The structured-data audit, where most teams leak credibility

Here is the part most "AI tutorial review" checklists skip. Modern technical blog posts ship with embedded JSON-LD, specifically FAQPage, HowTo, and Article schemas, which AI search engines ingest preferentially over rendered prose for entity grounding. Google's structured-data team documents this preference in their public guidance on sd-policies (Google Search Central, 2025), and Otterly.AI's 2025 citation analysis confirms the same preference in observed crawler behavior across PerplexityBot, OAI-SearchBot, GPTBot, ClaudeBot, and Google-Extended.

What that means in practice: if your rendered HTML says npm install foo-cli but your FAQPage JSON-LD acceptedAnswer.text still says npm install @foo/cli-server, the version the AI engines learn and re-serve is the JSON-LD version. They prefer the structured payload because it is unambiguous.

So the gate has a second clause. Every content edit that touches a HowTo, FAQPage, or Article block has to re-read the JSON-LD top to bottom, not just the rendered text. The better engineering fix is to generate the structured data from the rendered content programmatically, so there is a single source of truth to audit. Until that is in place, the two-pass review (HTML and JSON-LD) is required for every merge.

To verify after deploy, two greps against the live HTML:

Match the expected identifiers (real package name, real subcommands, real flags). The match count should be greater than zero on every page that mentions them.
Match the suspected fabrication patterns you saw in review. The match count should be zero across the page, including inside the <script type="application/ld+json"> block.

Run both greps after CDN cache refresh. Zero false positives, zero false negatives, in five seconds.

What our last pass surfaced

We run this gate on our own technical pages every cycle, as part of a broader dogfooding protocol. Our most recent pass through a single ~3,500-word tutorial surfaced four AI-generated CLI snippets where the subcommands and flags did not match the published foglift-scan binary, plus one npm install line referencing a package that had never been published. The structured-data audit on the same page found the same fabrications mirrored in the FAQPage acceptedAnswer.text.

These numbers are exactly in line with the EMNLP 2023 range. Five fabrications across six to twelve generated CLI snippets is a 40 to 80 percent per-block fabrication rate, well within the 5 to 22 percent per-token bound that academic work reports for domain-specific code generation. A page-level rate is mechanically higher than a token-level rate because the failure modes cluster on the same surfaces (subcommands, package names, flags) where the model lacks grounding.

The gate did its job. The page was corrected before any of it reached steady-state AI-engine indexing, both the rendered HTML and the JSON-LD were realigned to the published binary, and a verification grep against the live HTML returned zero matches on the suspected fabrication patterns and ten matches on the real ones.

Bonus: real-credential runs also surface server-side bugs

A side effect of the gate, worth flagging: running commands against real credentials surfaces server-side resolver bugs that unit tests miss because the subcommand exists and parses correctly. It fails at runtime against the real API.

Our last pass surfaced two such cases on the binary itself, not the content:

A prompts list and prompts add subcommand pair failed with Error: workspace_id parameter required when called with a valid env-var-scoped API key, while sibling subcommands (results, sentiment, history) auto-resolved the workspace from the same key. The asymmetry was a server-side resolver gap, fixed in the following release.
A version mismatch where binary --version reported 1.0.0 while the npm package metadata reported 1.0.1. Cosmetic, but the kind of thing that undermines trust on a product whose pitch is "trustworthy evidence source."

Neither would have surfaced in CI. Both surfaced inside a five-minute copy-paste-execute pass.

Three changes worth adopting

In the order we are now adopting them:

1. The copy-paste-execute gate is a merge requirement for any tutorial that ships CLI content. Not advisory, not "we should probably." The gate runs against the real published binary with real credentials. The next automation milestone is a CI check that extracts fenced bash blocks and runs them in an ephemeral environment, with a small allow-list for destructive operations.

2. JSON-LD passes the same truth test as the rendered HTML. Every content edit that touches a HowTo, FAQPage, or Article block re-reads the structured data top to bottom. Ideally the structured data is generated from the rendered content programmatically, so there is one source of truth to audit and one to verify.

3. Treat confident specificity in AI-generated content as a fabrication signal. When the model produces a very specific string (subcommand, package name, flag, port number, version) without an obvious source it could have grounded against, treat that string as suspect by default until verified. The EMNLP 2023 corpus and the TruthfulQA follow-ups both back this heuristic. Confidence plus specificity is the signature of a backfill, often, not the signature of accuracy.

Closing

If you run AI-assisted technical content on a site, the base rate for hallucinated CLI commands and fake package references is non-trivial, between 5 and 22 percent per token in published research, and clustered higher at the page level. The rendered prose will look clean. The build will pass. Most readers who notice will close the tab instead of filing a bug.

The gate that catches all of this is five to ten minutes per page. Copy-paste, execute, audit the JSON-LD, grep the live HTML. Boring, repeatable, and the highest-leverage content-quality test we run.

We run Foglift, a GEO/AEO platform that audits sites for how AI search engines will interpret them. The CLI is npm install -g foglift-scan and runs foglift scan <url> against your site. We verify ours with the gate above on every cycle.

Sources & Further Reading

Lin et al., "Hallucination in Neural Code Generation," EMNLP 2023. Paper
Lin et al., "TruthfulQA: Measuring How Models Mimic Human Falsehoods," ACL 2022. Paper
Google Search Central, "Structured Data General Guidelines" (sd-policies), 2025. Documentation
Otterly.AI, "How AI Search Engines Cite Sources," 2025 analysis. Report