<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Nlai</title>
    <description>The latest articles on DEV Community by Nlai (@nick_lai_6fa3f77d7dcf98ce).</description>
    <link>https://dev.to/nick_lai_6fa3f77d7dcf98ce</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3960055%2Fdca0edba-6812-4a21-83e0-ae109249d3b7.png</url>
      <title>DEV Community: Nlai</title>
      <link>https://dev.to/nick_lai_6fa3f77d7dcf98ce</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/nick_lai_6fa3f77d7dcf98ce"/>
    <language>en</language>
    <item>
      <title>The 5 Systematic Failure Modes of AI Research Reports (and How to Catch Them)</title>
      <dc:creator>Nlai</dc:creator>
      <pubDate>Sat, 30 May 2026 14:06:30 +0000</pubDate>
      <link>https://dev.to/nick_lai_6fa3f77d7dcf98ce/the-5-systematic-failure-modes-of-ai-research-reports-and-how-to-catch-them-o9h</link>
      <guid>https://dev.to/nick_lai_6fa3f77d7dcf98ce/the-5-systematic-failure-modes-of-ai-research-reports-and-how-to-catch-them-o9h</guid>
      <description>&lt;p&gt;AI research reports look authoritative. The numbers line up, the charts are clean, and every claim has a source citation.&lt;br&gt;
But when you actually open those sources, things fall apart.&lt;/p&gt;
&lt;h2&gt;
  
  
  After analyzing dozens of AI-generated research reports, I found that LLMs don't fail randomly when doing research at scale. They fail in 5 predictable, repeatable ways — and once you know the patterns, you can catch them systematically.
&lt;/h2&gt;

&lt;p&gt;Failure Mode #1: Unit and Scale Errors (HIGHEST PRIORITY)&lt;br&gt;
What happens: Numbers lose or gain zeros due to unit misinterpretation.&lt;br&gt;
A report says "revenue was $4,200B." The source says $4.2B. Somewhere between reading the source and writing the report, the AI dropped a unit conversion.&lt;br&gt;
This is extremely common in cross-language research:&lt;br&gt;
Chinese "亿" (100 million) vs "billion" — off by 10x&lt;br&gt;
"万" (10,000) dropped entirely — off by 10,000x&lt;br&gt;
Axis label on a chart misread — $4.2B → $4,200B&lt;/p&gt;
&lt;h2&gt;
  
  
  How to catch it: For every financial figure, trace it back to the original source and confirm the unit. Sanity check: does the number make sense given the entity's known scale? A startup with $50B revenue would be Fortune 100 — that's almost certainly wrong.
&lt;/h2&gt;

&lt;p&gt;Failure Mode #2: Fabricated Interpolation&lt;br&gt;
What happens: When exact data is unavailable, the AI fills in the gaps.&lt;br&gt;
Your report shows a clean 6-year revenue trend:&lt;br&gt;
Year    Revenue&lt;br&gt;
2019    $0.9B&lt;br&gt;
2020    $1.4B&lt;br&gt;
2021    $1.9B&lt;br&gt;
2022    $2.4B&lt;br&gt;
2023    $3.1B&lt;br&gt;
2024    $4.2B&lt;br&gt;
Looks great. But only FY2024 has a cited source. The other 5 points? The AI interpolated a smooth curve.&lt;br&gt;
Real financial data has noise, acquisitions, currency effects. A perfectly smooth trend line is a red flag.&lt;/p&gt;
&lt;h2&gt;
  
  
  How to catch it: For every data series, ask: "Was each data point explicitly found in a source, or was it derived?" Compare totals against components — do sub-items actually sum to the reported total?
&lt;/h2&gt;

&lt;p&gt;Failure Mode #3: Source Conflation&lt;br&gt;
What happens: Different metrics from different sources are merged as if they measure the same thing.&lt;br&gt;
"The Acme app generated $1.2B in revenue" — but the source described marketplace GMV (Gross Merchandise Value), not revenue. For marketplace businesses, GMV is typically 5-20x revenue.&lt;br&gt;
Other examples I've seen:&lt;br&gt;
"Cosmetics trade" (imports + exports) cited as "cosmetics exports"&lt;br&gt;
"Analyst consensus" treated as "filed figures"&lt;br&gt;
"Retail sales" confused with "wholesale revenue"&lt;/p&gt;
&lt;h2&gt;
  
  
  How to catch it: For every cited figure, verify the source explicitly uses the same metric name with the same definition and geographic scope.
&lt;/h2&gt;

&lt;p&gt;Failure Mode #4: Stale Data as Current&lt;br&gt;
What happens: Data from an earlier period presented as the latest, or forecasts presented as actual results.&lt;br&gt;
A report in 2026 cites "2025 revenue" from a source published in February 2025 — months before the fiscal year ended. That's an estimate, not a filing.&lt;br&gt;
Or worse: 2023 data presented as "the latest available" when 2024 filings are already public.&lt;/p&gt;
&lt;h2&gt;
  
  
  How to catch it: Check the source date vs. the period it describes. If a source discusses future results before they could have been filed, it's using estimates.
&lt;/h2&gt;

&lt;p&gt;Failure Mode #5: Attribution Laundering&lt;br&gt;
What happens: A fact found in a media article is cited as if it came from an official filing.&lt;br&gt;
The report says "per SEC filings" but the actual source is a TechCrunch summary that itself cited a second-hand analyst note. Two levels of telephone game.&lt;br&gt;
Or: a company press release cited as "industry data." Press releases are company statements, not independent verification.&lt;/p&gt;
&lt;h2&gt;
  
  
  How to catch it: Trace every claim to its earliest cited source. Primary = official filing/dataset. Secondary = analyst report. Tertiary = media article. A figure appearing only in media is unverified.
&lt;/h2&gt;

&lt;p&gt;I Built a Tool to Catch These&lt;br&gt;
After seeing these patterns repeat, I built EFC (Everything Fact-Checked) — a fact-checking tool that knows these 5 failure modes and checks for them systematically.&lt;br&gt;
It's available in three formats:&lt;br&gt;
CLI (&lt;code&gt;efc&lt;/code&gt;)&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;everything-fact-checked

&lt;span class="c"&gt;# Full audit&lt;/span&gt;
efc audit report.md

&lt;span class="c"&gt;# Verify source content (fetches URLs, checks if claims appear)&lt;/span&gt;
efc verify evidence.json
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;GitHub Action&lt;br&gt;
Auto fact-check markdown reports in every PR:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Nlai741533/EFC-Plugin@v0.2.2&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Standalone SKILL.md (for any AI agent)&lt;br&gt;
One Markdown file, zero dependencies. Drop it in your agent's skill directory and it gets a structured 6-step fact-check workflow. Works with Claude, Cursor, Pi, or any agent.&lt;br&gt;
Example output&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="gp"&gt;$&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;efc audit report.md
&lt;span class="gp"&gt;#&lt;/span&gt;&lt;span class="c"&gt;# Audit: report.md&lt;/span&gt;
&lt;span class="go"&gt;Claims found:   18 (P0: 8, P1: 2)
Source URLs:    1 ok, 2 broken
  ❌ [not_found] 404 https://...
  ❌ [unreachable] ERR https://...
Reliability: Low
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="gp"&gt;$&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;efc verify evidence.json
&lt;span class="go"&gt;✅ C002: found — Source contains 5 key terms from claim
🔌 C003: fetch_failed — source unreachable
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;p&gt;The Meta Part&lt;br&gt;
The first version of this tool shipped with a hallucinated install command — a &lt;code&gt;claude skill add&lt;/code&gt; command that doesn't exist. The AI made it up with complete confidence.&lt;br&gt;
That's literally the failure mode this tool exists to catch.&lt;/p&gt;

&lt;h2&gt;
  
  
  Now the repo fact-checks itself before every release (see FACTCHECK.md).
&lt;/h2&gt;

&lt;p&gt;Links:&lt;br&gt;
Full repo (CLI + Action + Claude plugin): EFC-Plugin&lt;br&gt;
Standalone skill (one file, any agent): EFC-Standalone&lt;br&gt;
Both are MIT licensed, stdlib-only Python (no dependencies), 72 tests.&lt;br&gt;
If you've seen other systematic failure modes in AI research output, I'd love to hear about them in the comments.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>opensource</category>
      <category>productivity</category>
      <category>agents</category>
    </item>
  </channel>
</rss>
