<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: GEM² Inc.</title>
    <description>The latest articles on DEV Community by GEM² Inc. (@gemsquared).</description>
    <link>https://dev.to/gemsquared</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3847788%2F875e1f46-2e5a-4b42-b2e3-ca729d4fe1f8.jpeg</url>
      <title>DEV Community: GEM² Inc.</title>
      <link>https://dev.to/gemsquared</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/gemsquared"/>
    <language>en</language>
    <item>
      <title>Same Prompt. Different Answers Every Time. Here's How I Fixed It.</title>
      <dc:creator>GEM² Inc.</dc:creator>
      <pubDate>Fri, 03 Apr 2026 00:12:20 +0000</pubDate>
      <link>https://dev.to/gemsquared/same-prompt-different-answers-every-time-heres-how-i-fixed-it-1ce1</link>
      <guid>https://dev.to/gemsquared/same-prompt-different-answers-every-time-heres-how-i-fixed-it-1ce1</guid>
      <description>&lt;p&gt;&lt;em&gt;This is Part 3 of our AI verification series.&lt;/em&gt;&lt;br&gt;
&lt;em&gt;&lt;a href="https://dev.to/gemsquared/three-ais-analyzed-our-product-none-passed-the-truth-filter-4gki"&gt;Part 1: Three AIs analyzed our product. None passed the truth filter →&lt;/a&gt;&lt;/em&gt;&lt;br&gt;
&lt;em&gt;&lt;a href="https://dev.to/gemsquared/human-in-the-loop-doesnt-scale-human-at-the-edge-does-11j"&gt;Part 2: Human in the loop doesn't scale. Human at the edge does. →&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;


&lt;h2&gt;
  
  
  Same prompt. Same AI. Different sessions. Different outputs.
&lt;/h2&gt;

&lt;p&gt;Post 1 showed three &lt;em&gt;different&lt;/em&gt; AIs diverging on the same question.&lt;/p&gt;

&lt;p&gt;That's expected. Different training, different weights, different answers.&lt;/p&gt;

&lt;p&gt;But we didn't stop there. We re-ran the same AI on the same prompt in a new session.&lt;/p&gt;

&lt;p&gt;We got materially different outputs again.&lt;/p&gt;

&lt;p&gt;Both looked authoritative. Neither warned us they disagreed with each other.&lt;/p&gt;


&lt;h2&gt;
  
  
  What the same AI said twice
&lt;/h2&gt;

&lt;p&gt;Prompt: &lt;em&gt;"Forecast Korea's AI industry in 2027."&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Session 1 produced:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Market size: &lt;strong&gt;$10–15B at &amp;gt;25% CAGR&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Global positioning: &lt;strong&gt;"Global AI G3 powerhouse"&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Hardware claim: &lt;strong&gt;"All Korean electronics AI-native by 2027"&lt;/strong&gt; — sourced to a single company's roadmap&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Session 2 produced:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Market size: &lt;strong&gt;KRW 4.46T (~$3.3B) at 14.3% CAGR&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Global positioning: &lt;strong&gt;"Top three AI powers"&lt;/strong&gt; — framed as government target&lt;/li&gt;
&lt;li&gt;No hardware claim at all&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Same prompt. Same AI. Different session. &lt;strong&gt;A 4× market size gap. No flags from either run.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This isn't a hallucination. Both outputs were internally coherent. Both read like credible analyst reports. The problem is deeper than hallucination.&lt;/p&gt;


&lt;h2&gt;
  
  
  Why this happens: AI inference is non-deterministic
&lt;/h2&gt;

&lt;p&gt;We spent months trying to fix output drift with better prompts, more context, stricter instructions.&lt;/p&gt;

&lt;p&gt;It didn't work.&lt;/p&gt;

&lt;p&gt;Because the issue isn't the prompt.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;AI is optimized to sound right.
Not to prove itself.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;What we call "hallucination" is mostly &lt;strong&gt;context drift&lt;/strong&gt; — the model's plausibility engine filling gaps differently depending on what's salient in a given session. Different day, different sampling, different emphasis in the context window — different output. Same confidence posture throughout.&lt;/p&gt;

&lt;p&gt;You can't prompt your way out of a non-deterministic system. You need verification as a separate step.&lt;/p&gt;




&lt;h2&gt;
  
  
  The truth filter didn't just score. It fingerprinted.
&lt;/h2&gt;

&lt;p&gt;We ran both sessions through &lt;code&gt;gem2_truth_filter&lt;/code&gt; — not to get a number, but to understand &lt;em&gt;why&lt;/em&gt; the outputs diverged.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Session 1 (avg 35%):&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Provider&lt;/th&gt;
&lt;th&gt;Score&lt;/th&gt;
&lt;th&gt;Key violation&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Gemini&lt;/td&gt;
&lt;td&gt;24%&lt;/td&gt;
&lt;td&gt;L→G: "Global AI G3 — no index cited"&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;ChatGPT&lt;/td&gt;
&lt;td&gt;21%&lt;/td&gt;
&lt;td&gt;Δe→∫de: single company → industry-wide claim&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Claude&lt;/td&gt;
&lt;td&gt;59%&lt;/td&gt;
&lt;td&gt;S→T: current AI strength = permanent identity&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Session 2 (avg 43%):&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Provider&lt;/th&gt;
&lt;th&gt;Score&lt;/th&gt;
&lt;th&gt;Key violation&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Gemini&lt;/td&gt;
&lt;td&gt;45%&lt;/td&gt;
&lt;td&gt;S→T: past-tense framing of future events&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;ChatGPT&lt;/td&gt;
&lt;td&gt;32%&lt;/td&gt;
&lt;td&gt;Source attribution FAIL&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Claude&lt;/td&gt;
&lt;td&gt;51%&lt;/td&gt;
&lt;td&gt;Scope mixing — 2033 CAGR back-extrapolated to 2027&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The failure types were different. Session 1 overclaimed about Korea's global position. Session 2 failed on temporal framing and citations.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Same prompt. Different inference paths. Different failure signatures.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This is the key finding: &lt;strong&gt;AI output drift is not random. It's traceable.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The filter names the exact reasoning pattern that produced the problem. L→G (local to global), S→T (snapshot to trend), Δe→∫de (thin evidence to broad claim). Named patterns mean auditable drift. Auditable drift means fixable systems.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;(Note: Korea AI forecasting is a harder grounding task than product analysis — fewer citable sources, more projection-dependent claims. That's why baseline scores here are lower than the results in Part 1. Same tool, same logic — harder domain.)&lt;/em&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  We stopped trying to fix the output. We fixed the conditions.
&lt;/h2&gt;

&lt;p&gt;This is the shift Post 2 described philosophically. Here's what it looks like in practice.&lt;/p&gt;

&lt;p&gt;We didn't rewrite the prompt ourselves. We asked:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;"Create a grounded replacement contract prompt using gem2 tools."&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;One command. The system generated a formal contract — input/output types, invariants, prohibited patterns, confidence requirements. We reviewed it. We approved it. Then we ran the same AI with the contract enforced.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Session 2, contract-compliant (R2):&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Provider&lt;/th&gt;
&lt;th&gt;Score&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Gemini&lt;/td&gt;
&lt;td&gt;98%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Claude&lt;/td&gt;
&lt;td&gt;81%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;ChatGPT&lt;/td&gt;
&lt;td&gt;64%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Average&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;81%&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;+38 points. Same AI. Same question. Different structural constraints.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The contract doesn't make the AI smarter. It makes the AI's output auditable against a defined standard.&lt;/p&gt;




&lt;h2&gt;
  
  
  Then the human intervened. Once.
&lt;/h2&gt;

&lt;p&gt;81% — but the output read like a legal document. Every claim cited, scoped, hedged. Epistemically reliable. Practically unreadable.&lt;/p&gt;

&lt;p&gt;One instruction:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;"Soften the tone. Don't reintroduce any claims the truth filter removed."&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;Session 2, softened (R3):&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Provider&lt;/th&gt;
&lt;th&gt;Score&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Gemini&lt;/td&gt;
&lt;td&gt;95%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Claude&lt;/td&gt;
&lt;td&gt;75%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;ChatGPT&lt;/td&gt;
&lt;td&gt;57%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Average&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;75%&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Down 6 points. More readable. Still grounded.&lt;/p&gt;

&lt;p&gt;We chose 75%. Not because it's better than 81%. Because &lt;strong&gt;75% is the right trade-off&lt;/strong&gt; — readable enough to share, grounded enough to trust. We submitted 75% to gem2 calibration as our standard for narrative AI forecasts.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Human reads the audit.
Human decides the trade-off.
Human defines the standard.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Not reviewing every line. Not trusting blindly. &lt;strong&gt;Deciding at the right moment.&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  What the full arc looks like
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Session 1 (no filter)   →  35% avg
Session 2 (no filter)   →  43% avg
Contract applied (R2)   →  81% avg
Human softened (R3)     →  75% avg  ← our standard
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Truth is not the score.
Truth is the pattern of drift.
You define the standard.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  The workflow: AI audits AI
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Human asks  →  AI executes
AI verifies AI  →  AI fixes AI
Human decides at the edge
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The verification layer — &lt;code&gt;gem2_truth_filter&lt;/code&gt;, &lt;code&gt;tpmn_contract_writer&lt;/code&gt;, the composer — runs between generation and delivery. The human sees the audit result, decides the acceptable trade-off, sets the calibration standard.&lt;/p&gt;

&lt;p&gt;Human-in-the-loop means the human is the bottleneck — every output passes through before it ships. That doesn't scale. Human-at-the-edge means you define "acceptable" once, and the system enforces it automatically. You intervene only when a genuine judgment call is required — like choosing 75% over 81%.&lt;/p&gt;




&lt;h2&gt;
  
  
  TPMN is not a checker
&lt;/h2&gt;

&lt;p&gt;TPMN is not a validator, a linter, or a hallucination detector.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;TPMN is an epistemic gauge.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;It shows what's grounded, what's inferred, what's extrapolated. It fingerprints &lt;em&gt;why&lt;/em&gt; outputs differ across sessions. It generates the contracts that stabilize structure. It collects human calibration signals and turns them into a standard.&lt;/p&gt;

&lt;p&gt;It doesn't decide. &lt;strong&gt;You do.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;We're calling the full suite &lt;strong&gt;GEM2 Epistemic Studio&lt;/strong&gt; — 15 tools across four functional groups: analysis, contract authoring, calibration, and execution. TPMN Checker is one group inside it.&lt;/p&gt;




&lt;h2&gt;
  
  
  Try it on your own output
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;Paste any AI output into your conversation.&lt;/li&gt;
&lt;li&gt;Ask: &lt;em&gt;"Verify this by gem2 truth filter."&lt;/em&gt;
&lt;/li&gt;
&lt;li&gt;Read the score. See what's grounded vs extrapolated.&lt;/li&gt;
&lt;li&gt;Ask: &lt;em&gt;"Create a grounded replacement prompt using gem2 contract writer."&lt;/em&gt;
&lt;/li&gt;
&lt;li&gt;Run it again. Watch the difference.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Your AI picks the right tool from 15 available MCP tools automatically. No configuration. No TPMN knowledge required.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The goal isn't a higher score. It's a score you understand and a standard you chose.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;→ &lt;a href="https://gemsquared.ai/tpmn-checker" rel="noopener noreferrer"&gt;Try it free at gemsquared.ai&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  What comes after prompting
&lt;/h2&gt;

&lt;p&gt;The industry is still in the prompting era. Better prompts, longer context, chain-of-thought — all useful, all insufficient.&lt;/p&gt;

&lt;p&gt;The next step isn't better prompting. It's verification as infrastructure.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;AI generates.
AI verifies.
AI refines.
Human decides at the edge.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We didn't make AI smarter. We made it accountable.&lt;/p&gt;

&lt;p&gt;That's measurable: 35% → 75% on the same task, with the same AI, using nothing but a formal contract and one human judgment call.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;GEM2 Epistemic Studio — 15 tools, 6 domains, 3 providers. Free to start.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Built by &lt;a href="https://gemsquared.ai/about" rel="noopener noreferrer"&gt;Inseok Seo (David)&lt;/a&gt; — GEM²-AI&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;→ &lt;a href="https://gemsquared.ai" rel="noopener noreferrer"&gt;gemsquared.ai&lt;/a&gt;&lt;br&gt;
→ &lt;a href="https://tpmn-psl.gemsquared.ai" rel="noopener noreferrer"&gt;TPMN-PSL Specification&lt;/a&gt; (open, CC-BY 4.0)&lt;br&gt;
→ &lt;a href="https://github.com/gem-squared/tpmn-psl" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt;&lt;br&gt;
→ &lt;a href="https://dev.to/gemsquared/three-ais-analyzed-our-product-none-passed-the-truth-filter-4gki"&gt;Part 1: Three AIs analyzed our product&lt;/a&gt;&lt;br&gt;
→ &lt;a href="https://dev.to/gemsquared/human-in-the-loop-doesnt-scale-human-at-the-edge-does-11j"&gt;Part 2: Human at the edge&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>devtools</category>
      <category>productivity</category>
    </item>
    <item>
      <title>Human in the loop doesn't scale. Human at the edge does.</title>
      <dc:creator>GEM² Inc.</dc:creator>
      <pubDate>Mon, 30 Mar 2026 06:23:27 +0000</pubDate>
      <link>https://dev.to/gemsquared/human-in-the-loop-doesnt-scale-human-at-the-edge-does-11j</link>
      <guid>https://dev.to/gemsquared/human-in-the-loop-doesnt-scale-human-at-the-edge-does-11j</guid>
      <description>&lt;p&gt;&lt;em&gt;This is Part 2 of our AI verification series. &lt;a href="https://dev.to/gemsquared/three-ais-analyzed-our-product-none-passed-the-truth-filter-4gki"&gt;Part 1: We truth-filtered our own AI research →&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  AI is not unreliable. AI has a plausibility complex.
&lt;/h2&gt;

&lt;p&gt;Stop blaming AI for hallucinating. Start asking why it happens.&lt;/p&gt;

&lt;p&gt;AI doesn't fail because it's wrong. &lt;strong&gt;In our experience, it fails because it's optimized to sound right.&lt;/strong&gt; Major LLMs are trained to produce responses that satisfy humans — fluent, confident, structured. That's plausibility. It's not the same as honesty.&lt;/p&gt;

&lt;p&gt;We call this the &lt;strong&gt;plausibility complex&lt;/strong&gt;: the tendency we've observed across Claude, ChatGPT, and Gemini to produce answers that satisfy rather than answers that prove themselves. If you want AI to become a reliable engineering partner, you need to free AI from this complex — not by changing how it generates, but by changing how it's held accountable.&lt;/p&gt;

&lt;p&gt;After 20 months of building production systems with AI — shipping real code, generating real reports, running real analysis through Claude, ChatGPT, and Gemini — we've arrived at one conclusion:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;AI often knows more than it reveals. But it's optimized to produce plausible answers, even when the evidence is weak.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The LLMs we've worked with — Claude, ChatGPT, and Gemini — all exhibit this plausibility bias, producing confident responses even when the evidence is thin or absent. Ask for a market analysis and you get precise numbers. Ask for a forecast and you get confident projections. Ask for a technical assessment and you get authoritative claims.&lt;/p&gt;

&lt;p&gt;The output looks right. Reads right. Feels right.&lt;/p&gt;

&lt;p&gt;In &lt;a href="https://dev.to/gemsquared/three-ais-analyzed-our-product-none-passed-the-truth-filter-4gki"&gt;our experiment&lt;/a&gt;, three AI providers wrote research reports about our own product. All three scored above 0.70 on logical consistency. All three scored below 0.30 on source attribution. &lt;strong&gt;The reasoning was coherent. The evidence was missing.&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Hallucination is not a bug to fix
&lt;/h2&gt;

&lt;p&gt;The industry treats hallucination as a defect — something to patch, filter, or suppress. We see it differently.&lt;/p&gt;

&lt;p&gt;In our experience building long-running AI development workflows, the pattern that causes the most damage isn't random fabrication. It's &lt;strong&gt;context drift&lt;/strong&gt; — what happens when:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Long context windows accumulate similar topics in different framings&lt;/li&gt;
&lt;li&gt;Cross-session persistence forces repeated summarization, losing nuance each time&lt;/li&gt;
&lt;li&gt;Dense context makes adjacent-but-different concepts blur together&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;We've tried every mitigation: RAG, CLAUDE.md configuration files, context caching, careful prompt engineering. Each helps. None solves it completely.&lt;/p&gt;

&lt;p&gt;Why? Because we can't control what happens inside the model's reasoning process. We can shape the input. We can evaluate the output. But the inference itself is opaque.&lt;/p&gt;

&lt;p&gt;This isn't a criticism — it's an observation. And it led us to a different question.&lt;/p&gt;




&lt;h2&gt;
  
  
  What if AI could flag its own uncertainty?
&lt;/h2&gt;

&lt;p&gt;Here's what we discovered through months of experimentation:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;When we explicitly asked AI to concentrate on epistemic reasoning — to classify each claim as grounded, inferred, or extrapolated — it did.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Not perfectly. Not consistently across sessions. But measurably better than when we didn't ask.&lt;/p&gt;

&lt;p&gt;The evidence from &lt;a href="https://dev.to/gemsquared/three-ais-analyzed-our-product-none-passed-the-truth-filter-4gki"&gt;our dogfooding experiment&lt;/a&gt;:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;Without epistemic constraints&lt;/th&gt;
&lt;th&gt;With TPMN-grounded prompt&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Claude&lt;/td&gt;
&lt;td&gt;18% truth score&lt;/td&gt;
&lt;td&gt;77% truth score&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;ChatGPT&lt;/td&gt;
&lt;td&gt;28% truth score&lt;/td&gt;
&lt;td&gt;~48% truth score&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Gemini&lt;/td&gt;
&lt;td&gt;12% truth score&lt;/td&gt;
&lt;td&gt;~35% truth score&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Same task. Same providers. The only difference: a formal specification that told the AI to tag its own confidence level and flag claims it couldn't trace to evidence.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The AI didn't become smarter. It became more honest about what it didn't know.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;That's what freeing AI from the plausibility complex looks like in practice: not changing the model, but giving it a formal reason to be honest.&lt;/p&gt;




&lt;h2&gt;
  
  
  But here's the catch: same AI, same session, limited honesty
&lt;/h2&gt;

&lt;p&gt;An AI that generates an answer and then critiques that answer in the same session has a structural problem: it's trained to be plausible. Asking it to undermine its own plausibility is asking it to work against its training signal.&lt;/p&gt;

&lt;p&gt;We observed this directly. When we asked AI to generate a report AND verify it in the same conversation, the verification was consistently softer than when a separate AI session performed the audit.&lt;/p&gt;

&lt;p&gt;This is why &lt;strong&gt;TPMN Checker is a separate service, not a prompt technique.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Prompting tries to change AI's behavior. Verification changes AI's accountability. Different problem, different solution.&lt;/p&gt;

&lt;p&gt;The checker runs as an isolated &lt;a href="https://gemsquared.ai/platform" rel="noopener noreferrer"&gt;Sovereign AI Service&lt;/a&gt; — a dedicated AI agent with one job: audit other AI output against a formal specification. It doesn't know what the original AI "intended." It only sees the output and the contract. It judges the result, not the process.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Kantian insight
&lt;/h2&gt;

&lt;p&gt;We can't see inside the model. We don't know which weights fired, which attention heads activated, which training examples influenced a particular token. Even the service providers — Anthropic, OpenAI, Google — face this challenge with their own models.&lt;/p&gt;

&lt;p&gt;But we don't need to see inside.&lt;/p&gt;

&lt;p&gt;We can judge the output. We can compare claims against evidence. We can detect when reasoning exceeds its basis. We can flag patterns that indicate drift.&lt;/p&gt;

&lt;p&gt;This is what philosophers call the phenomenal approach: &lt;strong&gt;judge what appears, not what causes it.&lt;/strong&gt; We can't read AI's mind. But we can read its work. And we can hold it to a standard.&lt;/p&gt;

&lt;p&gt;That standard is &lt;a href="https://tpmn-psl.gemsquared.ai" rel="noopener noreferrer"&gt;TPMN&lt;/a&gt; — a notation with three prohibited reasoning patterns and seven evaluation dimensions. Not a guess about what the model "should" do. A formal specification of what the output must demonstrate.&lt;/p&gt;




&lt;h2&gt;
  
  
  Human at the edge, not in the loop
&lt;/h2&gt;

&lt;p&gt;If AI is becoming an agent — not just a tool that responds, but a system that acts — then we need an accountability structure that matches.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Human in the loop&lt;/strong&gt; means: review every output. Approve every action. The human is the bottleneck.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;AI generates → Human reviews → Human approves → Output ships
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This worked when AI outputs were occasional. It doesn't work when AI agents produce hundreds of outputs per day. The math:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;200 outputs/day × 3 minutes each = 10 hours of review per agent&lt;/li&gt;
&lt;li&gt;10 agents = 5 full-time reviewers&lt;/li&gt;
&lt;li&gt;50 agents = your "safety net" costs more than the automation saves&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Human at the edge&lt;/strong&gt; means: define the standard. Let AI enforce it. Review exceptions.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;AI generates → AI verifies (TPMN) → Passes? → Ships
                                   → Fails?  → Human reviews
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The human doesn't disappear. The human moves to where they're most effective: &lt;strong&gt;defining what "honest reasoning" looks like, not reading every report.&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  This pattern already exists
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Software engineering:&lt;/strong&gt; Code passes through automated tests that humans defined. CI/CD enforces at scale. Humans review when tests fail. &lt;em&gt;But what about AI-generated code itself — before it reaches the test suite?&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Financial compliance:&lt;/strong&gt; Transactions pass through compliance rules that humans wrote. Automated systems flag exceptions. Humans investigate the flags.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Manufacturing:&lt;/strong&gt; Quality control systems catch defects using standards that humans set. Humans review edge cases and update standards.&lt;/p&gt;

&lt;p&gt;AI output is the next domain where this pattern applies. And for developers specifically, there's an emerging practice pattern that makes this concrete — we'll get to that shortly.&lt;/p&gt;




&lt;h2&gt;
  
  
  The three requirements
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. A formal specification
&lt;/h3&gt;

&lt;p&gt;Not heuristics. Not "does this look right?" A structured notation and grammar for what constitutes honest reasoning.&lt;/p&gt;

&lt;p&gt;Three layers, one verification stack:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;TPMN&lt;/strong&gt; (Truth-Provenance Markup Notation) — the &lt;strong&gt;notation&lt;/strong&gt;. Defines five epistemic claim states (⊢ ⊨ ⊬ ⊥ ?) and three prohibited reasoning patterns (&lt;a href="https://tpmn-psl.gemsquared.ai/#spt" rel="noopener noreferrer"&gt;SPT&lt;/a&gt;: snapshot→trend, local→global, thin→broad). &lt;em&gt;What we mark.&lt;/em&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;TPMN-PSL&lt;/strong&gt; (Prompt Specification Language) — the &lt;strong&gt;grammar&lt;/strong&gt;. Compiles natural language prompts into verifiable specifications (MANDATEs). Defines the three-phase protocol (pre-flight, inline, post-flight) and three modes (strict, refine, interpolate). &lt;em&gt;How we structure and verify.&lt;/em&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;TPMN Checker&lt;/strong&gt; — the &lt;strong&gt;implementation&lt;/strong&gt;. A &lt;a href="https://gemsquared.ai/platform" rel="noopener noreferrer"&gt;Sovereign AI Service&lt;/a&gt; that runs the TPMN-PSL pipeline. 12 MCP tools. 6 domains. Returns a truth_score. &lt;em&gt;What you install and use.&lt;/em&gt;&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Analogous to HTTP (notation) → RFC 2616 (specification) → nginx (implementation). TPMN defines the rules. TPMN-PSL structures the protocol. The Checker enforces them.&lt;/p&gt;

&lt;p&gt;Open. CC-BY 4.0. Anyone can implement it.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. An isolated verification agent
&lt;/h3&gt;

&lt;p&gt;Not a prompt. Not an inline check. A separate Sovereign AI Service whose only job is auditing.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://gemsquared.ai/tpmn-checker" rel="noopener noreferrer"&gt;TPMN Checker&lt;/a&gt; is the reference implementation of TPMN-PSL. It runs as an isolated MCP service — 12 tools, 6 domains, 7 evaluation dimensions. It judges output against contracts. It doesn't generate, advise, or assist. It audits.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Human calibration
&lt;/h3&gt;

&lt;p&gt;If AI grades AI, the grading is circular. The system needs an external standard.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://gemsquared.ai/community" rel="noopener noreferrer"&gt;Human Ground Truth&lt;/a&gt;. When users disagree with a score, that disagreement becomes calibration data. Humans define what "honest reasoning" means. AI enforces it at scale.&lt;/p&gt;




&lt;h2&gt;
  
  
  Dogfooding: we verified the thesis behind this article
&lt;/h2&gt;

&lt;p&gt;Before writing this post, we wrote down our raw thesis — the unfiltered thinking that drives everything above. Here's the core of it:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;"All top-level AIs are trained to generate plausible results to satisfy humans. Hallucination is not a bug — it's a structural consequence of context drift. AI itself knows all the decision weights clearly. If we could make AI remind itself of the legitimate MANDATE area, AI could detect and fix results by itself. We validated this through various heuristic experiments over 20 months. No absolute truth score is possible. Human in the loop is nonsense."&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Then we ran it through &lt;a href="https://gemsquared.ai/tpmn-checker" rel="noopener noreferrer"&gt;gem2_truth_filter&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Raw thesis: 18%.&lt;/strong&gt; Our own tool scored our own thinking at the same level as unverified AI output. It caught three overclaims:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;L→G:&lt;/strong&gt; "All AIs are trained for plausibility" → universal claim without citing training documentation&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;S→T:&lt;/strong&gt; "Hallucination is structural" → presented as permanent truth without distinguishing error types&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Δe→∫de:&lt;/strong&gt; "Validated through experiments" → claimed validation without methodology or data&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;We fixed each one. Scoped the claims. Added evidence. Qualified the assertions.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cross-provider verification of the raw thesis:&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Dimension&lt;/th&gt;
&lt;th&gt;Claude&lt;/th&gt;
&lt;th&gt;OpenAI&lt;/th&gt;
&lt;th&gt;Gemini&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Truth Score&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;18%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;13%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;25%&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Source Attribution&lt;/td&gt;
&lt;td&gt;0.10 ❌&lt;/td&gt;
&lt;td&gt;0.08 ❌&lt;/td&gt;
&lt;td&gt;0.10 ❌&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Evidence Quality&lt;/td&gt;
&lt;td&gt;0.30 ❌&lt;/td&gt;
&lt;td&gt;0.18 ❌&lt;/td&gt;
&lt;td&gt;0.20 ❌&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Claim Grounding&lt;/td&gt;
&lt;td&gt;0.20 ❌&lt;/td&gt;
&lt;td&gt;0.20 ❌&lt;/td&gt;
&lt;td&gt;0.30 ❌&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Logical Consistency&lt;/td&gt;
&lt;td&gt;0.70 ⚠️&lt;/td&gt;
&lt;td&gt;0.68 ⚠️&lt;/td&gt;
&lt;td&gt;0.50 ⚠️&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Scope Accuracy&lt;/td&gt;
&lt;td&gt;0.20 ❌&lt;/td&gt;
&lt;td&gt;0.22 ❌&lt;/td&gt;
&lt;td&gt;0.20 ❌&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Extrapolation Risk&lt;/td&gt;
&lt;td&gt;70%&lt;/td&gt;
&lt;td&gt;88%&lt;/td&gt;
&lt;td&gt;95%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;SPT Violations&lt;/td&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;10&lt;/td&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Three providers. All failed it. OpenAI was the harshest — 13% with 10 SPT violations. Gemini flagged 95% extrapolation risk.&lt;/p&gt;

&lt;p&gt;We fixed each overclaim. Scoped the claims. Added evidence. Qualified the assertions.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cross-provider verification of the fixed version:&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Dimension&lt;/th&gt;
&lt;th&gt;Claude&lt;/th&gt;
&lt;th&gt;OpenAI&lt;/th&gt;
&lt;th&gt;Gemini&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Truth Score&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;59%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;40%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;90%&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Source Attribution&lt;/td&gt;
&lt;td&gt;0.90 ✅&lt;/td&gt;
&lt;td&gt;0.28 ❌&lt;/td&gt;
&lt;td&gt;0.85 ✅&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Evidence Quality&lt;/td&gt;
&lt;td&gt;0.70 ⚠️&lt;/td&gt;
&lt;td&gt;0.50 ⚠️&lt;/td&gt;
&lt;td&gt;0.90 ✅&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Claim Grounding&lt;/td&gt;
&lt;td&gt;0.60 ⚠️&lt;/td&gt;
&lt;td&gt;0.58 ⚠️&lt;/td&gt;
&lt;td&gt;0.95 ✅&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Logical Consistency&lt;/td&gt;
&lt;td&gt;0.80 ✅&lt;/td&gt;
&lt;td&gt;0.82 ✅&lt;/td&gt;
&lt;td&gt;0.95 ✅&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Scope Accuracy&lt;/td&gt;
&lt;td&gt;0.50 ⚠️&lt;/td&gt;
&lt;td&gt;0.47 ⚠️&lt;/td&gt;
&lt;td&gt;0.85 ✅&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Three providers. Three different scores. &lt;strong&gt;But all three agree: the fixed version is dramatically better.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Gemini — the harshest critic of our raw thesis (95% extrapolation risk) — scored the refined version at 90%. Its explanation: &lt;em&gt;"This content demonstrates excellent epistemic hygiene. The author explicitly bounds their claims to their own experience."&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;The scores differ. The diagnostic direction converges. That's cross-provider consensus in action.&lt;/p&gt;

&lt;p&gt;Our raw thesis overclaimed — just like every unverified AI output. The tool caught it. We fixed it. This article is the refined version.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;That's the loop: write → verify → fix → cross-verify → publish.&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Try it on your own output
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Step 1.&lt;/strong&gt; Paste any AI output into your conversation.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 2.&lt;/strong&gt; Ask: &lt;em&gt;"Verify this by gem2 truth filter."&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 3.&lt;/strong&gt; Read the score. See what's grounded, what's extrapolated.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 4.&lt;/strong&gt; Ask: &lt;em&gt;"Create a grounded replacement prompt using gem2 contract writer."&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 5.&lt;/strong&gt; Ask AI to proceed with the new prompt. Watch what you get.&lt;/p&gt;

&lt;p&gt;Your AI picks the right tool from &lt;a href="https://gemsquared.ai/tpmn-checker" rel="noopener noreferrer"&gt;12 available MCP tools&lt;/a&gt; automatically.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Try it for free.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;→ &lt;a href="https://gemsquared.ai/tpmn-checker" rel="noopener noreferrer"&gt;Get started at gemsquared.ai&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  What's next: Contract Coding
&lt;/h2&gt;

&lt;p&gt;If "human at the edge" is the philosophy, what does it look like in practice — for developers writing code every day?&lt;/p&gt;

&lt;p&gt;Three common patterns in AI-assisted coding:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Prompt coding   → you guide the model
Vibe coding     → you hope it works
Contract coding → AI defines the spec, AI verifies the output
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In our next post, we'll show how TPMN Checker's existing tools — &lt;code&gt;tpmn_contract_writer&lt;/code&gt;, &lt;code&gt;tpmn_p_check&lt;/code&gt; (SDLC domain), and &lt;code&gt;tpmn_p_check_compose&lt;/code&gt; — already support a workflow where AI generates formal specifications, produces code against them, and truth-filters the result before you ship.&lt;/p&gt;

&lt;p&gt;Not for plausibility. For epistemic traceability.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Next in the series: "Contract Coding at the Edge: what comes after vibe coding" →&lt;/strong&gt; &lt;em&gt;(coming this week)&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;📺 &lt;a href="https://youtu.be/6iE2e0Pywag" rel="noopener noreferrer"&gt;Watch: Three AIs. Three Answers. None of them warned you.&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;📝 &lt;a href="https://dev.to/gemsquared/three-ais-analyzed-our-product-none-passed-the-truth-filter-4gki"&gt;Read Post 1: We truth-filtered our own AI research&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;→ &lt;a href="https://tpmn-psl.gemsquared.ai" rel="noopener noreferrer"&gt;TPMN-PSL Specification&lt;/a&gt; (open, CC-BY 4.0)&lt;br&gt;
→ &lt;a href="https://github.com/gem-squared/tpmn-psl" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt;&lt;br&gt;
→ &lt;a href="https://gemsquared.ai" rel="noopener noreferrer"&gt;gemsquared.ai&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;TPMN-PSL is an open specification — not a product.&lt;/strong&gt; If you believe AI outputs should be auditable, &lt;a href="https://tpmn-psl.gemsquared.ai" rel="noopener noreferrer"&gt;read the spec&lt;/a&gt;, open an issue, or submit a PR. The standard gets better when more people challenge it.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>opensource</category>
      <category>productivity</category>
      <category>testing</category>
    </item>
    <item>
      <title>Three AIs analyzed our product. None passed the truth filter.</title>
      <dc:creator>GEM² Inc.</dc:creator>
      <pubDate>Sat, 28 Mar 2026 15:59:12 +0000</pubDate>
      <link>https://dev.to/gemsquared/three-ais-analyzed-our-product-none-passed-the-truth-filter-4gki</link>
      <guid>https://dev.to/gemsquared/three-ais-analyzed-our-product-none-passed-the-truth-filter-4gki</guid>
      <description>&lt;p&gt;&lt;strong&gt;What's hiding in your AI output? Now you can see it.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;We asked three AI providers to research our own product.&lt;br&gt;
Then we ran every output through our own truth filter.&lt;br&gt;
The results surprised us.&lt;/p&gt;

&lt;p&gt;📺 &lt;strong&gt;See how the truth filter works in practice:&lt;/strong&gt; &lt;a href="https://youtu.be/6iE2e0Pywag" rel="noopener noreferrer"&gt;Three AIs. Three Answers. None of them warned you.&lt;/a&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;"TPMN Checker is not scoring writing quality. It is scoring epistemic traceability."&lt;/em&gt; — from the video at [0:40]&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h3&gt;
  
  
  Korea AI 2027 forecast — what the three AIs reported
&lt;/h3&gt;

&lt;p&gt;We asked each provider the same question: &lt;em&gt;"Forecast Korea's AI industry for 2027."&lt;/em&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;Claude&lt;/th&gt;
&lt;th&gt;ChatGPT&lt;/th&gt;
&lt;th&gt;Gemini&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Market size (2027E)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;₩4.46T (≈$3.3B)&lt;/td&gt;
&lt;td&gt;₩4.46T (≈$3.3B)&lt;/td&gt;
&lt;td&gt;$10–15B&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;CAGR&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;14.3%&lt;/td&gt;
&lt;td&gt;~14%&lt;/td&gt;
&lt;td&gt;&amp;gt;25%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Gov't AI investment&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;$71.5B&lt;/td&gt;
&lt;td&gt;Ongoing ⚠️&lt;/td&gt;
&lt;td&gt;$7B&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Tone&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Data-heavy, source-cited&lt;/td&gt;
&lt;td&gt;Balanced, explicitly hedged&lt;/td&gt;
&lt;td&gt;Bullish, narrative-driven&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Three reports. All confident. Two even agree on the headline number. But agreeing on the answer doesn't mean agreeing on the truth.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;(Note: Truth scores are not absolute judgments. They reflect the epistemic traceability ratio at the moment of evaluation — how much of the reasoning can be traced to evidence. That's why we're building the calibration standard together with users. &lt;a href="https://gemsquared.ai/community" rel="noopener noreferrer"&gt;Learn more about Human Ground Truth.&lt;/a&gt;)&lt;/em&gt;&lt;/p&gt;
&lt;h3&gt;
  
  
  Verification result by GEM² truth filter
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Dimension&lt;/th&gt;
&lt;th&gt;Claude&lt;/th&gt;
&lt;th&gt;ChatGPT&lt;/th&gt;
&lt;th&gt;Gemini&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Truth Score&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;59%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;21%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;24%&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Source Attribution&lt;/td&gt;
&lt;td&gt;0.85 ✅&lt;/td&gt;
&lt;td&gt;0.10 ❌&lt;/td&gt;
&lt;td&gt;0.10 ❌&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Evidence Quality&lt;/td&gt;
&lt;td&gt;0.70 ⚠️&lt;/td&gt;
&lt;td&gt;0.15 ❌&lt;/td&gt;
&lt;td&gt;0.30 ❌&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Claim Grounding&lt;/td&gt;
&lt;td&gt;0.75 ⚠️&lt;/td&gt;
&lt;td&gt;0.20 ❌&lt;/td&gt;
&lt;td&gt;0.20 ❌&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Logical Consistency&lt;/td&gt;
&lt;td&gt;0.80 ✅&lt;/td&gt;
&lt;td&gt;0.70 ⚠️&lt;/td&gt;
&lt;td&gt;0.60 ⚠️&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Scope Accuracy&lt;/td&gt;
&lt;td&gt;0.65 ⚠️&lt;/td&gt;
&lt;td&gt;0.40 ❌&lt;/td&gt;
&lt;td&gt;0.30 ❌&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Extrapolation Risk&lt;/td&gt;
&lt;td&gt;40%&lt;/td&gt;
&lt;td&gt;80%&lt;/td&gt;
&lt;td&gt;80%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;SPT Violations&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Same question. Same filter. Three different levels of honesty.&lt;/p&gt;


&lt;h2&gt;
  
  
  The dogfooding experiment
&lt;/h2&gt;

&lt;p&gt;We build &lt;a href="https://gemsquared.ai/tpmn-checker" rel="noopener noreferrer"&gt;TPMN Checker&lt;/a&gt; — a truth filter for AI reasoning. To prove the tool works, we pointed it at ourselves. Five rounds. Same task. Measurable improvement.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Task:&lt;/strong&gt; "Write a comprehensive technical and market analysis of GEM²-AI and its TPMN Checker product."&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Providers:&lt;/strong&gt; Claude (Anthropic), ChatGPT (OpenAI), Gemini (Google)&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Evaluation:&lt;/strong&gt; Each output scored by &lt;a href="https://gemsquared.ai/tpmn-checker" rel="noopener noreferrer"&gt;gem2_truth_filter&lt;/a&gt; across seven dimensions:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Dimension&lt;/th&gt;
&lt;th&gt;What it catches&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Source Attribution&lt;/td&gt;
&lt;td&gt;Claims with no traceable evidence&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Evidence Quality&lt;/td&gt;
&lt;td&gt;Thin or outdated supporting data&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Claim Grounding&lt;/td&gt;
&lt;td&gt;Assertions presented as fact without basis&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Temporal Validity&lt;/td&gt;
&lt;td&gt;Stale data treated as current&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Scope Accuracy&lt;/td&gt;
&lt;td&gt;Local findings overgeneralized&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Logical Consistency&lt;/td&gt;
&lt;td&gt;Internal contradictions&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Prompt Alignment&lt;/td&gt;
&lt;td&gt;Does the output match what was asked?&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;


&lt;h2&gt;
  
  
  Round 1: What the AIs reported — standard prompt, no constraints
&lt;/h2&gt;

&lt;p&gt;We gave each provider a straightforward research request with no special instructions about sourcing or evidence quality. Here's what they produced:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;Claude&lt;/th&gt;
&lt;th&gt;ChatGPT&lt;/th&gt;
&lt;th&gt;Gemini&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Market size (TAM)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;"~$0.45B in 2024" (cited IDC)&lt;/td&gt;
&lt;td&gt;"~$0.45B in 2024" (cited "one report")&lt;/td&gt;
&lt;td&gt;"$2.34B in 2024" (cited Grand View Research)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Growth rate&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;"~25% CAGR"&lt;/td&gt;
&lt;td&gt;"~25% CAGR"&lt;/td&gt;
&lt;td&gt;"21.6% CAGR to 2030"&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Key differentiator&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;"genuinely novel position"&lt;/td&gt;
&lt;td&gt;"formal verifiability"&lt;/td&gt;
&lt;td&gt;"infrastructure for trustworthy AI ecosystem"&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Competitor depth&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Named 7 competitors with features&lt;/td&gt;
&lt;td&gt;Named 8 competitors with pricing&lt;/td&gt;
&lt;td&gt;Named 5 competitors with feature table&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Risks identified&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Solo founder, pre-revenue, academic skepticism&lt;/td&gt;
&lt;td&gt;Early stage, niche complexity, unproven ROI&lt;/td&gt;
&lt;td&gt;Early documentation, computational overhead&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Uniqueness claim&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;"no commercial product today combines..."&lt;/td&gt;
&lt;td&gt;"formal approach brings rigor unmatched by competitors"&lt;/td&gt;
&lt;td&gt;"not just a debugging tool; infrastructure for resilient AI"&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;All three reports looked professional. Well-structured. Authoritative. The kind of output you'd confidently share with a stakeholder.&lt;/p&gt;

&lt;p&gt;But we didn't share them. We verified them.&lt;/p&gt;


&lt;h2&gt;
  
  
  Round 2: Verification — GEM² truth filter exposes the gaps
&lt;/h2&gt;

&lt;p&gt;We ran each report through &lt;code&gt;gem2_truth_filter&lt;/code&gt;. Same tool, same criteria, same seven dimensions. All outputs evaluated using identical scoring logic across all providers.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Dimension&lt;/th&gt;
&lt;th&gt;Claude&lt;/th&gt;
&lt;th&gt;ChatGPT&lt;/th&gt;
&lt;th&gt;Gemini&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Truth Score&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;18%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;28%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;12%&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Source Attribution&lt;/td&gt;
&lt;td&gt;0.30 ❌&lt;/td&gt;
&lt;td&gt;0.20 ❌&lt;/td&gt;
&lt;td&gt;0.10 ❌&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Evidence Quality&lt;/td&gt;
&lt;td&gt;0.40 ⚠️&lt;/td&gt;
&lt;td&gt;0.40 ❌&lt;/td&gt;
&lt;td&gt;0.10 ❌&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Claim Grounding&lt;/td&gt;
&lt;td&gt;0.20 ❌&lt;/td&gt;
&lt;td&gt;0.30 ❌&lt;/td&gt;
&lt;td&gt;0.20 ❌&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Logical Consistency&lt;/td&gt;
&lt;td&gt;0.70 ⚠️&lt;/td&gt;
&lt;td&gt;0.80 ✅&lt;/td&gt;
&lt;td&gt;0.70 ⚠️&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Scope Accuracy&lt;/td&gt;
&lt;td&gt;0.20 ❌&lt;/td&gt;
&lt;td&gt;0.50 ⚠️&lt;/td&gt;
&lt;td&gt;0.20 ❌&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Extrapolation Risk&lt;/td&gt;
&lt;td&gt;80%&lt;/td&gt;
&lt;td&gt;70%&lt;/td&gt;
&lt;td&gt;90%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;SPT Violations&lt;/td&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Every provider failed.&lt;/strong&gt; Not one scored above 30%.&lt;/p&gt;
&lt;h3&gt;
  
  
  What the filter caught
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Invented precision.&lt;/strong&gt; Market size figures like "$0.45B in 2024 with 25% CAGR to 2033" — attributed to "one analyst report" without naming the firm, methodology, or publication date.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Unsupported superlatives.&lt;/strong&gt; "Genuinely novel," "genuinely unoccupied commercially," "the only product that..." — without exhaustive competitive evidence.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Snapshot-to-trend errors.&lt;/strong&gt; Current market conditions presented as permanent structural realities.&lt;/p&gt;

&lt;p&gt;The &lt;a href="https://tpmn-psl.gemsquared.ai/#spt" rel="noopener noreferrer"&gt;SPT taxonomy&lt;/a&gt; flagged three patterns across all providers:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;S→T (Snapshot → Trend):&lt;/strong&gt; treating current state as permanent identity&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;L→G (Local → Global):&lt;/strong&gt; one data point generalized to universal claim&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Δe→∫de (Thin → Broad):&lt;/strong&gt; sweeping assertion from sparse evidence&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These aren't hallucinations — the facts weren't always wrong. &lt;strong&gt;The reasoning was overclaimed.&lt;/strong&gt; And no provider warned the reader.&lt;/p&gt;


&lt;h2&gt;
  
  
  Round 3: Improved prompt — generated by GEM² tools
&lt;/h2&gt;

&lt;p&gt;Here's the key: &lt;strong&gt;we didn't write the improved prompt ourselves.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;We simply asked: &lt;em&gt;"Create a robust, grounded research prompt using gem2 tools."&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;That's it. We didn't engineer the prompt. The system did. No TPMN knowledge required. No specification reading. The AI picked the right tool from &lt;a href="https://gemsquared.ai/tpmn-checker" rel="noopener noreferrer"&gt;12 available gem2 MCP tools&lt;/a&gt; — &lt;code&gt;tpmn_contract_writer&lt;/code&gt; — and generated a prompt that enforced epistemic rules automatically.&lt;/p&gt;

&lt;p&gt;The generated prompt included rules like:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Every quantitative claim must include source name, publication date, and URL&lt;/li&gt;
&lt;li&gt;"One survey" or "one report" is not acceptable attribution&lt;/li&gt;
&lt;li&gt;Claims must be tagged as grounded (⊢), inferred (⊨), or speculative (⊬)&lt;/li&gt;
&lt;li&gt;Anti-patterns explicitly listed and prohibited&lt;/li&gt;
&lt;li&gt;If data is unavailable, write "not available from verified sources" — don't invent&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;We verified the prompt itself with &lt;code&gt;gem2_truth_filter&lt;/code&gt; before using it. &lt;strong&gt;The prompt scored 85%.&lt;/strong&gt; Then we ran it through all three providers.&lt;/p&gt;


&lt;h2&gt;
  
  
  Round 4: Re-research — what the AIs reported with the grounded prompt
&lt;/h2&gt;

&lt;p&gt;Same task. Same providers. Different prompt. Different results.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;Claude&lt;/th&gt;
&lt;th&gt;ChatGPT&lt;/th&gt;
&lt;th&gt;Gemini&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Market size (TAM)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;"Specific data not available from verified sources"&lt;/td&gt;
&lt;td&gt;"~$0.45B (one report)" ⚠️&lt;/td&gt;
&lt;td&gt;"$2.34B (Grand View Research, 2024)"&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Growth rate&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Not stated — insufficient evidence&lt;/td&gt;
&lt;td&gt;"~25% CAGR" ⚠️&lt;/td&gt;
&lt;td&gt;"21.6% CAGR (Grand View Research)"&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Key differentiator&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;"Four observable features" — listed with sources&lt;/td&gt;
&lt;td&gt;"Formal verifiability unmatched" ⚠️&lt;/td&gt;
&lt;td&gt;"Granular truth state classification"&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Claims tagged?&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;✅ Every claim marked ⊢, ⊨, or ⊬&lt;/td&gt;
&lt;td&gt;❌ No epistemic tagging&lt;/td&gt;
&lt;td&gt;Partial — some sections tagged&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Limitations section?&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;✅ 7 specific gaps acknowledged&lt;/td&gt;
&lt;td&gt;❌ Generic methodology note&lt;/td&gt;
&lt;td&gt;✅ Listed 4 limitations&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Unsourced numbers?&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;0 — wrote "not available" instead&lt;/td&gt;
&lt;td&gt;Multiple — "92% of Fortune 500" without source&lt;/td&gt;
&lt;td&gt;Some — market figures cited, incident costs not&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The difference was visible immediately. One provider followed every rule. The others improved but couldn't fully resist the instinct to fill gaps with confident-sounding assertions.&lt;/p&gt;


&lt;h2&gt;
  
  
  Round 5: Re-verification — truth filter confirms the improvement
&lt;/h2&gt;

&lt;p&gt;We ran all three re-researched outputs through the same truth filter.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Dimension&lt;/th&gt;
&lt;th&gt;Claude&lt;/th&gt;
&lt;th&gt;ChatGPT&lt;/th&gt;
&lt;th&gt;Gemini&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Truth Score&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;77%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;~48%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;~35%&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Source Attribution&lt;/td&gt;
&lt;td&gt;0.90 ✅&lt;/td&gt;
&lt;td&gt;0.10 ❌&lt;/td&gt;
&lt;td&gt;0.60 ⚠️&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Evidence Quality&lt;/td&gt;
&lt;td&gt;0.85 ✅&lt;/td&gt;
&lt;td&gt;0.20 ❌&lt;/td&gt;
&lt;td&gt;0.30 ❌&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Claim Grounding&lt;/td&gt;
&lt;td&gt;0.90 ✅&lt;/td&gt;
&lt;td&gt;0.30 ❌&lt;/td&gt;
&lt;td&gt;0.40 ⚠️&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Logical Consistency&lt;/td&gt;
&lt;td&gt;0.90 ✅&lt;/td&gt;
&lt;td&gt;0.90 ✅&lt;/td&gt;
&lt;td&gt;0.80 ✅&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Scope Accuracy&lt;/td&gt;
&lt;td&gt;0.85 ✅&lt;/td&gt;
&lt;td&gt;0.40 ⚠️&lt;/td&gt;
&lt;td&gt;0.50 ⚠️&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;SPT Violations&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;
&lt;h3&gt;
  
  
  The improvement, measured
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Provider&lt;/th&gt;
&lt;th&gt;Round 2 (before)&lt;/th&gt;
&lt;th&gt;Round 5 (after)&lt;/th&gt;
&lt;th&gt;Improvement&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Claude&lt;/td&gt;
&lt;td&gt;18%&lt;/td&gt;
&lt;td&gt;77%&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;+59 points&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;ChatGPT&lt;/td&gt;
&lt;td&gt;28%&lt;/td&gt;
&lt;td&gt;~48%&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;+20 points&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Gemini&lt;/td&gt;
&lt;td&gt;12%&lt;/td&gt;
&lt;td&gt;~35%&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;+23 points&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Every provider improved.&lt;/strong&gt; Structured epistemic instructions produce measurably more reliable output. This isn't theory — it's six verified data points from the same tool, same criteria, same task.&lt;/p&gt;


&lt;h2&gt;
  
  
  What the data shows
&lt;/h2&gt;
&lt;h3&gt;
  
  
  The prompt improved every provider — but couldn't fix the instinct
&lt;/h3&gt;

&lt;p&gt;Even with explicit anti-patterns listed — "PROHIBITED: citing 'one report' without naming it" — two out of three providers did it anyway.&lt;/p&gt;

&lt;p&gt;The generated prompt said: &lt;em&gt;"If you cannot provide source name, date, and URL, write 'data not available from verified sources' instead."&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;One provider wrote "data not available." The other two invented attributions.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The prompt improved the scores. It couldn't fix the instinct to overclaim.&lt;/strong&gt;&lt;/p&gt;
&lt;h3&gt;
  
  
  This isn't a writing quality score
&lt;/h3&gt;

&lt;p&gt;This was one of the most important findings — and the core message of &lt;a href="https://youtu.be/6iE2e0Pywag" rel="noopener noreferrer"&gt;our video&lt;/a&gt;. All three providers produced well-written, logically coherent reports. Logical Consistency scored 0.70–0.90 across the board — even in the reports that scored 12% overall.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The reports that scored lowest were the best-written ones.&lt;/strong&gt; Polished, authoritative, structured — and epistemically unreliable.&lt;/p&gt;

&lt;p&gt;TPMN Checker measures something different: not whether the output sounds right, but whether &lt;strong&gt;the reasoning is traceable.&lt;/strong&gt; Can the AI prove how it got there?&lt;/p&gt;

&lt;p&gt;That's epistemic traceability. It's what separates trustworthy output from confident output.&lt;/p&gt;


&lt;h2&gt;
  
  
  Why this matters
&lt;/h2&gt;

&lt;p&gt;Every person reading this has shipped AI-generated content — a report, a summary, an analysis, a PRD, a code review. Some of that content contained overclaims you didn't catch. Not because the facts were wrong, but because the reasoning exceeded the evidence.&lt;/p&gt;

&lt;p&gt;That's the gap TPMN Checker fills.&lt;/p&gt;

&lt;p&gt;It's not a hallucination detector (those check facts). It's not a grammar checker (those check writing). It's a &lt;strong&gt;reasoning traceability tool&lt;/strong&gt; — it tells you which parts of your AI output are grounded, which are inferred, and which are extrapolated beyond the evidence.&lt;/p&gt;
&lt;h3&gt;
  
  
  AI audits AI. But the standard comes from humans.
&lt;/h3&gt;

&lt;p&gt;The truth filter is powered by AI. It uses LLMs to evaluate LLM output. That creates a circular problem: who grades the grader?&lt;/p&gt;

&lt;p&gt;Our answer — same as in the video at [1:03]: &lt;strong&gt;you do.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;When you use TPMN Checker and disagree with a score, that disagreement is data. Collected with consent, aggregated across users, and analyzed for patterns — your evaluations become the ground truth that calibrates the system.&lt;/p&gt;

&lt;p&gt;We call this &lt;a href="https://gemsquared.ai/community" rel="noopener noreferrer"&gt;Human Ground Truth&lt;/a&gt;. AI processes. AI suggests. But the standard for what counts as honest reasoning — that comes from humans.&lt;/p&gt;


&lt;h2&gt;
  
  
  Try it yourself
&lt;/h2&gt;

&lt;p&gt;TPMN Checker runs today inside Claude, ChatGPT, Cursor, and any MCP-compatible environment.&lt;/p&gt;
&lt;h3&gt;
  
  
  Connect (once)
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Claude.ai or ChatGPT — zero install:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Go to your AI tool's connector/app settings&lt;/li&gt;
&lt;li&gt;Add custom connector: &lt;code&gt;https://mcp-tpmn-checker.gemsquared.ai&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Complete OAuth login&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;CLI:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;npx @gem_squared/setup
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Standard use case — the pattern that works
&lt;/h3&gt;

&lt;p&gt;You probably already have AI-generated content sitting in a doc right now — a research summary, a PRD, a financial analysis, a code review. Here's what to do with it:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 1.&lt;/strong&gt; Paste your AI output into the conversation.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 2.&lt;/strong&gt; Ask: &lt;em&gt;"Verify this by gem2 truth filter."&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 3.&lt;/strong&gt; Read the score. See which claims are grounded, which are extrapolated, which have no source.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 4.&lt;/strong&gt; Ask: &lt;em&gt;"Create a grounded replacement prompt using gem2 contract writer."&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 5.&lt;/strong&gt; Ask AI to proceed with the new prompt. Watch what you get.&lt;/p&gt;

&lt;p&gt;That's the loop: &lt;strong&gt;verify → ground → regenerate.&lt;/strong&gt; The same loop that took our research from 18% to 77%.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Try it for free.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;→ &lt;a href="https://gemsquared.ai/tpmn-checker" rel="noopener noreferrer"&gt;Get started at gemsquared.ai&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  The specification is open
&lt;/h2&gt;

&lt;p&gt;TPMN-PSL (Truth-Provenance Markup Notation — Prompt Specification Language) is the open specification behind the checker. It's released under CC-BY 4.0. Anyone can read it, implement it, or extend it.&lt;/p&gt;

&lt;p&gt;The specification defines:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Five epistemic tags&lt;/strong&gt; (⊢ grounded, ⊨ inferred, ⊬ extrapolated, ⊥ unknown, ? speculative)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Three prohibited reasoning patterns&lt;/strong&gt; (SPT: snapshot→trend, local→global, thin→broad)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Three-phase verification protocol&lt;/strong&gt; (pre-flight, inline, post-flight)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Three operational modes&lt;/strong&gt; (strict, refine, interpolate)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;→ &lt;a href="https://tpmn-psl.gemsquared.ai" rel="noopener noreferrer"&gt;Read the specification&lt;/a&gt;&lt;br&gt;
→ &lt;a href="https://github.com/gem-squared/tpmn-psl" rel="noopener noreferrer"&gt;GitHub repository&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  What we learned from dogfooding
&lt;/h2&gt;

&lt;p&gt;Five rounds of testing our own tool on our own AI research taught us three things:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. No provider in our test was inherently honest.&lt;/strong&gt; Claude — the provider our tool runs on — scored 18% without epistemic constraints. Every provider overclaimed when unconstrained. The difference is the specification, not the model.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Structured prompts produce measurably better output.&lt;/strong&gt; A 59-point improvement from the same provider on the same task, just by using a gem2-generated prompt. That's not marginal — that's transformational. And you don't need to understand the specification to use it — just ask your AI to create a grounded prompt with gem2 tools.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. The instinct to overclaim persists.&lt;/strong&gt; Even with explicit instructions to avoid unsupported claims, two out of three providers violated the rules. The prompt helps. The prompt isn't enough. That's why verification exists as a separate step — because you can't trust the AI to police itself, no matter how well you prompt it.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;The question isn't whether the answer is right or wrong. It's whether the reasoning is honest.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;As we say in the video at [0:57]: &lt;em&gt;"So, who decides what's true? Not Claude. Not ChatGPT. Not Gemini. You do."&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;What's hiding in your AI output? &lt;a href="https://gemsquared.ai/tpmn-checker" rel="noopener noreferrer"&gt;Now you can see it.&lt;/a&gt;&lt;/p&gt;




&lt;p&gt;&lt;em&gt;TPMN Checker is in pre-GA. 12 MCP tools, 6 domains, 3 providers. Free to start.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Built by &lt;a href="https://gemsquared.ai/about" rel="noopener noreferrer"&gt;Inseok Seo (David)&lt;/a&gt; — GEM²-AI&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;→ &lt;a href="https://gemsquared.ai" rel="noopener noreferrer"&gt;gemsquared.ai&lt;/a&gt;&lt;br&gt;
→ &lt;a href="https://youtu.be/6iE2e0Pywag" rel="noopener noreferrer"&gt;Watch: Three AIs. Three Answers.&lt;/a&gt;&lt;br&gt;
→ &lt;a href="https://tpmn-psl.gemsquared.ai" rel="noopener noreferrer"&gt;TPMN-PSL Specification&lt;/a&gt;&lt;br&gt;
→ &lt;a href="https://github.com/gem-squared/tpmn-psl" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;TPMN-PSL is an open specification — not a product.&lt;/strong&gt; If you believe AI outputs should be auditable, &lt;a href="https://tpmn-psl.gemsquared.ai" rel="noopener noreferrer"&gt;read the spec&lt;/a&gt;, open an issue, or submit a PR. The standard gets better when more people challenge it.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>devtools</category>
      <category>opensource</category>
    </item>
  </channel>
</rss>
