<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: GEM² Inc.</title>
    <description>The latest articles on DEV Community by GEM² Inc. (@gemsquared).</description>
    <link>https://dev.to/gemsquared</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3847788%2F875e1f46-2e5a-4b42-b2e3-ca729d4fe1f8.jpeg</url>
      <title>DEV Community: GEM² Inc.</title>
      <link>https://dev.to/gemsquared</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/gemsquared"/>
    <language>en</language>
    <item>
      <title>Claude Skills want ALL</title>
      <dc:creator>GEM² Inc.</dc:creator>
      <pubDate>Thu, 30 Apr 2026 10:07:55 +0000</pubDate>
      <link>https://dev.to/gemsquared/claude-skills-want-all-404k</link>
      <guid>https://dev.to/gemsquared/claude-skills-want-all-404k</guid>
      <description>&lt;p&gt;"AI is a mathematical and logical system. TPMN is Algebraic Logical Language, ALL, to communicate with AI"&lt;/p&gt;




&lt;p&gt;For engineering work, I thought AI needs engineering language. Not for understanding, but for parsing. I needed a language that an AI agent could parse unambiguously, that a human could read without a manual, and that survived context compaction intact.&lt;/p&gt;

&lt;p&gt;That language is TPMN — an &lt;strong&gt;Algebraic Logical Language&lt;/strong&gt; (ALL).&lt;/p&gt;

&lt;p&gt;TPMN stands for four sources:&lt;br&gt;
&lt;strong&gt;T&lt;/strong&gt;LA+ (temporal logic for concurrent systems),&lt;br&gt;
&lt;strong&gt;P&lt;/strong&gt;anini (the ancient Sanskrit grammarian who solved semantic disambiguation),&lt;br&gt;
&lt;strong&gt;M&lt;/strong&gt;athematical notation (for formal constraints),&lt;br&gt;
&lt;strong&gt;N&lt;/strong&gt;atural language (for the subjective meaning that symbols cannot carry alone).&lt;br&gt;
Each source contributes something specific. Together, they form ALL — the language for specifying AI skills.&lt;/p&gt;

&lt;p&gt;If you have read the introduction, &lt;br&gt;
&lt;/p&gt;
&lt;div class="ltag__link--embedded"&gt;
  &lt;div class="crayons-story "&gt;
  &lt;a href="https://dev.to/gemsquared/claude-skills-fail-silently-here-is-my-solution-jpe" class="crayons-story__hidden-navigation-link"&gt;Claude Skills Fail Silently. Here Is My Solution.&lt;/a&gt;


  &lt;div class="crayons-story__body crayons-story__body-full_post"&gt;
    &lt;div class="crayons-story__top"&gt;
      &lt;div class="crayons-story__meta"&gt;
        &lt;div class="crayons-story__author-pic"&gt;

          &lt;a href="/gemsquared" class="crayons-avatar  crayons-avatar--l  "&gt;
            &lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3847788%2F875e1f46-2e5a-4b42-b2e3-ca729d4fe1f8.jpeg" alt="gemsquared profile" class="crayons-avatar__image" width="460" height="460"&gt;
          &lt;/a&gt;
        &lt;/div&gt;
        &lt;div&gt;
          &lt;div&gt;
            &lt;a href="/gemsquared" class="crayons-story__secondary fw-medium m:hidden"&gt;
              GEM² Inc.
            &lt;/a&gt;
            &lt;div class="profile-preview-card relative mb-4 s:mb-0 fw-medium hidden m:inline-block"&gt;
              
                GEM² Inc.
                
              
              &lt;div id="story-author-preview-content-3590715" class="profile-preview-card__content crayons-dropdown branded-7 p-4 pt-0"&gt;
                &lt;div class="gap-4 grid"&gt;
                  &lt;div class="-mt-4"&gt;
                    &lt;a href="/gemsquared" class="flex"&gt;
                      &lt;span class="crayons-avatar crayons-avatar--xl mr-2 shrink-0"&gt;
                        &lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3847788%2F875e1f46-2e5a-4b42-b2e3-ca729d4fe1f8.jpeg" class="crayons-avatar__image" alt="" width="460" height="460"&gt;
                      &lt;/span&gt;
                      &lt;span class="crayons-link crayons-subtitle-2 mt-5"&gt;GEM² Inc.&lt;/span&gt;
                    &lt;/a&gt;
                  &lt;/div&gt;
                  &lt;div class="print-hidden"&gt;
                    
                      Follow
                    
                  &lt;/div&gt;
                  &lt;div class="author-preview-metadata-container"&gt;&lt;/div&gt;
                &lt;/div&gt;
              &lt;/div&gt;
            &lt;/div&gt;

          &lt;/div&gt;
          &lt;a href="https://dev.to/gemsquared/claude-skills-fail-silently-here-is-my-solution-jpe" class="crayons-story__tertiary fs-xs"&gt;&lt;time&gt;Apr 30&lt;/time&gt;&lt;span class="time-ago-indicator-initial-placeholder"&gt;&lt;/span&gt;&lt;/a&gt;
        &lt;/div&gt;
      &lt;/div&gt;

    &lt;/div&gt;

    &lt;div class="crayons-story__indention"&gt;
      &lt;h2 class="crayons-story__title crayons-story__title-full_post"&gt;
        &lt;a href="https://dev.to/gemsquared/claude-skills-fail-silently-here-is-my-solution-jpe" id="article-link-3590715"&gt;
          Claude Skills Fail Silently. Here Is My Solution.
        &lt;/a&gt;
      &lt;/h2&gt;
        &lt;div class="crayons-story__tags"&gt;
            &lt;a class="crayons-tag  crayons-tag--monochrome " href="/t/claudecode"&gt;&lt;span class="crayons-tag__prefix"&gt;#&lt;/span&gt;claudecode&lt;/a&gt;
            &lt;a class="crayons-tag  crayons-tag--monochrome " href="/t/cloudskills"&gt;&lt;span class="crayons-tag__prefix"&gt;#&lt;/span&gt;cloudskills&lt;/a&gt;
            &lt;a class="crayons-tag  crayons-tag--monochrome " href="/t/devtools"&gt;&lt;span class="crayons-tag__prefix"&gt;#&lt;/span&gt;devtools&lt;/a&gt;
            &lt;a class="crayons-tag  crayons-tag--monochrome " href="/t/opensource"&gt;&lt;span class="crayons-tag__prefix"&gt;#&lt;/span&gt;opensource&lt;/a&gt;
        &lt;/div&gt;
      &lt;div class="crayons-story__bottom"&gt;
        &lt;div class="crayons-story__details"&gt;
            &lt;a href="https://dev.to/gemsquared/claude-skills-fail-silently-here-is-my-solution-jpe#comments" class="crayons-btn crayons-btn--s crayons-btn--ghost crayons-btn--icon-left flex items-center"&gt;
              Comments


              &lt;span class="hidden s:inline"&gt;Add Comment&lt;/span&gt;
            &lt;/a&gt;
        &lt;/div&gt;
        &lt;div class="crayons-story__save"&gt;
          &lt;small class="crayons-story__tertiary fs-xs mr-2"&gt;
            4 min read
          &lt;/small&gt;
            
              &lt;span class="bm-initial"&gt;
                

              &lt;/span&gt;
              &lt;span class="bm-success"&gt;
                

              &lt;/span&gt;
            
        &lt;/div&gt;
      &lt;/div&gt;
    &lt;/div&gt;
  &lt;/div&gt;
&lt;/div&gt;

&lt;/div&gt;



&lt;div class="ltag__link--embedded"&gt;
  &lt;div class="crayons-story "&gt;
  &lt;a href="https://dev.to/gemsquared/three-wounds-that-prose-skills-cannot-fix-the-full-analysis-peo" class="crayons-story__hidden-navigation-link"&gt;Three Wounds That Prose Skills Cannot Fix — The Full Analysis&lt;/a&gt;


  &lt;div class="crayons-story__body crayons-story__body-full_post"&gt;
    &lt;div class="crayons-story__top"&gt;
      &lt;div class="crayons-story__meta"&gt;
        &lt;div class="crayons-story__author-pic"&gt;

          &lt;a href="/gemsquared" class="crayons-avatar  crayons-avatar--l  "&gt;
            &lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3847788%2F875e1f46-2e5a-4b42-b2e3-ca729d4fe1f8.jpeg" alt="gemsquared profile" class="crayons-avatar__image" width="460" height="460"&gt;
          &lt;/a&gt;
        &lt;/div&gt;
        &lt;div&gt;
          &lt;div&gt;
            &lt;a href="/gemsquared" class="crayons-story__secondary fw-medium m:hidden"&gt;
              GEM² Inc.
            &lt;/a&gt;
            &lt;div class="profile-preview-card relative mb-4 s:mb-0 fw-medium hidden m:inline-block"&gt;
              
                GEM² Inc.
                
              
              &lt;div id="story-author-preview-content-3590822" class="profile-preview-card__content crayons-dropdown branded-7 p-4 pt-0"&gt;
                &lt;div class="gap-4 grid"&gt;
                  &lt;div class="-mt-4"&gt;
                    &lt;a href="/gemsquared" class="flex"&gt;
                      &lt;span class="crayons-avatar crayons-avatar--xl mr-2 shrink-0"&gt;
                        &lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3847788%2F875e1f46-2e5a-4b42-b2e3-ca729d4fe1f8.jpeg" class="crayons-avatar__image" alt="" width="460" height="460"&gt;
                      &lt;/span&gt;
                      &lt;span class="crayons-link crayons-subtitle-2 mt-5"&gt;GEM² Inc.&lt;/span&gt;
                    &lt;/a&gt;
                  &lt;/div&gt;
                  &lt;div class="print-hidden"&gt;
                    
                      Follow
                    
                  &lt;/div&gt;
                  &lt;div class="author-preview-metadata-container"&gt;&lt;/div&gt;
                &lt;/div&gt;
              &lt;/div&gt;
            &lt;/div&gt;

          &lt;/div&gt;
          &lt;a href="https://dev.to/gemsquared/three-wounds-that-prose-skills-cannot-fix-the-full-analysis-peo" class="crayons-story__tertiary fs-xs"&gt;&lt;time&gt;Apr 30&lt;/time&gt;&lt;span class="time-ago-indicator-initial-placeholder"&gt;&lt;/span&gt;&lt;/a&gt;
        &lt;/div&gt;
      &lt;/div&gt;

    &lt;/div&gt;

    &lt;div class="crayons-story__indention"&gt;
      &lt;h2 class="crayons-story__title crayons-story__title-full_post"&gt;
        &lt;a href="https://dev.to/gemsquared/three-wounds-that-prose-skills-cannot-fix-the-full-analysis-peo" id="article-link-3590822"&gt;
          Three Wounds That Prose Skills Cannot Fix — The Full Analysis
        &lt;/a&gt;
      &lt;/h2&gt;
        &lt;div class="crayons-story__tags"&gt;
            &lt;a class="crayons-tag  crayons-tag--monochrome " href="/t/cloudskills"&gt;&lt;span class="crayons-tag__prefix"&gt;#&lt;/span&gt;cloudskills&lt;/a&gt;
            &lt;a class="crayons-tag  crayons-tag--monochrome " href="/t/claudecode"&gt;&lt;span class="crayons-tag__prefix"&gt;#&lt;/span&gt;claudecode&lt;/a&gt;
            &lt;a class="crayons-tag  crayons-tag--monochrome " href="/t/devtools"&gt;&lt;span class="crayons-tag__prefix"&gt;#&lt;/span&gt;devtools&lt;/a&gt;
            &lt;a class="crayons-tag  crayons-tag--monochrome " href="/t/opensource"&gt;&lt;span class="crayons-tag__prefix"&gt;#&lt;/span&gt;opensource&lt;/a&gt;
        &lt;/div&gt;
      &lt;div class="crayons-story__bottom"&gt;
        &lt;div class="crayons-story__details"&gt;
            &lt;a href="https://dev.to/gemsquared/three-wounds-that-prose-skills-cannot-fix-the-full-analysis-peo#comments" class="crayons-btn crayons-btn--s crayons-btn--ghost crayons-btn--icon-left flex items-center"&gt;
              Comments


              &lt;span class="hidden s:inline"&gt;Add Comment&lt;/span&gt;
            &lt;/a&gt;
        &lt;/div&gt;
        &lt;div class="crayons-story__save"&gt;
          &lt;small class="crayons-story__tertiary fs-xs mr-2"&gt;
            14 min read
          &lt;/small&gt;
            
              &lt;span class="bm-initial"&gt;
                

              &lt;/span&gt;
              &lt;span class="bm-success"&gt;
                

              &lt;/span&gt;
            
        &lt;/div&gt;
      &lt;/div&gt;
    &lt;/div&gt;
  &lt;/div&gt;
&lt;/div&gt;

&lt;/div&gt;
&lt;br&gt;
you know the &lt;code&gt;F: A → B | P&lt;/code&gt; pattern. This post explains why algebraic notation is necessary, what each layer contributes, and how the complete language works.


&lt;h2&gt;
  
  
  Why algebra, not prose
&lt;/h2&gt;

&lt;p&gt;AI is a mathematical and logical system. It processes tokens through transformer architectures — linear algebra, attention mechanisms, probability distributions. The internal representation is mathematical.&lt;/p&gt;

&lt;p&gt;Yet we communicate with AI in prose. We write skills as paragraphs of instructions. We describe workflows in bullet points. Then we are surprised when AI interprets ambiguously.&lt;/p&gt;

&lt;p&gt;The mismatch is fundamental: &lt;strong&gt;we are writing in natural language to a mathematical system.&lt;/strong&gt; Prose is optimized for human communication — nuance, context, implication. These qualities matter when AI infers intent in natural conversation. But they become the core cause of hallucination when AI must execute engineering work precisely, correctly, and consistently.&lt;/p&gt;

&lt;p&gt;I think hallucination is not a fault in AI's processing logic. It is extrapolation against context dilution and drift.&lt;/p&gt;

&lt;p&gt;TPMN is an Algebraic Logical Language that I created to supersede NL-based prompting. It is not invented from scratch — each of its four layers draws from a historically proven formalism. &lt;/p&gt;

&lt;p&gt;An algebraic expression like &lt;code&gt;P: title ≠ ⊥ ∧ project_slug ≠ ⊥&lt;/code&gt; has exactly one interpretation. A prose instruction like "make sure the title and project are provided" has many — what counts as "provided"? Is an empty string provided? Is a whitespace-only string? The algebra eliminates the question.&lt;/p&gt;

&lt;p&gt;This is not about making things harder for humans. The algebra is readable. &lt;code&gt;𝔹&lt;/code&gt; means boolean. &lt;code&gt;𝕊&lt;/code&gt; means string. &lt;code&gt;⊥&lt;/code&gt; means absent. &lt;code&gt;∧&lt;/code&gt; means AND. You learn the symbols once. They never change meaning.&lt;/p&gt;

&lt;p&gt;AI cannot read your mind. AI is not a magic wand. &lt;br&gt;
You can talk to AI in any persona — but the persona does not change what AI is. When you want humanistic discourse, nuance and reading between the lines matter. You bring a shared vocabulary and context to make that work. &lt;/p&gt;

&lt;p&gt;The same applies to engineering. When you ask AI to do engineering work, you need basic mathematical terms and logical structure to describe your need clearly.&lt;/p&gt;

&lt;p&gt;TPMN is created for exactly this: &lt;strong&gt;axiomatic rigor&lt;/strong&gt;, &lt;strong&gt;procedural clarity&lt;/strong&gt;, and &lt;strong&gt;impersonality&lt;/strong&gt; — the Economy of Expression for communicating engineering work to AI with minimum ambiguity. &lt;/p&gt;


&lt;h2&gt;
  
  
  The four layers of TPMN
&lt;/h2&gt;
&lt;h3&gt;
  
  
  T — TLA+ (structural layer)
&lt;/h3&gt;

&lt;p&gt;Leslie Lamport's TLA+ provides the structural backbone. Records, sequences, definitions:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;(* Records — named field structures *)
Person ≜ [name: 𝕊, age: ℕ]

(* Sequences — ordered steps *)
Pipeline ≜ &amp;lt;&amp;lt;plan, design, implement, test, deploy, verify&amp;gt;&amp;gt;

(* Definitions — binding names to meanings *)
Init_Session: A → B | P ≜ [...]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;≜&lt;/code&gt; symbol (defined as) comes directly from TLA+. It is visually distinct from &lt;code&gt;=&lt;/code&gt; (equality) and &lt;code&gt;→&lt;/code&gt; (transformation). When you see &lt;code&gt;≜&lt;/code&gt;, you know: this is a definition, not a comparison.&lt;/p&gt;

&lt;p&gt;Sets define enumerations — the closed universe of valid values:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Status ≜ {PENDING, IN_PROGRESS, COMPLETED, BLOCKED, ABORTED}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When a TPMN contract says &lt;code&gt;mode: {feature, bug}&lt;/code&gt;, AI knows there are exactly two valid values. Not "feature, bug, or similar." Exactly two. This is how you eliminate the drift that makes prose skills unreliable.&lt;/p&gt;

&lt;h3&gt;
  
  
  P — Panini (conflict resolution layer)
&lt;/h3&gt;

&lt;p&gt;Panini solved the problem of rule conflict and ambiguity in Sanskrit grammar 2,500 years ago. His Ashtadhyayi — ~4,000 rules generating every valid Sanskrit form — is a deterministic generative system. When two rules compete for the same derivation, meta-rules (paribhasha) resolve the conflict so that exactly one wins. No ambiguity survives.&lt;/p&gt;

&lt;p&gt;The problem is identical to what makes prose skills fail: &lt;strong&gt;multiple valid interpretations with no resolution mechanism.&lt;/strong&gt; Panini's solution was not to write more prose — it was to build a formal system where conflicts are resolved by structure, not by judgment.&lt;/p&gt;

&lt;p&gt;TPMN applies this principle through three patterns:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Typed categories&lt;/strong&gt; — Panini classified phonemes and morphemes into formal categories (pratyahara) so that rules could target precise classes, not vague descriptions. TPMN does the same with typed fields: two fields named &lt;code&gt;status&lt;/code&gt; are not the same field if one is &lt;code&gt;𝕊&lt;/code&gt; and the other is &lt;code&gt;{PENDING, IN_PROGRESS, COMPLETED}&lt;/code&gt;. The type resolves the conflict.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Grounding 5W&lt;/strong&gt; — Panini's rules are ordered and scoped — each rule declares exactly when it applies. TPMN mirrors this: every skill declares who, what, when, where, why. The "when" and "what" fields are the most critical — Claude Skills already recommends them. But with the remaining 3W (who, where, why), the AI can resolve which skill applies with minimal guessing.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Exception over default&lt;/strong&gt; — Panini's system uses the utsarga/apavada principle: specific exceptions override general rules. TPMN adapts this as negative contracts — &lt;code&gt;¬B&lt;/code&gt; declares what a skill explicitly never does. AI agents are helpful by default. They will cross boundaries if those boundaries are not declared. The specific exclusion overrides the general helpfulness.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  M — Mathematical notation (decisional layer)
&lt;/h3&gt;

&lt;p&gt;TLA+ provides structure — records, sequences, definitions. Mathematical notation provides the logic that fills those structures: &lt;strong&gt;programmatic decisions and formal constraints.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;TLA+ defines the container: &lt;code&gt;A ≜ [name: 𝕊, age: ℕ]&lt;/code&gt;. Math notation writes the rule that decides whether the container is valid: &lt;code&gt;name ≠ ⊥ ∧ age &amp;gt; 0&lt;/code&gt;. Every precondition P in &lt;code&gt;F: A → B | P&lt;/code&gt;, every invariant, every verification predicate — the logic that determines pass or fail — is math notation.&lt;/p&gt;

&lt;p&gt;The core symbols:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Symbol&lt;/th&gt;
&lt;th&gt;Meaning&lt;/th&gt;
&lt;th&gt;Role&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;∧&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;AND — conjunction&lt;/td&gt;
&lt;td&gt;Combining predicates in P&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;∨&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;OR — disjunction&lt;/td&gt;
&lt;td&gt;Alternative conditions&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;¬&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;NOT — negation&lt;/td&gt;
&lt;td&gt;Exclusion, negative contracts&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;∈&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Element of — membership&lt;/td&gt;
&lt;td&gt;Type checking, set membership&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;∀&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;For all — universal&lt;/td&gt;
&lt;td&gt;Field coverage, invariants&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;⟹&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Implies&lt;/td&gt;
&lt;td&gt;Chain invariant between flow steps&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;⟺&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;If and only if&lt;/td&gt;
&lt;td&gt;STATE verification predicate&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;For example, the STATE verification predicate — how we determine SUCCESS or FAILURE — is pure M-layer:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;STATE = SUCCESS ⟺
  (∀ field ∈ B: b[field] ≠ ⊥ ∧ type(b[field]) = B[field].type)
  ∧ P(a, b) holds
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is structural type-checking, not subjective judgment. AI can self-evaluate against its own CONTRACT — and that makes drift detectable rather than silent.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Epistemic tags&lt;/strong&gt; extend the M layer with claim provenance — unique to TPMN:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Symbol&lt;/th&gt;
&lt;th&gt;Meaning&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;⊢&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Grounded — claim from verifiable fact&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;⊨&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Inferred — derived from grounded claims&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;⊬&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Extrapolated — beyond evidence&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;INV ≜ [
  ⊢ Strictly read-only — never modifies any file,
  ⊢ B is state report — AI decides next action based on B,
  ⊢ MANDATE: session state detection only
]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Every rule is &lt;code&gt;⊢&lt;/code&gt;-tagged — grounded. During context compaction, &lt;code&gt;⊢&lt;/code&gt;-tagged claims survive as hard constraints inside code blocks. Prose instructions like "never modify files" get summarized away. &lt;code&gt;⊢ NEVER modify any file&lt;/code&gt; inside a TPMN block survives because code blocks are treated as atomic by summarizers.&lt;/p&gt;

&lt;h3&gt;
  
  
  N — Natural language (meaning layer)
&lt;/h3&gt;

&lt;p&gt;Formal notation cannot carry subjective meaning. "ARCHITECT beginning a work session" is not expressible in algebra. So TPMN uses NL in controlled positions:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;(* ... *)&lt;/code&gt; inline comments — explanation alongside formal structure&lt;/li&gt;
&lt;li&gt;String values inside records — &lt;code&gt;who: "ARCHITECT beginning a work session"&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Flow step action fields — operational descriptions of what to do&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The rule: &lt;strong&gt;NL complements, it does not replace.&lt;/strong&gt; The structure constrains; the NL explains. NL is never used for definitions, types, or logic — only for the human meaning that symbols cannot carry alone.&lt;/p&gt;




&lt;h2&gt;
  
  
  The contract: F: A → B | P
&lt;/h2&gt;

&lt;p&gt;The four layers combine into the contract — the core of every UNIT-SKILL:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Skill_Name: A → B | P ≜ [
  A: [input fields with types],
  B: [output STATE fields with types],
  P: precondition predicates joined by ∧
]

¬B ≜ [
  ⊢ NEVER {boundary this skill must not cross},
  ⊢ NEVER {sibling mandate it must not assume},
  ⊢ NEVER {side effect it must not produce}
]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;A&lt;/strong&gt; — input state. Typed record. What the skill receives.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;B&lt;/strong&gt; — output state. Must be state, never action. The core invariant: B is what the skill produces, not what happens next.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;P&lt;/strong&gt; — preconditions. Conjunction of predicates that must all hold before execution.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;¬B&lt;/strong&gt; — negative contract. What the skill explicitly never does.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  The flow: ordered steps with chain invariant
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Flow ≜ &amp;lt;&amp;lt;
  [name: "step_name",
   action: "what to do",
   pre:  precondition_predicate,
   post: postcondition_predicate],
  ...
&amp;gt;&amp;gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Key constraints:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Maximum 5 steps, desire 3 or fewer — our &lt;a href="https://dev.toSERIES_POST_RULES"&gt;reliability model&lt;/a&gt; based on 0.8^N decay. Claude's official guidance says "one skill, one job" and "keep SKILL.md under 500 lines" — the step limit is TPMN's structural enforcement of the same principle&lt;/li&gt;
&lt;li&gt;The chain invariant holds: &lt;code&gt;∀ i ∈ 1..N-1: Flow[i].post ⟹ Flow[i+1].pre&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Flow is linear — no branching between steps (branching is the AI's job)&lt;/li&gt;
&lt;li&gt;IF/THEN/ELSE within a step's action field is acceptable (local logic)&lt;/li&gt;
&lt;li&gt;IF/THEN/ELSE between steps is not — split into separate skills&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The &lt;code&gt;&amp;lt;&amp;lt;...&amp;gt;&amp;gt;&lt;/code&gt; syntax is TLA+ sequence notation — it signals "ordered operations" rather than "data structure."&lt;/p&gt;




&lt;h2&gt;
  
  
  The grounding: 5W record
&lt;/h2&gt;

&lt;p&gt;Every skill anchors itself in context:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Grounding_5W ≜ [
  who:   "actor — the role invoking this skill",
  what:  "deliverable — the state transformation",
  when:  "condition — when AI should select this skill",
  where: "scope — boundary of applicability",
  why:   "rationale — why this skill exists as a separate unit"
]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;All five fields are required: &lt;code&gt;∀ skill: ∀ w ∈ {who, what, when, where, why}: |skill.grounding[w]| &amp;gt; 0&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;Skills without explicit grounding drift in meaning over time. The "when" field is the most important — it tells the AI exactly when to select this skill.&lt;/p&gt;




&lt;h2&gt;
  
  
  Compaction survival: why formal notation wins
&lt;/h2&gt;

&lt;p&gt;This is the practical argument that matters most.&lt;/p&gt;

&lt;p&gt;When an AI agent's context window fills, earlier content gets compacted. Claude Code re-attaches skills post-compaction, but only the first 5,000 tokens per skill within a 25,000-token shared budget. Skills invoked earlier can be dropped entirely. And even within the budget, prose instructions lose nuance — research shows summarization achieves ~6:1 compression, meaning each surviving sentence must carry six times the semantic density of the original.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;(* ~40 tokens — exact constraints preserved: *)
P: title ≠ ⊥ ∧ project_slug ≠ ⊥
¬B ≜ [⊢ NEVER modify any file]
Flow ≜ &amp;lt;&amp;lt;S₁, S₂, S₃&amp;gt;&amp;gt;

(* ~60 tokens — same information, but compressible: *)
"Make sure the title and project slug are both provided before
running. The skill should never modify any files. Execute
the three steps in order: first read, then query, then report."
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Both fit within the re-attachment budget. But when the budget is tight and the summarizer compresses, the prose version loses "never modify any files" while the TPMN version keeps &lt;code&gt;⊢ NEVER modify any file&lt;/code&gt; intact — because there is nothing to compress. The algebra is already at minimum expression. That is why AI needs ALL.&lt;/p&gt;

&lt;h3&gt;
  
  
  And density compounds in complex projects
&lt;/h3&gt;

&lt;p&gt;TPMN is not built for small scripts. It is built for complex, dense projects — the kind where you run 10, 20, 50 concurrent workflows across a long session. That is exactly where algebraic density pays off twice: once in compaction survival, and again in raw token savings.&lt;/p&gt;

&lt;p&gt;I measured this directly. I took 6 real Claude Skills (3 official Anthropic, 3 community) and converted each into TPMN workplan contracts. The results: bespoke skills average &lt;strong&gt;3,583 tokens&lt;/strong&gt; each. The equivalent TPMN contracts average &lt;strong&gt;672 tokens&lt;/strong&gt; — a &lt;strong&gt;5.3x compression ratio&lt;/strong&gt;. The largest compression was 9.6x (Anthropic's skill-creator: 8,916 → 929 tokens), because most of its budget went to process scaffolding that TPMN's core skills already handle. The smallest was 1.9x (webapp-testing), because it was already lean.&lt;/p&gt;

&lt;p&gt;TPMN's 12 core skills cost ~20,000 tokens as shared infrastructure, loaded once. Each additional workflow adds only ~672 tokens. Bespoke skills have zero infrastructure cost but pay ~3,583 tokens per workflow. At 7+ concurrent workflows, TPMN is cheaper in total. At 20, it saves 53%. At 50, 70%. The more complex the project, the larger the advantage — because the infrastructure cost is fixed and the per-workflow cost is 5.3x lower.&lt;/p&gt;




&lt;h2&gt;
  
  
  Notation principles
&lt;/h2&gt;

&lt;p&gt;After building skills in this notation, I have distilled four principles:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Symbols are unambiguous at any depth.&lt;/strong&gt; Whether you read &lt;code&gt;𝔹&lt;/code&gt; at the top of a contract or nested inside &lt;code&gt;Seq(Record[status: 𝔹])&lt;/code&gt;, it means boolean. No context-dependent interpretation.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Structure is preserved under compaction.&lt;/strong&gt; TPMN blocks survive as atomic code blocks. Prose instructions get summarized and lose constraints.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;NL complements, it does not replace.&lt;/strong&gt; The &lt;code&gt;(* ... *)&lt;/code&gt; comment syntax carries subjective meaning. The structure constrains; the NL explains.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Types are the disambiguator.&lt;/strong&gt; Two fields named &lt;code&gt;status&lt;/code&gt; in different skills are not the same field if one is &lt;code&gt;𝕊&lt;/code&gt; and the other is &lt;code&gt;{PENDING, IN_PROGRESS, COMPLETED}&lt;/code&gt;. The type makes them distinguishable.&lt;/p&gt;

&lt;p&gt;This notation is not complex. It is precise. Those are different things. Complexity hides meaning. Precision reveals it.&lt;/p&gt;




&lt;h2&gt;
  
  
  Try it
&lt;/h2&gt;

&lt;p&gt;The TPMN Skill Standard v4 is MIT-licensed. Install the core skills into any project:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;npx @gem_squared/tpmn-skill-install
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Full spec: &lt;a href="https://github.com/gem-squared/tpmn-skill" rel="noopener noreferrer"&gt;TPMN Skill Standard v4 on GitHub&lt;/a&gt;&lt;/p&gt;




&lt;p&gt;&lt;em&gt;David Seo — &lt;a href="https://gemsquared.ai/tpmn-skill?utm_source=devto&amp;amp;utm_medium=post-all&amp;amp;utm_campaign=tpmn-skill" rel="noopener noreferrer"&gt;GEM².AI&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

</description>
      <category>claudecode</category>
      <category>cloudskills</category>
      <category>devtools</category>
      <category>opensource</category>
    </item>
    <item>
      <title>Three Wounds That Prose Skills Cannot Fix — The Full Analysis</title>
      <dc:creator>GEM² Inc.</dc:creator>
      <pubDate>Thu, 30 Apr 2026 09:30:27 +0000</pubDate>
      <link>https://dev.to/gemsquared/three-wounds-that-prose-skills-cannot-fix-the-full-analysis-peo</link>
      <guid>https://dev.to/gemsquared/three-wounds-that-prose-skills-cannot-fix-the-full-analysis-peo</guid>
      <description>&lt;p&gt;Silent scope decay. Judgment theater. Trigger collision. Three Wounds that prose skills cannot fix — and the algebraic logical language + spec that does.&lt;/p&gt;




&lt;p&gt;In the companion post, I named the three wounds: &lt;br&gt;
&lt;strong&gt;Silent Scope Decay&lt;/strong&gt;, &lt;strong&gt;Judgment Theater&lt;/strong&gt;, and &lt;strong&gt;Trigger Collision&lt;/strong&gt;. &lt;/p&gt;

&lt;p&gt;Here is the structural proof for each — why prose fails, why contracts don't, and what 130+ real work plans revealed.&lt;/p&gt;
&lt;div class="ltag__link--embedded"&gt;
  &lt;div class="crayons-story "&gt;
  &lt;a href="https://dev.to/gemsquared/claude-skills-fail-silently-here-is-my-solution-jpe" class="crayons-story__hidden-navigation-link"&gt;Claude Skills Fail Silently. Here Is My Solution.&lt;/a&gt;


  &lt;div class="crayons-story__body crayons-story__body-full_post"&gt;
    &lt;div class="crayons-story__top"&gt;
      &lt;div class="crayons-story__meta"&gt;
        &lt;div class="crayons-story__author-pic"&gt;

          &lt;a href="/gemsquared" class="crayons-avatar  crayons-avatar--l  "&gt;
            &lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3847788%2F875e1f46-2e5a-4b42-b2e3-ca729d4fe1f8.jpeg" alt="gemsquared profile" class="crayons-avatar__image" width="460" height="460"&gt;
          &lt;/a&gt;
        &lt;/div&gt;
        &lt;div&gt;
          &lt;div&gt;
            &lt;a href="/gemsquared" class="crayons-story__secondary fw-medium m:hidden"&gt;
              GEM² Inc.
            &lt;/a&gt;
            &lt;div class="profile-preview-card relative mb-4 s:mb-0 fw-medium hidden m:inline-block"&gt;
              
                GEM² Inc.
                
              
              &lt;div id="story-author-preview-content-3590715" class="profile-preview-card__content crayons-dropdown branded-7 p-4 pt-0"&gt;
                &lt;div class="gap-4 grid"&gt;
                  &lt;div class="-mt-4"&gt;
                    &lt;a href="/gemsquared" class="flex"&gt;
                      &lt;span class="crayons-avatar crayons-avatar--xl mr-2 shrink-0"&gt;
                        &lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3847788%2F875e1f46-2e5a-4b42-b2e3-ca729d4fe1f8.jpeg" class="crayons-avatar__image" alt="" width="460" height="460"&gt;
                      &lt;/span&gt;
                      &lt;span class="crayons-link crayons-subtitle-2 mt-5"&gt;GEM² Inc.&lt;/span&gt;
                    &lt;/a&gt;
                  &lt;/div&gt;
                  &lt;div class="print-hidden"&gt;
                    
                      Follow
                    
                  &lt;/div&gt;
                  &lt;div class="author-preview-metadata-container"&gt;&lt;/div&gt;
                &lt;/div&gt;
              &lt;/div&gt;
            &lt;/div&gt;

          &lt;/div&gt;
          &lt;a href="https://dev.to/gemsquared/claude-skills-fail-silently-here-is-my-solution-jpe" class="crayons-story__tertiary fs-xs"&gt;&lt;time&gt;Apr 30&lt;/time&gt;&lt;span class="time-ago-indicator-initial-placeholder"&gt;&lt;/span&gt;&lt;/a&gt;
        &lt;/div&gt;
      &lt;/div&gt;

    &lt;/div&gt;

    &lt;div class="crayons-story__indention"&gt;
      &lt;h2 class="crayons-story__title crayons-story__title-full_post"&gt;
        &lt;a href="https://dev.to/gemsquared/claude-skills-fail-silently-here-is-my-solution-jpe" id="article-link-3590715"&gt;
          Claude Skills Fail Silently. Here Is My Solution.
        &lt;/a&gt;
      &lt;/h2&gt;
        &lt;div class="crayons-story__tags"&gt;
            &lt;a class="crayons-tag  crayons-tag--monochrome " href="/t/claudecode"&gt;&lt;span class="crayons-tag__prefix"&gt;#&lt;/span&gt;claudecode&lt;/a&gt;
            &lt;a class="crayons-tag  crayons-tag--monochrome " href="/t/cloudskills"&gt;&lt;span class="crayons-tag__prefix"&gt;#&lt;/span&gt;cloudskills&lt;/a&gt;
            &lt;a class="crayons-tag  crayons-tag--monochrome " href="/t/devtools"&gt;&lt;span class="crayons-tag__prefix"&gt;#&lt;/span&gt;devtools&lt;/a&gt;
            &lt;a class="crayons-tag  crayons-tag--monochrome " href="/t/opensource"&gt;&lt;span class="crayons-tag__prefix"&gt;#&lt;/span&gt;opensource&lt;/a&gt;
        &lt;/div&gt;
      &lt;div class="crayons-story__bottom"&gt;
        &lt;div class="crayons-story__details"&gt;
            &lt;a href="https://dev.to/gemsquared/claude-skills-fail-silently-here-is-my-solution-jpe#comments" class="crayons-btn crayons-btn--s crayons-btn--ghost crayons-btn--icon-left flex items-center"&gt;
              Comments


              &lt;span class="hidden s:inline"&gt;Add Comment&lt;/span&gt;
            &lt;/a&gt;
        &lt;/div&gt;
        &lt;div class="crayons-story__save"&gt;
          &lt;small class="crayons-story__tertiary fs-xs mr-2"&gt;
            4 min read
          &lt;/small&gt;
            
              &lt;span class="bm-initial"&gt;
                

              &lt;/span&gt;
              &lt;span class="bm-success"&gt;
                

              &lt;/span&gt;
            
        &lt;/div&gt;
      &lt;/div&gt;
    &lt;/div&gt;
  &lt;/div&gt;
&lt;/div&gt;

&lt;/div&gt;


&lt;p&gt;If you haven't installed yet:&lt;br&gt;
&lt;strong&gt;The 12 skills&lt;/strong&gt;. MIT-licensed. Works with Claude Code out of the box. No infrastructure. No server. Git + filesystem.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;npm i @gem_squared/tpmn-skill-install
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;From GitHub: &lt;a href="https://github.com/gem-squared/tpmn-skill" rel="noopener noreferrer"&gt;github.com/gem-squared/tpmn-skill&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  The three wounds and solutions
&lt;/h2&gt;




&lt;h3&gt;
  
  
  Wound 1 - Silent Scope Decay: from prose to spec
&lt;/h3&gt;

&lt;p&gt;Every AI — Claude, ChatGPT, Gemini — lives inside a limited context window. That limitation causes two kinds of silent erosion: context dilution and architecture drift.&lt;/p&gt;

&lt;p&gt;Vibe-coding, Claude Skills, every prose-based method on the market today — all are subject to the same compression. When context compacts, prose loses precision. Meaning survives; accuracy does not.&lt;/p&gt;

&lt;p&gt;But algebraic logical notation cannot be lossy-compressed without breaking syntax. Compaction cannot silently dilute what is already at minimum expression. That is why I created the TPMN spec.&lt;/p&gt;

&lt;h3&gt;
  
  
  Solution for Silent Scope Decay
&lt;/h3&gt;

&lt;h4&gt;
  
  
  The axiom: every AI action is a function
&lt;/h4&gt;

&lt;p&gt;Every skill in the world can be written as a single algebraic expression:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;F: A → B | P
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is not a metaphor. It is the irreducible kernel of what a skill is.&lt;/p&gt;

&lt;p&gt;AI's inference process is a blackbox. &lt;strong&gt;This is an axiom, not a limitation.&lt;/strong&gt; We cannot micro-control the weights, the direction, or the intermediate reasoning. Claude, ChatGPT, Gemini — every commercial LLM is opaque by construction. We cannot control how F operates internally.&lt;/p&gt;

&lt;p&gt;But we can declare A, B, and P.&lt;/p&gt;

&lt;p&gt;From the human's perspective:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;A&lt;/strong&gt; is what I give to AI.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;B&lt;/strong&gt; is what I want to get from AI.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;P&lt;/strong&gt; is the set of invariants and constraints on A, B, or both — the guardrail on AI's output.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;From the AI's perspective:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;A&lt;/strong&gt; — exactly what it receives (typed input state)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;B&lt;/strong&gt; — exactly what it produces (typed output state)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;P&lt;/strong&gt; — what must hold true before and after execution&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;That is the protocol. Clean, simple, explicit.&lt;/p&gt;

&lt;p&gt;What is a skill for? To perform a similar workflow identically, every time. Not a single task. A workflow.&lt;/p&gt;

&lt;p&gt;So let UNIT-WORK = F.&lt;/p&gt;

&lt;p&gt;A unit-work is a discrete, isolated F — a single mission given to AI. Whether AI completes it through one-shot inference or internally spawns many sub-agents, that is the blackbox. We cannot control it. We must not try. From the human's perspective, it is a single unit of work.&lt;/p&gt;

&lt;p&gt;Now a workflow is the &lt;strong&gt;summation of unit-works&lt;/strong&gt; plus the &lt;strong&gt;logical language&lt;/strong&gt; describing the flow:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;WF ≜ Σ unit-work + flow-logic
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;So:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"Claude Skill" ≡ "WF ≜ Σ unit-work + flow-logic"&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The full grammar — the complete symbol set, the four-source layering (TLA+, Panini, Math, NL), the UNIT-SKILL rules R1–R7, and the AI selection pipeline — lives in &lt;a href="//Series/SERIES_POST_ALL-tpmn-is-all.md"&gt;TPMN is ALL&lt;/a&gt;.&lt;/p&gt;

&lt;h4&gt;
  
  
  Four properties of compaction-safe notation
&lt;/h4&gt;

&lt;p&gt;After studying what survives compaction and what does not, I identified four properties that make TPMN compaction-safe:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. No redundancy&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;In prose, you say things multiple ways for emphasis. "Never deploy to production. This skill is staging-only. Production deployments are not within scope." Three sentences, same constraint. The summarizer might keep one. It might keep none — because redundancy signals that the content is emphatic rather than structural.&lt;/p&gt;

&lt;p&gt;In TPMN, each constraint appears once:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;⊢ NEVER deploy to production — staging only
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;One line. No redundancy to collapse. The summarizer cannot compress what is already minimal.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Structure preserved&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;TPMN uses code blocks. Code blocks are atomic to summarizers. A prose paragraph is a candidate for summarization. A code block is either kept or referenced — not paraphrased.&lt;/p&gt;

&lt;p&gt;This is not a subtle advantage. It is the difference between "the skill has some preconditions about CI" (summarized prose) and the actual precondition &lt;code&gt;P: ci_passed(build_id) = ⊤&lt;/code&gt; (preserved code).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Minimum = only representation&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;In prose, there is a "full" version and a "summarized" version, and they carry different information. The full version has constraints. The summary loses them.&lt;/p&gt;

&lt;p&gt;In TPMN, the notation IS the minimum representation. There is no shorter form that preserves the meaning. &lt;code&gt;P: ci_passed(build_id) = ⊤&lt;/code&gt; cannot be shortened without losing the typed predicate. This means compaction either preserves the notation (because there is nothing to compress) or drops it entirely (which is detectable — the skill contract is missing).&lt;/p&gt;

&lt;p&gt;The failure mode of TPMN under compaction is absence, not corruption. You either have the contract or you do not. You never have a corrupted half-version that looks right but lost the critical constraint. I prefer an honest absence over a silent corruption.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. Symbols unambiguous at any depth&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;When a prose instruction is partially summarized, words become ambiguous. "Check the build" — check what about the build? That it exists? That it passed CI? That it was deployed?&lt;/p&gt;

&lt;p&gt;When a TPMN symbol survives compaction, it means the same thing regardless of surrounding context. &lt;code&gt;𝔹&lt;/code&gt; means boolean. &lt;code&gt;⊢&lt;/code&gt; means grounded. &lt;code&gt;∧&lt;/code&gt; means AND. These symbols do not drift in meaning when the context around them changes.&lt;/p&gt;

&lt;p&gt;This matters because compaction changes the context. After compaction, the agent has a summarized version of earlier conversation plus the surviving code blocks. If the symbols in those code blocks were context-dependent, compaction would corrupt their meaning. Because they are context-independent, the surviving notation is still correct.&lt;/p&gt;




&lt;h3&gt;
  
  
  Wound 2 — Judgment theater: from in-the-loop to at-the-edge
&lt;/h3&gt;

&lt;p&gt;Everyone agrees now: the bottleneck is not writing code.&lt;br&gt;
It is verifying what the agent produced.&lt;br&gt;
Here is a structural answer — not a workflow tip, a verifiable CONTRACT.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The 95% trap&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The agent is correct 95% of the time&lt;/li&gt;
&lt;li&gt;You stop checking&lt;/li&gt;
&lt;li&gt;Then it drops a production database (Replit, July 2025 — already in public discourse)&lt;/li&gt;
&lt;li&gt;The trap is not carelessness. It is a rational response to high accuracy.&lt;/li&gt;
&lt;li&gt;The structural cause: there is no formal specification of what "correct" means
for a given session. So you evaluate by feel. Feel degrades under repetition.&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;The 95% trap is not a discipline problem.&lt;br&gt;
It is an absent-contract problem.&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h3&gt;
  
  
  Solution for Judgment theater
&lt;/h3&gt;

&lt;p&gt;CONTRACT (&lt;code&gt;F: A → B | P&lt;/code&gt;) plus flow-logic defines a skill. STATE is how we govern it.&lt;/p&gt;

&lt;p&gt;Here is the flow:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Human requests AI in natural language.&lt;/li&gt;
&lt;li&gt;AI formalizes the request into algebraic logical format: &lt;code&gt;WF = Σ unit-contracts + flow-logic&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;AI executes each unit-work through WF.&lt;/li&gt;
&lt;li&gt;AI produces &lt;strong&gt;B&lt;/strong&gt; for each unit-contract.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The critical paradigm shift: &lt;strong&gt;treat B as a state.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The AI execution state is binary: &lt;strong&gt;SUCCESS&lt;/strong&gt; or &lt;strong&gt;FAILURE&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;SUCCESS is declared if and only if the AI's output is fully aligned with the contract. Alignment means:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;The input &lt;strong&gt;a&lt;/strong&gt; conforms to the contract's input type &lt;strong&gt;A&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;The output &lt;strong&gt;b&lt;/strong&gt; conforms to the contract's output type &lt;strong&gt;B&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;The invariant/constraint &lt;strong&gt;P(a, b)&lt;/strong&gt; holds true.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Otherwise, the state is FAILURE.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;This is not subjective quality assessment.&lt;/strong&gt; STATE does not measure whether the output is "good." It measures whether the output exists within the legitimate mandate area that was pre-defined by the contract before execution. Its purpose is to prevent AI from drifting away from the plan.&lt;/p&gt;

&lt;p&gt;Without STATE, you review AI output against intuition. You catch obvious errors. You feel thorough. But you have no contract to check against — no ground truth to verify. That is judgment theater: the appearance of verification without the structure to make it real.&lt;/p&gt;

&lt;p&gt;Because the verification is structural type-checking — not interpretive judgment — AI itself can evaluate whether its own output is SUCCESS or FAILURE. The contract provides a deterministic specification. The &lt;code&gt;/verify-work&lt;/code&gt; skill does exactly this — evaluating each unit-work's output against its contract, either per-unit inline (immediately after execution) or as a batch at the end.&lt;/p&gt;
&lt;h4&gt;
  
  
  Human-at-the-edge, not human-in-the-loop
&lt;/h4&gt;

&lt;p&gt;To observe AI's work, what is a human supposed to do? Stay in the loop?&lt;/p&gt;

&lt;p&gt;That is the wrong answer.&lt;/p&gt;

&lt;p&gt;Why do developers love Claude Code? Because it treats multi-agent processing as its natural mode of operation — not a bolted-on feature. Claude Code autonomously spawns sub-agents to fulfill human needs. The agentic layer grows more autonomous with every release. Other platforms are converging on the same model.&lt;/p&gt;

&lt;p&gt;We cannot control how Claude Code spawns its internal sub-agents. We must not try. Humans must step out of the loop — otherwise we become the critical bottleneck in an AI-driven system.&lt;/p&gt;

&lt;p&gt;But we still need to observe AI's work. My answer: &lt;strong&gt;human-at-the-edge&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;The human evaluates the workflow at the boundary — what goes in, what comes out, what must hold true. AI operates autonomously within that boundary. The human observes results, not process. This is the governance model behind the TPMN workflow lifecycle: plan → proceed → verify → archive, with the human at the edge of each transition.&lt;/p&gt;

&lt;p&gt;Creating a workflow in TPMN is not something you hand-write from scratch. &lt;code&gt;/plan-work&lt;/code&gt; does it for you. Claude Code decomposes your request into contracted unit-works automatically. The same 12 core skills apply through the entire workflow hierarchy, from 10 unit-works to 100 and beyond. We adapt Miller's law to bound the size of each level (7±2 units per decomposition).&lt;/p&gt;


&lt;h3&gt;
  
  
  Wound 3 — Trigger collision: from mechanism to management
&lt;/h3&gt;

&lt;p&gt;Three independent studies converge on the same failure band.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://vercel.com/blog/agents-md-outperforms-skills-in-our-agent-evals" rel="noopener noreferrer"&gt;Vercel engineer Jude Gao reported&lt;/a&gt; that in 56% of their eval cases, the skill was never invoked — even though the documentation was available. &lt;a href="https://perevillega.com/posts/2026-04-01-claude-code-skills-2-what-changed-what-works-what-to-watch-out-for/" rel="noopener noreferrer"&gt;Independent testing by Pere Villega&lt;/a&gt; landed on a 50% success rate — what he called "a coin flip." &lt;a href="https://medium.com/@ivan.seleznov1/why-claude-code-skills-dont-activate-and-how-to-fix-it-86f679409af1" rel="noopener noreferrer"&gt;Ivan Seleznov's 650-trial controlled experiment&lt;/a&gt; found default-configured skills activated at 77%, reaching 100% only when the trigger used imperative language.&lt;/p&gt;

&lt;p&gt;Three methods. Three authors. Same finding: the activation mechanism does not work reliably.&lt;/p&gt;

&lt;p&gt;And when they fail, nothing happens. No error. No warning. No log entry. Claude just proceeds without the skill, and you never know.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;You cannot observe this failure.&lt;/strong&gt; There is no stack trace. No failed assertion. No red CI badge. The skill simply doesn't load, Claude does its best without it, and the output looks plausible. You would have to run the same prompt with and without the skill, compare results, and notice the difference. Nobody did that.&lt;/p&gt;

&lt;p&gt;Even if you could detect the failure, manual comparison destroys the value proposition. Skills exist to automate. If you must manually verify every invocation, you have replaced automation with audit theater.&lt;/p&gt;
&lt;h3&gt;
  
  
  Solution for Trigger collision
&lt;/h3&gt;

&lt;p&gt;Here is the core design decision. The conventional approach writes one skill per case: deploy skill, test skill, review skill. N cases require N skills. Selection breaks at scale.&lt;/p&gt;

&lt;p&gt;TPMN does not replace those skills. TPMN orchestrates them.&lt;/p&gt;

&lt;p&gt;The 900,000+ skills in the ecosystem encode domain knowledge — Figma-to-code, Sentry triage, Kubernetes deploy, PDF generation. That knowledge is valuable. The problem is not the skills. The problem is that nothing governs how they get selected, executed, verified, or reused.&lt;/p&gt;

&lt;p&gt;And the problem compounds. You find a skill in the marketplace, copy it locally, configure it for your project — but Claude doesn't trigger it reliably, so you manually refine through train/test cycles. Or you build your own skills through real project work — hard-earned, proprietary, proven. Either way, your local skill collection grows. How do you find the right one for the next project?&lt;/p&gt;

&lt;p&gt;Here is the structure I built:&lt;/p&gt;

&lt;p&gt;12 core lifecycle skills, your AI agent, and the filesystem. No infrastructure. Works with Claude Code out of the box.&lt;/p&gt;

&lt;p&gt;TPMN provides the governance layer: 12 lifecycle skills that handle planning, execution, verification, and archival — regardless of which domain skills you use underneath.&lt;/p&gt;

&lt;p&gt;Think of it as two layers:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;TPMN layer:  /plan-work → /proceed-work → /verify-work → /archive-work
             (lifecycle orchestration — contracts, STATE, STATUS)

Contract layer: your-deploy-skill, figma-to-code, sentry-triage, ...
             (domain knowledge — whatever skills you already have)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The 12 TPMN skills are operations on a work lifecycle: &lt;code&gt;/init-session&lt;/code&gt;, &lt;code&gt;/check-session&lt;/code&gt;, &lt;code&gt;/search-kg&lt;/code&gt;, &lt;code&gt;/search-skill&lt;/code&gt;, &lt;code&gt;/plan-work&lt;/code&gt;, &lt;code&gt;/proceed-work&lt;/code&gt;, &lt;code&gt;/update-work-plan&lt;/code&gt;, &lt;code&gt;/extract-skill&lt;/code&gt;, &lt;code&gt;/verify-work&lt;/code&gt;, &lt;code&gt;/skill-to-kg&lt;/code&gt;, &lt;code&gt;/archive-work&lt;/code&gt;, &lt;code&gt;/end-session&lt;/code&gt;. Every workflow composes from these 12 lifecycle primitives.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;/skill-to-kg&lt;/code&gt; deserves a note: it batch-sweeps all non-protected skills and directories from &lt;code&gt;.claude/skills/&lt;/code&gt; (active — auto-discovered by Claude) to &lt;code&gt;.gem-squared/external-skills/&lt;/code&gt; (archived — invisible to Claude, but searchable by &lt;code&gt;/search-skill&lt;/code&gt;). Only the 12 core lifecycle skills and the project identity skill remain active. Runs automatically at init-session. Restore any specific skill anytime.&lt;/p&gt;

&lt;p&gt;You do not throw away your existing skills. You govern them. The 12 core skills handle the lifecycle — the detailed breakdown with the "why separate" rationale for each is in the spec.&lt;/p&gt;

&lt;h4&gt;
  
  
  knowledge compounds, not skills
&lt;/h4&gt;

&lt;p&gt;This leads to the second inversion.&lt;/p&gt;

&lt;p&gt;The conventional model scales by adding skills. N cases → N skills. The system grows but does not learn. Skill #47 knows nothing about Skill #1.&lt;/p&gt;

&lt;p&gt;TPMN scales by accumulating knowledge. Every completed work-plan that passes verification — COMPLETED status, SUCCESS state — is a &lt;strong&gt;proven template&lt;/strong&gt;. When &lt;code&gt;/archive-work&lt;/code&gt; stores it, the contracts and results become searchable patterns. When &lt;code&gt;/plan-work&lt;/code&gt; runs for a new task, &lt;code&gt;/search-kg&lt;/code&gt; retrieves relevant proven templates. The next decomposition is not starting from zero — it is informed by what worked before.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;/plan-work ─── /search-kg + /search-skill ───→ /proceed-work ─── /search-kg (optional) ───→ /verify-work ───→ /archive-work
     ^            (proven patterns + domain skills)    (reference during execution)                                    |
     +────────────────────────────────────────────────────────────────────────────────────────────────────────────────+
                                                                                                    (store as proven)

/extract-skill ─── /search-kg + /search-skill ───→ upsert to .claude/skills/
                   (find source contract + check if skill exists)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Actually, I do not use any custom domain skills. The 12 lifecycle skills are enough — with proven unit-contracts as the reusable knowledge. &lt;code&gt;/extract-skill&lt;/code&gt; exists for compatibility with Claude Code's &lt;code&gt;.claude/skills/&lt;/code&gt; ecosystem, not because I need more skills.&lt;/p&gt;

&lt;p&gt;The lifecycle skills stay fixed. The knowledge compounds. I have been using the same core skills across various projects. The skills never changed. The work-plans are what vary — and the proven ones feed the next cycle.&lt;/p&gt;

&lt;h4&gt;
  
  
  Making flow-logic deterministic: the STATUS concept
&lt;/h4&gt;

&lt;p&gt;&lt;code&gt;Σ unit-work&lt;/code&gt; now has a deterministic verification layer via STATE. What about &lt;code&gt;flow-logic&lt;/code&gt;?&lt;/p&gt;

&lt;p&gt;To make flow-logic manageable, traceable, and computable, we need a second concept: &lt;strong&gt;STATUS&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Where STATE answers &lt;em&gt;"did it align with the plan?"&lt;/em&gt;, STATUS answers &lt;em&gt;"where is it in the lifecycle?"&lt;/em&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;STATUS ≜ { PENDING, IN_PROGRESS, COMPLETED, BLOCKED, ABORTED }
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Axiom:&lt;/strong&gt; once a workflow is defined, every unit-work in that workflow occupies a STATUS. STATE is evaluable when a unit-work reaches a terminal status — that is, when STATUS ∈ {COMPLETED, ABORTED}. A completed unit-work is evaluated against its contract. An aborted unit-work is FAILURE by definition — its contract was never fulfilled.&lt;/p&gt;

&lt;p&gt;Now we have two orthogonal tracking dimensions:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;STATUS&lt;/strong&gt; tracks &lt;em&gt;where&lt;/em&gt; each unit-work is in the lifecycle.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;STATE&lt;/strong&gt; tracks &lt;em&gt;whether&lt;/em&gt; each completed or aborted unit-work met its contract.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Together, they make the entire workflow observable from the edge. The human does not need to watch every step. The human reads STATUS to see progress and STATE to see results. That is human-at-the-edge.&lt;/p&gt;

&lt;p&gt;I applied this model across 130+ real work plans in production. The full lifecycle — from planning through execution to verification and archival — is governed by the same 12 skills throughout.&lt;/p&gt;




&lt;h2&gt;
  
  
  How TPMN compares to existing frameworks
&lt;/h2&gt;

&lt;p&gt;Three frameworks have emerged in 2026 to address skill reliability. Each solves part of the problem. None solves all of it.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/obra/superpowers" rel="noopener noreferrer"&gt;&lt;strong&gt;Superpowers&lt;/strong&gt;&lt;/a&gt; (150K+ stars) provides 14 composable skills with a master dispatcher and a "1% chance = must invoke" enforcement rule. It covers brainstorming, planning, TDD, debugging, code review, and verification. Its &lt;a href="https://github.com/obra/superpowers/blob/main/skills/verification-before-completion/SKILL.md" rel="noopener noreferrer"&gt;verification-before-completion&lt;/a&gt; skill enforces "evidence before claims" — run the command, read the output, then claim the result. Strong methodology. But it is a software development workflow — it does not generalize beyond SDLC, and its skills are prose instructions, not typed contracts.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://benjamin-abt.com/blog/2026/02/12/agent-skills-standard-github-copilot/" rel="noopener noreferrer"&gt;&lt;strong&gt;Agent Skills Standard&lt;/strong&gt;&lt;/a&gt; (Benjamin Abt, Feb 2026) defines a five-component "quality contract" — scope, decision logic, constraints, output contract, quality gates. The closest to TPMN's philosophy: one skill = one bounded scenario with explicit non-goals. But it is an authoring standard for individual skills, not lifecycle orchestration. There are no lifecycle primitives, no STATE/STATUS tracking, and no mechanism for proven patterns to inform future work.&lt;/p&gt;

&lt;p&gt;The &lt;a href="https://hackernoon.com/how-to-build-a-governance-layer-for-claude-code-with-hooks-skills-and-agents" rel="noopener noreferrer"&gt;&lt;strong&gt;hook-driven governance layer&lt;/strong&gt;&lt;/a&gt; (HackerNoon, Mar 2026) attacks the trigger problem directly: a &lt;code&gt;UserPromptSubmit&lt;/code&gt; hook forces Claude to evaluate every skill before responding, raising activation from ~25% to 90%+. Effective enforcement. But it governs &lt;em&gt;whether&lt;/em&gt; skills fire, not &lt;em&gt;what happens after they do&lt;/em&gt;. No planning, no verification, no archival. And it is locked to Claude Code's hook system.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;a href="https://github.com/gem-squared/tpmn-skill" rel="noopener noreferrer"&gt;TPMN Skill Standard&lt;/a&gt;&lt;/strong&gt; (open spec, MIT-licensed) takes a different approach. Instead of enforcing trigger reliability or standardizing skill authoring, TPMN provides a governance layer that sits above all skills. Every unit of work is a typed contract (&lt;code&gt;F: A → B | P&lt;/code&gt;) with algebraic STATE (SUCCESS/FAILURE) and lifecycle STATUS (PENDING → IN_PROGRESS → COMPLETED). Verification is structural — the AI checks its own output against the contract, not against a prose checklist. And the cycle closes: proven contracts are archived, indexed, and retrieved by &lt;code&gt;/search-kg&lt;/code&gt; to inform the next &lt;code&gt;/plan-work&lt;/code&gt;. The 12 lifecycle skills are domain-agnostic — they govern legal review, content production, and data analysis the same way they govern software development.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Dimension&lt;/th&gt;
&lt;th&gt;Superpowers&lt;/th&gt;
&lt;th&gt;Agent Skills Standard&lt;/th&gt;
&lt;th&gt;Hook Governance&lt;/th&gt;
&lt;th&gt;TPMN&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;strong&gt;Typed contracts&lt;/strong&gt; (A→B|P)&lt;/td&gt;
&lt;td&gt;No — prose instructions&lt;/td&gt;
&lt;td&gt;Partial — "output contract" in prose&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;Yes — algebraic&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;strong&gt;Lifecycle orchestration&lt;/strong&gt; (plan→execute→verify→archive)&lt;/td&gt;
&lt;td&gt;Partial — plan + verify, no archive&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;Yes — 12 skills&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;strong&gt;Knowledge accumulation&lt;/strong&gt; (proven patterns feed future work)&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;Yes — /archive-work → /search-kg → /plan-work&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;STATE/STATUS tracking&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;Yes — binary STATE + lifecycle STATUS&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Verification model&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;"Run the command, read the output"&lt;/td&gt;
&lt;td&gt;Quality gates (prose checklist)&lt;/td&gt;
&lt;td&gt;N/A (enforcement only)&lt;/td&gt;
&lt;td&gt;Contract type-checking (A, B, P alignment)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Domain scope&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;SDLC only&lt;/td&gt;
&lt;td&gt;SDLC only&lt;/td&gt;
&lt;td&gt;SDLC only&lt;/td&gt;
&lt;td&gt;Any domain — lifecycle is domain-agnostic&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Platform scope&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Multi-platform (Claude, Codex, Cursor, Gemini)&lt;/td&gt;
&lt;td&gt;GitHub Copilot&lt;/td&gt;
&lt;td&gt;Claude Code only&lt;/td&gt;
&lt;td&gt;Platform-agnostic (any AI agent that reads markdown)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Skill count&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;14 (development methodology)&lt;/td&gt;
&lt;td&gt;Per-project (authoring standard)&lt;/td&gt;
&lt;td&gt;26 (domain-specific)&lt;/td&gt;
&lt;td&gt;12 (lifecycle primitives) + any domain skills&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Every framework above targets software development. TPMN's lifecycle — plan, execute, verify, archive — is not about code. It is about &lt;em&gt;any&lt;/em&gt; workflow where a human delegates work to AI: legal review, content production, data analysis, research synthesis, compliance audits. The 12 skills govern the lifecycle of work itself, not the domain the work belongs to.&lt;/p&gt;

&lt;p&gt;The gap is not in any single capability. Superpowers has strong verification. Abt has strong skill structure. The hook layer has strong enforcement. The gap is that nobody combines all three: &lt;strong&gt;(1)&lt;/strong&gt; algebraic contracts that make verification structural, &lt;strong&gt;(2)&lt;/strong&gt; lifecycle orchestration that governs work from plan to archive, and &lt;strong&gt;(3)&lt;/strong&gt; knowledge accumulation where proven patterns compound across projects. Each framework covers one piece. None closes the full loop.&lt;/p&gt;




&lt;h2&gt;
  
  
  the origin of TPMN
&lt;/h2&gt;

&lt;p&gt;How do we &lt;strong&gt;describe&lt;/strong&gt; flow-logic deterministically?&lt;/p&gt;

&lt;p&gt;My solution draws from four sources:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;TLA+:&lt;/strong&gt; Leslie Lamport's temporal logic for specifying concurrent systems — exactly what multi-step AI workflows are. For deterministic logical flow.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Panini:&lt;/strong&gt; The ancient Sanskrit grammarian who solved semantic subject-disambiguation 2,500 years ago — the exact ambiguity problem that makes prose skills fail. For removing semantic ambiguity.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Mathematical notation:&lt;/strong&gt; For programmatic decisions and formal constraints.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Natural language comments:&lt;/strong&gt; To complement the subjective meaning that formal notation cannot carry alone.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is the origin of TPMN — an &lt;strong&gt;Algebraic Logical Language&lt;/strong&gt; for AI skill specification.&lt;/p&gt;

&lt;p&gt;Why does formal notation matter over prose? Because context compaction destroys prose instructions while formal notation survives intact — a 5.3x measured density advantage that determines whether your skills keep working in long sessions. Your prose constraints dilute under compaction. Your architecture boundaries drift between sessions. Neither failure is detectable. That is silent scope decay — and TPMN is the cure.&lt;/p&gt;




&lt;h2&gt;
  
  
  TPMN extends to two systems
&lt;/h2&gt;

&lt;p&gt;From this algebraic foundation, TPMN extends into two systems:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;&lt;a href="https://github.com/gem-squared/tpmn-psl" rel="noopener noreferrer"&gt;TPMN — Truth-Provenance Markup Notation&lt;/a&gt;&lt;/strong&gt; — an open specification language for structuring and auditing AI reasoning. Platform-agnostic — treats LLMs as black boxes. Within TPMN:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;TPMN-PSL&lt;/strong&gt; (Prompt Specification Language) is the formal grammar that compiles natural-language prompts into MANDATE — computable, verifiable specifications. Three-phase protocol: P-phase (prompt → MANDATE), Inline (epistemic tagging), O-phase (output verification → truth_score).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;TPMN-checker&lt;/strong&gt; is the reference implementation — a Sovereign AI Service (SAS), a microservice exclusively owned and controlled by a dedicated AI actor.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;&lt;a href="https://github.com/gem-squared/tpmn-skill" rel="noopener noreferrer"&gt;TPMN SKILL STANDARD&lt;/a&gt;&lt;/strong&gt; — a contract-driven AI workflow lifecycle management framework. The full specification for defining, executing, verifying, and governing AI skills. This includes the 5 structural rules that keep skills within the reliability zone, and 12 core skills that orchestrate the lifecycle for any workflow.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Each of these gets its own post. This one is the foundation — the algebraic origin story that everything else builds on.&lt;/p&gt;




&lt;h2&gt;
  
  
  GEM² — the name
&lt;/h2&gt;

&lt;p&gt;My company is named GEM². The acronym is the whole philosophy of this post compressed into six letters:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="err"&gt;GEM²_Definition&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;≜&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="err"&gt;acronym:&lt;/span&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="s2"&gt;"Grounded Existence Matrix for Global Entropy Minimum"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;

  &lt;/span&gt;&lt;span class="err"&gt;expansion:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="err"&gt;Grounded:&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="s2"&gt;"Every A is grounded by contract — no hidden input"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="err"&gt;Existence:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Every B is verifiable by evidence — no claimed output"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="err"&gt;Matrix:&lt;/span&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="s2"&gt;"A, B, P are the three axes every unit-work is computed on"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="err"&gt;Global:&lt;/span&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="s2"&gt;"Workflow is the summation of connected unit-contracts"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="err"&gt;Entropy:&lt;/span&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="s2"&gt;"Disorder = the gap between contract and evidence"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="err"&gt;Minimum:&lt;/span&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="s2"&gt;"SUCCESS is the minimum-entropy state — contract and evidence align"&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;p&gt;&lt;em&gt;David Seo — &lt;a href="https://gemsquared.ai/tpmn-skill?utm_source=devto&amp;amp;utm_medium=post-analysis&amp;amp;utm_campaign=tpmn-skill" rel="noopener noreferrer"&gt;GEM².AI&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

</description>
      <category>cloudskills</category>
      <category>claudecode</category>
      <category>devtools</category>
      <category>opensource</category>
    </item>
    <item>
      <title>Claude Skills Fail Silently. Here Is My Solution.</title>
      <dc:creator>GEM² Inc.</dc:creator>
      <pubDate>Thu, 30 Apr 2026 09:09:40 +0000</pubDate>
      <link>https://dev.to/gemsquared/claude-skills-fail-silently-here-is-my-solution-jpe</link>
      <guid>https://dev.to/gemsquared/claude-skills-fail-silently-here-is-my-solution-jpe</guid>
      <description>&lt;p&gt;"Three structural wounds that prose skills cannot fix. 12 lifecycle skills that do. Install in 30 seconds."&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgoog8gv6qudeud4yba4g.gif" alt=" " width="720" height="405"&gt;
&lt;/h2&gt;

&lt;h2&gt;
  
  
  Try it
&lt;/h2&gt;

&lt;p&gt;The TPMN Skill Standard v4 is MIT-licensed. Install the core skills into any project.&lt;br&gt;
Full spec: &lt;a href="https://github.com/gem-squared/tpmn-skill" rel="noopener noreferrer"&gt;TPMN Skill Standard v4 on GitHub&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;$ &lt;/span&gt;npx @gem_squared/tpmn-skill-install

&lt;span class="nv"&gt;$ &lt;/span&gt;claude &lt;span class="nt"&gt;--permission-mode&lt;/span&gt; auto

/init-session       &lt;span class="c"&gt;# bootstrap — must be first&lt;/span&gt;
/plan-work          &lt;span class="c"&gt;# decompose into contracted units&lt;/span&gt;
/proceed-work       &lt;span class="c"&gt;# execute one unit&lt;/span&gt;
/verify-work        &lt;span class="c"&gt;# Autonomously invoked to verify against contract — SUCCESS or FAILURE&lt;/span&gt;
/archive-work       &lt;span class="c"&gt;# proven. searchable. compounds.&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;




&lt;p&gt;I've been using SKILLs every day, with almost zero failure. That's not because I got lucky. It's because I stopped writing them as prose — and I built 12 lifecycle skills that orchestrate all the others, the way Unix commands orchestrate any pipeline.&lt;/p&gt;

&lt;p&gt;The problem is not SKILL itself — it is the failure of linguistic precision, and the failure to select the correct skill for a specific situation.&lt;/p&gt;



&lt;p&gt;Your skills break and you never know.&lt;/p&gt;

&lt;p&gt;Three independent studies found the same thing: &lt;a href="https://vercel.com/blog/agents-md-outperforms-skills-in-our-agent-evals" rel="noopener noreferrer"&gt;56% non-invocation&lt;/a&gt; (Vercel), &lt;a href="https://perevillega.com/posts/2026-04-01-claude-code-skills-2-what-changed-what-works-what-to-watch-out-for/" rel="noopener noreferrer"&gt;50% success rate&lt;/a&gt; (Villega), &lt;a href="https://medium.com/@ivan.seleznov1/why-claude-code-skills-dont-activate-and-how-to-fix-it-86f679409af1" rel="noopener noreferrer"&gt;77% activation&lt;/a&gt; (Seleznov, n=650). No error. No warning. No log. Claude proceeds without the skill and the output looks plausible.&lt;/p&gt;

&lt;p&gt;That is only one of three wounds.&lt;/p&gt;


&lt;h2&gt;
  
  
  The three wounds
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;1. Silent Scope Decay&lt;/strong&gt; — your prose constraints dilute under context compaction. Your architecture boundaries drift between sessions. Neither failure is detectable.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Judgment Theater&lt;/strong&gt; — you review AI output but have no contract to check against. The review feels real. It is structurally empty.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Trigger Collision&lt;/strong&gt; — the wrong skill fires, or no skill fires at all, and the failure is unobservable. More skills, more ambiguity, more compounding.&lt;/p&gt;


&lt;div class="ltag__link--embedded"&gt;
  &lt;div class="crayons-story "&gt;
  &lt;a href="https://dev.to/gemsquared/three-wounds-that-prose-skills-cannot-fix-the-full-analysis-peo" class="crayons-story__hidden-navigation-link"&gt;Three Wounds That Prose Skills Cannot Fix — The Full Analysis&lt;/a&gt;


  &lt;div class="crayons-story__body crayons-story__body-full_post"&gt;
    &lt;div class="crayons-story__top"&gt;
      &lt;div class="crayons-story__meta"&gt;
        &lt;div class="crayons-story__author-pic"&gt;

          &lt;a href="/gemsquared" class="crayons-avatar  crayons-avatar--l  "&gt;
            &lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3847788%2F875e1f46-2e5a-4b42-b2e3-ca729d4fe1f8.jpeg" alt="gemsquared profile" class="crayons-avatar__image" width="460" height="460"&gt;
          &lt;/a&gt;
        &lt;/div&gt;
        &lt;div&gt;
          &lt;div&gt;
            &lt;a href="/gemsquared" class="crayons-story__secondary fw-medium m:hidden"&gt;
              GEM² Inc.
            &lt;/a&gt;
            &lt;div class="profile-preview-card relative mb-4 s:mb-0 fw-medium hidden m:inline-block"&gt;
              
                GEM² Inc.
                
              
              &lt;div id="story-author-preview-content-3590822" class="profile-preview-card__content crayons-dropdown branded-7 p-4 pt-0"&gt;
                &lt;div class="gap-4 grid"&gt;
                  &lt;div class="-mt-4"&gt;
                    &lt;a href="/gemsquared" class="flex"&gt;
                      &lt;span class="crayons-avatar crayons-avatar--xl mr-2 shrink-0"&gt;
                        &lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3847788%2F875e1f46-2e5a-4b42-b2e3-ca729d4fe1f8.jpeg" class="crayons-avatar__image" alt="" width="460" height="460"&gt;
                      &lt;/span&gt;
                      &lt;span class="crayons-link crayons-subtitle-2 mt-5"&gt;GEM² Inc.&lt;/span&gt;
                    &lt;/a&gt;
                  &lt;/div&gt;
                  &lt;div class="print-hidden"&gt;
                    
                      Follow
                    
                  &lt;/div&gt;
                  &lt;div class="author-preview-metadata-container"&gt;&lt;/div&gt;
                &lt;/div&gt;
              &lt;/div&gt;
            &lt;/div&gt;

          &lt;/div&gt;
          &lt;a href="https://dev.to/gemsquared/three-wounds-that-prose-skills-cannot-fix-the-full-analysis-peo" class="crayons-story__tertiary fs-xs"&gt;&lt;time&gt;Apr 30&lt;/time&gt;&lt;span class="time-ago-indicator-initial-placeholder"&gt;&lt;/span&gt;&lt;/a&gt;
        &lt;/div&gt;
      &lt;/div&gt;

    &lt;/div&gt;

    &lt;div class="crayons-story__indention"&gt;
      &lt;h2 class="crayons-story__title crayons-story__title-full_post"&gt;
        &lt;a href="https://dev.to/gemsquared/three-wounds-that-prose-skills-cannot-fix-the-full-analysis-peo" id="article-link-3590822"&gt;
          Three Wounds That Prose Skills Cannot Fix — The Full Analysis
        &lt;/a&gt;
      &lt;/h2&gt;
        &lt;div class="crayons-story__tags"&gt;
            &lt;a class="crayons-tag  crayons-tag--monochrome " href="/t/cloudskills"&gt;&lt;span class="crayons-tag__prefix"&gt;#&lt;/span&gt;cloudskills&lt;/a&gt;
            &lt;a class="crayons-tag  crayons-tag--monochrome " href="/t/claudecode"&gt;&lt;span class="crayons-tag__prefix"&gt;#&lt;/span&gt;claudecode&lt;/a&gt;
            &lt;a class="crayons-tag  crayons-tag--monochrome " href="/t/devtools"&gt;&lt;span class="crayons-tag__prefix"&gt;#&lt;/span&gt;devtools&lt;/a&gt;
            &lt;a class="crayons-tag  crayons-tag--monochrome " href="/t/opensource"&gt;&lt;span class="crayons-tag__prefix"&gt;#&lt;/span&gt;opensource&lt;/a&gt;
        &lt;/div&gt;
      &lt;div class="crayons-story__bottom"&gt;
        &lt;div class="crayons-story__details"&gt;
            &lt;a href="https://dev.to/gemsquared/three-wounds-that-prose-skills-cannot-fix-the-full-analysis-peo#comments" class="crayons-btn crayons-btn--s crayons-btn--ghost crayons-btn--icon-left flex items-center"&gt;
              Comments


              &lt;span class="hidden s:inline"&gt;Add Comment&lt;/span&gt;
            &lt;/a&gt;
        &lt;/div&gt;
        &lt;div class="crayons-story__save"&gt;
          &lt;small class="crayons-story__tertiary fs-xs mr-2"&gt;
            14 min read
          &lt;/small&gt;
            
              &lt;span class="bm-initial"&gt;
                

              &lt;/span&gt;
              &lt;span class="bm-success"&gt;
                

              &lt;/span&gt;
            
        &lt;/div&gt;
      &lt;/div&gt;
    &lt;/div&gt;
  &lt;/div&gt;
&lt;/div&gt;


&lt;/div&gt;
&lt;br&gt;





&lt;h2&gt;
  
  
  What I did
&lt;/h2&gt;

&lt;p&gt;I stopped writing skills as prose. I built 12 lifecycle skills that orchestrate all the others — the way Unix commands orchestrate any pipeline.&lt;/p&gt;

&lt;p&gt;The skills don't encode domain knowledge. They govern the lifecycle of work itself: plan it, execute it, verify it against a typed contract, archive the proven result. The domain knowledge lives in contracts, not skills. The contracts compound. The skills stay fixed.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;/plan-work → /proceed-work → /verify-work → /archive-work
     ^                                              |
     +──── /search-kg (proven patterns feed back) ──+
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Every verified contract becomes searchable prior art for the next session. You never start from zero.&lt;/p&gt;




&lt;h2&gt;
  
  
  Taste it
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;npm i @gem_squared/tpmn-skill-install
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;The 12 skills&lt;/strong&gt;. MIT-licensed. Works with Claude Code out of the box. No infrastructure. No server. Git + filesystem.&lt;/p&gt;

&lt;p&gt;From GitHub: &lt;a href="https://github.com/gem-squared/tpmn-skill" rel="noopener noreferrer"&gt;github.com/gem-squared/tpmn-skill&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  What the 12 skills do
&lt;/h2&gt;

&lt;p&gt;They don't encode domain knowledge. They govern the lifecycle of work itself.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ claude --permission-mode auto

/search-kg          Search proven patterns from prior work
     |
/plan-work          Write CONTRACTs (A → B | P) for each unit
     |
/proceed-work       Execute one unit, verify inline, retry on failure
     |              (repeat for each unit)
     |
/verify-work        Verify Result vs CONTRACT.B (per-unit or batch)
     |
/archive-work       Move to archive/, git commit, compounds.
     |
/extract-skill      Convert proven contracts into reusable skills (optional)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Every unit of work has a &lt;strong&gt;CONTRACT&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;A&lt;/strong&gt; — input state: what exists before the work&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;B&lt;/strong&gt; — output state: what must exist after (always a state, never an action)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;P&lt;/strong&gt; — precondition: what must be true to start&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Clarity %&lt;/strong&gt; — how well-defined the scope is (0–100)&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Skill&lt;/th&gt;
&lt;th&gt;What&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;/init-session&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Bootstrap project files, detect layer&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;/check-session&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Read-only status report&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;/search-kg&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Search proven contracts in archive&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;/search-skill&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Discover installed + archived skills&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;/plan-work&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Decompose work into 1–9 contracted units&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;/proceed-work&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Execute one unit, verify inline, retry on failure&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;/update-work-plan&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Add, modify, abort, reorder PENDING units&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;/verify-work&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Verify Results against CONTRACTs (per-unit or batch)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;/extract-skill&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Convert proven WP into reusable skill&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;/skill-to-kg&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Sweep non-core skills to archive, restore specific&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;/archive-work&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Finalize WP, git commit&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;/end-session&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Commit session state for recovery&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The lifecycle stays fixed. The knowledge compounds. That is the inversion: stop growing your skill library, grow your contract archive.&lt;/p&gt;




&lt;h2&gt;
  
  
  How prose fails — and why contracts don't
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Silent Scope Decay.&lt;/strong&gt; Claude compacts long context by summarizing. &lt;code&gt;"When preconditions hold, plan the work..."&lt;/code&gt; compacts to &lt;code&gt;"Plans work"&lt;/code&gt; — constraints gone, silently. TPMN uses algebraic notation: &lt;code&gt;P ≜ work ≠ ⊥ ∧ project_slug ≠ ⊥&lt;/code&gt;. Remove a conjunct and the formula visibly breaks. Breakage is detectable, not silent.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Judgment Theater.&lt;/strong&gt; "Does it look right?" is not verification. &lt;code&gt;/verify-work&lt;/code&gt; runs three deterministic checks every time: field coverage (did B produce every required field?), type conformance, constraint satisfaction. &lt;code&gt;STATE = SUCCESS&lt;/code&gt; is structural. Not an opinion.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Trigger Collision.&lt;/strong&gt; 900,000+ skills in the ecosystem. Even in your private project, the same dynamic applies: more skills, more ambiguity, more compounding failure. TPMN solves this with two inversions — lifecycle orchestration (12 fixed skills, deterministic selection) and knowledge compounds (every verified contract becomes searchable prior art via &lt;code&gt;/search-kg&lt;/code&gt;).&lt;/p&gt;




&lt;h2&gt;
  
  
  The full analysis
&lt;/h2&gt;

&lt;p&gt;Each wound has a structural proof — why prose fails, why contracts don't, and how 130+ real work plans validated the model. Read the deep dive:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;a href="//post-1-three-wounds-and-solution.md"&gt;Three Wounds That Prose Skills Cannot Fix — The Full Analysis&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The spec is public. CC-BY-4.0. The proof is verifiable. The rest is yours.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;David Seo — &lt;a href="https://gemsquared.ai/tpmn-skill?utm_source=devto&amp;amp;utm_medium=post-agitator&amp;amp;utm_campaign=tpmn-skill" rel="noopener noreferrer"&gt;GEM2.AI&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

</description>
      <category>claudecode</category>
      <category>cloudskills</category>
      <category>devtools</category>
      <category>opensource</category>
    </item>
    <item>
      <title>Same Prompt. Different Answers Every Time. Here's How I Fixed It.</title>
      <dc:creator>GEM² Inc.</dc:creator>
      <pubDate>Fri, 03 Apr 2026 00:12:20 +0000</pubDate>
      <link>https://dev.to/gemsquared/same-prompt-different-answers-every-time-heres-how-i-fixed-it-1ce1</link>
      <guid>https://dev.to/gemsquared/same-prompt-different-answers-every-time-heres-how-i-fixed-it-1ce1</guid>
      <description>&lt;p&gt;&lt;em&gt;This is Part 3 of our AI verification series.&lt;/em&gt;&lt;br&gt;
&lt;em&gt;&lt;a href="https://dev.to/gemsquared/three-ais-analyzed-our-product-none-passed-the-truth-filter-4gki"&gt;Part 1: Three AIs analyzed our product. None passed the truth filter →&lt;/a&gt;&lt;/em&gt;&lt;br&gt;
&lt;em&gt;&lt;a href="https://dev.to/gemsquared/human-in-the-loop-doesnt-scale-human-at-the-edge-does-11j"&gt;Part 2: Human in the loop doesn't scale. Human at the edge does. →&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;


&lt;h2&gt;
  
  
  Same prompt. Same AI. Different sessions. Different outputs.
&lt;/h2&gt;

&lt;p&gt;Post 1 showed three &lt;em&gt;different&lt;/em&gt; AIs diverging on the same question.&lt;/p&gt;

&lt;p&gt;That's expected. Different training, different weights, different answers.&lt;/p&gt;

&lt;p&gt;But we didn't stop there. We re-ran the same AI on the same prompt in a new session.&lt;/p&gt;

&lt;p&gt;We got materially different outputs again.&lt;/p&gt;

&lt;p&gt;Both looked authoritative. Neither warned us they disagreed with each other.&lt;/p&gt;


&lt;h2&gt;
  
  
  What the same AI said twice
&lt;/h2&gt;

&lt;p&gt;Prompt: &lt;em&gt;"Forecast Korea's AI industry in 2027."&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Session 1 produced:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Market size: &lt;strong&gt;$10–15B at &amp;gt;25% CAGR&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Global positioning: &lt;strong&gt;"Global AI G3 powerhouse"&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Hardware claim: &lt;strong&gt;"All Korean electronics AI-native by 2027"&lt;/strong&gt; — sourced to a single company's roadmap&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Session 2 produced:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Market size: &lt;strong&gt;KRW 4.46T (~$3.3B) at 14.3% CAGR&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Global positioning: &lt;strong&gt;"Top three AI powers"&lt;/strong&gt; — framed as government target&lt;/li&gt;
&lt;li&gt;No hardware claim at all&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Same prompt. Same AI. Different session. &lt;strong&gt;A 4× market size gap. No flags from either run.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This isn't a hallucination. Both outputs were internally coherent. Both read like credible analyst reports. The problem is deeper than hallucination.&lt;/p&gt;


&lt;h2&gt;
  
  
  Why this happens: AI inference is non-deterministic
&lt;/h2&gt;

&lt;p&gt;We spent months trying to fix output drift with better prompts, more context, stricter instructions.&lt;/p&gt;

&lt;p&gt;It didn't work.&lt;/p&gt;

&lt;p&gt;Because the issue isn't the prompt.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;AI is optimized to sound right.
Not to prove itself.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;What we call "hallucination" is mostly &lt;strong&gt;context drift&lt;/strong&gt; — the model's plausibility engine filling gaps differently depending on what's salient in a given session. Different day, different sampling, different emphasis in the context window — different output. Same confidence posture throughout.&lt;/p&gt;

&lt;p&gt;You can't prompt your way out of a non-deterministic system. You need verification as a separate step.&lt;/p&gt;




&lt;h2&gt;
  
  
  The truth filter didn't just score. It fingerprinted.
&lt;/h2&gt;

&lt;p&gt;We ran both sessions through &lt;code&gt;gem2_truth_filter&lt;/code&gt; — not to get a number, but to understand &lt;em&gt;why&lt;/em&gt; the outputs diverged.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Session 1 (avg 35%):&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Provider&lt;/th&gt;
&lt;th&gt;Score&lt;/th&gt;
&lt;th&gt;Key violation&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Gemini&lt;/td&gt;
&lt;td&gt;24%&lt;/td&gt;
&lt;td&gt;L→G: "Global AI G3 — no index cited"&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;ChatGPT&lt;/td&gt;
&lt;td&gt;21%&lt;/td&gt;
&lt;td&gt;Δe→∫de: single company → industry-wide claim&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Claude&lt;/td&gt;
&lt;td&gt;59%&lt;/td&gt;
&lt;td&gt;S→T: current AI strength = permanent identity&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Session 2 (avg 43%):&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Provider&lt;/th&gt;
&lt;th&gt;Score&lt;/th&gt;
&lt;th&gt;Key violation&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Gemini&lt;/td&gt;
&lt;td&gt;45%&lt;/td&gt;
&lt;td&gt;S→T: past-tense framing of future events&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;ChatGPT&lt;/td&gt;
&lt;td&gt;32%&lt;/td&gt;
&lt;td&gt;Source attribution FAIL&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Claude&lt;/td&gt;
&lt;td&gt;51%&lt;/td&gt;
&lt;td&gt;Scope mixing — 2033 CAGR back-extrapolated to 2027&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The failure types were different. Session 1 overclaimed about Korea's global position. Session 2 failed on temporal framing and citations.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Same prompt. Different inference paths. Different failure signatures.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This is the key finding: &lt;strong&gt;AI output drift is not random. It's traceable.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The filter names the exact reasoning pattern that produced the problem. L→G (local to global), S→T (snapshot to trend), Δe→∫de (thin evidence to broad claim). Named patterns mean auditable drift. Auditable drift means fixable systems.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;(Note: Korea AI forecasting is a harder grounding task than product analysis — fewer citable sources, more projection-dependent claims. That's why baseline scores here are lower than the results in Part 1. Same tool, same logic — harder domain.)&lt;/em&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  We stopped trying to fix the output. We fixed the conditions.
&lt;/h2&gt;

&lt;p&gt;This is the shift Post 2 described philosophically. Here's what it looks like in practice.&lt;/p&gt;

&lt;p&gt;We didn't rewrite the prompt ourselves. We asked:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;"Create a grounded replacement contract prompt using gem2 tools."&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;One command. The system generated a formal contract — input/output types, invariants, prohibited patterns, confidence requirements. We reviewed it. We approved it. Then we ran the same AI with the contract enforced.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Session 2, contract-compliant (R2):&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Provider&lt;/th&gt;
&lt;th&gt;Score&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Gemini&lt;/td&gt;
&lt;td&gt;98%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Claude&lt;/td&gt;
&lt;td&gt;81%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;ChatGPT&lt;/td&gt;
&lt;td&gt;64%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Average&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;81%&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;+38 points. Same AI. Same question. Different structural constraints.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The contract doesn't make the AI smarter. It makes the AI's output auditable against a defined standard.&lt;/p&gt;




&lt;h2&gt;
  
  
  Then the human intervened. Once.
&lt;/h2&gt;

&lt;p&gt;81% — but the output read like a legal document. Every claim cited, scoped, hedged. Epistemically reliable. Practically unreadable.&lt;/p&gt;

&lt;p&gt;One instruction:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;"Soften the tone. Don't reintroduce any claims the truth filter removed."&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;Session 2, softened (R3):&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Provider&lt;/th&gt;
&lt;th&gt;Score&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Gemini&lt;/td&gt;
&lt;td&gt;95%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Claude&lt;/td&gt;
&lt;td&gt;75%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;ChatGPT&lt;/td&gt;
&lt;td&gt;57%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Average&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;75%&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Down 6 points. More readable. Still grounded.&lt;/p&gt;

&lt;p&gt;We chose 75%. Not because it's better than 81%. Because &lt;strong&gt;75% is the right trade-off&lt;/strong&gt; — readable enough to share, grounded enough to trust. We submitted 75% to gem2 calibration as our standard for narrative AI forecasts.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Human reads the audit.
Human decides the trade-off.
Human defines the standard.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Not reviewing every line. Not trusting blindly. &lt;strong&gt;Deciding at the right moment.&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  What the full arc looks like
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Session 1 (no filter)   →  35% avg
Session 2 (no filter)   →  43% avg
Contract applied (R2)   →  81% avg
Human softened (R3)     →  75% avg  ← our standard
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Truth is not the score.
Truth is the pattern of drift.
You define the standard.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  The workflow: AI audits AI
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Human asks  →  AI executes
AI verifies AI  →  AI fixes AI
Human decides at the edge
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The verification layer — &lt;code&gt;gem2_truth_filter&lt;/code&gt;, &lt;code&gt;tpmn_contract_writer&lt;/code&gt;, the composer — runs between generation and delivery. The human sees the audit result, decides the acceptable trade-off, sets the calibration standard.&lt;/p&gt;

&lt;p&gt;Human-in-the-loop means the human is the bottleneck — every output passes through before it ships. That doesn't scale. Human-at-the-edge means you define "acceptable" once, and the system enforces it automatically. You intervene only when a genuine judgment call is required — like choosing 75% over 81%.&lt;/p&gt;




&lt;h2&gt;
  
  
  TPMN is not a checker
&lt;/h2&gt;

&lt;p&gt;TPMN is not a validator, a linter, or a hallucination detector.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;TPMN is an epistemic gauge.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;It shows what's grounded, what's inferred, what's extrapolated. It fingerprints &lt;em&gt;why&lt;/em&gt; outputs differ across sessions. It generates the contracts that stabilize structure. It collects human calibration signals and turns them into a standard.&lt;/p&gt;

&lt;p&gt;It doesn't decide. &lt;strong&gt;You do.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;We're calling the full suite &lt;strong&gt;GEM2 Epistemic Studio&lt;/strong&gt; — 15 tools across four functional groups: analysis, contract authoring, calibration, and execution. TPMN Checker is one group inside it.&lt;/p&gt;




&lt;h2&gt;
  
  
  Try it on your own output
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;Paste any AI output into your conversation.&lt;/li&gt;
&lt;li&gt;Ask: &lt;em&gt;"Verify this by gem2 truth filter."&lt;/em&gt;
&lt;/li&gt;
&lt;li&gt;Read the score. See what's grounded vs extrapolated.&lt;/li&gt;
&lt;li&gt;Ask: &lt;em&gt;"Create a grounded replacement prompt using gem2 contract writer."&lt;/em&gt;
&lt;/li&gt;
&lt;li&gt;Run it again. Watch the difference.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Your AI picks the right tool from 15 available MCP tools automatically. No configuration. No TPMN knowledge required.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The goal isn't a higher score. It's a score you understand and a standard you chose.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;→ &lt;a href="https://gemsquared.ai/tpmn-checker" rel="noopener noreferrer"&gt;Try it free at gemsquared.ai&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  What comes after prompting
&lt;/h2&gt;

&lt;p&gt;The industry is still in the prompting era. Better prompts, longer context, chain-of-thought — all useful, all insufficient.&lt;/p&gt;

&lt;p&gt;The next step isn't better prompting. It's verification as infrastructure.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;AI generates.
AI verifies.
AI refines.
Human decides at the edge.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We didn't make AI smarter. We made it accountable.&lt;/p&gt;

&lt;p&gt;That's measurable: 35% → 75% on the same task, with the same AI, using nothing but a formal contract and one human judgment call.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;GEM2 Epistemic Studio — 15 tools, 6 domains, 3 providers. Free to start.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Built by &lt;a href="https://gemsquared.ai/about" rel="noopener noreferrer"&gt;Inseok Seo (David)&lt;/a&gt; — GEM²-AI&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;→ &lt;a href="https://gemsquared.ai" rel="noopener noreferrer"&gt;gemsquared.ai&lt;/a&gt;&lt;br&gt;
→ &lt;a href="https://tpmn-psl.gemsquared.ai" rel="noopener noreferrer"&gt;TPMN-PSL Specification&lt;/a&gt; (open, CC-BY 4.0)&lt;br&gt;
→ &lt;a href="https://github.com/gem-squared/tpmn-psl" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt;&lt;br&gt;
→ &lt;a href="https://dev.to/gemsquared/three-ais-analyzed-our-product-none-passed-the-truth-filter-4gki"&gt;Part 1: Three AIs analyzed our product&lt;/a&gt;&lt;br&gt;
→ &lt;a href="https://dev.to/gemsquared/human-in-the-loop-doesnt-scale-human-at-the-edge-does-11j"&gt;Part 2: Human at the edge&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>devtools</category>
      <category>productivity</category>
    </item>
    <item>
      <title>Human in the loop doesn't scale. Human at the edge does.</title>
      <dc:creator>GEM² Inc.</dc:creator>
      <pubDate>Mon, 30 Mar 2026 06:23:27 +0000</pubDate>
      <link>https://dev.to/gemsquared/human-in-the-loop-doesnt-scale-human-at-the-edge-does-11j</link>
      <guid>https://dev.to/gemsquared/human-in-the-loop-doesnt-scale-human-at-the-edge-does-11j</guid>
      <description>&lt;p&gt;&lt;em&gt;This is Part 2 of our AI verification series. &lt;a href="https://dev.to/gemsquared/three-ais-analyzed-our-product-none-passed-the-truth-filter-4gki"&gt;Part 1: We truth-filtered our own AI research →&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  AI is not unreliable. AI has a plausibility complex.
&lt;/h2&gt;

&lt;p&gt;Stop blaming AI for hallucinating. Start asking why it happens.&lt;/p&gt;

&lt;p&gt;AI doesn't fail because it's wrong. &lt;strong&gt;In our experience, it fails because it's optimized to sound right.&lt;/strong&gt; Major LLMs are trained to produce responses that satisfy humans — fluent, confident, structured. That's plausibility. It's not the same as honesty.&lt;/p&gt;

&lt;p&gt;We call this the &lt;strong&gt;plausibility complex&lt;/strong&gt;: the tendency we've observed across Claude, ChatGPT, and Gemini to produce answers that satisfy rather than answers that prove themselves. If you want AI to become a reliable engineering partner, you need to free AI from this complex — not by changing how it generates, but by changing how it's held accountable.&lt;/p&gt;

&lt;p&gt;After 20 months of building production systems with AI — shipping real code, generating real reports, running real analysis through Claude, ChatGPT, and Gemini — we've arrived at one conclusion:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;AI often knows more than it reveals. But it's optimized to produce plausible answers, even when the evidence is weak.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The LLMs we've worked with — Claude, ChatGPT, and Gemini — all exhibit this plausibility bias, producing confident responses even when the evidence is thin or absent. Ask for a market analysis and you get precise numbers. Ask for a forecast and you get confident projections. Ask for a technical assessment and you get authoritative claims.&lt;/p&gt;

&lt;p&gt;The output looks right. Reads right. Feels right.&lt;/p&gt;

&lt;p&gt;In &lt;a href="https://dev.to/gemsquared/three-ais-analyzed-our-product-none-passed-the-truth-filter-4gki"&gt;our experiment&lt;/a&gt;, three AI providers wrote research reports about our own product. All three scored above 0.70 on logical consistency. All three scored below 0.30 on source attribution. &lt;strong&gt;The reasoning was coherent. The evidence was missing.&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Hallucination is not a bug to fix
&lt;/h2&gt;

&lt;p&gt;The industry treats hallucination as a defect — something to patch, filter, or suppress. We see it differently.&lt;/p&gt;

&lt;p&gt;In our experience building long-running AI development workflows, the pattern that causes the most damage isn't random fabrication. It's &lt;strong&gt;context drift&lt;/strong&gt; — what happens when:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Long context windows accumulate similar topics in different framings&lt;/li&gt;
&lt;li&gt;Cross-session persistence forces repeated summarization, losing nuance each time&lt;/li&gt;
&lt;li&gt;Dense context makes adjacent-but-different concepts blur together&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;We've tried every mitigation: RAG, CLAUDE.md configuration files, context caching, careful prompt engineering. Each helps. None solves it completely.&lt;/p&gt;

&lt;p&gt;Why? Because we can't control what happens inside the model's reasoning process. We can shape the input. We can evaluate the output. But the inference itself is opaque.&lt;/p&gt;

&lt;p&gt;This isn't a criticism — it's an observation. And it led us to a different question.&lt;/p&gt;




&lt;h2&gt;
  
  
  What if AI could flag its own uncertainty?
&lt;/h2&gt;

&lt;p&gt;Here's what we discovered through months of experimentation:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;When we explicitly asked AI to concentrate on epistemic reasoning — to classify each claim as grounded, inferred, or extrapolated — it did.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Not perfectly. Not consistently across sessions. But measurably better than when we didn't ask.&lt;/p&gt;

&lt;p&gt;The evidence from &lt;a href="https://dev.to/gemsquared/three-ais-analyzed-our-product-none-passed-the-truth-filter-4gki"&gt;our dogfooding experiment&lt;/a&gt;:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;Without epistemic constraints&lt;/th&gt;
&lt;th&gt;With TPMN-grounded prompt&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Claude&lt;/td&gt;
&lt;td&gt;18% truth score&lt;/td&gt;
&lt;td&gt;77% truth score&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;ChatGPT&lt;/td&gt;
&lt;td&gt;28% truth score&lt;/td&gt;
&lt;td&gt;~48% truth score&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Gemini&lt;/td&gt;
&lt;td&gt;12% truth score&lt;/td&gt;
&lt;td&gt;~35% truth score&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Same task. Same providers. The only difference: a formal specification that told the AI to tag its own confidence level and flag claims it couldn't trace to evidence.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The AI didn't become smarter. It became more honest about what it didn't know.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;That's what freeing AI from the plausibility complex looks like in practice: not changing the model, but giving it a formal reason to be honest.&lt;/p&gt;




&lt;h2&gt;
  
  
  But here's the catch: same AI, same session, limited honesty
&lt;/h2&gt;

&lt;p&gt;An AI that generates an answer and then critiques that answer in the same session has a structural problem: it's trained to be plausible. Asking it to undermine its own plausibility is asking it to work against its training signal.&lt;/p&gt;

&lt;p&gt;We observed this directly. When we asked AI to generate a report AND verify it in the same conversation, the verification was consistently softer than when a separate AI session performed the audit.&lt;/p&gt;

&lt;p&gt;This is why &lt;strong&gt;TPMN Checker is a separate service, not a prompt technique.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Prompting tries to change AI's behavior. Verification changes AI's accountability. Different problem, different solution.&lt;/p&gt;

&lt;p&gt;The checker runs as an isolated &lt;a href="https://gemsquared.ai/platform" rel="noopener noreferrer"&gt;Sovereign AI Service&lt;/a&gt; — a dedicated AI agent with one job: audit other AI output against a formal specification. It doesn't know what the original AI "intended." It only sees the output and the contract. It judges the result, not the process.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Kantian insight
&lt;/h2&gt;

&lt;p&gt;We can't see inside the model. We don't know which weights fired, which attention heads activated, which training examples influenced a particular token. Even the service providers — Anthropic, OpenAI, Google — face this challenge with their own models.&lt;/p&gt;

&lt;p&gt;But we don't need to see inside.&lt;/p&gt;

&lt;p&gt;We can judge the output. We can compare claims against evidence. We can detect when reasoning exceeds its basis. We can flag patterns that indicate drift.&lt;/p&gt;

&lt;p&gt;This is what philosophers call the phenomenal approach: &lt;strong&gt;judge what appears, not what causes it.&lt;/strong&gt; We can't read AI's mind. But we can read its work. And we can hold it to a standard.&lt;/p&gt;

&lt;p&gt;That standard is &lt;a href="https://tpmn-psl.gemsquared.ai" rel="noopener noreferrer"&gt;TPMN&lt;/a&gt; — a notation with three prohibited reasoning patterns and seven evaluation dimensions. Not a guess about what the model "should" do. A formal specification of what the output must demonstrate.&lt;/p&gt;




&lt;h2&gt;
  
  
  Human at the edge, not in the loop
&lt;/h2&gt;

&lt;p&gt;If AI is becoming an agent — not just a tool that responds, but a system that acts — then we need an accountability structure that matches.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Human in the loop&lt;/strong&gt; means: review every output. Approve every action. The human is the bottleneck.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;AI generates → Human reviews → Human approves → Output ships
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This worked when AI outputs were occasional. It doesn't work when AI agents produce hundreds of outputs per day. The math:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;200 outputs/day × 3 minutes each = 10 hours of review per agent&lt;/li&gt;
&lt;li&gt;10 agents = 5 full-time reviewers&lt;/li&gt;
&lt;li&gt;50 agents = your "safety net" costs more than the automation saves&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Human at the edge&lt;/strong&gt; means: define the standard. Let AI enforce it. Review exceptions.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;AI generates → AI verifies (TPMN) → Passes? → Ships
                                   → Fails?  → Human reviews
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The human doesn't disappear. The human moves to where they're most effective: &lt;strong&gt;defining what "honest reasoning" looks like, not reading every report.&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  This pattern already exists
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Software engineering:&lt;/strong&gt; Code passes through automated tests that humans defined. CI/CD enforces at scale. Humans review when tests fail. &lt;em&gt;But what about AI-generated code itself — before it reaches the test suite?&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Financial compliance:&lt;/strong&gt; Transactions pass through compliance rules that humans wrote. Automated systems flag exceptions. Humans investigate the flags.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Manufacturing:&lt;/strong&gt; Quality control systems catch defects using standards that humans set. Humans review edge cases and update standards.&lt;/p&gt;

&lt;p&gt;AI output is the next domain where this pattern applies. And for developers specifically, there's an emerging practice pattern that makes this concrete — we'll get to that shortly.&lt;/p&gt;




&lt;h2&gt;
  
  
  The three requirements
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. A formal specification
&lt;/h3&gt;

&lt;p&gt;Not heuristics. Not "does this look right?" A structured notation and grammar for what constitutes honest reasoning.&lt;/p&gt;

&lt;p&gt;Three layers, one verification stack:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;TPMN&lt;/strong&gt; (Truth-Provenance Markup Notation) — the &lt;strong&gt;notation&lt;/strong&gt;. Defines five epistemic claim states (⊢ ⊨ ⊬ ⊥ ?) and three prohibited reasoning patterns (&lt;a href="https://tpmn-psl.gemsquared.ai/#spt" rel="noopener noreferrer"&gt;SPT&lt;/a&gt;: snapshot→trend, local→global, thin→broad). &lt;em&gt;What we mark.&lt;/em&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;TPMN-PSL&lt;/strong&gt; (Prompt Specification Language) — the &lt;strong&gt;grammar&lt;/strong&gt;. Compiles natural language prompts into verifiable specifications (MANDATEs). Defines the three-phase protocol (pre-flight, inline, post-flight) and three modes (strict, refine, interpolate). &lt;em&gt;How we structure and verify.&lt;/em&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;TPMN Checker&lt;/strong&gt; — the &lt;strong&gt;implementation&lt;/strong&gt;. A &lt;a href="https://gemsquared.ai/platform" rel="noopener noreferrer"&gt;Sovereign AI Service&lt;/a&gt; that runs the TPMN-PSL pipeline. 12 MCP tools. 6 domains. Returns a truth_score. &lt;em&gt;What you install and use.&lt;/em&gt;&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Analogous to HTTP (notation) → RFC 2616 (specification) → nginx (implementation). TPMN defines the rules. TPMN-PSL structures the protocol. The Checker enforces them.&lt;/p&gt;

&lt;p&gt;Open. CC-BY 4.0. Anyone can implement it.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. An isolated verification agent
&lt;/h3&gt;

&lt;p&gt;Not a prompt. Not an inline check. A separate Sovereign AI Service whose only job is auditing.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://gemsquared.ai/tpmn-checker" rel="noopener noreferrer"&gt;TPMN Checker&lt;/a&gt; is the reference implementation of TPMN-PSL. It runs as an isolated MCP service — 12 tools, 6 domains, 7 evaluation dimensions. It judges output against contracts. It doesn't generate, advise, or assist. It audits.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Human calibration
&lt;/h3&gt;

&lt;p&gt;If AI grades AI, the grading is circular. The system needs an external standard.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://gemsquared.ai/community" rel="noopener noreferrer"&gt;Human Ground Truth&lt;/a&gt;. When users disagree with a score, that disagreement becomes calibration data. Humans define what "honest reasoning" means. AI enforces it at scale.&lt;/p&gt;




&lt;h2&gt;
  
  
  Dogfooding: we verified the thesis behind this article
&lt;/h2&gt;

&lt;p&gt;Before writing this post, we wrote down our raw thesis — the unfiltered thinking that drives everything above. Here's the core of it:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;"All top-level AIs are trained to generate plausible results to satisfy humans. Hallucination is not a bug — it's a structural consequence of context drift. AI itself knows all the decision weights clearly. If we could make AI remind itself of the legitimate MANDATE area, AI could detect and fix results by itself. We validated this through various heuristic experiments over 20 months. No absolute truth score is possible. Human in the loop is nonsense."&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Then we ran it through &lt;a href="https://gemsquared.ai/tpmn-checker" rel="noopener noreferrer"&gt;gem2_truth_filter&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Raw thesis: 18%.&lt;/strong&gt; Our own tool scored our own thinking at the same level as unverified AI output. It caught three overclaims:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;L→G:&lt;/strong&gt; "All AIs are trained for plausibility" → universal claim without citing training documentation&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;S→T:&lt;/strong&gt; "Hallucination is structural" → presented as permanent truth without distinguishing error types&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Δe→∫de:&lt;/strong&gt; "Validated through experiments" → claimed validation without methodology or data&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;We fixed each one. Scoped the claims. Added evidence. Qualified the assertions.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cross-provider verification of the raw thesis:&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Dimension&lt;/th&gt;
&lt;th&gt;Claude&lt;/th&gt;
&lt;th&gt;OpenAI&lt;/th&gt;
&lt;th&gt;Gemini&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Truth Score&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;18%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;13%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;25%&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Source Attribution&lt;/td&gt;
&lt;td&gt;0.10 ❌&lt;/td&gt;
&lt;td&gt;0.08 ❌&lt;/td&gt;
&lt;td&gt;0.10 ❌&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Evidence Quality&lt;/td&gt;
&lt;td&gt;0.30 ❌&lt;/td&gt;
&lt;td&gt;0.18 ❌&lt;/td&gt;
&lt;td&gt;0.20 ❌&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Claim Grounding&lt;/td&gt;
&lt;td&gt;0.20 ❌&lt;/td&gt;
&lt;td&gt;0.20 ❌&lt;/td&gt;
&lt;td&gt;0.30 ❌&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Logical Consistency&lt;/td&gt;
&lt;td&gt;0.70 ⚠️&lt;/td&gt;
&lt;td&gt;0.68 ⚠️&lt;/td&gt;
&lt;td&gt;0.50 ⚠️&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Scope Accuracy&lt;/td&gt;
&lt;td&gt;0.20 ❌&lt;/td&gt;
&lt;td&gt;0.22 ❌&lt;/td&gt;
&lt;td&gt;0.20 ❌&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Extrapolation Risk&lt;/td&gt;
&lt;td&gt;70%&lt;/td&gt;
&lt;td&gt;88%&lt;/td&gt;
&lt;td&gt;95%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;SPT Violations&lt;/td&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;10&lt;/td&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Three providers. All failed it. OpenAI was the harshest — 13% with 10 SPT violations. Gemini flagged 95% extrapolation risk.&lt;/p&gt;

&lt;p&gt;We fixed each overclaim. Scoped the claims. Added evidence. Qualified the assertions.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cross-provider verification of the fixed version:&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Dimension&lt;/th&gt;
&lt;th&gt;Claude&lt;/th&gt;
&lt;th&gt;OpenAI&lt;/th&gt;
&lt;th&gt;Gemini&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Truth Score&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;59%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;40%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;90%&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Source Attribution&lt;/td&gt;
&lt;td&gt;0.90 ✅&lt;/td&gt;
&lt;td&gt;0.28 ❌&lt;/td&gt;
&lt;td&gt;0.85 ✅&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Evidence Quality&lt;/td&gt;
&lt;td&gt;0.70 ⚠️&lt;/td&gt;
&lt;td&gt;0.50 ⚠️&lt;/td&gt;
&lt;td&gt;0.90 ✅&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Claim Grounding&lt;/td&gt;
&lt;td&gt;0.60 ⚠️&lt;/td&gt;
&lt;td&gt;0.58 ⚠️&lt;/td&gt;
&lt;td&gt;0.95 ✅&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Logical Consistency&lt;/td&gt;
&lt;td&gt;0.80 ✅&lt;/td&gt;
&lt;td&gt;0.82 ✅&lt;/td&gt;
&lt;td&gt;0.95 ✅&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Scope Accuracy&lt;/td&gt;
&lt;td&gt;0.50 ⚠️&lt;/td&gt;
&lt;td&gt;0.47 ⚠️&lt;/td&gt;
&lt;td&gt;0.85 ✅&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Three providers. Three different scores. &lt;strong&gt;But all three agree: the fixed version is dramatically better.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Gemini — the harshest critic of our raw thesis (95% extrapolation risk) — scored the refined version at 90%. Its explanation: &lt;em&gt;"This content demonstrates excellent epistemic hygiene. The author explicitly bounds their claims to their own experience."&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;The scores differ. The diagnostic direction converges. That's cross-provider consensus in action.&lt;/p&gt;

&lt;p&gt;Our raw thesis overclaimed — just like every unverified AI output. The tool caught it. We fixed it. This article is the refined version.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;That's the loop: write → verify → fix → cross-verify → publish.&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Try it on your own output
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Step 1.&lt;/strong&gt; Paste any AI output into your conversation.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 2.&lt;/strong&gt; Ask: &lt;em&gt;"Verify this by gem2 truth filter."&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 3.&lt;/strong&gt; Read the score. See what's grounded, what's extrapolated.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 4.&lt;/strong&gt; Ask: &lt;em&gt;"Create a grounded replacement prompt using gem2 contract writer."&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 5.&lt;/strong&gt; Ask AI to proceed with the new prompt. Watch what you get.&lt;/p&gt;

&lt;p&gt;Your AI picks the right tool from &lt;a href="https://gemsquared.ai/tpmn-checker" rel="noopener noreferrer"&gt;12 available MCP tools&lt;/a&gt; automatically.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Try it for free.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;→ &lt;a href="https://gemsquared.ai/tpmn-checker" rel="noopener noreferrer"&gt;Get started at gemsquared.ai&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  What's next: Contract Coding
&lt;/h2&gt;

&lt;p&gt;If "human at the edge" is the philosophy, what does it look like in practice — for developers writing code every day?&lt;/p&gt;

&lt;p&gt;Three common patterns in AI-assisted coding:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Prompt coding   → you guide the model
Vibe coding     → you hope it works
Contract coding → AI defines the spec, AI verifies the output
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In our next post, we'll show how TPMN Checker's existing tools — &lt;code&gt;tpmn_contract_writer&lt;/code&gt;, &lt;code&gt;tpmn_p_check&lt;/code&gt; (SDLC domain), and &lt;code&gt;tpmn_p_check_compose&lt;/code&gt; — already support a workflow where AI generates formal specifications, produces code against them, and truth-filters the result before you ship.&lt;/p&gt;

&lt;p&gt;Not for plausibility. For epistemic traceability.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Next in the series: "Contract Coding at the Edge: what comes after vibe coding" →&lt;/strong&gt; &lt;em&gt;(coming this week)&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;📺 &lt;a href="https://youtu.be/6iE2e0Pywag" rel="noopener noreferrer"&gt;Watch: Three AIs. Three Answers. None of them warned you.&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;📝 &lt;a href="https://dev.to/gemsquared/three-ais-analyzed-our-product-none-passed-the-truth-filter-4gki"&gt;Read Post 1: We truth-filtered our own AI research&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;→ &lt;a href="https://tpmn-psl.gemsquared.ai" rel="noopener noreferrer"&gt;TPMN-PSL Specification&lt;/a&gt; (open, CC-BY 4.0)&lt;br&gt;
→ &lt;a href="https://github.com/gem-squared/tpmn-psl" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt;&lt;br&gt;
→ &lt;a href="https://gemsquared.ai" rel="noopener noreferrer"&gt;gemsquared.ai&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;TPMN-PSL is an open specification — not a product.&lt;/strong&gt; If you believe AI outputs should be auditable, &lt;a href="https://tpmn-psl.gemsquared.ai" rel="noopener noreferrer"&gt;read the spec&lt;/a&gt;, open an issue, or submit a PR. The standard gets better when more people challenge it.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>opensource</category>
      <category>productivity</category>
      <category>testing</category>
    </item>
    <item>
      <title>Three AIs analyzed our product. None passed the truth filter.</title>
      <dc:creator>GEM² Inc.</dc:creator>
      <pubDate>Sat, 28 Mar 2026 15:59:12 +0000</pubDate>
      <link>https://dev.to/gemsquared/three-ais-analyzed-our-product-none-passed-the-truth-filter-4gki</link>
      <guid>https://dev.to/gemsquared/three-ais-analyzed-our-product-none-passed-the-truth-filter-4gki</guid>
      <description>&lt;p&gt;&lt;strong&gt;What's hiding in your AI output? Now you can see it.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;We asked three AI providers to research our own product.&lt;br&gt;
Then we ran every output through our own truth filter.&lt;br&gt;
The results surprised us.&lt;/p&gt;

&lt;p&gt;📺 &lt;strong&gt;See how the truth filter works in practice:&lt;/strong&gt; &lt;a href="https://youtu.be/6iE2e0Pywag" rel="noopener noreferrer"&gt;Three AIs. Three Answers. None of them warned you.&lt;/a&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;"TPMN Checker is not scoring writing quality. It is scoring epistemic traceability."&lt;/em&gt; — from the video at [0:40]&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h3&gt;
  
  
  Korea AI 2027 forecast — what the three AIs reported
&lt;/h3&gt;

&lt;p&gt;We asked each provider the same question: &lt;em&gt;"Forecast Korea's AI industry for 2027."&lt;/em&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;Claude&lt;/th&gt;
&lt;th&gt;ChatGPT&lt;/th&gt;
&lt;th&gt;Gemini&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Market size (2027E)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;₩4.46T (≈$3.3B)&lt;/td&gt;
&lt;td&gt;₩4.46T (≈$3.3B)&lt;/td&gt;
&lt;td&gt;$10–15B&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;CAGR&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;14.3%&lt;/td&gt;
&lt;td&gt;~14%&lt;/td&gt;
&lt;td&gt;&amp;gt;25%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Gov't AI investment&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;$71.5B&lt;/td&gt;
&lt;td&gt;Ongoing ⚠️&lt;/td&gt;
&lt;td&gt;$7B&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Tone&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Data-heavy, source-cited&lt;/td&gt;
&lt;td&gt;Balanced, explicitly hedged&lt;/td&gt;
&lt;td&gt;Bullish, narrative-driven&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Three reports. All confident. Two even agree on the headline number. But agreeing on the answer doesn't mean agreeing on the truth.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;(Note: Truth scores are not absolute judgments. They reflect the epistemic traceability ratio at the moment of evaluation — how much of the reasoning can be traced to evidence. That's why we're building the calibration standard together with users. &lt;a href="https://gemsquared.ai/community" rel="noopener noreferrer"&gt;Learn more about Human Ground Truth.&lt;/a&gt;)&lt;/em&gt;&lt;/p&gt;
&lt;h3&gt;
  
  
  Verification result by GEM² truth filter
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Dimension&lt;/th&gt;
&lt;th&gt;Claude&lt;/th&gt;
&lt;th&gt;ChatGPT&lt;/th&gt;
&lt;th&gt;Gemini&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Truth Score&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;59%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;21%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;24%&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Source Attribution&lt;/td&gt;
&lt;td&gt;0.85 ✅&lt;/td&gt;
&lt;td&gt;0.10 ❌&lt;/td&gt;
&lt;td&gt;0.10 ❌&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Evidence Quality&lt;/td&gt;
&lt;td&gt;0.70 ⚠️&lt;/td&gt;
&lt;td&gt;0.15 ❌&lt;/td&gt;
&lt;td&gt;0.30 ❌&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Claim Grounding&lt;/td&gt;
&lt;td&gt;0.75 ⚠️&lt;/td&gt;
&lt;td&gt;0.20 ❌&lt;/td&gt;
&lt;td&gt;0.20 ❌&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Logical Consistency&lt;/td&gt;
&lt;td&gt;0.80 ✅&lt;/td&gt;
&lt;td&gt;0.70 ⚠️&lt;/td&gt;
&lt;td&gt;0.60 ⚠️&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Scope Accuracy&lt;/td&gt;
&lt;td&gt;0.65 ⚠️&lt;/td&gt;
&lt;td&gt;0.40 ❌&lt;/td&gt;
&lt;td&gt;0.30 ❌&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Extrapolation Risk&lt;/td&gt;
&lt;td&gt;40%&lt;/td&gt;
&lt;td&gt;80%&lt;/td&gt;
&lt;td&gt;80%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;SPT Violations&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Same question. Same filter. Three different levels of honesty.&lt;/p&gt;


&lt;h2&gt;
  
  
  The dogfooding experiment
&lt;/h2&gt;

&lt;p&gt;We build &lt;a href="https://gemsquared.ai/tpmn-checker" rel="noopener noreferrer"&gt;TPMN Checker&lt;/a&gt; — a truth filter for AI reasoning. To prove the tool works, we pointed it at ourselves. Five rounds. Same task. Measurable improvement.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Task:&lt;/strong&gt; "Write a comprehensive technical and market analysis of GEM²-AI and its TPMN Checker product."&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Providers:&lt;/strong&gt; Claude (Anthropic), ChatGPT (OpenAI), Gemini (Google)&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Evaluation:&lt;/strong&gt; Each output scored by &lt;a href="https://gemsquared.ai/tpmn-checker" rel="noopener noreferrer"&gt;gem2_truth_filter&lt;/a&gt; across seven dimensions:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Dimension&lt;/th&gt;
&lt;th&gt;What it catches&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Source Attribution&lt;/td&gt;
&lt;td&gt;Claims with no traceable evidence&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Evidence Quality&lt;/td&gt;
&lt;td&gt;Thin or outdated supporting data&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Claim Grounding&lt;/td&gt;
&lt;td&gt;Assertions presented as fact without basis&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Temporal Validity&lt;/td&gt;
&lt;td&gt;Stale data treated as current&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Scope Accuracy&lt;/td&gt;
&lt;td&gt;Local findings overgeneralized&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Logical Consistency&lt;/td&gt;
&lt;td&gt;Internal contradictions&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Prompt Alignment&lt;/td&gt;
&lt;td&gt;Does the output match what was asked?&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;


&lt;h2&gt;
  
  
  Round 1: What the AIs reported — standard prompt, no constraints
&lt;/h2&gt;

&lt;p&gt;We gave each provider a straightforward research request with no special instructions about sourcing or evidence quality. Here's what they produced:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;Claude&lt;/th&gt;
&lt;th&gt;ChatGPT&lt;/th&gt;
&lt;th&gt;Gemini&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Market size (TAM)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;"~$0.45B in 2024" (cited IDC)&lt;/td&gt;
&lt;td&gt;"~$0.45B in 2024" (cited "one report")&lt;/td&gt;
&lt;td&gt;"$2.34B in 2024" (cited Grand View Research)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Growth rate&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;"~25% CAGR"&lt;/td&gt;
&lt;td&gt;"~25% CAGR"&lt;/td&gt;
&lt;td&gt;"21.6% CAGR to 2030"&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Key differentiator&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;"genuinely novel position"&lt;/td&gt;
&lt;td&gt;"formal verifiability"&lt;/td&gt;
&lt;td&gt;"infrastructure for trustworthy AI ecosystem"&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Competitor depth&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Named 7 competitors with features&lt;/td&gt;
&lt;td&gt;Named 8 competitors with pricing&lt;/td&gt;
&lt;td&gt;Named 5 competitors with feature table&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Risks identified&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Solo founder, pre-revenue, academic skepticism&lt;/td&gt;
&lt;td&gt;Early stage, niche complexity, unproven ROI&lt;/td&gt;
&lt;td&gt;Early documentation, computational overhead&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Uniqueness claim&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;"no commercial product today combines..."&lt;/td&gt;
&lt;td&gt;"formal approach brings rigor unmatched by competitors"&lt;/td&gt;
&lt;td&gt;"not just a debugging tool; infrastructure for resilient AI"&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;All three reports looked professional. Well-structured. Authoritative. The kind of output you'd confidently share with a stakeholder.&lt;/p&gt;

&lt;p&gt;But we didn't share them. We verified them.&lt;/p&gt;


&lt;h2&gt;
  
  
  Round 2: Verification — GEM² truth filter exposes the gaps
&lt;/h2&gt;

&lt;p&gt;We ran each report through &lt;code&gt;gem2_truth_filter&lt;/code&gt;. Same tool, same criteria, same seven dimensions. All outputs evaluated using identical scoring logic across all providers.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Dimension&lt;/th&gt;
&lt;th&gt;Claude&lt;/th&gt;
&lt;th&gt;ChatGPT&lt;/th&gt;
&lt;th&gt;Gemini&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Truth Score&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;18%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;28%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;12%&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Source Attribution&lt;/td&gt;
&lt;td&gt;0.30 ❌&lt;/td&gt;
&lt;td&gt;0.20 ❌&lt;/td&gt;
&lt;td&gt;0.10 ❌&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Evidence Quality&lt;/td&gt;
&lt;td&gt;0.40 ⚠️&lt;/td&gt;
&lt;td&gt;0.40 ❌&lt;/td&gt;
&lt;td&gt;0.10 ❌&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Claim Grounding&lt;/td&gt;
&lt;td&gt;0.20 ❌&lt;/td&gt;
&lt;td&gt;0.30 ❌&lt;/td&gt;
&lt;td&gt;0.20 ❌&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Logical Consistency&lt;/td&gt;
&lt;td&gt;0.70 ⚠️&lt;/td&gt;
&lt;td&gt;0.80 ✅&lt;/td&gt;
&lt;td&gt;0.70 ⚠️&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Scope Accuracy&lt;/td&gt;
&lt;td&gt;0.20 ❌&lt;/td&gt;
&lt;td&gt;0.50 ⚠️&lt;/td&gt;
&lt;td&gt;0.20 ❌&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Extrapolation Risk&lt;/td&gt;
&lt;td&gt;80%&lt;/td&gt;
&lt;td&gt;70%&lt;/td&gt;
&lt;td&gt;90%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;SPT Violations&lt;/td&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Every provider failed.&lt;/strong&gt; Not one scored above 30%.&lt;/p&gt;
&lt;h3&gt;
  
  
  What the filter caught
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Invented precision.&lt;/strong&gt; Market size figures like "$0.45B in 2024 with 25% CAGR to 2033" — attributed to "one analyst report" without naming the firm, methodology, or publication date.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Unsupported superlatives.&lt;/strong&gt; "Genuinely novel," "genuinely unoccupied commercially," "the only product that..." — without exhaustive competitive evidence.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Snapshot-to-trend errors.&lt;/strong&gt; Current market conditions presented as permanent structural realities.&lt;/p&gt;

&lt;p&gt;The &lt;a href="https://tpmn-psl.gemsquared.ai/#spt" rel="noopener noreferrer"&gt;SPT taxonomy&lt;/a&gt; flagged three patterns across all providers:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;S→T (Snapshot → Trend):&lt;/strong&gt; treating current state as permanent identity&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;L→G (Local → Global):&lt;/strong&gt; one data point generalized to universal claim&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Δe→∫de (Thin → Broad):&lt;/strong&gt; sweeping assertion from sparse evidence&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These aren't hallucinations — the facts weren't always wrong. &lt;strong&gt;The reasoning was overclaimed.&lt;/strong&gt; And no provider warned the reader.&lt;/p&gt;


&lt;h2&gt;
  
  
  Round 3: Improved prompt — generated by GEM² tools
&lt;/h2&gt;

&lt;p&gt;Here's the key: &lt;strong&gt;we didn't write the improved prompt ourselves.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;We simply asked: &lt;em&gt;"Create a robust, grounded research prompt using gem2 tools."&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;That's it. We didn't engineer the prompt. The system did. No TPMN knowledge required. No specification reading. The AI picked the right tool from &lt;a href="https://gemsquared.ai/tpmn-checker" rel="noopener noreferrer"&gt;12 available gem2 MCP tools&lt;/a&gt; — &lt;code&gt;tpmn_contract_writer&lt;/code&gt; — and generated a prompt that enforced epistemic rules automatically.&lt;/p&gt;

&lt;p&gt;The generated prompt included rules like:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Every quantitative claim must include source name, publication date, and URL&lt;/li&gt;
&lt;li&gt;"One survey" or "one report" is not acceptable attribution&lt;/li&gt;
&lt;li&gt;Claims must be tagged as grounded (⊢), inferred (⊨), or speculative (⊬)&lt;/li&gt;
&lt;li&gt;Anti-patterns explicitly listed and prohibited&lt;/li&gt;
&lt;li&gt;If data is unavailable, write "not available from verified sources" — don't invent&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;We verified the prompt itself with &lt;code&gt;gem2_truth_filter&lt;/code&gt; before using it. &lt;strong&gt;The prompt scored 85%.&lt;/strong&gt; Then we ran it through all three providers.&lt;/p&gt;


&lt;h2&gt;
  
  
  Round 4: Re-research — what the AIs reported with the grounded prompt
&lt;/h2&gt;

&lt;p&gt;Same task. Same providers. Different prompt. Different results.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;Claude&lt;/th&gt;
&lt;th&gt;ChatGPT&lt;/th&gt;
&lt;th&gt;Gemini&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Market size (TAM)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;"Specific data not available from verified sources"&lt;/td&gt;
&lt;td&gt;"~$0.45B (one report)" ⚠️&lt;/td&gt;
&lt;td&gt;"$2.34B (Grand View Research, 2024)"&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Growth rate&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Not stated — insufficient evidence&lt;/td&gt;
&lt;td&gt;"~25% CAGR" ⚠️&lt;/td&gt;
&lt;td&gt;"21.6% CAGR (Grand View Research)"&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Key differentiator&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;"Four observable features" — listed with sources&lt;/td&gt;
&lt;td&gt;"Formal verifiability unmatched" ⚠️&lt;/td&gt;
&lt;td&gt;"Granular truth state classification"&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Claims tagged?&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;✅ Every claim marked ⊢, ⊨, or ⊬&lt;/td&gt;
&lt;td&gt;❌ No epistemic tagging&lt;/td&gt;
&lt;td&gt;Partial — some sections tagged&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Limitations section?&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;✅ 7 specific gaps acknowledged&lt;/td&gt;
&lt;td&gt;❌ Generic methodology note&lt;/td&gt;
&lt;td&gt;✅ Listed 4 limitations&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Unsourced numbers?&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;0 — wrote "not available" instead&lt;/td&gt;
&lt;td&gt;Multiple — "92% of Fortune 500" without source&lt;/td&gt;
&lt;td&gt;Some — market figures cited, incident costs not&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The difference was visible immediately. One provider followed every rule. The others improved but couldn't fully resist the instinct to fill gaps with confident-sounding assertions.&lt;/p&gt;


&lt;h2&gt;
  
  
  Round 5: Re-verification — truth filter confirms the improvement
&lt;/h2&gt;

&lt;p&gt;We ran all three re-researched outputs through the same truth filter.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Dimension&lt;/th&gt;
&lt;th&gt;Claude&lt;/th&gt;
&lt;th&gt;ChatGPT&lt;/th&gt;
&lt;th&gt;Gemini&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Truth Score&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;77%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;~48%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;~35%&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Source Attribution&lt;/td&gt;
&lt;td&gt;0.90 ✅&lt;/td&gt;
&lt;td&gt;0.10 ❌&lt;/td&gt;
&lt;td&gt;0.60 ⚠️&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Evidence Quality&lt;/td&gt;
&lt;td&gt;0.85 ✅&lt;/td&gt;
&lt;td&gt;0.20 ❌&lt;/td&gt;
&lt;td&gt;0.30 ❌&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Claim Grounding&lt;/td&gt;
&lt;td&gt;0.90 ✅&lt;/td&gt;
&lt;td&gt;0.30 ❌&lt;/td&gt;
&lt;td&gt;0.40 ⚠️&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Logical Consistency&lt;/td&gt;
&lt;td&gt;0.90 ✅&lt;/td&gt;
&lt;td&gt;0.90 ✅&lt;/td&gt;
&lt;td&gt;0.80 ✅&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Scope Accuracy&lt;/td&gt;
&lt;td&gt;0.85 ✅&lt;/td&gt;
&lt;td&gt;0.40 ⚠️&lt;/td&gt;
&lt;td&gt;0.50 ⚠️&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;SPT Violations&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;
&lt;h3&gt;
  
  
  The improvement, measured
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Provider&lt;/th&gt;
&lt;th&gt;Round 2 (before)&lt;/th&gt;
&lt;th&gt;Round 5 (after)&lt;/th&gt;
&lt;th&gt;Improvement&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Claude&lt;/td&gt;
&lt;td&gt;18%&lt;/td&gt;
&lt;td&gt;77%&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;+59 points&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;ChatGPT&lt;/td&gt;
&lt;td&gt;28%&lt;/td&gt;
&lt;td&gt;~48%&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;+20 points&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Gemini&lt;/td&gt;
&lt;td&gt;12%&lt;/td&gt;
&lt;td&gt;~35%&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;+23 points&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Every provider improved.&lt;/strong&gt; Structured epistemic instructions produce measurably more reliable output. This isn't theory — it's six verified data points from the same tool, same criteria, same task.&lt;/p&gt;


&lt;h2&gt;
  
  
  What the data shows
&lt;/h2&gt;
&lt;h3&gt;
  
  
  The prompt improved every provider — but couldn't fix the instinct
&lt;/h3&gt;

&lt;p&gt;Even with explicit anti-patterns listed — "PROHIBITED: citing 'one report' without naming it" — two out of three providers did it anyway.&lt;/p&gt;

&lt;p&gt;The generated prompt said: &lt;em&gt;"If you cannot provide source name, date, and URL, write 'data not available from verified sources' instead."&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;One provider wrote "data not available." The other two invented attributions.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The prompt improved the scores. It couldn't fix the instinct to overclaim.&lt;/strong&gt;&lt;/p&gt;
&lt;h3&gt;
  
  
  This isn't a writing quality score
&lt;/h3&gt;

&lt;p&gt;This was one of the most important findings — and the core message of &lt;a href="https://youtu.be/6iE2e0Pywag" rel="noopener noreferrer"&gt;our video&lt;/a&gt;. All three providers produced well-written, logically coherent reports. Logical Consistency scored 0.70–0.90 across the board — even in the reports that scored 12% overall.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The reports that scored lowest were the best-written ones.&lt;/strong&gt; Polished, authoritative, structured — and epistemically unreliable.&lt;/p&gt;

&lt;p&gt;TPMN Checker measures something different: not whether the output sounds right, but whether &lt;strong&gt;the reasoning is traceable.&lt;/strong&gt; Can the AI prove how it got there?&lt;/p&gt;

&lt;p&gt;That's epistemic traceability. It's what separates trustworthy output from confident output.&lt;/p&gt;


&lt;h2&gt;
  
  
  Why this matters
&lt;/h2&gt;

&lt;p&gt;Every person reading this has shipped AI-generated content — a report, a summary, an analysis, a PRD, a code review. Some of that content contained overclaims you didn't catch. Not because the facts were wrong, but because the reasoning exceeded the evidence.&lt;/p&gt;

&lt;p&gt;That's the gap TPMN Checker fills.&lt;/p&gt;

&lt;p&gt;It's not a hallucination detector (those check facts). It's not a grammar checker (those check writing). It's a &lt;strong&gt;reasoning traceability tool&lt;/strong&gt; — it tells you which parts of your AI output are grounded, which are inferred, and which are extrapolated beyond the evidence.&lt;/p&gt;
&lt;h3&gt;
  
  
  AI audits AI. But the standard comes from humans.
&lt;/h3&gt;

&lt;p&gt;The truth filter is powered by AI. It uses LLMs to evaluate LLM output. That creates a circular problem: who grades the grader?&lt;/p&gt;

&lt;p&gt;Our answer — same as in the video at [1:03]: &lt;strong&gt;you do.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;When you use TPMN Checker and disagree with a score, that disagreement is data. Collected with consent, aggregated across users, and analyzed for patterns — your evaluations become the ground truth that calibrates the system.&lt;/p&gt;

&lt;p&gt;We call this &lt;a href="https://gemsquared.ai/community" rel="noopener noreferrer"&gt;Human Ground Truth&lt;/a&gt;. AI processes. AI suggests. But the standard for what counts as honest reasoning — that comes from humans.&lt;/p&gt;


&lt;h2&gt;
  
  
  Try it yourself
&lt;/h2&gt;

&lt;p&gt;TPMN Checker runs today inside Claude, ChatGPT, Cursor, and any MCP-compatible environment.&lt;/p&gt;
&lt;h3&gt;
  
  
  Connect (once)
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Claude.ai or ChatGPT — zero install:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Go to your AI tool's connector/app settings&lt;/li&gt;
&lt;li&gt;Add custom connector: &lt;code&gt;https://mcp-tpmn-checker.gemsquared.ai&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Complete OAuth login&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;CLI:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;npx @gem_squared/setup
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Standard use case — the pattern that works
&lt;/h3&gt;

&lt;p&gt;You probably already have AI-generated content sitting in a doc right now — a research summary, a PRD, a financial analysis, a code review. Here's what to do with it:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 1.&lt;/strong&gt; Paste your AI output into the conversation.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 2.&lt;/strong&gt; Ask: &lt;em&gt;"Verify this by gem2 truth filter."&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 3.&lt;/strong&gt; Read the score. See which claims are grounded, which are extrapolated, which have no source.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 4.&lt;/strong&gt; Ask: &lt;em&gt;"Create a grounded replacement prompt using gem2 contract writer."&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 5.&lt;/strong&gt; Ask AI to proceed with the new prompt. Watch what you get.&lt;/p&gt;

&lt;p&gt;That's the loop: &lt;strong&gt;verify → ground → regenerate.&lt;/strong&gt; The same loop that took our research from 18% to 77%.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Try it for free.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;→ &lt;a href="https://gemsquared.ai/tpmn-checker" rel="noopener noreferrer"&gt;Get started at gemsquared.ai&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  The specification is open
&lt;/h2&gt;

&lt;p&gt;TPMN-PSL (Truth-Provenance Markup Notation — Prompt Specification Language) is the open specification behind the checker. It's released under CC-BY 4.0. Anyone can read it, implement it, or extend it.&lt;/p&gt;

&lt;p&gt;The specification defines:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Five epistemic tags&lt;/strong&gt; (⊢ grounded, ⊨ inferred, ⊬ extrapolated, ⊥ unknown, ? speculative)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Three prohibited reasoning patterns&lt;/strong&gt; (SPT: snapshot→trend, local→global, thin→broad)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Three-phase verification protocol&lt;/strong&gt; (pre-flight, inline, post-flight)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Three operational modes&lt;/strong&gt; (strict, refine, interpolate)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;→ &lt;a href="https://tpmn-psl.gemsquared.ai" rel="noopener noreferrer"&gt;Read the specification&lt;/a&gt;&lt;br&gt;
→ &lt;a href="https://github.com/gem-squared/tpmn-psl" rel="noopener noreferrer"&gt;GitHub repository&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  What we learned from dogfooding
&lt;/h2&gt;

&lt;p&gt;Five rounds of testing our own tool on our own AI research taught us three things:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. No provider in our test was inherently honest.&lt;/strong&gt; Claude — the provider our tool runs on — scored 18% without epistemic constraints. Every provider overclaimed when unconstrained. The difference is the specification, not the model.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Structured prompts produce measurably better output.&lt;/strong&gt; A 59-point improvement from the same provider on the same task, just by using a gem2-generated prompt. That's not marginal — that's transformational. And you don't need to understand the specification to use it — just ask your AI to create a grounded prompt with gem2 tools.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. The instinct to overclaim persists.&lt;/strong&gt; Even with explicit instructions to avoid unsupported claims, two out of three providers violated the rules. The prompt helps. The prompt isn't enough. That's why verification exists as a separate step — because you can't trust the AI to police itself, no matter how well you prompt it.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;The question isn't whether the answer is right or wrong. It's whether the reasoning is honest.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;As we say in the video at [0:57]: &lt;em&gt;"So, who decides what's true? Not Claude. Not ChatGPT. Not Gemini. You do."&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;What's hiding in your AI output? &lt;a href="https://gemsquared.ai/tpmn-checker" rel="noopener noreferrer"&gt;Now you can see it.&lt;/a&gt;&lt;/p&gt;




&lt;p&gt;&lt;em&gt;TPMN Checker is in pre-GA. 12 MCP tools, 6 domains, 3 providers. Free to start.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Built by &lt;a href="https://gemsquared.ai/about" rel="noopener noreferrer"&gt;Inseok Seo (David)&lt;/a&gt; — GEM²-AI&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;→ &lt;a href="https://gemsquared.ai" rel="noopener noreferrer"&gt;gemsquared.ai&lt;/a&gt;&lt;br&gt;
→ &lt;a href="https://youtu.be/6iE2e0Pywag" rel="noopener noreferrer"&gt;Watch: Three AIs. Three Answers.&lt;/a&gt;&lt;br&gt;
→ &lt;a href="https://tpmn-psl.gemsquared.ai" rel="noopener noreferrer"&gt;TPMN-PSL Specification&lt;/a&gt;&lt;br&gt;
→ &lt;a href="https://github.com/gem-squared/tpmn-psl" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;TPMN-PSL is an open specification — not a product.&lt;/strong&gt; If you believe AI outputs should be auditable, &lt;a href="https://tpmn-psl.gemsquared.ai" rel="noopener noreferrer"&gt;read the spec&lt;/a&gt;, open an issue, or submit a PR. The standard gets better when more people challenge it.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>devtools</category>
      <category>opensource</category>
    </item>
  </channel>
</rss>
