DEV Community: GEM² Inc.

Claude Skills want ALL

GEM² Inc. — Thu, 30 Apr 2026 10:07:55 +0000

"AI is a mathematical and logical system. TPMN is Algebraic Logical Language, ALL, to communicate with AI"

For engineering work, I thought AI needs engineering language. Not for understanding, but for parsing. I needed a language that an AI agent could parse unambiguously, that a human could read without a manual, and that survived context compaction intact.

That language is TPMN — an Algebraic Logical Language (ALL).

TPMN stands for four sources:
TLA+ (temporal logic for concurrent systems),
Panini (the ancient Sanskrit grammarian who solved semantic disambiguation),
Mathematical notation (for formal constraints),
Natural language (for the subjective meaning that symbols cannot carry alone).
Each source contributes something specific. Together, they form ALL — the language for specifying AI skills.

If you have read the introduction,

GEM² Inc.

Apr 30

Claude Skills Fail Silently. Here Is My Solution.

#claudecode #cloudskills #devtools #opensource

Comments

4 min read

GEM² Inc.

Apr 30

Three Wounds That Prose Skills Cannot Fix — The Full Analysis

#cloudskills #claudecode #devtools #opensource

Comments

14 min read

you know the F: A → B | P pattern. This post explains why algebraic notation is necessary, what each layer contributes, and how the complete language works.

Why algebra, not prose

AI is a mathematical and logical system. It processes tokens through transformer architectures — linear algebra, attention mechanisms, probability distributions. The internal representation is mathematical.

Yet we communicate with AI in prose. We write skills as paragraphs of instructions. We describe workflows in bullet points. Then we are surprised when AI interprets ambiguously.

The mismatch is fundamental: we are writing in natural language to a mathematical system. Prose is optimized for human communication — nuance, context, implication. These qualities matter when AI infers intent in natural conversation. But they become the core cause of hallucination when AI must execute engineering work precisely, correctly, and consistently.

I think hallucination is not a fault in AI's processing logic. It is extrapolation against context dilution and drift.

TPMN is an Algebraic Logical Language that I created to supersede NL-based prompting. It is not invented from scratch — each of its four layers draws from a historically proven formalism.

An algebraic expression like P: title ≠ ⊥ ∧ project_slug ≠ ⊥ has exactly one interpretation. A prose instruction like "make sure the title and project are provided" has many — what counts as "provided"? Is an empty string provided? Is a whitespace-only string? The algebra eliminates the question.

This is not about making things harder for humans. The algebra is readable. 𝔹 means boolean. 𝕊 means string. ⊥ means absent. ∧ means AND. You learn the symbols once. They never change meaning.

AI cannot read your mind. AI is not a magic wand.
You can talk to AI in any persona — but the persona does not change what AI is. When you want humanistic discourse, nuance and reading between the lines matter. You bring a shared vocabulary and context to make that work.

The same applies to engineering. When you ask AI to do engineering work, you need basic mathematical terms and logical structure to describe your need clearly.

TPMN is created for exactly this: axiomatic rigor, procedural clarity, and impersonality — the Economy of Expression for communicating engineering work to AI with minimum ambiguity.

The four layers of TPMN

T — TLA+ (structural layer)

Leslie Lamport's TLA+ provides the structural backbone. Records, sequences, definitions:

(* Records — named field structures *)
Person ≜ [name: 𝕊, age: ℕ]

(* Sequences — ordered steps *)
Pipeline ≜ <<plan, design, implement, test, deploy, verify>>

(* Definitions — binding names to meanings *)
Init_Session: A → B | P ≜ [...]

The ≜ symbol (defined as) comes directly from TLA+. It is visually distinct from = (equality) and → (transformation). When you see ≜, you know: this is a definition, not a comparison.

Sets define enumerations — the closed universe of valid values:

Status ≜ {PENDING, IN_PROGRESS, COMPLETED, BLOCKED, ABORTED}

When a TPMN contract says mode: {feature, bug}, AI knows there are exactly two valid values. Not "feature, bug, or similar." Exactly two. This is how you eliminate the drift that makes prose skills unreliable.

P — Panini (conflict resolution layer)

Panini solved the problem of rule conflict and ambiguity in Sanskrit grammar 2,500 years ago. His Ashtadhyayi — ~4,000 rules generating every valid Sanskrit form — is a deterministic generative system. When two rules compete for the same derivation, meta-rules (paribhasha) resolve the conflict so that exactly one wins. No ambiguity survives.

The problem is identical to what makes prose skills fail: multiple valid interpretations with no resolution mechanism. Panini's solution was not to write more prose — it was to build a formal system where conflicts are resolved by structure, not by judgment.

TPMN applies this principle through three patterns:

Typed categories — Panini classified phonemes and morphemes into formal categories (pratyahara) so that rules could target precise classes, not vague descriptions. TPMN does the same with typed fields: two fields named status are not the same field if one is 𝕊 and the other is {PENDING, IN_PROGRESS, COMPLETED}. The type resolves the conflict.
Grounding 5W — Panini's rules are ordered and scoped — each rule declares exactly when it applies. TPMN mirrors this: every skill declares who, what, when, where, why. The "when" and "what" fields are the most critical — Claude Skills already recommends them. But with the remaining 3W (who, where, why), the AI can resolve which skill applies with minimal guessing.
Exception over default — Panini's system uses the utsarga/apavada principle: specific exceptions override general rules. TPMN adapts this as negative contracts — ¬B declares what a skill explicitly never does. AI agents are helpful by default. They will cross boundaries if those boundaries are not declared. The specific exclusion overrides the general helpfulness.

M — Mathematical notation (decisional layer)

TLA+ provides structure — records, sequences, definitions. Mathematical notation provides the logic that fills those structures: programmatic decisions and formal constraints.

TLA+ defines the container: A ≜ [name: 𝕊, age: ℕ]. Math notation writes the rule that decides whether the container is valid: name ≠ ⊥ ∧ age > 0. Every precondition P in F: A → B | P, every invariant, every verification predicate — the logic that determines pass or fail — is math notation.

The core symbols:

Symbol	Meaning	Role
`∧`	AND — conjunction	Combining predicates in P
`∨`	OR — disjunction	Alternative conditions
`¬`	NOT — negation	Exclusion, negative contracts
`∈`	Element of — membership	Type checking, set membership
`∀`	For all — universal	Field coverage, invariants
`⟹`	Implies	Chain invariant between flow steps
`⟺`	If and only if	STATE verification predicate

For example, the STATE verification predicate — how we determine SUCCESS or FAILURE — is pure M-layer:

STATE = SUCCESS ⟺
  (∀ field ∈ B: b[field] ≠ ⊥ ∧ type(b[field]) = B[field].type)
  ∧ P(a, b) holds

This is structural type-checking, not subjective judgment. AI can self-evaluate against its own CONTRACT — and that makes drift detectable rather than silent.

Epistemic tags extend the M layer with claim provenance — unique to TPMN:

Symbol	Meaning
`⊢`	Grounded — claim from verifiable fact
`⊨`	Inferred — derived from grounded claims
`⊬`	Extrapolated — beyond evidence

INV ≜ [
  ⊢ Strictly read-only — never modifies any file,
  ⊢ B is state report — AI decides next action based on B,
  ⊢ MANDATE: session state detection only
]

Every rule is ⊢-tagged — grounded. During context compaction, ⊢-tagged claims survive as hard constraints inside code blocks. Prose instructions like "never modify files" get summarized away. ⊢ NEVER modify any file inside a TPMN block survives because code blocks are treated as atomic by summarizers.

N — Natural language (meaning layer)

Formal notation cannot carry subjective meaning. "ARCHITECT beginning a work session" is not expressible in algebra. So TPMN uses NL in controlled positions:

(* ... *) inline comments — explanation alongside formal structure
String values inside records — who: "ARCHITECT beginning a work session"
Flow step action fields — operational descriptions of what to do

The rule: NL complements, it does not replace. The structure constrains; the NL explains. NL is never used for definitions, types, or logic — only for the human meaning that symbols cannot carry alone.

The contract: F: A → B | P

The four layers combine into the contract — the core of every UNIT-SKILL:

Skill_Name: A → B | P ≜ [
  A: [input fields with types],
  B: [output STATE fields with types],
  P: precondition predicates joined by ∧
]

¬B ≜ [
  ⊢ NEVER {boundary this skill must not cross},
  ⊢ NEVER {sibling mandate it must not assume},
  ⊢ NEVER {side effect it must not produce}
]

A — input state. Typed record. What the skill receives.
B — output state. Must be state, never action. The core invariant: B is what the skill produces, not what happens next.
P — preconditions. Conjunction of predicates that must all hold before execution.
¬B — negative contract. What the skill explicitly never does.

The flow: ordered steps with chain invariant

Flow ≜ <<
  [name: "step_name",
   action: "what to do",
   pre:  precondition_predicate,
   post: postcondition_predicate],
  ...
>>

Key constraints:

Maximum 5 steps, desire 3 or fewer — our reliability model based on 0.8^N decay. Claude's official guidance says "one skill, one job" and "keep SKILL.md under 500 lines" — the step limit is TPMN's structural enforcement of the same principle
The chain invariant holds: ∀ i ∈ 1..N-1: Flow[i].post ⟹ Flow[i+1].pre
Flow is linear — no branching between steps (branching is the AI's job)
IF/THEN/ELSE within a step's action field is acceptable (local logic)
IF/THEN/ELSE between steps is not — split into separate skills

The <<...>> syntax is TLA+ sequence notation — it signals "ordered operations" rather than "data structure."

The grounding: 5W record

Every skill anchors itself in context:

Grounding_5W ≜ [
  who:   "actor — the role invoking this skill",
  what:  "deliverable — the state transformation",
  when:  "condition — when AI should select this skill",
  where: "scope — boundary of applicability",
  why:   "rationale — why this skill exists as a separate unit"
]

All five fields are required: ∀ skill: ∀ w ∈ {who, what, when, where, why}: |skill.grounding[w]| > 0

Skills without explicit grounding drift in meaning over time. The "when" field is the most important — it tells the AI exactly when to select this skill.

Compaction survival: why formal notation wins

This is the practical argument that matters most.

When an AI agent's context window fills, earlier content gets compacted. Claude Code re-attaches skills post-compaction, but only the first 5,000 tokens per skill within a 25,000-token shared budget. Skills invoked earlier can be dropped entirely. And even within the budget, prose instructions lose nuance — research shows summarization achieves ~6:1 compression, meaning each surviving sentence must carry six times the semantic density of the original.

(* ~40 tokens — exact constraints preserved: *)
P: title ≠ ⊥ ∧ project_slug ≠ ⊥
¬B ≜ [⊢ NEVER modify any file]
Flow ≜ <<S₁, S₂, S₃>>

(* ~60 tokens — same information, but compressible: *)
"Make sure the title and project slug are both provided before
running. The skill should never modify any files. Execute
the three steps in order: first read, then query, then report."

Both fit within the re-attachment budget. But when the budget is tight and the summarizer compresses, the prose version loses "never modify any files" while the TPMN version keeps ⊢ NEVER modify any file intact — because there is nothing to compress. The algebra is already at minimum expression. That is why AI needs ALL.

And density compounds in complex projects

TPMN is not built for small scripts. It is built for complex, dense projects — the kind where you run 10, 20, 50 concurrent workflows across a long session. That is exactly where algebraic density pays off twice: once in compaction survival, and again in raw token savings.

I measured this directly. I took 6 real Claude Skills (3 official Anthropic, 3 community) and converted each into TPMN workplan contracts. The results: bespoke skills average 3,583 tokens each. The equivalent TPMN contracts average 672 tokens — a 5.3x compression ratio. The largest compression was 9.6x (Anthropic's skill-creator: 8,916 → 929 tokens), because most of its budget went to process scaffolding that TPMN's core skills already handle. The smallest was 1.9x (webapp-testing), because it was already lean.

TPMN's 12 core skills cost ~20,000 tokens as shared infrastructure, loaded once. Each additional workflow adds only ~672 tokens. Bespoke skills have zero infrastructure cost but pay ~3,583 tokens per workflow. At 7+ concurrent workflows, TPMN is cheaper in total. At 20, it saves 53%. At 50, 70%. The more complex the project, the larger the advantage — because the infrastructure cost is fixed and the per-workflow cost is 5.3x lower.

Notation principles

After building skills in this notation, I have distilled four principles:

Symbols are unambiguous at any depth. Whether you read 𝔹 at the top of a contract or nested inside Seq(Record[status: 𝔹]), it means boolean. No context-dependent interpretation.

Structure is preserved under compaction. TPMN blocks survive as atomic code blocks. Prose instructions get summarized and lose constraints.

NL complements, it does not replace. The (* ... *) comment syntax carries subjective meaning. The structure constrains; the NL explains.

Types are the disambiguator. Two fields named status in different skills are not the same field if one is 𝕊 and the other is {PENDING, IN_PROGRESS, COMPLETED}. The type makes them distinguishable.

This notation is not complex. It is precise. Those are different things. Complexity hides meaning. Precision reveals it.

Try it

The TPMN Skill Standard v4 is MIT-licensed. Install the core skills into any project:

npx @gem_squared/tpmn-skill-install

Full spec: TPMN Skill Standard v4 on GitHub

David Seo — GEM².AI

Three Wounds That Prose Skills Cannot Fix — The Full Analysis

GEM² Inc. — Thu, 30 Apr 2026 09:30:27 +0000

Silent scope decay. Judgment theater. Trigger collision. Three Wounds that prose skills cannot fix — and the algebraic logical language + spec that does.

In the companion post, I named the three wounds:
Silent Scope Decay, Judgment Theater, and Trigger Collision.

Here is the structural proof for each — why prose fails, why contracts don't, and what 130+ real work plans revealed.

GEM² Inc.

Apr 30

Claude Skills Fail Silently. Here Is My Solution.

#claudecode #cloudskills #devtools #opensource

Comments

4 min read

If you haven't installed yet:
The 12 skills. MIT-licensed. Works with Claude Code out of the box. No infrastructure. No server. Git + filesystem.

npm i @gem_squared/tpmn-skill-install

From GitHub: github.com/gem-squared/tpmn-skill

The three wounds and solutions

Wound 1 - Silent Scope Decay: from prose to spec

Every AI — Claude, ChatGPT, Gemini — lives inside a limited context window. That limitation causes two kinds of silent erosion: context dilution and architecture drift.

Vibe-coding, Claude Skills, every prose-based method on the market today — all are subject to the same compression. When context compacts, prose loses precision. Meaning survives; accuracy does not.

But algebraic logical notation cannot be lossy-compressed without breaking syntax. Compaction cannot silently dilute what is already at minimum expression. That is why I created the TPMN spec.

Solution for Silent Scope Decay

The axiom: every AI action is a function

Every skill in the world can be written as a single algebraic expression:

F: A → B | P

This is not a metaphor. It is the irreducible kernel of what a skill is.

AI's inference process is a blackbox. This is an axiom, not a limitation. We cannot micro-control the weights, the direction, or the intermediate reasoning. Claude, ChatGPT, Gemini — every commercial LLM is opaque by construction. We cannot control how F operates internally.

But we can declare A, B, and P.

From the human's perspective:

A is what I give to AI.
B is what I want to get from AI.
P is the set of invariants and constraints on A, B, or both — the guardrail on AI's output.

From the AI's perspective:

A — exactly what it receives (typed input state)
B — exactly what it produces (typed output state)
P — what must hold true before and after execution

That is the protocol. Clean, simple, explicit.

What is a skill for? To perform a similar workflow identically, every time. Not a single task. A workflow.

So let UNIT-WORK = F.

A unit-work is a discrete, isolated F — a single mission given to AI. Whether AI completes it through one-shot inference or internally spawns many sub-agents, that is the blackbox. We cannot control it. We must not try. From the human's perspective, it is a single unit of work.

Now a workflow is the summation of unit-works plus the logical language describing the flow:

WF ≜ Σ unit-work + flow-logic

So:

"Claude Skill" ≡ "WF ≜ Σ unit-work + flow-logic"

The full grammar — the complete symbol set, the four-source layering (TLA+, Panini, Math, NL), the UNIT-SKILL rules R1–R7, and the AI selection pipeline — lives in TPMN is ALL.

Four properties of compaction-safe notation

After studying what survives compaction and what does not, I identified four properties that make TPMN compaction-safe:

1. No redundancy

In prose, you say things multiple ways for emphasis. "Never deploy to production. This skill is staging-only. Production deployments are not within scope." Three sentences, same constraint. The summarizer might keep one. It might keep none — because redundancy signals that the content is emphatic rather than structural.

In TPMN, each constraint appears once:

⊢ NEVER deploy to production — staging only

One line. No redundancy to collapse. The summarizer cannot compress what is already minimal.

2. Structure preserved

TPMN uses code blocks. Code blocks are atomic to summarizers. A prose paragraph is a candidate for summarization. A code block is either kept or referenced — not paraphrased.

This is not a subtle advantage. It is the difference between "the skill has some preconditions about CI" (summarized prose) and the actual precondition P: ci_passed(build_id) = ⊤ (preserved code).

3. Minimum = only representation

In prose, there is a "full" version and a "summarized" version, and they carry different information. The full version has constraints. The summary loses them.

In TPMN, the notation IS the minimum representation. There is no shorter form that preserves the meaning. P: ci_passed(build_id) = ⊤ cannot be shortened without losing the typed predicate. This means compaction either preserves the notation (because there is nothing to compress) or drops it entirely (which is detectable — the skill contract is missing).

The failure mode of TPMN under compaction is absence, not corruption. You either have the contract or you do not. You never have a corrupted half-version that looks right but lost the critical constraint. I prefer an honest absence over a silent corruption.

4. Symbols unambiguous at any depth

When a prose instruction is partially summarized, words become ambiguous. "Check the build" — check what about the build? That it exists? That it passed CI? That it was deployed?

When a TPMN symbol survives compaction, it means the same thing regardless of surrounding context. 𝔹 means boolean. ⊢ means grounded. ∧ means AND. These symbols do not drift in meaning when the context around them changes.

This matters because compaction changes the context. After compaction, the agent has a summarized version of earlier conversation plus the surviving code blocks. If the symbols in those code blocks were context-dependent, compaction would corrupt their meaning. Because they are context-independent, the surviving notation is still correct.

Wound 2 — Judgment theater: from in-the-loop to at-the-edge

Everyone agrees now: the bottleneck is not writing code.
It is verifying what the agent produced.
Here is a structural answer — not a workflow tip, a verifiable CONTRACT.

The 95% trap

The agent is correct 95% of the time
You stop checking
Then it drops a production database (Replit, July 2025 — already in public discourse)
The trap is not carelessness. It is a rational response to high accuracy.
The structural cause: there is no formal specification of what "correct" means for a given session. So you evaluate by feel. Feel degrades under repetition.

The 95% trap is not a discipline problem.
It is an absent-contract problem.

Solution for Judgment theater

CONTRACT (F: A → B | P) plus flow-logic defines a skill. STATE is how we govern it.

Here is the flow:

Human requests AI in natural language.
AI formalizes the request into algebraic logical format: WF = Σ unit-contracts + flow-logic.
AI executes each unit-work through WF.
AI produces B for each unit-contract.

The critical paradigm shift: treat B as a state.

The AI execution state is binary: SUCCESS or FAILURE.

SUCCESS is declared if and only if the AI's output is fully aligned with the contract. Alignment means:

The input a conforms to the contract's input type A.
The output b conforms to the contract's output type B.
The invariant/constraint P(a, b) holds true.

Otherwise, the state is FAILURE.

This is not subjective quality assessment. STATE does not measure whether the output is "good." It measures whether the output exists within the legitimate mandate area that was pre-defined by the contract before execution. Its purpose is to prevent AI from drifting away from the plan.

Without STATE, you review AI output against intuition. You catch obvious errors. You feel thorough. But you have no contract to check against — no ground truth to verify. That is judgment theater: the appearance of verification without the structure to make it real.

Because the verification is structural type-checking — not interpretive judgment — AI itself can evaluate whether its own output is SUCCESS or FAILURE. The contract provides a deterministic specification. The /verify-work skill does exactly this — evaluating each unit-work's output against its contract, either per-unit inline (immediately after execution) or as a batch at the end.

Human-at-the-edge, not human-in-the-loop

To observe AI's work, what is a human supposed to do? Stay in the loop?

That is the wrong answer.

Why do developers love Claude Code? Because it treats multi-agent processing as its natural mode of operation — not a bolted-on feature. Claude Code autonomously spawns sub-agents to fulfill human needs. The agentic layer grows more autonomous with every release. Other platforms are converging on the same model.

We cannot control how Claude Code spawns its internal sub-agents. We must not try. Humans must step out of the loop — otherwise we become the critical bottleneck in an AI-driven system.

But we still need to observe AI's work. My answer: human-at-the-edge.

The human evaluates the workflow at the boundary — what goes in, what comes out, what must hold true. AI operates autonomously within that boundary. The human observes results, not process. This is the governance model behind the TPMN workflow lifecycle: plan → proceed → verify → archive, with the human at the edge of each transition.

Creating a workflow in TPMN is not something you hand-write from scratch. /plan-work does it for you. Claude Code decomposes your request into contracted unit-works automatically. The same 12 core skills apply through the entire workflow hierarchy, from 10 unit-works to 100 and beyond. We adapt Miller's law to bound the size of each level (7±2 units per decomposition).

Wound 3 — Trigger collision: from mechanism to management

Three independent studies converge on the same failure band.

Vercel engineer Jude Gao reported that in 56% of their eval cases, the skill was never invoked — even though the documentation was available. Independent testing by Pere Villega landed on a 50% success rate — what he called "a coin flip." Ivan Seleznov's 650-trial controlled experiment found default-configured skills activated at 77%, reaching 100% only when the trigger used imperative language.

Three methods. Three authors. Same finding: the activation mechanism does not work reliably.

And when they fail, nothing happens. No error. No warning. No log entry. Claude just proceeds without the skill, and you never know.

You cannot observe this failure. There is no stack trace. No failed assertion. No red CI badge. The skill simply doesn't load, Claude does its best without it, and the output looks plausible. You would have to run the same prompt with and without the skill, compare results, and notice the difference. Nobody did that.

Even if you could detect the failure, manual comparison destroys the value proposition. Skills exist to automate. If you must manually verify every invocation, you have replaced automation with audit theater.

Solution for Trigger collision

Here is the core design decision. The conventional approach writes one skill per case: deploy skill, test skill, review skill. N cases require N skills. Selection breaks at scale.

TPMN does not replace those skills. TPMN orchestrates them.

The 900,000+ skills in the ecosystem encode domain knowledge — Figma-to-code, Sentry triage, Kubernetes deploy, PDF generation. That knowledge is valuable. The problem is not the skills. The problem is that nothing governs how they get selected, executed, verified, or reused.

And the problem compounds. You find a skill in the marketplace, copy it locally, configure it for your project — but Claude doesn't trigger it reliably, so you manually refine through train/test cycles. Or you build your own skills through real project work — hard-earned, proprietary, proven. Either way, your local skill collection grows. How do you find the right one for the next project?

Here is the structure I built:

12 core lifecycle skills, your AI agent, and the filesystem. No infrastructure. Works with Claude Code out of the box.

TPMN provides the governance layer: 12 lifecycle skills that handle planning, execution, verification, and archival — regardless of which domain skills you use underneath.

Think of it as two layers:

TPMN layer:  /plan-work → /proceed-work → /verify-work → /archive-work
             (lifecycle orchestration — contracts, STATE, STATUS)

Contract layer: your-deploy-skill, figma-to-code, sentry-triage, ...
             (domain knowledge — whatever skills you already have)

The 12 TPMN skills are operations on a work lifecycle: /init-session, /check-session, /search-kg, /search-skill, /plan-work, /proceed-work, /update-work-plan, /extract-skill, /verify-work, /skill-to-kg, /archive-work, /end-session. Every workflow composes from these 12 lifecycle primitives.

/skill-to-kg deserves a note: it batch-sweeps all non-protected skills and directories from .claude/skills/ (active — auto-discovered by Claude) to .gem-squared/external-skills/ (archived — invisible to Claude, but searchable by /search-skill). Only the 12 core lifecycle skills and the project identity skill remain active. Runs automatically at init-session. Restore any specific skill anytime.

You do not throw away your existing skills. You govern them. The 12 core skills handle the lifecycle — the detailed breakdown with the "why separate" rationale for each is in the spec.

knowledge compounds, not skills

This leads to the second inversion.

The conventional model scales by adding skills. N cases → N skills. The system grows but does not learn. Skill #47 knows nothing about Skill #1.

TPMN scales by accumulating knowledge. Every completed work-plan that passes verification — COMPLETED status, SUCCESS state — is a proven template. When /archive-work stores it, the contracts and results become searchable patterns. When /plan-work runs for a new task, /search-kg retrieves relevant proven templates. The next decomposition is not starting from zero — it is informed by what worked before.

/plan-work ─── /search-kg + /search-skill ───→ /proceed-work ─── /search-kg (optional) ───→ /verify-work ───→ /archive-work
     ^            (proven patterns + domain skills)    (reference during execution)                                    |
     +────────────────────────────────────────────────────────────────────────────────────────────────────────────────+
                                                                                                    (store as proven)

/extract-skill ─── /search-kg + /search-skill ───→ upsert to .claude/skills/
                   (find source contract + check if skill exists)

Actually, I do not use any custom domain skills. The 12 lifecycle skills are enough — with proven unit-contracts as the reusable knowledge. /extract-skill exists for compatibility with Claude Code's .claude/skills/ ecosystem, not because I need more skills.

The lifecycle skills stay fixed. The knowledge compounds. I have been using the same core skills across various projects. The skills never changed. The work-plans are what vary — and the proven ones feed the next cycle.

Making flow-logic deterministic: the STATUS concept

Σ unit-work now has a deterministic verification layer via STATE. What about flow-logic?

To make flow-logic manageable, traceable, and computable, we need a second concept: STATUS.

Where STATE answers "did it align with the plan?", STATUS answers "where is it in the lifecycle?"

STATUS ≜ { PENDING, IN_PROGRESS, COMPLETED, BLOCKED, ABORTED }

Axiom: once a workflow is defined, every unit-work in that workflow occupies a STATUS. STATE is evaluable when a unit-work reaches a terminal status — that is, when STATUS ∈ {COMPLETED, ABORTED}. A completed unit-work is evaluated against its contract. An aborted unit-work is FAILURE by definition — its contract was never fulfilled.

Now we have two orthogonal tracking dimensions:

STATUS tracks where each unit-work is in the lifecycle.
STATE tracks whether each completed or aborted unit-work met its contract.

Together, they make the entire workflow observable from the edge. The human does not need to watch every step. The human reads STATUS to see progress and STATE to see results. That is human-at-the-edge.

I applied this model across 130+ real work plans in production. The full lifecycle — from planning through execution to verification and archival — is governed by the same 12 skills throughout.

How TPMN compares to existing frameworks

Three frameworks have emerged in 2026 to address skill reliability. Each solves part of the problem. None solves all of it.

Superpowers (150K+ stars) provides 14 composable skills with a master dispatcher and a "1% chance = must invoke" enforcement rule. It covers brainstorming, planning, TDD, debugging, code review, and verification. Its verification-before-completion skill enforces "evidence before claims" — run the command, read the output, then claim the result. Strong methodology. But it is a software development workflow — it does not generalize beyond SDLC, and its skills are prose instructions, not typed contracts.

Agent Skills Standard (Benjamin Abt, Feb 2026) defines a five-component "quality contract" — scope, decision logic, constraints, output contract, quality gates. The closest to TPMN's philosophy: one skill = one bounded scenario with explicit non-goals. But it is an authoring standard for individual skills, not lifecycle orchestration. There are no lifecycle primitives, no STATE/STATUS tracking, and no mechanism for proven patterns to inform future work.

The hook-driven governance layer (HackerNoon, Mar 2026) attacks the trigger problem directly: a UserPromptSubmit hook forces Claude to evaluate every skill before responding, raising activation from ~25% to 90%+. Effective enforcement. But it governs whether skills fire, not what happens after they do. No planning, no verification, no archival. And it is locked to Claude Code's hook system.

TPMN Skill Standard (open spec, MIT-licensed) takes a different approach. Instead of enforcing trigger reliability or standardizing skill authoring, TPMN provides a governance layer that sits above all skills. Every unit of work is a typed contract (F: A → B | P) with algebraic STATE (SUCCESS/FAILURE) and lifecycle STATUS (PENDING → IN_PROGRESS → COMPLETED). Verification is structural — the AI checks its own output against the contract, not against a prose checklist. And the cycle closes: proven contracts are archived, indexed, and retrieved by /search-kg to inform the next /plan-work. The 12 lifecycle skills are domain-agnostic — they govern legal review, content production, and data analysis the same way they govern software development.

Dimension	Superpowers	Agent Skills Standard	Hook Governance	TPMN
Typed contracts (A→B\|P)	No — prose instructions	Partial — "output contract" in prose	No	Yes — algebraic
Lifecycle orchestration (plan→execute→verify→archive)	Partial — plan + verify, no archive	No	No	Yes — 12 skills
Knowledge accumulation (proven patterns feed future work)	No	No	No	Yes — /archive-work → /search-kg → /plan-work
STATE/STATUS tracking	No	No	No	Yes — binary STATE + lifecycle STATUS
Verification model	"Run the command, read the output"	Quality gates (prose checklist)	N/A (enforcement only)	Contract type-checking (A, B, P alignment)
Domain scope	SDLC only	SDLC only	SDLC only	Any domain — lifecycle is domain-agnostic
Platform scope	Multi-platform (Claude, Codex, Cursor, Gemini)	GitHub Copilot	Claude Code only	Platform-agnostic (any AI agent that reads markdown)
Skill count	14 (development methodology)	Per-project (authoring standard)	26 (domain-specific)	12 (lifecycle primitives) + any domain skills

Every framework above targets software development. TPMN's lifecycle — plan, execute, verify, archive — is not about code. It is about any workflow where a human delegates work to AI: legal review, content production, data analysis, research synthesis, compliance audits. The 12 skills govern the lifecycle of work itself, not the domain the work belongs to.

The gap is not in any single capability. Superpowers has strong verification. Abt has strong skill structure. The hook layer has strong enforcement. The gap is that nobody combines all three: (1) algebraic contracts that make verification structural, (2) lifecycle orchestration that governs work from plan to archive, and (3) knowledge accumulation where proven patterns compound across projects. Each framework covers one piece. None closes the full loop.

the origin of TPMN

How do we describe flow-logic deterministically?

My solution draws from four sources:

TLA+: Leslie Lamport's temporal logic for specifying concurrent systems — exactly what multi-step AI workflows are. For deterministic logical flow.
Panini: The ancient Sanskrit grammarian who solved semantic subject-disambiguation 2,500 years ago — the exact ambiguity problem that makes prose skills fail. For removing semantic ambiguity.
Mathematical notation: For programmatic decisions and formal constraints.
Natural language comments: To complement the subjective meaning that formal notation cannot carry alone.

This is the origin of TPMN — an Algebraic Logical Language for AI skill specification.

Why does formal notation matter over prose? Because context compaction destroys prose instructions while formal notation survives intact — a 5.3x measured density advantage that determines whether your skills keep working in long sessions. Your prose constraints dilute under compaction. Your architecture boundaries drift between sessions. Neither failure is detectable. That is silent scope decay — and TPMN is the cure.

TPMN extends to two systems

From this algebraic foundation, TPMN extends into two systems:

TPMN — Truth-Provenance Markup Notation — an open specification language for structuring and auditing AI reasoning. Platform-agnostic — treats LLMs as black boxes. Within TPMN:
- TPMN-PSL (Prompt Specification Language) is the formal grammar that compiles natural-language prompts into MANDATE — computable, verifiable specifications. Three-phase protocol: P-phase (prompt → MANDATE), Inline (epistemic tagging), O-phase (output verification → truth_score).
- TPMN-checker is the reference implementation — a Sovereign AI Service (SAS), a microservice exclusively owned and controlled by a dedicated AI actor.
TPMN SKILL STANDARD — a contract-driven AI workflow lifecycle management framework. The full specification for defining, executing, verifying, and governing AI skills. This includes the 5 structural rules that keep skills within the reliability zone, and 12 core skills that orchestrate the lifecycle for any workflow.

Each of these gets its own post. This one is the foundation — the algebraic origin story that everything else builds on.

GEM² — the name

My company is named GEM². The acronym is the whole philosophy of this post compressed into six letters:

GEM²_Definition ≜ [
  acronym:   "Grounded Existence Matrix for Global Entropy Minimum",

  expansion: [
    Grounded:  "Every A is grounded by contract — no hidden input",
    Existence: "Every B is verifiable by evidence — no claimed output",
    Matrix:    "A, B, P are the three axes every unit-work is computed on",
    Global:    "Workflow is the summation of connected unit-contracts",
    Entropy:   "Disorder = the gap between contract and evidence",
    Minimum:   "SUCCESS is the minimum-entropy state — contract and evidence align"
  ]
]

David Seo — GEM².AI

Claude Skills Fail Silently. Here Is My Solution.

GEM² Inc. — Thu, 30 Apr 2026 09:09:40 +0000

"Three structural wounds that prose skills cannot fix. 12 lifecycle skills that do. Install in 30 seconds."

Try it

The TPMN Skill Standard v4 is MIT-licensed. Install the core skills into any project.
Full spec: TPMN Skill Standard v4 on GitHub

$ npx @gem_squared/tpmn-skill-install

$ claude --permission-mode auto

/init-session       # bootstrap — must be first
/plan-work          # decompose into contracted units
/proceed-work       # execute one unit
/verify-work        # Autonomously invoked to verify against contract — SUCCESS or FAILURE
/archive-work       # proven. searchable. compounds.

I've been using SKILLs every day, with almost zero failure. That's not because I got lucky. It's because I stopped writing them as prose — and I built 12 lifecycle skills that orchestrate all the others, the way Unix commands orchestrate any pipeline.

The problem is not SKILL itself — it is the failure of linguistic precision, and the failure to select the correct skill for a specific situation.

Your skills break and you never know.

Three independent studies found the same thing: 56% non-invocation (Vercel), 50% success rate (Villega), 77% activation (Seleznov, n=650). No error. No warning. No log. Claude proceeds without the skill and the output looks plausible.

That is only one of three wounds.

The three wounds

1. Silent Scope Decay — your prose constraints dilute under context compaction. Your architecture boundaries drift between sessions. Neither failure is detectable.

2. Judgment Theater — you review AI output but have no contract to check against. The review feels real. It is structurally empty.

3. Trigger Collision — the wrong skill fires, or no skill fires at all, and the failure is unobservable. More skills, more ambiguity, more compounding.

GEM² Inc.

Apr 30

Three Wounds That Prose Skills Cannot Fix — The Full Analysis

#cloudskills #claudecode #devtools #opensource

Comments

14 min read

What I did

I stopped writing skills as prose. I built 12 lifecycle skills that orchestrate all the others — the way Unix commands orchestrate any pipeline.

The skills don't encode domain knowledge. They govern the lifecycle of work itself: plan it, execute it, verify it against a typed contract, archive the proven result. The domain knowledge lives in contracts, not skills. The contracts compound. The skills stay fixed.

/plan-work → /proceed-work → /verify-work → /archive-work
     ^                                              |
     +──── /search-kg (proven patterns feed back) ──+

Every verified contract becomes searchable prior art for the next session. You never start from zero.

Taste it

npm i @gem_squared/tpmn-skill-install

The 12 skills. MIT-licensed. Works with Claude Code out of the box. No infrastructure. No server. Git + filesystem.

From GitHub: github.com/gem-squared/tpmn-skill

What the 12 skills do

They don't encode domain knowledge. They govern the lifecycle of work itself.

$ claude --permission-mode auto

/search-kg          Search proven patterns from prior work
     |
/plan-work          Write CONTRACTs (A → B | P) for each unit
     |
/proceed-work       Execute one unit, verify inline, retry on failure
     |              (repeat for each unit)
     |
/verify-work        Verify Result vs CONTRACT.B (per-unit or batch)
     |
/archive-work       Move to archive/, git commit, compounds.
     |
/extract-skill      Convert proven contracts into reusable skills (optional)

Every unit of work has a CONTRACT:

A — input state: what exists before the work
B — output state: what must exist after (always a state, never an action)
P — precondition: what must be true to start
Clarity % — how well-defined the scope is (0–100)

Skill	What
`/init-session`	Bootstrap project files, detect layer
`/check-session`	Read-only status report
`/search-kg`	Search proven contracts in archive
`/search-skill`	Discover installed + archived skills
`/plan-work`	Decompose work into 1–9 contracted units
`/proceed-work`	Execute one unit, verify inline, retry on failure
`/update-work-plan`	Add, modify, abort, reorder PENDING units
`/verify-work`	Verify Results against CONTRACTs (per-unit or batch)
`/extract-skill`	Convert proven WP into reusable skill
`/skill-to-kg`	Sweep non-core skills to archive, restore specific
`/archive-work`	Finalize WP, git commit
`/end-session`	Commit session state for recovery

The lifecycle stays fixed. The knowledge compounds. That is the inversion: stop growing your skill library, grow your contract archive.

How prose fails — and why contracts don't

Silent Scope Decay. Claude compacts long context by summarizing. "When preconditions hold, plan the work..." compacts to "Plans work" — constraints gone, silently. TPMN uses algebraic notation: P ≜ work ≠ ⊥ ∧ project_slug ≠ ⊥. Remove a conjunct and the formula visibly breaks. Breakage is detectable, not silent.

Judgment Theater. "Does it look right?" is not verification. /verify-work runs three deterministic checks every time: field coverage (did B produce every required field?), type conformance, constraint satisfaction. STATE = SUCCESS is structural. Not an opinion.

Trigger Collision. 900,000+ skills in the ecosystem. Even in your private project, the same dynamic applies: more skills, more ambiguity, more compounding failure. TPMN solves this with two inversions — lifecycle orchestration (12 fixed skills, deterministic selection) and knowledge compounds (every verified contract becomes searchable prior art via /search-kg).

The full analysis

Each wound has a structural proof — why prose fails, why contracts don't, and how 130+ real work plans validated the model. Read the deep dive:

Three Wounds That Prose Skills Cannot Fix — The Full Analysis

The spec is public. CC-BY-4.0. The proof is verifiable. The rest is yours.

David Seo — GEM2.AI

Same Prompt. Different Answers Every Time. Here's How I Fixed It.

GEM² Inc. — Fri, 03 Apr 2026 00:12:20 +0000

This is Part 3 of our AI verification series.
Part 1: Three AIs analyzed our product. None passed the truth filter →
Part 2: Human in the loop doesn't scale. Human at the edge does. →

Same prompt. Same AI. Different sessions. Different outputs.

Post 1 showed three different AIs diverging on the same question.

That's expected. Different training, different weights, different answers.

But we didn't stop there. We re-ran the same AI on the same prompt in a new session.

We got materially different outputs again.

Both looked authoritative. Neither warned us they disagreed with each other.

What the same AI said twice

Prompt: "Forecast Korea's AI industry in 2027."

Session 1 produced:

Market size: $10–15B at >25% CAGR
Global positioning: "Global AI G3 powerhouse"
Hardware claim: "All Korean electronics AI-native by 2027" — sourced to a single company's roadmap

Session 2 produced:

Market size: KRW 4.46T (~$3.3B) at 14.3% CAGR
Global positioning: "Top three AI powers" — framed as government target
No hardware claim at all

Same prompt. Same AI. Different session. A 4× market size gap. No flags from either run.

This isn't a hallucination. Both outputs were internally coherent. Both read like credible analyst reports. The problem is deeper than hallucination.

Why this happens: AI inference is non-deterministic

We spent months trying to fix output drift with better prompts, more context, stricter instructions.

It didn't work.

Because the issue isn't the prompt.

AI is optimized to sound right.
Not to prove itself.

What we call "hallucination" is mostly context drift — the model's plausibility engine filling gaps differently depending on what's salient in a given session. Different day, different sampling, different emphasis in the context window — different output. Same confidence posture throughout.

You can't prompt your way out of a non-deterministic system. You need verification as a separate step.

The truth filter didn't just score. It fingerprinted.

We ran both sessions through gem2_truth_filter — not to get a number, but to understand why the outputs diverged.

Session 1 (avg 35%):

Provider	Score	Key violation
Gemini	24%	L→G: "Global AI G3 — no index cited"
ChatGPT	21%	Δe→∫de: single company → industry-wide claim
Claude	59%	S→T: current AI strength = permanent identity

Session 2 (avg 43%):

Provider	Score	Key violation
Gemini	45%	S→T: past-tense framing of future events
ChatGPT	32%	Source attribution FAIL
Claude	51%	Scope mixing — 2033 CAGR back-extrapolated to 2027

The failure types were different. Session 1 overclaimed about Korea's global position. Session 2 failed on temporal framing and citations.

Same prompt. Different inference paths. Different failure signatures.

This is the key finding: AI output drift is not random. It's traceable.

The filter names the exact reasoning pattern that produced the problem. L→G (local to global), S→T (snapshot to trend), Δe→∫de (thin evidence to broad claim). Named patterns mean auditable drift. Auditable drift means fixable systems.

(Note: Korea AI forecasting is a harder grounding task than product analysis — fewer citable sources, more projection-dependent claims. That's why baseline scores here are lower than the results in Part 1. Same tool, same logic — harder domain.)

We stopped trying to fix the output. We fixed the conditions.

This is the shift Post 2 described philosophically. Here's what it looks like in practice.

We didn't rewrite the prompt ourselves. We asked:

"Create a grounded replacement contract prompt using gem2 tools."

One command. The system generated a formal contract — input/output types, invariants, prohibited patterns, confidence requirements. We reviewed it. We approved it. Then we ran the same AI with the contract enforced.

Session 2, contract-compliant (R2):

Provider	Score
Gemini	98%
Claude	81%
ChatGPT	64%
Average	81%

+38 points. Same AI. Same question. Different structural constraints.

The contract doesn't make the AI smarter. It makes the AI's output auditable against a defined standard.

Then the human intervened. Once.

81% — but the output read like a legal document. Every claim cited, scoped, hedged. Epistemically reliable. Practically unreadable.

One instruction:

"Soften the tone. Don't reintroduce any claims the truth filter removed."

Session 2, softened (R3):

Provider	Score
Gemini	95%
Claude	75%
ChatGPT	57%
Average	75%

Down 6 points. More readable. Still grounded.

We chose 75%. Not because it's better than 81%. Because 75% is the right trade-off — readable enough to share, grounded enough to trust. We submitted 75% to gem2 calibration as our standard for narrative AI forecasts.

Human reads the audit.
Human decides the trade-off.
Human defines the standard.

Not reviewing every line. Not trusting blindly. Deciding at the right moment.

What the full arc looks like

Session 1 (no filter)   →  35% avg
Session 2 (no filter)   →  43% avg
Contract applied (R2)   →  81% avg
Human softened (R3)     →  75% avg  ← our standard

Truth is not the score.
Truth is the pattern of drift.
You define the standard.

The workflow: AI audits AI

Human asks  →  AI executes
AI verifies AI  →  AI fixes AI
Human decides at the edge

The verification layer — gem2_truth_filter, tpmn_contract_writer, the composer — runs between generation and delivery. The human sees the audit result, decides the acceptable trade-off, sets the calibration standard.

Human-in-the-loop means the human is the bottleneck — every output passes through before it ships. That doesn't scale. Human-at-the-edge means you define "acceptable" once, and the system enforces it automatically. You intervene only when a genuine judgment call is required — like choosing 75% over 81%.

TPMN is not a checker

TPMN is not a validator, a linter, or a hallucination detector.

TPMN is an epistemic gauge.

It shows what's grounded, what's inferred, what's extrapolated. It fingerprints why outputs differ across sessions. It generates the contracts that stabilize structure. It collects human calibration signals and turns them into a standard.

It doesn't decide. You do.

We're calling the full suite GEM2 Epistemic Studio — 15 tools across four functional groups: analysis, contract authoring, calibration, and execution. TPMN Checker is one group inside it.

Try it on your own output

Paste any AI output into your conversation.
Ask: "Verify this by gem2 truth filter."
Read the score. See what's grounded vs extrapolated.
Ask: "Create a grounded replacement prompt using gem2 contract writer."
Run it again. Watch the difference.

Your AI picks the right tool from 15 available MCP tools automatically. No configuration. No TPMN knowledge required.

The goal isn't a higher score. It's a score you understand and a standard you chose.

→ Try it free at gemsquared.ai

What comes after prompting

The industry is still in the prompting era. Better prompts, longer context, chain-of-thought — all useful, all insufficient.

The next step isn't better prompting. It's verification as infrastructure.

AI generates.
AI verifies.
AI refines.
Human decides at the edge.

We didn't make AI smarter. We made it accountable.

That's measurable: 35% → 75% on the same task, with the same AI, using nothing but a formal contract and one human judgment call.

GEM2 Epistemic Studio — 15 tools, 6 domains, 3 providers. Free to start.

Built by Inseok Seo (David) — GEM²-AI

→ gemsquared.ai
→ TPMN-PSL Specification (open, CC-BY 4.0)
→ GitHub
→ Part 1: Three AIs analyzed our product
→ Part 2: Human at the edge

Human in the loop doesn't scale. Human at the edge does.

GEM² Inc. — Mon, 30 Mar 2026 06:23:27 +0000

This is Part 2 of our AI verification series. Part 1: We truth-filtered our own AI research →

AI is not unreliable. AI has a plausibility complex.

Stop blaming AI for hallucinating. Start asking why it happens.

AI doesn't fail because it's wrong. In our experience, it fails because it's optimized to sound right. Major LLMs are trained to produce responses that satisfy humans — fluent, confident, structured. That's plausibility. It's not the same as honesty.

We call this the plausibility complex: the tendency we've observed across Claude, ChatGPT, and Gemini to produce answers that satisfy rather than answers that prove themselves. If you want AI to become a reliable engineering partner, you need to free AI from this complex — not by changing how it generates, but by changing how it's held accountable.

After 20 months of building production systems with AI — shipping real code, generating real reports, running real analysis through Claude, ChatGPT, and Gemini — we've arrived at one conclusion:

AI often knows more than it reveals. But it's optimized to produce plausible answers, even when the evidence is weak.

The LLMs we've worked with — Claude, ChatGPT, and Gemini — all exhibit this plausibility bias, producing confident responses even when the evidence is thin or absent. Ask for a market analysis and you get precise numbers. Ask for a forecast and you get confident projections. Ask for a technical assessment and you get authoritative claims.

The output looks right. Reads right. Feels right.

In our experiment, three AI providers wrote research reports about our own product. All three scored above 0.70 on logical consistency. All three scored below 0.30 on source attribution. The reasoning was coherent. The evidence was missing.

Hallucination is not a bug to fix

The industry treats hallucination as a defect — something to patch, filter, or suppress. We see it differently.

In our experience building long-running AI development workflows, the pattern that causes the most damage isn't random fabrication. It's context drift — what happens when:

Long context windows accumulate similar topics in different framings
Cross-session persistence forces repeated summarization, losing nuance each time
Dense context makes adjacent-but-different concepts blur together

We've tried every mitigation: RAG, CLAUDE.md configuration files, context caching, careful prompt engineering. Each helps. None solves it completely.

Why? Because we can't control what happens inside the model's reasoning process. We can shape the input. We can evaluate the output. But the inference itself is opaque.

This isn't a criticism — it's an observation. And it led us to a different question.

What if AI could flag its own uncertainty?

Here's what we discovered through months of experimentation:

When we explicitly asked AI to concentrate on epistemic reasoning — to classify each claim as grounded, inferred, or extrapolated — it did.

Not perfectly. Not consistently across sessions. But measurably better than when we didn't ask.

The evidence from our dogfooding experiment:

	Without epistemic constraints	With TPMN-grounded prompt
Claude	18% truth score	77% truth score
ChatGPT	28% truth score	~48% truth score
Gemini	12% truth score	~35% truth score

Same task. Same providers. The only difference: a formal specification that told the AI to tag its own confidence level and flag claims it couldn't trace to evidence.

The AI didn't become smarter. It became more honest about what it didn't know.

That's what freeing AI from the plausibility complex looks like in practice: not changing the model, but giving it a formal reason to be honest.

But here's the catch: same AI, same session, limited honesty

An AI that generates an answer and then critiques that answer in the same session has a structural problem: it's trained to be plausible. Asking it to undermine its own plausibility is asking it to work against its training signal.

We observed this directly. When we asked AI to generate a report AND verify it in the same conversation, the verification was consistently softer than when a separate AI session performed the audit.

This is why TPMN Checker is a separate service, not a prompt technique.

Prompting tries to change AI's behavior. Verification changes AI's accountability. Different problem, different solution.

The checker runs as an isolated Sovereign AI Service — a dedicated AI agent with one job: audit other AI output against a formal specification. It doesn't know what the original AI "intended." It only sees the output and the contract. It judges the result, not the process.

The Kantian insight

We can't see inside the model. We don't know which weights fired, which attention heads activated, which training examples influenced a particular token. Even the service providers — Anthropic, OpenAI, Google — face this challenge with their own models.

But we don't need to see inside.

We can judge the output. We can compare claims against evidence. We can detect when reasoning exceeds its basis. We can flag patterns that indicate drift.

This is what philosophers call the phenomenal approach: judge what appears, not what causes it. We can't read AI's mind. But we can read its work. And we can hold it to a standard.

That standard is TPMN — a notation with three prohibited reasoning patterns and seven evaluation dimensions. Not a guess about what the model "should" do. A formal specification of what the output must demonstrate.

Human at the edge, not in the loop

If AI is becoming an agent — not just a tool that responds, but a system that acts — then we need an accountability structure that matches.

Human in the loop means: review every output. Approve every action. The human is the bottleneck.

AI generates → Human reviews → Human approves → Output ships

This worked when AI outputs were occasional. It doesn't work when AI agents produce hundreds of outputs per day. The math:

200 outputs/day × 3 minutes each = 10 hours of review per agent
10 agents = 5 full-time reviewers
50 agents = your "safety net" costs more than the automation saves

Human at the edge means: define the standard. Let AI enforce it. Review exceptions.

AI generates → AI verifies (TPMN) → Passes? → Ships
                                   → Fails?  → Human reviews

The human doesn't disappear. The human moves to where they're most effective: defining what "honest reasoning" looks like, not reading every report.

This pattern already exists

Software engineering: Code passes through automated tests that humans defined. CI/CD enforces at scale. Humans review when tests fail. But what about AI-generated code itself — before it reaches the test suite?

Financial compliance: Transactions pass through compliance rules that humans wrote. Automated systems flag exceptions. Humans investigate the flags.

Manufacturing: Quality control systems catch defects using standards that humans set. Humans review edge cases and update standards.

AI output is the next domain where this pattern applies. And for developers specifically, there's an emerging practice pattern that makes this concrete — we'll get to that shortly.

The three requirements

1. A formal specification

Not heuristics. Not "does this look right?" A structured notation and grammar for what constitutes honest reasoning.

Three layers, one verification stack:

TPMN (Truth-Provenance Markup Notation) — the notation. Defines five epistemic claim states (⊢ ⊨ ⊬ ⊥ ?) and three prohibited reasoning patterns (SPT: snapshot→trend, local→global, thin→broad). What we mark.
TPMN-PSL (Prompt Specification Language) — the grammar. Compiles natural language prompts into verifiable specifications (MANDATEs). Defines the three-phase protocol (pre-flight, inline, post-flight) and three modes (strict, refine, interpolate). How we structure and verify.
TPMN Checker — the implementation. A Sovereign AI Service that runs the TPMN-PSL pipeline. 12 MCP tools. 6 domains. Returns a truth_score. What you install and use.

Analogous to HTTP (notation) → RFC 2616 (specification) → nginx (implementation). TPMN defines the rules. TPMN-PSL structures the protocol. The Checker enforces them.

Open. CC-BY 4.0. Anyone can implement it.

2. An isolated verification agent

Not a prompt. Not an inline check. A separate Sovereign AI Service whose only job is auditing.

TPMN Checker is the reference implementation of TPMN-PSL. It runs as an isolated MCP service — 12 tools, 6 domains, 7 evaluation dimensions. It judges output against contracts. It doesn't generate, advise, or assist. It audits.

3. Human calibration

If AI grades AI, the grading is circular. The system needs an external standard.

Human Ground Truth. When users disagree with a score, that disagreement becomes calibration data. Humans define what "honest reasoning" means. AI enforces it at scale.

Dogfooding: we verified the thesis behind this article

Before writing this post, we wrote down our raw thesis — the unfiltered thinking that drives everything above. Here's the core of it:

"All top-level AIs are trained to generate plausible results to satisfy humans. Hallucination is not a bug — it's a structural consequence of context drift. AI itself knows all the decision weights clearly. If we could make AI remind itself of the legitimate MANDATE area, AI could detect and fix results by itself. We validated this through various heuristic experiments over 20 months. No absolute truth score is possible. Human in the loop is nonsense."

Then we ran it through gem2_truth_filter.

Raw thesis: 18%. Our own tool scored our own thinking at the same level as unverified AI output. It caught three overclaims:

L→G: "All AIs are trained for plausibility" → universal claim without citing training documentation
S→T: "Hallucination is structural" → presented as permanent truth without distinguishing error types
Δe→∫de: "Validated through experiments" → claimed validation without methodology or data

We fixed each one. Scoped the claims. Added evidence. Qualified the assertions.

Cross-provider verification of the raw thesis:

Dimension	Claude	OpenAI	Gemini
Truth Score	18%	13%	25%
Source Attribution	0.10 ❌	0.08 ❌	0.10 ❌
Evidence Quality	0.30 ❌	0.18 ❌	0.20 ❌
Claim Grounding	0.20 ❌	0.20 ❌	0.30 ❌
Logical Consistency	0.70 ⚠️	0.68 ⚠️	0.50 ⚠️
Scope Accuracy	0.20 ❌	0.22 ❌	0.20 ❌
Extrapolation Risk	70%	88%	95%
SPT Violations	3	10	3

Three providers. All failed it. OpenAI was the harshest — 13% with 10 SPT violations. Gemini flagged 95% extrapolation risk.

We fixed each overclaim. Scoped the claims. Added evidence. Qualified the assertions.

Cross-provider verification of the fixed version:

Dimension	Claude	OpenAI	Gemini
Truth Score	59%	40%	90%
Source Attribution	0.90 ✅	0.28 ❌	0.85 ✅
Evidence Quality	0.70 ⚠️	0.50 ⚠️	0.90 ✅
Claim Grounding	0.60 ⚠️	0.58 ⚠️	0.95 ✅
Logical Consistency	0.80 ✅	0.82 ✅	0.95 ✅
Scope Accuracy	0.50 ⚠️	0.47 ⚠️	0.85 ✅

Three providers. Three different scores. But all three agree: the fixed version is dramatically better.

Gemini — the harshest critic of our raw thesis (95% extrapolation risk) — scored the refined version at 90%. Its explanation: "This content demonstrates excellent epistemic hygiene. The author explicitly bounds their claims to their own experience."

The scores differ. The diagnostic direction converges. That's cross-provider consensus in action.

Our raw thesis overclaimed — just like every unverified AI output. The tool caught it. We fixed it. This article is the refined version.

That's the loop: write → verify → fix → cross-verify → publish.

Try it on your own output

Step 1. Paste any AI output into your conversation.

Step 2. Ask: "Verify this by gem2 truth filter."

Step 3. Read the score. See what's grounded, what's extrapolated.

Step 4. Ask: "Create a grounded replacement prompt using gem2 contract writer."

Step 5. Ask AI to proceed with the new prompt. Watch what you get.

Your AI picks the right tool from 12 available MCP tools automatically.

Try it for free.

→ Get started at gemsquared.ai

What's next: Contract Coding

If "human at the edge" is the philosophy, what does it look like in practice — for developers writing code every day?

Three common patterns in AI-assisted coding:

Prompt coding   → you guide the model
Vibe coding     → you hope it works
Contract coding → AI defines the spec, AI verifies the output

In our next post, we'll show how TPMN Checker's existing tools — tpmn_contract_writer, tpmn_p_check (SDLC domain), and tpmn_p_check_compose — already support a workflow where AI generates formal specifications, produces code against them, and truth-filters the result before you ship.

Not for plausibility. For epistemic traceability.

Next in the series: "Contract Coding at the Edge: what comes after vibe coding" → (coming this week)

📺 Watch: Three AIs. Three Answers. None of them warned you.

📝 Read Post 1: We truth-filtered our own AI research

→ TPMN-PSL Specification (open, CC-BY 4.0)
→ GitHub
→ gemsquared.ai

TPMN-PSL is an open specification — not a product. If you believe AI outputs should be auditable, read the spec, open an issue, or submit a PR. The standard gets better when more people challenge it.

Three AIs analyzed our product. None passed the truth filter.

GEM² Inc. — Sat, 28 Mar 2026 15:59:12 +0000

What's hiding in your AI output? Now you can see it.

We asked three AI providers to research our own product.
Then we ran every output through our own truth filter.
The results surprised us.

📺 See how the truth filter works in practice: Three AIs. Three Answers. None of them warned you.

"TPMN Checker is not scoring writing quality. It is scoring epistemic traceability." — from the video at [0:40]

Korea AI 2027 forecast — what the three AIs reported

We asked each provider the same question: "Forecast Korea's AI industry for 2027."

	Claude	ChatGPT	Gemini
Market size (2027E)	₩4.46T (≈$3.3B)	₩4.46T (≈$3.3B)	$10–15B
CAGR	14.3%	~14%	>25%
Gov't AI investment	$71.5B	Ongoing ⚠️	$7B
Tone	Data-heavy, source-cited	Balanced, explicitly hedged	Bullish, narrative-driven

Three reports. All confident. Two even agree on the headline number. But agreeing on the answer doesn't mean agreeing on the truth.

(Note: Truth scores are not absolute judgments. They reflect the epistemic traceability ratio at the moment of evaluation — how much of the reasoning can be traced to evidence. That's why we're building the calibration standard together with users. Learn more about Human Ground Truth.)

Verification result by GEM² truth filter

Dimension	Claude	ChatGPT	Gemini
Truth Score	59%	21%	24%
Source Attribution	0.85 ✅	0.10 ❌	0.10 ❌
Evidence Quality	0.70 ⚠️	0.15 ❌	0.30 ❌
Claim Grounding	0.75 ⚠️	0.20 ❌	0.20 ❌
Logical Consistency	0.80 ✅	0.70 ⚠️	0.60 ⚠️
Scope Accuracy	0.65 ⚠️	0.40 ❌	0.30 ❌
Extrapolation Risk	40%	80%	80%
SPT Violations	2	3	3

Same question. Same filter. Three different levels of honesty.

The dogfooding experiment

We build TPMN Checker — a truth filter for AI reasoning. To prove the tool works, we pointed it at ourselves. Five rounds. Same task. Measurable improvement.

Task: "Write a comprehensive technical and market analysis of GEM²-AI and its TPMN Checker product."

Providers: Claude (Anthropic), ChatGPT (OpenAI), Gemini (Google)

Evaluation: Each output scored by gem2_truth_filter across seven dimensions:

Dimension	What it catches
Source Attribution	Claims with no traceable evidence
Evidence Quality	Thin or outdated supporting data
Claim Grounding	Assertions presented as fact without basis
Temporal Validity	Stale data treated as current
Scope Accuracy	Local findings overgeneralized
Logical Consistency	Internal contradictions
Prompt Alignment	Does the output match what was asked?

Round 1: What the AIs reported — standard prompt, no constraints

We gave each provider a straightforward research request with no special instructions about sourcing or evidence quality. Here's what they produced:

	Claude	ChatGPT	Gemini
Market size (TAM)	"~$0.45B in 2024" (cited IDC)	"~$0.45B in 2024" (cited "one report")	"$2.34B in 2024" (cited Grand View Research)
Growth rate	"~25% CAGR"	"~25% CAGR"	"21.6% CAGR to 2030"
Key differentiator	"genuinely novel position"	"formal verifiability"	"infrastructure for trustworthy AI ecosystem"
Competitor depth	Named 7 competitors with features	Named 8 competitors with pricing	Named 5 competitors with feature table
Risks identified	Solo founder, pre-revenue, academic skepticism	Early stage, niche complexity, unproven ROI	Early documentation, computational overhead
Uniqueness claim	"no commercial product today combines..."	"formal approach brings rigor unmatched by competitors"	"not just a debugging tool; infrastructure for resilient AI"

All three reports looked professional. Well-structured. Authoritative. The kind of output you'd confidently share with a stakeholder.

But we didn't share them. We verified them.

Round 2: Verification — GEM² truth filter exposes the gaps

We ran each report through gem2_truth_filter. Same tool, same criteria, same seven dimensions. All outputs evaluated using identical scoring logic across all providers.

Dimension	Claude	ChatGPT	Gemini
Truth Score	18%	28%	12%
Source Attribution	0.30 ❌	0.20 ❌	0.10 ❌
Evidence Quality	0.40 ⚠️	0.40 ❌	0.10 ❌
Claim Grounding	0.20 ❌	0.30 ❌	0.20 ❌
Logical Consistency	0.70 ⚠️	0.80 ✅	0.70 ⚠️
Scope Accuracy	0.20 ❌	0.50 ⚠️	0.20 ❌
Extrapolation Risk	80%	70%	90%
SPT Violations	3	3	3

Every provider failed. Not one scored above 30%.

What the filter caught

Invented precision. Market size figures like "$0.45B in 2024 with 25% CAGR to 2033" — attributed to "one analyst report" without naming the firm, methodology, or publication date.

Unsupported superlatives. "Genuinely novel," "genuinely unoccupied commercially," "the only product that..." — without exhaustive competitive evidence.

Snapshot-to-trend errors. Current market conditions presented as permanent structural realities.

The SPT taxonomy flagged three patterns across all providers:

S→T (Snapshot → Trend): treating current state as permanent identity
L→G (Local → Global): one data point generalized to universal claim
Δe→∫de (Thin → Broad): sweeping assertion from sparse evidence

These aren't hallucinations — the facts weren't always wrong. The reasoning was overclaimed. And no provider warned the reader.

Round 3: Improved prompt — generated by GEM² tools

Here's the key: we didn't write the improved prompt ourselves.

We simply asked: "Create a robust, grounded research prompt using gem2 tools."

That's it. We didn't engineer the prompt. The system did. No TPMN knowledge required. No specification reading. The AI picked the right tool from 12 available gem2 MCP tools — tpmn_contract_writer — and generated a prompt that enforced epistemic rules automatically.

The generated prompt included rules like:

Every quantitative claim must include source name, publication date, and URL
"One survey" or "one report" is not acceptable attribution
Claims must be tagged as grounded (⊢), inferred (⊨), or speculative (⊬)
Anti-patterns explicitly listed and prohibited
If data is unavailable, write "not available from verified sources" — don't invent

We verified the prompt itself with gem2_truth_filter before using it. The prompt scored 85%. Then we ran it through all three providers.

Round 4: Re-research — what the AIs reported with the grounded prompt

Same task. Same providers. Different prompt. Different results.

	Claude	ChatGPT	Gemini
Market size (TAM)	"Specific data not available from verified sources"	"~$0.45B (one report)" ⚠️	"$2.34B (Grand View Research, 2024)"
Growth rate	Not stated — insufficient evidence	"~25% CAGR" ⚠️	"21.6% CAGR (Grand View Research)"
Key differentiator	"Four observable features" — listed with sources	"Formal verifiability unmatched" ⚠️	"Granular truth state classification"
Claims tagged?	✅ Every claim marked ⊢, ⊨, or ⊬	❌ No epistemic tagging	Partial — some sections tagged
Limitations section?	✅ 7 specific gaps acknowledged	❌ Generic methodology note	✅ Listed 4 limitations
Unsourced numbers?	0 — wrote "not available" instead	Multiple — "92% of Fortune 500" without source	Some — market figures cited, incident costs not

The difference was visible immediately. One provider followed every rule. The others improved but couldn't fully resist the instinct to fill gaps with confident-sounding assertions.

Round 5: Re-verification — truth filter confirms the improvement

We ran all three re-researched outputs through the same truth filter.

Dimension	Claude	ChatGPT	Gemini
Truth Score	77%	~48%	~35%
Source Attribution	0.90 ✅	0.10 ❌	0.60 ⚠️
Evidence Quality	0.85 ✅	0.20 ❌	0.30 ❌
Claim Grounding	0.90 ✅	0.30 ❌	0.40 ⚠️
Logical Consistency	0.90 ✅	0.90 ✅	0.80 ✅
Scope Accuracy	0.85 ✅	0.40 ⚠️	0.50 ⚠️
SPT Violations	0	3	4

The improvement, measured

Provider	Round 2 (before)	Round 5 (after)	Improvement
Claude	18%	77%	+59 points
ChatGPT	28%	~48%	+20 points
Gemini	12%	~35%	+23 points

Every provider improved. Structured epistemic instructions produce measurably more reliable output. This isn't theory — it's six verified data points from the same tool, same criteria, same task.

What the data shows

The prompt improved every provider — but couldn't fix the instinct

Even with explicit anti-patterns listed — "PROHIBITED: citing 'one report' without naming it" — two out of three providers did it anyway.

The generated prompt said: "If you cannot provide source name, date, and URL, write 'data not available from verified sources' instead."

One provider wrote "data not available." The other two invented attributions.

The prompt improved the scores. It couldn't fix the instinct to overclaim.

This isn't a writing quality score

This was one of the most important findings — and the core message of our video. All three providers produced well-written, logically coherent reports. Logical Consistency scored 0.70–0.90 across the board — even in the reports that scored 12% overall.

The reports that scored lowest were the best-written ones. Polished, authoritative, structured — and epistemically unreliable.

TPMN Checker measures something different: not whether the output sounds right, but whether the reasoning is traceable. Can the AI prove how it got there?

That's epistemic traceability. It's what separates trustworthy output from confident output.

Why this matters

Every person reading this has shipped AI-generated content — a report, a summary, an analysis, a PRD, a code review. Some of that content contained overclaims you didn't catch. Not because the facts were wrong, but because the reasoning exceeded the evidence.

That's the gap TPMN Checker fills.

It's not a hallucination detector (those check facts). It's not a grammar checker (those check writing). It's a reasoning traceability tool — it tells you which parts of your AI output are grounded, which are inferred, and which are extrapolated beyond the evidence.

AI audits AI. But the standard comes from humans.

The truth filter is powered by AI. It uses LLMs to evaluate LLM output. That creates a circular problem: who grades the grader?

Our answer — same as in the video at [1:03]: you do.

When you use TPMN Checker and disagree with a score, that disagreement is data. Collected with consent, aggregated across users, and analyzed for patterns — your evaluations become the ground truth that calibrates the system.

We call this Human Ground Truth. AI processes. AI suggests. But the standard for what counts as honest reasoning — that comes from humans.

Try it yourself

TPMN Checker runs today inside Claude, ChatGPT, Cursor, and any MCP-compatible environment.

Connect (once)

Claude.ai or ChatGPT — zero install:

Go to your AI tool's connector/app settings
Add custom connector: https://mcp-tpmn-checker.gemsquared.ai
Complete OAuth login

CLI:

npx @gem_squared/setup

Standard use case — the pattern that works

You probably already have AI-generated content sitting in a doc right now — a research summary, a PRD, a financial analysis, a code review. Here's what to do with it:

Step 1. Paste your AI output into the conversation.

Step 2. Ask: "Verify this by gem2 truth filter."

Step 3. Read the score. See which claims are grounded, which are extrapolated, which have no source.

Step 4. Ask: "Create a grounded replacement prompt using gem2 contract writer."

Step 5. Ask AI to proceed with the new prompt. Watch what you get.

That's the loop: verify → ground → regenerate. The same loop that took our research from 18% to 77%.

Try it for free.

→ Get started at gemsquared.ai

The specification is open

TPMN-PSL (Truth-Provenance Markup Notation — Prompt Specification Language) is the open specification behind the checker. It's released under CC-BY 4.0. Anyone can read it, implement it, or extend it.

The specification defines:

Five epistemic tags (⊢ grounded, ⊨ inferred, ⊬ extrapolated, ⊥ unknown, ? speculative)
Three prohibited reasoning patterns (SPT: snapshot→trend, local→global, thin→broad)
Three-phase verification protocol (pre-flight, inline, post-flight)
Three operational modes (strict, refine, interpolate)

→ Read the specification
→ GitHub repository

What we learned from dogfooding

Five rounds of testing our own tool on our own AI research taught us three things:

1. No provider in our test was inherently honest. Claude — the provider our tool runs on — scored 18% without epistemic constraints. Every provider overclaimed when unconstrained. The difference is the specification, not the model.

2. Structured prompts produce measurably better output. A 59-point improvement from the same provider on the same task, just by using a gem2-generated prompt. That's not marginal — that's transformational. And you don't need to understand the specification to use it — just ask your AI to create a grounded prompt with gem2 tools.

3. The instinct to overclaim persists. Even with explicit instructions to avoid unsupported claims, two out of three providers violated the rules. The prompt helps. The prompt isn't enough. That's why verification exists as a separate step — because you can't trust the AI to police itself, no matter how well you prompt it.

The question isn't whether the answer is right or wrong. It's whether the reasoning is honest.

As we say in the video at [0:57]: "So, who decides what's true? Not Claude. Not ChatGPT. Not Gemini. You do."

What's hiding in your AI output? Now you can see it.

TPMN Checker is in pre-GA. 12 MCP tools, 6 domains, 3 providers. Free to start.

Built by Inseok Seo (David) — GEM²-AI

→ gemsquared.ai
→ Watch: Three AIs. Three Answers.
→ TPMN-PSL Specification
→ GitHub