GEM² Inc.

Posted on Apr 30

Three Wounds That Prose Skills Cannot Fix — The Full Analysis

#claudecode #cloudskills #devtools #opensource

Silent scope decay. Judgment theater. Trigger collision. Three Wounds that prose skills cannot fix — and the algebraic logical language + spec that does.

In the companion post, I named the three wounds:
Silent Scope Decay, Judgment Theater, and Trigger Collision.

Here is the structural proof for each — why prose fails, why contracts don't, and what 130+ real work plans revealed.

GEM² Inc.

Apr 30

Claude Skills Fail Silently. Here Is My Solution.

#claudecode #cloudskills #devtools #opensource

4 min read

If you haven't installed yet:
The 12 skills. MIT-licensed. Works with Claude Code out of the box. No infrastructure. No server. Git + filesystem.

npm i @gem_squared/tpmn-skill-install

From GitHub: github.com/gem-squared/tpmn-skill

The three wounds and solutions

Wound 1 - Silent Scope Decay: from prose to spec

Every AI — Claude, ChatGPT, Gemini — lives inside a limited context window. That limitation causes two kinds of silent erosion: context dilution and architecture drift.

Vibe-coding, Claude Skills, every prose-based method on the market today — all are subject to the same compression. When context compacts, prose loses precision. Meaning survives; accuracy does not.

But algebraic logical notation cannot be lossy-compressed without breaking syntax. Compaction cannot silently dilute what is already at minimum expression. That is why I created the TPMN spec.

Solution for Silent Scope Decay

The axiom: every AI action is a function

Every skill in the world can be written as a single algebraic expression:

F: A → B | P

This is not a metaphor. It is the irreducible kernel of what a skill is.

AI's inference process is a blackbox. This is an axiom, not a limitation. We cannot micro-control the weights, the direction, or the intermediate reasoning. Claude, ChatGPT, Gemini — every commercial LLM is opaque by construction. We cannot control how F operates internally.

But we can declare A, B, and P.

From the human's perspective:

A is what I give to AI.
B is what I want to get from AI.
P is the set of invariants and constraints on A, B, or both — the guardrail on AI's output.

From the AI's perspective:

A — exactly what it receives (typed input state)
B — exactly what it produces (typed output state)
P — what must hold true before and after execution

That is the protocol. Clean, simple, explicit.

What is a skill for? To perform a similar workflow identically, every time. Not a single task. A workflow.

So let UNIT-WORK = F.

A unit-work is a discrete, isolated F — a single mission given to AI. Whether AI completes it through one-shot inference or internally spawns many sub-agents, that is the blackbox. We cannot control it. We must not try. From the human's perspective, it is a single unit of work.

Now a workflow is the summation of unit-works plus the logical language describing the flow:

WF ≜ Σ unit-work + flow-logic

So:

"Claude Skill" ≡ "WF ≜ Σ unit-work + flow-logic"

The full grammar — the complete symbol set, the four-source layering (TLA+, Panini, Math, NL), the UNIT-SKILL rules R1–R7, and the AI selection pipeline — lives in TPMN is ALL.

Four properties of compaction-safe notation

After studying what survives compaction and what does not, I identified four properties that make TPMN compaction-safe:

1. No redundancy

In prose, you say things multiple ways for emphasis. "Never deploy to production. This skill is staging-only. Production deployments are not within scope." Three sentences, same constraint. The summarizer might keep one. It might keep none — because redundancy signals that the content is emphatic rather than structural.

In TPMN, each constraint appears once:

⊢ NEVER deploy to production — staging only

One line. No redundancy to collapse. The summarizer cannot compress what is already minimal.

2. Structure preserved

TPMN uses code blocks. Code blocks are atomic to summarizers. A prose paragraph is a candidate for summarization. A code block is either kept or referenced — not paraphrased.

This is not a subtle advantage. It is the difference between "the skill has some preconditions about CI" (summarized prose) and the actual precondition P: ci_passed(build_id) = ⊤ (preserved code).

3. Minimum = only representation

In prose, there is a "full" version and a "summarized" version, and they carry different information. The full version has constraints. The summary loses them.

In TPMN, the notation IS the minimum representation. There is no shorter form that preserves the meaning. P: ci_passed(build_id) = ⊤ cannot be shortened without losing the typed predicate. This means compaction either preserves the notation (because there is nothing to compress) or drops it entirely (which is detectable — the skill contract is missing).

The failure mode of TPMN under compaction is absence, not corruption. You either have the contract or you do not. You never have a corrupted half-version that looks right but lost the critical constraint. I prefer an honest absence over a silent corruption.

4. Symbols unambiguous at any depth

When a prose instruction is partially summarized, words become ambiguous. "Check the build" — check what about the build? That it exists? That it passed CI? That it was deployed?

When a TPMN symbol survives compaction, it means the same thing regardless of surrounding context. 𝔹 means boolean. ⊢ means grounded. ∧ means AND. These symbols do not drift in meaning when the context around them changes.

This matters because compaction changes the context. After compaction, the agent has a summarized version of earlier conversation plus the surviving code blocks. If the symbols in those code blocks were context-dependent, compaction would corrupt their meaning. Because they are context-independent, the surviving notation is still correct.

Wound 2 — Judgment theater: from in-the-loop to at-the-edge

Everyone agrees now: the bottleneck is not writing code.
It is verifying what the agent produced.
Here is a structural answer — not a workflow tip, a verifiable CONTRACT.

The 95% trap

The agent is correct 95% of the time
You stop checking
Then it drops a production database (Replit, July 2025 — already in public discourse)
The trap is not carelessness. It is a rational response to high accuracy.
The structural cause: there is no formal specification of what "correct" means for a given session. So you evaluate by feel. Feel degrades under repetition.

The 95% trap is not a discipline problem.
It is an absent-contract problem.

Solution for Judgment theater

CONTRACT (F: A → B | P) plus flow-logic defines a skill. STATE is how we govern it.

Here is the flow:

Human requests AI in natural language.
AI formalizes the request into algebraic logical format: WF = Σ unit-contracts + flow-logic.
AI executes each unit-work through WF.
AI produces B for each unit-contract.

The critical paradigm shift: treat B as a state.

The AI execution state is binary: SUCCESS or FAILURE.

SUCCESS is declared if and only if the AI's output is fully aligned with the contract. Alignment means:

The input a conforms to the contract's input type A.
The output b conforms to the contract's output type B.
The invariant/constraint P(a, b) holds true.

Otherwise, the state is FAILURE.

This is not subjective quality assessment. STATE does not measure whether the output is "good." It measures whether the output exists within the legitimate mandate area that was pre-defined by the contract before execution. Its purpose is to prevent AI from drifting away from the plan.

Without STATE, you review AI output against intuition. You catch obvious errors. You feel thorough. But you have no contract to check against — no ground truth to verify. That is judgment theater: the appearance of verification without the structure to make it real.

Because the verification is structural type-checking — not interpretive judgment — AI itself can evaluate whether its own output is SUCCESS or FAILURE. The contract provides a deterministic specification. The /verify-work skill does exactly this — evaluating each unit-work's output against its contract, either per-unit inline (immediately after execution) or as a batch at the end.

Human-at-the-edge, not human-in-the-loop

To observe AI's work, what is a human supposed to do? Stay in the loop?

That is the wrong answer.

Why do developers love Claude Code? Because it treats multi-agent processing as its natural mode of operation — not a bolted-on feature. Claude Code autonomously spawns sub-agents to fulfill human needs. The agentic layer grows more autonomous with every release. Other platforms are converging on the same model.

We cannot control how Claude Code spawns its internal sub-agents. We must not try. Humans must step out of the loop — otherwise we become the critical bottleneck in an AI-driven system.

But we still need to observe AI's work. My answer: human-at-the-edge.

The human evaluates the workflow at the boundary — what goes in, what comes out, what must hold true. AI operates autonomously within that boundary. The human observes results, not process. This is the governance model behind the TPMN workflow lifecycle: plan → proceed → verify → archive, with the human at the edge of each transition.

Creating a workflow in TPMN is not something you hand-write from scratch. /plan-work does it for you. Claude Code decomposes your request into contracted unit-works automatically. The same 12 core skills apply through the entire workflow hierarchy, from 10 unit-works to 100 and beyond. We adapt Miller's law to bound the size of each level (7±2 units per decomposition).

Wound 3 — Trigger collision: from mechanism to management

Three independent studies converge on the same failure band.

Vercel engineer Jude Gao reported that in 56% of their eval cases, the skill was never invoked — even though the documentation was available. Independent testing by Pere Villega landed on a 50% success rate — what he called "a coin flip." Ivan Seleznov's 650-trial controlled experiment found default-configured skills activated at 77%, reaching 100% only when the trigger used imperative language.

Three methods. Three authors. Same finding: the activation mechanism does not work reliably.

And when they fail, nothing happens. No error. No warning. No log entry. Claude just proceeds without the skill, and you never know.

You cannot observe this failure. There is no stack trace. No failed assertion. No red CI badge. The skill simply doesn't load, Claude does its best without it, and the output looks plausible. You would have to run the same prompt with and without the skill, compare results, and notice the difference. Nobody did that.

Even if you could detect the failure, manual comparison destroys the value proposition. Skills exist to automate. If you must manually verify every invocation, you have replaced automation with audit theater.

Solution for Trigger collision

Here is the core design decision. The conventional approach writes one skill per case: deploy skill, test skill, review skill. N cases require N skills. Selection breaks at scale.

TPMN does not replace those skills. TPMN orchestrates them.

The 900,000+ skills in the ecosystem encode domain knowledge — Figma-to-code, Sentry triage, Kubernetes deploy, PDF generation. That knowledge is valuable. The problem is not the skills. The problem is that nothing governs how they get selected, executed, verified, or reused.

And the problem compounds. You find a skill in the marketplace, copy it locally, configure it for your project — but Claude doesn't trigger it reliably, so you manually refine through train/test cycles. Or you build your own skills through real project work — hard-earned, proprietary, proven. Either way, your local skill collection grows. How do you find the right one for the next project?

Here is the structure I built:

12 core lifecycle skills, your AI agent, and the filesystem. No infrastructure. Works with Claude Code out of the box.

TPMN provides the governance layer: 12 lifecycle skills that handle planning, execution, verification, and archival — regardless of which domain skills you use underneath.

Think of it as two layers:

TPMN layer:  /plan-work → /proceed-work → /verify-work → /archive-work
             (lifecycle orchestration — contracts, STATE, STATUS)

Contract layer: your-deploy-skill, figma-to-code, sentry-triage, ...
             (domain knowledge — whatever skills you already have)

The 12 TPMN skills are operations on a work lifecycle: /init-session, /check-session, /search-kg, /search-skill, /plan-work, /proceed-work, /update-work-plan, /extract-skill, /verify-work, /skill-to-kg, /archive-work, /end-session. Every workflow composes from these 12 lifecycle primitives.

/skill-to-kg deserves a note: it batch-sweeps all non-protected skills and directories from .claude/skills/ (active — auto-discovered by Claude) to .gem-squared/external-skills/ (archived — invisible to Claude, but searchable by /search-skill). Only the 12 core lifecycle skills and the project identity skill remain active. Runs automatically at init-session. Restore any specific skill anytime.

You do not throw away your existing skills. You govern them. The 12 core skills handle the lifecycle — the detailed breakdown with the "why separate" rationale for each is in the spec.

knowledge compounds, not skills

This leads to the second inversion.

The conventional model scales by adding skills. N cases → N skills. The system grows but does not learn. Skill #47 knows nothing about Skill #1.

TPMN scales by accumulating knowledge. Every completed work-plan that passes verification — COMPLETED status, SUCCESS state — is a proven template. When /archive-work stores it, the contracts and results become searchable patterns. When /plan-work runs for a new task, /search-kg retrieves relevant proven templates. The next decomposition is not starting from zero — it is informed by what worked before.

/plan-work ─── /search-kg + /search-skill ───→ /proceed-work ─── /search-kg (optional) ───→ /verify-work ───→ /archive-work
     ^            (proven patterns + domain skills)    (reference during execution)                                    |
     +────────────────────────────────────────────────────────────────────────────────────────────────────────────────+
                                                                                                    (store as proven)

/extract-skill ─── /search-kg + /search-skill ───→ upsert to .claude/skills/
                   (find source contract + check if skill exists)

Actually, I do not use any custom domain skills. The 12 lifecycle skills are enough — with proven unit-contracts as the reusable knowledge. /extract-skill exists for compatibility with Claude Code's .claude/skills/ ecosystem, not because I need more skills.

The lifecycle skills stay fixed. The knowledge compounds. I have been using the same core skills across various projects. The skills never changed. The work-plans are what vary — and the proven ones feed the next cycle.

Making flow-logic deterministic: the STATUS concept

Σ unit-work now has a deterministic verification layer via STATE. What about flow-logic?

To make flow-logic manageable, traceable, and computable, we need a second concept: STATUS.

Where STATE answers "did it align with the plan?", STATUS answers "where is it in the lifecycle?"

STATUS ≜ { PENDING, IN_PROGRESS, COMPLETED, BLOCKED, ABORTED }

Axiom: once a workflow is defined, every unit-work in that workflow occupies a STATUS. STATE is evaluable when a unit-work reaches a terminal status — that is, when STATUS ∈ {COMPLETED, ABORTED}. A completed unit-work is evaluated against its contract. An aborted unit-work is FAILURE by definition — its contract was never fulfilled.

Now we have two orthogonal tracking dimensions:

STATUS tracks where each unit-work is in the lifecycle.
STATE tracks whether each completed or aborted unit-work met its contract.

Together, they make the entire workflow observable from the edge. The human does not need to watch every step. The human reads STATUS to see progress and STATE to see results. That is human-at-the-edge.

I applied this model across 130+ real work plans in production. The full lifecycle — from planning through execution to verification and archival — is governed by the same 12 skills throughout.

How TPMN compares to existing frameworks

Three frameworks have emerged in 2026 to address skill reliability. Each solves part of the problem. None solves all of it.

Superpowers (150K+ stars) provides 14 composable skills with a master dispatcher and a "1% chance = must invoke" enforcement rule. It covers brainstorming, planning, TDD, debugging, code review, and verification. Its verification-before-completion skill enforces "evidence before claims" — run the command, read the output, then claim the result. Strong methodology. But it is a software development workflow — it does not generalize beyond SDLC, and its skills are prose instructions, not typed contracts.

Agent Skills Standard (Benjamin Abt, Feb 2026) defines a five-component "quality contract" — scope, decision logic, constraints, output contract, quality gates. The closest to TPMN's philosophy: one skill = one bounded scenario with explicit non-goals. But it is an authoring standard for individual skills, not lifecycle orchestration. There are no lifecycle primitives, no STATE/STATUS tracking, and no mechanism for proven patterns to inform future work.

The hook-driven governance layer (HackerNoon, Mar 2026) attacks the trigger problem directly: a UserPromptSubmit hook forces Claude to evaluate every skill before responding, raising activation from ~25% to 90%+. Effective enforcement. But it governs whether skills fire, not what happens after they do. No planning, no verification, no archival. And it is locked to Claude Code's hook system.

TPMN Skill Standard (open spec, MIT-licensed) takes a different approach. Instead of enforcing trigger reliability or standardizing skill authoring, TPMN provides a governance layer that sits above all skills. Every unit of work is a typed contract (F: A → B | P) with algebraic STATE (SUCCESS/FAILURE) and lifecycle STATUS (PENDING → IN_PROGRESS → COMPLETED). Verification is structural — the AI checks its own output against the contract, not against a prose checklist. And the cycle closes: proven contracts are archived, indexed, and retrieved by /search-kg to inform the next /plan-work. The 12 lifecycle skills are domain-agnostic — they govern legal review, content production, and data analysis the same way they govern software development.

Dimension	Superpowers	Agent Skills Standard	Hook Governance	TPMN
Typed contracts (A→B\|P)	No — prose instructions	Partial — "output contract" in prose	No	Yes — algebraic
Lifecycle orchestration (plan→execute→verify→archive)	Partial — plan + verify, no archive	No	No	Yes — 12 skills
Knowledge accumulation (proven patterns feed future work)	No	No	No	Yes — /archive-work → /search-kg → /plan-work
STATE/STATUS tracking	No	No	No	Yes — binary STATE + lifecycle STATUS
Verification model	"Run the command, read the output"	Quality gates (prose checklist)	N/A (enforcement only)	Contract type-checking (A, B, P alignment)
Domain scope	SDLC only	SDLC only	SDLC only	Any domain — lifecycle is domain-agnostic
Platform scope	Multi-platform (Claude, Codex, Cursor, Gemini)	GitHub Copilot	Claude Code only	Platform-agnostic (any AI agent that reads markdown)
Skill count	14 (development methodology)	Per-project (authoring standard)	26 (domain-specific)	12 (lifecycle primitives) + any domain skills

Every framework above targets software development. TPMN's lifecycle — plan, execute, verify, archive — is not about code. It is about any workflow where a human delegates work to AI: legal review, content production, data analysis, research synthesis, compliance audits. The 12 skills govern the lifecycle of work itself, not the domain the work belongs to.

The gap is not in any single capability. Superpowers has strong verification. Abt has strong skill structure. The hook layer has strong enforcement. The gap is that nobody combines all three: (1) algebraic contracts that make verification structural, (2) lifecycle orchestration that governs work from plan to archive, and (3) knowledge accumulation where proven patterns compound across projects. Each framework covers one piece. None closes the full loop.

the origin of TPMN

How do we describe flow-logic deterministically?

My solution draws from four sources:

TLA+: Leslie Lamport's temporal logic for specifying concurrent systems — exactly what multi-step AI workflows are. For deterministic logical flow.
Panini: The ancient Sanskrit grammarian who solved semantic subject-disambiguation 2,500 years ago — the exact ambiguity problem that makes prose skills fail. For removing semantic ambiguity.
Mathematical notation: For programmatic decisions and formal constraints.
Natural language comments: To complement the subjective meaning that formal notation cannot carry alone.

This is the origin of TPMN — an Algebraic Logical Language for AI skill specification.

Why does formal notation matter over prose? Because context compaction destroys prose instructions while formal notation survives intact — a 5.3x measured density advantage that determines whether your skills keep working in long sessions. Your prose constraints dilute under compaction. Your architecture boundaries drift between sessions. Neither failure is detectable. That is silent scope decay — and TPMN is the cure.

TPMN extends to two systems

From this algebraic foundation, TPMN extends into two systems:

TPMN — Truth-Provenance Markup Notation — an open specification language for structuring and auditing AI reasoning. Platform-agnostic — treats LLMs as black boxes. Within TPMN:
- TPMN-PSL (Prompt Specification Language) is the formal grammar that compiles natural-language prompts into MANDATE — computable, verifiable specifications. Three-phase protocol: P-phase (prompt → MANDATE), Inline (epistemic tagging), O-phase (output verification → truth_score).
- TPMN-checker is the reference implementation — a Sovereign AI Service (SAS), a microservice exclusively owned and controlled by a dedicated AI actor.
TPMN SKILL STANDARD — a contract-driven AI workflow lifecycle management framework. The full specification for defining, executing, verifying, and governing AI skills. This includes the 5 structural rules that keep skills within the reliability zone, and 12 core skills that orchestrate the lifecycle for any workflow.

Each of these gets its own post. This one is the foundation — the algebraic origin story that everything else builds on.

GEM² — the name

My company is named GEM². The acronym is the whole philosophy of this post compressed into six letters:

GEM²_Definition ≜ [
  acronym:   "Grounded Existence Matrix for Global Entropy Minimum",

  expansion: [
    Grounded:  "Every A is grounded by contract — no hidden input",
    Existence: "Every B is verifiable by evidence — no claimed output",
    Matrix:    "A, B, P are the three axes every unit-work is computed on",
    Global:    "Workflow is the summation of connected unit-contracts",
    Entropy:   "Disorder = the gap between contract and evidence",
    Minimum:   "SUCCESS is the minimum-entropy state — contract and evidence align"
  ]
]

David Seo — GEM².AI

DEV Community