Anthropic Skills Is Quietly Killing the Prompt Management Category

#ai #llm #prompt #promptengineering

Book: Prompt Engineering Pocket Guide
My project: Hermes IDE | GitHub — an IDE for developers who ship with Claude Code and other AI coding tools
Me: xgabriel.com | GitHub

An engineer I work with was a week from signing a prompt-management renewal in the mid-five figures. The procurement form was already on the way back when they pasted a single folder into the Claude Agent SDK and replaced a chunk of what the platform was doing. The folder had a SKILL.md, two scripts, and a references/ directory. That was the whole feature.

Anthropic announced Agent Skills in October 2025, and the open standard followed in December 2025. Through Q1 2026 Skills shipped across four surfaces: Claude.ai, Claude Code, the Agent SDK, and the Messages API. The official docs describe them as modular capabilities that extend Claude's functionality. That undersells what is actually happening. Skills moved the things prompt-management SaaS was charging for into the model runtime.

What a Skill actually is

A Skill is a directory. The minimum viable version looks like this on disk:

my-skill/
├── SKILL.md
├── scripts/
│   └── extract_invoice.py
└── references/
    └── tax_rules.md

SKILL.md is YAML frontmatter plus markdown:

---
name: invoice-extractor
description: Extracts line items, tax, and totals from
  invoice PDFs in EU and US formats. Use when the user
  provides an invoice file or asks about extracted
  invoice data.
---

# Invoice Extractor

When the user provides an invoice, run
`scripts/extract_invoice.py` against the file. Return
the JSON output. For tax handling, consult
`references/tax_rules.md` before answering questions
about applicable rates.

The description field is load-bearing. At runtime, Claude pre-loads only the name and description of every installed skill into its system prompt. The body of SKILL.md and any scripts or references are pulled in lazily, when the description matches the user's request. Anthropic calls this progressive disclosure. It is the entire architectural trick.

If you have ever paid for a "prompt registry," you have already paid for half of this.

A Skills call in 30 lines

Here is a Claude-with-Skills call from the Agent SDK:

from anthropic import Anthropic

client = Anthropic()

response = client.messages.create(
    model="claude-opus-4-7",
    max_tokens=2048,
    container={
        "skills": [
            {"type": "anthropic", "skill_id": "pdf"},
            {"type": "custom", "skill_id": "invoice-extractor"},
        ]
    },
    tools=[{"type": "code_execution_20250825",
            "name": "code_execution"}],
    messages=[{
        "role": "user",
        "content": "Extract totals from invoice_q1.pdf",
    }],
)

print(response.content)

That is the whole thing. The Skill is referenced by ID. The model decides whether to invoke it by reading the description. The body of the skill never enters your application code, never sits in a vendor's database, and never round-trips through your prompt-management dashboard. Versioning happens in the same place the code lives: git.

Compare it to the traditional shape:

from anthropic import Anthropic
from prompt_vendor import Client as PromptClient

client = Anthropic()
pm = PromptClient(api_key="...")

template = pm.get_prompt(
    name="invoice_extractor",
    label="production",
    version=14,
)

system = template.compile(
    tax_rules=load_tax_rules(),
    output_format="json",
)

response = client.messages.create(
    model="claude-opus-4-7",
    system=system,
    max_tokens=2048,
    messages=[{
        "role": "user",
        "content": "Extract totals from invoice_q1.pdf",
    }],
)

Two services. A version tag. A label promotion workflow. A compile step. A latency hit on every call to fetch the prompt. A mental model where the prompt is a string template that lives somewhere else, gets pulled in, and gets stitched into the request.

Skills replace the entire middle column: one folder in your repo, behind whatever review process your code already has, holding the prompt, the tools, the references, and the version.

Why the SaaS pitch deflates

Look at the four things prompt-management vendors typically sell. PromptLayer pitches version control and registries. Helicone bundles prompts on top of LLM observability. Pezzo, Langfuse prompts, and the long tail repeat the same shape. Every one of them lists roughly:

Prompt versioning with labels (staging, production).
Prompt + variable templating.
Prompt-attached metadata (model, tools, parameters).
A way to swap prompts without redeploying.

Skills do (1) through (3) at the model layer. Versioning is git. Templating is markdown plus optional scripts. Metadata is the SKILL.md frontmatter plus the container parameter on the API call. The fourth pitch is swap-without-redeploy. It was always the weakest, because most teams who said they wanted it ended up wanting the opposite when an outage hit and they couldn't tell which prompt had been live.

There is also a quieter shift. Skills can ship scripts, not just text. That collapses the boundary between "prompt template" and "tool definition." A folder that can include extract_invoice.py carries more than a typed text box can. Anthropic has been explicit about this direction: Skills are positioned as the unit of agent capability.

The vendors aren't wrong about the problem. Teams genuinely need versioning, governance, and a way to roll back. They are wrong about the layer where that should live. The prompt-management thesis was: prompts are external assets, manage them like SQL migrations. The Skills thesis is: prompts are code, manage them like code.

Code wins this argument every time.

A specific scenario where the SaaS still loses

Imagine a fintech with an invoice-extraction prompt running across 14 customer-specific configurations. In the old shape, each customer gets a labeled prompt version in PromptLayer. The QA team promotes them. The platform maintains a dashboard showing which version is live for which tenant.

In the Skills shape, each customer gets a Skill folder in a customers/ directory. CI builds them into the deployment. Tenant routing picks the skill_id. Rollbacks are a git revert. Diff review is the same review your engineers already do for the tax-calculation code that uses the extracted output.

The fintech does not save money on the line item the SaaS bills for. They save it on the cognitive load of running a parallel content-management system for prompts. That is the cost the SaaS pitch never put on the slide.

What survives

The prompt-management SaaS pitch is collapsing. The categories adjacent to it are not.

Eval harnesses. Skills give you a unit to ship. They do not tell you whether the new version is better than the old one. You still need a labeled dataset, a regression suite, and a way to run both versions against the same inputs. Tools like Promptfoo, Inspect, and the eval features in Braintrust sit downstream of where Skills end. If anything, Skills make the eval layer more important. You now ship prompts as fast as you ship code, and code without tests breaks code without tests.

Observability. Tracing what Claude actually did with the Skills you gave it is its own problem. Helicone, Langfuse, Arize Phoenix, and Honeycomb's LLM tracing all survive. The question "which skill fired and what did it do" is a runtime question, separate from versioning. Based on public marketing, the vendors that pivoted hardest into observability before betting their growth on prompt registries appear to be sitting in a better seat now.

Governance for regulated industries. A bank cannot put production-customer prompts in a public GitHub repo. They can put them in an internal monorepo with a SOC 2 review, but they will still want approval workflows, audit logs of who changed which Skill, and red-team sign-off before deployment. That is a real product surface, and it is the one most prompt-management vendors should be aiming at.

Skill catalogs and discovery. As Skills proliferate inside a company, finding them, deduping them, and seeing which ones are actually used becomes a problem. This is closer to package management than to prompt management. The vendor who builds the npm of Skills has a real business: internal registry, search, dependency tracking, vulnerability scanning of skill scripts. It is not the same business as the one PromptLayer was running.

What gets commoditized

Versioned prompt storage. Label-based promotion (staging/production). Variable templating. Retrieving a prompt by name. Storing a prompt's metadata next to its text. The basic shape of "Postman for prompts."

If a vendor's home page demo stops at "edit your prompt, save a version, deploy with a label," that vendor is selling you something the model platform now ships in its SDK. The conversation a buyer should have is whether the rest of the product (evals, observability, governance) is good enough to justify the bundle.

Some vendors have already made the move. In my read, Helicone leans observability-led on its current marketing; Langfuse has historically led with tracing rather than prompts. The ones still leading with version-controlled prompt registries on the marketing site are reading the room badly.

What to do at renewal

If you are building a new system in 2026, start with Skills. Put the folders in your repo. Wire them into CI. Use git for versioning, code review for governance, and a real eval harness to gate releases. Pay for observability and evals; do not pay for a hosted prompt registry unless the rest of the product is doing work the model platform doesn't.

If you are already paying for prompt management, run the math at renewal. Pull one of your prompts out, turn it into a Skill, ship it through the Agent SDK, and see what part of the SaaS contract you actually still need. In conversations I've had, the answer is often "the dashboard, sometimes." That is not a five-figure-a-year problem.

The category isn't dead. The middle of it is hollowing out, and the vendors that don't move toward evals, observability, or true governance are going to look like CMS companies after WordPress shipped Gutenberg.

If this was useful

The Prompt Engineering Pocket Guide covers the patterns underneath all of this — how to write a prompt that's worth versioning, how to design system messages that survive a model upgrade, and how to structure a prompt so a Skill or a SaaS or a plain string call all behave the same. If you're sorting through the Skills migration and want a reference that's about the prompts themselves, not the tooling around them, it's the one I'd hand you.