Everyone gives agents skills. I made skills hatch their own agents.

EternalRights — Fri, 19 Jun 2026 12:43:20 +0000

I spent two weeks writing a SKILL.md for our internal agent stack at work. Every API endpoint, every MCP config, every token-budget rule I'd bled for over three months, written down in clean markdown. I handed it in as a deliverable.

Then I watched Claude Code run it.

My strict requirements, the ones in bold, the ones that said "check environment variables first" and "exit immediately on type X errors"? The model skimmed past them like they were terms of service. It treated the skill as a reference book, not a contract. Every run, the interpretation drifted a little. One run it drifted far enough that a bug which should have died in staging almost made it to production.

That was the night I decided skills shouldn't be interpreted. They should be compiled.

I'm an agent developer intern at DiDi and a junior CS student. I build this stuff for a living. And I'm tired of watching LLMs selectively ignore prose that took me weeks to write. So I built agenthatch. It turns a SKILL.md into a standalone, pip-installable Python agent with typed tools, a state machine, and its own runtime. Not a prompt wrapper. Actual generated code.

pip install agenthatch
agenthatch init
agenthatch skills add ./my-skill/SKILL.md
agenthatch hatch my-skill
agenthatch run my-skill

Three commands from markdown to a running agent. The rest of this post is the why and the how, with source file references so you can verify I'm not making this up.

The problem isn't your skill. It's the format.

Let me be precise about what's broken.

A SKILL.md file is prose. Human-written prose for humans to read. You then paste it into a system prompt and ask an LLM to figure out what you meant, at runtime, every single turn.

One skill, fine. Three skills, manageable. Five or more? Things break.

What happens	Why
Skills leak into each other	Every skill shares one context window. A file-organizer skill and a git-ops skill cross-contaminate. The agent applies logic from one to the other.
The agent skim-reads	Long skills get treated as loose suggestions. The model picks the parts that look relevant and ignores the rest.
Token waste	Every skill lives in the system prompt. 5 skills at 3KB each means 15KB burned before the conversation starts. Long tasks compound this.
No validation	A typo in a tool name, a missing parameter, an ambiguous instruction. None of it gets caught until runtime, and by then you're 20 turns deep.
Scale decays	1 to 3 skills works. At 10+ it's chaos. No dependency graph, no conflict detection, no way to know which skill overrides which.

This isn't an Anthropic bug or an OpenAI bug. It's not your skill being bad either. It's architectural decay. You're asking one LLM to interpret seven pieces of zero-isolation prose in the same context window, and the interpretation shifts every run.

Think about it this way: you hand someone seven operating manuals and ask them a question. They have to flip through all seven books every time and stitch an answer together. That person would lose their mind. The LLM does too, just quietly.

The core issue is that SKILL.md is prompt engineering, not software engineering. There's no compile step, no type checking, no contract between you and the model.

The thesis: skills are source code, agents are binaries

This is the part I want you to actually think about, even if you skip the rest.

Right now everyone is writing skills. For Claude Code, for Codex CLI, for OpenClaw, for whatever comes next. And in all of these, the skill's role is the same: it's prose stuffed into a system prompt, interpreted at runtime by a host agent. The skill is a prompt accessory. It can't exist on its own.

That's the wrong paradigm.

Skills shouldn't be prompt accessories. Skills should be agent source code.

Java compiles to bytecode for the JVM. TypeScript compiles to JavaScript for the browser. The compile step exists because it converts human expression into a format the machine can execute deterministically, before runtime. Typos, type errors, ambiguity, all of it gets caught at compile time, before the agent ever runs.

SKILL.md has no compile step. It hands raw prose to an LLM and hopes for the best, every turn, forever.

So the endgame for skills isn't "write them better." It's "compile them into agents." A skill should be the hatching input for an agent. You write the skill, a compiler turns it into an independent runtime with its own tools, its own state machine, its own config. The skill stops being a prompt fragment and becomes a program.

That's what agenthatch does. You still write skills in markdown, that part doesn't change. The generated agent runs independently, no host required. agenthatch is the step in between, the compiler. javac is to .java what agenthatch is to SKILL.md.

Once you internalize this, a bunch of things click into place:

Skills stop burning tokens in the system prompt. Compiled agents carry about 150 bytes of runtime config.
Each skill becomes an isolated agent. No more cross-contamination.
Schema validation happens at compile time. Typos and ambiguity die before runtime.
The output is a real Python package. pip install, import, run it anywhere, no host required.

The pipeline: three phases, six harnesses

Here's the architecture. Every file path I mention is real, you can open it in the repo.

SKILL.md  →  Parse  →  6-Harness LLM Pipeline  →  Code Generation  →  Runnable Agent
  (input)   (Phase 1)    (Phase 2: AI inference)     (Phase 3: Jinja2)     (output)

Phase 1: deterministic parse, zero AI

Phase 1 doesn't use AI. It reads the SKILL.md, pulls out the frontmatter, the body, and every file in the skill directory. Pure filesystem operations. The entry point is assemble_context() in parser.py:

def assemble_context(skill_path: str | Path) -> ContextPack:
    skill_dir = _resolve_skill_directory(Path(skill_path))
    dir_name = skill_dir.name
    manifest = _discover_files(skill_dir)
    frontmatter, body, warnings = _best_effort_parse_yaml(skill_dir)
    return ContextPack(frontmatter, body, manifest, dir_name, warnings, skill_dir)

The key design decision here: Phase 1 makes no semantic judgment. It doesn't try to guess whether a file is a script, a doc, or a config. That's Phase 2's job. Phase 1 just reads bytes, computes SHA-256 hashes, and does YAML parsing.

There's a small detail I like in the file reader. It checks binary magic numbers to skip PNGs, JPEGs, PDFs, ZIPs, and friends:

_BIN_SIGS: list[bytes] = [
    b"\x89PNG\r\n\x1a\n",
    b"\xff\xd8\xff",
    b"GIF89a",
    b"%PDF",
    b"PK\x03\x04",
    b"Rar!\x1a\x07",
    b"\x1f\x8b",
    b"BZh",
    b"\xca\xfe\xba\xbe",
]

Files over 1MB get skipped. Files with null bytes in the header get skipped. If something can be handled deterministically, don't ask an LLM. That principle shows up over and over in this codebase.

Phase 1.5: AST signature extraction

This was added in v0.8 and I think it's the most underrated part of the whole project.

Phase 1.5 uses Python's built-in ast module to parse Python scripts and regex to parse shell scripts, extracting function signatures. Deterministic, zero LLM. This feeds into Harness C for precise interface inference.

def extract_python_signatures(file_path: Path) -> list[ToolSchema]:
    """AST-parse a Python script, extract public function signatures.
    Deterministic, zero LLM. Uses Python's built-in ast module.
    Skips private functions (those starting with _).
    """
    try:
        tree = _ast.parse(file_path.read_text(encoding="utf-8"))
    except (SyntaxError, UnicodeDecodeError, OSError):
        return []

    functions: list[ToolSchema] = []
    for node in _ast.walk(tree):
        if isinstance(node, _ast.FunctionDef) and not node.name.startswith("_"):
            args: list[dict[str, str | None]] = []
            for arg in node.args.args:
                arg_type: str | None = None
                if arg.annotation:
                    try:
                        arg_type = _ast.unparse(arg.annotation)
                    except Exception:
                        arg_type = None
                args.append({"name": arg.arg, "type": arg_type})
            functions.append(ToolSchema(...))
    return functions

Why bother? Because Harness C has to design tool signatures. If it reads raw script contents, it burns tokens and hallucinates. Hand it a 1KB compact signature summary extracted via AST, and the inference quality jumps. This is compiler thinking: extract deterministically whatever you can, leave only the genuinely ambiguous stuff for the LLM.

Phase 2: six AI harnesses

This is the heart. Six specialized harnesses process the skill, each with its own persona and temperature. The config is hardcoded in engine.py:

HARNESS_CONFIG: dict[str, dict[str, Any]] = {
    "A": {"thinking": True, "temperature": 0.1,
          "reason": "Identity extraction is deterministic — low temp for consistency"},
    "B": {"thinking": True, "temperature": 0.5,
          "reason": "Intent inference requires creativity for long-tail triggers"},
    "C": {"thinking": True, "temperature": 0.5,
          "reason": "Interface inference is complex — needs SKILL.md + ScriptManifest"},
    "D": {"thinking": True, "temperature": 0.3,
          "reason": "Base detection needs precision — moderate temp"},
    "E": {"thinking": True, "temperature": 0.2,
          "reason": "Assembly validation is structured — low temp for consistency"},
    "F": {"thinking": True, "temperature": 0.3,
          "reason": "MCP config extraction needs exact matching — moderate temp"},
}

Every temperature has a reason. Identity extraction is deterministic, so temp drops to 0.1. Intent inference needs to cover long-tail triggers, so it gets 0.5 for creativity. Assembly validation is structured, so 0.2.

Harness	Job	Model tier	Temp
A — Identity	Extract name, version, description from frontmatter	small	0.1
B — Intent	Infer trigger phrases and user intents	small	0.5
C — Interface	Design tool signatures, parameters, return types	large	0.5
D — Base	Detect runtime base class and instruction structure	large	0.3
E — Assembly	Cross-validate all harness outputs, produce AHSSPEC	small	0.2
F — MCP	Detect and configure MCP server connections	small	0.3

Why six? Because I tried one giant prompt that did everything, and the output was a lottery. Splitting it up so each harness does one thing, quality went up significantly. Same reason compilers split the frontend into lexer, parser, semantic analysis. Single responsibility.

Each harness runs an Analyze, Infer, Self-Validate, Correct loop with up to two internal retries. Every harness has its own validate_output(). Harness A checks that identity.id is kebab-case:

def validate_output(self, result: dict[str, Any]) -> tuple[bool, str]:
    identity = result.get("identity", {})
    identity_id = identity.get("id", "")
    if not identity_id:
        return False, "identity.id is empty"
    if not re.match(r"^[a-z0-9]+(-[a-z0-9]+)*$", identity_id):
        return False, f"identity.id '{identity_id}' is not kebab-case"
    if not identity.get("display_name"):
        return False, "identity.display_name is empty"
    return True, ""

Harness B checks that triggers count is between 5 and 15, satisfies between 3 and 8, summary at least 20 characters. These constraints aren't the LLM's call. They're enforced by code. If the LLM produces non-compliant output, it gets sent back to redo.

Harness E is the critical one. It cross-validates the other five and produces a unified AHSSPEC (Agent Hatch Standard Specification). E also computes a structural confidence score, and this is important: it's not LLM self-assessment, it's code counting fields:

def _compute_structural_confidence(self, ahs_dict: dict[str, Any]) -> float:
    """Compute confidence based on structural checks, not LLM self-assessment."""
    checks = 0
    passed = 0
    id_ = ahs_dict.get("identity", {})
    for f in ("id", "display_name", "version"):
        checks += 1
        if id_.get(f): passed += 1
    iface = ahs_dict.get("interface", {})
    for f in ("provides", "requires"):
        checks += 1
        if iface.get(f): passed += 1
    score = round(passed / max(checks, 1), 2)
    return score

I don't trust LLM self-reported confidence. The model will cheerfully tell you "0.95 confidence" while missing three required fields. Code counting fields doesn't lie.

There's also a pre-flight classifier that picks model tiers per skill type. A pure-instruction skill skips Harness D entirely (no base class to detect). An integration skill with API calls and scripts upgrades everything to large models. Not every skill deserves the expensive model. The classifier saves real money.

MODEL_TIER_MAP: dict[str, dict[str, str]] = {
    "pure_instruction": {
        "A": "small", "B": "small", "C": "large", "D": "skip", "E": "small", "F": "small",
    },
    "script_driven": {
        "A": "small", "B": "small", "C": "large", "D": "large", "E": "small", "F": "small",
    },
    "integration": {
        "A": "small", "B": "large", "C": "large", "D": "large", "E": "large", "F": "small",
    },
    "knowledge": {
        "A": "small", "B": "large", "C": "large", "D": "small", "E": "small", "F": "small",
    },
}

Phase 3: code generation

Phase 3 renders the AHSSPEC into a complete Python package via Jinja2 templates. The engine is GenerateEngine in generate/engine.py:

TEMPLATE_MAP: dict[str, str] = {
    "pyproject.toml.j2": "pyproject.toml",
    "agent.py.j2": "src/{package_name}/agent.py",
    "tools.py.j2": "src/{package_name}/tools.py",
    "references.py.j2": "src/{package_name}/references.py",
    "runtime.toml.j2": "runtime.toml",
    "README.md.j2": "README.md",
}

But Phase 3 isn't just template rendering. There's an AI code generation step. _ai_generate_tool_impls() reads the full skill directory context (SKILL.md, reference files, script files, templates) and has the LLM generate real Python function bodies for each tool. Not stubs.

The critical part is validation. AI-generated code doesn't get written to disk directly. It gets compile()-checked first. If it fails, the engine tries to auto-fix indentation:

wrapper = "def _validate():\n" + indented + "\n"
try:
    compile(wrapper, f"<tool:{func_name}>", "exec")
except SyntaxError as se:
    try:
        fixed = GenerateEngine._normalize_indentation(indented_lines, error_lines)
        fixed_wrapper = "def _validate():\n" + fixed_str + "\n"
        compile(fixed_wrapper, f"<tool:{func_name}>", "exec")
        indented = fixed_str
        valid = True
    except SyntaxError:
        pass

if valid:
    valid_tools[func_name] = indented
else:
    logger.warning(
        "AI-generated code for tool '%s' has syntax errors, skipping. "
        "Tool will use template fallback.", func_name,
    )

This is the compiler's attitude. Generated code must compile. If it doesn't, fix it. If it can't be fixed, fall back to a template stub. Never let syntactically broken code reach runtime.

There's also a _validate_generated_python() pass that scans every generated .py file for JavaScript keywords that leaked in (null, undefined, true, false). Added in v0.7.15 after I watched an LLM try to write Python like it was TypeScript one too many times.

What comes out

The output is a real Python package:

hatched-agent/
├── pyproject.toml          # pip-installable
├── runtime.toml            # LLM provider, model, API keys
├── README.md               # generated usage docs
├── agenthatch.yaml         # AHSSPEC manifest
└── src/{package_name}/
    ├── __init__.py
    ├── agent.py            # Agent class (extends AHCoreAgent)
    ├── tools.py            # type-annotated tool implementations
    └── references.py       # AI-extracted structured data

You can pip install it. You can import it. You can run it as a CLI. You can wrap it as an MCP server. It doesn't depend on Claude Code, Codex, or any host agent. It's a program.

The generated agent also carries a full copy of its source skill directory. It can read its own SKILL.md at runtime for self-reference, execute its own scripts, and self-repair when things go sideways. Like a compiled binary with debug symbols, not a stripped binary.

The runtime: PlanLayer state machine

The generated agent doesn't run a naive ReAct loop. It uses a 6-state PlanLayer state machine, defined in plan.py:

class AgentState(str, Enum):
    STARTING = "starting"       # initial state, waiting for plan
    PLANNING = "planning"       # generating/updating plan
    EXECUTING = "executing"     # executing plan steps
    VERIFYING = "verifying"     # checking results
    REPLANNING = "replanning"   # hit a blockage, revising plan
    DONE = "done"               # terminal state

The agent generates a structured plan via a virtual plan tool at session start, then executes steps with explicit state tracking. Three consecutive tool failures trigger REPLANNING. State transitions are managed by the loop, not the LLM. The LLM is unreliable. State machines are reliable.

class PlanLayer:
    MAX_CONSECUTIVE_FAILURES = 3
    VERIFY_EVERY_N_STEPS = 5

The plan renders into the system prompt so the agent can see its own progress:

## Plan: Add i18n to the project
  ☐ Step 1: Install next-intl
  ▶ Step 2: Create language packs
  ☐ Step 3: Configure middleware
Progress: 1/3 steps done

Model support: pick your poison

agenthatch supports OpenAI, Anthropic, DeepSeek, and any OpenAI-compatible endpoint. The harness system picks model tiers per task, so you can mix providers.

I've been testing with Claude Opus 4.5 for the large-tier harnesses (C and D, the ones doing interface and base detection) and GPT-5.2 Codex for comparison runs. Opus 4.5 is genuinely good at structured interface inference, the kind of task where you want the model to design a clean tool signature without inventing parameters. GPT-5.2 Codex is faster on the small-tier harnesses (A, B, E, F) where the job is extraction and validation, not design.

The point is: you're not locked in. The harness config is just a dict. Swap models per harness, swap providers, run the same SKILL.md through different stacks and diff the output. That's the kind of thing you can't do when your skill is glued to a host agent's system prompt.

What's still broken

I'm not going to pretend this is finished.

Python only. JS/TS support is in progress. If you want a Node agent, not today.
You need an LLM API key. Phase 2 is AI inference. No key, no hatch.
Single-file skills work. Multi-file directory skills are in development.
It's v0.9.x. There are bugs. I find new ones every week.
Windows is untested. I develop on macOS and Linux. Windows users, tell me what breaks.

I've shipped 8 PRs to pytest and 1 to agent-browser. agenthatch is my first from-scratch project, built nights and weekends. "Ship beats perfect" is the most useful thing I learned from open source. This tool isn't perfect, but it runs, and the rest gets fixed in flight.

Try it

If you maintain more than three SKILL.md files and feel the friction, this is for you.

pip install agenthatch
agenthatch init
agenthatch skills add ./my-skill/SKILL.md
agenthatch hatch my-skill
agenthatch run my-skill

Repo: github.com/agenthatch/agenthatch

PyPI: pypi.org/project/agenthatch

If you hit bugs, file an issue. If you want to argue about the paradigm, the discussions tab is open. And if you think skills should stay as prompt fragments, I genuinely want to hear why, because I might be wrong and I'd rather find out now than after writing another 2,000 lines of compiler.

Skills are source code. Agents are binaries, and the compiler is the part everyone's been skipping.

DEV Community: EternalRights