Rudson Kiyoshi Souza Carvalho

Posted on Jun 10

Agent skills load on a guess (and can't inherit). Here's the fix

#agents #ai #architecture #llm

Your agent skill was never loaded. And you have no way of knowing.

Not "loaded the wrong version." Not "loaded late." Never loaded at all. The model read a one-line summary of it, decided it didn't need the details, and generated a confident, plausible, wrong artifact instead. No error. No log line. No stack trace. Just output that looks right and isn't.

I kept running into this while building agents for regulated workflows, so I want to walk through why it happens — it's structural, not a bug — and a small fix you can paste into your own stack today.

How skills actually load

Most agent frameworks load skills the same way. The model never sees your skills up front. It sees a menu — a list of names and short descriptions — and decides for itself, mid-task, whether to open any of them.

Here's roughly the context the model wakes up to:

AVAILABLE SKILLS
- rtm-format    : How to produce a requirements traceability matrix
- pii-redaction : Redact personal data before export
- audit-trail   : Log generated artifacts for compliance

TASK
Generate a requirements traceability matrix for the payments module.

Then, invisibly, it runs something like: "Do I need to open rtm-format? I already know what an RTM is — columns for requirements, sources, tests. I've got this." And it proceeds without ever opening the skill.

That's the whole mechanism. It's a semantic trigger: a probabilistic, model-driven pull. The skill body only enters the context if the model first decides it's needed. There's no guarantee, and — this is the part that hurts in production — no observability. You can't tell from the output whether the skill fired.

The TV-manual problem

Think about a TV manual. Nobody opens it to turn the TV on — you already know how. You only reach for the manual when you recognize you don't know something: pairing a soundbar, fixing some weird HDMI handshake.

The whole system depends on one assumption: you know what you don't know.

An LLM breaks that assumption. It doesn't know what it doesn't know. It "knows" how to generate an RTM in the generic sense, so it never recognizes that it should open your RTM skill — the one that says your column order is fixed, your IDs follow SYS-REQ-####, and there's one row per requirement, no exceptions. From the model's point of view, it already knows how to turn on the TV. So it never opens the manual. And it hands you a perfectly formatted RTM that's wrong in every way that matters to your auditor.

This is why the failure is structural. The model can only choose to load a skill after recognizing it lacks the knowledge — and the cases where it's most confidently wrong are exactly the cases where it feels no need to check.

Why this quietly wrecks critical workflows

A loud failure is a gift. A crash, a 500, a validation error — these tell you exactly where to look.

The skipped-skill failure is the opposite. The agent produces a clean RTM with the wrong column order and non-conforming IDs. It passes a glance. It might pass review. It surfaces three weeks later when a compliance tool rejects the export, or worse, when nobody catches it at all. The cost of a silent failure isn't the failure — it's the false confidence it travels with.

For ad-hoc help ("brainstorm some test cases"), probabilistic loading is fine, even good. For workflows where a specific artifact format is non-negotiable — regulated reporting, audit trails, anything with a downstream machine consumer — "the model will probably load the right skill" is not a foundation you want to stand on.

The fix: a Skill Resolver

If a skill is mandatory for a task type, the decision to load it shouldn't belong to the model at all. Make it a property of the pipeline.

A Skill Resolver is a tiny pre-dispatch step. It runs before the LLM, looks at the task type, and injects the full body of every required skill straight into the context. No menu, no model discretion — push instead of pull.

SKILL_STORE = {
    "rtm-format": "RTM SKILL: columns must be [ReqID, Source, Verification, "
                  "Status]; ReqID format SYS-REQ-####; one row per requirement.",
    "audit-trail": "AUDIT SKILL: log every generated artifact with author, "
                   "timestamp, and source skill version.",
}
REQUIRED = {"compliance": ["rtm-format", "audit-trail"]}

def resolve_skills(task_type, store):
    # Runs BEFORE the LLM. Returns full skill bodies, not just summaries.
    return "\n\n".join(store[name] for name in REQUIRED.get(task_type, []))

def build_prompt(task_type, user_msg, store):
    injected = resolve_skills(task_type, store)
    return f"<required_skills>\n{injected}\n</required_skills>\n\nTask: {user_msg}"

print(build_prompt("compliance", "Generate an RTM for the payments module", SKILL_STORE))

That's the whole idea. The key property isn't the line count — it's where it runs. The injection happens outside the model's decision loop. By the time the LLM is called, rtm-format is already in the context whether the model thought it needed it or not. The pull became a push.

"Can't I just write it in AGENTS.md?" You can, and it helps with prioritization — but it doesn't guarantee anything. A line in AGENTS.md is still an instruction the model interprets at inference time; it lives in the same probabilistic layer as the skill menu. The resolver lives one layer below inference, in code, where "always" actually means always.

Leveling up: skill inheritance for multinationals

Now the real-world version. You're not running one agent — you're running one platform for a company with offices in a dozen countries. The audit format is global. The data-retention rules are German (GDPR) or Brazilian (LGPD). The reporting template is set by each local central bank. And a single business unit has its own quirks on top.

The naive answer is copy-and-modify: fork the global skill set per country, tweak as needed. That falls apart fast. The forks drift — a fix to the global skill never reaches the copies. You lose lineage — six months later nobody can say which rule came from HQ and which a local team invented. And every global change becomes N update points instead of one.

What you actually want is inheritance: a scope chain from global down to the business unit, where more specific scopes override less specific ones, most-specific-wins — except for invariants that HQ locks and no local scope can touch. If you've ever debugged CSS specificity, this is the same cascade: the most specific rule wins, and !important is your invariant.

Here's a resolver that walks that chain and keeps the lineage:

REGISTRY = {
    "global":       {"rtm-format": "v2.1", "audit-trail": "v4.0",
                     "_invariant": {"audit-trail"}},
    "country:BR":   {"rtm-format": "v1.3"},
    "bu:BR/retail": {"rtm-format": "v1.0", "audit-trail": "v1.0"},  # tries to weaken
}

def resolve(scope_chain, registry):
    resolved, lineage, locked = {}, {}, set()
    for scope in scope_chain:                 # walk least -> most specific
        layer = registry.get(scope, {})
        locked |= layer.get("_invariant", set())
        for key, version in layer.items():
            if key.startswith("_"):
                continue
            if key in locked and key in resolved:
                continue                       # invariant: can't be overridden
            resolved[key] = version            # most-specific wins
            lineage[key] = scope               # who set it -> auditable
    return resolved, lineage

chain = ["global", "country:BR", "bu:BR/retail"]
resolved, lineage = resolve(chain, REGISTRY)
for k in resolved:
    print(f"{k:12} -> {resolved[k]:6} (set by {lineage[k]})")

Output:

rtm-format   -> v1.0   (set by bu:BR/retail)
audit-trail  -> v4.0   (set by global)

rtm-format cascades down to the business unit's v1.0. But audit-trail is locked global — the BU's attempt to swap in a weaker v1.0 is ignored, and the lineage map tells you exactly which scope set each final value. One global change, one update point, full audit trail. That's Hierarchical Skill Resolution.

Where Microsoft APM fits

None of this competes with Microsoft's APM — it composes with it. There are three separate planes here, and it's worth keeping them straight:

APM is the distribution plane: how skills are versioned, locked, and pulled from registries — the package-manager layer. The Skill Resolver is the consumption plane: what's deterministically in the context when the model runs. HSR is the governance plane: who controls which skill at which scope, and what can't be overridden. APM ships the v1.0; the resolver guarantees it's actually present at inference; HSR decides that the BU was allowed to set it in the first place. None replaces the others.

If this maps to a problem you're staring at, I opened a discussion in the APM community to push on the governance side — discussion #1722. Feedback and counter-arguments welcome.

Honest limitations

Injection guarantees what enters the context. It does not guarantee what the model does with it. Stuff a skill into a 100k-token prompt and "lost in the middle" still applies — present isn't the same as attended-to. There's a token cost, too; injecting every required skill on every call adds up, so scope your REQUIRED map tightly. And for genuinely open-ended, ad-hoc assistance, don't bother — probabilistic loading is the right tool there. The resolver earns its keep specifically where an output format is non-negotiable.

Wrap-up

Semantic skill loading is a pull, and the model decides whether to pull. That's perfect for exploration and quietly dangerous for compliance, because the model can't recognize a gap it doesn't know it has. A Skill Resolver flips it to a push — moving the load decision out of the model and into your pipeline, in about fifteen lines. Add scope inheritance and you get governance for an org of any size, with lineage you can hand to an auditor.

If you want to go deeper, the full write-up is in the paper, Hierarchical Skill Resolution: Enabling Skill Inheritance and Deterministic Knowledge Injection for AI Agents (DOI: 10.5281/zenodo.20619456), and the governance discussion is over at APM #1722.

What's the worst silent skill-skip you've shipped? I'd love to hear it.

DEV Community