Guatu

Posted on May 29 • Originally published at guatulabs.dev

Building Agent Skills: A Pattern for Discoverable Capabilities

#aiagents #llmorchestration #architecture #mcpservers

I spent three weeks building a set of "tools" for a custom agent that could manage my infrastructure, only to realize the agent had no idea how to actually use them in combination. I'd give it a read_file tool and a grep_search tool, and it would repeatedly try to read a 50MB log file into its context window instead of grepping for the error first. The tools existed, but the "skill" of knowing when and how to sequence them was missing.

If you're building AI agents, you've probably hit this. Most frameworks treat tools as a flat list of functions. You dump 20 Python functions into the system prompt and hope the LLM's reasoning is strong enough to pick the right one. It usually isn't.

The False Start: The "Tool Soup" Approach

My first instinct was to just write better descriptions. I spent hours tweaking the docstrings of my functions, adding phrases like "Use this tool ONLY when the file is larger than 10KB." I was treating the LLM like a junior dev who just needed better instructions.

The problem is that tool-calling is fundamentally different from skill execution. A tool is an atomic action (e.g., GET /api/v1/status). A skill is a capability (e.g., "Diagnose why the Kubernetes ingress is returning 502").

I tried to solve this by creating "orchestrator" tools—basically giant functions that wrapped other functions. This just moved the complexity into my Python code. I ended up with a monolithic diagnose_k8s_issue() function that was 300 lines long and impossible to test. I had created a rigid script, not a flexible agent. I'd effectively turned my AI agent back into a bash script with a fancy interface.

The Solution: Discoverable Skill Definitions

The shift happened when I stopped defining tools and started defining skills as discoverable metadata. Instead of just exposing a function, I created a registry where skills are defined by their intent, the tools they require, and a suggested execution pattern.

I implemented this using a structured manifest. Instead of the LLM guessing which tool to use, the agent first queries a "Skill Registry" to find a capability that matches the user's intent.

Here is the pattern I'm using now. Each skill is a standalone definition that explicitly maps the capability to the underlying tool.

# skill-registry.yaml
skills:
  - id: "log-error-search"
    name: "Search Logs for Errors"
    description: "Finds specific error patterns in system logs without loading entire files."
    required_tools: ["grep", "ls"]
    execution_pattern: |
      1. Use 'ls' to identify the relevant log file in /var/log.
      2. Use 'grep' with the --context flag to find the error and surrounding lines.
      3. If no results, try searching for 'FATAL' or 'CRITICAL'.
    usage_example: "/skill:search --tool=grep --pattern='timeout' --files='/var/log/syslog'"

To make this work in practice, I changed the agent's loop. Instead of User -> LLM -> Tool, the flow became User -> LLM -> Skill Lookup -> LLM -> Tool Sequence.

When the agent identifies it needs to search logs, it doesn't just call grep. It retrieves the log-error-search skill definition. This gives the LLM a "recipe" for the task. It's the difference between giving someone a pile of ingredients and giving them a recipe book.

If you're building these as MCP servers, you can implement this by creating a specific "discovery" tool that returns these manifests. I've written about building MCP servers with FastMCP, and applying this skill pattern there makes the tools significantly more reliable across different IDEs like Antigravity or Kiro.

Handling the "Dirty Work" of Execution

One of the biggest gaps in agent documentation is how to handle the actual execution of these skills when they hit real-world infrastructure. For example, if a skill requires searching through Kubernetes volumes, you can't just assume the agent has the right permissions or that the volume is healthy.

I hit a wall where my "Log Search" skill would fail because the underlying Longhorn volumes were hitting snapshot limits, causing the filesystem to go read-only. The agent would just report "Permission Denied," which is useless.

I had to build "pre-flight" checks into the skill execution layer. If a skill involves storage, it first checks the volume health. If I see a bunch of stale snapshots, I have the agent run a cleanup before attempting the search.

# Example of a cleanup command the agent can trigger via a 'maintenance' skill
kubectl delete snapshots.longhorn.io -l "snapshot-name=old-snapshot-2025"

This is where the gap between "it works in the playground" and "it works in production" becomes obvious. If you're running these agents on bare metal, you need to account for the infrastructure failures I've detailed in my posts on Longhorn volume health.

Why This Pattern Works

The reason this beats a flat list of tools is cognitive load. LLMs have a limited context window, and more importantly, a limited "attention" span (the lost-in-the-middle phenomenon). When you provide 50 tools, the probability of the LLM picking a suboptimal tool increases.

By using a skill registry, you're implementing a form of "just-in-time" prompting. The agent only sees the detailed instructions for the specific skill it needs for the current step.

Feature	Tool-Based Approach	Skill-Based Approach
Discovery	LLM scans all tool descriptions	Agent queries registry for specific intent
Execution	LLM guesses the sequence	Agent follows a proven execution pattern
Maintenance	Change docstrings and hope for the best	Update the skill manifest in one place
Reliability	High variance in output	Consistent, repeatable workflows
Scalability	Context window fills up quickly	Only relevant skills are loaded into context

This approach also solves the security problem. I don't give the agent a blanket "Admin" token. Instead, I map skills to specific two-tier service accounts. A "Read-Only Log Search" skill uses a restricted token, while a "Restart Pod" skill requires a higher-privilege token and a manual approval gate.

Lessons Learned and Gotchas

The biggest surprise was that the LLM actually prefers being told how to use a tool over being told what the tool does. A tool description like "Greps a file" is useless. A skill pattern that says "First list the files, then grep the most recent one" is a force multiplier.

I also learned that you can't trust the LLM to always follow the registry. Sometimes it tries to be "clever" and skip a step. I had to implement a validation layer that checks the output of each step against the skill's expected state. If the ls step fails, the agent isn't allowed to attempt the grep step.

If I were to do this over again, I'd move the skill registry into a vector database from the start. As the number of skills grows, even a YAML file becomes a bottleneck. Using a vector search to find the top 3 most relevant skills based on the user's query is the only way to scale this to hundreds of capabilities.

The most important takeaway is this: stop trying to make your agents "smarter" by using a larger model. Instead, make your capabilities more discoverable. The intelligence should live in the architecture of the skills, not just in the weights of the LLM.

For those building these systems for industrial or production use, I highly recommend looking into how these patterns fit into a broader multi-agent architecture. One agent can act as the "Librarian" (managing the skill registry), while another acts as the "Executor" (following the recipes). This separation of concerns prevents the executor from getting distracted by the discovery process.

Top comments (3)

Alex Shev • May 30

I like the focus on discoverability. That is the part that separates a useful capability system from a folder of random scripts.

The pattern that keeps paying off for me is treating each skill as a contract: when to use it, what inputs it expects, what it should never do, and how to verify the result. Without that, agents either ignore the capability or use it in the wrong context.

The boring metadata around the skill is often what makes the actual automation safe to reuse.

Harjot Singh • May 31

Discoverability is the part of agent-skill design that quietly decides whether the whole thing scales. With a handful of skills you can stuff them all in the prompt; past a few dozen you can't, and the agent needs to find the right capability for the task without you hand-wiring it. So the pattern matters: good metadata (what does this skill do, when should it be used, what does it need), a selection step that narrows the candidate set before the model commits, and descriptions written for the agent's decision, not for a human reader. A skill the agent can't reliably discover at the right moment may as well not exist.

The failure mode I'd guard against is the model confidently picking the wrong skill because two descriptions overlap - which makes discovery a verification problem too, not just a search one. That "narrow, then verify the choice" instinct is core to how I build Moonshift, the thing I work on - a multi-agent pipeline that takes a prompt to a deployed SaaS, where capability selection is structured and the choice is checked rather than blindly trusted. Multi-model routing keeps a build ~$3 flat, first run's free no card. Solid pattern. How are you doing the discovery step - semantic match on skill descriptions, an explicit registry the agent queries, or a routing model that picks? And do you confirm the selected skill actually fits before invoking it? That confirm step is what stops confident mis-selection.

sanreds • Jun 5

Discoverability is the part that gets harder fast as the skill registry grows. Once you're past ~30 skills, the model starts picking by name similarity rather than by what the skill actually does, and the symptom is silent, the wrong skill runs, returns a plausible-looking result, and nobody catches it.