AI Agent Skills: The Emerging Architecture for Composable, Evolvable Agent Capabilities
The tools abstraction that powered the first wave of production agents is hitting its ceiling. When your agent needs to "review code," it doesn't just call a function—it reads previous review comments from memory, applies learned heuristics about the codebase, adapts its critique style to the author, and improves its approach based on whether past suggestions were accepted. This isn't a stateless function call. It's a skill. And in the past four months, the entire agent framework ecosystem has converged on this distinction with remarkable speed.
Introduction: From Tools to Skills — A Paradigm Shift in Agent Design
The research community's pivot to skills has been dramatic. Since February 2026, we've seen over 20 papers explicitly addressing skill architectures, skill learning, and skill evaluation—a signal that the field has identified a fundamental gap in how we build agents. The agentic AI architectures survey published in January laid the theoretical groundwork, distinguishing between "reactive tool use" and "proactive capability development." By March, the major frameworks had taken notice.
The distinction between tools and skills is more than semantic. Tools are stateless function calls: search_web(query) → results. Skills are learned, versioned, composable capabilities with memory and context: a "web research" skill knows which sources proved reliable in past investigations, adapts search strategies based on domain, and can delegate to sub-skills for fact verification. LangChain's March 2026 newsletter announced their Deep Agents Skills system, explicitly framing it as "the next layer of abstraction above tools." CrewAI followed with self-healing skills in their enterprise multi-agent builder. Microsoft Foundry introduced skill primitives for multi-agent coordination.
Why the sudden convergence? The emergence of reasoning models—o1, R1, Gemini 2.5—finally gave agents the cognitive horsepower for genuine skill acquisition and composition. Earlier models could use tools when instructed; reasoning models can learn when and how to combine capabilities, recognize when a skill is failing, and propose refinements. Research on agentic reasoning shows these models achieving 40-60% better performance on multi-step tasks when given skill-level abstractions rather than raw tool access.
My thesis: Skills represent the "package manager" moment for agentic AI. Just as npm made JavaScript code genuinely reusable and composable, skill architectures make agent capabilities genuinely shareable and evolvable. We're moving from "agents that can do things" to "agents that can learn to do things better."
The Skill Architecture Stack: Anatomy of a Modern Agent Skill
Understanding the skill architecture requires thinking in three layers: definition, runtime, and lifecycle. Each layer addresses a distinct concern that tools-based approaches left unresolved.
Skill Definition encompasses the schema and metadata that describe what a skill does, what it requires, and what it produces. Unlike tool schemas (which specify only function signatures), skill schemas include capability declarations, memory access patterns, and composition rules. LangChain's approach defines skills as first-class graph nodes with typed state, while CrewAI binds skills to agent roles with explicit permission scopes. The emerging SkillNet interchange format (referenced in multiple 2026 framework comparisons) aims to make skills portable across frameworks, though adoption remains early.
Skill Runtime handles execution context and memory access. This is where the tools/skills distinction matters most. A skill runtime provides: (1) access to episodic memory for retrieving relevant past experiences, (2) working memory for multi-step reasoning within the skill, and (3) tool delegation for invoking lower-level capabilities. AutoGen's shared state discussions reveal the complexity here—agents need fine-grained control over which memories a skill can read versus modify.
Skill Lifecycle manages versioning, evaluation, and deprecation. Research on multi-agent system development found that 34% of production agent failures traced to skill version mismatches or unevaluated skill changes. Modern skill architectures treat skills like software packages: semantic versioning, dependency declarations, and explicit deprecation policies.
The capability declaration pattern deserves special attention. Drawing from deterministic pre-action authorization research, skills now declare required permissions upfront. A "code review" skill might declare: requires: [read:repository, read:pull_request, write:comments]. The runtime enforces these boundaries, preventing skill drift into unauthorized behaviors. This least-privilege approach—termed SkillScope in the authorization literature—is essential for enterprise deployments where audit requirements are strict.
Composition primitives enable skills to work together. Skill chaining sequences capabilities (research → summarize → cite). Skill delegation allows one skill to invoke another (a "write report" skill delegating to "generate chart" skill). Skill fallback hierarchies provide graceful degradation (try "semantic search" skill, fall back to "keyword search" skill). These patterns are now supported natively in LangGraph's agent framework.
Skill Acquisition: How Agents Learn New Capabilities
The most profound shift isn't just having skills—it's how agents acquire them. Three distinct pathways have emerged in 2026 research, each with different tradeoffs for production systems.
Human-authored skills remain the foundation. A developer writes skill code, defines the schema, and registers it with the agent. This approach offers maximum control and reliability but scales poorly. Framework comparison analyses note that human-authored skills typically require 2-4 hours of engineering time per skill, including testing and documentation. For core business logic, this investment makes sense. For long-tail capabilities, it's prohibitive.
Demonstration-learned skills represent the middle ground. The agent observes a human performing a task—watching tool invocations, reading decisions made, noting outcomes—and extracts a reusable skill representation. Research on tool use capabilities shows demonstration learning achieving 70-80% of human-authored skill quality with 10x less human effort. The key insight: demonstrations should capture not just what was done, but why—the decision points, the alternatives considered, the success criteria applied.
Self-evolved skills push further into autonomy. The agent generates skill candidates, tests them against task outcomes, and refines through reinforcement learning. Research on agentic reinforcement learning introduced GRPO (Group Relative Policy Optimization) for skill training, providing step-wise rewards for skill invocation decisions rather than just final task success. This enables agents to learn nuanced skill selection: when to use "precise search" vs. "exploratory search," when to delegate vs. handle directly.
The challenges research on agentic AI's path forward emphasizes that self-evolved skills require robust evaluation infrastructure. Without it, agents can develop confidently wrong skills—capabilities that appear to work in training but fail catastrophically in production. The paper recommends a "skill quarantine" pattern: newly evolved skills run in shadow mode, their outputs logged but not acted upon, until evaluation metrics clear predetermined thresholds.
A practical production pattern is emerging: start with human-authored core skills for critical paths, enable demonstration-learning for domain adaptation (letting power users teach the agent their workflows), and restrict self-evolution to well-bounded capability improvements. XAgen's explainability work provides tools for understanding why a skill evolved in a particular direction, essential for maintaining trust in self-improving systems.
The experience compression spectrum offers a useful mental model. Not every learning should become a skill. Some belong as episodic memories (specific instances to retrieve when relevant). Others crystallize into skills (reusable capabilities worth naming and versioning). A few should codify as rules (invariants that must always hold). The AI agent software architecture evolution paper provides decision heuristics: if you'd invoke the capability >100 times and it requires multi-step reasoning, it's a skill candidate.
Hands-On: Code Walkthrough
Let's build a skill-enabled research agent using current APIs. This example demonstrates the complete skill lifecycle: definition, registration, invocation, memory integration, and basic self-improvement.
"""
Skill-enabled research agent using LangGraph and LangChain patterns.
Demonstrates: skill definition, composition, memory coupling, and evaluation.
Requires: langgraph>=0.5.0, langchain-core>=0.3.0, langchain-anthropic>=0.2.0
"""
from typing import TypedDict, Literal, Optional
from dataclasses import dataclass, field
from datetime import datetime
import json
from langgraph.graph import StateGraph, END
from langgraph.prebuilt import ToolNode
from langchain_core.messages import HumanMessage, AIMessage
from langchain_anthropic import ChatAnthropic
# --- Skill Schema Definition ---
# Skills are more than tools: they declare capabilities, memory access, and composition rules
@dataclass
class SkillMetadata:
"""Metadata for skill versioning and governance."""
name: str
version: str # Semver: breaking changes in skills break agents
author: str
required_permissions: list[str] # SkillScope-style declarations
memory_access: Literal["none", "read", "read_write"]
delegatable: bool = True # Can this skill invoke other skills?
@dataclass
class SkillExecutionContext:
"""Runtime context provided to skill during execution."""
episodic_memory: list[dict] # Retrieved relevant past experiences
working_memory: dict # Scratch space for multi-step reasoning
available_skills: list[str] # Skills this skill can delegate to
trace_id: str # For evaluation and audit
@dataclass
class SkillResult:
"""Standardized skill output with provenance."""
output: dict
confidence: float
sources_used: list[str]
delegated_to: list[str] # Which skills were invoked
memory_writes: list[dict] # What should be persisted
execution_time_ms: int
tokens_used: int
# --- Concrete Skill Implementation ---
# A WebResearchSkill that demonstrates memory coupling and tool delegation
class WebResearchSkill:
"""
A skill for conducting web research with memory and learning.
Unlike a simple search tool, this skill:
- Retrieves past research on similar topics from memory
- Adapts search strategy based on what worked before
- Persists successful research patterns for future use
"""
metadata = SkillMetadata(
name="web_research",
version="1.2.0",
author="research_team",
required_permissions=["search:web", "read:memory", "write:memory"],
memory_access="read_write",
delegatable=True
)
def __init__(self, llm: ChatAnthropic, search_tool: callable):
self.llm = llm
self.search_tool = search_tool
async def execute(
self,
query: str,
depth: Literal["quick", "standard", "comprehensive"],
context: SkillExecutionContext
) -> SkillResult:
"""Execute research with memory-informed strategy selection."""
start_time = datetime.now()
tokens = 0
sources = []
# Step 1: Retrieve relevant past research from episodic memory
# This is what distinguishes skills from tools
past_research = [
mem for mem in context.episodic_memory
if mem.get("skill") == "web_research"
and self._topic_similarity(query, mem.get("query", "")) > 0.7
]
# Step 2: Adapt strategy based on past outcomes
# If previous research on similar topics found certain sources reliable,
# prioritize those sources
reliable_sources = self._extract_reliable_sources(past_research)
search_strategy = self._select_strategy(depth, reliable_sources)
# Step 3: Execute search with adapted strategy
search_queries = self._generate_queries(query, search_strategy)
results = []
for sq in search_queries:
result = await self.search_tool(sq, sources=reliable_sources)
results.extend(result)
sources.extend([r["url"] for r in result])
# Step 4: Synthesize results using LLM
synthesis_prompt = self._build_synthesis_prompt(query, results, past_research)
response = await self.llm.ainvoke([HumanMessage(content=synthesis_prompt)])
tokens += response.usage_metadata.get("total_tokens", 0)
# Step 5: Prepare memory writes for future skill invocations
# This is skill learning: recording what worked for future use
memory_writes = [{
"skill": "web_research",
"query": query,
"strategy_used": search_strategy,
"sources_found_useful": self._identify_useful_sources(results, response),
"timestamp": datetime.now().isoformat(),
"outcome_pending": True # Will be updated based on user feedback
}]
execution_time = int((datetime.now() - start_time).total_seconds() * 1000)
return SkillResult(
output={"synthesis": response.content, "sources": sources[:10]},
confidence=self._compute_confidence(results, response),
sources_used=sources,
delegated_to=[],
memory_writes=memory_writes,
execution_time_ms=execution_time,
tokens_used=tokens
)
def _topic_similarity(self, q1: str, q2: str) -> float:
"""Compute semantic similarity between queries. Simplified for example."""
# In production: use embedding similarity
common_words = set(q1.lower().split()) & set(q2.lower().split())
all_words = set(q1.lower().split()) | set(q2.lower().split())
return len(common_words) / len(all_words) if all_words else 0.0
def _extract_reliable_sources(self, past_research: list[dict]) -> list[str]:
"""Identify sources that proved useful in past research."""
source_scores = {}
for research in past_research:
for source in research.get("sources_found_useful", []):
source_scores[source] = source_scores.get(source, 0) + 1
return sorted(source_scores.keys(), key=lambda s: source_scores[s], reverse=True)[:5]
def _select_strategy(self, depth: str, reliable_sources: list[str]) -> dict:
"""Select search strategy based on depth and past learning."""
base_strategies = {
"quick": {"max_queries": 2, "max_results_per_query": 5},
"standard": {"max_queries": 5, "max_results_per_query": 10},
"comprehensive": {"max_queries": 10, "max_results_per_query": 20}
}
strategy = base_strategies[depth]
strategy["prioritized_sources"] = reliable_sources
return strategy
# Additional helper methods omitted for brevity...
# --- Skill Registration and Agent Assembly ---
class SkillRegistry:
"""
Registry for managing skill versions and dependencies.
Implements the Fleet SkillAttachment pattern for version constraints.
"""
def __init__(self):
self._skills: dict[str, dict[str, object]] = {} # name -> version -> skill
self._active_versions: dict[str, str] = {} # name -> active version
def register(self, skill: object, config_overrides: Optional[dict] = None):
"""Register a skill with optional configuration."""
meta = skill.metadata
if meta.name not in self._skills:
self._skills[meta.name] = {}
self._skills[meta.name][meta.version] = {
"skill": skill,
"config": config_overrides or {},
"registered_at": datetime.now().isoformat()
}
# Set as active if no version active or this is newer
if meta.name not in self._active_versions:
self._active_versions[meta.name] = meta.version
def get_skill(self, name: str, version: Optional[str] = None) -> object:
"""Retrieve skill by name, optionally pinning version."""
target_version = version or self._active_versions.get(name)
if not target_version or name not in self._skills:
raise ValueError(f"Skill {name} not found")
return self._skills[name][target_version]["skill"]
# --- Skill-Based Agent State ---
class ResearchAgentState(TypedDict):
"""State for the research agent with skill-aware fields."""
messages: list
current_task: Optional[str]
skill_invocations: list[dict] # Track which skills were used
episodic_memory: list[dict] # Retrieved memories for context
pending_memory_writes: list[dict] # Memories to persist after completion
# --- Agent Construction ---
def build_research_agent(skill_registry: SkillRegistry, llm: ChatAnthropic):
"""
Build a LangGraph agent that selects and invokes skills.
The agent decides WHICH skill to use; skills handle HOW to execute.
"""
async def skill_selector(state: ResearchAgentState) -> ResearchAgentState:
"""Agent decides which skill to invoke based on task and context."""
task = state["current_task"]
available_skills = ["web_research", "code_analysis", "summarization"]
# LLM decides which skill(s) to invoke
selection_prompt = f"""Given this task: {task}
Available skills: {available_skills}
Recent skill invocations: {state['skill_invocations'][-3:]}
Which skill should be invoked? Respond with JSON: {{"skill": "name", "params": {{...}}}}"""
response = await llm.ainvoke([HumanMessage(content=selection_prompt)])
selection = json.loads(response.content)
# Get and execute the selected skill
skill = skill_registry.get_skill(selection["skill"])
context = SkillExecutionContext(
episodic_memory=state["episodic_memory"],
working_memory={},
available_skills=available_skills,
trace_id=f"trace_{datetime.now().timestamp()}"
)
result = await skill.execute(**selection["params"], context=context)
# Update state with skill results
state["skill_invocations"].append({
"skill": selection["skill"],
"params": selection["params"],
"result_summary": result.output,
"tokens": result.tokens_used
})
state["pending_memory_writes"].extend(result.memory_writes)
state["messages"].append(AIMessage(content=str(result.output)))
return state
# Build the graph
workflow = StateGraph(ResearchAgentState)
workflow.add_node("select_and_invoke_skill", skill_selector)
workflow.set_entry_point("select_and_invoke_skill")
workflow.add_edge("select_and_invoke_skill", END)
return workflow.compile()
This code demonstrates several key patterns from the skill architecture: typed skill metadata with version and permission declarations, memory coupling where skills read from and write to episodic memory, and the separation between skill selection (agent's job) and skill execution (skill's job). The SkillResult type ensures every skill invocation produces traceable, auditable output.
Evaluation and Governance: Making Skills Production-Ready
Skills without evaluation are liabilities. The 2026 research landscape has produced several benchmarking frameworks that address different aspects of skill quality.
Framework evaluations have converged on five evaluation axes for production skills. Correctness measures whether skill outputs meet acceptance criteria. Efficiency tracks token and compute costs relative to output quality. Generalization tests whether skills transfer to novel inputs within their intended domain. Composability verifies that skills work correctly when chained with others. Safety ensures skills operate within declared permission boundaries.
Research on agentic frameworks introduced SkillGenBench, specifically measuring whether agents can create useful new skills. This matters for systems with self-evolution enabled: if your agent proposes skill refinements, you need automated evaluation of those proposals before promotion to production. SkillGenBench tests include held-out task sets, adversarial inputs designed to expose skill boundaries, and composition stress tests.
For agents with continual learning, skill regression becomes a concern. Multi-agent system studies found that 23% of skill updates introduced regressions in other skills—a new research skill that searches more thoroughly might break a summarization skill's token budget assumptions. SkillLearnBench provides regression testing protocols: after any skill change, re-evaluate not just that skill but all skills that compose with it.
Governance primitives are equally important for enterprise deployments. Research on explainability introduced Counterfactual Trace Auditing: given a skill execution trace, determine what would have happened with different inputs or different skill versions. This supports both debugging ("why did the research skill produce wrong results?") and compliance ("can we prove the skill never accessed unauthorized data?").
The least-privilege enforcement pattern from authorization research deserves implementation from day one. Skills declare permissions; runtime enforces them. A skill claiming read:memory cannot write to memory, even if it contains code attempting to do so. The enforcement layer intercepts all memory and tool access, checking against declared permissions. This prevents both accidental scope creep and adversarial prompt injection attacks that try to escalate skill privileges.
Cost attribution often gets overlooked until bills arrive. Skills should report token usage, and the orchestration layer should aggregate costs per skill per task type. Enterprise platform discussions emphasize that skill-level cost visibility enables optimization: if your research skill costs 10x your analysis skill but delivers only 2x the value, that's actionable intelligence.
What This Means for Your Stack
If you're starting a new agent project, choose a framework with first-class skill support. LangChain's Deep Agents, CrewAI Enterprise, and Microsoft Foundry all offer skill primitives. Retrofitting skill abstractions onto tool-based agents requires rearchitecting memory access patterns and state management—it's substantially harder than building with skills from the start.
If you have existing tool-based agents, begin migrating high-value tool chains to skill abstractions incrementally. Start with tools that have implicit memory dependencies: anything that benefits from "remembering" past invocations. A search tool becomes a research skill when it tracks which sources proved reliable. A code generation tool becomes a coding skill when it learns from past review feedback. The evolution of agent architectures provides migration patterns for this transition.
Skill versioning strategy requires treating skills like npm packages. Use semantic versioning: patch versions for bug fixes, minor versions for backward-compatible capability additions, major versions for breaking changes. Maintain lockfiles that pin skill versions per deployment. Establish deprecation policies—how long do you support old skill versions? Production rankings show that teams with explicit versioning policies experience 60% fewer production incidents from skill changes.
Evaluation investment scales with skill complexity. Skill-based agents require skill-level evaluation, not just end-to-end task success. If your research agent fails, you need to know whether the research skill failed, the synthesis skill failed, or the composition logic failed. Budget for evaluation infrastructure—expect 15-20% of agent development effort to go toward testing and benchmarking.
Security implications are substantial. Skills with memory access and tool delegation are powerful attack surfaces. A compromised skill can exfiltrate data through memory writes, escalate privileges through delegation, or persist malicious patterns for future invocations. Implement permission enforcement from day one; adding it later requires auditing every existing skill.
Team structure may need adjustment. Skills create natural ownership boundaries. Consider a skill ownership model similar to microservice ownership: designated maintainers, explicit SLOs, documented interfaces. Developer tool discussions suggest that teams with clear skill ownership see faster iteration and fewer cross-cutting bugs.
Timeline expectations: Skill architectures are production-ready now, but expect significant API churn through 2026. Abstract your skill interfaces—depend on your own skill protocols, not framework-specific implementations directly. The SkillNet interchange format may stabilize by Q4 2026, at which point portability across frameworks becomes practical.
What to Build This Week
Build a skill-enabled personal research assistant that demonstrates the tools-to-skills evolution:
- Start with a basic research agent using standard tools (web search, document reading)
- Add a
SkillRegistryand migrate web search to aWebResearchSkillwith memory coupling - Implement episodic memory that tracks: which sources proved useful, which search strategies worked for different query types, which results the user marked as helpful
- Add skill-level evaluation: track correctness (did the user accept the research?), efficiency (tokens per useful result), and generalization (does the skill work on new topic domains?)
- Implement one round of demonstration learning: record yourself researching a topic, have the agent extract a skill refinement, evaluate whether the refinement improves outcomes
The complete implementation should take 8-12 hours. By the end, you'll have hands-on experience with skill schemas, memory coupling patterns, and the evaluation infrastructure that makes skills production-ready. More importantly, you'll understand why the industry is converging on this abstraction—and be ready to apply it to your production systems.
Sources
- March 2026: LangChain Newsletter
- LangChain Announces Enterprise Agentic AI Platform Built with NVIDIA
- A Developer's Guide to Agentic Frameworks in 2026 - Towards AI
- LangChain vs LangGraph 2026: Which AI Agent Framework?
- AI Agent Frameworks 2026: Production-Tested Ranking by Alice Labs
- AI Agent Frameworks Comparison 2026: Complete Guide
- Handling shared state across multi-agent conversations in AutoGen
- CrewAI now lets you build fleets of enterprise AI agents | VentureBeat
- XAgen: An Explainability Tool for Identifying and Correcting Failures in Multi-Agent Workflows
- A Large-Scale Study on the Development and Issues of Multi-Agent AI Systems
- Deterministic Pre-Action Authorization for Autonomous AI Agents
- Advanced Function Calling and Multi-Agent Systems with Small Language Models in Foundry Local
- ai-agent-papers/capability-papers/tool-use.md at main - GitHub
- Agentic Frameworks for Reasoning Tasks: An Empirical Study
- Awesome Agentic Reasoning Papers - GitHub
- The Path Ahead for Agentic AI: Challenges and Opportunities
- Rethinking Agentic Reinforcement Learning In Large Language Models
- Agentic Artificial Intelligence (AI): Architectures, Taxonomies, and Applications
- The Evolution of Agentic AI Software Architecture - arXiv
- Best AI Tools for Developers in 2026: What Are Your Must-Have...
This is part of the **Agentic Engineering Weekly* series — a deep-dive every Monday into the frameworks,
patterns, and techniques shaping the next generation of AI systems.*
Follow the Agentic Engineering Weekly series on Dev.to to catch every edition.
Building something agentic? Drop a comment — I'd love to feature reader projects.
Top comments (0)