Mininglamp

Posted on May 14

Agent vs Skill vs MCP vs Tool: The 4-Layer Stack Every AI Developer Should Know

#ai #opensource #machinelearning #agents

The Terminology Problem

The AI agent ecosystem has a vocabulary collision. "Tool" means one thing in LangChain, another in AutoGPT, and something else entirely in Claude's function-calling docs. "Skill" and "agent" are similarly overloaded—an "agent" might be a simple prompt wrapper or a fully autonomous system that books flights and deploys code. "MCP" arrived in late 2024 and added yet another term to the mix.

This matters architecturally. When layers are conflated, testing becomes harder, reuse drops, and swapping a model means rewriting half the system. A function that orchestrates 15 steps gets called a "tool." A prompt that strings together API calls gets called an "agent." The result is codebases where nothing is composable.

A 4-layer mental model resolves most of the confusion—similar to how the OSI model gave networking a shared vocabulary, or how MVC clarified web application structure. It's not a rigid specification, but a framework for making architectural discussions more productive.

The 4-Layer Stack

From bottom to top:

Layer 1: Tools — The Atoms

A tool is a single, stateless function that performs one atomic operation. It clicks a button, reads a file, calls an API, or captures a screenshot. Tools have no memory, no planning capability, and no awareness of why they're being called.

Key properties:

Deterministic (or close to it)
Testable in isolation
Composable — designed to be called by higher layers
Environment-specific — a click() on macOS differs in implementation from click() on Android, even if the interface is identical

Examples:

screenshot() — captures the current screen
click(x, y) — clicks at coordinates
read_file(path) — returns file contents
http_get(url) — fetches a URL

Tools are the smallest composable unit. They accept input, perform one action, and return a result. No side quests. The web analogy: individual HTTP endpoints. A GET /users/:id doesn't know about business logic—it fetches a row from a database and returns it.

Layer 2: MCP (Model Context Protocol) — The Connectors

MCP is a standardized transport layer for tool discovery and invocation across process boundaries. Think of it as GraphQL or gRPC for AI systems—it defines how tools are discovered, described, and called, not what they do.

Before MCP, every agent framework had its own tool integration spec. Building a tool for LangChain meant rebuilding it for AutoGPT. Building it for CrewAI meant doing it again. MCP standardizes three things:

Discovery: "What tools are available on this server?"
Schema: "What parameters does this tool accept? What does it return?"
Transport: stdio, HTTP, or WebSocket—the calling code picks the transport

MCP is about interoperability, not intelligence. An MCP server exposes tools; it never decides when to use them. The calling agent makes all decisions. An MCP server is a waiter that presents the menu and takes orders—it doesn't choose the meal.

When MCP adds value: Tools living in different processes or machines. Multiple agents or frameworks sharing the same tool set. Tool authors who want to write once and have it work across LangChain, Claude, OpenAI Assistants, and others.

When MCP adds overhead without benefit: Everything runs in-process and only one agent consumes the tools. In that case, direct function calls are simpler.

Layer 3: Skills — The Playbooks

A skill is a reusable, multi-step procedure that combines tools to accomplish a meaningful task. The web analogy: a service-layer module. A PlaceOrderUseCase orchestrates inventory checks, payment processing, and notifications—it's not a single endpoint but a choreography of endpoints.

"Fill out a web form" is a skill: it involves locating fields, typing values, handling dropdowns, scrolling, and clicking submit. Each step invokes tools, but the sequence, branching logic, and error recovery are the skill's contribution.

Examples:

"Navigate to Settings > Privacy > Clear Cache" (UI navigation)
"Search for a flight, compare prices, select the cheapest" (multi-step research)
"Read an Excel file, extract key metrics, generate a summary" (data analysis)
"Log into a service, check account status, export a report" (multi-app workflow)

Skills are portable when the underlying tool layer provides the required primitives. A "fill web form" skill works on any OS as long as click, type, and screenshot tools are available underneath.

The skill is the natural unit of reuse. A 3-line function and a 300-line multi-step workflow serve fundamentally different purposes; separating them clarifies what's testable in isolation (tools) versus what requires integration testing (skills). Skills can also be shared across agents—one agent might use a "file analysis" skill in a data pipeline context, another in a customer support context.

Layer 4: Agent — The Decision-Maker

An agent is the autonomous reasoning entity that decides what to do, when, and why. It observes the environment (via tools), reasons about the next action (via its language model), selects the appropriate skill, monitors execution, and adapts when things fail.

An agent owns:

Goal decomposition — breaking "book me a flight to Tokyo" into subtasks
Skill selection — choosing which playbook fits the current subtask
Error recovery — detecting failures and trying alternatives
Memory — tracking what's been done across a session
Termination judgment — knowing when the goal is achieved

Agents are model-powered. Replace the model, and the agent's capability ceiling changes. But in well-layered architecture, skills and tools remain valid regardless of which model drives the agent. This is the key insight: the agent is the most volatile layer (models improve quarterly), while tools and skills are the most stable (click is still click).

How the Layers Compose

Agent (decides what to do)
  ↓ selects
Skill (knows how to do it)
  ↓ invokes via
MCP (discovers and routes)
  ↓ calls
Tool (executes one atomic action)

This separation enables:

Swappable models — upgrade the agent's LLM without touching skills or tools
Portable skills — move a skill from cloud to edge by swapping the tool layer
Testable tools — unit-test each tool independently, integration-test each skill
Interoperable infrastructure — MCP means tools work with any compliant agent

A Real-World Example: Mano-P

Mano-P is Mininglamp Technology's open-source on-device GUI agent for macOS. It illustrates how the Agent and Skill layers work together in a local-first, privacy-preserving architecture.

It is pure vision-driven—understanding screens via screenshots, with no dependency on DOM trees, accessibility APIs, or HTML scraping. A local 4B-parameter model runs the entire inference loop on-device.

At the Tool layer: Screen capture, mouse click, keyboard input, scroll—all native macOS operations. No cloud calls for any action primitive.

At the Skill layer: Multi-step workflows for desktop tasks—form filling, app navigation, data extraction—compose the native tools into reliable sequences. These are packaged as mano-skill, a format callable by external orchestrators like Claude Code or OpenClaw agents.

At the Agent layer: The vision-language model observes screenshots and decides the next action autonomously. On Apple M4 + 32GB RAM, it runs at 76 tok/s using the Cider SDK (MLX inference acceleration with W8A8 activation quantization). Data never leaves the device—no screenshots uploaded to cloud APIs, no keystrokes logged remotely.

On the OSWorld benchmark, Mano-P ranked #1 in the proprietary model category with 58.2% accuracy—demonstrating that smaller local models with well-separated architecture can compete with cloud-dependent systems on real desktop tasks.

Installation:

brew tap Mininglamp-AI/tap && brew install mano-cua

Apache 2.0 licensed. Hardware requirement: Apple M4 chip + 32GB RAM.

When to Use What

Not every project needs all four layers:

Tools alone — deterministic automation with fixed sequences (cron jobs, CI pipelines, simple scripts).

Tools + MCP — tools live in different processes or machines; multiple agents share the same tool set.

Tools + MCP + Skills — multi-step workflows with conditional logic and error recovery; reusable procedures across different agents.

Full stack (Agent + Skill + MCP + Tool) — goals are ambiguous or user-specified at runtime; the environment is dynamic; autonomous operation over extended sessions is needed.

Building from the bottom up tends to work well. Get tools right first. Add MCP when interop is needed. Compose skills when workflows emerge. Add an agent when autonomous reasoning becomes necessary.

Common Architecture Smells

Patterns worth recognizing early:

Monolithic prompts — tools, skills, and orchestration logic all in one system message. Hard to test or debug individual pieces. Hard to reuse across projects.
"Tools" that maintain state — a function doing 15 things with internal state is a skill in disguise. Recognizing this improves testability and makes the codebase legible.
MCP everywhere — wrapping every in-process function call in MCP transport adds complexity without interoperability gains. MCP shines at boundaries, not within a single process.
Platform logic in skills — skills containing OS-specific code instead of delegating to tools lose portability. The fix: push platform specifics down into the tool layer where they belong.
Agent without skills — putting all multi-step logic directly in the agent's prompt creates a brittle system that breaks when the model changes or the prompt grows too long.

Summary

The 4-layer model—Tool, MCP, Skill, Agent—provides a vocabulary for answering recurring design questions:

Where does this logic belong?
What's reusable vs. environment-specific?
What can be tested in isolation?
What changes when the model is swapped?
What survives a model upgrade without modification?

These are the same separation-of-concerns questions that web development answered with MVC, service layers, and API gateways. The AI agent stack is working through equivalent patterns now. The projects that age well will be the ones with clean boundaries between layers—where upgrading the LLM doesn't require rewriting the skill library, and swapping from macOS to Linux only means changing the tool implementations.

Mano-P is open-source at github.com/Mininglamp-AI/Mano-P. If you find this useful, a ⭐ on GitHub helps the project reach more developers.

DEV Community