DEV Community

Ken W Alger
Ken W Alger

Posted on • Originally published at kenwalger.com on

From Cloud to Laptop: Running MCP Agents with Small Language Models

Large Models Build Systems. Small Models Run Them.

For most developers, modern AI systems feel locked behind massive infrastructure.

We’ve been conditioned to believe that “Intelligence” is a service we rent from a data center—a luxury that requires GPU clusters, $10,000 hardware, and ever-climbing cloud inference bills.

Last week, when we built our Multi-Agent Forensic Team, you likely assumed that coordinating a Supervisor, a Librarian, and an Analyst required the reasoning horsepower of a 400B+ parameter model.

Today, we’re cutting the cord. We are moving the entire Forensic Team—the agents, the orchestration, and the data—onto a standard laptop. No cloud. No API costs. No data leaving your local network.

This is the power of Edge AI combined with the Model Context Protocol (MCP).

The Pivot: The “Forensic Clean-Room”

In the world of rare book forensics, data sovereignty isn’t a “nice-to-have.” When you are auditing high-value archival records or sensitive provenance data, the “Clean-Room” approach is the gold standard. You want the data isolated.

By moving our stack to the Edge, we transform a laptop into a portable forensic lab.

The Edge Architecture

Architecture diagram showing an MCP-based multi-agent system running locally with small language models where a supervisor and specialist agents interact with an MCP server and local archive database on a laptop.
Running MCP agents locally: small language models power the supervisor and specialist agents while the MCP server provides structured tool access to local data.

Notice that the architecture we built in Post 2 doesn’t change. Because we used MCP as our “USB-C” interface, we don’t have to rewrite our tools or our agents. We only swap the Inference Engine.

Why SLMs Love MCP

Small language models struggle when tasks are open-ended.

However, MCP dramatically reduces the search space.

Instead of inventing answers, the model interacts with structured primitives:

  • tools
  • resources
  • prompts

Each defined with strict schemas.

The Thesis: Large models are great for designing the system and writing the initial code. Small models are the perfect runtime engines for executing those standardized tasks.

The “How-To”: Swapping the Engine

In our updated orchestrator.py, we’ve introduced a provider flag. Instead of hitting a remote API, the Python supervisor now talks to a local inference server (like Ollama or LM Studio).

# [Post 3 - Edge AI] Swapping the Inference Provider
if args.provider == "ollama":
# Pointing to the local SLM engine
client = OllamaClient(base_url="http://localhost:11434")
model = "phi4"
else:
# Standard Cloud Provider
client = AnthropicClient()
model = "claude-3-5-sonnet"

Enter fullscreen mode Exit fullscreen mode

Because our TypeScript MCP Server is running locally via stdio, the latency is nearly zero. The “Librarian” fetches metadata from the local database, and the “Analyst” runs the audit—all without a single packet hitting the open web.

Benchmarking the Forensic Team: Cloud vs. Edge

Does a 14B model perform as well as a 400B model for forensics? When constrained by MCP schemas, the results are surprising.

Criteria Cloud (Claude/GPT-4) Edge (Phi-4/Mistral)
Reasoning Depth Extremely High High (with MCP Tool Constraints)
Latency 1.5s – 3s (Network Dependent) < 500ms (Local Inference)
Cost Per-token billing $0.00
Privacy Data processed externally 100% Data Sovereignty
Scalability Infinite Limited by local RAM/NPU

The Reveal: Same System, New Home

If you look at the latest update to the repository, you’ll see that the orchestration logic is nearly identical. The architecture stack from earlier posts remains unchanged.

Comparison diagram showing cloud-based AI architecture using large models and remote inference versus edge AI architecture using small language models and local MCP tool servers.
Edge AI architecture replaces cloud inference with local small language models while retaining MCP-based tool access.

Nothing about the agents changed.

Nothing about the tools changed.

Only the inference engine moved.

The “Zero-Glue” promise is realized here.

We didn’t build a cloud app; we built a protocol-driven system. The fact that it can live on a server or a laptop is simply a deployment choice.

What’s Next?

We’ve built the server. We’ve orchestrated the team. We’ve moved it to the edge.

In the final post of this series, we tackle the “Final Boss” of AI systems: Enterprise Governance. We’ll explore how to take this forensic lab and scale it across an organization using Oracle 26ai, ensuring that every audit is secure, permissioned, and defensible.

Ready to go local?

Check out the orchestrator.py update and try running the Forensic Team on your own machine.

👉 MCP Forensic Analyzer – Edge AI Example

The “Zero-Glue” Series

The post From Cloud to Laptop: Running MCP Agents with Small Language Models appeared first on Blog of Ken W. Alger.

Top comments (0)