A few weeks ago, Pinterest engineering published numbers from their production MCP deployment: ~66,000 monthly tool invocations, 844 active users, an estimated 7,000 engineering hours saved per month. (source)
What stood out wasn't the scale. It was what they didn't skip — every server passed security, legal, privacy, and compliance review before going to production. Sensitive operations required human approval.
That's the part most MCP tutorials skip. Most guides show you how to wrap an API in five minutes. Almost none show you how to design a tool that's actually safe and useful for an AI agent to call autonomously.
This post walks through that gap — using a generic incident/log-analysis domain as the example — with working FastMCP 3.x code.
The wrapper instinct
The fastest way to build an MCP server is to take an existing API and expose every endpoint as a tool, 1:1. It looks something like this:
from fastmcp import FastMCP
mcp = FastMCP("incidents")
@mcp.tool
def get_incident(incident_id: str) -> dict:
"""Fetch a single incident by ID."""
return db.query("SELECT * FROM incidents WHERE id = ?", incident_id)
@mcp.tool
def list_incidents() -> list[dict]:
"""List all incidents."""
return db.query("SELECT * FROM incidents")
@mcp.tool
def update_incident(incident_id: str, status: str) -> dict:
"""Update an incident's status."""
return db.execute("UPDATE incidents SET status = ? WHERE id = ?", status, incident_id)
This works. It'll pass a demo. It's also exactly the pattern that causes problems in production:
- No domain semantics. The model has to infer what "status" values are valid, what counts as a duplicate incident, what severity actually means in your system — none of that is encoded anywhere.
-
No governance.
update_incidentcan be called with zero friction. An agent that misreads a request can silently change production state. - Context bloat. One developer building on GitHub's MCP server described it as dumping over 40 tools into context before doing anything — which measurably degrades agent performance, since the model has to reason over every tool description on every turn.
A thin 1:1 wrapper is a fast way to get a demo. It is rarely the right shape for a tool an agent calls unsupervised.
Designing a domain-specific tool instead
The alternative isn't more tools — it's fewer, smarter tools that encode the judgment a domain expert would apply by hand.
Here's the same incident-analysis domain, redesigned:
from fastmcp import FastMCP
from pydantic import BaseModel, Field
from enum import Enum
from datetime import datetime, timedelta
mcp = FastMCP("incident-analysis")
class Severity(str, Enum):
LOW = "low"
MEDIUM = "medium"
HIGH = "high"
CRITICAL = "critical"
class IncidentSummary(BaseModel):
total: int
by_severity: dict[str, int]
likely_duplicates: list[str] = Field(
description="Incident IDs that may be duplicates based on time/error-signature clustering"
)
needs_human_review: bool
@mcp.tool
def summarize_recent_incidents(
hours: int = 24,
min_severity: Severity = Severity.MEDIUM,
) -> IncidentSummary:
"""
Summarize incidents from the last N hours at or above a severity threshold.
Applies domain logic to flag likely duplicates (same error signature within
a 10-minute window) rather than returning raw rows for the model to interpret.
Returns needs_human_review=True if any CRITICAL incidents are unresolved,
signaling the caller should not auto-close or auto-triage without a human.
"""
cutoff = datetime.utcnow() - timedelta(hours=hours)
rows = db.query(
"SELECT * FROM incidents WHERE created_at >= ? AND severity >= ?",
cutoff, min_severity.value
)
duplicates = _cluster_by_signature(rows, window_minutes=10)
has_unresolved_critical = any(
r["severity"] == "critical" and r["status"] != "resolved" for r in rows
)
return IncidentSummary(
total=len(rows),
by_severity=_count_by_severity(rows),
likely_duplicates=duplicates,
needs_human_review=has_unresolved_critical,
)
@mcp.tool
def propose_incident_resolution(incident_id: str) -> dict:
"""
Propose (but do not apply) a resolution for an incident based on
similar past incidents and their resolutions.
This tool is read-only by design. It never mutates incident state —
use apply_resolution() separately, which requires explicit human
confirmation, to actually close anything out.
"""
incident = db.query_one("SELECT * FROM incidents WHERE id = ?", incident_id)
similar = _find_similar_resolved(incident, limit=3)
return {
"incident_id": incident_id,
"proposed_resolution": _summarize_resolution_pattern(similar),
"based_on": [s["id"] for s in similar],
"confidence": _resolution_confidence(similar),
}
@mcp.tool
def apply_resolution(incident_id: str, resolution_note: str, confirmed_by_human: bool) -> dict:
"""
Apply a resolution to an incident. Requires confirmed_by_human=True.
This is the only tool in this server that mutates incident state.
It exists separately from propose_incident_resolution() so that an
agent can never go from "read" to "write" in a single unsupervised call.
"""
if not confirmed_by_human:
raise ValueError(
"apply_resolution requires confirmed_by_human=True. "
"Use propose_incident_resolution() first and surface it to a person."
)
return db.execute(
"UPDATE incidents SET status='resolved', resolution=? WHERE id=?",
resolution_note, incident_id
)
A few things changed, deliberately:
Domain logic moved into the tool, not the prompt. "What counts as a likely duplicate" is a judgment call specific to this domain — clustering by error signature within a time window. Encoding it in the tool means every caller gets the same correct answer instead of the model re-deriving (or guessing at) the logic every time.
Read and write are different tools. propose_incident_resolution can never mutate anything — it's structurally read-only. apply_resolution is the single narrow path to a state change, and it refuses to run without an explicit human-confirmation flag. This is a simplified version of the human-in-the-loop pattern Pinterest's writeup describes for sensitive operations.
Return types are structured, not raw rows. The IncidentSummary Pydantic model gives the agent (and you) a typed, predictable contract — and the needs_human_review flag does some of the model's reasoning for it, rather than hoping it infers urgency correctly from a list of dicts.
Three tools, not three-times-N. Instead of exposing every table operation, this server exposes three tools that map to how a human on-call engineer actually thinks: "what's going on," "what would I do about it," "okay, do that." Fewer tools in context, each carrying more domain weight.
Why this matters more than the protocol choice
MCP itself is becoming infrastructure — it went from an internal Anthropic spec to a Linux Foundation standard backed by OpenAI, Google, and Microsoft in about 16 months. (source) That convergence means the protocol layer is rapidly commoditizing. Every team will have access to the same transport, the same SDKs, the same client support.
What won't commoditize is tool design. The difference between a server that's safe to hand to an autonomous agent and one that isn't comes down to exactly the choices above — what's read-only, what requires confirmation, what domain logic gets encoded once instead of re-derived every call.
If you're building an MCP server right now, the protocol decisions are mostly made for you. The judgment calls aren't.
Quick start, if you want to try this yourself
uv add fastmcp
# server.py — minimal runnable version
from fastmcp import FastMCP
mcp = FastMCP("incident-analysis")
@mcp.tool
def ping() -> str:
"""Health check."""
return "pong"
if __name__ == "__main__":
mcp.run()
fastmcp dev server.py
That spins up the MCP Inspector so you can see raw tool calls and responses before wiring it into Claude Desktop, Cursor, or any other MCP client.
This is a companion piece to a short series I've been writing on LinkedIn about what's actually missing in how organizations adopt AI — from training-time limitations to why "owning" your AI capabilities means doing the unglamorous design work, not just picking a model.
Top comments (0)