Ricardo Ferreira

Posted on Mar 30

The Seven Deadly Sins of MCP: Design Sins

#mcp #ai #agents #api

This part of the series focuses on the design sins: Gluttony, Pride, and Envy. They belong in this category because they shape the day-to-day quality of the system itself. Things like how much it carries, how clearly its contracts are exposed, and how easy it is for both humans and models to reason about what it can do.

Many MCP systems do not fail first as security disasters. They fail because they become expensive, crowded, and hard to reason about. One tool returns far too much data. Another disappears behind a clever abstraction. A third is added because it looks impressive next to the others. None of that feels catastrophic in the moment. It just makes the system slower, noisier, and harder to trust over time.

Gluttony, pride, and envy are design sins because all three involve building more than the task requires. Too much data. Too much abstraction. Too many tools, prompts, and resources. They may not create the highest blast radius on day one, but they compound into systems that are costly, confusing, and difficult to maintain.

That design pressure in MCP is not limited to function handlers. It shapes the full model-facing surface: the tool catalog, the prompts users can invoke, the resources clients can browse, and the amount of protocol surface area a host application has to reason about. A server can be perfectly secure and mostly reliable and still be exhausting to use because the model-facing surface is bloated, clever, or crowded.

Gluttony

Gluttony is giving the model or server more data, context, work, or runtime than the task requires.

How to spot it

Tool responses are much larger than they need to be, especially when they include raw documents, embeddings, or internal metadata.
Prompt templates or resources quietly inject far more context than the task needs.
Token usage or latency spikes after what should have been a simple tool call.
The model spends more time digesting data than acting on it.
A single tool quietly starts dominating cost, context, or response time.

Example

A common version of this happens in support and on-call workflows. An agent asks the assistant for the refund policy on annual plans while a customer is waiting in chat, or an engineer asks for the runbook for repeated 502s from the billing service. In both cases, the useful response is a short list of the best matches with enough context to choose the right one. What often comes back instead is the entire retrieval payload: full document text, embeddings, chunk offsets, and internal metadata before anyone has even selected a result. The same thing happens when a prompt template eagerly injects three full runbooks "just in case," or when a resource endpoint returns the entire policy corpus instead of one relevant document.

Before

@server.tool("search_docs")
async def search_docs(query: str):
    matches = await vector_store.search(query, limit=20)

    return {
        "matches": [
            {
                "id": doc.id,
                "title": doc.title,
                "score": doc.score,
                "text": doc.full_text,
                "embedding": doc.embedding,
                "metadata": doc.metadata,
                "chunk_offsets": doc.chunk_offsets,
            }
            for doc in matches
        ]
    }

After

@server.tool("search_docs")
async def search_docs(query: str, limit: int = 5):
    matches = await vector_store.search(query, limit=limit + 1)
    visible_matches = matches[:limit]

    return {
        "matches": [
            {
                "id": doc.id,
                "title": doc.title,
                "score": round(doc.score, 3),
                "excerpt": doc.match_snippet,
            }
            for doc in visible_matches
        ],
        "truncated": len(matches) > limit,
    }

The same principle applies to prompts and resources: pass only the context the next step actually needs.

prompt("incident_triage", async ({ service, symptom }) => {
  return {
    messages: [
      {
        role: "user",
        content: {
          type: "text",
          text: `Use only the matched runbook excerpts for ${service} and ${symptom}. If they are insufficient, ask for a follow-up resource fetch.`
        }
      }
    ]
  };
});

How to fix it

The first step is to separate discovery from retrieval. Search tools should help the model decide what to look at next, not dump every possible field into the first response. That usually means returning small, decision-oriented results up front and forcing a second call for the full record when the model genuinely needs it. In most cases, that first response should contain short excerpts or matched snippets, not a fresh server-side summary of the whole document.

From there, treat output size as an engineering constraint rather than a style preference. Put hard caps on text length, result count, and serialized payload size. Measure token counts, output bytes, and latency per tool so you can see when a "helpful" change quietly turns into an expensive one. The same instinct applies to resources and prompt templates too: if they force a client to ship or inject far more context than the task needs, the design is still gluttonous even when no single tool response looks outrageous. It is also worth adding regression tests for payload size and reviewing schemas to remove fields that do not belong in the first response at all, especially embeddings, raw blobs, traces, and internal metadata.

Lessons from the trenches

The MCP fault taxonomy paper cites a Graphiti MCP issue in which returning full embedding vectors increased output from approximately 5K tokens to more than 250K tokens, reportedly driving a 50x increase in cost. The pattern also appears in runtime excess issues like modelcontextprotocol/python-sdk #756, where stateless Streamable HTTP leaked tasks across requests, a form of runtime gluttony even when the payload itself was not large.

Pride

Pride is trusting cleverness and abstraction over simplicity and correctness.

How to spot it

Bugs disappear when you bypass the abstraction and talk to the protocol directly.
The team has a custom wrapper layer that only one or two people can confidently explain.
Schema, prompt-shaping, or resource-format issues keep popping up in places that should be boring.
You are debugging framework magic rather than tool behavior.

Example

This usually happens right after the first few tools ship successfully. The team gets tired of repeating schemas and handlers, so they build a wrapper to "standardize" everything. Suddenly, a straightforward tool like create_ticket or get_customer depends on hidden normalization and formatting logic. The same instinct often spreads further: prompt assembly and resource shaping also disappear behind helpers that silently rewrite or decorate the real contract. The user still expects a boring, reliable capability. Only the implementation has become clever.

Before

class SmartToolServer extends McpServer {
  registerTool(definition: ToolDefinition) {
    const wrappedHandler = async (input: any) => {
      const normalizedInput = this.autoNormalize(definition, input);
      const result = await definition.handler(normalizedInput);
      return this.autoFormat(definition, result);
    };

    super.registerTool(definition.name, wrappedHandler);
  }
}

After

server.tool(
  "create_ticket",
  {
    title: "Create Support Ticket",
    inputSchema: {
      type: "object",
      properties: {
        title: { type: "string" },
        priority: { type: "string", enum: ["low", "medium", "high"] },
      },
      required: ["title", "priority"],
      additionalProperties: false,
    },
  },
  async ({ title, priority }) => {
    return await ticketing.create({ title, priority });
  }
);

How to fix it

The healthiest fix is usually subtraction. Before adding another wrapper or helper, ask whether the existing abstraction is buying you anything other than distance from the protocol. Pride fixes often start by deleting a smart layer, moving the contract back into plain sight, and making the boring path the default again.

Once the abstraction is thinner, protect the real contract directly. Add integration tests that exercise the schema, the arguments, and the returned structure exactly as a peer sees them. Do the same for prompt templates and resources when they are part of the exposed surface: test the visible shape, not just the helper that assembled it. If you generate contracts, keep compatibility tests for protocol and schema versions, and document the simplest supported pattern in your internal templates so engineers are encouraged to reuse something plain rather than rebuild a framework. A little visible duplication is often far cheaper than hidden magic.

Lessons from the trenches

This is the engineering lesson behind modelcontextprotocol/typescript-sdk #451, where subclassing McpServer broke tool argument passing, and modelcontextprotocol/typescript-sdk #745, where generated JSON Schema drifted from what newer clients expected.

Envy

Envy is adding capabilities because they look impressive, not because they are necessary. It is the sin of mistaking a bigger catalog for a better product. In MCP, every additional tool, prompt, or resource is not just one more feature. It is one more thing the model has to advertise, distinguish, and sometimes choose incorrectly.

How to spot it

Capability lists keep growing, but nobody can explain which ones are actually important.
Multiple tools, prompts, or resources overlap, collide in naming, or differ only in tiny ways.
The model picks the wrong capability because the catalog is too crowded or ambiguous.
Prompt cost rises just from advertising available capabilities.

Example

This is a familiar phase in internal platform rollouts. One genuinely useful server for GitHub becomes a larger catalog with Jira, Slack, docs, incidents, analytics, and deployments, because every new integration demo and every internal team wants to be represented. Then the same pattern spreads to prompts and resources: every team adds its own incident template, support playbook, and reference bundle. By the time people rely on it for daily work, the catalog is crowded, names overlap, and simple tasks become harder because the model has too many similar capabilities to choose from. The tax is paid at selection time, not at launch time, which is why teams underestimate it.

Before

List<RemoteServer> servers = List.of(githubServer, jiraServer, slackServer, docsServer);

for (RemoteServer remoteServer : servers) {
    List<ToolDefinition> tools = remoteServer.listTools();
    for (ToolDefinition tool : tools) {
        registry.register(tool.name(), tool);
    }
}

After

Set<String> enabledTools = Set.of(
    "github.list_open_issues",
    "github.create_issue",
    "docs.search_runbooks",
    "docs.get_runbook"
);

for (RemoteServer remoteServer : List.of(githubServer, docsServer)) {
    List<ToolDefinition> tools = remoteServer.listTools();

    for (ToolDefinition tool : tools) {
        String namespacedName = remoteServer.name() + "." + tool.name();
        if (enabledTools.contains(namespacedName)) {
            registry.register(namespacedName, tool);
        }
    }
}

How to fix it

The fix for envy starts with curation. Treat the capability catalog as product surface area with carrying cost, not as a trophy shelf. Inventory the tools, prompts, and resources and remove the ones that are redundant, rarely used, or too ambiguous to justify their existence. Naming conventions also need to arrive earlier than teams think. Once multiple servers are involved, namespacing and consistent verbs stop being cosmetic and start being survival.

After that, treat discovery as a product design problem instead of a registry problem. Add usage telemetry so retention decisions are based on demand rather than guesswork, and build pagination and filtering into your registry and UI surfaces before the catalog gets large. That review should include prompts and resources, not just tools, because from the host's perspective they are all competing model-facing affordances. If either a model or a human struggles to tell two capabilities apart, the design is not finished yet.

This is another place where platform convenience can mislead teams. A gateway or integration layer can make it trivially easy to publish one more capability from an existing API catalog, and that speed feels like progress. But publication is not curation. If the underlying API surface is already inconsistent, overlapping, or bloated, an MCP catalog in front of it can amplify the ambiguity rather than organize it.

Lessons from the trenches

Duplicate tool names across MCP servers caused errors in openai/openai-agents-python #464. Broken pagination lets very large tool lists come back at once in modelcontextprotocol/java-sdk #615. Capability sprawl is not a cosmetic problem. It directly affects correctness, prompt cost, and the model's ability to make good choices.

Why design sins are hard to fix

Design sins usually look cheap until you try to remove them. Gluttony fixes often force one broad capability to split into search, detail, and pagination flows, which means more contracts and more observability. Pride fixes can be politically harder than technically hard because they often require deleting a "smart" internal framework and accepting a plainer pattern with a little more visible duplication. Envy fixes usually require governance: someone has to own the capability taxonomy, naming rules, enablement, and retirement, or the catalog will keep growing.

That is why design cleanup often lags behind security and operations work. The system may still appear to function. It just gets slower, noisier, and harder to reason about until teams finally notice the carrying cost.

DEV Community

The Seven Deadly Sins of MCP: Design Sins

Gluttony

How to spot it

Example

Before

After

How to fix it

Lessons from the trenches

Pride

How to spot it

Example

Before

After

How to fix it

Lessons from the trenches

Envy

How to spot it

Example

Before

After

How to fix it

Lessons from the trenches

Why design sins are hard to fix

Top comments (0)