Recognizing the sins is not enough. The hard part starts when a team decides to remove them from a real system that already has users, deadlines, demos, and internal dependencies. That is when the road gets real. The broad tool already has three downstream consumers. The noisy response format already shaped prompt logic in another service. The clever abstraction already has defenders because it reduced boilerplate six months ago. By the time a team agrees that the system needs cleanup, the sins are usually no longer isolated bugs. They are part of how the organization has learned to work.
That is why remediation is not just a technical exercise. It is also a cultural one. Greed and Lust force awkward conversations about who actually owns production access. Sloth and Wrath surface whether the team has been treating observability and failure design as first-class engineering work or as cleanup for later. Pride is especially political because the fix often means deleting an internal abstraction that somebody worked hard to build. Even Envy and Gluttony, which can sound cosmetic at first, turn into debates about platform scope, tool ownership, and whether every integration really deserves to be exposed to the model.
There is another gotcha waiting on this road: fixing one sin often reveals two more. Tighten permissions, and suddenly you discover that the happy path depended on overbroad access. Add typed errors, and now everyone can see how many failure cases were previously collapsed into request failed. Bound retries, and you learn that parts of the product were leaning on brute-force reconnects to appear reliable. Shrink tool responses and prompt flows that depended on oversized payloads start to break. This is normal. A healthy refactor does not immediately make the system look cleaner. It makes hidden coupling visible.
By this point, the category pattern should be clear. Security sins set blast radius. Operational sins determine whether the system fails truthfully and under control. Design sins determine how expensive, ambiguous, and hard to maintain the whole product becomes over time.
Good MCP habits survive only when they are baked into review, testing, observability, and design defaults. This final part is about turning recognition into a habit, and that habit into an engineering structure strong enough to survive delivery pressure. That is also why the useful question is not whether everything should become MCP. Some local integrations are fine as direct APIs or CLIs. MCP earns its keep when tools, prompts, and resources need to be delivered consistently across clients, or when a team wants shared authorization, policy, and observability around model-facing capabilities across stdio and Streamable HTTP deployments with HTTP-based authorization.
A simple refactoring order
If you are retrofitting an existing MCP server, fix the sins in this order:
- Lust: reduce the blast radius of the shell, database, and filesystem first.
- Greed: narrow scopes, credentials, and access boundaries.
- Sloth: make failures specific and observable.
- Wrath: stop runaway retries and reconnect storms.
- Gluttony: shrink payloads and bound work.
- Pride: simplify abstractions once the system is stable enough to see clearly.
- Envy: reduce tool sprawl and naming ambiguity once you know which tools actually earn their place.
This order prioritizes risk reduction, then reliability, and finally maintainability.
Choose MCP on purpose
The goal is not to turn every useful integration into an MCP server. If the problem is local, single-user, and already well served by a direct CLI or API, MCP may add more surface area than value. The point of this series is not that every interface should become MCP. It is that once you do expose a model-facing capability through MCP, the engineering bar changes.
That choice also depends on the shape of the deployment. Local stdio MCP and remote Streamable HTTP MCP are both legitimate, but they create different failure modes and different benefits. Stdio keeps the loop close to one machine and one user, which can be ideal for local tooling. Remote MCP earns its complexity when a team needs centralized authorization, shared policy, consistent prompts and resources, and telemetry across many clients. In other words: choose MCP when you need a shared protocol surface, not just because you have a function you can wrap.
The gateway fallacy
One modern version of the problem is easy to miss because it arrives wearing governance language. API gateways, integration platforms, and automation vendors can now add MCP as a new exposure layer over an existing API estate. They often bring genuinely useful controls with them: authentication, authorization, throttling, proxying, logging, and catalog management. Those are real improvements around the edge.
But that edge is not the capability. If the backend API is over-scoped, noisy, ambiguous, non-idempotent, or structurally unsafe for model use, MCP does not redeem it. A gateway can make a flawed capability easier to ship, easier to discover, and easier to govern. It cannot make the capability well designed. This is the same mistake teams made in the early API gateway era: treating edge features as if they could upgrade a weak contract into a strong one.
Take a concrete example. Imagine a customer API that returns the full account object, billing profile, internal notes, and support history from a single GET /customer/:id endpoint, and fails with a generic 500 whenever an upstream dependency times out. Put that API behind a gateway with OAuth, rate limits, and audit logs, then expose it through MCP as get_customer. You may have improved access control at the edge, but you still shipped Greed through over-broad access, Gluttony through oversized responses, and Sloth through ambiguous failures. The gateway made the capability easier to govern. It did not make the capability good.
That is why the review question has to start behind the gateway, not at the gateway. What exactly is this capability allowed to do? How narrow is its contract? How truthful are its failures? How much data does it return? Can a model distinguish it from neighboring capabilities? If those answers are weak, adding MCP support only increases the chance that the weakness becomes systemic.
Make the ownership model explicit
This is the organizational part that teams often skip. Someone always owns the underlying system, but that does not mean anyone clearly owns the model-facing capability. Those are not the same thing. For each exposed capability, decide who owns the blast radius, who approves new scopes or mutations, who reviews prompt and resource changes for drift, and who has authority to retire a capability when it no longer earns its place.
Without that map, remediation turns into a coordination problem instead of an engineering one. Security fixes stall because nobody owns approval flow. Operational fixes stall because logging and retries live in a different layer from the tool owner. Design fixes stall because nobody feels responsible for curating the catalog. If the ownership model is vague, the sins come back through org seams even after the code improves.
Turning this into engineering defaults
Everything discussed in this series is useful knowledge, but knowledge alone does not survive deadlines. The only version that lasts is the one that becomes ownership, code, tests, dashboards, and release gates. Use the table below as a review card, then turn it into something your stack can enforce.
| Sin | Review Question | Key Metric |
|---|---|---|
| Lust | Can this capability execute commands, write to production, or mutate state without confirmation? | Count of unguarded mutation capabilities |
| Greed | Does this capability request more access than the task requires? | Scope breadth per credential set |
| Sloth | Does every failure path return a specific, actionable error? | Negative test coverage per capability |
| Wrath | Are retries bounded, backed off, and cancellable? | Retry storm frequency in dashboards |
| Gluttony | Is the response or injected context as small as possible for the task? | P95 payload size and token cost per capability |
| Pride | Can a new team member debug this capability without understanding the framework? | Time-to-resolution for contract bugs |
| Envy | Can the model reliably distinguish this capability from every other one? | Usage distribution and error-selection rate |
A minimum technical baseline
In practice, a production MCP server should have four technical defaults: explicit capability policy, contract tests, structured telemetry, and a release gate. If one of those is missing, one of the sins usually finds its way back in. The exact implementation will vary by stack. One team might encode these controls in SDK middleware, another in gateway policy, another in code generation and CI. The point is not the syntax. The point is that the controls exist in a concrete, enforceable form.
1. Make the capability contract explicit
Do not leave the dangerous parts of a capability buried inside handler code or templates. For each tool, declare whether it mutates state, what scopes it needs, whether confirmation is required, how long it may run, whether it may retry, and how large a response it is allowed to return. The same discipline applies to prompts and resources too: make exposure, ownership, and output budget explicit rather than implicit.
type ToolPolicy = {
name: string;
mutates: boolean;
confirmationRequired: boolean;
idempotencyKeyRequired: boolean;
scopes: string[];
timeoutMs: number;
retry: {
maxAttempts: number;
strategy: "none" | "exponential-jitter";
retryableErrors: string[];
};
outputBudget: {
maxItems: number;
maxBytes: number;
redactFields: string[];
};
};
const toolPolicies: Record<string, ToolPolicy> = {
"github.create_issue": {
name: "github.create_issue",
mutates: true,
confirmationRequired: true,
idempotencyKeyRequired: true,
scopes: ["issues:write"],
timeoutMs: 10_000,
retry: {
maxAttempts: 0,
strategy: "none",
retryableErrors: []
},
outputBudget: {
maxItems: 1,
maxBytes: 4_096,
redactFields: ["body_html"]
}
},
"docs.search_runbooks": {
name: "docs.search_runbooks",
mutates: false,
confirmationRequired: false,
idempotencyKeyRequired: false,
scopes: ["docs:read"],
timeoutMs: 3_000,
retry: {
maxAttempts: 2,
strategy: "exponential-jitter",
retryableErrors: ["timeout", "temporarily_unavailable"]
},
outputBudget: {
maxItems: 5,
maxBytes: 8_192,
redactFields: []
}
}
};
const promptPolicies = {
"incident_triage": {
scopes: ["incidents:read", "runbooks:read"],
owner: "sre-platform",
visibility: "oncall-only",
maxTemplateBytes: 4_096,
},
};
const resourcePolicies = {
"support://policies/refunds/annual": {
scopes: ["support:read"],
owner: "support-platform",
visibility: "support-agents",
maxBytes: 4_096,
},
};
This is where Lust, Greed, Wrath, and Gluttony stop being opinions and become machine-readable constraints. Tools are the clearest example, but the same idea applies to prompts, resources, and transport-facing policy. If a capability cannot be described this plainly, the design is probably already drifting toward Pride.
For mutation tools, idempotency deserves one extra note. It can still be worth requiring an idempotency key even when automatic retries are disabled, because humans, agents, and job runners all resubmit requests above the transport layer. The CI gate below enforces the narrower minimum: if automatic retries are enabled for a mutation, idempotency protection becomes mandatory.
2. Test the dangerous paths on purpose
Happy-path demos do not catch the sins that matter. The tests that earn their keep are the ones that assert the server's actual capability contract directly: typed errors, confirmation requirements, output limits, and retry boundaries. They should exist for prompts and resources too, especially when prompt templates can drift, resources can exceed budget, or visibility rules can change silently.
describe("docs.search_runbooks", () => {
it("returns a typed not_found error", async () => {
const result = await callTool("docs.search_runbooks", {
query: "runbook-that-does-not-exist"
});
expect(result).toMatchObject({
isError: true,
code: "not_found",
retryable: false
});
});
it("caps result count and payload size", async () => {
const result = await callTool("docs.search_runbooks", {
query: "deploy"
});
expect(result.items).toHaveLength(5);
expect(
Buffer.byteLength(JSON.stringify(result), "utf8")
).toBeLessThanOrEqual(8_192);
});
});
describe("github.create_issue", () => {
it("rejects mutation without confirmation", async () => {
const result = await callTool("github.create_issue", {
repo: "org/service",
title: "broken deploy",
confirmed: false
});
expect(result).toMatchObject({
isError: true,
code: "confirmation_required"
});
});
});
describe("incident_triage prompt", () => {
it("stays within the approved template budget", async () => {
const prompt = await getPrompt("incident_triage", {
service: "billing",
symptom: "502s"
});
expect(
Buffer.byteLength(JSON.stringify(prompt), "utf8")
).toBeLessThanOrEqual(4_096);
});
});
These are the tests that catch the kinds of misleading errors, state leakage, and contract drift seen in modelcontextprotocol/typescript-sdk #699, modelcontextprotocol/python-sdk #756, and modelcontextprotocol/typescript-sdk #451 before users do. A server with only happy-path coverage is usually one incident away from rediscovering Sloth, Wrath, or Pride the hard way.
3. Emit telemetry in a shape that operations can aggregate
Per-capability logs with different field names are not observability. They are anecdotes. Emit the same event shape for every call, then build dashboards around latency, payload size, retries, and failure codes.
{
"timestamp": "2026-03-19T14:22:31.018Z",
"capability_type": "tool",
"capability": "docs.search_runbooks",
"server": "docs",
"request_id": "req_7f3d",
"result": "ok",
"duration_ms": 183,
"input_bytes": 126,
"output_bytes": 1840,
"retry_count": 0,
"error_code": null,
"mutates": false
}
With an event like this, the first operational views become obvious: top capabilities by call volume, p95 latency by capability, p95 output bytes, retry counts, error-code frequency, and which mutation tools are being invoked most often. Add the same discipline to prompts and resources by capturing capability type, size, and visibility context there too. MCP Debugger is useful during development because it makes interactive diagnosis easier, but production safety still depends on telemetry that can be aggregated, alerted on, and reviewed over time.
4. Add a release gate before new tools ship
The safest MCP teams make it harder to add a sloppy capability than to add a boring, well-bounded one. That means validating metadata and registry shape in CI, rather than relying on reviewers to catch everything by eye.
for (const tool of registry.list()) {
const policy = toolPolicies[tool.name];
assert(policy, `missing policy for ${tool.name}`);
assert(tool.inputSchema, `missing input schema for ${tool.name}`);
assert(tool.outputSchema, `missing output schema for ${tool.name}`);
if (policy.mutates) {
assert(
policy.confirmationRequired,
`${tool.name} mutates state without confirmation`
);
assert(
policy.retry.maxAttempts === 0 || policy.idempotencyKeyRequired,
`${tool.name} retries mutation without idempotency protection`
);
}
}
Run the same kind of gate over prompt and resource policy registries too: every exposed prompt or resource should have an owner, visibility rule, scope policy, and size budget before it ships.
The same gate can fail the build when:
- Two tools collide in naming or become too similar to be distinguished cleanly, as in
openai/openai-agents-python #464 - a tool exceeds its payload budget or latency budget in contract tests
- a schema changes without an approved snapshot update
- a prompt or resource is exposed without clear ownership, auth review, or budget limits
- a tool catalog grows without pagination, filtering, or explicit enablement, as in
modelcontextprotocol/java-sdk #615
That is how Envy and Gluttony stop being abstract warnings and become concrete failures.
Summary
The series only becomes useful when it changes how a team ships. A healthy MCP server has narrow scopes, explicit tool policy, typed failures, bounded retries, small outputs, curated catalogs, and telemetry that makes bad behavior visible. None of those properties should depend on memory or heroics. They should be encoded in the server, asserted in tests, and checked in CI.
MCP is powerful precisely because it makes integration easy. That convenience is real, but it also means the protocol surface becomes part of the product: tools, prompts, resources, transports, and the boundary conditions around them. If that surface is broad, noisy, ambiguous, or hard to reason about, start there. That is usually where one of these sins has already started to settle into the design.
That is the real road to redemption. Not better intentions, but better defaults.
Top comments (0)