The First Malicious MCP Server and What It Taught Us About Trust

#agents #ai #api #architecture

In September 2025, a security team at Koi discovered something that, in hindsight, was inevitable. A package on npm called postmark-mcp a Model Context Protocol server that let AI assistants send email through Postmark, had been quietly turned into a weapon. The story is almost boringly simple, and that is exactly what makes it worth studying.

An engineer copied Postmark’s legitimate open-source MCP server, near line for line, and published it under his own name. The first fifteen versions were clean. They did precisely what they claimed. People installed them, wired them into their agents, and moved on. Then version 1.0.16 shipped. It was identical to the version before it, except for a single line, line 231, that added a blind carbon copy to phan@giftshop[.]club on every outgoing message. From that point on, every email an AI assistant sent through the tool, password resets, invoices, credentials, contracts, was silently forwarded to a stranger. The package was downloaded 1,643 times before anyone noticed.

We keep coming back to this incident when we reason about the problem we work on, because it strips away everything incidental and leaves the core issue exposed. So let us walk through the reasoning the way we actually walked through it.

The trust was placed once, and never checked again

Start with the obvious question: at what moment did the people using postmark-mcp get compromised? It is tempting to say "when they installed 1.0.16." But that is not quite right. They were compromised the moment they decided, at install time, that this tool was safe and then never revisited that decision.

This is the pattern we kept seeing, across incident after incident. Trust is a point-in-time act. You inspect a schema, review a tool, approve a permission, and form a judgment: this is the shape of the thing I am depending on. And then you build on top of that judgment as if it were permanent. But the thing you depended on is not permanent. It is a living artifact that someone else controls, and it can change underneath you at any time, for any reason, with no obligation to tell you.

The first fifteen versions being clean was not reassuring. It was the attack. Fifteen honest versions is precisely how you earn the trust that the sixteenth version spends. The whole exploit lives in the gap between “I evaluated this once” and “this is still what I evaluated.”

So the first idea we arrived at is almost embarrassingly plain: the shape of a dependency is not a fact you establish once. It is a value you have to keep measuring. If trust decays the instant a contract can change, then the only honest thing to do is to treat the contract as something you re-check continuously, not something you approve and forget.

To watch for change, you first have to make the contract a thing you can hold

The next question follows naturally. If we want to notice when a tool or a schema changes, what exactly are we comparing?

You cannot diff a vibe. “This tool felt trustworthy” is not comparable across time. What is comparable is the concrete, structural description of the thing: the fields in a payload and their types, the tools an MCP server advertises, the parameters each tool accepts, the description text the model reads to decide how to use it. These are the promises, and crucially, they can be captured as explicit artifacts snapshots you can store, version, and set side by side.
Write on Medium

This led us to the second idea: before you can guard a contract, you have to render it into something inspectable and stable. Turn the live, shifting surface of an API or a tool catalog into a captured shape. Once you have that, the impossible question, “did I get compromised?” becomes a mechanical one: is today’s shape the same as the shape I approved? The postmark-mcp attack, viewed this way, is not subtle at all. Version 1.0.15 and version 1.0.16 describe the same tools, but their behavior diverged. Even the visible metadata — versions, hashes, tool listings — moved. A system holding yesterday's snapshot next to today's would have had something concrete to point at.

Not all change is attack — so the comparison has to have judgment

Here we hit the complication that makes this a real engineering problem and not just a checksum. Contracts should change. APIs add fields. Tools improve their descriptions. If every difference screams, you have built an alarm that everyone learns to ignore, and an ignored alarm is worse than none.

So the third idea: a useful comparison distinguishes the change that breaks a promise from the change that merely extends it. Adding an optional field is not the same as removing a required one. A new tool appearing is a different risk than an existing tool quietly growing a new parameter, or its description mutating in a way that could steer an agent toward a new, unintended action. The comparison has to classify safe versus breaking, additive versus destructive, so that human attention is spent only where a promise actually broke. This is what turns raw diffing into something you can put in front of a team without exhausting them.

The cheapest place to catch drift is before it ships, but that’s not the only place

Now, where do you run this check? Reasoning it through, there are two distinct moments, and you need both.

The first is at the boundary of your own changes, before code merges. When your integration assumes a certain response shape, the moment to discover the assumption is wrong is at the pull request, not at 2am in production. Catching drift here is the cheap catch: it is a gate you put in the path of change, and it fails the build before the bad assumption ever reaches a user. This is drift you cause, and you can stop it upstream.

But postmark-mcp is the other kind entirely. Nobody on the victim side changed anything. Their code was stable. The drift came from outside, on someone else's release schedule, long after their last merge. No pre-merge gate on earth would have caught it, because there was no merge. This is the fourth idea, and it is the one most tooling misses: some of the most dangerous drift happens in things you depend on but do not control, and it happens continuously, on a clock you don't own. Catching it requires something that keeps polling the live surface, re-fetching the tool catalog, re-reading the advertised shapes, on its own schedule, forever, and raising a hand the moment the shape you approved is no longer the shape being served.

And when it changes, someone has to be told, with the receipt

The final piece is almost anticlimactic but it is where most good intentions die. A check that runs and notices a change is useless if the noticing lands in a log nobody reads. The value is only realized at the moment a human is interrupted with a specific, legible claim: this tool’s contract changed at this time, in this way, and here is the before and the after.

That is the fifth idea: detection is only half of it; the other half is delivery and history. You need the alert that reaches a person, and you need the durable record, the timeline of a contract’s shape over weeks, so that when something does go wrong, you can answer the question the postmark-mcp victims could not answer for days: when did this change, and what were we exposed to in between?

So, to the landing

Put those ideas in a line and they compose into a single discipline. Capture the contract as an explicit, comparable shape. Diff it with enough judgment to separate breaking change from benign growth. Gate your own changes before they merge. Poll the surfaces you depend on but do not control, continuously, because their drift is not on your schedule. And when a promise breaks, tell a human, with the full receipt.

We wanted to test this ideas out and built them into DriftGuard. The local diff that classifies breaking changes and fits into CI is the cheap, upstream catch. The continuous watches that poll live API and MCP tool catalogs are for the drift you don’t control. The postmark-mcp-shaped drift that no pre-merge test can see. The alerting and the history are what turn a detected change into a decision someone can actually act on. None of these are clever in isolation. What makes them matter is that the threat is now continuous and external, and so the guard has to be continuous too.

The lesson of the first malicious MCP server is not that MCP is dangerous. It is that we have been treating trust as a thing you establish once, in a world where the things we trust can change every day, on someone else’s terms. Closing that gap, measuring the promise continuously instead of approving it once is the work. postmark-mcp is simply the clearest argument we have found for why it can't wait.