Kioi

Posted on Jun 1 • Edited on Jul 5 • Originally published at driftguard.eddy-d55.workers.dev

Postmortem: MCP tool removed over the weekend, detected on scheduled poll (not prod traffic)

#mcp #devops #postmortem #api

Subtitle: How a DriftGuard customer closed a gap their uptime stack and CI never covered

Summary

On 2026-05-12, a B2B SaaS team's Cursor-based workflow started failing intermittently: agents could read context but stopped creating tasks in their internal MCP server. Customer-facing APIs were healthy. Stripe webhooks and GitHub App installs showed no errors. The failure was isolated to a third-party MCP contract the team depended on but did not operate.

After adopting DriftGuard, a similar change was caught ~35 minutes after the vendor's live tools/list changed, via a breaking-classified drift event and Slack alert—before support volume moved.

This post walks through the original incident, why existing tooling missed it, and which DriftGuard capabilities map to each gap.

Impact (original incident)

Metric	Value
Duration	~4h from first user report to root cause
Severity	SEV-2 (degraded agent workflows, API OK)
Affected	Internal ops automations + one customer-facing "AI assistant" feature
Data loss	None
Revenue	No direct billing impact; support load + delayed ship

Symptom: MCP tools/call errors and empty agent turns. Logs showed tool names that no longer existed in tools/list.

Not affected: Application HTTP error rates, p95 latency, Stripe charge success rate.

Timeline (original incident, UTC)

Time	Event
Sat 02:14	Vendor deploys MCP server; `create_task` removed from catalog (no public changelog)
Sat–Mon	No production traffic hits `create_task` (low weekend usage)
Mon 13:40	Support ticket: "AI can't file tasks"
Mon 14:10	On-call checks service dashboards — green
Mon 15:05	Engineer manually runs `curl` + inspects MCP JSON; tool missing
Mon 16:00	Hotfix: agent config updated to new tool name; incident closed

Detection gap: ~62 hours from contract change to human discovery. Monitoring that only reflects your traffic or your release pipeline will not see this class of failure.

Root cause

Primary: Undocumented removal of MCP tool create_task from a server the team does not own.
Contributing: No baseline or diff on tools/list / inputSchema outside ad-hoc debugging.
Contributing: CI validates their OpenAPI and contract tests use frozen fixtures for vendor JSON.

This is not an uptime problem. Endpoints returned 200. It is a consumer contract drift problem.

Why their stack didn't catch it

Tool	What it was doing	Why it wasn't enough
APM / synthetics	Latency and 5xx on their API	MCP schema isn't HTTP status
oasdiff in CI	Diff their spec at merge	Vendor MCP has no spec in repo
Cron `fetch` + `jq` (planned, never shipped)	One URL, unstructured diff	No MCP handshake, no breaking semantics, unmaintained
Agent retries	Masked failures as "empty results"	No alert; degraded UX

The team needed continuous observation of external contracts with breaking vs noise classification—not another dashboard on their own service.

What they changed after the incident (DriftGuard mapping)

Below is what the customer actually configured, and what decision each piece supported.

1. Inventory dependencies (day 0)

They pasted repo mcp.json and two OpenAPI URLs into the console import / suggest flow.

Decision: Which URLs are worth watching first?
Outcome: Four watches proposed (2 MCP, Stripe OpenAPI, GitHub REST spec)—skipped debating a matrix in a spreadsheet.

2. Watch types and intervals

MCP servers: watchType: mcp, 30-minute interval
Vendor OpenAPI specs: specFormat: openapi, daily interval
Decision: Where do we need fast feedback vs slow spec churn?
Outcome: MCP on shorter interval; OpenAPI vendors on daily (semantic op-level diff, not raw JSON tree noise).

3. Baseline + fingerprint

First manual check on each watch stored a snapshot and schema fingerprint on the watch row.

Decision: Has this contract changed since we last looked?
Outcome: Fleet view shows stable hash; drift events are diffs against a known baseline, not one-off curls.

4. Alert routing

Slack incoming webhook on MCP watches; breaking-only policy initially.

Decision: Who gets paged for what?
Outcome: #integrations channel; test ping confirmed delivery (webhook_last_status visible in console—important for trusting alerts).

5. Ignore paths on Stripe watch

Ignored $.info.version after a noisy warning.

Decision: Is this alert actionable?
Outcome: Team kept breaking alerts on operations; suppressed metadata churn.

6. CI gate

GitHub Action calling /api/coverage/assert on mcp.json in the repo.

Decision: Can we prevent new unwatched deps from merging?
Outcome: PR adding a third MCP URL failed CI until a watch existed—addresses repeat of "we added a dep but forgot to monitor it."

Second event (after DriftGuard)—how it played out

Two weeks later, the same internal MCP server renamed a tool (warning-level add + breaking-level required field on another tool in staging—not prod yet, but live URL).

Time	Event
09:12	DriftGuard scheduled check runs
09:12	Drift event: 1 breaking, 2 warnings on MCP watch
09:13	Slack alert delivered
09:20	Engineer opens drift timeline → watch detail

The drift payload included agentAction strings (e.g. update client calls for required field on tools.sync_tasks.inputSchema). That text went straight into the Jira ticket—no separate "write up what changed" step.

Time to detect: ~35 minutes (poll interval + cron), not 62 hours.
Time to understand: minutes (classified diff + explain), not half a day of manual JSON comparison.

Lessons learned (customer's words, paraphrased)

Consumer contracts need consumer monitoring. CI on your repo cannot substitute for watches on URLs you call but don't control.
MCP failures look like agent bugs. Without tools/list diffs, on-call burns time in the wrong layer.
Classification matters. "JSON changed" alerts get ignored; breaking tool removal gets fixed.
Coverage is a process problem. Assert-on-merge turned monitoring from heroics into a default.

When this pattern applies to you

Consider the same approach if:

You run agents against MCP servers you don't operate
You integrate Stripe/GitHub/partner APIs from live behavior, not specs you pin in CI
You have low-traffic code paths that won't trip synthetics until Monday
You've said "we should cron those URLs someday" and never did

Not a fit if: you only need to gate your own OpenAPI at release—use the OSS diff / GitHub Action locally; DriftGuard's hosted value is external watches.

DEV Community