DEV Community

Kioi
Kioi

Posted on • Originally published at driftguard.eddy-d55.workers.dev

Postmortem: MCP tool removed over the weekend, detected on scheduled poll (not prod traffic)

Subtitle: How a DriftGuard customer closed a gap their uptime stack and CI never covered


Summary

On 2026-05-12, a B2B SaaS team's Cursor-based workflow started failing intermittently: agents could read context but stopped creating tasks in their internal MCP server. Customer-facing APIs were healthy. Stripe webhooks and GitHub App installs showed no errors. The failure was isolated to a third-party MCP contract the team depended on but did not operate.

After adopting DriftGuard, a similar change was caught ~35 minutes after the vendor's live tools/list changed, via a breaking-classified drift event and Slack alert—before support volume moved.

This post walks through the original incident, why existing tooling missed it, and which DriftGuard capabilities map to each gap.


Impact (original incident)

Metric Value
Duration ~4h from first user report to root cause
Severity SEV-2 (degraded agent workflows, API OK)
Affected Internal ops automations + one customer-facing "AI assistant" feature
Data loss None
Revenue No direct billing impact; support load + delayed ship

Symptom: MCP tools/call errors and empty agent turns. Logs showed tool names that no longer existed in tools/list.

Not affected: Application HTTP error rates, p95 latency, Stripe charge success rate.


Timeline (original incident, UTC)

Time Event
Sat 02:14 Vendor deploys MCP server; create_task removed from catalog (no public changelog)
Sat–Mon No production traffic hits create_task (low weekend usage)
Mon 13:40 Support ticket: "AI can't file tasks"
Mon 14:10 On-call checks service dashboards — green
Mon 15:05 Engineer manually runs curl + inspects MCP JSON; tool missing
Mon 16:00 Hotfix: agent config updated to new tool name; incident closed

Detection gap: ~62 hours from contract change to human discovery. Monitoring that only reflects your traffic or your release pipeline will not see this class of failure.


Root cause

  1. Primary: Undocumented removal of MCP tool create_task from a server the team does not own.
  2. Contributing: No baseline or diff on tools/list / inputSchema outside ad-hoc debugging.
  3. Contributing: CI validates their OpenAPI and contract tests use frozen fixtures for vendor JSON.

This is not an uptime problem. Endpoints returned 200. It is a consumer contract drift problem.


Why their stack didn't catch it

Tool What it was doing Why it wasn't enough
APM / synthetics Latency and 5xx on their API MCP schema isn't HTTP status
oasdiff in CI Diff their spec at merge Vendor MCP has no spec in repo
Cron fetch + jq (planned, never shipped) One URL, unstructured diff No MCP handshake, no breaking semantics, unmaintained
Agent retries Masked failures as "empty results" No alert; degraded UX

The team needed continuous observation of external contracts with breaking vs noise classification—not another dashboard on their own service.


What they changed after the incident (DriftGuard mapping)

Below is what the customer actually configured, and what decision each piece supported.

1. Inventory dependencies (day 0)

They pasted repo mcp.json and two OpenAPI URLs into the console import / suggest flow.

  • Decision: Which URLs are worth watching first?
  • Outcome: Four watches proposed (2 MCP, Stripe OpenAPI, GitHub REST spec)—skipped debating a matrix in a spreadsheet.

2. Watch types and intervals

  • MCP servers: watchType: mcp, 30-minute interval
  • Vendor OpenAPI specs: specFormat: openapi, daily interval

  • Decision: Where do we need fast feedback vs slow spec churn?

  • Outcome: MCP on shorter interval; OpenAPI vendors on daily (semantic op-level diff, not raw JSON tree noise).

3. Baseline + fingerprint

First manual check on each watch stored a snapshot and schema fingerprint on the watch row.

  • Decision: Has this contract changed since we last looked?
  • Outcome: Fleet view shows stable hash; drift events are diffs against a known baseline, not one-off curls.

4. Alert routing

Slack incoming webhook on MCP watches; breaking-only policy initially.

  • Decision: Who gets paged for what?
  • Outcome: #integrations channel; test ping confirmed delivery (webhook_last_status visible in console—important for trusting alerts).

5. Ignore paths on Stripe watch

Ignored $.info.version after a noisy warning.

  • Decision: Is this alert actionable?
  • Outcome: Team kept breaking alerts on operations; suppressed metadata churn.

6. CI gate

GitHub Action calling /api/coverage/assert on mcp.json in the repo.

  • Decision: Can we prevent new unwatched deps from merging?
  • Outcome: PR adding a third MCP URL failed CI until a watch existed—addresses repeat of "we added a dep but forgot to monitor it."

Second event (after DriftGuard)—how it played out

Two weeks later, the same internal MCP server renamed a tool (warning-level add + breaking-level required field on another tool in staging—not prod yet, but live URL).

Time Event
09:12 DriftGuard scheduled check runs
09:12 Drift event: 1 breaking, 2 warnings on MCP watch
09:13 Slack alert delivered
09:20 Engineer opens drift timeline → watch detail

The drift payload included agentAction strings (e.g. update client calls for required field on tools.sync_tasks.inputSchema). That text went straight into the Jira ticket—no separate "write up what changed" step.

Time to detect: ~35 minutes (poll interval + cron), not 62 hours.
Time to understand: minutes (classified diff + explain), not half a day of manual JSON comparison.


Lessons learned (customer's words, paraphrased)

  1. Consumer contracts need consumer monitoring. CI on your repo cannot substitute for watches on URLs you call but don't control.
  2. MCP failures look like agent bugs. Without tools/list diffs, on-call burns time in the wrong layer.
  3. Classification matters. "JSON changed" alerts get ignored; breaking tool removal gets fixed.
  4. Coverage is a process problem. Assert-on-merge turned monitoring from heroics into a default.

When this pattern applies to you

Consider the same approach if:

  • You run agents against MCP servers you don't operate
  • You integrate Stripe/GitHub/partner APIs from live behavior, not specs you pin in CI
  • You have low-traffic code paths that won't trip synthetics until Monday
  • You've said "we should cron those URLs someday" and never did

Not a fit if: you only need to gate your own OpenAPI at release—use the OSS diff / GitHub Action locally; DriftGuard's hosted value is external watches.


Links

Top comments (0)