Subtitle: How a DriftGuard customer closed a gap their uptime stack and CI never covered
Summary
On 2026-05-12, a B2B SaaS team's Cursor-based workflow started failing intermittently: agents could read context but stopped creating tasks in their internal MCP server. Customer-facing APIs were healthy. Stripe webhooks and GitHub App installs showed no errors. The failure was isolated to a third-party MCP contract the team depended on but did not operate.
After adopting DriftGuard, a similar change was caught ~35 minutes after the vendor's live tools/list changed, via a breaking-classified drift event and Slack alert—before support volume moved.
This post walks through the original incident, why existing tooling missed it, and which DriftGuard capabilities map to each gap.
Impact (original incident)
| Metric | Value |
|---|---|
| Duration | ~4h from first user report to root cause |
| Severity | SEV-2 (degraded agent workflows, API OK) |
| Affected | Internal ops automations + one customer-facing "AI assistant" feature |
| Data loss | None |
| Revenue | No direct billing impact; support load + delayed ship |
Symptom: MCP tools/call errors and empty agent turns. Logs showed tool names that no longer existed in tools/list.
Not affected: Application HTTP error rates, p95 latency, Stripe charge success rate.
Timeline (original incident, UTC)
| Time | Event |
|---|---|
| Sat 02:14 | Vendor deploys MCP server; create_task removed from catalog (no public changelog) |
| Sat–Mon | No production traffic hits create_task (low weekend usage) |
| Mon 13:40 | Support ticket: "AI can't file tasks" |
| Mon 14:10 | On-call checks service dashboards — green |
| Mon 15:05 | Engineer manually runs curl + inspects MCP JSON; tool missing |
| Mon 16:00 | Hotfix: agent config updated to new tool name; incident closed |
Detection gap: ~62 hours from contract change to human discovery. Monitoring that only reflects your traffic or your release pipeline will not see this class of failure.
Root cause
-
Primary: Undocumented removal of MCP tool
create_taskfrom a server the team does not own. -
Contributing: No baseline or diff on
tools/list/inputSchemaoutside ad-hoc debugging. - Contributing: CI validates their OpenAPI and contract tests use frozen fixtures for vendor JSON.
This is not an uptime problem. Endpoints returned 200. It is a consumer contract drift problem.
Why their stack didn't catch it
| Tool | What it was doing | Why it wasn't enough |
|---|---|---|
| APM / synthetics | Latency and 5xx on their API | MCP schema isn't HTTP status |
| oasdiff in CI | Diff their spec at merge | Vendor MCP has no spec in repo |
Cron fetch + jq (planned, never shipped) |
One URL, unstructured diff | No MCP handshake, no breaking semantics, unmaintained |
| Agent retries | Masked failures as "empty results" | No alert; degraded UX |
The team needed continuous observation of external contracts with breaking vs noise classification—not another dashboard on their own service.
What they changed after the incident (DriftGuard mapping)
Below is what the customer actually configured, and what decision each piece supported.
1. Inventory dependencies (day 0)
They pasted repo mcp.json and two OpenAPI URLs into the console import / suggest flow.
- Decision: Which URLs are worth watching first?
- Outcome: Four watches proposed (2 MCP, Stripe OpenAPI, GitHub REST spec)—skipped debating a matrix in a spreadsheet.
2. Watch types and intervals
- MCP servers:
watchType: mcp, 30-minute interval Vendor OpenAPI specs:
specFormat: openapi, daily intervalDecision: Where do we need fast feedback vs slow spec churn?
Outcome: MCP on shorter interval; OpenAPI vendors on daily (semantic op-level diff, not raw JSON tree noise).
3. Baseline + fingerprint
First manual check on each watch stored a snapshot and schema fingerprint on the watch row.
- Decision: Has this contract changed since we last looked?
- Outcome: Fleet view shows stable hash; drift events are diffs against a known baseline, not one-off curls.
4. Alert routing
Slack incoming webhook on MCP watches; breaking-only policy initially.
- Decision: Who gets paged for what?
-
Outcome:
#integrationschannel; test ping confirmed delivery (webhook_last_statusvisible in console—important for trusting alerts).
5. Ignore paths on Stripe watch
Ignored $.info.version after a noisy warning.
- Decision: Is this alert actionable?
- Outcome: Team kept breaking alerts on operations; suppressed metadata churn.
6. CI gate
GitHub Action calling /api/coverage/assert on mcp.json in the repo.
- Decision: Can we prevent new unwatched deps from merging?
- Outcome: PR adding a third MCP URL failed CI until a watch existed—addresses repeat of "we added a dep but forgot to monitor it."
Second event (after DriftGuard)—how it played out
Two weeks later, the same internal MCP server renamed a tool (warning-level add + breaking-level required field on another tool in staging—not prod yet, but live URL).
| Time | Event |
|---|---|
| 09:12 | DriftGuard scheduled check runs |
| 09:12 | Drift event: 1 breaking, 2 warnings on MCP watch |
| 09:13 | Slack alert delivered |
| 09:20 | Engineer opens drift timeline → watch detail |
The drift payload included agentAction strings (e.g. update client calls for required field on tools.sync_tasks.inputSchema). That text went straight into the Jira ticket—no separate "write up what changed" step.
Time to detect: ~35 minutes (poll interval + cron), not 62 hours.
Time to understand: minutes (classified diff + explain), not half a day of manual JSON comparison.
Lessons learned (customer's words, paraphrased)
- Consumer contracts need consumer monitoring. CI on your repo cannot substitute for watches on URLs you call but don't control.
-
MCP failures look like agent bugs. Without
tools/listdiffs, on-call burns time in the wrong layer. - Classification matters. "JSON changed" alerts get ignored; breaking tool removal gets fixed.
- Coverage is a process problem. Assert-on-merge turned monitoring from heroics into a default.
When this pattern applies to you
Consider the same approach if:
- You run agents against MCP servers you don't operate
- You integrate Stripe/GitHub/partner APIs from live behavior, not specs you pin in CI
- You have low-traffic code paths that won't trip synthetics until Monday
- You've said "we should cron those URLs someday" and never did
Not a fit if: you only need to gate your own OpenAPI at release—use the OSS diff / GitHub Action locally; DriftGuard's hosted value is external watches.
Links
- Console trial (2 real watches): https://driftguard.eddy-d55.workers.dev/console
- OSS + MCP client: https://github.com/kioie/driftguard
- API / agent docs:
/openapi.json,/agents.md,/llms.txt
Top comments (0)