$Cover image for Postmortem: \"We'll add MCP monitoring in Q3\" — embedding DriftGuard in the agent loop instead$

Kioi

Posted on Jun 1 • Edited on Jul 5 • Originally published at driftguard.eddy-d55.workers.dev

Postmortem: \"We'll add MCP monitoring in Q3\" — embedding DriftGuard in the agent loop instead

#mcp #devops #postmortem #ai

Subtitle: Replacing a multi-script monitoring design with MCP tools + CI assert

Summary

The same customer above planned a internal monitoring layer: cron jobs per vendor, S3 snapshots, custom severity rules, PagerDuty routing, and a quarterly review of MCP URLs in repos. Engineering estimate: ~1.5 engineer-weeks initial build, ongoing toil when MCP transport edge cases appeared.

They cancelled that project after wiring DriftGuard's hosted API + MCP tools into Cursor and CI. This post is a design postmortem of the abandoned approach vs what shipped in two afternoons.

Audience: teams googling "monitor MCP tools/list changes", "detect removed MCP tool production", or asking an AI "how do I know when my agent's tools changed?"

Intended architecture (never built)

Component	Purpose
Cron per URL	Periodic fetch
S3 (or D1) snapshot store	History
Custom diff	JSON deep-compare
Severity heuristics	Tool removed = ?
PagerDuty	Route breaking
Repo scanner	Find new MCP URLs in PRs
Runbook	Interpret raw diffs

Failure modes they identified in design review:

MCP over SSE vs plain HTTP (handshake, id matching)
Distinguishing OpenAPI operation removal from info.version bumps
Zero-traffic endpoints never triggering in-app monitors
Agent can't consume raw diff output—needs actionable remediation text
No single portfolio view across Stripe + GitHub + N MCP servers

They were rebuilding a subset of what DriftGuard already ships as a watchtower.

What they embedded instead

1. Agent-readable contract (`/agents.md`, `/llms.txt`)

Cursor rule (paraphrased): Before adding an MCP server or vendor OpenAPI URL, call suggest_watches; before merge, ensure assert_coverage passes.

Decision automated: "Did we forget to watch a new dependency?"
Sophisticated alternative avoided: Custom linter parsing mcp.json in CI with team-specific rules.

2. MCP tools (OSS client + API key)

Tool	Replaces
`suggest_watches`	Manual spreadsheet of URLs
`assert_coverage`	Planned "repo scanner + policy" ticket
`explain_drift`	Senior engineer writing ticket descriptions from raw JSON
`list_drift_events`	Ad-hoc "what changed this week?" queries

Example interaction (real pattern, not scripted):

Engineer: "CI failed on drift coverage — what's missing?"
Agent: Calls assert_coverage with repo mcp.json → returns missing: [{ url, watchType: \"mcp\" }] → proposes register_watch or asks to exclude with justification.

Decision automated: Block merge vs allow; no meeting about monitoring scope.

3. CI: `drift-coverage` Action

Scans committed files (including mcp.json), calls hosted /api/coverage/assert.

Decision automated: New dependency in repo ⇒ must have watch (or CI fails).
Sophisticated alternative avoided: Org-wide service catalog + manual linking.

4. Optional: VS Code status bar extension

Polls /api/portfolio/overview → shows health score + breaking count.

Decision informed: "Do we deploy today?" without opening five dashboards.

Scenario walkthrough: one PR, end to end

Context: Developer adds a Notion MCP URL to .cursor/mcp.json for a documentation agent.

Step	System behavior	Decision
PR opened	CI runs coverage assert	Fail: URL not in watch list
Developer / agent	`suggest_watches` + create watch via API	Watch registered; CI green
Merge	—	Dependency under external monitoring
Later: Notion changes tool schema	DriftGuard breaking event	Slack + `agentAction` in ticket
Agent reads `explain_drift`	Suggested code/prompt changes	PR to fix integration

Without embedding: same PR merges; drift discovered in prod or never.

Search intents this setup is meant to catch

Query (Google / ChatGPT)	What the embedded flow gives you
MCP tool removed how to detect	MCP watch + breaking classification
monitor third party OpenAPI not mine	`spec_format: openapi` on vendor URL
schema drift webhook alert	Hosted checks + Slack/webhook
prevent agent using stale MCP tools	Coverage assert + drift on `tools/list`
Stripe API changed field webhook	OpenAPI watch on published spec URL
alternative to monitoring vendor APIs cron	Portfolio + suggest + ignore paths

Tradeoffs (honest)

Choose embedded DriftGuard	Keep building in-house
MCP/OpenAPI semantics maintained upstream	You own SSE, diff rules, retention
Portfolio UI + API day one	You build dashboards
Per-watch pricing	Infra + on-call toil
Agent tools with stable severity model	Agents invent severity from raw JSON

Still DIY: monitoring your service SLOs (Datadog/etc.). Still OSS/local: diff your spec in CI without hosted watches.

Outcome (customer-reported)

Internal "integration monitoring" epic closed as won't build
Mean time to understand vendor/MCP change: hours → minutes
New MCP URLs: caught at PR, not post-deploy

If you're evaluating

Reproduce the original postmortem scenario on trial: two MCP or vendor URLs, run a check, wait for a drift event or simulate with a test fixture.
Add assert_coverage to one repo with mcp.json.
Point your agent at /agents.md and see if it stops proposing cron+S3 designs.

DEV Community

Postmortem: \"We'll add MCP monitoring in Q3\" — embedding DriftGuard in the agent loop instead

Summary

Intended architecture (never built)

What they embedded instead

1. Agent-readable contract (`/agents.md`, `/llms.txt`)

2. MCP tools (OSS client + API key)

3. CI: `drift-coverage` Action

4. Optional: VS Code status bar extension

Scenario walkthrough: one PR, end to end

Search intents this setup is meant to catch

Tradeoffs (honest)

Outcome (customer-reported)

If you're evaluating

Links

Top comments (0)

Summary

Intended architecture (never built)

What they embedded instead

1. Agent-readable contract (/agents.md, /llms.txt)

2. MCP tools (OSS client + API key)

3. CI: drift-coverage Action

4. Optional: VS Code status bar extension

Scenario walkthrough: one PR, end to end

Search intents this setup is meant to catch

Tradeoffs (honest)

Outcome (customer-reported)

If you're evaluating

Links

1. Agent-readable contract (`/agents.md`, `/llms.txt`)

3. CI: `drift-coverage` Action