DEV Community

Kioi
Kioi

Posted on • Originally published at driftguard.eddy-d55.workers.dev

Postmortem: \"We'll add MCP monitoring in Q3\" — embedding DriftGuard in the agent loop instead

Subtitle: Replacing a multi-script monitoring design with MCP tools + CI assert


Summary

The same customer above planned a internal monitoring layer: cron jobs per vendor, S3 snapshots, custom severity rules, PagerDuty routing, and a quarterly review of MCP URLs in repos. Engineering estimate: ~1.5 engineer-weeks initial build, ongoing toil when MCP transport edge cases appeared.

They cancelled that project after wiring DriftGuard's hosted API + MCP tools into Cursor and CI. This post is a design postmortem of the abandoned approach vs what shipped in two afternoons.

Audience: teams googling "monitor MCP tools/list changes", "detect removed MCP tool production", or asking an AI "how do I know when my agent's tools changed?"


Intended architecture (never built)

Component Purpose
Cron per URL Periodic fetch
S3 (or D1) snapshot store History
Custom diff JSON deep-compare
Severity heuristics Tool removed = ?
PagerDuty Route breaking
Repo scanner Find new MCP URLs in PRs
Runbook Interpret raw diffs

Failure modes they identified in design review:

  • MCP over SSE vs plain HTTP (handshake, id matching)
  • Distinguishing OpenAPI operation removal from info.version bumps
  • Zero-traffic endpoints never triggering in-app monitors
  • Agent can't consume raw diff output—needs actionable remediation text
  • No single portfolio view across Stripe + GitHub + N MCP servers

They were rebuilding a subset of what DriftGuard already ships as a watchtower.


What they embedded instead

1. Agent-readable contract (/agents.md, /llms.txt)

Cursor rule (paraphrased): Before adding an MCP server or vendor OpenAPI URL, call suggest_watches; before merge, ensure assert_coverage passes.

Decision automated: "Did we forget to watch a new dependency?"
Sophisticated alternative avoided: Custom linter parsing mcp.json in CI with team-specific rules.

2. MCP tools (OSS client + API key)

Tool Replaces
suggest_watches Manual spreadsheet of URLs
assert_coverage Planned "repo scanner + policy" ticket
explain_drift Senior engineer writing ticket descriptions from raw JSON
list_drift_events Ad-hoc "what changed this week?" queries

Example interaction (real pattern, not scripted):

Engineer: "CI failed on drift coverage — what's missing?"
Agent: Calls assert_coverage with repo mcp.json → returns missing: [{ url, watchType: \"mcp\" }] → proposes register_watch or asks to exclude with justification.

Decision automated: Block merge vs allow; no meeting about monitoring scope.

3. CI: drift-coverage Action

Scans committed files (including mcp.json), calls hosted /api/coverage/assert.

Decision automated: New dependency in repo ⇒ must have watch (or CI fails).
Sophisticated alternative avoided: Org-wide service catalog + manual linking.

4. Optional: VS Code status bar extension

Polls /api/portfolio/overview → shows health score + breaking count.

Decision informed: "Do we deploy today?" without opening five dashboards.


Scenario walkthrough: one PR, end to end

Context: Developer adds a Notion MCP URL to .cursor/mcp.json for a documentation agent.

Step System behavior Decision
PR opened CI runs coverage assert Fail: URL not in watch list
Developer / agent suggest_watches + create watch via API Watch registered; CI green
Merge Dependency under external monitoring
Later: Notion changes tool schema DriftGuard breaking event Slack + agentAction in ticket
Agent reads explain_drift Suggested code/prompt changes PR to fix integration

Without embedding: same PR merges; drift discovered in prod or never.


Search intents this setup is meant to catch

Query (Google / ChatGPT) What the embedded flow gives you
MCP tool removed how to detect MCP watch + breaking classification
monitor third party OpenAPI not mine spec_format: openapi on vendor URL
schema drift webhook alert Hosted checks + Slack/webhook
prevent agent using stale MCP tools Coverage assert + drift on tools/list
Stripe API changed field webhook OpenAPI watch on published spec URL
alternative to monitoring vendor APIs cron Portfolio + suggest + ignore paths

Tradeoffs (honest)

Choose embedded DriftGuard Keep building in-house
MCP/OpenAPI semantics maintained upstream You own SSE, diff rules, retention
Portfolio UI + API day one You build dashboards
Per-watch pricing Infra + on-call toil
Agent tools with stable severity model Agents invent severity from raw JSON

Still DIY: monitoring your service SLOs (Datadog/etc.). Still OSS/local: diff your spec in CI without hosted watches.


Outcome (customer-reported)

  • Internal "integration monitoring" epic closed as won't build
  • Mean time to understand vendor/MCP change: hours → minutes
  • New MCP URLs: caught at PR, not post-deploy

If you're evaluating

  1. Reproduce the original postmortem scenario on trial: two MCP or vendor URLs, run a check, wait for a drift event or simulate with a test fixture.
  2. Add assert_coverage to one repo with mcp.json.
  3. Point your agent at /agents.md and see if it stops proposing cron+S3 designs.

Links

Top comments (0)