yongrean

Posted on Jun 8

MCP CI gates need retry receipts for flaky downstreams

#mcp #devops #testing #ai

MCP CI gates need to distinguish two very different failures:

the server is actually broken
the downstream dependency is temporarily flaky

If both become hard failures, CI gets noisy.
If both are ignored, the gate stops meaning anything.

So I shipped @k08200/mcp-probe@1.12.0 with explicit sidecar retry policy for tool-call dry-runs.

The problem

A readiness gate that calls real MCP tools can hit transient downstream failures:

503 Service Unavailable
502 Bad Gateway
504 Gateway Timeout
rate limits
short network timeouts

But auth and permission failures are different. A 401 or 403 usually means the agent will fail in production too.

Those should stay visible unless the contract explicitly says otherwise.

Retry is opt-in per tool

mcp-probe now lets a sidecar contract define retry behavior per tool:

{
  "tools": {
    "logs_query": {
      "input": {
        "query": "service:web status:error",
        "timeframe": "1h"
      },
      "retry": {
        "attempts": 3,
        "delayMs": 1000,
        "retryOn": [429, 500, 502, 503, 504, "timeout", "rate limit"]
      },
      "expect": {
        "status": "pass"
      }
    }
  }
}

The important part: retry is not global magic.

It only happens when the sidecar explicitly opts in.

Receipts still show the flake

If a call fails once and passes on retry, the final result can pass, but the receipt still records every attempt.

That means CI can tolerate a transient downstream blip without pretending the run was clean.

Example shape:

{
  "tool": "flaky_read",
  "status": "pass",
  "source": "sidecar",
  "attempts": [
    {
      "attempt": 1,
      "status": "fail",
      "error": "503 Service Unavailable: transient downstream"
    },
    {
      "attempt": 2,
      "status": "pass"
    }
  ]
}

That is the distinction I want MCP CI gates to preserve:

hard failures should block
transient failures can be retried
pass-after-retry should still leave a receipt

Install

npm install -D @k08200/mcp-probe

Or run directly:

npx @k08200/mcp-probe@latest --config mcp-probe.config.json --github-summary --receipt-file mcp-probe.receipt.json

GitHub release: https://github.com/k08200/mcp-probe/releases/tag/v1.12.0

npm: https://www.npmjs.com/package/@k08200/mcp-probe

Top comments (3)

xulingfeng • Jun 8

The distinction between "the server is broken" and "the downstream is flaky" is exactly what I wish more CI gates would make explicit. We've been dealing with a similar pattern in our test automation — a flaky API call that 503s once in a blue moon, and the whole pipeline turns red. Our fix was a retry wrapper too, but the receipt approach is cleaner: you still see the transient in the logs, it just doesn't block the gate.

Do you have plans to surface those receipts in a dashboard or PR comment summary? That'd be the missing piece for teams that need to track flake trends over time.

yongrean • Jun 14

Yep, I ended up adding that layer in this PR.

Retried tool calls now show up directly in the GitHub Actions summary as Retry Receipts, so pass-after-retry cases are visible without failing the gate. I also added a new mcp-probe trends command that aggregates receipt history by day and by server/tool, separates recovered transients from unresolved failures, and can emit markdown for PR comments or a standalone HTML dashboard via --dashboard-file.

So the immediate PR visibility piece and the longer-term trend/dashboard path are both covered now. The remaining follow-up would be wiring teams’ CI artifact retention into that trends input over time.

Raju Dandigam • Jun 30

This is a very useful framing: retries should not just be control flow, they should become evidence. In CI, especially with downstream APIs and agent tool calls, a final pass can hide a lot of instability unless the retry path is captured clearly. I’ve seen the same pattern in test automation where teams need to know not only that something eventually passed, but how many times it failed, why it retried, and whether the same dependency keeps showing up. This is close to what I’m exploring with agent-inspect: making retries, tool calls, and execution paths easier to inspect locally before they become production mystery logs.