DEV Community

yongrean
yongrean

Posted on

MCP CI gates need retry receipts for flaky downstreams

MCP CI gates need to distinguish two very different failures:

  1. the server is actually broken
  2. the downstream dependency is temporarily flaky

If both become hard failures, CI gets noisy.
If both are ignored, the gate stops meaning anything.

So I shipped @k08200/mcp-probe@1.12.0 with explicit sidecar retry policy for tool-call dry-runs.

The problem

A readiness gate that calls real MCP tools can hit transient downstream failures:

  • 503 Service Unavailable
  • 502 Bad Gateway
  • 504 Gateway Timeout
  • rate limits
  • short network timeouts

But auth and permission failures are different. A 401 or 403 usually means the agent will fail in production too.

Those should stay visible unless the contract explicitly says otherwise.

Retry is opt-in per tool

mcp-probe now lets a sidecar contract define retry behavior per tool:

{
  "tools": {
    "logs_query": {
      "input": {
        "query": "service:web status:error",
        "timeframe": "1h"
      },
      "retry": {
        "attempts": 3,
        "delayMs": 1000,
        "retryOn": [429, 500, 502, 503, 504, "timeout", "rate limit"]
      },
      "expect": {
        "status": "pass"
      }
    }
  }
}
Enter fullscreen mode Exit fullscreen mode

The important part: retry is not global magic.

It only happens when the sidecar explicitly opts in.

Receipts still show the flake

If a call fails once and passes on retry, the final result can pass, but the receipt still records every attempt.

That means CI can tolerate a transient downstream blip without pretending the run was clean.

Example shape:

{
  "tool": "flaky_read",
  "status": "pass",
  "source": "sidecar",
  "attempts": [
    {
      "attempt": 1,
      "status": "fail",
      "error": "503 Service Unavailable: transient downstream"
    },
    {
      "attempt": 2,
      "status": "pass"
    }
  ]
}
Enter fullscreen mode Exit fullscreen mode

That is the distinction I want MCP CI gates to preserve:

  • hard failures should block
  • transient failures can be retried
  • pass-after-retry should still leave a receipt

Install

npm install -D @k08200/mcp-probe
Enter fullscreen mode Exit fullscreen mode

Or run directly:

npx @k08200/mcp-probe@latest --config mcp-probe.config.json --github-summary --receipt-file mcp-probe.receipt.json
Enter fullscreen mode Exit fullscreen mode

GitHub release: https://github.com/k08200/mcp-probe/releases/tag/v1.12.0

npm: https://www.npmjs.com/package/@k08200/mcp-probe

Top comments (1)

Collapse
 
xulingfeng profile image
xulingfeng

The distinction between "the server is broken" and "the downstream is flaky" is exactly what I wish more CI gates would make explicit. We've been dealing with a similar pattern in our test automation — a flaky API call that 503s once in a blue moon, and the whole pipeline turns red. Our fix was a retry wrapper too, but the receipt approach is cleaner: you still see the transient in the logs, it just doesn't block the gate.

Do you have plans to surface those receipts in a dashboard or PR comment summary? That'd be the missing piece for teams that need to track flake trends over time.