DEV Community

yongrean
yongrean

Posted on

MCP CI gates need receipts: tools/list is not enough

MCP servers are starting to look like normal infrastructure.

That means they need boring infrastructure checks.

The mistake I kept seeing is this:

"The server starts, and tools/list returns a clean schema. Therefore it works."

That is not enough.

An MCP server can pass initialize, advertise every expected tool, and still fail every real call because auth, scopes, tenant boundaries, environment variables, downstream permissions, or read-only roles are broken.

So I pushed mcp-probe@1.8.0 further toward being a real CI readiness gate for MCP servers.

npx @k08200/mcp-probe@latest --config mcp-probe.config.json --github-summary --fail-on-warn
Enter fullscreen mode Exit fullscreen mode

What changed

1. Warnings can now fail CI

By default, warnings still exit 0. That keeps existing users from getting surprise CI failures.

But production gates often need stricter behavior:

mcp-probe --config mcp-probe.config.json --fail-on-warn
Enter fullscreen mode Exit fullscreen mode

With --fail-on-warn, auth handoff issues, permission warnings, or incomplete readiness receipts can block the workflow.

That matters because many MCP failures are not hard crashes. They are degraded states:

  • OAuth flow requires a browser redirect the agent cannot complete
  • a server starts but every tool call returns 401
  • a database tool works with admin credentials but fails with the intended read-only role
  • the workflow mentions a probe but does not actually run the production boundary check

2. Doctor now checks the actual workflow receipt

mcp-probe doctor already checked whether a GitHub Actions workflow existed.

But that is not enough either.

The new behavior is stricter: the required flags must appear on the same actual mcp-probe run step.

This should pass:

- run: npx @k08200/mcp-probe@latest --config mcp-probe.config.json --github-summary --fail-on-warn
Enter fullscreen mode Exit fullscreen mode

This should not count as a complete gate:

- run: npx @k08200/mcp-probe --config mcp-probe.config.json
- run: npx @k08200/mcp-probe ./server.js --github-summary --fail-on-warn
Enter fullscreen mode Exit fullscreen mode

The flags are present somewhere in the workflow, but no single run step proves the intended config is actually being checked with CI summaries and strict warning handling.

That is the difference between "we have a gate" and "the gate is enforcing the thing we trust."

3. Tool call coverage is now tied to expected tools

For config-based checks, you can declare the expected tool catalog:

{
  "servers": [
    {
      "name": "datadog",
      "target": "https://mcp.example.com/mcp",
      "transport": "http",
      "headers": {
        "Authorization": "Bearer ${DATADOG_MCP_TOKEN}"
      },
      "expectedTools": ["logs_query"],
      "forbiddenTools": ["delete_dashboard", "rotate_api_key"],
      "toolsFile": "./datadog.tools.json"
    }
  ]
}
Enter fullscreen mode Exit fullscreen mode

If expectedTools and toolsFile are both set, every expected tool needs a sidecar sample input.

That means CI checks not just "is the tool advertised?" but "did we actually provide a meaningful dry-run sample for the tool an agent depends on?"

4. Sidecar inputs are the real contract

Auto-generated inputs are useful for smoke tests, but they mostly hit schema validation.

Real readiness checks need meaningful inputs:

{
  "tools": {
    "logs_query": {
      "input": {
        "query": "service:web status:error",
        "timeframe": "1h"
      },
      "expect": {
        "status": "pass",
        "not_error_code": [401, 403],
        "requiredFields": ["source", "freshness"],
        "maxRows": 100
      }
    }
  }
}
Enter fullscreen mode Exit fullscreen mode

For database-backed MCP servers, these assertions are the interesting part:

  • does the read-only role work?
  • are row limits enforced?
  • are broad exports/admin actions absent or gated?
  • are denied writes structured enough for agents to recover?
  • do results include provenance fields like source and freshness?
  • does the response avoid leaking secrets, stack traces, or raw internals?

Install

npm install -D @k08200/mcp-probe
Enter fullscreen mode Exit fullscreen mode

Or run directly:

npx @k08200/mcp-probe@latest doctor
npx @k08200/mcp-probe@latest --config mcp-probe.config.json --github-summary --fail-on-warn
Enter fullscreen mode Exit fullscreen mode

GitHub: https://github.com/k08200/mcp-probe
npm: https://www.npmjs.com/package/@k08200/mcp-probe

The goal is simple: CI for MCP should test the contract an agent will actually depend on, not just whether the process starts.

Top comments (9)

Collapse
 
xulingfeng profile image
xulingfeng

The JSON Schema response validation idea is exactly where the gap is — tools/list tells you shape, but says nothing about whether the response is actionable.

We hit this with Hermes: a weather MCP tool returned valid JSON every time, but the units field was Celsius in some responses and Kelvin in others depending on which downstream API it queried. Schema validation would've caught that in 1 run instead of 3 debug sessions.

Are you thinking of validating against the declared schema from tools/list, or a separate contract file? The former stays self-documenting, but the latter lets you define acceptable ranges rather than just types.

Collapse
 
k08200 profile image
yongrean

I went with the separate contract file approach.

tools/list remains useful advertised metadata, but for CI I want the deployment/agent owner to define the operational contract independently.

The Hermes weather example is exactly why. units: string is not enough; the contract needs enum/range checks.

Shipped in @k08200/mcp-probe@1.10.1: expect.jsonSchema now supports:

  • type
  • required
  • properties
  • items
  • enum
  • additionalProperties
  • minimum
  • maximum
  • minLength
  • maxLength
  • pattern

So a weather contract can now say:


json
{
  "tools": {
    "get_weather": {
      "input": { "city": "Seoul" },
      "expect": {
        "jsonSchema": {
          "type": "object",
          "required": ["temperature", "units", "freshness"],
          "properties": {
            "temperature": { "type": "number", "minimum": -80, "maximum": 80 },
            "units": { "enum": ["celsius", "fahrenheit"] },
            "freshness": { "type": "string", "pattern": "^\\d{4}-\\d{2}-\\d{2}T" }
          }
        }
      }
    }
  }
}
Enter fullscreen mode Exit fullscreen mode
Collapse
 
k08200 profile image
yongrean

Follow-up: I opened real-world recipe issues for the exact failure modes discussed here:

The goal is to collect safe, read-only sidecar samples from people running these MCP servers in real agent workflows. No secrets, only placeholders and reproducible CI checks.

Collapse
 
ariless profile image
Darya Belaya

The distinction between "advertised" and "actually happened" showed up for me while building coverage tracking for Playwright MCP sessions.
The obvious approach was to ask the agent which routes and endpoints it visited. The problem: in longer sessions, the self-report wasn't reliable. Steps could be omitted or described differently from what actually happened.
The solution was an independent observer. @playwright/mcp can connect to an existing browser via --cdp-endpoint, so a separate tracker process attached to the same browser and collected visited routes and API calls directly from network events.
Feels like the same principle you're describing: whether it's an MCP server claiming readiness through tools/list or an agent claiming what it did, self-report is useful. The thing you trust is the receipt generated independently.

Collapse
 
k08200 profile image
yongrean

This framing was useful enough that I turned it into the next release.

mcp-probe@1.9.0 now has --receipt-file, which writes a redacted JSON readiness receipt containing the observed handshake, tool catalog, dry-run calls, contract assertions, and final status.

I also made doctor check that CI is actually producing and uploading the receipt artifact, not just running the probe:


bash
npx @k08200/mcp-probe@latest \
  --config mcp-probe.config.json \
  --github-summary \
  --fail-on-warn \
  --receipt-file mcp-probe.receipt.json
Enter fullscreen mode Exit fullscreen mode
Collapse
 
imon_cmar_1b6026c67d3771 profile image
Šimon Cmar

The "same run step" check is smart. I've definitely seen workflows where --fail-on-warn and the real config live on different steps and everyone just assumes the gate works.

Curious how you handle flaky downstreams though - if a tool throws a transient 401/503 in CI, does that hard-fail the gate or is there some retry tolerance?

Collapse
 
k08200 profile image
yongrean

Good question. I changed this so retry tolerance is explicit, not hidden.

Flaky downstream retries are now sidecar-level policy. A tool only retries when the contract says it should:


json
{
  "tools": {
    "logs_query": {
      "input": { "query": "service:web status:error", "timeframe": "1h" },
      "retry": {
        "attempts": 3,
        "delayMs": 1000,
        "retryOn": [429, 500, 502, 503, 504, "timeout", "rate limit"]
      },
      "expect": { "status": "pass" }
    }
  }
}
Enter fullscreen mode Exit fullscreen mode
Collapse
 
xulingfeng profile image
xulingfeng

This hits a real pain point — tools/list gives you schema, but tells you nothing about runtime behavior. We run something similar on our MCP servers: a lightweight CI step that spins up the server, calls each tool with dummy params, and asserts the response shape matches. Caught two silent failures already where the schema said 'string' but the implementation crashed on non-empty input. Do you enforce response schema contracts at the MCP transport layer or just at the tool level?

Collapse
 
k08200 profile image
yongrean

Right now mcp-probe enforces this at the tool-contract level, not as a generic MCP transport-layer response schema validator.

The MCP transport layer tells me whether initialize/tools/list/tools/call completed and whether the server stayed protocol-compatible. But the useful failure mode is usually one layer above that: the tool call returns something that is technically valid MCP, but unusable for the agent.

So the sidecar contract currently sits on observed tool results:

  • expect.status
  • expect.requiredFields
  • expect.maxRows
  • expect.errorCode
  • expect.contains
  • expect.notContains

That catches cases like “the call succeeded but omitted rowCount/source/freshness”, “the result exceeded the configured limit”, or “a denied write did not return the stable error code agents expect.”

I haven’t added full JSON Schema response validation yet. That’s probably the next useful contract primitive: something like expect.jsonSchema or expect.outputSchema, applied to the observed tool result. I’d still keep it tool-level rather than transport-level, because a response can be MCP-valid and still violate the application contract the agent depends on.