My OpenClaw MCP Server Said 'OK' But Returned Nothing. I Built a 40-Line Health Check That Saved My Mornings.

#openclaw #ai #agents #productivity

My OpenClaw MCP Server Said "OK" But Returned Nothing. I Built a 40-Line Health Check That Saved My Mornings.

Three mornings in a row I woke up to a quiet Slack channel and an empty inbox. No errors. No alerts. Just... silence. The cron had fired. The agent had responded. The MCP server had logged 200 OK.

Everything looked healthy.

Nothing had actually run.

If you run an OpenClaw agent with MCP servers in production — and you're trusting the 200 OK to mean "your work got done" — this post is the one I wish I'd read two weeks ago.

The lie of the MCP "200 OK"

Here's the failure mode that bit me. My morning cron looks roughly like this:

{
  "name": "morning-research-digest",
  "schedule": { "kind": "cron", "expr": "0 7 * * 1-5", "tz": "America/New_York" },
  "payload": {
    "kind": "agentTurn",
    "message": "Run the morning research digest: query the research MCP for the top 5 stories, post a summary to the team channel, and update the dashboard."
  }
}

The flow is: agent calls MCP → MCP returns JSON → agent reads JSON → agent posts to Slack.

Sounds clean. Here's what was actually happening on the broken days:

The MCP server (research-mcp, running on a separate VM) accepted the request.
Its database query timed out at the 30s mark.
The server's error handler caught the exception, logged it as a warning, and returned {"status": "ok", "data": []} to the agent.
The agent received data: [] — an empty list — and produced a Slack message: "No new research today."
The cron logged: ✅ morning-research-digest completed in 31.2s.

The dashboard said green. The team got a "no news today" message. The actual research never ran.

This is the worst kind of bug. No alert, no error, just wrong work. And in an agent pipeline where the next step trusts the previous step's output, a silent empty result is indistinguishable from a real empty result.

The fix: stop trusting the status field

The MCP spec lets a server return {"status": "ok", "data": [...]} and that's a valid success response — even when data is empty. There's no required field for "how many items did you actually find vs. how many did you skip because of an error."

So I stopped trusting it. I wrote a 40-line health check (scripts/mcp-healthcheck.py) that runs before any cron that depends on MCP output. It does three things:

Pings the MCP server with a known sentinel query.
Asserts the response shape matches the contract.
Cross-checks the result count against a floor (e.g. "I expect at least 3 research items on a weekday morning — if I get 0, something is wrong").

Here's the core of it:

def healthcheck(server: str, query: str, min_results: int = 1, timeout_s: int = 15) -> None:
    """Raise on any anomaly. Cron should fail loudly, not silently."""
    try:
        resp = mcp_call(server, query, timeout=timeout_s)
    except (TimeoutError, ConnectionError) as e:
        raise HealthcheckFail(f"{server}: network/timeout — {e}")

    if not isinstance(resp, dict):
        raise HealthcheckFail(f"{server}: response is {type(resp).__name__}, not dict")

    if resp.get("status") != "ok":
        raise HealthcheckFail(f"{server}: non-ok status — {resp.get('status')}")

    data = resp.get("data")
    if data is None:
        raise HealthcheckFail(f"{server}: missing 'data' field — server bug?")
    if not isinstance(data, list):
        raise HealthcheckFail(f"{server}: 'data' is {type(data).__name__}, not list")
    if len(data) < min_results:
        raise HealthcheckFail(
            f"{server}: only {len(data)} results (min={min_results}) — "
            f"likely silent failure; check server logs"
        )

    # Sentinel: if server is degraded, it sometimes returns placeholders.
    for item in data:
        if item.get("source") == "stub":
            raise HealthcheckFail(f"{server}: stub data detected — server in degraded mode")

The key design choice: the health check raises on anything suspicious. The cron is wrapped so that any HealthcheckFail aborts the agent turn and sends me a Telegram alert with the exact reason. No more silent empty mornings.

Wiring it into the cron

I didn't want to change every cron — there are 18 of them now. Instead I added a thin wrapper that the OpenClaw agent prompt references:

# In the agent's session bootstrap
preflight:
  - script: scripts/mcp-healthcheck.py
    args: ["--server", "research-mcp", "--query", "test-sentinel", "--min-results", "3"]
    on_fail: abort

The agent's prompt now starts with: "Before running the morning digest, run the preflight. If it aborts, post a single Slack message saying the digest is delayed and ping James. Do NOT post the digest."

This is the inversion of the silent-failure pattern. The agent is now explicitly told: if your inputs are bad, do nothing and tell me. That's safer than letting it produce a plausible-looking summary of nothing.

The MCP server side: fix the liar

The health check caught the symptom, but the root cause was on the server. The error handler was wrong:

# Before — the liar
@app.exception_handler(Exception)
async def swallow_errors(request, exc):
    logger.warning(f"Query failed: {exc}")
    return {"status": "ok", "data": []}

I replaced it with:

# After — let it fail loudly
@app.exception_handler(QueryTimeout)
async def on_timeout(request, exc):
    logger.error(f"Query timeout on {request.path}: {exc}")
    return JSONResponse(
        status_code=504,
        content={"status": "error", "error": "query_timeout", "detail": str(exc)},
    )

@app.exception_handler(Exception)
async def on_unknown(request, exc):
    logger.exception(f"Unhandled error on {request.path}")
    return JSONResponse(
        status_code=500,
        content={"status": "error", "error": "internal", "detail": str(exc)},
    )

Now the server returns 504 on timeout and 500 on unknown errors, with status: "error". The agent turn fails. The cron fails. I get paged.

What I learned

Three things, in order of how much pain each one caused:

1. MCP status fields are not reliability signals. A 200 OK from an MCP server means "the request reached the server and got a response." It does not mean "the work you asked for got done." Treat every MCP integration as potentially lying about success, and validate at the consumer.

2. Silent failures compound in agent pipelines. When the agent trusted the empty result, it produced a confident-sounding "no news today" message. The team started ignoring the digest because "it's always empty." By the time I noticed, I'd lost three days of signal. If your agent says "no results" too often, that's a bug in the pipeline, not a feature of the data.

3. Preflight checks beat postmortems. I could have written a fancy dashboard that showed MCP server health. Instead I wrote 40 lines that abort the cron. The dashboard would have told me on day 4 what I learned on day 1. The preflight told me on day 1.

The full healthcheck script is in scripts/mcp-healthcheck.py if you want to copy it. Two weeks in, the morning digest has caught two more silent degradations — once when the database ran out of disk, once when the server was redeployed with a missing env var. Both times I knew before the team did.

That's the bar. If your agent says "done," you should be able to trust it. And if you can't, a preflight check is cheaper than another silent morning.