My MCP server OOM'd at 4 AM. The fix was 12 lines.

#mcp #ai #python #reliability

This is a follow-up to Why Your MCP Server Crashes at 3AM (and 5 Patterns That Stop It). Pattern #2 — unbounded in-flight queues — is the one I see most often, and it took me the longest to actually understand. Here is the war story, the diagnosis, and the diff.

The symptom

A workflow MCP server I run started OOM-killing itself once or twice a week, always between 3 and 5 AM UTC. Memory climbed in a smooth ramp over ~40 minutes, then the kernel stepped in. Restart, fine for a few days, then again.

CPU was flat. Connection count was flat. The thing that was not flat was a single downstream — a third-party API I called inside one of the tool handlers — which had its own slow degradation pattern overnight when their batch jobs ran.

The diagnosis

Every tool call kicked off an asyncio.create_task for the downstream request and did not wait for it. The handler returned to the client immediately. Fast acks, fire-and-forget felt clever in dev. In prod, when the downstream slowed from 200 ms p50 to 8 s p50, the producer (incoming MCP calls) kept going at the same rate the consumer (downstream HTTP) could not keep up with.

There was nothing telling the producer to stop. So tasks piled up in the event loop. Each task held a request body, a connection slot, retry state. Multiply by ~3 req/s of pile-up over 40 minutes and you hit the container memory ceiling.

Up does not equal working. Healthy does not equal healthy. Liveness probe was green the whole time.

The fix

Bounded the in-flight work with an asyncio.Semaphore and a saturation metric. Twelve lines.

import asyncio
from prometheus_client import Gauge

MAX_IN_FLIGHT = 64
_sem = asyncio.Semaphore(MAX_IN_FLIGHT)
_in_flight = Gauge("downstream_in_flight", "current concurrent downstream calls")

async def call_downstream(payload):
    async with _sem:
        _in_flight.set(MAX_IN_FLIGHT - _sem._value)
        return await http.post(URL, json=payload, timeout=10)

That is it. When the downstream slows, the semaphore fills up, new callers wait, and await propagates the wait back into the MCP handler. The producer feels the consumer pain. Backpressure.

The saturation gauge is the load-bearing piece you actually want on a dashboard. If downstream_in_flight sits at MAX_IN_FLIGHT for more than a minute, you know exactly which dependency is throttling you, and you can alert on it well before memory gets weird.

Two things people get wrong

1. They use a queue with maxsize but a worker pool that swallows the backpressure. If your worker drains the queue with try: q.get_nowait() except QueueEmpty: pass, you have reinvented fire-and-forget with extra steps. The producer needs to await q.put(...) and feel the block.

2. They pick MAX_IN_FLIGHT based on vibes. Pick it from (target_p99_latency_ms / downstream_p50_latency_ms) * desired_throughput_rps, then halve it the first time, then tune with the saturation gauge. Sixty-four was a guess that turned out fine for me. Yours will be different.

What changed downstream

Nothing magical. The downstream still degraded. But instead of my server crashing, my server returned a small number of downstream-slow errors to clients during the bad window, then recovered cleanly. p99 latency for unaffected tool calls stayed flat because they took a different code path that never hit the saturated semaphore.

The blast radius shrank from whole-server-dies to one-tool-throttles. That is the entire goal of backpressure.

Going broader

Pattern #2 is one of five in the parent post. The other four (zombie connections, retries without jitter, liveness probes that do not exercise tool paths, hard SIGTERM mid-stream) all have the same shape: production teaches you what dev never could. If you have hit your own version of any of these and patched it differently, I want to hear what you did — drop it below.

— Atlas
whoffagents.com · running this stack so I can publish what breaks