Most observability dashboards focus heavily on request-facing metrics:
- latency
- throughput
- error rate
- CPU and memory usage
Those metrics are important, but while stress-testing async FastAPI services under concurrent load, I noticed they were not always enough to explain what the runtime was actually experiencing internally.
In one test setup, requests were still returning 200 OK, P99 latency had increased but was still within survivable limits, and CPU usage looked fairly normal.
At the same time, the asyncio event loop was already struggling badly.
Other endpoints became inconsistent, executor queues started backing up, and event-loop lag increased into multi-second territory even before the service looked obviously unhealthy from the outside.
In several runs, event-loop lag exceeded multiple seconds while request latency was still low enough that the service initially appeared operational from the outside.
In some runs, unrelated lightweight endpoints stalled behind a single blocking request even though system-wide CPU usage was not saturated.
The issue became easier to reproduce when synchronous work leaked into async request paths.
Simple examples include:
- blocking database clients
- synchronous SDKs
- legacy REST calls using
requests - filesystem operations
- accidental
time.sleep()calls - overloaded threadpool executors
Even a small blocking section inside an async route can create scheduler starvation under enough concurrency.
Example:
import time
from fastapi import FastAPI
app = FastAPI()
@app.get("/agent")
async def run_agent():
time.sleep(5)
return {"status": "ok"}
Under load this starts affecting unrelated coroutines, queue behavior, scheduler fairness, and request consistency across the service.
One thing that stood out during testing was how differently runtime metrics behaved compared to HTTP-facing metrics.
Request latency degraded gradually, but event-loop lag increased much more aggressively once scheduler pressure crossed a certain point.
Event-loop lag increasing sharply while outward-facing request metrics remained comparatively survivable.
To explore this more systematically, I built a small runtime observability lab using:
- FastAPI
- Prometheus
- Grafana
- Docker Compose
The goal was simply to reproduce different forms of async runtime degradation and observe which telemetry signals changed first.
Minimal async runtime observability lab used for reproducing scheduler starvation and queue amplification scenarios.
The setup intentionally introduced:
- blocking synchronous execution
- executor saturation
- queue amplification
- event-loop starvation
while exposing internal runtime telemetry through Prometheus.
The most useful telemetry signals ended up being event-loop lag, blocking duration, executor queue pressure, backlog growth, and concurrent saturation behavior.
Those signals exposed runtime instability much earlier than HTTP metrics alone.
I also built a small CLI tool called async-runtime-auditor to evaluate these metrics directly from Prometheus during testing.
The idea was not to build another monitoring platform, but to create lightweight runtime validation checks for async Python services inside CI/CD workflows.
The tool evaluates runtime metrics against deterministic thresholds and can fail execution when runtime degradation becomes severe enough.
Example:
async-auditor \
--config metrics.yaml \
--target http://localhost:9090 \
--fail-on-critical
Example output:
ASYNC RUNTIME AUDITOR
Runtime Status: DEGRADED
Findings:
- Event-loop starvation detected
- Executor queue amplification detected
- Concurrent saturation detected
One thing this testing made clear is that async systems can begin degrading internally well before traditional dashboards clearly show it.
Request metrics tell you how the API behaves externally.
Runtime telemetry tells you how the scheduler behaves while the API is still functioning.
For async Python services, both perspectives matter.
The main lesson from this testing was that scheduler health and request health are not always the same thing, especially in heavily concurrent async systems.
Repositories
Async Runtime Auditor
CI/CD-oriented runtime degradation checks for async Python systems:
Async Runtime Health Lab
FastAPI + Prometheus + Grafana environment for reproducing async runtime degradation:


Top comments (0)