Learning to model failure properly while building a monitoring tool in Python
I’m currently building TrustMonitor, a small website and API monitoring tool using FastAPI, asyncio, and httpx.
One thing that surprised me early was how vague the word failure becomes if you’re not careful.
At first, any unsuccessful check was treated the same. If the request didn’t succeed within a timeout, it failed. That worked, but it hid important differences and made retries noisy.
What I changed
Instead of treating all failures equally, I started separating them into two broad groups:
- transport-level failures
- application-level failures
Transport-level failures happen before an HTTP response exists. Examples include DNS resolution errors, connection timeouts, TLS issues, and read timeouts.
Application-level failures are valid HTTP responses that still indicate a problem, such as 4xx or 5xx status codes.
A simplified example
try:
response = await client.get(url, timeout=timeout)
except httpx.ConnectTimeout:
failure_type = "connect_timeout"
except httpx.ReadTimeout:
failure_type = "read_timeout"
except httpx.RequestError as exc:
failure_type = f"request_error:{type(exc).__name__}"
else:
if response.status_code >= 500:
failure_type = "server_error"
elif response.status_code >= 400:
failure_type = "client_error"
else:
failure_type = None
This isn’t final or elegant, but it’s explicit. Naming the failure before reacting to it made retries and alerts easier to reason about.
Why this matters
- Some failures justify retries
- Others should alert immediately
- Aggressive retries can hide real outages
Without clear failure modeling, retries just add noise.
Closing thought
Even in a small project, thinking about time budgets, failure domains, and observability early makes a big difference.
If a monitoring system can’t explain why something failed, it’s hard to trust it when things go wrong.
Top comments (0)