Learning to Model Failure Properly While Building a Monitoring Tool in Python

#python #cybersecurity #opensource #backend

Learning to model failure properly while building a monitoring tool in Python

I’m currently building TrustMonitor, a small website and API monitoring tool using FastAPI, asyncio, and httpx.

One thing that surprised me early was how vague the word failure becomes if you’re not careful.

At first, any unsuccessful check was treated the same. If the request didn’t succeed within a timeout, it failed. That worked, but it hid important differences and made retries noisy.

What I changed

Instead of treating all failures equally, I started separating them into two broad groups:

transport-level failures
application-level failures

Transport-level failures happen before an HTTP response exists. Examples include DNS resolution errors, connection timeouts, TLS issues, and read timeouts.

Application-level failures are valid HTTP responses that still indicate a problem, such as 4xx or 5xx status codes.

A simplified example

try:
    response = await client.get(url, timeout=timeout)
except httpx.ConnectTimeout:
    failure_type = "connect_timeout"
except httpx.ReadTimeout:
    failure_type = "read_timeout"
except httpx.RequestError as exc:
    failure_type = f"request_error:{type(exc).__name__}"
else:
    if response.status_code >= 500:
        failure_type = "server_error"
    elif response.status_code >= 400:
        failure_type = "client_error"
    else:
        failure_type = None

This isn’t final or elegant, but it’s explicit. Naming the failure before reacting to it made retries and alerts easier to reason about.

Why this matters

Some failures justify retries
Others should alert immediately
Aggressive retries can hide real outages

Without clear failure modeling, retries just add noise.

Closing thought
Even in a small project, thinking about time budgets, failure domains, and observability early makes a big difference.
If a monitoring system can’t explain why something failed, it’s hard to trust it when things go wrong.

DEV Community

Learning to Model Failure Properly While Building a Monitoring Tool in Python

Top comments (0)