SecEngineerX

Posted on Mar 5

How I Built a Distributed Uptime Monitoring System with FastAPI

#distributedsystems #fastapi #monitoring #python

The Real Problem With Uptime Monitoring

Most uptime monitoring tools work like this:

A single server sends a request to your endpoint every few minutes.

If the request fails, the system declares downtime.
Simple.
Also very wrong.

A single monitor cannot reliably determine whether an application is actually down. Network routing issues, DNS delays, or temporary congestion can produce false downtime alerts even when the service is functioning normally.

In production environments, false positives create a serious problem.

Engineers lose trust in the monitoring system.

Once that happens, alerts stop being useful.

So when I started building TrustMonitor, the first design constraint was simple:

The monitoring system itself must be reliable enough to be trusted.

Architecture Overview
Instead of relying on a single monitor, the system uses a distributed verification approach.
The monitoring flow looks like this:

Scheduler
   ↓
Primary Monitor
   ↓
Secondary Verification
   ↓
Incident Recording
   ↓
Signed Incident Report

Each stage reduces the probability of false alerts and ensures that incidents cannot be modified after they are recorded.

Monitor Scheduling
The system uses a scheduler responsible for dispatching monitoring jobs at defined intervals.

Each job contains:

endpoint URL
expected status code
timeout threshold
verification rules Example structure:

{
  "endpoint": "https://api.example.com/health",
  "expected_status": 200,
  "timeout": 5
}

The scheduler pushes these jobs into a queue where worker nodes perform the actual checks.

Separating scheduling from execution prevents monitoring delays if a worker becomes slow or temporarily unavailable.

Primary Monitor
The primary monitor sends the initial request to the target endpoint.

In the current implementation, this is handled using FastAPI workers running asynchronous HTTP checks.

Example simplified check:

import httpx

async def check_endpoint(url):
    async with httpx.AsyncClient(timeout=5) as client:
        response = await client.get(url)
        return response.status_code

If the response matches the expected conditions, the monitor records a successful check.

If not, the system does not immediately declare downtime.

This is where most monitoring tools fail.

Secondary Verification

Before an incident is recorded, a secondary verification monitor repeats the check.

This step confirms whether the failure is real or caused by temporary network conditions.

Verification logic:

Primary Monitor detects failure
        ↓
Secondary Monitor runs verification check
        ↓
If failure confirmed → incident recorded
If success → ignore false positive

This simple mechanism significantly reduces false downtime alerts.

Incident Recording

Once the failure is verified, the system records an incident containing:

timestamp
endpoint
failure reason
verification results
response data Example incident structure:

{
  "endpoint": "api.example.com",
  "status": "DOWN",
  "timestamp": "2026-03-05T10:20:15Z",
  "verified": true
}

However, recording incidents alone is not enough.

Monitoring systems must also guarantee data integrity.

Cryptographic Incident Signing

A key design decision in TrustMonitor is that incident records are cryptographically signed.

This prevents incidents from being altered later.

Each incident is hashed using a cryptographic digest.

Example concept:

incident_data → SHA256 → incident_signature

The signature proves that the incident existed at a specific time and has not been modified.

This becomes useful for:

post-incident audits
SLA verification
infrastructure debugging

Lessons Learned

Building monitoring systems revealed a few important realities.

Single-location monitoring is unreliable

Network issues happen constantly.

A single monitor cannot determine service health with certainty.

Verification layers are essential.

Monitoring systems must be trustworthy

If alerts generate too many false positives, engineers eventually ignore them.

A monitoring system that isn’t trusted is worse than having none at all.

Incident integrity matters

Monitoring data should be tamper-resistant. Signed incidents create verifiable records of infrastructure events.

Final Thoughts

Monitoring infrastructure sounds simple on paper.

In practice, reliability requires careful design around:

verification
distributed checks
incident integrity

TrustMonitor is still evolving, but building it has already surfaced interesting engineering challenges around monitoring accuracy and system trust.

Future improvements will focus on:

multi-region verification
anomaly detection
improved alert reliability

Because in monitoring systems, trust is everything.

DEV Community

How I Built a Distributed Uptime Monitoring System with FastAPI

Top comments (0)