The Real Problem With Uptime Monitoring
Most uptime monitoring tools work like this:
A single server sends a request to your endpoint every few minutes.
If the request fails, the system declares downtime.
Simple.
Also very wrong.
A single monitor cannot reliably determine whether an application is actually down. Network routing issues, DNS delays, or temporary congestion can produce false downtime alerts even when the service is functioning normally.
In production environments, false positives create a serious problem.
Engineers lose trust in the monitoring system.
Once that happens, alerts stop being useful.
So when I started building TrustMonitor, the first design constraint was simple:
- The monitoring system itself must be reliable enough to be trusted.
Architecture Overview
Instead of relying on a single monitor, the system uses a distributed verification approach.
The monitoring flow looks like this:
Scheduler
↓
Primary Monitor
↓
Secondary Verification
↓
Incident Recording
↓
Signed Incident Report
Each stage reduces the probability of false alerts and ensures that incidents cannot be modified after they are recorded.
Monitor Scheduling
The system uses a scheduler responsible for dispatching monitoring jobs at defined intervals.
Each job contains:
- endpoint URL
- expected status code
- timeout threshold
- verification rules Example structure:
{
"endpoint": "https://api.example.com/health",
"expected_status": 200,
"timeout": 5
}
The scheduler pushes these jobs into a queue where worker nodes perform the actual checks.
Separating scheduling from execution prevents monitoring delays if a worker becomes slow or temporarily unavailable.
Primary Monitor
The primary monitor sends the initial request to the target endpoint.
In the current implementation, this is handled using FastAPI workers running asynchronous HTTP checks.
Example simplified check:
import httpx
async def check_endpoint(url):
async with httpx.AsyncClient(timeout=5) as client:
response = await client.get(url)
return response.status_code
If the response matches the expected conditions, the monitor records a successful check.
If not, the system does not immediately declare downtime.
This is where most monitoring tools fail.
Secondary Verification
Before an incident is recorded, a secondary verification monitor repeats the check.
This step confirms whether the failure is real or caused by temporary network conditions.
Verification logic:
Primary Monitor detects failure
↓
Secondary Monitor runs verification check
↓
If failure confirmed → incident recorded
If success → ignore false positive
This simple mechanism significantly reduces false downtime alerts.
Incident Recording
Once the failure is verified, the system records an incident containing:
- timestamp
- endpoint
- failure reason
- verification results
- response data Example incident structure:
{
"endpoint": "api.example.com",
"status": "DOWN",
"timestamp": "2026-03-05T10:20:15Z",
"verified": true
}
However, recording incidents alone is not enough.
Monitoring systems must also guarantee data integrity.
Cryptographic Incident Signing
A key design decision in TrustMonitor is that incident records are cryptographically signed.
This prevents incidents from being altered later.
Each incident is hashed using a cryptographic digest.
Example concept:
incident_data → SHA256 → incident_signature
The signature proves that the incident existed at a specific time and has not been modified.
This becomes useful for:
- post-incident audits
- SLA verification
- infrastructure debugging
Lessons Learned
Building monitoring systems revealed a few important realities.
Single-location monitoring is unreliable
- Network issues happen constantly.
A single monitor cannot determine service health with certainty.
Verification layers are essential.
Monitoring systems must be trustworthy
If alerts generate too many false positives, engineers eventually ignore them.
A monitoring system that isn’t trusted is worse than having none at all.
Incident integrity matters
Monitoring data should be tamper-resistant. Signed incidents create verifiable records of infrastructure events.
Final Thoughts
Monitoring infrastructure sounds simple on paper.
In practice, reliability requires careful design around:
- verification
- distributed checks
- incident integrity
TrustMonitor is still evolving, but building it has already surfaced interesting engineering challenges around monitoring accuracy and system trust.
Future improvements will focus on:
- multi-region verification
- anomaly detection
- improved alert reliability
Because in monitoring systems, trust is everything.
Top comments (0)