thesonicstar

Posted on Mar 18

Why I Stopped Trusting Status Pages (and Built My Own Monitor)

#aws #serverless #devops #monitoring

Modern software depends on dozens of SaaS platforms.

Messaging systems. Identity providers. Payment services. Collaboration tools.

When one of them fails, everything downstream feels it.

Yet if you’ve ever checked a vendor status page during an outage, you’ve probably noticed something:

Everything is still marked as “Operational”.

Even when it clearly isn’t.

That observation led me to build Trusted Status — a system that measures whether services are actually reachable, from the outside, in real time.

The Problem With Status Pages

Most vendor status pages rely on:

internal monitoring
crowdsourced reports
manual incident updates

Each has trade-offs.

Internal monitoring often reflects system health, not user experience.
Crowdsourced data introduces noise and regional bias.
Manual updates are slow and sometimes overly cautious.

In short, status pages often describe systems:

from the inside looking out

But users experience them:

from the outside looking in

That difference matters.

The Core Idea

Trusted Status takes a simple approach:

Measure what a real user would experience.

Instead of deep internal diagnostics, the system asks:

Can I reach the service front door from the UK?

If yes → Operational
If inconsistent → Degraded
If consistently unreachable → Outage

This intentionally focuses on observability at the edge, not inside the vendor’s infrastructure.

Trusted Status Sentinel Architecture

At the core of the system is the monitoring engine: Trusted Status Sentinel.

Sentinel is responsible for:

running probes
collecting evidence
applying classification logic
publishing the final signal

The architecture is deliberately minimal and serverless.

How Sentinel Produces a Signal

Every minute:

EventBridge triggers the Sentinel Lambda
Probes run in parallel across services
Each probe produces evidence
Evidence is compared with previous runs
State transitions are applied (if thresholds met)
A status.json file is written to S3

The frontend simply reads this file via CloudFront.

This keeps the system:

fast
resilient
loosely coupled

Evidence vs State (The Key Design Decision)

One of the most important design choices in Sentinel is this:

Evidence and state are not the same thing.

Each probe produces evidence:

HEALTHY
DEGRADED
OUTAGE
INTERNAL_ERROR

But the public state only changes after consistent evidence across multiple runs.

This avoids a common monitoring problem:

*flapping *(rapid state switching due to transient failures)

Instead, Sentinel behaves more like a scientist:

observe repeatedly → then conclude

The result is a much calmer, more trustworthy signal.

Handling Real-World Edge Cases

A few practical problems had to be solved.

1. Rate limiting (HTTP 429)

Many platforms throttle requests.

Treating this as an outage would be incorrect.

So Sentinel treats front-door HTTP errors carefully:

reachable service ≠ outage
only repeated failures trigger degradation

2. Transient network failures

Single failures happen constantly on the internet.

Without protection, these create noise.

Solution:

streak-based transitions
require consecutive failures before changing state

3. Internal vs external failures

Sometimes:

The status API fails
but the service front door still works

Sentinel classifies this as:

INTERNAL_ERROR (non-impacting)

This prevents false outages when supporting systems fail, but users are unaffected.

Why UK-Based Monitoring?

Most monitoring platforms aggregate global signals.

That’s useful, but it hides regional issues.

A service might be:

fully operational in the US
partially unreachable in the UK

Trusted Status Sentinel answers a different question:

Can users in the UK actually reach this service right now?

This makes the signal more relevant for UK-based users and organisations.

Why This Architecture Works

The system intentionally avoids complexity.

Serverless compute
No infrastructure to manage.

Stateful logic where needed
DynamoDB enables streak tracking and transitions.

Static output
S3 + CloudFront makes the public signal fast and resilient.

Loose coupling
The frontend is completely decoupled from the monitoring engine.

This keeps both cost and operational overhead low.

What Comes Next

Sentinel, its monitoring engine, powers Trusted Status today.

Sentinel currently operates from a UK node and monitors a growing set of communication platforms.

The next phase is to expand Sentinel into a broader, more capable system.

Planned improvements include:

Multi-region monitoring
Run probes from multiple geographic locations to detect regional failures.

Historical insights
Capture probe history to analyse trends and incident patterns.

Expanded SaaS coverage
Extend beyond communication platforms into identity, payments, and developer tooling.

Alerting subscriptions
Allow users to subscribe to real-time notifications when service states change.

The goal is not to replace vendor status pages.

It is to provide something different:

an independent observation layer

Trusted Status is simply the first public window into that system.

Final Thought

Most systems tell you what they believe is happening.

Trusted Status tries to answer a simpler question:

Is it actually working?

Sometimes, that’s the only signal that matters.

Trusted Status Sentinel

DEV Community