Adeshina Emmanuel

Posted on May 26 • Originally published at eadeshina.hashnode.dev

I Refused to Let My System Auto-Tweet Outages

#buildinpublic #infrastructure #monitoring #showdev

This is my first time writing publicly about a system I’m building, so I wanted to start with something that genuinely changed how I think about infrastructure engineering.

I’m currently building Reliastra — an independent uptime verification and reliability intelligence system.

One of the earliest ideas I had sounded incredibly smart at first.

“If Reliastra detects that AWS, Stripe, or Cloudflare is down while their status page still says operational, automatically tweet it immediately.”

At first glance, it felt like a strong differentiator.

Fast truth

Public accountability

Real-time contradiction detection

But the more I thought about it, the more dangerous it became.

Eventually, I realized something important:

This single feature could destroy the entire credibility of the system.

The Real Goal Was Never Just Monitoring

Building Trust Matters More Than Building Features

Reliastra is not just another monitoring tool.

The real goal is credibility.

The system independently measures infrastructure health across multiple regions and compares those measurements against what vendors publicly claim on their status pages.

If the system detects something like this:

Vendor status page says: Operational

Reliastra measurements say: Degraded or Down

…it marks that as a contradiction.

Originally, I wanted those contradictions to be published instantly to social media.

That was the mistake.

The Failure Chain I Had To Think Through

What Happens When the System Is Wrong for Two Minutes?

Once I started analyzing the operational consequences, the risks became obvious.

Imagine scenarios like these:

A temporary DNS issue affects one monitoring node

A regional routing problem creates false failures

A short-lived network partition causes inconsistent measurements

Now imagine the system automatically posting this publicly:

“AWS is DOWN while claiming operational.”

Even if the system was wrong for only two minutes, the consequences would still be serious.

Potential outcomes:

Public misinformation

Loss of credibility

Legal exposure

Permanent trust damage

And for a system whose entire value depends on trust, one false public contradiction could be catastrophic.

Not theoretically.

Actually catastrophic.

What I Built Instead

Replacing Instant Reactions With Controlled Verification

Instead of instant auto-publication, I redesigned the system around a staged contradiction model.

That decision completely changed the architecture.

Step 1 — Immediate Dashboard Publication

The Dashboard Remains the Source of Truth

Contradictions still appear instantly on the public Truth Dashboard.

There is:

No delay

No censorship

No hidden filtering

The dashboard remains the primary source of truth.

Step 2 — Human Suppression Window

Adding a Safety Layer Before Public Amplification

Social media publication is delayed for 10 minutes.

During that window:

The system alerts the Admin Room

Confidence scores are reviewed

False positives can be suppressed before publication

This creates a safety layer without hiding operational data.

Step 3 — Confidence Thresholds

Public Claims Require Strong Validation

The system only allows public publication if strict validation conditions are met.

Requirements include:

Confidence score ≥ 0.95

At least 5 consecutive failed checks

Contradiction sustained across validation gates

This reduces the chance of noisy or unstable measurements becoming public claims.

Step 4 — Immutable Audit Logging

No Silent Intervention Allowed

I also didn’t want silent intervention.

So if an admin suppresses publication, the suppression itself becomes an audit event.

That event includes:

Timestamp

Reason for suppression

Associated contradiction data

Nothing disappears silently.

What This Changed In My Thinking

Infrastructure Engineering Is Often About Refusal

While designing this system, I realized something important about infrastructure engineering.

A lot of engineering is not about adding features.

It’s about refusing dangerous ones.

Some technical ideas look impressive in demos but become extremely risky under real operational conditions.

Especially when systems involve:

Public trust

Reputation

Financial consequences

Legal exposure

That changed how I think about reliability systems entirely.

The Bigger Lesson

Reliability Systems Need Restraint, Not Just Speed

Most monitoring systems optimize for speed.

But systems that influence public trust need something equally important:

Restraint.

Sometimes the most important engineering question is not:

“What can this system do?”

But instead:

“What should this system never be allowed to do automatically?”

That distinction completely changed my approach to system design.

Why I’m Writing About This Publicly

Documenting the Reasoning Behind the Architecture

Reliastra is still being built.

It’s not finished.

But I’ve started documenting these architecture decisions publicly because I think the reasoning behind systems matters just as much as the implementation itself.

This is my first post, and hopefully the first of many more engineering notes as I continue learning and building.

DEV Community

I Refused to Let My System Auto-Tweet Outages

Top comments (0)