DEV Community

Furozq
Furozq

Posted on

How I Detect Website Failures 60 Seconds Before They Happen (Without Heavy ML)

Most monitoring tools answer one question:

“Is it up?”

I wanted to answer a different one:

“Is it about to go down?”

The Problem with Traditional Uptime Checks
Downtime rarely happens instantly.

In real-world systems, failure usually looks like this:

  • T-5 minutes → response time slowly climbs (200ms → 400ms)
  • T-2 minutes → latency spikes, occasional timeouts
  • T-1 minute → error rate increases sharply
  • T-0 → service crash

Traditional monitoring only checks availability.

It completely ignores degradation patterns.

The Core Idea: Trend + Volatility > Status
Instead of checking:

isAlive = true / false

I started tracking:

  • Response time trend
  • Slope direction
  • Variance (volatility)
  • Consecutive instability signals

Because instability is usually visible before failure.

The Lightweight Prediction Model

No heavy ML.
No TensorFlow.
No GPU.

Just math.

1️⃣ Exponential Moving Average (EMA)

EMA smooths out noise while preserving trend.

A single spike doesn’t trigger an alert.
But a gradual climb does.

2️⃣ Linear Regression (Slope Detection)

If latency is trending upward, regression tells me:

  • How fast it’s increasing
  • Where it will likely be in 5–15 minutes

If projected latency crosses a risk threshold → risk score increases.

3️⃣ Variance Analysis

A stable 200ms ± 20ms system is healthy.

A 200ms average swinging between 50ms and 2000ms is unstable.

Variance exposes hidden risk that averages hide.

Risk Scoring
All signals combine into a 0–100 instability score.

Instead of binary alerts, I get probabilistic warning levels:

  • 0–30 → stable
  • 30–60 → degrading
  • 60+ → likely incident

This allows earlier, smarter alerts.

The Result
In controlled stress tests, the system flagged instability:

60–90 seconds before actual downtime
While the service was still technically “up.”

That window is enough to:

  • Scale horizontally

  • Trigger failover

  • Enable CDN fallback

  • Alert on-call engineers

Why Not Machine Learning?

I initially experimented with ML models.

They were:

  • Slower

  • Harder to tune

  • Resource heavy

  • Not meaningfully more accurate

Well-tuned statistical methods outperformed them.

Sometimes simple math beats complex AI.

Built with:

Node.js + TypeScript

SQLite

Single VPS (~$6/month)

No Kubernetes

If you're building infrastructure tools, you don't always need complexity.

Sometimes you just need the right signal.

I’m building this as ORVO AI. Feedback from fellow builders is always welcome.

Top comments (0)