Most monitoring tools answer one question:
“Is it up?”
I wanted to answer a different one:
“Is it about to go down?”
The Problem with Traditional Uptime Checks
Downtime rarely happens instantly.
In real-world systems, failure usually looks like this:
- T-5 minutes → response time slowly climbs (200ms → 400ms)
- T-2 minutes → latency spikes, occasional timeouts
- T-1 minute → error rate increases sharply
- T-0 → service crash
Traditional monitoring only checks availability.
It completely ignores degradation patterns.
The Core Idea: Trend + Volatility > Status
Instead of checking:
isAlive = true / false
I started tracking:
- Response time trend
- Slope direction
- Variance (volatility)
- Consecutive instability signals
Because instability is usually visible before failure.
The Lightweight Prediction Model
No heavy ML.
No TensorFlow.
No GPU.
Just math.
1️⃣ Exponential Moving Average (EMA)
EMA smooths out noise while preserving trend.
A single spike doesn’t trigger an alert.
But a gradual climb does.
2️⃣ Linear Regression (Slope Detection)
If latency is trending upward, regression tells me:
- How fast it’s increasing
- Where it will likely be in 5–15 minutes
If projected latency crosses a risk threshold → risk score increases.
3️⃣ Variance Analysis
A stable 200ms ± 20ms system is healthy.
A 200ms average swinging between 50ms and 2000ms is unstable.
Variance exposes hidden risk that averages hide.
Risk Scoring
All signals combine into a 0–100 instability score.
Instead of binary alerts, I get probabilistic warning levels:
- 0–30 → stable
- 30–60 → degrading
- 60+ → likely incident
This allows earlier, smarter alerts.
The Result
In controlled stress tests, the system flagged instability:
60–90 seconds before actual downtime
While the service was still technically “up.”
That window is enough to:
Scale horizontally
Trigger failover
Enable CDN fallback
Alert on-call engineers
Why Not Machine Learning?
I initially experimented with ML models.
They were:
Slower
Harder to tune
Resource heavy
Not meaningfully more accurate
Well-tuned statistical methods outperformed them.
Sometimes simple math beats complex AI.
Built with:
Node.js + TypeScript
SQLite
Single VPS (~$6/month)
No Kubernetes
If you're building infrastructure tools, you don't always need complexity.
Sometimes you just need the right signal.
I’m building this as ORVO AI. Feedback from fellow builders is always welcome.
Top comments (0)