How I built root cause analysis into my free API uptime monitor

#node #webdev #devops #javascript

Most uptime monitors tell you your API is down. Mine tells you why.
I got tired of waking up to a vague "monitor failed" alert with zero context. Is it a DNS issue? Did the server crash? Is it a TLS problem? You have no idea until you log in, dig through logs, and piece it together yourself.
So when I built Pulse — my own API monitoring tool — I made root cause analysis the core feature. Here's how I implemented it.

Most monitors do something like this:

const response = await axios.get(url); if (response.status !== 200) { sendAlert('monitor is down'); }

That tells you nothing. You know the request failed. You don't know where.
An HTTP request isn't a single operation — it's a pipeline of stages. DNS lookup, TCP connection, TLS handshake, time to first byte. Each stage can fail independently and each failure means something completely different.

Switching to native http with timing hooks
Axios doesn't expose per-stage timing. Node's built-in http/https module does via socket events. I rewrote the ping function to capture each stage separately:

const req = transport.request(options, (res) => { timings.ttfb = Date.now() - startTime; res.on('end', () => { timings.total = Date.now() - startTime; }); });

req.on('socket', (socket) => { socket.on('lookup', () => { timings.dnsLookup = Date.now() - startTime; }); socket.on('connect', () => { timings.tcpConnect = Date.now() - startTime - timings.dnsLookup; }); socket.on('secureConnect', () => { timings.tlsHandshake = Date.now() - startTime - timings.tcpConnect; }); });

Now every ping stores dns_lookup_ms, tcp_connect_ms, tls_handshake_ms, and ttfb_ms separately in the database alongside the usual status_code and response_time_ms

The inference logic
With per-stage timings stored, I wrote a pure function that compares the failed ping against the historical baseline for that monitor and infers the likely cause:
// DNS spiked but TCP was fine — DNS issue if (dnsRatio > 3) { return { cause: 'DNS resolution failure', confidence: 75 } }

// TCP failed entirely — server unreachable if (!tcpConnectMs) { return { cause: 'Server unreachable — connection refused', confidence: 85 } }

// Everything fine until TTFB — server-side problem if (ttfbRatio > 5) { return { cause: 'Upstream server overload or slow database query', confidence: 78 } }

// Status code tells us exactly what happened if (statusCode === 503) { return { cause: 'Service unavailable — server overloaded or in maintenance', confidence: 92 } }

No ML, no black box. Just rule-based inference against a baseline. A 503 with normal DNS/TCP/TLS timings but a spiked TTFB looks completely different from a connection timeout with no TCP at all.

What it looks like in practice

When a monitor goes down, instead of just logging the failure, Pulse shows:

Root Cause Analysis
Likely cause: Upstream server overload (78% confidence)

DNS Lookup → 34ms normal
TCP Connect → 28ms normal

TLS Handshake → 71ms normal
Time to First Byte → 8432ms CRITICAL (56x baseline)`

Suggestion: Server is responding but very slowly —
check database queries and server load
That's immediately actionable. You know it's not a network problem. It's not DNS. The server is reachable but something on the backend is choking.

The baseline problem
The tricky part was making the comparisons meaningful. A 200ms TTFB is great for one endpoint and terrible for another. I compute a rolling baseline from the last 20 successful pings for each monitor individually, so the thresholds adapt to the normal behavior of that specific endpoint.

What I learned
The biggest insight was that most of the value isn't in the ML or the fancy inference — it's just in capturing the right data at ping time. Once you have per-stage timings stored, the analysis is mostly pattern matching. The hard part was switching from axios to raw http and making sure the timing hooks fired reliably across both HTTP and HTTPS endpoints.
The second thing I learned: storing this data costs almost nothing. Four extra integer columns per ping row. The diagnostic value is completely disproportionate to the storage cost.

Pulse is free — 5 monitors, no credit card. If you want to see the root cause analysis in action or poke around the implementation: Pulse

Happy to hear opinions!

Top comments (3)

Victor Okefie • Mar 12

The insight isn't the timing data; it's that you built the baseline into the inference. Most monitoring tools compare against static thresholds. Yours compares the failure against itself. That's the difference between an alert and a diagnosis.