Mixa_Dev

Posted on May 19

My Alert System Fired for the First Time. The Engine Was Working Perfectly.

#monitoring #devops #postmortem #webdev

I built an alert engine for my own product last week. Yesterday afternoon, it fired for the first time.

This is the story of what happened, why "no signal" turned out to be the correct answer, and why I'd build the same system again tomorrow.

The Alert

It was 16:57 on a Sunday afternoon when my phone buzzed. The subject line read:

🚨 LeadEdge: No signal for 29.7 hours

The body was unambiguous:

The LeadEdge signal engine has not fired in 29.7 hours.
This is during US market hours when signal frequency should be high.
Last signal: sig_e263b43b2bc351fa at 2026-05-16T09:13:25.737Z

For context: LeadEdge is a cross-exchange signal API that detects price moves on Binance Futures ETH and flags them as predictive signals for Coinbase Spot. The engine normally fires around 14 signals per day. 29.7 hours of silence wasn't normal.

I was outside, ten minutes from my laptop.

The Panic

I'd deployed major changes that morning — adding first-tick lag tracking to the signal detector. A non-trivial refactor that introduced new code paths into the price update handler.

The first thought was the obvious one: I broke it.

The second thought was worse: it's been silent for 30 hours, not 5. The break happened before my deploy.

Either way, the engine had stopped producing signals — the product I was actively pitching to potential customers via DM at that exact moment. Not great.

The Diagnostic

I opened Railway logs first. What I expected to find: stack traces, uncaught exceptions, restart loops.

What I actually found:

[HEARTBEAT] Uptime: 95min | Memory: 15MB
[STORAGE] Total inserted: 26,481 | Errors: 0 | Buffer: 12
[STORAGE] Total inserted: 26,708 | Errors: 0 | Buffer: 4
[WS] Clients: 0 (free=0, pro=0) | Signals: 0 | Outcomes: 0 | Auth rejections: 0

Clean. The storage layer was inserting tens of thousands of price updates with zero errors. WebSocket server was running. Memory was stable. No exceptions in the logs.

But the most suspicious line: Signals: 0 | Outcomes: 0. The engine was running, storing data, but not generating signals.

Time to check the database.

Was Data Actually Flowing?

The first hypothesis: silent WebSocket failure on Binance Futures. I'd written a Dev.to article a few weeks back about exactly this failure mode — WebSocket connections that look healthy at the TCP level but stopped delivering data. It happens. It's hard to detect. It nearly killed me on a separate project once before.

So I ran this query in Supabase:

SELECT 
  exchange, 
  market_type,
  MAX(server_timestamp) AS last_update_ms,
  to_timestamp(MAX(server_timestamp)/1000.0) AS last_update_time,
  COUNT(*) FILTER (
    WHERE server_timestamp > (EXTRACT(EPOCH FROM NOW() - INTERVAL '1 hour') * 1000)
  ) AS updates_last_hour,
  COUNT(*) FILTER (
    WHERE server_timestamp > (EXTRACT(EPOCH FROM NOW() - INTERVAL '5 minutes') * 1000)
  ) AS updates_last_5min
FROM price_updates
WHERE server_timestamp > (EXTRACT(EPOCH FROM NOW() - INTERVAL '24 hours') * 1000)
GROUP BY exchange, market_type
ORDER BY exchange, market_type;

The result killed my hypothesis:

exchange	market_type	updates_last_hour	updates_last_5min
binance	futures	19,764	941
bybit	futures	3,672	150
bybit	spot	5,209	224
coinbase	spot	4,936	470

Binance Futures was firing at ~5.5 ticks per second. All exchanges were healthy. The data was flowing fine.

So why no signals?

The Reveal

If data was flowing and the engine was running, the only remaining possibility was that no tick-to-tick price change had crossed the 0.1% threshold that defines a signal.

For a signal to fire on LeadEdge, the price on Binance Futures ETH must move at least 0.1% between two consecutive ticks. At ~5 ticks per second, consecutive ticks are usually milliseconds apart. A 0.1% move tick-to-tick means a sudden jump — exactly the kind of dislocation that has predictive power for the follower exchange.

But what if there were no such jumps?

I ran a second query, looking at the actual tick-to-tick changes:

SELECT 
  ROUND(MAX(ABS(price_change_pct))::numeric, 4) AS max_change_pct,
  ROUND(AVG(ABS(price_change_pct))::numeric, 4) AS avg_change_pct,
  ROUND(PERCENTILE_CONT(0.99) WITHIN GROUP (ORDER BY ABS(price_change_pct))::numeric, 4) AS p99_change_pct,
  COUNT(*) FILTER (WHERE ABS(price_change_pct) >= 0.1) AS ticks_over_threshold,
  COUNT(*) AS total_ticks
FROM price_updates
WHERE exchange = 'binance' 
  AND market_type = 'futures' 
  AND base_asset = 'ETH'
  AND server_timestamp > (EXTRACT(EPOCH FROM NOW() - INTERVAL '3 hours') * 1000)
  AND price_change_pct IS NOT NULL;

The result was definitive:

max_change_pct	avg_change_pct	p99_change_pct	ticks_over_threshold	total_ticks
0.0597	0.0011	0.0071	0	18,974

In three hours, across 18,974 tick-to-tick changes, the largest single move was 0.06%. The 99th percentile was 0.007%. The average was so small it rounds to noise.

ETH had been ranging in a $4 band on a quiet Sunday afternoon. There was nothing for the engine to fire on.

The engine wasn't broken. The market was just unusually calm.

The Lesson

This is exactly what the 0.1% threshold is supposed to do.

During pre-launch validation, I measured 90.7% follow-through on signals at that threshold across 9.4M price updates and 7 days of live data. The threshold is the product. Lowering it to "catch more signals" during quiet periods would tank the accuracy that justifies the API existing in the first place.

Selectivity isn't a bug. It's the entire value proposition.

The right behavior during a calm market is silence. Customers paying for cross-exchange signals don't want noise — they want the engine to fire only on moves that historically predict the follower exchange's response. If ETH is range-bound, the engine is doing its job by not firing.

The wrong response would have been to panic-lower the threshold to "fix" the silence. That would be a self-inflicted regression: converting a working signal stream into a noise stream to satisfy a vanity metric.

The Meta-Win

Here's the part that made the whole experience worth writing about.

The alert engine that emailed me yesterday was built two days earlier, in direct response to my own Dev.to article about WebSocket silent failures. I'd written about how TCP keepalive and process-level health checks can't catch a connection that's "alive" but no longer delivering data. The fix is application-aware monitoring — code that knows what the system is supposed to be doing, not just whether the process is running.

After writing that article, I realized something uncomfortable: I had no application-aware monitoring on my own system. If LeadEdge silently failed, I'd find out from a customer email, not a system alert.

So I built it. Two days later, it fired for the first time — not on a real failure, but on a market calm I'd otherwise have noticed days later, if at all.

The alert was technically a false positive (the system was healthy). But it caught something I hadn't been thinking about: my own implicit assumption that "signals every few hours" was a guaranteed property of the system. It isn't. It depends on market behavior. And now I have an early-warning channel for unusual silence, calibrated to expected behavior.

I'd rather be alerted falsely about a calm market than miss a real silent failure. The asymmetric cost is what makes monitoring like this worth building.

What I'd Tell Anyone Building Similar Systems

A few things this experience reinforced:

Build monitoring that knows your system's expected behavior, not just whether the process is alive. TCP keepalive, container health checks, and "is the process running" pings all said LeadEdge was fine. They couldn't tell the difference between "running and healthy" and "running and silently broken." Application-aware monitoring is the only thing that catches the latter.

Build it before you need it. I built the alert engine while LeadEdge's customer base was still small. If I'd waited until I had paying customers complaining about silence, the customers would have been the alert system. That's a bad place to be.

Trust validated parameters during unusual conditions. It would have been easy to interpret "30 hours of silence" as evidence the threshold was wrong. The right answer was the opposite: 30 hours of silence on a quiet Sunday afternoon is evidence the threshold is working correctly.

False positives during calm markets are tolerable. False negatives during real failures are not. When calibrating an alert system, accept that you'll get woken up occasionally for nothing. The alternative — missing the alerts that actually matter — is much worse.

The signal stream resumed firing about an hour after I started writing this. ETH had moved.

If you want to inspect signals in real time, the free tier is at leadedge.dev. The methodology and raw validation data are in my earlier thread. The original article on silent WebSocket failures (the one that prompted me to build the alert engine in the first place) is here on Dev.to.