Two weeks ago, my crypto signal API silently failed for 22 hours.
No errors. No exceptions. No crash. The service kept running, logs continued to flow, my deployment dashboard showed everything green. I only noticed when I happened to check the database and realized no new data had been written for almost a full day.
The culprit? My WebSocket connection to Binance. It was "connected" — but it hadn't received a message in hours.
This is the silent staleness problem. And TCP keepalive can't catch it.
If you've ever built a system that consumes a long-lived WebSocket feed (price data, chat messages, IoT telemetry, log streams), you're vulnerable to this exact failure mode. Here's what's happening and how to fix it.
The illusion of "connected"
When your client opens a WebSocket connection, the underlying TCP socket goes through a handshake. From then on, "connected" really means: there's an open TCP socket between you and the server, and TCP believes the route is alive.
That's it.
TCP keepalive (when enabled) sends periodic empty packets to verify the route. The OS does this for you. If the route is broken, you'll eventually get a connection-closed error.
But here's what TCP can't see:
- Whether the application on the other end is still pushing messages
- Whether a proxy or load balancer between you and the server has dropped your subscription
- Whether a backend bug stopped emitting events while keeping the connection open
Your WebSocket can look perfectly healthy at the TCP layer while application data has stopped flowing entirely.
In my case, Binance's WebSocket gateway accepted my connection, accepted my subscriptions, and then stopped pushing ticker updates. The TCP socket was fine. The OS was fine. My code was fine. The data was gone.
Why naive fixes don't work
The first instinct is: "I'll just reconnect on error." But the application never errors. No exception fires. The connection is perfectly alive — there's just nothing coming through.
The second instinct: "I'll add a watchdog timer that pings the server." This is closer to right but has a flaw — many services (including most exchange feeds) don't respond to client pings on data WebSockets. Your ping goes out, returns silence, and you can't distinguish "server doesn't ping back" from "server is broken."
The third instinct: "I'll send a subscribe message and check for confirmation." This catches startup failures but not mid-stream failures.
What actually works is much simpler:
Track the time of the last message received. If it exceeds a threshold, the stream is stale — regardless of what TCP thinks.
Implementing message-level staleness detection
Here's the pattern in Python:
import asyncio
import json
import time
import websockets
STALENESS_TIMEOUT_SECONDS = 60 # tune to your feed's expected frequency
class StaleStreamError(Exception):
pass
async def consume_stream(url, subscribe_message):
while True:
try:
async with websockets.connect(url) as ws:
await ws.send(json.dumps(subscribe_message))
last_message_at = time.time()
async def monitor_staleness():
while True:
await asyncio.sleep(STALENESS_TIMEOUT_SECONDS)
age = time.time() - last_message_at
if age > STALENESS_TIMEOUT_SECONDS:
await ws.close()
raise StaleStreamError(
f"No message for {age:.1f}s "
f"(threshold: {STALENESS_TIMEOUT_SECONDS}s)"
)
staleness_task = asyncio.create_task(monitor_staleness())
try:
async for message in ws:
last_message_at = time.time()
await handle_message(message)
finally:
staleness_task.cancel()
except websockets.exceptions.ConnectionClosed:
print("Connection closed, reconnecting...")
except StaleStreamError as e:
print(f"Staleness detected: {e}, reconnecting...")
await asyncio.sleep(1) # backoff before retry
The key insight: define "alive" at your application level, not the OS level.
Your feed might have natural quiet periods (markets close, low-traffic hours), so tune the threshold. A 60-second timeout might be too aggressive for IoT telemetry; a 5-minute timeout might be too lenient for a high-frequency ticker.
A good heuristic: set your timeout to 3-5x the expected gap between messages during your slowest periods.
What about exchange-provided heartbeats?
Some WebSocket protocols include explicit heartbeats — small periodic messages that confirm both parties are alive at the application layer. Binance Futures, for example, sends a ping every few minutes; you respond with a pong.
These help. But they don't solve the staleness problem on their own, because:
- Heartbeats might keep working while data subscription has died (different code paths on the server)
- Some feeds don't include heartbeats at all
- Even with heartbeats, you still need staleness logic for the data stream specifically
Treat heartbeats as one input, not the source of truth. Your real signal is: "Am I getting the kind of message I subscribed to?"
Reconnect logic that doesn't make things worse
When you detect staleness and reconnect, consider:
- Exponential backoff: if the server is genuinely down, don't hammer it with reconnect attempts
- Jitter: if 1000 clients all detect staleness at the same instant (after a server outage), randomized retry intervals prevent a thundering herd
- State recovery: for stateful feeds (order books, subscription channels), you might need to resync state after reconnect
- Alerting: if you've had to reconnect more than N times in M minutes, something deeper is broken — page yourself
The 22-hour lesson
The bug that hit me wasn't subtle — it's a known failure mode in long-lived streaming systems. But I'd built my service assuming "WebSocket connected = data flowing," and that assumption silently broke when the assumption became false.
What fixed it for good:
- Message-level staleness detection (the pattern above)
-
External health monitoring — a small endpoint that returns
last_signal_age_secondsso UptimeRobot can alert me when it crosses a threshold - Application-level alerting — a separate cron that emails me if no events fire for N hours during peak hours
If you're consuming a long-lived WebSocket today and you don't have all three of these, you're vulnerable to the same silent failure. The fix is not expensive. The bug, when it hits, is.
About me
I'm building LeadEdge — a cross-exchange crypto signal API for trading bots. The WebSocket consumer pattern above ships in our open-source integration examples as a drop-in for anyone building on top of similar streaming feeds.
The full validation methodology with 9.4M live price updates and 90.7% follow-through on ETH cross-exchange signals is documented here.
Top comments (0)