<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Mixa_Dev</title>
    <description>The latest articles on DEV Community by Mixa_Dev (@mixa_dev).</description>
    <link>https://dev.to/mixa_dev</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3935396%2Fa2afa3fd-51cf-4c40-8693-a4d3051dbeda.png</url>
      <title>DEV Community: Mixa_Dev</title>
      <link>https://dev.to/mixa_dev</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/mixa_dev"/>
    <language>en</language>
    <item>
      <title>Your WebSocket says "connected" but stopped sending data. Here's the bug TCP keepalive can't catch.</title>
      <dc:creator>Mixa_Dev</dc:creator>
      <pubDate>Sat, 16 May 2026 19:23:32 +0000</pubDate>
      <link>https://dev.to/mixa_dev/your-websocket-says-connected-but-stopped-sending-data-heres-the-bug-tcp-keepalive-cant-catch-5424</link>
      <guid>https://dev.to/mixa_dev/your-websocket-says-connected-but-stopped-sending-data-heres-the-bug-tcp-keepalive-cant-catch-5424</guid>
      <description>&lt;p&gt;Two weeks ago, my crypto signal API silently failed for 22 hours.&lt;/p&gt;

&lt;p&gt;No errors. No exceptions. No crash. The service kept running, logs continued to flow, my deployment dashboard showed everything green. I only noticed when I happened to check the database and realized no new data had been written for almost a full day.&lt;/p&gt;

&lt;p&gt;The culprit? My WebSocket connection to Binance. It was "connected" — but it hadn't received a message in hours.&lt;/p&gt;

&lt;p&gt;This is the &lt;strong&gt;silent staleness problem&lt;/strong&gt;. And TCP keepalive can't catch it.&lt;/p&gt;

&lt;p&gt;If you've ever built a system that consumes a long-lived WebSocket feed (price data, chat messages, IoT telemetry, log streams), you're vulnerable to this exact failure mode. Here's what's happening and how to fix it.&lt;/p&gt;

&lt;h2&gt;
  
  
  The illusion of "connected"
&lt;/h2&gt;

&lt;p&gt;When your client opens a WebSocket connection, the underlying TCP socket goes through a handshake. From then on, "connected" really means: there's an open TCP socket between you and the server, and TCP believes the route is alive.&lt;/p&gt;

&lt;p&gt;That's it.&lt;/p&gt;

&lt;p&gt;TCP keepalive (when enabled) sends periodic empty packets to verify the route. The OS does this for you. If the route is broken, you'll eventually get a connection-closed error.&lt;/p&gt;

&lt;p&gt;But here's what TCP can't see:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Whether the application on the other end is still pushing messages&lt;/li&gt;
&lt;li&gt;Whether a proxy or load balancer between you and the server has dropped your subscription&lt;/li&gt;
&lt;li&gt;Whether a backend bug stopped emitting events while keeping the connection open&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Your WebSocket can look perfectly healthy at the TCP layer while application data has stopped flowing entirely.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;In my case, Binance's WebSocket gateway accepted my connection, accepted my subscriptions, and then stopped pushing ticker updates. The TCP socket was fine. The OS was fine. My code was fine. The data was gone.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why naive fixes don't work
&lt;/h2&gt;

&lt;p&gt;The first instinct is: "I'll just reconnect on error." But the application never errors. No exception fires. The connection is perfectly alive — there's just nothing coming through.&lt;/p&gt;

&lt;p&gt;The second instinct: "I'll add a watchdog timer that pings the server." This is closer to right but has a flaw — many services (including most exchange feeds) don't respond to client pings on data WebSockets. Your ping goes out, returns silence, and you can't distinguish "server doesn't ping back" from "server is broken."&lt;/p&gt;

&lt;p&gt;The third instinct: "I'll send a subscribe message and check for confirmation." This catches startup failures but not mid-stream failures.&lt;/p&gt;

&lt;p&gt;What actually works is much simpler:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Track the time of the last message received. If it exceeds a threshold, the stream is stale — regardless of what TCP thinks.&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Implementing message-level staleness detection
&lt;/h2&gt;

&lt;p&gt;Here's the pattern in Python:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;asyncio&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;websockets&lt;/span&gt;

&lt;span class="n"&gt;STALENESS_TIMEOUT_SECONDS&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;60&lt;/span&gt;  &lt;span class="c1"&gt;# tune to your feed's expected frequency
&lt;/span&gt;
&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;StaleStreamError&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;Exception&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;pass&lt;/span&gt;

&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;consume_stream&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;subscribe_message&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;while&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;websockets&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;connect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;ws&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;ws&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;send&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dumps&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;subscribe_message&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

                &lt;span class="n"&gt;last_message_at&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;time&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

                &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;monitor_staleness&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
                    &lt;span class="k"&gt;while&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                        &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;asyncio&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sleep&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;STALENESS_TIMEOUT_SECONDS&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                        &lt;span class="n"&gt;age&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;time&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;last_message_at&lt;/span&gt;
                        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;age&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;STALENESS_TIMEOUT_SECONDS&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                            &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;ws&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;close&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
                            &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nc"&gt;StaleStreamError&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
                                &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;No message for &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;age&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;s &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
                                &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;(threshold: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;STALENESS_TIMEOUT_SECONDS&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;s)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
                            &lt;span class="p"&gt;)&lt;/span&gt;

                &lt;span class="n"&gt;staleness_task&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;asyncio&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create_task&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;monitor_staleness&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;

                &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                    &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;message&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;ws&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                        &lt;span class="n"&gt;last_message_at&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;time&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
                        &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;handle_message&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;message&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                &lt;span class="k"&gt;finally&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                    &lt;span class="n"&gt;staleness_task&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;cancel&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

        &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="n"&gt;websockets&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;exceptions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ConnectionClosed&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Connection closed, reconnecting...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="n"&gt;StaleStreamError&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Staleness detected: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;, reconnecting...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;asyncio&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sleep&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# backoff before retry
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The key insight: &lt;strong&gt;define "alive" at your application level, not the OS level.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Your feed might have natural quiet periods (markets close, low-traffic hours), so tune the threshold. A 60-second timeout might be too aggressive for IoT telemetry; a 5-minute timeout might be too lenient for a high-frequency ticker.&lt;/p&gt;

&lt;p&gt;A good heuristic: set your timeout to 3-5x the expected gap between messages during your slowest periods.&lt;/p&gt;

&lt;h2&gt;
  
  
  What about exchange-provided heartbeats?
&lt;/h2&gt;

&lt;p&gt;Some WebSocket protocols include explicit heartbeats — small periodic messages that confirm both parties are alive at the application layer. Binance Futures, for example, sends a ping every few minutes; you respond with a pong.&lt;/p&gt;

&lt;p&gt;These help. But they don't solve the staleness problem on their own, because:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Heartbeats might keep working while data subscription has died (different code paths on the server)&lt;/li&gt;
&lt;li&gt;Some feeds don't include heartbeats at all&lt;/li&gt;
&lt;li&gt;Even with heartbeats, you still need staleness logic for the data stream specifically&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Treat heartbeats as one input, not the source of truth. Your real signal is: "Am I getting the kind of message I subscribed to?"&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Reconnect logic that doesn't make things worse
&lt;/h2&gt;

&lt;p&gt;When you detect staleness and reconnect, consider:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Exponential backoff:&lt;/strong&gt; if the server is genuinely down, don't hammer it with reconnect attempts&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Jitter:&lt;/strong&gt; if 1000 clients all detect staleness at the same instant (after a server outage), randomized retry intervals prevent a thundering herd&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;State recovery:&lt;/strong&gt; for stateful feeds (order books, subscription channels), you might need to resync state after reconnect&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Alerting:&lt;/strong&gt; if you've had to reconnect more than N times in M minutes, something deeper is broken — page yourself&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The 22-hour lesson
&lt;/h2&gt;

&lt;p&gt;The bug that hit me wasn't subtle — it's a known failure mode in long-lived streaming systems. But I'd built my service assuming "WebSocket connected = data flowing," and that assumption silently broke when the assumption became false.&lt;/p&gt;

&lt;p&gt;What fixed it for good:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Message-level staleness detection&lt;/strong&gt; (the pattern above)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;External health monitoring&lt;/strong&gt; — a small endpoint that returns &lt;code&gt;last_signal_age_seconds&lt;/code&gt; so UptimeRobot can alert me when it crosses a threshold&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Application-level alerting&lt;/strong&gt; — a separate cron that emails me if no events fire for N hours during peak hours&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;If you're consuming a long-lived WebSocket today and you don't have all three of these, you're vulnerable to the same silent failure. The fix is not expensive. The bug, when it hits, is.&lt;/p&gt;




&lt;h2&gt;
  
  
  About me
&lt;/h2&gt;

&lt;p&gt;I'm building &lt;a href="https://leadedge.dev" rel="noopener noreferrer"&gt;LeadEdge&lt;/a&gt; — a cross-exchange crypto signal API for trading bots. The WebSocket consumer pattern above ships in our open-source &lt;a href="https://github.com/mihalismacura7-blip/leadedge-examples" rel="noopener noreferrer"&gt;integration examples&lt;/a&gt; as a drop-in for anyone building on top of similar streaming feeds.&lt;/p&gt;

&lt;p&gt;The full validation methodology with 9.4M live price updates and 90.7% follow-through on ETH cross-exchange signals is documented &lt;a href="https://leadedge.dev/blog/validation" rel="noopener noreferrer"&gt;here&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>webdev</category>
      <category>phyton</category>
      <category>websocket</category>
      <category>debugging</category>
    </item>
  </channel>
</rss>
