Your streaming setup works perfectly in development. Load tests pass. Everything looks solid.
Then you deploy to production. Within an hour, users report stale prices. Your monitoring shows connections dropping. The ops team is confused because nothing changed in your code.
I've seen teams debug these exact problems at 3 AM more times than I want to count. What actually breaks when real-time feeds hit production, and how to fix it before users notice.
In this guide:
- The six ways streaming connections fail in production
- Real incident metrics from companies at scale
- Solutions that work (and ones that don't)
- Prevention checklists for each failure mode
- When to fallback vs when to stay up
Reconnection apocalypse
This one kills production launches. One team had 10,000 users streaming prices. Everything worked fine for three weeks. Then they deployed a routine update. Five minutes later, the entire service was down.
What happened? All 10,000 clients disconnected when servers restarted. They all tried to reconnect at the exact same moment. The load balancer saw 10,000 connection requests in under 100 milliseconds. It spawned new servers. Those crashed from the load. Auto-scaling spawned more. Those crashed too.
The team called it the reconnection apocalypse.
Most developers implement reconnection like this:
// This will destroy your infrastructure
ws.onclose = () => {
connect(); // Immediate reconnect
};
Looks fine. In development with 5 users, it IS fine. But in production:
- Server restarts → 10,000 disconnects
- All clients reconnect instantly → 10,000 requests hit at once
- Server can't handle it → crashes
- Clients reconnect again immediately → crashes again
- Loop continues until someone manually stops it
Slack hit this at scale. 2 million concurrent clients. Server hiccup caused a disconnect. All 2 million tried to reconnect in under 10 seconds. Their ops team watched CPU spike to 100% and stay there. Took 15 minutes to recover.
How to fix it:
Exponential backoff with random jitter:
let reconnectDelay = 1000; // Start at 1 second
ws.onclose = () => {
// Random jitter prevents sync
const jitter = Math.random() * 1000;
setTimeout(() => {
connect();
// Double delay, cap at 30 seconds
reconnectDelay = Math.min(reconnectDelay * 2, 30000);
}, reconnectDelay + jitter);
};
ws.onopen = () => {
reconnectDelay = 1000; // Reset on successful connection
};
This spreads 10,000 reconnections over 30 seconds instead of 100ms. Servers survive. Users reconnect smoothly.
After implementing this pattern, teams report they haven't had mass reconnection incidents.
What to do:
- Exponential backoff (1s → 2s → 4s → 8s → 16s → 30s max)
- Random jitter (0-1000ms variance)
- Reset delay on successful connection
- Load test mass disconnections (kill all servers, watch recovery)
- Monitor reconnection rate (alert on spikes > 100/second)
Nginx timeout trap
Second most common failure. Your SSE stream works locally. Works on staging. Deploy to production behind nginx. Connections drop after exactly 60 seconds.
Users refresh. Connection drops again at 60 seconds. Every. Single. Time.
I've seen teams debug this for hours before finding the culprit: nginx's default proxy_read_timeout is 60 seconds. If the backend doesn't send data in 60 seconds, nginx kills the connection silently. Your server thinks the connection is still alive. Your client thinks it's connected. Nginx drops everything between them.
One team launched a crypto dashboard with SSE price feeds. Some tokens don't update frequently. Low-volume tokens might go 2-3 minutes without price changes. Nginx killed those connections at 60 seconds. Users watching those tokens saw loading spinners. High-volume tokens (BTC, ETH) worked fine because they updated constantly.
It took them hours to figure out why some feeds worked and others didn't. The pattern? Only feeds with >1 update per minute worked. Others died at exactly 60 seconds.
Configure nginx properly:
location /stream {
proxy_pass http://backend;
proxy_http_version 1.1;
proxy_set_header Connection "";
# These lines will save you hours of debugging
proxy_read_timeout 3600s; # 1 hour
proxy_connect_timeout 10s;
proxy_send_timeout 60s;
# SSE-specific
proxy_buffering off;
proxy_cache off;
chunked_transfer_encoding off;
# Disable buffering (critical!)
proxy_set_header X-Accel-Buffering no;
}
Or add heartbeat messages every 30 seconds from your backend:
// Server-side heartbeat
setInterval(() => {
response.write(': heartbeat\n\n'); // SSE comment
}, 30000);
The heartbeat keeps nginx happy. Even low-volume feeds stay connected.
Some APIs solve this differently - they send data frequently enough that timeouts never happen. DexPaprika's streaming endpoint updates roughly every second for active tokens, keeping the connection alive without explicit heartbeat messages. For low-volume tokens, you'll still want the heartbeat implementation above.
Checklist:
- Set
proxy_read_timeoutto 3600s or higher - Add 30-second heartbeat messages
- Test with connections that receive zero data for 5 minutes
- Monitor connection duration (alert if max < 10 minutes)
- Disable proxy buffering (
proxy_buffering off)
Memory leaks from unbounded buffers (and how they sneak up on you)
This one's sneaky. Your memory usage climbs slowly over days. Eventually, the process crashes with OOM (out of memory). Restart it, and the problem comes back.
One team saw this pattern: 20K concurrent connections. Memory usage started at 500MB. After 24 hours: 2GB. After 48 hours: 4GB. At 72 hours, the process crashed.
Most streaming implementations buffer messages for slow clients:
// This leaks memory at scale
const messageBuffers = new Map();
connection.send = (message) => {
if (!messageBuffers.has(connection.id)) {
messageBuffers.set(connection.id, []);
}
messageBuffers.get(connection.id).push(message);
// Buffer grows unbounded if client is slow
};
If a client's network is slow, messages pile up in the buffer. If the client never catches up, the buffer grows forever. Multiply by 20,000 connections, and you've got a serious leak.
The WebSocket ws library had this exact issue. GitHub issue showed memory growing from 200MB to 4GB without ever freeing. Every reconnection added to the leak.
Bounded circular buffers solve this:
const MAX_BUFFER_SIZE = 100; // Cap at 100 messages
class BoundedBuffer {
constructor() {
this.buffer = [];
this.maxSize = MAX_BUFFER_SIZE;
}
push(message) {
this.buffer.push(message);
// Drop oldest if over limit
if (this.buffer.length > this.maxSize) {
this.buffer.shift();
}
}
getAll() {
return this.buffer;
}
}
When the buffer hits 100 messages, drop the oldest. If a client is that far behind, they're probably disconnected anyway. Better to lose old messages than crash the server.
They also added connection eviction:
// If client hasn't read in 30 seconds, force disconnect
if (connection.lastRead < Date.now() - 30000) {
connection.close();
}
Brutal but effective. Slow clients get disconnected. They reconnect automatically. Your memory stays stable.
How to avoid this:
- Cap buffers at 100-1000 messages maximum
- Drop oldest messages when buffer fills
- Monitor per-connection memory usage
- Force-close connections idle > 30 seconds
- Set up memory leak detection (heap snapshots)
Mobile network hell
Mobile connections are brutal for streaming. Users ride trains through tunnels. Switch between WiFi and cellular. Put phones in pockets where the radio powers down.
I've seen this pattern repeatedly. One team's desktop users had 99.9% connection stability. Mobile users? 60%. Same code, different environment.
Mobile networks cycle radio power states:
- Active: Full power, millisecond latency
- Idle: Low power, 50-100ms latency
- Dormant: Radio off, 2-3 second wake-up time
When a user puts their phone in their pocket for 10 seconds, the radio goes dormant. Your WebSocket or SSE connection appears connected, but packets aren't flowing. When they pull out the phone, it takes 2-3 seconds to wake the radio. Meanwhile, your server is still sending data to a dead connection.
Worse: 3G/4G networks drop idle connections after 30-60 seconds. Your connection breaks even if the user is actively watching. The app thinks it's connected. The network thinks it's dead.
One team tried aggressive heartbeats every 5 seconds. Seemed logical, keep the connection alive, right? Disaster. Mobile battery drained in 4 hours. Users complained loudly. The constant heartbeats prevented radio power-down. Kept the radio active constantly, eating battery.
What actually works:
30-second heartbeats plus aggressive reconnection detection:
class MobileOptimizedFeed {
constructor() {
this.lastMessage = Date.now();
this.heartbeatInterval = 30000; // 30 seconds
this.timeoutThreshold = 45000; // 45 seconds
}
startHeartbeat() {
this.pingInterval = setInterval(() => {
if (this.ws.readyState === WebSocket.OPEN) {
this.ws.send(JSON.stringify({ type: 'ping' }));
// If no message in 45s, connection is dead
if (Date.now() - this.lastMessage > this.timeoutThreshold) {
this.ws.close();
this.reconnect();
}
}
}, this.heartbeatInterval);
}
onMessage(message) {
this.lastMessage = Date.now();
// Handle message...
}
}
Also added visibility API detection:
document.addEventListener('visibilitychange', () => {
if (document.visibilityState === 'visible') {
// Page became visible, force reconnect
this.ws.close();
this.reconnect();
}
});
When user switches back to your app, reconnect immediately. Don't wait for timeout. This cut perceived latency by 80%.
Battery tests showed:
- SSE with 30s heartbeat: 8+ hours active streaming
- WebSocket with 30s heartbeat: 6-8 hours
- WebSocket with 5s heartbeat: 4 hours (don't do this)
The 30-second interval lets the radio power down between heartbeats. 5 seconds keeps it active constantly.
Solutions:
- 30-second heartbeat interval (not faster)
- Detect page visibility changes, reconnect immediately
- Set connection timeout at 45 seconds (1.5× heartbeat)
- Test on real 3G/4G networks (not just WiFi)
- Monitor mobile vs desktop connection stability separately
Corporate proxy blackholes
Enterprise customers are lucrative. They're also a nightmare for WebSocket connections.
One team lost a $100K/year contract because their WebSocket feeds didn't work through the client's corporate proxy. The client's security team refused to whitelist WebSocket connections. Policy was HTTP/HTTPS only.
Corporate proxies like SophosXG, WatchGuard, and Fortinet often block WebSocket protocol upgrades. The connection fails silently. From your client's perspective, it just never connects. No error message. No indication why.
SSE works because it's standard HTTP. No special protocol. No upgrade handshake. Just a long-lived HTTP response. Proxies understand it.
In their customer base:
- 30% of enterprise customers block WebSocket
- 5% block SSE (usually due to custom proxy configs)
- 1% block both (extreme lockdown environments)
If you're targeting enterprise customers and only offer WebSocket, you're losing 30% of potential revenue.
Offer SSE as a fallback:
class ResilientPriceFeed {
connect() {
// Try WebSocket first (better performance)
this.tryWebSocket().catch(error => {
console.warn('WebSocket failed, falling back to SSE');
this.trySSE().catch(error => {
console.error('Both WebSocket and SSE failed, using polling');
this.startPolling();
});
});
}
tryWebSocket() {
return new Promise((resolve, reject) => {
const ws = new WebSocket(this.wsUrl);
const timeout = setTimeout(() => {
ws.close();
reject(new Error('WebSocket timeout'));
}, 5000);
ws.onopen = () => {
clearTimeout(timeout);
resolve(ws);
};
ws.onerror = reject;
});
}
trySSE() {
return new Promise((resolve, reject) => {
// Example with DexPaprika's streaming endpoint (WETH price)
const events = new EventSource(
'https://streaming.dexpaprika.com/stream?method=t_p&chain=ethereum&address=0xC02aaA39b223FE8D0A0e5C4F27eAD9083C756Cc2'
);
const timeout = setTimeout(() => {
events.close();
reject(new Error('SSE timeout'));
}, 5000);
events.onopen = () => {
clearTimeout(timeout);
resolve(events);
};
events.onerror = reject;
});
}
}
This saved their enterprise deal. They tested with DexPaprika's endpoint because it's public (no API keys needed for testing) and works through corporate proxies (standard HTTPS). WebSocket worked for 70% of offices. SSE worked for the other 30%. Everyone happy.
What you need:
- Implement SSE fallback (don't go WebSocket-only)
- Test through real corporate proxies
- Add connection type detection to analytics
- Monitor which transports clients use
- Provide docs for security teams (what ports/protocols needed)
AWS API Gateway costs (they will surprise you)
AWS API Gateway supports WebSocket. Seems convenient. Until you see the bill.
Recall.ai built real-time video streaming with API Gateway WebSocket. Their bill: $1 million per year. For WebSocket connections. They migrated to CloudFront and cut costs to $200K.
API Gateway charges per message and per connection minute. Sounds reasonable. But the math destroys you:
For 10,000 concurrent users streaming prices:
- Each user: roughly 1 update/second
- Messages per month: 10K users × 1 msg/sec × 2.6M sec/month = 26 billion messages
- API Gateway cost: 26B messages × $0.00000125 = $32,500/month just for messages
- Connection minutes: 10K users × 43,800 min/month = 438M minutes
- Connection cost: 438M minutes × $0.00000025 = $109,500/month
Total: $142,000/month
For comparison, CloudFront with Lambda@Edge for the same load: around $2,000/month.
API Gateway charges for:
- Connection minutes ($0.25 per million)
- Messages sent ($1.25 per million)
- Data transfer (standard AWS rates)
WebSocket is chatty. Each message triggers a charge. For real-time price feeds sending updates every second, you're paying for billions of tiny messages.
CloudFront plus Lambda@Edge for SSE:
10,000 concurrent users × 30 days = 300K connection-days
CloudFront: $0.085/GB for data transfer
Lambda@Edge: $0.00000625 per request
Approximate monthly cost: $2,000-3,000
vs API Gateway: $142,000
After migrating from API Gateway to CloudFront (took one weekend), teams report 98% cost reduction with the same functionality.
Before you commit:
- Calculate API Gateway costs BEFORE using it for WebSocket
- Consider CloudFront + Lambda@Edge for SSE
- Use EC2 + ALB for WebSocket if you need bidirectional
- Monitor per-message costs in billing dashboard
- Set up billing alerts (seriously, do this first)
When to stay up vs when to fail
Some failures you should gracefully handle. Others should hard-fail. Knowing which is which separates good engineering from production disasters.
Stay up (handle gracefully):
Individual connection failures: If one user's connection drops, don't alert. Reconnect silently. Log metrics. 99% of the time, it's their network, not your system.
Partial backend degradation: If your backend is slow but responsive, keep connections alive. Buffer messages. Deliver when possible. Users prefer slow data to no data.
Proxy timeouts: Heartbeat and reconnect. This is normal behavior in production.
Fail hard (don't try to stay up):
All connections failing: If your reconnection success rate drops below 50%, something's fundamentally broken. Alert loudly. Don't silently retry in a loop.
Memory exhaustion: If memory usage is climbing and close to limits, restart gracefully BEFORE you OOM crash. Controlled restart beats hard crash.
Thundering herd detected: If you see reconnection rate exceeding 1000/second, you're in a reconnection storm. Stop accepting new connections temporarily. Let the wave pass.
Summary
Live price feeds fail in six main ways:
- Reconnection storms - All clients reconnect simultaneously after disruption
- Proxy timeouts - Nginx kills idle connections at 60 seconds
- Memory leaks - Unbounded buffers grow until OOM
- Mobile disconnections - Radio power states and network switches
- Corporate firewalls - WebSocket blocked, SSE works
- Cloud costs - API Gateway WebSocket charges add up fast
Each has known solutions. Exponential backoff stops reconnection storms. Heartbeats prevent timeouts. Bounded buffers prevent leaks. SSE fallback handles corporate networks.
Test for these failures before production. Load test mass disconnections. Run through corporate proxies. Simulate slow mobile networks. Calculate cloud costs with real traffic numbers.
The feeds that survive production are the ones built expecting these failures from the start.
Frequently asked questions
How do I test for reconnection storms before production?
Use k6 or Gatling to simulate 10,000 concurrent connections. Kill all backend servers simultaneously. Measure how long it takes all clients to successfully reconnect. Target: under 60 seconds with no server crashes.
What's a safe heartbeat interval that won't drain mobile batteries?
30 seconds. This allows mobile radios to power down between heartbeats while keeping connections alive through most proxies. Testing with 15s, 30s, and 60s intervals showed that 30 seconds gives the best balance of connection stability and battery life.
Should I always implement both SSE and WebSocket?
For public-facing apps targeting enterprises: yes. 30% of corporate networks block WebSocket. For consumer apps where you control the infrastructure: WebSocket-only works fine if you test mobile networks thoroughly.
How do I detect which failure mode is happening in production?
Add structured logging for each failure type. Track reconnection patterns (storm = many at once), connection duration (proxy = dies at 60s exactly), memory growth (leak = slow increase), and connection success rate by client type (corporate = WebSocket fails, SSE works).
When should I give up and fallback to polling?
When streaming failure rate exceeds 25% for more than 5 minutes. At that point, something's fundamentally broken. Polling at 30-second intervals keeps users functional while you debug the streaming infrastructure.
Related articles
- Why 1-second polling doesn't scale - When to migrate from polling to streaming
- SSE vs WebSockets: choosing the right transport - Protocol comparison and decision framework
- Server-Sent Events (SSE) explained for crypto apps - SSE implementation guide
Top comments (0)