A production incident at 5PM - how a 502 error led me to discover we were creating new HTTP connections for every Firebase request, and the simple fix that improved performance 3-5x
5:15 PM. Friday afternoon. My Slack explodes with messages:
"Push notification admin page down - 502 Bad Gateway"
"Can't send any notifications"
I opened my AWS CloudWatch dashboard. Every metric had spiked simultaneously:
- CPU usage: 15% → 95%
- Network packets out: 50/sec → 12,000/sec
- Network bytes out: 1 MB/min → 800 MB/min
My first thought: "Did we get DDoS'd?"
Turns out, I had been creating a new HTTP connection for every single Firebase push notification for 6 months without realizing it. Here's how I discovered the issue and how a 3-line code change fixed everything.
The smoking gun
I checked the system logs:
sudo tail -f /var/log/syslog
Jan 15 17:40:02 kernel: nf_conntrack: nf_conntrack: table full, dropping packet
Jan 15 17:40:02 kernel: nf_conntrack: nf_conntrack: table full, dropping packet
(repeating thousands of times)
That one line explained everything. Linux kernel's connection tracking table had filled up, causing it to drop all packets - including health checks, SSH attempts, and API requests.
What actually happened
Without Keep-Alive, each Firebase API call creates a brand new TCP connection:
Request 1: New connection → 3-way handshake → Send → Close → TIME_WAIT (60s)
Request 2: New connection → 3-way handshake → Send → Close → TIME_WAIT (60s)
Request 3: New connection → 3-way handshake → Send → Close → TIME_WAIT (60s)
...
With retry logic (up to 3 retries per chunk):
50,000 users = 100 chunks of 500
Worst case: 100 chunks × 4 attempts = 400 connections
Plus retries and other traffic = 130+ simultaneous TIME_WAIT connections
When this exceeded the nf_conntrack_max limit (16,384 on my EC2), the kernel started dropping ALL packets → 502 Bad Gateway.
Why it worked fine for 6 months
Three factors kept me safe:
1. Chunk delays
const chunkDelay = 2000; // 2 second delay between chunks
This gave connections time to clear from TIME_WAIT.
2. Small campaigns
- Daily devotionals: 10,000 users (20 chunks)
- Event announcements: 5,000 users (10 chunks)
- All well below threshold
3. Peak hour spacing
Notifications sent at 9 AM, 12 PM, 9 PM - hours apart.
What changed on Jan 15th?
- 50,000 users (100 chunks)
- 2 campaigns sent simultaneously
- FCM API errors triggered retry storm
- Result: 400+ connections in 5 minutes → crash
The immediate fix: Increase conntrack limit
# Check current limit
sudo sysctl net.netfilter.nf_conntrack_max
# Output: 16384
# Increase to 262,144
sudo sysctl -w net.netfilter.nf_conntrack_max=262144
# Make permanent
echo "net.netfilter.nf_conntrack_max = 262144" | sudo tee /etc/sysctl.d/99-custom.conf
sudo sysctl -p
This fixed the immediate crisis but didn't address the root cause.
The root cause: No HTTP Keep-Alive
My Firebase initialization code:
// firebase.module.ts - BEFORE (❌ Wrong)
{
provide: 'FIREBASE_ADMIN',
useFactory: (configService: ConfigService) => {
const firebaseEnv = configService.get('firebase');
const app = admin.initializeApp({
credential: admin.credential.cert({
projectId: firebaseEnv.projectId,
clientEmail: firebaseEnv.clientEmail,
privateKey: firebaseEnv.privateKey,
}),
});
// ❌ No HTTP agent configuration!
return app;
},
}
Overhead per request: TCP handshake (20-50ms) + TLS handshake (50-150ms) = 100-200ms wasted per connection.
The fix: Enable HTTP Keep-Alive
// firebase.module.ts - AFTER (✅ Correct)
import { Agent as HttpsAgent } from 'https';
{
provide: 'FIREBASE_ADMIN',
useFactory: (configService: ConfigService) => {
const firebaseEnv = configService.get('firebase');
// ✅ Create Keep-Alive HTTPS Agent
const httpsAgent = new HttpsAgent({
keepAlive: true, // Enable connection reuse
keepAliveMsecs: 30000, // Keep idle connections for 30s
maxSockets: 100, // Max 100 concurrent connections
maxFreeSockets: 10, // Keep 10 idle sockets ready
timeout: 60000, // 60s socket timeout
scheduling: 'lifo', // Use most recent connection first
});
console.log('🔧 [Firebase] Keep-Alive HTTPS Agent created');
// ✅ Initialize Firebase
const app = admin.initializeApp({
credential: admin.credential.cert({
projectId: firebaseEnv.projectId,
clientEmail: firebaseEnv.clientEmail,
privateKey: firebaseEnv.privateKey,
}),
});
// ✅ Inject Keep-Alive Agent globally
process.env.GOOGLE_APPLICATION_TIMEOUT = '30000';
const https = require('https');
https.globalAgent = httpsAgent;
console.log('✅ [Firebase] Keep-Alive enabled');
return app;
},
inject: [ConfigService],
}
What changed:
Before: 100 requests = 100 new connections
After: 100 requests = 5-10 reused connections
Configuration explained
keepAlive: true
Enables TCP connection reuse. Without this, every request creates a new connection.
keepAliveMsecs: 30000
Sends TCP keep-alive probes every 30s to prevent connection timeout (Firebase timeout is ~60s).
maxSockets: 100
Limits concurrent active connections to 100. Requests queue if exceeded.
maxFreeSockets: 10
Keeps 10 idle connections ready for immediate reuse.
timeout: 60000
Closes socket if no data received for 60s (Firebase SLA is 30s max).
scheduling: 'lifo'
Last In, First Out - reuses the most recently used connection (keeps connections "warm").
Measuring the impact
Test: Send 100,000 notifications (200 chunks of 500)
| Metric | Before | After | Improvement |
|---|---|---|---|
| Total time | 12m 30s | 4m 15s | 3x faster |
| TCP connections | 200 | 12 | 16x fewer |
| CPU usage | 45% | 28% | 38% lower |
| nf_conntrack entries | 180 peak | 25 peak | 7x fewer |
Real production (1 week after fix):
- Send time (100K): 12.5 min → 4.2 min
- 502 errors: 3 incidents → 0 incidents
- CPU credit exhaustion: 2 times → 0
- Cost savings: ~$50/month (can downgrade EC2)
Key lessons
1. Default configurations are rarely optimal
Node.js defaults to keepAlive: false. Always enable it:
const httpsAgent = new HttpsAgent({ keepAlive: true });
2. Check kernel limits for high-connection workloads
# Default is often too low (16,384)
sysctl net.netfilter.nf_conntrack_max
# Increase for API-heavy apps
echo "net.netfilter.nf_conntrack_max = 262144" | sudo tee /etc/sysctl.d/99-custom.conf
3. Always check system logs
sudo tail -f /var/log/syslog
Kernel messages tell the real story. High CPU/network can have many causes - logs reveal the truth.
4. Small-scale problems hide at scale
My 2-second chunk delays accidentally saved me at small scale (10K users), but broke at large scale (50K users).
How to check if you have this issue
Monitor while sending notifications:
# Check TIME_WAIT connections
ss -s | grep timew
# If you see 200+, you have a problem
# Check conntrack usage
cat /proc/sys/net/netfilter/nf_conntrack_count
cat /proc/sys/net/netfilter/nf_conntrack_max
# If count is >80% of max, you're at risk
Quick fix for other Firebase users
Add this to your Firebase initialization:
import { Agent as HttpsAgent } from 'https';
const httpsAgent = new HttpsAgent({
keepAlive: true,
keepAliveMsecs: 30000,
maxSockets: 100,
maxFreeSockets: 10,
timeout: 60000,
scheduling: 'lifo',
});
const app = admin.initializeApp({ /* ... */ });
const https = require('https');
https.globalAgent = httpsAgent;
This works for all Node.js HTTP clients (axios, node-fetch, etc.) and Google Cloud libraries.
Conclusion
A 502 error on Friday evening led me to discover I'd been creating 200+ unnecessary TCP connections every time I sent push notifications.
The fix? Three lines of code:
import { Agent as HttpsAgent } from 'https';
const httpsAgent = new HttpsAgent({ keepAlive: true });
https.globalAgent = httpsAgent;
The result:
- 3x faster sends
- 16x fewer connections
- 38% lower CPU usage
- Zero 502 errors since
If you're making high-volume HTTP requests in Node.js, check your connection pooling settings. You might be surprised what you find.

Top comments (0)