DEV Community

Cover image for When 502 Bad Gateway Revealed My Firebase Connection Pool Mistake (and How I Fixed It)
Sangwoo Lee
Sangwoo Lee

Posted on

When 502 Bad Gateway Revealed My Firebase Connection Pool Mistake (and How I Fixed It)

A production incident at 5PM - how a 502 error led me to discover we were creating new HTTP connections for every Firebase request, and the simple fix that improved performance 3-5x

5:15 PM. Friday afternoon. My Slack explodes with messages:

"Push notification admin page down - 502 Bad Gateway"
"Can't send any notifications"

I opened my AWS CloudWatch dashboard. Every metric had spiked simultaneously:

  • CPU usage: 15% → 95%
  • Network packets out: 50/sec → 12,000/sec
  • Network bytes out: 1 MB/min → 800 MB/min

My first thought: "Did we get DDoS'd?"

Turns out, I had been creating a new HTTP connection for every single Firebase push notification for 6 months without realizing it. Here's how I discovered the issue and how a 3-line code change fixed everything.

The smoking gun

I checked the system logs:

sudo tail -f /var/log/syslog
Enter fullscreen mode Exit fullscreen mode
Jan 15 17:40:02 kernel: nf_conntrack: nf_conntrack: table full, dropping packet
Jan 15 17:40:02 kernel: nf_conntrack: nf_conntrack: table full, dropping packet
(repeating thousands of times)
Enter fullscreen mode Exit fullscreen mode

That one line explained everything. Linux kernel's connection tracking table had filled up, causing it to drop all packets - including health checks, SSH attempts, and API requests.

What actually happened

Without Keep-Alive, each Firebase API call creates a brand new TCP connection:

Request 1: New connection → 3-way handshake → Send → Close → TIME_WAIT (60s)
Request 2: New connection → 3-way handshake → Send → Close → TIME_WAIT (60s)
Request 3: New connection → 3-way handshake → Send → Close → TIME_WAIT (60s)
...
Enter fullscreen mode Exit fullscreen mode

With retry logic (up to 3 retries per chunk):

50,000 users = 100 chunks of 500
Worst case: 100 chunks × 4 attempts = 400 connections
Plus retries and other traffic = 130+ simultaneous TIME_WAIT connections
Enter fullscreen mode Exit fullscreen mode

When this exceeded the nf_conntrack_max limit (16,384 on my EC2), the kernel started dropping ALL packets → 502 Bad Gateway.

Why it worked fine for 6 months

Three factors kept me safe:

1. Chunk delays

const chunkDelay = 2000; // 2 second delay between chunks
Enter fullscreen mode Exit fullscreen mode

This gave connections time to clear from TIME_WAIT.

2. Small campaigns

  • Daily devotionals: 10,000 users (20 chunks)
  • Event announcements: 5,000 users (10 chunks)
  • All well below threshold

3. Peak hour spacing
Notifications sent at 9 AM, 12 PM, 9 PM - hours apart.

What changed on Jan 15th?

  • 50,000 users (100 chunks)
  • 2 campaigns sent simultaneously
  • FCM API errors triggered retry storm
  • Result: 400+ connections in 5 minutes → crash

The immediate fix: Increase conntrack limit

# Check current limit
sudo sysctl net.netfilter.nf_conntrack_max
# Output: 16384

# Increase to 262,144
sudo sysctl -w net.netfilter.nf_conntrack_max=262144

# Make permanent
echo "net.netfilter.nf_conntrack_max = 262144" | sudo tee /etc/sysctl.d/99-custom.conf
sudo sysctl -p
Enter fullscreen mode Exit fullscreen mode

This fixed the immediate crisis but didn't address the root cause.

The root cause: No HTTP Keep-Alive

My Firebase initialization code:

// firebase.module.ts - BEFORE (❌ Wrong)
{
  provide: 'FIREBASE_ADMIN',
  useFactory: (configService: ConfigService) => {
    const firebaseEnv = configService.get('firebase');

    const app = admin.initializeApp({
      credential: admin.credential.cert({
        projectId: firebaseEnv.projectId,
        clientEmail: firebaseEnv.clientEmail,
        privateKey: firebaseEnv.privateKey,
      }),
    });

    // ❌ No HTTP agent configuration!
    return app;
  },
}
Enter fullscreen mode Exit fullscreen mode

Overhead per request: TCP handshake (20-50ms) + TLS handshake (50-150ms) = 100-200ms wasted per connection.

The fix: Enable HTTP Keep-Alive

// firebase.module.ts - AFTER (✅ Correct)
import { Agent as HttpsAgent } from 'https';

{
  provide: 'FIREBASE_ADMIN',
  useFactory: (configService: ConfigService) => {
    const firebaseEnv = configService.get('firebase');

    // ✅ Create Keep-Alive HTTPS Agent
    const httpsAgent = new HttpsAgent({
      keepAlive: true,              // Enable connection reuse
      keepAliveMsecs: 30000,        // Keep idle connections for 30s
      maxSockets: 100,              // Max 100 concurrent connections
      maxFreeSockets: 10,           // Keep 10 idle sockets ready
      timeout: 60000,               // 60s socket timeout
      scheduling: 'lifo',           // Use most recent connection first
    });

    console.log('🔧 [Firebase] Keep-Alive HTTPS Agent created');

    // ✅ Initialize Firebase
    const app = admin.initializeApp({
      credential: admin.credential.cert({
        projectId: firebaseEnv.projectId,
        clientEmail: firebaseEnv.clientEmail,
        privateKey: firebaseEnv.privateKey,
      }),
    });

    // ✅ Inject Keep-Alive Agent globally
    process.env.GOOGLE_APPLICATION_TIMEOUT = '30000';
    const https = require('https');
    https.globalAgent = httpsAgent;

    console.log('✅ [Firebase] Keep-Alive enabled');
    return app;
  },
  inject: [ConfigService],
}
Enter fullscreen mode Exit fullscreen mode

What changed:

Before: 100 requests = 100 new connections
After: 100 requests = 5-10 reused connections

Configuration explained

keepAlive: true

Enables TCP connection reuse. Without this, every request creates a new connection.

keepAliveMsecs: 30000

Sends TCP keep-alive probes every 30s to prevent connection timeout (Firebase timeout is ~60s).

maxSockets: 100

Limits concurrent active connections to 100. Requests queue if exceeded.

maxFreeSockets: 10

Keeps 10 idle connections ready for immediate reuse.

timeout: 60000

Closes socket if no data received for 60s (Firebase SLA is 30s max).

scheduling: 'lifo'

Last In, First Out - reuses the most recently used connection (keeps connections "warm").

Measuring the impact

Test: Send 100,000 notifications (200 chunks of 500)

Metric Before After Improvement
Total time 12m 30s 4m 15s 3x faster
TCP connections 200 12 16x fewer
CPU usage 45% 28% 38% lower
nf_conntrack entries 180 peak 25 peak 7x fewer

Real production (1 week after fix):

  • Send time (100K): 12.5 min → 4.2 min
  • 502 errors: 3 incidents → 0 incidents
  • CPU credit exhaustion: 2 times → 0
  • Cost savings: ~$50/month (can downgrade EC2)

Key lessons

1. Default configurations are rarely optimal

Node.js defaults to keepAlive: false. Always enable it:

const httpsAgent = new HttpsAgent({ keepAlive: true });
Enter fullscreen mode Exit fullscreen mode

2. Check kernel limits for high-connection workloads

# Default is often too low (16,384)
sysctl net.netfilter.nf_conntrack_max

# Increase for API-heavy apps
echo "net.netfilter.nf_conntrack_max = 262144" | sudo tee /etc/sysctl.d/99-custom.conf
Enter fullscreen mode Exit fullscreen mode

3. Always check system logs

sudo tail -f /var/log/syslog
Enter fullscreen mode Exit fullscreen mode

Kernel messages tell the real story. High CPU/network can have many causes - logs reveal the truth.

4. Small-scale problems hide at scale

My 2-second chunk delays accidentally saved me at small scale (10K users), but broke at large scale (50K users).

How to check if you have this issue

Monitor while sending notifications:

# Check TIME_WAIT connections
ss -s | grep timew
# If you see 200+, you have a problem

# Check conntrack usage
cat /proc/sys/net/netfilter/nf_conntrack_count
cat /proc/sys/net/netfilter/nf_conntrack_max
# If count is >80% of max, you're at risk
Enter fullscreen mode Exit fullscreen mode

Quick fix for other Firebase users

Add this to your Firebase initialization:

import { Agent as HttpsAgent } from 'https';

const httpsAgent = new HttpsAgent({
  keepAlive: true,
  keepAliveMsecs: 30000,
  maxSockets: 100,
  maxFreeSockets: 10,
  timeout: 60000,
  scheduling: 'lifo',
});

const app = admin.initializeApp({ /* ... */ });

const https = require('https');
https.globalAgent = httpsAgent;
Enter fullscreen mode Exit fullscreen mode

This works for all Node.js HTTP clients (axios, node-fetch, etc.) and Google Cloud libraries.

Conclusion

A 502 error on Friday evening led me to discover I'd been creating 200+ unnecessary TCP connections every time I sent push notifications.

The fix? Three lines of code:

import { Agent as HttpsAgent } from 'https';
const httpsAgent = new HttpsAgent({ keepAlive: true });
https.globalAgent = httpsAgent;
Enter fullscreen mode Exit fullscreen mode

The result:

  • 3x faster sends
  • 16x fewer connections
  • 38% lower CPU usage
  • Zero 502 errors since

If you're making high-volume HTTP requests in Node.js, check your connection pooling settings. You might be surprised what you find.

Top comments (0)