Sangwoo Lee

Posted on Feb 2

When 502 Bad Gateway Revealed My Firebase Connection Pool Mistake (and How I Fixed It)

#firebase #nginx #nestjs #node

A production incident at 5PM - how a 502 error led me to discover we were creating new HTTP connections for every Firebase request, and the simple fix that improved performance 3-5x

5:15 PM. Friday afternoon. My Slack explodes with messages:

"Push notification admin page down - 502 Bad Gateway"
"Can't send any notifications"

I opened my AWS CloudWatch dashboard. Every metric had spiked simultaneously:

CPU usage: 15% → 95%
Network packets out: 50/sec → 12,000/sec
Network bytes out: 1 MB/min → 800 MB/min

My first thought: "Did we get DDoS'd?"

Turns out, I had been creating a new HTTP connection for every single Firebase push notification for 6 months without realizing it. Here's how I discovered the issue and how a 3-line code change fixed everything.

The smoking gun

I checked the system logs:

sudo tail -f /var/log/syslog

Jan 15 17:40:02 kernel: nf_conntrack: nf_conntrack: table full, dropping packet
Jan 15 17:40:02 kernel: nf_conntrack: nf_conntrack: table full, dropping packet
(repeating thousands of times)

That one line explained everything. Linux kernel's connection tracking table had filled up, causing it to drop all packets - including health checks, SSH attempts, and API requests.

What actually happened

Without Keep-Alive, each Firebase API call creates a brand new TCP connection:

Request 1: New connection → 3-way handshake → Send → Close → TIME_WAIT (60s)
Request 2: New connection → 3-way handshake → Send → Close → TIME_WAIT (60s)
Request 3: New connection → 3-way handshake → Send → Close → TIME_WAIT (60s)
...

With retry logic (up to 3 retries per chunk):

50,000 users = 100 chunks of 500
Worst case: 100 chunks × 4 attempts = 400 connections
Plus retries and other traffic = 130+ simultaneous TIME_WAIT connections

When this exceeded the nf_conntrack_max limit (16,384 on my EC2), the kernel started dropping ALL packets → 502 Bad Gateway.

Why it worked fine for 6 months

Three factors kept me safe:

1. Chunk delays

const chunkDelay = 2000; // 2 second delay between chunks

This gave connections time to clear from TIME_WAIT.

2. Small campaigns

Daily devotionals: 10,000 users (20 chunks)
Event announcements: 5,000 users (10 chunks)
All well below threshold

3. Peak hour spacing
Notifications sent at 9 AM, 12 PM, 9 PM - hours apart.

What changed on Jan 15th?

50,000 users (100 chunks)
2 campaigns sent simultaneously
FCM API errors triggered retry storm
Result: 400+ connections in 5 minutes → crash

The immediate fix: Increase conntrack limit

# Check current limit
sudo sysctl net.netfilter.nf_conntrack_max
# Output: 16384

# Increase to 262,144
sudo sysctl -w net.netfilter.nf_conntrack_max=262144

# Make permanent
echo "net.netfilter.nf_conntrack_max = 262144" | sudo tee /etc/sysctl.d/99-custom.conf
sudo sysctl -p

This fixed the immediate crisis but didn't address the root cause.

The root cause: No HTTP Keep-Alive

My Firebase initialization code:

// firebase.module.ts - BEFORE (❌ Wrong)
{
  provide: 'FIREBASE_ADMIN',
  useFactory: (configService: ConfigService) => {
    const firebaseEnv = configService.get('firebase');

    const app = admin.initializeApp({
      credential: admin.credential.cert({
        projectId: firebaseEnv.projectId,
        clientEmail: firebaseEnv.clientEmail,
        privateKey: firebaseEnv.privateKey,
      }),
    });

    // ❌ No HTTP agent configuration!
    return app;
  },
}

Overhead per request: TCP handshake (20-50ms) + TLS handshake (50-150ms) = 100-200ms wasted per connection.

The fix: Enable HTTP Keep-Alive

// firebase.module.ts - AFTER (✅ Correct)
import { Agent as HttpsAgent } from 'https';

{
  provide: 'FIREBASE_ADMIN',
  useFactory: (configService: ConfigService) => {
    const firebaseEnv = configService.get('firebase');

    // ✅ Create Keep-Alive HTTPS Agent
    const httpsAgent = new HttpsAgent({
      keepAlive: true,              // Enable connection reuse
      keepAliveMsecs: 30000,        // Keep idle connections for 30s
      maxSockets: 100,              // Max 100 concurrent connections
      maxFreeSockets: 10,           // Keep 10 idle sockets ready
      timeout: 60000,               // 60s socket timeout
      scheduling: 'lifo',           // Use most recent connection first
    });

    console.log('🔧 [Firebase] Keep-Alive HTTPS Agent created');

    // ✅ Initialize Firebase
    const app = admin.initializeApp({
      credential: admin.credential.cert({
        projectId: firebaseEnv.projectId,
        clientEmail: firebaseEnv.clientEmail,
        privateKey: firebaseEnv.privateKey,
      }),
    });

    // ✅ Inject Keep-Alive Agent globally
    process.env.GOOGLE_APPLICATION_TIMEOUT = '30000';
    const https = require('https');
    https.globalAgent = httpsAgent;

    console.log('✅ [Firebase] Keep-Alive enabled');
    return app;
  },
  inject: [ConfigService],
}

What changed:

Before: 100 requests = 100 new connections
After: 100 requests = 5-10 reused connections

Configuration explained

`keepAlive: true`

Enables TCP connection reuse. Without this, every request creates a new connection.

`keepAliveMsecs: 30000`

Sends TCP keep-alive probes every 30s to prevent connection timeout (Firebase timeout is ~60s).

`maxSockets: 100`

Limits concurrent active connections to 100. Requests queue if exceeded.

`maxFreeSockets: 10`

Keeps 10 idle connections ready for immediate reuse.

`timeout: 60000`

Closes socket if no data received for 60s (Firebase SLA is 30s max).

`scheduling: 'lifo'`

Last In, First Out - reuses the most recently used connection (keeps connections "warm").

Measuring the impact

Test: Send 100,000 notifications (200 chunks of 500)

Metric	Before	After	Improvement
Total time	12m 30s	4m 15s	3x faster
TCP connections	200	12	16x fewer
CPU usage	45%	28%	38% lower
nf_conntrack entries	180 peak	25 peak	7x fewer

Real production (1 week after fix):

Send time (100K): 12.5 min → 4.2 min
502 errors: 3 incidents → 0 incidents
CPU credit exhaustion: 2 times → 0
Cost savings: ~$50/month (can downgrade EC2)

Key lessons

1. Default configurations are rarely optimal

Node.js defaults to keepAlive: false. Always enable it:

const httpsAgent = new HttpsAgent({ keepAlive: true });

2. Check kernel limits for high-connection workloads

# Default is often too low (16,384)
sysctl net.netfilter.nf_conntrack_max

# Increase for API-heavy apps
echo "net.netfilter.nf_conntrack_max = 262144" | sudo tee /etc/sysctl.d/99-custom.conf

3. Always check system logs

sudo tail -f /var/log/syslog

Kernel messages tell the real story. High CPU/network can have many causes - logs reveal the truth.

4. Small-scale problems hide at scale

My 2-second chunk delays accidentally saved me at small scale (10K users), but broke at large scale (50K users).

How to check if you have this issue

Monitor while sending notifications:

# Check TIME_WAIT connections
ss -s | grep timew
# If you see 200+, you have a problem

# Check conntrack usage
cat /proc/sys/net/netfilter/nf_conntrack_count
cat /proc/sys/net/netfilter/nf_conntrack_max
# If count is >80% of max, you're at risk

Quick fix for other Firebase users

Add this to your Firebase initialization:

import { Agent as HttpsAgent } from 'https';

const httpsAgent = new HttpsAgent({
  keepAlive: true,
  keepAliveMsecs: 30000,
  maxSockets: 100,
  maxFreeSockets: 10,
  timeout: 60000,
  scheduling: 'lifo',
});

const app = admin.initializeApp({ /* ... */ });

const https = require('https');
https.globalAgent = httpsAgent;

This works for all Node.js HTTP clients (axios, node-fetch, etc.) and Google Cloud libraries.

Conclusion

A 502 error on Friday evening led me to discover I'd been creating 200+ unnecessary TCP connections every time I sent push notifications.

The fix? Three lines of code:

import { Agent as HttpsAgent } from 'https';
const httpsAgent = new HttpsAgent({ keepAlive: true });
https.globalAgent = httpsAgent;

The result:

3x faster sends
16x fewer connections
38% lower CPU usage
Zero 502 errors since

If you're making high-volume HTTP requests in Node.js, check your connection pooling settings. You might be surprised what you find.

DEV Community