DEV Community

Cover image for The Silent Connection Killer: MySQL2 and AWS Lambda's Freeze/Thaw Problem
Jayesh Shinde
Jayesh Shinde

Posted on • Edited on

The Silent Connection Killer: MySQL2 and AWS Lambda's Freeze/Thaw Problem

The Mystery Error

You're running a Node.js Lambda with MySQL2, everything works great in testing, but production logs show intermittent failures:

Error: Connection lost: The server closed the connection.code: "PROTOCOL_CONNECTION_LOST"
fatal: true
Enter fullscreen mode Exit fullscreen mode

No pattern. No warning. Just random failures that make you question your life choices.

Understanding the Real Problem : How Lambda Actually Works

Lambda doesn't spin up a fresh container for every request. AWS keeps containers "warm" for reuse:

Request 1 → Lambda runs → Response
                ↓
           [FREEZE] ← Container paused (not terminated)
                ↓     
         (5-15 min pass)
                ↓
           [THAW] ← Container resumed
                ↓
Request 2 → Lambda runs → 💥 PROTOCOL_CONNECTION_LOST

Enter fullscreen mode Exit fullscreen mode

During freeze, your Lambda is literally paused. The JavaScript event loop stops. Timers stop. Everything stops.
But here's the catch: the outside world doesn't stop.

What Happens to Your Database Connection

  1. Lambda creates a MySQL connection pool
  2. Connections sit idle in the pool
  3. Lambda freezes (container paused)
  4. Real world time passes (5-30 minutes)
  5. Network timeouts occur, NAT gateways clear state, RDS Proxy cleans up
  6. The TCP socket dies, but your pool doesn't know
  7. Lambda thaws, tries to use the dead connection
  8. 💥 PROTOCOL_CONNECTION_LOST

Why idleTimeout Doesn't Help

You might think: "I'll set idleTimeout: 60000 to clean up idle connections!"
Here's why it doesn't work:

Timer starts (60s countdown)
    ↓
Lambda FREEZES at 1s elapsed
    ↓
████████████████████████████████
█  15 minutes pass in REAL WORLD  █
█  Timer is PAUSED at 1s          █
████████████████████████████████
    ↓
Lambda THAWS - timer resumes at 1s
    ↓
Connection still in pool (timer thinks 2s passed)
    ↓
Connection is DEAD but pool doesn't know

Enter fullscreen mode Exit fullscreen mode

The timer doesn't run during freeze. Your 60-second timeout is useless against a 15-minute freeze.

The Solution: Detect and Retry
Since we can't prevent stale connections, we detect them and retry transparently.

Step 1: Enable TCP Keep-Alive

const pool = mysql.createPool({
  ...config,
  enableKeepAlive: true,
  keepAliveInitialDelay: 10000,
});
Enter fullscreen mode Exit fullscreen mode

This helps get clear error codes (ECONNRESET, PROTOCOL_CONNECTION_LOST) instead of hanging indefinitely.

Step 2: Implement Retry Logic

async executeQuery(sql, params) {
  const maxRetries = 1;
  let lastError = null;

  for (let attempt = 0; attempt <= maxRetries; attempt++) {
    let connection = null;

    try {
      connection = await this.getConnectionFromPool();
      const result = await this.executeQueryWithConnection(connection, sql, params);
      connection.release();
      return result;
    } catch (error) {
      // Non-recoverable error - throw immediately
      if (!this.isConnectionLostError(error)) {
        if (connection) connection.release();
        throw error;
      }

      // Connection lost - destroy stale connection
      if (connection) connection.destroy();

      // Retry if attempts left
      lastError = error;
      if (attempt < maxRetries) {
        console.warn("Connection lost, retrying...", { attempt: attempt + 1 });
        continue;
      }
    }
  }

  throw lastError;
}

isConnectionLostError(error) {
  const recoverableCodes = [
    "PROTOCOL_CONNECTION_LOST",  // Server closed connection
    "ECONNRESET",                // TCP reset
    "EPIPE",                     // Broken pipe
    "ETIMEDOUT",                 // Connection timeout
    "ECONNREFUSED",              // Connection refused
  ];
  return recoverableCodes.includes(error?.code);
}
Enter fullscreen mode Exit fullscreen mode

Key Points:

connection.destroy() - Removes stale connection from pool (don't reuse it!)
connection.release() - Returns healthy connection to pool
One retry is usually enough - The second attempt gets a fresh connection

What About Pool Settings?

Do I Need to Tune connectionLimit, maxIdle, etc.?
Short answer: Not really.

Setting Helps with freeze/thaw? Why?
idleTimeout No Timer paused during freeze
maxIdle Marginally Fewer connections = fewer stale ones, but adds reconnection overhead
connectionLimit No Doesn't affect stale connections
Retry logic Yes Handles stale connections at runtime

If you're using RDS Proxy, it handles connection pooling at the infrastructure level. Keep your Lambda pool settings simple and let the retry logic do the heavy lifting.

Using RDS Proxy?

RDS Proxy's "Idle client connection timeout" (default: 30 minutes) is separate from MySQL's wait_timeout. The proxy manages Lambda→Proxy connections independently.
But even with RDS Proxy, connections can die due to:

  • NAT gateway timeouts (typically 5-15 minutes for idle TCP)
  • Network state table cleanup
  • Proxy internal connection recycling

The retry logic is still your safety net.

The Final Architecture

┌─────────────────────────────────────────────────────┐
│                    Your Lambda                       │
├─────────────────────────────────────────────────────┤
│                                                      │
│   executeQuery()                                     │
│       ↓                                             │
│   for (attempt = 0; attempt <= 1; attempt++)        │
│       ↓                                             │
│   getConnection() → Try query                       │
│       ↓                                             │
│   Success? → return result                          │
│       ↓                                             │
│   Connection lost? → destroy() → retry              │
│       ↓                                             │
│   Other error? → release() → throw                  │
│                                                      │
└─────────────────────────────────────────────────────┘
           ↓
    [RDS Proxy] (optional)
           ↓
    [MySQL/Aurora]
Enter fullscreen mode Exit fullscreen mode

Summary

Problem Solution
Lambda freezes, connections go stale Retry logic detects and recovers
Pool doesn't know connections are dead enableKeepAlive for faster detection
idleTimeout doesn't work during freeze Accept it, rely on retry instead
Random PROTOCOL_CONNECTION_LOST Transparent retry = users don't notice

The key insight: You can't prevent stale connections in a serverless environment. But you can detect them instantly and retry transparently.

BUT....


The Problem With Just Retrying Once

The previous article suggested detecting a stale connection error and retrying once. That works — but only if a single connection went stale. After a longer Lambda freeze (10–15+ minutes), the entire pool goes stale. Here's the scenario:

Pool has 5 connections → Lambda freezes → all 5 TCP sockets die

Request comes in after thaw:
→ gets conn #1 from _freeConnections → FAILS (stale)
→ retry: gets conn #2 from _freeConnections → FAILS (also stale!)
Enter fullscreen mode Exit fullscreen mode

One retry is not enough because _freeConnections is a queue — the retry just picks the next dead connection in line.


How mysql2's Pool Actually Works Internally

Looking at mysql2's pool.js source, getConnection() does this:

// Simplified from mysql2 internals
getConnection(cb) {
  if (this._freeConnections.length > 0) {
    const connection = this._freeConnections.shift(); // FIFO — no health check
    return cb(null, connection);
  }
  // if allConnections.length < connectionLimit → create a NEW connection
  // otherwise → queue the request in _connectionQueue
}
Enter fullscreen mode Exit fullscreen mode

There is zero validation when pulling from _freeConnections. The pool hands you whatever is sitting there, stale or not.

The inverse is also useful to know — when _freeConnections is empty AND _allConnections.length < connectionLimit, mysql2 will automatically create a brand new TCP connection. This is the behavior we want to exploit.


The Fix: Drain All Free Connections on a Stale Error

Instead of retrying once and hoping the next connection is healthy, destroy every connection in _freeConnections the moment you detect a stale error. The pool's own logic then forces a fresh connection on retry.

const STALE_ERRORS = new Set([
  'PROTOCOL_CONNECTION_LOST',
  'ECONNRESET',
  'EPIPE',
  'ETIMEDOUT',
  'ECONNREFUSED',
]);

function drainFreeConnections(pool) {
  // Destroy all idle connections in one sweep
  for (const conn of pool._freeConnections) {
    conn.destroy(); // sets conn._pool = null, removes from _allConnections
  }
  pool._freeConnections.length = 0; // clear the array in-place
  // pool._allConnections.length is now reduced → next getConnection()
  // sees: freeConnections empty + allConnections < connectionLimit
  // → creates a fresh TCP connection automatically
}

async function executeQuery(pool, sql, params) {
  let connection;
  try {
    connection = await pool.getConnection();
    const [rows] = await connection.query(sql, params);
    connection.release();
    return rows;
  } catch (err) {
    if (connection) connection.destroy(); // kill the one that triggered the error

    if (STALE_ERRORS.has(err.code)) {
      // Nuke all remaining free connections — they're all suspect
      drainFreeConnections(pool);

      // Retry — pool is now forced to open a fresh connection
      const freshConn = await pool.getConnection();
      try {
        const [rows] = await freshConn.query(sql, params);
        freshConn.release();
        return rows;
      } catch (retryErr) {
        freshConn.destroy();
        throw retryErr;
      }
    }

    throw err;
  }
}
Enter fullscreen mode Exit fullscreen mode

Why pool._freeConnections.length = 0 instead of a while loop with shift()?

pool._freeConnections is an array used as a FIFO queue — mysql2 uses shift() when getting connections and push() when releasing them. Since we're destroying all of them, iterating with for...of then zeroing the length is simpler and safer than mutating the array mid-iteration with shift().

What conn.destroy() does under the hood

When you call destroy(), mysql2 does:

// Simplified from connection._removeFromPool()
this._allConnections.splice(this._allConnections.indexOf(connection), 1);
connection._pool = null;
// TCP socket is closed
Enter fullscreen mode Exit fullscreen mode

So after drainFreeConnections:

  • _freeConnections → empty ✅
  • _allConnections.length → drops back toward 0 ✅
  • Next getConnection() → pool sees room under connectionLimit → creates fresh TCP connection ✅

Comparison: Single Retry vs. Drain All

Approach After 1 stale conn After full pool stale
Single retry (article) ✅ Works ❌ Retry hits another stale conn
Drain all + retry (this approach) ✅ Works ✅ Pool forced to create fresh conn

A Note on _freeConnections Being a Private API

_freeConnections is not exported in mysql2's TypeScript typings (it's missing from Pool.d.ts). In TypeScript, you'll need a cast:

(pool as any)._freeConnections
Enter fullscreen mode Exit fullscreen mode

It has been stable and present since the beginning of the library. But since it's not officially part of the public API, it's worth keeping an eye on across major version upgrades.


Why Prisma Doesn't Have This Problem

If you're using Prisma, you may have noticed it doesn't suffer from the freeze/thaw stale connection issue as badly. There's a concrete reason for this — it's not magic, it's the Rust query engine.

Prisma uses a connection pool built on the mobc library inside its Rust engine. Before handing a connection to your query, it performs a time-gated pre-ping:

Connection pulled from pool
         ↓
Has more than 15 seconds passed since this connection was last used?
    YES → run SELECT 1
              ↓
         SELECT 1 succeeds? → proceed with query
         SELECT 1 fails?    → discard, open fresh connection
    NO  → skip ping, proceed directly (optimization for rapid queries)
Enter fullscreen mode Exit fullscreen mode

This is essentially the same pattern as SQLAlchemy's pool_pre_ping=True, with a 15-second grace window to avoid pinging on every rapid-fire query.

After a Lambda freeze of any meaningful duration (seconds to minutes), the timer has expired, so Prisma will ping before your query even runs — and silently replace any dead connection. Your application code never sees the error.

How it stacks up

mysql2 (raw pool) Prisma (Rust engine)
Pre-ping on checkout ❌ None ✅ If >15s idle
Handles full pool going stale ❌ Needs manual drain logic ✅ Each connection validated individually
Error surfaces to app code ✅ Yes — you must handle it ❌ Transparent — retried internally
Overhead None (no extra queries) One SELECT 1 per connection after idle period

Even Prisma Isn't Bulletproof

Worth noting: Prisma's pre-ping protects against stale connections, but the 15-second threshold means a freeze shorter than 15 seconds could still theoretically slip through. And connection-level issues outside the pool (e.g. NAT gateway state tables, RDS Proxy recycling) can still cause failures that the pre-ping doesn't catch. Retry logic at the application layer remains a good safety net regardless of ORM.


Summary

The key insight is to work with mysql2's internal pool logic rather than against it:

  1. On a stale connection error, don't just retry — drain _freeConnections first
  2. mysql2 will automatically open fresh connections to fill the gap (it's built into getConnection())
  3. Your retry then gets a genuinely new TCP connection instead of another dead one from the queue

If you want this behavior without managing it yourself, Prisma's Rust engine gives you a time-gated pre-ping out of the box — which is the more principled long-term solution for serverless MySQL workloads.

Top comments (0)