Jayesh Shinde

Posted on Feb 5 • Edited on Feb 18

The Silent Connection Killer: MySQL2 and AWS Lambda's Freeze/Thaw Problem

#mysql #lambda #backenddevelopment #prisma

The Mystery Error

You're running a Node.js Lambda with MySQL2, everything works great in testing, but production logs show intermittent failures:

Error: Connection lost: The server closed the connection.code: "PROTOCOL_CONNECTION_LOST"
fatal: true

No pattern. No warning. Just random failures that make you question your life choices.

Understanding the Real Problem : How Lambda Actually Works

Lambda doesn't spin up a fresh container for every request. AWS keeps containers "warm" for reuse:

Request 1 → Lambda runs → Response
                ↓
           [FREEZE] ← Container paused (not terminated)
                ↓     
         (5-15 min pass)
                ↓
           [THAW] ← Container resumed
                ↓
Request 2 → Lambda runs → 💥 PROTOCOL_CONNECTION_LOST

During freeze, your Lambda is literally paused. The JavaScript event loop stops. Timers stop. Everything stops.
But here's the catch: the outside world doesn't stop.

What Happens to Your Database Connection

Lambda creates a MySQL connection pool
Connections sit idle in the pool
Lambda freezes (container paused)
Real world time passes (5-30 minutes)
Network timeouts occur, NAT gateways clear state, RDS Proxy cleans up
The TCP socket dies, but your pool doesn't know
Lambda thaws, tries to use the dead connection
💥 PROTOCOL_CONNECTION_LOST

Why idleTimeout Doesn't Help

You might think: "I'll set idleTimeout: 60000 to clean up idle connections!"
Here's why it doesn't work:

Timer starts (60s countdown)
    ↓
Lambda FREEZES at 1s elapsed
    ↓
████████████████████████████████
█  15 minutes pass in REAL WORLD  █
█  Timer is PAUSED at 1s          █
████████████████████████████████
    ↓
Lambda THAWS - timer resumes at 1s
    ↓
Connection still in pool (timer thinks 2s passed)
    ↓
Connection is DEAD but pool doesn't know

The timer doesn't run during freeze. Your 60-second timeout is useless against a 15-minute freeze.

The Solution: Detect and Retry
Since we can't prevent stale connections, we detect them and retry transparently.

Step 1: Enable TCP Keep-Alive

const pool = mysql.createPool({
  ...config,
  enableKeepAlive: true,
  keepAliveInitialDelay: 10000,
});

This helps get clear error codes (ECONNRESET, PROTOCOL_CONNECTION_LOST) instead of hanging indefinitely.

Step 2: Implement Retry Logic

async executeQuery(sql, params) {
  const maxRetries = 1;
  let lastError = null;

  for (let attempt = 0; attempt <= maxRetries; attempt++) {
    let connection = null;

    try {
      connection = await this.getConnectionFromPool();
      const result = await this.executeQueryWithConnection(connection, sql, params);
      connection.release();
      return result;
    } catch (error) {
      // Non-recoverable error - throw immediately
      if (!this.isConnectionLostError(error)) {
        if (connection) connection.release();
        throw error;
      }

      // Connection lost - destroy stale connection
      if (connection) connection.destroy();

      // Retry if attempts left
      lastError = error;
      if (attempt < maxRetries) {
        console.warn("Connection lost, retrying...", { attempt: attempt + 1 });
        continue;
      }
    }
  }

  throw lastError;
}

isConnectionLostError(error) {
  const recoverableCodes = [
    "PROTOCOL_CONNECTION_LOST",  // Server closed connection
    "ECONNRESET",                // TCP reset
    "EPIPE",                     // Broken pipe
    "ETIMEDOUT",                 // Connection timeout
    "ECONNREFUSED",              // Connection refused
  ];
  return recoverableCodes.includes(error?.code);
}

Key Points:

connection.destroy() - Removes stale connection from pool (don't reuse it!)
connection.release() - Returns healthy connection to pool
One retry is usually enough - The second attempt gets a fresh connection

What About Pool Settings?

Do I Need to Tune connectionLimit, maxIdle, etc.?
Short answer: Not really.

Setting	Helps with freeze/thaw?	Why?
`idleTimeout`	No	Timer paused during freeze
`maxIdle`	Marginally	Fewer connections = fewer stale ones, but adds reconnection overhead
`connectionLimit`	No	Doesn't affect stale connections
Retry logic	Yes	Handles stale connections at runtime

If you're using RDS Proxy, it handles connection pooling at the infrastructure level. Keep your Lambda pool settings simple and let the retry logic do the heavy lifting.

Using RDS Proxy?

RDS Proxy's "Idle client connection timeout" (default: 30 minutes) is separate from MySQL's wait_timeout. The proxy manages Lambda→Proxy connections independently.
But even with RDS Proxy, connections can die due to:

NAT gateway timeouts (typically 5-15 minutes for idle TCP)
Network state table cleanup
Proxy internal connection recycling

The retry logic is still your safety net.

The Final Architecture

┌─────────────────────────────────────────────────────┐
│                    Your Lambda                       │
├─────────────────────────────────────────────────────┤
│                                                      │
│   executeQuery()                                     │
│       ↓                                             │
│   for (attempt = 0; attempt <= 1; attempt++)        │
│       ↓                                             │
│   getConnection() → Try query                       │
│       ↓                                             │
│   Success? → return result                          │
│       ↓                                             │
│   Connection lost? → destroy() → retry              │
│       ↓                                             │
│   Other error? → release() → throw                  │
│                                                      │
└─────────────────────────────────────────────────────┘
           ↓
    [RDS Proxy] (optional)
           ↓
    [MySQL/Aurora]

Summary

Problem	Solution
Lambda freezes, connections go stale	Retry logic detects and recovers
Pool doesn't know connections are dead	`enableKeepAlive` for faster detection
`idleTimeout` doesn't work during freeze	Accept it, rely on retry instead
Random `PROTOCOL_CONNECTION_LOST`	Transparent retry = users don't notice

The key insight: You can't prevent stale connections in a serverless environment. But you can detect them instantly and retry transparently.

BUT....

The Problem With Just Retrying Once

The previous article suggested detecting a stale connection error and retrying once. That works — but only if a single connection went stale. After a longer Lambda freeze (10–15+ minutes), the entire pool goes stale. Here's the scenario:

Pool has 5 connections → Lambda freezes → all 5 TCP sockets die

Request comes in after thaw:
→ gets conn #1 from _freeConnections → FAILS (stale)
→ retry: gets conn #2 from _freeConnections → FAILS (also stale!)

One retry is not enough because _freeConnections is a queue — the retry just picks the next dead connection in line.

How mysql2's Pool Actually Works Internally

Looking at mysql2's pool.js source, getConnection() does this:

// Simplified from mysql2 internals
getConnection(cb) {
  if (this._freeConnections.length > 0) {
    const connection = this._freeConnections.shift(); // FIFO — no health check
    return cb(null, connection);
  }
  // if allConnections.length < connectionLimit → create a NEW connection
  // otherwise → queue the request in _connectionQueue
}

There is zero validation when pulling from _freeConnections. The pool hands you whatever is sitting there, stale or not.

The inverse is also useful to know — when _freeConnections is empty AND _allConnections.length < connectionLimit, mysql2 will automatically create a brand new TCP connection. This is the behavior we want to exploit.

The Fix: Drain All Free Connections on a Stale Error

Instead of retrying once and hoping the next connection is healthy, destroy every connection in _freeConnections the moment you detect a stale error. The pool's own logic then forces a fresh connection on retry.

const STALE_ERRORS = new Set([
  'PROTOCOL_CONNECTION_LOST',
  'ECONNRESET',
  'EPIPE',
  'ETIMEDOUT',
  'ECONNREFUSED',
]);

function drainFreeConnections(pool) {
  // Destroy all idle connections in one sweep
  for (const conn of pool._freeConnections) {
    conn.destroy(); // sets conn._pool = null, removes from _allConnections
  }
  pool._freeConnections.length = 0; // clear the array in-place
  // pool._allConnections.length is now reduced → next getConnection()
  // sees: freeConnections empty + allConnections < connectionLimit
  // → creates a fresh TCP connection automatically
}

async function executeQuery(pool, sql, params) {
  let connection;
  try {
    connection = await pool.getConnection();
    const [rows] = await connection.query(sql, params);
    connection.release();
    return rows;
  } catch (err) {
    if (connection) connection.destroy(); // kill the one that triggered the error

    if (STALE_ERRORS.has(err.code)) {
      // Nuke all remaining free connections — they're all suspect
      drainFreeConnections(pool);

      // Retry — pool is now forced to open a fresh connection
      const freshConn = await pool.getConnection();
      try {
        const [rows] = await freshConn.query(sql, params);
        freshConn.release();
        return rows;
      } catch (retryErr) {
        freshConn.destroy();
        throw retryErr;
      }
    }

    throw err;
  }
}

Why `pool._freeConnections.length = 0` instead of a `while` loop with `shift()`?

pool._freeConnections is an array used as a FIFO queue — mysql2 uses shift() when getting connections and push() when releasing them. Since we're destroying all of them, iterating with for...of then zeroing the length is simpler and safer than mutating the array mid-iteration with shift().

What `conn.destroy()` does under the hood

When you call destroy(), mysql2 does:

// Simplified from connection._removeFromPool()
this._allConnections.splice(this._allConnections.indexOf(connection), 1);
connection._pool = null;
// TCP socket is closed

So after drainFreeConnections:

_freeConnections → empty ✅
_allConnections.length → drops back toward 0 ✅
Next getConnection() → pool sees room under connectionLimit → creates fresh TCP connection ✅

Comparison: Single Retry vs. Drain All

Approach	After 1 stale conn	After full pool stale
Single retry (article)	✅ Works	❌ Retry hits another stale conn
Drain all + retry (this approach)	✅ Works	✅ Pool forced to create fresh conn

A Note on `_freeConnections` Being a Private API

_freeConnections is not exported in mysql2's TypeScript typings (it's missing from Pool.d.ts). In TypeScript, you'll need a cast:

(pool as any)._freeConnections

It has been stable and present since the beginning of the library. But since it's not officially part of the public API, it's worth keeping an eye on across major version upgrades.

Why Prisma Doesn't Have This Problem

If you're using Prisma, you may have noticed it doesn't suffer from the freeze/thaw stale connection issue as badly. There's a concrete reason for this — it's not magic, it's the Rust query engine.

Prisma uses a connection pool built on the mobc library inside its Rust engine. Before handing a connection to your query, it performs a time-gated pre-ping:

Connection pulled from pool
         ↓
Has more than 15 seconds passed since this connection was last used?
    YES → run SELECT 1
              ↓
         SELECT 1 succeeds? → proceed with query
         SELECT 1 fails?    → discard, open fresh connection
    NO  → skip ping, proceed directly (optimization for rapid queries)

This is essentially the same pattern as SQLAlchemy's pool_pre_ping=True, with a 15-second grace window to avoid pinging on every rapid-fire query.

After a Lambda freeze of any meaningful duration (seconds to minutes), the timer has expired, so Prisma will ping before your query even runs — and silently replace any dead connection. Your application code never sees the error.

How it stacks up

	mysql2 (raw pool)	Prisma (Rust engine)
Pre-ping on checkout	❌ None	✅ If >15s idle
Handles full pool going stale	❌ Needs manual drain logic	✅ Each connection validated individually
Error surfaces to app code	✅ Yes — you must handle it	❌ Transparent — retried internally
Overhead	None (no extra queries)	One `SELECT 1` per connection after idle period

Even Prisma Isn't Bulletproof

Worth noting: Prisma's pre-ping protects against stale connections, but the 15-second threshold means a freeze shorter than 15 seconds could still theoretically slip through. And connection-level issues outside the pool (e.g. NAT gateway state tables, RDS Proxy recycling) can still cause failures that the pre-ping doesn't catch. Retry logic at the application layer remains a good safety net regardless of ORM.

Summary

The key insight is to work with mysql2's internal pool logic rather than against it:

On a stale connection error, don't just retry — drain _freeConnections first
mysql2 will automatically open fresh connections to fill the gap (it's built into getConnection())
Your retry then gets a genuinely new TCP connection instead of another dead one from the queue

If you want this behavior without managing it yourself, Prisma's Rust engine gives you a time-gated pre-ping out of the box — which is the more principled long-term solution for serverless MySQL workloads.

DEV Community

The Silent Connection Killer: MySQL2 and AWS Lambda's Freeze/Thaw Problem

The Mystery Error

Understanding the Real Problem : How Lambda Actually Works

What Happens to Your Database Connection

Why idleTimeout Doesn't Help

Step 1: Enable TCP Keep-Alive

Step 2: Implement Retry Logic

Key Points:

What About Pool Settings?

Using RDS Proxy?

The Final Architecture

Summary

BUT....

The Problem With Just Retrying Once

How mysql2's Pool Actually Works Internally

The Fix: Drain All Free Connections on a Stale Error

Why `pool._freeConnections.length = 0` instead of a `while` loop with `shift()`?

What `conn.destroy()` does under the hood

Comparison: Single Retry vs. Drain All

A Note on `_freeConnections` Being a Private API

Why Prisma Doesn't Have This Problem

How it stacks up

Even Prisma Isn't Bulletproof

Summary

Top comments (0)

The Mystery Error

Understanding the Real Problem : How Lambda Actually Works

What Happens to Your Database Connection

Why idleTimeout Doesn't Help

Step 1: Enable TCP Keep-Alive

Step 2: Implement Retry Logic

Key Points:

What About Pool Settings?

Using RDS Proxy?

The Final Architecture

Summary

BUT....

The Problem With Just Retrying Once

How mysql2's Pool Actually Works Internally

The Fix: Drain All Free Connections on a Stale Error

Why pool._freeConnections.length = 0 instead of a while loop with shift()?

What conn.destroy() does under the hood

Comparison: Single Retry vs. Drain All

A Note on _freeConnections Being a Private API

Why Prisma Doesn't Have This Problem

How it stacks up

Even Prisma Isn't Bulletproof

Summary

Why `pool._freeConnections.length = 0` instead of a `while` loop with `shift()`?

What `conn.destroy()` does under the hood

A Note on `_freeConnections` Being a Private API