srinu madhav vysyaraju

Posted on May 5

Connection Storms and Orphan Locks: Why Your Lambda Outage Wasn't About Lambda

#serverless #lambda #postgres #rds

Series recap. Post 1 — Locking 101 for Serverless Devs covered the mental model: pessimistic vs optimistic locking, what each isolation level actually locks, and why short-lived functions amplify every locking pain point. Worth reading first — this post leans on it.

In post 1, a thousand Lambdas queued on a single inventory row. That's the obvious failure mode — the kind you can almost see coming if you squint. Post 2 is about the one that actually wakes you up, because nothing in your dashboard says the word "lock":

Lambda invocations: spiking, then flat-lining at the concurrency ceiling.
Database CPU: 8%.
Database connections: pinned at max.
p99: timing out.
Error rate: 100% on bookings, partial everywhere else.

You restart the database. It works for ninety glorious seconds. Then it breaks again, the same way.

That's a connection storm meeting an orphan lock. The lock is the actual cause. The connection storm is what turns it into a public incident with a postmortem doc attached. Let's walk through how the two combine, and the small set of settings that make this whole class of outage go away.

The setup

A passenger taps "Book" on a popular flight. Your Lambda runs something like this:

async function bookSeat(flightId: string, fareClass: string, passengerId: string, paymentToken: string) {
  const client = await pool.connect();
  try {
    await client.query('BEGIN');

    await client.query(
      `UPDATE flight_inventory
       SET seats_available = seats_available - 1
       WHERE flight_id = $1 AND fare_class = $2 AND seats_available > 0`,
      [flightId, fareClass],
    );

    // Charge the card before we commit. Seems safe — if charge fails, we roll back.
    const charge = await paymentProvider.charge(paymentToken, fareAmount);

    await client.query(
      `INSERT INTO bookings (id, flight_id, fare_class, passenger_id, charge_id)
       VALUES (gen_random_uuid(), $1, $2, $3, $4)`,
      [flightId, fareClass, passengerId, charge.id],
    );

    await client.query('COMMIT');
  } catch (e) {
    await client.query('ROLLBACK');
    throw e;
  } finally {
    client.release();
  }
}

Stop here for a second and look at the shape of it. Between BEGIN and COMMIT there's a network call to a third party. The UPDATE already grabbed an exclusive row lock on the inventory row, and that lock is going to be held for as long as the payment call takes — every millisecond of it.

On a normal day, payment is 300ms and nobody notices. On a bad day — provider is degraded, packet loss, customer fumbling a 3DS challenge — payment is 25 seconds. Your Lambda's timeout is 30 seconds, so it deadlines, the runtime kills the container, and the connection back to Postgres is left holding the bag.

What "orphaned" actually means

When a Lambda gets killed mid-transaction, three things happen, in this order:

The Node process is terminated. The pg client never gets a chance to send ROLLBACK. It was about to. It just didn't.
The TCP connection eventually times out at the OS level. This can take minutes, depending on your network defaults. Postgres has no idea anything is wrong yet.
In the meantime, Postgres still sees a perfectly valid connection sitting in idle in transaction state, holding every lock the transaction had acquired.

For a booking platform, that means the inventory row for the popular flight is locked. By a transaction that will never commit. On a connection that no longer has a process behind it.

Now multiply. Every other booking attempt for that flight runs UPDATE, hits the lock, and waits. Lambda concurrency climbs. The connection pool drains. New invocations can't even get a connection. Your platform's retry behavior makes it worse — API Gateway retries, EventBridge retries, the front-end retries — and you're now spending compute creating more doomed transactions to stack behind the dead one.

The funniest part (you won't think it's funny at the time) is that database CPU stays at 8% throughout. Nobody's doing any work. They're all waiting on a row lock held by a ghost.

You can't fix the upstream. You can fix the blast radius.

Payment providers occasionally get slow. Fraud APIs occasionally get slow. Customers occasionally walk into an elevator mid-checkout. None of that is going away. What you can do is make sure a slow upstream doesn't escalate from "some users see a hiccup" to "the whole booking flow is down."

There are four knobs. Turn all of them.

1. `statement_timeout`

Caps how long a single SQL statement can run. If your UPDATE ever takes longer than this, Postgres aborts it and returns an error. Easy.

const pool = new Pool({
  // ...
  statement_timeout: 5000, // 5 seconds — generous for OLTP
});

You can also scope it tighter for specific transactions:

await client.query(`SET LOCAL statement_timeout = '2s'`);

This catches runaway queries. It does not catch a transaction where each individual statement is fast but the transaction as a whole is slow because something in the middle is sleeping on a network call. Which is — annoyingly — exactly the case we just walked through. So:

2. `idle_in_transaction_session_timeout`

If you take one thing from this post, take this one.

It caps how long a connection is allowed to sit in idle in transaction state before Postgres forcibly aborts the transaction and closes the connection.

const pool = new Pool({
  // ...
  statement_timeout: 5000,
  idle_in_transaction_session_timeout: 10000, // 10 seconds
});

With this set, a Lambda that dies mid-payment causes Postgres to release the inventory row lock 10 seconds later. No human in the loop. No database restart. Without it, the lock can be held for the TCP timeout duration, which on default AWS networking you can measure in minutes — long enough for the entire booking flow to stay broken until somebody notices.

This is the single most important Postgres setting for serverless workloads, and it ships disabled by default. Fix that.

3. `lock_timeout`

Caps how long a statement waits for a lock before giving up. Without it, a transaction that hits a held lock waits forever — well, until statement_timeout cuts it off. With it, you fail fast and surface the contention to the caller.

await client.query(`SET LOCAL lock_timeout = '500ms'`);

For a booking flow this is almost always what you want. If you can't get the inventory lock in 500ms, the user has already mentally moved on, and you'd rather return a clean retryable error than tie up a connection slot for four more seconds while they stare at a spinner.

4. Concurrency caps at the runtime

The first three settings stop one bad transaction from holding a lock forever. They don't stop your fleet from creating bad transactions faster than the database can shed them.

That's what reserved concurrency is for. Configure your booking Lambda with reserved concurrency = 200, and at most 200 of them can be running at once. Invocations beyond that get throttled at the platform layer, before they ever touch a connection. The retry storm has a ceiling. Search, check-in, and status keep working because they're not sharing the booking budget.

The lever has different names depending on where you run — Vercel calls it concurrency limits, Cloudflare Workers has per-Worker concurrency — but the shape is identical. Cap blast radius at the runtime, not at the database.

A short word on poolers

You can't run a thousand raw Postgres connections, and if you try, the lock-table contention you're trying to fix gets worse. Postgres is happiest in the low hundreds.

So you put a pooler in front: RDS Proxy, PgBouncer, Supabase's pooler, Neon's, whichever fits. They multiplex many client connections onto fewer server connections. Two modes matter:

Session pooling — a client connection is mapped to a server connection for its lifetime. Behaves like raw Postgres. Doesn't really help in serverless, because every Lambda invocation is a fresh client.
Transaction pooling — a server connection is rented to a client only for the duration of a transaction, then returned to the pool. This is what you want. A thousand Lambdas can comfortably share fifty server connections as long as no single transaction monopolizes one.

Transaction pooling has sharp edges. They all share one root cause, so let's name it first, and then the four problems will basically write themselves.

The root cause. When you talk to Postgres directly, you have one connection and you keep it for the whole life of your client. Anything you set on it — settings, prepared queries, locks, subscriptions — stays put because you're the only one using it.

Transaction pooling breaks that assumption. Between each transaction, your "connection" is returned to the pool and can be handed to someone else. Your next transaction gets a different server connection. Anything you stored on the old connection is no longer reachable from where you're standing.

Every pitfall below is a variation on the same theme: a feature that quietly stores state on the server connection, used by code that doesn't realize the connection identity has changed underneath it.

1. Prepared statements

A prepared statement is Postgres caching a parsed and planned query under a name, so re-running it is faster:

PREPARE get_seat AS SELECT * FROM flight_inventory WHERE flight_id = $1;
EXECUTE get_seat('NYC-LON-FRI');

The cache lives on the server connection. Many client libraries — including pg for Node — quietly use prepared statements under the hood for performance, even when you didn't write PREPARE yourself.

What breaks under transaction pooling: your client prepares get_seat on server connection A. Your next query lands on server connection B, which has never heard of get_seat. Postgres responds with prepared statement "get_seat" does not exist. Your code looks correct. The error is confusing.

Fix: tell your pooler or your client not to use prepared statements. PgBouncer in transaction mode requires this. For pg in Node, set ?options=-c%20plan_cache_mode%3Dforce_custom_plan on the connection string, or pass a connection flag your pooler documents (RDS Proxy, Supabase, and Neon each have their own switch). The performance hit is real but small.

2. `SET` without `LOCAL`

You can change Postgres settings on the fly. There are two flavors:

SET timezone = 'UTC';        -- session-scoped: lasts until you disconnect
SET LOCAL timezone = 'UTC';  -- transaction-scoped: lasts until COMMIT/ROLLBACK

Without LOCAL, the change persists on the server connection forever. Under direct connections that's fine. Under transaction pooling, your transaction commits, the connection goes back to the pool with timezone = 'UTC' still applied, and the next tenant rents that connection and starts seeing UTC instead of whatever they expected. Their bug doesn't look like yours, and good luck tracing it.

This applies to everything settable: timezone, search_path, statement_timeout, role-related settings, all of it.

Fix: always use SET LOCAL inside a transaction. The setting unwinds automatically at commit and you can't poison the pool.

3. Session-level advisory locks

Postgres has application-level locks you can use as a lightweight mutex:

SELECT pg_advisory_lock(42);      -- session-scoped: held until you unlock or disconnect
SELECT pg_advisory_xact_lock(42); -- transaction-scoped: held until COMMIT/ROLLBACK

The session-scoped one is tied to a server connection. Under transaction pooling, the same lifecycle problem appears, but the consequences are uglier:

You acquire pg_advisory_lock(42) inside a transaction.
Transaction commits. Connection goes back to the pool, still holding the advisory lock, because the lock outlives transactions.
Next tenant rents that connection. They now hold "your" lock. They might release it (pg_advisory_unlock_all is a real footgun here). Or they might just block on their own work because something they didn't acquire is held.
Meanwhile, the next time you try to acquire the same lock, you'll be on a different server connection that doesn't have it, so two of you might "hold" lock 42 at the same time. The whole point of the lock just evaporated.

Fix: use pg_advisory_xact_lock(key). It releases automatically when the transaction ends, no matter what happens to the connection afterwards. There's almost never a reason to use the session-scoped variant in serverless code.

4. `LISTEN` / `NOTIFY`

Postgres has a built-in pub/sub. Your client runs LISTEN bookings once, and any process that calls NOTIFY bookings, 'something happened' causes Postgres to push a message to your connection.

The subscription is bound to the server connection. Under transaction pooling, your connection identity changes constantly, so you might LISTEN on connection A, leave the transaction, get connection B next time, and have no idea where the notifications are going. They might be queuing on a connection some other tenant is using. They might be silently dropped. Worst of all, sometimes it works and sometimes it doesn't, depending on which connection you happen to draw.

Fix: don't run LISTEN / NOTIFY through a transaction pooler. If you need it, open a dedicated long-lived connection that bypasses the pooler (most managed Postgres services expose a separate endpoint for this), or use a real message broker — Redis, SQS, NATS — for pub/sub.

The shorter version of this whole section: in transaction pooling, the connection you talked to last time is not the connection you're talking to now. Anything that relied on "it's the same connection" needs a different design.

And one thing a pooler does not do is rescue you from orphan locks. If a Lambda hangs mid-transaction, the rented server connection sits in idle in transaction, the row lock is still held, and the queue still forms. The pooler just keeps you from running out of connection slots while it happens. The four settings above are still doing the real work.

Wiring it up

Connection setup for a serverless booking service, all the pieces in one place:

import { Pool } from 'pg';

export const pool = new Pool({
  connectionString: process.env.DATABASE_URL, // points at RDS Proxy or PgBouncer
  max: 1, // one connection per Lambda container; the pooler does the multiplexing
  statement_timeout: 5000,
  idle_in_transaction_session_timeout: 10000,
  query_timeout: 5000, // client-side belt to the server-side suspenders
});

async function bookSeat(
  flightId: string,
  fareClass: string,
  passengerId: string,
  paymentToken: string,
) {
  // 1. Charge the card BEFORE opening the transaction.
  //    The charge call now holds no DB resources.
  const charge = await paymentProvider.charge(paymentToken, fareAmount);

  // 2. Short, local-only transaction.
  const client = await pool.connect();
  try {
    await client.query('BEGIN');
    await client.query(`SET LOCAL lock_timeout = '500ms'`);

    const result = await client.query(
      `UPDATE flight_inventory
       SET seats_available = seats_available - 1
       WHERE flight_id = $1 AND fare_class = $2 AND seats_available > 0
       RETURNING seats_available`,
      [flightId, fareClass],
    );

    if (result.rowCount === 0) {
      // Roll back AND refund — we charged but couldn't book.
      await client.query('ROLLBACK');
      await paymentProvider.refund(charge.id);
      throw new Error('Sold out');
    }

    await client.query(
      `INSERT INTO bookings (id, flight_id, fare_class, passenger_id, charge_id)
       VALUES (gen_random_uuid(), $1, $2, $3, $4)`,
      [flightId, fareClass, passengerId, charge.id],
    );

    await client.query('COMMIT');
  } catch (e) {
    await client.query('ROLLBACK').catch(() => {});
    throw e;
  } finally {
    client.release();
  }
}

Better — but if you stare at it long enough you'll notice we've quietly traded one problem for another. If the Lambda dies between the charge and the booking insert, the customer is charged with no seat. The lock is gone, which is great, but the money's gone too, which is less great.

That's the exact gap idempotency keys close, and it's the whole subject of post 3. Don't fix it today. Today's win is smaller and more valuable: ship the four settings.

A checklist you can take to a code review

Walk through these against whatever serverless Postgres setup you have. If any answer is "no," you're one slow upstream away from a bad afternoon.

Is idle_in_transaction_session_timeout set, on every connection or as a database-wide default?
Is statement_timeout set, at a ceiling that matches your real OLTP workload?
Are you using lock_timeout inside transactions that touch hot rows?
Does your function platform cap concurrency on the booking path so retry storms can't compound?

Bonus round: are external calls (payment, fraud, PNR generation, anything that talks to the network) outside your transactions, not inside them?

What's next

Post 3 picks up exactly where this one stops: how to make a booking flow safe to retry without holding any locks at all, using idempotency keys. That's the move that turns "Lambda died mid-flow" from an incident into a no-op.

DEV Community

Connection Storms and Orphan Locks: Why Your Lambda Outage Wasn't About Lambda

The setup

What "orphaned" actually means

You can't fix the upstream. You can fix the blast radius.

1. `statement_timeout`

2. `idle_in_transaction_session_timeout`

3. `lock_timeout`

4. Concurrency caps at the runtime

A short word on poolers

1. Prepared statements

2. `SET` without `LOCAL`

3. Session-level advisory locks

4. `LISTEN` / `NOTIFY`

Wiring it up

A checklist you can take to a code review

What's next

Top comments (0)

The setup

What "orphaned" actually means

You can't fix the upstream. You can fix the blast radius.

1. statement_timeout

2. idle_in_transaction_session_timeout

3. lock_timeout

4. Concurrency caps at the runtime

A short word on poolers

1. Prepared statements

2. SET without LOCAL

3. Session-level advisory locks

4. LISTEN / NOTIFY

Wiring it up

A checklist you can take to a code review

What's next

1. `statement_timeout`

2. `idle_in_transaction_session_timeout`

3. `lock_timeout`

2. `SET` without `LOCAL`

4. `LISTEN` / `NOTIFY`