srinu madhav vysyaraju

Posted on Apr 27

DB Locking 101 for Serverless Devs

#serverless #postgres #database #typescript

A flash sale goes live on a popular booking platform. Fifty dollars off the Friday evening flight from New York to London. Within seconds, a thousand Lambdas wake up and each one runs:

  UPDATE flight_inventory
  SET seats_available = seats_available - 1
  WHERE flight_id = 'NYC-LON-FRI' AND fare_class = 'Y';

Your p99 latency jumps from 50ms to 30 seconds. No code changed. No deploy. CPU is fine. Connections look healthy-ish. What happened?

The answer is locks. If your mental model of database locking is fuzzy, serverless will find every gap and turn it into a pager alert. This post is the foundation: pessimistic vs optimistic, what isolation levels actually lock, and why short-lived functions amplify the pain.

We'll use Postgres and TypeScript throughout, and the running example is a booking platform — flights, seats, fare classes — because most of the interesting concurrency problems show up there in their natural habitat.

Why locks exist

When two transactions touch the same row at the same time. Without coordination, one overwrites the other and you get corruption: a seat sold twice, a hold counter that ticks once instead of twice, a fare class that goes negative. The database prevents this by serializing conflicting access using locks.

A lock is a temporary, exclusive (or shared) claim on a piece of data, held by a transaction until it commits or rolls back. That's it. The complexity comes from which data, how exclusive, and for how long.

Throughout this post, assume a schema like:

CREATE TABLE flight_inventory (
    flight_id       TEXT NOT NULL,
    fare_class      TEXT NOT NULL,
    seats_available INTEGER NOT NULL CHECK (seats_available >= 0),
    version         INTEGER NOT NULL DEFAULT 0,
    PRIMARY KEY (flight_id, fare_class)
  );

CREATE TABLE bookings (
    id              UUID PRIMARY KEY,
    flight_id       TEXT NOT NULL,
    fare_class      TEXT NOT NULL,
    passenger_id    TEXT NOT NULL,
    created_at      TIMESTAMPTZ NOT NULL DEFAULT now()
  );

Pessimistic locking: claim it before you touch it

Pessimistic locking assumes contention will happen and grabs a lock up front.

import { Pool } from 'pg';
const pool = new Pool();

async function bookSeatPessimistic(
  flightId: string,
  fareClass: string,
  passengerId: string,
 ) {
    const client = await pool.connect();
    try {
      await client.query('BEGIN');

      // Grab an exclusive row lock on the inventory row.
      const { rows } = await client.query(
        `SELECT seats_available
         FROM flight_inventory
         WHERE flight_id = $1 AND fare_class = $2
         FOR UPDATE`,
        [flightId, fareClass],
      );

      if (rows.length === 0) throw new Error('Flight not found');
      if (rows[0].seats_available < 1) throw new Error('Sold out');

      await client.query(
        `UPDATE flight_inventory
         SET seats_available = seats_available - 1
         WHERE flight_id = $1 AND fare_class = $2`,
        [flightId, fareClass],
      );

      await client.query(
        `INSERT INTO bookings (id, flight_id, fare_class, passenger_id)
         VALUES (gen_random_uuid(), $1, $2, $3)`,
        [flightId, fareClass, passengerId],
      );

      await client.query('COMMIT');
    } catch (e) {
      await client.query('ROLLBACK');
      throw e;
    } finally {
      client.release();
    }
  }

SELECT ... FOR UPDATE grabs a row-level exclusive lock. Any other transaction that tries to update or FOR UPDATE the same inventory row blocks until you commit. There are softer variants:

FOR SHARE — read lock; multiple readers allowed, writers blocked.
FOR UPDATE NOWAIT — fail immediately instead of waiting.
FOR UPDATE SKIP LOCKED — pretend locked rows don't exist (the foundation of a Postgres job queue, covered in a later post — handy for ticket issuance workers).

Cost of pessimistic locking: while you hold the lock, every other booking attempt for that flight queues. If your function takes 200ms (because it also calls the payment provider, generates a PNR, and emits an event) and 500 invocations want the same flight, the last one waits 100 seconds. Deadlocks are also possible if two transactions grab locks in different orders — Postgres detects them and aborts one, but the retry is on you.

For booking platforms, this gets worse fast: the row people fight over (popular flight, popular fare class, popular date) is also the row most likely to have a viral moment.

Optimistic locking: assume no contention, verify on write

Optimistic locking skips the lock entirely. You read the row, do your work, and on write you check that nothing changed underneath you. The version column from the schema above does the bookkeeping.

async function bookSeatOptimistic(
    flightId: string,
    fareClass: string,
    passengerId: string,
  ) {
    for (let attempt = 0; attempt < 3; attempt++) {
      const client = await pool.connect();
      try {
        await client.query('BEGIN');

        const { rows } = await client.query(
          `SELECT seats_available, version
           FROM flight_inventory
           WHERE flight_id = $1 AND fare_class = $2`,
          [flightId, fareClass],
        );
        if (rows.length === 0) throw new Error('Flight not found');
        const { seats_available, version } = rows[0];
        if (seats_available < 1) throw new Error('Sold out');

        const result = await client.query(
          `UPDATE flight_inventory
           SET seats_available = seats_available - 1, version = version + 1
           WHERE flight_id = $1 AND fare_class = $2 AND version = $3`,
          [flightId, fareClass, version],
        );

        if (result.rowCount === 0) {
          await client.query('ROLLBACK');
          continue; // someone else booked first; retry
        }

        await client.query(
          `INSERT INTO bookings (id, flight_id, fare_class, passenger_id)
           VALUES (gen_random_uuid(), $1, $2, $3)`,
          [flightId, fareClass, passengerId],
        );

        await client.query('COMMIT');
        return;
      } finally {
        client.release();
      }
    }
    throw new Error('Too much contention, gave up after 3 retries');
  }

If a concurrent transaction bumped the version between your read and your write, the UPDATE matches zero rows. You roll back and retry.

Postgres also exposes a built-in version via the xmin system column (the transaction id that last wrote the row). You can use it instead of an explicit version column, with the caveat that xmin changes on every update including unrelated columns.

Cost of optimistic locking: under heavy contention, retries pile up. If 100 booking attempts all want the last seat in the same fare class, 99 of them lose the race, retry, lose again, and so on. It works beautifully when contention is rare and terribly when it isn't. For most flights on most days, it's the right default. For the flash-sale flight, it isn't — and we'll address that with sharding and holds in later posts.

Isolation levels: what your transaction actually sees

Isolation level controls which anomalies a transaction can observe. In Postgres:

Level	Dirty read	Non-repeatable read	Phantom read	Write skew
Read Committed (default)	No	Possible	Possible	Possible
Repeatable Read	No	No	No (snapshot)	Possible
Serializable	No	No	No	No

A few things to know:

Read Committed locks a row only when you write it. Reads see the latest committed value at the moment of each statement. This is what you get by default, and what most apps run on.
Repeatable Read gives your transaction a consistent snapshot at the moment it began. Reads inside the transaction return the same data even if others commit. Postgres uses MVCC for this, no extra read locks needed.
Serializable uses predicate locking (SSI — serializable snapshot isolation) to detect any pattern of reads and writes that couldn't have happened in some serial order, and aborts one of the offenders. It's the only level that prevents write skew without manual FOR UPDATE.

Write skew is worth pausing on, because booking platforms hit it often. Suppose policy says "always keep at least one seat in fare class Y reserved for elite upgrades." Two concurrent booking transactions each read seats_available = 2, each conclude "fine, plenty of buffer," and each book a seat. End state: seats_available = 0, constraint violated, no single transaction did anything wrong. Read Committed and Repeatable Read both allow this. Serializable does not.

You set the level per transaction:
await client.query('BEGIN ISOLATION LEVEL SERIALIZABLE');

Serializable is correct-by-default but produces serialization failures (SQLSTATE 40001) under contention, which you must retry. Most production booking systems run Read Committed and add explicit locks (or unique-index-backed constraints) where needed.

Why serverless makes all of this harder

Now the serverless angle. The locking mechanics above are the same on a long-lived monolith. What changes is the runtime around them.

Many short transactions instead of a few long ones. A monolith might handle 100 booking requests on one connection over a minute. Serverless spins up 100 concurrent functions, each with its own connection and its own micro-transactions. More acquire/release churn, more contention surface area.

Connection counts explode. Lambda concurrency of 1000 means up to 1000 Postgres connections unless you put a pooler in front. Each idle connection still holds whatever locks its last transaction grabbed if you forgot to commit. (We'll fix that in post 2 of this series.)

Retries are automatic and unsafe by default. API Gateway retries, Step Functions retries, EventBridge retries. If your function charged the card, then timed out before writing the booking row, the retry can charge again unless you've thought carefully about idempotency. (Post 3. Coming soon)

Cold starts inside transactions. A Lambda that begins a transaction, calls the payment provider for 5 seconds, then commits, can hit its own timeout while holding row locks on the popular flight. The connection sits orphaned in idle in transaction state until Postgres reaps it, and during that window every other booking for that flight blocks behind it.

No shared memory. You can't reach for a process-local mutex or an in-memory cache to deduplicate work across invocations. The database (or Redis) is the only coordination point.

The thread running through the rest of this series is simple: serverless hates long locks. Anything that holds a lock across a network call (payment, fraud check, PNR generation), a cold start, or a retry boundary is a future incident.

Decision rule

For most CRUD on most tables, do nothing — Read Committed plus normal UPDATE statements handle it. When you need to coordinate concurrent writes to the same row (booking the same seat, decrementing the same counter), follow this:

Default to optimistic locking with a version column and a small retry loop. It scales well, holds no locks, and degrades visibly (you see retries climb in logs) before it fails.
Switch to pessimistic (SELECT ... FOR UPDATE) when you've measured high contention and retry cost is worse than queue cost — typically when the work inside the transaction is expensive or has external side effects you don't want to redo.
Use SKIP LOCKED when you want queue semantics: many workers, each grab a different row. Ticket issuance and email-confirmation workers are textbook fits.
Reach for Serializable only when write skew is a real risk (e.g., the "keep one seat for upgrades" policy) and you can't express the constraint as a unique index or a CHECK.

Back to the opening puzzle

A thousand Lambdas all run UPDATE flight_inventory SET seats_available = seats_available - 1 WHERE flight_id = 'NYC-LON-FRI' AND fare_class = 'Y'. Even though no one wrote FOR UPDATE, the UPDATE itself takes a row-level exclusive lock for the duration of the transaction. Concurrent updaters queue. With a thousand of them, the queue depth is a thousand, and your p99 is the time it takes for 999 transactions to finish before yours.

Two fixes, both covered later in the series:

Make the work idempotent and use optimistic retries, so contention surfaces as fast retries instead of long waits.
Shard the hot row — keep N inventory rows per (flight, fare class) and pick one at random — so contention spreads across N queues. The booking layer reassembles the total before showing availability.

In post 2, we'll get specific about the connection-storm-meets-orphan-lock failure mode (the payment-provider-hang case above) and how to defuse it with timeouts and a pooler.