SELECT MAX()+1 Is a Race Condition Waiting to Happen

#postgres #sql #dotnet #concurrency

This is a story in three acts about one of the most innocent-looking patterns in backend code: generating the next number in a sequence. It passed every test. It ran fine for months. Then a unique-index violation showed up in production logs, and the fix I reached for first didn't actually fix it.

Act 1: The code that looks correct

The requirement was a per-customer progressive number — ORD-1, ORD-2, and so on. The implementation read like plain English:

var existing = await db.Orders
    .Where(o => o.CustomerId == customerId)
    .ToListAsync();

var next = existing
    .Select(o => ParseProgressive(o.Code))
    .DefaultIfEmpty(0)
    .Max() + 1;

db.Orders.Add(new Order { Code = $"ORD-{next}", CustomerId = customerId });
await db.SaveChangesAsync();

Read the current max, add one, insert. There is nothing wrong with this code if only one thing ever runs it at a time. The problem is that assumption is almost never true, and nothing in the code makes it true.

Act 2: Why it breaks, and the fix that doesn't

This is a TOCTOU bug — time-of-check to time-of-use. Two requests for the same customer arrive nearly together:

Request A reads max = 7. Computes next = 8.
Request B reads max = 7 before A has inserted. Also computes next = 8.
Both insert ORD-8. The unique index rejects the second. Production error.

My first instinct was a lock around the method — make the read-and-write atomic in the application. And in a single-process test, that "works." Here's the trap: the moment you run more than one instance of the service (and you will — that's what horizontal scaling is), an in-process lock protects nothing. Instance 1 and instance 2 each hold their own lock and happily compute the same number. The in-memory fix doesn't close the race; it just shrinks the window enough to pass your tests and lie to you.

The real lesson arrived here: uniqueness has to be enforced by the one thing both instances share — the database. Not by application code, no matter how clever the locking.

Act 3: The fixes that actually hold

There are several, and the right one depends on whether the number has to be gapless.

If gaps are acceptable (you just need unique and roughly increasing), use the database's own sequence or identity. It's atomic by construction, across any number of instances:

CREATE SEQUENCE order_seq;
-- each insert pulls nextval('order_seq'), no read-then-write

If the number must be per-customer and gapless, you can't lean on a global sequence. Then you serialize at the row level. A Postgres advisory lock keyed on the customer turns concurrent inserts for the same customer into a queue, while leaving different customers fully parallel:

SELECT pg_advisory_xact_lock(hashtext($1));  -- $1 = customer id
-- now the read-max-and-insert is safe within this transaction

Or let the database be the referee with insert-and-retry. Keep the unique index, attempt the insert, and on a conflict recompute and try again. The constraint you were treating as an error becomes your concurrency control:

INSERT INTO orders (...) VALUES (...)
ON CONFLICT (customer_id, code) DO NOTHING;

The thing I'd tell past me

The bug wasn't the arithmetic. It was believing that "read, then write" is a single step. It's two, and anything can happen in between. When correctness depends on two operations being indivisible, the only component that can guarantee that across a scaled-out service is the shared database — through a sequence, a lock, or a constraint you actually let do its job. Application-side locking on a multi-instance service is a comfort blanket, not a fix.