Alex Aslam

Posted on Apr 21

Deadlock hunting: PostgreSQL advisory locks in distributed systems

#webdev #programming #productivity #postgres

Or: How I learned to stop worrying and love the 64-bit integer

Let me paint you a scene. It's 2:47 AM on a Tuesday. You're the one on call because the "senior" in your title apparently means "person who gets woken up when the database screams." Your pager (yes, we still have pagers, don't judge) goes off. The distributed job scheduler is frozen. Requests are piling up. The metrics dashboard looks like a patient flatlining.

You've been here before. We all have.

The culprit? A classic distributed systems deadlock. Process A holds lock on resource X, waits for resource Y. Process B holds lock on resource Y, waits for resource X. Except in a distributed system, "resources" might be database rows, Redis keys, or even just logical operations across three different microservices.

And PostgreSQL's standard row-level locks? Useless across service boundaries.

This is where advisory locks saved my sanity. Let me show you how.

The problem with "normal" locks in a distributed world

After ten years, you know the drill. Standard database locks are tied to transactions, rows, and tables. They're great for what they do. But when you have:

A cron job that rebalances user data across shards
A message consumer that processes events for the same entity from different partitions
Two different services that need to coordinate access to an external API rate limit

...you quickly realize that row locks don't exist across PostgreSQL connections, let alone across services.

I once spent three days debugging a deadlock between a payment processor and a refund handler. Two separate services, both using SELECT FOR UPDATE on different rows in the same table, but the deadlock was happening at the application logic level, not the database level. PostgreSQL had no idea. The logs showed nothing. I aged three years in 72 hours.

Enter advisory locks: the Swiss Army knife you forgot you had

Advisory locks are PostgreSQL's way of saying, "I don't know what you're locking, but I'll remember it for you."

They're just integers. That's it. A 64-bit bigint. Or two 32-bit ints if you're feeling fancy. The database doesn't care what they represent—a user ID, a shard number, a job ID, a fictional character's social security number. It just knows: "Lock ID 12345 is held by session 789."

Here's the API you'll use 90% of the time:

-- Try to acquire lock (returns true/false, doesn't wait)
SELECT pg_try_advisory_lock(12345);

-- Block until lock is acquired
SELECT pg_advisory_lock(12345);

-- Release it
SELECT pg_advisory_unlock(12345);

-- Check if someone holds it
SELECT pg_try_advisory_lock(12345) INTO acquired;
-- (if acquired, you need to release immediately or you just stole it)

Session-level locks (what I just showed) live until your connection dies or you explicitly release them. Transaction-level variants (pg_try_advisory_xact_lock) auto-release on commit/rollback.

The distributed deadlock detection pattern

Here's the real art. Not the lock itself—anyone can call pg_try_advisory_lock. The art is building a deadlock detection system on top of them.

Think of it as a lock registry plus heartbeat plus timeout.

Step 1: The lock key schema

Don't just pick random numbers. Design a scheme. I use a 64-bit composite:

High 32 bits: resource type (e.g., 0x01 for user, 0x02 for order, 0x03 for job)
Low 32 bits: resource identifier

So lock for user 42 becomes (1 << 32) | 42. This keeps your keys debuggable and prevents accidental collisions.

Step 2: The heartbeat table

Advisory locks alone don't tell you who holds the lock or when they acquired it. So I add a small tracking table:

CREATE TABLE lock_heartbeats (
    lock_key bigint PRIMARY KEY,
    holder_id text NOT NULL,        -- service instance ID, pod name, etc.
    acquired_at timestamptz NOT NULL,
    expires_at timestamptz NOT NULL,
    last_heartbeat timestamptz NOT NULL
);

The pattern: Before acquiring an advisory lock, you check this table. If there's an entry that hasn't expired, someone legitimately holds the lock. If it's expired but the advisory lock still exists? Congratulations, you found a dead or hung process.

Step 3: The watchdog

Every 30 seconds, a background job scans for expired heartbeats where the advisory lock is still held. When found, you have two choices:

Force release: pg_advisory_unlock (dangerous, but sometimes necessary)
Alert and investigate: log, notify, and let a human decide

I prefer the second. Automatically breaking locks in distributed systems is how you get data corruption. But I've worked places where the SLO demanded auto-recovery. Just know the risks.

Step 4: The deadlock query

Here's the query that saved my career. It finds circular dependencies in your advisory locks by joining the lock table with itself:

WITH locks AS (
    SELECT
        classid::bigint << 32 | objid::bigint AS lock_key,
        pid,
        virtualtransaction
    FROM pg_locks
    WHERE locktype = 'advisory'
      AND granted = true
)
SELECT
    a.lock_key,
    a.pid AS waiter_pid,
    b.pid AS blocker_pid
FROM pg_locks a
JOIN pg_locks b ON a.objid = b.objid  -- same lock key
WHERE a.locktype = 'advisory'
  AND b.locktype = 'advisory'
  AND a.granted = false  -- a is waiting
  AND b.granted = true   -- b holds it
  AND a.pid != b.pid;

Run this every few seconds on your master database. When you get results, you've found a deadlock before the timeout kills your user experience.

Real-world example: the job queue nightmare

Let me walk you through a real fix from last year.

We had a worker pool processing "reconciliation jobs" for 10,000 tenants. Each job needed exclusive access to a tenant's data for 5-10 seconds. We used pg_try_advisory_lock(tenant_id) at the start of each job.

The problem: Worker A got tenant 42. Worker B got tenant 99. Worker A tried to update a shared lookup table (which required a lock on tenant 99 for a sub-operation). Worker B tried to update the same lookup table (requiring tenant 42). Neither would release the original tenant lock until the entire job finished.

Standard deadlock detection didn't catch this because the locks were advisory, not row locks. PostgreSQL saw two happy sessions holding unrelated integers.

Our custom detection query caught it in 2 seconds. We logged the PIDs, inspected the stack traces, and realized our sub-operation was using a different lock acquisition order than the main operation. Fixed the code, deployed, slept for the first time in a week.

The pitfalls that almost killed me (and will almost kill you)

Advisory locks don't survive connection poolers. PgBouncer in transaction pooling mode will change your backend PID between transactions. Your lock vanishes. Use session pooling or direct connections for any service that uses advisory locks.

They're not replicated. PostgreSQL logical replication doesn't copy advisory locks. Your read replicas won't know about locks on the primary. Don't query pg_locks on a replica for deadlock detection.

They can leak. If your application crashes between acquiring a lock and releasing it, that lock stays held until the TCP connection times out or you manually clear it. Always set statement_timeout and idle_in_transaction_session_timeout on the connection.

The 64-bit key space isn't magic. I've seen teams use timestamps as lock keys. Don't. Use deterministic, bounded keys. Hash strings if you must: ('tenant:' || tenant_id)::regclass is a neat trick, but I prefer hashtext('tenant:' || tenant_id) & 4294967295 for 32-bit safety.

When NOT to use advisory locks

I'm a pragmatist. Advisory locks are amazing, but they're not always the answer.

If you need cross-database coordination, look at etcd or ZooKeeper. Advisory locks are PostgreSQL-only.
If you need millisecond-level lock acquisition, Redis is faster. PostgreSQL advisory locks have network round trips and transaction overhead.
If you have fewer than 10 workers, a simple Redis SETNX with TTL is easier to reason about.

But for anything between 10 and 10,000 workers, where consistency matters more than absolute speed, and you're already using PostgreSQL? Advisory locks are your hammer. And this is a nail.

The art of the hunt

After a decade of this, I've learned that deadlock hunting isn't about the tools. It's about the story your system tells you.

The advisory lock is just a witness. The heartbeat table is the timeline. The detection query is the interrogation. You're not writing code—you're building a crime scene investigation unit for your distributed system.

And sometimes, at 2:47 AM, that's exactly what you need.

Next time your pager goes off, don't panic. Query pg_locks. Look for the advisory rows with granted = false. Follow the trail. And remember: every deadlock is just two processes that fell in love with the wrong resources.

DEV Community