Sriram Rajendran

Posted on Apr 2

MySQL InnoDB Locking: The Silent Killer Behind Database Crashes

#mysql #database #sql #opensource

We run a fleet of MySQL 8.0 RDS instances — multi-TB databases on 32-vCPU / 128 GB machines doing thousands of write IOPS at peak across 1,000+ concurrent connections. Three of them have been brought to their knees by locking over the past year. Not slow queries. Not CPU saturation. Not disk I/O. Locks.

Here's what actually happens: a single metadata lock from a partition drop cascades into 1,600 queued connections in 90 seconds, exhausting your connection pool and crashing every microservice that writes to that table. A handful of abandoned application connections holding row locks slowly strangle write throughput until the database is effectively unresponsive. A bulk INSERT INTO ... SELECT under REPEATABLE READ gap-locks a range of an index and deadlocks every concurrent insert attempting to write into that range. In each case, the database didn't run out of CPU or memory — it ran out of the ability to make progress. Transactions pile up, connection pools saturate, application threads block, health checks fail, and the cascading failure takes out services that don't even touch the locked table.

This post is the full picture — what InnoDB's locks actually are, why they exist, how they interact, and the specific queries we use to find them before they page us.

But first — if you've never thought about why databases lock at all, let's build up from first principles.

Why Databases Lock: The Problem of Concurrent Access

Every multi-user database has the same fundamental problem: multiple transactions reading and writing the same data simultaneously. Without some coordination mechanism, you get anomalies — dirty reads, lost updates, phantom rows. The SQL standard defines four isolation levels (READ UNCOMMITTED, READ COMMITTED, REPEATABLE READ, SERIALIZABLE) that describe which anomalies you're willing to tolerate.

The question is how the database enforces the isolation level you chose. There are two broad strategies:

Pessimistic concurrency (locking). Before accessing a row, acquire a lock on it. If someone else already holds a conflicting lock, wait. This guarantees correctness by preventing concurrent access entirely. The downside: contention. Transactions queue behind each other, and throughput drops under load.

Optimistic concurrency (MVCC). Don't lock for reads. Instead, maintain multiple versions of each row. Readers see a consistent snapshot from the start of their transaction, writers create new versions. Conflicts are detected at commit time. The downside: maintaining multiple versions costs memory and storage, and write-write conflicts still need resolution.

Here's the thing — most production databases use both. InnoDB uses MVCC for reads (non-locking consistent reads) and pessimistic locking for writes. When you run a SELECT, InnoDB reads from a snapshot without acquiring any row locks. When you run an UPDATE or DELETE, InnoDB acquires exclusive locks on the affected rows. This hybrid is why MySQL can handle thousands of concurrent readers without contention, but writers can still block each other.

Where databases fall on the spectrum

Database	Default Isolation	Read Strategy	Write Strategy
MySQL/InnoDB	REPEATABLE READ	MVCC (snapshot)	Row-level locking (in-place update, old version to undo log)
PostgreSQL	READ COMMITTED	MVCC (snapshot)	Row-level locking (new tuple version per update, old tuple marked dead)
DynamoDB	—	Optimistic (conditional writes)	Optimistic (conditional writes)
CockroachDB	SERIALIZABLE	MVCC	Pessimistic + optimistic hybrid

Both MySQL and PostgreSQL use row-level locking for writes: if two transactions try to update the same row, the second one blocks until the first commits or rolls back. They share the same hybrid model — MVCC for reads, pessimistic locks for writes — but differ in storage mechanics, gap locking behavior, and conflict resolution.

One operational difference worth calling out: under REPEATABLE READ, when two transactions try to update the same row, both databases block the second writer until the first commits. But what happens after the commit differs. InnoDB lets the second writer proceed — it reads the latest committed version and applies its update on top. PostgreSQL takes the opposite approach: it aborts the second writer with ERROR: could not serialize access due to concurrent update, forcing the application to retry. This "first-updater-wins" policy means PostgreSQL detects write-write conflicts and forces explicit retry logic, while InnoDB silently applies the update to the newest committed version. Neither has a classic "lost update" — but InnoDB's behavior means the second transaction's update is based on a version it never read in its snapshot, which can lead to subtle anomalies if the application logic depends on snapshot consistency across reads and writes.

InnoDB's locking is more aggressive than PostgreSQL's in one critical way: gap locking. Under REPEATABLE READ, InnoDB locks not just the rows that match your query, but the gaps between those rows to prevent phantom reads. PostgreSQL also prevents phantoms at REPEATABLE READ, but through a completely different mechanism — pure snapshot isolation. Since each transaction reads from a fixed snapshot, newly inserted rows by other transactions are simply invisible, so phantoms can't occur without any locking. The key difference isn't whether phantoms are prevented (both do), but how: InnoDB prevents them pessimistically by locking gaps in the index, while PostgreSQL prevents them passively through snapshot visibility rules. InnoDB's approach means write transactions can block each other even when they're operating on non-overlapping rows, simply because their index ranges overlap. PostgreSQL avoids this contention entirely at REPEATABLE READ, only adding conflict detection at SERIALIZABLE via SSI with non-blocking predicate locks.

InnoDB's Lock Types: The Full Taxonomy

InnoDB has more lock types than most engineers realize. Understanding the hierarchy matters because lock conflicts aren't always row-vs-row — they can be gap-vs-insert, metadata-vs-DML, or intention-vs-table. Here's the complete picture.

Row-Level Locks (Record Locks)

The most intuitive lock type. A record lock locks a single index record. When you execute:

UPDATE orders SET status = 'shipped' WHERE id = 42;

InnoDB acquires an exclusive (X) lock on the index record for id = 42. Any other transaction attempting to UPDATE or DELETE that same row will block until the first transaction commits or rolls back.

Two flavors:

Shared (S) lock: Acquired by SELECT ... FOR SHARE (or LOCK IN SHARE MODE). Multiple transactions can hold S locks on the same row simultaneously. Blocks X locks.
Exclusive (X) lock: Acquired by UPDATE, DELETE, SELECT ... FOR UPDATE. Only one transaction can hold an X lock. Blocks both S and X locks.

Record locks always operate on index records, not table rows directly. The MySQL docs state: "Record locks always lock index records, even if a table is defined with no indexes. For such cases, InnoDB creates a hidden clustered index and uses this index for record locking." (InnoDB Locking). This is why index design directly affects lock contention — a full table scan under an UPDATE means locking every index record scanned, not just the rows that match the WHERE clause: "If you have no indexes suitable for your statement and MySQL must scan the entire table to process the statement, every row of the table becomes locked" (Locks Set by Different SQL Statements).

Gap Locks

This is where InnoDB gets interesting — and where most locking surprises come from.

A gap lock locks the gap between index records, preventing other transactions from inserting into that gap. It doesn't lock the records themselves.

Consider a table with an indexed column age containing values [10, 20, 30]. The gaps are:

(-∞, 10)  (10, 20)  (20, 30)  (30, +∞)

If transaction A runs:

SELECT * FROM users WHERE age BETWEEN 15 AND 25 FOR UPDATE;

InnoDB locks the gap (10, 20) and the gap (20, 30), plus the record at 20. This prevents any other transaction from inserting age = 12, age = 17, age = 22, or age = 28 — even though none of those rows exist yet.

Why gap locks exist: To prevent phantom reads under REPEATABLE READ. Without gap locks, transaction A could run the same SELECT twice and get different results because transaction B inserted a new row in the range between the two reads. Gap locks guarantee that if you read a range, no one can insert into that range until you commit.

The catch: Gap locks are purely inhibitive. The MySQL docs are explicit: "Gap locks in InnoDB are 'purely inhibitive', which means that their only purpose is to prevent other transactions from inserting to the gap. Gap locks can co-exist." (InnoDB Locking). Two transactions can hold gap locks on the same gap simultaneously. This seems harmless until you realize that each is blocking the other's INSERT, creating a deadlock.

The key is that this happens when no rows match the query — only gaps are locked, with no record locks to cause the second SELECT to block. Consider a table where age contains only [10, 30] (no row with age = 20):

Tx A: SELECT ... WHERE age BETWEEN 15 AND 25 FOR UPDATE;  -- no matching rows; gap lock on (10, 30)
Tx B: SELECT ... WHERE age BETWEEN 15 AND 25 FOR UPDATE;  -- also gap lock on (10, 30) — no conflict!
Tx A: INSERT INTO users (age) VALUES (18);                 -- blocked by Tx B's gap lock
Tx B: INSERT INTO users (age) VALUES (22);                 -- blocked by Tx A's gap lock → DEADLOCK

If rows did exist in the range, the SELECT ... FOR UPDATE would acquire next-key locks (record + gap), and the X record locks would cause Tx B to block on the SELECT itself — no deadlock, just contention. It's the empty-range case that's dangerous: both transactions sail through the SELECT, acquire only gap locks, and then deadlock on the INSERT.

We've hit this pattern in production. More on that later.

Next-Key Locks

A next-key lock is a combination of a record lock and a gap lock on the gap before that record. It's InnoDB's default locking strategy for index scans under REPEATABLE READ.

For our age index [10, 20, 30], the next-key locks are:

(-∞, 10]  (10, 20]  (20, 30]  (30, +∞)

Notice the notation: the record itself is included (closed bracket on the right). A next-key lock on (10, 20] locks the gap (10, 20) AND the record 20.

When you run a range scan like WHERE age > 15 AND age < 25, InnoDB places next-key locks on:

(10, 20] — gap before 20, plus record 20
(20, 30] — gap before 30, plus record 30

This is more than the rows that match your predicate. The extra locking is the cost of preventing phantoms via pessimistic locking rather than snapshot visibility.

Intention Locks

Intention locks are table-level locks that signal what kind of row-level locks a transaction intends to acquire. They exist purely for efficiency.

Intention Shared (IS): "I'm going to acquire shared row locks in this table."
Intention Exclusive (IX): "I'm going to acquire exclusive row locks in this table."

When you run UPDATE orders SET status = 'shipped' WHERE id = 42, InnoDB first acquires an IX lock on the orders table, then an X lock on the row. The IX lock is cheap (no contention between IX and IX) and serves one purpose: if another transaction wants a full table lock (LOCK TABLES orders WRITE), it can check for intention locks instead of scanning every row's lock state.

Intention locks never block each other. IX + IX is fine. IS + IX is fine. They only conflict with full table locks:

	IS	IX	S	X
IS	✓	✓	✓	✗
IX	✓	✓	✗	✗
S	✓	✗	✓	✗
X	✗	✗	✗	✗

In practice, you almost never think about intention locks unless you're debugging a case where LOCK TABLES or DDL is blocking behind thousands of row-level transactions.

Insert Intention Locks

A special type of gap lock acquired before an INSERT. It signals: "I intend to insert into this gap, but I don't need to lock the entire gap — just my specific insertion point."

Two transactions inserting into the same gap at different positions won't block each other:

-- Gap (10, 20) exists
Tx A: INSERT INTO users (age) VALUES (12);  -- insert intention lock at 12
Tx B: INSERT INTO users (age) VALUES (17);  -- insert intention lock at 17 — no conflict!

But an insert intention lock DOES conflict with a gap lock on the same gap. This is the mechanism that makes gap locks effective — they block insert intentions.

Auto-Increment Locks

When a table has an AUTO_INCREMENT column, InnoDB needs to serialize the generation of new values. Historically, this was a table-level lock held for the duration of an INSERT statement (the "traditional" mode). Modern InnoDB has three modes controlled by innodb_autoinc_lock_mode:

Mode	Value	Behavior	Use Case
Traditional	0	Table-level AUTO-INC lock held for entire statement	Legacy, maximally safe
Consecutive	1	Lightweight mutex for simple inserts, table lock for bulk inserts	Default in MySQL 5.7
Interleaved	2	Lightweight mutex for all inserts, values may have gaps	Default in MySQL 8.0, safe only with row-based replication

In mode 2 (the MySQL 8.0 default), simple INSERT statements use a lightweight mutex that's released as soon as the value is generated — not when the statement completes. This means auto-increment is rarely a bottleneck. But bulk operations like INSERT INTO ... SELECT or LOAD DATA can still cause contention in mode 1 because they hold the table-level lock for the entire statement duration.

Metadata Locks (MDL)

This is the lock type that catches most people off guard because it's not an InnoDB lock — it's a MySQL server-level lock that sits above the storage engine.

Metadata locks protect the schema definition of a table. Any DML statement (SELECT, INSERT, UPDATE, DELETE) acquires a shared MDL on the table. DDL statements (ALTER TABLE, DROP TABLE, DROP PARTITION) require an exclusive MDL. Critically, once an exclusive MDL request is pending, all new shared MDL requests queue behind it — a waiting DDL blocks every subsequent DML from acquiring the lock (Metadata Locking).

The rule is simple: DDL waits for all active DML to finish, and blocks all new DML while waiting.

Timeline:
  T0: Tx A starts SELECT on orders (acquires shared MDL)
  T1: DBA runs ALTER TABLE orders ... (needs exclusive MDL → blocked by Tx A)
  T2: Tx B starts INSERT INTO orders (needs shared MDL → blocked by ALTER)
  T3: Tx C starts SELECT on orders (needs shared MDL → blocked by ALTER)
  ...
  T?: Tx A finishes → ALTER acquires exclusive MDL → all DML queued behind it

This is the MDL queue stacking problem. A single ALTER TABLE doesn't just wait for the active transaction — it blocks every subsequent transaction too. On a table sustaining 500 writes/second, a 90-second stall creates a pileup of thousands of queued connections, exhausting your connection pool entirely. The ALTER itself might be sub-second, but the queue cascades into connection pool exhaustion, health check failures, and service-wide outages that have nothing to do with the table being altered.

Production Incident #1: Metadata Locks from Partition Drops

We run time-partitioned tables for high-volume event data. Each partition holds one day's worth of data, and a cron job drops partitions older than 90 days:

ALTER TABLE events DROP PARTITION p20250101;

This is an ALTER TABLE. It needs an exclusive metadata lock.

What happened

The partition drop job was already scheduled during a low-traffic window. There was no single long-running query blocking it. The problem was simpler and more insidious: the DROP PARTITION itself took time to execute — and for the entire duration it held an exclusive metadata lock, every new DML statement on the table queued behind it.

ALTER TABLE ... DROP PARTITION isn't always instant. On a large table with many partitions, MySQL needs to update the table's partition metadata, and if the partition being dropped has significant data, the storage engine needs to reclaim the tablespace. During this window — which stretched to 30+ seconds in our case — the exclusive MDL blocked all new INSERT, UPDATE, DELETE, and even SELECT statements on the table. The Waiting for table metadata lock state spread across the processlist like a wave.

What we saw in monitoring

The global MDL wait metric spiked sharply. We queried the process list:

SELECT id, user, host, db, command, time, state, info
FROM information_schema.processlist
WHERE state = 'Waiting for table metadata lock'
ORDER BY time DESC;

The output showed hundreds of DML statements queued behind the ALTER TABLE:

+-------+------+-----------+--------+---------+------+----------------------------------+-----------------------------+
| id    | user | host      | db     | command | time | state                            | info                        |
+-------+------+-----------+--------+---------+------+----------------------------------+-----------------------------+
| 15901 | admin| 10.0.2.x  | events | Query   |  34  | altering table                   | ALTER TABLE events DROP ... |
| 15902 | app  | 10.0.3.x  | events | Query   |  32  | Waiting for table metadata lock  | INSERT INTO events ...      |
| 15903 | app  | 10.0.3.x  | events | Query   |  31  | Waiting for table metadata lock  | INSERT INTO events ...      |
| 15904 | app  | 10.0.3.x  | events | Query   |  30  | Waiting for table metadata lock  | INSERT INTO events ...      |
| ...   | ...  | ...       | ...    | ...     | ...  | ...                              | ...                         |
+-------+------+-----------+--------+---------+------+----------------------------------+-----------------------------+

The ALTER TABLE was actively running (state: altering table), not waiting for anything — it was the one holding the exclusive MDL while it completed. Every DML that arrived during those 30+ seconds piled up.

The query to find MDL holders

MySQL 8.0 exposes metadata locks through performance_schema:

SELECT
    mdl.OBJECT_SCHEMA,
    mdl.OBJECT_NAME,
    mdl.LOCK_TYPE,
    mdl.LOCK_DURATION,
    mdl.LOCK_STATUS,
    mdl.OWNER_THREAD_ID,
    t.PROCESSLIST_ID,
    t.PROCESSLIST_USER,
    t.PROCESSLIST_HOST,
    t.PROCESSLIST_TIME,
    t.PROCESSLIST_STATE,
    t.PROCESSLIST_INFO
FROM performance_schema.metadata_locks mdl
JOIN performance_schema.threads t
    ON mdl.OWNER_THREAD_ID = t.THREAD_ID
WHERE mdl.OBJECT_SCHEMA = 'your_database'
  AND mdl.OBJECT_NAME = 'events'
ORDER BY t.PROCESSLIST_TIME DESC;

This tells you exactly who holds the lock and who's waiting. The output distinguishes GRANTED (holding the lock) from PENDING (waiting for it).

The fix

Switched from DROP PARTITION to EXCHANGE PARTITION followed by dropping the exchanged table — EXCHANGE PARTITION is a metadata-only rename operation that completes near-instantly, so the exclusive MDL is held for milliseconds, not seconds
For cases where DROP PARTITION is unavoidable, we batch the drops and add lock_wait_timeout = 5 to the session — if it can't acquire the MDL within 5 seconds, it backs off and retries on the next cycle
Added an alert on the global Waiting for table metadata lock count exceeding a threshold sustained for 10 seconds

Production Incident #2: Row Locks from Incorrectly Terminated Application Instances

This one was subtle. Write latency on the orders table was slowly degrading over hours, not spiking suddenly. P99 crept from 15ms to 200ms to 2 seconds over the course of a day.

What happened

An application instance was terminated incorrectly during a deployment — the process was killed (SIGKILL / OOM) without gracefully closing its database connections. The connections had open transactions with row-level exclusive locks acquired via SELECT ... FOR UPDATE.

Here's the thing MySQL doesn't advertise: when a client process dies abruptly, no COM_QUIT is sent, no TCP FIN packet is transmitted. The TCP connection is now "half-open" — MySQL has no idea the client is gone. The server-side connection thread sits in a blocking read(), waiting for the next query that will never arrive. The transaction stays open. The locks stay held.

MySQL detects dead connections through two mechanisms, whichever fires first:

wait_timeout (application-layer): MySQL checks if the connection has been idle for longer than this value. Default: 28800 seconds (8 hours).
TCP keepalive (network-layer): The OS sends keepalive probes to detect the dead peer. With Linux defaults (tcp_keepalive_time=7200, tcp_keepalive_intvl=75, tcp_keepalive_probes=9), this takes ~7,875 seconds (~2.2 hours).

Until one of those fires, the dead connection's transaction remains open and all its locks are held. InnoDB only rolls back the transaction when MySQL actually closes the server-side connection.

In our case, wait_timeout was still at the default — 28800 seconds (8 hours). So the killed application instance left behind connections holding exclusive row locks for hours. As subsequent deployments killed more instances the same way, more orphaned locks accumulated. Eventually, write contention on orders was so severe that P99 latency hit the innodb_lock_wait_timeout ceiling.

What we saw

-- Find transactions that have been open for a long time but aren't actively running
SELECT
    trx.trx_id,
    trx.trx_state,
    trx.trx_started,
    TIMESTAMPDIFF(SECOND, trx.trx_started, NOW()) AS trx_age_seconds,
    trx.trx_rows_locked,
    trx.trx_rows_modified,
    trx.trx_mysql_thread_id,
    p.user,
    p.host,
    p.command,
    p.time AS command_time,
    p.state,
    p.info AS current_query
FROM information_schema.innodb_trx trx
JOIN information_schema.processlist p
    ON trx.trx_mysql_thread_id = p.id
WHERE p.command = 'Sleep'
  AND TIMESTAMPDIFF(SECOND, trx.trx_started, NOW()) > 60
ORDER BY trx.trx_started ASC;

The output showed connections from hosts belonging to application instances that had already been terminated — their IPs were no longer in service, but MySQL was still holding their connections open:

+--------+-----------+---------------------+-----------------+-----------------+-------------------+-------+------+---------+------+-------+---------------+
| trx_id | trx_state | trx_started         | trx_age_seconds | trx_rows_locked | trx_rows_modified | ...   | user | command | time | state | current_query |
+--------+-----------+---------------------+-----------------+-----------------+-------------------+-------+------+---------+------+-------+---------------+
| 48291  | RUNNING   | 2025-03-15 09:14:22 |           14338 |              47 |                 0 | ...   | app  | Sleep   | 9841 |       | NULL          |
| 48305  | RUNNING   | 2025-03-15 09:21:07 |           13933 |              23 |                 0 | ...   | app  | Sleep   | 8122 |       | NULL          |
| 48412  | RUNNING   | 2025-03-15 10:02:44 |           11436 |              12 |                 0 | ...   | app  | Sleep   | 7203 |       | NULL          |
+--------+-----------+---------------------+-----------------+-----------------+-------------------+-------+------+---------+------+-------+---------------+

A few things jump out. trx_state is RUNNING but command is Sleep — the transaction is open but the connection is idle. trx_rows_locked is non-zero but trx_rows_modified is zero — these transactions acquired locks via FOR UPDATE but never wrote anything. The time column shows idle times of 7,000–10,000 seconds — the 8-hour wait_timeout default hadn't even kicked in yet, and these connections had already been abandoned for hours with no end in sight.

Finding exactly which rows are locked

SELECT
    l.ENGINE_LOCK_ID,
    l.ENGINE_TRANSACTION_ID,
    l.LOCK_TYPE,
    l.LOCK_MODE,
    l.LOCK_STATUS,
    l.LOCK_DATA,
    l.OBJECT_SCHEMA,
    l.OBJECT_NAME,
    l.INDEX_NAME
FROM performance_schema.data_locks l
WHERE l.ENGINE_TRANSACTION_ID IN (
    SELECT trx.trx_id
    FROM information_schema.innodb_trx trx
    JOIN information_schema.processlist p
        ON trx.trx_mysql_thread_id = p.id
    WHERE p.command = 'Sleep'
      AND TIMESTAMPDIFF(SECOND, trx.trx_started, NOW()) > 300
);

This shows every lock held by sleeping transactions older than 5 minutes. The LOCK_DATA column gives you the primary key values of the locked rows, and LOCK_MODE tells you whether it's shared or exclusive.

Finding who's waiting on whom

SELECT
    w.REQUESTING_ENGINE_TRANSACTION_ID AS waiting_trx_id,
    w.BLOCKING_ENGINE_TRANSACTION_ID AS blocking_trx_id,
    rl.LOCK_MODE AS waiting_lock_mode,
    bl.LOCK_MODE AS blocking_lock_mode,
    rl.LOCK_DATA AS contested_row,
    r.trx_query AS waiting_query,
    b_proc.command AS blocking_command,
    b_proc.time AS blocking_idle_time,
    TIMESTAMPDIFF(SECOND, b_trx.trx_started, NOW()) AS blocking_trx_age
FROM performance_schema.data_lock_waits w
JOIN performance_schema.data_locks rl
    ON w.REQUESTING_ENGINE_LOCK_ID = rl.ENGINE_LOCK_ID
JOIN performance_schema.data_locks bl
    ON w.BLOCKING_ENGINE_LOCK_ID = bl.ENGINE_LOCK_ID
JOIN information_schema.innodb_trx r
    ON w.REQUESTING_ENGINE_TRANSACTION_ID = r.trx_id
JOIN information_schema.innodb_trx b_trx
    ON w.BLOCKING_ENGINE_TRANSACTION_ID = b_trx.trx_id
JOIN information_schema.processlist b_proc
    ON b_trx.trx_mysql_thread_id = b_proc.id;

The fix

Short-term: killed the orphaned connections manually (KILL <processlist_id>).

Long-term:

Fixed the deployment process to send SIGTERM with a grace period, allowing the application to close database connections before the container is killed. Added a pre-stop hook that explicitly closes the connection pool
Reduced wait_timeout to 300 (5 minutes). The MySQL default of 28800 (8 hours) means orphaned connections can hold locks for an entire workday before MySQL notices. Check yours with SHOW VARIABLES LIKE 'wait_timeout'
On RDS, you cannot tune server-side TCP keepalive — AWS manages the OS and those sysctl parameters aren't user-configurable. wait_timeout is your primary lever for dead client detection. On self-managed MySQL, you can additionally tune the server's TCP keepalive (tcp_keepalive_time=60, tcp_keepalive_intvl=10, tcp_keepalive_probes=6) so the server detects dead clients in ~120 seconds instead of ~2.2 hours
Added monitoring for information_schema.innodb_trx rows where trx_state = 'RUNNING' and the processlist shows Sleep for more than 5 minutes

Production Incident #3: Gap Locks from INSERT INTO ... SELECT

We have a nightly job that archives completed orders into a summary table:

INSERT INTO order_summaries (order_id, total, region, completed_at)
SELECT id, total_amount, region, completed_at
FROM orders
WHERE status = 'completed'
  AND completed_at >= DATE_SUB(CURDATE(), INTERVAL 1 DAY)
  AND completed_at < CURDATE();

This ran fine for months. Then the orders table grew, the nightly job started taking longer, and we started seeing deadlocks during peak write hours.

What happened

Under REPEATABLE READ, INSERT INTO ... SELECT places shared next-key locks on the rows read by the SELECT — the MySQL docs are explicit: "InnoDB sets shared next-key locks on rows from S" (InnoDB Locks Set by Different SQL Statements). This is required to guarantee a consistent read — InnoDB needs to prevent other transactions from modifying or inserting into the range being read while the bulk insert is in progress.

The problem: the SELECT scans a range of completed_at values. InnoDB places shared next-key locks on every index record in that range — and since each next-key lock includes the gap before the record, the entire scanned range is locked against inserts. Any INSERT into the orders table with a completed_at value that falls within or near the locked range will block.

Our application was simultaneously inserting new orders with completed_at values close to the current timestamp. Since the archival job was reading completed_at from the previous day, you might think there's no overlap. But gap locks extend to the next index record beyond the scanned range — and if the next record's completed_at is today, the gap lock extends into today's range.

Index records for completed_at:
  ... | 2025-03-14 23:58:12 | 2025-03-14 23:59:44 | 2025-03-15 00:01:23 | ...
                                                       ↑
                                          Next-key lock extends here
                                          because this is the next record
                                          after the scanned range

  New INSERT with completed_at = 2025-03-15 00:00:15 → blocked by gap lock

Finding active lock waits in real time

In MySQL 8.0, lock wait information lives in performance_schema.data_lock_waits (the old information_schema.innodb_lock_waits was removed in 8.0):

SELECT
    r.trx_id AS waiting_trx_id,
    r.trx_mysql_thread_id AS waiting_thread,
    r.trx_query AS waiting_query,
    TIMESTAMPDIFF(SECOND, r.trx_wait_started, NOW()) AS wait_seconds,
    b.trx_id AS blocking_trx_id,
    b.trx_mysql_thread_id AS blocking_thread,
    b.trx_query AS blocking_query,
    b.trx_rows_locked AS blocking_rows_locked
FROM performance_schema.data_lock_waits w
JOIN information_schema.innodb_trx b
    ON w.BLOCKING_ENGINE_TRANSACTION_ID = b.trx_id
JOIN information_schema.innodb_trx r
    ON w.REQUESTING_ENGINE_TRANSACTION_ID = r.trx_id
ORDER BY wait_seconds DESC;

The fix

Changed the archival job's isolation level to READ COMMITTED for that session:

   SET SESSION TRANSACTION ISOLATION LEVEL READ COMMITTED;

Under READ COMMITTED, InnoDB doesn't acquire gap locks at all (except for foreign key and duplicate-key checks) — the SELECT portion runs as a consistent read without locking the source rows. This eliminates the phantom prevention guarantee for that transaction, which is acceptable for archival.

Added batching — instead of one massive INSERT INTO ... SELECT, process 1,000 rows at a time with explicit LIMIT and OFFSET:

   INSERT INTO order_summaries (order_id, total, region, completed_at)
   SELECT id, total_amount, region, completed_at
   FROM orders
   WHERE status = 'completed'
     AND completed_at >= DATE_SUB(CURDATE(), INTERVAL 1 DAY)
     AND completed_at < CURDATE()
   ORDER BY id
   LIMIT 1000;

Each batch holds locks for a shorter duration, reducing the window for contention.

Moved the archival job to run against a read replica, then applied the summaries to the primary via smaller transactional batches.

Monitoring InnoDB Locks: The Essential Queries

Here's the full set of diagnostic queries we keep in our runbook.

1. Current InnoDB transaction status

SELECT
    trx_id,
    trx_state,
    trx_started,
    TIMESTAMPDIFF(SECOND, trx_started, NOW()) AS age_seconds,
    trx_rows_locked,
    trx_rows_modified,
    trx_lock_memory_bytes,
    trx_mysql_thread_id,
    trx_query,
    trx_operation_state
FROM information_schema.innodb_trx
ORDER BY trx_started ASC;

2. All currently held locks (MySQL 8.0+)

SELECT
    ENGINE_TRANSACTION_ID,
    OBJECT_SCHEMA,
    OBJECT_NAME,
    INDEX_NAME,
    LOCK_TYPE,
    LOCK_MODE,
    LOCK_STATUS,
    LOCK_DATA
FROM performance_schema.data_locks
ORDER BY ENGINE_TRANSACTION_ID, OBJECT_NAME;

LOCK_TYPE will be RECORD (row/gap/next-key) or TABLE (intention locks). LOCK_MODE tells you the specifics: X (exclusive), S (shared), X,GAP (exclusive gap lock), X,REC_NOT_GAP (record-only, no gap), S,GAP, etc.

3. Lock wait chains — who blocks whom

SELECT
    w.REQUESTING_ENGINE_TRANSACTION_ID AS waiting_trx,
    w.BLOCKING_ENGINE_TRANSACTION_ID AS blocking_trx,
    rl.OBJECT_NAME AS table_name,
    rl.INDEX_NAME,
    rl.LOCK_MODE AS waiting_lock_mode,
    bl.LOCK_MODE AS blocking_lock_mode,
    rl.LOCK_DATA AS contested_row,
    r.trx_query AS waiting_query,
    TIMESTAMPDIFF(SECOND, r.trx_wait_started, NOW()) AS wait_seconds
FROM performance_schema.data_lock_waits w
JOIN performance_schema.data_locks rl
    ON w.REQUESTING_ENGINE_LOCK_ID = rl.ENGINE_LOCK_ID
JOIN performance_schema.data_locks bl
    ON w.BLOCKING_ENGINE_LOCK_ID = bl.ENGINE_LOCK_ID
JOIN information_schema.innodb_trx r
    ON w.REQUESTING_ENGINE_TRANSACTION_ID = r.trx_id
ORDER BY wait_seconds DESC;

4. Deadlock history

SHOW ENGINE INNODB STATUS\G

The LATEST DETECTED DEADLOCK section shows the last deadlock with full details: both transactions, the locks they held, the locks they were waiting for, and which transaction InnoDB chose as the victim. Parse the output — it's verbose but complete.

For ongoing monitoring, enable the deadlock log:

SET GLOBAL innodb_print_all_deadlocks = ON;

This writes every deadlock to the MySQL error log, not just the latest one.

5. Long-running transactions with sleeping connections

SELECT
    trx.trx_id,
    trx.trx_state,
    TIMESTAMPDIFF(SECOND, trx.trx_started, NOW()) AS trx_age_seconds,
    trx.trx_rows_locked,
    trx.trx_rows_modified,
    p.id AS processlist_id,
    p.user,
    p.host,
    p.command,
    p.time AS idle_seconds
FROM information_schema.innodb_trx trx
JOIN information_schema.processlist p
    ON trx.trx_mysql_thread_id = p.id
WHERE p.command = 'Sleep'
ORDER BY trx.trx_started ASC;

This is the "zombie transaction" detector. Any row here is a connection with an open transaction that isn't executing a query. These are the ones that silently hold locks while doing nothing.

6. InnoDB lock metrics summary

SELECT
    (SELECT COUNT(*) FROM information_schema.innodb_trx) AS active_transactions,
    (SELECT COUNT(*) FROM performance_schema.data_lock_waits) AS lock_waits,
    (SELECT COUNT(*) FROM performance_schema.data_locks WHERE LOCK_STATUS = 'GRANTED') AS locks_held,
    (SELECT COUNT(*) FROM performance_schema.data_locks WHERE LOCK_STATUS = 'WAITING') AS locks_waiting,
    (SELECT VARIABLE_VALUE FROM performance_schema.global_status WHERE VARIABLE_NAME = 'Innodb_row_lock_waits') AS total_row_lock_waits,
    (SELECT VARIABLE_VALUE FROM performance_schema.global_status WHERE VARIABLE_NAME = 'Innodb_row_lock_time_avg') AS avg_row_lock_wait_ms,
    (SELECT VARIABLE_VALUE FROM performance_schema.global_status WHERE VARIABLE_NAME = 'Innodb_deadlocks') AS total_deadlocks;

MySQL Parameters That Control Lock Behavior

Most lock incidents we've hit were made worse — or outright caused — by MySQL parameters left at their defaults. Here are the ones that matter, what they do, and what we set them to.

-- Check all lock-related parameters on your instance
SHOW VARIABLES WHERE Variable_name IN (
    'innodb_lock_wait_timeout',
    'lock_wait_timeout',
    'innodb_deadlock_detect',
    'innodb_print_all_deadlocks',
    'innodb_status_output_locks',
    'wait_timeout',
    'interactive_timeout',
    'innodb_autoinc_lock_mode'
);

Parameter	MySQL Default	Recommended	Why
`innodb_lock_wait_timeout`	50	10–30	How long a transaction waits for a row lock before erroring out. 50 seconds is too generous — if you're waiting 30 seconds for a row lock, the transaction should fail and retry, not queue behind a zombie
`lock_wait_timeout`	31536000 (1 year)	300	How long a statement waits for a metadata lock. The default is effectively "wait forever." A `DROP PARTITION` waiting a year for a shared MDL to release will stack every connection in your pool long before then
`wait_timeout`	28800 (8 hours)	300	How long MySQL keeps an idle connection alive. An abandoned connection holding row locks can sit there for 8 hours at the default. Drop this to 5 minutes
`interactive_timeout`	28800 (8 hours)	300	Same as `wait_timeout`, for connections flagged as interactive (mysql CLI sessions)
`innodb_deadlock_detect`	ON	ON	Real-time deadlock detection. Turning this off (sometimes done for "performance") means deadlocks are only resolved by `innodb_lock_wait_timeout` expiring — much slower
`innodb_print_all_deadlocks`	OFF	ON	Log every deadlock to the error log. Without this, only the most recent deadlock is visible via `SHOW ENGINE INNODB STATUS`
`innodb_status_output_locks`	OFF	ON (during incidents)	Include lock details in `SHOW ENGINE INNODB STATUS` output. Verbose, but invaluable during active lock investigations
`innodb_autoinc_lock_mode`	2 (MySQL 8.0)	2	Interleaved mode — least contention for auto-increment. Only change if you need consecutive values across bulk inserts (you probably don't)

The most dangerous defaults are lock_wait_timeout (1 year) and wait_timeout (8 hours). If you take nothing else from this post, check those two on your production instances right now.

A Quick Guide to Isolation Levels and Their Lock Behavior

The locks InnoDB takes are directly tied to the isolation level. Understanding this mapping is essential for diagnosing unexpected contention.

READ UNCOMMITTED

No locks on consistent reads
Writes take record locks (no gap locks)
You almost never want this — it allows dirty reads

READ COMMITTED

Consistent reads use a fresh snapshot per statement — each statement sees the latest committed data as of its start time, not the transaction start time
No gap locks — this is the key difference from REPEATABLE READ. The docs state: "Gap locking is only used for foreign-key constraint checking and duplicate-key checking" at this level (Transaction Isolation Levels)
Record locks on matched rows only, released for non-matching rows after evaluation
Phantom reads are possible but gap lock deadlocks are eliminated
This is the level we switch to for bulk operations

REPEATABLE READ (InnoDB Default)

Consistent reads see a snapshot from the first read in the transaction
Next-key locks on index scans — record lock + gap lock
Gap locks prevent phantom inserts in scanned ranges
This is where most unexpected locking contention occurs

SERIALIZABLE

All consistent reads are implicitly converted to SELECT ... FOR SHARE
Every read takes shared next-key locks
Maximum correctness, maximum contention
We've never used this in production

Side-by-side lock behavior

Operation	READ COMMITTED	REPEATABLE READ
`SELECT` (plain)	No locks (MVCC)	No locks (MVCC)
`SELECT ... FOR UPDATE`	Record locks only	Next-key locks (record + gap)
`UPDATE WHERE unique_col = ?`	Record lock on matching row	Record lock only (unique index optimization — no gap lock)
`UPDATE WHERE non_unique_col = ?`	Record lock on matching row	Next-key locks on matching records + next record to seal the gap
`UPDATE WHERE range`	Record locks on matching rows	Next-key locks on range
`INSERT`	Insert intention lock	Insert intention lock
`INSERT INTO ... SELECT`	No locks on source (consistent read)	Shared next-key locks on source rows
`DELETE WHERE range`	Record locks on matching rows	Next-key locks on range

The INSERT INTO ... SELECT row is the one that bit us. Under REPEATABLE READ, the SELECT side takes shared next-key locks on every row it reads. Under READ COMMITTED, it doesn't. Switching isolation level for that one session was a one-line fix that eliminated the gap lock deadlocks entirely.

Preventing Lock Contention: What We Do Now

After three incidents with three different lock types, we've adopted a set of practices that have kept us out of trouble.

Keep transactions short. The longer a transaction is open, the longer it holds locks. Every lock held is a potential blocker. We set innodb_lock_wait_timeout to 30 (the MySQL default is 50). If your transaction is waiting 30 seconds for a row lock, something is structurally wrong — fail fast and let the application retry.

Use READ COMMITTED for bulk operations. If you're running INSERT INTO ... SELECT, CREATE TABLE ... AS SELECT, or any bulk read-then-write pattern, switch the session to READ COMMITTED. You don't need phantom protection for a batch job.

Use pt-online-schema-change or gh-ost for large ALTERs. These tools perform schema changes by creating a shadow table, copying data in small batches, and swapping at the end — avoiding long-held metadata locks entirely. For partition operations, consider using EXCHANGE PARTITION which is metadata-only and near-instant.

Ensure graceful shutdown of application instances. When an application process is killed without closing its database connections (SIGKILL, OOM, container eviction), MySQL doesn't detect the dead client immediately. Locks are held until wait_timeout expires or TCP keepalive detects the dead peer — whichever comes first. Use SIGTERM with a grace period, and add pre-stop hooks that drain connections before the process exits.

Tune dead connection detection. Drop wait_timeout from the default 28800 (8 hours) to 300 — on RDS, this is your primary lever since server-side TCP keepalive isn't user-configurable. On self-managed MySQL, also tune the server's TCP keepalive (tcp_keepalive_time=60, tcp_keepalive_intvl=10, tcp_keepalive_probes=6) so the server detects dead clients in ~120 seconds. If you use a connection pool (HikariCP, etc.), configure its own idle connection eviction to detect and close stale connections faster than MySQL does.

Monitor zombie transactions. Alert on any transaction that has been open for more than N minutes while the connection is idle. This is the single most impactful alert we've added:

-- Alert query: transactions open > 5 min with sleeping connections
SELECT COUNT(*) AS zombie_transactions
FROM information_schema.innodb_trx trx
JOIN information_schema.processlist p
    ON trx.trx_mysql_thread_id = p.id
WHERE p.command = 'Sleep'
  AND TIMESTAMPDIFF(SECOND, trx.trx_started, NOW()) > 300;

Index your write predicates. If your UPDATE or DELETE hits a full table scan, InnoDB locks every row it examines — not just the rows that match the WHERE clause. Under REPEATABLE READ, it also places gap locks across the entire index. A missing index on a write path doesn't just make the query slow — it makes every other concurrent write slow too.

Wrapping Up

InnoDB uses a hybrid of MVCC and pessimistic locking. Reads don't lock (consistent snapshots). Writes lock. The specific lock type depends on the isolation level and the index structure.
Gap locks are the most surprising lock type. They lock ranges between index records, not the records themselves. They exist to prevent phantom reads under REPEATABLE READ, and they're the root cause of most unexpected deadlocks.
Metadata locks are MySQL-level, not InnoDB-level. Any DDL queues behind active DML and blocks all subsequent DML. On high-QPS tables, even fast DDL can cascade into connection pool exhaustion.
Sleeping connections with open transactions are silent killers. A connection in Sleep state with an uncommitted transaction still holds all its row locks. Monitor information_schema.innodb_trx crossed with the processlist.
INSERT INTO ... SELECT takes shared next-key locks on the source table under REPEATABLE READ. Switch to READ COMMITTED for bulk operations to eliminate gap locking on the read side.
Monitor lock contention proactively. performance_schema.data_locks, data_lock_waits, and innodb_trx are your friends. On RDS, Performance Insights broken down by db.wait_event is the fastest way to see what your connections are actually waiting on.

Every lock incident we've had traced back to one of three things: a DDL operation that held an exclusive metadata lock longer than expected, application instances that died without releasing their connections, or an isolation level that was too strict for the workload. The queries in this post are the ones we reach for first. performance_schema.data_locks and data_lock_waits tell you exactly which rows are locked and who's waiting. information_schema.innodb_trx crossed with the processlist catches zombie transactions from dead application instances. And performance_schema.metadata_locks tells you who's holding the MDL that's stacking your connection pool. The data is always there — you just have to know where to look.

Top comments (1)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.