DEV Community: Sriram Rajendran

MySQL InnoDB Locking: The Silent Killer Behind Database Crashes

Sriram Rajendran — Thu, 02 Apr 2026 07:50:40 +0000

We run a fleet of MySQL 8.0 RDS instances — multi-TB databases on 32-vCPU / 128 GB machines doing thousands of write IOPS at peak across 1,000+ concurrent connections. Three of them have been brought to their knees by locking over the past year. Not slow queries. Not CPU saturation. Not disk I/O. Locks.

Here's what actually happens: a single metadata lock from a partition drop cascades into 1,600 queued connections in 90 seconds, exhausting your connection pool and crashing every microservice that writes to that table. A handful of abandoned application connections holding row locks slowly strangle write throughput until the database is effectively unresponsive. A bulk INSERT INTO ... SELECT under REPEATABLE READ gap-locks a range of an index and deadlocks every concurrent insert attempting to write into that range. In each case, the database didn't run out of CPU or memory — it ran out of the ability to make progress. Transactions pile up, connection pools saturate, application threads block, health checks fail, and the cascading failure takes out services that don't even touch the locked table.

This post is the full picture — what InnoDB's locks actually are, why they exist, how they interact, and the specific queries we use to find them before they page us.

But first — if you've never thought about why databases lock at all, let's build up from first principles.

Why Databases Lock: The Problem of Concurrent Access

Every multi-user database has the same fundamental problem: multiple transactions reading and writing the same data simultaneously. Without some coordination mechanism, you get anomalies — dirty reads, lost updates, phantom rows. The SQL standard defines four isolation levels (READ UNCOMMITTED, READ COMMITTED, REPEATABLE READ, SERIALIZABLE) that describe which anomalies you're willing to tolerate.

The question is how the database enforces the isolation level you chose. There are two broad strategies:

Pessimistic concurrency (locking). Before accessing a row, acquire a lock on it. If someone else already holds a conflicting lock, wait. This guarantees correctness by preventing concurrent access entirely. The downside: contention. Transactions queue behind each other, and throughput drops under load.

Optimistic concurrency (MVCC). Don't lock for reads. Instead, maintain multiple versions of each row. Readers see a consistent snapshot from the start of their transaction, writers create new versions. Conflicts are detected at commit time. The downside: maintaining multiple versions costs memory and storage, and write-write conflicts still need resolution.

Here's the thing — most production databases use both. InnoDB uses MVCC for reads (non-locking consistent reads) and pessimistic locking for writes. When you run a SELECT, InnoDB reads from a snapshot without acquiring any row locks. When you run an UPDATE or DELETE, InnoDB acquires exclusive locks on the affected rows. This hybrid is why MySQL can handle thousands of concurrent readers without contention, but writers can still block each other.

Where databases fall on the spectrum

Database	Default Isolation	Read Strategy	Write Strategy
MySQL/InnoDB	REPEATABLE READ	MVCC (snapshot)	Row-level locking (in-place update, old version to undo log)
PostgreSQL	READ COMMITTED	MVCC (snapshot)	Row-level locking (new tuple version per update, old tuple marked dead)
DynamoDB	—	Optimistic (conditional writes)	Optimistic (conditional writes)
CockroachDB	SERIALIZABLE	MVCC	Pessimistic + optimistic hybrid

Both MySQL and PostgreSQL use row-level locking for writes: if two transactions try to update the same row, the second one blocks until the first commits or rolls back. They share the same hybrid model — MVCC for reads, pessimistic locks for writes — but differ in storage mechanics, gap locking behavior, and conflict resolution.

One operational difference worth calling out: under REPEATABLE READ, when two transactions try to update the same row, both databases block the second writer until the first commits. But what happens after the commit differs. InnoDB lets the second writer proceed — it reads the latest committed version and applies its update on top. PostgreSQL takes the opposite approach: it aborts the second writer with ERROR: could not serialize access due to concurrent update, forcing the application to retry. This "first-updater-wins" policy means PostgreSQL detects write-write conflicts and forces explicit retry logic, while InnoDB silently applies the update to the newest committed version. Neither has a classic "lost update" — but InnoDB's behavior means the second transaction's update is based on a version it never read in its snapshot, which can lead to subtle anomalies if the application logic depends on snapshot consistency across reads and writes.

InnoDB's locking is more aggressive than PostgreSQL's in one critical way: gap locking. Under REPEATABLE READ, InnoDB locks not just the rows that match your query, but the gaps between those rows to prevent phantom reads. PostgreSQL also prevents phantoms at REPEATABLE READ, but through a completely different mechanism — pure snapshot isolation. Since each transaction reads from a fixed snapshot, newly inserted rows by other transactions are simply invisible, so phantoms can't occur without any locking. The key difference isn't whether phantoms are prevented (both do), but how: InnoDB prevents them pessimistically by locking gaps in the index, while PostgreSQL prevents them passively through snapshot visibility rules. InnoDB's approach means write transactions can block each other even when they're operating on non-overlapping rows, simply because their index ranges overlap. PostgreSQL avoids this contention entirely at REPEATABLE READ, only adding conflict detection at SERIALIZABLE via SSI with non-blocking predicate locks.

InnoDB's Lock Types: The Full Taxonomy

InnoDB has more lock types than most engineers realize. Understanding the hierarchy matters because lock conflicts aren't always row-vs-row — they can be gap-vs-insert, metadata-vs-DML, or intention-vs-table. Here's the complete picture.

Row-Level Locks (Record Locks)

The most intuitive lock type. A record lock locks a single index record. When you execute:

UPDATE orders SET status = 'shipped' WHERE id = 42;

InnoDB acquires an exclusive (X) lock on the index record for id = 42. Any other transaction attempting to UPDATE or DELETE that same row will block until the first transaction commits or rolls back.

Two flavors:

Shared (S) lock: Acquired by SELECT ... FOR SHARE (or LOCK IN SHARE MODE). Multiple transactions can hold S locks on the same row simultaneously. Blocks X locks.
Exclusive (X) lock: Acquired by UPDATE, DELETE, SELECT ... FOR UPDATE. Only one transaction can hold an X lock. Blocks both S and X locks.

Record locks always operate on index records, not table rows directly. The MySQL docs state: "Record locks always lock index records, even if a table is defined with no indexes. For such cases, InnoDB creates a hidden clustered index and uses this index for record locking." (InnoDB Locking). This is why index design directly affects lock contention — a full table scan under an UPDATE means locking every index record scanned, not just the rows that match the WHERE clause: "If you have no indexes suitable for your statement and MySQL must scan the entire table to process the statement, every row of the table becomes locked" (Locks Set by Different SQL Statements).

Gap Locks

This is where InnoDB gets interesting — and where most locking surprises come from.

A gap lock locks the gap between index records, preventing other transactions from inserting into that gap. It doesn't lock the records themselves.

Consider a table with an indexed column age containing values [10, 20, 30]. The gaps are:

(-∞, 10)  (10, 20)  (20, 30)  (30, +∞)

If transaction A runs:

SELECT * FROM users WHERE age BETWEEN 15 AND 25 FOR UPDATE;

InnoDB locks the gap (10, 20) and the gap (20, 30), plus the record at 20. This prevents any other transaction from inserting age = 12, age = 17, age = 22, or age = 28 — even though none of those rows exist yet.

Why gap locks exist: To prevent phantom reads under REPEATABLE READ. Without gap locks, transaction A could run the same SELECT twice and get different results because transaction B inserted a new row in the range between the two reads. Gap locks guarantee that if you read a range, no one can insert into that range until you commit.

The catch: Gap locks are purely inhibitive. The MySQL docs are explicit: "Gap locks in InnoDB are 'purely inhibitive', which means that their only purpose is to prevent other transactions from inserting to the gap. Gap locks can co-exist." (InnoDB Locking). Two transactions can hold gap locks on the same gap simultaneously. This seems harmless until you realize that each is blocking the other's INSERT, creating a deadlock.

The key is that this happens when no rows match the query — only gaps are locked, with no record locks to cause the second SELECT to block. Consider a table where age contains only [10, 30] (no row with age = 20):

Tx A: SELECT ... WHERE age BETWEEN 15 AND 25 FOR UPDATE;  -- no matching rows; gap lock on (10, 30)
Tx B: SELECT ... WHERE age BETWEEN 15 AND 25 FOR UPDATE;  -- also gap lock on (10, 30) — no conflict!
Tx A: INSERT INTO users (age) VALUES (18);                 -- blocked by Tx B's gap lock
Tx B: INSERT INTO users (age) VALUES (22);                 -- blocked by Tx A's gap lock → DEADLOCK

If rows did exist in the range, the SELECT ... FOR UPDATE would acquire next-key locks (record + gap), and the X record locks would cause Tx B to block on the SELECT itself — no deadlock, just contention. It's the empty-range case that's dangerous: both transactions sail through the SELECT, acquire only gap locks, and then deadlock on the INSERT.

We've hit this pattern in production. More on that later.

Next-Key Locks

A next-key lock is a combination of a record lock and a gap lock on the gap before that record. It's InnoDB's default locking strategy for index scans under REPEATABLE READ.

For our age index [10, 20, 30], the next-key locks are:

(-∞, 10]  (10, 20]  (20, 30]  (30, +∞)

Notice the notation: the record itself is included (closed bracket on the right). A next-key lock on (10, 20] locks the gap (10, 20) AND the record 20.

When you run a range scan like WHERE age > 15 AND age < 25, InnoDB places next-key locks on:

(10, 20] — gap before 20, plus record 20
(20, 30] — gap before 30, plus record 30

This is more than the rows that match your predicate. The extra locking is the cost of preventing phantoms via pessimistic locking rather than snapshot visibility.

Intention Locks

Intention locks are table-level locks that signal what kind of row-level locks a transaction intends to acquire. They exist purely for efficiency.

Intention Shared (IS): "I'm going to acquire shared row locks in this table."
Intention Exclusive (IX): "I'm going to acquire exclusive row locks in this table."

When you run UPDATE orders SET status = 'shipped' WHERE id = 42, InnoDB first acquires an IX lock on the orders table, then an X lock on the row. The IX lock is cheap (no contention between IX and IX) and serves one purpose: if another transaction wants a full table lock (LOCK TABLES orders WRITE), it can check for intention locks instead of scanning every row's lock state.

Intention locks never block each other. IX + IX is fine. IS + IX is fine. They only conflict with full table locks:

	IS	IX	S	X
IS	✓	✓	✓	✗
IX	✓	✓	✗	✗
S	✓	✗	✓	✗
X	✗	✗	✗	✗

In practice, you almost never think about intention locks unless you're debugging a case where LOCK TABLES or DDL is blocking behind thousands of row-level transactions.

Insert Intention Locks

A special type of gap lock acquired before an INSERT. It signals: "I intend to insert into this gap, but I don't need to lock the entire gap — just my specific insertion point."

Two transactions inserting into the same gap at different positions won't block each other:

-- Gap (10, 20) exists
Tx A: INSERT INTO users (age) VALUES (12);  -- insert intention lock at 12
Tx B: INSERT INTO users (age) VALUES (17);  -- insert intention lock at 17 — no conflict!

But an insert intention lock DOES conflict with a gap lock on the same gap. This is the mechanism that makes gap locks effective — they block insert intentions.

Auto-Increment Locks

When a table has an AUTO_INCREMENT column, InnoDB needs to serialize the generation of new values. Historically, this was a table-level lock held for the duration of an INSERT statement (the "traditional" mode). Modern InnoDB has three modes controlled by innodb_autoinc_lock_mode:

Mode	Value	Behavior	Use Case
Traditional	0	Table-level AUTO-INC lock held for entire statement	Legacy, maximally safe
Consecutive	1	Lightweight mutex for simple inserts, table lock for bulk inserts	Default in MySQL 5.7
Interleaved	2	Lightweight mutex for all inserts, values may have gaps	Default in MySQL 8.0, safe only with row-based replication

In mode 2 (the MySQL 8.0 default), simple INSERT statements use a lightweight mutex that's released as soon as the value is generated — not when the statement completes. This means auto-increment is rarely a bottleneck. But bulk operations like INSERT INTO ... SELECT or LOAD DATA can still cause contention in mode 1 because they hold the table-level lock for the entire statement duration.

Metadata Locks (MDL)

This is the lock type that catches most people off guard because it's not an InnoDB lock — it's a MySQL server-level lock that sits above the storage engine.

Metadata locks protect the schema definition of a table. Any DML statement (SELECT, INSERT, UPDATE, DELETE) acquires a shared MDL on the table. DDL statements (ALTER TABLE, DROP TABLE, DROP PARTITION) require an exclusive MDL. Critically, once an exclusive MDL request is pending, all new shared MDL requests queue behind it — a waiting DDL blocks every subsequent DML from acquiring the lock (Metadata Locking).

The rule is simple: DDL waits for all active DML to finish, and blocks all new DML while waiting.

Timeline:
  T0: Tx A starts SELECT on orders (acquires shared MDL)
  T1: DBA runs ALTER TABLE orders ... (needs exclusive MDL → blocked by Tx A)
  T2: Tx B starts INSERT INTO orders (needs shared MDL → blocked by ALTER)
  T3: Tx C starts SELECT on orders (needs shared MDL → blocked by ALTER)
  ...
  T?: Tx A finishes → ALTER acquires exclusive MDL → all DML queued behind it

This is the MDL queue stacking problem. A single ALTER TABLE doesn't just wait for the active transaction — it blocks every subsequent transaction too. On a table sustaining 500 writes/second, a 90-second stall creates a pileup of thousands of queued connections, exhausting your connection pool entirely. The ALTER itself might be sub-second, but the queue cascades into connection pool exhaustion, health check failures, and service-wide outages that have nothing to do with the table being altered.

Production Incident #1: Metadata Locks from Partition Drops

We run time-partitioned tables for high-volume event data. Each partition holds one day's worth of data, and a cron job drops partitions older than 90 days:

ALTER TABLE events DROP PARTITION p20250101;

This is an ALTER TABLE. It needs an exclusive metadata lock.

What happened

The partition drop job was already scheduled during a low-traffic window. There was no single long-running query blocking it. The problem was simpler and more insidious: the DROP PARTITION itself took time to execute — and for the entire duration it held an exclusive metadata lock, every new DML statement on the table queued behind it.

ALTER TABLE ... DROP PARTITION isn't always instant. On a large table with many partitions, MySQL needs to update the table's partition metadata, and if the partition being dropped has significant data, the storage engine needs to reclaim the tablespace. During this window — which stretched to 30+ seconds in our case — the exclusive MDL blocked all new INSERT, UPDATE, DELETE, and even SELECT statements on the table. The Waiting for table metadata lock state spread across the processlist like a wave.

What we saw in monitoring

The global MDL wait metric spiked sharply. We queried the process list:

SELECT id, user, host, db, command, time, state, info
FROM information_schema.processlist
WHERE state = 'Waiting for table metadata lock'
ORDER BY time DESC;

The output showed hundreds of DML statements queued behind the ALTER TABLE:

+-------+------+-----------+--------+---------+------+----------------------------------+-----------------------------+
| id    | user | host      | db     | command | time | state                            | info                        |
+-------+------+-----------+--------+---------+------+----------------------------------+-----------------------------+
| 15901 | admin| 10.0.2.x  | events | Query   |  34  | altering table                   | ALTER TABLE events DROP ... |
| 15902 | app  | 10.0.3.x  | events | Query   |  32  | Waiting for table metadata lock  | INSERT INTO events ...      |
| 15903 | app  | 10.0.3.x  | events | Query   |  31  | Waiting for table metadata lock  | INSERT INTO events ...      |
| 15904 | app  | 10.0.3.x  | events | Query   |  30  | Waiting for table metadata lock  | INSERT INTO events ...      |
| ...   | ...  | ...       | ...    | ...     | ...  | ...                              | ...                         |
+-------+------+-----------+--------+---------+------+----------------------------------+-----------------------------+

The ALTER TABLE was actively running (state: altering table), not waiting for anything — it was the one holding the exclusive MDL while it completed. Every DML that arrived during those 30+ seconds piled up.

The query to find MDL holders

MySQL 8.0 exposes metadata locks through performance_schema:

SELECT
    mdl.OBJECT_SCHEMA,
    mdl.OBJECT_NAME,
    mdl.LOCK_TYPE,
    mdl.LOCK_DURATION,
    mdl.LOCK_STATUS,
    mdl.OWNER_THREAD_ID,
    t.PROCESSLIST_ID,
    t.PROCESSLIST_USER,
    t.PROCESSLIST_HOST,
    t.PROCESSLIST_TIME,
    t.PROCESSLIST_STATE,
    t.PROCESSLIST_INFO
FROM performance_schema.metadata_locks mdl
JOIN performance_schema.threads t
    ON mdl.OWNER_THREAD_ID = t.THREAD_ID
WHERE mdl.OBJECT_SCHEMA = 'your_database'
  AND mdl.OBJECT_NAME = 'events'
ORDER BY t.PROCESSLIST_TIME DESC;

This tells you exactly who holds the lock and who's waiting. The output distinguishes GRANTED (holding the lock) from PENDING (waiting for it).

The fix

Switched from DROP PARTITION to EXCHANGE PARTITION followed by dropping the exchanged table — EXCHANGE PARTITION is a metadata-only rename operation that completes near-instantly, so the exclusive MDL is held for milliseconds, not seconds
For cases where DROP PARTITION is unavoidable, we batch the drops and add lock_wait_timeout = 5 to the session — if it can't acquire the MDL within 5 seconds, it backs off and retries on the next cycle
Added an alert on the global Waiting for table metadata lock count exceeding a threshold sustained for 10 seconds

Production Incident #2: Row Locks from Incorrectly Terminated Application Instances

This one was subtle. Write latency on the orders table was slowly degrading over hours, not spiking suddenly. P99 crept from 15ms to 200ms to 2 seconds over the course of a day.

What happened

An application instance was terminated incorrectly during a deployment — the process was killed (SIGKILL / OOM) without gracefully closing its database connections. The connections had open transactions with row-level exclusive locks acquired via SELECT ... FOR UPDATE.

Here's the thing MySQL doesn't advertise: when a client process dies abruptly, no COM_QUIT is sent, no TCP FIN packet is transmitted. The TCP connection is now "half-open" — MySQL has no idea the client is gone. The server-side connection thread sits in a blocking read(), waiting for the next query that will never arrive. The transaction stays open. The locks stay held.

MySQL detects dead connections through two mechanisms, whichever fires first:

wait_timeout (application-layer): MySQL checks if the connection has been idle for longer than this value. Default: 28800 seconds (8 hours).
TCP keepalive (network-layer): The OS sends keepalive probes to detect the dead peer. With Linux defaults (tcp_keepalive_time=7200, tcp_keepalive_intvl=75, tcp_keepalive_probes=9), this takes ~7,875 seconds (~2.2 hours).

Until one of those fires, the dead connection's transaction remains open and all its locks are held. InnoDB only rolls back the transaction when MySQL actually closes the server-side connection.

In our case, wait_timeout was still at the default — 28800 seconds (8 hours). So the killed application instance left behind connections holding exclusive row locks for hours. As subsequent deployments killed more instances the same way, more orphaned locks accumulated. Eventually, write contention on orders was so severe that P99 latency hit the innodb_lock_wait_timeout ceiling.

What we saw

-- Find transactions that have been open for a long time but aren't actively running
SELECT
    trx.trx_id,
    trx.trx_state,
    trx.trx_started,
    TIMESTAMPDIFF(SECOND, trx.trx_started, NOW()) AS trx_age_seconds,
    trx.trx_rows_locked,
    trx.trx_rows_modified,
    trx.trx_mysql_thread_id,
    p.user,
    p.host,
    p.command,
    p.time AS command_time,
    p.state,
    p.info AS current_query
FROM information_schema.innodb_trx trx
JOIN information_schema.processlist p
    ON trx.trx_mysql_thread_id = p.id
WHERE p.command = 'Sleep'
  AND TIMESTAMPDIFF(SECOND, trx.trx_started, NOW()) > 60
ORDER BY trx.trx_started ASC;

The output showed connections from hosts belonging to application instances that had already been terminated — their IPs were no longer in service, but MySQL was still holding their connections open:

+--------+-----------+---------------------+-----------------+-----------------+-------------------+-------+------+---------+------+-------+---------------+
| trx_id | trx_state | trx_started         | trx_age_seconds | trx_rows_locked | trx_rows_modified | ...   | user | command | time | state | current_query |
+--------+-----------+---------------------+-----------------+-----------------+-------------------+-------+------+---------+------+-------+---------------+
| 48291  | RUNNING   | 2025-03-15 09:14:22 |           14338 |              47 |                 0 | ...   | app  | Sleep   | 9841 |       | NULL          |
| 48305  | RUNNING   | 2025-03-15 09:21:07 |           13933 |              23 |                 0 | ...   | app  | Sleep   | 8122 |       | NULL          |
| 48412  | RUNNING   | 2025-03-15 10:02:44 |           11436 |              12 |                 0 | ...   | app  | Sleep   | 7203 |       | NULL          |
+--------+-----------+---------------------+-----------------+-----------------+-------------------+-------+------+---------+------+-------+---------------+

A few things jump out. trx_state is RUNNING but command is Sleep — the transaction is open but the connection is idle. trx_rows_locked is non-zero but trx_rows_modified is zero — these transactions acquired locks via FOR UPDATE but never wrote anything. The time column shows idle times of 7,000–10,000 seconds — the 8-hour wait_timeout default hadn't even kicked in yet, and these connections had already been abandoned for hours with no end in sight.

Finding exactly which rows are locked

SELECT
    l.ENGINE_LOCK_ID,
    l.ENGINE_TRANSACTION_ID,
    l.LOCK_TYPE,
    l.LOCK_MODE,
    l.LOCK_STATUS,
    l.LOCK_DATA,
    l.OBJECT_SCHEMA,
    l.OBJECT_NAME,
    l.INDEX_NAME
FROM performance_schema.data_locks l
WHERE l.ENGINE_TRANSACTION_ID IN (
    SELECT trx.trx_id
    FROM information_schema.innodb_trx trx
    JOIN information_schema.processlist p
        ON trx.trx_mysql_thread_id = p.id
    WHERE p.command = 'Sleep'
      AND TIMESTAMPDIFF(SECOND, trx.trx_started, NOW()) > 300
);

This shows every lock held by sleeping transactions older than 5 minutes. The LOCK_DATA column gives you the primary key values of the locked rows, and LOCK_MODE tells you whether it's shared or exclusive.

Finding who's waiting on whom

SELECT
    w.REQUESTING_ENGINE_TRANSACTION_ID AS waiting_trx_id,
    w.BLOCKING_ENGINE_TRANSACTION_ID AS blocking_trx_id,
    rl.LOCK_MODE AS waiting_lock_mode,
    bl.LOCK_MODE AS blocking_lock_mode,
    rl.LOCK_DATA AS contested_row,
    r.trx_query AS waiting_query,
    b_proc.command AS blocking_command,
    b_proc.time AS blocking_idle_time,
    TIMESTAMPDIFF(SECOND, b_trx.trx_started, NOW()) AS blocking_trx_age
FROM performance_schema.data_lock_waits w
JOIN performance_schema.data_locks rl
    ON w.REQUESTING_ENGINE_LOCK_ID = rl.ENGINE_LOCK_ID
JOIN performance_schema.data_locks bl
    ON w.BLOCKING_ENGINE_LOCK_ID = bl.ENGINE_LOCK_ID
JOIN information_schema.innodb_trx r
    ON w.REQUESTING_ENGINE_TRANSACTION_ID = r.trx_id
JOIN information_schema.innodb_trx b_trx
    ON w.BLOCKING_ENGINE_TRANSACTION_ID = b_trx.trx_id
JOIN information_schema.processlist b_proc
    ON b_trx.trx_mysql_thread_id = b_proc.id;

The fix

Short-term: killed the orphaned connections manually (KILL <processlist_id>).

Long-term:

Fixed the deployment process to send SIGTERM with a grace period, allowing the application to close database connections before the container is killed. Added a pre-stop hook that explicitly closes the connection pool
Reduced wait_timeout to 300 (5 minutes). The MySQL default of 28800 (8 hours) means orphaned connections can hold locks for an entire workday before MySQL notices. Check yours with SHOW VARIABLES LIKE 'wait_timeout'
On RDS, you cannot tune server-side TCP keepalive — AWS manages the OS and those sysctl parameters aren't user-configurable. wait_timeout is your primary lever for dead client detection. On self-managed MySQL, you can additionally tune the server's TCP keepalive (tcp_keepalive_time=60, tcp_keepalive_intvl=10, tcp_keepalive_probes=6) so the server detects dead clients in ~120 seconds instead of ~2.2 hours
Added monitoring for information_schema.innodb_trx rows where trx_state = 'RUNNING' and the processlist shows Sleep for more than 5 minutes

Production Incident #3: Gap Locks from INSERT INTO ... SELECT

We have a nightly job that archives completed orders into a summary table:

INSERT INTO order_summaries (order_id, total, region, completed_at)
SELECT id, total_amount, region, completed_at
FROM orders
WHERE status = 'completed'
  AND completed_at >= DATE_SUB(CURDATE(), INTERVAL 1 DAY)
  AND completed_at < CURDATE();

This ran fine for months. Then the orders table grew, the nightly job started taking longer, and we started seeing deadlocks during peak write hours.

What happened

Under REPEATABLE READ, INSERT INTO ... SELECT places shared next-key locks on the rows read by the SELECT — the MySQL docs are explicit: "InnoDB sets shared next-key locks on rows from S" (InnoDB Locks Set by Different SQL Statements). This is required to guarantee a consistent read — InnoDB needs to prevent other transactions from modifying or inserting into the range being read while the bulk insert is in progress.

The problem: the SELECT scans a range of completed_at values. InnoDB places shared next-key locks on every index record in that range — and since each next-key lock includes the gap before the record, the entire scanned range is locked against inserts. Any INSERT into the orders table with a completed_at value that falls within or near the locked range will block.

Our application was simultaneously inserting new orders with completed_at values close to the current timestamp. Since the archival job was reading completed_at from the previous day, you might think there's no overlap. But gap locks extend to the next index record beyond the scanned range — and if the next record's completed_at is today, the gap lock extends into today's range.

Index records for completed_at:
  ... | 2025-03-14 23:58:12 | 2025-03-14 23:59:44 | 2025-03-15 00:01:23 | ...
                                                       ↑
                                          Next-key lock extends here
                                          because this is the next record
                                          after the scanned range

  New INSERT with completed_at = 2025-03-15 00:00:15 → blocked by gap lock

Finding active lock waits in real time

In MySQL 8.0, lock wait information lives in performance_schema.data_lock_waits (the old information_schema.innodb_lock_waits was removed in 8.0):

SELECT
    r.trx_id AS waiting_trx_id,
    r.trx_mysql_thread_id AS waiting_thread,
    r.trx_query AS waiting_query,
    TIMESTAMPDIFF(SECOND, r.trx_wait_started, NOW()) AS wait_seconds,
    b.trx_id AS blocking_trx_id,
    b.trx_mysql_thread_id AS blocking_thread,
    b.trx_query AS blocking_query,
    b.trx_rows_locked AS blocking_rows_locked
FROM performance_schema.data_lock_waits w
JOIN information_schema.innodb_trx b
    ON w.BLOCKING_ENGINE_TRANSACTION_ID = b.trx_id
JOIN information_schema.innodb_trx r
    ON w.REQUESTING_ENGINE_TRANSACTION_ID = r.trx_id
ORDER BY wait_seconds DESC;

The fix

Changed the archival job's isolation level to READ COMMITTED for that session:

   SET SESSION TRANSACTION ISOLATION LEVEL READ COMMITTED;

Under READ COMMITTED, InnoDB doesn't acquire gap locks at all (except for foreign key and duplicate-key checks) — the SELECT portion runs as a consistent read without locking the source rows. This eliminates the phantom prevention guarantee for that transaction, which is acceptable for archival.

Added batching — instead of one massive INSERT INTO ... SELECT, process 1,000 rows at a time with explicit LIMIT and OFFSET:

   INSERT INTO order_summaries (order_id, total, region, completed_at)
   SELECT id, total_amount, region, completed_at
   FROM orders
   WHERE status = 'completed'
     AND completed_at >= DATE_SUB(CURDATE(), INTERVAL 1 DAY)
     AND completed_at < CURDATE()
   ORDER BY id
   LIMIT 1000;

Each batch holds locks for a shorter duration, reducing the window for contention.

Moved the archival job to run against a read replica, then applied the summaries to the primary via smaller transactional batches.

Monitoring InnoDB Locks: The Essential Queries

Here's the full set of diagnostic queries we keep in our runbook.

1. Current InnoDB transaction status

SELECT
    trx_id,
    trx_state,
    trx_started,
    TIMESTAMPDIFF(SECOND, trx_started, NOW()) AS age_seconds,
    trx_rows_locked,
    trx_rows_modified,
    trx_lock_memory_bytes,
    trx_mysql_thread_id,
    trx_query,
    trx_operation_state
FROM information_schema.innodb_trx
ORDER BY trx_started ASC;

2. All currently held locks (MySQL 8.0+)

SELECT
    ENGINE_TRANSACTION_ID,
    OBJECT_SCHEMA,
    OBJECT_NAME,
    INDEX_NAME,
    LOCK_TYPE,
    LOCK_MODE,
    LOCK_STATUS,
    LOCK_DATA
FROM performance_schema.data_locks
ORDER BY ENGINE_TRANSACTION_ID, OBJECT_NAME;

LOCK_TYPE will be RECORD (row/gap/next-key) or TABLE (intention locks). LOCK_MODE tells you the specifics: X (exclusive), S (shared), X,GAP (exclusive gap lock), X,REC_NOT_GAP (record-only, no gap), S,GAP, etc.

3. Lock wait chains — who blocks whom

SELECT
    w.REQUESTING_ENGINE_TRANSACTION_ID AS waiting_trx,
    w.BLOCKING_ENGINE_TRANSACTION_ID AS blocking_trx,
    rl.OBJECT_NAME AS table_name,
    rl.INDEX_NAME,
    rl.LOCK_MODE AS waiting_lock_mode,
    bl.LOCK_MODE AS blocking_lock_mode,
    rl.LOCK_DATA AS contested_row,
    r.trx_query AS waiting_query,
    TIMESTAMPDIFF(SECOND, r.trx_wait_started, NOW()) AS wait_seconds
FROM performance_schema.data_lock_waits w
JOIN performance_schema.data_locks rl
    ON w.REQUESTING_ENGINE_LOCK_ID = rl.ENGINE_LOCK_ID
JOIN performance_schema.data_locks bl
    ON w.BLOCKING_ENGINE_LOCK_ID = bl.ENGINE_LOCK_ID
JOIN information_schema.innodb_trx r
    ON w.REQUESTING_ENGINE_TRANSACTION_ID = r.trx_id
ORDER BY wait_seconds DESC;

4. Deadlock history

SHOW ENGINE INNODB STATUS\G

The LATEST DETECTED DEADLOCK section shows the last deadlock with full details: both transactions, the locks they held, the locks they were waiting for, and which transaction InnoDB chose as the victim. Parse the output — it's verbose but complete.

For ongoing monitoring, enable the deadlock log:

SET GLOBAL innodb_print_all_deadlocks = ON;

This writes every deadlock to the MySQL error log, not just the latest one.

5. Long-running transactions with sleeping connections

SELECT
    trx.trx_id,
    trx.trx_state,
    TIMESTAMPDIFF(SECOND, trx.trx_started, NOW()) AS trx_age_seconds,
    trx.trx_rows_locked,
    trx.trx_rows_modified,
    p.id AS processlist_id,
    p.user,
    p.host,
    p.command,
    p.time AS idle_seconds
FROM information_schema.innodb_trx trx
JOIN information_schema.processlist p
    ON trx.trx_mysql_thread_id = p.id
WHERE p.command = 'Sleep'
ORDER BY trx.trx_started ASC;

This is the "zombie transaction" detector. Any row here is a connection with an open transaction that isn't executing a query. These are the ones that silently hold locks while doing nothing.

6. InnoDB lock metrics summary

SELECT
    (SELECT COUNT(*) FROM information_schema.innodb_trx) AS active_transactions,
    (SELECT COUNT(*) FROM performance_schema.data_lock_waits) AS lock_waits,
    (SELECT COUNT(*) FROM performance_schema.data_locks WHERE LOCK_STATUS = 'GRANTED') AS locks_held,
    (SELECT COUNT(*) FROM performance_schema.data_locks WHERE LOCK_STATUS = 'WAITING') AS locks_waiting,
    (SELECT VARIABLE_VALUE FROM performance_schema.global_status WHERE VARIABLE_NAME = 'Innodb_row_lock_waits') AS total_row_lock_waits,
    (SELECT VARIABLE_VALUE FROM performance_schema.global_status WHERE VARIABLE_NAME = 'Innodb_row_lock_time_avg') AS avg_row_lock_wait_ms,
    (SELECT VARIABLE_VALUE FROM performance_schema.global_status WHERE VARIABLE_NAME = 'Innodb_deadlocks') AS total_deadlocks;

MySQL Parameters That Control Lock Behavior

Most lock incidents we've hit were made worse — or outright caused — by MySQL parameters left at their defaults. Here are the ones that matter, what they do, and what we set them to.

-- Check all lock-related parameters on your instance
SHOW VARIABLES WHERE Variable_name IN (
    'innodb_lock_wait_timeout',
    'lock_wait_timeout',
    'innodb_deadlock_detect',
    'innodb_print_all_deadlocks',
    'innodb_status_output_locks',
    'wait_timeout',
    'interactive_timeout',
    'innodb_autoinc_lock_mode'
);

Parameter	MySQL Default	Recommended	Why
`innodb_lock_wait_timeout`	50	10–30	How long a transaction waits for a row lock before erroring out. 50 seconds is too generous — if you're waiting 30 seconds for a row lock, the transaction should fail and retry, not queue behind a zombie
`lock_wait_timeout`	31536000 (1 year)	300	How long a statement waits for a metadata lock. The default is effectively "wait forever." A `DROP PARTITION` waiting a year for a shared MDL to release will stack every connection in your pool long before then
`wait_timeout`	28800 (8 hours)	300	How long MySQL keeps an idle connection alive. An abandoned connection holding row locks can sit there for 8 hours at the default. Drop this to 5 minutes
`interactive_timeout`	28800 (8 hours)	300	Same as `wait_timeout`, for connections flagged as interactive (mysql CLI sessions)
`innodb_deadlock_detect`	ON	ON	Real-time deadlock detection. Turning this off (sometimes done for "performance") means deadlocks are only resolved by `innodb_lock_wait_timeout` expiring — much slower
`innodb_print_all_deadlocks`	OFF	ON	Log every deadlock to the error log. Without this, only the most recent deadlock is visible via `SHOW ENGINE INNODB STATUS`
`innodb_status_output_locks`	OFF	ON (during incidents)	Include lock details in `SHOW ENGINE INNODB STATUS` output. Verbose, but invaluable during active lock investigations
`innodb_autoinc_lock_mode`	2 (MySQL 8.0)	2	Interleaved mode — least contention for auto-increment. Only change if you need consecutive values across bulk inserts (you probably don't)

The most dangerous defaults are lock_wait_timeout (1 year) and wait_timeout (8 hours). If you take nothing else from this post, check those two on your production instances right now.

A Quick Guide to Isolation Levels and Their Lock Behavior

The locks InnoDB takes are directly tied to the isolation level. Understanding this mapping is essential for diagnosing unexpected contention.

READ UNCOMMITTED

No locks on consistent reads
Writes take record locks (no gap locks)
You almost never want this — it allows dirty reads

READ COMMITTED

Consistent reads use a fresh snapshot per statement — each statement sees the latest committed data as of its start time, not the transaction start time
No gap locks — this is the key difference from REPEATABLE READ. The docs state: "Gap locking is only used for foreign-key constraint checking and duplicate-key checking" at this level (Transaction Isolation Levels)
Record locks on matched rows only, released for non-matching rows after evaluation
Phantom reads are possible but gap lock deadlocks are eliminated
This is the level we switch to for bulk operations

REPEATABLE READ (InnoDB Default)

Consistent reads see a snapshot from the first read in the transaction
Next-key locks on index scans — record lock + gap lock
Gap locks prevent phantom inserts in scanned ranges
This is where most unexpected locking contention occurs

SERIALIZABLE

All consistent reads are implicitly converted to SELECT ... FOR SHARE
Every read takes shared next-key locks
Maximum correctness, maximum contention
We've never used this in production

Side-by-side lock behavior

Operation	READ COMMITTED	REPEATABLE READ
`SELECT` (plain)	No locks (MVCC)	No locks (MVCC)
`SELECT ... FOR UPDATE`	Record locks only	Next-key locks (record + gap)
`UPDATE WHERE unique_col = ?`	Record lock on matching row	Record lock only (unique index optimization — no gap lock)
`UPDATE WHERE non_unique_col = ?`	Record lock on matching row	Next-key locks on matching records + next record to seal the gap
`UPDATE WHERE range`	Record locks on matching rows	Next-key locks on range
`INSERT`	Insert intention lock	Insert intention lock
`INSERT INTO ... SELECT`	No locks on source (consistent read)	Shared next-key locks on source rows
`DELETE WHERE range`	Record locks on matching rows	Next-key locks on range

The INSERT INTO ... SELECT row is the one that bit us. Under REPEATABLE READ, the SELECT side takes shared next-key locks on every row it reads. Under READ COMMITTED, it doesn't. Switching isolation level for that one session was a one-line fix that eliminated the gap lock deadlocks entirely.

Preventing Lock Contention: What We Do Now

After three incidents with three different lock types, we've adopted a set of practices that have kept us out of trouble.

Keep transactions short. The longer a transaction is open, the longer it holds locks. Every lock held is a potential blocker. We set innodb_lock_wait_timeout to 30 (the MySQL default is 50). If your transaction is waiting 30 seconds for a row lock, something is structurally wrong — fail fast and let the application retry.

Use READ COMMITTED for bulk operations. If you're running INSERT INTO ... SELECT, CREATE TABLE ... AS SELECT, or any bulk read-then-write pattern, switch the session to READ COMMITTED. You don't need phantom protection for a batch job.

Use pt-online-schema-change or gh-ost for large ALTERs. These tools perform schema changes by creating a shadow table, copying data in small batches, and swapping at the end — avoiding long-held metadata locks entirely. For partition operations, consider using EXCHANGE PARTITION which is metadata-only and near-instant.

Ensure graceful shutdown of application instances. When an application process is killed without closing its database connections (SIGKILL, OOM, container eviction), MySQL doesn't detect the dead client immediately. Locks are held until wait_timeout expires or TCP keepalive detects the dead peer — whichever comes first. Use SIGTERM with a grace period, and add pre-stop hooks that drain connections before the process exits.

Tune dead connection detection. Drop wait_timeout from the default 28800 (8 hours) to 300 — on RDS, this is your primary lever since server-side TCP keepalive isn't user-configurable. On self-managed MySQL, also tune the server's TCP keepalive (tcp_keepalive_time=60, tcp_keepalive_intvl=10, tcp_keepalive_probes=6) so the server detects dead clients in ~120 seconds. If you use a connection pool (HikariCP, etc.), configure its own idle connection eviction to detect and close stale connections faster than MySQL does.

Monitor zombie transactions. Alert on any transaction that has been open for more than N minutes while the connection is idle. This is the single most impactful alert we've added:

-- Alert query: transactions open > 5 min with sleeping connections
SELECT COUNT(*) AS zombie_transactions
FROM information_schema.innodb_trx trx
JOIN information_schema.processlist p
    ON trx.trx_mysql_thread_id = p.id
WHERE p.command = 'Sleep'
  AND TIMESTAMPDIFF(SECOND, trx.trx_started, NOW()) > 300;

Index your write predicates. If your UPDATE or DELETE hits a full table scan, InnoDB locks every row it examines — not just the rows that match the WHERE clause. Under REPEATABLE READ, it also places gap locks across the entire index. A missing index on a write path doesn't just make the query slow — it makes every other concurrent write slow too.

Wrapping Up

InnoDB uses a hybrid of MVCC and pessimistic locking. Reads don't lock (consistent snapshots). Writes lock. The specific lock type depends on the isolation level and the index structure.
Gap locks are the most surprising lock type. They lock ranges between index records, not the records themselves. They exist to prevent phantom reads under REPEATABLE READ, and they're the root cause of most unexpected deadlocks.
Metadata locks are MySQL-level, not InnoDB-level. Any DDL queues behind active DML and blocks all subsequent DML. On high-QPS tables, even fast DDL can cascade into connection pool exhaustion.
Sleeping connections with open transactions are silent killers. A connection in Sleep state with an uncommitted transaction still holds all its row locks. Monitor information_schema.innodb_trx crossed with the processlist.
INSERT INTO ... SELECT takes shared next-key locks on the source table under REPEATABLE READ. Switch to READ COMMITTED for bulk operations to eliminate gap locking on the read side.
Monitor lock contention proactively. performance_schema.data_locks, data_lock_waits, and innodb_trx are your friends. On RDS, Performance Insights broken down by db.wait_event is the fastest way to see what your connections are actually waiting on.

Every lock incident we've had traced back to one of three things: a DDL operation that held an exclusive metadata lock longer than expected, application instances that died without releasing their connections, or an isolation level that was too strict for the workload. The queries in this post are the ones we reach for first. performance_schema.data_locks and data_lock_waits tell you exactly which rows are locked and who's waiting. information_schema.innodb_trx crossed with the processlist catches zombie transactions from dead application instances. And performance_schema.metadata_locks tells you who's holding the MDL that's stacking your connection pool. The data is always there — you just have to know where to look.

Debugging PostgreSQL Query Plan Instability in Production

Sriram Rajendran — Tue, 31 Mar 2026 14:23:00 +0000

The query plan is only as good as the statistics behind it. When those statistics are wrong, the planner makes confident decisions based on a false reality.

We run a field technician dispatch system on PostgreSQL 14. The core query — find technicians matching specific dispatch criteria near a job site — ran in 3ms at 10am and 42ms at 2pm the same day. Same query text. Same bind parameters. The only thing that changed was the number of technicians available — as the fleet clocked in and dispatched through the day, the underlying data distribution shifted just enough to flip the query plan.

This post is the story of how we traced a 14x latency regression to a fundamental assumption baked into every cost-based query optimizer, and why our existing composite index didn't help. Along the way, we'll dig into pg_statistic, selectivity estimation, BitmapAnd mechanics, and the specific ways correlated boolean columns break the planner's world model.

But first — if you've never looked at how PostgreSQL decides whether to use your index, let's build up from first principles.

How PostgreSQL Chooses a Query Plan

PostgreSQL doesn't just "use the index." It evaluates multiple execution strategies and picks the one with the lowest estimated cost. This is the cost-based optimizer.

The optimizer doesn't see your data. It sees a statistical summary stored in pg_statistic (exposed via the pg_stats view). When ANALYZE runs — manually or via autovacuum — PostgreSQL samples rows and builds per-column statistics: most common values (MCVs), histograms, distinct counts, and null fractions.

For a single predicate like WHERE is_available = true, the planner looks up the MCV frequency for is_available. If true appears at 0.50, the estimated rows on a 15,000-row table is 7,500.

For multiple predicates, it applies the independence assumption:

P(A AND B AND C) = P(A) × P(B) × P(C)

This is correct if and only if the columns are statistically independent. The PostgreSQL docs are explicit:

"The planner normally assumes that multiple conditions are independent of each other, an assumption that does not hold when column values are correlated."
— Chapter 14.2: Statistics Used by the Planner

When the assumption holds, estimates are accurate. When it doesn't, they can be off by orders of magnitude. Our query hit the second case.

The Setup: Our Table and Query

How we ended up with six booleans

The technicians table tracks every field technician in the system — roughly 15,000 rows. It didn't start with six boolean columns. It started with two: is_active and is_available. Then we needed to track whether a technician was currently on a job, so is_dispatched arrived. A compliance incident led to is_blocked. Express delivery became a feature, so is_express_enabled. An ops request for soft-disabling technicians without removing them added is_suspended. Each boolean made sense in isolation, each was a small migration, and each shipped independently over about eighteen months.

CREATE TABLE technicians (
    id              BIGSERIAL PRIMARY KEY,
    zone_id         INTEGER NOT NULL,
    is_active       BOOLEAN NOT NULL DEFAULT false,
    is_available    BOOLEAN NOT NULL DEFAULT false,
    is_dispatched   BOOLEAN NOT NULL DEFAULT false,
    is_blocked      BOOLEAN NOT NULL DEFAULT false,
    is_express_enabled BOOLEAN NOT NULL DEFAULT false,
    is_suspended    BOOLEAN NOT NULL DEFAULT false,
    last_job_location GEOMETRY(Point, 4326),
    -- ... other columns
);

The technician_locations table stores real-time GPS positions (one row per technician, updated every few seconds).

Key indexes

-- Composite boolean index WITH zone
CREATE INDEX idx_technician_selection_v2 ON technicians
    (zone_id, is_active, is_available, is_dispatched, is_blocked, is_express_enabled, is_suspended);

-- Composite boolean index WITHOUT zone
CREATE INDEX idx_technician_selection ON technicians
    (is_active, is_available, is_dispatched, is_blocked, is_express_enabled, is_suspended);

-- Spatial index on last job location
CREATE INDEX idx_technician_last_job_geom ON technicians USING gist (last_job_location);

-- Spatial index on live GPS location
CREATE INDEX idx_technician_location_geom ON technician_locations USING gist (current_location);

The allocation query

The query finds technicians matching specific dispatch criteria within a geographic radius:

SELECT t.*, tl.*, ts.*
FROM technicians t
    INNER JOIN technician_locations tl ON t.id = tl.technician_id
    LEFT JOIN technician_stats ts ON t.id = ts.technician_id
    LEFT JOIN daily_utilization du ON t.id = du.technician_id
    -- ... additional left joins for tagging, blocklists, queue state
WHERE t.is_active = true
  AND t.is_available = true
  AND t.is_dispatched = true
  AND t.is_blocked = false
  AND t.is_express_enabled = true
  AND t.is_suspended = false
  AND ST_DWithin(tl.current_location, ST_SetSRID(ST_MakePoint(?, ?), 4326)::geography, 4000)
  AND tl.reported_at > now() - interval '20 minutes';

Six boolean conditions on technicians, plus a spatial predicate on technician_locations. Looks straightforward. The problem is entirely in how the planner estimates the boolean combination.

The Symptom: Same Query, Two Plans

We noticed the issue in our Grafana dashboards. The allocation query's P50 latency was steady around 4ms, but P99 kept spiking to 40–45ms within the same day. The pattern correlated with fleet activity — as more technicians clocked in and changed dispatch states through the day, the underlying data distribution shifted. The planner's statistics drifted with it, and the plan flipped.

When we pulled EXPLAIN ANALYZE hours apart on the same day, we saw two completely different plans.

The fast plan (~3ms): BitmapAnd

BitmapAnd  (cost=146.22..146.22 rows=4 width=0) (actual time=1.83..1.83 rows=0 loops=1)
  -> Bitmap Index Scan on idx_technician_selection
       (cost=0.00..5.22 rows=38 width=0) (actual time=0.45..0.45 rows=8720 loops=1)
       Index Cond: (is_active = true) AND (is_available = true) AND (is_dispatched = true)
                   AND (is_blocked = false) AND (is_express_enabled = true) AND (is_suspended = false)
  -> Bitmap Index Scan on idx_technician_last_job_geom
       (cost=0.00..140.75 rows=1628 width=0) (actual time=1.20..1.20 rows=8419 loops=1)
       Index Cond: (last_job_location && <bounding box>)

Bitmap Heap Scan on technicians t
    Heap Blocks: exact=191
    -> BitmapAnd (above)
    Filter: ST_DWithin(last_job_location, ..., 4000)
    Rows Removed by Filter: 92
    -> actual rows=1

Execution Time: 3.155 ms

The planner used BitmapAnd — it scanned both the boolean index and the spatial index on technicians, intersected the two bitmaps in memory, and only fetched 191 heap pages. One row survived all filters. Fast.

The slow plan (~42ms): Boolean index only

Bitmap Heap Scan on technicians t
    (cost=5.47..1205.33 rows=38 width=824) (actual time=0.62..8.14 rows=4626 loops=1)
  -> Bitmap Index Scan on idx_technician_selection
       (cost=0.00..5.22 rows=38 width=0) (actual time=0.48..0.48 rows=4626 loops=1)
       Index Cond: (is_active = true) AND (is_available = true) AND (is_dispatched = true)
                   AND (is_blocked = false) AND (is_express_enabled = true) AND (is_suspended = false)

  Nested Loop (actual loops=4558)
    -> Index Scan on technician_locations tl
         Index Cond: (technician_id = t.id)
         Filter: ST_DWithin(current_location, ..., 4000)
         loops=4558

Execution Time: 42.008 ms

The planner used only the boolean index. No spatial index. It fetched 4,626 rows from technicians, then nested-loop joined into technician_locations 4,558 times, applying ST_DWithin as a CPU filter on each loop. 14x slower.

Side by side

Metric	Fast Plan (BitmapAnd)	Slow Plan (Boolean only)
Strategy	BitmapAnd (boolean + spatial)	Bitmap Scan (boolean only)
Estimated rows from boolean index	38	38
Actual rows from boolean index	8,720	4,626
Heap pages fetched	191	4,626 nested loops
Spatial index used?	Yes	No
Execution time	3.1ms	42ms

A few things jump out. The estimated row count is 38 in both plans. That's the planner's selectivity estimate for the boolean combination. But the actual count is thousands of rows. The estimate is wrong by two orders of magnitude.

The difference between the plans isn't that one has better estimates — both are equally wrong. The difference is what the planner decided to do with that bad estimate.

The Root Cause: 229x Underestimate

The selectivity math

The planner looks up each boolean column's frequency independently from pg_statistic and multiplies them together:

is_active = true:          ~70%  → 0.70
is_available = true:       ~50%  → 0.50
is_dispatched = true:      ~25%  → 0.25
is_blocked = false:        ~95%  → 0.95
is_express_enabled = true: ~40%  → 0.40
is_suspended = false:      ~95%  → 0.95

Combined (assuming independence):
  0.70 × 0.50 × 0.25 × 0.95 × 0.40 × 0.95 = 0.0025

  15,000 rows × 0.0025 = 38 rows

The planner arrives at 38. The actual number is 8,720. That's a 229x underestimate.

Why the multiplication is wrong

These columns are not independent. They encode a finite state machine — the operational lifecycle of a field technician. The business logic enforces hard constraints between them:

Column A	Column B	Relationship
`is_active`	`is_available`	Available implies active
`is_active`	`is_dispatched`	Dispatched implies active
`is_available`	`is_dispatched`	Dispatched implies available
`is_blocked`	`is_active`	Blocked implies not active
`is_suspended`	`is_active`	Suspended implies not active
`is_suspended`	`is_available`	Suspended implies not available

In practice, 5 of the 6 booleans encode roughly 6 valid states (the sixth, is_express_enabled, is an independent capability flag):

State	is_active	is_available	is_dispatched	is_blocked	is_suspended
SUSPENDED	F	F	F	F	T
BLOCKED	F	F	F	T	F
INACTIVE	F	F	F	F	F
OFFLINE	T	F	F	F	F
IDLE	T	T	F	F	F
DISPATCHED	T	T	T	F	F

That's 6 valid states out of 2^5 = 32 theoretical combinations. The planner treats all 32 as equally possible and weights them accordingly. It doesn't know that is_blocked = true AND is_active = true can never occur in practice.

Actual selectivity of the boolean combination: ~58% (8,720 / 15,000)
Planner's estimate: 0.25% (38 / 15,000)
Error factor: 229x

The PostgreSQL docs even demonstrate this exact failure mode:

"The planner estimates the selectivity for each condition individually... Then it assumes that the conditions are independent, and so it multiplies their selectivities, producing a final selectivity estimate of just 0.01%. This is a significant underestimate, as the actual number of rows matching the conditions (100) is two orders of magnitude higher."
— Multivariate Statistics Examples

Why the Plan Flips: ANALYZE Sampling Non-Determinism

The 229x underestimate explains why the slow plan is slow. But why does the query sometimes get the fast plan?

Because the estimate isn't always exactly 38 — it drifts. ANALYZE takes a random sample, not a census:

"The statistics are only approximate, and will change slightly each time ANALYZE is run, even if the actual table contents did not change. This might result in small changes in the planner's estimated costs. In rare situations, this non-determinism will cause the planner's choices of query plans to change after ANALYZE is run."
— ANALYZE command reference

Our technicians table is high-churn — technicians toggle is_dispatched hundreds of times per day across the fleet. As the number of available technicians changes through the day, each ANALYZE sample captures a different snapshot. The boolean estimate drifts — and the plan is sensitive to that drift.

The tipping point

The planner's decision to include the spatial index in a BitmapAnd has a cost threshold. When the boolean estimate is "moderate" (say, 30–50 rows), the planner decides it's worth adding a second index scan to intersect. When the estimate drops below a threshold (say, 8–15 rows), the planner decides the boolean index alone is good enough and drops the spatial index.

When the boolean estimate is ~38:

Planner: "Boolean index returns ~38 rows. That's moderate.
          Adding the spatial index scan (est. 1,628 rows) and
          intersecting bitmaps will reduce heap fetches to ~4.
          The extra index scan is worth it."

Plan: BitmapAnd(boolean_index + spatial_index)
Result: 191 heap pages, 1 row survives → 3ms

When the boolean estimate drops to ~10:

Planner: "Boolean index returns only ~10 rows. That's already tiny.
          A second index scan costs more than just filtering 10 rows
          by distance. Not worth the overhead."

Plan: Bitmap Scan on boolean index only, ST_DWithin as CPU filter
Result: 8,720 rows scanned, spatial filter on every row → 42ms

The perverse incentive

Here's the cruel part: the MORE wrong the boolean estimate is (in the "too low" direction), the MORE confident the planner becomes that BitmapAnd is unnecessary — and the WORSE the actual performance gets. The planner drops the spatial index precisely when it would help the most.

Autovacuum is the trigger

Autovacuum triggers ANALYZE when the number of changed tuples exceeds autovacuum_analyze_threshold + autovacuum_analyze_scale_factor × table_size. With default settings, that's 50 + 0.10 × 15,000 = 1,550 tuple changes. On a table with thousands of state transitions per minute during peak hours, this threshold is crossed constantly. ANALYZE can fire multiple times within a few minutes, each time producing a slightly different boolean estimate that may or may not cross the BitmapAnd decision boundary.

BitmapAnd: How It Works and Why It's Fragile

BitmapAnd scans multiple indexes on the same table, builds a bitmap of matching heap page locations for each, and intersects them. Only pages in the intersection get fetched. In our fast plan, it intersects boolean-matching pages with spatial-matching pages, reducing thousands of candidates down to 191 heap pages.

Two things make it fragile for our case:

Same-table only. BitmapAnd can't combine indexes across different tables. Our spatial filter is on technician_locations, our boolean filter is on technicians — that's a join, not a merge. BitmapAnd only helps when both indexes live on the same table (like the last_job_location GiST index and the boolean index, both on technicians).

Cost-sensitive inclusion. Each additional bitmap index scan has a startup cost. When the boolean estimate is low enough, the planner decides the second index scan isn't worth it:

"Because each additional index scan adds extra time, the planner will sometimes choose to use a simple index scan even though additional indexes are available that could have been used as well."
— Chapter 11.5: Combining Multiple Indexes

The planner concludes: "I only expect 10 rows from the boolean index — filtering them by distance is cheaper than running a second index scan." The math is correct given the premise. The premise is just 229x wrong.

Why Our Composite Index Didn't Help

We already had a composite index on all six boolean columns. The natural assumption was: "PostgreSQL has an index on exactly this combination — surely it knows how many rows match?"

It doesn't. A composite index helps PostgreSQL find rows efficiently (access path), but it does NOT help PostgreSQL estimate how many rows exist before scanning (cardinality estimation). These are two separate systems.

"No entry is made for an ordinary non-expression index column, however, since it would be redundant with the entry for the underlying table column."
— pg_statistic catalog

ANALYZE samples heap pages and builds per-column MCVs independently. The B-tree's internal knowledge of which key combinations exist is never extracted into planner statistics. Our composite index found the right rows perfectly every time — the problem was that the planner's estimate of how many rows it would find was 229x too low, causing it to choose a bad join strategy around the index.

Only two things can store combination frequencies:

CREATE STATISTICS (mcv) — explicit opt-in, stores multi-column MCV lists
A single enum/status column — collapses the combination into one value

The Anti-Pattern: Dependent Booleans in OLTP

This isn't just a query tuning story. It's a schema design lesson.

When you model a state machine as independent boolean columns, you're making an implicit promise to the database: "these columns are independent dimensions." Every cost-based optimizer — PostgreSQL, MySQL, Oracle, SQL Server — takes you at your word. The planner multiplies their selectivities because that's mathematically correct for independent variables.

The problem is that the promise is false. is_available and is_dispatched aren't independent dimensions — they're states in a lifecycle. One implies the other. The planner can't know this from the schema alone.

This pattern tends to emerge organically. You start with is_active. A feature ships, you add is_available. A compliance requirement adds is_blocked. Each column is a small, low-risk migration. Nobody notices that the columns are accumulating mutual dependencies until the planner starts making bad decisions — and even then, the symptom (intermittent latency spikes) doesn't obviously point at schema design.

The fix is to model state as state. If your boolean columns have business-logic dependencies between them — if certain combinations can never occur — they should be a single enum or status column. One column, one MCV lookup, no multiplication error. WHERE status = 'idle' gives the planner an exact frequency. WHERE is_active = true AND is_available = true AND is_dispatched = false AND ... gives it a guess.

A note on InnoDB

Interestingly, this specific failure mode wouldn't manifest the same way on MySQL/InnoDB with the same composite index. InnoDB's optimizer uses a technique called index dive — when estimating the cardinality of a range scan on a composite index, it actually samples the B-tree directly rather than multiplying per-column statistics. For an equality scan across all columns of a composite index (which is what our query does), InnoDB dives into the index, reads a sample of pages at the leaf level, and estimates row count from the actual index structure.

This means InnoDB would see that the combination (true, true, false, false, true, false) maps to ~8,700 leaf entries, not 38. The estimate would be roughly correct, and the optimizer wouldn't make the same bad join order decision.

PostgreSQL doesn't do index dives for cardinality estimation. It always goes back to pg_statistic and multiplies. The composite index is invisible to the estimation layer — it's only visible to the execution layer. This is a deliberate design choice (keeping statistics separate from access paths), but it means PostgreSQL is more vulnerable to correlated-column estimation errors than InnoDB is, even when the right composite index already exists.

Wrapping Up

Always compare estimated vs. actual rows in EXPLAIN ANALYZE. If they diverge by more than 10x, the planner is making decisions based on a false premise.
Composite indexes don't fix estimation in PostgreSQL. They help the executor find rows, but pg_statistic stores per-column statistics independently. The planner still multiplies. (InnoDB's index dives would handle this better.)
Model state as state, not as independent booleans. If your boolean columns have mutual dependencies — if certain combinations can never occur — they belong in a single enum column.

PostgreSQL's cost-based optimizer is remarkably good. But it operates on a statistical model of your data, not the data itself. When your schema encodes assumptions that violate the model's assumptions, the planner makes rational decisions from irrational premises. Understanding where the model breaks is the difference between a query that runs in 3ms and one that runs in 42ms.

Running Grafana Loki in Production: What We Actually Learned

Sriram Rajendran — Mon, 30 Mar 2026 11:58:48 +0000

We run Loki in distributed mode on EKS, processing ~1.16 TB of logs per day across ~34,000 lines/second. This post covers the architecture we landed on, the configuration decisions that actually matter, and the numbers from production that validate (or challenge) those decisions.

But first — if you're evaluating Loki or just heard the name, let's build up from first principles.

Why Loki Exists: A Different Philosophy on Logs

Traditional logging systems like Elasticsearch (ELK stack) or Splunk work by full-text indexing every log line. When a log line comes in, the system tokenizes it, builds an inverted index over every word, and stores that index alongside the raw data. This makes arbitrary text search fast, but the index itself becomes enormous — often larger than the raw logs. At scale, you're paying more to store and maintain the index than the data it points to.

Loki takes the opposite approach: index only the metadata, store the logs as compressed chunks. Instead of indexing the contents of "ERROR: connection refused to database host db-prod-3", Loki only indexes the labels attached to that line — things like {namespace="payments", app="api-gateway", pod="api-gateway-7f8b9c"}. When you query, Loki uses the label index to find the right chunks, then brute-force greps through those chunks.

This is the fundamental trade-off: Loki trades query-time compute for storage-time simplicity. Queries are slower than Elasticsearch for arbitrary text search, but storage costs drop dramatically because you're not maintaining a massive inverted index. For most operational use cases — "show me the logs from the payments namespace in the last hour where the line contains ERROR" — this is fast enough, and the cost savings are substantial.

Think of it like this: Elasticsearch is a search engine that happens to store logs. Loki is a log storage system that happens to support search.

The Components: What Each Piece Does and Why It Exists

Before we get into our specific setup, let's understand what each Loki component does from first principles. In a traditional monolithic logging system, one process handles everything — accept logs, store them, index them, and query them. Loki breaks this into discrete components so each can be scaled independently based on its bottleneck (CPU, memory, I/O, or network).

Distributor — The Front Door

The distributor is the first component that touches your log data. Every log push (from Promtail, Fluentd, or any other agent) hits a distributor.

What it does: Validates incoming log streams (checks labels, enforces rate limits, rejects old samples), then hashes the stream's labels to determine which ingester(s) should own that stream. It uses a consistent hash ring to route — the same label set always goes to the same ingester, which is critical for keeping related log lines together in memory.

Why it's separate: In Elasticsearch, the coordinating node handles both routing and querying. Loki splits these because the write path and read path have completely different scaling characteristics. Distributors are CPU-light and stateless — you can add or remove them without any data migration. They're essentially smart load balancers for your write path.

Scaling signal: CPU usage and push latency. If P99 push latency climbs above 500ms, add more distributors.

Ingester — Where Data Lives Before Storage

The ingester is the most critical and most resource-hungry component. It's the equivalent of what Elasticsearch calls a "data node" for recent data, but with a key difference.

What it does: Receives log streams from distributors, holds them in memory as "chunks" (compressed blocks of log lines), builds the index entries for those chunks, and periodically flushes both to long-term storage (S3). While data is in the ingester, it's queryable directly from memory — no storage round-trip needed.

Why it's separate and stateful: Ingesters are StatefulSets, not Deployments, because they hold state — unflushed chunks in memory and a Write-Ahead Log (WAL) on disk. The WAL is Loki's crash recovery mechanism: if an ingester dies, the replacement can replay the WAL to recover data that hadn't been flushed to S3 yet.

This is fundamentally different from Elasticsearch, where data is replicated at the storage level through Lucene segment replication. In Loki, data durability between flush cycles is handled by a combination of replication factor (writing to N ingesters simultaneously) and WAL replay.

Scaling signal: Memory usage. Ingesters are memory-bound — they hold all active streams plus unflushed chunks. If memory pressure rises, you either add ingesters (spreading the hash ring thinner) or tune flush intervals to push data to storage faster.

Querier — The Brute-Force Search Engine

The querier is where Loki's "index little, grep a lot" philosophy becomes concrete.

What it does: Executes LogQL queries by first consulting the index (via the index gateway) to identify which chunks match the label matchers, then fetches those chunks from S3 (or cache), decompresses them, and does a line-by-line scan for your filter expression. It also queries ingesters directly for data that hasn't been flushed to storage yet.

Why it's separate: In Elasticsearch, a query coordinator fans out to data nodes that each search their local shards. Loki separates this because queriers are compute-intensive and bursty — one expensive query shouldn't affect your ability to ingest logs. By isolating the read path, you can scale queriers independently of ingesters.

The brute-force scan is what makes Loki queries slower than Elasticsearch for wide searches, but it's also what makes the storage layer so simple. No inverted index to update on every write, no segment merges eating I/O in the background.

Scaling signal: Query latency and memory. Wide time-range queries or unselective label matchers cause queriers to fetch and decompress many chunks. More queriers = more parallelism.

Query Frontend — The Query Optimizer

The query frontend sits between the user (Grafana) and the queriers. It doesn't execute queries itself.

What it does: Takes an incoming query, splits it into smaller sub-queries by time interval, dispatches those sub-queries to queriers (via the query scheduler), and merges the results. It also caches query results and deduplicates identical in-flight queries.

Why it's separate: This is a pattern borrowed from Cortex/Mimir (Prometheus long-term storage). Without the query frontend, a single "show me the last 24 hours" query would force one querier to scan 24 hours of data sequentially. With it, that query becomes 96 parallel 15-minute queries spread across your querier fleet. This is the single biggest lever for query performance.

Elasticsearch has a similar concept with its search "phases" (query then fetch), but Loki makes the time-based splitting explicit and configurable.

Scaling signal: Rarely the bottleneck. Scale if you see request queuing.

Query Scheduler — The Traffic Controller

What it does: Maintains a queue of pending sub-queries from the query frontend and distributes them fairly across available queriers. Implements per-tenant fair queuing so one user's expensive query doesn't starve others.

Why it's separate: Without the scheduler, query frontends connect directly to queriers via round-robin. This is fine at small scale, but at high concurrency it leads to uneven load distribution. The scheduler ensures no single querier gets overwhelmed while others sit idle. Think of it as the difference between random checkout lane selection vs. a single serpentine queue at a bank.

Index Gateway — The Index Serving Layer

What it does: Downloads the BoltDB index files from S3, caches them on local disk (EFS in our case), and serves index lookups to queriers over gRPC.

Why it's separate: This is Loki-specific and exists because of the BoltDB Shipper pattern. Without an index gateway, every querier downloads and caches the full index locally. With 10 queriers, that's 10 copies of the index, 10x the S3 GET requests, and 10x the local disk needed. The index gateway centralizes this into 3 replicas that serve the entire querier fleet.

This has no real equivalent in Elasticsearch because ES stores the index within its Lucene segments on each data node — it's not a separate concern.

Compactor — The Janitor

What it does: Runs as a singleton (only one instance) and performs two jobs: (1) compacts multiple small index files into larger ones to improve query performance, and (2) applies retention policies by marking expired chunks for deletion.

Why it's separate: Compaction is I/O-intensive and runs on its own schedule. You don't want compaction competing with ingest or query for resources. In Elasticsearch, segment merging (the equivalent operation) happens on each data node and is one of the primary sources of I/O contention at scale. Loki avoids this by centralizing compaction into a dedicated component.

Table Manager — The Schema Enforcer

What it does: Pre-creates and manages the index "tables" (time-based partitions) according to the schema config. Ensures tables exist before data arrives.

Why it's separate: Mostly a legacy component for when Loki supported external index stores like DynamoDB or Cassandra with time-sharded tables. With BoltDB Shipper or TSDB, it's less critical but still handles table lifecycle.

Gateway (nginx) — The Entry Point

What it does: A simple nginx reverse proxy that routes /loki/api/v1/push to distributors and /loki/api/v1/query to query frontends. Provides a single endpoint for clients.

Our Setup at a Glance

Cluster: EKS (Kubernetes 1.33) on AWS, 8 nodes running Bottlerocket OS
Deployment: loki-distributed Helm chart (v0.80.2), Loki version 2.9.10
Storage: S3 for chunks, BoltDB Shipper for index, EFS for compactor and index gateway
Caching: Memcached (3 tiers — chunks, query frontend, index writes)
Monitoring: Prometheus + Grafana co-located in the same cluster

Component Sizing

Component	Type	Replicas	CPU (req/limit)	Memory (req/limit)
Distributor	Deployment	6	250m / 500m	512Mi / 1Gi
Ingester	StatefulSet	4	250m / 500m	16Gi / 18Gi
Querier	Deployment	10	200m / 400m	3Gi / 6Gi
Query Frontend	Deployment	3	200m / 400m	1Gi / 4Gi
Query Scheduler	Deployment	5	250m / 500m	512Mi / 1Gi
Index Gateway	StatefulSet	3	1 / 2	2Gi / 8Gi
Compactor	Deployment	1	500m / 1	6Gi / 8Gi
Table Manager	Deployment	2	—	—
Gateway (nginx)	Deployment	1	—	—

A few things jump out:

Ingesters are the memory hogs. Each ingester requests 16Gi and uses nearly all of it (~15.9 GB working set in production). This is because ingesters hold all active streams and recent chunks in memory before flushing to storage. If you undersize these, you'll see OOMKills that take down a chunk of your in-flight data with them.

Queriers need headroom. We run 10 queriers not because steady-state demands it, but because queries are bursty. A single user running a broad {namespace=~".+"} query over a wide time range can spike memory on multiple queriers simultaneously. The 3Gi request with 6Gi limit gives them room to burst.

Distributors are lightweight but need replicas. At 34k lines/sec ingest, 6 distributors keep each one comfortably below its CPU limit. They're stateless, so scaling is trivial.

The Configuration That Matters

Here are the config knobs that we've tuned from defaults and why:

Ingester Tuning

ingester:
  chunk_block_size: 262144        # 256KB blocks
  chunk_encoding: snappy           # Fast compression, ~3.5x ratio
  chunk_idle_period: 2m            # Flush idle chunks after 2 min
  max_chunk_age: 30m               # Hard cap — flush after 30 min regardless
  chunk_retain_period: 1m          # Keep flushed chunks for queries in flight
  lifecycler:
    ring:
      replication_factor: 2        # Every log line written to 2 ingesters
  wal:
    dir: /var/loki/wal
    checkpoint_duration: 1m        # WAL checkpoint every minute

replication_factor: 2 is the sweet spot for us. RF=3 is safer but doubles your ingester memory overhead vs RF=1. With RF=2 and WAL enabled, we can lose one ingester and recover without data loss. The WAL with 1-minute checkpointing gives us a recovery path that doesn't rely on ring state alone.

chunk_idle_period: 2m and max_chunk_age: 30m control how long data lives in ingester memory before hitting S3. Shorter values = lower memory usage but more small chunks in object storage (which slows queries). 30 minutes is a good balance — our ingesters flush about 9.4 chunks/sec at steady state.

chunk_encoding: snappy over gzip because at 13.4 MB/s ingest, CPU matters more than compression ratio. Snappy gives us good-enough compression without burning cores.

Limits: The Rate Limiting That Keeps You Alive

limits_config:
  retention_period: 72h
  ingestion_rate_mb: 10000         # Per-tenant ingest rate (MB/s)
  ingestion_burst_size_mb: 1000    # Burst allowance
  per_stream_rate_limit: 512M      # Per-stream rate limit
  per_stream_rate_limit_burst: 1024M
  max_entries_limit_per_query: 1000000
  reject_old_samples: true
  reject_old_samples_max_age: 168h # Reject logs older than 7 days
  cardinality_limit: 200000
  max_label_value_length: 20480
  max_label_name_length: 10240
  max_label_names_per_series: 300
  split_queries_by_interval: 15m

A few things worth calling out:

retention_period: 72h — we only keep 3 days of logs in Loki. This is deliberate. Loki isn't our log archive; it's our log search tool. Anything older goes to S3 lifecycle rules for long-term retention. This keeps our index small and queries fast.

per_stream_rate_limit: 512M — this is intentionally high. We run with auth_enabled: false (single tenant), so there's no per-tenant isolation. Instead, we rely on per-stream limits to prevent any single application from overwhelming the pipeline. If you're multi-tenant, you'd want much tighter per-tenant limits.

reject_old_samples_max_age: 168h — logs older than 7 days get rejected at the distributor. This prevents backfill jobs or misbehaving agents from pushing stale data that would create index entries far from the current write head.

split_queries_by_interval: 15m — the query frontend splits every query into 15-minute sub-queries. These are parallelized across queriers, which is why a 1-hour query range actually fans out to 4 sub-queries. This, combined with the query scheduler (5 replicas), is what keeps our P99 query latency at ~2.75 seconds despite scanning TBs.

Storage: S3 + BoltDB Shipper

schema_config:
  configs:
    - from: "2020-09-07"
      index:
        period: 24h
        prefix: loki_index_
      object_store: aws
      schema: v11
      store: boltdb-shipper

storage_config:
  aws:
    bucketnames: <your-s3-bucket>
    s3: s3://<your-region>
  boltdb_shipper:
    active_index_directory: /var/loki/index
    cache_location: /var/loki/cache
    cache_ttl: 168h
    index_gateway_client:
      server_address: dns:///<loki-index-gateway-service>:9095
    shared_store: s3

BoltDB Shipper with Index Gateway is the key pattern here. Instead of every querier downloading the full BoltDB index from S3, the index gateway (3 replicas on EFS) serves index lookups over gRPC. This dramatically reduces the number of S3 API calls and keeps query latency consistent.

We back the index gateways and compactor with EFS (not EBS), because EFS gives us shared persistent storage that survives pod rescheduling across AZs. The ingesters use gp3 EBS volumes (200Gi each) for WAL and local index — they need low-latency local disk, not shared access.

Caching: The Three Layers

chunk_store_config:
  chunk_cache_config:
    memcached_client:
      addresses: dnssrv+...memcached-chunks...
      consistent_hash: true
  write_dedupe_cache_config:
    memcached_client:
      addresses: dnssrv+...memcached-index-writes...
      consistent_hash: true

query_range:
  results_cache:
    cache:
      memcached_client:
        addresses: dnssrv+...memcached-frontend...
        consistent_hash: true

Three separate memcached tiers:

Chunk cache (3 replicas, 4Gi each) — caches decompressed chunks so repeated queries don't hit S3
Index write dedupe (3 replicas, 1Gi each) — deduplicates index writes from multiple ingesters (critical with RF=2)
Query frontend cache (2 replicas, 1Gi each) — caches query results for identical queries within the cache freshness window

Our memcached hit rate sits at 97.8%, meaning only ~2.2% of chunk fetches actually hit S3. This is the single biggest factor in keeping query latency reasonable at our scale.

Ring Membership: Memberlist over Consul/etcd

memberlist:
  join_members:
    - <loki-memberlist-service>
distributor:
  ring:
    kvstore:
      store: memberlist

We use memberlist (gossip protocol) instead of Consul or etcd for hash ring coordination. One fewer external dependency to manage. It works well up to the scale we're at. If you're running 50+ ingesters, you might want to evaluate Consul for faster ring convergence.

gRPC Tuning

ingester_client:
  grpc_client_config:
    grpc_compression: gzip
    max_send_msg_size: 204857600    # ~200MB
    max_recv_msg_size: 204857600

server:
  grpc_server_max_recv_msg_size: 204857600
  grpc_server_max_send_msg_size: 204857600

The default gRPC message sizes are too small for production. At high ingest rates, distributor-to-ingester messages can get large, especially when batching. We set both client and server to ~200MB. The gzip compression on the ingester client cuts inter-component bandwidth significantly.

The Numbers from Production

Here's what the cluster looks like right now:

Metric	Value
Ingest rate	~13.4 MB/s (~1.16 TB/day)
Log lines/sec	~34,000
Active streams	632
P99 push latency	~245ms
P99 query latency	~2.75s
Chunk flush rate	~9.4 chunks/sec
Push error rate	0% (all 204s)
Memcached hit rate	97.8%
Ingester memory (actual)	~15.9 GB across 4 pods
Querier memory (actual)	~4.8 GB across 10 pods
Total cluster memory footprint	~22.4 GB working set (Loki components only)

A few observations:

632 active streams is low for 34k lines/sec — this means our log labels are well-structured. High cardinality (thousands of unique label combinations) is the number one killer of Loki performance. We keep stream count low by using only a handful of labels: namespace, pod, container, and app. We avoid dynamic labels like request IDs or user IDs.

245ms P99 push latency is solid at this ingest rate. The 6 distributors with gRPC compression keep the write path fast. If this creeps above 500ms, it's time to add distributors or check if ingesters are falling behind on flushes.

2.75s P99 query latency is acceptable for our use case (human-driven debugging sessions). If you need sub-second queries, look at increasing the querier count and reducing split_queries_by_interval.

Monitoring Loki: The PromQL Queries That Matter

Running Loki without monitoring Loki is flying blind. Loki exposes a rich set of Prometheus metrics out of the box — the challenge is knowing which ones to watch and what they're telling you. Here's the monitoring playbook, broken down by component.

Distributor Metrics — Is Data Getting In?

The distributor is your canary. If the write path is unhealthy, these metrics will show it first.

Ingest rate (bytes/sec):

sum(rate(loki_distributor_bytes_received_total[5m]))

This is your headline number — total bytes/sec hitting Loki across all distributors. Track this on a dashboard as the primary throughput gauge. A sudden drop means log agents are failing to ship; a sudden spike means something is logging excessively (a crash loop, debug logging left on, etc.).

Ingest rate (lines/sec):

sum(rate(loki_distributor_lines_received_total[5m]))

Compare this against bytes/sec to derive your average line size. If lines/sec stays flat but bytes/sec spikes, something is producing abnormally large log lines (stack traces, serialized payloads). If lines/sec spikes but bytes/sec doesn't, you've got a chatty service producing many small lines.

Distributor-to-ingester failures:

sum(rate(loki_distributor_ingester_append_failures_total[5m]))

This should be zero. Any non-zero value means distributors are failing to write to ingesters — the ingester ring might be unhealthy, an ingester is OOMing, or gRPC connections are timing out. Alert on this immediately.

Discarded samples (dropped logs):

sum by (reason) (rate(loki_discarded_samples_total[5m]))

Logs that Loki actively rejected, grouped by reason. Common reasons: rate_limited (hitting per-tenant or per-stream limits), greater_than_max_sample_age (old logs rejected by reject_old_samples_max_age), per_stream_rate_limit. If you see rate_limited, your limits are too tight or a service is misbehaving. This is the metric that tells you when logs are being silently dropped.

Ingester Metrics — The Heart of the System

Ingesters are stateful, memory-heavy, and the most likely component to cause data loss if they go wrong. Monitor them closely.

Active streams:

sum(loki_ingester_memory_streams)

The number of unique label combinations currently active in memory across all ingesters. This is your cardinality gauge — the single most important metric for Loki health. If this number grows unbounded, you have a label cardinality problem (someone added a dynamic label like request_id or user_id). We alert if this crosses 2x our baseline.

Memory chunks:

sum(loki_ingester_memory_chunks)

The number of chunk objects held in ingester memory. Each active stream has at least one chunk being actively written to. This correlates with memory usage — more chunks = more RAM. If chunks grow faster than flushes, ingesters will OOM.

Chunk flush rate:

sum(rate(loki_ingester_chunks_flushed_total[5m]))

How many chunks/sec are being flushed to long-term storage. This should be steady. A drop in flush rate while ingest stays constant means chunks are accumulating in memory — check if S3 is slow or the compactor is backed up.

Chunk age at flush (P99):

histogram_quantile(0.99, sum(rate(loki_ingester_chunk_age_seconds_bucket[5m])) by (le))

How old chunks are when they get flushed. Should be close to your max_chunk_age setting (30 minutes = 1800s for us). If P99 chunk age drifts significantly higher, ingesters are holding data too long — either the flush loop is slow or chunks aren't hitting the idle timeout.

Chunk compression ratio:

histogram_quantile(0.5, sum(rate(loki_ingester_chunk_compression_ratio_bucket[5m])) by (le))

How well your chunks compress. A ratio of 0.3 means data compresses to 30% of its original size (~3.3x). If this ratio climbs toward 1.0, your logs are either already compressed or have high entropy (binary data being logged). Snappy encoding typically gives 3-4x for structured text logs.

WAL bytes in use:

sum(loki_ingester_wal_bytes_in_use)

How much data is sitting in the Write-Ahead Log. If this grows steadily, WAL checkpointing may be falling behind. A zero value (like ours currently) means the WAL is keeping up — data is checkpointed and flushed regularly.

WAL corruptions (alert on any non-zero):

sum(rate(loki_ingester_wal_corruptions_total[5m]))

WAL corruption means potential data loss on recovery. This should always be zero. Any corruption usually points to disk issues (EBS volume problems, filesystem corruption). Alert immediately and investigate the underlying storage.

Checkpoint duration:

loki_ingester_checkpoint_duration_seconds

How long WAL checkpoints take. If this exceeds your checkpoint_duration setting (1 minute for us), checkpoints are overlapping and the WAL will grow unbounded. Usually means disk I/O is saturated.

Query Path Metrics — Are Queries Healthy?

Push latency (P99):

histogram_quantile(0.99,
  sum(rate(loki_request_duration_seconds_bucket{route="loki_api_v1_push"}[5m])) by (le)
)

End-to-end time for a push request. This spans distributor validation, hashing, and ingester writes. Under 500ms is healthy. Above 1s means something in the write path is saturated.

Query latency (P99):

histogram_quantile(0.99,
  sum(rate(loki_request_duration_seconds_bucket{route=~"loki_api_v1_query_range|loki_api_v1_query"}[5m])) by (le)
)

End-to-end time for read queries. This is what your Grafana users feel. The acceptable threshold depends on your use case — under 5s for interactive debugging is reasonable. If this degrades, check if queriers are memory-saturated, cache hit rates dropped, or someone is running expensive queries.

Request rate by status code:

sum by (status_code, route) (rate(loki_request_duration_seconds_count[5m]))

Break down all requests by HTTP status code. 204 for pushes is success. 200 for queries is success. Watch for 429 (rate limited), 500 (internal errors), and 503 (service unavailable — usually means queriers are overloaded).

Cache Metrics — Your Performance Multiplier

Caching is what makes Loki viable at scale. If cache hit rates drop, query latency will spike and S3 costs will climb.

Chunk cache hit rate:

sum(rate(loki_cache_hits{cache="chunks"}[5m]))
  /
sum(rate(loki_cache_fetched_keys{cache="chunks"}[5m]))

What percentage of chunk fetches are served from memcached instead of S3. We target >95%. Below 90% means your memcached is undersized or your query patterns aren't cache-friendly (too many unique time ranges).

Query result cache hit rate:

sum(rate(loki_query_frontend_log_result_cache_hit_total[5m]))
  /
(
  sum(rate(loki_query_frontend_log_result_cache_hit_total[5m]))
  + sum(rate(loki_query_frontend_log_result_cache_miss_total[5m]))
)

How often the query frontend serves results from cache instead of dispatching to queriers. High hit rates here mean your users are running the same (or overlapping) queries repeatedly — common when multiple people investigate the same incident.

Memcached client health:

loki_memcache_client_servers

Number of memcached servers each Loki component can see. If this drops below expected (e.g., from 3 to 2), a memcached pod is down and you're losing cache capacity. The consistent hashing will redistribute, but hit rate will temporarily drop as the remaining nodes re-warm.

Cache queue depth:

sum(loki_cache_background_queue_length)

How many cache writes are queued. A growing queue means memcached can't keep up with write volume — either the memcached pods need more CPU/memory, or network latency between Loki and memcached is too high.

Index Gateway Metrics — Is Index Serving Healthy?

Index gateway request latency (P99):

histogram_quantile(0.99,
  sum(rate(loki_index_gateway_request_duration_seconds_bucket[5m])) by (le)
)

How long it takes the index gateway to serve a lookup. Should be low (under 500ms). High latency here means the index is too large for memory and the gateway is reading from disk, or EFS is slow.

BoltDB shipper upload health:

sum(rate(loki_boltdb_shipper_tables_upload_operation_total[5m]))

Rate of index table uploads from ingesters to shared storage. A zero rate means ingesters aren't shipping index tables — queries for recently ingested data may fail.

BoltDB shipper request latency (P99):

histogram_quantile(0.99,
  sum(rate(loki_boltdb_shipper_request_duration_seconds_bucket[5m])) by (le)
)

How long BoltDB shipper operations take. Our P99 sits at ~378ms. If this spikes, shared storage (S3/EFS) is likely the bottleneck.

Global Panic Metric — The Last Resort

sum(rate(loki_panic_total[5m]))

Loki processes that panicked (crashed). This should always be zero. Any non-zero value means a component is hitting an unhandled error — check logs immediately. This is your "something is deeply wrong" alert.

Recommended Alert Rules

Based on what we've learned running this in production, here are the alerts worth wiring up:

groups:
  - name: loki-alerts
    rules:
      - alert: LokiIngesterAppendFailures
        expr: sum(rate(loki_distributor_ingester_append_failures_total[5m])) > 0
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Distributors failing to write to ingesters"

      - alert: LokiDiscardedSamples
        expr: sum(rate(loki_discarded_samples_total[5m])) > 10
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Loki is dropping logs — check reason label"

      - alert: LokiIngesterHighMemoryStreams
        expr: sum(loki_ingester_memory_streams) > 5000
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Active stream count is unusually high — possible cardinality explosion"

      - alert: LokiPushLatencyHigh
        expr: |
          histogram_quantile(0.99,
            sum(rate(loki_request_duration_seconds_bucket{route="loki_api_v1_push"}[5m])) by (le)
          ) > 1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "P99 push latency above 1 second"

      - alert: LokiChunkCacheHitRateLow
        expr: |
          sum(rate(loki_cache_hits{cache="chunks"}[15m]))
            /
          sum(rate(loki_cache_fetched_keys{cache="chunks"}[15m]))
          < 0.90
        for: 15m
        labels:
          severity: warning
        annotations:
          summary: "Chunk cache hit rate below 90%"

      - alert: LokiWALCorruption
        expr: sum(rate(loki_ingester_wal_corruptions_total[5m])) > 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Ingester WAL corruption detected — potential data loss"

      - alert: LokiPanic
        expr: sum(rate(loki_panic_total[5m])) > 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Loki component panicked"

Things We'd Do Differently

Start with TSDB index (schema v12+) instead of BoltDB Shipper. We're on schema v11 because we started early. TSDB is significantly better for index performance and is the direction Grafana is investing in. New deployments should use it.
Set up recording rules earlier. We added Prometheus recording rules for Loki metrics after the fact. Having loki:ingester_memory_streams:sum, loki:distributor_bytes:rate5m etc. pre-built saves a lot of dashboard query overhead.
Consider upgrading to Loki 3.x. We're on 2.9.10. Loki 3.x brings native OTLP ingest, improved bloom filters for faster queries, and the new v13 schema. The migration path exists but requires careful planning around schema migration.

Wrapping Up

Loki in distributed mode isn't "install and forget." But it also isn't the operational nightmare that some make it out to be. The key principles:

Label cardinality is everything. Keep your active stream count low.
Cache aggressively. Three-tier memcached turned 97.8% of our S3 reads into cache hits.
Size your ingesters for memory, not CPU. They're memory-bound workloads.
Use index gateways to keep queriers from hammering S3 for index lookups.
Set retention aggressively if Loki isn't your archive. 72 hours keeps queries fast.