<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Sriram Rajendran</title>
    <description>The latest articles on DEV Community by Sriram Rajendran (@sriramrajendran).</description>
    <link>https://dev.to/sriramrajendran</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F493246%2F53e33b7d-c73f-4236-9aec-d210a847c367.jpeg</url>
      <title>DEV Community: Sriram Rajendran</title>
      <link>https://dev.to/sriramrajendran</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/sriramrajendran"/>
    <language>en</language>
    <item>
      <title>MySQL InnoDB Locking: The Silent Killer Behind Database Crashes</title>
      <dc:creator>Sriram Rajendran</dc:creator>
      <pubDate>Thu, 02 Apr 2026 07:50:40 +0000</pubDate>
      <link>https://dev.to/sriramrajendran/mysql-innodb-locking-the-silent-killer-behind-database-crashes-1ak9</link>
      <guid>https://dev.to/sriramrajendran/mysql-innodb-locking-the-silent-killer-behind-database-crashes-1ak9</guid>
      <description>&lt;p&gt;We run a fleet of MySQL 8.0 RDS instances — multi-TB databases on 32-vCPU / 128 GB machines doing thousands of write IOPS at peak across 1,000+ concurrent connections. Three of them have been brought to their knees by locking over the past year. Not slow queries. Not CPU saturation. Not disk I/O. Locks.&lt;/p&gt;

&lt;p&gt;Here's what actually happens: a single metadata lock from a partition drop cascades into 1,600 queued connections in 90 seconds, exhausting your connection pool and crashing every microservice that writes to that table. A handful of abandoned application connections holding row locks slowly strangle write throughput until the database is effectively unresponsive. A bulk &lt;code&gt;INSERT INTO ... SELECT&lt;/code&gt; under &lt;code&gt;REPEATABLE READ&lt;/code&gt; gap-locks a range of an index and deadlocks every concurrent insert attempting to write into that range. In each case, the database didn't run out of CPU or memory — it ran out of the ability to make progress. Transactions pile up, connection pools saturate, application threads block, health checks fail, and the cascading failure takes out services that don't even touch the locked table.&lt;/p&gt;

&lt;p&gt;This post is the full picture — what InnoDB's locks actually are, why they exist, how they interact, and the specific queries we use to find them before they page us.&lt;/p&gt;

&lt;p&gt;But first — if you've never thought about why databases lock at all, let's build up from first principles.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why Databases Lock: The Problem of Concurrent Access
&lt;/h2&gt;

&lt;p&gt;Every multi-user database has the same fundamental problem: multiple transactions reading and writing the same data simultaneously. Without some coordination mechanism, you get anomalies — dirty reads, lost updates, phantom rows. The SQL standard defines four isolation levels (&lt;code&gt;READ UNCOMMITTED&lt;/code&gt;, &lt;code&gt;READ COMMITTED&lt;/code&gt;, &lt;code&gt;REPEATABLE READ&lt;/code&gt;, &lt;code&gt;SERIALIZABLE&lt;/code&gt;) that describe which anomalies you're willing to tolerate.&lt;/p&gt;

&lt;p&gt;The question is &lt;em&gt;how&lt;/em&gt; the database enforces the isolation level you chose. There are two broad strategies:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pessimistic concurrency (locking).&lt;/strong&gt; Before accessing a row, acquire a lock on it. If someone else already holds a conflicting lock, wait. This guarantees correctness by preventing concurrent access entirely. The downside: contention. Transactions queue behind each other, and throughput drops under load.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Optimistic concurrency (MVCC).&lt;/strong&gt; Don't lock for reads. Instead, maintain multiple versions of each row. Readers see a consistent snapshot from the start of their transaction, writers create new versions. Conflicts are detected at commit time. The downside: maintaining multiple versions costs memory and storage, and write-write conflicts still need resolution.&lt;/p&gt;

&lt;p&gt;Here's the thing — most production databases use &lt;strong&gt;both&lt;/strong&gt;. InnoDB uses MVCC for reads (non-locking consistent reads) and pessimistic locking for writes. When you run a &lt;code&gt;SELECT&lt;/code&gt;, InnoDB reads from a snapshot without acquiring any row locks. When you run an &lt;code&gt;UPDATE&lt;/code&gt; or &lt;code&gt;DELETE&lt;/code&gt;, InnoDB acquires exclusive locks on the affected rows. This hybrid is why MySQL can handle thousands of concurrent readers without contention, but writers can still block each other.&lt;/p&gt;

&lt;h3&gt;
  
  
  Where databases fall on the spectrum
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Database&lt;/th&gt;
&lt;th&gt;Default Isolation&lt;/th&gt;
&lt;th&gt;Read Strategy&lt;/th&gt;
&lt;th&gt;Write Strategy&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;MySQL/InnoDB&lt;/td&gt;
&lt;td&gt;REPEATABLE READ&lt;/td&gt;
&lt;td&gt;MVCC (snapshot)&lt;/td&gt;
&lt;td&gt;Row-level locking (in-place update, old version to undo log)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;PostgreSQL&lt;/td&gt;
&lt;td&gt;READ COMMITTED&lt;/td&gt;
&lt;td&gt;MVCC (snapshot)&lt;/td&gt;
&lt;td&gt;Row-level locking (new tuple version per update, old tuple marked dead)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;DynamoDB&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;Optimistic (conditional writes)&lt;/td&gt;
&lt;td&gt;Optimistic (conditional writes)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;CockroachDB&lt;/td&gt;
&lt;td&gt;SERIALIZABLE&lt;/td&gt;
&lt;td&gt;MVCC&lt;/td&gt;
&lt;td&gt;Pessimistic + optimistic hybrid&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Both MySQL and PostgreSQL use row-level locking for writes: if two transactions try to update the same row, the second one blocks until the first commits or rolls back. They share the same hybrid model — MVCC for reads, pessimistic locks for writes — but differ in storage mechanics, gap locking behavior, and conflict resolution.&lt;/p&gt;

&lt;p&gt;One operational difference worth calling out: under &lt;code&gt;REPEATABLE READ&lt;/code&gt;, when two transactions try to update the same row, both databases block the second writer until the first commits. But what happens &lt;em&gt;after&lt;/em&gt; the commit differs. InnoDB lets the second writer &lt;strong&gt;proceed&lt;/strong&gt; — it reads the latest committed version and applies its update on top. PostgreSQL takes the opposite approach: it &lt;strong&gt;aborts&lt;/strong&gt; the second writer with &lt;code&gt;ERROR: could not serialize access due to concurrent update&lt;/code&gt;, forcing the application to retry. This "first-updater-wins" policy means PostgreSQL detects write-write conflicts and forces explicit retry logic, while InnoDB silently applies the update to the newest committed version. Neither has a classic "lost update" — but InnoDB's behavior means the second transaction's update is based on a version it never read in its snapshot, which can lead to subtle anomalies if the application logic depends on snapshot consistency across reads and writes.&lt;/p&gt;

&lt;p&gt;InnoDB's locking is more aggressive than PostgreSQL's in one critical way: &lt;strong&gt;gap locking&lt;/strong&gt;. Under &lt;code&gt;REPEATABLE READ&lt;/code&gt;, InnoDB locks not just the rows that match your query, but the &lt;em&gt;gaps between&lt;/em&gt; those rows to prevent phantom reads. PostgreSQL also prevents phantoms at &lt;code&gt;REPEATABLE READ&lt;/code&gt;, but through a completely different mechanism — pure snapshot isolation. Since each transaction reads from a fixed snapshot, newly inserted rows by other transactions are simply invisible, so phantoms can't occur without any locking. The key difference isn't &lt;em&gt;whether&lt;/em&gt; phantoms are prevented (both do), but &lt;em&gt;how&lt;/em&gt;: InnoDB prevents them pessimistically by locking gaps in the index, while PostgreSQL prevents them passively through snapshot visibility rules. InnoDB's approach means write transactions can block each other even when they're operating on non-overlapping rows, simply because their index ranges overlap. PostgreSQL avoids this contention entirely at &lt;code&gt;REPEATABLE READ&lt;/code&gt;, only adding conflict detection at &lt;code&gt;SERIALIZABLE&lt;/code&gt; via SSI with non-blocking predicate locks.&lt;/p&gt;




&lt;h2&gt;
  
  
  InnoDB's Lock Types: The Full Taxonomy
&lt;/h2&gt;

&lt;p&gt;InnoDB has more lock types than most engineers realize. Understanding the hierarchy matters because lock conflicts aren't always row-vs-row — they can be gap-vs-insert, metadata-vs-DML, or intention-vs-table. Here's the complete picture.&lt;/p&gt;

&lt;h3&gt;
  
  
  Row-Level Locks (Record Locks)
&lt;/h3&gt;

&lt;p&gt;The most intuitive lock type. A record lock locks a single index record. When you execute:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;UPDATE&lt;/span&gt; &lt;span class="n"&gt;orders&lt;/span&gt; &lt;span class="k"&gt;SET&lt;/span&gt; &lt;span class="n"&gt;status&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'shipped'&lt;/span&gt; &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;42&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;InnoDB acquires an exclusive (X) lock on the index record for &lt;code&gt;id = 42&lt;/code&gt;. Any other transaction attempting to &lt;code&gt;UPDATE&lt;/code&gt; or &lt;code&gt;DELETE&lt;/code&gt; that same row will block until the first transaction commits or rolls back.&lt;/p&gt;

&lt;p&gt;Two flavors:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Shared (S) lock&lt;/strong&gt;: Acquired by &lt;code&gt;SELECT ... FOR SHARE&lt;/code&gt; (or &lt;code&gt;LOCK IN SHARE MODE&lt;/code&gt;). Multiple transactions can hold S locks on the same row simultaneously. Blocks X locks.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Exclusive (X) lock&lt;/strong&gt;: Acquired by &lt;code&gt;UPDATE&lt;/code&gt;, &lt;code&gt;DELETE&lt;/code&gt;, &lt;code&gt;SELECT ... FOR UPDATE&lt;/code&gt;. Only one transaction can hold an X lock. Blocks both S and X locks.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Record locks always operate on &lt;strong&gt;index records&lt;/strong&gt;, not table rows directly. The MySQL docs state: &lt;em&gt;"Record locks always lock index records, even if a table is defined with no indexes. For such cases, InnoDB creates a hidden clustered index and uses this index for record locking."&lt;/em&gt; (&lt;a href="https://dev.mysql.com/doc/refman/8.0/en/innodb-locking.html" rel="noopener noreferrer"&gt;InnoDB Locking&lt;/a&gt;). This is why index design directly affects lock contention — a full table scan under an &lt;code&gt;UPDATE&lt;/code&gt; means locking every index record scanned, not just the rows that match the &lt;code&gt;WHERE&lt;/code&gt; clause: &lt;em&gt;"If you have no indexes suitable for your statement and MySQL must scan the entire table to process the statement, every row of the table becomes locked"&lt;/em&gt; (&lt;a href="https://dev.mysql.com/doc/refman/8.0/en/innodb-locks-set.html" rel="noopener noreferrer"&gt;Locks Set by Different SQL Statements&lt;/a&gt;).&lt;/p&gt;

&lt;h3&gt;
  
  
  Gap Locks
&lt;/h3&gt;

&lt;p&gt;This is where InnoDB gets interesting — and where most locking surprises come from.&lt;/p&gt;

&lt;p&gt;A gap lock locks the &lt;em&gt;gap between&lt;/em&gt; index records, preventing other transactions from inserting into that gap. It doesn't lock the records themselves.&lt;/p&gt;

&lt;p&gt;Consider a table with an indexed column &lt;code&gt;age&lt;/code&gt; containing values &lt;code&gt;[10, 20, 30]&lt;/code&gt;. The gaps are:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;(-∞, 10)  (10, 20)  (20, 30)  (30, +∞)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If transaction A runs:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;users&lt;/span&gt; &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;age&lt;/span&gt; &lt;span class="k"&gt;BETWEEN&lt;/span&gt; &lt;span class="mi"&gt;15&lt;/span&gt; &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="mi"&gt;25&lt;/span&gt; &lt;span class="k"&gt;FOR&lt;/span&gt; &lt;span class="k"&gt;UPDATE&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;InnoDB locks the gap &lt;code&gt;(10, 20)&lt;/code&gt; and the gap &lt;code&gt;(20, 30)&lt;/code&gt;, plus the record at &lt;code&gt;20&lt;/code&gt;. This prevents any other transaction from inserting &lt;code&gt;age = 12&lt;/code&gt;, &lt;code&gt;age = 17&lt;/code&gt;, &lt;code&gt;age = 22&lt;/code&gt;, or &lt;code&gt;age = 28&lt;/code&gt; — even though none of those rows exist yet.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why gap locks exist:&lt;/strong&gt; To prevent &lt;strong&gt;phantom reads&lt;/strong&gt; under &lt;code&gt;REPEATABLE READ&lt;/code&gt;. Without gap locks, transaction A could run the same &lt;code&gt;SELECT&lt;/code&gt; twice and get different results because transaction B inserted a new row in the range between the two reads. Gap locks guarantee that if you read a range, no one can insert into that range until you commit.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The catch:&lt;/strong&gt; Gap locks are &lt;strong&gt;purely inhibitive&lt;/strong&gt;. The MySQL docs are explicit: &lt;em&gt;"Gap locks in InnoDB are 'purely inhibitive', which means that their only purpose is to prevent other transactions from inserting to the gap. Gap locks can co-exist."&lt;/em&gt; (&lt;a href="https://dev.mysql.com/doc/refman/8.0/en/innodb-locking.html" rel="noopener noreferrer"&gt;InnoDB Locking&lt;/a&gt;). Two transactions can hold gap locks on the same gap simultaneously. This seems harmless until you realize that each is blocking the other's &lt;code&gt;INSERT&lt;/code&gt;, creating a deadlock.&lt;/p&gt;

&lt;p&gt;The key is that this happens when &lt;strong&gt;no rows match the query&lt;/strong&gt; — only gaps are locked, with no record locks to cause the second &lt;code&gt;SELECT&lt;/code&gt; to block. Consider a table where &lt;code&gt;age&lt;/code&gt; contains only &lt;code&gt;[10, 30]&lt;/code&gt; (no row with &lt;code&gt;age = 20&lt;/code&gt;):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Tx A: SELECT ... WHERE age BETWEEN 15 AND 25 FOR UPDATE;  -- no matching rows; gap lock on (10, 30)
Tx B: SELECT ... WHERE age BETWEEN 15 AND 25 FOR UPDATE;  -- also gap lock on (10, 30) — no conflict!
Tx A: INSERT INTO users (age) VALUES (18);                 -- blocked by Tx B's gap lock
Tx B: INSERT INTO users (age) VALUES (22);                 -- blocked by Tx A's gap lock → DEADLOCK
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If rows &lt;em&gt;did&lt;/em&gt; exist in the range, the &lt;code&gt;SELECT ... FOR UPDATE&lt;/code&gt; would acquire next-key locks (record + gap), and the X record locks would cause Tx B to block on the &lt;code&gt;SELECT&lt;/code&gt; itself — no deadlock, just contention. It's the empty-range case that's dangerous: both transactions sail through the &lt;code&gt;SELECT&lt;/code&gt;, acquire only gap locks, and then deadlock on the &lt;code&gt;INSERT&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;We've hit this pattern in production. More on that later.&lt;/p&gt;

&lt;h3&gt;
  
  
  Next-Key Locks
&lt;/h3&gt;

&lt;p&gt;A next-key lock is a &lt;strong&gt;combination&lt;/strong&gt; of a record lock and a gap lock on the gap before that record. It's InnoDB's default locking strategy for index scans under &lt;code&gt;REPEATABLE READ&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;For our &lt;code&gt;age&lt;/code&gt; index &lt;code&gt;[10, 20, 30]&lt;/code&gt;, the next-key locks are:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;(-∞, 10]  (10, 20]  (20, 30]  (30, +∞)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Notice the notation: the record itself is included (closed bracket on the right). A next-key lock on &lt;code&gt;(10, 20]&lt;/code&gt; locks the gap &lt;code&gt;(10, 20)&lt;/code&gt; AND the record &lt;code&gt;20&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;When you run a range scan like &lt;code&gt;WHERE age &amp;gt; 15 AND age &amp;lt; 25&lt;/code&gt;, InnoDB places next-key locks on:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;(10, 20]&lt;/code&gt; — gap before 20, plus record 20&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;(20, 30]&lt;/code&gt; — gap before 30, plus record 30&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is more than the rows that match your predicate. The extra locking is the cost of preventing phantoms via pessimistic locking rather than snapshot visibility.&lt;/p&gt;

&lt;h3&gt;
  
  
  Intention Locks
&lt;/h3&gt;

&lt;p&gt;Intention locks are &lt;strong&gt;table-level&lt;/strong&gt; locks that signal what kind of row-level locks a transaction intends to acquire. They exist purely for efficiency.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Intention Shared (IS)&lt;/strong&gt;: "I'm going to acquire shared row locks in this table."&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Intention Exclusive (IX)&lt;/strong&gt;: "I'm going to acquire exclusive row locks in this table."&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;When you run &lt;code&gt;UPDATE orders SET status = 'shipped' WHERE id = 42&lt;/code&gt;, InnoDB first acquires an IX lock on the &lt;code&gt;orders&lt;/code&gt; table, then an X lock on the row. The IX lock is cheap (no contention between IX and IX) and serves one purpose: if another transaction wants a &lt;strong&gt;full table lock&lt;/strong&gt; (&lt;code&gt;LOCK TABLES orders WRITE&lt;/code&gt;), it can check for intention locks instead of scanning every row's lock state.&lt;/p&gt;

&lt;p&gt;Intention locks never block each other. IX + IX is fine. IS + IX is fine. They only conflict with full table locks:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;IS&lt;/th&gt;
&lt;th&gt;IX&lt;/th&gt;
&lt;th&gt;S&lt;/th&gt;
&lt;th&gt;X&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;IS&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;✓&lt;/td&gt;
&lt;td&gt;✓&lt;/td&gt;
&lt;td&gt;✓&lt;/td&gt;
&lt;td&gt;✗&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;IX&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;✓&lt;/td&gt;
&lt;td&gt;✓&lt;/td&gt;
&lt;td&gt;✗&lt;/td&gt;
&lt;td&gt;✗&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;S&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;✓&lt;/td&gt;
&lt;td&gt;✗&lt;/td&gt;
&lt;td&gt;✓&lt;/td&gt;
&lt;td&gt;✗&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;X&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;✗&lt;/td&gt;
&lt;td&gt;✗&lt;/td&gt;
&lt;td&gt;✗&lt;/td&gt;
&lt;td&gt;✗&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;In practice, you almost never think about intention locks unless you're debugging a case where &lt;code&gt;LOCK TABLES&lt;/code&gt; or DDL is blocking behind thousands of row-level transactions.&lt;/p&gt;

&lt;h3&gt;
  
  
  Insert Intention Locks
&lt;/h3&gt;

&lt;p&gt;A special type of gap lock acquired before an &lt;code&gt;INSERT&lt;/code&gt;. It signals: "I intend to insert into this gap, but I don't need to lock the entire gap — just my specific insertion point."&lt;/p&gt;

&lt;p&gt;Two transactions inserting into the same gap at &lt;strong&gt;different positions&lt;/strong&gt; won't block each other:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Gap (10, 20) exists&lt;/span&gt;
&lt;span class="n"&gt;Tx&lt;/span&gt; &lt;span class="n"&gt;A&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="k"&gt;INSERT&lt;/span&gt; &lt;span class="k"&gt;INTO&lt;/span&gt; &lt;span class="n"&gt;users&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;age&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;VALUES&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;12&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;  &lt;span class="c1"&gt;-- insert intention lock at 12&lt;/span&gt;
&lt;span class="n"&gt;Tx&lt;/span&gt; &lt;span class="n"&gt;B&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="k"&gt;INSERT&lt;/span&gt; &lt;span class="k"&gt;INTO&lt;/span&gt; &lt;span class="n"&gt;users&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;age&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;VALUES&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;17&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;  &lt;span class="c1"&gt;-- insert intention lock at 17 — no conflict!&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;But an insert intention lock DOES conflict with a gap lock on the same gap. This is the mechanism that makes gap locks effective — they block insert intentions.&lt;/p&gt;

&lt;h3&gt;
  
  
  Auto-Increment Locks
&lt;/h3&gt;

&lt;p&gt;When a table has an &lt;code&gt;AUTO_INCREMENT&lt;/code&gt; column, InnoDB needs to serialize the generation of new values. Historically, this was a table-level lock held for the duration of an &lt;code&gt;INSERT&lt;/code&gt; statement (the "traditional" mode). Modern InnoDB has three modes controlled by &lt;code&gt;innodb_autoinc_lock_mode&lt;/code&gt;:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Mode&lt;/th&gt;
&lt;th&gt;Value&lt;/th&gt;
&lt;th&gt;Behavior&lt;/th&gt;
&lt;th&gt;Use Case&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Traditional&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;td&gt;Table-level AUTO-INC lock held for entire statement&lt;/td&gt;
&lt;td&gt;Legacy, maximally safe&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Consecutive&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;Lightweight mutex for simple inserts, table lock for bulk inserts&lt;/td&gt;
&lt;td&gt;Default in MySQL 5.7&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Interleaved&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;Lightweight mutex for all inserts, values may have gaps&lt;/td&gt;
&lt;td&gt;Default in MySQL 8.0, safe only with row-based replication&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;In mode 2 (the MySQL 8.0 default), simple &lt;code&gt;INSERT&lt;/code&gt; statements use a lightweight mutex that's released as soon as the value is generated — not when the statement completes. This means auto-increment is rarely a bottleneck. But bulk operations like &lt;code&gt;INSERT INTO ... SELECT&lt;/code&gt; or &lt;code&gt;LOAD DATA&lt;/code&gt; can still cause contention in mode 1 because they hold the table-level lock for the entire statement duration.&lt;/p&gt;

&lt;h3&gt;
  
  
  Metadata Locks (MDL)
&lt;/h3&gt;

&lt;p&gt;This is the lock type that catches most people off guard because it's &lt;strong&gt;not an InnoDB lock&lt;/strong&gt; — it's a MySQL server-level lock that sits above the storage engine.&lt;/p&gt;

&lt;p&gt;Metadata locks protect the schema definition of a table. Any DML statement (&lt;code&gt;SELECT&lt;/code&gt;, &lt;code&gt;INSERT&lt;/code&gt;, &lt;code&gt;UPDATE&lt;/code&gt;, &lt;code&gt;DELETE&lt;/code&gt;) acquires a &lt;strong&gt;shared MDL&lt;/strong&gt; on the table. DDL statements (&lt;code&gt;ALTER TABLE&lt;/code&gt;, &lt;code&gt;DROP TABLE&lt;/code&gt;, &lt;code&gt;DROP PARTITION&lt;/code&gt;) require an &lt;strong&gt;exclusive MDL&lt;/strong&gt;. Critically, once an exclusive MDL request is pending, all new shared MDL requests queue behind it — a waiting DDL blocks every subsequent DML from acquiring the lock (&lt;a href="https://dev.mysql.com/doc/refman/8.0/en/metadata-locking.html" rel="noopener noreferrer"&gt;Metadata Locking&lt;/a&gt;).&lt;/p&gt;

&lt;p&gt;The rule is simple: DDL waits for all active DML to finish, and blocks all new DML while waiting.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Timeline:
  T0: Tx A starts SELECT on orders (acquires shared MDL)
  T1: DBA runs ALTER TABLE orders ... (needs exclusive MDL → blocked by Tx A)
  T2: Tx B starts INSERT INTO orders (needs shared MDL → blocked by ALTER)
  T3: Tx C starts SELECT on orders (needs shared MDL → blocked by ALTER)
  ...
  T?: Tx A finishes → ALTER acquires exclusive MDL → all DML queued behind it
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is the &lt;strong&gt;MDL queue stacking&lt;/strong&gt; problem. A single &lt;code&gt;ALTER TABLE&lt;/code&gt; doesn't just wait for the active transaction — it blocks every subsequent transaction too. On a table sustaining 500 writes/second, a 90-second stall creates a pileup of thousands of queued connections, exhausting your connection pool entirely. The ALTER itself might be sub-second, but the queue cascades into connection pool exhaustion, health check failures, and service-wide outages that have nothing to do with the table being altered.&lt;/p&gt;




&lt;h2&gt;
  
  
  Production Incident #1: Metadata Locks from Partition Drops
&lt;/h2&gt;

&lt;p&gt;We run time-partitioned tables for high-volume event data. Each partition holds one day's worth of data, and a cron job drops partitions older than 90 days:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;ALTER&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;events&lt;/span&gt; &lt;span class="k"&gt;DROP&lt;/span&gt; &lt;span class="k"&gt;PARTITION&lt;/span&gt; &lt;span class="n"&gt;p20250101&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is an &lt;code&gt;ALTER TABLE&lt;/code&gt;. It needs an exclusive metadata lock.&lt;/p&gt;

&lt;h3&gt;
  
  
  What happened
&lt;/h3&gt;

&lt;p&gt;The partition drop job was already scheduled during a low-traffic window. There was no single long-running query blocking it. The problem was simpler and more insidious: the &lt;code&gt;DROP PARTITION&lt;/code&gt; itself took time to execute — and for the entire duration it held an exclusive metadata lock, every new DML statement on the table queued behind it.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;ALTER TABLE ... DROP PARTITION&lt;/code&gt; isn't always instant. On a large table with many partitions, MySQL needs to update the table's partition metadata, and if the partition being dropped has significant data, the storage engine needs to reclaim the tablespace. During this window — which stretched to 30+ seconds in our case — the exclusive MDL blocked all new &lt;code&gt;INSERT&lt;/code&gt;, &lt;code&gt;UPDATE&lt;/code&gt;, &lt;code&gt;DELETE&lt;/code&gt;, and even &lt;code&gt;SELECT&lt;/code&gt; statements on the table. The &lt;code&gt;Waiting for table metadata lock&lt;/code&gt; state spread across the processlist like a wave.&lt;/p&gt;

&lt;h3&gt;
  
  
  What we saw in monitoring
&lt;/h3&gt;

&lt;p&gt;The global MDL wait metric spiked sharply. We queried the process list:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;user&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;host&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;db&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;command&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;time&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;state&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;info&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;information_schema&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;processlist&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="k"&gt;state&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'Waiting for table metadata lock'&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="nb"&gt;time&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The output showed hundreds of DML statements queued behind the &lt;code&gt;ALTER TABLE&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;+-------+------+-----------+--------+---------+------+----------------------------------+-----------------------------+
| id    | user | host      | db     | command | time | state                            | info                        |
+-------+------+-----------+--------+---------+------+----------------------------------+-----------------------------+
| 15901 | admin| 10.0.2.x  | events | Query   |  34  | altering table                   | ALTER TABLE events DROP ... |
| 15902 | app  | 10.0.3.x  | events | Query   |  32  | Waiting for table metadata lock  | INSERT INTO events ...      |
| 15903 | app  | 10.0.3.x  | events | Query   |  31  | Waiting for table metadata lock  | INSERT INTO events ...      |
| 15904 | app  | 10.0.3.x  | events | Query   |  30  | Waiting for table metadata lock  | INSERT INTO events ...      |
| ...   | ...  | ...       | ...    | ...     | ...  | ...                              | ...                         |
+-------+------+-----------+--------+---------+------+----------------------------------+-----------------------------+
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;ALTER TABLE&lt;/code&gt; was actively running (state: &lt;code&gt;altering table&lt;/code&gt;), not waiting for anything — it was the one holding the exclusive MDL while it completed. Every DML that arrived during those 30+ seconds piled up.&lt;/p&gt;

&lt;h3&gt;
  
  
  The query to find MDL holders
&lt;/h3&gt;

&lt;p&gt;MySQL 8.0 exposes metadata locks through &lt;code&gt;performance_schema&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt;
    &lt;span class="n"&gt;mdl&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;OBJECT_SCHEMA&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;mdl&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;OBJECT_NAME&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;mdl&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;LOCK_TYPE&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;mdl&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;LOCK_DURATION&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;mdl&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;LOCK_STATUS&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;mdl&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;OWNER_THREAD_ID&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;PROCESSLIST_ID&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;PROCESSLIST_USER&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;PROCESSLIST_HOST&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;PROCESSLIST_TIME&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;PROCESSLIST_STATE&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;PROCESSLIST_INFO&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;performance_schema&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;metadata_locks&lt;/span&gt; &lt;span class="n"&gt;mdl&lt;/span&gt;
&lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;performance_schema&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;threads&lt;/span&gt; &lt;span class="n"&gt;t&lt;/span&gt;
    &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;mdl&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;OWNER_THREAD_ID&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;THREAD_ID&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;mdl&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;OBJECT_SCHEMA&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'your_database'&lt;/span&gt;
  &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;mdl&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;OBJECT_NAME&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'events'&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;PROCESSLIST_TIME&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This tells you exactly who holds the lock and who's waiting. The output distinguishes &lt;code&gt;GRANTED&lt;/code&gt; (holding the lock) from &lt;code&gt;PENDING&lt;/code&gt; (waiting for it).&lt;/p&gt;

&lt;h3&gt;
  
  
  The fix
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;Switched from &lt;code&gt;DROP PARTITION&lt;/code&gt; to &lt;code&gt;EXCHANGE PARTITION&lt;/code&gt; followed by dropping the exchanged table — &lt;code&gt;EXCHANGE PARTITION&lt;/code&gt; is a metadata-only rename operation that completes near-instantly, so the exclusive MDL is held for milliseconds, not seconds&lt;/li&gt;
&lt;li&gt;For cases where &lt;code&gt;DROP PARTITION&lt;/code&gt; is unavoidable, we batch the drops and add &lt;code&gt;lock_wait_timeout = 5&lt;/code&gt; to the session — if it can't acquire the MDL within 5 seconds, it backs off and retries on the next cycle&lt;/li&gt;
&lt;li&gt;Added an alert on the global &lt;code&gt;Waiting for table metadata lock&lt;/code&gt; count exceeding a threshold sustained for 10 seconds&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  Production Incident #2: Row Locks from Incorrectly Terminated Application Instances
&lt;/h2&gt;

&lt;p&gt;This one was subtle. Write latency on the &lt;code&gt;orders&lt;/code&gt; table was slowly degrading over hours, not spiking suddenly. P99 crept from 15ms to 200ms to 2 seconds over the course of a day.&lt;/p&gt;

&lt;h3&gt;
  
  
  What happened
&lt;/h3&gt;

&lt;p&gt;An application instance was terminated incorrectly during a deployment — the process was killed (&lt;code&gt;SIGKILL&lt;/code&gt; / OOM) without gracefully closing its database connections. The connections had open transactions with row-level exclusive locks acquired via &lt;code&gt;SELECT ... FOR UPDATE&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Here's the thing MySQL doesn't advertise: when a client process dies abruptly, no &lt;code&gt;COM_QUIT&lt;/code&gt; is sent, no TCP FIN packet is transmitted. The TCP connection is now "half-open" — MySQL has no idea the client is gone. The server-side connection thread sits in a blocking &lt;code&gt;read()&lt;/code&gt;, waiting for the next query that will never arrive. The transaction stays open. The locks stay held.&lt;/p&gt;

&lt;p&gt;MySQL detects dead connections through two mechanisms, whichever fires first:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;wait_timeout&lt;/code&gt;&lt;/strong&gt; (application-layer): MySQL checks if the connection has been idle for longer than this value. Default: &lt;strong&gt;28800 seconds (8 hours)&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;TCP keepalive&lt;/strong&gt; (network-layer): The OS sends keepalive probes to detect the dead peer. With Linux defaults (&lt;code&gt;tcp_keepalive_time=7200&lt;/code&gt;, &lt;code&gt;tcp_keepalive_intvl=75&lt;/code&gt;, &lt;code&gt;tcp_keepalive_probes=9&lt;/code&gt;), this takes &lt;strong&gt;~7,875 seconds (~2.2 hours)&lt;/strong&gt;.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Until one of those fires, the dead connection's transaction remains open and &lt;strong&gt;all its locks are held&lt;/strong&gt;. InnoDB only rolls back the transaction when MySQL actually closes the server-side connection.&lt;/p&gt;

&lt;p&gt;In our case, &lt;code&gt;wait_timeout&lt;/code&gt; was still at the default — 28800 seconds (8 hours). So the killed application instance left behind connections holding exclusive row locks for hours. As subsequent deployments killed more instances the same way, more orphaned locks accumulated. Eventually, write contention on &lt;code&gt;orders&lt;/code&gt; was so severe that P99 latency hit the &lt;code&gt;innodb_lock_wait_timeout&lt;/code&gt; ceiling.&lt;/p&gt;

&lt;h3&gt;
  
  
  What we saw
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Find transactions that have been open for a long time but aren't actively running&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt;
    &lt;span class="n"&gt;trx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;trx_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;trx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;trx_state&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;trx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;trx_started&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;TIMESTAMPDIFF&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;SECOND&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;trx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;trx_started&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;NOW&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;trx_age_seconds&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;trx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;trx_rows_locked&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;trx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;trx_rows_modified&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;trx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;trx_mysql_thread_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="k"&gt;user&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="k"&gt;host&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;command&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;time&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;command_time&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="k"&gt;state&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;info&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;current_query&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;information_schema&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;innodb_trx&lt;/span&gt; &lt;span class="n"&gt;trx&lt;/span&gt;
&lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;information_schema&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;processlist&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt;
    &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;trx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;trx_mysql_thread_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;command&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'Sleep'&lt;/span&gt;
  &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;TIMESTAMPDIFF&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;SECOND&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;trx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;trx_started&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;NOW&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;60&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;trx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;trx_started&lt;/span&gt; &lt;span class="k"&gt;ASC&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The output showed connections from hosts belonging to application instances that had already been terminated — their IPs were no longer in service, but MySQL was still holding their connections open:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;+--------+-----------+---------------------+-----------------+-----------------+-------------------+-------+------+---------+------+-------+---------------+
| trx_id | trx_state | trx_started         | trx_age_seconds | trx_rows_locked | trx_rows_modified | ...   | user | command | time | state | current_query |
+--------+-----------+---------------------+-----------------+-----------------+-------------------+-------+------+---------+------+-------+---------------+
| 48291  | RUNNING   | 2025-03-15 09:14:22 |           14338 |              47 |                 0 | ...   | app  | Sleep   | 9841 |       | NULL          |
| 48305  | RUNNING   | 2025-03-15 09:21:07 |           13933 |              23 |                 0 | ...   | app  | Sleep   | 8122 |       | NULL          |
| 48412  | RUNNING   | 2025-03-15 10:02:44 |           11436 |              12 |                 0 | ...   | app  | Sleep   | 7203 |       | NULL          |
+--------+-----------+---------------------+-----------------+-----------------+-------------------+-------+------+---------+------+-------+---------------+
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A few things jump out. &lt;code&gt;trx_state&lt;/code&gt; is &lt;code&gt;RUNNING&lt;/code&gt; but &lt;code&gt;command&lt;/code&gt; is &lt;code&gt;Sleep&lt;/code&gt; — the transaction is open but the connection is idle. &lt;code&gt;trx_rows_locked&lt;/code&gt; is non-zero but &lt;code&gt;trx_rows_modified&lt;/code&gt; is zero — these transactions acquired locks via &lt;code&gt;FOR UPDATE&lt;/code&gt; but never wrote anything. The &lt;code&gt;time&lt;/code&gt; column shows idle times of 7,000–10,000 seconds — the 8-hour &lt;code&gt;wait_timeout&lt;/code&gt; default hadn't even kicked in yet, and these connections had already been abandoned for hours with no end in sight.&lt;/p&gt;

&lt;h3&gt;
  
  
  Finding exactly which rows are locked
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt;
    &lt;span class="n"&gt;l&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ENGINE_LOCK_ID&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;l&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ENGINE_TRANSACTION_ID&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;l&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;LOCK_TYPE&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;l&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;LOCK_MODE&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;l&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;LOCK_STATUS&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;l&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;LOCK_DATA&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;l&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;OBJECT_SCHEMA&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;l&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;OBJECT_NAME&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;l&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;INDEX_NAME&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;performance_schema&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;data_locks&lt;/span&gt; &lt;span class="n"&gt;l&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;l&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ENGINE_TRANSACTION_ID&lt;/span&gt; &lt;span class="k"&gt;IN&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;trx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;trx_id&lt;/span&gt;
    &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;information_schema&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;innodb_trx&lt;/span&gt; &lt;span class="n"&gt;trx&lt;/span&gt;
    &lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;information_schema&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;processlist&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt;
        &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;trx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;trx_mysql_thread_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt;
    &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;command&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'Sleep'&lt;/span&gt;
      &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;TIMESTAMPDIFF&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;SECOND&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;trx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;trx_started&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;NOW&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;300&lt;/span&gt;
&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This shows every lock held by sleeping transactions older than 5 minutes. The &lt;code&gt;LOCK_DATA&lt;/code&gt; column gives you the primary key values of the locked rows, and &lt;code&gt;LOCK_MODE&lt;/code&gt; tells you whether it's shared or exclusive.&lt;/p&gt;

&lt;h3&gt;
  
  
  Finding who's waiting on whom
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt;
    &lt;span class="n"&gt;w&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;REQUESTING_ENGINE_TRANSACTION_ID&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;waiting_trx_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;w&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;BLOCKING_ENGINE_TRANSACTION_ID&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;blocking_trx_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;rl&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;LOCK_MODE&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;waiting_lock_mode&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;bl&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;LOCK_MODE&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;blocking_lock_mode&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;rl&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;LOCK_DATA&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;contested_row&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;trx_query&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;waiting_query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;b_proc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;command&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;blocking_command&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;b_proc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;time&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;blocking_idle_time&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;TIMESTAMPDIFF&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;SECOND&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;b_trx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;trx_started&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;NOW&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;blocking_trx_age&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;performance_schema&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;data_lock_waits&lt;/span&gt; &lt;span class="n"&gt;w&lt;/span&gt;
&lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;performance_schema&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;data_locks&lt;/span&gt; &lt;span class="n"&gt;rl&lt;/span&gt;
    &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;w&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;REQUESTING_ENGINE_LOCK_ID&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;rl&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ENGINE_LOCK_ID&lt;/span&gt;
&lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;performance_schema&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;data_locks&lt;/span&gt; &lt;span class="n"&gt;bl&lt;/span&gt;
    &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;w&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;BLOCKING_ENGINE_LOCK_ID&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;bl&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ENGINE_LOCK_ID&lt;/span&gt;
&lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;information_schema&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;innodb_trx&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;
    &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;w&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;REQUESTING_ENGINE_TRANSACTION_ID&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;trx_id&lt;/span&gt;
&lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;information_schema&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;innodb_trx&lt;/span&gt; &lt;span class="n"&gt;b_trx&lt;/span&gt;
    &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;w&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;BLOCKING_ENGINE_TRANSACTION_ID&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;b_trx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;trx_id&lt;/span&gt;
&lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;information_schema&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;processlist&lt;/span&gt; &lt;span class="n"&gt;b_proc&lt;/span&gt;
    &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;b_trx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;trx_mysql_thread_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;b_proc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  The fix
&lt;/h3&gt;

&lt;p&gt;Short-term: killed the orphaned connections manually (&lt;code&gt;KILL &amp;lt;processlist_id&amp;gt;&lt;/code&gt;).&lt;/p&gt;

&lt;p&gt;Long-term:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Fixed the deployment process to send &lt;code&gt;SIGTERM&lt;/code&gt; with a grace period, allowing the application to close database connections before the container is killed. Added a pre-stop hook that explicitly closes the connection pool&lt;/li&gt;
&lt;li&gt;Reduced &lt;code&gt;wait_timeout&lt;/code&gt; to &lt;code&gt;300&lt;/code&gt; (5 minutes). The MySQL default of 28800 (8 hours) means orphaned connections can hold locks for an entire workday before MySQL notices. Check yours with &lt;code&gt;SHOW VARIABLES LIKE 'wait_timeout'&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;On RDS, you &lt;strong&gt;cannot&lt;/strong&gt; tune server-side TCP keepalive — AWS manages the OS and those sysctl parameters aren't user-configurable. &lt;code&gt;wait_timeout&lt;/code&gt; is your primary lever for dead client detection. On self-managed MySQL, you can additionally tune the server's TCP keepalive (&lt;code&gt;tcp_keepalive_time=60&lt;/code&gt;, &lt;code&gt;tcp_keepalive_intvl=10&lt;/code&gt;, &lt;code&gt;tcp_keepalive_probes=6&lt;/code&gt;) so the server detects dead clients in ~120 seconds instead of ~2.2 hours&lt;/li&gt;
&lt;li&gt;Added monitoring for &lt;code&gt;information_schema.innodb_trx&lt;/code&gt; rows where &lt;code&gt;trx_state = 'RUNNING'&lt;/code&gt; and the processlist shows &lt;code&gt;Sleep&lt;/code&gt; for more than 5 minutes&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  Production Incident #3: Gap Locks from INSERT INTO ... SELECT
&lt;/h2&gt;

&lt;p&gt;We have a nightly job that archives completed orders into a summary table:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;INSERT&lt;/span&gt; &lt;span class="k"&gt;INTO&lt;/span&gt; &lt;span class="n"&gt;order_summaries&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;order_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;total&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;region&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;completed_at&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;total_amount&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;region&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;completed_at&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;orders&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;status&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'completed'&lt;/span&gt;
  &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;completed_at&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="n"&gt;DATE_SUB&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;CURDATE&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="n"&gt;INTERVAL&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="k"&gt;DAY&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;completed_at&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;CURDATE&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This ran fine for months. Then the orders table grew, the nightly job started taking longer, and we started seeing deadlocks during peak write hours.&lt;/p&gt;

&lt;h3&gt;
  
  
  What happened
&lt;/h3&gt;

&lt;p&gt;Under &lt;code&gt;REPEATABLE READ&lt;/code&gt;, &lt;code&gt;INSERT INTO ... SELECT&lt;/code&gt; places &lt;strong&gt;shared next-key locks&lt;/strong&gt; on the rows read by the &lt;code&gt;SELECT&lt;/code&gt; — the MySQL docs are explicit: &lt;em&gt;"InnoDB sets shared next-key locks on rows from S"&lt;/em&gt; (&lt;a href="https://dev.mysql.com/doc/refman/8.0/en/innodb-locks-set.html" rel="noopener noreferrer"&gt;InnoDB Locks Set by Different SQL Statements&lt;/a&gt;). This is required to guarantee a consistent read — InnoDB needs to prevent other transactions from modifying or inserting into the range being read while the bulk insert is in progress.&lt;/p&gt;

&lt;p&gt;The problem: the &lt;code&gt;SELECT&lt;/code&gt; scans a range of &lt;code&gt;completed_at&lt;/code&gt; values. InnoDB places shared next-key locks on every index record in that range — and since each next-key lock includes the gap before the record, the entire scanned range is locked against inserts. Any &lt;code&gt;INSERT&lt;/code&gt; into the &lt;code&gt;orders&lt;/code&gt; table with a &lt;code&gt;completed_at&lt;/code&gt; value that falls within or near the locked range will block.&lt;/p&gt;

&lt;p&gt;Our application was simultaneously inserting new orders with &lt;code&gt;completed_at&lt;/code&gt; values close to the current timestamp. Since the archival job was reading &lt;code&gt;completed_at&lt;/code&gt; from the previous day, you might think there's no overlap. But gap locks extend to the &lt;em&gt;next&lt;/em&gt; index record beyond the scanned range — and if the next record's &lt;code&gt;completed_at&lt;/code&gt; is today, the gap lock extends into today's range.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Index records for completed_at:
  ... | 2025-03-14 23:58:12 | 2025-03-14 23:59:44 | 2025-03-15 00:01:23 | ...
                                                       ↑
                                          Next-key lock extends here
                                          because this is the next record
                                          after the scanned range

  New INSERT with completed_at = 2025-03-15 00:00:15 → blocked by gap lock
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Finding active lock waits in real time
&lt;/h3&gt;

&lt;p&gt;In MySQL 8.0, lock wait information lives in &lt;code&gt;performance_schema.data_lock_waits&lt;/code&gt; (the old &lt;code&gt;information_schema.innodb_lock_waits&lt;/code&gt; was removed in 8.0):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt;
    &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;trx_id&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;waiting_trx_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;trx_mysql_thread_id&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;waiting_thread&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;trx_query&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;waiting_query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;TIMESTAMPDIFF&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;SECOND&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;trx_wait_started&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;NOW&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;wait_seconds&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;trx_id&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;blocking_trx_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;trx_mysql_thread_id&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;blocking_thread&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;trx_query&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;blocking_query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;trx_rows_locked&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;blocking_rows_locked&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;performance_schema&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;data_lock_waits&lt;/span&gt; &lt;span class="n"&gt;w&lt;/span&gt;
&lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;information_schema&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;innodb_trx&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt;
    &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;w&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;BLOCKING_ENGINE_TRANSACTION_ID&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;trx_id&lt;/span&gt;
&lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;information_schema&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;innodb_trx&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;
    &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;w&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;REQUESTING_ENGINE_TRANSACTION_ID&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;trx_id&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;wait_seconds&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  The fix
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;Changed the archival job's isolation level to &lt;code&gt;READ COMMITTED&lt;/code&gt; for that session:
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;   &lt;span class="k"&gt;SET&lt;/span&gt; &lt;span class="k"&gt;SESSION&lt;/span&gt; &lt;span class="n"&gt;TRANSACTION&lt;/span&gt; &lt;span class="k"&gt;ISOLATION&lt;/span&gt; &lt;span class="k"&gt;LEVEL&lt;/span&gt; &lt;span class="k"&gt;READ&lt;/span&gt; &lt;span class="k"&gt;COMMITTED&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Under &lt;code&gt;READ COMMITTED&lt;/code&gt;, InnoDB doesn't acquire gap locks at all (except for foreign key and duplicate-key checks) — the &lt;code&gt;SELECT&lt;/code&gt; portion runs as a consistent read without locking the source rows. This eliminates the phantom prevention guarantee for that transaction, which is acceptable for archival.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Added batching — instead of one massive &lt;code&gt;INSERT INTO ... SELECT&lt;/code&gt;, process 1,000 rows at a time with explicit &lt;code&gt;LIMIT&lt;/code&gt; and &lt;code&gt;OFFSET&lt;/code&gt;:
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;   &lt;span class="k"&gt;INSERT&lt;/span&gt; &lt;span class="k"&gt;INTO&lt;/span&gt; &lt;span class="n"&gt;order_summaries&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;order_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;total&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;region&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;completed_at&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
   &lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;total_amount&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;region&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;completed_at&lt;/span&gt;
   &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;orders&lt;/span&gt;
   &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;status&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'completed'&lt;/span&gt;
     &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;completed_at&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="n"&gt;DATE_SUB&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;CURDATE&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="n"&gt;INTERVAL&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="k"&gt;DAY&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
     &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;completed_at&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;CURDATE&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
   &lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;id&lt;/span&gt;
   &lt;span class="k"&gt;LIMIT&lt;/span&gt; &lt;span class="mi"&gt;1000&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each batch holds locks for a shorter duration, reducing the window for contention.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Moved the archival job to run against a read replica, then applied the summaries to the primary via smaller transactional batches.&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  Monitoring InnoDB Locks: The Essential Queries
&lt;/h2&gt;

&lt;p&gt;Here's the full set of diagnostic queries we keep in our runbook.&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Current InnoDB transaction status
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt;
    &lt;span class="n"&gt;trx_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;trx_state&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;trx_started&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;TIMESTAMPDIFF&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;SECOND&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;trx_started&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;NOW&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;age_seconds&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;trx_rows_locked&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;trx_rows_modified&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;trx_lock_memory_bytes&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;trx_mysql_thread_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;trx_query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;trx_operation_state&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;information_schema&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;innodb_trx&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;trx_started&lt;/span&gt; &lt;span class="k"&gt;ASC&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  2. All currently held locks (MySQL 8.0+)
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt;
    &lt;span class="n"&gt;ENGINE_TRANSACTION_ID&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;OBJECT_SCHEMA&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;OBJECT_NAME&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;INDEX_NAME&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;LOCK_TYPE&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;LOCK_MODE&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;LOCK_STATUS&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;LOCK_DATA&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;performance_schema&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;data_locks&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;ENGINE_TRANSACTION_ID&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;OBJECT_NAME&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;LOCK_TYPE&lt;/code&gt; will be &lt;code&gt;RECORD&lt;/code&gt; (row/gap/next-key) or &lt;code&gt;TABLE&lt;/code&gt; (intention locks). &lt;code&gt;LOCK_MODE&lt;/code&gt; tells you the specifics: &lt;code&gt;X&lt;/code&gt; (exclusive), &lt;code&gt;S&lt;/code&gt; (shared), &lt;code&gt;X,GAP&lt;/code&gt; (exclusive gap lock), &lt;code&gt;X,REC_NOT_GAP&lt;/code&gt; (record-only, no gap), &lt;code&gt;S,GAP&lt;/code&gt;, etc.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Lock wait chains — who blocks whom
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt;
    &lt;span class="n"&gt;w&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;REQUESTING_ENGINE_TRANSACTION_ID&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;waiting_trx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;w&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;BLOCKING_ENGINE_TRANSACTION_ID&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;blocking_trx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;rl&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;OBJECT_NAME&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="k"&gt;table_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;rl&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;INDEX_NAME&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;rl&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;LOCK_MODE&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;waiting_lock_mode&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;bl&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;LOCK_MODE&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;blocking_lock_mode&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;rl&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;LOCK_DATA&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;contested_row&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;trx_query&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;waiting_query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;TIMESTAMPDIFF&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;SECOND&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;trx_wait_started&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;NOW&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;wait_seconds&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;performance_schema&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;data_lock_waits&lt;/span&gt; &lt;span class="n"&gt;w&lt;/span&gt;
&lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;performance_schema&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;data_locks&lt;/span&gt; &lt;span class="n"&gt;rl&lt;/span&gt;
    &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;w&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;REQUESTING_ENGINE_LOCK_ID&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;rl&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ENGINE_LOCK_ID&lt;/span&gt;
&lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;performance_schema&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;data_locks&lt;/span&gt; &lt;span class="n"&gt;bl&lt;/span&gt;
    &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;w&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;BLOCKING_ENGINE_LOCK_ID&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;bl&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ENGINE_LOCK_ID&lt;/span&gt;
&lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;information_schema&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;innodb_trx&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;
    &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;w&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;REQUESTING_ENGINE_TRANSACTION_ID&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;trx_id&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;wait_seconds&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  4. Deadlock history
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SHOW&lt;/span&gt; &lt;span class="n"&gt;ENGINE&lt;/span&gt; &lt;span class="n"&gt;INNODB&lt;/span&gt; &lt;span class="n"&gt;STATUS&lt;/span&gt;&lt;span class="err"&gt;\&lt;/span&gt;&lt;span class="k"&gt;G&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;LATEST DETECTED DEADLOCK&lt;/code&gt; section shows the last deadlock with full details: both transactions, the locks they held, the locks they were waiting for, and which transaction InnoDB chose as the victim. Parse the output — it's verbose but complete.&lt;/p&gt;

&lt;p&gt;For ongoing monitoring, enable the deadlock log:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SET&lt;/span&gt; &lt;span class="k"&gt;GLOBAL&lt;/span&gt; &lt;span class="n"&gt;innodb_print_all_deadlocks&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This writes every deadlock to the MySQL error log, not just the latest one.&lt;/p&gt;

&lt;h3&gt;
  
  
  5. Long-running transactions with sleeping connections
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt;
    &lt;span class="n"&gt;trx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;trx_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;trx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;trx_state&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;TIMESTAMPDIFF&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;SECOND&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;trx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;trx_started&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;NOW&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;trx_age_seconds&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;trx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;trx_rows_locked&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;trx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;trx_rows_modified&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;processlist_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="k"&gt;user&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="k"&gt;host&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;command&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;time&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;idle_seconds&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;information_schema&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;innodb_trx&lt;/span&gt; &lt;span class="n"&gt;trx&lt;/span&gt;
&lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;information_schema&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;processlist&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt;
    &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;trx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;trx_mysql_thread_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;command&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'Sleep'&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;trx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;trx_started&lt;/span&gt; &lt;span class="k"&gt;ASC&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is the "zombie transaction" detector. Any row here is a connection with an open transaction that isn't executing a query. These are the ones that silently hold locks while doing nothing.&lt;/p&gt;

&lt;h3&gt;
  
  
  6. InnoDB lock metrics summary
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt;
    &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="k"&gt;COUNT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;information_schema&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;innodb_trx&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;active_transactions&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="k"&gt;COUNT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;performance_schema&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;data_lock_waits&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;lock_waits&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="k"&gt;COUNT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;performance_schema&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;data_locks&lt;/span&gt; &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;LOCK_STATUS&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'GRANTED'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;locks_held&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="k"&gt;COUNT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;performance_schema&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;data_locks&lt;/span&gt; &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;LOCK_STATUS&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'WAITING'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;locks_waiting&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;VARIABLE_VALUE&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;performance_schema&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;global_status&lt;/span&gt; &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;VARIABLE_NAME&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'Innodb_row_lock_waits'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;total_row_lock_waits&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;VARIABLE_VALUE&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;performance_schema&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;global_status&lt;/span&gt; &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;VARIABLE_NAME&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'Innodb_row_lock_time_avg'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;avg_row_lock_wait_ms&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;VARIABLE_VALUE&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;performance_schema&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;global_status&lt;/span&gt; &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;VARIABLE_NAME&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'Innodb_deadlocks'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;total_deadlocks&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  MySQL Parameters That Control Lock Behavior
&lt;/h2&gt;

&lt;p&gt;Most lock incidents we've hit were made worse — or outright caused — by MySQL parameters left at their defaults. Here are the ones that matter, what they do, and what we set them to.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Check all lock-related parameters on your instance&lt;/span&gt;
&lt;span class="k"&gt;SHOW&lt;/span&gt; &lt;span class="n"&gt;VARIABLES&lt;/span&gt; &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;Variable_name&lt;/span&gt; &lt;span class="k"&gt;IN&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="s1"&gt;'innodb_lock_wait_timeout'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="s1"&gt;'lock_wait_timeout'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="s1"&gt;'innodb_deadlock_detect'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="s1"&gt;'innodb_print_all_deadlocks'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="s1"&gt;'innodb_status_output_locks'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="s1"&gt;'wait_timeout'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="s1"&gt;'interactive_timeout'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="s1"&gt;'innodb_autoinc_lock_mode'&lt;/span&gt;
&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Parameter&lt;/th&gt;
&lt;th&gt;MySQL Default&lt;/th&gt;
&lt;th&gt;Recommended&lt;/th&gt;
&lt;th&gt;Why&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;innodb_lock_wait_timeout&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;50&lt;/td&gt;
&lt;td&gt;10–30&lt;/td&gt;
&lt;td&gt;How long a transaction waits for a row lock before erroring out. 50 seconds is too generous — if you're waiting 30 seconds for a row lock, the transaction should fail and retry, not queue behind a zombie&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;lock_wait_timeout&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;31536000 (1 year)&lt;/td&gt;
&lt;td&gt;300&lt;/td&gt;
&lt;td&gt;How long a statement waits for a metadata lock. The default is effectively "wait forever." A &lt;code&gt;DROP PARTITION&lt;/code&gt; waiting a year for a shared MDL to release will stack every connection in your pool long before then&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;wait_timeout&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;28800 (8 hours)&lt;/td&gt;
&lt;td&gt;300&lt;/td&gt;
&lt;td&gt;How long MySQL keeps an idle connection alive. An abandoned connection holding row locks can sit there for 8 hours at the default. Drop this to 5 minutes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;interactive_timeout&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;28800 (8 hours)&lt;/td&gt;
&lt;td&gt;300&lt;/td&gt;
&lt;td&gt;Same as &lt;code&gt;wait_timeout&lt;/code&gt;, for connections flagged as interactive (mysql CLI sessions)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;innodb_deadlock_detect&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;ON&lt;/td&gt;
&lt;td&gt;ON&lt;/td&gt;
&lt;td&gt;Real-time deadlock detection. Turning this off (sometimes done for "performance") means deadlocks are only resolved by &lt;code&gt;innodb_lock_wait_timeout&lt;/code&gt; expiring — much slower&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;innodb_print_all_deadlocks&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;OFF&lt;/td&gt;
&lt;td&gt;ON&lt;/td&gt;
&lt;td&gt;Log every deadlock to the error log. Without this, only the most recent deadlock is visible via &lt;code&gt;SHOW ENGINE INNODB STATUS&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;innodb_status_output_locks&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;OFF&lt;/td&gt;
&lt;td&gt;ON (during incidents)&lt;/td&gt;
&lt;td&gt;Include lock details in &lt;code&gt;SHOW ENGINE INNODB STATUS&lt;/code&gt; output. Verbose, but invaluable during active lock investigations&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;innodb_autoinc_lock_mode&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;2 (MySQL 8.0)&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;Interleaved mode — least contention for auto-increment. Only change if you need consecutive values across bulk inserts (you probably don't)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The most dangerous defaults are &lt;code&gt;lock_wait_timeout&lt;/code&gt; (1 year) and &lt;code&gt;wait_timeout&lt;/code&gt; (8 hours). If you take nothing else from this post, check those two on your production instances right now.&lt;/p&gt;




&lt;h2&gt;
  
  
  A Quick Guide to Isolation Levels and Their Lock Behavior
&lt;/h2&gt;

&lt;p&gt;The locks InnoDB takes are directly tied to the isolation level. Understanding this mapping is essential for diagnosing unexpected contention.&lt;/p&gt;

&lt;h3&gt;
  
  
  READ UNCOMMITTED
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;No locks on consistent reads&lt;/li&gt;
&lt;li&gt;Writes take record locks (no gap locks)&lt;/li&gt;
&lt;li&gt;You almost never want this — it allows dirty reads&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  READ COMMITTED
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Consistent reads use a &lt;strong&gt;fresh snapshot per statement&lt;/strong&gt; — each statement sees the latest committed data as of its start time, not the transaction start time&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No gap locks&lt;/strong&gt; — this is the key difference from &lt;code&gt;REPEATABLE READ&lt;/code&gt;. The docs state: &lt;em&gt;"Gap locking is only used for foreign-key constraint checking and duplicate-key checking"&lt;/em&gt; at this level (&lt;a href="https://dev.mysql.com/doc/refman/8.0/en/innodb-transaction-isolation-levels.html" rel="noopener noreferrer"&gt;Transaction Isolation Levels&lt;/a&gt;)&lt;/li&gt;
&lt;li&gt;Record locks on matched rows only, released for non-matching rows after evaluation&lt;/li&gt;
&lt;li&gt;Phantom reads are possible but gap lock deadlocks are eliminated&lt;/li&gt;
&lt;li&gt;This is the level we switch to for bulk operations&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  REPEATABLE READ (InnoDB Default)
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Consistent reads see a snapshot from the first read in the transaction&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Next-key locks&lt;/strong&gt; on index scans — record lock + gap lock&lt;/li&gt;
&lt;li&gt;Gap locks prevent phantom inserts in scanned ranges&lt;/li&gt;
&lt;li&gt;This is where most unexpected locking contention occurs&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  SERIALIZABLE
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;All consistent reads are implicitly converted to &lt;code&gt;SELECT ... FOR SHARE&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Every read takes shared next-key locks&lt;/li&gt;
&lt;li&gt;Maximum correctness, maximum contention&lt;/li&gt;
&lt;li&gt;We've never used this in production&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Side-by-side lock behavior
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Operation&lt;/th&gt;
&lt;th&gt;READ COMMITTED&lt;/th&gt;
&lt;th&gt;REPEATABLE READ&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;SELECT&lt;/code&gt; (plain)&lt;/td&gt;
&lt;td&gt;No locks (MVCC)&lt;/td&gt;
&lt;td&gt;No locks (MVCC)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;SELECT ... FOR UPDATE&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Record locks only&lt;/td&gt;
&lt;td&gt;Next-key locks (record + gap)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;UPDATE WHERE unique_col = ?&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Record lock on matching row&lt;/td&gt;
&lt;td&gt;Record lock only (unique index optimization — no gap lock)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;UPDATE WHERE non_unique_col = ?&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Record lock on matching row&lt;/td&gt;
&lt;td&gt;Next-key locks on matching records + next record to seal the gap&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;UPDATE WHERE range&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Record locks on matching rows&lt;/td&gt;
&lt;td&gt;Next-key locks on range&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;INSERT&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Insert intention lock&lt;/td&gt;
&lt;td&gt;Insert intention lock&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;INSERT INTO ... SELECT&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;No locks on source (consistent read)&lt;/td&gt;
&lt;td&gt;Shared next-key locks on source rows&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;DELETE WHERE range&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Record locks on matching rows&lt;/td&gt;
&lt;td&gt;Next-key locks on range&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The &lt;code&gt;INSERT INTO ... SELECT&lt;/code&gt; row is the one that bit us. Under &lt;code&gt;REPEATABLE READ&lt;/code&gt;, the &lt;code&gt;SELECT&lt;/code&gt; side takes shared next-key locks on every row it reads. Under &lt;code&gt;READ COMMITTED&lt;/code&gt;, it doesn't. Switching isolation level for that one session was a one-line fix that eliminated the gap lock deadlocks entirely.&lt;/p&gt;




&lt;h2&gt;
  
  
  Preventing Lock Contention: What We Do Now
&lt;/h2&gt;

&lt;p&gt;After three incidents with three different lock types, we've adopted a set of practices that have kept us out of trouble.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Keep transactions short.&lt;/strong&gt; The longer a transaction is open, the longer it holds locks. Every lock held is a potential blocker. We set &lt;code&gt;innodb_lock_wait_timeout&lt;/code&gt; to 30 (the MySQL default is 50). If your transaction is waiting 30 seconds for a row lock, something is structurally wrong — fail fast and let the application retry.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Use READ COMMITTED for bulk operations.&lt;/strong&gt; If you're running &lt;code&gt;INSERT INTO ... SELECT&lt;/code&gt;, &lt;code&gt;CREATE TABLE ... AS SELECT&lt;/code&gt;, or any bulk read-then-write pattern, switch the session to &lt;code&gt;READ COMMITTED&lt;/code&gt;. You don't need phantom protection for a batch job.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Use &lt;code&gt;pt-online-schema-change&lt;/code&gt; or &lt;code&gt;gh-ost&lt;/code&gt; for large ALTERs.&lt;/strong&gt; These tools perform schema changes by creating a shadow table, copying data in small batches, and swapping at the end — avoiding long-held metadata locks entirely. For partition operations, consider using &lt;code&gt;EXCHANGE PARTITION&lt;/code&gt; which is metadata-only and near-instant.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Ensure graceful shutdown of application instances.&lt;/strong&gt; When an application process is killed without closing its database connections (SIGKILL, OOM, container eviction), MySQL doesn't detect the dead client immediately. Locks are held until &lt;code&gt;wait_timeout&lt;/code&gt; expires or TCP keepalive detects the dead peer — whichever comes first. Use &lt;code&gt;SIGTERM&lt;/code&gt; with a grace period, and add pre-stop hooks that drain connections before the process exits.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Tune dead connection detection.&lt;/strong&gt; Drop &lt;code&gt;wait_timeout&lt;/code&gt; from the default 28800 (8 hours) to 300 — on RDS, this is your primary lever since server-side TCP keepalive isn't user-configurable. On self-managed MySQL, also tune the server's TCP keepalive (&lt;code&gt;tcp_keepalive_time=60&lt;/code&gt;, &lt;code&gt;tcp_keepalive_intvl=10&lt;/code&gt;, &lt;code&gt;tcp_keepalive_probes=6&lt;/code&gt;) so the server detects dead clients in ~120 seconds. If you use a connection pool (HikariCP, etc.), configure its own idle connection eviction to detect and close stale connections faster than MySQL does.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Monitor zombie transactions.&lt;/strong&gt; Alert on any transaction that has been open for more than N minutes while the connection is idle. This is the single most impactful alert we've added:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Alert query: transactions open &amp;gt; 5 min with sleeping connections&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="k"&gt;COUNT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;zombie_transactions&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;information_schema&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;innodb_trx&lt;/span&gt; &lt;span class="n"&gt;trx&lt;/span&gt;
&lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;information_schema&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;processlist&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt;
    &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;trx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;trx_mysql_thread_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;command&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'Sleep'&lt;/span&gt;
  &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;TIMESTAMPDIFF&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;SECOND&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;trx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;trx_started&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;NOW&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;300&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Index your write predicates.&lt;/strong&gt; If your &lt;code&gt;UPDATE&lt;/code&gt; or &lt;code&gt;DELETE&lt;/code&gt; hits a full table scan, InnoDB locks every row it examines — not just the rows that match the &lt;code&gt;WHERE&lt;/code&gt; clause. Under &lt;code&gt;REPEATABLE READ&lt;/code&gt;, it also places gap locks across the entire index. A missing index on a write path doesn't just make the query slow — it makes every other concurrent write slow too.&lt;/p&gt;




&lt;h2&gt;
  
  
  Wrapping Up
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;InnoDB uses a hybrid of MVCC and pessimistic locking.&lt;/strong&gt; Reads don't lock (consistent snapshots). Writes lock. The specific lock type depends on the isolation level and the index structure.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Gap locks are the most surprising lock type.&lt;/strong&gt; They lock ranges between index records, not the records themselves. They exist to prevent phantom reads under &lt;code&gt;REPEATABLE READ&lt;/code&gt;, and they're the root cause of most unexpected deadlocks.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Metadata locks are MySQL-level, not InnoDB-level.&lt;/strong&gt; Any DDL queues behind active DML and blocks all subsequent DML. On high-QPS tables, even fast DDL can cascade into connection pool exhaustion.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Sleeping connections with open transactions are silent killers.&lt;/strong&gt; A connection in &lt;code&gt;Sleep&lt;/code&gt; state with an uncommitted transaction still holds all its row locks. Monitor &lt;code&gt;information_schema.innodb_trx&lt;/code&gt; crossed with the processlist.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;&lt;code&gt;INSERT INTO ... SELECT&lt;/code&gt; takes shared next-key locks on the source table under &lt;code&gt;REPEATABLE READ&lt;/code&gt;.&lt;/strong&gt; Switch to &lt;code&gt;READ COMMITTED&lt;/code&gt; for bulk operations to eliminate gap locking on the read side.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Monitor lock contention proactively.&lt;/strong&gt; &lt;code&gt;performance_schema.data_locks&lt;/code&gt;, &lt;code&gt;data_lock_waits&lt;/code&gt;, and &lt;code&gt;innodb_trx&lt;/code&gt; are your friends. On RDS, Performance Insights broken down by &lt;code&gt;db.wait_event&lt;/code&gt; is the fastest way to see what your connections are actually waiting on.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Every lock incident we've had traced back to one of three things: a DDL operation that held an exclusive metadata lock longer than expected, application instances that died without releasing their connections, or an isolation level that was too strict for the workload. The queries in this post are the ones we reach for first. &lt;code&gt;performance_schema.data_locks&lt;/code&gt; and &lt;code&gt;data_lock_waits&lt;/code&gt; tell you exactly which rows are locked and who's waiting. &lt;code&gt;information_schema.innodb_trx&lt;/code&gt; crossed with the processlist catches zombie transactions from dead application instances. And &lt;code&gt;performance_schema.metadata_locks&lt;/code&gt; tells you who's holding the MDL that's stacking your connection pool. The data is always there — you just have to know where to look.&lt;/p&gt;

</description>
      <category>mysql</category>
      <category>database</category>
      <category>sql</category>
      <category>opensource</category>
    </item>
    <item>
      <title>Debugging PostgreSQL Query Plan Instability in Production</title>
      <dc:creator>Sriram Rajendran</dc:creator>
      <pubDate>Tue, 31 Mar 2026 14:23:00 +0000</pubDate>
      <link>https://dev.to/sriramrajendran/debugging-postgresql-query-plan-instability-in-production-ncb</link>
      <guid>https://dev.to/sriramrajendran/debugging-postgresql-query-plan-instability-in-production-ncb</guid>
      <description>&lt;p&gt;The query plan is only as good as the statistics behind it. When those statistics are wrong, the planner makes confident decisions based on a false reality.&lt;/p&gt;

&lt;p&gt;We run a field technician dispatch system on PostgreSQL 14. The core query — find technicians matching specific dispatch criteria near a job site — ran in 3ms at 10am and 42ms at 2pm the same day. Same query text. Same bind parameters. The only thing that changed was the number of technicians available — as the fleet clocked in and dispatched through the day, the underlying data distribution shifted just enough to flip the query plan.&lt;/p&gt;

&lt;p&gt;This post is the story of how we traced a 14x latency regression to a fundamental assumption baked into every cost-based query optimizer, and why our existing composite index didn't help. Along the way, we'll dig into &lt;code&gt;pg_statistic&lt;/code&gt;, selectivity estimation, BitmapAnd mechanics, and the specific ways correlated boolean columns break the planner's world model.&lt;/p&gt;

&lt;p&gt;But first — if you've never looked at how PostgreSQL decides whether to use your index, let's build up from first principles.&lt;/p&gt;




&lt;h2&gt;
  
  
  How PostgreSQL Chooses a Query Plan
&lt;/h2&gt;

&lt;p&gt;PostgreSQL doesn't just "use the index." It evaluates multiple execution strategies and picks the one with the lowest estimated cost. This is the &lt;strong&gt;cost-based optimizer&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;The optimizer doesn't see your data. It sees a &lt;strong&gt;statistical summary&lt;/strong&gt; stored in &lt;code&gt;pg_statistic&lt;/code&gt; (exposed via the &lt;code&gt;pg_stats&lt;/code&gt; view). When &lt;code&gt;ANALYZE&lt;/code&gt; runs — manually or via autovacuum — PostgreSQL samples rows and builds per-column statistics: most common values (MCVs), histograms, distinct counts, and null fractions.&lt;/p&gt;

&lt;p&gt;For a single predicate like &lt;code&gt;WHERE is_available = true&lt;/code&gt;, the planner looks up the MCV frequency for &lt;code&gt;is_available&lt;/code&gt;. If &lt;code&gt;true&lt;/code&gt; appears at 0.50, the estimated rows on a 15,000-row table is 7,500.&lt;/p&gt;

&lt;p&gt;For multiple predicates, it applies the &lt;strong&gt;independence assumption&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;P(A AND B AND C) = P(A) × P(B) × P(C)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is correct if and only if the columns are statistically independent. The PostgreSQL docs are explicit:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;"The planner normally assumes that multiple conditions are independent of each other, an assumption that does not hold when column values are correlated."&lt;/em&gt;&lt;br&gt;
— &lt;a href="https://www.postgresql.org/docs/current/planner-stats.html" rel="noopener noreferrer"&gt;Chapter 14.2: Statistics Used by the Planner&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;When the assumption holds, estimates are accurate. When it doesn't, they can be off by orders of magnitude. Our query hit the second case.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Setup: Our Table and Query
&lt;/h2&gt;

&lt;h3&gt;
  
  
  How we ended up with six booleans
&lt;/h3&gt;

&lt;p&gt;The &lt;code&gt;technicians&lt;/code&gt; table tracks every field technician in the system — roughly 15,000 rows. It didn't start with six boolean columns. It started with two: &lt;code&gt;is_active&lt;/code&gt; and &lt;code&gt;is_available&lt;/code&gt;. Then we needed to track whether a technician was currently on a job, so &lt;code&gt;is_dispatched&lt;/code&gt; arrived. A compliance incident led to &lt;code&gt;is_blocked&lt;/code&gt;. Express delivery became a feature, so &lt;code&gt;is_express_enabled&lt;/code&gt;. An ops request for soft-disabling technicians without removing them added &lt;code&gt;is_suspended&lt;/code&gt;. Each boolean made sense in isolation, each was a small migration, and each shipped independently over about eighteen months.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;technicians&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;id&lt;/span&gt;              &lt;span class="n"&gt;BIGSERIAL&lt;/span&gt; &lt;span class="k"&gt;PRIMARY&lt;/span&gt; &lt;span class="k"&gt;KEY&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;zone_id&lt;/span&gt;         &lt;span class="nb"&gt;INTEGER&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;is_active&lt;/span&gt;       &lt;span class="nb"&gt;BOOLEAN&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt; &lt;span class="k"&gt;DEFAULT&lt;/span&gt; &lt;span class="k"&gt;false&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;is_available&lt;/span&gt;    &lt;span class="nb"&gt;BOOLEAN&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt; &lt;span class="k"&gt;DEFAULT&lt;/span&gt; &lt;span class="k"&gt;false&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;is_dispatched&lt;/span&gt;   &lt;span class="nb"&gt;BOOLEAN&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt; &lt;span class="k"&gt;DEFAULT&lt;/span&gt; &lt;span class="k"&gt;false&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;is_blocked&lt;/span&gt;      &lt;span class="nb"&gt;BOOLEAN&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt; &lt;span class="k"&gt;DEFAULT&lt;/span&gt; &lt;span class="k"&gt;false&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;is_express_enabled&lt;/span&gt; &lt;span class="nb"&gt;BOOLEAN&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt; &lt;span class="k"&gt;DEFAULT&lt;/span&gt; &lt;span class="k"&gt;false&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;is_suspended&lt;/span&gt;    &lt;span class="nb"&gt;BOOLEAN&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt; &lt;span class="k"&gt;DEFAULT&lt;/span&gt; &lt;span class="k"&gt;false&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;last_job_location&lt;/span&gt; &lt;span class="n"&gt;GEOMETRY&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Point&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;4326&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="c1"&gt;-- ... other columns&lt;/span&gt;
&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;technician_locations&lt;/code&gt; table stores real-time GPS positions (one row per technician, updated every few seconds).&lt;/p&gt;

&lt;h3&gt;
  
  
  Key indexes
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Composite boolean index WITH zone&lt;/span&gt;
&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;INDEX&lt;/span&gt; &lt;span class="n"&gt;idx_technician_selection_v2&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;technicians&lt;/span&gt;
    &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;zone_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;is_active&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;is_available&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;is_dispatched&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;is_blocked&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;is_express_enabled&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;is_suspended&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="c1"&gt;-- Composite boolean index WITHOUT zone&lt;/span&gt;
&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;INDEX&lt;/span&gt; &lt;span class="n"&gt;idx_technician_selection&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;technicians&lt;/span&gt;
    &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;is_active&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;is_available&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;is_dispatched&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;is_blocked&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;is_express_enabled&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;is_suspended&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="c1"&gt;-- Spatial index on last job location&lt;/span&gt;
&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;INDEX&lt;/span&gt; &lt;span class="n"&gt;idx_technician_last_job_geom&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;technicians&lt;/span&gt; &lt;span class="k"&gt;USING&lt;/span&gt; &lt;span class="n"&gt;gist&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;last_job_location&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="c1"&gt;-- Spatial index on live GPS location&lt;/span&gt;
&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;INDEX&lt;/span&gt; &lt;span class="n"&gt;idx_technician_location_geom&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;technician_locations&lt;/span&gt; &lt;span class="k"&gt;USING&lt;/span&gt; &lt;span class="n"&gt;gist&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;current_location&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  The allocation query
&lt;/h3&gt;

&lt;p&gt;The query finds technicians matching specific dispatch criteria within a geographic radius:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;tl&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ts&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;technicians&lt;/span&gt; &lt;span class="n"&gt;t&lt;/span&gt;
    &lt;span class="k"&gt;INNER&lt;/span&gt; &lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;technician_locations&lt;/span&gt; &lt;span class="n"&gt;tl&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;tl&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;technician_id&lt;/span&gt;
    &lt;span class="k"&gt;LEFT&lt;/span&gt; &lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;technician_stats&lt;/span&gt; &lt;span class="n"&gt;ts&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ts&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;technician_id&lt;/span&gt;
    &lt;span class="k"&gt;LEFT&lt;/span&gt; &lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;daily_utilization&lt;/span&gt; &lt;span class="n"&gt;du&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;du&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;technician_id&lt;/span&gt;
    &lt;span class="c1"&gt;-- ... additional left joins for tagging, blocklists, queue state&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;is_active&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;true&lt;/span&gt;
  &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;is_available&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;true&lt;/span&gt;
  &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;is_dispatched&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;true&lt;/span&gt;
  &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;is_blocked&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;false&lt;/span&gt;
  &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;is_express_enabled&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;true&lt;/span&gt;
  &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;is_suspended&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;false&lt;/span&gt;
  &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;ST_DWithin&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tl&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;current_location&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ST_SetSRID&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ST_MakePoint&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;?&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;?&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="mi"&gt;4326&lt;/span&gt;&lt;span class="p"&gt;)::&lt;/span&gt;&lt;span class="n"&gt;geography&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;4000&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;tl&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;reported_at&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;now&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;interval&lt;/span&gt; &lt;span class="s1"&gt;'20 minutes'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Six boolean conditions on &lt;code&gt;technicians&lt;/code&gt;, plus a spatial predicate on &lt;code&gt;technician_locations&lt;/code&gt;. Looks straightforward. The problem is entirely in how the planner estimates the boolean combination.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Symptom: Same Query, Two Plans
&lt;/h2&gt;

&lt;p&gt;We noticed the issue in our Grafana dashboards. The allocation query's P50 latency was steady around 4ms, but P99 kept spiking to 40–45ms within the same day. The pattern correlated with fleet activity — as more technicians clocked in and changed dispatch states through the day, the underlying data distribution shifted. The planner's statistics drifted with it, and the plan flipped.&lt;/p&gt;

&lt;p&gt;When we pulled &lt;code&gt;EXPLAIN ANALYZE&lt;/code&gt; hours apart on the same day, we saw two completely different plans.&lt;/p&gt;

&lt;h3&gt;
  
  
  The fast plan (~3ms): BitmapAnd
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;BitmapAnd  (cost=146.22..146.22 rows=4 width=0) (actual time=1.83..1.83 rows=0 loops=1)
  -&amp;gt; Bitmap Index Scan on idx_technician_selection
       (cost=0.00..5.22 rows=38 width=0) (actual time=0.45..0.45 rows=8720 loops=1)
       Index Cond: (is_active = true) AND (is_available = true) AND (is_dispatched = true)
                   AND (is_blocked = false) AND (is_express_enabled = true) AND (is_suspended = false)
  -&amp;gt; Bitmap Index Scan on idx_technician_last_job_geom
       (cost=0.00..140.75 rows=1628 width=0) (actual time=1.20..1.20 rows=8419 loops=1)
       Index Cond: (last_job_location &amp;amp;&amp;amp; &amp;lt;bounding box&amp;gt;)

Bitmap Heap Scan on technicians t
    Heap Blocks: exact=191
    -&amp;gt; BitmapAnd (above)
    Filter: ST_DWithin(last_job_location, ..., 4000)
    Rows Removed by Filter: 92
    -&amp;gt; actual rows=1

Execution Time: 3.155 ms
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The planner used &lt;strong&gt;BitmapAnd&lt;/strong&gt; — it scanned both the boolean index and the spatial index on &lt;code&gt;technicians&lt;/code&gt;, intersected the two bitmaps in memory, and only fetched 191 heap pages. One row survived all filters. Fast.&lt;/p&gt;

&lt;h3&gt;
  
  
  The slow plan (~42ms): Boolean index only
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Bitmap Heap Scan on technicians t
    (cost=5.47..1205.33 rows=38 width=824) (actual time=0.62..8.14 rows=4626 loops=1)
  -&amp;gt; Bitmap Index Scan on idx_technician_selection
       (cost=0.00..5.22 rows=38 width=0) (actual time=0.48..0.48 rows=4626 loops=1)
       Index Cond: (is_active = true) AND (is_available = true) AND (is_dispatched = true)
                   AND (is_blocked = false) AND (is_express_enabled = true) AND (is_suspended = false)

  Nested Loop (actual loops=4558)
    -&amp;gt; Index Scan on technician_locations tl
         Index Cond: (technician_id = t.id)
         Filter: ST_DWithin(current_location, ..., 4000)
         loops=4558

Execution Time: 42.008 ms
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The planner used &lt;strong&gt;only&lt;/strong&gt; the boolean index. No spatial index. It fetched 4,626 rows from &lt;code&gt;technicians&lt;/code&gt;, then nested-loop joined into &lt;code&gt;technician_locations&lt;/code&gt; 4,558 times, applying &lt;code&gt;ST_DWithin&lt;/code&gt; as a CPU filter on each loop. 14x slower.&lt;/p&gt;

&lt;h3&gt;
  
  
  Side by side
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Fast Plan (BitmapAnd)&lt;/th&gt;
&lt;th&gt;Slow Plan (Boolean only)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Strategy&lt;/td&gt;
&lt;td&gt;BitmapAnd (boolean + spatial)&lt;/td&gt;
&lt;td&gt;Bitmap Scan (boolean only)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Estimated rows from boolean index&lt;/td&gt;
&lt;td&gt;38&lt;/td&gt;
&lt;td&gt;38&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Actual rows from boolean index&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;8,720&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;4,626&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Heap pages fetched&lt;/td&gt;
&lt;td&gt;191&lt;/td&gt;
&lt;td&gt;4,626 nested loops&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Spatial index used?&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Execution time&lt;/td&gt;
&lt;td&gt;3.1ms&lt;/td&gt;
&lt;td&gt;42ms&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;A few things jump out. The estimated row count is 38 in &lt;strong&gt;both&lt;/strong&gt; plans. That's the planner's selectivity estimate for the boolean combination. But the actual count is thousands of rows. The estimate is wrong by two orders of magnitude.&lt;/p&gt;

&lt;p&gt;The difference between the plans isn't that one has better estimates — both are equally wrong. The difference is what the planner &lt;em&gt;decided to do&lt;/em&gt; with that bad estimate.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Root Cause: 229x Underestimate
&lt;/h2&gt;

&lt;h3&gt;
  
  
  The selectivity math
&lt;/h3&gt;

&lt;p&gt;The planner looks up each boolean column's frequency independently from &lt;code&gt;pg_statistic&lt;/code&gt; and multiplies them together:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;is_active = true:          ~70%  → 0.70
is_available = true:       ~50%  → 0.50
is_dispatched = true:      ~25%  → 0.25
is_blocked = false:        ~95%  → 0.95
is_express_enabled = true: ~40%  → 0.40
is_suspended = false:      ~95%  → 0.95

Combined (assuming independence):
  0.70 × 0.50 × 0.25 × 0.95 × 0.40 × 0.95 = 0.0025

  15,000 rows × 0.0025 = 38 rows
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The planner arrives at 38. The actual number is &lt;strong&gt;8,720&lt;/strong&gt;. That's a 229x underestimate.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why the multiplication is wrong
&lt;/h3&gt;

&lt;p&gt;These columns are &lt;strong&gt;not independent&lt;/strong&gt;. They encode a &lt;strong&gt;finite state machine&lt;/strong&gt; — the operational lifecycle of a field technician. The business logic enforces hard constraints between them:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Column A&lt;/th&gt;
&lt;th&gt;Column B&lt;/th&gt;
&lt;th&gt;Relationship&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;is_active&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;is_available&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Available implies active&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;is_active&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;is_dispatched&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Dispatched implies active&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;is_available&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;is_dispatched&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Dispatched implies available&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;is_blocked&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;is_active&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Blocked implies not active&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;is_suspended&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;is_active&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Suspended implies not active&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;is_suspended&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;is_available&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Suspended implies not available&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;In practice, 5 of the 6 booleans encode roughly 6 valid states (the sixth, &lt;code&gt;is_express_enabled&lt;/code&gt;, is an independent capability flag):&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;State&lt;/th&gt;
&lt;th&gt;is_active&lt;/th&gt;
&lt;th&gt;is_available&lt;/th&gt;
&lt;th&gt;is_dispatched&lt;/th&gt;
&lt;th&gt;is_blocked&lt;/th&gt;
&lt;th&gt;is_suspended&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;SUSPENDED&lt;/td&gt;
&lt;td&gt;F&lt;/td&gt;
&lt;td&gt;F&lt;/td&gt;
&lt;td&gt;F&lt;/td&gt;
&lt;td&gt;F&lt;/td&gt;
&lt;td&gt;T&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;BLOCKED&lt;/td&gt;
&lt;td&gt;F&lt;/td&gt;
&lt;td&gt;F&lt;/td&gt;
&lt;td&gt;F&lt;/td&gt;
&lt;td&gt;T&lt;/td&gt;
&lt;td&gt;F&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;INACTIVE&lt;/td&gt;
&lt;td&gt;F&lt;/td&gt;
&lt;td&gt;F&lt;/td&gt;
&lt;td&gt;F&lt;/td&gt;
&lt;td&gt;F&lt;/td&gt;
&lt;td&gt;F&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;OFFLINE&lt;/td&gt;
&lt;td&gt;T&lt;/td&gt;
&lt;td&gt;F&lt;/td&gt;
&lt;td&gt;F&lt;/td&gt;
&lt;td&gt;F&lt;/td&gt;
&lt;td&gt;F&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;IDLE&lt;/td&gt;
&lt;td&gt;T&lt;/td&gt;
&lt;td&gt;T&lt;/td&gt;
&lt;td&gt;F&lt;/td&gt;
&lt;td&gt;F&lt;/td&gt;
&lt;td&gt;F&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;DISPATCHED&lt;/td&gt;
&lt;td&gt;T&lt;/td&gt;
&lt;td&gt;T&lt;/td&gt;
&lt;td&gt;T&lt;/td&gt;
&lt;td&gt;F&lt;/td&gt;
&lt;td&gt;F&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;That's 6 valid states out of 2^5 = 32 theoretical combinations. The planner treats all 32 as equally possible and weights them accordingly. It doesn't know that &lt;code&gt;is_blocked = true AND is_active = true&lt;/code&gt; can never occur in practice.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Actual selectivity of the boolean combination: ~58% (8,720 / 15,000)&lt;/strong&gt;&lt;br&gt;
&lt;strong&gt;Planner's estimate: 0.25% (38 / 15,000)&lt;/strong&gt;&lt;br&gt;
&lt;strong&gt;Error factor: 229x&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The PostgreSQL docs even demonstrate this exact failure mode:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;"The planner estimates the selectivity for each condition individually... Then it assumes that the conditions are independent, and so it multiplies their selectivities, producing a final selectivity estimate of just 0.01%. This is a significant underestimate, as the actual number of rows matching the conditions (100) is two orders of magnitude higher."&lt;/em&gt;&lt;br&gt;
— &lt;a href="https://www.postgresql.org/docs/current/multivariate-statistics-examples.html" rel="noopener noreferrer"&gt;Multivariate Statistics Examples&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;


&lt;h2&gt;
  
  
  Why the Plan Flips: ANALYZE Sampling Non-Determinism
&lt;/h2&gt;

&lt;p&gt;The 229x underestimate explains why the slow plan is slow. But why does the query &lt;em&gt;sometimes&lt;/em&gt; get the fast plan?&lt;/p&gt;

&lt;p&gt;Because the estimate isn't always exactly 38 — it drifts. ANALYZE takes a random sample, not a census:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;"The statistics are only approximate, and will change slightly each time ANALYZE is run, even if the actual table contents did not change. This might result in small changes in the planner's estimated costs. In rare situations, this non-determinism will cause the planner's choices of query plans to change after ANALYZE is run."&lt;/em&gt;&lt;br&gt;
— &lt;a href="https://www.postgresql.org/docs/current/sql-analyze.html" rel="noopener noreferrer"&gt;ANALYZE command reference&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Our &lt;code&gt;technicians&lt;/code&gt; table is high-churn — technicians toggle &lt;code&gt;is_dispatched&lt;/code&gt; hundreds of times per day across the fleet. As the number of available technicians changes through the day, each ANALYZE sample captures a different snapshot. The boolean estimate drifts — and the plan is sensitive to that drift.&lt;/p&gt;
&lt;h3&gt;
  
  
  The tipping point
&lt;/h3&gt;

&lt;p&gt;The planner's decision to include the spatial index in a BitmapAnd has a cost threshold. When the boolean estimate is "moderate" (say, 30–50 rows), the planner decides it's worth adding a second index scan to intersect. When the estimate drops below a threshold (say, 8–15 rows), the planner decides the boolean index alone is good enough and drops the spatial index.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;When the boolean estimate is ~38:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Planner: "Boolean index returns ~38 rows. That's moderate.
          Adding the spatial index scan (est. 1,628 rows) and
          intersecting bitmaps will reduce heap fetches to ~4.
          The extra index scan is worth it."

Plan: BitmapAnd(boolean_index + spatial_index)
Result: 191 heap pages, 1 row survives → 3ms
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;When the boolean estimate drops to ~10:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Planner: "Boolean index returns only ~10 rows. That's already tiny.
          A second index scan costs more than just filtering 10 rows
          by distance. Not worth the overhead."

Plan: Bitmap Scan on boolean index only, ST_DWithin as CPU filter
Result: 8,720 rows scanned, spatial filter on every row → 42ms
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  The perverse incentive
&lt;/h3&gt;

&lt;p&gt;Here's the cruel part: the MORE wrong the boolean estimate is (in the "too low" direction), the MORE confident the planner becomes that BitmapAnd is unnecessary — and the WORSE the actual performance gets. The planner drops the spatial index precisely when it would help the most.&lt;/p&gt;

&lt;h3&gt;
  
  
  Autovacuum is the trigger
&lt;/h3&gt;

&lt;p&gt;Autovacuum triggers ANALYZE when the number of changed tuples exceeds &lt;code&gt;autovacuum_analyze_threshold + autovacuum_analyze_scale_factor × table_size&lt;/code&gt;. With default settings, that's &lt;code&gt;50 + 0.10 × 15,000 = 1,550 tuple changes&lt;/code&gt;. On a table with thousands of state transitions per minute during peak hours, this threshold is crossed constantly. ANALYZE can fire multiple times within a few minutes, each time producing a slightly different boolean estimate that may or may not cross the BitmapAnd decision boundary.&lt;/p&gt;




&lt;h2&gt;
  
  
  BitmapAnd: How It Works and Why It's Fragile
&lt;/h2&gt;

&lt;p&gt;BitmapAnd scans multiple indexes on the &lt;strong&gt;same table&lt;/strong&gt;, builds a bitmap of matching heap page locations for each, and intersects them. Only pages in the intersection get fetched. In our fast plan, it intersects boolean-matching pages with spatial-matching pages, reducing thousands of candidates down to 191 heap pages.&lt;/p&gt;

&lt;p&gt;Two things make it fragile for our case:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Same-table only.&lt;/strong&gt; BitmapAnd can't combine indexes across different tables. Our spatial filter is on &lt;code&gt;technician_locations&lt;/code&gt;, our boolean filter is on &lt;code&gt;technicians&lt;/code&gt; — that's a join, not a merge. BitmapAnd only helps when both indexes live on the same table (like the &lt;code&gt;last_job_location&lt;/code&gt; GiST index and the boolean index, both on &lt;code&gt;technicians&lt;/code&gt;).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cost-sensitive inclusion.&lt;/strong&gt; Each additional bitmap index scan has a startup cost. When the boolean estimate is low enough, the planner decides the second index scan isn't worth it:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;"Because each additional index scan adds extra time, the planner will sometimes choose to use a simple index scan even though additional indexes are available that could have been used as well."&lt;/em&gt;&lt;br&gt;
— &lt;a href="https://www.postgresql.org/docs/current/indexes-bitmap-scans.html" rel="noopener noreferrer"&gt;Chapter 11.5: Combining Multiple Indexes&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The planner concludes: "I only expect 10 rows from the boolean index — filtering them by distance is cheaper than running a second index scan." The math is correct given the premise. The premise is just 229x wrong.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why Our Composite Index Didn't Help
&lt;/h2&gt;

&lt;p&gt;We already had a composite index on all six boolean columns. The natural assumption was: "PostgreSQL has an index on exactly this combination — surely it knows how many rows match?"&lt;/p&gt;

&lt;p&gt;It doesn't. A composite index helps PostgreSQL &lt;strong&gt;find&lt;/strong&gt; rows efficiently (access path), but it does NOT help PostgreSQL &lt;strong&gt;estimate how many&lt;/strong&gt; rows exist before scanning (cardinality estimation). These are two separate systems.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;"No entry is made for an ordinary non-expression index column, however, since it would be redundant with the entry for the underlying table column."&lt;/em&gt;&lt;br&gt;
— &lt;a href="https://www.postgresql.org/docs/current/catalog-pg-statistic.html" rel="noopener noreferrer"&gt;pg_statistic catalog&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;ANALYZE samples heap pages and builds per-column MCVs independently. The B-tree's internal knowledge of which key combinations exist is never extracted into planner statistics. Our composite index found the right rows perfectly every time — the problem was that the planner's &lt;strong&gt;estimate&lt;/strong&gt; of how many rows it would find was 229x too low, causing it to choose a bad join strategy around the index.&lt;/p&gt;

&lt;p&gt;Only two things can store combination frequencies:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;CREATE STATISTICS (mcv)&lt;/code&gt;&lt;/strong&gt; — explicit opt-in, stores multi-column MCV lists&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;A single enum/status column&lt;/strong&gt; — collapses the combination into one value&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  The Anti-Pattern: Dependent Booleans in OLTP
&lt;/h2&gt;

&lt;p&gt;This isn't just a query tuning story. It's a schema design lesson.&lt;/p&gt;

&lt;p&gt;When you model a state machine as independent boolean columns, you're making an implicit promise to the database: "these columns are independent dimensions." Every cost-based optimizer — PostgreSQL, MySQL, Oracle, SQL Server — takes you at your word. The planner multiplies their selectivities because that's mathematically correct for independent variables.&lt;/p&gt;

&lt;p&gt;The problem is that the promise is false. &lt;code&gt;is_available&lt;/code&gt; and &lt;code&gt;is_dispatched&lt;/code&gt; aren't independent dimensions — they're states in a lifecycle. One implies the other. The planner can't know this from the schema alone.&lt;/p&gt;

&lt;p&gt;This pattern tends to emerge organically. You start with &lt;code&gt;is_active&lt;/code&gt;. A feature ships, you add &lt;code&gt;is_available&lt;/code&gt;. A compliance requirement adds &lt;code&gt;is_blocked&lt;/code&gt;. Each column is a small, low-risk migration. Nobody notices that the columns are accumulating mutual dependencies until the planner starts making bad decisions — and even then, the symptom (intermittent latency spikes) doesn't obviously point at schema design.&lt;/p&gt;

&lt;p&gt;The fix is to model state as state. If your boolean columns have business-logic dependencies between them — if certain combinations can never occur — they should be a single enum or status column. One column, one MCV lookup, no multiplication error. &lt;code&gt;WHERE status = 'idle'&lt;/code&gt; gives the planner an exact frequency. &lt;code&gt;WHERE is_active = true AND is_available = true AND is_dispatched = false AND ...&lt;/code&gt; gives it a guess.&lt;/p&gt;

&lt;h3&gt;
  
  
  A note on InnoDB
&lt;/h3&gt;

&lt;p&gt;Interestingly, this specific failure mode wouldn't manifest the same way on MySQL/InnoDB with the same composite index. InnoDB's optimizer uses a technique called &lt;strong&gt;index dive&lt;/strong&gt; — when estimating the cardinality of a range scan on a composite index, it actually samples the B-tree directly rather than multiplying per-column statistics. For an equality scan across all columns of a composite index (which is what our query does), InnoDB dives into the index, reads a sample of pages at the leaf level, and estimates row count from the actual index structure.&lt;/p&gt;

&lt;p&gt;This means InnoDB would see that the combination &lt;code&gt;(true, true, false, false, true, false)&lt;/code&gt; maps to ~8,700 leaf entries, not 38. The estimate would be roughly correct, and the optimizer wouldn't make the same bad join order decision.&lt;/p&gt;

&lt;p&gt;PostgreSQL doesn't do index dives for cardinality estimation. It always goes back to &lt;code&gt;pg_statistic&lt;/code&gt; and multiplies. The composite index is invisible to the estimation layer — it's only visible to the execution layer. This is a deliberate design choice (keeping statistics separate from access paths), but it means PostgreSQL is more vulnerable to correlated-column estimation errors than InnoDB is, even when the right composite index already exists.&lt;/p&gt;




&lt;h2&gt;
  
  
  Wrapping Up
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Always compare estimated vs. actual rows in &lt;code&gt;EXPLAIN ANALYZE&lt;/code&gt;.&lt;/strong&gt; If they diverge by more than 10x, the planner is making decisions based on a false premise.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Composite indexes don't fix estimation in PostgreSQL.&lt;/strong&gt; They help the executor find rows, but &lt;code&gt;pg_statistic&lt;/code&gt; stores per-column statistics independently. The planner still multiplies. (InnoDB's index dives would handle this better.)&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Model state as state, not as independent booleans.&lt;/strong&gt; If your boolean columns have mutual dependencies — if certain combinations can never occur — they belong in a single enum column.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;PostgreSQL's cost-based optimizer is remarkably good. But it operates on a statistical model of your data, not the data itself. When your schema encodes assumptions that violate the model's assumptions, the planner makes rational decisions from irrational premises. Understanding where the model breaks is the difference between a query that runs in 3ms and one that runs in 42ms.&lt;/p&gt;

</description>
      <category>database</category>
      <category>postgres</category>
      <category>opensource</category>
      <category>sql</category>
    </item>
    <item>
      <title>Running Grafana Loki in Production: What We Actually Learned</title>
      <dc:creator>Sriram Rajendran</dc:creator>
      <pubDate>Mon, 30 Mar 2026 11:58:48 +0000</pubDate>
      <link>https://dev.to/sriramrajendran/running-grafana-loki-in-production-what-we-actually-learned-d9g</link>
      <guid>https://dev.to/sriramrajendran/running-grafana-loki-in-production-what-we-actually-learned-d9g</guid>
      <description>&lt;p&gt;We run Loki in distributed mode on EKS, processing ~1.16 TB of logs per day across ~34,000 lines/second. This post covers the architecture we landed on, the configuration decisions that actually matter, and the numbers from production that validate (or challenge) those decisions.&lt;/p&gt;

&lt;p&gt;But first — if you're evaluating Loki or just heard the name, let's build up from first principles.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why Loki Exists: A Different Philosophy on Logs
&lt;/h2&gt;

&lt;p&gt;Traditional logging systems like Elasticsearch (ELK stack) or Splunk work by &lt;strong&gt;full-text indexing&lt;/strong&gt; every log line. When a log line comes in, the system tokenizes it, builds an inverted index over every word, and stores that index alongside the raw data. This makes arbitrary text search fast, but the index itself becomes enormous — often larger than the raw logs. At scale, you're paying more to store and maintain the index than the data it points to.&lt;/p&gt;

&lt;p&gt;Loki takes the opposite approach: &lt;strong&gt;index only the metadata, store the logs as compressed chunks.&lt;/strong&gt; Instead of indexing the contents of &lt;code&gt;"ERROR: connection refused to database host db-prod-3"&lt;/code&gt;, Loki only indexes the &lt;em&gt;labels&lt;/em&gt; attached to that line — things like &lt;code&gt;{namespace="payments", app="api-gateway", pod="api-gateway-7f8b9c"}&lt;/code&gt;. When you query, Loki uses the label index to find the right chunks, then brute-force greps through those chunks.&lt;/p&gt;

&lt;p&gt;This is the fundamental trade-off: &lt;strong&gt;Loki trades query-time compute for storage-time simplicity.&lt;/strong&gt; Queries are slower than Elasticsearch for arbitrary text search, but storage costs drop dramatically because you're not maintaining a massive inverted index. For most operational use cases — "show me the logs from the payments namespace in the last hour where the line contains ERROR" — this is fast enough, and the cost savings are substantial.&lt;/p&gt;

&lt;p&gt;Think of it like this: Elasticsearch is a search engine that happens to store logs. Loki is a log storage system that happens to support search.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Components: What Each Piece Does and Why It Exists
&lt;/h2&gt;

&lt;p&gt;Before we get into our specific setup, let's understand what each Loki component does from first principles. In a traditional monolithic logging system, one process handles everything — accept logs, store them, index them, and query them. Loki breaks this into discrete components so each can be scaled independently based on its bottleneck (CPU, memory, I/O, or network).&lt;/p&gt;

&lt;h3&gt;
  
  
  Distributor — The Front Door
&lt;/h3&gt;

&lt;p&gt;The distributor is the first component that touches your log data. Every log push (from Promtail, Fluentd, or any other agent) hits a distributor.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What it does:&lt;/strong&gt; Validates incoming log streams (checks labels, enforces rate limits, rejects old samples), then hashes the stream's labels to determine which ingester(s) should own that stream. It uses a &lt;strong&gt;consistent hash ring&lt;/strong&gt; to route — the same label set always goes to the same ingester, which is critical for keeping related log lines together in memory.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why it's separate:&lt;/strong&gt; In Elasticsearch, the coordinating node handles both routing and querying. Loki splits these because the write path and read path have completely different scaling characteristics. Distributors are CPU-light and stateless — you can add or remove them without any data migration. They're essentially smart load balancers for your write path.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Scaling signal:&lt;/strong&gt; CPU usage and push latency. If P99 push latency climbs above 500ms, add more distributors.&lt;/p&gt;

&lt;h3&gt;
  
  
  Ingester — Where Data Lives Before Storage
&lt;/h3&gt;

&lt;p&gt;The ingester is the most critical and most resource-hungry component. It's the equivalent of what Elasticsearch calls a "data node" for recent data, but with a key difference.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What it does:&lt;/strong&gt; Receives log streams from distributors, holds them in memory as "chunks" (compressed blocks of log lines), builds the index entries for those chunks, and periodically flushes both to long-term storage (S3). While data is in the ingester, it's queryable directly from memory — no storage round-trip needed.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why it's separate and stateful:&lt;/strong&gt; Ingesters are StatefulSets, not Deployments, because they hold state — unflushed chunks in memory and a Write-Ahead Log (WAL) on disk. The WAL is Loki's crash recovery mechanism: if an ingester dies, the replacement can replay the WAL to recover data that hadn't been flushed to S3 yet.&lt;/p&gt;

&lt;p&gt;This is fundamentally different from Elasticsearch, where data is replicated at the storage level through Lucene segment replication. In Loki, data durability between flush cycles is handled by a combination of &lt;strong&gt;replication factor&lt;/strong&gt; (writing to N ingesters simultaneously) and &lt;strong&gt;WAL replay&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Scaling signal:&lt;/strong&gt; Memory usage. Ingesters are memory-bound — they hold all active streams plus unflushed chunks. If memory pressure rises, you either add ingesters (spreading the hash ring thinner) or tune flush intervals to push data to storage faster.&lt;/p&gt;

&lt;h3&gt;
  
  
  Querier — The Brute-Force Search Engine
&lt;/h3&gt;

&lt;p&gt;The querier is where Loki's "index little, grep a lot" philosophy becomes concrete.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What it does:&lt;/strong&gt; Executes LogQL queries by first consulting the index (via the index gateway) to identify which chunks match the label matchers, then fetches those chunks from S3 (or cache), decompresses them, and does a line-by-line scan for your filter expression. It also queries ingesters directly for data that hasn't been flushed to storage yet.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why it's separate:&lt;/strong&gt; In Elasticsearch, a query coordinator fans out to data nodes that each search their local shards. Loki separates this because queriers are compute-intensive and bursty — one expensive query shouldn't affect your ability to ingest logs. By isolating the read path, you can scale queriers independently of ingesters.&lt;/p&gt;

&lt;p&gt;The brute-force scan is what makes Loki queries slower than Elasticsearch for wide searches, but it's also what makes the storage layer so simple. No inverted index to update on every write, no segment merges eating I/O in the background.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Scaling signal:&lt;/strong&gt; Query latency and memory. Wide time-range queries or unselective label matchers cause queriers to fetch and decompress many chunks. More queriers = more parallelism.&lt;/p&gt;

&lt;h3&gt;
  
  
  Query Frontend — The Query Optimizer
&lt;/h3&gt;

&lt;p&gt;The query frontend sits between the user (Grafana) and the queriers. It doesn't execute queries itself.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What it does:&lt;/strong&gt; Takes an incoming query, splits it into smaller sub-queries by time interval, dispatches those sub-queries to queriers (via the query scheduler), and merges the results. It also caches query results and deduplicates identical in-flight queries.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why it's separate:&lt;/strong&gt; This is a pattern borrowed from Cortex/Mimir (Prometheus long-term storage). Without the query frontend, a single "show me the last 24 hours" query would force one querier to scan 24 hours of data sequentially. With it, that query becomes 96 parallel 15-minute queries spread across your querier fleet. This is the single biggest lever for query performance.&lt;/p&gt;

&lt;p&gt;Elasticsearch has a similar concept with its search "phases" (query then fetch), but Loki makes the time-based splitting explicit and configurable.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Scaling signal:&lt;/strong&gt; Rarely the bottleneck. Scale if you see request queuing.&lt;/p&gt;

&lt;h3&gt;
  
  
  Query Scheduler — The Traffic Controller
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;What it does:&lt;/strong&gt; Maintains a queue of pending sub-queries from the query frontend and distributes them fairly across available queriers. Implements per-tenant fair queuing so one user's expensive query doesn't starve others.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why it's separate:&lt;/strong&gt; Without the scheduler, query frontends connect directly to queriers via round-robin. This is fine at small scale, but at high concurrency it leads to uneven load distribution. The scheduler ensures no single querier gets overwhelmed while others sit idle. Think of it as the difference between random checkout lane selection vs. a single serpentine queue at a bank.&lt;/p&gt;

&lt;h3&gt;
  
  
  Index Gateway — The Index Serving Layer
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;What it does:&lt;/strong&gt; Downloads the BoltDB index files from S3, caches them on local disk (EFS in our case), and serves index lookups to queriers over gRPC.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why it's separate:&lt;/strong&gt; This is Loki-specific and exists because of the BoltDB Shipper pattern. Without an index gateway, every querier downloads and caches the full index locally. With 10 queriers, that's 10 copies of the index, 10x the S3 GET requests, and 10x the local disk needed. The index gateway centralizes this into 3 replicas that serve the entire querier fleet.&lt;/p&gt;

&lt;p&gt;This has no real equivalent in Elasticsearch because ES stores the index within its Lucene segments on each data node — it's not a separate concern.&lt;/p&gt;

&lt;h3&gt;
  
  
  Compactor — The Janitor
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;What it does:&lt;/strong&gt; Runs as a singleton (only one instance) and performs two jobs: (1) compacts multiple small index files into larger ones to improve query performance, and (2) applies retention policies by marking expired chunks for deletion.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why it's separate:&lt;/strong&gt; Compaction is I/O-intensive and runs on its own schedule. You don't want compaction competing with ingest or query for resources. In Elasticsearch, segment merging (the equivalent operation) happens on each data node and is one of the primary sources of I/O contention at scale. Loki avoids this by centralizing compaction into a dedicated component.&lt;/p&gt;

&lt;h3&gt;
  
  
  Table Manager — The Schema Enforcer
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;What it does:&lt;/strong&gt; Pre-creates and manages the index "tables" (time-based partitions) according to the schema config. Ensures tables exist before data arrives.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why it's separate:&lt;/strong&gt; Mostly a legacy component for when Loki supported external index stores like DynamoDB or Cassandra with time-sharded tables. With BoltDB Shipper or TSDB, it's less critical but still handles table lifecycle.&lt;/p&gt;

&lt;h3&gt;
  
  
  Gateway (nginx) — The Entry Point
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;What it does:&lt;/strong&gt; A simple nginx reverse proxy that routes &lt;code&gt;/loki/api/v1/push&lt;/code&gt; to distributors and &lt;code&gt;/loki/api/v1/query&lt;/code&gt; to query frontends. Provides a single endpoint for clients.&lt;/p&gt;




&lt;h2&gt;
  
  
  Our Setup at a Glance
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Cluster&lt;/strong&gt;: EKS (Kubernetes 1.33) on AWS, 8 nodes running Bottlerocket OS&lt;br&gt;
&lt;strong&gt;Deployment&lt;/strong&gt;: &lt;code&gt;loki-distributed&lt;/code&gt; Helm chart (v0.80.2), Loki version 2.9.10&lt;br&gt;
&lt;strong&gt;Storage&lt;/strong&gt;: S3 for chunks, BoltDB Shipper for index, EFS for compactor and index gateway&lt;br&gt;
&lt;strong&gt;Caching&lt;/strong&gt;: Memcached (3 tiers — chunks, query frontend, index writes)&lt;br&gt;
&lt;strong&gt;Monitoring&lt;/strong&gt;: Prometheus + Grafana co-located in the same cluster&lt;/p&gt;
&lt;h2&gt;
  
  
  Component Sizing
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Component&lt;/th&gt;
&lt;th&gt;Type&lt;/th&gt;
&lt;th&gt;Replicas&lt;/th&gt;
&lt;th&gt;CPU (req/limit)&lt;/th&gt;
&lt;th&gt;Memory (req/limit)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Distributor&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Deployment&lt;/td&gt;
&lt;td&gt;6&lt;/td&gt;
&lt;td&gt;250m / 500m&lt;/td&gt;
&lt;td&gt;512Mi / 1Gi&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Ingester&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;StatefulSet&lt;/td&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;250m / 500m&lt;/td&gt;
&lt;td&gt;16Gi / 18Gi&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Querier&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Deployment&lt;/td&gt;
&lt;td&gt;10&lt;/td&gt;
&lt;td&gt;200m / 400m&lt;/td&gt;
&lt;td&gt;3Gi / 6Gi&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Query Frontend&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Deployment&lt;/td&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;200m / 400m&lt;/td&gt;
&lt;td&gt;1Gi / 4Gi&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Query Scheduler&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Deployment&lt;/td&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;td&gt;250m / 500m&lt;/td&gt;
&lt;td&gt;512Mi / 1Gi&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Index Gateway&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;StatefulSet&lt;/td&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;1 / 2&lt;/td&gt;
&lt;td&gt;2Gi / 8Gi&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Compactor&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Deployment&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;500m / 1&lt;/td&gt;
&lt;td&gt;6Gi / 8Gi&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Table Manager&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Deployment&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Gateway (nginx)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Deployment&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;A few things jump out:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Ingesters are the memory hogs.&lt;/strong&gt; Each ingester requests 16Gi and uses nearly all of it (~15.9 GB working set in production). This is because ingesters hold all active streams and recent chunks in memory before flushing to storage. If you undersize these, you'll see OOMKills that take down a chunk of your in-flight data with them.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Queriers need headroom.&lt;/strong&gt; We run 10 queriers not because steady-state demands it, but because queries are bursty. A single user running a broad &lt;code&gt;{namespace=~".+"}&lt;/code&gt; query over a wide time range can spike memory on multiple queriers simultaneously. The 3Gi request with 6Gi limit gives them room to burst.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Distributors are lightweight but need replicas.&lt;/strong&gt; At 34k lines/sec ingest, 6 distributors keep each one comfortably below its CPU limit. They're stateless, so scaling is trivial.&lt;/p&gt;
&lt;h2&gt;
  
  
  The Configuration That Matters
&lt;/h2&gt;

&lt;p&gt;Here are the config knobs that we've tuned from defaults and why:&lt;/p&gt;
&lt;h3&gt;
  
  
  Ingester Tuning
&lt;/h3&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;ingester&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;chunk_block_size&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;262144&lt;/span&gt;        &lt;span class="c1"&gt;# 256KB blocks&lt;/span&gt;
  &lt;span class="na"&gt;chunk_encoding&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;snappy&lt;/span&gt;           &lt;span class="c1"&gt;# Fast compression, ~3.5x ratio&lt;/span&gt;
  &lt;span class="na"&gt;chunk_idle_period&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;2m&lt;/span&gt;            &lt;span class="c1"&gt;# Flush idle chunks after 2 min&lt;/span&gt;
  &lt;span class="na"&gt;max_chunk_age&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;30m&lt;/span&gt;               &lt;span class="c1"&gt;# Hard cap — flush after 30 min regardless&lt;/span&gt;
  &lt;span class="na"&gt;chunk_retain_period&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;1m&lt;/span&gt;          &lt;span class="c1"&gt;# Keep flushed chunks for queries in flight&lt;/span&gt;
  &lt;span class="na"&gt;lifecycler&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;ring&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;replication_factor&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;2&lt;/span&gt;        &lt;span class="c1"&gt;# Every log line written to 2 ingesters&lt;/span&gt;
  &lt;span class="na"&gt;wal&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;dir&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;/var/loki/wal&lt;/span&gt;
    &lt;span class="na"&gt;checkpoint_duration&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;1m&lt;/span&gt;        &lt;span class="c1"&gt;# WAL checkpoint every minute&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;&lt;strong&gt;&lt;code&gt;replication_factor: 2&lt;/code&gt;&lt;/strong&gt; is the sweet spot for us. RF=3 is safer but doubles your ingester memory overhead vs RF=1. With RF=2 and WAL enabled, we can lose one ingester and recover without data loss. The WAL with 1-minute checkpointing gives us a recovery path that doesn't rely on ring state alone.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;chunk_idle_period: 2m&lt;/code&gt; and &lt;code&gt;max_chunk_age: 30m&lt;/code&gt;&lt;/strong&gt; control how long data lives in ingester memory before hitting S3. Shorter values = lower memory usage but more small chunks in object storage (which slows queries). 30 minutes is a good balance — our ingesters flush about 9.4 chunks/sec at steady state.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;chunk_encoding: snappy&lt;/code&gt;&lt;/strong&gt; over gzip because at 13.4 MB/s ingest, CPU matters more than compression ratio. Snappy gives us good-enough compression without burning cores.&lt;/p&gt;
&lt;h3&gt;
  
  
  Limits: The Rate Limiting That Keeps You Alive
&lt;/h3&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;limits_config&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;retention_period&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;72h&lt;/span&gt;
  &lt;span class="na"&gt;ingestion_rate_mb&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;10000&lt;/span&gt;         &lt;span class="c1"&gt;# Per-tenant ingest rate (MB/s)&lt;/span&gt;
  &lt;span class="na"&gt;ingestion_burst_size_mb&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;1000&lt;/span&gt;    &lt;span class="c1"&gt;# Burst allowance&lt;/span&gt;
  &lt;span class="na"&gt;per_stream_rate_limit&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;512M&lt;/span&gt;      &lt;span class="c1"&gt;# Per-stream rate limit&lt;/span&gt;
  &lt;span class="na"&gt;per_stream_rate_limit_burst&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;1024M&lt;/span&gt;
  &lt;span class="na"&gt;max_entries_limit_per_query&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;1000000&lt;/span&gt;
  &lt;span class="na"&gt;reject_old_samples&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
  &lt;span class="na"&gt;reject_old_samples_max_age&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;168h&lt;/span&gt; &lt;span class="c1"&gt;# Reject logs older than 7 days&lt;/span&gt;
  &lt;span class="na"&gt;cardinality_limit&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;200000&lt;/span&gt;
  &lt;span class="na"&gt;max_label_value_length&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;20480&lt;/span&gt;
  &lt;span class="na"&gt;max_label_name_length&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;10240&lt;/span&gt;
  &lt;span class="na"&gt;max_label_names_per_series&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;300&lt;/span&gt;
  &lt;span class="na"&gt;split_queries_by_interval&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;15m&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;A few things worth calling out:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;retention_period: 72h&lt;/code&gt;&lt;/strong&gt; — we only keep 3 days of logs in Loki. This is deliberate. Loki isn't our log archive; it's our log search tool. Anything older goes to S3 lifecycle rules for long-term retention. This keeps our index small and queries fast.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;per_stream_rate_limit: 512M&lt;/code&gt;&lt;/strong&gt; — this is intentionally high. We run with &lt;code&gt;auth_enabled: false&lt;/code&gt; (single tenant), so there's no per-tenant isolation. Instead, we rely on per-stream limits to prevent any single application from overwhelming the pipeline. If you're multi-tenant, you'd want much tighter per-tenant limits.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;reject_old_samples_max_age: 168h&lt;/code&gt;&lt;/strong&gt; — logs older than 7 days get rejected at the distributor. This prevents backfill jobs or misbehaving agents from pushing stale data that would create index entries far from the current write head.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;split_queries_by_interval: 15m&lt;/code&gt;&lt;/strong&gt; — the query frontend splits every query into 15-minute sub-queries. These are parallelized across queriers, which is why a 1-hour query range actually fans out to 4 sub-queries. This, combined with the query scheduler (5 replicas), is what keeps our P99 query latency at ~2.75 seconds despite scanning TBs.&lt;/p&gt;
&lt;h3&gt;
  
  
  Storage: S3 + BoltDB Shipper
&lt;/h3&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;schema_config&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;configs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;from&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;2020-09-07"&lt;/span&gt;
      &lt;span class="na"&gt;index&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;period&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;24h&lt;/span&gt;
        &lt;span class="na"&gt;prefix&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;loki_index_&lt;/span&gt;
      &lt;span class="na"&gt;object_store&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;aws&lt;/span&gt;
      &lt;span class="na"&gt;schema&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;v11&lt;/span&gt;
      &lt;span class="na"&gt;store&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;boltdb-shipper&lt;/span&gt;

&lt;span class="na"&gt;storage_config&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;aws&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;bucketnames&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;&amp;lt;your-s3-bucket&amp;gt;&lt;/span&gt;
    &lt;span class="na"&gt;s3&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;s3://&amp;lt;your-region&amp;gt;&lt;/span&gt;
  &lt;span class="na"&gt;boltdb_shipper&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;active_index_directory&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;/var/loki/index&lt;/span&gt;
    &lt;span class="na"&gt;cache_location&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;/var/loki/cache&lt;/span&gt;
    &lt;span class="na"&gt;cache_ttl&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;168h&lt;/span&gt;
    &lt;span class="na"&gt;index_gateway_client&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;server_address&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;dns:///&amp;lt;loki-index-gateway-service&amp;gt;:9095&lt;/span&gt;
    &lt;span class="na"&gt;shared_store&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;s3&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;&lt;strong&gt;BoltDB Shipper with Index Gateway&lt;/strong&gt; is the key pattern here. Instead of every querier downloading the full BoltDB index from S3, the index gateway (3 replicas on EFS) serves index lookups over gRPC. This dramatically reduces the number of S3 API calls and keeps query latency consistent.&lt;/p&gt;

&lt;p&gt;We back the index gateways and compactor with &lt;strong&gt;EFS&lt;/strong&gt; (not EBS), because EFS gives us shared persistent storage that survives pod rescheduling across AZs. The ingesters use &lt;strong&gt;gp3 EBS&lt;/strong&gt; volumes (200Gi each) for WAL and local index — they need low-latency local disk, not shared access.&lt;/p&gt;
&lt;h3&gt;
  
  
  Caching: The Three Layers
&lt;/h3&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;chunk_store_config&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;chunk_cache_config&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;memcached_client&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;addresses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;dnssrv+...memcached-chunks...&lt;/span&gt;
      &lt;span class="na"&gt;consistent_hash&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
  &lt;span class="na"&gt;write_dedupe_cache_config&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;memcached_client&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;addresses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;dnssrv+...memcached-index-writes...&lt;/span&gt;
      &lt;span class="na"&gt;consistent_hash&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;

&lt;span class="na"&gt;query_range&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;results_cache&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;cache&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;memcached_client&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;addresses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;dnssrv+...memcached-frontend...&lt;/span&gt;
        &lt;span class="na"&gt;consistent_hash&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Three separate memcached tiers:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Chunk cache&lt;/strong&gt; (3 replicas, 4Gi each) — caches decompressed chunks so repeated queries don't hit S3&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Index write dedupe&lt;/strong&gt; (3 replicas, 1Gi each) — deduplicates index writes from multiple ingesters (critical with RF=2)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Query frontend cache&lt;/strong&gt; (2 replicas, 1Gi each) — caches query results for identical queries within the cache freshness window&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Our memcached hit rate sits at &lt;strong&gt;97.8%&lt;/strong&gt;, meaning only ~2.2% of chunk fetches actually hit S3. This is the single biggest factor in keeping query latency reasonable at our scale.&lt;/p&gt;
&lt;h3&gt;
  
  
  Ring Membership: Memberlist over Consul/etcd
&lt;/h3&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;memberlist&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;join_members&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;&amp;lt;loki-memberlist-service&amp;gt;&lt;/span&gt;
&lt;span class="na"&gt;distributor&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;ring&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;kvstore&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;store&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;memberlist&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;We use &lt;strong&gt;memberlist&lt;/strong&gt; (gossip protocol) instead of Consul or etcd for hash ring coordination. One fewer external dependency to manage. It works well up to the scale we're at. If you're running 50+ ingesters, you might want to evaluate Consul for faster ring convergence.&lt;/p&gt;
&lt;h3&gt;
  
  
  gRPC Tuning
&lt;/h3&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;ingester_client&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;grpc_client_config&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;grpc_compression&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;gzip&lt;/span&gt;
    &lt;span class="na"&gt;max_send_msg_size&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;204857600&lt;/span&gt;    &lt;span class="c1"&gt;# ~200MB&lt;/span&gt;
    &lt;span class="na"&gt;max_recv_msg_size&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;204857600&lt;/span&gt;

&lt;span class="na"&gt;server&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;grpc_server_max_recv_msg_size&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;204857600&lt;/span&gt;
  &lt;span class="na"&gt;grpc_server_max_send_msg_size&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;204857600&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;The default gRPC message sizes are too small for production. At high ingest rates, distributor-to-ingester messages can get large, especially when batching. We set both client and server to ~200MB. The &lt;code&gt;gzip&lt;/code&gt; compression on the ingester client cuts inter-component bandwidth significantly.&lt;/p&gt;
&lt;h2&gt;
  
  
  The Numbers from Production
&lt;/h2&gt;

&lt;p&gt;Here's what the cluster looks like right now:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Value&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Ingest rate&lt;/td&gt;
&lt;td&gt;~13.4 MB/s (~1.16 TB/day)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Log lines/sec&lt;/td&gt;
&lt;td&gt;~34,000&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Active streams&lt;/td&gt;
&lt;td&gt;632&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;P99 push latency&lt;/td&gt;
&lt;td&gt;~245ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;P99 query latency&lt;/td&gt;
&lt;td&gt;~2.75s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Chunk flush rate&lt;/td&gt;
&lt;td&gt;~9.4 chunks/sec&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Push error rate&lt;/td&gt;
&lt;td&gt;0% (all 204s)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Memcached hit rate&lt;/td&gt;
&lt;td&gt;97.8%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Ingester memory (actual)&lt;/td&gt;
&lt;td&gt;~15.9 GB across 4 pods&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Querier memory (actual)&lt;/td&gt;
&lt;td&gt;~4.8 GB across 10 pods&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Total cluster memory footprint&lt;/td&gt;
&lt;td&gt;~22.4 GB working set (Loki components only)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;A few observations:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;632 active streams is low for 34k lines/sec&lt;/strong&gt; — this means our log labels are well-structured. High cardinality (thousands of unique label combinations) is the number one killer of Loki performance. We keep stream count low by using only a handful of labels: &lt;code&gt;namespace&lt;/code&gt;, &lt;code&gt;pod&lt;/code&gt;, &lt;code&gt;container&lt;/code&gt;, and &lt;code&gt;app&lt;/code&gt;. We avoid dynamic labels like request IDs or user IDs.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;245ms P99 push latency is solid&lt;/strong&gt; at this ingest rate. The 6 distributors with gRPC compression keep the write path fast. If this creeps above 500ms, it's time to add distributors or check if ingesters are falling behind on flushes.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2.75s P99 query latency is acceptable&lt;/strong&gt; for our use case (human-driven debugging sessions). If you need sub-second queries, look at increasing the querier count and reducing &lt;code&gt;split_queries_by_interval&lt;/code&gt;.&lt;/p&gt;
&lt;h2&gt;
  
  
  Monitoring Loki: The PromQL Queries That Matter
&lt;/h2&gt;

&lt;p&gt;Running Loki without monitoring Loki is flying blind. Loki exposes a rich set of Prometheus metrics out of the box — the challenge is knowing which ones to watch and what they're telling you. Here's the monitoring playbook, broken down by component.&lt;/p&gt;
&lt;h3&gt;
  
  
  Distributor Metrics — Is Data Getting In?
&lt;/h3&gt;

&lt;p&gt;The distributor is your canary. If the write path is unhealthy, these metrics will show it first.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Ingest rate (bytes/sec):&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;sum(rate(loki_distributor_bytes_received_total[5m]))
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is your headline number — total bytes/sec hitting Loki across all distributors. Track this on a dashboard as the primary throughput gauge. A sudden drop means log agents are failing to ship; a sudden spike means something is logging excessively (a crash loop, debug logging left on, etc.).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Ingest rate (lines/sec):&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;sum(rate(loki_distributor_lines_received_total[5m]))
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Compare this against bytes/sec to derive your average line size. If lines/sec stays flat but bytes/sec spikes, something is producing abnormally large log lines (stack traces, serialized payloads). If lines/sec spikes but bytes/sec doesn't, you've got a chatty service producing many small lines.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Distributor-to-ingester failures:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;sum(rate(loki_distributor_ingester_append_failures_total[5m]))
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This should be zero. Any non-zero value means distributors are failing to write to ingesters — the ingester ring might be unhealthy, an ingester is OOMing, or gRPC connections are timing out. Alert on this immediately.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Discarded samples (dropped logs):&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;sum by (reason) (rate(loki_discarded_samples_total[5m]))
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Logs that Loki actively rejected, grouped by reason. Common reasons: &lt;code&gt;rate_limited&lt;/code&gt; (hitting per-tenant or per-stream limits), &lt;code&gt;greater_than_max_sample_age&lt;/code&gt; (old logs rejected by &lt;code&gt;reject_old_samples_max_age&lt;/code&gt;), &lt;code&gt;per_stream_rate_limit&lt;/code&gt;. If you see &lt;code&gt;rate_limited&lt;/code&gt;, your limits are too tight or a service is misbehaving. This is the metric that tells you when logs are being silently dropped.&lt;/p&gt;

&lt;h3&gt;
  
  
  Ingester Metrics — The Heart of the System
&lt;/h3&gt;

&lt;p&gt;Ingesters are stateful, memory-heavy, and the most likely component to cause data loss if they go wrong. Monitor them closely.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Active streams:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;sum(loki_ingester_memory_streams)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The number of unique label combinations currently active in memory across all ingesters. This is your &lt;strong&gt;cardinality gauge&lt;/strong&gt; — the single most important metric for Loki health. If this number grows unbounded, you have a label cardinality problem (someone added a dynamic label like &lt;code&gt;request_id&lt;/code&gt; or &lt;code&gt;user_id&lt;/code&gt;). We alert if this crosses 2x our baseline.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Memory chunks:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;sum(loki_ingester_memory_chunks)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The number of chunk objects held in ingester memory. Each active stream has at least one chunk being actively written to. This correlates with memory usage — more chunks = more RAM. If chunks grow faster than flushes, ingesters will OOM.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Chunk flush rate:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;sum(rate(loki_ingester_chunks_flushed_total[5m]))
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;How many chunks/sec are being flushed to long-term storage. This should be steady. A drop in flush rate while ingest stays constant means chunks are accumulating in memory — check if S3 is slow or the compactor is backed up.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Chunk age at flush (P99):&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;histogram_quantile(0.99, sum(rate(loki_ingester_chunk_age_seconds_bucket[5m])) by (le))
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;How old chunks are when they get flushed. Should be close to your &lt;code&gt;max_chunk_age&lt;/code&gt; setting (30 minutes = 1800s for us). If P99 chunk age drifts significantly higher, ingesters are holding data too long — either the flush loop is slow or chunks aren't hitting the idle timeout.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Chunk compression ratio:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;histogram_quantile(0.5, sum(rate(loki_ingester_chunk_compression_ratio_bucket[5m])) by (le))
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;How well your chunks compress. A ratio of 0.3 means data compresses to 30% of its original size (~3.3x). If this ratio climbs toward 1.0, your logs are either already compressed or have high entropy (binary data being logged). Snappy encoding typically gives 3-4x for structured text logs.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;WAL bytes in use:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;sum(loki_ingester_wal_bytes_in_use)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;How much data is sitting in the Write-Ahead Log. If this grows steadily, WAL checkpointing may be falling behind. A zero value (like ours currently) means the WAL is keeping up — data is checkpointed and flushed regularly.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;WAL corruptions (alert on any non-zero):&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;sum(rate(loki_ingester_wal_corruptions_total[5m]))
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;WAL corruption means potential data loss on recovery. This should always be zero. Any corruption usually points to disk issues (EBS volume problems, filesystem corruption). Alert immediately and investigate the underlying storage.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Checkpoint duration:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;loki_ingester_checkpoint_duration_seconds
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;How long WAL checkpoints take. If this exceeds your &lt;code&gt;checkpoint_duration&lt;/code&gt; setting (1 minute for us), checkpoints are overlapping and the WAL will grow unbounded. Usually means disk I/O is saturated.&lt;/p&gt;

&lt;h3&gt;
  
  
  Query Path Metrics — Are Queries Healthy?
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Push latency (P99):&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;histogram_quantile(0.99,
  sum(rate(loki_request_duration_seconds_bucket{route="loki_api_v1_push"}[5m])) by (le)
)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;End-to-end time for a push request. This spans distributor validation, hashing, and ingester writes. Under 500ms is healthy. Above 1s means something in the write path is saturated.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Query latency (P99):&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;histogram_quantile(0.99,
  sum(rate(loki_request_duration_seconds_bucket{route=~"loki_api_v1_query_range|loki_api_v1_query"}[5m])) by (le)
)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;End-to-end time for read queries. This is what your Grafana users feel. The acceptable threshold depends on your use case — under 5s for interactive debugging is reasonable. If this degrades, check if queriers are memory-saturated, cache hit rates dropped, or someone is running expensive queries.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Request rate by status code:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;sum by (status_code, route) (rate(loki_request_duration_seconds_count[5m]))
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Break down all requests by HTTP status code. &lt;code&gt;204&lt;/code&gt; for pushes is success. &lt;code&gt;200&lt;/code&gt; for queries is success. Watch for &lt;code&gt;429&lt;/code&gt; (rate limited), &lt;code&gt;500&lt;/code&gt; (internal errors), and &lt;code&gt;503&lt;/code&gt; (service unavailable — usually means queriers are overloaded).&lt;/p&gt;

&lt;h3&gt;
  
  
  Cache Metrics — Your Performance Multiplier
&lt;/h3&gt;

&lt;p&gt;Caching is what makes Loki viable at scale. If cache hit rates drop, query latency will spike and S3 costs will climb.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Chunk cache hit rate:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;sum(rate(loki_cache_hits{cache="chunks"}[5m]))
  /
sum(rate(loki_cache_fetched_keys{cache="chunks"}[5m]))
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;What percentage of chunk fetches are served from memcached instead of S3. We target &amp;gt;95%. Below 90% means your memcached is undersized or your query patterns aren't cache-friendly (too many unique time ranges).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Query result cache hit rate:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;sum(rate(loki_query_frontend_log_result_cache_hit_total[5m]))
  /
(
  sum(rate(loki_query_frontend_log_result_cache_hit_total[5m]))
  + sum(rate(loki_query_frontend_log_result_cache_miss_total[5m]))
)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;How often the query frontend serves results from cache instead of dispatching to queriers. High hit rates here mean your users are running the same (or overlapping) queries repeatedly — common when multiple people investigate the same incident.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Memcached client health:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;loki_memcache_client_servers
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Number of memcached servers each Loki component can see. If this drops below expected (e.g., from 3 to 2), a memcached pod is down and you're losing cache capacity. The consistent hashing will redistribute, but hit rate will temporarily drop as the remaining nodes re-warm.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cache queue depth:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;sum(loki_cache_background_queue_length)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;How many cache writes are queued. A growing queue means memcached can't keep up with write volume — either the memcached pods need more CPU/memory, or network latency between Loki and memcached is too high.&lt;/p&gt;

&lt;h3&gt;
  
  
  Index Gateway Metrics — Is Index Serving Healthy?
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Index gateway request latency (P99):&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;histogram_quantile(0.99,
  sum(rate(loki_index_gateway_request_duration_seconds_bucket[5m])) by (le)
)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;How long it takes the index gateway to serve a lookup. Should be low (under 500ms). High latency here means the index is too large for memory and the gateway is reading from disk, or EFS is slow.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;BoltDB shipper upload health:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;sum(rate(loki_boltdb_shipper_tables_upload_operation_total[5m]))
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Rate of index table uploads from ingesters to shared storage. A zero rate means ingesters aren't shipping index tables — queries for recently ingested data may fail.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;BoltDB shipper request latency (P99):&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;histogram_quantile(0.99,
  sum(rate(loki_boltdb_shipper_request_duration_seconds_bucket[5m])) by (le)
)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;How long BoltDB shipper operations take. Our P99 sits at ~378ms. If this spikes, shared storage (S3/EFS) is likely the bottleneck.&lt;/p&gt;

&lt;h3&gt;
  
  
  Global Panic Metric — The Last Resort
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;sum(rate(loki_panic_total[5m]))
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Loki processes that panicked (crashed). This should always be zero. Any non-zero value means a component is hitting an unhandled error — check logs immediately. This is your "something is deeply wrong" alert.&lt;/p&gt;

&lt;h3&gt;
  
  
  Recommended Alert Rules
&lt;/h3&gt;

&lt;p&gt;Based on what we've learned running this in production, here are the alerts worth wiring up:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;groups&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;loki-alerts&lt;/span&gt;
    &lt;span class="na"&gt;rules&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;alert&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;LokiIngesterAppendFailures&lt;/span&gt;
        &lt;span class="na"&gt;expr&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;sum(rate(loki_distributor_ingester_append_failures_total[5m])) &amp;gt; &lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;
        &lt;span class="na"&gt;for&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;5m&lt;/span&gt;
        &lt;span class="na"&gt;labels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;severity&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;critical&lt;/span&gt;
        &lt;span class="na"&gt;annotations&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;summary&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Distributors&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;failing&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;to&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;write&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;to&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;ingesters"&lt;/span&gt;

      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;alert&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;LokiDiscardedSamples&lt;/span&gt;
        &lt;span class="na"&gt;expr&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;sum(rate(loki_discarded_samples_total[5m])) &amp;gt; &lt;/span&gt;&lt;span class="m"&gt;10&lt;/span&gt;
        &lt;span class="na"&gt;for&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;5m&lt;/span&gt;
        &lt;span class="na"&gt;labels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;severity&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;warning&lt;/span&gt;
        &lt;span class="na"&gt;annotations&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;summary&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Loki&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;is&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;dropping&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;logs&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;—&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;check&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;reason&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;label"&lt;/span&gt;

      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;alert&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;LokiIngesterHighMemoryStreams&lt;/span&gt;
        &lt;span class="na"&gt;expr&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;sum(loki_ingester_memory_streams) &amp;gt; &lt;/span&gt;&lt;span class="m"&gt;5000&lt;/span&gt;
        &lt;span class="na"&gt;for&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;10m&lt;/span&gt;
        &lt;span class="na"&gt;labels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;severity&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;warning&lt;/span&gt;
        &lt;span class="na"&gt;annotations&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;summary&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Active&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;stream&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;count&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;is&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;unusually&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;high&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;—&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;possible&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;cardinality&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;explosion"&lt;/span&gt;

      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;alert&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;LokiPushLatencyHigh&lt;/span&gt;
        &lt;span class="na"&gt;expr&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
          &lt;span class="s"&gt;histogram_quantile(0.99,&lt;/span&gt;
            &lt;span class="s"&gt;sum(rate(loki_request_duration_seconds_bucket{route="loki_api_v1_push"}[5m])) by (le)&lt;/span&gt;
          &lt;span class="s"&gt;) &amp;gt; 1&lt;/span&gt;
        &lt;span class="na"&gt;for&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;5m&lt;/span&gt;
        &lt;span class="na"&gt;labels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;severity&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;warning&lt;/span&gt;
        &lt;span class="na"&gt;annotations&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;summary&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;P99&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;push&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;latency&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;above&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;1&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;second"&lt;/span&gt;

      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;alert&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;LokiChunkCacheHitRateLow&lt;/span&gt;
        &lt;span class="na"&gt;expr&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
          &lt;span class="s"&gt;sum(rate(loki_cache_hits{cache="chunks"}[15m]))&lt;/span&gt;
            &lt;span class="s"&gt;/&lt;/span&gt;
          &lt;span class="s"&gt;sum(rate(loki_cache_fetched_keys{cache="chunks"}[15m]))&lt;/span&gt;
          &lt;span class="s"&gt;&amp;lt; 0.90&lt;/span&gt;
        &lt;span class="na"&gt;for&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;15m&lt;/span&gt;
        &lt;span class="na"&gt;labels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;severity&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;warning&lt;/span&gt;
        &lt;span class="na"&gt;annotations&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;summary&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Chunk&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;cache&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;hit&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;rate&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;below&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;90%"&lt;/span&gt;

      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;alert&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;LokiWALCorruption&lt;/span&gt;
        &lt;span class="na"&gt;expr&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;sum(rate(loki_ingester_wal_corruptions_total[5m])) &amp;gt; &lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;
        &lt;span class="na"&gt;for&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;1m&lt;/span&gt;
        &lt;span class="na"&gt;labels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;severity&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;critical&lt;/span&gt;
        &lt;span class="na"&gt;annotations&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;summary&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Ingester&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;WAL&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;corruption&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;detected&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;—&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;potential&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;data&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;loss"&lt;/span&gt;

      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;alert&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;LokiPanic&lt;/span&gt;
        &lt;span class="na"&gt;expr&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;sum(rate(loki_panic_total[5m])) &amp;gt; &lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;
        &lt;span class="na"&gt;for&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;1m&lt;/span&gt;
        &lt;span class="na"&gt;labels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;severity&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;critical&lt;/span&gt;
        &lt;span class="na"&gt;annotations&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;summary&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Loki&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;component&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;panicked"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Things We'd Do Differently
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Start with TSDB index (schema v12+) instead of BoltDB Shipper.&lt;/strong&gt; We're on schema v11 because we started early. TSDB is significantly better for index performance and is the direction Grafana is investing in. New deployments should use it.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Set up recording rules earlier.&lt;/strong&gt; We added Prometheus recording rules for Loki metrics after the fact. Having &lt;code&gt;loki:ingester_memory_streams:sum&lt;/code&gt;, &lt;code&gt;loki:distributor_bytes:rate5m&lt;/code&gt; etc. pre-built saves a lot of dashboard query overhead.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Consider upgrading to Loki 3.x.&lt;/strong&gt; We're on 2.9.10. Loki 3.x brings native OTLP ingest, improved bloom filters for faster queries, and the new v13 schema. The migration path exists but requires careful planning around schema migration.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Wrapping Up
&lt;/h2&gt;

&lt;p&gt;Loki in distributed mode isn't "install and forget." But it also isn't the operational nightmare that some make it out to be. The key principles:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Label cardinality is everything.&lt;/strong&gt; Keep your active stream count low.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cache aggressively.&lt;/strong&gt; Three-tier memcached turned 97.8% of our S3 reads into cache hits.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Size your ingesters for memory, not CPU.&lt;/strong&gt; They're memory-bound workloads.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Use index gateways&lt;/strong&gt; to keep queriers from hammering S3 for index lookups.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Set retention aggressively&lt;/strong&gt; if Loki isn't your archive. 72 hours keeps queries fast.&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>logging</category>
      <category>devops</category>
      <category>opensource</category>
      <category>architecture</category>
    </item>
  </channel>
</rss>
