<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Elvis</title>
    <description>The latest articles on DEV Community by Elvis (@elvis_n).</description>
    <link>https://dev.to/elvis_n</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F2668248%2F318d2c5c-6264-4d9a-aef2-c22377174dcb.webp</url>
      <title>DEV Community: Elvis</title>
      <link>https://dev.to/elvis_n</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/elvis_n"/>
    <language>en</language>
    <item>
      <title>How Database Indexes Actually Work — From the Disk Up</title>
      <dc:creator>Elvis</dc:creator>
      <pubDate>Wed, 24 Jun 2026 13:54:42 +0000</pubDate>
      <link>https://dev.to/elvis_n/how-database-indexes-actually-work-from-the-disk-up-126k</link>
      <guid>https://dev.to/elvis_n/how-database-indexes-actually-work-from-the-disk-up-126k</guid>
      <description>&lt;p&gt;&lt;em&gt;Every index is a different answer to the same trade-off.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;A while ago I gave a talk to my engineering team about database indexes. The question I actually wanted to answer was the one nobody asks out loud: &lt;em&gt;what is an index, physically, and why does adding one turn a 40-minute query into a 40-millisecond one?&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;I don't think you really understand a tool until you can rebuild it from the ground up. So that's what this is. We're going to start at the disk, invent the index ourselves, and only then give it a name. By the end, "which index should I use?" stops being a thing you memorize and becomes a thing you can reason about.&lt;/p&gt;

&lt;h2&gt;
  
  
  First, what does "slow" even mean?
&lt;/h2&gt;

&lt;p&gt;Before indexes, before B-trees, there's one question: how does a database read a row off a disk?&lt;/p&gt;

&lt;p&gt;It doesn't read rows. It reads &lt;strong&gt;blocks&lt;/strong&gt; (also called pages). A block is a fixed-size chunk of disk — say 512 bytes. If each row is about 128 bytes, then one block holds four rows:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;512 B per block ÷ 128 B per row = 4 rows per block
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The disk can only hand you a whole block at a time, and fetching one costs a seek, on a spinning disk, on the order of 10 milliseconds. Hold onto that number; it's where all the pain comes from.&lt;/p&gt;

&lt;p&gt;Now run this query with no index:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;users&lt;/span&gt; &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;7&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The database has no idea where &lt;code&gt;id = 7&lt;/code&gt; lives. So it does the only thing it can: it reads &lt;strong&gt;every block&lt;/strong&gt; and checks every row. This is a &lt;em&gt;full scan&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;With 100 rows:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;100 rows ÷ 4 rows/block = 25 blocks
25 blocks × 10 ms        = 250 ms
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A quarter of a second to find one row. Annoying, but survivable. Now scale up to a million rows:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;1,000,000 ÷ 4   = 250,000 blocks
250,000 × 10 ms ≈ 42 minutes
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Forty-two minutes. To find &lt;em&gt;one row&lt;/em&gt;. Every query, reading every block, forever. This is not a tuning problem, it's unusable by design. The whole reason indexes exist is sitting right there in that number.&lt;/p&gt;

&lt;h2&gt;
  
  
  Inventing the index
&lt;/h2&gt;

&lt;p&gt;Here's the move. What if, instead of searching the big table, we kept a second, much smaller table on the side, one that just says &lt;em&gt;which block each id lives in&lt;/em&gt;?&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;INDEX                 USERS TABLE
id  → block           Block 2
 1  → Block 1         ┌────┬───────┬─────────┐
 2  → Block 1         │ id │ name  │ dept    │
 ...                  │  5 │ David │ Finance │
 5  → Block 2         │  6 │ Lin   │ Eng     │
 6  → Block 2         │  7 │ Yusuf │ HR      │  ← right here
 7  → Block 2         │  8 │ Eva   │ Eng     │
 ...                  └────┴───────┴─────────┘
100 → Block 25
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each index entry is tiny, just an id and a block pointer, maybe 16 bytes. That means a single block holds far more index entries than table rows:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;512 B ÷ 16 B = 32 index entries per block
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And because the index is &lt;em&gt;sorted by id&lt;/em&gt;, we don't have to scan it, we can binary-search it. Let's redo the million-row lookup:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Index size:      1,000,000 ÷ 32   = 31,250 blocks
Binary search:   log₂(31,250)     ≈ 15 block reads
Fetch the row:   + 1 read
Total:           16 reads × 10 ms = 160 ms
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;42 minutes → 160 milliseconds.&lt;/strong&gt; Same disk, same query, same data. The only thing that changed is that we stopped scanning and started &lt;em&gt;navigating&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;But notice the new problem hiding in that math: the index itself is 31,250 blocks. It's a big sorted thing that we now have to search efficiently &lt;em&gt;and&lt;/em&gt; keep sorted as rows are inserted and deleted. Binary search over a flat file is fine until you insert a row in the middle and have to shuffle everything down. So the real question becomes: &lt;strong&gt;how do we organize the index?&lt;/strong&gt; That choice, and it is always a choice, is the entire rest of the story.&lt;/p&gt;

&lt;h2&gt;
  
  
  B-trees: the default answer since 1971
&lt;/h2&gt;

&lt;p&gt;Most of the time, when you type &lt;code&gt;CREATE INDEX&lt;/code&gt; and don't think about it, you get a &lt;strong&gt;B-tree&lt;/strong&gt;. It's been the default for over fifty years, and once you see what it's optimizing for, you understand why.&lt;/p&gt;

&lt;p&gt;Start from a binary tree, the structure you'd reach for instinctively. The problem: a binary tree of a million keys is about 20 levels deep, and on disk every level you descend is another ~10 ms seek. Twenty seeks is slow. The depth is the enemy.&lt;/p&gt;

&lt;p&gt;So you ask the obvious next question: &lt;em&gt;why only two children per node?&lt;/em&gt; If a node can have two children, why not hundreds? Make each node exactly one disk page (~8 KB), pack it with hundreds of keys, and suddenly the tree is &lt;strong&gt;wide and shallow&lt;/strong&gt; instead of narrow and deep. That's a B-tree: a balanced, sorted tree with a high &lt;em&gt;fanout&lt;/em&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;                 [ 30 | 60 | 90 ]
                /      |       |     \
         [10|20]   [40|50]  [70|80]  ...
         /  |  \    /  |  \
     1..9 .. ..  41..49 51..59 ..
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Finding &lt;code&gt;id = 45&lt;/code&gt;: at the root, 45 is between 30 and 60, so take that branch; in the next node, land in the &lt;code&gt;41..49&lt;/code&gt; leaf; done. Three page reads.&lt;/p&gt;

&lt;p&gt;Here's the payoff that still feels like magic to me. With a fanout of ~256 keys per node, the depth grows as &lt;code&gt;log₂₅₆(n)&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;log₂₅₆(1,000,000,000) ≈ 3.74
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Four page reads will find any row in a billion-row table.&lt;/strong&gt; That's the whole reason B-trees won.&lt;/p&gt;

&lt;p&gt;And because the tree stays sorted, the &lt;em&gt;same structure&lt;/em&gt; answers a whole family of questions for free:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Equality&lt;/strong&gt; — &lt;code&gt;WHERE email = '...'&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Range&lt;/strong&gt; — &lt;code&gt;WHERE created_at BETWEEN x AND y&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Sorting&lt;/strong&gt; — &lt;code&gt;ORDER BY&lt;/code&gt; on the indexed column&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Prefix&lt;/strong&gt; — &lt;code&gt;WHERE name LIKE 'Joh%'&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;What it's &lt;em&gt;bad&lt;/em&gt; at is just as informative, because it's the mirror image of what it's good at. A B-tree struggles with low-cardinality columns (&lt;code&gt;WHERE active = true&lt;/code&gt; — a boolean index barely narrows anything), with suffix or substring search (&lt;code&gt;LIKE '%son'&lt;/code&gt; can't use the sorted order), and with full-text search inside long documents. Every one of those weaknesses is a direct consequence of "balanced, sorted tree." The shape that makes ranges fast is the same shape that makes &lt;code&gt;%son&lt;/code&gt; impossible.&lt;/p&gt;

&lt;p&gt;In practice you write one line and the database does the rest:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;INDEX&lt;/span&gt; &lt;span class="n"&gt;idx_users_email&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;users&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;email&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;users&lt;/span&gt; &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;email&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'alice@example.com'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You declare the index. The engine picks B-tree by default. Queries use it automatically. This is why Postgres and MySQL reach for it first.&lt;/p&gt;

&lt;h2&gt;
  
  
  When the default doesn't fit
&lt;/h2&gt;

&lt;p&gt;Everything above is one answer to "how do we organize the index." The interesting part is what happens when your workload asks a &lt;em&gt;different&lt;/em&gt; question. Each of the indexes below exists because someone had a query a B-tree couldn't serve well, and they were willing to give something up to get it.&lt;/p&gt;

&lt;h3&gt;
  
  
  Hash indexes — equality, and nothing else
&lt;/h3&gt;

&lt;p&gt;If all you ever do is look up an exact key, a tree is overkill. Hash the key, jump straight to its bucket, read the row:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;alice@…  →  hash()  →  bucket 47  →  row #4,827,193
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That's &lt;strong&gt;O(1)&lt;/strong&gt; — faster than any tree, no matter how big the table gets. The catch is total: hashing &lt;em&gt;destroys order&lt;/em&gt;. Similar keys scatter to unrelated buckets, so there are no ranges, no sorting, no prefix matches. You traded the entire &lt;code&gt;ORDER BY&lt;/code&gt; / &lt;code&gt;BETWEEN&lt;/code&gt; family for raw point-lookup speed. This is the engine behind Redis and Memcached:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;SET user:1234 &lt;span class="s2"&gt;"{...}"&lt;/span&gt;
GET user:1234   → value, &lt;span class="k"&gt;in &lt;/span&gt;microseconds
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Exact key in, exact value out. No range query is the &lt;em&gt;point&lt;/em&gt;, not a limitation.&lt;/p&gt;

&lt;h3&gt;
  
  
  Inverted indexes — searching inside text
&lt;/h3&gt;

&lt;p&gt;Now flip the question entirely. A B-tree answers "find the row matching this key." Full-text search asks "find the &lt;em&gt;documents containing these words&lt;/em&gt;." Different question, different index.&lt;/p&gt;

&lt;p&gt;The trick is to invert the natural mapping. Instead of &lt;code&gt;document → words&lt;/code&gt;, you store &lt;code&gt;word → documents&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Forward                          Inverted
Doc 1 → "refund for shipping"    "refund"   → [Doc 1, Doc 2]
Doc 2 → "refund policy"          "shipping" → [Doc 1, Doc 3]
Doc 3 → "shipping was fast"      "policy"   → [Doc 2]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Search a word, get its document list in one lookup — microseconds, regardless of how big the corpus is. That unlocks full-text search across millions of documents, boolean queries (&lt;code&gt;refund AND shipping NOT cancelled&lt;/code&gt;), relevance ranking, phrase search, and autocomplete.&lt;/p&gt;

&lt;p&gt;The cost is the inverse of B-tree's strength. Writes are slow, because one document touches many word lists. It's eventually consistent — Elasticsearch refreshes on roughly a one-second delay. And it gives up ACID guarantees, so it isn't a system of record. Which is why the most common pattern in modern backends isn't "replace your database with Elasticsearch," it's &lt;em&gt;both&lt;/em&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;PostgreSQL (source of truth, B-trees) → pipeline → Elasticsearch (search, inverted index)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each index does the job it's shaped for.&lt;/p&gt;

&lt;h3&gt;
  
  
  LSM-trees — when writes outpace reads
&lt;/h3&gt;

&lt;p&gt;B-trees do their bookkeeping at &lt;em&gt;write&lt;/em&gt; time: every insert may have to find the right leaf, maybe split a node, maybe rebalance. That's random I/O, and it's fine until you're ingesting a firehose of writes.&lt;/p&gt;

&lt;p&gt;The &lt;strong&gt;Log-Structured Merge tree&lt;/strong&gt; refuses to play that game. It never overwrites in place. It appends new writes to an in-memory table (and a write-ahead log), and when that fills up it flushes to disk as a sorted file. Background threads later merge those files together:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;writes → MEMTABLE (RAM) --flush--&amp;gt; SSTable 1, 2, 3 (disk) --compact--&amp;gt; merged SSTable
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The write path is just an append — no tree navigation, no splits, sequential I/O. That makes writes &lt;strong&gt;10–100× faster&lt;/strong&gt; than a B-tree.&lt;/p&gt;

&lt;p&gt;But — and this is the line from the talk I most wanted people to remember — &lt;em&gt;the cost didn't disappear, it moved.&lt;/em&gt; Reads got harder: a key might be in the memtable or in any of the SSTables, and if it was updated, several versions exist and the newest wins. Worst case, a read probes every file. Two optimizations make that bearable: &lt;strong&gt;bloom filters&lt;/strong&gt; (which answer "definitely not here, or maybe here" cheaply) and small sparse in-memory indexes per file. The work shifted from synchronous write-time effort to asynchronous background compaction. This is the engine inside Cassandra, RocksDB, ScyllaDB, and most time-series databases.&lt;/p&gt;

&lt;h3&gt;
  
  
  Vector indexes — searching by similarity
&lt;/h3&gt;

&lt;p&gt;The newest question is the strangest. Not "find the match" but "find the things &lt;em&gt;similar&lt;/em&gt; to this thing." An embedding model turns text, images, or audio into vectors, positioned so that similar meanings land near each other in high-dimensional space:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;"shipment missing"  ─┐
                      ├─ cluster together
"package never came" ─┘            "update my email"  (far away)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The index finds nearest neighbors fast. The twist is that exact nearest-neighbor search in high dimensions is brutally expensive, so these indexes go &lt;strong&gt;approximate&lt;/strong&gt;: ANN (approximate nearest neighbor) algorithms like HNSW and IVF return results that are ~98% correct but ~1000× faster. You trade a sliver of accuracy for tractability — and that trade is what makes semantic search, recommendations, and RAG pipelines possible. (This is the part I run into daily; the retrieval layer in the systems I build lives or dies on this trade-off.) In practice: pgvector, Pinecone, and friends.&lt;/p&gt;

&lt;h2&gt;
  
  
  So which index should you use?
&lt;/h2&gt;

&lt;p&gt;Here's the whole landscape on one page:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Index&lt;/th&gt;
&lt;th&gt;Best for&lt;/th&gt;
&lt;th&gt;The trade-off&lt;/th&gt;
&lt;th&gt;In the wild&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;B-tree&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Equality, range, sort&lt;/td&gt;
&lt;td&gt;Slower writes (tree upkeep)&lt;/td&gt;
&lt;td&gt;Postgres, MySQL&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Hash&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Equality only — O(1)&lt;/td&gt;
&lt;td&gt;No range, sort, or prefix&lt;/td&gt;
&lt;td&gt;Redis, Memcached&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Inverted&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Full-text search&lt;/td&gt;
&lt;td&gt;Slow writes, eventually consistent&lt;/td&gt;
&lt;td&gt;Elasticsearch, Lucene&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;LSM-tree&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Write-heavy workloads&lt;/td&gt;
&lt;td&gt;Reads probe many files&lt;/td&gt;
&lt;td&gt;Cassandra, RocksDB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Vector&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Similarity search&lt;/td&gt;
&lt;td&gt;Approximate, not exact&lt;/td&gt;
&lt;td&gt;Pinecone, pgvector&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Read that table top to bottom and the pattern is unmistakable. Every row buys speed on one kind of question by giving up speed on another. There is no free lunch and there is no universal winner.&lt;/p&gt;

&lt;p&gt;That's the thing I wanted my team to walk away with, and it's the thing I'll leave you with too:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;There is no best index. There are only indexes that fit your workload — and indexes that fight it.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The next time someone asks "should we add an index here?", you don't need to remember a rule. Ask what question the query is really asking, ask what you're willing to give up, and the right structure picks itself. That's the difference between memorizing a tool and actually understanding it — and it's a difference that, once you start looking, shows up absolutely everywhere in software.&lt;/p&gt;

</description>
      <category>computerscience</category>
      <category>database</category>
      <category>performance</category>
      <category>tutorial</category>
    </item>
  </channel>
</rss>
