DEV Community

Hrishikesh Dalal
Hrishikesh Dalal

Posted on

EP 6.1: How to Search 1 Billion Rows in Milliseconds

The Problem: The "Full Table Scan"

Imagine a library with 1 million books. If you want to find "The Great Gatsby" and the books aren't organized, you have to look at every single book one by one until you find it. In database terms, this is a Full Table Scan ( time complexity).

As your database grows from thousands to millions of rows, your SELECT queries will slow down to a crawl, eventually timing out and crashing your application.

The Solution: The Index

An Index is a separate data structure that stores a sorted version of a column’s data alongside a pointer (Reference) to the actual row in the main table.

How it Works: The B-Tree

Most relational databases (like PostgreSQL and MySQL) use a B-Tree (Balanced Tree) structure.

  1. The Root: The search starts at the top node.
  2. The Decision: The system compares your value (e.g., "User ID 505") to the values in the node and decides whether to go left or right.
  3. The Leaf: You reach the bottom of the tree, which contains the exact location of the data on the disk.

This turns a search from into . Searching 1 million rows suddenly only takes about 20 "jumps."

The Textbook Analogy

Think of the Index like the Index at the back of a textbook.

  1. You look for the keyword "Load Balancing."
  2. The index tells you it's on page 42.
  3. You jump directly to page 42. You didn't have to read all 41 previous pages to find the information.

Types of Indexes You Should Know

1. Primary Index (Clustered)

This is usually your Primary Key. The table itself is physically sorted in this order. You can only have one clustered index per table.

2. Secondary Index (Non-Clustered)

These are additional indexes you create on columns like email or created_at. They live in a separate "folder" from your main data and point back to the Clustered Index.

3. Composite Index

If your query always looks like WHERE country='India' AND city='Mumbai', a single index covering both columns is much faster than the database trying to merge two separate indexes.

Pro Tip: Order matters in composite indexes. An index on (city, country) is not the same as (country, city).


The Trade-off: The "Index Tax"

Nothing in System Design is free. Indexes speed up Reads (SELECT), but they slow down Writes (INSERT/UPDATE/DELETE).

  • The Write Penalty: Every time you add a new row, the database cannot just "drop it in." It must find the correct spot in every B-Tree associated with that table and re-balance the tree.
  • Storage Cost: Indexes take up disk space. On massive tables, the index can sometimes be larger than the data itself.

Best Practices:

  • Index your Foreign Keys: This makes JOIN operations significantly faster.
  • Covering Queries: If your index contains all the data the SELECT needs (e.g., you index email and only select email), the database doesn't even have to look at the main table. This is called an Index-Only Scan.
  • Don't Index Low-Cardinality Columns: Indexing a "Gender" column (where values are only M/F/O) is often useless because the database still has to scan half the table anyway.
  • Monitor with EXPLAIN ANALYZE: Before adding an index, run your query with EXPLAIN to see if the database is actually performing a sequential scan or using an existing index.

Takeaway

Indexes are the most powerful tool for database performance, but use them like salt: too much will ruin the dish. Start by indexing columns used in WHERE clauses, JOIN conditions, and ORDER BY statements, then monitor your write latency.

Top comments (0)