James Miller

Posted on Feb 10

Mastering Database Indexing: A Practical Guide with 100 Million Rows

#database #googlecloud #webdev #indexing

Many developers have faced this scenario: running locally, the dataset is small, pages load instantly, and SQL queries feel lightning-fast regardless of how they are written. But the moment you go to production and face millions of requests, queries time out, and the database CPU spikes to 100%.

To avoid the embarrassment of "works on my machine, dies in production," the best approach is to simulate massive data locally. Only when the data volume goes up do those hidden performance pitfalls reveal themselves.

Step 1: Generate 100 Million Rows

If a table only has a few thousand rows, a Full Table Scan is often just as fast, or even faster, than using an index. To truly validate an indexing strategy, you need to stress test it.

Don't foolishly write a script in your application layer to loop through inserts; the network overhead will make you doubt your life choices. PostgreSQL's built-in generate_series is a godsend. The function below can help you generate 100 million rows of simulated user operation logs in minutes.

-- Create a function to quickly generate massive amounts of mock data
CREATE OR REPLACE FUNCTION populate_large_table(target_rows BIGINT)
RETURNS VOID AS $$
BEGIN
  -- Batch insert to avoid the overhead of row-by-row commits
  INSERT INTO user_events (user_id, event_type, created_at)
  SELECT
    (random() * 1000000)::INTEGER, -- Simulate 1 million distinct users
    CASE (random() * 3)::INTEGER    -- Randomly generate operation types
      WHEN 0 THEN 'login'
      WHEN 1 THEN 'logout'
      WHEN 2 THEN 'purchase'
      ELSE 'view'
    END,
    NOW() - (random() * INTERVAL '365 days') -- Distributed over the past year
  FROM generate_series(1, target_rows);
END;
$$ LANGUAGE plpgsql;

-- Execute generation 
-- Note: This will take up significant disk space; execution time depends on machine performance
-- SELECT populate_large_table(100000000);

Once these 100 million rows hit the disk, run a simple SELECT * FROM user_events WHERE user_id = 12345. You will find that the execution time jumps from milliseconds to several seconds or even longer.

This is where the value of indexing becomes apparent.

Step 2: More Indexes != Better Performance

The most common mistake beginners make is adding an index to every single field. You must understand that indexing is essentially trading space for time, and it comes with a cost.

Reads get faster, Writes get slower.
Every time you INSERT, UPDATE, or DELETE, the database not only has to modify the data file but also synchronously update the relevant index trees. If your table has 5 indexes, inserting one row means maintaining 5 trees. For write-heavy, read-light businesses (like logs or IoT sensor reporting), too many indexes are a disaster.

Strategy Suggestions:

High-Frequency Query Columns: Add indexes (e.g., Foreign Keys, timestamps).
High-Frequency Update Columns: Be cautious about adding indexes.
Low Cardinality Columns: Do NOT add indexes. For example, "Gender" or "Status (0/1)". If the database scans the index and realizes it has to visit the table for 50% of the rows anyway, it will usually abandon the index and do a Full Table Scan. That index is wasted space.

Step 3: Don't Guess, Check the Execution Plan

Don't assume that just because you wrote WHERE user_id = ..., the database will definitely use the index. The optimizer is sometimes lazier (or smarter) than we think.

Always use EXPLAIN (works for PostgreSQL/MySQL) to see what the database is actually doing:

EXPLAIN ANALYZE SELECT * FROM user_events WHERE user_id = 42;

If you see Index Scan: Congratulations, the index is working.
If you see Seq Scan (Sequential Scan): It means a full table scan is happening. You need to check if the data distribution is uneven or if the query conditions don't match the index.

Common Indexing Patterns & Pitfalls

1. Composite Indexes: Order is Key

When query conditions involve multiple fields, such as checking "records for a specific user on a specific day," a single-column index is often not fast enough. You need a composite index:

CREATE INDEX idx_user_date ON user_events(user_id, created_at);

Pay attention to the order (Leftmost Prefix Rule):

This index works for WHERE user_id = ?.
It works for WHERE user_id = ? AND created_at = ?.
However, if the query is only WHERE created_at = ?, this index is useless. Always put the most frequently filtered column on the far left.

2. Unique Indexes: Constraint + Speed

If business logic requires that an email or phone number cannot be duplicated, use a Unique Index directly. It not only speeds up queries but also provides a safety net at the database level, preventing dirty data from being written due to code logic loopholes.

CREATE UNIQUE INDEX idx_unique_email ON users(email);

3. Specialized Index Types

Full-Text: Don't use LIKE '%keyword%' to search large blocks of text; it's incredibly slow. MySQL and PG both have specialized Full-Text indexes.
GIN Index: PG-specific, designed for handling JSONB or Array data.
Bitmap Index: Suitable for data warehousing scenarios, useful for combined queries on low-cardinality fields like "Status" or "Tags".

Simplifying the Infrastructure

For beginners (and even seniors), installing and managing multiple database versions for testing is tedious.

This is where ServBay shines.

ServBay allows you to install databases with one click. More importantly, it supports multi-instance concurrent execution. This means you can start MySQL 8.0 and MariaDB simultaneously, or run PostgreSQL 12 and 16 side-by-side.

This is incredibly convenient for data migration testing or performance comparison to see how the same complex SQL performs under different database versions.

One-Click Deployment: Covers mainstream databases like MySQL, PostgreSQL, MongoDB, Redis, and MariaDB. No need to hunt for installers or configure Brew.
Out-of-the-Box: Environment variables are configured automatically upon installation. You can type psql or mysql in the terminal to connect immediately.
Clean System: All services run independently. It manages your DB stack and your Python environment in a sandboxed manner, ensuring no garbage files are left in your system when you uninstall or stop services.

Conclusion

There is no "standard answer" for database optimization, only trade-offs.

Do you sacrifice write speed for read speed? Do you sacrifice disk space for query time? It depends on your specific business scenario, or even whether you can tolerate a few seconds of data delay.

Practice is the only criterion for testing truth. Spin up 100 million rows locally and try it yourself.

DEV Community