Faizan Firdousi

Posted on Apr 17

I Tested PostgreSQL on 5 Million Rows, Here’s What Actually Makes Queries Fast

#database #postgres #sql #distributedsystems

when you run a query on a large table in postgres, the performance difference is rarely about the query syntax, it’s about how postgres chooses to access the data underneath to understand this better, i tested on my local machine the same query on a table with 5 million rows under different execution strategies and the results were drastically different

What Really Happens When PostgreSQL Scans Your Entire Table

when you do the query

select * from table_name where id = 17

postgres can do a full sequential scan if there is no usable index or if it thinks scanning everything is cheaper. so what do you mean by full sequential scan ?
well you see the data of the table is stored in form of pages, which are 8kb by default in postgres, and these pages are stored in the heap. heap here just means the rows are not stored in any sorted order.

so when it runs the sequential scan it goes page by page through the whole heap and checks every row to find the one you asked for

How PostgreSQL Avoids Scanning Everything With Indexes

but this doesn't happens mostly because the primary key are always indexed, postgres automatically creates an index on primary key. well what is an index ? well as i said the data is stored in heap but for faster retrieval we make indexes, indexes are key value pair arranged in a B+ Tree, where keys are sorted and each key points to the exact location of the row in the heap using a tuple id

so the next time you run query on a indexed value it goes into the b+ tree, finds that id quickly using binary search like traversal, and then jumps directly to the heap location instead of scanning everything. this is called an index scan

Index Only Scan: Eliminating Heap Access Entirely

now here’s where things get interesting....even with an index scan, postgres still has to go to the heap to fetch the actual row data after finding the pointer in the index, but what if the query can be answered entirely from the index itself?

this is where index only scan comes in, in an index only scan, postgres does not jump to the heap at all, it reads everything directly from the index structure this removes one of the most expensive parts of the lookup which is random heap access
however, this is only possible when all required columns are present in the index (covering index) and postgres can safely skip heap visibility checks (via mvcc visibility map)
just a way to eliminate heap access when the conditions allow it, also depending on the workload, index only scan can be faster than index scan but not always, especially for single-row lookups

Let’s See How PostgreSQL Actually Executes These

now enough theory, lets actually see this in action with some real benchmarks

Benchmark 1: Forced Sequential Scan

i created a table with around 5 million rows

CREATE TABLE users (
id serial primary key,
name text,
email text,
age int,
city text
);

and ran the same query under different conditions to see how postgres behaves internally

first i forced postgres to not use any index so that it is forced to do a full scan

SET enable_indexscan = OFF;
SET enable_bitmapscan = OFF;

EXPLAIN ANALYZE
SELECT * FROM users WHERE id = 4000000;

this resulted in a parallel sequential scan where postgres scanned almost the entire table

execution time was around ~277 ms
you can also notice it used multiple workers because my machince can afford it, so even when doing a bad operation postgres tries to optimize it using parallelism, but still it had to go through millions of rows just to find one
now enabling index scan back

Benchmark 2: Index Scan

SET enable_indexscan = ON;
SET enable_bitmapscan = ON;

EXPLAIN ANALYZE
SELECT * FROM users WHERE id = 4000000;

this time postgres used an index scan on the primary key

execution time dropped to around ~0.05 ms
this is a massive difference because instead of scanning the whole table, it directly used the b+ tree to locate the exact row and then fetched it from the heap
now taking it one step further with index only scan

Benchmark 3: Index Only Scan

i created a covering index so that all required columns are already present inside the index

CREATE INDEX idx_users_covering 
ON users (id) INCLUDE (name, email, age, city);

then ran:


VACUUM ANALYZE users; -- enables index-only scan by marking pages as all-visible

EXPLAIN ANALYZE
SELECT id, name, email, age, city
FROM users
WHERE id = 4000000;

this resulted in an index only scan

execution time was around ~0.07 ms
and most importantly:

Heap Fetches: 0

which means postgres did not even touch the heap and served the query entirely from the index
but here is something interesting
even though index only scan avoids heap access, in this case it was slightly slower than index scan

why?
because we are fetching a single row using a primary key, and the heap access is extremely cheap (often already cached in memory), so removing heap access does not provide a meaningful gain here
index only scan really is best when you are doing something like reading multiple columns like this

Where Index Only Scan Actually Starts to Matter

run this to see the benefit of index only scan

EXPLAIN ANALYZE
SELECT id, name, email, age, city
FROM users
WHERE id BETWEEN 1000000 AND 2000000;

here postgres is using an index scan on the primary key for a range query (~1M rows), but it still has to fetch each row from the heap after locating it via the index

this results in higher execution time (~161 ms) because of repeated heap accesses, even though the lookup itself is efficient, this is exactly the scenario where an index only scan can outperform index scan by eliminating these heap fetches entirely

Understanding What Really Impacts Query Execution

in the end, what this really shows is that query performance in postgres is not about writing complex sql, but about understanding how the database actually accesses data. whether it scans the entire table, navigates through an index, or avoids the heap altogether, each approach is just a different trade-off based on cost. once you start thinking in terms of how much data is being touched and how it is being accessed, optimizing queries becomes far more intuitive and predictable.

DEV Community