You created an index, but the query still does a Seq Scan? The cost in EXPLAIN is some mysterious number and you have no idea how to interpret it? Reads with an index are slower than without one? Let's figure out how PostgreSQL actually works with indexes — using hands-on examples with 4 million rows.
This is Part 2 of a series based on my internal PostgreSQL talks. I spent a long time working with other technologies, and when I came back to PostgreSQL, I revisited my own notes to refresh the fundamentals. The fundamental mechanics of indexes haven't changed, but some useful tools have appeared over the past 5 years — more on that at the end.
Setup: 4 Million Rows
For our experiments, let's create a table with no indexes and no primary key — 2 million records each for 'hans' and 'paul':
DROP TABLE IF EXISTS t_test;
CREATE TABLE t_test (id serial, name text);
INSERT INTO t_test (name) SELECT 'hans' FROM generate_series(1, 2000000);
INSERT INTO t_test (name) SELECT 'paul' FROM generate_series(1, 2000000);
SELECT name, count(*) FROM t_test GROUP BY name;
No Indexes — Disaster
EXPLAIN ANALYZE SELECT * FROM t_test WHERE id = 432332;
Seq Scan on t_test (cost=0.00..71622.00 rows=1 width=9)
(actual time=15.304..126.473 rows=1 loops=1)
Filter: (id = 432332)
Rows Removed by Filter: 3999999
Planning Time: 0.561 ms
Execution Time: 126.488 ms
~126 milliseconds. PostgreSQL scanned all 4 million rows, discarded 3,999,999, and found one. Everyone knows indexes will solve this, but before building them — we need to understand the query plan.
EXPLAIN and "Parrots"
Let's look at cost=0.00..71622.00 again — what is this number?
Let's disable parallelism for a clean experiment:
SET max_parallel_workers_per_gather TO 0;
The cost is composed of the number of disk blocks and the cost of processing rows:
SELECT pg_relation_size('t_test') / 8192.0; -- ~21622 blocks of 8 KB
SHOW cpu_tuple_cost; -- 0.01 (cost of processing a row)
SHOW cpu_operator_cost; -- 0.0025 (cost of an operator/function)
The cost formula for our query:
SELECT (pg_relation_size('t_test') / 8192.0) * 1
+ count(id) * 0.01
+ count(id) * 0.0025
FROM t_test;
This gives us ~71622 — roughly what EXPLAIN shows.
The cost in EXPLAIN has nothing to do with actual time. It's an abstract coefficient — "parrots," as we say in Russian. You can't convert it to milliseconds. Comparing costs between two different queries is also meaningless because it doesn't account for hardware. Within a single query, it's useful for understanding which part is more expensive. Between queries — only as a rough estimate.
Building an Index
CREATE INDEX idx_id ON t_test (id);
EXPLAIN SELECT * FROM t_test WHERE id = 43242;
Index Scan using idx_id on t_test (cost=0.43..8.45 rows=1 width=9)
Index Cond: (id = 43242)
The cost dropped from 71,622 to 8.45 — nearly 8,500x cheaper. Execution time goes from 126ms to sub-millisecond.
By default, PostgreSQL builds a BTree index. It provides high concurrency, but indexes aren't free — you pay with memory and slower INSERT operations.
Indexes and Sorting
BTree is used not just for lookups:
EXPLAIN SELECT * FROM t_test ORDER BY id DESC LIMIT 10;
Limit (cost=0.43..0.74 rows=10 width=9)
-> Index Scan Backward using idx_id on t_test (cost=0.43..125505.43 rows=4000000 width=9)
PostgreSQL doesn't sort — it simply walks the index in reverse (Backward) and stops after 10 rows.
EXPLAIN SELECT min(id), max(id) FROM t_test;
Result (cost=0.92..0.93 rows=1 width=8)
InitPlan 1 (returns $0)
-> Limit (cost=0.43..0.46 rows=1 width=4)
-> Index Only Scan using idx_id on t_test (cost=0.43..128791.43 rows=4000000 width=4)
Index Cond: (id IS NOT NULL)
InitPlan 2 (returns $1)
-> Limit (cost=0.43..0.46 rows=1 width=4)
-> Index Only Scan Backward using idx_id on t_test t_test_1 (cost=0.43..128791.43 rows=4000000 width=4)
Index Cond: (id IS NOT NULL)
For min — one row from the start of the tree. For max — one from the end (Backward). Total cost — less than one.
Bitmap Scans: Multiple Conditions
PostgreSQL can use the same index multiple times via OR:
EXPLAIN SELECT * FROM t_test WHERE id = 30 OR id = 50;
Bitmap Heap Scan on t_test (cost=8.88..16.85 rows=2 width=9)
Recheck Cond: ((id = 30) OR (id = 50))
-> BitmapOr (cost=8.88..8.88 rows=2 width=0)
-> Bitmap Index Scan on idx_id (cost=0.00..4.44 rows=1 width=0)
Index Cond: (id = 30)
-> Bitmap Index Scan on idx_id (cost=0.00..4.44 rows=1 width=0)
Index Cond: (id = 50)
PostgreSQL scans the index twice (once per condition), merges the results via BitmapOr into a page bitmap, and only then goes to the table (Bitmap Heap Scan) for the actual rows.
Why PostgreSQL Ignores Your Index
Let's add an index on name:
CREATE INDEX idx_name ON t_test (name);
Search for a non-existent name:
EXPLAIN SELECT * FROM t_test WHERE name = 'hans2';
Index Scan using idx_name on t_test (cost=0.43..8.45 rows=1 width=9)
Index Cond: (name = 'hans2'::text)
The index works. Notice: rows=1, even though 'hans2' doesn't exist in the table. PostgreSQL never estimates rows=0 — even for non-existent data it expects at least one row, otherwise there would be no point in attempting a scan.
Now let's search for something that covers almost the entire table:
EXPLAIN SELECT * FROM t_test WHERE name = 'hans' OR name = 'paul';
Seq Scan on t_test (cost=0.00..81622.00 rows=3000003 width=9)
Filter: ((name = 'hans'::text) OR (name = 'paul'::text))
The index is not used. When the query covers a large portion of the table (roughly 70%+), PostgreSQL decides it's cheaper to read everything from disk directly rather than bounce between the index and the table.
This is a key insight: execution plans depend on the data and are not static. The same query with different parameters can execute in completely different ways. The planner makes decisions based on statistics — what it knows about the data distribution. If the statistics are stale, the decision may be suboptimal.
The term for this is selectivity: what fraction of the table a condition affects. High selectivity (few rows) — an index is beneficial. Low selectivity (many rows) — Seq Scan is cheaper.
Physical Data Layout
Suppose we want to read a range: the first 10,000 records. With an index it's fast — if the data is stored sequentially on disk.
But what if the data is scattered?
CREATE TABLE t_random AS SELECT * FROM t_test ORDER BY random();
CREATE INDEX idx_random ON t_random (id);
VACUUM ANALYZE t_random;
Let's compare:
-- Original table (data in order)
EXPLAIN (analyze true, buffers true) SELECT * FROM t_test WHERE id < 10000;
Index Scan using idx_id on t_test (cost=0.43..346.88 rows=10083 width=9)
(actual time=0.010..1.219 rows=9999 loops=1)
Index Cond: (id < 10000)
Buffers: shared hit=3 read=82
Planning Time: 0.108 ms
Execution Time: 1.668 ms
85 blocks (hit + read) — data is stored together, read sequentially. Now the randomized table:
-- Randomized table (data shuffled)
EXPLAIN (analyze true, buffers true) SELECT * FROM t_random WHERE id < 10000;
Bitmap Heap Scan on t_random (cost=196.41..18136.79 rows=10320 width=9)
(actual time=1.826..79.262 rows=9999 loops=1)
Recheck Cond: (id < 10000)
Heap Blocks: exact=7981
Buffers: shared hit=801 read=7210
-> Bitmap Index Scan on idx_random (cost=0.00..193.83 rows=10320 width=0)
Buffers: shared hit=3 read=27
Planning Time: 0.289 ms
Execution Time: 79.843 ms
8,011 blocks (shared hit + read) instead of 85 — nearly 100x more disk accesses. 79ms instead of 1.6ms. Same 10,000 rows, same index, but the data is scattered across disk. PostgreSQL even switched from Index Scan to Bitmap Heap Scan, because with low correlation that's cheaper.
Correlation
PostgreSQL knows about this through statistics:
SELECT tablename, attname, correlation
FROM pg_stats
WHERE tablename IN ('t_test', 't_random') AND attname = 'id'
ORDER BY 1, 2;
tablename | attname | correlation
-----------+---------+--------------
t_random | id | -0.003704906
t_test | id | 1
Correlation measures how closely the physical order of data on disk matches the logical order:
- ~1 — data is ordered, blocks are read sequentially (t_test)
- ~0 — data is randomly scattered, each row means a separate disk access (t_random)
CLUSTER
You can physically sort the data by an index:
CLUSTER t_random USING idx_random;
VACUUM ANALYZE t_random;
After this, range queries are fast again. But the downsides are significant:
-
Locks the entire table for the duration — even
SELECTwaits - Only works with one index
- Not maintained automatically — after new inserts, data becomes unordered again
Index Only Scan
If a query only accesses columns that are in the index, PostgreSQL can skip the table entirely:
EXPLAIN SELECT id FROM t_test WHERE id = 34234;
Index Only Scan using idx_id on t_test (cost=0.43..8.45 rows=1 width=4)
Index Cond: (id = 34234)
Everything needed (id) is already in the index — no table access required. But if we request all columns:
EXPLAIN SELECT * FROM t_test WHERE id = 34234;
Index Scan using idx_id on t_test (cost=0.43..8.45 rows=1 width=9)
Index Cond: (id = 34234)
Now it's a regular Index Scan — the name column isn't in the index, so PostgreSQL goes to the table.
To cover all needed columns — use a covering index with INCLUDE:
CREATE INDEX idx_random_cover ON t_random (id) INCLUDE (name);
EXPLAIN SELECT * FROM t_random WHERE id = 34234;
Index Only Scan using idx_random_cover on t_random (cost=0.43..4.45 rows=1 width=9)
Index Cond: (id = 34234)
Now it's Index Only Scan again — even for SELECT *, because name is included in the index via INCLUDE.
Functional Indexes
You can build an index on the result of a function. The only requirement — the function must be deterministic (same input → same output, no side effects):
CREATE INDEX idx_cos ON t_random (cos(id));
EXPLAIN SELECT * FROM t_random WHERE cos(id) = 10;
Index Scan using idx_cos on t_random (cost=0.43..8.45 rows=1 width=9)
Index Cond: (cos((id)::double precision) = '10'::double precision)
A typical use case: an index on lower(email) for case-insensitive search.
Partial Indexes
If 99% of your data is 'hans' and 'paul', and you never search for them:
CREATE INDEX idx_name ON t_test (name) WHERE name NOT IN ('hans', 'paul');
The index will be tiny and will only be updated when "atypical" values are inserted.
Index Types
PostgreSQL has 6 index types (SELECT * FROM pg_am). We've covered BTree — let's look at two more.
GiST and Fuzzy Search (pg_trgm)
CREATE EXTENSION IF NOT EXISTS pg_trgm;
The pg_trgm extension splits strings into trigrams and computes the distance between them:
SELECT 'abcde' <-> 'abdeacb'; -- a number from 0 to 1
SELECT show_trgm('abcdef');
Let's load 2,354 Austrian cities and try searching with a misspelling:
CREATE TABLE t_location (name text);
-- data import
SELECT * FROM t_location
ORDER BY name <-> 'Kramertneusiedel'
LIMIT 3;
The strings 'Gramatneusiedl' and 'Kramertneusiedel' sound similar, and K instead of G is a common mistake. Without an index — full table scan. With a GiST index:
CREATE INDEX idx_trgm ON t_location USING GiST(name gist_trgm_ops);
Now the search uses the index. A trigram index also works with LIKE and regular expressions:
EXPLAIN SELECT * FROM t_location WHERE name LIKE '%neusi%';
Bitmap Heap Scan on t_location (cost=4.31..18.42 rows=2 width=15)
Recheck Cond: (name ~~ '%neusi%'::text)
-> Bitmap Index Scan on idx_trgm (cost=0.00..4.31 rows=2 width=0)
Index Cond: (name ~~ '%neusi%'::text)
BTree is useless for LIKE '%...%' — it can't search for a substring in the middle. But GiST + pg_trgm handles it because it works with trigrams, not character ordering.
GIN and Full-Text Search
GIN indexes work with lexemes — canonical word forms:
SELECT to_tsvector('english', 'A car, I want a car. I would not even mind having many cars');
'car' and 'cars' → same lexeme. 'want' and 'wanted' → same lexeme. This lets you find documents by any word form:
SELECT to_tsvector('english', 'A car, I want a car...')
@@ to_tsquery('english', 'wanted'); -- true
SELECT to_tsvector('english', 'A car, I want a car...')
@@ to_tsquery('english', 'wanted & bmw'); -- false
Two approaches to setup:
1. Functional index:
CREATE INDEX idx_fts ON t_fts USING gin(to_tsvector('english', comment));
2. Separate column with a trigger (faster for search, more expensive on insert):
ALTER TABLE t_fts ADD COLUMN ts tsvector;
CREATE TRIGGER tsvectorupdate
BEFORE INSERT OR UPDATE ON t_fts
FOR EACH ROW EXECUTE PROCEDURE
tsvector_update_trigger(ts, 'pg_catalog.english', 'comment');
CREATE INDEX idx_fts ON t_fts USING gin(ts);
GiST and Exclusion Constraints
With the btree_gist extension, you can build constraints that prevent overlapping ranges — for example, for reservations:
CREATE EXTENSION btree_gist;
CREATE TABLE t_reservation (
room int,
"from" date,
"to" date,
EXCLUDE USING GiST (room WITH =, daterange("from", "to") WITH &&)
);
INSERT INTO t_reservation VALUES (10, '2017-01-01', '2017-03-03'); -- OK
INSERT INTO t_reservation VALUES (13, '2017-01-01', '2017-03-03'); -- OK (different room)
INSERT INTO t_reservation VALUES (13, '2017-02-02', '2017-08-14'); -- ERROR! (overlap)
PostgreSQL won't allow booking the same room for an overlapping period — the check is performed at the index level.
What Changed in PostgreSQL Over 5 Years
The fundamental mechanics of indexes haven't changed, but useful tools have appeared:
- BRIN indexes (Block Range Index) deserve a special mention. For data with high correlation (a topic we covered above), BRIN can be orders of magnitude more compact than BTree. A table with a billion rows and timestamps is the ideal BRIN use case.
- GIN for pg_trgm. This article showed GiST + pg_trgm, but starting with PG 13, GIN support for trigrams improved significantly. For large datasets, GIN + pg_trgm can be faster than GiST + pg_trgm.
-
REINDEX CONCURRENTLY(PG 12) — index rebuilding without an exclusive table lock. -
pg_repack— an extension that does what CLUSTER does, but without locking the table. In production,CLUSTERis practically unusable, whilepg_repacksolves the same problem. - Parallel index builds became the norm. Parallel Seq Scan and Index Scan also improved significantly — disabling parallelism is only useful for demos, not for production.
This is Part 2 of the series. In Part 1, we covered transactions, locks, and VACUUM.
Top comments (0)