Tags: #postgresql #database #performance #sql
Meta Description: Discover PostgreSQL indexes through the lens of a vast library system. Learn when to use B-tree, Hash, GIN, GiST, BRIN, and more with real-world examples and performance insights.
Cover Image Concept: A surreal library with books floating in organized patterns, each shelf labeled with different index types, with a confused librarian (developer) holding a magnifying glass.
The Day My Query Took 47 Minutes (And What a Librarian Taught Me)
It was 2 AM on a Tuesday. Production was melting. Our analytics dashboard—the one the CEO checks every morning with his coffee—was timing out. The query that usually took 2 seconds was now approaching its 47th minute.
I did what any panicked developer does: I added EXPLAIN ANALYZE to the front of the query and watched in horror as PostgreSQL decided to scan all 12 million rows of our events table. No index. Just a good old-fashioned sequential scan, reading every single row like someone searching for a specific book by checking every single volume in the Library of Congress.
That's when it clicked: databases are libraries, and indexes are the card catalog system.
Let me show you what I mean.
The Library Metaphor: Understanding Index Philosophy
Imagine you walk into a massive library with 10 million books. You need to find "The Art of PostgreSQL" by Dimitri Fontaine. You have three options:
- Sequential Scan: Start at the entrance and check every single book until you find it (12 million rows later...)
- Index Scan: Walk to the card catalog, look up the book by title, and go directly to shelf B-42
- No Library at All: Keep all 10 million books in your backpack (memory) and hope you don't need to walk anywhere
PostgreSQL gives you six types of card catalogs (indexes), each designed for different ways people search for books. Let's meet them all.
The Six Guardians of Query Performance
1. B-tree: The Classic Card Catalog
What it is: The default. The workhorse. The "if you're not sure, use this" index.
Real-world analogy: A dictionary. Want words starting with "Q"? Flip to the Q section. Want words between "quantum" and "quartz"? They're all grouped together, sorted, ready to go.
When to use:
- Equality comparisons (
WHERE user_id = 42) - Range queries (
WHERE created_at > '2024-01-01') - Sorting operations (
ORDER BY price DESC) - Pattern matching with left-anchored patterns (
WHERE email LIKE 'john%')
Real example from my production disaster:
-- Before: 47 minutes ☠️
SELECT user_id, COUNT(*)
FROM events
WHERE event_type = 'page_view'
AND created_at >= '2024-11-01'
GROUP BY user_id;
-- The fix: B-tree composite index
CREATE INDEX idx_events_type_date
ON events(event_type, created_at);
-- After: 1.2 seconds 🎉
Why it worked: B-trees store data in sorted order. PostgreSQL could jump directly to 'page_view' entries, then scan only November dates within that subset.
The gotcha I learned the hard way:
-- ❌ Index NOT used (function on indexed column)
WHERE LOWER(email) = 'john@example.com'
-- ✅ Index used (function-based index)
CREATE INDEX idx_email_lower ON users(LOWER(email));
WHERE LOWER(email) = 'john@example.com'
Column order matters in composite indexes:
-- Index: (event_type, created_at)
-- ✅ Fast: Uses index fully
WHERE event_type = 'click' AND created_at > '2024-01-01'
-- ✅ Fast: Uses index partially (first column)
WHERE event_type = 'click'
-- ❌ Slow: Can't use index (skips first column)
WHERE created_at > '2024-01-01'
2. Hash: The Lightning-Fast Exact Matcher
What it is: A hash table. One-to-one lookups at near-constant time.
Real-world analogy: A locker room where your locker number is calculated from your name. You don't search—you compute hash("John") = Locker 42 and go straight there.
When to use:
- Equality comparisons ONLY
- Columns with high cardinality (many unique values)
- When you never need range queries or sorting
The catch: Hash indexes can't help with >, <, ORDER BY, or LIKE. They're specialists.
-- Perfect use case: UUID lookups
CREATE INDEX idx_users_uuid_hash
ON users USING hash(uuid);
SELECT * FROM users WHERE uuid = 'a0eebc99-9c0b-4ef8-bb6d-6bb9bd380a11';
-- Benchmark from my tests:
-- B-tree: 0.042ms
-- Hash: 0.031ms
-- Gain: ~26% faster for exact matches
When I use it: Session tokens, API keys, UUIDs—anything where I only ever check "does this exact value exist?"
When I don't: Pretty much everything else. B-tree is almost always better unless you're doing millions of equality checks per second.
Hash index gotcha:
Before PostgreSQL 10, hash indexes weren't WAL-logged (no crash recovery!). Now they're safe, but many devs still avoid them out of habit.
3. GIN: The Full-Text Search Wizard
What it is: Generalized Inverted Index. Think of it as an index of indexes.
Real-world analogy: A book's index page. Instead of organizing by page number, it organizes by every word, pointing to all pages where that word appears.
"PostgreSQL" → pages: 1, 15, 23, 47, 89
"index" → pages: 1, 2, 3, 15, 16, 23...
When to use:
- JSONB queries
- Array contains operations
- Full-text search
- Any scenario with "multi-valued" columns
The game-changer for my e-commerce project:
-- Product table with JSONB attributes
CREATE TABLE products (
id BIGSERIAL PRIMARY KEY,
name TEXT,
attributes JSONB -- {"color": "blue", "size": "L", "tags": ["cotton", "summer"]}
);
-- Before GIN: Sequential scan on 500K products = 8 seconds
SELECT * FROM products
WHERE attributes @> '{"color": "blue"}';
-- The magic spell:
CREATE INDEX idx_products_attrs
ON products USING gin(attributes);
-- After: 0.023 seconds 🪄
GIN for arrays (my favorite trick):
CREATE TABLE posts (
id BIGSERIAL PRIMARY KEY,
tags TEXT[]
);
CREATE INDEX idx_posts_tags ON posts USING gin(tags);
-- Find posts with ANY of these tags
SELECT * FROM posts WHERE tags && ARRAY['postgresql', 'performance'];
-- Find posts with ALL these tags
SELECT * FROM posts WHERE tags @> ARRAY['postgresql', 'performance'];
-- Find posts with exactly this tag
SELECT * FROM posts WHERE tags @> ARRAY['postgresql'];
Full-text search example:
CREATE TABLE articles (
id BIGSERIAL PRIMARY KEY,
title TEXT,
content TEXT,
search_vector tsvector GENERATED ALWAYS AS
(to_tsvector('english', coalesce(title, '') || ' ' || coalesce(content, ''))) STORED
);
CREATE INDEX idx_articles_search ON articles USING gin(search_vector);
-- Natural language search
SELECT title, ts_rank(search_vector, query) AS rank
FROM articles,
plainto_tsquery('english', 'postgresql performance tuning') query
WHERE search_vector @@ query
ORDER BY rank DESC
LIMIT 10;
The trade-off:
- GIN indexes are LARGE (2-3x data size)
- Slower to INSERT/UPDATE (must update inverted index)
- But for reads? Lightning fast
- Use
gin_pending_list_limitto balance write performance
4. GiST: The Geometric Genius
What it is: Generalized Search Tree. The Swiss Army knife for complex data types.
Real-world analogy: A map with zones. "Find all pizza places within 5 miles" doesn't require checking every restaurant—just the ones in nearby zones.
When to use:
- Geometric data (PostGIS)
- Range types (
int4range,tstzrange) - Full-text search (alternative to GIN)
- Nearest-neighbor searches
- Any custom data type with spatial/hierarchical properties
Real example: Building a delivery zone checker
CREATE EXTENSION IF NOT EXISTS postgis;
CREATE TABLE delivery_zones (
id SERIAL PRIMARY KEY,
restaurant_id INT,
zone GEOMETRY(POLYGON, 4326)
);
CREATE INDEX idx_zones_geom
ON delivery_zones USING gist(zone);
-- Find which restaurants deliver to a location (San Francisco)
SELECT restaurant_id
FROM delivery_zones
WHERE ST_Contains(
zone,
ST_SetSRID(ST_MakePoint(-122.4194, 37.7749), 4326)
);
-- Without GiST: checks every polygon = O(n)
-- With GiST: spatial pruning = O(log n)
The time-range trick I use for booking systems:
CREATE TABLE bookings (
id BIGSERIAL PRIMARY KEY,
resource_id INT,
time_range tstzrange -- '[2024-11-25 14:00, 2024-11-25 16:00)'
);
CREATE INDEX idx_bookings_range
ON bookings USING gist(time_range);
-- Find overlapping bookings (conflict detection)
SELECT * FROM bookings
WHERE time_range && tstzrange('2024-11-25 15:00', '2024-11-25 17:00');
-- This is how calendar apps stay fast with thousands of events
Nearest neighbor search (k-NN):
-- Find 5 closest coffee shops to your location
SELECT name, location <-> ST_MakePoint(-122.4194, 37.7749) AS distance
FROM coffee_shops
ORDER BY location <-> ST_MakePoint(-122.4194, 37.7749)
LIMIT 5;
-- The <-> operator with GiST makes this incredibly efficient
GIN vs GiST for full-text search:
| Feature | GIN | GiST |
|---|---|---|
| Query speed | Faster (3x) | Slower |
| Index size | Larger (3x) | Smaller |
| Update speed | Slower | Faster |
| Use when | Read-heavy, fast searches | Write-heavy, space-constrained |
5. SP-GiST: The Hierarchical Organizer
What it is: Space-Partitioned GiST. For data with natural hierarchies or non-balanced structures.
Real-world analogy: A phone tree or quad-tree. Instead of balanced binary splits, it partitions space intelligently based on data distribution.
When to use:
- IP addresses (inet types)
- Phone numbers
- Geographic quad-trees
- Data with natural clustering
The IP address indexing revelation:
CREATE TABLE access_logs (
id BIGSERIAL PRIMARY KEY,
ip_address INET,
accessed_at TIMESTAMPTZ,
user_agent TEXT
);
-- SP-GiST is perfect for INET types
CREATE INDEX idx_logs_ip
ON access_logs USING spgist(ip_address);
-- Find all requests from a subnet
SELECT COUNT(*)
FROM access_logs
WHERE ip_address << '192.168.1.0/24';
-- Check if IP is in specific ranges
SELECT *
FROM access_logs
WHERE ip_address <<= '10.0.0.0/8' -- contained in or equal to
OR ip_address <<= '172.16.0.0/12'
OR ip_address <<= '192.168.0.0/16';
Performance comparison on 10M rows:
-- No index: 8.2 seconds (seq scan)
-- B-tree: 1.1 seconds (not ideal for ranges)
-- GiST: 0.3 seconds (good)
-- SP-GiST: 0.08 seconds (excellent!) 🎯
Text pattern matching (my secret weapon for autocomplete):
CREATE TABLE products (
id BIGSERIAL PRIMARY KEY,
sku TEXT
);
CREATE INDEX idx_products_sku_spgist
ON products USING spgist(sku text_pattern_ops);
-- Lightning-fast prefix matching
SELECT sku
FROM products
WHERE sku ^@ 'ELEC'; -- starts with 'ELEC'
-- This beats LIKE 'ELEC%' on large datasets
When to choose SP-GiST:
- Your data has natural prefixes (IP addresses, phone numbers, hierarchical codes)
- You need efficient prefix or suffix searches
- Standard B-tree isn't cutting it for your specific data pattern
6. BRIN: The Time-Series Hero
What it is: Block Range Index. Instead of indexing every row, it indexes ranges of physical blocks.
Real-world analogy: Instead of cataloging every book individually, you label entire shelves: "Shelf A1 contains books from 1900-1950, Shelf A2 contains 1951-2000."
When to use:
- Very large tables (1TB+)
- Data that's naturally ordered (timestamps, auto-increment IDs)
- When you prioritize tiny index size over maximum speed
- Time-series data, log tables, append-only tables
The shocking size difference:
CREATE TABLE sensor_readings (
id BIGSERIAL PRIMARY KEY,
sensor_id INT,
reading NUMERIC,
recorded_at TIMESTAMPTZ DEFAULT NOW()
);
-- Insert 100 million rows (time-ordered)
-- Table size: 8.2 GB
-- B-tree index on recorded_at
CREATE INDEX idx_readings_time_btree ON sensor_readings(recorded_at);
-- Index size: 2.1 GB 😱
-- BRIN index on recorded_at
CREATE INDEX idx_readings_time_brin ON sensor_readings
USING brin(recorded_at) WITH (pages_per_range = 128);
-- Index size: 2.3 MB 🤯
-- That's 900x smaller!
Real-world performance:
-- Query: Get last hour of data
SELECT AVG(reading)
FROM sensor_readings
WHERE recorded_at > NOW() - INTERVAL '1 hour';
-- B-tree: 23ms
-- BRIN: 89ms
-- No index: 45 seconds
-- BRIN is 3.8x slower than B-tree but 500x faster than no index
-- And it takes up 0.03% of the space!
The correlation requirement:
-- Check if your data is suitable for BRIN
SELECT
correlation
FROM pg_stats
WHERE tablename = 'sensor_readings'
AND attname = 'recorded_at';
-- correlation close to 1.0 or -1.0 = PERFECT for BRIN
-- correlation close to 0.0 = DON'T use BRIN (use B-tree)
-- Example results:
-- recorded_at: 0.99 ✅ (naturally ordered by time)
-- sensor_id: 0.03 ❌ (randomly distributed)
Tuning BRIN indexes:
-- Smaller pages_per_range = larger index, faster queries
CREATE INDEX idx_brin_fine ON logs
USING brin(created_at) WITH (pages_per_range = 32);
-- Larger pages_per_range = smaller index, slower queries
CREATE INDEX idx_brin_coarse ON logs
USING brin(created_at) WITH (pages_per_range = 512);
-- Default is 128, which is usually perfect
My BRIN use cases:
- Application logs (append-only, timestamp-ordered)
- IoT sensor data (millions of readings per day)
- Financial transaction tables (always increasing IDs)
- Archival data (historical records that never change)
The Decision Tree: Which Index Should You Use?
Start here
↓
Do you have time-series or append-only data that's physically ordered?
├─ YES → Consider BRIN (especially if table > 100GB)
└─ NO → Continue
↓
Is it a geometry/geography column?
├─ YES → Use GiST (or PostGIS-specific indexes)
└─ NO → Continue
↓
Is it JSONB, array, or full-text search?
├─ YES → Use GIN (primary choice) or GiST (if write-heavy)
└─ NO → Continue
↓
Is it an IP address, phone number, or hierarchical text?
├─ YES → Consider SP-GiST
└─ NO → Continue
↓
Do you ONLY need equality checks (=) and never ranges or sorting?
├─ YES → Consider HASH (minor speedup over B-tree)
└─ NO → Use B-TREE (the reliable default)
Advanced Techniques: Index Optimization Secrets
1. Partial Indexes: Index Only What Matters
-- Don't index all users—only active ones
CREATE INDEX idx_users_active_email
ON users(email)
WHERE status = 'active' AND deleted_at IS NULL;
-- 80% of queries filter for active users
-- Index is 80% smaller
-- Queries are just as fast
-- Win-win-win
My favorite partial index trick:
-- Only index recent orders (last 90 days)
CREATE INDEX idx_orders_recent
ON orders(created_at, customer_id)
WHERE created_at > NOW() - INTERVAL '90 days';
-- Old orders are rarely queried
-- Index stays small and fast
-- Rebuild monthly to maintain 90-day window
2. Expression Indexes: When Functions Attack
-- ❌ This won't use an index
SELECT * FROM users WHERE LOWER(email) = 'john@example.com';
-- ✅ Create an expression index
CREATE INDEX idx_users_email_lower ON users(LOWER(email));
-- Now it's blazing fast
SELECT * FROM users WHERE LOWER(email) = 'john@example.com';
JSON path expression indexes:
CREATE INDEX idx_user_settings_theme
ON users((settings->>'theme'));
SELECT * FROM users WHERE settings->>'theme' = 'dark';
-- Fast lookup on nested JSON field
3. Covering Indexes: The Index-Only Scan
-- Query that needs user_id and email
SELECT user_id, email
FROM users
WHERE status = 'active';
-- Regular index: PostgreSQL must visit the table
CREATE INDEX idx_users_status ON users(status);
-- Covering index: Everything needed is IN the index
CREATE INDEX idx_users_status_covering
ON users(status) INCLUDE (user_id, email);
-- Result: Index-only scan (no table access!)
-- 3-5x faster on large tables
Check if you're getting index-only scans:
EXPLAIN (ANALYZE, BUFFERS)
SELECT user_id, email
FROM users
WHERE status = 'active';
-- Look for:
-- Index Only Scan using idx_users_status_covering
-- Heap Fetches: 0 ← This is the magic number!
4. Multi-Column Index Strategy
Left-to-right rule:
CREATE INDEX idx_events_multi ON events(user_id, event_type, created_at);
-- ✅ Uses index fully
WHERE user_id = 123 AND event_type = 'click' AND created_at > '2024-01-01'
-- ✅ Uses index partially (user_id, event_type)
WHERE user_id = 123 AND event_type = 'click'
-- ✅ Uses index minimally (user_id only)
WHERE user_id = 123
-- ❌ Can't use index (skips first column)
WHERE event_type = 'click' AND created_at > '2024-01-01'
The cardinality trick: Put high-cardinality columns first (unless you have specific query patterns).
-- user_id has 1M unique values
-- status has 3 values ('active', 'pending', 'deleted')
-- created_at has 10M unique values
-- ✅ Best order for most queries
CREATE INDEX idx_best ON users(user_id, created_at, status);
-- ❌ Worst order (low cardinality first)
CREATE INDEX idx_worst ON users(status, user_id, created_at);
Monitoring: Is Your Index Actually Being Used?
The Unused Index Detective
-- Find indexes that are never used
SELECT
schemaname,
tablename,
indexname,
idx_scan,
pg_size_pretty(pg_relation_size(indexrelid)) AS index_size
FROM pg_stat_user_indexes
WHERE idx_scan = 0
AND indexrelname NOT LIKE '%_pkey'
ORDER BY pg_relation_size(indexrelid) DESC;
-- Drop the dead weight
-- I found 47GB of unused indexes in one audit!
Index Bloat Check
-- Check index health
SELECT
indexrelname AS index_name,
pg_size_pretty(pg_relation_size(indexrelid)) AS index_size,
idx_scan,
idx_tup_read,
idx_tup_fetch,
ROUND(100.0 * idx_scan / NULLIF(idx_scan + seq_scan, 0), 2) AS index_usage_pct
FROM pg_stat_user_indexes
JOIN pg_stat_user_tables USING (relid)
WHERE schemaname = 'public'
ORDER BY pg_relation_size(indexrelid) DESC;
The EXPLAIN Decoder Ring
EXPLAIN (ANALYZE, BUFFERS, VERBOSE)
SELECT * FROM events WHERE user_id = 123;
-- What to look for:
-- "Index Scan" or "Index Only Scan" = Good ✅
-- "Seq Scan" = Bad (unless table is tiny) ❌
-- "Bitmap Index Scan" = Good for OR conditions ✅
-- "actual time" vs "planned time" = Accuracy check
-- "Buffers: shared read" = Disk I/O (lower is better)
The Performance Paradox: When NOT to Index
Yes, you read that right. Sometimes indexes hurt.
1. Small Tables (< 10,000 rows)
-- Table with 5,000 rows
-- Sequential scan: 2ms
-- Index scan: 3ms (overhead of index lookup)
-- Winner: No index (also saves space)
2. Low Cardinality Columns
-- Column with 2-3 unique values
CREATE TABLE orders (status VARCHAR(20)); -- 'pending', 'completed', 'cancelled'
-- ❌ Index is useless
-- PostgreSQL will seq scan anyway if > 5-10% of rows match
-- Save the disk space
3. Write-Heavy Tables
-- Insert-intensive logging table
-- 10,000 inserts/second
-- Each insert updates 5 indexes
-- = 50,000 index writes/second
-- Database melts 🔥
-- Solution: Fewer indexes, or use BRIN
4. The Composite Index Trap
-- ❌ Anti-pattern: Too many similar indexes
CREATE INDEX idx1 ON table(a);
CREATE INDEX idx2 ON table(a, b);
CREATE INDEX idx3 ON table(a, b, c);
-- ✅ Better: One composite index
CREATE INDEX idx_all ON table(a, b, c);
-- Can serve all three query patterns
Real-World Battle Stories
Story 1: The JOIN That Ate My Weekend
Problem: Query joining 3 tables taking 45 seconds.
SELECT o.id, u.name, p.title
FROM orders o
JOIN users u ON o.user_id = u.id
JOIN products p ON o.product_id = p.id
WHERE o.created_at > '2024-01-01'
AND u.country = 'US'
AND p.category = 'electronics';
The Fix:
-- Foreign keys had no indexes! 🤦
CREATE INDEX idx_orders_user_id ON orders(user_id);
CREATE INDEX idx_orders_product_id ON orders(product_id);
-- Plus a composite for the WHERE clause
CREATE INDEX idx_orders_date_optimized
ON orders(created_at)
WHERE created_at > '2023-01-01'; -- Partial index!
-- Related table indexes
CREATE INDEX idx_users_country ON users(country) WHERE country IS NOT NULL;
CREATE INDEX idx_products_category ON products(category);
-- Result: 45 seconds → 0.3 seconds
Lesson: Always index foreign keys. Always.
Story 2: The JSONB Query That Wouldn't Quit
Problem: Product search in e-commerce site timing out.
-- 2 million products, attributes stored as JSONB
SELECT * FROM products
WHERE attributes @> '{"brand": "Apple", "category": "laptops"}';
-- Query time: 12 seconds
The Fix:
-- GIN index with jsonb_path_ops
CREATE INDEX idx_products_attrs_pathops
ON products USING gin(attributes jsonb_path_ops);
-- Result: 12 seconds → 0.15 seconds
Bonus optimization:
-- For queries with many conditions, use expression index
CREATE INDEX idx_products_brand_category
ON products((attributes->>'brand'), (attributes->>'category'));
-- Now these are even faster:
SELECT * FROM products
WHERE attributes->>'brand' = 'Apple'
AND attributes->>'category' = 'laptops';
Story 3: The Time-Series Table That Grew Too Big
Problem: 500GB sensor data table, indexes taking 200GB, queries getting slower.
The Fix:
-- Drop B-tree, add BRIN
DROP INDEX idx_readings_timestamp;
CREATE INDEX idx_readings_timestamp_brin
ON sensor_readings USING brin(recorded_at);
-- Index size: 200GB → 180MB
-- Query time: 50ms → 200ms
-- Trade-off: 4x slower queries, 1000x smaller index
-- Verdict: WORTH IT (freed up 200GB!)
The Maintenance Checklist
Daily
-- Check for long-running queries
SELECT
pid,
now() - query_start AS duration,
query
FROM pg_stat_activity
WHERE state = 'active'
ORDER BY duration DESC
LIMIT 10;
Weekly
-- Analyze statistics (helps query planner)
ANALYZE;
-- Check index bloat
SELECT
indexrelname,
pg_size_pretty(pg_relation_size(indexrelid)) AS size
FROM pg_stat_user_indexes
WHERE idx_scan < 50 -- Rarely used
ORDER BY pg_relation_size(indexrelid) DESC;
Monthly
-- REINDEX if bloated (usually not needed)
REINDEX INDEX CONCURRENTLY idx_name;
-- Vacuum to reclaim space
VACUUM ANALYZE;
Key Takeaways: Your Index Cheat Sheet
- B-tree (default): 95% of use cases. Equality, ranges, sorting.
- Hash: Equality only. Slightly faster than B-tree for simple lookups.
- GIN: JSONB, arrays, full-text. Large but fast for reads.
- GiST: Geometry, ranges, nearest-neighbor. Swiss Army knife.
- SP-GiST: IP addresses, hierarchical data, prefix matching.
- BRIN: Time-series, append-only. Tiny indexes for huge tables.
The Golden Rules:
- Index foreign keys (ALWAYS)
- Index columns in WHERE, JOIN, ORDER BY clauses
- Use composite indexes for multi-column queries
- Check correlation before using BRIN
- Monitor with pg_stat_user_indexes
- Don't over-index (write performance suffers)
- Partial indexes for filtered queries
- Expression indexes for functions
- Covering indexes for index-only scans
Go Forth and Index Wisely
That 2 AM crisis taught me something profound: indexes aren't just about making queries fast. They're about understanding your data's story—how it's organized, how it's accessed, how it grows over time.
Every table is a library. Every query is a reader searching for knowledge. Your job? Build the catalog system that helps them find it instantly.
Further Reading:
- PostgreSQL Documentation: Indexes
- The Art of PostgreSQL by Dimitri Fontaine
- Use The Index, Luke! - SQL indexing guide
Tools I Use:
- pgAdmin - GUI for PostgreSQL
- pg_stat_statements - Query performance tracking
- PEV2 - Visualize EXPLAIN plans
Top comments (0)