Igor Nosatov

Posted on Nov 25

The Library Heist: How PostgreSQL Indexes Are Like Planning the Perfect Crime (And Why You're Probably Using the Wrong One)

The Crime Scene

It was 2 AM on a Tuesday when my phone exploded with alerts. Our e-commerce platform was dying. Response times had ballooned from 200ms to 45 seconds. The CPU graphs looked like a heart monitor during a cardiac arrest.

I logged in, hands shaking, coffee forgotten. The culprit? A seemingly innocent query:

SELECT * FROM products 
WHERE description @@ to_tsquery('wireless bluetooth headphones')
ORDER BY created_at DESC 
LIMIT 20;

My "fix" from earlier that day: adding a B-Tree index on the description column.

The problem: I'd brought a lockpick to a safe-cracking job.

This disaster taught me something crucial: PostgreSQL indexes aren't just about "making queries faster." They're specialized tools, each designed for specific types of heists—I mean, queries. Using the wrong one is like trying to break into Fort Knox with a butter knife.

Let me tell you about the six types of indexes in PostgreSQL's arsenal, when to use each, and more importantly, when NOT to.

The Master Plan: Understanding Your Target

Before we choose our tools, let's understand what we're dealing with. Think of your database like a massive library:

PostgreSQL Database = Library
├── Tables = Bookshelves
├── Rows = Books
└── Indexes = Finding Systems

But here's where it gets interesting: not all finding systems are created equal.

B-Tree indexes are like the Dewey Decimal System—perfect for ordered searches.
Hash indexes are like a magical catalog that teleports you directly to one book.
GIN indexes are like a full-text search engine that knows every word in every book.
GiST indexes are like a map showing spatial relationships between books.
SP-GiST indexes are like a hierarchical filing system for weird organizational schemes.
BRIN indexes are like sticky notes saying "Books 1-1000 are roughly in this area."

Now, let's plan our heists.

Heist #1: The Vault (B-Tree Indexes)

The Target: Ordered data, range queries, equality searches

The Tool: B-Tree (Balanced Tree)

B-Tree is the Swiss Army knife of indexes. It's the default, and honestly, it solves 80% of your problems. But that's also why developers (including past-me) slap it on everything without thinking.

When B-Tree Shines

-- The classic: finding a user by ID
CREATE INDEX idx_users_id ON users(id);
SELECT * FROM users WHERE id = 12345;

-- Range queries on dates
CREATE INDEX idx_orders_created ON orders(created_at);
SELECT * FROM orders 
WHERE created_at BETWEEN '2024-01-01' AND '2024-12-31';

-- Sorting operations
CREATE INDEX idx_products_price ON products(price);
SELECT * FROM products ORDER BY price DESC LIMIT 10;

The B-Tree Structure (Simplified)

                    [50]
                   /    \
              [25]        [75]
             /   \       /   \
          [10] [40]   [60] [90]
          / \   / \   / \   / \
        [...leaves with actual data pointers...]

Why it works: B-Trees maintain balance and order. Search, insert, and delete operations are all O(log n). They're predictable, reliable, and fast for most operations.

When B-Tree Fails Spectacularly

Remember my production disaster? Here's what I learned:

-- DON'T DO THIS
CREATE INDEX idx_description_btree ON products(description);

-- Why? Because:
SELECT * FROM products WHERE description LIKE '%bluetooth%';
-- B-Tree can't optimize this! It needs to scan the entire index.

-- Also terrible:
CREATE INDEX idx_json_btree ON logs(data);
-- JSON data isn't ordered in a meaningful way for B-Tree

Real-world gotcha: B-Trees are fantastic for equality and range queries, but they're nearly useless for:

Pattern matching with wildcards at the start (%term)
Full-text search
JSON queries
Geometric data
Array contains operations

Performance benchmark from my disaster:

Before B-Tree on description: 200ms (seq scan)
After B-Tree on description: 45 seconds (index scan + sort nightmare)
After switching to GIN: 12ms (proper full-text search)

Heist #2: The Word Vault (GIN Indexes)

The Target: Full-text search, arrays, JSON documents

The Tool: GIN (Generalized Inverted Index)

GIN is what I should have used. Think of it as an inverted index—instead of "Document 5 contains words X, Y, Z," it stores "Word X appears in documents 3, 5, 7."

The GIN Solution to My Problem

-- Add a tsvector column for full-text search
ALTER TABLE products ADD COLUMN description_tsv tsvector
GENERATED ALWAYS AS (to_tsvector('english', description)) STORED;

-- Create the GIN index
CREATE INDEX idx_description_gin ON products USING GIN(description_tsv);

-- Now this flies
SELECT * FROM products 
WHERE description_tsv @@ to_tsquery('english', 'wireless & bluetooth & headphones')
ORDER BY ts_rank(description_tsv, to_tsquery('english', 'wireless & bluetooth & headphones')) DESC
LIMIT 20;

GIN's Structure (Conceptual)

Word "bluetooth":     → [doc_1, doc_5, doc_7, doc_23, ...]
Word "wireless":      → [doc_1, doc_8, doc_12, doc_23, ...]
Word "headphones":    → [doc_1, doc_5, doc_23, doc_45, ...]

Query: bluetooth AND wireless AND headphones
Result: intersection → [doc_1, doc_23]

When to Use GIN

-- 1. Full-text search (my use case)
CREATE INDEX idx_articles_content ON articles 
USING GIN(to_tsvector('english', content));

-- 2. Array operations
CREATE INDEX idx_tags_gin ON posts USING GIN(tags);
SELECT * FROM posts WHERE tags @> ARRAY['postgresql', 'performance'];

-- 3. JSONB queries
CREATE INDEX idx_metadata_gin ON events USING GIN(metadata);
SELECT * FROM events WHERE metadata @> '{"user_type": "premium"}';

-- 4. Multiple columns
CREATE INDEX idx_multi_search ON products 
USING GIN(to_tsvector('english', name || ' ' || description));

The Dark Side of GIN

Warning: GIN indexes are LARGE. Like, really large.

-- Check your index sizes
SELECT 
    schemaname,
    tablename,
    indexname,
    pg_size_pretty(pg_relation_size(indexrelid)) AS index_size
FROM pg_stat_user_indexes
ORDER BY pg_relation_size(indexrelid) DESC;

In my case:

Table size: 2 GB
B-Tree indexes: ~200 MB total
GIN index: 800 MB

Trade-off: GIN indexes also slow down inserts and updates because they need to update the inverted index structure. For high-write workloads, this matters.

Pro tip: Use GIN(description_tsv) WITH (fastupdate = on) to batch updates and reduce write overhead.

Heist #3: The Geometry Puzzle (GiST Indexes)

The Target: Geometric data, ranges, custom types

The Tool: GiST (Generalized Search Tree)

GiST is the elegant solution for "nearness" problems. Finding restaurants within 5km? Checking if date ranges overlap? GiST is your friend.

Real-World Example: Location-Based Search

-- Enable PostGIS extension
CREATE EXTENSION IF NOT EXISTS postgis;

-- Create a table of coffee shops
CREATE TABLE coffee_shops (
    id SERIAL PRIMARY KEY,
    name VARCHAR(200),
    location GEOGRAPHY(POINT, 4326)
);

-- The magic: GiST index on geographic data
CREATE INDEX idx_location_gist ON coffee_shops USING GiST(location);

-- Find coffee shops within 1km of a point
SELECT 
    name,
    ST_Distance(location, ST_MakePoint(-122.4194, 37.7749)::geography) as distance_meters
FROM coffee_shops
WHERE ST_DWithin(
    location,
    ST_MakePoint(-122.4194, 37.7749)::geography,
    1000  -- 1km in meters
)
ORDER BY location <-> ST_MakePoint(-122.4194, 37.7749)::geography
LIMIT 5;

GiST Structure (Spatial Example)

                [Bounding Box: Entire City]
                      /          \
        [Downtown Area]          [Suburbs]
           /       \              /      \
    [Block A] [Block B]    [Region 1] [Region 2]
      /  \       /  \          /  \       /  \
   [shops....]              [shops....]

Non-Spatial GiST Uses

-- Range overlap queries
CREATE TABLE bookings (
    id SERIAL PRIMARY KEY,
    room_id INT,
    booking_period TSTZRANGE
);

CREATE INDEX idx_booking_period ON bookings USING GiST(booking_period);

-- Find overlapping bookings (double-booking detection)
SELECT * FROM bookings 
WHERE booking_period && '[2024-12-25 14:00, 2024-12-25 16:00)'::tstzrange;

-- Exclusion constraints (prevent overlaps)
ALTER TABLE bookings ADD CONSTRAINT no_overlap
EXCLUDE USING GiST (room_id WITH =, booking_period WITH &&);

When GiST Outperforms Everything

Use case: IP address range lookups

CREATE TABLE ip_locations (
    ip_range inet,
    country VARCHAR(2)
);

CREATE INDEX idx_ip_gist ON ip_locations USING GiST(ip_range inet_ops);

-- Lightning fast IP geolocation
SELECT country FROM ip_locations 
WHERE ip_range >>= '192.168.1.1'::inet;

Heist #4: The Hierarchical Vault (SP-GiST Indexes)

The Target: Hierarchical or partitioned data

The Tool: SP-GiST (Space-Partitioned GiST)

SP-GiST is the hipster of indexes—less commonly used but perfect for specific scenarios. It's designed for data with a natural hierarchy.

The Classic Use Case: Phone Numbers

CREATE TABLE phone_directory (
    id SERIAL PRIMARY KEY,
    phone_number VARCHAR(20)
);

-- SP-GiST for prefix searches
CREATE INDEX idx_phone_spgist ON phone_directory 
USING SPGIST(phone_number);

-- Fast prefix matching
SELECT * FROM phone_directory 
WHERE phone_number LIKE '415%';

SP-GiST Structure (Quad-tree example for 2D points)

                    [Origin]
                    /   |   \   \
            [NW]  [NE]  [SW]  [SE]
            /  \   |     |    /  \
         [...]  [...]  [...] [...]

When SP-GiST Shines

-- 1. IP network hierarchies
CREATE INDEX idx_network_spgist ON networks USING SPGIST(cidr_column);

-- 2. Quadtree spatial indexes (alternative to GiST)
CREATE INDEX idx_point_spgist ON locations USING SPGIST(point_column);

-- 3. Text prefix matching
CREATE INDEX idx_prefix_spgist ON words USING SPGIST(word text_ops);

Performance comparison (prefix search on 10M phone numbers):

No index: 2,300ms
B-Tree: 180ms
SP-GiST: 12ms

Why? SP-GiST naturally partitions the search space by prefixes, while B-Tree has to scan all matching prefixes.

Heist #5: The Timeline (BRIN Indexes)

The Target: Huge tables with naturally ordered data

The Tool: BRIN (Block Range INdex)

BRIN is the minimalist's dream. Instead of indexing every row, it stores summaries of data blocks. Tiny index size, surprisingly effective for the right use case.

The Perfect BRIN Scenario

-- Imagine a logs table with 1 billion rows
CREATE TABLE application_logs (
    id BIGSERIAL PRIMARY KEY,
    timestamp TIMESTAMPTZ NOT NULL,
    level VARCHAR(10),
    message TEXT
);

-- Traditional B-Tree index: ~40GB
-- BRIN index: ~40MB (1000x smaller!)
CREATE INDEX idx_timestamp_brin ON application_logs 
USING BRIN(timestamp) WITH (pages_per_range = 128);

-- Still fast for time-range queries
SELECT * FROM application_logs
WHERE timestamp BETWEEN '2024-12-01' AND '2024-12-02';

BRIN Structure (Conceptual)

Block Range 1 (rows 1-10000):    timestamp MIN: 2024-01-01, MAX: 2024-01-05
Block Range 2 (rows 10001-20000): timestamp MIN: 2024-01-05, MAX: 2024-01-10
Block Range 3 (rows 20001-30000): timestamp MIN: 2024-01-10, MAX: 2024-01-15
...

Query plan: Postgres skips entire block ranges that couldn't contain the target data.

When BRIN Fails

-- BAD: Random updates destroy correlation
UPDATE application_logs SET timestamp = NOW() WHERE id = 5;
-- Now block 1 might have rows from 2024 and 2025!

-- GOOD: Append-only tables
INSERT INTO application_logs (timestamp, level, message)
VALUES (NOW(), 'INFO', 'User logged in');

Real Numbers from Production

-- My time-series data table: 500M rows, 80GB
-- Comparing index approaches:

CREATE INDEX idx_ts_btree ON metrics(timestamp);
-- Index size: 15GB, Query time: 45ms

CREATE INDEX idx_ts_brin ON metrics USING BRIN(timestamp);
-- Index size: 15MB, Query time: 52ms

-- For a 1000x size reduction, 7ms slower? I'll take it.

Golden rule: If your data has natural physical ordering (timestamps, sequential IDs, geographic regions in order), BRIN is your secret weapon.

Heist #6: The Direct Teleport (Hash Indexes)

The Target: Exact equality matches only

The Tool: Hash Index

Hash indexes used to be the "don't use these" option in PostgreSQL (pre-v10 they weren't WAL-logged). Now they're viable, but their use case is narrow.

When Hash Beats B-Tree

-- UUID lookups
CREATE TABLE sessions (
    session_id UUID PRIMARY KEY,
    user_id INT,
    data JSONB
);

-- Hash index for exact UUID lookups
CREATE INDEX idx_session_hash ON sessions USING HASH(session_id);

-- Perfect for:
SELECT * FROM sessions WHERE session_id = 'a0eebc99-9c0b-4ef8-bb6d-6bb9bd380a11';

Hash vs B-Tree Benchmark

-- Test on 50M UUID rows
-- B-Tree index: 8.5GB, Query time: 0.8ms
-- Hash index: 6.2GB, Query time: 0.6ms

-- But...
SELECT * FROM sessions WHERE session_id > 'a0eebc99-9c0b-4ef8-bb6d-6bb9bd380a11';
-- B-Tree: 45ms
-- Hash: Can't do this at all!

The Truth About Hash Indexes

Use them when:

You ONLY do equality checks
You need slightly better space efficiency
Your data is high-cardinality (like UUIDs)

Don't use them when:

You might need range queries (even occasionally)
You're not sure (just use B-Tree)

Controversial take: In PostgreSQL 16, hash indexes are rarely worth the specificity. B-Trees are "good enough" for equality checks, and they give you flexibility. I've stopped using hash indexes entirely.

The Heist Debriefing: Decision Matrix

Here's how to choose your tool:

┌─────────────────────┬──────────────────────────────────────────┐
│ Query Type          │ Best Index                               │
├─────────────────────┼──────────────────────────────────────────┤
│ =, <, >, <=, >=    │ B-Tree                                   │
│ BETWEEN            │ B-Tree                                   │
│ LIKE 'abc%'        │ B-Tree                                   │
│ LIKE '%abc%'       │ GIN (with trigram extension)             │
│ Full-text search   │ GIN (with tsvector)                      │
│ @>, <@, &&, ?      │ GIN (arrays, JSONB)                      │
│ Geometric queries  │ GiST (PostGIS, ranges)                   │
│ && (overlap)       │ GiST (ranges)                            │
│ Prefix search      │ SP-GiST                                  │
│ IP address lookup  │ GiST or SP-GiST                          │
│ Time-series (huge) │ BRIN                                     │
│ = only (UUID)      │ Hash (or just use B-Tree)                │
└─────────────────────┴──────────────────────────────────────────┘

The Cost-Benefit Matrix

Index Type	Size	Insert Speed	Query Speed	Best For
B-Tree	Medium	Fast	Fast	General purpose
GIN	Large	Slow	Very Fast	Search, arrays, JSON
GiST	Medium	Medium	Fast	Geometric, ranges
SP-GiST	Small	Fast	Very Fast	Hierarchical data
BRIN	Tiny	Very Fast	Medium	Huge ordered tables
Hash	Medium	Fast	Fast	Equality only

The Master Thief's Toolkit: Advanced Techniques

1. Partial Indexes (The Precision Strike)

-- Only index active users
CREATE INDEX idx_active_users ON users(email) 
WHERE status = 'active';

-- Only index recent orders
CREATE INDEX idx_recent_orders ON orders(created_at)
WHERE created_at > NOW() - INTERVAL '90 days';

-- Combine with other index types
CREATE INDEX idx_recent_posts_gin ON posts 
USING GIN(to_tsvector('english', content))
WHERE published = true AND created_at > '2024-01-01';

Result: Smaller indexes, faster queries, less maintenance overhead.

2. Expression Indexes (The Transform)

-- Index on computed values
CREATE INDEX idx_email_lower ON users(LOWER(email));
SELECT * FROM users WHERE LOWER(email) = 'alice@example.com';

-- Index on JSON extraction
CREATE INDEX idx_user_type ON events((metadata->>'user_type'));
SELECT * FROM events WHERE metadata->>'user_type' = 'premium';

3. Multi-Column Indexes (The Combo)

-- Order matters!
CREATE INDEX idx_user_created ON logs(user_id, created_at);

-- Works great for:
SELECT * FROM logs WHERE user_id = 123 AND created_at > '2024-01-01';
SELECT * FROM logs WHERE user_id = 123;

-- Doesn't use the index:
SELECT * FROM logs WHERE created_at > '2024-01-01';  -- No user_id!

4. Index Monitoring (The Surveillance)

-- Find unused indexes
SELECT 
    schemaname,
    tablename,
    indexname,
    idx_scan as scans,
    pg_size_pretty(pg_relation_size(indexrelid)) as size
FROM pg_stat_user_indexes
WHERE idx_scan = 0 
    AND indexrelname NOT LIKE '%pkey%'
ORDER BY pg_relation_size(indexrelid) DESC;

-- Find duplicate indexes
SELECT 
    indrelid::regclass as table_name,
    array_agg(indexrelid::regclass) as indexes
FROM pg_index
GROUP BY indrelid, indkey
HAVING count(*) > 1;

The Lessons from My 2 AM Disaster

B-Tree isn't always the answer. It's the default, not the optimal.
Index type matters more than having an index. Wrong index can be worse than no index.
Size matters. GIN indexes can be 40% of your table size. BRIN can be 0.1%.
Test with production-scale data. My B-Tree index worked great on 1,000 rows. At 10M rows, it collapsed.
Monitor your indexes. You'll be surprised how many are unused.
Partial indexes are underrated. If you're querying WHERE status = 'active' 99% of the time, don't index inactive rows.
Read the query planner. EXPLAIN ANALYZE is your best friend:

EXPLAIN (ANALYZE, BUFFERS) 
SELECT * FROM products 
WHERE description_tsv @@ to_tsquery('wireless & bluetooth');

Your Action Plan

Right now:

Run the unused index query on your database
Find your largest table and check if BRIN might work
Look for full-text searches using B-Tree (you know who you are)

This week:

Add monitoring for index usage
Test different index types on your slowest queries
Measure index sizes and correlate with query performance

This month:

Audit all your indexes
Add partial indexes where appropriate
Consider GIN for your JSON queries

The Perfect Crime

The best index is invisible. Users don't notice it. Developers forget it's there. It just works, quietly making queries fast, without consuming too much space or slowing down writes.

Choose wisely. Monitor religiously. Test thoroughly.

And please, for the love of Postgres, don't put a B-Tree index on a text column for full-text search. Learn from my 2 AM phone call.

What's your index horror story? Drop a comment below. Let's learn from each other's disasters—I've shared mine.