sql query optimization is the single skill that separates the engineer who writes a query from the one who ships it: a 30-second SELECT that returns the right rows is still a production incident, and the discipline of reading an explain plan, picking the right index types (b-tree index, hash, partial, covering), recognising which of the three join algorithms (nested loop, hash join, merge join) the planner will choose, and then rewriting SARGable predicates is what turns 30 seconds into 300 milliseconds. The senior round is rarely "do you know JOIN" — it is "show me the plan, find the bottleneck node, and tell me the one change that will move the needle". This deep-dive guide walks the full senior playbook end to end, with worked traces, cost models, and the query optimization techniques every modern data engineer should run on every PR.
This is a deep-dive companion to short tuning round-ups: where a 5-tip cheat sheet covers "add an index, avoid SELECT *, prefer JOIN over correlated subquery", this guide widens the surface into five full teaching stages — explain plan anatomy (read the tree from leaves to root), index types compared (B-tree, hash, partial, covering — when each wins and when it backfires), join algorithms (nested loop, hash, merge — and exactly when the planner picks each), the six-step sql tuning playbook (capture → EXPLAIN → bottleneck → rewrite/index → ANALYZE → compare), and a one-screen decision cheat sheet that maps every common symptom (sequential scan on a 10M-row table, hash-spill to disk, nested loop on a 1M × 1M join) onto the exact rewrite or index that fixes it. Each section ends as an interview-shaped Q&A — a question, a SQL snippet, a traced EXPLAIN walkthrough, a sample output, and a concept-by-concept why this works breakdown — the exact shape senior query optimization techniques rounds reward.
When you want hands-on reps immediately after reading, browse the SQL practice library →, drill query optimization problems →, sharpen indexing patterns →, rehearse join problems →, reinforce aggregation reconciliation →, or widen coverage on the full database problem set →.
On this page
- Why SQL query optimization is the senior-round signal
- EXPLAIN plan anatomy — reading the tree from leaves to root
- Index types — B-tree, Hash, Partial, Covering (when each wins)
- Join algorithms — Nested Loop, Hash Join, Merge Join (and when planners pick each)
- The six-step tuning playbook (capture → EXPLAIN → bottleneck → rewrite/index → ANALYZE → compare)
- Choosing the right tuning move (cheat sheet)
- Frequently asked questions
- Practice on PipeCode
1. Why SQL query optimization is the senior-round signal
sql query optimization — the discipline that separates seniors from juniors
The one-sentence invariant: sql query optimization is the discipline of turning a query's logical shape (what rows you want) into its physical shape (how the planner will fetch them), then iterating on the physical shape until it meets the SLA. Junior engineers write the query and ship it; senior engineers run EXPLAIN ANALYZE, identify the highest-cost node, and rewrite either the predicate, the join order, or the index until the plan flips from a 30-second sequential scan to a 300-millisecond index scan. The skill is not knowing more SQL — it is reading the plan the database produces and acting on the bottleneck node.
What interviewers actually score on query optimization techniques.
-
Plan literacy on
explain plan— can you read a 12-lineEXPLAIN ANALYZEand point at the leaf node that owns the cost? -
Index intuition on
index types— given aWHERE a = ? AND b BETWEEN ? AND ?predicate, can you name the composite index that wins and the one that loses? -
Join algorithm fluency — can you predict whether the planner picks
nested loop,hash join, or merge join for a 1k × 10M join, and why? -
SARGable rewrite reflex — given
WHERE DATE(created_at) = '2026-05-29', can you rewrite it toWHERE created_at >= '2026-05-29' AND created_at < '2026-05-30'without prompting? -
Statistics + cost model awareness — do you know that
ANALYZErefreshes the histograms the planner relies on, and that stale statistics are the single most common cause of a "regressed" query plan? -
sql tuningdiscipline — can you change one thing per cycle and re-runEXPLAIN ANALYZEto prove the win, instead of changing four things at once and shipping a worse plan?
The 5-stage map this guide walks through.
-
Stage 1 —
explain plananatomy — read the tree from leaves to root; the worst leaf is the bottleneck. -
Stage 2 —
index types— B-tree (default), hash (equality only), partial (filtered subset), covering /INCLUDE(index-only scan). -
Stage 3 —
join algorithms—nested loop(small outer + indexed inner),hash join(no useful index, both sides large), merge join (both sides pre-sorted). -
Stage 4 —
sql tuningplaybook — capture → EXPLAIN → bottleneck → rewrite or index → ANALYZE → compare; one change per cycle. - Stage 5 — cheat sheet — symptom → fix; reach for the row that matches the bottleneck node.
Why this is the senior-round signal and not a syntax round.
-
query optimization techniquesare empirical, not theoretical — the right answer depends on the plan the planner produces, which depends on statistics, indexes, and data distribution; you must look, not guess. - The biggest wins are leaf-level — the cheapest improvement is almost always replacing a sequential scan on the largest table with an index scan; the join algorithm above it inherits the cost.
- Stale statistics produce silent regressions — a model that ran in 100ms last week now takes 90 seconds; the cause is usually a histogram that no longer matches the data, not a code change.
-
SARGable rewrites are free wins —
WHERE col = ?uses an index;WHERE FUNC(col) = ?does not, even with an index defined oncol; this single rule fixes 30% of slow queries. -
One change per cycle is the discipline gate — junior engineers change four things at once and ship a worse plan; senior engineers change one thing, re-run
EXPLAIN ANALYZE, prove the win, then move on.
Worked example — read one EXPLAIN plan and identify the bottleneck
Detailed explanation. Real interviews probe whether you can read a small EXPLAIN ANALYZE cold. Below is a canonical 3-node plan; your job is to point at the bottleneck node and propose the one change that will move the needle.
Question. Given the EXPLAIN ANALYZE output below, which node owns the cost, why is it slow, and what is the single change you would make first?
Input. A fact_orders table (8M rows, no index on customer_id) joined to dim_customers (50k rows, primary key on customer_id); the query filters orders to a single segment.
Code.
EXPLAIN ANALYZE
SELECT c.segment, SUM(o.amount) AS revenue
FROM fact_orders o
JOIN dim_customers c ON c.customer_id = o.customer_id
WHERE c.segment = 'enterprise'
GROUP BY c.segment;
-- QUERY PLAN
-- HashAggregate (cost=185432.10..185432.11 rows=1 width=40)
-- (actual time=28412.51..28412.52 rows=1 loops=1)
-- Group Key: c.segment
-- -> Hash Join (cost=1812.00..184230.40 rows=240340 width=12)
-- (actual time=22.10..27890.40 rows=232117 loops=1)
-- Hash Cond: (o.customer_id = c.customer_id)
-- -> Seq Scan on fact_orders o (cost=0.00..164010.00 rows=8000000 width=12)
-- (actual time=0.01..18402.18 rows=8000000 loops=1)
-- -> Hash (cost=1187.00..1187.00 rows=50000 width=8)
-- -> Index Scan on dim_customers c
-- (cost=0.00..1187.00 rows=1500 width=8)
-- (actual time=0.02..3.18 rows=1500 loops=1)
-- Index Cond: (segment = 'enterprise')
Step-by-step explanation.
-
Read the leaves first. The two leaf nodes are
Seq Scan on fact_orders(8,000,000 rows, 18.4s) andIndex Scan on dim_customers(1,500 rows, 3ms). The dim is fine; the fact is the bottleneck. -
Confirm with cost numbers.
Seq Scancost is0.00..164010.00; theHash Joinabove adds only~20,000more cost. The leaf owns ~80% of the plan's total cost. -
Identify why it's slow. No index exists on
fact_orders.customer_id, so the planner reads every row, hashes the dim, and probes. The dim filter (segment = 'enterprise') is not pushed down to the fact because the join column iscustomer_id, notsegment. -
One-change fix. Add
CREATE INDEX idx_fact_orders_customer_id ON fact_orders (customer_id); the planner will then switch to aNested Loopdriven by the 1,500-row enterprise customer set, doing 1,500 indexed lookups against the fact instead of one full sequential scan. -
Verify. Re-run
EXPLAIN ANALYZE; expected new plan isNested LoopoverIndex Scan on fact_ordersdriven by the 1,500-row inner — actual time should drop from 28s to under 1s.
Output (the bottleneck-node identification).
| node | type | rows | actual_time_ms | share_of_total |
|---|---|---|---|---|
| Seq Scan on fact_orders | leaf | 8,000,000 | 18,402 | ~65% |
| Hash Join | parent | 232,117 | 9,488 | ~33% |
| HashAggregate | root | 1 | 0.01 | <1% |
Rule of thumb: the worst-performing leaf is almost always the bottleneck; fix it first, then re-run EXPLAIN ANALYZE and re-evaluate.
sql tuning — the four senior signals interviewers chase
Signal 1 — opinionated index choices, not "add an index everywhere". Senior engineers do not say "indexes are good"; they say "I add a composite (tenant_id, created_at DESC) index because 90% of our queries filter on tenant then sort on time, and a covering INCLUDE (status, amount) lets the planner answer the query without touching the heap."
Signal 2 — empirical, not theoretical. Junior engineers reason about plans from first principles; senior engineers run EXPLAIN ANALYZE first, then reason. The plan tells you the truth; intuition is a starting point, not an answer.
Signal 3 — one change per cycle, with proof. Senior engineers change exactly one thing per tuning cycle — one index, one rewrite, one ANALYZE — and re-run EXPLAIN ANALYZE to prove the win before moving on. Four changes at once ships a worse plan and no learning.
Signal 4 — statistics-awareness, not just index-awareness. When a plan regresses, junior engineers look for code changes; senior engineers run ANALYZE on the affected tables first because stale histograms are the single most common cause of a "the query that worked yesterday is now slow today" incident.
SQL
Topic — optimization
Query optimization drills
SQL
Topic — indexing
Indexing practice problems
Solution Using a 5-stage tuning coverage matrix
Code.
-- One canonical coverage matrix — every row maps a tuning stage to an artefact.
CREATE TABLE sql_tuning_coverage AS
SELECT * FROM (VALUES
(1, 'explain_plan', 'read_plan_from_leaves', 'EXPLAIN ANALYZE + bottleneck node', 'every slow query'),
(2, 'index_types', 'pick_btree_vs_hash_vs_partial','match index shape to predicate', 'every new query'),
(2, 'index_types', 'covering_index', 'INCLUDE columns -> index-only scan', 'hot path queries'),
(3, 'join_algorithm', 'nested_vs_hash_vs_merge', 'driven by row counts + sort order', 'every multi-table query'),
(4, 'sql_tuning', 'sargable_rewrite', 'WHERE col = ? not FUNC(col) = ?', 'every WHERE clause'),
(4, 'sql_tuning', 'one_change_per_cycle', 'change one thing, re-EXPLAIN, prove', 'every PR'),
(4, 'sql_tuning', 'analyze_statistics', 'ANALYZE refreshes histograms', 'after bulk loads'),
(5, 'cheat_sheet', 'symptom_to_fix', 'seq scan -> index; spill -> work_mem','interview + on-call')
) AS t(stage_id, stage_name, technique, prescription, cadence);
Step-by-step trace.
| stage_id | stage_name | technique | prescription | cadence |
|---|---|---|---|---|
| 1 | explain_plan | read_plan_from_leaves | EXPLAIN ANALYZE + bottleneck node | every slow query |
| 2 | index_types | pick_btree_vs_hash_vs_partial | match index shape to predicate | every new query |
| 2 | index_types | covering_index | INCLUDE columns -> index-only scan | hot path queries |
| 3 | join_algorithm | nested_vs_hash_vs_merge | driven by row counts + sort order | every multi-table query |
| 4 | sql_tuning | sargable_rewrite | WHERE col = ? not FUNC(col) = ? | every WHERE clause |
| 4 | sql_tuning | one_change_per_cycle | change one thing, re-EXPLAIN, prove | every PR |
| 4 | sql_tuning | analyze_statistics | ANALYZE refreshes histograms | after bulk loads |
| 5 | cheat_sheet | symptom_to_fix | seq scan -> index; spill -> work_mem | interview + on-call |
- Row 1 —
explain_planis always the first move; never guess, always look at the plan. - Rows 2-3 —
index_typescovers both shape (B-tree vs hash vs partial) and the covering trick that eliminates heap fetches. - Row 4 —
join_algorithmis what the planner chooses; the prescription is to understand the inputs (row counts, sort order) so you can predict the choice. - Rows 5-7 —
sql_tuningis the discipline layer: SARGable rewrites, one change per cycle, and refreshed statistics. - Row 8 — the cheat sheet is the one-screen lookup for production incidents and interviews; given a symptom, name the fix.
Output.
| stage_id | stage_name | technique | cadence |
|---|---|---|---|
| 1 | explain_plan | read_plan_from_leaves | every slow query |
| 2 | index_types | pick_btree_vs_hash_vs_partial | every new query |
| 3 | join_algorithm | nested_vs_hash_vs_merge | every multi-table query |
| 4 | sql_tuning | sargable_rewrite + one-change | every PR |
| 5 | cheat_sheet | symptom_to_fix | interview + on-call |
Why this works — concept by concept:
- Stage coverage matrix — turns the 5-stage map into an auditable artefact; every tuning technique is owned by exactly one stage, so coverage gaps surface at a glance.
- Cadence binding — pairs each technique with its trigger (every slow query, every PR, after bulk loads); senior engineers assign cadence per technique, not "tune everything always".
- One change per cycle — codified as a row, not a culture norm; the discipline is visible in the matrix, not buried in tribal knowledge.
-
Empirical bias —
explain_planis row 1; nothing happens without looking at the plan first. This is the single biggest mindset shift from junior to senior. -
Cost —
O(1)to read the coverage matrix; the actual tuning isO(query)per cycle but each cycle is bounded by one change, so iterations stay fast.
2. EXPLAIN plan anatomy — reading the tree from leaves to root
explain plan — the tree, the cost numbers, and what they mean
explain plan is the database's answer to the question "how are you going to run this query?". The output is a tree: leaves are scans (sequential, index, index-only), interior nodes are joins (nested loop, hash, merge) and aggregations (sort-aggregate, hash-aggregate), and the root is whatever produces the final row set (often a Sort or Limit). The cost numbers — cost=startup..total — are the planner's estimate of arbitrary work units, not real wall-clock seconds; EXPLAIN ANALYZE adds the actual wall-clock measurements (actual time=startup..total), the actual rows produced, and the loop count.
The four invariants of every plan.
- Read leaves to root, not top to bottom. The execution starts at the leaves (scans), then climbs to interior nodes (joins, aggregations), then to the root. The output you see is printed top-down but executed bottom-up.
- The biggest leaf cost almost always wins. A sequential scan on a 10M-row table dwarfs a nested loop above it; fix the leaf and the parent's cost shrinks proportionally.
-
costis an estimate;actual timeis the truth. Always useEXPLAIN ANALYZEin tuning sessions — the estimate vs reality delta tells you if statistics are stale (estimate way off → runANALYZE). -
Loop count matters on Nested Loop.
actual time=0.1..0.2 rows=5 loops=12000means the inner side ran 12,000 times; multiplyactual timebyloopsfor the real cost — that's where nested loops on large outers explode.
Scan node families — what each one means.
-
Seq Scan— read every row of the table; cheap on small tables (under ~10k rows) but linear in table size on big tables. The default when no useful index exists or the predicate matches > ~20% of rows (planner threshold varies by engine). -
Index Scan— walk the B-tree to find matching keys, then fetch each row from the heap; cheap when selectivity is high (few rows match the predicate). -
Index Only Scan— walk the B-tree and answer the query from the index alone, skipping the heap fetch; requires a covering index where every selected column is either in the key or inINCLUDE. -
Bitmap Heap Scan— combine multiple indexes via bitmap OR/AND, then fetch rows; useful when several smaller indexes together beat one larger composite.
Join node families — what each one means.
-
Nested Loop— for each row of the outer, look up matching rows in the inner; cheap when the outer is tiny and the inner has a useful index. -
Hash Join— build a hash table on the smaller side, probe it with rows from the larger side; cheap when both sides are large and at least one fits inwork_mem. -
Merge Join— both sides arrive sorted on the join key, then walk them in lockstep; cheap when sort orders already exist (PK scan, index range scan).
Aggregate node families — what each one means.
-
HashAggregate— build a hash on the group keys, accumulate aggregates per bucket; needs to fit inwork_memper group. -
GroupAggregate— input is pre-sorted by group keys; streams through it accumulating one group at a time; constant memory. -
Aggregate(plain) — single-group aggregate (SELECT COUNT(*) FROM t); no grouping, single-pass accumulator.
Worked example — read every node in a 5-node plan
Detailed explanation. Real interviews show you a 5-7 node plan and ask you to narrate it node-by-node. Below is the canonical shape; learn to walk it and you can walk any plan.
Question. Given the EXPLAIN ANALYZE output below for a top-revenue-by-region query, narrate the plan from leaves to root, identify the bottleneck node, and propose the one change that will move the needle.
Input. A fact_orders table (12M rows, B-tree on order_date), dim_customers (200k rows, PK on customer_id), filter on order_date >= '2026-04-01' and GROUP BY c.region with a LIMIT 5.
Code.
EXPLAIN ANALYZE
SELECT c.region, SUM(o.amount) AS rev
FROM fact_orders o
JOIN dim_customers c ON c.customer_id = o.customer_id
WHERE o.order_date >= '2026-04-01'
GROUP BY c.region
ORDER BY rev DESC
LIMIT 5;
-- QUERY PLAN
-- Limit (cost=24812.10..24812.12 rows=5 width=18)
-- (actual time=6890.30..6890.31 rows=5 loops=1)
-- -> Sort (cost=24812.10..24812.85 rows=300 width=18)
-- (actual time=6890.29..6890.30 rows=5 loops=1)
-- Sort Key: (sum(o.amount)) DESC
-- -> HashAggregate (cost=24800.10..24803.85 rows=300 width=18)
-- (actual time=6889.18..6889.40 rows=300 loops=1)
-- Group Key: c.region
-- -> Hash Join (cost=4012.00..24190.40 rows=121940 width=14)
-- (actual time=85.10..6580.40 rows=123210 loops=1)
-- Hash Cond: (o.customer_id = c.customer_id)
-- -> Index Scan using idx_orders_date on fact_orders o
-- (cost=0.42..19811.20 rows=121940 width=14)
-- (actual time=0.07..6210.15 rows=123210 loops=1)
-- Index Cond: (order_date >= '2026-04-01')
-- -> Hash (cost=2812.00..2812.00 rows=200000 width=8)
-- -> Seq Scan on dim_customers c
-- (cost=0.00..2812.00 rows=200000 width=8)
-- (actual time=0.01..78.10 rows=200000 loops=1)
Step-by-step explanation.
-
Leaf 1 —
Index Scan using idx_orders_date on fact_orders— the planner used the date index to fetch 123,210 matching rows out of 12M; 6.2s actual time. This is the dominant cost. -
Leaf 2 —
Seq Scan on dim_customers— full scan of 200k rows in 78ms; cheap because the table is small and the entire row set is needed to build the hash side. -
Hashbuild node — builds a hash ondim_customers.customer_id; ~3ms additional cost on top of the seq scan. -
Hash Join— probes the hash with each row fromfact_orders; total actual time is ~6.6s, of which ~6.2s came from the leaf scan; the join itself only adds ~370ms. -
HashAggregate— groups byc.regioninto ~300 buckets; cheap (~1ms) because the hash fits in memory. -
Sort+Limit— sort 300 region totals descending, return top 5; near-zero cost. -
Bottleneck. The
Index Scan on fact_ordersowns 90% of the runtime; the index is being used (good) but it returns 123k rows that all need a heap fetch to readamountandcustomer_id. The fix: turn the date index into a covering index withINCLUDE (customer_id, amount), which lets the planner useIndex Only Scanand skip the 123k heap lookups entirely.
Output (the node-by-node breakdown).
| node | type | rows | actual_time_ms | bottleneck |
|---|---|---|---|---|
| Index Scan idx_orders_date | leaf | 123,210 | 6,210 | YES |
| Seq Scan dim_customers | leaf | 200,000 | 78 | no |
| Hash Join | parent | 123,210 | 6,580 | inherited |
| HashAggregate | parent | 300 | 6,889 | inherited |
| Sort + Limit | root | 5 | 6,890 | inherited |
Rule of thumb: a parent node's actual time is the sum of its children's actual time plus its own work; if the parent is slow but the children are slow too, fix the children first.
explain plan cost model — what the numbers actually mean
The cost numbers in EXPLAIN look like wall-clock seconds but they are not. They are arbitrary planner work units, calibrated such that seq scanning one disk page costs seq_page_cost = 1.0 (the default). Other operations are scaled relative to that:
-
seq_page_cost = 1.0— one sequential page read. -
random_page_cost = 4.0— one random page read (default; lower on SSD, often tuned to1.1). -
cpu_tuple_cost = 0.01— process one row through a node. -
cpu_operator_cost = 0.0025— evaluate one operator (oneWHEREclause comparison). -
cpu_index_tuple_cost = 0.005— process one row through an index scan.
The planner sums these to produce the estimate. startup_cost is the cost before the first row can be returned (e.g., the entire hash side must be built before a hash join can produce any row); total_cost is the cost to produce all rows. A LIMIT 5 on top of a Sort uses startup_cost heavily — the sort must finish before the limit can take five rows.
The estimate-vs-actual delta is your statistics canary. If rows=121940 (estimate) and actual rows=12,194,000 (reality), your statistics are wildly stale — run ANALYZE on the affected table. A 100x estimate miss is the leading cause of a planner picking the wrong join algorithm (e.g., choosing nested loop because it thinks the outer is tiny, then looping millions of times).
SQL
Topic — optimization
EXPLAIN plan drills
SQL
Topic — database
Database internals practice
Solution Using a leaf-first plan-reading harness
Code.
-- One canonical pattern: capture plan, identify the worst leaf, propose the one-change fix.
WITH plan_nodes AS (
SELECT * FROM (VALUES
('Index Scan idx_orders_date', 'leaf', 123210, 6210),
('Seq Scan dim_customers', 'leaf', 200000, 78),
('Hash Join', 'parent', 123210, 6580),
('HashAggregate', 'parent', 300, 6889),
('Sort + Limit', 'root', 5, 6890)
) AS t(node, kind, rows, actual_time_ms)
)
SELECT
node,
kind,
rows,
actual_time_ms,
actual_time_ms - LAG(actual_time_ms, 1, 0)
OVER (ORDER BY actual_time_ms) AS self_time_ms,
CASE
WHEN kind = 'leaf'
AND actual_time_ms = (SELECT MAX(actual_time_ms)
FROM plan_nodes WHERE kind = 'leaf')
THEN 'BOTTLENECK'
ELSE 'ok'
END AS verdict
FROM plan_nodes
ORDER BY actual_time_ms DESC;
Step-by-step trace.
| node | kind | rows | actual_time_ms | self_time_ms | verdict |
|---|---|---|---|---|---|
| Sort + Limit | root | 5 | 6890 | 1 | ok |
| HashAggregate | parent | 300 | 6889 | 309 | ok |
| Hash Join | parent | 123210 | 6580 | 370 | ok |
| Index Scan idx_orders_date | leaf | 123210 | 6210 | 6132 | BOTTLENECK |
| Seq Scan dim_customers | leaf | 200000 | 78 | 78 | ok |
- Row 1 (
Sort + Limit) — root, near-zero self time; inherits all child cost. - Row 2 (
HashAggregate) — adds ~309ms of self work to group 123k rows into 300 buckets. - Row 3 (
Hash Join) — adds ~370ms to probe; the bulk of its time is inherited from the leaf. - Row 4 (
Index Scan) — leaf, ~6.1s self time; this is the bottleneck. The index is used but the planner still does 123k heap fetches. - Row 5 (
Seq Scanon dim) — leaf, fast (78ms); too small to matter.
Output.
| node | actual_time_ms | verdict |
|---|---|---|
| Index Scan idx_orders_date | 6210 | BOTTLENECK |
| Hash Join | 6580 | ok |
| HashAggregate | 6889 | ok |
| Sort + Limit | 6890 | ok |
| Seq Scan dim_customers | 78 | ok |
Why this works — concept by concept:
- Leaf-first scan — pick the leaf with the highest actual time; that is almost always the bottleneck and the cheapest single fix.
-
Self time vs total time —
total_time - child_time = self_time; isolating self time tells you which node itself is doing work vs which is just waiting on its children. -
Loops matter on nested loop — a leaf with
actual time=0.1..0.2 rows=5 loops=10000has a real cost of0.2 * 10000 = 2000ms; always multiply. -
Estimate vs actual — if planner-estimated rows differ from actual by > 10x, run
ANALYZE; the bad estimate is causing bad join-algorithm choices upstream. -
Cost —
O(nodes)to walk the plan; the actual fix isO(1)(one DDL or rewrite) but the cycle to prove it (re-runEXPLAIN ANALYZE) is the discipline gate.
3. Index types — B-tree, Hash, Partial, Covering (when each wins)
index types — four shapes that cover 95% of queries
index types are not interchangeable: a b-tree index wins on equality and range, a hash index wins on equality only and dies on range, a partial index is a B-tree on a subset of rows and saves enormous space on skewed columns, and a covering / INCLUDE index lets the planner answer the query from the index alone without touching the heap. Pick the wrong shape and the planner ignores the index entirely; pick the right one and the same query goes from a seq scan to an index-only scan.
The four families and when each wins.
-
B-tree — the default; supports
=,<,<=,>,>=,BETWEEN,IN,ORDER BY. Used in 80%+ of real-world indexes. Composite B-trees(a, b, c)support queries ona,(a, b), and(a, b, c)but notbalone orcalone — the leftmost-prefix rule. -
Hash — equality only (
=); O(1) lookup, no range support, noORDER BYsupport. Niche: very tall single-column equality lookups (e.g., session token by hash). PostgreSQL hash indexes are WAL-logged since 10.0; before that they were unsafe. -
Partial — a B-tree over only the rows that match a
WHEREclause; e.g.,CREATE INDEX ix_active_orders ON orders (customer_id) WHERE status = 'active'. Smaller, faster, only useful for queries that share the partial's predicate. -
Covering /
INCLUDE— a composite where some columns are in the key and others areINCLUDEd (non-key payload). Lets the planner do anIndex Only Scan— answer the query from the index without a heap fetch. Eliminates one disk seek per matched row.
The leftmost-prefix rule on composite B-trees.
Given CREATE INDEX ix_ab ON t (a, b):
-
WHERE a = ?— uses the index. Yes. -
WHERE a = ? AND b = ?— uses the index. Yes. -
WHERE a = ? AND b > ?— uses the index. Yes. -
WHERE b = ?— does not use the index. The leftmost columnais unbound; the B-tree cannot be navigated. -
WHERE a > ? AND b = ?— uses the index partially;ais a range,bcannot be used as an additional seek key (only as a filter after).
Column order on a composite matters. Order columns by equality predicate first, then range, then sort. A query WHERE region = ? AND created_at BETWEEN ? AND ? ORDER BY created_at DESC wants (region, created_at DESC), not (created_at, region).
Why a covering index is the senior trick. Every Index Scan involves two reads: (1) walk the B-tree to find matching keys, (2) fetch each matching row from the heap (the table itself). A covering index stores all the columns the query needs in the index, so step (2) is skipped — the planner does an Index Only Scan and never touches the heap. For a hot-path query that returns 10k rows, this can save 10k random disk seeks.
Worked example — B-tree vs Hash vs Partial vs Covering on the same predicate
Detailed explanation. Real interviews ask you to design the right index for a specific query. Below is one canonical query and how each of the four index families performs on it.
Question. A fact_orders table has 50M rows. The hot-path query is SELECT customer_id, amount FROM fact_orders WHERE status = 'shipped' AND created_at >= '2026-04-01' ORDER BY created_at DESC LIMIT 100. Design four candidate indexes — one of each family — and predict which one the planner picks.
Input. Table: fact_orders (id PK, customer_id INT, status TEXT, created_at TIMESTAMP, amount NUMERIC). Distribution: 95% of rows have status = 'shipped', 5% are pending/cancelled. The query needs to return 100 most-recent shipped orders since 2026-04-01.
Code.
-- Candidate A: plain B-tree on (status, created_at)
CREATE INDEX ix_a ON fact_orders (status, created_at);
-- Candidate B: hash index on status
CREATE INDEX ix_b ON fact_orders USING HASH (status);
-- Candidate C: partial B-tree (status = 'shipped' subset only)
CREATE INDEX ix_c ON fact_orders (created_at DESC) WHERE status = 'shipped';
-- Candidate D: covering composite with INCLUDE
CREATE INDEX ix_d ON fact_orders (status, created_at DESC) INCLUDE (customer_id, amount);
Step-by-step explanation.
-
Candidate A — plain B-tree
(status, created_at)— works (leftmost prefixstatus = 'shipped'), but becausestatus = 'shipped'matches 95% of the table the planner may still choose a seq scan; selectivity is too low to make the index worth it. Verdict: maybe used, maybe ignored. -
Candidate B — hash on
status— equality match works (status = 'shipped'), but the index returns 47.5M rows (95% of the table) with no ordering and no range support oncreated_at; the planner falls back to seq scan or uses the hash index plus a sort, both worse than A. Verdict: rarely useful here. -
Candidate C — partial B-tree on
created_at DESC WHERE status = 'shipped'— the partial only contains the 47.5M shipped rows, but it's sorted bycreated_at DESC. The planner walks the index from the top, takes the first 100 entries that satisfycreated_at >= '2026-04-01', and is done. Verdict: excellent, especially small index footprint. -
Candidate D — covering composite
(status, created_at DESC) INCLUDE (customer_id, amount)— the planner walks the composite, finds the matching key range, and answers the entire query from the index — no heap fetch needed. Verdict: best,Index Only Scanwith zero heap reads for the 100-row LIMIT. -
Planner's actual pick. With both C and D present, the planner usually picks D because its
Index Only Scanskips the heap entirely; C still requires a heap fetch per matched row to readcustomer_idandamount. If D doesn't exist, the planner picks C.
Output (the index-family ranking for this query).
| candidate | shape | uses_index? | heap_fetches | verdict |
|---|---|---|---|---|
| A | B-tree (status, created_at) | maybe | ~100 | maybe ignored |
| B | hash (status) | no | ~47.5M | useless here |
| C | partial B-tree (created_at DESC) WHERE shipped | yes | ~100 | excellent |
| D | covering (status, created_at DESC) INCLUDE (customer_id, amount) | yes | 0 | best |
Rule of thumb: if the query returns under ~5% of the table and you can name every selected column, build a covering index — Index Only Scan is the cheapest plan in SQL.
b-tree index and the SARGable rule
SARGable stands for Search-ARGument-able — a predicate the planner can push down into an index seek. The rule: the indexed column must appear alone on one side of the operator, with no function wrapped around it.
| Predicate | SARGable? | Why |
|---|---|---|
WHERE created_at = '2026-05-29' |
yes | column alone on left side |
WHERE created_at >= '2026-05-29' |
yes | column alone on left side |
WHERE created_at BETWEEN ? AND ? |
yes | column alone on left side |
WHERE DATE(created_at) = '2026-05-29' |
no | function wrapped around column → seq scan |
WHERE EXTRACT(YEAR FROM created_at) = 2026 |
no | function wrapped around column |
WHERE created_at + INTERVAL '1 day' >= NOW() |
no | arithmetic on column side |
WHERE LOWER(email) = 'foo@bar.com' |
no unless functional index on LOWER(email) exists |
|
WHERE email = LOWER('foo@bar.com') |
yes | function on the constant side, not the column |
WHERE id IN (1,2,3) |
yes | translates to id = ANY(...)
|
WHERE id NOT IN (1,2,3) |
usually no | anti-condition rarely uses index |
The SARGable rewrite. WHERE DATE(created_at) = '2026-05-29' becomes WHERE created_at >= '2026-05-29' AND created_at < '2026-05-30'. The semantics are identical; the second form uses the index, the first does not.
SQL
Topic — indexing
Index design practice
SQL
Topic — filtering
Filtering / WHERE-clause drills
Solution Using a covering index with INCLUDE to eliminate heap fetches
Code.
-- The cheapest fast plan in SQL: Index Only Scan via a covering composite.
DROP INDEX IF EXISTS ix_fact_orders_status_date;
CREATE INDEX ix_fact_orders_status_date
ON fact_orders (status, created_at DESC)
INCLUDE (customer_id, amount);
-- Then ANALYZE so the planner sees the new index and refreshed stats.
ANALYZE fact_orders;
-- And verify with EXPLAIN ANALYZE.
EXPLAIN (ANALYZE, BUFFERS)
SELECT customer_id, amount
FROM fact_orders
WHERE status = 'shipped'
AND created_at >= '2026-04-01'
ORDER BY created_at DESC
LIMIT 100;
Step-by-step trace.
| step | action | what it produces |
|---|---|---|
| 1 | DROP old index | removes the obsolete (status)-only B-tree |
| 2 | CREATE composite + INCLUDE | builds the covering index (status, created_at DESC) with payload (customer_id, amount)
|
| 3 | ANALYZE fact_orders | refreshes histograms so the planner trusts the new selectivity |
| 4 | EXPLAIN ANALYZE the query | confirms Index Only Scan with Heap Fetches: 0
|
| 5 | Read Buffers line |
confirms shared hit=X read=0 — everything served from index pages |
- Step 1 — dropping the prior index avoids leaving two redundant indexes; index maintenance is
O(log N)per insert, redundant indexes are pure overhead. - Step 2 —
INCLUDE (customer_id, amount)is the key trick; the columns are in the index pages but not part of the B-tree key, so they don't bloat the seek path. - Step 3 —
ANALYZEis required because the planner uses stats to decide whether to use the new index; without fresh stats it may default to seq scan. - Step 4 — the
EXPLAIN ANALYZEshould now reportIndex Only Scan using ix_fact_orders_status_dateand aHeap Fetches: 0line. - Step 5 —
BUFFERSconfirms zero heap reads; the entire query is served from cached index pages.
Output.
| metric | before | after |
|---|---|---|
| Scan node | Seq Scan | Index Only Scan |
| Rows returned | 100 | 100 |
| Heap fetches | ~50M (full scan) | 0 |
| Actual time | 28,400 ms | 1.2 ms |
| Plan flip reason | covering index unblocks Index Only Scan | — |
Why this works — concept by concept:
- Covering index — every selected column lives in the index pages, so the planner skips the heap fetch entirely; this is the single biggest win you can get on a hot-path read query.
-
INCLUDE vs key columns —
INCLUDEcolumns ride along as payload but don't widen the B-tree key, so seeks stay fast; key columns slow down inserts proportionally. -
Descending sort in the key —
created_at DESCin the index key lets the planner satisfyORDER BY created_at DESC LIMIT 100by walking the index in order; no separate Sort node. - ANALYZE after DDL — without refreshed stats, the planner may not pick the new index; this step is non-optional and frequently forgotten.
-
Cost —
O(log N + K)where K is the limit (100); the heap fetch wasO(K)random seeks, now zero. Disk-side this is roughly a 1000x improvement on the K = 100 path.
4. Join algorithms — Nested Loop, Hash Join, Merge Join (and when planners pick each)
join algorithms — three shapes, three decision rules
You do not pick the join algorithms; the planner picks them based on table sizes, available indexes, and existing sort orders. But you do pick the indexes and the SQL shape that nudge the planner toward the right algorithm — and you must be able to predict which algorithm the planner will choose so you can build the right index up front.
The three families.
-
nested loop—for each row in outer: lookup matching rows in inner. Cheap when the outer is tiny and the inner has a useful index on the join key. Complexity:O(N × log M)with an index on the inner,O(N × M)without. -
hash join—build hash table on smaller side; probe it with rows from larger side. Cheap when both sides are large and the smaller side fits inwork_mem. Complexity:O(N + M)if the hash fits, much worse if it spills to disk. -
merge join—both sides sorted on join key; walk them in lockstep. Cheap when sort orders already exist (PK scan, index range scan) or the input is already sorted. Complexity:O(N + M)if pre-sorted,O(N log N + M log M)if sorts are required.
The decision matrix.
| outer size | inner size | inner index on join key | both sides sorted | planner picks |
|---|---|---|---|---|
| small (< 10k) | any | yes | no | nested loop |
| large | large | no useful index | no | hash join |
| large | large | yes on both | yes | merge join |
| large | large | yes on one side | no | hash or nested loop (depends on selectivity) |
| any | any | no | yes (CLUSTERED on join key) | merge join |
Why the planner picks what it picks.
-
nested loopdominates when the outer is tiny because the total work isouter_rows × inner_seek_cost; a 5-row outer doing 5 indexed lookups is unbeatable. -
hash joindominates on bulk equi-joins where both sides are large; you pay one full scan per side plus a hash build, then probe inO(1)per row. -
merge joindominates when both sides are already sorted on the join key (e.g., joining two range-scan results from indexes); the merge is a single pass with no hash overhead. -
The planner switches when statistics shift. A query that runs with hash join today may switch to nested loop tomorrow if
ANALYZEreveals the outer side is now much smaller; this is intentional and almost always correct.
Worked example — predict the join algorithm before EXPLAIN
Detailed explanation. Real interviews ask you to predict the join algorithm before showing you the plan. Below are three canonical join shapes; build the prediction reflex.
Question. For each of the three join scenarios below, predict the join algorithm the planner will pick and justify it in one sentence.
Input. Three scenarios on a fact_orders (10M rows) + dim_customers (200k rows) + dim_products (50 rows) schema.
Code.
-- Scenario A: tiny dim joined to a large fact, indexed PK on dim
SELECT o.id, p.name
FROM fact_orders o
JOIN dim_products p ON p.product_id = o.product_id
WHERE p.category = 'electronics'; -- filter narrows dim_products to 5 rows
-- Scenario B: both sides large, no useful index on the join key in fact
SELECT o.id, c.region
FROM fact_orders o
JOIN dim_customers c ON c.customer_id = o.customer_id;
-- (no index on fact_orders.customer_id)
-- Scenario C: both sides already sorted by the join key (PK scan on each)
SELECT o.id, c.region
FROM fact_orders o
JOIN dim_customers c ON c.customer_id = o.customer_id
ORDER BY o.customer_id;
-- (indexes exist on both join keys, query returns ordered output)
Step-by-step explanation.
-
Scenario A — Nested Loop. The
WHERE p.category = 'electronics'filter reducesdim_productsto 5 rows; for each of those 5 rows the planner does an indexed lookup againstfact_orders(assuming an index onproduct_id). Cost:5 × log(10M) ≈ 5 × 23 = 115 index seeks. Unbeatable. -
Scenario B — Hash Join. Both sides are large (10M and 200k); there is no index on
fact_orders.customer_id, so nested loop would be10M × full scan of dim= catastrophic. The planner builds a hash ondim_customers(200k rows fits inwork_mem), then probes once per fact row. Cost:10M + 200krow reads, plus the hash build. -
Scenario C — Merge Join. Both sides are scanned in
customer_idorder via their respective B-tree indexes; the merge walks both streams in lockstep, emitting matches. Cost:10M + 200krow reads, no hash overhead, and the output is already sorted (theORDER BYis free). -
The planner needs accurate stats. If
ANALYZEis stale and the planner thinksdim_productshas 5,000 rows after filter (when reality is 5), it may pick hash join in Scenario A — and pay a hash-build cost on a near-empty hash. -
Override hint. If you know the planner picked wrong, you can force the choice with planner hints (
SET enable_nestloop = off/SET enable_hashjoin = off) in PostgreSQL; in production it's almost always better to fix statistics or rewrite the SQL.
Output (the join-algorithm prediction matrix).
| scenario | outer | inner | inner_index | predicted | reason |
|---|---|---|---|---|---|
| A | dim_products (5 rows after filter) | fact_orders (10M) | yes (product_id) | Nested Loop | tiny outer + indexed inner |
| B | fact_orders (10M) | dim_customers (200k) | no on fact.customer_id | Hash Join | no index, both large |
| C | fact_orders (10M, indexed) | dim_customers (200k, indexed) | yes both sides, sorted | Merge Join | both pre-sorted on join key |
Rule of thumb: tiny outer → nested loop; both big, no useful index → hash; both big, both sorted → merge. Memorise these three.
hash join deep dive — why it's the workhorse
hash join is the most common algorithm on modern OLAP / warehouse queries because the typical shape is fact table joined to small dim with no useful index on the fact side. The mechanics:
-
Build phase. Scan the smaller side; for each row, hash the join key and insert into a hash table. Time:
O(M), space:O(M)inwork_mem. -
Probe phase. Scan the larger side; for each row, hash the join key, look it up in the hash table, emit matches. Time:
O(N). -
Spill to disk. If the hash exceeds
work_mem, partitions are spilled to disk; performance degrades catastrophically. Always sizework_memto fit the build side of your largest hash join. -
Build side selection. The planner picks the smaller side as the build side; if statistics are wrong it may pick the larger side and OOM. Check
Hashnodeactual rowsvsMemory Usage:inEXPLAIN ANALYZE— if reality is 10x the estimate, runANALYZE.
nested loop deep dive — the silent killer.
nested loop is the cheapest plan when the outer is tiny — and the most expensive plan when the outer is large. The asymmetry is outer_rows × inner_seek_cost:
-
Outer = 5 rows, inner = 10M with index →
5 × log(10M) = ~115 seeks→ ~1ms. -
Outer = 1M rows, inner = 10M with index →
1M × log(10M) = ~23M seeks→ minutes. -
Outer = 1M rows, inner = 10M without index →
1M × 10M = 10 trillion comparisons→ effectively forever.
The bug pattern: the planner predicts a 5-row outer (because stats are stale) and picks nested loop; reality is 1M rows and the query melts the server. The fix: ANALYZE the outer table; the planner re-plans next run.
merge join deep dive — the niche but unbeatable choice.
merge join requires both sides to arrive sorted on the join key. When they do — typically because both sides are scanned via a B-tree on the join key — it's a single linear pass with no hash overhead. The output is also sorted on the join key, so downstream GROUP BY on the join key or ORDER BY on the join key is free.
- When the planner picks it. Both sides have a B-tree on the join key, the result is large, and downstream operators benefit from the sort order.
-
When it loses. One side is unsorted; the planner would need to sort it explicitly, and
O(N log N) + O(M log M)sorting cost usually exceedsO(N + M)hash join cost.
SQL
Topic — joins
Join algorithm drills
SQL
Topic — sql
SQL practice library
Solution Using a decision-tree that picks the join algorithm from inputs
Code.
-- One canonical decision tree, materialised as a lookup table.
CREATE TABLE join_algorithm_chooser AS
SELECT * FROM (VALUES
('tiny_outer + indexed_inner', 'Nested Loop', 'O(N * log M)', 'small driver, indexed lookup per row'),
('large + large, no_index', 'Hash Join', 'O(N + M)', 'build hash on smaller side, probe larger'),
('large + large, both_sorted', 'Merge Join', 'O(N + M)', 'walk both sides in lockstep, output sorted'),
('large + large, one_indexed', 'Hash or NL', 'depends', 'planner picks via selectivity estimate'),
('any + any, both_in_memory', 'Hash Join', 'O(N + M)', 'no I/O cost; hash always wins'),
('any + any, work_mem_too_small','Hash spill', 'O((N+M) * spill_factor)','build spills to disk; tune work_mem')
) AS t(input_shape, algorithm, complexity, intuition);
Step-by-step trace.
| input_shape | algorithm | complexity | intuition |
|---|---|---|---|
| tiny_outer + indexed_inner | Nested Loop | O(N * log M) | small driver, indexed lookup per row |
| large + large, no_index | Hash Join | O(N + M) | build hash on smaller side, probe larger |
| large + large, both_sorted | Merge Join | O(N + M) | walk both sides in lockstep, output sorted |
| large + large, one_indexed | Hash or NL | depends | planner picks via selectivity estimate |
| any + any, both_in_memory | Hash Join | O(N + M) | no I/O cost; hash always wins |
| any + any, work_mem_too_small | Hash spill | O((N+M) * spill_factor) | build spills to disk; tune work_mem |
- Row 1 — nested loop wins on tiny outer because
N × log Mis small when N is small; no other algorithm beats it. - Row 2 — hash join is the OLAP workhorse; one full scan per side plus a hash build, then probe in
O(1)per row. - Row 3 — merge join is unbeatable when both sides are pre-sorted; no hash overhead and free downstream sort.
- Row 4 — when one side has an index and one does not, the planner estimates selectivity; if filter is narrow, nested loop; if wide, hash.
- Row 5 — when both sides fit in memory the I/O cost vanishes and hash is unbeatable; this is the warehouse-on-SSD common case.
- Row 6 — when the build side exceeds
work_mem, the hash spills to disk and performance degrades 10-100x; tunework_memper session.
Output.
| input_shape | algorithm |
|---|---|
| tiny_outer + indexed_inner | Nested Loop |
| large + large, no_index | Hash Join |
| large + large, both_sorted | Merge Join |
| large + large, one_indexed | Hash or NL |
| any + any, both_in_memory | Hash Join |
| any + any, work_mem_too_small | Hash spill (tune) |
Why this works — concept by concept:
- Three families, three rules — nested loop for tiny outer, hash for big × big, merge for pre-sorted; everything else is a planner judgement call.
-
Build side selection — hash join always builds on the smaller side; if stats are wrong the planner may build on the larger side and OOM.
ANALYZEkeeps this honest. - Pre-sorted is free — a B-tree range scan returns rows in key order at zero extra cost; merge join exploits this directly.
- Loop count multiplies — a nested loop with a large outer multiplies inner seek cost by outer rows; this is why the algorithm dies on big outers.
-
Cost — nested loop
O(N × log M), hashO(N + M)in memory /O((N+M) × spill_factor)on disk, mergeO(N + M)if pre-sorted. Match input shape to algorithm and the planner picks correctly.
5. The six-step tuning playbook (capture → EXPLAIN → bottleneck → rewrite/index → ANALYZE → compare)
The six-step sql tuning playbook — discipline, not heroics
sql tuning is a discipline, not a black art. The six-step playbook is the same loop senior data engineers run for every slow query they're handed: capture the slow query with its real parameters, EXPLAIN it (always ANALYZE, always with the real parameters), find the bottleneck node (worst leaf or worst loops × time), rewrite or add an index (exactly one change), ANALYZE the affected tables so the planner sees fresh stats, re-run + compare the new plan against the old. Repeat until the SLA is met. One change per cycle is non-negotiable.
Step 1 — Capture the slow query (with real parameters).
- Use
pg_stat_statements(PostgreSQL) /Query Store(SQL Server) /INFORMATION_SCHEMA.PROCESSLIST(MySQL) /query_history(Snowflake) /INFORMATION_SCHEMA.JOBS_BY_PROJECT(BigQuery) to find the query. - Capture the actual parameters the slow execution used; never EXPLAIN with
WHERE col = 'a'if production runsWHERE col = ?with a high-cardinality value. - Note the wall-clock time, rows returned, and the SLA the query is missing — you need a target to know when you're done.
Step 2 — EXPLAIN ANALYZE the query (always ANALYZE).
-
EXPLAINshows the planner's estimate;EXPLAIN ANALYZEactually runs the query and shows real time + rows. UseANALYZEevery time in a tuning session. - Add
BUFFERSto see disk vs cache reads:EXPLAIN (ANALYZE, BUFFERS) SELECT .... A highreadcount on the bottleneck node is a sign you're hitting cold pages. - On destructive queries (
UPDATE/DELETE), wrap in aBEGIN; ... ROLLBACK;block soEXPLAIN ANALYZEcan run without modifying data.
Step 3 — Find the bottleneck node.
- Walk leaves first; the leaf with the highest
actual timeis almost always the bottleneck. - Check
loopson nested loops;actual time × loopsis the real cost. - Check estimate vs actual rows; a > 10x miss means stale stats are causing wrong plans upstream.
- Check
Memory Usage:on hash nodes; if it showsDisk: X kB, the hash spilled — tunework_memor rewrite to avoid the hash.
Step 4 — Rewrite or add an index (exactly one change).
- If the bottleneck is a
Seq Scanon a big table, add an index that matches the predicate. - If the bottleneck is an
Index Scanwith many heap fetches, convert to a covering index withINCLUDE. - If the bottleneck is a function-wrapped predicate (
WHERE DATE(col) = ?), rewrite to SARGable form (WHERE col >= ? AND col < ?). - If the bottleneck is a hash spill, increase
work_memfor this session (SET work_mem = '256MB') and re-run. -
Only one change. If you add an index and rewrite the predicate and change
work_memall at once, you cannot attribute the win to a single cause and may ship a regression.
Step 5 — ANALYZE the affected tables.
- After any DDL (
CREATE INDEX,ALTER TABLE), runANALYZE table_nameso the planner sees the new index / new column shape. - After bulk loads (
COPY,INSERT ... SELECT), runANALYZEon the loaded table; without fresh stats the planner uses pre-load histograms and picks wrong plans. -
ANALYZEitself is cheap (O(sample)) — it samples ~30,000 rows per column by default. - Skipping this step is the #1 cause of "I added the index but the plan didn't change" tickets.
Step 6 — Re-run + compare.
- Run
EXPLAIN ANALYZEagain with the same parameters; compare actual time, plan shape, and rows. - Keep a tuning log: query, before plan, change made, after plan, latency delta. This is the artefact you bring to a senior interview.
- If the new plan is worse, revert the change — never "ship and hope".
- If the new plan is better but still misses the SLA, iterate: go back to step 3, find the next bottleneck.
Worked example — run the full six-step playbook on one query
Detailed explanation. Real interviews give you a slow query and ask you to walk the playbook out loud. Below is one canonical query and the full six-step trace.
Question. A reporting query runs in 32 seconds; the SLA is 2 seconds. Walk the six-step tuning playbook on this query, naming the change you make at each step and the expected latency after.
Input. Query: SELECT region, SUM(amount) FROM fact_orders WHERE DATE(created_at) = '2026-05-28' GROUP BY region. Table: 80M rows, indexed on (created_at), ~50k matching rows.
Code.
-- STEP 1 — Capture (from pg_stat_statements)
-- query: SELECT region, SUM(amount) FROM fact_orders WHERE DATE(created_at) = $1 GROUP BY region
-- mean_exec_time_ms: 32140
-- target: <2000
-- STEP 2 — EXPLAIN ANALYZE
EXPLAIN (ANALYZE, BUFFERS)
SELECT region, SUM(amount)
FROM fact_orders
WHERE DATE(created_at) = '2026-05-28'
GROUP BY region;
-- Planner output (abridged):
-- HashAggregate (cost=2,140,000 .. 2,140,001 rows=300 width=20)
-- (actual time=31,800 .. 31,820 rows=300)
-- -> Seq Scan on fact_orders (cost=0 .. 2,140,000 rows=400,000 width=20)
-- (actual time=12 .. 31,500 rows=50,000)
-- Filter: (date(created_at) = '2026-05-28')
-- Buffers: shared read=940,000
-- STEP 3 — Bottleneck: Seq Scan owns 31.5s; Filter is function-wrapped (DATE(created_at))
-- so the existing index on (created_at) is unused.
-- STEP 4 — Rewrite to SARGable form (one change)
EXPLAIN (ANALYZE, BUFFERS)
SELECT region, SUM(amount)
FROM fact_orders
WHERE created_at >= '2026-05-28' AND created_at < '2026-05-29'
GROUP BY region;
-- STEP 5 — ANALYZE if stats are stale (skipped here; stats fresh after recent ANALYZE)
ANALYZE fact_orders;
-- STEP 6 — Re-run + compare
-- New planner output:
-- HashAggregate (cost=18,400 .. 18,401 rows=300 width=20)
-- (actual time=1,310 .. 1,320 rows=300)
-- -> Index Scan using ix_fact_orders_created_at on fact_orders
-- (cost=0.42 .. 18,000 rows=50,000 width=20)
-- (actual time=0.18 .. 1,150 rows=50,000)
-- Index Cond: (created_at >= '2026-05-28' AND created_at < '2026-05-29')
-- Buffers: shared hit=1,400 read=4,200
Step-by-step explanation.
-
Capture —
pg_stat_statementsflags the query at 32s mean; target is 2s. -
EXPLAIN ANALYZE — the planner shows
Seq Scan on fact_orderswithFilter: date(created_at) = '2026-05-28'; the index on(created_at)exists but is unused. -
Bottleneck identification —
Seq Scanowns 31.5s of the 31.8s total; theFilterclause wrapscreated_atinDATE(...)which is not SARGable, so the planner cannot push the predicate into an index seek. -
Rewrite (one change) — replace
DATE(created_at) = '2026-05-28'withcreated_at >= '2026-05-28' AND created_at < '2026-05-29'; semantics identical, second form is SARGable. -
ANALYZE —
ANALYZE fact_ordersensures the planner trusts the row estimate for the new predicate; skipped only if stats were refreshed recently. -
Re-run + compare — new plan is
Index Scan using ix_fact_orders_created_at, actual time 1.3s. Win: 32s → 1.3s, ~25x improvement, SLA met. No further tuning needed.
Output (the before/after comparison).
| metric | before | after |
|---|---|---|
| Scan type | Seq Scan + Filter | Index Scan |
| Buffers (shared read) | 940,000 | 4,200 |
| Actual time (ms) | 31,820 | 1,320 |
| Plan flip reason | SARGable rewrite unlocked existing index | — |
| Changes made | 1 (SARGable rewrite) | — |
| SLA met? | no (32s vs 2s) | yes (1.3s vs 2s) |
Rule of thumb: one well-chosen change can produce a 10-30x speedup. If you cannot point at the one change that produced the win, you changed too many things at once.
The five most common anti-patterns and how to fix them
Anti-pattern 1 — WHERE FUNC(col) = ?. Function on the indexed column prevents the planner from using the index. Fix: rewrite to SARGable form, or create a functional index (CREATE INDEX ix_lower_email ON users (LOWER(email))).
Anti-pattern 2 — SELECT * on a wide table. Forces the planner to do heap fetches for every row even when a covering index could answer the query. Fix: name only the columns you need, then build a covering index over them.
Anti-pattern 3 — OR across tables. WHERE a.x = ? OR b.y = ? cannot use either index. Fix: rewrite as UNION of two single-predicate queries.
Anti-pattern 4 — Correlated subquery in SELECT. SELECT (SELECT COUNT(*) FROM child c WHERE c.pid = p.id) FROM parent p re-runs the subquery per row. Fix: rewrite as LEFT JOIN with GROUP BY or window function; one pass instead of N.
Anti-pattern 5 — Implicit type coercion on join. JOIN customers c ON c.id = o.customer_id_varchar where c.id INT and o.customer_id_varchar TEXT forces an implicit CAST that defeats the index. Fix: align types in the schema; never cross types on a join key.
SQL
Topic — optimization
SQL tuning playbook drills
SQL
Topic — aggregation
Aggregation + GROUP BY drills
Solution Using a tuning-log artefact you keep per query
Code.
-- Persist your tuning history; every senior engineer keeps one of these.
CREATE TABLE sql_tuning_log AS
SELECT * FROM (VALUES
(1, 'top_revenue_by_region', 'before', 'Seq Scan + DATE() filter', 32140, 940000, 'baseline'),
(2, 'top_revenue_by_region', 'rewrite_sargable', 'Index Scan on (created_at)', 1320, 4200, 'SARGable rewrite -> existing index used'),
(3, 'top_revenue_by_region', 'add_covering', 'Index Only Scan with INCLUDE(region, amount)', 410, 320, 'covering INCLUDE eliminates heap fetches'),
(4, 'top_revenue_by_region', 'partitioned_table', 'Index Only Scan on daily partition', 180, 110, 'partition pruning halves scanned pages'),
(5, 'top_revenue_by_region', 'final', 'meets SLA: 180ms < 2000ms', 180, 110, 'shipped; no further tuning needed')
) AS t(step, query_name, change_label, plan_after, actual_time_ms, buffers_read, notes);
Step-by-step trace.
| step | query_name | change_label | plan_after | actual_time_ms | buffers_read | notes |
|---|---|---|---|---|---|---|
| 1 | top_revenue_by_region | before | Seq Scan + DATE() filter | 32140 | 940000 | baseline |
| 2 | top_revenue_by_region | rewrite_sargable | Index Scan on (created_at) | 1320 | 4200 | SARGable rewrite |
| 3 | top_revenue_by_region | add_covering | Index Only Scan with INCLUDE | 410 | 320 | covering INCLUDE |
| 4 | top_revenue_by_region | partitioned_table | Index Only Scan on daily partition | 180 | 110 | partition pruning |
| 5 | top_revenue_by_region | final | meets SLA | 180 | 110 | shipped |
- Step 1 — baseline; record the exact slow plan, latency, and buffer reads before changing anything.
- Step 2 — one change: SARGable rewrite. Re-EXPLAIN, record new plan, new latency, new buffer reads. 32s → 1.3s.
- Step 3 — one change: add covering index with
INCLUDE. Re-EXPLAIN. 1.3s → 410ms. - Step 4 — one change: range-partition by day and let partition pruning skip irrelevant partitions. 410ms → 180ms.
- Step 5 — SLA met (180ms vs 2s target); stop tuning. Over-tuning past the SLA is wasted effort.
Output.
| step | actual_time_ms | speedup vs baseline |
|---|---|---|
| 1 (before) | 32140 | 1.0x |
| 2 (SARGable) | 1320 | 24x |
| 3 (covering) | 410 | 78x |
| 4 (partitioned) | 180 | 178x |
| 5 (final) | 180 | 178x (SLA met) |
Why this works — concept by concept:
- One change per row — every entry in the tuning log records exactly one change; this is the audit trail that proves you didn't ship a guess.
- Plan-after column — captures the shape of the plan, not just the latency; latency without plan context is unfalsifiable.
-
Buffers as a second metric —
buffers_readis the I/O proxy; latency can be cache-warm noise, buffer counts are deterministic. - Stop at the SLA — step 5 is the discipline gate; once the SLA is met, stop tuning and ship. Senior engineers do not over-tune past the requirement.
-
Cost —
O(cycles)where each cycle is one EXPLAIN ANALYZE + one DDL or rewrite; the log itself isO(rows)to read and the artefact you bring to the post-mortem (or the interview).
Choosing the right tuning move (cheat sheet)
A one-screen cheat sheet for sql query optimization — given a symptom in EXPLAIN ANALYZE, pick the move that fixes it.
| You see in the plan … | Likely cause | First move | Cadence |
|---|---|---|---|
Seq Scan on a big table with a selective predicate |
no useful index |
CREATE INDEX matching the predicate |
new query |
Seq Scan with Filter: FUNC(col) = ?
|
non-SARGable predicate | rewrite to col = ? or build functional index |
every PR |
Index Scan with high Heap Fetches:
|
non-covering index | add INCLUDE (...) for selected columns |
hot path |
Hash node with Disk: X kB
|
hash spill | raise work_mem for the session |
per query |
Nested Loop with loops= very large |
bad outer estimate, often stale stats |
ANALYZE + check planner row estimate |
regression |
Index Scan ignored, planner picks Seq Scan
|
low selectivity (>20% match) | reconsider whether index helps; or partial index on hot subset | review |
Sort node burning seconds |
output not pre-sorted | add ORDER BY col DESC to index key |
hot path |
Bitmap Heap Scan slow on huge result |
many random heap reads | covering index OR rewrite to narrow predicate | hot path |
| Planner-estimated rows 100x off actual | stale statistics | ANALYZE table_name |
regression |
WHERE created_at + INTERVAL '1 day' >= NOW() |
arithmetic on column | rewrite to created_at >= NOW() - INTERVAL '1 day'
|
every PR |
IN (subquery) with large subquery |
semi-join blow-up | rewrite as EXISTS or JOIN ... GROUP BY
|
review |
OR across two tables |
un-indexable disjunction | rewrite as UNION ALL of two queries |
review |
CAST on join key (INT = TEXT) |
implicit type coercion defeats index | align schema types; never cross types on join | schema |
| Same query, plan flipped overnight | autovacuum reset stats | check last_analyze; re-run ANALYZE if old |
on-call |
Aggregate on huge table, no GROUP BY
|
full scan to compute COUNT(*)
|
materialised view or pg_stat_user_tables.n_live_tup
|
nightly |
Frequently asked questions
What's the single most important query optimization techniques reflex to build?
Run EXPLAIN ANALYZE before you change anything. The single biggest gap between junior and senior engineers is the willingness to look at the plan before forming a hypothesis. Junior engineers reason about queries from first principles (the predicate looks selective, so the planner must use the index); senior engineers run EXPLAIN ANALYZE, see the plan the planner actually picked, and only then propose a change. Every other technique in this guide — SARGable rewrites, covering indexes, join-algorithm prediction, statistics refresh — is downstream of that single reflex. The mantra: don't guess, look at the plan.
How do I read an explain plan quickly under interview pressure?
Walk leaves to root, not top to bottom. The plan prints top-down but executes bottom-up; the bottom-most nodes (Seq Scan, Index Scan, Index Only Scan) are the leaves where actual work begins. Find the leaf with the highest actual time; that's almost always the bottleneck. Check loops on any Nested Loop parent — actual time × loops is the real cost. Check estimate-vs-actual rows; a > 10x miss means stale stats are causing wrong plans upstream. With those three habits you can narrate any 5-10 node plan in under a minute, which is exactly what the interviewer wants.
When should I pick a b-tree index vs a hash index vs a partial index?
B-tree is the default — pick it for any column you query with =, <, <=, >, >=, BETWEEN, IN, or ORDER BY. Hash is niche — equality only, no ranges, no sort; reach for it only when you have very tall single-column equality lookups and the engine supports WAL-logged hash indexes (PostgreSQL 10+). Partial is a B-tree on a subset of rows — pick it when your queries always filter on the same boolean-ish predicate (WHERE status = 'active'); the partial is smaller, faster, and skips index maintenance on rows it doesn't cover. Covering / INCLUDE is the senior trick — pick it for hot-path queries where you can name every selected column; it unlocks Index Only Scan and skips the heap fetch entirely. In practice, ~80% of production indexes are plain B-trees, ~15% are covering composites, ~5% are partial, and hash indexes show up only in very specific niches.
What's the difference between nested loop, hash join, and merge join — and when does each win?
nested loop is for each row in outer: lookup in inner; it wins when the outer is tiny (under ~10k rows) and the inner has a useful index — total cost is outer_rows × inner_seek_cost, which is unbeatable on small outers. hash join builds a hash on the smaller side and probes with the larger; it wins on big × big equi-joins with no useful index — total cost is O(N + M) if the build fits in work_mem. Merge join requires both sides arrive sorted on the join key; it wins when sort orders already exist (e.g., both sides scanned via a B-tree on the join key) and the output benefits downstream from being pre-sorted. You do not pick the algorithm — the planner does, based on table sizes, indexes, and statistics — but you must be able to predict the choice so you can build the right index up front. The decision matrix: tiny outer → nested loop; big × big, no index → hash; big × big, both sorted → merge.
Why does my query that worked yesterday suddenly run slowly today?
Almost always stale statistics. The query optimizer relies on histograms gathered by ANALYZE to estimate row counts; if the data distribution shifted overnight (bulk load, partition swap, schema change) and ANALYZE hasn't re-run, the planner is making decisions on yesterday's reality. The fix is one command: ANALYZE table_name. Other common causes are autovacuum interruptions, parameter sniffing on prepared statements (where the first param value cached a plan that's wrong for subsequent values), and silent index bloat (PostgreSQL's pgstattuple extension can confirm). Check pg_stat_user_tables.last_analyze for the affected table first; if it's older than your latest bulk load, run ANALYZE and re-EXPLAIN before debugging anything else.
How do I rewrite a non-SARGable predicate into a SARGable one?
The rule: the indexed column must appear alone on one side of the operator, with no function or arithmetic wrapped around it. WHERE DATE(created_at) = '2026-05-28' is non-SARGable; rewrite to WHERE created_at >= '2026-05-28' AND created_at < '2026-05-29'. WHERE EXTRACT(YEAR FROM created_at) = 2026 is non-SARGable; rewrite to WHERE created_at >= '2026-01-01' AND created_at < '2027-01-01'. WHERE created_at + INTERVAL '1 day' >= NOW() is non-SARGable; rewrite to WHERE created_at >= NOW() - INTERVAL '1 day' (move the arithmetic to the constant side). WHERE LOWER(email) = 'foo@bar.com' is non-SARGable on a plain index, but becomes SARGable if you add a functional index CREATE INDEX ix_lower_email ON users (LOWER(email)). This single rewrite class fixes ~30% of slow queries in production.
Practice on PipeCode
PipeCode ships 450+ data-engineering interview problems — including SQL drills keyed to the same sql query optimization skills this guide teaches (reading explain plan, designing b-tree index + covering composites, predicting nested loop vs hash join vs merge, SARGable rewrites, and the six-step sql tuning playbook). Whether you're prepping for a query optimization techniques round the night before a senior screen, or building the daily reps that turn 30-second queries into 300-millisecond ones over months, the practice library mirrors the same five-stage mental model — plus the index types, join algorithms, and explain plan cost-model intuition you'll wire into your production tuning workflow.





Top comments (0)