DEV Community

Cover image for SQL Query Optimization: EXPLAIN Plans, Indexes & Tuning Techniques for Data Engineers
Gowtham Potureddi
Gowtham Potureddi

Posted on

SQL Query Optimization: EXPLAIN Plans, Indexes & Tuning Techniques for Data Engineers

sql query optimization is the single skill that separates the engineer who writes a query from the one who ships it: a 30-second SELECT that returns the right rows is still a production incident, and the discipline of reading an explain plan, picking the right index types (b-tree index, hash, partial, covering), recognising which of the three join algorithms (nested loop, hash join, merge join) the planner will choose, and then rewriting SARGable predicates is what turns 30 seconds into 300 milliseconds. The senior round is rarely "do you know JOIN" — it is "show me the plan, find the bottleneck node, and tell me the one change that will move the needle". This deep-dive guide walks the full senior playbook end to end, with worked traces, cost models, and the query optimization techniques every modern data engineer should run on every PR.

This is a deep-dive companion to short tuning round-ups: where a 5-tip cheat sheet covers "add an index, avoid SELECT *, prefer JOIN over correlated subquery", this guide widens the surface into five full teaching stagesexplain plan anatomy (read the tree from leaves to root), index types compared (B-tree, hash, partial, covering — when each wins and when it backfires), join algorithms (nested loop, hash, merge — and exactly when the planner picks each), the six-step sql tuning playbook (capture → EXPLAIN → bottleneck → rewrite/index → ANALYZE → compare), and a one-screen decision cheat sheet that maps every common symptom (sequential scan on a 10M-row table, hash-spill to disk, nested loop on a 1M × 1M join) onto the exact rewrite or index that fixes it. Each section ends as an interview-shaped Q&A — a question, a SQL snippet, a traced EXPLAIN walkthrough, a sample output, and a concept-by-concept why this works breakdown — the exact shape senior query optimization techniques rounds reward.

PipeCode blog header for a deep-dive SQL query optimization guide — bold white headline 'SQL Query Optimization' with subtitle 'EXPLAIN plans · Indexes · Joins · Tuning' and a stylised four-step optimisation flow infographic (read plan → choose index → pick join → rewrite SQL) on a dark gradient with purple, orange, green, and blue accents and a small pipecode.ai attribution.

When you want hands-on reps immediately after reading, browse the SQL practice library →, drill query optimization problems →, sharpen indexing patterns →, rehearse join problems →, reinforce aggregation reconciliation →, or widen coverage on the full database problem set →.


On this page


1. Why SQL query optimization is the senior-round signal

sql query optimization — the discipline that separates seniors from juniors

The one-sentence invariant: sql query optimization is the discipline of turning a query's logical shape (what rows you want) into its physical shape (how the planner will fetch them), then iterating on the physical shape until it meets the SLA. Junior engineers write the query and ship it; senior engineers run EXPLAIN ANALYZE, identify the highest-cost node, and rewrite either the predicate, the join order, or the index until the plan flips from a 30-second sequential scan to a 300-millisecond index scan. The skill is not knowing more SQL — it is reading the plan the database produces and acting on the bottleneck node.

What interviewers actually score on query optimization techniques.

  • Plan literacy on explain plan — can you read a 12-line EXPLAIN ANALYZE and point at the leaf node that owns the cost?
  • Index intuition on index types — given a WHERE a = ? AND b BETWEEN ? AND ? predicate, can you name the composite index that wins and the one that loses?
  • Join algorithm fluency — can you predict whether the planner picks nested loop, hash join, or merge join for a 1k × 10M join, and why?
  • SARGable rewrite reflex — given WHERE DATE(created_at) = '2026-05-29', can you rewrite it to WHERE created_at >= '2026-05-29' AND created_at < '2026-05-30' without prompting?
  • Statistics + cost model awareness — do you know that ANALYZE refreshes the histograms the planner relies on, and that stale statistics are the single most common cause of a "regressed" query plan?
  • sql tuning discipline — can you change one thing per cycle and re-run EXPLAIN ANALYZE to prove the win, instead of changing four things at once and shipping a worse plan?

The 5-stage map this guide walks through.

  • Stage 1 — explain plan anatomy — read the tree from leaves to root; the worst leaf is the bottleneck.
  • Stage 2 — index types — B-tree (default), hash (equality only), partial (filtered subset), covering / INCLUDE (index-only scan).
  • Stage 3 — join algorithmsnested loop (small outer + indexed inner), hash join (no useful index, both sides large), merge join (both sides pre-sorted).
  • Stage 4 — sql tuning playbook — capture → EXPLAIN → bottleneck → rewrite or index → ANALYZE → compare; one change per cycle.
  • Stage 5 — cheat sheet — symptom → fix; reach for the row that matches the bottleneck node.

Why this is the senior-round signal and not a syntax round.

  • query optimization techniques are empirical, not theoretical — the right answer depends on the plan the planner produces, which depends on statistics, indexes, and data distribution; you must look, not guess.
  • The biggest wins are leaf-level — the cheapest improvement is almost always replacing a sequential scan on the largest table with an index scan; the join algorithm above it inherits the cost.
  • Stale statistics produce silent regressions — a model that ran in 100ms last week now takes 90 seconds; the cause is usually a histogram that no longer matches the data, not a code change.
  • SARGable rewrites are free winsWHERE col = ? uses an index; WHERE FUNC(col) = ? does not, even with an index defined on col; this single rule fixes 30% of slow queries.
  • One change per cycle is the discipline gate — junior engineers change four things at once and ship a worse plan; senior engineers change one thing, re-run EXPLAIN ANALYZE, prove the win, then move on.

Worked example — read one EXPLAIN plan and identify the bottleneck

Detailed explanation. Real interviews probe whether you can read a small EXPLAIN ANALYZE cold. Below is a canonical 3-node plan; your job is to point at the bottleneck node and propose the one change that will move the needle.

Question. Given the EXPLAIN ANALYZE output below, which node owns the cost, why is it slow, and what is the single change you would make first?

Input. A fact_orders table (8M rows, no index on customer_id) joined to dim_customers (50k rows, primary key on customer_id); the query filters orders to a single segment.

Code.

EXPLAIN ANALYZE
SELECT c.segment, SUM(o.amount) AS revenue
FROM   fact_orders o
JOIN   dim_customers c ON c.customer_id = o.customer_id
WHERE  c.segment = 'enterprise'
GROUP  BY c.segment;

--                                  QUERY PLAN
-- HashAggregate  (cost=185432.10..185432.11 rows=1 width=40)
--                (actual time=28412.51..28412.52 rows=1 loops=1)
--   Group Key: c.segment
--   ->  Hash Join  (cost=1812.00..184230.40 rows=240340 width=12)
--                  (actual time=22.10..27890.40 rows=232117 loops=1)
--         Hash Cond: (o.customer_id = c.customer_id)
--         ->  Seq Scan on fact_orders o (cost=0.00..164010.00 rows=8000000 width=12)
--                                       (actual time=0.01..18402.18 rows=8000000 loops=1)
--         ->  Hash  (cost=1187.00..1187.00 rows=50000 width=8)
--               ->  Index Scan on dim_customers c
--                       (cost=0.00..1187.00 rows=1500 width=8)
--                       (actual time=0.02..3.18 rows=1500 loops=1)
--                     Index Cond: (segment = 'enterprise')
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. Read the leaves first. The two leaf nodes are Seq Scan on fact_orders (8,000,000 rows, 18.4s) and Index Scan on dim_customers (1,500 rows, 3ms). The dim is fine; the fact is the bottleneck.
  2. Confirm with cost numbers. Seq Scan cost is 0.00..164010.00; the Hash Join above adds only ~20,000 more cost. The leaf owns ~80% of the plan's total cost.
  3. Identify why it's slow. No index exists on fact_orders.customer_id, so the planner reads every row, hashes the dim, and probes. The dim filter (segment = 'enterprise') is not pushed down to the fact because the join column is customer_id, not segment.
  4. One-change fix. Add CREATE INDEX idx_fact_orders_customer_id ON fact_orders (customer_id); the planner will then switch to a Nested Loop driven by the 1,500-row enterprise customer set, doing 1,500 indexed lookups against the fact instead of one full sequential scan.
  5. Verify. Re-run EXPLAIN ANALYZE; expected new plan is Nested Loop over Index Scan on fact_orders driven by the 1,500-row inner — actual time should drop from 28s to under 1s.

Output (the bottleneck-node identification).

node type rows actual_time_ms share_of_total
Seq Scan on fact_orders leaf 8,000,000 18,402 ~65%
Hash Join parent 232,117 9,488 ~33%
HashAggregate root 1 0.01 <1%

Rule of thumb: the worst-performing leaf is almost always the bottleneck; fix it first, then re-run EXPLAIN ANALYZE and re-evaluate.

sql tuning — the four senior signals interviewers chase

Signal 1 — opinionated index choices, not "add an index everywhere". Senior engineers do not say "indexes are good"; they say "I add a composite (tenant_id, created_at DESC) index because 90% of our queries filter on tenant then sort on time, and a covering INCLUDE (status, amount) lets the planner answer the query without touching the heap."

Signal 2 — empirical, not theoretical. Junior engineers reason about plans from first principles; senior engineers run EXPLAIN ANALYZE first, then reason. The plan tells you the truth; intuition is a starting point, not an answer.

Signal 3 — one change per cycle, with proof. Senior engineers change exactly one thing per tuning cycle — one index, one rewrite, one ANALYZE — and re-run EXPLAIN ANALYZE to prove the win before moving on. Four changes at once ships a worse plan and no learning.

Signal 4 — statistics-awareness, not just index-awareness. When a plan regresses, junior engineers look for code changes; senior engineers run ANALYZE on the affected tables first because stale histograms are the single most common cause of a "the query that worked yesterday is now slow today" incident.

SQL
Topic — optimization
Query optimization drills

Practice →

SQL
Topic — indexing
Indexing practice problems

Practice →

Solution Using a 5-stage tuning coverage matrix

Code.

-- One canonical coverage matrix — every row maps a tuning stage to an artefact.
CREATE TABLE sql_tuning_coverage AS
SELECT * FROM (VALUES
    (1, 'explain_plan',     'read_plan_from_leaves',        'EXPLAIN ANALYZE + bottleneck node',   'every slow query'),
    (2, 'index_types',      'pick_btree_vs_hash_vs_partial','match index shape to predicate',      'every new query'),
    (2, 'index_types',      'covering_index',               'INCLUDE columns -> index-only scan',  'hot path queries'),
    (3, 'join_algorithm',   'nested_vs_hash_vs_merge',      'driven by row counts + sort order',   'every multi-table query'),
    (4, 'sql_tuning',       'sargable_rewrite',             'WHERE col = ? not FUNC(col) = ?',     'every WHERE clause'),
    (4, 'sql_tuning',       'one_change_per_cycle',         'change one thing, re-EXPLAIN, prove', 'every PR'),
    (4, 'sql_tuning',       'analyze_statistics',           'ANALYZE refreshes histograms',        'after bulk loads'),
    (5, 'cheat_sheet',      'symptom_to_fix',               'seq scan -> index; spill -> work_mem','interview + on-call')
) AS t(stage_id, stage_name, technique, prescription, cadence);
Enter fullscreen mode Exit fullscreen mode

Step-by-step trace.

stage_id stage_name technique prescription cadence
1 explain_plan read_plan_from_leaves EXPLAIN ANALYZE + bottleneck node every slow query
2 index_types pick_btree_vs_hash_vs_partial match index shape to predicate every new query
2 index_types covering_index INCLUDE columns -> index-only scan hot path queries
3 join_algorithm nested_vs_hash_vs_merge driven by row counts + sort order every multi-table query
4 sql_tuning sargable_rewrite WHERE col = ? not FUNC(col) = ? every WHERE clause
4 sql_tuning one_change_per_cycle change one thing, re-EXPLAIN, prove every PR
4 sql_tuning analyze_statistics ANALYZE refreshes histograms after bulk loads
5 cheat_sheet symptom_to_fix seq scan -> index; spill -> work_mem interview + on-call
  1. Row 1 — explain_plan is always the first move; never guess, always look at the plan.
  2. Rows 2-3 — index_types covers both shape (B-tree vs hash vs partial) and the covering trick that eliminates heap fetches.
  3. Row 4 — join_algorithm is what the planner chooses; the prescription is to understand the inputs (row counts, sort order) so you can predict the choice.
  4. Rows 5-7 — sql_tuning is the discipline layer: SARGable rewrites, one change per cycle, and refreshed statistics.
  5. Row 8 — the cheat sheet is the one-screen lookup for production incidents and interviews; given a symptom, name the fix.

Output.

stage_id stage_name technique cadence
1 explain_plan read_plan_from_leaves every slow query
2 index_types pick_btree_vs_hash_vs_partial every new query
3 join_algorithm nested_vs_hash_vs_merge every multi-table query
4 sql_tuning sargable_rewrite + one-change every PR
5 cheat_sheet symptom_to_fix interview + on-call

Why this works — concept by concept:

  • Stage coverage matrix — turns the 5-stage map into an auditable artefact; every tuning technique is owned by exactly one stage, so coverage gaps surface at a glance.
  • Cadence binding — pairs each technique with its trigger (every slow query, every PR, after bulk loads); senior engineers assign cadence per technique, not "tune everything always".
  • One change per cycle — codified as a row, not a culture norm; the discipline is visible in the matrix, not buried in tribal knowledge.
  • Empirical biasexplain_plan is row 1; nothing happens without looking at the plan first. This is the single biggest mindset shift from junior to senior.
  • CostO(1) to read the coverage matrix; the actual tuning is O(query) per cycle but each cycle is bounded by one change, so iterations stay fast.

2. EXPLAIN plan anatomy — reading the tree from leaves to root

Visual diagram of an EXPLAIN plan tree — a root Aggregate node at the top with rows + cost pills, branching down into a Hash Join node, which branches into an Index Scan and a Sequential Scan child; each node shows its estimated cost, rows, and width; a small legend card on the right defining scan types; on a light PipeCode card.

explain plan — the tree, the cost numbers, and what they mean

explain plan is the database's answer to the question "how are you going to run this query?". The output is a tree: leaves are scans (sequential, index, index-only), interior nodes are joins (nested loop, hash, merge) and aggregations (sort-aggregate, hash-aggregate), and the root is whatever produces the final row set (often a Sort or Limit). The cost numbers — cost=startup..total — are the planner's estimate of arbitrary work units, not real wall-clock seconds; EXPLAIN ANALYZE adds the actual wall-clock measurements (actual time=startup..total), the actual rows produced, and the loop count.

The four invariants of every plan.

  • Read leaves to root, not top to bottom. The execution starts at the leaves (scans), then climbs to interior nodes (joins, aggregations), then to the root. The output you see is printed top-down but executed bottom-up.
  • The biggest leaf cost almost always wins. A sequential scan on a 10M-row table dwarfs a nested loop above it; fix the leaf and the parent's cost shrinks proportionally.
  • cost is an estimate; actual time is the truth. Always use EXPLAIN ANALYZE in tuning sessions — the estimate vs reality delta tells you if statistics are stale (estimate way off → run ANALYZE).
  • Loop count matters on Nested Loop. actual time=0.1..0.2 rows=5 loops=12000 means the inner side ran 12,000 times; multiply actual time by loops for the real cost — that's where nested loops on large outers explode.

Scan node families — what each one means.

  • Seq Scan — read every row of the table; cheap on small tables (under ~10k rows) but linear in table size on big tables. The default when no useful index exists or the predicate matches > ~20% of rows (planner threshold varies by engine).
  • Index Scan — walk the B-tree to find matching keys, then fetch each row from the heap; cheap when selectivity is high (few rows match the predicate).
  • Index Only Scan — walk the B-tree and answer the query from the index alone, skipping the heap fetch; requires a covering index where every selected column is either in the key or in INCLUDE.
  • Bitmap Heap Scan — combine multiple indexes via bitmap OR/AND, then fetch rows; useful when several smaller indexes together beat one larger composite.

Join node families — what each one means.

  • Nested Loop — for each row of the outer, look up matching rows in the inner; cheap when the outer is tiny and the inner has a useful index.
  • Hash Join — build a hash table on the smaller side, probe it with rows from the larger side; cheap when both sides are large and at least one fits in work_mem.
  • Merge Join — both sides arrive sorted on the join key, then walk them in lockstep; cheap when sort orders already exist (PK scan, index range scan).

Aggregate node families — what each one means.

  • HashAggregate — build a hash on the group keys, accumulate aggregates per bucket; needs to fit in work_mem per group.
  • GroupAggregate — input is pre-sorted by group keys; streams through it accumulating one group at a time; constant memory.
  • Aggregate (plain) — single-group aggregate (SELECT COUNT(*) FROM t); no grouping, single-pass accumulator.

Worked example — read every node in a 5-node plan

Detailed explanation. Real interviews show you a 5-7 node plan and ask you to narrate it node-by-node. Below is the canonical shape; learn to walk it and you can walk any plan.

Question. Given the EXPLAIN ANALYZE output below for a top-revenue-by-region query, narrate the plan from leaves to root, identify the bottleneck node, and propose the one change that will move the needle.

Input. A fact_orders table (12M rows, B-tree on order_date), dim_customers (200k rows, PK on customer_id), filter on order_date >= '2026-04-01' and GROUP BY c.region with a LIMIT 5.

Code.

EXPLAIN ANALYZE
SELECT c.region, SUM(o.amount) AS rev
FROM   fact_orders o
JOIN   dim_customers c ON c.customer_id = o.customer_id
WHERE  o.order_date >= '2026-04-01'
GROUP  BY c.region
ORDER  BY rev DESC
LIMIT  5;

--                                         QUERY PLAN
-- Limit  (cost=24812.10..24812.12 rows=5 width=18)
--        (actual time=6890.30..6890.31 rows=5 loops=1)
--   ->  Sort  (cost=24812.10..24812.85 rows=300 width=18)
--             (actual time=6890.29..6890.30 rows=5 loops=1)
--         Sort Key: (sum(o.amount)) DESC
--         ->  HashAggregate  (cost=24800.10..24803.85 rows=300 width=18)
--                             (actual time=6889.18..6889.40 rows=300 loops=1)
--               Group Key: c.region
--               ->  Hash Join  (cost=4012.00..24190.40 rows=121940 width=14)
--                              (actual time=85.10..6580.40 rows=123210 loops=1)
--                     Hash Cond: (o.customer_id = c.customer_id)
--                     ->  Index Scan using idx_orders_date on fact_orders o
--                                  (cost=0.42..19811.20 rows=121940 width=14)
--                                  (actual time=0.07..6210.15 rows=123210 loops=1)
--                           Index Cond: (order_date >= '2026-04-01')
--                     ->  Hash  (cost=2812.00..2812.00 rows=200000 width=8)
--                           ->  Seq Scan on dim_customers c
--                                  (cost=0.00..2812.00 rows=200000 width=8)
--                                  (actual time=0.01..78.10 rows=200000 loops=1)
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. Leaf 1 — Index Scan using idx_orders_date on fact_orders — the planner used the date index to fetch 123,210 matching rows out of 12M; 6.2s actual time. This is the dominant cost.
  2. Leaf 2 — Seq Scan on dim_customers — full scan of 200k rows in 78ms; cheap because the table is small and the entire row set is needed to build the hash side.
  3. Hash build node — builds a hash on dim_customers.customer_id; ~3ms additional cost on top of the seq scan.
  4. Hash Join — probes the hash with each row from fact_orders; total actual time is ~6.6s, of which ~6.2s came from the leaf scan; the join itself only adds ~370ms.
  5. HashAggregate — groups by c.region into ~300 buckets; cheap (~1ms) because the hash fits in memory.
  6. Sort + Limit — sort 300 region totals descending, return top 5; near-zero cost.
  7. Bottleneck. The Index Scan on fact_orders owns 90% of the runtime; the index is being used (good) but it returns 123k rows that all need a heap fetch to read amount and customer_id. The fix: turn the date index into a covering index with INCLUDE (customer_id, amount), which lets the planner use Index Only Scan and skip the 123k heap lookups entirely.

Output (the node-by-node breakdown).

node type rows actual_time_ms bottleneck
Index Scan idx_orders_date leaf 123,210 6,210 YES
Seq Scan dim_customers leaf 200,000 78 no
Hash Join parent 123,210 6,580 inherited
HashAggregate parent 300 6,889 inherited
Sort + Limit root 5 6,890 inherited

Rule of thumb: a parent node's actual time is the sum of its children's actual time plus its own work; if the parent is slow but the children are slow too, fix the children first.

explain plan cost model — what the numbers actually mean

The cost numbers in EXPLAIN look like wall-clock seconds but they are not. They are arbitrary planner work units, calibrated such that seq scanning one disk page costs seq_page_cost = 1.0 (the default). Other operations are scaled relative to that:

  • seq_page_cost = 1.0 — one sequential page read.
  • random_page_cost = 4.0 — one random page read (default; lower on SSD, often tuned to 1.1).
  • cpu_tuple_cost = 0.01 — process one row through a node.
  • cpu_operator_cost = 0.0025 — evaluate one operator (one WHERE clause comparison).
  • cpu_index_tuple_cost = 0.005 — process one row through an index scan.

The planner sums these to produce the estimate. startup_cost is the cost before the first row can be returned (e.g., the entire hash side must be built before a hash join can produce any row); total_cost is the cost to produce all rows. A LIMIT 5 on top of a Sort uses startup_cost heavily — the sort must finish before the limit can take five rows.

The estimate-vs-actual delta is your statistics canary. If rows=121940 (estimate) and actual rows=12,194,000 (reality), your statistics are wildly stale — run ANALYZE on the affected table. A 100x estimate miss is the leading cause of a planner picking the wrong join algorithm (e.g., choosing nested loop because it thinks the outer is tiny, then looping millions of times).

SQL
Topic — optimization
EXPLAIN plan drills

Practice →

SQL
Topic — database
Database internals practice

Practice →

Solution Using a leaf-first plan-reading harness

Code.

-- One canonical pattern: capture plan, identify the worst leaf, propose the one-change fix.
WITH plan_nodes AS (
    SELECT * FROM (VALUES
        ('Index Scan idx_orders_date',  'leaf',   123210, 6210),
        ('Seq Scan dim_customers',      'leaf',   200000,   78),
        ('Hash Join',                   'parent', 123210, 6580),
        ('HashAggregate',               'parent',    300, 6889),
        ('Sort + Limit',                'root',        5, 6890)
    ) AS t(node, kind, rows, actual_time_ms)
)
SELECT
    node,
    kind,
    rows,
    actual_time_ms,
    actual_time_ms - LAG(actual_time_ms, 1, 0)
        OVER (ORDER BY actual_time_ms) AS self_time_ms,
    CASE
        WHEN kind = 'leaf'
         AND actual_time_ms = (SELECT MAX(actual_time_ms)
                                FROM plan_nodes WHERE kind = 'leaf')
        THEN 'BOTTLENECK'
        ELSE 'ok'
    END AS verdict
FROM   plan_nodes
ORDER  BY actual_time_ms DESC;
Enter fullscreen mode Exit fullscreen mode

Step-by-step trace.

node kind rows actual_time_ms self_time_ms verdict
Sort + Limit root 5 6890 1 ok
HashAggregate parent 300 6889 309 ok
Hash Join parent 123210 6580 370 ok
Index Scan idx_orders_date leaf 123210 6210 6132 BOTTLENECK
Seq Scan dim_customers leaf 200000 78 78 ok
  1. Row 1 (Sort + Limit) — root, near-zero self time; inherits all child cost.
  2. Row 2 (HashAggregate) — adds ~309ms of self work to group 123k rows into 300 buckets.
  3. Row 3 (Hash Join) — adds ~370ms to probe; the bulk of its time is inherited from the leaf.
  4. Row 4 (Index Scan) — leaf, ~6.1s self time; this is the bottleneck. The index is used but the planner still does 123k heap fetches.
  5. Row 5 (Seq Scan on dim) — leaf, fast (78ms); too small to matter.

Output.

node actual_time_ms verdict
Index Scan idx_orders_date 6210 BOTTLENECK
Hash Join 6580 ok
HashAggregate 6889 ok
Sort + Limit 6890 ok
Seq Scan dim_customers 78 ok

Why this works — concept by concept:

  • Leaf-first scan — pick the leaf with the highest actual time; that is almost always the bottleneck and the cheapest single fix.
  • Self time vs total timetotal_time - child_time = self_time; isolating self time tells you which node itself is doing work vs which is just waiting on its children.
  • Loops matter on nested loop — a leaf with actual time=0.1..0.2 rows=5 loops=10000 has a real cost of 0.2 * 10000 = 2000ms; always multiply.
  • Estimate vs actual — if planner-estimated rows differ from actual by > 10x, run ANALYZE; the bad estimate is causing bad join-algorithm choices upstream.
  • CostO(nodes) to walk the plan; the actual fix is O(1) (one DDL or rewrite) but the cycle to prove it (re-run EXPLAIN ANALYZE) is the discipline gate.

3. Index types — B-tree, Hash, Partial, Covering (when each wins)

Visual diagram of four index types — B-tree (default, range queries), Hash (equality only), Partial (filtered subset of rows), and Covering / INCLUDE (composite + included columns); each is a small card with a tiny structure icon, a one-line use-case, and a pill marking what queries it accelerates; on a light PipeCode card.

index types — four shapes that cover 95% of queries

index types are not interchangeable: a b-tree index wins on equality and range, a hash index wins on equality only and dies on range, a partial index is a B-tree on a subset of rows and saves enormous space on skewed columns, and a covering / INCLUDE index lets the planner answer the query from the index alone without touching the heap. Pick the wrong shape and the planner ignores the index entirely; pick the right one and the same query goes from a seq scan to an index-only scan.

The four families and when each wins.

  • B-tree — the default; supports =, <, <=, >, >=, BETWEEN, IN, ORDER BY. Used in 80%+ of real-world indexes. Composite B-trees (a, b, c) support queries on a, (a, b), and (a, b, c) but not b alone or c alone — the leftmost-prefix rule.
  • Hash — equality only (=); O(1) lookup, no range support, no ORDER BY support. Niche: very tall single-column equality lookups (e.g., session token by hash). PostgreSQL hash indexes are WAL-logged since 10.0; before that they were unsafe.
  • Partial — a B-tree over only the rows that match a WHERE clause; e.g., CREATE INDEX ix_active_orders ON orders (customer_id) WHERE status = 'active'. Smaller, faster, only useful for queries that share the partial's predicate.
  • Covering / INCLUDE — a composite where some columns are in the key and others are INCLUDEd (non-key payload). Lets the planner do an Index Only Scan — answer the query from the index without a heap fetch. Eliminates one disk seek per matched row.

The leftmost-prefix rule on composite B-trees.

Given CREATE INDEX ix_ab ON t (a, b):

  • WHERE a = ? — uses the index. Yes.
  • WHERE a = ? AND b = ? — uses the index. Yes.
  • WHERE a = ? AND b > ? — uses the index. Yes.
  • WHERE b = ?does not use the index. The leftmost column a is unbound; the B-tree cannot be navigated.
  • WHERE a > ? AND b = ? — uses the index partially; a is a range, b cannot be used as an additional seek key (only as a filter after).

Column order on a composite matters. Order columns by equality predicate first, then range, then sort. A query WHERE region = ? AND created_at BETWEEN ? AND ? ORDER BY created_at DESC wants (region, created_at DESC), not (created_at, region).

Why a covering index is the senior trick. Every Index Scan involves two reads: (1) walk the B-tree to find matching keys, (2) fetch each matching row from the heap (the table itself). A covering index stores all the columns the query needs in the index, so step (2) is skipped — the planner does an Index Only Scan and never touches the heap. For a hot-path query that returns 10k rows, this can save 10k random disk seeks.

Worked example — B-tree vs Hash vs Partial vs Covering on the same predicate

Detailed explanation. Real interviews ask you to design the right index for a specific query. Below is one canonical query and how each of the four index families performs on it.

Question. A fact_orders table has 50M rows. The hot-path query is SELECT customer_id, amount FROM fact_orders WHERE status = 'shipped' AND created_at >= '2026-04-01' ORDER BY created_at DESC LIMIT 100. Design four candidate indexes — one of each family — and predict which one the planner picks.

Input. Table: fact_orders (id PK, customer_id INT, status TEXT, created_at TIMESTAMP, amount NUMERIC). Distribution: 95% of rows have status = 'shipped', 5% are pending/cancelled. The query needs to return 100 most-recent shipped orders since 2026-04-01.

Code.

-- Candidate A: plain B-tree on (status, created_at)
CREATE INDEX ix_a ON fact_orders (status, created_at);

-- Candidate B: hash index on status
CREATE INDEX ix_b ON fact_orders USING HASH (status);

-- Candidate C: partial B-tree (status = 'shipped' subset only)
CREATE INDEX ix_c ON fact_orders (created_at DESC) WHERE status = 'shipped';

-- Candidate D: covering composite with INCLUDE
CREATE INDEX ix_d ON fact_orders (status, created_at DESC) INCLUDE (customer_id, amount);
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. Candidate A — plain B-tree (status, created_at) — works (leftmost prefix status = 'shipped'), but because status = 'shipped' matches 95% of the table the planner may still choose a seq scan; selectivity is too low to make the index worth it. Verdict: maybe used, maybe ignored.
  2. Candidate B — hash on status — equality match works (status = 'shipped'), but the index returns 47.5M rows (95% of the table) with no ordering and no range support on created_at; the planner falls back to seq scan or uses the hash index plus a sort, both worse than A. Verdict: rarely useful here.
  3. Candidate C — partial B-tree on created_at DESC WHERE status = 'shipped' — the partial only contains the 47.5M shipped rows, but it's sorted by created_at DESC. The planner walks the index from the top, takes the first 100 entries that satisfy created_at >= '2026-04-01', and is done. Verdict: excellent, especially small index footprint.
  4. Candidate D — covering composite (status, created_at DESC) INCLUDE (customer_id, amount) — the planner walks the composite, finds the matching key range, and answers the entire query from the index — no heap fetch needed. Verdict: best, Index Only Scan with zero heap reads for the 100-row LIMIT.
  5. Planner's actual pick. With both C and D present, the planner usually picks D because its Index Only Scan skips the heap entirely; C still requires a heap fetch per matched row to read customer_id and amount. If D doesn't exist, the planner picks C.

Output (the index-family ranking for this query).

candidate shape uses_index? heap_fetches verdict
A B-tree (status, created_at) maybe ~100 maybe ignored
B hash (status) no ~47.5M useless here
C partial B-tree (created_at DESC) WHERE shipped yes ~100 excellent
D covering (status, created_at DESC) INCLUDE (customer_id, amount) yes 0 best

Rule of thumb: if the query returns under ~5% of the table and you can name every selected column, build a covering index — Index Only Scan is the cheapest plan in SQL.

b-tree index and the SARGable rule

SARGable stands for Search-ARGument-able — a predicate the planner can push down into an index seek. The rule: the indexed column must appear alone on one side of the operator, with no function wrapped around it.

Predicate SARGable? Why
WHERE created_at = '2026-05-29' yes column alone on left side
WHERE created_at >= '2026-05-29' yes column alone on left side
WHERE created_at BETWEEN ? AND ? yes column alone on left side
WHERE DATE(created_at) = '2026-05-29' no function wrapped around column → seq scan
WHERE EXTRACT(YEAR FROM created_at) = 2026 no function wrapped around column
WHERE created_at + INTERVAL '1 day' >= NOW() no arithmetic on column side
WHERE LOWER(email) = 'foo@bar.com' no unless functional index on LOWER(email) exists
WHERE email = LOWER('foo@bar.com') yes function on the constant side, not the column
WHERE id IN (1,2,3) yes translates to id = ANY(...)
WHERE id NOT IN (1,2,3) usually no anti-condition rarely uses index

The SARGable rewrite. WHERE DATE(created_at) = '2026-05-29' becomes WHERE created_at >= '2026-05-29' AND created_at < '2026-05-30'. The semantics are identical; the second form uses the index, the first does not.

SQL
Topic — indexing
Index design practice

Practice →

SQL
Topic — filtering
Filtering / WHERE-clause drills

Practice →

Solution Using a covering index with INCLUDE to eliminate heap fetches

Code.

-- The cheapest fast plan in SQL: Index Only Scan via a covering composite.
DROP   INDEX IF EXISTS ix_fact_orders_status_date;
CREATE INDEX ix_fact_orders_status_date
    ON fact_orders (status, created_at DESC)
    INCLUDE (customer_id, amount);

-- Then ANALYZE so the planner sees the new index and refreshed stats.
ANALYZE fact_orders;

-- And verify with EXPLAIN ANALYZE.
EXPLAIN (ANALYZE, BUFFERS)
SELECT customer_id, amount
FROM   fact_orders
WHERE  status = 'shipped'
  AND  created_at >= '2026-04-01'
ORDER  BY created_at DESC
LIMIT  100;
Enter fullscreen mode Exit fullscreen mode

Step-by-step trace.

step action what it produces
1 DROP old index removes the obsolete (status)-only B-tree
2 CREATE composite + INCLUDE builds the covering index (status, created_at DESC) with payload (customer_id, amount)
3 ANALYZE fact_orders refreshes histograms so the planner trusts the new selectivity
4 EXPLAIN ANALYZE the query confirms Index Only Scan with Heap Fetches: 0
5 Read Buffers line confirms shared hit=X read=0 — everything served from index pages
  1. Step 1 — dropping the prior index avoids leaving two redundant indexes; index maintenance is O(log N) per insert, redundant indexes are pure overhead.
  2. Step 2 — INCLUDE (customer_id, amount) is the key trick; the columns are in the index pages but not part of the B-tree key, so they don't bloat the seek path.
  3. Step 3 — ANALYZE is required because the planner uses stats to decide whether to use the new index; without fresh stats it may default to seq scan.
  4. Step 4 — the EXPLAIN ANALYZE should now report Index Only Scan using ix_fact_orders_status_date and a Heap Fetches: 0 line.
  5. Step 5 — BUFFERS confirms zero heap reads; the entire query is served from cached index pages.

Output.

metric before after
Scan node Seq Scan Index Only Scan
Rows returned 100 100
Heap fetches ~50M (full scan) 0
Actual time 28,400 ms 1.2 ms
Plan flip reason covering index unblocks Index Only Scan

Why this works — concept by concept:

  • Covering index — every selected column lives in the index pages, so the planner skips the heap fetch entirely; this is the single biggest win you can get on a hot-path read query.
  • INCLUDE vs key columnsINCLUDE columns ride along as payload but don't widen the B-tree key, so seeks stay fast; key columns slow down inserts proportionally.
  • Descending sort in the keycreated_at DESC in the index key lets the planner satisfy ORDER BY created_at DESC LIMIT 100 by walking the index in order; no separate Sort node.
  • ANALYZE after DDL — without refreshed stats, the planner may not pick the new index; this step is non-optional and frequently forgotten.
  • CostO(log N + K) where K is the limit (100); the heap fetch was O(K) random seeks, now zero. Disk-side this is roughly a 1000x improvement on the K = 100 path.

4. Join algorithms — Nested Loop, Hash Join, Merge Join (and when planners pick each)

Visual diagram of three SQL join algorithms — Nested Loop (small driver + index lookup), Hash Join (build hash on smaller side, probe with larger), Merge Join (both sides pre-sorted then merged); each is a small panel with a mini-illustration, big-O complexity pill, and a one-line 'use when' card; on a light PipeCode card.

join algorithms — three shapes, three decision rules

You do not pick the join algorithms; the planner picks them based on table sizes, available indexes, and existing sort orders. But you do pick the indexes and the SQL shape that nudge the planner toward the right algorithm — and you must be able to predict which algorithm the planner will choose so you can build the right index up front.

The three families.

  • nested loopfor each row in outer: lookup matching rows in inner. Cheap when the outer is tiny and the inner has a useful index on the join key. Complexity: O(N × log M) with an index on the inner, O(N × M) without.
  • hash joinbuild hash table on smaller side; probe it with rows from larger side. Cheap when both sides are large and the smaller side fits in work_mem. Complexity: O(N + M) if the hash fits, much worse if it spills to disk.
  • merge joinboth sides sorted on join key; walk them in lockstep. Cheap when sort orders already exist (PK scan, index range scan) or the input is already sorted. Complexity: O(N + M) if pre-sorted, O(N log N + M log M) if sorts are required.

The decision matrix.

outer size inner size inner index on join key both sides sorted planner picks
small (< 10k) any yes no nested loop
large large no useful index no hash join
large large yes on both yes merge join
large large yes on one side no hash or nested loop (depends on selectivity)
any any no yes (CLUSTERED on join key) merge join

Why the planner picks what it picks.

  • nested loop dominates when the outer is tiny because the total work is outer_rows × inner_seek_cost; a 5-row outer doing 5 indexed lookups is unbeatable.
  • hash join dominates on bulk equi-joins where both sides are large; you pay one full scan per side plus a hash build, then probe in O(1) per row.
  • merge join dominates when both sides are already sorted on the join key (e.g., joining two range-scan results from indexes); the merge is a single pass with no hash overhead.
  • The planner switches when statistics shift. A query that runs with hash join today may switch to nested loop tomorrow if ANALYZE reveals the outer side is now much smaller; this is intentional and almost always correct.

Worked example — predict the join algorithm before EXPLAIN

Detailed explanation. Real interviews ask you to predict the join algorithm before showing you the plan. Below are three canonical join shapes; build the prediction reflex.

Question. For each of the three join scenarios below, predict the join algorithm the planner will pick and justify it in one sentence.

Input. Three scenarios on a fact_orders (10M rows) + dim_customers (200k rows) + dim_products (50 rows) schema.

Code.

-- Scenario A: tiny dim joined to a large fact, indexed PK on dim
SELECT o.id, p.name
FROM   fact_orders o
JOIN   dim_products p ON p.product_id = o.product_id
WHERE  p.category = 'electronics';   -- filter narrows dim_products to 5 rows

-- Scenario B: both sides large, no useful index on the join key in fact
SELECT o.id, c.region
FROM   fact_orders o
JOIN   dim_customers c ON c.customer_id = o.customer_id;
-- (no index on fact_orders.customer_id)

-- Scenario C: both sides already sorted by the join key (PK scan on each)
SELECT o.id, c.region
FROM   fact_orders o
JOIN   dim_customers c ON c.customer_id = o.customer_id
ORDER  BY o.customer_id;
-- (indexes exist on both join keys, query returns ordered output)
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. Scenario A — Nested Loop. The WHERE p.category = 'electronics' filter reduces dim_products to 5 rows; for each of those 5 rows the planner does an indexed lookup against fact_orders (assuming an index on product_id). Cost: 5 × log(10M) ≈ 5 × 23 = 115 index seeks. Unbeatable.
  2. Scenario B — Hash Join. Both sides are large (10M and 200k); there is no index on fact_orders.customer_id, so nested loop would be 10M × full scan of dim = catastrophic. The planner builds a hash on dim_customers (200k rows fits in work_mem), then probes once per fact row. Cost: 10M + 200k row reads, plus the hash build.
  3. Scenario C — Merge Join. Both sides are scanned in customer_id order via their respective B-tree indexes; the merge walks both streams in lockstep, emitting matches. Cost: 10M + 200k row reads, no hash overhead, and the output is already sorted (the ORDER BY is free).
  4. The planner needs accurate stats. If ANALYZE is stale and the planner thinks dim_products has 5,000 rows after filter (when reality is 5), it may pick hash join in Scenario A — and pay a hash-build cost on a near-empty hash.
  5. Override hint. If you know the planner picked wrong, you can force the choice with planner hints (SET enable_nestloop = off / SET enable_hashjoin = off) in PostgreSQL; in production it's almost always better to fix statistics or rewrite the SQL.

Output (the join-algorithm prediction matrix).

scenario outer inner inner_index predicted reason
A dim_products (5 rows after filter) fact_orders (10M) yes (product_id) Nested Loop tiny outer + indexed inner
B fact_orders (10M) dim_customers (200k) no on fact.customer_id Hash Join no index, both large
C fact_orders (10M, indexed) dim_customers (200k, indexed) yes both sides, sorted Merge Join both pre-sorted on join key

Rule of thumb: tiny outer → nested loop; both big, no useful index → hash; both big, both sorted → merge. Memorise these three.

hash join deep dive — why it's the workhorse

hash join is the most common algorithm on modern OLAP / warehouse queries because the typical shape is fact table joined to small dim with no useful index on the fact side. The mechanics:

  • Build phase. Scan the smaller side; for each row, hash the join key and insert into a hash table. Time: O(M), space: O(M) in work_mem.
  • Probe phase. Scan the larger side; for each row, hash the join key, look it up in the hash table, emit matches. Time: O(N).
  • Spill to disk. If the hash exceeds work_mem, partitions are spilled to disk; performance degrades catastrophically. Always size work_mem to fit the build side of your largest hash join.
  • Build side selection. The planner picks the smaller side as the build side; if statistics are wrong it may pick the larger side and OOM. Check Hash node actual rows vs Memory Usage: in EXPLAIN ANALYZE — if reality is 10x the estimate, run ANALYZE.

nested loop deep dive — the silent killer.

nested loop is the cheapest plan when the outer is tiny — and the most expensive plan when the outer is large. The asymmetry is outer_rows × inner_seek_cost:

  • Outer = 5 rows, inner = 10M with index5 × log(10M) = ~115 seeks → ~1ms.
  • Outer = 1M rows, inner = 10M with index1M × log(10M) = ~23M seeks → minutes.
  • Outer = 1M rows, inner = 10M without index1M × 10M = 10 trillion comparisons → effectively forever.

The bug pattern: the planner predicts a 5-row outer (because stats are stale) and picks nested loop; reality is 1M rows and the query melts the server. The fix: ANALYZE the outer table; the planner re-plans next run.

merge join deep dive — the niche but unbeatable choice.

merge join requires both sides to arrive sorted on the join key. When they do — typically because both sides are scanned via a B-tree on the join key — it's a single linear pass with no hash overhead. The output is also sorted on the join key, so downstream GROUP BY on the join key or ORDER BY on the join key is free.

  • When the planner picks it. Both sides have a B-tree on the join key, the result is large, and downstream operators benefit from the sort order.
  • When it loses. One side is unsorted; the planner would need to sort it explicitly, and O(N log N) + O(M log M) sorting cost usually exceeds O(N + M) hash join cost.

SQL
Topic — joins
Join algorithm drills

Practice →

SQL
Topic — sql
SQL practice library

Practice →

Solution Using a decision-tree that picks the join algorithm from inputs

Code.

-- One canonical decision tree, materialised as a lookup table.
CREATE TABLE join_algorithm_chooser AS
SELECT * FROM (VALUES
    ('tiny_outer + indexed_inner',  'Nested Loop', 'O(N * log M)',  'small driver, indexed lookup per row'),
    ('large + large, no_index',     'Hash Join',   'O(N + M)',      'build hash on smaller side, probe larger'),
    ('large + large, both_sorted',  'Merge Join',  'O(N + M)',      'walk both sides in lockstep, output sorted'),
    ('large + large, one_indexed',  'Hash or NL',  'depends',       'planner picks via selectivity estimate'),
    ('any + any, both_in_memory',   'Hash Join',   'O(N + M)',      'no I/O cost; hash always wins'),
    ('any + any, work_mem_too_small','Hash spill', 'O((N+M) * spill_factor)','build spills to disk; tune work_mem')
) AS t(input_shape, algorithm, complexity, intuition);
Enter fullscreen mode Exit fullscreen mode

Step-by-step trace.

input_shape algorithm complexity intuition
tiny_outer + indexed_inner Nested Loop O(N * log M) small driver, indexed lookup per row
large + large, no_index Hash Join O(N + M) build hash on smaller side, probe larger
large + large, both_sorted Merge Join O(N + M) walk both sides in lockstep, output sorted
large + large, one_indexed Hash or NL depends planner picks via selectivity estimate
any + any, both_in_memory Hash Join O(N + M) no I/O cost; hash always wins
any + any, work_mem_too_small Hash spill O((N+M) * spill_factor) build spills to disk; tune work_mem
  1. Row 1 — nested loop wins on tiny outer because N × log M is small when N is small; no other algorithm beats it.
  2. Row 2 — hash join is the OLAP workhorse; one full scan per side plus a hash build, then probe in O(1) per row.
  3. Row 3 — merge join is unbeatable when both sides are pre-sorted; no hash overhead and free downstream sort.
  4. Row 4 — when one side has an index and one does not, the planner estimates selectivity; if filter is narrow, nested loop; if wide, hash.
  5. Row 5 — when both sides fit in memory the I/O cost vanishes and hash is unbeatable; this is the warehouse-on-SSD common case.
  6. Row 6 — when the build side exceeds work_mem, the hash spills to disk and performance degrades 10-100x; tune work_mem per session.

Output.

input_shape algorithm
tiny_outer + indexed_inner Nested Loop
large + large, no_index Hash Join
large + large, both_sorted Merge Join
large + large, one_indexed Hash or NL
any + any, both_in_memory Hash Join
any + any, work_mem_too_small Hash spill (tune)

Why this works — concept by concept:

  • Three families, three rules — nested loop for tiny outer, hash for big × big, merge for pre-sorted; everything else is a planner judgement call.
  • Build side selection — hash join always builds on the smaller side; if stats are wrong the planner may build on the larger side and OOM. ANALYZE keeps this honest.
  • Pre-sorted is free — a B-tree range scan returns rows in key order at zero extra cost; merge join exploits this directly.
  • Loop count multiplies — a nested loop with a large outer multiplies inner seek cost by outer rows; this is why the algorithm dies on big outers.
  • Cost — nested loop O(N × log M), hash O(N + M) in memory / O((N+M) × spill_factor) on disk, merge O(N + M) if pre-sorted. Match input shape to algorithm and the planner picks correctly.

5. The six-step tuning playbook (capture → EXPLAIN → bottleneck → rewrite/index → ANALYZE → compare)

Visual playbook diagram of a six-step SQL tuning workflow — Capture slow query → EXPLAIN ANALYZE → Find bottleneck node → Rewrite or add index → ANALYZE statistics → Re-run + compare; each step is a small card with a tiny icon and a one-line action; on a light PipeCode card.

The six-step sql tuning playbook — discipline, not heroics

sql tuning is a discipline, not a black art. The six-step playbook is the same loop senior data engineers run for every slow query they're handed: capture the slow query with its real parameters, EXPLAIN it (always ANALYZE, always with the real parameters), find the bottleneck node (worst leaf or worst loops × time), rewrite or add an index (exactly one change), ANALYZE the affected tables so the planner sees fresh stats, re-run + compare the new plan against the old. Repeat until the SLA is met. One change per cycle is non-negotiable.

Step 1 — Capture the slow query (with real parameters).

  • Use pg_stat_statements (PostgreSQL) / Query Store (SQL Server) / INFORMATION_SCHEMA.PROCESSLIST (MySQL) / query_history (Snowflake) / INFORMATION_SCHEMA.JOBS_BY_PROJECT (BigQuery) to find the query.
  • Capture the actual parameters the slow execution used; never EXPLAIN with WHERE col = 'a' if production runs WHERE col = ? with a high-cardinality value.
  • Note the wall-clock time, rows returned, and the SLA the query is missing — you need a target to know when you're done.

Step 2 — EXPLAIN ANALYZE the query (always ANALYZE).

  • EXPLAIN shows the planner's estimate; EXPLAIN ANALYZE actually runs the query and shows real time + rows. Use ANALYZE every time in a tuning session.
  • Add BUFFERS to see disk vs cache reads: EXPLAIN (ANALYZE, BUFFERS) SELECT .... A high read count on the bottleneck node is a sign you're hitting cold pages.
  • On destructive queries (UPDATE / DELETE), wrap in a BEGIN; ... ROLLBACK; block so EXPLAIN ANALYZE can run without modifying data.

Step 3 — Find the bottleneck node.

  • Walk leaves first; the leaf with the highest actual time is almost always the bottleneck.
  • Check loops on nested loops; actual time × loops is the real cost.
  • Check estimate vs actual rows; a > 10x miss means stale stats are causing wrong plans upstream.
  • Check Memory Usage: on hash nodes; if it shows Disk: X kB, the hash spilled — tune work_mem or rewrite to avoid the hash.

Step 4 — Rewrite or add an index (exactly one change).

  • If the bottleneck is a Seq Scan on a big table, add an index that matches the predicate.
  • If the bottleneck is an Index Scan with many heap fetches, convert to a covering index with INCLUDE.
  • If the bottleneck is a function-wrapped predicate (WHERE DATE(col) = ?), rewrite to SARGable form (WHERE col >= ? AND col < ?).
  • If the bottleneck is a hash spill, increase work_mem for this session (SET work_mem = '256MB') and re-run.
  • Only one change. If you add an index and rewrite the predicate and change work_mem all at once, you cannot attribute the win to a single cause and may ship a regression.

Step 5 — ANALYZE the affected tables.

  • After any DDL (CREATE INDEX, ALTER TABLE), run ANALYZE table_name so the planner sees the new index / new column shape.
  • After bulk loads (COPY, INSERT ... SELECT), run ANALYZE on the loaded table; without fresh stats the planner uses pre-load histograms and picks wrong plans.
  • ANALYZE itself is cheap (O(sample)) — it samples ~30,000 rows per column by default.
  • Skipping this step is the #1 cause of "I added the index but the plan didn't change" tickets.

Step 6 — Re-run + compare.

  • Run EXPLAIN ANALYZE again with the same parameters; compare actual time, plan shape, and rows.
  • Keep a tuning log: query, before plan, change made, after plan, latency delta. This is the artefact you bring to a senior interview.
  • If the new plan is worse, revert the change — never "ship and hope".
  • If the new plan is better but still misses the SLA, iterate: go back to step 3, find the next bottleneck.

Worked example — run the full six-step playbook on one query

Detailed explanation. Real interviews give you a slow query and ask you to walk the playbook out loud. Below is one canonical query and the full six-step trace.

Question. A reporting query runs in 32 seconds; the SLA is 2 seconds. Walk the six-step tuning playbook on this query, naming the change you make at each step and the expected latency after.

Input. Query: SELECT region, SUM(amount) FROM fact_orders WHERE DATE(created_at) = '2026-05-28' GROUP BY region. Table: 80M rows, indexed on (created_at), ~50k matching rows.

Code.

-- STEP 1 — Capture (from pg_stat_statements)
-- query: SELECT region, SUM(amount) FROM fact_orders WHERE DATE(created_at) = $1 GROUP BY region
-- mean_exec_time_ms: 32140
-- target: <2000

-- STEP 2 — EXPLAIN ANALYZE
EXPLAIN (ANALYZE, BUFFERS)
SELECT region, SUM(amount)
FROM   fact_orders
WHERE  DATE(created_at) = '2026-05-28'
GROUP  BY region;

-- Planner output (abridged):
-- HashAggregate  (cost=2,140,000 .. 2,140,001 rows=300 width=20)
--                (actual time=31,800 .. 31,820 rows=300)
--   ->  Seq Scan on fact_orders  (cost=0 .. 2,140,000 rows=400,000 width=20)
--                                  (actual time=12 .. 31,500 rows=50,000)
--         Filter: (date(created_at) = '2026-05-28')
--         Buffers: shared read=940,000

-- STEP 3 — Bottleneck: Seq Scan owns 31.5s; Filter is function-wrapped (DATE(created_at))
--          so the existing index on (created_at) is unused.

-- STEP 4 — Rewrite to SARGable form (one change)
EXPLAIN (ANALYZE, BUFFERS)
SELECT region, SUM(amount)
FROM   fact_orders
WHERE  created_at >= '2026-05-28' AND created_at < '2026-05-29'
GROUP  BY region;

-- STEP 5 — ANALYZE if stats are stale (skipped here; stats fresh after recent ANALYZE)
ANALYZE fact_orders;

-- STEP 6 — Re-run + compare
-- New planner output:
-- HashAggregate  (cost=18,400 .. 18,401 rows=300 width=20)
--                (actual time=1,310 .. 1,320 rows=300)
--   ->  Index Scan using ix_fact_orders_created_at on fact_orders
--                                  (cost=0.42 .. 18,000 rows=50,000 width=20)
--                                  (actual time=0.18 .. 1,150 rows=50,000)
--         Index Cond: (created_at >= '2026-05-28' AND created_at < '2026-05-29')
--         Buffers: shared hit=1,400 read=4,200
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. Capturepg_stat_statements flags the query at 32s mean; target is 2s.
  2. EXPLAIN ANALYZE — the planner shows Seq Scan on fact_orders with Filter: date(created_at) = '2026-05-28'; the index on (created_at) exists but is unused.
  3. Bottleneck identificationSeq Scan owns 31.5s of the 31.8s total; the Filter clause wraps created_at in DATE(...) which is not SARGable, so the planner cannot push the predicate into an index seek.
  4. Rewrite (one change) — replace DATE(created_at) = '2026-05-28' with created_at >= '2026-05-28' AND created_at < '2026-05-29'; semantics identical, second form is SARGable.
  5. ANALYZEANALYZE fact_orders ensures the planner trusts the row estimate for the new predicate; skipped only if stats were refreshed recently.
  6. Re-run + compare — new plan is Index Scan using ix_fact_orders_created_at, actual time 1.3s. Win: 32s → 1.3s, ~25x improvement, SLA met. No further tuning needed.

Output (the before/after comparison).

metric before after
Scan type Seq Scan + Filter Index Scan
Buffers (shared read) 940,000 4,200
Actual time (ms) 31,820 1,320
Plan flip reason SARGable rewrite unlocked existing index
Changes made 1 (SARGable rewrite)
SLA met? no (32s vs 2s) yes (1.3s vs 2s)

Rule of thumb: one well-chosen change can produce a 10-30x speedup. If you cannot point at the one change that produced the win, you changed too many things at once.

The five most common anti-patterns and how to fix them

Anti-pattern 1 — WHERE FUNC(col) = ?. Function on the indexed column prevents the planner from using the index. Fix: rewrite to SARGable form, or create a functional index (CREATE INDEX ix_lower_email ON users (LOWER(email))).

Anti-pattern 2 — SELECT * on a wide table. Forces the planner to do heap fetches for every row even when a covering index could answer the query. Fix: name only the columns you need, then build a covering index over them.

Anti-pattern 3 — OR across tables. WHERE a.x = ? OR b.y = ? cannot use either index. Fix: rewrite as UNION of two single-predicate queries.

Anti-pattern 4 — Correlated subquery in SELECT. SELECT (SELECT COUNT(*) FROM child c WHERE c.pid = p.id) FROM parent p re-runs the subquery per row. Fix: rewrite as LEFT JOIN with GROUP BY or window function; one pass instead of N.

Anti-pattern 5 — Implicit type coercion on join. JOIN customers c ON c.id = o.customer_id_varchar where c.id INT and o.customer_id_varchar TEXT forces an implicit CAST that defeats the index. Fix: align types in the schema; never cross types on a join key.

SQL
Topic — optimization
SQL tuning playbook drills

Practice →

SQL
Topic — aggregation
Aggregation + GROUP BY drills

Practice →

Solution Using a tuning-log artefact you keep per query

Code.

-- Persist your tuning history; every senior engineer keeps one of these.
CREATE TABLE sql_tuning_log AS
SELECT * FROM (VALUES
    (1, 'top_revenue_by_region', 'before', 'Seq Scan + DATE() filter', 32140, 940000, 'baseline'),
    (2, 'top_revenue_by_region', 'rewrite_sargable', 'Index Scan on (created_at)', 1320, 4200, 'SARGable rewrite -> existing index used'),
    (3, 'top_revenue_by_region', 'add_covering', 'Index Only Scan with INCLUDE(region, amount)', 410, 320, 'covering INCLUDE eliminates heap fetches'),
    (4, 'top_revenue_by_region', 'partitioned_table', 'Index Only Scan on daily partition', 180, 110, 'partition pruning halves scanned pages'),
    (5, 'top_revenue_by_region', 'final', 'meets SLA: 180ms < 2000ms', 180, 110, 'shipped; no further tuning needed')
) AS t(step, query_name, change_label, plan_after, actual_time_ms, buffers_read, notes);
Enter fullscreen mode Exit fullscreen mode

Step-by-step trace.

step query_name change_label plan_after actual_time_ms buffers_read notes
1 top_revenue_by_region before Seq Scan + DATE() filter 32140 940000 baseline
2 top_revenue_by_region rewrite_sargable Index Scan on (created_at) 1320 4200 SARGable rewrite
3 top_revenue_by_region add_covering Index Only Scan with INCLUDE 410 320 covering INCLUDE
4 top_revenue_by_region partitioned_table Index Only Scan on daily partition 180 110 partition pruning
5 top_revenue_by_region final meets SLA 180 110 shipped
  1. Step 1 — baseline; record the exact slow plan, latency, and buffer reads before changing anything.
  2. Step 2 — one change: SARGable rewrite. Re-EXPLAIN, record new plan, new latency, new buffer reads. 32s → 1.3s.
  3. Step 3 — one change: add covering index with INCLUDE. Re-EXPLAIN. 1.3s → 410ms.
  4. Step 4 — one change: range-partition by day and let partition pruning skip irrelevant partitions. 410ms → 180ms.
  5. Step 5 — SLA met (180ms vs 2s target); stop tuning. Over-tuning past the SLA is wasted effort.

Output.

step actual_time_ms speedup vs baseline
1 (before) 32140 1.0x
2 (SARGable) 1320 24x
3 (covering) 410 78x
4 (partitioned) 180 178x
5 (final) 180 178x (SLA met)

Why this works — concept by concept:

  • One change per row — every entry in the tuning log records exactly one change; this is the audit trail that proves you didn't ship a guess.
  • Plan-after column — captures the shape of the plan, not just the latency; latency without plan context is unfalsifiable.
  • Buffers as a second metricbuffers_read is the I/O proxy; latency can be cache-warm noise, buffer counts are deterministic.
  • Stop at the SLA — step 5 is the discipline gate; once the SLA is met, stop tuning and ship. Senior engineers do not over-tune past the requirement.
  • CostO(cycles) where each cycle is one EXPLAIN ANALYZE + one DDL or rewrite; the log itself is O(rows) to read and the artefact you bring to the post-mortem (or the interview).

Choosing the right tuning move (cheat sheet)

A one-screen cheat sheet for sql query optimization — given a symptom in EXPLAIN ANALYZE, pick the move that fixes it.

You see in the plan … Likely cause First move Cadence
Seq Scan on a big table with a selective predicate no useful index CREATE INDEX matching the predicate new query
Seq Scan with Filter: FUNC(col) = ? non-SARGable predicate rewrite to col = ? or build functional index every PR
Index Scan with high Heap Fetches: non-covering index add INCLUDE (...) for selected columns hot path
Hash node with Disk: X kB hash spill raise work_mem for the session per query
Nested Loop with loops= very large bad outer estimate, often stale stats ANALYZE + check planner row estimate regression
Index Scan ignored, planner picks Seq Scan low selectivity (>20% match) reconsider whether index helps; or partial index on hot subset review
Sort node burning seconds output not pre-sorted add ORDER BY col DESC to index key hot path
Bitmap Heap Scan slow on huge result many random heap reads covering index OR rewrite to narrow predicate hot path
Planner-estimated rows 100x off actual stale statistics ANALYZE table_name regression
WHERE created_at + INTERVAL '1 day' >= NOW() arithmetic on column rewrite to created_at >= NOW() - INTERVAL '1 day' every PR
IN (subquery) with large subquery semi-join blow-up rewrite as EXISTS or JOIN ... GROUP BY review
OR across two tables un-indexable disjunction rewrite as UNION ALL of two queries review
CAST on join key (INT = TEXT) implicit type coercion defeats index align schema types; never cross types on join schema
Same query, plan flipped overnight autovacuum reset stats check last_analyze; re-run ANALYZE if old on-call
Aggregate on huge table, no GROUP BY full scan to compute COUNT(*) materialised view or pg_stat_user_tables.n_live_tup nightly

Frequently asked questions

What's the single most important query optimization techniques reflex to build?

Run EXPLAIN ANALYZE before you change anything. The single biggest gap between junior and senior engineers is the willingness to look at the plan before forming a hypothesis. Junior engineers reason about queries from first principles (the predicate looks selective, so the planner must use the index); senior engineers run EXPLAIN ANALYZE, see the plan the planner actually picked, and only then propose a change. Every other technique in this guide — SARGable rewrites, covering indexes, join-algorithm prediction, statistics refresh — is downstream of that single reflex. The mantra: don't guess, look at the plan.

How do I read an explain plan quickly under interview pressure?

Walk leaves to root, not top to bottom. The plan prints top-down but executes bottom-up; the bottom-most nodes (Seq Scan, Index Scan, Index Only Scan) are the leaves where actual work begins. Find the leaf with the highest actual time; that's almost always the bottleneck. Check loops on any Nested Loop parent — actual time × loops is the real cost. Check estimate-vs-actual rows; a > 10x miss means stale stats are causing wrong plans upstream. With those three habits you can narrate any 5-10 node plan in under a minute, which is exactly what the interviewer wants.

When should I pick a b-tree index vs a hash index vs a partial index?

B-tree is the default — pick it for any column you query with =, <, <=, >, >=, BETWEEN, IN, or ORDER BY. Hash is niche — equality only, no ranges, no sort; reach for it only when you have very tall single-column equality lookups and the engine supports WAL-logged hash indexes (PostgreSQL 10+). Partial is a B-tree on a subset of rows — pick it when your queries always filter on the same boolean-ish predicate (WHERE status = 'active'); the partial is smaller, faster, and skips index maintenance on rows it doesn't cover. Covering / INCLUDE is the senior trick — pick it for hot-path queries where you can name every selected column; it unlocks Index Only Scan and skips the heap fetch entirely. In practice, ~80% of production indexes are plain B-trees, ~15% are covering composites, ~5% are partial, and hash indexes show up only in very specific niches.

What's the difference between nested loop, hash join, and merge join — and when does each win?

nested loop is for each row in outer: lookup in inner; it wins when the outer is tiny (under ~10k rows) and the inner has a useful index — total cost is outer_rows × inner_seek_cost, which is unbeatable on small outers. hash join builds a hash on the smaller side and probes with the larger; it wins on big × big equi-joins with no useful index — total cost is O(N + M) if the build fits in work_mem. Merge join requires both sides arrive sorted on the join key; it wins when sort orders already exist (e.g., both sides scanned via a B-tree on the join key) and the output benefits downstream from being pre-sorted. You do not pick the algorithm — the planner does, based on table sizes, indexes, and statistics — but you must be able to predict the choice so you can build the right index up front. The decision matrix: tiny outer → nested loop; big × big, no index → hash; big × big, both sorted → merge.

Why does my query that worked yesterday suddenly run slowly today?

Almost always stale statistics. The query optimizer relies on histograms gathered by ANALYZE to estimate row counts; if the data distribution shifted overnight (bulk load, partition swap, schema change) and ANALYZE hasn't re-run, the planner is making decisions on yesterday's reality. The fix is one command: ANALYZE table_name. Other common causes are autovacuum interruptions, parameter sniffing on prepared statements (where the first param value cached a plan that's wrong for subsequent values), and silent index bloat (PostgreSQL's pgstattuple extension can confirm). Check pg_stat_user_tables.last_analyze for the affected table first; if it's older than your latest bulk load, run ANALYZE and re-EXPLAIN before debugging anything else.

How do I rewrite a non-SARGable predicate into a SARGable one?

The rule: the indexed column must appear alone on one side of the operator, with no function or arithmetic wrapped around it. WHERE DATE(created_at) = '2026-05-28' is non-SARGable; rewrite to WHERE created_at >= '2026-05-28' AND created_at < '2026-05-29'. WHERE EXTRACT(YEAR FROM created_at) = 2026 is non-SARGable; rewrite to WHERE created_at >= '2026-01-01' AND created_at < '2027-01-01'. WHERE created_at + INTERVAL '1 day' >= NOW() is non-SARGable; rewrite to WHERE created_at >= NOW() - INTERVAL '1 day' (move the arithmetic to the constant side). WHERE LOWER(email) = 'foo@bar.com' is non-SARGable on a plain index, but becomes SARGable if you add a functional index CREATE INDEX ix_lower_email ON users (LOWER(email)). This single rewrite class fixes ~30% of slow queries in production.


Practice on PipeCode

PipeCode ships 450+ data-engineering interview problems — including SQL drills keyed to the same sql query optimization skills this guide teaches (reading explain plan, designing b-tree index + covering composites, predicting nested loop vs hash join vs merge, SARGable rewrites, and the six-step sql tuning playbook). Whether you're prepping for a query optimization techniques round the night before a senior screen, or building the daily reps that turn 30-second queries into 300-millisecond ones over months, the practice library mirrors the same five-stage mental model — plus the index types, join algorithms, and explain plan cost-model intuition you'll wire into your production tuning workflow.

Top comments (0)