PostgreSQL Query Rewriting Techniques
The previous articles in this series covered performance problems you fix by adding indexes, restructuring joins, or tuning memory. This one is about the queries where the plan is "fine" — every node is doing something reasonable — but the query itself is asking the wrong question, producing unnecessarily large intermediate results or forcing the planner down a path that a different SQL shape would avoid.
These rewrites don't change what the query returns. They change how PostgreSQL goes about computing it. Learn to recognise the patterns and most of them are mechanical — if the original form matches X, rewrite to Y — and the performance improvement is often an order of magnitude or more with no downside.
This article is the seventh in the Complete Guide to PostgreSQL SQL Query Analysis & Optimization series. Every EXPLAIN block below is captured from the same Neon Postgres 17.8 database used throughout.
Offset pagination → keyset pagination
The single highest-impact rewrite in this article. OFFSET N LIMIT M is the default pagination shape in most ORMs and REST API frameworks. It's also a performance landmine as soon as users deep-paginate. To return page 1000 of 500,000 rows (20 per page), PostgreSQL must read and discard 19,980 rows before returning the 20 you want. Page 1 is fast; page 1000 is slow; page 10000 is a disaster.
Captured against our 500,000-row sim_bp_orders table — "page 24000 of 25000, 20 orders per page":
SELECT order_id, user_id, created_at
FROM sim_bp_orders
ORDER BY created_at DESC
LIMIT 20 OFFSET 480000;
Limit (cost=28511.34..28512.53 rows=20 width=16) (actual time=1900.713..1900.731 rows=20 loops=1)
Buffers: shared hit=481693 read=1775
-> Index Scan Backward using idx_sim_bp_orders_created_at
(actual time=0.018..1878.609 rows=480020 loops=1)
Execution Time: 1900.750 ms
1.9 seconds for 20 rows. The Index Scan Backward returns rows=480020 before the Limit takes 20 — PostgreSQL walked the created_at index backwards, visited every heap tuple for visibility checks, and discarded 99.996% of them. Buffers: shared hit=481693 read=1775 is 3.8 GB of page traffic for a result the size of a tweet.
The fix is keyset pagination — instead of OFFSET 480000, remember the cursor value of the last row you returned and ask for rows less than that:
-- Pass the (created_at, order_id) from the last row of the previous page.
SELECT order_id, user_id, created_at
FROM sim_bp_orders
WHERE created_at < '2024-03-01'
ORDER BY created_at DESC
LIMIT 20;
Limit (actual time=0.979..1.014 rows=20 loops=1)
Buffers: shared hit=22 read=1
-> Index Scan Backward using idx_sim_bp_orders_created_at
Index Cond: (created_at < '2024-03-01'::timestamptz)
(actual time=0.978..1.010 rows=20 loops=1)
Execution Time: 1.032 ms
1 ms, 23 buffers hit. The Index Cond means the planner could start the index scan from the cursor position rather than the beginning — no discarded rows, no wasted buffer reads. Page 1 and page 10,000 have identical cost.
Three things to know about keyset pagination:
-
Use a composite cursor for uniqueness.
ORDER BY created_at DESCisn't a deterministic total order unlesscreated_atis unique. For production systems, use(created_at, id)or similar:WHERE (created_at, order_id) < ('2024-03-01 14:22:00+00', 984523) ORDER BY created_at DESC, order_id DESC LIMIT 20. This ensures no rows are skipped or duplicated at page boundaries when multiple rows share the same timestamp. -
The index has to match the sort.
ORDER BY created_at DESC, order_id DESCworks against(created_at DESC, order_id DESC)directly or(created_at, order_id)read backwards. Mismatches force an in-memory sort that undoes the keyset win. - You give up random-access "jump to page N" semantics. Keyset pagination is forward/backward through an ordered stream. Most APIs and infinite-scroll UIs don't actually need random access; if yours does, you're stuck with OFFSET (or need a completely different data model).
Correlated scalar subquery → aggregating JOIN
A scalar subquery in the SELECT list runs once per outer row (SubPlan N in the plan). When the outer set is large, this is O(n²). The rewrite is a LEFT JOIN to a pre-aggregated table or CTE:
-- Before: SubPlan runs once per user.
SELECT u.user_id,
u.email,
(SELECT count(*) FROM sim_bp_orders o
WHERE o.user_id = u.user_id AND o.status = 'pending') AS pending_count
FROM sim_bp_users u
WHERE u.status = 'active';
-- After: single aggregation, left-joined.
SELECT u.user_id, u.email, COALESCE(p.pending_count, 0) AS pending_count
FROM sim_bp_users u
LEFT JOIN (
SELECT user_id, count(*) AS pending_count
FROM sim_bp_orders
WHERE status = 'pending'
GROUP BY user_id
) p ON p.user_id = u.user_id
WHERE u.status = 'active';
The rewrite computes all per-user counts in a single aggregating scan over sim_bp_orders, then joins them against users. On large outer sets (say, all 200k active users instead of LIMIT 100), the rewrite is usually 20-100× faster because the aggregation happens once rather than 200,000 times.
For "top-N related rows per outer" (not just count), use LATERAL JOIN with LIMIT N.
NOT IN → NOT EXISTS
The most insidious bug in SQL, bar none. NOT IN returns no rows whenever the inner set contains a single NULL, because x NOT IN (a, b, NULL) evaluates to x <> a AND x <> b AND x <> NULL, and x <> NULL is unknown, making the whole AND evaluate to unknown (not-true, hence excluded).
-- If any user in the inner query has a NULL email, this returns empty.
SELECT * FROM customers
WHERE email NOT IN (SELECT email FROM unsubscribed_users);
-- Correct, NULL-safe equivalent:
SELECT * FROM customers c
WHERE NOT EXISTS (SELECT 1 FROM unsubscribed_users u WHERE u.email = c.email);
NOT EXISTS uses existence semantics, not three-valued logic, so NULLs don't poison the result. The two forms also often produce different plans — NOT EXISTS usually becomes an Anti Semi Join, which PostgreSQL executes as cheaply as a regular join. NOT IN with a nullable inner column can force a hash anti-join that's aware of NULL semantics, and that's slower.
Rule: never write NOT IN against a subquery unless you've confirmed the compared column is NOT NULL at the schema level. In production code, just default to NOT EXISTS.
DISTINCT → GROUP BY
SELECT DISTINCT tells PostgreSQL to deduplicate the output; GROUP BY on the same columns does the same thing. When the only goal is deduplication (no aggregate functions), the two are equivalent, and the planner usually produces the same plan for each. But GROUP BY is strictly more flexible — it composes with HAVING, plays nicely with window functions, and handles expressions more cleanly.
The rewrite that actually matters is when DISTINCT is used in a query shape that's really asking for something else. "The first order per user" is often written as:
-- Wrong: this gets any order, not the first.
SELECT DISTINCT ON (user_id) user_id, order_id, created_at
FROM sim_bp_orders;
DISTINCT ON (user_id) returns one row per user_id, but which row is unspecified without an ORDER BY. Usually you want:
SELECT DISTINCT ON (user_id) user_id, order_id, created_at
FROM sim_bp_orders
ORDER BY user_id, created_at DESC;
This returns the latest order per user, provided the ORDER BY starts with the DISTINCT ON column. An index on (user_id, created_at DESC) lets this run as an index scan that emits one row per user without a separate sort.
DISTINCT ON is a PostgreSQL extension (not standard SQL) but it's the cleanest expression of "top-1 per group" when the pattern fits. For top-N with N > 1, use LATERAL (below) or a window function with a Run Condition.
Chunked deletes and updates
Large DELETE or UPDATE statements take locks on every row they touch, generate WAL proportional to the row count, and can trigger autovacuum storms. A 10-million-row delete often locks out writers for minutes. The rewrite is to do it in chunks:
-- Problematic: single massive delete.
DELETE FROM sim_bp_logs WHERE created_at < now() - interval '90 days';
-- Chunked: loop until no more rows to delete.
DO $$
DECLARE
deleted_count int;
BEGIN
LOOP
DELETE FROM sim_bp_logs
WHERE log_id IN (
SELECT log_id FROM sim_bp_logs
WHERE created_at < now() - interval '90 days'
LIMIT 10000
);
GET DIAGNOSTICS deleted_count = ROW_COUNT;
EXIT WHEN deleted_count = 0;
COMMIT; -- Releases locks; next iteration starts fresh txn.
END LOOP;
END $$;
Each chunk commits separately, releasing locks and letting autovacuum catch up between iterations. Use LIMIT + IN (SELECT ... LIMIT ...) because DELETE ... LIMIT isn't valid PostgreSQL syntax (unlike MySQL).
The same pattern applies to bulk UPDATEs. Batch size depends on row width and lock contention tolerance — 1,000 for wide rows with heavy concurrent load, up to 100,000 for narrow rows on an off-hours maintenance window.
INSERT ... ON CONFLICT
Pre-existing code often uses a read-then-write pattern for upserts:
-- Anti-pattern: race condition between the SELECT and INSERT.
SELECT 1 FROM sim_bp_users WHERE email = $1;
-- (application: if not found) INSERT INTO sim_bp_users ...;
Two round trips, and two sessions can both read "not found" and both try to insert, producing a unique-constraint violation. The PostgreSQL idiom is INSERT ... ON CONFLICT:
INSERT INTO sim_bp_users (email, username, status)
VALUES ($1, $2, 'active')
ON CONFLICT (email) DO UPDATE
SET username = EXCLUDED.username,
status = 'active'
RETURNING user_id;
One round trip, atomic, race-free. EXCLUDED references the row that would have been inserted (before the conflict). For "do nothing on duplicate," use ON CONFLICT (col) DO NOTHING. The conflict target must be a column or constraint that has a unique index — without one, PostgreSQL has no way to detect "a conflicting row already exists."
SELECT * in production queries
Not a rewrite of the query's logic, but a rewrite of its projection. SELECT * from a wide table pulls every column over the wire and through every plan node — Index Only Scans degrade to regular Index Scans (heap fetches required for the extra columns), join memory usage multiplies, sort widths explode.
The specific cost isn't always catastrophic, but the robustness cost is. A column-type change on an upstream table can break downstream consumers that didn't know they depended on the old width. In production code, name every column you actually need.
The exception: dump tools, ad-hoc debugging, and CTEs that genuinely pass all columns through. Context-dependent, but the default should be "name the columns."
HAVING vs WHERE
HAVING filters after aggregation; WHERE filters before. If a predicate could apply before aggregation, it should — the aggregate then operates on fewer rows. A classic misuse:
-- Inefficient: aggregate over all orders, then filter.
SELECT user_id, count(*) AS order_count
FROM sim_bp_orders
GROUP BY user_id
HAVING user_id IN (SELECT user_id FROM active_users);
-- Better: filter before aggregation.
SELECT user_id, count(*) AS order_count
FROM sim_bp_orders
WHERE user_id IN (SELECT user_id FROM active_users)
GROUP BY user_id;
The WHERE clause restricts the set of rows that go into the GROUP BY, so the aggregate runs over a smaller input. Only predicates that depend on the aggregate result (e.g., HAVING count(*) > 5) belong in HAVING; anything else is almost always more efficient in WHERE.
The planner usually pushes predicates from HAVING to WHERE when it's safe, but not always — especially when there are subqueries or complex expressions involved. Writing the filter in WHERE to begin with removes the uncertainty.
Composite rewrites: correlated subquery + LATERAL + keyset pagination
Real-world queries often combine several anti-patterns. The "show me the latest 20 orders for each of the top 100 users by lifetime spend" query is classic:
-- Naive: one subquery for the user list, window function for the per-user top-N.
WITH top_users AS (
SELECT user_id
FROM sim_bp_orders
GROUP BY user_id
ORDER BY sum(total_amount_cents) DESC
LIMIT 100
)
SELECT * FROM (
SELECT o.*,
row_number() OVER (PARTITION BY user_id ORDER BY created_at DESC) AS rn
FROM sim_bp_orders o
WHERE user_id IN (SELECT user_id FROM top_users)
) t WHERE rn <= 20;
The CTE lists 100 top users; the window function computes row numbers for all their orders (potentially thousands each); then the outer WHERE keeps only the top 20 per user. The window function is doing 10-100× the work that's actually needed.
Rewritten with LATERAL + LIMIT:
WITH top_users AS (
SELECT user_id, sum(total_amount_cents) AS total_spent
FROM sim_bp_orders
GROUP BY user_id
ORDER BY 2 DESC
LIMIT 100
)
SELECT t.user_id, t.total_spent, recent.*
FROM top_users t,
LATERAL (
SELECT order_id, total_amount_cents, created_at
FROM sim_bp_orders
WHERE user_id = t.user_id
ORDER BY created_at DESC
LIMIT 20
) recent;
For each of the 100 top users, a LATERAL subquery returns their 20 most recent orders — at most 2000 rows total, vs potentially hundreds of thousands in the window-function form. PostgreSQL 15+ can sometimes optimise the window-function form via Run Condition, but LATERAL is both clearer and more reliably cheap.
When not to rewrite
Every rewrite has a small risk of changing semantics in an edge case. Before deploying:
- Diff the results. Run the old and new forms against the same data; check the row counts and a representative sample match exactly.
- Check the plan with EXPLAIN ANALYZE. The rewrite should show the cost improvement you expect; if it doesn't, there's a case where the planner disagreed.
- Run both under load. Synthetic benchmarks rarely capture the real cache and concurrency effects. A rewrite that's 10× faster in isolation might be only 2× faster in production — still worth it, but measure.
Rewriting for performance is the right move after indexing, before buying bigger hardware. The patterns in this article cover most of what you'll find in a typical OLTP codebase; for the actually-broken queries — the ones that are wrong by construction — see the companion article on PostgreSQL Query Anti-Patterns and Common Mistakes.
postgres #performance #database #sql
Full series and canonical version: https://mydba.dev/blog/postgres-query-rewriting-techniques
Top comments (0)