DEV Community

Cover image for PostgreSQL MVCC & VACUUM: Bloat, Wraparound, Autovacuum Tuning in Production
Gowtham Potureddi
Gowtham Potureddi

Posted on

PostgreSQL MVCC & VACUUM: Bloat, Wraparound, Autovacuum Tuning in Production

postgres vacuum is the single most under-asked-about Postgres operational knob a senior data engineer has to own, and the one most teams discover the hard way the first time a 200 GB table refuses writes at 03:00 because the transaction ID counter is two days from wraparound. postgresql mvcc is what makes Postgres concurrent at all — every UPDATE writes a new tuple version and marks the old one for later cleanup — and postgres autovacuum is the background worker that turns those dead tuples back into reusable space. Skip either side of that contract and you get postgres bloat, a steadily-growing on-disk footprint that no SQL query can explain.

This guide is the senior-DBA reference you wished existed the first time an interviewer asked "walk me through MVCC tuple visibility and how VACUUM cleans it up" or "how would you handle a 1.5 billion XID warning on a primary you cannot take down?". It walks through the MVCC visibility model (xmin, xmax, snapshot isolation, HOT updates), the VACUUM flavours (plain VACUUM, VACUUM FULL, vacuum freeze, ANALYZE, the pg_repack online alternative), the autovacuum tuning surface (autovacuum_vacuum_scale_factor, per-table overrides, cost-based throttling, worker count), and the transaction id wraparound emergency runbook (datfrozenxid, relfrozenxid, the autonomous wraparound vacuum, vacuumdb --freeze --all). Each section pairs a teaching block with a Solution-Tail interview answer — code, a step-by-step trace, an output table, then a concept-by-concept breakdown of why it works.

PipeCode blog header for PostgreSQL MVCC & VACUUM — bold white headline 'Postgres VACUUM' with subtitle 'MVCC, Bloat, Wraparound' and a stylised tuple-versioning split scene on a dark gradient with purple, green, orange, and blue accents and a small pipecode.ai attribution.

When you want hands-on reps immediately after reading, drill the SQL practice library →, rehearse on query optimization problems →, and stack the ETL problem library → for the end-to-end pipeline shape.


On this page


1. Why MVCC + VACUUM is the senior-DBA interview

MVCC creates new tuple versions on every UPDATE, VACUUM cleans them up, and the 32-bit XID counter forces both — the trilogy senior interviewers actually probe

The one-sentence invariant: Postgres never updates a row in place. Every UPDATE writes a new tuple version, leaves the old one as "dead", and relies on postgres vacuum to reclaim that space before the table doubles in size or the 32-bit transaction ID counter loops around to zero. Every other VACUUM interview question — bloat monitoring, autovacuum tuning, pg_repack vs VACUUM FULL, wraparound emergencies — is a downstream consequence of those two facts. Once you internalise "MVCC creates dead tuples, VACUUM reclaims them, autovacuum schedules the work, freeze prevents wraparound," the whole surface collapses to a sequence of operational choices around that loop.

Four axes interviewers actually probe.

  • Visibility. Postgres stamps every tuple with xmin (the inserting XID) and xmax (the deleting XID). A transaction's snapshot is "all tuples whose xmin committed before me and whose xmax is either zero or committed after me". This is postgresql mvcc in one sentence — and it's why "dead tuples" are not garbage immediately, just garbage for the newest transactions.
  • Bloat. A heap page that's 80% dead tuples still occupies a full 8 KB on disk; the dead bytes are not returned to the filesystem until the page is rewritten. Bloated tables cost more I/O per query, blow out shared_buffers, and slow down sequential scans. postgres bloat is the operational symptom most DBAs miss until the table hits 5x its theoretical size.
  • Autovacuum. A background daemon (autovacuum launcher + autovacuum_max_workers workers) wakes up periodically and runs VACUUM + ANALYZE on tables that have crossed configurable thresholds. postgres autovacuum is the single most important Postgres operational subsystem — and the one most teams leave at defaults that were tuned for a 2010-era OLTP workload.
  • Wraparound. Transaction IDs are 32 bits. After ~2 billion transactions, the counter wraps. To prevent visibility chaos, Postgres "freezes" old tuples (marks them as visible-to-all) via vacuum freeze. If freezing falls behind, the database refuses writes — the most dramatic Postgres outage mode a senior DBA can face. transaction id wraparound is rare but always asked about in senior interviews because it tests whether you actually understand what an XID is.

The interview opener — the five-second version.

  • "Why does an UPDATE in Postgres not actually update?" — Because MVCC: it INSERTs a new tuple version and marks the old one's xmax. Senior signal if you say "and that creates a dead tuple that VACUUM has to reclaim."
  • "What is VACUUM actually doing?" — Marks dead tuples as reusable in the free space map; doesn't shrink the file. Senior signal if you separate "reclaim within file" from "VACUUM FULL rewrites the file."
  • "Why can't you just run VACUUM FULL?"ACCESS EXCLUSIVE lock, table rewrite, doubled disk usage during the rewrite. Senior signal if you mention pg_repack as the online alternative.
  • "What is a transaction ID wraparound and how do you prevent it?" — 32-bit XID counter; preventive vacuum freeze runs as part of autovacuum; emergency vacuumdb --freeze --all if you fall behind. Senior signal if you mention vacuum_freeze_min_age and autovacuum_freeze_max_age.

The 2026 reality — what changed since Postgres 13.

  • Postgres 14+ ships parallel index vacuums, dramatically shrinking VACUUM time on big tables with many indexes. Default behaviour is to use up to max_parallel_maintenance_workers index workers in parallel.
  • Postgres 15+ added per-statement VACUUM (FREEZE, INDEX_CLEANUP off, ...) options and made aggressive freezing the default for tables approaching autovacuum_freeze_max_age.
  • Postgres 16+ improved autovacuum's ability to skip pages via the visibility map, cutting wall time on append-mostly tables.
  • Postgres 17+ introduced the MAINTENANCE predicate locking model that lets autovacuum continue working on partitions while application traffic queries the parent table.
  • pg_repack remains the production-default tool for online table rebuilds; pg_squeeze is the auto-trigger variant.

What "senior" sounds like in a Postgres ops interview.

  • Do you say "MVCC creates dead tuples, autovacuum reclaims them" in the first sentence? — senior signal.
  • Do you mention "VACUUM FULL takes ACCESS EXCLUSIVE; we use pg_repack in production" unprompted? — senior signal.
  • Do you bring up "wraparound forces preventive freeze, which is why autovacuum runs even on read-only tables"? — senior signal.
  • Do you push back on "just turn off autovacuum on hot tables to reduce overhead" with "that guarantees wraparound; never turn it off, tune it"? — strongest senior signal.

Worked example — the four-line MVCC trace every senior should be able to draw

Detailed explanation. The single most common Postgres MVCC interview opener is "draw what happens on disk when I run an UPDATE." You should be able to answer in 30 seconds with a four-row visualisation. Every advanced question (HOT, index bloat, freezing, snapshot isolation) builds on this base.

Question. Walk through what happens to the heap, the indexes, and the dead tuple count when you run a single UPDATE accounts SET balance = balance + 10 WHERE id = 1 on a table with two indexes.

Input.

Step Action
t0 INSERT INTO accounts (id, balance) VALUES (1, 100); commit XID=1000
t1 session A: BEGIN; UPDATE accounts SET balance = 110 WHERE id = 1; commit XID=1001
t2 session B: SELECT balance FROM accounts WHERE id = 1; -- snapshot XID=1002

Code.

-- t0: initial INSERT
INSERT INTO accounts (id, balance) VALUES (1, 100);

-- t1: UPDATE in session A
BEGIN;
UPDATE accounts SET balance = balance + 10 WHERE id = 1;
COMMIT;

-- t2: SELECT in session B
SELECT id, balance, xmin, xmax FROM accounts WHERE id = 1;

-- inspect dead tuples
SELECT relname, n_live_tup, n_dead_tup
FROM pg_stat_user_tables
WHERE relname = 'accounts';
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. At t0 the INSERT writes one heap tuple (id=1, balance=100, xmin=1000, xmax=0). Both indexes (primary key on id, plus a secondary on balance) point at that tuple's (page, offset) location.
  2. At t1 the UPDATE does not rewrite the existing tuple. It writes a new heap tuple (id=1, balance=110, xmin=1001, xmax=0) and updates the old tuple's xmax to 1001. The old tuple is now "dead from XID 1001's perspective."
  3. Both indexes get a new entry pointing at the new tuple. The old index entries are not removed yet — they still point at the dead tuple. This is "index bloat" — the secondary effect of MVCC that VACUUM has to clean up separately.
  4. At t2 session B sees only the new tuple (xmin=1001 committed before B's snapshot at 1002). If session B had started its snapshot before XID 1001 committed, it would still see balance=100 — that's snapshot isolation.
  5. pg_stat_user_tables reports n_live_tup=1, n_dead_tup=1. The dead tuple stays on disk until VACUUM runs and marks its space reusable. The page is still 8 KB on disk — VACUUM does not shrink the file.
  6. If a HOT (Heap-Only Tuple) update was possible — same page, no indexed column changed — Postgres would skip the index writes and chain the new tuple to the old one via the page's HOT chain. HOT is the most important MVCC optimisation; it's why indexed-column updates are dramatically more expensive than unindexed-column updates.

Output.

Tuple id balance xmin xmax Visible to XID 1002?
old 1 100 1000 1001 no (xmax committed)
new 1 110 1001 0 yes

Rule of thumb. Every UPDATE = 1 INSERT + 1 mark-dead. Every UPDATE on an indexed column = 1 INSERT + 1 mark-dead + N new index entries (one per index). Treat indexes as multipliers on your write cost.

Worked example — measuring dead tuple ratio on a real table

Detailed explanation. "How would you measure bloat" is a standard senior probe. The answer has two layers: the lightweight pg_stat_user_tables counters (always available, no extension), and the precise pgstattuple extension (full page scan, more expensive but exact).

Question. Show how to read the dead tuple count on a 50 GB orders table, and how to compute the dead-tuple percentage and the autovacuum threshold from that.

Input.

Setting Value
Table public.orders
Reltuples (live row estimate) 100,000,000
autovacuum_vacuum_scale_factor 0.2 (default)
autovacuum_vacuum_threshold 50

Code.

-- lightweight reading (no extension)
SELECT
    schemaname,
    relname,
    n_live_tup,
    n_dead_tup,
    round(100.0 * n_dead_tup / NULLIF(n_live_tup, 0), 2) AS dead_pct,
    last_vacuum,
    last_autovacuum
FROM pg_stat_user_tables
WHERE relname = 'orders';

-- precise scan via pgstattuple
CREATE EXTENSION IF NOT EXISTS pgstattuple;
SELECT
    pg_size_pretty(table_len)               AS table_size,
    pg_size_pretty(dead_tuple_len)          AS dead_size,
    round(dead_tuple_percent::numeric, 2)   AS dead_pct,
    round(free_percent::numeric, 2)         AS free_pct
FROM pgstattuple('public.orders');

-- autovacuum trigger threshold for this table
SELECT
    autovacuum_vacuum_threshold +
        autovacuum_vacuum_scale_factor * reltuples AS trigger_dead_tuples
FROM pg_class
WHERE relname = 'orders';
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. pg_stat_user_tables gives n_live_tup and n_dead_tup directly. These are statistics-collector estimates, refreshed on every transaction commit and every autovacuum run. They're cheap (a single catalog read) but can drift on append-heavy tables that don't get vacuumed often.
  2. pgstattuple does a real heap scan and returns exact counts of live, dead, and free bytes. It costs roughly the same as a sequential scan, so don't run it on a 5 TB table during peak hours. Use it for a true bloat audit, not for monitoring.
  3. The dead-tuple percentage gives you a rough bloat number, but the autovacuum trigger is what actually matters for scheduling. Postgres triggers a vacuum when n_dead_tup > threshold + scale_factor * reltuples. For our 100M-row table at default settings: 50 + 0.2 * 100,000,000 = 20,000,050 dead tuples before autovacuum kicks in.
  4. That's 20M dead tuples worth of bloat before autovacuum starts working — for a write-heavy table that's far too many. Per-table override to scale_factor=0.05 would trigger at 5M, dramatically reducing peak bloat.
  5. For monitoring, alert when dead_pct > 20% on any table over 1 GB. That catches a misconfigured autovacuum or a long-running transaction before it becomes a disk pressure incident.

Output.

Metric Default trigger Recommended for hot table
autovacuum_vacuum_threshold 50 rows 1000 rows
autovacuum_vacuum_scale_factor 0.2 (20%) 0.05 (5%)
Trigger on 100M-row table 20M dead 5M dead
Peak bloat at trigger ~20% ~5%

Rule of thumb. Default autovacuum settings are tuned for tables under ~1M rows. Above 10M rows, lower scale_factor to 0.05 or 0.02 via per-table ALTER TABLE so autovacuum runs more frequently and keeps peak bloat under control.

Worked example — the cost of long-running transactions

Detailed explanation. Senior interviewers love to ask "why does a long-running transaction block VACUUM?" The answer ties MVCC visibility to dead-tuple reclamation in one neat causal chain — and it's the root cause of most "autovacuum is running but bloat keeps growing" production incidents.

Question. Explain why a 6-hour BEGIN; SELECT ... ; transaction in one session prevents autovacuum from reclaiming dead tuples in every table in the database.

Input.

Session Action Duration
A BEGIN; SELECT * FROM big_report; -- not committed 6 hours
B UPDATE orders SET status = 'shipped' WHERE ...; (every minute) continuous
C autovacuum worker on orders every 5 min

Code.

-- find long-running transactions blocking VACUUM
SELECT
    pid,
    usename,
    state,
    xact_start,
    now() - xact_start AS xact_age,
    query
FROM pg_stat_activity
WHERE state <> 'idle'
  AND xact_start IS NOT NULL
ORDER BY xact_start ASC
LIMIT 10;

-- inspect the "horizon" — the oldest XID VACUUM can clean past
SELECT
    backend_xmin,
    pg_xact_commit_timestamp(backend_xmin) AS backend_xmin_ts
FROM pg_stat_activity
WHERE backend_xmin IS NOT NULL
ORDER BY backend_xmin ASC
LIMIT 5;

-- kill the blocker (after consultation with the app team)
SELECT pg_cancel_backend(12345);    -- soft cancel
-- or
SELECT pg_terminate_backend(12345); -- hard terminate
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. Session A holds a snapshot from 6 hours ago. From its perspective, any tuple whose xmin committed after its snapshot time is invisible; any tuple whose xmax committed after its snapshot time is still alive.
  2. Session B's UPDATEs create dead tuples constantly. From the perspective of A's snapshot, the old versions are still alive — the new versions don't exist for A.
  3. Autovacuum on orders runs and sees the dead tuples — but it cannot reclaim them, because doing so would corrupt session A's view. It computes the "global xmin horizon" (the oldest snapshot in any open transaction) and only reclaims tuples whose xmax committed before that horizon.
  4. Session A's horizon advances zero — so the horizon for the entire database is stuck 6 hours in the past. Every dead tuple created in those 6 hours accumulates and cannot be reclaimed.
  5. pg_stat_activity.backend_xmin shows the oldest XID that any backend is still using. If you have a session with backend_xmin from 6 hours ago, every dead tuple created after that point is locked in until that session ends.
  6. The fix is either to kill the long-running transaction (pg_terminate_backend(pid)) or to refactor the report query to use a short-lived snapshot (e.g. snapshot at the start of the report, copy the data out, COMMIT, then process the copy).

Output.

Symptom Underlying cause
Bloat keeps growing despite autovacuum running Horizon is stuck behind a long-running txn
pg_stat_activity shows a session with xact_age > 1h The blocker is found
n_dead_tup keeps climbing on every table Horizon is per-DB, not per-table

Rule of thumb. Set idle_in_transaction_session_timeout = '15min' and statement_timeout to a reasonable value. A transaction left open by a buggy ORM is the single most common cause of unexplained Postgres bloat.

Senior interview question on MVCC + VACUUM

A senior interviewer often opens with: "Walk me through how Postgres handles a high-throughput UPDATE workload — what happens on disk, what cleans up, and what fails if you misconfigure it?"

Solution Using the MVCC + VACUUM mental model

The full lifecycle of an UPDATE under MVCC + VACUUM
====================================================

1. UPDATE arrives
   - Writes new tuple version (xmin = currentXID, xmax = 0)
   - Marks old tuple xmax = currentXID
   - Updates all indexes (unless HOT applies)
   - Page may split if no room; new tuple ends up on a different page

2. Commit
   - Old tuple becomes "dead from XID currentXID forward"
   - New tuple becomes "live"
   - Snapshots taken before commit still see the old tuple

3. Autovacuum eventually wakes up
   - When n_dead_tup > threshold + scale_factor * reltuples
   - Computes global xmin horizon
   - Reclaims dead tuples older than the horizon
   - Marks page space as reusable in the free space map
   - Removes index entries pointing at reclaimed tuples
   - Does NOT shrink the file — that requires VACUUM FULL or pg_repack

4. Eventually, freeze
   - When tuple xmin gets older than vacuum_freeze_min_age (default 50M)
   - VACUUM sets the "frozen" hint bit so the tuple is visible to all future XIDs
   - Prevents wraparound from corrupting visibility

5. Failure modes
   - Long-running transaction → horizon stuck → bloat grows unbounded
   - Autovacuum too slow → bloat grows faster than reclamation
   - Wraparound approaching → autovacuum runs in aggressive mode
   - autovacuum disabled → bloat + eventual wraparound shutdown
Enter fullscreen mode Exit fullscreen mode

Step-by-step trace.

Phase Disk effect Index effect Visibility horizon Wraparound risk
1. UPDATE new tuple inserted N new index entries unchanged none
2. Commit old tuple marked dead old index entries dangle unchanged none
3. Autovacuum dead tuples reclaimed in-place dangling index entries removed advances reduced
4. Freeze xmin marked frozen none tuple visible to all prevented
5. Wraparound block (failure mode) no space available none stuck imminent — DB shuts down

After this trace, every VACUUM, autovacuum, and wraparound question is a question about which phase is failing in your specific incident. That decomposition is the senior signal.

Output:

Concern Where it shows up Tool to use
Bloat n_dead_tup, pgstattuple autovacuum tuning, pg_repack
Long-running txns pg_stat_activity.xact_start terminate or refactor app
Wraparound datfrozenxid age VACUUM FREEZE, vacuumdb --freeze
Autovacuum stuck pg_stat_progress_vacuum scale workers, lower naptime

Why this works — concept by concept:

  • MVCC tuple versioning — every UPDATE writes a new tuple and marks the old one dead. The whole subsequent VACUUM surface is the consequence of that decision. Naming it explicitly proves you understand the root cause.
  • The xmin horizon — autovacuum can only reclaim tuples older than the oldest open snapshot in the database. That single rule explains 90% of "why isn't autovacuum working" incidents.
  • In-place reclaim vs file shrink — VACUUM reclaims within the file; VACUUM FULL rewrites the file. Conflating them is a junior mistake. Senior engineers know to reach for pg_repack instead of FULL in production.
  • Freeze is a separate concern from bloat — freezing is wraparound prevention, not space reclamation. Autovacuum runs both jobs in the same worker, but they're triggered by different thresholds.
  • Cost — autovacuum cost is O(table_size + index_size) worst case, throttled by autovacuum_vacuum_cost_delay / _cost_limit. Wraparound prevention runs unconditionally and cannot be skipped, so a wraparound-pending table will block autovacuum on other tables.

SQL
Topic — sql
Postgres MVCC + VACUUM problems

Practice →

SQL Topic — optimization Query optimization drills

Practice →


2. MVCC visibility and tuple versions

postgresql mvcc stamps every tuple with xmin and xmax — visibility is a function of those two columns and the transaction's snapshot

The mental model in one line: every heap tuple carries xmin (the transaction that created it) and xmax (the transaction that deleted/updated it); a transaction's snapshot decides which tuples it can see by comparing those XIDs to the set of transactions that committed before the snapshot was taken. Once you say "visibility is xmin + xmax against the snapshot," every snapshot-isolation, dead-tuple, HOT-update, and wraparound question becomes a deduction from that rule.

Visual diagram of Postgres MVCC tuple versions — a single row going through INSERT, UPDATE, and DELETE with xmin/xmax stamps on each version and an arrow showing the snapshot horizon; on a light PipeCode card.

The core column pair — xmin / xmax.

  • xmin — the XID of the transaction that created this tuple version. Set on INSERT or on the new tuple version of an UPDATE.
  • xmax — the XID of the transaction that invalidated this tuple version. Zero if still alive. Set on DELETE, or on the old tuple version of an UPDATE.
  • Both columns are 32-bit unsigned integers stored in the tuple header (24 bytes per tuple — a real overhead you pay for MVCC).

Snapshot isolation — the visibility rule.

  • A transaction's snapshot is a triple: (xmin, xmax, xip_list) — the oldest still-running transaction, the next XID to be assigned, and the list of in-flight XIDs at snapshot time.
  • A tuple is visible to a snapshot if: (tuple.xmin is in the committed set AND tuple.xmin < snapshot.xmax AND tuple.xmin is not in snapshot.xip_list) AND (tuple.xmax is zero OR tuple.xmax did not commit by snapshot time OR tuple.xmax is in snapshot.xip_list).
  • In practice: "tuple was inserted by a committed transaction before me, and either is still alive or was deleted by a transaction that committed after my snapshot."

The UPDATE = INSERT + mark identity.

  • An UPDATE writes a new heap tuple with xmin = currentXID, xmax = 0 and sets the old tuple's xmax = currentXID.
  • Both tuples share a logical identity (same primary key) but live in different physical heap slots, often on different pages.
  • This is the only way Postgres handles updates — there is no in-place update. Even single-byte column changes write a full new tuple.

HOT updates — the optimisation.

  • If the UPDATE does not change any indexed column, and there is room on the same page, Postgres can do a Heap-Only Tuple (HOT) update.
  • The new tuple lives on the same page, chained from the old tuple via the page's HOT chain pointer.
  • Indexes are not updated — they still point at the old tuple, and the runtime follows the HOT chain to find the live version.
  • HOT updates are dramatically cheaper than regular updates. The cost difference is often the difference between a healthy and a bloat-prone table.

Why indexed-column updates are expensive.

  • Indexed column changed → no HOT → new index entries → old index entries dangle until VACUUM removes them.
  • A 1M-row table with 5 secondary indexes pays 6 writes per UPDATE (heap + 5 indexes) instead of 1.
  • The cost compounds: more index entries = larger indexes = more I/O on every read.

The visibility map — the optimisation that makes VACUUM cheap.

  • Postgres maintains a bitmap (relname_vm file) marking which pages are "all-visible" — every tuple on that page is visible to every snapshot.
  • VACUUM can skip all-visible pages entirely. On a mostly-read table, autovacuum becomes nearly free.
  • INSERTs and HOT updates can keep pages all-visible. Non-HOT UPDATEs and DELETEs clear the visibility flag, forcing the next VACUUM to scan that page.

Index-only scans — visibility map dividend.

  • If every page touched by an index scan is marked all-visible, Postgres can satisfy the query from the index alone, skipping the heap entirely.
  • This is "index-only scan" — one of the biggest performance wins in modern Postgres.
  • It only works if the visibility map is up to date, which requires VACUUM to have run recently.

Common interview probes on MVCC.

  • "What are xmin and xmax?" — the create-XID and the invalidate-XID of a tuple version.
  • "Why does an UPDATE produce a dead tuple?" — because UPDATE is INSERT + mark-old-dead; there's no in-place update.
  • "What is a HOT update?" — same-page UPDATE with no indexed column change; skips index writes.
  • "What is the visibility map?" — bitmap of all-visible pages; lets VACUUM skip pages and enables index-only scans.

Worked example — UPDATE-heavy table accumulating 10x bloat

Detailed explanation. A common production failure: a table with millions of UPDATEs per hour grows to 10x its theoretical size in a week. The culprit is almost always indexed-column updates preventing HOT, combined with default autovacuum settings that don't keep up. Tracing what's happening on disk is the first step.

Question. Show the disk-growth math for an events table with 10M rows, 5 secondary indexes, and 200K UPDATEs per hour on an indexed column. How much bloat accumulates in a day before autovacuum catches up?

Input.

Setting Value
Live rows 10,000,000
Avg row width 200 bytes
Secondary indexes 5
UPDATE rate 200,000 / hour
UPDATE target column indexed
autovacuum_vacuum_scale_factor 0.2 (default)

Code.

-- live size estimate
SELECT
    pg_size_pretty(pg_relation_size('events'))                  AS heap_size,
    pg_size_pretty(pg_indexes_size('events'))                   AS index_size,
    pg_size_pretty(pg_total_relation_size('events'))            AS total_size,
    n_live_tup,
    n_dead_tup,
    round(100.0 * n_dead_tup / NULLIF(n_live_tup, 0), 2)        AS dead_pct
FROM pg_stat_user_tables JOIN pg_class ON oid = relid
WHERE relname = 'events';

-- bloat trigger threshold
SELECT
    pg_class.reltuples,
    pg_class.reltuples * 0.2 AS trigger_dead
FROM pg_class WHERE relname = 'events';

-- enforce HOT-friendly schema via fillfactor
ALTER TABLE events SET (fillfactor = 80);
-- 20% of every page reserved for HOT chains
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. 200K UPDATEs/hour × 24 h = 4.8M UPDATEs/day. Because the target column is indexed, each UPDATE writes 1 new heap tuple + 5 new index entries (no HOT possible). The old heap tuple plus 5 old index entries become dead.
  2. Heap bloat after 24h: 4.8M dead tuples × 200 bytes = ~960 MB. That's about 50% of the live heap size (2 GB), so the table grows to roughly 3 GB before autovacuum kicks in.
  3. Index bloat is far worse because each of the 5 indexes accumulates 4.8M dead entries. If each index entry is ~30 bytes, that's 4.8M × 30 × 5 = ~720 MB of dead index bytes. Indexes often grow more than the heap on update-heavy tables.
  4. Default autovacuum triggers at 0.2 * 10M = 2M dead tuples. That's reached in 10 hours at this rate. Once it runs, it reclaims the heap dead tuples and rebuilds the index visibility, dropping n_dead_tup to near zero.
  5. After one autovacuum cycle, the table file is still 3 GB on disk — VACUUM reclaims within the file, it doesn't shrink it. The next 24 hours of UPDATEs reuse the reclaimed space, so the table stabilises at ~3 GB rather than growing further.
  6. The fix for ongoing bloat: lower autovacuum_vacuum_scale_factor to 0.05 (autovacuum every ~2.5 hours instead of every 10 hours), and set fillfactor = 80 so HOT updates are more likely (more page free space available for in-page new tuple).

Output.

Phase Heap size Index size Dead rows
t0 (baseline) 2 GB 1.5 GB 0
t+10h (autovacuum trigger) 2.4 GB 1.8 GB 2M
t+24h (post-vacuum) 2.4 GB (steady) 1.8 GB (steady) < 100K
with tuned settings (scale=0.05) 2.1 GB 1.6 GB < 500K

Rule of thumb. If your table has >5 indexes and any indexed column is updated frequently, the indexes will dominate your bloat. Drop unused indexes ruthlessly — each one multiplies your write cost and your VACUUM time.

Worked example — HOT updates vs non-HOT updates on the same table

Detailed explanation. The same UPDATE can be 10x cheaper if it qualifies for HOT. Senior engineers know that "an UPDATE costs nothing on Postgres if it's HOT" is one of the most leverage-y performance facts in the database.

Question. Show two UPDATE patterns on a users table — one that triggers HOT, one that doesn't — and explain the per-update cost difference.

Input.

Schema Columns Indexed columns
users id (PK), email, login_count, last_login_at id, email

Code.

-- A: UPDATE on a non-indexed column → HOT eligible
UPDATE users SET login_count = login_count + 1 WHERE id = 42;

-- B: UPDATE on an indexed column → NOT HOT eligible
UPDATE users SET email = 'new@example.com' WHERE id = 42;

-- Inspect HOT updates count
SELECT
    schemaname, relname,
    n_tup_upd      AS total_updates,
    n_tup_hot_upd  AS hot_updates,
    round(100.0 * n_tup_hot_upd / NULLIF(n_tup_upd, 0), 2) AS hot_pct
FROM pg_stat_user_tables
WHERE relname = 'users';

-- Ensure HOT can happen — leave fillfactor room
ALTER TABLE users SET (fillfactor = 85);
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. UPDATE A changes only login_count, which is not indexed. If there is free space on the same heap page (this is where fillfactor matters), Postgres writes the new tuple on the same page and chains it to the old tuple via the page's HOT chain. Indexes are untouched.
  2. UPDATE A costs: 1 heap page write, 0 index writes. The old tuple still exists but is reachable only via the HOT chain from the page's "redirect" pointer. VACUUM later prunes the HOT chain — much cheaper than full reclamation.
  3. UPDATE B changes email, which is indexed. HOT is impossible — the email index needs a new entry for the new tuple. Postgres writes the new heap tuple (anywhere there's room, possibly on a new page) and a new email-index entry. The old heap tuple and the old email-index entry become dead.
  4. UPDATE B costs: 1 heap page write (possibly a new page if the old page is full), 1 index entry write per indexed column. The dead entries linger until VACUUM clears them.
  5. pg_stat_user_tables.n_tup_hot_upd tracks HOT updates separately from total updates. Healthy ratio: > 70% HOT for an UPDATE-heavy OLTP table. If yours is below 30%, you have indexes on volatile columns that you should reconsider.
  6. fillfactor=85 reserves 15% of every page for HOT chain growth. Tables with frequent UPDATEs benefit from fillfactor=70-80; read-only tables can use the default 100 for maximum density.

Output.

UPDATE pattern Heap writes Index writes Dead tuples produced HOT eligible
Non-indexed column (A) 1 (same page) 0 1 (prunable cheap) yes
Indexed column (B) 1 (any page) N 1 heap + N index no

Rule of thumb. Don't index last_login_at, updated_at, view_count, or any high-velocity column unless an active query plan depends on it. Every such index is a 10x multiplier on your UPDATE cost and a magnet for bloat.

Worked example — long-running snapshot blocks the horizon

Detailed explanation. The single most insidious MVCC failure mode is a forgotten transaction holding back the global xmin horizon. It manifests as autovacuum running but not reclaiming any space. The fix is always to find and kill (or refactor) the blocking session.

Question. A monitoring dashboard shows that autovacuum has been running every 5 minutes but n_dead_tup keeps climbing. Bloat is now at 60%. Find the cause.

Input.

Symptom Value
autovacuum_workers 3 (active)
n_dead_tup on hot_events 50M and climbing
pg_stat_user_tables.last_autovacuum 2 minutes ago
free disk 25% and falling

Code.

-- find the oldest in-flight backend xmin
SELECT
    pid,
    application_name,
    state,
    backend_xmin,
    xact_start,
    now() - xact_start AS xact_age,
    LEFT(query, 80)    AS query_head
FROM pg_stat_activity
WHERE backend_xmin IS NOT NULL
  AND state <> 'idle'
ORDER BY backend_xmin
LIMIT 5;

-- find idle-in-transaction sessions
SELECT
    pid,
    application_name,
    state,
    now() - state_change AS idle_age,
    LEFT(query, 80)      AS query_head
FROM pg_stat_activity
WHERE state = 'idle in transaction'
ORDER BY state_change ASC;

-- terminate the offender (after app team confirms)
SELECT pg_terminate_backend(54321);
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. pg_stat_activity filtered by backend_xmin IS NOT NULL and ordered by backend_xmin finds the session whose snapshot is the bottleneck. Often it's an idle in transaction session — a transaction that started, executed one query, and then was forgotten by buggy ORM code.
  2. The age now() - xact_start tells you how long the snapshot has been holding back the horizon. If it's > 30 minutes on an OLTP database, that's the bloat root cause.
  3. Killing the session via pg_terminate_backend(pid) releases the snapshot. The next autovacuum cycle sees a vastly advanced horizon and can reclaim hours of accumulated dead tuples.
  4. Long-term, set idle_in_transaction_session_timeout = '5min' and statement_timeout = '30min' in postgresql.conf. Both prevent any single buggy code path from holding back the horizon.
  5. Add an alert on max(now() - xact_start) > 1 hour from pg_stat_activity WHERE state <> 'idle'. This catches the issue before bloat becomes a disk pressure incident.
  6. Educate the app team: every transaction must end with COMMIT or ROLLBACK. Long-running read transactions should use SET TRANSACTION SNAPSHOT or short repeatable batches rather than holding one snapshot open for hours.

Output.

Backend xact_age State Action
54321 6h 12min idle in transaction terminate; tell the app team
54322 0:01 active leave alone
54323 0:42 active (analytics) refactor to short snapshots

Rule of thumb. Bloat that autovacuum cannot reclaim is always a horizon problem. Find the oldest backend_xmin first; do not retune autovacuum thresholds until the horizon is moving normally.

Senior interview question on MVCC visibility

A senior interviewer might frame this as: "Explain Postgres snapshot isolation in terms of xmin, xmax, and the visibility map. Walk me through what happens when two concurrent transactions try to update the same row."

Solution Using the MVCC visibility rules

-- Setup
CREATE TABLE accounts (id INT PRIMARY KEY, balance NUMERIC);
INSERT INTO accounts VALUES (1, 100);

-- Session A: XID 100
BEGIN;
SELECT balance FROM accounts WHERE id = 1;     -- sees 100
-- ... long-running snapshot ...

-- Session B: XID 101, concurrent
BEGIN;
UPDATE accounts SET balance = 150 WHERE id = 1;
COMMIT;                                          -- B commits

-- Session A still sees 100 (its snapshot predates B's commit)
SELECT balance FROM accounts WHERE id = 1;     -- still sees 100

-- Now session A wants to update
UPDATE accounts SET balance = balance + 10 WHERE id = 1;
-- ERROR: could not serialize access due to concurrent update
-- (in SERIALIZABLE) OR
-- ROW LOCK WAIT then proceeds with B's value (in READ COMMITTED)

ROLLBACK;
Enter fullscreen mode Exit fullscreen mode

Step-by-step trace.

Step Session A snapshot Tuple on disk A sees
t0 xmin=99, xmax=100 (id=1, bal=100, xmin=50, xmax=0) 100
t1 (B updates) unchanged old: xmax=101; new: (xmin=101, xmax=0, bal=150) 100 (xmax=101 in xip_list)
t2 (B commits) unchanged new tuple committed 100 (xmax=101 committed after A's snapshot)
t3 (A reads again) unchanged unchanged 100 (snapshot isolation guarantees stable view)
t4 (A updates) unchanged conflict detected error or wait, depending on isolation

The Session A view is consistent throughout — that's the whole point of snapshot isolation. The trade-off is that A and B cannot both update the same row at the same isolation level without one of them seeing a serialization error.

Output:

Isolation level Behaviour on concurrent update
READ COMMITTED (default) A waits for B, then re-reads at the new committed state, then applies its UPDATE on the new value
REPEATABLE READ A errors with "could not serialize access due to concurrent update"; app retries
SERIALIZABLE A errors immediately if any read-write conflict could violate serializability; app retries

Why this works — concept by concept:

  • xmin / xmax tuple stamps — every tuple carries its creator and invalidator XID. The whole visibility system is two integer comparisons. That simplicity is what makes Postgres MVCC fast.
  • Snapshot isolation — a snapshot fixes the visible set of XIDs at transaction start. From that point, the transaction sees a consistent view even if concurrent transactions commit changes. This is the read side of MVCC.
  • Write conflict detection — when A tries to UPDATE a row that B has already updated and committed, Postgres detects the conflict via xmax. In READ COMMITTED it waits and re-reads; in REPEATABLE READ / SERIALIZABLE it errors. The write side of MVCC.
  • The horizon is global — every backend's backend_xmin contributes to a single per-database horizon. One slow session blocks VACUUM for every table. This is the operational pitfall that catches teams.
  • Cost — visibility check is O(1) per tuple (two XID comparisons + xip_list lookup). Snapshot acquisition is O(active_transactions). Dead-tuple reclamation is O(table_size + index_size) amortised across autovacuum cycles.

SQL
Topic — sql
Postgres snapshot isolation problems

Practice →

SQL Topic — optimization Index optimization drills

Practice →


3. VACUUM mechanics and flavours

postgres vacuum ships in four shapes — plain VACUUM, vacuum full, vacuum freeze, and ANALYZE — and using the wrong one in production is a guaranteed outage

The mental model in one line: plain VACUUM reclaims dead-tuple space inside the file (cheap, no lock), VACUUM FULL rewrites the file from scratch (expensive, ACCESS EXCLUSIVE lock), VACUUM FREEZE preemptively marks old tuples as frozen to prevent wraparound, and ANALYZE updates the statistics the planner reads — they're four different jobs with four different cost profiles, and pg_repack is the production-safe alternative to VACUUM FULL. Once you internalise "VACUUM reclaims, FULL rewrites, FREEZE prevents wraparound, ANALYZE updates stats," every "which VACUUM should I run" interview question becomes a single-question lookup.

Visual diagram of the four Postgres VACUUM flavours — plain VACUUM reclaiming within a page, VACUUM FULL rewriting the table, VACUUM FREEZE marking frozen, and pg_repack doing an online swap; on a light PipeCode card.

Plain VACUUM — the safe default.

  • What it does. Scans the table, finds dead tuples, marks their space as reusable in the page's "item pointer" array, updates the visibility map, and prunes HOT chains. Does not rewrite the file, does not return space to the OS.
  • Lock level. SHARE UPDATE EXCLUSIVE — concurrent SELECTs, INSERTs, UPDATEs, DELETEs continue. Only blocks other VACUUM, ANALYZE, CREATE INDEX CONCURRENTLY, and a few DDL operations.
  • Disk effect. Frees space for reuse by this table. Future INSERTs/UPDATEs land in the reclaimed slots; the file size stays the same.
  • When to run. Constantly via autovacuum. Rarely by hand — only when autovacuum is misconfigured or you need an immediate reclaim before a query plan re-runs.

VACUUM FULL — the nuclear option.

  • What it does. Creates a new table file, copies all live tuples into it, rebuilds every index, then swaps the file. End result: the smallest possible disk footprint for the current data.
  • Lock level. ACCESS EXCLUSIVE — blocks every other operation on the table, including SELECTs. The table is effectively offline for the duration of the FULL.
  • Disk effect. Returns space to the OS. The new file is smaller. During the operation, you need at least 2x the current table size in free disk (old + new exist simultaneously).
  • When to run. Almost never in production. Use case: a one-time massive bloat reclaim during a planned maintenance window. Even then, prefer pg_repack.

VACUUM FREEZE — the wraparound preventer.

  • What it does. Marks every tuple older than vacuum_freeze_min_age as "frozen" — a hint bit that means "this tuple is visible to all current and future transactions, ignore its xmin." Frozen tuples don't contribute to the wraparound countdown.
  • Lock level. Same as plain VACUUM (SHARE UPDATE EXCLUSIVE).
  • Disk effect. Same as plain VACUUM (reclaims dead-tuple space) plus the freeze bit updates.
  • When to run. Autovacuum runs an "aggressive" mode FREEZE when a table's relfrozenxid is older than autovacuum_freeze_max_age (default 200M). You can force it via VACUUM FREEZE tablename.

ANALYZE — the planner statistics refresh.

  • What it does. Samples the table, updates pg_statistic with column histograms, distinct counts, null fractions. The query planner uses these for cost estimates.
  • Lock level. Same as plain VACUUM. Often run in the same autovacuum cycle (VACUUM (ANALYZE) tablename).
  • Disk effect. None on the table; only updates catalog rows.
  • When to run. After bulk loads, after index creation, after any operation that significantly changes the data distribution. Autovacuum runs it automatically when n_mod_since_analyze exceeds a threshold.

pg_repack — the production-safe rebuild.

  • What it does. Builds a new copy of the table in the background, captures concurrent changes via triggers into a log table, applies the log to the copy, then swaps the table file with a brief ACCESS EXCLUSIVE lock (seconds, not hours).
  • Lock level. Almost none — the swap is the only ACCESS EXCLUSIVE step, and it's instantaneous.
  • Disk effect. Returns space to the OS like VACUUM FULL. Requires 2x table size in free disk during the copy.
  • When to run. Whenever a table's bloat exceeds ~30% on a system you cannot take offline. The de facto standard for production table rebuilds.

The decision tree.

  • Need to reclaim within-file? → autovacuum (or manual VACUUM if urgent).
  • Need to shrink the file in a maintenance window? → VACUUM FULL.
  • Need to shrink the file on a 24/7 system? → pg_repack.
  • Need to prevent wraparound? → autovacuum handles it; force with VACUUM FREEZE if behind.
  • Need to fix stale query plans? → ANALYZE (or VACUUM ANALYZE).

Common interview probes on VACUUM flavours.

  • "Difference between VACUUM and VACUUM FULL?" — VACUUM reclaims in-place, FULL rewrites the file.
  • "Why is VACUUM FULL dangerous?" — ACCESS EXCLUSIVE lock for the duration. Use pg_repack instead.
  • "What does VACUUM FREEZE do?" — marks old tuples as frozen to prevent wraparound.
  • "Why run ANALYZE separately?" — it updates planner stats; cheaper than VACUUM if you just need fresh stats.

Worked example — VACUUM vs VACUUM FULL on the same bloated table

Detailed explanation. A 100 GB orders table is 60% dead tuples. You have two options: run plain VACUUM (no downtime, but the file stays 100 GB) or run VACUUM FULL (rewrites to 40 GB, but takes the table offline for hours). The right answer depends on what you're optimising for — disk vs uptime.

Question. Run both options against the same 100 GB orders table and compare disk footprint, lock duration, and operational risk.

Input.

Table Heap size Index size Dead pct Indexes
orders 100 GB 30 GB 60% 4

Code.

-- Option 1: plain VACUUM — no downtime, no file shrink
VACUUM (VERBOSE) orders;
-- Output: "pages reused", "tuples removed", "no shrinkage"
-- Heap still 100 GB, but 60 GB of it is now reusable

-- Option 2: VACUUM FULL — full rewrite, ACCESS EXCLUSIVE lock
VACUUM FULL VERBOSE orders;
-- Output: rebuilds heap and all indexes
-- Heap becomes 40 GB, indexes become smaller

-- Option 3: pg_repack — same effect as FULL, no long lock
-- run from shell:
pg_repack --no-superuser-check --table=orders --jobs=4 mydb
-- ACCESS EXCLUSIVE held only during the final swap (seconds)
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. Plain VACUUM scans every page, marks dead tuples reusable, updates the visibility map, and prunes HOT chains. On a 100 GB table with 4 indexes and parallel index vacuum, this might take ~30 minutes wall-clock. Concurrent traffic continues uninterrupted.
  2. The heap file is still 100 GB after plain VACUUM — VACUUM does not call ftruncate. The "reclaimed" 60 GB is now free space inside the file that future INSERTs/UPDATEs can reuse. The OS sees the same disk footprint.
  3. VACUUM FULL creates a new file, scans the old file, copies every live tuple into the new one, rebuilds every index from scratch, then atomically swaps the two files. End state: a 40 GB heap and indexes.
  4. VACUUM FULL holds ACCESS EXCLUSIVE on the table for the entire duration — every SELECT, INSERT, UPDATE, DELETE blocks. On a 100 GB table, this can be 2-6 hours. Production-fatal for OLTP.
  5. pg_repack does the same logical work as VACUUM FULL but in a transaction-friendly way: triggers capture concurrent changes into a log table, the background worker applies the log to the new file, and the final swap is an instant catalog change. Application impact is a few-second pause.
  6. After any of the three, you should run ANALYZE orders to refresh statistics — VACUUM does not always trigger an analyze, and stale stats cause bad query plans.

Output.

Approach Heap after Lock duration Downtime When to pick
Plain VACUUM 100 GB (60 GB reusable) none (concurrent) 0 autovacuum routine
VACUUM FULL 40 GB 2-6 hours ACCESS EXCLUSIVE full maintenance window
pg_repack 40 GB seconds (only the swap) seconds production, 24/7

Rule of thumb. Never run VACUUM FULL on a production OLTP table. The pg_repack alternative is operationally equivalent without the lock. The only legitimate VACUUM FULL use case is in a maintenance window on a table you can take offline.

Worked example — pg_repack online rebuild on a 200 GB table

Detailed explanation. pg_repack is the de facto production tool for shrinking bloated tables. The mechanics — triggers, log table, background copy, final swap — are worth understanding in detail because they explain why pg_repack works, what its constraints are, and when it can fail.

Question. Walk through a pg_repack run on a 200 GB events table with 8 indexes and 50K writes per minute. What happens at each phase, and what is the operational impact?

Input.

Setting Value
Table events
Heap size 200 GB
Index count 8
Write rate 50,000 / minute
Dead pct 55%
Target heap size after ~90 GB

Code.

# Install (one-time, on the server)
yum install pg_repack15

# Enable extension in the target database (one-time)
psql -d mydb -c 'CREATE EXTENSION pg_repack;'

# Run repack with 4 parallel index rebuilds, dry-run first
pg_repack --no-superuser-check \
          --table=events \
          --jobs=4 \
          --dry-run \
          mydb

# Real run
pg_repack --no-superuser-check \
          --table=events \
          --jobs=4 \
          mydb

# Monitor progress
SELECT * FROM repack.tables WHERE relname = 'events';
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. Phase 1 — setup. pg_repack creates a log table (repack.log_<oid>) and an INSERT/UPDATE/DELETE trigger on events. Every concurrent write is captured into the log table alongside being applied to the original.
  2. Phase 2 — copy. A new heap file is created (repack.table_<oid>). The background process scans the original events, copies live tuples into the new file. This is the longest phase — on a 200 GB table, expect 1-4 hours depending on disk speed. Concurrent writes continue uninterrupted, captured by the trigger.
  3. Phase 3 — index build. Indexes are built concurrently on the new heap. With --jobs=4, four indexes build in parallel. This is also long but parallelisable.
  4. Phase 4 — apply log. The captured changes from repack.log_<oid> are applied to the new heap. This catches up the new file with everything that happened during phases 2-3. Repeats until the log is empty.
  5. Phase 5 — swap. A brief ACCESS EXCLUSIVE lock is taken (seconds). The catalog is updated to point the events name at the new file. The old file is dropped. Triggers are removed.
  6. End state: events is now ~90 GB, indexes are freshly built (and smaller), and the 110 GB of dead-tuple bloat is gone. The application saw a few-second blip during the swap.

Output.

Phase Duration Lock Application impact
1. Setup seconds none trigger added
2. Heap copy 1-4 hours none full traffic OK
3. Index rebuild 30 min - 2 hours none full traffic OK
4. Apply log minutes none full traffic OK
5. Swap seconds ACCESS EXCLUSIVE (brief) seconds of pause

Rule of thumb. pg_repack needs 2x table size in free disk (old + new exist simultaneously) and a primary key or unique not-null index on the target table. Both are usually fine. The free disk requirement is the most common gotcha — you can run out of disk mid-repack and end up with the temp tables stuck.

Worked example — when to run ANALYZE separately from VACUUM

Detailed explanation. ANALYZE refreshes the planner's statistics. After a bulk load, after index creation, after a schema change — the planner can have wildly stale stats that produce bad query plans. Senior engineers run ANALYZE explicitly in these cases instead of waiting for autovacuum.

Question. A 500M-row data warehouse table just received a 100M-row bulk INSERT. Query plans on the table are now picking sequential scans where index scans were correct. What do you run and why?

Input.

Event Effect
Bulk INSERT of 100M rows n_live_tup jumps from 500M to 600M
Last autovacuum 30 minutes ago (pre-load)
Planner stats reflect 500M rows, old distribution
Query plan today seq scan instead of index scan

Code.

-- Refresh planner stats only (fast)
ANALYZE warehouse.facts;

-- Refresh and run a normal VACUUM in one pass (heavier)
VACUUM (ANALYZE, VERBOSE) warehouse.facts;

-- Force aggressive sample for large tables (default sample is 30K rows)
ALTER TABLE warehouse.facts ALTER COLUMN customer_id SET STATISTICS 1000;
ANALYZE warehouse.facts;

-- Inspect stats freshness
SELECT
    schemaname, relname,
    last_analyze, last_autoanalyze,
    n_live_tup, n_mod_since_analyze
FROM pg_stat_user_tables
WHERE relname = 'facts';

-- Re-check the bad plan
EXPLAIN ANALYZE SELECT * FROM warehouse.facts WHERE customer_id = 42;
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. The bulk INSERT moved n_mod_since_analyze from 0 to 100M, but autovacuum hasn't run again yet. The planner still sees 500M rows and the old customer_id distribution.
  2. ANALYZE samples the table (default 30K rows × 300 = 9M sampled positions). It computes new histograms, MCVs (most common values), and null fractions for every column. Updates pg_statistic for the table.
  3. On large warehouse tables with high-cardinality columns, the default sample can miss important value distributions. ALTER TABLE ... ALTER COLUMN ... SET STATISTICS 1000 increases the sample size (default 100 → 1000 means a 10x larger sample for that column).
  4. After ANALYZE, EXPLAIN ANALYZE shows the planner now picks an index scan — because the stats correctly say the index is selective. The bad seq-scan plan was a statistics problem, not a query problem.
  5. VACUUM (ANALYZE) combines both operations in one pass. Useful when you also want to reclaim dead tuples from concurrent UPDATEs, otherwise plain ANALYZE is cheaper.
  6. After any major data ingest, schedule an explicit ANALYZE in the same job — don't rely on autovacuum's autoanalyze to catch up before your dashboards run their queries.

Output.

Plan source Estimated rows Actual rows Chosen scan
Stale stats (pre-ANALYZE) 6M 12 Seq Scan (wrong)
Fresh stats (post-ANALYZE) 8 12 Index Scan (right)

Rule of thumb. After any bulk ingest > 10% of table size, run ANALYZE explicitly. After schema changes (new column, new index), run ANALYZE on the affected table. For warehouse tables with skewed distributions, bump per-column statistics targets to 500-1000.

Senior interview question on VACUUM flavours

A senior interviewer might frame this as: "You inherit a 24/7 OLTP database. The main orders table is at 70% bloat and 300 GB. The team wants to shrink it. Walk me through your decision and the runbook."

Solution Using the VACUUM-flavour decision tree + pg_repack runbook

Decision: 300 GB orders, 70% bloat, 24/7 OLTP, cannot take offline.

Step 1: Confirm bloat is real (pgstattuple, not just n_dead_tup estimate).
Step 2: Verify free disk is at least 2x table size (need ~600 GB free).
Step 3: Verify primary key exists on the table (required by pg_repack).
Step 4: Stage the rebuild during a low-traffic window (e.g. weekend).
Step 5: Run pg_repack with --dry-run first to surface any errors.
Step 6: Run real pg_repack; monitor via repack.tables view.
Step 7: After completion, run ANALYZE to refresh stats.
Step 8: Verify dead_pct returned to ~5% and table is ~90 GB.
Step 9: Tune autovacuum_vacuum_scale_factor=0.05 to prevent recurrence.
Enter fullscreen mode Exit fullscreen mode

Step-by-step trace.

Step Action Risk Mitigation
1 pgstattuple confirms 70% bloat, 90 GB live full scan slow run off-peak
2 df -h shows 800 GB free not enough free pause / add disk
3 \d orders confirms PK on id repack will refuse add PK first
4 Saturday 02:00 window weekend pager crew on standby
5 pg_repack --dry-run missing extension install + CREATE EXTENSION
6 pg_repack --jobs=4 swap pause < 5 seconds expected
7 ANALYZE orders stale stats always do
8 n_dead_tup ≈ 100K, heap = 90 GB success monitor 24h
9 ALTER TABLE orders SET autovacuum_vacuum_scale_factor = 0.05 none per-table override

After step 9, the table stays under 100 GB indefinitely because autovacuum keeps up with the dead-tuple rate.

Output:

State Heap Indexes Dead pct Notes
Before 300 GB 90 GB 70% unsustainable
After pg_repack 90 GB 30 GB 5% one-time reclaim
After scale_factor tune 95 GB (steady) 32 GB (steady) < 10% sustainable

Why this works — concept by concept:

  • VACUUM FULL is forbidden on OLTP — ACCESS EXCLUSIVE for hours is an outage. pg_repack achieves the same file-shrink with a seconds-long swap. Naming this constraint first proves you understand production.
  • pg_repack uses triggers + log table + swap — concurrent writes are captured by triggers, the new heap is built in the background, the swap is a catalog flip. This is why pg_repack is online: writes never pause more than a few seconds.
  • Free disk is the gating constraint — 2x table size temporarily. Run out mid-repack and you have to clean up the leftover temp tables manually. Always verify before starting.
  • Sustainable reclaim needs autovacuum tuning — pg_repack solves the one-time bloat. Without lowering autovacuum_vacuum_scale_factor, the bloat returns in months. The two fixes are paired, not alternatives.
  • Cost — pg_repack cost is O(table_size + index_size) of disk I/O plus the trigger overhead on concurrent writes during the run (typically 5-15% per-write slowdown). Total wall-clock typically 1-4 hours per 100 GB; orders of magnitude cheaper than the alternative outage.

SQL
Topic — sql
Postgres maintenance problems

Practice →

SQL Topic — optimization Table optimization problems

Practice →


4. Autovacuum tuning

postgres autovacuum defaults are tuned for a 2010-era 1M-row table — production OLTP workloads need per-table overrides on every hot table

The mental model in one line: the autovacuum launcher wakes every autovacuum_naptime seconds, scans pg_stat_user_tables, and dispatches up to autovacuum_max_workers workers to tables where n_dead_tup > autovacuum_vacuum_threshold + autovacuum_vacuum_scale_factor * reltuples; each worker runs a VACUUM throttled by a cost-budget system to avoid saturating I/O — and almost every production tuning problem reduces to "the global defaults don't fit my table's write rate, override them per-table". Once you internalise "threshold + scale × tuples = trigger; cost_limit/cost_delay = throttle; workers + naptime = parallelism," the entire autovacuum tuning interview surface collapses to four knobs and two cost dials.

Visual diagram of the Postgres autovacuum tuning surface — three workers running in parallel on three hot tables, with per-table overrides for scale_factor, cost_delay, and cost_limit; on a light PipeCode card.

The trigger formula.

  • A table is eligible for autovacuum when n_dead_tup > autovacuum_vacuum_threshold + autovacuum_vacuum_scale_factor * reltuples.
  • Defaults: threshold=50, scale_factor=0.2. For a 1M-row table, trigger at 50 + 0.2 * 1M = 200,050 dead rows (~20% bloat).
  • For a 100M-row table: trigger at ~20M dead rows. That's 20% bloat before autovacuum starts — way too much for an OLTP workload.

The four global knobs that matter.

  • autovacuum_max_workers (default 3) — how many tables can be vacuumed in parallel. On a busy multi-table OLTP database, 5-10 is more realistic.
  • autovacuum_naptime (default 1 min) — how often the launcher scans for eligible tables. Lower (15-30s) on a database with frequent autovacuum needs.
  • autovacuum_vacuum_scale_factor (default 0.2) — the global trigger threshold as a fraction of reltuples. Almost always overridden per-table on hot tables.
  • autovacuum_vacuum_threshold (default 50) — the additive constant in the trigger formula. Matters only for small tables.

The cost-based throttle.

  • Every VACUUM operation accumulates "cost units" — a page read from disk is 10 units, a dirtied page is 20 units, a buffer access is 1 unit.
  • When the accumulated cost reaches autovacuum_vacuum_cost_limit (default 200), the worker sleeps for autovacuum_vacuum_cost_delay (default 2 ms).
  • This intentionally throttles VACUUM to avoid saturating disk I/O during normal operations. The default budget gives roughly 8 MB/s of vacuuming — fine for a 1 GB table, painfully slow for a 1 TB table.
  • For modern SSDs, tune autovacuum_vacuum_cost_limit=2000 (10x the default) and autovacuum_vacuum_cost_delay=2ms to get ~80 MB/s of vacuum throughput without saturating the I/O system.

Per-table overrides — the production lever.

  • The single most important production tuning pattern: ALTER TABLE hot_events SET (autovacuum_vacuum_scale_factor = 0.05);
  • Per-table overrides let you keep conservative defaults for the 90% of tables that don't matter, and tighten thresholds on the 10% of hot tables that drive bloat.
  • Other per-table knobs: autovacuum_vacuum_threshold, autovacuum_vacuum_cost_delay, autovacuum_vacuum_cost_limit, autovacuum_freeze_min_age, autovacuum_freeze_max_age, fillfactor.

The autoanalyze counterpart.

  • Identical mechanism applies to ANALYZE: autovacuum_analyze_threshold + autovacuum_analyze_scale_factor * reltuples.
  • Default analyze_scale_factor=0.1 — analyze triggers at 10% modified rows. Usually fine; tune lower on warehouse tables with skewed distributions.

Operational monitoring.

  • pg_stat_progress_vacuum — real-time view of every running autovacuum, with phase, heap_blks_scanned, heap_blks_total. Use it during incidents.
  • pg_stat_user_tables.autovacuum_count — how many times each table has been autovacuumed since stats reset. Hot tables should be in the hundreds or thousands.
  • pg_stat_user_tables.last_autovacuum — last time each table was autovacuumed. > 24 hours on a hot table = trouble.

Common interview probes on autovacuum tuning.

  • "What is the autovacuum trigger?" — n_dead_tup > threshold + scale_factor * reltuples.
  • "How do you tune for a hot table?" — per-table override of autovacuum_vacuum_scale_factor to 0.02-0.05.
  • "Why does autovacuum throttle?" — cost-based system to avoid saturating I/O during normal operations.
  • "When do you increase autovacuum_max_workers?" — when multiple hot tables compete for vacuum slots; default 3 is too few for a 20-table OLTP database.

Worked example — tuning a hot events table from defaults to production

Detailed explanation. A hot_events table receives 1M writes per hour. At default scale_factor=0.2, autovacuum runs every ~5 hours but the table grows by 50% between runs — bloat peaks at 50%. The fix is a per-table override that triggers autovacuum every ~1 hour at ~10% bloat.

Question. Walk through the autovacuum math on hot_events (10M rows, 1M writes/hour) at default settings and after a tuned override. Show the SQL and the expected bloat curve.

Input.

Setting Default Tuned
reltuples 10,000,000 10,000,000
autovacuum_vacuum_threshold 50 1,000
autovacuum_vacuum_scale_factor 0.2 0.05
autovacuum_vacuum_cost_limit 200 2,000
autovacuum_vacuum_cost_delay 2 ms 2 ms

Code.

-- inspect current per-table settings (will be empty if none)
SELECT relname, reloptions
FROM pg_class
WHERE relname = 'hot_events';

-- per-table override for production tuning
ALTER TABLE hot_events SET (
    autovacuum_vacuum_threshold     = 1000,
    autovacuum_vacuum_scale_factor  = 0.05,
    autovacuum_vacuum_cost_limit    = 2000,
    autovacuum_vacuum_cost_delay    = 2,
    autovacuum_analyze_scale_factor = 0.02,
    fillfactor                       = 80
);

-- verify the override took effect
SELECT relname, reloptions
FROM pg_class
WHERE relname = 'hot_events';

-- compute the resulting trigger threshold
SELECT
    1000 + 0.05 * 10000000 AS new_trigger,
    50   + 0.2  * 10000000 AS old_trigger;
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. Default trigger: 50 + 0.2 * 10M = 2,000,050 dead tuples — about 5 hours of writes at 1M/hour and a 40% UPDATE rate (assume 400K dead/hour). Bloat peaks at ~20% before autovacuum runs.
  2. Tuned trigger: 1000 + 0.05 * 10M = 501,000 dead tuples — about 1.25 hours of writes. Bloat peaks at ~5%.
  3. The cost-limit bump (200 → 2000) lets the autovacuum worker actually finish before the next trigger fires. At default cost_limit, a 10M-row table can take 20+ minutes to vacuum; at 2000, it finishes in ~2 minutes.
  4. fillfactor=80 reserves page space so HOT updates can stay on the same page. This further reduces the dead-tuple production rate for the same workload.
  5. autovacuum_analyze_scale_factor=0.02 triggers an analyze at 2% modified rows. On a hot table, this keeps the planner stats fresh between autovacuum cycles.
  6. After deploying these settings: monitor pg_stat_user_tables for the next 24 hours. Expect n_dead_tup to oscillate around 500K (the new trigger), autovacuum_count to climb by 20+/day (vs 4-5/day at defaults), and the heap size to stabilise.

Output.

Metric Default Tuned
Autovacuum trigger 2M dead rows 500K dead rows
Autovacuum frequency every 5 h every 1.25 h
Peak bloat ~20% ~5%
Vacuum wall time 20+ min ~2 min
HOT update rate ~40% (typical) ~70%+ (with fillfactor 80)

Rule of thumb. Every table over 10M rows with > 1000 writes per minute should have per-table autovacuum overrides. The defaults are designed for safety, not for production scale. Lower scale_factor, raise cost_limit, and set fillfactor=80 is the standard production starting point.

Worked example — increasing autovacuum_max_workers for a multi-table OLTP

Detailed explanation. A database has 20 hot tables, each producing dead tuples at a similar rate. With autovacuum_max_workers=3 (default), only 3 tables can be vacuumed at any given moment — the other 17 queue up. On busy systems this means hot tables don't get vacuumed often enough.

Question. Walk through how to safely raise autovacuum_max_workers from 3 to 8 on a 16-core database server with 64 GB RAM and SSD storage. What are the constraints?

Input.

Resource Value
CPU cores 16
Memory 64 GB
Storage SSD, 5 GB/s read, 2 GB/s write
Hot table count 20
Current autovacuum_max_workers 3
maintenance_work_mem 256 MB

Code.

-- global settings (postgresql.conf)
-- autovacuum_max_workers = 8
-- autovacuum_naptime     = 15s          (default 1min — wake more often)
-- maintenance_work_mem   = 1GB          (each worker can use this much)
-- autovacuum_vacuum_cost_limit = 2000   (faster vacuum throughput)
-- autovacuum_vacuum_cost_delay = 2ms

-- apply via ALTER SYSTEM (requires reload)
ALTER SYSTEM SET autovacuum_max_workers      = 8;
ALTER SYSTEM SET autovacuum_naptime          = '15s';
ALTER SYSTEM SET maintenance_work_mem        = '1GB';
ALTER SYSTEM SET autovacuum_vacuum_cost_limit = 2000;
ALTER SYSTEM SET autovacuum_vacuum_cost_delay = 2;

-- reload (autovacuum_max_workers REQUIRES a restart, not just reload)
SELECT pg_reload_conf();
-- ... then restart Postgres for max_workers to take effect

-- monitor concurrency after restart
SELECT count(*) FROM pg_stat_activity
WHERE backend_type = 'autovacuum worker';
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. Default autovacuum_max_workers=3 means at most 3 tables vacuum in parallel. On 20 hot tables, that's a 6-7x serialisation factor — most tables wait their turn.
  2. Bumping to 8 workers lets 8 tables vacuum concurrently. With 16 CPU cores, the workers consume at most ~50% of CPU during peak — leaves plenty of headroom for application queries.
  3. Memory budget: each worker can use up to maintenance_work_mem (1 GB tuned). 8 workers × 1 GB = 8 GB peak — comfortably within the 64 GB total. If you also run manual VACUUMs, account for those.
  4. Disk budget: each worker reads and writes at a rate throttled by cost_limit/cost_delay. At 2000/2ms, each worker delivers ~80 MB/s. 8 workers = 640 MB/s — well within the SSD's 2 GB/s write capacity.
  5. autovacuum_max_workers requires a Postgres restart (it's a max-connections-style setting that allocates worker slots at startup). Plan a maintenance window for the change.
  6. After restart, monitor pg_stat_user_tables.autovacuum_count — you should see it climb 2-3x faster on hot tables, with last_autovacuum staying under 1 hour for every busy table.

Output.

Metric Before (3 workers) After (8 workers)
Tables vacuumed in parallel up to 3 up to 8
Average wait for hot table ~30 min < 5 min
Peak CPU usage during vacuum ~15% ~40%
Peak disk write 240 MB/s 640 MB/s
Memory usage (worker pool) 3 × 256 MB = 768 MB 8 × 1 GB = 8 GB

Rule of thumb. Set autovacuum_max_workers = min(cores / 2, hot_table_count). On modern 16+ core servers with SSD, 6-10 workers is the right range. Don't bump it without also raising maintenance_work_mem and autovacuum_vacuum_cost_limit.

Worked example — autovacuum cost-budget tuning on SSD

Detailed explanation. The default autovacuum_vacuum_cost_limit=200 was set when "fast storage" meant 5400 RPM disks. On modern NVMe SSD, you can safely deliver 10x the vacuum throughput without affecting application latency. Senior engineers know to tune this knob explicitly.

Question. A 500 GB OLTP database on NVMe SSD has autovacuum running constantly but not keeping up — bloat keeps creeping upward. Cost limit is at default. Walk through the throughput math and the tuning.

Input.

Resource Value
Storage NVMe SSD, 7 GB/s read, 4 GB/s write
autovacuum_vacuum_cost_limit 200 (default)
autovacuum_vacuum_cost_delay 2 ms (default)
vacuum_cost_page_dirty 20 (default)
Database write rate 50 MB/s sustained
Bloat trend +1% per day

Code.

-- compute effective vacuum throughput at defaults
-- 200 cost_limit / 20 cost_page_dirty = 10 dirty pages per cycle
-- each cycle ends with a 2ms sleep
-- 1000ms / 2ms = 500 cycles/sec
-- 500 cycles * 10 pages = 5000 pages/sec
-- 5000 pages * 8KB = 40 MB/s

-- With default settings, vacuum can dirty ~40 MB/s of pages
-- App is writing 50 MB/s sustained ⇒ vacuum cannot keep up ⇒ bloat

-- Tuned for NVMe
ALTER SYSTEM SET autovacuum_vacuum_cost_limit  = 4000;
ALTER SYSTEM SET autovacuum_vacuum_cost_delay  = 2;
ALTER SYSTEM SET vacuum_cost_page_dirty        = 20; -- unchanged
ALTER SYSTEM SET vacuum_cost_page_hit          = 1;  -- unchanged
ALTER SYSTEM SET vacuum_cost_page_miss         = 2;  -- lowered from 10
SELECT pg_reload_conf();

-- recheck throughput
-- 4000 / 20 = 200 dirty pages per cycle
-- 500 cycles/sec * 200 pages * 8KB = 800 MB/s

-- now vacuum can dirty 800 MB/s vs app writing 50 MB/s
-- ⇒ vacuum stays well ahead of bloat production
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. At defaults, autovacuum's effective dirty-page throughput is (cost_limit / cost_page_dirty) * (1000ms / cost_delay) * 8KB = (200/20) * (1000/2) * 8KB = 40 MB/s. On a sustained 50 MB/s write workload, vacuum cannot keep up.
  2. The slow-disk era assumption is cost_page_miss=10 — a random read was expensive. On SSD with random I/O nearly free, lower this to 2. Same change applies to cost_page_hit=1 (default fine).
  3. Raising cost_limit to 4000 (20x default) gives autovacuum 800 MB/s of dirty-page throughput — far more than the 50 MB/s workload produces. Vacuum stays well ahead.
  4. cost_delay=2ms is the minimum reasonable; lower than that gives diminishing returns because the OS scheduler granularity is around 1-2 ms.
  5. The trade-off: at 800 MB/s of vacuum dirty-page rate, the autovacuum workers do compete with foreground queries for buffer cache and disk I/O. On NVMe with bandwidth headroom, the contention is invisible to applications. On rotational disks, this configuration would saturate IOPS — never use it.
  6. Monitor pg_stat_io (Postgres 16+) to see how many pages autovacuum reads vs the application. If autovacuum I/O dominates, ratchet back cost_limit.

Output.

Configuration Effective vacuum throughput Catches up to 50 MB/s writes?
Defaults 40 MB/s no — bloat grows
cost_limit=2000 400 MB/s yes (8x headroom)
cost_limit=4000 800 MB/s yes (16x headroom)

Rule of thumb. On NVMe SSD, set autovacuum_vacuum_cost_limit=2000-4000 and vacuum_cost_page_miss=2. On rotational disks, leave defaults. The "slow vacuum on fast disk" anti-pattern is one of the most common Postgres operational misconfigurations in 2026.

Senior interview question on autovacuum tuning

A senior interviewer might frame this as: "You're on call. A 50 GB orders table is at 65% bloat and growing. Autovacuum is enabled but seemingly not keeping up. Walk me through your tuning, in priority order, in the next 30 minutes."

Solution Using the autovacuum diagnostic + tuning playbook

30-min autovacuum diagnostic + tuning playbook
===============================================

Minute 0-5: Diagnose
  - SELECT * FROM pg_stat_progress_vacuum;
    (is autovacuum even running on this table?)
  - SELECT * FROM pg_stat_user_tables WHERE relname = 'orders';
    (last_autovacuum, autovacuum_count, n_dead_tup)
  - Check for long-running transactions blocking the horizon.

Minute 5-10: Tune trigger
  - ALTER TABLE orders SET (autovacuum_vacuum_scale_factor = 0.05);
  - Triggers next autovacuum at 5% bloat instead of 20%.

Minute 10-15: Tune throughput
  - ALTER TABLE orders SET (
      autovacuum_vacuum_cost_limit = 4000,
      autovacuum_vacuum_cost_delay = 2
    );
  - Vacuum delivers ~800 MB/s on SSD, plenty for 50 GB.

Minute 15-20: Force immediate vacuum (if patience runs out)
  - VACUUM (VERBOSE, ANALYZE) orders;
  - Manual run, picks up new settings; reclaims dead tuples.

Minute 20-25: Verify
  - SELECT n_dead_tup FROM pg_stat_user_tables WHERE relname='orders';
  - Should drop sharply.

Minute 25-30: Document + monitor
  - Add per-table overrides to runbook.
  - Set alerting on dead_pct > 15% on this table.
Enter fullscreen mode Exit fullscreen mode

Step-by-step trace.

Minute Action Expected result
0-5 Diagnose Identify default settings on hot table
5-10 scale_factor 0.2 → 0.05 Future triggers at 5%
10-15 cost_limit 200 → 4000 Vacuum throughput 40 → 800 MB/s
15-20 Manual VACUUM n_dead_tup drops by ~95%
20-25 Confirm autovacuum_count incremented, last_autovacuum recent
25-30 Document runbook captures the per-table overrides

After 30 minutes the bloat is reclaimed and the new settings prevent recurrence. The on-call closes the incident with a permanent fix, not a band-aid.

Output:

Setting Before After
autovacuum_vacuum_scale_factor 0.2 (DB default) 0.05 (table override)
autovacuum_vacuum_cost_limit 200 4000
Effective throughput 40 MB/s 800 MB/s
Peak bloat 65% < 8%
Recovery time "never" minutes

Why this works — concept by concept:

  • Diagnose before you tune — confirm autovacuum is actually running on the right table, not blocked by a long-running transaction or a missing extension. The horizon check before tuning is the senior signal.
  • Per-table overrides — global tuning is too blunt; production databases have 5-50 tables with different write profiles. The per-table override pattern is the daily-driver tool.
  • Trigger vs throughputscale_factor controls when vacuum runs; cost_limit controls how fast it runs. Tune both; they solve different sub-problems.
  • Force a manual VACUUM for immediate relief — autovacuum runs on its own schedule; if you need bloat gone now, run VACUUM by hand. It picks up the new per-table settings immediately.
  • CostO(table_size) per vacuum pass; per-cycle cost is throttled by cost_limit/cost_delay. The throttle exists to prevent vacuum from saturating I/O — tune it on fast storage so vacuum can actually keep up.

SQL
Topic — sql
Autovacuum tuning problems

Practice →

SQL Topic — optimization Database tuning drills

Practice →


5. Wraparound and emergency procedures

transaction id wraparound is the rare Postgres outage that everyone has to know about — 32-bit XID counter, 2 billion transactions, and a hard shutdown if vacuum freeze falls behind

The mental model in one line: Postgres assigns every transaction a 32-bit XID; when the counter approaches 2 billion (~2^31), Postgres must "freeze" old tuples (mark them visible-to-all) so the wrapped XID space stays consistent — and if freezing falls behind, the database refuses writes until you run a manual VACUUM FREEZE to catch up. Once you internalise "XID is 32-bit, freeze prevents wraparound, missing the freeze deadline = hard outage," the wraparound interview surface reduces to a known runbook.

Visual diagram of a Postgres XID counter approaching the 2B wraparound threshold — green warning, orange alert, and a red shutdown band, with an emergency vacuumdb command arrow; on a light PipeCode card.

The 32-bit XID — why wraparound exists.

  • Every transaction gets a unique 32-bit XID. After 2^32 = ~4.3 billion transactions, the counter wraps to 0.
  • Naively, an old tuple with xmin=10 would suddenly appear "newer" than a transaction with xmin=20 after the counter wrapped — that would break visibility.
  • Postgres prevents this by "freezing" old tuples — setting a hint bit that says "this tuple is visible to all current and future transactions, ignore its xmin."

The freeze thresholds.

  • vacuum_freeze_min_age (default 50M) — tuples older than 50M XIDs are eligible to be frozen by a regular VACUUM run.
  • autovacuum_freeze_max_age (default 200M) — if any table's oldest XID is more than 200M XIDs behind current, autovacuum runs a forced "aggressive" mode VACUUM on that table to freeze.
  • vacuum_freeze_table_age (default 150M) — VACUUM switches to aggressive (scans every page, not just non-all-visible) when the table's oldest XID hits this age.
  • The defaults give you ~1.8 billion XIDs of headroom (2B XID limit minus 200M freeze max age). At 1M transactions/hour, that's ~75 days — plenty of warning if autovacuum is working.

The age columns — your visibility into wraparound.

  • pg_database.datfrozenxid — the oldest XID still considered "non-frozen" anywhere in the database. The wraparound countdown is 2^31 - age(datfrozenxid).
  • pg_class.relfrozenxid — same, but per-table. Autovacuum advances this every time it freezes the oldest tuples.
  • age(xid) — Postgres function returning the distance between an XID and the current XID in modular-32-bit arithmetic. The closer to 2 billion, the more urgent the freeze.

The wraparound warning ladder.

  • At age(datfrozenxid) > 200M: autovacuum starts forcing aggressive freeze runs on every table.
  • At age > 1B: log warnings ("oldest transaction's age is..."). Time to investigate.
  • At age > 1.5B: log STRONG warnings. Pager territory.
  • At age > 2B: hard shutdown. Postgres refuses to accept writes (and eventually reads). The only fix is single-user mode VACUUM FREEZE.

The emergency runbook.

  • Step 1: Verify autovacuum is actually running. SELECT * FROM pg_stat_activity WHERE backend_type='autovacuum worker'.
  • Step 2: Identify the oldest table. SELECT relname, age(relfrozenxid) FROM pg_class WHERE relkind='r' ORDER BY age(relfrozenxid) DESC LIMIT 10.
  • Step 3: Run VACUUM FREEZE relname manually on the offending table, ideally in parallel for multiple tables.
  • Step 4: If approaching the hard limit, run vacuumdb --freeze --all from shell — runs VACUUM FREEZE against every table in every database.
  • Step 5: If hard limit hit, restart Postgres in single-user mode and run VACUUM FREEZE. Last-resort recovery.

What blocks freeze.

  • A long-running transaction holds back the horizon, just like for plain VACUUM. The horizon must advance past a tuple's xmax before that tuple can be frozen.
  • Read-only replicas: by default, do not block freeze on the primary. (hot_standby_feedback=on does propagate the replica's horizon back to the primary — be careful.)
  • Logical replication: an open replication slot holds back the horizon. A long-disconnected logical replica is a common wraparound cause.

Monitoring.

  • Alert on age(datfrozenxid) > 800M per database. Gives weeks of headroom.
  • Alert on any single table's age(relfrozenxid) > 500M. Catches per-table autovacuum failures.
  • Track xact_commit + xact_rollback per second to estimate XID burn rate. A spike forecasts wraparound earlier.

Common interview probes on wraparound.

  • "What is XID wraparound?" — the 32-bit XID counter looping; visibility breaks if old tuples aren't frozen first.
  • "How does Postgres prevent it?" — autovacuum runs aggressive freeze runs at autovacuum_freeze_max_age.
  • "What happens if it isn't prevented?" — database refuses writes; single-user-mode VACUUM FREEZE recovery.
  • "How would you monitor it?" — pg_database.datfrozenxid and per-table relfrozenxid age.

Worked example — diagnosing a table approaching 1.5B XID age

Detailed explanation. A monitoring alert fires: age(datfrozenxid) = 1.5B on the production database. That's the orange-warning zone — about 25-30 days from forced shutdown. You need to identify which table is the laggard and drain it safely.

Question. A wraparound alert fires. Walk through the diagnostic queries and the freeze plan to get the database back below 200M XID age within 24 hours.

Input.

Metric Value
age(datfrozenxid) 1,500,000,000
Hard limit 2,146,483,648
Days until limit at current XID rate ~28
Largest table (rows) events (500M)
Autovacuum running? yes, but on small tables

Code.

-- 1. Database-level age
SELECT datname, age(datfrozenxid)
FROM pg_database
ORDER BY age(datfrozenxid) DESC;

-- 2. Tables with the oldest relfrozenxid
SELECT
    schemaname, relname,
    age(relfrozenxid) AS xid_age,
    pg_size_pretty(pg_total_relation_size(c.oid)) AS size,
    last_autovacuum
FROM pg_stat_user_tables s
JOIN pg_class c ON c.oid = s.relid
WHERE c.relkind = 'r'
ORDER BY age(c.relfrozenxid) DESC
LIMIT 10;

-- 3. Verify no long-running txn is blocking
SELECT pid, backend_xmin, age(backend_xmin) AS backend_age,
       now() - xact_start AS xact_age, application_name
FROM pg_stat_activity
WHERE backend_xmin IS NOT NULL
ORDER BY age(backend_xmin) DESC
LIMIT 5;

-- 4. Check replication slot lag (can block freeze)
SELECT slot_name, slot_type, active, age(xmin) AS slot_xid_age
FROM pg_replication_slots
ORDER BY age(xmin) DESC;

-- 5. Force freeze on the offending table
VACUUM (FREEZE, VERBOSE) public.events;
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. The database age (1.5B) is dominated by the table with the oldest relfrozenxid. The catalog query in step 2 surfaces the top 10 offenders — usually one or two large tables that autovacuum hasn't been able to fully freeze.
  2. If a table is large (500M rows), a full VACUUM FREEZE can take hours. The plan is to start the freeze immediately, not wait for autovacuum's normal schedule.
  3. Step 3 checks for the most common freeze blocker: a long-running transaction. If you find one with age(backend_xmin) > 100M, terminate it before freezing.
  4. Step 4 catches the second most common blocker: an inactive replication slot. A logical replica that disconnected weeks ago and was never cleaned up holds back the horizon. Drop the slot if confirmed abandoned (SELECT pg_drop_replication_slot('slot_name')).
  5. The manual VACUUM (FREEZE, VERBOSE) events runs an aggressive freeze on the large table. Lock level is SHARE UPDATE EXCLUSIVE — application traffic continues uninterrupted.
  6. After the freeze finishes (might be 2-6 hours), recheck pg_database.datfrozenxid — should drop from 1.5B to under 200M, restoring the normal headroom.

Output.

Phase age(datfrozenxid) Action Status
Alert fires 1,500,000,000 diagnose top tables events is the laggard
Check blockers 1,500,000,000 no long txn, no stuck slot safe to freeze
Freeze events running, 2 hours aggressive scan in progress
After freeze 100,000,000 verify pg_database under control
Tune 100,000,000 lower freeze_max_age for events prevent recurrence

Rule of thumb. Anytime age(datfrozenxid) crosses 800M, treat it as a P1. The 30-day shutdown clock is the most consequential operational deadline in Postgres. Pause non-essential maintenance and run the freeze immediately.

Worked example — single-user mode emergency recovery

Detailed explanation. The worst case: you ignored the warnings, the XID counter is at 2B+1, and Postgres has shut down to prevent corruption. The only path back is single-user mode VACUUM FREEZE — a manual recovery procedure every senior Postgres DBA should know cold.

Question. Postgres refuses to start, logs show "database is not accepting commands to avoid wraparound data loss". Walk through the single-user-mode recovery.

Input.

State Value
Postgres status down (in shutdown to prevent wraparound)
pg_database.datfrozenxid age 2,146,483,648 +
Disk free 40%
Backups last good at 6 hours ago
Replication replica at same XID — also stuck

Code.

# Stop the running postmaster (it will not accept normal connections anyway)
pg_ctl stop -D /var/lib/postgresql/data -m immediate

# Start in single-user mode against the affected database
postgres --single -D /var/lib/postgresql/data mydb

# Inside single-user mode prompt, run:
# (note: no transaction wrapping; each statement is a transaction)
VACUUM FREEZE;     -- runs against current DB; can take hours
# Wait until completion.

# Exit single-user mode (Ctrl-D or \q)

# Start Postgres normally
pg_ctl start -D /var/lib/postgresql/data

# Verify
psql -d mydb -c "SELECT datname, age(datfrozenxid) FROM pg_database;"
# Should now show low ages.

# Repeat for every other database in the cluster
postgres --single -D /var/lib/postgresql/data otherdb
# VACUUM FREEZE;
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. Postgres refuses to start in normal mode because the wraparound risk is too high. Single-user mode (postgres --single) is the recovery interface — a stripped-down standalone Postgres that bypasses the wraparound check.
  2. Single-user mode talks directly to the storage layer. Connect to one database at a time; the wraparound condition is per-database (datfrozenxid is per-DB).
  3. The VACUUM FREEZE command in single-user mode runs an aggressive freeze across every table in the current database. On a 500 GB database with billions of tuples, this can take 2-8 hours. There is no progress indicator — just wait.
  4. After the freeze completes, exit single-user mode and start Postgres normally. The startup check now sees datfrozenxid ages well below the 2B threshold and allows full operations.
  5. If the cluster has multiple databases, repeat the procedure for each one — single-user mode only handles one DB at a time, and only the currently-stuck DBs need this treatment.
  6. After recovery, audit the autovacuum configuration: why did freezing fall behind? Common causes are autovacuum disabled, a long-running transaction held the horizon, a stuck replication slot, or autovacuum was so throttled it couldn't keep up.

Output.

Phase Status Wall time
Stop postmaster done seconds
Single-user mode VACUUM FREEZE running 2-8 hours
Exit single-user mode done seconds
Start Postgres normally accepting connections seconds
Recheck datfrozenxid age 50M-100M (safe) seconds

Rule of thumb. Single-user mode is the recovery path of last resort. The procedure works, but it costs you hours of downtime and shakes any trust the application team has in the database. Monitor age(datfrozenxid) and alert at 800M — you should never find yourself running this procedure if your alerting is in place.

Worked example — vacuumdb --freeze --all preemptive run

Detailed explanation. Sometimes you discover the warning early enough — age = 1.2B, no immediate emergency, but autovacuum is clearly behind. The right move is a controlled, parallel vacuumdb --freeze --all from shell that catches up every table in every database without waiting for autovacuum's schedule.

Question. Show how to run a parallel preemptive freeze across an entire Postgres cluster using vacuumdb from the shell, before wraparound becomes a real emergency.

Input.

Setting Value
Cluster databases 5
Largest database 500 GB
age(datfrozenxid) max across DBs 1.2B
Available maintenance window 6 hours overnight
CPU cores 16

Code.

# Inspect current age across all databases first
psql -c "SELECT datname, age(datfrozenxid)
         FROM pg_database
         ORDER BY age(datfrozenxid) DESC;"

# Run preemptive freeze with 6 parallel jobs against every DB
vacuumdb --freeze \
         --all \
         --jobs=6 \
         --verbose \
         2>&1 | tee /var/log/postgres/freeze.log

# Or target a specific DB if cluster is too large to do at once
vacuumdb --freeze --jobs=6 --verbose --dbname=mydb

# Skip indexes for a speed boost (rarely needed; the freeze itself is the bottleneck)
vacuumdb --freeze --jobs=6 --skip-locked --no-index-cleanup mydb

# Watch progress live
psql -c "SELECT pid, datname, relid::regclass, phase,
                heap_blks_scanned, heap_blks_total,
                round(100.0 * heap_blks_scanned / NULLIF(heap_blks_total,0),2) AS pct
         FROM pg_stat_progress_vacuum;"
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. vacuumdb --freeze --all --jobs=6 runs VACUUM (FREEZE) against every table in every database, with up to 6 vacuums running in parallel. It's the official "catch up on wraparound" command in the Postgres toolbox.
  2. The --jobs flag is per database; with 5 databases and --jobs=6, you can have up to 6 vacuums running at once on the largest database. The other DBs run sequentially after.
  3. --skip-locked skips any table currently locked by another operation rather than waiting. Useful if you have ongoing migrations or other vacuum activity.
  4. --no-index-cleanup skips index vacuum, which can be 3-5x faster on tables with many indexes. The trade-off: index bloat doesn't get cleaned this run. Use only if you're racing the wraparound clock.
  5. pg_stat_progress_vacuum gives live per-table progress, including phase (scanning heap, vacuuming indexes, etc.) and heap blocks scanned vs total. Monitor it during the run to estimate completion time.
  6. After completion, recheck age(datfrozenxid) per DB — should drop to under 100M. Schedule a postmortem to fix the autovacuum config that let it grow to 1.2B in the first place.

Output.

DB Before age After age Wall time Tables frozen
mydb (main) 1,200,000,000 80,000,000 4 h 15 min 247
analytics 800,000,000 50,000,000 1 h 30 min 89
reporting 600,000,000 40,000,000 45 min 52
catalog 200,000,000 30,000,000 10 min 18
logs 150,000,000 25,000,000 8 min 12

Rule of thumb. Run vacuumdb --freeze --all --jobs=N as a quarterly preventive job on any production cluster with hot tables. It's the easiest insurance against wraparound — costs you one maintenance window per quarter and guarantees you'll never face the single-user-mode emergency.

Senior interview question on wraparound

A senior interviewer might frame this as: "Your monitoring shows age(datfrozenxid) = 1.8B on a production primary. The XID burn rate is 8M/hour. Walk me through the next 24 hours — diagnostic, mitigation, root cause, prevention."

Solution Using the wraparound emergency runbook

T+0h    Diagnose
        - Identify table(s) with oldest relfrozenxid.
        - Check for long-running transactions blocking horizon.
        - Check for inactive replication slots holding XIDs.

T+0h    Compute remaining headroom
        - 2.1B − 1.8B = 300M XIDs remaining.
        - 8M/hour → 37 hours until shutdown.
        - 30-hour window to act.

T+0-1h  Clear blockers
        - pg_terminate_backend(<long-txn pid>)
        - pg_drop_replication_slot(<stuck slot>)

T+1-6h  Run freeze on top offenders in parallel
        - vacuumdb --freeze --jobs=8 --table=events --table=orders mydb
        - Monitor via pg_stat_progress_vacuum.
        - Expect age to drop sharply on each table as it completes.

T+6-12h Run vacuumdb --freeze --all to sweep the rest
        - Catches every other table that's still behind.

T+12h   Verify
        - age(datfrozenxid) under 200M.
        - Headroom restored to ~1.9B XIDs (~10 days).

T+24h   Postmortem + permanent fix
        - Lower autovacuum_freeze_max_age on hot tables to 100M.
        - Add per-table override for events: autovacuum_vacuum_scale_factor=0.02.
        - Add Datadog alert on any age(datfrozenxid) > 800M.
        - Document the runbook in wiki for next on-call.
Enter fullscreen mode Exit fullscreen mode

Step-by-step trace.

Time age(datfrozenxid) Action Headroom
T+0h 1,800,000,000 diagnose 30 hours
T+1h 1,800,000,000 clear long txn + dead slot 30 hours
T+3h 1,500,000,000 events VACUUM FREEZE half-done restored
T+6h 800,000,000 events fully frozen restored
T+9h 400,000,000 orders fully frozen restored
T+12h 150,000,000 other tables frozen restored
T+24h 130,000,000 (steady) postmortem complete 10+ days

The emergency is resolved in 12 hours of vacuum work, plus 12 hours for the postmortem and permanent fix. The key signal: the database stayed online the whole time — no ACCESS EXCLUSIVE, no single-user mode, no application impact.

Output:

Outcome Status
Database stays online yes (no shutdown)
Application impact negligible (no ACCESS EXCLUSIVE)
Headroom restored from 30 hours to 10+ days
Root cause fixed autovacuum tuned per-table
Monitoring upgraded alert at 800M added

Why this works — concept by concept:

  • Headroom math first — knowing exactly how many hours you have before shutdown turns panic into a plan. The XID burn rate × remaining XIDs is the most actionable number in the runbook.
  • Clear blockers before vacuuming — long transactions and stuck replication slots prevent freeze. No amount of vacuumdb --freeze helps until the horizon can actually advance.
  • Parallel freeze on top offenders — instead of waiting for autovacuum's schedule, kick off vacuumdb --freeze --jobs=N against the largest tables manually. They run in parallel and finish dramatically faster than the default autovacuum cadence.
  • Postmortem yields permanent tuning — wraparound emergencies are autovacuum-tuning bugs in disguise. The permanent fix is per-table autovacuum_freeze_max_age lowered to 100M plus tighter scale_factor to keep dead-tuple accumulation in check.
  • CostO(table_size + index_size) per table to freeze; runs in parallel across tables and across DBs. Single-user mode recovery is the only fully-offline option and the only one with a O(downtime_hours) business cost — keep that off the table at all costs by monitoring early.

SQL
Topic — sql
Postgres wraparound problems

Practice →

ETL
Topic — etl
Postgres operational drills

Practice →


Cheat sheet — Postgres VACUUM recipes

  • Per-table autovacuum override (the daily-driver tune). ALTER TABLE hot_events SET (autovacuum_vacuum_scale_factor=0.05, autovacuum_vacuum_threshold=1000, autovacuum_vacuum_cost_limit=2000, autovacuum_vacuum_cost_delay=2, fillfactor=80); — five settings, tuned for any table over 10M rows with > 1000 writes/min.
  • Quick bloat estimate from pg_stat_user_tables. SELECT relname, n_live_tup, n_dead_tup, round(100.0 * n_dead_tup / NULLIF(n_live_tup,0), 2) AS dead_pct FROM pg_stat_user_tables ORDER BY n_dead_tup DESC LIMIT 20; — cheap, always available, refreshed on every commit.
  • Precise bloat audit via pgstattuple. CREATE EXTENSION pgstattuple; SELECT pg_size_pretty(table_len), round(dead_tuple_percent::numeric, 2) AS dead_pct FROM pgstattuple('public.orders'); — full scan cost, exact answer.
  • pg_repack one-shot. pg_repack --no-superuser-check --table=orders --jobs=4 mydb — online table rebuild; requires the pg_repack extension and 2x table size in free disk.
  • Wraparound monitor query. SELECT datname, age(datfrozenxid), 2146483648 - age(datfrozenxid) AS xids_remaining FROM pg_database ORDER BY age(datfrozenxid) DESC; — run hourly via monitoring; alert at < 1.3B XIDs remaining (age > 800M).
  • Per-table oldest XID. SELECT relname, age(relfrozenxid), pg_size_pretty(pg_total_relation_size(oid)) FROM pg_class WHERE relkind='r' ORDER BY age(relfrozenxid) DESC LIMIT 20; — find the laggards before they hit the cluster shutdown threshold.
  • Find horizon-blocking sessions. SELECT pid, backend_xmin, age(backend_xmin) AS xmin_age, now()-xact_start AS xact_age, application_name FROM pg_stat_activity WHERE backend_xmin IS NOT NULL ORDER BY age(backend_xmin) DESC LIMIT 10; — the first query to run when bloat is growing despite autovacuum activity.
  • Idle-in-transaction safety net. Set idle_in_transaction_session_timeout='15min' and statement_timeout='30min' in postgresql.conf. Prevents most accidental horizon blocks.
  • Preemptive cluster-wide freeze. vacuumdb --freeze --all --jobs=6 --verbose from shell — run quarterly in a maintenance window to catch up any tables autovacuum has fallen behind on.
  • Emergency single-user-mode recovery. pg_ctl stop -m immediate; postgres --single -D <data_dir> <db>; VACUUM FREEZE; — last resort if wraparound triggered a shutdown; budget 2-8 hours per database.
  • Aggressive index vacuum. VACUUM (PARALLEL 4, INDEX_CLEANUP on) orders; — uses up to 4 parallel index workers; Postgres 14+ feature. Use on tables with many indexes for a 3-5x speedup over single-threaded VACUUM.
  • Skip index cleanup for a wraparound race. VACUUM (FREEZE, INDEX_CLEANUP off) huge_table; — only the heap freeze runs; index bloat ignored. Recover from wraparound risk in half the time; rebuild indexes later.
  • HOT update rate as a tuning signal. SELECT relname, n_tup_upd AS updates, n_tup_hot_upd AS hot_updates, round(100.0 * n_tup_hot_upd / NULLIF(n_tup_upd,0), 2) AS hot_pct FROM pg_stat_user_tables ORDER BY n_tup_upd DESC LIMIT 20; — anything under 30% HOT on a busy table means an indexed column is being updated; revisit the indexes.
  • Per-table FREEZE override on a hot table. ALTER TABLE huge_table SET (autovacuum_freeze_max_age=100000000, vacuum_freeze_min_age=10000000); — forces aggressive freeze to start earlier on this table specifically.
  • Live autovacuum progress. SELECT pid, datname, relid::regclass AS rel, phase, heap_blks_scanned, heap_blks_total, round(100.0 * heap_blks_scanned / NULLIF(heap_blks_total,0), 2) AS pct FROM pg_stat_progress_vacuum; — use during incidents to see exactly what each worker is doing.

Frequently asked questions

What is MVCC in PostgreSQL?

MVCC (Multi-Version Concurrency Control) is the technique PostgreSQL uses to let many transactions read and write the same data concurrently without traditional row-level read locks. Instead of updating rows in place, every UPDATE writes a new tuple version stamped with the current transaction's XID in xmin and marks the previous version's xmax. Each transaction sees a snapshot — the set of tuples whose xmin committed before the snapshot started and whose xmax is either zero or committed after. The cost of postgresql mvcc is "dead tuples": old versions that have to be reclaimed later by postgres vacuum and its autovacuum daemon. This is why Postgres needs VACUUM at all — it's the maintenance work that pays the bill for lock-free reads.

What is the difference between VACUUM and VACUUM FULL in Postgres?

Plain postgres vacuum reclaims dead-tuple space inside the existing table file — it marks the dead tuples reusable, updates the visibility map, and prunes HOT chains, but it does not call ftruncate on the file. The file stays the same size; future INSERTs and UPDATEs reuse the freed slots. vacuum full, by contrast, rewrites the entire table into a new file and atomically swaps it, holding an ACCESS EXCLUSIVE lock that blocks every other operation on the table for the duration. VACUUM FULL is the only way to return space to the OS via the built-in toolset — but it is production-fatal on any table with active traffic. The 2026 senior answer is "we never run VACUUM FULL in production; we use pg_repack for online table rebuilds when we need the file to actually shrink."

How do I prevent transaction ID wraparound in PostgreSQL?

Prevention is mostly automatic: autovacuum runs an aggressive vacuum freeze on any table whose oldest XID is more than autovacuum_freeze_max_age (default 200M) behind current. The freeze sets a hint bit marking the tuple "visible to all," removing it from the wraparound countdown. The two things that go wrong are (1) long-running transactions blocking the horizon so freeze cannot advance, and (2) inactive replication slots holding XIDs that prevent freeze. Monitor age(datfrozenxid) per database and alert at 800M XIDs (gives weeks of headroom); investigate any session with backend_xmin older than 100M; drop any abandoned replication slot. For preemptive catch-up, run vacuumdb --freeze --all --jobs=N quarterly. If you ever see transaction id wraparound shutdown messages, recover via single-user mode VACUUM FREEZE — but tune autovacuum so you never see it.

What is autovacuum tuning — which knobs actually matter?

The four knobs that matter in 95% of production tuning: (1) autovacuum_vacuum_scale_factor — the dead-tuple fraction that triggers a vacuum, default 0.2; lower to 0.02-0.05 per-table on hot tables to keep peak bloat under 5%. (2) autovacuum_vacuum_cost_limit — the per-cycle work budget, default 200; raise to 2000-4000 on SSD storage so vacuum can actually keep up with high write rates. (3) autovacuum_max_workers — how many tables vacuum in parallel, default 3; raise to 6-10 on multi-table OLTP systems with many cores. (4) fillfactor per table — reserve page space for HOT updates, default 100 (none reserved); set 70-80 on UPDATE-heavy tables to push HOT update rate above 70%. The pattern: keep global defaults conservative, override the four knobs per-table via ALTER TABLE on the 10% of tables that drive 90% of bloat. This autovacuum tuning discipline is the single highest-leverage Postgres ops change.

How is pg_repack different from VACUUM FULL?

Both rewrite the table file to reclaim space, so the end result on disk is identical. The difference is operational: VACUUM FULL holds an ACCESS EXCLUSIVE lock for the entire rewrite, blocking every other operation; on a 100 GB table that's hours of effective downtime. pg_repack does the same logical work — copies live tuples to a new file, rebuilds indexes, swaps the file — but uses triggers and a log table to capture concurrent writes, applying them to the new file in the background. Only the final swap requires ACCESS EXCLUSIVE, and that's a catalog flip that takes seconds. pg_repack requires 2x the table size in free disk during the run, a primary key or unique not-null index on the table, and the pg_repack extension installed. Within those constraints, it is the production-default tool for shrinking a bloated table on a 24/7 system.

How do I monitor Postgres bloat in production?

Two layers. The lightweight layer reads pg_stat_user_tables.n_dead_tup and n_live_tup to compute dead_pct per table — cheap, always available, refreshed on every commit. Alert at dead_pct > 20% on any table over 1 GB. The precise layer uses the pgstattuple extension to scan the table and return exact byte-level dead/live ratios — expensive (full table scan), so run weekly during low-traffic windows and feed the result into a dashboard. Pair the bloat metric with three causal queries: (1) longest open transaction (max(now() - xact_start) FROM pg_stat_activity WHERE state <> 'idle'), (2) oldest backend_xmin (the actual horizon), and (3) stuck replication slots. Bloat that autovacuum cannot reclaim is almost always a horizon issue, not a tuning issue — diagnose the horizon first, then tune autovacuum_vacuum_scale_factor per-table once the horizon is moving normally.

Practice on PipeCode

Lock in Postgres VACUUM muscle memory

Postgres docs explain MVCC. PipeCode drills explain the decision — when autovacuum needs a per-table override, when pg_repack beats VACUUM FULL, when wraparound becomes an emergency. Pipecode.ai is Leetcode for Data Engineering — pattern-first practice tuned for the production trade-offs senior data engineers actually face.

Practice SQL problems →
Practice optimization problems →

Top comments (0)