postgres vacuum is the single most under-asked-about Postgres operational knob a senior data engineer has to own, and the one most teams discover the hard way the first time a 200 GB table refuses writes at 03:00 because the transaction ID counter is two days from wraparound. postgresql mvcc is what makes Postgres concurrent at all — every UPDATE writes a new tuple version and marks the old one for later cleanup — and postgres autovacuum is the background worker that turns those dead tuples back into reusable space. Skip either side of that contract and you get postgres bloat, a steadily-growing on-disk footprint that no SQL query can explain.
This guide is the senior-DBA reference you wished existed the first time an interviewer asked "walk me through MVCC tuple visibility and how VACUUM cleans it up" or "how would you handle a 1.5 billion XID warning on a primary you cannot take down?". It walks through the MVCC visibility model (xmin, xmax, snapshot isolation, HOT updates), the VACUUM flavours (plain VACUUM, VACUUM FULL, vacuum freeze, ANALYZE, the pg_repack online alternative), the autovacuum tuning surface (autovacuum_vacuum_scale_factor, per-table overrides, cost-based throttling, worker count), and the transaction id wraparound emergency runbook (datfrozenxid, relfrozenxid, the autonomous wraparound vacuum, vacuumdb --freeze --all). Each section pairs a teaching block with a Solution-Tail interview answer — code, a step-by-step trace, an output table, then a concept-by-concept breakdown of why it works.
When you want hands-on reps immediately after reading, drill the SQL practice library →, rehearse on query optimization problems →, and stack the ETL problem library → for the end-to-end pipeline shape.
On this page
- Why MVCC + VACUUM is the senior-DBA interview
- MVCC visibility and tuple versions
- VACUUM mechanics and flavours
- Autovacuum tuning
- Wraparound and emergency procedures
- Cheat sheet — Postgres VACUUM recipes
- Frequently asked questions
- Practice on PipeCode
1. Why MVCC + VACUUM is the senior-DBA interview
MVCC creates new tuple versions on every UPDATE, VACUUM cleans them up, and the 32-bit XID counter forces both — the trilogy senior interviewers actually probe
The one-sentence invariant: Postgres never updates a row in place. Every UPDATE writes a new tuple version, leaves the old one as "dead", and relies on postgres vacuum to reclaim that space before the table doubles in size or the 32-bit transaction ID counter loops around to zero. Every other VACUUM interview question — bloat monitoring, autovacuum tuning, pg_repack vs VACUUM FULL, wraparound emergencies — is a downstream consequence of those two facts. Once you internalise "MVCC creates dead tuples, VACUUM reclaims them, autovacuum schedules the work, freeze prevents wraparound," the whole surface collapses to a sequence of operational choices around that loop.
Four axes interviewers actually probe.
-
Visibility. Postgres stamps every tuple with
xmin(the inserting XID) andxmax(the deleting XID). A transaction's snapshot is "all tuples whose xmin committed before me and whose xmax is either zero or committed after me". This ispostgresql mvccin one sentence — and it's why "dead tuples" are not garbage immediately, just garbage for the newest transactions. -
Bloat. A heap page that's 80% dead tuples still occupies a full 8 KB on disk; the dead bytes are not returned to the filesystem until the page is rewritten. Bloated tables cost more I/O per query, blow out shared_buffers, and slow down sequential scans.
postgres bloatis the operational symptom most DBAs miss until the table hits 5x its theoretical size. -
Autovacuum. A background daemon (
autovacuum launcher+autovacuum_max_workersworkers) wakes up periodically and runs VACUUM + ANALYZE on tables that have crossed configurable thresholds.postgres autovacuumis the single most important Postgres operational subsystem — and the one most teams leave at defaults that were tuned for a 2010-era OLTP workload. -
Wraparound. Transaction IDs are 32 bits. After ~2 billion transactions, the counter wraps. To prevent visibility chaos, Postgres "freezes" old tuples (marks them as visible-to-all) via
vacuum freeze. If freezing falls behind, the database refuses writes — the most dramatic Postgres outage mode a senior DBA can face.transaction id wraparoundis rare but always asked about in senior interviews because it tests whether you actually understand what an XID is.
The interview opener — the five-second version.
- "Why does an UPDATE in Postgres not actually update?" — Because MVCC: it INSERTs a new tuple version and marks the old one's xmax. Senior signal if you say "and that creates a dead tuple that VACUUM has to reclaim."
- "What is VACUUM actually doing?" — Marks dead tuples as reusable in the free space map; doesn't shrink the file. Senior signal if you separate "reclaim within file" from "VACUUM FULL rewrites the file."
-
"Why can't you just run VACUUM FULL?" —
ACCESS EXCLUSIVElock, table rewrite, doubled disk usage during the rewrite. Senior signal if you mentionpg_repackas the online alternative. -
"What is a transaction ID wraparound and how do you prevent it?" — 32-bit XID counter; preventive
vacuum freezeruns as part of autovacuum; emergencyvacuumdb --freeze --allif you fall behind. Senior signal if you mentionvacuum_freeze_min_ageandautovacuum_freeze_max_age.
The 2026 reality — what changed since Postgres 13.
-
Postgres 14+ ships parallel index vacuums, dramatically shrinking VACUUM time on big tables with many indexes. Default behaviour is to use up to
max_parallel_maintenance_workersindex workers in parallel. -
Postgres 15+ added per-statement
VACUUM (FREEZE, INDEX_CLEANUP off, ...)options and made aggressive freezing the default for tables approachingautovacuum_freeze_max_age. - Postgres 16+ improved autovacuum's ability to skip pages via the visibility map, cutting wall time on append-mostly tables.
-
Postgres 17+ introduced the
MAINTENANCEpredicate locking model that lets autovacuum continue working on partitions while application traffic queries the parent table. - pg_repack remains the production-default tool for online table rebuilds; pg_squeeze is the auto-trigger variant.
What "senior" sounds like in a Postgres ops interview.
- Do you say "MVCC creates dead tuples, autovacuum reclaims them" in the first sentence? — senior signal.
- Do you mention "VACUUM FULL takes ACCESS EXCLUSIVE; we use pg_repack in production" unprompted? — senior signal.
- Do you bring up "wraparound forces preventive freeze, which is why autovacuum runs even on read-only tables"? — senior signal.
- Do you push back on "just turn off autovacuum on hot tables to reduce overhead" with "that guarantees wraparound; never turn it off, tune it"? — strongest senior signal.
Worked example — the four-line MVCC trace every senior should be able to draw
Detailed explanation. The single most common Postgres MVCC interview opener is "draw what happens on disk when I run an UPDATE." You should be able to answer in 30 seconds with a four-row visualisation. Every advanced question (HOT, index bloat, freezing, snapshot isolation) builds on this base.
Question. Walk through what happens to the heap, the indexes, and the dead tuple count when you run a single UPDATE accounts SET balance = balance + 10 WHERE id = 1 on a table with two indexes.
Input.
| Step | Action |
|---|---|
| t0 | INSERT INTO accounts (id, balance) VALUES (1, 100); commit XID=1000 |
| t1 | session A: BEGIN; UPDATE accounts SET balance = 110 WHERE id = 1; commit XID=1001 |
| t2 | session B: SELECT balance FROM accounts WHERE id = 1; -- snapshot XID=1002 |
Code.
-- t0: initial INSERT
INSERT INTO accounts (id, balance) VALUES (1, 100);
-- t1: UPDATE in session A
BEGIN;
UPDATE accounts SET balance = balance + 10 WHERE id = 1;
COMMIT;
-- t2: SELECT in session B
SELECT id, balance, xmin, xmax FROM accounts WHERE id = 1;
-- inspect dead tuples
SELECT relname, n_live_tup, n_dead_tup
FROM pg_stat_user_tables
WHERE relname = 'accounts';
Step-by-step explanation.
- At t0 the INSERT writes one heap tuple
(id=1, balance=100, xmin=1000, xmax=0). Both indexes (primary key onid, plus a secondary onbalance) point at that tuple's(page, offset)location. - At t1 the UPDATE does not rewrite the existing tuple. It writes a new heap tuple
(id=1, balance=110, xmin=1001, xmax=0)and updates the old tuple's xmax to 1001. The old tuple is now "dead from XID 1001's perspective." - Both indexes get a new entry pointing at the new tuple. The old index entries are not removed yet — they still point at the dead tuple. This is "index bloat" — the secondary effect of MVCC that VACUUM has to clean up separately.
- At t2 session B sees only the new tuple (xmin=1001 committed before B's snapshot at 1002). If session B had started its snapshot before XID 1001 committed, it would still see balance=100 — that's snapshot isolation.
-
pg_stat_user_tablesreportsn_live_tup=1, n_dead_tup=1. The dead tuple stays on disk until VACUUM runs and marks its space reusable. The page is still 8 KB on disk — VACUUM does not shrink the file. - If a HOT (Heap-Only Tuple) update was possible — same page, no indexed column changed — Postgres would skip the index writes and chain the new tuple to the old one via the page's HOT chain. HOT is the most important MVCC optimisation; it's why indexed-column updates are dramatically more expensive than unindexed-column updates.
Output.
| Tuple | id | balance | xmin | xmax | Visible to XID 1002? |
|---|---|---|---|---|---|
| old | 1 | 100 | 1000 | 1001 | no (xmax committed) |
| new | 1 | 110 | 1001 | 0 | yes |
Rule of thumb. Every UPDATE = 1 INSERT + 1 mark-dead. Every UPDATE on an indexed column = 1 INSERT + 1 mark-dead + N new index entries (one per index). Treat indexes as multipliers on your write cost.
Worked example — measuring dead tuple ratio on a real table
Detailed explanation. "How would you measure bloat" is a standard senior probe. The answer has two layers: the lightweight pg_stat_user_tables counters (always available, no extension), and the precise pgstattuple extension (full page scan, more expensive but exact).
Question. Show how to read the dead tuple count on a 50 GB orders table, and how to compute the dead-tuple percentage and the autovacuum threshold from that.
Input.
| Setting | Value |
|---|---|
| Table | public.orders |
| Reltuples (live row estimate) | 100,000,000 |
| autovacuum_vacuum_scale_factor | 0.2 (default) |
| autovacuum_vacuum_threshold | 50 |
Code.
-- lightweight reading (no extension)
SELECT
schemaname,
relname,
n_live_tup,
n_dead_tup,
round(100.0 * n_dead_tup / NULLIF(n_live_tup, 0), 2) AS dead_pct,
last_vacuum,
last_autovacuum
FROM pg_stat_user_tables
WHERE relname = 'orders';
-- precise scan via pgstattuple
CREATE EXTENSION IF NOT EXISTS pgstattuple;
SELECT
pg_size_pretty(table_len) AS table_size,
pg_size_pretty(dead_tuple_len) AS dead_size,
round(dead_tuple_percent::numeric, 2) AS dead_pct,
round(free_percent::numeric, 2) AS free_pct
FROM pgstattuple('public.orders');
-- autovacuum trigger threshold for this table
SELECT
autovacuum_vacuum_threshold +
autovacuum_vacuum_scale_factor * reltuples AS trigger_dead_tuples
FROM pg_class
WHERE relname = 'orders';
Step-by-step explanation.
-
pg_stat_user_tablesgivesn_live_tupandn_dead_tupdirectly. These are statistics-collector estimates, refreshed on every transaction commit and every autovacuum run. They're cheap (a single catalog read) but can drift on append-heavy tables that don't get vacuumed often. -
pgstattupledoes a real heap scan and returns exact counts of live, dead, and free bytes. It costs roughly the same as a sequential scan, so don't run it on a 5 TB table during peak hours. Use it for a true bloat audit, not for monitoring. - The dead-tuple percentage gives you a rough bloat number, but the autovacuum trigger is what actually matters for scheduling. Postgres triggers a vacuum when
n_dead_tup > threshold + scale_factor * reltuples. For our 100M-row table at default settings:50 + 0.2 * 100,000,000 = 20,000,050dead tuples before autovacuum kicks in. - That's 20M dead tuples worth of bloat before autovacuum starts working — for a write-heavy table that's far too many. Per-table override to
scale_factor=0.05would trigger at 5M, dramatically reducing peak bloat. - For monitoring, alert when dead_pct > 20% on any table over 1 GB. That catches a misconfigured autovacuum or a long-running transaction before it becomes a disk pressure incident.
Output.
| Metric | Default trigger | Recommended for hot table |
|---|---|---|
| autovacuum_vacuum_threshold | 50 rows | 1000 rows |
| autovacuum_vacuum_scale_factor | 0.2 (20%) | 0.05 (5%) |
| Trigger on 100M-row table | 20M dead | 5M dead |
| Peak bloat at trigger | ~20% | ~5% |
Rule of thumb. Default autovacuum settings are tuned for tables under ~1M rows. Above 10M rows, lower scale_factor to 0.05 or 0.02 via per-table ALTER TABLE so autovacuum runs more frequently and keeps peak bloat under control.
Worked example — the cost of long-running transactions
Detailed explanation. Senior interviewers love to ask "why does a long-running transaction block VACUUM?" The answer ties MVCC visibility to dead-tuple reclamation in one neat causal chain — and it's the root cause of most "autovacuum is running but bloat keeps growing" production incidents.
Question. Explain why a 6-hour BEGIN; SELECT ... ; transaction in one session prevents autovacuum from reclaiming dead tuples in every table in the database.
Input.
| Session | Action | Duration |
|---|---|---|
| A | BEGIN; SELECT * FROM big_report; -- not committed | 6 hours |
| B | UPDATE orders SET status = 'shipped' WHERE ...; (every minute) | continuous |
| C | autovacuum worker on orders | every 5 min |
Code.
-- find long-running transactions blocking VACUUM
SELECT
pid,
usename,
state,
xact_start,
now() - xact_start AS xact_age,
query
FROM pg_stat_activity
WHERE state <> 'idle'
AND xact_start IS NOT NULL
ORDER BY xact_start ASC
LIMIT 10;
-- inspect the "horizon" — the oldest XID VACUUM can clean past
SELECT
backend_xmin,
pg_xact_commit_timestamp(backend_xmin) AS backend_xmin_ts
FROM pg_stat_activity
WHERE backend_xmin IS NOT NULL
ORDER BY backend_xmin ASC
LIMIT 5;
-- kill the blocker (after consultation with the app team)
SELECT pg_cancel_backend(12345); -- soft cancel
-- or
SELECT pg_terminate_backend(12345); -- hard terminate
Step-by-step explanation.
- Session A holds a snapshot from 6 hours ago. From its perspective, any tuple whose xmin committed after its snapshot time is invisible; any tuple whose xmax committed after its snapshot time is still alive.
- Session B's UPDATEs create dead tuples constantly. From the perspective of A's snapshot, the old versions are still alive — the new versions don't exist for A.
- Autovacuum on
ordersruns and sees the dead tuples — but it cannot reclaim them, because doing so would corrupt session A's view. It computes the "global xmin horizon" (the oldest snapshot in any open transaction) and only reclaims tuples whose xmax committed before that horizon. - Session A's horizon advances zero — so the horizon for the entire database is stuck 6 hours in the past. Every dead tuple created in those 6 hours accumulates and cannot be reclaimed.
-
pg_stat_activity.backend_xminshows the oldest XID that any backend is still using. If you have a session withbackend_xminfrom 6 hours ago, every dead tuple created after that point is locked in until that session ends. - The fix is either to kill the long-running transaction (
pg_terminate_backend(pid)) or to refactor the report query to use a short-lived snapshot (e.g. snapshot at the start of the report, copy the data out, COMMIT, then process the copy).
Output.
| Symptom | Underlying cause |
|---|---|
| Bloat keeps growing despite autovacuum running | Horizon is stuck behind a long-running txn |
pg_stat_activity shows a session with xact_age > 1h |
The blocker is found |
n_dead_tup keeps climbing on every table |
Horizon is per-DB, not per-table |
Rule of thumb. Set idle_in_transaction_session_timeout = '15min' and statement_timeout to a reasonable value. A transaction left open by a buggy ORM is the single most common cause of unexplained Postgres bloat.
Senior interview question on MVCC + VACUUM
A senior interviewer often opens with: "Walk me through how Postgres handles a high-throughput UPDATE workload — what happens on disk, what cleans up, and what fails if you misconfigure it?"
Solution Using the MVCC + VACUUM mental model
The full lifecycle of an UPDATE under MVCC + VACUUM
====================================================
1. UPDATE arrives
- Writes new tuple version (xmin = currentXID, xmax = 0)
- Marks old tuple xmax = currentXID
- Updates all indexes (unless HOT applies)
- Page may split if no room; new tuple ends up on a different page
2. Commit
- Old tuple becomes "dead from XID currentXID forward"
- New tuple becomes "live"
- Snapshots taken before commit still see the old tuple
3. Autovacuum eventually wakes up
- When n_dead_tup > threshold + scale_factor * reltuples
- Computes global xmin horizon
- Reclaims dead tuples older than the horizon
- Marks page space as reusable in the free space map
- Removes index entries pointing at reclaimed tuples
- Does NOT shrink the file — that requires VACUUM FULL or pg_repack
4. Eventually, freeze
- When tuple xmin gets older than vacuum_freeze_min_age (default 50M)
- VACUUM sets the "frozen" hint bit so the tuple is visible to all future XIDs
- Prevents wraparound from corrupting visibility
5. Failure modes
- Long-running transaction → horizon stuck → bloat grows unbounded
- Autovacuum too slow → bloat grows faster than reclamation
- Wraparound approaching → autovacuum runs in aggressive mode
- autovacuum disabled → bloat + eventual wraparound shutdown
Step-by-step trace.
| Phase | Disk effect | Index effect | Visibility horizon | Wraparound risk |
|---|---|---|---|---|
| 1. UPDATE | new tuple inserted | N new index entries | unchanged | none |
| 2. Commit | old tuple marked dead | old index entries dangle | unchanged | none |
| 3. Autovacuum | dead tuples reclaimed in-place | dangling index entries removed | advances | reduced |
| 4. Freeze | xmin marked frozen | none | tuple visible to all | prevented |
| 5. Wraparound block (failure mode) | no space available | none | stuck | imminent — DB shuts down |
After this trace, every VACUUM, autovacuum, and wraparound question is a question about which phase is failing in your specific incident. That decomposition is the senior signal.
Output:
| Concern | Where it shows up | Tool to use |
|---|---|---|
| Bloat | n_dead_tup, pgstattuple | autovacuum tuning, pg_repack |
| Long-running txns | pg_stat_activity.xact_start | terminate or refactor app |
| Wraparound | datfrozenxid age | VACUUM FREEZE, vacuumdb --freeze |
| Autovacuum stuck | pg_stat_progress_vacuum | scale workers, lower naptime |
Why this works — concept by concept:
- MVCC tuple versioning — every UPDATE writes a new tuple and marks the old one dead. The whole subsequent VACUUM surface is the consequence of that decision. Naming it explicitly proves you understand the root cause.
- The xmin horizon — autovacuum can only reclaim tuples older than the oldest open snapshot in the database. That single rule explains 90% of "why isn't autovacuum working" incidents.
- In-place reclaim vs file shrink — VACUUM reclaims within the file; VACUUM FULL rewrites the file. Conflating them is a junior mistake. Senior engineers know to reach for pg_repack instead of FULL in production.
- Freeze is a separate concern from bloat — freezing is wraparound prevention, not space reclamation. Autovacuum runs both jobs in the same worker, but they're triggered by different thresholds.
-
Cost — autovacuum cost is
O(table_size + index_size)worst case, throttled byautovacuum_vacuum_cost_delay/_cost_limit. Wraparound prevention runs unconditionally and cannot be skipped, so a wraparound-pending table will block autovacuum on other tables.
SQL
Topic — sql
Postgres MVCC + VACUUM problems
2. MVCC visibility and tuple versions
postgresql mvcc stamps every tuple with xmin and xmax — visibility is a function of those two columns and the transaction's snapshot
The mental model in one line: every heap tuple carries xmin (the transaction that created it) and xmax (the transaction that deleted/updated it); a transaction's snapshot decides which tuples it can see by comparing those XIDs to the set of transactions that committed before the snapshot was taken. Once you say "visibility is xmin + xmax against the snapshot," every snapshot-isolation, dead-tuple, HOT-update, and wraparound question becomes a deduction from that rule.
The core column pair — xmin / xmax.
- xmin — the XID of the transaction that created this tuple version. Set on INSERT or on the new tuple version of an UPDATE.
- xmax — the XID of the transaction that invalidated this tuple version. Zero if still alive. Set on DELETE, or on the old tuple version of an UPDATE.
- Both columns are 32-bit unsigned integers stored in the tuple header (24 bytes per tuple — a real overhead you pay for MVCC).
Snapshot isolation — the visibility rule.
- A transaction's snapshot is a triple: (xmin, xmax, xip_list) — the oldest still-running transaction, the next XID to be assigned, and the list of in-flight XIDs at snapshot time.
- A tuple is visible to a snapshot if: (tuple.xmin is in the committed set AND tuple.xmin < snapshot.xmax AND tuple.xmin is not in snapshot.xip_list) AND (tuple.xmax is zero OR tuple.xmax did not commit by snapshot time OR tuple.xmax is in snapshot.xip_list).
- In practice: "tuple was inserted by a committed transaction before me, and either is still alive or was deleted by a transaction that committed after my snapshot."
The UPDATE = INSERT + mark identity.
- An UPDATE writes a new heap tuple with
xmin = currentXID, xmax = 0and sets the old tuple'sxmax = currentXID. - Both tuples share a logical identity (same primary key) but live in different physical heap slots, often on different pages.
- This is the only way Postgres handles updates — there is no in-place update. Even single-byte column changes write a full new tuple.
HOT updates — the optimisation.
- If the UPDATE does not change any indexed column, and there is room on the same page, Postgres can do a Heap-Only Tuple (HOT) update.
- The new tuple lives on the same page, chained from the old tuple via the page's HOT chain pointer.
- Indexes are not updated — they still point at the old tuple, and the runtime follows the HOT chain to find the live version.
- HOT updates are dramatically cheaper than regular updates. The cost difference is often the difference between a healthy and a bloat-prone table.
Why indexed-column updates are expensive.
- Indexed column changed → no HOT → new index entries → old index entries dangle until VACUUM removes them.
- A 1M-row table with 5 secondary indexes pays 6 writes per UPDATE (heap + 5 indexes) instead of 1.
- The cost compounds: more index entries = larger indexes = more I/O on every read.
The visibility map — the optimisation that makes VACUUM cheap.
- Postgres maintains a bitmap (
relname_vmfile) marking which pages are "all-visible" — every tuple on that page is visible to every snapshot. - VACUUM can skip all-visible pages entirely. On a mostly-read table, autovacuum becomes nearly free.
- INSERTs and HOT updates can keep pages all-visible. Non-HOT UPDATEs and DELETEs clear the visibility flag, forcing the next VACUUM to scan that page.
Index-only scans — visibility map dividend.
- If every page touched by an index scan is marked all-visible, Postgres can satisfy the query from the index alone, skipping the heap entirely.
- This is "index-only scan" — one of the biggest performance wins in modern Postgres.
- It only works if the visibility map is up to date, which requires VACUUM to have run recently.
Common interview probes on MVCC.
- "What are xmin and xmax?" — the create-XID and the invalidate-XID of a tuple version.
- "Why does an UPDATE produce a dead tuple?" — because UPDATE is INSERT + mark-old-dead; there's no in-place update.
- "What is a HOT update?" — same-page UPDATE with no indexed column change; skips index writes.
- "What is the visibility map?" — bitmap of all-visible pages; lets VACUUM skip pages and enables index-only scans.
Worked example — UPDATE-heavy table accumulating 10x bloat
Detailed explanation. A common production failure: a table with millions of UPDATEs per hour grows to 10x its theoretical size in a week. The culprit is almost always indexed-column updates preventing HOT, combined with default autovacuum settings that don't keep up. Tracing what's happening on disk is the first step.
Question. Show the disk-growth math for an events table with 10M rows, 5 secondary indexes, and 200K UPDATEs per hour on an indexed column. How much bloat accumulates in a day before autovacuum catches up?
Input.
| Setting | Value |
|---|---|
| Live rows | 10,000,000 |
| Avg row width | 200 bytes |
| Secondary indexes | 5 |
| UPDATE rate | 200,000 / hour |
| UPDATE target column | indexed |
| autovacuum_vacuum_scale_factor | 0.2 (default) |
Code.
-- live size estimate
SELECT
pg_size_pretty(pg_relation_size('events')) AS heap_size,
pg_size_pretty(pg_indexes_size('events')) AS index_size,
pg_size_pretty(pg_total_relation_size('events')) AS total_size,
n_live_tup,
n_dead_tup,
round(100.0 * n_dead_tup / NULLIF(n_live_tup, 0), 2) AS dead_pct
FROM pg_stat_user_tables JOIN pg_class ON oid = relid
WHERE relname = 'events';
-- bloat trigger threshold
SELECT
pg_class.reltuples,
pg_class.reltuples * 0.2 AS trigger_dead
FROM pg_class WHERE relname = 'events';
-- enforce HOT-friendly schema via fillfactor
ALTER TABLE events SET (fillfactor = 80);
-- 20% of every page reserved for HOT chains
Step-by-step explanation.
- 200K UPDATEs/hour × 24 h = 4.8M UPDATEs/day. Because the target column is indexed, each UPDATE writes 1 new heap tuple + 5 new index entries (no HOT possible). The old heap tuple plus 5 old index entries become dead.
- Heap bloat after 24h: 4.8M dead tuples × 200 bytes = ~960 MB. That's about 50% of the live heap size (2 GB), so the table grows to roughly 3 GB before autovacuum kicks in.
- Index bloat is far worse because each of the 5 indexes accumulates 4.8M dead entries. If each index entry is ~30 bytes, that's 4.8M × 30 × 5 = ~720 MB of dead index bytes. Indexes often grow more than the heap on update-heavy tables.
- Default autovacuum triggers at
0.2 * 10M = 2M dead tuples. That's reached in 10 hours at this rate. Once it runs, it reclaims the heap dead tuples and rebuilds the index visibility, droppingn_dead_tupto near zero. - After one autovacuum cycle, the table file is still 3 GB on disk — VACUUM reclaims within the file, it doesn't shrink it. The next 24 hours of UPDATEs reuse the reclaimed space, so the table stabilises at ~3 GB rather than growing further.
- The fix for ongoing bloat: lower
autovacuum_vacuum_scale_factorto 0.05 (autovacuum every ~2.5 hours instead of every 10 hours), and setfillfactor = 80so HOT updates are more likely (more page free space available for in-page new tuple).
Output.
| Phase | Heap size | Index size | Dead rows |
|---|---|---|---|
| t0 (baseline) | 2 GB | 1.5 GB | 0 |
| t+10h (autovacuum trigger) | 2.4 GB | 1.8 GB | 2M |
| t+24h (post-vacuum) | 2.4 GB (steady) | 1.8 GB (steady) | < 100K |
| with tuned settings (scale=0.05) | 2.1 GB | 1.6 GB | < 500K |
Rule of thumb. If your table has >5 indexes and any indexed column is updated frequently, the indexes will dominate your bloat. Drop unused indexes ruthlessly — each one multiplies your write cost and your VACUUM time.
Worked example — HOT updates vs non-HOT updates on the same table
Detailed explanation. The same UPDATE can be 10x cheaper if it qualifies for HOT. Senior engineers know that "an UPDATE costs nothing on Postgres if it's HOT" is one of the most leverage-y performance facts in the database.
Question. Show two UPDATE patterns on a users table — one that triggers HOT, one that doesn't — and explain the per-update cost difference.
Input.
| Schema | Columns | Indexed columns |
|---|---|---|
| users | id (PK), email, login_count, last_login_at | id, email |
Code.
-- A: UPDATE on a non-indexed column → HOT eligible
UPDATE users SET login_count = login_count + 1 WHERE id = 42;
-- B: UPDATE on an indexed column → NOT HOT eligible
UPDATE users SET email = 'new@example.com' WHERE id = 42;
-- Inspect HOT updates count
SELECT
schemaname, relname,
n_tup_upd AS total_updates,
n_tup_hot_upd AS hot_updates,
round(100.0 * n_tup_hot_upd / NULLIF(n_tup_upd, 0), 2) AS hot_pct
FROM pg_stat_user_tables
WHERE relname = 'users';
-- Ensure HOT can happen — leave fillfactor room
ALTER TABLE users SET (fillfactor = 85);
Step-by-step explanation.
- UPDATE A changes only
login_count, which is not indexed. If there is free space on the same heap page (this is wherefillfactormatters), Postgres writes the new tuple on the same page and chains it to the old tuple via the page's HOT chain. Indexes are untouched. - UPDATE A costs: 1 heap page write, 0 index writes. The old tuple still exists but is reachable only via the HOT chain from the page's "redirect" pointer. VACUUM later prunes the HOT chain — much cheaper than full reclamation.
- UPDATE B changes
email, which is indexed. HOT is impossible — the email index needs a new entry for the new tuple. Postgres writes the new heap tuple (anywhere there's room, possibly on a new page) and a new email-index entry. The old heap tuple and the old email-index entry become dead. - UPDATE B costs: 1 heap page write (possibly a new page if the old page is full), 1 index entry write per indexed column. The dead entries linger until VACUUM clears them.
-
pg_stat_user_tables.n_tup_hot_updtracks HOT updates separately from total updates. Healthy ratio: > 70% HOT for an UPDATE-heavy OLTP table. If yours is below 30%, you have indexes on volatile columns that you should reconsider. -
fillfactor=85reserves 15% of every page for HOT chain growth. Tables with frequent UPDATEs benefit fromfillfactor=70-80; read-only tables can use the default 100 for maximum density.
Output.
| UPDATE pattern | Heap writes | Index writes | Dead tuples produced | HOT eligible |
|---|---|---|---|---|
| Non-indexed column (A) | 1 (same page) | 0 | 1 (prunable cheap) | yes |
| Indexed column (B) | 1 (any page) | N | 1 heap + N index | no |
Rule of thumb. Don't index last_login_at, updated_at, view_count, or any high-velocity column unless an active query plan depends on it. Every such index is a 10x multiplier on your UPDATE cost and a magnet for bloat.
Worked example — long-running snapshot blocks the horizon
Detailed explanation. The single most insidious MVCC failure mode is a forgotten transaction holding back the global xmin horizon. It manifests as autovacuum running but not reclaiming any space. The fix is always to find and kill (or refactor) the blocking session.
Question. A monitoring dashboard shows that autovacuum has been running every 5 minutes but n_dead_tup keeps climbing. Bloat is now at 60%. Find the cause.
Input.
| Symptom | Value |
|---|---|
| autovacuum_workers | 3 (active) |
| n_dead_tup on hot_events | 50M and climbing |
| pg_stat_user_tables.last_autovacuum | 2 minutes ago |
| free disk | 25% and falling |
Code.
-- find the oldest in-flight backend xmin
SELECT
pid,
application_name,
state,
backend_xmin,
xact_start,
now() - xact_start AS xact_age,
LEFT(query, 80) AS query_head
FROM pg_stat_activity
WHERE backend_xmin IS NOT NULL
AND state <> 'idle'
ORDER BY backend_xmin
LIMIT 5;
-- find idle-in-transaction sessions
SELECT
pid,
application_name,
state,
now() - state_change AS idle_age,
LEFT(query, 80) AS query_head
FROM pg_stat_activity
WHERE state = 'idle in transaction'
ORDER BY state_change ASC;
-- terminate the offender (after app team confirms)
SELECT pg_terminate_backend(54321);
Step-by-step explanation.
-
pg_stat_activityfiltered bybackend_xmin IS NOT NULLand ordered bybackend_xminfinds the session whose snapshot is the bottleneck. Often it's anidle in transactionsession — a transaction that started, executed one query, and then was forgotten by buggy ORM code. - The age
now() - xact_starttells you how long the snapshot has been holding back the horizon. If it's > 30 minutes on an OLTP database, that's the bloat root cause. - Killing the session via
pg_terminate_backend(pid)releases the snapshot. The next autovacuum cycle sees a vastly advanced horizon and can reclaim hours of accumulated dead tuples. - Long-term, set
idle_in_transaction_session_timeout = '5min'andstatement_timeout = '30min'inpostgresql.conf. Both prevent any single buggy code path from holding back the horizon. - Add an alert on
max(now() - xact_start) > 1 hourfrompg_stat_activity WHERE state <> 'idle'. This catches the issue before bloat becomes a disk pressure incident. - Educate the app team: every transaction must end with COMMIT or ROLLBACK. Long-running read transactions should use
SET TRANSACTION SNAPSHOTor short repeatable batches rather than holding one snapshot open for hours.
Output.
| Backend | xact_age | State | Action |
|---|---|---|---|
| 54321 | 6h 12min | idle in transaction | terminate; tell the app team |
| 54322 | 0:01 | active | leave alone |
| 54323 | 0:42 | active (analytics) | refactor to short snapshots |
Rule of thumb. Bloat that autovacuum cannot reclaim is always a horizon problem. Find the oldest backend_xmin first; do not retune autovacuum thresholds until the horizon is moving normally.
Senior interview question on MVCC visibility
A senior interviewer might frame this as: "Explain Postgres snapshot isolation in terms of xmin, xmax, and the visibility map. Walk me through what happens when two concurrent transactions try to update the same row."
Solution Using the MVCC visibility rules
-- Setup
CREATE TABLE accounts (id INT PRIMARY KEY, balance NUMERIC);
INSERT INTO accounts VALUES (1, 100);
-- Session A: XID 100
BEGIN;
SELECT balance FROM accounts WHERE id = 1; -- sees 100
-- ... long-running snapshot ...
-- Session B: XID 101, concurrent
BEGIN;
UPDATE accounts SET balance = 150 WHERE id = 1;
COMMIT; -- B commits
-- Session A still sees 100 (its snapshot predates B's commit)
SELECT balance FROM accounts WHERE id = 1; -- still sees 100
-- Now session A wants to update
UPDATE accounts SET balance = balance + 10 WHERE id = 1;
-- ERROR: could not serialize access due to concurrent update
-- (in SERIALIZABLE) OR
-- ROW LOCK WAIT then proceeds with B's value (in READ COMMITTED)
ROLLBACK;
Step-by-step trace.
| Step | Session A snapshot | Tuple on disk | A sees |
|---|---|---|---|
| t0 | xmin=99, xmax=100 | (id=1, bal=100, xmin=50, xmax=0) | 100 |
| t1 (B updates) | unchanged | old: xmax=101; new: (xmin=101, xmax=0, bal=150) | 100 (xmax=101 in xip_list) |
| t2 (B commits) | unchanged | new tuple committed | 100 (xmax=101 committed after A's snapshot) |
| t3 (A reads again) | unchanged | unchanged | 100 (snapshot isolation guarantees stable view) |
| t4 (A updates) | unchanged | conflict detected | error or wait, depending on isolation |
The Session A view is consistent throughout — that's the whole point of snapshot isolation. The trade-off is that A and B cannot both update the same row at the same isolation level without one of them seeing a serialization error.
Output:
| Isolation level | Behaviour on concurrent update |
|---|---|
| READ COMMITTED (default) | A waits for B, then re-reads at the new committed state, then applies its UPDATE on the new value |
| REPEATABLE READ | A errors with "could not serialize access due to concurrent update"; app retries |
| SERIALIZABLE | A errors immediately if any read-write conflict could violate serializability; app retries |
Why this works — concept by concept:
- xmin / xmax tuple stamps — every tuple carries its creator and invalidator XID. The whole visibility system is two integer comparisons. That simplicity is what makes Postgres MVCC fast.
- Snapshot isolation — a snapshot fixes the visible set of XIDs at transaction start. From that point, the transaction sees a consistent view even if concurrent transactions commit changes. This is the read side of MVCC.
- Write conflict detection — when A tries to UPDATE a row that B has already updated and committed, Postgres detects the conflict via xmax. In READ COMMITTED it waits and re-reads; in REPEATABLE READ / SERIALIZABLE it errors. The write side of MVCC.
-
The horizon is global — every backend's
backend_xmincontributes to a single per-database horizon. One slow session blocks VACUUM for every table. This is the operational pitfall that catches teams. - Cost — visibility check is O(1) per tuple (two XID comparisons + xip_list lookup). Snapshot acquisition is O(active_transactions). Dead-tuple reclamation is O(table_size + index_size) amortised across autovacuum cycles.
SQL
Topic — sql
Postgres snapshot isolation problems
3. VACUUM mechanics and flavours
postgres vacuum ships in four shapes — plain VACUUM, vacuum full, vacuum freeze, and ANALYZE — and using the wrong one in production is a guaranteed outage
The mental model in one line: plain VACUUM reclaims dead-tuple space inside the file (cheap, no lock), VACUUM FULL rewrites the file from scratch (expensive, ACCESS EXCLUSIVE lock), VACUUM FREEZE preemptively marks old tuples as frozen to prevent wraparound, and ANALYZE updates the statistics the planner reads — they're four different jobs with four different cost profiles, and pg_repack is the production-safe alternative to VACUUM FULL. Once you internalise "VACUUM reclaims, FULL rewrites, FREEZE prevents wraparound, ANALYZE updates stats," every "which VACUUM should I run" interview question becomes a single-question lookup.
Plain VACUUM — the safe default.
- What it does. Scans the table, finds dead tuples, marks their space as reusable in the page's "item pointer" array, updates the visibility map, and prunes HOT chains. Does not rewrite the file, does not return space to the OS.
-
Lock level.
SHARE UPDATE EXCLUSIVE— concurrent SELECTs, INSERTs, UPDATEs, DELETEs continue. Only blocks other VACUUM, ANALYZE, CREATE INDEX CONCURRENTLY, and a few DDL operations. - Disk effect. Frees space for reuse by this table. Future INSERTs/UPDATEs land in the reclaimed slots; the file size stays the same.
- When to run. Constantly via autovacuum. Rarely by hand — only when autovacuum is misconfigured or you need an immediate reclaim before a query plan re-runs.
VACUUM FULL — the nuclear option.
- What it does. Creates a new table file, copies all live tuples into it, rebuilds every index, then swaps the file. End result: the smallest possible disk footprint for the current data.
-
Lock level.
ACCESS EXCLUSIVE— blocks every other operation on the table, including SELECTs. The table is effectively offline for the duration of the FULL. - Disk effect. Returns space to the OS. The new file is smaller. During the operation, you need at least 2x the current table size in free disk (old + new exist simultaneously).
- When to run. Almost never in production. Use case: a one-time massive bloat reclaim during a planned maintenance window. Even then, prefer pg_repack.
VACUUM FREEZE — the wraparound preventer.
-
What it does. Marks every tuple older than
vacuum_freeze_min_ageas "frozen" — a hint bit that means "this tuple is visible to all current and future transactions, ignore its xmin." Frozen tuples don't contribute to the wraparound countdown. -
Lock level. Same as plain VACUUM (
SHARE UPDATE EXCLUSIVE). - Disk effect. Same as plain VACUUM (reclaims dead-tuple space) plus the freeze bit updates.
-
When to run. Autovacuum runs an "aggressive" mode FREEZE when a table's
relfrozenxidis older thanautovacuum_freeze_max_age(default 200M). You can force it viaVACUUM FREEZE tablename.
ANALYZE — the planner statistics refresh.
-
What it does. Samples the table, updates
pg_statisticwith column histograms, distinct counts, null fractions. The query planner uses these for cost estimates. -
Lock level. Same as plain VACUUM. Often run in the same autovacuum cycle (
VACUUM (ANALYZE) tablename). - Disk effect. None on the table; only updates catalog rows.
-
When to run. After bulk loads, after index creation, after any operation that significantly changes the data distribution. Autovacuum runs it automatically when
n_mod_since_analyzeexceeds a threshold.
pg_repack — the production-safe rebuild.
-
What it does. Builds a new copy of the table in the background, captures concurrent changes via triggers into a log table, applies the log to the copy, then swaps the table file with a brief
ACCESS EXCLUSIVElock (seconds, not hours). -
Lock level. Almost none — the swap is the only
ACCESS EXCLUSIVEstep, and it's instantaneous. - Disk effect. Returns space to the OS like VACUUM FULL. Requires 2x table size in free disk during the copy.
- When to run. Whenever a table's bloat exceeds ~30% on a system you cannot take offline. The de facto standard for production table rebuilds.
The decision tree.
- Need to reclaim within-file? → autovacuum (or manual VACUUM if urgent).
- Need to shrink the file in a maintenance window? → VACUUM FULL.
- Need to shrink the file on a 24/7 system? → pg_repack.
- Need to prevent wraparound? → autovacuum handles it; force with VACUUM FREEZE if behind.
- Need to fix stale query plans? → ANALYZE (or VACUUM ANALYZE).
Common interview probes on VACUUM flavours.
- "Difference between VACUUM and VACUUM FULL?" — VACUUM reclaims in-place, FULL rewrites the file.
- "Why is VACUUM FULL dangerous?" — ACCESS EXCLUSIVE lock for the duration. Use pg_repack instead.
- "What does VACUUM FREEZE do?" — marks old tuples as frozen to prevent wraparound.
- "Why run ANALYZE separately?" — it updates planner stats; cheaper than VACUUM if you just need fresh stats.
Worked example — VACUUM vs VACUUM FULL on the same bloated table
Detailed explanation. A 100 GB orders table is 60% dead tuples. You have two options: run plain VACUUM (no downtime, but the file stays 100 GB) or run VACUUM FULL (rewrites to 40 GB, but takes the table offline for hours). The right answer depends on what you're optimising for — disk vs uptime.
Question. Run both options against the same 100 GB orders table and compare disk footprint, lock duration, and operational risk.
Input.
| Table | Heap size | Index size | Dead pct | Indexes |
|---|---|---|---|---|
| orders | 100 GB | 30 GB | 60% | 4 |
Code.
-- Option 1: plain VACUUM — no downtime, no file shrink
VACUUM (VERBOSE) orders;
-- Output: "pages reused", "tuples removed", "no shrinkage"
-- Heap still 100 GB, but 60 GB of it is now reusable
-- Option 2: VACUUM FULL — full rewrite, ACCESS EXCLUSIVE lock
VACUUM FULL VERBOSE orders;
-- Output: rebuilds heap and all indexes
-- Heap becomes 40 GB, indexes become smaller
-- Option 3: pg_repack — same effect as FULL, no long lock
-- run from shell:
pg_repack --no-superuser-check --table=orders --jobs=4 mydb
-- ACCESS EXCLUSIVE held only during the final swap (seconds)
Step-by-step explanation.
- Plain VACUUM scans every page, marks dead tuples reusable, updates the visibility map, and prunes HOT chains. On a 100 GB table with 4 indexes and parallel index vacuum, this might take ~30 minutes wall-clock. Concurrent traffic continues uninterrupted.
- The heap file is still 100 GB after plain VACUUM — VACUUM does not call
ftruncate. The "reclaimed" 60 GB is now free space inside the file that future INSERTs/UPDATEs can reuse. The OS sees the same disk footprint. - VACUUM FULL creates a new file, scans the old file, copies every live tuple into the new one, rebuilds every index from scratch, then atomically swaps the two files. End state: a 40 GB heap and indexes.
- VACUUM FULL holds
ACCESS EXCLUSIVEon the table for the entire duration — every SELECT, INSERT, UPDATE, DELETE blocks. On a 100 GB table, this can be 2-6 hours. Production-fatal for OLTP. - pg_repack does the same logical work as VACUUM FULL but in a transaction-friendly way: triggers capture concurrent changes into a log table, the background worker applies the log to the new file, and the final swap is an instant catalog change. Application impact is a few-second pause.
- After any of the three, you should run
ANALYZE ordersto refresh statistics — VACUUM does not always trigger an analyze, and stale stats cause bad query plans.
Output.
| Approach | Heap after | Lock duration | Downtime | When to pick |
|---|---|---|---|---|
| Plain VACUUM | 100 GB (60 GB reusable) | none (concurrent) | 0 | autovacuum routine |
| VACUUM FULL | 40 GB | 2-6 hours ACCESS EXCLUSIVE | full | maintenance window |
| pg_repack | 40 GB | seconds (only the swap) | seconds | production, 24/7 |
Rule of thumb. Never run VACUUM FULL on a production OLTP table. The pg_repack alternative is operationally equivalent without the lock. The only legitimate VACUUM FULL use case is in a maintenance window on a table you can take offline.
Worked example — pg_repack online rebuild on a 200 GB table
Detailed explanation. pg_repack is the de facto production tool for shrinking bloated tables. The mechanics — triggers, log table, background copy, final swap — are worth understanding in detail because they explain why pg_repack works, what its constraints are, and when it can fail.
Question. Walk through a pg_repack run on a 200 GB events table with 8 indexes and 50K writes per minute. What happens at each phase, and what is the operational impact?
Input.
| Setting | Value |
|---|---|
| Table | events |
| Heap size | 200 GB |
| Index count | 8 |
| Write rate | 50,000 / minute |
| Dead pct | 55% |
| Target heap size after | ~90 GB |
Code.
# Install (one-time, on the server)
yum install pg_repack15
# Enable extension in the target database (one-time)
psql -d mydb -c 'CREATE EXTENSION pg_repack;'
# Run repack with 4 parallel index rebuilds, dry-run first
pg_repack --no-superuser-check \
--table=events \
--jobs=4 \
--dry-run \
mydb
# Real run
pg_repack --no-superuser-check \
--table=events \
--jobs=4 \
mydb
# Monitor progress
SELECT * FROM repack.tables WHERE relname = 'events';
Step-by-step explanation.
-
Phase 1 — setup. pg_repack creates a log table (
repack.log_<oid>) and an INSERT/UPDATE/DELETE trigger onevents. Every concurrent write is captured into the log table alongside being applied to the original. -
Phase 2 — copy. A new heap file is created (
repack.table_<oid>). The background process scans the originalevents, copies live tuples into the new file. This is the longest phase — on a 200 GB table, expect 1-4 hours depending on disk speed. Concurrent writes continue uninterrupted, captured by the trigger. -
Phase 3 — index build. Indexes are built concurrently on the new heap. With
--jobs=4, four indexes build in parallel. This is also long but parallelisable. -
Phase 4 — apply log. The captured changes from
repack.log_<oid>are applied to the new heap. This catches up the new file with everything that happened during phases 2-3. Repeats until the log is empty. -
Phase 5 — swap. A brief
ACCESS EXCLUSIVElock is taken (seconds). The catalog is updated to point theeventsname at the new file. The old file is dropped. Triggers are removed. - End state:
eventsis now ~90 GB, indexes are freshly built (and smaller), and the 110 GB of dead-tuple bloat is gone. The application saw a few-second blip during the swap.
Output.
| Phase | Duration | Lock | Application impact |
|---|---|---|---|
| 1. Setup | seconds | none | trigger added |
| 2. Heap copy | 1-4 hours | none | full traffic OK |
| 3. Index rebuild | 30 min - 2 hours | none | full traffic OK |
| 4. Apply log | minutes | none | full traffic OK |
| 5. Swap | seconds | ACCESS EXCLUSIVE (brief) | seconds of pause |
Rule of thumb. pg_repack needs 2x table size in free disk (old + new exist simultaneously) and a primary key or unique not-null index on the target table. Both are usually fine. The free disk requirement is the most common gotcha — you can run out of disk mid-repack and end up with the temp tables stuck.
Worked example — when to run ANALYZE separately from VACUUM
Detailed explanation. ANALYZE refreshes the planner's statistics. After a bulk load, after index creation, after a schema change — the planner can have wildly stale stats that produce bad query plans. Senior engineers run ANALYZE explicitly in these cases instead of waiting for autovacuum.
Question. A 500M-row data warehouse table just received a 100M-row bulk INSERT. Query plans on the table are now picking sequential scans where index scans were correct. What do you run and why?
Input.
| Event | Effect |
|---|---|
| Bulk INSERT of 100M rows | n_live_tup jumps from 500M to 600M |
| Last autovacuum | 30 minutes ago (pre-load) |
| Planner stats | reflect 500M rows, old distribution |
| Query plan today | seq scan instead of index scan |
Code.
-- Refresh planner stats only (fast)
ANALYZE warehouse.facts;
-- Refresh and run a normal VACUUM in one pass (heavier)
VACUUM (ANALYZE, VERBOSE) warehouse.facts;
-- Force aggressive sample for large tables (default sample is 30K rows)
ALTER TABLE warehouse.facts ALTER COLUMN customer_id SET STATISTICS 1000;
ANALYZE warehouse.facts;
-- Inspect stats freshness
SELECT
schemaname, relname,
last_analyze, last_autoanalyze,
n_live_tup, n_mod_since_analyze
FROM pg_stat_user_tables
WHERE relname = 'facts';
-- Re-check the bad plan
EXPLAIN ANALYZE SELECT * FROM warehouse.facts WHERE customer_id = 42;
Step-by-step explanation.
- The bulk INSERT moved
n_mod_since_analyzefrom 0 to 100M, but autovacuum hasn't run again yet. The planner still sees 500M rows and the oldcustomer_iddistribution. -
ANALYZEsamples the table (default 30K rows × 300 = 9M sampled positions). It computes new histograms, MCVs (most common values), and null fractions for every column. Updatespg_statisticfor the table. - On large warehouse tables with high-cardinality columns, the default sample can miss important value distributions.
ALTER TABLE ... ALTER COLUMN ... SET STATISTICS 1000increases the sample size (default 100 → 1000 means a 10x larger sample for that column). - After ANALYZE, EXPLAIN ANALYZE shows the planner now picks an index scan — because the stats correctly say the index is selective. The bad seq-scan plan was a statistics problem, not a query problem.
-
VACUUM (ANALYZE)combines both operations in one pass. Useful when you also want to reclaim dead tuples from concurrent UPDATEs, otherwise plainANALYZEis cheaper. - After any major data ingest, schedule an explicit
ANALYZEin the same job — don't rely on autovacuum'sautoanalyzeto catch up before your dashboards run their queries.
Output.
| Plan source | Estimated rows | Actual rows | Chosen scan |
|---|---|---|---|
| Stale stats (pre-ANALYZE) | 6M | 12 | Seq Scan (wrong) |
| Fresh stats (post-ANALYZE) | 8 | 12 | Index Scan (right) |
Rule of thumb. After any bulk ingest > 10% of table size, run ANALYZE explicitly. After schema changes (new column, new index), run ANALYZE on the affected table. For warehouse tables with skewed distributions, bump per-column statistics targets to 500-1000.
Senior interview question on VACUUM flavours
A senior interviewer might frame this as: "You inherit a 24/7 OLTP database. The main orders table is at 70% bloat and 300 GB. The team wants to shrink it. Walk me through your decision and the runbook."
Solution Using the VACUUM-flavour decision tree + pg_repack runbook
Decision: 300 GB orders, 70% bloat, 24/7 OLTP, cannot take offline.
Step 1: Confirm bloat is real (pgstattuple, not just n_dead_tup estimate).
Step 2: Verify free disk is at least 2x table size (need ~600 GB free).
Step 3: Verify primary key exists on the table (required by pg_repack).
Step 4: Stage the rebuild during a low-traffic window (e.g. weekend).
Step 5: Run pg_repack with --dry-run first to surface any errors.
Step 6: Run real pg_repack; monitor via repack.tables view.
Step 7: After completion, run ANALYZE to refresh stats.
Step 8: Verify dead_pct returned to ~5% and table is ~90 GB.
Step 9: Tune autovacuum_vacuum_scale_factor=0.05 to prevent recurrence.
Step-by-step trace.
| Step | Action | Risk | Mitigation |
|---|---|---|---|
| 1 | pgstattuple confirms 70% bloat, 90 GB live | full scan slow | run off-peak |
| 2 | df -h shows 800 GB free | not enough free | pause / add disk |
| 3 |
\d orders confirms PK on id |
repack will refuse | add PK first |
| 4 | Saturday 02:00 window | weekend pager | crew on standby |
| 5 | pg_repack --dry-run | missing extension | install + CREATE EXTENSION |
| 6 | pg_repack --jobs=4 | swap pause | < 5 seconds expected |
| 7 | ANALYZE orders | stale stats | always do |
| 8 | n_dead_tup ≈ 100K, heap = 90 GB | success | monitor 24h |
| 9 | ALTER TABLE orders SET autovacuum_vacuum_scale_factor = 0.05 | none | per-table override |
After step 9, the table stays under 100 GB indefinitely because autovacuum keeps up with the dead-tuple rate.
Output:
| State | Heap | Indexes | Dead pct | Notes |
|---|---|---|---|---|
| Before | 300 GB | 90 GB | 70% | unsustainable |
| After pg_repack | 90 GB | 30 GB | 5% | one-time reclaim |
| After scale_factor tune | 95 GB (steady) | 32 GB (steady) | < 10% | sustainable |
Why this works — concept by concept:
- VACUUM FULL is forbidden on OLTP — ACCESS EXCLUSIVE for hours is an outage. pg_repack achieves the same file-shrink with a seconds-long swap. Naming this constraint first proves you understand production.
- pg_repack uses triggers + log table + swap — concurrent writes are captured by triggers, the new heap is built in the background, the swap is a catalog flip. This is why pg_repack is online: writes never pause more than a few seconds.
- Free disk is the gating constraint — 2x table size temporarily. Run out mid-repack and you have to clean up the leftover temp tables manually. Always verify before starting.
-
Sustainable reclaim needs autovacuum tuning — pg_repack solves the one-time bloat. Without lowering
autovacuum_vacuum_scale_factor, the bloat returns in months. The two fixes are paired, not alternatives. -
Cost — pg_repack cost is
O(table_size + index_size)of disk I/O plus the trigger overhead on concurrent writes during the run (typically 5-15% per-write slowdown). Total wall-clock typically 1-4 hours per 100 GB; orders of magnitude cheaper than the alternative outage.
SQL
Topic — sql
Postgres maintenance problems
4. Autovacuum tuning
postgres autovacuum defaults are tuned for a 2010-era 1M-row table — production OLTP workloads need per-table overrides on every hot table
The mental model in one line: the autovacuum launcher wakes every autovacuum_naptime seconds, scans pg_stat_user_tables, and dispatches up to autovacuum_max_workers workers to tables where n_dead_tup > autovacuum_vacuum_threshold + autovacuum_vacuum_scale_factor * reltuples; each worker runs a VACUUM throttled by a cost-budget system to avoid saturating I/O — and almost every production tuning problem reduces to "the global defaults don't fit my table's write rate, override them per-table". Once you internalise "threshold + scale × tuples = trigger; cost_limit/cost_delay = throttle; workers + naptime = parallelism," the entire autovacuum tuning interview surface collapses to four knobs and two cost dials.
The trigger formula.
- A table is eligible for autovacuum when
n_dead_tup > autovacuum_vacuum_threshold + autovacuum_vacuum_scale_factor * reltuples. - Defaults:
threshold=50, scale_factor=0.2. For a 1M-row table, trigger at50 + 0.2 * 1M = 200,050dead rows (~20% bloat). - For a 100M-row table: trigger at ~20M dead rows. That's 20% bloat before autovacuum starts — way too much for an OLTP workload.
The four global knobs that matter.
-
autovacuum_max_workers(default 3) — how many tables can be vacuumed in parallel. On a busy multi-table OLTP database, 5-10 is more realistic. -
autovacuum_naptime(default 1 min) — how often the launcher scans for eligible tables. Lower (15-30s) on a database with frequent autovacuum needs. -
autovacuum_vacuum_scale_factor(default 0.2) — the global trigger threshold as a fraction ofreltuples. Almost always overridden per-table on hot tables. -
autovacuum_vacuum_threshold(default 50) — the additive constant in the trigger formula. Matters only for small tables.
The cost-based throttle.
- Every VACUUM operation accumulates "cost units" — a page read from disk is 10 units, a dirtied page is 20 units, a buffer access is 1 unit.
- When the accumulated cost reaches
autovacuum_vacuum_cost_limit(default 200), the worker sleeps forautovacuum_vacuum_cost_delay(default 2 ms). - This intentionally throttles VACUUM to avoid saturating disk I/O during normal operations. The default budget gives roughly 8 MB/s of vacuuming — fine for a 1 GB table, painfully slow for a 1 TB table.
- For modern SSDs, tune
autovacuum_vacuum_cost_limit=2000(10x the default) andautovacuum_vacuum_cost_delay=2msto get ~80 MB/s of vacuum throughput without saturating the I/O system.
Per-table overrides — the production lever.
- The single most important production tuning pattern:
ALTER TABLE hot_events SET (autovacuum_vacuum_scale_factor = 0.05); - Per-table overrides let you keep conservative defaults for the 90% of tables that don't matter, and tighten thresholds on the 10% of hot tables that drive bloat.
- Other per-table knobs:
autovacuum_vacuum_threshold,autovacuum_vacuum_cost_delay,autovacuum_vacuum_cost_limit,autovacuum_freeze_min_age,autovacuum_freeze_max_age,fillfactor.
The autoanalyze counterpart.
- Identical mechanism applies to ANALYZE:
autovacuum_analyze_threshold + autovacuum_analyze_scale_factor * reltuples. - Default
analyze_scale_factor=0.1— analyze triggers at 10% modified rows. Usually fine; tune lower on warehouse tables with skewed distributions.
Operational monitoring.
-
pg_stat_progress_vacuum— real-time view of every running autovacuum, with phase, heap_blks_scanned, heap_blks_total. Use it during incidents. -
pg_stat_user_tables.autovacuum_count— how many times each table has been autovacuumed since stats reset. Hot tables should be in the hundreds or thousands. -
pg_stat_user_tables.last_autovacuum— last time each table was autovacuumed. > 24 hours on a hot table = trouble.
Common interview probes on autovacuum tuning.
- "What is the autovacuum trigger?" —
n_dead_tup > threshold + scale_factor * reltuples. - "How do you tune for a hot table?" — per-table override of
autovacuum_vacuum_scale_factorto 0.02-0.05. - "Why does autovacuum throttle?" — cost-based system to avoid saturating I/O during normal operations.
- "When do you increase
autovacuum_max_workers?" — when multiple hot tables compete for vacuum slots; default 3 is too few for a 20-table OLTP database.
Worked example — tuning a hot events table from defaults to production
Detailed explanation. A hot_events table receives 1M writes per hour. At default scale_factor=0.2, autovacuum runs every ~5 hours but the table grows by 50% between runs — bloat peaks at 50%. The fix is a per-table override that triggers autovacuum every ~1 hour at ~10% bloat.
Question. Walk through the autovacuum math on hot_events (10M rows, 1M writes/hour) at default settings and after a tuned override. Show the SQL and the expected bloat curve.
Input.
| Setting | Default | Tuned |
|---|---|---|
| reltuples | 10,000,000 | 10,000,000 |
| autovacuum_vacuum_threshold | 50 | 1,000 |
| autovacuum_vacuum_scale_factor | 0.2 | 0.05 |
| autovacuum_vacuum_cost_limit | 200 | 2,000 |
| autovacuum_vacuum_cost_delay | 2 ms | 2 ms |
Code.
-- inspect current per-table settings (will be empty if none)
SELECT relname, reloptions
FROM pg_class
WHERE relname = 'hot_events';
-- per-table override for production tuning
ALTER TABLE hot_events SET (
autovacuum_vacuum_threshold = 1000,
autovacuum_vacuum_scale_factor = 0.05,
autovacuum_vacuum_cost_limit = 2000,
autovacuum_vacuum_cost_delay = 2,
autovacuum_analyze_scale_factor = 0.02,
fillfactor = 80
);
-- verify the override took effect
SELECT relname, reloptions
FROM pg_class
WHERE relname = 'hot_events';
-- compute the resulting trigger threshold
SELECT
1000 + 0.05 * 10000000 AS new_trigger,
50 + 0.2 * 10000000 AS old_trigger;
Step-by-step explanation.
- Default trigger:
50 + 0.2 * 10M = 2,000,050dead tuples — about 5 hours of writes at 1M/hour and a 40% UPDATE rate (assume 400K dead/hour). Bloat peaks at ~20% before autovacuum runs. - Tuned trigger:
1000 + 0.05 * 10M = 501,000dead tuples — about 1.25 hours of writes. Bloat peaks at ~5%. - The cost-limit bump (200 → 2000) lets the autovacuum worker actually finish before the next trigger fires. At default cost_limit, a 10M-row table can take 20+ minutes to vacuum; at 2000, it finishes in ~2 minutes.
-
fillfactor=80reserves page space so HOT updates can stay on the same page. This further reduces the dead-tuple production rate for the same workload. -
autovacuum_analyze_scale_factor=0.02triggers an analyze at 2% modified rows. On a hot table, this keeps the planner stats fresh between autovacuum cycles. - After deploying these settings: monitor
pg_stat_user_tablesfor the next 24 hours. Expectn_dead_tupto oscillate around 500K (the new trigger),autovacuum_countto climb by 20+/day (vs 4-5/day at defaults), and the heap size to stabilise.
Output.
| Metric | Default | Tuned |
|---|---|---|
| Autovacuum trigger | 2M dead rows | 500K dead rows |
| Autovacuum frequency | every 5 h | every 1.25 h |
| Peak bloat | ~20% | ~5% |
| Vacuum wall time | 20+ min | ~2 min |
| HOT update rate | ~40% (typical) | ~70%+ (with fillfactor 80) |
Rule of thumb. Every table over 10M rows with > 1000 writes per minute should have per-table autovacuum overrides. The defaults are designed for safety, not for production scale. Lower scale_factor, raise cost_limit, and set fillfactor=80 is the standard production starting point.
Worked example — increasing autovacuum_max_workers for a multi-table OLTP
Detailed explanation. A database has 20 hot tables, each producing dead tuples at a similar rate. With autovacuum_max_workers=3 (default), only 3 tables can be vacuumed at any given moment — the other 17 queue up. On busy systems this means hot tables don't get vacuumed often enough.
Question. Walk through how to safely raise autovacuum_max_workers from 3 to 8 on a 16-core database server with 64 GB RAM and SSD storage. What are the constraints?
Input.
| Resource | Value |
|---|---|
| CPU cores | 16 |
| Memory | 64 GB |
| Storage | SSD, 5 GB/s read, 2 GB/s write |
| Hot table count | 20 |
| Current autovacuum_max_workers | 3 |
| maintenance_work_mem | 256 MB |
Code.
-- global settings (postgresql.conf)
-- autovacuum_max_workers = 8
-- autovacuum_naptime = 15s (default 1min — wake more often)
-- maintenance_work_mem = 1GB (each worker can use this much)
-- autovacuum_vacuum_cost_limit = 2000 (faster vacuum throughput)
-- autovacuum_vacuum_cost_delay = 2ms
-- apply via ALTER SYSTEM (requires reload)
ALTER SYSTEM SET autovacuum_max_workers = 8;
ALTER SYSTEM SET autovacuum_naptime = '15s';
ALTER SYSTEM SET maintenance_work_mem = '1GB';
ALTER SYSTEM SET autovacuum_vacuum_cost_limit = 2000;
ALTER SYSTEM SET autovacuum_vacuum_cost_delay = 2;
-- reload (autovacuum_max_workers REQUIRES a restart, not just reload)
SELECT pg_reload_conf();
-- ... then restart Postgres for max_workers to take effect
-- monitor concurrency after restart
SELECT count(*) FROM pg_stat_activity
WHERE backend_type = 'autovacuum worker';
Step-by-step explanation.
- Default
autovacuum_max_workers=3means at most 3 tables vacuum in parallel. On 20 hot tables, that's a 6-7x serialisation factor — most tables wait their turn. - Bumping to 8 workers lets 8 tables vacuum concurrently. With 16 CPU cores, the workers consume at most ~50% of CPU during peak — leaves plenty of headroom for application queries.
- Memory budget: each worker can use up to
maintenance_work_mem(1 GB tuned). 8 workers × 1 GB = 8 GB peak — comfortably within the 64 GB total. If you also run manual VACUUMs, account for those. - Disk budget: each worker reads and writes at a rate throttled by
cost_limit/cost_delay. At 2000/2ms, each worker delivers ~80 MB/s. 8 workers = 640 MB/s — well within the SSD's 2 GB/s write capacity. -
autovacuum_max_workersrequires a Postgres restart (it's a max-connections-style setting that allocates worker slots at startup). Plan a maintenance window for the change. - After restart, monitor
pg_stat_user_tables.autovacuum_count— you should see it climb 2-3x faster on hot tables, withlast_autovacuumstaying under 1 hour for every busy table.
Output.
| Metric | Before (3 workers) | After (8 workers) |
|---|---|---|
| Tables vacuumed in parallel | up to 3 | up to 8 |
| Average wait for hot table | ~30 min | < 5 min |
| Peak CPU usage during vacuum | ~15% | ~40% |
| Peak disk write | 240 MB/s | 640 MB/s |
| Memory usage (worker pool) | 3 × 256 MB = 768 MB | 8 × 1 GB = 8 GB |
Rule of thumb. Set autovacuum_max_workers = min(cores / 2, hot_table_count). On modern 16+ core servers with SSD, 6-10 workers is the right range. Don't bump it without also raising maintenance_work_mem and autovacuum_vacuum_cost_limit.
Worked example — autovacuum cost-budget tuning on SSD
Detailed explanation. The default autovacuum_vacuum_cost_limit=200 was set when "fast storage" meant 5400 RPM disks. On modern NVMe SSD, you can safely deliver 10x the vacuum throughput without affecting application latency. Senior engineers know to tune this knob explicitly.
Question. A 500 GB OLTP database on NVMe SSD has autovacuum running constantly but not keeping up — bloat keeps creeping upward. Cost limit is at default. Walk through the throughput math and the tuning.
Input.
| Resource | Value |
|---|---|
| Storage | NVMe SSD, 7 GB/s read, 4 GB/s write |
| autovacuum_vacuum_cost_limit | 200 (default) |
| autovacuum_vacuum_cost_delay | 2 ms (default) |
| vacuum_cost_page_dirty | 20 (default) |
| Database write rate | 50 MB/s sustained |
| Bloat trend | +1% per day |
Code.
-- compute effective vacuum throughput at defaults
-- 200 cost_limit / 20 cost_page_dirty = 10 dirty pages per cycle
-- each cycle ends with a 2ms sleep
-- 1000ms / 2ms = 500 cycles/sec
-- 500 cycles * 10 pages = 5000 pages/sec
-- 5000 pages * 8KB = 40 MB/s
-- With default settings, vacuum can dirty ~40 MB/s of pages
-- App is writing 50 MB/s sustained ⇒ vacuum cannot keep up ⇒ bloat
-- Tuned for NVMe
ALTER SYSTEM SET autovacuum_vacuum_cost_limit = 4000;
ALTER SYSTEM SET autovacuum_vacuum_cost_delay = 2;
ALTER SYSTEM SET vacuum_cost_page_dirty = 20; -- unchanged
ALTER SYSTEM SET vacuum_cost_page_hit = 1; -- unchanged
ALTER SYSTEM SET vacuum_cost_page_miss = 2; -- lowered from 10
SELECT pg_reload_conf();
-- recheck throughput
-- 4000 / 20 = 200 dirty pages per cycle
-- 500 cycles/sec * 200 pages * 8KB = 800 MB/s
-- now vacuum can dirty 800 MB/s vs app writing 50 MB/s
-- ⇒ vacuum stays well ahead of bloat production
Step-by-step explanation.
- At defaults, autovacuum's effective dirty-page throughput is
(cost_limit / cost_page_dirty) * (1000ms / cost_delay) * 8KB = (200/20) * (1000/2) * 8KB = 40 MB/s. On a sustained 50 MB/s write workload, vacuum cannot keep up. - The slow-disk era assumption is
cost_page_miss=10— a random read was expensive. On SSD with random I/O nearly free, lower this to 2. Same change applies tocost_page_hit=1(default fine). - Raising
cost_limitto 4000 (20x default) gives autovacuum 800 MB/s of dirty-page throughput — far more than the 50 MB/s workload produces. Vacuum stays well ahead. -
cost_delay=2msis the minimum reasonable; lower than that gives diminishing returns because the OS scheduler granularity is around 1-2 ms. - The trade-off: at 800 MB/s of vacuum dirty-page rate, the autovacuum workers do compete with foreground queries for buffer cache and disk I/O. On NVMe with bandwidth headroom, the contention is invisible to applications. On rotational disks, this configuration would saturate IOPS — never use it.
- Monitor
pg_stat_io(Postgres 16+) to see how many pages autovacuum reads vs the application. If autovacuum I/O dominates, ratchet backcost_limit.
Output.
| Configuration | Effective vacuum throughput | Catches up to 50 MB/s writes? |
|---|---|---|
| Defaults | 40 MB/s | no — bloat grows |
| cost_limit=2000 | 400 MB/s | yes (8x headroom) |
| cost_limit=4000 | 800 MB/s | yes (16x headroom) |
Rule of thumb. On NVMe SSD, set autovacuum_vacuum_cost_limit=2000-4000 and vacuum_cost_page_miss=2. On rotational disks, leave defaults. The "slow vacuum on fast disk" anti-pattern is one of the most common Postgres operational misconfigurations in 2026.
Senior interview question on autovacuum tuning
A senior interviewer might frame this as: "You're on call. A 50 GB orders table is at 65% bloat and growing. Autovacuum is enabled but seemingly not keeping up. Walk me through your tuning, in priority order, in the next 30 minutes."
Solution Using the autovacuum diagnostic + tuning playbook
30-min autovacuum diagnostic + tuning playbook
===============================================
Minute 0-5: Diagnose
- SELECT * FROM pg_stat_progress_vacuum;
(is autovacuum even running on this table?)
- SELECT * FROM pg_stat_user_tables WHERE relname = 'orders';
(last_autovacuum, autovacuum_count, n_dead_tup)
- Check for long-running transactions blocking the horizon.
Minute 5-10: Tune trigger
- ALTER TABLE orders SET (autovacuum_vacuum_scale_factor = 0.05);
- Triggers next autovacuum at 5% bloat instead of 20%.
Minute 10-15: Tune throughput
- ALTER TABLE orders SET (
autovacuum_vacuum_cost_limit = 4000,
autovacuum_vacuum_cost_delay = 2
);
- Vacuum delivers ~800 MB/s on SSD, plenty for 50 GB.
Minute 15-20: Force immediate vacuum (if patience runs out)
- VACUUM (VERBOSE, ANALYZE) orders;
- Manual run, picks up new settings; reclaims dead tuples.
Minute 20-25: Verify
- SELECT n_dead_tup FROM pg_stat_user_tables WHERE relname='orders';
- Should drop sharply.
Minute 25-30: Document + monitor
- Add per-table overrides to runbook.
- Set alerting on dead_pct > 15% on this table.
Step-by-step trace.
| Minute | Action | Expected result |
|---|---|---|
| 0-5 | Diagnose | Identify default settings on hot table |
| 5-10 | scale_factor 0.2 → 0.05 | Future triggers at 5% |
| 10-15 | cost_limit 200 → 4000 | Vacuum throughput 40 → 800 MB/s |
| 15-20 | Manual VACUUM | n_dead_tup drops by ~95% |
| 20-25 | Confirm | autovacuum_count incremented, last_autovacuum recent |
| 25-30 | Document | runbook captures the per-table overrides |
After 30 minutes the bloat is reclaimed and the new settings prevent recurrence. The on-call closes the incident with a permanent fix, not a band-aid.
Output:
| Setting | Before | After |
|---|---|---|
| autovacuum_vacuum_scale_factor | 0.2 (DB default) | 0.05 (table override) |
| autovacuum_vacuum_cost_limit | 200 | 4000 |
| Effective throughput | 40 MB/s | 800 MB/s |
| Peak bloat | 65% | < 8% |
| Recovery time | "never" | minutes |
Why this works — concept by concept:
- Diagnose before you tune — confirm autovacuum is actually running on the right table, not blocked by a long-running transaction or a missing extension. The horizon check before tuning is the senior signal.
- Per-table overrides — global tuning is too blunt; production databases have 5-50 tables with different write profiles. The per-table override pattern is the daily-driver tool.
-
Trigger vs throughput —
scale_factorcontrols when vacuum runs;cost_limitcontrols how fast it runs. Tune both; they solve different sub-problems. -
Force a manual VACUUM for immediate relief — autovacuum runs on its own schedule; if you need bloat gone now, run
VACUUMby hand. It picks up the new per-table settings immediately. -
Cost —
O(table_size)per vacuum pass; per-cycle cost is throttled bycost_limit/cost_delay. The throttle exists to prevent vacuum from saturating I/O — tune it on fast storage so vacuum can actually keep up.
SQL
Topic — sql
Autovacuum tuning problems
5. Wraparound and emergency procedures
transaction id wraparound is the rare Postgres outage that everyone has to know about — 32-bit XID counter, 2 billion transactions, and a hard shutdown if vacuum freeze falls behind
The mental model in one line: Postgres assigns every transaction a 32-bit XID; when the counter approaches 2 billion (~2^31), Postgres must "freeze" old tuples (mark them visible-to-all) so the wrapped XID space stays consistent — and if freezing falls behind, the database refuses writes until you run a manual VACUUM FREEZE to catch up. Once you internalise "XID is 32-bit, freeze prevents wraparound, missing the freeze deadline = hard outage," the wraparound interview surface reduces to a known runbook.
The 32-bit XID — why wraparound exists.
- Every transaction gets a unique 32-bit XID. After 2^32 = ~4.3 billion transactions, the counter wraps to 0.
- Naively, an old tuple with
xmin=10would suddenly appear "newer" than a transaction withxmin=20after the counter wrapped — that would break visibility. - Postgres prevents this by "freezing" old tuples — setting a hint bit that says "this tuple is visible to all current and future transactions, ignore its xmin."
The freeze thresholds.
-
vacuum_freeze_min_age(default 50M) — tuples older than 50M XIDs are eligible to be frozen by a regular VACUUM run. -
autovacuum_freeze_max_age(default 200M) — if any table's oldest XID is more than 200M XIDs behind current, autovacuum runs a forced "aggressive" mode VACUUM on that table to freeze. -
vacuum_freeze_table_age(default 150M) — VACUUM switches to aggressive (scans every page, not just non-all-visible) when the table's oldest XID hits this age. - The defaults give you ~1.8 billion XIDs of headroom (2B XID limit minus 200M freeze max age). At 1M transactions/hour, that's ~75 days — plenty of warning if autovacuum is working.
The age columns — your visibility into wraparound.
-
pg_database.datfrozenxid— the oldest XID still considered "non-frozen" anywhere in the database. The wraparound countdown is2^31 - age(datfrozenxid). -
pg_class.relfrozenxid— same, but per-table. Autovacuum advances this every time it freezes the oldest tuples. -
age(xid)— Postgres function returning the distance between an XID and the current XID in modular-32-bit arithmetic. The closer to 2 billion, the more urgent the freeze.
The wraparound warning ladder.
- At
age(datfrozenxid) > 200M: autovacuum starts forcing aggressive freeze runs on every table. - At
age > 1B: log warnings ("oldest transaction's age is..."). Time to investigate. - At
age > 1.5B: log STRONG warnings. Pager territory. - At
age > 2B: hard shutdown. Postgres refuses to accept writes (and eventually reads). The only fix is single-user mode VACUUM FREEZE.
The emergency runbook.
- Step 1: Verify autovacuum is actually running.
SELECT * FROM pg_stat_activity WHERE backend_type='autovacuum worker'. - Step 2: Identify the oldest table.
SELECT relname, age(relfrozenxid) FROM pg_class WHERE relkind='r' ORDER BY age(relfrozenxid) DESC LIMIT 10. - Step 3: Run
VACUUM FREEZE relnamemanually on the offending table, ideally in parallel for multiple tables. - Step 4: If approaching the hard limit, run
vacuumdb --freeze --allfrom shell — runsVACUUM FREEZEagainst every table in every database. - Step 5: If hard limit hit, restart Postgres in single-user mode and run VACUUM FREEZE. Last-resort recovery.
What blocks freeze.
- A long-running transaction holds back the horizon, just like for plain VACUUM. The horizon must advance past a tuple's xmax before that tuple can be frozen.
- Read-only replicas: by default, do not block freeze on the primary. (
hot_standby_feedback=ondoes propagate the replica's horizon back to the primary — be careful.) - Logical replication: an open replication slot holds back the horizon. A long-disconnected logical replica is a common wraparound cause.
Monitoring.
- Alert on
age(datfrozenxid) > 800Mper database. Gives weeks of headroom. - Alert on any single table's
age(relfrozenxid) > 500M. Catches per-table autovacuum failures. - Track
xact_commit + xact_rollbackper second to estimate XID burn rate. A spike forecasts wraparound earlier.
Common interview probes on wraparound.
- "What is XID wraparound?" — the 32-bit XID counter looping; visibility breaks if old tuples aren't frozen first.
- "How does Postgres prevent it?" — autovacuum runs aggressive freeze runs at
autovacuum_freeze_max_age. - "What happens if it isn't prevented?" — database refuses writes; single-user-mode VACUUM FREEZE recovery.
- "How would you monitor it?" —
pg_database.datfrozenxidand per-tablerelfrozenxidage.
Worked example — diagnosing a table approaching 1.5B XID age
Detailed explanation. A monitoring alert fires: age(datfrozenxid) = 1.5B on the production database. That's the orange-warning zone — about 25-30 days from forced shutdown. You need to identify which table is the laggard and drain it safely.
Question. A wraparound alert fires. Walk through the diagnostic queries and the freeze plan to get the database back below 200M XID age within 24 hours.
Input.
| Metric | Value |
|---|---|
| age(datfrozenxid) | 1,500,000,000 |
| Hard limit | 2,146,483,648 |
| Days until limit at current XID rate | ~28 |
| Largest table (rows) | events (500M) |
| Autovacuum running? | yes, but on small tables |
Code.
-- 1. Database-level age
SELECT datname, age(datfrozenxid)
FROM pg_database
ORDER BY age(datfrozenxid) DESC;
-- 2. Tables with the oldest relfrozenxid
SELECT
schemaname, relname,
age(relfrozenxid) AS xid_age,
pg_size_pretty(pg_total_relation_size(c.oid)) AS size,
last_autovacuum
FROM pg_stat_user_tables s
JOIN pg_class c ON c.oid = s.relid
WHERE c.relkind = 'r'
ORDER BY age(c.relfrozenxid) DESC
LIMIT 10;
-- 3. Verify no long-running txn is blocking
SELECT pid, backend_xmin, age(backend_xmin) AS backend_age,
now() - xact_start AS xact_age, application_name
FROM pg_stat_activity
WHERE backend_xmin IS NOT NULL
ORDER BY age(backend_xmin) DESC
LIMIT 5;
-- 4. Check replication slot lag (can block freeze)
SELECT slot_name, slot_type, active, age(xmin) AS slot_xid_age
FROM pg_replication_slots
ORDER BY age(xmin) DESC;
-- 5. Force freeze on the offending table
VACUUM (FREEZE, VERBOSE) public.events;
Step-by-step explanation.
- The database age (1.5B) is dominated by the table with the oldest
relfrozenxid. The catalog query in step 2 surfaces the top 10 offenders — usually one or two large tables that autovacuum hasn't been able to fully freeze. - If a table is large (500M rows), a full VACUUM FREEZE can take hours. The plan is to start the freeze immediately, not wait for autovacuum's normal schedule.
- Step 3 checks for the most common freeze blocker: a long-running transaction. If you find one with
age(backend_xmin) > 100M, terminate it before freezing. - Step 4 catches the second most common blocker: an inactive replication slot. A logical replica that disconnected weeks ago and was never cleaned up holds back the horizon. Drop the slot if confirmed abandoned (
SELECT pg_drop_replication_slot('slot_name')). - The manual
VACUUM (FREEZE, VERBOSE) eventsruns an aggressive freeze on the large table. Lock level isSHARE UPDATE EXCLUSIVE— application traffic continues uninterrupted. - After the freeze finishes (might be 2-6 hours), recheck
pg_database.datfrozenxid— should drop from 1.5B to under 200M, restoring the normal headroom.
Output.
| Phase | age(datfrozenxid) | Action | Status |
|---|---|---|---|
| Alert fires | 1,500,000,000 | diagnose top tables | events is the laggard |
| Check blockers | 1,500,000,000 | no long txn, no stuck slot | safe to freeze |
| Freeze events | running, 2 hours | aggressive scan | in progress |
| After freeze | 100,000,000 | verify pg_database | under control |
| Tune | 100,000,000 | lower freeze_max_age for events | prevent recurrence |
Rule of thumb. Anytime age(datfrozenxid) crosses 800M, treat it as a P1. The 30-day shutdown clock is the most consequential operational deadline in Postgres. Pause non-essential maintenance and run the freeze immediately.
Worked example — single-user mode emergency recovery
Detailed explanation. The worst case: you ignored the warnings, the XID counter is at 2B+1, and Postgres has shut down to prevent corruption. The only path back is single-user mode VACUUM FREEZE — a manual recovery procedure every senior Postgres DBA should know cold.
Question. Postgres refuses to start, logs show "database is not accepting commands to avoid wraparound data loss". Walk through the single-user-mode recovery.
Input.
| State | Value |
|---|---|
| Postgres status | down (in shutdown to prevent wraparound) |
| pg_database.datfrozenxid age | 2,146,483,648 + |
| Disk free | 40% |
| Backups | last good at 6 hours ago |
| Replication | replica at same XID — also stuck |
Code.
# Stop the running postmaster (it will not accept normal connections anyway)
pg_ctl stop -D /var/lib/postgresql/data -m immediate
# Start in single-user mode against the affected database
postgres --single -D /var/lib/postgresql/data mydb
# Inside single-user mode prompt, run:
# (note: no transaction wrapping; each statement is a transaction)
VACUUM FREEZE; -- runs against current DB; can take hours
# Wait until completion.
# Exit single-user mode (Ctrl-D or \q)
# Start Postgres normally
pg_ctl start -D /var/lib/postgresql/data
# Verify
psql -d mydb -c "SELECT datname, age(datfrozenxid) FROM pg_database;"
# Should now show low ages.
# Repeat for every other database in the cluster
postgres --single -D /var/lib/postgresql/data otherdb
# VACUUM FREEZE;
Step-by-step explanation.
- Postgres refuses to start in normal mode because the wraparound risk is too high. Single-user mode (
postgres --single) is the recovery interface — a stripped-down standalone Postgres that bypasses the wraparound check. - Single-user mode talks directly to the storage layer. Connect to one database at a time; the wraparound condition is per-database (
datfrozenxidis per-DB). - The
VACUUM FREEZEcommand in single-user mode runs an aggressive freeze across every table in the current database. On a 500 GB database with billions of tuples, this can take 2-8 hours. There is no progress indicator — just wait. - After the freeze completes, exit single-user mode and start Postgres normally. The startup check now sees
datfrozenxidages well below the 2B threshold and allows full operations. - If the cluster has multiple databases, repeat the procedure for each one — single-user mode only handles one DB at a time, and only the currently-stuck DBs need this treatment.
- After recovery, audit the autovacuum configuration: why did freezing fall behind? Common causes are autovacuum disabled, a long-running transaction held the horizon, a stuck replication slot, or autovacuum was so throttled it couldn't keep up.
Output.
| Phase | Status | Wall time |
|---|---|---|
| Stop postmaster | done | seconds |
| Single-user mode VACUUM FREEZE | running | 2-8 hours |
| Exit single-user mode | done | seconds |
| Start Postgres normally | accepting connections | seconds |
| Recheck datfrozenxid age | 50M-100M (safe) | seconds |
Rule of thumb. Single-user mode is the recovery path of last resort. The procedure works, but it costs you hours of downtime and shakes any trust the application team has in the database. Monitor age(datfrozenxid) and alert at 800M — you should never find yourself running this procedure if your alerting is in place.
Worked example — vacuumdb --freeze --all preemptive run
Detailed explanation. Sometimes you discover the warning early enough — age = 1.2B, no immediate emergency, but autovacuum is clearly behind. The right move is a controlled, parallel vacuumdb --freeze --all from shell that catches up every table in every database without waiting for autovacuum's schedule.
Question. Show how to run a parallel preemptive freeze across an entire Postgres cluster using vacuumdb from the shell, before wraparound becomes a real emergency.
Input.
| Setting | Value |
|---|---|
| Cluster databases | 5 |
| Largest database | 500 GB |
| age(datfrozenxid) max across DBs | 1.2B |
| Available maintenance window | 6 hours overnight |
| CPU cores | 16 |
Code.
# Inspect current age across all databases first
psql -c "SELECT datname, age(datfrozenxid)
FROM pg_database
ORDER BY age(datfrozenxid) DESC;"
# Run preemptive freeze with 6 parallel jobs against every DB
vacuumdb --freeze \
--all \
--jobs=6 \
--verbose \
2>&1 | tee /var/log/postgres/freeze.log
# Or target a specific DB if cluster is too large to do at once
vacuumdb --freeze --jobs=6 --verbose --dbname=mydb
# Skip indexes for a speed boost (rarely needed; the freeze itself is the bottleneck)
vacuumdb --freeze --jobs=6 --skip-locked --no-index-cleanup mydb
# Watch progress live
psql -c "SELECT pid, datname, relid::regclass, phase,
heap_blks_scanned, heap_blks_total,
round(100.0 * heap_blks_scanned / NULLIF(heap_blks_total,0),2) AS pct
FROM pg_stat_progress_vacuum;"
Step-by-step explanation.
-
vacuumdb --freeze --all --jobs=6runsVACUUM (FREEZE)against every table in every database, with up to 6 vacuums running in parallel. It's the official "catch up on wraparound" command in the Postgres toolbox. - The
--jobsflag is per database; with 5 databases and--jobs=6, you can have up to 6 vacuums running at once on the largest database. The other DBs run sequentially after. -
--skip-lockedskips any table currently locked by another operation rather than waiting. Useful if you have ongoing migrations or other vacuum activity. -
--no-index-cleanupskips index vacuum, which can be 3-5x faster on tables with many indexes. The trade-off: index bloat doesn't get cleaned this run. Use only if you're racing the wraparound clock. -
pg_stat_progress_vacuumgives live per-table progress, including phase (scanning heap, vacuuming indexes, etc.) and heap blocks scanned vs total. Monitor it during the run to estimate completion time. - After completion, recheck
age(datfrozenxid)per DB — should drop to under 100M. Schedule a postmortem to fix the autovacuum config that let it grow to 1.2B in the first place.
Output.
| DB | Before age | After age | Wall time | Tables frozen |
|---|---|---|---|---|
| mydb (main) | 1,200,000,000 | 80,000,000 | 4 h 15 min | 247 |
| analytics | 800,000,000 | 50,000,000 | 1 h 30 min | 89 |
| reporting | 600,000,000 | 40,000,000 | 45 min | 52 |
| catalog | 200,000,000 | 30,000,000 | 10 min | 18 |
| logs | 150,000,000 | 25,000,000 | 8 min | 12 |
Rule of thumb. Run vacuumdb --freeze --all --jobs=N as a quarterly preventive job on any production cluster with hot tables. It's the easiest insurance against wraparound — costs you one maintenance window per quarter and guarantees you'll never face the single-user-mode emergency.
Senior interview question on wraparound
A senior interviewer might frame this as: "Your monitoring shows age(datfrozenxid) = 1.8B on a production primary. The XID burn rate is 8M/hour. Walk me through the next 24 hours — diagnostic, mitigation, root cause, prevention."
Solution Using the wraparound emergency runbook
T+0h Diagnose
- Identify table(s) with oldest relfrozenxid.
- Check for long-running transactions blocking horizon.
- Check for inactive replication slots holding XIDs.
T+0h Compute remaining headroom
- 2.1B − 1.8B = 300M XIDs remaining.
- 8M/hour → 37 hours until shutdown.
- 30-hour window to act.
T+0-1h Clear blockers
- pg_terminate_backend(<long-txn pid>)
- pg_drop_replication_slot(<stuck slot>)
T+1-6h Run freeze on top offenders in parallel
- vacuumdb --freeze --jobs=8 --table=events --table=orders mydb
- Monitor via pg_stat_progress_vacuum.
- Expect age to drop sharply on each table as it completes.
T+6-12h Run vacuumdb --freeze --all to sweep the rest
- Catches every other table that's still behind.
T+12h Verify
- age(datfrozenxid) under 200M.
- Headroom restored to ~1.9B XIDs (~10 days).
T+24h Postmortem + permanent fix
- Lower autovacuum_freeze_max_age on hot tables to 100M.
- Add per-table override for events: autovacuum_vacuum_scale_factor=0.02.
- Add Datadog alert on any age(datfrozenxid) > 800M.
- Document the runbook in wiki for next on-call.
Step-by-step trace.
| Time | age(datfrozenxid) | Action | Headroom |
|---|---|---|---|
| T+0h | 1,800,000,000 | diagnose | 30 hours |
| T+1h | 1,800,000,000 | clear long txn + dead slot | 30 hours |
| T+3h | 1,500,000,000 | events VACUUM FREEZE half-done | restored |
| T+6h | 800,000,000 | events fully frozen | restored |
| T+9h | 400,000,000 | orders fully frozen | restored |
| T+12h | 150,000,000 | other tables frozen | restored |
| T+24h | 130,000,000 (steady) | postmortem complete | 10+ days |
The emergency is resolved in 12 hours of vacuum work, plus 12 hours for the postmortem and permanent fix. The key signal: the database stayed online the whole time — no ACCESS EXCLUSIVE, no single-user mode, no application impact.
Output:
| Outcome | Status |
|---|---|
| Database stays online | yes (no shutdown) |
| Application impact | negligible (no ACCESS EXCLUSIVE) |
| Headroom restored | from 30 hours to 10+ days |
| Root cause fixed | autovacuum tuned per-table |
| Monitoring upgraded | alert at 800M added |
Why this works — concept by concept:
- Headroom math first — knowing exactly how many hours you have before shutdown turns panic into a plan. The XID burn rate × remaining XIDs is the most actionable number in the runbook.
-
Clear blockers before vacuuming — long transactions and stuck replication slots prevent freeze. No amount of
vacuumdb --freezehelps until the horizon can actually advance. -
Parallel freeze on top offenders — instead of waiting for autovacuum's schedule, kick off
vacuumdb --freeze --jobs=Nagainst the largest tables manually. They run in parallel and finish dramatically faster than the default autovacuum cadence. -
Postmortem yields permanent tuning — wraparound emergencies are autovacuum-tuning bugs in disguise. The permanent fix is per-table
autovacuum_freeze_max_agelowered to 100M plus tighterscale_factorto keep dead-tuple accumulation in check. -
Cost —
O(table_size + index_size)per table to freeze; runs in parallel across tables and across DBs. Single-user mode recovery is the only fully-offline option and the only one with aO(downtime_hours)business cost — keep that off the table at all costs by monitoring early.
SQL
Topic — sql
Postgres wraparound problems
ETL
Topic — etl
Postgres operational drills
Cheat sheet — Postgres VACUUM recipes
-
Per-table autovacuum override (the daily-driver tune).
ALTER TABLE hot_events SET (autovacuum_vacuum_scale_factor=0.05, autovacuum_vacuum_threshold=1000, autovacuum_vacuum_cost_limit=2000, autovacuum_vacuum_cost_delay=2, fillfactor=80);— five settings, tuned for any table over 10M rows with > 1000 writes/min. -
Quick bloat estimate from
pg_stat_user_tables.SELECT relname, n_live_tup, n_dead_tup, round(100.0 * n_dead_tup / NULLIF(n_live_tup,0), 2) AS dead_pct FROM pg_stat_user_tables ORDER BY n_dead_tup DESC LIMIT 20;— cheap, always available, refreshed on every commit. -
Precise bloat audit via
pgstattuple.CREATE EXTENSION pgstattuple; SELECT pg_size_pretty(table_len), round(dead_tuple_percent::numeric, 2) AS dead_pct FROM pgstattuple('public.orders');— full scan cost, exact answer. -
pg_repack one-shot.
pg_repack --no-superuser-check --table=orders --jobs=4 mydb— online table rebuild; requires thepg_repackextension and 2x table size in free disk. -
Wraparound monitor query.
SELECT datname, age(datfrozenxid), 2146483648 - age(datfrozenxid) AS xids_remaining FROM pg_database ORDER BY age(datfrozenxid) DESC;— run hourly via monitoring; alert at < 1.3B XIDs remaining (age > 800M). -
Per-table oldest XID.
SELECT relname, age(relfrozenxid), pg_size_pretty(pg_total_relation_size(oid)) FROM pg_class WHERE relkind='r' ORDER BY age(relfrozenxid) DESC LIMIT 20;— find the laggards before they hit the cluster shutdown threshold. -
Find horizon-blocking sessions.
SELECT pid, backend_xmin, age(backend_xmin) AS xmin_age, now()-xact_start AS xact_age, application_name FROM pg_stat_activity WHERE backend_xmin IS NOT NULL ORDER BY age(backend_xmin) DESC LIMIT 10;— the first query to run when bloat is growing despite autovacuum activity. -
Idle-in-transaction safety net. Set
idle_in_transaction_session_timeout='15min'andstatement_timeout='30min'inpostgresql.conf. Prevents most accidental horizon blocks. -
Preemptive cluster-wide freeze.
vacuumdb --freeze --all --jobs=6 --verbosefrom shell — run quarterly in a maintenance window to catch up any tables autovacuum has fallen behind on. -
Emergency single-user-mode recovery.
pg_ctl stop -m immediate; postgres --single -D <data_dir> <db>; VACUUM FREEZE;— last resort if wraparound triggered a shutdown; budget 2-8 hours per database. -
Aggressive index vacuum.
VACUUM (PARALLEL 4, INDEX_CLEANUP on) orders;— uses up to 4 parallel index workers; Postgres 14+ feature. Use on tables with many indexes for a 3-5x speedup over single-threaded VACUUM. -
Skip index cleanup for a wraparound race.
VACUUM (FREEZE, INDEX_CLEANUP off) huge_table;— only the heap freeze runs; index bloat ignored. Recover from wraparound risk in half the time; rebuild indexes later. -
HOT update rate as a tuning signal.
SELECT relname, n_tup_upd AS updates, n_tup_hot_upd AS hot_updates, round(100.0 * n_tup_hot_upd / NULLIF(n_tup_upd,0), 2) AS hot_pct FROM pg_stat_user_tables ORDER BY n_tup_upd DESC LIMIT 20;— anything under 30% HOT on a busy table means an indexed column is being updated; revisit the indexes. -
Per-table FREEZE override on a hot table.
ALTER TABLE huge_table SET (autovacuum_freeze_max_age=100000000, vacuum_freeze_min_age=10000000);— forces aggressive freeze to start earlier on this table specifically. -
Live autovacuum progress.
SELECT pid, datname, relid::regclass AS rel, phase, heap_blks_scanned, heap_blks_total, round(100.0 * heap_blks_scanned / NULLIF(heap_blks_total,0), 2) AS pct FROM pg_stat_progress_vacuum;— use during incidents to see exactly what each worker is doing.
Frequently asked questions
What is MVCC in PostgreSQL?
MVCC (Multi-Version Concurrency Control) is the technique PostgreSQL uses to let many transactions read and write the same data concurrently without traditional row-level read locks. Instead of updating rows in place, every UPDATE writes a new tuple version stamped with the current transaction's XID in xmin and marks the previous version's xmax. Each transaction sees a snapshot — the set of tuples whose xmin committed before the snapshot started and whose xmax is either zero or committed after. The cost of postgresql mvcc is "dead tuples": old versions that have to be reclaimed later by postgres vacuum and its autovacuum daemon. This is why Postgres needs VACUUM at all — it's the maintenance work that pays the bill for lock-free reads.
What is the difference between VACUUM and VACUUM FULL in Postgres?
Plain postgres vacuum reclaims dead-tuple space inside the existing table file — it marks the dead tuples reusable, updates the visibility map, and prunes HOT chains, but it does not call ftruncate on the file. The file stays the same size; future INSERTs and UPDATEs reuse the freed slots. vacuum full, by contrast, rewrites the entire table into a new file and atomically swaps it, holding an ACCESS EXCLUSIVE lock that blocks every other operation on the table for the duration. VACUUM FULL is the only way to return space to the OS via the built-in toolset — but it is production-fatal on any table with active traffic. The 2026 senior answer is "we never run VACUUM FULL in production; we use pg_repack for online table rebuilds when we need the file to actually shrink."
How do I prevent transaction ID wraparound in PostgreSQL?
Prevention is mostly automatic: autovacuum runs an aggressive vacuum freeze on any table whose oldest XID is more than autovacuum_freeze_max_age (default 200M) behind current. The freeze sets a hint bit marking the tuple "visible to all," removing it from the wraparound countdown. The two things that go wrong are (1) long-running transactions blocking the horizon so freeze cannot advance, and (2) inactive replication slots holding XIDs that prevent freeze. Monitor age(datfrozenxid) per database and alert at 800M XIDs (gives weeks of headroom); investigate any session with backend_xmin older than 100M; drop any abandoned replication slot. For preemptive catch-up, run vacuumdb --freeze --all --jobs=N quarterly. If you ever see transaction id wraparound shutdown messages, recover via single-user mode VACUUM FREEZE — but tune autovacuum so you never see it.
What is autovacuum tuning — which knobs actually matter?
The four knobs that matter in 95% of production tuning: (1) autovacuum_vacuum_scale_factor — the dead-tuple fraction that triggers a vacuum, default 0.2; lower to 0.02-0.05 per-table on hot tables to keep peak bloat under 5%. (2) autovacuum_vacuum_cost_limit — the per-cycle work budget, default 200; raise to 2000-4000 on SSD storage so vacuum can actually keep up with high write rates. (3) autovacuum_max_workers — how many tables vacuum in parallel, default 3; raise to 6-10 on multi-table OLTP systems with many cores. (4) fillfactor per table — reserve page space for HOT updates, default 100 (none reserved); set 70-80 on UPDATE-heavy tables to push HOT update rate above 70%. The pattern: keep global defaults conservative, override the four knobs per-table via ALTER TABLE on the 10% of tables that drive 90% of bloat. This autovacuum tuning discipline is the single highest-leverage Postgres ops change.
How is pg_repack different from VACUUM FULL?
Both rewrite the table file to reclaim space, so the end result on disk is identical. The difference is operational: VACUUM FULL holds an ACCESS EXCLUSIVE lock for the entire rewrite, blocking every other operation; on a 100 GB table that's hours of effective downtime. pg_repack does the same logical work — copies live tuples to a new file, rebuilds indexes, swaps the file — but uses triggers and a log table to capture concurrent writes, applying them to the new file in the background. Only the final swap requires ACCESS EXCLUSIVE, and that's a catalog flip that takes seconds. pg_repack requires 2x the table size in free disk during the run, a primary key or unique not-null index on the table, and the pg_repack extension installed. Within those constraints, it is the production-default tool for shrinking a bloated table on a 24/7 system.
How do I monitor Postgres bloat in production?
Two layers. The lightweight layer reads pg_stat_user_tables.n_dead_tup and n_live_tup to compute dead_pct per table — cheap, always available, refreshed on every commit. Alert at dead_pct > 20% on any table over 1 GB. The precise layer uses the pgstattuple extension to scan the table and return exact byte-level dead/live ratios — expensive (full table scan), so run weekly during low-traffic windows and feed the result into a dashboard. Pair the bloat metric with three causal queries: (1) longest open transaction (max(now() - xact_start) FROM pg_stat_activity WHERE state <> 'idle'), (2) oldest backend_xmin (the actual horizon), and (3) stuck replication slots. Bloat that autovacuum cannot reclaim is almost always a horizon issue, not a tuning issue — diagnose the horizon first, then tune autovacuum_vacuum_scale_factor per-table once the horizon is moving normally.
Practice on PipeCode
- Drill the SQL practice library → for the index-design, MVCC-snapshot, and EXPLAIN ANALYZE family of probes.
- Rehearse on query optimization problems → when the interviewer wants planner and index depth.
- Sharpen ETL drills → for the end-to-end Postgres ingest + dimensional model pattern.
- For the broader surface, work the entire PipeCode practice catalog →.
- Stack the prerequisites with the only 5 skills you need to become a data engineer →.
- For ETL system design depth, work through the ETL system design course →.
Lock in Postgres VACUUM muscle memory
Postgres docs explain MVCC. PipeCode drills explain the decision — when autovacuum needs a per-table override, when pg_repack beats VACUUM FULL, when wraparound becomes an emergency. Pipecode.ai is Leetcode for Data Engineering — pattern-first practice tuned for the production trade-offs senior data engineers actually face.





Top comments (0)