- Book: Database Playbook: Choosing the Right Store for Every System You Build
- Also by me: Thinking in Go (2-book series) — Complete Guide to Go Programming + Hexagonal Architecture in Go
- My project: Hermes IDE | GitHub — an IDE for developers who ship with Claude Code and other AI coding tools
- Me: xgabriel.com | GitHub
Your Postgres table grew by 40GB last quarter. The data only grew by 8GB. The other 32GB is bloat: dead tuples that autovacuum couldn't keep up with. Four settings decide whether autovacuum wins or loses on your hot tables, and the defaults are tuned for a 2009-sized database.
What autovacuum does (and why it falls behind)
Postgres uses MVCC. Every UPDATE and DELETE doesn't change a row in place. It writes a new version and marks the old one dead. The old version sticks around until vacuum reclaims its space. On a low-traffic table this is invisible. On a hot table doing 10k UPDATEs per minute, dead tuples pile up faster than autovacuum can sweep them.
Autovacuum kicks in when the number of dead tuples crosses a threshold:
threshold = autovacuum_vacuum_threshold
+ autovacuum_vacuum_scale_factor * reltuples
autovacuum_vacuum_threshold defaults to 50 rows. autovacuum_vacuum_scale_factor defaults to 0.2, meaning autovacuum waits for 20% of the table to die before doing anything. On a 100-million-row events table that's 20 million dead tuples before the cleanup even starts. By the time the worker grabs that table, the index scans alone take an hour, and queries against the table are slower the whole time because the planner has to walk dead heap pointers.
The cure isn't bigger machines. It's tuning four knobs.
Setting 1: autovacuum_vacuum_scale_factor
The 20% default exists because Postgres targets databases of any size and didn't want surprise I/O on tiny tables. At scale it's the single setting that hurts the most.
Drop it globally to something sane, then override per-table for the hottest five or ten:
-- postgresql.conf: global default, applied to all tables
ALTER SYSTEM SET autovacuum_vacuum_scale_factor = 0.05;
SELECT pg_reload_conf();
-- per-table override for a hot write table
ALTER TABLE payments_events
SET (autovacuum_vacuum_scale_factor = 0.01,
autovacuum_vacuum_threshold = 1000);
With scale_factor = 0.01 on a 100M-row table, autovacuum kicks in after 1M dead tuples instead of 20M. The vacuum runs are smaller, finish faster, and the table never accumulates the kind of bloat that needs a VACUUM FULL to recover.
There's a parallel knob for ANALYZE: autovacuum_analyze_scale_factor, default 0.1. On the same hot tables, drop it to 0.02. Stale planner statistics on a fast-moving table produce bad plans long before the data is actually corrupted.
ALTER TABLE payments_events
SET (autovacuum_analyze_scale_factor = 0.02);
Verify a per-table override stuck:
SELECT relname, reloptions
FROM pg_class
WHERE relname = 'payments_events';
Setting 2: autovacuum_max_workers and autovacuum_naptime
autovacuum_max_workers defaults to 3. autovacuum_naptime defaults to 60 seconds. Together they cap how parallel and how often autovacuum runs.
Three workers is fine if you have three tables. If you have a thousand tables and ten of them are hot, three workers is a queue. A worker on a slow large vacuum blocks others behind it. Bump it on machines with cores to spare:
ALTER SYSTEM SET autovacuum_max_workers = 8;
ALTER SYSTEM SET autovacuum_naptime = '15s';
-- requires a server restart for max_workers
autovacuum_naptime tells the launcher how long to sleep between checks of each database. With the default 60s, a busy database with hundreds of tables gets surveyed once a minute. Drop it to 15s on busy systems and the launcher catches threshold crossings sooner.
There's a real cost: more workers means more concurrent I/O, and naptime lowered too aggressively wastes CPU on launcher overhead on small systems. The right shape for a meaningful production database in 2026 sits around 6-8 workers and a 15-30s naptime.
Setting 3: autovacuum_vacuum_cost_limit
Autovacuum throttles itself. Each page read, dirty, or hit costs some abstract "vacuum cost units", and once the running total hits autovacuum_vacuum_cost_limit (default 200), the worker sleeps for autovacuum_vacuum_cost_delay (default 2ms on modern Postgres, was 20ms before PG12). Then it resumes.
The defaults were tuned for spinning disks. On NVMe SSDs they leave throughput on the floor. A vacuum that could finish in 5 minutes takes 90 because the worker keeps napping.
Push the cost limit up aggressively on modern hardware:
ALTER SYSTEM SET autovacuum_vacuum_cost_limit = 2000;
ALTER SYSTEM SET autovacuum_vacuum_cost_delay = '2ms';
SELECT pg_reload_conf();
Per-table overrides matter here too. The table that ate the table-space is the one to fix:
ALTER TABLE payments_events
SET (autovacuum_vacuum_cost_limit = 5000,
autovacuum_vacuum_cost_delay = 0);
autovacuum_vacuum_cost_delay = 0 removes the throttle entirely for that table. Use it when you know the table is on dedicated storage or when you're behind on bloat and need to catch up. Don't ship 0 to every table at once. You'll create an I/O spike that competes with foreground traffic.
Setting 4: per-table overrides for the hot 10
Global tuning gives you a sane default. Per-table overrides win the war. Most databases have a power-law distribution: ten tables generate 90% of the bloat, the other thousand generate noise. Find those ten:
SELECT
schemaname,
relname,
n_dead_tup,
n_live_tup,
round(n_dead_tup::numeric / nullif(n_live_tup, 0) * 100, 2) AS dead_pct,
last_autovacuum,
autovacuum_count
FROM pg_stat_user_tables
WHERE n_live_tup > 10000
ORDER BY n_dead_tup DESC
LIMIT 20;
Anything with dead_pct > 20 and a stale last_autovacuum is the work list. For each, apply the full kit:
ALTER TABLE orders SET (
autovacuum_vacuum_scale_factor = 0.01,
autovacuum_vacuum_threshold = 5000,
autovacuum_analyze_scale_factor = 0.02,
autovacuum_vacuum_cost_limit = 5000,
autovacuum_vacuum_cost_delay = 0
);
The discipline that pays: keep a small SQL file in your infra repo with every per-table autovacuum override, applied on every environment. Without that, someone restores a snapshot, drops the table, recreates it from a migration, and your tuning silently vanishes.
Diagnosing bloat with pgstattuple
pg_stat_user_tables tells you what autovacuum thinks. It can be wrong. The truth lives in pgstattuple, a contrib extension that opens every page and counts.
CREATE EXTENSION IF NOT EXISTS pgstattuple;
-- exact bloat measurement (slow, full scan)
SELECT * FROM pgstattuple('payments_events');
-- approximate, much faster on large tables
SELECT * FROM pgstattuple_approx('payments_events');
A row from pgstattuple_approx looks like:
table_len | 87530250240
scanned_percent | 6.42
approx_tuple_count | 142_837_201
approx_tuple_len | 51_421_392_360
approx_tuple_percent| 58.74
dead_tuple_count | 47_293_104
dead_tuple_len | 17_025_517_440
dead_tuple_percent | 19.45
approx_free_percent | 21.81
Anything over 20% dead or 25% free space is bloated. A healthy table sits under 5% dead, under 10% free. A 30-line bloat report you can wire into a weekly Slack alert:
SELECT
c.relname,
pg_size_pretty(pg_relation_size(c.oid)) AS size,
s.dead_tuple_percent,
s.approx_free_percent
FROM pg_class c
JOIN pg_namespace n ON n.oid = c.relnamespace
JOIN LATERAL pgstattuple_approx(c.oid) s ON true
WHERE c.relkind = 'r'
AND n.nspname NOT IN ('pg_catalog', 'information_schema')
AND pg_relation_size(c.oid) > 100 * 1024 * 1024 -- 100MB+
ORDER BY s.dead_tuple_percent DESC
LIMIT 20;
Run it weekly. The trend matters more than any single snapshot. A table climbing from 8% to 14% to 22% dead over three weeks tells you autovacuum is losing to write traffic, even if it's not yet on fire.
pgstattuple_approx skips visibility-map-clean pages, which makes it 10-100x faster than full pgstattuple while losing maybe 1-2% accuracy. Use the approx version for everything except investigating a single suspect table.
There's also pgstatindex from the same extension, for measuring index bloat. That's the bloat people forget about until the planner picks a sequential scan because the index is twice the size it should be:
SELECT * FROM pgstatindex('idx_payments_events_user_id');
If avg_leaf_density is under 60% on a busy index, a REINDEX CONCURRENTLY reclaims real space without locking.
When manual VACUUM FULL is the answer
Autovacuum reclaims space within the existing file. It doesn't shrink the file. A table that hit 60% bloat last month and got cleaned up still occupies 60% more disk than it needs to. The empty space inside the pages just gets reused on new INSERTs.
VACUUM FULL rewrites the table from scratch. It returns the disk to the OS. It also takes an ACCESS EXCLUSIVE lock for the entire duration, which means every read and write blocks. On a 100GB table, that's hours of outage.
For a production table that needs shrinking without downtime, use pg_repack or pg_squeeze. Both rebuild the table online using triggers to capture concurrent writes, then swap. They need careful setup but they work, and they're the only way to recover badly bloated production data without a maintenance window.
The cheap version of "do I need this?": if your bloat percentage stabilizes after tuning and the disk usage isn't trending up dangerously, you don't need VACUUM FULL. Reclaimed slots inside pages get reused. Shrinking the file matters only when you're genuinely running out of disk.
The gotcha: partitioned tables
Autovacuum on partitioned tables does not behave the way the parent-table syntax suggests. Each partition is its own physical table, with its own autovacuum settings, its own pg_stat_user_tables row, its own bloat trajectory. Setting autovacuum_vacuum_scale_factor on the parent does nothing because the parent has no data.
-- this does NOTHING for the actual data
ALTER TABLE events SET (autovacuum_vacuum_scale_factor = 0.01);
You have to set it on every partition:
DO $$
DECLARE
part record;
BEGIN
FOR part IN
SELECT inhrelid::regclass AS partition_name
FROM pg_inherits
WHERE inhparent = 'events'::regclass
LOOP
EXECUTE format(
'ALTER TABLE %s SET (autovacuum_vacuum_scale_factor = 0.01,
autovacuum_vacuum_threshold = 1000)',
part.partition_name
);
END LOOP;
END $$;
And do it again every time a new partition gets created. If you're using pg_partman or a similar tool, add the autovacuum settings to its template table so new partitions inherit them. Otherwise the partition created last Tuesday at midnight has the lazy defaults, and three months later it's the bloated one nobody noticed.
Older or finalized partitions can usually get more aggressive cleanup once writes stop. Set autovacuum_freeze_max_age lower on closed partitions to prevent the wraparound-vacuum surprise that locks them at the worst possible moment.
The other partitioned-table trap: ANALYZE on the parent doesn't update per-partition statistics. You need to ANALYZE partitions individually or rely on autovacuum's per-partition analyze trigger. Bad statistics on a single recent partition can wreck plans for queries that join across the partition set.
What good looks like
After tuning, a healthy production database has:
- Dead tuple percentage under 5% on hot tables (check via
pgstattuple_approx). - Last autovacuum within the last hour on top write tables.
- Per-table overrides documented in version control.
- A weekly bloat report wired into Slack so trend changes surface before they become incidents.
- Partition templates that include autovacuum settings.
The defaults assume your database is small and your hardware is slow. Neither is true in 2026. Spend an afternoon on the four settings above and you can save a quarterly storage bill that was paying for empty pages.
What's the worst bloat ratio you've found in production, and what tipped you off? Drop the numbers in the comments.
If this was useful
The autovacuum dance is one slice of a bigger question: when is Postgres the right store at all, and where do other databases fit? The Database Playbook walks through that decision for the major data shapes (OLTP, time-series, search, vector, analytical) and the chapter on Postgres operations covers vacuum, partitioning, replication, and the failure modes you only learn about at 3am.

Top comments (0)