Understanding and Fixing Table Bloat in PostgreSQL

I recently helped debug a production PostgreSQL instance where queries had gotten progressively slower over weeks, despite no schema changes and roughly the same amount of data. The table sizes on disk told the story: the busiest table was nearly twice the size it should have been. Not because of growth -- because of dead tuples. Table bloat had silently accumulated to the point where every query was scanning through as many dead rows as live ones. Here's what's happening under the hood and how to fix it.

The Problem: MVCC and Dead Tuples

PostgreSQL uses Multi-Version Concurrency Control (MVCC) to let readers and writers operate without blocking each other. When you UPDATE a row, PostgreSQL does not modify it in place -- it marks the old version as dead and writes a new version elsewhere in the table. When you DELETE, the row is flagged dead but not physically removed. These leftover rows are called dead tuples.

Autovacuum is responsible for reclaiming dead tuples, but when updates outpace vacuuming -- or when long-running transactions prevent cleanup -- dead tuples accumulate faster than they are removed. The table file grows larger than it needs to be. This is table bloat.

The effects compound quickly. Sequential scans read dead rows alongside live ones, wasting I/O bandwidth. Indexes still point to dead tuples, inflating index size. The shared buffer cache fills with pages containing dead data, reducing effective memory for useful data. A table with 40% bloat uses 40% more disk, 40% more memory, and performs roughly 40% more I/O than the same data without bloat.

High-churn OLTP tables are especially vulnerable. An orders table processing hundreds of status updates per second can accumulate thousands of dead tuples per minute. With the default autovacuum settings (trigger vacuum when 20% of rows are dead), a 10-million-row table accumulates 2 million dead tuples before autovacuum kicks in -- by which point the damage is done.

How to Detect It

Dead Tuple Counts

The pg_stat_user_tables view tracks dead tuple counts and vacuum history for every table:

-- Find tables with high dead tuple counts and stale vacuums
SELECT
    schemaname,
    relname AS table_name,
    n_live_tup AS live_tuples,
    n_dead_tup AS dead_tuples,
    CASE WHEN n_live_tup > 0
        THEN round(100.0 * n_dead_tup / n_live_tup, 1)
        ELSE 0
    END AS dead_tuple_percentage,
    last_autovacuum,
    last_autoanalyze
FROM pg_stat_user_tables
WHERE n_dead_tup > 1000
ORDER BY n_dead_tup DESC
LIMIT 20;

A dead_tuple_percentage above 10% is a warning sign. Above 20% means autovacuum is falling behind and you need to investigate why. Check last_autovacuum -- if it is NULL or hours old on a high-churn table, autovacuum may be misconfigured, starved for workers, or blocked by long-running transactions.

Accurate Physical Bloat with pgstattuple

For a more accurate picture of actual disk bloat (as opposed to estimated dead tuple counts), use the pgstattuple extension:

CREATE EXTENSION IF NOT EXISTS pgstattuple;
SELECT * FROM pgstattuple('your_table_name');

This returns dead_tuple_percent and free_space -- the actual physical bloat in the table file. The dead tuple count from pg_stat_user_tables is an estimate; pgstattuple reads the table directly and gives you the real number.

Why Trends Matter More Than Snapshots

The limitation of both approaches is that they show a point-in-time snapshot. You cannot tell from a single query whether bloat is trending up (getting worse), flat (stable), or regularly spiking and recovering (healthy autovacuum cycling). For effective bloat management, you need to track dead tuple counts and table sizes over time.

Set up periodic snapshots of pg_stat_user_tables and pg_relation_size() so you can plot trends. A rising dead tuple trend line between vacuum runs means autovacuum is falling behind. A flat or cycling pattern means things are healthy.

Visualizing table sizes proportionally makes it immediately obvious which tables dominate storage and where bloat is wasting the most space. Combine size tracking with dead tuple trends to catch bloat before it compounds into a performance problem.

How to Fix It

Standard VACUUM: Reclaim Space for Reuse

For moderate bloat (under 50% dead tuples), a standard VACUUM reclaims dead tuples within the existing table structure:

-- Reclaim dead tuples (does not lock the table for reads)
VACUUM VERBOSE sim_events;

Standard VACUUM marks space as reusable but does not return it to the operating system -- the table file stays the same size on disk, but new inserts reuse the freed space rather than extending the file. This is usually fine for tables with ongoing write activity.

VACUUM FULL: Physically Shrink the Table

For severe bloat where you need to reclaim disk space:

-- VACUUM FULL rewrites the entire table -- blocks ALL reads and writes
-- Only run during maintenance windows
VACUUM FULL sim_events;

-- Verify the size decreased
SELECT pg_size_pretty(pg_relation_size('sim_events')) AS table_size;

VACUUM FULL takes an ACCESS EXCLUSIVE lock, blocking every query against the table for the duration. On a 10GB table, this can take minutes to hours. Only use it during maintenance windows.

pg_repack: Online Table Rewrite

For large production tables where downtime is unacceptable, pg_repack performs an online rewrite without exclusive locks:

-- Install the extension (once)
CREATE EXTENSION pg_repack;

-- Repack a table online (no exclusive lock)
-- Run from the command line:
pg_repack -d mydb -t sim_events

pg_repack creates a new copy of the table, replays changes that happened during the copy, then swaps atomically. It needs roughly 2x the table size in free disk space during the operation.

How to Prevent It

Per-Table Autovacuum Tuning

Tune autovacuum per table based on update frequency. The default autovacuum_vacuum_scale_factor of 0.2 (20%) is far too conservative for large, high-churn tables:

-- Make autovacuum trigger 10x earlier on high-churn tables
ALTER TABLE sim_events SET (
    autovacuum_vacuum_scale_factor = 0.02,   -- 2% instead of 20%
    autovacuum_vacuum_threshold = 500
);

With this setting, autovacuum triggers after 2% of rows are dead rather than 20%. On a table with 100,000 rows, vacuum starts at 2,000 dead tuples instead of waiting for 20,000. The table stays compact and queries stay fast.

Kill Long-Running Transactions

A single forgotten BEGIN in a psql session holds back the vacuum horizon, preventing cleanup of any rows created after that transaction started. Set a timeout to kill idle-in-transaction sessions proactively:

-- Set globally in postgresql.conf or per-database
ALTER DATABASE mydb SET idle_in_transaction_session_timeout = '5min';

Monitor Dead Tuple Trends Continuously

A rising dead tuple trend line means autovacuum is falling behind. You need to either make vacuum more aggressive (lower scale factor, higher cost limit) or add more autovacuum workers. Track n_dead_tup from pg_stat_user_tables over time for your busiest tables, and set alerts when dead tuple ratios exceed your thresholds.

The goal is catching bloat in the first week, not discovering it months later when the only fix is a VACUUM FULL during a maintenance window.

Originally published at mydba.dev/blog/table-bloat-dead-tuples