Pranay Ravi

Posted on May 24

PostgreSQL VACUUM Tuning: A Technical Deep Dive Into Autovacuum Configuration

#aws #database #performance #postgres

Author's Note: This article documents a production incident investigation and the technical findings that emerged from returning to foundational documentation. The fix was implemented by a colleague; this article captures the learning journey through proper documentation review.

The Incident: High CPU Due to Autovacuum Contention

A production Aurora PostgreSQL cluster experienced sustained CPU utilization between 85-90% over a 3-4 hour window. CloudWatch Performance Insights identified the primary wait event as CPU (not I/O or lock contention), and the top consuming operation was autovacuum VACUUM processes running on two large tables.

Observed State:

Table A (593 GB, 623M rows): 124 million dead tuples (16.6% dead ratio)
Table B (465M rows): 74 million dead tuples (13.7% dead ratio)
Autovacuum workers: 2 running concurrently
CPU utilization: 85-90%
Autovacuum frequency: Single massive vacuum operation per 4-6 hours
Status of tables: Never manually vacuumed since instance creation

Root Cause: Autovacuum thresholds were configured at system defaults, which were inappropriate for high-churn tables receiving bulk updates every 2-3 hours via scheduled data loader processes.

Understanding MVCC: PostgreSQL vs. Oracle's Undo Architecture

Before tuning can be effective, the fundamental difference in how PostgreSQL and Oracle manage concurrent access must be understood.

Oracle's Approach: Automatic Undo Retention Management (AUM)

In Oracle, Multi-Version Concurrency Control (MVCC) is implemented via undo tablespace segments:

Update operation: When a row is updated, the old version is written to undo tablespace (not to the table itself).
Undo retention: Oracle's Automatic Undo Retention Management (AUM) manages undo tablespace as a circular buffer. Undo extents are recycled automatically based on the UNDO_RETENTION parameter and available tablespace.
Space reclamation: Undo space is automatically freed when either (a) the retention period expires, or (b) tablespace pressure forces rollback of older undo data. The DBA's responsibility is limited to allocating sufficient undo tablespace upfront.

Key characteristic: The DBA tunes this mechanism once (setting retention period and tablespace size) and then relies on Oracle's background processes to manage undo lifecycle automatically.

[IMAGE 2: Oracle vs PostgreSQL MVCC Architecture Diagram]

PostgreSQL's Approach: Heap-Based MVCC with Explicit VACUUM

In PostgreSQL, MVCC is implemented at the table (heap) level:

Update operation: When a row is updated, a new version of the row is inserted into the same table. The old version is marked as "dead" but remains physically in the table.
Space reclamation: VACUUM must scan the table, identify dead tuples, and mark their space as reusable. Dead tuples are not automatically removed.
Autovacuum trigger: Autovacuum is a background process that decides when to vacuum based on tunable thresholds. Unlike Oracle's automatic undo recycling, PostgreSQL requires explicit configuration of when vacuum should trigger.

Key characteristic: The DBA must actively tune VACUUM parameters based on table churn patterns. There is no "set it and forget it" mechanism comparable to Oracle's AUM.

Implication: On Oracle, a table with high UPDATE volume simply generates more undo, which Oracle's AUM handles. On PostgreSQL, the same UPDATE volume generates more dead tuples, and if autovacuum thresholds are too conservative, dead tuples accumulate until autovacuum finally triggers—often at high volume, causing CPU spikes.

The Vacuum Threshold Formula: The Mathematical Foundation

When autovacuum decides to run, it evaluates the following formula for each table:

VACUUM_TRIGGER_THRESHOLD = autovacuum_vacuum_threshold + 
                           (autovacuum_vacuum_scale_factor × n_live_tup)

Where:

autovacuum_vacuum_threshold: Absolute minimum dead tuples (default: 50)
autovacuum_vacuum_scale_factor: Percentage of table size (default: 0.1 = 10%)
n_live_tup: Current number of live tuples in the table

Pre-Investigation Configuration

System defaults on the cluster:

autovacuum_vacuum_threshold = 50
autovacuum_vacuum_scale_factor = 0.1

Calculation for Table A (725M rows):

THRESHOLD = 50 + (0.1 × 725,000,000)
          = 50 + 72,500,000
          = 72.5 million dead tuples

Interpretation: Autovacuum would not trigger on Table A until 72.5 million dead tuples accumulated. This is the critical misconfiguration.

For comparison, Table A actually accumulated 124 million dead tuples before the vacuum completed—well beyond this threshold, indicating autovacuum had already triggered much earlier in the lifecycle but was running continuously against an accumulating workload.

The Root Cause: Loader Pattern and Threshold Mismatch

The data loader process (running every 2-3 hours with 4 parallel workers) updated rows using a COALESCE merge pattern:

UPDATE table_a t SET 
  column_1 = COALESCE(s.column_1, t.column_1),
  column_2 = COALESCE(s.column_2, t.column_2),
  column_3 = COALESCE(s.column_3, t.column_3)
FROM staging_table s
WHERE t.id = s.id

This pattern creates a dead tuple for every row touched, regardless of whether values actually changed. Over a 2-3 hour window with millions of rows, dead tuple generation rate significantly exceeded autovacuum's ability to reclaim space given the high threshold values.

The collision:

High dead tuple generation rate from bulk UPDATE operations
Autovacuum thresholds calibrated for the default use case (moderate churn on tables of typical size)
No table-level overrides to account for this specific workload pattern

Result: Dead tuples accumulated to 16.6% of table size before stabilizing.

Investigation: Diagnostic Queries and Findings

Three queries provided diagnostic clarity:

Query 1: Current dead tuple status

SELECT 
  relname, 
  n_live_tup,
  n_dead_tup,
  ROUND(100.0 * n_dead_tup / (n_live_tup + n_dead_tup), 2) AS dead_ratio_pct,
  last_autovacuum
FROM pg_stat_user_tables
WHERE relname IN ('table_a', 'table_b')
ORDER BY n_dead_tup DESC;

Result:

Table A: 623M live, 124M dead (16.6%)
Table B: 465M live, 74M dead (13.7%)

Query 2: Active autovacuum processes

SELECT 
  pid, 
  query, 
  query_start,
  EXTRACT(EPOCH FROM (NOW() - query_start))::INT AS runtime_seconds
FROM pg_stat_activity
WHERE query LIKE '%VACUUM%' 
  AND query NOT LIKE '%pg_stat%';

Result: Two autovacuum workers running concurrently, one on each table, both active for 55+ minutes and 3+ minutes respectively at time of investigation.

Query 3: Cumulative churn analysis

SELECT 
  relname,
  n_live_tup,
  n_dead_tup,
  n_tup_upd + n_tup_del AS total_modifications,
  ROUND(100.0 * (n_tup_upd + n_tup_del) / n_live_tup, 1) AS churn_ratio_pct
FROM pg_stat_user_tables
WHERE relname IN ('table_a', 'table_b');

Result:

Table A: 16 billion total modifications on 725M rows (2203% cumulative churn)
Table B: 11.9 billion total modifications on 465M rows (2566% cumulative churn)

Interpretation: These represent lifetime cumulative statistics since instance creation. The 2200%+ ratio indicates every row has been touched approximately 22 times on average over the instance lifetime.

The Configuration Adjustment: Formula-Based Approach

Rather than implementing ad-hoc changes, a formula-based approach was used to calculate appropriate thresholds.

The colleague conducting the fix referenced textbook formulas for maintenance memory allocation and threshold calculation, confirming that the system defaults were inappropriate for tables with this churn profile.

Applied changes (table-level only):

ALTER TABLE table_a SET (
  autovacuum_vacuum_scale_factor = 0.01,
  autovacuum_vacuum_threshold = 10000,
  autovacuum_analyze_scale_factor = 0.005,
  autovacuum_analyze_threshold = 5000,
  autovacuum_vacuum_cost_delay = 2,
  autovacuum_vacuum_cost_limit = 5000
);

ALTER TABLE table_b SET (
  autovacuum_vacuum_scale_factor = 0.01,
  autovacuum_vacuum_threshold = 10000,
  autovacuum_analyze_scale_factor = 0.005,
  autovacuum_analyze_threshold = 5000,
  autovacuum_vacuum_cost_delay = 2,
  autovacuum_vacuum_cost_limit = 5000
);

Rationale for each parameter:

Scale factor (0.1 → 0.01): Lowers the percentage-based trigger from 10% to 1% of table size, causing autovacuum to start at 7.25 million dead tuples instead of 72.5 million. This increases vacuum frequency but reduces per-operation work volume.
Threshold (50 → 10000): Sets an explicit minimum to prevent excessive vacuum triggering on very small tables, but still allows reasonable triggering on large tables.
Analyze scale factor (0.05 → 0.005): ANALYZE (which updates table statistics for the query planner) triggers more frequently, preventing stale statistics during high-churn periods.
Cost parameters: cost_delay = 2ms and cost_limit = 5000 distribute vacuum work into smaller increments with longer pauses, reducing per-operation CPU spike while still completing the work.

New threshold calculation for Table A:

THRESHOLD = 10,000 + (0.01 × 725,000,000)
          = 10,000 + 7,250,000
          = 7.26 million dead tuples

This represents a 10× reduction in the threshold value, causing autovacuum to trigger 10× more frequently but with proportionally smaller work loads.

Critical Design Decision: Table-Level Configuration, Not Cluster-Level

A deliberate choice was made to apply all tuning parameters at the table level only, not at the system/cluster level.

Why table-level configuration is necessary:

Heterogeneous workloads: Not all tables in a cluster have the same churn pattern. Table A and Table B are high-churn bulk-update targets. Other tables in the same cluster may be mostly static or read-heavy.
Preventing cascade effects: A cluster-wide reduction in autovacuum_vacuum_scale_factor would cause autovacuum to trigger more frequently on all tables, including those with minimal churn. This could result in:
- Unnecessary vacuum operations consuming CPU and I/O
- More frequent ANALYZE operations on stable tables
- Increased lock contention if many vacuums run concurrently
Isolation of risk: By tuning only the affected tables, the change is isolated to the problem source and does not introduce unexpected side effects on other database objects or applications.

This is a deliberate engineering discipline: tune at the smallest scope that solves the problem.

Results: Before and After Metrics

Before Configuration (April 13, 09:51 AM):

Metric	Table A	Table B
Dead tuples	124M	74M
Dead ratio	16.6%	13.7%
Last autovacuum	N/A (first vacuum)	N/A (first vacuum)
CPU utilization	85-90%	85-90%
Vacuum frequency	1 per 4-6 hours	1 per 4-6 hours

After Configuration (April 13, 12:07 PM - 13:45 PM):

Metric	Table A	Table B
Dead tuples	0 (vacuum completed)	21M (declining)
Dead ratio	0%	7.8%
Last autovacuum	2026-04-13 12:07:08	2026-04-13 13:45:44
CPU utilization	30-40%	30-40%
Vacuum frequency	Multiple per loader cycle	Multiple per loader cycle

The vacuum operations completed naturally without manual intervention. Dead tuple levels stabilized well below the new thresholds. CPU alerts cleared.

Monitoring: Proactive Visibility

To prevent recurrence, a monitoring table was implemented to capture dead tuple trends:

CREATE TABLE pg_table_stats_history (
  captured_at TIMESTAMP DEFAULT NOW(),
  table_name TEXT,
  n_live_tup BIGINT,
  n_dead_tup BIGINT,
  dead_ratio_percent NUMERIC,
  n_tup_upd BIGINT,
  n_tup_del BIGINT,
  last_autovacuum TIMESTAMP
);

SELECT cron.schedule('capture_table_stats', '*/30 * * * *', $$
  INSERT INTO pg_table_stats_history
  SELECT 
    NOW(), relname, n_live_tup, n_dead_tup,
    ROUND(100.0 * n_dead_tup / NULLIF(n_live_tup + n_dead_tup, 0), 2),
    n_tup_upd, n_tup_del, last_autovacuum
  FROM pg_stat_user_tables
  WHERE relname IN ('table_a', 'table_b');
$$);

This captures a snapshot every 30 minutes, allowing observation of daily patterns: "During loader window (12:00-14:00), dead tuples climb to 5-8M, then autovacuum brings them back to 0.5M."

Key Concepts: What the Investigation Revealed

1. The Autovacuum Cost Mechanism

VACUUM has two cost-control parameters:

autovacuum_vacuum_cost_limit: Units of work (reading a page = 1 unit, writing a page = 20 units) allowed before autovacuum pauses
autovacuum_vacuum_cost_delay: Milliseconds to pause when cost limit is reached

Lower cost_limit + higher delay = slower vacuum, less CPU spike

Higher cost_limit + lower delay = faster vacuum, more CPU spike

The pre-incident configuration had cost_delay = 5ms and cost_limit = 1800 (already aggressive). The post-incident configuration used cost_delay = 2ms and cost_limit = 5000, allocating higher work budgets but still enforcing frequent pauses for distribution.

When autovacuum triggers, it scans the entire table. If dead tuples are being created faster than they're being reclaimed (because the loader is still running), the vacuum operation takes longer and consumes sustained CPU.

In the incident:

Table A vacuum started at 09:51:54 and completed at 12:07:08 (2h 15m)
Table B vacuum started at 08:59:39 and completed at 13:45:44 (4h 46m)
During this window, the loader was also running (starting 12:37:27), creating additional dead tuples

The concurrent activity created contention for CPU and I/O resources, explaining the 85-90% CPU utilization.

3. Wraparound Protection: A Critical Background Mechanism

While not the active problem in this incident, understanding wraparound protection is essential context:

PostgreSQL uses 4-byte transaction IDs (2^32 = 4.3 billion possible values). For MVCC visibility comparison to work correctly, only a 2-billion value range is usable at any given time. If autovacuum falls so far behind that unfrozen tuples approach the 2-billion transaction boundary, PostgreSQL will:

Log warnings at 200M transactions remaining
Shift to READ-ONLY mode at 1M transactions remaining
Force shutdown if the limit is reached

This is not a theoretical concern—it's a safety mechanism that has forced database shutdowns in under-monitored systems. Autovacuum must keep up, or the database fails.

The Documentation Recovery: Why AI Alone Is Insufficient

When the incident was first encountered, AI-based diagnostic tools provided generic suggestions: "lower cost_delay," "increase cost_limit," "check maintenance_work_mem."

These suggestions were not wrong, but they were not calibrated to the specific situation. The threshold formula, the rationale for scale factor adjustment, and the risks of cluster-level configuration changes were not apparent from AI output alone.

The fix required returning to foundational documentation:

PostgreSQL VACUUM documentation (official): Explained the cost mechanism and threshold formula
AWS Aurora PostgreSQL tuning guide: Provided context-specific guidance for managed Aurora instances
Textbook references on MVCC: Clarified why dead tuples accumulate and how autovacuum prevents wraparound

The colleague's insistence on reading the documentation forced a deeper investigation that revealed:

The mathematical formula driving autovacuum trigger decisions
The specific interaction between bulk update workloads and default thresholds
The risks and benefits of table-level vs. cluster-level configuration

The lesson: Base concepts—MVCC, dead tuples, autovacuum thresholds, wraparound protection—are non-negotiable foundations. Understanding these requires reading documentation. Once the foundations are understood, AI tools can accelerate diagnosis and suggest configurations. But without foundations, suggestions are just knobs to turn without understanding consequences.

Combining both—foundational reading with AI-assisted diagnosis—yields better outcomes than either alone.

Diagnostic Queries for Future Incidents

When autovacuum CPU spikes occur, these queries provide immediate visibility:

Dead tuple status:

SELECT 
  relname, n_live_tup, n_dead_tup,
  ROUND(100.0 * n_dead_tup / (n_live_tup + n_dead_tup), 2) AS dead_pct,
  last_autovacuum
FROM pg_stat_user_tables
ORDER BY n_dead_tup DESC;

Active autovacuum activity:

SELECT pid, query, query_start, 
  EXTRACT(EPOCH FROM (NOW() - query_start))::INT AS runtime_sec
FROM pg_stat_activity
WHERE query LIKE '%VACUUM%';

Churn rate analysis:

SELECT 
  relname, n_live_tup, 
  n_tup_upd + n_tup_del AS total_mods,
  ROUND(100.0 * (n_tup_upd + n_tup_del) / n_live_tup, 1) AS churn_pct
FROM pg_stat_user_tables
ORDER BY n_dead_tup DESC;

Transaction age (wraparound risk):

SELECT 
  datname,
  age(datfrozenxid) AS txid_age,
  (SELECT setting::int FROM pg_settings 
   WHERE name = 'autovacuum_freeze_max_age') - age(datfrozenxid) 
   AS txid_remaining
FROM pg_database
WHERE datallowconn
ORDER BY txid_age DESC;

Conclusion

This incident demonstrated that PostgreSQL's VACUUM mechanism requires active tuning based on workload patterns. Unlike Oracle's Automatic Undo Retention Management—which automatically recycles undo tablespace based on retention policies—PostgreSQL requires explicit configuration of autovacuum thresholds calibrated to specific table churn patterns.

The resolution came not from tweaking random parameters, but from understanding the mathematical formulas governing autovacuum behavior and applying them methodically at table scope.

The broader lesson is methodological: when facing production database performance issues, foundational understanding of mechanisms (documented in official sources) combined with diagnostic data yields better outcomes than parameter suggestions alone.

References

PostgreSQL Official: VACUUM
PostgreSQL Official: Autovacuum
AWS Database Blog: Understanding autovacuum in Amazon RDS for PostgreSQL
Percona: Overcoming VACUUM WRAPAROUND
CYBERTEC: Autovacuum wraparound protection

DEV Community