Pranay Ravi

Posted on May 23

From Zombie Connections to XactSync: A Post-Mortem on Aurora Postgres CPU Spikes

#aws #postgres #database #devops

The alert fires at 1:28 PM. By the time you open CloudWatch, CPU is already back to normal. No application errors. No user complaints. Nothing obviously broken.

Do you close the ticket and move on, or dig in?

This post is about why you should always dig in — and exactly how we traced a "self-resolving" CPU spike on Aurora PostgreSQL 16 all the way back to a mismatch between pipeline commit logic and the database storage engine.

The Environment

Engine: Aurora PostgreSQL 16.1 on db.x2g.xlarge (4 vCPUs, 128GB RAM)
Workload: Kafka stream-based pipeline inserting ~365 rows/sec average into a daily-partitioned table, with burst peaks near 825/sec
Replication: Oracle GoldenGate
Observability: AWS CloudWatch Database Insights + AppDynamics Database Agent

What the Dashboard Showed

The first stop is always CloudWatch $\rightarrow$ Database Insights $\rightarrow$ Database Load. This chart shows Average Active Sessions (AAS) broken down by wait type:

Color	Wait Type	What It Means
🟢 Green	`CPU`	Active computation
🟠 Orange	`IO:AuroraStorageLogAllocate`	WAL/storage log writes
🟣 Purple	`LWLock:BufferContent`	Buffer manager contention
🟤 Brown	`Timeout:SpinDelay`	Spinlock spin waits
🩷 Pink	`LWLock:WALInsert`	WAL insert lock contention

Our spike was pure green—CPU only, no I/O or lock contention. The Top SQL tab showed a high-volume INSERT into a partitioned event log table running at ~825 calls/sec with 0.17ms average latency. Fast and normal individually. The volume was the question.

However, the Database Telemetry tab revealed two ominous signals that often catch a developer's attention:

Max time idle in transaction: Climbing linearly for 3.2 hours straight.
Vacuum: Max used transaction IDs: On a steady linear climb.

When you see both of those linear trends on a single chart, it usually points to a common suspect in relational databases: something has been holding an open transaction for a very long time, threatening your autovacuum health.

Background: The Core Mechanics of Postgres MVCC

To understand why long-running transactions are so closely tracked, it helps to look at how PostgreSQL manages Multi-Version Concurrency Control (MVCC).

Unlike database engines that keep old record data in a separate utility tablespace (like Oracle's Undo Tablespace), PostgreSQL relies on a Heap-Based Approach.

The Postgres Heap Reality: There is no separate undo space. When a row is updated or deleted, the old version (called a "dead tuple") remains physically stored inside the exact same table page. It stays there until the background VACUUM process scans the table and marks that dead space as reusable for future inserts.

If an application leaves a transaction open, it anchors the xmin (the baseline transaction ID floor for that snapshot). VACUUM cannot safely clean up any dead tuples that were generated after the oldest active transaction's `xmin`. This can trigger a problematic domino effect:

Session opens → acquires xmin snapshot
       ↓
Session goes idle (no explicit commits or rollbacks)
       ↓
Dead tuples accumulate → VACUUM cannot clear them out
       ↓
Autovacuum loops continuously, wasting CPU scanning bloated tables
       ↓
Transient CPU spikes alert operations

Because autovacuum periodically wakes up to evaluate tables, the resulting CPU spikes look completely intermittent. The alert clears, engineers close the ticket as a glitch, and the underlying architectural friction persists.

Step 1: Find the Stuck Sessions

To hunt down the culprit, we ran a quick diagnostic query against pg_stat_activity:

SELECT
  pid,
  usename,
  application_name,
  client_addr,
  state,
  query,
  now() - state_change AS idle_duration,
  now() - xact_start   AS transaction_age
FROM pg_stat_activity
WHERE state = 'idle in transaction'
ORDER BY xact_start ASC;

The results: 35 sessions, all originating from the same connection IP, belonging to the replication user. The oldest transaction had been open for 133 days.

The queries they were idling on:

SELECT count(*) from pg_settings where name='ansi_force_foreign_key_checks';
SELECT count(*) from pg_type where typtype='e' and oid in (...);

These are classic compatibility probe queries issued during connection initialization by integration tooling like GoldenGate. The integration platform had been connecting, executing its basic discovery probes, and abandoning the sessions without explicitly closing them every time it restarted or reconnected over a four-month period.

Step 2: Verify the Vacuum Blocker (The Twist)

Finding idle in transaction sessions is a major red flag, but we had to verify if they were genuinely blocking data cleanup. Not all idle sessions hold an active database snapshot.

We executed the source-of-truth query to find actual vacuum blockers:

SELECT
  pid,
  usename,
  backend_xmin,
  age(backend_xmin) AS xmin_age_in_transactions,
  now() - xact_start AS session_age,
  state
FROM pg_stat_activity
WHERE backend_xmin IS NOT NULL
ORDER BY age(backend_xmin) DESC
LIMIT 10;

⚠️ This is where our initial MVCC-bloat hypothesis collapsed.

The zombie integration sessions did not appear in this list; their backend_xmin fields were entirely NULL. Because they had only executed basic, read-only catalog queries during initialization, they never acquired a persistent data snapshot transaction ID. They were not blocking vacuum. The actual oldest xmin holders were normal application sessions with transaction ages under 3,000—completely standard for a busy transaction engine.

While these zombie connections were a clear operational hazard (consuming connection slots and wasting system memory), they weren't driving our CPU alert.

Step 3: Check the Partitions

With zombie connections ruled out as vacuum blockers, we turned our focus to the physical table health. Because our target event log table is a partitioned table, statistics are tracked natively at the individual child partition level, not the parent logical table. We targeted the table prefix:

SELECT schemaname, relname, n_live_tup, n_dead_tup,
  ROUND(100.0 * n_dead_tup / NULLIF(n_live_tup + n_dead_tup, 0), 2) AS dead_ratio_pct,
  last_autovacuum
FROM pg_stat_user_tables
WHERE relname LIKE 'event_log%'
ORDER BY n_dead_tup DESC;

Result: 0 dead tuples across all active partitions.

The partitioned strategy was working flawlessly. The application used an insert-only pattern on the active daily partition, and old partitions were dropped entirely rather than vacuumed. This meant VACUUM bloat was completely ruled out.

Step 4: Finding the Real Culprit via Wait States

With MVCC bloat off the table, we pivoted to the APM wait state drill-down for the intensive INSERT query during the exact minutes of the CPU spike:

ClientRead (54%): The database engine spent more than half its time completely idle, waiting for the application client to send the next query over the network wire.
Active (24.7%): Time spent physically executing the raw insert statements.
XactSync (17.9%): The engine waiting for the Write-Ahead Log (WAL) to be flushed and synchronized to durable storage upon a COMMIT.
AuroraStorageLogAllocate (2.4%): Storage block layer allocation.

Two metrics tell the final story here:

XactSync at 17.9%: This wait state occurs when an explicit COMMIT forces PostgreSQL to flush WAL records. At a peak volume of 825 inserts per second, the application was issuing 825 individual commits per second. This meant 825 distinct sync operations forcing the storage layer to acknowledge writes sequentially.
ClientRead at 54%: This is the classic signature of single-row, one-at-a-time application database calls. The application loop was executing: Send row $\rightarrow$ Execute $\rightarrow$ Commit $\rightarrow$ Wait for network Ack $\rightarrow$ Send next row.

The shape of the CPU spike closely mirrored a backlog flush pattern. Messages would pile up briefly in the upstream Kafka topic, and then the consumer would drain them all at once in a massive burst of single-row, high-frequency synchronous commits.

Root Cause Summary

Hypothesis	Verdict	Technical Proof
Dead Tuple Accumulation	❌ Ruled Out	`pg_stat_user_tables` showed a 0% dead tuple ratio across all database partitions.
Zombie Connections Blocking Vacuum	❌ Ruled Out	`backend_xmin` was explicitly `NULL` for all zombie replication sessions.
Burst Insert Flush + Per-Row Commits	Root Cause	High `XactSync` (17.9%) paired with a massive `ClientRead` (54%) network wait pattern.

The Action Plan

Fix 1: Terminate and Block the Zombie Connections

Even though they didn't block vacuum in this specific instance, leaving 35 abandoned sessions active for 133 days is risky. They leak connection resources and backend memory.

First, we safely terminated the dangling backends:

SELECT pg_terminate_backend(pid), pid, now() - xact_start AS xact_age
FROM pg_stat_activity
WHERE usename = 'repl_agent' 
  AND state = 'idle in transaction';

Next, we implemented an automated defense line. We updated the cluster configuration parameters to enforce a strict timeout globally:

idle_in_transaction_session_timeout = 600000 # 10 minutes in milliseconds

If any session sits idle inside an uncommitted transaction block for more than 10 minutes, Postgres will now automatically drop the connection and free up system resources.

Fix 2: Implement Micro-Batching in the Consumer Pipeline

The true fix for the CPU spike required changing how the application interacted with the storage engine. Instead of committing every single row individually, we modified the JDBC producer configuration to use a batch size of 500.

-- Old Pattern (825 WAL Syncs/sec):
INSERT row 1 → COMMIT → WAL fsync → Wait
INSERT row 2 → COMMIT → WAL fsync → Wait

-- New Pattern (1.7 WAL Syncs/sec):
INSERT row 1, INSERT row 2 ... INSERT row 500 → COMMIT → 1 WAL fsync

Metric	Before Batching	After Batching (Est.)
Commits/sec (Burst)	~825	~1.7
WAL Fsyncs/sec	~825	~1.7
`XactSync` Wait Time	17.9%	< 1.0%
Transient CPU Spikes	Frequent	Eliminated

Essential Checklist Queries for Triage

Add these queries to your incident response runbook for analyzing database load:

1. Identify Long-Running Snapshot Blockers

SELECT pid, usename, backend_xmin,
  age(backend_xmin) AS xmin_age_transactions,
  now() - xact_start AS session_age
FROM pg_stat_activity
WHERE backend_xmin IS NOT NULL
ORDER BY age(backend_xmin) DESC 
LIMIT 5;

2. Spot Dangerous `Idle in Transaction` Sessions

SELECT pid, usename, client_addr,
  now() - xact_start AS transaction_age,
  now() - state_change AS idle_duration,
  LEFT(query, 80) AS last_query
FROM pg_stat_activity
WHERE state = 'idle in transaction'
  AND now() - xact_start > interval '5 minutes'
ORDER BY xact_start ASC;

3. Check Table Churn & Dead Tuple Ratios (Including Partitions)

SELECT schemaname, relname, n_live_tup, n_dead_tup,
  ROUND(100.0 * n_dead_tup / NULLIF(n_live_tup + n_dead_tup, 0), 2) AS dead_ratio_pct,
  last_autovacuum
FROM pg_stat_user_tables
WHERE relname LIKE 'your_table_prefix%'  -- Handles partitioned child tables
ORDER BY n_dead_tup DESC;

💡 Tip: A dead_ratio_pct > 10% indicates a strong need for space recovery; > 20% means the table is critically bloated and query execution performance will likely degrade.

What's Next?

If you found this breakdown helpful, keep an eye out for our next post. We will be diving into the opposite side of database optimization: When Autovacuum goes wrong. We will walk through a separate production incident where a high-churn table choked on over 124 million dead tuples (a brutal 16.6% dead ratio) due to misconfigured autovacuum cost thresholds. Stay tuned!

DEV Community

From Zombie Connections to XactSync: A Post-Mortem on Aurora Postgres CPU Spikes

The Environment

What the Dashboard Showed

Background: The Core Mechanics of Postgres MVCC

Step 1: Find the Stuck Sessions

Step 2: Verify the Vacuum Blocker (The Twist)

Step 3: Check the Partitions

Step 4: Finding the Real Culprit via Wait States

Root Cause Summary

The Action Plan

Fix 1: Terminate and Block the Zombie Connections

Fix 2: Implement Micro-Batching in the Consumer Pipeline

Essential Checklist Queries for Triage

1. Identify Long-Running Snapshot Blockers

2. Spot Dangerous `Idle in Transaction` Sessions

3. Check Table Churn & Dead Tuple Ratios (Including Partitions)

What's Next?

Top comments (0)

The Environment

What the Dashboard Showed

Background: The Core Mechanics of Postgres MVCC

Step 1: Find the Stuck Sessions

Step 2: Verify the Vacuum Blocker (The Twist)

Step 3: Check the Partitions

Step 4: Finding the Real Culprit via Wait States

Root Cause Summary

The Action Plan

Fix 1: Terminate and Block the Zombie Connections

Fix 2: Implement Micro-Batching in the Consumer Pipeline

Essential Checklist Queries for Triage

1. Identify Long-Running Snapshot Blockers

2. Spot Dangerous Idle in Transaction Sessions

3. Check Table Churn & Dead Tuple Ratios (Including Partitions)

What's Next?

2. Spot Dangerous `Idle in Transaction` Sessions