DEV Community

Cover image for From Zombie Connections to XactSync: A Post-Mortem on Aurora Postgres CPU Spikes
Pranay Ravi
Pranay Ravi

Posted on

From Zombie Connections to XactSync: A Post-Mortem on Aurora Postgres CPU Spikes

The alert fires at 1:28 PM. By the time you open CloudWatch, CPU is already back to normal. No application errors. No user complaints. Nothing obviously broken.

Do you close the ticket and move on, or dig in?

This post is about why you should always dig in β€” and exactly how we traced a "self-resolving" CPU spike on Aurora PostgreSQL 16 all the way back to a mismatch between pipeline commit logic and the database storage engine.


The Environment

  • Engine: Aurora PostgreSQL 16.1 on db.x2g.xlarge (4 vCPUs, 128GB RAM)
  • Workload: Kafka stream-based pipeline inserting ~365 rows/sec average into a daily-partitioned table, with burst peaks near 825/sec
  • Replication: Oracle GoldenGate
  • Observability: AWS CloudWatch Database Insights + AppDynamics Database Agent

What the Dashboard Showed

The first stop is always CloudWatch $\rightarrow$ Database Insights $\rightarrow$ Database Load. This chart shows Average Active Sessions (AAS) broken down by wait type:

Color Wait Type What It Means
🟒 Green CPU Active computation
🟠 Orange IO:AuroraStorageLogAllocate WAL/storage log writes
🟣 Purple LWLock:BufferContent Buffer manager contention
🟀 Brown Timeout:SpinDelay Spinlock spin waits
🩷 Pink LWLock:WALInsert WAL insert lock contention

Our spike was pure greenβ€”CPU only, no I/O or lock contention. The Top SQL tab showed a high-volume INSERT into a partitioned event log table running at ~825 calls/sec with 0.17ms average latency. Fast and normal individually. The volume was the question.

However, the Database Telemetry tab revealed two ominous signals that often catch a developer's attention:

  1. Max time idle in transaction: Climbing linearly for 3.2 hours straight.
  2. Vacuum: Max used transaction IDs: On a steady linear climb.

When you see both of those linear trends on a single chart, it usually points to a common suspect in relational databases: something has been holding an open transaction for a very long time, threatening your autovacuum health.


Background: The Core Mechanics of Postgres MVCC

To understand why long-running transactions are so closely tracked, it helps to look at how PostgreSQL manages Multi-Version Concurrency Control (MVCC).

Unlike database engines that keep old record data in a separate utility tablespace (like Oracle's Undo Tablespace), PostgreSQL relies on a Heap-Based Approach.

The Postgres Heap Reality: There is no separate undo space. When a row is updated or deleted, the old version (called a "dead tuple") remains physically stored inside the exact same table page. It stays there until the background VACUUM process scans the table and marks that dead space as reusable for future inserts.

If an application leaves a transaction open, it anchors the xmin (the baseline transaction ID floor for that snapshot). VACUUM cannot safely clean up any dead tuples that were generated after the oldest active transaction's `xmin`. This can trigger a problematic domino effect:

Session opens β†’ acquires xmin snapshot
       ↓
Session goes idle (no explicit commits or rollbacks)
       ↓
Dead tuples accumulate β†’ VACUUM cannot clear them out
       ↓
Autovacuum loops continuously, wasting CPU scanning bloated tables
       ↓
Transient CPU spikes alert operations

Enter fullscreen mode Exit fullscreen mode

Because autovacuum periodically wakes up to evaluate tables, the resulting CPU spikes look completely intermittent. The alert clears, engineers close the ticket as a glitch, and the underlying architectural friction persists.


Step 1: Find the Stuck Sessions

To hunt down the culprit, we ran a quick diagnostic query against pg_stat_activity:

SELECT
  pid,
  usename,
  application_name,
  client_addr,
  state,
  query,
  now() - state_change AS idle_duration,
  now() - xact_start   AS transaction_age
FROM pg_stat_activity
WHERE state = 'idle in transaction'
ORDER BY xact_start ASC;

Enter fullscreen mode Exit fullscreen mode

The results: 35 sessions, all originating from the same connection IP, belonging to the replication user. The oldest transaction had been open for 133 days.

The queries they were idling on:

SELECT count(*) from pg_settings where name='ansi_force_foreign_key_checks';
SELECT count(*) from pg_type where typtype='e' and oid in (...);

Enter fullscreen mode Exit fullscreen mode

These are classic compatibility probe queries issued during connection initialization by integration tooling like GoldenGate. The integration platform had been connecting, executing its basic discovery probes, and abandoning the sessions without explicitly closing them every time it restarted or reconnected over a four-month period.


Step 2: Verify the Vacuum Blocker (The Twist)

Finding idle in transaction sessions is a major red flag, but we had to verify if they were genuinely blocking data cleanup. Not all idle sessions hold an active database snapshot.

We executed the source-of-truth query to find actual vacuum blockers:

SELECT
  pid,
  usename,
  backend_xmin,
  age(backend_xmin) AS xmin_age_in_transactions,
  now() - xact_start AS session_age,
  state
FROM pg_stat_activity
WHERE backend_xmin IS NOT NULL
ORDER BY age(backend_xmin) DESC
LIMIT 10;

Enter fullscreen mode Exit fullscreen mode

⚠️ This is where our initial MVCC-bloat hypothesis collapsed.

The zombie integration sessions did not appear in this list; their backend_xmin fields were entirely NULL. Because they had only executed basic, read-only catalog queries during initialization, they never acquired a persistent data snapshot transaction ID. They were not blocking vacuum. The actual oldest xmin holders were normal application sessions with transaction ages under 3,000β€”completely standard for a busy transaction engine.

While these zombie connections were a clear operational hazard (consuming connection slots and wasting system memory), they weren't driving our CPU alert.


Step 3: Check the Partitions

With zombie connections ruled out as vacuum blockers, we turned our focus to the physical table health. Because our target event log table is a partitioned table, statistics are tracked natively at the individual child partition level, not the parent logical table. We targeted the table prefix:

SELECT schemaname, relname, n_live_tup, n_dead_tup,
  ROUND(100.0 * n_dead_tup / NULLIF(n_live_tup + n_dead_tup, 0), 2) AS dead_ratio_pct,
  last_autovacuum
FROM pg_stat_user_tables
WHERE relname LIKE 'event_log%'
ORDER BY n_dead_tup DESC;

Enter fullscreen mode Exit fullscreen mode

Result: 0 dead tuples across all active partitions.

The partitioned strategy was working flawlessly. The application used an insert-only pattern on the active daily partition, and old partitions were dropped entirely rather than vacuumed. This meant VACUUM bloat was completely ruled out.


Step 4: Finding the Real Culprit via Wait States

With MVCC bloat off the table, we pivoted to the APM wait state drill-down for the intensive INSERT query during the exact minutes of the CPU spike:

  • ClientRead (54%): The database engine spent more than half its time completely idle, waiting for the application client to send the next query over the network wire.
  • Active (24.7%): Time spent physically executing the raw insert statements.
  • XactSync (17.9%): The engine waiting for the Write-Ahead Log (WAL) to be flushed and synchronized to durable storage upon a COMMIT.
  • AuroraStorageLogAllocate (2.4%): Storage block layer allocation.

Two metrics tell the final story here:

  1. XactSync at 17.9%: This wait state occurs when an explicit COMMIT forces PostgreSQL to flush WAL records. At a peak volume of 825 inserts per second, the application was issuing 825 individual commits per second. This meant 825 distinct sync operations forcing the storage layer to acknowledge writes sequentially.
  2. ClientRead at 54%: This is the classic signature of single-row, one-at-a-time application database calls. The application loop was executing: Send row $\rightarrow$ Execute $\rightarrow$ Commit $\rightarrow$ Wait for network Ack $\rightarrow$ Send next row.

The shape of the CPU spike closely mirrored a backlog flush pattern. Messages would pile up briefly in the upstream Kafka topic, and then the consumer would drain them all at once in a massive burst of single-row, high-frequency synchronous commits.


Root Cause Summary

Hypothesis Verdict Technical Proof
Dead Tuple Accumulation ❌ Ruled Out pg_stat_user_tables showed a 0% dead tuple ratio across all database partitions.
Zombie Connections Blocking Vacuum ❌ Ruled Out backend_xmin was explicitly NULL for all zombie replication sessions.
Burst Insert Flush + Per-Row Commits Root Cause High XactSync (17.9%) paired with a massive ClientRead (54%) network wait pattern.

The Action Plan

Fix 1: Terminate and Block the Zombie Connections

Even though they didn't block vacuum in this specific instance, leaving 35 abandoned sessions active for 133 days is risky. They leak connection resources and backend memory.

First, we safely terminated the dangling backends:

SELECT pg_terminate_backend(pid), pid, now() - xact_start AS xact_age
FROM pg_stat_activity
WHERE usename = 'repl_agent' 
  AND state = 'idle in transaction';

Enter fullscreen mode Exit fullscreen mode

Next, we implemented an automated defense line. We updated the cluster configuration parameters to enforce a strict timeout globally:

idle_in_transaction_session_timeout = 600000 # 10 minutes in milliseconds

Enter fullscreen mode Exit fullscreen mode

If any session sits idle inside an uncommitted transaction block for more than 10 minutes, Postgres will now automatically drop the connection and free up system resources.

Fix 2: Implement Micro-Batching in the Consumer Pipeline

The true fix for the CPU spike required changing how the application interacted with the storage engine. Instead of committing every single row individually, we modified the JDBC producer configuration to use a batch size of 500.

-- Old Pattern (825 WAL Syncs/sec):
INSERT row 1 β†’ COMMIT β†’ WAL fsync β†’ Wait
INSERT row 2 β†’ COMMIT β†’ WAL fsync β†’ Wait

-- New Pattern (1.7 WAL Syncs/sec):
INSERT row 1, INSERT row 2 ... INSERT row 500 β†’ COMMIT β†’ 1 WAL fsync

Enter fullscreen mode Exit fullscreen mode
Metric Before Batching After Batching (Est.)
Commits/sec (Burst) ~825 ~1.7
WAL Fsyncs/sec ~825 ~1.7
XactSync Wait Time 17.9% < 1.0%
Transient CPU Spikes Frequent Eliminated

Essential Checklist Queries for Triage

Add these queries to your incident response runbook for analyzing database load:

1. Identify Long-Running Snapshot Blockers

SELECT pid, usename, backend_xmin,
  age(backend_xmin) AS xmin_age_transactions,
  now() - xact_start AS session_age
FROM pg_stat_activity
WHERE backend_xmin IS NOT NULL
ORDER BY age(backend_xmin) DESC 
LIMIT 5;

Enter fullscreen mode Exit fullscreen mode

2. Spot Dangerous Idle in Transaction Sessions

SELECT pid, usename, client_addr,
  now() - xact_start AS transaction_age,
  now() - state_change AS idle_duration,
  LEFT(query, 80) AS last_query
FROM pg_stat_activity
WHERE state = 'idle in transaction'
  AND now() - xact_start > interval '5 minutes'
ORDER BY xact_start ASC;

Enter fullscreen mode Exit fullscreen mode

3. Check Table Churn & Dead Tuple Ratios (Including Partitions)

SELECT schemaname, relname, n_live_tup, n_dead_tup,
  ROUND(100.0 * n_dead_tup / NULLIF(n_live_tup + n_dead_tup, 0), 2) AS dead_ratio_pct,
  last_autovacuum
FROM pg_stat_user_tables
WHERE relname LIKE 'your_table_prefix%'  -- Handles partitioned child tables
ORDER BY n_dead_tup DESC;

Enter fullscreen mode Exit fullscreen mode

πŸ’‘ Tip: A dead_ratio_pct > 10% indicates a strong need for space recovery; > 20% means the table is critically bloated and query execution performance will likely degrade.


What's Next?

If you found this breakdown helpful, keep an eye out for our next post. We will be diving into the opposite side of database optimization: When Autovacuum goes wrong. We will walk through a separate production incident where a high-churn table choked on over 124 million dead tuples (a brutal 16.6% dead ratio) due to misconfigured autovacuum cost thresholds. Stay tuned!

Top comments (0)