DEV Community

Prabhu
Prabhu

Posted on

Why Your Fast PostgreSQL Query Suddenly Became Slow: A Deep Dive into MVCC and Index Bloat

Ever noticed a query using an index correctly that's blazing fast, then suddenly slows down - especially on tables with high write volume? We recently experienced this exact issue in production. Here's what we learned.

Our Setup

We have an event_logs table with the following structure:

CREATE TABLE event_logs (
  id BIGINT PRIMARY KEY,
  state VARCHAR,
  last_requeued_at TIMESTAMP,
  params JSON
);
Enter fullscreen mode Exit fullscreen mode

A cron job runs every few minutes to pick up pending event logs and queue them for processing in batches:

Batch 1:

SELECT "event_logs".*
FROM "event_logs"
WHERE "event_logs"."state" = 'initial'
  AND "event_logs"."last_requeued_at" <= '2025-12-15 10:34:32.473436'
ORDER BY "event_logs"."id" ASC
LIMIT 1000;
Enter fullscreen mode Exit fullscreen mode

Batch 2:

SELECT "event_logs".*
FROM "event_logs"
WHERE "event_logs"."state" = 'initial'
  AND "event_logs"."last_requeued_at" <= '2025-12-15 10:34:32.473436'
  AND "event_logs"."id" > 1000
ORDER BY "event_logs"."id" ASC
LIMIT 1000;
Enter fullscreen mode Exit fullscreen mode

To optimize this query, we created a partial index that only indexes records in the 'initial' state:

CREATE INDEX index_event_logs_on_initial_id_last_requeued_at 
ON event_logs USING btree (id, last_requeued_at) 
WHERE state = 'initial';
Enter fullscreen mode Exit fullscreen mode

The Problem

In production, we were processing around 900 writes/second on the event_logs table:

  • 300 new inserts/second (state = 'initial')
  • 300 updates/second (state from initial → processing)
  • 300 updates/second (state from processing → processed)

The table had around 1 billion rows.

Initially, query latency was around 3 milliseconds—excellent performance.

However, after a few hours, the batching query started experiencing high latency. Running EXPLAIN ANALYZE showed the query was taking several seconds, even with the index in place.

Our first instinct was to suspect the index. We ran REINDEX, which rebuilt the index from scratch. Immediately after, the query was blazing fast again. But within hours, the problem returned.

Something deeper was going on.

Understanding PostgreSQL's MVCC (Multi-Version Concurrency Control)

PostgreSQL uses MVCC to handle concurrent transactions. Instead of modifying rows in place, PostgreSQL keeps multiple versions of the same row:

  • When a row is updated, PostgreSQL marks the old version as dead and creates a new version
  • When a row is deleted, PostgreSQL marks it as dead but doesn't physically remove it
  • The original row remains on disk until cleanup occurs (we'll read about the cleanup later in our blog)

Let's trace what happens to our event logs and how this affects our partial index as they move through states:

Initial Insert

INSERT INTO event_logs (id, state, last_requeued_at, params) 
VALUES (1, 'initial', '2024-01-01 10:00:00', '{"key": "value"}');
Enter fullscreen mode Exit fullscreen mode

Physical storage:

  • Row version 1: id=1, state='initial', last_requeued_at='2024-01-01 10:00:00'
  • Partial index: Entry for (id=1, last_requeued_at) → row version 1

Update 1: 'initial' → 'processing'

UPDATE event_logs SET state='processing' WHERE id=1;
Enter fullscreen mode Exit fullscreen mode

Physical storage:

  • Row version 1: state='initial' (dead tuple)
  • Row version 2: state='processing' (current)
  • Partial index: The previous entry pointing to the 'initial' row version is marked dead. The dead entry stays in the partial index until VACUUM runs (We'll see later on what VACUUM is). No new entry is added since state='processing' doesn't match the index's WHERE clause.

Hence, each event log that transitions from initial to processing state creates a dead entry in the partial index.

Update 2: 'processing' → 'processed'

UPDATE event_logs SET state='processed' WHERE id=1;
Enter fullscreen mode Exit fullscreen mode

Physical storage:

  • Row version 1: state='initial' (dead)
  • Row version 2: state='processing' (dead)
  • Row version 3: state='processed' (current)
  • Partial index: Still has 1 dead entry from the original 'initial' state

Result: 3 physical row copies on disk, with accumulated dead entries in the partial index.

(Update from processing to processed state doesn't affect the partial index since the partial index is only on the rows with 'initial' state)

How This Caused Our Slow Query

When PostgreSQL scans the partial index, it must check the visibility of each entry and skip dead tuples. Under high update volume, as more event logs transition from initial to processing state, dead entries accumulate rapidly in the index.

With 300 events/second and each event going through two state transitions (initial → processing → processed), we were creating dead index entries faster than they were being cleaned up.

The query was spending most of its time checking visibility and skipping dead entries rather than retrieving actual data—hence the dramatic slowdown.

How PostgreSQL Cleans Up Dead Tuples

VACUUM

When VACUUM runs, it processes both the table and its indexes, physically removing dead entries and marking space as reusable.

Important: Regular VACUUM doesn't compact indexes or shrink files—it only marks space as reusable. For true compaction, you need VACUUM FULL, REINDEX, or pg_repack.

AUTOVACUUM

PostgreSQL's automatic background process that runs VACUUM periodically. It's enabled by default, so why did we still have problems?

The Root Cause: Default Autovacuum Settings

PostgreSQL's default autovacuum settings aren't aggressive enough for high-write tables, especially large ones.

Default settings:

  • autovacuum_vacuum_threshold = 50 rows
  • autovacuum_vacuum_scale_factor = 0.2 (20%)
  • autovacuum_vacuum_cost_limit = 200 (Number of vacuum cost units an autovacuum worker can accumulate before it must sleep)
  • autovacuum_vacuum_cost_delay = 2ms

Autovacuum triggers when:

dead_tuples > autovacuum_vacuum_threshold + (autovacuum_vacuum_scale_factor × total_rows)
Enter fullscreen mode Exit fullscreen mode

For our event_logs table with 1 billion rows, autovacuum would only trigger after:

50 + (0.2 × 1,000,000,000) = ~200 million dead tuples
Enter fullscreen mode Exit fullscreen mode

With 600 dead tuples/second being generated: (roughly one per update, so 600/s from the two updates)

600 dead tuples/sec × 86,400 seconds/day = ~51.8 million dead tuples/day
Enter fullscreen mode Exit fullscreen mode

It would take nearly 4 days to hit the autovacuum threshold, during which our index would accumulate millions of dead entries, causing severe query degradation.

The Solution

We made autovacuum much more aggressive for this specific table:

ALTER TABLE test_schema.event_logs SET (
  autovacuum_vacuum_scale_factor = 0.005,
  autovacuum_analyze_scale_factor = 0.01 -- Keeps table stats fresh
);
Enter fullscreen mode Exit fullscreen mode

This sets the scale factor to 0.5% instead of 20%.

New trigger threshold for 1 billion rows:

50 + (0.005 × 1,000,000,000) = ~5 million dead tuples
Enter fullscreen mode Exit fullscreen mode

This means autovacuum now runs approximately every 2.3 hours instead of every 4 days, keeping dead tuple accumulation under control and maintaining fast query performance.

To make the autovacuum process faster, you could set autovacuum_vacuum_cost_delay to zero.

Monitoring Autovacuum Activity

You can check dead tuple counts and autovacuum statistics with this query:

SELECT
  schemaname,
  relname AS table_name,
  n_live_tup,
  n_dead_tup,
  last_autovacuum,
  last_vacuum,
  autovacuum_count,
  vacuum_count
FROM pg_stat_user_tables
WHERE relname = 'event_logs' 
  AND schemaname = 'your_schema';
Enter fullscreen mode Exit fullscreen mode

Key Takeaways

  • MVCC creates dead tuples - Every update creates a new row version and leaves the old one as a dead tuple
  • Indexes accumulate dead entries - Including partial indexes, which must be cleaned by VACUUM
  • Dead entries slow down queries - PostgreSQL must check visibility and skip dead entries during index scans
  • Default autovacuum settings don't scale - The 20% threshold is too high for large, high-write tables
  • Tune per-table settings - Aggressive autovacuum settings on hot tables prevent bloat-related performance issues

Reference Resources

Top comments (1)

Collapse
 
satish_kandala_bf356c7994 profile image
Satish Kandala

Great explanation of MVCC and index bloat. Understanding query performance at this level is critical for developers. This kind of hands-on knowledge really comes through with practical SQL work. If you want to strengthen your query optimization skills and database fundamentals, sql-practice.online offers scenarios that teach these concepts through real-world problems. #database #sql