Myroslav Vivcharyk

Posted on Feb 3 • Originally published at devgeist.com

The Ghost in the Database: Why Is This Empty Table Taking 20 Seconds to Query?

#backend #database #mysql

SELECT * FROM my_table;

Query execution time: ~5 seconds

Table rows: 0

This made absolutely no sense.

A few years ago, I found myself staring at a performance metric that defied logic. We had a table that was, for all intents and purposes, empty. Yet, querying it was taking longer than a coffee break.

The Setup

Here's the context: one massive MySQL database attached to an old monolith. Multiple teams outside of monolith had read-only access. The database was running on one of the beefiest RDS instances available - typically at 50-60% load.

What bothered me was that this load level never quite matched our actual traffic patterns. Something felt off.

Our monolith used the transactional outbox pattern - storing records in MySQL before publishing to RabbitMQ. The setup was straightforward: the monolith writes messages to an outbox_message table, and a separate worker service reads them in batches, publishes to RabbitMQ, then deletes them.

High throughput table. Lots of writes. Lots of deletes. Should be simple.

Then these warnings started appearing:

[WARN] Slow batch processing detected: 23.4s
[WARN] Slow batch processing detected: 31.2s
[WARN] Slow batch processing detected: 47.8s

Confirming the Symptoms

The monitoring dashboard showed the first signs. The outbox worker's latency metric - which measures the time to fetch messages from the database and publish them to RabbitMQ - had jumped to high levels. We were seeing 20+ seconds with spikes up to 50 seconds.

Distributed traces pointed to the culprit: the SELECT query that fetches message batches was taking almost all the time. The commit operation was also suspiciously slow.

Time to verify directly:

SELECT * FROM outbox_message LIMIT 1;

An empty table taking 5 seconds to query.

I checked RDS Performance Insights expecting to see obvious problems - CPU maxed out, disk I/O bottlenecks, something.

Nothing. The database was sitting at a steady 50-60% load. It looked "busy," but not "breaking." No unusual spikes. The Top Queries view didn't even show our problematic query.

Which was actually the first real clue. This wasn't a typical bottleneck.

Following the Clues

The MySQL slow query log revealed what Performance Insights had missed:

# Query_time: 18.814028  Lock_time: 0.000021 Rows_sent: 1000  Rows_examined: 1000
SELECT * FROM outbox_message LIMIT 1000 FOR UPDATE SKIP LOCKED;

18.8 seconds for a query that examined only 1000 rows. The lock time was negligible.

My first theory: blocking transactions. The worker retrieves messages in batches concurrently, so maybe they're stepping on each other.

SELECT trx_id, trx_state, trx_started, trx_query,
       dl.lock_type, dl.lock_mode, dl.LOCK_STATUS
FROM information_schema.innodb_trx
JOIN performance_schema.data_locks dl ON trx_id = ENGINE_TRANSACTION_ID
WHERE trx_query LIKE '%outbox_message%'
ORDER BY trx_started DESC;

Yes - several long-running transactions on the outbox_message table. But here's the thing: no deadlocks. All transactions had GRANTED locks, and all lock types were compatible. No conflicts.

The locks weren't the problem.

Let's look at the table itself:

SELECT table_name, 
       sys.format_bytes(data_length) AS data_size,
       sys.format_bytes(data_free) AS free_space,
       TABLE_ROWS
FROM information_schema.tables
WHERE TABLE_NAME='outbox_message';

data_size: 1.2 GiB
free_space: 1.1 GiB
TABLE_ROWS: 0

An empty table occupying 1.2 GiB on disk. Something was fundamentally wrong with this particular table.

Time to look deeper:

SHOW ENGINE INNODB STATUS;

And there it was:

History list length 21,842,680

21 million entries in the history list.

Understanding What's Actually Happening

Before we go further, let's step back and understand what this history list means.

Multi-Version Concurrency Control (MVCC)

InnoDB uses MVCC to allow concurrent reads and writes without locking. When you update or delete a row:

InnoDB doesn't immediately overwrite or remove the old version
It creates a new version of the row (or marks it as deleted)
The old version gets stored in the undo log
A reference to this old version is added to the history list

This enables readers to access old versions while writers work on new ones - no blocking required.

InnoDB's purge operations run in the background to clean up old row versions. But here's the critical constraint: the purge thread can only remove versions that are older than currently active transactions.

When you start a transaction, InnoDB creates a "snapshot" at that point in time. Even if your transaction sits idle, MySQL must keep all row versions created after your transaction started - because you might need them for consistent reads.

The Domino Effect

Here's what happened in our case:

Three transactions started. Due to a bug in another service, they never completed - no commit, no rollback. They just sat there, open for hours.

Meanwhile, the outbox worker continued its normal operation:

Write message to outbox_message (creates new row version)
Read and process message
Delete message (creates another row version marking the row as deleted)
Repeat millions of times

The purge thread's dilemma: It could see all these deleted row versions in the history list. It wanted to clean them up. But it couldn't - those three transactions were still active. MySQL had to keep every single row version created since they started.

The result: The history list grew to 21 million entries at that moment. The table physically contained gigabytes of old row versions for rows that had been processed and deleted.

The query impact: When our outbox worker ran SELECT * FROM outbox_message, InnoDB had to scan through 21 million old versions to determine which rows should be visible to this transaction.

This is why an empty table was taking 20 seconds to query. We weren't querying an empty table - we were querying through 21 million historical versions of rows that didn't exist anymore.

And here's the kicker about that 50-60% database load I mentioned earlier: it wasn't doing useful work. The database was spending enormous resources on background that was going nowhere.

Finding the Smoking Gun

Now that I understood the problem, I needed to find what was preventing the purge. Let's find those hanging transactions:

SELECT ps.id AS processlist_id,
       trx_id, 
       trx_started,
       TIMESTAMPDIFF(HOUR, trx_started, NOW()) AS hours_running,
       trx_query
FROM information_schema.innodb_trx trx
JOIN information_schema.processlist ps ON trx.trx_mysql_thread_id = ps.id
WHERE trx.trx_started < CURRENT_TIME - INTERVAL 10 MINUTE
ORDER BY trx_started;

There they are: three transactions that had been running for looong time.

These queries were coming from another service that had database access - a bug in their code was leaving transactions open, preventing MySQL from cleaning up millions of row versions for rows that had been processed and deleted.

And here was the systemic issue hiding in plain sight:

SHOW VARIABLES WHERE Variable_name IN (
    'innodb_max_purge_lag',
    'innodb_max_purge_lag_delay',
    'innodb_purge_threads' 
);

    innodb_max_purge_lag: 0
    innodb_max_purge_lag_delay: 0
    innodb_purge_threads: 4

innodb_max_purge_lag = 0 means the history list can grow indefinitely. No guardrails. No limits.

The Resolution

Now that we understood the problem, the fix was straightforward:

Immediate actions:

Kill the hanging transactions
Wait for the purge thread to catch up (took sime time to process 21 million entries)
Optimize the table to reclaim disk space

Within minutes, query performance returned to normal. And the database load? It dropped from 50-60% down to 10%. Same traffic — but now the database wasn't wasting resources on futile purge operations.

Long-term fixes:

Configure proper purge limits:

    SET GLOBAL innodb_max_purge_lag = XXX;

This prevents the history list from growing beyond XXX entries. If it reaches this limit, InnoDB adds small delays to DML operations to let the purge thread catch up. A small performance may be better than a complete breakdown.

Fix transaction timeout handling across all services with database access. No transaction should run for hours/days.
Add monitoring and alerts for history list length. This metric should be on every write-heavy database dashboard.

    SELECT count FROM information_schema.INNODB_METRICS
    WHERE name = 'trx_rseg_history_len';

Wrapping Up

You can't control everything in a large systems. Bugs will happen. But what you can control:

1. Monitoring Matters

Add HLL metric to your database dashboard:

SELECT count FROM information_schema.INNODB_METRICS 
WHERE name = 'trx_rseg_history_len';

Or to your Performance insights dashboard.

2. Set Proper Guardrails

Default MySQL configurations assume perfect transaction hygiene. In complex systems with multiple services, you may need limits.

Don't let innodb_max_purge_lag stay at 0 in write-heavy databases with high delete volumes. Set it to something reasonable and accept the small performance trade-off for the reliability gain.

3. Understand the Internals

Knowing how MVCC and the purge thread work was important to diagnosing this issue. The history list concept isn't obscure trivia - it's fundamental to how InnoDB handles concurrency.

Every DELETE doesn't immediately remove data; it just marks a new version. Every UPDATE creates a new row version. All of this goes into the history list until the purge thread can safely remove it.

4. Discipline Matters

The service that caused this only had read-only access. But even a read-only service can bring down your database if it doesn't properly manage transactions.

Every service needs proper timeout handling and error recovery that ensures transactions complete. No exceptions.

DEV Community