DEV Community

Mustafa ERBAY
Mustafa ERBAY

Posted on • Originally published at mustafaerbay.com.tr

PostgreSQL MVCC: Common Mistakes in Application Development

When working with PostgreSQL, especially in applications requiring high concurrency, correctly understanding the MVCC (Multi-Version Concurrency Control) concept is critically important. This mechanism allows the database to seamlessly manage many operations simultaneously. However, overlooking the intricacies of MVCC can lead to unexpected performance issues or data inconsistencies in a production environment.

While working on a production ERP or developing the backend for my side product's financial calculators, I encountered many problems caused by MVCC or by misunderstanding MVCC. In this post, I will share my experiences with common mistakes related to MVCC usage during application development and how to avoid them.

What is MVCC and Why is it Vital?

MVCC is a fundamental concurrency model that allows PostgreSQL to perform read and write operations concurrently without the need for locking. In traditional database systems, when one transaction reads data and another tries to write to the same data, deadlocks often occur. These deadlocks cause severe performance bottlenecks in high-traffic systems.

Thanks to MVCC, each transaction sees its own "snapshot." This means that when a transaction starts, it gets a consistent view of the database at that moment, and this view remains unchanged until the transaction completes. Even if other transactions modify the same data, the first transaction continues to read the data from its own snapshot. This prevents readers from blocking writers and writers from blocking readers, thereby increasing the overall concurrency of the system. For example, when designing operator screens showing the real-time status of a production line, where data needs to be continuously updated and reports generated simultaneously, this structure of MVCC proved very useful.

ℹ️ Snapshots and Tuples

When a row is updated or deleted in PostgreSQL, it is not actually physically deleted; instead, a new version (tuple) of that row is created, and the old version is marked as a "dead tuple." Each tuple has hidden columns like xmin (the transaction ID that created it) and xmax (the transaction ID that deleted or updated it). MVCC uses these IDs to decide which tuples a transaction can see and which it cannot.

The Hidden Hand of VACUUM: Bloat and Performance Issues

The benefits MVCC provides for database concurrency are undeniable, but this model comes with a cost: dead tuples. Since the old tuple is not physically deleted when a new tuple is created, these "dead" tuples continue to occupy disk space. Over time, these dead tuples accumulate, creating a condition called "bloat" in tables and indexes. Bloat causes database files to occupy more space than necessary and leads to a decrease in query performance.

I faced this bloat problem when I saw the size of the order table in a production ERP growing uncontrollably. Every day, thousands of new orders, updates, and deletions occurred. When I queried tables like pg_stat_user_tables and pg_class to examine the n_dead_tup and relpages values, I found that more than 40% of the table consisted of dead tuples. This increased disk I/O and caused even simple SELECT queries to slow down.

SELECT
    relname AS table_name,
    pg_size_pretty(pg_table_size(C.oid)) AS total_size,
    pg_size_pretty(pg_relation_size(C.oid)) AS table_size,
    pg_size_pretty(pg_total_relation_size(C.oid) - pg_relation_size(C.oid)) AS index_size,
    n_live_tup AS live_tuples,
    n_dead_tup AS dead_tuples,
    (n_dead_tup * 100 / (n_live_tup + n_dead_tup + 0.001))::numeric(5,2) AS dead_ratio,
    last_autovacuum,
    last_autoanalyze
FROM pg_class C
JOIN pg_namespace N ON (N.oid = C.relnamespace)
JOIN pg_stat_user_tables S ON (S.relid = C.oid)
WHERE nspname = 'public' AND C.relkind = 'r'
ORDER BY dead_ratio DESC
LIMIT 10;
Enter fullscreen mode Exit fullscreen mode

To solve this bloat problem, VACUUM and ANALYZE operations must run regularly. PostgreSQL's autovacuum daemon performs this automatically by default, but sometimes the default settings may be insufficient. Especially for tables with high write traffic, parameters like autovacuum_vacuum_scale_factor and autovacuum_vacuum_threshold may need to be adjusted. In my case, for the ERP order table, I reduced the autovacuum_vacuum_scale_factor from 20% to 5% and also lowered the autovacuum_vacuum_threshold to trigger autovacuum more frequently. This significantly reduced the amount of dead tuples accumulated in the table, and query times dropped to acceptable levels.

Transaction Isolation Levels and Misconceptions

PostgreSQL has four different transaction isolation levels: READ UNCOMMITTED, READ COMMITTED, REPEATABLE READ, and SERIALIZABLE. One of the most common mistakes application developers make with MVCC is using the default READ COMMITTED level without fully understanding the behavior of these isolation levels.

At the READ COMMITTED level, a query only sees data that has been committed before it. Different queries within the same transaction can return different results due to other commits in between (non-repeatable read). For example, in a financial calculator application, I was first checking a user's balance and then performing an operation. If another transaction changed and committed the user's balance immediately after I ran SELECT bakiye FROM hesaplar WHERE user_id = X, my UPDATE hesaplar SET bakiye = ... WHERE user_id = X query would operate on a different balance. This can lead to data inconsistencies and disrupt the system's logic.

-- Transaction A
BEGIN;
SELECT bakiye FROM hesaplar WHERE user_id = 1; -- Let's say 1000 TL
-- A few seconds passed, meanwhile another transaction reduced the balance to 900 TL.
UPDATE hesaplar SET bakiye = bakiye - 100 WHERE user_id = 1; -- Risk of operating on 1000 TL
COMMIT;

-- Transaction B (concurrent)
BEGIN;
UPDATE hesaplar SET bakiye = bakiye - 50 WHERE user_id = 1;
COMMIT;
Enter fullscreen mode Exit fullscreen mode

In such situations, it may be necessary to switch to REPEATABLE READ or SERIALIZABLE isolation levels. At the REPEATABLE READ level, the snapshot taken when the transaction starts remains valid until the end of the transaction. This means you will get the same results even if you run the same query again. However, "phantom reads" can still occur at this level, meaning newly added rows might be visible during the transaction. SERIALIZABLE is the highest isolation level and ensures that transactions behave as if they were executed one after another. At this level, the performance cost can be higher, and serialization_failure errors may occur.

⚠️ Incorrect Isolation Level Selection

Choosing the correct isolation level based on your application's business logic and data consistency requirements is crucial. Using unnecessarily high isolation levels can degrade performance, while lower levels can lead to data inconsistencies. We must always evaluate the trade-offs.

In a client project, when I used REPEATABLE READ in a complex reservation system, I used explicit locks with SELECT ... FOR UPDATE to prevent two users trying to make reservations simultaneously from booking conflicting slots. This required me to bypass MVCC's non-blocking nature for a specific moment to ensure correct data consistency. I covered this topic in more detail in the post [Related: Database Locking Strategies].

Long-Running Transactions and Snapshot Bloat

Another insidious aspect of MVCC is long-running transactions. As you know, a VACUUM operation can only clean dead tuples when it is sure that no active transaction sees a snapshot containing that dead tuple. If a transaction remains open for too long, the snapshot held by this transaction can prevent dead tuples from being cleaned. We call this "snapshot bloat."

Snapshot bloat can be more dangerous than normal bloat because it completely prevents autovacuum from doing its job. In an ERP system, a large reporting operation could sometimes run within a transaction that lasted for hours. As long as this operation remained open, the dead tuples accumulated in other tables could not be cleaned, which caused the pg_class.relfrozenxid value to advance slowly and increased the risk of transaction ID wraparound. This situation culminated in a crisis on April 28th when disk usage reached 95%.

To detect long-running transactions, I use the pg_stat_activity table:

SELECT
    pid,
    usename,
    datname,
    client_addr,
    backend_start,
    xact_start,
    state,
    query,
    age(backend_xid) AS xid_age,
    age(query_start) AS query_age
FROM pg_stat_activity
WHERE state != 'idle'
ORDER BY xact_start ASC;
Enter fullscreen mode Exit fullscreen mode

This query shows the start times of active transactions and their xid_age (age of transaction IDs) values. High xid_age values can indicate potential sources of snapshot bloat. In such scenarios, I implemented solutions such as splitting long-running operations into smaller transactions, directing reporting operations to read replicas, or scheduling non-critical reports for off-peak hours. Sometimes, better managing the transaction lifecycle on the application side (e.g., avoiding unnecessarily long BEGIN blocks) can solve the problem fundamentally.

Index Bloat and MVCC Relationship

MVCC affects not only dead tuples in tables but also indexes. When a row is updated, a new tuple is actually created, and a new entry is made in the index for this new tuple. The index entries of the old tuple are not immediately deleted; they are also marked as "dead" and await cleaning by VACUUM. This situation also leads to bloat in indexes.

Index bloat causes the index size to grow unnecessarily. This, in turn, causes index scans to perform more disk I/O, leading to a decrease in query performance. Indexes of frequently updated tables are particularly affected by this situation. For example, the sizes of indexes in a production ERP's stock movement table were growing faster than the table itself. I detected this situation using the pg_indexes_size and pg_relation_size functions.

SELECT
    t.relname AS table_name,
    indexrelname AS index_name,
    pg_size_pretty(pg_relation_size(i.indexrelid)) AS index_size,
    pg_stat_get_numscans(i.indexrelid) AS index_scans,
    pg_stat_get_tuples_returned(i.indexrelid) AS tuples_returned
FROM pg_class t
JOIN pg_index i ON t.oid = i.indrelid
JOIN pg_class idx ON idx.oid = i.indexrelid
WHERE t.relkind = 'r'
ORDER BY pg_relation_size(i.indexrelid) DESC
LIMIT 10;
Enter fullscreen mode Exit fullscreen mode

To reduce index bloat, VACUUM FULL or REINDEX commands can be used. However, VACUUM FULL completely locks the table and can take a long time; therefore, it should be used carefully in a production environment. REINDEX might be a better option, but it also requires a short-term lock. In my experience, optimizing autovacuum settings according to the table's update traffic is the most effective method to prevent index bloat. Additionally, the FILLFACTOR setting of indexes can also affect bloat. A lower FILLFACTOR (e.g., 70%) leaves more free space on index pages, making room for future updates and slowing down bloat formation, but immediately increases the index size. This is a trade-off between disk space and performance. I delved deeper into this topic in the post [Related: PostgreSQL Index Optimizations].

Explicit Locking with FOR UPDATE/SHARE: Bypassing MVCC

Although MVCC generally eliminates the need for locking, in some cases, explicit locking becomes mandatory. Especially if you need to update data immediately after reading it within a transaction and want to ensure that no other transaction modifies the same data, you can use SELECT ... FOR UPDATE or SELECT ... FOR SHARE statements.

SELECT ... FOR UPDATE locks the selected rows against other FOR UPDATE or FOR SHARE requests until the current transaction completes. This is indispensable for ensuring data consistency in critical business logic such as stock reduction or balance updates. In a production ERP, I prevented race conditions by using FOR UPDATE in scenarios where multiple users simultaneously tried to reduce the stock of the same product.

-- Transaction 1
BEGIN;
SELECT stok_miktari FROM urunler WHERE id = 123 FOR UPDATE; -- Locks the row
-- stok_miktari = 50
-- After some time
UPDATE urunler SET stok_miktari = 49 WHERE id = 123;
COMMIT;

-- Transaction 2 (concurrent, waits until Transaction 1 finishes)
BEGIN;
SELECT stok_miktari FROM urunler WHERE id = 123 FOR UPDATE; -- Waits until Transaction 1 finishes
-- Runs after Transaction 1 commits
-- stok_miktari = 49
UPDATE urunler SET stok_miktari = 48 WHERE id = 123;
COMMIT;
Enter fullscreen mode Exit fullscreen mode

SELECT ... FOR SHARE is a lighter lock; it only locks rows against FOR UPDATE requests while allowing concurrent execution with other FOR SHARE requests. This can be used in read-heavy situations where consistency is important. However, these explicit locking mechanisms reduce the concurrency benefits provided by MVCC to a certain extent. When used incorrectly or unnecessarily, they can create deadlocks or performance bottlenecks. Therefore, it is crucial to be very careful about when and where to use them. Deciding on their use after testing their performance impact and taking measurements will prevent headaches in a production environment.

Conclusion: Understanding and Correctly Using MVCC

PostgreSQL's MVCC mechanism provides a powerful foundation for developing modern, high-performance, and concurrent applications. However, to fully utilize this power and avoid potential pitfalls, a good understanding of MVCC's operating principles and related concepts is required. Topics such as the role of VACUUM, the intricacies of transaction isolation levels, the effects of long-running transactions, and the correct use of explicit locking directly impact database performance and data consistency.

In my own experience, ignoring these issues led to "disk full" alarms at night or inexplicably slow queries. Therefore, when developing applications on or operationally managing a PostgreSQL database, delving into the depths of MVCC and avoiding the mistakes I mentioned above is key to building robust and scalable systems. Next step: Examine the pg_stat_user_tables and pg_stat_activity outputs in your own systems to identify potential areas for MVCC optimizations.

Top comments (0)