DEV Community

Nikita Kutsokon
Nikita Kutsokon

Posted on

Databases: Concurrency Control. Part 1

Introduction

Imagine you're working in a busy office with multiple people trying to update the same spreadsheet at the same time. One person is adding new information, while another is making changes to existing data. At the same time, others are just looking at the document, but they need to see the latest, accurate version of the data. In this situation, how can we ensure that:

  • People making changes don’t accidentally overwrite each other’s work ?
  • People who are just reading the document always see the most up-to-date version, without waiting for someone else to finish their work ?
  • The document doesn't end up in a jumbled state ?

Now, imagine this happening in a database, where multiple users or applications are trying to read from and write to the same tables at the same time. The challenge is even bigger in this case because databases are designed to store huge amounts of data, and many transactions need to happen simultaneously.

Why Is Concurrency Hard in Databases ?
When multiple users or processes are interacting with the database at once, several things can go wrong:

  • Data consistency issues - One user might see outdated information because another user is changing it.
  • Locking problems - If one transaction locks a row, other transactions might have to wait, leading to delays and frustration.
  • Concurrency anomalies - Without the right mechanisms in place, transactions could lead to inconsistent results, such as one user overwriting another’s work.

Databases need to figure out how to let users work simultaneously without blocking each other, and also make sure the data stays consistent and correct no matter how many people are using it.

What is MVCC ?

The solution lies in using techniques that allow each transaction to see a consistent snapshot of the data, without interfering with other transactions. One of the most powerful and widely used techniques in modern databases - is called Multiversion Concurrency Control. It is a method that enables multiple transactions to access the database simultaneously without blocking each other. This is accomplished by creating multiple versions of a piece of data, instead of having just one active version at any time.

Table with MVCC

In an MVCC-based system, when data is updated or inserted, a new version of the row is created. However, if a transaction only reads the data without modifying it, it accesses the original version of the data. This ensures that read operations are not blocked by write operations, and write operations do not block reads. As a result, the new version of the data becomes visible only after a modification or insertion occurs, allowing concurrent transactions to work with different versions of the same data without causing interference. This approach enables databases to maintain data consistency and high concurrency while minimizing the need for locking resources.

Example of MVCC in Action
Let's say three transactions, T1, T2, and T3, are operating on the same row in the database:

  1. Transaction T1 reads a row and decides to update it. This creates version 2 of the row.
  2. Transaction T2 also reads the same row (before T1 has committed) and performs a different update, creating version 3 of the row.
  3. Transaction T3 reads the row as well and decides to insert a new value, creating version 4.

At this point, there are three versions of the same row, each with different updates from different transactions. The database now has to manage and ensure that the operations from these transactions do not interfere with each other. In the next parts of this article, we will explore how the database resolves conflicts when multiple transactions try to modify the same data. This conflict resolution ensures that data consistency is maintained and that transactions are isolated properly

The Hidden Cost of MVCC: Bloat

While Multiversion Concurrency Control offers many benefits, such as high concurrency and reduced locking, it comes with its own set of challenges. One of the most significant drawbacks of MVCC is bloat. In this section, we’ll explore what bloat is, how it occurs in MVCC systems, and its impact on database performance.

Bloat refers to the buildup of unused, outdated, or "dead" data versions in the database, resulting from the creation of multiple row versions due to MVCC. As time passes, these obsolete versions accumulate and consume storage space, leading to inefficient resource usage. As bloat increases, it introduces several negative effects on both storage and performance. The presence of outdated data versions creates inefficiencies that can worsen over time. If left unmanaged, these inefficiencies can significantly degrade the overall health and performance of the database. Let’s explore how bloat can impact the database’s functionality in more detail:

  1. Increased Storage Usage 💾🫠
    The primary cost of bloat is that it leads to increased disk space usage. As new versions of rows accumulate and old versions remain in the database, the overall size of the database increases. This means more disk space is needed for storage, and more memory is required to process queries that involve large amounts of outdated data.

  2. Slower Query Performance 🏃😶
    As bloat accumulates, queries may need to process more data than necessary. Even though old versions of rows are no longer needed, they still have to be read and checked during queries, especially if they are not cleaned up properly. This can slow down query performance, particularly for read-heavy operations.

  3. Increased Maintenance Overhead 🛠️☹️
    Managing and cleaning up bloat introduces additional complexity. While newer versions of a row are easily accessible, dead tuples need to be removed periodically, which requires an active maintenance process. Without proper cleanup, the database can become inefficient and slow down over time.

Enter VACUUM 🦸🌟

To combat the negative effects of bloat, databases like PostgreSQL use a process called VACUUM. VACUUM is designed to reclaim storage space by removing outdated or "dead" versions of rows that are no longer needed. Over time, as multiple versions of rows accumulate due to ongoing transactions, the database can become inefficient and sluggish. Without VACUUM, this bloat would continue to increase, leading to wasted storage and slower query performance. VACUUM helps keep the database running efficiently by removing these obsolete row versions and maintaining transaction visibility, ensuring that ongoing operations aren't impacted by outdated data.

MVCC cleaning

Think of VACUUM like a housekeeper in a busy office. Just as employees move documents around, updating or discarding them, the office can become cluttered if old, irrelevant papers are not cleared away. VACUUM performs a similar role, ensuring the database remains tidy and efficient. PostgreSQL offers several types of VACUUM processes to manage this cleanup. The standard VACUUM reclaims storage and removes dead rows without locking the table, allowing the database to continue operating with minimal disruption. The VACUUM FULL, on the other hand, not only removes dead rows but also compacts the table by physically rewriting it, reducing its disk space usage. While standard VACUUM is run frequently as part of regular database maintenance, VACUUM FULL is typically used when a large amount of space needs to be reclaimed, such as after large deletions or updates.

How MVCC and VACUUM Work Together 🤝

MVCC and VACUUM work together to maintain PostgreSQL’s performance and consistency. MVCC enables multiple transactions to run concurrently by creating new versions of data when updates or inserts occur, allowing reads to access the original version without waiting for writes. However, this leads to the accumulation of outdated row versions, causing bloat. VACUUM addresses this issue by periodically removing these "dead" versions, reclaiming storage space, and ensuring the database remains efficient. Together, MVCC allows high concurrency while VACUUM prevents performance degradation, keeping the database both responsive and well-maintained.

Helpful Links 🤓

Text resources:

Video resources:

Top comments (0)