An SQL transaction comprises many sequential reads and writes that affect different objects. These may be stored on other servers in a distributed database.
The read and write operations include writing to table rows, which will be made visible and durable at commit (the A and D of ACID), and some lock information (aka transaction intents or provisional records) that must be visible to other concurrent transactions for consistency and concurrency control purposes (the C and I of ACID).
Once the transaction is committed, only a part of it needs to be stored for the long term: the effective modifications to the table rows and the commit time.
Recording, sharing, and cleaning the transaction intents
Some databases store intents, like locks, in memory to make them easy to access and clean up once committed. However, this doesn't scale. First, it cannot scale horizontally because memory cannot be shared efficiently between servers over the network. Even with a single server, memory is limited, and databases using this architecture escalate multiple locks, for example, locking a whole table when too many records are updated. SQL Server uses Lock Escalation.
Other databases store all intents in persistent data blocks. They add a reference to the transaction table so that readers can know if the related transaction is ongoing (and its changes are not visible to others) or committed (and its changes are visible to all transactions that read a state after the commit time). The provisional information stored in those blocks is cleaned up later to avoid this overhead of reads. Jonathan Lewis described this for Oracle Database in Clean it up. PostgreSQL is designed for asynchronous cleanup by running VACUUM frequently.
In a distributed SQL database, whether storing intents in memory or on disk, reading and writing during a transaction involve network calls and add latency. Aurora DSQL, with Optimistic Concurrency Control, avoids this by storing the transaction intents locally and synchronizing only at commit. To achieve this, it introduces many limits on the number of rows and memory used by transactions and the SQL isolation levels, which are incompatible with applications written for other databases.
YugabyteDB is PostgreSQL compatible, which requires sharing the transaction intents through the network using Raft consensus. In each Raft group, called a tablet, the long-term storage LSM tree, RegularDB, stores only clean data with a commit time and no unnecessary provisional intents once the transactions are committed. Before a transaction is committed, all provisional records are stored in another LSM tree, IntentsDB. Instead of a delayed cleanup, the committed intents are applied to RegularDB asynchronously. Once in RegularDB, no further cleanup is needed, except for the garbage collection of Multi-Version Concurrency Control (MVCC) done during SST file compaction.
Unlike Oracle or PostgreSQL, YugabyteDB does not piggyback delayed cleanup onto the read workload, which is advantageous for performance predictability and the performance of read replicas. A read in YugabyteDB does not generate Write-Ahead Logging (except with Serializable isolation level that must record read intents when not in a read-only transaction).
YugabyteDB PostgreSQL-compatible Distributed Transactions
It's all about trade-offs. A YugabyteDB transaction that reads a table row or an index entry must read from two databases: RegularDB, which is the long-term storage, and IntentsDB to see if there are ongoing transactions with locks or committed transactions that have not yet been applied to RegularDB. This would be a performance penalty in traditional databases. Still, YugabyteDB uses LSM trees, where reading from multiple sources is already the nominal read path as it merges reads from multiple SST files or MemTables. Reading from IntentsDB adds an entry to the iterator, usually a MemTable only, except with very long transactions that may have flushed to SST files. YugabyteDB was built for modern storage, SSD, where random reads are fast - this would not have been efficient with HDD.
I mentioned MemTable for IntentsDB and RegularDB. Memory access is fast, but even if replicated in YugabyteDB, it is protected by Write-Ahead Logging (WAL) for additional protection. YugabyteDB's LSM tree implementation is based on RocksDB. Still, it doesn't use RocksDB's WAL because logging is already generated for the Raft replication, and the Raft log can be used to recover the MemTable after a crash. This WAL protects the IntentsDB MemTable, which can be discarded when flushed into an SST file. In case of a crash, the WAL is read and applied to the IntentsDB MemTable, recovering the state with all uncommitted provisional records and some committed changes.
Write-Ahead Logging and MemTable Flush
There's one more detail. When the committed changes from IntentsDB are applied asynchronously to RegularDB, the provisional records are deleted from IntentsDB. This doesn't release memory (that's how RocksDB works, and it avoids managing fragmentation), but they are discarded when flushing to SST files. With many small transactions, you will see IntentsDB using memory but generating few or no SST files.
However, the committed records applied to RegularDB are not protected, as RocksDB WAL is not used. Once they are flushed from IntentsDB, the WAL from Raft log can be discarded. To recover from a crash, no unflushed MemTable must remain in RegularDB when their intents are flushed from IntentsDB. For this reason, when IntentsDB is flushed to release memory, a flush of RegularDB must happen. YugabyteDB checks if one happened recently (intents_flush_max_delay_ms
, which defaults to two seconds) to force one. Note that the WAL is not discarded immediately after IntentsDB is flushed because it keeps a minimum of two files (the default for log_min_segments_to_retain
and 15 minutes (the default for log_min_seconds_to_retain
) to be able to resolve the gap with a Raft follower in case of short network failures.
Large allocated memory for IntentsDB
The following screenshot summarizes this. The memory allocation for the IntentsDB MemTable is larger than for RegularDB because the workload had many small transactions. They filled the IntentsDB MemTable until it reached the memory thresholds per tablet, with a default of 128MB defined by memstore_size_mb
. However, in a busy system, this happens before reaching the per-tablet threshold because of two global cache thresholds: global_memstore_size_mb_max,
which defaults to 2GB, and global_memstore_size_percentage,
which defaults to 10% of the T-Server's memory.
Small SST files for RegularDB
Another consequence of flushing triggered by many deleted intents is the generation of small SST files for short transaction workloads, especially if some transactions have rollbacked. Because IntentsDB is flushed frequently and contains deleted provisional intents, it generates no or small SST files. However, it also generates small SST files for RegularDB, which are flushed simultaneously.
I described this to explain what you may observe and find counterintuitive if you are unaware of those implementation details. However, it doesn't mean that there is a problem with it. Keeping memory allocated for IntentsDB deleted records is an optimization to avoid the overhead of memory fragmentation. The memory is managed to flush tablets to release space. The screenshot above is from a test with only one tablet active with data ingest, and that's why IntentsDB reaches the 128MB limit. But in a busy database, the global cache thresholds trigger a flush before it reaches this size. It also shows that the number of SST files never exceeds six because they are compacted in the background, limiting the read amplification due to many small SST files.
Fast-path for single-shard transactions
Note that transactions not involving multiple nodes are optimized with a 'fast path' as the intents don't need to be synchronized through Raft. They bypass IntentsDB and write the committed changes directly to RegularDB. Here is a summary of the read-and-write transactional paths:
-
The full presentation is:
Top comments (0)