YugabyteDB employs Multi-Version Concurrency Control (MVCC) to store all versions of table rows together. This method differs from Oracle, which maintains old versions in an undo tablespace organized by transaction identifier. It also contrasts with PostgreSQL, where versions are scattered throughout the table. These databases store rows directly in blocks, which typically have a size of 8KB, which makes it impossible to store a variable size of version history.
In YugabyteDB, data is stored in Log-Structured Merge (LSM) trees, where keys are documents that are not limited in size. The version timestamp is appended to the end of the key in descending order, allowing new versions to be inserted at the top of each key. When querying a row by its primary key, only one seek is needed to find the key.
This explanation pertains to packed rows. All column values are stored under a single key when executing an SQL insert with many columns. However, when a column is updated, YugabyteDB adds the version for the new column value. The key for this latest version incorporates the row's key, the column identifier, and the MVCC timestamp. The data will only be re-packed during compaction after the MVCC retention period. This process represents a significant optimization compared to PostgreSQL, which copies the entire row for each update.
Column-level versions can cause row fragmentation when multiple columns are updated frequently. Here is an example.
I will create a table with a thousand columns to emphasize the effect. I will use the following to generate the list of columns for the SELECT, CREATE TABLE, and UPDATE clauses:
yugabyte=# \x
Expanded display is on.
yugabyte=# select
string_agg(format('c%s' ,n),',') as s ,
string_agg(format('c%s float' ,n),',') as c ,
string_agg(format('c%s=42',n),',') as u ,
'{'||string_agg(format('"c%s":42',n),',')||'}' as j
from generate_series(1,10) n;
-[ RECORD 1 ]---------------------------------------------------------------------------------
s | c1,c2,c3,c4,c5,c6,c7,c8,c9,c10
c | c1 float,c2 float,c3 float,c4 float,c5 float,c6 float,c7 float,c8 float,c9 float,c10 float
u | c1=42,c2=42,c3=42,c4=42,c5=42,c6=42,c7=42,c8=42,c9=42,c10=42
j | {"c1":42,"c2":42,"c3":42,"c4":42,"c5":42,"c6":42,"c7":42,"c8":42,"c9":42,"c10":42}
yugabyte=# \gset
yugabyte=# \x
Expanded display is off.
To display it, I generated ten columns, but for the following, I did the same with one thousand columns.
I created the table using the generated :c
variable:
yugabyte=# create table demo (
a int, b int,
primary key (a, b),
:c,
d jsonb
);
CREATE TABLE
I inserted one thousand rows:
yugabyte=# insert into demo(a,b) select generate_series(1,1000),1;
INSERT 0 1000
Here is an update statement that updates those thousand columns for one row. I ran it five times:
explain (analyze, dist, debug, summary off)
update demo set :u where a=1
\watch c=5
Here is the execution plan with all statistics, especially the LSM tree seek
and next
operations displayed by the debug
option of explain analyze
:
Sun 01 Dec 2024 03:59:52 AM GMT (every 2s)
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------
Update on demo (cost=0.00..15.25 rows=100 width=16136) (actual time=2.535..2.535 rows=0 loops=1)
-> Index Scan using demo_pkey on demo (cost=0.00..15.25 rows=100 width=16136) (actual time=2.017..2.025 rows=1 loops=1)
Index Cond: (a = 1)
Storage Table Read Requests: 1
Storage Table Read Execution Time: 1.482 ms
Storage Table Rows Scanned: 1
Storage Table Write Requests: 1
Metric rocksdb_number_db_seek: 1.000
Metric rocksdb_number_db_next: 1.000
Metric rocksdb_number_db_seek_found: 1.000
Metric rocksdb_number_db_next_found: 1.000
Metric rocksdb_iter_bytes_read: 8098.000
Metric docdb_keys_found: 1.000
Metric ql_read_latency: sum: 210.000, count: 1.000
(14 rows)
Sun 01 Dec 2024 03:59:54 AM GMT (every 2s)
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------
Update on demo (cost=0.00..15.25 rows=100 width=16136) (actual time=2.911..2.911 rows=0 loops=1)
-> Index Scan using demo_pkey on demo (cost=0.00..15.25 rows=100 width=16136) (actual time=2.381..2.390 rows=1 loops=1)
Index Cond: (a = 1)
Storage Table Read Requests: 1
Storage Table Read Execution Time: 1.782 ms
Storage Table Rows Scanned: 1
Storage Table Write Requests: 1
Metric rocksdb_number_db_seek: 1.000
Metric rocksdb_number_db_next: 1001.000
Metric rocksdb_number_db_seek_found: 1.000
Metric rocksdb_number_db_next_found: 1001.000
Metric rocksdb_iter_bytes_read: 68524.000
Metric docdb_keys_found: 1.000
Metric ql_read_latency: sum: 381.000, count: 1.000
(14 rows)
Sun 01 Dec 2024 03:59:56 AM GMT (every 2s)
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------
Update on demo (cost=0.00..15.25 rows=100 width=16136) (actual time=3.101..3.101 rows=0 loops=1)
-> Index Scan using demo_pkey on demo (cost=0.00..15.25 rows=100 width=16136) (actual time=2.523..2.532 rows=1 loops=1)
Index Cond: (a = 1)
Storage Table Read Requests: 1
Storage Table Read Execution Time: 1.943 ms
Storage Table Rows Scanned: 1
Storage Table Write Requests: 1
Metric rocksdb_number_db_seek: 1.000
Metric rocksdb_number_db_next: 2001.000
Metric rocksdb_number_db_seek_found: 1.000
Metric rocksdb_number_db_next_found: 2001.000
Metric rocksdb_iter_bytes_read: 128950.000
Metric docdb_keys_found: 1.000
Metric ql_read_latency: sum: 439.000, count: 1.000
(14 rows)
Sun 01 Dec 2024 03:59:58 AM GMT (every 2s)
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------
Update on demo (cost=0.00..15.25 rows=100 width=16136) (actual time=3.467..3.467 rows=0 loops=1)
-> Index Scan using demo_pkey on demo (cost=0.00..15.25 rows=100 width=16136) (actual time=2.971..2.980 rows=1 loops=1)
Index Cond: (a = 1)
Storage Table Read Requests: 1
Storage Table Read Execution Time: 2.311 ms
Storage Table Rows Scanned: 1
Storage Table Write Requests: 1
Metric rocksdb_number_db_seek: 1001.000
Metric rocksdb_number_db_next: 2001.000
Metric rocksdb_number_db_seek_found: 1001.000
Metric rocksdb_number_db_next_found: 2001.000
Metric rocksdb_iter_bytes_read: 181376.000
Metric docdb_keys_found: 1.000
Metric ql_read_latency: sum: 768.000, count: 1.000
(14 rows)
Sun 01 Dec 2024 04:00:00 AM GMT (every 2s)
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------
Update on demo (cost=0.00..15.25 rows=100 width=16136) (actual time=3.575..3.575 rows=0 loops=1)
-> Index Scan using demo_pkey on demo (cost=0.00..15.25 rows=100 width=16136) (actual time=3.054..3.063 rows=1 loops=1)
Index Cond: (a = 1)
Storage Table Read Requests: 1
Storage Table Read Execution Time: 2.264 ms
Storage Table Rows Scanned: 1
Storage Table Write Requests: 1
Metric rocksdb_number_db_seek: 1001.000
Metric rocksdb_number_db_next: 2001.000
Metric rocksdb_number_db_seek_found: 1001.000
Metric rocksdb_number_db_next_found: 2001.000
Metric rocksdb_iter_bytes_read: 181376.000
Metric docdb_keys_found: 1.000
Metric ql_read_latency: sum: 854.000, count: 1.000
(14 rows)
- The first update successfully finds the row to update using one
seek
and onenext
operation. It then inserts 1000 new column values because 1000 columns were updated. - The second update also locates the row to update with one
seek
but must read the 1000 columns versions to reconstruct the tuple as of the read time. This process involves 1000 additionalnext
operations. However, this is not an expensive operation since it reads contiguous keys. - In the third update, there are two versions for each column. Although it needs only one version, it is faster to read through the versions using
next
rather than skipping with aseek
, which is more CPU-intensive because it requires key comparison. This behavior is controlled by the parametermax_nexts_to_avoid_seek
, which defaults to two. - The fourth update encounters three versions for each column. Based on the
max_nexts_to_avoid_seek
parameter, it opts to use aseek
after reading two keys withnext
, which results in an additional 1000 seeks. - The fifth update follows the same pattern as subsequent updates. This continues until the versions exceed the MVCC retention period (
timestamp_history_retention_interval_sec
), which defaults to 900 seconds, triggering a compaction.
If you observe this pattern, with many columns being updated together, it likely originates from an attempt to store documents in an SQL database prior to the introduction of JSON and JSONB data types, which made document storage significantly easier.
I have created the table with a JSONB column. To illustrate the difference, I will update it with a JSON containing one thousand attributes.
yugabyte=# truncate table demo;
TRUNCATE TABLE
yugabyte=# insert into demo(a,b) select generate_series(1,1000),1;
INSERT 0 1000
yugabyte=# explain (analyze, dist, debug, summary off)
update demo set d=:'j' where a=1
yugabyte-# \watch c=5
The result has fewer seek
and next
operations because only one column is updated, even if it has one thousand attributes.
Sun 01 Dec 2024 04:16:36 AM GMT (every 2s)
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------
Update on demo (cost=0.00..15.25 rows=100 width=16136) (actual time=2.197..2.197 rows=0 loops=1)
-> Index Scan using demo_pkey on demo (cost=0.00..15.25 rows=100 width=16136) (actual time=2.069..2.076 rows=1 loops=1)
Index Cond: (a = 1)
Storage Table Read Requests: 1
Storage Table Read Execution Time: 1.513 ms
Storage Table Rows Scanned: 1
Storage Table Write Requests: 1
Metric rocksdb_number_db_seek: 1.000
Metric rocksdb_number_db_next: 1.000
Metric rocksdb_number_db_seek_found: 1.000
Metric rocksdb_number_db_next_found: 1.000
Metric rocksdb_iter_bytes_read: 8098.000
Metric docdb_keys_found: 1.000
Metric ql_read_latency: sum: 225.000, count: 1.000
(14 rows)
Sun 01 Dec 2024 04:16:38 AM GMT (every 2s)
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------
Update on demo (cost=0.00..15.25 rows=100 width=16136) (actual time=2.342..2.342 rows=0 loops=1)
-> Index Scan using demo_pkey on demo (cost=0.00..15.25 rows=100 width=16136) (actual time=2.146..2.154 rows=1 loops=1)
Index Cond: (a = 1)
Storage Table Read Requests: 1
Storage Table Read Execution Time: 1.561 ms
Storage Table Rows Scanned: 1
Storage Table Write Requests: 1
Metric rocksdb_number_db_seek: 1.000
Metric rocksdb_number_db_next: 2.000
Metric rocksdb_number_db_seek_found: 1.000
Metric rocksdb_number_db_next_found: 2.000
Metric rocksdb_iter_bytes_read: 28047.000
Metric docdb_keys_found: 1.000
Metric ql_read_latency: sum: 294.000, count: 1.000
(14 rows)
Sun 01 Dec 2024 04:16:40 AM GMT (every 2s)
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------
Update on demo (cost=0.00..15.25 rows=100 width=16136) (actual time=2.470..2.470 rows=0 loops=1)
-> Index Scan using demo_pkey on demo (cost=0.00..15.25 rows=100 width=16136) (actual time=2.342..2.349 rows=1 loops=1)
Index Cond: (a = 1)
Storage Table Read Requests: 1
Storage Table Read Execution Time: 1.690 ms
Storage Table Rows Scanned: 1
Storage Table Write Requests: 1
Metric rocksdb_number_db_seek: 1.000
Metric rocksdb_number_db_next: 3.000
Metric rocksdb_number_db_seek_found: 1.000
Metric rocksdb_number_db_next_found: 3.000
Metric rocksdb_iter_bytes_read: 47996.000
Metric docdb_keys_found: 1.000
Metric ql_read_latency: sum: 252.000, count: 1.000
(14 rows)
Sun 01 Dec 2024 04:16:42 AM GMT (every 2s)
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------
Update on demo (cost=0.00..15.25 rows=100 width=16136) (actual time=2.409..2.409 rows=0 loops=1)
-> Index Scan using demo_pkey on demo (cost=0.00..15.25 rows=100 width=16136) (actual time=2.273..2.280 rows=1 loops=1)
Index Cond: (a = 1)
Storage Table Read Requests: 1
Storage Table Read Execution Time: 1.614 ms
Storage Table Rows Scanned: 1
Storage Table Write Requests: 1
Metric rocksdb_number_db_seek: 2.000
Metric rocksdb_number_db_next: 3.000
Metric rocksdb_number_db_seek_found: 2.000
Metric rocksdb_number_db_next_found: 3.000
Metric rocksdb_iter_bytes_read: 67937.000
Metric docdb_keys_found: 1.000
Metric ql_read_latency: sum: 229.000, count: 1.000
(14 rows)
Sun 01 Dec 2024 04:16:44 AM GMT (every 2s)
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------
Update on demo (cost=0.00..15.25 rows=100 width=16136) (actual time=2.450..2.450 rows=0 loops=1)
-> Index Scan using demo_pkey on demo (cost=0.00..15.25 rows=100 width=16136) (actual time=2.312..2.323 rows=1 loops=1)
Index Cond: (a = 1)
Storage Table Read Requests: 1
Storage Table Read Execution Time: 1.652 ms
Storage Table Rows Scanned: 1
Storage Table Write Requests: 1
Metric rocksdb_number_db_seek: 2.000
Metric rocksdb_number_db_next: 3.000
Metric rocksdb_number_db_seek_found: 2.000
Metric rocksdb_number_db_next_found: 3.000
Metric rocksdb_iter_bytes_read: 67937.000
Metric docdb_keys_found: 1.000
Metric ql_read_latency: sum: 234.000, count: 1.000
(14 rows)
Normalization in an SQL database doesn't always mean that all information needs to be split into multiple columns. When a set of attributes is often updated together, it can be more efficient to store them within a single JSON document. This approach aligns with MongoDB's golden rule: "Data that is accessed together should be stored together." In a document database, these attributes are typically embedded within a single document, and this same principle can be applied to an SQL database.
You can still index specific parts of that document for better performance.
Top comments (0)