DEV Community

Franck Pachot for YugabyteDB Distributed PostgreSQL Database

Posted on • Updated on

Table size in YugabyteDB, PostgreSQL and Oracle πŸ…ΎπŸ˜πŸš€

The protocol and SQL processing is the same, but the storage is different between PostgreSQL and YugabyteDB. Today the size on disk is not the most expensive resource, and is not a big concern in a distributed database where you scale-out anyway, for high availability and performance reasons. But it is good to get an idea of the storage size when you migrate across databases. Especially between legacy databases (heap tables, B-Tree indexes, 8k block pages) and modern distributed ones (document store, LSM Tree, SST Files).

I'm using my ybdemo lab with a minimal deployment:

git clone https://github.com/FranckPachot/ybdemo.git
cd ybdemo/docker/yb-lab
# create a minimal RF=1 deployement
sh gen-yb-docker-compose.sh minimal
# stop the demo containers
docker-compose kill yb-demo-{read,write,connect}
Enter fullscreen mode Exit fullscreen mode

It is easier to look at the size on a one-node cluster. When distributing to more nodes, the total size doesn't change. When adding high availability with replication factor RF=3 total size is multiplied by 3. But here I'm interested by the raw size on disk.

I connect to the tserver node:

docker exec -it yb-tserver-0 bash
Enter fullscreen mode Exit fullscreen mode

Generate csv

I'll use the following functions to generate data.

  • gen_random_csv generates random alphanumerical of size $1 to $2 columns into $3 rows to be loaded into $4 table
  • gen_table_ddl generates the CREATE TABLE
  • gen_pg calls those to create the table and insert the row with COPY. PostgreSQL is easy and YugabyteDB benefits from all those features to scale them out to the distributed storage.

gen_random_csv(){
cat /dev/urandom | tr -dc '[:alpha:]' | fold -w $1 |
 awk '
   NR%m==1{if(NR>1)print "";d=""}
   {printf d $0 ; d=","}
   ' m=$2 |
 head -$3 | nl -s ","
}

gen_table_ddl(){
echo -n "create table $4 ( id bigint primary key check (id<=$3)"
for i in $(seq 1 $2)
do
 echo -n ", col$i varchar($1)"
done
echo ");"
}

gen_pg(){
echo "drop table if exists $4;"
gen_table_ddl  $1 $2 $3 "$4"
echo "copy $4 from stdin with csv;"
gen_random_csv $1 $2 $3 "$4"
echo "\\."
}
Enter fullscreen mode Exit fullscreen mode

YugabyteDB

load 1000 rows

As a first test, I'm loading 1000 rows with 60 columns of VARCHAR(100) to see how it works:

time gen_pg 100 60 1000 franck_varchar |
 ysqlsh -h $(hostname) -e -v ON_ERROR_STOP=ON 
Enter fullscreen mode Exit fullscreen mode

The metrics including this franck_varchar table:

Image description

1000 rows of 60 columns are stored as 61000 subdocuments in DocDB: 1 per row and 1 per column. This is distributed into 2 tablets (my default is 2 tablets per tserver). Each row had to seek to the primary key to check for duplicates, and this is why the YSQL interface takes longer on loads than the YCQL interface which just appends the new version.

If you are new to YugabyteDB you may be surprised to see a size of zero for this table:
Image description
The first level of storage is a MemTable of 128MB by default. My 1000 rows fit in it and then nothing is stored to the SST Files. Of course, this is protected by WAL and I see the 6MB of loaded data there. In case of crash, the MemTable can be recovered from it.

Now let's insert more data so that the MemTable will be flushed to SST files.

load 600M

I generate this to measure the raw size:

time gen_pg 100 60 100000 franck_varchar | wc -c | numfmt --to=si
Enter fullscreen mode Exit fullscreen mode

Result: 607M

I run it to create and load the table:

time gen_pg 100 60 100000 franck_varchar |
 ysqlsh -h $(hostname) -e -v ON_ERROR_STOP=ON 
Enter fullscreen mode Exit fullscreen mode

Here are the storage details from the http://tserver:9000/tables endpoint.

And the numbers from the filesystem show the same:
Image description

[root@yb-tserver-0 yugabyte]# du -h | grep 000030af00003000800000000000406a | sort -h
4.0K    ./data/yb-data/tserver/data/rocksdb/table-000030af00003000800000000000406a/tablet-0d3e98af78354a43bcaef9fc2e37da17.snapshots
4.0K    ./data/yb-data/tserver/data/rocksdb/table-000030af00003000800000000000406a/tablet-7bdf68bb1b324ff0baccfdb24e7d6ba0.snapshots
92K     ./data/yb-data/tserver/data/rocksdb/table-000030af00003000800000000000406a/tablet-0d3e98af78354a43bcaef9fc2e37da17.intents
92K     ./data/yb-data/tserver/data/rocksdb/table-000030af00003000800000000000406a/tablet-7bdf68bb1b324ff0baccfdb24e7d6ba0.intents
282M    ./data/yb-data/tserver/data/rocksdb/table-000030af00003000800000000000406a/tablet-0d3e98af78354a43bcaef9fc2e37da17
283M    ./data/yb-data/tserver/data/rocksdb/table-000030af00003000800000000000406a/tablet-7bdf68bb1b324ff0baccfdb24e7d6ba0
395M    ./data/yb-data/tserver/wals/table-000030af00003000800000000000406a/tablet-0d3e98af78354a43bcaef9fc2e37da17
395M    ./data/yb-data/tserver/wals/table-000030af00003000800000000000406a/tablet-7bdf68bb1b324ff0baccfdb24e7d6ba0
565M    ./data/yb-data/tserver/data/rocksdb/table-000030af00003000800000000000406a
789M    ./data/yb-data/tserver/wals/table-000030af00003000800000000000406a
Enter fullscreen mode Exit fullscreen mode

Not counting the WALs that are temporary to protect the memory structures, the size on disk is less than the CSV size.

load 6.1 GB

Loading one million rows now:

time gen_pg 100 60 1000000 franck_varchar |
 ysqlsh -h $(hostname) -e -v ON_ERROR_STOP=ON 
Enter fullscreen mode Exit fullscreen mode

The output shows the DDL, load time on my laptop, and time to generate and only count the size:

drop table if exists franck_varchar;
DROP TABLE
create table franck_varchar ( id bigint primary key check (id<=1000000), col1 text, col2 text, col3 text, col4 text, col5 text, col6 text, col7 text, col8 tex
t, col9 text, col10 text, col11 text, col12 text, col13 text, col14 text, col15 text, col16 text, col17 text, col18 text, col19 text, col20 text, col21 text,
col22 text, col23 text, col24 text, col25 text, col26 text, col27 text, col28 text, col29 text, col30 text, col31 text, col32 text, col33 text, col34 text, co
l35 text, col36 text, col37 text, col38 text, col39 text, col40 text, col41 text, col42 text, col43 text, col44 text, col45 text, col46 text, col47 text, col4
8 text, col49 text, col50 text, col51 text, col52 text, col53 text, col54 text, col55 text, col56 text, col57 text, col58 text, col59 text, col60 text);
CREATE TABLE
copy franck_varchar from stdin with csv;
COPY 1000000

real    19m0.471s
user    15m30.699s
sys     14m54.813s

[root@yb-tserver-0 yugabyte]# time gen_pg 100 60 1000000 franck_varchar | wc -c | numfmt --to=si

6.1G

real    7m18.488s
user    10m15.511s
sys     12m7.927s
Enter fullscreen mode Exit fullscreen mode

My table is stored in 6.4 GB:
Image description
I get the same picture from the filesystem:

[root@yb-tserver-0 yugabyte]# du -h | grep 000030af000030008000000000004070 | sort -h
4.0K    ./data/yb-data/tserver/data/rocksdb/table-000030af000030008000000000004070/tablet-9fea5fced0924eebb7a978be7cd83132.snapshots
4.0K    ./data/yb-data/tserver/data/rocksdb/table-000030af000030008000000000004070/tablet-f6dffe42f77044deadd217ea3aeb7222.snapshots
92K     ./data/yb-data/tserver/data/rocksdb/table-000030af000030008000000000004070/tablet-9fea5fced0924eebb7a978be7cd83132.intents
92K     ./data/yb-data/tserver/data/rocksdb/table-000030af000030008000000000004070/tablet-f6dffe42f77044deadd217ea3aeb7222.intents
1.6G    ./data/yb-data/tserver/wals/table-000030af000030008000000000004070/tablet-9fea5fced0924eebb7a978be7cd83132
1.6G    ./data/yb-data/tserver/wals/table-000030af000030008000000000004070/tablet-f6dffe42f77044deadd217ea3aeb7222
3.2G    ./data/yb-data/tserver/wals/table-000030af000030008000000000004070
3.3G    ./data/yb-data/tserver/data/rocksdb/table-000030af000030008000000000004070/tablet-9fea5fced0924eebb7a978be7cd83132
3.3G    ./data/yb-data/tserver/data/rocksdb/table-000030af000030008000000000004070/tablet-f6dffe42f77044deadd217ea3aeb7222
6.5G    ./data/yb-data/tserver/data/rocksdb/table-000030af000030008000000000004070
Enter fullscreen mode Exit fullscreen mode

So, basically, my data size in the database is roughly the same as my input CSV text.

WAL size

The WAL is there to protect the changes done in memory (the first level of the LSM Trees is a MemTable). It has a retention defined by the following parameters and I use the default settings:

--log_min_seconds_to_retain=900
--log_min_segments_to_retain=2
--log_segment_size_mb=64
Enter fullscreen mode Exit fullscreen mode

This means that after 15 minutes without activity on this table, it should shrink to 2x64MB segments by tablets, and I have two tablets. Here it is after my lunch break:
Image description

Compaction

I have 5 SST Files here and no further automatic compaction will be triggered because the of the default --rocksdb_level0_file_num_compaction_trigger=5

I can run one manually:

time yb-admin --master_addresses $(echo yb-master-{0..2}:7100|tr ' ' ,)\
 compact_table_by_id 000030af000030008000000000004070 300
Enter fullscreen mode Exit fullscreen mode

This will decrease to one file per tablet:
Image description

In my case, the total size didn't change because I've only loaded rows without further changes, so there are no old versions to compact.

PostgreSQL

I start a PostgreSQL container to compare with the same load

docker pull postgres
docker run --name pg14 -e POSTGRES_PASSWORD=franck -d postgres
docker exec -it pg14 bash
Enter fullscreen mode Exit fullscreen mode

I run the same as (the gen_random_csv, gen_table_ddl and gen_pg defined above) to load 600MB

time gen_pg 100 60 100000 franck_varchar |
 psql -U postgres -e -v ON_ERROR_STOP=ON 
Enter fullscreen mode Exit fullscreen mode

This is fast:

drop table if exists franck_varchar;
DROP TABLE
create table franck_varchar ( id bigint primary key check (id<=100000), col1 varchar(100), col2 varchar(100), col3 varchar(100), col4 varchar(100), col5 varch
ar(100), col6 varchar(100), col7 varchar(100), col8 varchar(100), col9 varchar(100), col10 varchar(100), col11 varchar(100), col12 varchar(100), col13 varchar
(100), col14 varchar(100), col15 varchar(100), col16 varchar(100), col17 varchar(100), col18 varchar(100), col19 varchar(100), col20 varchar(100), col21 varch
ar(100), col22 varchar(100), col23 varchar(100), col24 varchar(100), col25 varchar(100), col26 varchar(100), col27 varchar(100), col28 varchar(100), col29 var
char(100), col30 varchar(100), col31 varchar(100), col32 varchar(100), col33 varchar(100), col34 varchar(100), col35 varchar(100), col36 varchar(100), col37 v
archar(100), col38 varchar(100), col39 varchar(100), col40 varchar(100), col41 varchar(100), col42 varchar(100), col43 varchar(100), col44 varchar(100), col45
 varchar(100), col46 varchar(100), col47 varchar(100), col48 varchar(100), col49 varchar(100), col50 varchar(100), col51 varchar(100), col52 varchar(100), col
53 varchar(100), col54 varchar(100), col55 varchar(100), col56 varchar(100), col57 varchar(100), col58 varchar(100), col59 varchar(100), col60 varchar(100));
CREATE TABLE
copy franck_varchar from stdin with csv;
COPY 100000

real    1m29.669s
user    0m32.372s
sys     1m14.037s
Enter fullscreen mode Exit fullscreen mode

Be careful when looking at the pg_relation_size:

postgres=# 
           select
           pg_size_pretty(pg_relation_size('franck_varchar')),
           current_setting('data_directory')
           ||'/'||pg_relation_filepath('franck_varchar');

 pg_size_pretty |                 ?column?
----------------+-------------------------------------------
 195 MB         | /var/lib/postgresql/data/base/13757/16384
Enter fullscreen mode Exit fullscreen mode

How can I have 195MB to store 600MB? This is because my rows are larger than what is possible to fit in a page (PostgreSQL stores tuple in blocks). TOAST is used to store the overflow. Let's look at my database files:

postgres=# \! du -ah /var/lib/postgresql/data/base/13757 | sort -h | tail -5
2.2M    /var/lib/postgresql/data/base/13757/16390
105M    /var/lib/postgresql/data/base/13757/16389
196M    /var/lib/postgresql/data/base/13757/16384
661M    /var/lib/postgresql/data/base/13757/16388
972M    /var/lib/postgresql/data/base/13757
Enter fullscreen mode Exit fullscreen mode

Here is the related metadata in the catalog:

postgres=# 
           select pg_size_pretty(pg_relation_size(oid))
           , oid, relname, relkind, relpages, reltuples, reltoastrelid 
           from pg_class 
           where oid in (16390,16389,16384,16388)
           order by pg_relation_size(oid);

 pg_size_pretty |  oid  |       relname        | relkind | relpages | reltuples | reltoastrelid
----------------+-------+----------------------+---------+----------+-----------+---------------
 2208 kB        | 16390 | franck_varchar_pkey  | i       |      276 |    100000 |             0
 105 MB         | 16389 | pg_toast_16384_index | i       |        1 |         0 |             0
 195 MB         | 16384 | franck_varchar       | r       |    25000 |    100000 |         16388
 660 MB         | 16388 | pg_toast_16384       | t       |    84483 |   4.9e+06 |             0
Enter fullscreen mode Exit fullscreen mode

The 600MB are in TOAST, 660MB + 105MB for the index actually. I have an additional 195MB in the table, for the non-toasted columns and row internal attributes. And, because PostgreSQL stores tables in heap tables, the primary key has an additional relation for the index, but this one is small: 2MB only. The total is 972MB to store 600MB of raw data. Of course, this has consequences on query performance but this is another topic.

Looking at this it seems that the storage in PostgreSQL, because of heap tables, pages, and tuples, is less optimal than the YugabyteDB one using LSM Trees ans SSTables. There are some optimizations to do with PostgreSQL but basically this case (which I got from a user in our slack channel) cannot benefit from compression (the threshold is 2KB text). Of course, you can wonder whether it makes sense, in any database, to store 60 columns of 100 characters. JSON may be more appropriate for this it it is a document.

Oracle Database

The initial question in the slack channel was a comparison with Oracle.
Let's start a container:

docker pull gvenzl/oracle-xe
docker run --name ora -e ORACLE_PASSWORD=franck -d gvenzl/oracle-xe
Enter fullscreen mode Exit fullscreen mode

You have to wait. An empty Oracle database is not light. docker logs ora until you see the listener started successfully. Don't worry about The listener supports no services... they will be registered dynamically.

I create the table using my gen_table_ddl defined above:

gen_table_ddl 100 60 100000 franck_varchar | sed -e '/create table/s/bigint/number/' | sqlplus system/franck@//localhost/xepdb1
Enter fullscreen mode Exit fullscreen mode

I generate the CSV as I did above with gen_random_csv, but now to a linux named pipe to be read by SQL*Loader as I don't think SQL*Loader can read from stdin:

mknod franck_varchar.dat p
gen_random_csv 100 60 100000 > franck_varchar.dat &
Enter fullscreen mode Exit fullscreen mode

And using the SQL*Loader Express call with all defaults:

time sqlldr system/franck@//localhost/xepdb1 table=franck_varchar
Enter fullscreen mode Exit fullscreen mode

This is fast:
Image description

Quickly checking the size from the dictionary:

sqlplus system/franck@//localhost/xepdb1

SQL>
select segment_type||' '||segment_name||': '
||dbms_xplan.format_size(bytes)
from dba_segments
where owner=user and segment_name in (
 'FRANCK_VARCHAR'
 ,(select index_name from user_indexes where table_name='FRANCK_VARCHAR'))
;

SEGMENT_TYPE||''||SEGMENT_NAME||':'||DBMS_XPLAN.FORMAT_SIZE(BYTES)
--------------------------------------------------------------------------------
TABLE FRANCK_VARCHAR: 783M
INDEX SYS_C008223: 2048K
Enter fullscreen mode Exit fullscreen mode

The index is 2M, like in PostgreSQL. The heap table is 783M, a bit higher than my raw data. Note that I've left the default PCTFREE here. Those are things to take care when you are on heap tables storing their rows into blocks. There are some compression options with Oracle but for this, you need to buy an Exadata to get HCC, or move your data to their cloud. The other table compression options will do nothing here are there are no duplicate values.

What about small columns?

Here I inserted 100 characters in each column because that was the test case I had. But I know that, in YugabyteDB, storing columns as subdocuments have a per-column overhead that we don't have in variable size rows in block based heap tables. Then I've run the same with only one character per column with gen_random_csv 1 60 100000
The volume is now 13MB in the CSV:

# time gen_pg 1 60 100000 franck_varchar | wc -c | numfmt --to=si
13M
Enter fullscreen mode Exit fullscreen mode

I've imported this and check the volume stored

Oracle:

SEGMENT_TYPE||''||SEGMENT_NAME||':'||DBMS_XPLAN.FORMAT_SIZE(BYTES)
--------------------------------------------------------------------------------
TABLE FRANCK_VARCHAR: 8192K
INDEX SYS_C008223: 2048K
Enter fullscreen mode Exit fullscreen mode

8MB of data, 2MB of index - this is small. That's even a bit less than the CSV file... πŸ€” but, wait. No experimental result can be trusted without an explanation. The only thing that can be smaller is the NUMBER datatype. πŸ€¦β€β™‚οΈ I remember I did a test before with ALTER TABLE franck_varchar MOVE COMPRESS and didn't re-create the table. It made no change with the random 100 character values but now I have a lot of duplicates per block with only 1 character at random.

Let's run it again with the table re-created:

SEGMENT_TYPE||''||SEGMENT_NAME||':'||DBMS_XPLAN.FORMAT_SIZE(BYTES)
--------------------------------------------------------------------------------
TABLE FRANCK_VARCHAR: 15M
INDEX SYS_C008225: 2048K
Enter fullscreen mode Exit fullscreen mode

Yes, this is correct. Without compression, the volume is just a bit higher than the raw volume, plus the index.

PostgreSQL:

 pg_size_pretty |                   ?column?
----------------+----------------------------------------------
 15 MB          | /var/lib/postgresql/data/base/13757/19616663

   oid    |    relname     | reltoastrelid
----------+----------------+---------------


\! du -ah /var/lib/postgresql/data/base/13757  | sort -h | tail -5

752K    /var/lib/postgresql/data/base/13757/1255
752K    /var/lib/postgresql/data/base/13757/2840
2.2M    /var/lib/postgresql/data/base/13757/19616667
16M     /var/lib/postgresql/data/base/13757/19616663
27M     /var/lib/postgresql/data/base/13757
Enter fullscreen mode Exit fullscreen mode

There is no TOAST here, 2MB for the index and 16MB for the heap table. PostgreSQL has lot of per-row metadata to manage locks and MVCC.

YugabyteDB:

    Total: 328.32M
    Consensus Metadata: 19.7K
    WAL Files: 255.80M
    SST Files: 72.50M
    SST Files Uncompressed: 237.76M
Enter fullscreen mode Exit fullscreen mode

I expected a large size because of the per-column storage in the document store. It is not too much thanks to compression, simple prefix compression based delta-encoding, and the default --enable_ondisk_compression=true --compression_type=Snappy. Of course, if you compare YugabyteDB with Oracle or PostgreSQL, you see a 4x factor. But this is not what you should see in real life, storing 60 columns with only one byte in it. We have all PostgreSQL datatypes like ARRAY or JSONB for this. There is also an open issue for the packed columns enhancement. By the way, I'm running this in YugabyteDB 2.11.2

Summary

I did this test on a specific case where I got a question about the size stored into YugabyteDB. Of course all this depends on the number of columns, the datatypes, the values, the write pattern, the database settings, the read performance requirements, and so on. Today, developers do not have time to fine tune all those. A key-value document store, like RocksDB, which YugabyteDB tablet storage is based on, is easier. The little write amplification is balanced by the compression during compaction of SST Files. And the ability to scale-out can overcome any limitation.
Anyway, modern SQL databases also support JSON documents, like PostgreSQL JSONB, also available in YugabyteDB, or Oracle OSON. If you have 60 columns of text, then maybe this should go into one document.

Top comments (0)