Franck Pachot for YugabyteDB

Posted on May 28, 2024 • Edited on Dec 14, 2024

The Log Is (not) The Database

#yugabytedb #distributed #postgres #sql

There is a common saying "The Log Is The Database", to explain some database innovations in cloud-native environments. For example, Kafka stores the events that modify data, Amazon Aurora's database instances send only the Write-Ahead Logging to the storage, and modern databases use LSM Tree to append all changes to a log. I dislike this saying for two reasons:

First, there's no real innovation. In traditional databases, your changes are first saved in the redo log, transactional log, or write-ahead log (WAL), and once they are saved, they are considered durable. "Write-Ahead Logging" means exactly that it is written first, and it's not new, but was described in the ARIES paper back in 1992.
Second, if your database consists only of a log, then it's not really a database but more of a sequentially written file. The primary function of databases is to process data, and a log file isn't efficient for much else besides disaster recovery. Additionally, a true database allows multiple users to read and write at the same time, managing concurrency, which often requires memory structures that aren't written to the log (except for higher resilience to failure, like YugabyteDB does with Raft).

Distributed SQL databases like YugabyteDB are closer to "The Log Is The Database" because the log is used to distribute and replicate before writing to memory structures. However, this would still ignore the fact that the log must be sorted and compacted to avoid read amplification when processing data.

I selected this title because when you consider the write-path only, the distributed log serves as the database for YugabyteDB. All data manipulation (including reading and writing intents, managing locks for consistency, and transaction control) is distributed to LSM Trees. Additionally, they are distributed over the network using a shared-nothing architecture, which is why I illustrated it with a mousepad from the previous century that I found in a drawer when working at CERN: "The Network Is The Computer."

YugabyteDB utilizes PostgreSQL code for the query layer. This layer is stateless and operates on all nodes. Write operations, and some read intents, are converted into a log of write requests that are sharded based on their key (the primary key for the tables, the indexed columns for secondary indexes), sent in batches to the storage layer, replicated as a Raft log, and then stored in LSM Trees.

Enough talking, let's look at it. I created a table on YugabyteDB cluster:

yugabyte=# create table demo (
  id bigserial primary key, a int, b int, c int
) split into 1 tablets;
CREATE TABLE

I forced it to a single tablet to make it easer to look at it.

I query the YB-Master Web Console /dump-entities endpoint to get the table UUID (more details here):

# curl -sL http://yb0.pachot.net:7000/dump-entities |
    jq -r '.tables[] | select(.table_name == "demo") '
{
  "table_id": "000033c000003000800000000000408e",
  "keyspace_id": "000033c0000030008000000000000000",
  "table_name": "demo",
  "state": "RUNNING"
}

The same endpoint exposes the details about the tablets.
I extend my JQ script to find the tablet leader identifier:

curl -sL http://yb0.pachot.net:7000/dump-entities |
 jq -r --arg table "demo" -r '
  . as $input |
  (
    $input.tables[] |
    select(.table_name == $table ) |
    .table_id
  ) as $table_id |
  $input.tablets[] |
  select(.table_id == $table_id and .state == "RUNNING" ) |
  .leader as $leader |
  {
    $table, table_id, tablet_id,
    leader: (
      .replicas[] |
      select(.server_uuid == $leader)
    )
  }
'

{
  "table": "demo",
  "table_id": "000033c000003000800000000000408e",
  "tablet_id": "4bccaaaa0fc3486ea565ccc18e325122",
  "leader": {
    "type": "VOTER",
    "server_uuid": "104130d300d64c9f9ed5df25823cd121",
    "addr": "10.0.0.39:9100"
  }
}

With this information, I connect to the node (10.0.0.39) which stores this tablet peer so that I can look at the files.

I find the WAL (Write Ahead Log) for this tablet:

# ls -t $(find / -regex '.*/yb-data/tserver/wals/table-000033c000003000800000000000408e/tablet-4bccaaaa0fc3486ea565ccc18e325122/wal-[0-9]+')

/home/opc/10.0.0.39/var/data/yb-data/tserver/wals/table-000033c000003000800000000000408e/tablet-4bccaaaa0fc3486ea565ccc18e325122/wal-000000001

As my database is not encrypted (it's a lab) I can look at the content of the WAL:

# log-dump /home/opc/10.0.0.39/var/data/yb-data/tserver/wals/table-000033c000003000800000000000408e/tablet-4bccaaaa0fc3486ea565ccc18e325122/wal-000000001

replicate {
  id {
    term: 1
    index: 1
  }
  hybrid_time: HT{ days: 19869 time: 20:46:57.683178 }
  op_type: NO_OP
  size: 26
  id { term: 1 index: 1 } hybrid_time: 7031834286830297088 op_type: NO_OP committed_op_id { term: 0 index: 0 } noop_request { }
}
replicate {
  id {
    term: 2
    index: 2
  }
  hybrid_time: HT{ days: 19869 time: 20:46:58.648734 }
  op_type: NO_OP
  size: 26
  id { term: 2 index: 2 } hybrid_time: 7031834290785214464 op_type: NO_OP committed_op_id { term: 1 index: 1 } noop_request { }
}

There is no data because my table is empty.

INSERT

I insert one row:

yugabyte=# insert into demo ( a, b, c ) values (1, 1, 1);
INSERT 0 1

I look at the WAL again and see one write:

replicate {
  id {
    term: 2
    index: 3
  }
  hybrid_time: HT{ days: 19869 time: 21:39:37.331124 }
  op_type: WRITE_OP
  size: 134
  write {
    unused_tablet_id:
    write_batch {
      write_pairs_size: 1
      write_pairs {
        Key: SubDocKey(DocKey(0xeda9, [1], []), [])
        Value: Not found (yb/docdb/kv_debug.cc:114): No packing information available
      }
    }
  }
}

The log is structured as a key-value where the value has documents (for table rows and index entries) with sub-documents (for the column values).
Here, the key is my primary key, the value 1, which I inserted into id. Because it is hash sharded (by default YugabyteDB sets the first column of the primary key as HASH) the hash code is added in front of the key (you can run select to_hex(yb_hash_code(1::bigint)) to verify that it is 0xeda9)

The sub-document value is not visible here because it is packed, and I didn't provide the metadata. Packed rows is an optimization for storing all column values into a single SubDocument, for faster INSERTs.

UPDATE

I update one column:

yugabyte=# update demo set a=2, b=2;
UPDATE 1

The WAL shows a new write with two sub-documents, one for each column value (this is better than PostgreSQL that copies the whole row when you update a single bit):

replicate {
  id {
    term: 2
    index: 4
  }
  hybrid_time: HT{ days: 19869 time: 21:41:01.749055 }
  op_type: WRITE_OP
  size: 238
  write {
    unused_tablet_id:
    write_batch {
      write_pairs_size: 2
      write_pairs {
        Key: SubDocKey(DocKey(0xeda9, [1], []), [ColumnId(1)])
        Value: 2
      }
      write_pairs {
        Key: SubDocKey(DocKey(0xeda9, [1], []), [ColumnId(2)])
        Value: 2
      }
    }
  }
}
replicate {
  id {
    term: 2
    index: 5
  }
  hybrid_time: HT{ days: 19869 time: 21:41:01.752862 }
  op_type: UPDATE_TRANSACTION_OP
  size: 94
  update_transaction {
    transaction_id: 23074c91-3f86-4f28-b8c3-70295392c63b
    status: APPLYING
    tablets: ea42dda9f4634e9bb5193382ce41bf74
    commit_hybrid_time: HT{ days: 19869 time: 21:41:01.752154 }
    sealed: 0
  }
}

The log also holds information about the transaction because it's a shared-nothing architecture: all states must go through the network.

DELETE

I delete this row:

yugabyte=# delete from demo;
DELETE 1

There's a new write to the log marking the end of life of the row with the DEL marker, often called a "tombstone":

replicate {
  id {
    term: 2
    index: 6
  }
  hybrid_time: HT{ days: 19869 time: 21:42:13.005451 }
  op_type: WRITE_OP
  size: 183
  write {
    unused_tablet_id:
    write_batch {
      write_pairs_size: 1
      write_pairs {
        Key: SubDocKey(DocKey(0xeda9, [1], []), [])
        Value: DEL
      }
    }
  }
}
replicate {
  id {
    term: 2
    index: 7
  }
  hybrid_time: HT{ days: 19869 time: 21:42:13.007993 }
  op_type: UPDATE_TRANSACTION_OP
  size: 94
  update_transaction {
    transaction_id: 1e334b0c-d9c2-4ea0-a55e-212a3282e011
    status: APPLYING
    tablets: b1298c45b85a4475be2123b270655d82
    commit_hybrid_time: HT{ days: 19869 time: 21:42:13.007542 }
    sealed: 0
  }
}

This log can be used to reconstruct all versions of the table's rows at any point in time for the MVCC (Multi-Value Concurrency Control) retention. The same content is stored in memory, the LSM Tree MemTable, as soon as the replicated log gets consensus from the quorum, and processing data from it is efficient. The WAL file is used only to recover it, send the changes to a lagging replica, or for asynchronous replication or change data capture. At this point, "The Log Is The Database", and there's no presence of my table's data elsewhere on disk.

As the table grows, it may no longer fit entirely in memory, and processing data from the WAL file would be inefficient. To address this, the MemTable is flushed to an SST File where the values are ordered by key (instead of by time in the WAL), allowing for efficient point or range queries. This marks the end of "The Log Is The Database," and the database is now stored in files optimized for data retrieval. The WAL can be discarded as soon as it passed the retention for asynchronous replication or follower's gap resolution.

Flush

I want to keep my table small for this demo but I can force a flush:

$ yb-ts-cli --server-address 10.0.0.39:9100 flush_tablet 4bccaaaa0fc3486ea565ccc18e325122
Successfully flushed tablet <4bccaaaa0fc3486ea565ccc18e325122>

I find the SST file:

$ ls -t $(find / -regex '.*/yb-data/tserver/data/rocksdb/table-000033c000003000800000000000408e/tablet-4bccaaaa0fc3486ea565ccc18e325122/[0-9]+.sst')

/home/opc/10.0.0.39/var/data/yb-data/tserver/data/rocksdb/table-000033c000003000800000000000408e/tablet-4bccaaaa0fc3486ea565ccc18e325122/000010.sst

I can read it with sst_dump:

$ sst_dump --command=scan --file=/home/opc/10.0.0.39/var/data/yb-data/tserver/data/rocksdb/table-000033c000003000800000000000408e/tablet-4bccaaaa0fc3486ea565ccc18e325122/000010.sst --output_format=decoded_regulardb

from [] to []
Process /home/opc/10.0.0.39/var/data/yb-data/tserver/data/rocksdb/table-000033c000003000800000000000408e/tablet-4bccaaaa0fc3486ea565ccc18e325122/000010.sst
Sst file format: block-based
SubDocKey(DocKey(0xeda9, [1], []), [HT{ physical: 1716759733007542 }]) -> DEL
SubDocKey(DocKey(0xeda9, [1], []), [HT{ physical: 1716759577331124 }]) -> PACKED_ROW[0](050000000A0000000F000000480000000148000000014800000001)
SubDocKey(DocKey(0xeda9, [1], []), [ColumnId(1); HT{ physical: 1716759661752154 }]) -> 2
SubDocKey(DocKey(0xeda9, [1], []), [ColumnId(2); HT{ physical: 1716759661752154 w: 1 }]) -> 2

I still see all versions, with four sub-documents: the tombstone (DEL), the two columns with new values, and the initial packed row. This is still a log of all changes, but it is ordered by the key before the time. When the table grows, you will have multiple flushes and multiple SST files which will be merged on read. LSM Tree means Log Structure Merge Tree: it is a log-structure of sorted runs that can be merged when iterating on the key.

To see what is inside the packed row, I can provide the metadata (description of the columns) to the SST Dump tool. The metadata is stored in another directory:

ls -t $(find / -regex '.*/yb-data/tserver/tablet-meta/4bccaaaa0fc3486ea565ccc18e325122')

/home/opc/10.0.0.39/var/data/yb-data/tserver/tablet-meta/4bccaaaa0fc3486ea565ccc18e325122

I pass it as a --formatter_tablet_metadata option:

$ sst_dump --command=scan --file=/home/opc/10.0.0.39/var/data/yb-data/tserver/data/rocksdb/table-000033c000003000800000000000408e/tablet-4bccaaaa0fc3486ea565ccc18e325122/000010.sst --output_format=decoded_regulardb --formatter_tablet_metadata=/home/opc/10.0.0.39/var/data/yb-data/tserver/tablet-meta/4bccaaaa0fc3486ea565ccc18e325122

WARNING: Logging before InitGoogleLogging() is written to STDERR
I0526 21:56:39.871042 321471 kv_formatter.cc:35] Found info for table ID 000033c000003000800000000000408e (namespace yugabyte, table_type PGSQL_TABLE_TYPE, name demo, cotable_id 00000000-0000-0000-0000-000000000000, colocation_id 0) in superblock
from [] to []
Process /home/opc/10.0.0.39/var/data/yb-data/tserver/data/rocksdb/table-000033c000003000800000000000408e/tablet-4bccaaaa0fc3486ea565ccc18e325122/000010.sst
Sst file format: block-based
SubDocKey(DocKey(0xeda9, [1], []), [HT{ physical: 1716759733007542 }]) -> DEL
SubDocKey(DocKey(0xeda9, [1], []), [HT{ physical: 1716759577331124 }]) -> { 1: 1 2: 1 3: 1 }
SubDocKey(DocKey(0xeda9, [1], []), [ColumnId(1); HT{ physical: 1716759661752154 }]) -> 2
SubDocKey(DocKey(0xeda9, [1], []), [ColumnId(2); HT{ physical: 1716759661752154 w: 1 }]) -> 2

The sub-document with the lowest time shows all the column values at the time of the insert. To allow consistent reads, the versions are ordered on the Hybrid Logical Clock, the cluster time, rather than the RocksDB sequence.

Compact

Merging from too many SST files would lower the read performance, but this read amplification is limited by background compaction. In addition to merging, the compaction can remove the intermediate versions when they are beyond the MVCC retention, to lower the space amplification.

I waited 15 minutes, the default MVCC retention (set by --timestamp_history_retention_interval_sec=900) and forced a compaction:

$ yb-ts-cli --server-address 10.0.0.39:9100 compact_tablet 4bccaaaa0fc3486ea565ccc18e325122
Successfully compacted tablet <4bccaaaa0fc3486ea565ccc18e325122>

This writes new SST files and discards the old ones (except if they are used by an active snapshot for Point In Time Recovery or Thin Clones). I look for the new SST files:

$ ls -t $(find / -regex '.*/yb-data/tserver/data/rocksdb/table-000033c000003000800000000000408e/tablet-4bccaaaa0fc3486ea565ccc18e325122/[0-9]+.sst')

In my special case, where I deleted all the rows, there are no remaining SST files. They will be created when new data is inserted, logged, and flushed.

Where is the database?

No database should only serve as a log. Writing to the log first has two reasons:

it is fast to persist to disk, with sequential writes rather than random writes
it allows recovery of the database files by re-playing the changes

Thanks to these two properties, additional structures can be maintained in memory (Shared Buffer Pool in monolithic databases, MemTable in distributed LSM Trees) to sort them on the key before the timestamp, for faster retrieval. However, a large part of the database will be written to disk, with a checkpoint from the shared buffer pool, or a flush from the distributed MemTables. When Amazon Aurora says that there's no checkpoint happening, they refer to what happens on the single writer instance, but the WAL is applied to the blocks in the storage servers. All databases apply the log to materialize the current state, or a recent state.

In a cloud-native database, the computer is the network, and the log is the database, but that applies only to the write path. For efficient SQL processing, the data files act as the database, the cache is above it for faster access to the frequently read dataset and transaction control. Once written, the log is used to protect memory structures and roll forward to a consistent point after a point-in-time recovery. Note that all SQL DML needs to read before writing, to detect duplicate keys, check foreign keys, and be aware of locked rows, so even SQL writes used more than the log. Finally, the log is not the database, but only the safety net to avoid data loss in case of failure.

DEV Community