DEV Community

marijaselakovic for CrateDB

Posted on

Guide to write operations in CrateDB

In our previous article, we gave a general overview of the storage layer in CrateDB. In general, every shard in CrateDB represents a Lucene index that is broken down into segments and stored in the filesystem. A Lucene segment can be seen as a sub-index and can be searched independently.

When new records are written in CrateDB, Lucene first creates in-memory segments before flushing them to the disk. In this article, our goal is to give you a throughout understanding of how CrateDB writes new records. With that in mind, we will go through the basic concepts of Lucene, such as Lucene segments, refresh and flush operation and introduce the concept of translog that guarantees that write operations are persistent to disk.

Lucene segments
The Lucene segment is a part of the logical Lucene index and the Lucene index maps 1-1 to a CrateDB shard. It is an independent index, that can contain inverted indexes, k-d trees, and doc values. A Lucene segment can be searched independently of other segments and documents in the segments are immutable. Every time a field of a document is updated, the document is flagged as deleted in the segment it belonged, and the updated document is added to a new segment. The same behavior applies when a document is deleted in CrateDB. All subsequent queries will skip all the documents that were previously marked as deleted.

Lucene index and segments

To keep the number of segments manageable, Lucene periodically merges them according to some merge policy. When segments are merged, documents that are marked as deleted are discarded and newly created segments will contain only valid, non-deleted documents from the original segments. Merge is triggered when new documents are added and maybe surprisingly, this can result in a smaller index size. To merge segments of a table or a partition on demand one can use the OPTIMIZE TABLE command in CrateDB. When the max_num_segments parameter is set to 1, CrateDB will fully merge the table or partition for optimal query performance. The more details on table optimization in CrateDB, check out our documentation.

Retention period

During the recovery of nodes, to speed up the process, CrateDB needs to replay operations that were executed on one shard on other shards (replicas), residing on different nodes. To preserve the recent deletions within the Lucene index, CrateDB supports the concept of soft delete. Because of that, deleted documents still take up disk space, which is why CrateDB keeps only recently deleted documents. It is possible to customize the retention lease period that says for how long the deleted documents have to be preserved. CrateDB will discard documents only after the expiration of this period. This means that if the merge operation takes place before the expiration, deleted documents will still remain physically available. The default retention lease period is 12 hours.

Writing data to CrateDB

The following diagram illustrates how a new record is stored in CrateDB. When a new document arrives it is first committed to a memory buffer and translog (step 1). Documents in the memory buffer will become in-memory Lucene segments after the refresh operation takes place (step 2). CrateDB executes a refresh operation every second if a search request is received in the last 30 seconds. Eventually, Lucene will commit new segments to the disk (step 3). Once the data are stored on the disk, the merge operation will get triggered and some of the segments will be merged.

Flow of write operation

Translog

Translog stores write operations for documents that are in memory and that have not been committed. A translog exists for each shard and it is stored in the physical disk. In case of a node failure, CrateDB can retrieve the potentially lost data by replaying the operations from the translog (step 4).

The data in translog are persisted to disk when the translog is fsynced and committed. The default behavior of CrateDB implies that the translog is flushed to the disk after every operation. Additionally, it is possible to flush the translog every translog.sync_interval which can be controlled by translog.durability parameter.

The following parameters control the behavior of the translog:

  • translog.flush_threshold_size sets the size of the transaction log containing the operations that are not yet safely persisted. This is done to prevent recoveries from taking too long.

  • translog.interval sets the frequency of flush necessity check.

  • translog.durability can be set as ASYNC or REQUEST. When set to ASYNC the translog is flushed every translog.sync_interval, and when set to REQUEST the flush happens after every operation.

  • translog.sync_interval sets how often the translog is fsynced to disk. The default value is 5s. The setting of this parameter takes effect only if the translog.durability is set to ASYNC.

Refresh

Between CrateDB and the disk, there is RAM memory. New records are first buffered in memory before being written to a new segment. Refresh operation makes the in-memory segment available for search. Furthermore, a table can be refreshed explicitly to ensure that the latest state of the table is fetched. In CrateDB this is done with the REFRESH TABLE command. However, the refresh operation doesn’t guarantee durability: to address the persistence issue, CrateDB relies on translog.

If not done explicitly, the table is refreshed with a specified refresh interval. The default value is one second but the interval can be changed with the table parameter refresh_interval. If no query accesses the table during the refresh interval, the table becomes idle. When the table is in an idle state it will not be refreshed until the next query arrives which will first refresh the table and then execute. This will also enable the periodic refresh again.

Flush

A flush operation triggers a Lucene commit: it writes the segments permanently on the disk and clears the transaction log. Writing segments to disk is expensive and takes place at less frequent intervals than the refresh operation. However, a flush happens automatically based on the translog configurations as illustrated in the previous section.

Summary

To summarize, this article provides an overview on how data is written to CrateDB. The following properties of CrateDB are important to be aware of:

  • A Lucene index is organized in segments which are first built and kept in memory and later flushed to disk. This can influence the time when data becomes available on disk.

  • In-memory segments become searchable after the refresh operation.

  • Documents are immutable and deleted documents are not discarded until a merge takes place.

  • Segments are occasionally merged and deleted documents are discarded.

  • If translog.durability is set to REQUEST, the translog is flushed after every operation. Otherwise, if set to ASYNC, the translog is flushed when certain configurable conditions are met.

  • CrateDB maintains the transaction log of operations on each shard for recovery in case of a node crash.

We hope you found this topic interesting. Do you want to learn more about CrateDB? Take a look at our documentation and in case of any questions, check out our CrateDB Community. We're happy to help!
We hope you found this topic interesting. Do you want to learn more about CrateDB? Take a look at our documentation and in case of any questions, check out our CrateDB Community. We're happy to help!

Top comments (0)