Johannes Lichtenberger

Posted on Aug 28, 2019 • Edited on Sep 4, 2019

SirixDB - Architecture of an Evolutionary, Temporal Database System

#versioning #json #nosql #database

Introduction

SirixDB is a temporal database system and never overwrites data. Every time you're committing a transaction, SirixDB creates a new lightweight snapshot. It uses a log-structured copy-on-write approach, whereas versioning takes place at the page- as well as node-level. Let's first define what a temporal database system is all about.

A temporal database is capable of retrieving past states. Typically it stores the transaction time, that is the time a transaction commits data. If the valid time is also stored, that is when a fact is true in the real world, we have a bitemporal relation, which is two time axes.

Questions such as the following might be easily answered: Give me last month's history of the Dollar-Pound Euro exchange rate. What was the customer's address on July 12th in 2015 as it was recorded back in the day? Did they move or did someone correct an error? Did we have errors in the database, which were corrected later on?

Let's turn our focus towards the question of why historical data has not been retained in the past. We postulate that new storage advances in recent years present possibilities, to build sophisticated solutions to help answer those questions without the hurdle, state-of-the-art systems bring.

Advantages and disadvantages of flash drives as for instance SSDs

As Marc Kramis points out in his paper "Growing Persistent Trees into the 21st Century":

The switch to flash drives keenly motivates to shift from the "current state" paradigm towards remembering the evolutionary steps leading to this state.

The main insight is that flash drives as for instance, SSDs, which are common nowadays have zero seek time while not being able to do in-place modifications of the data. Flash drives are organized into pages and blocks. Due to their characteristics, they are able to read data on a fine-granular page-level, but can only erase data at the coarser block-level. Furthermore, blocks first have to be erased before they can be updated. Thus, updated data is written to another place. A garbage collector marks the data, which has been rewritten to the new place as erased at the previous block location, such that new data can be stored in the future. Metadata to find the data at the new location is updated.

Evolution of state through fine-grained modifications

Furthermore, Marc points out, that those small modifications, usually involves writing not only the modified data but also all other records in the modified page as well as a number of pages with unmodified data. Traditional spinning disks require clustering due to slow random reads of traditionally mechanical disk head seek times. This clearly is an undesired effect.

Instead, from a storage point of view, it is desirable only to store the changes. As we'll see it boils down to a trade-off between read and write performance, which is having to reconstruct a page in-memory from scattered incremental changes or having to store more records than necessarily have changed.

How we built an Open Source storage system based on these observations from scratch

SirixDB stores per revision and per page deltas. Due to zero seek time of flash drives, SirixDB does not have to cluster data. It only ever clusters data during transaction commits. It is based on append-only, log-structured storage. Data is never modified in-place.

Instead, database pages are copied to memory, updated and synced to a file in batches by means of a postorder traversal of the internal tree-structure during a transaction commit.

The page-structure is heavily inspired by the operating system ZFS. We used some of the ideas to store and version data on a sub-file level. We'll see that Marc Kramis came up with a novel sliding snapshot algorithm to version record pages, based on observed shortcomings of versioning approaches from backup systems.

Node layer/Tree-encoding

The encoding of the underlying tree structure of both XML and JSON documents in SirixDB uses a pointer-based approach.

SirixDB doesn't use range-encodings (not update-friendly) or hierarchical labels (B+-tree index-structure traversal might be too expensive). However, we can specify that SirixDB stores hierarchical labels (DeweyIDs) for XML-resources to provide fast document order determination.

Instead of the aforementioned encodings, a node in SirixDB references other nodes by a firstChild/leftSibling/rightSibling/parentNodeKey/nodeKey encoding. Think of it as a persistent DOM:

The numbers in the figure are auto-generated unique, stable node-IDs or node-keys generated with a simple sequential number generator.

Every structural node might have a first child, a left sibling, a right sibling, and a parent node.

Namespaces and attributes are the only non-structural nodes in XML resources. They have a parent pointer. If the transactional cursor points to an element node, you can reference these nodes through special moveToAttribute and moveToNamespace.

In the JSON-to-tree mapping, however, every node is a structural node. To support fine granular versioning of nodes and to be able to reuse the axis-implementations, SirixDB uses the same encoding for JSON resources as we've seen.

Note that the binary JSON-format in SirixDB allows ordered, duplicate object record keys. Upper layers, however, may store object records in a hash map, thus not keeping track of the order nor supporting duplicate record keys.

Page structure

The page-structure for one revision is depicted in the following figure:

We don't want to go into the details, but the main point from the figure is that a resource is stored in a huge array-based trie. Committing a new UberPage version is atomic and written last as we'll see. A RevisionRootPage denotes the main entry point into a revision.

Each node and revision in SirixDB is referenced by a unique, stable identifier. First, SirixDB has to find the revision by its revision number traversing a tree of indirect-pages. Addressing nodes is done in exactly the same manner.

An additional top layer IndirectPage is written if the number of nodes or revisions cannot be referenced by the maximum number addressable with the current level.

Transaction commit

The next figure depicts what happens during a transaction-commit:

We assume that a record has been modified in the leftmost RecordPage. Depending on the versioning algorithm used by SirixDB, the modified record, as well as probably some other records of the page, are copied to a new page fragment. First, all changes are stored in an in-memory transaction (intent) log, which can be persisted, if needed. Second, during a transaction commit, the page-structure of the current RevisionRootPage is serialized in a postorder traversal.

All changed RecordPages are written to persistent storage, starting with the leftmost. If other changed record pages exist underneath an indirect page, these are serialized before the IndirectPage, which points to the updated record pages. Then the IndirectPage which points to the updated revision root page is written. The indirect pages are written with updated references to the new persistent locations of the record pages. SirixDB also stores checksums in the parent pointers as in ZFS, such that the storage in the future is able to detect data corruption and heal itself, once we partition and especially replicate the data. The whole page-structure is serialized in this manner. We also want to store an encryption key in the references in the future, to support encryption at rest.

Note that SirixDB has to update the ancestor path of each changed RecordPage. However, storing indirect pages as well as the RevisionRootPage, CASPage, PathSummaryPage, and the PathPage is cheap. We currently store copies of the NamePages, but in the future might also version these according to the chosen versioning algorithm. Thus, we do not need to copy the whole dictionaries and save storage costs thereof. Each reference, which doesn't point to a new page or page-fragment, is left unchanged. Thus, unchanged pages (which are also not on the ancestor-path of changed pages) are simply referenced at their respective position in the former revision and never rewritten.

Versioning at the record-level

One of the distinctive features of SirixDB is that it versions the RecordPages and not just copies all records of the page, even if only a single record has been modified. The new record page fragment always contains a reference to the former version. Thus, the versioning algorithms are able to dereference a fixed predefined number of page-fragments at maximum to reconstruct a RecordPage in-memory.

A sliding snapshot algorithm used to version record pages is able to avoid read and write peaks. Intermittent full-page snapshots are avoided, which are otherwise needed during incremental or differential page-versioning to fast-track its reconstruction.

Resource-level Transactions

SirixDB supports one read-write transaction concurrently to N read-only transactions. Thus, the architecture supports concurrency very well (note the difference between concurrency and parallel computations; the former simply is a prerequisite for the latter). If we ever want to allow concurrent writes to the same resource in SirixDB, we could introduce a form of serializable snapshot isolation. However, we think it's not feasible for tree-structured data as XML- and JSON, at least if hashes for the nodes are built and if the number of descendants is stored as well.

Conclusion

We've described some of the architectural concepts of SirixDB. The main feature is a thorough versioning built from the ground up into SirixDB. In contrast to other approaches, each version is indexed and not only page-snapshots are written from changed pages, but SirixDB also applies per record versioning. The history of each resource can be stored and queried efficiently.

Give it a try/Help needed

If you like this, please share this on Twitter, give me a star on Github and most importantly check it out and let me know. Since a few years I'm the only maintainer of SirixDB, now more eager than ever to put forth the idea of a versioned, secure temporal analytics platform as a community. I'd love to hear any suggestions, feedback, suggestions for future work, as for instance work on horizontal scaling, bug reports, just everything… please get in contact) :-)

The Open Source repository: https://github.com/sirixdb/sirix

The new community forum: https://sirix.discourse.group

SirixDB website: https://sirix.io