π¦ Goal
I'm building a temporal document store (SirixDB), which is capable of storing revisions of (currently) XML and JSON databases with minimal, constant overhead regarding space and the reconstruction of a revision or a specific record therein. SirixDB stores the data in a tailored binary format.
I want to build a front-end for this document store, which can efficiently store and restore revisions of data. SirixDB furthermore supports sophisticated time-travel queries.
However, I'm a backend engineer, eager to learn. I'm currently reading a book about TypeScript and another one about D3.js. I'm also currently learning Vue.js and Nuxt.js. I've found that building something together is much more fulfilling than working alone. Furthermore, I want to learn best practices, clean code and I'd love to do code reviews with you :-) That said, it should be clear that I'd love to learn from you and you're able to work on a greenfield project, basically using the latest technology, which fits our needs :-) I got the first few contributions to this project during the #Hacktoberfest.
On the one hand users should be able to store, update and query (time travel queries included) data in SirixDB. On the other hand I want to provide sophisticated, interactive visualizations to explore the differences between the revisions of the XML- or JSON-resources.
Let's first introduce SirixDB, and its features.
π SirixDB, a Temporal Document Store
SirixDB is log-structured at its core and never overwrites data. It appends new revisions during transaction-commits, whereas revisions share unchanged page structures. The timestamp of a transaction commit is stored precisely once in a so-called RevisionRootPage. It is the root page, which is the main entry point to the data of a particular revision. A revision represents a snapshot at a specific point in time. SirixDB indexes the revisions itself in a key-based trie and an in-memory map of timestamps for the fast reconstruction of a version.
π Versioning
Efficient versioning is the primary goal of SirixDB. We've written the storage engine from scratch to achieve this goal. As depicted in the figure above, SirixDB stores the state or a new revision during a transaction-commit. All states are retrievable for further analysis or to fix human and application errors.
In stark contrast to other approaches the temporal document store not only versions on a per revision basis, but also on a per page basis.
Revisions share unchanged pages. Furthermore, depending on the chosen page-versioning algorithm, data pages are versioned as well. A database page usually stores a fixed number of bytes at most. SirixDB, however, does not align specific block sizes on persistent storage with database page sizes. Instead, SirixDB stores page-fragments during updates. A page-fragment has a dynamic size. SirixDB has to read a constant number of page-fragments from persistent storage. The following figure describes the general copy-on-write approach in which the first revision shares unchanged pages with the second revision. Some metadata pages (IndirectPages) have to be updated to point to the new record pages.
As an example, let's say we update a single node in a tree structure. In SirixDB, it might involve a constant overhead of adapting (pointers of) neighbor nodes. However, SirixDB doesn't have to rewrite all other nodes in the page. Even if only a few bytes have changed, usually a whole database page is copied and written either in-place or to a new location. A page-size might be a few thousand bytes in total, and some records might even exceed the page-size limit. In that case, they might be in some form of overflow pages. The DBMS now usually writes the whole page even if only a few records have changed. This might impose a considerable overhead regarding space on persistent storage if the DBMS wants to retain old data.
To overcome this limitation, we first looked into how backup systems work (even if the work mostly at a file or block-level granularity).
πΎ Versioning Strategies
We implemented several versioning strategies best known from backup systems for copy-on-write operations of record-pages. Namely, we either copy
- the full record-pages that is any record in the page (full)
- only the changed records in a record-page regarding the former version (incremental)
- only the changed records in a record-page since a full-page dump (differential)
Incremental-versioning is the other extreme. Write performance is best, as it stores the optimum (only changed records). On the other hand, reconstructing a page needs intermittent full snapshots of pages. Otherwise, performance deteriorates with each new revision of the page as the number of increments increases with each new version.
Differential-versioning tries to balance reads and writes a bit better, but is still not optimal. A system implementing a differential versioning strategy has to write all changed records since a past full dump of the page. Thus, only ever two revisions of the page-fragment have to be read to reconstruct a record-page. However, write-performance also deteriorates with each new revision of the page.
Write peaks occur both during incremental versioning, due to the requirement of intermittent full dumps of the page. Differential versioning also suffers from a similar problem. Without an intermittent full dump, a system using differential versioning has to duplicate vast amounts of data during each new write.
Marc Kramis came up with the idea of a novel sliding snapshot algorithm, which balances read-/write-performance to circumvent any write-peaks.
The algorithm makes use of a sliding window. First, a system implementing the algorithm must write all changed records during a commit. Second, the system has to write any record, which is older than a predefined length N of the window and which didn't change during these N-revisions, too. Reading involves fetching of page-fragments from these N-revisions at max. A system can fetch page-fragments in parallel or linear. In the latter case, reading of the page fragments starts with the most recent revision. Once a system has read all records from a page, it does not have to fetch older page-fragments. You can find the best high-level overview of the algorithm in Marcβs Thesis: Evolutionary Tree-Structured Storage: Concepts, Interfaces, and Applications
Time Travel Queries
The following time-travel query to be executed on our binary JSON representation of Twitter sample data, that is several revisions thereof gives an initial impression of what's possible:
let $doc := jn:open('database','resource', xs:dateTime('2019-04-13T16:24:27Z'))
let $statuses := $doc=>statuses
let $foundStatus := for $status in bit:array-values($statuses)
let $dateTimeCreated := xs:dateTime($status=>created_at)
where $dateTimeCreated > xs:dateTime("2018-02-01T00:00:00")
and not(exists(jn:previous($status)))
order by $dateTimeCreated
return $status
return {"revision": sdb:revision($foundStatus), $foundStatus{text}}
The query opens a database/resource in a specific revision based on a timestamp (2019β04β13T16:24:27Z
) and searches for all statuses, which have a created_at
timestamp, which has to be greater than the 1st of February in 2018 and did not exist in the previous revision. =>
is a dereferencing operator used to dereference keys in JSON objects. You can access array values as shown with the function bit:array-values
or through specifying an index, starting with zero: $array[[0]]
, for instance, specifies the first value of the array.
β€οΈ Design Goals of SirixDB
Some of the most important core principles and design goals are:
- Minimize Storage Overhead
- SirixDB shares unchanged data pages as well as records between revisions, depending on a chosen versioning algorithm during the initial bootstrapping of a resource. SirixDB aims to balance read and write performance in its default configuration
- Concurrent
- SirixDB contains very few locks and aims to be as suitable for multithreaded systems as possible
- Asynchronous
- Operations can happen independently; each transaction is bound to a specific revision and only one read/write-transaction on a resource is permitted concurrently to N read-only-transactions
- Versioning/Revision history
- SirixDB stores a revision history of every resource in the database without imposing extra overhead
- Data integrity
- SirixDB, like ZFS, stores full checksums of the pages in the parent pages. That means that almost all data corruption can be detected upon reading in the future, we aim to partition and replicate databases in the future
- Copy-on-write semantics
- Similarly to the file systems Btrfs and ZFS, SirixDB uses CoW semantics, meaning that SirixDB never overwrites data. Instead, database-page fragments are copied/written to a new location
- Per revision and per page versioning
- SirixDB does not only version on a per revision, but also on a per page-base. Thus, whenever we change a potentially small fraction of records in a data-page, it does not have to copy the whole page and write it to a new location on a disk or flash drive. Instead, we can specify one of several versioning strategies known from backup systems or a novel sliding snapshot algorithm during the creation of a database resource. The versioning-type we specify is used by SirixDB to version data-pages
- Guaranteed atomicity (without a WAL)
- The system never enters an inconsistent state (unless there is hardware failure), meaning that unexpected power-off won't ever damage the system. A write-ahead-log (WAL) is not needed. Instead, an UberPage is swapped atomically during transaction commits
- Log-structured and SSD friendly
- SirixDB batches writes and syncs everything sequentially to a flash drive during commits. It never overwrites committed data
Future With a Front-end
SirixDB is all about versioning. A front-end to interact with the document store and explore differences between revisions (XML or JSON diffing views) and to visualize query outcomes is of utmost importance.
I hope the project is useful for some of you. I even got my first contributions after years of working alone during and right before #Hacktoberfest, that's simply awesome. Thank you so much π
If you like the project you might share it on twitter and so and and spread the word!? π
Contribute on GitHub SirixDB Web Frontend and/or GitHub SirixDB π
Kind regards and have a great friday and weekend already
Johannes
Top comments (0)