Mark Andreev

Posted on May 13

What is RocksDB (and its role in streaming)?

#database #java #cpp #beginners

RocksDB, a high-performance database, is a hidden gem in the tech industry, often overlooked by developers. I’m excited to delve into the intricacies and applications of this powerful tool. Who knows? It might just become the cornerstone of your next project.

Introduction

Originating from Facebook, RocksDB emerged as a potent storage engine designed for server workloads across diverse storage media, with an initial emphasis on high-speed storage, particularly Flash storage. It’s a C++ library that stores keys and values as byte streams of arbitrary sizes. It’s capable of both point lookups and range scans, and offers various ACID guarantees.

RocksDB strikes a harmonious balance between customizability and self-adaptability. It boasts highly adjustable configuration settings that can be fine-tuned to operate in a wide range of production environments, including SSDs, hard disks, ramfs, or remote storage. It supports a variety of compression algorithms and provides robust tools for production support and debugging. Simultaneously, efforts are made to limit the number of adjustable parameters, ensuring satisfactory out-of-box performance and implementing adaptive algorithms where suitable.

RocksDB incorporates substantial code from the open-source leveldb project and ideas from Apache HBase. The initial codebase was a fork from open-source leveldb. Furthermore, it builds upon code and concepts that were developed at Facebook prior to the inception of RocksDB.

Overview

The fundamental design principle of RocksDB is its performance optimization for fast storage and server workloads. It is engineered to facilitate efficient point lookups and range scans. Its configurability allows it to support high random-read workloads, high update workloads, or a blend of both. The architecture of RocksDB is designed to easily adjust trade-offs to cater to different workloads and hardware configurations.

RocksDB serves as a storage engine library, providing a key-value store interface where keys and values are represented as arbitrary byte streams. It arranges all data in a sorted sequence, and the typical operations include Get(key), NewIterator(), Put(key, val), Delete(key), and SingleDelete(key)

RocksDB does not natively support SQL. While it does not have a relational data model and does not support SQL queries, it can be used in conjunction with other systems or frameworks that provide SQL-like querying capabilities. For example, MyRocks combines RocksDB with MySQL. However, it’s important to note that RocksDB itself does not natively support SQL. It also does not have direct support for secondary indexes, but a user may build their own internally using Column Families or externally.

The main features are:

Designed for application servers wanting to store up to a few terabytes of data on local or remote storage systems.

Optimized for storing small to medium size key-values on fast storage -- flash devices or in-memory

It works well on processors with many cores

For read & write workload, rocksdb should be opened by a single process. RocksDB has the capability to support a multi-process read-only operation without any write operations to the database. This can be achieved by invoking the DB::OpenForReadOnly() method to open the database.

RocksDB was initially released as an open-source project under the BSD 3-clause license. However, in July 2017, the licensing of RocksDB was changed to a dual license model, incorporating both Apache 2.0 and GPLv2. This change was significant as it enabled RocksDB’s integration into projects under the Apache Software Foundation, which had previously blacklisted the BSD+Patents license clause. The dual-license information is available in the root directory of the RocksDB repository on GitHub. For the most recent updates, you may want to check the official RocksDB GitHub repository.

Applications

RocksDB is widely used in stream processing frameworks like Apache Flink, serving as a fast and efficient state store for maintaining the state of streaming applications. It’s also utilized in web applications for caching and session storage due to its efficient memory utilization and fast read/write operations. Additionally, RocksDB serves as a storage engine in database systems like TiKV that require high performance and durability, and is used in embedded systems such as IoT devices or edge computing applications due to its small memory footprint and light design.

What is Streaming

Streaming and batching are two different methods of processing data, each with its own use cases and advantages.

Streaming is a data processing method where each data item is processed individually and in real time as soon as it arrives. It’s like a water stream, where data flows continuously. This method is particularly useful when you need to process live data, such as real-time analytics, live monitoring systems, or real-time recommendations. The advantage of streaming is that it provides real-time or near-real-time insights. However, it can be more complex to implement and requires a robust infrastructure to handle continuous data flow.

On the other hand, batching is a method where data is collected over a period of time and processed all at once. It’s like filling a bucket with water and then processing all the water at once. This method is often used when the data doesn’t need to be processed in real time, such as daily reports, historical data analysis, or large ETL (Extract, Transform, Load) jobs. The advantage of batching is that it’s simpler to implement and can be more cost-effective, as it allows for resource optimization. However, it doesn’t provide real-time insights and can be slower, as you need to wait for the batch to be completed before you can process the data.

In summary, the choice between streaming and batching depends on your specific use case and requirements. If you need real-time insights, streaming is the way to go. If you’re dealing with large volumes of data and don’t need real-time processing, batching could be a more efficient choice. As a data engineer, it’s crucial to understand these differences to make the best decision for your data processing needs.

Streaming. Apache Flink

Applications that process streams often have a stateful nature, meaning they retain information from events they’ve processed to influence the processing of future events. In the context of Flink, this retained information, or state, is locally stored in a state backend that’s been configured. To safeguard against data loss during failures, the state backend periodically takes a snapshot of its contents and stores it in a durable storage that’s been pre-configured.
Among the three built-in state backends in Flink, the RocksDB state backend (also known as RocksDBStateBackend) is one.

You can find example of usage in org/apache/flink/contrib/streaming/state package (https://github.com/apache/flink/tree/9fe8d7bf870987bf43bad63078e2590a38e4faf6/flink-state-backends/flink-statebackend-rocksdb/src/main/java/org/apache/flink/contrib/streaming/state).

Streaming. Kafka Streams & KSQL

Kafka Streams and KSQL use RocksDB as their default storage engine for stateful operations.In KSQL, RocksDB is used to store the materialized view locally on its disk. RocksDB is an embedded key/value store that runs in process in each KSQL server. You do not need to start, manage, or interact with it. This allows KSQL to execute stateful operations efficiently and reliably.

You can find example of usage in RocksDBStore class (https://github.com/a0x8o/kafka/blob/master/streams/src/main/java/org/apache/kafka/streams/state/internals/RocksDBStore.java).

CockroachDB

CockroachDB is a distributed SQL database system developed by Cockroach Labs. It is designed to be resilient and consistent, much like its namesake, the cockroach. The system is known for its scalability, high availability, and versatility. It can handle increasing loads by adding more nodes, ensures data is always accessible, and can be run in various environments.

One of the key features of CockroachDB is its compatibility with PostgreSQL. It supports the PostgreSQL wire protocol and the majority of PostgreSQL syntax. This means that existing applications built on PostgreSQL can often be migrated to CockroachDB without changing application code. However, it’s important to note that CockroachDB does not support some of the PostgreSQL features or behaves differently from PostgreSQL because not all features can be easily implemented in a distributed system.

CockroachDB uses RocksDB as its underlying storage engine. The architecture of CockroachDB is divided into two layers: the SQL layer and the storage layer. The SQL layer sits on top of a transactional and strongly-consistent distributed key-value store.

In this key-value store, the key ranges are divided and stored in RocksDB. This allows CockroachDB to leverage the efficient storage mechanism of RocksDB while providing the benefits of a SQL interface. The data stored in RocksDB is replicated across the cluster, enhancing the resilience of the system. This design choice enables CockroachDB to handle large-scale, distributed storage while maintaining high performance and strong consistency.

You can find details about RocksDB usage in CockroachDB at their blog https://www.cockroachlabs.com/blog/cockroachdb-on-rocksd/

TiKV

TiKV, short for “Ti Key-Value”, is an open-source, distributed, and transactional key-value database. It’s designed to handle vast amounts of data while ensuring strong consistency and reliability. Unlike traditional NoSQL systems, TiKV provides both classical key-value APIs and transactional APIs with ACID compliance.

Built in Rust and powered by the Raft consensus algorithm, TiKV was originally created by PingCAP to complement TiDB, a distributed HTAP (Hybrid Transactional and Analytical Processing) database compatible with the MySQL protocol. The design of TiKV is inspired by some great distributed systems from Google, such as BigTable, Spanner, and Percolator, and some of the latest achievements in academia in recent years, such as the Raft consensus algorithm.

TiKV uses RocksDB as its primary storage layer. This design choice allows TiKV to handle large-scale, distributed storage while maintaining high performance and strong consistency.

In TiKV’s architecture, data is stored as key-value pairs. The keys are divided into ranges, and each range of keys is stored in RocksDB. This allows TiKV to leverage the efficient storage mechanism of RocksDB.

TiKV also uses the Raft consensus algorithm for data replication. For every write request, TiKV first writes the request to the Raft log. After the log is committed, TiKV applies the Raft log and writes the data to RocksDB. This ensures data consistency across multiple replicas in the cluster.

Moreover, TiKV uses a feature of RocksDB called Prefix Bloom Filter (PBF). PBF can filter out data which is impossible to contain keys with the same prefix as the row key provided. This is particularly useful in TiKV’s multi-version concurrency control (MVCC) model, where multiple versions of the same row share the same prefix. This feature enhances the efficiency of read operations.

In summary, RocksDB plays a crucial role in TiKV’s ability to provide a reliable, high-performance distributed storage solution.

Apache Kvrocks

Apache Kvrocks, a distributed key-value NoSQL database, utilizes RocksDB as its storage engine and is designed to be compatible with the Redis protocol.

Kvrocks has the following key features:

Redis Compatible: Users can access Apache Kvrocks via any Redis client.

Namespace: Similar to Redis SELECT but equipped with token per namespace.

Replication: Async replication using binlog like MySQL.

High Availability: Support Redis sentinel to failover when master or slave was failed.

Cluster: Centralized management but accessible via any Redis cluster client.

You can find details in their official documentation https://kvrocks.apache.org/community/data-structure-on-rocksdb/

Use in application

RocksDB is an embedded database, which means it is designed to be integrated directly into your application. Unlike standalone database systems, an embedded database like RocksDB operates as an integral part of your application, allowing for efficient data management and storage within the application itself. This can lead to improved performance and easier data handling, as the database operations are closely tied to the application’s functionality. However, it also means that the application is responsible for the direct management of the database.

C++

In C++ you can operate with database using:

// Copyright (c) 2011-present, Facebook, Inc.  All rights reserved.
//  This source code is licensed under both the GPLv2 (found in the
//  COPYING file in the root directory) and Apache 2.0 License
//  (found in the LICENSE.Apache file in the root directory).

#include <cstdio>
#include <string>

#include "rocksdb/db.h"
#include "rocksdb/options.h"
#include "rocksdb/slice.h"

using ROCKSDB_NAMESPACE::DB;
using ROCKSDB_NAMESPACE::Options;
using ROCKSDB_NAMESPACE::PinnableSlice;
using ROCKSDB_NAMESPACE::ReadOptions;
using ROCKSDB_NAMESPACE::Status;
using ROCKSDB_NAMESPACE::WriteBatch;
using ROCKSDB_NAMESPACE::WriteOptions;

#if defined(OS_WIN)
std::string kDBPath = "C:\\Windows\\TEMP\\rocksdb_simple_example";
#else
std::string kDBPath = "/tmp/rocksdb_simple_example";
#endif

int main() {
  DB* db;
  Options options;
  // Optimize RocksDB. This is the easiest way to get RocksDB to perform well
  options.IncreaseParallelism();
  options.OptimizeLevelStyleCompaction();
  // create the DB if it's not already present
  options.create_if_missing = true;

  // open DB
  Status s = DB::Open(options, kDBPath, &db);
  assert(s.ok());

  // Put key-value
  s = db->Put(WriteOptions(), "key1", "value");
  assert(s.ok());
  std::string value;
  // get value
  s = db->Get(ReadOptions(), "key1", &value);
  assert(s.ok());
  assert(value == "value");

  // atomically apply a set of updates
  {
    WriteBatch batch;
    batch.Delete("key1");
    batch.Put("key2", value);
    s = db->Write(WriteOptions(), &batch);
  }

  s = db->Get(ReadOptions(), "key1", &value);
  assert(s.IsNotFound());

  db->Get(ReadOptions(), "key2", &value);
  assert(value == "value");

  {
    PinnableSlice pinnable_val;
    db->Get(ReadOptions(), db->DefaultColumnFamily(), "key2", &pinnable_val);
    assert(pinnable_val == "value");
  }

  {
    std::string string_val;
    // If it cannot pin the value, it copies the value to its internal buffer.
    // The intenral buffer could be set during construction.
    PinnableSlice pinnable_val(&string_val);
    db->Get(ReadOptions(), db->DefaultColumnFamily(), "key2", &pinnable_val);
    assert(pinnable_val == "value");
    // If the value is not pinned, the internal buffer must have the value.
    assert(pinnable_val.IsPinned() || string_val == "value");
  }

  PinnableSlice pinnable_val;
  s = db->Get(ReadOptions(), db->DefaultColumnFamily(), "key1", &pinnable_val);
  assert(s.IsNotFound());
  // Reset PinnableSlice after each use and before each reuse
  pinnable_val.Reset();
  db->Get(ReadOptions(), db->DefaultColumnFamily(), "key2", &pinnable_val);
  assert(pinnable_val == "value");
  pinnable_val.Reset();
  // The Slice pointed by pinnable_val is not valid after this point

  delete db;

  return 0;
}

You can find details in official wiki in github https://github.com/facebook/rocksdb/wiki/Basic-Operations

Java

RocksJava is an initiative aimed at developing a Java driver for RocksDB that is both high-performing and user-friendly.

<dependency>
  <groupId>org.rocksdb</groupId>
  <artifactId>rocksdbjni</artifactId>
  <version>6.6.4</version>
</dependency>

You can open a database using

import org.rocksdb.RocksDB;
import org.rocksdb.RocksDBException;
import org.rocksdb.Options;
...
  // a static method that loads the RocksDB C++ library.
  RocksDB.loadLibrary();

  // the Options class contains a set of configurable DB options
  // that determines the behaviour of the database.
  try (final Options options = new Options().setCreateIfMissing(true)) {

    // a factory method that returns a RocksDB instance
    try (final RocksDB db = RocksDB.open(options, "path/to/db")) {

        // do something
    }
  } catch (RocksDBException e) {
    // do some error handling
    ...
  }
...

And do some read and write operations:

byte[] key1;
byte[] key2;
// some initialization for key1 and key2

try {
  final byte[] value = db.get(key1);
  if (value != null) {  // value == null if key1 does not exist in db.
    db.put(key2, value);
  }
  db.delete(key1);
} catch (RocksDBException e) {
  // error handling
}

You can find details in official documentation https://github.com/facebook/rocksdb/wiki/RocksJava-Basics

Conclusion

In conclusion, RocksDB is an excellent solution for managing persistent state on disk due to its high performance and efficient key-value storage. It’s particularly well-suited for applications that require fast, low-latency access to disk-based data. However, if your needs extend to larger-scale distributed state management or constraint management, more abstract databases such as MySQL might be a better fit. MySQL, being a relational database, offers features like transactional support, advanced querying capabilities, and robustness in handling complex data relationships, making it more suitable for these more complex scenarios. Therefore, the choice between RocksDB and MySQL (or any other database) should be guided by the specific requirements of your application.

Author is Mark Andreev, SWE @ Conundrum.AI

Top comments (9)

Gleb Chermennov • May 14

Nice article.
I have to point out that CockroachLabs stopped using RocksDb as its storage engine

Mark Andreev • May 14

Thank you.

In my perspective, this story serves as a great illustration of using RocksDB as an initial step.

haris • May 13

Brilliant intro and context into RocksDB. I had one question about scaling; traditional relational databases come with horizontal scaling/sharding, etc. Given that RocksDB is embedded in the application, how does scaling work?

Mark Andreev • May 13

RocksDB primarily concentrates on managing data within a single node. While it is not inherently a replicated system, it does offer auxiliary functions that allow users to construct their own replication system using RocksDB as a foundation. github.com/facebook/rocksdb/wiki/R...

So all scaling problems are end user problems.

Yuliya Fokina • May 13

Thanks for your post! Does it makes sense to integrate RocksDB in python app?

Mark Andreev • May 13

Yes, it is makes sense because RocksDB simplify persistence of your data. It is hard to maintain delete & update operations on one's own. The biggest challange is compaction ( github.com/facebook/rocksdb/wiki/C... ) which is required to remove inactive rows.

Doron Tohar • May 28

Depends. I believe for most use cases it will be an overkill. sqlite, mysql, postgres, mongodb will probably be easier to work with.

jakehov • May 13

Thank you!

I would like to share crate for rust docs.rs/rocksdb/latest/rocksdb/

Mark Andreev • May 13 • Edited

I would like to mention the limitation of this crate

The underlying RocksDB does allow column families to be created and dropped from multiple threads concurrently. But this crate doesn't allow it by default for compatibility. If you need to modify column families concurrently, enable crate feature called multi-threaded-cf, which makes this binding's data structures to use RwLock by default. Alternatively, you can directly create DBWithThreadMode without enabling the crate feature.
ref: docs.rs/crate/rocksdb/latest

DEV Community