Have you ever wondered how databases like Postgres, MySQL, or Cassandra store and retrieve data so efficiently? I was amazed when I first realized that databases — these powerful, complex systems — are, at their core, just files on disk. Yes, files! Structured and optimized for fast and efficient operations. Two key components make this possible:
Query Layer: Handles user queries, optimizes them, and interacts with the storage engine.
Storage Engine: Manages how data is stored, retrieved, and updated on disk or in memory.
What is a Storage engine?
A storage engine is a core component of a database system responsible for managing how data is stored, retrieved, and updated on disk or in memory. It handles low-level details such as file organization, indexing, and concurrency control. At its heart, a storage engine relies on efficient data structures designed specifically for disk-based access.
Storage engines can be broadly classified into two types based on how they handle data modifications:
Mutable Storage Engines: These engines, like B-Trees, are optimized for read-heavy workloads. They update data in place, meaning existing data is overwritten when modified. Examples include Postgres, InnoDB (MySQL), and SQLite.
Immutable Storage Engines: These engines, like Log-Structured Merge Trees (LSM Trees) and Log-Structured Hash Tables (LSHT), are optimized for write-heavy workloads. Instead of overwriting data, they append new data to a log file and periodically clean up old or deleted data through a process called compaction. Examples include LevelDB, RocksDB, Bitcask, and Cassandra.
In this post, we’ll build a simple immutable key-value store, inspired by the Bitcask model. Bitcask is based on a Log-Structured Hash Table (LSHT), which appends data sequentially to a log file and maintains an in-memory hash table for fast lookups. LSHT-based engines are simpler to implement compared to other designs like LSM-Trees, as they avoid the complexity of sorting, multi-level merging, and additional data structures.
By the end of this post, you’ll have a foundational understanding of how storage engines work and a working implementation of a minimal key-value store. In future posts, we’ll enhance this implementation by adding features like compaction, advanced indexing, and handling large datasets efficiently.
Full source code available on Github
Design Overview
Our key-value store uses an append-only binary file format with a simple yet efficient record structure:
Each record contains:
Timestamp (4 bytes)
Key length (4 bytes)
Value length (4 bytes)
Tombstone flag (1 byte)
Key data (variable length)
Value data (variable length)
Record format:
+------------------+------------------+------------------+--------------------+--------------------+--------------------+
| Timestamp (4B) | Key len (4B) | Value len (4B) | Tombstone flag (1B)| Key (variable) | Value (variable) |
+------------------+------------------+------------------+--------------------+--------------------+--------------------+
| 0x5F8D3E8F | 0x00000005 | 0x00000007 | 0x00 | "mykey" | "myvalue" |
+------------------+------------------+------------------+--------------------+--------------------+--------------------+
Why Binary Format for Serialization?
While JSON and XML are human-readable and widely used for data interchange, most modern storage systems use binary formats for their core storage. Here’s why:
Performance: The system can read and write binary formats directly without parsing overhead.
Space Efficiency: Binary representations typically require significantly less storage space than text-based formats.
Cross-Platform Compatibility: Using standardized binary formats ensures data can be read consistently across different systems.
Implementation
Types
type Record struct {
Key []byte
Value []byte
Timestamp uint32
Tombstone bool
}
type Store struct {
filename string
mu sync.RWMutex
file *os.File
index map[string]int64
}
The MinKV structure maintains a persistent file handle and an in-memory index for efficient lookups.
Creating the Store
func Open(filename string) (*Store, error) {
file, err := os.OpenFile(filename, os.O_RDWR|os.O_CREATE, 0666)
if err != nil {
return nil, fmt.Errorf("failed to open file: %w", err)
}
store := &Store{
filename: filename,
file: file,
index: make(map[string]int64),
}
if err := store.buildIndex(); err != nil {
file.Close()
return nil, fmt.Errorf("failed to rebuild index: %w", err)
}
return store, nil
}
Index Rebuilding
To support efficient key lookups, the buildIndex
function scans the file and populates the in-memory index with the latest offsets for each key. (But if the file is a large building the index will be slow).
func (s *Store) buildIndex() error {
s.mu.Lock()
defer s.mu.Unlock()
var offset int64
for {
record, err := s.readRecord(offset)
if err == io.EOF {
break
}
if err != nil {
return fmt.Errorf("failed to read record: %v", err)
}
// mark tombstoned records as deleted
if record.Tombstone {
s.index[string(record.Key)] = tombstoneOffset
} else {
s.index[string(record.Key)] = offset
}
offset += int64(headerSize + len(record.Key) + len(record.Value))
}
return nil
}
Binary Serialization
The writeRecord
function efficiently serializes records into binary format:
func (s *Store) writeRecord(record *Record) (int64, error) {
totalSize := headerSize + len(record.Key) + len(record.Value)
buf := make([]byte, totalSize)
// serialize timestamp (4 bytes)
binary.BigEndian.PutUint32(buf[0:4], record.Timestamp)
// serialize key len (4 bytes)
binary.BigEndian.PutUint32(buf[4:8], uint32(len(record.Key)))
// serialize value len (4 bytes)
binary.BigEndian.PutUint32(buf[8:12], uint32(len(record.Value)))
// serialize tombstone (1 byte)
if record.Tombstone {
buf[12] = 1
} else {
buf[12] = 0
}
// copy key data
copy(buf[headerSize:headerSize+len(record.Key)], record.Key)
// copy value data
copy(buf[headerSize+len(record.Key):], record.Value)
// write record to file
offset, err := s.file.Seek(0, io.SeekEnd)
if err != nil {
return 0, err
}
if _, err := s.file.Write(buf); err != nil {
return 0, err
}
return offset, nil
}
The readRecord
function reads binary-encoded records from the file:
func (s *Store) readRecord(offset int64) (*Record, error) {
if _, err := s.file.Seek(offset, io.SeekStart); err != nil {
return nil, fmt.Errorf("failed to seek: %w", err)
}
record := &Record{}
// read timestamp (4 bytes)
timestampBuf := make([]byte, 4)
if _, err := s.file.Read(timestampBuf); err != nil {
return nil, err
}
record.Timestamp = binary.BigEndian.Uint32(timestampBuf)
// read key len (4 bytes)
keyLenBuf := make([]byte, 4)
if _, err := s.file.Read(keyLenBuf); err != nil {
return nil, err
}
keyLen := binary.BigEndian.Uint32(keyLenBuf)
// read value len (4 bytes)
valueLenBuf := make([]byte, 4)
if _, err := s.file.Read(valueLenBuf); err != nil {
return nil, err
}
valueLen := binary.BigEndian.Uint32(valueLenBuf)
// read tombstone (1 byte)
tombstoneBuf := make([]byte, 1)
if _, err := s.file.Read(tombstoneBuf); err != nil {
return nil, err
}
record.Tombstone = tombstoneBuf[0] == 1
// read key data
key := make([]byte, keyLen)
if _, err := s.file.Read(key); err != nil {
return nil, err
}
record.Key = key
// read value data
value := make([]byte, valueLen)
if _, err := s.file.Read(value); err != nil {
return nil, err
}
record.Value = value
return record, nil
}
Put Operation
The Put
function writes a new record to the file and updates the in-memory index. It ensures thread safety by acquiring a write lock:
func (s *Store) Put(key, value []byte) error {
if len(key) == 0 {
return fmt.Errorf("key cannot be empty")
}
s.mu.Lock()
defer s.mu.Unlock()
record := &Record{
Key: key,
Value: value,
Timestamp: uint32(time.Now().Unix()),
}
offset, err := s.writeRecord(record)
if err != nil {
return fmt.Errorf("failed to write record: %w", err)
}
s.index[string(key)] = offset
return nil
}
Get Operation
The Get
function retrieves a value by its key using the in-memory index. It acquires a read lock to ensure thread safety:
func (s *Store) Get(key []byte) ([]byte, error) {
if len(key) == 0 {
return nil, fmt.Errorf("key cannot be empty")
}
s.mu.RLock()
defer s.mu.RUnlock()
offset, exists := s.index[string(key)]
if !exists || offset == tombstoneOffset {
return nil, fmt.Errorf("key not found: %s", key)
}
record, err := s.readRecord(offset)
if err != nil {
return nil, fmt.Errorf("failed to read record: %w", err)
}
return record.Value, nil
}
Delete Operation
The Delete
function writes a tombstoned record to the file to ensure deletions are persisted across restarts then updates the in-memory index marking the record as deleted.
(Notice the record data still exists in the file we’ll be removing it later when implementing compaction).
func (s *Store) Delete(key []byte) error {
if len(key) == 0 {
return fmt.Errorf("key cannot be empty")
}
s.mu.Lock()
defer s.mu.Unlock()
record := &Record{
Key: key,
Tombstone: true,
Timestamp: uint32(time.Now().Unix()),
}
_, err := s.writeRecord(record)
if err != nil {
return fmt.Errorf("failed to write record: %w", err)
}
s.index[string(key)] = tombstoneOffset
return nil
}
Getting All the Records
We've implemented an iterator pattern to efficiently scan all records from the file without loading all the records into memory at once and skipping tombstoned records.
type Iterator interface {
Next() bool
Record() (*Record, error)
}
Example usage:
package main
import (
"fmt"
"github.com/galalen/minkv"
)
func main() {
store, _ := minkv.Open("store.db")
defer store.Close()
// Put data
_ = store.Put([]byte("os"), []byte("mac"))
_ = store.Put([]byte("db"), []byte("kv"))
_ = store.Put([]byte("lang"), []byte("go"))
// Retrieve data
value, _ := store.Get([]byte("os"))
fmt.Printf("os => %s\n", value)
// update value
_ = store.Put([]byte("os"), []byte("linux"))
// Delete data
_ = store.Delete([]byte("db"))
// Iterate over records
iter, _ := store.Iterator()
for iter.Next() {
record, _ := iter.Record()
fmt.Printf("Key: %s, Value: %s\n", record.Key, record.Value)
}
}
Future Optimizations
Here are some areas where this key-value store can be further improved:
- Compaction: Currently, deleted or overwritten records remain in the file until compaction is performed. Implementing a compaction mechanism will help reduce file size and improve performance.
- Hintfiles: Reduce disk seeks and speed up index rebuilding.
- Batching Writes: Group multiple writes into a single batch to reduce the number of disk sync operations.
Conclusion
We’ve built a simple yet powerful key-value store that demonstrates core database storage principles. Key-value stores are the backbone of many real-world databases and systems. For example:
- Riak: A distributed NoSQL database that implements the Bitcask log structure to handle write-heavy operations efficiently.
- Cassandra: Uses key-value principles for distributed data storage.
- RocksDB: A key-value store used in systems like MySQL and Kafka, is built on an LSM-based architecture.
Full source code available on Github
References
https://tech-lessons.in/en/blog/bitcask/
https://riak.com/assets/bitcask-intro.pdf
https://tikv.org/deep-dive/key-value-engine/introduction/
https://tikv.org/deep-dive/key-value-engine/b-tree-vs-lsm/
https://github.com/avinassh/py-caskdb
https://www.amazon.com/Database-Internals-Deep-Distributed-Systems/dp/1492040347
Top comments (0)