Mohammed Galalen

Posted on Feb 2, 2025

How Databases Work Under the Hood: Building a Key-Value Store in Go

#database #linux #go #postgres

Have you ever wondered how databases like Postgres, MySQL, or Cassandra store and retrieve data so efficiently? I was amazed when I first realized that databases — these powerful, complex systems — are, at their core, just files on disk. Yes, files! Structured and optimized for fast and efficient operations. Two key components make this possible:

Query Layer: Handles user queries, optimizes them, and interacts with the storage engine.
Storage Engine: Manages how data is stored, retrieved, and updated on disk or in memory.

What is a Storage engine?
A storage engine is a core component of a database system responsible for managing how data is stored, retrieved, and updated on disk or in memory. It handles low-level details such as file organization, indexing, and concurrency control. At its heart, a storage engine relies on efficient data structures designed specifically for disk-based access.

Storage engines can be broadly classified into two types based on how they handle data modifications:

Mutable Storage Engines: These engines, like B-Trees, are optimized for read-heavy workloads. They update data in place, meaning existing data is overwritten when modified. Examples include Postgres, InnoDB (MySQL), and SQLite.

Immutable Storage Engines: These engines, like Log-Structured Merge Trees (LSM Trees) and Log-Structured Hash Tables (LSHT), are optimized for write-heavy workloads. Instead of overwriting data, they append new data to a log file and periodically clean up old or deleted data through a process called compaction. Examples include LevelDB, RocksDB, Bitcask, and Cassandra.

In this post, we’ll build a simple immutable key-value store, inspired by the Bitcask model. Bitcask is based on a Log-Structured Hash Table (LSHT), which appends data sequentially to a log file and maintains an in-memory hash table for fast lookups. LSHT-based engines are simpler to implement compared to other designs like LSM-Trees, as they avoid the complexity of sorting, multi-level merging, and additional data structures.

By the end of this post, you’ll have a foundational understanding of how storage engines work and a working implementation of a minimal key-value store. In future posts, we’ll enhance this implementation by adding features like compaction, advanced indexing, and handling large datasets efficiently.

Full source code available on Github

Design Overview

Our key-value store uses an append-only binary file format with a simple yet efficient record structure:

Each record contains:

Timestamp (4 bytes)
Key length (4 bytes)
Value length (4 bytes)
Tombstone flag (1 byte)
Key data (variable length)
Value data (variable length)

Record format:

+------------------+------------------+------------------+--------------------+--------------------+--------------------+
| Timestamp (4B)   | Key len (4B)     | Value len (4B)   | Tombstone flag (1B)| Key (variable)     | Value (variable)   |
+------------------+------------------+------------------+--------------------+--------------------+--------------------+
| 0x5F8D3E8F       | 0x00000005       | 0x00000007       | 0x00               | "mykey"            | "myvalue"          |
+------------------+------------------+------------------+--------------------+--------------------+--------------------+

Why Binary Format for Serialization?

While JSON and XML are human-readable and widely used for data interchange, most modern storage systems use binary formats for their core storage. Here’s why:

Performance: The system can read and write binary formats directly without parsing overhead.
Space Efficiency: Binary representations typically require significantly less storage space than text-based formats.
Cross-Platform Compatibility: Using standardized binary formats ensures data can be read consistently across different systems.

Implementation

Types

type Record struct {
   Key       []byte
   Value     []byte
   Timestamp uint32
   Tombstone bool
}

type Store struct {
   filename string
   mu       sync.RWMutex
   file     *os.File
   index    map[string]int64
}

The MinKV structure maintains a persistent file handle and an in-memory index for efficient lookups.

Creating the Store

func Open(filename string) (*Store, error) {
   file, err := os.OpenFile(filename, os.O_RDWR|os.O_CREATE, 0666)
   if err != nil {
      return nil, fmt.Errorf("failed to open file: %w", err)
   }
   store := &Store{
      filename: filename,
      file:     file,
      index:    make(map[string]int64),
   }

   if err := store.buildIndex(); err != nil {
      file.Close()
      return nil, fmt.Errorf("failed to rebuild index: %w", err)
   }

   return store, nil
}

Index Rebuilding

To support efficient key lookups, the buildIndex function scans the file and populates the in-memory index with the latest offsets for each key. (But if the file is a large building the index will be slow).

func (s *Store) buildIndex() error {
   s.mu.Lock()
   defer s.mu.Unlock()

   var offset int64
   for {
      record, err := s.readRecord(offset)
      if err == io.EOF {
         break
      }
      if err != nil {
         return fmt.Errorf("failed to read record: %v", err)
      }

      // mark tombstoned records as deleted
      if record.Tombstone {
         s.index[string(record.Key)] = tombstoneOffset
      } else {
         s.index[string(record.Key)] = offset
      }
      offset += int64(headerSize + len(record.Key) + len(record.Value))
   }
   return nil
}

Binary Serialization

The writeRecord function efficiently serializes records into binary format:

func (s *Store) writeRecord(record *Record) (int64, error) {
   totalSize := headerSize + len(record.Key) + len(record.Value)
   buf := make([]byte, totalSize)

   // serialize timestamp (4 bytes)
   binary.BigEndian.PutUint32(buf[0:4], record.Timestamp)

   // serialize key len (4 bytes)
   binary.BigEndian.PutUint32(buf[4:8], uint32(len(record.Key)))

   // serialize value len (4 bytes)
   binary.BigEndian.PutUint32(buf[8:12], uint32(len(record.Value)))

   // serialize tombstone (1 byte)
   if record.Tombstone {
      buf[12] = 1
   } else {
      buf[12] = 0
   }

   // copy key data
   copy(buf[headerSize:headerSize+len(record.Key)], record.Key)

   // copy value data
   copy(buf[headerSize+len(record.Key):], record.Value)

   // write record to file
   offset, err := s.file.Seek(0, io.SeekEnd)
   if err != nil {
      return 0, err
   }

   if _, err := s.file.Write(buf); err != nil {
      return 0, err
   }

   return offset, nil
}

The readRecord function reads binary-encoded records from the file:

func (s *Store) readRecord(offset int64) (*Record, error) {
   if _, err := s.file.Seek(offset, io.SeekStart); err != nil {
      return nil, fmt.Errorf("failed to seek: %w", err)
   }

   record := &Record{}

   // read timestamp (4 bytes)
   timestampBuf := make([]byte, 4)
   if _, err := s.file.Read(timestampBuf); err != nil {
      return nil, err
   }
   record.Timestamp = binary.BigEndian.Uint32(timestampBuf)

   // read key len (4 bytes)
   keyLenBuf := make([]byte, 4)
   if _, err := s.file.Read(keyLenBuf); err != nil {
      return nil, err
   }
   keyLen := binary.BigEndian.Uint32(keyLenBuf)

   // read value len (4 bytes)
   valueLenBuf := make([]byte, 4)
   if _, err := s.file.Read(valueLenBuf); err != nil {
      return nil, err
   }
   valueLen := binary.BigEndian.Uint32(valueLenBuf)

   // read tombstone (1 byte)
   tombstoneBuf := make([]byte, 1)
   if _, err := s.file.Read(tombstoneBuf); err != nil {
      return nil, err
   }
   record.Tombstone = tombstoneBuf[0] == 1

   // read key data
   key := make([]byte, keyLen)
   if _, err := s.file.Read(key); err != nil {
      return nil, err
   }
   record.Key = key

   // read value data
   value := make([]byte, valueLen)
   if _, err := s.file.Read(value); err != nil {
      return nil, err
   }
   record.Value = value

   return record, nil
}

Put Operation

The Put function writes a new record to the file and updates the in-memory index. It ensures thread safety by acquiring a write lock:

func (s *Store) Put(key, value []byte) error {
   if len(key) == 0 {
      return fmt.Errorf("key cannot be empty")
   }

   s.mu.Lock()
   defer s.mu.Unlock()

   record := &Record{
      Key:       key,
      Value:     value,
      Timestamp: uint32(time.Now().Unix()),
   }

   offset, err := s.writeRecord(record)
   if err != nil {
      return fmt.Errorf("failed to write record: %w", err)
   }

   s.index[string(key)] = offset
   return nil
}

Get Operation

The Get function retrieves a value by its key using the in-memory index. It acquires a read lock to ensure thread safety:

func (s *Store) Get(key []byte) ([]byte, error) {
   if len(key) == 0 {
      return nil, fmt.Errorf("key cannot be empty")
   }

   s.mu.RLock()
   defer s.mu.RUnlock()

   offset, exists := s.index[string(key)]
   if !exists || offset == tombstoneOffset {
      return nil, fmt.Errorf("key not found: %s", key)
   }

   record, err := s.readRecord(offset)
   if err != nil {
      return nil, fmt.Errorf("failed to read record: %w", err)
   }

   return record.Value, nil
}

Delete Operation

The Delete function writes a tombstoned record to the file to ensure deletions are persisted across restarts then updates the in-memory index marking the record as deleted.
(Notice the record data still exists in the file we’ll be removing it later when implementing compaction).

func (s *Store) Delete(key []byte) error {
   if len(key) == 0 {
      return fmt.Errorf("key cannot be empty")
   }

   s.mu.Lock()
   defer s.mu.Unlock()
   record := &Record{
      Key:       key,
      Tombstone: true,
      Timestamp: uint32(time.Now().Unix()),
   }

   _, err := s.writeRecord(record)
   if err != nil {
      return fmt.Errorf("failed to write record: %w", err)
   }

   s.index[string(key)] = tombstoneOffset
   return nil
}

Getting All the Records

We've implemented an iterator pattern to efficiently scan all records from the file without loading all the records into memory at once and skipping tombstoned records.

type Iterator interface {
    Next() bool
    Record() (*Record, error)
}

Example usage:

package main

import (
 "fmt"

 "github.com/galalen/minkv"
)

func main() {
   store, _ := minkv.Open("store.db")
   defer store.Close()

   // Put data
   _ = store.Put([]byte("os"), []byte("mac"))
   _ = store.Put([]byte("db"), []byte("kv"))
   _ = store.Put([]byte("lang"), []byte("go"))

   // Retrieve data
   value, _ := store.Get([]byte("os"))
   fmt.Printf("os => %s\n", value)

   // update value
   _ = store.Put([]byte("os"), []byte("linux"))

   // Delete data
   _ = store.Delete([]byte("db"))

   // Iterate over records
   iter, _ := store.Iterator()
   for iter.Next() {
      record, _ := iter.Record()
      fmt.Printf("Key: %s, Value: %s\n", record.Key, record.Value)
   }
}

Future Optimizations

Here are some areas where this key-value store can be further improved:

- Compaction: Currently, deleted or overwritten records remain in the file until compaction is performed. Implementing a compaction mechanism will help reduce file size and improve performance.
- Hintfiles: Reduce disk seeks and speed up index rebuilding.
- Batching Writes: Group multiple writes into a single batch to reduce the number of disk sync operations.

Conclusion

We’ve built a simple yet powerful key-value store that demonstrates core database storage principles. Key-value stores are the backbone of many real-world databases and systems. For example:

- Riak: A distributed NoSQL database that implements the Bitcask log structure to handle write-heavy operations efficiently.
- Cassandra: Uses key-value principles for distributed data storage.
- RocksDB: A key-value store used in systems like MySQL and Kafka, is built on an LSM-based architecture.