DEV Community

Cover image for How Databases Work Under the Hood: Building a Key-Value Store in Go
Mohammed Galalen
Mohammed Galalen

Posted on

How Databases Work Under the Hood: Building a Key-Value Store in Go

Have you ever wondered how databases like Postgres, MySQL, or Cassandra store and retrieve data so efficiently? I was amazed when I first realized that databases — these powerful, complex systems — are, at their core, just files on disk. Yes, files! Structured and optimized for fast and efficient operations. Two key components make this possible:

Query Layer: Handles user queries, optimizes them, and interacts with the storage engine.
Storage Engine: Manages how data is stored, retrieved, and updated on disk or in memory.

What is a Storage engine?
A storage engine is a core component of a database system responsible for managing how data is stored, retrieved, and updated on disk or in memory. It handles low-level details such as file organization, indexing, and concurrency control. At its heart, a storage engine relies on efficient data structures designed specifically for disk-based access.

Storage engines can be broadly classified into two types based on how they handle data modifications:

Mutable Storage Engines: These engines, like B-Trees, are optimized for read-heavy workloads. They update data in place, meaning existing data is overwritten when modified. Examples include Postgres, InnoDB (MySQL), and SQLite.

Immutable Storage Engines: These engines, like Log-Structured Merge Trees (LSM Trees) and Log-Structured Hash Tables (LSHT), are optimized for write-heavy workloads. Instead of overwriting data, they append new data to a log file and periodically clean up old or deleted data through a process called compaction. Examples include LevelDB, RocksDB, Bitcask, and Cassandra.

In this post, we’ll build a simple immutable key-value store, inspired by the Bitcask model. Bitcask is based on a Log-Structured Hash Table (LSHT), which appends data sequentially to a log file and maintains an in-memory hash table for fast lookups. LSHT-based engines are simpler to implement compared to other designs like LSM-Trees, as they avoid the complexity of sorting, multi-level merging, and additional data structures.

By the end of this post, you’ll have a foundational understanding of how storage engines work and a working implementation of a minimal key-value store. In future posts, we’ll enhance this implementation by adding features like compaction, advanced indexing, and handling large datasets efficiently.

Full source code available on Github

Design Overview

Our key-value store uses an append-only binary file format with a simple yet efficient record structure:

Each record contains:

Timestamp (4 bytes)
Key length (4 bytes)
Value length (4 bytes)
Tombstone flag (1 byte)
Key data (variable length)
Value data (variable length)

Record format:

+------------------+------------------+------------------+--------------------+--------------------+--------------------+
| Timestamp (4B)   | Key len (4B)     | Value len (4B)   | Tombstone flag (1B)| Key (variable)     | Value (variable)   |
+------------------+------------------+------------------+--------------------+--------------------+--------------------+
| 0x5F8D3E8F       | 0x00000005       | 0x00000007       | 0x00               | "mykey"            | "myvalue"          |
+------------------+------------------+------------------+--------------------+--------------------+--------------------+
Enter fullscreen mode Exit fullscreen mode

Why Binary Format for Serialization?

While JSON and XML are human-readable and widely used for data interchange, most modern storage systems use binary formats for their core storage. Here’s why:

Performance: The system can read and write binary formats directly without parsing overhead.
Space Efficiency: Binary representations typically require significantly less storage space than text-based formats.
Cross-Platform Compatibility: Using standardized binary formats ensures data can be read consistently across different systems.

Implementation

Types

type Record struct {
   Key       []byte
   Value     []byte
   Timestamp uint32
   Tombstone bool
}

type Store struct {
   filename string
   mu       sync.RWMutex
   file     *os.File
   index    map[string]int64
}
Enter fullscreen mode Exit fullscreen mode

The MinKV structure maintains a persistent file handle and an in-memory index for efficient lookups.

Creating the Store

func Open(filename string) (*Store, error) {
   file, err := os.OpenFile(filename, os.O_RDWR|os.O_CREATE, 0666)
   if err != nil {
      return nil, fmt.Errorf("failed to open file: %w", err)
   }
   store := &Store{
      filename: filename,
      file:     file,
      index:    make(map[string]int64),
   }

   if err := store.buildIndex(); err != nil {
      file.Close()
      return nil, fmt.Errorf("failed to rebuild index: %w", err)
   }

   return store, nil
}
Enter fullscreen mode Exit fullscreen mode

Index Rebuilding

To support efficient key lookups, the buildIndex function scans the file and populates the in-memory index with the latest offsets for each key. (But if the file is a large building the index will be slow).

func (s *Store) buildIndex() error {
   s.mu.Lock()
   defer s.mu.Unlock()

   var offset int64
   for {
      record, err := s.readRecord(offset)
      if err == io.EOF {
         break
      }
      if err != nil {
         return fmt.Errorf("failed to read record: %v", err)
      }

      // mark tombstoned records as deleted
      if record.Tombstone {
         s.index[string(record.Key)] = tombstoneOffset
      } else {
         s.index[string(record.Key)] = offset
      }
      offset += int64(headerSize + len(record.Key) + len(record.Value))
   }
   return nil
}
Enter fullscreen mode Exit fullscreen mode

Binary Serialization

The writeRecord function efficiently serializes records into binary format:

func (s *Store) writeRecord(record *Record) (int64, error) {
   totalSize := headerSize + len(record.Key) + len(record.Value)
   buf := make([]byte, totalSize)

   // serialize timestamp (4 bytes)
   binary.BigEndian.PutUint32(buf[0:4], record.Timestamp)

   // serialize key len (4 bytes)
   binary.BigEndian.PutUint32(buf[4:8], uint32(len(record.Key)))

   // serialize value len (4 bytes)
   binary.BigEndian.PutUint32(buf[8:12], uint32(len(record.Value)))

   // serialize tombstone (1 byte)
   if record.Tombstone {
      buf[12] = 1
   } else {
      buf[12] = 0
   }

   // copy key data
   copy(buf[headerSize:headerSize+len(record.Key)], record.Key)

   // copy value data
   copy(buf[headerSize+len(record.Key):], record.Value)

   // write record to file
   offset, err := s.file.Seek(0, io.SeekEnd)
   if err != nil {
      return 0, err
   }

   if _, err := s.file.Write(buf); err != nil {
      return 0, err
   }

   return offset, nil
}
Enter fullscreen mode Exit fullscreen mode

The readRecord function reads binary-encoded records from the file:

func (s *Store) readRecord(offset int64) (*Record, error) {
   if _, err := s.file.Seek(offset, io.SeekStart); err != nil {
      return nil, fmt.Errorf("failed to seek: %w", err)
   }

   record := &Record{}

   // read timestamp (4 bytes)
   timestampBuf := make([]byte, 4)
   if _, err := s.file.Read(timestampBuf); err != nil {
      return nil, err
   }
   record.Timestamp = binary.BigEndian.Uint32(timestampBuf)

   // read key len (4 bytes)
   keyLenBuf := make([]byte, 4)
   if _, err := s.file.Read(keyLenBuf); err != nil {
      return nil, err
   }
   keyLen := binary.BigEndian.Uint32(keyLenBuf)

   // read value len (4 bytes)
   valueLenBuf := make([]byte, 4)
   if _, err := s.file.Read(valueLenBuf); err != nil {
      return nil, err
   }
   valueLen := binary.BigEndian.Uint32(valueLenBuf)

   // read tombstone (1 byte)
   tombstoneBuf := make([]byte, 1)
   if _, err := s.file.Read(tombstoneBuf); err != nil {
      return nil, err
   }
   record.Tombstone = tombstoneBuf[0] == 1

   // read key data
   key := make([]byte, keyLen)
   if _, err := s.file.Read(key); err != nil {
      return nil, err
   }
   record.Key = key

   // read value data
   value := make([]byte, valueLen)
   if _, err := s.file.Read(value); err != nil {
      return nil, err
   }
   record.Value = value

   return record, nil
}
Enter fullscreen mode Exit fullscreen mode

Put Operation

The Put function writes a new record to the file and updates the in-memory index. It ensures thread safety by acquiring a write lock:

func (s *Store) Put(key, value []byte) error {
   if len(key) == 0 {
      return fmt.Errorf("key cannot be empty")
   }

   s.mu.Lock()
   defer s.mu.Unlock()

   record := &Record{
      Key:       key,
      Value:     value,
      Timestamp: uint32(time.Now().Unix()),
   }

   offset, err := s.writeRecord(record)
   if err != nil {
      return fmt.Errorf("failed to write record: %w", err)
   }

   s.index[string(key)] = offset
   return nil
}
Enter fullscreen mode Exit fullscreen mode

Get Operation

The Get function retrieves a value by its key using the in-memory index. It acquires a read lock to ensure thread safety:

func (s *Store) Get(key []byte) ([]byte, error) {
   if len(key) == 0 {
      return nil, fmt.Errorf("key cannot be empty")
   }

   s.mu.RLock()
   defer s.mu.RUnlock()

   offset, exists := s.index[string(key)]
   if !exists || offset == tombstoneOffset {
      return nil, fmt.Errorf("key not found: %s", key)
   }

   record, err := s.readRecord(offset)
   if err != nil {
      return nil, fmt.Errorf("failed to read record: %w", err)
   }

   return record.Value, nil
}
Enter fullscreen mode Exit fullscreen mode

Delete Operation

The Delete function writes a tombstoned record to the file to ensure deletions are persisted across restarts then updates the in-memory index marking the record as deleted.
(Notice the record data still exists in the file we’ll be removing it later when implementing compaction).

func (s *Store) Delete(key []byte) error {
   if len(key) == 0 {
      return fmt.Errorf("key cannot be empty")
   }

   s.mu.Lock()
   defer s.mu.Unlock()
   record := &Record{
      Key:       key,
      Tombstone: true,
      Timestamp: uint32(time.Now().Unix()),
   }

   _, err := s.writeRecord(record)
   if err != nil {
      return fmt.Errorf("failed to write record: %w", err)
   }

   s.index[string(key)] = tombstoneOffset
   return nil
}
Enter fullscreen mode Exit fullscreen mode

Getting All the Records

We've implemented an iterator pattern to efficiently scan all records from the file without loading all the records into memory at once and skipping tombstoned records.

type Iterator interface {
    Next() bool
    Record() (*Record, error)
}
Enter fullscreen mode Exit fullscreen mode

Example usage:

package main

import (
 "fmt"

 "github.com/galalen/minkv"
)

func main() {
   store, _ := minkv.Open("store.db")
   defer store.Close()

   // Put data
   _ = store.Put([]byte("os"), []byte("mac"))
   _ = store.Put([]byte("db"), []byte("kv"))
   _ = store.Put([]byte("lang"), []byte("go"))

   // Retrieve data
   value, _ := store.Get([]byte("os"))
   fmt.Printf("os => %s\n", value)

   // update value
   _ = store.Put([]byte("os"), []byte("linux"))

   // Delete data
   _ = store.Delete([]byte("db"))

   // Iterate over records
   iter, _ := store.Iterator()
   for iter.Next() {
      record, _ := iter.Record()
      fmt.Printf("Key: %s, Value: %s\n", record.Key, record.Value)
   }
}
Enter fullscreen mode Exit fullscreen mode

Future Optimizations

Here are some areas where this key-value store can be further improved:

- Compaction: Currently, deleted or overwritten records remain in the file until compaction is performed. Implementing a compaction mechanism will help reduce file size and improve performance.
- Hintfiles: Reduce disk seeks and speed up index rebuilding.
- Batching Writes: Group multiple writes into a single batch to reduce the number of disk sync operations.

Conclusion

We’ve built a simple yet powerful key-value store that demonstrates core database storage principles. Key-value stores are the backbone of many real-world databases and systems. For example:

- Riak: A distributed NoSQL database that implements the Bitcask log structure to handle write-heavy operations efficiently.
- Cassandra: Uses key-value principles for distributed data storage.
- RocksDB: A key-value store used in systems like MySQL and Kafka, is built on an LSM-based architecture.

Full source code available on Github

References
https://tech-lessons.in/en/blog/bitcask/
https://riak.com/assets/bitcask-intro.pdf
https://tikv.org/deep-dive/key-value-engine/introduction/
https://tikv.org/deep-dive/key-value-engine/b-tree-vs-lsm/
https://github.com/avinassh/py-caskdb
https://www.amazon.com/Database-Internals-Deep-Distributed-Systems/dp/1492040347

Image of Timescale

Timescale – the developer's data platform for modern apps, built on PostgreSQL

Timescale Cloud is PostgreSQL optimized for speed, scale, and performance. Over 3 million IoT, AI, crypto, and dev tool apps are powered by Timescale. Try it free today! No credit card required.

Try free

Top comments (0)

Billboard image

The Next Generation Developer Platform

Coherence is the first Platform-as-a-Service you can control. Unlike "black-box" platforms that are opinionated about the infra you can deploy, Coherence is powered by CNC, the open-source IaC framework, which offers limitless customization.

Learn more

👋 Kindness is contagious

Please leave a ❤️ or a friendly comment on this post if you found it helpful!

Okay