DEV Community: Debjit Bhattacharjee

DeepSeek 3FS: A High-Performance Distributed File System for Modern Workloads

Debjit Bhattacharjee — Sun, 09 Mar 2025 07:07:46 +0000

In this blog post, we’ll dive deep into the design and implementation of DeepSeek 3FS, a distributed file system engineered for high-performance workloads like data analytics and machine learning. We’ll explore its architecture, components, file system interfaces, metadata management, and chunk storage system, with detailed explanations, diagrams, and flowcharts to break down the complexity.

Introduction to DeepSeek 3FS

DeepSeek 3FS is a distributed file system designed to provide strong consistency, high throughput, and fault tolerance, leveraging RDMA networks (InfiniBand or RoCE) and SSDs for optimal performance. It aims to bridge the gap between traditional file system semantics and modern object stores, offering a unified namespace and flexible data placement for applications.

The system comprises four main components:

Cluster Manager: Handles membership changes and distributes cluster configurations.
Metadata Service: Manages file metadata using a transactional key-value store.
Storage Service: Stores file chunks with strong consistency using Chain Replication with Apportioned Queries (CRAQ).
Client: Provides two interfaces—FUSE client for ease of adoption and a native client for performance-critical applications.

Let’s break down each component and their interactions.

System Architecture

The 3FS architecture is designed for scalability and fault tolerance, with all components communicating over an RDMA network for low-latency, high-bandwidth data transfers.

Components and Their Roles

Cluster Manager:
- Manages membership and configuration changes.
- Multiple managers are deployed; one is elected as the primary using a distributed coordination service (e.g., ZooKeeper or etcd).
- Receives heartbeats from metadata and storage services to detect failures.
- Distributes updated cluster configurations to services and clients.
Metadata Service:
- Stateless and scalable, handling file metadata operations (e.g., open, create).
- Stores metadata in a transactional key-value store (FoundationDB in production).
- Clients can connect to any metadata service for load balancing.
Storage Service:
- Manages local SSDs and provides a chunk store interface.
- Implements CRAQ for strong consistency and high read throughput.
- File chunks are replicated across multiple SSDs for fault tolerance.
Client:
- FUSE Client: Integrates with applications via the FUSE kernel module for ease of use.
- Native Client: Offers asynchronous zero-copy I/O for performance-critical applications.

Architecture Diagram

Below is a high-level architecture diagram of 3FS, showing the interactions between components:

   +-------------------------+        +-------------------------+

   |                         |        |                         |

   |      Cluster Manager     | <----> |     Metadata Service    |

   |                         |        |                         |

   +-------------------------+        +-------------------------+

                 |                             |

                 |                             |

                 v                             v

   +-------------------------+        +-------------------------+

   |                         |        |                         |

   |    Storage Service      | <----> |    Client (FUSE/Native) |

   |                         |        |                         |

   +-------------------------+        +-------------------------+

File System Interfaces

3FS provides a POSIX-like file system interface with enhancements for modern workloads, addressing limitations of object stores while maintaining compatibility with existing applications.

Why File System Semantics?

Unlike object stores, 3FS offers:

Atomic Directory Manipulation: Supports operations like moving or deleting directories atomically, critical for workflows involving temporary directories.
Symbolic and Hard Links: Enables lightweight snapshots for dynamic datasets.
Familiar Interface: Simplifies adoption by supporting file-based data formats (e.g., CSV, Parquet) without requiring new APIs.

Limitations of FUSE

While the FUSE client simplifies integration, it introduces performance overheads:

Memory Copy Overhead: Data transfers between kernel and user space increase latency.
Multi-threading Bottlenecks: Lock contention in the FUSE shared queue limits scalability (benchmarks show ~400K 4KiB reads/sec).
Concurrent Writes: Linux FUSE (v5.x) does not support concurrent writes to the same file, requiring workarounds like writing to multiple files.

Native Client with Asynchronous Zero-Copy API

To address FUSE limitations, 3FS implements a native client with an asynchronous zero-copy API inspired by Linux io_uring. Key data structures include:

Iov: A shared memory region for zero-copy read/write operations, registered with InfiniBand for RDMA.
Ior: A ring buffer for request queuing, supporting batched and parallel I/O operations.

The native client spawns multiple threads to fetch and dispatch I/O requests to storage services, minimizing RPC overhead for small reads.

Flowchart: Native Client I/O Operation

File Metadata Management

File metadata in 3FS is stored in FoundationDB, a distributed transactional key-value store providing Serializable Snapshot Isolation (SSI).

Metadata Structures

Inodes: Store attributes (e.g., ownership, permissions, timestamps) with a unique 64-bit ID.
- File inodes include chunk size, chain table range, and shuffle seed.
- Directory inodes include parent inode ID and layout configurations.
Directory Entries: Map parent inode IDs and entry names to target inode IDs.

Key Encoding

Inode keys: "INOD" + inode_id (little-endian for distribution across FoundationDB nodes).
Directory entry keys: "DENT" + parent_inode_id + entry_name".

Metadata Operations

Read-only Transactions: Used for queries (e.g., fstat, listdir).
Read-write Transactions: Used for updates (e.g., create, rename), with automatic retries on conflicts.

Dynamic File Attributes

File Deletion: For write-opened files, deletion is deferred until all file descriptors are closed.
File Length Updates: Clients periodically report maximum write positions; final length is computed on close or fsync by querying storage services.
Optimizations: Uses rendezvous hashing to distribute length updates and hints in inodes to avoid querying all chains for small files.

Chunk Storage System

The chunk storage system is designed for high bandwidth and fault tolerance, using CRAQ for replication and balanced data placement across SSDs.

Data Placement with CRAQ

Files are split into chunks, replicated across storage targets using CRAQ:

Write Path: Requests propagate from the head to the tail of a chain.
Read Path: Requests can be served by any target, balancing load across replicas.

Chain Table Example

Chain	Version	Target 1 (head)	Target 2	Target 3 (tail)
1	1	A1	B1	C1
2	1	D1	E1	F1

Each chain has a version number, incremented on updates by the cluster manager.

Balanced Traffic During Recovery

To mitigate bottlenecks during failures, 3FS distributes read traffic across multiple SSDs using balanced incomplete block design. For example, if node A fails, its traffic is split evenly among other nodes.

Recovery Traffic Flowchart

Data Replication with CRAQ

Write Process

Validate chain version.
Fetch data via RDMA.
Serialize writes at the head using a lock.
Propagate writes along the chain.
Commit at the tail and propagate acknowledgments.

Read Process

Return committed version if available.
Handle pending versions with a status code, allowing retries or relaxed reads.

Failure Detection and Recovery

Heartbeats: Cluster manager detects failures if heartbeats are missed for a configurable interval.
State Transitions: Storage targets transition between public states (e.g., serving, syncing, offline) based on local states.
Recovery: Offline targets are moved to the end of chains; data is synced using full-chunk-replace writes.

Chunk Engine

The chunk engine manages persistent storage on SSDs:

Data Files: Store chunk data in physical blocks (64KiB to 64MiB).
RocksDB: Stores chunk metadata.
Allocator: Uses bitmaps for efficient block allocation and reclamation.

Write Operation Flowchart

Check out the Github Repo. All credit for this research goes to the researchers of this project.

Conclusion

DeepSeek 3FS is a robust distributed file system tailored for modern workloads, combining the familiarity of file system semantics with the scalability of object stores. Its use of RDMA, CRAQ, and FoundationDB ensures high performance, strong consistency, and fault tolerance. Whether you're running data analytics or machine learning pipelines, 3FS offers a flexible and efficient storage solution.

Feel free to experiment with 3FS in your projects! If you have questions or insights, drop them in the comments below.

You can find me on X

Research Paper Series: Zanzibar: Google’s Consistent, Global Authorization System

Debjit Bhattacharjee — Sat, 08 Feb 2025 12:14:21 +0000

Understanding the Zanzibar White Paper: A Deep Dive into Scalable Authorization Systems

Modern applications demand scalable and fine‑grained access control. With billions of relationships and millions of queries per second, traditional authorization systems often fall short. Enter Zanzibar, Google’s innovative approach to distributed authorization. In this post, we’ll break down the key concepts, architecture, and advanced features (like Leopard indexing) described in the Zanzibar white paper—plus a look at a novel solution known as zookies that tackles what some call the "new enemy problem" in security.

Introduction
The Need for a Scalable Authorization System
Meet Zanzibar: Background and Key Contributors
The Core Tuple-Based Model
- Direct and Indirect Permissions
- Real-World Examples
Architectural Overview
- Global Tuple Data Store
- The Authorization Engine
- Schema and Policy Layer
- Consistency and Caching
Advanced Indexing with Leopard Indexing
Zookies: Tackling the New Enemy Problem
End-to-End Authorization Decision Flow
Real-World Use Cases
Challenges in Distributed Systems
Impact and Future Directions
Conclusion

Introduction

In today’s microservices and cloud-based architectures, authorization—deciding who can do what—is both critical and challenging. Google’s Zanzibar system was designed to address these challenges at scale, enabling millions of authorization decisions per second with consistency and flexibility. The Zanzibar white paper outlines a novel, tuple-based model that can express both simple and complex access control policies.

In this post, we’ll explore:

The core concepts behind Zanzibar
Its distributed architecture and data model
How it handles complex policies such as hierarchical and time‑bound permissions
The role of advanced techniques like Leopard indexing in ensuring low latency and high performance
And finally, a new security solution—zookies—that tackles what’s known as the "new enemy problem."

Whether you’re an engineer working on security or simply curious about modern authorization mechanisms, read on to learn more about Zanzibar and its evolving features.

The Need for a Scalable Authorization System

Traditional access control systems, often based on role‑based access control (RBAC), struggle when facing modern demands:

High Volume: Billions of relationships and millions of access decisions per second.
Complex Relationships: Permissions aren’t always direct—users might inherit access via groups, hierarchies, or other indirect relationships.
Distributed Environments: Global systems must maintain consistency across multiple data centers and regions.

Zanzibar was conceived to meet these challenges by providing a flexible yet efficient authorization engine that could work at Google’s massive scale.

Meet Zanzibar: Background and Key Contributors

The Zanzibar white paper is the result of collaborative efforts by a dedicated team at Google. While the names might vary between versions, here are some key roles:

Conceptual Design: Visionaries who recognized the limitations of existing systems and introduced the tuple‑based model.
Scalability Engineering: Engineers who tackled distributed consistency challenges.
Indexing Innovations: Researchers who developed advanced indexing (like Leopard indexing) to optimize data retrieval.
Schema Design: Developers who created a flexible schema layer for defining complex access policies.

Timeline of Key Events

Conceptualization: Brainstorming sessions and whiteboarding led to the idea of representing permissions as tuples.
Prototype Development: Early prototypes exposed challenges in query performance and data retrieval.
Leopard Indexing Introduction: A breakthrough that dramatically reduced lookup latency.
Internal Rollout: Iterative testing and refinement based on real-world feedback.
White Paper Publication: Sharing the design and lessons learned with the broader community.

The Core Tuple-Based Model

At the heart of Zanzibar lies a simple yet powerful data model: the tuple.

Direct and Indirect Permissions

Each permission is represented as a tuple:

(object, relation, subject)

Object: The resource (e.g., a document, folder, or service).
Relation: The type of permission (e.g., read, write, edit).
Subject: The entity (user, group, or service) granted the permission.

Direct Permissions

For example:

(Document123, viewer, Alice)

This tuple directly grants Alice viewing rights to Document123.

Indirect Permissions

Indirect relationships can be expressed using multiple tuples:

(Document123, editor, GroupX)
(GroupX, member, Bob)

Even though Bob isn’t directly assigned to Document123, his membership in GroupX grants him editor rights.

Real-World Examples

Hierarchical Permissions

In a corporate file system:

(FolderA, contains, FolderB)
(FolderA, editor, UserX)

UserX, an editor of FolderA, can inherit editing rights for FolderB and its files.

Combined Conditions

A sensitive document might require both team membership and explicit permission:

(Document456, editor, TeamY)
(TeamY, member, UserZ)

UserZ must be a member of TeamY (indirect permission) to gain editing rights.

Temporal Constraints

Permissions can also be time‑bound:

(Document789, viewer, ContractorA)  // With an expiration timestamp

Access is only valid within a specified time window.

Architectural Overview

Zanzibar’s architecture is engineered for scalability and performance. Here’s a look at its key components:

Global Tuple Data Store

Distributed: Operates across multiple data centers.
Scalable: Designed to handle billions of tuples.
Low Latency: Optimized for rapid read and write operations.

The Authorization Engine

The engine processes access requests through these steps:

Request Parsing: Extract the object, relation, and subject from the request.
Tuple Lookup: Query the data store for relevant tuples.
Recursive Evaluation: Follow indirect relationships (e.g., group memberships) to determine effective permissions.
Decision Output: Consolidate findings and grant or deny access.

Schema and Policy Layer

This layer provides flexibility:

Customizable: Define new object types, relations, and composite relationships.
Extensible: Easily incorporate new access control paradigms without a full system redesign.

Consistency and Caching

To ensure every node has up‑to‑date data:

Propagation Protocols: Distribute updates quickly across nodes.
Conflict Resolution: Handle concurrent updates seamlessly.
Caching Strategies: Use local caches with invalidation mechanisms to reduce latency without sacrificing accuracy.

Advanced Indexing with Leopard Indexing

As Zanzibar scaled, performance challenges emerged. Leopard indexing was introduced as an advanced method to optimize tuple lookups.

Why Leopard Indexing?

Performance: Minimizes latency by reducing disk and network operations.
Scalability: Supports queries on billions of tuples.
Flexibility: Efficiently handles multiple query directions (object, relation, subject).

How It Works

Leopard indexing decomposes tuples into individual components and builds multiple index structures:

Object-Relation Index: Quickly retrieves all tuples associated with a specific object and relation.
Subject-Relation Index: Enables queries initiated from the subject side.
Composite Indexes for Groups: Facilitates rapid evaluation of indirect relationships, such as group memberships.

Diagram: Leopard Indexing Overview

             +-------------------------------------+
             |         Global Tuple Data Store     |
             |  (All (object, relation, subject))  |
             +----------------+----------------------+
                              |
                              v
             +-------------------------------------+
             |      Leopard Indexing Layer         |
             |                                     |
             |  - Object-Relation Index            |
             |  - Subject-Relation Index           |
             |  - Composite Indexes for Groups     |
             +----------------+----------------------+
                              |
                              v
             +-------------------------------------+
             |     Rapid Tuple Retrieval Layer     |
             | (Optimized Query Resolution Engine) |
             +-------------------------------------+

With these indices, when a client queries “Can Alice read Document123?”, the engine can directly retrieve the relevant tuples with minimal overhead.

Zookies: Tackling the New Enemy Problem

In addition to the core challenges of distributed authorization, modern systems must also address what is sometimes referred to as the "new enemy problem." This problem involves adversaries attempting to exploit vulnerabilities by injecting unauthorized or stale tuple data into the system. To counter this, an innovative solution known as zookies has been introduced.

What Are Zookies?

Zookies are an enhanced security mechanism integrated into the Zanzibar framework. They add an extra layer of verification to ensure that every tuple is current and authenticated. In practice, zookies:

Enhance Metadata: Each tuple can carry additional security metadata (such as digital signatures, timestamps, and authentication tokens) to verify its legitimacy.
Enforce Rigorous Validation: Before any tuple is accepted or updated in the system, a series of strict validation checks are performed. This minimizes the risk of adversarial data injections.
Mitigate Data Inconsistencies: By ensuring that all nodes work with the most up‑to‑date and verified data, zookies help prevent scenarios where outdated or tampered data could be exploited.
Integrate with Existing Indexing: Zookies work in tandem with Leopard indexing, ensuring that security checks occur with minimal impact on overall query performance.

How Zookies Work in Practice

Tuple Ingestion: When a new tuple is created or an existing one is updated, zookies ensure that enhanced metadata is attached.
Validation Checks: The system verifies digital signatures, cross-checks timestamps, and consults trusted caches before integrating the tuple into the global data store.
Dynamic Revocation: If any inconsistencies or potential security threats are detected, zookies enable rapid revocation and replacement of the affected tuples.
Seamless Integration: The validation process is optimized to work alongside Leopard indexing, ensuring that security does not come at the cost of performance.

By addressing the "new enemy problem," zookies significantly strengthen the overall resilience of Zanzibar against modern adversarial challenges.

End-to-End Authorization Decision Flow

Let’s walk through the process step-by-step:

Request Arrival:

A client sends an access check request, e.g., “Can Alice read Document123?”
Request Parsing:

The engine extracts the object (Document123), relation (read), and subject (Alice).
Index Querying Using Leopard Indexing:

The engine quickly queries the object‑relation index to retrieve direct tuples.
Direct Tuple Evaluation:

If (Document123, read, Alice) exists, access is granted. Otherwise, indirect relationships are evaluated.
Recursive Evaluation:

For example, if (Document123, read, GroupX) exists, the engine checks if Alice is a member of GroupX via (GroupX, member, Alice).
Temporal and Conditional Checks:

The engine verifies any time‑bound or conditional metadata (with zookies ensuring data integrity).
Final Decision:

If a valid permission chain is found, access is granted; if not, it is denied.

Detailed Flow Chart

[Start: Receive Access Request]
           |
           v
[Parse Request: Extract Object, Relation, Subject]
           |
           v
[Query Leopard Indexes for (Object, Relation) tuples]
           |
           v
[Direct Tuple Found?] --> [Yes] --> [Grant Access]
           |
         No|
           v
[Check for Indirect Relationships via Indexes]
           |
           v
[Recursive Evaluation of Group or Hierarchical Tuples]
           |
           v
[Evaluate Additional Conditions (Temporal, etc.)]
           |
           v
[Consolidate Findings]
           |
           v
[Decision: Valid Permission Chain Exists?]
           |             \
          Yes             No
           |               \
           v                v
     [Grant Access]     [Deny Access]
           |
           v
         [End]

Real-World Use Cases

Corporate File Systems with Hierarchical Permissions

Imagine a corporate file system where folders are nested:

Tuples:

  (FolderA, contains, FolderB)
  (FolderA, editor, UserX)

UserX’s permission on FolderA cascades down to FolderB and its contents.

Combined Conditions for Sensitive Resources

For sensitive documents, multiple conditions may be required:

Tuples:

  (Document456, editor, SecurityTeam)
  (SecurityTeam, member, UserY)

UserY must satisfy both the team membership (indirect permission) and any direct conditions.

Temporal Permissions and Time‑Bound Access

Time-sensitive access is common for contractors:

Tuples:

  (Document789, viewer, ContractorA)  // With an expiration timestamp

Access is granted only within a specified time window.

Challenges in Distributed Systems

Operating at a global scale isn’t trivial. Zanzibar addresses several challenges:

Consistency: Ensuring every node has the most up‑to‑date data via robust propagation protocols.
Caching: Local caches reduce latency but must remain synchronized to avoid stale decisions.
Distributed Indexing: With techniques like Leopard indexing (and the additional security of zookies), low‑latency queries are maintained regardless of geographic location.

Impact and Future Directions

Influence on Modern Authorization Systems

Zanzibar has influenced many modern access control systems:

Adoption: Its tuple‑based model and indexing techniques have inspired both industry and open‑source projects.
Innovation: Design principles from Zanzibar continue to shape scalable, secure authorization in distributed environments.

Flexibility and Extensibility

Zanzibar’s model is adaptable:

Diverse Paradigms: Supports RBAC, attribute‑based, and relationship‑based access control.
Evolving Needs: The schema can be extended to include new relationship types, temporal constraints, and security enhancements like zookies.

Future Enhancements

Enhanced Indexing: Research into further optimizing indexing—possibly using predictive caching.
Improved Consistency Models: New protocols may further reduce latency while ensuring up‑to‑date authorization.
Security Upgrades: Continued development of zookies and other security measures to counter emerging threats.

Conclusion

Google’s Zanzibar white paper presents a groundbreaking approach to authorization by breaking down access control into simple, composable tuples. By combining a robust, distributed architecture with advanced indexing techniques like Leopard indexing—and now, with the addition of zookies to tackle the "new enemy problem"—Zanzibar handles billions of relationships and millions of decisions per second while maintaining consistency and security across global data centers.

In summary, this post covered:

The Core Tuple-Based Model: How Zanzibar represents both direct and indirect permissions.
Architectural Components: From the global tuple data store to the authorization engine and schema layers.
Advanced Indexing: How Leopard indexing optimizes performance by reducing lookup latency.
Zookies: A novel solution to counter adversarial threats and ensure data integrity.
Real-World Applications and Challenges: Practical use cases and the complexities of distributed systems.
Future Directions: The ongoing impact of Zanzibar on modern authorization systems and areas for further innovation.

By understanding the Zanzibar white paper—and innovations like zookies—we gain valuable insights into the challenges and solutions powering today’s scalable and secure access control systems. Whether you’re building your own authorization engine or just curious about distributed systems, Zanzibar offers a wealth of ideas and inspiration.

For a more in-depth understanding, you can access the full paper here

Happy coding! If you found this post useful, please leave a comment or share your thoughts on tackling authorization challenges in your projects.

Tags: dev, authorization, security, distributed-systems, backend, scalability

You can find me on X

Research Paper Series: Using Lightweight Formal Methods to Validate a Key-Value Storage Node in Amazon S3

Debjit Bhattacharjee — Mon, 13 Jan 2025 16:03:43 +0000

The paper "Using Lightweight Formal Methods to Validate a Key-Value Storage Node in Amazon S3" presents a pragmatic approach to ensuring the correctness of ShardStore, a key-value storage node in Amazon S3.

A bit on ShardStore

ShardStore, a key-value storage node in Amazon S3, plays a critical role in efficiently storing and retrieving objects by organizing data into extents. Here’s how it works and interacts with extents:

Key Components and Workflow

ShardStore Overview:
- ShardStore serves as a layer in the S3 architecture, responsible for handling object storage within shards.
- Data is partitioned into shards to ensure scalability, fault isolation, and efficient data management.
Interaction with Extents:
- Extents are fixed-size blocks of data, typically spanning multiple megabytes. They are the primary storage units managed by ShardStore.
- Each extent contains:
  - Data for multiple objects or object fragments.
  - Metadata for efficiently locating and retrieving specific pieces of data.
Key-Value Abstraction:
- ShardStore exposes a key-value interface, where the key maps to a specific object or fragment, and the value references its location within extents.
- This abstraction decouples the logical organization of data from its physical storage, allowing ShardStore to optimize for performance and durability.
Write Workflow:
- When new data is written to ShardStore:
  1. The data is assigned a key and appended to an available extent.
  2. Metadata is updated to reflect the key-to-extent mapping.
  3. The extent, now containing the new data, is persisted to disk and replicated for fault tolerance.
Read Workflow:
- To read data:
  1. ShardStore uses the key to locate the corresponding extent and offset.
  2. The relevant extent is loaded from disk (or memory if cached).
  3. Data is extracted and returned to the client.
Crash Consistency and Concurrency:
- ShardStore employs techniques such as journaling and atomic updates to ensure crash consistency during writes.
- Concurrent read and write operations are managed using fine-grained locking and careful metadata updates to prevent conflicts.
Extent Lifecycle Management:
- Garbage Collection: As objects are deleted or overwritten, extents containing obsolete data are compacted to reclaim space.
- Replication: Extents are replicated across multiple nodes for durability and availability.

Load Balancing: ShardStore dynamically moves extents between nodes to balance storage and compute load.

Benefits of Using Extents in ShardStore

Scalability: Extents enable efficient storage of millions of objects within a shard.
Performance: Sequential writes to extents reduce disk I/O overhead, improving throughput.
Fault Isolation: By partitioning data into shards and extents, failures in one area are less likely to impact the entire system.

This interaction between ShardStore and extents underpins the scalability, durability, and performance of Amazon S3, making it a key innovation in distributed storage systems.

Back to validations

Traditional formal verification methods can be resource-intensive and challenging to maintain, especially in large-scale, evolving systems. To address this, the authors advocate for "lightweight formal methods," emphasizing automation, usability, and continuous validation alongside ongoing software development.

A central aspect of their approach is the development of executable reference models that serve as specifications against which the implementation is validated. These models are written in the same programming language as the implementation (Rust), facilitating their integration into the development process and enabling engineers to maintain them as the system evolves.

The authors decompose correctness into independent properties, each verified using the most appropriate tool. For instance, property-based testing is employed to ensure that the implementation conforms to the reference model under various scenarios, including crash consistency and concurrent executions. This method has been effective in identifying subtle bugs that might have been missed through traditional testing methods.

By integrating these lightweight formal methods into the engineering workflow, the team has prevented 16 issues from reaching production, including complex crash consistency and concurrency problems. Notably, the approach has been adopted by non-formal-methods experts, with engineers contributing to the development and maintenance of the reference models, demonstrating the practicality and scalability of the method in a real-world, large-scale system like Amazon S3.

This work was recognized with a best-paper award at the ACM Symposium on Operating Systems Principles (SOSP) in 2021, highlighting its significance in the field of automated reasoning and formal methods.

For a more in-depth understanding, you can access the full paper here.

Database Indexing Internals Part III

Debjit Bhattacharjee — Sat, 30 Nov 2024 12:14:20 +0000

B+ Tree Extensions

B+ Tree File Organization

It's a logical method of representing B+ Tree Nodes in a tree structure in a file. They are going to have the pointers ultimately which will initiate another disk I/O to fetch the next index node and ultimately the leaf node which has the pointer to the node.

Indexing Strings

Creating B+-tree indices on string-valued attributes raises two problems. The first problem is that strings can be of variable length. The second problem is that strings can be
long, leading to a low fanout and a correspondingly increased tree height.

The fanout of nodes can be increased by using a technique called prefix compression. With prefix compression, we do not store the entire search key value at nonleaf
nodes. We only store a prefix of each search key value that is sufficient to distinguish
between the key values in the subtrees that it separates. For example, if we had an index on names, the key value at a nonleaf node could be a prefix of a name; it may suffice to store “Silb” at a nonleaf node, instead of the full “Silberschatz” if the closest values in the two subtrees that it separates are, say, “Silas” and “Silver” respectively.

Bulk Loading of B+-Tree Indices

Insertion of a large number of entries at a time into an index is referred to as bulk loading of the index. An efficient way to perform bulk loading of an index is as follows:
First, create a temporary file containing index entries for the relation, then sort the file on the search key of the index being constructed, and finally scan the sorted file
and insert the entries into the index.

There are efficient algorithms for sorting large relations, which can sort even a large file with an I/O cost comparable to that of reading the file a few times, assuming a reasonable amount of main memory is available.

There is a significant benefit to sorting the entries before inserting them into the B+-tree. When the entries are inserted in sorted order, all entries that go to a particular leaf node will appear consecutively, and the leaf needs to be written out only once;
nodes will never have to be read from disk during bulk load, if the B+-tree was empty to start with. Each leaf node will thus incur only one I/O operation even though many entries may be inserted into the node.

If each leaf contains 100 entries, the leaf level will contain 1 million nodes, resulting in only 1 million I/O operations for creating the leaf level. Even these I/O operations can be expected to be sequential, if successive leaf nodes are allocated on successive disk blocks, and few disk seeks would be required. With magnetic disks, 1 millisecond per block is a reasonable estimate for mostly sequential I/O operations, in contrast to 10 milliseconds per block for random I/O operations.

If the B+-tree is initially empty, it can be constructed faster by building it bottom-up, from the leaf level, instead of using the usual insert procedure. In bottom-up B+ tree construction, after sorting the entries as we just described, we break up the sorted
entries into blocks, keeping as many entries in a block as can fit in the block; the resulting blocks form the leaf level of the B+-tree.
The minimum value in each block, along with the pointer to the block, is used to create entries in the next level of the B+-tree, pointing to the leaf blocks. Each further level of the tree is similarly constructed using the minimum values associated with each node one level below, until the root is created.

B Tree Index Files

Mostly databases only use B+ Tree index structures since the benefits outweigh the usage of B Tree Index. So I am not going to go into it.

Indexing on Flash Storage

Flash storage is structured as pages, and the B+-tree index structure can be used with flash based SSDs. SSDs provide much faster random I/O operations than magnetic disks, requiring only around 20 to 100 microseconds for a random page read, instead of about 5 to 10 milliseconds with magnetic disks. Thus, lookups run much faster with data on SSDs, compared to data on magnetic disks.

The performance of write operations is more complicated with flash storage. An important difference between flash storage and magnetic disks is that flash storage does not permit in-place updates to data at the physical level, although it appears to do so logically. Every update turns into a copy+write of an entire flash-storage page, requiring the old copy of the page to be erased subsequently.

A new page can be written in 20 to 100 microseconds, but eventually old pages need to be erased to free up the pages for
further writes. Erases are done at the level of blocks containing multiple pages, and a block erase takes 2 to 5 milliseconds.
The optimum B+-tree node size for flash storage is smaller than that with magnetic disk, since flash pages are smaller than disk blocks; it makes sense for tree-node sizes to match to flash pages, since larger nodes would lead to multiple page writes when a
node is updated. Although smaller pages lead to taller trees and more I/O operations to access data, random page reads are so much faster with flash storage that the overall impact on read performance is quite small.

Although random I/O is much cheaper with SSDs than with magnetic disks, bulk loading still provides significant performance benefits, compared to tuple-at-a-time insertion, with SSDs. In particular, bottom-up construction reduces the number of page writes compared to tuple-at-a-time insertion, even if the entries are sorted on the search key. Since page writes on flash cannot be done in place and require relatively expensive block erases at a later point in time, the reduction of number of page writes with bottom-up B+-tree construction provides significant performance benefits.

Several extensions and alternatives to B+-trees have been proposed for flash storage, with a focus on reducing the number of erase operations that result due to page rewrites. One approach is to add buffers to internal nodes of B+-trees and record updates temporarily in buffers at higher levels, pushing the updates down to lower levels
lazily. The key idea is that when a page is updated, multiple updates are applied together, reducing the number of page writes per update. Another approach creates multiple trees and merges them; the log-structured merge tree and its variants are based on this idea.

Thats it for this blog. More coming soon.
Part I: Database Indexing Internals Explained
Part II: B+ Trees Blog from PlanetScale

Database Indexing Internals Explained

Debjit Bhattacharjee — Mon, 25 Nov 2024 16:16:52 +0000

There are two basic kinds of indices:
• Ordered indices. Based on a sorted ordering of the values.
• Hash indices. Based on a uniform distribution of values across a range of buckets.
The bucket to which a value is assigned is determined by a function, called a hash
function.

A file may have several indices, on different search keys. If the file containing the records is sequentially ordered, a clustering index is an index whose search key
also defines the sequential order of the file.

Indices whose search key
specifies an order different from the sequential order of the file are called nonclustering
indices, or secondary indices

An index entry, or index record, consists of a search-key value and pointers to one or
more records with that value as their search-key value. The pointer to a record consists
of the identifier of a disk block and an offset within the disk block to identify the record
within the block.
There are two types of ordered indices that we can use:

Dense index: In a dense index, an index entry appears for every search-key value
in the file. In a dense clustering index, the index record contains the search-key
value and a pointer to the first data record with that search-key value. The rest of
the records with the same search-key value would be stored sequentially after the
first record, since, because the index is a clustering one, records are sorted on the
same search key.
In a dense nonclustering index, the index must store a list of pointers to all
records with the same search-key value.

Sparse index: In a sparse index, an index entry appears for only some of the search-
key values. Sparse indices can be used only if the relation is stored in sorted order
of the search key; that is, if the index is a clustering index. As is true in dense
indices, each index entry contains a search-key value and a pointer to the first data
record with that search-key value. To locate a record, we find the index entry with
the largest search-key value that is less than or equal to the search-key value for
which we are looking. We start at the record pointed to by that index entry and
follow the pointers in the file until we find the desired record.

Multilevel Indices
Dense Index Overview:

A dense index has one index entry for each record in the relation.
Example: A relation with 1,000,000 tuples and 100 index entries per 4 KB block results in 10,000 index blocks.
Issues with Large Indices:

Large relations require large indices. For 100,000,000 tuples, the index occupies 1,000,000 blocks (4 GB).
Large indices are often stored as sequential files on disk due to their size.
Index search can become costly due to multiple random I/O operations.
Binary Search on Index:

Binary search requires up to ⌈log2(b)⌉ block reads for an index occupying b blocks.
Example: For a 10,000-block index, binary search requires 14 random reads (taking ~140 ms on a magnetic disk).
Each block read is random, increasing search time.
Overflow blocks make binary search more complex and potentially less efficient.
Sequential Search:

Sequential search reads all b blocks, which may take longer than binary search.
However, sequential reads can sometimes benefit from lower access costs compared to random reads.
Multilevel Indexing:

Solution to reduce search time for large indices.
Treat the large index (inner index) as a sequential file and construct a sparse outer index.
The outer index points to blocks of the inner index, which contains pointers to actual data blocks.
Example of Multilevel Index:

Inner index with 10,000 blocks → Outer index with 100 blocks.
Outer index (if in main memory) reduces the search to a single inner index block read instead of 14 with binary search.
14x improvement in search efficiency.
Multilevel Index with Extremely Large Files:

For a relation with 100,000,000 tuples:
Inner index: 1,000,000 blocks.
Outer index: 10,000 blocks (40 MB).
If the outer index is too large for main memory, another level of indexing is added, creating a multilevel index.
Advantages of Multilevel Index:

Significantly fewer I/O operations compared to binary search or sequential search.
Supports efficient searching for very large relations.
Relation to Tree Structures:

Multilevel indices resemble tree structures (e.g., binary trees).
Higher levels of indices act like parent nodes, and lower levels act like child nodes.

Key Takeaways
Multilevel indices optimize the search process for large datasets by reducing the number of I/O operations.
Sparse indexing at higher levels minimizes memory usage while maintaining search efficiency.
They are an efficient alternative to binary or sequential searches for extremely large relations.

This is Part 1 of the upcoming series on Database Indexing.

These are just my notes from the one of my favourite books: Database System Concepts

You can find me on
X

DEV Community: Debjit Bhattacharjee

DeepSeek 3FS: A High-Performance Distributed File System for Modern Workloads

Introduction to DeepSeek 3FS

System Architecture

Components and Their Roles

Architecture Diagram

File System Interfaces

Why File System Semantics?

Limitations of FUSE

Native Client with Asynchronous Zero-Copy API

Flowchart: Native Client I/O Operation

File Metadata Management

Metadata Structures

Key Encoding

Metadata Operations

Dynamic File Attributes

Chunk Storage System

Data Placement with CRAQ

Chain Table Example

Balanced Traffic During Recovery

Recovery Traffic Flowchart

Data Replication with CRAQ

Write Process

Read Process

Failure Detection and Recovery

Chunk Engine

Write Operation Flowchart

Conclusion

Research Paper Series: Zanzibar: Google’s Consistent, Global Authorization System

Understanding the Zanzibar White Paper: A Deep Dive into Scalable Authorization Systems

Table of Contents

Introduction

The Need for a Scalable Authorization System

Meet Zanzibar: Background and Key Contributors

Timeline of Key Events

The Core Tuple-Based Model

Direct and Indirect Permissions

Direct Permissions

Indirect Permissions

Real-World Examples

Hierarchical Permissions

Combined Conditions

Temporal Constraints

Architectural Overview

Global Tuple Data Store

The Authorization Engine

Schema and Policy Layer

Consistency and Caching

Advanced Indexing with Leopard Indexing

Why Leopard Indexing?

How It Works

Diagram: Leopard Indexing Overview

Zookies: Tackling the New Enemy Problem

What Are Zookies?

How Zookies Work in Practice

End-to-End Authorization Decision Flow

Detailed Flow Chart

Real-World Use Cases

Corporate File Systems with Hierarchical Permissions

Combined Conditions for Sensitive Resources

Temporal Permissions and Time‑Bound Access

Challenges in Distributed Systems

Impact and Future Directions

Influence on Modern Authorization Systems

Flexibility and Extensibility

Future Enhancements

Conclusion

Research Paper Series: Using Lightweight Formal Methods to Validate a Key-Value Storage Node in Amazon S3

A bit on ShardStore

Key Components and Workflow

Benefits of Using Extents in ShardStore

Back to validations

Database Indexing Internals Part III

B+ Tree Extensions

Database Indexing Internals Explained