<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Debjit Bhattacharjee</title>
    <description>The latest articles on DEV Community by Debjit Bhattacharjee (@sf_1997).</description>
    <link>https://dev.to/sf_1997</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F2481991%2F73e02a4c-868f-43a5-8ef4-f96999c90e99.png</url>
      <title>DEV Community: Debjit Bhattacharjee</title>
      <link>https://dev.to/sf_1997</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/sf_1997"/>
    <language>en</language>
    <item>
      <title>DeepSeek 3FS: A High-Performance Distributed File System for Modern Workloads</title>
      <dc:creator>Debjit Bhattacharjee</dc:creator>
      <pubDate>Sun, 09 Mar 2025 07:07:46 +0000</pubDate>
      <link>https://dev.to/sf_1997/deepseek-3fs-a-high-performance-distributed-file-system-for-modern-workloads-5998</link>
      <guid>https://dev.to/sf_1997/deepseek-3fs-a-high-performance-distributed-file-system-for-modern-workloads-5998</guid>
      <description>&lt;p&gt;In this blog post, we’ll dive deep into the design and implementation of DeepSeek 3FS, a distributed file system engineered for high-performance workloads like data analytics and machine learning. We’ll explore its architecture, components, file system interfaces, metadata management, and chunk storage system, with detailed explanations, diagrams, and flowcharts to break down the complexity.&lt;/p&gt;

&lt;h2&gt;
  
  
  Introduction to DeepSeek 3FS
&lt;/h2&gt;

&lt;p&gt;DeepSeek 3FS is a distributed file system designed to provide strong consistency, high throughput, and fault tolerance, leveraging RDMA networks (InfiniBand or RoCE) and SSDs for optimal performance. It aims to bridge the gap between traditional file system semantics and modern object stores, offering a unified namespace and flexible data placement for applications.&lt;/p&gt;

&lt;p&gt;The system comprises four main components:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Cluster Manager&lt;/strong&gt;: Handles membership changes and distributes cluster configurations.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Metadata Service&lt;/strong&gt;: Manages file metadata using a transactional key-value store.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Storage Service&lt;/strong&gt;: Stores file chunks with strong consistency using Chain Replication with Apportioned Queries (CRAQ).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Client&lt;/strong&gt;: Provides two interfaces—FUSE client for ease of adoption and a native client for performance-critical applications.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Let’s break down each component and their interactions.&lt;/p&gt;

&lt;h2&gt;
  
  
  System Architecture
&lt;/h2&gt;

&lt;p&gt;The 3FS architecture is designed for scalability and fault tolerance, with all components communicating over an RDMA network for low-latency, high-bandwidth data transfers.&lt;/p&gt;

&lt;h3&gt;
  
  
  Components and Their Roles
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Cluster Manager&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Manages membership and configuration changes.&lt;/li&gt;
&lt;li&gt;Multiple managers are deployed; one is elected as the primary using a distributed coordination service (e.g., ZooKeeper or etcd).&lt;/li&gt;
&lt;li&gt;Receives heartbeats from metadata and storage services to detect failures.&lt;/li&gt;
&lt;li&gt;Distributes updated cluster configurations to services and clients.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Metadata Service&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Stateless and scalable, handling file metadata operations (e.g., &lt;code&gt;open&lt;/code&gt;, &lt;code&gt;create&lt;/code&gt;).&lt;/li&gt;
&lt;li&gt;Stores metadata in a transactional key-value store (FoundationDB in production).&lt;/li&gt;
&lt;li&gt;Clients can connect to any metadata service for load balancing.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Storage Service&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Manages local SSDs and provides a chunk store interface.&lt;/li&gt;
&lt;li&gt;Implements CRAQ for strong consistency and high read throughput.&lt;/li&gt;
&lt;li&gt;File chunks are replicated across multiple SSDs for fault tolerance.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Client&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;FUSE Client&lt;/strong&gt;: Integrates with applications via the FUSE kernel module for ease of use.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Native Client&lt;/strong&gt;: Offers asynchronous zero-copy I/O for performance-critical applications.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Architecture Diagram
&lt;/h3&gt;

&lt;p&gt;Below is a high-level architecture diagram of 3FS, showing the interactions between components:&lt;/p&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;   +-------------------------+        +-------------------------+&lt;br&gt;
   |                         |        |                         |&lt;br&gt;
   |      Cluster Manager     | &amp;lt;----&amp;gt; |     Metadata Service    |&lt;br&gt;
   |                         |        |                         |&lt;br&gt;
   +-------------------------+        +-------------------------+&lt;br&gt;
                 |                             |&lt;br&gt;
                 |                             |&lt;br&gt;
                 v                             v&lt;br&gt;
   +-------------------------+        +-------------------------+&lt;br&gt;
   |                         |        |                         |&lt;br&gt;
   |    Storage Service      | &amp;lt;----&amp;gt; |    Client (FUSE/Native) |&lt;br&gt;
   |                         |        |                         |&lt;br&gt;
   +-------------------------+        +-------------------------+&lt;br&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;h2&gt;
&lt;br&gt;
  &lt;br&gt;
  &lt;br&gt;
  File System Interfaces&lt;br&gt;
&lt;/h2&gt;

&lt;p&gt;3FS provides a POSIX-like file system interface with enhancements for modern workloads, addressing limitations of object stores while maintaining compatibility with existing applications.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why File System Semantics?
&lt;/h3&gt;

&lt;p&gt;Unlike object stores, 3FS offers:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Atomic Directory Manipulation&lt;/strong&gt;: Supports operations like moving or deleting directories atomically, critical for workflows involving temporary directories.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Symbolic and Hard Links&lt;/strong&gt;: Enables lightweight snapshots for dynamic datasets.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Familiar Interface&lt;/strong&gt;: Simplifies adoption by supporting file-based data formats (e.g., CSV, Parquet) without requiring new APIs.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Limitations of FUSE
&lt;/h3&gt;

&lt;p&gt;While the FUSE client simplifies integration, it introduces performance overheads:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Memory Copy Overhead&lt;/strong&gt;: Data transfers between kernel and user space increase latency.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Multi-threading Bottlenecks&lt;/strong&gt;: Lock contention in the FUSE shared queue limits scalability (benchmarks show ~400K 4KiB reads/sec).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Concurrent Writes&lt;/strong&gt;: Linux FUSE (v5.x) does not support concurrent writes to the same file, requiring workarounds like writing to multiple files.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Native Client with Asynchronous Zero-Copy API
&lt;/h3&gt;

&lt;p&gt;To address FUSE limitations, 3FS implements a native client with an asynchronous zero-copy API inspired by Linux &lt;code&gt;io_uring&lt;/code&gt;. Key data structures include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Iov&lt;/strong&gt;: A shared memory region for zero-copy read/write operations, registered with InfiniBand for RDMA.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Ior&lt;/strong&gt;: A ring buffer for request queuing, supporting batched and parallel I/O operations.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The native client spawns multiple threads to fetch and dispatch I/O requests to storage services, minimizing RPC overhead for small reads.&lt;/p&gt;

&lt;h4&gt;
  
  
  Flowchart: Native Client I/O Operation
&lt;/h4&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fv7ldx30srkt9tkowj1rf.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fv7ldx30srkt9tkowj1rf.png" alt="A flowchart showing the lifecycle of an I/O request in the native client." width="482" height="896"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  File Metadata Management
&lt;/h2&gt;

&lt;p&gt;File metadata in 3FS is stored in FoundationDB, a distributed transactional key-value store providing Serializable Snapshot Isolation (SSI).&lt;/p&gt;

&lt;h3&gt;
  
  
  Metadata Structures
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Inodes&lt;/strong&gt;: Store attributes (e.g., ownership, permissions, timestamps) with a unique 64-bit ID.

&lt;ul&gt;
&lt;li&gt;File inodes include chunk size, chain table range, and shuffle seed.&lt;/li&gt;
&lt;li&gt;Directory inodes include parent inode ID and layout configurations.&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;

&lt;strong&gt;Directory Entries&lt;/strong&gt;: Map parent inode IDs and entry names to target inode IDs.&lt;/li&gt;

&lt;/ul&gt;

&lt;h4&gt;
  
  
  Key Encoding
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;Inode keys: &lt;code&gt;"INOD" + inode_id&lt;/code&gt; (little-endian for distribution across FoundationDB nodes).&lt;/li&gt;
&lt;li&gt;Directory entry keys: &lt;code&gt;"DENT" + parent_inode_id + entry_name"&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Metadata Operations
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Read-only Transactions&lt;/strong&gt;: Used for queries (e.g., &lt;code&gt;fstat&lt;/code&gt;, &lt;code&gt;listdir&lt;/code&gt;).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Read-write Transactions&lt;/strong&gt;: Used for updates (e.g., &lt;code&gt;create&lt;/code&gt;, &lt;code&gt;rename&lt;/code&gt;), with automatic retries on conflicts.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Dynamic File Attributes
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;File Deletion&lt;/strong&gt;: For write-opened files, deletion is deferred until all file descriptors are closed.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;File Length Updates&lt;/strong&gt;: Clients periodically report maximum write positions; final length is computed on &lt;code&gt;close&lt;/code&gt; or &lt;code&gt;fsync&lt;/code&gt; by querying storage services.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Optimizations&lt;/strong&gt;: Uses rendezvous hashing to distribute length updates and hints in inodes to avoid querying all chains for small files.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Chunk Storage System
&lt;/h2&gt;

&lt;p&gt;The chunk storage system is designed for high bandwidth and fault tolerance, using CRAQ for replication and balanced data placement across SSDs.&lt;/p&gt;

&lt;h3&gt;
  
  
  Data Placement with CRAQ
&lt;/h3&gt;

&lt;p&gt;Files are split into chunks, replicated across storage targets using CRAQ:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Write Path&lt;/strong&gt;: Requests propagate from the head to the tail of a chain.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Read Path&lt;/strong&gt;: Requests can be served by any target, balancing load across replicas.&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  Chain Table Example
&lt;/h4&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Chain&lt;/th&gt;
&lt;th&gt;Version&lt;/th&gt;
&lt;th&gt;Target 1 (head)&lt;/th&gt;
&lt;th&gt;Target 2&lt;/th&gt;
&lt;th&gt;Target 3 (tail)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;A1&lt;/td&gt;
&lt;td&gt;B1&lt;/td&gt;
&lt;td&gt;C1&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;D1&lt;/td&gt;
&lt;td&gt;E1&lt;/td&gt;
&lt;td&gt;F1&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Each chain has a version number, incremented on updates by the cluster manager.&lt;/p&gt;

&lt;h3&gt;
  
  
  Balanced Traffic During Recovery
&lt;/h3&gt;

&lt;p&gt;To mitigate bottlenecks during failures, 3FS distributes read traffic across multiple SSDs using balanced incomplete block design. For example, if node A fails, its traffic is split evenly among other nodes.&lt;/p&gt;

&lt;h4&gt;
  
  
  Recovery Traffic Flowchart
&lt;/h4&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9qaw6k9kmteeo8d5znl1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9qaw6k9kmteeo8d5znl1.png" alt="A flowchart showing read traffic redirection during failure." width="621" height="907"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Data Replication with CRAQ
&lt;/h3&gt;

&lt;h4&gt;
  
  
  Write Process
&lt;/h4&gt;

&lt;ol&gt;
&lt;li&gt;Validate chain version.&lt;/li&gt;
&lt;li&gt;Fetch data via RDMA.&lt;/li&gt;
&lt;li&gt;Serialize writes at the head using a lock.&lt;/li&gt;
&lt;li&gt;Propagate writes along the chain.&lt;/li&gt;
&lt;li&gt;Commit at the tail and propagate acknowledgments.&lt;/li&gt;
&lt;/ol&gt;

&lt;h4&gt;
  
  
  Read Process
&lt;/h4&gt;

&lt;ol&gt;
&lt;li&gt;Return committed version if available.&lt;/li&gt;
&lt;li&gt;Handle pending versions with a status code, allowing retries or relaxed reads.&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Failure Detection and Recovery
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Heartbeats&lt;/strong&gt;: Cluster manager detects failures if heartbeats are missed for a configurable interval.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;State Transitions&lt;/strong&gt;: Storage targets transition between public states (e.g., &lt;code&gt;serving&lt;/code&gt;, &lt;code&gt;syncing&lt;/code&gt;, &lt;code&gt;offline&lt;/code&gt;) based on local states.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Recovery&lt;/strong&gt;: Offline targets are moved to the end of chains; data is synced using full-chunk-replace writes.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Chunk Engine
&lt;/h3&gt;

&lt;p&gt;The chunk engine manages persistent storage on SSDs:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Data Files&lt;/strong&gt;: Store chunk data in physical blocks (64KiB to 64MiB).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;RocksDB&lt;/strong&gt;: Stores chunk metadata.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Allocator&lt;/strong&gt;: Uses bitmaps for efficient block allocation and reclamation.&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  Write Operation Flowchart
&lt;/h4&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmy5hlmgfvxkejo61kix0.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmy5hlmgfvxkejo61kix0.png" alt="A flowchart for chunk write." width="436" height="902"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Check out the &lt;a href="https://github.com/deepseek-ai/3FS" rel="noopener noreferrer"&gt;Github Repo&lt;/a&gt;. All credit for this research goes to the researchers of this project.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;DeepSeek 3FS is a robust distributed file system tailored for modern workloads, combining the familiarity of file system semantics with the scalability of object stores. Its use of RDMA, CRAQ, and FoundationDB ensures high performance, strong consistency, and fault tolerance. Whether you're running data analytics or machine learning pipelines, 3FS offers a flexible and efficient storage solution.&lt;/p&gt;

&lt;p&gt;Feel free to experiment with 3FS in your projects! If you have questions or insights, drop them in the comments below.&lt;/p&gt;

&lt;p&gt;You can find me on &lt;a href="https://x.com/sf_9179" rel="noopener noreferrer"&gt;X&lt;/a&gt;&lt;/p&gt;

</description>
      <category>distributedsystems</category>
      <category>systemdesign</category>
      <category>ai</category>
      <category>opensource</category>
    </item>
    <item>
      <title>Research Paper Series: Zanzibar: Google’s Consistent, Global Authorization System</title>
      <dc:creator>Debjit Bhattacharjee</dc:creator>
      <pubDate>Sat, 08 Feb 2025 12:14:21 +0000</pubDate>
      <link>https://dev.to/sf_1997/research-paper-series-zanzibar-googles-consistent-global-authorization-system-1pk6</link>
      <guid>https://dev.to/sf_1997/research-paper-series-zanzibar-googles-consistent-global-authorization-system-1pk6</guid>
      <description>&lt;h2&gt;
  
  
  Understanding the Zanzibar White Paper: A Deep Dive into Scalable Authorization Systems
&lt;/h2&gt;

&lt;p&gt;Modern applications demand scalable and fine‑grained access control. With billions of relationships and millions of queries per second, traditional authorization systems often fall short. Enter &lt;strong&gt;Zanzibar&lt;/strong&gt;, Google’s innovative approach to distributed authorization. In this post, we’ll break down the key concepts, architecture, and advanced features (like Leopard indexing) described in the Zanzibar white paper—plus a look at a novel solution known as &lt;strong&gt;zookies&lt;/strong&gt; that tackles what some call the "new enemy problem" in security.&lt;/p&gt;




&lt;h2&gt;
  
  
  Table of Contents
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Introduction&lt;/li&gt;
&lt;li&gt;The Need for a Scalable Authorization System&lt;/li&gt;
&lt;li&gt;Meet Zanzibar: Background and Key Contributors&lt;/li&gt;
&lt;li&gt;
The Core Tuple-Based Model

&lt;ul&gt;
&lt;li&gt;Direct and Indirect Permissions&lt;/li&gt;
&lt;li&gt;Real-World Examples&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;

Architectural Overview

&lt;ul&gt;
&lt;li&gt;Global Tuple Data Store&lt;/li&gt;
&lt;li&gt;The Authorization Engine&lt;/li&gt;
&lt;li&gt;Schema and Policy Layer&lt;/li&gt;
&lt;li&gt;Consistency and Caching&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;Advanced Indexing with Leopard Indexing&lt;/li&gt;

&lt;li&gt;Zookies: Tackling the New Enemy Problem&lt;/li&gt;

&lt;li&gt;End-to-End Authorization Decision Flow&lt;/li&gt;

&lt;li&gt;Real-World Use Cases&lt;/li&gt;

&lt;li&gt;Challenges in Distributed Systems&lt;/li&gt;

&lt;li&gt;Impact and Future Directions&lt;/li&gt;

&lt;li&gt;Conclusion&lt;/li&gt;

&lt;/ul&gt;




&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;In today’s microservices and cloud-based architectures, &lt;strong&gt;authorization&lt;/strong&gt;—deciding who can do what—is both critical and challenging. Google’s Zanzibar system was designed to address these challenges at scale, enabling millions of authorization decisions per second with consistency and flexibility. The Zanzibar white paper outlines a novel, tuple-based model that can express both simple and complex access control policies.&lt;/p&gt;

&lt;p&gt;In this post, we’ll explore:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The core concepts behind Zanzibar&lt;/li&gt;
&lt;li&gt;Its distributed architecture and data model&lt;/li&gt;
&lt;li&gt;How it handles complex policies such as hierarchical and time‑bound permissions&lt;/li&gt;
&lt;li&gt;The role of advanced techniques like Leopard indexing in ensuring low latency and high performance&lt;/li&gt;
&lt;li&gt;And finally, a new security solution—&lt;strong&gt;zookies&lt;/strong&gt;—that tackles what’s known as the "new enemy problem."&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Whether you’re an engineer working on security or simply curious about modern authorization mechanisms, read on to learn more about Zanzibar and its evolving features.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Need for a Scalable Authorization System
&lt;/h2&gt;

&lt;p&gt;Traditional access control systems, often based on role‑based access control (RBAC), struggle when facing modern demands:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;High Volume:&lt;/strong&gt; Billions of relationships and millions of access decisions per second.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Complex Relationships:&lt;/strong&gt; Permissions aren’t always direct—users might inherit access via groups, hierarchies, or other indirect relationships.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Distributed Environments:&lt;/strong&gt; Global systems must maintain consistency across multiple data centers and regions.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Zanzibar was conceived to meet these challenges by providing a flexible yet efficient authorization engine that could work at Google’s massive scale.&lt;/p&gt;




&lt;h2&gt;
  
  
  Meet Zanzibar: Background and Key Contributors
&lt;/h2&gt;

&lt;p&gt;The Zanzibar white paper is the result of collaborative efforts by a dedicated team at Google. While the names might vary between versions, here are some key roles:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Conceptual Design:&lt;/strong&gt; Visionaries who recognized the limitations of existing systems and introduced the tuple‑based model.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Scalability Engineering:&lt;/strong&gt; Engineers who tackled distributed consistency challenges.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Indexing Innovations:&lt;/strong&gt; Researchers who developed advanced indexing (like Leopard indexing) to optimize data retrieval.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Schema Design:&lt;/strong&gt; Developers who created a flexible schema layer for defining complex access policies.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Timeline of Key Events
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Conceptualization:&lt;/strong&gt; Brainstorming sessions and whiteboarding led to the idea of representing permissions as tuples.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Prototype Development:&lt;/strong&gt; Early prototypes exposed challenges in query performance and data retrieval.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Leopard Indexing Introduction:&lt;/strong&gt; A breakthrough that dramatically reduced lookup latency.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Internal Rollout:&lt;/strong&gt; Iterative testing and refinement based on real-world feedback.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;White Paper Publication:&lt;/strong&gt; Sharing the design and lessons learned with the broader community.&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  The Core Tuple-Based Model
&lt;/h2&gt;

&lt;p&gt;At the heart of Zanzibar lies a simple yet powerful data model: the tuple.&lt;/p&gt;

&lt;h3&gt;
  
  
  Direct and Indirect Permissions
&lt;/h3&gt;

&lt;p&gt;Each permission is represented as a tuple:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;(object, relation, subject)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Object:&lt;/strong&gt; The resource (e.g., a document, folder, or service).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Relation:&lt;/strong&gt; The type of permission (e.g., read, write, edit).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Subject:&lt;/strong&gt; The entity (user, group, or service) granted the permission.&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  Direct Permissions
&lt;/h4&gt;

&lt;p&gt;For example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;(Document123, viewer, Alice)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This tuple directly grants Alice viewing rights to Document123.&lt;/p&gt;

&lt;h4&gt;
  
  
  Indirect Permissions
&lt;/h4&gt;

&lt;p&gt;Indirect relationships can be expressed using multiple tuples:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;(Document123, editor, GroupX)
(GroupX, member, Bob)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Even though Bob isn’t directly assigned to Document123, his membership in GroupX grants him editor rights.&lt;/p&gt;

&lt;h3&gt;
  
  
  Real-World Examples
&lt;/h3&gt;

&lt;h4&gt;
  
  
  Hierarchical Permissions
&lt;/h4&gt;

&lt;p&gt;In a corporate file system:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;(FolderA, contains, FolderB)
(FolderA, editor, UserX)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;UserX, an editor of FolderA, can inherit editing rights for FolderB and its files.&lt;/p&gt;

&lt;h4&gt;
  
  
  Combined Conditions
&lt;/h4&gt;

&lt;p&gt;A sensitive document might require both team membership and explicit permission:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;(Document456, editor, TeamY)
(TeamY, member, UserZ)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;UserZ must be a member of TeamY (indirect permission) to gain editing rights.&lt;/p&gt;

&lt;h4&gt;
  
  
  Temporal Constraints
&lt;/h4&gt;

&lt;p&gt;Permissions can also be time‑bound:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;(Document789, viewer, ContractorA)  // With an expiration timestamp
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Access is only valid within a specified time window.&lt;/p&gt;




&lt;h2&gt;
  
  
  Architectural Overview
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fk3rrnhg9fr1ii0kzgvz0.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fk3rrnhg9fr1ii0kzgvz0.png" alt="Zanzibar architecture. Arrows indicate the direction of data flow" width="800" height="397"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Zanzibar’s architecture is engineered for scalability and performance. Here’s a look at its key components:&lt;/p&gt;

&lt;h3&gt;
  
  
  Global Tuple Data Store
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Distributed:&lt;/strong&gt; Operates across multiple data centers.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Scalable:&lt;/strong&gt; Designed to handle billions of tuples.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Low Latency:&lt;/strong&gt; Optimized for rapid read and write operations.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  The Authorization Engine
&lt;/h3&gt;

&lt;p&gt;The engine processes access requests through these steps:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Request Parsing:&lt;/strong&gt; Extract the object, relation, and subject from the request.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tuple Lookup:&lt;/strong&gt; Query the data store for relevant tuples.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Recursive Evaluation:&lt;/strong&gt; Follow indirect relationships (e.g., group memberships) to determine effective permissions.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Decision Output:&lt;/strong&gt; Consolidate findings and grant or deny access.&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Schema and Policy Layer
&lt;/h3&gt;

&lt;p&gt;This layer provides flexibility:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Customizable:&lt;/strong&gt; Define new object types, relations, and composite relationships.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Extensible:&lt;/strong&gt; Easily incorporate new access control paradigms without a full system redesign.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Consistency and Caching
&lt;/h3&gt;

&lt;p&gt;To ensure every node has up‑to‑date data:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Propagation Protocols:&lt;/strong&gt; Distribute updates quickly across nodes.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Conflict Resolution:&lt;/strong&gt; Handle concurrent updates seamlessly.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Caching Strategies:&lt;/strong&gt; Use local caches with invalidation mechanisms to reduce latency without sacrificing accuracy.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Advanced Indexing with Leopard Indexing
&lt;/h2&gt;

&lt;p&gt;As Zanzibar scaled, performance challenges emerged. &lt;strong&gt;Leopard indexing&lt;/strong&gt; was introduced as an advanced method to optimize tuple lookups.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why Leopard Indexing?
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Performance:&lt;/strong&gt; Minimizes latency by reducing disk and network operations.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Scalability:&lt;/strong&gt; Supports queries on billions of tuples.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Flexibility:&lt;/strong&gt; Efficiently handles multiple query directions (object, relation, subject).&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  How It Works
&lt;/h3&gt;

&lt;p&gt;Leopard indexing decomposes tuples into individual components and builds multiple index structures:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Object-Relation Index:&lt;/strong&gt; Quickly retrieves all tuples associated with a specific object and relation.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Subject-Relation Index:&lt;/strong&gt; Enables queries initiated from the subject side.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Composite Indexes for Groups:&lt;/strong&gt; Facilitates rapid evaluation of indirect relationships, such as group memberships.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Diagram: Leopard Indexing Overview
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;             +-------------------------------------+
             |         Global Tuple Data Store     |
             |  (All (object, relation, subject))  |
             +----------------+----------------------+
                              |
                              v
             +-------------------------------------+
             |      Leopard Indexing Layer         |
             |                                     |
             |  - Object-Relation Index            |
             |  - Subject-Relation Index           |
             |  - Composite Indexes for Groups     |
             +----------------+----------------------+
                              |
                              v
             +-------------------------------------+
             |     Rapid Tuple Retrieval Layer     |
             | (Optimized Query Resolution Engine) |
             +-------------------------------------+
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;With these indices, when a client queries “Can Alice read Document123?”, the engine can directly retrieve the relevant tuples with minimal overhead.&lt;/p&gt;




&lt;h2&gt;
  
  
  Zookies: Tackling the New Enemy Problem
&lt;/h2&gt;

&lt;p&gt;In addition to the core challenges of distributed authorization, modern systems must also address what is sometimes referred to as the &lt;strong&gt;"new enemy problem."&lt;/strong&gt; This problem involves adversaries attempting to exploit vulnerabilities by injecting unauthorized or stale tuple data into the system. To counter this, an innovative solution known as &lt;strong&gt;zookies&lt;/strong&gt; has been introduced.&lt;/p&gt;

&lt;h3&gt;
  
  
  What Are Zookies?
&lt;/h3&gt;

&lt;p&gt;Zookies are an enhanced security mechanism integrated into the Zanzibar framework. They add an extra layer of verification to ensure that every tuple is current and authenticated. In practice, zookies:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Enhance Metadata:&lt;/strong&gt;
Each tuple can carry additional security metadata (such as digital signatures, timestamps, and authentication tokens) to verify its legitimacy.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Enforce Rigorous Validation:&lt;/strong&gt;
Before any tuple is accepted or updated in the system, a series of strict validation checks are performed. This minimizes the risk of adversarial data injections.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Mitigate Data Inconsistencies:&lt;/strong&gt;
By ensuring that all nodes work with the most up‑to‑date and verified data, zookies help prevent scenarios where outdated or tampered data could be exploited.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Integrate with Existing Indexing:&lt;/strong&gt;
Zookies work in tandem with Leopard indexing, ensuring that security checks occur with minimal impact on overall query performance.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  How Zookies Work in Practice
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Tuple Ingestion:&lt;/strong&gt;
When a new tuple is created or an existing one is updated, zookies ensure that enhanced metadata is attached.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Validation Checks:&lt;/strong&gt;
The system verifies digital signatures, cross-checks timestamps, and consults trusted caches before integrating the tuple into the global data store.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Dynamic Revocation:&lt;/strong&gt;
If any inconsistencies or potential security threats are detected, zookies enable rapid revocation and replacement of the affected tuples.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Seamless Integration:&lt;/strong&gt;
The validation process is optimized to work alongside Leopard indexing, ensuring that security does not come at the cost of performance.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;By addressing the "new enemy problem," zookies significantly strengthen the overall resilience of Zanzibar against modern adversarial challenges.&lt;/p&gt;




&lt;h2&gt;
  
  
  End-to-End Authorization Decision Flow
&lt;/h2&gt;

&lt;p&gt;Let’s walk through the process step-by-step:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Request Arrival:&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
A client sends an access check request, e.g., “Can Alice read Document123?”&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Request Parsing:&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
The engine extracts the object (Document123), relation (read), and subject (Alice).&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Index Querying Using Leopard Indexing:&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
The engine quickly queries the object‑relation index to retrieve direct tuples.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Direct Tuple Evaluation:&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
If &lt;code&gt;(Document123, read, Alice)&lt;/code&gt; exists, access is granted. Otherwise, indirect relationships are evaluated.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Recursive Evaluation:&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
For example, if &lt;code&gt;(Document123, read, GroupX)&lt;/code&gt; exists, the engine checks if Alice is a member of GroupX via &lt;code&gt;(GroupX, member, Alice)&lt;/code&gt;.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Temporal and Conditional Checks:&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
The engine verifies any time‑bound or conditional metadata (with zookies ensuring data integrity).&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Final Decision:&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
If a valid permission chain is found, access is granted; if not, it is denied.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Detailed Flow Chart
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[Start: Receive Access Request]
           |
           v
[Parse Request: Extract Object, Relation, Subject]
           |
           v
[Query Leopard Indexes for (Object, Relation) tuples]
           |
           v
[Direct Tuple Found?] --&amp;gt; [Yes] --&amp;gt; [Grant Access]
           |
         No|
           v
[Check for Indirect Relationships via Indexes]
           |
           v
[Recursive Evaluation of Group or Hierarchical Tuples]
           |
           v
[Evaluate Additional Conditions (Temporal, etc.)]
           |
           v
[Consolidate Findings]
           |
           v
[Decision: Valid Permission Chain Exists?]
           |             \
          Yes             No
           |               \
           v                v
     [Grant Access]     [Deny Access]
           |
           v
         [End]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Real-World Use Cases
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Corporate File Systems with Hierarchical Permissions
&lt;/h3&gt;

&lt;p&gt;Imagine a corporate file system where folders are nested:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Tuples:&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;  (FolderA, contains, FolderB)
  (FolderA, editor, UserX)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;UserX’s permission on FolderA cascades down to FolderB and its contents.&lt;/p&gt;

&lt;h3&gt;
  
  
  Combined Conditions for Sensitive Resources
&lt;/h3&gt;

&lt;p&gt;For sensitive documents, multiple conditions may be required:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Tuples:&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;  (Document456, editor, SecurityTeam)
  (SecurityTeam, member, UserY)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;UserY must satisfy both the team membership (indirect permission) and any direct conditions.&lt;/p&gt;

&lt;h3&gt;
  
  
  Temporal Permissions and Time‑Bound Access
&lt;/h3&gt;

&lt;p&gt;Time-sensitive access is common for contractors:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Tuples:&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;  (Document789, viewer, ContractorA)  // With an expiration timestamp
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Access is granted only within a specified time window.&lt;/p&gt;




&lt;h2&gt;
  
  
  Challenges in Distributed Systems
&lt;/h2&gt;

&lt;p&gt;Operating at a global scale isn’t trivial. Zanzibar addresses several challenges:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Consistency:&lt;/strong&gt;
Ensuring every node has the most up‑to‑date data via robust propagation protocols.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Caching:&lt;/strong&gt;
Local caches reduce latency but must remain synchronized to avoid stale decisions.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Distributed Indexing:&lt;/strong&gt;
With techniques like Leopard indexing (and the additional security of zookies), low‑latency queries are maintained regardless of geographic location.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Impact and Future Directions
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Influence on Modern Authorization Systems
&lt;/h3&gt;

&lt;p&gt;Zanzibar has influenced many modern access control systems:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Adoption:&lt;/strong&gt;
Its tuple‑based model and indexing techniques have inspired both industry and open‑source projects.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Innovation:&lt;/strong&gt;
Design principles from Zanzibar continue to shape scalable, secure authorization in distributed environments.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Flexibility and Extensibility
&lt;/h3&gt;

&lt;p&gt;Zanzibar’s model is adaptable:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Diverse Paradigms:&lt;/strong&gt;
Supports RBAC, attribute‑based, and relationship‑based access control.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Evolving Needs:&lt;/strong&gt;
The schema can be extended to include new relationship types, temporal constraints, and security enhancements like zookies.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Future Enhancements
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Enhanced Indexing:&lt;/strong&gt;
Research into further optimizing indexing—possibly using predictive caching.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Improved Consistency Models:&lt;/strong&gt;
New protocols may further reduce latency while ensuring up‑to‑date authorization.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Security Upgrades:&lt;/strong&gt;
Continued development of zookies and other security measures to counter emerging threats.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;Google’s Zanzibar white paper presents a groundbreaking approach to authorization by breaking down access control into simple, composable tuples. By combining a robust, distributed architecture with advanced indexing techniques like Leopard indexing—and now, with the addition of zookies to tackle the "new enemy problem"—Zanzibar handles billions of relationships and millions of decisions per second while maintaining consistency and security across global data centers.&lt;/p&gt;

&lt;p&gt;In summary, this post covered:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;The Core Tuple-Based Model:&lt;/strong&gt;
How Zanzibar represents both direct and indirect permissions.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Architectural Components:&lt;/strong&gt;
From the global tuple data store to the authorization engine and schema layers.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Advanced Indexing:&lt;/strong&gt;
How Leopard indexing optimizes performance by reducing lookup latency.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Zookies:&lt;/strong&gt;
A novel solution to counter adversarial threats and ensure data integrity.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Real-World Applications and Challenges:&lt;/strong&gt;
Practical use cases and the complexities of distributed systems.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Future Directions:&lt;/strong&gt;
The ongoing impact of Zanzibar on modern authorization systems and areas for further innovation.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;By understanding the Zanzibar white paper—and innovations like zookies—we gain valuable insights into the challenges and solutions powering today’s scalable and secure access control systems. Whether you’re building your own authorization engine or just curious about distributed systems, Zanzibar offers a wealth of ideas and inspiration.&lt;/p&gt;

&lt;p&gt;For a more in-depth understanding, you can access the full paper &lt;a href="https://www.usenix.org/system/files/atc19-pang.pdf" rel="noopener noreferrer"&gt;here&lt;/a&gt;&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Happy coding! If you found this post useful, please leave a comment or share your thoughts on tackling authorization challenges in your projects.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Tags: dev, authorization, security, distributed-systems, backend, scalability&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;You can find me on &lt;a href="https://x.com/sf_9179?t=Mq38zsA_v5LvsnyIZPdM8A&amp;amp;s=09" rel="noopener noreferrer"&gt;X&lt;/a&gt;&lt;/p&gt;

</description>
      <category>distributedsystems</category>
      <category>cloud</category>
      <category>security</category>
      <category>architecture</category>
    </item>
    <item>
      <title>Research Paper Series: Using Lightweight Formal Methods to Validate a Key-Value Storage Node in Amazon S3</title>
      <dc:creator>Debjit Bhattacharjee</dc:creator>
      <pubDate>Mon, 13 Jan 2025 16:03:43 +0000</pubDate>
      <link>https://dev.to/sf_1997/research-paper-series-using-lightweight-formal-methods-to-validate-a-key-value-storage-node-in-1p37</link>
      <guid>https://dev.to/sf_1997/research-paper-series-using-lightweight-formal-methods-to-validate-a-key-value-storage-node-in-1p37</guid>
      <description>&lt;p&gt;The paper "Using Lightweight Formal Methods to Validate a Key-Value Storage Node in Amazon S3" presents a pragmatic approach to ensuring the correctness of &lt;strong&gt;ShardStore&lt;/strong&gt;, a key-value storage node in Amazon S3.&lt;/p&gt;

&lt;h2&gt;
  
  
  A bit on ShardStore
&lt;/h2&gt;

&lt;p&gt;ShardStore, a key-value storage node in Amazon S3, plays a critical role in efficiently storing and retrieving objects by organizing data into &lt;strong&gt;extents&lt;/strong&gt;. Here’s how it works and interacts with extents:&lt;/p&gt;

&lt;h3&gt;
  
  
  Key Components and Workflow
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;ShardStore Overview&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;ShardStore serves as a layer in the S3 architecture, responsible for handling object storage within shards.&lt;/li&gt;
&lt;li&gt;Data is partitioned into shards to ensure scalability, fault isolation, and efficient data management.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Interaction with Extents&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Extents&lt;/strong&gt; are fixed-size blocks of data, typically spanning multiple megabytes. They are the primary storage units managed by ShardStore.&lt;/li&gt;
&lt;li&gt;Each extent contains:

&lt;ul&gt;
&lt;li&gt;Data for multiple objects or object fragments.&lt;/li&gt;
&lt;li&gt;Metadata for efficiently locating and retrieving specific pieces of data.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Key-Value Abstraction&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;ShardStore exposes a key-value interface, where the key maps to a specific object or fragment, and the value references its location within extents.&lt;/li&gt;
&lt;li&gt;This abstraction decouples the logical organization of data from its physical storage, allowing ShardStore to optimize for performance and durability.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Write Workflow&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;When new data is written to ShardStore:

&lt;ol&gt;
&lt;li&gt;The data is assigned a key and appended to an available extent.&lt;/li&gt;
&lt;li&gt;Metadata is updated to reflect the key-to-extent mapping.&lt;/li&gt;
&lt;li&gt;The extent, now containing the new data, is persisted to disk and replicated for fault tolerance.&lt;/li&gt;
&lt;/ol&gt;


&lt;/li&gt;

&lt;/ul&gt;

&lt;/li&gt;

&lt;li&gt;

&lt;p&gt;&lt;strong&gt;Read Workflow&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;To read data:

&lt;ol&gt;
&lt;li&gt;ShardStore uses the key to locate the corresponding extent and offset.&lt;/li&gt;
&lt;li&gt;The relevant extent is loaded from disk (or memory if cached).&lt;/li&gt;
&lt;li&gt;Data is extracted and returned to the client.&lt;/li&gt;
&lt;/ol&gt;
&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;

&lt;p&gt;&lt;strong&gt;Crash Consistency and Concurrency&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;ShardStore employs techniques such as journaling and atomic updates to ensure crash consistency during writes.&lt;/li&gt;
&lt;li&gt;Concurrent read and write operations are managed using fine-grained locking and careful metadata updates to prevent conflicts.&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;

&lt;p&gt;&lt;strong&gt;Extent Lifecycle Management&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Garbage Collection&lt;/strong&gt;: As objects are deleted or overwritten, extents containing obsolete data are compacted to reclaim space.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Replication&lt;/strong&gt;: Extents are replicated across multiple nodes for durability and availability.&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;/ol&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8kvj0crvk2jppiguf1vs.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8kvj0crvk2jppiguf1vs.png" alt="LSM Tree chunk reclamation" width="642" height="797"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Load Balancing&lt;/strong&gt;: ShardStore dynamically moves extents between nodes to balance storage and compute load.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Benefits of Using Extents in ShardStore
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Scalability&lt;/strong&gt;: Extents enable efficient storage of millions of objects within a shard.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Performance&lt;/strong&gt;: Sequential writes to extents reduce disk I/O overhead, improving throughput.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Fault Isolation&lt;/strong&gt;: By partitioning data into shards and extents, failures in one area are less likely to impact the entire system.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This interaction between ShardStore and extents underpins the scalability, durability, and performance of Amazon S3, making it a key innovation in distributed storage systems.&lt;/p&gt;

&lt;h2&gt;
  
  
  Back to validations
&lt;/h2&gt;

&lt;p&gt;Traditional formal verification methods can be resource-intensive and challenging to maintain, especially in large-scale, evolving systems. To address this, the authors advocate for "lightweight formal methods," emphasizing automation, usability, and continuous validation alongside ongoing software development. &lt;/p&gt;

&lt;p&gt;A central aspect of their approach is the development of executable reference models that serve as specifications against which the implementation is validated. These models are written in the same programming language as the implementation (Rust), facilitating their integration into the development process and enabling engineers to maintain them as the system evolves. &lt;/p&gt;

&lt;p&gt;The authors decompose correctness into independent properties, each verified using the most appropriate tool. For instance, property-based testing is employed to ensure that the implementation conforms to the reference model under various scenarios, including crash consistency and concurrent executions. This method has been effective in identifying subtle bugs that might have been missed through traditional testing methods. &lt;/p&gt;

&lt;p&gt;By integrating these lightweight formal methods into the engineering workflow, the team has prevented 16 issues from reaching production, including complex crash consistency and concurrency problems. Notably, the approach has been adopted by non-formal-methods experts, with engineers contributing to the development and maintenance of the reference models, demonstrating the practicality and scalability of the method in a real-world, large-scale system like Amazon S3. &lt;/p&gt;

&lt;p&gt;This work was recognized with a best-paper award at the ACM Symposium on Operating Systems Principles (SOSP) in 2021, highlighting its significance in the field of automated reasoning and formal methods. &lt;/p&gt;

&lt;p&gt;For a more in-depth understanding, you can access the full paper &lt;a href="https://assets.amazon.science/77/5e/4a7c238f4ce890efdc325df83263/using-lightweight-formal-methods-to-validate-a-key-value-storage-node-in-amazon-s3-2.pdf" rel="noopener noreferrer"&gt;here&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>aws</category>
      <category>cloud</category>
      <category>softwaredevelopment</category>
      <category>testing</category>
    </item>
    <item>
      <title>Database Indexing Internals Part III</title>
      <dc:creator>Debjit Bhattacharjee</dc:creator>
      <pubDate>Sat, 30 Nov 2024 12:14:20 +0000</pubDate>
      <link>https://dev.to/sf_1997/database-indexing-internals-part-iii-5bhe</link>
      <guid>https://dev.to/sf_1997/database-indexing-internals-part-iii-5bhe</guid>
      <description>&lt;h2&gt;
  
  
  B+ Tree Extensions
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;B+ Tree File Organization&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;It's a logical method of representing B+ Tree Nodes in a tree structure in a file. They are going to have the pointers ultimately which will initiate another disk I/O to fetch the next index node and ultimately the leaf node which has the pointer to the node. &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Indexing Strings&lt;/strong&gt; &lt;/p&gt;

&lt;p&gt;Creating B+-tree indices on string-valued attributes raises two problems. The first problem is that strings can be of variable length. The second problem is that strings can be&lt;br&gt;
long, leading to a low fanout and a correspondingly increased tree height.&lt;/p&gt;

&lt;p&gt;The fanout of nodes can be increased by using a technique called prefix compression. With prefix compression, we do not store the entire search key value at nonleaf&lt;br&gt;
nodes. We only store a prefix of each search key value that is sufficient to distinguish&lt;br&gt;
between the key values in the subtrees that it separates. For example, if we had an index on names, the key value at a nonleaf node could be a prefix of a name; it may suffice to store “Silb” at a nonleaf node, instead of the full “Silberschatz” if the closest values in the two subtrees that it separates are, say, “Silas” and “Silver” respectively.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Bulk Loading of B+-Tree Indices&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Insertion of a large number of entries at a time into an index is referred to as bulk loading of the index. An efficient way to perform bulk loading of an index is as follows:&lt;br&gt;
First, create a temporary file containing index entries for the relation, then sort the file on the search key of the index being constructed, and finally scan the sorted file&lt;br&gt;
and insert the entries into the index. &lt;/p&gt;

&lt;p&gt;There are efficient algorithms for sorting large relations, which can sort even a large file with an I/O cost comparable to that of reading the file a few times, assuming a reasonable amount of main memory is available. &lt;/p&gt;

&lt;p&gt;There is a significant benefit to sorting the entries before inserting them into the B+-tree. When the entries are inserted in sorted order, all entries that go to a particular leaf node will appear consecutively, and the leaf needs to be written out only once;&lt;br&gt;
nodes will never have to be read from disk during bulk load, if the B+-tree was empty to start with. Each leaf node will thus incur only one I/O operation even though many entries may be inserted into the node. &lt;/p&gt;

&lt;p&gt;If each leaf contains 100 entries, the leaf level will contain 1 million nodes, resulting in only 1 million I/O operations for creating the leaf level. Even these I/O operations can be expected to be sequential, if successive leaf nodes are allocated on successive disk blocks, and few disk seeks would be required. With magnetic disks, 1 millisecond per block is a reasonable estimate for mostly sequential I/O operations, in contrast to 10 milliseconds per block for random I/O operations.&lt;/p&gt;

&lt;p&gt;If the B+-tree is initially empty, it can be constructed faster by building it bottom-up, from the leaf level, instead of using the usual insert procedure. In bottom-up B+ tree construction, after sorting the entries as we just described, we break up the sorted&lt;br&gt;
entries into blocks, keeping as many entries in a block as can fit in the block; the resulting blocks form the leaf level of the B+-tree. &lt;br&gt;
The minimum value in each block, along with the pointer to the block, is used to create entries in the next level of the B+-tree, pointing to the leaf blocks. Each further level of the tree is similarly constructed using the minimum values associated with each node one level below, until the root is created.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;B Tree Index Files&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Mostly databases only use B+ Tree index structures since the benefits outweigh the usage of B Tree Index. So I am not going to go into it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Indexing on Flash Storage&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Flash storage is structured as pages, and the B+-tree index structure can be used with flash based SSDs. SSDs provide much faster random I/O operations than magnetic disks, requiring only around 20 to 100 microseconds for a random page read, instead of about 5 to 10 milliseconds with magnetic disks. Thus, lookups run much faster with data on SSDs, compared to data on magnetic disks.&lt;/p&gt;

&lt;p&gt;The performance of write operations is more complicated with flash storage. An important difference between flash storage and magnetic disks is that flash storage does not permit in-place updates to data at the physical level, although it appears to do so logically. Every update turns into a copy+write of an entire flash-storage page, requiring the old copy of the page to be erased subsequently. &lt;/p&gt;

&lt;p&gt;A new page can be written in 20 to 100 microseconds, but eventually old pages need to be erased to free up the pages for&lt;br&gt;
further writes. Erases are done at the level of blocks containing multiple pages, and a block erase takes 2 to 5 milliseconds.&lt;br&gt;
The optimum B+-tree node size for flash storage is smaller than that with magnetic disk, since flash pages are smaller than disk blocks; it makes sense for tree-node sizes to match to flash pages, since larger nodes would lead to multiple page writes when a&lt;br&gt;
node is updated. Although smaller pages lead to taller trees and more I/O operations to access data, random page reads are so much faster with flash storage that the overall impact on read performance is quite small.&lt;/p&gt;

&lt;p&gt;Although random I/O is much cheaper with SSDs than with magnetic disks, bulk loading still provides significant performance benefits, compared to tuple-at-a-time insertion, with SSDs. In particular, bottom-up construction reduces the number of page writes compared to tuple-at-a-time insertion, even if the entries are sorted on the search key. Since page writes on flash cannot be done in place and require relatively expensive block erases at a later point in time, the reduction of number of page writes with bottom-up B+-tree construction provides significant performance benefits.&lt;/p&gt;

&lt;p&gt;Several extensions and alternatives to B+-trees have been proposed for flash storage, with a focus on reducing the number of erase operations that result due to page rewrites. One approach is to add buffers to internal nodes of B+-trees and record updates temporarily in buffers at higher levels, pushing the updates down to lower levels&lt;br&gt;
lazily. The key idea is that when a page is updated, multiple updates are applied together, reducing the number of page writes per update. Another approach creates multiple trees and merges them; the log-structured merge tree and its variants are based on this idea. &lt;/p&gt;

&lt;p&gt;Thats it for this blog. More coming soon.&lt;br&gt;
Part I: &lt;a href="https://dev.to/sf_1997/database-indexing-explained-47em"&gt;Database Indexing Internals Explained&lt;/a&gt;&lt;br&gt;
Part II: &lt;a href="https://planetscale.com/blog/btrees-and-database-indexes" rel="noopener noreferrer"&gt;B+ Trees Blog from PlanetScale&lt;/a&gt;&lt;/p&gt;

</description>
      <category>database</category>
      <category>computerscience</category>
      <category>distributedsystems</category>
      <category>learning</category>
    </item>
    <item>
      <title>Database Indexing Internals Explained</title>
      <dc:creator>Debjit Bhattacharjee</dc:creator>
      <pubDate>Mon, 25 Nov 2024 16:16:52 +0000</pubDate>
      <link>https://dev.to/sf_1997/database-indexing-explained-47em</link>
      <guid>https://dev.to/sf_1997/database-indexing-explained-47em</guid>
      <description>&lt;p&gt;There are two basic kinds of indices:&lt;br&gt;
• &lt;strong&gt;Ordered indices&lt;/strong&gt;. Based on a sorted ordering of the values.&lt;br&gt;
• &lt;strong&gt;Hash indices&lt;/strong&gt;. Based on a uniform distribution of values across a range of buckets.&lt;br&gt;
The bucket to which a value is assigned is determined by a function, called a hash&lt;br&gt;
function.&lt;/p&gt;

&lt;p&gt;A file may have several indices, on different search keys. If the file containing the records is sequentially ordered, a &lt;strong&gt;clustering index&lt;/strong&gt; is an index whose search key&lt;br&gt;
also defines the sequential order of the file.&lt;/p&gt;

&lt;p&gt;Indices whose search key&lt;br&gt;
specifies an order different from the sequential order of the file are called &lt;strong&gt;nonclustering&lt;br&gt;
indices, or secondary indices&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;An index entry, or index record, consists of a search-key value and pointers to one or&lt;br&gt;
more records with that value as their search-key value. The pointer to a record consists&lt;br&gt;
of the identifier of a disk block and an offset within the disk block to identify the record&lt;br&gt;
within the block.&lt;br&gt;
There are two types of ordered indices that we can use:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Dense index&lt;/strong&gt;: In a dense index, an index entry appears for every search-key value&lt;br&gt;
in the file. In a dense clustering index, the index record contains the search-key&lt;br&gt;
value and a pointer to the first data record with that search-key value. The rest of&lt;br&gt;
the records with the same search-key value would be stored sequentially after the&lt;br&gt;
first record, since, because the index is a clustering one, records are sorted on the&lt;br&gt;
same search key.&lt;br&gt;
In a dense nonclustering index, the index must store a list of pointers to all&lt;br&gt;
records with the same search-key value.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fl08yom1kyfsq9ekt3p50.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fl08yom1kyfsq9ekt3p50.png" alt="Dense Index" width="800" height="360"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Sparse index&lt;/strong&gt;: In a sparse index, an index entry appears for only some of the search-&lt;br&gt;
key values. Sparse indices can be used only if the relation is stored in sorted order&lt;br&gt;
of the search key; that is, if the index is a clustering index. As is true in dense&lt;br&gt;
indices, each index entry contains a search-key value and a pointer to the first data&lt;br&gt;
record with that search-key value. To locate a record, we find the index entry with&lt;br&gt;
the largest search-key value that is less than or equal to the search-key value for&lt;br&gt;
which we are looking. We start at the record pointed to by that index entry and&lt;br&gt;
follow the pointers in the file until we find the desired record.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fz55abupt6qcumnz8xlad.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fz55abupt6qcumnz8xlad.png" alt="Sparse Index" width="800" height="335"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Multilevel Indices&lt;/strong&gt;&lt;br&gt;
Dense Index Overview:&lt;/p&gt;

&lt;p&gt;A dense index has one index entry for each record in the relation.&lt;br&gt;
Example: A relation with 1,000,000 tuples and 100 index entries per 4 KB block results in 10,000 index blocks.&lt;br&gt;
&lt;em&gt;Issues with Large Indices&lt;/em&gt;:&lt;/p&gt;

&lt;p&gt;Large relations require large indices. For 100,000,000 tuples, the index occupies 1,000,000 blocks (4 GB).&lt;br&gt;
Large indices are often stored as sequential files on disk due to their size.&lt;br&gt;
Index search can become costly due to multiple random I/O operations.&lt;br&gt;
Binary Search on Index:&lt;/p&gt;

&lt;p&gt;Binary search requires up to ⌈log2(b)⌉ block reads for an index occupying b blocks.&lt;br&gt;
Example: For a 10,000-block index, binary search requires 14 random reads (taking ~140 ms on a magnetic disk).&lt;br&gt;
Each block read is random, increasing search time.&lt;br&gt;
Overflow blocks make binary search more complex and potentially less efficient.&lt;br&gt;
Sequential Search:&lt;/p&gt;

&lt;p&gt;Sequential search reads all b blocks, which may take longer than binary search.&lt;br&gt;
However, sequential reads can sometimes benefit from lower access costs compared to random reads.&lt;br&gt;
Multilevel Indexing:&lt;/p&gt;

&lt;p&gt;Solution to reduce search time for large indices.&lt;br&gt;
Treat the large index (inner index) as a sequential file and construct a sparse outer index.&lt;br&gt;
The outer index points to blocks of the inner index, which contains pointers to actual data blocks.&lt;br&gt;
Example of Multilevel Index:&lt;/p&gt;

&lt;p&gt;Inner index with 10,000 blocks → Outer index with 100 blocks.&lt;br&gt;
Outer index (if in main memory) reduces the search to a single inner index block read instead of 14 with binary search.&lt;br&gt;
14x improvement in search efficiency.&lt;br&gt;
Multilevel Index with Extremely Large Files:&lt;/p&gt;

&lt;p&gt;For a relation with 100,000,000 tuples:&lt;br&gt;
Inner index: 1,000,000 blocks.&lt;br&gt;
Outer index: 10,000 blocks (40 MB).&lt;br&gt;
If the outer index is too large for main memory, another level of indexing is added, creating a multilevel index.&lt;br&gt;
Advantages of Multilevel Index:&lt;/p&gt;

&lt;p&gt;Significantly fewer I/O operations compared to binary search or sequential search.&lt;br&gt;
Supports efficient searching for very large relations.&lt;br&gt;
Relation to Tree Structures:&lt;/p&gt;

&lt;p&gt;Multilevel indices resemble tree structures (e.g., binary trees).&lt;br&gt;
Higher levels of indices act like parent nodes, and lower levels act like child nodes.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Key Takeaways&lt;/strong&gt;&lt;br&gt;
Multilevel indices optimize the search process for large datasets by reducing the number of I/O operations.&lt;br&gt;
Sparse indexing at higher levels minimizes memory usage while maintaining search efficiency.&lt;br&gt;
They are an efficient alternative to binary or sequential searches for extremely large relations.&lt;/p&gt;

&lt;p&gt;This is Part 1 of the upcoming series on Database Indexing.&lt;/p&gt;

&lt;p&gt;These are just my notes from the one of my favourite books: &lt;a href="https://db-book.com/" rel="noopener noreferrer"&gt;Database System Concepts&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;You can find me on &lt;br&gt;
&lt;a href="https://x.com/sf_9179?t=Mq38zsA_v5LvsnyIZPdM8A&amp;amp;s=09" rel="noopener noreferrer"&gt;X&lt;/a&gt;&lt;/p&gt;

</description>
      <category>computerscience</category>
      <category>database</category>
      <category>distributedsystems</category>
      <category>microservices</category>
    </item>
  </channel>
</rss>
