Yu Qian Yang

Posted on Apr 16 • Originally published at Medium

How We Built a Disaggregated Storage Layer for a Columnar MPP Database (And What Broke in Production)

#database #performance #cpp #distributedsystems

How I Built a Disaggregated Storage Layer for a Columnar MPP Database

And what broke in production.

The Problem

Enterprise data warehouses are expensive to run. When your database stores data on local SSDs, you're paying SSD prices for data that might only be queried once a month. For a bank managing petabytes of historical transaction data, this adds up fast.

The standard answer in 2022 was "move cold data to object storage" — S3, Alibaba OSS, Huawei OBS, take your pick. Object storage is cheap. The problem is that it's also slow: latency in the hundreds of milliseconds per request, compared to microseconds for local NVMe.

Our task: make object storage fast enough that a columnar MPP database could use it as a primary storage backend without destroying query performance.

The result was UniStore, a disaggregated storage layer that sits between the query engine and cloud object storage. This is the architecture we built, the tradeoffs we made, and the bugs we found the hard way.

Architecture

UniStore runs as a separate process on each database node. From the query engine's perspective, it looks like a local filesystem. Under the hood, it manages a local cache (around 1TB per node) backed by object storage.

Query Engine
     |
     | (POSIX-like file API)
     v
  UniStore Frontend
     |
     | (internal protocol)
     v
  UniStore Backend
     |          |
     v          v
Local Cache   Object Storage
(NVMe, ~1TB)  (S3/OSS/OBS/HDFS)

The frontend handles file open/read/write calls from the query engine. The backend manages the cache, coordinates uploads to object storage, and handles prefetching. They communicate over a Unix socket.

Object files are split into fixed-size blocks. Each block is independently cached, uploaded, and tracked. The metadata — which blocks exist, where they are, their cache state — is stored in RocksDB for persistence across restarts.

Key Engineering: The Prefetch Engine

Cold reads from object storage are slow (~200ms per request vs ~100μs for local NVMe — roughly 2000x slower). The only way to make this tolerable is to have data in cache before the query asks for it.

UniStore's prefetch engine works by predicting sequential scan patterns. When the query engine opens a file and starts reading blocks sequentially, the backend detects the pattern and begins fetching ahead.

Two mechanisms drive this. The first is a sequential scan predictor: given the current read position and recent access history, it estimates which blocks will be needed next and issues prefetch requests proactively. The second is a pre-registration API: when the query optimizer knows in advance which files it will access, it registers this list before execution starts, and the backend begins warming the cache early.

To handle variable object storage latency, the prefetch timing uses a sliding-window average of recent round-trip times. If the storage backend is responding slowly, the prefetch window expands to compensate.

Performance result: Cold start (empty cache) is roughly 20x slower than SSD. Warm cache (data already prefetched) performs on par with SSD. In practice, for repeated workloads on the same dataset, queries run at SSD speed despite the underlying storage being object storage.

Key Engineering: The Cache Layer

The cache uses a block-level LRU policy with pin/unpin support. Blocks referenced by active queries are pinned — they cannot be evicted until the query releases them. This prevents a pathological case where a long-running scan evicts its own prefetched data.

Eviction is triggered when cache utilization crosses a threshold. The eviction policy walks the LRU list, skipping pinned blocks, and frees the least recently used eligible blocks.

One design decision worth noting: we track metadata (block location, cache state, reference counts) in RocksDB rather than in-memory only. This means cache metadata survives process restarts — after a crash, UniStore knows exactly which blocks are in the local cache without having to scan the filesystem.

Multi-Cloud Backend Abstraction

The customer required support for multiple cloud providers simultaneously, with the ability to switch backends or add redundancy without query engine changes.

We implemented a worker abstraction: separate backend implementations for each storage type (OSS-compatible, HDFS, and local filesystem) all satisfy the same interface. The frontend doesn't know or care which backend is active. A factory component selects the right implementation based on configuration.

This turned out to be more useful than we expected. During deployment, we needed to migrate data between cloud providers while keeping the database online. Because the worker abstraction was clean, we could run two backends simultaneously and drain one while filling the other — all transparent to the query engine.

What Broke in Production

Bug 1: Silent Data Corruption on Large File Writes

This one was subtle.

Large files are written as a sequence of blocks. The frontend issues completion signals one by one as each block finishes. The backend adds these to an async task queue — so in theory, blocks could be processed out of order.

The bug: when the last block was uploaded first (out of queue ordering), the multipart upload part number assignment collided with earlier blocks. The last block's data silently overwrote part of an earlier block. No error was returned. The file appeared to complete successfully, but the data was corrupt.

The root cause was in the upload type selection logic. The code was checking only buffer size to decide between single-upload and multipart-upload mode — but for files with multiple blocks, this check was wrong for any block beyond the first. It chose single-upload mode when it should have chosen multipart, which conflicted with an already-in-progress multipart upload session.

Fix: Gate the upload type decision on block position as well as buffer size. For any block beyond the first, force multipart mode.

We also added a fault injection flag that forces the last block to be uploaded first, making the race condition deterministically reproducible in CI. This is the kind of test infrastructure we should have built before shipping — async upload pipelines have inherent ordering ambiguity, and that should have been a first-class test scenario from day one.

Bug 2: Cache Deadlock from Tmpfile Leak

The original write path worked like this: when writing a file, data was first buffered into a local temporary file, and a corresponding cache object was created. When the file write completed normally, the temporary file was transitioned to a state where it could eventually be evicted.

The problem: if an exception occurred before the file write completed — say, if object storage became temporarily unavailable — the temporary file and its associated cache objects remained in a state where they could not be evicted. Cache space leaked. Given enough retries (which the query engine would do automatically), the entire cache filled up with unevictable incomplete objects. New queries blocked waiting for cache space that would never be freed.

A secondary problem: block identifiers are allocated serially, so retrying the same failed write created new identifiers and new temporary files with each attempt — accelerating the leak.

Fix: Added a configuration flag that bypasses the local cache entirely for writes, streaming data directly to object storage via dedicated write implementations for each backend type. In the worst case, only one temporary file leaks (named by object path rather than block identifier).

The tradeoff: writes no longer populate the cache, so data written and immediately read will hit object storage instead of cache. For our workload (write-once, read-later analytics), this was acceptable.

Lessons

1. Test your failure modes, not just your happy path.
The data corruption bug only manifested when the task queue processed blocks out of order. This was always possible, but we didn't test for it. Fault injection infrastructure should be built alongside the feature, not after the bug is found.

2. Resource cleanup ordering in async systems needs explicit contracts.
Both bugs above were variations of the same mistake: implicit assumptions about ordering in async code. The tmpfile bug assumed writes would always complete cleanly. The data corruption bug assumed blocks would always be processed in submission order. Neither assumption was documented, and neither was enforced.

3. Metadata persistence matters.
Storing cache metadata in RocksDB rather than memory-only was the right call. It added complexity (serialization, recovery logic), but it meant that a crashed UniStore process could restart without invalidating the entire cache or losing track of in-flight uploads.

Results

Deployed at a major bank's big data center, managing thousands of PBs across multiple cloud providers (AWS S3, Alibaba Cloud OSS, Huawei Cloud OBS, Tencent Cloud). Cache size per node: ~1TB. Concurrent client connections supported: up to 17,500.

Cold start performance: ~20x slower than SSD (dominated by object storage round-trip latency).
Warm cache performance: on par with SSD.

For the bank's analytics workloads — where the same datasets are queried repeatedly within a business day — warm cache performance means object storage costs with SSD-equivalent query speed.

If you're working on similar storage infrastructure problems — disaggregated storage, object storage integration, or database performance engineering — feel free to reach out.

DEV Community