Architecture Design of Cloud Data Warehouse

Many people may ask these questions about cloud data warehouses:

Is traditional data warehouse architecture out of date when it comes to cloud?
What problems can a cloud data warehouse solve?
What is the ideal cloud data warehouse architecture?

Now let’s try to figure them out.

Sharding Warehouse

First, let's take a look at the architecture of a traditional sharding warehouse, and find out the limitations when it works in the cloud.

The architecture usually comes with a fixed data interval between shards, so the data skew problem is prone to occur. There’re a few options to fix this:

Upgrading the shard hardware: If the locations of hot spots are hard to predict, you must upgrade the entire cluster. Turns out the resource control becomes coarse-grained.
Expansion (for example, adding shard-4): This requires data migration. If there is a large mount of data to migrate, you have to wait a long time for shard-4 to be ready for service.

What if we just simply move a sharding warehouse to the cloud as it is? The resource control would remain coarse-grained, and it’s impossible to build an accurate billing model, for example, on-demand pay or pay-as-you-go.

In other words, easy expansion but high cost.

Cloud Warehouse

Can you try to imagine the architecture of a cloud warehouse that satisfies the following conditions?

On-demand elastic expansion
Refined resource control by the amount of data

First, the architecture must separate storage and computing. Second, the computing capability can increase or decrease as the number of computing nodes goes up or down. This is a very smooth process and does not require data migration.

The computing nodes are scheduled in a refined way. In this example, Node-4 is literally sever-less and can be considered as a process that will automatically shut down after running.

You might think the cloud warehouse architecture above seems simpler than the traditional one. The “Shared Storage” can be on the cloud, AWS S3 or Azure Blob Storage. Plus a computing engine like Presto, the cloud warehouse solution looks perfect to you?

Not really, because the shared storage is not designed for low latency and high throughput, and the network jitter occurs occasionally and it is difficult to control. If you want to fix these only With a powerful computing engine, that’s not a good idea.

Design Strategies

First, let's take a look at the data types in a cloud data warehouse:

Persistent data: Users’ data that is heavily dependent on shared storage.
Intermediate data: Temporary intermediate results, such as temporary data generated by sorting, JOIN.
Metadata: Object catalogs, table schema, user metadata, etc.

Since reading data from shared storage is assumed to be unreliable, we can read less by adding a cache. Then the question is, which data the cache is supposed to hold? Raw block data or index? Is it a global cache or a cache within a computing node?

Snowflake Architecture

Snowflake adds a Distributed Ephemeral Storage between the computing nodes and shared storage to store the intermediate data, as well as the persistent data. The architecture can fully utilize the cache, but the Distributed Ephemeral Storage still has some challenges to be elastic, for example, resource isolation in case of multiple tenants.

Databend Architecture

A cloud warehouse separates computing from storage. We pre-generate enough indexes for the persistent data and put them into the Metadata Service. Each computing node subscribes to the indexes and updates its local cache as needed. This architecture is similar to FireBolt.

This is a relatively simple and feasible way at present. The cache only needs a warm-up when a new computing node is added. But challenges still exist, for example, when you have massive index data to sync.

Summary

Databend is an open source cloud data warehouse where you can easily create your own data cloud. It separates computing from storage and allows for elastic expansion on the cloud.

DEV Community