What is Data Partitioning?

The data storage for a software system can often be a bottleneck. If you try to run an application using a single database server, as you scale, you'll soon reach the limits of a single machine, even as you pay for a bigger, more expensive machine with more CPUs and memory. This is where data partitioning can help.
Here's a simple example. A block blob in Azure Storage currently has a maximum size of just under 5 TB. Five terabytes is a large blob, but what if you have 8 TB of data to store? Well, you'd need to partition your data across multiple blobs. You could, for example, store 4 TB in blob 1 and another 4 TB in blob 2.
That's a simple example of partitioning. Now let's talk about three specific types of partitioning. First, imagine you need to store orders in your application and an order consists of five attributes. We'll just call them A, B, C, D, and E. These attributes could be columns in a relational database or properties in a document-oriented database, but let's also say that the application frequently accesses attributes A and D, but rarely accesses B, C, and E.

In that scenario, we can vertically partition orders so that A and D are stored separately from B, C, and E. This could have a few benefits. For example, there could be less contention in a database if the attributes are separate and the tables are not as wide, or we could store A and D in fast premium storage while B, C, and E stay in less expensive storage. Also, A and D would be less information and easier to cache and faster to replicate so there are several opportunities when using vertical partitioning to optimize your data.

Another approach with orders is to spread the order data across multiple data sources. This is what we call horizontal partitioning, and horizontal partitioning is a popular approach to scaling data because this approach gives you a lot of power that you can use for scalability and performance. One approach to horizontal partitioning is to use what we call range partitioning. I could say that all the orders for customers A through M go to one database, while all the orders for customers M through Z go to a different database. A range could also be temporal. So I could say all the orders for this month go to one database and all the orders for last month go to a different database, or I could have a hashing algorithm that places orders into different databases in a more round-robin fashion. You can almost think of it like load balancing for databases. The choice here really depends on the application and the data and how the data is used, but in a horizontal partition, I used the same schema or same document shape in all the different databases, but since I'm spreading the data out, it's easier to scale out my hardware and add more processing power as my data grows.

Finally, there's also what we call functional partitioning. Let's say for this scenario our system stores orders and also stores invoices. In functional partitioning, these two datasets could live in different data sources that are optimized for the data and the system that uses the data. So we could use a relational database for orders and a document-based database for invoices. This sounds complicated, which is true. You don't necessarily want to use functional partitioning for a simple application, but for large systems, for large data, for a scalable system, functional partitioning is one partitioning strategy to evaluate so you can match up the best data store for your data and application data access patterns.

So I can not tell you which partitioning strategy is best for your application. I can just tell you that if you need to be highly scalable, you should evaluate a partitioning strategy and yes, partitioning can complicate your design and implementation. It can be difficult, for example, to join data across partitions or enforce transactions across partitions.