Mustafa ERBAY

Posted on May 27 • Originally published at mustafaerbay.com.tr

Database Partitioning: Does It Always Provide Scalability?

#database #postgres #performance #scalability

Partitioning: The Magic Wand of Scalability?

When database partitioning comes to mind, "scalability" is usually the first thing that pops up. The idea is that dividing large tables into smaller, manageable pieces improves query performance and reduces management overhead. With this thought in mind, I've frequently used partitioning in my own projects and in companies I've consulted for. However, over time, I've realized that partitioning doesn't always create the magical wand effect we expect. In some cases, it can even make things more complicated. In this post, I will explain, based on my own experiences, how much scalability partitioning actually provides, in which situations it is useful, and which pitfalls to avoid.

Partitioning is fundamentally the process of dividing a database table into multiple smaller physical pieces. These pieces are created based on a specific criterion (e.g., date range, region code, user ID range). Its purpose is to reduce I/O by accessing only the relevant pieces during query times, thus improving performance. It can be an effective method for simplifying database management and speeding up access to specific data, especially with large datasets.

Types of Partitioning and Basic Principles

At its core, partitioning involves logically dividing a large dataset into smaller parts. This division is used to optimize the query processing capabilities of database systems. Different database systems (PostgreSQL, MySQL, Oracle, etc.) offer different partitioning strategies, but the basic principles are generally similar. Let's look at the most common types of partitioning:

1. Range Partitioning

In this strategy, data is divided based on a specific range. The most frequently used is date-based partitioning. For example, you can divide an orders table into a separate partition for each month.

Scenario: An e-commerce platform's orders table grew to billions of rows over time. Since the frequency of querying old orders decreased, we wanted to move them to separate partitions.
Implementation: Using PostgreSQL's declarative partitioning feature, I created a separate partition for each month. I completed this configuration with commands like CREATE TABLE orders_2024_01 PARTITION OF orders FOR VALUES FROM ('2024-01-01') TO ('2024-01-02');.

2. List Partitioning

In this method, data is divided based on specific values. For example, you can partition a customer table by country.

Scenario: A global service provider's customer data belongs to users in different geographical regions.
Implementation: We set up a structure like CREATE TABLE customers PARTITION BY LIST (country_code); CREATE TABLE customers_us PARTITION OF customers FOR VALUES IN ('US'); CREATE TABLE customers_eu PARTITION OF customers FOR VALUES IN ('DE', 'FR', 'UK');. This ensures that queries for a specific country only scan the relevant partition.

3. Hash Partitioning

In this strategy, data is divided using the hash value of a column. It is generally used to maintain a balanced data distribution.

Scenario: A user sessions table for a high-traffic application. Balanced data distribution is critical.
Implementation: We created a structure like CREATE TABLE sessions PARTITION BY HASH (user_id); CREATE TABLE sessions_p0 PARTITION OF sessions FOR VALUES WITH (modulus 4, remainder 0); .... This helped distribute the data load evenly across four partitions.

While these types sound appealing, it's important not to overlook the additional overhead and management complexities that partitions bring.

Advantages of Partitioning and Its Real Impact

Although partitioning's biggest promise is scalability, its realization requires the right conditions. When implemented correctly, partitioning can indeed improve query performance and simplify the management of large datasets.

For instance, when I was working on a financial reporting system, there was a massive table storing monthly transaction data. Generating month-end reports on this table took hours. After implementing partitioning, we sped things up by creating a separate partition for each month. Reports now only scanned the relevant month's partition, reducing the processing time from hours to about 15-20 minutes. This demonstrated how powerful partitioning can be, especially in scenarios where data size directly impacts query performance.

💡 Performance Gain Example

In a manufacturing ERP system, the table holding daily production records had reached 200 million rows. Daily reporting queries took 45 minutes. After partitioning the table into daily partitions, the same queries dropped to an average of 5 minutes. This was made possible by the query planner scanning only the relevant day's partition.

However, the key point to note here is that the query must be optimized based on the partition key. If your query does not use the partition key, the database might still have to scan all partitions, and you won't achieve performance gains. This shows that partitioning is not always a magical solution. Proper query writing practices and data modeling are as important as, if not more important than, partitioning.

Furthermore, partitioning also brings management overhead. Operations like creating new partitions, archiving old ones, or deleting them need to be performed regularly. If you don't set up this automation correctly, management complexity can increase over time.

Hidden Costs and Pitfalls of Partitioning

As they say, everything good has a cost, and partitioning is no exception. Behind the promised performance increase lies some costs and pitfalls that can be overlooked. Knowing these will enable you to make more informed decisions when transitioning to partitioning.

In one project, we decided to manage a table holding session data for millions of users with date-based partitions. Initially, everything went smoothly, and queries sped up. However, after a few months, the number of tables reached hundreds. Each new day meant a new partition. Our database server's open files limit quickly started to fill up. When I ran the command ls -l /proc/<pid>/fd/, I saw hundreds, even thousands, of file descriptors being used by these partitions. This was due to each partition being treated as a file and consuming operating system resources.

⚠️ Excessive Partition Count Issue

In PostgreSQL, a separate relation is created for each partition. A large number of partitions means many entries in the system catalog. This can increase query planning time. I've encountered tables with over 1000 partitions. In such cases, even deciding which partition to go to could take significant time for the query planner. The result? Instead of the initial performance increase, query times actually lengthened.

Another pitfall is the performance degradation in INSERT and DELETE operations. When adding new data, the database has to decide which partition to go to. If your query doesn't use the partition key or if the key is complex, this operation can be costly. Especially with DELETE operations, if you try to delete rows one by one instead of deleting entire old partitions in bulk, performance can significantly drop. Therefore, it is generally more performant to clean up an entire partition with the DROP PARTITION command rather than deleting old data row by row.

These experiences taught me that partitioning is not just about dividing a table; it also requires management strategies and careful planning.

The Technical Depth of Partitioning: Query Planner and Resource Management

To understand the impact of partitioning on performance, it's important to know how the database query planner works and how it interacts with partitioning. Partitioning gives the query planner the opportunity to apply an optimization technique called "partition pruning."

Partition pruning allows the query planner to determine which partitions are not relevant to a query and skip scanning them. For example, in a query like SELECT * FROM orders WHERE order_date BETWEEN '2024-03-01' AND '2024-03-31';, if our tables are divided into monthly partitions, the planner will only scan the orders_2024_03 partition. This significantly reduces the amount of data the query needs to scan.

However, for this pruning to be effective, the column used in the query's WHERE clause must be compatible with the partition key. If your query has an expression like WHERE customer_id = 12345 AND order_date = '2024-03-15'; and the table is partitioned by order_date, the planner can prune based on order_date. But if the query is WHERE customer_name = 'John Doe'; and customer_name is not the partition key, the planner will have to scan all partitions.

ℹ️ PostgreSQL Partition Pruning Example

In a PostgreSQL database, let's say the sales table is partitioned monthly by the sale_date column.
-- Scans only the relevant partition
EXPLAIN ANALYZE
SELECT * FROM sales WHERE sale_date = '2024-04-10';

-- Scans all partitions (if customer_id is not the partition key)
EXPLAIN ANALYZE
SELECT * FROM sales WHERE customer_id = 98765;
The first query allows the planner to scan only the April 2024 partition. The second query, unless partitioned by customer_id, will scan all partitions.

In addition, partitions also affect file systems at the operating system level. Numerous partitions can lead to the use of more inodes on the file system and increased disk I/O. Especially operations like CREATE TABLE ... PARTITION OF ... require the creation of new files not only at the database level but also at the file system level. This can lead to performance bottlenecks, especially on storage systems with low I/O capacity.

When to Use Partitioning? (And When Not To)

We've seen that partitioning isn't a panacea. So, when should we use this powerful tool? In which scenarios does it create real value, and in which situations should we stay away from it?

Use Cases for Partitioning:

Very Large Tables: When table size reaches hundreds of millions or billions of rows, and query performance is significantly affected.
Date-Based Data Management: Scenarios where you add, delete, or archive data daily, weekly, or monthly. Such as logging, monitoring data, transaction records.
Data Lifecycle Management: When old data needs to be moved to less accessible storage or deleted. Partitions make this process easier and more performant.
Query Optimization: Situations where queries are generally filtered based on a specific partition key. For example, customer data queried by geographical region.

Situations to Avoid Partitioning:

Small or Medium-Sized Tables: If the table size is at a manageable level (e.g., a few million rows), partitioning can introduce extra cost and complexity.
Unbalanced Data Distribution: If the data distribution based on the partition key is not balanced, some partitions may become excessively full while others remain empty. This can reduce performance instead of increasing it.
Complex or Frequently Changing Partition Key: If the partition key changes constantly or is complex, it makes the query planner's job harder and reduces the pruning effect.
Very Frequent INSERT and UPDATE Operations: If data is constantly being updated and the partition key frequently changes, this can lead to performance issues.
Desire to Avoid Management Complexity: Partition management (creating new partitions, deleting old ones) requires automation. If you don't want to set up or cannot manage this automation, it might be better to stay away from partitioning.

🔥 Result of Incorrect Partition Implementation

In one project, we made the mistake of creating a separate partition for each customer. Since the number of customers was in the millions, the table had millions of partitions. This slowed down the database server so much that even basic queries became impossible to run. Eventually, we had to merge all partitions back into a single table. This was a live example of how destructive partitioning can be when implemented incorrectly.

Remember, partitioning is a tool. When used correctly, it can work wonders, but when used incorrectly, it can lead to serious problems. Therefore, before transitioning to partitioning, carefully evaluate your data, query patterns, and management capacity.

Alternatives and Future Perspectives

We've seen that partitioning isn't always the best solution. So, what other methods are there for dealing with large datasets? And what can be expected in this area in the future?

Alternative Scalability Strategies:

Index Optimization: While not as impactful as partitioning, proper indexing strategies can significantly improve query performance. Creating appropriate indexes on columns that your queries filter is a good first step. More advanced index types like PostgreSQL's BRIN indexes can also be effective for large tables.
Database Sharding: Partitioning is about dividing a table on a single database server. Sharding, on the other hand, is about distributing the database horizontally across multiple servers. This is a more powerful solution for both read and write scalability, but it is much more complex to manage.
Columnar Databases: Columnar databases like Snowflake, BigQuery, or ClickHouse are optimized for analytical queries. Unlike row-based databases, these databases store data column by column and generally offer higher performance in large data analysis.
Data Archiving and Cleanup: Regular data archiving and policies for cleaning up unnecessary data are among the simplest ways to maintain performance by keeping table sizes under control.

Future Perspectives:

As data volume increases, scalability will always remain an important issue. Artificial intelligence and machine learning can make database management smarter. For example, AI-powered tools can offer automatic indexing suggestions, optimize query plans, or automatically adjust partition strategies.

Furthermore, distributed systems and cloud-based database services (managed databases) are becoming more accessible for solving scalability issues. These services abstract away infrastructure management, allowing developers and system administrators to focus more on the data.

ℹ️ AI-Assisted Database Management

In my own side project, I was having an issue with PostgreSQL WAL (Write-Ahead Logging) files consuming disk space. Instead of manually cleaning the pg_wal directory, I wrote a small script and scheduled it with a systemd timer. However, in the future, I imagine these routine maintenance tasks being managed automatically by AI. The database could "learn" to monitor and optimize its own performance.

In conclusion, partitioning is a powerful tool, but it's not always the best solution. It's important to understand the alternatives, evaluate the trade-offs, and choose the strategy that best suits your data needs.

DEV Community