Artem

Posted on Oct 6

FastAPI & PostgreSQL Sharding: A Step-by-Step Guide (Part 1) - Theory

#distributedsystems #programming #database #tutorial

In this article series, I'll share my experience with the database sharding using PostgreSQL and FastAPI.

This series consists of two articles. In the first one, I'll cover the theoretical concepts behind sharding, including consistent hashing and other common pitfalls in distributed systems. In the second article, I'll showcase a practical FastAPI application that works with PostgreSQL shards.

The purpose of this article is to provide a basic understanding of how distributed databases work and the complexities they introduce into the system.

What is sharding?

Sharding, or horizontal database partitioning , is the process of distributing your your system's data across multiple physical servers (called shards) that together form a cluster.

Do we really need sharding?

Before we start I need to say that I did not have experience working with sharding in production systems (and perhaps you too won't have to work with sharding in production) and this is totally fine, as this optimization technique is an extreme measure in my opinion. I recommend to avoid sharding in production applications, because there are others effective optimizations mechanism such as: indexing, partitioning, replication, denormalization and queries optimization.

Sharding is really required when you are dealing with the billions of rows in a single table and you can not optimize it inside one physical server, then you starting to spawn multiple database server and group them into the cluster.

With the horizontal partitioning, you can distribute the data across multiple smaller, cheaper servers (which is not really possible when you have a single big instance), this makes your system more resilent as well.

Also sharding can help with the geographical distribution, for example, user data from Europe can live in a European shard, and user data from the US can be in a US shard - reducing latency and improving compliance with regional data regulations.

Hashing and consistent hashing

We need to understand how consistent hashing works before moving forward. If you know how the regular hashing works it uses a hash function, that accept arbitrary value and returns an integer.

Example: hash_function("python") = 2393

Now, imagine you have 4 database shards and you add a new user to them, then we can use this integer to locate needed DB instance. As we have a fixed number of shards we use modular of shards amount in our case is 4.

shard = hash_function(user_id) % 4

This will work fine until one of your shard is removed or added. Let's add a new shard, now we have 5 shards in total.

shard = hash_function(user_id) % 5

Now all our users across 4 shards are lost their hash value, which means if you will try to get a user with id that was placed to some instance, this instance number will change and you will not able to locate this user, because the same user ID now points to a different shard.

Consistent hashing was invented to solve this problem. In this method we use a circle (usually it called ring), where we place our database servers at different positions around the ring.

The picture shows: a consistent hashing ring where each server owns a portion of the hash space across the ring.

Let's look at this example:

[0 - - - - - - - - - - - - - - - - - - - 360)
 ^ server A      ^ server B      ^ server C

User id 1 hashed between A and B, we store it in B

User id 2 hashed between B and C, we store it in C

Now we add server D

[0 - - - - - - - - - - - - - - - - - - - 360)
 ^ server A    ^ server D  ^ server B  ^ server C

Only the keys between A and B moved to D. Everything else remains on the same servers.

User id 1 originally hashed between A and B stored on B, but if its hash actually lies between A and D it now maps to D.

User id 2 hashed between B and C, we store it in C.

So minimal resharding still happens , only the keys belonging to the neighboring segment are reallocated, instead of having to rehash and redistribute all data across all servers like in traditional hashing.

Loss of traditional ACID transactions across shards

When you split data across the multiple shards, familiar ACID transaction system is not available. Operations that need to update data across multiple shards can't rely on the database's built-in transaction guarantees.

You either need to:
Implement distributed transactions (with two-phase commits or custom coordination), which adds extra complexity and slows down the transaction commit.

Move toward eventual consistency.

Increased application complexity

With the sharding you application queries logic become harder.
You might need to handle cross shard joins, when you need to access data from different shard for single query
Manage shard logic and fallback logic when rebalancing happens
More complex and expensive deployment and backups collection

In many cases, scaling vertically (and apply optimization techniques for instance), might be cheaper and simpler.

Conclusion

Now you know what consistent hashing and horizontal database partitioning (also known as sharding) are. We've also discussed the complexity of horizontally scaled systems and some of the pitfalls they introduce.

In part two, we'll explore a practical implementation using FastAPI and PostgreSQL.

DEV Community

FastAPI & PostgreSQL Sharding: A Step-by-Step Guide (Part 1) - Theory

Top comments (0)