I thought I did… But when I started this course in my Master’s in Computer Science, I realized I was completely wrong. I used to think it was something simple and straightforward: just take my centralized database and replicate it across multiple nodes. Okay, that’s one possible approach, but it goes far beyond that.
Let’s say we adopt replication. How can we guarantee data consistency? When an INSERT
or UPDATE
is received, do I need to propagate it to all nodes? Does it make sense to keep exactly the same data on every node? I think you can already see the kind of challenge we’re facing. The truth is: a distributed database is much more than simple replication, it involves fragmentation, synchronization, fault tolerance, and optimization.
The most common techniques are:
- - Horizontal fragmentation (by rows/tuples)
- - Vertical fragmentation (by columns)
- - Hybrid fragmentation, which combines both
In horizontal fragmentation, for example, imagine a sales table.
Suppose you have 3 customers, and every month you need to generate reports for each one. Your query might look like:
SELECT * FROM customer_sales WHERE customer_id = x;
In this case, it makes sense to split the customer_sales table into three parts, one for each customer, and distribute them across different nodes.
If there are related tables that use JOINs
with customer_sales, we apply derived horizontal fragmentation to maintain integrity and query performance.
In vertical fragmentation, the focus is on the columns.
Suppose your application only needs some specific columns from the table. You can divide the table into two fragments: the first containing customer_id, date, and total_value, and the second containing the remaining columns. This improves performance by reducing the amount of data transferred and processed during queries.
The criteria for fragment distribution can vary, for example:
latency, cloud cost, geographic location of users, among others.
With this structure, we can distribute queries, reduce bottlenecks, and execute operations in parallel, making better use of available resources.
Of course, not all database management systems support this type of architecture. When they do, the DBMS is responsible for managing fragmentation, routing queries, and reassembling the data.
At the end of the course, we did a case study using Hive, which allowed us to see in practice how fragmentation and distributed querying actually work.
Despite the high complexity and maintenance effort, some scenarios require a distributed database or similar solutions.
Top comments (0)