Unlearning what you know about relational databases to unlock the power of Redshift

#redshift #aws #dataengineering #data

My first encounter with the concept of columnar databases was when I learned about BigQuery offered by GCP. Initially, I wasn't sure what made them so different, as they didn't appear to be very different from standard relational databases. You are still creating tables where the row structure is strictly defined, and you can query them using familiar SQL syntax. However, those databases are architectured in a totally different way, and it affects how you should be working with them. After I gained practical experience with Redshift I understood that I needed to turn some mental switches in my brain in order to use it effectively. Here's a short list of important differences between Redshift and the traditional databases, and how they impact the optimal schema design.

No primary keys

In relational databases, defining a primary key is often one of the first things to define when designing a new table. Redshift doesn't have a concept of primary keys, or constraints in general. What does it mean in practice? There is no straightforward way to enforce the uniqueness of values in your columns. You need to be very careful in your ingestion process to ensure that you are not inserting duplicates. Additionally, it's good to have validation rules in place that would routinely check for the presence of duplicates, and have a remediation plan if they are detected.

No foreign keys

In relational databases, joining even large tables can be quick and easy - thanks to the existence of foreign keys. You cannot define foreign keys in Redshift. As a consequence, joins can become painfully slow, especially if tables on two sides of the join are distributed using a different column. Consider a situation where you are running a Redshift on a 6-node cluster and you are trying to join the table ORDERS (distributed by customer ID) with the table DELIVERIES (distributed by location). It will mean that each of the 6 chunks of the ORDERS table will have to be matched against 6 pieces of the DELIVERIES table, which will require a lot of inter-node communication.

How to join the tables effectively if you cannot define foreign keys?

Distribute both tables used in the join by the same column (the one used in the join).
For small tables (like static configuration, dictionary tables), use distribution style ALL, so each Redshift node will hold a copy of the entire table.
If the options above are not possible:
- Do the JOIN before the ingestion step (for example, during extraction from the source database) and store the rows in a denormalized format.
- Define the join logic in a materialized view.

Only one SORTKEY and DISTKEY per table

In a relational database, adding another index to a table is often the first option considered when experiencing performance issues (which doesn't mean it's always the best solution). As a result, you may end up with a single table where most of columns are indexed, to support queries with conditions applying to different columns.

In Redshift, you don't have this luxury - each table can have only one key used for sorting and one key used to distribute the table between nodes. You have to be mindful of your choice! Understand your query patterns and design the sort key and distribution key accordingly. If you have use cases where the same table needs to be queried using different filtering criteria, you can create a materialized view that will use a different distribution key than the source table.

Impact of SORT KEY choice

Let's say that you decided that the records in your table will be sorted by the update_date. As a consequence:

Redshift will work great if you want to query all rows last updated within a certain time frame and calculate aggregated statistics for them.
Redshift will not work so great when you want to find a row with a specific ID, and you don't know when was the last time it was updated. (This task would be trivial on a relational database with an index on ID.)

Impact of DIST KEY choice

Let's say that you want to distribute data in your ORDERS table based on customer_id. It would result in all orders done by a single customer to be stored on the same Redshift node.

If you want to calculate the number of orders per customer, each part of the aggregation will take place locally on a single node. Calculations will be spread between the nodes, fully utilizing your cluster's resources.
If you want to calculate the statistics for a single customer, Redshift will be able to use the power of only one node. It can result in sub-optimal performance. You can solve it by extending distribution key definition to include another column (for example, based on the date of the order).
If some customers have significantly more orders than the orders, your data can become skewed and the load on different nodes will be uneven (some nodes will store more orders than the others). If this happens, Redshift should be able to solve it by redistributing the keys between the nodes.

Conclusion

This list is by no means exhaustive, especially as Redshift is still a developing technology, with new features released every quarter. Before you can fully enjoy the benefits it offers (scalability, cost-effectiveness, high-performance), you need to forget some of your old routines and embrace its uniqueness.