Alex Antra

Posted on Nov 14, 2019 • Edited on Nov 20, 2019

The R.A.G (Redshift Analyst Guide): Distribution

#data #analytics #sql #aws

Welcome to the R.A.G, a guide about Amazon's Redshift Database written for the Analyst's out there in the world who use it.

Previously on the R.A.G....

Article No Longer Available

As mentioned in my previous two articles, a table may be spread across multiple Compute Nodes and, is in part, one of the things that can result in Redshift returning results lightning quick. We call this, Distribution.

Distribution

In other data warehouses you can speed up table reads by defining an index on a table.

Just like an index in a book, if you know what page in a book a certain subject is on, you can jump right to it saving you from scanning through the whole book.

Redshift takes a more physical approach to this. Rather than index at the beginning of a book outlining what subjects start and end where, instead each subject is in it's own book.

You achieve this when building tables by assigning a Distribution Style and Key.
This tells Redshift how to spread your data across it's physical nodes.

So if you are creating tables as part of a piece of analysis, you can have a hand in distributing the table in a way that will aid your future querying efforts.

However, if you are querying a pre-existing table, it's distribution style may actually work work against the query you are trying to run leaving you with two options (1) change your query to work with how a table is distributed, if possible, or (2) suck it up and deal.

Running a query that works against a specific tables distribution doesn't necessarily mean the query will be slow, after all you still have a heap of processors crunching data.

Diagram of a Redshift Cluster (to remind you)

Picking a Distribution Style

There are four distribution styles you can specify when building a table:

All

What does it do?
All copies the entire table to the first slice in every node so in an environment with four nodes, the table would be copied four times and obviously takes up 4x the amount of space, having a copy of the data on every node means faster queries, especially for joins, but has the inverse effect in that loading, updating, deleting or inserting takes longer, and remember Redshift is OLAP so these functions already take longer than an OLTP system.

When should I use it?
All should only be used on a static table that rarely change but is often referenced in a join or a cte as returning results will happen quickly but writing new data will take longer. Great for a large lookup table.

When shouldn't I use it?
You should avoid all style distribution in the following scenarios:

Small tables. About 500k-1m is the sweet spot for ALL, tables smaller than this will be forcibly re-distributed even.
Tables that are going to be used once
Tables that are going to updated a lot
Tables over 10 million rows, you really don't want to have explain to a DBA why a 10m row table has been copied onto every node...

How do I do this?

create table    schema.tablename
diststyle       All
sortkey         (<columnname>,<columnname>)
as
--
select
from
where
;

Auto

What does it do?
Auto, the new default for Redshift, and will manage a tables distribution automatically, over the life of the table, but only between All and Even, it can't do Key.
Say you have a small table Redshift will set it as All, however once it passes a certain threshold Redshift will re-distribute the table as Even automatically.

When should I use it?
When you don't need a key distribution and you can't figure out between all or even.
When shouldn't I use it?
When you later intend to join to that table, because in theory you should have done a key style.

How do I do this?

create table    schema.tablename
diststyle       Auto
sortkey         (<columnname>,<columnname>)
as
--
select
from
where
;

Even

What does it do?
Even distributes ever row of data in a round robin fashion across all slices, this makes for the most even distribution across the nodes.

When should I use it?
Fundamentally you should use this style when the other two don't suit. Even works when the table is highly unique as in there is no logical way to distribute the table on a key because every row is different from the rest.

Tables that are going to have an external program, like Tableau, read off of them without filtering would benefit from this style.

When shouldn't I use it?
When you later intend to join to that table, because in theory you should have done a key style.

How do I do this?

create table    schema.tablename
diststyle       Even
sortkey         (<columnname>,<columnname>)
as
--
select
from
where
;

Key

What does it do?
After defining a column as the distribution key, Redshift will store rows with the same key on the same node slice.

If possible, they will go to the same slice that other data from other tables that share the same dist key are located, though that's not always possible.

This is called collocation. If there are more dist keys than nodes, then failing any collocation, values with the same will be distributed out to nodes round robin style.

When should I use it?
You should use this when your table has multiple rows to a single id and you intend to query or join using the dist key at a later date.

When shouldn't I use it?
You should avoid key style distribution in the following scenarios:

When the key is heavily skewed, for example one keyid has a thousand rows but another has a million.
For the same reason as above, avoid distkey with null or blank values as they will all end up on the same node, resulting in a bad skew.
When HOW you will act on this table at a later date won't benefit from a key distribution. For example if you don't intend to query the table mentioning the dist key in the where clause or in a join, then don't bother.

How do I do this?

create table    schema.tablename
diststyle       Key
distkey         (<columnname>) --can only have 1
sortkey         (<columnname>,<columnname>)
as
--
select
from
where
;

So what should I pick?

The key thing to remember is that distribution determines where the data gets physically stored across your available nodes and so only queries that align to where the data is actually stored will happen lightning fast.

There isn't a silver bullet with Redshift, there will always be a query that you have to run that works against how a table is distributed and there is often no way around that. The hope is that through careful design and constant revision that the queries that work with Redshifts distribution occur more often than against.

So when trying to figure out what to do, you need to consider the following.

Only pick Key if it's actually going to get used

Because you are limited to only one key, you need to be sure that the table is going to get queried in a way that benefits that key.

So if you are building a staging table, you just need to tune it's distribution towards the next step of the process. Say the next thing to happen to the table is it's going to get joined on X column, then that's your dist key.

However if you are building a table for wider use, like a dim or a fact table then you need to really understand how that table is going to be used by the end user. No use defining a dist key that only get used in 30% of all queries.

Don't be afraid to give something a shot and then rebuild the table in the future if you find the dist key needs changing.

Try and avoid a bad Skew

What will really throw a spanner in the works is if your data has a bad skew. Say for example I had a table of customer orders and distributed my orders based on customer (customer as dist key) :

Customer One: 3,095 Orders
Customer Two: 987 Orders
Customer Three: 87,985 Orders
Customer Four: 48 Orders

Because the Leader Node won't return it's results until all Compute Nodes have finished,in the above scenario nodes one, two and four would finish way faster than Three and Three would act as a bottleneck.

While customer sounded like an ideal dist key, due to a bad skew it didn't work out. You may need to run some checks to see if the key you want to use isn't skewed too heavily.

This is also why you want to avoid setting columns that contain null or blank values as your dist key, all of those values will end up on the same node.

Redshift can tell you how effective your Dist Key is

If you run the below query, after you have built your table, and look for the column 'row_skew_ratio', the closer this is to 1 the better. A ratio of 3 would mean that their is a 3:1 difference between the smallest and largest distkey cohort.

select          i.schema        as schema_location, 
                i.table         as table_name,
                i.encoded       as are_columns_encoded,
                i.diststyle     as distyle_and_key,
                i.sortkey1      as first_sortkey,
                i.sortkey1_enc  as sortkey_compression,
                i.sortkey_num   as no_sort_keys,
                i.skew_sortkey1 as sortkey_skew_ratio,
                i.size          as size_in_blocks_mb,
                i.tbl_rows      as total_rows,
                i.skew_rows     as row_skew_ratio,
                i.pct_used      as percent_space_used,
                i.unsorted      as percent_unsorted,
                i.stats_off     as stats_needed
from            svv_table_info i
where           i.schema = '<schemaname>'
and             i.table = '<tablename>'
limit           50;

Distribution is also about processing power

If you can't think of a good enough dist key, then that's fine an even distribution is not the end of the world.
Remember that each nodes is also it's own processor. In the best case scenario your data is evenly distributed, both in volume and variety, across, lets say eight nodes, that means the data set is being divided up and worked on by eight nodes at once. Because Redshift is distributed processing, I've thrown horribly inefficient queries at it and still had results in seconds. Sometimes It's not worth investing the time and effort into optimizing a table or query as the sheer processing power will make your efforts null and void.

header image drawn by me

Who am I?

You should read....

Article No Longer Available

Top comments (1)

Adam Tokarski • May 22 '20

Hi, awesome series, it helps me a lot to understand bit after bit.

I have one question though, you wrote "Because the Leader Node won't return it's results until all Compute Nodes have finished,in the above scenario nodes one, two and four would finish way faster than Three and Three would act as a bottleneck." In which case? If I understood correctly, the leader node should cut off these overloaded nodes in case when I'd have query with filter for, say, only Customer Two.

Would it be a bottleneck when I had an aggregation for all customers then?