Alex Antra

Posted on Nov 20, 2019

The R.A.G (Redshift Analyst Guide): Things to avoid / Best Practice

#data #analytics #sql #aws

Welcome to the R.A.G, a guide about Amazon's Redshift Database written for the Analyst's out there in the world who use it.

Previously on the R.A.G....

Article No Longer Available

Below is a collection of do's, don'ts, and 'be wary of's that we have come across in our travels.

None of these are hard and fast, sometimes you have no choice but to do all of the Don'ts.

Things to Avoid

A series of things NOT to do, mainly sourced from official documentation.

❌ Don't Compress your first sort key

Compressing your first compound sort key runs the risk of actually making your query run slower in some scenarios.

From AWS: "If sort key columns are compressed much more highly than other columns in the same query, range-restricted scans might perform poorly."

As a general rule of thumb compress your first sort key as raw. Feel free to compress the other sort keys.

Create table test
    (
    customer_id     char(36) encode  ZSTD,
    createddate     timestamp encode RAW,
    customer_type   char(50) encode  ZSTD,
    amount_due      decimal(10,2) encode AZ64
    )
diststyle   key
distkey     (customer_id)
sortkey     (createddate, customer_type);

Source:AWS Doco
Source:Sort Key Investigation

❌ Don't select * unless it's a small query

Redshift has a dedicated resource stream for handling small queries, so this rule doesn't apply to you if you are just wanting to do a quick select * from table where limit 50 as your query will be given its own resources.

HOWEVER, for everything else you should never be doing select * from unless you absolutely NEED every column. Redshift works faster the fewer columns are pulled in.
Source:AWS Best Practice

❌ Don't assign a column with NULL values as a Dist key

If the column you set as your dist key has a lot of NULL values, then all the NULLS will end up on one slice. This could potentially cause a bad SKEW
Source:AWS Doco

❌ Don't join the same table multiple times in the same query

Referencing the same table in a query can come at a high performance cost, explore other options like breaking down the query into smaller datasets or use a CASE expression.
Source:AWS Best Practice

❌ Don't use sub-queries for large complex operations

Avoid using sub-queries on data sets that have multiple conditions and are large in size. Sub-queries perform best over JOINS where its a simple IN clause.

The example below shows a good use of sub-query over a join. The reason why the join condition is bad is that one data set is huge the other is small and by JOINING the data you will force data to be redistributed across nodes.
Source:AWS Best Practice

❌ Don't use highly unique columns as a Dist Key

For Example using a timestamp for a dist key would be bad.
If we used a timestamp as our dist key, that would potentially lead to 86,000 unique dist keys PER day. This would vastly reduce the benefits of having a dist key.
Source:AWS Doco

❌ Don't use UNION, instead use UNION ALL

When you use UNION, Redshift tries to remove any duplicate rows, so depending on the size of your data the performance overhead could be huge. Use UNION ALL instead and if you need to remove duplicate rows look at other methods to do so like a row_number and delete statement.

UNION is believed to perform ~150% worse than UNION ALL.
Source:Investigation on Github

❌ Don't use an Interleaved Sort key unless you 100% know what your doing

Interleaved Sort keys are complicated, only use them if you know what you're doing, by default use compound sort keys.

Not implementing an Interleaved sort key can result in very poor result return time and long write / update / vacuum wait times.
Source:AWS Doco

❌ Don't use Dist Style All on very large or small tables.

About 500k-1m is the sweet spot for Dist Style ALL, remember it gets copied to every node.

❌ Don't use LZO, when you can use ZSTD or AZ64

LZO's best of all worlds compression has been replaced by ZSTD and AZ64 who do a better job. AZ64 should be used on your numbers, ZSTD on the rest.

Things to Be Wary Of

A series of things to be aware of, most of them have come from my own experience.

🤔 Redshift Inserts columns in the order you specify - even if they don't line up

If you build a table with the columns in this order:

Customer_id
Createddate
Customer_type

But your insert statement is in this order:

Customer_id
Customer_type
Createddate Redshift will attempt to insert it in that order. So it will try and put customer_type data into the createddate column. If the data types don't match, Redshift will throw an error, but if they do match, you won't notice.

🤔 Comparison Operators > LIKE > Similar to > REGEX/POSIX

Applying Logic to your dataset comes at a cost in terms of performance.

Comparison Operators such as < > = perform better than LIKE, which in turn perform better than Similar To.

And anything and everything performs better than REGEX / POSIX

🤔 Conditional Logic on Joins performs the worst

When we join tables on conditional logic, i.e join table on blah > blah or join table on blah between bleh and bloh.

Redshift has no choice but to do a nested loop which means every SINGLE row in table a has to be checked against every row in table b, which can have massive amounts of overhead.

🤔 All Functions will come at a cost

Using functions can slow down performance. For example where invoicedate >= date_trunc('month',sysdate) has a higher performance cost than invoicedate >= "2019-10-01".

This will forever be a balancing act of course. Over time you will learn what Functions cost more

🤔 Some data types have a base size you can't reduce.

For example varchar(4) takes up 8 bytes of space even though you have specified 4.

🤔 Query's against system tables can be slow

Querying the system tables for logs or usage history can be slow as all of that stuff is taken care of by the leader node only. So they can't distribute that query out to the other nodes sadly.

🤔 CASE When statements take the first match

Case when statements will take the first 'when' statement that is true, even if there are multiple that are true.

create table if not exists case_when_experiment
(
  fruit   varchar(100)
)
;

insert into case_when_experiment
values ('apple')
;

select fruit,
       case 
       when fruit ilike 'a%'  then 'apple first match'
       when fruit ilike '%e'  then 'apple second match'
       when fruit ilike '_p%' then 'apple third match'
       else 'no match'
       end
from case_when_experiment
;

The result of the select statement is:
Fruit: Apple
Case: apple first match
...even though all three of the 'when' conditions are true.

🤔 Avoid Delete if you can drop/truncate. Take note of what UPDATE does.

Delete doesn't delete the row. It hides it and marks it to be deleted when you next vacuum.

Likewise, UPDATE, just hides the row and replaces it with a new row, and the next Vacuum will remove the hidden row.

If you can, drop table or truncate. A table that is constantly deleted from without a vacuum will increase in size.

🤔 MD5 works faster than Func_sha1

MD5 hashing has a faster read and write due to being 120-bit however its not as secure as func_SHA1. Use MD5 when you want to obscure something without a security implication, using SHA1 for security

🤔 Be aware of the SQL execution order.

SQL is executed in the following order:

FROM and JOIN are loaded into memory
Filters in the WHERE & JOIN's are applied
GROUP BY aggregates the data
Having applies logic to the aggregates
Select brings in the columns needed
Order by sorts the final product
Limit then limits the final data set

So Limit 10 will not just crunch 10 rows, it will crunch them all and show you 10.

Things to Do

The opposite of the things to avoid list :)

✔️ Use CHAR over VARCHAR if you have an exact length field.

For example if we have a column for ORGID's which are ALWAYS 36char long. Redshift will perform better if you set it to CHAR(36) over VARCHAR(36)
Source:Best Practices by DWGeeks

✔️ Use numeric / boolean logic where possible

The cost to apply logic to numbers and and boolean is so much lower than if you are using strings.

This is why OLTP systems will use number to represent things and why we use keys in Datamarts (dimmarketkey) as where dimmarketkey = 4 performs better than dimmarketkey = "Australia"

✔️ Repeating the same filters on multiple tables still helps.

If two tables you are joining have the same filters, specify them both, even if you feel like its redundant. See the below example from amazon

-- BAD EXAMPLE
select listing.sellerid, sum(sales.qtysold)
from sales, listing
where sales.salesid = listing.listid
and listing.listtime > '2008-12-01'
group by 1 order by 1;

-- GOOD EXAMPLE
select listing.sellerid, sum(sales.qtysold)
from sales, listing
where sales.salesid = listing.listid
and listing.listtime > '2008-12-01'
and sales.saletime > '2008-12-01'
group by 1 order by 1;

Source:AWS Best Practice

✔️ Match Sort Keys to Group By clauses - but don't skip them.

If you build a table, to later group it by cust_type, cust_city , and cust_plan, consider sorting the base table by Region, Industry, and Product Plan.

However if you instead grouped by cust_type and cust_plan, the first and third sort key, you will lose all benefit of the sortkey.
Source:AWS Best Practice

✔️ Match Order by and Group by clauses if possible.

Do:

group by a, b, c

order by a, b, c

Don't:

group by b, c, a

order by a, b, c

Source:AWS Best Practice

header image drawn by me

Who am I?

Ronsoak 🏳️‍🌈🇳🇿

@ronsoak

Who Am I❓
|🇳🇿 |🇬🇧 |🏳️‍🌈
📊 Senior Data Analyst for @Xero
⚠️ My words do not represent my company.
✍️ Writer on @ThePracticalDev (dev.to/ronsoak)
✍️ Writer on @Medium (medium.com/@ronsoak)
🎨 I draw (instagram.com/ronsoak_art/)
🛰️ I ❤️ space!

03:11 AM - 13 Oct 2019