Welcome to the R.A.G, a guide about Amazon's Redshift Database written for the Analyst's out there in the world who use it.
Previously on the R.A.G....
Article No Longer Available
When building tables you determine what 'data type' the column(s) will be. You do this column by column and while it may seem silly to specify exactly what is going to go into each column, especially when Redshift can guess for you, it can be a big contributing factor in terms of speeding up performance and decreasing table size.
You can also choose to compress your columns.
Compression will allow for more data to fit inside a block, again decreasing table size. It will also improve the efficiencies of the Zone Maps, another thing that can speed up performance.
Data Types
Data Keywords | Description |
---|---|
SMALLINT ,INT2 | This is for WHOLE numbers that only take up 2 bytes of data, range: -32768 to +32767 |
INTEGER, INT, INT4 | Also for whole numbers that only take up 4 bytes of date, range: -2147483648 to +2147483647 |
BIGINT, INT8 | Also for whole numbers that only take up 8 bytes of date, range: -9223372036854775808 to 9223372036854775807 |
DECIMAL, NUMERIC | For numbers with decimal points, up to 38 digits total. When you classify a column as decimal you must declare both the TOTAL length and then how many decimals. For example decimal(10,2) means ten numbers max with two decimal places, this equates to 8 digits on the left of the decimal, and 2 on the right. |
REAL, FLOAT4 | For storing smaller, rounded down, floating point numbers |
DOUBLE PRECISION, FLOAT , FLOAT8 | For storing larger, non rounded, floating point numbers |
BOOLEAN, BOOL | Boolean is a single byte flag which is either 1 or 0, true or false. Though it can hold a null value. It can also get represented as a checkbox. You can specify that the default value is true or false, if you don't specify a default value then the default value will be null |
CHAR, CHARACTER, NCHAR , BPCHAR | CHAR is a fixed length text string, ignore references to NCHAR and BPCHAR those are old functionality merged all into one. CHAR always takes up all of the space you specify, so if you specify char(100) but only put 'Hi' into the column, the remain 98 characters of space will be filled with spaces. Which can cause issues with EXACT object matching. Use when your column is always going to be a fixed length. CHAR will always use up at least 4 bytes of data, even if you specify CHAR(2) |
VARCHAR, CHARACTER VARYING, NVARCHAR, TEXT | VARCHAR allows for varying character length which is good for free text fields. Unlike CHAR it will only use however much space has been entered. So a in a VARCHAR(100) the word 'Christmas' will only use 9 of that 100, saving space. VARCHAR will always use up at least 6 bytes of data, even if you specify VARCHAR(2) |
DATE | Use this for dealing with time spanning as small as a day, ie. this will not use / show / handle time. |
TIMESTAMP, TIMESTAMP WITHOUT TIME ZONE | Use this for dealing with time AND where your entire data warehouse is in the same timezone |
TIMESTAMPZ, TIMESTAMP WITH TIMEZONE | Use this for dealing with time across various time zones |
Data Type Usage Examples
Below are some examples of how to use the above data types in your code.
CREATE TABLE <TABLE_NAME>
(
COLUMN_NAME SMALLINT,
COLUMN_NAME INT,
COLUMN_NAME INTEGER,
COLUMN_NAME INT,
COLUMN_NAME INT4,
COLUMN_NAME BIGINT,
COLUMN_NAME INT8,
COLUMN_NAME DECIMAL(20,2),
COLUMN_NAME NUMERIC(20,2),
COLUMN_NAME REAL,
COLUMN_NAME FLOAT4,
COLUMN_NAME DOUBLE PRECISION,
COLUMN_NAME FLOAT,
COLUMN_NAME FLOAT8,
COLUMN_NAME BOOLEAN DEFAULT FALSE,
COLUMN_NAME BOOL DEFAULT FALSE,
COLUMN_NAME CHAR(36),
COLUMN_NAME CHARACTER(36),
COLUMN_NAME NCHAR(36),
COLUMN_NAME BPCHAR(36),
COLUMN_NAME VARCHAR(36),
COLUMN_NAME CHARACTER VARYING(36),
COLUMN_NAME NVARCHAR(36),
COLUMN_NAME TEXT(36),
COLUMN_NAME DATE,
COLUMN_NAME TIMESTAMP,
COLUMN_NAME TIMESTAMP WITHOUT TIMEZONE,
COLUMN_NAME TIMESTAMPZ,
COLUMN_NAME TIMESTAMP WITH TIMEZONE,
);
Which Data Type to use?
So there's a lot of data types to pick, and plenty of overlap, so why not just use VARCHAR(999) for everything and go about your day? Reasons, that's why! The below will help.
Dealing with numbers
- Firstly, numbers are WAY more performant than text so you should never use CHAR or VARCHAR when you could be using INT, DECIMAL, or DATE.
- INTEGERS don't have decimal places, so don't use them when your customer needs to go down to the decimal level, i.e Currency.
- INT2 is obviously more performant than INT4/INT8 however it will cap out at 32767 so only use it for small numbers.
- When dealing with decimals, be smart about how many decimal points, it helps performance. If you only need two decimal places then just specify 2
Dealing with Flags
- Bool is what you should use for flags,
- Don't use CHAR(1) or VARCHAR(1) when you could use BOOL as CHAR(1) uses up 4 bytes of data and VARCHAR(1) uses 6 bytes when BOOL uses 1 byte.
- Try and set a default flag as it will save you in the long run
Dealing with Dates
- When you are just dealing with the DATE, i.e the day, month, or year then use DATE
- For Time just use TIMESTAMP
- Only use TIMESTAMPZ when dealing with multiple time zones, good data warehouses use the same timezone across all data
- Never put a date into a VARCHAR
Dealing with Text
- Don't use CHAR if you don't know how long the text is going to be, you will hit size errors, i.e. this text is too big for the field, or you will be forced to set the limit too high wasting space (char uses up all the space by filling it in with spaces).
- Conversely don't use VARCHAR if you know the length of all your values. For example a GUID, which is always 36 characters long should be char(36) not VARCHAR(36) as VARCHAR(36) is actually 40 bytes long. Redshift will perform better on char in these scenarios.
- Don't use VARCHAR for anything less than 6 bytes, you won't gain any space with VARCHAR(2)
- Don't use CHAR or VARCHAR if you are using a flag as BOOL will be quicker and smaller
Data Compression
Compression, also known as Encoding, makes the column smaller. You can chose different types of compression for different scenarios, and some compression types can only be used on certain data types. In theory, compressing data too much can make it longer to read, however that's not often the case as Amazon makes sure it's compression methods balance out storage and reading. In some scenarios, compression actually can use up more space.
Compression Types
Type | Keyword | Applicable Data Types |
---|---|---|
Raw (no compression) | RAW | All Types |
AZ64 | AZ64 | INT2/INT4/INT8/DECIMAL/DATE/TIMESTAMP/TIMESTAMPZ |
Byte Dictionary | BYTEDICT | All but BOOL & TEXT |
Delta | DELTA | INT2/INT4/INT8/DECIMAL/DATE/TIMESTAMP/TIMESTAMPZ |
Delta 32k | DELTA32K | INT4/INT8/DECIMAL/DATE/TIMESTAMP/TIMESTAMPZ |
LZO | LZO | All but BOOL and FLOAT/FLOAT8 |
Mostly8 | MOSTLY8 | INT2/INT4/INT8/DECIMAL |
Mostly16 | MOSTLY16 | INT4/INT8/DECIMAL |
Mostly32 | MOSTLY32 | INT8/DECIMAL |
Run-length | RUNLENGHT | All but TEXT |
Text | TEXT255, TEXT32K | VARCHAR Only |
Zstandard | ZSTD | All but TEXT |
What one do I use?
Short Answer
AZ64, unless it doesn't apply, then ZSTD for everything else. Most analysis won't require you to nail compression.
Long Answer
Each compression has their specific use cases. Some are more general in their storage while some work in a very specific manner which lead to some very specific downsides if not used properly.
Some TLDRs...
- RAW: No compression, your supposed to not compress your first sort key, so use it there.
- AZ64: An aggressive compression algorithm with good savings and performance. A general all rounder for Integers and Dates.
- BYTEDICT: Creates a dictionary of values, and so repeated values take up no space. Only use on tables with values that repeat a lot. It's dictionary is limited to 256 values, before new ones get stored as RAW.
- DELTA /DELTA32K: Compresses data by only recording the difference between values. Only good if the delta (difference) is small and incremental. Large differences can actually cause DELTA to use more space than if it was uncompressed.
- LZO: An aggressive compression algorithm with good savings and performance. Used to be the go to, though now you should use ZSTD or AZ64 instead as they are newer and perform better.
- MOSTLY8/16/32: This method is one you use when most of the values in a column are 8/16/32 bits, however their are some outlying larger values. So by specifying a Mostyl8, you are saying the majority of values can be compressed to 8 bytes and the outliers left as raw. Only use when you know you can compress most of the columns.
- RUNLENGHT: Very similar to ByteDict in the fact that repeat values, are recycled with one value. It's not limited like ByteDict to a set number of values, however you shouldn't use this on any a sort key column .
- TEXT255/TEXT32K: For text values only, works similarly to ByteDict in that it makes a dictionary of values. TEXT255 stores values in 1 byte indexes and is limited to the first 245 unique words in a column before new values are stored uncompressed. TEXT32k stores values in 2 byte indexes and will keep storing new words until it hits a hard limit of 32k bytes of data for the combined dictionary. Only low unique text values will result in good compression.
- ZSTD: An aggressive compression algorithm with good savings and performance. Will seldom result in using more data than it saves unlike other compression method. Use this where AZ64 does not apply.
Pro-Tip: If sort key columns are compressed more aggressively than other columns in the same query, Redshift may perform poorly.
Redshift can tell you what it recommends.
If you build a table and run the below command, Redshift will recommend, per column, what the compression should be and will even include it's guess at how MUCH the new compression will help by.
analyse compression <table_name>
header image drawn by me
Who am I?
Top comments (0)