If you love Postgres, you don't need to tell me why. It's fully open-source and yet, thanks to its rock-solid foundations and a growing set of delightful features, has become the absolute go-to choice for application databases. But when it comes to data warehousing, in an enterprise you'll quickly start to push its limits... so maybe you've become a little... cloud-curious.
That's okay. Don't feel bad! In fact, let's explore those feelings by diving into a primer on my current cloud data warehouse of choice: Snowflake. We'll start from a sky-high view and narrow down on the details next, because I think the details all make a lot more sense when you understand the big picture.
Table of Contents:
Architecture
Let's recap how a typical Postgres or other relational OLTP database management system is architected. You have a big, single machine, which has a huge storage array attached to it (probably on your SAN), a massive amount of RAM, and more processors than even the biggest hoarder among us has in their "old PC parts" box in their closet. All work goes through this one server, and although you probably have read replicas set up to handle reporting loads, there's still the ultimate limitation that your database can't be distributed.
Snowflake is part of a new class of DBMSes which empowers itself by taking the critical step of separating storage from compute. What that means is that all actual query execution is done by ephemeral servers running in a layer abstracted above the actual storage medium, which in Snowflake's case is cloud (AWS, Azure, or GCP) storage. The benefit, as you might have guessed, is that you can scale out that compute layer virtually infinitely. The only limit is your wallet!
Note: Ottertune has a great article recapping database developments in 2022, and separating storage & compute is a big theme amongst the newer entrants. Google even recently released AlloyDB, which is a modified PostgreSQL that takes the same step of separating storage from compute, so perhaps that's worth a look if it matches your needs. But hey, this article is still about Snowflake!
With that in mind, the architecture can be divided into about three layers, which I like to think of as follows:
The Interface layer
Snowflake calls this the "Cloud Services" layer, but I don't like that name, so I think of it as the interface. It provides the SQL interface that takes in your queries, plans their execution, and orchestrates their execution. It's all running in a private cloud, along with the rest of Snowflake's components, separate from other Snowflake customers.
The Compute layer
Query execution is done inside what Snowflake refers to as Warehouses. These represent computing power, and as of today come in a range of sizes from X-Small (1 credit/hour) to 6X-Large (512 credits/hour). Credits, by the way, cost a fixed amount depending on your pricing plan.
Warehouses can be configured to run full-time, but by default will auto-suspend and resume based on activity.
It should be intuitive that using a larger warehouse size will make your query faster, in the classic tradition of throwing money at a problem. Just like in Postgres, though, there are often other ways of making your queries faster!
The Storage layer
In Postgres and other RDBMSes, a table's data is stored in a row-based format, not that different from a giant CSV. To speed up queries, you create indexes that help Postgres quickly find the rows you're interested in.
Snowflake is entirely different. It's a columnar database, which means it stores data in a column-based format. To use an analogy, picture products stored in a physical warehouse. Under the Postgres model, you have a huge array of crates (rows) where each crate has a full set of items (columns) in it. Once your forklift retrieves a crate, you get all the items in that crate.
Under the Snowflake model, though, imagine all the different items categorized by their type and stored in their own, dedicated sections of the warehouse. Retrieving just one type of item (one column) will be much faster than retrieving all types of items.
Language Differences
Okay, I think that's enough high-level architecture. Let's look at the actual language differences between Snowflake SQL and Postgres SQL.
Syntax
As a Postgres lover, Snowflake SQL is going to be no problem for you: it's based on the ANSI SQL standard, and all your usual query syntax is supported. But there are a few differences you might run into.
Cross-database references actually work
Postgres actually does let you write out references to objects specified by database (mydb.schema.table
), but it will tell you cross-database references are not implemented
if you actually try to use something not in your connected database.
Snowflake, on the other hand, not only allows cross-database references but actively encourages them. Functionally, the difference is mostly at the governance layer (you can set up RBAC on a per-database level) since the data is all "in the cloud" anyway. Still, it's a very important thing to know.
USE
statement
Following the above, you may want to set your "current" database. This is as simple as running USE db_name;
. It's a connection-local setting, so it won't persist if you reconnect.
USE SCHEMA
vs SEARCH PATH
If you're not in public
all the time, you're probably used to running SET SEARCH_PATH = 'myschema';
in Postgres. Snowflake can do this as well:
ALTER SESSION SET SEARCH_PATH = 'db.schema1, db2.schema2';
This will not work for DDL operations, like it does in Postgres. The docs also imply it also shouldn't work for DML statements, but it seems to work from some testing.
For DDL operations, or cases where you just want to target one schema, you can set your current schema with USE SCHEMA myschema
.
Full docs about object resolution are here.
No FILTER
after aggregate functions
Have you experienced the joy of FILTER
in Postgres? No? Behold!
SELECT
date,
SUM(amount) FILTER (status = 'complete') AS total_completed,
SUM(amount) FILTER (status = 'pending') AS total_pending
FROM orders GROUP BY 1;
But suppress your joy, because Snowflake doesn't support that syntax. Instead, go back to the tried-and-true SUM(CASE)
:
SELECT
date,
SUM(CASE WHEN status = 'completed' THEN amount ELSE 0 END)
AS total_completed,
SUM(CASE WHEN status = 'pending' THEN amount ELSE 0 END)
AS total_pending
FROM orders GROUP BY 1;
No DISTINCT ON
π
DISTINCT ON
is a very neat feature in Postgres that lets you do the incredibly common task of wanting to group up rows that match certain columns without worrying about what happens to the other columns. That is, you're surely familiar with GROUP BY
, but GROUP BY
has the annoying requirement that you be clear about how to aggregate the values from the non-grouped columns. Sometimes I don't care, man! So Postgres lets you do this:
SELECT DISTINCT ON (unique_value)
unique_value, other_datapoint, corollary_val
FROM my_table
ORDER BY unique_value, other_datapoint, corollary_val;
In Snowflake, though, you have access to the QUALIFY clause which allows you to get the same end result:
SELECT unique_value, other_datapoint, corollary_val
FROM my_table
QUALIFY row_number() OVER (PARTITION BY unique_value ORDER BY unique_value, other_datapoint, corollary_val) = 1;
Lateral join queries cannot use ORDER BY ... LIMIT
Ever ran a query like this in Postgres?
SELECT * FROM orders o
JOIN LATERAL (
SELECT * FROM orders WHERE customer_id = o.customer_id
ORDER BY created_at DESC LIMIT 1
) latest_order
The idea is pretty simple: pull each order alongside the latest order for the customer. If you're not familiar with LATERAL joins, they allow your join expression to be a subquery specific to each row. But Snowflake will refuse to execute this:
Unsupported subquery type cannot be evaluated
Helpful, right? The problem is that, as we'll go over below, ORDER BY ... LIMIT
queries in Snowflake are really really slow, and the possibility of having to run one for each row (which a LATERAL join does) would absolutely murder things to the point where Snowflake just doesn't even let you load that particular footgun.
Data Type Comparison
Here's a rundown of Postgres data types mapped to their closest Snowflake equivalent:
Postgres Data Type | Snowflake Data Type | Notes |
---|---|---|
bigint | BIGINT | |
bigserial | BIGINT | |
bit | BINARY | |
varbit | VARBINARY | Equivalent to BINARY |
boolean | BOOLEAN | Only supported for accounts provisioned after January 25, 2016. Weird! |
box | GEOMETRY | |
bytea | BINARY | |
char | CHAR | |
varchar | VARCHAR | |
cidr | ||
circle | GEOMETRY | |
date | DATE | |
double | DOUBLE | |
inet | ||
integer | INTEGER | |
interval | I'm really bummed Snowflake doesn't have this type. | |
json | TEXT | Probably more useful to use VARIANT. |
jsonb | VARIANT | |
line | GEOMETRY | |
lseg | GEOMETRY | |
macaddr | ||
macaddr8 | ||
money | NUMERIC | YMMV. |
numeric | NUMERIC | |
path | GEOMETRY | |
pg_lsn | It's not Postgres, so... no. | |
pg_snapshot | ||
point | GEOMETRY | |
polygon | GEOMETRY | |
real | REAL | |
smallint | SMALLINT | |
smallserial | SMALLINT | |
serial | INT | |
text | TEXT | |
time | TIME | |
timetz | TIME | Careful! Snowflake's TIME is just a 24-hour time value. No concept of time zone is stored or recognized. |
timestamp | TIMESTAMP_NTZ | |
timestamptz | TIMESTAMP_LTZ | There is a third TIMESTAMP_TZ type, which stores the time in UTC as well as the original timezone it was created it, which might be useful if you want to track e.g. what timezone a customer performed an operation in. However, LTZ is most similar to Postgres timestamptz. |
tsquery | ||
tsvector | ||
txid_snapshot | ||
uuid | TEXT | Not a native type. |
xml | VARIANT | VARIANT is very cool. |
Surviving without INTERVAL
Even though Snowflake doesn't have an actual INTERVAL data type, it still supports the INTERVAL literal for simplified date math. So this Postgres query will also work in Snowflake:
SELECT CURRENT_TIMESTAMP - INTERVAL '24 HOURS';
But this will not, because the interval itself can't be stored to a data type:
SELECT INTERVAL '24 hours';
If you need to store a duration of time, I have found it easiest to store number of seconds as a decimal (or integer, depending on precision needed).
Things to Avoid
Snowflake's completely-different architecture means that your mental model of how a query executes will likely need some expansion. Here's a list of common mistakes that Snowflake newcomers make (all of which I've personally done).
SELECT *
= π€’
Remember how we mentioned that Snowflake is columnar? A SELECT *
requires fetching data for all columns, which in Postgres is no extra work than just one column. Remember, in our analogy, Postgres stores all columns' worth of data in one big crate that your forklift is already picking up.
But Snowflake has different warehouse sections for each column, so your forklift would need to make potentially dozens of stops!
This doesn't mean that you should never run SELECT *
(maybe your report really does need all the data) but you should be mindful about what you're asking Snowflake to do.
ORDER BY β¦ LIMIT
= π
This is a common and typically performant pattern in Postgres:
SELECT id FROM orders ORDER BY created_at LIMIT 5;
Postgres is able to use an index to find the most-recently-created order easily (think of a binder in our warehouse with directions to the right aisle).
But Snowflake has no indexes! Since data isn't stored by row, an index wouldn't help. So for a query like this, Snowflake could potentially need to scan every single value!
(Side note: it's surprisingly easy for Snowflake to do that, thanks to its ability to distribute the load, and I guarantee there's a warehouse size that you could pick to efficiently do it for your dataset, but boy would that be expensive).
But Snowflake is actually quite good at keeping statistics about each and every column, and it bundles up all the data in each column into nicely-labeled little boxes. This makes Snowflake really good at handling filters. So if you just narrow down the scope a bit, Snowflake's job gets much easier:
SELECT id FROM orders
WHERE created_at > current_date - 5
ORDER BY created_at LIMIT 5;
Those detailed stats also help Snowflake perform queries that can just leverage metadata work efficiently, like this one:
SELECT MAX(created_at) FROM orders;
Picking too big a warehouse = πΈ
Although it might be tempting just to bump up your warehouse size to make a query run quicker (or complete without timing out), make sure to optimize first and only do this as a last resort. Bigger warehouses are not cheap. Always start with XSmall and work your way up if you can't optimize your query.
Conclusion
There is plenty more to learn about Snowflake, but I hope this article has provided a good starting point. Check out my earlier post on using Tasks, Streams, and Python UDFs if you're hungry for more Snowflake content!
Changelog:
- 2023-02-01: Corrected information related to INTERVAL, TIMESTAMP_LTZ, and SEARCH_PATH. Thanks to my co-worker, Jeremy Finzel, for pointing out the corrections!
- 2023-02-02: Corrected language in the
ORDER BY ... LIMIT
section. Thanks to Aaron Pavely for the note.
Top comments (1)
The syntax for accessing JSON columns is also different.
annoying_json_column->'key1'->>'key2'
or
(annoying_json_column->'key1'->>'key2')::string
becomes
annoying_json_column:key1:key2
or
annoying_json_column:key1:key2::string
There are also differences in things like:
lateral jsonb_array_elements(annoying_json_column)
(or
jsonb_hash_elements
)becomes something like
inner join lateral flatten(input => annoying_json_column) as my_json_values