DEV Community: Matt Arderne

Snowflake Field Notes

Matt Arderne — Mon, 13 Jul 2020 10:15:50 +0000

In my first post, I justified an approach to achieve a scalable system for loading, storing, transforming and distributing data within an analytics context.

In this post, we’ll be taking a look into my notebook on storing. Specifically, the things I’ve noted as useful when implementing Snowflake. These few notes, scripts and points of reference should save you some time and get you out onto the water sooner.

Welcome to my pocket notebook, heading Snowflake Important Things - Jan 2020.

Why Snowflake

This isn’t paid content. Though with the flowery praise to come it should be (see contacts below). However, this post doesn't get too far into why Snowflake. Rather it explores how Snowflake. Nonetheless, we may need some justification.

Snowflake’s real value is the reduction of non-value-adding complexity for the user. Putting useful things in your path and keeping anything and everything operationally complex out of your way. Simple as that. If you’ve used PostgreSQL then this shouldn’t feel too foreign, minus index maintenance, table locks, performance issues and upgrades. Pretty standard SQL otherwise, and a few new concepts.

And you only pay for the capacity and performance you use.

That’s it really.

It is just a SQL database. A very fast one, that handles loads of data, and has lots of usability features. It stores data in a columnar way (rather than rows), which means it is very fast. But you’ll still be writing SQL queries, in a mostly familiar pleasant SQL syntax.

The primary alternative to Snowflake in this context is Google Bigquery. I’m no expert, but you’d struggle to go wrong with either. Snowflake offers a choice of AWS, Azure or GCP for your horsepower, so that might be reason enough for you to choose Snowflake. At some point, it should start to become clear that Snowflake is just a clever interface for storage and computation built on commodity cloud infrastructure. Very clever. S3 buckets + EC2 for anyone feeling like they’d rather DIY this part, or build a competitor.

Last part of the intro fanfare: Snowflake is a Data Platform. This is made clear in their recent manoeuvring into the crowded, polluted sea of Data Marketplaces, and a peek into the BI world, with their very simple new Dashboards tool. However, the most platformy move here is a direct integration with Salesforce. More on this in the closing.

Context

This post doesn’t get too far into the details of the doing, but rather points out things that are somewhat peculiar or unique to Snowflake. Things to be kept in mind when doing the initial deployment.

The context also caters entirely towards doing your transforming tasks in a SQL transformation tool like Dataform or dbt.

The structure of this post will loosely follow the order in which you’ll encounter and want to consider various new concepts and features as you implement Snowflake.

We will start with an intro to a Snowflake deployment. We’ll then apply some structure to loading, after getting the security and costs watertight we will finally set sail with some interesting new features and capabilities.

1. Deployment

As of publishing this, you can sign up and get started with a free (no credit card), month-long trial, which gets you floating.

Once you’ve signed up, you’ll need a few things in place as part of the deployment. These include roles, users, databases and warehouses.

New Concepts

The new concepts introduced here are warehouses and credits.

/Warehouses

Essentially a warehouse is how you specify the power of compute that you use to run queries. This is interesting because you can assign a warehouse to a role. TRANSFORM roles can use a different warehouse to REPORT roles. This allows you to fine-tune your compute power and response time for various scenarios. Predictable power for TRANSFORM, snappy and responsive for REPORT to keep the end-users happy!

Warehouses are NOT where you keep your data. Think of a warehouse like a sail that you hoist when the cold query winds blow from the East (or when the warm Summer trade-winds blow from the East depending on your preference).

Practically, a role is granted privileges to use a warehouse in much the same way a role is granted privileges to access a database. A warehouse also needs to be specified whenever a connection is made to Snowflake.

grant all privileges on warehouse WAREHOUSE_REPORT 
to role ROLE_REPORT;

/Credits

You get billed based on your usage of credits.

Credits are consumed by storage and warehouses.

Every time* you start a warehouse, you pay per second in credits, and so credits are effectively your unit of currency.

At the time of writing a credit is $2-$3, and negotiating that down when your annual contract value reaches ~$10k is the typical script.

The outcome of this warehouse/credit scenario is you have a very granular cost breakdown of your query costs.

*Not every query starts a warehouse - see cached data section below.

Additional Notes:

See a walkthrough of cost calculations, product tiers and implications here.

Permissions

This is the grant <PERMISSION> to <ROLE> part of the database deployment process.

I like to follow either one of the following two deployment patterns:

The Proof Of Concept (POC) keeps things as simple as possible, while still being stable and scalable.
The Production option adds some additional structure on top of the POC.

1. Proof of Concept

This setup doesn't distinguish between PROD and DEV, and rather relies on branching features later on in the transformation, which is perfectly fine.

At the core are the 3 roles, with each only having the permissions necessary to function, without the ability to interfere with the other roles’ domains.

INGEST
- Loads data
- Can create schemas in RAW database
TRANSFORM
- Creates transformation scripts
- Can read data in RAW
- Can create schemas in ANALYTICS
REPORT
- Read-only access to ANALYTICS

This is shown in the relationship diagram below, where connections indicate permissions assigned.

You’ll notice in the diagram that the USER_REPORT cannot access the RAW data, this is an entirely deliberate move towards ensuring that downstream tools cannot build a dependency on RAW data.

For further clarification on how all this works, I’ve created a starter kit for Snowflake, which creates the above diagram exactly, ready for a POC. If you’re considering a Snowflake implementation, it is well worth an hour to take a look. Pull requests welcome!

https://github.com/mattarderne/snowflake-starter

2. Production

The following configuration takes the basics from the Proof Of Concept and enhances them to include a more robust separation between PROD and DEV. There is a duplication of all entities with _PROD with a _DEV version (_DEV not shown in this diagram for simplicity) and distinct role breakdown for accessing Databases.

Additional Notes:

Snowflake case sensitivity is subtly different to PostgreSQL.
- Unquoted object identifiers are case-insensitive
- “ANALYTICS” = ANALYTICS = analytics
Create a user for every connecting system, and a user for every developer. This will enable you to track the source and cost of all queries.
If you already have a Snowflake database, you can visually analyse your setup with the snowflakeinspector.com, great for tracking poorly configured snowflake permissions that you may inherit.
A very useful bit of code is the grant on future snippet, which allows you to grant all future tables in a schema with a certain permission.

grant usage on future SCHEMAS in database RAW to role TRANSFORM

grant select on future TABLES in database RAW to role TRANSFORM

2. Extract and Load Nuance

/Loaders

If you are using Stitch, Fivetran or similar, you can target your data warehouse at this point. Assign the tool the appropriate role, warehouse, database and schema as specified in the deployment script (ROLE_INGEST, WAREHOUSE_INGEST, RAW).

Stitch will create a schema based on the name you give to the job, so stick with something scalable. I like <loader>_<source> format, so you’ll start with something like STITCH_HUBSPOT. It’s key to note that this means you can later pop out the stitch part for a FIVETRAN_HUBSPOT or an ETL_HUBSPOT.

/JSON

Managed ELT tools will load data as best as they can, typically as rows and columns, but often will insert your data as raw JSON into a single column. This is a good thing. It allows you to become familiar with the incredibly useful Snowflake JSON SQL syntax.

If you write any custom ELT scripts, ensure when loading data to load all data as JSON variant type. This is the crux of ELT. Schemaless loading means your data lands without any notion of a schema, and so you can define the schema later on in one go in the transformation step. This can be seen as a big step, but it helps to be able to define ALL transformations in the transformation stage, and not have to go back to your Python scripts to add new fields.

Additional Notes:

Start with a tutorial for handling JSON in Snowflake, just to get the basics.

3. Secure the perimeter

At this stage there is a risk of moving too fast, and that awkward speed wobble is avoided by taking stock and balancing the books.

The pre-retrospective things to attend to are Costs and Sensitive Data.

Costs

Snowflake is a powerful tool, and with the largest warehouse running into the thousands of dollars per hour, you want to do two things:

/Set a budget and limit

Determining what you are willing to spend in a month is a good start, and setting a policy to alert you at various increments of that amount will avoid a broadside attack from Finance. Setting the policy to disable future queries across specific warehouses or all of them is a good trip switch to ensure that you aren’t caught at sea.

*/Get alerted *

Worse than running up a large bill (depending on who you ask) would be for your credit limit policy to come into play the moment you click run when demo’ing your fancy analytics to a client or stakeholder.

For this reason, keeping close tabs on spikes in credit usage and becoming familiar with how and where your credits are going is very high on your new agenda. Remember this is SaaS, i.e. Operational Expense. All the costs lay ahead of you on this one.

SnowAlert is a tool that Snowflake maintains. I’ve adopted some of the queries as part of my suggested monitoring in the Snowflake-Starter repo. The queries look for spending spikes across the infrastructure and will return results only if they detect a spike.

Last thing on cost management and this is more of an opinion.

Historically, database resources are specified against a budget for their max expected load. This left lots of performance headroom for the median query. One could view Snowflake costs with some equivalency to this performance headroom, in that a Snowflake query could run faster if you assign it a larger warehouse at increased cost.

However there is a premium being paid for the flexibility, and so it benefits you to manage your fleet of warehouses carefully, lest they turn on you. Snowflake is an operational expense. This is a subtle shift. The crux is that every credit spent should “deliver value” in a somewhat meaningful way.

Additional Notes:

Snowflake caches results of queries, meaning that you won’t get charged for queries that hit the cache. This requires some nuance when modelling credit intensive processes like incremental updates. See this blog for a run-through.
Snowflake charges lightly for access to metadata queries, this is because each time your transform tool runs, it queries the schema definition heavily. This was free, it now isn’t. The cost is negligible but it is worth noting what is going on.

Sensitive Data

/Masking

Snowflake’s “Dynamic Data Masking” feature isn’t quite as dynamic as it sounds but is a welcome addition. You’ll create or replace masking policy EMAIL_MASK and attach that to a role. See this video for an explanation. This is a helpful addition to be able to define masks at an object level. This is a new (enterprise only) feature and works in conjunction or in addition to the standard masking features.

/Access Control

Enable a network policy that whitelists the IPs of Stitch, your BI tool, VPN etc.

Enable multi-factor authentication (MFA) with the Duo app. Duo is GREAT. It prompts for a password protected authorisation on your phone’s home screen. No excuses. All users assigned the ACCOUNTADMIN role should also be required to use MFA.

4. Setting Sail

Snowflake at this point, like setting sail, depends on where you want to go. In my previous post, I outlined what I’d do next, and it looks something like setting up a few data loading tools, writing transforms in Dataform and then distributing the results in an analytics tool. If you haven’t, please check it out.

I will not be overemphasising this section, but rather point out a few of the most interesting features that fall under analysing data. You could at this point treat Snowflake like you would a very tiny t2.tiny PostgreSQL instance, forget about it (other than the $) and continue.

New features in themselves are not always so interesting, but what is interesting is what they enable when combined with existing features. As in technology, so in databases.

/Swap With

alter database PROD swap with STAGE

Swaps all content and metadata between two specified tables, including any integrity constraints defined for the tables. Also swap all access control privilege grants. The two tables are essentially renamed in a single transaction.

It also enables a Blue/Green deployment, which in simple terms means: Create a new database with your changes (STAGE), run tests on that, if they pass, swap it with PROD. If an hour later you realise you’ve deployed something terrible, swap it back.

/Zero copy clone

create or replace table USERS_V2 clone USERS

Create an instant clone of Tables, Schemas, and Databases with zero cost (until you change the data). Great for testing, development and deployment.

/Time Travel

Combining the clone function, one can time travel to a table as it existed at a specified time (1 day back on the standard plan, 90 days on enterprise). The command below will recover the schema at the timestamp (wayward DROP perchance).

create schema TEST_RESTORE clone TEST at (timestamp=> to_timestampe(40*365*86400));

/External functions

Run a call to a REST API in your SQL. Great for those pesky ML functions.

select zipcode_to_city_external_function(ZIPCODE)

from ADDRESS;

Closing Meta Industry Thoughts

Snowflake is building a platform, meaning they are building the one-stop-shop for your data needs. The notion of Data Loading is likely going to become more fringe. Snowflake has already moved in this direction with Salesforce.

Einstein Analytics Output Connector for Snowflake lets customers move their Salesforce data into the Snowflake data warehouse alongside data from other sources. Joint customers can consolidate all their Salesforce data in Snowflake. Automated data import keeps the Snowflake copy up to date.

This off-the-shelf analytics is a reasonable next step, perhaps in this case due to investment by Salesforce into Snowflake, but that aside, the data space is finding where lie its layers of abstraction, and this is shown in these industry moves.

Snowflake is building a platform, doing it well, and charging you for it. Engineering time remains expensive, and so outsourcing this to Snowflake’s managed platform will be a welcome relief. However there are no free lunches, and Snowflake is building something bigger than a data warehouse. What this means is that if you take too much, you’ll be stuck with too much.

Echoing Dremio, there is always a thought towards a modular data architecture “that’s built around an open cloud data lake* (e.g S3) instead of a proprietary data warehouse”. I generally agree with this premise. Snowflake is built on top of AWS or Azure or GCP, and so is (was) a thin layer on top of raw storage and compute.

* More on data lakes here

Snowflake is marching towards the abstractions seen in Software Engineering, where every job is a feature for them to build. Snowflake has built Data Warehouse Engineer, it is building ETL Engineer and will likely build Data Engineer in some version soon.

“It is not the ship so much as the skilful sailing that assures the prosperous voyage.” - George William Curtis

Please comment if you have any feedback on any of this, I aim to improve with your help.

Please consider subscribing for more on the subject of data systems thinking

Subscribe now

What is group by 1

Who is Matt Arderne

Data Architecture as a Utility Tool

Matt Arderne — Sat, 11 Jul 2020 14:43:04 +0000

Welcome to group by 1. In this first post, I’ve started broad with my opinion on a few of the typical compromises made when implementing a modern data warehouse solution. Modern meaning cloud, data warehouse meaning the back-end for an analytics tool. This post is a primer for my future content.

Within the companies I have worked for and plan on working for, uncertainty is a common thread. Sales may continue to accelerate, funding should land next quarter, we hope to keep in touch. The uncertainty may be more concrete. We should change to a new CRM. We probably need to stop reporting in Excel.

I’ve put together some opinions on what has worked for me in managing uncertainty when architecting data systems that need to cater for many parallel futures.

Two Buckets

Designing solutions for analytics systems can stylistically or abstractly be described as a problem of two buckets. Bucket one is full of the typical problems a business might have. The business usually then approaches “the data team” with problems such as help us define a metric / store the data / visualise the KPI / distribute the report.

In this simplistic utopia, bucket two, the Solutions Bucket, is typically filled with lots of products and opinions, like Snowflake / Big Query / my last company used Tableau / group by 1 etc.

select * from problems_bucket
   inner join (select * from solutions_bucket)

The catch when architecting a solution is that you’re given only one scoop from the solution bucket, with the hope that it covers as many of the items in the problem bucket as possible. The solution bucket is usually very resource-intensive, expensive, time-consuming and gathers momentum once that scoop is moving.

The decision to re-scoop is not going to be taken lightly. The first scoop usually needs to be made under significant uncertainty and pressure. This is a time to be making bets that will serve you in many of your uncertain futures.

Travel Light

For this reason, I invoke the spirit of a prepper, where travelling light is as essential as being prepared. Enter the Swiss army knife.

My ideal scoop of the solution bucket, like a good utility tool, has a nice healthy mix of scalability, ease of use, and utilitarian functionality. Bare metal that stands the test of time and rests easily on the hip, ready for action!

More concretely, a lightweight data architecture describes modularity, where each component plays a specified part in the greater whole, without restricting the system. This enables upgrading, downgrading and replacing as necessary.

With that in mind, I’ll be describing my opinion / experience / preference for a utilitarian data architecture.

Context

The context of this article skews heavily to the typical first-hired-one-person-data-team scenario and is generally applicable if that person is within a small business, a startup or a small team within a larger organisation. It can be extended to a data team within a larger organisation that is rethinking their architecture.

New paradigms start from the ground up, and so it can safely be assumed that this paradigm will be what banks implement in 50 years while the rest of us use quantum computing to think the data into order.

The driver for this workflow arises from the need to centralise data across multiple systems, typically at the point where there are 3+ key business apps or systems.

If you're the one called in to take over from the last guy who burnt the ETL (extract-transform-load) candle on both ends and now has a 1000 yard stare, then this might hit a nerve.

Ingesting, Storing, Transforming, Distributing. Four verbs for four (4) sections that describe what will be covered, and the order.

1. Ingesting

I generally subscribe to the opinion that engineers should avoid writing custom ETL code whenever practically possible, and rather use a managed SaaS ETL tool. This resembles the corkscrew of our Swiss army knife. Powerful and simple.

Managed ETL tools allow you to connect to your supported sources, point those at your data warehouse and have data flowing in a matter of minutes. You are paying for specialisation here. Post-implementation, the ETL specialist at the end of an intercom is worth their (initially) nominal fee.

If you cannot get your data into your data warehouse with a managed ETL, or you cannot strike the cost/benefit balance, then you’ll have to start building. This is a great time to think about the possibility of contracting the work to a specialist. They’ll bring the expertise, and over time you can consider internalising that skill as you see fit. Because the work is narrowly described and easily measured, this is a great piece of work for outsourcing. Budget for a maintenance contract, and keep an eye on those Managed ETL services as a replacement option over time.

Some additional thoughts:

Ingesting raw data (JSON or tables) into your data warehouse is key. Don’t spend time at this point doing transforms in python, there isn’t time. ETL has been surpassed by ELT (extract-load-transform). This new paradigm is now established.
A Google sheet is a data source. Time is of the essence and done is better than perfect. Data validation, spreadsheet protection and read-only permissions do a database maketh. Use this one sparingly, as word may get out.

2. Storing

Balancing the trifecta of scalability, cost and performance is key when picking the backbone of your system. Your data may start small, or large, or small with a risk of growing large. Stopping to change a tire in bear country is never a good look, and neither is a data warehouse migration.

Managed data warehouses balance the trifecta, with scalability from a team of 1 to 100(n), megabytes to terabytes+, cost starting near zero, and performance flexibility to suit your budget and need.

Your contract for a data warehouse should begin near $0 and go up from there. Start negotiating your unit costs down once the value of the contract becomes significant, or sooner. The point is that you can get started, prove value, and iron out the details down the line.

Snowflake is a good start, Big Query does wonders. Microsoft is probably up to something with Azure. Redshift is squarely in the migrated_from category. All can scale beyond your VC backer's wildest dreams.

This is the knife of your Swiss army knife, simply put, a knife needs to be sharp, a data warehouse needs to be powerful. The main attraction.

3. Transforming

Pliers apply leverage. A Swiss army knife doesn’t have pliers, which is why no one owns one, preferring a utility-tool. Loosely applying the same logic, the Transformation Layer has long been the missing link in the analytics stack, with various frustrating attempts at enabling elegant management of transformations.

The broad goal here is to enable access to your data for your business users while abstracting away as much of the source system complexity as possible. The outcome is clean, documented, coherent, reliable, logical, self-explanatory and performant data that can be relied upon by the Distributing tools. This is the highest leverage point in your pipeline. Leverage that magnifies both gains and mistakes.

SQL is the language of analysis, and a collection of SQL scripts best describes the transformation of data landed RAW in your data warehouse to ultimately transformed and ready for ANALYTICS. The Analytics mentioned here is the schema that you expose to your Distributing tools.

Dataform is a tool that takes that simple concept and runs with it, making writing sophisticated transformations a delight for analysts. Simply explained, Dataform is a SQL editor that enables analysts to build complex transformations in a way that is maintainable and interpretable. Dataform is differentiated by three concepts from software engineering that are put in the hands of the analyst:

1/ Continuous Deployment

A deployment of new code or changes to your transforms should be a thing that happens continuously, and without fear. This is achieved through automated schema tests, continuously deploying code, and data validity and quality tests. This is achieved through the assertions in Dataform, among other useful features.

2/ Version Control

If your job involves writing SQL code, and doesn't involve version control, then perhaps more than anything else, this article was written for you.

3/ Modularity

If your SQL queries typically run into the 100's or 1000's of lines, with sub-queries galore, then breaking that into individual reusable modular components will feel like our man on a rock below. Extend this with JavaScript and suddenly you will be able to truly express yourself.

4. Distributing

### TODO - setup a BI tool

Distribution of data. Commonly described as an Analytics Tool or BI tool aka The Last Mile delivery problem.

In a physical product, and an analytics project, the last mile of delivery is often both the most expensive and time-consuming part of the delivery mechanism. This is the point where the surface area expands massively, and the usage pattern permutations explode. Bluntly; the neatly organised cookie-cutter data pipeline gets punched in the face by the needs of the user.

The utility-tool analogy falls apart somewhat at this point, as arguably the pliers should be used here. Just like this utility-tool, it can get a bit confusing.

Broadly speaking the distribution problem gets broken into two categories. BI tools and Analytics Tools. The distinction is murky, like your requirements. Generally speaking, these tools are either:

1/ Good at solving the operational reporting problems of business: Metrics, KPIs, lots of users, lots of operational complexity (tools like Looker, Metabase).

2/ Good at solving the analysts’ problems: complicated questions, nuanced analysis, vague outcomes, forecasts, predictions (tools like Mode, Periscope, Jupyter Notebooks).

A rule of thumb is that you need a good few business users who are comfortable writing complicated SQL or Python before Option 2 will be feasible. This decision is largely based on the operational complexity and technical fluency of the stakeholders in this grand adventure, and generally Option 1 is more broadly applicable.

If you’ve done good work in your Transforming layer, then you can get away with a compromise here, and use a cheaper tool as a stop-gap, or use an array of tools, or allow the team to choose whatever suits them. Ultimately, you want to trend towards a single source of truth for KPI / Metric type numbers, and aim to automate their delivery.

My experience

I've honed in on my preferred data stack, described below. This stack is likely a feasible option for your goals if they are related to aligning your business on key metrics. Especially so if you have multiple SaaS or custom software systems floating around that drive these metrics. What you’ll end up with is something like the following diagram.

Ingesting/ As mentioned, I prefer to use a SaaS ELT tool like Stitch or Fivetran, as they reduce the need for ongoing maintenance where possible. Stitch is the cheaper option, and a great low-cost starting point, with the following useful additions:

It has a great Import API that allows some simplification of ELT scripts if you do need to write them.
It has a useful Google Sheets Integration, as well as the usual Postgres, Hubspot, Salesforce, Google/Facebook ads etc.

Storing/ The stack described orients towards BigQuery or Snowflake, with PostgreSQL also a feasible option. I prefer the scale / cost model of Snowflake.

Snowflake scales up to enterprise but starts from $2/credit, so can be a very cost-effective bet with typical small loads running around 2-5 credits per day. This can get very expensive if you don’t manage it carefully with limits.
PostgreSQL will require a migration in the future, so unless you are very cost sensitive, the cost / benefit generally leans in favour of Snowflake.
I have a simple SQL script used to setup Snowflake ready to use for a POC, and I like to use these scripts to track Snowflake credit usage in combination with Dataform assertions.

Distributing/ This is where business users will interact with and judge the success of your system, so to spend your budget on the rest of the components but cut corners on the distribution tool is a bad idea. That said, BI tools can have expensive annual contracts.

Metabase is a great open-source BI tool and should give you a good place to start. The cost jump is quite severe up to Looker / ChartIO, but so is the feature set.
These tools are trickier to migrate from, and so it is reasonable to expect to be locked-in for the mid-term.

Transforming/ This may be premature depending on the level of sophistication of logical transformations required to answer your questions, but at some stage it will make sense to move your transforms to the data warehouse from the BI tool.

The best of breed at this stage is Dataform or dbt. These tools enable software development best practices (git, testing, documentation).
There is relatively little involved in adding this from the start, and significant gains to be had if used to build a logical data model from the start.
I have deployed Metabase successfully with https and nice scalability using these Docker scripts.

In future editions I’ll be diving into the above specifics, stay tuned.

Conclusion

Taking the time to properly implement a reasoned and scalable analytics infrastructure is an axe sharpening exercise with benefits that may compound massively over time. Second-order benefits to aim for include increasing the data proficiency of your team, enabling evidence-based decision making and most importantly, increasing alignment.

Most businesses follow similar patterns, and in survival as in business, preparation is key.

“Give me six hours to chop down a tree and I will spend the first four sharpening the axe.” - Abe

Please consider subscribing for more on the subject of data systems thinking

Subscribe now

What is group by 1

Who is Matt Arderne