DEV Community: Census

Introducing dbt_census_utils: The first dbt macros for data activation

Census — Tue, 23 May 2023 16:51:37 +0000

Census is excited to present our first dbt package dbt_census_utils, featuring six macros that bypass the most tedious aspects of readying your data for activation. In the past, we’ve published articles on common data modeling requirements, like extracting email domains from email addresses, but why not just…do the transformation for you? 🪄

You can peek at the package’s source code if you want to know the nitty-gritty details on how it’s done, but the TL;DR is: Our new dbt_census_utils package helps you make your data ready to activate in over 160 SaaS tools.

The package features common macros for Reverse ETL use cases, such as identifying personal emails vs. business emails, cleaning names for advertising destinations, and standardizing country codes.

The best part? Even if you don’t use Census today, you can use these macros to save time and teach you new tricks in dbt. 🙌

Before we dive in, we (the Census Team) have to give a huge shout-out to Stephen Ebrey, founder of Sawtelle Analytics and OA Club Data Champion, for his work building this package with us.

dbt packages make your life easier

A data scientist joining a new company can quickly do advanced analysis using Python libraries like pandas, numpy, and matplotlib, but until recently, data engineers and analysts generally wrote a ton of custom SQL. This caused data teams to expand in size to service all the use cases that a modern data stack can provide, like business intelligence, product analytics, marketing attribution, and data activation.

dbt packages make it easier to build, manage, and test data transformations using open-source code. This brings analytics engineering in SQL up to par with data science and software engineering by letting you skip solving the same old problems as everyone else. Get to the juicy particulars of your business – faster!

What’s unique about these dbt packages for data activation?

Data activation has some specific challenges that traditional BI doesn’t have, like needing to match the customer parameter guidelines of a destination. Misformatting names or emails can cause records to be rejected by the destination’s API, and formatting them too much can interfere with user-matching processes. For instance, if you unnecessarily change João to joao, Google Ads may not realize the user was already shown an ad.

Also, data activation can confuse operational teams if we’re not clearly delineating internal vs. external users. For example, internal employees must be differentiated from those using a work email as well as users coming from the same email domain. You don’t want a salesperson to get excited about a PQL in Salesforce when it’s actually just a developer working for your company. We’ve solved these problems before, and we want to pass on our solutions to you!

Useful macros for just about anyone

The extract_email_domain, is_personal_email, is_internal, and get_country_code macros are important fields to use when sending data to third-party systems and traditional BI. For example, if you have a QA tester who made 30 purchases, you wouldn’t want to rate them as a highly qualified lead in Salesforce, or pay to remarket to them on Instagram. But you wouldn’t want to count them in your average LTV score in Tableau or Looker.

You might not want to prioritize Zendesk requests from fred@gmail.com as highly as ones from ceo@hotnewstartup.com, and you also might not want to count gmail addresses when calculating conversion rate in Mode. Plus, it’s helpful to link sara@client.com with amit@client.com, both in HubSpot and your count of unique customers on the executive dashboard!

Another common use case is that you may have some sources that report countries in names and some in codes, and you want to standardize them all with get_country_code before building reports or sending them off to destinations.

With get_country_code, the entries above become:

Use macros to standardize user data specifically for services like Google and Facebook Ads

The clean macro lowercases, removes spaces and symbols, and removes special characters from names, so they won’t get rejected from destinations like Facebook audiences. This is very useful with Census or your own reverse ETL pipelines to these destinations, but traditional BI does not require São Paulo to become saopaulo.

Useful macros for GA4 customers

The parse_ga4_client_id was a request from a Census customer who had trouble standardizing their GA 4 client IDs. It lets you separate the unique ID part of the client ID or the timestamp part (which represents the user’s first visit). All GA4 customers can benefit from this cleaning logic, so we’ve added it to the package as well!

A look behind the scenes

If you’re curious about dbt packages, here’s a brief overview of how we created this package:

Info gathering

Stephen combed through dozens of dbt packages on the package hub to see what problems were already solved, who made the best packages, and how much was company sponsored vs. community-contributed. He looked at how dbt packages implemented tests with methods like dbt_utils.equality and how they supported multiple data warehouses with dispatch and cross-database macros.

Some smaller packages are only tested on one or two warehouses. We decided to support Redshift, Snowflake, and BigQuery, which meant setting up multiple test warehouses, dbt profiles, and learning how to link and unlink different dbt adapters.

Feature specifications

Stephen first got the idea of creating a dbt package for Census specifically when he got some advice on how to get city names to be accepted by Facebook, with a very-ugly line of SQL that looks like the following:

regexp_replace(translate(lower(name),'ůțąðěřšžųłşșýźľňèéëêēėęàáâäæãåāîïíīįìôöòóœøōõûüùúūñńçćč','utaoerszutssyzlneeeeeeaaaaaaaaiiiiiioooooooouuuuunnccc'),'[^a-z]','')

With that, he thought, “Surely, a clean() macro would be so much more elegant!” Then he started seeing other uses for macros from Census articles and working with clients. Stephen also made sure they weren’t already provided in dbt_utils or any other packages.

Implementation

Writing a dbt macro is simple. Making sure it works for multiple warehouses, on the other hand, is tricky! Redshift, BigQuery, and Snowflake handle arrays very differently (*cough* Redshift does it the worst *cough*). And with the is_internal macro, we needed to use variables so customers could specify their domain and any tables they had with lists of internal users.

With the extract_email_domain, is_personal_email, is_internal, and get_country_code macros, we had to compile lists of standardized country name formats and common personal email domains. Finally, Stephen ensured each macro had an integration test so it worked on all platforms and that changes did not cause them to break.

Documentation

A dbt package needs a readme, dbt docs, an entry on the dbt package hub, and issue and PR templates! As a final step, Stephen did the holy work of writing a ton of documentation. We’ll release more packages and a lot of this work in dbt_census_utils can be reused, but if you’re trying to write a package yourself, leave time for this step.

We want your feedback

dbt_census_utils is our first dbt package, and you can request other macros that help wrangle data for use on Census with a Github Issue, or try to implement it yourself in a pull request. We also monitor for questions and comments in the #tools-census channel in dbt Slack, as well as on the Operational Analytics Club Slack.

📑Check out the package here, then put dbt_census_utils to the test for yourself with Census. Book a demo to get started.

If you’d like to learn more about the work that went into building this package, and some of the challenges Stephen worked through in the process, join us for his webinar June 5 at 10 am PT.

👉 Want to chat about this package? Got other burning data questions? Head over to the Operational Analytics Club to join tons of other data-savvy folks in their quest for more knowledge.

NoSQL isn't dead, but it left a lasting legacy

Census — Wed, 19 Apr 2023 19:02:04 +0000

Once upon a time, NoSQL databases like MongoDB took the tech world by storm. They promised scalability and flexibility in handling unstructured data, and developers eagerly jumped on the bandwagon. Fast forward to today, and the hype surrounding NoSQL has certainly leveled off.

MongoDB's growth and plateau in the DB-Engines ranking, but it still ranks 4th overall

But its influence on the database landscape is undeniable. NoSQL has left a lasting legacy by shaping the way modern databases handle semi-structured data.

In this blog post, we will explore the lasting impact of NoSQL on today's database systems, focusing on the rise of semi-structured data storage. We will discuss various types of semi-structured data and compare the approaches taken by popular database platforms such as Postgres, Redshift, BigQuery, Snowflake, Databricks, DuckDB, and SQLite.

Traditional SQL databases share a very consistent and familiar approach to storing data. Columns (with types) and rows form a table, and each “cell” has a particular value of the column’s type for that individual row. The cell values in SQL databases have traditionally been individual values, a boolean, an integer, and text. In Computer lingo, this is called a scalar value, meaning that it only holds one value at a time, and the schema of that table forces the type. The S in SQL means Structured and this is exactly where the structure comes in.

This approach has worked for decades, but if you’ve ever built databases, you’ll know there are times when you actually need more than one value. Maybe a user has multiple email addresses. Maybe you want to store a list of tags. This type of data is easy to store in programming languages because they support arrays and hashmaps to store collections of values. But back in SQL land, it wasn’t that easy.

In recent years though, modern databases and warehouses have relaxed this constraint. It’s now possible, and sometimes even preferable, to store those arrays and maps in a database. Why the change of heart? Well, a few things happened:

In the early 2010s, there was a huge rise in Document or NoSQL databases. They offered a lot of benefits with some drawbacks. And in particular, the recognition of the value of not always knowing schemas up front, particularly when dealing with data you may not control.
More recently, we’ve seen rapid adoption of Datalake architectures where data is written to file stores and then queried (and thus typed) at read time. That data can be in a lot of formats, fancy ones like Avro or Parquet, or simple ones like JSON or the humble CSV. All of those (sans CSV) make it easy to include arrays and hashmaps, so data lakes need to be able to accept and query over that data.

With data lakes and document databases, there was still a lot of excitement to just use SQL, so SQL and these sets of values needed to find a way to work together. That brings us to the world we find ourselves in today, one of semi-structured data.

It will also come as no surprise that different services approach semi-structured data in slightly different ways. Today I want to cover all of the different techniques you’ll need.

New items on the menu

There are four new types that can be used to represent semi-structured data. Most services offer a subset of the options, with Postgres and DuckDB the only two offering all the options.

Some types such as STRUCTS and OBJECTS look identical at first glance but they have some subtle differences so it’s worth knowing exactly which one you’re working with. And if you’re only going to look at one, skip to the end and read about Variants.

Arrays

‎Arrays represent a list of values. The type of the values in those arrays is up for some debate depending on the service you’re using. For example BigQuery and Databricks require an explicit type, Postgres optionally can have a type, and Snowflake assumes arrays contain variants. Redshift takes it one step further and doesn’t have an explicit array type, recommending its version of a variant instead.

One of the most common ways to create an array is by using an Array aggregate function like you would any other aggregate function with a GROUP BY. But in this case, instead of COUNTing or SUMing the values, you’ll end up with an array of values for each of the groups.

Objects (aka Maps aka …)

‎Depending on your school of programming, you may call this an object, a map, a hash, or a dictionary. All of these names represent the same thing: a set of key/value pairs. An important feature of Objects though is that there’s no specific restriction on the types. If you want all of your objects to have the exact same set of keys every time, you’re probably looking for the next option.

1️⃣ DuckDB’s Map is part way to a struct in that it requires all keys to be the same type, and all values to be the same type, but not that keys and values be the same type.

2️⃣ I have never seen a Postgres hstore in the wild. Do with that knowledge what you will.

Structs

‎Structs also act like an object with the additional twist that they have their own pre-defined schemas. Structs come out of the world of C but the pattern is widespread. Postgres calls it a composite type and that’s a good way to think about it: It’s a type made of other types, and the data will always have that shape or combination of types. This is as structured as semi-structured gets.

A very common example of a struct is a GeoCoordinate which would always contain a lat and long value. It’s so common that some services also have a separate GeoCoordinate type just for this purpose.

Variants

‎Variants are the catch-all. If you’re looking for a type without having to think too hard, this is the one. It’s not surprising that most services actually just call this what it’s most used for: JSON.

If you think about it, a JSON blob can be a lot of different types. Most obviously you might get an object or an array at the root of any blob, and as you navigate into JSON, it may contain strings, numbers, nulls, or booleans (🪦 dates). Variants can represent any of those things and each service usually includes a series of functions to actually figure out what type the variant actually is.

It’s worth pointing out SQLite and DuckDB do something a little funky here. They actually just store their JSON as text and only parse it at query time. It shouldn’t matter, but if you run into weird errors where it feels like you’re for some reason manipulating a string, you probably are!

1️⃣ If you’re in Postgres land, skip over JSON and go straight to JSONB.

2️⃣ It’s a bit sad that Databricks doesn’t have a variant option here. Let me know if I’m just missing it.

When to go semi-structured

Now that you’re familiar with all the new types available, it’s useful to know when to use them. As you’ve seen, each type can be used for a few different use cases, but arguably, you could avoid using them completely with some clever data structure design. So when does it make sense to reach for them?

The most obvious answer is when dealing with JSON. If you’re loading JSON data, particularly data that might have a structure that changes over time or has arbitrary nesting, it’s really nice to just not think about structures and types. Another way to say this is when you’re dealing with data where you don’t control the structure or expect the structure to change.

The other very reasonable use is when a single record or row needs to provide multiple values for a type. A list of tags is one of the most common cases. Sure, you could comma separate and string, but it’s foolproof to let your data service store it for you.

Wrapping it up

Hopefully, that gives you a much better sense of all the semi-structured types now at your disposal. There are lots to choose from, but in most cases, you should just reach for the Variant and you’ll be in good shape (unless you’re in Databricks, sadly).

We didn’t touch on the patterns to query structured data here, but all the doc links should point to examples, and if you’d like to see a post on that, please let us know. And, of course, if you have questions about the right types for your particular use case, structured or semi, swing by the OA Club and ask the experts.

📖 Want to read more like this? Check out our census.dev blog that captures tales from the Census Engineering flock.

P.S. Enjoy working with SQL and data? We’re hiring. Come work with us!

How to solve the unexpected syntax error in Snowflake

Census — Fri, 02 Dec 2022 21:20:36 +0000

You might encounter a number of different error messages during your coding process. While sometimes annoying, these error messages are intended to help you identify and fix issues with your programming with helpful descriptions, fulfilling a crucial role in your debugging workflow. However, some error messages can be confusing due to their ambiguous text and a lack of information on the possible sources, like Snowflake’s <EOF> syntax error. 😕

In this article, you'll learn what an <EOF> error in Snowflake is, what happens when this error occurs in Snowflake, some possible sources of this error, and how to find and fix it.

What is the `<EOF>` syntax error?

An unexpected '<EOF>' syntax error in Snowflake simply means that your SQL compiler has hit an obstacle in parsing your code. Specifically, while processing said code, there was an unexpected end of file (EOF), your code deviated from the standard syntax, and the compiler posits there is something missing or incomplete. 🛑

This error can be due to any single statement in your blocks of code, and your SQL compiler will give you positional information on the flawed statement if it can. When debugging this error, you'll notice incomplete parameters, functions, or loop statements, causing the termination of your process. Generally, your compiler is programmed to expect certain sequences in code, like closing brackets and statements. When certain processes are left hanging without any more code to complete them, the compiler publishes an unexpected '<EOF>' syntax error.

Solving the '' syntax error in Snowflake

The unexpected '<EOF>' syntax error can occur from most SQL commands run in Snowflake. The following tutorial goes through the process of creating a data set in Snowflake and explores examples of this error in SQL queries.

Prerequisites

To follow along with this article, you'll need the following:

❄️ A Snowflake account.

❄️ A local installation of SnowSQL, a CLI tool that allows Windows, macOS, and Linux users to connect their systems to their Snowflake instances.

For this article, you'll use the gearbox_data.csv file from this GitHub repository. The data set is a sample of a Kaggle project that simulated the functioning of two mechanical teeth (healthy and broken) across different load variations.

Download the data set using the following CLI command:

curl https://github.com/Soot3/snowflake_syntax_error_unexpected_eof_demo/blob/main/gearbox_data.csv 
--output gearbox_data.csv

Connecting SnowSQL to Snowflake

This process allows you to set up communication between your CLI and your Snowflake instance. You can then use SnowSQL CLI commands to interact with Snowflake and its resources. Connect to your Snowflake instance using the following command in your CLI:

snowsql -a <account-name> -u <username>

Then, input your Snowflake password to finish logging in to the Snowflake CLI. If you're using a new account, your Snowflake instance will be bare, with no Snowflake warehouse, database, or schema resources set up:

Snowflake CLI

But don’t worry, we’ll walk through creating those resources using the relevant SnowSQL commands. 🚶‍♂️

Utilizing SnowSQL for Your SQL Commands

After connecting SnowSQL to your Snowflake instance, you can utilize the tool to execute the necessary queries and SQL operations. Your first task is to create a database, schema, and warehouse where you can run queries for your resources.

Create a database with the following SQL command:

create or replace database sf_errors;

This creates a database named sf_errors, automatically creates a public schema, and sets the new database as the default one for your current session.

You can check the database and schema name you are using with the following SQL command:

select current_database(), current_schema();

Create a warehouse, defining the resources to be used for your queries, with the following SQL command:

create or replace warehouse sf_errors_wh with
  warehouse_size='X-SMALL'
  auto_suspend = 180
  auto_resume = true
  initially_suspended=true;

With the database, schema, and warehouse created, all that is needed of you is to port in data tables to utilize the resources you've established in SQL queries and commands.

You'll be using the gearbox_data.csv file you downloaded earlier. To input this data set into your Snowflake account, first create an empty table with the necessary structure and data types.

You can do so, using the downloaded data set's columns, with the following SQL command:

create or replace table sf_table (
  a1 double,
  a2 double,
  a3 double,
  a4 double,
  load int,
  failure int
  );

Your next step is to populate the empty table you created with the CSV file's data. You can do this from the SnowSQL CLI with the PUT command.

Unexpected syntax errors and fixes

Up until this point, we've been running SQL commands that have been parsed by the SQL compiler and executed once validated. This validation process is basically the compiler running through checks based on the SQL syntax, the defined variables, and the SQL keywords used. You'll now send in a command that the compiler recognizes as incomplete according to its standard syntax.

The PUT command — as a particular syntax that also depends on the system environment the SnowSQL command — is running in (Linux/Windows). Generally, the PUT command requires the directory of the file you want to upload and the location in Snowflake where the data will be uploaded.

Try out this command with SnowSQL:

put file://<directory-of-your-data>
/* Example
PUT file://C:\Users\soote\Documents\Github\snowflake_error_article\gearbox_data.csv

` endraw error" width="880" height="122">

<EOF>\ error

As you can see, running this command will generate an unexpected '<EOF>' syntax error, since the compiler expects all the required parameters for the PUT command (directory of the data, directory in Snowflake) immediately after it starts parsing the command.

Note: You might need to use Ctrl + Enter rather than Enter alone to run this command, as by default, SnowSQL requires a closing ; before executing commands.

You can fix this error by adding in the missing required parameter:

put file://<directory-of-your-data> @~
/* Example
PUT file://C:\Users\soote\Documents\Github\snowflake_error_article\gearbox_data.csv @~;

Fixed query

@~ in SnowSQL points to the current Snowflake directory or internal stage the user is currently working on. Think of it as a pointer to the current working directory (i.e., pwd).

The unexpected '<EOF>' error can occur with all commands that are missing their required syntax. For instance, the earlier create table command can generate an error if the closing bracket and semicolon are left out:

create or replace table sf_table_error (
  a1 double,
  a2 double,
  a3 double,
  a4 double,
  load int,
  failure int

Create table error

There are a number of ways your various commands can be left insufficient and parsed as incomplete by the SQL compiler. It's always important to confirm the syntax of the commands you are using. If you encounter an unexpected '<EOF>' syntax error, go through your syntax line by line and ensure that you've taken care of all the required elements in the command.

All commands can be affected by CTEs, stored procedures, pipes, block scripts, and even SELECT statements.

Putting your skills to the test 💻

Error messages are helpful tools for your debugging processes, as they tell you exactly what obstacles were encountered in your code and where to find them. Sometimes, they’re direct pointers to the issues that plague your system. But sometimes, they’re just confusing. 🤷

The unexpected '<EOF>' syntax error, just like any other error, helps you create the best-standardized code. At its base, this error is concerned with the syntax of your code and whether it's up to par with the programmed syntax encoded in the compiler. In this article, you learned more about the unexpected '<EOF>' syntax error, its causes, and how to debug and fix it.

📈 Want to learn more SQL tips and tricks? Check out our free SQL workshop with Ergest Xheblati, author of Minimun Viable SQL Patterns. The workshop was offered by the Operational Analytics Club, a great place to learn and chat with fellow data practitioners. Check it out and we'll see you in the OA Club!

👉 Want to start syncing your data from Snowflake to all your business tools? Book a demo with a Census product specialists to learn how we work for your specific operational analytics needs.

How to solve the SQL compilation error: "Missing column specification"

Census — Wed, 30 Nov 2022 23:14:17 +0000

Data analytics and business intelligence have grown to be key considerations for most businesses in today's market. Organizations are now investing more in their data technologies, and there's an increasing focus on being data-driven. Unsurprisingly, Snowflake has emerged as a first-choice data warehousing technology solution for many, greatly enhancing the development experience when working on SQL-heavy tasks. 🏆

Although working with SQL on Snowflake is largely similar to working with any other SQL platform, there are some cases where the Snowflake environment differs from the rest. One such case is running into a missing column specification error, which arises due to the distinctive way that Snowflake's query processing engine evaluates and parses SQL queries.

In this article, you'll take a closer look at this error to better understand what it is, how it arises, and how it can be resolved.

For starters, what’s a SQL compilation error?

Snowflake, like other database platforms, works with SQL and has implemented its own version of a SQL query engine. Although Snowflake largely follows the standard SQL syntax, it has a few modifications to the syntax and semantics when compared to databases such as Postgres or MySQL.

As such, queries submitted to Snowflake for execution are compiled and validated to ensure they conform to Snowflake's query syntax rules. Like any other language, SQL conforms to a set of lexical usage rules and semantics. So, when a query is submitted to the query processing engine, it undergoes a process called compilation, during which the SQL statements undergo syntax checks and semantic checks to ensure compliance with these rules.

✅ Syntax Checks ensure that keywords are spelled correctly, statements are terminated with delimiters, and there aren't any unsupported formats in the query.

✅ Semantic Checks ensure that the objects such as tables and columns used within the query are valid and available to the user.

Any syntax or semantic failure during the compilation process results in a compilation error.

What causes "missing column specification" errors in particular?

"Missing column specification" is a type of SQL compilation error, meaning that this error arises due to some incorrect syntax in the SQL query you're trying to process. 🛑 Specifically, the missing column specification error arises when you use Snowflake's data definition language (DDL) to create a table or create a view (materialized or not).

If you're creating a new table, view, or materialized view using a select column statement, the Snowflake DDL requires that all the referenced columns are explicitly named with aliases. Here’s how this plays out in practice. 👇

Resolving "missing column specification" errors

Before you can reproduce the error on Snowflake, you'll need to set up the SnowSQL client on your machine.

Setting up SnowSQL

Follow the appropriate SnowSQL CLI installation guide for your system. After installing SnowSQL, execute the following command in the terminal to connect to your Snowflake account:

snowsql -a <account-identifier>.<aws-region>.aws -u <username>

On successful login, you should see the following:

* SnowSQL * v1.2.23
Type SQL statements or !help
<username>#(no warehouse)@(no database).(no schema)>

In a new terminal, clone the following GitHub repo:

git clone https://github.com/furqanshahid85-python/sf-resolve-missing-col-spec-error.git

The data files you'll be using in this tutorial can be found in the data_files folder in the repo, and the db.sql file contains the DDL for all the objects you need to create on Snowflake. From db.sql, execute the following commands in the SnowSQL terminal to create a warehouse, table, and schema in Snowflake:

-- DDL for creating warehouse
CREATE OR REPLACE WAREHOUSE DEV_WAREHOUSE1 
WITH WAREHOUSE_SIZE ='XSMALL'
AUTO_SUSPEND        = 60
AUTO_RESUME         = TRUE
INITIALLY_SUSPENDED = TRUE
COMMENT = 'Warehouse for DEV purposes';
USE WAREHOUSE DEV_WAREHOUSE;

-- DDL for Database creation
CREATE DATABASE IF NOT EXISTS ANALYTICALDB;
USE DATABASE ANALYTICALDB;

-- DDL for schema
CREATE SCHEMA IF NOT EXISTS PUBLIC;
USE SCHEMA PUBLIC

-- DDL for tables
CREATE OR REPLACE TABLE ANALYTICALDB.PUBLIC.USERS (
USER_ID NUMBER(38,0),
JOINED_AT DATE,
COUNTRY_CODE VARCHAR(16777216)
);

CREATE OR REPLACE TABLE ANALYTICALDB.PUBLIC.BOOKINGS (
ID NUMBER(38,0),
STARTTIME DATE,
DURATION_HOURS NUMBER(38,0),
USER_ID NUMBER(38,0),
STATUS VARCHAR(16777216)
);

CREATE OR REPLACE TABLE ANALYTICALDB.PUBLIC.PAYMENTS (
ID NUMBER(38,0),
CREATED_AT DATE,
PAYMENT_AMOUNT FLOAT,
USER_ID NUMBER(38,0)
);

Your database objects are now created. Next, copy the data from the CSV files in the repo into the database tables. To do this, create Snowflake stage and file format objects.

Then run the following commands from db.sql:

-- DDL for copying csv files
CREATE OR REPLACE STAGE USERS_STAGE;
CREATE OR REPLACE FILE FORMAT USERS_FORMAT TYPE = 'CSV' FIELD_DELIMITER = ',';
PUT file:///tmp/data/Users.csv @USERS_STAGE;
CREATE OR REPLACE STAGE BOOKINGS_STAGE;
CREATE OR REPLACE FILE FORMAT BOOKINGS_FORMAT TYPE = 'CSV' FIELD_DELIMITER = ',';
PUT file:///tmp/data/Bookings.csv @BOOKINGS_STAGE;
CREATE OR REPLACE STAGE PAYMENTS_STAGE;
CREATE OR REPLACE FILE FORMAT PAYMENTS_FORMAT TYPE = 'CSV' FIELD_DELIMITER = ',';
PUT file:///tmp/data/Payments.csv @PAYMENTS_STAGE;

Next, run the following copy commands from db.sql to copy data into snowflake tables:

COPY INTO ANALYTICALDB.PUBLIC.USERS FROM @USERS_STAGE;
COPY INTO ANALYTICALDB.PUBLIC.BOOKINGS FROM @BOOKINGS_STAGE;
COPY INTO ANALYTICALDB.PUBLIC.PAYMENTS FROM @PAYMENTS_STAGE;

At this point, your database tables look like this:

And your table data should look like this:

Solving the "missing column specification" error

Let's look at an example to demonstrate how to resolve this error. In the example, you'll be developing a report to provide insight into the percentage of users, segmented by country, who made their first payment within three days of registration.

The report will be generated by calculating the rank of payments for each user by filtering the data with payment rank = 1 with a difference of <=3 between the date of joining and the date of payment. Finally, the percentage of payments made within three days of registration will be calculated by dividing the total number of users who made payments within three days of registration by the total number of users. The results are stored in the view users_payment_report.

Start by executing the following query from db.sql in the GitHub repo:

create view user_payments_report as (

SELECT 
     u.country_code,
     COUNT(DISTINCT u.user_id) registered_users,
     COUNT(DISTINCT t.user_id) first_3_days_payment,
     (CAST(COUNT(DISTINCT t.user_id) AS REAL)  / count(DISTINCT u.user_id)) * 100 
FROM 
     Users u 
LEFT join (
            SELECT 
                u.user_id,
                u.joined_at joined_at,
                p.created_at created_at,
                Rank() OVER (partition by u.user_id ORDER BY p.created_at ASC) as payment_rank
            FROM
                Users u
                JOIN Payments p on u.user_id = p.user_id
                WHERE p.created_at >= u.joined_at
        )t
ON u.user_id = t.user_id 
AND t.created_at - t.joined_at  <= 3  
AND payment_rank = 1
GROUP BY
u.country_code
);

When you run the above query it returns the error, SQL compilation error: Missing column specification, like so:

Missing column specification error message

The reason for this error is that when you calculate the percentage of users who made a payment, you did not provide a column alias. Snowflake SQL query syntax requires that whenever you create a table, view, or materialized view using a Select column statement that has calculated fields, such as percentage in this example, you must add an explicit column alias.

Since you didn’t provide a column alias while creating a view from the query results, it triggered a syntax check failure.

-- missing column alias
(CAST(COUNT(DISTINCT t.user_id) AS REAL)  / count(DISTINCT u.user_id)) * 100

In order to fix the issue, you need to add a column alias for the calculated field, which in this case is percentage:

-- column alias added
(CAST(COUNT(DISTINCT t.user_id) AS REAL)  / count(DISTINCT u.user_id)) * 100 percentage

Now, if you run this query again, it executes successfully.

Query execution success

Querying the USER_PAYMENT_REPORT view returns the expected results.

Query results

The following scenarios illustrate additional cases in which the "Missing column specification error" will occur, and how to resolve it in each case:

select max(payment_amount) from payments;

  
  
  --no DDL, no alias, query works


create table max_payment as (select max(payment_amount) from payments);

-- alias not used, error: missing column specification

create table max_payment as (select max(payment_amount) as mp from payments);

  
  
  -- alias used, works


create view max_payment_view as (select max(payment_amount) from payments);

-- alias not used, error: missing column specification

create view max_payment_view as (select max(payment_amount) as mp from payments);

  
  
  -- alias used, works


create materialized view max_payment_view as (select max(payment_amount) from payments);

-- alias not used, error: missing column specification

create materialized view max_payment_view as (select max(payment_amount) as mp from payments);

-- alias used, works

Putting your newfound knowledge to use 🧠

In this article, you've learned what causes the "Missing column specification" error in Snowflake SQL. To recap, we've covered how Snowflake SQL syntax is different from competing platforms, how one of the differences is that it requires you to explicitly add column aliases for calculated fields when creating tables or views, and we even included a business case illustrating this issue to see how it's fixed.

Now, as businesses are moving towards being more data-driven, it's more crucial than ever to get data in your central data warehouses operational as soon as possible. This is where tools like Census come into play. 💪 Census provides a comprehensive solution for syncing data between Snowflake and business apps such as Salesforce, Stripe, Mixpanel, and Google Ads (just to name a few).

Because Census is the leading reverse ETL tool available, it allows you to move data from Snowflake into all of your tools without needing to spend time setting up and maintaining these integrations. 😮‍💨 Instead, you can focus your time and resources on what matters: Getting value from and taking action on your data.

👉 Want to start syncing your data? Book a demo with one of our product specialists to learn how we work for your specific operational analytics needs.

Snowflake and SQLAlchemy tutorial: From installation to example queries

Census — Wed, 23 Nov 2022 18:05:18 +0000

SQLAlchemy is one of the most popular libraries to interface with relational databases in Python. Its distinguishing feature is the object-relational mapper (ORM), which allows software engineers to develop classes and automatically map them to a database.

Sure, SQLAlchemy is known to support operational databases, but did you know it supports also data warehouses like Snowflake? While ORM use cases are less frequent with data warehouses like Snowflake, SQLAlchemy provides a solid interface for running analytical queries.

In this tutorial, you’ll learn several different approaches to how to set up SQLAlchemy to interface with a Snowflake data warehouse. We’ll also explore the various pitfalls of these approaches so you can be prepared for anything. 💪

Snowflake + SQLAlchemy

Since Snowflake is a data warehouse, its main users are analytics professionals. Although many of them will interact with it through a SQL IDE, others will prefer to embed queries in their Python code.

But while Python is widely used in the data world, most applications require some kind of database technology for storing and processing relational data. That’s where SQLAlchemy comes in. 🦸

SQLAlchemy is a well-known library for interfacing with all kinds of database technologies like Oracle, Postgres, and of course, Snowflake, offering a unified way to run queries on both operational and analytical databases. Plus, it’s widely supported and vendor-agnostic, which helps organizations avoid software lock-in.

Here are a few use cases to show just how powerful it is. 👇

ORM

To translate the logical representations of objects into an atomized form that can be stored in a database, you use an object-relational mapper.

SQLAlchemy is a high-performing and accurate toolkit that’s used in tens of thousands of applications, so it excels in the area of object-relational mapping. Since it was designed with the database administrator (DBA) in mind, generated SQL can be swapped out for hand-optimized statements, making it a flexible solution for most ORM use cases in Python.

SQL Engine

Nowadays, most data professionals working in Python use pandas, a data analysis and manipulation tool centered around the DataFrame. pandas has the built-in capability to translate the results of database queries to a DataFrame, but it requires a connector to a database. While most people are familiar with Open Database Connectivity (ODBC) and Java Database Connectivity (JDBC), you might not know that Snowflake also has its own proprietary connector.

The Snowflake connector uses a cursor to retrieve data and put it in a pandas DataFrame object, but there's a more convenient way to do this – one that allows analysts to query Snowflake within their familiar pandas experience.

Using SQLAlchemy with Snowflake

To generate a pandas DataFrame from Snowflake query results in Python, you could use the Python connector. However, a more straightforward solution is to use a SQLAlchemy engine with pandas’ read_sql function. Below, both solutions are presented.

Setup

To interface with Snowflake from Python, you’re going to need some packages. They can be installed with pip install in your terminal, or in a notebook cell by appending the commands with an exclamation mark, as shown in the example below. Note that these versions have been set to ensure that they don’t run into any compatibility issues:

!pip install 'snowflake-sqlalchemy==1.4.2'
!pip install 'typing-extensions>=4.3.0'
!pip install 'snowflake-connector-python==2.7.9'

Once installed, set your credentials, which can be found by logging into your Snowflake account and workspace, in the following dictionary object:

credentials = {
    'account': '<YOUR_ACCOUNT>',
    'user': '<YOUR_USERNAME>',
    'password': '<YOUR_PASSWORD>',
    'database': '<YOUR_DATABASE>',
    'schema': '<YOUR_SCHEMA>',
    'warehouse': '<YOUR_WAREHOUSE>',
    'role': '<YOUR_ROLE>'
}

Throughout the next sections, we'll be using the example Snowflake TPC-H benchmark data set. The following query calculates an aggregation of revenue and the number of orders per customer and per order status.

query = """
SELECT
    O_CUSTKEY,
    O_ORDERSTATUS,
    SUM(O_TOTALPRICE) AS sum_totalprice,
    COUNT(O_ORDERKEY) AS count_orderkeys
FROM TPCH_SF1.ORDERS 
GROUP BY
    1, 2
LIMIT 10;
"""

In the sections below, we’ll execute this query in two different ways.

Generate a pandas DataFrame from a Snowflake Query with the Snowflake Connector

In this section, we’ll generate a DataFrame by iterating over the rows of a query result.

To start, import and create a Snowflake connector object. To make it work properly, you’ll need to provide it with the credentials you specified in the credentials dictionary you set earlier:

import pandas as pd
import snowflake.connector

con = snowflake.connector.connect(
    account = credentials['account'],
    user = credentials['user'],
    password = credentials['password'],
    database = credentials['database'],
    schema = credentials['schema'],
    warehouse = credentials['warehouse'],
    role = credentials['role']
)

With the next piece of code, we’ll create a generate_df function. We’ll then indicate a cursor, a SQL query, and the maximum number of rows that should be loaded in a DataFrame. The function uses the cursor’s fetchmany method, which returns a fixed number of rows:

def generate_df(cur, query, n):
    cur.execute(query)
    col_names = [col.name for col in cur.description]
    rows = 0
    while True:
        dat = cur.fetchmany(n)
        if not dat:
            break
        df = pd.DataFrame(dat, columns = col_names)
        rows += df.shape[0]
    return df

Next, execute the generate_df function with a cursor based on the connection you opened earlier, a query, and optionally, the number of rows you’d like the function to return:

df = generate_df(con.cursor(), query, 10)

If you inspect the contents of your DataFrame, you should see the following results:

A DataFrame from a Snowflake query

Note that the number of connections you can open is finite, so it's good practice to close the connection you opened. Otherwise, you’ll run into issues if you run a large number of queries in a short time span:

try:
    con.close()
except Exception as e:
    print(e)

Generate a pandas DataFrame from a Snowflake Query with SQLAlchemy

Alternatively, you could use SQLAlchemy to wrap the Snowflake connector in an easy-to-use interface.

To do so, we’ll start by loading the required modules. You’ll notice that it not only loads SQLAlchemy, but also a module from the Snowflake SQLAlchemy toolkit, which acts as a bridge between SQLAlchemy and Snowflake.

Next, we’ll create a SQLAlchemy engine containing all the information that SQLAlchemy needs to interface with a database technology (which in this case is Snowflake). The engine is created with a Snowflake connection string, constructed by the URL function:

from snowflake.sqlalchemy import URL
from sqlalchemy import create_engine
import pandas as pd

engine = create_engine(URL(
    account = credentials['account'],
    user = credentials['user'],
    password = credentials['password'],
    database = credentials['database'],
    schema = credentials['schema'],
    warehouse = credentials['warehouse'],
    role = credentials['role']
))

Setting up the engine is basically all you need to do. Unlike the previous example, you don’t need to make custom functions to fetch all the rows from query results.

From the engine, you can open a connection and pass it to the pandas read_sql method alongside the query:

con = engine.connect()
df = pd.read_sql(query, con)

Again, it’s best practice to close the connection, and optionally remove the engine:

try:
    con.close()
except Exception as e:
    print(e)

try:
    engine.dispose()
except Exception as e:
    print(e)

You can see that loading a DataFrame with data from Snowflake is much easier if you use SQLAlchemy. All you need is two lines of code to execute a query and load it into memory. 🤷

Common errors

Despite the simplicity of this solution, there are a number of issues that have been reported across the internet. Luckily, most issues are known and have straightforward solutions. Here are a few. 👇

NoSuchModuleError: Can't load plugin: sqlalchemy.dialects:snowflake

It’s possible that the Snowflake dialect can’t be found when creating an engine. This can be solved in two different ways.

First, try reinstalling SQLAlchemy:

pip install --upgrade sqlalchemy

If that doesn’t work, try registering the Snowflake dialect explicitly in the SQLAlchemy register by referring it to the Snowflake SQLAlchemy bridge:

from sqlalchemy.dialects import registry

registry.register('snowflake', 'snowflake.sqlalchemy', 'dialect')

ModuleNotFoundError: No module named 'snowflake.sqlalchemy'

Some Anaconda users are unaware of the required Snowflake-SQLAlchemy bridge. While it can be installed through pip, Anaconda users can install it through the conda command. However, the package is not in the default channel, but in conda-forge. This requires that you set the channel parameters, as follows:

conda install -c conda-forge snowflake-sqlalchemy

snowflake.connector.errors. OperationalError: 250003: Failed to get the response

Many people report getting this error when setting up their connection for the first time. The most common root cause is that they ignored a warning in the documentation and included snowflakecomputing.com\ in their account identifier. Dropping it will likely solve the issue.

pandas only writes NULLs to Snowflake

pandas can do more than just read data from Snowflake, it can also write to it. However, some users reported getting columns with only NULL values, and no error or warning to indicate why. The cause? Snowflake expects uppercase column names. Fixing it is easy. Just change the case of your column names as follows:

df.columns = df.columns.str.upper()

AttributeError: 'SnowflakeCursor' object has no attribute `cursor`

When you’re using pandas’ to_sql method, it expects a SQLAlchemy engine, not a Snowflake connector object. To get rid of this error, replace the Snowflake connector with the correct SQLAlchemy engine.

Use Snowflake and SQLAlchemy with confidence 🙌

This tutorial walked you through using Snowflake with SQLAlchemy to run SQL queries and store the results as pandas DataFrames. You probably noticed that using SQLAlchemy to interface with Snowflake from Python is superior to using the default connector in terms of simplicity. By creating an SQL engine and providing it to pandas methods, data professionals can work from within their familiar pandas setup.

After you integrate Snowflake and SQLAlchemy and you need a solution to sync your Snowflake data with your sales and marketing tools, Census has your back. Book a demo with a product specialist to see how Census can make your data actionable. 🚀

Stars, tables, and activities: How do we model the real world?

Census — Tue, 15 Nov 2022 17:56:10 +0000

For 25 years, Ralph Kimball and Margy Ross’s The Data Warehouse Toolkit has provided the mainstream method of data modeling, alternatively called Kimball-style warehouses, dimensional models, star schemas, or just “fact and dimension tables.”

In fact, it’s so ingrained in the data industry that nearly every data conference has a talk arguing for or against its continued relevance – or suggesting a new alternative altogether.

The History

For those of you who are new to the concept, fact tables are lists of events that rarely change (i.e. orders), and dimension tables are lists of entities that often change (i.e. a list of customers). To keep this straight in my mind, I often think about how a consultant once explained it to me:

You can tell what’s a fact and what’s a dimension by explaining your report with the word ‘by.’

Sales by customer, refunds by product, etc… it’s always fact by dimension. Here’s an example 👇

fact_orders

dim_customer

‎‎The idea here is simple: It wouldn't make sense to include the email, location, or lifetime value (LTV) of the customer on the order. Why? Those things change, and you don’t want to be updating year-old orders when your orders table has millions of rows.

The Case For Kimball

The basic organization – orders, customers, and products being separate tables – makes sense to most people. It’s easy to join the tables together and to know what to put in which table.

Even if you want to make Wide Tables, combining fact and dimensions is often the easiest way to create them, so why not make them available? Looker, for example, is well suited to dimensional models because it takes care of the joins that can make Kimball warehouses hard to navigate for business users.

But one of the main advantages of Kimball is that it provides a blueprint. 📘 I’ve worked with companies that aren’t modeling their data from ELT tools at all, and fact and dimension tables are a great model for a data warehouse.

It’s also a common standard that new team members will be able to contribute to quickly. By comparison, if you skipped the star schema and just created Sales by Product and Sales by Customer tables, you could easily avoid problems incurred by someone asking for Sales by Customer by Product table.

The Case Against Kimball

Every modern talk about Kimball – even the ones that are generally positive – has a long list of things about Kimball that are no longer relevant to modern cloud data warehouses.

Some of the more extreme Kimball practices (like creating a city dimension and referring to it as city ID 123 instead of the actual city name) only made sense when analytical databases were row-based and did not store repeated names efficiently.

Plus, a huge portion of his book talks about storage costs, which are practically free nowadays. This, then, raises the question of whether dimensional modeling is an artifact of a constrained time, like how QWERTY keyboards were created to slow down typing on antiquated keyboards.

Generally, the arguments against Kimball fall into 2 main buckets:

Business people don’t understand joins. This is the argument for Wide Tables or One Big Table, such as a table called sales_by_customer_by_product. Certain BI tools, such as Metabase, provide a good self-service interface when dealing with a single table, but are useless with star schemas.
Separating activities into multiple fact tables makes it hard to track a user’s behavior over time. If a customer gets an email that’s in fct_email_events, visits the website which creates a few rows in fct_page_views, adds something to their cart which is recorded in fct_events, and then makes a purchase which is stored in fct_orders, it becomes hard to tell a cohesive story about why the customer made the purchase. 🤔 This is the argument for activity schemas, which combine all fact tables into a single one with a few standard columns.

Everyone has their own incentives

Tools that work well with Kimball models will suggest that you use Kimball models, while ones that do well with Wide Tables or Activity Schemas will push for those. As Aram Panasenco on dbt slack said:

While Narrator is a cool tool — and there’s nothing wrong with implementing a schema that works well with it — you should know whether you’re implementing something for best practices or for a tool. 🧰 I’d personally love to see more discussion of giant fact tables combining all your important events (not limited to just 3 columns) that play well in a particular tool. It’s a fascinating idea!

Modern data modeling provides the answers

With modern data modeling tools like dbt, it’s very easy to create fact and dimension tables, then spin up Wide Tables as needed. Maybe your CX team wants to see Orders by Customer in Metabase, or your marketing team wants to sync Orders by Product to Salesforce with Census. Either way, it takes very little time to create views for these use cases.

You can also create a view that turns your fact tables into a strict Activity Schema if you want to try out Narrator. Then, if you later implement Looker, you can join the fact and dimension tables directly in your Explores. Kimball modeling as the core (but not the extent) of a modern data warehouse is hard to beat in flexibility and ease of use!

👉 Want to start sending data to all your go-to-market tools to make it actionable for your teams? Book a demo with a Census product specialist.

Pipedrive API with Python: How to write to contact records

Census — Thu, 10 Jun 2021 19:11:29 +0000

Chances are you and your team have felt frustrated at some point because you’re missing sales you could have made if data were better utilized. You took the effort to implement Pipedrive as a sales CRM but some other system also generates valuable data that is not fed to Pipedrive, too.

On top of that, that data point doesn't fit neatly into Pipedrive's default fields. There is, for example, no default field for trial end date available to include in your contact records (even though this could help you reach out to potential customers in your sales pipeline with an offer at just the right time). Luckily, you can create a custom field and write data to it using the Pipedrive API and significantly improve your sales process functionality.

Getting set up with the Pipedrive API

We'll use Python to connect to the Pipedrive API to make our trial end date data available where we need it to be. To connect to the Pipedrive API, you will need to gather two things from within the web app. First, you’ll need your API access token, which you can find here. If you have access to multiple companies, make sure that you are within the right environment. Second, you will need the company domain, which you can find in the address bar (e.g. https://this-is-your-domain.pipedrive.com.)

You’ll also have to install the Python requests library if you haven't done that previously. You can do so by launching your terminal or command prompt and entering the command below.

pip install requests
pip3 install requests # Use this if the first one doesn't work for you.

We recommend running all code in a Jupyter Notebook for your first attempt so that you can easily see the output and interact with it, though creating a .py file will work as well.

We'll start by importing the necessary libraries. The requests library will allow us to make HTTP requests to the Pipedrive API and the json library will allow us to properly parse the responses.

import json
import requests

Checking your existing Pipedrive fields

Before we get started, we need to find out if the field we want to write data to already exists. Since we'll be making a GET request, a POST request, and a PUT request, the variables have been prepended with get_, post_ and put_ to help you distinguish between them.

# token hasn't been prepended with get_ because it needs to be sent with all requests.

token = {
    'api_token': 'your-api-token-found-in-the-web-app'
}

get_url = 'https://your-domain.pipedrive.com/api/v1/personFields'

# The params argument is appended to the url as a query parameter.
get_response = requests.get(get_url, params=token)

get_content = json.loads(get_response.content)

# Upon success, the key 'success' will equal True.
print(get_content['success'])

A response to a successful GET request will contain the fields that you already have in the API key data. We'll print the names of all the fields and their respective indices. Unless your company naming conventions are absolutely topnotch, you'll want to go through this list manually to check if your custom field already exists.

get_data = get_content['data']

for i, v in enumerate(get_data):
    print(i, v['name'])


# If you want to further examine the field at index 5.
print(get_data[5])

If the custom field that you want to write to already exists, save its “key” value to a variable.

# If for example the index of your field is 5.
field_key = get_data[5]['key']

Follow along with the next section to create a custom field if you didn't find an existing field that meets your needs.

Creating a Pipedrive custom field

Creating the custom field if it doesn't exist yet is fairly straightforward. You'll just need to think of a name for your custom field and decide its type. You have several options when it comes to the field type, which you can find in the API reference. For our trial end date example, it makes the most sense to go with “date.”

# token should still be defined from the GET request, but in case you skipped over that, here it is again.

token = {
    'api_token': 'your-api-token-found-in-the-web-app'
}

# The field that you want to create.
post_data = {
    'name': 'trial end date',
    'field_type': 'date'
}

post_url = 'https://your-domain.pipedrive.com/api/v1/personFields' 

post_response = requests.post(post_url, params=token, data=post_data)

post_content = json.loads(post_response.content)

# The key 'success' should equal True.
print(post_content['success'])

If you successfully created a field, the response will contain a “data” key with the information of the field. This information includes a key called “key,” which you’ll need when writing data to this field.

field_key = post_content['data']['key']

Writing data to a Pipedrive custom field

Now that you have a custom field in place, you can write data to it. Heads up, the Pipedrive API reference is misleading (it makes it seem like you can only write data to default fields, which actually isn’t the case). To complete this step, you’ll need to find the ID of the user you want to write data to. To make it easier, you can get a list of all your users or search for specific users.

# token should still be defined from the GET request, but in case you skipped over that, here it is again.
token = {
    'api_token': 'your-api-token-found-in-the-web-app'
}

# Replace id-of-person with the actual ID
put_url = 'https://your-domain.pipedrive.com/api/v1/persons/id-of-person'

# field_key is the 'key' value of the field that you want to write data to
put_payload = {
    field_key: '2021-06-01' # If this person's trial ends on 2021-06-01
}

put_response = requests.put(put_url, params=token, data=put_payload)

put_content = json.loads(put_response.content)

# The key 'success' should equal True.
print(put_content['success'])

The output contains the user that we just wrote data to, including the newly added data. One thing to watch for is that the “success” key will equal True if you manage to write data, regardless of whether the data was correct. If you, for instance, try to write the string “wrong-data” to a date field, the “success” key will equal True and the value of the field will be set to 1970-01-01. You'll want to verify the result of your API request to make sure it’s accurate.

# This should equal the value that you just wrote using the PUT request.
print(put_content['data'][field_key])

Success! You just wrote your data to a Pipedrive custom field using the Pipedrive API... once.

There's more to life than writing to custom fields

It is, in itself, easy enough to write data to a custom field through the Pipedrive API. The real challenge lies in getting this process just right in production. This means scheduling this process to run periodically. This also means making sure that you don't exceed the two-second nor 24-hour rate limits - which also includes any actions that you take in the web app. You’ll also need to incorporate logging so you know exactly which data points were written successfully and which ones failed (and why). Additionally, you'll have to develop a process to retry writing these failed data points - and hope they don't fail again. The list goes on.

You can struggle your way through this process, or you can let Census worry about it. We can take all the engineering for custom connectors off your plate and make it easy to sync your customer data from your warehouse to your business tools. See if we integrate with your tools or check out a demo.

New to data orchestration? Start here.

Census — Wed, 02 Jun 2021 15:44:13 +0000

The universe of data within modern companies is ever-expanding. It’s exciting, there’s more data than ever to dig into, but with more data comes more governance, sync schedules, and processing problems.

Companies need to break silos between data sources and storage to truly operationalize all the information they’re collecting. However, just adding new tools to the mix won’t solve the problem (it can actually make it much, much worse). To break down silos between their data sources and overcome system sprawl, they need better data governance combined with data orchestration.

Data orchestration makes it possible for organizations to automate and streamline their data, operationalizing it so that this valuable information can be leveraged to drive real-time business decisions. By some estimates, 80% of the work involved in data analysis comes down to ingesting and preparing the data, which means data orchestration can cut down on loads of your data processing and scheduling time.

As you know, we here at Census are big fans of anything that breaks down silos and improves data quality and data access. ❤️ That’s why we’re excited to dive into the top of data orchestration, what it is, why you should care, and some of the tools you can use to get started with orchestration today.

What is data orchestration?

Data orchestration can look a little different at each stage of your data pipeline, so for this article, we’re focusing on a general, 20,000-foot definition that captures it from end to end:

Data orchestration is the leveraging of software to break down data silos between sources and storage locations to improve data accessibility across the modern data stack via automation. It improves data collection, data preparation and transformation, data unification, and delivery and activation.

When a company invests in the software necessary to support data orchestration at each layer of their stack, they’re better able to connect their systems and ensure they have access to all their information in relative real-time.

Data orchestration services automate the movement of data between your event tracking, data loader, modeling, and data integration tools (as seen above within a sample modern data stack of our favorite tools).

As we mentioned in the definition above, there are four main processes that data orchestration helps with across your stack:

Data collection: End-to-end data orchestration services handle data ingestion to collect important data from your customer touchpoints, often via SDKs and APIs that can be directly integrated with your applications.
Data preparation and transformation: Once your data has been collected, orchestration services help to standardize and check properties and values at your collection points. These values--such as names, times, and events--can be mutated to match a standard schema.
Data unification: Data orchestration services can help organizations unify their data into a pool that is more valuable than the individual input streams by themselves. This can be used to create a single, unified view of customers by stitching together data collected from websites, point of sale devices, and applications to help you understand individual user behavior over time.
Delivery and activation: Once unified customer profiles have been created, data orchestration services can send this valuable information to the tools that your team uses every day, including BI platforms, data analytics tools, and customer management solutions.

You can picture your data pipeline like a river, with a handful of tributaries (data sources) feeding into it. Imagine that you’re trying to create a reservoir (data warehouse) near the river’s mouth with a dam that will generate enough electricity (insight/value) to provide power to a nearby town (customers).

Without data orchestration, the folks building your dam and controlling the reservoir level have to manually run upstream and release water gates on each source. It takes hours (and a lot of work). With data orchestration, the crew manning the dam can run operations programmatically to automate the water flow, without the literal leg work. This means they can spend more time monitoring the energy generated and getting it to the town.

Historically, this manual leg work took the form of cron jobs, which data engineers and data scientists wrote in Python. This meant that things like failure handling had to be done on a job-by-job basis. However, as data stacks grew and become more complex to handle larger and larger volumes of data (more water, if we’re sticking with the metaphor), engineers couldn’t keep the manual work up.

This manual work--combined with disparate legacy systems--creates eddies of dark data in the sources leading to your reservoir, preventing valuable information from reaching your team for analytics, customer engagement, or any other activity.

Thankfully, there are frameworks today that support automated data orchestration and pipeline monitoring to prevent data from going dark. That’s why we believe data orchestration is a core part of the modern data stack.

Technically speaking, data orchestration solutions are filed by DAGs (direct acyclic graphics), which are a collection of the tasks that you want to run. These tasks are organized and run based on their dependencies.

Each node of the DAG represents a step of the tasks in the process. It’s generally defined by a Python script, which makes it easy for data scientists to use.

For example, take the four-step extract, load, transform, and sync the process as a DAG. In order for the transform step to happen, data has to be loaded, and in order for data to be loaded, it first has to be extracted. Then, once all that has happened, a reverse ETL tool like Census can take care of the downstream work with trigger sync via API.

More precisely, DAGs within data orchestration flows can help with:

Data organizing, cleansing, and publishing in the warehouse
Business metric computation
Rule enforcement for campaign targets and user engagement through email campaigns
Data infrastructure maintenance
Training machine learning models

While Census's ability to orchestrate the last mile of your pipeline is unique, there are a variety of orchestration services on the market that can help with the rest of the flows above.

How data orchestration tools have evolved

Like all technology, data orchestration tools change often to keep pace with the data management needs of evolving organizations. Each new generation of data orchestration services emphasizes a more and more specific use case.

First-generation data orchestration tools like Airflow are primarily focused on improving usability for data scientists with the introduction of Python support (vs previous tools that required queries to be written in JSON and YAML). This improved UI made it easier for data teams to manage their pipeline flows without getting as caught up in the process.

Second-generation data orchestration tools like Dagster and Prefect are more focused on being data-driven. They’re able to detect the kinds of data within DAGs and improve data awareness by anticipating the actions triggered by each data type.

These data-driven capabilities can be divided into two categories:

Active approaches that pass data between steps and systems.
Passive approaches wait for an event (cue) outside the DAG to occur before triggering a task (particularly useful for continuous model training).

For most use cases, active data orchestration approaches will work well. However, as stacks and data flow to become more complex, passive approaches can be leveraged to orchestrate these stacks.

What’s the difference between orchestrated and un-orchestrated data?

The difference between orchestrated and unorchestrated data stacks is the difference between operationalizing your data to fuel future decisions vs reacting to and wrestling with your past data to troubleshoot.

Unfortunately, a lot of established companies and startups are still wrestling with their unorchestrated data. More than 87% of businesses have low BI and analytics maturity, according to Gartner.

This is because legacy systems and stacks that are still in the process of being migrated to the cloud tend to create more data silos than they break down. It takes deliberate governance and design decisions to revamp these legacy stacks to overcome technical debt and leverage data orchestration tools in your data ecosystem.

Why should you care about data orchestration?

Data orchestration breaks down the silos that separate your data stack and make your data stale as it sits in dark data eddies. Many companies may set their engineers on the warpath to building DIY orchestration solutions, but those will quickly become irrelevant as the stack changes (plus, it requires a lot of expensive rework and annoys your engineers). Beyond saving data engineering time, orchestration also helps you:

Improve data governance and visibility
Leverage fresher customer data
Ensure data privacy compliance

Orchestration prevents the growing pains many companies experience by giving them a scalable way to keep their stacks connected while data flows smoothly. It’s great for companies that:

Have a lot of data systems that need to be pulled together.
Have started to integrate the modern data stack and want to get more use out of it.
Have just started building their first stack and want to establish a strong foundation to handle future scale.

Data orchestration ensures that you and your team have the freshest data possible, without your engineers having to manually run jobs overnight to serve it to you. These services let you automate your sync schedule and use trigger APIs to update downstream dependencies.

Improved workflows for engineers and analysts

Using manual cron jobs and Python scripts is a slow method for getting you the data you need. In today’s data world, the speed and volume of data collection have grown so much that data teams and engineers can’t possibly keep up with the manual organization of it.

Rather than relying on one overloaded engineer to help fetch the data you need from multiple warehouses and storage systems, you can use data orchestration platforms to automatically transform and deliver it to you.

This reduces the time engineers and data scientists have to spend on data collection and transformation and empowers data teams to make better decisions in real-time.

Improved data governance and visibility

Siloed and disparate data sources are hard to govern and get visibility into. To use orchestration tools effectively, companies must audit and organize their stack, creating more visibility in the process. This makes it easier for you to govern your data and improves the overall confidence in and quality of it.

At its heart, data orchestration is about making your data and systems more useful for the systems and people that consume them.

Leverage fresher customer data

RevOps folks know that the key to unlocking better insights into customers lies in data. The automation of your data processes from end to end makes it easier than ever for data consumers to operationalize their data use.

With data orchestration functionality integrated with your pipeline, the data from campaigns, webcasts, web apps, and other customer data sources can be easily collected, transformed, loaded into your warehouse, and then sent back out to platforms like Salesforce and Marketo via reverse ETL tools.

This information availability helps your RevOps teams score leads and accounts, create a single view of each customer, establish audience associations, and more.

Ensure data privacy compliance

With great data comes great responsibility. GDPR, CCPA, and other data privacy laws require organizations to be good stewards of their customer data by providing documentation that shows it was collected correctly and ethically.

If you’re working within a chaotic, disparate stack, this kind of paper trail is hard to maintain and show. However, with a modern data stack outfitted with data orchestration, you can easily get a detailed view of the when, where, and why of each data point you’ve collected.

This also makes it easier for organizations to delete information on request, something that’s difficult to fully do if part of the data in question has been held indefinitely upstream due to poor data flow.

Improve your data orchestration, improve your operations

Cool, so you’ve learned all this fancy information about data orchestration. But now what? Well, if you’re still wrestling with your legacy stack and looking to make the move to the modern data stack, check out this resource on upgrading to the modern data stack.

If you’re ready to orchestrate your modern data stack (or if you’re just looking for orchestration for the last mile of your pipeline from your warehouse to destinations like Salesforce), we have great news: Census has released trigger syncs via API and dbt Cloud (with more integrations coming soon). 🎉

Check out our recent product announcement to learn more.