SeattleDataGuy

Posted on Aug 8, 2020

4 SQL Tips For Data Scientists And Data Analysts

#sql #database #datascience #postgres

SQL has become a common skill requirement across industries and job profiles over the last decade.

Companies like Amazon and Google often will require their data analysts, data scientists and product managers to at least be familiar with SQL. This is because SQL remains the language of data. So, in order to be data driven, people need to know how to access and analyze data.

With so many people looking at, slicing, manipulating and analyzing data we wanted to provide some tips to help improve your SQL.

These tips and tricks we have picked up along the way while writing SQL. Some of them are do's and don'ts others are just best practices. Overall we hope that they will help bring your SQL to the next level.

Some of the tips will be things you shouldn't do, even when you might be tempted and others are best practices that will help ensure that you can trust your data. Overall, they are both meant to be informative as well as reduce possible future headaches.

Don't Use Avg() on an Average It's Not the Same

A common mistake we see in people's queries is averaging averages. Some people may think it's obvious to not average averages. However, there are discussions and whole articles explaining why it is bad to average averages across the web.

So why is it bad to average averages, both in SQL and in general? Because it can be skewed by averages that were based on fewer of whatever you are averaging.

For example let's look at the table below.

In the table we have already averaged the cost per claim at the county level. What we also can see is that one counties average is based off 100 claims and the other is based on 2 claims. Generally this table probably wouldn't have the total count of claims, we are using it to show how easy you can skew an average.

What if we wanted to find the average of all the counties. If you were to try averaging the average, then you would get $525. That doesn't seem right.

If 100 claims were on average $50 and only 2 averaged 100, then the average of all those values should be closer to $50 not $500.

In actuality the average of these claims is about $68. However, if you average the average you get a number almost 10x greater.

So why do people even ask is it ok to average the average?

We think sometimes the averaging the average can sometimes be close to the expected output.

Let's look at a SQL example.

SELECT patient_county
    ,avg(avg_visits_per_patient)*1.0 avg_visits
    ,avg(avg_cost_per_patient)*1.0 avg_cost
    ,'Avg of Averages' table_type
FROM agg_patient_counties 
GROUP BY patient_county,table_type

In this case we will be using a table that has the average cost per patient and average visits per patient by county and age.

However, we would like to find the average cost per patient and visits per patient at the county level.

We could just average the averages from this table using the query above and it will give us the output below.

Now if instead we were to correctly write a query that recalculates the average at the county granularity like the query below:

SELECT 
    patient_county
    ,count(*)*1.0/count(distinct c.patient_id) avg_total_visits
    ,sum(cast(claim_cost AS decimal))/count(distinct c.patient_id) avg_total_costs
    ,'Correct Way' as table_type
FROM patients p
JOIN claims c ON p.patient_id = c.patient_id
GROUP by patient_county,table_type

Now let's compare this queries output to the previous output.

You will notice a few differences in the King County output.

So if we compare the average visits they actually seem quite similar 2.4 to 2.6. This is why we believe some people fall for the average of averages. They sometimes can be close to the actual output so it may be tempting to use this method.

However, when we look at the average cost per claim. We will notice that there is a nearly $58 difference between about $560 and $620. That's almost a 10%. When you are talking about cost-savings, that's a huge difference.

So although 2.4 vs 2.6 seems negligible, it can lead to some massive differences.

In the end, don't average averages.

You Can Use A Case Statement Inside Sum

Another great tip when writing SQL is learning how to use case statements in your sum clause. This can be very useful when you are trying to write metrics with a ratio or a numerator.

For example let's look at the query below. You will see that we need to hit the table claims twice to get the count of values we are trying to filter as well the total number of rows. However, we could reduce this.

SELECT total_claims_over_500 * 100.0 / COUNT(*)
FROM claims
JOIN (
    SELECT COUNT(*) total_claims_over_500
    FROM claims
    WHERE CAST(claim_cost AS INT) > 500
    ) t1 ON 1 = 1
GROUP BY total_claims_over_500

We can write a case statement to count the total values where the condition is true and then divide by the total count like in the query below.

SELECT SUM(CASE 
        WHEN CAST(claim_cost AS INT) > 500
            THEN 1
        ELSE 0
        END) * 100.0 / COUNT(*) perc_claims_over_500
FROM claims

What you will notice is that we don't need to hit the table twice to get both numbers. In addition, this is simpler to read.

From our experience this trick is usually picked up by most SQL developers somewhere in their first year or two using SQL.

It's extremely helpful at writing code that counts the percentage of nulls in a row, or to calculate metrics for dashboards. In turn, this is why many analysts and data engineers will become familiar with this trick as long as they have to write a decent amount of SQL and don't just use drag and drop solutions.

Understanding Arrays And How To Manipulate Them

Arrays and maps inside of your database tables aren't too common. However, we have noticed more and more teams relying on unstructured data which can often utilize data structures like arrays and array functions.

This is because databases like Postgres and SQL engines like Presto allow for you to handle arrays in your query.

Although Arrays and and maps are not a new concept they are a somewhat new concept for some analysts and data scientists who aren't as familiar with programming.

This means you may need to occasionally learn a few array and map functions to extract data.

Let's start by learning how to unnest a map in presto. A map is a data structure that provides a key:value relationship. This means you can provide a unique key like a specific description about the value like "first_name":"George". A map can also contain multiple key value pairs like the image below.

In this case we have two keys, dob and friend_ids that we would like to access.

So how do we access that data? Let's check out the query below.


SELECT username
    ,key
    ,value
FROM user_info u
    ,json_each_text(user_data) i

As you can see you can define a row for both the key and value. So when we pull out the data you can get the specific data types.

The output will look like the image below.

You can also check the length of arrays, find specific keys and so much more(read more about presto arrays here). We do recommend you don't just use maps and arrays as replacements for good data modeling. However, it can come in handy when you are working with data that you might not want a specific schema for.

Lead And Lag To Avoid Self Joins

Finally, let's talk about using Lead and Lag window functions to avoid self joins.

When you're doing analytics you will often need to compare two events output or calculate the amount of time between two events.

One way you can do this is to self-join a table to itself and connect the two rows. However, another nifty SQL function is the Lag and Lead functions.

These allow a user to reference a lagging or leading value specified. You can also specify what is the granularity at which you wish to get the lagging and leading values.

For example, in the query below we are partitioning the lagging and leading value by patient_id. This means we are only looking at lagging and lead claim_dates and claim_costs at the patient level.


WITH claim_cte
AS (
    SELECT patient_id
        ,claim_date claim_date
        ,claim_cost
        ,lag(claim_date) OVER (
            PARTITION BY patient_id ORDER BY claim_date
            ) previous_claim_date
        ,lag(claim_cost) OVER (
            PARTITION BY patient_id ORDER BY claim_date
            ) previous_claim_cost
    FROM claims
    )
SELECT claim_date - previous_claim_date days_between_dates
    ,patient_id
    ,claim_date
    ,claim_cost
    ,previous_claim_date
    ,previous_claim_cost
FROM claim_cte

The output of this query will look like the table below.

You will notice that for the first date of every patient the lagging claim_date and cost is null. This is because there is no cost or claim date prior.

Overall, the lag and lead functions can make a SQL developers life much simpler.

The Details Matter With SQL

SQL remains the language of data and learning these tips and tricks can help ensure that your next dashboard or analysis is just that much better. Whether you avoid averaging averages, or write data quality checks, these small improvements make a huge difference. Some of these issues have caused large issues and discussions in companies so we hope this helps bring many of you up to speed.

In addition, your data analysis will be more accurate and you can be more confident in the numbers you provide if you follow these SQL tips.

Thanks for reading.

Automate Data Extracts From Google Sheets With Python

Kafka Vs RabbitMQ

Data Engineering 101: An Introduction To Data Engineering

What Are The Different Kinds Of Cloud Computing

SQL Best Practices --- Designing An ETL Video

5 Great Libraries To Manage Big Data With Python

Joining Data in DynamoDB and S3 for Live Ad Hoc Analysis

Top comments (4)

Jacob Cohen • Aug 27 '20

I've always been a big fan of CASE statements inside of SUM functions. Quick and easy to get counts of various odd conditional measurements combined into a single query. Definitely an underrated trick!

Liu Yongliang • Aug 9 '20

Hey, cool article 😄. I believe I spotted a careless error in the discussion on Average, where in the image, the table shows total claims to be 100, 2. However in your discussion, you stated total claims to be over 1000 and 2.