DEV Community

Cover image for 3 ways to improve data sampling efficiency in Snowflake
Metaplane
Metaplane

Posted on • Originally published at metaplane.dev

3 ways to improve data sampling efficiency in Snowflake

The longer a query takes to execute, the more expensive it becomes. Not just in terms of compute resources, but also our most precious resource—time.

While it’s not much of a problem when your tables are small, as your tables grow in size, the cost, execution, and iteration time of downstream tasks follow suit. That’s why writing efficient queries is just as important as writing queries that work.

That's where conditional statements come into play. In Snowflake in particular, conditional statements can drastically reduce the resources and time spent on data queries.

How do conditional statements in Snowflake make sampling queries more efficient?

Think of conditional statements as setting up smart filters for your data collection. These “filters” sort through all the data and pick out just the bits that are actually useful and relevant to the problem you’re trying to solve. Like if you're fishing and you only want to catch salmon, you'd ideally use a net that only lets salmon through and keeps all the other fish out (if such a thing existed).

Using conditional statements before you start sampling makes sure you’re only working with the data that matters, so you’re not wasting time and resources on data you don't need. And when you do take a representative sample, it's more likely to give you the information you need without hitting random data roadblocks (or worse, sampling bias/sampling error).

With that said, we’re starting with a fairly simple sampling technique. It’s not a conditional statement per se, but it’s a great way to wrangle your larger datasets more efficiently.

Option 1: Using Partitioned Tables with Sampling

Though more of a technique than a conditional statement, partitioning and sampling data in Snowflake is a great, easy way to enhance your query performance—especially if you’re using large datasets. Partitioning organizes the data based on certain keys or criteria, facilitating quicker access to relevant data segments and reducing the scope of data scans during queries. Essentially, it speeds up query execution by focusing only on pertinent data partitions.

Sampling after partitioning allows you to work with a smaller sample size that represents the larger whole. By analyzing a sample, you can infer patterns, trends, and insights without the overhead of processing the entire dataset, saving you on data storage down the road.

To combine these steps into a single SQL query in Snowflake, you’d typically make sure your table is organized into partitions based on a key that is relevant to your query patterns. But since Snowflake automatically manages micro-partitions and doesn’t allow manual partitioning like traditional databases, we'll focus on using cluster sampling for organizing and then sampling data.

Here’s what that looks like in practice:

Let's say we have a sales data table (sales_data), and we're interested in analyzing sales performance by region. We assume that the table is clustered by region_id to optimize performance for queries filtered by region. Now, we want to sample a subset of this data for a quick analysis using the SQL query below:

SELECT 
  * 
FROM 
  sales_data TABLESAMPLE BERNOULLI (10);
-- Sample approximately 10% of the rows
WHERE 
  region_id = 'NorthAmerica' -- Assuming you're interested in North American sales data
  AND DATE(sale_date) BETWEEN '2023-01-01' 
  AND '2023-12-31'
Enter fullscreen mode Exit fullscreen mode


In this case:

  • WHERE region_id = 'NorthAmerica' focuses the query on the North American sales data, using the table's clustering on region_id to improve performance.
  • AND DATE(sale_date) BETWEEN '2023-01-01' AND '2023-12-31' further filters the data to only include sales from the year 2023.
  • TABLESAMPLE BERNOULLI (10) applies a sampling method to retrieve approximately 10% of the rows from the filtered result set. The BERNOULLI sampling method provides a random sample of the data.

This query is designed to efficiently filter and sample the data based on the table's organization (clustering by region_id)—it aligns the data with query patterns, and then samples the targeted subset of data.

While partitioning is great for speeding up searches targeting specific regions, if your data isn't neatly organized around the criteria you're using to partition it, or if your searches don't align well with how the data is split up, partitioning won't help much. For instance, if you need information that's spread across multiple partitions, or if your search conditions change a lot and don't match the partitioning scheme, you might not see the performance boost you were hoping for. If that’s the case, you might want to investigate conditional statements.

Option 2: Using CASE statements

Using CASE statements in your data sampling queries in Snowflake adds a layer of conditional logic to the sampling process, which is particularly useful when you want to apply different sampling rates or methods based on specific criteria within your data.

For instance, you might want to sample more heavily in areas where your data is denser or more variable, and less so in more uniform areas. The CASE statement allows you to dynamically adjust the sampling rate or method based on the characteristics of the data (e.g. region, time period, or any other relevant dimension).

To use CASE statements to analyze sales performance by region (like in the above sales_data table example), you can design a query that selects a sample of sales data based on certain conditions related to regions. Since Snowflake's SQL does not support using TABLESAMPLE directly within a CASE statement, you’ll have to use a workaround that involves filtering data in subqueries or using conditional logic to assign sample rates and then applying these rates in a subsequent operation.

Here's what this looks like in practice:

WITH region_sample AS (
  SELECT 
    sale_id, 
    region, 
    sale_amount, 
    CASE 
      WHEN sale_amount < 1000 THEN 1 
      WHEN sale_amount BETWEEN 1000 AND 10000 THEN 2 
      WHEN sale_amount > 10000 THEN 3 
    END AS sale_group 
  FROM 
    sales_data
), 
sampled_data AS (
  SELECT 
    * 
  FROM 
    region_sample 
  WHERE 
    (
      sale_group = 1 
      AND RANDOM() < 0.05
    ) -- For sales under $1000, sample ~5%
    OR (
      sale_group = 2 
      AND RANDOM() < 0.1
    ) -- For sales between $1000 and $10000, sample ~10%
    OR (
      sale_group = 3 
      AND RANDOM() < 0.15
    ) -- For sales over $10000, sample ~15%
) 
SELECT 
  * 
FROM 
  sampled_data;
‍
Enter fullscreen mode Exit fullscreen mode

In this case:

  • The function above for ‘region_sample’ assigns a sample_group value to each row based on the region and sale_amount. Each region (and condition within the region) is associated with a different group.
  • sampled_data then filters the region_sample data by applying a random sampling condition to each sample_group. The RANDOM() function generates a random value between 0 and 1, and rows are selected based on whether this random value falls below the specified threshold (e.g., 0.05 for a 5% sample rate).

Rather than partitioning, this approach allows for nuanced sampling based on region and sales amount. As a result, you get a much more targeted data analysis of sales performance.

But you also have to deal with increased complexity and reduced readability of your SQL queries. As you add more conditions and logic with CASE statements, the queries become harder to understand and maintain (which is especially true for teams where multiple analysts work on the same codebase). If this doesn’t work for your scenario, try using a JOIN.

Option 3: Using JOINS

Using JOIN statements with conditional logic allows you to sample data based on relationships between tables or within subsets of a single table. You can create a derived table or a Common Table Expression (CTE) that contains the specific conditions or subsets you care about, then join this derived table or CTE with the original (or another related) table and apply the sampling on this joined result set.

This method is particularly useful when the sampling criteria involve complex conditions or multiple tables.

Now, back to the sales_data table example from above. Let's assume we have a related table (e.g.regions), that contains detailed information about different sales regions. And suppose we want to sample sales data from the NorthAmerica region more efficiently by joining the sales_data table with the regions table, which contains detailed region information.

This is what that SQL query looks like:

WITH RegionSales AS (
  SELECT 
    sd.* 
  FROM 
    sales_data sd 
    JOIN regions r ON sd.region_id = r.region_id 
  WHERE 
    r.region_name = 'NorthAmerica' -- Condition to filter sales data by region
    AND DATE(sd.sale_date) BETWEEN '2023-01-01' 
    AND '2023-12-31'
) 
SELECT 
  * 
FROM 
  RegionSales TABLESAMPLE BERNOULLI (10);
Enter fullscreen mode Exit fullscreen mode

In this case:

  • RegionSales creates a temporary result set that joins the sales_data table with the regions table. It filters the sales data to include only those records from the NorthAmerica region and within the specified date range ('2023-01-01' to '2023-12-31').
  • The TABLESAMPLE BERNOULLI (10) clause is applied to this filtered and joined dataset, sampling approximately 10% of the rows.

JOINS are particularly advantageous when the sampling criteria involves complex conditions or multiple tables that are interconnected. Imagine trying to get a snapshot of data that's spread across different tables, each with its own set of rules or relationships. JOINS bring all that related information together first, so you can then apply your sampling logic to one, combined dataset. This is super helpful when your analysis depends on understanding how different pieces of data relate to each other, like how customer profiles link to their purchase histories.

But keep in mind: while JOINS are powerful for relating datasets, they’re not always the best choice if simplicity and performance are priorities. When you join tables, especially multiple or large ones, you increase the amount of data being processed before sampling can even occur which requires more compute resources upfront and slows down query execution time. Doing the JOIN after sampling will improve this efficiency slightly, but it won’t fix the problem entirely.

A better way

Metaplane makes it easy for users to configure how they want their data to be sampled in future Snowflake queries. This includes options for users such as Time Windows and WHERE clauses. With Time Windows and “Include Data Since” options, users can configure their lookback periods to only include their most recent data. In WHERE clauses, users can further restrict the amount of data within a table being queried by any dimension of their table.

As a bonus, consider using timestamp functions supported by your warehouse such as CURRENT_DATE() to scan data created or updated from today onwards, if your goal is to ensure that new data is accurate.

Want to see how Metaplane’s configurable monitors can make your Snowflake data sampling more efficient? Talk to us or start a free trial today.

Top comments (0)