DEV Community

Colin Dsouza
Colin Dsouza

Posted on • Edited on

Snowflake Compute Cost and Performance Optimization

A Snowflake virtual warehouse, often referred to simply as a “warehouse”, is a cluster of compute resources in Snowflake.

Image description



A warehouse provides the required resources, such as CPU, memory, and temporary storage, to perform the following operations in a Snowflake :

1. Executing SQL SELECT, JOIN, GROUP BY etc  statements that require compute resources (e.g. retrieving rows from tables and views). <br>
2. Performing DML operations, such as: <br>

- Updating rows in tables (DELETE , INSERT , UPDATE). <br>
- Loading data into tables (COPY INTO <table>). <br>
- Unloading data from tables (COPY INTO <location>). <br>
Enter fullscreen mode Exit fullscreen mode

The price structure for Snowflake virtual warehouse is pay-as-you-go. There is no up-front cost for hardware purchases or licensing. The management and upkeep of the infrastructure are free. Just pay for computation and storage as you use it. However, that does not preclude from being cost-effective. It would be unwise to neglect the expense aspect. There are optimization methods and fundamental guidelines for using your accounts. Snowflake provides a scalable solution, but the expense may rise to unwelcome heights.

Let's look at a few characteristics that can affect prices and that you ought to keep in check.


All the below recommendations and insights are based on my personal experience. Although not all of these are likely to be applicable for every project, I feel it is worthwhile to be aware of them.




Virtual Warehouses



Each virtual warehouse's operational time is billed in Snowflake credits (per second billing). The amount of billed credits is determined by the duration of running time and the size of the warehouse. There are basic configuration settings for virtual warehouses that you can utilize every time to be efficient with your spending and processing time.

Auto suspend and auto resume parameters



This is one of the fundamental ways to save cost. Because you are charged for run time, do not leave the warehouses running 24 hours a day, seven days a week. Use them only when necessary. They can be suspended for the rest of the time.


When a new query is performed, the warehouse is automatically restarted using the auto-resume parameter. Obviously, the auto-suspend setting allows you to shutdown the warehouse after a predefined idle time. Your defined idle time is determined by your workload. You can start with 1 second and work your way up to hours. When deciding on the best time for warehouse suspension, keep the following points in mind:


When warehouse is started you are always billed one full minute (60 seconds). After that, billing per seconds starts. It means if you have small gaps between your queries — up to 2 minutes, it does not make sense to set the auto-suspend to a very low value.

When you suspend the virtual warehouse, the warehouse cache is cleared. Using the cache can also have a huge impact on your cost and speed, so try to use as many different types of Snowflake caches as feasible. We'll go through caches in further detail later.

The start/stop operation of a virtual warehouse is immediate. There is no delay or waiting time, giving you the freedom to design your compute layer around your workloads. Make sure to use this elasticity



Experiment with different warehouse sizes



There is no one optimal size virtual warehouse which fits all use case.
You may be asking how to determine the optimal warehouse size to ensure adequate compute power while avoiding over-provisioning. There are no hard and fast rules. Experimenting is what is recommended. Try running a common query on top of a good quantity sample dataset and compare the results. You must strike the correct balance between time and cost, which may vary depending on the application. When executing ETL/ELT jobs, you will most likely use larger warehouse to complete the process as fast as possible and make data available to end users. However, you may have a virtual warehouse utilised by data analyst where you do not require such high compute power and data analyst can wait a couple of seconds for their query to be processed. We did this by having separate warehouses for data ingestion, data querying and data administration tasks.


Optimal size for my use case :



Image description
Warehouse size : Xtra Small
Data loading time : ~4hrs





Image description
Warehouse size : Medium
Data loading time : ~0.5hrs (2 runs of the same task)




As you can see, quadrupling the warehouse size cuts the run time by 8 times. Because you pay per second of computation, the cost is the same. In other words, your data gets loaded 8 times faster for the same price. Consider this when designing your warehouses.




Always be aware that bigger warehouse does not always mean more cost.


Monitor your costs



Examine resource monitoring and set quotas and warnings for your virtual warehouses. If there is a tight spending limit, you can even shut down the warehouse when it is reached. Long-term tracking of warehouse activity and expenditures will offer you with a deeper understanding of the setup and inputs for another round of cost/performance adjustment.



Snowflake provides account utilisation in the Snowflake database. It contains a large number of system views with valuable data that can provide you with more information about total account usage at various levels (queries, tasks, pipes, warehouses, materialised views, and so on).

Image description

Materialised views



A materialized view is a pre-computed data set derived from a query specification (the SELECT in the view definition) and stored for later use. Because the data is pre-computed, querying a materialized view is faster than executing a query against the base table of the view. This performance difference can be significant when a query is run frequently or is sufficiently complex. As a result, materialized views can speed up expensive aggregation, projection, and selection operations, especially those that run frequently and that run on large data sets.


Snowflake automatically refreshes materialized views. It means that if you have materialized views where the underlaying data changes frequently, you could pay a considerable amount of credits just for the refresh. As a result, it is wise to use materilized views only when the base table does not change regularly.

How to effectively ingest data ?



There are few factors that can effect performance and consequently cost when creating a data pipeline. Snowflake provides what are known as clustering keys. It is a mechanism for effectively organizing data in tiny divisions. This improves performance when retrieving data with the SELECT command. Clustering keys are not suited for use with all tables. It is only useful for very large tables in terms of data rows. Snowflake does auto clustering, but there is a significant expense associated with them. The workaround for this is to simulate (in some ways) what clustering keys accomplish. When inserting data into tables, arrange them by the dimension that is typically utilized in joins or filtration (WHERE clause). Date and time columns, ids, and so forth are common filtering criteria. When obtaining data from Snowflake, this workaround can greatly aid query trimming and scanning of micro partitions. Because of the sorting, query runtime will be significantly reduced.



Cloud Services Compute

The cloud services layer of the Snowflake architecture consumes credits as it performs behind-the-scenes tasks such as authentication, metadata management, and access control. Usage of the cloud services layer is charged only if the daily consumption of cloud services resources exceeds 10% of the daily warehouse usage.



This query returns the total credits consumed for cloud services by a particular type of query.



SELECT query_type,
  SUM(credits_used_cloud_services) AS cs_credits,
  COUNT(1) num_queries
FROM snowflake.account_usage.query_history
WHERE true
  AND start_time >= TIMESTAMPADD(day, -1, CURRENT_TIMESTAMP)
GROUP BY 1
ORDER BY 2 DESC
LIMIT 10;
Enter fullscreen mode Exit fullscreen mode



In my personal experience I found that ` use database and use schema ' were being used inside a loop inside a stored procedure which significantly increased our cloud services costs.


This query returns the total credits consumed for cloud services by all queries of a specifc type. Below is an example for COPY query :

`

SELECT *
FROM snowflake.account_usage.query_history
WHERE true
AND start_time >= TIMESTAMPADD(day, -1, CURRENT_TIMESTAMP)
AND query_type = 'COPY'
ORDER BY credits_used_cloud_services DESC
LIMIT 10;

`



How caching effects costs ?



Snowflake has different types of caches. All of these impact on your costs and performance. Below are few ways which you can effectively use to save cost and improve performance.

Metadata cache



Snowflake holds a large amount of metadata about different objects (tables, views, micro partitions, and so on) or events (copy command history, load command history). They can assist you in certain circumstances. For example, consider the query SELECT COUNT(*) FROM TABLE. There is no size of virtual warehouse seen in the history tab. This indicates that the information is received from the metadata cache and so does not require any operating virtual warehouses.

Image description

Vrtual Warehouse cache or Local cache



This type of cache is known by various names. All of the designations refer to a cache that is tied to a specific instance of a virtual warehouse. Each virtual warehouse stores data locally. Assume you own a warehouse called COMPUTE_WH. When you run the query for the first time, the data is cached locally. The next time you execute a query that uses cached data, COMPUTE_WH can retrieve it directly from the local cache, saving time and money. In the query profile, you can see what was fetched from this cache. Just keep in mind that when you turn off the warehouse, the local cache is cleared. This is where good auto-suspend parameter tuning is required based on your requirements.

If the warehouse is stopped and then restarted in a minute or two, there will be no data in cache and you will be charged for the entire first minute.

Query Result Cache



The query result cache is the final type of cache. Snowflake saves the results of every query you run. If you execute the same query again within 24 hours, the result will be returned from the query result cache (within milliseconds). More rules must be followed in order to use the query result cache:


  • Functions such as CURRENT_TIMESTAMP and CURRENT_DATE cannot be utilized
  • Underlying data has not changed since last execution time. UDF cannot be used in your query.
  • All of these factors would preclude you from using query result data. If you execute the same query again within 24 hours, Snowflake resets the internal clock, and the cached result is available for another 24 hours. This can be done for a maximum of 31 days.

Stored procedure optimizations :



If data is loaded in batches I'm sure you must have used stored procedures. Below are few ways which you can improve performance as well as cost associated when running the stored procedure in a virtual warehouse.

  1. Optimize Code Logic: Review your stored procedure code to identify any areas that can be improved in terms of efficiency and resource usage. Avoid using unnecessary loops, nested queries, or redundant operations. Optimize your SQL queries to use the most efficient approach for data retrieval and manipulation.
  2. Minimize Data Movement: Try to minimize the amount of data that needs to be moved between stages or tables within the stored procedure. Data movement can be a costly operation in terms of both time and resource consumption.
  3. Emit operational metrics inside stored procedure : Say for example your copying multiple files from AWS S3 to snowflake, keep a tab on no. of files copied per unit time.
  4. Use Appropriate Data Types: Make sure to use the appropriate data types for your columns. Using data types with excessive precision or storage can lead to increased costs.
  5. Manage Transactions: Limit the usage of long-running transactions within your stored procedures. Transactions can hold locks on resources for extended periods, leading to potential resource contention and increased costs.
  6. Optimize Query Execution: Leverage Snowflake's query optimization features like automatic query optimization, clustering, and materialized views. This can help improve query performance and reduce the cost of data processing.
  7. Limit Data Scans: Be mindful of the amount of data your stored procedure reads or scans. Minimize the use of SELECT * and only fetch the necessary columns.
  8. Use Staging: When loading large volumes of data, consider using Snowflake's staging area to efficiently ingest data into your tables.
  9. Monitor and Analyze: Regularly monitor the performance of your stored procedures and analyze query plans to identify potential bottlenecks. Snowflake provides several performance monitoring tools to help with this task.
  10. Scale Resources Appropriately: Adjust the size of your virtual warehouses appropriately based on the workload requirements of your stored procedures. Scaling up when necessary and down during off-peak times can help manage costs effectively.
  11. Optimize Storage: Evaluate the way your data is stored and managed. Utilize appropriate clustering keys and partitions to minimize storage costs and improve query performance.
  12. Schedule Wisely: Consider scheduling the execution of your stored procedures during non-peak hours to take advantage of lower compute costs.

Below are few things I did to optimize our warehouse cost and was able to save 70% on virtual warehouse :

  1. Based on businees requirements 2 (or more) set of similar tasks were simultanesously run by the same warehouse, the warehouse was initially underutilized by doing this the warehouse were utilized for their optimum capacity.
  2. Use datawarehouses for what they are supposed to be used as. Reduce single updates,single inserts. Utilize bulk updates and bulk inserts.
  3. Each warehouse has multiple cores/threads, design queries with this in mind. Parallize when possible so that you can utilie all of the cores/threads.
  4. Check for locks/blocked queries stuck in transaction.

    SHOW LOCKS [ IN ACCOUNT ]

  5. Reduced trasactions if not needed.

  6. Emitted metrics inside the stored procedure/task and logged them inside task return value for further analysis.

  7. Decreased DDL inside stored procedure, improvised logic.

  8. Removed queries inside stored procedures which utilized cloud services.

Thanks for reading. :)

Top comments (1)

Collapse
 
nikita_josepharicum_a488 profile image
Info Comment hidden by post author - thread only accessible via permalink
Nikita Joseph Aricum

Very insightful!

Some comments may only be visible to logged-in visitors. Sign in to view all comments. Some comments have been hidden by the post's author - find out more