Our team hit a familiar Snowflake paradox: slower ETL runs arrived at the same time as FinOps alerts about rising credits.
The warehouse in question was WH_ETL_BRONZE_01, a multi-cluster warehouse dedicated to Bronze layer ingestion and merge workloads. What looked like a simple cost problem turned out to be a workload-isolation and concurrency problem.
The Problem
We started with this setup:
ALTER WAREHOUSE WH_ETL_BRONZE_01 SET
WAREHOUSE_SIZE = 'XSMALL',
MAX_CLUSTER_COUNT = 10,
MIN_CLUSTER_COUNT = 1,
SCALING_POLICY = 'STANDARD',
AUTO_SUSPEND = 60,
MAX_CONCURRENCY_LEVEL = 8; -- default behavior
And we saw both of these at once:
- Higher queue times (
queued_overload_time) - Higher daily credits
At that point, we were loading 1,500+ tables through this single warehouse. That meant heavyweight Bronze MERGE workloads and smaller ingestion/utility queries were all competing in the same execution pool.
The critical issue was workload mix. Long-running MERGE statements (30 to 60 minutes) were running alongside tiny queries. When too many heavy MERGE statements landed on the same node/cluster, they competed for resources and all slowed down.
In other words, even with multi-cluster enabled, assignment patterns could create unstable performance if too many heavyweight queries were packed together.
We also learned we needed workload separation: large-table loads should run in a dedicated warehouse so they do not contend with smaller table loads and operational queries.
What We Changed
This was not a single before/after discovery from history. We ran controlled experiments in sequence, while splitting the largest table loads into a separate warehouse path.
Phase 1: Baseline
ALTER WAREHOUSE WH_ETL_BRONZE_01 SET
WAREHOUSE_SIZE = 'XSMALL',
MAX_CLUSTER_COUNT = 10,
MIN_CLUSTER_COUNT = 1,
SCALING_POLICY = 'STANDARD',
AUTO_SUSPEND = 60,
MAX_CONCURRENCY_LEVEL = 8;
Phase 2: First Experiment
ALTER WAREHOUSE WH_ETL_BRONZE_01 SET
WAREHOUSE_SIZE = 'XSMALL',
MAX_CLUSTER_COUNT = 10,
MIN_CLUSTER_COUNT = 1,
SCALING_POLICY = 'STANDARD',
AUTO_SUSPEND = 60,
MAX_CONCURRENCY_LEVEL = 2;
Goal: force scale-out behavior earlier and reduce the chance that many heavy MERGE jobs share the same node.
At the same time, we moved larger-table processing to a separate warehouse so those jobs would not compete with the remaining 1,500+ table ingestion flow.
Phase 3: Final Optimization
ALTER WAREHOUSE WH_ETL_BRONZE_01 SET
WAREHOUSE_SIZE = 'SMALL',
MAX_CLUSTER_COUNT = 10,
MIN_CLUSTER_COUNT = 1,
SCALING_POLICY = 'STANDARD',
AUTO_SUSPEND = 60,
MAX_CONCURRENCY_LEVEL = 3;
This became the best balance for our workload profile.
Final state for this ETL warehouse path:
-
SMALLsize MAX_CONCURRENCY_LEVEL = 3- Large-table workloads isolated to a separate warehouse
Why This Helped
From deeper analysis of the workload behavior:
- The number of queries and output volume could look similar across runs, but bytes scanned at table level could still rise.
- Heavy MERGE overlap was a key instability driver.
- On slow runs, many heavy MERGE statements could get assigned to the same node, causing resource contention.
- On fast runs, fewer heavy MERGE statements shared a node, leaving room for short-running queries.
- Isolating large-table workloads reduced cross-workload contention in the primary Bronze ETL warehouse.
Lowering concurrency reduced the tendency to over-pack heavyweight merges. Then moving to SMALL provided enough per-query resources to cut elapsed time and queueing further. Separating large-table loads into a dedicated warehouse stabilized performance for the rest of the pipeline.
Configuration Summary
| Phase | Size | Max Clusters | Max Concurrency | Intent |
|---|---|---|---|---|
| Baseline | XSMALL | 10 | 8 | Initial setup |
| Experiment | XSMALL | 10 | 2 | Force earlier scale-out / reduce heavy-query packing |
| Final | SMALL | 10 | 3 | Balance throughput, queueing, and cost |
The two final intentional changes vs baseline were:
-
WAREHOUSE_SIZE:XSMALL->SMALL -
MAX_CONCURRENCY_LEVEL:8->3
How We Measured
We used Snowflake ACCOUNT_USAGE views for both performance and cost, comparing baseline and final optimization windows.
Query Performance (Before/After)
SELECT
CASE
WHEN start_time < '2026-05-05' THEN 'XSMALL Period'
ELSE 'SMALL Period'
END AS period,
COUNT(*) AS total_queries,
ROUND(AVG(total_elapsed_time)/1000, 2) AS avg_elapsed_sec,
ROUND(AVG(queued_overload_time)/1000, 2) AS avg_queued_sec,
ROUND(MAX(total_elapsed_time)/1000, 2) AS max_elapsed_sec
FROM SNOWFLAKE.ACCOUNT_USAGE.QUERY_HISTORY
WHERE warehouse_name = 'WH_ETL_BRONZE_01'
AND start_time >= '2026-04-30'
GROUP BY period
ORDER BY period;
Credit Consumption (Daily)
SELECT
CASE
WHEN start_time::DATE < '2026-05-05' THEN 'XSMALL Period'
ELSE 'SMALL Period'
END AS period,
ROUND(SUM(credits_used_compute) / COUNT(DISTINCT start_time::DATE), 2) AS daily_credits
FROM SNOWFLAKE.ACCOUNT_USAGE.WAREHOUSE_METERING_HISTORY
WHERE warehouse_name = 'WH_ETL_BRONZE_01'
AND start_time::DATE >= '2026-04-30'
GROUP BY period
ORDER BY period;
Results
Cost
| Period | Size | Daily Credits |
|---|---|---|
| Baseline (5 days) | XSMALL | 15.12/day |
| Optimized (3 days) | SMALL | 13.50/day |
Daily credits dropped by about 11% after upsizing.
Performance
| Metric | XSMALL | SMALL | Improvement |
|---|---|---|---|
| Avg query time | 26.15s | 24.13s | 8% faster |
| Max query time | 3,475s | 2,850s | 18% faster |
| Total queue time | 2.20 min | 0.28 min | 87% less queuing |
Note: The intermediate XSMALL plus concurrency 2 phase was used to validate behavior and direction. The published KPI table above compares the stable baseline period against the final tuned period.
The Counterintuitive Lesson
In some workloads, a bigger warehouse can cost less.
1. Faster execution means earlier suspend
Queries completed faster on SMALL, so the warehouse reached AUTO_SUSPEND = 60 sooner. Less runtime plus less idle overhead translated to fewer daily credits.
2. Higher concurrency can reduce cluster sprawl
Concurrency must match workload shape. In our case, moving from default 8 down to 2 first improved isolation for heavy MERGE jobs, then landing at 3 with SMALL gave the right balance of throughput and stability.
3. Less queueing improves throughput and utilization
With 87% less queue time, work finished in tighter windows. The warehouse did useful work and went to sleep sooner.
Key Takeaways
- Do not assume smaller is always cheaper.
- Monitor
queued_overload_timeclosely. Persistent queueing often means you are under-sized or under-concurrented. - Tune
MAX_CONCURRENCY_LEVELwith warehouse size and workload type; they should be optimized together. - Isolate heavyweight large-table workflows into a dedicated warehouse to reduce overlap and contention.
- Use both
QUERY_HISTORY(performance) andWAREHOUSE_METERING_HISTORY(cost) for before/after decisions. - Keep aggressive auto-suspend for bursty workloads. Faster queries plus short suspend windows compound savings.
Closing Thought
Warehouse right-sizing is a performance and FinOps exercise, not just a cost-control exercise. In many real workloads, a slightly larger warehouse with slightly higher concurrency can win on both speed and spend.
Disclosure and Scope
- Results reflect our specific environment and workload profile; outcomes may vary.
This analysis used Snowflake built-in ACCOUNT_USAGE views only. No third-party monitoring stack required.
Top comments (0)