De' Clerke

Posted on Jun 2

Snowflake for Data Engineers: The Mental Model Shifts That Actually Matter

#snowflake #database #datawarehouse #dataengineering

When I moved to Snowflake for the BizPulse Kenya project, my first reaction was that it felt familiar. The SQL was standard. The dbt models were identical. The Airflow DAG looked the same. Then I started hitting unexpected costs, slow loads, and behaviors that did not match what I expected from PostgreSQL. The SQL was the same. The mental model was completely different.

This article covers the shifts that matter, written from the perspective of someone who uses dbt, Python, and Airflow to build pipelines, not someone managing a Snowflake account full-time.

Compute and Storage Are Separate

In PostgreSQL, the database server handles both storage and query execution. In Snowflake, they are fully decoupled. Your data lives in cloud object storage (S3 or similar). Compute is provided by virtual warehouses, which are independent clusters you spin up on demand.

CREATE WAREHOUSE COMPUTE_WH
  WAREHOUSE_SIZE = 'X-SMALL'
  AUTO_SUSPEND = 60
  AUTO_RESUME = TRUE
  INITIALLY_SUSPENDED = TRUE;

This separation has one critical implication: you only pay for compute when a warehouse is running. A warehouse that stays running 24 hours a day costs 24 times more than one that runs for one hour. Set AUTO_SUSPEND = 60 on every warehouse. 60 seconds of inactivity before suspend is aggressive but appropriate for a batch pipeline that runs once a day.

Warehouse sizing affects query speed directly. An X-SMALL warehouse has 1 server. A LARGE has 8. For a dbt run that builds 30 models in sequence, increasing the warehouse size reduces wall clock time. The credits-per-hour rate also increases, but the total run time drops proportionally, so the cost is roughly the same. Where sizing matters is for parallel workloads: multiple dbt threads, multiple concurrent queries, or large joins that exceed available memory.

For a typical data engineering project with a daily batch pipeline:

Use X-SMALL for dbt development and testing
Use SMALL or MEDIUM for production dbt runs if models are slow
Always AUTO_SUSPEND = 60 and AUTO_RESUME = TRUE

ALTER WAREHOUSE COMPUTE_WH SUSPEND;
ALTER WAREHOUSE COMPUTE_WH RESUME;
ALTER WAREHOUSE COMPUTE_WH SET WAREHOUSE_SIZE = 'SMALL';

You can resize a warehouse without dropping it. Resize up before a heavy job, resize down after.

There Are No Indexes: Micro-Partitions Do the Work

PostgreSQL requires you to create indexes explicitly. Snowflake does not have user-managed indexes at all. Data is automatically divided into compressed micro-partitions of 50 to 500 MB each, and Snowflake tracks the min and max value of every column in every micro-partition. When your query has a WHERE clause, Snowflake prunes partitions whose min/max range cannot contain the matching values and never reads them.

This works well for monotonically increasing columns like timestamps, where each partition naturally contains a distinct range. It works less well for high-cardinality random columns where values are scattered across many partitions.

When a table is large and queries on it are slow, check the clustering depth with:

SELECT SYSTEM$CLUSTERING_INFORMATION('ANALYTICS.MARTS.FCT_ORDERS', '(TO_DATE(CREATED_AT))');

If the average depth is high, the partitions are not well-organized around your query column. You can declare a clustering key to reorganize them:

ALTER TABLE ANALYTICS.MARTS.FCT_ORDERS
  CLUSTER BY (TO_DATE(created_at), customer_id);

Snowflake then automatically reclusters the table in the background. This costs credits, so use it only on large tables (hundreds of millions of rows) that you query repeatedly on the same columns. For normal pipeline tables of a few million rows, micro-partition pruning is sufficient without explicit clustering.

COPY INTO: The Right Way to Load Data

The PostgreSQL COPY command streams data directly to the database server. Snowflake's COPY INTO loads from a stage, which is a named location pointing to internal storage or an external cloud bucket (S3, GCS, Azure Blob).

-- Internal named stage
CREATE STAGE ANALYTICS.RAW.MY_STAGE
  FILE_FORMAT = (TYPE = CSV FIELD_OPTIONALLY_ENCLOSED_BY='"' SKIP_HEADER=1);

-- Upload a file (from SnowSQL)
PUT file:///path/to/orders.csv @MY_STAGE AUTO_COMPRESS=TRUE;

-- Load into table
COPY INTO ANALYTICS.RAW.ORDERS
FROM @MY_STAGE/orders.csv.gz
FILE_FORMAT = (FORMAT_NAME = 'CSV_FORMAT')
ON_ERROR = 'SKIP_FILE';

One behavior that is easy to miss: COPY INTO does not re-load a file that it has already successfully loaded from that stage. Snowflake tracks a load history per stage and table. If you re-run the same COPY INTO command with the same file, nothing happens. This is the right default for idempotent pipelines but will bite you if you are trying to reload corrected data. Use FORCE = TRUE to override:

COPY INTO ANALYTICS.RAW.ORDERS
FROM @MY_STAGE/orders.csv.gz
FILE_FORMAT = (FORMAT_NAME = 'CSV_FORMAT')
FORCE = TRUE;

Before loading production data, validate first:

COPY INTO ANALYTICS.RAW.ORDERS
FROM @MY_STAGE/orders.csv.gz
FILE_FORMAT = (FORMAT_NAME = 'CSV_FORMAT')
VALIDATION_MODE = 'RETURN_ERRORS';

This returns a list of parsing errors without writing anything to the table. A validation run that comes back clean means the actual load will succeed.

For Python pipelines that build DataFrames and write to Snowflake, write_pandas is the right call, not to_sql. Under the hood, write_pandas stages the data and uses COPY INTO. The performance difference is the same as the COPY vs insert difference in PostgreSQL.

from snowflake.connector.pandas_tools import write_pandas

success, nchunks, nrows, _ = write_pandas(
    conn,
    df,
    table_name="ORDERS",
    schema="RAW",
    database="ANALYTICS",
    auto_create_table=True,
    overwrite=False,
)
print(f"Loaded {nrows} rows in {nchunks} chunks")

to_sql via SQLAlchemy works, but it issues row-by-row inserts and is slow for any meaningful volume.

VARIANT: Semi-Structured Data Without the Pain

Snowflake's VARIANT type stores JSON, XML, or Avro natively. Unlike PostgreSQL's JSONB, you do not need GIN indexes for path queries. Snowflake automatically indexes paths within VARIANT columns at load time.

CREATE TABLE RAW.EVENTS (
    event_id  VARCHAR(36),
    payload   VARIANT,
    loaded_at TIMESTAMP_NTZ
);

-- Query VARIANT paths with colon-dot notation
SELECT
    payload:event_id::VARCHAR        AS event_id,
    payload:user.id::NUMBER          AS user_id,
    payload:tags[0]::VARCHAR         AS first_tag,
    (payload:amount)::FLOAT          AS amount
FROM RAW.EVENTS;

The :: casting is mandatory. Without it, every extracted value comes back as a VARIANT, not a typed column, and your aggregations will fail or return wrong results.

For arrays inside VARIANT, use LATERAL FLATTEN to expand them into rows:

SELECT
    e.event_id,
    f.value::VARCHAR AS tag
FROM RAW.EVENTS e,
LATERAL FLATTEN(input => e.payload:tags) f;

This is equivalent to PostgreSQL's jsonb_array_elements, but the syntax is distinct enough that it catches people who copy Postgres patterns directly.

The practical use case: land raw API responses as VARIANT, run dbt staging models that extract and type the fields you need. This pattern lets you change what you extract later without re-loading the raw data.

Time Travel and Zero-Copy Cloning

Two Snowflake features that have no direct PostgreSQL equivalent.

Time travel lets you query any table as it existed at any point in the past, up to the retention period (1 day on Standard edition, up to 90 days on Enterprise).

-- What did this table look like 30 minutes ago?
SELECT * FROM ANALYTICS.RAW.ORDERS AT (OFFSET => -60*30);

-- Restore an accidentally dropped table
UNDROP TABLE ANALYTICS.RAW.ORDERS;
UNDROP SCHEMA ANALYTICS.RAW;

This is a safety net, not a backup strategy. I have used it exactly once, after a dbt model accidentally ran with the wrong filter and truncated a staging table. UNDROP TABLE restored it in seconds.

Zero-copy cloning creates an instant copy of a table, schema, or entire database that shares the underlying storage until you start modifying rows. Creating a clone costs nothing and takes no time, regardless of the source size.

-- Clone a table (instant, zero cost until modified)
CREATE TABLE STAGING.ORDERS_BACKUP CLONE RAW.ORDERS;

-- Clone a whole database for dev/testing
CREATE DATABASE ANALYTICS_DEV CLONE ANALYTICS;

The practical use case: create a _DEV clone of production before running a risky migration or a new dbt model that touches fact tables. If something goes wrong, drop the clone and try again. No restore from backup, no waiting.

Streams: CDC Without Extra Infrastructure

A stream is a change-data-capture object that tracks inserts, updates, and deletes on a table since the last time the stream was consumed.

CREATE STREAM ANALYTICS.RAW.ORDERS_STREAM ON TABLE ANALYTICS.RAW.ORDERS;

-- Read what has changed since last consumption
SELECT * FROM ANALYTICS.RAW.ORDERS_STREAM;
-- Returns: METADATA$ACTION ('INSERT' or 'DELETE'), METADATA$ISUPDATE, METADATA$ROW_ID

-- Consume the stream in a MERGE (marks it as consumed)
MERGE INTO ANALYTICS.STAGING.STG_ORDERS AS target
USING (
    SELECT * FROM ANALYTICS.RAW.ORDERS_STREAM
    WHERE METADATA$ACTION = 'INSERT'
) AS src
ON target.order_id = src.order_id
WHEN MATCHED THEN
    UPDATE SET status = src.status, amount = src.amount
WHEN NOT MATCHED THEN
    INSERT (order_id, customer_id, status, amount, created_at)
    VALUES (src.order_id, src.customer_id, src.status, src.amount, src.created_at);

The stream is consumed (reset) when you run a DML statement that reads from it. If the DML fails, the stream is not consumed and still contains all the changes. This makes it naturally idempotent: re-running the MERGE after a failure replays the same changes.

In a dbt + Airflow pipeline, streams are useful when you want to process only new data since the last DAG run without maintaining your own watermark table. Run a SnowflakeOperator to consume the stream into your staging table, then run dbt on top of it.

dbt on Snowflake: The Config Options That Matter

Most dbt behavior is identical between PostgreSQL and Snowflake. A few Snowflake-specific options are worth knowing.

Transient tables have no Fail-safe storage layer (the 7-day retention beyond Time Travel). For staging tables that are rebuilt every run and do not need point-in-time recovery, transient tables are cheaper.

{{ config(
    materialized='table',
    transient_table=true
) }}

Per-model warehouse lets you use a larger warehouse only for the models that need it.

{{ config(
    materialized='incremental',
    unique_key='order_id',
    snowflake_warehouse='LARGE_WH'
) }}

Use this for fact table builds that scan hundreds of millions of rows. Leave everything else on the default warehouse.

Clustering keys in dbt are configured at the model level:

{{ config(
    materialized='table',
    cluster_by=['TO_DATE(created_at)', 'customer_id']
) }}

copy_grants=true preserves grants on the table after a full rebuild. Without it, every dbt run that rebuilds a table model drops and recreates the table, which drops all grants. Any downstream role that had SELECT on the table loses access until grants are re-applied.

{{ config(
    materialized='table',
    copy_grants=true
) }}

Query result caching returns the same result for identical queries on unchanged data, at zero compute cost. This is excellent for dashboards but misleading when benchmarking dbt models. Disable it before timing a model:

ALTER SESSION SET USE_CACHED_RESULT = FALSE;

Python Connection Setup

The connection string format for SQLAlchemy is different from PostgreSQL:

from sqlalchemy import create_engine
import os

engine = create_engine(
    "snowflake://{user}:{password}@{account}/{database}/{schema}"
    "?warehouse={warehouse}&role={role}".format(
        user      = os.environ['SNOWFLAKE_USER'],
        password  = os.environ['SNOWFLAKE_PASSWORD'],
        account   = os.environ['SNOWFLAKE_ACCOUNT'],
        database  = 'ANALYTICS',
        schema    = 'RAW',
        warehouse = 'COMPUTE_WH',
        role      = 'TRANSFORMER',
    )
)

The account value is your Snowflake account identifier, not the URL. It looks like xy12345.eu-west-1 or myorg-myaccount depending on whether you're on the old or new account format. If you pass the full URL including .snowflakecomputing.com, the connection fails.

For Airflow, set the connection via environment variable in your Docker Compose file to avoid storing credentials in the Airflow metadata database:

environment:
  AIRFLOW_CONN_SNOWFLAKE_DEFAULT: >-
    snowflake://USER:PASS@ACCOUNT/ANALYTICS/RAW
    ?warehouse=COMPUTE_WH&role=TRANSFORMER

Then reference it in tasks:

from airflow.providers.snowflake.hooks.snowflake import SnowflakeHook

def load_to_snowflake(**context):
    hook = SnowflakeHook(snowflake_conn_id='snowflake_default')
    df   = hook.get_pandas_df("SELECT * FROM MARTS.FCT_ORDERS")
    return df

Monitoring Cost and Query Performance

Unlike PostgreSQL where you optimize for CPU and I/O, in Snowflake you optimize for credit consumption and bytes scanned. Both are visible in the query history.

-- Slow queries in the last 24 hours
SELECT query_text,
       total_elapsed_time / 1000       AS seconds,
       bytes_scanned / 1e9             AS gb_scanned,
       percentage_scanned_from_cache   AS cache_hit_pct
FROM TABLE(INFORMATION_SCHEMA.QUERY_HISTORY(
    DATEADD('hour', -24, CURRENT_TIMESTAMP())
))
ORDER BY total_elapsed_time DESC
LIMIT 20;

-- Credit usage by warehouse this week
SELECT warehouse_name, SUM(credits_used) AS total_credits
FROM ACCOUNT_USAGE.WAREHOUSE_METERING_HISTORY
WHERE start_time >= DATEADD(day, -7, CURRENT_TIMESTAMP())
GROUP BY 1
ORDER BY 2 DESC;

bytes_scanned is the key metric. A query that scans 500 GB is expensive. A query that scans 50 MB is not, even if it runs 100 times a day. Micro-partition pruning reduces bytes scanned. Caching eliminates it entirely for repeated identical queries.

percentage_scanned_from_cache being 100% means the query hit the result cache or the local disk cache and cost no credits. A dashboard query that runs 200 times a day and consistently hits cache costs the same as running it once.

If a query is scanning more data than expected, check whether the warehouse is running with a good clustering depth on the columns you are filtering by. A full scan on a 1 TB table when you are querying one week of data is a sign that Time Travel data is bloating the table or the clustering is poor.

What Is Different at a Glance

For anyone coming from PostgreSQL specifically, this is the quick mental model reset:

PostgreSQL	Snowflake
Manage indexes explicitly	No indexes; use clustering keys for large tables
psycopg2 COPY for bulk loads	COPY INTO from a stage
JSONB with GIN index	VARIANT with automatic path indexing
VACUUM to reclaim dead rows	No manual maintenance required
Connection pool tuning	Warehouse auto-suspend and sizing
`pg_stat_statements` for slow queries	`INFORMATION_SCHEMA.QUERY_HISTORY`
COPY breaks if the file changes	COPY INTO deduplicates by default (use FORCE=TRUE to reload)

The biggest adjustment is not technical. It is accepting that some things you control in PostgreSQL are handled automatically in Snowflake, and the levers you do have (warehouse size, clustering, transient vs permanent tables) exist for cost reasons, not correctness reasons.

BizPulse Kenya uses Snowflake as the warehouse layer with dbt and FinBERT for Kenyan news sentiment analysis. The code is on my GitHub.

Follow me on dev.to for more on data engineering, dbt, and Airflow.

DEV Community