DEV Community

Vinicius Fagundes
Vinicius Fagundes

Posted on

πŸ—οΈ Building a Sandbox Data Warehouse with Snowflake + GitHub Actions (Part 1)

A sandbox data warehouse is a reproducible playground where you can safely test schemas, pipelines, and transformations.

Instead of deploying manually, you can automate everything using GitHub Actions and idempotent SQL scripts.

This article (Part 1) covers:

  • Why sandboxes matter
  • Core design principles
  • SQL idempotence patterns for tables, views, and data loads
  • Repository structure

πŸš€ Why use a sandbox?

Sandboxes allow you to:

  • Iterate quickly without impacting production
  • Test end-to-end pipelines
  • Recreate environments in any Snowflake account just by providing new credentials

The golden rule: all scripts must be idempotent β€” they can run over and over without breaking anything.


πŸ“ Core design principles

  1. Idempotence first β€” scripts must be safe to re-run.
  2. Parameterization β€” drive object names from environment variables.
  3. Small SQL files β€” easier to debug and maintain.
  4. Execution order β€” enforce with folder prefixes (00_setup, 10_tables, etc).
  5. Sandbox naming β€” append branch/run IDs for uniqueness.
  6. Secrets management β€” never hardcode credentials; use GitHub Secrets.

πŸ—‚οΈ Recommended repository layout

  • sql/
    • 00_setup/ β†’ database and schema creation scripts
    • 01_create_database.sql
    • 02_create_schema.sql
    • 10_tables/ β†’ raw/core tables
    • 01_raw_events.sql
    • 02_dim_users.sql
    • 20_views/ β†’ reporting views
    • v_user_sessions.sql
    • 30_grants/ β†’ permissions
    • grants.sql
    • 90_tests/ β†’ smoke tests
    • smoke_check.sql
  • scripts/ β†’ Python runner to execute SQL scripts
    • run_all_sql.py
  • .github/workflows/ β†’ GitHub Actions workflow
    • ci-deploy-sandbox.yml

🧩 SQL patterns for idempotence

create database if not exists {{DATABASE_NAME}};
create schema if not exists {{SCHEMA_NAME}};

create table if not exists raw.events (
  event_id varchar,
  user_id varchar,
  event_type varchar,
  event_time timestamp_ltz,
  payload variant
);

create table if not exists dim.users (
  user_id varchar primary key,
  first_name varchar,
  last_name varchar,
  email varchar,
  created_at timestamp_ltz
);

create or replace view analytics.v_user_sessions as
select 
    e.user_id, 
    count(*) as session_count,
    max(e.event_time) as last_event_time
from raw.events e
group by e.user_id;

merge into dim.users t
using (
    select *
    from @staging/users_csv
) s
on t.user_id = s.user_id
when matched then update set
    first_name = s.first_name,
    last_name = s.last_name,
    email = s.email
when not matched then insert (
    user_id, first_name, last_name, email
) values (
    s.user_id, s.first_name, s.last_name, s.email
);
Enter fullscreen mode Exit fullscreen mode

βœ… Conclusion

By designing your Snowflake sandbox with idempotent, parameterized SQL and automated GitHub Actions, you create a highly portable, reproducible, and maintainable environment. This approach not only accelerates testing and development but also reduces risk when moving projects between accounts or even different SQL-based platforms.

  • Easy migration between Snowflake accounts β€” Because all objects (databases, schemas, tables, views) are created dynamically from variables, running the same pipeline in a new account recreates the environment without manual intervention.
  • Cross-platform portability β€” Standard SQL constructs and MERGE patterns make it possible to move to Redshift, Databricks SQL, or other SQL-compatible warehouses with minimal refactoring.
  • Engineering discipline & reproducibility β€” Scripts stored in GitHub, executed in deterministic order, and idempotent ensure that the sandbox can be rebuilt at any time, promoting reliability and version control.
  • Avoid vendor lock-in β€” By keeping SQL declarative, parameterized, and platform-agnostic, you reduce dependency on proprietary features while still taking advantage of Snowflake’s capabilities.
  • Safe iterative development β€” Developers can experiment freely in sandbox environments, run pipelines repeatedly, and validate transformations without affecting production or requiring complex rollback strategies.
  • Scalable best practices β€” This methodology supports teams of any size, enabling CI/CD for data engineering and aligning with modern data platform practices.

Top comments (0)