Marko

Posted on Nov 28

Stop Waiting for the Cloud: Building a Hybrid SQL+Python Data Pipeline Locally with DuckDB

#python #sql #dataengineering #tooling

Cloud data warehouses are amazing for production. They are terrible for development.

If you’re a Data Engineer, you know the pain of the “cloud feedback loop”:

You write a complex SQL query.
You hit “Run” in your orchestrator.
You wait 45 seconds for the warehouse to spin up or queue your job.
It fails because of a syntax error.
You fix it. You pay for the query slot. You wait again.

This latency kills flow.

In software engineering, we run code locally on our laptops before shipping to production. Why can’t we do the same for data pipelines?

I built FastFlowTransform (FFT) to solve this. It’s a framework that lets you build and test your pipeline locally using DuckDB (for speed and free compute), and then deploy the same project to Snowflake, BigQuery, or Databricks for production.

In this post we’ll build a tiny “Users” pipeline that:

runs locally on DuckDB in < 1s, and
can be deployed to BigQuery by changing a single CLI flag (--env).

1. Initialize a local-first project (no cloud creds)

You don’t need AWS keys or a Snowflake login. Just a laptop.

pip install fastflowtransform

fft init building_locally_demo --engine duckdb
cd building_locally_demo

This creates a minimal FFT project with:

models/ – SQL/Python models
seeds/ – CSV/Parquet seeds
profiles.yml – connection profiles, including a local DuckDB one

We’ll use DuckDB as our “dev warehouse”.

2. Add some data (seeds + sources)

We start with a simple users CSV.

seeds/seed_users.csv:

id,email,signup_date
1,alice@example.com,2023-01-01
2,bob@example.com,2023-01-02

Then we tell FFT how to reference this seed as a source.

sources.yml:

version: 2

sources:
  - name: raw
    schema: staging
    tables:
      - name: users
        identifier: seed_users

Now {{ source('raw', 'users') }} will resolve to the table created from seed_users.csv.

Load the seed into DuckDB:

fft seed . --env dev_duckdb

You should now have:

a DuckDB file at .local/dev.duckdb
a table like staging.seed_users available in your local engine

3. Write a transformation in SQL

FFT uses standard SQL with Jinja templating (similar to dbt). It takes care of engine differences for you.

Create models/staging/stg_users.ff.sql:

{{ config(
    materialized='table'
) }}

select 
    id, 
    -- Standardizing email casing
    lower(email) as email, 
    -- Casting types explicitly
    cast(signup_date as date) as signup_date
from {{ source('raw', 'users') }}

This is just… SQL. No FFT-specific magic beyond config() and source().

4. Run the pipeline locally (the “fast loop”)

Now run the DAG on your machine:

fft run . --env dev_duckdb

On my laptop, I see something like:

Time: 18 ms
Cost: $0.00
Infrastructure: my CPU

I can iterate on this loop hundreds of times per hour.

Add data quality checks:

fft test . --env dev_duckdb

Or even model-level unit tests (no real DB needed):

fft utest . --env dev_duckdb

Local-first DX: fast feedback, offline-friendly, and you only touch the cloud once your logic is solid.

5. Point the same project at BigQuery (the “flow loop”)

Once you’re happy with the logic, it’s time to push to production.

In other frameworks you might:

change connection strings,
update environment variables manually,
maybe even rewrite SQL if you used engine-specific functions.

In FFT, you just switch the profile (and do not forget to set your credentials for BigQuery).

profiles.yml:

# My Local Playground
dev_duckdb:
  engine: duckdb
  duckdb:
    path: ".local/dev.duckdb"

# My Local utest Overrides
dev_duckdb_utest:
  engine: duckdb
  duckdb:
    path: ":memory:"

# My Production Environment
prod_bigquery:
  engine: bigquery
  bigquery:
    project: "fft-basic-demo"
    dataset: "production_marts"
    location: "EU"
    # Use the pandas backend here; FFT can also use BigFrames if you set this to true.
    use_bigframes: false
    allow_create_dataset: true

# My Production utest Overrides
prod_bigquery_utest:
  engine: bigquery
  bigquery:
    dataset: "production_marts_utest"

Now run the same project, different environment:

# exactly the same command, different --env
fft seed . --env prod_bigquery
fft run . --env prod_bigquery

FFT:

builds the same DAG,
compiles the same SQL models,
authenticates with Google Cloud using your local creds,
executes the transformations on BigQuery.

We didn’t touch stg_users.ff.sql. Only --env changed.

6. Why this is a big deal for DX

This isn’t just about saving money (though you do get that for free). It’s about Developer Experience.

Work offline. Build complex DAGs on a train or in airplane mode with DuckDB.
Unit test your models. Use fft utest with tiny fixtures to validate logic before hitting any real warehouse.
Hybrid SQL+Python. FFT supports Python models alongside SQL. Use SQL for aggregations and joins, Python for custom logic or ML.

Example Python model:

# models/marts/mart_latest_signup.ff.py
import pandas as pd
from fastflowtransform import engine_model


@engine_model(
    # Register this model for both DuckDB (local) and BigQuery (pandas backend)
    only=("duckdb", "bigquery"),
    name="mart_latest_signup",
    materialized="table",
    deps=["stg_users.ff"],  # SQL model from earlier in the article
    tags=[
        "scope:mart", 
        "engine:duckdb",
        "engine:bigquery"
    ],
    requires={
        # Columns produced by stg_users.ff.sql:
        #   id, email, signup_date
        "stg_users.ff": {"id", "email", "signup_date"},
    },
)
def build(stg_users: pd.DataFrame) -> pd.DataFrame:
    """Return the latest signup per email domain using pandas."""

    # Derive an email_domain column in Python
    users = stg_users.copy()
    users["email_domain"] = users["email"].str.split("@").str[-1]

    latest = (
        users.sort_values("signup_date", ascending=False)
        .drop_duplicates("email_domain")  # keep the newest per domain
        .loc[:, ["email_domain", "id", "email", "signup_date"]]
        .rename(
            columns={
                "id": "latest_user_id",
                "email": "latest_email",
                "signup_date": "latest_signup_date",
            }
        )
        .reset_index(drop=True)
    )

    return latest

You can run this locally (DuckDB + pandas), then later switch to Spark or BigQuery with the same decorator.

7. Try it yourself

Stop treating your laptop like a thin client. Your machine is powerful enough to build data pipelines.

pip install fastflowtransform

fft init building_locally_demo --engine duckdb
cd building_locally_demo

# add the users seed + sources.yml from this article,
# then:
fft seed . --env dev_duckdb
fft run . --env dev_duckdb