DEV Community: Fritz Larco

How to Replicate Databricks Lakebase to Snowflake with Sling

Fritz Larco — Tue, 14 Jul 2026 23:51:39 +0000

Last updated: July 2026

Databricks Lakebase is a serverless Postgres database, built on the Neon engine Databricks acquired in 2025 and generally available since February 2026. It holds operational, transactional data: the reads and writes an application or an AI agent makes in real time, kept separate from the analytical tables in your lakehouse or warehouse. The idea is to keep the OLTP store fast and small and move data out to a warehouse when you need to run analytics across it.

Snowflake is a common destination for that analytics half, which raises an obvious question. How do you get data out of Lakebase and into Snowflake without standing up a separate ETL platform to do it?

Here's the thing that makes it easy: Lakebase is Postgres. Not Postgres-like, actual Postgres, running version 17 and speaking the standard wire protocol on port 5432. Anything that connects to Postgres connects to Lakebase. That includes Sling, which treats Lakebase as an ordinary postgres source and moves it to Snowflake with a few lines of YAML.

This guide walks through that move end to end. The row counts, timings, and type mappings below come from a real replication run: a Postgres source loaded with a demo e-commerce schema, replicated into a live Snowflake warehouse. Because Lakebase presents as standard Postgres, the mechanics don't change whether the source is a self-hosted Postgres or a Lakebase endpoint. Only the host and the credentials change.

Installation

Sling is a single binary with no runtime dependencies. Install it however suits your setup:

# macOS / Linux
curl -fsSL https://slingdata.io/install.sh | bash

# Windows
irm https://slingdata.io/install.ps1 | iex

# Python
pip install sling

Confirm it's on your path:

sling --version

Connecting Sling to Lakebase

A Lakebase instance exposes a standard Postgres connection: a hostname (Databricks generates one that looks like ep-<id>.databricks.com), port 5432, and a default database named databricks_postgres. SSL is required — Lakebase rejects unencrypted connections — so the connection string carries sslmode=require.

The one decision worth making up front is authentication. Lakebase supports two methods, and they behave very differently for a data-movement tool. You can pass an OAuth token as the password, which is convenient for poking around interactively, but the token expires after an hour, so a scheduled replication that runs overnight will fail when it goes stale. Or you can use a native Postgres role and password, which doesn't expire and works with connection pooling. Use the second one for Sling.

Create a Postgres role in Lakebase, give it read access to the tables you're replicating, and use that role's credentials. Then register the connection with Sling. Since Lakebase is Postgres, the connection type is postgres:

sling conns set lakebase type=postgres \
  host=ep-your-instance.databricks.com port=5432 \
  user=your_role password=your_password \
  database=databricks_postgres sslmode=require

Or as a connection string:

sling conns set lakebase \
  url="postgres://your_role:your_password@ep-your-instance.databricks.com:5432/databricks_postgres?sslmode=require"

Connecting Sling to Snowflake

Snowflake authenticates with an account identifier, a user, and either a password or a key pair. The Sling user needs USAGE on the warehouse, database, and target schema, plus CREATE TABLE on the schema so Sling can create the destination tables.

sling conns set snowflake type=snowflake \
  account=your-account user=your-user password=your-password \
  database=analytics warehouse=compute_wh role=your-role

Test both connections

sling conns test lakebase
sling conns test snowflake

8:41AM INF success!

If the Lakebase test hangs before failing, the instance may have scaled its compute to zero while idle — Lakebase spins compute down when nothing is connected and back up on demand. Give it a moment and test again. If it fails outright, check that sslmode=require is present and that your role has login rights.

The source tables

The demo source is a small e-commerce schema, the kind of transactional data that lives in an OLTP store: 15,000 customers, 3,000 products, and 90,000 orders. The orders table carries an ordered_at timestamp, which matters for incremental loads later, plus a nullable promo_code column so the type-handling examples have something to point at.

-- on the Lakebase source
select 'customers' as t, count(*) as c from demo_lakebase_snowflake.customers
union all select 'products', count(*) from demo_lakebase_snowflake.products
union all select 'orders',   count(*) from demo_lakebase_snowflake.orders;

t          c
customers  15000
products   3000
orders     90000

Full refresh: the first load

A replication is one YAML file. The defaults block sets the mode and how targets are named; streams lists what to move. The {stream_table} token maps each source table to a Snowflake table of the same name.

# replication.yaml
source: lakebase
target: snowflake

defaults:
  mode: full-refresh
  object: demo_lakebase_snowflake.{stream_table}

streams:
  demo_lakebase_snowflake.customers:
  demo_lakebase_snowflake.products:
  demo_lakebase_snowflake.orders:

Run it:

sling run -r replication.yaml

INF Sling Replication [3 streams] | lakebase -> snowflake

INF [1 / 3] running stream demo_lakebase_snowflake.customers
INF writing to target database [mode: full-refresh]
INF created table "DEMO_LAKEBASE_SNOWFLAKE"."CUSTOMERS_TMP"
INF created table "DEMO_LAKEBASE_SNOWFLAKE"."CUSTOMERS"
INF inserted 15000 rows into "DEMO_LAKEBASE_SNOWFLAKE"."CUSTOMERS" in 15 secs [993 r/s] [1.3 MB]
...
INF [3 / 3] running stream demo_lakebase_snowflake.orders
INF created table "DEMO_LAKEBASE_SNOWFLAKE"."ORDERS_TMP"
INF inserted 90000 rows into "DEMO_LAKEBASE_SNOWFLAKE"."ORDERS" in 13 secs [6,814 r/s] [5.5 MB]
INF execution succeeded

INF Sling Replication Completed in 40s | lakebase -> snowflake | 3 Successes | 0 Failures

Three tables, 108,000 rows, 40 seconds. Two details in that log are worth noticing. First, Sling writes to a _TMP table and swaps it in once the load is clean, so a failed run leaves the existing table untouched rather than half-loaded. Second, Sling stages the data and loads it into Snowflake in bulk with COPY INTO rather than inserting row by row — that's why 90,000 rows land in 13 seconds.

You don't have to pre-create the tables. Sling creates them on first run and adjusts them on later runs when the schema changes.

A note on branching

Lakebase inherits Neon's copy-on-write branching: you can create an instant, zero-copy clone of your database. That's a useful trick here. Instead of pointing the replication at your production instance and adding read load to the same compute your application uses, create a branch and replicate from that. The branch is a consistent snapshot, the read traffic hits separate compute, and you tear the branch down when the load finishes. Nothing about the Sling side changes — you just point the host at the branch's endpoint.

Verification

Trust the load, but check it. Run a count and a min/max against the target:

sling conns exec snowflake \
  "select count(*) as orders, min(ordered_at) as first_order, max(ordered_at) as last_order from demo_lakebase_snowflake.orders"

ORDERS  FIRST_ORDER                    LAST_ORDER
90000   2025-01-01 00:02:00 +0000 UTC  2025-03-04 12:01:00 +0000 UTC

And a sample:

sling conns exec snowflake \
  "select order_id, customer_id, quantity, amount, promo_code, ordered_at from demo_lakebase_snowflake.orders order by order_id limit 5"

ORDER_ID  CUSTOMER_ID  QUANTITY  AMOUNT     PROMO_CODE  ORDERED_AT
1         2            2         10.740000  PROMO1      2025-01-01 00:02:00 +0000 UTC
2         3            3         17.220000  PROMO2      2025-01-01 00:03:00 +0000 UTC
3         4            4         24.440000  PROMO3      2025-01-01 00:04:00 +0000 UTC
4         5            5         32.400000              2025-01-01 00:05:00 +0000 UTC
5         6            1         6.850000   PROMO5      2025-01-01 00:06:00 +0000 UTC

Row count matches the source, the timestamps survived the trip, and the decimal amounts kept their precision. Notice row 4: the promo_code is empty because it was null in the source, and that nullability carried across intact.

Type mapping

Postgres and Snowflake don't share a type system, so Sling maps between them. Read the target schema to see the result:

sling conns exec snowflake \
  "select column_name, data_type from information_schema.columns
   where table_schema='DEMO_LAKEBASE_SNOWFLAKE' and table_name='ORDERS'
   order by ordinal_position"

COLUMN_NAME  DATA_TYPE
ORDER_ID     NUMBER
CUSTOMER_ID  NUMBER
PRODUCT_ID   NUMBER
QUANTITY     NUMBER
AMOUNT       NUMBER
PROMO_CODE   TEXT
ORDERED_AT   TIMESTAMP_NTZ

Postgres integers become Snowflake NUMBER, and numeric(10,2) also becomes NUMBER with its precision preserved, so the order amounts keep full decimal precision instead of collapsing into a float. varchar and text become TEXT, and timestamp becomes TIMESTAMP_NTZ. When a clean mapping isn't available, Sling renders the value as a string rather than dropping precision silently, and you can override any column with a columns: block in the stream if you need a specific Snowflake type.

Incremental loads: only what changed

Full refresh is fine for a first load or a small dimension table. For anything that grows, you want to move only the new rows. Switch the mode to incremental and tell Sling which column tracks change and which column identifies a row:

# replication-incremental.yaml
source: lakebase
target: snowflake

defaults:
  mode: incremental
  object: demo_lakebase_snowflake.{stream_table}
  primary_key: [order_id]
  update_key: ordered_at

streams:
  demo_lakebase_snowflake.orders:

Say 2,500 new orders land in Lakebase. Re-run with the incremental file:

sling run -r replication-incremental.yaml

INF Sling Replication | lakebase -> snowflake | demo_lakebase_snowflake.orders
INF getting checkpoint value (ordered_at)
INF created table "DEMO_LAKEBASE_SNOWFLAKE"."ORDERS_TMP"
INF inserted 2500 rows into "DEMO_LAKEBASE_SNOWFLAKE"."ORDERS" in 10 secs [245 r/s] [152 kB]
INF execution succeeded

The line that does the work is getting checkpoint value (ordered_at). Sling reads the maximum ordered_at already in the Snowflake target, then pulls only source rows newer than that. There's no separate state file to manage — the target table is the checkpoint. The primary_key lets Sling merge, so a row that was updated rather than inserted is upserted instead of duplicated.

Run it again with nothing new on the source and you get a clean no-op:

INF getting checkpoint value (ordered_at)
WRN no data or records found in stream. Nothing to do.
INF inserted 0 rows into "DEMO_LAKEBASE_SNOWFLAKE"."ORDERS" in 7 secs [0 r/s]
INF execution succeeded

Zero rows, no error. The warning is Sling telling you there was nothing past the checkpoint, which is exactly what you want from a scheduled job that fires on a cron whether or not there's anything to move.

A final count confirms the math:

sling conns exec snowflake "select count(*) as orders, max(ordered_at) as last_order from demo_lakebase_snowflake.orders"

ORDERS  LAST_ORDER
92500   2025-03-06 05:41:00 +0000 UTC

90,000 from the first load plus 2,500 from the incremental run, with the high-water mark advanced to the newest order.

Replicating many tables at once

Listing every table by hand doesn't scale past a handful. Use a wildcard in streams and the {stream_table} token in the target object to move a whole schema:

source: lakebase
target: snowflake

defaults:
  mode: incremental
  object: demo_lakebase_snowflake.{stream_table}
  primary_key: [id]
  update_key: updated_at

streams:
  demo_lakebase_snowflake.*:

Every table in the schema gets replicated, each into a Snowflake table of the same name. You can still override a single table by listing it explicitly below the wildcard — the explicit entry wins. This is the pattern to reach for when you're mirroring a whole Lakebase database into the warehouse.

Scheduling

Once the YAML works, scheduling is a cron entry:

0 * * * * cd /path/to/configs && sling run -r replication-incremental.yaml >> /var/log/sling.log 2>&1

Because incremental mode tracks its checkpoint against the target, each run is idempotent — if one fails or is skipped, the next picks up exactly where the last left off. One thing to keep in mind on a schedule: use a native Postgres password for the Lakebase role, not an OAuth token. Tokens expire after an hour, and a job that runs overnight will hit a stale one.

If you'd rather not babysit cron, the Sling Platform handles scheduling, alerting, and run history, while keeping the data movement itself on your own infrastructure — nothing routes through a vendor control plane.

Conclusion

Because Lakebase is real Postgres, moving data out of it is not a special case. You register it as a postgres connection, point Sling at Snowflake, and write two YAML files: full-refresh for the first load, incremental for everything after. Sling loads into Snowflake in bulk, maps the types sensibly by default, and keeps the checkpoint in the target table so there's no separate state store to lose.

The numbers here are from a real run: 108,000 rows on the first load in 40 seconds, 2,500 rows on the incremental in 10, a clean no-op when there's nothing to move. Point Sling at your own Lakebase endpoint, ideally a branch, and you'll get the same shape.

Related guides

The Lakebase-to-Snowflake path is the standard-Postgres-to-Snowflake path, so these companion walkthroughs apply directly:

Export PostgreSQL to Snowflake — the same replication against a self-hosted Postgres, with more on transformations and advanced options
Snowflake to Postgres — the reverse direction, when Snowflake is the source of truth
Postgres to DuckDB — a local, zero-cloud target for quick analysis
Extract databases into DuckLake — Lakebase to a lakehouse table format instead of a warehouse

If you're weighing Sling against a managed connector for this job, the Sling vs Fivetran comparison covers the tradeoffs. For the reasoning behind Sling's single-binary, no-control-plane design, see the Sling blog.

Frequently asked questions

Does Sling need a special Lakebase connector?

No. Lakebase speaks the standard Postgres wire protocol on port 5432, so Sling connects to it with the ordinary postgres connection type. There's nothing Lakebase-specific to configure beyond the hostname, credentials, and sslmode=require.

Should I use an OAuth token or a password for the Lakebase connection?

Use a native Postgres role and password. Lakebase's OAuth tokens expire after an hour, which breaks scheduled or long-running replications. A native password doesn't expire and works with connection pooling, so it's the right fit for a data-movement tool.

Will replicating add load to my production Lakebase instance?

It can, since the read query runs against your instance's compute. To avoid competing with application traffic, use Lakebase's copy-on-write branching to create a zero-copy clone and replicate from the branch. The read load hits separate compute, and you delete the branch when the run finishes.

How are Postgres `numeric`, `jsonb`, and `uuid` columns handled in Snowflake?

numeric maps to Snowflake NUMBER with precision preserved, so decimal amounts don't get flattened to floats. jsonb lands in a VARIANT column and stays queryable with : and FLATTEN(). uuid maps to TEXT by default; override it with a columns: block if you want a different type.

Can I do incremental Lakebase → Snowflake with deletes?

Sling's incremental mode propagates inserts and updates via the update_key and merges on the primary_key, but it does not detect physical deletes in the source. For delete-aware syncs, soft-delete with a deleted_at column and filter in queries, run mode: full-refresh on small enough tables, or use mode: snapshot to keep historical versions.

Does the source have to be Lakebase specifically?

No — this exact configuration works against any Postgres 16 or 17 instance. Lakebase is just Postgres with serverless compute and branching, so if you later move the transactional store somewhere else, the replication YAML doesn't change; only the connection host does.

Export from BigQuery to PostgreSQL with Sling

Fritz Larco — Tue, 14 Jul 2026 23:51:24 +0000

Last updated: June 2026

Introduction

Moving data between Google BigQuery and PostgreSQL traditionally involves complex ETL processes, custom scripts, and significant engineering effort. Organizations often face challenges such as:

Setting up and maintaining data extraction processes from BigQuery
Managing authentication and permissions across platforms
Handling schema compatibility and data type conversions
Implementing efficient data loading into PostgreSQL
Monitoring and maintaining the data pipeline
Dealing with incremental updates and schema changes

According to industry research, setting up a traditional data pipeline between BigQuery and PostgreSQL can take weeks or even months, requiring specialized knowledge of both platforms and custom code development. Common approaches include:

Writing custom Python scripts using libraries like pandas and sqlalchemy
Using ETL tools that require extensive configuration and maintenance
Implementing Apache Airflow DAGs with custom operators
Developing and maintaining complex data transformation logic

These approaches often lead to:

Increased development and maintenance costs
Complex error handling and retry mechanisms
Difficulty in handling schema changes
Performance bottlenecks
Limited monitoring and observability

Sling simplifies this entire process by providing a streamlined, configuration-based approach that eliminates the need for custom code and complex infrastructure setup. With Sling, you can:

Configure connections with simple environment variables or CLI commands
Automatically handle schema mapping and data type conversions
Optimize performance with built-in batch processing and parallel execution
Monitor and manage replications through both CLI and web interface
Implement incremental updates with minimal configuration

In this guide, we'll walk through the process of setting up a BigQuery to PostgreSQL replication using Sling, demonstrating how to overcome common challenges and implement an efficient data pipeline in minutes rather than days or weeks. If you need the opposite direction, our MySQL to BigQuery guide and Postgres to BigQuery guide cover loading into BigQuery.

Installation

Getting started with Sling is straightforward. You can install it using various package managers depending on your operating system:

# macOS / Linux
curl -fsSL https://slingdata.io/install.sh | bash

# Windows
irm https://slingdata.io/install.ps1 | iex

# Python
pip install sling

After installation, verify that Sling is properly installed by running:

# Check sling version
sling --version

For more detailed installation instructions and options, visit the installation guide.

Setting Up Connections

Before we can start replicating data, we need to configure our source (BigQuery) and target (PostgreSQL) connections. Sling provides multiple ways to manage connections, including environment variables, the sling conns command, and a YAML configuration file.

BigQuery Connection Setup

The BigQuery source connection here is identical to the one used when BigQuery is the source in other pipelines, such as BigQuery to Snowflake.

For BigQuery, you'll need:

Google Cloud project ID
Service account credentials with appropriate permissions
Dataset information
Google Cloud Storage bucket (for data transfer)

You can set up the BigQuery connection in several ways:

Using the sling conns set Command

# Set up BigQuery connection using CLI
sling conns set bigquery_source type=bigquery \
  project=<project> \
  dataset=<dataset> \
  gc_bucket=<gc_bucket> \
  key_file=/path/to/service.account.json \
  location=<location>

Using Environment Variables

# Set up using service account JSON content
export GC_KEY_BODY='{"type": "service_account", ...}'
export BIGQUERY_SOURCE='{type: bigquery, project: <project>, dataset: <dataset>, gc_bucket: <gc_bucket>}'

Using Sling Environment YAML

Create or edit ~/.sling/env.yaml:

connections:
  bigquery_source:
    type: bigquery
    project: your-project
    dataset: your_dataset
    gc_bucket: your-bucket
    key_file: /path/to/service.account.json
    location: US  # optional

PostgreSQL Connection Setup

For PostgreSQL, you'll need:

Host address
Port number (default: 5432)
Database name
Username and password
Schema (optional)
SSL mode (if required)

Here's how to set up the PostgreSQL connection:

Using the sling conns set Command

# Set up PostgreSQL connection using CLI
sling conns set postgres_target type=postgres \
  host=<host> \
  user=<user> \
  database=<database> \
  password=<password> \
  port=<port> \
  schema=<schema>

# Or use connection URL format
sling conns set postgres_target url="postgresql://user:password@host:5432/database?sslmode=require"

Using Environment Variables

# Set up using connection URL format
export POSTGRES_TARGET='postgresql://user:password@host:5432/database?sslmode=require'

Using Sling Environment YAML

Add to your ~/.sling/env.yaml:

connections:
  postgres_target:
    type: postgres
    host: your-host
    user: your-username
    password: your-password
    database: your-database
    port: 5432
    schema: public
    sslmode: require  # optional

Testing Connections

After setting up your connections, it's important to verify they work correctly:

# Test BigQuery connection
sling conns test bigquery_source

# Test PostgreSQL connection
sling conns test postgres_target

# List available tables in BigQuery
sling conns discover bigquery_source

You can also manage your connections through the Sling Platform's web interface:

For more details about connection configuration, visit the environment documentation.

Data Replication Methods

Sling provides multiple ways to replicate data from BigQuery to PostgreSQL. Let's explore both CLI-based and YAML-based approaches, starting from simple configurations to more advanced use cases.

Using CLI Flags

The quickest way to start a replication is using CLI flags. Here are two examples:

Basic CLI Example

This example shows how to replicate a single table with default settings:

# Replicate a single table from BigQuery to PostgreSQL
sling run \
  --src-conn bigquery_source \
  --src-stream "analytics.daily_sales" \
  --tgt-conn postgres_target \
  --tgt-object "analytics.daily_sales" \
  --tgt-options '{ "column_casing": "snake" }'

Advanced CLI Example

This example demonstrates more advanced options including column selection and incremental updates:

# Replicate with advanced options
sling run \
  --src-conn bigquery_source \
  --src-stream "analytics.customer_orders" \
  --select "order_id, customer_id, order_date, total_amount" \
  --tgt-conn postgres_target \
  --tgt-object "analytics.customer_orders" \
  --mode incremental \
  --primary-key order_id \
  --update-key order_date \
  --tgt-options '{ "column_casing": "snake", "add_new_columns": true, "table_keys": { "unique": ["order_id"] } }'

For more CLI flag options, visit the CLI flags documentation.

Using YAML Configuration

For more complex replication scenarios, YAML configuration files provide better maintainability and reusability. Let's look at two examples:

Basic YAML Example

Create a file named bigquery_to_postgres.yaml:

# Define source and target connections
source: bigquery_source
target: postgres_target

# Default settings for all streams
defaults:
  mode: full-refresh
  target_options:
    column_casing: snake
    add_new_columns: true

# Define streams to replicate
streams:
  analytics.daily_sales:
    object: analytics.daily_sales
    primary_key: [date, product_id]

  analytics.customer_orders:
    object: analytics.{stream_table}
    primary_key: order_id

Run the replication:

# Run the replication using YAML config
sling run -r bigquery_to_postgres.yaml

Advanced YAML Example

Here's a more complex example that demonstrates various features including runtime variables, custom SQL, and multiple streams:

source: bigquery_source
target: postgres_target

env:
  DATE: ${DATE}  # from env var

defaults:
  mode: incremental
  target_options:
    column_casing: snake
    add_new_columns: true

streams:
  # Stream with custom SQL and runtime variables
  analytics.orders_{DATE}:
    object: analytics.orders
    sql: |
      SELECT *
      FROM analytics.orders
      WHERE DATE(created_at) = '{DATE}'
    primary_key: order_id
    update_key: updated_at

  # Stream with column selection and transforms
  analytics.customers:
    object: analytics.customers
    select:
      - customer_id
      - first_name
      - last_name
      - email
      - -internal_notes  # exclude this column
    transforms:
      email: [lower, trim]
    primary_key: customer_id
    target_options:
      table_keys:
        primary: [customer_id]
        unique: [email]

  # Stream with wildcard pattern
  analytics.events_*:
    object: analytics.events
    mode: full-refresh
    primary_key: event_id

Run the replication with runtime variables:

# Run the replication with a specific date
export DATE=2024-02-10
sling run -r bigquery_to_postgres.yaml

For more details about replication configuration, visit:

Sling Platform UI

While the CLI provides powerful functionality for local development and automation, the Sling Platform offers a comprehensive web interface for managing and monitoring your data replications at scale.

Platform Overview

The Sling Platform provides:

Visual interface for creating and managing data workflows
Team collaboration features
Monitoring and alerting
Centralized connection management
Job scheduling and orchestration
Agent-based architecture for secure execution

Managing Connections

The platform provides an intuitive interface for managing your connections:

You can:

Create and edit connections with a visual form
Test connections directly from the UI
View and manage connection permissions
Share connections with team members

Visual Replication Editor

The platform includes a powerful visual editor for creating and managing replications:

Features include:

Visual stream configuration
Syntax highlighting for SQL and YAML
Real-time validation
Version control integration

Monitoring and Execution

Track your replication jobs with detailed execution information:

The platform provides:

Real-time execution monitoring
Detailed logs and error messages
Performance metrics and statistics
Historical execution data

For more information about the Sling Platform, visit the platform documentation.

Getting Started

Now that we've covered the various aspects of using Sling for BigQuery to PostgreSQL data replication, here's a quick guide to get you started:

Install Sling
- Choose the appropriate installation method for your system
- Verify the installation with sling --version
Set Up Connections
- Configure BigQuery source connection
- Configure PostgreSQL target connection
- Test both connections using sling conns test
Start Simple
- Begin with a basic CLI command to replicate a single table
- Monitor the replication process
- Verify the data in PostgreSQL
Scale Up
- Create a YAML configuration file for multiple streams
- Add incremental updates and transformations
- Implement runtime variables for flexibility
Consider Platform Features
- Sign up for the Sling Platform for advanced features
- Set up team access and permissions
- Configure monitoring and alerts

Best Practices

Connection Management
- Use environment variables or YAML files for connection configuration
- Keep credentials secure and never commit them to version control
- Use separate connections for development and production
Replication Configuration
- Start with simple configurations and gradually add complexity
- Use YAML files for better maintainability
- Document your configurations with clear descriptions
Performance Optimization
- Use appropriate batch sizes for your data volume
- Implement incremental updates when possible
- Monitor and adjust configurations based on performance metrics
Monitoring and Maintenance
- Regularly check replication logs
- Set up alerts for failed replications
- Keep Sling updated to the latest version

Next Steps

Once your data lands in Postgres, you can keep moving it downstream; see Postgres to DuckDB for fast local analytics or loading JSON data into Postgres for the file-to-Postgres direction.

To learn more about Sling's capabilities, explore these resources:

For additional examples and community support:

Join the Sling Discord Community
Follow Sling on GitHub
Contact support@slingdata.io for assistance

FAQ

How do I export a BigQuery table to PostgreSQL with Sling?

Set bigquery as the source connection and postgres as the target, then point a stream's object key at the destination schema and table. A single sling run command moves the data, creating the target table if it does not exist.

Can I run incremental loads from BigQuery into Postgres?

Yes. Set mode to incremental on the stream and supply a primary key plus an update_key. Sling compares the update_key against the existing target rows and loads only newer records.

How do I filter BigQuery rows before loading into Postgres?

Add a sql key to the stream with a SELECT statement. The query runs against BigQuery and only the result set is loaded, which is handy for date-partitioned or filtered extracts.

Does Sling automatically convert BigQuery data types to Postgres types?

Yes. Sling maps each BigQuery column to a compatible Postgres type during the load. Enable adjust_column_type under target_options if you want it to widen existing target columns when needed.

How do I set a primary key or unique constraint on the Postgres target?

Use a table_keys block under target_options with primary and unique lists, for example primary: [customer_id] and unique: [email]. Sling applies these keys when it creates or maintains the target table.

Can I use runtime variables like a date in the replication config?

Yes. Reference variables with single braces such as {DATE} in object names, stream names, or SQL, and pass the value through the env block or an environment variable at run time.

What permissions does the BigQuery service account need?

It needs read access to the source dataset and read/write access to the GCS bucket set as gc_bucket, since Sling stages data through Cloud Storage during extraction.

Moving Data from BigQuery to Snowflake Using Sling

Fritz Larco — Tue, 14 Jul 2026 23:50:50 +0000

Introduction

Last updated: June 2026

Moving data between cloud data warehouses like Google BigQuery and Snowflake traditionally involves complex ETL processes, custom scripts, and significant engineering effort. Organizations often face challenges such as:

Setting up and maintaining data extraction processes from BigQuery
Managing authentication and permissions across platforms
Handling schema compatibility and data type conversions
Implementing efficient data loading into Snowflake
Monitoring and maintaining the data pipeline
Dealing with incremental updates and schema changes

According to industry research, setting up a traditional data pipeline between BigQuery and Snowflake can take weeks or even months, requiring specialized knowledge of both platforms and custom code development. This complexity often leads to increased costs, maintenance overhead, and potential reliability issues.

Sling simplifies this entire process by providing a streamlined, configuration-based approach that eliminates the need for custom code and complex infrastructure setup. With Sling, you can:

Configure connections with simple environment variables or CLI commands
Automatically handle schema mapping and data type conversions
Optimize performance with built-in batch processing and parallel execution
Monitor and manage replications through both CLI and web interface
Implement incremental updates with minimal configuration

In this guide, we'll walk through the process of setting up a BigQuery to Snowflake replication using Sling, demonstrating how to overcome common challenges and implement an efficient data pipeline in minutes rather than days or weeks. If you need to move data the other direction, see the guide on exporting Snowflake to BigQuery.

Installation

Getting started with Sling is straightforward. You can install it using various package managers depending on your operating system:

# macOS / Linux
curl -fsSL https://slingdata.io/install.sh | bash

# Windows
irm https://slingdata.io/install.ps1 | iex

# Python
pip install sling

After installation, verify that Sling is properly installed by running:

# Check sling version
sling --version

For more detailed installation instructions and options, visit the installation guide.

Setting Up Connections

Before we can start replicating data, we need to configure our source (BigQuery) and target (Snowflake) connections. Sling provides multiple ways to manage connections, including environment variables, the sling conns command, and a YAML configuration file.

BigQuery Connection Setup

For BigQuery, you'll need:

Google Cloud project ID
Service account credentials with appropriate permissions
Dataset information
Google Cloud Storage bucket (for data transfer)

You can set up the BigQuery connection in several ways:

Using the sling conns set Command

# Set up BigQuery connection using CLI
sling conns set bigquery_source type=bigquery \
  project=<project> \
  dataset=<dataset> \
  gc_bucket=<gc_bucket> \
  key_file=/path/to/service.account.json \
  location=<location>

Using Environment Variables

# Set up using service account JSON content
export GC_KEY_BODY='{"type": "service_account", ...}'
export BIGQUERY_SOURCE='{type: bigquery, project: <project>, dataset: <dataset>, gc_bucket: <gc_bucket>}'

Using Sling Environment YAML

Create or edit ~/.sling/env.yaml:

connections:
  bigquery_source:
    type: bigquery
    project: your-project
    dataset: your_dataset
    gc_bucket: your-bucket
    key_file: /path/to/service.account.json
    location: US  # optional

Snowflake Connection Setup

For Snowflake, you'll need:

Account identifier (e.g., xy12345.us-east-1)
Username and password
Database name
Warehouse name
Role (optional)
Schema (optional)

Here's how to set up the Snowflake connection:

Using the sling conns set Command

# Set up Snowflake connection using CLI
sling conns set snowflake_target type=snowflake \
  account=<account> \
  user=<user> \
  password=<password> \
  database=<database> \
  warehouse=<warehouse> \
  role=<role>

Using Environment Variables

# Set up using connection URL format
export SNOWFLAKE_TARGET='snowflake://user:password@account/database?warehouse=compute_wh&role=sling_role'

Using Sling Environment YAML

Add to your ~/.sling/env.yaml:

connections:
  snowflake_target:
    type: snowflake
    account: xy12345.us-east-1
    user: your_username
    password: your_password
    database: your_database
    warehouse: compute_wh
    role: sling_role  # optional
    schema: public    # optional

Testing Connections

After setting up your connections, it's important to verify they work correctly:

# Test BigQuery connection
sling conns test bigquery_source

# Test Snowflake connection
sling conns test snowflake_target

# List available tables in BigQuery
sling conns discover bigquery_source

You can also manage your connections through the Sling Platform's web interface:

For more details about connection configuration, visit the environment documentation.

Data Replication Methods

Sling provides multiple ways to replicate data from BigQuery to Snowflake. Let's explore both CLI-based and YAML-based approaches, starting from simple configurations to more advanced use cases.

Using CLI Flags

The quickest way to start a replication is using CLI flags. Here are two examples:

Basic CLI Example

This example shows how to replicate a single table with default settings:

# Replicate a single table from BigQuery to Snowflake
sling run \
  --src-conn bigquery_source \
  --src-stream "analytics.daily_sales" \
  --tgt-conn snowflake_target \
  --tgt-object "ANALYTICS.DAILY_SALES" \
  --tgt-options '{ "column_casing": "upper" }'

Advanced CLI Example

This example demonstrates more advanced options including column selection and data type handling:

# Replicate with advanced options
sling run \
  --src-conn bigquery_source \
  --src-stream "analytics.customer_orders" \
  --select "order_id, customer_id, order_date, total_amount" \
  --tgt-conn snowflake_target \
  --tgt-object "ANALYTICS.CUSTOMER_ORDERS" \
  --mode incremental \
  --primary-key order_id \
  --update-key order_date \
  --tgt-options '{ "column_casing": "upper", "add_new_columns": true }'

For more CLI flag options, visit the CLI flags documentation.

Using YAML Configuration

For more complex replication scenarios, YAML configuration files provide better maintainability and reusability. Let's look at two examples:

Basic YAML Example

Create a file named bigquery_to_snowflake.yaml:

# Define source and target connections
source: bigquery_source
target: snowflake_target

# Default settings for all streams
defaults:
  mode: full-refresh
  target_options:
    column_casing: upper
    add_new_columns: true

# Define the streams to replicate
streams:
  # Replicate multiple tables using wildcards
  "analytics.*":
    object: "ANALYTICS.{stream_table}"

  # Replicate a specific table with custom settings
  "sales.transactions":
    object: "SALES.TRANSACTIONS"
    mode: incremental
    primary_key: ["transaction_id"]
    update_key: "transaction_date"

Advanced YAML Example

Here's a more complex configuration with multiple streams and custom options:

source: bigquery_source
target: snowflake_target

env:
  run_date: ${RUN_DATE}

defaults:
  mode: incremental
  target_options:
    column_casing: upper
    add_new_columns: true

streams:
  # Replicate with custom SQL and column selection
  "custom_sales_report":
    object: "ANALYTICS.SALES_REPORT"
    sql: |
      SELECT 
        o.order_id,
        c.customer_name,
        p.product_name,
        o.quantity,
        o.total_amount,
        o.order_date
      FROM `dataset.orders` o
      JOIN `dataset.customers` c ON o.customer_id = c.customer_id
      JOIN `dataset.products` p ON o.product_id = p.product_id
      WHERE o.order_date >= '2023-01-01'

  # Replicate with runtime variables and transformations
  "daily_metrics":
    object: "ANALYTICS.DAILY_METRICS_{run_timestamp}"
    sql: |
      SELECT 
        date,
        product_category,
        SUM(revenue) as total_revenue,
        COUNT(DISTINCT customer_id) as unique_customers
      FROM `dataset.sales`
      WHERE date = '{run_date}'
      GROUP BY date, product_category

  # Replicate with advanced options
  "customer_segments":
    mode: truncate
    object: "ANALYTICS.CUSTOMER_SEGMENTS"
    select: ["segment_id", "segment_name", "created_at", "updated_at"]

To run a replication using a YAML configuration:

# Run the entire replication
sling run -r bigquery_to_snowflake.yaml

# Run specific streams
sling run -r bigquery_to_snowflake.yaml --stream analytics.daily_sales

For more information about runtime variables and configuration options, visit:

Sling Platform Features

While the CLI provides powerful command-line capabilities, the Sling Platform offers a comprehensive web-based solution for managing your data pipelines at scale. Let's explore some key features that make it easier to manage BigQuery to Snowflake replications.

Visual Configuration Editor

The Sling Platform includes a sophisticated configuration editor that makes it easy to create and modify replication configurations through a user-friendly interface.

The editor provides:

Syntax highlighting for YAML configurations
Auto-completion for connection names and options
Real-time validation of your configuration
Easy access to documentation and examples
Version control for configuration changes

Execution Monitoring

Monitor your replications in real-time with detailed execution statistics and logs.

Key monitoring features include:

Real-time progress tracking
Detailed execution logs
Performance metrics and statistics
Error reporting and diagnostics
Historical execution data

Team Collaboration

The platform facilitates team collaboration with features such as:

Role-based access control
Shared connection management
Configuration version history
Team activity monitoring
Collaborative troubleshooting

Additional Platform Benefits

The Sling Platform offers several advantages for enterprise users:

Scheduled executions
Automated retries and error handling
Integration with notification systems
Audit logging
Resource usage monitoring

For more information about the Sling Platform and its features, visit the platform documentation.

Getting Started

Ready to start using Sling for your BigQuery to Snowflake data pipeline? Here's how to get started:

Set Up Your Environment
- Install Sling CLI
- Configure your connections
- Test connectivity to both platforms
Create Your First Replication
- Start with a simple table replication
- Test the replication process
- Monitor the results
Scale Your Implementation
- Add more tables and transformations
- Implement incremental updates
- Set up scheduling and monitoring
Explore Advanced Features
- Try the Sling Platform
- Implement complex transformations
- Set up team collaboration

Additional Resources

Related Guides

For more examples and detailed documentation, visit the Sling Documentation.

FAQ

Why does the BigQuery connection need a Google Cloud Storage bucket?

Sling exports BigQuery data through GCS for efficient bulk extraction. It unloads query results to the bucket you set with gc_bucket, then streams them into Snowflake, so the bucket works as a staging area.

How do I match Snowflake's uppercase identifier convention?

Set column_casing to upper in target_options. Snowflake stores unquoted identifiers in uppercase, so this keeps column and table names lined up with what Snowflake expects.

Can I replicate many BigQuery tables at once?

Yes. Use a wildcard stream such as analytics.* to match every table in a dataset, and reference the {stream_table} runtime variable in the target object so each Snowflake table is named after its source.

How do I run incremental loads from BigQuery to Snowflake?

Set mode to incremental and define table_keys with a primary key plus an update_key. Sling pulls only the rows that changed on each run and merges them into the Snowflake target.

Does Sling create the Snowflake tables, and can it add new columns automatically?

Sling creates target tables that do not exist and infers types from BigQuery. Turn on add_new_columns in target_options so new source columns are added to the Snowflake table instead of breaking the run.

Which Snowflake role and warehouse does Sling use for loading?

Sling uses the warehouse and optional role you set on the Snowflake connection. The role needs permission to create or write to the target schema, and the warehouse has to be able to run the load.

Exporting Snowflake to BigQuery Using Sling

Fritz Larco — Fri, 26 Jun 2026 00:49:25 +0000

Last updated: May 2026

The Challenge of Snowflake to BigQuery Data Migration

Moving data between cloud data warehouses like Snowflake and BigQuery traditionally involves complex ETL processes, custom scripts, and significant engineering effort. Common challenges include:

Setting up and maintaining data extraction processes from Snowflake
Managing data type compatibility between platforms
Implementing efficient data loading into BigQuery
Monitoring and maintaining the data pipeline
Handling incremental updates and schema changes

Sling simplifies this entire process by providing a streamlined, configuration-based approach that eliminates the need for custom code and complex infrastructure setup.

Installing Sling

Getting started with Sling is straightforward. You can install the CLI tool using various package managers:

# macOS / Linux
curl -fsSL https://slingdata.io/install.sh | bash

# Windows
irm https://slingdata.io/install.ps1 | iex

# Python
pip install sling

For more detailed installation instructions, visit the official documentation.

Setting Up Connections

Before we can start replicating data, we need to configure our Snowflake and BigQuery connections. Sling makes this process simple with its connection management system.

First, let's set up our Snowflake connection:

# Set up Snowflake connection
export SNOWFLAKE_SOURCE="snowflake://${SNOWFLAKE_USER}:${SNOWFLAKE_PASSWORD}@${SNOWFLAKE_ACCOUNT}/${SNOWFLAKE_DATABASE}?warehouse=${SNOWFLAKE_WAREHOUSE}&role=${SNOWFLAKE_ROLE}"

# we should be able to test our connection now
sling conns test snowflake_source

Next, let's configure the BigQuery connection:

# Set up BigQuery connection
sling conns set bigquery_target type=bigquery project=<project> dataset=<dataset> key_file=/path/to/service.account.json

# we should be able to test our connection now
sling conns test bigquery_target

Creating a Snowflake to BigQuery Replication

Now that our connections are set up, we can create a replication configuration. Create a file named snowflake_to_bigquery.yaml with the following content:

# Define source and target connections
source: snowflake_source
target: bigquery_target

# Set default options for all streams
defaults:
  mode: full-refresh

# Define the tables to replicate
streams:
  # Replicate a single table
  "SALES.ORDERS":
    object: "sales_dataset.orders"
    primary_key: ["order_id"]

  # Replicate multiple tables using wildcards
  "SALES.*":
    object: "sales_dataset.{stream_table}"
    mode: incremental
    update_key: "last_modified_at"
    target_options:
      # Use BigQuery's bulk loading for better performance
      use_bulk: true

For more detailed configuration options, refer to the replication documentation.

Running the Replication

With our configuration in place, we can now run the replication using the Sling CLI:

# Run the replication
sling run -r snowflake_to_bigquery.yaml

The Sling Platform

While the CLI provides powerful functionality for data replication, the Sling Platform offers a comprehensive UI-based solution for managing your data pipelines at scale.

The platform provides:

Visual replication configuration
Real-time monitoring and logging
Team collaboration features
Scheduled executions
Agent management for distributed workloads

Best Practices and Tips

To get the most out of your Snowflake to BigQuery replications:

Use incremental mode for large tables that update frequently
Implement appropriate primary keys for data integrity
Leverage bulk loading for better performance
Monitor replication logs regularly
Use runtime variables for flexible configurations

Next Steps

To learn more about Sling's capabilities:

Explore database-to-database examples
Read about replication modes
Learn about runtime variables
Check out the Sling Platform documentation

Related Guides

For more Snowflake and BigQuery workflows, these articles cover related paths:

Frequently Asked Questions

Does Sling pull data from Snowflake using UNLOAD to a stage, or does it stream rows over the wire?

Sling streams rows over the standard Snowflake driver and buffers them in batches before pushing to BigQuery. There's no Snowflake stage or external table involved, which makes the setup simpler but means very large tables benefit from running on hardware close to the BigQuery region.

How does Sling map Snowflake's VARIANT and OBJECT columns to BigQuery?

Variant, object, and array columns are serialized to JSON strings during extraction and land in BigQuery as STRING by default. If you want them as JSON in BigQuery, run a post-load SQL step that casts the column with SAFE.PARSE_JSON() into a new column or view.

Can I replicate a Snowflake share without copying data into my own database first?

Yes. As long as the Snowflake role on the connection has IMPORTED PRIVILEGES on the share, you can address the shared database and schema directly in your stream names. Sling reads from the share the same way it reads from any other database.

What's the right approach for handling case-sensitive Snowflake identifiers?

Snowflake stores unquoted identifiers in uppercase. Sling preserves the source casing by default, so streams like SALES.ORDERS keep their uppercase form. Set target_options.column_casing: snake if you want lower_snake_case columns on the BigQuery side, which is the BigQuery convention.

Will Sling create the target dataset in BigQuery automatically?

Sling will create the target tables, but the dataset itself must exist before the run starts. This is by design because dataset creation involves location and billing decisions that Sling shouldn't make for you. Create the dataset once, then point your replications at it.

How can I throttle the load on Snowflake during a large initial backfill?

Use source_options.batch_limit to cap rows per batch and run streams sequentially by leaving the default parallelism. You can also point the replication at a smaller Snowflake warehouse so it auto-suspends quickly if the run pauses.

Does the use_bulk: true option actually change anything for BigQuery targets?

BigQuery loads are already done via the bulk load API by default, so use_bulk: true is effectively a no-op for this target. You can safely omit it in BigQuery replications; it's only meaningful for targets that have both row-by-row and bulk paths.

Effortless Data Migration: How to Export from PostgreSQL and Load into S3 as Parquet with Sling

Fritz Larco — Wed, 17 Jun 2026 16:45:36 +0000

Last updated: June 2026

Introduction

In today's data-driven landscape, efficiently moving data from PostgreSQL databases to cloud storage solutions like Amazon S3 is a critical requirement for many organizations. When combined with the Parquet file format's superior compression and query performance capabilities, this creates a powerful solution for data warehousing and analytics. However, setting up and maintaining such a data pipeline traditionally involves multiple tools, complex configurations, and significant overhead.

Enter Sling, a modern data movement tool that dramatically simplifies this process. In this guide, we'll explore how to use Sling to efficiently transfer data from PostgreSQL to S3, storing it in the Parquet format for optimal performance and cost efficiency. We'll cover everything from installation and setup to advanced configuration options, making your data pipeline both powerful and maintainable.

Sling: A Modern Solution

Sling is a modern data movement platform designed to simplify data operations between various sources and destinations. It provides both a powerful CLI tool and a comprehensive platform for managing data workflows.

Key Features

Efficient Data Transfer: Optimized for performance with built-in parallelization and streaming capabilities
Native Parquet Support: Direct conversion to Parquet format without intermediate steps
Schema Handling: Automatic schema detection and evolution support
Incremental Updates: Built-in support for incremental data loading
Security: Secure credential management for both PostgreSQL and S3

Getting Started with Sling

Let's begin by installing Sling on your system. Sling provides multiple installation methods to suit your environment:

# macOS / Linux
curl -fsSL https://slingdata.io/install.sh | bash

# Windows
irm https://slingdata.io/install.ps1 | iex

# Python
pip install sling

After installation, verify that Sling is properly installed:

# Check Sling version
sling --version

For more detailed installation instructions, visit the Sling CLI Getting Started Guide.

Setting Up Connections

Before we can transfer data, we need to configure our source (PostgreSQL) and target (S3) connections. Sling provides multiple ways to set up and manage connections securely.

PostgreSQL Connection Setup

You can set up a PostgreSQL connection using one of these methods:

Using Environment Variables

The simplest way is to use environment variables:

# Set PostgreSQL connection using environment variable
export POSTGRES='postgresql://myuser:mypassword@localhost:5432/mydatabase'

Using the Sling CLI

Alternatively, use the sling conns set command:

# Set up PostgreSQL connection with individual parameters
sling conns set POSTGRES type=postgres host=localhost user=myuser database=mydatabase password=mypassword port=5432

# Or use a connection URL
sling conns set POSTGRES url="postgresql://myuser:mypassword@localhost:5432/mydatabase"

Using the Sling Environment File

You can also add the connection details to your ~/.sling/env.yaml file:

connections:
  POSTGRES:
    type: postgres
    host: localhost
    user: myuser
    password: mypassword
    port: 5432
    database: mydatabase
    schema: public

S3 Connection Setup

For Amazon S3, you'll need to configure AWS credentials. Here are the available methods:

Using Environment Variables

# Set AWS credentials using environment variables
export AWS_ACCESS_KEY_ID='your_access_key'
export AWS_SECRET_ACCESS_KEY='your_secret_key'
export AWS_REGION='us-west-2'  # optional, defaults to us-east-1

Using the Sling CLI

# Set up S3 connection with credentials
sling conns set S3 type=s3 access_key_id=your_access_key secret_access_key=your_secret_key region=us-west-2

Using the Sling Environment File

Add the S3 connection to your ~/.sling/env.yaml:

connections:
  S3:
    type: s3
    access_key_id: your_access_key
    secret_access_key: your_secret_key
    region: us-west-2  # optional, defaults to us-east-1

Testing Connections

After setting up your connections, it's important to verify they work correctly:

# Test the PostgreSQL connection
sling conns test POSTGRES

# Test the S3 connection
sling conns test S3

You can also explore the PostgreSQL database schema:

# List available tables in the public schema
sling conns discover POSTGRES -p 'public.*'

For more details about connection configuration and options, refer to:

Basic Data Transfer with CLI Flags

Once you have your connections set up, you can start transferring data from PostgreSQL to S3 using Sling's CLI flags. Let's look at some common usage patterns.

Simple Transfer Example

The most basic way to transfer data is using the sling run command with source and target specifications:

# Export a single table to S3 as Parquet
sling run \
  --src-conn POSTGRES \
  --src-stream "public.users" \
  --tgt-conn S3 \
  --tgt-object "s3://my-bucket/data/users.parquet"

Understanding CLI Flag Options

Sling provides various CLI flags to customize your transfer:

# Export with specific columns and where clause
sling run \
  --src-conn POSTGRES \
  --src-stream "SELECT id, name, email FROM users WHERE created_at > '2024-01-01'" \
  --tgt-conn S3 \
  --tgt-object "s3://my-bucket/data/filtered_users.parquet" \
  --tgt-options '{ "compression": "snappy", "row_group_size": 100000 }'

# Export with custom Parquet options and table keys
sling run \
  --src-conn POSTGRES \
  --src-stream "public.orders" \
  --tgt-conn S3 \
  --tgt-object "s3://my-bucket/data/orders.parquet" \
  --tgt-options '{ "file_max_bytes": 100000000, "compression": "snappy" }'

Using Runtime Variables

Sling supports runtime variables that can be used in your object paths and queries:

# Export multiple tables with runtime variables
sling run \
  --src-conn POSTGRES \
  --src-stream "public.sales_*" \
  --tgt-conn S3 \
  --tgt-object "s3://my-bucket/data/{stream_table}/{date_yyyy_mm_dd}.parquet" \
  --tgt-options '{ "file_max_bytes": 100000000 }'

For a complete list of available CLI flags and runtime variables, refer to:

Advanced Data Transfer with Replication YAML

While CLI flags are great for simple transfers, YAML configuration files provide more flexibility and reusability for complex data transfer scenarios. Let's explore how to use YAML configurations with Sling.

Basic Multi-Stream Example

Create a file named postgres_to_s3.yaml with the following content:

# Basic configuration for exporting multiple tables
source: POSTGRES
target: S3

defaults:
  mode: full-refresh
  target_options:
    format: parquet
    compression: snappy
    file_max_bytes: 100000000

streams:
  # Export users table with specific columns
  public.users:
    object: s3://my-bucket/data/users/{YYYY}_{MM}_{DD}.parquet
    select: [id, name, email, created_at]

  # Export orders table with primary key and column selection
  public.orders:
    object: s3://my-bucket/data/orders/{YYYY}_{MM}_{DD}.parquet
    target_options:
      format: parquet
      compression: gzip

Advanced Configuration Example

Here's a more complex example with multiple streams and advanced options:

source: POSTGRES
target: S3

defaults:
  mode: incremental
  source_options:
    add_new_columns: true
  target_options:
    format: parquet
    compression: snappy
    row_group_size: 100000
    file_max_bytes: 100000000

streams:
  # Export all tables in sales schema
  sales.*:
    object: s3://my-bucket/data/{stream_schema}/{stream_table}.parquet
    mode: full-refresh
    target_options:
      format: parquet
      compression: snappy
      file_max_bytes: 500000000

  # Incremental export of customer transactions (partitioning)
  public.transactions:
    object: s3://my-bucket/data/transactions/{part_year}/{part_month}
    sql: |
      select transaction_id, customer_id, amount, status, created_at
      from public.transactions
      where created_at > coalesce({incremental_val}, '2001-01-01)

  # Export specific customer data with custom query
  public.customers:
    object: s3://my-bucket/data/customers.parquet
    mode: full-refresh
    query: |
      SELECT 
        c.customer_id,
        c.name,
        c.email,
        COUNT(o.order_id) as total_orders,
        SUM(o.total_amount) as lifetime_value
      FROM customers c
      LEFT JOIN orders o ON c.customer_id = o.customer_id
      GROUP BY c.customer_id, c.name, c.email

To run a replication configuration:

# Execute the replication configuration
sling run -r postgres_to_s3.yaml

For more details about replication configuration options, refer to:

Using the Sling Platform UI

While the CLI is powerful for automation and scripting, the Sling Platform provides a user-friendly web interface for managing and monitoring your data transfers.

Key Platform Features

The Sling Platform offers several advantages:

Visual Replication Editor: Create and edit replication configurations with a user-friendly interface
Real-time Monitoring: Track the progress of your data transfers in real-time
History and Logs: View detailed execution history and logs for troubleshooting
Team Collaboration: Share connections and configurations with team members
Scheduling: Set up recurring transfers with flexible scheduling options

Getting Started with the Platform

To get started with the Sling Platform:

Visit app.slingdata.io to create an account
Follow the onboarding process to set up your workspace
Create your PostgreSQL and S3 connections
Create your first replication using the visual editor
Monitor your transfers in real-time

For more information about the Sling Platform, visit the Platform Documentation.

Getting Started and Next Steps

Now that you understand how to use Sling for transferring data from PostgreSQL to S3 in Parquet format, here are some next steps to explore:

Additional Resources

Best Practices

Start Small: Begin with a single table and simple configuration
Test Thoroughly: Use the --dry-run flag to validate your configuration
Monitor Performance: Use the platform's monitoring features to optimize your transfers
Use Version Control: Store your replication YAML files in version control
Implement Security: Follow AWS best practices for S3 bucket policies and IAM roles

Next Steps

Set up your first PostgreSQL to S3 transfer using the CLI
Create a more complex replication using YAML configuration
Explore the Sling Platform for visual configuration and monitoring
Join the Sling community to share experiences and get help

With Sling, you can efficiently manage your data pipeline needs while maintaining flexibility and control over your data movement processes.

Related Guides

Parquet is the analytics-friendly choice, but Sling can write other formats from the same PostgreSQL source:

exporting PostgreSQL to S3 as CSV for a flat, widely-readable format
exporting PostgreSQL to S3 as JSON when you need a nested, schema-flexible format
exporting PostgreSQL to local Parquet files when the target is a filesystem instead of S3
exporting MySQL to S3 as Parquet for the same workflow from a MySQL source

FAQ

Why choose Parquet over CSV or JSON for PostgreSQL exports to S3?

Parquet is columnar and compressed, so files are smaller and analytical queries that read a subset of columns run much faster than over row-based CSV or JSON. It also carries column types, which avoids the type-guessing that text formats require downstream.

Which Parquet compression codecs does Sling support?

Sling supports snappy, gzip, and zstd among others, set via the compression property under target_options. Snappy is a good default balancing speed and size, while zstd compresses more tightly for cold storage.

What is row_group_size and how should I set it?

row_group_size controls how many rows go into each Parquet row group, which affects read parallelism and memory use. Larger groups compress better, while smaller groups let query engines skip data more granularly. The default works for most workloads.

Does Sling preserve PostgreSQL data types in the Parquet schema?

Yes. Sling maps PostgreSQL types to Parquet logical types, so numerics, timestamps, and booleans stay typed instead of being coerced to strings the way they would in CSV.

How do I partition the Parquet output by date in S3?

Use partition runtime variables such as {part_year} and {part_month} in the object path. Sling routes each row to the correct prefix, producing a Hive-style partitioned layout that query engines can prune.

Can I add new columns to existing exports without a full reload?

Yes. Enable add_new_columns under target_options so that when the source picks up a new column, Sling adds it to the schema on the next run rather than failing or requiring a manual reload.

How large should each Parquet file be?

Aim for roughly 128 MB to 512 MB per file for good query-engine performance, and control it with file_max_bytes under target_options. Many tiny files hurt read throughput, while a few huge files limit parallelism.

Extract data from Databases into DuckLake

Fritz Larco — Mon, 08 Jun 2026 13:22:41 +0000

Introduction

In the evolving landscape of data engineering, DuckLake is emerging as a powerful solution for building data lakes with ACID transactions, versioning, and a flexible catalog backend. It combines the speed and efficiency of DuckDB with the scalability of cloud storage, making it an attractive choice for modern data platforms.

A common requirement is to populate a data lake by extracting data from various transactional or analytical databases. This is where Sling comes in, offering a simple and powerful command-line interface (CLI) to move data between different sources and destinations.

In this article, we'll walk through how to use Sling to extract data from a PostgreSQL database and load it into DuckLake. The same principles can be applied to other databases that Sling supports, such as MySQL, SQL Server, Oracle, and more.

What is DuckLake?

DuckLake is a data lake format that brings the power of DuckDB to a data lake architecture. It provides a transactional layer over your data files (like Parquet) stored in object storage (e.g., AWS S3, Google Cloud Storage, Azure Blob Storage, or local files). It uses a catalog database (like DuckDB, SQLite, PostgreSQL, or MySQL) to manage metadata, schemas, and versions.

This setup allows you to query your data lake using standard SQL with the performance benefits of DuckDB, while ensuring data consistency and reliability.

Setting up the Environment

Before we begin, make sure you have Sling CLI installed.

Configuring the Connections

We need to configure two connections in Sling: one for our source database (PostgreSQL) and one for our target (DuckLake).

1. Source Database: PostgreSQL

Let's set up a connection to a PostgreSQL database. You can do this by setting an environment variable or by using the sling conns command. See here for more details.

export POSTGRES_CONN="postgres://user:pass@host:5432/dbname"

2. Target: DuckLake

For DuckLake, we need to specify the catalog type, the connection string for the catalog, and the path where the data will be stored. For this example, we'll use a local DuckDB file as our catalog and a local directory for our data. See here for more details.

Here's how to set up the DuckLake connection using sling conns set:

sling conns set MY_DUCKLAKE type=ducklake \
  catalog_type=duckdb \
  catalog_conn_string=my_catalog.db \
  data_path=./ducklake_data

This command creates a DuckLake connection named MY_DUCKLAKE that uses a local DuckDB file my_catalog.db for the catalog and stores data in the ./ducklake_data directory.

You can verify that your connections are set up correctly by running:

sling conns list

Extracting Data from PostgreSQL to DuckLake

With our connections configured, extracting data is straightforward. We can use a simple sling run command.

Let's say we want to extract the customers table from the public schema in our PostgreSQL database and load it into a table named customers in the main schema of our DuckLake.

sling run --src-conn POSTGRES_CONN --src-stream public.customers \
          --tgt-conn MY_DUCKLAKE --tgt-object main.customers

Alternatively, you can use a YAML configuration file for more control:

# extract.yaml
source: POSTGRES_CONN
target: MY_DUCKLAKE

defaults:
  object: main.{stream_table}

streams:
  # load all tables
  public.*:

  finance.customers:
    object: main.finance_customers

Then run:

sling run -r extract.yaml

That's it! Sling will handle:

Reading the data from all the tables in the public schema, and the finance.customers table.
Creating the main.finance_customers table in DuckLake if it doesn't exist, as well as all respective tables from the source public schema.
Writing the data into Parquet files in the ducklake_data directory.
Updating the DuckLake catalog (my_catalog.db) with the new table information.

Incremental Loads

One of the powerful features of Sling is its ability to handle incremental loads easily. This is crucial for keeping your data lake up-to-date without having to re-extract all the data every time.

To perform an incremental load, you need a key column in your source table that indicates the order of records, such as a timestamp or an auto-incrementing ID. Let's assume our customers table has a updated_at column.

We can use the replication mode in Sling to manage the state of our incremental loads automatically. Here's how you would structure the command:

# replication.yaml
source: POSTGRES_CONN
target: MY_DUCKLAKE

defaults:
  mode: incremental
  primary_key: [customer_id]
  update_key: updated_at

streams:
  public.customers:
    object: main.customers

You can then run this replication with:

sling run -r replication.yaml

The incremental mode will merge new or updated records to the target table and ensure there are no duplicates based on the primary_key. Sling will automatically track the last updated_at value it processed and only fetch newer records on subsequent runs.

Conclusion

DuckLake offers a compelling solution for building modern, transactional data lakes, and Sling makes it incredibly simple to populate it from any database. With just a few commands, you can perform full extracts or set up robust incremental pipelines to keep your DuckLake synchronized with your source systems.

To learn more about what you can do with Sling, check out the official documentation. Happy slinging!

Sync PostgreSQL to MotherDuck with Sling

Fritz Larco — Mon, 01 Jun 2026 17:02:53 +0000

Introduction

MotherDuck is a serverless analytics service built on DuckDB. It hosts DuckDB databases in the cloud and keeps the same SQL surface you'd use locally. PostgreSQL is what most apps run on for transactional data.

So you usually want both: Postgres for the app, MotherDuck for analytics. The part in the middle that copies tables across is what Sling does.

This guide replicates a PostgreSQL schema into MotherDuck with Sling, in both full-refresh and incremental modes. The CLI output and row counts below come from an actual run, not a fabricated one.

Installing Sling

Sling is a single binary. Pick whichever install method fits your environment:

# macOS / Linux
curl -fsSL https://slingdata.io/install.sh | bash

# Windows
irm https://slingdata.io/install.ps1 | iex

# Python
pip install sling

Confirm the install:

sling --version

Full installation notes are in the Sling CLI Getting Started Guide.

Configuring the PostgreSQL Source

Sling reads connection details from ~/.sling/env.yaml, environment variables, or sling conns set. For PostgreSQL you'll need host, port, database, user, and password.

Using sling conns set:

sling conns set PG_SOURCE type=postgres host=host.ip user=myuser \
  database=mydb password=mypass port=5432

Or in ~/.sling/env.yaml:

connections:
  PG_SOURCE:
    type: postgres
    host: host.ip
    user: myuser
    password: mypass
    port: 5432
    database: mydb
    sslmode: require
    schema: public

Test it:

sling conns test PG_SOURCE

The PostgreSQL connection docs cover SSL, IAM auth, and other options.

Configuring the MotherDuck Target

A MotherDuck connection needs the database name and a service token. You can generate a token from the MotherDuck UI.

sling conns set MOTHERDUCK type=motherduck \
  database=my_db motherduck_token=eyJhbGciOi...

Or the URL form:

sling conns set MOTHERDUCK url="motherduck://my_db?motherduck_token=eyJhbGciOi..."

Or in ~/.sling/env.yaml:

connections:
  MOTHERDUCK:
    type: motherduck
    database: my_db
    motherduck_token: eyJhbGciOi...

Test it:

sling conns test MOTHERDUCK

Full options (attach modes, copy method, DuckDB version pinning) are in the MotherDuck connection docs.

A Full-Refresh Replication

For this run the PostgreSQL source has three tables in a demo_pg_motherduck schema:

customers — 5,000 rows
orders — 30,000 rows, with an updated_at timestamp
events — 60,000 rows, with an occurred_at timestamp

The replication file lives next to wherever you want to run Sling from:

# replication.yaml
source: PG_SOURCE
target: MOTHERDUCK

defaults:
  mode: full-refresh
  object: demo_pg_motherduck.{stream_table}

streams:
  demo_pg_motherduck.customers:
    primary_key: [customer_id]

  demo_pg_motherduck.orders:
    primary_key: [order_id]
    update_key: updated_at

  demo_pg_motherduck.events:
    primary_key: [event_id]
    update_key: occurred_at

A few things to point out:

object: demo_pg_motherduck.{stream_table} is a runtime variable. Sling substitutes the source table name into the target object, so you don't repeat yourself per stream.
primary_key and update_key are set even though the mode here is full-refresh. The next section flips to incremental without touching those declarations; only the mode changes.
The target schema gets created automatically by Sling on the first run. No manual CREATE SCHEMA needed.

Run it:

sling run -r replication.yaml

Real output, trimmed for readability:

INF Sling Replication [3 streams] | PG_SOURCE -> MOTHERDUCK

INF [1 / 3] running stream demo_pg_motherduck.customers
INF reading from source database
INF writing to target database [mode: full-refresh]
INF created table "demo_pg_motherduck"."customers"
INF inserted 5000 rows into "demo_pg_motherduck"."customers" in 11 secs [425 r/s] [390 kB]
INF execution succeeded

INF [2 / 3] running stream demo_pg_motherduck.orders
INF created table "demo_pg_motherduck"."orders"
INF inserted 30000 rows into "demo_pg_motherduck"."orders" in 14 secs [2,131 r/s] [2.6 MB]
INF execution succeeded

INF [3 / 3] running stream demo_pg_motherduck.events
INF created table "demo_pg_motherduck"."events"
INF inserted 60000 rows into "demo_pg_motherduck"."events" in 9 secs [6,036 r/s] [3.3 MB]
INF execution succeeded

INF Sling Replication Completed in 40s | PG_SOURCE -> MOTHERDUCK | 3 Successes | 0 Failures

95,000 rows across three tables, end to end, in 40 seconds. The _tmp tables that show up in the full log are Sling's staging step before it swaps the data into the final target. They get cleaned up automatically.

Verification

A count(*) from MotherDuck right after the run:

select 'customers' as t, count(*) c from demo_pg_motherduck.customers
union all select 'orders',    count(*)   from demo_pg_motherduck.orders
union all select 'events',    count(*)   from demo_pg_motherduck.events;

+-----------+-------+
| T         |     C |
+-----------+-------+
| customers |  5000 |
| orders    | 30000 |
| events    | 60000 |
+-----------+-------+

A small sample to confirm the data made the trip with types intact:

select event_id, customer_id, event_type, region, occurred_at
from demo_pg_motherduck.events
order by event_id limit 5;

+----------+-------------+------------+--------+-------------------------------+
| EVENT_ID | CUSTOMER_ID | EVENT_TYPE | REGION | OCCURRED_AT                   |
+----------+-------------+------------+--------+-------------------------------+
|        1 |           2 | click      | us-2   | 2025-01-01 00:00:01 +0000 UTC |
|        2 |           3 | signup     | us-3   | 2025-01-01 00:00:02 +0000 UTC |
|        3 |           4 | purchase   | us-4   | 2025-01-01 00:00:03 +0000 UTC |
|        4 |           5 | page_view  | us-5   | 2025-01-01 00:00:04 +0000 UTC |
|        5 |           6 | click      | us-6   | 2025-01-01 00:00:05 +0000 UTC |
+----------+-------------+------------+--------+-------------------------------+

Numeric, varchar, and timestamp columns round-tripped cleanly. Nullable columns (region is null on every seventh row in the source) are preserved as nulls, not as the string "NULL".

Switching to Incremental

Full-refreshing a 60,000-row table every day is fine. Full-refreshing a 600-million-row event table every day is not. Sling's incremental mode reads only the rows newer than the highest update_key already in the target.

Drop customers from the streams (it changes slowly enough to keep on full-refresh in a separate run, or rebuild weekly) and switch the mode:

# replication-incremental.yaml
source: PG_SOURCE
target: MOTHERDUCK

defaults:
  mode: incremental
  object: demo_pg_motherduck.{stream_table}

streams:
  demo_pg_motherduck.orders:
    primary_key: [order_id]
    update_key: updated_at

  demo_pg_motherduck.events:
    primary_key: [event_id]
    update_key: occurred_at

Insert 1,000 new orders and 2,500 new events on the source (this simulates a day's worth of data flowing in), then run again:

sling run -r replication-incremental.yaml

INF Sling Replication [2 streams] | PG_SOURCE -> MOTHERDUCK

INF [1 / 2] running stream demo_pg_motherduck.orders
INF getting checkpoint value (updated_at)
INF writing to target database [mode: incremental]
INF inserted 1000 rows into "demo_pg_motherduck"."orders" in 9 secs [104 r/s] [86 kB]
INF execution succeeded

INF [2 / 2] running stream demo_pg_motherduck.events
INF getting checkpoint value (occurred_at)
INF writing to target database [mode: incremental]
INF inserted 2500 rows into "demo_pg_motherduck"."events" in 6 secs [358 r/s] [137 kB]
INF execution succeeded

INF Sling Replication Completed in 20s | PG_SOURCE -> MOTHERDUCK | 2 Successes | 0 Failures

The getting checkpoint value line is where Sling looks at the target, finds the largest updated_at already present, and uses that as the lower bound on the source query. Only the new rows come across:

select 'orders' as t, count(*) c from demo_pg_motherduck.orders
union all select 'events',  count(*)   from demo_pg_motherduck.events;

+--------+-------+
| T      |     C |
+--------+-------+
| orders | 31000 |
| events | 62500 |
+--------+-------+

Orders went from 30,000 to 31,000. Events went from 60,000 to 62,500. Matches what was inserted on the source.

If you need updates as well as inserts (a row's updated_at changes and the existing row should be replaced rather than duplicated), keep mode: incremental and make sure primary_key is set. Sling will upsert against the primary key instead of appending. The replication modes docs cover the trade-offs.

Common Tweaks

A few options you'll reach for once the basics are in place:

Schema and column casing. MotherDuck (DuckDB) is case-sensitive, and Sling defaults to keeping the source casing. Add target_options: { column_casing: snake } under defaults if your Postgres source has mixed-case identifiers and you want a clean snake_case target.
Add new columns automatically. When the source schema changes, set target_options: { add_new_columns: true } so Sling alters the MotherDuck table on the next run. Without it, new source columns get dropped at the boundary.
Pick a copy method. The default for MotherDuck is csv_http. For very wide rows or large text values, switch to arrow_http via copy_method: arrow_http in the connection config. It's usually faster and avoids CSV escaping edge cases.
Filter at the source. Use a custom sql: block in a stream to project columns or filter rows before they leave Postgres. Cheaper than dragging unused columns to MotherDuck.

Where to Go Next

The same replication pattern works for any of Sling's 30+ database sources into MotherDuck: MySQL, SQL Server, Snowflake, BigQuery, and the rest. Swap the source connection and leave the target alone.

If you'd rather store flat files than warehouse tables, see PostgreSQL to S3 as Parquet, which uses the same replication file shape with a file-system target. For a local DuckDB setup instead of a managed MotherDuck one, see PostgreSQL to DuckDB. For team workflows with scheduling and alerting on top of the same CLI, look at the Sling Platform.

Questions go to Discord or GitHub Issues.

Replicate MySQL to ClickHouse with Sling

Fritz Larco — Tue, 26 May 2026 13:29:56 +0000

Introduction

ClickHouse is a columnar OLAP database. It runs aggregate queries across billions of rows in seconds. MySQL is what most apps run on for transactional reads and writes. Different jobs, different storage shapes, which is why people end up running them side by side: MySQL for the app, ClickHouse for analytics on top of the app's data.

The piece in the middle, the bit that copies tables from MySQL into ClickHouse and keeps them current, is what Sling does.

This guide replicates a MySQL schema into ClickHouse with Sling, in both full-refresh and incremental modes. The CLI output, row counts, and timings below all come from an actual run against a Docker MySQL on the source side and a self-hosted ClickHouse 25.4 on the target side. The same configuration works against ClickHouse Cloud; only the connection URL changes.

Installing Sling

Sling is a single binary. Pick whichever install method fits your environment:

# macOS / Linux
curl -fsSL https://slingdata.io/install.sh | bash

# Windows
irm https://slingdata.io/install.ps1 | iex

# Python
pip install sling

Confirm the install:

sling --version

Installation notes for every platform are in the Sling CLI Getting Started Guide.

Configuring the MySQL Source

Sling reads connection details from ~/.sling/env.yaml, environment variables, or sling conns set. For MySQL you need host, port, database, user, and password.

A read-only Sling user is the right shape for replication:

CREATE USER 'sling'@'%' IDENTIFIED BY '<password>';
GRANT SELECT ON <source_schema>.* TO 'sling'@'%';

Using sling conns set:

sling conns set MYSQL_SOURCE type=mysql host=host.ip user=sling \
  database=mydb password=mypass port=3306

Or in ~/.sling/env.yaml:

connections:
  MYSQL_SOURCE:
    type: mysql
    host: host.ip
    user: sling
    password: mypass
    port: 3306
    database: mydb

If your MySQL requires TLS, append ?tls=skip-verify to the URL form, or set tls: skip-verify in the YAML. Test it:

sling conns test MYSQL_SOURCE

The MySQL connection docs cover SSL, IAM auth, and the rest of the options.

Configuring the ClickHouse Target

ClickHouse speaks two protocols: native (port 9000) and HTTP (port 8123 / 8443 with TLS). Sling supports both. For self-hosted clusters the native protocol is usually the fastest path; for ClickHouse Cloud, the HTTPS endpoint is the supported one.

Self-hosted, native protocol:

sling conns set CLICKHOUSE type=clickhouse host=host.ip user=default \
  password=mypass port=9000 database=default

ClickHouse Cloud over HTTPS:

sling conns set CLICKHOUSE \
  url="https://default:mypass@xxxxxx.us-east-1.aws.clickhouse.cloud:8443/default"

Or in ~/.sling/env.yaml:

connections:
  CLICKHOUSE:
    type: clickhouse
    host: host.ip
    user: default
    password: mypass
    port: 9000
    database: default

Test it:

sling conns test CLICKHOUSE

The ClickHouse connection docs list every option, including the HTTP URL form and the export_stream_format setting for tuning the staging file format.

A Full-Refresh Replication

For this run the MySQL source has three tables in a demo_mysql_clickhouse database:

customers — 5,000 rows
orders — 30,000 rows, with an updated_at timestamp
events — 60,000 rows, with an occurred_at timestamp

The replication file lives next to wherever you run Sling from:

# replication.yaml
source: MYSQL_SOURCE
target: CLICKHOUSE

defaults:
  mode: full-refresh
  object: demo_mysql_clickhouse.{stream_table}

streams:
  demo_mysql_clickhouse.customers:
    primary_key: [customer_id]

  demo_mysql_clickhouse.orders:
    primary_key: [order_id]
    update_key: updated_at

  demo_mysql_clickhouse.events:
    primary_key: [event_id]
    update_key: occurred_at

A few things worth pointing out:

object: demo_mysql_clickhouse.{stream_table} is a runtime variable. Sling substitutes the source table name into the target object, so you don't repeat yourself per stream.
primary_key and update_key are set even though the mode is full-refresh. The next section flips to incremental without touching those declarations; only the mode changes.
The target database (demo_mysql_clickhouse on ClickHouse) gets created automatically by Sling on the first run. No manual CREATE DATABASE needed on the target side.

Run it:

sling run -r replication.yaml

Real output, trimmed for readability:

INF Sling Replication [3 streams] | MYSQL_SOURCE -> CLICKHOUSE

INF [1 / 3] running stream demo_mysql_clickhouse.customers
INF reading from source database
INF writing to target database [mode: full-refresh]
INF created table `demo_mysql_clickhouse`.`customers`
INF inserted 5000 rows into `demo_mysql_clickhouse`.`customers` in 0 secs [8,853 r/s] [396 kB]
INF execution succeeded

INF [2 / 3] running stream demo_mysql_clickhouse.orders
INF created table `demo_mysql_clickhouse`.`orders`
INF inserted 30000 rows into `demo_mysql_clickhouse`.`orders` in 1 secs [29,381 r/s] [2.8 MB]
INF execution succeeded

INF [3 / 3] running stream demo_mysql_clickhouse.events
INF created table `demo_mysql_clickhouse`.`events`
INF inserted 60000 rows into `demo_mysql_clickhouse`.`events` in 0 secs [81,559 r/s] [3.2 MB]
INF execution succeeded

INF Sling Replication Completed in 4s | MYSQL_SOURCE -> CLICKHOUSE | 3 Successes | 0 Failures

95,000 rows across three tables, end to end, in 4 seconds. The _tmp tables that show up in the full log are Sling's staging step before it swaps the data into the final target. They get cleaned up automatically.

When Sling creates the table, it asks for MergeTree with the primary key columns as the sorting key. That's a fine baseline for analytical queries. The "Common Tweaks" section below covers how to override it when you need partitioning, replication, or a different engine.

Verification

A count() from ClickHouse right after the run:

SELECT 'customers' AS t, count() AS c FROM demo_mysql_clickhouse.customers
UNION ALL SELECT 'orders',    count()   FROM demo_mysql_clickhouse.orders
UNION ALL SELECT 'events',    count()   FROM demo_mysql_clickhouse.events;

+-----------+-------+
| T         | C     |
+-----------+-------+
| customers |  5000 |
| orders    | 30000 |
| events    | 60000 |
+-----------+-------+

A small sample to confirm the data made the trip with types intact:

SELECT event_id, customer_id, event_type, region, occurred_at
FROM demo_mysql_clickhouse.events
ORDER BY event_id LIMIT 5;

+----------+-------------+------------+--------+-------------------------------+
| EVENT_ID | CUSTOMER_ID | EVENT_TYPE | REGION | OCCURRED_AT                   |
+----------+-------------+------------+--------+-------------------------------+
|        1 |           2 | signup     | us-2   | 2025-01-01 00:00:01 +0000 UTC |
|        2 |           3 | purchase   | us-3   | 2025-01-01 00:00:02 +0000 UTC |
|        3 |           4 | logout     | us-4   | 2025-01-01 00:00:03 +0000 UTC |
|        4 |           5 | page_view  | us-1   | 2025-01-01 00:00:04 +0000 UTC |
|        5 |           6 | click      | us-2   | 2025-01-01 00:00:05 +0000 UTC |
+----------+-------------+------------+--------+-------------------------------+

Numeric, varchar, and timestamp columns round-tripped cleanly. The nullable region column (every seventh row in the source is null) lands as ClickHouse Nullable(String) and preserves nulls as nulls, not as the literal string "NULL".

Switching to Incremental

Full-refreshing a 60,000-row event table every day is fine. Full-refreshing a 600-million-row event table every day is not. Sling's incremental mode reads only the rows newer than the highest update_key already in the target.

Drop customers from the streams (it changes slowly enough to keep on full-refresh in a separate run, or rebuild weekly) and switch the mode:

# replication-incremental.yaml
source: MYSQL_SOURCE
target: CLICKHOUSE

defaults:
  mode: incremental
  object: demo_mysql_clickhouse.{stream_table}

streams:
  demo_mysql_clickhouse.orders:
    primary_key: [order_id]
    update_key: updated_at

  demo_mysql_clickhouse.events:
    primary_key: [event_id]
    update_key: occurred_at

Insert 1,000 new orders and 2,500 new events on the source (a stand-in for a day of fresh data), then run again:

sling run -r replication-incremental.yaml

INF Sling Replication [2 streams] | MYSQL_SOURCE -> CLICKHOUSE

INF [1 / 2] running stream demo_mysql_clickhouse.orders
INF getting checkpoint value (updated_at)
INF writing to target database [mode: incremental]
INF inserted 1000 rows into `demo_mysql_clickhouse`.`orders` in 0 secs [1,926 r/s] [93 kB]
INF execution succeeded

INF [2 / 2] running stream demo_mysql_clickhouse.events
INF getting checkpoint value (occurred_at)
INF writing to target database [mode: incremental]
INF inserted 2500 rows into `demo_mysql_clickhouse`.`events` in 0 secs [4,040 r/s] [134 kB]
INF execution succeeded

INF Sling Replication Completed in 2s | MYSQL_SOURCE -> CLICKHOUSE | 2 Successes | 0 Failures

SELECT 'orders' AS t, count() c FROM demo_mysql_clickhouse.orders
UNION ALL SELECT 'events', count()   FROM demo_mysql_clickhouse.events;

+--------+-------+
| T      | C     |
+--------+-------+
| orders | 31000 |
| events | 62500 |
+--------+-------+

Orders went from 30,000 to 31,000. Events went from 60,000 to 62,500. Matches what was inserted on the source.

ClickHouse's MergeTree family is append-friendly. In incremental mode Sling inserts the new rows directly into the main table without rewriting partitions. If you also need updates (a row's updated_at changes and you want the existing target row replaced rather than duplicated), keep mode: incremental and make sure primary_key is set. Sling will use a ReplacingMergeTree-style upsert path against that key. The replication modes docs cover the trade-offs.

Common Tweaks

A few options worth reaching for once the basics are in place:

Pick a table engine. ClickHouse's default MergeTree is a fine baseline, but for high-write or replicated clusters you'll want ReplicatedMergeTree, partitioning by month, and a TTL. Set target_options.table_ddl per stream with the full CREATE TABLE you want; Sling will use it instead of generating its own. Example: engine = MergeTree() ORDER BY (customer_id, occurred_at) PARTITION BY toYYYYMM(occurred_at).
Add new columns automatically. When the source schema changes, set target_options: { add_new_columns: true } so Sling alters the ClickHouse table on the next run. Without it, new source columns get dropped at the boundary.
Tune the staging format. Sling stages data as a file before bulk-loading into ClickHouse. The default is CSVWithNames, which is robust but verbose. For wide rows or large text values, set export_stream_format: Parquet on the ClickHouse connection. Usually faster and more compact on the wire.
Filter at the source. Use a custom sql: block in a stream to project columns or filter rows before they leave MySQL. Cheaper than dragging unused columns to ClickHouse, and it keeps row payloads small for the network hop.

Where to Go Next

The same replication pattern works for any of Sling's 30+ database sources into ClickHouse: PostgreSQL, SQL Server, Snowflake, BigQuery, and the rest. Swap the source connection and leave the target alone. For the equivalent flow from a Postgres source, see PostgreSQL to ClickHouse.

If your downstream is more cloud-warehouse than columnar engine, MySQL to MotherDuck covers the same setup with DuckDB-on-the-cloud as the target. For team workflows with scheduling and alerting on top of the same CLI, look at the Sling Platform.

Questions go to Discord or GitHub Issues.

Load PostgreSQL into Apache Iceberg with Sling

Fritz Larco — Mon, 18 May 2026 13:31:11 +0000

Introduction

Apache Iceberg is the table format that turns a pile of Parquet files in object storage into something that behaves like a warehouse table. You get schema evolution, hidden partitioning, time travel, and consistent reads from whichever engine you point at the table. PostgreSQL is where most operational data starts. Moving it into Iceberg gives you an analytics copy that DuckDB, Spark, Trino, Snowflake, and Athena can all read without anyone needing to agree on a single warehouse vendor first.

Sling speaks the Iceberg REST catalog directly. From the configuration side an Iceberg target is just another database connection: point Sling at the catalog URL and the underlying object store, then declare your streams. No JVM, no Spark, no manual manifest writing.

This guide replicates a Postgres schema into Iceberg using Sling. The catalog is Cloudflare R2's managed Iceberg REST catalog and the storage layer underneath is R2. Every CLI line, row count, and timing below comes from an actual run against those endpoints.

Installing Sling

Sling is a single binary. Pick whichever install fits:

# macOS / Linux
curl -fsSL https://slingdata.io/install.sh | bash

# Windows
irm https://slingdata.io/install.ps1 | iex

# Python
pip install sling

Confirm:

sling --version

Full install notes are in the Sling CLI Getting Started Guide.

Configuring the Postgres Source

Sling reads connection details from ~/.sling/env.yaml, environment variables, or sling conns set. A read-only user is enough:

CREATE USER sling WITH PASSWORD '<password>';
GRANT CONNECT ON DATABASE mydb TO sling;
GRANT USAGE ON SCHEMA public TO sling;
GRANT SELECT ON ALL TABLES IN SCHEMA public TO sling;
ALTER DEFAULT PRIVILEGES IN SCHEMA public GRANT SELECT ON TABLES TO sling;

Then register the connection:

sling conns set POSTGRES type=postgres host=host.ip user=sling \
  database=mydb password=mypass port=5432

Or in ~/.sling/env.yaml:

connections:
  POSTGRES:
    type: postgres
    host: host.ip
    user: sling
    password: mypass
    port: 5432
    database: mydb

If your Postgres requires SSL, append sslmode: require. Test it:

sling conns test POSTGRES

The Postgres connection docs cover SSL, IAM, and the rest.

Configuring the Iceberg Target

Sling treats Iceberg as a database-class target. The connection captures two things: the catalog, which stores table metadata, and the warehouse, which stores the actual Parquet data files. Sling supports REST, AWS Glue, and SQL catalogs. This guide uses REST.

For Cloudflare R2's Iceberg catalog you need the catalog URL, an API token, the warehouse identifier (account-id + bucket name), and S3-compatible credentials for the R2 bucket underneath. All four come from the R2 dashboard.

connections:
  ICEBERG:
    type: iceberg
    catalog_type: rest
    rest_uri: https://catalog.cloudflarestorage.com/<accountid>/<bucket>
    rest_token: <r2_catalog_api_token>
    rest_warehouse: <accountid>_<bucket>
    s3_access_key_id: <r2_access_key_id>
    s3_secret_access_key: <r2_secret_access_key>

For a self-hosted Lakekeeper or Nessie catalog, the shape is the same; only the rest_uri and rest_warehouse change. For AWS Glue, set catalog_type: glue and glue_warehouse: s3://my-bucket/warehouse. The Iceberg connection docs walk through each catalog type.

Test it:

sling conns test ICEBERG

A Full-Refresh Replication

For this run the Postgres source has three tables in a demo_postgres_iceberg schema:

users — 8,000 rows
orders — 35,000 rows
events — 60,000 rows, with an occurred_at timestamp

The replication file:

# replication.yaml
source: POSTGRES
target: ICEBERG

defaults:
  mode: full-refresh
  object: demo_postgres_iceberg.{stream_table}

streams:
  demo_postgres_iceberg.users:
  demo_postgres_iceberg.orders:
  demo_postgres_iceberg.events:
    mode: incremental
    primary_key: [event_id]
    update_key: occurred_at

A few notes:

object: follows the usual <namespace>.<table> shape. Sling creates the Iceberg namespace if it doesn't already exist in the catalog.
{stream_table} is a runtime variable. Sling substitutes the source table name so you don't repeat yourself.
The third stream switches to mode: incremental with an update_key. That's the only diff between a one-shot bulk load and an ongoing append flow.

Run it:

sling run -r replication.yaml

Real output, trimmed:

INF Sling CLI | https://slingdata.io
WRN for mode 'incremental' with iceberg target, primary-key is ineffective,
    incremental merge is not yet supported (only appends)
INF Sling Replication [3 streams] | POSTGRES -> ICEBERG

INF [1 / 3] running stream demo_postgres_iceberg.users
INF created table "demo_postgres_iceberg"."users"
INF streaming data (direct insert)
INF inserted 8000 rows into "demo_postgres_iceberg"."users" in 11 secs [713 r/s] [519 kB]

INF [2 / 3] running stream demo_postgres_iceberg.orders
INF created table "demo_postgres_iceberg"."orders"
INF inserted 35000 rows into "demo_postgres_iceberg"."orders" in 9 secs [3,721 r/s] [2.1 MB]

INF [3 / 3] running stream demo_postgres_iceberg.events
INF getting checkpoint value (occurred_at)
INF writing to target database [mode: incremental]
INF created table "demo_postgres_iceberg"."events"
INF inserted 60000 rows into "demo_postgres_iceberg"."events" in 7 secs [8,190 r/s] [4.5 MB]

INF Sling Replication Completed in 29s | POSTGRES -> ICEBERG | 3 Successes | 0 Failures

103,000 rows across three tables, 29 seconds end-to-end. The warning at the top deserves a real answer; see the section on incremental modes further down.

Verification

Sling can query Iceberg tables directly through its DuckDB-backed reader. Tables are addressed as iceberg_catalog.<namespace>.<table>:

sling conns exec ICEBERG \
  "select 'users' as t, count(*) as c
     from iceberg_catalog.demo_postgres_iceberg.users
   union all
   select 'orders', count(*) from iceberg_catalog.demo_postgres_iceberg.orders
   union all
   select 'events', count(*) from iceberg_catalog.demo_postgres_iceberg.events"

+--------+-------+
| T      |     C |
+--------+-------+
| users  |  8000 |
| orders | 35000 |
| events | 60000 |
+--------+-------+

Row counts match the source. A sample of users confirms columns and types survived the trip:

sling conns exec ICEBERG \
  "select user_id, email, country, signup_at
     from iceberg_catalog.demo_postgres_iceberg.users
    order by user_id limit 5"

+---------+-------------------+---------+-------------------------------+
| USER_ID | EMAIL             | COUNTRY | SIGNUP_AT                     |
+---------+-------------------+---------+-------------------------------+
|       1 | user1@example.com | BR      | 2025-01-01 00:14:00 -0300 -03 |
|       2 | user2@example.com | DE      | 2025-01-01 00:28:00 -0300 -03 |
|       3 | user3@example.com | FR      | 2025-01-01 00:42:00 -0300 -03 |
|       4 | user4@example.com | JP      | 2025-01-01 00:56:00 -0300 -03 |
|       5 | user5@example.com | UK      | 2025-01-01 01:10:00 -0300 -03 |
+---------+-------------------+---------+-------------------------------+

Postgres jsonb lands as a structured column too. Sampling events:

+----------+---------+------------+----------------------+----------------------+
| EVENT_ID | USER_ID | EVENT_TYPE | PAYLOAD              | OCCURRED_AT          |
+----------+---------+------------+----------------------+----------------------+
|    60001 |       2 | click      | {"v": 1, "utm": "x"} | 2026-05-11 ...       |
|    60002 |       3 | signup     | {"v": 2, "utm": "x"} | 2026-05-11 ...       |
|    60003 |       4 | purchase   | {"v": 3, "utm": "x"} | 2026-05-11 ...       |
+----------+---------+------------+----------------------+----------------------+

Any other Iceberg reader sees the same data: DuckDB with the iceberg extension, Spark, Trino, Athena, Snowflake's catalog-linked databases. That portability is the reason for the catalog in the first place.

Running an Incremental Append

After the bulk load, the day-to-day shape is: every few minutes (or hours, or once a day), pick up the new rows since the last run and append them to the Iceberg table. Sling's incremental mode does this. The state (the last seen value of the update_key) is tracked by Sling itself, so you don't need to manage a state file the way you would for a file-based target.

Insert 2,500 new events on the source (a stand-in for fresh activity):

insert into demo_postgres_iceberg.events (event_id, user_id, event_type, payload, occurred_at)
select 60000 + n, 1 + (n % 8000), 'click',
       jsonb_build_object('utm','x','v', n % 100),
       now() - (n * interval '1 second')
  from generate_series(1, 2500) g(n);

Run a single-stream replication that touches only events:

# replication-incremental.yaml
source: POSTGRES
target: ICEBERG

defaults:
  object: demo_postgres_iceberg.{stream_table}

streams:
  demo_postgres_iceberg.events:
    mode: incremental
    update_key: occurred_at

sling run -r replication-incremental.yaml

INF Sling Replication | POSTGRES -> ICEBERG | demo_postgres_iceberg.events
INF getting checkpoint value (occurred_at)
INF reading from source database
INF writing to target database [mode: incremental]
INF streaming data (direct insert)
INF inserted 2500 rows into "demo_postgres_iceberg"."events" in 8 secs [294 r/s] [178 kB]
INF execution succeeded

Sling read the saved checkpoint, pulled only rows newer than the last occurred_at it saw, and appended exactly the 2,500 new rows. A readback confirms the new total:

sling conns exec ICEBERG \
  "select min(occurred_at), max(occurred_at), count(*)
     from iceberg_catalog.demo_postgres_iceberg.events"

+-------------------------------+--------------------------------------+--------+
| MIN_OCCURRED_AT               | MAX_OCCURRED_AT                      | COUNT  |
+-------------------------------+--------------------------------------+--------+
| 2025-03-01 00:00:40 -0300 -03 | 2026-05-11 08:42:59.533692 -0300 -03 |  62500 |
+-------------------------------+--------------------------------------+--------+

60,000 + 2,500 = 62,500. The new high-water mark on occurred_at is the timestamp of the freshest insert. The next scheduled run will start from there.

Append-incremental vs merge-incremental

That warning Sling printed on the first run matters:

WRN for mode 'incremental' with iceberg target, primary-key is ineffective,
    incremental merge is not yet supported (only appends)

For database targets like Postgres or Snowflake, Sling's incremental mode is a merge: a row whose primary_key already exists in the target gets updated in place. For an Iceberg target today, incremental means append only. New rows go in, existing rows stay as-is, and a primary_key declared on the stream is parsed but not enforced.

That is fine when your source is append-only: events, immutable transactions, log data. It is the wrong default if your source has mutable rows you need reflected on the lake side. Until merge lands, two patterns work:

Snapshot replays. Run mode: full-refresh on a cadence that matches your freshness budget. Iceberg's snapshot model means readers always see a consistent table; the old snapshot is replaced atomically. For tables in the low millions this is faster than it sounds.
CDC-style append plus downstream resolution. Append every Postgres change to Iceberg as-is (using a logical-replication tool or trigger-based capture) and resolve the latest-state view at read time with something like qualify row_number() over (partition by pk order by event_ts desc) = 1. A bit more work at query time, very cheap at write time.

Track the Iceberg connector docs for when full merge mode ships.

Common tweaks

Choose the right catalog. REST is the most portable: the same connection shape works for Cloudflare R2, Lakekeeper, Nessie, Polaris, and any other REST-compatible catalog. Glue is the simplest in AWS-native shops. SQL catalog is fine for local dev. Avoid wiring a different catalog per environment if you can help it; the table layout doesn't care, but the metadata location does.
Namespace organization. Treat namespaces (demo_postgres_iceberg.users) the way you treat warehouse schemas: one per source system, or one per data domain. Don't dump everything into default.
Filter at the source. Use a sql: block per stream to project columns or filter rows before they leave Postgres. Smaller Parquet files, smaller manifests, cheaper queries downstream.
Time travel for free. Every replication produces a new Iceberg snapshot. Readers can time-travel to a previous snapshot, which is useful for "what did this table look like before yesterday's run?" without storing your own backups.
Maintain the table. Like any Iceberg table, periodic compaction and snapshot expiration keep the file count and metadata size from growing without bound. Set this up on a separate schedule from the replication itself.

Where to go next

The same pattern works for any of Sling's 30+ database sources into Iceberg: MySQL, SQL Server, Snowflake, BigQuery, MongoDB, and the rest. Swap the source and leave the target alone.

If the underlying R2 storage is what brought you here, the Postgres → R2 as Parquet walkthrough shows the same source landing as raw Parquet files instead of an Iceberg table, which is useful when downstream readers don't need a catalog. For a deeper comparison of file-format targets, see Postgres → S3 as Parquet and Postgres → DuckDB.

For team workflows with scheduling, alerting, and audit trails on top of the same CLI, look at the Sling Platform.

Questions go to Discord or GitHub Issues.

DEV Community: Fritz Larco

How to Replicate Databricks Lakebase to Snowflake with Sling

Installation

Connecting Sling to Lakebase

Connecting Sling to Snowflake

Test both connections

The source tables

Full refresh: the first load

A note on branching

Verification

Type mapping

Incremental loads: only what changed

Replicating many tables at once

Scheduling

Conclusion

Related guides

Frequently asked questions

Does Sling need a special Lakebase connector?

Should I use an OAuth token or a password for the Lakebase connection?

Will replicating add load to my production Lakebase instance?

How are Postgres numeric, jsonb, and uuid columns handled in Snowflake?

Can I do incremental Lakebase → Snowflake with deletes?

Does the source have to be Lakebase specifically?

Export from BigQuery to PostgreSQL with Sling

Introduction

Installation

Setting Up Connections

BigQuery Connection Setup

PostgreSQL Connection Setup

Testing Connections

Data Replication Methods

Using CLI Flags

Basic CLI Example

Advanced CLI Example

Using YAML Configuration

Basic YAML Example

Advanced YAML Example

Sling Platform UI

Platform Overview

Managing Connections

Visual Replication Editor

Monitoring and Execution

Getting Started

Best Practices

Next Steps

FAQ

How do I export a BigQuery table to PostgreSQL with Sling?

Can I run incremental loads from BigQuery into Postgres?

How do I filter BigQuery rows before loading into Postgres?

Does Sling automatically convert BigQuery data types to Postgres types?

How do I set a primary key or unique constraint on the Postgres target?

Can I use runtime variables like a date in the replication config?

What permissions does the BigQuery service account need?

Moving Data from BigQuery to Snowflake Using Sling

Introduction

Installation

Setting Up Connections

BigQuery Connection Setup

Snowflake Connection Setup

Testing Connections

Data Replication Methods

Using CLI Flags

Basic CLI Example

Advanced CLI Example

Using YAML Configuration

Basic YAML Example

Advanced YAML Example

Sling Platform Features

Visual Configuration Editor

Execution Monitoring

Team Collaboration

Additional Platform Benefits

Getting Started

Additional Resources

Related Guides

FAQ

Exporting Snowflake to BigQuery Using Sling

The Challenge of Snowflake to BigQuery Data Migration

Installing Sling

Setting Up Connections

How are Postgres `numeric`, `jsonb`, and `uuid` columns handled in Snowflake?