DEV Community: Sourabh Gupta

Debezium vs Managed CDC: How to Actually Decide Between Build and Buy

Sourabh Gupta — Wed, 08 Jul 2026 09:41:47 +0000

Most "Debezium vs managed tool" articles get the question wrong. They frame it as a product bake-off, feature grid included, and declare a winner. But if you've actually run change data capture in production, you know the real decision isn't which tool captures a transaction log better. They mostly read the same logs the same way. The real decision is who operates everything that sits around the capture, and whether that work is a good use of your team's time.

That's a build-vs-buy question, not a product question. This post is a framework for answering it for your own situation.

First, let's kill an outdated assumption

A lot of Debezium criticism floating around is two or three years stale, and if you repeat it in 2026 you'll get corrected fast. So let's set the record straight before we compare anything.

Debezium is no longer just “the thing you run with Kafka Connect.” In recent Debezium 3.x releases, the project has become much more flexible than the old tutorials suggest. Today, you have several deployment options:

Kafka Connect, the classic setup, which gives you the Kafka ecosystem, distributed fault tolerance, durable schema history, and access to Kafka Connect sink connectors.
Debezium Server, a standalone application that streams changes to systems like Amazon Kinesis, Google Cloud Pub/Sub, Apache Pulsar, Redis Streams, or NATS JetStream without requiring Kafka.
Debezium Management Platform, which builds on Debezium Server and the Debezium Operator to provide a higher-level way to configure and manage CDC pipelines in Kubernetes-style environments.
Embedded usage, where you run Debezium Engine inside your own application. Recent Debezium releases also added framework support such as the Quarkus extension.

A few more things are worth knowing so the comparison is fair:

Kafka 4.x runs in KRaft mode, and ZooKeeper mode has been removed. “You need to babysit ZooKeeper” is no longer true for a modern Kafka deployment.
Debezium's default remains at-least-once delivery, so duplicates are possible on failures and restarts. Since Debezium 3.3, exactly-once support is available for core connectors when running through Kafka Connect with transactional support enabled. Outside that boundary, you still need idempotent consumers or downstream deduplication.

None of this makes Debezium the wrong choice. It makes it a more capable and more flexible choice than the old tutorials suggest. Keep that in mind, because a fair comparison is the only kind worth reading.

What you are actually comparing

Here's the mental model I'd start from. A CDC pipeline has two parts:

The capture engine. It reads the database transaction log (the Postgres WAL, the MySQL binlog, the SQL Server change tables) and emits row-level change events. This is the part Debezium does extremely well, and it's genuinely the easy part.
Everything around the capture. Backfilling historical data and reconciling it with the live stream. Handling schema changes so a new column doesn't break a downstream job. Delivering to your actual destinations. Monitoring lag, retries, and dead letters. Scaling, upgrades, and the 2 a.m. page when a replication slot fills up and your source database's disk starts climbing.

Whichever path you pick, part one is close to solved. The entire build-vs-buy decision lives in part two.

When you "build," you take Debezium (or another open-source engine) and you own part two. When you "buy," you pay a vendor to own most of part two for you. That's the whole trade, stated plainly.

The real cost of building

Debezium itself is free and Apache-licensed. That number is real, and it's zero. The cost shows up everywhere else, so budget for these honestly:

Engineering time to design and stand up the pipeline. Not just the connector config, but the surrounding services: sinks, schema handling, and whatever glue your destinations need.
Backfill and snapshot coordination. Loading history and then switching to the live stream without gaps or duplicates is finicky. Debezium's incremental snapshots help a lot here, but you still own the behavior.
Schema evolution. Debezium can emit schema-change information for supported connectors, but turning that into safe downstream schema evolution is still something you own. This is one of the most common sources of production breakage in home-grown pipelines.
Destination management. Debezium is a capture tool by design. Getting data into Snowflake, BigQuery, or an operational store means Kafka Connect sink connectors or your own consumers, each with their own limits.
Delivery guarantees. If you need exactly-once, you're running on Kafka Connect with EOS configured, or you're building idempotent consumers. That's real design work, not a checkbox.
Monitoring, on-call, and upgrades. Lag dashboards, alerting, dead-letter handling, and version upgrades across connectors. This never fully ends; it's a standing operational commitment.

The honest version of the build case: if you have the engineering depth, want full control, need to avoid vendor lock-in, and CDC is close enough to your core competency that owning it makes sense, building is a completely reasonable choice. Plenty of strong data teams run Debezium in production and are happy. The newer deployment modes (Server and the Management Platform) meaningfully reduce the operational surface compared to the full Kafka setup, so the "build" side is lighter than it used to be.

Where building goes wrong is when a team picks it to save money, undercounts part two, and discovers six months later that a senior engineer is effectively a part-time CDC platform operator.

The real cost of buying

Managed CDC platforms take part two off your plate. You connect a source, pick a destination, and the vendor handles snapshots, schema handling, delivery, scaling, and recovery. The catalog here includes Estuary, Fivetran, Airbyte Cloud, Confluent Cloud, and the cloud-native services like AWS DMS and Google Datastream. They differ in meaningful ways:

Fivetran is the managed ELT leader with the largest connector catalog. CDC is part of a broader platform, latency is typically schedule-based and often measured in minutes depending on connector and plan, and pricing is based on monthly active rows, which can surprise fast-growing teams.
Airbyte is open-core with 600+ connectors. For selected database CDC sources, it uses Debezium-based log capture under the hood. Self-host for control, or use the cloud tier for convenience.
Confluent Cloud is the natural fit if you want managed Kafka plus Debezium connectors with deep Kafka integration.
Cloud-native services like AWS DMS, Google Datastream, and Azure Data Factory can be convenient inside their own cloud, but cross-cloud use, egress, connector behavior, and feature support all need careful review.
Estuary is streaming-first, built on its own Gazette-based runtime rather than Kafka, with sub-second data movement for many use cases and exactly-once semantics where the destination supports transactional commits. It captures once and lets you reuse that stream across many destinations, and it exposes a Kafka-compatible interface, Dekaf, so consumers can read Estuary collections as if they were Kafka topics. It offers public SaaS, private, and BYOC deployments.

Note: I work at Estuary, one of the managed CDC platforms mentioned here. This post is meant as a practical build-vs-buy framework, not a product ranking.

Buying isn't free of tradeoffs, and pretending otherwise is how you lose a technical audience. Be clear-eyed about the buy side:

Recurring cost that scales with your data. At high volume, a managed bill can exceed the cost of running Debezium yourself. Model this at your real and projected volumes, not today's.
Less low-level control. You get the vendor's abstractions. If you need to fork the capture logic or do something exotic, a managed platform may not let you.
Lock-in and pricing-model risk. Different vendors bill on rows, connectors, or compute. A model that's cheap now can get expensive as you grow, and migrating off later is real work.
Security and data-residency review. Routing production data through a third party means a procurement and compliance conversation. BYOC and private deployments exist partly to answer this.

The decision framework

Here's how I'd actually decide. Score your situation against these, and let the weight fall where it falls.

Lean toward building (Debezium) if:

You have engineers who know, or want to own, streaming infrastructure.
Avoiding vendor lock-in is a hard requirement.
You need deep customization of the capture or event-processing logic.
CDC is central enough to your product that operating it is a reasonable investment.
You're comfortable owning backfills, schema propagation, and on-call.

Lean toward buying if:

Your team's time is better spent on product than on pipeline plumbing.
You need data in warehouses or apps quickly, without a multi-week build.
You want backfills, schema handling, and delivery guarantees handled for you.
You'd rather scale a bill than a Kafka Connect cluster.
Reliability and low-latency delivery matter more than low-level control.

A useful gut check: if the pipeline going down at 2 a.m. is your problem to fix, you're building. If it's a support ticket, you're buying. Decide which of those you want to sign up for.

A quick way to compare true cost

Don't compare a license fee to a subscription. Compare total cost of ownership over, say, two years. A rough model:

Build (Debezium) TCO = infrastructure (Kafka or the runtime you choose, plus storage and egress) + fully loaded engineering time to build it + ongoing operational time (monitoring, on-call, upgrades, incident response) + the cost of the occasional data incident.

Buy (managed) TCO = the subscription at your real volume + integration time + the internal cost of vendor and security review + any egress the platform passes through.

The line that catches teams off guard is "ongoing operational time." A conservative half-day a week of senior engineer time on pipeline maintenance is roughly 20-plus days a year. Price that against a subscription before you conclude that build is cheaper. Sometimes it genuinely is, especially at very high volume. Often it isn't once you count honestly.

When each choice tends to win

A couple of concrete patterns, since frameworks are easier to use with examples:

A five-person startup needs Postgres data in Snowflake for analytics next sprint. Buy. There's no world where hand-rolling a Kafka Connect pipeline is the right use of that team's month.
A large engineering org is already deep in Kafka, has an SRE function, and wants event streams feeding dozens of microservices with custom processing. Build. Debezium on Kafka Connect fits the existing competency, and the control is worth the operational cost.
A mid-size team needs the same change stream in a warehouse, an operational store, and a search index, and wants it reliable without becoming Kafka operators. This is where managed streaming platforms can make sense. Capture once, fan out, and don't run the brokers yourself.

Most teams aren't at the extremes. The framework is there to tell you which way you actually lean once you weigh the factors that matter to you.

Bottom line

Debezium and managed CDC platforms are not really competing on capture quality. They're offering you two different deals on the work that surrounds the capture. Building means you own that work in exchange for control and a lower license cost. Buying means you rent someone else's operations in exchange for speed and a predictable bill.

Neither is the smart choice in the abstract. The smart choice is the one that matches your team's skills, your data volume, and how much of your engineering attention you want CDC to consume. Run the framework, price the real total cost, and pick with your eyes open.

If you're on the build side, the Debezium docs are excellent and the community is genuinely helpful. If you're leaning toward buy, evaluate a few platforms against your actual sources and destinations before committing, because the right fit depends on your stack, not on anyone's feature grid.

If you're evaluating the managed side of this decision, Estuary is one option worth testing, especially if you need low-latency CDC, exactly-once semantics where supported, and the ability to reuse the same change stream across multiple destinations. The best way to evaluate it is the same way you should evaluate any managed CDC platform: try it against your real source, destination, schema changes, and expected data volume.

What did I miss? If you've run Debezium at scale or moved between build and buy, I'd like to hear how the tradeoffs actually played out for you.

Sources and further reading

Debezium Server, deployment and sinks: https://debezium.io/documentation/reference/stable/operations/debezium-server.html
Debezium exactly-once delivery: https://debezium.io/documentation/reference/stable/configuration/eos.html
Debezium Management Platform: https://debezium.io/documentation/reference/stable/operations/debezium-platform.html
Apache Kafka (KRaft replaces ZooKeeper in Kafka 4.x): https://kafka.apache.org
Estuary Dekaf, reading collections as Kafka topics: https://docs.estuary.dev/guides/dekaf_reading_collections_from_kafka/

PostgreSQL to Snowflake: A Hands-On CDC Streaming Guide

Sourabh Gupta — Wed, 15 Apr 2026 10:49:37 +0000

Most PostgreSQL to Snowflake pipelines run on a schedule: extract every hour, load to S3, copy into Snowflake. That works for reporting. It fails the moment you need to track hard deletes, catch every intermediate row state, or reduce the lag between your operational database and your warehouse.

There are several ways to move data from PostgreSQL to Snowflake: manual CSV exports via COPY INTO, scheduled batch ETL with tools like Airflow, Snowflake's native Openflow connector (built on Apache NiFi), and managed CDC platforms like Fivetran, Airbyte, or Estuary. Each approach trades off differently across latency, delete handling, setup complexity, and cost.

This tutorial focuses on real-time Change data capture (CDC) using Estuary. CDC works by reading directly from PostgreSQL's write-ahead log (WAL). Every INSERT, UPDATE, and DELETE is captured the moment it's committed and streamed to Snowflake continuously: no polling, no missed deletes, no stale data windows. We use Snowpipe Streaming on the Snowflake side for the lowest-latency ingestion path available.

We'll use a demo Postgres instance in a Docker container to walk through logical replication configuration and to ensure a continuous stream of updates. If you want to evaluate the other approaches against what we build here, there's a comparison table at the end of the tutorial.

By the end of this tutorial, you'll have:

A running PostgreSQL CDC capture streaming live changes into Estuary collections
A Snowflake materialization receiving those changes in near real time
A working understanding of WAL-based replication, schema evolution, and Snowflake sync frequency tuning

Prerequisites: Docker, ngrok, a Snowflake account (trial is fine), and a free Estuary account.

💡 If you want to use an existing Postgres database rather than a demo Docker one, make sure to use PostgreSQL 10.0 or later. Logical replication was introduced in PG 10.

What You'll Build

Here's the architecture at a glance:

Source: A self-hosted PostgreSQL instance with logical replication enabled and a live data generator inserting rows continuously
Transport: Estuary captures the WAL stream, stores change events as JSON collections in cloud object storage
Destination: Estuary materializes those collections into Snowflake tables, with configurable sync frequency to control warehouse credits

What you'll need before starting:

Docker (Docker Compose V2 is bundled in modern Docker Desktop, no separate install needed)
ngrok (free tier is fine, though you’ll need to authenticate your account to use TCP) — to expose your local Postgres to Estuary's cloud connector
A Snowflake account (trial account works perfectly)
An Estuary account — sign up free at dashboard.estuary.dev/register

Step 1: Set Up the Source PostgreSQL Database

Spin Up Postgres + Data Generator with Docker Compose

Create a project directory and save the following as docker-compose.yml. It defines two services: a PostgreSQL container configured for logical replication, and a data generator that continuously writes realistic fake product records.

services:

  postgres:

    image: postgres:18

    container_name: postgres_cdc

    hostname: postgres_cdc

    restart: unless-stopped

    user: postgres

    environment:

      POSTGRES_USER: postgres

      POSTGRES_DB: postgres

      POSTGRES_PASSWORD: postgres

    command:

      - "postgres"

      - "-c"

      - "wal_level=logical"

    healthcheck:

      test: ["CMD-SHELL", "pg_isready -U postgres -d postgres"]

      interval: 5s

      timeout: 10s

      retries: 120

    volumes:

      - ./init.sql:/docker-entrypoint-initdb.d/init.sql

    ports:

      - "5432:5432"



  datagen:

    image: materialize/datagen

    container_name: datagen

    restart: unless-stopped

    environment:

      POSTGRES_HOST: postgres_cdc

      POSTGRES_PORT: 5432

      POSTGRES_DB: postgres

      POSTGRES_USER: postgres

      POSTGRES_PASSWORD: postgres

    entrypoint: "datagen -s /app/schemas/products.sql -n 10000 -f postgres -w 1000"

    depends_on:

      postgres:

        condition: service_healthy

    volumes:

      - ./schemas/products.sql:/app/schemas/products.sql

💡 The -w 1000 flag tells the generator to insert a new row every 1000ms. You can lower it to stress-test the pipeline, or raise it to keep things calm while you explore the UI.

Define the Data Schema

Create a schemas/ folder and save the following as schemas/products.sql. This defines the table schema that the data generator uses to create records:

-- schemas/products.sql

-- NOTE: COMMENT fields are datagen directives (Faker expressions),

-- not standard SQL. They are only used by the datagen tool, not executed against Postgres.

CREATE TABLE "public"."products" (

  "id"          INT         PRIMARY KEY,

  "name"        VARCHAR     COMMENT 'faker.internet.userName()',

  "merchant_id" INT NOT NULL COMMENT 'faker.datatype.number()',

  "price"       INT         COMMENT 'faker.datatype.number()',

  "status"      VARCHAR     COMMENT 'faker.datatype.boolean()',

  "created_at"  TIMESTAMP   DEFAULT now()

);

⚠️ The COMMENT directives in the schema file are Faker expressions consumed by the datagen tool — they are not executed against PostgreSQL. The actual table is created by init.sql below.

Configure PostgreSQL for CDC Replication

Save the following as init.sql in your project root. This runs automatically when the Postgres container starts and configures everything Estuary needs for log-based CDC:

-- init.sql



-- 1. Create a dedicated replication user (use a strong password in production)

CREATE USER flow_capture WITH PASSWORD 'secret' REPLICATION;



-- 2. Grant read access across all tables

GRANT pg_read_all_data TO flow_capture;



-- 3. Create the products table (the actual table, not the datagen schema)

CREATE TABLE IF NOT EXISTS public.products (

  id        INT PRIMARY KEY,

  name      VARCHAR,

  merchant_id INT NOT NULL,

  price     INT,

  status    VARCHAR,

  created_at  TIMESTAMP DEFAULT now()

);



-- 4. Create the watermarks table

--    Estuary writes small amounts of metadata here to ensure

--    accurate backfill sequencing. Not needed in read-only mode.

CREATE TABLE IF NOT EXISTS public.flow_watermarks (

  slot      TEXT PRIMARY KEY,

  watermark TEXT

);

GRANT ALL PRIVILEGES ON TABLE public.flow_watermarks TO flow_capture;



-- 5. Create and configure the replication publication

CREATE PUBLICATION flow_publication;

ALTER PUBLICATION flow_publication SET (publish_via_partition_root = true);

ALTER PUBLICATION flow_publication ADD TABLE

  public.flow_watermarks,

  public.products;

A few notes on what each block does:

Object	Purpose
flow_capture user	Dedicated replication user with REPLICATION attribute. Isolated from your app user for security.
pg_read_all_data	PostgreSQL built-in role granting SELECT on all tables.
flow_watermarks	A small scratch table Estuary writes to during backfill to track replication progress accurately.
flow_publication	Defines which tables' WAL events are exposed for replication. publish_via_partition_root is recommended for partitioned tables.

📌 Using an existing database that runs PostgreSQL v13 or earlier? Replace GRANT pg_read_all_data with per-schema grants: GRANT SELECT ON ALL TABLES IN SCHEMA public TO flow_capture; ALTER DEFAULT PRIVILEGES IN SCHEMA public GRANT SELECT ON TABLES TO flow_capture;

Start the Containers

Run the following from your project directory:

docker compose up -d

After a few seconds, verify both containers are healthy:

docker compose ps

You should see postgres_cdc and datagen both in a running state. Confirm rows are accumulating:

# Connect and check row counts a few seconds apart

psql -h localhost -p 5432 -U postgres -d postgres -c "SELECT count(*) FROM products;"

Expose the Database via ngrok

Estuary is a fully managed cloud service. Since your Postgres instance is running locally, you need to expose port 5432 to the internet. ngrok makes this a one-liner:

ngrok tcp 5432

Note the Forwarding address from the output — it will look like 5.tcp.ngrok.io:12345. You'll use the host and port as the connection details in Estuary.

⚠️ Keep the ngrok process running throughout the tutorial. Closing it breaks the connection between Estuary and your database.

Step 2: Create a Capture in Estuary

Log in to the Estuary dashboard and navigate to Captures in the left sidebar. Click + New Capture and select the PostgreSQL connector.

Configure the Connection

Fill in the connection details using your ngrok forwarding address:

Field	Value
Address	Your ngrok host and port (e.g. 5.tcp.ngrok.io:12345)
Database	postgres
User	flow_capture
Password	secret (or whatever you set in init.sql)

Click Next. Estuary will connect to your database and auto-discover the tables available in the publication.

Review Collections and Schema Evolution Settings

In the next section, you'll see the discovered tables as Estuary collections. A collection is a real-time data lake of JSON documents backed by cloud object storage (S3, GCS, or Azure Blob). Because it's object storage, data is retained indefinitely and can always be replayed from the beginning.

Estuary infers the JSON schema automatically from your Postgres table definition. You can inspect and override it here if needed.

You'll also see three schema evolution toggles, all enabled by default:

• Automatically keep schemas up to date: Estuary propagates upstream schema changes (new columns, type changes) through to downstream materializations automatically.

• Automatically add new source tables: Any new table added to your publication is picked up and captured without reconfiguration.

• Breaking changes re-version collections: If a change is incompatible (e.g. a column is dropped or renamed), Estuary creates a new versioned collection rather than corrupting existing data.

For this tutorial, leave all defaults and click Next, then Save and Publish. The connector will start and immediately begin backfilling existing rows while simultaneously streaming new changes from the WAL.

✅ Estuary captures both the historical backfill and the live change stream in a single consistent operation. There is no risk of missing events that happened during the initial load.

Step 3: Set Up the Snowflake Materialization

Prepare Snowflake

Open your Snowflake console and create a new SQL worksheet. Paste and run the following setup script. It creates a dedicated role, warehouse, database, schema, and user for Estuary:

-- Snowflake Setup Script for Estuary

-- Run all statements (use "Run All" from the dropdown next to the Run button)



SET database_name   = 'ESTUARY_DB';

SET warehouse_name   = 'ESTUARY_WH';

SET estuary_role    = 'ESTUARY_ROLE';

SET estuary_user    = 'ESTUARY_USER';

SET estuary_schema   = 'ESTUARY_SCHEMA';



-- Role and schema

CREATE ROLE IF NOT EXISTS IDENTIFIER($estuary_role);

GRANT ROLE IDENTIFIER($estuary_role) TO ROLE SYSADMIN;



-- Database

CREATE DATABASE IF NOT EXISTS IDENTIFIER($database_name);

USE DATABASE IDENTIFIER($database_name);

CREATE SCHEMA IF NOT EXISTS IDENTIFIER($estuary_schema);



-- User

CREATE USER IF NOT EXISTS IDENTIFIER($estuary_user)

  DEFAULT_ROLE  = $estuary_role

  DEFAULT_WAREHOUSE = $warehouse_name;

GRANT ROLE IDENTIFIER($estuary_role) TO USER IDENTIFIER($estuary_user);

GRANT ALL ON SCHEMA IDENTIFIER($estuary_schema) TO IDENTIFIER($estuary_role);



-- Warehouse (XS, auto-suspend after 60s to minimize credit usage)

CREATE WAREHOUSE IF NOT EXISTS IDENTIFIER($warehouse_name)

  WAREHOUSE_SIZE = XSMALL

  WAREHOUSE_TYPE = STANDARD

  AUTO_SUSPEND   = 60

  AUTO_RESUME   = TRUE

  INITIALLY_SUSPENDED = TRUE;

GRANT USAGE ON WAREHOUSE IDENTIFIER($warehouse_name) TO ROLE IDENTIFIER($estuary_role);



-- Database access

GRANT CREATE SCHEMA, MONITOR, USAGE

  ON DATABASE IDENTIFIER($database_name)

  TO ROLE IDENTIFIER($estuary_role);



-- Required for Snowflake on GCP only

USE ROLE ACCOUNTADMIN;

GRANT CREATE INTEGRATION ON ACCOUNT TO ROLE IDENTIFIER($estuary_role);

USE ROLE SYSADMIN;



COMMIT;

💡 The AUTO_SUSPEND = 60 setting suspends the warehouse after 60 seconds of inactivity. Combined with the Sync Frequency setting below, this is the primary lever for controlling Snowflake compute costs.

You will then need to configure JWT authentication for your Snowflake user. You can generate a private-public key pair from the terminal using:

# generate a private key

openssl genrsa 2048 | openssl pkcs8 -topk8 -inform PEM -out rsa_key.p8 -nocrypt

# generate a public key

openssl rsa -in rsa_key.p8 -pubout -out rsa_key.pub

Copy the public key to the clipboard using cat rsa_key.pub.

In Snowflake, you can then associate this key with your user. Run:

ALTER USER identifier($estuary_user) SET RSA_PUBLIC_KEY='<value you copied>'

Create the Materialization

Back in Estuary, navigate to Destinations and click + New Materialization.

Select Snowflake and fill in the connection details using the values from the script above.

Field	Value
Host	Your Snowflake host URL (.snowflakecomputing.com)
Database	ESTUARY_DB
Schema	ESTUARY_SCHEMA
Warehouse	ESTUARY_WH
Timestamp Type	Choose how timestamp columns should be stored, such as using `TIMESTAMP_LTZ`
User	ESTUARY_USER
Private Key	The private key from the key-pair you generated

Configure Sync Frequency (Cost Optimization)

Scroll down to Advanced Options. The Sync Frequency parameter controls how long Estuary waits before flushing the latest data into Snowflake. This is a critical cost lever:

Sync Frequency	Warehouse Behavior	Best For
0 seconds	Continuous	Real-time dashboards, operational use cases
30 minutes (default)	Wakes every 30 min, ~1 credit/hr on XS	Near-real-time analytics
2 hours	Wakes every 2 hrs, suspends in between	Daily reporting, cost-conscious setups
24 hours	Once daily flush	Archival, low-priority data

Set the frequency that matches your latency requirements.

To test out real-time streaming, for example, go ahead and set the sync frequency to 0s. This will push data as soon as it’s available.

These types of continuous updates are best suited to ingestion via Snowpipe Streaming, Snowflake’s lowest-latency ingestion option. Otherwise, continuous updates can negatively affect costs, as your warehouse may not be able to suspend activity between data loads. We’ll make sure we’re using Snowpipe Streaming in the next section.

Link Your Capture

In the “Source Collections” section, toggle on “Delta Updates” to be the default setting for all newly added collections.

💡 Delta updates is an append-only materialization mode that cleanly fits with Snowpipe Streaming’s low latency streaming capabilities. All bindings in your Snowflake materialization that use delta updates automatically ingest data via Snowpipe Streaming on the back end.

Click Source from Capture and select the PostgreSQL capture you created in Step 2. Estuary will automatically link the collections and propose a mapping to Snowflake tables. You can also toggle delta updates mode individually for each binding here.

Click Next, then Save and Publish. Watch the deployment logs — a successful publish will show the connector handshake completing and the first materialization batch starting.

Verify the Pipeline

Once the materialization is deployed, open Snowflake and run:

USE DATABASE ESTUARY_DB;

USE SCHEMA ESTUARY_SCHEMA;



-- Check that the table exists and has data

SELECT count(*) FROM products;



-- Inspect a sample of rows

SELECT * FROM products LIMIT 10;



-- Watch rows arriving in real time

SELECT count(*) FROM products; -- run this repeatedly

You should see the row count climbing as the datagen service continues writing to Postgres. The Estuary dashboard's collection view shows the change stream in real time; the Snowflake side reflects it after the next sync interval.

🔬 Want to verify a full end-to-end delete? Run DELETE FROM products WHERE id = 1; against Postgres and watch Estuary propagate the delete to Snowflake after the next sync. Batch pipelines miss these entirely — CDC captures every one.

What Happens When Your Schema Changes?

One of the reasons teams avoid CDC is fear of schema drift breaking the pipeline. Estuary handles this automatically. Here's what happens in practice:

• New column added: Estuary detects it via schema auto-discovery, updates the collection schema, and adds the column to the Snowflake table. Existing rows show NULL for the new column.

• Column type widened (e.g. INT to BIGINT): Handled automatically. Estuary updates the schema without interruption.

• Breaking change (column dropped, primary key changed): Estuary creates a new versioned collection (e.g. products_v2) rather than corrupting the existing one. You can choose when to cut over.

All three toggles on the capture's Collections tab control this behavior. The defaults are safe for most teams.

Cleanup

Postgres and ngrok

# Stop and remove containers

docker compose down -v



# Terminate ngrok (Ctrl+C in the ngrok terminal)

Snowflake

-- Run in Snowflake to remove all resources created by this tutorial

REVOKE ALL PRIVILEGES ON SCHEMA ESTUARY_SCHEMA FROM ROLE ESTUARY_ROLE;

REVOKE ALL PRIVILEGES ON DATABASE ESTUARY_DB FROM ROLE ESTUARY_ROLE;

DROP WAREHOUSE IF EXISTS ESTUARY_WH;

DROP USER IF EXISTS ESTUARY_USER;

DROP ROLE IF EXISTS ESTUARY_ROLE;

DROP SCHEMA IF EXISTS ESTUARY_SCHEMA;

DROP DATABASE IF EXISTS ESTUARY_DB;

Estuary

From the Estuary dashboard, disable or delete the capture and materialization you created.

How Does This Compare to Other Approaches?

The CDC pipeline you just built is one of several ways to get data from PostgreSQL into Snowflake. Here's how the major methods compare:

Method	Latency	Captures Deletes	Schema Evolution	Setup Complexity	Best For
Manual CSV/S3 + COPY INTO	Hours to days (scheduled)	No. Requires full reload or custom tracking.	Manual. You manage column changes yourself.	Low initial, high ongoing maintenance.	One-time migrations or very small datasets.
Batch ETL (Airflow, custom scripts)	Minutes to hours (schedule-dependent)	Only with soft-delete flags or full reloads.	Manual. DAG changes required per schema change.	Medium. Requires Airflow infrastructure and DAG maintenance.	Scheduled reporting where freshness is not critical.
Snowflake Openflow (PostgreSQL connector)	Near real-time (CDC via WAL).	Yes. Captures INSERT, UPDATE, DELETE via logical replication.	Tracks DDL changes and replicates downstream.	Medium-high. Requires Openflow runtime setup (Snowflake Deployment or BYOC), Apache NiFi familiarity helps. Enterprise account required.	Teams standardized on Snowflake wanting a native, single-vendor CDC path.
Fivetran	1 min to hours (plan-dependent). 1-min sync requires Enterprise or Business Critical plan.	Yes, with logical replication. Query-Based mode may miss short-lived rows.	Automatic schema migration, though some edge cases (e.g., ADD COLUMN SET DEFAULT) require manual re-sync.	Low. Fully managed, point-and-click setup.	Teams wanting managed simplicity and a large connector catalog, with budget for consumption-based pricing.
Estuary	Sub-second to configurable intervals. Snowpipe Streaming enabled by default for delta updates.	Yes. Captures every INSERT, UPDATE, DELETE from the WAL, including hard deletes.	Automatic. New columns propagate, type widenings handled, breaking changes create versioned collections.	Low. Web UI setup, no infrastructure to manage.	Real-time analytics, operational use cases, teams needing low-latency CDC with Snowpipe Streaming and cost control via sync frequency.

A few things worth noting:

Snowflake's Openflow connector is relatively new. It replaces the older Snowflake Connector for PostgreSQL (now in maintenance mode). Openflow is built on Apache NiFi and uses Snowpipe Streaming for ingestion. It's a strong option if you want everything inside Snowflake's ecosystem, but it requires an Enterprise account and Openflow runtime configuration.
Fivetran's 1-minute sync frequency is only available on Enterprise and Business Critical plans. On lower tiers, the minimum is 5 minutes. Their PostgreSQL connector supports both logical replication and a Query-Based fallback, but Query-Based mode can miss rows that exist for less than the sync interval.
Estuary's sync frequency is configurable from 0 seconds (continuous via Snowpipe Streaming) to 24 hours, on any plan. The tutorial above uses 0s for demonstration, but production deployments typically use 30 minutes or longer to control Snowflake warehouse credits.
The manual CSV/S3 approach is essentially free in tooling costs but scales poorly. It's viable for a one-time migration but not for ongoing replication.

Where to Go From Here

You now have a working CDC pipeline from Postgres to Snowflake. A few directions worth exploring next:

Connect your real database: Replace the Docker Postgres with your production instance. Estuary supports RDS, Aurora, Cloud SQL, Azure Database for PostgreSQL, Supabase, and Neon out of the box.
Add more tables: Modify the ALTER PUBLICATION flow_publication ADD TABLE statement and Estuary will auto-discover the new bindings.
Try read-only mode: If you're connecting to a replica or a database where you can't create the watermarks table, Estuary supports a read-only capture mode. No schema modifications required on the source.
Explore transformations: Estuary supports inline derivations using TypeScript, Python, or SQL. You can reshape, filter, or join collections before they land in Snowflake.
Add other destinations: The same Postgres capture can fan out to BigQuery, Redshift, Databricks, or Kafka simultaneously. No duplicate infrastructure.

Check the full Estuary documentation for connector-specific setup guides, and join the Estuary Slack community if you run into questions

2x Faster MongoDB CDC: An Engineering Deep-Dive on Performance Optimization

Sourabh Gupta — Tue, 31 Mar 2026 12:29:53 +0000

Estuary’s focus on in-house crafted connectors isn’t an accident.

It’s not about keeping secrets; we’re not a black box factory and connector source code is publicly available for anyone to review. It’s about maintaining the responsibility of ownership, starting with a high-quality base product, and refining from there.

Integrations are specifically designed to work seamlessly with Estuary, providing standard customization options and converting data to standard formats with as little waste as possible. And connectors get continuous updates to keep up with API changes or finetune performance.

Our MongoDB capture connector recently received one of these upgrades: while the connector reliably got the job done, it could fall behind in high-volume enterprise use cases. This could be especially detrimental for real-time pipelines that counted on the connector’s functionality with MongoDB’s change streams—if the connector couldn’t keep up with the data coming in, downstream systems could experience delays.

For real-time native applications, even small slowdowns have an outsized impact. Consider the route change notification for a shipment that arrives just after a driver misses the turnoff. Or a triage system that doesn't capture the latest developments in its priority calculations.

It was definitely time for some optimization work.

On the case was Mahdi Dibaiee. Based in Dublin, Ireland when not on adventures around the world, Mahdi has been a Senior Software Engineer with Estuary for nearly four years. Having worked on data planes, Estuary’s flowctl CLI, and various connectors, his deep knowledge of the platform lets him flexibly pick up whatever tasks have current top priority.

This is a behind-the-scenes look at how he analyzed the existing implementation’s limitations, researched solutions, and ended up with double the speed.

The Problem with Small Documents

“Make this integration faster,” while a laudable goal, isn’t much to go on. Why were captures falling behind? What was the expected throughput rate? And how could we find specific areas to improve?

First, start with a baseline.

The MongoDB capture connector tended towards a throughput rate of 34 MB/s when working with standard-sized documents, such as those around 20 KB apiece.

To test how the connector would react under different circumstances, Mahdi tried it out against a stream of much smaller documents, each around 250 bytes.

Something concerning happened when the connector processed these small documents. The capture’s ingestion rate dropped down to a meager 6 MB/s. While it would be unlikely to find this “tiny document” use case in the wild, 6 MB/s was still far too slow.

It also uncovered a possible path forward.

“This told us that we had a large overhead-per-document,” Mahdi explained, which resulted in the abysmal slowdown.

Essentially, all document processing would include some overhead. Changing the size of processed documents acted as a lever to quickly check just how much the overhead impacted performance: smaller documents with the same amount of overhead per document led to more overall time spent on the overhead rather than on making progress.

If he could find ways to reduce that overhead, all pipelines should speed up, not just ones with tiny documents.

But where exactly did that overhead come from? To tune the MongoDB capture’s performance, some digging would be required.

The Reason Behind the Bottleneck

To get a picture of the systems involved, Mahdi profiled a particular MongoDB capture that was struggling to keep up with its load.

First up was to rule out a couple obvious answers. He checked CPU load and memory pressure on both MongoDB’s side and the capture connector’s side. Neither indicated any issues.

Next, Mahdi wanted to see where Estuary spent the most time when ingesting data from MongoDB. He set up a detailed tracing view, dividing up the time for each data fetch and marking out network and CPU activity.

The trace exposed two areas of note: one a suspiciously empty space, and one a suspiciously long process, both related to the connector call to get more documents. In total, this caused Estuary to spend around two seconds on each batch of fetched data, which isn’t quite the millisecond latency Estuary aims for.

So, what was actually happening?

Activity trace for a MongoDB capture. ~2 seconds is highlighted, showing a noticeable gap in CPU usage before a string of activity.

600ms at the beginning of this cycle corresponded to the data fetch itself. When one batch of data finished processing, the connector sent out a request over the network for more, then started working on the new batch once it arrived.

Because of this synchronous mode of operation, the connector essentially sat around waiting for half a second each time it wanted to check for new data. When working with an end-to-end real-time system, those milliseconds in the pipeline add up. Not to mention the cumulative CPU idle time when the CPU’s doing nothing much for a full quarter of the connector’s process.

There, then, was an obvious bottleneck, but the activity following the fetch was also curious. The remaining 1.4 seconds in the cycle were spent processing documents.

By itself, emitting documents and checkpoints to Estuary shouldn’t take that long. But there was one more step in the processing phase that might: decoding MongoDB’s BSON documents in the first place.

With the possibility of optimizing document processing in the mix, there were two routes forward, two avenues to improve the connector’s performance.

Why not implement both?

From Go to Rust: An Expedient Solution

The CPU’s idle time was perhaps the more straightforward fix. Mahdi immediately identified that making the connector slightly more asynchronous would keep the CPU busy and shave those 600ms off of each batch.

To do so, he modified Estuary’s MongoDB connector to pre-fetch the next batch while still processing the first. In order to preserve both ordering and load on memory, he limited the number of fetched batches to four. With a maximum of 16 MB for each MongoDB cursor batch, this would keep the connector’s memory consumption to 64 MB.

This change alone would provide a welcome performance boost, but there was still the unsatisfyingly slow document processing time to contend with. And it was a trickier problem.

To standardize data coming from and going to a variety of different systems using a variety of different document formats and data types, Estuary translates everything to JSON as an intermediary. This makes it simple to mix and match data sources and destinations, or plug in a new connector: each connector only needs to handle its specific system and translation to or from the shared language.

Estuary translates MongoDB’s BSON documents to JSON so as to then easily translate the data to any destination format.

MongoDB documents come in BSON, or Binary JSON. This modified version of JSON generally makes for efficient storage and retrieval. It also includes a handful of additional data types, such as datetime and more specific numeric types.

This sounds like it would make for a reasonably simple conversion, but Mahdi found that Estuary’s MongoDB connector spent a lot of time decoding documents with Go’s bson package. On reflection, perhaps this wasn’t much of a surprise. Go’s reflect package, which infers data types that aren’t already known, is notoriously slow and the bson package relied heavily on reflect.

Looking for alternatives, he first performed some benchmarks on Rust’s corresponding bson crate. The results were demonstrable: Rust’s version was 2x faster than Go.

Mahdi’s meticulous research also uncovered another option. Rust’s most popular serialization/deserialization crate, serde, has a serde-transcode plugin crate. This transcoder can convert documents from one format to another without any intermediary layer, cutting down on unnecessary processing steps. With this, the BSON to JSON conversion could be 3x faster than the existing Go implementation.

serde couldn’t simply be swapped in as-is. Mahdi wrapped the out-of-the-box serializer in custom logic, extending the JSON conversion and sanitizing the data. The resulting implementation fit Estuary’s specific needs while retaining the 3x performance boost.

These changes would address both bottlenecks and refurbish the MongoDB capture connector.

End Result: Supercharged MongoDB Captures

One question remained: would these improvements hold up across various scenarios? Thorough testing commenced.

Mahdi started where it all began: the tiny documents scenario. He ran the MongoDB connector on a stream of small 250-byte documents, first using the main version before switching to use the improved branch. The measly ~6 MB/s throughput rate rose to around 17.5 MB/s, tripling throughput for the small documents use case.

Mahdi graphs out throughput results for the MongoDB connector, first using the original Go implementation, followed by the Rust transcoder.

Of course, this scenario was only ever meant as a test and example, a way to define how much overhead we were seeing as the connector processed documents.

Mahdi therefore reran the test, this time using 20 KB documents, a more standard size. The original 34 MB/s rate jumped to 57 MB/s, almost doubling throughput.

The difference when using larger documents is still substantial, even if less pronounced.

This rate was much more reasonable, allowing for around 200 GB of data ingestion per hour and ensuring the Estuary connector could keep up with higher volume use cases.

What this means in practical terms is that:

Huge initial databases would get backfilled in half the time
The platform would be able to handle twice as much data in continuous CDC mode
Which also means spikes in activity would be more easily handled: instead of choking performance, real-time events would stay real-time

After review and approval, Mahdi rolled out the changes to a select set of users first so he could closely monitor affected pipelines. He would be ready to quickly revert or revise as needed if any problems arose.

With so many use cases and interactions, one minor issue did raise its head: Rust and Go handled invalid UTF-8 characters differently. With a little more customization, Mahdi updated the connector’s leniency on invalid characters to mimic the former behavior.

Other than that, the rollout was smooth sailing, with capture throughput ticking upwards across the board.

So if you recently noticed your MongoDB capture speeding up: now you know.

What’s Next?

While 200 GB an hour is a decent clip, Mahdi noted that there is still room for further improvement. The main issue now is that the connector is relatively CPU-bound. And, after all, efficiency is one of those goals that doesn’t have a specific end.

For now, though, there are new challenges to face.

To test out the capture connector’s speed yourself, try it out in Estuary. Or set up a call to discuss how the connector could fit into your particular use case.

Or if you’re simply interested in switching to Rust for faster BSON decoding in your own code, check out Mahdi’s repo on benchmarking Rust and Go or his work in Estuary’s source code.

8 Key BYOC Deployment Options Every Data Engineer Should Know

Sourabh Gupta — Wed, 18 Mar 2026 07:35:38 +0000

Bring Your Own Cloud (BYOC) means running a vendor's managed software directly inside your own cloud account, keeping data, access controls, and billing firmly in your hands. For data teams, BYOC occupies the middle ground between fully managed SaaS and self-hosted deployments: vendors operate or orchestrate the software while your VPC, IAM policies, and storage define the security boundary. The result is stronger compliance posture, better cost governance, and tighter integration with existing infrastructure.

The eight patterns below are not products. They are architectural categories. Real-world deployments frequently blend two or more of them. Each section defines the pattern precisely, shows how leading vendors implement it today, and lays out the trade-offs that matter for architecture, security, and total cost of ownership.

The 8 BYOC Deployment Patterns at a Glance

Pattern	One-line definition	Best for	Key trade-off
Cloud-Provider-Specific	Vendor stack in a single CSP account	AWS- or Azure-first orgs	Cloud and vendor lock-in
Managed In-Your-Account	Vendor operates service inside your VPC	Low ops burden, full data control	Higher service fees
Self-Managed	You install, run, and maintain the stack	Max control, regulated industries	Full ops burden
Zero-Access / Zero-Trust	No inbound vendor access, outbound-only	High-assurance compliance environments	Slower support triage
Split Control / Data Plane	Vendor control plane + your data plane	Sovereignty with SaaS-like UX	Complex cross-plane auth
Open-Format Storage	Writes to your object store in open formats	Retention, cost, and egress control	Performance tuning required
Kubernetes-Centric	Vendor workloads run in your K8s cluster	Teams standardised on Kubernetes	K8s operational complexity
Lightweight / Serverless	Docker, SSH, or functions in your infra	Fast start, small teams, edge	Fewer enterprise guardrails

1. Cloud-Provider-Specific BYOC

Definition: The vendor deploys and manages their software inside a single cloud provider's account, using that provider's native services end-to-end.

In this pattern, the vendor tightly couples their stack to one cloud provider, such as AWS, and leverages native compute, networking, and identity primitives rather than building cloud-agnostic abstractions. The result is deep IAM alignment, native private networking, and a familiar operational surface for teams already standardised on that provider. Portability to other clouds is limited by design.

A well-documented example is Flightcontrol, which deploys application workloads to customers' own AWS accounts using Amazon ECS with either Fargate or EC2 launch types rather than Kubernetes. Fargate is the default path (serverless compute, no node management), while ECS with EC2 is available for teams that need GPU support, Reserved Instance pricing, or custom instance types. All builds run in the customer's AWS account via AWS CodeBuild, so build artifacts never leave the customer's environment, and secrets are stored in AWS Parameter Store or Secrets Manager encrypted under customer-managed KMS keys.

What This Looks Like in Practice

IAM roles, VPC subnets, security groups, and private endpoints are all CSP-native constructs.
Logging and metrics flow directly into CloudWatch, Azure Monitor, or Cloud Logging without an additional agent.
Reserved Instances, Savings Plans, and Committed Use Discounts apply because compute runs in the customer's billing account.
Flightcontrol stores secrets in AWS Parameter Store or Secrets Manager using the customer's KMS keys, not the vendor's.

Strategic Trade-Offs

Strong security posture: cloud-native policies, SCP guardrails, and private networking all apply natively.
Cloud lock-in is real: the architecture is not portable to a second provider without significant re-engineering.
Multi-cloud strategies are not supported; teams on Azure or GCP need a different vendor or model.

2. Managed BYOC Inside Your Cloud Account

Definition: The vendor deploys, operates, and upgrades their service inside your cloud account, while your organization retains ownership of data, encryption keys, and billing.

This is the most common commercial BYOC model. The customer grants the vendor cross-account IAM permissions scoped to the minimum needed to provision and manage infrastructure. The vendor handles day-2 operations including upgrades, scaling, and incident response, while all data remains in the customer's VPC. The customer keeps their CSP discounts and reserved capacity, and no data traverses the vendor's network.

Estuary is a real-time data integration platform built specifically for the data movement problem that makes BYOC relevant in the first place: moving data from operational databases, SaaS applications, and event streams into warehouses, lakes, and AI systems without copying it through a vendor's infrastructure. Estuary offers its managed BYOC model as Private Deployment. A private data plane runs entirely within the customer's VPC on AWS, GCP, or Azure. Only metadata flows to Estuary's control plane over AWS PrivateLink or equivalent private connectivity, so it never crosses the public internet. Estuary manages connector updates, pipeline orchestration, and uptime while the customer's IAM, KMS keys, and VPC peering configurations remain authoritative.

For data teams specifically, Estuary's private deployment covers 200+ connectors for CDC, streaming, and batch across databases, SaaS, and warehouses. Pipelines deliver sub-100ms end-to-end latency with exactly-once delivery guarantees, and automatic schema evolution means pipelines do not break when upstream schemas change. The platform is SOC 2 Type II certified and HIPAA-compliant, and it is designed for GDPR and data residency environments. It is distinct from Estuary's full BYOC option, in which the customer also owns the underlying cloud account and billing.

ClickHouse BYOC on AWS (GA as of February 2025) follows the same principle. The data plane, consisting of EKS clusters, Amazon S3 storage, and ClickHouse nodes, runs in the customer's AWS VPC. The ClickHouse control plane communicates with the customer's BYOC VPC over HTTPS port 443 for orchestration operations only. All data, logs, and metrics remain in the customer's VPC, with only critical telemetry crossing to the vendor for health monitoring. ClickHouse engineers can access system-level diagnostics only through a time-bound, audited approval workflow; they never have direct access to customer data.

What This Looks Like in Practice

Vendor provisions resources into your account using scoped cross-account IAM roles.
Your KMS keys encrypt data at rest; your VPC peering or PrivateLink rules govern all network paths.
Cloud billing flows to your account so reserved capacity and committed use discounts apply.
Vendor SRE teams manage upgrades and handle incidents without requiring persistent inbound access.

Strategic Trade-Offs

Lower operational burden than self-managed; faster time to value for data engineering teams.
Shared responsibility boundary must be documented clearly, especially for incident response.
Service fees are higher than self-managed because the vendor absorbs operational overhead.
ClickHouse BYOC on AWS does not publish a formal uptime SLA because the data plane runs on customer-owned resources; fully managed SaaS deployments carry a published SLA.

3. Self-Managed Vendor Software in Your Cloud

Definition: Your team installs, configures, and maintains the vendor's software end-to-end, taking full ownership of patching, scaling, HA/DR, and security hardening.

Self-managed BYOC is the highest-control option. The vendor distributes their software as binaries, container images, Helm charts, or Terraform modules, and the customer's platform engineering team handles the full operational lifecycle. This model is common among organisations with strict air-gap or no-internet requirements, teams that need deep customisation of configuration and network topology, and regulated enterprises where vendor access to infrastructure is contractually prohibited.

The trade-off is full operational ownership. Day-2 operations, including version upgrades, rolling restarts, capacity planning, certificate rotation, and disaster recovery runbooks, are entirely the customer's responsibility. Teams without mature SRE practices typically find this model more expensive in total than managed alternatives once engineering time is factored in.

What This Looks Like in Practice

Vendor distributes software via Helm charts, Terraform modules, container images, or RPM/deb packages.
Customer manages topology, replication factors, network zones, and storage backends.
Full integration with existing tooling: Terraform for provisioning, HashiCorp Vault for secrets, Prometheus and Grafana for observability.
Customer owns versioning strategy, blue/green deployments, and rollback procedures.

Strategic Trade-Offs

Maximum security control: no external party has any access to infrastructure or data.
Full operational burden for upgrades, scaling events, and reliability incidents.
Longer lead times for new features: customer must upgrade on their own schedule.
BYOC is the recommended middle ground for teams that want vendor-managed operations without giving up data sovereignty; self-managed is for cases where even vendor orchestration access is not permitted.

4. Zero-Access / Zero-Trust BYOC Models

Definition: The vendor holds no persistent inbound access or stored credentials to your infrastructure. All control-plane communication is outbound-only from the customer's environment, using short-lived, scoped tokens.

Zero-trust BYOC is an architectural constraint layered on top of any of the other patterns. The key principle is that the vendor's software, once deployed, operates autonomously and initiates all communication outward to the vendor's control plane. The vendor cannot SSH into customer nodes, cannot open inbound connections, and holds no long-lived secrets in their own systems that could be used to access customer infrastructure.

Redpanda's BYOC architecture is a widely cited example. A single Go binary agent is injected with a unique token at provisioning time and connects outbound to cloud.redpanda.com for lifecycle management. Customers can block that connection with a single firewall rule and all application traffic continues uninterrupted, because the data plane has no external runtime dependencies. Redpanda calls this data plane atomicity: the cluster runs fully independently of the control plane once provisioned, and control plane unavailability can only delay version upgrades, not disrupt running workloads.

ClickHouse's BYOC also uses an outbound-only channel for management traffic. Control-plane connectivity from the ClickHouse VPC to the customer's BYOC VPC is provided over a Tailscale connection that is outbound-only from the customer's BYOC VPC. ClickHouse engineers must request time-bound, audited access through an internal approval system; they can only reach system tables and infrastructure components, never customer data.

Confluent's BYOC approach (built on the WarpStream architecture acquired in September 2024) takes a different angle: WarpStream is designed entirely on top of object storage. The stateless brokers in the customer's VPC store no data locally; all records are written directly to the customer's Amazon S3 bucket. Because the brokers are stateless, the control plane has nothing to access even if a connection were established. The trade-off is higher write latency compared to traditional Kafka deployments, which makes WarpStream best suited for high-volume, latency-tolerant workloads such as logging and data lake ingestion.

What This Looks Like in Practice

Outbound-only control channels: no vendor VPNs, no inbound SSH jump hosts, no persistent credentials in vendor systems.
Ephemeral authentication tokens and short-lived certificates for all management operations.
Vendors can be blocked at the firewall with no impact on running workloads (if data plane atomicity is implemented).
Aligns with NIST SP 800-207 zero-trust architecture principles and passes most enterprise security reviews.

Strategic Trade-Offs

Excellent data isolation: vendor compromise cannot cascade into customer infrastructure.
Support triage requires the customer to run diagnostic tooling and share sanitised outputs; live debugging by the vendor is not possible.
Upgrades and configuration changes need more coordination and may require customer-side approval workflows.
WarpStream-style object-storage-backed BYOC introduces additional write latency (typically tens of milliseconds) versus broker-local storage.

5. Control-Plane and Data-Plane Separation

Definition: Orchestration, metering, and management (the control plane) remain vendor-operated, while compute and storage that process actual data (the data plane) run inside your cloud account.

Control-plane and data-plane separation is the architectural backbone of most modern BYOC offerings. The control plane manages cluster lifecycle, provisioning, version upgrades, RBAC, billing, and health monitoring. It does not touch or store customer data. The data plane executes queries, processes records, and persists data, and it runs entirely within the customer's VPC.

This separation achieves two goals simultaneously. First, the vendor can deliver a consistent, SaaS-quality experience: one-click upgrades, a unified dashboard, and central fleet management work the same way regardless of which cloud the data plane lives in. Second, the customer retains full data sovereignty: encryption keys, network policies, and storage bucket ACLs are all customer-controlled, and data never leaves the customer's perimeter.

ClickHouse Cloud BYOC on AWS clearly documents this split in its architecture reference. The control plane, hosted in the ClickHouse VPC, runs the Cloud Console, authentication and user management, APIs, and billing. The data plane, running in the customer's VPC on an EKS cluster, handles all ClickHouse nodes, Amazon S3 storage, EBS-backed logs, and Prometheus/Thanos metrics. Control-plane-to-data-plane traffic is limited to HTTPS on port 443 for orchestration commands and critical telemetry for health monitoring. Query traffic never touches the control plane.

Estuary applies this architecture across all three of its deployment modes: Public, Private Deployment, and BYOC. The Estuary control plane manages connector configuration, pipeline scheduling, and change data capture orchestration. The data plane runs captures (sources), derivations (transformations), and materializations (destinations) inside the customer's VPC. All pipeline data is stored as reusable collections in the customer's own cloud storage, not Estuary's. Only pipeline metadata and health signals cross to the control plane via PrivateLink. A key practical benefit for data teams is that the same Estuary control plane API, connectors, and pipeline specifications work identically whether the data plane is in Estuary's cloud or the customer's, so there is no lock-in to a deployment topology.

Union.ai's platform provides another illustrative example. The Union.ai control plane runs in the vendor's AWS account. The data plane runs in the customer's AWS or GCP account and is managed by a resident Union operator that communicates outbound to the control plane. The operator holds only the minimum permissions required: it can spin clusters up and down and provide access to system-level logs, but it does not have access to secrets or application data. All communication is initiated by the operator in the data plane, never the other way around.

What This Looks Like in Practice

Vendor-managed control plane provides cluster provisioning, RBAC, audit logs, and feature rollout.
Customer VPC hosts compute nodes, object storage, and all data at rest and in motion.
Control-plane traffic is strictly limited to orchestration commands and anonymised health telemetry.
Cross-account IAM roles are scoped to infrastructure management only, never to data access.

Strategic Trade-Offs

Delivers SaaS-like usability (one-click upgrades, central dashboard) with self-hosted data sovereignty.
Cross-plane identity and authentication design is complex and must be audited carefully.
Shared-responsibility boundaries for incidents need to be explicitly documented: who owns what when the data plane is degraded.
Control plane availability affects lifecycle operations (upgrades, scaling) but should not interrupt running workloads if the data plane has atomicity guarantees.

6. Open-Format Storage BYOC

Definition: The vendor's pipelines read and write raw and processed data to customer-owned object storage in open, vendor-neutral formats, separating compute from durable storage.

Open-format storage BYOC treats object storage, typically Amazon S3, Google Cloud Storage, or Azure Blob Storage, as the system of record, and keeps the vendor's compute layer entirely stateless. Data is written in open, interoperable formats such as Apache Parquet, Apache Iceberg, or Delta Lake. This means the customer can query data with any compatible engine, such as Apache Spark, Trino, DuckDB, or BigQuery Omni, without converting formats and without depending on the vendor's query layer to access their own data.

WarpStream's BYOC architecture (now part of Confluent) is the most prominent recent example in the data streaming space. WarpStream brokers are fully stateless: every record produced to a Kafka-compatible topic is written directly to the customer's Amazon S3 bucket before the produce acknowledgement is returned to the client. No data is stored on broker disk. Because the brokers hold no state, they can be terminated and restarted at any time without data loss, making autoscaling trivial. The customer owns the S3 bucket, the bucket policy, and the KMS key, which means they can audit, export, or delete data independently of the vendor.

The trade-off of routing every write through object storage is latency. Amazon S3 PUT operations typically add tens of milliseconds of latency compared to writing to a local disk or in-memory buffer. For high-volume, latency-tolerant workloads such as log aggregation, analytics ingestion, and data lake pipelines, this is acceptable. For low-latency streaming use cases requiring single-digit millisecond end-to-end latency, traditional broker-local storage is the better choice.

What This Looks Like in Practice

Vendor compute is stateless; all durable state lives in customer-owned Amazon S3, GCS, or Azure Blob buckets.
Data is written in Apache Parquet, Apache Iceberg, or Delta Lake format, enabling multi-engine access.
Customer controls bucket lifecycle policies, intelligent tiering, versioning, and cross-region replication independently of the vendor.
Object storage costs replace broker disk costs; at high volumes, object storage unit costs are significantly lower.

Strategic Trade-Offs

Write latency is higher than broker-local storage due to Amazon S3/GCS round-trip times (typically 10 to 50 ms additional latency).
Read performance for streaming consumers depends on object listing and GET operations; compaction and tiering strategies are needed at scale.
Compute and storage regions must be co-located to avoid high inter-region egress costs.
Vendor lock-in risk is significantly reduced: data is readable by any engine that supports the open format.

7. Kubernetes-Centric BYOC Deployments

Definition: Vendor software components are deployed as workloads in the customer's existing Kubernetes clusters, governed by standard K8s primitives such as namespaces, RBAC, NetworkPolicies, and Pod Security Standards.

Kubernetes-centric BYOC targets organisations that have already standardised on Kubernetes as their internal platform and want to apply uniform policy controls across all workloads, including vendor software. The vendor ships their components as Helm charts or Kubernetes Operators. The customer installs them into their own clusters, where existing GitOps pipelines, admission controllers, network policies, and service mesh configurations govern deployment.

Helm is the dominant packaging mechanism: as of 2024, approximately 75% of organisations use Helm to manage Kubernetes applications. Helm charts bundle Kubernetes manifests into versioned, configurable packages that can be installed, upgraded, and rolled back with single commands, making them well-suited for distributing vendor software that needs to run in arbitrary customer clusters.

Kubernetes Operators extend this model for stateful workloads. An Operator encodes domain-specific operational logic, such as automated failover, backup scheduling, rolling upgrades, and shard rebalancing, as a Kubernetes controller. The vendor ships the Operator as part of the BYOC package. Once deployed, it watches Custom Resource Definitions (CRDs) and reconciles the actual cluster state toward the desired state, allowing the customer's team to manage the vendor's software using the same kubectl and GitOps workflows they use for everything else.

What This Looks Like in Practice

Vendor components deploy via helm install or kubectl apply of the Operator manifest into customer-managed namespaces.
Namespace isolation, Kubernetes RBAC, NetworkPolicies, and PodSecurityAdmission policies apply uniformly to vendor and customer workloads.
GitOps tools such as Argo CD and Flux manage vendor chart versions alongside customer application versions in the same repository.
Service meshes such as Istio or Linkerd provide mTLS, traffic shaping, and zero-trust lateral movement controls for vendor pods.

Strategic Trade-Offs

Highest extensibility and policy control for teams with deep Kubernetes expertise.
CRD version management is non-trivial: vendor CRD updates can conflict with existing cluster CRDs and require careful upgrade sequencing.
Kubernetes operational complexity is real; this model is not appropriate for teams without dedicated platform engineering capacity.
Multi-cluster BYOC deployments increase operational surface area significantly.

8. Lightweight Container, SSH, and Serverless BYOC

Definition: Vendor agents or connectors run inside customer infrastructure as Docker containers, SSH-tunnelled processes, or serverless functions, without requiring Kubernetes or complex cloud-native infrastructure.

Not every BYOC deployment justifies a Kubernetes cluster or full cloud-native infrastructure. Lightweight BYOC patterns use the simplest available execution environment: a Docker container on a VM, an SSH tunnel, or a serverless function invoked on demand. These patterns are common for data integration connectors, observability agents, ETL workers, and event-driven ingestion pipelines that need to run inside the customer's perimeter but do not require the orchestration capabilities of Kubernetes.

SSH-based connectors are particularly common in data integration platforms where the connector needs to reach a database or file system inside a private network. The connector process runs on a customer-managed host, establishes an outbound SSH or SOCKS5 tunnel, and receives pipeline instructions from the vendor's control plane without requiring inbound network access. This is architecturally similar to the zero-trust model described in Pattern 4.

Serverless functions, such as AWS Lambda, Google Cloud Run, or Azure Functions, extend this to event-driven workloads. The vendor ships a function package and deployment configuration. The customer deploys it to their own account. The function is invoked by triggers the customer controls (API Gateway events, S3 notifications, Pub/Sub messages) and processes data within the customer's execution environment. Per-invocation billing means there is no idle infrastructure cost.

What This Looks Like in Practice

Docker-based agents run on customer VMs or EC2 instances with outbound-only network egress to the vendor control plane.
SSH tunnels from connector processes reach databases and file systems in private networks without firewall rule changes.
AWS Lambda or Cloud Run functions handle event-driven ingestion with per-invocation billing and no persistent infrastructure footprint.
Deployment is typically a single shell command, Terraform resource, or CloudFormation stack; no Kubernetes knowledge required.

Strategic Trade-Offs

Fast to set up and low operational overhead, making this well-suited for small teams and proof-of-concept deployments.
Serverless cold-start latency (typically 100 ms to 1 s depending on runtime) can be unacceptable for low-latency streaming pipelines.
Limited built-in high availability: a crashed Docker container or failed VM does not self-heal without additional orchestration.
Fewer enterprise guardrails compared to Kubernetes-centric deployments: no namespace isolation, no NetworkPolicies, no PodSecurityAdmission.

Choosing a BYOC Pattern for Real-Time Data Pipelines

The eight patterns above apply across all software categories, but data pipeline teams face a specific constraint set that narrows the options quickly. Here is how the patterns map to the decisions data engineers actually make.

When your primary concern is data residency or compliance

Pattern 2 (Managed BYOC) or Pattern 5 (Control-Plane/Data-Plane Separation) is typically the right starting point. Your data never leaves your VPC, the vendor handles operational work, and you retain encryption key ownership. For teams that need this with a real-time CDC pipeline covering databases, SaaS sources, and warehouse destinations, Estuary's Private Deployment is purpose-built for this: HIPAA- and GDPR-compliant, SOC 2 Type II certified, and deployable on AWS, GCP, or Azure in the customer's VPC.

When your primary concern is vendor access and zero-trust security

Pattern 4 (Zero-Access/Zero-Trust) is the baseline requirement. For data pipelines specifically, this means connectors run inside your perimeter, all communication is outbound-only to the vendor control plane, and the vendor cannot access your data even during a support incident. Estuary's architecture achieves this: the data plane runs in your VPC, data is stored in your own cloud storage, and Estuary's control plane only receives pipeline metadata, not records.

When your primary concern is cost control and using existing cloud credits

Pattern 2 (Managed BYOC) lets you leverage Reserved Instances, Savings Plans, and Committed Use Discounts because pipeline compute runs in your billing account. Estuary's BYOC option goes further: since pipeline data lands in your own object storage, you avoid the egress charges that accumulate when a vendor copies your data into their infrastructure and then back out.

When you need to move fast without infrastructure investment

Pattern 8 (Lightweight/Serverless) or Estuary's standard public SaaS deployment is the right starting point. Estuary's free tier includes 10 GB/month and 2 connector instances with no credit card required. Most teams have a working pipeline within minutes. Private Deployment or BYOC can be added later without rebuilding pipelines, because the same connector specifications and pipeline logic run identically on all deployment options.

Top 5 Snowflake Data Ingestion Tools in 2026 (Compared & Reviewed)

Sourabh Gupta — Fri, 20 Feb 2026 13:54:27 +0000

If you’re searching for Snowflake data ingestion tools, you’re usually trying to solve one (or more) of these problems:

Get data into Snowflake quickly from SaaS apps, databases, files, or event streams.
Keep Snowflake continuously updated (CDC / near real-time) without brittle scripts.
Minimize operational overhead (monitoring, retries, schema drift, cost control).
Balance latency vs. cost (batch is cheaper, streaming is fresher, but can be trickier).

This guide compares five widely used options and focuses on decision-making: what each tool is best for, where it struggles, and how it typically fits into a Snowflake ingestion architecture.

How we evaluated these Snowflake data ingestion tools

To help you pick the best tool for your use case, I scored each option across the criteria that usually matter most:

Ingestion patterns supported: batch, micro-batch, streaming, CDC.
Source coverage: SaaS apps, databases, files/object storage, event streams.
Latency + freshness controls: can you choose “right-time” (real-time or scheduled)?
Schema evolution & change handling: how painful is drift (new columns, deletes)?
Operational overhead: setup, monitoring, retries, scaling.
Security & deployment: SaaS vs. hybrid vs. in-your-VPC / inside Snowflake.
Cost model fit: predictable vs. usage-based, and where Snowflake compute spend lands.

Quick recommendations

Choose Estuary if you want low-latency pipelines into Snowflake with a platform designed around continuous movement + transformations, including support for Snowpipe Streaming in Snowflake ingestion.
Choose Snowflake Snowpipe / Snowpipe Streaming if you’re building ingestion natively on Snowflake and you can own the engineering (file/event integration, retries, schema handling).
Choose Fivetran if you want a fully managed “connect sources → Snowflake” experience with minimal ops, plus hosted dbt Core for transformations.
Choose Airbyte if you want open-source flexibility (self-host/cloud/hybrid) and you’re comfortable owning more operational work.
Choose Matillion if you want a visual ELT platform that pushes transformations down into Snowflake and can be deployed in SaaS/hybrid/inside Snowflake.

Comparison table

Tool	Best for	Real-time / CDC	Transformations	Deployment options	Primary tradeoff
Estuary	Real-time ingestion + streaming-style pipelines into Snowflake	Yes (incl. Snowpipe Streaming for delta bindings)	Built-in derivations (SQL/TypeScript/Python)	Managed + private/BYOC patterns (varies by feature)	New mental model (collections/derivations/materializations) vs. classic ETL
Snowpipe + Snowpipe Streaming	Native Snowflake ingestion from files/events	Yes (Streaming); Snowpipe is continuous micro-batch	You build it (tasks/SQL/apps)	Snowflake-native	You own the pipeline engineering + ops
Fivetran	Fast, managed ingestion from many sources into Snowflake	Often (depends on connector); strong for replication patterns	Hosted dbt Core + SQL in destination	SaaS + Hybrid	Usage-based pricing + less control for edge cases
Airbyte	Flexibility + OSS + custom connectors	Yes (CDC supported for some sources)	Typically downstream (dbt/SQL), connector-dependent	OSS, Cloud, hybrid control/data plane	More operational ownership + connector variability
Matillion	Visual ELT + pushdown transformations inside Snowflake	Yes for pipelines (tooling dependent)	Pushdown ELT designed for Snowflake	SaaS, hybrid, even inside Snowflake	Heavier platform than “just ingest”

5 Top Snowflake data ingestion tools

1) Estuary

Estuary is a data integration platform built around three core building blocks:

Collections (how data is represented and stored as documents)
Materializations (continuous delivery to destinations like Snowflake)
Derivations (transformations that produce new collections)

How Estuary ingests into Snowflake

Estuary’s Snowflake materialization connector supports both standard and delta updates, and Snowpipe Streaming is available for delta update bindings. The connector uploads changes to a Snowflake table stage and then transactionally applies those changes into the target table.

That architecture matters because it’s designed for continuous change application (not just periodic “dump and reload”).

Transformation support (important for real pipelines)

Estuary supports derivations (transformations) in:

SQL (SQLite)
TypeScript
Python

One nuance that’s easy to miss: Python derivations can only be deployed to private or BYOC data planes (so if you need Python transforms, plan deployment accordingly).

Strengths

Designed for low-latency pipelines to Snowflake, including Snowpipe Streaming for certain binding modes.
Materializations are continuously pushed with “very low latency,” and can handle documents up to 16 MB.
Connector ecosystem can be expanded: Estuary notes it can run Airbyte community connectors via airbyte-to-flow to broaden supported SaaS sources.
Pricing is published as pay-as-you-go with a free tier available (useful for evaluation).

Limitations / when it’s not ideal

The “collections/materializations/derivations” model is powerful, but can feel unfamiliar if you expect classic “ELT sync jobs.”
If your team is standardized on a specific orchestration + transformation stack (e.g., “all transforms in dbt”), you’ll want to decide whether to transform in Estuary vs. keep Estuary as pure ingestion.

Best for

Teams that want:

Real-time ingestion into Snowflake (including streaming-style ingestion),
Built-in transformation capability (especially SQL/TypeScript),
A managed experience without building Snowpipe pipelines from scratch.

2) Snowflake Snowpipe (and Snowpipe Streaming)

If you prefer “native-first,” Snowflake offers two core ingestion mechanisms:

Snowpipe: continuous loading of files (micro-batch style)
Snowpipe Streaming: streaming row ingestion with SDK/REST options

Snowpipe: continuous file ingestion (serverless)

Snowflake documents that automated Snowpipe loads use cloud storage event notifications to detect new files, then Snowpipe copies files into a queue and loads them into tables continuously and serverlessly based on a PIPE object configuration.

Snowflake also explicitly recommends enabling cloud event filtering to reduce costs, event noise, and latency.

Snowpipe Streaming: streaming ingestion into tables

Snowflake states Snowpipe Streaming:

ingests data “as it arrives,”
uses SDKs to write rows directly into tables (bypassing intermediate cloud storage),
is serverless and scalable, with billing optimized for streaming workloads (potentially more cost-effective for high-volume, low-latency feeds).

Snowpipe Streaming also has two implementations:

High-performance architecture (newer; uses the snowpipe-streaming SDK; throughput-based pricing; uses a PIPE object)
Classic architecture (original GA; different SDK; channels opened directly against tables; pricing based on serverless compute + active connections).

Strengths

No third-party vendor: fully Snowflake-native.
Great fit when ingestion is already in cloud storage (Snowpipe) or you own the event producer/application (Snowpipe Streaming).

Limitations / when it’s not ideal

Snowpipe is not a “connect to Salesforce and go” tool—you still need systems to extract data and land files/events.
You own the operational surface area: event notifications, backfills, schema handling, retries, monitoring, and pipeline code.
Snowpipe has operational details you must design around (example: Snowpipe vs bulk load behavior; REST auth, pipe metadata history, etc.).

Best for

Teams that:

Want to keep ingestion native in Snowflake,
Already have data landing in object storage or streaming systems,
Have engineering capacity to build and operate ingestion pipelines.

3) Fivetran

Fivetran is a managed ingestion platform known for quickly syncing many different sources into a warehouse.

How it ingests into Snowflake

Fivetran’s Snowflake destination docs emphasize Snowflake’s separation of storage and compute, noting you can run Fivetran in a separate logical warehouse—for example, one warehouse loading data and another serving analyst queries.

Deployment + security model

Fivetran supports SaaS and Hybrid deployment models for the Snowflake destination, and notes Hybrid requires certain plan levels.

Transformations

Fivetran offers transformations powered by Fivetran-hosted dbt Core, executing the resulting SQL in your destination (Snowflake).

Pricing model (important for tool selection)

Fivetran documents its usage-based pricing using Monthly Active Rows (MAR) as the measurement unit.

Strengths

Fastest “time to first pipeline” for many common SaaS/DB sources (highly managed).
Clear separation of ingestion vs transformation (dbt Core option is well-documented).

Limitations / when it’s not ideal

Usage-based pricing can be hard to predict if your data changes frequently (MAR-driven).
Custom or niche APIs can be harder unless a connector exists and meets your needs.

Best for

Teams that want:

A managed, low-ops path to ingest data into Snowflake,
Built-in transformation orchestration with dbt Core,
Strong defaults and minimal pipeline engineering.

4) Airbyte

Airbyte is a data movement platform with a major open-source footprint and multiple deployment options. The official GitHub repo explicitly references deploying Airbyte Open Source or using Airbyte Cloud.

Snowflake destination specifics

Airbyte’s Snowflake destination setup guide states that you set up Snowflake entities (warehouse, database, schema, user, role) and then configure the destination in Airbyte.

It also notes setting up Airbyte-specific Snowflake entities with OWNERSHIP permission to write into Snowflake and manage permissions/cost tracking.

CDC and schema evolution considerations

Airbyte’s CDC documentation notes it adds CDC metadata columns for CDC sources with the _ab_cdc_ prefix.

On the Snowflake destination side, the migration guide for destination version upgrades notes:

v4.0.0 moves Snowflake destination to the Direct-Load paradigm (improves performance and reduces warehouse spend),
adds an option for CDC deletions as soft-deletes,
requires ALTER TABLE permissions for schema evolution/table modifications.

Deployment options (including hybrid)

Airbyte’s Enterprise Flex is described as a hybrid model with a managed Cloud control plane and data planes running in your infrastructure—positioned for data sovereignty/compliance needs.

Strengths

Strong choice when you want control (open-source/self-managed) or hybrid deployment models.
Transparent documentation on Snowflake destination behaviors (direct-load, permissions, schema evolution).

Limitations / when it’s not ideal

You typically take on more operational responsibility than a fully managed ingestion vendor.
Connector quality can vary depending on support level and source (plan for testing/monitoring).

Best for

Teams that want:

Open-source flexibility or “run it in our infrastructure,”
A platform they can extend/customize,
Detailed control over Snowflake destination behavior and upgrades.

5) Matillion (Matillion ETL / Data Productivity Cloud)

Matillion is a long-established ETL/ELT vendor with a strong Snowflake focus.

Matillion’s own product docs describe Matillion ETL as an ETL/ELT tool built specifically for cloud data platforms including Snowflake, emphasizing push-down transformations into the warehouse.

Why Matillion is often chosen for Snowflake ingestion

Matillion ETL highlights:

pushdown transformations executed in your cloud data warehouse,
a browser-based UI with many components,
“over 80 out-of-the-box connectors.”

Matillion’s Data Productivity Cloud page further claims a “completely native pushdown architecture,” and explicitly says data “never leaves your cloud platform,” with deployment options including hosted SaaS, hybrid, or even running inside Snowflake.

Matillion also markets Snowflake Marketplace deployment, stating you can deploy Matillion “inside your Snowflake environment,” and even “run Matillion fully inside your Snowflake account.”

Strengths

Excellent when ingestion is tied to ELT pipeline development (ingest + transform + orchestrate).
Strong Snowflake alignment via pushdown and marketplace-style deployment options.

Limitations / when it’s not ideal

Typically heavier than “simple ingestion,” especially if you only need replication and no transformations.
Commercial licensing/procurement can be more involved than OSS.

Best for

Teams that want:

A visual, enterprise-ready platform to build ELT pipelines on Snowflake,
Strong transformation + orchestration capabilities alongside ingestion.

How to choose the best Snowflake ingestion tool for you

Use this practical decision checklist:

1) What freshness do you actually need?

Minutes/hours is fine → Batch ELT tools (Fivetran, Airbyte, Matillion) or Snowpipe (file micro-batch).
Seconds (near real-time) → Estuary or Snowpipe Streaming (or Airbyte/Fivetran if the specific connector supports the latency you need).

2) What kind of sources are you ingesting?

SaaS apps (CRM, ads, support tools) → Typically easiest with managed connector platforms (Fivetran) or connector-heavy OSS platforms (Airbyte).
Databases + CDC → Estuary, Airbyte CDC patterns, and Fivetran replication approaches are common choices; native Snowflake options usually require more custom plumbing.
Files landing in cloud storage → Snowpipe is often the cleanest native option.

3) Where do you want transformations to live?

In Snowflake (pushdown SQL) → Matillion and Fivetran’s hosted dbt Core model align strongly.
Inside the ingestion platform → Estuary derivations (SQL/TypeScript/Python) can reduce the number of moving parts.
Separate transformation layer → Airbyte + dbt / Snowflake tasks is common.

4) How much operational overhead can you accept?

Low ops / managed → Fivetran, Estuary.
Medium ops / platform ownership → Airbyte (especially self-hosted).
High ops / engineering build → Snowpipe + Snowpipe Streaming pipelines.

FAQ

Which Snowflake data ingestion tool is best for real-time ingestion?

If you want real-time ingestion with a managed tool, Estuary’s Snowflake connector explicitly supports Snowpipe Streaming for delta update bindings.

If you want a native Snowflake approach and can build/operate it, Snowpipe Streaming is Snowflake’s own serverless streaming ingestion option.

Can I ingest data into Snowflake without third-party tools?

Yes—Snowpipe (for continuous file ingestion) and Snowpipe Streaming (for row streaming ingestion) are Snowflake-native options, but you still need to build upstream extraction and operational controls.

I mainly need SaaS to Snowflake ingestion. What’s the simplest path?

A managed connector platform is usually the lowest-friction option. Fivetran’s Snowflake destination documentation emphasizes automated, continuous sync and separation of compute warehouses for loading vs querying.

I need open-source and the ability to customize connectors. What should I use?

Airbyte is designed around open-source deployment and extensibility, and supports Snowflake as a destination with documented setup and upgrade behaviors.

Final take

There isn’t a single “best” Snowflake data ingestion tool—there’s a best fit for your latency needs, source systems, security constraints, and appetite for operational ownership.

How to Stream OLTP Data to MotherDuck in Real Time with Estuary

Sourabh Gupta — Fri, 26 Sep 2025 05:51:23 +0000

Introduction

DuckDB is quickly becoming one of the most talked about analytical databases. It is fast, lightweight, and designed to run inside your applications, often described as SQLite for analytics. While it works great on a laptop for local analysis, production workflows need something more scalable.

That is where MotherDuck comes in. MotherDuck takes the power of DuckDB and brings it to the cloud. It adds collaboration features, secure storage, and a serverless model that lets teams use DuckDB at scale without worrying about infrastructure.

In this guide, you will learn how to stream data from an OLTP system into MotherDuck using Estuary. This approach lets you run analytical queries on fresh data without putting extra load on your production database.

🎥Prefer watching instead of reading? Check out the short walkthrough below.

Why DuckDB Is Gaining Traction

DuckDB is an open source analytical database designed with a clear goal: to make complex queries fast and simple without heavy infrastructure. Instead of being a traditional client-server database, DuckDB is embedded. It runs inside the host process, which reduces overhead and makes it easy to integrate directly into applications, notebooks, or scripts.

Several features stand out:

In-process operation: Similar to SQLite, DuckDB runs where your code runs. This avoids network calls and gives you low-latency access to data.
Columnar and vectorized execution: DuckDB is optimized for analytical queries. Its execution model speeds up heavy operations such as aggregations, filtering, and joins on large tables.
Portability and extensibility: It has a very small footprint and no external dependencies. At the same time, extensions support advanced data types and file formats, including Parquet, JSON, and geospatial data.
Seamless file access: DuckDB can query local files directly without requiring an ETL pipeline. For example, you can run SQL queries on CSV or Parquet files straight from disk.
Integration with data science tools: DuckDB connects smoothly with Python, R, and Jupyter notebooks, which makes it a favorite among data scientists. Because of this balance of speed, flexibility, and simplicity, DuckDB is increasingly used as the analytical layer in modern data pipelines, as well as for ad hoc analysis by engineers and analysts.

MotherDuck: DuckDB in the Cloud

DuckDB is excellent for local analysis, but production environments often require more than a local embedded database. Teams need collaboration, security, and scalability. That is where MotherDuck comes in.

MotherDuck is a managed cloud service built on top of DuckDB. It extends the same fast and lightweight query engine into a serverless environment while adding features that make it practical for organizations:

Serverless architecture: No servers to manage and no infrastructure overhead. MotherDuck scales automatically with your workloads.
Collaboration: Share queries, results, and datasets with teammates in real time. This makes it easier for teams to work from the same source of truth.
Secure secret storage: Manage credentials and connections safely in the cloud.
Integration with pipelines: Platforms like Estuary can write directly into MotherDuck, which means your data is always fresh and ready for analysis. In practice, MotherDuck gives teams the best of both worlds: the performance and simplicity of DuckDB combined with the scalability and ease of use of a modern cloud service.

OLTP → OLAP: The Core Use Case

Most production applications run on OLTP databases such as PostgreSQL, MySQL, or MongoDB. These systems are designed for fast inserts, updates, and deletes. They keep applications responsive but are not optimized for running heavy analytical queries.

Running aggregations, joins, or reports directly on an OLTP database can:

Slow down your application performance.
Increase operational risk by adding load to your production environment.
Limit the ability of analysts and data scientists to explore data freely.

This is why organizations separate OLTP (transactional) systems from OLAP (analytical) systems. The OLTP database handles day-to-day transactions, while an OLAP database is dedicated to complex queries and reporting.

DuckDB, and by extension MotherDuck, fits perfectly as an OLAP layer. With Estuary, you can capture real-time changes from your OLTP source and stream them into MotherDuck. This way, analysts always have up-to-date data to query without touching the production database.

Setting Up Estuary with MotherDuck

In this section, we’ll walk through the process of connecting your OLTP source to MotherDuck using Estuary. The setup is straightforward and only takes a few steps.

Step 1: Prepare Your Source in Estuary

Before you can send data to MotherDuck, you need a source system connected in Estuary. A source could be any OLTP database such as PostgreSQL, MySQL, or MongoDB. Estuary also supports SaaS applications, event streams, and file-based sources.

To prepare a source:

Go to the Captures tab in the Estuary dashboard.
Create a new capture and select the connector for your source system.
Provide the connection details (for example, host, port, database name, and credentials).
Save and publish the capture.

Once this is done, Estuary begins ingesting data from your source and continuously tracks new changes. This stream of data is stored in an internal collection, which you will later connect to MotherDuck.

Tip: If you are new to Estuary, try starting with a simple dataset (like PostgreSQL or a CSV file) before moving on to production-scale sources.

Step 2: Create a MotherDuck Materialization

With your source capture running, the next step is to set up MotherDuck as the destination for your data. In Estuary, this is called a materialization.

To create one:

Go to the Destinations tab in the Estuary dashboard.
Click New Materialization.
Search for MotherDuck in the connector catalog and select it.
Give the materialization a descriptive name so you can easily identify it later.

At this point, you will see the configuration screen for the MotherDuck connector. This is where you provide the details that allow Estuary to stage data and deliver it into your MotherDuck database.

In the next step, you’ll configure AWS S3 staging, which Estuary uses as a temporary storage location for data loads.

Step 3: Configure AWS S3 Staging

The MotherDuck connector in Estuary uses an Amazon S3 bucket as a staging area. Data is first written to S3, then loaded into MotherDuck. This design ensures high reliability and scalability for large datasets.

Here’s what you need to set up:

Create or choose an S3 bucket
- Note down the bucket name and its region.
- Optionally, you can define a prefix if you want Estuary to organize staged files under a specific folder.
Set up IAM permissions
- Create or use an IAM user that has read and write access to the S3 bucket.
- Attach a policy with at least the following actions:
  - s3:PutObject
  - s3:GetObject
  - s3:ListBucket
Generate access keys
- In the AWS console, go to the IAM user’s Security Credentials tab.
- Create an access key and secret key.
- Copy these values into the Estuary dashboard when configuring the MotherDuck connector.

At this point, Estuary knows where to stage data and has the permissions needed to write into your S3 bucket.

Tip: For production, avoid using a root account. Always generate access keys from an IAM user with the least privileges necessary.

Step 4: Set Up MotherDuck

Now that AWS S3 staging is ready, it’s time to configure the MotherDuck side of the connection. This step makes sure MotherDuck can pull the staged data into your chosen database.

Generate an access token
- Log in to your MotherDuck account.
- Open the Settings menu and go to Access Tokens.
- Create a new token and copy it into the Estuary connector configuration.
Provide AWS credentials to MotherDuck
- MotherDuck needs permission to read the staged files from your S3 bucket.
- You can provide these credentials either:
a. By running SQL statements inside MotherDuck:
```
 CREATE SECRET aws_access_key '<ACCESS_KEY>';
 CREATE SECRET aws_secret_key '<SECRET_KEY>';
```
b. Or by entering them through the MotherDuck UI.
Choose a target database
- Select an existing database in your MotherDuck account, or create a new one.
- Copy its name into the Estuary configuration.
Decide on delete behavior
- Soft deletes: Mark a record as deleted but keep it in the table for historical analysis.
- Hard deletes: Remove the record entirely.
- Choose the option that best matches your analytics or compliance needs.

Step 5: Publish and Stream Data

Once your MotherDuck materialization is configured, the final step is to publish it and start the data flow.

Select your source data
- Link an entire capture (for example, your PostgreSQL database)
- Or choose specific collections you want to replicate.
Review the configuration
- Double-check that your S3 credentials, MotherDuck token, and database name are correct.
- Make sure you selected the right delete behavior (soft or hard).
Save and publish
- Click Next, then Save & Publish.
- Estuary will immediately begin streaming data from your OLTP source into MotherDuck.

From here, data updates in your source will flow continuously into your MotherDuck database. This gives you a near real-time OLAP environment for analytics, without adding load to your production system.

Step 6: Query in MotherDuck

With the connector published, your data is now flowing into MotherDuck. The final step is to start exploring it.

Open the MotherDuck dashboard and go to Notebooks.
Select the database you configured as the destination.
Run queries using DuckDB’s familiar SQL syntax.

For example, if you replicated an orders table from your OLTP database, you could analyze top customers like this:

SELECT customer_id, COUNT(*) AS order_count
FROM orders
GROUP BY customer_id
ORDER BY order_count DESC
LIMIT 10;

Wrap-Up

By combining Estuary and MotherDuck, you can build a modern pipeline that keeps analytics separate from your production workload without adding extra complexity.

Estuary captures real-time changes from your OLTP databases.
Data is staged in S3 for reliability.
MotherDuck provides a cloud-native DuckDB environment where your team can query and collaborate.

This setup is fast to configure, easy to maintain, and scales with your needs. Instead of managing batch jobs or writing custom scripts, you can focus on analysis and insights.

✅ Key Takeaways

DuckDB is lightweight and powerful for analytics, while MotherDuck brings it to the cloud for collaboration and scalability.
Estuary makes it simple to stream data from OLTP systems into MotherDuck in real time.
AWS S3 is used as a staging layer, requiring IAM permissions and credentials.
Once published, you can query fresh data in MotherDuck notebooks using DuckDB SQL.

👉 Ready to try it yourself? Explore Estuary and see how quickly you can start streaming data into MotherDuck.

Which is Best for Real Time Dashboards: Airbyte, Fivetran, or Estuary

Sourabh Gupta — Tue, 12 Aug 2025 10:14:44 +0000

A dashboard is only as valuable as the freshness of the data behind it. If the numbers are hours old, the insights are already stale. In a world where customer actions, market conditions, and operational realities change by the second, waiting for the next scheduled batch job can mean missed opportunities and delayed responses.

Many teams turn to data integration tools like Airbyte, Fivetran, or Estuary to power their analytics dashboards. While all three can deliver data, their approaches to latency, scalability, and reliability vary greatly. These differences determine whether your dashboard reflects the current state of the business or lags behind the real world.

In this article, we will break down how each platform supports real time dashboarding and what truly makes right time analytics possible. We will look at sync speed, transformation capabilities, and delivery guarantees so you can choose the right foundation for instant, dependable insights.

What Makes a Real Time Dashboard Possible

Real time dashboards surface insights within seconds of data changes, not minutes or hours. To power metrics like active users, inventory updates, or transactional anomalies, your data pipeline must support ultra low latency and consistent freshness.

Key requirements for real time analytics pipelines

Sub second or second level latency

The pipeline must deliver data to your dashboard as events occur, with minimal delay.
Exactly once delivery

Preventing duplicate or missing records ensures metric accuracy, especially when using aggregation and real time visualization.
Schema evolution support

Data structure changes such as adding columns or nested fields must be handled seamlessly to avoid pipeline errors or dashboard downtime.
In flight transformations

The ability to transform, enrich, or filter data on the fly (via SQL or code) eliminates downstream ETL complexity and enables faster insights.
Integration with dashboard and analytics tools

The pipeline should connect smoothly to BI systems, data stores, or query engines that power your visualization layer.

Airbyte Overview

What Airbyte Is

Airbyte is a popular open source data integration platform that enables users to replicate data from a wide variety of sources into data warehouses, lakes, and databases using extract load transform workflows. It offers both self hosted and cloud deployment options, and its connector ecosystem is driven heavily by community contributions.

Strengths

Open source flexibility and extensibility: You can customize connectors or contribute new ones to the growing ecosystem.
Broad connector catalog: Supports hundreds of source target combinations with flexible deployment models.

Limitations for Real Time Dashboarding

Batch first architecture: Airbyte operates on batch syncs rather than continuous streaming. The default minimum sync cadence is five minutes, and frequent polling can degrade performance.
CDC support is not streaming based: While Airbyte supports Change Data Capture (CDC), it treats each CDC enabled sync as another scheduled batch rather than an ongoing stream. Real time change streaming is not natively supported.

What This Means

Airbyte is highly effective when low latency is not critical or where teams prefer open source tools with deployment flexibility. However, for dashboards that need updates within seconds, Airbyte’s architecture introduces inherent latency that may not meet real time expectations.

Fivetran Overview

What Fivetran Is

Fivetran is a fully managed, cloud based ELT platform that automates data movement from a large set of sources into analytics destinations. It focuses on reliability, low maintenance operation, and enterprise grade security, making it a popular choice for teams that prefer a hands off approach to infrastructure management.

Strengths

Extensive connector library with hundreds of production grade, fully managed source and destination integrations.
Automated schema migration so changes in source structure are handled with minimal disruption.
Zero maintenance experience where scaling, uptime, and infrastructure are managed by Fivetran.
Security and compliance including SOC 2 Type II, GDPR, and HIPAA readiness for regulated industries.

Limitations for Real Time Dashboarding

Batch oriented sync model: Most connectors run on a schedule, with intervals that are typically 15 minutes or longer for standard plans.
Streaming Change Data Capture only for select sources and often as part of higher priced enterprise tiers.
MAR based pricing (Monthly Active Rows) which can significantly increase costs for high volume, frequently changing datasets.
Limited in pipeline transformation options: Fivetran relies heavily on dbt for transformations, which are applied after loading into the destination rather than in real time.

What This Means

Fivetran offers excellent reliability and low maintenance for batch analytics use cases. However, for dashboards that require second level latency, its architecture and pricing model may limit feasibility unless you opt for specialized CDC features on high cost tiers.

Estuary: The Right Time Data Platform

What Estuary Is

Estuary is the Right Time Data Platform, built for unified, dependable, and scalable data movement. It lets you move and transform data continuously or at the cadence your business requires. With Estuary, you can synchronize systems in real time, near real time, or on schedule, all from a single platform that combines streaming and batch in one.

In other words, right time means data moves when it matters, whether that is sub second updates for live dashboards or hourly refreshes for analytics workloads.

Strengths for Right Time Dashboarding

Unified data movement: Handle streaming and batch data within one platform without separate infrastructure.
Right time performance: Achieve second level latency for continuous Change Data Capture (CDC) and event streams.
Exactly once delivery: Guarantees accuracy and consistency for operational and analytical dashboards.
In stream transformations: Apply SQL or TypeScript transformations as data moves so dashboards display clean, usable data instantly.
Automatic schema evolution: Accommodate source changes without breaking pipelines or visualizations.
Kafka compatible Dekaf API: Deliver data directly to Kafka consumers without maintaining brokers.
Flexible, secure deployment: Choose public SaaS, private cloud, or bring your own cloud (BYOC) for full compliance and control.
Predictable TCO: Volume based pricing eliminates the unpredictability of MAR based or usage tiered models.

What This Means

Estuary empowers organizations to deliver dashboards that always reflect the current state of the business, without trading off reliability or cost predictability. It combines the flexibility of streaming with the dependability of enterprise grade data movement in one platform.

Head to Head Comparison

Feature	Airbyte	Fivetran	Estuary
Latency	Minutes to hours depending on sync schedule	Typically 15 minutes or more for most sources	Seconds with right time streaming
Deployment	Self hosted or cloud	Cloud only	Cloud, private cloud, or BYOC
Pricing Model	Free self hosting, paid cloud plans	Monthly Active Rows (MAR) based	Predictable, volume based pricing
CDC Support	Batch based for some connectors	Select sources only	Continuous right time CDC for many sources
Exactly Once Delivery	No	No	Yes
In Pipeline Transformations	Basic via dbt	Basic via dbt	Real time SQL or TypeScript
Kafka Compatibility	No	No	Yes (via Dekaf API)
Schema Evolution Handling	Manual intervention often required	Automated	Automatic with zero downtime

Key Insight

Airbyte and Fivetran both effectively deliver batch data for analytics, but their architectures introduce unavoidable latency. Estuary stands apart as the only right time platform that combines continuous streaming, exactly once delivery, and unified transformations into a single dependable system.

Which Tool Fits Which Use Case

Airbyte

Best for teams who value open source flexibility and can tolerate delays of several minutes or hours between syncs.

Fivetran

Ideal for teams that want a fully managed, hands off ELT experience and are primarily focused on batch reporting.

Estuary

Purpose built for businesses where data freshness drives decisions:

Dashboards that must reflect reality within seconds.
Operational analytics needing accuracy and reliability.
Teams that want both streaming and batch movement in one platform.
Organizations prioritizing predictable TCO and compliance ready deployment.

The Real Cost of Not Choosing Right Time Data

Delays in dashboard updates are not just technical inconveniences. They have measurable business costs.

E commerce campaigns: Stale data means wasted ad spend and missed conversion optimization opportunities.
Fraud detection: Delayed signals can allow bad transactions to complete, costing thousands.
Operations and logistics: Without fresh data, routing and inventory systems react too late.
Customer experience: Old engagement metrics can lead to poor timing in retention strategies or feature rollouts.

Choosing batch based pipelines for use cases that demand immediacy often costs more in lost revenue and inefficiency than investing in a right time architecture upfront.

✅ Key Takeaways

Real time dashboards require right time data movement, not faster batches.
Airbyte offers open source flexibility but lacks continuous streaming.
Fivetran provides managed reliability but operates mainly on scheduled syncs.
Estuary combines streaming, transformations, and exactly once delivery in one dependable platform.
Predictable costs, right time performance, and enterprise reliability make Estuary the most future proof choice for mission critical dashboards.

2025 Data Warehouse Benchmark: What BigQuery, Snowflake, and Others Don’t Tell You

Sourabh Gupta — Thu, 17 Jul 2025 08:11:33 +0000

We Benchmark-Tested 5 Data Warehouses. Here's What Broke.

Choosing a data warehouse shouldn’t feel like a gamble — but it often is.

Marketing sites are polished. Demos are cherry-picked. Docs are full of high-level promises. But when your data team starts moving terabytes of real data, things change fast: performance bottlenecks, cost spikes, memory errors… and sometimes complete failure.

At Estuary, we help teams build real-time data pipelines that push warehouses hard — across batch and streaming. We’ve seen the consequences of choosing the wrong warehouse. So we built the benchmark we wish existed earlier.

🔍 The Estuary 2025 Data Warehouse Benchmark

We benchmarked 5 major data warehouses under real workloads:

Google BigQuery
Snowflake
Databricks
Amazon Redshift
Microsoft Fabric

We didn’t just run canned TPCH queries — we loaded over 8TB of structured + semi-structured data, then hit each platform with real-world SQL:

Joins, window functions, filters, and nesting
Query-F (“The Frankenquery”) — a deliberately brutal query that pushes limits
Full lifecycle tracking from ingest to query via Estuary Flow
Cost-to-runtime ratios with no vendor tuning or caching games

📂 Our full methodology is open source. Clone it. Run your own tests. Contribute.

🧠 What We Learned

🔵 BigQuery

Fast — especially on nested JSON
But zero cost guardrails = high bill risk
Cost-per-minute hit $15+ under some setups

⚪ Snowflake

Stable, predictable, smart scaling
Good balance of performance and cost
Strong default choice for teams who want reliability

🟨 Databricks

Great for ML workflows
SQL under load? Needs tuning
Performance quirks at scale

🟥 Redshift & 🟩 Fabric

Memory errors, long runtimes, incomplete results
Multiple queries failed or stalled for hours
Definitely not plug-and-play ready

📉 Chart: Cost vs Runtime

This graph tracks $ per minute of query runtime across warehouses and instance sizes.

Red bands = platforms that failed under load or threw memory errors.

⚙️ Rankings That Actually Matter

We scored each platform on:

Cost-efficiency 💰
Runtime performance ⚡
Scalability 📈
Reliability under pressure 🧱
Startup-friendliness 🚀
Enterprise readiness 🏢

🎯 Some platforms were efficient at small scale but crashed under growth. Others performed well but cost 10x more than peers.

📥 Get the Full Report

If you’re:

Planning a warehouse migration
Scaling analytics or ML pipelines
Comparing Snowflake vs BigQuery vs Databricks
Or just tired of guessing…

👉 Download the full benchmark report

👨‍🔬 Built by Engineers, Not Marketers

We created this benchmark at Estuary because we work with these warehouses daily. Our product — Estuary Flow — streams real-time data from sources like PostgreSQL, Kafka, MongoDB, and SaaS apps into modern warehouses.

We’ve helped teams recover from 18-month migrations and $100k+ in wasted compute. So we’re publishing what we’ve learned.

🤝 Contribute or fork the test harness here:

🔗 GitHub Repo

🌐 Estuary GitHub

💬 Join the Discussion

Have you had similar (or better?) experiences with these platforms?

Spot something we should test next?

Drop your thoughts, logs, or horror stories in the comments. We’re all ears 👇

Refresh Smarter: How Estuary’s Dataflow Reset Makes Backfills a Breeze

Sourabh Gupta — Tue, 15 Jul 2025 04:14:10 +0000

Backfills have always been a critical - but sometimes tedious - part of managing robust data pipelines. Whether you're dealing with schema drift, outdated destination tables, or bad source data, initiating a full reset of your pipeline used to require multiple steps.

Not anymore.

With Estuary’s new Dataflow Reset feature, you can perform a clean-sweep backfill in just one step - reloading your sources, refreshing schemas, re-triggering derivations, and updating destination tables - all at once.

What Is a Dataflow Reset?

A Dataflow Reset is Estuary’s one-click solution to refresh your entire dataflow. It works from top to bottom:

Re-extracts data from the source
Re-runs all derivations
Recalculates schemas using updated data
Rebuilds destination tables

This isn't just a re-run - it's a recalibration. If your schemas previously became too broad (due to inconsistent or junk data), the reset starts fresh and reflects the true shape of your source.

When Should You Use It?

The new Dataflow Reset option is ideal for scenarios like:

Structural changes in your source system
Schema inference gone awry
Destination tables out of sync with upstream logic

Bonus: It automatically tracks which downstream resources (like materializations) need updating - no manual selection required.

How to Use It

Go to your Capture in the Estuary Flow web app.
Click Edit.
Select Backfill.
The default backfill mode will now trigger a Dataflow Reset.

That’s it - your pipeline is reset and refreshed in one action.

Prefer Fine-Grained Control?

You can still choose from advanced backfill options:

Incremental Backfill

Reprocess only the source data while keeping the existing destination intact.
Materialization-Only Backfill

Rebuild destination tables from current collection data - no need to touch the source.

These modes are perfect for more targeted recovery and testing.

Known Limitation

Avoid using Dataflow Reset with Dekaf materializations (Estuary’s Kafka-compatible interface). This combination is currently unsupported.

Learn More

Want a deeper dive into backfilling options, use cases, and caveats? Check out the Estuary docs:

👉 https://docs.estuary.dev/reference/backfilling-data/

TL;DR

Dataflow Reset is a full-pipeline refresh: source -> schema -> derivation -> destination
Automatically recalculates schema to avoid issues caused by bad or outdated data
Easy to trigger and safer than ever to run
Still supports advanced, partial backfill modes
Avoid using with Dekaf (for now)

Make your next backfill a breeze with Estuary.

How to Load Data from Amazon S3 to Snowflake in Real Time

Sourabh Gupta — Wed, 09 Jul 2025 06:39:46 +0000

Got a bunch of raw data sitting in Amazon S3 and need to get it into Snowflake for analytics — fast? You’re not alone.

Maybe it’s JSON logs, CSV exports, or event data piling up in your S3 bucket. Maybe you’ve tried batch pipelines or custom scripts but ran into delays, duplicates, or schema chaos. What you actually need is a clean, reliable way to load that S3 data to Snowflake, without spending weeks building and maintaining it.

That’s exactly what Estuary Flow is built for.

Flow makes it easy to build real-time S3 to Snowflake data pipelines with no code, no ops overhead, and no latency headaches. It connects directly to your S3 bucket, picks up new files as they arrive, and keeps your Snowflake warehouse in sync continuously.

In this walkthrough, we’ll show you how to set up an Amazon S3 to Snowflake pipeline using Estuary Flow from start to finish. You’ll go from raw files to live Snowflake tables in just a few steps.

TL;DR: If you're looking to stream data from Amazon S3 to Snowflake, you're in the right place — and Flow makes it a breeze.

Why Stream S3 Data to Snowflake in Real Time?

Let’s be honest — batch processing worked fine back when dashboards only needed to update once a day. But today, teams expect real-time answers: marketing needs up-to-the-minute campaign performance, operations teams need live inventory data, and product managers want to react to user behavior as it happens.

That’s where streaming data from S3 to Snowflake changes the game.

If you’re storing raw files — like logs, events, or exports — in Amazon S3, you’re already halfway there. The missing piece is a low-latency pipeline that gets that data into Snowflake the moment it arrives. No waiting for hourly jobs. No stale reports. Just fresh, query-ready data flowing in 24/7.

Here are a few reasons real-time sync matters:

Analytics that actually keep up – Get real-time insights instead of reading yesterday’s data.
Automation that reacts fast – Trigger workflows in Snowflake based on live S3 updates.
Simplified ops – Eliminate brittle scripts, manual backfills, and sync delays.

Note: Since Amazon S3 doesn’t support native change notifications, Flow polls your bucket every few minutes to detect new files, then streams them to Snowflake immediately. It’s batch under the hood, but real-time in effect.

Why Use Estuary Flow Instead of Traditional ETL Tools?

If you’ve tried to move data from Amazon S3 to Snowflake before, you probably know the drill: patch together an ETL tool, deal with scheduling, wrestle with schema mismatches, and hope the job doesn’t break halfway through.

The thing is, most ETL tools were built for a different era — one where “real time” meant “hourly,” and everything ran in batches. Estuary Flow flips that on its head.

Here’s how Flow makes your S3 to Snowflake pipeline way easier:

Real-Time by Default: Flow isn’t just fast — it’s built for continuous streaming. Once you connect your S3 bucket, Flow automatically picks up new files as they land and streams the data directly into Snowflake.
No Code Required: Set up everything — capture, schema, and materialization — through a clean UI. You don’t need to write Python, wrangle Airflow, or babysit cron jobs.
Schema-Aware + Smart: Flow infers the structure of your S3 data and helps you map it to Snowflake tables. You can tighten up schemas, apply transformations, and evolve structure over time without breaking your pipeline.
Exactly-Once Delivery: No duplicates. No reprocessing. Flow uses cloud-native guarantees to ensure data lands in Snowflake exactly once, even if things get weird.
Built to Scale: Whether you're syncing a few JSON files or streaming terabytes a day, Flow scales automatically without locking you into complex infrastructure.

Estuary Flow takes the friction out of real-time data integration from S3 to Snowflake, so you can focus on using the data, not moving it.

What You Need to Get Started

You don’t need much to build an Amazon S3 to Snowflake pipeline with Estuary Flow — just a few basics:

Estuary Flow Account

Amazon S3 Bucket

This is your data source. You’ll need:

Bucket name & region
Either public access or your AWS access key + secret key
(Optional) A folder path, called a “prefix”

Snowflake Account

Your destination for the data. Make sure you have:

A database, schema, and virtual warehouse
A user with access
Your account URL + login credentials
(Optional) warehouse name and role

That’s it. With these in place, you’re ready to connect the pieces and start streaming.

Step 1: Capture Data from Amazon S3

First up, you’ll connect Estuary Flow to your S3 bucket — this step is called a capture. It’s how Flow knows where to pull your data from.

Here’s how to set it up:

Log into Estuary Flow at dashboard.estuary.dev.
Click the Sources tab and select New Capture.
Choose Amazon S3 from the list of connectors.

You’ll see a form where you enter your S3 details:

Capture name – Something like myorg/s3-orders
AWS credentials – Only needed if your bucket isn’t public
Bucket name & region – From your S3 console
Prefix (optional) – To pull from a specific folder
Match keys (optional) – For filtering files, like *.json

Once you click Next, Flow will connect to your bucket and auto-generate a schema based on your data. You’ll see a preview of your Flow collection — this acts as a live copy of your S3 data inside Flow.

Click Save and Publish to finish the capture.

Behind the scenes, Flow checks your S3 bucket on a 5-minute schedule (by default) to pick up new or updated files. This is how it delivers near-real-time sync, even though S3 itself doesn’t support streaming events.

Next, let’s connect this to Snowflake.

Step 2: Materialize to Snowflake

Now that your data is flowing into Estuary, it’s time to materialize it to Snowflake — in other words, stream it directly into a Snowflake table.

Here’s how to set it up:

After saving your S3 capture, click Materialize Collections.
Choose the Snowflake connector from the destination list.

You’ll fill out a simple form with your Snowflake details:

Materialization name – e.g., myorg/s3-to-snowflake
Account URL – Like myorg-account.snowflakecomputing.com
User + Password – A Snowflake user with the right permissions
Database & Schema – Where the table will live
Warehouse – Optional, but recommended
Role – Optional if already assigned to the user

Once Flow connects, you’ll see your captured collection (from S3) listed.

From here, you can:

Rename the output table
Enable delta updates (if you want changes applied instead of full inserts)
Use Schema Inference to map your flat S3 data into Snowflake’s tabular format

To do that:

Click the Collection tab
Select Schema Inference
Review the suggested schema → Click Apply

Finally, hit Save and Publish.

✅ That’s it — you’ve now got a fully working, real-time S3 to Snowflake pipeline. Flow will continuously sync new files from your bucket straight into your Snowflake warehouse.

What’s Next? Supercharge Your S3 to Snowflake Pipeline

You now have a fully operational, real-time pipeline from Amazon S3 to Snowflake — and it runs continuously, no scripts or schedulers required.

But that’s just the beginning.

With Estuary Flow, you can take things even further:

Add Transformations (a.k.a. Derivations)

Want to clean, filter, or join your data before it lands in Snowflake? Use derivations to apply real-time transformations using SQL or TypeScript, right inside Flow.

You can enrich JSON objects, flatten nested structures, or create entirely new views.

Plug into More Systems

Need to send the same S3 data to BigQuery, Kafka, or a dashboard tool? Just add another materialization — Flow supports multi-destination sync out of the box.

Monitor + Optimize

Use Flow’s built-in observability tools or plug into OpenMetrics to monitor throughput, schema evolution, and pipeline health in real time.

Start Streaming S3 Data to Snowflake Today

The old way — batch jobs, manual scripts, clunky ETL — just can’t keep up with today’s speed of data.

With Estuary Flow, you can:

Sync Amazon S3 to Snowflake in real time
Handle schema changes effortlessly
Scale without infrastructure headaches

Ready to go from raw files to real-time insights?

Try Estuary Flow for free and build your first streaming data pipeline today.

Top 5 Fivetran Alternatives in 2025: Faster, More Dependable Data Integration

Sourabh Gupta — Mon, 31 Mar 2025 04:57:04 +0000

In the era of data-driven business, seamless data integration is no longer a luxury but a necessity. While Fivetran has long been a popular choice, its limitations in latency, cost predictability, and reliability have led many organizations to explore alternatives.

In this guide, we will look at five powerful Fivetran alternatives in 2025: Estuary, Matillion, Integrate.io, Airbyte, and Hevo Data. Each platform has unique strengths and trade-offs that address common pain points experienced with Fivetran. Whether you are replacing Fivetran or adopting a new data integration platform, this comparison will help you make an informed decision.

Why Consider Fivetran Alternatives

Fivetran’s challenges in real-time data processing, unpredictable MAR-based pricing, and delivery reliability have left many users seeking more efficient and budget-friendly options. The alternatives below provide a range of deployment models, pricing structures, and latency profiles that fit the needs of the modern data stack.

With that in mind, let’s explore the top Fivetran alternatives that balance performance, cost predictability, and scalability. Each platform takes a different approach to data movement from right-time streaming to traditional ELT helping you find the best fit for your team’s needs.

1. Estuary

Estuary is the Right Time Data Platform built to unify streaming and batch data movement. Unlike traditional ELT tools that focus on scheduled syncs, Estuary enables data to move when it matters. This means you can operate in real time, near real time, or batch mode from the same platform.

Estuary’s architecture is designed for dependability and scalability, delivering exactly-once guarantees and second-level latency without requiring separate streaming infrastructure. With over 200 native connectors and compatibility with Airbyte, Meltano, and Stitch ecosystems, Estuary offers unmatched integration flexibility.

Estuary also solves one of the biggest concerns with Fivetran: unpredictable costs. Its transparent, volume-based pricing model makes total cost of ownership predictable and easy to control.

Key Features:

Right time performance with continuous Change Data Capture (CDC) and streaming
Unified streaming and batch data movement with exactly-once delivery
In-stream SQL and TypeScript transformations
Automated backfill, schema evolution, and time travel
Scales to enterprise-grade throughput levels
Flexible deployment: public cloud, private cloud, or bring your own cloud
Predictable volume-based pricing model

👉 Try Estuary for free

2. Matillion

Matillion is a cloud-native ETL and ELT platform known for its strong visual interface and enterprise-grade data transformation capabilities. It supports both cloud and on-prem deployments and focuses on governance, security, and data quality.

While it offers advanced transformation features, its enterprise-tier pricing may be excessive for smaller teams or simpler data movement needs.

Key Features:

Visual workflow builder for complex transformations
Cloud-native with hybrid deployment support
Strong governance and quality assurance tools
Reverse ETL capabilities

3. Integrate.io

Integrate.io is a no-code and low-code data integration platform built for simplicity. It provides an intuitive drag-and-drop interface that enables quick setup for teams without deep engineering resources.

Although it may lack advanced transformation features, Integrate.io covers fundamental integration use cases well and offers flexible pricing tiers.

Key Features:

Visual pipeline builder with drag-and-drop functionality
No-code and low-code environment
Wide connector library
Cloud and hybrid deployment options

4. Airbyte

Airbyte is an open-source ELT platform with more than 500 connectors, many maintained by the community. It gives technical teams complete control over their data pipelines and infrastructure.

While Airbyte provides great flexibility and community-driven growth, it requires more engineering effort and is better suited for non-real-time workloads. Its default sync frequency often makes it less ideal for right-time or operational analytics.

Key Features:

500+ open-source and custom connectors
Modular, extensible architecture
Self-hosted or cloud-hosted options
Default sync intervals starting at 5 minutes (OSS) or 1 hour (cloud)
Debezium-powered CDC with at-least-once delivery

5. Hevo Data

Hevo Data is a cloud-based ELT platform designed for ease of use and quick setup. It focuses on reliability and automation with strong schema handling. However, it offers limited transformation flexibility compared to more developer-oriented tools.

Key Features:

No-code setup with drag-and-drop transformations (in beta)
Batch-based delivery with exactly-once guarantees
Sync frequency starts at 1 hour (5 minutes on higher tiers)
Supports reverse ETL workflows

Fivetran Alternatives Comparison Table

Before choosing a platform, it helps to see how these tools compare across latency, transformations, cost models, and deployment flexibility. The table below summarizes key differences between Estuary, Fivetran, and other top data integration tools in 2025.

Feature / Platform	Estuary	Fivetran	Matillion	Integrate.io	Airbyte	Hevo Data
Data Movement Type	Streaming and batch	Batch (some CDC)	Batch ELT	Batch ETL	Batch with CDC	Batch ELT
Latency	Seconds	15 min to hours	Minutes	Minutes to hours	5 min+	1 hr (5 min on higher tiers)
Exactly Once Delivery	✅ Yes	❌ No	❌ No	❌ No	⚠️ Partial (at-least-once)	✅ Yes
Transformation Support	Real-time SQL or TypeScript	dbt-based (post-load)	Visual and SQL	Visual drag-and-drop	dbt integration	Visual (limited)
Schema Evolution	Automatic with zero downtime	Automated (some connectors)	Manual or scheduled	Automatic	Manual for custom connectors	Automatic
Deployment Options	Cloud, private cloud, or BYOC	Cloud only	Cloud or on-prem	Cloud or hybrid	Self-hosted or cloud	Cloud
Pricing Model	Volume-based, predictable	MAR-based (Monthly Active Rows)	License + usage	Tiered plans	Free OSS + paid cloud	Tiered plans
Open Source Model	Open Core	❌ No	❌ No	❌ No	✅ Yes	❌ No
Best For	Real-time and high-throughput analytics	Managed ELT with wide connector set	Enterprise transformations	No-code data teams	Engineering-heavy setups	Fast and easy batch syncs

10 More Fivetran Alternatives

If you want to explore additional tools, here are ten more Fivetran alternatives worth considering:

Stitch
Rivery
Striim
Talend
Informatica
Blendo
Alooma
Qlik Replicate
Panoply
Meltano

Conclusion

If right-time performance, scalability, and predictable pricing are your top priorities, Estuary is the strongest alternative to Fivetran in 2025. As a unified Right Time Data Platform, Estuary provides streaming and batch data movement, exactly-once guarantees, and sub-second latency for the most demanding workloads.

That said, the right choice depends on your team’s technical requirements and resources:

Estuary for unified, right-time data movement with predictable cost and reliability
Matillion for enterprise transformation and governance needs
Integrate.io for teams seeking an easy no-code integration setup
Airbyte for open-source flexibility and customization
Hevo Data for fast, reliable batch delivery with minimal setup

By understanding your specific goals, whether real-time analytics, reverse ETL, or simplified onboarding, you can select the platform that delivers dependable data movement and the insights your business needs.

✅ Key Takeaways

Fivetran’s batch-first design and MAR-based pricing can limit scalability and cost predictability.
Estuary provides right-time data movement that adapts to your latency and control needs.
Matillion, Integrate.io, Airbyte, and Hevo each serve specific use cases but are limited in streaming or flexibility.
Estuary’s exactly-once guarantees, in-stream transformations, and predictable pricing make it ideal for modern data stacks.

Oracle to PostgreSQL Migration: A Comprehensive Guide

Sourabh Gupta — Wed, 19 Mar 2025 10:27:36 +0000

Migrating from Oracle to PostgreSQL is becoming a priority for businesses looking to reduce costs, improve flexibility, and embrace open-source technologies. While Oracle provides enterprise-grade solutions, its proprietary nature and licensing fees can be restrictive. PostgreSQL, on the other hand, offers a robust, scalable, and cost-effective alternative.

This guide explores the steps, challenges, and tools available for a smooth Oracle to PostgreSQL migration, focusing on an automated approach using Estuary Flow.

Why Consider Migrating to PostgreSQL?

1. Cost Reduction

Oracle's high licensing and operational costs can be burdensome. PostgreSQL eliminates these expenses as it is open-source and freely available for commercial and non-commercial use.

2. Open-source Flexibility

PostgreSQL provides extensive customization options through extensions, whereas Oracle relies on costly add-ons for advanced functionalities.

3. Multi-cloud & Hybrid Deployment

Unlike Oracle, PostgreSQL allows seamless multi-cloud and hybrid deployments, supporting AWS, GCP, Azure, and on-premise setups without vendor lock-in.

4. Strong Community Support

PostgreSQL is backed by a strong global community that continuously enhances the database with new features and security updates.

Automated Oracle to PostgreSQL Migration Using Estuary Flow

Automating the migration process helps minimize downtime and human error while ensuring real-time synchronization. Estuary Flow is an advanced ETL tool that simplifies the process with minimal configuration.

Key Features of Estuary Flow

Change Data Capture (CDC): Supports real-time data sync, reducing the risk of data loss.
No-code Configuration: Enables easy migration without requiring extensive technical knowledge.
200+ Pre-built Connectors: Offers seamless integration with multiple databases, cloud services, and applications.
Secure & Scalable: Supports private deployments, ensuring complete control over data.

Steps to Migrate Data Using Estuary Flow

Step 1: Configure Oracle as the Source

Log in to Estuary Flow.
Select Sources from the dashboard and click + NEW CAPTURE.
Search for the Oracle Database connector and select the Real-time option.
Provide the necessary credentials:
- Name: Unique identifier for the connection.
- Server Address: Hostname and port of the Oracle database.
- User & Password: Authentication credentials.
Click NEXT and then SAVE AND PUBLISH to finalize the connection.

Step 2: Set Up PostgreSQL as the Destination

After setting up Oracle as a source, click MATERIALIZE COLLECTIONS.
Alternatively, navigate to Destinations and click + NEW MATERIALIZATION.
Search for the PostgreSQL connector and select Materialization.
Enter the following details:
- Name: Unique name for the destination.
- Address: PostgreSQL host and port (default: 5432).
- User & Password: PostgreSQL credentials.
Click NEXT > SAVE AND PUBLISH.

Once configured, Estuary Flow will migrate and sync Oracle data into PostgreSQL in real-time.

Common Challenges in Oracle to PostgreSQL Migration

1. Data Type Mismatch

Oracle NUMBER → PostgreSQL NUMERIC or BIGINT
Oracle CLOB → PostgreSQL TEXT
Oracle DATE → PostgreSQL TIMESTAMP

2. Stored Procedures & Functions

Oracle uses PL/SQL, whereas PostgreSQL uses PL/pgSQL. Converting complex procedures may require rewriting code.

3. Indexing & Performance Optimization

Oracle’s Index-Organized Tables (IOTs) and partitioning methods differ from PostgreSQL, requiring adjustments to maintain performance.

Conclusion

Migrating from Oracle to PostgreSQL is a strategic move for businesses looking to reduce costs, enhance scalability, and gain more control over their data. While manual migration methods can be time-consuming and error-prone, automated tools like Estuary Flow simplify the process, ensuring real-time synchronization and minimal downtime.

If you’re considering migrating, start with Estuary Flow today to experience seamless and efficient data migration!

FAQs

1. How long does an Oracle to PostgreSQL migration take?

The duration depends on data volume and the migration method. Automated tools like Estuary Flow speed up the process significantly.

2. Does PostgreSQL support Change Data Capture (CDC)?

Yes, PostgreSQL supports CDC using logical replication and tools like Estuary Flow.

3. Can I migrate stored procedures from Oracle to PostgreSQL?

Yes, but Oracle's PL/SQL must be converted to PostgreSQL’s PL/pgSQL, which may require manual intervention.