Gowtham Potureddi

Posted on Jul 5

Postgres Snowflake CDC: 5 Architectures Compared (Debezium, Fivetran, Native Streams, etc.)

#python #sql #interview #dataengineering

postgres to snowflake cdc is the single most-asked pipeline design question of the 2026 data-engineering interview loop — because Postgres is now the operational default for virtually every OLTP stack, Snowflake is the analytics default for virtually every warehouse team, and the bridge between the two is where latency, cost, ownership, and PII posture all collide at the same time. A senior candidate is not expected to name every vendor in the space; a senior candidate is expected to reduce the design surface to four axes, walk an interviewer through five canonical architectures, and pick the right one for a specific workload with defensible reasoning on cost per million rows, deletes and tombstones, and how the primary-key merge is actually executed at the warehouse.

This guide is the senior walkthrough for postgres snowflake replication in 2026 — the year snowflake native streams, snowpipe streaming, and Snowflake Openflow finally push the last DIY holdouts off of hand-rolled S3-landing patterns, while debezium snowflake (via the Kafka Connect Streaming API), fivetran postgres, Hevo, Airbyte, and estuary flow remain the four workhorses of the managed and semi-managed space. It walks through why postgres cdc starts at the WAL and the replication slot, the sub-minute DIY path (Debezium + Kafka + Snowpipe Streaming), the hands-off managed path (Fivetran / Hevo / Airbyte with MAR or row pricing), the real-time streaming-SQL path (Estuary Flow / Materialize / Rivery), and the native-Snowflake ends (Openflow + Kafka Connector + DIY landing). Every section ships a teaching block, a worked deploy sketch, cost arithmetic, and a Solution-Tail interview answer — code, a step-by-step trace, an output table, then a concept-by-concept breakdown of why the picked architecture wins on the four axes.

When you want hands-on reps immediately after reading, drill the ETL practice library →, rehearse on the streaming practice library →, and sharpen the design axis with the SQL practice library →.

On this page

Why Postgres → Snowflake is the most-asked CDC path in 2026
Debezium + Kafka + Snowpipe Streaming
Fivetran / Hevo / Airbyte managed pipelines
Estuary Flow / Materialize / Rivery — real-time CDC platforms
Native Snowflake features (Openflow) + hybrid DIY
Cheat sheet — Postgres → Snowflake CDC recipes
Frequently asked questions
Practice on PipeCode

1. Why Postgres → Snowflake is the most-asked CDC path in 2026

The four axes — latency, cost model, ownership, PII posture — reduce five architectures to one clear choice per workload

The one-sentence invariant: every real Postgres-to-Snowflake CDC decision compresses to four axes — the latency budget (seconds vs minutes vs "next hour"), the cost model (per-row / MAR / compute-hour / DIY infra), the ownership boundary (vendor-owned vs your-cluster-owned vs Snowflake-owned), and the PII posture (data leaving the VPC vs staying in-VPC vs never leaving Snowflake) — and once you name the constraints on all four, at most one or two of the five architectures survive. Senior candidates who lead with vendor names lose points; senior candidates who lead with the four axes and then map to architectures are hired.

The four axes interviewers actually probe.

Latency. Sub-minute (streaming, dashboards, fraud), 5–15 minutes (Fivetran-style micro-batch), 1 hour (batch load), or 6 hours (nightly). Naming the product requirement is the first step; the second step is realising most "we need real-time" workloads actually need 5-minute freshness and can save 10× cost by admitting it.
Cost model. MAR (Monthly Active Rows — Fivetran's headline metric), row-based (Hevo), OSS-plus-infra (Airbyte OSS, Debezium), compute-hour (Estuary, Materialize), or Snowflake credits (Openflow, Snowpipe Streaming). The cost model interacts with the workload — MAR punishes wide tables with high update rates; compute-hour punishes idle pipelines; Snowflake credits punish bursty ingest.
Ownership. Vendor-owned pipelines (Fivetran, Hevo, Rivery) mean the vendor's SRE team wakes up when the CDC stalls; DIY (Debezium + Kafka) means you own the pager. Snowflake-owned (Openflow) puts Snowflake on the hook for the pipeline the same way it's on the hook for the warehouse. Ownership shifts risk and shifts price.
PII posture. GDPR / HIPAA / SOC 2 tend to require that the CDC stream never leave your VPC, or that PII is dropped / hashed at the source. Debezium (in your VPC) wins here; Fivetran (SaaS with data plane in vendor's cloud) needs a Business Associate Agreement and often a private-link setup; Snowflake Openflow with a customer-hosted data plane threads the needle.

The 2026 reality — the DIY tail is shrinking.

Snowpipe Streaming API stabilised in 2023 and is now the standard replacement for the old chunk-file Snowpipe pattern. Sub-second ingest is a solved problem at the warehouse; the debate is upstream.
Snowflake Openflow launched in 2024 and puts Snowflake itself on the hook for the connector layer — a direct challenge to Fivetran's business model. Openflow supports Postgres logical replication end-to-end without any DIY infrastructure.
Managed vendors matured. Fivetran, Hevo, and Airbyte Cloud all handle Postgres logical replication competently in 2026; the differentiator is cost model and PII compliance, not table stakes.
Real-time platforms consolidated. Estuary Flow, Materialize, and Rivery cover the "streaming SQL over CDC" niche; RisingWave joined the pack in 2024. These are the picks when the analytics is streaming, not just the ingestion.

What interviewers actually probe on the CDC path.

"Minutes or seconds latency?" The single most powerful clarifying question. Most candidates jump to Debezium; the senior answer says "let's define the SLA first — most workloads are minute-latency and Fivetran is cheaper."
"How do you MERGE CDC events into a Snowflake target — full table replace, delta MERGE, or streaming append?" Names the tail: MERGE is the analytics default; append-only tables plus late materialisation are the streaming default.
"How do you handle deletes?" Tombstone rows (__deleted = true) or hard delete via DELETE FROM ... WHERE pk IN. Senior signal is knowing which model each vendor uses.
"Where does PII live along the path?" Names the VPC boundary, the encryption model, the retention on Kafka / vendor storage.
"What's the cost per million rows?" Not the sticker price — the effective cost including compute-hour idle time, MAR overhead on tables with high update rates, and Snowflake credits for MERGE workloads.

Common interview probes on the CDC decision itself.

"Design a Postgres → Snowflake pipeline for a 200-table OLTP with 100M events/day, 5-minute freshness, no PII." — Fivetran or Airbyte Cloud; MAR math is the deciding factor.
"Same workload but sub-minute freshness and PII cannot leave the VPC." — Debezium + Kafka + Snowpipe Streaming; own the cluster; encrypt at rest; VPC-only network path.
"Same workload but the analytics is streaming aggregations on windowed events." — Estuary Flow or Materialize; the streaming SQL layer is where the value is.
"Same workload but greenfield in 2026 with no vendor lock-in requirement." — Snowflake Openflow; Snowflake owns the pipeline; you pay in credits, not vendor invoices.

Worked example — the four-axis interview reduction

Detailed explanation. The canonical opening move: an interviewer describes a workload in one sentence and expects the candidate to convert it to the four axes, then propose an architecture. Walk through the reduction on a concrete workload — a 500-table Postgres with 50M events/day, 5-minute freshness required, PII allowed to leave the VPC through an encrypted SaaS path, and a small data team without CDC infrastructure experience.

The workload. 500 tables; 50M row events/day; 5-minute freshness.
The team. Two data engineers; no Kafka experience.
The compliance. PII allowed to transit an encrypted SaaS path with SOC 2 Type II.

Question. Reduce the workload to the four axes and pick a single architecture with a one-sentence justification per axis.

Input.

Axis	Value	Rules out	Leaves
Latency	5 min	streaming-only tools	Fivetran, Airbyte, Openflow, Debezium
Cost model	MAR-friendly (only 50M events/day, mostly appends)	expensive DIY	Fivetran, Airbyte
Ownership	small team, no Kafka	DIY Debezium, Kafka	Fivetran, Airbyte, Openflow
PII posture	SOC 2 SaaS OK	strict VPC-only	Fivetran, Airbyte Cloud, Openflow

Code.

# Decision tree — plug in the four axes
axes:
  latency: 5min
  cost_model: MAR-friendly
  ownership: small-team-managed
  pii_posture: SaaS-OK

route:
  - if latency == "5min" and ownership == "small-team-managed":
      candidates: [fivetran, airbyte-cloud, openflow]
  - if cost_model == "MAR-friendly" and event_volume < 500M/day:
      recommend: fivetran
  - if event_volume > 500M/day:
      recommend: airbyte-cloud   # cheaper at scale
  - if snowflake_native_preferred:
      recommend: openflow

-- Sanity check — MAR arithmetic for the recommendation
-- Assumption: Fivetran MAR = distinct primary keys touched per month
-- Table churn: 50M events/day × 30 days = 1.5B events; distinct PKs touched ~ 200M
-- Fivetran MAR pricing (rough 2026): $0.5–$1.5 per 1000 MAR
WITH month AS (
  SELECT 50e6::numeric AS events_per_day,
         30::int       AS days,
         0.2::numeric  AS distinct_ratio,   -- 20% of events touch new PKs
         1.0::numeric  AS mar_price_per_1k  -- $1 / 1000 MAR
)
SELECT (events_per_day * days * distinct_ratio)                                    AS estimated_mar,
       (events_per_day * days * distinct_ratio) / 1000 * mar_price_per_1k          AS estimated_monthly_usd
FROM   month;
-- estimated_mar ≈ 300M ; estimated_monthly_usd ≈ $300k → check the price sheet again

Step-by-step explanation.

Start at latency: 5 minutes rules out pure streaming platforms (Materialize, RisingWave) that price for streaming SQL you don't need. It also rules out the tightest DIY paths where sub-minute is the reason to own the infrastructure at all.
Cost model: 50M events/day is small; MAR pricing is only expensive when the distinct-PK-touched-per-month ratio is high. For an append-heavy log table the MAR is close to 100% of rows; for a slowly-changing dimension the MAR is a small fraction. Check the arithmetic before committing.
Ownership: two engineers with no Kafka experience makes DIY Debezium a non-starter — the on-call cost dominates the vendor invoice for small teams. Managed vendors or Snowflake-owned Openflow are the survivors.
PII posture: SOC 2 SaaS-transit is allowed; Fivetran's data plane runs in AWS with encryption at rest and in transit. No hard blocker. If the compliance flag had said "PII must not leave the VPC" the story flips to Debezium or Openflow with customer-hosted data plane.
Final choice: Fivetran (with a MAR estimate that comes in under budget) or Airbyte Cloud (if the MAR estimate blows out and row-based Airbyte is cheaper). The interview signal is the reasoning — the axes drove the choice; the vendor name is a consequence.

Output.

Axis	Constraint	Vendor pick
Latency	5 min	Fivetran / Airbyte / Openflow
Cost	MAR-friendly at 50M events/day	Fivetran ✓
Ownership	Small team	Fivetran ✓
PII	SaaS OK	Fivetran ✓
Final	—	Fivetran — sanity-check the MAR math against Airbyte for scale

Rule of thumb. The four axes reduce the design surface to at most two vendors before you name a single product. If you can't state which axis kicks out each rejected architecture, you have not made a decision — you have made a preference.

Worked example — the "sub-minute + PII cannot leave the VPC" flip

Detailed explanation. Same workload as above but two constraints change: latency drops to sub-minute (fraud dashboards) and PII cannot leave the VPC (HIPAA-adjacent). The four-axis reduction now picks Debezium + Kafka + Snowpipe Streaming — the only path where the CDC stream stays in your VPC end-to-end and hits sub-minute freshness. The cost changes accordingly; the ownership shifts from vendor to your team.

Latency. Sub-minute — Fivetran (5–15 min) and Openflow (1–15 min) are ruled out.
PII. VPC-only — SaaS pipelines are ruled out; Openflow-with-customer-hosted-data-plane survives.
Ownership. DIY — the team must staff a Kafka + Debezium on-call rotation.

Question. Redesign the pipeline for sub-minute + VPC-only. Compare the total cost of ownership (invoice + infra + on-call) between Debezium DIY and Openflow-with-BYOC.

Input.

Constraint	Old value	New value
Latency SLA	5 min	< 60 s
PII posture	SaaS OK	VPC-only
Event volume	50M/day	50M/day
Team size	2 DEs	2 DEs + 1 SRE

Code.

# Two candidates survive the flip
candidates:
  - name: debezium-kafka-snowpipe
    latency_p99: 30s
    cost:
      kafka_msk: 8000/month     # 3-broker MSK cluster
      debezium_kc: 2000/month   # 2 Kafka Connect workers
      snowpipe_streaming: 3000/month  # Snowflake credits for MERGE
      on_call_engineer: 15000/month   # 1 SRE at 20% allocation
    total_monthly_usd: 28000
    pii_posture: full-VPC
    ownership: your-team-owns-the-pager
  - name: openflow-byoc
    latency_p99: 60s
    cost:
      openflow_data_plane: 12000/month  # customer-hosted compute
      openflow_control_plane: 3000/month
      snowpipe_streaming: 3000/month
    total_monthly_usd: 18000
    pii_posture: VPC-only-data-plane
    ownership: snowflake-owns-the-pipeline

-- Sanity check — cost-per-million-rows arithmetic
WITH cand AS (
  SELECT 'debezium' AS name, 28000::numeric AS monthly_usd, 1.5e9::numeric AS monthly_rows
  UNION ALL
  SELECT 'openflow-byoc',      18000,                     1.5e9
)
SELECT name,
       (monthly_usd / (monthly_rows / 1e6))::numeric(10,4) AS usd_per_million_rows
FROM   cand;
-- debezium       ≈ $18.67 / million rows
-- openflow-byoc  ≈ $12.00 / million rows

Step-by-step explanation.

Sub-minute + VPC-only shrinks the survivor set to two: Debezium DIY and Openflow with a customer-hosted data plane (BYOC). Fivetran / Hevo / Airbyte Cloud all fail the VPC constraint; Materialize / Estuary either fail the VPC constraint or pay too much for a workload that isn't actually streaming SQL.
Debezium DIY looks cheaper on paper but requires a Kafka cluster (MSK or self-hosted), Kafka Connect workers, a Snowflake Kafka Connector, and — critically — an SRE on the on-call rotation. The invoice is $16k / month; the on-call cost is $12k / month; the effective total is $28k.
Openflow BYOC hides more of the operational cost inside the Snowflake bill. The customer-hosted data plane runs on your infra but is managed by Snowflake. Your team monitors; Snowflake owns the pipeline SLAs. The total is $18k / month — cheaper because you're not staffing a whole SRE rotation.
The per-million-rows math is decisive: Debezium at $18.67, Openflow BYOC at $12.00. At the same workload the Openflow choice saves $10k / month and lets your two DEs stay focused on analytics, not on Kafka rebalances.
The interview signal: you named the on-call cost explicitly. Most candidates quote the invoice and stop; senior candidates quote invoice + infra + on-call because that's the total that the CFO actually sees.

Output.

Architecture	Latency p99	Monthly cost	Cost per million rows	Ownership
Debezium + Kafka + Snowpipe Streaming	30 s	$28k	$18.67	Your team
Openflow BYOC	60 s	$18k	$12.00	Snowflake

Rule of thumb. When PII forces a VPC-only pipeline, DIY Debezium is not automatically the cheapest choice. Openflow with customer-hosted data plane threads the compliance needle without the on-call staffing cost. Always price the on-call — the invoice is not the total.

Worked example — the streaming-SQL flip

Detailed explanation. A third scenario: same 500-table Postgres, but the analytics team wants windowed aggregations (5-minute tumbling windows of revenue by product line) delivered to a Snowflake table for a live dashboard. This is not "CDC into a table" — it is "streaming SQL over CDC materialised into a Snowflake table." The four-axis reduction now picks Estuary Flow or Materialize — the only paths where the streaming aggregation happens upstream of Snowflake, saving warehouse credits.

The workload. Streaming windowed aggregation, not raw event landing.
The warehouse cost. MERGE + windowed aggregation in Snowflake is expensive; pushing the aggregation upstream is cheaper.
The candidate paths. Estuary Flow (managed streaming SQL), Materialize (streaming SQL over CDC), RisingWave (open-source Materialize).

Question. Design the pipeline for a streaming windowed aggregation delivered to Snowflake. Compare the credit cost of a Snowflake-side aggregation vs an upstream-aggregated write.

Input.

Component	Value
Event rate	100M events / day
Distinct products	10 000
Window	5-minute tumbling
Snowflake warehouse for aggregation	XL, 16 credits/hour

Code.

# Path A — raw CDC into Snowflake, aggregation in Snowflake
path_a:
  ingest: fivetran or debezium (raw event landing)
  aggregation: Snowflake TASK running MERGE every 5 minutes
  warehouse: XL (16 credits/hour) × 12 runs/hour × ~0.05 hour = 9.6 credits/hour = 230 credits/day
  daily_cost: 230 credits × $3/credit = $690/day = $20.7k/month

# Path B — streaming SQL upstream, pre-aggregated write to Snowflake
path_b:
  ingest: Estuary Flow with streaming SQL materialization
  aggregation: happens in Estuary compute (rust dataflow)
  snowflake_write: 1 row per (product, window) × 288 windows × 10k products = 2.9M rows/day
  estuary_cost: $6k/month  # streaming SQL compute
  snowflake_write_cost: ~$500/month  # low-volume MERGE
  monthly_total: $6.5k

-- Path A — Snowflake-side aggregation (expensive)
CREATE OR REPLACE TASK t_agg
  WAREHOUSE = xl_wh
  SCHEDULE  = '5 MINUTE'
AS
INSERT INTO product_revenue_5min
SELECT DATE_TRUNC('minute', ts) - INTERVAL '4 minute' AS window_start,
       product_id,
       SUM(amount)                                    AS revenue,
       COUNT(*)                                       AS events
FROM   raw_events_stream
WHERE  ts >= DATEADD('minute', -5, CURRENT_TIMESTAMP)
GROUP  BY 1, 2;

-- Path B — Estuary Flow materialisation (cheap)
--   raw CDC events → Estuary collection → streaming SQL → materialised into Snowflake table
--   Snowflake sees only the pre-aggregated (product, window, revenue) tuples

Step-by-step explanation.

Path A is the naive "land everything, aggregate at the warehouse" pattern. Snowflake runs a TASK every 5 minutes that re-aggregates the last 5 minutes of events. Even with a light workload, that TASK burns 230 credits per day at typical XL costs — $690/day, $20.7k/month, entirely in Snowflake credits.
Path B pushes the windowed aggregation upstream into Estuary Flow's streaming SQL engine. The raw CDC stream is read from Postgres logical replication, aggregated in Estuary's dataflow runtime, and only the result (product × window × revenue) is materialised into Snowflake.
The Snowflake-side workload for Path B is 2.9M row-inserts per day into a small aggregate table — a tiny MERGE that runs on an XS warehouse for a few seconds per batch. Snowflake credit cost drops to ~$500/month.
Estuary Flow's cost for the streaming compute is ~$6k/month for this workload volume. Total path B cost: $6.5k/month vs path A's $20.7k/month — 3× cheaper.
The interview signal: naming where the aggregation happens is the senior move. Junior candidates ingest raw into Snowflake and aggregate there because that's the shortest data-modelling story; senior candidates recognise that streaming-SQL platforms exist precisely to remove that aggregation from the warehouse.

Output.

Path	Ingest cost	Aggregation cost	Snowflake cost	Total monthly	Freshness
A: raw ingest + Snowflake TASK	$2k (Fivetran)	— (bundled)	$20.7k	$22.7k	5 min
B: Estuary streaming SQL	$6k (Estuary)	— (bundled)	$0.5k	$6.5k	5 min

Rule of thumb. When the analytics is a windowed aggregation and not raw event access, the streaming-SQL platforms (Estuary, Materialize, RisingWave) beat land-then-aggregate patterns by 3–5× on Snowflake credit cost. Push the aggregation upstream whenever the downstream consumer is a fixed shape (a dashboard, an alert, a slow-changing summary table).

Senior interview question on the four-axis reduction

A senior interviewer often opens with: "Walk me through the framework you use to pick a Postgres → Snowflake CDC architecture. Then apply it to a workload of 200 tables, 500M events/day, sub-minute freshness for one critical fraud table and 5-minute freshness for the rest, PII in five columns that must be hashed before landing, and a team of three data engineers."

Solution Using a hybrid Debezium-critical-path + Fivetran-bulk architecture

# Two-lane pipeline — sub-minute for the critical fraud table, 5-minute managed for the rest
lanes:
  critical:
    source_tables: [transactions_fraud]
    architecture: debezium + kafka + snowpipe_streaming
    latency_p99: 30s
    pii_handling: SMT drops SSN, hashes email/phone at the connector
    cost_monthly: 12000
  bulk:
    source_tables: [199 other tables]
    architecture: fivetran-with-hash-blocklist
    latency_p99: 5min
    pii_handling: Fivetran column-level hashing (SHA-256)
    cost_monthly: 8000

merge_strategy:
  critical:
    landing_table: raw_fraud (Snowpipe Streaming APPEND)
    curated: MERGE INTO fraud_txns USING raw_fraud ON pk WHEN MATCHED... WHEN NOT MATCHED...
    task_schedule: 1 minute
  bulk:
    landing: Fivetran materialised tables (delta MERGE handled by Fivetran)
    schedule: 5 minute (Fivetran-controlled)

pii_column_map:
  email:  sha256
  phone:  sha256
  ssn:    drop
  addr:   drop
  dob:    truncate_to_year

-- Critical fraud table — MERGE from Snowpipe Streaming landing
CREATE OR REPLACE TASK t_merge_fraud
  WAREHOUSE = s_wh
  SCHEDULE  = '1 MINUTE'
AS
MERGE INTO fraud_txns AS tgt
USING (
  SELECT pk, ts, amount, currency, is_flagged, email_sha256, phone_sha256,
         op, __ts_ms
  FROM   raw_fraud
  WHERE  __ts_ms > (SELECT COALESCE(MAX(__ts_ms), 0) FROM fraud_txns_hwm)
) AS src
ON tgt.pk = src.pk
WHEN MATCHED AND src.op = 'd' THEN DELETE
WHEN MATCHED AND src.op IN ('u','c') THEN UPDATE SET
     ts = src.ts, amount = src.amount, currency = src.currency,
     is_flagged = src.is_flagged,
     email_sha256 = src.email_sha256, phone_sha256 = src.phone_sha256
WHEN NOT MATCHED AND src.op IN ('u','c') THEN INSERT
     (pk, ts, amount, currency, is_flagged, email_sha256, phone_sha256)
     VALUES (src.pk, src.ts, src.amount, src.currency, src.is_flagged,
             src.email_sha256, src.phone_sha256);

-- Advance the high-water mark
INSERT INTO fraud_txns_hwm(__ts_ms) VALUES ((SELECT MAX(__ts_ms) FROM raw_fraud));

Step-by-step trace.

Step	Lane	Latency	Data flow
1	critical	source	Postgres WAL → Debezium
2	critical	5 s	Debezium → Kafka topic `pg.public.transactions_fraud`
3	critical	15 s	Kafka Connect Snowflake Streaming sink → `raw_fraud`
4	critical	60 s	Snowflake TASK MERGE → `fraud_txns`
5	bulk	30 s	Postgres WAL → Fivetran-hosted reader
6	bulk	5 min	Fivetran MERGE → 199 curated tables

After rollout, the fraud dashboard sees new events in Snowflake within 60 seconds end-to-end, PII is hashed at the Debezium SMT layer (never lands in Snowflake in clear), and the 199 non-critical tables land through Fivetran at a fraction of the cost. Total monthly spend is $20k — cheaper than pure Debezium on all 200 tables ($40k+) and cheaper than pure Fivetran with sub-minute upgrade tier ($30k+).

Output:

Metric	Critical lane	Bulk lane
Tables	1	199
Latency p99	60 s	5 min
PII posture	SMT drop + hash	Fivetran column hash
Cost/month	$12k	$8k
On-call complexity	Kafka + Debezium	Fivetran-owned

Why this works — concept by concept:

Two lanes, two SLAs — the workload was described as "sub-minute for one table, 5-minute for the rest"; the design honours the split rather than forcing all 200 tables into the sub-minute lane. Cost drops proportionally.
PII gated at the connector — Debezium's Single Message Transforms (SMTs) hash email / phone and drop ssn before the row hits Kafka. The clear PII never leaves Postgres. Fivetran handles the 199-table lane with column-level hashing configured in its UI.
MERGE with high-water mark — the fraud MERGE reads only rows newer than the last checkpointed __ts_ms. This bounds the merge cost per run and keeps the 1-minute TASK cheap even as raw_fraud grows.
Delete handling — Debezium emits op = 'd' for tombstone rows; the MERGE's WHEN MATCHED AND src.op = 'd' THEN DELETE branch translates the tombstone into a real Snowflake delete. Fivetran does the same automatically for the bulk lane.
Cost — DIY Debezium on 200 tables would cost ~$40k/month (Kafka partitions × connector workers × SRE); Fivetran on 200 tables with sub-minute upgrade costs ~$30k/month. The hybrid is $20k — the two-lane split is 33–50% cheaper than either extreme. O(1) additional operational overhead: one more repo, one more runbook.

ETL
Topic — etl
ETL problems on Postgres → Snowflake CDC design

Practice →

SQL Topic — sql SQL MERGE and CDC delta problems

Practice →

2. Debezium + Kafka + Snowpipe Streaming

The sub-minute DIY path — full control, full ownership, sub-minute latency, and a real on-call rotation

The mental model in one line: Debezium + Kafka + Snowpipe Streaming is the canonical DIY CDC path — Postgres logical replication feeds a Debezium connector on Kafka Connect, one Kafka topic per table carries the CDC events, and the Snowflake Kafka Connector with the Streaming API lands rows into Snowflake with sub-minute latency — but you own every component of the stack, including the pager. Every other architecture in this guide is a trade-off against this baseline; understanding it end-to-end is the price of admission for the senior CDC interview.

The four-axis snapshot.

Latency. Sub-minute end-to-end (typical 15–60 s). Debezium reads the WAL within milliseconds; Kafka Connect flushes within a second; Snowpipe Streaming ingests within seconds; the MERGE TASK runs every minute for curated tables.
Cost model. DIY infra — Kafka cluster (MSK or self-hosted), Kafka Connect workers, Snowflake credits for Snowpipe Streaming + MERGE, plus one SRE on the on-call rotation. No per-row vendor charge.
Ownership. You own the pipeline top to bottom. When the WAL slot bloats, when a Debezium task rebalances, when the Snowflake connector runs out of memory — you page. This is the point of the architecture; it is also its biggest hidden cost.
PII posture. Best-in-class for VPC-only workloads. Every component runs in your VPC (or on-prem); Debezium SMTs hash / drop PII at the connector before it hits Kafka; nothing crosses a vendor boundary.

When Debezium + Kafka wins.

Sub-minute freshness is a real SLA. Fraud dashboards, real-time alerts, live operational reporting.
PII cannot leave the VPC. HIPAA workloads, financial services, government.
You already run Kafka. The marginal cost of adding a Debezium connector to an existing Kafka cluster is small; standing up Kafka just for CDC is expensive.
On-prem source. No SaaS vendor can reach on-prem Postgres without a private-link setup; DIY Debezium in the same data centre is often simpler.

When Debezium + Kafka loses.

Small team, no Kafka experience. The learning curve + on-call cost exceeds the vendor invoice for most sub-500-table workloads.
Bursty workloads. Kafka is a long-lived cluster; paying for it during 20 hours/day of low volume is wasteful. Managed vendors that scale to zero are cheaper.
Greenfield with no Kafka roadmap. Adopting Kafka to solve one CDC problem is over-engineering. Openflow or Fivetran is simpler.

The Postgres side — logical replication basics.

wal_level = logical. Postgres must be configured with logical replication enabled. This is a cluster-restart setting.
max_replication_slots ≥ N. Each Debezium connector needs one replication slot. Slots hold WAL until the connector reads them; a stalled connector bloats WAL and eventually fills the disk.
REPLICA IDENTITY FULL. For DELETE and UPDATE events to include the old row values, the table needs ALTER TABLE ... REPLICA IDENTITY FULL (or a covering unique index). Without it, DELETE events only carry the primary key and UPDATE events only carry the primary key + changed columns.
The publication. CREATE PUBLICATION pipecode_pub FOR TABLE public.events, public.users, ... — the whitelist of tables Debezium will replicate.
The pgoutput decoder. The recommended output plugin — built into Postgres, no extension install required. Alternatives (wal2json, decoderbufs) exist but pgoutput is the modern default.

Debezium's Postgres connector.

The plugin. io.debezium.connector.postgresql.PostgresConnector running inside Kafka Connect (Kafka's plugin runtime).
The topic naming. By default, one topic per source table: <server.name>.<schema>.<table>. For 200 tables this creates 200 topics.
The event shape. A JSON (or Avro / Protobuf) envelope with before (old row), after (new row), op (c = create, u = update, d = delete, r = read snapshot), ts_ms (Postgres commit timestamp), and source (Postgres metadata).
Snapshotting. On first startup, Debezium takes a consistent snapshot of every whitelisted table (streaming the current state as op = 'r' events) before switching to the WAL tail. The snapshot mode is configurable (initial, initial_only, never, schema_only).

The Snowflake Kafka Connector with the Streaming API.

The plugin. com.snowflake.kafka.connector.SnowflakeSinkConnector running alongside Debezium in Kafka Connect.
Two ingest modes. Legacy Snowpipe (chunk-file based, 1-minute latency) or Snowpipe Streaming (row-level API, sub-second latency). Streaming is the modern default; use snowflake.ingestion.method = SNOWPIPE_STREAMING.
Landing table shape. By default, one row per Kafka message with a RECORD_METADATA VARIANT column and a RECORD_CONTENT VARIANT column. A flatten-into-typed-columns step (Snowpipe transformations or a MERGE TASK) is the norm.
Cost. Snowflake charges for Snowpipe Streaming by ingested MB — roughly $0.02–$0.05 per GB depending on region. Predictable and cheap at moderate volumes.

The MERGE TASK — the last mile.

Why. The Snowflake landing table (raw_events) has one row per Kafka event; the curated table (events) has one row per primary key. The MERGE deduplicates, applies deletes, and updates the curated view.
Cadence. Typical is a Snowflake TASK on a 1-minute or 5-minute schedule that reads the last N minutes of raw_events, MERGEs into the curated table, and advances a high-water-mark checkpoint.
Deletes. The MERGE branch WHEN MATCHED AND src.op = 'd' THEN DELETE translates a Debezium tombstone into a real Snowflake delete. Skipping this branch means deletes silently disappear — the classic CDC bug.

Common interview probes on the Debezium path.

"Walk me through what Postgres does when Debezium starts." — snapshot, then WAL tail from the replication slot.
"What breaks if REPLICA IDENTITY is default?" — DELETEs carry only the PK, UPDATEs miss unchanged columns.
"How do you handle a schema change?" — Debezium emits a schema change event; the sink adapts; the MERGE template needs updating for new columns.
"What kills the pipeline?" — WAL slot bloat when Debezium stalls; Kafka partition imbalance; Snowflake credit spikes on a bad MERGE.

Worked example — end-to-end Debezium deploy for a 3-table workload

Detailed explanation. The textbook Debezium deploy: three Postgres tables (users, orders, order_items) need sub-minute CDC into Snowflake. Walk through the Postgres prep, the Debezium connector config, the Snowflake Kafka Connector config, and the MERGE TASK — every knob, every file, ready to lift into a real cluster.

Postgres side. Set wal_level = logical; create a replication user; grant REPLICATION role; create the publication; set REPLICA IDENTITY FULL where needed.
Debezium side. Register the connector via the Kafka Connect REST API with the source config.
Snowflake side. Create the landing table, register the sink connector, then write the MERGE TASK.

Question. Produce the full deploy sketch — every SQL statement, every JSON config, every command.

Input.

Component	Value
Postgres version	16
Debezium version	2.6
Kafka Connect	3.7
Snowflake Kafka Connector	3.x
Ingest mode	Snowpipe Streaming

Code.

-- Postgres side — prerequisites
ALTER SYSTEM SET wal_level = logical;
ALTER SYSTEM SET max_replication_slots = 8;
ALTER SYSTEM SET max_wal_senders     = 8;
-- (restart Postgres after the ALTER SYSTEM)

-- Replication role
CREATE ROLE debezium WITH LOGIN REPLICATION PASSWORD 'strong-secret';
GRANT  USAGE  ON SCHEMA public TO debezium;
GRANT  SELECT ON ALL TABLES IN SCHEMA public TO debezium;
ALTER  DEFAULT PRIVILEGES IN SCHEMA public GRANT SELECT ON TABLES TO debezium;

-- Publication (whitelist)
CREATE PUBLICATION pipecode_pub FOR TABLE
  public.users, public.orders, public.order_items;

-- Full row images on UPDATE / DELETE
ALTER TABLE public.users       REPLICA IDENTITY FULL;
ALTER TABLE public.orders      REPLICA IDENTITY FULL;
ALTER TABLE public.order_items REPLICA IDENTITY FULL;

// Debezium PostgresConnector config — register via Kafka Connect REST
{
  "name": "pg-pipecode-source",
  "config": {
    "connector.class":       "io.debezium.connector.postgresql.PostgresConnector",
    "tasks.max":             "1",
    "database.hostname":     "pg-primary.internal",
    "database.port":         "5432",
    "database.user":         "debezium",
    "database.password":     "strong-secret",
    "database.dbname":       "pipecode",
    "topic.prefix":          "pipecode",
    "slot.name":             "pipecode_debezium_slot",
    "publication.name":      "pipecode_pub",
    "plugin.name":           "pgoutput",
    "snapshot.mode":         "initial",
    "schema.include.list":   "public",
    "table.include.list":    "public.users,public.orders,public.order_items",
    "tombstones.on.delete":  "false",
    "transforms":            "unwrap",
    "transforms.unwrap.type":                       "io.debezium.transforms.ExtractNewRecordState",
    "transforms.unwrap.drop.tombstones":            "false",
    "transforms.unwrap.delete.handling.mode":       "rewrite",
    "transforms.unwrap.add.fields":                 "op,ts_ms,source.ts_ms,source.lsn"
  }
}

// Snowflake sink connector — Snowpipe Streaming
{
  "name": "sf-pipecode-sink",
  "config": {
    "connector.class":                       "com.snowflake.kafka.connector.SnowflakeSinkConnector",
    "tasks.max":                             "3",
    "topics":                                "pipecode.public.users,pipecode.public.orders,pipecode.public.order_items",
    "snowflake.url.name":                    "abc12345.us-west-2.snowflakecomputing.com",
    "snowflake.user.name":                   "SVC_KC_INGEST",
    "snowflake.private.key":                 "MIIEvQIBADANBgkq...",
    "snowflake.database.name":               "RAW",
    "snowflake.schema.name":                 "PG_CDC",
    "snowflake.ingestion.method":            "SNOWPIPE_STREAMING",
    "snowflake.enable.schematization":       "true",
    "buffer.count.records":                  "10000",
    "buffer.size.bytes":                     "5000000",
    "buffer.flush.time":                     "10",
    "key.converter":                         "org.apache.kafka.connect.json.JsonConverter",
    "value.converter":                       "org.apache.kafka.connect.json.JsonConverter"
  }
}

-- Snowflake side — landing tables (auto-created if schematization on)
-- MERGE TASK for the curated users table
CREATE OR REPLACE TASK t_merge_users
  WAREHOUSE = xs_wh
  SCHEDULE  = '1 MINUTE'
AS
MERGE INTO curated.users AS tgt
USING (
  SELECT
    RECORD_CONTENT:id::NUMBER               AS id,
    RECORD_CONTENT:email::STRING            AS email,
    RECORD_CONTENT:name::STRING             AS name,
    RECORD_CONTENT:created_at::TIMESTAMP_NTZ AS created_at,
    RECORD_CONTENT:__op::STRING             AS op,
    RECORD_CONTENT:__source_ts_ms::NUMBER   AS ts_ms
  FROM  RAW.PG_CDC."pipecode.public.users"
  WHERE RECORD_CONTENT:__source_ts_ms > (SELECT COALESCE(MAX(ts_ms), 0) FROM curated.users_hwm)
) AS src
ON tgt.id = src.id
WHEN MATCHED AND src.op = 'd' THEN DELETE
WHEN MATCHED AND src.op IN ('u','c') THEN UPDATE SET
     email      = src.email,
     name       = src.name,
     created_at = src.created_at
WHEN NOT MATCHED AND src.op IN ('u','c','r') THEN INSERT
     (id, email, name, created_at)
     VALUES (src.id, src.email, src.name, src.created_at);

INSERT INTO curated.users_hwm(ts_ms)
  VALUES ((SELECT MAX(RECORD_CONTENT:__source_ts_ms::NUMBER)
             FROM RAW.PG_CDC."pipecode.public.users"));

ALTER TASK t_merge_users RESUME;

Step-by-step explanation.

Postgres prep is the foundation: wal_level = logical is a restart-required change, so schedule it during a maintenance window. max_replication_slots = 8 gives headroom for the Debezium connector plus a few for future replicas. The replication user needs REPLICATION privilege (not superuser) and SELECT on the tables it will replicate.
REPLICA IDENTITY FULL on each replicated table is the single most-missed step. Without it, DELETE events carry only the primary key, and UPDATE events carry only the primary key plus the changed columns. Downstream consumers that need the old row values (audit tables, delta detection) silently break.
The Debezium config registers a connector against the publication; snapshot.mode = initial triggers a one-time full-table snapshot before the WAL tail starts. The unwrap transform flattens Debezium's {before, after, op, ts_ms, source} envelope into a single row with the new values plus __op and __source_ts_ms metadata — the shape the Snowflake sink expects.
The Snowflake sink config specifies SNOWPIPE_STREAMING as the ingest method (sub-second) and enable.schematization = true, which auto-creates the landing table with typed columns extracted from the JSON. Buffer knobs (buffer.count.records, buffer.size.bytes, buffer.flush.time) trade latency for throughput; the defaults land within 10 seconds.
The MERGE TASK is the last mile — it converts the append-only landing table into a keyed curated table. The high-water-mark pattern (a small users_hwm table tracking the max ts_ms merged) bounds each TASK run to only the new events. The WHEN MATCHED AND src.op = 'd' THEN DELETE branch is the delete-handling clause; skipping it is the most common CDC bug.

Output.

Layer	Component	Latency contribution
Postgres	WAL commit	0 ms
Debezium	poll + emit to Kafka	~1 s
Kafka	broker fsync	~50 ms
Snowflake sink	Snowpipe Streaming API	~5 s
MERGE TASK	1-minute cadence	up to 60 s
End-to-end p99		~60 s

Rule of thumb. The MERGE TASK cadence dominates end-to-end latency. If your dashboard needs sub-30-second freshness, drop the TASK cadence to 30 seconds and accept the higher warehouse credit cost. Below 30 seconds, migrate to streaming tables (Snowflake Dynamic Tables) instead of TASK-driven MERGE.

Worked example — sizing the Kafka cluster + on-call cost

Detailed explanation. The Debezium invoice is the smallest part of the total cost of ownership. Kafka cluster costs, connector-worker costs, and — most importantly — SRE on-call cost dominate. Walk through the sizing math for a 100M-events-per-day workload and the annual TCO that the CFO actually sees.

Volume. 100M events/day = ~1200 events/sec average, 5000 events/sec p99.
Kafka cluster. 3-broker MSK cluster on kafka.m5.large = ~$2000/month.
Kafka Connect workers. 2 workers on c5.xlarge = ~$500/month for compute; +$200 for storage.
Snowflake credits. Snowpipe Streaming at $0.05/GB × 30 GB/day × 30 days = $45/month. MERGE TASK on XS warehouse × 1440 runs/day × ~30s = ~$100/month.
On-call. 1 SRE at 15% allocation × $180k/year = $27k/year = $2250/month.

Question. Assemble the full TCO for the Debezium path and compare it to the equivalent Fivetran quote.

Input.

Component	Monthly cost
Kafka (MSK 3-broker)	$2000
Kafka Connect (2 workers)	$700
Snowpipe Streaming	$45
MERGE TASK compute	$100
SRE 15% allocation	$2250
Networking (VPC transit)	$150
Monitoring (Prometheus + Grafana Cloud)	$200
Total	$5445

Code.

# TCO calculator — Debezium DIY vs Fivetran
def debezium_tco(events_per_day, sre_allocation=0.15, sre_fully_loaded=180_000):
    kafka        = 2000
    kc_workers   = 700
    snowpipe_gb  = events_per_day * 300 / 1e9      # ~300 bytes/event JSON
    snowpipe     = snowpipe_gb * 30 * 0.05          # per GB/day × 30 days × $0.05
    merge        = 100
    sre          = sre_allocation * sre_fully_loaded / 12
    network      = 150
    monitoring   = 200
    return {
        "kafka": kafka, "kc_workers": kc_workers,
        "snowpipe": snowpipe, "merge": merge,
        "sre": sre, "network": network, "monitoring": monitoring,
        "total": sum([kafka, kc_workers, snowpipe, merge, sre, network, monitoring]),
    }

def fivetran_tco(events_per_day, mar_ratio=0.3, price_per_1k_mar=1.0):
    mar     = events_per_day * 30 * mar_ratio
    invoice = mar / 1000 * price_per_1k_mar
    snowflake_merge = 500   # Fivetran uses a small warehouse for MERGE; you pay Snowflake side
    return {"invoice": invoice, "snowflake": snowflake_merge, "total": invoice + snowflake_merge}

print(debezium_tco(100e6))
# {'kafka':2000,'kc_workers':700,'snowpipe':45,'merge':100,'sre':2250,'network':150,'monitoring':200,'total':5445}

print(fivetran_tco(100e6, mar_ratio=0.3))
# {'invoice':900, 'snowflake':500, 'total':1400}

Debezium DIY vs Fivetran — 100M events/day
==========================================

Debezium DIY:  $5445/month   →  $65340/year
Fivetran:      $1400/month   →  $16800/year   (4× cheaper on invoice)

BUT: PII posture, latency SLA, existing Kafka investment, or on-prem source can flip this.
Never compare only the invoice — compare the TCO against the four axes.

Step-by-step explanation.

The Kafka MSK cost is the fixed base. A 3-broker cluster on m5.large is the smallest production-grade shape; smaller shapes trade reliability for cost and are usually a false economy. Storage is EBS-backed and cheap at this volume.
Kafka Connect workers run the Debezium source and the Snowflake sink plugins. Two workers give one for redundancy; adding a third only helps at >5000 events/sec sustained. c5.xlarge is the sizing sweet spot.
Snowpipe Streaming cost scales linearly with ingested bytes — 100M events/day at ~300 bytes each is 30 GB/day, or $45/month at $0.05/GB. Predictable and small compared to compute.
The MERGE TASK cost is dominated by warehouse-hour credits. An XS warehouse (1 credit/hour) running for 30 seconds every minute is ~0.5 credit/hour × 24 hours × 30 days × $3/credit = ~$100/month. Larger warehouses or faster cadences scale this linearly.
The SRE on-call cost is the hidden 40% of the invoice. A 15% allocation ($27k/year) assumes the pipeline is stable and only needs occasional attention. New deployments or immature runbooks bump this to 25–40% for the first six months.

Output.

Cost component	Debezium DIY	Fivetran
Vendor invoice	$0	$900
Infra	$2850	$0
Snowflake credits	$145	$500
SRE on-call	$2250	$0 (vendor-owned)
Total monthly	$5445	$1400
Annual	$65k	$17k

Rule of thumb. Debezium DIY is 3–4× more expensive than Fivetran on pure invoice math for workloads under 500M events/day. The break-even flips when (a) PII cannot leave the VPC, (b) sub-minute freshness is a hard SLA, or (c) you already run Kafka and Debezium is the marginal cost of a new topic. Match the four axes before the invoice math.

Worked example — WAL slot bloat, the #1 Debezium incident

Detailed explanation. A Debezium connector stalls (crash, network partition, upstream Kafka issue) and Postgres cannot recycle WAL until the connector reconnects and confirms it has consumed the WAL. The result: Postgres disk fills, autovacuum stalls, and the whole cluster degrades. Walk through the diagnostic, the mitigation, and the alert that catches it before the disk fills.

The setup. Debezium's replication slot holds a Postgres LSN — the "last position the connector confirmed." Postgres cannot GC WAL past that LSN.
The failure. Debezium crashes; the slot LSN stops advancing; WAL accumulates.
The symptom. pg_wal directory grows; pg_replication_slots.wal_status = 'reserved' or 'extended'; pg_stat_replication.write_lsn lags.

Question. Build the alert query and the runbook for the on-call SRE when the alert fires.

Input.

Component	Value
Postgres version	16
Debezium slot	`pipecode_debezium_slot`
`pg_wal` disk	200 GB
Alert threshold	slot lag > 10 GB

Code.

-- The alert query — run every 60 seconds
SELECT slot_name,
       active,
       wal_status,
       pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), confirmed_flush_lsn)) AS lag_size,
       pg_wal_lsn_diff(pg_current_wal_lsn(), confirmed_flush_lsn)                 AS lag_bytes
FROM   pg_replication_slots
WHERE  slot_name = 'pipecode_debezium_slot';

-- Example healthy output:
-- slot_name                | active | wal_status | lag_size | lag_bytes
-- pipecode_debezium_slot   | t      | reserved   | 32 MB    | 33554432

-- Example alarming output:
-- slot_name                | active | wal_status | lag_size | lag_bytes
-- pipecode_debezium_slot   | f      | extended   | 42 GB    | 45097156608

# Prometheus alert
groups:
- name: postgres-cdc
  rules:
  - alert: DebeziumWalSlotBloat
    expr: postgres_replication_slot_wal_lag_bytes{slot="pipecode_debezium_slot"} > 10 * 1024 * 1024 * 1024
    for: 5m
    labels:
      severity: page
    annotations:
      summary: Debezium replication slot > 10 GB WAL lag
      runbook: https://runbooks.internal/debezium-wal-slot-bloat

Runbook — DebeziumWalSlotBloat
==============================

t+0    Alert fires (WAL lag > 10 GB for 5 min)
t+2m   SRE confirms via SQL query; slot.active = false → Debezium crashed
t+3m   Check Debezium logs: `docker logs kafka-connect | tail -100`
t+5m   Root cause found (OOM, network, upstream Kafka partition offline)
t+7m   Restart Debezium: POST /connectors/pg-pipecode-source/restart
t+10m  Slot resumes advancing; lag drops back to <100 MB in 20 min
t+30m  Follow-up: increase Kafka Connect JVM heap; add extra broker capacity
CRITICAL escape hatch:
  If disk is < 10% free and slot won't reconnect within 15 min,
  drop the slot: SELECT pg_drop_replication_slot('pipecode_debezium_slot');
  → this loses CDC state; requires full re-snapshot of all tables.

Step-by-step explanation.

Postgres replication slots are the price of streaming CDC — they let a downstream consumer disconnect and resume from the same LSN, but they also hold WAL indefinitely if the consumer never resumes. This is by design; treat every slot as a potential disk-fill risk.
The alert query uses pg_replication_slots.confirmed_flush_lsn (the last LSN the consumer confirmed) and pg_current_wal_lsn() (the current WAL write position). The difference in bytes is the slot lag. Above 10 GB is worth paging; above 100 GB is imminent-disk-fill territory.
wal_status = 'reserved' is fine (slot is holding WAL within max_slot_wal_keep_size); wal_status = 'extended' means Postgres has extended WAL retention beyond the cap and disk pressure is mounting; wal_status = 'lost' means Postgres gave up and the slot can no longer resume — a full re-snapshot is required.
The runbook: confirm the slot state, check Debezium logs, restart the connector, watch the slot lag drop. In the extreme case where the connector cannot be restarted quickly and the disk is filling, the SRE must drop the slot — accepting the cost of a full re-snapshot as the price of keeping Postgres up.
The prevention: alert early (10 GB, not 100 GB); wire the Debezium liveness check into the same alerting stack; set max_slot_wal_keep_size on Postgres 13+ as an insurance policy (Postgres will drop the slot before the disk fills).

Output.

Slot state	wal_status	Lag	Action
Healthy	reserved	< 1 GB	none
Warning	reserved	1–10 GB	monitor
Alert	reserved / extended	10–50 GB	page SRE
Critical	extended	> 50 GB	page + prepare to drop slot
Lost	lost	any	re-snapshot required

Rule of thumb. WAL slot bloat is the single most common Debezium incident. Wire a slot-lag alert from day one; set max_slot_wal_keep_size as insurance; and rehearse the "drop the slot and re-snapshot" runbook before you need it. Every CDC pipeline that touches Postgres has this hazard.

Senior interview question on the Debezium architecture

A senior interviewer might ask: "You're the tech lead for a 100M-events-per-day, sub-minute Postgres → Snowflake CDC pipeline. Walk me through the Debezium + Kafka + Snowpipe Streaming design end-to-end, the production hazards you'd guard against, and where PgCat or Snowflake Openflow would tempt you to redesign."

Solution Using the four-layer Debezium blueprint with a slot-bloat alert and a hybrid escape hatch

# Layer 1 — Postgres
postgres:
  wal_level: logical
  max_replication_slots: 8
  max_slot_wal_keep_size: 50GB    # insurance policy
  replica_identity: FULL (per replicated table)
  publication: pipecode_pub
  user: debezium (REPLICATION + SELECT)

# Layer 2 — Kafka + Kafka Connect
kafka:
  broker_count: 3
  broker_type: MSK m5.xlarge
  partition_count_per_topic: 6
  replication_factor: 3
  retention_ms: 604800000        # 7 days
  compression: snappy
connect:
  workers: 3
  jvm_heap: 4G
  connector_source: debezium.PostgresConnector
  connector_sink:   snowflake.SnowflakeSinkConnector

# Layer 3 — Snowflake ingest
snowflake_ingest:
  method: SNOWPIPE_STREAMING
  landing_schema: RAW.PG_CDC
  schematization: true
  buffer_flush_time_s: 10

# Layer 4 — MERGE + observability
merge:
  task_cadence: 1 MINUTE
  warehouse: XS_WH (1 credit/hour)
  high_water_mark_table: <curated>.<table>_hwm
observability:
  slot_lag_alert: > 10 GB for 5m
  connector_health: kafka_connect_connector_state{state!="RUNNING"}
  merge_duration: task_history warehouse credits
  end_to_end_latency: max(now() - src_ts_ms) < 60s

-- Reusable MERGE template for a keyed CDC table
CREATE OR REPLACE PROCEDURE p_merge_cdc(
  landing_table  STRING,
  target_table   STRING,
  key_columns    ARRAY,
  data_columns   ARRAY
)
RETURNS STRING
LANGUAGE SQL
AS
$$
DECLARE
  merge_sql STRING;
BEGIN
  merge_sql :=
    'MERGE INTO ' || target_table || ' AS tgt
     USING (
       SELECT ' || ARRAY_TO_STRING(data_columns, ', ') || ',
              RECORD_CONTENT:__op::STRING          AS op,
              RECORD_CONTENT:__source_ts_ms::NUMBER AS ts_ms
       FROM   ' || landing_table || '
       WHERE  RECORD_CONTENT:__source_ts_ms > (SELECT COALESCE(MAX(ts_ms), 0)
                                                  FROM ' || target_table || '_hwm)
     ) AS src
     ON ' || (SELECT LISTAGG('tgt.' || v.VALUE::STRING || ' = src.' || v.VALUE::STRING, ' AND ')
                FROM  LATERAL FLATTEN(input => key_columns) v) || '
     WHEN MATCHED AND src.op = ''d'' THEN DELETE
     WHEN MATCHED AND src.op IN (''u'',''c'') THEN UPDATE SET '
       || (SELECT LISTAGG(v.VALUE::STRING || ' = src.' || v.VALUE::STRING, ', ')
             FROM  LATERAL FLATTEN(input => data_columns) v)
     || '
     WHEN NOT MATCHED AND src.op IN (''u'',''c'',''r'') THEN INSERT
       (' || ARRAY_TO_STRING(ARRAY_CAT(key_columns, data_columns), ', ') || ')
       VALUES (' || (SELECT LISTAGG('src.' || v.VALUE::STRING, ', ')
                       FROM  LATERAL FLATTEN(input => ARRAY_CAT(key_columns, data_columns)) v) || ');';
  EXECUTE IMMEDIATE :merge_sql;
  RETURN 'ok';
END;
$$;

Step-by-step trace.

Layer	Component	Latency	Failure mode	Guard
1	Postgres WAL	0 ms	Slot bloat	max_slot_wal_keep_size + alert
2	Kafka + Connect	1–5 s	Connector rebalance	replication_factor=3 + JVM heap
3	Snowpipe Streaming	5–10 s	Credit spike	buffer flush tuning
4	MERGE TASK	60 s	Warehouse queue	HWM checkpoint + XS warehouse

After rollout, the pipeline delivers 100M events/day at p99 60-second end-to-end latency, with a $5.5k/month TCO. The hybrid escape hatch (Fivetran for low-priority tables) is documented as a Day-90 review: if the SRE on-call cost exceeds 25% of headcount, migrate the non-critical tables to a managed vendor and reserve Debezium for the sub-minute critical path only.

Output:

Metric	Target	Actual
End-to-end p99 latency	< 60 s	55 s
Slot lag p99	< 500 MB	200 MB
Kafka Connect uptime	99.95%	99.97%
Monthly cost	< $6k	$5.4k
SRE on-call load	< 20%	15%

Why this works — concept by concept:

Four-layer blueprint — Postgres / Kafka / Snowflake ingest / MERGE — each layer has an SLA, a failure mode, and a guard. The mental model is sequential responsibility; a failure at layer N is diagnosed by walking layers N → 1.
WAL slot as first-class citizen — the slot lag alert is the single highest-value observability signal in the entire pipeline. Without it, the first sign of trouble is a full disk.
Reusable MERGE procedure — one Snowflake stored procedure handles the MERGE for every table via metadata. Adding a table is a config change, not a copy-paste of hundreds of lines of SQL.
Escape hatch is part of the design — the plan explicitly says "if the on-call load exceeds 25% we migrate half the tables to Fivetran." Naming the exit criteria up front prevents scope creep and gives the CFO confidence.
Cost — $5.5k/month at 100M events/day = $55 per million events, all in. Fivetran-only would be $17/million but forfeits sub-minute + VPC-only. The hybrid saves ~$30k/year on the non-critical half of the workload while preserving the four-axis tradeoff on the critical half. O(1) additional operational cost per lane.

Streaming
Topic — streaming
Streaming CDC + Debezium problems

Practice →

ETL Topic — etl ETL problems on Kafka + Snowflake pipelines

Practice →

3. Fivetran / Hevo / Airbyte managed pipelines

The hands-off managed path — the vendor owns the WAL reader, you pay per row, and Snowflake sees a clean landing table

The mental model in one line: Fivetran, Hevo, and Airbyte are the three canonical managed CDC vendors that give up ownership of the WAL reader and the connector runtime in exchange for a per-row (MAR / GB / event) invoice — Fivetran is the enterprise default with MAR pricing, Hevo undercuts on row-based pricing and lands in Snowflake near-native, and Airbyte splits into OSS (self-hosted) and Cloud (managed) tiers with a per-GB model. The trade-off is stark: your team ships zero infrastructure, your on-call load is zero, and your CFO signs a monthly invoice that scales with your data volume.

The four-axis snapshot.

Latency. 5–15 minutes typical (vendor-tuned batch windows). Sub-minute is technically available on premium tiers but roughly 3× the base cost; usually not worth it if you're picking managed.
Cost model. MAR (Fivetran), row-based (Hevo), or per-GB (Airbyte). MAR punishes wide tables with many updates; row-based punishes high-volume append logs; per-GB punishes wide JSON payloads. Choose based on the shape of your data, not the sticker price.
Ownership. Vendor owns the pipeline. You own the source Postgres, the Snowflake target, and the schema decisions; the vendor owns everything between. On-call cost is zero.
PII posture. Data plane runs in the vendor's cloud (or a customer-owned VPC on premium tiers). SOC 2 Type II is table stakes; HIPAA-eligible tiers are extra. GDPR-heavy workloads may need a VPC-hosted data plane, which raises cost.

Fivetran — the enterprise default.

Pricing. MAR (Monthly Active Rows) — distinct primary keys touched per calendar month. A row updated 100 times counts as 1 MAR; a row never touched counts as 0. Pricing tiers: ~$0.5–$2.5 per 1000 MAR depending on volume and contract.
Ingest cadence. Configurable per connector: 5 min, 15 min, 1 hour, 6 hours, 24 hours. Sub-minute is available on the Enterprise HVA tier.
Deletes. Handled natively — Fivetran writes a _fivetran_deleted boolean column on target tables. Queries that filter WHERE NOT _fivetran_deleted see only live rows.
Schema evolution. Auto-detects new columns, new tables, dropped columns. Emits a schema change event; the target table is altered in place.
PII posture. SaaS default; enterprise tier allows a customer-managed data plane. HIPAA / PHI requires a Business Associate Agreement.

Hevo Data — the row-based alternative.

Pricing. Row-based — every event (insert/update/delete) is one row. No MAR overhead. Typical: ~$0.10–$0.30 per 1000 events at moderate volumes.
Ingest cadence. Configurable per pipeline: 5 min to 24 hours. Real-time tier available for premium pricing.
Deletes. Handled as __hevo__marked_deleted column (similar to Fivetran).
Schema evolution. Automatic; UI lets you preview schema changes before applying.
PII posture. Similar to Fivetran; SOC 2, HIPAA-eligible tier available.

Airbyte — the OSS + Cloud hybrid.

Two tiers. OSS (self-hosted, free, you own infra) and Cloud (managed). OSS is a Docker / Kubernetes install; Cloud is a SaaS.
Pricing (Cloud). Per-GB of data replicated. Predictable but can spike on wide JSON payloads.
Ingest cadence. Batch-oriented: 5 min to 24 hours. Change Data Capture connector supports Postgres logical replication with configurable polling intervals.
Deletes. Handled via _ab_cdc_deleted_at column.
Schema evolution. Automatic; less mature than Fivetran on the edge cases.
PII posture. OSS gives you full VPC control; Cloud is SaaS with SOC 2.

The MAR arithmetic — Fivetran's cost trap.

The formula. MAR = distinct (source_table, primary_key) tuples touched in a calendar month. A wide table with many updates per key uses low MAR; a narrow append log uses high MAR.
The trap. Teams that quote "50M events/day" think of the total event count, not MAR. If those 50M events touch 10M distinct PKs, MAR is 300M/month. At $1/1000 MAR that's $300k/month — 10× what a rough back-of-envelope would guess.
The mitigation. Estimate MAR before signing. Use Fivetran's cost calculator with a representative table sample. Consider Hevo or Airbyte for high-MAR workloads.

When managed vendors win.

Small teams. No infra to own; vendor invoice replaces engineer headcount.
Standard schemas. Postgres logical replication with typical table shapes; no exotic column types.
Predictable volume. Vendors price per unit of data; predictable volume = predictable invoice.
Compliance-lite. SOC 2 is enough; the vendor's cloud is fine.

When managed vendors lose.

PII strict. VPC-only pipeline requirement pushes you to Debezium or Openflow BYOC.
Sub-minute freshness. Premium tiers exist but cost 3–5× the base rate; DIY is often competitive.
Very high MAR / row / GB. Above ~1B events/day the cost curve of a vendor invoice can beat a fixed infrastructure investment.
Cross-cloud. Vendors move bytes across cloud boundaries and charge for it; on-cloud DIY or Openflow can avoid the egress.

Worked example — Fivetran configuration and cost estimate

Detailed explanation. A small analytics team wants to land 20 Postgres tables into Snowflake with 15-minute freshness. Walk through the Fivetran setup — connector wizard, primary key declaration, sync mode, cost estimate — and the resulting Snowflake landing shape.

Source. RDS Postgres in the customer VPC; peering to Fivetran's VPC via AWS PrivateLink.
Target. Snowflake account in the same region.
Tables. 20 tables; total ~10M row changes/day.
Estimated MAR. ~4M distinct PKs touched/month per table = 80M MAR total.

Question. Configure the Fivetran connector, compute the monthly cost estimate, and show the resulting Snowflake landing table.

Input.

Component	Value
Source	Postgres 16 on RDS
Target	Snowflake account (us-east-1)
Tables	20
Row changes/day	10M
Estimated MAR/month	80M
Fivetran price tier	$1.50 / 1000 MAR

Code.

// Fivetran connector config (REST API — POST /v1/connectors)
{
  "service": "postgres",
  "group_id": "pipecode_group",
  "config": {
    "host":             "pg-primary.internal",
    "port":             5432,
    "database":         "pipecode",
    "user":             "fivetran_readonly",
    "password":         "***",
    "connection_type":  "PrivateLink",
    "update_method":    "WAL",
    "replication_slot": "fivetran_slot",
    "publication_name": "fivetran_pub",
    "tunnel_host":      null,
    "always_encrypted": true
  },
  "schedule_type":    "auto",
  "sync_frequency":   15,
  "paused":           false,
  "trust_certificates": false
}

-- Postgres side — prerequisites Fivetran needs
CREATE ROLE fivetran_readonly WITH LOGIN REPLICATION PASSWORD '***';
GRANT USAGE ON SCHEMA public TO fivetran_readonly;
GRANT SELECT ON ALL TABLES IN SCHEMA public TO fivetran_readonly;
ALTER DEFAULT PRIVILEGES IN SCHEMA public GRANT SELECT ON TABLES TO fivetran_readonly;

CREATE PUBLICATION fivetran_pub FOR TABLE public.users, public.orders, ...;

-- Fivetran creates the slot on first run:
-- SELECT * FROM pg_create_logical_replication_slot('fivetran_slot', 'pgoutput');

-- Snowflake landing shape — Fivetran auto-creates these
-- One target table per source table, plus the metadata columns
DESCRIBE TABLE raw.pipecode.users;
-- Column                   | Type
-- id                       | NUMBER(38,0)
-- email                    | VARCHAR(16777216)
-- name                     | VARCHAR(16777216)
-- created_at               | TIMESTAMP_NTZ(9)
-- _fivetran_synced         | TIMESTAMP_LTZ(9)
-- _fivetran_deleted        | BOOLEAN
-- _fivetran_id             | VARCHAR(16777216)   -- only present for tables without a PK

# Cost estimate — MAR arithmetic
def fivetran_cost(mar_millions_per_month, price_per_1k_mar):
    return mar_millions_per_month * 1000 * price_per_1k_mar

print(fivetran_cost(80, 1.50))    # 80M MAR × $1.50/1000 MAR = $120000
# $120,000 per year at 80M MAR/month = $10,000/month

Step-by-step explanation.

Fivetran connects to Postgres via AWS PrivateLink (or direct network if the VPC allows it); the connector uses a read-only role with REPLICATION privilege. The publication whitelist and the replication slot are the same primitives Debezium uses — Fivetran is running the equivalent WAL reader inside its own cloud.
sync_frequency: 15 sets the batch cadence to 15 minutes — a Fivetran-optimised default that balances warehouse cost (fewer MERGEs) against freshness. The Enterprise HVA tier supports sub-minute cadences at premium pricing.
On first sync, Fivetran takes a full snapshot of every included table (op = 'r' events). This is the same "snapshot then WAL tail" pattern Debezium uses; the difference is that Fivetran handles it transparently — no config decisions required from your side.
The Snowflake target tables are auto-created with typed columns plus _fivetran_synced (last-synced timestamp), _fivetran_deleted (soft-delete flag), and _fivetran_id (only for tables without a natural PK). Queries against curated views typically filter WHERE NOT _fivetran_deleted.
The cost estimate at 80M MAR × $1.50/1000 = $10k/month. For comparison, the equivalent Debezium DIY at this volume would be ~$5k/month for the infra + ~$2.5k/month for on-call = $7.5k/month. Fivetran is 30% more expensive on invoice but 100% cheaper on operational burden — for a small team the trade-off is worthwhile.

Output.

Metric	Value
Tables synced	20
Latency p99	15 min
MAR/month	80M
Cost/month	$10k
Snowflake side cost	~$500 (Fivetran uses a small XS warehouse)
On-call cost	$0
Total TCO	$10.5k/month

Rule of thumb. For workloads under 100M MAR/month with SaaS-OK compliance and 5–15 minute freshness, Fivetran is the cheapest total-cost path — the invoice is bigger than DIY but the on-call and infra costs are zero. Above 500M MAR/month or with strict PII, redo the arithmetic.

Worked example — Airbyte OSS on Kubernetes for VPC-only compliance

Detailed explanation. A regulated workload requires that no vendor's cloud ever see the CDC stream. Airbyte OSS runs entirely in the customer VPC on Kubernetes — the connectors, the workers, the temporal server, the database — all in your cluster. The trade-off is you own the K8s deployment and the upgrade cycle, but the compliance story is airtight.

The setup. Airbyte OSS Helm chart on EKS/GKE/AKS; ~10 pods total.
The connectors. Postgres source + Snowflake destination — both from the Airbyte connector registry.
The compliance win. Every byte stays in the customer VPC.

Question. Deploy Airbyte OSS on Kubernetes with a Postgres source and Snowflake destination. Show the Helm values, the source config, and the destination config.

Input.

Component	Value
Kubernetes	EKS 1.29
Airbyte version	0.63
Postgres	16
Snowflake	account.us-east-1
CDC method	logical replication (pgoutput)

Code.

# Airbyte Helm values
global:
  airbyteYml: |
    server:
      accessManagement:
        enabled: true
      logging:
        level: INFO
    postgres:
      database: airbyte
      user: airbyte
      password: ***
      host: airbyte-db.internal
      port: 5432

workload-launcher:
  replicaCount: 2
  resources:
    requests: {cpu: "500m", memory: "1Gi"}
    limits:   {cpu: "2",     memory: "4Gi"}

worker:
  replicaCount: 4
  resources:
    requests: {cpu: "1",   memory: "2Gi"}
    limits:   {cpu: "4",   memory: "8Gi"}

// Airbyte source config — POST /api/v1/sources/create
{
  "sourceDefinitionId": "decd338e-5647-4c0b-adf4-da0e75f5a750",
  "connectionConfiguration": {
    "host":          "pg-primary.internal",
    "port":          5432,
    "database":      "pipecode",
    "username":      "airbyte_ro",
    "password":      "***",
    "schemas":       ["public"],
    "replication_method": {
      "method":                  "CDC",
      "plugin":                  "pgoutput",
      "replication_slot":        "airbyte_slot",
      "publication":             "airbyte_pub",
      "initial_waiting_seconds": 300
    },
    "ssl_mode": {"mode": "require"}
  },
  "workspaceId": "...",
  "name":        "pg-pipecode"
}

// Airbyte destination config — Snowflake
{
  "destinationDefinitionId": "424892c4-daac-4491-b35d-c6688ba547ba",
  "connectionConfiguration": {
    "host":       "abc12345.us-east-1.snowflakecomputing.com",
    "warehouse":  "xs_wh",
    "database":   "raw",
    "schema":     "airbyte_pg",
    "role":       "airbyte_role",
    "username":   "svc_airbyte",
    "credentials": {"auth_type": "Key Pair Authentication", "private_key": "..."},
    "loading_method": {
      "method":                 "Internal Staging",
      "flattening":             "Root level flattening"
    }
  }
}

-- Snowflake landing shape — Airbyte creates typed columns via flattening
DESCRIBE TABLE raw.airbyte_pg._airbyte_raw_users;
-- Column                     | Type
-- _airbyte_ab_id             | VARCHAR
-- _airbyte_emitted_at        | TIMESTAMP_TZ
-- _airbyte_data              | VARIANT      -- full row as JSON

DESCRIBE TABLE raw.airbyte_pg.users;
-- Column                    | Type
-- id                        | NUMBER
-- email                     | VARCHAR
-- name                      | VARCHAR
-- _ab_cdc_lsn               | NUMBER
-- _ab_cdc_updated_at        | TIMESTAMP_TZ
-- _ab_cdc_deleted_at        | TIMESTAMP_TZ
-- _airbyte_ab_id            | VARCHAR
-- _airbyte_emitted_at       | TIMESTAMP_TZ
-- _airbyte_normalized_at    | TIMESTAMP_TZ

Step-by-step explanation.

Airbyte OSS on Kubernetes runs entirely in the customer VPC — no data ever leaves your infrastructure. This is the compliance win over Airbyte Cloud (SaaS) and Fivetran (SaaS). The trade-off is you own the Helm upgrade cadence, the pod resource sizing, and the connector version pinning.
The Postgres source config uses the same primitives every CDC vendor uses: pgoutput plugin, a named replication slot, a publication whitelist. The initial_waiting_seconds: 300 gives Postgres time to build the snapshot before Airbyte starts consuming.
The Snowflake destination uses key-pair authentication (not password) — the modern standard for service accounts. loading_method: Internal Staging tells Airbyte to stage files in Snowflake's internal stage before COPY INTO — cheaper than an external S3 stage for small volumes.
Airbyte writes two tables per source: a _airbyte_raw_* VARIANT-only table (append-only history) and a normalized typed table with _ab_cdc_* metadata columns for CDC state. The normalized table is the analytics target; the raw table is the audit trail.
The compliance boundary is airtight: the CDC stream flows Postgres → Airbyte pods (in your VPC) → Snowflake stage (in your Snowflake account) → Snowflake target table. No vendor cloud in the path. The cost is roughly ~$3k/month in EKS + Airbyte support license, vs $10k for Fivetran on the same workload.

Output.

Layer	Component	Where it runs
WAL reader	Airbyte connector pod	Your EKS cluster
Buffer / temporal	Airbyte worker pod	Your EKS cluster
Staging	Snowflake internal stage	Your Snowflake account
Target	Snowflake table	Your Snowflake account

Rule of thumb. Airbyte OSS on Kubernetes is the cheapest compliance-safe managed-CDC pattern. It's ~30% of the Fivetran invoice at moderate volumes, but you own the K8s ops. Match this against Debezium DIY: Airbyte OSS is simpler than Debezium if you don't already run Kafka.

Worked example — MAR blowout diagnosis and mitigation

Detailed explanation. A team quotes their Postgres workload at "50M events/day" and picks Fivetran on the basis of "MAR should be small — most of our events are appends to a log table." Two months later, the Fivetran invoice arrives at $50k for a projected $5k budget. Diagnose the MAR blowout: an unindexed UPDATE loop is touching every row of a wide dimension table daily.

The estimate. 50M events/day × 30 = 1.5B events. MAR estimated at 10% = 150M MAR = ~$225/month at $1.50/1000. Estimated total: $2500/month.
The reality. MAR is 30M distinct PKs touched per day, or 900M/month = $1350/month for the dimension table alone plus the rest = $50k invoice.
The bug. A backend job runs a nightly UPDATE dim_users SET last_seen_at = now() WHERE id IN (SELECT ...) that touches every user row every day.

Question. Diagnose the MAR blowout, quantify the cost, and propose three mitigations.

Input.

Metric	Estimated	Actual
Events/day	50M	50M
Distinct PKs touched/day	5M	30M
MAR/month	150M	900M
Fivetran cost	$2.5k	$50k

Code.

-- Diagnostic — find the tables with the highest MAR
-- (Fivetran's usage dashboard shows this; the SQL below reconstructs it from Postgres)
WITH updates_per_day AS (
  SELECT relname                      AS table_name,
         n_tup_upd + n_tup_ins + n_tup_del AS row_ops_since_reset,
         stats_reset                    AS since,
         extract(epoch FROM now() - stats_reset)/86400 AS days
  FROM   pg_stat_user_tables
)
SELECT table_name,
       row_ops_since_reset::numeric / days::numeric AS avg_ops_per_day
FROM   updates_per_day
ORDER  BY avg_ops_per_day DESC
LIMIT  10;

-- Mitigation 1 — exclude the offending column from CDC
-- Fivetran side: uncheck the last_seen_at column in the schema UI
-- Postgres side: same behaviour via publication filter
ALTER PUBLICATION fivetran_pub SET (publish = 'insert, update, delete', publish_via_partition_root = false);
-- Or, if the vendor supports it, exclude the column:
--   Fivetran: schema config → last_seen_at → excluded
--   Airbyte: connection config → field selection → exclude last_seen_at

-- Mitigation 2 — batch the UPDATE at the app layer
-- Instead of updating last_seen_at on every login,
-- write to a separate append-only table and let dbt aggregate.
CREATE TABLE public.user_sessions (
  user_id   BIGINT,
  logged_in_at TIMESTAMPTZ DEFAULT now(),
  session_id UUID
);

-- Mitigation 3 — move the noisy table to Debezium
-- Keep 199 tables on Fivetran; move dim_users to Debezium + a coarser TASK schedule.

# Cost mitigation ROI calculator
def mar_cost(mar_monthly, price_per_1k=1.50):
    return mar_monthly / 1000 * price_per_1k

# Baseline (the problem)
print(mar_cost(900e6))    # $1350k/year on this one table (bug)

# Mitigation 1 — exclude last_seen_at column
# The UPDATE no longer triggers CDC (unchanged columns filter)
print(mar_cost(50e6))     # $75k/year — 18× cheaper

# Mitigation 2 — batch at app layer
print(mar_cost(30e6))     # $45k/year — 30× cheaper

Step-by-step explanation.

The diagnostic is pg_stat_user_tables — the "table with the highest average ops per day" is nearly always the source of an unexpected MAR blowout. If Fivetran's own usage dashboard is available, cross-reference; if not, this Postgres-native view is the fallback.
Mitigation 1 — exclude the noisy column from CDC. Both Fivetran and Airbyte allow per-column exclusion. Downstream analytics that need last_seen_at can compute it from the append-only session log instead. Cost drops 18× because the majority of CDC events were driven by that column update.
Mitigation 2 — change the app schema. Instead of updating a wide dim row on every login, append to a narrow session log. The dim row is only touched when a slow-moving attribute (email, name) changes. This is the "right" fix — it removes the MAR pressure at the source.
Mitigation 3 — split the table off to a different vendor. Fivetran is expensive on MAR; Debezium is cheap on high-MAR. Move dim_users to Debezium; keep the 199 other tables on Fivetran. The bill drops without changing the app.
The interview signal: naming three mitigations across the schema, the vendor config, and the architecture. Junior candidates propose one fix; senior candidates propose a hierarchy and let the team pick based on their constraints (app team's time, urgency, DIY appetite).

Output.

Mitigation	Effort	Monthly savings	Recommendation
1: exclude last_seen_at	30 minutes UI change	$45k	best short-term
2: batch at app layer	2-week app refactor	$50k	best long-term
3: move table to Debezium	1-day migration	$40k	best if 1 & 2 blocked

Rule of thumb. MAR blowouts are almost always driven by one noisy column on one table. Before signing a Fivetran contract for anything beyond ~50M MAR/month, run the pg_stat_user_tables diagnostic and identify the top 3 noisiest tables. Exclude the noisy columns before the vendor's first invoice.

Senior interview question on managed CDC vendor selection

A senior interviewer might ask: "Fivetran, Hevo, and Airbyte all support Postgres → Snowflake CDC. Walk me through the decision framework for picking one, including the MAR / row / GB pricing arithmetic, and describe the cost trap you'd guard against before signing a 12-month contract."

Solution Using a three-vendor MAR-vs-row-vs-GB decision matrix

# Decision matrix
axes:
  - vendor: fivetran
    pricing: MAR
    strength: enterprise support, sub-min HVA, HIPAA
    weakness: MAR blowout on noisy tables
    sweet_spot: < 300M MAR/month, standard schemas
  - vendor: hevo
    pricing: row-based (events)
    strength: predictable cost, no MAR overhead
    weakness: smaller ecosystem, less HA
    sweet_spot: high-volume appends, log tables
  - vendor: airbyte_cloud
    pricing: per-GB
    strength: 350+ connectors, OSS fallback
    weakness: less mature CDC connector history
    sweet_spot: wide catalog, moderate volume
  - vendor: airbyte_oss
    pricing: infra only (compute + eng time)
    strength: VPC-only, no invoice, control
    weakness: you own K8s + upgrades
    sweet_spot: strict compliance, existing K8s team

decision_flow:
  1. compliance:
     - VPC-only: airbyte_oss
     - SaaS-OK: continue
  2. volume_shape:
     - narrow_high_update: hevo (row-based)
     - wide_low_update:    fivetran (MAR)
     - very_wide_json:     airbyte_cloud (per-GB)
  3. eco_size:
     - 300+ sources needed: airbyte_cloud
     - 100-300:              fivetran
     - < 100:                any
  4. sla_ownership:
     - premium enterprise: fivetran
     - startup/mid:        any

# Concrete cost estimator
def estimate_cost(vendor, events_per_day, mar_ratio, avg_row_bytes, days=30):
    total_events = events_per_day * days
    mar = total_events * mar_ratio
    gb  = total_events * avg_row_bytes / 1e9

    if vendor == "fivetran":
        return mar / 1000 * 1.50            # $1.50 / 1000 MAR
    if vendor == "hevo":
        return total_events / 1e6 * 500     # $500 per 1M events (rough tier)
    if vendor == "airbyte_cloud":
        return gb * 10                      # $10/GB (rough Cloud pricing)
    if vendor == "airbyte_oss":
        return 3000                         # $3k/month infra + minimal support
    return None

for v in ["fivetran", "hevo", "airbyte_cloud", "airbyte_oss"]:
    print(v, "$", estimate_cost(v, 50e6, 0.3, 300))
# fivetran $ 675000  (blowout — bad mar_ratio)
# hevo $ 750
# airbyte_cloud $ 4500
# airbyte_oss $ 3000

Step-by-step trace.

Step	Question	Answer
1	VPC-only?	no → SaaS candidates in play
2	Volume shape?	50M events/day, 30% MAR ratio (high)
3	Ecosystem size?	20 Postgres tables — any vendor
4	SLA tier?	mid-market
5	Cost math	Fivetran $675k, Hevo $750, Airbyte Cloud $4.5k, Airbyte OSS $3k
6	Cost trap	Fivetran MAR blowout — 30% ratio is a bad fit
7	Final pick	Airbyte OSS ($3k) or Hevo ($750/month if the pricing tier holds)

The MAR blowout is the trap — a 30% MAR ratio (many updates touching many rows) breaks the Fivetran cost model. Hevo's row-based pricing is far cheaper here; Airbyte OSS gives full control at similar cost. The senior signal: quote all four numbers, name the trap, and let the team pick based on operational appetite.

Output:

Vendor	Estimated monthly cost	Fits axes?
Fivetran	$675k	✗ MAR blowout
Hevo	$750	✓ but check tier limits
Airbyte Cloud	$4.5k	✓
Airbyte OSS	$3k	✓ (needs K8s team)

Why this works — concept by concept:

Pricing model matches data shape — MAR punishes high-update workloads; row-based punishes high-volume appends; per-GB punishes wide JSON. Reading the workload shape before the vendor decision saves a MAR blowout.
OSS fallback — Airbyte OSS is a genuine safety net when a Cloud invoice becomes untenable. The connector configs migrate one-to-one; only the runtime moves.
Compliance as a gate — VPC-only is a hard gate that eliminates every SaaS vendor. Naming it as step 1 avoids wasting time on quotes that will never sign.
Cost trap explicit — every vendor has a pathological workload shape. The senior signal is naming which trap ends which vendor's suitability before running the invoice math.
Cost — the decision matrix is O(1) to apply and delivers a 10–100× cost range across candidates. The value of doing this once at project start dwarfs the effort of running the arithmetic on all four vendors. Contract terms are typically 12 months; get the choice right the first time.

ETL
Topic — etl
ETL problems on managed CDC vendor selection

Practice →

SQL Topic — sql SQL problems on MAR-vs-row cost modelling

Practice →

4. Estuary Flow / Materialize / Rivery — real-time CDC platforms

The streaming-SQL path — push the aggregation upstream, materialise into Snowflake at a fraction of the credit cost

The mental model in one line: Estuary Flow, Materialize, and Rivery are the three canonical real-time CDC platforms — they read the Postgres WAL, run streaming SQL over the events in a Rust or Timely-dataflow engine, and materialise the result into Snowflake — pushing the windowed aggregation upstream of the warehouse and slashing Snowflake credit spend for dashboard workloads. This is the pick when the downstream consumer is not a raw event table but a fixed-shape summary that needs sub-minute freshness.

The four-axis snapshot.

Latency. Sub-minute end-to-end for the streaming layer; the materialisation to Snowflake adds another 5–15 seconds. Dashboards are live to the second on the streaming engine and near-live on Snowflake.
Cost model. Compute-hour (Estuary, Materialize, Rivery) — you pay for the Rust dataflow runtime running your streaming SQL, regardless of throughput. Pricing scales with the complexity of the streaming SQL, not with raw event count.
Ownership. Vendor-owned pipelines. Estuary and Rivery are fully-managed SaaS; Materialize offers both Cloud and self-hosted; RisingWave is the OSS competitor.
PII posture. Varies. Estuary Cloud runs in the vendor's cloud; Materialize can self-host; Rivery is SaaS. Streaming platforms hold PII in memory longer than batch pipelines, which is a real compliance concern.

Estuary Flow — collections + materialisations.

The model. Data lands in a Flow "collection" (essentially an S3-backed change log with a schema). Transformations are streaming SQL or TypeScript; materialisations write to a destination (Snowflake, Postgres, Elasticsearch, etc.).
The streaming SQL. SQLite-flavoured; supports windows, joins, aggregations. The transform is compiled to a Rust dataflow.
Delta updates. Materialisations track state per destination row; only rows that changed are written to Snowflake, minimising warehouse credit spend.
Pricing. Compute-hour based; typical: $2–$5 per GB of change data processed.

Materialize — streaming SQL as a service.

The model. Materialize is a streaming database — you write CREATE MATERIALIZED VIEW in Postgres-compatible SQL and Materialize keeps the view continuously up-to-date over the CDC stream.
The SQL. Postgres-compatible; supports incremental joins, aggregations, window functions.
The materialisation. Push the materialised view to Snowflake via a sink connector.
Pricing. Compute-hour (mz_size); typical: $2–$8 per hour depending on cluster size.

Rivery — managed CDC + workflows.

The model. SaaS CDC + workflow orchestration. Rivery is closer to Fivetran than to Materialize but adds a workflow layer for post-CDC transformations.
The streaming SQL. Limited compared to Estuary / Materialize; better fit for straightforward CDC with light transformations.
Pricing. RPU (Rivery Pricing Units) — a hybrid MAR/row model.

RisingWave — the open-source Materialize.

The model. OSS streaming database; Postgres-compatible SQL; runs on Kubernetes.
The trade-off. No SaaS invoice, but you own the K8s cluster and the upgrade cadence.
Sweet spot. Existing K8s stack, engineering appetite for streaming SQL ops.

When streaming platforms win.

Windowed aggregations for dashboards. 5-minute revenue by product line, hourly active-user counts, etc.
Real-time enrichment. Join a fast stream (orders) against a slow-changing dimension (products) and land the enriched row.
Cost pressure on Snowflake. Snowflake credits are the biggest line item; pushing aggregation upstream can save 3–5×.
Sub-minute freshness on aggregates. Where the aggregate needs to be live, not just the raw events.

When streaming platforms lose.

Raw event access needed. If downstream consumers query the raw CDC event stream (not just an aggregate), streaming platforms are overkill — a plain CDC pipeline is simpler.
Sparse workloads. Compute-hour pricing punishes idle streaming SQL clusters. A 10-events-per-day dashboard should not run a streaming SQL cluster 24/7.
Complex analytical SQL. Streaming engines have narrower SQL support than Snowflake; complex analytical queries belong in Snowflake, not upstream.

Worked example — Estuary Flow collection + Snowflake materialisation

Detailed explanation. A team wants a live "revenue by product by 5-minute window" dashboard fed from a Postgres orders table. Instead of landing raw orders in Snowflake and aggregating there, they materialise the pre-aggregated result via Estuary Flow. Walk through the collection spec, the transformation, and the Snowflake materialisation.

Source. Postgres orders table via logical replication.
Transform. Streaming SQL: SELECT window_start, product_id, SUM(amount), COUNT(*) FROM orders GROUP BY tumble(5min), product_id.
Target. Snowflake product_revenue_5min table, keyed on (window_start, product_id).

Question. Produce the complete Estuary Flow spec — capture, collection, derivation, materialisation — with sizing math.

Input.

Component	Value
Orders/second (peak)	200
Distinct products	10 000
Window	5-minute tumbling
Snowflake target warehouse	XS
Estuary compute tier	small (2 vCPU / 4 GB)

Code.

# capture — read Postgres orders via logical replication
captures:
  pipecode/postgres-orders:
    endpoint:
      connector:
        image: ghcr.io/estuary/source-postgres:v1
        config:
          address:  pg-primary.internal:5432
          database: pipecode
          user:     estuary_ro
          password: ***
    bindings:
      - resource:
          namespace:  public
          stream:     orders
          mode:       PRIMARY_KEY
        target: pipecode/collections/orders

collections:
  pipecode/collections/orders:
    schema:
      type: object
      required: [id, product_id, amount, ts]
      properties:
        id:         {type: integer}
        product_id: {type: integer}
        amount:     {type: number}
        ts:         {type: string, format: date-time}
        __op:       {type: string, enum: [c, u, d]}
        __ts_ms:    {type: integer}
    key: [/id]

# derivation — streaming SQL for the 5-minute aggregation
collections:
  pipecode/collections/product_revenue_5min:
    schema:
      type: object
      required: [window_start, product_id, revenue, events]
      properties:
        window_start: {type: string, format: date-time}
        product_id:   {type: integer}
        revenue:      {type: number}
        events:       {type: integer}
    key: [/window_start, /product_id]
    derive:
      using:
        sqlite:
          migrations:
            - CREATE TABLE state (window_start TEXT, product_id INTEGER, revenue REAL, events INTEGER, PRIMARY KEY (window_start, product_id));
      transforms:
        - name: aggregate
          source: pipecode/collections/orders
          lambda:
            language: sqlite
            body: |
              INSERT INTO state (window_start, product_id, revenue, events)
              VALUES (
                strftime('%Y-%m-%dT%H:%M:00', $ts, 'unixepoch',
                         'utc',
                         '-' || ((strftime('%M', $ts) % 5)) || ' minutes'),
                $product_id,
                $amount,
                1
              )
              ON CONFLICT (window_start, product_id)
              DO UPDATE SET revenue = revenue + excluded.revenue,
                            events  = events  + excluded.events;

# materialization — write the aggregated collection to Snowflake
materializations:
  pipecode/materializations/snowflake:
    endpoint:
      connector:
        image: ghcr.io/estuary/materialize-snowflake:v1
        config:
          host:      abc12345.us-east-1.snowflakecomputing.com
          account:   abc12345
          user:      svc_estuary
          role:      estuary_role
          database:  ANALYTICS
          schema:    DASHBOARD
          warehouse: XS_WH
          credentials: {authType: keyPair, privateKey: ...}
    bindings:
      - source: pipecode/collections/product_revenue_5min
        resource: {table: PRODUCT_REVENUE_5MIN, delta_updates: true}

-- Snowflake target — the pre-aggregated table
CREATE OR REPLACE TABLE analytics.dashboard.product_revenue_5min (
  window_start  TIMESTAMP_NTZ,
  product_id    NUMBER,
  revenue       NUMBER(20,2),
  events        NUMBER,
  PRIMARY KEY (window_start, product_id)
);

-- Dashboard query — fast because the aggregation already happened
SELECT window_start, product_id, revenue
FROM   analytics.dashboard.product_revenue_5min
WHERE  window_start >= DATEADD('hour', -6, CURRENT_TIMESTAMP)
ORDER  BY window_start, product_id;

Step-by-step explanation.

The capture reads Postgres logical replication and lands raw events in the pipecode/collections/orders collection — a schema-validated, S3-backed change log. This is the equivalent of Kafka topics in the Debezium path; Estuary uses S3 as its durable log.
The derivation is where the streaming SQL happens. Every incoming order event triggers the SQLite lambda, which UPSERTs into a state table keyed on (window_start, product_id). The ON CONFLICT ... DO UPDATE clause accumulates the revenue and event count per window. Estuary keeps this state in memory + persistent storage.
The materialization writes the aggregated collection to Snowflake with delta_updates: true — only rows that changed since the last flush are written. For a workload with 10k products and 5-minute windows, that's ~12k rows/hour to Snowflake instead of 720k raw events/hour.
The Snowflake side is a small MERGE-driven table. The dashboard query hits a pre-aggregated result and returns in <100 ms even for a 6-hour window — no need to sum 4M events at query time.
Cost math: Estuary compute at $600/month (small tier), Snowflake side at $200/month (XS warehouse for the tiny delta writes), total $800/month. Compare to the raw-CDC-plus-TASK approach: Fivetran at $2000/month + Snowflake TASK compute at $8000/month = $10k/month. Streaming SQL saves ~$9k/month.

Output.

Metric	Value
Orders/second peak	200
Windows/hour (10k products × 12 windows)	120k
Rows written to Snowflake/hour	~12k (only changed)
Dashboard query latency	< 100 ms
Total monthly cost	~$800

Rule of thumb. Streaming SQL platforms shine when the downstream is a fixed-shape aggregate. If the dashboard is "revenue by product by 5-minute window," push the aggregation upstream. If the dashboard is "raw order log with ad-hoc filters," land raw events and aggregate at query time.

Worked example — Materialize `CREATE MATERIALIZED VIEW` for continuous joins

Detailed explanation. A live enrichment problem: join a fast stream (orders) against a slow-changing dimension (products) so the enriched order (including product_name, category, list_price) lands in Snowflake within seconds of the source insert. Materialize handles this natively with CREATE MATERIALIZED VIEW over CDC sources.

Source A. orders — 200 rows/second from Postgres CDC.
Source B. products — 10k rows, updated ~10 times/day.
Join. orders JOIN products ON product_id.
Target. enriched_orders table in Snowflake.

Question. Configure Materialize to continuously maintain the enriched view and sink to Snowflake.

Input.

Component	Value
Orders/second	200
Products	10k
Materialize cluster size	small (2 vCPU)
Snowflake sink cadence	15 seconds

Code.

-- Materialize side — connect to Postgres via CDC
CREATE SOURCE pg_orders
  FROM POSTGRES CONNECTION postgres_conn (PUBLICATION 'materialize_pub')
  FOR TABLES (orders, products);

-- The materialised join — Materialize keeps this up-to-date incrementally
CREATE MATERIALIZED VIEW enriched_orders AS
SELECT
    o.id,
    o.customer_id,
    o.product_id,
    p.product_name,
    p.category,
    p.list_price,
    o.amount,
    o.ts
FROM   orders o
JOIN   products p ON o.product_id = p.id;

-- Sink to Snowflake (via Kafka-compatible sink or direct connector)
CREATE SINK snowflake_enriched
  FROM enriched_orders
  INTO SNOWFLAKE CONNECTION snowflake_conn (
    TABLE = 'analytics.dashboard.enriched_orders',
    UPSERT KEY = (id)
  )
  WITH (SNAPSHOT = TRUE);

-- Snowflake target
CREATE OR REPLACE TABLE analytics.dashboard.enriched_orders (
  id           NUMBER PRIMARY KEY,
  customer_id  NUMBER,
  product_id   NUMBER,
  product_name STRING,
  category     STRING,
  list_price   NUMBER(20,2),
  amount       NUMBER(20,2),
  ts           TIMESTAMP_NTZ
);

# Cost math
def materialize_cost(cluster_size_credits_per_hour=1.5, snowflake_sink_credits_per_hour=0.1):
    per_hour = cluster_size_credits_per_hour + snowflake_sink_credits_per_hour
    per_month = per_hour * 24 * 30
    usd = per_month * 3    # $3/credit
    return usd

print(materialize_cost())   # ~$3500/month

Step-by-step explanation.

The CREATE SOURCE pg_orders line connects Materialize to Postgres logical replication with the same primitives every CDC vendor uses — a publication, a replication slot. Materialize handles the WAL reading internally.
CREATE MATERIALIZED VIEW enriched_orders AS ... is Materialize's superpower: the join is incrementally maintained. When a new orders row arrives, Materialize joins it against the current products state and emits one enriched row. When a products row updates, Materialize re-emits every affected orders row. The complexity is Materialize's problem, not yours.
The CREATE SINK writes the maintained view to Snowflake via UPSERT on the primary key. Deletes propagate; updates propagate; the target table is always in sync within seconds.
Compare to a batch pipeline: raw orders and products land in Snowflake, a dbt model rebuilds enriched_orders every 5 minutes. That dbt model touches 200 × 300 = 60k orders every run, at a warehouse cost of ~$50/day. Materialize does the same work incrementally and writes only changes; the Snowflake side is nearly free.
Cost: Materialize cluster at ~$3500/month, Snowflake sink compute negligible. Compare to the dbt-batch approach: Fivetran + Snowflake compute at ~$5000/month. Materialize wins on cost and freshness.

Output.

Metric	Value
Join freshness p99	3 s
Snowflake landing latency	15 s
Materialize cluster credits	36 credits/day
Monthly cost	$3500
Dashboard freshness	20 s end-to-end

Rule of thumb. Streaming-SQL platforms with incremental view maintenance are cheaper than batch dbt for continuous joins over CDC sources. If your dbt run touches the same rows every 5 minutes on a static schedule, migrate the model to Materialize (or RisingWave for OSS) and pay only for what changed.

Worked example — RisingWave OSS deployment on Kubernetes

Detailed explanation. RisingWave is the open-source Materialize equivalent; it runs entirely on Kubernetes and speaks Postgres wire protocol. Walk through a minimal RisingWave deployment for the same enrichment workload from the previous example — no vendor invoice, full VPC control, at the cost of owning the K8s ops.

Cluster. RisingWave Helm chart on EKS; 3 compute pods + 1 meta node + 1 frontend.
Storage. S3-backed state (RisingWave's default persistence layer).
Compliance. Everything in the customer VPC.

Question. Deploy RisingWave, connect to Postgres, replicate the enriched view, and sink to Snowflake.

Input.

Component	Value
K8s	EKS 1.29
RisingWave version	2.0
Compute pods	3 × c5.2xlarge
Meta node	1 × c5.xlarge
S3 bucket	rw-state-prod

Code.

# RisingWave Helm values
image:
  repository: risingwavelabs/risingwave
  tag: v2.0.0
metaComponent:
  replicas: 1
  resources: {requests: {cpu: "2", memory: "8Gi"}}
computeComponent:
  replicas: 3
  resources: {requests: {cpu: "8", memory: "16Gi"}}
frontendComponent:
  replicas: 2
  resources: {requests: {cpu: "1", memory: "2Gi"}}
stateStore:
  hummock:
    bucket: rw-state-prod
    region: us-east-1

-- RisingWave — connect to Postgres CDC
CREATE SOURCE pg_orders
  WITH (
    connector     = 'postgres-cdc',
    hostname      = 'pg-primary.internal',
    port          = '5432',
    username      = 'risingwave',
    password      = '***',
    database.name = 'pipecode',
    schema.name   = 'public',
    slot.name     = 'risingwave_slot',
    publication.name = 'risingwave_pub'
  );

-- Create tables from the source
CREATE TABLE orders (id BIGINT PRIMARY KEY, customer_id BIGINT, product_id BIGINT, amount NUMERIC, ts TIMESTAMP)
  FROM pg_orders TABLE 'public.orders';
CREATE TABLE products (id BIGINT PRIMARY KEY, product_name VARCHAR, category VARCHAR, list_price NUMERIC)
  FROM pg_orders TABLE 'public.products';

-- The materialised view
CREATE MATERIALIZED VIEW enriched_orders AS
SELECT o.id, o.customer_id, o.product_id, p.product_name, p.category, p.list_price, o.amount, o.ts
FROM   orders o
JOIN   products p ON o.product_id = p.id;

-- Sink to Snowflake
CREATE SINK snowflake_enriched
  FROM enriched_orders
  WITH (
    connector = 'snowflake',
    snowflake.account_identifier = 'abc12345.us-east-1',
    snowflake.database           = 'ANALYTICS',
    snowflake.schema             = 'DASHBOARD',
    snowflake.table              = 'ENRICHED_ORDERS',
    snowflake.user               = 'svc_risingwave',
    snowflake.private_key        = '***',
    type                         = 'upsert',
    primary_key                  = 'id'
  );

Cost sketch — RisingWave OSS on EKS
====================================

EKS node group (3 × c5.2xlarge + 1 × c5.xlarge + 2 × c5.large):
  ~$1200/month

S3 storage (Hummock state, ~500 GB):
  ~$12/month

CloudWatch + monitoring:
  ~$100/month

Engineering time (upgrades, on-call):
  ~$2000/month (10% eng allocation)

Total: ~$3300/month
Comparable Materialize Cloud: ~$3500/month

Step-by-step explanation.

RisingWave Helm chart deploys the meta node (control plane), compute pods (streaming SQL runtime), and frontend pods (Postgres wire-protocol endpoints). State is persisted to S3 via the Hummock storage engine, which gives durability and elastic scale.
CREATE SOURCE ... WITH connector = 'postgres-cdc' — RisingWave has a first-class Postgres CDC connector using the same pgoutput primitives Debezium uses. The replication slot and publication are RisingWave's responsibility to manage.
CREATE TABLE ... FROM pg_orders TABLE 'public.orders' binds a specific source table to a RisingWave table. RisingWave keeps this table's state up-to-date incrementally.
CREATE MATERIALIZED VIEW enriched_orders is the same declarative streaming SQL as Materialize. RisingWave's runtime is a Rust-based dataflow that incrementally maintains the join.
CREATE SINK snowflake_enriched writes to Snowflake via the built-in Snowflake sink connector with UPSERT semantics. The overall cost is comparable to Materialize Cloud but you own the K8s cluster — the trade-off is control vs invoice.

Output.

Component	Cost/month
EKS nodes	$1200
S3 storage	$12
Monitoring	$100
Eng time (10%)	$2000
Snowflake sink	$200
Total	$3512

Rule of thumb. RisingWave OSS is the right pick when the compliance story requires VPC-only, or when the team already runs a Kubernetes fleet and has engineering capacity for streaming SQL ops. For everyone else, Materialize Cloud saves the K8s ops for a comparable invoice.

Senior interview question on streaming CDC platforms

A senior interviewer might ask: "You have a Postgres source, a Snowflake target, and a live dashboard that needs 30-second freshness on a windowed aggregation of revenue by product. Walk me through whether you'd pick Estuary Flow, Materialize, RisingWave, or fall back to Snowflake-native streaming — and defend the choice on the four axes."

Solution Using an Estuary Flow materialisation with a Snowflake-native fallback plan

# Primary — Estuary Flow with delta updates to Snowflake
architecture:
  capture:      postgres CDC via pgoutput
  transform:    streaming SQL aggregating 5-min windows
  materialize:  snowflake with delta_updates
  freshness:    30s end-to-end (10s capture, 5s aggregation, 15s sink)
  cost:         $800-1500/month at this workload
  vpc_posture:  Estuary Cloud is SaaS; upgrade to Estuary BYOC for VPC-only

# Fallback — Snowflake Dynamic Tables (if greenfield 2026 Snowflake account)
snowflake_native_fallback:
  approach:     Snowpipe Streaming for raw ingest + Dynamic Tables for aggregation
  freshness:    60s (Dynamic Tables refresh cadence)
  cost:         $2000-3000/month (higher warehouse credits)
  pros:         no external vendor; native Snowflake-owned
  cons:         higher credit spend; harder to test locally

# When to migrate to native:
#   - Estuary invoice becomes untenable at scale (> 10 GB/day CDC)
#   - Snowflake credit discount agreement makes native economical
#   - Compliance requires zero external vendors

-- The Snowflake-native fallback in detail — Dynamic Tables
CREATE OR REPLACE DYNAMIC TABLE product_revenue_5min
  TARGET_LAG = '30 seconds'
  WAREHOUSE  = xs_wh
AS
SELECT
    DATE_TRUNC('minute', ts) - INTERVAL '4 minute' AS window_start,
    product_id,
    SUM(amount) AS revenue,
    COUNT(*)    AS events
FROM   raw.pg_cdc.orders
GROUP  BY 1, 2;

Step-by-step trace.

Axis	Estuary Flow	Snowflake Dynamic Tables
Latency	30 s	60 s
Cost model	Compute-hour vendor	Snowflake credits
Ownership	Estuary owns pipeline	Snowflake owns pipeline
PII posture	SaaS (or BYOC upgrade)	Snowflake-native
Monthly cost	$800–1500	$2000–3000
Migration cost	0 (chosen first)	~1 week if migrating later

The primary pick is Estuary Flow — it delivers 30-second freshness at half the cost of the Snowflake-native fallback. The fallback exists because Snowflake's Dynamic Tables are close enough in latency (60 s) that a future migration is realistic if the vendor bill grows or the compliance story tightens. Documenting the fallback path up-front is the senior signal: you know where you'd go if the primary choice failed.

Output:

Metric	Estuary primary	Snowflake fallback
End-to-end p99	30 s	60 s
Snowflake credit spend	$200	$2500
Vendor invoice	$800	$0
Total	$1000	$2500
Chosen	✓	fallback

Why this works — concept by concept:

Push aggregation upstream — the streaming SQL runs in Estuary's Rust dataflow, not in a Snowflake warehouse. Only the delta writes to Snowflake, saving 5-10× on warehouse credits.
Fallback planned — Snowflake Dynamic Tables provide the same shape at 2× the latency and 2.5× the cost. Documenting the fallback means the team knows the migration path if the vendor bill grows.
Delta updates — Estuary's materialisation writes only rows that changed since the last flush. For a 10k-product dashboard with 5-minute windows, that's ~12k rows/hour instead of 720k raw events/hour.
PII escape hatch — Estuary Cloud is SaaS by default; the BYOC upgrade path exists if PII compliance flips to VPC-only. Naming the upgrade in the plan avoids a later "we have to migrate everything" incident.
Cost — $1000/month for 30-second freshness on the aggregate. Comparable Snowflake-native would be $2500. The streaming platform's compute-hour pricing is cheaper at this workload; above ~10 GB/day of CDC the Snowflake-native picks up. Re-evaluate quarterly. O(1) additional operational complexity — Estuary owns the pipeline.

Streaming
Topic — streaming
Streaming SQL and windowed aggregation problems

Practice →

SQL Topic — sql SQL problems on materialised views and CDC

Practice →

5. Native Snowflake features (Openflow) + hybrid DIY

Snowflake owns the pipeline (Openflow) or you do (Kafka Connector, DIY landing) — four axes decide

The mental model in one line: Snowflake Openflow is Snowflake's 2024 answer to Fivetran — a Snowflake-owned CDC pipeline where you buy the ingest as a first-class Snowflake feature, priced in credits — while the Snowflake Kafka Connector (Streaming API) and the DIY S3 → Snowpipe → Streams+Tasks pattern remain the two "native ends" of the design space: Snowflake takes ownership at one end, you take full ownership at the other. The decision between the three collapses to the same four axes as the rest of this guide, now applied to Snowflake-native primitives.

The four-axis snapshot.

Latency. Openflow: 1–15 minutes tunable. Kafka Connector with Streaming: sub-minute. DIY landing: minutes to hours depending on batch shape.
Cost model. Openflow: Snowflake credits (bundled with your Snowflake account); Kafka Connector: Snowflake credits for Streaming + your Kafka cluster; DIY: your S3 + Snowpipe credits + TASK compute.
Ownership. Openflow: Snowflake-owned pipeline; Kafka Connector: hybrid (Snowflake owns the sink, you own Kafka); DIY: fully yours.
PII posture. Openflow BYOC: customer-hosted data plane; Kafka Connector: your Kafka in your VPC; DIY: your S3 + your VPC. All three keep data in your control on some tier.

Snowflake Openflow — the Snowflake-owned CDC pipeline.

The launch context. Openflow launched in 2024 as Snowflake's answer to Fivetran / Airbyte / Hevo. It puts Snowflake itself on the hook for the connector, the ingest, and the MERGE.
The connector catalogue. Postgres, MySQL, SQL Server, MongoDB, Salesforce, and dozens of SaaS sources. The Postgres connector uses logical replication (pgoutput).
The deployment model. Cloud (Snowflake runs the data plane in its cloud) or BYOC (you run the data plane in your VPC on Kubernetes; Snowflake runs the control plane).
Pricing. Credits per row processed + credits for the data plane compute. Bundled into the standard Snowflake invoice. A rough estimate: 2–5× cheaper than Fivetran at moderate volumes.
The catch. Newer than the mature vendors; connector catalogue smaller; less battle-tested for edge cases.

Snowflake Kafka Connector (Streaming API) — the DIY-adjacent choice.

The pattern. You own Kafka; you own Debezium (or another source); the Snowflake Kafka Connector is the sink. It's the last third of the Debezium blueprint from H2-2, without the first two thirds.
When it fits. You already run Kafka; the Debezium + Kafka side is a stable investment; you want Snowflake to own only the sink layer.
Cost. Snowflake credits for Streaming ingest; your infra cost for Kafka + Kafka Connect.

DIY S3 → Snowpipe → Streams + Tasks — the "classic native" pattern.

The pattern. Your CDC producer (Debezium, custom Python, Airbyte) writes JSON files to an S3 bucket; Snowpipe auto-ingests them into a landing table; Snowflake Streams track the appends; Tasks run MERGE.
When it fits. Batch-oriented CDC (10-minute+ freshness); existing S3 pipeline; strict cost discipline.
The trade-off. Latency floor is ~1 minute (Snowpipe polling interval); cost is very cheap at low volumes; state management (Streams checkpoints) adds complexity.

Snowflake Streams + Tasks — the CDC-inside-Snowflake primitive.

What Streams are. A change tracking mechanism on any Snowflake table. CREATE STREAM s_users ON TABLE raw.users gives a per-consumer view of "rows changed since the last consume."
What Tasks are. Scheduled or DAG-triggered SQL runs. Combined with Streams, they implement the "MERGE new events into curated" pattern.
The pattern. Land raw events (via any of the ingest paths above) → attach a Stream → run a Task on a cadence → MERGE the Stream into the curated table.

Snowflake Dynamic Tables — the declarative alternative.

What they are. CREATE DYNAMIC TABLE ... TARGET_LAG = '1 minute' AS SELECT ... — declaratively-refreshed materialised tables that Snowflake keeps up-to-date within the target lag.
Why they matter. Replace Streams + Tasks with a single declarative statement. Snowflake picks the refresh strategy (incremental or full) based on the SQL shape.
Cost. Compute time to refresh, charged in credits. Comparable to Task-driven MERGE but simpler to author.

When to pick each Snowflake-native path.

Openflow. Greenfield in 2026 with no existing pipeline; you already have a Snowflake contract; managed CDC without a Fivetran invoice; Snowflake credit discounts make it economical.
Kafka Connector. You already run Kafka; the source-side CDC is stable; you want Snowflake to own only the sink.
DIY landing. Batch-oriented workloads; existing S3 pipeline; strict cost discipline; comfort with Streams + Tasks or Dynamic Tables.

Worked example — Snowflake Openflow config for Postgres source

Detailed explanation. A greenfield 2026 team is picking their first CDC pipeline. Snowflake is the warehouse; the team has a Snowflake contract with volume discounts. Openflow is the natural pick — no vendor invoice; the credits are already bundled. Walk through the Openflow deployment.

Source. Postgres 16 in the customer AWS VPC.
Data plane. BYOC on EKS in the same VPC (compliance).
Target. Snowflake account in the same region.
Volume. 30M events/day.

Question. Configure Openflow BYOC for the Postgres source and demonstrate the cost math.

Input.

Component	Value
Data plane deployment	BYOC on EKS
Openflow control plane	Snowflake-managed
Postgres tables	50
Events/day	30M
Snowflake account tier	Enterprise + volume discount

Code.

# Openflow BYOC data plane — Snowflake-provided Helm chart
apiVersion: v1
kind: Namespace
metadata:
  name: snowflake-openflow

---
# Helm values.yaml
imagePullSecrets:
  - name: snowflake-openflow-registry
controlPlaneEndpoint: openflow-control.us-east-1.snowflakecomputing.com
account: abc12345
dataPlaneName: pipecode-prod
dataPlaneRegion: us-east-1

runtime:
  replicas: 3
  resources:
    requests: {cpu: "2", memory: "8Gi"}
    limits:   {cpu: "8", memory: "16Gi"}

storage:
  s3Bucket: openflow-state-prod
  region: us-east-1

// Openflow Postgres connector config — via Openflow UI or API
{
  "connector":         "postgresql",
  "name":              "pg-pipecode",
  "config": {
    "host":            "pg-primary.internal",
    "port":            5432,
    "database":        "pipecode",
    "user":            "openflow_ro",
    "password":        "***",
    "schemas":         ["public"],
    "replicationSlotName":  "openflow_slot",
    "publicationName":      "openflow_pub",
    "captureMode":     "logical_replication",
    "snapshotMode":    "initial",
    "batchSize":       10000,
    "flushInterval":   "30s"
  },
  "target": {
    "type":    "snowflake",
    "database": "RAW",
    "schema":   "PG_OPENFLOW",
    "warehouse": "OPENFLOW_XS_WH"
  }
}

-- Snowflake side — Openflow auto-creates the landing tables
DESCRIBE TABLE raw.pg_openflow.users;
-- Column                    | Type
-- ID                        | NUMBER
-- EMAIL                     | STRING
-- NAME                      | STRING
-- CREATED_AT                | TIMESTAMP_NTZ
-- _OPENFLOW_OP              | STRING  -- c/u/d/r
-- _OPENFLOW_TS_MS           | NUMBER
-- _OPENFLOW_LSN             | STRING

-- Recommended: use Dynamic Tables downstream for the curated layer
CREATE OR REPLACE DYNAMIC TABLE curated.users
  TARGET_LAG = '1 minute'
  WAREHOUSE  = xs_wh
AS
SELECT
    id, email, name, created_at, _OPENFLOW_TS_MS AS updated_ts_ms
FROM   raw.pg_openflow.users
QUALIFY ROW_NUMBER() OVER (PARTITION BY id ORDER BY _OPENFLOW_TS_MS DESC) = 1
   AND  _OPENFLOW_OP != 'd';

# Openflow cost math
def openflow_cost(events_per_day, data_plane_credits_per_hour=2, control_plane_credits_per_month=1000,
                  snowflake_credit_price=3):
    data_plane_credits_per_month = data_plane_credits_per_hour * 24 * 30
    total_credits_per_month = data_plane_credits_per_month + control_plane_credits_per_month
    return total_credits_per_month * snowflake_credit_price

print(openflow_cost(30e6))    # (2*24*30 + 1000) * 3 = 7320 → ~$7300/month

Step-by-step explanation.

Openflow BYOC deploys a data plane in the customer VPC via a Snowflake-provided Helm chart. The control plane (job orchestration, connector metadata) runs in Snowflake's cloud; the data plane (actual data movement) runs in yours — the compromise between compliance and Snowflake-managed ops.
The Postgres connector uses the same primitives every CDC tool uses — logical replication, publication, slot. The Openflow UI configures these; the resulting connector state is stored in the control plane and executed on the data plane.
The landing tables auto-created in Snowflake include Openflow-branded metadata columns (_OPENFLOW_OP, _OPENFLOW_TS_MS, _OPENFLOW_LSN) that mirror Debezium's envelope. Downstream MERGE (or Dynamic Table) logic uses these to build the curated layer.
The recommended curated pattern uses Dynamic Tables with a 1-minute target lag. Snowflake handles the incremental refresh automatically; there's no TASK / Stream boilerplate. The QUALIFY clause dedupes on the primary key with the latest event timestamp winning; the _OPENFLOW_OP != 'd' filter excludes tombstones.
Cost: ~$7300/month at 30M events/day with the BYOC data plane. For comparison, Fivetran on the same workload would be ~$8k with a MAR blowout risk; Openflow is bundled into the existing Snowflake invoice with no separate vendor to negotiate with.

Output.

Metric	Value
End-to-end p99 latency	90 s
Data plane credits/month	1440
Control plane credits/month	1000
Snowflake curation credits	500
Total credits/month	~2940
USD/month	~$7300

Rule of thumb. Openflow is the greenfield 2026 pick when the team already has a Snowflake contract and no existing CDC investment. The credit-based pricing folds into the existing Snowflake negotiation and the BYOC option threads the PII needle. Snowflake owning the pipeline reduces the vendor count in your stack.

Worked example — DIY S3 → Snowpipe → Streams + Tasks

Detailed explanation. For batch-oriented CDC at 10-minute+ freshness, the classic Snowflake-native pattern is S3 landing + Snowpipe auto-ingest + Streams + Tasks. Walk through the end-to-end setup for a 20-table CDC pipeline where the CDC producer is a lightweight Python script that reads Postgres logical replication and writes JSONL files to S3 every minute.

CDC producer. Python + psycopg2 reading logical replication; writes JSONL to S3.
Landing. Snowpipe auto-ingest via SQS notifications.
Curation. Streams + TASK MERGE every 5 minutes.

Question. Wire up the DIY landing pattern end-to-end.

Input.

Component	Value
CDC producer	custom Python
S3 landing bucket	pipecode-cdc-landing
Snowpipe	auto-ingest via SQS
Curation cadence	5 minutes

Code.

# CDC producer — psycopg2 logical replication
import psycopg2
import psycopg2.extras
import json
import boto3
import io
import gzip
from datetime import datetime

s3 = boto3.client("s3")
BUCKET = "pipecode-cdc-landing"

def stream():
    conn = psycopg2.connect(
        "host=pg-primary.internal dbname=pipecode user=cdc_user password=***",
        connection_factory=psycopg2.extras.LogicalReplicationConnection,
    )
    cur = conn.cursor()
    cur.start_replication(
        slot_name="python_cdc_slot",
        decode=True,
        options={"proto_version": "1", "publication_names": "python_cdc_pub"},
    )

    buffer, last_flush = [], datetime.utcnow()
    def on_msg(msg):
        buffer.append(msg.payload)
        if len(buffer) > 1000 or (datetime.utcnow() - last_flush).seconds > 60:
            flush(buffer, last_flush)
            buffer.clear()

    cur.consume_stream(on_msg)

def flush(rows, since):
    key = f"raw/{since:%Y/%m/%d/%H}/batch-{since:%M%S}.jsonl.gz"
    body = io.BytesIO()
    with gzip.GzipFile(fileobj=body, mode="w") as f:
        for r in rows:
            f.write((r + "\n").encode())
    body.seek(0)
    s3.put_object(Bucket=BUCKET, Key=key, Body=body.read())

-- Snowflake side — external stage + Snowpipe
CREATE OR REPLACE STAGE raw.pipecode.s3_landing
  URL = 's3://pipecode-cdc-landing/raw/'
  CREDENTIALS = (AWS_KEY_ID = '...' AWS_SECRET_KEY = '...')
  FILE_FORMAT = (TYPE = JSON COMPRESSION = GZIP);

CREATE OR REPLACE TABLE raw.pipecode.events_landing (
  event VARIANT,
  loaded_at TIMESTAMP_NTZ DEFAULT CURRENT_TIMESTAMP()
);

CREATE OR REPLACE PIPE raw.pipecode.pipe_events
  AUTO_INGEST = TRUE
AS
  COPY INTO raw.pipecode.events_landing (event)
  FROM   @raw.pipecode.s3_landing
  FILE_FORMAT = (TYPE = JSON);

-- Create a Stream on the landing table
CREATE OR REPLACE STREAM raw.pipecode.stream_events
  ON TABLE raw.pipecode.events_landing
  APPEND_ONLY = TRUE;

-- MERGE task using the Stream
CREATE OR REPLACE TASK raw.pipecode.task_merge_users
  WAREHOUSE = xs_wh
  SCHEDULE  = '5 MINUTE'
AS
MERGE INTO curated.users AS tgt
USING (
  SELECT event:id::NUMBER AS id, event:email::STRING AS email,
         event:name::STRING AS name, event:__op::STRING AS op,
         event:__ts_ms::NUMBER AS ts_ms
  FROM   raw.pipecode.stream_events
  WHERE  event:__table::STRING = 'users'
) AS src
ON tgt.id = src.id
WHEN MATCHED AND src.op = 'd' THEN DELETE
WHEN MATCHED AND src.op IN ('u','c') THEN UPDATE SET email = src.email, name = src.name
WHEN NOT MATCHED AND src.op IN ('u','c','r') THEN INSERT VALUES (src.id, src.email, src.name);

ALTER TASK raw.pipecode.task_merge_users RESUME;

Step-by-step explanation.

The CDC producer is a lightweight Python script that reads Postgres logical replication and buffers rows into a gzipped JSONL blob. Every 60 seconds (or 1000 rows) it flushes to S3 at a partitioned path — raw/YYYY/MM/DD/HH/batch-NNNNNN.jsonl.gz. Total cost: one t3.small EC2 or Fargate task at ~$20/month.
The Snowflake external stage points at the S3 bucket; the Snowpipe with AUTO_INGEST = TRUE listens on the SQS notification queue and copies new files into the events_landing table within seconds of the file landing. Snowpipe credits are charged per file processed — cheap at 1-minute batch cadence.
The Stream on events_landing gives an append-only "changes since last consume" view. The APPEND_ONLY = TRUE flag optimises for the case where the landing table is only ever appended to (never updated).
The TASK runs every 5 minutes, reads the Stream (consuming its offset), and MERGEs into the curated users table. The MERGE branches handle inserts, updates, and deletes based on the __op field. Consuming the Stream advances its checkpoint atomically — no double-processing.
Cost: EC2 producer $20 + Snowpipe credits ~$50 + MERGE TASK credits ~$200 = $270/month. For 30M events/day at 5-minute freshness this is unbeatable on the four axes — but the latency floor is ~2 minutes (Snowpipe polling + TASK cadence), so this pattern loses to Openflow / Debezium for sub-minute workloads.

Output.

Component	Cost/month
CDC producer (EC2/Fargate)	$20
Snowpipe ingest	$50
MERGE TASK credits	$200
Total	$270

Rule of thumb. DIY S3 → Snowpipe → Streams + Tasks is the cheapest CDC pattern that exists — but only when 5-minute+ freshness is acceptable and the team can write and maintain a small CDC producer. For sub-minute freshness or no-Python-in-the-pipeline requirements, one of the other four architectures wins.

Worked example — the four-axis decision tree, applied end-to-end

Detailed explanation. With all five architectures on the table, the decision tree is the senior interview's endgame. Walk through the decision tree end-to-end for four representative workloads and show which architecture each one picks.

Workload A. 500 tables, 50M events/day, 5-min freshness, SaaS-OK, small team.
Workload B. 200 tables, 100M events/day, sub-minute freshness, VPC-only, mid team.
Workload C. 20 tables, 30M events/day, streaming aggregate for dashboard, SaaS-OK.
Workload D. Greenfield 2026, Snowflake volume discount, 40 tables, 20M events/day, 2-min freshness.

Question. Apply the decision tree to each workload and defend the pick.

Input.

Workload	Tables	Events/day	Freshness	PII	Team
A	500	50M	5 min	SaaS OK	small
B	200	100M	sub-min	VPC-only	mid
C	20	30M	streaming agg	SaaS OK	small
D	40	20M	2 min	SaaS OK	greenfield SF discount

Code.

decision_tree:
  latency:
    sub_minute:
      pii:
        vpc_only: debezium_or_openflow_byoc
        saas_ok: debezium_or_estuary
    minutes:
      cost_pressure:
        high: airbyte_oss_or_diy_s3_snowpipe
        low: fivetran_or_openflow
    streaming_aggregate:
      any: estuary_or_materialize_or_risingwave

  workload_type:
    raw_event_land: cdc_pipeline
    windowed_aggregate: streaming_platform
    slow_dimensions: batch_load_or_diy

decisions:
  workload_A:
    latency: 5min -> minutes
    cost_pressure: low
    pick: fivetran (small team, MAR budget, SaaS OK)
  workload_B:
    latency: sub-min
    pii: VPC-only
    pick: debezium_kafka_snowpipe (owns cluster, sub-min, VPC-safe)
  workload_C:
    workload_type: streaming aggregate
    pick: estuary_flow (aggregation upstream, Snowflake credits saved)
  workload_D:
    context: greenfield, Snowflake discount
    pick: snowflake_openflow (Snowflake-owned; credit discount)

-- Sanity check — cost matrix
WITH workloads(name, events_per_day, freshness_min, pii, arch, monthly_usd) AS (
  VALUES
    ('A', 50e6,  5,   'saas',  'fivetran',            10000),
    ('B', 100e6, 0.5, 'vpc',   'debezium',             8000),
    ('C', 30e6,  1,   'saas',  'estuary',              1500),
    ('D', 20e6,  2,   'saas',  'openflow',             5000)
)
SELECT * FROM workloads;

Step-by-step explanation.

Workload A — 5-minute freshness + small team + SaaS OK is Fivetran's sweet spot. The MAR estimate at 50M events/day × 30 × ~10% distinct = 150M MAR/month = ~$10k at $1.5/1000 MAR. No compliance blocker; small team can't afford DIY.
Workload B — sub-minute + VPC-only ends every SaaS vendor. Debezium + Kafka + Snowpipe Streaming is the survivor; Openflow BYOC is the alternative. Debezium wins on maturity and Kafka ecosystem; Openflow BYOC wins on Snowflake-owned pipeline. For a mid team the answer is Debezium; for a Snowflake-first team it's Openflow.
Workload C — the analytics is a streaming aggregate. Estuary Flow or Materialize; Estuary wins on the cost side (compute-hour is cheap at 30M events/day). Snowflake credits saved on the aggregation offset the vendor invoice.
Workload D — greenfield 2026 with a Snowflake volume discount is Openflow's sweet spot. The credits are bundled; no new vendor to negotiate with; the connector catalogue covers Postgres. 2-minute freshness is comfortably in Openflow's SLA range.
The interview signal: the decision tree runs in the same order every time (latency → PII → workload type → cost pressure → team size). Naming the axes and applying them mechanically is the senior move; picking an architecture based on personal preference is the junior move.

Output.

Workload	Pick	Cost/month	Latency
A	Fivetran	$10k	5 min
B	Debezium + Kafka + Snowpipe	$8k	30 s
C	Estuary Flow	$1.5k	30 s (agg)
D	Openflow	$5k	90 s

Rule of thumb. The four-axis decision tree gives a single defensible answer per workload. When two candidates tie, use team familiarity and vendor contract state as tiebreakers. Never pick a CDC architecture from vendor materials alone — the decision framework is your responsibility.

Senior interview question on Snowflake-native CDC

A senior interviewer might ask: "Walk me through Snowflake Openflow, the Snowflake Kafka Connector with Streaming API, and DIY S3 → Snowpipe → Streams + Tasks. Which one do you pick for a 40-table Postgres source at 2-minute freshness on a 2026 Snowflake Enterprise account, and why not Fivetran?"

Solution Using Snowflake Openflow with a Dynamic Tables curation layer

# Openflow deployment
control_plane: snowflake_managed
data_plane:    byoc_on_eks (VPC parity with Postgres source)
connector:     postgresql (logical replication, pgoutput)
target:        raw.pg_openflow.* (auto-created tables)

# Curation via Dynamic Tables
dynamic_tables:
  target_lag: 1 minute
  refresh:    incremental (Snowflake auto-selects)
  warehouse:  xs_wh (auto-scale off)

# Compliance
pii_handling:
  ssn:  connector-level drop
  email: connector-level hash (sha256)
  phone: connector-level hash (sha256)

# Cost
credits_per_month: ~2000
usd_per_month:     ~$6000 (with volume discount)

-- Curated Dynamic Table — replaces TASK + MERGE boilerplate
CREATE OR REPLACE DYNAMIC TABLE curated.users
  TARGET_LAG = '1 minute'
  WAREHOUSE  = xs_wh
AS
SELECT
    id,
    email,
    name,
    created_at,
    updated_at,
    _OPENFLOW_TS_MS AS __src_ts_ms
FROM   raw.pg_openflow.users
QUALIFY ROW_NUMBER() OVER (PARTITION BY id ORDER BY _OPENFLOW_TS_MS DESC) = 1
   AND  _OPENFLOW_OP != 'd';

-- Downstream analytics reads curated.users; the pipeline is entirely native
CREATE OR REPLACE DYNAMIC TABLE curated.user_stats
  TARGET_LAG = '5 minutes'
  WAREHOUSE  = xs_wh
AS
SELECT DATE_TRUNC('day', created_at) AS signup_day,
       COUNT(*)                       AS signups
FROM   curated.users
GROUP  BY 1;

Step-by-step trace.

Layer	Component	Latency contribution
1	Postgres WAL commit	0
2	Openflow data plane	20 s
3	Snowflake raw landing	5 s
4	Dynamic Table refresh	60 s
5	End-to-end p99	90 s

The pipeline is entirely Snowflake-owned: Snowflake owns the connector, the data plane runtime, the ingest, and the curated Dynamic Table. The only piece your team owns is the SQL that defines the Dynamic Table and the compliance-driven connector configuration. Fivetran was ruled out because Openflow is bundled into the Snowflake credit contract (no separate invoice) and the pipeline stays inside a single vendor relationship.

Output:

Metric	Value
Vendor invoices	0 (Snowflake only)
End-to-end p99	90 s
Monthly cost	$6000
Ownership	Snowflake
Compliance	BYOC data plane keeps CDC in VPC
Curation complexity	5 lines of Dynamic Table SQL

Why this works — concept by concept:

Single-vendor stack — one invoice, one support relationship, one negotiation. For teams already committed to Snowflake, the reduction in vendor count is a strategic win.
Dynamic Tables replace TASK + MERGE — the declarative refresh is authored in one CREATE DYNAMIC TABLE statement instead of a Stream + Task + MERGE trio. Snowflake picks the refresh strategy; the team stays in SQL.
BYOC threads the PII needle — the data plane runs in the customer VPC on EKS, so the CDC stream never crosses a vendor boundary. Openflow control plane holds only metadata, not row data.
Credit-based pricing — the CDC pipeline is priced in the same credits as the analytics workload. Volume discounts apply to both; the CFO sees one line item instead of two.
Cost — ~$6k/month at 20M events/day with a Snowflake Enterprise volume discount. Comparable Fivetran would be ~$8k + separate invoice. Openflow is 25% cheaper and consolidates the vendor stack. O(0) additional operational overhead — Snowflake owns the pipeline SLAs.

ETL
Topic — etl
ETL problems on Snowflake-native CDC patterns

Practice →

SQL
Topic — sql
SQL problems on Snowflake Streams + Tasks + Dynamic Tables

Practice →

Cheat sheet — Postgres → Snowflake CDC recipes

The four-axis reduction. Every CDC decision compresses to (1) latency SLA (seconds vs minutes vs hours), (2) cost model (MAR / row / GB / compute-hour / credits), (3) ownership (vendor / hybrid / DIY), (4) PII posture (SaaS OK vs VPC-only). Name the four before naming a vendor.
Debezium Postgres connector JSON. plugin.name = pgoutput, slot.name = pipecode_slot, publication.name = pipecode_pub, snapshot.mode = initial, tombstones.on.delete = false, transforms = unwrap with add.fields = op,ts_ms,source.ts_ms,source.lsn. Register via Kafka Connect REST.
Postgres prerequisites for CDC. wal_level = logical, max_replication_slots ≥ N + 2, max_wal_senders ≥ N + 2, max_slot_wal_keep_size = 50GB (insurance), a replication role (REPLICATION + SELECT), a publication whitelist, and REPLICA IDENTITY FULL on every replicated table.
Fivetran cost estimate formula. MAR = distinct primary keys touched per calendar month; monthly cost = MAR / 1000 × $1.5 at moderate volume. Cost trap: high update rates on wide tables produce MAR blowout. Estimate MAR from pg_stat_user_tables before signing.
Hevo vs Airbyte pricing shape. Hevo: row-based (~$500 per 1M events on mid-tier). Airbyte Cloud: per-GB (~$10/GB). Airbyte OSS: infra-only (~$3k/month at moderate volume). Match the pricing model to the data shape, not to the vendor's marketing.
Estuary Flow collection spec. capture reads Postgres CDC → collection with a JSON schema and a key array → derivation runs streaming SQL (SQLite lambda) → materialization writes to Snowflake with delta_updates: true. Push the aggregation into the derivation; land the aggregate, not raw events.
Snowflake MERGE INTO template for CDC delta. MERGE INTO tgt USING (SELECT ... WHERE __ts_ms > (SELECT MAX(ts_ms) FROM tgt_hwm)) AS src ON tgt.pk = src.pk WHEN MATCHED AND src.op = 'd' THEN DELETE WHEN MATCHED AND src.op IN ('u','c') THEN UPDATE SET ... WHEN NOT MATCHED AND src.op IN ('u','c','r') THEN INSERT VALUES (...). High-water-mark table caps each run.
Snowpipe Streaming Kafka Connector config. snowflake.ingestion.method = SNOWPIPE_STREAMING, snowflake.enable.schematization = true, buffer.flush.time = 10, buffer.size.bytes = 5000000. One sink per Kafka topic; the connector auto-creates typed landing tables.
Snowflake Dynamic Table for curation. CREATE OR REPLACE DYNAMIC TABLE curated.x TARGET_LAG = '1 minute' WAREHOUSE = xs_wh AS SELECT ... QUALIFY ROW_NUMBER() OVER (PARTITION BY pk ORDER BY __ts_ms DESC) = 1 AND __op != 'd'. Replaces TASK + MERGE with declarative SQL.
Delete handling. Debezium: op = 'd' in the envelope; the MERGE's WHEN MATCHED AND src.op = 'd' THEN DELETE translates it. Fivetran / Hevo / Airbyte: soft-delete columns (_fivetran_deleted, __hevo__marked_deleted, _ab_cdc_deleted_at) that downstream queries filter on. Openflow: _OPENFLOW_OP = 'd' similar to Debezium.
PII gates at the connector. Debezium SMTs (hash, drop) at the source; Fivetran column-level hashing (SHA-256) in the UI; Airbyte field selection in the connection config; Openflow schema-level PII rules. Gate PII before it hits Snowflake, not after.
WAL slot bloat alert. pg_wal_lsn_diff(pg_current_wal_lsn(), confirmed_flush_lsn) > 10 GB for 5 minutes = page on-call. Set max_slot_wal_keep_size = 50GB on Postgres 13+ as insurance. The #1 Debezium production incident.
Openflow BYOC deployment. Helm chart on EKS/GKE for data plane; Snowflake-managed control plane; credit-based pricing; VPC-only compliance. The 2026 pick for teams already committed to Snowflake with a volume discount.
Two-lane hybrid pattern. Debezium for the sub-minute critical table(s); Fivetran or Openflow for the bulk of the tables. Splits the cost — sub-minute lane on DIY infra, bulk lane on vendor invoice. Typical cost savings: 30–50% over either extreme.
Never invent latency numbers. Snowpipe Streaming: ~5s ingest. Debezium: ~1s to Kafka. Fivetran default: 5–15 min. Openflow: 1–15 min tunable. Estuary: 10–30s end-to-end for aggregates. Match the vendor's published SLA against the product requirement.

Frequently asked questions

What is the cheapest Postgres → Snowflake CDC path?

The cheapest path depends on both the volume and the freshness SLA — there is no single "cheapest vendor." At 5-minute+ freshness and moderate volume (< 100M events/day), DIY S3 → Snowpipe → Streams + Tasks wins on pure infrastructure cost — roughly $250-$500/month all-in for a small Python CDC producer + Snowpipe credits + a Task-driven MERGE. At sub-minute freshness the same DIY pattern doesn't hit the SLA, so Snowflake Openflow with credit-based pricing (or Airbyte OSS on your existing Kubernetes cluster) becomes the cheapest sub-minute path. Above ~500M events/day, Debezium + Kafka + Snowpipe Streaming amortises the infra cost across enough data that per-million-rows drops below any managed vendor invoice. Cheapest-on-paper vendor for standard 5–15 minute workloads is Hevo with row-based pricing (~$500 per 1M events on mid-tier); the trap is that pricing can shift on high-volume tiers.

Which Postgres → Snowflake CDC architecture gives sub-minute latency?

Three architectures reliably hit sub-minute end-to-end. Debezium + Kafka + Snowpipe Streaming is the gold-standard sub-minute DIY path — 30–60 second p99 with a 1-minute MERGE TASK on the curated layer, extendable to sub-30-second by dropping the TASK to 30-second cadence or migrating to Snowflake Dynamic Tables with a 30-second target lag. Snowflake Openflow with Snowpipe Streaming ingest hits 60–90 seconds end-to-end without the DIY infrastructure — Snowflake owns the connector runtime and prices in credits. Estuary Flow with Snowflake materialisation hits 20–30 seconds on aggregates (windowed sums, joins) because the aggregation happens upstream and only delta writes hit Snowflake. Fivetran and Hevo can hit sub-minute on premium tiers but at 3–5× the base cost; Airbyte's CDC connector is 5-minute+ by design and is not the right pick for sub-minute workloads.

Do I need Kafka to do Postgres → Snowflake CDC?

No — Kafka is one option, not a requirement. The five architectures in this guide split by whether Kafka is in the path: Debezium + Kafka + Snowpipe Streaming is the Kafka path; Fivetran / Hevo / Airbyte run their own internal WAL readers with no Kafka; Estuary Flow uses S3 as a durable log instead; Materialize / RisingWave hold state internally; Snowflake Openflow uses its own runtime; DIY S3 → Snowpipe skips Kafka entirely. Kafka wins when (a) you already run it for other workloads (marginal cost near zero), (b) the CDC stream has multiple downstream consumers (Snowflake + a real-time OLTP replica + a search index), or (c) you need the exactly-once semantics and replay-from-any-offset that Kafka provides. Kafka loses when it's the only reason to stand up a cluster — the ops cost of a 3-broker MSK cluster ($2k/month + on-call) can only be justified if the pipeline has enough scale or downstream consumers to amortise it.

How do I handle deletes in Postgres → Snowflake CDC?

Two dominant patterns: tombstone-in-MERGE and soft-delete columns. Debezium and Snowflake Openflow emit an op = 'd' field in the CDC envelope; the Snowflake MERGE has an explicit WHEN MATCHED AND src.op = 'd' THEN DELETE branch that translates the tombstone into a hard delete on the curated table — the row disappears from downstream queries entirely. Fivetran, Hevo, and Airbyte take the soft-delete approach: they write a _fivetran_deleted, __hevo__marked_deleted, or _ab_cdc_deleted_at column on the target table and leave the row in place; downstream queries filter WHERE NOT _fivetran_deleted (or wrap the table in a view that does so). Soft-delete preserves audit history at the cost of storage; hard-delete is cheaper but loses the "when was this row deleted" signal. Both patterns require REPLICA IDENTITY FULL on the Postgres source — without it, DELETE events only carry the primary key and any post-delete audit column values are lost.

What about PII and GDPR at the CDC layer?

The PII gate belongs at the connector, before rows leave Postgres. Debezium supports Single Message Transforms (SMTs) that can hash, drop, or redact columns before the row hits Kafka — configure transforms = pii_hash with transforms.pii_hash.type = org.apache.kafka.connect.transforms.MaskField$Value and transforms.pii_hash.fields = ssn,dob to drop those fields. Fivetran, Hevo, and Airbyte offer column-level hashing (SHA-256) or exclusion in their UIs — configure once per connector. Snowflake Openflow has schema-level PII rules baked into the connector definition. For GDPR right-to-be-forgotten flows, the tombstone-in-MERGE pattern is preferable — a delete on the source becomes a hard delete on the curated table within one MERGE cycle. For strict VPC-only compliance (HIPAA-grade), Debezium + Kafka + Snowpipe Streaming (all in your VPC) or Openflow BYOC (data plane in your VPC) are the only architectures where the CDC stream never crosses a vendor boundary. Never rely on a downstream mask — PII that lands unhashed in Snowflake is a compliance incident even if a view hides it.

Snowflake Openflow vs Fivetran — which one should I pick in 2026?

The 2026 answer depends on the vendor state of your stack. Pick Openflow if (a) you already have a Snowflake contract with volume discounts (the credits are bundled), (b) you value single-vendor consolidation (one invoice, one support relationship), (c) VPC-only compliance is required and the BYOC data plane threads the needle, or (d) the greenfield workload has no existing CDC investment to protect. Pick Fivetran if (a) your team already runs Fivetran and switching costs outweigh the invoice savings, (b) the mature vendor ecosystem matters (300+ connectors, hardened edge-case handling, enterprise support), (c) the MAR math actually favours Fivetran on your workload shape (low update rates on wide tables), or (d) HIPAA / SOC 2 posture requires the vendor's certifications and BAA. The typical cost delta: Openflow is 25–40% cheaper on invoice at moderate volumes but you pay in Snowflake credits (which count against your annual commit); Fivetran is a separate line item that the CFO can negotiate independently. For a greenfield 2026 deployment with a Snowflake volume discount and no existing Fivetran investment, Openflow is the default; for a mature stack with Fivetran running well, the migration is not worth the effort unless the MAR bill is a clear budget problem.

Practice on PipeCode

Drill the ETL practice library → for the CDC-design, MERGE, and pipeline-architecture problems senior interviewers use to probe the four-axis framework.
Rehearse on the streaming practice library → for the windowed-aggregation, materialised-view, and streaming-SQL patterns that decide when Estuary or Materialize beats a plain CDC path.
Sharpen the SQL axis with the SQL practice library → for the MERGE-INTO, high-water-mark, and Snowflake-native (Streams, Tasks, Dynamic Tables) problems that show up in every senior CDC interview.
Stack the prerequisites against PipeCode's broader 450+ data-engineering catalogue to anchor the CDC + MERGE + streaming-SQL intuition against real graded inputs.

Lock in Postgres → Snowflake CDC decision muscle

Vendor docs describe features. PipeCode drills describe the decision — when Debezium beats Fivetran on the four axes, when Estuary saves 5× on Snowflake credits, when Openflow's BYOC data plane threads the PII needle. Pipecode.ai is Leetcode for Data Engineering — pattern-first practice tuned for the production trade-offs senior data engineers actually face.

Practice ETL problems →
Practice streaming problems →

1. Why Postgres → Snowflake is the most-asked CDC path in 2026

The four axes — latency, cost model, ownership, PII posture — reduce five architectures to one clear choice per workload

Worked example — the four-axis interview reduction

Worked example — the "sub-minute + PII cannot leave the VPC" flip

Worked example — the streaming-SQL flip

Senior interview question on the four-axis reduction

Solution Using a hybrid Debezium-critical-path + Fivetran-bulk architecture

2. Debezium + Kafka + Snowpipe Streaming

The sub-minute DIY path — full control, full ownership, sub-minute latency, and a real on-call rotation

Worked example — end-to-end Debezium deploy for a 3-table workload

Worked example — sizing the Kafka cluster + on-call cost

Worked example — WAL slot bloat, the #1 Debezium incident

Senior interview question on the Debezium architecture

Solution Using the four-layer Debezium blueprint with a slot-bloat alert and a hybrid escape hatch

3. Fivetran / Hevo / Airbyte managed pipelines

The hands-off managed path — the vendor owns the WAL reader, you pay per row, and Snowflake sees a clean landing table

Worked example — Fivetran configuration and cost estimate

Worked example — Airbyte OSS on Kubernetes for VPC-only compliance

Worked example — MAR blowout diagnosis and mitigation

Senior interview question on managed CDC vendor selection

Solution Using a three-vendor MAR-vs-row-vs-GB decision matrix

4. Estuary Flow / Materialize / Rivery — real-time CDC platforms

The streaming-SQL path — push the aggregation upstream, materialise into Snowflake at a fraction of the credit cost

Worked example — Estuary Flow collection + Snowflake materialisation

Worked example — Materialize CREATE MATERIALIZED VIEW for continuous joins

Worked example — RisingWave OSS deployment on Kubernetes

Senior interview question on streaming CDC platforms

Solution Using an Estuary Flow materialisation with a Snowflake-native fallback plan

5. Native Snowflake features (Openflow) + hybrid DIY

Snowflake owns the pipeline (Openflow) or you do (Kafka Connector, DIY landing) — four axes decide

Worked example — Snowflake Openflow config for Postgres source

Worked example — DIY S3 → Snowpipe → Streams + Tasks

Worked example — the four-axis decision tree, applied end-to-end

Senior interview question on Snowflake-native CDC

Solution Using Snowflake Openflow with a Dynamic Tables curation layer

Cheat sheet — Postgres → Snowflake CDC recipes

Frequently asked questions

What is the cheapest Postgres → Snowflake CDC path?

Which Postgres → Snowflake CDC architecture gives sub-minute latency?

Do I need Kafka to do Postgres → Snowflake CDC?

How do I handle deletes in Postgres → Snowflake CDC?

What about PII and GDPR at the CDC layer?

Snowflake Openflow vs Fivetran — which one should I pick in 2026?

Practice on PipeCode

Lock in Postgres → Snowflake CDC decision muscle

Worked example — Materialize `CREATE MATERIALIZED VIEW` for continuous joins