reverse etl is the discipline that closes the loop a data team starts the first time it lands raw events in a warehouse and then realises the warehouse, however beautiful, is invisible to the GTM team. Forward ETL moved source data into the warehouse so analysts could ask questions; reverse ETL ships the answers back out into the operational tools — Salesforce, HubSpot, Marketo, Intercom, Slack, Facebook Ads, Iterable — where the people and systems that act on customers actually live. It is the bridge between analytical truth and operational action, and in 2026 it is the single fastest-growing surface in the modern data stack.
This guide walks the practitioner's view of operational analytics end to end. It defines the data activation pattern (model → audience → sync → destination), compares the three production-grade reverse etl tools — Hightouch, Census, and RudderStack — across destinations, dbt integration, hosting, and pricing, deconstructs the sync architecture that turns a warehouse query into a queue of API calls absorbing 429s and dead letters, and lays out the governance and observability layer that distinguishes a real data product from a fragile pipeline. Each section pairs a teaching block with a Solution-Tail worked answer — code, a step-by-step trace, an output table, and a concept-by-concept breakdown of why it works.
When you want hands-on reps while reading, drill the ETL practice library →, layer in API integration drills →, and stack the warehouse muscles with dimensional modelling problems →.
On this page
- Why reverse ETL exists — operational analytics as a discipline
- The reverse ETL data model — models, audiences, syncs
- Hightouch vs Census vs RudderStack — vendor comparison
- Sync architecture — incremental detection, queues, rate limits
- Governance, observability, and failure modes
- Cheat sheet — reverse ETL recipes
- Frequently asked questions
- Practice on PipeCode
1. Why reverse ETL exists — operational analytics as a discipline
Forward ETL moves data INTO the warehouse so analysts can ask questions; reverse ETL moves data OUT so operational systems can act
The one-sentence invariant: forward ETL turns raw source data into warehouse rows that humans read on dashboards; reverse ETL turns those warehouse rows back into API calls that machines and SaaS tools execute against customers. Once you internalise that the warehouse is now the source of truth for every customer attribute, the question stops being "should we sync this?" and becomes "which destinations, which fields, how often, and with what governance?"
The data activation gap in three bullets.
- Dashboards inform people; syncs inform systems. A lead score in Looker is a number a manager looks at on Monday. A lead score in Salesforce is a field a routing rule reads at midnight to assign the lead to the right rep. The two consumers want the same number but through different surfaces.
- The warehouse aggregates across silos; SaaS tools cannot. Stripe knows about payments. HubSpot knows about emails. The product database knows about feature usage. Only the warehouse joins them. Reverse ETL ships that join back into every silo.
- Manual CSV exports do not scale. A "send a CSV to ops once a week" workflow has zero observability, no schema contract, and breaks the first time a column is renamed. Reverse ETL turns the export into a versioned, scheduled, monitored data product.
Common destinations in 2026.
- CRMs. Salesforce, HubSpot, Microsoft Dynamics, Pipedrive.
- Marketing automation. Marketo, Iterable, Customer.io, Braze, Klaviyo, Mailchimp.
- Support + success. Intercom, Zendesk, Gainsight, Vitally, ChurnZero.
- Ad platforms. Facebook / Meta custom audiences, Google Ads customer match, TikTok audiences, LinkedIn matched audiences.
- Collaboration + ops. Slack channels, Microsoft Teams webhooks, Notion databases, Asana tasks.
- Product analytics. Amplitude cohorts, Mixpanel cohorts, Heap audiences.
Why the warehouse won as source of truth.
- Compute and storage are now cheap. Snowflake, BigQuery, Databricks, Redshift — every cloud warehouse runs the joins at a price that makes "send the join result downstream" feasible at scale.
-
dbt made transformation governable. Once
models/marts/customers.sqlis the single SQL definition of "a customer," every downstream system can subscribe to its rows instead of recomputing them. - Data teams finally have leverage on the operational stack. Reverse ETL gives the data team a contract with marketing, sales, and CS without writing custom Python in five different SaaS APIs.
When NOT to use reverse ETL.
- Sub-second latency requirements. Reverse ETL is a batch + micro-batch architecture. Hightouch ships syncs as fast as ~5 minutes; Census as fast as ~1 minute; RudderStack with streaming-event reverse ETL can hit seconds. Below that, you want event streaming (RudderStack event stream, Segment, Kafka → consumer) — not warehouse syncs.
- True event streaming. "Page view fires → personalisation engine reacts in 200ms" is not a reverse ETL problem; it is a Kafka / Kinesis / event-bus problem.
- One-off backfills. A 50k-row one-time list does not need a sync pipeline; a CSV import inside the destination is faster.
Worked example — the lead score sync that justifies reverse ETL
Detailed explanation. A B2B SaaS company computes a lead score in dbt by joining Salesforce contacts, product usage events, and marketing engagement. The score lives in marts.lead_scores. Sales wants the same score visible on the Salesforce Contact record so routing and prioritisation rules can act on it. Without reverse ETL the team writes a custom Python script, schedules it in Airflow, builds retries, builds dedupe, and rebuilds it every time the score model changes. With reverse ETL the team writes a one-page sync definition and inherits all of that infrastructure.
Question. Given the dbt model marts.lead_scores with columns (salesforce_contact_id, lead_score, last_engagement_at, churn_risk), how do you ship the row into Salesforce Contact.lead_score__c, Contact.last_engagement_at__c, and Contact.churn_risk__c so that routing rules can act on it?
Input — marts.lead_scores.
| salesforce_contact_id | lead_score | last_engagement_at | churn_risk |
|---|---|---|---|
| 003A1 | 87 | 2026-06-12 | 0.12 |
| 003A2 | 41 | 2026-05-30 | 0.55 |
| 003A3 | 92 | 2026-06-14 | 0.08 |
| 003A4 | NULL | NULL | NULL |
Code.
-- The dbt model that becomes the sync source.
-- File: models/marts/lead_scores.sql
SELECT
c.salesforce_contact_id,
ROUND(s.raw_score, 0) AS lead_score,
s.last_engagement_at,
s.churn_risk
FROM {{ ref('dim_contacts') }} c
LEFT JOIN {{ ref('int_lead_scoring') }} s
ON s.contact_id = c.contact_id
WHERE c.salesforce_contact_id IS NOT NULL
# Hightouch sync definition (illustrative YAML).
# File: hightouch/syncs/salesforce_lead_score.yaml
model: marts.lead_scores
destination: salesforce_production
sync_mode: upsert
primary_key: salesforce_contact_id
schedule: "*/30 * * * *" # every 30 minutes
mappings:
- source: lead_score -> Contact.lead_score__c
- source: last_engagement_at -> Contact.last_engagement_at__c
- source: churn_risk -> Contact.churn_risk__c
Step-by-step explanation.
- The dbt model produces one row per Salesforce contact with a stable
salesforce_contact_idprimary key. The model is the contract — change the SQL, change every downstream consumer. - Hightouch reads the model on the cron schedule. On the first run it stores a snapshot; on every later run it diffs the current rows against the previous snapshot to find changes.
- The sync_mode
upserttells the destination "insert ifsalesforce_contact_iddoes not exist, update otherwise." Salesforce External ID matching is configured in the Hightouch UI to mapsalesforce_contact_idto Salesforce'sIdfield. - The three field mappings turn warehouse columns into Salesforce custom fields. NULL
lead_scorefor003A4becomes a blank update on the Salesforce field; the destination keeps any previous value if the sync setting is "do not overwrite with NULL." - The cron
*/30runs every 30 minutes — far below Salesforce's daily API limit but fast enough for sales routing.
Output (Salesforce after sync).
| Salesforce Contact Id | lead_score__c | last_engagement_at__c | churn_risk__c |
|---|---|---|---|
| 003A1 | 87 | 2026-06-12 | 0.12 |
| 003A2 | 41 | 2026-05-30 | 0.55 |
| 003A3 | 92 | 2026-06-14 | 0.08 |
| 003A4 | (unchanged) | (unchanged) | (unchanged) |
Rule of thumb. Every operational team that asks for "a number on the record so we can route on it" is asking for reverse ETL. Push back when they ask for "a CSV every Monday" — propose the sync instead, because it ships with observability, history, and a schema contract for free.
Worked example — the dashboards-vs-syncs contrast
Detailed explanation. A common mistake is treating a dashboard and a sync as the same artefact with a different surface. They are not. A dashboard runs on demand and serves humans; a sync runs on a schedule and serves machines. Different SLA, different failure mode, different governance, different consumer.
Question. Given a churn-risk metric, write the two access patterns side by side — Looker dashboard query vs reverse ETL sync — and explain why both exist.
Input.
| Surface | Cadence | Consumer | Failure mode |
|---|---|---|---|
| Looker dashboard | on demand | account manager | empty card |
| Reverse ETL sync | every 6h | Intercom tag automation | stale tag |
Code.
-- Looker explore (shared view).
-- explore: account_health
-- view: marts.account_health
view: account_health {
sql_table_name: marts.account_health ;;
measure: avg_churn_risk {
type: average
sql: ${TABLE}.churn_risk ;;
}
}
# Hightouch sync — same underlying model, machine surface.
model: marts.account_health
destination: intercom
sync_mode: mirror
primary_key: account_id
schedule: "0 */6 * * *"
mappings:
- source: churn_risk -> Company.churn_risk_attr
- source: CASE WHEN churn_risk > 0.7
THEN 'at_risk' ELSE 'ok' END -> Company.health_tag
Step-by-step explanation.
- The same
marts.account_healthmodel feeds both surfaces. There is exactly one definition of "churn risk" in the company. - The dashboard query runs when a human opens it. The SLA is "the query returns in less than 10 seconds and the number is no older than the last warehouse refresh."
- The Hightouch sync runs every 6 hours regardless of human attention. The SLA is "the Intercom tag reflects yesterday's risk score by the end of every 6-hour window."
- Failure modes differ: a dashboard failure is loud (empty card, error toast); a sync failure is quiet (a stale tag still looks like data). Observability for the sync must be explicit.
Output.
| Surface | Behaviour when warehouse fails |
|---|---|
| Looker dashboard | error visible immediately to the user |
| Hightouch sync | last successful tag persists; alert fires only if observability is set up |
Rule of thumb. Treat the sync as a different product than the dashboard, even when both subscribe to the same model. Stamp a SLA on the sync, add an explicit row-error alert, and surface the sync as a dbt exposure so it shows up in lineage.
Worked example — when reverse ETL is the wrong tool
Detailed explanation. Reverse ETL has a lower bound on latency around a minute (Census) and a typical floor of 15–30 minutes for cost-efficient syncs (Hightouch on shared infrastructure). For sub-second personalisation, fraud-blocking, or in-session experiences, reverse ETL is the wrong tool — you need an event stream.
Question. Given a "personalise the homepage banner based on the user's churn risk" requirement, decide between reverse ETL and an event-stream architecture. Show the latency budget that drives the decision.
Input.
| Requirement | Latency target | Architecture |
|---|---|---|
| Sales Salesforce score | 30 minutes | reverse ETL |
| Marketing Intercom tag | 6 hours | reverse ETL |
| Ad audience refresh | 24 hours | reverse ETL |
| Homepage personalisation | < 500 ms | event stream |
| Fraud block at checkout | < 200 ms | online ML feature store |
Code.
Decision rubric (pseudo-code):
if latency_target >= 5_minutes:
use reverse_etl (Hightouch / Census / RudderStack)
elif latency_target >= 30_seconds:
use event_stream_reverse_etl (RudderStack event stream)
elif latency_target >= 100_ms:
use online_feature_store + low_latency_api (Tecton, Feast, custom)
else:
use in_request_compute (edge function, cached cache lookup)
Step-by-step explanation.
- The latency floor for batch reverse ETL is a function of warehouse query time + diff computation + destination API throughput. On a shared tenant in Hightouch this typically lands at 5–15 minutes.
- RudderStack's event-stream reverse ETL closes the loop in seconds for individual event triggers but still cannot serve a single-millisecond synchronous API call.
- Online ML feature stores (Tecton, Feast) maintain a serving layer separate from the warehouse precisely for sub-100ms reads. Reverse ETL pre-materialises features into that layer on a slower cadence.
- The rubric ranks tools by the actual latency budget the use case requires. Picking the wrong tier wastes either money (using a feature store for a daily ad audience) or signal (using reverse ETL for sub-second personalisation).
Output.
| Use case | Tool | Why |
|---|---|---|
| Lead score in Salesforce | Hightouch upsert | 30-min cadence, batch fine |
| Churn risk tag in Intercom | Census sync | 6h cadence, batch fine |
| Homepage banner | Edge feature read | sub-500ms, batch insufficient |
| Fraud rule at checkout | Online feature store | sub-200ms, must be pre-materialised |
Rule of thumb. Sketch the latency budget first. Anything above 5 minutes is a reverse ETL problem. Anything below 5 minutes is a streaming or feature-store problem. Mixing the two architectures costs more than picking the right one from the start.
Reverse ETL interview question on the lift-up from forward ETL
A senior interviewer often asks: "Walk me through what changes in the data team's responsibility model when reverse ETL enters the stack. What dbt practices have to harden? What new SLAs do you accept?"
Solution Using the data activation contract
The data team takes on three new responsibilities the day reverse ETL ships:
1. Model stability is now an operational SLA.
- Every sync model needs a stable primary key (renaming it
breaks identity resolution downstream).
- Column renames now break SaaS-tool fields that humans rely on.
- Type changes can silently corrupt destination fields.
- Solution: dbt contract tests + dbt exposures + protected branch
for any model with downstream syncs.
2. Freshness is now a destination-level SLA.
- Warehouse "fresh as of midnight" is no longer enough.
- Each destination has its own freshness contract (Salesforce: 30m,
Intercom: 6h, Facebook ads: 24h).
- Solution: per-sync alerting, last_synced_at columns, freshness
dashboards.
3. Governance now spans warehouse + SaaS tools.
- PII synced to Marketo is now subject to Marketo's retention.
- GDPR delete must propagate to every destination.
- Solution: PII tags on every column, per-destination policy,
destination-side deletes.
Step-by-step trace.
| Responsibility | Before reverse ETL | After reverse ETL |
|---|---|---|
| Model PK stability | nice-to-have | hard contract |
| Column rename | dashboard fix | downstream sync break |
| Freshness | warehouse-wide | per-destination |
| PII | warehouse policy | propagated to N SaaS tools |
| Lineage | dbt + BI | dbt + BI + syncs |
The data team learns to think of every model with at least one sync as an operational data product. The discipline is closer to backend engineering than to "writing SQL" — versioned, monitored, alerted, paged.
Output:
| Practice | New requirement once reverse ETL is in the stack |
|---|---|
| dbt contracts | required on every sync model |
| dbt exposures | every sync surfaced in lineage |
| PII tagging | per-column tags propagated to destination policy |
| Alerting | per-sync row-error rate and freshness SLA |
| On-call | one person owns sync health |
Why this works — concept by concept:
-
Models become contracts — the
(primary_key, columns, types)tuple is now a stable API. Any change is a versioned migration with downstream blast-radius assessment. - Freshness becomes per-destination — the warehouse SLA is the upper bound; each sync has its own, often tighter, freshness contract because downstream SaaS automation acts on it.
- PII becomes propagated — a column tagged "email PII" in the warehouse must inherit the same handling everywhere it lands. GDPR delete is the canonical stress test.
- Lineage becomes end-to-end — dbt exposures are the standard way to surface "this model is consumed by this Hightouch sync" inside the dbt docs and the data catalog.
- On-call gets a new pager — the day a sync fails silently is the day the data team learns operational analytics needs operational ownership. One person owns sync health, full stop.
- Cost — the new responsibilities are mostly process; the dbt features (contracts, exposures, tags) ship out of the box. Marginal infrastructure cost is the reverse ETL vendor subscription itself.
ETL
Topic — etl
ETL pipeline problems (data engineering)
2. The reverse ETL data model — models, audiences, syncs
Every reverse ETL platform organises around four nouns: model, audience, sync, destination — learn them once and every vendor feels the same
The mental model in one line: a model is a warehouse query that produces one row per entity; an audience is a filtered subset of a model; a sync is a mapping of model rows into a destination; a destination is the SaaS tool. Once you learn this four-noun vocabulary, every vendor UI collapses to the same shape and the differences become mostly cosmetic.
The four-noun glossary.
-
Model. A SQL query (or dbt model reference) that returns rows of a single entity —
one_row_per_user,one_row_per_account,one_row_per_subscription. The model has a primary key column and a set of attribute columns. -
Audience. A filter expression layered on top of a model —
WHERE plan = 'pro' AND last_seen_at < CURRENT_DATE - INTERVAL '30 days'. Audiences are reusable across syncs and across destinations. - Sync. The full specification: which model (or audience), which destination, which field mappings, which sync mode, which schedule. A sync is the deployable unit.
- Destination. The SaaS tool credentials + the destination object (Salesforce Contact, HubSpot Company, Intercom User, Marketo Lead, Facebook Custom Audience).
Sync modes you will encounter.
- Insert. New rows are inserted into the destination; existing rows are untouched. Used for append-only destinations like logging or analytics events.
- Update. Existing rows are updated; new rows are not inserted. Used when the destination owns identity creation (e.g. only update Salesforce contacts that already exist via lead capture).
- Upsert. Insert new rows, update existing rows. The most common mode for customer attribute syncs.
- Mirror. Make the destination match the model exactly — insert new, update changed, delete rows no longer in the model. The most powerful and the most dangerous; usually scoped to audiences (e.g. "the at-risk audience").
- Delete only. Remove rows from the destination based on a "tombstone" model. Often used for GDPR delete propagation.
Identity resolution at the sync boundary.
-
External ID matching. The most common pattern: the warehouse primary key (
salesforce_contact_id,hubspot_vid) is the same as the destination's primary key. The sync upserts on that key. - Email / phone matching. When the warehouse and the destination both store contact PII, syncs can match on email or phone. Brittle to changes (a user's email change creates a "new" record) but works for greenfield setups.
-
Custom external_id field. Hightouch and Census both support designating a custom external ID field in the destination (e.g. Marketo's
external_id_c). The sync writes the warehouse PK there once, then matches on it forever. -
Composite key matching. Some destinations (Salesforce, Marketo) support compound external IDs (e.g.
account_id + region). Rarely used; useful when the same person lives in multiple tenants.
Idempotency — the contract that saves the team.
- Stable primary key on every model. If the warehouse PK can change, the sync will double-write or fail to dedupe — every reverse ETL platform assumes the model PK is stable across runs.
-
Idempotent upserts. A retry on the same row must produce the same destination state. Most SaaS APIs support
idbased upsert; some require a "create-or-update" two-step. - Diff-only by default. Sync only the rows that changed since the last successful run. Saves API quota, reduces destination clutter, simplifies observability ("zero diffs is a healthy sync, not a broken one").
Change detection — three strategies.
- Full refresh. Read the entire model every run, ship every row. Simple, expensive, almost never the right answer above 100k rows.
- Diff-only (snapshot). Store a hash of every (PK, attribute) tuple on each successful run. On the next run, compare hashes and only ship the diffs.
- CDC mirror. Subscribe to the warehouse's change-data-capture stream (Snowflake streams, BigQuery change streams, Databricks CDC) and apply diffs incrementally. The lowest-latency option; vendor support varies.
Worked example — defining a model with a stable PK and clean attributes
Detailed explanation. A reverse ETL model is not a fact table. It is a one-row-per-entity row set with attributes the destination cares about. The biggest mistake newcomers make is reusing an analytics fact table as the model — fact tables have multiple rows per entity, and the sync will explode or drop most of them.
Question. Given a fact_orders table and a dim_customers table, write the right dbt model for a "current customer state" reverse ETL sync into Salesforce.
Input — fact_orders.
| order_id | customer_id | amount | order_date |
|---|---|---|---|
| 1 | C1 | 50 | 2026-06-01 |
| 2 | C1 | 30 | 2026-06-10 |
| 3 | C2 | 200 | 2026-05-20 |
Input — dim_customers.
| customer_id | name | salesforce_contact_id |
|---|---|---|
| C1 | Alice | 003A1 |
| C2 | Bob | 003A2 |
Code.
-- WRONG — multiple rows per customer; will fail upsert.
SELECT
c.salesforce_contact_id,
o.amount,
o.order_date
FROM dim_customers c
JOIN fact_orders o ON o.customer_id = c.customer_id;
-- RIGHT — one row per customer with aggregated attributes.
-- File: models/marts/reverse_etl_customer_state.sql
SELECT
c.salesforce_contact_id,
c.name,
COUNT(o.order_id) AS lifetime_orders,
COALESCE(SUM(o.amount), 0) AS lifetime_revenue,
MAX(o.order_date) AS last_order_at,
COUNT(o.order_id) FILTER
(WHERE o.order_date >= CURRENT_DATE - 30) AS orders_last_30d
FROM dim_customers c
LEFT JOIN fact_orders o ON o.customer_id = c.customer_id
GROUP BY c.salesforce_contact_id, c.name;
Step-by-step explanation.
- The wrong model emits two rows for
C1(one per order). The Hightouch sync sees two rows with the samesalesforce_contact_id, fails the "unique PK" assertion, and either rejects the sync or upserts the last row arbitrarily. - The right model wraps
fact_ordersin a GROUP BY oncustomer_id, collapsing every customer to one row. Attributes are aggregated:COUNTfor orders,SUMfor revenue,MAXfor last order date. -
LEFT JOINpreserves customers with zero orders.COALESCE(SUM(...), 0)turns the NULL sum into a clean 0 for downstream Salesforce automations. -
COUNT(...) FILTER (WHERE ...)produces the "last 30 days" attribute without a separate subquery. Postgres / Snowflake / BigQuery support FILTER; SQL Server usesCOUNT(CASE WHEN ... THEN 1 END).
Output (the reverse ETL model).
| salesforce_contact_id | name | lifetime_orders | lifetime_revenue | last_order_at | orders_last_30d |
|---|---|---|---|---|---|
| 003A1 | Alice | 2 | 80 | 2026-06-10 | 2 |
| 003A2 | Bob | 1 | 200 | 2026-05-20 | 0 |
Rule of thumb. A reverse ETL model is SELECT ... FROM ... GROUP BY entity_id plus joins. If the model emits more than one row per entity, the sync is wrong. Add a dbt-unique test on the PK column so the next CI run catches it.
Worked example — defining an audience from a model
Detailed explanation. Audiences are reusable filtered subsets of a model. A typical pattern: one underlying marts.reverse_etl_customer_state model, multiple audiences ("at-risk", "high-value", "trial-expiring"), each subscribed to a different destination.
Question. Define three audiences on top of the customer state model: at-risk (churn_risk > 0.7), high-value (lifetime_revenue > 5000), and active-trial (plan = 'trial' AND days_remaining < 7). Show how each maps to a different destination.
Input — marts.reverse_etl_customer_state.
| salesforce_contact_id | name | plan | lifetime_revenue | churn_risk | trial_ends_at |
|---|---|---|---|---|---|
| 003A1 | Alice | pro | 8000 | 0.05 | NULL |
| 003A2 | Bob | trial | 0 | NULL | 2026-06-18 |
| 003A3 | Cara | pro | 1200 | 0.82 | NULL |
| 003A4 | Dan | trial | 0 | NULL | 2026-06-30 |
Code.
-- Audience: at_risk
SELECT * FROM {{ ref('marts.reverse_etl_customer_state') }}
WHERE churn_risk > 0.7;
-- Audience: high_value
SELECT * FROM {{ ref('marts.reverse_etl_customer_state') }}
WHERE lifetime_revenue > 5000;
-- Audience: active_trial
SELECT * FROM {{ ref('marts.reverse_etl_customer_state') }}
WHERE plan = 'trial'
AND trial_ends_at IS NOT NULL
AND trial_ends_at - CURRENT_DATE BETWEEN 0 AND 7;
# Three syncs, one model, three destinations.
- name: at_risk_to_intercom
audience: at_risk
destination: intercom
sync_mode: mirror
mappings:
- source: churn_risk -> Company.churn_risk_attr
- source: lifetime_revenue -> Company.ltv_attr
- name: high_value_to_facebook_ads
audience: high_value
destination: facebook_ads
sync_mode: mirror
mappings:
- source: email -> custom_audience.email_hash
- name: active_trial_to_iterable
audience: active_trial
destination: iterable
sync_mode: mirror
mappings:
- source: trial_ends_at -> User.trial_end_date
- source: name -> User.first_name
Step-by-step explanation.
- The single underlying model
marts.reverse_etl_customer_stateis the source of truth. Every audience is a filter on top of it. - Audience
at_riskmirrors to Intercom for CS alerting. The sync ships only the matching subset and removes the tag when a customer drops out of the audience (mirror mode). - Audience
high_valuemirrors hashed emails to a Facebook custom audience. Add/remove behaviour follows audience membership automatically. - Audience
active_trialsyncs to Iterable for an automated email sequence. The mirror mode adds users when they enter the trial window and removes them when the trial ends. - Each sync inherits the same model contract — change the column, every audience and sync notices on the next run.
Output.
| Audience | Members | Destination | Effect |
|---|---|---|---|
| at_risk | 003A3 (Cara) | Intercom | tagged as at_risk |
| high_value | 003A1 (Alice) | added to custom audience | |
| active_trial | 003A2 (Bob) | Iterable | trial-end sequence triggered |
Rule of thumb. Build one model per entity, many audiences per model, one or more syncs per audience. The fan-out pattern (1 model → N audiences → M syncs) keeps the definition of an entity DRY and lets each downstream team pick the slice they care about.
Worked example — change detection: snapshot diff vs full refresh
Detailed explanation. Diff-only syncs are the default in every modern reverse ETL platform. They store a hash (or row checksum) per primary key after each successful run; on the next run they compare the new model output against the stored snapshot and emit only the changed rows. Full refresh is sometimes correct but very expensive.
Question. Given a 1M-row customer state model where 0.2% of rows change between runs, compare full-refresh API cost (every run ships every row) with diff-only (only changed rows shipped). Use a destination with a 200-row-per-API-call batch limit.
Input — assumptions.
| Variable | Value |
|---|---|
| Total model rows | 1,000,000 |
| Rows changed per run | 2,000 (0.2%) |
| Destination batch size | 200 rows |
| Syncs per day | 24 |
| Destination API call cost | $0.001 |
Code.
Full refresh per run:
api_calls = ceil(1_000_000 / 200) = 5_000
runs_per_day = 24
daily_cost = 5_000 * 24 * $0.001 = $120
Diff-only per run:
api_calls = ceil(2_000 / 200) = 10
runs_per_day = 24
daily_cost = 10 * 24 * $0.001 = $0.24
Cost ratio: 500x cheaper with diff-only.
Time ratio: same — typical API latency dominated by call count, not payload size.
Step-by-step explanation.
- Full refresh ships every row on every run. With 1M rows and 200/batch, the platform issues 5,000 API calls per run. 24 runs/day is 120,000 calls/day.
- Diff-only ships only the 0.2% changed rows. 2,000 rows / 200 per batch = 10 API calls per run. 24 runs/day is 240 calls/day.
- The math is independent of vendor — every reverse ETL platform that supports diff-only will produce this savings on a typical attribute-update workload.
- Diff-only does require the platform to maintain the previous snapshot. The snapshot is typically stored in the reverse ETL platform's own metadata DB (Hightouch) or as a hidden audit table in the source warehouse (Census's "tracking table" pattern).
Output.
| Strategy | API calls / day | Cost / day | Quota risk |
|---|---|---|---|
| Full refresh | 120,000 | $120 | high (Salesforce 15k cap) |
| Diff-only | 240 | $0.24 | very low |
Rule of thumb. Default to diff-only on every sync. Use full refresh only for "catch-up after a destination outage" or for small reference tables under ~10k rows. The 100–500× API quota savings are not optional at scale — Salesforce will hard-stop you at 15k API calls per 24h on the standard plan.
Reverse ETL interview question on idempotency
A senior interviewer often probes: "Your nightly sync runs, fails halfway through with a network blip, and reruns automatically. How do you guarantee the destination ends up in the same state it would have been if the sync had succeeded the first time?"
Solution Using the idempotent upsert contract
Idempotency is guaranteed if and only if:
1. Every model row has a stable primary key.
- The PK is the natural identity (salesforce_contact_id),
not a row number or a hash that changes between runs.
- dbt test: unique + not_null on the PK column.
2. The sync mode is upsert (not insert) on a destination-side
external ID field.
- Salesforce: Upsert /sobjects/Contact/extId/{externalId}
- HubSpot: Upsert /contacts/v1/contact/createOrUpdate/email/{email}
- Marketo: leads/createOrUpdate with lookupField=externalId
3. The destination accepts a duplicate row as a no-op when
nothing has actually changed.
- Hightouch: built-in "skip unchanged rows" toggle.
- Census: built-in idempotency cache.
- RudderStack: ETag / If-Match conditional updates.
4. Retries on transient errors (5xx, network timeout) are
safe because step 2 guarantees the second call lands the
same destination state as the first.
5. Permanent errors (4xx) go to a dead-letter queue for
manual inspection, NOT into the auto-retry loop.
Step-by-step trace.
| Run | Outcome | Destination state | Idempotent? |
|---|---|---|---|
| Run 1 (initial) | success | 1000 rows in Salesforce | n/a |
| Run 2 (no diff) | success | 1000 rows unchanged | yes — zero API calls |
| Run 3 (1 row change) | success | 1000 rows, 1 updated | yes — 1 API call |
| Run 4 (mid-run network blip) | partial fail at row 500 | 500 of 999 deltas applied | next run resumes |
| Run 4 retry | success | all deltas applied | yes — final state matches success-on-first-try |
The fourth row shows the key behaviour: a half-applied sync is safe because each row's upsert is idempotent. The retry picks up the unfinished deltas without re-applying the already-applied ones.
Output:
| Property | Behaviour with idempotency contract |
|---|---|
| Network blip | safe — retry resumes |
| Same model, two runs back-to-back | second run is a no-op |
| Schema change downstream | sync fails loudly, no half-update |
| Concurrent runs | platform locks the sync to one instance |
| Duplicate row in model | dbt test fails before sync starts |
Why this works — concept by concept:
- Stable PK — the primary key is the bridge between warehouse identity and destination identity. The whole upsert mechanism depends on it being stable across runs.
- External ID upsert — every modern SaaS API offers an upsert primitive keyed on a custom external ID. Use it. Two-step "search-then-create-or-update" patterns are error-prone and not idempotent under concurrency.
- Diff-only + skip-unchanged — short-circuits the destination call entirely when nothing has changed. A healthy sync run can legitimately make zero API calls.
- Dead-letter queue — permanent errors (validation failure, missing required field) are not retried in a tight loop; they go to an inspect-and-fix queue. The retry loop is only for transient errors.
- Concurrent-run lock — every reverse ETL platform single-instances each sync. Two parallel runs of the same sync would race on the diff snapshot and corrupt the next-run baseline.
- Cost — idempotency is essentially free once the contract is in place. The cost is the up-front discipline of designing models with stable PKs and configuring destination external IDs.
Data
Topic — dimensional-modeling
Dimensional modelling problems (data engineering)
3. Hightouch vs Census vs RudderStack — vendor comparison
Each vendor optimises for a different team shape — pick by who owns syncs and how dbt-native your stack is
The mental model in one line: Hightouch is the audience-builder-first managed platform, Census is the dbt-native data-team-first managed platform, RudderStack is the open-source CDP + reverse ETL combined platform with a self-hostable option. Once you map team shape and stack constraints to vendor identity, the choice becomes obvious — and obvious choices are easier to defend in a procurement meeting.
The vendor matrix in one table.
| Capability | Hightouch | Census | RudderStack |
|---|---|---|---|
| Destinations (2026) | 200+ | 180+ | 200+ (events + reverse ETL) |
| dbt integration | strong (model picker, exposures) | strongest (dbt exposures native, "data-team first") | adequate |
| Audience builder | first-class visual UI | SQL-first, basic UI builder | basic |
| Sequences / journeys | yes (Hightouch sequences) | yes (Census audiences with priority) | partial |
| Identity resolution | strong (configurable matching) | strong (entity model) | event-stream-first |
| Self-hosted option | no (managed only) | no (managed only) | yes (RudderStack OSS + BYOC) |
| Combined CDP + reverse ETL | no | no | yes (event stream + reverse ETL) |
| Observability | strong (per-row, per-sync) | strong (sync alerts) | adequate |
| Pricing model | per-destination + MTU | per-row synced | per-MTU + events |
Hightouch — audience-builder first, GTM-team first.
- Strengths. Best-in-class audience builder UI (drag-and-drop filters, custom calculations); broadest destination catalogue; "Hightouch sequences" let marketing build journeys without leaving the tool; deep observability with row-level error inspection.
- Best fit. Teams where the audience definitions live half in SQL and half in marketing's head; companies with 5+ destinations across CRM + marketing + ads.
- Trade-offs. Managed-only (no self-host); MTU-based pricing surprises mid-market companies as their user count grows; Hightouch's UI-first audience editor can drift from the dbt definition of an entity if not policed.
Census — data-team first, dbt-native.
-
Strengths. Tightest dbt integration of the three — Census reads
dbt_project.yml, recognises exposures, and surfaces sync metadata back into the dbt docs; "entity" model is a first-class concept; sync alerting is mature. - Best fit. Data teams that already live in dbt and want the warehouse-to-SaaS contract owned by analytics engineers, not marketing ops.
- Trade-offs. Audience-builder UI is intentionally minimal (SQL is the way); fewer "GTM goodies" like multi-channel journeys; managed-only; per-row pricing means batch refreshes can sting.
RudderStack — open-source CDP + reverse ETL combined.
- Strengths. Open-source under AGPLv3 with a managed plan; combines event streaming (Segment-style) with reverse ETL in one tool; self-hostable for BYOC / on-prem / compliance-driven shops; the only one of the three that can serve sub-30-second event reverse ETL.
- Best fit. Companies that need both event collection and reverse ETL but want to avoid SaaS sprawl; compliance / BYOC use cases; engineering-heavy teams comfortable running infra.
- Trade-offs. UI is less polished than Hightouch / Census; destination catalogue runs slightly behind on long-tail SaaS tools; the self-hosted operational cost is real (operate Postgres, Kubernetes, observability).
Pricing dimensions to model before procurement.
- MTU (Monthly Tracked Users). Most platforms charge per unique entity synced per month. The metric grows roughly with total customer base.
- Per-row synced. Census's primary metric. Drives a "diff-only is required" discipline because full refresh becomes ruinously expensive.
- Per-destination. Hightouch's standard plans cap the number of destinations on lower tiers. Multi-channel companies feel this fast.
- Per-seat. Both Hightouch and Census charge per audience-builder seat above a baseline.
- Events (RudderStack). Event-stream pricing is per event, not per unique user. Plan for both axes.
Self-hosted (RudderStack OSS) vs managed trade-off.
- Self-hosted wins for. BYOC compliance, data-residency, "all data must stay in our VPC," low cost at very large MTU counts (>1M).
- Managed wins for. Speed (live in a day vs a quarter), no infra ops burden, faster destination roll-outs, no upgrade cycles.
- Hybrid pattern. Many shops run RudderStack OSS for event collection (zero per-event vendor cost) and Hightouch managed for reverse ETL (fastest catalogue + audience UI).
Worked example — picking Hightouch when GTM owns audiences
Detailed explanation. A B2B SaaS company has a 6-person revenue ops team that owns Salesforce, HubSpot, Marketo, Outreach, and a half-dozen ad accounts. They want to build "buying-committee" audiences without filing a Jira to data each time. The data team owns the underlying dbt model; revenue ops owns the audience layer on top.
Question. Given the company profile (GTM-heavy, 5+ destinations, audience-builder UI matters), justify Hightouch as the right pick. List the decisive feature differences vs Census and RudderStack.
Input — the company profile.
| Property | Value |
|---|---|
| Audience owners | Revenue ops (non-SQL) |
| Destinations | Salesforce, HubSpot, Marketo, Outreach, FB Ads, LinkedIn Ads |
| Warehouse | Snowflake + dbt |
| Sync latency | 30 minutes acceptable |
| Self-host requirement | none |
Code.
Decision matrix:
| Need | Hightouch | Census | RudderStack |
|----------------------------|-----------|--------|-------------|
| Drag-drop audience builder | strong | basic | basic |
| 6+ destinations | yes | yes | yes |
| dbt exposure surfacing | yes | best | adequate |
| Multi-channel sequences | yes | partial| partial |
| No-SQL revenue ops users | strong | weak | weak |
Decision: Hightouch wins on (1) audience builder, (4) sequences,
(5) non-SQL audience editors. Census's dbt-first stance is a real
strength but the GTM team owns audiences in this org.
Step-by-step explanation.
- The team's bottleneck is "GTM ops cannot self-serve audiences." Hightouch's audience builder is the only one of the three optimised for that exact persona.
- Census's strength (dbt-native) does not help when the audience layer is owned outside the data team. The model is still in dbt; the audience-on-top-of-model is what's UI-driven.
- RudderStack's event-stream story is not relevant — this team is not building real-time personalisation, just attribute syncs at 30-minute cadence.
- The decisive feature is the audience builder UI, with Hightouch sequences as a bonus for multi-step marketing journeys.
Output.
| Decision | Hightouch |
|---|---|
| Why | audience builder + sequences + destination catalogue |
| Estimated MTU cost | $$ (mid-market plan) |
| Implementation timeline | 4 weeks to first sync |
Rule of thumb. Hightouch wins when GTM owns the audience layer and non-SQL editors need to ship audiences without filing tickets. Census wins when the data team owns the audience layer and dbt is the single source of truth. RudderStack wins when you need both CDP event collection and reverse ETL or you must self-host.
Worked example — picking Census when dbt is the source of truth
Detailed explanation. A fintech with strict change-management has a small analytics engineering team that defines every metric, every entity, and every audience in dbt. Marketing ops "subscribes" to dbt models via tickets. The team wants the sync layer to inherit dbt's contract testing, exposures, and lineage natively.
Question. Given the company profile (dbt-first, analytics engineering owns audiences, strict change management), justify Census over Hightouch. List the decisive dbt integration features.
Input — the company profile.
| Property | Value |
|---|---|
| Audience owners | Analytics engineering (SQL-fluent) |
| Source of truth | dbt models, branch-protected |
| Destinations | Salesforce, Iterable, Customer.io |
| Sync latency | 1 hour acceptable |
| Compliance | strict — every change reviewed |
Code.
Census dbt-native features that decided it:
1. dbt project sync — Census reads dbt_project.yml directly.
Models appear in Census with the same name as in dbt.
2. dbt exposures — every Census sync is automatically surfaced
as a dbt exposure. Lineage in dbt docs shows the destination.
3. Git-backed sync definitions — sync YAML lives in the dbt
repo, change-managed via PR.
4. dbt tests propagate — failing dbt tests block the sync.
Census never ships a failing-test row to a destination.
5. Entity model — Census's "entity" concept is the equivalent
of a dbt model with documented PK + columns. Discoverable
across the team.
Step-by-step explanation.
- The data team's discipline is "everything ships via PR." Census's git-backed sync definitions extend that discipline to the reverse ETL layer.
- Hightouch supports a Terraform provider for sync-as-code, but the UI-first culture pulls non-engineers off the git workflow. Census's SQL-first culture matches the team.
- dbt exposures inside Census are decisive — every destination becomes a known consumer in the lineage graph. Census surfaces "this sync depends on this model" automatically.
- Failing dbt tests blocking the sync is the killer feature for compliance — it means a regression in the model never silently corrupts a downstream SaaS field.
Output.
| Decision | Census |
|---|---|
| Why | dbt-native + git-backed syncs + exposures + test-gating |
| Estimated cost | $$ (per-row pricing acceptable at this volume) |
| Implementation timeline | 6 weeks to production sync |
Rule of thumb. Census wins when the analytics engineering team owns the audience layer and "everything ships via PR" is a non-negotiable. The dbt integration is real, not cosmetic — it changes how the team operates day to day.
Worked example — picking RudderStack OSS for BYOC compliance
Detailed explanation. A healthcare SaaS must keep PII inside its own VPC. Sending raw email addresses through a multi-tenant SaaS reverse ETL platform is a compliance blocker. RudderStack OSS runs inside the customer VPC, never touches the vendor's infrastructure, and combines event collection (replacing Segment) with reverse ETL in one tool.
Question. Given the company profile (PII must stay in VPC, single tool preferred for events + syncs), justify RudderStack OSS over the managed options.
Input — the company profile.
| Property | Value |
|---|---|
| Compliance | PII must stay in customer VPC |
| Existing event tool | considering Segment replacement |
| Destinations | Salesforce Health Cloud, HubSpot, internal API |
| Sync latency | 5 minutes for high-priority |
| Team | engineering-heavy, comfortable running infra |
Code.
Why RudderStack OSS wins on this profile:
1. Self-hosted in customer VPC.
- No PII leaves the customer's cloud account.
- Audit trail end-to-end within customer-owned storage.
2. Combined event stream + reverse ETL.
- Single tool covers Segment-like event collection AND
Hightouch-like warehouse reverse ETL.
- One destinations catalogue, one UI, one set of credentials.
3. Event-stream reverse ETL.
- Sub-30-second latency on high-priority warehouse changes
via the event-stream path (not the batch path).
4. AGPLv3 source-available.
- Customer can patch, audit, and extend.
- No vendor lock-in for compliance-critical features.
Trade-offs accepted:
- Operate Postgres, Redis, K8s yourself.
- Destination catalogue runs slightly behind Hightouch on
long-tail tools.
- UI is less polished — engineers, not marketers, configure syncs.
Step-by-step explanation.
- The PII-in-VPC requirement removes Hightouch and Census from contention immediately — both are managed-only.
- The combined event-stream + reverse ETL story removes Segment from the picture and consolidates spend.
- RudderStack OSS's event-stream reverse ETL path is the only sub-30-second option in this comparison — relevant for the "high-priority sync" use case.
- The trade-off is operational burden. The team must own the Postgres metadata DB, Redis broker, and Kubernetes orchestration. An engineering-heavy org accepts this.
Output.
| Decision | RudderStack OSS |
|---|---|
| Why | self-hosted compliance + combined CDP + sub-30s reverse ETL |
| Estimated cost | infrastructure + 0.5 SRE FTE |
| Implementation timeline | 8 weeks to production |
Rule of thumb. RudderStack wins on three triggers: BYOC compliance, single-tool consolidation of CDP + reverse ETL, or sub-30-second latency requirements via event-stream reverse ETL. If none of those triggers fires, prefer Hightouch or Census for the operational simplicity of managed.
Reverse ETL interview question on the buy-vs-build decision
A senior interviewer often frames it as: "Your CTO is asking whether we can just build reverse ETL in-house with Airflow + Python + the destination SDKs. Walk me through the buy-vs-build decision."
Solution Using the operational-burden lens
The build-it-yourself stack:
1. Airflow / Dagster orchestration.
2. Custom Python writers for each destination API.
3. Snapshot diff engine (you build it).
4. Queue + worker pool with retry semantics (you build it).
5. Dead-letter queue + inspection UI (you build it).
6. Per-row error logging (you build it).
7. Schema-change detection (you build it).
8. Audit log + lineage (you build it).
9. Audience builder UI for non-engineers (... you build it).
10. PII tagging + governance UI (you build it).
The buy stack:
1. Hightouch / Census / RudderStack subscription.
2. Sync configuration (a week of work).
The break-even calculation:
- Year 1 build cost: 2 senior engineers × 6 months = ~$300k.
- Year 1 buy cost: ~$30k–$80k subscription, depending on MTU.
- Year 2 build cost: 1 engineer × full year maintenance = ~$200k.
- Year 2 buy cost: ~$50k–$120k subscription.
Buy wins decisively unless:
- You have a destination not on the vendor catalogue (rare).
- You have a sub-second latency requirement (use a feature store).
- You have a compliance constraint requiring on-prem (use OSS).
Step-by-step trace.
| Component | Buy time | Build time |
|---|---|---|
| Destination connectors | 1 day per destination (config) | 2 weeks per destination (code + tests) |
| Diff engine | included | 4 weeks |
| Queue + retry | included | 6 weeks |
| Dead-letter inspection | included | 2 weeks |
| Audience builder UI | included | 12+ weeks (and your data team has to maintain it) |
| Schema-change detection | included | 4 weeks |
The "build everything" path lands at 6–9 months for a v1 covering 5 destinations with no UI. The "buy" path lands at 4–6 weeks for the same scope plus an audience UI.
Output:
| Year | Build cost | Buy cost |
|---|---|---|
| 1 | ~$300k | ~$50k |
| 2 | ~$200k | ~$80k |
| 3 | ~$200k | ~$100k |
Why this works — concept by concept:
- Connector breadth — vendors maintain hundreds of destination integrations as their full-time job. A 2-engineer team building from scratch will cover 5–10 destinations at best in year 1.
- Diff engine is the moat — every reverse ETL platform's secret sauce is the diff/snapshot/incremental detection logic. Building a reliable one is a 6-month research project, not a weekend hack.
- Audience UI — the moment a non-engineer needs to ship an audience, you need a UI. Building that internally is a years-long product investment that has nothing to do with your company's actual product.
- Observability — per-row error tracking, dead-letter queues, sync success ring charts — all included in the vendor stack. Building them stalls your data team for months.
- Compliance escape hatch — RudderStack OSS exists precisely for the rare cases where vendor managed cannot work. Use OSS, not in-house build.
- Cost — over a 3-year window the buy path is 3–5× cheaper and ships in 1/10 the time. The only counter-arguments are scale (>10M MTU and you renegotiate hard) or compliance (and OSS solves that).
ETL
Topic — api-integration
API integration problems (data engineering)
4. Sync architecture — incremental detection, queues, rate limits
A sync is a diff engine plus a queue plus a worker pool plus a rate-limited destination API — every reverse ETL platform implements the same four-stage pipeline
The mental model in one line: the warehouse query produces rows, the diff engine classifies each row as insert/update/delete vs the previous snapshot, the queue absorbs back-pressure, and the worker pool drains the queue into the destination API while respecting per-destination rate limits. Once you can draw the four stages on a whiteboard, every "why is my sync slow / failing / partial?" question becomes a probe of which stage is the bottleneck.
Stage 1 — warehouse query and snapshot detection.
- Query. The model SQL (or audience-filtered model SQL) runs against the warehouse. Result is materialised either into a temp table or streamed.
-
Snapshot store. The previous run's
(pk, hash(attributes))set lives somewhere — a hidden table in the warehouse, a Postgres metadata DB in the vendor's infra, or a CDC stream offset. - Diff classification. For each current row: if PK absent in snapshot → INSERT; if PK present and hash differs → UPDATE; for each snapshot PK absent in current → DELETE (or "tombstone").
Stage 2 — staging / queue.
- Per-sync queue. Each sync gets its own queue, single-instanced. No parallel runs of the same sync.
- Back-pressure absorption. When the destination's API is slow, the queue grows; workers pull at the destination's pace, not the warehouse's pace.
- Persistence. Queues persist to disk so a vendor restart does not lose in-flight rows.
Stage 3 — worker pool.
- Worker concurrency. Configured per destination; usually 1–8 parallel workers per sync.
- Batch packing. Workers pack queue rows into destination-specific batches (Salesforce: 200/batch, HubSpot: 100/batch, Marketo: 300/batch).
- Token-bucket rate limiter. Each worker checks the destination's quota before issuing the call.
Stage 4 — destination API.
- Auth. OAuth, API key, service account — refreshed automatically by the platform.
- Rate limit response. 429 (Too Many Requests) triggers exponential backoff and a slowdown of the worker pool.
- Per-row error response. 4xx errors on specific rows are recorded as row-level failures, surfaced in the sync log, and either retried (transient) or dead-lettered (permanent).
Destination rate limits in the wild (2026 baselines).
| Destination | Limit | Notes |
|---|---|---|
| Salesforce | 15,000 / 24h (standard) | per-org, all APIs share |
| HubSpot | 100 / 10s + 250k / day | per-portal |
| Marketo | 100 / 20s + 50k / day | per-instance |
| Intercom | 1,000 / minute | per-app |
| Iterable | 4 / second list endpoints | varies by endpoint |
| Facebook Custom Audience | 200,000 users / API call | batched mode |
| Slack | 1 / second per webhook | basic tier |
Retry semantics.
- Transient (5xx, 429, network timeout) — retry with exponential backoff. Typical: 1s → 2s → 4s → 8s → 16s → 32s, then dead-letter.
- Permanent (4xx with validation error) — log and dead-letter immediately. Retrying will not help.
- Auth (401, token expired) — refresh the token and retry once, then alert.
- Quota exhausted (429 with daily-cap header) — pause the sync until the quota window resets; alert if the window is >12 hours.
Latency tiers.
- Hourly batches. Default for most syncs. 5–60 minutes end-to-end.
- Sub-minute batches. Census + small models. 30 seconds–5 minutes.
- CDC mirror. Continuous; reflects warehouse changes in seconds.
- Event-stream reverse ETL. RudderStack's path; reflects in 1–30 seconds.
Worked example — the diff engine in pseudo-code
Detailed explanation. The diff engine is the heart of every reverse ETL platform. It compares the current model row set against the previous snapshot and emits a stream of insert/update/delete events. Knowing the shape of this code helps debug "why did my sync ship row X?" questions.
Question. Write a pseudo-code sketch of a diff engine that takes (current_rows, previous_snapshot) and emits classified events. Explain how it handles deletes.
Input.
previous_snapshot:
C1 -> hash("Alice|pro|0.05")
C2 -> hash("Bob|trial|null")
C3 -> hash("Cara|pro|0.40")
current_rows:
C1 -> ("Alice", "pro", 0.05) # unchanged
C2 -> ("Bob", "pro", 0.10) # changed (trial -> pro)
C4 -> ("Dan", "trial", null) # new
# C3 missing -> deleted
Code.
def diff_engine(current_rows, previous_snapshot):
"""Yield classified change events."""
current_keys = set(current_rows.keys())
previous_keys = set(previous_snapshot.keys())
# INSERTs — PKs in current but not previous.
for pk in current_keys - previous_keys:
yield ("INSERT", pk, current_rows[pk])
# UPDATEs — PKs in both, hash differs.
for pk in current_keys & previous_keys:
new_hash = row_hash(current_rows[pk])
if new_hash != previous_snapshot[pk]:
yield ("UPDATE", pk, current_rows[pk])
# else: unchanged, emit nothing (this is the big saving).
# DELETEs — PKs in previous but not current.
# Only if sync_mode == "mirror"; otherwise skip deletes.
for pk in previous_keys - current_keys:
yield ("DELETE", pk, None)
# Persist new snapshot for next run.
new_snapshot = {pk: row_hash(row) for pk, row in current_rows.items()}
save_snapshot(new_snapshot)
Step-by-step explanation.
- The set difference
current - previousyields rows present this run but not last run — INSERTs. - The set intersection plus hash comparison yields rows present in both runs whose attributes changed — UPDATEs. Unchanged rows are skipped silently (zero API calls).
- The set difference
previous - currentyields rows present last run but absent this run — DELETEs. Only emitted inmirrorsync mode;upsertmode ignores them. - The new snapshot is persisted at the end. If the run crashes before this point, the next run sees the same previous snapshot and re-classifies the same diffs (idempotent recovery).
- The row hash function is typically MD5 / xxHash over the JSON serialisation of attributes in a canonical column order. Hash collisions are theoretically possible; in practice the rate is negligible at billion-row scale.
Output.
| Event | PK | Attributes |
|---|---|---|
| INSERT | C4 | (Dan, trial, NULL) |
| UPDATE | C2 | (Bob, pro, 0.10) |
| DELETE | C3 | — |
Rule of thumb. Always store the previous snapshot durably (warehouse table, Postgres, or S3). A lost snapshot triggers a "full diff against empty," which classifies every row as INSERT and floods the destination — the canonical "first run after vendor restart was a disaster" outage.
Worked example — the rate limiter and the 429 backoff loop
Detailed explanation. Every destination has rate limits. The worker pool must respect them or risk getting the entire integration locked. The token-bucket + exponential-backoff pattern is the universal solution.
Question. Sketch a worker loop that drains a queue of upsert events into a Salesforce-like API with a 15k/24h limit, handles 429 responses, and emits to dead-letter on permanent errors.
Input.
Queue items:
- upsert C1 with payload P1
- upsert C2 with payload P2
- upsert C3 with payload P3 (will return 400 — invalid email)
- upsert C4 with payload P4
Destination state:
- quota_remaining = 14_998
- quota_resets_at = 24h from now
Code.
def worker(queue, destination, rate_limiter, dead_letter):
while True:
event = queue.pop()
if event is None:
break
# 1. Respect the destination's rate limit.
rate_limiter.acquire(destination)
# 2. Make the API call.
backoff = 1
for attempt in range(7):
try:
response = destination.upsert(event.pk, event.payload)
if response.status == 200:
break
if response.status == 429:
# Rate limited — exponential backoff.
sleep(backoff)
backoff = min(backoff * 2, 60)
continue
if 400 <= response.status < 500:
# Permanent error — dead letter.
dead_letter.push(event, response.body)
break
if 500 <= response.status:
# Transient server error — retry.
sleep(backoff)
backoff = min(backoff * 2, 60)
continue
except NetworkTimeout:
sleep(backoff)
backoff = min(backoff * 2, 60)
continue
else:
# Out of attempts — dead letter.
dead_letter.push(event, "max_retries_exceeded")
Step-by-step explanation.
-
rate_limiter.acquireblocks the worker until the token bucket has a slot. Implementation is typically a Redis script that decrements a per-destination counter and refills it at the destination's rate. - The retry loop runs up to 7 attempts. On 429, the worker sleeps and retries (backoff 1s → 2s → 4s → ... capped at 60s).
- On 5xx transient server errors, the worker also retries — server-side issues are usually self-healing within seconds.
- On 4xx permanent errors (validation failure, malformed payload, missing required field), the worker stops retrying and pushes the event to the dead-letter queue for human inspection.
- Network timeouts (no response) are treated as transient — the worker retries with backoff.
- If all 7 attempts fail, the event is dead-lettered with
max_retries_exceededso on-call has visibility.
Output (events that reach the destination vs dead-letter).
| Event | Destination state | Dead-letter? |
|---|---|---|
| C1 | upserted | no |
| C2 | upserted | no |
| C3 | rejected (400 invalid email) | yes |
| C4 | upserted | no |
Rule of thumb. The retry loop should always distinguish transient (4 categories: 429, 5xx, timeout, auth-refresh) from permanent (4xx). Mixing them either burns rate limits on hopeless retries or silently drops fixable failures.
Worked example — back-pressure from a slow destination
Detailed explanation. When the destination API is slow (or rate-limit-restricted), the queue grows. A well-designed reverse ETL platform absorbs the growth and only fails when the queue passes a configured high-water mark — not every time the destination has a slow minute.
Question. Given a warehouse producing 10k rows/minute and a destination accepting 100 rows/minute, model the queue growth over an hour. Show why a "queue depth" alert is the right SLI and how to use it for early warning.
Input.
| Variable | Value |
|---|---|
| Warehouse output rate | 10,000 rows/min |
| Destination accept rate | 100 rows/min |
| Initial queue depth | 0 |
| Alert threshold | 50,000 rows |
Code.
def queue_growth(warehouse_rate, destination_rate, minutes):
depth = 0
growth_per_min = warehouse_rate - destination_rate
log = []
for minute in range(1, minutes + 1):
depth += growth_per_min
log.append((minute, depth))
return log
# Compute for one hour:
growth = queue_growth(10_000, 100, minutes=60)
# Alert fires when depth crosses 50_000.
alert_minute = next(m for m, d in growth if d > 50_000)
print(f"Queue depth alert at minute {alert_minute}")
# -> Queue depth alert at minute 6
Step-by-step explanation.
- Net growth per minute = warehouse output - destination accept = 10000 - 100 = 9900 rows/min.
- After 1 min: 9,900 rows queued. After 5 min: 49,500 queued. After 6 min: 59,400 — crosses the 50k alert threshold.
- The alert at minute 6 gives on-call 50 minutes of headroom before the queue passes a typical "platform refuses to enqueue" limit of ~500k rows.
- The right remediation depends on the cause: (a) destination is rate-limited — wait for the quota to reset and accept the lag; (b) destination is genuinely broken — pause the sync until the destination is healthy; (c) warehouse is producing duplicates — fix the model.
- Without the queue-depth alert the team only learns about the problem when the platform errors out at 500k+ — too late, downstream is already stale by hours.
Output.
| Minute | Queue depth | Alert? |
|---|---|---|
| 1 | 9,900 | no |
| 3 | 29,700 | no |
| 5 | 49,500 | no |
| 6 | 59,400 | yes |
| 60 | 594,000 | platform errors |
Rule of thumb. Alert on queue depth, not on sync errors. A sync error is the symptom; queue depth is the leading indicator. Set the alert threshold at 30–50% of the platform's enqueue ceiling to buy on-call time.
Reverse ETL interview question on rate-limit-aware design
A senior interviewer often asks: "Salesforce has a 15k API calls per day quota and our customer state model has 200k rows. How do you design a sync that fits inside the quota?"
Solution Using batching + diff-only + audience filtering
The math first:
raw rows = 200_000
Salesforce upsert batch size = 200 rows / call
full refresh calls = 1_000 calls / run
diff-only on 0.5% changed rows = 1_000 changed rows
diff-only batch calls = ceil(1_000 / 200) = 5 calls / run
hourly cadence = 24 runs / day
daily API calls = 5 * 24 = 120 calls / day
Headroom under the 15k quota: 124x.
The design:
1. Composite Tooling API batching.
- Use Salesforce's Composite/sObject Collections API:
200 records per call vs 1 record per Standard upsert.
2. Diff-only sync mode (no full refresh).
- Reverse ETL platform stores last-run snapshot.
- Ship only rows whose attribute hash changed.
3. Audience scoping.
- Many syncs only need the "active" subset of customers.
- Filter at the audience layer (plan != 'churned')
so the diff engine compares smaller sets.
4. Cadence sized to business need.
- Sales routing: every 30 minutes.
- Account health: every 6 hours.
- LTV refresh: every 24 hours.
- Do not over-spec freshness; quota is finite.
5. Per-sync quota guard.
- Configure the reverse ETL platform's "max API calls per
window" knob to a sub-quota share per sync.
- Hightouch and Census both expose this; RudderStack via config.
Step-by-step trace.
| Design choice | Effect on quota |
|---|---|
| Full refresh | 1,000 calls/run × 24 = 24,000/day — over quota |
| Diff-only | 5 calls/run × 24 = 120/day |
| Audience scoping | reduces diff size further |
| Per-sync quota guard | prevents any one sync from monopolising quota |
| Hourly vs 30-min cadence | doubles or halves daily API calls |
The combination of (2) and (4) is decisive. Diff-only converts the metric from "rows in the model" to "rows that changed," which on most attribute syncs is 0.1–2% of the model.
Output:
| Strategy | Calls / day | Inside 15k quota? |
|---|---|---|
| Full refresh hourly | 24,000 | no |
| Diff-only hourly | 120 | yes |
| Diff-only every 30m | 240 | yes |
| Diff-only every 5m | 1,440 | yes |
| Full refresh every 5m | 288,000 | catastrophic |
Why this works — concept by concept:
- Batched upsert — Salesforce's composite endpoint is the single biggest lever. Going from 1 row per call to 200 rows per call drops the call count by 200×.
- Diff-only sync — the second biggest lever. Only ship rows that actually changed. Drops the call count by 50–500× on typical attribute workloads.
- Audience filtering — shrinks the model to the rows that matter. Skipping churned customers saves both diff computation and quota.
- Cadence sizing — the third lever. Match the sync frequency to the actual business cadence; "fresh every 5 minutes" is rarely needed for a CRM attribute.
- Per-sync quota guard — defensive design. Even if one sync misbehaves (e.g. a model bug emits 200k diffs), the guard prevents it from burning the org-wide quota and breaking unrelated syncs.
- Cost — the design is essentially free. All the levers are configuration, not code. The cost is the discipline to model the math up-front for each new sync.
ETL
Topic — etl
ETL design problems (data engineering)
5. Governance, observability, and failure modes
A sync that has no governance, no observability, and no defined failure modes is not a data product — it is a time bomb
The mental model in one line: governance answers "who can sync what to where"; observability answers "is the sync healthy right now"; failure modes answer "what breaks and how do we know". The discipline that separates a hobbyist sync from a production data product is treating these three pillars as first-class — versioned, owned, and on-call paged.
Governance — five non-negotiables.
-
Field-level PII tagging. Every column tagged
pii=email | phone | address | name | ssn. Tags propagate to the sync layer so destinations can enforce per-tag policy. - Per-destination policy. "Email PII can sync to Marketo; SSN PII cannot sync to anything." Hightouch and Census both support sync-level allow/deny rules.
- Audience approval. New audiences > 10k members require analytics-engineering sign-off. Catches "I just synced 200k users to Facebook by accident."
-
GDPR delete propagation. A user's right-to-delete must reach every destination. The platform must support a "delete pipeline" sync (model =
users_to_delete, mode = delete-only, fanned out to every destination). - Audit log. Every sync edit, schedule change, and credential rotation is logged with actor + timestamp.
Observability — six SLIs to track.
- Sync success rate. Percent of runs that finished without a top-level error. Target: >99.5%.
- Row-error rate. Percent of rows in a successful run that failed (typically destination 4xx validation). Target: <1%.
- Freshness lag. Time since last successful run vs the scheduled cadence. Target: <2× cadence.
- Queue depth. Pending rows waiting for the worker pool. Leading indicator of destination slowness.
- Rejected payload sample. Stratified sample of dead-letter events for human inspection.
- Latency p50 / p99. Wall-clock time from model row produced to destination row accepted.
Failure modes — the four most common.
- Mapping drift. Warehouse column renamed; destination field still expects the old name; sync silently writes NULL or fails.
- Schema drift. Column type changed (INT → BIGINT, VARCHAR(50) → VARCHAR(500)); destination rejects with type-mismatch error.
- Row-cap breach. Audience suddenly grows from 5k to 200k members because a filter became overly permissive; destination quota burns out.
- Credential expiry. OAuth refresh token expires; sync fails with 401; team finds out hours later when freshness lag alert fires.
Catalog + lineage — surfacing syncs as dbt exposures.
- Every sync is a known consumer of one or more dbt models. The standard surface is a dbt exposure:
# models/marts/exposures.yml
exposures:
- name: salesforce_lead_score_sync
type: application
owner:
name: Analytics Engineering
email: ae@example.com
depends_on:
- ref('reverse_etl_customer_state')
description: |
Hightouch sync into Salesforce.Contact.lead_score__c.
Cadence: every 30 minutes.
On-call: data-team rotation.
Cost guardrails.
- Per-sync row caps. "This sync will never ship more than 50k rows per run; abort if it tries."
- Audience size caps. "This audience will never include more than 100k members; alert if it does."
- Quota share caps. "This sync will use no more than 30% of the destination's daily API quota."
- Frequency caps. "Even if scheduled hourly, no more than 24 runs per day."
Worked example — propagating PII tags from dbt to the sync layer
Detailed explanation. Field-level PII tagging is the foundation of governance. When a column is tagged in dbt, the tag must propagate to every downstream sync so per-destination policy can enforce "this PII can/cannot land here." Census and Hightouch both read dbt meta tags directly.
Question. Tag dim_users.email as pii=email in dbt, configure Census to read the tag, and define a per-destination policy that allows email to sync to Marketo but blocks it from a marketing experimentation tool.
Input.
# dbt model schema.yml
version: 2
models:
- name: dim_users
columns:
- name: user_id
tests: [unique, not_null]
- name: email
meta:
pii: email
contains_pii: true
- name: ssn
meta:
pii: ssn
contains_pii: true
Code.
# Census destination policy.
destinations:
marketo:
allowed_pii_tags: [email, name]
blocked_pii_tags: [ssn, phone, address]
experimentation_tool:
allowed_pii_tags: [name]
blocked_pii_tags: [email, ssn, phone, address]
# Note: email is blocked here.
# Census sync definition.
syncs:
- name: users_to_marketo
source: dim_users
destination: marketo
mappings:
- source: email -> Lead.Email # OK — email allowed in Marketo
- source: name -> Lead.Name # OK
- name: users_to_experimentation
source: dim_users
destination: experimentation_tool
mappings:
- source: name -> User.display_name # OK
- source: email -> User.identifier # BLOCKED — sync refuses to compile
Step-by-step explanation.
- The dbt
metablock tags the column with structured PII metadata. Census's dbt project reader picks up the tag automatically — no second source of truth. - The destination policy lists allowed and blocked PII categories per destination. Marketo accepts email + name; the experimentation tool accepts only name.
- When the sync to Marketo compiles, every mapping is checked against the policy. Email → Lead.Email is allowed; the sync ships.
- When the sync to the experimentation tool compiles, the email mapping triggers a policy violation. Census refuses to compile the sync; the engineer sees a clear error and either removes the mapping or escalates for an exception approval.
- The policy is enforced at compile time, before any row hits a network. A misconfigured sync never reaches the destination.
Output.
| Sync | Policy decision |
|---|---|
| users_to_marketo | compiles + ships |
| users_to_experimentation | refused to compile (email blocked) |
Rule of thumb. Tag PII at the dbt column level; let the reverse ETL platform read tags and enforce per-destination policy at compile time. Never enforce PII policy at the row level at runtime — at runtime the data has already left the warehouse.
Worked example — the freshness SLA alert
Detailed explanation. Every sync has a freshness contract — "fresh within 2 hours" — set by the consuming team. The platform tracks the actual freshness and alerts when the contract is breached. The alert wakes on-call before the marketing team complains.
Question. Configure a freshness alert for the salesforce_lead_score_sync (cadence 30 min, SLA 2h) and walk through the on-call response when it fires.
Input.
| Sync | Cadence | SLA | Freshness now |
|---|---|---|---|
| salesforce_lead_score_sync | 30 min | 2h | 3h 15m ago |
Code.
# Census alert definition (illustrative).
alerts:
- name: lead_score_sync_freshness
sync: salesforce_lead_score_sync
condition: minutes_since_last_success > 120 # 2h SLA
severity: page
notify:
- pagerduty: data-team-oncall
- slack: "#data-alerts"
runbook: |
Sync has not succeeded in over 2 hours.
Steps:
1. Check Census dashboard for recent error.
2. If 401 — refresh OAuth credential.
3. If 429 — wait for quota reset; backfill afterwards.
4. If model SQL error — open dbt repo, fix, redeploy.
5. If destination outage — pause sync, monitor status page.
Step-by-step explanation.
- The alert condition
minutes_since_last_success > 120measures actual freshness against the 2h SLA. The 30-minute cadence is the target; the SLA is the deadline. - When the alert fires, PagerDuty pages the on-call data engineer and posts to the Slack channel. The runbook is in the alert body, not in a separate wiki.
- The on-call reads the Census dashboard, identifies the failure category (auth, quota, model error, destination outage), and applies the matching runbook step.
- The runbook covers the four most-common failure modes. Steps 1–3 are operational; step 4 escalates to the model owner; step 5 escalates to the destination vendor.
Output (timeline of the on-call response).
| Time | Event |
|---|---|
| 03:00 | Last successful run. |
| 03:30 | Scheduled run fails — 401 (token expired). |
| 04:00 | Second scheduled run fails — 401. |
| 04:30 | Third scheduled run fails — 401. |
| 05:00 | Freshness alert fires (2h SLA breached). PagerDuty pages on-call. |
| 05:05 | On-call reads runbook, refreshes OAuth credential. |
| 05:10 | Sync retries successfully. Freshness lag drops to 10 minutes. |
Rule of thumb. Freshness lag is the right top-line SLI for a sync — not "did the last run succeed." A sync that runs and succeeds every hour is fine. A sync that runs every 30 minutes but has failed for the last 4 runs is broken, and only the freshness lag catches it.
Worked example — schema drift catches before deploy
Detailed explanation. Schema drift happens when a model's column type or name changes in a way the downstream sync cannot accept. The right place to catch it is in dbt CI, before merge — not in production after the sync starts failing.
Question. Configure dbt contracts on the reverse_etl_customer_state model and walk through what happens when a developer tries to rename lifetime_revenue to lifetime_value without coordinating with the sync.
Input.
# dbt contract on the model.
models:
- name: reverse_etl_customer_state
config:
contract:
enforced: true
columns:
- name: salesforce_contact_id
data_type: varchar
tests: [unique, not_null]
- name: name
data_type: varchar
- name: lifetime_orders
data_type: integer
- name: lifetime_revenue
data_type: numeric
- name: last_order_at
data_type: timestamp
Code.
-- Developer's PR — renames lifetime_revenue.
-- File: models/marts/reverse_etl_customer_state.sql
SELECT
c.salesforce_contact_id,
c.name,
COUNT(o.order_id) AS lifetime_orders,
COALESCE(SUM(o.amount), 0) AS lifetime_value, -- renamed!
MAX(o.order_date) AS last_order_at
FROM dim_customers c
LEFT JOIN fact_orders o ON o.customer_id = c.customer_id
GROUP BY c.salesforce_contact_id, c.name;
Step-by-step explanation.
- The developer renames
lifetime_revenuetolifetime_valuein the SELECT clause. - dbt CI runs
dbt build. The contract check inspects the actual output schema against the declaredcolumns:list. - The output column
lifetime_valuedoes not match the declaredlifetime_revenue. dbt fails the build with a clear error: "column lifetime_revenue not produced; column lifetime_value produced unexpectedly." - The CI failure blocks the merge. The developer either reverts the rename or files a coordinated migration (rename in dbt + rename mapping in sync + cutover plan).
- Without the contract, the rename would merge, the next sync run would silently ship NULL for
lifetime_revenue(Salesforce field overwritten with NULL), and the marketing team would discover the bug three days later when their nurture sequence fires for everyone.
Output.
| Stage | Outcome |
|---|---|
| Pre-contract | rename merges, sync silently writes NULL, downstream stale |
| With contract | rename blocked in CI, coordinated migration required |
Rule of thumb. Every dbt model with at least one reverse ETL sync should have an enforced contract. The contract is the bridge between "data team owns the model" and "operational team owns the destination" — it makes drift loud instead of silent.
Reverse ETL interview question on the sync as a data product
A senior interviewer often asks: "How do you turn a one-off sync from a side-project into a production data product? What does the full lifecycle look like?"
Solution Using the data-product lifecycle
The data-product lifecycle for a reverse ETL sync:
1. INTAKE
- Consumer team files a sync request.
- Required fields: model, destination, fields, cadence,
SLA, on-call owner.
2. DESIGN
- Analytics engineer reviews the model PK + idempotency.
- PII tags audited; destination policy verified.
- Audience defined if filtering required.
- dbt contract on the source model.
- Cost estimate (quota + MTU).
3. BUILD
- Sync YAML / config committed to git.
- CI runs dbt build + sync linting.
- PR review by analytics engineering.
4. DEPLOY
- Sync deployed to staging destination first.
- Manual QA on 10 sample rows.
- Cut over to production destination.
5. MONITOR
- dbt exposure surfaced in catalog.
- Freshness alert + row-error alert configured.
- Queue-depth alert configured.
- On-call runbook attached.
6. ITERATE
- Quarterly review of sync health metrics.
- Audience drift review (size still in expected range?).
- Destination policy review (PII still compliant?).
- Cost review (still inside quota envelope?).
7. RETIRE
- When the consumer no longer needs it: archive the sync,
drop the dbt exposure, document the deprecation.
Step-by-step trace.
| Phase | Owner | Output |
|---|---|---|
| Intake | Consumer team + AE | sync request ticket |
| Design | Analytics engineering | sync design doc |
| Build | Analytics engineering | sync YAML + PR |
| Deploy | Analytics engineering | staging then prod |
| Monitor | Data on-call | dashboards + alerts |
| Iterate | Analytics engineering | quarterly review notes |
| Retire | Analytics engineering | deprecation note |
The discipline is the same as any backend service. The vocabulary borrows from product management (intake, MVP, monitoring, deprecation) more than from data engineering (model, refresh, materialise).
Output:
| Artifact | Where it lives |
|---|---|
| Sync config | dbt repo / sync YAML |
| dbt contract | model schema.yml |
| dbt exposure | exposures.yml |
| Alerts | observability platform |
| Runbook | alert body + wiki |
| Cost budget | per-sync row cap + quota share |
| On-call rota | PagerDuty schedule |
Why this works — concept by concept:
- Intake gates entry — not every "we want a sync" idea becomes a sync. The intake form forces the consumer to articulate model, destination, SLA, and ownership before any engineering time is spent.
- dbt contracts gate change — every sync model has an enforced contract. Drift is caught at PR time, not at production-failure time.
- Exposures surface lineage — the data catalog knows every sync. When a model changes, the catalog shows every downstream sync that will be affected.
- Alerts surface failure — freshness lag, row-error rate, and queue depth are the three SLIs. Every sync has them; on-call wakes up to them.
- Quarterly review surfaces drift — audiences grow, costs shift, PII policy evolves. Quarterly review catches slow drift before it becomes an incident.
- Retirement is explicit — syncs are retired explicitly, not abandoned. A retired sync is archived in git and removed from exposures so the catalog stays accurate.
- Cost — the discipline is overhead. For a low-stakes internal sync, the full lifecycle is overkill. For any sync touching customer-facing automation, the lifecycle is the floor.
Data
Topic — data-transformation
Data transformation problems (data engineering)
Cheat sheet — reverse ETL recipes
-
Lead score → Salesforce. Model
fct_lead_score(one row per Salesforce contact) → audience "lead_score >= 80" → upsert intoContact.lead_score__c. Cadence: 30 minutes. Use composite API batching for 200 rows/call. -
Account churn risk → Intercom. Model
dim_accountswithchurn_risk→ audience "churn_risk > 0.7" → mirror sync setsCompany.churn_risk_tag = at_riskand clears the tag when the account drops out of the audience. -
High-value users → Facebook custom audience. Model
dim_usersjoined tofct_user_revenue→ audience "ltv_usd > 5000" → mirror sync hashes emails and pushes to a Meta custom audience. Reflects add/remove automatically on each run. -
Slack high-value signup alert. Model
fct_signupsfiltered to "plan = pro AND first_seen_at >= today" → RudderStack event sync → Slack webhook posts to#sales-alertswith the new account name + plan + region. -
Marketing suppression list. Model
dim_usersfiltered to "opted_out = true OR gdpr_deleted = true" → mirror sync to every marketing destination's suppression list (Marketo, Iterable, Customer.io, Mailchimp). -
Reverse ETL → product analytics. Model
marts.user_cohortswith(user_id, cohort_label)→ upsert into Amplitude'scohortsAPI, mirrored to Mixpanel'scohortendpoint. Lets PMs filter funnels by warehouse-defined cohorts. -
GDPR delete pipeline. Model
users_to_delete(one row per requested deletion) → delete-only sync fanned out to Salesforce, HubSpot, Marketo, Intercom, Iterable, Facebook. Idempotent: a row deleted twice is a no-op. -
Trial-ending sequence trigger. Model
dim_usersfiltered to "plan = trial AND trial_ends_at BETWEEN today AND today + 7" → mirror sync to Iterable user propertytrial_end_date. Iterable workflow fires the in-app + email sequence. -
Customer attribute fan-out. Single model
marts.customer_attributes(one row per customer) → multiple syncs to Salesforce, HubSpot, Intercom, Iterable each picking the columns they need. One source, many destinations. -
Sales territory routing. Model
dim_accountswithterritory_code→ upsert into SalesforceAccount.RoutingTerritory__c. Pairs with a Salesforce assignment rule that reads the field at lead creation. -
NPS score sync. Model
marts.nps(one row per account with rolling NPS) → upsert into SalesforceAccount.nps_rolling__c. Customer success team filters Salesforce dashboards by NPS bucket. -
Webhook fan-out. Model
fct_account_events(one row per significant account event) → RudderStack event sync → internal API webhook, Slack channel, and Salesforce task creation in parallel.
Frequently asked questions
Is reverse ETL the same as a CDP?
Not quite — they overlap but solve different starting problems. A CDP (Customer Data Platform like Segment or RudderStack Event) collects events from your sources and forwards them to destinations; the warehouse is optional. Reverse ETL starts from the warehouse — it assumes you already have a single source of truth for customer attributes and ships that to destinations. The modern stack often uses both: a CDP collects events into the warehouse (forward path), and a reverse ETL tool ships warehouse-aggregated state back to operational tools (reverse path). RudderStack is unusual in offering both in one product; Hightouch and Census focus on the reverse ETL half only.
Do I need a customer data warehouse before reverse ETL?
Yes — you need a warehouse and a single canonical definition of the entity you want to sync. The warehouse can be Snowflake, BigQuery, Databricks, Redshift, or Postgres; it does not have to be branded a "customer data warehouse." What matters is that one SQL query produces one row per entity with the attributes you need to ship. If your data is still scattered across SaaS tools with no aggregation layer, you have a forward ETL problem first, and reverse ETL has nothing to sync.
How is Hightouch different from Census?
Hightouch optimises for the GTM / revenue ops persona — drag-and-drop audience builder, multi-channel journeys (Hightouch Sequences), broad destination catalogue (200+), strong observability with row-level error inspection. Census optimises for the analytics engineering / data team persona — tightest dbt integration of any vendor (reads dbt_project.yml, surfaces exposures, git-backed sync configs), SQL-first audience model, sync-test gating tied to dbt tests. Pick Hightouch when non-SQL users own the audience layer; pick Census when the data team owns it end-to-end and dbt is the source of truth.
Can I build reverse ETL myself with Airflow + APIs?
Yes, technically — and you should not, in practice. A v1 covering 5 destinations takes two senior engineers about 6 months to build: connectors, diff engine, queue + retry, dead-letter inspection, audience builder UI, schema-change detection, audit logging, PII governance. The three production vendors (Hightouch, Census, RudderStack) ship all of that for the price of about one engineer-year per year. The only cases where in-house build wins are (a) you have an extremely narrow scope (one destination, never more), (b) you are at a scale where MTU pricing genuinely hurts (>10M MTU and you can renegotiate hard), or (c) you have a hard BYOC compliance constraint and even RudderStack OSS does not fit.
What latency can reverse ETL realistically deliver?
Batch reverse ETL typically delivers 5–60 minute end-to-end latency, dominated by the warehouse query time plus the destination API throughput. Census claims sub-minute sync on small models with their fastest tier; Hightouch's shared infrastructure typically lands around 5–15 minutes. RudderStack's event-stream reverse ETL path closes the loop in seconds to a minute for individual event triggers but is not magic for batch attribute updates. If your use case requires sub-second response (in-session personalisation, fraud blocking, real-time bidding), reverse ETL is the wrong tool — you want an online feature store or an event-stream architecture that does not round-trip through a warehouse query.
How do I handle GDPR deletes through reverse ETL?
Build a dedicated delete pipeline. The pattern: one warehouse model users_to_delete with one row per requested deletion (user_id, email, requested_at), fanned out as a delete-only sync to every destination that received that user's PII. Each destination has a delete or "right-to-be-forgotten" API; Hightouch and Census both expose delete-only sync modes that wire into them. Idempotency matters — a user deleted twice should be a no-op. Audit-log every delete sync run for compliance evidence. Crucially, the platform itself must be able to delete its sync history for the deleted user; verify your vendor's GDPR posture before committing to PII-heavy syncs.
Practice on PipeCode
- Drill the ETL practice library → for the warehouse-to-destination data movement patterns that reverse ETL formalises.
- Layer in API integration drills → for the rate-limit + retry + idempotency primitives every sync depends on.
- Stack the dimensional modelling library → so your reverse ETL models are one-row-per-entity by default.
- Sharpen the data transformation library → for the aggregation patterns that turn fact tables into reverse ETL models.
- Practise streaming problems → for the event-stream reverse ETL path RudderStack and modern Hightouch / Census tiers ship.
- For the broader interview surface, read top data engineering interview questions →.
- Stack the prerequisites with the only 5 skills you need to become a data engineer →.
- Sharpen the system-design axis with the ETL system design course →.
- For long-form data modelling craft, work through data modelling for DE interviews →.
Pipecode.ai is Leetcode for Data Engineering — every reverse ETL recipe above ships with hands-on practice rooms where you design the model, write the idempotent upsert, and reason about rate limits against real graded inputs. PipeCode pairs every reading with 450+ DE-focused problems and a real-time scoring engine, so you never have to wonder whether your sync design will hold up at scale.
Practice ETL now → API integration drills →





Top comments (0)