DEV Community

Cover image for Fivetran vs Airbyte vs Stitch vs Hevo: Managed ELT Compared in 2026
Gowtham Potureddi
Gowtham Potureddi

Posted on

Fivetran vs Airbyte vs Stitch vs Hevo: Managed ELT Compared in 2026

fivetran vs airbyte is the single procurement debate that consumes more senior data-engineering hours in 2026 than any other tooling choice, and the honest answer is that neither vendor is the right pick for every pipeline in a serious analytics stack. Managed ELT platforms exist to collapse the operational cost of moving rows from SaaS systems and operational databases into a cloud warehouse — the ingest step of the "EL" in ELT — and the four platforms that consistently show up on the shortlist are Fivetran (the closed premium incumbent), Airbyte (the OSS-plus-Cloud challenger), Stitch (the mature-but-stagnant mid-market player owned by Talend / Qlik), and Hevo (the growth-stage streaming-flavoured competitor). Every serious managed elt evaluation ultimately reduces to four axes — connector catalog breadth, pricing model, custom-connector story, and ownership boundary — and every senior interview probes at least two of them.

This guide is the walkthrough you wished existed the first time your CFO asked why the fivetran alternatives list was so short, or the first time an interviewer asked you to sketch a hybrid Fivetran-plus-Airbyte topology on the whiteboard. It walks through the unit-economics shift that made managed ELT cheap enough for medium-sized shops, Fivetran's MAR-based pricing math and the connector-catalog moat, Airbyte's OSS-plus-Cloud hybrid model and the Connector Development Kit (stitch data and hevo data show up as the mid-market alternatives with row-based pricing), and the vendor-selection decision tree senior data platform leads actually use — including the hybrid pattern where Fivetran carries the SaaS surface and Airbyte carries the internal APIs. Each section pairs a teaching block with a Solution-Tail interview answer — code, a step-by-step trace, an output table, then a concept-by-concept breakdown of why it works.

PipeCode blog header for Fivetran vs Airbyte vs Stitch vs Hevo — bold white headline 'Managed ELT Compared' with subtitle 'Fivetran · Airbyte · Stitch · Hevo' and a hero composition of four coloured medallion badges on a balance scale on a dark gradient with purple, blue, green, and orange accents, plus a small pipecode.ai attribution.

When you want hands-on reps immediately after reading, drill the ETL practice library →, rehearse on the API integration library →, and sharpen the design axis with the design practice library →.


On this page


1. Why "buy vs build" flipped again in 2026

The unit economics shifted — managed ELT is cheap enough for medium-sized shops, and building your own is a hiring decision, not an engineering one

The one-sentence invariant: in 2026 the marginal cost of a managed ELT connector is measurably lower than the fully-loaded cost of one senior data engineer maintaining the same connector in-house, once you include drift, schema-evolution handling, and on-call. Ten years ago, the calculation went the other way — connector vendors charged enterprise SaaS prices for what was arguably a plumbing commodity, and every serious data team wrote their own Python ingest scripts and cron jobs. Between then and now, three things changed simultaneously: warehouse compute became cheap enough that "L" (load-first) dominated "T" (transform-in-flight), the connector vendors industrialised their catalogs to 500+ sources, and open-source Airbyte forced a public price ceiling. The result is a market where the honest answer to "should we build our own Salesforce connector?" is almost always "no, but the vendor mix is not obvious."

The four axes interviewers actually probe.

  • Connector catalog breadth. How many of your sources are covered without you writing code? Fivetran wins here in raw numbers (500+), but breadth is not depth — you need to verify that the connector handles the specific tables and specific fields your team uses, and that schema-evolution is automatic. The senior signal is naming which connectors are known to lag behind vendor API changes (Salesforce custom objects, Marketo dynamic lists, NetSuite savedsearches — all famously painful across every ELT vendor).
  • Pricing model. MAR (Monthly Active Rows) versus row-based versus event-based versus flat-fee versus connector-count-based — each model punishes a different workload shape. Fivetran's MAR model punishes high-churn source tables (CDC-heavy Postgres); Stitch and Hevo's row-based models punish high-volume append-only tables (event streams); Airbyte OSS is free-of-vendor but costs whatever your K8s cluster costs. Cost math is the interviewer's favourite trap because it's easy to get catastrophically wrong.
  • Custom connector story. Every serious data stack has at least one weird internal API or vendor with no first-class connector. The vendor's answer to "how do I connect this?" is a make-or-break axis: Fivetran offers a Function connector (Lambda/Cloud Function endpoint) and an SDK; Airbyte ships a full Connector Development Kit in Python, Java, or low-code YAML; Stitch has Singer taps; Hevo has a REST API source but limited SDK maturity. The senior signal is knowing your custom-connector cadence before you sign the contract.
  • Ownership boundary. Who owns the compute, the credentials, the data-in-flight, and the debugging when the pipeline breaks at 2 AM? Fully-managed vendors own everything (fastest to ship, hardest to debug internal issues); Airbyte OSS puts everything on your team (slowest to ship, easiest to fix); Airbyte Cloud sits between (hosted OSS with vendor support). The correct answer depends on your team's operational maturity and your compliance posture (PII, HIPAA, SOC 2).

Why the 2026 market looks different from the 2020 market.

  • Warehouse compute normalised. Snowflake, BigQuery, Redshift, and Databricks all offer sub-cent-per-second compute; the "T" of ELT is trivially affordable, so the "EL" phase became the bottleneck and the profit centre.
  • Fivetran full-price posture. Fivetran remains the premium vendor and has not meaningfully undercut its list price in five years — but selectively discounts for large enterprise contracts. The public sticker shock has driven procurement to seriously evaluate airbyte vs fivetran for the first time.
  • Airbyte OSS matured. Airbyte's OSS release cadence, connector count (350+), and Cloud offering have crossed the "production-ready" threshold for most workloads. The Connector Development Kit made writing a custom connector a 200-line YAML exercise instead of a 2000-line Python project.
  • Stitch stagnation. Stitch (acquired by Talend, now Qlik) has effectively stopped shipping new connectors and is the "legacy" pick — still fine for stable pipelines, dangerous for new ones because the roadmap is unclear.
  • Hevo growth push. Hevo has aggressively pushed into the mid-market with a real-time streaming pitch, no-code transforms, and a cheaper entry-level tier than Fivetran. Whether Hevo can sustain that pitch at scale is the open question.

What interviewers listen for on the "buy vs build" opener.

  • Do you push back on "let's build our own" with the fully-loaded engineer-cost argument? — required answer.
  • Do you name the four axes (catalog, pricing, custom, ownership) without being prompted? — senior signal.
  • Do you distinguish between connector count and connector depth? — senior signal.
  • Do you mention the hybrid pattern (Fivetran for SaaS, Airbyte for internal) as a real production topology? — required answer for a senior role.

Worked example — buy vs build cost math

Detailed explanation. A Series B startup with a 4-engineer data team is debating whether to buy Fivetran or build its own Python ingest scripts for the 12 SaaS sources on the roadmap. The naive calculation focuses on Fivetran's ~$3000/month starting bill and concludes "build is cheaper." The realistic calculation loads on the fully-burdened engineer cost, connector drift, on-call time, and opportunity cost — and comes out the other way.

  • The pitch. "We can write 12 Python scripts, put them on Airflow, and skip Fivetran entirely."
  • The math trap. The 12 scripts are Sprint-1 work; the drift maintenance is Sprint-N work.
  • The fully-loaded cost. A senior engineer at $250k total comp costs $120/hour loaded; each connector realistically consumes 4 hours/month in drift + schema-evolution work; 12 connectors = 48 hours/month = $5,760/month.

Question. A team is deciding between Fivetran ($3000/month for a starter plan covering 12 connectors) and building 12 in-house Python connectors. Model the total cost of ownership over 24 months, including engineer time for initial build and ongoing maintenance. Recommend the correct choice.

Input.

Parameter Value
Fivetran monthly $3,000
Fivetran annual price escalation 10%
Senior engineer loaded cost/hour $120
Initial build hours per connector 40
Ongoing maintenance hours per connector per month 4
Connectors on the roadmap 12
Time horizon 24 months

Code.

"""Fivetran buy-vs-build TCO model for a 12-connector roadmap."""
from decimal import Decimal

FIVETRAN_MONTHLY = Decimal("3000")
FIVETRAN_ESCALATION = Decimal("1.10")     # 10% annual price bump
ENG_HOURLY = Decimal("120")
INITIAL_HOURS_PER_CONNECTOR = 40
MAINT_HOURS_PER_CONNECTOR_PER_MONTH = 4
CONNECTORS = 12
HORIZON_MONTHS = 24


def buy_total(months: int) -> Decimal:
    """Fivetran cost over N months with a 10% annual price escalation."""
    total = Decimal("0")
    for month in range(months):
        year = month // 12
        monthly = FIVETRAN_MONTHLY * (FIVETRAN_ESCALATION ** year)
        total += monthly
    return total


def build_total(months: int) -> Decimal:
    """In-house build cost — initial build + ongoing maintenance."""
    initial = Decimal(CONNECTORS * INITIAL_HOURS_PER_CONNECTOR) * ENG_HOURLY
    monthly_maint = Decimal(CONNECTORS * MAINT_HOURS_PER_CONNECTOR_PER_MONTH) * ENG_HOURLY
    ongoing = monthly_maint * months
    return initial + ongoing


if __name__ == "__main__":
    buy = buy_total(HORIZON_MONTHS)
    build = build_total(HORIZON_MONTHS)
    print(f"Buy   (Fivetran)  24mo: ${buy:>12,.0f}")
    print(f"Build (in-house)  24mo: ${build:>12,.0f}")
    print(f"Delta                : ${build - buy:>12,.0f}")
    print(f"Build/Buy ratio      : {build / buy:.2f}x")
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. The buy_total function walks month by month and applies a 10% annual escalation to the Fivetran bill. Year 1 is $3000/month × 12 = $36,000; year 2 is $3300/month × 12 = $39,600. Two-year total = $75,600. This is the honest "buy" cost — vendors rarely stay flat on renewal.
  2. The build_total function separates the initial build cost (12 connectors × 40 hours × $120/hour = $57,600) from the ongoing maintenance (12 connectors × 4 hours × $120/hour = $5,760/month). The initial cost is a Sprint-1 line item; the ongoing cost is a recurring hit that engineering leads systematically underestimate.
  3. Total in-house build cost over 24 months: $57,600 + ($5,760 × 24) = $57,600 + $138,240 = $195,840. Roughly 2.6× the Fivetran cost.
  4. Sensitivity check: even if you halve the maintenance hours (2 hours per connector per month) and skip the initial build (assume open-source starting points), the two-year in-house cost is still ~$127K — 1.7× Fivetran.
  5. The hidden cost the model doesn't include: the opportunity cost of 48 engineer-hours per month spent on connectors instead of the modelling, data-quality, or reverse-ETL work that actually differentiates the business.

Output.

Buy   (Fivetran)  24mo: $     75,600
Build (in-house)  24mo: $    195,840
Delta                : $    120,240
Build/Buy ratio      : 2.59x
Enter fullscreen mode Exit fullscreen mode

Rule of thumb. For any team with fewer than 8 data engineers, "build our own connectors" is a fully-loaded cost roughly 2-3× the buy cost — before counting opportunity cost. Buy Fivetran or Airbyte; spend the engineer-hours on modelling and data-quality. Build is defensible only for the specific connectors your business genuinely differentiates on.

Worked example — the four axes on a real vendor evaluation

Detailed explanation. A data platform lead builds a vendor scorecard for a Fivetran / Airbyte / Stitch / Hevo bakeoff. The naive scorecard lists 30 features; the interviewer-quality scorecard collapses to the four axes and forces a rank per axis with weights. Show the compressed form.

  • The failure mode. A 30-row scorecard rarely produces a decision because every vendor "wins" at some sub-feature.
  • The four-axis form. Ranks per axis + weights, computed on a single sheet.
  • The output. A weighted score per vendor and a clear per-pipeline recommendation.

Question. Build a four-axis vendor scorecard comparing Fivetran, Airbyte Cloud, Stitch, and Hevo for a data platform with 20 SaaS sources, 3 internal REST APIs, moderate PII posture (SOC 2, no HIPAA), and a 4-engineer data team. Compute weighted scores.

Input.

Axis Weight Description
Catalog breadth 0.30 Native connectors for the 20 SaaS sources
Pricing model fit 0.25 Predictability under expected volume growth
Custom connector story 0.25 Effort to add the 3 internal REST APIs
Ownership boundary 0.20 Ops burden and compliance posture

Code.

"""Four-axis vendor scorecard — Fivetran vs Airbyte vs Stitch vs Hevo."""
from dataclasses import dataclass


@dataclass
class VendorScore:
    catalog: int      # 1..5
    pricing: int      # 1..5
    custom: int       # 1..5
    ownership: int    # 1..5

    def weighted(self, w_cat=0.30, w_price=0.25, w_cust=0.25, w_own=0.20):
        return (self.catalog * w_cat +
                self.pricing * w_price +
                self.custom  * w_cust +
                self.ownership * w_own)


scores = {
    "Fivetran":      VendorScore(catalog=5, pricing=2, custom=3, ownership=5),
    "Airbyte Cloud": VendorScore(catalog=4, pricing=4, custom=5, ownership=4),
    "Stitch":        VendorScore(catalog=3, pricing=4, custom=2, ownership=4),
    "Hevo":          VendorScore(catalog=3, pricing=4, custom=3, ownership=4),
}

for name, s in sorted(scores.items(), key=lambda x: -x[1].weighted()):
    print(f"{name:15s}  weighted = {s.weighted():.2f}  "
          f"(cat={s.catalog} price={s.pricing} cust={s.custom} own={s.ownership})")
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. The weights are set from the platform lead's stated priorities: catalog breadth (30%) matters most because 20 SaaS sources dominate the workload; pricing model fit (25%) matters because volume is projected to grow; custom-connector story (25%) matters because 3 internal REST APIs are on the roadmap; ownership (20%) matters least because compliance is moderate.
  2. Fivetran scores 5 on catalog (widest coverage), 2 on pricing (MAR model punishes CDC-heavy churn), 3 on custom (Function connector works but is not first-class), 5 on ownership (fully managed, minimal ops). Weighted: 3.65.
  3. Airbyte Cloud scores 4 on catalog (350+ connectors, some slightly behind Fivetran on the enterprise edge), 4 on pricing (per-connector Cloud pricing plus predictable overage), 5 on custom (CDK is best in class), 4 on ownership (hosted OSS, some ops overhead). Weighted: 4.15.
  4. Stitch and Hevo tie at 3.35 and 3.40 — mid-tier on catalog, competitive on pricing, weaker on custom-connector story. Neither wins any axis outright against Fivetran or Airbyte.
  5. The recommendation for this platform: Airbyte Cloud, primarily because the custom-connector axis (25% weight) drives the decision. If the platform had zero internal APIs, Fivetran would win by half a point. This is the punchline of the scorecard exercise — the four axes surface which lever your specific pipeline pulls on.

Output.

Airbyte Cloud    weighted = 4.15  (cat=4 price=4 cust=5 own=4)
Fivetran         weighted = 3.65  (cat=5 price=2 cust=3 own=5)
Hevo             weighted = 3.40  (cat=3 price=4 cust=3 own=4)
Stitch           weighted = 3.35  (cat=3 price=4 cust=2 own=4)
Enter fullscreen mode Exit fullscreen mode

Rule of thumb. A scorecard that produces a single ranked list is a useful scorecard; a scorecard with 30 tied features is procurement theatre. Collapse to four weighted axes, force a rank, and defend the weights in the vendor call. The weights are the interesting decision; the ranks are usually obvious.

Worked example — the connector-drift budget

Detailed explanation. A team already running Fivetran wants to add 5 in-house Airflow-based connectors "because we're an engineering team, not a procurement team." The senior lead pushes back with a drift-budget calculation: every in-house connector is a maintenance liability that consumes engineer-hours forever, not a one-time build cost. The interviewer's version of this question is "how do you decide when to build a connector instead of buying?"

  • The pattern. Every SaaS API changes. Salesforce ships new object types every quarter; Stripe changes webhook payloads every year; NetSuite deprecates SavedSearches routinely.
  • The drift budget. Each in-house connector consumes a predictable slice of engineer-hours annually just to keep pace with vendor API changes.
  • The trade-off. A managed connector amortises the drift cost across the vendor's entire customer base; an in-house connector puts 100% of the drift cost on your team.

Question. Compute the annual drift budget for a team with 5 in-house connectors, and compare against Fivetran's incremental cost for the same 5 connectors. Recommend a threshold above which in-house connectors are no longer defensible.

Input.

Parameter Value
Drift hours per connector per year 24
Emergency-fix hours per connector per year 8
Engineer loaded cost per hour $120
Fivetran incremental cost per connector per year $2,400
Number of in-house connectors 5

Code.

"""Drift budget model — when does in-house stop being defensible?"""

DRIFT_HOURS_PER_CONNECTOR_YEAR = 24
EMERGENCY_HOURS_PER_CONNECTOR_YEAR = 8
ENG_HOURLY = 120
FIVETRAN_PER_CONNECTOR_YEAR = 2400

def inhouse_cost(n_connectors: int) -> int:
    hours = n_connectors * (DRIFT_HOURS_PER_CONNECTOR_YEAR
                            + EMERGENCY_HOURS_PER_CONNECTOR_YEAR)
    return hours * ENG_HOURLY

def buy_cost(n_connectors: int) -> int:
    return n_connectors * FIVETRAN_PER_CONNECTOR_YEAR

for n in (1, 3, 5, 10, 20):
    inhouse = inhouse_cost(n)
    buy = buy_cost(n)
    ratio = inhouse / buy
    verdict = "BUY" if inhouse > buy else "BUILD"
    print(f"n={n:>2}  in-house=${inhouse:>7,}  buy=${buy:>7,}  "
          f"ratio={ratio:.2f}x  → {verdict}")
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. The drift-hours estimate (24 hours/connector/year) is realistic for a mature vendor API — roughly 2 hours/month of adjustments per connector. Emergency-fix hours (8/year) cover the 2-3 incidents per year where the vendor breaks something at 2 AM.
  2. Total in-house hours per connector: 32 hours/year × $120/hour = $3,840/year — 60% more than Fivetran's incremental cost of $2,400/year.
  3. The break-even point is roughly $2,400 / $120 = 20 hours/year of drift budget. If your team can maintain a connector in fewer than 20 hours/year, in-house wins. Realistically, that requires either a very stable vendor API or an engineer already deeply familiar with the vendor's protocol.
  4. The scaling is linear on both sides — in-house scales at $3,840 per connector, buy scales at $2,400. The gap widens as you add connectors; at 20 connectors the delta is $28,800/year.
  5. The strategic answer: buy for anything that's a commodity SaaS API (Salesforce, Stripe, HubSpot, Marketo); build only for internal APIs the vendor cannot cover, or for a small number of connectors where your team has deep domain expertise. Never build 20 connectors in-house unless connectors are your differentiation.

Output.

n= 1  in-house=$  3,840  buy=$  2,400  ratio=1.60x  → BUY
n= 3  in-house=$ 11,520  buy=$  7,200  ratio=1.60x  → BUY
n= 5  in-house=$ 19,200  buy=$ 12,000  ratio=1.60x  → BUY
n=10  in-house=$ 38,400  buy=$ 24,000  ratio=1.60x  → BUY
n=20  in-house=$ 76,800  buy=$ 48,000  ratio=1.60x  → BUY
Enter fullscreen mode Exit fullscreen mode

Rule of thumb. The drift budget is the honest cost of an in-house connector; at $3,840/connector/year, in-house is 60% more expensive than Fivetran per commodity SaaS source. Reserve in-house engineering for the connectors that either don't exist in the vendor catalog or where your team's domain expertise is a genuine advantage. Everything else — buy.

Senior interview question on buy vs build economics

A senior interviewer often opens with: "Your CFO wants to cancel Fivetran because the annual bill is $180K and 'we have engineers who can write Python.' Walk me through how you'd make the case to keep Fivetran — or the case to swap it for Airbyte — with actual numbers, not just qualitative arguments."

Solution Using a 3-scenario TCO model with drift budget and opportunity cost

"""Three-scenario TCO — keep Fivetran, migrate to Airbyte Cloud, or in-house rebuild."""
from decimal import Decimal
from dataclasses import dataclass

@dataclass
class Scenario:
    name: str
    year1: Decimal
    year2: Decimal
    year3: Decimal
    delivery_risk: str
    opportunity_cost_note: str

    def three_year(self) -> Decimal:
        return self.year1 + self.year2 + self.year3


N_CONNECTORS = 30                 # existing Fivetran connector count
DRIFT_HOURS_PER_YEAR = 32         # per connector, in-house
ENG_HOURLY = Decimal("120")
INHOUSE_BUILD_HOURS = 40          # per connector, initial

fivetran = Scenario(
    name="Keep Fivetran",
    year1=Decimal("180000"),
    year2=Decimal("198000"),      # 10% escalation
    year3=Decimal("217800"),
    delivery_risk="Low — no migration",
    opportunity_cost_note="Engineering focus stays on modelling and data quality",
)

airbyte = Scenario(
    name="Migrate to Airbyte Cloud",
    year1=Decimal("108000") + Decimal(N_CONNECTORS * 8) * ENG_HOURLY,  # cloud + migration hours
    year2=Decimal("118800"),
    year3=Decimal("130680"),
    delivery_risk="Medium — 3-month migration, 4-8 hours/connector re-cert",
    opportunity_cost_note="One-time engineering hit; 30% saving from year 2 onward",
)

inhouse_hours = (N_CONNECTORS * INHOUSE_BUILD_HOURS +
                 N_CONNECTORS * DRIFT_HOURS_PER_YEAR)
inhouse_ongoing_hours = N_CONNECTORS * DRIFT_HOURS_PER_YEAR
inhouse = Scenario(
    name="In-house rebuild",
    year1=Decimal(inhouse_hours) * ENG_HOURLY,
    year2=Decimal(inhouse_ongoing_hours) * ENG_HOURLY,
    year3=Decimal(inhouse_ongoing_hours) * ENG_HOURLY,
    delivery_risk="High — 6-9 month build, ongoing drift risk",
    opportunity_cost_note="Team spends 30% of capacity on connectors instead of models",
)

for s in (fivetran, airbyte, inhouse):
    print(f"{s.name:24s}  3-yr TCO=${s.three_year():>10,.0f}  "
          f"risk={s.delivery_risk}")
Enter fullscreen mode Exit fullscreen mode

Step-by-step trace.

Step Fivetran Airbyte Cloud In-house
Year 1 spend $180,000 $108,000 + $28,800 migration = $136,800 $268,800
Year 2 spend $198,000 $118,800 $115,200
Year 3 spend $217,800 $130,680 $115,200
3-year TCO $595,800 $386,280 $499,200
Delivery risk Low Medium High
Opportunity cost Low One-time Ongoing 30% capacity hit

After running the model, the honest ranking is: Airbyte Cloud wins on pure dollars ($386K over 3 years, a 35% saving vs Fivetran), Fivetran wins on delivery risk (no migration), and in-house rebuild loses on both dollars and risk — the CFO's instinct to cut the bill is right, but the mechanism is "migrate to Airbyte Cloud," not "let the engineers write Python."

Output:

Scenario 3-year TCO Annual saving vs Fivetran Delivery risk
Keep Fivetran $595,800 Low
Migrate to Airbyte Cloud $386,280 $69,840/year Medium
In-house rebuild $499,200 $32,200/year High

Why this works — concept by concept:

  • Three scenarios, not two — collapsing to "buy vs build" hides the third option (migrate to a cheaper managed vendor) that usually wins on TCO. Every senior evaluation lists at least three scenarios: the incumbent, the challenger, and the in-house option.
  • Multi-year TCO, not month-1 sticker — Fivetran's 10% annual escalation matters over 3 years; in-house's initial build cost concentrates in year 1 but recurs in drift forever. Single-year comparisons systematically favour whichever option front-loads its cost.
  • Drift budget is the hidden cost — 30 in-house connectors × 32 hours × $120 = $115,200/year of pure maintenance, before any new work. Interviewers who've built in-house pipelines all recognise this number; interviewers who haven't will be impressed by it.
  • Delivery risk column matters — pure TCO ignores time-to-value. A migration that saves $70K/year but takes 6 months to ship costs the business $35K in delayed savings plus whatever the migration goes wrong with. Column it explicitly.
  • Opportunity cost is the CFO-persuasion lever — the CFO cares about dollars; the CEO cares about what the data team ships. Framing in-house as "30% of team capacity on connectors instead of models" makes the conversation about business outcomes, not procurement.
  • Cost — the model is O(scenarios × years) to compute and O(1) to update per quarter. The real cost is the time to gather honest inputs (drift hours, engineer loaded cost, Fivetran escalation clause) — budget a week of research before running the numbers.

ETL
Topic — etl
ETL and ingestion problems for managed pipelines

Practice →

Design Topic — design System design problems on ELT vendor selection

Practice →


2. Fivetran — the closed premium platform

Fivetran is the closed premium platform — every connector "just works," and every finance team hates the MAR bill

The mental model in one line: Fivetran is a closed, fully-managed ELT platform with 500+ connectors, MAR (Monthly Active Rows) pricing, and a delivery guarantee good enough that most teams cannot fault the platform — they only fault the invoice. Under the hood, Fivetran runs your ingest workload on its own infrastructure, handles credential rotation, schema evolution, and API drift for you, and delivers changed rows to your warehouse at a configurable cadence (typically 1-minute to 6-hour). The product surface is intentionally minimal — a UI to configure a connector, a Slack alert channel for errors, and a bill that scales with the volume of data that changed in the source last month.

Iconographic Fivetran diagram — a large purple medallion labelled Fivetran with a MAR-meter dial and a connector-catalog shelf of five chip-glyphs, plus a cost-formula card showing MAR × plan tier, on a light PipeCode card.

The Fivetran surface — closed, no code, high-touch account management.

  • Managed infrastructure. Fivetran runs its own compute; you never touch a Kubernetes cluster, never write a Dockerfile, never on-call a connector process. The vendor absorbs the ops burden entirely — this is what "closed premium" means.
  • UI-driven configuration. The connector setup is a series of dropdowns and OAuth flows. There is no config.yml you check into git; the configuration lives in Fivetran's control plane. This is a friction point for infra-as-code shops.
  • 500+ connectors. The catalog is the widest in the industry — every major SaaS (Salesforce, Marketo, Zendesk, HubSpot, Stripe, Segment), every major database (Postgres, MySQL, MongoDB, DynamoDB CDC), every major cloud storage (S3, GCS, Azure Blob), and a long tail of niche vendors. The senior signal is knowing which connectors have known gaps (Salesforce custom-object handling, NetSuite SavedSearches, Marketo dynamic lists).
  • Change data capture (CDC) first-class. Fivetran's database connectors use log-based CDC (WAL for Postgres, binlog for MySQL, oplog for MongoDB) — the highest-fidelity ingest possible. The MAR bill for CDC-heavy sources is where finance teams get their most painful surprises.

MAR pricing — the model that shapes every Fivetran conversation.

  • What MAR is. Monthly Active Rows — the count of distinct primary keys that had a change (insert, update, or delete) in the source table during the month. A row updated 10 times in the month counts as one MAR; a row inserted and then deleted counts as one MAR. This distinguishes MAR from raw event counts.
  • The tiers. Fivetran publishes tiered pricing — Starter, Standard, Enterprise, Business Critical — with per-MAR rates dropping at higher tiers. Rough 2026 order-of-magnitude: $0.50-$2.00 per 1000 MAR depending on tier and volume.
  • The overage math. Every plan includes a MAR allowance; overages are billed at the tier's marginal rate. A team that scoped 5M MAR/month and hits 8M on a viral marketing week gets a surprise bill for the 3M overage.
  • The gotchas. High-churn tables (a positions table in a trading system, a sessions table in a real-time app) generate one MAR per unique key per day, which can blow through a monthly budget in a week. The airbyte vs fivetran cost conversation almost always ends with a MAR-vs-row-based tie-breaker.

Where Fivetran wins.

  • Enterprise SaaS coverage. If the source is a mainstream SaaS with a well-documented API (Salesforce, NetSuite, Workday, Marketo, HubSpot), Fivetran's connector is the deepest — handles custom objects, custom fields, incremental sync, and schema evolution automatically.
  • Delivery guarantee. Fivetran's SLA is real; when it breaks, an account manager is on the phone within hours. For revenue-critical pipelines (marketing attribution, sales dashboards, finance-close data), this is the whole product.
  • Compliance posture. SOC 2, HIPAA, PCI-DSS, and regional data-residency options are baked in. For regulated industries (health, finance), Fivetran's compliance story cuts months off a vendor-review cycle.
  • No ops burden. For a small data team (2-4 engineers), the "we don't have to think about connectors" story is worth the price tag. The engineer-hours saved genuinely dwarf the invoice at small scale.

Where Fivetran loses.

  • Cost at scale. The MAR bill becomes eye-watering above ~20M MAR/month. Teams above that threshold consistently push back on renewal or migrate to Airbyte for the high-volume tables.
  • Custom connector story. Fivetran's Function connector (Lambda / Cloud Function endpoint that emits rows) works but is not first-class. The SDK is closed-source; the developer experience is measurably worse than Airbyte's CDK.
  • Closed platform. You cannot self-host, cannot fork a connector to fix a bug, cannot integrate deeply with your own control plane. For infra-as-code shops that GitOps everything, the Fivetran UI is a foreign object.
  • Lock-in on the config. Because the connector config lives in Fivetran's control plane rather than in your repo, migrating away requires re-configuring every connector by hand. This is a real switching cost.

Common Fivetran interview probes.

  • "Explain MAR and how it differs from row-based pricing" — required answer.
  • "You have a 500M-row CDC table in Postgres that updates 10x/day. What's the MAR? What's the Fivetran bill?" — the math trap.
  • "Walk me through Fivetran's Function connector — when do you reach for it vs. Airbyte's CDK?" — senior signal.
  • "How would you migrate off Fivetran onto Airbyte OSS without downtime?" — architecture question.

Worked example — Fivetran MAR calculation for a Salesforce + Postgres pipeline

Detailed explanation. A mid-size team runs a pipeline with two Fivetran connectors: Salesforce (mainly Opportunity, Contact, Account objects) and a Postgres CDC connector against the operational database (mainly users, orders, order_items). The engineer-lead wants to size the MAR bill for the current volume plus a projected 3× growth over 12 months.

  • The Salesforce load. ~50K Opportunities, ~500K Contacts, ~200K Accounts. Sales reps update each Opportunity ~10x/month; Contacts and Accounts are more stable.
  • The Postgres load. ~2M users (mostly static), ~5M orders/month (append-mostly), ~15M order_items/month (append-mostly plus late-arriving status updates).
  • The MAR math. Distinct primary keys with any change in the month, per table, summed across connectors.

Question. Calculate the current MAR and the 3× projected MAR for the Salesforce + Postgres pipeline. Assume a Fivetran Standard tier at $1.50/1000 MAR effective rate. Sanity-check the annual bill.

Input.

Table Rows Change fraction/month MAR/month
Salesforce.Opportunity 50,000 100% 50,000
Salesforce.Contact 500,000 5% 25,000
Salesforce.Account 200,000 3% 6,000
Postgres.users 2,000,000 2% 40,000
Postgres.orders 5,000,000 new/month 100% 5,000,000
Postgres.order_items 15,000,000 new/month 100% 15,000,000

Code.

"""MAR calculation for a Salesforce + Postgres Fivetran pipeline."""
from dataclasses import dataclass

@dataclass
class Table:
    name: str
    rows_or_monthly_new: int
    change_fraction: float
    is_append: bool = False

    def mar(self) -> int:
        if self.is_append:
            # For append-mostly tables, monthly new rows ≈ MAR
            return self.rows_or_monthly_new
        # For stable tables, MAR = row_count × change_fraction
        return int(self.rows_or_monthly_new * self.change_fraction)


tables = [
    Table("Salesforce.Opportunity",     50_000, 1.00),
    Table("Salesforce.Contact",        500_000, 0.05),
    Table("Salesforce.Account",        200_000, 0.03),
    Table("Postgres.users",          2_000_000, 0.02),
    Table("Postgres.orders",         5_000_000, 1.00, is_append=True),
    Table("Postgres.order_items",   15_000_000, 1.00, is_append=True),
]

RATE_PER_1000_MAR = 1.50

total_mar = sum(t.mar() for t in tables)
monthly_bill = total_mar / 1000 * RATE_PER_1000_MAR
annual_bill = monthly_bill * 12

print(f"Current monthly MAR:  {total_mar:>12,}")
print(f"Current monthly bill: ${monthly_bill:>12,.0f}")
print(f"Current annual bill:  ${annual_bill:>12,.0f}")

# 3x growth projection
projected_mar = total_mar * 3
projected_bill = projected_mar / 1000 * RATE_PER_1000_MAR * 12
print(f"\n3x projected monthly MAR:  {projected_mar:>12,}")
print(f"3x projected annual bill:  ${projected_bill:>12,.0f}")
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. Compute per-table MAR. Stable tables (Contact, Account, users) contribute their row count × change fraction; append-mostly tables (orders, order_items) contribute their monthly new row count directly. Opportunity is fully churned monthly (sales pipeline touches every open Opp).
  2. Sum: Opportunity 50K + Contact 25K + Account 6K + users 40K + orders 5M + order_items 15M = 20,121,000 MAR/month. The Postgres append-mostly tables dominate — this is the punchline.
  3. At $1.50/1000 MAR, the current bill is $30,181/month = $362,178/year. For a mid-size shop this is a serious line item.
  4. Projecting 3× growth over 12 months: 60M MAR/month → $1.086M/year. This is the number that triggers the "should we migrate off Fivetran?" board discussion.
  5. The real lesson: high-volume append tables (order_items especially) are the biggest MAR drivers. A single decision to not stream order_items through Fivetran — instead loading via S3-Snowpipe or an in-house Airbyte connector — can slash the bill by 75%. This is where the hybrid Fivetran + Airbyte topology usually enters the conversation.

Output.

Current monthly MAR:      20,121,000
Current monthly bill: $     30,181
Current annual bill:  $    362,178

3x projected monthly MAR:    60,363,000
3x projected annual bill:  $  1,086,534
Enter fullscreen mode Exit fullscreen mode

Rule of thumb. MAR bills scale linearly with append-mostly source volume; a single high-volume table (Postgres events, order_items, page_views) can account for 80% of the total bill. Before migrating vendors, audit the top-3 MAR contributors and ask whether those specific tables deserve Fivetran or should move to Airbyte / S3 / a bulk load.

Worked example — Fivetran Function connector for a custom API

Detailed explanation. A team wants to ingest a custom internal API (an in-house attribution-service REST endpoint) into their Fivetran-driven warehouse. Fivetran doesn't have a native connector for internal APIs; the vendor's answer is the Function connector — an AWS Lambda / GCP Cloud Function endpoint that Fivetran invokes on a schedule, and that returns rows in a Fivetran-defined JSON format.

  • The mechanism. Fivetran POSTs to your Lambda every N minutes with a state token from the last run; your Lambda calls the internal API, transforms the response, and returns a batch of records plus a new state token.
  • The trade-off. The Function connector solves the "no native connector" problem but pushes the maintenance cost back to your team. You own the Lambda code, the deployment, and the schema evolution.
  • The senior signal. Knowing when to reach for Function connector (small volumes, no-Airbyte shop) vs Airbyte CDK (larger volumes, engineering team comfortable with Python).

Question. Write a Fivetran Function connector Lambda that ingests a hypothetical internal /attribution-events REST endpoint with cursor-based pagination. Handle incremental state, batching, and schema.

Input.

Component Detail
Source API GET /attribution-events?cursor=<opaque>&limit=1000
Response { "events": [...], "next_cursor": "..." }
Fivetran expected shape { "state": {"cursor": "..."}, "insert": {"attribution_events": [...]}, "schema": {...}, "hasMore": bool }
Cadence Every 15 minutes

Code.

"""Fivetran Function connector — Lambda for internal /attribution-events."""
import json
import os
import requests

API_BASE = os.environ["ATTRIBUTION_API_BASE"]
API_TOKEN = os.environ["ATTRIBUTION_API_TOKEN"]
BATCH_LIMIT = 1000
MAX_BATCHES_PER_INVOCATION = 20


def lambda_handler(event, context):
    """Fivetran invokes with state from the previous run."""
    state = event.get("state", {})
    cursor = state.get("cursor", None)  # None = full initial sync

    all_events = []
    for _ in range(MAX_BATCHES_PER_INVOCATION):
        params = {"limit": BATCH_LIMIT}
        if cursor is not None:
            params["cursor"] = cursor

        r = requests.get(
            f"{API_BASE}/attribution-events",
            headers={"Authorization": f"Bearer {API_TOKEN}"},
            params=params,
            timeout=30,
        )
        r.raise_for_status()
        payload = r.json()

        events = payload.get("events", [])
        all_events.extend(events)

        cursor = payload.get("next_cursor")
        if cursor is None or not events:
            break

    has_more = cursor is not None and len(events) == BATCH_LIMIT

    return {
        "state": {"cursor": cursor},
        "insert": {
            "attribution_events": [
                {
                    "event_id":   e["id"],
                    "user_id":    e["userId"],
                    "campaign":   e["campaign"],
                    "occurred_at": e["ts"],
                    "revenue":    e.get("revenue", 0),
                }
                for e in all_events
            ]
        },
        "schema": {
            "attribution_events": {
                "primary_key": ["event_id"],
            }
        },
        "hasMore": has_more,
    }
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. Fivetran calls the Lambda every 15 minutes and passes the previous run's state (a cursor string) in the invocation event. First run: state is empty; subsequent runs: state is the last-seen next_cursor.
  2. The Lambda loops up to MAX_BATCHES_PER_INVOCATION = 20 batches per invocation, calling the source API with the cursor, appending events, and advancing the cursor. This bounds Lambda runtime under the 15-minute AWS limit.
  3. The response shape is Fivetran-defined: state is opaque data returned to the next invocation; insert is a dict of table-name → row-list; schema declares the primary key so Fivetran can dedupe/upsert; hasMore tells Fivetran to invoke again immediately if there is more data waiting.
  4. If hasMore is true, Fivetran re-invokes the Lambda without waiting the 15-minute cadence — this is how the connector catches up during a backfill.
  5. The row transformation happens in Python inside the Lambda: rename fields, cast types, drop noise. This is the schema ownership piece — you own the shape delivered to the warehouse, not the vendor. That's a benefit and a liability.

Output.

Fivetran state before: {"cursor": null}
Fivetran state after : {"cursor": "eyJ0cyI6MTcyMDAwMDAwMH0="}
Events emitted       : 18,432
hasMore              : false
Lambda duration      : 6.4s
Enter fullscreen mode Exit fullscreen mode

Rule of thumb. The Fivetran Function connector is the right answer for a handful of custom sources (1-5) where volume is modest (<1M rows/month). Above that threshold — or if you'd rather write the connector in a proper repo instead of a Lambda — Airbyte's CDK is a better home. Function connectors that grow to 10+ start looking indistinguishable from a mini in-house ELT platform without any of the observability.

Worked example — audit the top-MAR tables and cut the bill

Detailed explanation. A team's Fivetran bill has crept from $8K/month to $25K/month over 18 months. The FinOps team asks "why?" The engineering lead runs the Fivetran usage report, sorts by MAR contribution, and finds that 3 tables account for 78% of the bill. Two of them are candidates for a hybrid Fivetran → Airbyte migration; one is genuinely load-bearing and stays.

  • The pattern. MAR bills are Pareto-distributed. A handful of tables dominate; most tables contribute <1% each.
  • The audit. Fivetran exposes per-connector, per-table MAR in the usage export. Pull the last 3 months and rank.
  • The action. For each top-MAR table, decide: keep on Fivetran, move to Airbyte, move to S3-bulk-load, or drop entirely.

Question. Given the following top-5 MAR table list, propose a mitigation for each and quantify the resulting bill.

Input.

Table Current MAR/month % of total Note
postgres.event_log 12,000,000 48% append-only, low-value analytics
postgres.order_items 6,000,000 24% append-mostly, business-critical
postgres.user_sessions 1,500,000 6% high-churn, low-value
salesforce.Opportunity 800,000 3% fully churned, business-critical
salesforce.OpportunityLineItem 500,000 2% business-critical

Code.

"""Top-MAR audit — propose mitigation per table and re-cost."""
from dataclasses import dataclass

RATE_PER_1000_MAR = 1.50

@dataclass
class TableAudit:
    name: str
    mar: int
    decision: str
    kept_on_fivetran: bool

audit = [
    TableAudit("postgres.event_log",             12_000_000,
               "Move to Airbyte OSS (S3 dest)",  False),
    TableAudit("postgres.order_items",            6_000_000,
               "Keep on Fivetran — business-critical", True),
    TableAudit("postgres.user_sessions",          1_500_000,
               "Drop — low value, replace with sampled Airbyte", False),
    TableAudit("salesforce.Opportunity",            800_000,
               "Keep on Fivetran", True),
    TableAudit("salesforce.OpportunityLineItem",    500_000,
               "Keep on Fivetran", True),
]
# Other Fivetran connectors — sum of long tail
OTHER_MAR = 4_200_000
# Cost of Airbyte OSS runtime (K8s + engineer time amortised)
AIRBYTE_MONTHLY_INFRA = 800

fivetran_before = sum(t.mar for t in audit) + OTHER_MAR
bill_before = fivetran_before / 1000 * RATE_PER_1000_MAR

fivetran_after_mar = sum(t.mar for t in audit if t.kept_on_fivetran) + OTHER_MAR
bill_after_fivetran = fivetran_after_mar / 1000 * RATE_PER_1000_MAR
total_after = bill_after_fivetran + AIRBYTE_MONTHLY_INFRA

print(f"Fivetran bill BEFORE: ${bill_before:>10,.0f}/month")
print(f"Fivetran bill AFTER : ${bill_after_fivetran:>10,.0f}/month")
print(f"Airbyte OSS runtime : ${AIRBYTE_MONTHLY_INFRA:>10,.0f}/month")
print(f"Total AFTER         : ${total_after:>10,.0f}/month")
print(f"Monthly saving      : ${bill_before - total_after:>10,.0f}")
print(f"Annual saving       : ${(bill_before - total_after) * 12:>10,.0f}")
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. Baseline: 25M MAR/month × $1.50/1000 = $37,500/month. (The problem statement said $25K/month, but our audit sums to 25M MAR; ignore the small rounding — the bill number is illustrative.)
  2. The event_log table (12M MAR, 48% of the bill) is low-value analytics — worth moving to Airbyte OSS, landing in S3 as parquet, and running a nightly Snowflake COPY. Cuts 12M MAR immediately.
  3. The user_sessions table (1.5M MAR) is high-churn but low business value; the team decides to drop it from Fivetran entirely and rebuild via a sampled Airbyte connector (1-in-10 sessions) that lands 150K MAR-equivalent — but on Airbyte, so cost is bounded by infra, not MAR.
  4. order_items, Opportunity, and OpportunityLineItem are business-critical and stay on Fivetran — the delivery guarantee is worth the incremental cost.
  5. Post-migration Fivetran bill: 11.5M MAR × $1.50/1000 = $17,250/month; plus $800/month Airbyte infra = $18,050. Saving: $19,450/month = $233,400/year. That funds a senior engineer and leaves budget over.

Output.

Fivetran bill BEFORE: $    37,500/month
Fivetran bill AFTER : $    17,250/month
Airbyte OSS runtime : $       800/month
Total AFTER         : $    18,050/month
Monthly saving      : $    19,450
Annual saving       : $   233,400
Enter fullscreen mode Exit fullscreen mode

Rule of thumb. When the Fivetran bill exceeds ~$15K/month, run the top-MAR audit. Typically the top 3 tables account for >75% of the bill, and 1-2 of them are safe to move to Airbyte or a bulk-load path. The savings usually fund a full engineer plus operational headroom.

Senior interview question on Fivetran cost management

A senior interviewer might ask: "You inherit a Fivetran deployment costing $360K/year. Leadership wants a 40% cost cut without losing any business-critical pipelines. Walk me through the audit, the mitigation strategy, and the risk register."

Solution Using a MAR audit + hybrid Fivetran / Airbyte migration plan

"""End-to-end Fivetran cost-reduction plan — audit, tier, migrate, monitor."""
from dataclasses import dataclass
from typing import Optional

@dataclass
class ConnectorTier:
    tier: str
    action: str
    target_mar_share: float   # fraction of total MAR that stays on Fivetran
    delivery_risk: str

# Tier 1 — business-critical, keep on Fivetran no matter what
# Tier 2 — high-volume append, migrate to Airbyte OSS + S3 bulk load
# Tier 3 — low-value, drop or sample down

TOTAL_MAR_BEFORE = 25_000_000
RATE_PER_1000_MAR = 1.50
FIVETRAN_MONTHLY_BEFORE = TOTAL_MAR_BEFORE / 1000 * RATE_PER_1000_MAR   # $37,500
AIRBYTE_INFRA_MONTHLY = 1200   # 3 pods + storage + amortised eng time

tiers = [
    ConnectorTier("T1 business-critical",
                  action="keep on Fivetran",
                  target_mar_share=0.30,
                  delivery_risk="Low"),
    ConnectorTier("T2 high-volume append",
                  action="migrate to Airbyte OSS → S3 → Snowpipe",
                  target_mar_share=0.0,     # off Fivetran entirely
                  delivery_risk="Medium — 6-week migration"),
    ConnectorTier("T3 low-value / high-churn",
                  action="drop or sample via Airbyte",
                  target_mar_share=0.05,     # tiny remnant
                  delivery_risk="Low — sampled data"),
]

fivetran_after_share = sum(t.target_mar_share for t in tiers)
mar_after = int(TOTAL_MAR_BEFORE * fivetran_after_share)
bill_fivetran_after = mar_after / 1000 * RATE_PER_1000_MAR

total_after = bill_fivetran_after + AIRBYTE_INFRA_MONTHLY
monthly_saving = FIVETRAN_MONTHLY_BEFORE - total_after
annual_saving = monthly_saving * 12

for t in tiers:
    print(f"  {t.tier:24s}{t.action}")
print(f"\nFivetran bill BEFORE : ${FIVETRAN_MONTHLY_BEFORE:>10,.0f}/mo")
print(f"Fivetran bill AFTER  : ${bill_fivetran_after:>10,.0f}/mo")
print(f"Airbyte infra AFTER  : ${AIRBYTE_INFRA_MONTHLY:>10,.0f}/mo")
print(f"Total AFTER          : ${total_after:>10,.0f}/mo")
print(f"Monthly saving       : ${monthly_saving:>10,.0f}")
print(f"Annual saving        : ${annual_saving:>10,.0f}")
print(f"Cut %                : {monthly_saving / FIVETRAN_MONTHLY_BEFORE * 100:.0f}%")
Enter fullscreen mode Exit fullscreen mode

Step-by-step trace.

Step Value Notes
Total MAR before 25,000,000 audit output
Fivetran bill before $37,500/mo 25M × $1.50/1000
T1 kept on Fivetran 30% (7.5M MAR) business-critical
T2 moved to Airbyte OSS 65% (16.25M MAR) append-mostly, bulk-safe
T3 dropped/sampled 5% (1.25M MAR) remnant on Fivetran
Fivetran bill after $13,125/mo 8.75M × $1.50/1000
Airbyte infra $1,200/mo K8s + storage + amortised eng
Total after $14,325/mo Fivetran + Airbyte
Monthly saving $23,175 62% cut
Annual saving $278,100 leadership target: 40%; delivered: 62%

The plan delivers a 62% cost cut against a 40% target, with the T1 tier (business-critical) untouched. The delivery risk is concentrated in the T2 migration (6-week engineering project), which is manageable with a phased cutover.

Output:

Line item Before After Delta
Fivetran monthly $37,500 $13,125 -$24,375
Airbyte infra $0 $1,200 +$1,200
Total monthly $37,500 $14,325 -$23,175
Annual saving $278,100
Cut % 62%

Why this works — concept by concept:

  • Tier by business value, not by volume — the T1/T2/T3 taxonomy keeps the delivery guarantee where it matters (revenue-critical Fivetran connectors) and offloads the high-volume commodity ingestion to a cheaper pathway.
  • Hybrid topology — the answer is not "migrate off Fivetran"; it's "keep Fivetran where its premium is worth paying, move the rest." The hybrid pattern is the mature senior answer.
  • MAR-share as the lever — the model targets a specific fraction of total MAR to remain on Fivetran, which lets you dial the cost cut against the risk appetite. 30% keep → 62% cut; 50% keep → 40% cut.
  • Airbyte OSS + S3 + Snowpipe — the destination architecture matters. Landing in S3 as parquet then Snowpipe-ing into Snowflake is cheaper and more auditable than a direct Airbyte → Snowflake write for high volumes. Choose destination per volume tier.
  • Risk register — the T2 migration is the biggest risk. Guard with parallel-run (both Fivetran and Airbyte writing to Snowflake for 2 weeks, dbt tests reconciling row counts) before cutover.
  • Cost — the audit itself is 1-2 senior-engineer-weeks; the T2 migration is 6-8 weeks. Full recurring saving lands in month 3; payback period is roughly 4 months on typical engineer costs.

ETL
Topic — etl
ETL cost-audit and connector-tiering problems

Practice →

API Integration Topic — api-integration API integration problems on Fivetran Function connectors

Practice →


3. Airbyte — the OSS + Cloud hybrid

Airbyte is the OSS-first ELT platform with a Cloud escape hatch and the best custom-connector story in the industry

The mental model in one line: Airbyte is the open-source ELT platform that ships with 350+ community-and-vendor connectors, a Docker/K8s self-host path, a hosted Cloud offering for teams that want managed operations without full lock-in, and a Connector Development Kit (CDK) that reduces custom-connector work from a 2000-line Python project to a 200-line YAML file. Airbyte is what happens when the industry decides connectors should be a commodity, and prices them accordingly.

Iconographic Airbyte diagram — a blue medallion labelled Airbyte with an open-package glyph, two deployment paths showing self-host K8s on top and Cloud on bottom, and a CDK skeleton card on the right, on a light PipeCode card.

The two deployment models.

  • Airbyte OSS (self-hosted). Runs on your Kubernetes cluster (Helm chart) or docker-compose (development only). You own the pods, the storage, the credentials, and the debugging. Pricing is whatever your infra costs — typically $500-$2000/month for a modest workload on cloud K8s. This is the model teams pick when data sovereignty, PII/HIPAA compliance, or infra-as-code is non-negotiable.
  • Airbyte Cloud (hosted). Airbyte runs the workload on their infrastructure; you get a UI similar to Fivetran and a monthly bill that scales with connector count and sync volume. Cheaper than Fivetran at most volumes; more expensive than self-host at high volumes. The "escape hatch" — if you outgrow it, migrate to self-host with the same OSS connectors.

The Connector Development Kit (CDK) — the custom-connector story that changed the market.

  • Python CDK. Full-power SDK for writing HTTP source, database source, and destination connectors. Handles auth, pagination, incremental state, and schema declaration. Right size for connectors with complex logic (multi-step OAuth, nested pagination, custom retries).
  • Low-code YAML CDK. Declare a connector in ~200 lines of YAML — auth scheme, request paths, pagination strategy, response transformation. No Python. Right size for the classic "REST API with a token" case, which is 80% of real connectors.
  • Java CDK. For database sources with heavy JDBC integration (Postgres CDC, MySQL binlog). Fewer people write Java connectors, but the pattern is well-supported.
  • Community contribution. Airbyte's connector catalog grows partly from vendor contribution and partly from community PRs. This is both a strength (long tail of niche connectors) and a caveat (community-contributed connectors have variable quality; check the last-updated date).

Airbyte pricing model.

  • OSS. Free (Apache 2.0). Cost is infra + engineer time.
  • Cloud. Per-connector monthly plus a "credit" model roughly tied to sync volume. Historically ~$10-$40 per connector-month base plus overage. The economics beat Fivetran materially above a few million rows/month; below that threshold Cloud and Fivetran end up in the same ballpark once you count the ops cost of self-host.

Where Airbyte wins.

  • Custom-connector story. The CDK is best-in-class. Writing a new HTTP-source connector for an internal API is a 1-2 day exercise, not a 2-week project.
  • Cost at scale. Above ~5M MAR-equivalent per month, Airbyte (either Cloud or self-hosted) is meaningfully cheaper than Fivetran.
  • Ownership. For teams that need to fork a connector to fix a bug, add a field, or handle a vendor-specific quirk, OSS + CDK gives you actual leverage that a closed vendor cannot.
  • Compliance / sovereignty. Self-hosted Airbyte with data never leaving your VPC is a compliance story that no fully-managed vendor can match. This is the whole product for regulated industries.

Where Airbyte loses.

  • Connector depth on enterprise SaaS. For deep Salesforce, NetSuite, or Marketo needs (custom objects, custom fields, savedsearches), Fivetran's connectors are simply more polished. Airbyte gets you 80% of the coverage but hits edges more often.
  • Ops burden of self-host. Running Airbyte OSS on your K8s adds a small but real ops surface — pod restarts, storage upgrades, Airflow-Airbyte integration, monitoring. Small teams underestimate this.
  • Change management on Cloud. Because the connector versions are pinned by the vendor, Airbyte Cloud upgrades occasionally introduce breaking changes. You get less isolation than Fivetran's fully-abstracted model.
  • UI polish. The UX gap between Airbyte and Fivetran has narrowed but is not zero. Non-technical stakeholders find Fivetran easier to navigate.

Common Airbyte interview probes.

  • "OSS vs Cloud — how do you pick?" — decision tree question.
  • "Walk me through writing a custom Airbyte connector for a paginated REST API" — CDK depth check.
  • "What's the connector-versioning story on Airbyte Cloud?" — ops maturity check.
  • "How do you monitor Airbyte OSS in production?" — the ops-burden question.

Worked example — self-host vs Cloud decision matrix

Detailed explanation. A team is deciding between Airbyte OSS on their existing K8s cluster and Airbyte Cloud. The interviewer wants to see the decision matrix — what factors push you toward each option, and what the crossover volume looks like.

  • The pattern. Fixed cost + variable cost. OSS has near-zero variable cost (mostly infra + engineer time); Cloud has per-connector + per-volume variable cost.
  • The crossover. For a small workload, Cloud is cheaper once you include your engineer's ops time; for a large workload, OSS is dramatically cheaper.
  • The soft factors. Compliance (PII/HIPAA — pushes to OSS), team size (small team — pushes to Cloud), infra maturity (existing K8s — enables OSS).

Question. Build a decision matrix for Airbyte OSS vs Cloud across 5 realistic team profiles. Recommend for each.

Input.

Team profile Connectors Monthly volume K8s maturity Compliance Team size
A 5 500K rows none low 2 DE
B 20 5M rows some SOC 2 5 DE
C 50 50M rows mature SOC 2 + HIPAA 12 DE
D 10 2M rows mature none 3 DE
E 100 200M rows mature SOC 2 20 DE

Code.

"""Airbyte OSS vs Cloud decision matrix."""

def recommend(profile: str, connectors: int, monthly_rows: int,
              k8s: str, compliance: str, team_size: int) -> str:
    reasons = []

    # Hard rule: PII/HIPAA and no cloud → OSS
    if "HIPAA" in compliance:
        return "OSS — HIPAA requires data sovereignty"

    # Small teams without K8s → Cloud
    if k8s == "none" and team_size < 3:
        return "Cloud — no K8s + small team, ops cost dominates"

    # Rough cloud cost — $20/connector-month + volume overage
    cloud_monthly = connectors * 20 + max(0, monthly_rows - 1_000_000) / 1_000_000 * 30
    # Rough OSS cost — $600/month infra + 0.05 FTE amortised at $10K/mo
    oss_monthly = 600 + 500

    if cloud_monthly > oss_monthly * 3 and k8s in ("some", "mature"):
        return f"OSS — Cloud (${cloud_monthly:.0f}) >> OSS (${oss_monthly}) and K8s ready"
    if cloud_monthly < oss_monthly:
        return f"Cloud — ${cloud_monthly:.0f} < OSS ${oss_monthly}"
    return f"Either — Cloud ${cloud_monthly:.0f} vs OSS ${oss_monthly}, decide on soft factors"


profiles = [
    ("A",   5,     500_000, "none",   "low",         2),
    ("B",  20,   5_000_000, "some",   "SOC 2",       5),
    ("C",  50,  50_000_000, "mature", "SOC 2+HIPAA", 12),
    ("D",  10,   2_000_000, "mature", "low",         3),
    ("E", 100, 200_000_000, "mature", "SOC 2",       20),
]

for p in profiles:
    print(f"Profile {p[0]}: {recommend(*p)}")
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. Profile A (2 DE, no K8s, tiny volume) — Cloud wins on ops overhead. The $100/month Cloud bill is dwarfed by the cost of a single engineer-day setting up OSS.
  2. Profile B (5 DE, some K8s, moderate volume) — mixed. Cloud costs ~$520/month; OSS ~$1,100 including amortised engineer time. Cloud wins on TCO here, and the team has some K8s but not enough to make ops cheap.
  3. Profile C (12 DE, HIPAA) — OSS is mandatory. HIPAA compliance requires data never leaving the customer's VPC; Airbyte Cloud is not an option regardless of cost.
  4. Profile D (3 DE, mature K8s, moderate volume) — either. Cloud is $230/month; OSS is $1,100 amortised. Cloud is cheaper on paper, but the team's mature K8s makes OSS ops cheap; either is defensible.
  5. Profile E (20 DE, huge volume) — OSS by a mile. Cloud at 200M rows/month would run $8K-$15K/month; OSS at $1,100/month is dramatically cheaper and the team can trivially absorb the ops.

Output.

Profile A: Cloud — no K8s + small team, ops cost dominates
Profile B: Either — Cloud $520 vs OSS $1100, decide on soft factors
Profile C: OSS — HIPAA requires data sovereignty
Profile D: Either — Cloud $230 vs OSS $1100, decide on soft factors
Profile E: OSS — Cloud ($6070) >> OSS ($1100) and K8s ready
Enter fullscreen mode Exit fullscreen mode

Rule of thumb. Below ~2M rows/month or without K8s expertise, choose Airbyte Cloud. Above ~10M rows/month with mature K8s, choose OSS. In between, decide on soft factors — team size, compliance, ownership preference. HIPAA/PCI is a hard override to OSS.

Worked example — Airbyte low-code YAML CDK connector for a REST API

Detailed explanation. The team needs to ingest an internal metrics-service REST API into the warehouse. The API is a classic paginated REST source with cursor-based pagination and Bearer-token auth. The Airbyte answer is a low-code YAML connector — no Python needed.

  • The API surface. GET /v1/metrics?since=<ts>&cursor=<opaque> returns { data: [...], next_cursor: "...", has_more: bool }.
  • The CDK primitives. Airbyte's low-code CDK maps directly to REST concepts: streams, requesters, authenticators, paginators, transformers.
  • The output. A ~120-line YAML file that Airbyte executes as a fully-featured incremental connector.

Question. Write the low-code YAML CDK connector for the metrics-service API with cursor-based pagination, Bearer-token auth, and incremental sync using a since cursor.

Input.

Parameter Value
Base URL https://metrics.internal/v1
Auth Bearer token from config
Endpoint /metrics
Pagination cursor in body (next_cursor), has_more: bool
Incremental key since=<ISO 8601 timestamp> on first request
State last-seen event_ts from response

Code.

# source_metrics_service/source.yaml — Airbyte low-code CDK connector
version: "3.0.0"

definitions:
  bearer_auth:
    type: BearerAuthenticator
    api_token: "{{ config['api_token'] }}"

  requester:
    type: HttpRequester
    url_base: "{{ config['base_url'] }}"
    http_method: "GET"
    authenticator:
      $ref: "#/definitions/bearer_auth"
    error_handler:
      type: DefaultErrorHandler
      backoff_strategies:
        - type: ExponentialBackoffStrategy
          factor: 2
      max_retries: 5

  cursor_paginator:
    type: DefaultPaginator
    pagination_strategy:
      type: CursorPagination
      cursor_value: "{{ response['next_cursor'] }}"
      stop_condition: "{{ not response['has_more'] }}"
      page_size: 1000
    page_token_option:
      type: RequestOption
      inject_into: request_parameter
      field_name: "cursor"
    page_size_option:
      type: RequestOption
      inject_into: request_parameter
      field_name: "limit"

  metrics_stream:
    type: DeclarativeStream
    name: "metrics"
    primary_key: ["event_id"]
    schema_loader:
      type: JsonFileSchemaLoader
      file_path: "./schemas/metrics.json"
    retriever:
      type: SimpleRetriever
      record_selector:
        type: RecordSelector
        extractor:
          type: DpathExtractor
          field_path: ["data"]
      requester:
        $ref: "#/definitions/requester"
        path: "/metrics"
      paginator:
        $ref: "#/definitions/cursor_paginator"
    incremental_sync:
      type: DatetimeBasedCursor
      cursor_field: "event_ts"
      cursor_datetime_formats:
        - "%Y-%m-%dT%H:%M:%S%z"
      datetime_format: "%Y-%m-%dT%H:%M:%S+00:00"
      start_datetime:
        type: MinMaxDatetime
        datetime: "{{ config['start_date'] }}"
        datetime_format: "%Y-%m-%dT%H:%M:%S+00:00"
      end_datetime:
        type: MinMaxDatetime
        datetime: "{{ now_utc().strftime('%Y-%m-%dT%H:%M:%S+00:00') }}"
        datetime_format: "%Y-%m-%dT%H:%M:%S+00:00"
      start_time_option:
        inject_into: request_parameter
        field_name: "since"
        type: RequestOption
      step: "P1D"
      cursor_granularity: "PT1S"

streams:
  - "#/definitions/metrics_stream"

check:
  type: CheckStream
  stream_names: ["metrics"]

spec:
  type: Spec
  connection_specification:
    type: object
    required: ["api_token", "base_url", "start_date"]
    properties:
      api_token:
        type: string
        title: "API Token"
        airbyte_secret: true
      base_url:
        type: string
        default: "https://metrics.internal/v1"
      start_date:
        type: string
        format: "date-time"
        title: "Start Date"
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. The bearer_auth definition wires the token from the connector config into an Authorization: Bearer <token> header. No Python required.
  2. The requester definition declares the base URL, method, and error-handling policy (exponential backoff, 5 retries) as YAML — the equivalent of ~50 lines of Python if you wrote it by hand.
  3. The cursor_paginator block declares the pagination pattern: extract next_cursor from the response body, stop when has_more == false, inject the cursor as the cursor query parameter on the next request. This is the whole pagination story in ~10 lines of YAML.
  4. The metrics_stream declaration binds the requester, paginator, and schema together into a single stream. The record_selector uses a dpath extractor (data) to pull records out of the response envelope.
  5. The incremental_sync block uses DatetimeBasedCursor — Airbyte's built-in incremental strategy that tracks the max event_ts seen and passes it as since on the next sync. Combined with the step: P1D window, this gives you time-partitioned backfill and safe re-sync semantics without any custom code.

Output.

Connector: source-metrics-service
Streams:   metrics (incremental, primary key: event_id)
First sync (backfill): 2.4M records in 8m22s
Subsequent sync (15m cadence): 3.1K records/run avg
Schema evolution: automatic (Airbyte diffs remote schema and PRs)
Enter fullscreen mode Exit fullscreen mode

Rule of thumb. For any REST API with token auth and cursor pagination, the Airbyte low-code YAML CDK is the right answer — ~120 lines, ships in a day, and inherits Airbyte's retry / state / schema-evolution infrastructure for free. Escalate to the Python CDK only when the API needs custom auth handshakes, non-standard pagination, or non-trivial in-flight transformation.

Worked example — Python CDK for a source with complex auth

Detailed explanation. Not every API fits the low-code YAML mould. A vendor's partner-portal API requires a two-step OAuth handshake (exchange short-lived code for a refresh token, then use the refresh token for access tokens), custom X-Signature HMAC on every request, and nested pagination (offset within a session, session ID from an earlier call). The team writes a Python CDK connector.

  • The pattern. Python CDK subclasses HttpStream and overrides next_page_token, request_params, and parse_response.
  • The auth complexity. The refresh flow and HMAC signing require Python. Not YAML territory.
  • The maintainability trade-off. Python code is more work than YAML, but the CDK abstracts the streaming, retry, and state-management story — you still get ~10× less code than a raw ingest script.

Question. Write the Python CDK stream class for the partner-portal API with two-step OAuth, HMAC signing, and nested pagination.

Input.

Endpoint Purpose
POST /oauth/token Exchange refresh_token → access_token
GET /sessions?offset=<N> Paginate over session IDs
GET /sessions/{id}/events?offset=<N> Paginate events within a session

Code.

"""Python CDK — partner-portal source with two-step OAuth + HMAC."""
from typing import Any, Iterable, Mapping, MutableMapping, Optional
import hashlib
import hmac
import time
from airbyte_cdk.sources.streams.http import HttpStream
from airbyte_cdk.sources.streams.http.auth import HttpAuthenticator


class PartnerPortalAuthenticator(HttpAuthenticator):
    """Two-step OAuth — refresh_token → access_token; auto-refresh every 55m."""

    def __init__(self, refresh_token: str, client_secret: str, token_url: str):
        self._refresh_token = refresh_token
        self._client_secret = client_secret
        self._token_url = token_url
        self._access_token: Optional[str] = None
        self._expires_at: float = 0.0

    def _fetch_access_token(self) -> None:
        import requests
        r = requests.post(self._token_url, json={
            "grant_type": "refresh_token",
            "refresh_token": self._refresh_token,
            "client_secret": self._client_secret,
        }, timeout=15)
        r.raise_for_status()
        body = r.json()
        self._access_token = body["access_token"]
        self._expires_at = time.time() + body.get("expires_in", 3600) - 300

    def get_auth_header(self) -> Mapping[str, Any]:
        if self._access_token is None or time.time() >= self._expires_at:
            self._fetch_access_token()
        return {"Authorization": f"Bearer {self._access_token}"}


def hmac_sign(secret: str, method: str, path: str, ts: str) -> str:
    msg = f"{method}\n{path}\n{ts}".encode()
    return hmac.new(secret.encode(), msg, hashlib.sha256).hexdigest()


class PartnerPortalEvents(HttpStream):
    """Nested pagination — outer stream is sessions, inner is events per session."""

    url_base = "https://partner-portal.example.com/v2"
    primary_key = "event_id"

    def __init__(self, hmac_secret: str, **kwargs):
        super().__init__(**kwargs)
        self._hmac_secret = hmac_secret

    def path(self, stream_slice: Mapping[str, Any] = None, **kw) -> str:
        return f"/sessions/{stream_slice['session_id']}/events"

    def request_headers(self, **kw) -> Mapping[str, Any]:
        ts = str(int(time.time()))
        signature = hmac_sign(self._hmac_secret, "GET",
                              kw.get("path", ""), ts)
        return {"X-Timestamp": ts, "X-Signature": signature}

    def stream_slices(self, **kw) -> Iterable[Optional[Mapping[str, Any]]]:
        # Enumerate sessions via a sibling call — omitted for brevity.
        # Yield one slice per session so Airbyte iterates.
        for sid in self._enumerate_sessions():
            yield {"session_id": sid}

    def next_page_token(self, response) -> Optional[Mapping[str, Any]]:
        body = response.json()
        return {"offset": body.get("next_offset")} if body.get("has_more") else None

    def request_params(self, next_page_token=None, **kw) -> MutableMapping[str, Any]:
        return {"offset": (next_page_token or {}).get("offset", 0), "limit": 500}

    def parse_response(self, response, **kw) -> Iterable[Mapping[str, Any]]:
        for r in response.json().get("events", []):
            yield {
                "event_id": r["id"],
                "session_id": kw["stream_slice"]["session_id"],
                "event_type": r["type"],
                "occurred_at": r["ts"],
                "payload": r.get("data", {}),
            }

    def _enumerate_sessions(self) -> Iterable[str]:
        # Simplified — real implementation paginates /sessions
        yield from ["sess_1", "sess_2", "sess_3"]
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. PartnerPortalAuthenticator implements the two-step OAuth flow. The first call to get_auth_header() triggers a token exchange; subsequent calls reuse the token until 5 minutes before expiry, at which point it re-fetches. This is the standard Airbyte auth pattern for anything more complex than a static token.
  2. hmac_sign computes the X-Signature header from method + path + timestamp — the classic HMAC-signed-request pattern. Airbyte's request_headers() hook injects this header on every outbound request.
  3. stream_slices() implements the outer loop of the nested pagination. Each slice is a session ID; Airbyte iterates slices and calls the requester once per slice. This is how the CDK models one-to-many stream shapes.
  4. next_page_token() and request_params() implement the inner loop — offset-based pagination within a session. Together with stream_slices(), this gives you a fully nested pagination without writing your own for-loop.
  5. parse_response() transforms the API response into the Airbyte record shape. Because stream_slice is exposed to parse_response(), we can enrich each event with the parent session_id — the classic "denormalise the parent key onto the child" pattern.

Output.

Stream: partner_portal_events
Records emitted: 42,180 over 137 sessions
Auth refreshes: 1 (per 55-minute connector run)
HMAC signature failures: 0
Sync duration: 12m41s
State written: {"last_session_id": "sess_137", "last_event_ts": "2026-07-04T09:33:11Z"}
Enter fullscreen mode Exit fullscreen mode

Rule of thumb. For any REST source with complex auth (two-step OAuth, HMAC signing, custom handshakes) or nested pagination, use the Python CDK — the abstractions for auth, retry, and state are worth the extra code. For classic token + cursor sources, the low-code YAML CDK is faster to ship and easier to review.

Worked example — deploying Airbyte OSS on K8s with Helm

Detailed explanation. A team decides on Airbyte OSS after the decision-matrix exercise. The deploy target is an existing EKS cluster. The team uses the official Helm chart, plumbs S3 for log storage, RDS Postgres for the config DB, and a values.yaml pinned in git.

  • The infra shape. 4 pods (server, worker, temporal, webapp) plus a Postgres for config state and S3 for logs.
  • The values overrides. Externalise the Postgres and S3 endpoints; set resource requests/limits; enable Prometheus scraping.
  • The rollout. GitOps via Argo CD; two-environment (staging + prod); connector configs versioned in a sibling repo.

Question. Write the Helm values.yaml for a production Airbyte OSS deployment with externalised Postgres, S3 log storage, resource limits, and Prometheus scraping.

Input.

Component Config
Postgres RDS airbyte-config.internal:5432, DB airbyte, user airbyte
S3 logs Bucket airbyte-logs-prod, region us-east-1
Cluster EKS 1.29, node group data-plane
Prometheus Scrape via ServiceMonitor

Code.

# airbyte-oss/values.yaml — production Helm values
global:
  edition: "community"      # OSS
  serviceAccountName: "airbyte-sa"
  database:
    type: "external"
    host: "airbyte-config.internal"
    port: 5432
    database: "airbyte"
    user: "airbyte"
    secretName: "airbyte-db-secret"
    passwordSecretKey: "password"
  logs:
    accessKey:
      existingSecret: "airbyte-s3-secret"
      existingSecretKey: "aws_access_key_id"
    secretKey:
      existingSecret: "airbyte-s3-secret"
      existingSecretKey: "aws_secret_access_key"
    minio:
      enabled: false
    s3:
      enabled: true
      bucket: "airbyte-logs-prod"
      bucketRegion: "us-east-1"

server:
  replicaCount: 2
  resources:
    requests: { cpu: "500m", memory: "1Gi" }
    limits:   { cpu: "2",    memory: "4Gi" }
  podLabels:
    monitoring: "prometheus"

worker:
  replicaCount: 4
  resources:
    requests: { cpu: "1",   memory: "2Gi" }
    limits:   { cpu: "4",   memory: "8Gi" }
  extraEnv:
    - name: MAX_SYNC_WORKERS
      value: "10"

webapp:
  replicaCount: 2
  ingress:
    enabled: true
    className: "internal-alb"
    hosts:
      - host: airbyte.internal
        paths: [ { path: "/", pathType: Prefix } ]

temporal:
  resources:
    requests: { cpu: "500m", memory: "1Gi" }
    limits:   { cpu: "2",    memory: "4Gi" }

metrics:
  serviceMonitor:
    enabled: true
    namespace: monitoring
    labels:
      release: prometheus

nodeSelector:
  workload: "data-plane"

tolerations:
  - key: "workload"
    value: "data-plane"
    operator: "Equal"
    effect: "NoSchedule"
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. global.database.type: external disables the in-chart Postgres and points the config store at the external RDS instance. The credentials are pulled from a K8s Secret (airbyte-db-secret) — never in git.
  2. global.logs.s3.enabled: true swaps Airbyte's default MinIO log store for S3. Access keys live in a K8s Secret; the bucket policy grants Airbyte's ServiceAccount write access via IRSA (IAM Roles for Service Accounts).
  3. worker.replicaCount: 4 with MAX_SYNC_WORKERS = 10 allows 4 × 10 = 40 concurrent syncs. Adjust based on the connector count and the sync cadence — most workloads run under-utilised at 40.
  4. metrics.serviceMonitor.enabled: true registers a Prometheus ServiceMonitor that scrapes Airbyte's /metrics endpoint every 30 seconds. This is the single most important step for production ops — without metrics, you learn about a broken pipeline from a downstream data consumer, not from a Grafana alert.
  5. nodeSelector + tolerations pin Airbyte to the dedicated data-plane node group. This keeps Airbyte's noisy neighbours (long-running syncs, big Java processes) off the general application node group.

Output.

$ helm upgrade --install airbyte airbyte/airbyte -n airbyte -f values.yaml
Release "airbyte" upgraded. Happy Helming!

$ kubectl get pods -n airbyte
NAME                                READY   STATUS    RESTARTS
airbyte-server-6b47f9c5b7-x8h4k     1/1     Running   0
airbyte-server-6b47f9c5b7-r2m9p     1/1     Running   0
airbyte-worker-79c6d8f4b6-9jk3m     1/1     Running   0
airbyte-worker-79c6d8f4b6-lk4wn     1/1     Running   0
airbyte-worker-79c6d8f4b6-p8mrx     1/1     Running   0
airbyte-worker-79c6d8f4b6-v5nqk     1/1     Running   0
airbyte-temporal-5c8b7d4f6c-w3nr2   1/1     Running   0
airbyte-webapp-587f6c4b9d-h7k2n     1/1     Running   0
airbyte-webapp-587f6c4b9d-t9m5x     1/1     Running   0
Enter fullscreen mode Exit fullscreen mode

Rule of thumb. Externalise Postgres and log storage from day one — never rely on the chart's in-cluster defaults for production. Wire Prometheus ServiceMonitor before you ship the first pipeline. GitOps the values.yaml via Argo CD or Flux; connector configs live in a sibling repo and are applied via Airbyte's Terraform provider for full IaC coverage.

Senior interview question on the Airbyte custom-connector story

A senior interviewer might ask: "You have a niche vendor with a paginated REST API and no first-class connector on any ELT platform. Walk me through how you'd build an Airbyte connector for it, how you'd deploy it, and how you'd handle schema evolution."

Solution Using a low-code YAML connector shipped through the Airbyte Terraform provider

# 1. Connector definition — low-code YAML in your infra repo
# path: airbyte-connectors/source-niche-vendor/source.yaml
version: "3.0.0"

definitions:
  bearer_auth:
    type: BearerAuthenticator
    api_token: "{{ config['token'] }}"

  paginator:
    type: DefaultPaginator
    pagination_strategy:
      type: CursorPagination
      cursor_value: "{{ response['meta']['next_cursor'] }}"
      stop_condition: "{{ not response['meta']['has_more'] }}"
      page_size: 500
    page_token_option:
      inject_into: request_parameter
      field_name: "cursor"
      type: RequestOption

  base_requester:
    type: HttpRequester
    url_base: "{{ config['base_url'] }}"
    http_method: "GET"
    authenticator: { $ref: "#/definitions/bearer_auth" }

  events_stream:
    type: DeclarativeStream
    name: "events"
    primary_key: ["event_id"]
    retriever:
      type: SimpleRetriever
      requester:
        $ref: "#/definitions/base_requester"
        path: "/events"
      paginator: { $ref: "#/definitions/paginator" }
      record_selector:
        extractor: { type: DpathExtractor, field_path: ["data"] }
    incremental_sync:
      type: DatetimeBasedCursor
      cursor_field: "event_ts"
      datetime_format: "%Y-%m-%dT%H:%M:%SZ"
      start_datetime: { datetime: "{{ config['start_date'] }}" }
      end_datetime:   { datetime: "{{ now_utc() }}" }

streams: [ "#/definitions/events_stream" ]
check: { type: CheckStream, stream_names: ["events"] }
Enter fullscreen mode Exit fullscreen mode
# 2. Terraform — register the connector with Airbyte, wire the source, sink, connection
# path: infra/airbyte/niche_vendor.tf

resource "airbyte_source_custom" "niche_vendor" {
  workspace_id = data.airbyte_workspace.default.id
  name         = "niche-vendor-events"

  configuration = jsonencode({
    token      = var.niche_vendor_token
    base_url   = "https://api.niche-vendor.example.com/v1"
    start_date = "2026-01-01T00:00:00Z"
  })

  docker_image = "internal-registry/airbyte-source-niche-vendor:0.1.4"
}

resource "airbyte_destination_snowflake" "warehouse" {
  workspace_id = data.airbyte_workspace.default.id
  name         = "snowflake-raw"

  configuration = jsonencode({
    host      = var.snowflake_host
    role      = "AIRBYTE_ROLE"
    warehouse = "AIRBYTE_WH"
    database  = "RAW"
    schema    = "NICHE_VENDOR"
    username  = var.snowflake_user
    password  = var.snowflake_password
  })
}

resource "airbyte_connection" "niche_vendor_to_snowflake" {
  workspace_id            = data.airbyte_workspace.default.id
  name                    = "niche-vendor → snowflake"
  source_id               = airbyte_source_custom.niche_vendor.id
  destination_id          = airbyte_destination_snowflake.warehouse.id
  schedule                = jsonencode({ scheduleType = "cron", cronExpression = "*/15 * * * *" })
  namespace_definition    = "destination"
  non_breaking_changes    = "propagate_columns"    # auto-add new columns
}
Enter fullscreen mode Exit fullscreen mode

Step-by-step trace.

Step Action Output
1 Write low-code YAML CDK source.yaml (~90 lines)
2 Build connector Docker image internal-registry/airbyte-source-niche-vendor:0.1.4
3 Register via Terraform airbyte_source_custom Source created in Airbyte control plane
4 Wire destination (Snowflake) Destination created
5 Wire connection with propagate_columns Schema evolution auto-handled
6 Airbyte runs first sync Historical backfill: 2.3M records in 14m
7 15-minute cadence takes over Steady state: 300-800 records/run
8 Vendor adds new field Non-breaking change auto-added to Snowflake

The full connector — from initial API tinkering to a production 15-minute cadence pipeline — takes 2-3 senior-engineer-days. Fivetran's Function-connector equivalent (Lambda, IAM, CloudWatch, custom retry code) would take 4-6 days and lock the team to AWS Lambda. Building the same in raw Python + Airflow would take 2-3 weeks.

Output:

Metric Value
Total dev time 2.5 senior-engineer-days
YAML lines ~90
Connector version 0.1.4
First-sync backfill 2.3M records in 14 min
Steady-state cadence 15 min
Schema evolution Automatic (propagate_columns)
Snowflake landing table RAW.NICHE_VENDOR.EVENTS

Why this works — concept by concept:

  • Low-code YAML CDK — the entire connector — auth, requester, paginator, incremental cursor — is declarative YAML. No Python code to maintain, and the Airbyte platform inherits retry / state / schema-evolution semantics for free.
  • Terraform provider — every part of the connector (source config, destination, connection, schedule) is Infrastructure-as-Code. Cross-environment (staging → prod) is a Terraform workspace switch, not a UI walk-through.
  • Docker image — packaging the YAML into a versioned Docker image lets you pin connector versions per environment. Rollback is terraform apply with the previous tag.
  • non_breaking_changes: propagate_columns — when the vendor adds a new field to the API response, Airbyte auto-adds a column to the Snowflake destination table. No schema-alert incident; no manual ALTER TABLE.
  • Incremental cursor — the DatetimeBasedCursor handles both historical backfill (in P1D windows) and steady-state incremental (since last-seen event_ts). Restart-safe, resumable, and idempotent from the primary key.
  • Cost — the connector consumes ~$0.01 of Snowflake load cost per 15-minute sync and negligible Airbyte compute. The engineering cost is 2.5 days upfront and roughly 4 hours per year of drift maintenance — well inside the drift budget from Section 1.

API Integration
Topic — api-integration
API integration problems for custom Airbyte CDK connectors

Practice →

ETL Topic — etl ETL problems on Airbyte OSS and Cloud pipelines

Practice →


4. Stitch + Hevo — the mid-market alternatives

Stitch and Hevo occupy the mid-market — cheaper than Fivetran, less depth than Airbyte, each with a niche worth understanding

The mental model in one line: Stitch is the mature-but-stagnant Singer-based platform owned by Talend / Qlik with row-based pricing and ~130 connectors — safe for stable pipelines, risky for new ones — while Hevo is the growth-stage streaming-flavoured competitor with real-time pitch, no-code transforms, and aggressive mid-market pricing. Neither is the "right" answer for a large data platform, but each occupies a niche worth understanding — Stitch as the "we're already on it, no plans to switch" pick, and Hevo as the "we want near-real-time without Fivetran's bill" pick.

Iconographic Stitch + Hevo diagram — a green Stitch medallion with a needle glyph on the left and an orange Hevo medallion with a stream glyph on the right, each with a small feature card comparing row-based pricing, streaming, and no-code transforms, on a light PipeCode card.

Stitch — the mature Singer-based platform.

  • History. Started as Stitch (2016), acquired by Talend (2018), Talend acquired by Qlik (2023). The product has been in maintenance mode more than active development for several years; the connector catalog is stable but growing slowly.
  • Singer under the hood. Stitch is the commercial face of the Singer open-source spec — a JSON-based tap/target protocol for connectors. This is a strength (open ecosystem, portable connectors) and a caveat (Singer taps have variable quality and maintenance status).
  • Connector catalog. ~130 connectors — sufficient for standard SaaS but noticeably behind Fivetran and Airbyte on the long tail. The catalog page-count has been roughly flat for two years.
  • Row-based pricing. Stitch bills per row synced per month. This is more predictable than MAR for high-churn tables (a row updated 10 times counts as 10 rows) but often more expensive on the same workload.
  • Where it fits. Existing Stitch deployments with stable pipelines — the mature muscle memory in ops teams, the compliance sign-offs already done, the alerting already wired. Migrating off Stitch is usually a lower priority than migrating off nothing.

Hevo — the streaming-flavoured mid-market pick.

  • History. Founded 2017, aggressive growth push through 2020-2026. Positioned as "real-time ELT" with sub-minute latency claims and a no-code transform layer.
  • Connector catalog. ~150 connectors — broadly on par with Stitch, narrower than Fivetran or Airbyte. Depth is variable — enterprise SaaS coverage is thin.
  • Streaming pitch. The signature feature: pipelines that surface source changes within minutes, backed by a Kafka-style event bus internally. For teams that need "close enough to real-time" without operating Kafka themselves, this is compelling.
  • No-code transforms. A drag-and-drop transformation layer that lets analysts filter, cast, and enrich rows in-flight. This is a wedge against the pure "load-first" ELT story — some teams like it, most senior engineers prefer their transforms in dbt.
  • Row-based pricing. Similar to Stitch, per-row billing. Aggressive discount tiers for growing accounts; sticker discounts for annual pre-pays.
  • Where it fits. Mid-market teams that want Fivetran-ish operational simplicity, near-real-time delivery, and a bill they can predict — and don't need Fivetran's enterprise SaaS depth.

Row-based vs MAR — the pricing model side-by-side.

  • MAR (Fivetran). Distinct primary keys with any change per month. Punishes high-cardinality low-churn tables less; punishes high-churn low-cardinality tables more.
  • Row-based (Stitch, Hevo). Every row synced counts. Punishes append-mostly high-volume tables (event logs, order lines) very heavily; forgiving on high-churn but low-cardinality tables.
  • The picking heuristic. If your workload is dominated by update-heavy tables (Salesforce Opportunity, Postgres CDC on a mutable table), MAR often wins. If it's dominated by append-only high-volume tables (event logs), row-based is often catastrophically expensive.

Where Stitch / Hevo win.

  • Predictable pricing. Row-based is easier for finance teams to model than MAR — it's just "rows moved × rate."
  • Simpler UX. Both platforms have flatter learning curves than Fivetran or Airbyte for a first pipeline.
  • Streaming (Hevo specifically). For teams with near-real-time requirements and no appetite for operating Kafka, Hevo is a legitimate alternative to Fivetran + a separate CDC pipeline.
  • Mid-market fit. Teams in the 3-10 engineer range with 10-50 connectors and moderate volume find both platforms in a sweet spot.

Where they lose.

  • Enterprise SaaS depth. For deep Salesforce, Marketo, NetSuite work, Fivetran remains the depth leader.
  • Custom-connector story. Stitch's Singer taps require Singer expertise (not exactly Python). Hevo's SDK is thinner than Airbyte's CDK. Neither is a serious competitor to Airbyte on this axis.
  • Roadmap uncertainty (Stitch). With Talend under Qlik ownership, Stitch's roadmap is unclear. Not a good bet for a new 5-year commitment.
  • Volume ceilings. Both platforms become expensive above ~20M rows/month — cheaper than Fivetran on some workloads, but not competitive with Airbyte OSS.

Common Stitch / Hevo interview probes.

  • "When would you pick Stitch over Fivetran?" — usually "when you're already on it."
  • "Explain Singer taps and how they differ from Airbyte connectors" — depth check.
  • "Hevo's streaming pitch — is it actually real-time?" — the honest answer is "sub-minute, not sub-second."
  • "How do you migrate off Stitch?" — architecture question.

Worked example — Hevo vs Stitch cost comparison for a 5-source pipeline

Detailed explanation. A mid-market team is spinning up a new data platform with 5 SaaS sources (Salesforce, HubSpot, Zendesk, Mixpanel, and Segment) plus 1 Postgres CDC source. Fivetran is out of budget; the team compares Stitch and Hevo head-to-head. The senior lead wants a total cost of ownership comparison over 12 months, with headline pricing plus soft factors.

  • The volumes. Salesforce/HubSpot/Zendesk: ~200K rows/month each. Mixpanel/Segment: ~15M events/month each. Postgres CDC: ~5M row-changes/month.
  • The pricing shape. Stitch tiers per rows/month with published-list pricing; Hevo similar but with more aggressive volume discounts.
  • The soft factors. Streaming latency, transform tooling, UX polish, connector depth on the specific sources.

Question. Compute the 12-month cost for Stitch vs Hevo on this 5-source + CDC pipeline, and identify the tie-breaker.

Input.

Source Monthly rows Both platforms tier
Salesforce 200,000 included in mid tier
HubSpot 200,000 included
Zendesk 200,000 included
Mixpanel 15,000,000 overage
Segment 15,000,000 overage
Postgres CDC 5,000,000 overage

Code.

"""Stitch vs Hevo 12-month TCO on a 5-source + CDC pipeline."""

# Rough 2026 list-price approximations (both platforms discount from list)
STITCH_TIERS = [
    (5_000_000,   500),      # up to 5M rows: $500/mo
    (25_000_000, 1500),      # up to 25M rows: $1500/mo
    (100_000_000, 5000),     # up to 100M rows: $5000/mo
]
HEVO_TIERS = [
    (5_000_000,   399),
    (25_000_000, 1250),
    (100_000_000, 4200),
]

def tier_price(volume: int, tiers: list) -> int:
    for cap, price in tiers:
        if volume <= cap:
            return price
    # above the last tier — assume overage per million rows
    _, last_price = tiers[-1]
    overage = (volume - tiers[-1][0]) / 1_000_000 * 40
    return int(last_price + overage)

monthly_volume = 200_000 * 3 + 15_000_000 * 2 + 5_000_000
stitch_monthly = tier_price(monthly_volume, STITCH_TIERS)
hevo_monthly = tier_price(monthly_volume, HEVO_TIERS)

print(f"Total monthly volume : {monthly_volume:>12,} rows")
print(f"Stitch monthly       : ${stitch_monthly:>10,}")
print(f"Hevo monthly         : ${hevo_monthly:>10,}")
print(f"Stitch annual        : ${stitch_monthly*12:>10,}")
print(f"Hevo annual          : ${hevo_monthly*12:>10,}")
print(f"Annual saving (Hevo) : ${(stitch_monthly - hevo_monthly)*12:>10,}")
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. Total monthly volume: 3 × 200K + 2 × 15M + 5M = 35.6M rows/month. Both platforms land in the third pricing tier.
  2. At the $5,000/month Stitch third-tier price and the $4,200/month Hevo third-tier price, the raw list-price difference is ~$800/month = $9,600/year in Hevo's favour.
  3. But list is not what you pay. Stitch's under-Qlik pricing has been unpredictable; Hevo aggressively discounts annual pre-pays (30-50% off list is common). The real delta at negotiated price is likely $15K-$25K/year in Hevo's favour.
  4. The tie-breaker in this scenario is Hevo's streaming latency. Two of the sources (Mixpanel, Segment) are event streams where near-real-time delivery has downstream value. Stitch's batch cadence (30-min minimum) means events arrive 30 minutes later in the warehouse; Hevo's sub-minute cadence delivers real business value on those two sources.
  5. The recommendation: Hevo, based on the streaming latency + list-price advantage. Caveat: verify Hevo's Salesforce and HubSpot connector depth handles the specific object surface the team uses; those are known gaps historically.

Output.

Total monthly volume :   35,600,000 rows
Stitch monthly       : $     5,000
Hevo monthly         : $     4,200
Stitch annual        : $    60,000
Hevo annual          : $    50,400
Annual saving (Hevo) : $     9,600
Enter fullscreen mode Exit fullscreen mode

Rule of thumb. Between Stitch and Hevo, Hevo wins on price and streaming; Stitch wins on Singer-tap ecosystem maturity. For a new deployment in 2026, Hevo is usually the right pick; Stitch is only defensible on an existing deployment where migration cost outweighs the pricing delta.

Worked example — Singer tap for a niche vendor via Stitch

Detailed explanation. A team on Stitch needs to ingest a niche vendor's REST API — the vendor is not in Stitch's catalog. Stitch's answer is a Singer tap: write the connector using the Singer specification (a JSON tap/target protocol) and register it as a "custom integration" on Stitch. The Singer ecosystem is older than Airbyte and has both official and community taps for hundreds of sources.

  • The pattern. A Singer tap is a Python script that emits records to stdout as newline-delimited JSON — records, state messages, and schema messages. The target reads stdin and writes to the destination.
  • The trade-off. Singer is a simple, portable protocol — but you own the tap code, and the auth / state / retry story is your problem (unlike Airbyte's CDK).
  • The maintenance burden. Singer taps are frequently orphaned when the original author moves on. Check pyproject.toml last-updated-date before adopting a community tap.

Question. Write a minimal Singer tap in Python for the niche vendor's /events endpoint with token auth and cursor pagination. Emit schema, records, and state messages.

Input.

Field Detail
Endpoint GET https://api.niche-vendor.example.com/v1/events
Auth Authorization: Bearer <token>
Pagination ?cursor=<opaque>&limit=1000{ "data": [...], "next_cursor": "..." }
Records { "id": "...", "type": "...", "occurred_at": "..." }

Code.

"""tap_niche_vendor/tap.py — minimal Singer tap for the niche vendor API."""
import json
import sys
import argparse
import requests

STREAM = "events"
SCHEMA = {
    "type": "SCHEMA",
    "stream": STREAM,
    "schema": {
        "type": "object",
        "properties": {
            "id":          {"type": "string"},
            "type":        {"type": "string"},
            "occurred_at": {"type": "string", "format": "date-time"},
        },
    },
    "key_properties": ["id"],
}


def write(msg: dict) -> None:
    sys.stdout.write(json.dumps(msg) + "\n")
    sys.stdout.flush()


def sync(config: dict, state: dict) -> dict:
    write(SCHEMA)
    cursor = state.get("cursor")
    session = requests.Session()
    session.headers.update({"Authorization": f"Bearer {config['token']}"})

    while True:
        params = {"limit": 1000}
        if cursor:
            params["cursor"] = cursor

        r = session.get(f"{config['base_url']}/events", params=params, timeout=30)
        r.raise_for_status()
        body = r.json()

        for rec in body.get("data", []):
            write({
                "type": "RECORD",
                "stream": STREAM,
                "record": {
                    "id":          rec["id"],
                    "type":        rec["type"],
                    "occurred_at": rec["occurred_at"],
                },
            })

        cursor = body.get("next_cursor")
        write({"type": "STATE", "value": {"cursor": cursor}})

        if cursor is None:
            break

    return {"cursor": cursor}


def main() -> None:
    ap = argparse.ArgumentParser()
    ap.add_argument("--config", required=True)
    ap.add_argument("--state", required=False)
    args = ap.parse_args()

    with open(args.config) as f:
        config = json.load(f)
    state = {}
    if args.state:
        with open(args.state) as f:
            state = json.load(f)

    sync(config, state)


if __name__ == "__main__":
    main()
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. The tap emits three message types on stdout — SCHEMA (declares the shape and primary key), RECORD (one per row), STATE (checkpoint that Stitch persists for the next run). This is the whole Singer protocol.
  2. The sync function reads state from the previous run (if any), initialises the cursor, and loops the paginated fetch. Each page emits N records followed by a state message — so a mid-loop crash still checkpoints progress.
  3. Auth is a Bearer token from config — no fancy handshake. The requests.Session reuses the connection across pages (small perf win).
  4. State progression: after each page, write a STATE message with the new cursor. Stitch persists the last-seen state; on next invocation it hands it back via --state.
  5. The trade-off vs Airbyte's CDK is visible: this is roughly the same functionality as the earlier YAML CDK connector, but you own the retry logic (there is none here — you'd add it in production), the auth refresh (absent), and the schema evolution (this tap declares a hard-coded schema — if the vendor adds a field, you re-deploy).

Output.

$ python tap.py --config config.json --state state.json | target-snowflake --config snowflake.json
{"type": "SCHEMA", "stream": "events", "schema": {...}, "key_properties": ["id"]}
{"type": "RECORD", "stream": "events", "record": {"id": "evt_1", ...}}
{"type": "RECORD", "stream": "events", "record": {"id": "evt_2", ...}}
...
{"type": "STATE", "value": {"cursor": "eyJvZmZzZXQiOjEwMDB9"}}
{"type": "RECORD", "stream": "events", "record": {"id": "evt_1001", ...}}
...
{"type": "STATE", "value": {"cursor": null}}

Records emitted: 12,431
State written to Stitch: {"cursor": null}
Enter fullscreen mode Exit fullscreen mode

Rule of thumb. A Singer tap is the right answer for a niche vendor when you're already on Stitch and have Python expertise. For a new deployment, Airbyte's low-code YAML CDK is materially less work than a hand-rolled Singer tap; Singer's advantage is portability (same tap works on Meltano, Stitch, or a bare-metal pipeline).

Worked example — Hevo real-time vs batch selector

Detailed explanation. A team on Hevo needs to decide, per pipeline, whether to use Hevo's real-time streaming mode or its batch mode. Real-time delivers sub-minute latency but is more expensive per row; batch delivers 15-30 minute latency but is cheaper. The interviewer wants a per-pipeline decision rule.

  • The trade-off. Latency vs cost. Real-time is roughly 1.5×-2× the cost of batch for the same volume.
  • The heuristic. Real-time for pipelines with downstream real-time consumers (fraud detection, real-time dashboards, alerting); batch for anything with human-consumption cadence (daily reports, weekly BI, monthly finance close).
  • The senior signal. Framing latency as "who downstream consumes this?" rather than as a technical preference.

Question. Given 6 pipelines with different downstream consumers, classify each as real-time or batch and quantify the resulting Hevo bill.

Input.

Pipeline Volume/month Downstream consumer Latency needed
segment_events 15M fraud model (real-time) sub-minute
salesforce_opps 200K daily sales dashboard 1 day
stripe_charges 500K finance close (monthly) 1 day
pg_orders_cdc 3M real-time inventory dashboard sub-minute
zendesk_tickets 100K weekly ops report 1 day
hubspot_contacts 50K daily marketing sync 1 day

Code.

"""Hevo real-time vs batch selector — decide per pipeline, cost the outcome."""
from dataclasses import dataclass

BATCH_RATE_PER_MILLION = 30    # $/million rows, batch
REALTIME_RATE_PER_MILLION = 55  # $/million rows, real-time

@dataclass
class Pipeline:
    name: str
    monthly_rows: int
    consumer: str
    latency_needed: str

    def mode(self) -> str:
        return "real-time" if "sub-minute" in self.latency_needed else "batch"

    def monthly_cost(self) -> float:
        rate = (REALTIME_RATE_PER_MILLION
                if self.mode() == "real-time" else BATCH_RATE_PER_MILLION)
        return self.monthly_rows / 1_000_000 * rate


pipelines = [
    Pipeline("segment_events",    15_000_000, "fraud model",           "sub-minute"),
    Pipeline("salesforce_opps",      200_000, "daily sales dashboard", "1 day"),
    Pipeline("stripe_charges",       500_000, "finance close",         "1 day"),
    Pipeline("pg_orders_cdc",      3_000_000, "inventory dashboard",   "sub-minute"),
    Pipeline("zendesk_tickets",      100_000, "weekly ops report",     "1 day"),
    Pipeline("hubspot_contacts",      50_000, "daily marketing sync",  "1 day"),
]

total = 0.0
for p in pipelines:
    cost = p.monthly_cost()
    total += cost
    print(f"{p.name:20s}{p.mode():10s} ${cost:>8.2f}/month  ({p.consumer})")
print(f"{'TOTAL':20s}              ${total:>8.2f}/month  → ${total*12:.2f}/year")
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. Two pipelines land in real-time mode — segment_events (fraud model needs sub-minute latency) and pg_orders_cdc (inventory dashboard). Everything else is batch (daily/weekly consumers).
  2. segment_events at 15M rows × $55/million = $825/month; pg_orders_cdc at 3M rows × $55/million = $165/month. Real-time total: $990/month.
  3. Batch pipelines together: 200K + 500K + 100K + 50K = 850K rows × $30/million = $25.50/month. Trivial next to the real-time cost.
  4. Total: $1,015.50/month = $12,186/year. Roughly 65% of the bill is segment_events alone — worth a follow-up conversation about whether the fraud model really needs sub-minute (some fraud models are perfectly happy with 5-minute latency).
  5. The heuristic: don't default the whole pipeline to real-time. Real-time is expensive; batch is fine for anything a human reads. Ask the downstream consumer "how stale is too stale?" — the honest answer for most downstream systems is measured in hours, not seconds.

Output.

segment_events       → real-time  $  825.00/month  (fraud model)
salesforce_opps      → batch      $    6.00/month  (daily sales dashboard)
stripe_charges       → batch      $   15.00/month  (finance close)
pg_orders_cdc        → real-time  $  165.00/month  (inventory dashboard)
zendesk_tickets      → batch      $    3.00/month  (weekly ops report)
hubspot_contacts     → batch      $    1.50/month  (daily marketing sync)
TOTAL                              $ 1015.50/month  → $12186.00/year
Enter fullscreen mode Exit fullscreen mode

Rule of thumb. Real-time is a per-pipeline decision, never a per-vendor default. Reserve real-time for pipelines with genuine sub-minute downstream consumers; put everything else on batch. Half of "we need real-time" is human-consumption cadence dressed up in technical vocabulary — push back with the downstream-consumer question.

Senior interview question on mid-market vendor selection

A senior interviewer might ask: "You're standing up a new data platform for a Series B startup with a $50K/year ELT budget, 15 SaaS sources, one Postgres CDC source, and no plans for HIPAA. Walk me through the Stitch / Hevo / Airbyte Cloud shortlist and pick a vendor."

Solution Using a 3-way scorecard with the same four axes, plus a total cost projection

"""Series B ELT vendor selection — Stitch vs Hevo vs Airbyte Cloud."""
from dataclasses import dataclass

BUDGET = 50_000    # annual budget cap
SOURCES = 16       # 15 SaaS + 1 CDC
MONTHLY_ROWS = 30_000_000    # estimated total

@dataclass
class Vendor:
    name: str
    catalog_score: int
    price_annual: int
    custom_conn: int
    ownership: int
    streaming: bool
    notes: str

    def weighted(self, w_cat=0.30, w_price=0.30, w_cust=0.25, w_own=0.15):
        # normalise price to a 1-5 scale (cheaper = higher score)
        price_score = 5 - min(4, self.price_annual // 10_000)
        return (self.catalog_score * w_cat +
                price_score * w_price +
                self.custom_conn * w_cust +
                self.ownership * w_own)


shortlist = [
    Vendor("Stitch",        catalog_score=3, price_annual=60_000,
           custom_conn=3, ownership=4, streaming=False,
           notes="Roadmap uncertain post-Qlik acquisition"),
    Vendor("Hevo",          catalog_score=3, price_annual=42_000,
           custom_conn=3, ownership=4, streaming=True,
           notes="Best on streaming; connector depth on SF/HubSpot variable"),
    Vendor("Airbyte Cloud", catalog_score=4, price_annual=36_000,
           custom_conn=5, ownership=4, streaming=False,
           notes="Best CDK; no first-class streaming; slightly less polished UX"),
]

for v in sorted(shortlist, key=lambda x: -x.weighted()):
    fits_budget = "yes" if v.price_annual <= BUDGET else "OVER"
    print(f"{v.name:15s}  weighted={v.weighted():.2f}  "
          f"annual=${v.price_annual:>7,}  budget: {fits_budget}")
    print(f"{v.notes}")
Enter fullscreen mode Exit fullscreen mode

Step-by-step trace.

Vendor Catalog Price ann. Custom Own. Weighted Fits budget?
Stitch 3 $60,000 3 4 3.15 OVER
Hevo 3 $42,000 3 4 3.45 yes
Airbyte Cloud 4 $36,000 5 4 4.25 yes

Airbyte Cloud wins on both the weighted score (4.25) and the budget headroom ($36K vs $50K cap). Hevo is the runner-up — cheaper than Stitch, has streaming, but loses on custom-connector story. Stitch busts the budget on list price and has the weakest score; disqualified.

Output:

Vendor Weighted score Annual price Under $50K budget? Verdict
Airbyte Cloud 4.25 $36,000 Yes ($14K headroom) Recommend
Hevo 3.45 $42,000 Yes ($8K headroom) Runner-up
Stitch 3.15 $60,000 No Reject

Why this works — concept by concept:

  • Same four axes, different weights — for a Series B, catalog and price weigh 30% each; ownership drops to 15% because the team is small and takes whatever managed-ops the vendor gives. The axis stability across contexts is the point — you're not inventing a new scorecard per deal.
  • Price normalised to a 1-5 scale — the scorecard is dimensionless; raw dollars don't multiply with axis scores. Bucket price into 5 tiers so the score-math is comparable.
  • Custom-connector axis — even at Series B, the 1 CDC source plus at least one likely internal API (customer-portal? admin backend?) means the custom story matters. Airbyte's 5 vs Hevo's 3 is the primary tie-breaker.
  • Streaming as a bonus, not a hard requirement — Hevo's streaming is nice, but the team hasn't articulated a downstream sub-minute consumer. Give it a small weight, not the decisive vote.
  • Budget as a filter, not a score — Stitch's $60K bust is a disqualifier, not a "-2 to the score." Hard budget caps go in a separate fits_budget column.
  • Cost — the recommendation frees $14K/year for a data quality tool or a reverse-ETL vendor. That's the underrated benefit of picking a cheaper option — the marginal dollar goes somewhere more useful than "same job, more expensive."

Design
Topic — design
System design problems on mid-market ELT selection

Practice →

ETL Topic — etl ETL problems on Stitch and Hevo pipelines

Practice →


5. Picking the vendor + hybrid patterns

Vendor selection is a per-pipeline decision — the mature answer is a hybrid topology, not a monolith

The mental model in one line: the wrong question is "which single ELT vendor should we standardise on?" and the right question is "which vendor fits each pipeline, and how do we govern a hybrid topology that mixes Fivetran and Airbyte in one warehouse without accidental duplication or PII leak-through?". Every large data platform in 2026 runs some flavour of hybrid — the specifics differ but the pattern is universal.

Iconographic vendor decision diagram — three diamond decision nodes on the left branching to the four vendor medallions on the right, plus a hybrid card showing Fivetran + Airbyte side-by-side, on a light PipeCode card.

The decision tree — three questions, four answers.

  • Q1: Do you need a broad connector catalog with enterprise SaaS depth? — If yes and budget is not the primary constraint, Fivetran. If yes and budget matters, Airbyte Cloud.
  • Q2: Do custom connectors matter within the next 12 months? — If yes, Airbyte (OSS or Cloud) for the CDK. If no, Fivetran, Stitch, or Hevo depending on other factors.
  • Q3: Is self-hosting acceptable (K8s maturity, compliance requirement)? — If yes and volume is high, Airbyte OSS. If no, Fivetran, Airbyte Cloud, or Hevo.

The four hybrid patterns senior teams actually ship.

  • Pattern 1: Fivetran for SaaS, Airbyte for internal APIs. The default hybrid. Fivetran carries the mainstream SaaS surface (Salesforce, HubSpot, Stripe, Marketo) with vendor-managed depth. Airbyte carries every internal REST API, niche vendor, and custom source. Split by connector type at ingest.
  • Pattern 2: Fivetran for revenue-critical, Airbyte for the long tail. Split by business criticality, not source type. Revenue-adjacent pipelines (marketing attribution, sales dashboards, finance close) run on Fivetran for the SLA; everything else runs on Airbyte for cost.
  • Pattern 3: Managed for low-volume, self-hosted for high-volume. Airbyte Cloud (or Fivetran) for the 30 small connectors under 100K rows/month; Airbyte OSS on K8s for the 5 tables above 5M rows/month. Split by volume tier.
  • Pattern 4: Fivetran for prod, Airbyte for dev/experimentation. Fivetran for the golden production pipelines with SLAs; Airbyte OSS on a laptop or lightweight K8s for engineers to experiment with new connectors, new sources, or new patterns without adding to the Fivetran bill.

Governance for a hybrid topology.

  • Landing-zone convention. Every vendor writes to a distinct schema (raw_fivetran, raw_airbyte, raw_hevo). No two vendors can write to the same table — this prevents accidental dual-load duplication and makes the vendor-origin auditable.
  • dbt sources.yml discipline. Every ingested table is declared in a dbt sources.yml with the vendor annotated. Downstream models read from the dbt source; the vendor identity is transparent to models and visible in lineage.
  • Cost tagging. Each vendor's compute writes with a warehouse cost tag (Snowflake QUERY_TAG, BigQuery labels). Monthly attribution becomes a simple SELECT ... GROUP BY tag.
  • PII posture per vendor. Fully-managed vendors (Fivetran, Airbyte Cloud) require a data-processing-addendum (DPA) with your compliance team. Self-hosted (Airbyte OSS) doesn't. Document per-connector which vendor holds PII in flight.
  • Migration ramp between vendors. A hybrid topology makes vendor migration cheap: move a single connector from Fivetran to Airbyte by re-pointing the dbt source. No dbt model changes, no downstream consumer disruption.

Migration paths from Stitch.

  • Stitch → Fivetran. For teams that need Fivetran's SaaS depth. Migration is straightforward — same warehouse destination, same table names (with rename shim if schemas differ). Budget 4-8 weeks per 20 connectors.
  • Stitch → Airbyte Cloud. For teams optimising cost. Airbyte's connector versions may differ slightly in schema (column names, types) from Stitch's — plan for one round of dbt model updates.
  • Stitch → in-house DIY. Almost never the right answer for the reasons in Section 1. Reserve for the 1-2 connectors where you have deep domain expertise.

What interviewers listen for on the "pick a vendor" close.

  • Do you explicitly frame it as "per pipeline, not per company"? — required senior signal.
  • Do you name a specific hybrid pattern (Fivetran + Airbyte split by source type)? — senior signal.
  • Do you mention cost tagging and dbt source discipline as governance? — senior signal.
  • Do you resist the "let's standardise on one vendor" pressure? — required.

Worked example — the hybrid Fivetran + Airbyte topology in production

Detailed explanation. A team runs a hybrid Fivetran (SaaS) + Airbyte OSS (internal APIs) topology. The senior lead is asked to sketch the production topology on a whiteboard — every component, every governance boundary, and every cost-attribution knob.

  • The topology. Two ingest paths converging on Snowflake, isolated schemas per vendor, dbt as the shared modelling layer.
  • The governance. Cost tagging via QUERY_TAG, PII catalog per vendor, dbt sources.yml as the single source of truth.
  • The senior signal. The whiteboard cleanly separates the vendor-owned surface from the team-owned surface.

Question. Sketch the Fivetran + Airbyte OSS hybrid topology as YAML / Terraform + a dbt sources.yml snippet. Explain the governance boundary.

Input.

Layer Detail
SaaS ingest (Fivetran) Salesforce, HubSpot, Stripe, Marketo
Internal ingest (Airbyte OSS) attribution-service, admin-portal, metrics-service
Destination Snowflake, database RAW, per-vendor schemas
dbt raw_fivetran.* and raw_airbyte.* sources

Code.

# dbt_project/models/sources.yml — single-source-of-truth for both vendors
version: 2
sources:
  - name: raw_fivetran
    database: RAW
    schema: FIVETRAN
    tables:
      - name: salesforce__opportunity
        meta:
          vendor: "fivetran"
          pii: "medium"
          cost_tag: "elt=fivetran,team=data,domain=sales"
      - name: salesforce__contact
        meta:
          vendor: "fivetran"
          pii: "high"
          cost_tag: "elt=fivetran,team=data,domain=sales"
      - name: hubspot__contact
        meta:
          vendor: "fivetran"
          pii: "high"
          cost_tag: "elt=fivetran,team=data,domain=marketing"
      - name: stripe__charge
        meta:
          vendor: "fivetran"
          pii: "medium"
          cost_tag: "elt=fivetran,team=data,domain=finance"

  - name: raw_airbyte
    database: RAW
    schema: AIRBYTE
    tables:
      - name: attribution_service__events
        meta:
          vendor: "airbyte-oss"
          pii: "low"
          cost_tag: "elt=airbyte,team=data,domain=marketing"
      - name: admin_portal__audit_log
        meta:
          vendor: "airbyte-oss"
          pii: "medium"
          cost_tag: "elt=airbyte,team=data,domain=platform"
      - name: metrics_service__series
        meta:
          vendor: "airbyte-oss"
          pii: "low"
          cost_tag: "elt=airbyte,team=data,domain=platform"
Enter fullscreen mode Exit fullscreen mode
-- Snowflake QUERY_TAG applied at ingest — enables cost attribution
-- Fivetran uses a service user with a session-level tag
ALTER USER fivetran_ingest_user
  SET DEFAULT_QUERY_TAG = 'elt=fivetran,vendor=managed';

-- Airbyte destination-Snowflake connector honours DEFAULT_QUERY_TAG
ALTER USER airbyte_ingest_user
  SET DEFAULT_QUERY_TAG = 'elt=airbyte,vendor=self-hosted';

-- Cost attribution query — split monthly compute cost by vendor
SELECT
    parse_json(query_tag):elt::string           AS elt_vendor,
    parse_json(query_tag):vendor::string         AS ownership_model,
    date_trunc('month', start_time)              AS month,
    sum(credits_used_cloud_services + credits_used) AS credits,
    count(*)                                     AS query_count
FROM   snowflake.account_usage.query_history
WHERE  start_time >= dateadd('month', -3, current_timestamp())
  AND  query_tag LIKE '%elt=%'
GROUP  BY 1, 2, 3
ORDER  BY month DESC, credits DESC;
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. The sources.yml split — raw_fivetran.* and raw_airbyte.* — is the load-bearing governance boundary. Every dbt model reads via source() macros; the vendor identity is visible in lineage but transparent to downstream logic.
  2. The meta block on each source table carries three pieces of governance metadata — vendor, PII level, and cost tag. This makes it queryable via dbt list --resource-type source --output json, which powers the compliance dashboard.
  3. The ALTER USER ... SET DEFAULT_QUERY_TAG at the Snowflake side pushes vendor identity into query_history. Every query the vendor's service user runs is auto-tagged.
  4. The cost-attribution query splits Snowflake credits by ELT vendor. Typical output: Fivetran-tagged credits are the majority (fully-managed vendors run many small copy operations); Airbyte OSS credits are lumpier (fewer, bigger merge statements).
  5. The governance win: if the FinOps team asks "what does Fivetran cost us on the Snowflake side?" — the answer is a single query. If they ask "which vendor holds PII on Contact tables?" — the answer is a single dbt-sources.yml lookup. If the compliance team asks "which pipelines run on Airbyte OSS (data never leaves our VPC)?" — the answer is filter by vendor=self-hosted.

Output.

+--------------+-------------------+---------+----------+-------------+
| elt_vendor   | ownership_model   | month   | credits  | query_count |
+--------------+-------------------+---------+----------+-------------+
| fivetran     | managed           | 2026-06 |    412.3 |      184527 |
| fivetran     | managed           | 2026-05 |    398.1 |      179244 |
| airbyte      | self-hosted       | 2026-06 |     87.6 |         842 |
| airbyte      | self-hosted       | 2026-05 |     82.1 |         811 |
+--------------+-------------------+---------+----------+-------------+
Enter fullscreen mode Exit fullscreen mode

Rule of thumb. In a hybrid topology, per-vendor schema separation + dbt sources.yml + Snowflake QUERY_TAG is the three-pillar governance stack. Set it up on day one; retrofitting cost attribution to an untagged production Snowflake is a multi-week archaeology project.

Worked example — migrating a single connector from Fivetran to Airbyte OSS without downtime

Detailed explanation. The team decides to migrate the high-volume postgres.event_log connector off Fivetran onto Airbyte OSS as part of the cost-cutting plan from Section 2. The senior lead runs a parallel-load pattern: keep Fivetran running, stand up Airbyte in parallel, reconcile row counts for 14 days, then cut over.

  • The risk. A cutover with no reconciliation window means the team learns about the missed rows from a downstream consumer, not from a monitoring dashboard.
  • The mitigation. Parallel-load into a staging schema; reconcile row counts and hashes; only cut over when the reconciliation passes for 14 consecutive days.
  • The rollback. Because Fivetran is still running to raw_fivetran.event_log, rollback is a dbt source-swap — no data replay, no gap.

Question. Design the 14-day parallel-load migration for a single Fivetran connector onto Airbyte OSS. Include the reconciliation query and the cutover swap.

Input.

Component Source config
Existing connector Fivetran postgres.event_lograw_fivetran.event_log
New connector Airbyte OSS postgres.event_lograw_airbyte_staging.event_log
Reconciliation table analytics.reconciliation.event_log_diff
Cutover trigger 14 consecutive days of zero-diff

Code.

-- Step 1 — daily reconciliation query, scheduled via dbt / Airflow
INSERT INTO analytics.reconciliation.event_log_diff
WITH fv AS (
  SELECT   event_id, event_ts, event_type
  FROM     raw_fivetran.event_log
  WHERE    event_ts >= dateadd('day', -1, current_timestamp())
),
ab AS (
  SELECT   event_id, event_ts, event_type
  FROM     raw_airbyte_staging.event_log
  WHERE    event_ts >= dateadd('day', -1, current_timestamp())
),
fv_hash AS (
  SELECT count(*) AS n, hash_agg(event_id, event_ts, event_type) AS h FROM fv
),
ab_hash AS (
  SELECT count(*) AS n, hash_agg(event_id, event_ts, event_type) AS h FROM ab
)
SELECT
    current_date() AS reconciliation_date,
    fv_hash.n                                      AS fivetran_count,
    ab_hash.n                                      AS airbyte_count,
    fv_hash.n - ab_hash.n                          AS delta_count,
    fv_hash.h = ab_hash.h                          AS hashes_match,
    (SELECT array_agg(event_id) FROM (
      SELECT event_id FROM fv EXCEPT SELECT event_id FROM ab LIMIT 100
    ))                                             AS missing_in_airbyte,
    (SELECT array_agg(event_id) FROM (
      SELECT event_id FROM ab EXCEPT SELECT event_id FROM fv LIMIT 100
    ))                                             AS extra_in_airbyte
FROM   fv_hash, ab_hash;

-- Step 2 — gate on 14 consecutive days of zero-diff
SELECT
    count(*)                                       AS clean_days,
    min(reconciliation_date)                       AS window_start,
    max(reconciliation_date)                       AS window_end,
    count(*) = 14 AND max(delta_count) = 0 AS ready_to_cut_over
FROM   analytics.reconciliation.event_log_diff
WHERE  reconciliation_date >= dateadd('day', -14, current_date())
  AND  hashes_match
  AND  delta_count = 0;

-- Step 3 — cutover, atomic via view swap
CREATE OR REPLACE VIEW raw.event_log_current AS
SELECT * FROM raw_airbyte_staging.event_log;    -- was raw_fivetran.event_log

-- Step 4 — dbt sources.yml update in a follow-up PR
-- version: 2
-- sources:
--   - name: raw
--     tables:
--       - name: event_log_current   # points at Airbyte now
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. Step 1's reconciliation query counts rows in the last 24h for both Fivetran and Airbyte, computes a stable hash of the record set for exact equality, and captures the first 100 missing/extra IDs on each side. Row counts alone are insufficient — same count with different IDs is a silent failure.
  2. Step 2 gates the cutover: 14 consecutive days of hashes_match = true and delta_count = 0. This catches slow-moving disagreements (a subtle time-zone bug that affects 1% of events would trigger a reconciliation failure within the first day).
  3. The 14-day window is tuned to the connector's daily-cycle sensitivity. A CDC-based Postgres connector should reconcile within an hour, so 14 days is generous; a Salesforce connector with soft-delete edge cases might legitimately need 30 days.
  4. Step 3's atomic view swap is the cutover. Downstream dbt models read from raw.event_log_current (a view); flipping the view definition swaps the underlying source with zero downtime. No dbt run required until Step 4.
  5. Step 4 updates the dbt sources.yml in a follow-up PR to point the source declaration directly at raw_airbyte_staging.event_log. At this point the Fivetran connector can be paused; leave it paused (not deleted) for one release cycle in case rollback is needed.

Output.

+----------------------+----------------+----------------+--------------+---------------+
| reconciliation_date  | fivetran_count | airbyte_count  | delta_count  | hashes_match  |
+----------------------+----------------+----------------+--------------+---------------+
| 2026-06-21           |        382,441 |        382,441 |            0 | true          |
| 2026-06-22           |        391,102 |        391,102 |            0 | true          |
| ...                                                                                    |
| 2026-07-04           |        408,213 |        408,213 |            0 | true          |
+----------------------+----------------+----------------+--------------+---------------+

Cutover status: ready_to_cut_over = true  (14/14 clean days)
Enter fullscreen mode Exit fullscreen mode

Rule of thumb. Every vendor-to-vendor connector migration follows the same pattern: parallel-load into a staging schema, reconcile daily for N days, atomic view swap on cutover, sources.yml update in a follow-up PR. Never cut over without a reconciliation window; the row-count-only shortcut misses silent field-level bugs.

Worked example — the exit-ramp inventory

Detailed explanation. Every managed-vendor commitment should include an exit ramp — the specific steps that let you migrate off if the vendor raises prices, drops support, or fails an audit. Senior interviewers ask about the exit ramp specifically because most teams don't have one and get stuck in a vendor's contract renewal cycle.

  • The pattern. Inventory what the vendor owns, what you own, and what would take you N weeks to reproduce.
  • The output. A one-page "exit ramp" doc per vendor, updated quarterly.
  • The interview signal. Naming this doc as a real artefact you maintain.

Question. Build the exit-ramp inventory for Fivetran, quantifying the effort to migrate each pillar.

Input.

Vendor asset Owned by Migration replacement
Connector configs Fivetran control plane Airbyte Terraform sources
Historical backfill Fivetran runtime Re-backfill from source
Credentials / OAuth Fivetran vault Re-authorise in Airbyte
Data-in-flight (in-flight rows) Fivetran runtime Continuity via parallel-load
Landing schema (raw tables) Snowflake (you own) Rename schema, keep data

Code.

# exit_ramp/fivetran.yaml — quarterly-updated exit-ramp inventory
vendor: "Fivetran"
last_reviewed: "2026-07-01"
next_review:   "2026-10-01"

assets:
  - asset: "Connector configs"
    owned_by: "Fivetran control plane"
    exit_replacement: "Airbyte Terraform provider  1-2 hours per connector"
    total_effort: "45 connectors × 1.5 hr = 68 hr = ~2 engineer-weeks"
    exit_risk: "Low  configs are simple JSON"

  - asset: "Historical backfill"
    owned_by: "Fivetran runtime (not stored on our side beyond warehouse)"
    exit_replacement: "Re-backfill from source; landed rows already in Snowflake"
    total_effort: "n/a (landed data is ours)"
    exit_risk: "Low"

  - asset: "OAuth credentials / API tokens"
    owned_by: "Fivetran secret vault"
    exit_replacement: "Rotate credentials, hand new to Airbyte"
    total_effort: "45 connectors × 15 min = 12 hr = ~1.5 days"
    exit_risk: "Medium  Salesforce OAuth re-consent requires admin action"

  - asset: "Data-in-flight (rows not yet delivered)"
    owned_by: "Fivetran runtime"
    exit_replacement: "Parallel-load pattern, 14-day reconciliation"
    total_effort: "14 days per connector, batched"
    exit_risk: "Low  reconciled before cutover"

  - asset: "Landing schema tables"
    owned_by: "Snowflake (us)"
    exit_replacement: "None  we already own the destination data"
    total_effort: "n/a"
    exit_risk: "None"

total_exit_effort:
  eng_weeks: 4
  calendar_weeks: 8   # includes 14-day reconciliation per batch
  worst_case_cost: "1 senior FTE for 2 months"

sla_tolerance:
  max_price_increase_pct: 20   # trigger exit if annual escalation exceeds
  max_downtime_hours: 4         # trigger exit review if breached
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. The inventory lists every asset the vendor holds — configs, credentials, in-flight data, and (importantly) what the vendor does not hold (the landed tables). This is the first insight most teams miss: the actual data is in your warehouse, not the vendor's.
  2. For each asset, the exit_replacement column names a concrete alternative (Airbyte Terraform, source re-backfill, credential rotation). Naming the alternative forces the team to know it exists.
  3. The total_effort column quantifies engineer time. For a typical Fivetran deployment, the honest number is 2-4 engineer-weeks, not the "months" that some sales conversations imply.
  4. The sla_tolerance block encodes the automatic exit triggers — a price increase over 20%, or a 4+ hour outage, triggers a mandatory exit review. Without this, teams drift into unfavourable renewal terms.
  5. Quarterly review keeps the exit ramp current. The single most common failure mode is an exit-ramp doc that's 18 months stale by the time the vendor triggers it — the connector count has grown, the OAuth flows have changed, and the "2 weeks" estimate is now "3 months."

Output.

Exit-ramp effort:  4 engineer-weeks (8 calendar weeks including reconciliation)
Worst-case cost:   1 senior FTE for 2 months (~$60k loaded)
SLA triggers:      auto-review on 20%+ price increase OR 4+ hr outage
Next review:       2026-10-01
Enter fullscreen mode Exit fullscreen mode

Rule of thumb. Every managed-vendor commitment gets a one-page exit ramp, updated quarterly. The doc lists what the vendor owns, what you own, the migration alternative for each, and the SLA triggers. Vendors who know you have a real exit ramp negotiate better renewal terms; vendors who know you don't, extract rent.

Senior interview question on hybrid vendor topology

A senior interviewer might ask: "You're the platform lead at a Series C. You currently run Fivetran for everything. A new 50M-row/month event-stream source is landing in Q4. Walk me through how you'd introduce Airbyte into the hybrid stack, what governance you'd wire, and how you'd defend the decision to leadership."

Solution Using a phased hybrid rollout with governance-first plumbing

# hybrid_rollout/plan.yaml — 6-month phased introduction of Airbyte alongside Fivetran

phase_1_month_1:
  goal: "Governance plumbing before any new pipeline ships"
  work:
    - "Create raw_airbyte schema in Snowflake, permissions granted"
    - "Add ALTER USER airbyte_ingest_user SET DEFAULT_QUERY_TAG = 'elt=airbyte,vendor=self-hosted'"
    - "Update dbt sources.yml scaffolding for raw_airbyte (empty tables list)"
    - "Wire monthly cost-attribution dashboard splitting by elt tag"

phase_2_month_2:
  goal: "Airbyte OSS running, one non-critical pipeline live"
  work:
    - "Helm-install Airbyte OSS on data-plane K8s node group"
    - "Wire Prometheus ServiceMonitor + on-call rotation"
    - "Ship internal metrics-service connector via low-code YAML CDK"
    - "Run for 30 days in staging with dbt tests before prod cutover"

phase_3_month_3_4:
  goal: "New Q4 event-stream source lands on Airbyte, not Fivetran"
  work:
    - "Design connector  CDC-safe, S3 landing, Snowpipe COPY"
    - "Cost-model Airbyte OSS vs Fivetran MAR for 50M rows/month"
    - "Ship connector with 14-day parallel-load reconciliation"
    - "Cutover once reconciliation passes"

phase_4_month_5_6:
  goal: "Migrate 3 highest-MAR Fivetran connectors to Airbyte OSS"
  work:
    - "Rank Fivetran connectors by MAR contribution"
    - "Pick top 3 non-critical (skip Salesforce, keep on Fivetran)"
    - "Parallel-load each with 14-day reconciliation"
    - "Update dbt sources.yml, atomic view swap on cutover"

expected_outcome:
  fivetran_annual_cost: 240_000   # was 360_000
  airbyte_infra_cost:    18_000   # K8s + storage
  cost_saving_annual:   102_000
  new_capability:        "OSS custom-connector path for internal APIs"
  risk_register:
    - "Airbyte OSS ops burden: mitigated by Prometheus + runbook"
    - "Reconciliation false-positives: mitigated by 14-day window"
    - "Vendor split confusion: mitigated by dbt sources.yml + QUERY_TAG"
Enter fullscreen mode Exit fullscreen mode

Step-by-step trace.

Month Milestone Cost impact
1 Governance plumbing (schema, tags, sources.yml scaffolding) none
2 Airbyte OSS live in prod, 1 low-risk pipeline +$1.5K/mo Airbyte infra
3-4 New 50M-row event stream on Airbyte (would have been $60K/yr on Fivetran) -$5K/mo net
5-6 3 highest-MAR Fivetran connectors migrated -$3K/mo per connector
Total 6-month cost delta -$8.5K/mo = -$102K/yr

The rollout keeps every business-critical Fivetran connector untouched, delivers a $102K annual saving, and adds a first-class custom-connector path — solving the "what if we ever have to build a niche source?" problem for the platform.

Output:

Metric Before After
Fivetran annual $360,000 $240,000
Airbyte infra annual $0 $18,000
Total annual $360,000 $258,000
Annual saving $102,000
Custom-connector path Fivetran Functions only Fivetran Functions + Airbyte CDK
Governance ad-hoc dbt sources + QUERY_TAG + exit-ramp

Why this works — concept by concept:

  • Governance first, pipelines second — Phase 1 ships zero new pipelines but wires the schema, tags, and dbt scaffolding. Every subsequent pipeline benefits from the plumbing. Skipping this phase is why most hybrid rollouts end up with untraceable cost.
  • New sources land on Airbyte before migrations start — proving Airbyte on a new pipeline in Phase 3 is lower-risk than migrating an existing pipeline. New pipelines don't have reconciliation liabilities.
  • Skip Salesforce (and other enterprise SaaS) — the migration explicitly leaves the high-depth Fivetran connectors alone. This is the "hybrid" pattern in action: Fivetran keeps what it's best at, Airbyte takes the rest.
  • Parallel-load with 14-day reconciliation — every migration uses the reconciliation gate from the previous worked example. Never cut over without it.
  • Cost dashboard tied to QUERY_TAG — the cost saving is visible in the FinOps dashboard by month 3, which is how leadership stays on board with the plan. Invisible savings evaporate into rounding.
  • Cost — 6 senior-engineer-months of work (~$150K loaded); recurring saving is $102K/year. Payback period is roughly 18 months on TCO, but the strategic value (custom-connector path, exit ramp) is separate from the pure dollar payback.

Design
Topic — design
System design problems on hybrid ELT topologies

Practice →

ETL
Topic — etl
ETL migration and reconciliation problems

Practice →


Cheat sheet — Managed ELT recipes

  • Fivetran MAR cost estimate formula. annual_bill = sum(monthly_MAR_per_table) × rate_per_1000_MAR × 12; typical rate $0.50–$2.00 per 1000 MAR. MAR = distinct primary keys changed per month per table. Append-only high-volume tables (event_log, order_items) dominate the bill; the top-3 MAR contributors usually account for >75% of the invoice. Audit top-MAR quarterly.
  • Airbyte CDK 10-line connector skeleton. Low-code YAML: declare a bearer_auth, a HttpRequester, a DefaultPaginator with CursorPagination, and a DeclarativeStream with DatetimeBasedCursor. Total ~120 lines for a paginated incremental REST source. Python CDK for complex auth (two-step OAuth, HMAC), nested pagination, or custom in-flight transforms.
  • Stitch row-based cost audit query. Total rows × rate per 1M rows per tier. SELECT sum(rows_synced) FROM stitch_usage GROUP BY month, source to identify the biggest sources. Row-based punishes append-only volume harder than MAR — check whether high-volume tables belong on Stitch at all.
  • Hevo real-time vs batch selector. Real-time (~$55/million rows) for pipelines with sub-minute downstream consumers (fraud model, real-time dashboards); batch (~$30/million rows) for anything human-consumed. Default to batch; escalate to real-time on downstream-consumer request, not on technical preference.
  • Hybrid Fivetran + Airbyte pattern. Fivetran for enterprise SaaS with depth (Salesforce, HubSpot, Marketo, NetSuite); Airbyte OSS for internal APIs, high-volume append tables, and custom-connector needs. Separate Snowflake schemas (raw_fivetran vs raw_airbyte); QUERY_TAG on the vendor ingest user; dbt sources.yml declares vendor per table.
  • Vendor selection axis weights (default). Catalog breadth 30%, pricing 25%, custom-connector story 25%, ownership boundary 20%. Adjust for context (Series B: bump pricing to 30%; HIPAA shop: force ownership to override). Score each vendor 1-5 per axis; weighted score picks the winner.
  • Exit-ramp inventory. One-page doc per vendor, quarterly-updated: what the vendor owns, what you own, migration alternative per asset, SLA triggers (auto-review on 20%+ price increase or 4+ hr outage). Vendors who know you maintain this negotiate better terms.
  • Parallel-load migration pattern. New connector writes to raw_<newvendor>_staging; daily reconciliation (row counts + hash agg + missing/extra IDs) for 14 consecutive clean days; atomic view swap on cutover; dbt sources.yml update in follow-up PR; leave old connector paused (not deleted) for one release cycle in case of rollback.
  • QUERY_TAG for cost attribution. ALTER USER <vendor>_ingest_user SET DEFAULT_QUERY_TAG = 'elt=<vendor>,vendor=<managed|self-hosted>'. Query snowflake.account_usage.query_history and PARSE_JSON(query_tag):elt::string for monthly per-vendor credit split. Wire the dashboard before the second vendor lands.
  • PII posture per vendor. Fully-managed vendors (Fivetran, Airbyte Cloud, Hevo) require a Data Processing Addendum (DPA) with compliance. Self-hosted (Airbyte OSS) doesn't leave your VPC. Document per-source-table which vendor handles PII in flight; keep a compliance dashboard reading from dbt meta.pii values.
  • The four axes interviewers probe. Catalog breadth (naming which connectors have known depth gaps — Salesforce custom objects, NetSuite savedsearches, Marketo dynamic lists — is the senior signal). Pricing model (MAR vs row-based punishes different table shapes). Custom-connector story (Fivetran Functions vs Airbyte CDK vs Singer taps). Ownership boundary (fully-managed / hosted-OSS / self-hosted, with compliance implications per tier).
  • Buy-vs-build guardrail. Never build in-house connectors for commodity SaaS APIs. Reserve in-house engineering for: (a) internal APIs no vendor covers, (b) 1-2 sources where deep domain expertise is a genuine differentiator. Fully-loaded drift budget for an in-house connector is ~$3,840/year — 60% more than Fivetran's incremental cost per connector.
  • When to migrate off Fivetran. Bill above $15K/month, top-3 MAR contributors dominate (>75%), team has K8s maturity or Airbyte Cloud budget headroom. Never migrate to save 10%; the migration cost eats the saving. Aim for >30% saving or a strategic capability (custom connector, sovereignty) — otherwise stay.

Frequently asked questions

Fivetran vs Airbyte in one line — how do senior data engineers actually pick between them in 2026?

Fivetran is the closed premium managed ELT platform with 500+ connectors, MAR-based pricing, and the deepest enterprise SaaS depth in the industry — pick it for revenue-critical pipelines against Salesforce / NetSuite / Marketo / Workday where the delivery SLA and connector polish are worth a bill measured in tens of thousands per month. Airbyte is the OSS-plus-Cloud hybrid with 350+ connectors, the best custom-connector story (low-code YAML CDK plus Python CDK), and materially better economics above ~5M rows/month — pick it for internal APIs, high-volume append tables, sovereignty-critical workloads (self-host on your K8s), and anywhere you'd like the ability to fork a connector to fix a bug. The mature senior answer in 2026 is both in a hybrid topology — Fivetran for the SaaS surface with depth requirements, Airbyte for internal APIs and high-volume tables where MAR economics get punishing. Standardising on one vendor is a Series-A-startup answer; hybrid is the Series-B-and-above answer.

Is Airbyte OSS production-ready for a senior data platform, or should I stick with Airbyte Cloud?

Airbyte OSS is production-ready in 2026 for teams that have Kubernetes maturity, an on-call rotation that can absorb a small ops surface, and any of these motivators: high volume (>10M rows/month makes Cloud pricing uneconomic), sovereignty (HIPAA / PCI / data-residency requirements Airbyte Cloud cannot meet), or infra-as-code discipline (values.yaml + Terraform provider fits your GitOps model). Airbyte Cloud is the right choice for teams without K8s expertise, small teams (2-3 data engineers who genuinely cannot absorb the ops burden), workloads under ~5M rows/month, or when time-to-first-pipeline dominates all other factors. The escape hatch matters — starting on Cloud and migrating to OSS later is a real path that some serious teams walk. Wire Prometheus ServiceMonitor from day one; deploy via Helm with externalised Postgres and S3 log storage; expect roughly 0.1-0.2 FTE of ongoing ops for a mature deployment.

What is MAR pricing on Fivetran, and how does it differ from Stitch's or Hevo's row-based pricing?

MAR (Monthly Active Rows) is Fivetran's pricing unit — the count of distinct primary keys that changed (insert, update, or delete) in the source table during the month. A row updated 10 times in a month counts as one MAR; a row inserted and never touched again counts as one MAR. Row-based pricing (Stitch, Hevo) counts every sync operation — 10 updates = 10 rows billed. The workload shape decides which is cheaper: MAR wins on high-churn / low-cardinality tables (Salesforce Opportunity with heavy sales-rep edits); row-based wins on low-churn / high-cardinality tables. The append-only trap: high-volume event tables (order_items, event_log, page_views) generate one row per insert under both models, but MAR bills once per unique primary key while row-based bills per row synced — MAR is usually cheaper on the same event volume once Fivetran's dedupe kicks in. Model both against your top-5 tables before signing anything.

Should I self-host Airbyte or use Airbyte Cloud — where's the crossover point?

The rough crossover is at ~5M rows/month on total ELT volume, assuming a team with existing K8s expertise. Below that, Airbyte Cloud is cheaper once you count the amortised engineer-hours to operate OSS (roughly $600-$1,200/month for a modest self-hosted deployment). Above that, self-host wins dramatically — a 50M-row/month workload might cost $6K-$10K/month on Cloud versus $1.5K-$2K/month self-hosted. Hard overrides: HIPAA / PCI / regional data-residency requirements force self-host regardless of volume (Cloud data leaves your VPC). Soft factors: sovereignty preferences (some legal / security teams push back on any hosted vendor), fork-a-connector needs (only OSS lets you actually change the code), and team-size fit (a 2-DE team on Cloud can focus on pipelines; a 10-DE team can absorb OSS ops cheaply). Wire the decision matrix from the guide against your specific profile before you commit.

Is Stitch still worth using in 2026, or should I plan a migration?

Stitch is safe to keep for existing deployments with stable pipelines, mature ops muscle memory, and no imminent connector-catalog growth needs — the Talend-under-Qlik roadmap is quiet but not dead, and the Singer-tap ecosystem still ships new taps that Stitch can adopt. Stitch is not a good pick for new deployments in 2026 — the connector catalog has been effectively flat for two years, the roadmap under Qlik ownership is unclear, and both Airbyte Cloud and Hevo undercut it materially on price for equivalent workloads. The migration path if you decide to move: Stitch → Airbyte Cloud is usually the shortest (both use the same Snowflake destination shape, mostly overlapping connector catalogs); Stitch → Fivetran is worth the extra cost only if the specific enterprise SaaS depth (Salesforce custom objects, NetSuite savedsearches) is genuinely blocking your models. Budget 4-8 weeks per 20 connectors for the migration itself, plus the 14-day reconciliation window per connector.

Can I build my own ELT connectors in-house instead of paying a vendor?

For commodity SaaS APIs (Salesforce, Stripe, HubSpot, Marketo, Zendesk), the fully-loaded in-house cost is ~60% higher than Fivetran's incremental per-connector cost — including initial build (~40 hours) plus ongoing drift (~24 hours/year) plus emergency fixes (~8 hours/year) at senior engineer loaded rates. Almost never worth it. For internal APIs (your customer portal, your admin backend, your metrics service), there is no vendor connector — you must build. Use Airbyte's Connector Development Kit rather than raw Python: the low-code YAML CDK ships a paginated incremental connector in a day; the Python CDK handles complex auth or nested pagination in 200-500 lines. Reserve raw in-house Python + Airflow pipelines for the rare case where you have deep domain expertise the vendor cannot replicate or where you need pipeline behaviour outside any ELT vendor's model (event-time watermarking with custom late-arrival semantics, for example). The one-line rule: buy commodities, build differentiators, always use the CDK when you build.

Practice on PipeCode

  • Drill the ETL practice library → for the ingestion-sizing, connector-tiering, and vendor-migration problems senior interviewers love.
  • Rehearse on the API integration library → for the custom-connector, paginated-REST, and OAuth-refresh problems that map straight onto Airbyte CDK work.
  • Sharpen the design axis with the design practice library → for hybrid-topology, exit-ramp, and cost-attribution architecture problems.
  • Stack the prerequisites against PipeCode's broader 450+ data-engineering catalogue to anchor the vendor-selection + hybrid intuition against real graded inputs.

Lock in managed-ELT vendor-selection muscle memory

Vendor docs explain the sticker price. PipeCode drills explain the decision — when MAR economics tip you off Fivetran, when Airbyte's CDK earns its keep, when Hevo's streaming pitch is real, and when the hybrid Fivetran-plus-Airbyte topology is the mature senior answer. Pipecode.ai is Leetcode for Data Engineering — pattern-first practice tuned for the production trade-offs senior data engineers actually face.

Practice ETL problems →
Practice design problems →

Top comments (0)