openlineage openmetadata is the pair of words that quietly replaced the closed-catalog conversation in 2024 and 2025 — and by 2026, when an interviewer asks "how would you build lineage and a catalog across your stack?" the wrong answer is "we'd license Atlan" and the right answer starts with "OpenLineage as the wire format, OpenMetadata or DataHub as the backend." The shift is the same one that happened with Kubernetes versus proprietary container schedulers: the moment a credible open standard exists, every vendor either adopts it or argues itself into irrelevance.
This guide walks the two standards in production-engineering detail. It opens with why open standards for lineage and metadata matter at all (the cost of being trapped inside a closed metadata graph), then layers the OpenLineage event model (run, job, dataset, facets) on top of the OpenMetadata architecture (ingestion, metadata server, UI), and closes with the interop patterns that let you migrate off Atlan, Collibra, or Alation without a Big Bang cutover. Along the way it ties in Marquez and DataHub — the two most-mentioned reference backends — and shows the column-level lineage facet that makes a modern open data catalog actually useful for impact analysis. Every H2 ships at least one worked example with code, a step-by-step trace, an output table, and a concept-by-concept breakdown of why it works.
When you want hands-on reps immediately after reading, drill the ETL practice library →, rehearse on dimensional modeling problems →, and layer the data aggregation drills →.
On this page
- Why open standards for lineage and metadata matter
- The open standards stack
- The OpenLineage event model
- OpenMetadata architecture and entity model
- Interop with proprietary vendors and migration patterns
- Cheat sheet — open standards recipes
- Frequently asked questions
- Practice on PipeCode
1. Why open standards for lineage and metadata matter
Closed catalogs trap your metadata graph inside a vendor's billing model — open standards let lineage and entity definitions outlive the contract
The one-sentence invariant: lineage and metadata are the two most expensive things to backfill, so the format you choose to emit them in is a 5-to-10-year decision, and proprietary catalogs charge you forever to read back data you already paid to compute. Once you internalise that "the graph you build is more valuable than the UI you license," the case for openlineage plus openmetadata (or DataHub) over a closed product becomes the default architectural posture.
The lock-in tax of proprietary catalogs.
- Per-asset pricing scales with success. Every catalog vendor invoices on "data assets" — tables, dashboards, columns, pipelines. The more your platform grows, the more you pay, even when the marginal user value of asset 50,001 is near zero.
- Export is intentionally hard. Closed catalogs expose only narrow REST APIs (or paginated CSV exports) for the metadata you contributed. Lineage edges, column-level mappings, glossary tags, and ownership graphs are often not round-trippable — you can read them, but you cannot bulk-extract them in a form the next catalog will understand.
- Connectors are the moat. A vendor's competitive edge is "we have 200 connectors." But those connectors emit into the vendor's internal metadata model. Switching means rebuilding every connector for the new tool — months of work for a platform team that wants to ship product instead.
The "every tool emits to its own black box" problem.
In a typical 2023-era stack, Airflow exposed lineage to its own DB, dbt exposed lineage to dbt Cloud, Spark exposed lineage to Spline (if anything), Atlan ingested from BigQuery, Monte Carlo ingested separately for observability, and Collibra ingested independently for governance. Each tool maintained its own copy of the same fact: job daily_orders reads raw_orders and writes fct_orders. That fact was duplicated five times, inconsistently, with each vendor's UI showing a slightly different graph.
What an open standard buys.
- One emit, many consumers. Airflow emits an OpenLineage event once. Marquez, OpenMetadata, DataHub, Monte Carlo, Atlan, and Collibra can all receive it. The graph is single-sourced; the receivers compete on UX, not on data ownership.
- Vendor portability. Move from Atlan to OpenMetadata? You point the OpenLineage transport at the new backend. Your emitters do not change. Your pipeline code does not change.
-
Community integrations. When the OpenLineage spec adds a
columnLineagefacet, every emitter and every receiver implements it on the same schedule, in the same shape. No more "Vendor X supports column lineage on Snowflake but not Postgres." - Schema review by committee. OpenLineage and OpenMetadata are governed by the LF AI & Data Foundation. Spec changes go through public RFC discussion. There is no surprise breaking change from a vendor changing strategy.
Lineage vs metadata vs catalog — separating the three concerns.
- Lineage is the runtime fact of "this job read these inputs and wrote these outputs at this time." It is a stream of events emitted by the compute engine.
- Metadata is the static description of an asset: its schema, owner, tags, description, freshness SLO, classification. It is rows in a catalog DB.
- Catalog is the application layer — the UI, the search index, the REST API, the access policies — that lets humans browse and query the metadata graph.
OpenLineage targets the lineage problem. OpenMetadata targets the metadata + catalog problem. They are complementary, not competitors.
The current ecosystem.
- OpenLineage — the wire-format standard. JSON Schema for runs, jobs, datasets, and extensible facets. Reference backend is Marquez.
- OpenMetadata — the open catalog application. Self-hosted or managed via Collate. Ingests from databases, dashboards, pipelines, ML models. Defines its own entity schemas.
- Marquez — the original OpenLineage backend. Simple Postgres + REST UI. Great when you only want lineage and do not yet need a full catalog.
- DataHub — alternative open catalog, originally from LinkedIn. Slightly different entity model than OpenMetadata, stronger upstream metadata-event story.
- Amundsen — earlier-generation open catalog from Lyft. Less actively developed in 2026; relevant mostly for historical context.
What interviewers listen for.
- Do you say "OpenLineage is the wire format, not a catalog"? — senior signal.
- Do you mention Marquez as the reference backend for OpenLineage? — senior signal.
- Do you distinguish DataHub and OpenMetadata as two parallel open-catalog projects? — senior signal.
- Do you propose a two-write migration when leaving a closed catalog? — senior signal.
Worked example — the lock-in cost of a closed catalog in one number
Detailed explanation. A platform team has 8,000 tables, 1,200 dashboards, and 400 dbt models. The closed catalog vendor invoices on data_assets. Migrating off the vendor requires re-emitting lineage from every pipeline; staying means paying forever. Pricing the two options surfaces why the open-standard answer is the default.
Question. Compute the three-year total cost of staying on a closed catalog at $0.50 per asset per month versus migrating to OpenMetadata + OpenLineage in a self-hosted footprint that costs $4,000 per month all-in (infra + 0.25 FTE). Assume asset count grows 25% per year.
Input.
| Year | Assets (start) | Assets (end) | Avg assets |
|---|---|---|---|
| 1 | 9,600 | 12,000 | 10,800 |
| 2 | 12,000 | 15,000 | 13,500 |
| 3 | 15,000 | 18,750 | 16,875 |
Code.
# Three-year cost model — closed vs open
closed_unit_cost_per_month = 0.50 # USD per asset per month
open_monthly_cost = 4_000 # USD per month, all-in self-hosted
avg_assets = [10_800, 13_500, 16_875]
closed_total = sum(a * closed_unit_cost_per_month * 12 for a in avg_assets)
open_total = open_monthly_cost * 12 * 3
print(f"Closed 3-year cost: ${closed_total:>10,.0f}")
print(f"Open 3-year cost: ${open_total:>10,.0f}")
print(f"Savings: ${closed_total - open_total:>10,.0f}")
print(f"Break-even assets: {open_monthly_cost / closed_unit_cost_per_month:>10,.0f}")
Step-by-step explanation.
- Closed-catalog cost is linear in asset count. At $0.50 per asset per month, 10,800 average assets year 1 means
10,800 * 0.50 * 12 = $64,800for year 1. - Year 2 grows to 13,500 average assets →
13,500 * 0.50 * 12 = $81,000. Year 3 hits 16,875 average →$101,250. - Open-catalog cost is flat: $4,000 per month * 36 months = $144,000.
- The break-even is
open_monthly / closed_unit = 4000 / 0.50 = 8,000 assets. Above that asset count, OpenMetadata is cheaper at this infra budget.
Output.
| Metric | Value |
|---|---|
| Closed 3-year cost | $247,050 |
| Open 3-year cost | $144,000 |
| Savings | $103,050 |
| Break-even assets | 8,000 |
Rule of thumb. Below ~5,000 assets, the cost case for self-hosting is weaker — the FTE overhead dominates. Above ~10,000 assets, the open-standard answer pays for itself within the first contract renewal, before counting the value of avoiding vendor lock-in.
Worked example — what "lineage as a stream of events" looks like end-to-end
Detailed explanation. OpenLineage's mental model is event-driven: every job run emits a START event when it begins and a COMPLETE event when it finishes (or FAIL / ABORT on error). Each event carries the run, the job, the input datasets, the output datasets, and any number of facets. Concatenated over time, these events are the lineage graph.
Question. Show the minimum two-event sequence that captures a daily Airflow run of dbt_run_orders which reads raw.orders and writes analytics.fct_orders. Identify which fields are mandatory and which are optional.
Input.
| Field | Value |
|---|---|
| run id | a3f1-2026-06-15-01 |
| job name | analytics.dbt_run_orders |
| inputs | raw.orders |
| outputs | analytics.fct_orders |
Code.
// Event 1 — START
{
"eventType": "START",
"eventTime": "2026-06-15T01:00:00.000Z",
"run": { "runId": "a3f1-2026-06-15-01" },
"job": { "namespace": "analytics", "name": "dbt_run_orders" },
"inputs": [ { "namespace": "warehouse", "name": "raw.orders" } ],
"outputs": [ { "namespace": "warehouse", "name": "analytics.fct_orders" } ],
"producer": "https://github.com/OpenLineage/OpenLineage/tree/1.20.0/integration/airflow"
}
// Event 2 — COMPLETE
{
"eventType": "COMPLETE",
"eventTime": "2026-06-15T01:04:12.000Z",
"run": { "runId": "a3f1-2026-06-15-01" },
"job": { "namespace": "analytics", "name": "dbt_run_orders" },
"inputs": [ { "namespace": "warehouse", "name": "raw.orders" } ],
"outputs": [ { "namespace": "warehouse", "name": "analytics.fct_orders" } ],
"producer": "https://github.com/OpenLineage/OpenLineage/tree/1.20.0/integration/airflow"
}
Step-by-step explanation.
- The START event arrives when the Airflow operator begins. The
runIdis a UUID stamped once per attempt — Airflow uses the DAG run'stry_numberplus task id to derive it. - Inputs and outputs are listed intentionally. OpenLineage does not infer them; the emitter is responsible for declaring what the job will read and write. dbt knows from its compiled manifest; Spark knows from its query plan; Airflow falls back to operator-specific hints.
- The COMPLETE event arrives when the operator returns. It re-states the same run, job, inputs, and outputs — receivers reconcile the two events by
runId. If a FAIL or ABORT event arrives instead, the receiver knows the lineage edge is attempted rather than successful. - The
producerfield is the URL of the emitter's source. Receivers use it to know "this event came from Airflow 1.20.0 integration" and apply version-specific facet handling.
Output (Marquez UI rendering).
| Node type | Identifier | Edges |
|---|---|---|
| Job | analytics.dbt_run_orders |
input from warehouse.raw.orders, output to warehouse.analytics.fct_orders
|
| Dataset | warehouse.raw.orders |
read by analytics.dbt_run_orders
|
| Dataset | warehouse.analytics.fct_orders |
written by analytics.dbt_run_orders
|
| Run | a3f1-2026-06-15-01 |
status COMPLETE, duration 4m 12s |
Rule of thumb. Think of OpenLineage as Prometheus for lineage: emitters push events; backends scrape and persist; UIs render. The wire format is small and stable; the value compounds over thousands of runs.
Worked example — the "every tool has its own graph" failure mode
Detailed explanation. Without an open standard, every tool keeps its own private graph and your platform team operates as the human consistency layer. When the dbt graph says model X depends on table Y but the Airflow graph says task A depends on table Z and the BI tool says dashboard D depends on column C, no one can answer "if I drop column C, what breaks?" in less than a half-day investigation.
Question. A finance dashboard breaks because the currency column was renamed in the source. Trace the four lookups a platform engineer must do without an open standard and the single lookup they would do with one.
Input (impacted assets in five tools).
| Tool | Asset | Records column |
|---|---|---|
| Postgres source | raw.invoices |
ccy (renamed from currency) |
| dbt |
int_invoices, fct_revenue
|
references currency
|
| Airflow | DAG daily_revenue
|
runs dbt build
|
| BI | dashboard finance.revenue_v2
|
uses fct_revenue.currency
|
| Catalog |
fct_revenue lineage |
last refreshed 6h ago |
Code.
# Without OpenLineage / OpenMetadata — four siloed lookups
1. dbt docs: which models reference `currency`?
2. Airflow UI: which DAGs run those models?
3. BI tool: which dashboards depend on those tables?
4. Catalog: which downstream owners need notification?
# With OpenLineage + OpenMetadata — one query
GET /api/v1/lineage/table/warehouse.raw.invoices?upstreamDepth=0&downstreamDepth=4
Step-by-step explanation.
- Without the open standard, each tool answers a slice of the question against its private graph. The engineer manually stitches answers — dbt says "models X, Y depend on
currency"; Airflow says "DAGdaily_revenueruns them"; the BI tool says "dashboards A and B depend on Y"; the catalog confirms ownership but is stale. - The stitching is error-prone: a dbt model invoked by an ad-hoc notebook (not Airflow) is invisible to the Airflow lookup. A dashboard that depends on a derived column via a join is invisible unless the BI tool indexed column lineage.
- With OpenLineage emitters everywhere and OpenMetadata as the single sink, the question is one API call. The downstream graph is materialised continuously from the events; the answer is whichever assets currently sit downstream of
warehouse.raw.invoices. - Time-to-impact-analysis drops from "half a day" to "30 seconds." That speed is the operational ROI of a unified metadata graph — and the strongest argument when the team's senior engineer asks "why are we spending two weeks adopting another standard?"
Output (impact-analysis table from a single OpenMetadata query).
| Hop | Asset | Owner | Action required |
|---|---|---|---|
| 1 | warehouse.raw.invoices.currency |
data-eng | rename mapping in dbt staging |
| 2 | analytics.int_invoices |
data-eng | regenerate, redeploy |
| 3 | analytics.fct_revenue |
analytics-eng | document column |
| 4 | bi.dashboards.finance.revenue_v2 |
finance-eng | update dashboard tile |
Rule of thumb. The single best heuristic for "is our metadata stack mature?" is "can we answer the impact-analysis question in under a minute?" If no, the next architecture investment is OpenLineage emitters plus a single open backend.
Data engineering interview question on choosing between open and closed catalogs
A senior interviewer might frame this as: "Your CFO is asking why we should not just buy Atlan and be done. Defend the open-standards path in a 60-second answer that does not sound like an open-source zealot speech."
Solution Using a TCO + portability scorecard
# Decision matrix — closed vs open catalog
# Score each criterion 1-5 (higher = better for the option)
criteria = {
"feature_velocity_today": {"closed": 5, "open": 4}, # vendor ships polish
"five_year_TCO_at_scale": {"closed": 2, "open": 5}, # per-asset pricing scales painfully
"vendor_portability": {"closed": 1, "open": 5}, # OL means switching cost is near zero
"control_over_metadata_graph": {"closed": 2, "open": 5}, # self-hosted = your own DB
"platform_team_FTE_required": {"closed": 5, "open": 3}, # closed is cheaper in eng hours
"integration_with_OSS_emitters": {"closed": 3, "open": 5}, # OL emitters land on open backends day 1
}
closed_score = sum(c["closed"] for c in criteria.values())
open_score = sum(c["open"] for c in criteria.values())
print(f"Closed total: {closed_score}")
print(f"Open total: {open_score}")
Step-by-step trace.
| Criterion | Closed | Open | Comment |
|---|---|---|---|
| feature_velocity_today | 5 | 4 | vendors ship polish faster |
| five_year_TCO_at_scale | 2 | 5 | per-asset bill grows with success |
| vendor_portability | 1 | 5 | OL means switching cost near zero |
| control_over_metadata_graph | 2 | 5 | self-hosted = your own DB |
| platform_team_FTE_required | 5 | 3 | closed cheaper in eng hours |
| integration_with_OSS_emitters | 3 | 5 | OL emitters land everywhere day 1 |
The closed option leads on short-term polish and FTE economy; the open path leads on every multi-year axis (TCO, portability, control, integrations). For a platform expected to outlive any single vendor contract, the open path wins on every criterion you would care about three renewals out.
Output:
| Path | Total |
|---|---|
| Closed catalog | 18 |
| Open standards (OL + OM/DH) | 27 |
Why this works — concept by concept:
- Total cost of ownership — vendor pricing is per-asset, and asset counts grow super-linearly with success; open infra is a flat-ish cost. Above ~10K assets, open wins on cash alone.
- Portability premium — OpenLineage emitters survive backend changes; the cost of changing backends approaches the cost of pointing the OL transport at a new URL. That option value is real and grows with platform maturity.
- Control over your own metadata graph — when the catalog DB is yours, you can run arbitrary queries against it: cardinality audits, governance dashboards, custom impact analyses. Closed APIs cap you at the vendor's imagination.
- FTE realism — yes, self-hosted costs platform-engineering time. The fair comparison is not "free vs paid"; it is "X FTE-months vs Y dollars plus lock-in." The decision matrix surfaces this honestly.
- OSS emitter integration — every new OpenLineage emitter (Snowflake, Trino, Materialize) lands on every open backend at the same time. Closed catalogs lag by a release cycle.
- Cost — the analysis itself is one spreadsheet plus a back-of-envelope FTE estimate. The actual decision is bought back over years of avoided lock-in pain.
DE
Topic — ETL design
ETL & pipeline design problems
2. The open standards stack
OpenLineage is the wire format, OpenMetadata is the catalog application — they sit at different layers of the same stack, and confusing them is the most-common interview mistake
The mental model in one line: OpenLineage defines what to emit (a JSON event); OpenMetadata defines where to store and query (a catalog application with REST APIs and a UI). Once you say "wire format versus application," every follow-up question about Marquez, DataHub, or whether to "use OpenLineage or OpenMetadata" answers itself: you almost always use both, at different layers.
The four-layer stack in one paragraph.
- Layer 1 — Emitters. The things that produce lineage events: Airflow, dbt, Spark, Flink, Dagster, Prefect, custom Python apps. Each emitter has an OpenLineage integration that translates its native execution model into OL events.
-
Layer 2 — Wire format (OpenLineage). The JSON schema for the event itself:
run,job,dataset, and an extensiblefacetsslot. Versioned by the OpenLineage spec. - Layer 3 — Backends. The things that consume and persist the events: Marquez (reference backend, lineage-only), OpenMetadata (full catalog), DataHub (alternative catalog), and vendor receivers (Monte Carlo, Atlan, Bigeye, Collibra) when those products accept OL.
- Layer 4 — Consumers. The humans and systems that read the persisted graph: catalog UIs, search indexes, impact-analysis services, governance dashboards, downstream alerting.
The "one event, many consumers" pattern.
A single OpenLineage event emitted by an Airflow task can simultaneously land in:
- Marquez for the lineage graph UI used by data engineers.
- OpenMetadata for the broader catalog with glossary and tags used by analysts and stewards.
- Monte Carlo or Bigeye for observability and freshness anomaly detection.
- A custom Kafka topic that downstream services subscribe to for "this table just changed" event-driven processing.
The OL spec includes an HTTP transport and a Kafka transport out of the box. Multi-cast is solved by either configuring multiple OPENLINEAGE_URL entries (newer integrations) or by running a small fan-out proxy that re-emits each event to N backends.
OpenLineage vs OpenMetadata — when each is the right answer.
| You need | OpenLineage | OpenMetadata |
|---|---|---|
| The fact "job X read table Y at time T" | yes (emit + persist) | partial (ingests OL events) |
| A searchable UI of every table with owners and tags | no | yes |
| Column-level lineage facets | yes (in the event) | yes (renders the graph) |
| A glossary, classifications, PII tags | no | yes |
| Data quality test results | partial (facet) | yes (first-class entity) |
| Connectors for BigQuery, Snowflake, Tableau metadata | no | yes |
| A wire format other tools can also emit to | yes | no (it is an application) |
Marquez and DataHub in one sentence each.
- Marquez is the reference OpenLineage backend — Postgres for storage, REST API for ingest and query, a minimal lineage UI. Use when you want "OpenLineage and a graph viewer" and nothing else.
- DataHub is an alternative open catalog (originally LinkedIn) that competes with OpenMetadata. It uses its own metadata-event model (MCE / MAE) but accepts OL events through an adapter. Use when you want strong upstream metadata propagation with Kafka under the hood.
Where vendors plug in.
- As emitters. A vendor's product (e.g. a closed orchestrator) can ship native OL events instead of a proprietary metadata API. Increasingly common — even Databricks and Snowflake now have OL integration paths.
- As backends. Monte Carlo, Bigeye, Atlan, and Collibra accept OL events as input. Your team emits once; the vendor enriches and visualises.
-
As ingestion sources for OpenMetadata. OpenMetadata's
ingestion-frameworkruns as Airflow DAGs (or a Python container) and uses connectors to pull metadata from Snowflake, BigQuery, Tableau, Looker, Kafka. These connectors do not emit OL; they push entities directly into the OpenMetadata server.
Two paths in: events versus connectors.
OpenMetadata has two ingest paths. (1) Connectors that crawl source systems (Snowflake INFORMATION_SCHEMA, Tableau REST API, etc.) and push entity records via REST. (2) OpenLineage events that arrive via the OL endpoint and get converted into Pipeline entities + lineage edges. Many teams use both — connectors for the entity inventory, OL for the runtime lineage.
Common interview probes on the stack.
- "Is OpenLineage a database?" — no. It is a wire format. Storage is the backend's job.
- "Can I use OpenLineage without a catalog?" — yes. Marquez gives you lineage-only without the wider catalog surface.
- "Can I use OpenMetadata without OpenLineage?" — yes. Connectors alone populate the catalog; lineage will then be limited to whatever the connectors infer from query history.
- "Why not DataHub then?" — usually a tie. DataHub's metadata-event model is more event-native; OpenMetadata's connector library is broader. Pick by ecosystem fit, not by logo.
Worked example — sketching the stack as data flow
Detailed explanation. Drawing the four-layer stack with concrete tools at each layer is the fastest way to internalise where each project sits. The picture makes "OpenLineage versus OpenMetadata" stop being a question.
Question. Sketch a four-layer stack diagram for a team running Airflow, dbt, and Spark that wants both runtime lineage and a searchable catalog. Identify which projects sit at which layer and which transport carries events between them.
Input.
| Tool | Role |
|---|---|
| Airflow | orchestration |
| dbt | transformation in warehouse |
| Spark | external transformation |
| Snowflake | warehouse |
| Marquez | wanted as lineage UI |
| OpenMetadata | wanted as catalog UI |
Code.
LAYER 1 — Emitters
Airflow (OL plugin) dbt (OL adapter) Spark (OL listener)
LAYER 2 — Wire format
OpenLineage event (run, job, dataset, facets) over HTTP
Endpoint: OPENLINEAGE_URL = http://oltransport:5000
LAYER 3 — Backends
Marquez (lineage UI + Postgres)
OpenMetadata (catalog UI + Elasticsearch + Postgres)
Both subscribe via a fan-out proxy or dual OPENLINEAGE_URL
LAYER 4 — Consumers
Marquez UI for "trace the job"
OpenMetadata UI for "find the table, owner, tags"
Custom Slack bot subscribed to FAIL events for on-call
Step-by-step explanation.
- Each emitter is configured once to point at the OpenLineage transport URL. The team does not have to know which backends are subscribed downstream.
- The transport is HTTP by default; Kafka is the production choice when you want backpressure and durability between emitters and backends.
- The fan-out happens at the transport layer or with a small proxy (often a single FastAPI service) that POSTs each incoming event to every configured backend.
- Marquez and OpenMetadata coexist happily. They consume the same OL events but render different parts of the metadata graph — Marquez focuses on the lineage graph; OpenMetadata adds catalog, glossary, and quality on top.
Output (the stack table).
| Layer | Tool | What it does |
|---|---|---|
| 1 | Airflow, dbt, Spark | emit OL events on every run |
| 2 | OpenLineage JSON over HTTP | transport |
| 3 | Marquez, OpenMetadata | persist + render |
| 4 | Marquez UI, OpenMetadata UI, Slack bot | humans and downstream alerting |
Rule of thumb. Draw this four-layer stack on a whiteboard before you write any code. The teams that get OL adoption wrong almost always conflated layer 2 with layer 3 ("we're going to use OpenLineage as our catalog") or layer 3 with layer 4 ("we'll just point everyone at Marquez UI").
Worked example — OpenMetadata's two ingest paths side by side
Detailed explanation. OpenMetadata accepts metadata via connectors (pull from source) and via OpenLineage events (push from emitter). Each path fills a different slot in the graph, and most teams need both for a complete picture.
Question. A platform team wants analytics.fct_orders in the OpenMetadata UI with its schema, owner, tags, and a lineage graph that shows the dbt model writing it. Outline which ingest path supplies which fields, and the order in which the paths should run.
Input.
| Asset | Source of truth |
|---|---|
| Table schema (columns, types) | Snowflake INFORMATION_SCHEMA
|
| Owner, tags, description | OpenMetadata UI + glossary |
| Lineage edge "dbt → fct_orders" | dbt run-time |
| Last-refreshed timestamp | dbt run-time |
Code.
# 1) Connector ingest — runs as an Airflow DAG every hour
source:
type: snowflake
serviceName: warehouse_prod
serviceConnection:
config:
type: Snowflake
hostPort: acct.snowflakecomputing.com
username: openmetadata_ro
database: ANALYTICS
sink:
type: metadata-rest
config: {}
# 2) OpenLineage ingest — runs as a webhook the dbt CLI POSTs to
# Configure dbt to emit OL events to OpenMetadata's OL endpoint:
# OPENLINEAGE_URL=https://openmetadata.example.com
# OPENLINEAGE_ENDPOINT=/api/v1/openlineage
Step-by-step explanation.
- The Snowflake connector enumerates every table in
ANALYTICS, readsINFORMATION_SCHEMAfor column types, and pushes Table entities into OpenMetadata.fct_ordersappears in the UI but without lineage edges yet. - The dbt OL emitter fires on every
dbt runand POSTs OL events to OpenMetadata's/api/v1/openlineageendpoint. OpenMetadata converts each event into a Pipeline entity and creates lineage edges from inputs to outputs. - After both paths have run,
fct_ordersappears in the UI with its full schema and the upstream edge from the dbt Pipeline. The user adds the owner and tags manually (or by API) — those metadata are catalog-native and not in any source system. - Order matters: the connector must run first so that the Table entity exists before the OL event tries to create the lineage edge. If the order is reversed, OpenMetadata creates a placeholder Table from the OL
datasetreference and fills in real schema on the next connector pass.
Output (assembled OpenMetadata UI panel).
| Field | Source |
|---|---|
Name analytics.fct_orders
|
Snowflake connector |
| Columns + types | Snowflake connector |
Tags PII::masked, Domain::Finance
|
Manual + glossary |
Lineage upstream dbt.run_fct_orders
|
OpenLineage event |
| Last refresh timestamp | OpenLineage event |
Rule of thumb. Run the connector hourly (or on a metadata-change CDC if available); run OpenLineage continuously (per task). Mixing the two cadences gives you both static asset inventory and live runtime lineage at the cost each path implies.
Data engineering interview question on partitioning the stack between OL and OM
A senior interviewer might say: "Walk me through which problems you would solve with OpenLineage and which with OpenMetadata if you were designing a metadata platform from scratch in 2026."
Solution Using a layered responsibility split
# Stack ownership matrix
LAYER OWNER PROJECT ARTIFACT
emitters (per tool) OpenLineage integration one OL event per run
wire format OpenLineage spec JSON event with facets
transport OL HTTP / Kafka client POST / produce
durable store OpenMetadata or DataHub Postgres + Elasticsearch
catalog entities OpenMetadata schemas Table, Pipeline, Dashboard
search + UI OpenMetadata UI browse, search, lineage view
governance OpenMetadata Glossary terms, classifications, PII
data quality OM Test Suite test cases + results entity
runtime lineage OL events ingested by OM edges populated from facets
freshness alerts downstream consumer Slack bot or vendor receiver
Step-by-step trace.
| Concern | OpenLineage | OpenMetadata |
|---|---|---|
| run/job/dataset events | OL spec | consumes |
| schema + classifications | dataset facet (per event) | first-class entity |
| glossary + business terms | n/a | Glossary entity |
| lineage graph storage | n/a | yes |
| catalog search UI | n/a | yes |
| connectors to BI / Kafka | n/a | yes |
| extensible custom metadata | facets | extension API |
| transport multicast | yes (HTTP / Kafka) | n/a |
The split makes the role of each project clear: OL owns the protocol and the runtime events; OM owns the application, the catalog entities, and the user experience. They meet at the OpenLineage endpoint where OM consumes events.
Output:
| Decision | Project |
|---|---|
| What format do my emitters speak? | OpenLineage |
| Where do events live for the long term? | OpenMetadata (or DataHub) |
| Where do humans browse the catalog? | OpenMetadata UI |
| Where do I add a glossary or PII tags? | OpenMetadata |
| Which project do I configure transport on? | OpenLineage client |
Why this works — concept by concept:
- Separation of concerns — wire formats and applications evolve on different cadences; coupling them slows both. The four-layer stack is the architecture pattern that makes the metadata platform sustainable.
- Backend portability — by treating OL as the protocol, you can replace Marquez with OpenMetadata, OpenMetadata with DataHub, or DataHub with a vendor without changing a single emitter.
- Catalog ownership — OpenMetadata owns the entities (Table, Pipeline, Dashboard, MLModel, Glossary, Tag) and the policies that govern them; OL contributes the lineage edges between those entities.
- Custom metadata via facets — anything you cannot express in the core OL schema goes into a custom facet. The receiver chooses whether to surface it. No forking required.
- Transport choices — HTTP for simple setups, Kafka for high-volume production stacks where you want durability and replayability between emitters and backends.
- Cost — protocol design plus catalog design happens once; daily operations are O(events) and dominated by Postgres + Elasticsearch in the backend. The architecture itself is cheap; the content is where the value lives.
DE
Topic — design
System design problems for data engineers
3. The OpenLineage event model
run, job, dataset, facets — four nouns that capture every transformation in your stack, and the column-level facet is where the modern open data catalog gets its impact-analysis superpower
The mental model in one line: every OpenLineage event is a tuple (run, job, inputs, outputs, facets) describing one execution attempt of one unit of work. Once you can name the four nouns and the four run states (START, COMPLETE, FAIL, ABORT), the entire OL spec collapses to "fill in the right facets for your use case."
The four core entities.
-
run— a single execution attempt. Has arunId(UUID) plus optional facets for parent run, nominal time, error message. -
job— the unit of work itself, independent of any single execution. Identified by(namespace, name). The job is stable across runs; runs come and go. -
dataset— an input or output of the job. Identified by(namespace, name). Examples:(warehouse, raw.orders),(s3, bucket-name/path/prefix). -
facets— optional, extensible blocks of typed metadata attached to runs, jobs, or datasets. The whole spec is extended through facets, not by changing the core schema.
Run states.
-
START— the run has begun. Receivers create an open run record. -
COMPLETE— the run finished successfully. Receivers close the run and finalise edges. -
FAIL— the run failed. Edges may be marked attempted; downstream consumers can alert. -
ABORT— the run was killed (timeout, manual stop). Treated like FAIL by most receivers but the cause is different.
Standard facets you will use every day.
-
schemaFacet— attached to a dataset; lists columns and types. Lets a receiver know the shape of the data at the moment of the event. -
sourceFacet— attached to a dataset; identifies the physical storage system (Snowflake, S3, Kafka topic). Helps backends group datasets by source. -
sqlFacet— attached to a job; the exact SQL text the job ran. Powers query-level lineage for SQL engines. -
columnLineageFacet— attached to an output dataset; maps each output column to the input columns it was derived from. The single most valuable facet for impact analysis. -
dataQualityFacet— attached to a dataset; expected/actual stats (row count, null ratio, distinct count). Powers freshness and quality observability. -
ownershipFacet— attached to a job or dataset; team or person responsible. Lets receivers route alerts. -
parentRunFacet— attached to a run; reference to a parent run (e.g. an Airflow DAG run that contains a dbt task run). Lets the graph render hierarchically.
How Airflow, Spark, dbt, and Flink emit events.
-
Airflow. The OL Airflow plugin instruments every task. Each task emits a START on
pre_executeand a COMPLETE / FAIL onpost_execute. Operator-specific extractors fill in inputs and outputs (e.g.SnowflakeOperatorknows what tables the SQL touches). -
dbt. The OL dbt adapter wraps
dbt run. After each model materialises, it emits an event with inputs (refs) and outputs (the model's relation). ThesqlFacetcarries the compiled SQL; thecolumnLineageFacetis derived from dbt's manifest. - Spark. The OL Spark listener hooks into the SparkSession. On each query execution, it walks the logical plan to extract input and output dataset references and emits a START / COMPLETE pair.
- Flink. The OL Flink integration emits per-job events with stream sources and sinks as input / output datasets. Useful for keeping the streaming side of the graph aligned with the batch side.
Column-level lineage via the columnLineage facet.
The facet maps each output column to a list of (input dataset, input column, transformation type) tuples. Receivers render this as a column-level graph in the lineage UI. For a SQL job, the facet is computed by SQL-parsing the query plan (sqlglot, Calcite, or the engine's native parser). For a dbt model, the facet can be derived from dbt's manifest and ref() graph.
Custom facets — when and how.
-
When. You have metadata that does not fit the standard facets but is useful to your platform. Examples: a
securityClassificationFacet, acostFacet(compute units consumed), alineageQualityFacet(confidence score). -
How. Declare a JSON Schema for the facet under a unique URI (e.g.
https://your-org.com/openlineage/cost.json). Emit it inline. Receivers either render it or ignore it — no breaking changes either way.
The wire format itself in one paragraph.
Every event is a JSON object with mandatory fields eventType, eventTime, run.runId, job.namespace, job.name, plus optional inputs[], outputs[], and producer. Facets sit under run.facets, job.facets, or per-dataset facets. The schema is versioned via the top-level schemaURL. Receivers ignore facets they do not understand, which makes spec evolution painless.
Common interview probes on the event model.
- "What is the difference between a
joband arun?" — the job is the recipe; the run is one execution attempt. Multiple runs share a job; runs are immutable post-completion. - "Can I emit OL without inputs and outputs?" — yes (the event still describes the run), but the lineage edge is empty. You lose the main reason to emit at all.
- "How does OL handle streaming jobs that never complete?" — periodic checkpoint events with a
STARTat startup and intermittentCOMPLETEmarkers (or no terminal event), with the Flink integration's convention being a long-lived run that receives status updates. - "What stops a facet from being misused?" — JSON Schema validation. Receivers validate facets against their declared schemas; malformed facets are dropped or quarantined.
Worked example — anatomy of a single dbt OL event
Detailed explanation. Reading one real event end-to-end is the fastest way to internalise the spec. The example below is the COMPLETE event for a dbt model that joins raw.orders to raw.customers and writes analytics.fct_orders.
Question. Annotate the event with which field powers which UI feature. Identify the four mandatory fields, the input/output datasets, the SQL facet, the schema facet, and the column-lineage facet.
Input.
| Field | Value |
|---|---|
| job | analytics.fct_orders |
| run id | c8b3-2026-06-15-01 |
| inputs |
raw.orders, raw.customers
|
| outputs | analytics.fct_orders |
Code.
{
"eventType": "COMPLETE",
"eventTime": "2026-06-15T01:08:24.000Z",
"run": { "runId": "c8b3-2026-06-15-01" },
"job": {
"namespace": "analytics",
"name": "fct_orders",
"facets": {
"sql": {
"_producer": "https://github.com/OpenLineage/OpenLineage/tree/1.20.0/integration/dbt",
"_schemaURL": "https://openlineage.io/spec/facets/1-0-0/SqlJobFacet.json",
"query": "select o.order_id, o.amount, c.country from raw.orders o join raw.customers c using (customer_id)"
}
}
},
"inputs": [
{ "namespace": "warehouse", "name": "raw.orders",
"facets": {
"schema": { "fields": [
{"name": "order_id", "type": "BIGINT"},
{"name": "customer_id", "type": "BIGINT"},
{"name": "amount", "type": "NUMERIC"}
]}
}
},
{ "namespace": "warehouse", "name": "raw.customers",
"facets": {
"schema": { "fields": [
{"name": "customer_id", "type": "BIGINT"},
{"name": "country", "type": "STRING"}
]}
}
}
],
"outputs": [
{ "namespace": "warehouse", "name": "analytics.fct_orders",
"facets": {
"schema": { "fields": [
{"name": "order_id", "type": "BIGINT"},
{"name": "amount", "type": "NUMERIC"},
{"name": "country", "type": "STRING"}
]},
"columnLineage": {
"fields": {
"order_id": { "inputFields": [
{"namespace": "warehouse", "name": "raw.orders", "field": "order_id"}
]},
"amount": { "inputFields": [
{"namespace": "warehouse", "name": "raw.orders", "field": "amount"}
]},
"country": { "inputFields": [
{"namespace": "warehouse", "name": "raw.customers", "field": "country"}
]}
}
}
}
}
],
"producer": "https://github.com/OpenLineage/OpenLineage/tree/1.20.0/integration/dbt"
}
Step-by-step explanation.
- The mandatory fields
eventType,eventTime,run.runId,job.namespace, andjob.namedefine the run identity. Receivers reconcile START + COMPLETE pairs byrunId. -
inputs[]andoutputs[]declare the lineage edge. The two inputs (raw.orders,raw.customers) feed the single output (analytics.fct_orders). Marquez and OpenMetadata draw this as two arrows into one node. - The
sqlFaceton the job carries the compiled SQL. Catalog UIs render it as a clickable code block; impact-analysis tools can SQL-parse it to derive column lineage when the emitter does not provide it natively. - The
schemaFaceton each dataset lists columns and types. Catalog UIs render it as the table's schema panel at the moment of the run. - The
columnLineageFaceton the output dataset is the high-value payload: it mapsoutput.order_idtoinputs.raw.orders.order_id,output.amounttoinputs.raw.orders.amount, andoutput.countrytoinputs.raw.customers.country. Downstream "if I dropcountryfromraw.customers, what breaks?" queries traverse this map.
Output (UI features powered by this event).
| UI feature | Field used |
|---|---|
| Run timeline |
run.runId + START / COMPLETE pair |
| Lineage graph |
inputs[], outputs[]
|
| Schema panel | dataset schemaFacet
|
| Compiled SQL viewer | job sqlFacet.query
|
| Column-level lineage view | output columnLineageFacet.fields
|
| Producer attribution | top-level producer
|
Rule of thumb. When integrating a new tool, start with the mandatory fields plus schemaFacet. Add sqlFacet next (cheap to capture). columnLineageFacet last — it is the most valuable but the most work to compute correctly.
Worked example — dbt → Airflow → Spark chained run via parent facets
Detailed explanation. Real production lineage usually crosses tools. A scheduled Airflow DAG runs a dbt step that calls a Spark job. Each tool emits its own OL event; the chain is reconstructed via the parentRunFacet. The result is a single hierarchical graph spanning all three tools.
Question. Sketch the three OL events emitted when Airflow's DAG nightly schedules a dbt task dbt_run which kicks off Spark job etl_orders. Show the parent-run references that link the graph.
Input.
| Tool | Run id | Parent run id |
|---|---|---|
Airflow DAG nightly
|
a-001 |
— |
dbt task dbt_run
|
d-001 |
a-001 |
Spark job etl_orders
|
s-001 |
d-001 |
Code.
// 1) Airflow DAG-level event
{
"eventType": "START",
"run": { "runId": "a-001" },
"job": { "namespace": "airflow", "name": "nightly" }
}
// 2) dbt task event, parent = airflow DAG
{
"eventType": "START",
"run": {
"runId": "d-001",
"facets": {
"parent": {
"run": { "runId": "a-001" },
"job": { "namespace": "airflow", "name": "nightly" }
}
}
},
"job": { "namespace": "analytics", "name": "dbt_run" }
}
// 3) Spark job event, parent = dbt task
{
"eventType": "START",
"run": {
"runId": "s-001",
"facets": {
"parent": {
"run": { "runId": "d-001" },
"job": { "namespace": "analytics", "name": "dbt_run" }
}
}
},
"job": { "namespace": "etl", "name": "etl_orders" }
}
Step-by-step explanation.
- Airflow emits the outer event for the DAG run with id
a-001. This becomes the top-level node in the lineage graph. - dbt's emitter knows it was invoked from inside an Airflow task — the integration reads the
OPENLINEAGE_PARENT_*environment variables to construct theparentRunFacet, pointing back toa-001. - Spark's emitter, when invoked from a dbt python model or external task, similarly reads the parent context and constructs a facet pointing back to
d-001. - Receivers reconstruct the tree:
a-001is the root;d-001is a child ofa-001;s-001is a grandchild viad-001. The Marquez and OpenMetadata UIs render this hierarchically with collapsible sub-runs.
Output (graph structure).
| Level | Run id | Job |
|---|---|---|
| root | a-001 |
airflow.nightly |
| child | d-001 |
analytics.dbt_run |
| grandchild | s-001 |
etl.etl_orders |
Rule of thumb. Always propagate parent run context through environment variables when one tool launches another. Without it, the graph fragments into disconnected islands and the "what triggered this job?" question becomes hard to answer.
Worked example — emitting a custom facet for compute cost
Detailed explanation. Sometimes the standard facets do not cover a metric your team needs. The OL spec lets you declare a custom facet under your own URI. The receiver either renders it or ignores it — both are safe behaviours.
Question. Define a computeCostFacet that carries CPU seconds and dollar cost for each run, and attach it to a Spark COMPLETE event. Show the facet payload and the receiver's options for displaying it.
Input.
| Field | Value |
|---|---|
| CPU seconds | 124.5 |
| Estimated cost USD | 0.42 |
| Cluster id | spark-cluster-prod-01 |
Code.
{
"eventType": "COMPLETE",
"run": {
"runId": "s-001",
"facets": {
"computeCost_dataeng_example_com": {
"_producer": "https://github.com/dataeng-example/openlineage-cost",
"_schemaURL": "https://dataeng.example.com/openlineage/computeCost.json",
"cpu_seconds": 124.5,
"estimated_cost_usd": 0.42,
"cluster_id": "spark-cluster-prod-01"
}
}
},
"job": { "namespace": "etl", "name": "etl_orders" }
}
Step-by-step explanation.
- The facet key is namespaced (
computeCost_dataeng_example_com) so it cannot collide with any future standard facet. - The mandatory facet fields
_producerand_schemaURLlet receivers identify the source and validate the payload. Receivers that do not recognise the schema simply ignore the facet — no breaking change. - The payload itself is arbitrary JSON conforming to the schema at
_schemaURL. The schema lives in your org's repo and is referenced by URL — receivers can fetch and validate at runtime, or trust the producer. - Receivers like OpenMetadata render unknown facets either as raw JSON in a "raw facets" panel or, with a custom plugin, as a typed widget. Marquez stores them in its facets table for later querying.
Output (UI surfaces).
| Receiver | Treatment |
|---|---|
| Marquez | persisted in facets table; queryable via REST |
| OpenMetadata | rendered in raw facets panel; surfaced via custom widget if installed |
| Monte Carlo | ignored (does not know the schema) |
| Custom Spark cost dashboard | consumes via Kafka feed; renders as a chart |
Rule of thumb. Use custom facets sparingly. If three teams independently invent the same facet, lobby for it to become a standard. The OpenLineage community has accepted multiple originally-custom facets into the spec over the past two years.
Data engineering interview question on minimum-viable lineage instrumentation
A senior interviewer might frame this as: "You have an existing Airflow + dbt + Spark stack and zero lineage today. Walk me through the smallest first deployment of OpenLineage that delivers useful lineage in two weeks."
Solution Using emitters-first, single-backend rollout
WEEK 1
- Stand up Marquez via docker-compose in staging.
POSTGRES + the marquez-web container.
- Install the OL Airflow plugin on the staging Airflow.
Set OPENLINEAGE_URL=http://marquez:5000.
- Verify lineage events arrive by running one staging DAG.
- Add the dbt OL adapter to the dbt project.
Run `dbt build` against staging; confirm events appear.
WEEK 2
- Add the OL Spark listener to staging Spark cluster config.
spark.extraListeners=io.openlineage.spark.agent.OpenLineageSparkListener
- Run the most important production DAG once in staging
with realistic data; capture the full lineage graph.
- Promote OL configuration to production for one team's pipelines
with feature flag, monitor Marquez for two days.
- Plan week 3 for OpenMetadata or DataHub as second consumer.
Step-by-step trace.
| Day | Action | Outcome |
|---|---|---|
| 1 | docker-compose up Marquez | running locally |
| 2 | install Airflow OL plugin | events arriving from staging Airflow |
| 3 | install dbt OL adapter | events arriving from staging dbt |
| 4 | run end-to-end staging DAG | lineage graph visible in Marquez UI |
| 5–7 | iterate, fix missing extractors | graph passes peer review |
| 8 | enable Spark listener | Spark jobs join the graph |
| 9 | flag one prod team | prod lineage flowing |
| 10–14 | monitor, fix gaps | stable for one team |
By week three, the team can choose to layer OpenMetadata or DataHub as a second consumer of the same OL events without touching the emitters. The migration cost is configuring the new backend's OL endpoint, not re-instrumenting the pipelines.
Output:
| Milestone | Week |
|---|---|
| Marquez running, Airflow events flowing | 1 |
| dbt events flowing | 1 |
| Spark events flowing | 2 |
| One prod team fully instrumented | 2 |
| Second backend (OpenMetadata) as additional consumer | 3 |
Why this works — concept by concept:
- Marquez first, catalog second — Marquez is the cheapest credible OL backend. Standing it up validates the emitters before you spend weeks on OpenMetadata schemas and connectors.
- Per-tool integration — each emitter (Airflow, dbt, Spark) plugs into its native lifecycle. No code changes to pipelines; the integration owns the event generation.
- Feature flag in prod — emitter overhead is small but real (one HTTP call per task). Roll out by team so any regression is contained.
- Two-week MVP — the metric that matters is "first useful lineage graph visible to humans." Everything beyond that (column lineage, facets, glossary) layers on without touching the foundation.
- Backend swap is cheap — by week three, switching from Marquez to OpenMetadata is "point the OL URL at the new endpoint." This is exactly the portability the standard buys you.
- Cost — staging infra plus ~3 engineer-weeks for the MVP; prod onboarding is incremental per team thereafter.
DE
Topic — event modeling
Event modeling problems for lineage and audit
4. OpenMetadata architecture and entity model
OpenMetadata is a catalog application with three layers — Ingestion, Metadata Server, UI — and a unified entity model spanning tables, dashboards, pipelines, topics, and ML models
The mental model in one line: OpenMetadata is "a single catalog DB plus a REST API plus a UI plus a connector framework," and every metadata concern (lineage, governance, quality, glossary, classification) is a first-class entity in that DB. Once you internalise that "everything is an entity," the API surface and the UI both make obvious sense.
Three-layer architecture.
- Ingestion. Connectors run as Airflow DAGs (or as standalone Python apps) and push entity records into the metadata server. The ingestion framework is open and pluggable — adding a connector is writing a Python class that conforms to the source / sink interface.
- Metadata server. The heart. A Java service (Dropwizard) exposing a REST API; backed by Postgres or MySQL for storage and Elasticsearch (or OpenSearch) for search. Defines the entity schemas, the policies, and the lineage graph queries.
- UI. A React app that calls the REST API. Renders entity pages, search, the lineage graph, the glossary, data quality results, and admin pages.
The unified entity model.
OpenMetadata models everything as an entity. The same patterns (versioning, tagging, ownership, lineage) apply across:
-
Database, Schema, Table. Tables across Snowflake, BigQuery, Postgres, MySQL, etc., live as
Tableentities under aDatabaseService → Database → DatabaseSchema → Tablehierarchy. -
Pipeline. Airflow DAGs, dbt projects, Dagster pipelines — each becomes a
Pipelineentity. Lineage edges connect Pipelines to Tables (read / write). -
Dashboard, Chart. Looker / Tableau / Metabase dashboards become
Dashboardentities, with each tile or chart as aChartsub-entity. -
Topic. Kafka / Pulsar / Kinesis topics become
Topicentities with schema and ownership. -
MLModel, Container. ML models (MLflow / SageMaker) become
MLModelentities; storage containers (S3 / GCS / Azure) becomeContainerentities. -
Glossary, GlossaryTerm. Business vocabulary lives in
GlossaryandGlossaryTermentities, which can be linked as tags on any other entity. -
Tag, Classification. PII tags, data sensitivity classifications, and domain tags all live as
Tagentities underClassificationparents. - TestSuite, TestCase, TestCaseResult. Data quality is first-class: TestCase definitions and their run results are entities that the UI renders alongside the table.
Ingestion framework in detail.
- Connectors. One per source (Snowflake, BigQuery, Postgres, MySQL, Trino, Redshift, Tableau, Looker, PowerBI, Kafka, Airflow, dbt, MLflow). Each connector reads from the source via its native API and yields OpenMetadata entity records.
- Workflow types. Metadata (entities + schema), Lineage (edges from query history), Profiler (column stats), Data Quality (test runs), Usage (query history for popularity), dbt (parses manifest.json), Application Settings (admin).
- Scheduling. Workflows run as Airflow DAGs that come pre-bundled with OpenMetadata's ingestion image. Production teams typically point them at their own Airflow.
The metadata server's data model.
- Postgres stores entity rows, versions, and relationships.
- Elasticsearch stores the search index for each entity type plus the autocomplete index.
-
REST API at
/api/v1/*exposes every entity type. Filtering, search, and lineage queries all live here.
UI features.
- Search. Full-text plus typed filters (entity type, service, tier, owner, tag).
- Lineage graph. Bidirectional graph view with table-level and column-level depth controls.
- Glossary. Hierarchical business vocabulary; terms can be assigned to tables, columns, dashboards.
- Data quality. Test results render inline with each table; failing tests can route to Slack.
- Profiling. Column-level statistics (null %, distinct %, distributions) computed by the Profiler workflow.
- Roles, policies. Fine-grained access — who can read / edit / delete which entity types.
- PII tagging. Auto-classification of columns based on data and naming patterns; manual override via the UI.
How OL events flow into OpenMetadata.
OpenMetadata exposes an OpenLineage endpoint at /api/v1/openlineage. Each arriving event is translated into:
- A Pipeline entity (created if absent, looked up by namespace + name).
- Lineage edges from the listed input datasets to the listed output datasets.
- Column-level lineage edges if the event carries a
columnLineageFacet. - Pipeline status entries reflecting the run's success or failure.
The translation is convention-driven (dataset namespace warehouse maps to the DatabaseService named warehouse_prod, etc.) and configurable via the connector settings.
Self-hosted vs Collate.
- Self-hosted. Docker / Kubernetes Helm chart; you operate Postgres, Elasticsearch, and the metadata server. Cost is infra plus part-time platform-engineering work.
- Collate. Commercial managed offering from the same team. Hosted multi-tenant; eliminates the operational burden in exchange for per-asset pricing similar to other vendors.
Common interview probes on OpenMetadata.
- "What is the difference between a Table and a Topic entity?" — both are dataset-like, but Table maps to a relational warehouse and Topic to an event-stream; lineage edges treat them the same.
- "Where does PII classification come from?" — automatic classifiers run during the metadata or profiler workflow; manual overrides via the UI. Both produce Tag entities attached to the column.
- "How does OpenMetadata handle column-level lineage?" — column edges live as part of the Table entity; the UI renders them as a sub-graph inside the lineage panel.
- "Can OpenMetadata be the source of truth for ownership?" — yes — the ownership field on each entity is canonical and propagates to downstream alerting via webhooks.
Worked example — modeling a Snowflake table with full metadata
Detailed explanation. Walking through the entity payload for one table makes the model concrete. Below is the JSON shape stored for analytics.fct_orders after the connector ingests it.
Question. Build the Table entity for warehouse_prod.ANALYTICS.fct_orders with three columns, an owner team, a Finance domain tag, a PII tag on one column, and a glossary term link. Identify which fields are connector-supplied and which are user-curated.
Input.
| Field | Value | Source |
|---|---|---|
| Name | fct_orders |
Snowflake |
| Database | WAREHOUSE_PROD.ANALYTICS |
Snowflake |
| Columns |
order_id, amount, customer_email
|
Snowflake |
| Owner | team analytics-eng
|
Manual |
| Domain tag | Domain.Finance |
Manual |
PII tag on customer_email
|
PII.Sensitive |
Auto-classifier |
| Glossary term |
Finance.GMV linked |
Manual |
Code.
{
"name": "fct_orders",
"fullyQualifiedName": "warehouse_prod.ANALYTICS.fct_orders",
"service": "warehouse_prod",
"database": "WAREHOUSE_PROD",
"databaseSchema": "ANALYTICS",
"columns": [
{"name": "order_id", "dataType": "BIGINT"},
{"name": "amount", "dataType": "NUMERIC"},
{"name": "customer_email",
"dataType": "STRING",
"tags": [{"tagFQN": "PII.Sensitive", "labelType": "Automated"}]}
],
"owner": {"id": "team-analytics-eng", "type": "team"},
"tags": [{"tagFQN": "Domain.Finance", "labelType": "Manual"}],
"glossaryTerms": [{"id": "gt-finance-gmv", "type": "glossaryTerm"}]
}
Step-by-step explanation.
- The Snowflake connector populates
name,fullyQualifiedName,service,database,databaseSchema, and the column list with names and types. Connector-supplied fields are versioned and re-synced on every ingestion run. - The auto-classifier (part of the metadata workflow) inspects column names and sample data. It tags
customer_emailwithPII.Sensitiveand recordslabelType: Automatedso reviewers can distinguish auto from manual labels. - A platform admin (or a steward in the UI) assigns the team owner. Owner propagates to all downstream alerts: failing tests, freshness violations, and OpenLineage FAIL events route to the team.
- The domain tag
Domain.Financeand the glossary term link are manual. They make the table discoverable via filtered search ("show me every Finance table") and tie business vocabulary to physical assets.
Output (rendered Table entity page).
| Panel | Content |
|---|---|
| Schema |
order_id BIGINT, amount NUMERIC, customer_email STRING (PII)
|
| Owner | analytics-eng |
| Tags |
Domain.Finance, PII.Sensitive (on column) |
| Glossary | Finance.GMV |
| Lineage | upstream from dbt.fct_orders, downstream to BI |
| Quality | last 3 test runs and freshness metric |
Rule of thumb. Let the connector own everything mechanical (names, types, sizes, freshness timestamps); let humans own everything contextual (owner, domain, glossary). Auto-classifiers sit in the middle — let them propose, let stewards approve.
Worked example — converting an OL event into an OpenMetadata Pipeline + lineage
Detailed explanation. When an OpenLineage event arrives at OpenMetadata's /api/v1/openlineage endpoint, the server converts it into one Pipeline entity plus lineage edges. Walking the conversion makes the integration tangible.
Question. Trace the conversion for the dbt event from Section 3 (analytics.fct_orders reading raw.orders and raw.customers). Identify which entities are created and which edges are upserted.
Input.
| OL field | Becomes in OM |
|---|---|
job.namespace + job.name |
Pipeline FQN |
inputs[] |
source nodes for edges |
outputs[] |
target nodes for edges |
columnLineageFacet |
column-level edges |
run.runId + eventType
|
PipelineStatus entry |
Code.
Incoming OL event
job = analytics.fct_orders
inputs = [warehouse.raw.orders, warehouse.raw.customers]
outputs = [warehouse.analytics.fct_orders]
columnLineage = {order_id: [raw.orders.order_id], ...}
Conversion
ensure Pipeline entity "analytics.fct_orders" exists
ensure Table "warehouse_prod.raw.orders" referenced
ensure Table "warehouse_prod.raw.customers" referenced
ensure Table "warehouse_prod.analytics.fct_orders" referenced
upsert lineage edge: raw.orders -> analytics.fct_orders
upsert lineage edge: raw.customers -> analytics.fct_orders
upsert column-level edge: raw.orders.order_id -> analytics.fct_orders.order_id
upsert column-level edge: raw.orders.amount -> analytics.fct_orders.amount
upsert column-level edge: raw.customers.country -> analytics.fct_orders.country
append PipelineStatus: runId=c8b3-2026-06-15-01, state=Successful
Step-by-step explanation.
- The server looks up the Pipeline by
(service, namespace, name). If absent, it is created with the OLproduceras the service hint. Subsequent events update the same Pipeline rather than create duplicates. - The input and output datasets are mapped to Table entities by FQN convention. Datasets that do not yet exist (because the table connector has not run) are created as placeholder Tables and enriched later when the connector pass arrives.
- The lineage edges are upserted. Re-running the same event is idempotent — no duplicate edges. This is critical: every COMPLETE event in production carries the same edges, and the storage must collapse them.
- The column lineage facet drives the column-level edges. The UI renders them as a sub-graph inside the table-level edge; users toggle "column lineage" to drill in.
- The PipelineStatus entry records the run's outcome with timestamps. The Pipeline page displays a run history; failing runs annotate the connected tables with "last run failed."
Output.
| Entity | Action |
|---|---|
Pipeline analytics.fct_orders
|
upserted |
Table raw.orders
|
placeholder upserted |
Table raw.customers
|
placeholder upserted |
Table analytics.fct_orders
|
placeholder upserted |
| Lineage edge (table) | 2 upserted |
| Lineage edge (column) | 3 upserted |
| PipelineStatus | 1 appended |
Rule of thumb. Run the database connectors before expecting OL ingestion to fill the catalog. The connectors give you the entity inventory; OL gives you the lineage edges. Run them in the right order and your catalog is complete on day one.
Worked example — wiring data quality results into the catalog
Detailed explanation. OpenMetadata's TestSuite and TestCase entities make data quality first-class — every table can carry a list of tests, each test has a definition (e.g. "row count > 0"), and each test run produces a TestCaseResult that the UI surfaces inline. The same model accepts results from external tools via REST.
Question. Define a TestSuite for analytics.fct_orders with three tests (row count, distinct customer count, freshness), and show how a test runner posts results to the catalog.
Input.
| Test | Expectation |
|---|---|
| row_count_min | rows > 0 |
| distinct_customers_min | unique customer_id > 100 |
| freshness | data updated within 24h |
Code.
// 1) Create the TestSuite
POST /api/v1/dataQuality/testSuites
{
"name": "fct_orders_quality",
"entity": {"id": "table-id-fct-orders", "type": "table"}
}
// 2) Define a TestCase
POST /api/v1/dataQuality/testCases
{
"name": "row_count_min",
"entityLink": "<#E::table::warehouse_prod.ANALYTICS.fct_orders>",
"testDefinition": "tableRowCountToBeBetween",
"parameterValues": [
{"name": "minValue", "value": "1"}
],
"testSuite": "fct_orders_quality"
}
// 3) After running the test, post the result
POST /api/v1/dataQuality/testCases/testResults
{
"testCaseFQN": "warehouse_prod.ANALYTICS.fct_orders.row_count_min",
"result": "Success",
"timestamp": 1718492400000,
"testResultValue": [{"name": "rowCount", "value": "248913"}]
}
Step-by-step explanation.
- The TestSuite is the container for tests on one entity. Each table can have one TestSuite that aggregates its tests; failing tests on any case roll up to a suite-level health indicator.
- The TestCase definition references a
testDefinition(a built-in or custom test type) plus parameters. The platform ships a library of definitions liketableRowCountToBeBetween,columnValuesToBeUnique,tableFreshnessSLA, plus a custom SQL test. - The result is posted by whoever runs the test — OpenMetadata's own profiler workflow, an external dbt test run, a Great Expectations run, or a custom script. The same REST API accepts results from any source.
- The UI surfaces the latest result inline on the table page, with a colour-coded badge (green / amber / red). Failing tests can trigger webhooks to Slack or PagerDuty via OpenMetadata's alerting system.
Output (table page UI).
| Test | Last result | Last run |
|---|---|---|
| row_count_min | Success | 2026-06-15 03:00 |
| distinct_customers_min | Success | 2026-06-15 03:01 |
| freshness | Failed | 2026-06-15 03:02 |
Rule of thumb. Treat the test results as another lineage signal — a failing freshness test on a source table is exactly the information a downstream consumer needs before reading. Surface them inline in the lineage graph, not on a separate dashboard.
Data engineering interview question on adopting OpenMetadata across a 50-team org
A senior interviewer might frame this as: "You have OpenMetadata running for one team. How do you scale it to 50 teams without it becoming a dumping ground of stale entities?"
Solution Using domain-scoped ingestion + steward ownership
SCALING PLAN
1. Domain-scope the catalog
- Each business domain (Finance, Marketing, Product, Platform)
gets its own DatabaseService prefix and Glossary scope.
- Tags use Domain.* hierarchy so search is domain-filterable.
2. Steward per domain
- Every domain nominates a data steward.
- Stewards own glossary terms, tag policies, and PII reviews
for assets in their domain.
3. Connector cadence by tier
- Tier-1 assets (production warehouse, dashboards): hourly
- Tier-2 (staging, lab): daily
- Tier-3 (sandboxes): weekly or on-demand
- Tier classification is itself a Tag entity.
4. Lineage from OL is continuous
- Airflow + dbt + Spark + Flink emit OL events.
- Per-team OL endpoints converge in one OM instance.
5. Quality tests gated by tier
- Tier-1 tables MUST have row_count + freshness + uniqueness
- Tier-2 SHOULD have at least one custom test
- Tier-3 OPTIONAL.
6. PII review SLA
- Auto-classifier proposes; steward approves within 14 days.
- Unreviewed PII tags flagged on the steward dashboard.
7. Stale asset reaping
- Assets without ingestion for 30 days auto-archived
unless explicitly pinned.
Step-by-step trace.
| Step | Owner | Cadence | Output |
|---|---|---|---|
| 1 domain scope | platform | once | DatabaseService + Glossary roots |
| 2 stewards | data leadership | once + on join | named steward per domain |
| 3 connector cadence | platform + team | continuous | per-tier ingestion DAGs |
| 4 OL emitters | each team | continuous | runtime lineage |
| 5 tier-gated tests | each team | per release | failing tests block deploy |
| 6 PII review | steward | 14-day SLA | tags approved or rejected |
| 7 archive | platform | weekly | clean catalog |
The result is a catalog where every entity has a known owner, a known tier, and a known refresh expectation. Search returns relevant assets first because tier and domain are filterable.
Output:
| Health metric | Target |
|---|---|
| Tier-1 coverage by tests | 100% |
| Domain assignment completeness | > 95% |
| Stale entities (no refresh in 30d) | < 2% |
| PII auto-tags unreviewed > 14 days | 0 |
| OL events per minute (steady state) | proportional to pipeline count |
Why this works — concept by concept:
-
Domain scoping —
Domain.*tag hierarchy gives the catalog a top-down structure that mirrors how the org thinks about data, and lets stewards own their slice without blocking each other. - Steward ownership — putting humans at the leaf of every policy decision (glossary, PII, classification) is the only way a catalog survives at scale. Auto-classification proposes; humans dispose.
- Tier-driven cadence — not every asset deserves hourly metadata. Tiering keeps the ingestion pipeline cheap and the catalog signal-to-noise high.
- Continuous OL ingestion — runtime lineage is the always-fresh part of the graph; static connectors fill in the shape; together they keep the catalog accurate.
- Stale-asset reaping — a catalog that grows monotonically becomes useless. Archive policies keep search focused on assets that still matter.
- Cost — connectors scale O(assets); OL events scale O(pipeline runs). Postgres + Elasticsearch sized to those rates plus an FTE fraction per few hundred TB of source metadata.
DE
Topic — dimensional modeling
Dimensional modeling problems for warehouses
5. Interop with proprietary vendors and migration patterns
OpenLineage is the migration off-ramp from Atlan, Collibra, Alation, and Monte Carlo — emit once, route to whichever backend wins this quarter, and use the two-write pattern to stage the cutover
The mental model in one line: as long as OpenLineage events leave your pipelines, the choice of backend is a configuration change, not an architecture change. Once your team can quote that invariant, the conversation with the closed-catalog vendor on renewal day becomes very different — and the migration plan can be incremental rather than Big Bang.
Where vendors plug into the OpenLineage event stream.
- Monte Carlo. Accepts OL events as a lineage input. Layers freshness, volume, and schema-change anomaly detection on top of the same graph your open backend sees.
- Atlan. Has a documented OL adapter; ingests events into the Atlan graph and renders them inline with vendor-curated metadata.
- Bigeye. Similar to Monte Carlo — OL events feed the observability layer.
- Collibra. Accepts OL events for technical lineage; business-glossary side stays inside Collibra's model. Most teams keep Collibra for governance and use OL to keep its lineage panel current.
- Alation. Accepts OL through a plugin; the business catalog stays vendor-owned while runtime lineage is single-sourced from OL.
Emit OpenLineage from Airflow / dbt / Spark and forward to vendor X.
The integration pattern is identical regardless of which vendor receives:
emitter (Airflow / dbt / Spark)
|
v
OPENLINEAGE_URL = http(s)://vendor-endpoint/openlineage
|
v
vendor receiver ingests, renders, alerts
The emitter does not know it is talking to a vendor. The vendor does not know it is reading a community-format event. The standard makes both sides plug-and-play.
Multi-cast to two or more receivers.
When you want both an open backend and a vendor receiver during a migration, configure multi-cast:
-
Newer OL integrations accept a comma-separated
OPENLINEAGE_URL. - Older integrations require a small proxy: a single FastAPI service that POSTs each event to N configured URLs.
- Kafka transport turns multi-cast into "multiple consumer groups on one topic."
This is the two-write pattern: events flow to the old backend and the new one for the duration of the migration, so the new backend builds historical context before you turn the old one off.
Replace a closed catalog with OpenMetadata gradually.
A 90-day migration timeline that has worked for multiple platform teams:
- Day 1–14. Stand up OpenMetadata in staging. Run connectors against the same sources the old catalog covers. Verify entity completeness against the old catalog's asset list.
- Day 15–30. Enable OpenLineage emitters in production with multi-cast: events flow to both the old vendor and to OpenMetadata. Both catalogs now show identical runtime lineage.
- Day 31–60. Migrate business metadata (glossary, ownership, tags) into OpenMetadata. Most vendors have an export API or a CSV bulk download; the import can be scripted via OpenMetadata's REST API.
- Day 61–80. Switch primary user UI to OpenMetadata. Old vendor stays read-only as a fallback.
- Day 81–90. Decommission the old vendor. The OL multi-cast configuration drops the vendor endpoint. The renewal is not signed.
DataHub vs OpenMetadata — when to pick which.
Both are credible open catalogs with active communities and similar feature surface. The choice usually comes down to ecosystem fit.
- Pick OpenMetadata when — you want a broader out-of-the-box connector library, tighter integration with OpenLineage as a native ingest path, a more polished UI for end-users, or a managed offering (Collate) on the same stack.
- Pick DataHub when — you want an event-native architecture under the hood (the Metadata Change Event / Metadata Audit Event model on Kafka), strong upstream propagation for downstream services, or your existing stack already has heavy Kafka investment.
- Either way — OL events flow into both. The wire-format standard means you can change your mind later without re-instrumenting pipelines.
Governance integrations.
-
Glossary and business terms. OpenMetadata models
GlossaryandGlossaryTermas entities; terms can be linked to tables, columns, dashboards. DataHub uses theGlossaryNode/GlossaryTermmodel. Both let you bulk-import terms from a CSV or an external governance tool. -
Data classification. Both support hierarchical tags (
PII.Sensitive,PII.Email,Finance.Revenue). OpenMetadata's auto-classifier proposes tags; admins approve. DataHub uses Glossary Terms similarly. - Access policies. Role + Policy model in both: a Policy lists allowed actions on entity types matched by a rule. Roles bundle policies. Users / Teams are assigned roles.
- Compliance reporting. Glossary + Tag + Classification combine into a queryable matrix: "show every column tagged PII that touches a Finance domain dashboard." Both catalogs support this via search filters; OpenMetadata also exposes the query as a REST call.
Cost picture — self-hosted vs vendor.
| Stack | Year 1 | Year 3 | Notes |
|---|---|---|---|
| Closed vendor at $0.50/asset/mo, 10K assets | $60,000 | ~$225K cumulative | Grows with asset count |
| OpenMetadata self-hosted (4 vCPU, 16GB, 200GB DB) | ~$25K infra + 0.25 FTE | ~$100K cumulative | Flat-ish; FTE is bulk |
| Collate managed (similar to vendor) | $0.40/asset/mo | similar to vendor | Less ops overhead |
| Vendor receiver (Monte Carlo / Bigeye) — additive on top of any catalog | $20–60K/year typical | similar | Pays only for the observability layer, not the catalog |
Long-term bets.
- The OL spec is converging on column-level lineage as the default. Within two years, "OL without column lineage" will be considered a half-instrumented stack.
- Vendor receivers are becoming OL-first. New observability tools launch with OL ingestion as the recommended path, not as an afterthought.
- OpenMetadata and DataHub will likely both survive. They serve different architectural tastes; neither is going away.
- Marquez stays the reference backend. Useful as a sanity check during migrations and as a lightweight first deployment.
Common interview probes on interop and migration.
- "Can I send OpenLineage to Monte Carlo?" — yes. Configure
OPENLINEAGE_URLto Monte Carlo's OL endpoint, or multi-cast. - "What is the two-write pattern?" — emit events to both the old and new backend during migration; cut over when the new backend has parity.
- "How do I migrate business metadata (glossary, owners) into OpenMetadata?" — export from the old vendor (REST or CSV), import via OpenMetadata's REST API. Scriptable in a day for most orgs.
- "Is column lineage automatic?" — only when the emitter produces the
columnLineageFacet. dbt and Spark do; Airflow does for the operators that have extractors; custom Python is on you.
Worked example — the two-write pattern in configuration
Detailed explanation. Two-write is the safest migration shape: send every event to both backends, verify parity, then drop the old one. The configuration cost is tiny; the safety it buys is real.
Question. Configure a dbt project to emit OpenLineage events to both Atlan (the old catalog) and OpenMetadata (the new catalog) during a 60-day migration. Show the env vars or proxy required.
Input.
| Endpoint | Role |
|---|---|
| Atlan | old catalog, read-only by day 60 |
| OpenMetadata | new catalog, gaining context |
Code.
# Option A — multi-URL (newer OL integrations)
export OPENLINEAGE_URL="https://atlan.example.com/openlineage,https://openmetadata.example.com/api/v1/openlineage"
export OPENLINEAGE_API_KEY_ATLAN="..."
export OPENLINEAGE_API_KEY_OM="..."
# Option B — fan-out proxy (older OL integrations)
# proxy posts every incoming event to both URLs
export OPENLINEAGE_URL="http://ol-proxy.internal:5000"
# Minimal fan-out proxy (FastAPI) — Option B
from fastapi import FastAPI, Request
import httpx
app = FastAPI()
TARGETS = [
"https://atlan.example.com/openlineage",
"https://openmetadata.example.com/api/v1/openlineage",
]
@app.post("/")
async def fanout(request: Request):
body = await request.body()
headers = {"Content-Type": "application/json"}
async with httpx.AsyncClient(timeout=5.0) as client:
for url in TARGETS:
try:
await client.post(url, content=body, headers=headers)
except Exception:
pass # never block the producer
return {"status": "ok"}
Step-by-step explanation.
- Option A is the cleanest path when the OL integration supports comma-separated URLs (Airflow OL >= 1.18, dbt OL >= 1.16, Spark OL >= 1.20 with the OpenLineageClient transports config). Each URL receives every event.
- Option B works with any integration. The proxy is a single ~20-line FastAPI service. It POSTs each event to every configured target, swallowing per-target failures so the producer never blocks.
- The producer's view never changes during the migration. Pipelines do not know they are now talking to two backends; they POST once to the OL URL.
- On migration day 60, drop one URL from the list (Option A) or remove one TARGET entry (Option B). No code change anywhere else.
Output (during the migration window).
| Backend | Events received | UI status |
|---|---|---|
| Atlan | 100% | primary (days 0–45), read-only (days 46–60) |
| OpenMetadata | 100% | secondary (days 0–45), primary (days 46–60) |
Rule of thumb. Run the two-write window for at least 30 days. The new backend needs a meaningful history before you trust it as the primary UI.
Worked example — migrating glossary terms from Collibra to OpenMetadata
Detailed explanation. Business metadata does not flow over OpenLineage — it lives in the catalog itself. Migrating it is an export + transform + import job. OpenMetadata's REST API makes the import scriptable.
Question. Migrate 500 Collibra business terms (each with a name, description, and domain) into OpenMetadata as GlossaryTerm entities under a Finance glossary. Show the script outline.
Input.
| Field | Example |
|---|---|
| name | Gross Merchandise Value |
| description | Total value of goods sold over a period. |
| domain | Finance |
Code.
import csv, requests
OM_URL = "https://openmetadata.example.com/api/v1"
OM_TOKEN = "...JWT..."
HDR = {"Authorization": f"Bearer {OM_TOKEN}",
"Content-Type": "application/json"}
# 1) Ensure parent Glossary exists
glossary = {"name": "Finance",
"displayName": "Finance",
"description": "Finance domain business vocabulary"}
requests.put(f"{OM_URL}/glossaries", json=glossary, headers=HDR)
# 2) For each Collibra term, POST as GlossaryTerm
with open("collibra_export.csv") as f:
for row in csv.DictReader(f):
term = {
"name": row["name"].replace(" ", "_"),
"displayName": row["name"],
"description": row["description"],
"glossary": "Finance"
}
r = requests.put(f"{OM_URL}/glossaryTerms", json=term, headers=HDR)
r.raise_for_status()
Step-by-step explanation.
- The Collibra export is a CSV with
name,description,domaincolumns. Standard Collibra "Export Asset List" feature. - The script ensures the parent Glossary entity exists in OpenMetadata. PUT is idempotent — re-running the script does not duplicate the Glossary.
- For each row, the script POSTs (or PUTs, depending on whether you want create-or-update) a GlossaryTerm. The
namefield cannot contain spaces in OpenMetadata FQNs;displayNamekeeps the original. - Each term lands in the Finance Glossary. The terms can now be linked from tables, columns, and dashboards via the UI or programmatically.
Output.
| Imported | Count |
|---|---|
| Glossary | 1 (Finance) |
| GlossaryTerm | 500 |
| Links to tables | 0 (next migration phase) |
Rule of thumb. Migrate the glossary first, then the table-to-term links. Linking is the part that benefits most from human review — let stewards approve sample links rather than bulk-import them blindly.
Worked example — multi-cast to Marquez, OpenMetadata, and Monte Carlo
Detailed explanation. Some teams want the lineage UI of Marquez (fast to render), the catalog of OpenMetadata (governance), and the observability of Monte Carlo (anomaly detection). The OL standard makes this trivial: each backend is just another URL.
Question. Configure the fan-out proxy to deliver every OL event to Marquez, OpenMetadata, and Monte Carlo. Show the resulting graph experience for the end user.
Input.
| Backend | Role |
|---|---|
| Marquez | lineage graph UI for engineers |
| OpenMetadata | catalog + glossary + governance |
| Monte Carlo | observability + freshness alerts |
Code.
TARGETS = [
"http://marquez.internal:5000/api/v1/lineage",
"https://openmetadata.example.com/api/v1/openlineage",
"https://api.getmontecarlo.com/openlineage",
]
# same fan-out logic as Worked example above
Step-by-step explanation.
- The proxy accepts one event per task / model / job and POSTs it to all three URLs in parallel. Latency is bounded by the slowest receiver.
- Marquez renders the lineage graph immediately. Engineers use it for "trace the job" deep-dives during incidents.
- OpenMetadata creates a Pipeline entity and lineage edges, plus updates the affected tables. Analysts and stewards use this view.
- Monte Carlo cross-references the event against learned baselines — table appeared, schema changed, row count dropped. It alerts on anomalies; the alert pages the on-call.
- All three views show the same underlying facts because the source-of-truth event is the OL payload from the pipeline.
Output (per persona).
| Persona | Tool | Reason |
|---|---|---|
| Data engineer | Marquez | clean lineage graph for debugging |
| Analytics engineer | OpenMetadata | catalog browsing, glossary, owners |
| Analyst | OpenMetadata | search, find tables, see freshness |
| Steward | OpenMetadata | governance, PII review |
| On-call | Monte Carlo | freshness / schema-change alerts |
Rule of thumb. The right number of OL consumers is "however many distinct user personas you have, minus the ones whose needs overlap entirely." The marginal cost of adding a receiver is configuration; the marginal value is the persona it serves.
Data engineering interview question on planning a closed-catalog exit
A senior interviewer might frame this as: "Your CFO has asked for a plan to leave Vendor X at renewal in six months. Walk me through it from week one to cutover, in enough detail that the platform team can execute without me."
Solution Using a six-month phased migration plan
MONTH 1 — Stand up
- Deploy OpenMetadata in staging via Helm.
- Configure Postgres + Elasticsearch in dedicated VMs / managed services.
- Run all warehouse connectors (Snowflake, BigQuery, Postgres) once.
- Sanity check: entity count vs vendor's reported asset count.
MONTH 2 — Lineage
- Enable OL emitters on staging Airflow + dbt + Spark.
- Multi-cast OL events to both the vendor and OpenMetadata.
- Verify table-level + column-level lineage parity for top-20 tables.
- Document gaps; file integration bugs upstream where needed.
MONTH 3 — Business metadata
- Export glossary + owner + tags from vendor (CSV or API).
- Script the import into OpenMetadata GlossaryTerm / Tag / owner.
- Stewards review sample of 50 imports; fix mapping issues.
MONTH 4 — Quality and policy
- Define top TestSuite per Tier-1 table.
- Migrate or re-author data quality tests (dbt tests + custom SQL).
- Replicate Role / Policy model — admin / steward / read-only.
MONTH 5 — UX cutover
- Switch internal documentation links from vendor to OpenMetadata.
- Vendor UI moves to read-only mode; team is told "use OM going forward."
- Monitor support tickets, fix UX gaps, train teams.
MONTH 6 — Renewal day
- Drop vendor URL from OL multi-cast.
- Cancel vendor contract.
- Capture lessons learned for the next standards adoption (e.g. DataHub
as second open option, or a managed Collate as an upgrade path).
Step-by-step trace.
| Month | Headline deliverable | Risk | Mitigation |
|---|---|---|---|
| 1 | OM running, connectors green | infra sizing | start with x86-large VMs + 200GB Postgres |
| 2 | OL multi-cast in prod | emitter overhead | feature flag per team |
| 3 | Business metadata imported | term mapping errors | steward review sample |
| 4 | Quality tests live | test coverage gaps | tier-gated requirements |
| 5 | UX cutover | user pushback | early demos, training |
| 6 | Vendor decommissioned | sign-off blocking | written acceptance from each domain |
The plan is designed to fail safely: at every month, if the new stack is not ready, the old vendor is still receiving events and serving as the source of truth. Cutover only happens when parity is real, not when the calendar says.
Output:
| Month | Status |
|---|---|
| 1 | OpenMetadata running in staging |
| 2 | Lineage two-write in prod |
| 3 | Glossary imported, owners assigned |
| 4 | Quality tests + roles parity |
| 5 | UX cutover, vendor read-only |
| 6 | Vendor decommissioned at renewal |
Why this works — concept by concept:
- Two-write everywhere — the migration never has a "one-night switch" risk because events flow to both backends throughout. Either side can be the primary at any moment.
- Connector-first, lineage-second — entities give you the inventory; OL gives you the edges. Stand them up in that order so the OL graph has nodes to attach to.
- Steward review of business metadata — automated import handles 80%; humans handle the 20% with judgement calls. Stewards are the only durable defence against junk metadata.
- Tier-gated quality — every Tier-1 table must have a test suite; lower tiers are optional. This keeps quality investment proportional to business impact.
- UX cutover before renewal — the team must actively prefer the new UI before renewal day. If they do not, the plan slips by a month — better than slipping the renewal.
- Cost — six months of platform-engineering attention (~0.5 FTE) plus ~$40K infra annually. The vendor renewal usually exceeds that within a year for any non-trivial asset count.
DE
Topic — data aggregation
Data aggregation problems for catalog metrics
Cheat sheet — open standards recipes
- "I want lineage without a catalog yet." OpenLineage emitters in every tool + Marquez as the backend. One docker-compose stack; rendered lineage graph in 30 minutes.
- "I want a full open catalog." OpenMetadata (broader connector library, polished UI) or DataHub (event-native, Kafka-friendly). Pick by ecosystem fit, not by feature checklist alone.
-
"dbt + Airflow + Spark stack." OpenLineage emitters in all three (dbt OL adapter, Airflow OL plugin, Spark OL listener), single
OPENLINEAGE_URL, one backend behind it. Promote per team via feature flag. - "Migrating off Collibra / Alation / Atlan." OpenMetadata in parallel; OL multi-cast for 60–90 days; import glossary via REST; cut over once user-facing parity is real.
-
"Need column-level lineage." Enable the
columnLineageFacetend-to-end. dbt computes it from its manifest; Spark from query plans; SQL engines via sqlglot. Render in OpenMetadata or DataHub. -
"Want governance + glossary." OpenMetadata's
Glossary+GlossaryTerm+Tag+Classificationentities, plus the Role / Policy model. Stewards own approval; auto-classifiers propose. - "Need to stream lineage into a vendor." Configure the vendor's OL endpoint as one of the multi-cast targets. Monte Carlo, Bigeye, Atlan, and Collibra all accept OL events.
- "Production transport — HTTP or Kafka?" HTTP for setups under a few thousand events per minute and one backend. Kafka when you need durability, replay, or multiple downstream consumer groups.
-
"How do I cross tool boundaries?" Use the
parentRunFacet. Airflow → dbt → Spark events all carry parent links; receivers reconstruct the hierarchical graph automatically. - "Custom metadata that the spec does not cover." Custom facet with your org's URI. Receivers either render it or ignore it. Lobby for promotion to standard if the use case generalises.
- "OpenMetadata vs DataHub — quick decision." Want the deepest connector library and the most polished UI? OpenMetadata. Want event-native with Kafka under the hood? DataHub. Both accept OL events natively.
- "Cost back-of-envelope." Closed catalog: ~$0.50/asset/mo, grows linearly. Self-hosted OpenMetadata: ~$2–4K infra + 0.25 FTE. Crossover around 8–10K assets. Add ~$20–60K/year for a vendor observability layer if needed.
- "What about Marquez in production?" Fine for lineage-only at moderate scale. Lacks the catalog surface (glossary, tags, classification) — pair with OpenMetadata or DataHub if you need those.
Frequently asked questions
Is OpenLineage a catalog?
No — OpenLineage is a wire format for emitting lineage events; it is not a catalog application. It defines the JSON schema (run, job, dataset, facets) and reference clients in Python and Java, but storage and UI are the backend's job. The reference backend is Marquez (Postgres + a minimal lineage UI). For a full catalog you pair OpenLineage with OpenMetadata or DataHub. The most common interview mistake is conflating the standard with a backend — "we'll use OpenLineage as our catalog" is the wrong sentence; "we'll emit OpenLineage and store it in OpenMetadata" is the right one.
Should I use OpenMetadata or DataHub?
Both are credible open catalogs with active communities, similar feature surfaces, and OpenLineage support. Pick OpenMetadata when you want a broader out-of-the-box connector library, a polished end-user UI, native OL ingestion as a first-class path, or a managed offering (Collate) on the same code base. Pick DataHub when you want an event-native architecture with Kafka under the hood, strong upstream propagation to downstream services via the MCE / MAE model, or your existing stack already has heavy Kafka investment. Either way, your OL emitters do not change — you can switch later by pointing the transport at the new endpoint.
Does OpenLineage support column-level lineage?
Yes — the columnLineageFacet is a standard facet that maps each output column to the input columns it was derived from. dbt's OL adapter generates it from the compiled manifest; Spark's listener derives it from query plans; SQL engines via parsers like sqlglot or Calcite can compute it from the SQL text. Receivers (OpenMetadata, DataHub, Marquez) render column-level edges as a sub-graph inside the table-level lineage view. Column-level lineage is the high-value payload for impact analysis ("if I drop column C, what dashboards break?") — make sure your emitters produce the facet end-to-end.
Can I send OpenLineage events to Monte Carlo or Bigeye?
Yes — both vendors document OpenLineage endpoints. Configure OPENLINEAGE_URL to the vendor's OL endpoint (or include it in the comma-separated list for multi-cast) and the vendor receives every event your pipelines emit. Monte Carlo and Bigeye layer freshness, volume, and schema-change anomaly detection on top of the same graph your open backend sees, so you can keep one observability vendor while running an open catalog underneath. Atlan and Collibra also accept OL events for the lineage half of their products. The standard is the shared interface; the vendors compete on UX and analytics, not on data ownership.
Is Marquez production-ready?
For lineage-only workloads at small-to-medium scale (~100K events / day, ~10K datasets), yes — Marquez has been in production at multiple companies since 2019. It is the reference backend for OpenLineage, so spec changes land there first, and the Postgres + REST + minimal UI architecture is easy to operate. Marquez does not include the broader catalog surface (no glossary, no tag classification, no role / policy model). If you need that, pair Marquez with OpenMetadata or DataHub — or skip Marquez entirely and use OpenMetadata as both lineage backend and catalog. Many teams use Marquez during the OL adoption phase (weeks 1–4) and migrate to OpenMetadata as the second consumer once the catalog needs surface.
How does OpenMetadata compare to Atlan and Collibra?
OpenMetadata and the vendors converge on the same feature set (entity model, lineage graph, glossary, classification, data quality) but diverge on ownership and pricing. With Atlan or Collibra you license the product per asset and the metadata graph lives inside the vendor's database; switching vendors means rebuilding connectors and re-ingesting metadata. With OpenMetadata you self-host (or pay for the managed Collate variant), the metadata DB is yours, and the OL emitters that feed it also feed every other open or vendor receiver. Atlan and Collibra still win on out-of-the-box polish and vendor support; OpenMetadata wins on portability, cost at scale, and the option value of swapping backends without re-instrumenting pipelines. The honest answer is "both are credible; pick the trade-off your platform can actually live with for the next five years."
Practice on PipeCode
- Drill the ETL practice library → for end-to-end pipeline problems where lineage and catalog instrumentation actually pay off.
- Rehearse on dimensional modeling problems → to sharpen the entity instincts you need for catalog schema design.
- Sharpen the event modeling library → for the runtime side of lineage emitters.
- Layer the data aggregation drills → for the catalog metrics and coverage reports senior interviewers love.
- Stack the aggregation library → for the COUNT-style queries that drive every "how many assets / pipelines / tests do we have?" question.
- For the broader surface, read top data engineering interview questions →.
- Stack the prerequisites with the only 5 skills you need to become a data engineer →.
- Sharpen the design axis with the ETL system design for data engineering interviews course →.
- For long-form schema craft, work through data modelling for DE interviews →.
Pipecode.ai is Leetcode for Data Engineering — every OpenLineage and OpenMetadata recipe above ships with hands-on practice rooms where you wire the emitters, design the entity model, and write the SQL behind the catalog metrics against real graded inputs. PipeCode pairs every reading with 450+ DE-focused problems and a real-time scoring engine, so you never have to wonder whether your column-lineage facet actually round-trips between Marquez, OpenMetadata, and a vendor receiver in the same way it will on interview day.





Top comments (0)