DEV Community

Cover image for OpenLineage & OpenMetadata: Open Standards for Lineage and Cataloging
Gowtham Potureddi
Gowtham Potureddi

Posted on

OpenLineage & OpenMetadata: Open Standards for Lineage and Cataloging

openlineage openmetadata is the pair of words that quietly replaced the closed-catalog conversation in 2024 and 2025 — and by 2026, when an interviewer asks "how would you build lineage and a catalog across your stack?" the wrong answer is "we'd license Atlan" and the right answer starts with "OpenLineage as the wire format, OpenMetadata or DataHub as the backend." The shift is the same one that happened with Kubernetes versus proprietary container schedulers: the moment a credible open standard exists, every vendor either adopts it or argues itself into irrelevance.

This guide walks the two standards in production-engineering detail. It opens with why open standards for lineage and metadata matter at all (the cost of being trapped inside a closed metadata graph), then layers the OpenLineage event model (run, job, dataset, facets) on top of the OpenMetadata architecture (ingestion, metadata server, UI), and closes with the interop patterns that let you migrate off Atlan, Collibra, or Alation without a Big Bang cutover. Along the way it ties in Marquez and DataHub — the two most-mentioned reference backends — and shows the column-level lineage facet that makes a modern open data catalog actually useful for impact analysis. Every H2 ships at least one worked example with code, a step-by-step trace, an output table, and a concept-by-concept breakdown of why it works.

PipeCode blog header for an OpenLineage and OpenMetadata tutorial — bold white headline 'OpenLineage + OpenMetadata' with subtitle 'open standards for lineage and catalog' and a stylised lineage graph of glowing nodes and arrows on a dark gradient with a small pipecode.ai attribution.

When you want hands-on reps immediately after reading, drill the ETL practice library →, rehearse on dimensional modeling problems →, and layer the data aggregation drills →.


On this page


1. Why open standards for lineage and metadata matter

Closed catalogs trap your metadata graph inside a vendor's billing model — open standards let lineage and entity definitions outlive the contract

The one-sentence invariant: lineage and metadata are the two most expensive things to backfill, so the format you choose to emit them in is a 5-to-10-year decision, and proprietary catalogs charge you forever to read back data you already paid to compute. Once you internalise that "the graph you build is more valuable than the UI you license," the case for openlineage plus openmetadata (or DataHub) over a closed product becomes the default architectural posture.

The lock-in tax of proprietary catalogs.

  • Per-asset pricing scales with success. Every catalog vendor invoices on "data assets" — tables, dashboards, columns, pipelines. The more your platform grows, the more you pay, even when the marginal user value of asset 50,001 is near zero.
  • Export is intentionally hard. Closed catalogs expose only narrow REST APIs (or paginated CSV exports) for the metadata you contributed. Lineage edges, column-level mappings, glossary tags, and ownership graphs are often not round-trippable — you can read them, but you cannot bulk-extract them in a form the next catalog will understand.
  • Connectors are the moat. A vendor's competitive edge is "we have 200 connectors." But those connectors emit into the vendor's internal metadata model. Switching means rebuilding every connector for the new tool — months of work for a platform team that wants to ship product instead.

The "every tool emits to its own black box" problem.

In a typical 2023-era stack, Airflow exposed lineage to its own DB, dbt exposed lineage to dbt Cloud, Spark exposed lineage to Spline (if anything), Atlan ingested from BigQuery, Monte Carlo ingested separately for observability, and Collibra ingested independently for governance. Each tool maintained its own copy of the same fact: job daily_orders reads raw_orders and writes fct_orders. That fact was duplicated five times, inconsistently, with each vendor's UI showing a slightly different graph.

What an open standard buys.

  • One emit, many consumers. Airflow emits an OpenLineage event once. Marquez, OpenMetadata, DataHub, Monte Carlo, Atlan, and Collibra can all receive it. The graph is single-sourced; the receivers compete on UX, not on data ownership.
  • Vendor portability. Move from Atlan to OpenMetadata? You point the OpenLineage transport at the new backend. Your emitters do not change. Your pipeline code does not change.
  • Community integrations. When the OpenLineage spec adds a columnLineage facet, every emitter and every receiver implements it on the same schedule, in the same shape. No more "Vendor X supports column lineage on Snowflake but not Postgres."
  • Schema review by committee. OpenLineage and OpenMetadata are governed by the LF AI & Data Foundation. Spec changes go through public RFC discussion. There is no surprise breaking change from a vendor changing strategy.

Lineage vs metadata vs catalog — separating the three concerns.

  • Lineage is the runtime fact of "this job read these inputs and wrote these outputs at this time." It is a stream of events emitted by the compute engine.
  • Metadata is the static description of an asset: its schema, owner, tags, description, freshness SLO, classification. It is rows in a catalog DB.
  • Catalog is the application layer — the UI, the search index, the REST API, the access policies — that lets humans browse and query the metadata graph.

OpenLineage targets the lineage problem. OpenMetadata targets the metadata + catalog problem. They are complementary, not competitors.

The current ecosystem.

  • OpenLineage — the wire-format standard. JSON Schema for runs, jobs, datasets, and extensible facets. Reference backend is Marquez.
  • OpenMetadata — the open catalog application. Self-hosted or managed via Collate. Ingests from databases, dashboards, pipelines, ML models. Defines its own entity schemas.
  • Marquez — the original OpenLineage backend. Simple Postgres + REST UI. Great when you only want lineage and do not yet need a full catalog.
  • DataHub — alternative open catalog, originally from LinkedIn. Slightly different entity model than OpenMetadata, stronger upstream metadata-event story.
  • Amundsen — earlier-generation open catalog from Lyft. Less actively developed in 2026; relevant mostly for historical context.

What interviewers listen for.

  • Do you say "OpenLineage is the wire format, not a catalog"? — senior signal.
  • Do you mention Marquez as the reference backend for OpenLineage? — senior signal.
  • Do you distinguish DataHub and OpenMetadata as two parallel open-catalog projects? — senior signal.
  • Do you propose a two-write migration when leaving a closed catalog? — senior signal.

Worked example — the lock-in cost of a closed catalog in one number

Detailed explanation. A platform team has 8,000 tables, 1,200 dashboards, and 400 dbt models. The closed catalog vendor invoices on data_assets. Migrating off the vendor requires re-emitting lineage from every pipeline; staying means paying forever. Pricing the two options surfaces why the open-standard answer is the default.

Question. Compute the three-year total cost of staying on a closed catalog at $0.50 per asset per month versus migrating to OpenMetadata + OpenLineage in a self-hosted footprint that costs $4,000 per month all-in (infra + 0.25 FTE). Assume asset count grows 25% per year.

Input.

Year Assets (start) Assets (end) Avg assets
1 9,600 12,000 10,800
2 12,000 15,000 13,500
3 15,000 18,750 16,875

Code.

# Three-year cost model — closed vs open

closed_unit_cost_per_month = 0.50  # USD per asset per month
open_monthly_cost = 4_000          # USD per month, all-in self-hosted

avg_assets = [10_800, 13_500, 16_875]

closed_total = sum(a * closed_unit_cost_per_month * 12 for a in avg_assets)
open_total   = open_monthly_cost * 12 * 3

print(f"Closed 3-year cost:  ${closed_total:>10,.0f}")
print(f"Open   3-year cost:  ${open_total:>10,.0f}")
print(f"Savings:             ${closed_total - open_total:>10,.0f}")
print(f"Break-even assets:   {open_monthly_cost / closed_unit_cost_per_month:>10,.0f}")
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. Closed-catalog cost is linear in asset count. At $0.50 per asset per month, 10,800 average assets year 1 means 10,800 * 0.50 * 12 = $64,800 for year 1.
  2. Year 2 grows to 13,500 average assets → 13,500 * 0.50 * 12 = $81,000. Year 3 hits 16,875 average → $101,250.
  3. Open-catalog cost is flat: $4,000 per month * 36 months = $144,000.
  4. The break-even is open_monthly / closed_unit = 4000 / 0.50 = 8,000 assets. Above that asset count, OpenMetadata is cheaper at this infra budget.

Output.

Metric Value
Closed 3-year cost $247,050
Open 3-year cost $144,000
Savings $103,050
Break-even assets 8,000

Rule of thumb. Below ~5,000 assets, the cost case for self-hosting is weaker — the FTE overhead dominates. Above ~10,000 assets, the open-standard answer pays for itself within the first contract renewal, before counting the value of avoiding vendor lock-in.

Worked example — what "lineage as a stream of events" looks like end-to-end

Detailed explanation. OpenLineage's mental model is event-driven: every job run emits a START event when it begins and a COMPLETE event when it finishes (or FAIL / ABORT on error). Each event carries the run, the job, the input datasets, the output datasets, and any number of facets. Concatenated over time, these events are the lineage graph.

Question. Show the minimum two-event sequence that captures a daily Airflow run of dbt_run_orders which reads raw.orders and writes analytics.fct_orders. Identify which fields are mandatory and which are optional.

Input.

Field Value
run id a3f1-2026-06-15-01
job name analytics.dbt_run_orders
inputs raw.orders
outputs analytics.fct_orders

Code.

// Event 1  START
{
  "eventType": "START",
  "eventTime": "2026-06-15T01:00:00.000Z",
  "run":  { "runId": "a3f1-2026-06-15-01" },
  "job":  { "namespace": "analytics", "name": "dbt_run_orders" },
  "inputs":  [ { "namespace": "warehouse", "name": "raw.orders" } ],
  "outputs": [ { "namespace": "warehouse", "name": "analytics.fct_orders" } ],
  "producer": "https://github.com/OpenLineage/OpenLineage/tree/1.20.0/integration/airflow"
}

// Event 2  COMPLETE
{
  "eventType": "COMPLETE",
  "eventTime": "2026-06-15T01:04:12.000Z",
  "run":  { "runId": "a3f1-2026-06-15-01" },
  "job":  { "namespace": "analytics", "name": "dbt_run_orders" },
  "inputs":  [ { "namespace": "warehouse", "name": "raw.orders" } ],
  "outputs": [ { "namespace": "warehouse", "name": "analytics.fct_orders" } ],
  "producer": "https://github.com/OpenLineage/OpenLineage/tree/1.20.0/integration/airflow"
}
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. The START event arrives when the Airflow operator begins. The runId is a UUID stamped once per attempt — Airflow uses the DAG run's try_number plus task id to derive it.
  2. Inputs and outputs are listed intentionally. OpenLineage does not infer them; the emitter is responsible for declaring what the job will read and write. dbt knows from its compiled manifest; Spark knows from its query plan; Airflow falls back to operator-specific hints.
  3. The COMPLETE event arrives when the operator returns. It re-states the same run, job, inputs, and outputs — receivers reconcile the two events by runId. If a FAIL or ABORT event arrives instead, the receiver knows the lineage edge is attempted rather than successful.
  4. The producer field is the URL of the emitter's source. Receivers use it to know "this event came from Airflow 1.20.0 integration" and apply version-specific facet handling.

Output (Marquez UI rendering).

Node type Identifier Edges
Job analytics.dbt_run_orders input from warehouse.raw.orders, output to warehouse.analytics.fct_orders
Dataset warehouse.raw.orders read by analytics.dbt_run_orders
Dataset warehouse.analytics.fct_orders written by analytics.dbt_run_orders
Run a3f1-2026-06-15-01 status COMPLETE, duration 4m 12s

Rule of thumb. Think of OpenLineage as Prometheus for lineage: emitters push events; backends scrape and persist; UIs render. The wire format is small and stable; the value compounds over thousands of runs.

Worked example — the "every tool has its own graph" failure mode

Detailed explanation. Without an open standard, every tool keeps its own private graph and your platform team operates as the human consistency layer. When the dbt graph says model X depends on table Y but the Airflow graph says task A depends on table Z and the BI tool says dashboard D depends on column C, no one can answer "if I drop column C, what breaks?" in less than a half-day investigation.

Question. A finance dashboard breaks because the currency column was renamed in the source. Trace the four lookups a platform engineer must do without an open standard and the single lookup they would do with one.

Input (impacted assets in five tools).

Tool Asset Records column
Postgres source raw.invoices ccy (renamed from currency)
dbt int_invoices, fct_revenue references currency
Airflow DAG daily_revenue runs dbt build
BI dashboard finance.revenue_v2 uses fct_revenue.currency
Catalog fct_revenue lineage last refreshed 6h ago

Code.

# Without OpenLineage / OpenMetadata — four siloed lookups
1. dbt docs: which models reference `currency`?
2. Airflow UI: which DAGs run those models?
3. BI tool: which dashboards depend on those tables?
4. Catalog: which downstream owners need notification?

# With OpenLineage + OpenMetadata — one query
GET /api/v1/lineage/table/warehouse.raw.invoices?upstreamDepth=0&downstreamDepth=4
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. Without the open standard, each tool answers a slice of the question against its private graph. The engineer manually stitches answers — dbt says "models X, Y depend on currency"; Airflow says "DAG daily_revenue runs them"; the BI tool says "dashboards A and B depend on Y"; the catalog confirms ownership but is stale.
  2. The stitching is error-prone: a dbt model invoked by an ad-hoc notebook (not Airflow) is invisible to the Airflow lookup. A dashboard that depends on a derived column via a join is invisible unless the BI tool indexed column lineage.
  3. With OpenLineage emitters everywhere and OpenMetadata as the single sink, the question is one API call. The downstream graph is materialised continuously from the events; the answer is whichever assets currently sit downstream of warehouse.raw.invoices.
  4. Time-to-impact-analysis drops from "half a day" to "30 seconds." That speed is the operational ROI of a unified metadata graph — and the strongest argument when the team's senior engineer asks "why are we spending two weeks adopting another standard?"

Output (impact-analysis table from a single OpenMetadata query).

Hop Asset Owner Action required
1 warehouse.raw.invoices.currency data-eng rename mapping in dbt staging
2 analytics.int_invoices data-eng regenerate, redeploy
3 analytics.fct_revenue analytics-eng document column
4 bi.dashboards.finance.revenue_v2 finance-eng update dashboard tile

Rule of thumb. The single best heuristic for "is our metadata stack mature?" is "can we answer the impact-analysis question in under a minute?" If no, the next architecture investment is OpenLineage emitters plus a single open backend.

Data engineering interview question on choosing between open and closed catalogs

A senior interviewer might frame this as: "Your CFO is asking why we should not just buy Atlan and be done. Defend the open-standards path in a 60-second answer that does not sound like an open-source zealot speech."

Solution Using a TCO + portability scorecard

# Decision matrix — closed vs open catalog
# Score each criterion 1-5 (higher = better for the option)

criteria = {
    "feature_velocity_today":        {"closed": 5, "open": 4},   # vendor ships polish
    "five_year_TCO_at_scale":         {"closed": 2, "open": 5},   # per-asset pricing scales painfully
    "vendor_portability":             {"closed": 1, "open": 5},   # OL means switching cost is near zero
    "control_over_metadata_graph":    {"closed": 2, "open": 5},   # self-hosted = your own DB
    "platform_team_FTE_required":     {"closed": 5, "open": 3},   # closed is cheaper in eng hours
    "integration_with_OSS_emitters":  {"closed": 3, "open": 5},   # OL emitters land on open backends day 1
}

closed_score = sum(c["closed"] for c in criteria.values())
open_score   = sum(c["open"]   for c in criteria.values())

print(f"Closed total: {closed_score}")
print(f"Open total:   {open_score}")
Enter fullscreen mode Exit fullscreen mode

Step-by-step trace.

Criterion Closed Open Comment
feature_velocity_today 5 4 vendors ship polish faster
five_year_TCO_at_scale 2 5 per-asset bill grows with success
vendor_portability 1 5 OL means switching cost near zero
control_over_metadata_graph 2 5 self-hosted = your own DB
platform_team_FTE_required 5 3 closed cheaper in eng hours
integration_with_OSS_emitters 3 5 OL emitters land everywhere day 1

The closed option leads on short-term polish and FTE economy; the open path leads on every multi-year axis (TCO, portability, control, integrations). For a platform expected to outlive any single vendor contract, the open path wins on every criterion you would care about three renewals out.

Output:

Path Total
Closed catalog 18
Open standards (OL + OM/DH) 27

Why this works — concept by concept:

  • Total cost of ownership — vendor pricing is per-asset, and asset counts grow super-linearly with success; open infra is a flat-ish cost. Above ~10K assets, open wins on cash alone.
  • Portability premium — OpenLineage emitters survive backend changes; the cost of changing backends approaches the cost of pointing the OL transport at a new URL. That option value is real and grows with platform maturity.
  • Control over your own metadata graph — when the catalog DB is yours, you can run arbitrary queries against it: cardinality audits, governance dashboards, custom impact analyses. Closed APIs cap you at the vendor's imagination.
  • FTE realism — yes, self-hosted costs platform-engineering time. The fair comparison is not "free vs paid"; it is "X FTE-months vs Y dollars plus lock-in." The decision matrix surfaces this honestly.
  • OSS emitter integration — every new OpenLineage emitter (Snowflake, Trino, Materialize) lands on every open backend at the same time. Closed catalogs lag by a release cycle.
  • Cost — the analysis itself is one spreadsheet plus a back-of-envelope FTE estimate. The actual decision is bought back over years of avoided lock-in pain.

DE
Topic — ETL design
ETL & pipeline design problems

Practice →


2. The open standards stack

OpenLineage is the wire format, OpenMetadata is the catalog application — they sit at different layers of the same stack, and confusing them is the most-common interview mistake

The mental model in one line: OpenLineage defines what to emit (a JSON event); OpenMetadata defines where to store and query (a catalog application with REST APIs and a UI). Once you say "wire format versus application," every follow-up question about Marquez, DataHub, or whether to "use OpenLineage or OpenMetadata" answers itself: you almost always use both, at different layers.

Vertical four-layer stack diagram with layers labelled Emitters, Wire format (OpenLineage), Backends (Marquez / OpenMetadata / DataHub), and Consumers, with brand-coloured tiles inside each layer, on a light PipeCode card.

The four-layer stack in one paragraph.

  • Layer 1 — Emitters. The things that produce lineage events: Airflow, dbt, Spark, Flink, Dagster, Prefect, custom Python apps. Each emitter has an OpenLineage integration that translates its native execution model into OL events.
  • Layer 2 — Wire format (OpenLineage). The JSON schema for the event itself: run, job, dataset, and an extensible facets slot. Versioned by the OpenLineage spec.
  • Layer 3 — Backends. The things that consume and persist the events: Marquez (reference backend, lineage-only), OpenMetadata (full catalog), DataHub (alternative catalog), and vendor receivers (Monte Carlo, Atlan, Bigeye, Collibra) when those products accept OL.
  • Layer 4 — Consumers. The humans and systems that read the persisted graph: catalog UIs, search indexes, impact-analysis services, governance dashboards, downstream alerting.

The "one event, many consumers" pattern.

A single OpenLineage event emitted by an Airflow task can simultaneously land in:

  • Marquez for the lineage graph UI used by data engineers.
  • OpenMetadata for the broader catalog with glossary and tags used by analysts and stewards.
  • Monte Carlo or Bigeye for observability and freshness anomaly detection.
  • A custom Kafka topic that downstream services subscribe to for "this table just changed" event-driven processing.

The OL spec includes an HTTP transport and a Kafka transport out of the box. Multi-cast is solved by either configuring multiple OPENLINEAGE_URL entries (newer integrations) or by running a small fan-out proxy that re-emits each event to N backends.

OpenLineage vs OpenMetadata — when each is the right answer.

You need OpenLineage OpenMetadata
The fact "job X read table Y at time T" yes (emit + persist) partial (ingests OL events)
A searchable UI of every table with owners and tags no yes
Column-level lineage facets yes (in the event) yes (renders the graph)
A glossary, classifications, PII tags no yes
Data quality test results partial (facet) yes (first-class entity)
Connectors for BigQuery, Snowflake, Tableau metadata no yes
A wire format other tools can also emit to yes no (it is an application)

Marquez and DataHub in one sentence each.

  • Marquez is the reference OpenLineage backend — Postgres for storage, REST API for ingest and query, a minimal lineage UI. Use when you want "OpenLineage and a graph viewer" and nothing else.
  • DataHub is an alternative open catalog (originally LinkedIn) that competes with OpenMetadata. It uses its own metadata-event model (MCE / MAE) but accepts OL events through an adapter. Use when you want strong upstream metadata propagation with Kafka under the hood.

Where vendors plug in.

  • As emitters. A vendor's product (e.g. a closed orchestrator) can ship native OL events instead of a proprietary metadata API. Increasingly common — even Databricks and Snowflake now have OL integration paths.
  • As backends. Monte Carlo, Bigeye, Atlan, and Collibra accept OL events as input. Your team emits once; the vendor enriches and visualises.
  • As ingestion sources for OpenMetadata. OpenMetadata's ingestion-framework runs as Airflow DAGs (or a Python container) and uses connectors to pull metadata from Snowflake, BigQuery, Tableau, Looker, Kafka. These connectors do not emit OL; they push entities directly into the OpenMetadata server.

Two paths in: events versus connectors.

OpenMetadata has two ingest paths. (1) Connectors that crawl source systems (Snowflake INFORMATION_SCHEMA, Tableau REST API, etc.) and push entity records via REST. (2) OpenLineage events that arrive via the OL endpoint and get converted into Pipeline entities + lineage edges. Many teams use both — connectors for the entity inventory, OL for the runtime lineage.

Common interview probes on the stack.

  • "Is OpenLineage a database?" — no. It is a wire format. Storage is the backend's job.
  • "Can I use OpenLineage without a catalog?" — yes. Marquez gives you lineage-only without the wider catalog surface.
  • "Can I use OpenMetadata without OpenLineage?" — yes. Connectors alone populate the catalog; lineage will then be limited to whatever the connectors infer from query history.
  • "Why not DataHub then?" — usually a tie. DataHub's metadata-event model is more event-native; OpenMetadata's connector library is broader. Pick by ecosystem fit, not by logo.

Worked example — sketching the stack as data flow

Detailed explanation. Drawing the four-layer stack with concrete tools at each layer is the fastest way to internalise where each project sits. The picture makes "OpenLineage versus OpenMetadata" stop being a question.

Question. Sketch a four-layer stack diagram for a team running Airflow, dbt, and Spark that wants both runtime lineage and a searchable catalog. Identify which projects sit at which layer and which transport carries events between them.

Input.

Tool Role
Airflow orchestration
dbt transformation in warehouse
Spark external transformation
Snowflake warehouse
Marquez wanted as lineage UI
OpenMetadata wanted as catalog UI

Code.

LAYER 1 — Emitters
  Airflow (OL plugin)  dbt (OL adapter)  Spark (OL listener)

LAYER 2 — Wire format
  OpenLineage event (run, job, dataset, facets) over HTTP
    Endpoint: OPENLINEAGE_URL = http://oltransport:5000

LAYER 3 — Backends
  Marquez (lineage UI + Postgres)
  OpenMetadata (catalog UI + Elasticsearch + Postgres)
  Both subscribe via a fan-out proxy or dual OPENLINEAGE_URL

LAYER 4 — Consumers
  Marquez UI for "trace the job"
  OpenMetadata UI for "find the table, owner, tags"
  Custom Slack bot subscribed to FAIL events for on-call
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. Each emitter is configured once to point at the OpenLineage transport URL. The team does not have to know which backends are subscribed downstream.
  2. The transport is HTTP by default; Kafka is the production choice when you want backpressure and durability between emitters and backends.
  3. The fan-out happens at the transport layer or with a small proxy (often a single FastAPI service) that POSTs each incoming event to every configured backend.
  4. Marquez and OpenMetadata coexist happily. They consume the same OL events but render different parts of the metadata graph — Marquez focuses on the lineage graph; OpenMetadata adds catalog, glossary, and quality on top.

Output (the stack table).

Layer Tool What it does
1 Airflow, dbt, Spark emit OL events on every run
2 OpenLineage JSON over HTTP transport
3 Marquez, OpenMetadata persist + render
4 Marquez UI, OpenMetadata UI, Slack bot humans and downstream alerting

Rule of thumb. Draw this four-layer stack on a whiteboard before you write any code. The teams that get OL adoption wrong almost always conflated layer 2 with layer 3 ("we're going to use OpenLineage as our catalog") or layer 3 with layer 4 ("we'll just point everyone at Marquez UI").

Worked example — OpenMetadata's two ingest paths side by side

Detailed explanation. OpenMetadata accepts metadata via connectors (pull from source) and via OpenLineage events (push from emitter). Each path fills a different slot in the graph, and most teams need both for a complete picture.

Question. A platform team wants analytics.fct_orders in the OpenMetadata UI with its schema, owner, tags, and a lineage graph that shows the dbt model writing it. Outline which ingest path supplies which fields, and the order in which the paths should run.

Input.

Asset Source of truth
Table schema (columns, types) Snowflake INFORMATION_SCHEMA
Owner, tags, description OpenMetadata UI + glossary
Lineage edge "dbt → fct_orders" dbt run-time
Last-refreshed timestamp dbt run-time

Code.

# 1) Connector ingest — runs as an Airflow DAG every hour
source:
  type: snowflake
  serviceName: warehouse_prod
  serviceConnection:
    config:
      type: Snowflake
      hostPort: acct.snowflakecomputing.com
      username: openmetadata_ro
      database: ANALYTICS
sink:
  type: metadata-rest
  config: {}

# 2) OpenLineage ingest — runs as a webhook the dbt CLI POSTs to
# Configure dbt to emit OL events to OpenMetadata's OL endpoint:
# OPENLINEAGE_URL=https://openmetadata.example.com
# OPENLINEAGE_ENDPOINT=/api/v1/openlineage
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. The Snowflake connector enumerates every table in ANALYTICS, reads INFORMATION_SCHEMA for column types, and pushes Table entities into OpenMetadata. fct_orders appears in the UI but without lineage edges yet.
  2. The dbt OL emitter fires on every dbt run and POSTs OL events to OpenMetadata's /api/v1/openlineage endpoint. OpenMetadata converts each event into a Pipeline entity and creates lineage edges from inputs to outputs.
  3. After both paths have run, fct_orders appears in the UI with its full schema and the upstream edge from the dbt Pipeline. The user adds the owner and tags manually (or by API) — those metadata are catalog-native and not in any source system.
  4. Order matters: the connector must run first so that the Table entity exists before the OL event tries to create the lineage edge. If the order is reversed, OpenMetadata creates a placeholder Table from the OL dataset reference and fills in real schema on the next connector pass.

Output (assembled OpenMetadata UI panel).

Field Source
Name analytics.fct_orders Snowflake connector
Columns + types Snowflake connector
Tags PII::masked, Domain::Finance Manual + glossary
Lineage upstream dbt.run_fct_orders OpenLineage event
Last refresh timestamp OpenLineage event

Rule of thumb. Run the connector hourly (or on a metadata-change CDC if available); run OpenLineage continuously (per task). Mixing the two cadences gives you both static asset inventory and live runtime lineage at the cost each path implies.

Data engineering interview question on partitioning the stack between OL and OM

A senior interviewer might say: "Walk me through which problems you would solve with OpenLineage and which with OpenMetadata if you were designing a metadata platform from scratch in 2026."

Solution Using a layered responsibility split

# Stack ownership matrix
LAYER                     OWNER PROJECT             ARTIFACT
emitters (per tool)       OpenLineage integration   one OL event per run
wire format               OpenLineage spec          JSON event with facets
transport                 OL HTTP / Kafka client    POST / produce
durable store             OpenMetadata or DataHub   Postgres + Elasticsearch
catalog entities          OpenMetadata schemas      Table, Pipeline, Dashboard
search + UI               OpenMetadata UI           browse, search, lineage view
governance                OpenMetadata Glossary     terms, classifications, PII
data quality              OM Test Suite             test cases + results entity
runtime lineage           OL events ingested by OM  edges populated from facets
freshness alerts          downstream consumer       Slack bot or vendor receiver
Enter fullscreen mode Exit fullscreen mode

Step-by-step trace.

Concern OpenLineage OpenMetadata
run/job/dataset events OL spec consumes
schema + classifications dataset facet (per event) first-class entity
glossary + business terms n/a Glossary entity
lineage graph storage n/a yes
catalog search UI n/a yes
connectors to BI / Kafka n/a yes
extensible custom metadata facets extension API
transport multicast yes (HTTP / Kafka) n/a

The split makes the role of each project clear: OL owns the protocol and the runtime events; OM owns the application, the catalog entities, and the user experience. They meet at the OpenLineage endpoint where OM consumes events.

Output:

Decision Project
What format do my emitters speak? OpenLineage
Where do events live for the long term? OpenMetadata (or DataHub)
Where do humans browse the catalog? OpenMetadata UI
Where do I add a glossary or PII tags? OpenMetadata
Which project do I configure transport on? OpenLineage client

Why this works — concept by concept:

  • Separation of concerns — wire formats and applications evolve on different cadences; coupling them slows both. The four-layer stack is the architecture pattern that makes the metadata platform sustainable.
  • Backend portability — by treating OL as the protocol, you can replace Marquez with OpenMetadata, OpenMetadata with DataHub, or DataHub with a vendor without changing a single emitter.
  • Catalog ownership — OpenMetadata owns the entities (Table, Pipeline, Dashboard, MLModel, Glossary, Tag) and the policies that govern them; OL contributes the lineage edges between those entities.
  • Custom metadata via facets — anything you cannot express in the core OL schema goes into a custom facet. The receiver chooses whether to surface it. No forking required.
  • Transport choices — HTTP for simple setups, Kafka for high-volume production stacks where you want durability and replayability between emitters and backends.
  • Cost — protocol design plus catalog design happens once; daily operations are O(events) and dominated by Postgres + Elasticsearch in the backend. The architecture itself is cheap; the content is where the value lives.

DE
Topic — design
System design problems for data engineers

Practice →


3. The OpenLineage event model

run, job, dataset, facets — four nouns that capture every transformation in your stack, and the column-level facet is where the modern open data catalog gets its impact-analysis superpower

The mental model in one line: every OpenLineage event is a tuple (run, job, inputs, outputs, facets) describing one execution attempt of one unit of work. Once you can name the four nouns and the four run states (START, COMPLETE, FAIL, ABORT), the entire OL spec collapses to "fill in the right facets for your use case."

Central rounded card labelled 'OpenLineage event' surrounded by four satellite entity cards labelled run, job, dataset, and facets, with a thin ring of run-state pills (START, COMPLETE, FAIL, ABORT) around the run card, on a light PipeCode card.

The four core entities.

  • run — a single execution attempt. Has a runId (UUID) plus optional facets for parent run, nominal time, error message.
  • job — the unit of work itself, independent of any single execution. Identified by (namespace, name). The job is stable across runs; runs come and go.
  • dataset — an input or output of the job. Identified by (namespace, name). Examples: (warehouse, raw.orders), (s3, bucket-name/path/prefix).
  • facets — optional, extensible blocks of typed metadata attached to runs, jobs, or datasets. The whole spec is extended through facets, not by changing the core schema.

Run states.

  • START — the run has begun. Receivers create an open run record.
  • COMPLETE — the run finished successfully. Receivers close the run and finalise edges.
  • FAIL — the run failed. Edges may be marked attempted; downstream consumers can alert.
  • ABORT — the run was killed (timeout, manual stop). Treated like FAIL by most receivers but the cause is different.

Standard facets you will use every day.

  • schemaFacet — attached to a dataset; lists columns and types. Lets a receiver know the shape of the data at the moment of the event.
  • sourceFacet — attached to a dataset; identifies the physical storage system (Snowflake, S3, Kafka topic). Helps backends group datasets by source.
  • sqlFacet — attached to a job; the exact SQL text the job ran. Powers query-level lineage for SQL engines.
  • columnLineageFacet — attached to an output dataset; maps each output column to the input columns it was derived from. The single most valuable facet for impact analysis.
  • dataQualityFacet — attached to a dataset; expected/actual stats (row count, null ratio, distinct count). Powers freshness and quality observability.
  • ownershipFacet — attached to a job or dataset; team or person responsible. Lets receivers route alerts.
  • parentRunFacet — attached to a run; reference to a parent run (e.g. an Airflow DAG run that contains a dbt task run). Lets the graph render hierarchically.

How Airflow, Spark, dbt, and Flink emit events.

  • Airflow. The OL Airflow plugin instruments every task. Each task emits a START on pre_execute and a COMPLETE / FAIL on post_execute. Operator-specific extractors fill in inputs and outputs (e.g. SnowflakeOperator knows what tables the SQL touches).
  • dbt. The OL dbt adapter wraps dbt run. After each model materialises, it emits an event with inputs (refs) and outputs (the model's relation). The sqlFacet carries the compiled SQL; the columnLineageFacet is derived from dbt's manifest.
  • Spark. The OL Spark listener hooks into the SparkSession. On each query execution, it walks the logical plan to extract input and output dataset references and emits a START / COMPLETE pair.
  • Flink. The OL Flink integration emits per-job events with stream sources and sinks as input / output datasets. Useful for keeping the streaming side of the graph aligned with the batch side.

Column-level lineage via the columnLineage facet.

The facet maps each output column to a list of (input dataset, input column, transformation type) tuples. Receivers render this as a column-level graph in the lineage UI. For a SQL job, the facet is computed by SQL-parsing the query plan (sqlglot, Calcite, or the engine's native parser). For a dbt model, the facet can be derived from dbt's manifest and ref() graph.

Custom facets — when and how.

  • When. You have metadata that does not fit the standard facets but is useful to your platform. Examples: a securityClassificationFacet, a costFacet (compute units consumed), a lineageQualityFacet (confidence score).
  • How. Declare a JSON Schema for the facet under a unique URI (e.g. https://your-org.com/openlineage/cost.json). Emit it inline. Receivers either render it or ignore it — no breaking changes either way.

The wire format itself in one paragraph.

Every event is a JSON object with mandatory fields eventType, eventTime, run.runId, job.namespace, job.name, plus optional inputs[], outputs[], and producer. Facets sit under run.facets, job.facets, or per-dataset facets. The schema is versioned via the top-level schemaURL. Receivers ignore facets they do not understand, which makes spec evolution painless.

Common interview probes on the event model.

  • "What is the difference between a job and a run?" — the job is the recipe; the run is one execution attempt. Multiple runs share a job; runs are immutable post-completion.
  • "Can I emit OL without inputs and outputs?" — yes (the event still describes the run), but the lineage edge is empty. You lose the main reason to emit at all.
  • "How does OL handle streaming jobs that never complete?" — periodic checkpoint events with a START at startup and intermittent COMPLETE markers (or no terminal event), with the Flink integration's convention being a long-lived run that receives status updates.
  • "What stops a facet from being misused?" — JSON Schema validation. Receivers validate facets against their declared schemas; malformed facets are dropped or quarantined.

Worked example — anatomy of a single dbt OL event

Detailed explanation. Reading one real event end-to-end is the fastest way to internalise the spec. The example below is the COMPLETE event for a dbt model that joins raw.orders to raw.customers and writes analytics.fct_orders.

Question. Annotate the event with which field powers which UI feature. Identify the four mandatory fields, the input/output datasets, the SQL facet, the schema facet, and the column-lineage facet.

Input.

Field Value
job analytics.fct_orders
run id c8b3-2026-06-15-01
inputs raw.orders, raw.customers
outputs analytics.fct_orders

Code.

{
  "eventType": "COMPLETE",
  "eventTime": "2026-06-15T01:08:24.000Z",
  "run": { "runId": "c8b3-2026-06-15-01" },
  "job": {
    "namespace": "analytics",
    "name": "fct_orders",
    "facets": {
      "sql": {
        "_producer": "https://github.com/OpenLineage/OpenLineage/tree/1.20.0/integration/dbt",
        "_schemaURL": "https://openlineage.io/spec/facets/1-0-0/SqlJobFacet.json",
        "query": "select o.order_id, o.amount, c.country from raw.orders o join raw.customers c using (customer_id)"
      }
    }
  },
  "inputs": [
    { "namespace": "warehouse", "name": "raw.orders",
      "facets": {
        "schema": { "fields": [
          {"name": "order_id", "type": "BIGINT"},
          {"name": "customer_id", "type": "BIGINT"},
          {"name": "amount", "type": "NUMERIC"}
        ]}
      }
    },
    { "namespace": "warehouse", "name": "raw.customers",
      "facets": {
        "schema": { "fields": [
          {"name": "customer_id", "type": "BIGINT"},
          {"name": "country", "type": "STRING"}
        ]}
      }
    }
  ],
  "outputs": [
    { "namespace": "warehouse", "name": "analytics.fct_orders",
      "facets": {
        "schema": { "fields": [
          {"name": "order_id", "type": "BIGINT"},
          {"name": "amount", "type": "NUMERIC"},
          {"name": "country", "type": "STRING"}
        ]},
        "columnLineage": {
          "fields": {
            "order_id": { "inputFields": [
              {"namespace": "warehouse", "name": "raw.orders", "field": "order_id"}
            ]},
            "amount":   { "inputFields": [
              {"namespace": "warehouse", "name": "raw.orders", "field": "amount"}
            ]},
            "country":  { "inputFields": [
              {"namespace": "warehouse", "name": "raw.customers", "field": "country"}
            ]}
          }
        }
      }
    }
  ],
  "producer": "https://github.com/OpenLineage/OpenLineage/tree/1.20.0/integration/dbt"
}
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. The mandatory fields eventType, eventTime, run.runId, job.namespace, and job.name define the run identity. Receivers reconcile START + COMPLETE pairs by runId.
  2. inputs[] and outputs[] declare the lineage edge. The two inputs (raw.orders, raw.customers) feed the single output (analytics.fct_orders). Marquez and OpenMetadata draw this as two arrows into one node.
  3. The sqlFacet on the job carries the compiled SQL. Catalog UIs render it as a clickable code block; impact-analysis tools can SQL-parse it to derive column lineage when the emitter does not provide it natively.
  4. The schemaFacet on each dataset lists columns and types. Catalog UIs render it as the table's schema panel at the moment of the run.
  5. The columnLineageFacet on the output dataset is the high-value payload: it maps output.order_id to inputs.raw.orders.order_id, output.amount to inputs.raw.orders.amount, and output.country to inputs.raw.customers.country. Downstream "if I drop country from raw.customers, what breaks?" queries traverse this map.

Output (UI features powered by this event).

UI feature Field used
Run timeline run.runId + START / COMPLETE pair
Lineage graph inputs[], outputs[]
Schema panel dataset schemaFacet
Compiled SQL viewer job sqlFacet.query
Column-level lineage view output columnLineageFacet.fields
Producer attribution top-level producer

Rule of thumb. When integrating a new tool, start with the mandatory fields plus schemaFacet. Add sqlFacet next (cheap to capture). columnLineageFacet last — it is the most valuable but the most work to compute correctly.

Worked example — dbt → Airflow → Spark chained run via parent facets

Detailed explanation. Real production lineage usually crosses tools. A scheduled Airflow DAG runs a dbt step that calls a Spark job. Each tool emits its own OL event; the chain is reconstructed via the parentRunFacet. The result is a single hierarchical graph spanning all three tools.

Question. Sketch the three OL events emitted when Airflow's DAG nightly schedules a dbt task dbt_run which kicks off Spark job etl_orders. Show the parent-run references that link the graph.

Input.

Tool Run id Parent run id
Airflow DAG nightly a-001
dbt task dbt_run d-001 a-001
Spark job etl_orders s-001 d-001

Code.

// 1) Airflow DAG-level event
{
  "eventType": "START",
  "run": { "runId": "a-001" },
  "job": { "namespace": "airflow", "name": "nightly" }
}

// 2) dbt task event, parent = airflow DAG
{
  "eventType": "START",
  "run": {
    "runId": "d-001",
    "facets": {
      "parent": {
        "run": { "runId": "a-001" },
        "job": { "namespace": "airflow", "name": "nightly" }
      }
    }
  },
  "job": { "namespace": "analytics", "name": "dbt_run" }
}

// 3) Spark job event, parent = dbt task
{
  "eventType": "START",
  "run": {
    "runId": "s-001",
    "facets": {
      "parent": {
        "run": { "runId": "d-001" },
        "job": { "namespace": "analytics", "name": "dbt_run" }
      }
    }
  },
  "job": { "namespace": "etl", "name": "etl_orders" }
}
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. Airflow emits the outer event for the DAG run with id a-001. This becomes the top-level node in the lineage graph.
  2. dbt's emitter knows it was invoked from inside an Airflow task — the integration reads the OPENLINEAGE_PARENT_* environment variables to construct the parentRunFacet, pointing back to a-001.
  3. Spark's emitter, when invoked from a dbt python model or external task, similarly reads the parent context and constructs a facet pointing back to d-001.
  4. Receivers reconstruct the tree: a-001 is the root; d-001 is a child of a-001; s-001 is a grandchild via d-001. The Marquez and OpenMetadata UIs render this hierarchically with collapsible sub-runs.

Output (graph structure).

Level Run id Job
root a-001 airflow.nightly
child d-001 analytics.dbt_run
grandchild s-001 etl.etl_orders

Rule of thumb. Always propagate parent run context through environment variables when one tool launches another. Without it, the graph fragments into disconnected islands and the "what triggered this job?" question becomes hard to answer.

Worked example — emitting a custom facet for compute cost

Detailed explanation. Sometimes the standard facets do not cover a metric your team needs. The OL spec lets you declare a custom facet under your own URI. The receiver either renders it or ignores it — both are safe behaviours.

Question. Define a computeCostFacet that carries CPU seconds and dollar cost for each run, and attach it to a Spark COMPLETE event. Show the facet payload and the receiver's options for displaying it.

Input.

Field Value
CPU seconds 124.5
Estimated cost USD 0.42
Cluster id spark-cluster-prod-01

Code.

{
  "eventType": "COMPLETE",
  "run": {
    "runId": "s-001",
    "facets": {
      "computeCost_dataeng_example_com": {
        "_producer": "https://github.com/dataeng-example/openlineage-cost",
        "_schemaURL": "https://dataeng.example.com/openlineage/computeCost.json",
        "cpu_seconds": 124.5,
        "estimated_cost_usd": 0.42,
        "cluster_id": "spark-cluster-prod-01"
      }
    }
  },
  "job": { "namespace": "etl", "name": "etl_orders" }
}
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. The facet key is namespaced (computeCost_dataeng_example_com) so it cannot collide with any future standard facet.
  2. The mandatory facet fields _producer and _schemaURL let receivers identify the source and validate the payload. Receivers that do not recognise the schema simply ignore the facet — no breaking change.
  3. The payload itself is arbitrary JSON conforming to the schema at _schemaURL. The schema lives in your org's repo and is referenced by URL — receivers can fetch and validate at runtime, or trust the producer.
  4. Receivers like OpenMetadata render unknown facets either as raw JSON in a "raw facets" panel or, with a custom plugin, as a typed widget. Marquez stores them in its facets table for later querying.

Output (UI surfaces).

Receiver Treatment
Marquez persisted in facets table; queryable via REST
OpenMetadata rendered in raw facets panel; surfaced via custom widget if installed
Monte Carlo ignored (does not know the schema)
Custom Spark cost dashboard consumes via Kafka feed; renders as a chart

Rule of thumb. Use custom facets sparingly. If three teams independently invent the same facet, lobby for it to become a standard. The OpenLineage community has accepted multiple originally-custom facets into the spec over the past two years.

Data engineering interview question on minimum-viable lineage instrumentation

A senior interviewer might frame this as: "You have an existing Airflow + dbt + Spark stack and zero lineage today. Walk me through the smallest first deployment of OpenLineage that delivers useful lineage in two weeks."

Solution Using emitters-first, single-backend rollout

WEEK 1
- Stand up Marquez via docker-compose in staging.
  POSTGRES + the marquez-web container.
- Install the OL Airflow plugin on the staging Airflow.
  Set OPENLINEAGE_URL=http://marquez:5000.
- Verify lineage events arrive by running one staging DAG.
- Add the dbt OL adapter to the dbt project.
  Run `dbt build` against staging; confirm events appear.

WEEK 2
- Add the OL Spark listener to staging Spark cluster config.
  spark.extraListeners=io.openlineage.spark.agent.OpenLineageSparkListener
- Run the most important production DAG once in staging
  with realistic data; capture the full lineage graph.
- Promote OL configuration to production for one team's pipelines
  with feature flag, monitor Marquez for two days.
- Plan week 3 for OpenMetadata or DataHub as second consumer.
Enter fullscreen mode Exit fullscreen mode

Step-by-step trace.

Day Action Outcome
1 docker-compose up Marquez running locally
2 install Airflow OL plugin events arriving from staging Airflow
3 install dbt OL adapter events arriving from staging dbt
4 run end-to-end staging DAG lineage graph visible in Marquez UI
5–7 iterate, fix missing extractors graph passes peer review
8 enable Spark listener Spark jobs join the graph
9 flag one prod team prod lineage flowing
10–14 monitor, fix gaps stable for one team

By week three, the team can choose to layer OpenMetadata or DataHub as a second consumer of the same OL events without touching the emitters. The migration cost is configuring the new backend's OL endpoint, not re-instrumenting the pipelines.

Output:

Milestone Week
Marquez running, Airflow events flowing 1
dbt events flowing 1
Spark events flowing 2
One prod team fully instrumented 2
Second backend (OpenMetadata) as additional consumer 3

Why this works — concept by concept:

  • Marquez first, catalog second — Marquez is the cheapest credible OL backend. Standing it up validates the emitters before you spend weeks on OpenMetadata schemas and connectors.
  • Per-tool integration — each emitter (Airflow, dbt, Spark) plugs into its native lifecycle. No code changes to pipelines; the integration owns the event generation.
  • Feature flag in prod — emitter overhead is small but real (one HTTP call per task). Roll out by team so any regression is contained.
  • Two-week MVP — the metric that matters is "first useful lineage graph visible to humans." Everything beyond that (column lineage, facets, glossary) layers on without touching the foundation.
  • Backend swap is cheap — by week three, switching from Marquez to OpenMetadata is "point the OL URL at the new endpoint." This is exactly the portability the standard buys you.
  • Cost — staging infra plus ~3 engineer-weeks for the MVP; prod onboarding is incremental per team thereafter.

DE
Topic — event modeling
Event modeling problems for lineage and audit

Practice →


4. OpenMetadata architecture and entity model

OpenMetadata is a catalog application with three layers — Ingestion, Metadata Server, UI — and a unified entity model spanning tables, dashboards, pipelines, topics, and ML models

The mental model in one line: OpenMetadata is "a single catalog DB plus a REST API plus a UI plus a connector framework," and every metadata concern (lineage, governance, quality, glossary, classification) is a first-class entity in that DB. Once you internalise that "everything is an entity," the API surface and the UI both make obvious sense.

Three-layer architecture diagram of OpenMetadata — top layer Ingestion (Airflow DAGs and connector tiles), middle layer Metadata Server (REST API + Elasticsearch + relational DB), bottom layer UI (Search, Lineage, Glossary, Quality), with thin connecting arrows, on a light PipeCode card.

Three-layer architecture.

  • Ingestion. Connectors run as Airflow DAGs (or as standalone Python apps) and push entity records into the metadata server. The ingestion framework is open and pluggable — adding a connector is writing a Python class that conforms to the source / sink interface.
  • Metadata server. The heart. A Java service (Dropwizard) exposing a REST API; backed by Postgres or MySQL for storage and Elasticsearch (or OpenSearch) for search. Defines the entity schemas, the policies, and the lineage graph queries.
  • UI. A React app that calls the REST API. Renders entity pages, search, the lineage graph, the glossary, data quality results, and admin pages.

The unified entity model.

OpenMetadata models everything as an entity. The same patterns (versioning, tagging, ownership, lineage) apply across:

  • Database, Schema, Table. Tables across Snowflake, BigQuery, Postgres, MySQL, etc., live as Table entities under a DatabaseService → Database → DatabaseSchema → Table hierarchy.
  • Pipeline. Airflow DAGs, dbt projects, Dagster pipelines — each becomes a Pipeline entity. Lineage edges connect Pipelines to Tables (read / write).
  • Dashboard, Chart. Looker / Tableau / Metabase dashboards become Dashboard entities, with each tile or chart as a Chart sub-entity.
  • Topic. Kafka / Pulsar / Kinesis topics become Topic entities with schema and ownership.
  • MLModel, Container. ML models (MLflow / SageMaker) become MLModel entities; storage containers (S3 / GCS / Azure) become Container entities.
  • Glossary, GlossaryTerm. Business vocabulary lives in Glossary and GlossaryTerm entities, which can be linked as tags on any other entity.
  • Tag, Classification. PII tags, data sensitivity classifications, and domain tags all live as Tag entities under Classification parents.
  • TestSuite, TestCase, TestCaseResult. Data quality is first-class: TestCase definitions and their run results are entities that the UI renders alongside the table.

Ingestion framework in detail.

  • Connectors. One per source (Snowflake, BigQuery, Postgres, MySQL, Trino, Redshift, Tableau, Looker, PowerBI, Kafka, Airflow, dbt, MLflow). Each connector reads from the source via its native API and yields OpenMetadata entity records.
  • Workflow types. Metadata (entities + schema), Lineage (edges from query history), Profiler (column stats), Data Quality (test runs), Usage (query history for popularity), dbt (parses manifest.json), Application Settings (admin).
  • Scheduling. Workflows run as Airflow DAGs that come pre-bundled with OpenMetadata's ingestion image. Production teams typically point them at their own Airflow.

The metadata server's data model.

  • Postgres stores entity rows, versions, and relationships.
  • Elasticsearch stores the search index for each entity type plus the autocomplete index.
  • REST API at /api/v1/* exposes every entity type. Filtering, search, and lineage queries all live here.

UI features.

  • Search. Full-text plus typed filters (entity type, service, tier, owner, tag).
  • Lineage graph. Bidirectional graph view with table-level and column-level depth controls.
  • Glossary. Hierarchical business vocabulary; terms can be assigned to tables, columns, dashboards.
  • Data quality. Test results render inline with each table; failing tests can route to Slack.
  • Profiling. Column-level statistics (null %, distinct %, distributions) computed by the Profiler workflow.
  • Roles, policies. Fine-grained access — who can read / edit / delete which entity types.
  • PII tagging. Auto-classification of columns based on data and naming patterns; manual override via the UI.

How OL events flow into OpenMetadata.

OpenMetadata exposes an OpenLineage endpoint at /api/v1/openlineage. Each arriving event is translated into:

  • A Pipeline entity (created if absent, looked up by namespace + name).
  • Lineage edges from the listed input datasets to the listed output datasets.
  • Column-level lineage edges if the event carries a columnLineageFacet.
  • Pipeline status entries reflecting the run's success or failure.

The translation is convention-driven (dataset namespace warehouse maps to the DatabaseService named warehouse_prod, etc.) and configurable via the connector settings.

Self-hosted vs Collate.

  • Self-hosted. Docker / Kubernetes Helm chart; you operate Postgres, Elasticsearch, and the metadata server. Cost is infra plus part-time platform-engineering work.
  • Collate. Commercial managed offering from the same team. Hosted multi-tenant; eliminates the operational burden in exchange for per-asset pricing similar to other vendors.

Common interview probes on OpenMetadata.

  • "What is the difference between a Table and a Topic entity?" — both are dataset-like, but Table maps to a relational warehouse and Topic to an event-stream; lineage edges treat them the same.
  • "Where does PII classification come from?" — automatic classifiers run during the metadata or profiler workflow; manual overrides via the UI. Both produce Tag entities attached to the column.
  • "How does OpenMetadata handle column-level lineage?" — column edges live as part of the Table entity; the UI renders them as a sub-graph inside the lineage panel.
  • "Can OpenMetadata be the source of truth for ownership?" — yes — the ownership field on each entity is canonical and propagates to downstream alerting via webhooks.

Worked example — modeling a Snowflake table with full metadata

Detailed explanation. Walking through the entity payload for one table makes the model concrete. Below is the JSON shape stored for analytics.fct_orders after the connector ingests it.

Question. Build the Table entity for warehouse_prod.ANALYTICS.fct_orders with three columns, an owner team, a Finance domain tag, a PII tag on one column, and a glossary term link. Identify which fields are connector-supplied and which are user-curated.

Input.

Field Value Source
Name fct_orders Snowflake
Database WAREHOUSE_PROD.ANALYTICS Snowflake
Columns order_id, amount, customer_email Snowflake
Owner team analytics-eng Manual
Domain tag Domain.Finance Manual
PII tag on customer_email PII.Sensitive Auto-classifier
Glossary term Finance.GMV linked Manual

Code.

{
  "name": "fct_orders",
  "fullyQualifiedName": "warehouse_prod.ANALYTICS.fct_orders",
  "service": "warehouse_prod",
  "database": "WAREHOUSE_PROD",
  "databaseSchema": "ANALYTICS",
  "columns": [
    {"name": "order_id", "dataType": "BIGINT"},
    {"name": "amount",   "dataType": "NUMERIC"},
    {"name": "customer_email",
     "dataType": "STRING",
     "tags": [{"tagFQN": "PII.Sensitive", "labelType": "Automated"}]}
  ],
  "owner": {"id": "team-analytics-eng", "type": "team"},
  "tags": [{"tagFQN": "Domain.Finance", "labelType": "Manual"}],
  "glossaryTerms": [{"id": "gt-finance-gmv", "type": "glossaryTerm"}]
}
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. The Snowflake connector populates name, fullyQualifiedName, service, database, databaseSchema, and the column list with names and types. Connector-supplied fields are versioned and re-synced on every ingestion run.
  2. The auto-classifier (part of the metadata workflow) inspects column names and sample data. It tags customer_email with PII.Sensitive and records labelType: Automated so reviewers can distinguish auto from manual labels.
  3. A platform admin (or a steward in the UI) assigns the team owner. Owner propagates to all downstream alerts: failing tests, freshness violations, and OpenLineage FAIL events route to the team.
  4. The domain tag Domain.Finance and the glossary term link are manual. They make the table discoverable via filtered search ("show me every Finance table") and tie business vocabulary to physical assets.

Output (rendered Table entity page).

Panel Content
Schema order_id BIGINT, amount NUMERIC, customer_email STRING (PII)
Owner analytics-eng
Tags Domain.Finance, PII.Sensitive (on column)
Glossary Finance.GMV
Lineage upstream from dbt.fct_orders, downstream to BI
Quality last 3 test runs and freshness metric

Rule of thumb. Let the connector own everything mechanical (names, types, sizes, freshness timestamps); let humans own everything contextual (owner, domain, glossary). Auto-classifiers sit in the middle — let them propose, let stewards approve.

Worked example — converting an OL event into an OpenMetadata Pipeline + lineage

Detailed explanation. When an OpenLineage event arrives at OpenMetadata's /api/v1/openlineage endpoint, the server converts it into one Pipeline entity plus lineage edges. Walking the conversion makes the integration tangible.

Question. Trace the conversion for the dbt event from Section 3 (analytics.fct_orders reading raw.orders and raw.customers). Identify which entities are created and which edges are upserted.

Input.

OL field Becomes in OM
job.namespace + job.name Pipeline FQN
inputs[] source nodes for edges
outputs[] target nodes for edges
columnLineageFacet column-level edges
run.runId + eventType PipelineStatus entry

Code.

Incoming OL event
  job = analytics.fct_orders
  inputs = [warehouse.raw.orders, warehouse.raw.customers]
  outputs = [warehouse.analytics.fct_orders]
  columnLineage = {order_id: [raw.orders.order_id], ...}

Conversion
  ensure Pipeline entity "analytics.fct_orders" exists
  ensure Table "warehouse_prod.raw.orders" referenced
  ensure Table "warehouse_prod.raw.customers" referenced
  ensure Table "warehouse_prod.analytics.fct_orders" referenced
  upsert lineage edge: raw.orders -> analytics.fct_orders
  upsert lineage edge: raw.customers -> analytics.fct_orders
  upsert column-level edge: raw.orders.order_id -> analytics.fct_orders.order_id
  upsert column-level edge: raw.orders.amount   -> analytics.fct_orders.amount
  upsert column-level edge: raw.customers.country -> analytics.fct_orders.country
  append PipelineStatus: runId=c8b3-2026-06-15-01, state=Successful
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. The server looks up the Pipeline by (service, namespace, name). If absent, it is created with the OL producer as the service hint. Subsequent events update the same Pipeline rather than create duplicates.
  2. The input and output datasets are mapped to Table entities by FQN convention. Datasets that do not yet exist (because the table connector has not run) are created as placeholder Tables and enriched later when the connector pass arrives.
  3. The lineage edges are upserted. Re-running the same event is idempotent — no duplicate edges. This is critical: every COMPLETE event in production carries the same edges, and the storage must collapse them.
  4. The column lineage facet drives the column-level edges. The UI renders them as a sub-graph inside the table-level edge; users toggle "column lineage" to drill in.
  5. The PipelineStatus entry records the run's outcome with timestamps. The Pipeline page displays a run history; failing runs annotate the connected tables with "last run failed."

Output.

Entity Action
Pipeline analytics.fct_orders upserted
Table raw.orders placeholder upserted
Table raw.customers placeholder upserted
Table analytics.fct_orders placeholder upserted
Lineage edge (table) 2 upserted
Lineage edge (column) 3 upserted
PipelineStatus 1 appended

Rule of thumb. Run the database connectors before expecting OL ingestion to fill the catalog. The connectors give you the entity inventory; OL gives you the lineage edges. Run them in the right order and your catalog is complete on day one.

Worked example — wiring data quality results into the catalog

Detailed explanation. OpenMetadata's TestSuite and TestCase entities make data quality first-class — every table can carry a list of tests, each test has a definition (e.g. "row count > 0"), and each test run produces a TestCaseResult that the UI surfaces inline. The same model accepts results from external tools via REST.

Question. Define a TestSuite for analytics.fct_orders with three tests (row count, distinct customer count, freshness), and show how a test runner posts results to the catalog.

Input.

Test Expectation
row_count_min rows > 0
distinct_customers_min unique customer_id > 100
freshness data updated within 24h

Code.

// 1) Create the TestSuite
POST /api/v1/dataQuality/testSuites
{
  "name": "fct_orders_quality",
  "entity": {"id": "table-id-fct-orders", "type": "table"}
}

// 2) Define a TestCase
POST /api/v1/dataQuality/testCases
{
  "name": "row_count_min",
  "entityLink": "<#E::table::warehouse_prod.ANALYTICS.fct_orders>",
  "testDefinition": "tableRowCountToBeBetween",
  "parameterValues": [
    {"name": "minValue", "value": "1"}
  ],
  "testSuite": "fct_orders_quality"
}

// 3) After running the test, post the result
POST /api/v1/dataQuality/testCases/testResults
{
  "testCaseFQN": "warehouse_prod.ANALYTICS.fct_orders.row_count_min",
  "result": "Success",
  "timestamp": 1718492400000,
  "testResultValue": [{"name": "rowCount", "value": "248913"}]
}
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. The TestSuite is the container for tests on one entity. Each table can have one TestSuite that aggregates its tests; failing tests on any case roll up to a suite-level health indicator.
  2. The TestCase definition references a testDefinition (a built-in or custom test type) plus parameters. The platform ships a library of definitions like tableRowCountToBeBetween, columnValuesToBeUnique, tableFreshnessSLA, plus a custom SQL test.
  3. The result is posted by whoever runs the test — OpenMetadata's own profiler workflow, an external dbt test run, a Great Expectations run, or a custom script. The same REST API accepts results from any source.
  4. The UI surfaces the latest result inline on the table page, with a colour-coded badge (green / amber / red). Failing tests can trigger webhooks to Slack or PagerDuty via OpenMetadata's alerting system.

Output (table page UI).

Test Last result Last run
row_count_min Success 2026-06-15 03:00
distinct_customers_min Success 2026-06-15 03:01
freshness Failed 2026-06-15 03:02

Rule of thumb. Treat the test results as another lineage signal — a failing freshness test on a source table is exactly the information a downstream consumer needs before reading. Surface them inline in the lineage graph, not on a separate dashboard.

Data engineering interview question on adopting OpenMetadata across a 50-team org

A senior interviewer might frame this as: "You have OpenMetadata running for one team. How do you scale it to 50 teams without it becoming a dumping ground of stale entities?"

Solution Using domain-scoped ingestion + steward ownership

SCALING PLAN

1. Domain-scope the catalog
   - Each business domain (Finance, Marketing, Product, Platform)
     gets its own DatabaseService prefix and Glossary scope.
   - Tags use Domain.* hierarchy so search is domain-filterable.

2. Steward per domain
   - Every domain nominates a data steward.
   - Stewards own glossary terms, tag policies, and PII reviews
     for assets in their domain.

3. Connector cadence by tier
   - Tier-1 assets (production warehouse, dashboards): hourly
   - Tier-2 (staging, lab): daily
   - Tier-3 (sandboxes): weekly or on-demand
   - Tier classification is itself a Tag entity.

4. Lineage from OL is continuous
   - Airflow + dbt + Spark + Flink emit OL events.
   - Per-team OL endpoints converge in one OM instance.

5. Quality tests gated by tier
   - Tier-1 tables MUST have row_count + freshness + uniqueness
   - Tier-2 SHOULD have at least one custom test
   - Tier-3 OPTIONAL.

6. PII review SLA
   - Auto-classifier proposes; steward approves within 14 days.
   - Unreviewed PII tags flagged on the steward dashboard.

7. Stale asset reaping
   - Assets without ingestion for 30 days auto-archived
     unless explicitly pinned.
Enter fullscreen mode Exit fullscreen mode

Step-by-step trace.

Step Owner Cadence Output
1 domain scope platform once DatabaseService + Glossary roots
2 stewards data leadership once + on join named steward per domain
3 connector cadence platform + team continuous per-tier ingestion DAGs
4 OL emitters each team continuous runtime lineage
5 tier-gated tests each team per release failing tests block deploy
6 PII review steward 14-day SLA tags approved or rejected
7 archive platform weekly clean catalog

The result is a catalog where every entity has a known owner, a known tier, and a known refresh expectation. Search returns relevant assets first because tier and domain are filterable.

Output:

Health metric Target
Tier-1 coverage by tests 100%
Domain assignment completeness > 95%
Stale entities (no refresh in 30d) < 2%
PII auto-tags unreviewed > 14 days 0
OL events per minute (steady state) proportional to pipeline count

Why this works — concept by concept:

  • Domain scopingDomain.* tag hierarchy gives the catalog a top-down structure that mirrors how the org thinks about data, and lets stewards own their slice without blocking each other.
  • Steward ownership — putting humans at the leaf of every policy decision (glossary, PII, classification) is the only way a catalog survives at scale. Auto-classification proposes; humans dispose.
  • Tier-driven cadence — not every asset deserves hourly metadata. Tiering keeps the ingestion pipeline cheap and the catalog signal-to-noise high.
  • Continuous OL ingestion — runtime lineage is the always-fresh part of the graph; static connectors fill in the shape; together they keep the catalog accurate.
  • Stale-asset reaping — a catalog that grows monotonically becomes useless. Archive policies keep search focused on assets that still matter.
  • Cost — connectors scale O(assets); OL events scale O(pipeline runs). Postgres + Elasticsearch sized to those rates plus an FTE fraction per few hundred TB of source metadata.

DE
Topic — dimensional modeling
Dimensional modeling problems for warehouses

Practice →


5. Interop with proprietary vendors and migration patterns

OpenLineage is the migration off-ramp from Atlan, Collibra, Alation, and Monte Carlo — emit once, route to whichever backend wins this quarter, and use the two-write pattern to stage the cutover

The mental model in one line: as long as OpenLineage events leave your pipelines, the choice of backend is a configuration change, not an architecture change. Once your team can quote that invariant, the conversation with the closed-catalog vendor on renewal day becomes very different — and the migration plan can be incremental rather than Big Bang.

Central OpenLineage hub fanning out via glowing arrows to two columns of receiver cards — open backends (Marquez, OpenMetadata, DataHub) on one side and proprietary vendors (Monte Carlo, Atlan, Collibra) on the other — with a small 'two-write' band overlay illustrating migration, on a light PipeCode card.

Where vendors plug into the OpenLineage event stream.

  • Monte Carlo. Accepts OL events as a lineage input. Layers freshness, volume, and schema-change anomaly detection on top of the same graph your open backend sees.
  • Atlan. Has a documented OL adapter; ingests events into the Atlan graph and renders them inline with vendor-curated metadata.
  • Bigeye. Similar to Monte Carlo — OL events feed the observability layer.
  • Collibra. Accepts OL events for technical lineage; business-glossary side stays inside Collibra's model. Most teams keep Collibra for governance and use OL to keep its lineage panel current.
  • Alation. Accepts OL through a plugin; the business catalog stays vendor-owned while runtime lineage is single-sourced from OL.

Emit OpenLineage from Airflow / dbt / Spark and forward to vendor X.

The integration pattern is identical regardless of which vendor receives:

emitter (Airflow / dbt / Spark)
   |
   v
OPENLINEAGE_URL = http(s)://vendor-endpoint/openlineage
   |
   v
vendor receiver ingests, renders, alerts
Enter fullscreen mode Exit fullscreen mode

The emitter does not know it is talking to a vendor. The vendor does not know it is reading a community-format event. The standard makes both sides plug-and-play.

Multi-cast to two or more receivers.

When you want both an open backend and a vendor receiver during a migration, configure multi-cast:

  • Newer OL integrations accept a comma-separated OPENLINEAGE_URL.
  • Older integrations require a small proxy: a single FastAPI service that POSTs each event to N configured URLs.
  • Kafka transport turns multi-cast into "multiple consumer groups on one topic."

This is the two-write pattern: events flow to the old backend and the new one for the duration of the migration, so the new backend builds historical context before you turn the old one off.

Replace a closed catalog with OpenMetadata gradually.

A 90-day migration timeline that has worked for multiple platform teams:

  • Day 1–14. Stand up OpenMetadata in staging. Run connectors against the same sources the old catalog covers. Verify entity completeness against the old catalog's asset list.
  • Day 15–30. Enable OpenLineage emitters in production with multi-cast: events flow to both the old vendor and to OpenMetadata. Both catalogs now show identical runtime lineage.
  • Day 31–60. Migrate business metadata (glossary, ownership, tags) into OpenMetadata. Most vendors have an export API or a CSV bulk download; the import can be scripted via OpenMetadata's REST API.
  • Day 61–80. Switch primary user UI to OpenMetadata. Old vendor stays read-only as a fallback.
  • Day 81–90. Decommission the old vendor. The OL multi-cast configuration drops the vendor endpoint. The renewal is not signed.

DataHub vs OpenMetadata — when to pick which.

Both are credible open catalogs with active communities and similar feature surface. The choice usually comes down to ecosystem fit.

  • Pick OpenMetadata when — you want a broader out-of-the-box connector library, tighter integration with OpenLineage as a native ingest path, a more polished UI for end-users, or a managed offering (Collate) on the same stack.
  • Pick DataHub when — you want an event-native architecture under the hood (the Metadata Change Event / Metadata Audit Event model on Kafka), strong upstream propagation for downstream services, or your existing stack already has heavy Kafka investment.
  • Either way — OL events flow into both. The wire-format standard means you can change your mind later without re-instrumenting pipelines.

Governance integrations.

  • Glossary and business terms. OpenMetadata models Glossary and GlossaryTerm as entities; terms can be linked to tables, columns, dashboards. DataHub uses the GlossaryNode / GlossaryTerm model. Both let you bulk-import terms from a CSV or an external governance tool.
  • Data classification. Both support hierarchical tags (PII.Sensitive, PII.Email, Finance.Revenue). OpenMetadata's auto-classifier proposes tags; admins approve. DataHub uses Glossary Terms similarly.
  • Access policies. Role + Policy model in both: a Policy lists allowed actions on entity types matched by a rule. Roles bundle policies. Users / Teams are assigned roles.
  • Compliance reporting. Glossary + Tag + Classification combine into a queryable matrix: "show every column tagged PII that touches a Finance domain dashboard." Both catalogs support this via search filters; OpenMetadata also exposes the query as a REST call.

Cost picture — self-hosted vs vendor.

Stack Year 1 Year 3 Notes
Closed vendor at $0.50/asset/mo, 10K assets $60,000 ~$225K cumulative Grows with asset count
OpenMetadata self-hosted (4 vCPU, 16GB, 200GB DB) ~$25K infra + 0.25 FTE ~$100K cumulative Flat-ish; FTE is bulk
Collate managed (similar to vendor) $0.40/asset/mo similar to vendor Less ops overhead
Vendor receiver (Monte Carlo / Bigeye) — additive on top of any catalog $20–60K/year typical similar Pays only for the observability layer, not the catalog

Long-term bets.

  • The OL spec is converging on column-level lineage as the default. Within two years, "OL without column lineage" will be considered a half-instrumented stack.
  • Vendor receivers are becoming OL-first. New observability tools launch with OL ingestion as the recommended path, not as an afterthought.
  • OpenMetadata and DataHub will likely both survive. They serve different architectural tastes; neither is going away.
  • Marquez stays the reference backend. Useful as a sanity check during migrations and as a lightweight first deployment.

Common interview probes on interop and migration.

  • "Can I send OpenLineage to Monte Carlo?" — yes. Configure OPENLINEAGE_URL to Monte Carlo's OL endpoint, or multi-cast.
  • "What is the two-write pattern?" — emit events to both the old and new backend during migration; cut over when the new backend has parity.
  • "How do I migrate business metadata (glossary, owners) into OpenMetadata?" — export from the old vendor (REST or CSV), import via OpenMetadata's REST API. Scriptable in a day for most orgs.
  • "Is column lineage automatic?" — only when the emitter produces the columnLineageFacet. dbt and Spark do; Airflow does for the operators that have extractors; custom Python is on you.

Worked example — the two-write pattern in configuration

Detailed explanation. Two-write is the safest migration shape: send every event to both backends, verify parity, then drop the old one. The configuration cost is tiny; the safety it buys is real.

Question. Configure a dbt project to emit OpenLineage events to both Atlan (the old catalog) and OpenMetadata (the new catalog) during a 60-day migration. Show the env vars or proxy required.

Input.

Endpoint Role
Atlan old catalog, read-only by day 60
OpenMetadata new catalog, gaining context

Code.

# Option A — multi-URL (newer OL integrations)
export OPENLINEAGE_URL="https://atlan.example.com/openlineage,https://openmetadata.example.com/api/v1/openlineage"
export OPENLINEAGE_API_KEY_ATLAN="..."
export OPENLINEAGE_API_KEY_OM="..."

# Option B — fan-out proxy (older OL integrations)
# proxy posts every incoming event to both URLs
export OPENLINEAGE_URL="http://ol-proxy.internal:5000"
Enter fullscreen mode Exit fullscreen mode
# Minimal fan-out proxy (FastAPI) — Option B
from fastapi import FastAPI, Request
import httpx

app = FastAPI()
TARGETS = [
    "https://atlan.example.com/openlineage",
    "https://openmetadata.example.com/api/v1/openlineage",
]

@app.post("/")
async def fanout(request: Request):
    body = await request.body()
    headers = {"Content-Type": "application/json"}
    async with httpx.AsyncClient(timeout=5.0) as client:
        for url in TARGETS:
            try:
                await client.post(url, content=body, headers=headers)
            except Exception:
                pass  # never block the producer
    return {"status": "ok"}
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. Option A is the cleanest path when the OL integration supports comma-separated URLs (Airflow OL >= 1.18, dbt OL >= 1.16, Spark OL >= 1.20 with the OpenLineageClient transports config). Each URL receives every event.
  2. Option B works with any integration. The proxy is a single ~20-line FastAPI service. It POSTs each event to every configured target, swallowing per-target failures so the producer never blocks.
  3. The producer's view never changes during the migration. Pipelines do not know they are now talking to two backends; they POST once to the OL URL.
  4. On migration day 60, drop one URL from the list (Option A) or remove one TARGET entry (Option B). No code change anywhere else.

Output (during the migration window).

Backend Events received UI status
Atlan 100% primary (days 0–45), read-only (days 46–60)
OpenMetadata 100% secondary (days 0–45), primary (days 46–60)

Rule of thumb. Run the two-write window for at least 30 days. The new backend needs a meaningful history before you trust it as the primary UI.

Worked example — migrating glossary terms from Collibra to OpenMetadata

Detailed explanation. Business metadata does not flow over OpenLineage — it lives in the catalog itself. Migrating it is an export + transform + import job. OpenMetadata's REST API makes the import scriptable.

Question. Migrate 500 Collibra business terms (each with a name, description, and domain) into OpenMetadata as GlossaryTerm entities under a Finance glossary. Show the script outline.

Input.

Field Example
name Gross Merchandise Value
description Total value of goods sold over a period.
domain Finance

Code.

import csv, requests

OM_URL = "https://openmetadata.example.com/api/v1"
OM_TOKEN = "...JWT..."
HDR = {"Authorization": f"Bearer {OM_TOKEN}",
       "Content-Type": "application/json"}

# 1) Ensure parent Glossary exists
glossary = {"name": "Finance",
            "displayName": "Finance",
            "description": "Finance domain business vocabulary"}
requests.put(f"{OM_URL}/glossaries", json=glossary, headers=HDR)

# 2) For each Collibra term, POST as GlossaryTerm
with open("collibra_export.csv") as f:
    for row in csv.DictReader(f):
        term = {
            "name": row["name"].replace(" ", "_"),
            "displayName": row["name"],
            "description": row["description"],
            "glossary": "Finance"
        }
        r = requests.put(f"{OM_URL}/glossaryTerms", json=term, headers=HDR)
        r.raise_for_status()
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. The Collibra export is a CSV with name, description, domain columns. Standard Collibra "Export Asset List" feature.
  2. The script ensures the parent Glossary entity exists in OpenMetadata. PUT is idempotent — re-running the script does not duplicate the Glossary.
  3. For each row, the script POSTs (or PUTs, depending on whether you want create-or-update) a GlossaryTerm. The name field cannot contain spaces in OpenMetadata FQNs; displayName keeps the original.
  4. Each term lands in the Finance Glossary. The terms can now be linked from tables, columns, and dashboards via the UI or programmatically.

Output.

Imported Count
Glossary 1 (Finance)
GlossaryTerm 500
Links to tables 0 (next migration phase)

Rule of thumb. Migrate the glossary first, then the table-to-term links. Linking is the part that benefits most from human review — let stewards approve sample links rather than bulk-import them blindly.

Worked example — multi-cast to Marquez, OpenMetadata, and Monte Carlo

Detailed explanation. Some teams want the lineage UI of Marquez (fast to render), the catalog of OpenMetadata (governance), and the observability of Monte Carlo (anomaly detection). The OL standard makes this trivial: each backend is just another URL.

Question. Configure the fan-out proxy to deliver every OL event to Marquez, OpenMetadata, and Monte Carlo. Show the resulting graph experience for the end user.

Input.

Backend Role
Marquez lineage graph UI for engineers
OpenMetadata catalog + glossary + governance
Monte Carlo observability + freshness alerts

Code.

TARGETS = [
    "http://marquez.internal:5000/api/v1/lineage",
    "https://openmetadata.example.com/api/v1/openlineage",
    "https://api.getmontecarlo.com/openlineage",
]
# same fan-out logic as Worked example above
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. The proxy accepts one event per task / model / job and POSTs it to all three URLs in parallel. Latency is bounded by the slowest receiver.
  2. Marquez renders the lineage graph immediately. Engineers use it for "trace the job" deep-dives during incidents.
  3. OpenMetadata creates a Pipeline entity and lineage edges, plus updates the affected tables. Analysts and stewards use this view.
  4. Monte Carlo cross-references the event against learned baselines — table appeared, schema changed, row count dropped. It alerts on anomalies; the alert pages the on-call.
  5. All three views show the same underlying facts because the source-of-truth event is the OL payload from the pipeline.

Output (per persona).

Persona Tool Reason
Data engineer Marquez clean lineage graph for debugging
Analytics engineer OpenMetadata catalog browsing, glossary, owners
Analyst OpenMetadata search, find tables, see freshness
Steward OpenMetadata governance, PII review
On-call Monte Carlo freshness / schema-change alerts

Rule of thumb. The right number of OL consumers is "however many distinct user personas you have, minus the ones whose needs overlap entirely." The marginal cost of adding a receiver is configuration; the marginal value is the persona it serves.

Data engineering interview question on planning a closed-catalog exit

A senior interviewer might frame this as: "Your CFO has asked for a plan to leave Vendor X at renewal in six months. Walk me through it from week one to cutover, in enough detail that the platform team can execute without me."

Solution Using a six-month phased migration plan

MONTH 1 — Stand up
- Deploy OpenMetadata in staging via Helm.
- Configure Postgres + Elasticsearch in dedicated VMs / managed services.
- Run all warehouse connectors (Snowflake, BigQuery, Postgres) once.
- Sanity check: entity count vs vendor's reported asset count.

MONTH 2 — Lineage
- Enable OL emitters on staging Airflow + dbt + Spark.
- Multi-cast OL events to both the vendor and OpenMetadata.
- Verify table-level + column-level lineage parity for top-20 tables.
- Document gaps; file integration bugs upstream where needed.

MONTH 3 — Business metadata
- Export glossary + owner + tags from vendor (CSV or API).
- Script the import into OpenMetadata GlossaryTerm / Tag / owner.
- Stewards review sample of 50 imports; fix mapping issues.

MONTH 4 — Quality and policy
- Define top TestSuite per Tier-1 table.
- Migrate or re-author data quality tests (dbt tests + custom SQL).
- Replicate Role / Policy model — admin / steward / read-only.

MONTH 5 — UX cutover
- Switch internal documentation links from vendor to OpenMetadata.
- Vendor UI moves to read-only mode; team is told "use OM going forward."
- Monitor support tickets, fix UX gaps, train teams.

MONTH 6 — Renewal day
- Drop vendor URL from OL multi-cast.
- Cancel vendor contract.
- Capture lessons learned for the next standards adoption (e.g. DataHub
  as second open option, or a managed Collate as an upgrade path).
Enter fullscreen mode Exit fullscreen mode

Step-by-step trace.

Month Headline deliverable Risk Mitigation
1 OM running, connectors green infra sizing start with x86-large VMs + 200GB Postgres
2 OL multi-cast in prod emitter overhead feature flag per team
3 Business metadata imported term mapping errors steward review sample
4 Quality tests live test coverage gaps tier-gated requirements
5 UX cutover user pushback early demos, training
6 Vendor decommissioned sign-off blocking written acceptance from each domain

The plan is designed to fail safely: at every month, if the new stack is not ready, the old vendor is still receiving events and serving as the source of truth. Cutover only happens when parity is real, not when the calendar says.

Output:

Month Status
1 OpenMetadata running in staging
2 Lineage two-write in prod
3 Glossary imported, owners assigned
4 Quality tests + roles parity
5 UX cutover, vendor read-only
6 Vendor decommissioned at renewal

Why this works — concept by concept:

  • Two-write everywhere — the migration never has a "one-night switch" risk because events flow to both backends throughout. Either side can be the primary at any moment.
  • Connector-first, lineage-second — entities give you the inventory; OL gives you the edges. Stand them up in that order so the OL graph has nodes to attach to.
  • Steward review of business metadata — automated import handles 80%; humans handle the 20% with judgement calls. Stewards are the only durable defence against junk metadata.
  • Tier-gated quality — every Tier-1 table must have a test suite; lower tiers are optional. This keeps quality investment proportional to business impact.
  • UX cutover before renewal — the team must actively prefer the new UI before renewal day. If they do not, the plan slips by a month — better than slipping the renewal.
  • Cost — six months of platform-engineering attention (~0.5 FTE) plus ~$40K infra annually. The vendor renewal usually exceeds that within a year for any non-trivial asset count.

DE
Topic — data aggregation
Data aggregation problems for catalog metrics

Practice →


Cheat sheet — open standards recipes

  • "I want lineage without a catalog yet." OpenLineage emitters in every tool + Marquez as the backend. One docker-compose stack; rendered lineage graph in 30 minutes.
  • "I want a full open catalog." OpenMetadata (broader connector library, polished UI) or DataHub (event-native, Kafka-friendly). Pick by ecosystem fit, not by feature checklist alone.
  • "dbt + Airflow + Spark stack." OpenLineage emitters in all three (dbt OL adapter, Airflow OL plugin, Spark OL listener), single OPENLINEAGE_URL, one backend behind it. Promote per team via feature flag.
  • "Migrating off Collibra / Alation / Atlan." OpenMetadata in parallel; OL multi-cast for 60–90 days; import glossary via REST; cut over once user-facing parity is real.
  • "Need column-level lineage." Enable the columnLineageFacet end-to-end. dbt computes it from its manifest; Spark from query plans; SQL engines via sqlglot. Render in OpenMetadata or DataHub.
  • "Want governance + glossary." OpenMetadata's Glossary + GlossaryTerm + Tag + Classification entities, plus the Role / Policy model. Stewards own approval; auto-classifiers propose.
  • "Need to stream lineage into a vendor." Configure the vendor's OL endpoint as one of the multi-cast targets. Monte Carlo, Bigeye, Atlan, and Collibra all accept OL events.
  • "Production transport — HTTP or Kafka?" HTTP for setups under a few thousand events per minute and one backend. Kafka when you need durability, replay, or multiple downstream consumer groups.
  • "How do I cross tool boundaries?" Use the parentRunFacet. Airflow → dbt → Spark events all carry parent links; receivers reconstruct the hierarchical graph automatically.
  • "Custom metadata that the spec does not cover." Custom facet with your org's URI. Receivers either render it or ignore it. Lobby for promotion to standard if the use case generalises.
  • "OpenMetadata vs DataHub — quick decision." Want the deepest connector library and the most polished UI? OpenMetadata. Want event-native with Kafka under the hood? DataHub. Both accept OL events natively.
  • "Cost back-of-envelope." Closed catalog: ~$0.50/asset/mo, grows linearly. Self-hosted OpenMetadata: ~$2–4K infra + 0.25 FTE. Crossover around 8–10K assets. Add ~$20–60K/year for a vendor observability layer if needed.
  • "What about Marquez in production?" Fine for lineage-only at moderate scale. Lacks the catalog surface (glossary, tags, classification) — pair with OpenMetadata or DataHub if you need those.

Frequently asked questions

Is OpenLineage a catalog?

No — OpenLineage is a wire format for emitting lineage events; it is not a catalog application. It defines the JSON schema (run, job, dataset, facets) and reference clients in Python and Java, but storage and UI are the backend's job. The reference backend is Marquez (Postgres + a minimal lineage UI). For a full catalog you pair OpenLineage with OpenMetadata or DataHub. The most common interview mistake is conflating the standard with a backend — "we'll use OpenLineage as our catalog" is the wrong sentence; "we'll emit OpenLineage and store it in OpenMetadata" is the right one.

Should I use OpenMetadata or DataHub?

Both are credible open catalogs with active communities, similar feature surfaces, and OpenLineage support. Pick OpenMetadata when you want a broader out-of-the-box connector library, a polished end-user UI, native OL ingestion as a first-class path, or a managed offering (Collate) on the same code base. Pick DataHub when you want an event-native architecture with Kafka under the hood, strong upstream propagation to downstream services via the MCE / MAE model, or your existing stack already has heavy Kafka investment. Either way, your OL emitters do not change — you can switch later by pointing the transport at the new endpoint.

Does OpenLineage support column-level lineage?

Yes — the columnLineageFacet is a standard facet that maps each output column to the input columns it was derived from. dbt's OL adapter generates it from the compiled manifest; Spark's listener derives it from query plans; SQL engines via parsers like sqlglot or Calcite can compute it from the SQL text. Receivers (OpenMetadata, DataHub, Marquez) render column-level edges as a sub-graph inside the table-level lineage view. Column-level lineage is the high-value payload for impact analysis ("if I drop column C, what dashboards break?") — make sure your emitters produce the facet end-to-end.

Can I send OpenLineage events to Monte Carlo or Bigeye?

Yes — both vendors document OpenLineage endpoints. Configure OPENLINEAGE_URL to the vendor's OL endpoint (or include it in the comma-separated list for multi-cast) and the vendor receives every event your pipelines emit. Monte Carlo and Bigeye layer freshness, volume, and schema-change anomaly detection on top of the same graph your open backend sees, so you can keep one observability vendor while running an open catalog underneath. Atlan and Collibra also accept OL events for the lineage half of their products. The standard is the shared interface; the vendors compete on UX and analytics, not on data ownership.

Is Marquez production-ready?

For lineage-only workloads at small-to-medium scale (~100K events / day, ~10K datasets), yes — Marquez has been in production at multiple companies since 2019. It is the reference backend for OpenLineage, so spec changes land there first, and the Postgres + REST + minimal UI architecture is easy to operate. Marquez does not include the broader catalog surface (no glossary, no tag classification, no role / policy model). If you need that, pair Marquez with OpenMetadata or DataHub — or skip Marquez entirely and use OpenMetadata as both lineage backend and catalog. Many teams use Marquez during the OL adoption phase (weeks 1–4) and migrate to OpenMetadata as the second consumer once the catalog needs surface.

How does OpenMetadata compare to Atlan and Collibra?

OpenMetadata and the vendors converge on the same feature set (entity model, lineage graph, glossary, classification, data quality) but diverge on ownership and pricing. With Atlan or Collibra you license the product per asset and the metadata graph lives inside the vendor's database; switching vendors means rebuilding connectors and re-ingesting metadata. With OpenMetadata you self-host (or pay for the managed Collate variant), the metadata DB is yours, and the OL emitters that feed it also feed every other open or vendor receiver. Atlan and Collibra still win on out-of-the-box polish and vendor support; OpenMetadata wins on portability, cost at scale, and the option value of swapping backends without re-instrumenting pipelines. The honest answer is "both are credible; pick the trade-off your platform can actually live with for the next five years."

Practice on PipeCode

Pipecode.ai is Leetcode for Data Engineering — every OpenLineage and OpenMetadata recipe above ships with hands-on practice rooms where you wire the emitters, design the entity model, and write the SQL behind the catalog metrics against real graded inputs. PipeCode pairs every reading with 450+ DE-focused problems and a real-time scoring engine, so you never have to wonder whether your column-lineage facet actually round-trips between Marquez, OpenMetadata, and a vendor receiver in the same way it will on interview day.

Practice ETL design now →
Dimensional modeling drills →

Top comments (0)