Introduction
After exploring cloud migration patterns in my current work, I started thinking about one of the next layers: data lineage.
Legacy lineage systems built on top of SQLLineage often struggle with on-premise codebases that lack enforced standards — SQL mixed across legacy scripts and raw .sql files, runtime variables, shared temp tables, duplicate aliases. Accuracy tends to top out around 70%.
The cloud migration changed the equation. A common modern pattern is to use a Spark-based ingestion layer for data landing and dbt for transformation — and with that comes actual standards: consistent naming conventions, declarative SQL, and reproducible artifacts. That's the foundation automated lineage needs. But it also introduced a new challenge: the ingestion layer and dbt are separate tools, and getting end-to-end lineage means both need to emit events in a common format. OpenLineage is designed exactly for this. This series starts with the dbt side: given that manifest.json already captures the dependency graph, what does OpenLineage actually emit?
This series documents my hands-on exploration of openlineage-dbt. In this first post, I'll cover:
- How
dbt-olworks under the hood (it's not what I initially assumed) - The anatomy of an OpenLineage event: Job, Run, and Dataset
- What column-level lineage actually looks like in a raw
.ndjsonfile - Why
inputs: []is empty — and why that matters
The full project is on GitHub: openlineage-dbt
The Setup
Tech Stack
| Component | Tool | Why |
|---|---|---|
| Transformation | dbt-core + dbt-duckdb | Lightweight, no server needed |
| Database | DuckDB | File-based, perfect for local learning |
| Lineage | openlineage-dbt | Official dbt integration |
| Transport | File (.ndjson) |
Start simple, switch to Marquez later |
Project Architecture
One design decision I made early on: keep the dbt project completely clean of OpenLineage configuration.
openlineage_dbt/
├── jaffle_shop/ ← pure dbt project (deployable independently)
│ ├── dbt_project.yml
│ ├── models/
│ │ ├── staging/ stg_customers, stg_orders, stg_payments
│ │ ├── intermediate/
│ │ └── marts/
│ └── seeds/ raw_customers.csv, raw_orders.csv, raw_payments.csv
├── openlineage/
│ ├── openlineage.yml ← OL config lives here, not inside jaffle_shop/
│ └── events/ ← file transport output (.gitignored)
└── docker/ ← Marquez compose (Milestone 3)
The OpenLineage config is injected via environment variable:
export OPENLINEAGE_CONFIG=../openlineage/openlineage.yml
cd jaffle_shop
uv run dbt-ol run
Why this matters: The dbt project stays pure transformation logic; lineage config is an infrastructure concern. The same dbt project runs unchanged across environments — only the injected config differs. For this learning project, OPENLINEAGE_CONFIG is sufficient. In production, the newer OPENLINEAGE__ double-underscore env var system (e.g., OPENLINEAGE__TRANSPORT__TYPE=http) is the recommended approach — no config file needed at all.
Key Learning #1: How dbt-ol Actually Works
My first assumption was that dbt-ol intercepts dbt's execution and emits events in real time — like a plugin hooked into each model run. That turned out to be wrong.
dbt-ol is a post-processing wrapper. Here's what actually happens:
dbt-ol run
│
├── 1. Run dbt normally (identical to `dbt run`)
│ └── produces: manifest.json, run_results.json
│
└── 2. After dbt completes, read the artifacts
└── parse manifest.json + run_results.json
└── emit OpenLineage events to transport
The implications of this design:
- Events are emitted after execution, not during
- All lineage information comes from static artifact parsing, not runtime introspection
- If
dbt runfails mid-way, only the completed models get events
This is also why catalog.json matters — dbt-ol reads it (if available) to enrich output datasets with schema information (field names and DuckDB types). More on this in a later section.
The Two Artifacts
| Artifact | Produced by | What dbt-ol reads from it |
|---|---|---|
manifest.json |
dbt run / dbt compile
|
Model graph, compiled SQL, dependencies |
run_results.json |
dbt run |
Execution status, timing per model |
catalog.json |
dbt docs generate |
Column names and types (input schema) |
Running It
# Set config path (relative to execution directory)
export OPENLINEAGE_CONFIG=../openlineage/openlineage.yml
cd jaffle_shop
uv run dbt-ol run
One subtle gotcha: log_file_path in openlineage.yml is resolved relative to where you run dbt-ol (i.e., jaffle_shop/), not relative to the config file itself. So the path needs to account for that:
# openlineage/openlineage.yml
transport:
type: file
log_file_path: ../openlineage/events/events.ndjson # relative to jaffle_shop/
append: true
Key Learning #2: Anatomy of an OpenLineage Event
Running dbt-ol run with 3 staging models produces 8 events in events.ndjson — one JSON object per line (NDJSON format):
Line 1: parent START → the entire dbt run
Line 2: stg_customers START
Line 3: stg_orders START
Line 4: stg_payments START
Line 5: stg_customers COMPLETE
Line 6: stg_orders COMPLETE
Line 7: stg_payments COMPLETE
Line 8: parent COMPLETE → the entire dbt run
Every event shares the same top-level structure:
{
"eventType": "START",
"eventTime": "2026-02-27T14:55:51.150612+00:00",
"job": { ... },
"run": { ... },
"inputs": [ ... ],
"outputs": [ ... ]
}
The Three Core Entities
Job — what the work is (static, doesn't change between runs)
"job": {
"name": "dev.main.jaffle_shop.stg_customers",
"namespace": "dbt",
"facets": {
"jobType": { "jobType": "MODEL", "integration": "DBT" },
"sql": { "dialect": "duckdb", "query": "with source as (...) select ..." }
}
}
Run — this specific execution instance
"run": {
"runId": "019c9f99-59c5-75bd-...", // UUID v7, new every execution
"facets": {
"parent": {
"job": { "name": "dbt-run-jaffle_shop" },
"run": { "runId": "019c9f99-516e-7d6a-..." } // same across all models in this dbt run
}
}
}
Dataset — the data being read or written
{
"name": "dev.main.stg_customers",
"namespace": "duckdb://dev.duckdb"
}
Why the parent Facet Matters
Every model's run points to the same parent runId — the ID of the overall dbt run. This is what allows Marquez (or any OpenLineage backend) to group all models from a single execution together, and eventually reconstruct the full DAG from a single pipeline run.
dbt-run-jaffle_shop (runId: 019c9f99-516e...)
├── stg_customers (runId: 019c9f99-59c5-75bd..., parent → 019c9f99-516e...)
├── stg_orders (runId: 019c9f99-59c5-7bf9..., parent → 019c9f99-516e...)
└── stg_payments (runId: 019c9f99-59c5-7a35..., parent → 019c9f99-516e...)
Key Learning #3: Column-Level Lineage in Practice
Column-level lineage (CLL) is where OpenLineage gets interesting. Without any extra configuration, dbt-ol already parses the compiled SQL and produces field-level mappings in outputs[].facets.columnLineage.
Here's stg_customers.sql:
with source as (
select * from {{ ref('raw_customers') }}
)
select
id as customer_id,
first_name,
last_name,
first_name || ' ' || last_name as full_name
from source
And what OpenLineage extracts from it:
"columnLineage": {
"fields": {
"customer_id": {
"inputFields": [{ "field": "id", "name": "source" }]
},
"first_name": {
"inputFields": [{ "field": "first_name", "name": "source" }]
},
"last_name": {
"inputFields": [{ "field": "last_name", "name": "source" }]
},
"full_name": {
"inputFields": [
{ "field": "first_name", "name": "source" },
{ "field": "last_name", "name": "source" }
]
}
}
}
For simple rename and concatenation, the tracking is accurate:
| Output column | Source column(s) | Tracking |
|---|---|---|
customer_id |
id |
✅ rename detected |
first_name |
first_name |
✅ pass-through |
full_name |
first_name + last_name
|
✅ multi-source detected |
The CTE Alias Problem
Notice "name": "source" in every inputFields entry. That's the CTE alias — not the actual table name raw_customers. OpenLineage parsed the SQL correctly, but without knowing what source resolves to, the lineage chain is broken at the CTE boundary.
raw_customers → source (CTE) → stg_customers
↑
lineage stops here without catalog.json
This is a known limitation when catalog.json is absent. With it, OpenLineage can resolve the CTE back to the actual dataset. That's exactly what Milestone 2 covers.
What's Missing: The Empty inputs[]
Every model event in this run has "inputs": []. The output dataset and its column lineage are present, but there's no record of what was read.
{
"eventType": "COMPLETE",
"inputs": [], // ← empty
"outputs": [{ "name": "dev.main.stg_customers", "facets": { "columnLineage": {...} } }]
}
The reason: these staging models read from seed tables (raw_customers, raw_orders, raw_payments). Seeds are created by dbt seed and exist as real tables in DuckDB, but dbt-ol does not treat them as upstream lineage datasets — they don't appear in inputs[] regardless of what artifacts are present.
What catalog.json actually changes
Running dbt docs generate produces target/catalog.json, which records the actual schema (column names and types) of every table and view in the database as DuckDB sees them.
uv run dbt docs generate # produces target/catalog.json
uv run dbt-ol run # now reads catalog.json alongside manifest.json
With catalog.json present, the output dataset gains a SchemaDatasetFacet:
"outputs": [{
"name": "dev.main.stg_customers",
"facets": {
"columnLineage": { ... },
"schema": {
"fields": [
{ "name": "customer_id", "type": "INTEGER" },
{ "name": "first_name", "type": "VARCHAR" },
{ "name": "last_name", "type": "VARCHAR" },
{ "name": "full_name", "type": "VARCHAR" }
]
}
}
}]
inputs[] remains empty. catalog.json enriches the output side with type information — it does not resolve the upstream seed tables into input datasets.
What actually changes with catalog.json
| Item | Without catalog.json | With catalog.json |
|---|---|---|
inputs[] |
[] empty |
[] still empty |
Output schema facet |
absent | present (field names + DuckDB types) |
columnLineage |
present | present (unchanged) |
CTE alias in inputFields
|
"name": "source" |
"name": "source" (unchanged) |
The gap this leaves: there's no dataset-to-dataset edge connecting raw_customers → stg_customers in the emitted events. For this learning project the seeds act as the raw data layer, but in a real pipeline where staging models read from source() references (external tables), dbt-ol would populate inputs[] — that's the scenario where catalog.json matters for input schema enrichment.
This is exactly what Milestone 2 tests: intermediate and mart models that ref staging models (not seeds) — where inputs[] is no longer empty.
Takeaways
1. dbt-ol is a post-processor, not a runtime interceptor
It reads manifest.json and run_results.json after dbt run completes. This means lineage accuracy depends entirely on what's in the artifacts — and catalog.json is what adds output schema (field names and DuckDB types) to the emitted events.
2. The Job/Run/Dataset model maps cleanly onto dbt concepts
Job = dbt model definition. Run = one execution instance. Dataset = a table or view in the database. The parent facet ties all model runs within a single dbt run together, which is what makes pipeline-level lineage possible.
3. Column-level lineage works out of the box for simple SQL
Renames, pass-throughs, and multi-column expressions are all tracked correctly. The CTE alias issue ("name": "source" instead of the actual table name) is a known parser limitation — catalog.json does not resolve it. The inputFields still reference the CTE alias even after dbt docs generate.
4. Keep lineage config separate from the dbt project
Keeping openlineage.yml outside the dbt project enforces a clean boundary: dbt handles transformation, lineage config is infrastructure. OPENLINEAGE_CONFIG works well for local learning. In production, the newer OPENLINEAGE__ double-underscore env var system (e.g., OPENLINEAGE__TRANSPORT__TYPE=http) is the recommended path — each value injected directly, no config file required. The separation habit transfers either way.
What's Next
In Milestone 2, I'll run dbt docs generate to bring catalog.json into the picture, add intermediate and mart models with more complex SQL (CTEs + JOINs + window functions), and examine how column-level lineage accuracy degrades as SQL complexity increases.
Top comments (0)