Byron Hsieh

Posted on Mar 4

dbt + OpenLineage #1: Why dbt-ol Is a Post-Processor (Not a Plugin) — and Why It Matters

#dbt #openlineage #dataengineering #python

Introduction

After exploring cloud migration patterns in my current work, I started thinking about one of the next layers: data lineage.

Legacy lineage systems built on top of SQLLineage often struggle with on-premise codebases that lack enforced standards — SQL mixed across legacy scripts and raw .sql files, runtime variables, shared temp tables, duplicate aliases. Accuracy tends to top out around 70%.

The cloud migration changed the equation. A common modern pattern is to use a Spark-based ingestion layer for data landing and dbt for transformation — and with that comes actual standards: consistent naming conventions, declarative SQL, and reproducible artifacts. That's the foundation automated lineage needs. But it also introduced a new challenge: the ingestion layer and dbt are separate tools, and getting end-to-end lineage means both need to emit events in a common format. OpenLineage is designed exactly for this. This series starts with the dbt side: given that manifest.json already captures the dependency graph, what does OpenLineage actually emit?

This series documents my hands-on exploration of openlineage-dbt. In this first post, I'll cover:

How dbt-ol works under the hood (it's not what I initially assumed)
The anatomy of an OpenLineage event: Job, Run, and Dataset
What column-level lineage actually looks like in a raw .ndjson file
Why inputs: [] is empty — and why that matters

The full project is on GitHub: openlineage-dbt

The Setup

Tech Stack

Component	Tool	Why
Transformation	dbt-core + dbt-duckdb	Lightweight, no server needed
Database	DuckDB	File-based, perfect for local learning
Lineage	openlineage-dbt	Official dbt integration
Transport	File (`.ndjson`)	Start simple, switch to Marquez later

Project Architecture

One design decision I made early on: keep the dbt project completely clean of OpenLineage configuration.

openlineage_dbt/
├── jaffle_shop/          ← pure dbt project (deployable independently)
│   ├── dbt_project.yml
│   ├── models/
│   │   ├── staging/      stg_customers, stg_orders, stg_payments
│   │   ├── intermediate/
│   │   └── marts/
│   └── seeds/            raw_customers.csv, raw_orders.csv, raw_payments.csv
├── openlineage/
│   ├── openlineage.yml   ← OL config lives here, not inside jaffle_shop/
│   └── events/           ← file transport output (.gitignored)
└── docker/               ← Marquez compose (Milestone 3)

The OpenLineage config is injected via environment variable:

export OPENLINEAGE_CONFIG=../openlineage/openlineage.yml
cd jaffle_shop
uv run dbt-ol run

Why this matters: The dbt project stays pure transformation logic; lineage config is an infrastructure concern. The same dbt project runs unchanged across environments — only the injected config differs. For this learning project, OPENLINEAGE_CONFIG is sufficient. In production, the newer OPENLINEAGE__ double-underscore env var system (e.g., OPENLINEAGE__TRANSPORT__TYPE=http) is the recommended approach — no config file needed at all.

Key Learning #1: How dbt-ol Actually Works

My first assumption was that dbt-ol intercepts dbt's execution and emits events in real time — like a plugin hooked into each model run. That turned out to be wrong.

dbt-ol is a post-processing wrapper. Here's what actually happens:

dbt-ol run
    │
    ├── 1. Run dbt normally (identical to `dbt run`)
    │       └── produces: manifest.json, run_results.json
    │
    └── 2. After dbt completes, read the artifacts
            └── parse manifest.json + run_results.json
                └── emit OpenLineage events to transport

The implications of this design:

Events are emitted after execution, not during
All lineage information comes from static artifact parsing, not runtime introspection
If dbt run fails mid-way, only the completed models get events

This is also why catalog.json matters — dbt-ol reads it (if available) to enrich output datasets with schema information (field names and DuckDB types). More on this in a later section.

The Two Artifacts

Artifact	Produced by	What dbt-ol reads from it
`manifest.json`	`dbt run` / `dbt compile`	Model graph, compiled SQL, dependencies
`run_results.json`	`dbt run`	Execution status, timing per model
`catalog.json`	`dbt docs generate`	Column names and types (output schema)

Running It

# Set config path (relative to execution directory)
export OPENLINEAGE_CONFIG=../openlineage/openlineage.yml

cd jaffle_shop
uv run dbt-ol run

One subtle gotcha: log_file_path in openlineage.yml is resolved relative to where you run dbt-ol (i.e., jaffle_shop/), not relative to the config file itself. So the path needs to account for that:

# openlineage/openlineage.yml
transport:
  type: file
  log_file_path: ../openlineage/events/events.ndjson  # relative to jaffle_shop/
  append: true

Key Learning #2: Anatomy of an OpenLineage Event

Running dbt-ol run with 3 staging models produces 8 events in events.ndjson — one JSON object per line (NDJSON format):

Line 1: parent START       → the entire dbt run
Line 2: stg_customers START
Line 3: stg_orders START
Line 4: stg_payments START
Line 5: stg_customers COMPLETE
Line 6: stg_orders COMPLETE
Line 7: stg_payments COMPLETE
Line 8: parent COMPLETE    → the entire dbt run

Every event shares the same top-level structure:

{
  "eventType": "START",
  "eventTime": "2026-02-27T14:55:51.150612+00:00",
  "job":     { ... },
  "run":     { ... },
  "inputs":  [ ... ],
  "outputs": [ ... ]
}

The Three Core Entities

Job — what the work is (static, doesn't change between runs)

"job": {
  "name": "dev.main.jaffle_shop.stg_customers",
  "namespace": "dbt",
  "facets": {
    "jobType": { "jobType": "MODEL", "integration": "DBT" },
    "sql": { "dialect": "duckdb", "query": "with source as (...) select ..." }
  }
}

Run — this specific execution instance

"run": {
  "runId": "019c9f99-59c5-75bd-...",   // UUID v7, new every execution
  "facets": {
    "parent": {
      "job": { "name": "dbt-run-jaffle_shop" },
      "run": { "runId": "019c9f99-516e-7d6a-..." }  // same across all models in this dbt run
    }
  }
}

Dataset — the data being read or written

{
  "name": "dev.main.stg_customers",
  "namespace": "duckdb://dev.duckdb"
}

Why the parent Facet Matters

Every model's run points to the same parent runId — the ID of the overall dbt run. This is what allows Marquez (or any OpenLineage backend) to group all models from a single execution together, and eventually reconstruct the full DAG from a single pipeline run.

dbt-run-jaffle_shop  (runId: 019c9f99-516e...)
    ├── stg_customers  (runId: 019c9f99-59c5-75bd..., parent → 019c9f99-516e...)
    ├── stg_orders     (runId: 019c9f99-59c5-7bf9..., parent → 019c9f99-516e...)
    └── stg_payments   (runId: 019c9f99-59c5-7a35..., parent → 019c9f99-516e...)

Key Learning #3: Column-Level Lineage in Practice

Column-level lineage (CLL) is where OpenLineage gets interesting. Without any extra configuration, dbt-ol already parses the compiled SQL and produces field-level mappings in outputs[].facets.columnLineage.

Here's stg_customers.sql:

with source as (
    select * from {{ ref('raw_customers') }}
)
select
    id                             as customer_id,
    first_name,
    last_name,
    first_name || ' ' || last_name as full_name
from source

And what OpenLineage extracts from it:

"columnLineage": {
  "fields": {
    "customer_id": {
      "inputFields": [{ "field": "id",         "name": "source" }]
    },
    "first_name": {
      "inputFields": [{ "field": "first_name", "name": "source" }]
    },
    "last_name": {
      "inputFields": [{ "field": "last_name",  "name": "source" }]
    },
    "full_name": {
      "inputFields": [
        { "field": "first_name", "name": "source" },
        { "field": "last_name",  "name": "source" }
      ]
    }
  }
}

For simple rename and concatenation, the tracking is accurate:

Output column	Source column(s)	Tracking
`customer_id`	`id`	✅ rename detected
`first_name`	`first_name`	✅ pass-through
`full_name`	`first_name` + `last_name`	✅ multi-source detected

The CTE Alias Problem

Notice "name": "source" in every inputFields entry. That's the CTE alias — not the actual table name raw_customers. OpenLineage parsed the SQL correctly, but without knowing what source resolves to, the lineage chain is broken at the CTE boundary.

raw_customers  →  source (CTE)  →  stg_customers
                  ↑
              lineage stops here (CTE alias, not resolved to actual table)

This is a SQL parser limitation — the parser sees the CTE alias source but doesn't trace it back to raw_customers. Adding catalog.json does not resolve this; "name": "source" persists in both runs. That's exactly what Milestone 2 investigates.

What's Missing: The Empty inputs[]

Every model event in this run has "inputs": []. The output dataset and its column lineage are present, but there's no record of what was read.

{
  "eventType": "COMPLETE",
  "inputs":  [],              // ← empty
  "outputs": [{ "name": "dev.main.stg_customers", "facets": { "columnLineage": {...} } }]
}

The reason: these staging models read from seed tables (raw_customers, raw_orders, raw_payments). Seeds are created by dbt seed and exist as real tables in DuckDB, but dbt-ol does not treat them as upstream lineage datasets — they don't appear in inputs[] regardless of what artifacts are present.

What catalog.json actually changes

Running dbt docs generate produces target/catalog.json, which records the actual schema (column names and types) of every table and view in the database as DuckDB sees them.

uv run dbt docs generate   # produces target/catalog.json
uv run dbt-ol run          # now reads catalog.json alongside manifest.json

With catalog.json present, the output dataset gains a SchemaDatasetFacet:

"outputs": [{
  "name": "dev.main.stg_customers",
  "facets": {
    "columnLineage": { ... },
    "schema": {
      "fields": [
        { "name": "customer_id", "type": "INTEGER" },
        { "name": "first_name",  "type": "VARCHAR" },
        { "name": "last_name",   "type": "VARCHAR" },
        { "name": "full_name",   "type": "VARCHAR" }
      ]
    }
  }
}]

inputs[] remains empty. catalog.json enriches the output side with type information — it does not resolve the upstream seed tables into input datasets.

What actually changes with catalog.json

Item	Without catalog.json	With catalog.json
`inputs[]`	`[]` empty	`[]` still empty
Output `schema` facet	absent	present (field names + DuckDB types)
`columnLineage`	present	present (unchanged)
CTE alias in `inputFields`	`"name": "source"`	`"name": "source"` (unchanged)

The gap this leaves: there's no dataset-to-dataset edge connecting raw_customers → stg_customers in the emitted events. For this learning project the seeds act as the raw data layer, but in a real pipeline where staging models read from source() references (external tables), dbt-ol would populate inputs[] — that's the scenario where catalog.json matters for input schema enrichment.

This is exactly what Milestone 2 tests: intermediate and mart models that ref staging models (not seeds) — where inputs[] is no longer empty.

Takeaways

1. dbt-ol is a post-processor, not a runtime interceptor
It reads manifest.json and run_results.json after dbt run completes. This means lineage accuracy depends entirely on what's in the artifacts — and catalog.json is what adds output schema (field names and DuckDB types) to the emitted events.

2. The Job/Run/Dataset model maps cleanly onto dbt concepts
Job = dbt model definition. Run = one execution instance. Dataset = a table or view in the database. The parent facet ties all model runs within a single dbt run together, which is what makes pipeline-level lineage possible.

3. Column-level lineage works out of the box for simple SQL
Renames, pass-throughs, and multi-column expressions are all tracked correctly. The CTE alias issue ("name": "source" instead of the actual table name) is a known parser limitation — catalog.json does not resolve it. The inputFields still reference the CTE alias even after dbt docs generate.

4. Keep lineage config separate from the dbt project
Keeping openlineage.yml outside the dbt project enforces a clean boundary: dbt handles transformation, lineage config is infrastructure. OPENLINEAGE_CONFIG works well for local learning. In production, the newer OPENLINEAGE__ double-underscore env var system (e.g., OPENLINEAGE__TRANSPORT__TYPE=http) is the recommended path — each value injected directly, no config file required. The separation habit transfers either way.

What's Next

In Milestone 2, I'll run dbt docs generate to bring catalog.json into the picture, add intermediate and mart models with more complex SQL (CTEs + JOINs + window functions), and examine how column-level lineage accuracy degrades as SQL complexity increases.

DEV Community