DEV Community

Byron Hsieh
Byron Hsieh

Posted on

dbt + OpenLineage #1: Why dbt-ol Is a Post-Processor (Not a Plugin) — and Why It Matters

Introduction

After exploring cloud migration patterns in my current work, I started thinking about one of the next layers: data lineage.

Legacy lineage systems built on top of SQLLineage often struggle with on-premise codebases that lack enforced standards — SQL mixed across legacy scripts and raw .sql files, runtime variables, shared temp tables, duplicate aliases. Accuracy tends to top out around 70%.

The cloud migration changed the equation. A common modern pattern is to use a Spark-based ingestion layer for data landing and dbt for transformation — and with that comes actual standards: consistent naming conventions, declarative SQL, and reproducible artifacts. That's the foundation automated lineage needs. But it also introduced a new challenge: the ingestion layer and dbt are separate tools, and getting end-to-end lineage means both need to emit events in a common format. OpenLineage is designed exactly for this. This series starts with the dbt side: given that manifest.json already captures the dependency graph, what does OpenLineage actually emit?

This series documents my hands-on exploration of openlineage-dbt. In this first post, I'll cover:

  • How dbt-ol works under the hood (it's not what I initially assumed)
  • The anatomy of an OpenLineage event: Job, Run, and Dataset
  • What column-level lineage actually looks like in a raw .ndjson file
  • Why inputs: [] is empty — and why that matters

The full project is on GitHub: openlineage-dbt


The Setup

Tech Stack

Component Tool Why
Transformation dbt-core + dbt-duckdb Lightweight, no server needed
Database DuckDB File-based, perfect for local learning
Lineage openlineage-dbt Official dbt integration
Transport File (.ndjson) Start simple, switch to Marquez later

Project Architecture

One design decision I made early on: keep the dbt project completely clean of OpenLineage configuration.

openlineage_dbt/
├── jaffle_shop/          ← pure dbt project (deployable independently)
│   ├── dbt_project.yml
│   ├── models/
│   │   ├── staging/      stg_customers, stg_orders, stg_payments
│   │   ├── intermediate/
│   │   └── marts/
│   └── seeds/            raw_customers.csv, raw_orders.csv, raw_payments.csv
├── openlineage/
│   ├── openlineage.yml   ← OL config lives here, not inside jaffle_shop/
│   └── events/           ← file transport output (.gitignored)
└── docker/               ← Marquez compose (Milestone 3)
Enter fullscreen mode Exit fullscreen mode

The OpenLineage config is injected via environment variable:

export OPENLINEAGE_CONFIG=../openlineage/openlineage.yml
cd jaffle_shop
uv run dbt-ol run
Enter fullscreen mode Exit fullscreen mode

Why this matters: The dbt project stays pure transformation logic; lineage config is an infrastructure concern. The same dbt project runs unchanged across environments — only the injected config differs. For this learning project, OPENLINEAGE_CONFIG is sufficient. In production, the newer OPENLINEAGE__ double-underscore env var system (e.g., OPENLINEAGE__TRANSPORT__TYPE=http) is the recommended approach — no config file needed at all.


Key Learning #1: How dbt-ol Actually Works

My first assumption was that dbt-ol intercepts dbt's execution and emits events in real time — like a plugin hooked into each model run. That turned out to be wrong.

dbt-ol is a post-processing wrapper. Here's what actually happens:

dbt-ol run
    │
    ├── 1. Run dbt normally (identical to `dbt run`)
    │       └── produces: manifest.json, run_results.json
    │
    └── 2. After dbt completes, read the artifacts
            └── parse manifest.json + run_results.json
                └── emit OpenLineage events to transport
Enter fullscreen mode Exit fullscreen mode

The implications of this design:

  • Events are emitted after execution, not during
  • All lineage information comes from static artifact parsing, not runtime introspection
  • If dbt run fails mid-way, only the completed models get events

This is also why catalog.json matters — dbt-ol reads it (if available) to enrich output datasets with schema information (field names and DuckDB types). More on this in a later section.

The Two Artifacts

Artifact Produced by What dbt-ol reads from it
manifest.json dbt run / dbt compile Model graph, compiled SQL, dependencies
run_results.json dbt run Execution status, timing per model
catalog.json dbt docs generate Column names and types (input schema)

Running It

# Set config path (relative to execution directory)
export OPENLINEAGE_CONFIG=../openlineage/openlineage.yml

cd jaffle_shop
uv run dbt-ol run
Enter fullscreen mode Exit fullscreen mode

One subtle gotcha: log_file_path in openlineage.yml is resolved relative to where you run dbt-ol (i.e., jaffle_shop/), not relative to the config file itself. So the path needs to account for that:

# openlineage/openlineage.yml
transport:
  type: file
  log_file_path: ../openlineage/events/events.ndjson  # relative to jaffle_shop/
  append: true
Enter fullscreen mode Exit fullscreen mode

Key Learning #2: Anatomy of an OpenLineage Event

Running dbt-ol run with 3 staging models produces 8 events in events.ndjson — one JSON object per line (NDJSON format):

Line 1: parent START       → the entire dbt run
Line 2: stg_customers START
Line 3: stg_orders START
Line 4: stg_payments START
Line 5: stg_customers COMPLETE
Line 6: stg_orders COMPLETE
Line 7: stg_payments COMPLETE
Line 8: parent COMPLETE    → the entire dbt run
Enter fullscreen mode Exit fullscreen mode

Every event shares the same top-level structure:

{
  "eventType": "START",
  "eventTime": "2026-02-27T14:55:51.150612+00:00",
  "job":     { ... },
  "run":     { ... },
  "inputs":  [ ... ],
  "outputs": [ ... ]
}
Enter fullscreen mode Exit fullscreen mode

The Three Core Entities

Job — what the work is (static, doesn't change between runs)

"job": {
  "name": "dev.main.jaffle_shop.stg_customers",
  "namespace": "dbt",
  "facets": {
    "jobType": { "jobType": "MODEL", "integration": "DBT" },
    "sql": { "dialect": "duckdb", "query": "with source as (...) select ..." }
  }
}
Enter fullscreen mode Exit fullscreen mode

Run — this specific execution instance

"run": {
  "runId": "019c9f99-59c5-75bd-...",   // UUID v7, new every execution
  "facets": {
    "parent": {
      "job": { "name": "dbt-run-jaffle_shop" },
      "run": { "runId": "019c9f99-516e-7d6a-..." }  // same across all models in this dbt run
    }
  }
}
Enter fullscreen mode Exit fullscreen mode

Dataset — the data being read or written

{
  "name": "dev.main.stg_customers",
  "namespace": "duckdb://dev.duckdb"
}
Enter fullscreen mode Exit fullscreen mode

Why the parent Facet Matters

Every model's run points to the same parent runId — the ID of the overall dbt run. This is what allows Marquez (or any OpenLineage backend) to group all models from a single execution together, and eventually reconstruct the full DAG from a single pipeline run.

dbt-run-jaffle_shop  (runId: 019c9f99-516e...)
    ├── stg_customers  (runId: 019c9f99-59c5-75bd..., parent → 019c9f99-516e...)
    ├── stg_orders     (runId: 019c9f99-59c5-7bf9..., parent → 019c9f99-516e...)
    └── stg_payments   (runId: 019c9f99-59c5-7a35..., parent → 019c9f99-516e...)
Enter fullscreen mode Exit fullscreen mode

Key Learning #3: Column-Level Lineage in Practice

Column-level lineage (CLL) is where OpenLineage gets interesting. Without any extra configuration, dbt-ol already parses the compiled SQL and produces field-level mappings in outputs[].facets.columnLineage.

Here's stg_customers.sql:

with source as (
    select * from {{ ref('raw_customers') }}
)
select
    id                             as customer_id,
    first_name,
    last_name,
    first_name || ' ' || last_name as full_name
from source
Enter fullscreen mode Exit fullscreen mode

And what OpenLineage extracts from it:

"columnLineage": {
  "fields": {
    "customer_id": {
      "inputFields": [{ "field": "id",         "name": "source" }]
    },
    "first_name": {
      "inputFields": [{ "field": "first_name", "name": "source" }]
    },
    "last_name": {
      "inputFields": [{ "field": "last_name",  "name": "source" }]
    },
    "full_name": {
      "inputFields": [
        { "field": "first_name", "name": "source" },
        { "field": "last_name",  "name": "source" }
      ]
    }
  }
}
Enter fullscreen mode Exit fullscreen mode

For simple rename and concatenation, the tracking is accurate:

Output column Source column(s) Tracking
customer_id id ✅ rename detected
first_name first_name ✅ pass-through
full_name first_name + last_name ✅ multi-source detected

The CTE Alias Problem

Notice "name": "source" in every inputFields entry. That's the CTE alias — not the actual table name raw_customers. OpenLineage parsed the SQL correctly, but without knowing what source resolves to, the lineage chain is broken at the CTE boundary.

raw_customers  →  source (CTE)  →  stg_customers
                  ↑
              lineage stops here without catalog.json
Enter fullscreen mode Exit fullscreen mode

This is a known limitation when catalog.json is absent. With it, OpenLineage can resolve the CTE back to the actual dataset. That's exactly what Milestone 2 covers.


What's Missing: The Empty inputs[]

Every model event in this run has "inputs": []. The output dataset and its column lineage are present, but there's no record of what was read.

{
  "eventType": "COMPLETE",
  "inputs":  [],              //  empty
  "outputs": [{ "name": "dev.main.stg_customers", "facets": { "columnLineage": {...} } }]
}
Enter fullscreen mode Exit fullscreen mode

The reason: these staging models read from seed tables (raw_customers, raw_orders, raw_payments). Seeds are created by dbt seed and exist as real tables in DuckDB, but dbt-ol does not treat them as upstream lineage datasets — they don't appear in inputs[] regardless of what artifacts are present.

What catalog.json actually changes

Running dbt docs generate produces target/catalog.json, which records the actual schema (column names and types) of every table and view in the database as DuckDB sees them.

uv run dbt docs generate   # produces target/catalog.json
uv run dbt-ol run          # now reads catalog.json alongside manifest.json
Enter fullscreen mode Exit fullscreen mode

With catalog.json present, the output dataset gains a SchemaDatasetFacet:

"outputs": [{
  "name": "dev.main.stg_customers",
  "facets": {
    "columnLineage": { ... },
    "schema": {
      "fields": [
        { "name": "customer_id", "type": "INTEGER" },
        { "name": "first_name",  "type": "VARCHAR" },
        { "name": "last_name",   "type": "VARCHAR" },
        { "name": "full_name",   "type": "VARCHAR" }
      ]
    }
  }
}]
Enter fullscreen mode Exit fullscreen mode

inputs[] remains empty. catalog.json enriches the output side with type information — it does not resolve the upstream seed tables into input datasets.

What actually changes with catalog.json

Item Without catalog.json With catalog.json
inputs[] [] empty [] still empty
Output schema facet absent present (field names + DuckDB types)
columnLineage present present (unchanged)
CTE alias in inputFields "name": "source" "name": "source" (unchanged)

The gap this leaves: there's no dataset-to-dataset edge connecting raw_customersstg_customers in the emitted events. For this learning project the seeds act as the raw data layer, but in a real pipeline where staging models read from source() references (external tables), dbt-ol would populate inputs[] — that's the scenario where catalog.json matters for input schema enrichment.

This is exactly what Milestone 2 tests: intermediate and mart models that ref staging models (not seeds) — where inputs[] is no longer empty.


Takeaways

1. dbt-ol is a post-processor, not a runtime interceptor
It reads manifest.json and run_results.json after dbt run completes. This means lineage accuracy depends entirely on what's in the artifacts — and catalog.json is what adds output schema (field names and DuckDB types) to the emitted events.

2. The Job/Run/Dataset model maps cleanly onto dbt concepts
Job = dbt model definition. Run = one execution instance. Dataset = a table or view in the database. The parent facet ties all model runs within a single dbt run together, which is what makes pipeline-level lineage possible.

3. Column-level lineage works out of the box for simple SQL
Renames, pass-throughs, and multi-column expressions are all tracked correctly. The CTE alias issue ("name": "source" instead of the actual table name) is a known parser limitation — catalog.json does not resolve it. The inputFields still reference the CTE alias even after dbt docs generate.

4. Keep lineage config separate from the dbt project
Keeping openlineage.yml outside the dbt project enforces a clean boundary: dbt handles transformation, lineage config is infrastructure. OPENLINEAGE_CONFIG works well for local learning. In production, the newer OPENLINEAGE__ double-underscore env var system (e.g., OPENLINEAGE__TRANSPORT__TYPE=http) is the recommended path — each value injected directly, no config file required. The separation habit transfers either way.


What's Next

In Milestone 2, I'll run dbt docs generate to bring catalog.json into the picture, add intermediate and mart models with more complex SQL (CTEs + JOINs + window functions), and examine how column-level lineage accuracy degrades as SQL complexity increases.


Resources

Top comments (0)