Introducing DataLineage -- Automated Data Pipeline Lineage Tracking

#opensource #python #datalineage #tutorial

---
title: "Your Schema Changed. Congratulations, You've Just Inherited a Mystery."
published: false
tags: [dataengineering, python, opensource, dbt]
---

It's 2:47 PM on a Thursday.

A Slack message appears: *"Hey, the revenue dashboard is showing nulls everywhere."*

You stare at it. You made one change this morning — renamed a column in a Postgres table. A sensible rename. A *good* rename. `rev_amt` → `revenue_amount`. Clean. Descriptive. Professional.

What you didn't know: that column fed a dbt model, which fed a Spark aggregation job, which fed three Airflow DAGs, which fed the dashboard your CFO reviews every Friday morning.

You didn't know because **nobody knew**. The knowledge lived in the heads of people who've since left the company, in a Confluence doc last edited in 2021, and in the silent, load-bearing assumptions baked into 40,000 lines of pipeline code.

This is the data dependency problem. And it's not a tooling gap — it's a visibility gap.

---

## Introducing DataLineage

[DataLineage](https://github.com/datalineage) automatically discovers and maps every dependency across your pipeline stack — dbt, Airflow, Spark, custom ETL scripts — and gives you real-time impact analysis when anything changes.

No more archaeological digs through DAG definitions. No more "who owns this table?" Slack threads. No more Thursday afternoon mysteries.

---

## How It Works (The Honest Version)

Most lineage tools ask you to *declare* your dependencies. You write YAML. You tag things. You maintain a catalog. This is documentation-driven lineage, and it has the same problem as all documentation: it drifts.

DataLineage takes a different approach: **discovery over declaration**.

It instruments your existing tools — parsing dbt manifests, hooking into Airflow's metadata DB, intercepting Spark execution plans — and builds a live dependency graph automatically. Your pipelines are the source of truth. Not a YAML file someone forgot to update.

When a schema change happens (detected via your warehouse's information schema or pushed via the API), DataLineage traverses the graph and returns every downstream node affected, with a severity score based on how directly it consumes the changed field.

---

## Quick Start

bash
pip install datalineage


Point it at your stack:

python
from datalineage import LineageClient

client = LineageClient(api_key="your_key")
client.connect(dbt_project="./dbt", airflow_db="postgresql://...", spark_app="my_spark_app")
graph = client.discover()


That's the full discovery. `graph` is now a queryable dependency map of your entire pipeline.

---

## A Real-World Scenario: The Schema Change You Can Survive

Let's say you're about to rename that column. Before you do, you run an impact check:

python
from datalineage import LineageClient, ChangeEvent

client = LineageClient(api_key="your_key")

Describe the change you're about to make

change = ChangeEvent(
table="analytics.orders",
column="rev_amt",
change_type="rename",
new_name="revenue_amount"
)

impact = client.analyze_impact(change)

for node in impact.affected_nodes:
print(f"{node.name} ({node.tool}) — severity: {node.severity}")
print(f" Owner: {node.owner}")
print(f" Last run: {node.last_run}")
print()


Output:

plaintext
orders_daily_agg (dbt) — severity: DIRECT
Owner: analytics-team@company.com
Last run: 2024-01-18 06:00 UTC

revenue_spark_job (Spark) — severity: DIRECT
Owner: data-platform@company.com
Last run: 2024-01-18 07:30 UTC

finance_weekly_rollup (Airflow DAG) — severity: TRANSITIVE
Owner: finance-eng@company.com
Last run: 2024-01-18 00:00 UTC

exec_dashboard_refresh (Airflow DAG) — severity: TRANSITIVE
Owner: analytics-team@company.com
Last run: 2024-01-18 08:00 UTC


You now know:
- **What breaks** (4 downstream consumers)
- **Who to notify** (3 different teams)
- **How urgently** (two DIRECT dependencies will fail immediately)

You can send this report to stakeholders *before* you merge the PR. You can open tickets. You can coordinate. You can be the engineer who prevented the Thursday afternoon mystery instead of the one who caused it.

---

## The API-First Design (Why It Matters)

We built DataLineage API-first because lineage isn't a dashboard you check occasionally — it's a signal that should flow through your existing workflows.

The Python client is thin wrapper around a REST API, which means you can:

- **Integrate with CI/CD**: fail a PR if a schema change has unacknowledged DIRECT dependents
- **Trigger Slack alerts** when a new dependency is discovered on a critical table
- **Feed your data catalog** with freshness and ownership data that's actually current
- **Build custom tooling** without being locked into our UI

python

Example: CI/CD gate

impact = client.analyze_impact(change)
critical = [n for n in impact.affected_nodes if n.severity == "DIRECT"]

if critical and not all(n.acknowledged for n in critical):
print("⛔ Unacknowledged direct dependents. Blocking merge.")
exit(1)


This is the kind of guard rail that turns "move fast and break things" into "move fast and *know* what you're breaking."

---

## What We Don't Do (Yet)

Honest section, because you deserve it:

- **Column-level lineage for custom ETL** is still in beta. We handle dbt and Spark column lineage well; custom Python scripts get table-level lineage for now.
- **Streaming pipelines** (Kafka, Flink) are on the roadmap but not in v1.
- **The UI** is functional but not beautiful. We prioritized the API. PRs welcome.

---

## Getting Started

The core library is open source. The hosted API (which handles the graph storage and real-time diffing) has a free tier that covers most individual and small-team use cases.

bash
pip install datalineage


**→ [Star us on GitHub](https://github.com/datalineage/datalineage)** — it genuinely helps, and the issue tracker is where the roadmap lives.

**→ [Try the API](https://datalineage.dev/signup)** — free tier, no credit card, five-minute setup if your dbt project is local.

**→ [Read the docs](https://docs.datalineage.dev)** — especially the Airflow integration guide, which has some non-obvious setup steps we've documented carefully.

---

If you've ever been the person explaining to a VP why the dashboard is broken because of a column rename — this one's for you.

Drop questions in the comments. I read all of them.

DEV Community

Introducing DataLineage -- Automated Data Pipeline Lineage Tracking

Describe the change you're about to make

Example: CI/CD gate

Top comments (0)