---
title: "Your Schema Changed. Congratulations, You've Just Inherited a Mystery."
published: false
tags: [dataengineering, python, opensource, dbt]
---
It's 2:47 PM on a Thursday.
A Slack message appears: *"Hey, the revenue dashboard is showing nulls everywhere."*
You stare at it. You made one change this morning — renamed a column in a Postgres table. A sensible rename. A *good* rename. `rev_amt` → `revenue_amount`. Clean. Descriptive. Professional.
What you didn't know: that column fed a dbt model, which fed a Spark aggregation job, which fed three Airflow DAGs, which fed the dashboard your CFO reviews every Friday morning.
You didn't know because **nobody knew**. The knowledge lived in the heads of people who've since left the company, in a Confluence doc last edited in 2021, and in the silent, load-bearing assumptions baked into 40,000 lines of pipeline code.
This is the data dependency problem. And it's not a tooling gap — it's a visibility gap.
---
## Introducing DataLineage
[DataLineage](https://github.com/datalineage) automatically discovers and maps every dependency across your pipeline stack — dbt, Airflow, Spark, custom ETL scripts — and gives you real-time impact analysis when anything changes.
No more archaeological digs through DAG definitions. No more "who owns this table?" Slack threads. No more Thursday afternoon mysteries.
---
## How It Works (The Honest Version)
Most lineage tools ask you to *declare* your dependencies. You write YAML. You tag things. You maintain a catalog. This is documentation-driven lineage, and it has the same problem as all documentation: it drifts.
DataLineage takes a different approach: **discovery over declaration**.
It instruments your existing tools — parsing dbt manifests, hooking into Airflow's metadata DB, intercepting Spark execution plans — and builds a live dependency graph automatically. Your pipelines are the source of truth. Not a YAML file someone forgot to update.
When a schema change happens (detected via your warehouse's information schema or pushed via the API), DataLineage traverses the graph and returns every downstream node affected, with a severity score based on how directly it consumes the changed field.
---
## Quick Start
bash
pip install datalineage
Point it at your stack:
python
from datalineage import LineageClient
client = LineageClient(api_key="your_key")
client.connect(dbt_project="./dbt", airflow_db="postgresql://...", spark_app="my_spark_app")
graph = client.discover()
That's the full discovery. `graph` is now a queryable dependency map of your entire pipeline.
---
## A Real-World Scenario: The Schema Change You Can Survive
Let's say you're about to rename that column. Before you do, you run an impact check:
python
from datalineage import LineageClient, ChangeEvent
client = LineageClient(api_key="your_key")
Describe the change you're about to make
change = ChangeEvent(
table="analytics.orders",
column="rev_amt",
change_type="rename",
new_name="revenue_amount"
)
impact = client.analyze_impact(change)
for node in impact.affected_nodes:
print(f"{node.name} ({node.tool}) — severity: {node.severity}")
print(f" Owner: {node.owner}")
print(f" Last run: {node.last_run}")
print()
Output:
plaintext
orders_daily_agg (dbt) — severity: DIRECT
Owner: analytics-team@company.com
Last run: 2024-01-18 06:00 UTC
revenue_spark_job (Spark) — severity: DIRECT
Owner: data-platform@company.com
Last run: 2024-01-18 07:30 UTC
finance_weekly_rollup (Airflow DAG) — severity: TRANSITIVE
Owner: finance-eng@company.com
Last run: 2024-01-18 00:00 UTC
exec_dashboard_refresh (Airflow DAG) — severity: TRANSITIVE
Owner: analytics-team@company.com
Last run: 2024-01-18 08:00 UTC
You now know:
- **What breaks** (4 downstream consumers)
- **Who to notify** (3 different teams)
- **How urgently** (two DIRECT dependencies will fail immediately)
You can send this report to stakeholders *before* you merge the PR. You can open tickets. You can coordinate. You can be the engineer who prevented the Thursday afternoon mystery instead of the one who caused it.
---
## The API-First Design (Why It Matters)
We built DataLineage API-first because lineage isn't a dashboard you check occasionally — it's a signal that should flow through your existing workflows.
The Python client is thin wrapper around a REST API, which means you can:
- **Integrate with CI/CD**: fail a PR if a schema change has unacknowledged DIRECT dependents
- **Trigger Slack alerts** when a new dependency is discovered on a critical table
- **Feed your data catalog** with freshness and ownership data that's actually current
- **Build custom tooling** without being locked into our UI
python
Example: CI/CD gate
impact = client.analyze_impact(change)
critical = [n for n in impact.affected_nodes if n.severity == "DIRECT"]
if critical and not all(n.acknowledged for n in critical):
print("⛔ Unacknowledged direct dependents. Blocking merge.")
exit(1)
This is the kind of guard rail that turns "move fast and break things" into "move fast and *know* what you're breaking."
---
## What We Don't Do (Yet)
Honest section, because you deserve it:
- **Column-level lineage for custom ETL** is still in beta. We handle dbt and Spark column lineage well; custom Python scripts get table-level lineage for now.
- **Streaming pipelines** (Kafka, Flink) are on the roadmap but not in v1.
- **The UI** is functional but not beautiful. We prioritized the API. PRs welcome.
---
## Getting Started
The core library is open source. The hosted API (which handles the graph storage and real-time diffing) has a free tier that covers most individual and small-team use cases.
bash
pip install datalineage
**→ [Star us on GitHub](https://github.com/datalineage/datalineage)** — it genuinely helps, and the issue tracker is where the roadmap lives.
**→ [Try the API](https://datalineage.dev/signup)** — free tier, no credit card, five-minute setup if your dbt project is local.
**→ [Read the docs](https://docs.datalineage.dev)** — especially the Airflow integration guide, which has some non-obvious setup steps we've documented carefully.
---
If you've ever been the person explaining to a VP why the dashboard is broken because of a column rename — this one's for you.
Drop questions in the comments. I read all of them.
Top comments (0)