leo gu

Posted on Apr 4

I Tried to Analyze SQL Lineage Across 15 Databases — Everything Broke Until I Did This

#sql #database

The problem nobody talks about

If you’ve ever worked with SQL at scale, you’ve probably run into this:

Queries spanning multiple schemas
dbt models referencing each other
Views built on top of views
Different SQL dialects (Snowflake, BigQuery, Spark…)

And then someone asks:

“Where does this column actually come from?”

At that moment, everything falls apart.

Not because the answer doesn’t exist —

but because your tools can’t give it to you.

I tried to solve it the “normal” way

I went through the usual stack:

dbt lineage graphs
Database-native tools
SQL IDEs like DataGrip and DBeaver

They work… until they don’t.

Here’s where things break:

❌ Cross-database lineage? Forget it
❌ Offline analysis? Not possible
❌ Large SQL projects? Slow or incomplete
❌ Raw SQL parsing? Surprisingly fragile

Especially when you mix:

Snowflake + dbt
Spark SQL + Hive
BigQuery + custom scripts

So I ran an experiment

I wanted to see how bad it really is.

So I tested SQL lineage across:

10+ SQL dialects
dbt projects with hundreds of models
Real-world open-source repositories

Including:

dbt projects (~400+ models)
Spark / Hive SQL codebases
Data warehouse examples across multiple vendors

The goal was simple:

Can I reliably trace column-level lineage across all of them?

Short answer:

No existing tool handled all of it well.

The workaround that actually worked

Instead of relying on cloud tools or database engines, I tried something different:

👉 Analyze SQL locally, directly inside VS Code

That’s where this comes in:

👉 gudu sql omni (VS Code extension)

It’s essentially:

A local, offline SQL lineage engine that supports multiple databases

What makes it different?

Here’s what stood out immediately:

1. Works across multiple SQL dialects

Not just one database.

It handled:

Snowflake
BigQuery
Spark SQL
Hive
Redshift
Databricks

…in a single workflow.

2. Fully offline

No:

uploading SQL
connecting to cloud services
worrying about sensitive data

Everything runs locally.

3. Actually parses complex SQL

Including:

nested queries
CTE chains
dbt-style transformations
multi-layer views

This is where most tools fail.

What it looks like in practice

Inside VS Code:

Open a SQL file (or a project)
Run lineage analysis
Instantly get:

👉 table-level lineage

👉 column-level lineage

👉 dependency graph

No setup. No infra.

Real use case: dbt projects

This is where things get interesting.

dbt already provides lineage — but:

It’s tied to dbt ecosystem
Requires dbt setup
Not always flexible for raw SQL

With a local parser:

You can analyze dbt SQL without dbt runtime
You can inspect edge cases dbt doesn’t visualize well
You can debug transformations faster

Where this approach wins

After testing across multiple datasets, this approach works best when:

You have mixed SQL environments
You need offline analysis
You deal with large SQL codebases
You want fast iteration inside your editor

Where it still needs improvement

To be fair:

It’s not a full replacement for dbt
Visualization can still improve
Edge cases exist (as with any parser)

But as a developer tool, it fills a gap that’s been ignored for years.

Final thoughts

SQL lineage shouldn’t be this hard.

And yet:

Most tools are tied to one ecosystem
Or require heavy infrastructure
Or simply break on real-world SQL

What surprised me most is this:

A lightweight, local approach actually works better in many cases.

Try it yourself

If you work with SQL seriously, it’s worth testing:

👉 https://marketplace.visualstudio.com/items?itemName=gudusoftware.gudu-sql-omni

Takes less than a minute to install.

I’m curious

If you’ve struggled with SQL lineage before:

What tools did you try?
Where did they fail?

I’d love to compare notes.

DEV Community