Get an overview of your databases/tables' dependencies in 1 minute.

Andromelus — Thu, 20 Apr 2023 16:55:05 +0000

Introduction

When working in big data, or even just "normal data", you may at some point have a lot of databases, with a lot of tables and a lot of dependencies between all these tables.

You may have to modify something in the data flow: a query, a table structure, name, or anything. And it can break stuff. A lot of stuff.

While I cannot help you not breaking stuff, I can help you know what can break (your project manager will call that "impact analysis", but we call that "features") with a little project of mine that generates automatically a mermaidjs diagram showing you all the links in and out of all the referenced tables in your SQL script: sql-to-mermaid-awk (I know the name sucks).

And most important: It is written in gawk ! Available on many *nix distros, even in your git for windows (git bash).

And before someone says it, yes, there already are some lineage tools. Only you can use it if the company or your wallet allows you to. ;)

Example

Lets take this very simple SQL script. Note that it might not be a correct SQL script, but it matches all the correct syntax element for the program to understand it.

use raw_data_db;

# loading data from HDFS
create external table client (column string) location "/hdfs/path/somewhere";

use processing_data_db;

create table proc_client as
with raw_client as(

    select * from raw_data_db.client

),
raw_addresses as (
    select * from addresses
)

select * from raw_client
left join raw_addresses
on raw_client.id = raw_addresses.client_id;

use clean_data_db;

-- list all the clients and other company's clients
create view exposed_clients as select * from 
processing_data_db.proc_client

union all

select * from other_company_db.clients;

We have three databases, between one to two tables per database. Nothing fancy, but if you want to know, without any tool, which table requires which data, it will take you some time to manually draw something (with a pen, or any software).

With this sql-to-mermaid tool, a one line command does the trick:

awk -f main.awk -v generate_html=1 < script.sql > diagram.html

Which would create a HTML (diagram.html) file with the following:

Ho, and to get the entire lineage of a table, just click on its name.

Now, imagine a team managing a datalake: Dozens of databases, dozens of tables per database, at least one HQL/Impala script to build one table, several tables joined together to create a new one, and no lineage tool (yup, personnal experience). Well, this project has you covered.

DEV Community: Andromelus

Get an overview of your databases/tables' dependencies in 1 minute.

Introduction

Example