<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Andromelus</title>
    <description>The latest articles on DEV Community by Andromelus (@andromelus).</description>
    <link>https://dev.to/andromelus</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F1066077%2F841b2039-4965-49fe-87ec-439770fe02a4.png</url>
      <title>DEV Community: Andromelus</title>
      <link>https://dev.to/andromelus</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/andromelus"/>
    <language>en</language>
    <item>
      <title>Get an overview of your databases/tables' dependencies in 1 minute.</title>
      <dc:creator>Andromelus</dc:creator>
      <pubDate>Thu, 20 Apr 2023 16:55:05 +0000</pubDate>
      <link>https://dev.to/andromelus/get-an-overview-of-your-databasestables-links-in-1-minute-m55</link>
      <guid>https://dev.to/andromelus/get-an-overview-of-your-databasestables-links-in-1-minute-m55</guid>
      <description>&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;When working in big data, or even just "normal data", you may at some point have a lot of databases, with a lot of tables and a lot of dependencies between all these tables.&lt;/p&gt;

&lt;p&gt;You may have to modify something in the data flow: a query, a table structure, name, or anything. And it can break stuff. A lot of stuff.&lt;/p&gt;

&lt;p&gt;While I cannot help you not breaking stuff, I can help you know what can break (your project manager will call that "impact analysis", but we call that "features") with a little project of mine that generates automatically a &lt;a href="https://mermaid.js.org/"&gt;mermaidjs&lt;/a&gt; diagram showing you all the links in and out of all the referenced tables in your SQL script: &lt;a href="https://github.com/Andromelus/sql-to-mermaid-awk"&gt;sql-to-mermaid-awk (I know the name sucks)&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;And most important: It is written in &lt;a href="https://www.gnu.org/software/gawk/"&gt;gawk&lt;/a&gt; ! Available on many *nix distros, even in your git for windows (git bash).&lt;/p&gt;

&lt;p&gt;And before someone says it, yes, there already are some lineage tools. Only you can use it if the company or your wallet allows you to. ;)&lt;/p&gt;

&lt;h2&gt;
  
  
  Example
&lt;/h2&gt;

&lt;p&gt;Lets take this very simple SQL script. Note that it might not be a correct SQL script, but it matches all the correct syntax element for the program to understand it.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="n"&gt;use&lt;/span&gt; &lt;span class="n"&gt;raw_data_db&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="o"&gt;#&lt;/span&gt; &lt;span class="n"&gt;loading&lt;/span&gt; &lt;span class="k"&gt;data&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="n"&gt;HDFS&lt;/span&gt;
&lt;span class="k"&gt;create&lt;/span&gt; &lt;span class="k"&gt;external&lt;/span&gt; &lt;span class="k"&gt;table&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;column&lt;/span&gt; &lt;span class="n"&gt;string&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;location&lt;/span&gt; &lt;span class="nv"&gt;"/hdfs/path/somewhere"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="n"&gt;use&lt;/span&gt; &lt;span class="n"&gt;processing_data_db&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="k"&gt;create&lt;/span&gt; &lt;span class="k"&gt;table&lt;/span&gt; &lt;span class="n"&gt;proc_client&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt;
&lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;raw_client&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;

    &lt;span class="k"&gt;select&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="n"&gt;raw_data_db&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;client&lt;/span&gt;

&lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="n"&gt;raw_addresses&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="k"&gt;select&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="n"&gt;addresses&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;select&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="n"&gt;raw_client&lt;/span&gt;
&lt;span class="k"&gt;left&lt;/span&gt; &lt;span class="k"&gt;join&lt;/span&gt; &lt;span class="n"&gt;raw_addresses&lt;/span&gt;
&lt;span class="k"&gt;on&lt;/span&gt; &lt;span class="n"&gt;raw_client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;raw_addresses&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;client_id&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="n"&gt;use&lt;/span&gt; &lt;span class="n"&gt;clean_data_db&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;-- list all the clients and other company's clients&lt;/span&gt;
&lt;span class="k"&gt;create&lt;/span&gt; &lt;span class="k"&gt;view&lt;/span&gt; &lt;span class="n"&gt;exposed_clients&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="k"&gt;select&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; 
&lt;span class="n"&gt;processing_data_db&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;proc_client&lt;/span&gt;

&lt;span class="k"&gt;union&lt;/span&gt; &lt;span class="k"&gt;all&lt;/span&gt;

&lt;span class="k"&gt;select&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="n"&gt;other_company_db&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;clients&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We have three databases, between one to two tables per database. Nothing fancy, but if you want to know, without any tool, which table requires which data, it will take you some time to manually draw something (with a pen, or any software).&lt;/p&gt;

&lt;p&gt;With this sql-to-mermaid tool, a one line command does the trick:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;awk -f main.awk -v generate_html=1 &amp;lt; script.sql &amp;gt; diagram.html&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;Which would create a HTML (diagram.html) file with the following:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--lUsqsJd_--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/nui9hqb0z3q1lcvpr1dh.PNG" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--lUsqsJd_--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/nui9hqb0z3q1lcvpr1dh.PNG" alt="diagram no highlight" width="582" height="393"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Ho, and to get the entire lineage of a table, just click on its name.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--BGyKOpfy--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/zc2w9a1d3cmk7ekpvya8.PNG" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--BGyKOpfy--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/zc2w9a1d3cmk7ekpvya8.PNG" alt="diagram highlight" width="603" height="427"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Now, imagine a team managing a datalake: Dozens of databases, dozens of tables per database, &lt;strong&gt;at least&lt;/strong&gt; one HQL/Impala script to build one table, several tables joined together to create a new one, and no lineage tool (yup, personnal experience). Well, this project has you covered.&lt;/p&gt;

</description>
      <category>database</category>
      <category>awk</category>
      <category>mermaidjs</category>
      <category>lineage</category>
    </item>
  </channel>
</rss>
