DEV Community

沈欒
沈欒

Posted on

πŸš€ Solving DataHub’s Limitation in Parsing Complex SQL Lineage (gsp-datahub-sidecar Validation)

🧩 Background

When using DataHub for data lineage analysis, a common issue is:

πŸ‘‰ Failure to correctly parse complex SQL (especially BigQuery procedural SQL)

As mentioned in this issue:

  • DataHub issue #11654

Typical symptoms include:

  • Multi-statement SQL (e.g., BEGIN...END) cannot be parsed
  • Missing or incomplete lineage
  • Column-level lineage cannot be extracted

πŸ’‘ Solution

Use the sidecar tool provided by Gudu:

πŸ‘‰ gsp-datahub-sidecar

GitHub repository:

https://github.com/gudusoftware/gsp-datahub-sidecar

Capabilities

  • Supports BigQuery procedural SQL
  • Supports column-level lineage
  • No modification required to DataHub itself

🎯 Validation Objective

This article verifies whether the following pipeline works:

BigQuery procedural SQL
↓
gsp-datahub-sidecar
↓
DataHub GMS
↓
DataHub UI lineage visualization
Enter fullscreen mode Exit fullscreen mode

πŸ§ͺ Test Steps (Reproducible)

Prerequisite: You already have a running DataHub instance

(GMS default: http://localhost:8080)


Step 0: Install the Sidecar

pip install git+https://github.com/gudusoftware/gsp-datahub-sidecar.git
Enter fullscreen mode Exit fullscreen mode

πŸ‘‰ If you are using macOS + Homebrew Python (PEP 668 issue), recommended:

brew install pipx
pipx install git+https://github.com/gudusoftware/gsp-datahub-sidecar.git
Enter fullscreen mode Exit fullscreen mode

Verify installation:

gsp-datahub-sidecar --version
Enter fullscreen mode Exit fullscreen mode

Step 1: Get Test SQL

git clone https://github.com/gudusoftware/gsp-datahub-sidecar.git
cd gsp-datahub-sidecar
Enter fullscreen mode Exit fullscreen mode

Use the built-in example:

examples/bigquery_procedural.sql
Enter fullscreen mode Exit fullscreen mode

Step 2: Validate SQL Parsing (Dry Run Mode)

gsp-datahub-sidecar \
--mode authenticated \
--user-id YOUR_USER_ID \
--secret-key YOUR_SECRET_KEY \
--sql-file examples/bigquery_procedural.sql \
--dry-run
Enter fullscreen mode Exit fullscreen mode

Dry-run mode is for debugging only. It parses lineage for review but does NOT send data to the DataHub server.

βœ… Example output:

Detected procedural SQL β€” sending as single statement

Lineage: PROJECT.DATASET.VIEW_NAME --> TEMP_TABLE (12 columns)
Lineage: TEMP_TABLE_DELTA --> FINAL_OUTPUT (5 columns)

[DRY RUN] Would emit 10 MCPs
Enter fullscreen mode Exit fullscreen mode


Step 3: Ingest into DataHub

Test with BigQuery SQL

gsp-datahub-sidecar \
--mode authenticated \
--user-id YOUR_USER_ID \
--secret-key YOUR_SECRET_KEY \
--sql-file examples/bigquery_procedural.sql
Enter fullscreen mode Exit fullscreen mode

βœ… Example output:

Detected procedural SQL β€” sending as single statement

Lineage: PROJECT.DATASET.VIEW_NAME --> TEMP_TABLE (12 columns)
Lineage: TEMP_TABLE_DELTA --> FINAL_OUTPUT (5 columns)

Emitted 10 MCPs
Enter fullscreen mode Exit fullscreen mode

πŸ” Verify Table-Level Lineage

  1. In DataHub UI, search for temp_table, you should see:
project.dataset.view_name β†’ temp_table
Enter fullscreen mode Exit fullscreen mode

  1. Search for final_output, you should see:
temp_table_delta β†’ final_output
Enter fullscreen mode Exit fullscreen mode


πŸ”¬ Verify Column-Level Lineage

You should observe column-level relationships such as:

  • email
  • idfield

mapped between temp_table_delta and final_output.


πŸ§ͺ Test with Oracle SQL

gsp-datahub-sidecar \
--mode authenticated \
--user-id YOUR_USER_ID \
--secret-key YOUR_SECRET_KEY \
--sql-file examples/oracle_create_view.sql \
--db-vendor dbvoracle \
--datahub-server http://localhost:8080
Enter fullscreen mode Exit fullscreen mode

In the UI, search for vsal, and you should see correct table and column lineage relationships.


🏠 Test Using Self-Hosted Mode

gsp-datahub-sidecar --mode self_hosted \
--sqlflow-url http://localhost:8165/api/gspLive_backend/sqlflow/generation/sqlflow/exportFullLineageAsJson \
--user-id YOUR_USER_ID \
--secret-key YOUR_SECRET_KEY \
--sql-file examples/oracle_create_view.sql \
--db-vendor dbvoracle
Enter fullscreen mode Exit fullscreen mode

⚠️ Note: You must install and run SQLFlow locally in advance.


πŸ“Š Validation Results

Capability Result
Procedural SQL Parsing βœ…
Table-Level Lineage βœ…
Column-Level Lineage βœ…
DataHub Integration βœ…
UI Visualization βœ…

🧨 Key Conclusion

πŸ‘‰ gsp-datahub-sidecar effectively solves DataHub's limitations in parsing complex SQL

Especially suitable for:

  • 20+ mainstream databases
  • Multi-statement SQL
  • Column-level lineage analysis

If your problem is:

DataHub cannot correctly parse complex SQL lineage

Then:

πŸ‘‰ This tool has been validated as a reliable solution


πŸ”— Related Project

https://github.com/gudusoftware/gsp-datahub-sidecar

Top comments (0)