Byron Hsieh

Posted on Mar 12

Why Data Teams Need Data Lineage: From Common Pain Points to Real-World Challenges

#dataengineering #sql #datalineage #database

Why Data Teams Need Data Lineage

From Common Pain Points to Real-World Challenges

Data Lineage has become a core component of modern data platforms. It provides transparency, traceability, and observability needed for maintainability and governance. This article first covers the common challenges faced by data teams, then explores the additional real-world difficulties encountered in highly heterogeneous SQL ecosystems — the context that motivated building an automated SQL lineage extraction tool.

1. Why Data Teams Need Data Lineage (General)

Complex dependencies make incident investigation expensive

In mature data warehouses, data flows across multiple transformation layers.

When issues occur, teams must determine:

Where does this column come from?
Which transformation introduced the issue?
What downstream tables will be impacted?

Without lineage, this requires manual code tracing — slow and error-prone.

Risky schema and logic changes without impact analysis

Schema updates or ETL refactoring require an understanding of downstream dependencies.

Without lineage:

Dashboards may break
ML pipelines may fail
Incidents appear only after deployment

Lineage makes changes predictable.

Slow onboarding and weak knowledge transfer

Most data platforms lack complete documentation.

Newcomers must learn:

Table relationships
Column semantics
End-to-end data flow

Lineage accelerates onboarding by offering a visual data map.

Data teams become a support center for business users

Common questions include:

“How is this metric calculated?”
“What is the source of this column?”
“Why does this month’s number differ?”

Without lineage, engineers manually search SQL each time.

Inefficient Data Quality (DQ) incident handling and reprocessing

After identifying a data quality issue, teams must decide:

Which downstream tables need reprocessing?
How far does the impact propagate?
What is the safe order for backfilling data?

Without lineage, reprocessing scope is often guesswork — teams either miss affected systems or waste time rerunning unnecessary jobs. Lineage provides the dependency map needed for surgical, efficient remediation.

2. Real-World Challenges: A Highly Heterogeneous SQL Ecosystem

Beyond typical issues, certain enterprise environments present extreme complexity due to decades of organic growth and diverse technology stacks. Drawing from experience in large-scale data warehouses, this section describes challenges that make automated lineage extraction particularly difficult — challenges that motivated building a specialized preprocessing and parsing pipeline.

Thousands of legacy batch jobs

In mature data warehouses, jobs accumulate over years, created by different teams with inconsistent conventions:

Highly coupled dependencies
Missing or outdated metadata
Manual dependency tracking becomes impractical at scale

Automation becomes essential.

SQL is not pure SQL: templates, embedded code, and vendor dialects

SQL rarely exists in isolation in production systems. Common patterns include:

Template-based SQL:

SELECT * FROM ${SOURCE_SCHEMA}.customer_data
WHERE batch_date = ${RUN_DATE}

SQL embedded in application code:

Python: Dynamic query construction with f-strings or string concatenation
COBOL: SQL embedded in EXEC SQL blocks
Perl: Template systems mixing procedural logic with SQL

Vendor-specific dialects:

Teradata procedural extensions (.SET, .IF, EXEC)
Oracle PL/SQL blocks
T-SQL stored procedures

Each approach complicates lineage extraction — the SQL parser must first extract the query from its host language context.

Noise from comments, multilingual content, and inconsistent naming

Legacy codebases often accumulate various forms of noise:

Comments and documentation:

Mix of inline, multi-line, and vendor-specific comment styles
Documentation in multiple languages (reflecting global or offshore teams)
Outdated comments referencing deprecated logic

Naming inconsistencies:

Column names in different languages (English, local language, or mixed)
Meaningless aliases (t1, x, temp_final_final)
Reused table names across contexts

Extracting clean lineage requires distinguishing signal from noise — comments must be removed, but not at the expense of losing vendor-specific SQL syntax.

SQL patterns too complex for off-the-shelf lineage tools

Production SQL often contains patterns that defeat generic parsers:

Computed column dependencies within the same SELECT:

SELECT 
    acct.account_id,
    CASE 
        WHEN acct.status_code = '00' THEN 'A'
        WHEN acct.status_code = '07' THEN 'C'
        ELSE 'X'
    END as derived_status,

    -- References the column defined above
    CASE 
        WHEN derived_status = 'C' THEN 0
        ELSE acct.balance
    END as effective_balance,

    -- Further chained reference
    CAST(effective_balance * rate.conversion_rate AS DECIMAL(18,2)) as converted_balance

FROM source_table acct
JOIN exchange_rates rate ON acct.currency = rate.currency

This pattern is common in financial ETL but problematic for parsers: derived_status is defined and immediately referenced in the same SELECT clause. Generic parsers fail because they expect column references to resolve to source tables, not to computed columns in the same projection.

Other challenging patterns:

Deeply nested subqueries (5+ levels) with alias renaming at each layer
Large UNION chains combining 10+ tables with overlapping column names
Dynamic table/column name resolution (metadata-driven ETL)
Vendor-specific ranking functions (QUALIFY in Teradata, TOP in T-SQL)

Each pattern requires specialized AST traversal and context tracking beyond what standard SQL parsers provide.

A preprocessing pipeline becomes required

To reliably feed SQL into lineage engines (e.g., sqllineage, sqlglot), a multi-stage preprocessing pipeline is necessary:

Cleaning and normalization:

Comment removal (preserving vendor syntax)
Template variable expansion and macro resolution
Vendor-specific syntax normalization (Teradata, Oracle, T-SQL, etc.)
Removal of multilingual content and non-standard formatting

Structure extraction:

Separating DDL metadata from DML queries
Extracting SQL from host language contexts (Python, COBOL, shell scripts)
Tokenization and syntax tree preparation

Only after this normalization can standard lineage engines produce reliable results. The preprocessing layer becomes as critical as the parser itself.

3. Conclusion

Data Lineage is not optional — it is foundational to modern data operations. In environments with large amounts of legacy SQL and heterogeneous scripting, a robust lineage pipeline is crucial for maintainability and reliability.

This context motivated the development of an automated SQL lineage extraction tool. In the next articles of this series, we'll explore:

Part 2: Getting started with Python and sqllineage for standard SQL scenarios
Part 3: System Design - A Production-Ready SQL Lineage Pipeline
- Architecture designed for real production environments with thousands of legacy SQL files
- Multi-stage processing pipeline: preprocessing → parsing → lineage extraction → export
- Happy path workflow and edge case handling strategies
- Error isolation, graceful degradation, and comprehensive logging
- Design decisions: single-threaded processing, step-based architecture, and scalability considerations
- Real-world deployment and operational characteristics

DEV Community