SchemaMind

Posted on Mar 2

Generative Data Intelligence(GDI): A Paradigm shift from ETL and Data Fusion

#ai #dataengineering #etl #semantic

For a time data engineering has worked in a certain way. You have to define the schema write the pipeline and maintain the mappings. Tools like ETL and data fusion frameworks need humans to do a lot of work to understand the data structure before anything useful can happen.

This way of doing things is changing.

Generative Data Intelligence or GDI is a way of doing things. It turns the model upside down. Of humans having to understand the data structure the system figures it out for them.

The problem with ETL and traditional data fusion is that they are not very flexible.

ETL pipelines are hard to set up and need a lot of configuration.
Every source needs a predefined schema.
Every transformation needs a rule.
If you add a column or rename a field it can break the pipeline.

Data fusion is a little better because it tries to combine data from sources into one view.. It still needs humans to define join keys and field mappings.

This can be a problem when different teams or systems use names for the same thing.
Traditional fusion can break if the naming conventions are not the same.

GDI is different. It uses intelligence to understand the data and figure out what the user wants.

The user can type a question in language like "Show 30-day readmission rates by diagnosis and age group".
GDI handles everything from there. It finds the sources clarifies any ambiguity generates retrieval commands and returns a unified result.

GDI works in a series of stages.

Source Discovery & Metadata Extraction: GDI reads a configuration file. Connects to each registered source. It. Extracts metadata, like table schemas and column names and stores it in a catalogue.
Natural Language Query Refinement: The user submits a query in language. GDI uses a language model to classify the query intent and resolve any ambiguities.
Source Selection & Command Generation: GDI selects the sources and generates retrieval commands. It uses a large language model to evaluate the commands and assign a confidence score.
Cross-Source Dependency Resolution: GDI analyses dependencies across the selected sources. Creates temporary staging tables to hold intermediate results.
Semantic Similarity Merge: GDI detects columns across datasets and infers join keys automatically.
Conflict Resolution & Policy Enforcement: GDI resolves any conflicts and enforces governance policies like anonymizing fields or restricting access.
Post-Merge Refinement & Output: GDI refines the dataset and delivers the final output to the user.

GDI is also self-healing, which means it can fix commands before they are executed.

If a command is generated with a confidence score GDI will try to correct it before executing it.
This reduces the risk of execution failures. Makes the system more reliable.

GDI can also handle schema evolution, which means it can keep up with changes to the data structure.

It can detect changes in time or on a schedule and regenerate affected queries and execution plans.

Overall GDI is a way of doing data engineering that is more flexible and user-friendly. It uses intelligence to understand the data and figure out what the user wants and it can handle complex tasks like cross-source joins and semantic column matching.

It is a paradigm shift because it changes the way we think about data integration and makes it more accessible, to -technical users.
GDI treats data integration as an intelligence problem, not an engineering problem.
It is a way of doing things that can make data engineering more efficient and effective.

GDI looks at data integration as a problem that needs to be figured out all the time. It is a problem that can be solved with the help of inference understanding what things mean and being able to adapt to new situations.

The change is not about the technology. It is the way we think about things:

We used to say "set up the system so it can understand your data"

now we say "the system can understand your data so you do not have to".

This change has an impact on who can use the data how fast companies can make decisions based on it and how much time engineers can save from doing maintenance work.

Where GDI Fits

GDI is a way of doing things and a concept for building things, not a set of rules for what technology to use. These rules can be applied in any situation where:

You need to ask questions that use kinds of structured data

You have people with different skill levels who need to use the data

The way the data is organized is changing and breaking the old ways of doing things

You need to make sure the rules are followed at the level of the data

You want to reduce the amount of work it takes to set up and maintain the system

The specific way to do it. Which tools to use, which libraries to use, which storage to use, which interfaces to use. Is up to the people who are building it. GDI tells you what to do and why to do it. The how is for you to decide.

Closing Thought

The hardest part of putting data has never been the technology. It has been the gap in knowledge. The distance between what people want to know and what the system needs to be told.

GDI fills that gap. Not completely not like magic. In a way that makes a difference. By putting a layer of smart technology between what people want and the structured data and letting that layer do the work that people have been doing by hand for thirty years.

That is the change.

What patterns have you seen in your data pipelines that GDIs approach might help with. Or where do you think the limits of this model are? I would love to hear your thoughts, in the comments 👇

Top comments (1)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.