The Schema Is Not Defined - It Is Discovered

We've been designing data systems backwards.
For decades, we started with structure - defining schemas, modeling entities, establishing relationships - and only then did we let data flow through those predefined paths.
It made sense in a world where systems were isolated, controlled, and relatively stable.
That world no longer exists.

The Problem with Schema-First Thinking
In most enterprises today, data doesn't originate from a single system.
It comes from:
· Legacy applications
· SaaS platforms
· External integrations
· Rapidly evolving business logic

And none of these evolve in sync.
Yet we still insist on imposing a fixed schema on top of them.
The result is predictable:
· Models drift away from reality
· Relationships become assumptions rather than facts
· Every integration requires re-interpretation

Over time, the schema stops describing the system.
It starts describing what we think the system looks like.
And that gap is where most data problems live.

Data Already Knows More Than We Do
If you step away from modeling and look at the data itself, something interesting emerges.
Data carries signals about its own structure.
Not explicitly - but statistically.
For any given column, you can observe:
· How many distinct values it contains
· How complete those values are
· How those values overlap with other columns

These are not design decisions.
They are observable properties.
For example:
If the majority of values in one column consistently appear in another, that is not a coincidence.
It is evidence of a relationship.
This is what is often overlooked.
We treat structure as something we define.
But in reality:
Structure is something we can measure.

From Definition to Discovery

This leads to a different way of thinking about data systems.
Instead of:
Define schema → Ingest data
We begin to explore:
Analyze data → Infer schema
This doesn't eliminate modeling.
But it changes its role.
Schema is no longer the starting point.
It becomes a derived artifact - something we validate and refine, not something we assume to be correct from the beginning.
Technically, this shift is grounded in a few simple ideas:
· Distinct value patterns indicate identity or cardinality
· Null distribution reveals optionality and completeness
· Inclusion relationships expose containment and dependency

Individually, these signals are weak.
Combined, they form a reliable structural picture.
In other words:
Data can explain itself - if we are willing to listen.

Why This Matters Now
This shift is not theoretical.
It becomes necessary as systems scale.
At small scale, humans can:
· Read schemas
· Trace relationships
· Validate assumptions

At enterprise scale, this breaks down completely.
You are no longer dealing with:
· Hundreds of tables
· Thousands of fields

But tens of thousands of columns across multiple systems.
Manual understanding doesn't scale.
Assumptions don't scale.
Documentation certainly doesn't scale.
Only evidence-based structure scales.

A Practical Direction
Some systems are beginning to move toward this model.
Instead of relying solely on metadata or predefined keys, they analyze data content directly:
· Identifying inclusion patterns across tables
· Inferring relationships without naming conventions
· Constructing relationship graphs that can be executed

One example is Arisyn.
It approaches data relationships as a discovery problem rather than a modeling task - analyzing actual data characteristics to infer how tables connect, even across systems.
The significance here is not the tool itself.
It's the shift in approach.

Rethinking the Role of Data Engineering
If schema can be discovered rather than defined, then the role of data engineering changes.
Less time is spent on:
· Manually mapping relationships
· Maintaining brittle models
· Reconciling inconsistencies

More time is spent on:
Validating structural signals
Governing discovered relationships
Building systems that adapt with data

This is a subtle but important transition.
From:
Designing structure
To:
Managing structural truth

Final Thought
We've long treated schema as the source of truth.
But in modern data systems, that assumption is increasingly fragile.
Perhaps the more durable approach is this:
The schema is not something we define once.
It is something we continuously discover.
And if that's true,
then a more interesting question emerges:
If data can reveal its own structure, what does a data engineer become?

DEV Community

The Schema Is Not Defined - It Is Discovered

Top comments (0)