Why Text-to-SQL Needs Join Path Context, Not Just Schema

#ai #sql #dataengineering #texttosql

A lot of Text-to-SQL examples start with something like this:

Give the model the schema.
Ask the user for a question.
Generate SQL.
Run it.

That can work for demos.

It gets much harder in real enterprise data.

The problem is not always SQL syntax. The model may generate perfectly valid SQL. The query may run. The output may even look reasonable.

The issue is whether the query used the right path through the data.

Schema does not tell the whole story

A schema gives you tables, columns, types, and sometimes constraints.

But it usually does not tell you enough about analytical intent.

For example, imagine a question like:

Show monthly revenue by customer segment.
A model may see tables like:

1）customers
2）orders
3）order_items
4）payments
5）invoices
6）customer_segments
It can probably generate a join.

But which one?

Should revenue come from orders, payments, invoices, or finance postings?

Should customer segment be current segment or segment at the time of order?

Should the query join customer directly to orders, or through account, membership, or billing entity?

Will joining order items duplicate revenue?

These are not syntax questions.

They are relationship and grain questions.

Valid SQL is not the same as correct SQL

This is where enterprise Text-to-SQL gets dangerous.

A bad query does not always fail.

Sometimes it returns a clean table and a nice chart.

The model may choose a join path that looks obvious from column names but is not the business-approved path. Or it may join two fact tables directly and inflate the result. Or it may use a dimension table that is current-state when the metric needs historical context.

No exception is thrown.

The number is just wrong.

What join path context should include

A more reliable Text-to-SQL system needs relationship context before SQL generation.

At minimum, join path context should include:

candidate tables for the question
approved table relationships
join keys
cardinality
table grain
preferred paths
risky paths
duplication risks
validation checks
This turns table relationships into something the model can actually use.

Instead of asking the model to infer everything from names, the system gives it a query plan.

A better workflow

A schema-only workflow looks like this:

User question
→ schema prompt
→ SQL generation
→ execution

A relationship-aware workflow looks more like this:

User question
→ intent detection
→ semantic matching
→ candidate table selection
→ trusted join path retrieval
→ query plan
→ SQL generation
→ validation
→ answer

That extra planning step is not just overhead.

It is what prevents the system from producing plausible but wrong SQL.

The analyst version of this

Good analysts already think this way.

Before writing a join, they ask:

What is the grain of this table?
Is this the source of truth?
Will this join duplicate the metric?
Is this relationship current-state or historical?
Does this path match the business definition?
Text-to-SQL systems need to capture more of that reasoning.

Not as a long prompt full of vague instructions, but as structured context the system can retrieve and apply.

Final thought

Text-to-SQL is not just a translation problem.

It is a data modeling problem, a governance problem, and a relationship-context problem.

Models are getting better at writing SQL.

But if the system does not know how the underlying data connects, better SQL generation just gives us wrong answers faster.