ArisynData

Posted on Jun 26

Why Text-to-SQL Breaks When the Join Path Is Not Obvious

#texttosql #ai #dataengineering #datagovernance

Most Text-to-SQL examples are too clean.

They usually assume a simple schema, obvious table names, clear foreign keys, and a question that maps neatly to one or two tables. In that environment, generating SQL from natural language looks impressive.

Enterprise databases are not like that.

In real analytics work, the hard part is often not the SELECT clause. It is the join path.

The SQL Can Be Valid and Still Wrong

Imagine a user asks:

“Show revenue by customer for the last quarter.”

A model may generate something like:

SELECT
  c.customer_name,
  SUM(o.revenue) AS total_revenue
FROM customers c
JOIN orders o
  ON c.customer_id = o.customer_id
WHERE o.order_date >= '2026-01-01'
  AND o.order_date < '2026-04-01'
GROUP BY c.customer_name;

Technically, this looks fine.

But in an enterprise environment, several things may be wrong:

customers may not be the approved customer master.
orders.revenue may not be the finance-approved revenue field.
Customer records may be duplicated across regions.
Some orders may need to be excluded because they were adjusted later.
The join may create duplication if there are multiple customer records per account.
Last quarter may follow fiscal, not calendar, logic.

The database accepts the query.

The dashboard renders.

The answer is wrong.

That is the uncomfortable part of enterprise Text-to-SQL.

A syntactically valid query is not the same as a trusted query.

Why Join Paths Are Hard

Join paths are obvious only when the data model is clean.

In production systems, they are usually messy.

You may have:

Missing foreign keys
Legacy tables
Similar columns with different meanings
One-to-many relationships that create fanout
Historical snapshots
Slowly changing dimensions
Department-specific marts
Fields reused for different purposes
Business rules that exist only in old SQL reports

Even experienced engineers often need time to inspect the schema, check existing reports, ask someone in finance, and run sample queries before trusting a join.

Now ask an AI model to do the same thing with only table names and column names.

It will guess.

Sometimes the guess will be right.

Sometimes it will be dangerously plausible.

The Fanout Problem

One of the most common issues is fanout.

Suppose you join orders to order_lines and then to shipments.

SELECT
  o.customer_id,
  SUM(o.order_amount) AS revenue
FROM orders o
JOIN order_lines l
  ON o.order_id = l.order_id
JOIN shipments s
  ON l.line_id = s.line_id
GROUP BY o.customer_id;

If one order line can have multiple shipments, order revenue may be counted multiple times.

The SQL is valid.

The join is valid.

The result is not valid for revenue reporting.

A human analyst may know to aggregate at the order level first, or use a shipment-adjusted revenue table, or avoid this path altogether.

A model needs that knowledge in context.

Why Metadata Alone Is Not Enough

Metadata helps, but it does not solve the full problem.

Column names can tell you that customer_id relates to customers.

They do not tell you whether this is the right customer relationship for financial reporting.

Foreign keys can tell you that a relationship exists.

They do not tell you whether the relationship is safe for aggregation.

Descriptions can tell you what a table contains.

They do not always explain historical exceptions.

That is why enterprise Text-to-SQL needs more than schemas.

It needs relationship context.

What Relationship Context Should Include

At minimum, an AI query system should know:

Candidate join paths between tables
Approved join paths for common business questions
Relationship cardinality
Known fanout risks
Join confidence
Source of relationship evidence
Whether the relationship came from constraints, SQL history, naming patterns, dbt models, BI datasets, or human approval
Which paths are rejected or deprecated

This changes the behavior of the system.

Instead of simply generating SQL, it can reason about query safety.

It can choose the trusted path.

It can warn when no approved path exists.

It can ask for clarification when multiple paths are possible.

That is far more useful than blindly producing a query.

A Better Pattern

A more reliable Text-to-SQL system should work like this:

Parse the user question.
Identify business entities and metrics.
Resolve semantic definitions.
Retrieve candidate tables.
Retrieve trusted relationship paths.
Check join risks.
Generate SQL using approved paths.
Validate the SQL against semantic and relationship rules.
Explain the assumptions behind the result.

The key difference is step 5 and step 6.

Many systems jump from semantic mapping directly to SQL generation.

That is where errors enter.

The missing layer is relationship intelligence.

Final Thought

Text-to-SQL is not just a language translation problem.

It is a context problem.

Models are getting better at writing SQL.

But enterprise analytics requires more than syntactically correct SQL.

It requires knowing which joins are safe, which paths are trusted, and which assumptions should be checked before the query runs.

Until that context exists, Text-to-SQL will continue to work well in demos and struggle in real companies.

DEV Community