ArisynData

Posted on Jun 11

Why Text-to-SQL Needs Table Relationship Discovery Before SQL Generation

#ai #sql #data

Most Text-to-SQL discussions start with language.

Can the model understand the question?
Can it generate valid SQL?
Can it explain the result?

Those are important.

But in enterprise systems, the harder problem is often not language.

It is table relationships.

A model may understand the user’s question and still choose the wrong tables, the wrong join keys, or the wrong join path.

The SQL may run.

The answer may look reasonable.

The result may still be wrong.

A simple example

Suppose a user asks:

What was revenue by customer segment last quarter?
A model may retrieve tables that look relevant:

customers
customer_profiles
customer_segments
orders
contracts
invoices
invoice_items
revenue_snapshot

Now the difficult part begins.

Which customer table is authoritative?

Should revenue come from invoices, contracts, orders, or a finance snapshot?

Should customer segment be current or historical?

How should invoice data connect to customer segment?

Should the join go directly from invoices to customers, or through contracts, accounts, or a bridge table?

These are not just semantic questions.

They are relationship questions.

Schema retrieval is not enough

A common Text-to-SQL flow looks like this:

User question
→ retrieve relevant schemas
→ send schemas to LLM
→ generate SQL
→ execute query
This can work for simple demos.

It breaks down when the schema is large, old, inconsistent, or distributed across systems.

Schema retrieval can tell the model that a column exists.

It does not always tell the model whether that column should be used.

For example:

customer.customer_id
crm_account.account_id
billing_customer.customer_ref
customer_snapshot.customer_id
customer_profile.customer_id
All of these may look relevant.

Only one may be right for the question.

Or the correct answer may require more than one of them.

Foreign keys are helpful, but incomplete

Declared foreign keys are useful when they exist.

But many enterprise systems do not fully declare relationships in the database.

Some relationships are enforced in application code. Some come from ETL logic. Some exist across systems. Some live only in BI models or analyst knowledge.

Even when foreign keys exist, they may not define the right analytical path.

A join can be technically valid and analytically wrong.

That is why Text-to-SQL systems need a relationship-aware layer before SQL generation.

What relationship discovery should provide

Before generating SQL, the system should be able to build context like this:

{
  "question": "revenue by customer segment last quarter",
  "candidate_tables": [
    "invoices",
    "contracts",
    "customers",
    "customer_segments"
  ],
  "preferred_join_path": [
    "invoices.contract_id -> contracts.contract_id",
    "contracts.customer_id -> customers.customer_id",
    "customers.segment_id -> customer_segments.segment_id"
  ],
  "grain_warning": "invoice detail level may duplicate revenue unless aggregated before joining",
  "time_logic": "use historical customer segment for period-based reporting",
  "confidence": "high"
}

This kind of context gives the LLM a much better foundation.

Instead of guessing from names, the model receives evidence about how the data connects.

A better Text-to-SQL workflow

A relationship-aware workflow looks more like this:

User question
→ intent understanding
→ semantic mapping
→ metadata retrieval
→ table relationship discovery
→ join path selection
→ SQL generation
→ SQL validation
→ execution
→ explanation
The relationship discovery step should answer:

Which tables are relevant?
Which joins are possible?
Which join path is preferred?
What is the relationship strength?
Could the join duplicate rows?
Is there a bridge table?
Are there multiple valid paths?
Should the system ask a clarification question?
This step reduces the chance that the model produces SQL that is syntactically correct but structurally wrong.

What can go wrong without relationship context?

Here are common failure modes.

Duplicate rows

A one-to-many join multiplies records because the model does not understand data grain.

SELECT
c.segment,
SUM(i.amount) AS revenue
FROM invoices i
JOIN customers c
ON i.customer_id = c.customer_id
JOIN customer_contacts cc
ON c.customer_id = cc.customer_id
GROUP BY c.segment;
If each customer has multiple contacts, revenue may be over-counted.

Missing records

An inner join removes valid records because not all entities have a matching dimension row.

Wrong table

The model chooses customer_profile when the approved reporting path requires customer_snapshot.

Wrong time logic

The model uses the customer’s current segment instead of the segment at the time of transaction.

False confidence

The worst case: the query runs, the chart looks good, and nobody notices the join path was wrong.

Relationship discovery is not just for AI

This is useful beyond Text-to-SQL.

Table relationship discovery also helps with:
1）data warehouse modeling
2）ETL and ELT design
3）BI dataset preparation
4）migration planning
5）impact analysis
6）data quality checks
7）metadata documentation
8）lineage and governance
In many companies, relationship knowledge is scattered across SQL scripts, notebooks, dashboards, and senior analysts’ memory.

AI systems need that knowledge to become structured and reusable.

Final thought

Text-to-SQL is not only a translation problem.

It is a data context problem.

A model needs to know what the user means.

It also needs to know how the data connects.

That is why table relationship discovery should happen before SQL generation, not after the query fails.

Better prompts help.

Better relationship context matters more.

DEV Community

Why Text-to-SQL Needs Table Relationship Discovery Before SQL Generation

Top comments (0)