DEV Community

ArisynData
ArisynData

Posted on

Designing Relationship Context for Text-to-SQL Systems

A common Text-to-SQL architecture looks like this:

user question -> schema context -> prompt -> SQL

This works well enough for demos.

It is not enough for enterprise databases.

In real production environments, the missing piece is often relationship
context: structured information about how tables and fields connect,
which join paths are trusted, and where ambiguity exists.

A semantic layer can define what a business term means.

But the SQL generator still needs to know how the physical data objects
connect.

The problem with schema-only context

Many Text-to-SQL systems start by passing table names, column names, and
a user question into a model.

For example:

Question:

Which customers have not placed an order in the last 90 days?

Schema:

customers(id, name, email, signup_date)

orders(id, customer_id, order_date, status, total_amount)

For a simple schema, this is fine.

The relationship is obvious:

customers.id = orders.customer_id

But enterprise schemas are rarely that simple.

You may have:

crm_account

customer_profile

customer_profile_v2

sales_contract

invoice_header

invoice_line

payment_record

finance_revenue_snapshot

region_mapping

Some tables are authoritative.

Some are historical.

Some are replicated.

Some are partially deprecated.

Some relationships are declared as foreign keys.

Many are not.

A model can still generate SQL from this environment, but without the
right relationship context, it may generate SQL that looks correct and
returns the wrong answer.

That is worse than an obvious error.

Semantic context is different from relationship context

It is useful to separate two kinds of context.

Semantic context answers:

What does this business term mean?

Relationship context answers:

How do the relevant data objects connect?

For example, semantic context may define:

{

"business_term": "revenue",

"approved_metric": "recognized_revenue",

"definition": "Revenue recognized according to finance reporting rules",

"owner": "Finance",

"version": "2026.1"

}

That helps the model understand the business meaning.

But it does not fully answer which tables to join.

For that, the system needs relationship context.

What relationship context might look like

Relationship context can be represented as structured input before SQL
generation.

For example:

{

"business_intent": "customers_without_recent_orders",

"semantic_context": {

"entity": "customer",

"activity": "order",

"time_window": "last_90_days"

},

"relationship_context": {

"candidate_tables": [

"customers",

"orders",

"order_items"

],

"preferred_path": [

{

"from": "customers.customer_id",

"to": "orders.customer_id",

"relationship_type": "customer_to_order",

"confidence": 0.97

}

],

"risk_flags": [

"orders table contains cancelled orders",

"order_date should be used instead of created_at"

],

"clarification_required": false

}

}

Now the model is not guessing from raw table names alone.

It has a constrained context.

It knows which tables matter.

It knows how they connect.

It knows which risks to consider.

A more reliable workflow

A relationship-aware Text-to-SQL workflow might look like this:

Natural language question

Intent and entity extraction

Business semantic mapping

Metadata retrieval

Relationship context assembly

Join path selection

SQL generation

Validation

Execution

Explanation and feedback

The important point is that SQL generation happens after semantic and
relationship context are assembled.

This reduces the burden on the model.

The model is still useful, but it is no longer responsible for inventing
the entire data environment from scratch.

Why prompts are not enough

Prompt engineering helps.

You can tell the model:

Use the correct tables.

Avoid ambiguous joins.

Prefer approved metrics.

Ask a clarifying question when necessary.

These are good instructions.

But they are not a substitute for missing data knowledge.

If the system does not know which table is authoritative, the
instruction "use the correct table" is not enough.

If the system does not know which join path is trusted, the instruction
"avoid ambiguous joins" is not enough.

The system needs a governed source of truth for metadata, business
definitions, and table relationships.

Otherwise, the model is forced to make a probabilistic decision in a
place where the business expects determinism.

What should be included in relationship context?

A practical relationship context layer may include:

  • source table and target table

  • source field and target field

  • relationship type

  • confidence score

  • coverage ratio

  • uniqueness ratio

  • preferred join path

  • deprecated relationships

  • business meaning of the relationship

  • validation rules

  • known duplication risks

For example:

{

"relationship": {

"source": "sales_contract.customer_id",

"target": "customer_master.customer_id",

"type": "many_to_one",

"confidence": 0.94,

"coverage": 0.98,

"preferred": true,

"notes": "Preferred customer relationship for sales reporting"

}

}

This kind of context gives the SQL generation layer a much better
starting point.

Validation matters

Even with relationship context, generated SQL should not be executed
blindly.

Validation should check:

  • whether referenced tables exist

  • whether join fields are resolvable

  • whether the query may duplicate records

  • whether the metric definition matches the approved semantic layer

  • whether the user has permission to access the requested data

  • whether ambiguity should trigger clarification

For example, if there are two possible revenue definitions, the system
should not guess.

It should ask:

Do you mean booked revenue or recognized revenue?

A good Text-to-SQL system should know when not to answer.

The practical takeaway

If you are building an enterprise Text-to-SQL system, do not start with
the model alone.

Start with these questions:

  1. What business terms does the user use?

  2. Which metrics and dimensions are governed?

  3. Which tables are authoritative?

  4. Which join paths are trusted?

  5. What should the system do when the question is ambiguous?

  6. How will generated SQL be validated before execution?

  7. How will user feedback improve future answers?

Only then should the model generate SQL.

At Arisyn, this is one of the problems we are exploring: how to make
enterprise Text-to-SQL more relationship-aware, governed, and
explainable by design.

The direction is simple:

Better prompts help.

Better context matters more.

Top comments (0)