Vivek Kumar

Posted on May 21

From English to SQL: How LLMs Actually Understand Your Database Schema

#sql #database #ai #tutorial

You type "show me total revenue by customer for last month" into a chat interface. A second later, out pops a perfectly formatted SQL query against your orders and customers tables. It feels like magic.

But it's not. And once you understand the mechanics — specifically, how the model reads and interprets your database schema — you'll know exactly why text-to-SQL tools sometimes get it right, sometimes get it embarrassingly wrong, and what you can do to tip the odds in your favor.

This post breaks down the internal mechanics of schema understanding in LLM-powered text-to-SQL systems, with practical advice for developers building or integrating these tools.

The First Problem: The LLM Doesn't Know Your Database

Out of the box, an LLM has no idea that your SaaS product has a subscriptions table, or that mrr lives in billing_events, or that you store user_id in lowercase snake_case everywhere except the one legacy table where it's UserID.

That knowledge has to be injected into the model's prompt — every single time. The process of selecting which schema information to include, and how to format it, is called schema linking. It's the most critical (and most commonly fumbled) part of any text-to-SQL pipeline.

A typical schema injection looks like this:

-- Passed to the LLM in the system or user prompt:

Table: users
Columns: id (integer, PK), email (text), created_at (timestamp), plan (text)

Table: orders
Columns: id (integer, PK), user_id (integer, FK → users.id), amount_cents (integer), status (text), created_at (timestamp)

Table: subscriptions
Columns: id (integer, PK), user_id (integer, FK → users.id), plan_id (integer), started_at (timestamp), cancelled_at (timestamp, nullable)

The model reads this and builds an internal understanding of your data model — which tables exist, what columns they have, how they relate to each other, and what data types to expect.

Step 1: Schema Parsing — What the Model Actually Sees

When a user asks "which customers churned last month?", the LLM doesn't just search for the word "churned" in your schema. It reasons through what "churned" likely means given the available columns: probably a cancelled_at timestamp that falls within the last 30 days, joined against users via subscriptions.

This reasoning — connecting semantic intent in the question to structural elements in the schema — is called schema linking. It involves three sub-problems:

1. Table selection: Which tables are relevant to this question? For large databases with hundreds of tables, the model must filter down to the handful that matter. Dumping your entire schema into the prompt doesn't scale — a 300-table database easily exceeds a model's effective context window.

2. Column selection: Within those tables, which columns are needed? This is where naming matters enormously. amt is ambiguous. amount_cents is clear. ts is useless. created_at is useful.

3. Join inference: How do the relevant tables connect? The model needs to reconstruct the relationship graph — ideally from explicit foreign key declarations in the schema you provide, rather than guessing from column name patterns alone.

Step 2: Why Column Names and Table Names Matter More Than You Think

Here's a test. Give an LLM these two schemas and ask "how many users signed up last week?"

Schema A:

Table: u
Columns: uid, em, ts, pl

Schema B:

Table: users
Columns: id, email, created_at, plan

Schema B wins every time. Natural language tokens in column names (email, created_at, plan) map cleanly to words a model understands. Abbreviations break that mapping.

Research from IBM and others has shown that schema-aware models — those given rich metadata including table descriptions, column definitions, and sample values — outperform models handed raw DDL by a significant margin.

The practical takeaway: invest in good naming conventions, and pass them into your text-to-SQL pipeline explicitly. If you're stuck with legacy short column names, add descriptions.

Step 3: How to Structure Schema for Maximum LLM Comprehension

Here's an example of a well-formatted schema prompt that a text-to-SQL system might use:

You are a SQL expert. Use the following schema to answer the user's question.

### Schema

Table: users
Description: One row per registered user.
Columns:
  - id (integer, primary key)
  - email (text) — the user's login email
  - created_at (timestamp) — when the account was created
  - plan (text) — current plan: 'free', 'pro', 'enterprise'

Table: orders
Description: Purchases made by users.
Columns:
  - id (integer, primary key)
  - user_id (integer) — foreign key → users.id
  - amount_cents (integer) — order total in cents (divide by 100 for dollars)
  - status (text) — 'pending', 'completed', 'refunded'
  - created_at (timestamp)

### Question
Which users on the 'pro' plan have made more than 3 orders in the past 30 days?

Notice what's included beyond just column names:

Table descriptions that explain business purpose
Inline comments on ambiguous columns (units, enumerations)
Explicit foreign key relationships spelled out in plain language

This is the difference between a mediocre text-to-SQL result and one you'd actually run in production.

Step 4: The Large Schema Problem — When You Can't Fit Everything

Most real-world databases have far more tables than you'd ever want in a single prompt. Cramming 200 table definitions into the context window creates two problems: cost, and accuracy. Models are measurably worse at attending to schema details when surrounded by irrelevant tables.

The solution is retrieval-augmented schema selection. Before calling the LLM, a retrieval step filters the schema:

# Pseudocode for schema-aware RAG in a text-to-SQL pipeline

user_question = "show me monthly revenue by plan for the last 6 months"

# 1. Embed the user question
question_embedding = embed(user_question)

# 2. Search a vector index of table + column descriptions
relevant_tables = vector_search(
    index=schema_index,
    query=question_embedding,
    top_k=5
)

# 3. Build a focused schema prompt with only relevant tables
schema_context = format_schema(relevant_tables)

# 4. Call the LLM with a focused prompt
sql = llm.complete(
    system_prompt=f"Use this schema:\n{schema_context}",
    user_prompt=user_question
)

Systems like RASL (Retrieval Augmented Schema Linking) from Amazon use this approach to handle databases with thousands of tables, achieving dramatically better accuracy than naive full-schema injection.

The key insight: the right schema in the prompt beats a larger model every time.

Step 5: Foreign Keys Are the Secret Weapon

Joins are where text-to-SQL models fail most visibly. A model might correctly identify that you need both users and orders, then produce a broken query because it guessed the wrong join column.

The fix is explicit: always include foreign key relationships in your schema prompt. Compare these two approaches:

Vague:

Table: orders
Columns: id, user_id, amount_cents, created_at

Explicit:

Table: orders
Columns:
  - id (integer, primary key)
  - user_id (integer) — foreign key → users.id; never null
  - amount_cents (integer)
  - created_at (timestamp)

When foreign keys are declared, models reliably produce correct JOINs:

-- Generated correctly when FK is explicit:
SELECT 
  u.email,
  COUNT(o.id) AS order_count,
  SUM(o.amount_cents) / 100.0 AS total_revenue
FROM users u
JOIN orders o ON o.user_id = u.id
WHERE o.status = 'completed'
  AND o.created_at >= NOW() - INTERVAL '30 days'
GROUP BY u.email
ORDER BY total_revenue DESC;

Without the FK hint, you're relying on the model to guess that user_id in orders maps to id in users. It usually gets this right — but "usually" isn't good enough for production.

Common Mistakes Developers Make

Passing raw DDL without descriptions. Your CREATE TABLE statements are a start, but they're optimized for the database engine, not for an LLM. Add comments and descriptions.

-- Don't just pass this:
CREATE TABLE ev (
  id SERIAL PRIMARY KEY,
  uid INT,
  typ VARCHAR(50),
  ts TIMESTAMPTZ
);

-- Also include:
-- Table: events (alias: ev)
-- Description: user activity events for product analytics
-- uid → users.id, typ = 'click' | 'page_view' | 'signup' | 'purchase'

Including every table for every query. Token-bloat hurts accuracy. Filter schema by relevance before injecting.

Ignoring enum values. If a column like status has a fixed set of values, list them. The model needs to know that 'completed' is a valid status, not 'done' or 'success'.

No sample data. Even a single example row per table dramatically improves accuracy for edge cases — especially date formats, unit conventions, and enum spellings.

Key Takeaways

The gap between a text-to-SQL tool that impresses and one you'd trust in production almost always comes down to schema quality, not model quality. Here's the short version:

Use descriptive table and column names; avoid abbreviations
Declare foreign key relationships explicitly in your schema prompt
Include column descriptions for ambiguous fields (units, enums, nullable meaning)
Filter schema by relevance using vector search for large databases
Add sample values or enum lists for categorical columns
Never assume the model will infer what you didn't tell it

The LLM is doing surprisingly sophisticated reasoning — it's matching your English words to your schema structure, inferring joins, and constructing syntactically valid SQL all in one pass. Give it the right schema context, and it'll give you queries you can actually use.

Have you built a text-to-SQL system or integrated one into your product? I'd love to hear how you handle schema injection — drop a comment below with your approach, especially if you've tackled the large-schema problem in an interesting way.