The Problem
Most people who need data can't write SQL. Product managers open Jira tickets for simple queries. Support teams need custom admin panels for every lookup. Analysts spend hours on routine joins they've written a hundred times.
NL2SQL (natural language to SQL) fixes this. You type:
"Who are the top 5 customers by order volume?"
The system returns:
SELECT c.name, COUNT(o.order_id) AS order_count
FROM customers c
JOIN orders o ON c.customer_id = o.customer_id
GROUP BY c.name
ORDER BY order_count DESC
LIMIT 5;
LLMs have made this practical. On the Spider benchmark (200+ databases, standard academic dataset), modern systems hit 85–92% execution accuracy. But getting from "works on benchmarks" to "works in production" requires more than a single prompt.
Why Single-Prompt NL2SQL Breaks
The naive approach:
System: Here's the schema: {entire_database_schema}
User: {natural_language_question}
Assistant: {sql_query}
This works when your database has 5 tables with names like customers and orders.
It fails when you have 500 tables, columns named usr_trx_fl, and foreign key chains 6 layers deep. A single LLM call can't simultaneously:
- Figure out which 5 of 500 tables are relevant
- Check for SQL injection
- Generate correct SQL
- Validate syntax, logic, and performance
- Explain what the query does
You need decomposition. Same principle as microservices — single responsibility per agent.
Multi-Agent NL2SQL: How It Works
I'll walk through the pipeline using the open-source NL2SQL project as a reference. It uses 8 agents orchestrated with LangGraph.
1. Table Discovery (Schema Linking)
The hardest part. Given a question, find the relevant tables.
Three signals run in parallel:
Keyword Matching — Token overlap between query and table/column names. No LLM needed.
"singers" → singer table (fuzzy match score: 0.95)
Semantic Search — Embedding similarity. Catches conceptual matches keywords miss.
"revenue" → order_details table (cosine similarity: 0.82)
Foreign Key Graph — BFS from seed tables through FK relationships.
singer (seed) → concert (depth 1, score 0.5) → stadium (depth 2, score 0.25)
Scores are weighted and merged:
final_score = 0.35 * keyword + 0.45 * semantic + 0.20 * fk_graph
Only top-K tables (default: 5) pass through to the next stage.
2. Security Filter
Runs before generation. Checks for:
- SQL injection patterns (
'; DROP TABLE --) - Destructive operations (DELETE, TRUNCATE, ALTER)
- System table access
- Dangerous function calls
Blocked queries never reach the generator. Non-negotiable for user-facing systems.
3. Query Generation
The generator receives:
- Only the relevant tables (not the full schema)
- The user's question
- Few-shot examples
If the first attempt fails validation → one-shot retry with the error message appended. This simple pattern catches a surprising number of wrong column names and missing joins.
4. Parallel Validation
Four validators run concurrently:
┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌──────────────┐
│ Syntax │ │ Logic │ │ Security │ │ Performance │
│ validator │ │ validator │ │ validator │ │ validator │
└──────┬──────┘ └──────┬──────┘ └──────┬──────┘ └──────┬───────┘
│ │ │ │
└────────────────┴────────────────┴─────────────────┘
│
┌────────▼────────┐
│ Fan-in merge │
└─────────────────┘
| Validator | Checks |
|---|---|
| Syntax | SQL parsing, grammar, dialect |
| Logic | Tables/columns exist, joins correct, types compatible |
| Security | Injection, unauthorized ops, data exfiltration |
| Performance | Full table scans, missing indexes, complexity |
Any failure → back to generator with error context.
5. Explanation + Safety Score
Output includes:
- Plain-English explanation of the query
- Safety score (0–4 across four dimensions)
- Optimization recommendations
═══════════════════════════════════════
EXPLANATION
═══════════════════════════════════════
This query retrieves the top 5 customers by order volume by:
1. Joining customers with orders on customer_id
2. Counting orders per customer
3. Sorting by count descending, limiting to 5
Safety Score: 3.5/4.0
✓ Security: Safe (no injection risks)
✓ Syntax: Valid SQL
✓ Logic: Correct table and column usage
⚠ Performance: Consider index on customer_id
═══════════════════════════════════════
DDL Vector Stores: The Efficiency Fix
Here's something worth knowing if you're building or evaluating NL2SQL systems.
The semantic search step in most pipelines (including the one above) recomputes embeddings on every request. It embeds the query + every candidate table, computes cosine similarity in-memory, and discards the vectors. Next request: same computation, same API calls, same cost.
The Better Approach
Pre-compute embeddings for your DDL structure and store them in a vector DB.
What you embed per table:
Table: orders
Columns: order_id (PK, INTEGER), customer_id (FK → customers.id),
order_date (DATE), total_amount (DECIMAL), status (VARCHAR)
Relationships: references customers, referenced by order_items
The key insight: include primary keys, foreign keys, types, and relationships in the text you embed. This gives the embedding model structural context, not just names.
Store in FAISS / Chroma / Pinecone / Weaviate.
At query time: embed only the user's question (1 API call) → vector similarity search against pre-computed embeddings.
Comparison
| Metric | Per-Request Embedding | DDL Vector Store |
|---|---|---|
| Embedding calls/query | 1 + N (N = tables) | 1 |
| Latency | O(N) | O(1) |
| Cost at scale | Linear with schema size | Near-constant |
| Schema changes | Auto (always fresh) | Requires re-index |
Research backing: LitE-SQL (EACL 2026) uses this exact pattern — pre-computed schema embeddings with contrastive learning. Results: 88.45% on Spider, 72.10% on BIRD, with 2–30x fewer parameters than full LLM approaches.
For production systems where the schema changes less often than queries arrive (i.e., basically every production database), this is a clear win.
The Benchmark-to-Production Gap
Numbers you should know:
| Benchmark | Schema Type | Accuracy |
|---|---|---|
| Spider 1.0 | Clean, 3–10 tables, descriptive names | 85–92% |
| Spider 2.0 | Enterprise-realistic schemas | 6–21% |
| BIRD | Dirty schemas, noisy labels | ~72% (best systems) |
The 85% → 6% drop isn't a bug. Spider 1.0 is clean academic data. Real databases have:
- Hundreds of tables with cryptic names
- "Active users" meaning different things per company
- Nulls, inconsistent formats, undocumented columns
- Multi-hop joins, window functions, nested subqueries
Multi-agent validation doesn't close this gap entirely, but it's the difference between a system that fails silently and one that says "I'm not confident — here's why."
Quick Start
Install from PyPI:
pip install nl2sql-agents
Configure (works with any OpenAI-compatible API — OpenRouter, OpenAI, Ollama, vLLM):
export OPENAI_API_KEY="your-key"
export OPENAI_BASE_URL="https://openrouter.ai/api/v1"
export OPENAI_MODEL="openai/gpt-4o-mini"
export OPENAI_EMBEDDING_MODEL="openai/text-embedding-3-small"
export DB_TYPE="sqlite"
export DB_PATH="/path/to/your/database.sqlite"
Run:
# Interactive REPL
nl2sql
# One-shot
nl2sql "Show me all singers from France"
# Override database
nl2sql --db /path/to/other.sqlite "List all employees"
Works with any SQLite database. Point DB_PATH at your own .sqlite file or use the Spider benchmark databases.
GitHub: github.com/ToheedAsghar/NL2SQL
PyPI: pypi.org/project/nl2sql-agents
Architecture at a Glance
User Query
│
▼
┌─────────────────┐
│ Security Filter │──── Block dangerous queries
└────────┬────────┘
▼
┌─────────────────────────────────────┐
│ Table Discovery │
│ ┌──────┐ ┌────────┐ ┌──────────┐ │
│ │Keywrd│ │Semantic│ │ FK Graph │ │
│ └──┬───┘ └───┬────┘ └────┬─────┘ │
│ └─────────┴────────────┘ │
│ Weighted Merge │
└────────────┬────────────────────────┘
▼
┌─────────────────┐
│ Schema Formatter │──── Format relevant tables
└────────┬────────┘
▼
┌─────────────────┐
│ Query Generator │◄─── Retry with error context
└────────┬────────┘
▼
┌─────────────────────────────────────┐
│ Parallel Validation │
│ ┌──────┐ ┌─────┐ ┌─────┐ ┌──────┐ │
│ │Syntax│ │Logic│ │Secur│ │ Perf │ │
│ └──────┘ └─────┘ └─────┘ └──────┘ │
└────────────┬────────────────────────┘
▼
┌─────────────────┐
│ Explainer │──── Explanation + Safety Score
└─────────────────┘
8 agents. Parallel fan-out/fan-in. LangGraph StateGraph with conditional edges. MIT licensed.
What's Next for NL2SQL
- DDL vector stores replacing per-request embeddings as the default
- Semantic layers (dbt, Cube) + NL2SQL as complementary approaches
- Multi-database support — PostgreSQL, MySQL, BigQuery
- Execution + visualization — run queries and render results
- Multi-turn conversations — follow-up questions with context
- Right-sized models — fine-tuned 7B models for individual agent tasks instead of GPT-4 for everything
Star the repo if it's useful: github.com/ToheedAsghar/NL2SQL
The architecture diagram in the README is worth a look even if you're building something completely different — it's a clean reference for structuring multi-agent LangGraph pipelines.
Top comments (0)