Toheed Asghar

Posted on Apr 13

NL2SQL in 2026: How Multi-Agent Pipelines Convert Natural Language to Safe SQL

#python #ai #sql #machinelearning

The Problem

Most people who need data can't write SQL. Product managers open Jira tickets for simple queries. Support teams need custom admin panels for every lookup. Analysts spend hours on routine joins they've written a hundred times.

NL2SQL (natural language to SQL) fixes this. You type:

"Who are the top 5 customers by order volume?"

The system returns:

SELECT c.name, COUNT(o.order_id) AS order_count
FROM customers c
JOIN orders o ON c.customer_id = o.customer_id
GROUP BY c.name
ORDER BY order_count DESC
LIMIT 5;

LLMs have made this practical. On the Spider benchmark (200+ databases, standard academic dataset), modern systems hit 85–92% execution accuracy. But getting from "works on benchmarks" to "works in production" requires more than a single prompt.

Why Single-Prompt NL2SQL Breaks

The naive approach:

System: Here's the schema: {entire_database_schema}
User: {natural_language_question}
Assistant: {sql_query}

This works when your database has 5 tables with names like customers and orders.

It fails when you have 500 tables, columns named usr_trx_fl, and foreign key chains 6 layers deep. A single LLM call can't simultaneously:

Figure out which 5 of 500 tables are relevant
Check for SQL injection
Generate correct SQL
Validate syntax, logic, and performance
Explain what the query does

You need decomposition. Same principle as microservices — single responsibility per agent.

Multi-Agent NL2SQL: How It Works

I'll walk through the pipeline using the open-source NL2SQL project as a reference. It uses 8 agents orchestrated with LangGraph.

1. Table Discovery (Schema Linking)

The hardest part. Given a question, find the relevant tables.

Three signals run in parallel:

Keyword Matching — Token overlap between query and table/column names. No LLM needed.

"singers" → singer table (fuzzy match score: 0.95)

Semantic Search — Embedding similarity. Catches conceptual matches keywords miss.

"revenue" → order_details table (cosine similarity: 0.82)

Foreign Key Graph — BFS from seed tables through FK relationships.

singer (seed) → concert (depth 1, score 0.5) → stadium (depth 2, score 0.25)

Scores are weighted and merged:

final_score = 0.35 * keyword + 0.45 * semantic + 0.20 * fk_graph

Only top-K tables (default: 5) pass through to the next stage.

2. Security Filter

Runs before generation. Checks for:

SQL injection patterns ('; DROP TABLE --)
Destructive operations (DELETE, TRUNCATE, ALTER)
System table access
Dangerous function calls

Blocked queries never reach the generator. Non-negotiable for user-facing systems.

3. Query Generation

The generator receives:

Only the relevant tables (not the full schema)
The user's question
Few-shot examples

If the first attempt fails validation → one-shot retry with the error message appended. This simple pattern catches a surprising number of wrong column names and missing joins.

4. Parallel Validation

Four validators run concurrently:

┌─────────────┐  ┌─────────────┐  ┌─────────────┐  ┌──────────────┐
│   Syntax    │  │    Logic    │  │  Security   │  │ Performance  │
│  validator  │  │  validator  │  │  validator  │  │  validator   │
└──────┬──────┘  └──────┬──────┘  └──────┬──────┘  └──────┬───────┘
       │                │                │                 │
       └────────────────┴────────────────┴─────────────────┘
                              │
                     ┌────────▼────────┐
                     │   Fan-in merge  │
                     └─────────────────┘

Validator	Checks
Syntax	SQL parsing, grammar, dialect
Logic	Tables/columns exist, joins correct, types compatible
Security	Injection, unauthorized ops, data exfiltration
Performance	Full table scans, missing indexes, complexity

Any failure → back to generator with error context.

5. Explanation + Safety Score

Output includes:

Plain-English explanation of the query
Safety score (0–4 across four dimensions)
Optimization recommendations

═══════════════════════════════════════
 EXPLANATION
═══════════════════════════════════════
This query retrieves the top 5 customers by order volume by:
1. Joining customers with orders on customer_id
2. Counting orders per customer
3. Sorting by count descending, limiting to 5

Safety Score: 3.5/4.0
✓ Security: Safe (no injection risks)
✓ Syntax: Valid SQL
✓ Logic: Correct table and column usage
⚠ Performance: Consider index on customer_id
═══════════════════════════════════════

DDL Vector Stores: The Efficiency Fix

Here's something worth knowing if you're building or evaluating NL2SQL systems.

The semantic search step in most pipelines (including the one above) recomputes embeddings on every request. It embeds the query + every candidate table, computes cosine similarity in-memory, and discards the vectors. Next request: same computation, same API calls, same cost.

The Better Approach

Pre-compute embeddings for your DDL structure and store them in a vector DB.

What you embed per table:

Table: orders
Columns: order_id (PK, INTEGER), customer_id (FK → customers.id),
order_date (DATE), total_amount (DECIMAL), status (VARCHAR)
Relationships: references customers, referenced by order_items

The key insight: include primary keys, foreign keys, types, and relationships in the text you embed. This gives the embedding model structural context, not just names.

Store in FAISS / Chroma / Pinecone / Weaviate.

At query time: embed only the user's question (1 API call) → vector similarity search against pre-computed embeddings.

Comparison

Metric	Per-Request Embedding	DDL Vector Store
Embedding calls/query	1 + N (N = tables)	1
Latency	O(N)	O(1)
Cost at scale	Linear with schema size	Near-constant
Schema changes	Auto (always fresh)	Requires re-index

Research backing: LitE-SQL (EACL 2026) uses this exact pattern — pre-computed schema embeddings with contrastive learning. Results: 88.45% on Spider, 72.10% on BIRD, with 2–30x fewer parameters than full LLM approaches.

For production systems where the schema changes less often than queries arrive (i.e., basically every production database), this is a clear win.

The Benchmark-to-Production Gap

Numbers you should know:

Benchmark	Schema Type	Accuracy
Spider 1.0	Clean, 3–10 tables, descriptive names	85–92%
Spider 2.0	Enterprise-realistic schemas	6–21%
BIRD	Dirty schemas, noisy labels	~72% (best systems)

The 85% → 6% drop isn't a bug. Spider 1.0 is clean academic data. Real databases have:

Hundreds of tables with cryptic names
"Active users" meaning different things per company
Nulls, inconsistent formats, undocumented columns
Multi-hop joins, window functions, nested subqueries

Multi-agent validation doesn't close this gap entirely, but it's the difference between a system that fails silently and one that says "I'm not confident — here's why."

Quick Start

Install from PyPI:

pip install nl2sql-agents

Configure (works with any OpenAI-compatible API — OpenRouter, OpenAI, Ollama, vLLM):

export OPENAI_API_KEY="your-key"
export OPENAI_BASE_URL="https://openrouter.ai/api/v1"
export OPENAI_MODEL="openai/gpt-4o-mini"
export OPENAI_EMBEDDING_MODEL="openai/text-embedding-3-small"
export DB_TYPE="sqlite"
export DB_PATH="/path/to/your/database.sqlite"

Run:

# Interactive REPL
nl2sql

# One-shot
nl2sql "Show me all singers from France"

# Override database
nl2sql --db /path/to/other.sqlite "List all employees"

Works with any SQLite database. Point DB_PATH at your own .sqlite file or use the Spider benchmark databases.

GitHub: github.com/ToheedAsghar/NL2SQL
PyPI: pypi.org/project/nl2sql-agents

Architecture at a Glance

User Query
    │
    ▼
┌─────────────────┐
│ Security Filter  │──── Block dangerous queries
└────────┬────────┘
         ▼
┌─────────────────────────────────────┐
│         Table Discovery             │
│  ┌──────┐ ┌────────┐ ┌──────────┐  │
│  │Keywrd│ │Semantic│ │ FK Graph │  │
│  └──┬───┘ └───┬────┘ └────┬─────┘  │
│     └─────────┴────────────┘        │
│         Weighted Merge              │
└────────────┬────────────────────────┘
             ▼
┌─────────────────┐
│ Schema Formatter │──── Format relevant tables
└────────┬────────┘
         ▼
┌─────────────────┐
│ Query Generator  │◄─── Retry with error context
└────────┬────────┘
         ▼
┌─────────────────────────────────────┐
│        Parallel Validation          │
│ ┌──────┐ ┌─────┐ ┌─────┐ ┌──────┐ │
│ │Syntax│ │Logic│ │Secur│ │ Perf │ │
│ └──────┘ └─────┘ └─────┘ └──────┘ │
└────────────┬────────────────────────┘
             ▼
┌─────────────────┐
│   Explainer     │──── Explanation + Safety Score
└─────────────────┘

8 agents. Parallel fan-out/fan-in. LangGraph StateGraph with conditional edges. MIT licensed.

What's Next for NL2SQL

DDL vector stores replacing per-request embeddings as the default
Semantic layers (dbt, Cube) + NL2SQL as complementary approaches
Multi-database support — PostgreSQL, MySQL, BigQuery
Execution + visualization — run queries and render results
Multi-turn conversations — follow-up questions with context
Right-sized models — fine-tuned 7B models for individual agent tasks instead of GPT-4 for everything

Star the repo if it's useful: github.com/ToheedAsghar/NL2SQL

The architecture diagram in the README is worth a look even if you're building something completely different — it's a clean reference for structuring multi-agent LangGraph pipelines.

DEV Community