DEV Community

Cover image for NL2SQL in 2026: How Multi-Agent Pipelines Convert Natural Language to Safe SQL
Toheed Asghar
Toheed Asghar

Posted on

NL2SQL in 2026: How Multi-Agent Pipelines Convert Natural Language to Safe SQL

The Problem

Most people who need data can't write SQL. Product managers open Jira tickets for simple queries. Support teams need custom admin panels for every lookup. Analysts spend hours on routine joins they've written a hundred times.

NL2SQL (natural language to SQL) fixes this. You type:

"Who are the top 5 customers by order volume?"

The system returns:

SELECT c.name, COUNT(o.order_id) AS order_count
FROM customers c
JOIN orders o ON c.customer_id = o.customer_id
GROUP BY c.name
ORDER BY order_count DESC
LIMIT 5;
Enter fullscreen mode Exit fullscreen mode

LLMs have made this practical. On the Spider benchmark (200+ databases, standard academic dataset), modern systems hit 85–92% execution accuracy. But getting from "works on benchmarks" to "works in production" requires more than a single prompt.


Why Single-Prompt NL2SQL Breaks

The naive approach:

System: Here's the schema: {entire_database_schema}
User: {natural_language_question}
Assistant: {sql_query}
Enter fullscreen mode Exit fullscreen mode

This works when your database has 5 tables with names like customers and orders.

It fails when you have 500 tables, columns named usr_trx_fl, and foreign key chains 6 layers deep. A single LLM call can't simultaneously:

  • Figure out which 5 of 500 tables are relevant
  • Check for SQL injection
  • Generate correct SQL
  • Validate syntax, logic, and performance
  • Explain what the query does

You need decomposition. Same principle as microservices — single responsibility per agent.


Multi-Agent NL2SQL: How It Works

I'll walk through the pipeline using the open-source NL2SQL project as a reference. It uses 8 agents orchestrated with LangGraph.

1. Table Discovery (Schema Linking)

The hardest part. Given a question, find the relevant tables.

Three signals run in parallel:

Keyword Matching — Token overlap between query and table/column names. No LLM needed.

"singers" → singer table (fuzzy match score: 0.95)
Enter fullscreen mode Exit fullscreen mode

Semantic Search — Embedding similarity. Catches conceptual matches keywords miss.

"revenue" → order_details table (cosine similarity: 0.82)
Enter fullscreen mode Exit fullscreen mode

Foreign Key Graph — BFS from seed tables through FK relationships.

singer (seed) → concert (depth 1, score 0.5) → stadium (depth 2, score 0.25)
Enter fullscreen mode Exit fullscreen mode

Scores are weighted and merged:

final_score = 0.35 * keyword + 0.45 * semantic + 0.20 * fk_graph
Enter fullscreen mode Exit fullscreen mode

Only top-K tables (default: 5) pass through to the next stage.

2. Security Filter

Runs before generation. Checks for:

  • SQL injection patterns ('; DROP TABLE --)
  • Destructive operations (DELETE, TRUNCATE, ALTER)
  • System table access
  • Dangerous function calls

Blocked queries never reach the generator. Non-negotiable for user-facing systems.

3. Query Generation

The generator receives:

  • Only the relevant tables (not the full schema)
  • The user's question
  • Few-shot examples

If the first attempt fails validation → one-shot retry with the error message appended. This simple pattern catches a surprising number of wrong column names and missing joins.

4. Parallel Validation

Four validators run concurrently:

┌─────────────┐  ┌─────────────┐  ┌─────────────┐  ┌──────────────┐
│   Syntax    │  │    Logic    │  │  Security   │  │ Performance  │
│  validator  │  │  validator  │  │  validator  │  │  validator   │
└──────┬──────┘  └──────┬──────┘  └──────┬──────┘  └──────┬───────┘
       │                │                │                 │
       └────────────────┴────────────────┴─────────────────┘
                              │
                     ┌────────▼────────┐
                     │   Fan-in merge  │
                     └─────────────────┘
Enter fullscreen mode Exit fullscreen mode
Validator Checks
Syntax SQL parsing, grammar, dialect
Logic Tables/columns exist, joins correct, types compatible
Security Injection, unauthorized ops, data exfiltration
Performance Full table scans, missing indexes, complexity

Any failure → back to generator with error context.

5. Explanation + Safety Score

Output includes:

  • Plain-English explanation of the query
  • Safety score (0–4 across four dimensions)
  • Optimization recommendations
═══════════════════════════════════════
 EXPLANATION
═══════════════════════════════════════
This query retrieves the top 5 customers by order volume by:
1. Joining customers with orders on customer_id
2. Counting orders per customer
3. Sorting by count descending, limiting to 5

Safety Score: 3.5/4.0
✓ Security: Safe (no injection risks)
✓ Syntax: Valid SQL
✓ Logic: Correct table and column usage
⚠ Performance: Consider index on customer_id
═══════════════════════════════════════
Enter fullscreen mode Exit fullscreen mode

DDL Vector Stores: The Efficiency Fix

Here's something worth knowing if you're building or evaluating NL2SQL systems.

The semantic search step in most pipelines (including the one above) recomputes embeddings on every request. It embeds the query + every candidate table, computes cosine similarity in-memory, and discards the vectors. Next request: same computation, same API calls, same cost.

The Better Approach

Pre-compute embeddings for your DDL structure and store them in a vector DB.

What you embed per table:

Table: orders
Columns: order_id (PK, INTEGER), customer_id (FK → customers.id),
order_date (DATE), total_amount (DECIMAL), status (VARCHAR)
Relationships: references customers, referenced by order_items
Enter fullscreen mode Exit fullscreen mode

The key insight: include primary keys, foreign keys, types, and relationships in the text you embed. This gives the embedding model structural context, not just names.

Store in FAISS / Chroma / Pinecone / Weaviate.

At query time: embed only the user's question (1 API call) → vector similarity search against pre-computed embeddings.

Comparison

Metric Per-Request Embedding DDL Vector Store
Embedding calls/query 1 + N (N = tables) 1
Latency O(N) O(1)
Cost at scale Linear with schema size Near-constant
Schema changes Auto (always fresh) Requires re-index

Research backing: LitE-SQL (EACL 2026) uses this exact pattern — pre-computed schema embeddings with contrastive learning. Results: 88.45% on Spider, 72.10% on BIRD, with 2–30x fewer parameters than full LLM approaches.

For production systems where the schema changes less often than queries arrive (i.e., basically every production database), this is a clear win.


The Benchmark-to-Production Gap

Numbers you should know:

Benchmark Schema Type Accuracy
Spider 1.0 Clean, 3–10 tables, descriptive names 85–92%
Spider 2.0 Enterprise-realistic schemas 6–21%
BIRD Dirty schemas, noisy labels ~72% (best systems)

The 85% → 6% drop isn't a bug. Spider 1.0 is clean academic data. Real databases have:

  • Hundreds of tables with cryptic names
  • "Active users" meaning different things per company
  • Nulls, inconsistent formats, undocumented columns
  • Multi-hop joins, window functions, nested subqueries

Multi-agent validation doesn't close this gap entirely, but it's the difference between a system that fails silently and one that says "I'm not confident — here's why."


Quick Start

Install from PyPI:

pip install nl2sql-agents
Enter fullscreen mode Exit fullscreen mode

Configure (works with any OpenAI-compatible API — OpenRouter, OpenAI, Ollama, vLLM):

export OPENAI_API_KEY="your-key"
export OPENAI_BASE_URL="https://openrouter.ai/api/v1"
export OPENAI_MODEL="openai/gpt-4o-mini"
export OPENAI_EMBEDDING_MODEL="openai/text-embedding-3-small"
export DB_TYPE="sqlite"
export DB_PATH="/path/to/your/database.sqlite"
Enter fullscreen mode Exit fullscreen mode

Run:

# Interactive REPL
nl2sql

# One-shot
nl2sql "Show me all singers from France"

# Override database
nl2sql --db /path/to/other.sqlite "List all employees"
Enter fullscreen mode Exit fullscreen mode

Works with any SQLite database. Point DB_PATH at your own .sqlite file or use the Spider benchmark databases.

GitHub: github.com/ToheedAsghar/NL2SQL
PyPI: pypi.org/project/nl2sql-agents


Architecture at a Glance

User Query
    │
    ▼
┌─────────────────┐
│ Security Filter  │──── Block dangerous queries
└────────┬────────┘
         ▼
┌─────────────────────────────────────┐
│         Table Discovery             │
│  ┌──────┐ ┌────────┐ ┌──────────┐  │
│  │Keywrd│ │Semantic│ │ FK Graph │  │
│  └──┬───┘ └───┬────┘ └────┬─────┘  │
│     └─────────┴────────────┘        │
│         Weighted Merge              │
└────────────┬────────────────────────┘
             ▼
┌─────────────────┐
│ Schema Formatter │──── Format relevant tables
└────────┬────────┘
         ▼
┌─────────────────┐
│ Query Generator  │◄─── Retry with error context
└────────┬────────┘
         ▼
┌─────────────────────────────────────┐
│        Parallel Validation          │
│ ┌──────┐ ┌─────┐ ┌─────┐ ┌──────┐ │
│ │Syntax│ │Logic│ │Secur│ │ Perf │ │
│ └──────┘ └─────┘ └─────┘ └──────┘ │
└────────────┬────────────────────────┘
             ▼
┌─────────────────┐
│   Explainer     │──── Explanation + Safety Score
└─────────────────┘
Enter fullscreen mode Exit fullscreen mode

8 agents. Parallel fan-out/fan-in. LangGraph StateGraph with conditional edges. MIT licensed.


What's Next for NL2SQL

  • DDL vector stores replacing per-request embeddings as the default
  • Semantic layers (dbt, Cube) + NL2SQL as complementary approaches
  • Multi-database support — PostgreSQL, MySQL, BigQuery
  • Execution + visualization — run queries and render results
  • Multi-turn conversations — follow-up questions with context
  • Right-sized models — fine-tuned 7B models for individual agent tasks instead of GPT-4 for everything

Star the repo if it's useful: github.com/ToheedAsghar/NL2SQL

The architecture diagram in the README is worth a look even if you're building something completely different — it's a clean reference for structuring multi-agent LangGraph pipelines.

Top comments (0)