DEV Community: Toheed Asghar

NL2SQL in 2026: How Multi-Agent Pipelines Convert Natural Language to Safe SQL

Toheed Asghar — Mon, 13 Apr 2026 18:15:47 +0000

The Problem

Most people who need data can't write SQL. Product managers open Jira tickets for simple queries. Support teams need custom admin panels for every lookup. Analysts spend hours on routine joins they've written a hundred times.

NL2SQL (natural language to SQL) fixes this. You type:

"Who are the top 5 customers by order volume?"

The system returns:

SELECT c.name, COUNT(o.order_id) AS order_count
FROM customers c
JOIN orders o ON c.customer_id = o.customer_id
GROUP BY c.name
ORDER BY order_count DESC
LIMIT 5;

LLMs have made this practical. On the Spider benchmark (200+ databases, standard academic dataset), modern systems hit 85–92% execution accuracy. But getting from "works on benchmarks" to "works in production" requires more than a single prompt.

Why Single-Prompt NL2SQL Breaks

The naive approach:

System: Here's the schema: {entire_database_schema}
User: {natural_language_question}
Assistant: {sql_query}

This works when your database has 5 tables with names like customers and orders.

It fails when you have 500 tables, columns named usr_trx_fl, and foreign key chains 6 layers deep. A single LLM call can't simultaneously:

Figure out which 5 of 500 tables are relevant
Check for SQL injection
Generate correct SQL
Validate syntax, logic, and performance
Explain what the query does

You need decomposition. Same principle as microservices — single responsibility per agent.

Multi-Agent NL2SQL: How It Works

I'll walk through the pipeline using the open-source NL2SQL project as a reference. It uses 8 agents orchestrated with LangGraph.

1. Table Discovery (Schema Linking)

The hardest part. Given a question, find the relevant tables.

Three signals run in parallel:

Keyword Matching — Token overlap between query and table/column names. No LLM needed.

"singers" → singer table (fuzzy match score: 0.95)

Semantic Search — Embedding similarity. Catches conceptual matches keywords miss.

"revenue" → order_details table (cosine similarity: 0.82)

Foreign Key Graph — BFS from seed tables through FK relationships.

singer (seed) → concert (depth 1, score 0.5) → stadium (depth 2, score 0.25)

Scores are weighted and merged:

final_score = 0.35 * keyword + 0.45 * semantic + 0.20 * fk_graph

Only top-K tables (default: 5) pass through to the next stage.

2. Security Filter

Runs before generation. Checks for:

SQL injection patterns ('; DROP TABLE --)
Destructive operations (DELETE, TRUNCATE, ALTER)
System table access
Dangerous function calls

Blocked queries never reach the generator. Non-negotiable for user-facing systems.

3. Query Generation

The generator receives:

Only the relevant tables (not the full schema)
The user's question
Few-shot examples

If the first attempt fails validation → one-shot retry with the error message appended. This simple pattern catches a surprising number of wrong column names and missing joins.

4. Parallel Validation

Four validators run concurrently:

┌─────────────┐  ┌─────────────┐  ┌─────────────┐  ┌──────────────┐
│   Syntax    │  │    Logic    │  │  Security   │  │ Performance  │
│  validator  │  │  validator  │  │  validator  │  │  validator   │
└──────┬──────┘  └──────┬──────┘  └──────┬──────┘  └──────┬───────┘
       │                │                │                 │
       └────────────────┴────────────────┴─────────────────┘
                              │
                     ┌────────▼────────┐
                     │   Fan-in merge  │
                     └─────────────────┘

Validator	Checks
Syntax	SQL parsing, grammar, dialect
Logic	Tables/columns exist, joins correct, types compatible
Security	Injection, unauthorized ops, data exfiltration
Performance	Full table scans, missing indexes, complexity

Any failure → back to generator with error context.

5. Explanation + Safety Score

Output includes:

Plain-English explanation of the query
Safety score (0–4 across four dimensions)
Optimization recommendations

═══════════════════════════════════════
 EXPLANATION
═══════════════════════════════════════
This query retrieves the top 5 customers by order volume by:
1. Joining customers with orders on customer_id
2. Counting orders per customer
3. Sorting by count descending, limiting to 5

Safety Score: 3.5/4.0
✓ Security: Safe (no injection risks)
✓ Syntax: Valid SQL
✓ Logic: Correct table and column usage
⚠ Performance: Consider index on customer_id
═══════════════════════════════════════

DDL Vector Stores: The Efficiency Fix

Here's something worth knowing if you're building or evaluating NL2SQL systems.

The semantic search step in most pipelines (including the one above) recomputes embeddings on every request. It embeds the query + every candidate table, computes cosine similarity in-memory, and discards the vectors. Next request: same computation, same API calls, same cost.

The Better Approach

Pre-compute embeddings for your DDL structure and store them in a vector DB.

What you embed per table:

Table: orders
Columns: order_id (PK, INTEGER), customer_id (FK → customers.id),
order_date (DATE), total_amount (DECIMAL), status (VARCHAR)
Relationships: references customers, referenced by order_items

The key insight: include primary keys, foreign keys, types, and relationships in the text you embed. This gives the embedding model structural context, not just names.

Store in FAISS / Chroma / Pinecone / Weaviate.

At query time: embed only the user's question (1 API call) → vector similarity search against pre-computed embeddings.

Comparison

Metric	Per-Request Embedding	DDL Vector Store
Embedding calls/query	1 + N (N = tables)	1
Latency	O(N)	O(1)
Cost at scale	Linear with schema size	Near-constant
Schema changes	Auto (always fresh)	Requires re-index

Research backing: LitE-SQL (EACL 2026) uses this exact pattern — pre-computed schema embeddings with contrastive learning. Results: 88.45% on Spider, 72.10% on BIRD, with 2–30x fewer parameters than full LLM approaches.

For production systems where the schema changes less often than queries arrive (i.e., basically every production database), this is a clear win.

The Benchmark-to-Production Gap

Numbers you should know:

Benchmark	Schema Type	Accuracy
Spider 1.0	Clean, 3–10 tables, descriptive names	85–92%
Spider 2.0	Enterprise-realistic schemas	6–21%
BIRD	Dirty schemas, noisy labels	~72% (best systems)

The 85% → 6% drop isn't a bug. Spider 1.0 is clean academic data. Real databases have:

Hundreds of tables with cryptic names
"Active users" meaning different things per company
Nulls, inconsistent formats, undocumented columns
Multi-hop joins, window functions, nested subqueries

Multi-agent validation doesn't close this gap entirely, but it's the difference between a system that fails silently and one that says "I'm not confident — here's why."

Quick Start

Install from PyPI:

pip install nl2sql-agents

Configure (works with any OpenAI-compatible API — OpenRouter, OpenAI, Ollama, vLLM):

export OPENAI_API_KEY="your-key"
export OPENAI_BASE_URL="https://openrouter.ai/api/v1"
export OPENAI_MODEL="openai/gpt-4o-mini"
export OPENAI_EMBEDDING_MODEL="openai/text-embedding-3-small"
export DB_TYPE="sqlite"
export DB_PATH="/path/to/your/database.sqlite"

Run:

# Interactive REPL
nl2sql

# One-shot
nl2sql "Show me all singers from France"

# Override database
nl2sql --db /path/to/other.sqlite "List all employees"

Works with any SQLite database. Point DB_PATH at your own .sqlite file or use the Spider benchmark databases.

GitHub: github.com/ToheedAsghar/NL2SQL
PyPI: pypi.org/project/nl2sql-agents

Architecture at a Glance

User Query
    │
    ▼
┌─────────────────┐
│ Security Filter  │──── Block dangerous queries
└────────┬────────┘
         ▼
┌─────────────────────────────────────┐
│         Table Discovery             │
│  ┌──────┐ ┌────────┐ ┌──────────┐  │
│  │Keywrd│ │Semantic│ │ FK Graph │  │
│  └──┬───┘ └───┬────┘ └────┬─────┘  │
│     └─────────┴────────────┘        │
│         Weighted Merge              │
└────────────┬────────────────────────┘
             ▼
┌─────────────────┐
│ Schema Formatter │──── Format relevant tables
└────────┬────────┘
         ▼
┌─────────────────┐
│ Query Generator  │◄─── Retry with error context
└────────┬────────┘
         ▼
┌─────────────────────────────────────┐
│        Parallel Validation          │
│ ┌──────┐ ┌─────┐ ┌─────┐ ┌──────┐ │
│ │Syntax│ │Logic│ │Secur│ │ Perf │ │
│ └──────┘ └─────┘ └─────┘ └──────┘ │
└────────────┬────────────────────────┘
             ▼
┌─────────────────┐
│   Explainer     │──── Explanation + Safety Score
└─────────────────┘

8 agents. Parallel fan-out/fan-in. LangGraph StateGraph with conditional edges. MIT licensed.

What's Next for NL2SQL

DDL vector stores replacing per-request embeddings as the default
Semantic layers (dbt, Cube) + NL2SQL as complementary approaches
Multi-database support — PostgreSQL, MySQL, BigQuery
Execution + visualization — run queries and render results
Multi-turn conversations — follow-up questions with context
Right-sized models — fine-tuned 7B models for individual agent tasks instead of GPT-4 for everything

Star the repo if it's useful: github.com/ToheedAsghar/NL2SQL

The architecture diagram in the README is worth a look even if you're building something completely different — it's a clean reference for structuring multi-agent LangGraph pipelines.

I Built a Multi-Agent RAG System That Fact-Checks Its Own Answers — Here's How

Toheed Asghar — Fri, 20 Feb 2026 14:24:23 +0000

Every RAG system has the same Achilles' heel: hallucination. You ask a question, it retrieves some documents, and the LLM confidently generates an answer that sounds right but is subtly wrong. No warning, no citation, no second opinion.

I spent weeks building a system that fixes this. DocForge is an open-source multi-agent RAG pipeline where four specialized AI agents collaborate — and one of them exists solely to fact-check the others.

In this post, I'll walk you through the architecture, the problems it solves, and how you can run it yourself.

ToheedAsghar / DocForge

A RAG pipeline that doesn't trust its own answers. 4 AI agents collaborate to route queries, retrieve docs, synthesize answers, and catch hallucinations automatically.

DocForge

A Multi-Agent Retrieval-Augmented Generation (RAG) system built with LangGraph, featuring intelligent query routing, adaptive retrieval, fact-checking with automatic retry logic, and a FastAPI backend

Key Features

Multi-Agent Architecture

Routing Agent — Classifies query complexity (simple lookup / complex reasoning / multi-hop) and generates an optimized search query for the vector database
Retrieval Agent — Adaptive document retrieval (3-10 docs based on complexity, with relaxed thresholds on retries)
Analysis Agent — Synthesizes coherent, cited answers from multiple sources using chain-of-thought reasoning
Validation Agent — Fact-checks every claim against source documents, identifies hallucinations, and corrects the answer if needed

Intelligent Workflow

Confidence-based validation skip — When retrieval scores are high, sources are sufficient, and no information gaps exist, validation is skipped entirely for faster responses
Automatic retry with adaptive strategy — On validation failure, the system retries retrieval with 50% more documents and a relaxed relevance threshold (up to 3 attempts)
…

View on GitHub

Why Traditional RAG Falls Short

A standard RAG pipeline is straightforward: embed a query, retrieve similar chunks from a vector database, and pass them to an LLM to generate an answer. It works — until it doesn't.

Here are the failure modes I kept hitting:

No query understanding — A simple factual lookup and a complex multi-hop question both get the same retrieval strategy
Fixed retrieval — Always fetching the same number of documents regardless of question complexity
No verification — The LLM's answer is accepted as-is, even when it contradicts or fabricates information beyond the source documents
No recovery — When retrieval fails to find relevant documents, the system has no mechanism to retry with a different strategy

DocForge addresses every one of these with a multi-agent architecture.

The Architecture: Four Agents, One Pipeline

DocForge is built on LangGraph, which orchestrates four specialized agents into a stateful workflow:

User Query
    │
    ▼
┌─────────────────┐
│   Redis Cache    │ ◄── Check cache first (SHA-256 key)
└────────┬────────┘
         │ (cache miss)
         ▼
┌─────────────────┐
│  Routing Agent   │ ◄── Classify complexity, optimize search query
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│ Retrieval Agent  │ ◄── Fetch 3–10 docs from Pinecone
└────────┬────────┘     (50% more on retry, relaxed threshold)
         │
         ▼
┌─────────────────┐
│ Analysis Agent   │ ◄── Synthesize cited answer (chain-of-thought)
└────────┬────────┘
         │
         ▼
    Confidence Check
    │
    ├── High confidence ──▶ Skip validation ──▶ Return & Cache
    │
    └── Otherwise:
         │
         ▼
    ┌─────────────────┐
    │ Validation Agent │ ◄── Fact-check every claim
    └────────┬────────┘
             │
             ▼
        ├── Valid           ──▶ Return & Cache
        ├── Invalid (< 3)  ──▶ Retry from Retrieval (adaptive)
        └── Invalid (≥ 3)  ──▶ Return corrected answer & Cache

Let me break down each agent.

1. Routing Agent — The Dispatcher

Not all questions are equal. "What is LangGraph?" is a simple lookup. "Compare the tradeoffs of LangGraph vs. CrewAI for multi-agent orchestration" requires complex reasoning across multiple sources.

The Routing Agent classifies every incoming query into one of three types:

Simple lookup — Direct factual questions (retrieves 3 documents)
Complex reasoning — Questions requiring synthesis across sources (retrieves 7 documents)
Multi-hop — Questions that chain multiple pieces of information (retrieves 10 documents)

It also rewrites the user's natural-language query into an optimized search query for the vector database, improving retrieval relevance.

2. Retrieval Agent — Adaptive Search

Based on the routing classification, the Retrieval Agent queries Pinecone with the appropriate number of documents and relevance threshold.

The key innovation here is adaptive retry. If the Validation Agent later rejects the answer, retrieval reruns with:

50% more documents than the previous attempt
A relaxed relevance threshold to cast a wider net

This means the system self-corrects when initial retrieval wasn't sufficient.

3. Analysis Agent — The Synthesizer

The Analysis Agent takes the retrieved document chunks and synthesizes a coherent, cited answer using chain-of-thought reasoning. Every claim in the answer is tied back to a specific source document.

4. Validation Agent — The Fact-Checker

This is the agent that makes DocForge different. The Validation Agent independently fact-checks every claim in the synthesized answer against the source documents. It:

Identifies unsupported claims
Detects hallucinated information
Flags contradictions with sources
Provides a corrected answer when issues are found

If validation fails, the system retries from retrieval with an adaptive strategy — up to 3 attempts. If it still fails after maximum retries, it returns the best corrected answer it has.

Smart Optimizations That Matter in Production

Building a multi-agent system that's correct is one thing. Making it fast and cost-effective is another.

Confidence-Based Validation Skip

Not every answer needs fact-checking. When all three conditions are met, DocForge skips the Validation Agent entirely:

Retrieval scores are above 0.85
At least 3 source documents were used
No information gaps were detected

This saves 30–40% latency on high-confidence queries.

Redis Caching

Every query result is cached in Redis with a SHA-256 key and 1-hour TTL. Repeated queries return instantly — roughly 10x faster than a fresh pipeline run, with zero token cost.

Task-Specific Model Selection

Different agents need different capabilities. DocForge lets you assign different models per task:

# Fast, cheap model for simple routing decisions
GEMINI_ROUTING_MODEL=gemini-2.0-flash-lite

# More capable model for complex synthesis and validation
GEMINI_ANALYSIS_MODEL=gemini-2.5-flash
GEMINI_VALIDATION_MODEL=gemini-2.5-flash

This cuts token costs by 40–50% compared to using a single expensive model for everything.

Dual LLM Provider Support

DocForge supports both OpenAI GPT (via OpenRouter) and Google Gemini. Switch providers with a single environment variable:

LLM_PROVIDER=gemini  # or "gpt"

Getting Started in 5 Minutes

Prerequisites

Python 3.11+
A Pinecone account (free tier works)
An OpenRouter or Google Gemini API key
Redis (optional, for caching)

Installation

git clone https://github.com/ToheedAsghar/DocForge.git
cd DocForge
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

Configuration

Create a .env file:

LLM_PROVIDER=gemini
GEMINI_API_KEY=your-gemini-key
PINECONE_API_KEY=your-pinecone-key
PINECONE_ENVIRONMENT=us-east-1
PINECONE_INDEX_NAME=techdoc-intelligence
REDIS_URL=redis://localhost:6379
CACHE_ENABLED=true

Ingest Your Documents

from backend.ingestion.pipeline import ingest_documents

stats = ingest_documents("./documents/", chunk_size=1000, chunk_overlap=200)
print(f"Ingested {stats['documents_loaded']} PDFs → {stats['chunks_created']} chunks")

Query the System

from backend.agents.graph import run_graph

result = run_graph("What is LangGraph?")

print(result["fact_checked_answer"])
print(f"Sources: {len(result['retrieved_chunks'])} documents")
print(f"Validation: {'passed' if result['validation_passed'] else 'corrected'}")
print(f"Latency: {result['latency_ms']:.0f}ms")

Or Use the REST API

uvicorn backend.main:app --host 0.0.0.0 --port 8000

curl -X POST http://localhost:8000/api/v1/query \
  -H "Content-Type: application/json" \
  -d '{"query": "What is LangGraph?"}'

Docker is also supported:

docker-compose up

The Tech Stack

Component	Technology
Agent Orchestration	LangGraph
LLM Providers	OpenAI GPT-4o-mini (via OpenRouter), Google Gemini 2.5 Flash
Embeddings	OpenAI `text-embedding-3-small` (1536 dims)
Vector Database	Pinecone (serverless, cosine similarity)
Caching	Redis (SHA-256 keys, 1-hour TTL)
API Framework	FastAPI
LLM Framework	LangChain
Configuration	Pydantic Settings
Containerization	Docker + Docker Compose

What I Learned Building This

A few takeaways from building a multi-agent RAG system:

1. Validation is worth the latency cost. In my testing, the Validation Agent caught hallucinated claims in roughly 15–20% of responses. That's 1 in 5 answers that would have been wrong without it.

2. Adaptive retry is better than aggressive retrieval. Instead of always retrieving 10+ documents (slow, expensive, noisy), start small and retry with more only when needed. Most queries are answered well with 3–5 documents.

3. Caching is a multiplier. In any production Q&A system, users ask similar questions repeatedly. Redis caching turned repeated queries from 3–5 second operations into sub-100ms responses.

4. Different tasks need different models. Routing a query is a simple classification task — it doesn't need GPT-4. Synthesizing a multi-source answer does. Task-specific model assignment is an easy win for cost optimization.

What's Next

DocForge is actively being developed. Here's what's on the roadmap:

Support for more document formats (DOCX, TXT, Markdown, HTML)
Conversation history and multi-turn chat
A frontend UI for non-technical users
Multi-tenancy support
Deployment guides for AWS, Railway, and Render

Try It Out

DocForge is fully open-source under the MIT license. If you're building a RAG system and tired of hallucinated answers, give it a spin:

GitHub: github.com/ToheedAsghar/DocForge

If you found this useful, a star on the repo would mean a lot. I'm also happy to answer questions in the comments — whether it's about the architecture, LangGraph, or multi-agent systems in general.

Built by Toheed Asghar with LangGraph, LangChain, Pinecone, and FastAPI.