Hady Walied

Posted on Dec 2

From 40% to 100% SQL Generation Accuracy: Why Local AI Needs Self-Correction, Not Perfect Prompts

#sql #llm #machinelearning #ai

I spent 12 hours fighting a local AI model to generate valid SQL queries. My success rate went from 40% to 100%, not by prompt engineering, but by teaching the model to learn from its own mistakes.

Key Takeaways TL;DR:

• Self-correction loops beat perfect-first-time approaches for local AI
• DSPy optimization improved SQL accuracy from 40% to 100% without manual prompt tuning
• Defensive parsing is critical, LLM output is probabilistic, your code isn't
• When to use this: privacy-critical data, edge deployment, vendor lock-in avoidance

The Problem: Local Models Are "Creative" (and that's bad for SQL)

Building a Retail Analytics Copilot that runs entirely on a laptop (using a quantized 24B model) sounds great for privacy, but it's a nightmare for reliability. Unlike GPTs, which follows instructions like a senior engineer, local models are like enthusiastic interns: they try hard, but they hallucinate syntax, forget schema details, and love to chat when they should be coding.

My initial baseline was dismal: only 40% of generated SQL queries actually executed. The rest were plagued by syntax errors, hallucinated columns, or conversational fluff ("Here is your query...").

Here are the three counterintuitive lessons I learned getting that number to 100%.

Lesson 1: Self-Correction > Perfect Prompts

I started where everyone starts: Prompt Engineering. I spent hours crafting the "perfect" system prompt, begging the model to "ONLY output SQL" and "CHECK the schema." It didn't work. The model would obey for 5 queries and fail on the 6th.

The Shift: Instead of trying to prevent errors, I built a system that expects them.

I implemented a LangGraph workflow with a dedicated Repair Loop.

Generate: The model writes a query.
Execute: We try to run it against SQLite.
Catch: If it fails, we catch the error (e.g., no such column: 'Price').
Feedback: We feed the exact error message back to the model: "The previous query failed with error X. Fix it."

This pattern generalizes beyond SQL. Anytime you're working with probabilistic systems, external APIs that can fail, or ambiguous user input, you should design for graceful degradation.
Amazon's mantra: "Everything fails all the time." Your architecture should assume failure and route around it.

The Hidden Benefit: This loop generates training data. Every (failed_query, error_message, corrected_query) triple becomes a potential few-shot example for optimization. You're not just fixing bugs; you're building a self-improving system.

The Code

Here is the actual logic for the repair loop:

def sql_execution_node(state: AgentState) -> AgentState:
    """Execute SQL and handle errors gracefully."""
    query = state["sql_query"]

    try:
        cursor.execute(query)
        state["sql_results"] = cursor.fetchall()
        state["errors"] = []
    except sqlite3.OperationalError as e:
        # Don't crash—capture and route to repair
        state["sql_results"] = []
        state["errors"].append(str(e))
        state["feedback"] = f"SQL execution failed: {e}. Fix the query."
        state["repair_count"] = state.get("repair_count", 0) + 1

    return state

def should_repair(state: AgentState) -> str:
    """Conditional edge: repair or continue?"""
    if state["errors"] and state["repair_count"] < 2:
        return "sql_generator"  # Loop back
    return "synthesizer"  # Give up or continue

Lesson 2: The "ELECT" Bug (Defensive Parsing)

Local models generate creative variations: SELECT, Select, SQL: SELECT. I wrote defensive code to strip prefixes using lstrip("SQL:"). Seems reasonable, right?

Wrong.

lstrip removes any character in the set from the left. Since SELECT starts with S, and S is in SQL:, it got stripped. My code turned SELECT * FROM... into ELECT * FROM....

The model was doing its job,my parsing broke it.

The Lesson: LLMs are probabilistic. Your parsing must be deterministic.

Don't use character-level stripping (lstrip).
Do use Regex extraction (re.search(r'(SELECT\s+.*)', ...)).
Better yet: Enforce structured output (JSON mode) so you aren't parsing freeform text at all.

Lesson 3: Optimization Is Code, Not Magic

I stopped hand-writing prompts. Instead, I used DSPy.

Think of traditional prompt engineering like manually tuning hyperparameters in a neural network by running experiments and eyeballing loss curves. DSPy is like having backpropagation,it automates the search for optimal prompts using gradient-free optimization over a metric you define.

The mental shift: You're not writing prompts anymore. You're writing loss functions.

I defined a metric: "A query is good if it executes AND returns non-empty results." Then, I used the BootstrapFewShot optimizer. It acted like a teacher, generating multiple potential SQL queries, running them, and keeping only the ones that passed my metric to use as "few-shot examples".

The Results

Metric	Baseline	After Optimization	After Repair Loop
Valid SQL (%)	40%	85%	100%
Correct Format (%)	30%	60%	95%
End-to-End Success (%)	12%	51%	66%*

Why 100% SQL but only 66% end-to-end?
Because SQL accuracy is necessary but not sufficient. The failures now happen at different layers: the router misclassifies 15% of hybrid questions as pure SQL, the retriever returns irrelevant docs 10% of the time, and the synthesizer occasionally combines SQL results with RAG context incorrectly. This reveals a critical insight: optimizing one component creates new bottlenecks elsewhere. The system is now limited by orchestration logic, not model quality.

3 Things You Can Steal Today

The Repair Pattern: Add a feedback field to your LLM state. On failure, inject the error message and retry. Costs ~1 extra LLM call but 10x's reliability.

The ELECT Test: Run this on your parsing logic:

assert clean_sql_output("SQL: SELECT * FROM orders") == "SELECT * FROM orders"
assert clean_sql_output("SELECT * FROM orders") == "SELECT * FROM orders"

If it fails, you're using string methods wrong.

The DSPy Starter Template: Define your task as (inputs, output, metric). Let the optimizer find examples instead of writing them manually.

Why This Matters (Beyond This Project)

The AI landscape is bifurcating:

Cloud-first: GPT-5.1, Claude, etc.
- powerful but expensive and privacy-risky.
Edge-first: Local models
- cheaper, private, but harder to wrangle.

Companies that master local AI will own regulated industries (healthcare, finance, government) where cloud LLMs are non-starters. The bottleneck isn't model weights, it's reliability engineering.

If you can build systems that make 7B models behave like 70B models through clever orchestration, you're solving a $100B problem. This isn't just a technical exercise, it's a strategic moat.

The skills that matter:

Systems thinking: Understanding failure modes and designing around them.
Optimization: Treating prompts as learnable parameters, not art.
Defensive engineering: Building for the probabilistic world.

Master these, and you're not competing with prompt engineers. You're competing with infrastructure engineers at AI-native companies.

you can find this project on GitHub here hadywalied/RACopilot

DEV Community