Build a Production‑Ready SQL Evaluation Engine for LLMs

#sql #llm #evaluation #python

Intro

When I first started building a text‑to‑SQL system, the obvious thing was to run the generated query against a database and compare the result with a ground truth. That worked for a handful of examples, but as soon as we hit hundreds of user queries, the naive approach broke down: it was slow, brittle, and offered no insight into why a query failed.

What I needed was a two‑layer engine:

Fast deterministic checks that catch the most common mistakes in under a second.
An AI judge that digs deeper when those checks fail, tells you exactly what’s missing or wrong, and even spits out a corrected SQL snippet.

Below is my complete, production‑ready framework (no storage, no UI). I’ll walk through the architecture, show you the core code, and explain how to plug it into your own pipeline. By the end, you’ll have a reusable tool that turns every LLM‑generated query into actionable feedback—perfect for continuous model improvement.

1. Why Two Layers?

Layer	Purpose	Typical Cost	Speed
Deterministic	Quick sanity checks (row count, column coverage, AST structure)	Negligible	< 0.5 s per query
AI Judge	Deep semantic review + actionable suggestions	Medium (model token cost)	1–3 s per query

The deterministic layer filters out the obvious failures—missing joins, wrong aggregates, mismatched row counts—so we only pay for the expensive AI pass when something truly needs human‑level reasoning. This keeps overall costs low while still giving you rich diagnostics.

2. High‑Level Flow

User Query + Expected SQL
          │
          ▼
[Deterministic Evaluator] → scores, verdict
          │
          ├─ If score ≥ 0.92 → return quick summary
          │
          ▼
[AI Judge] (LLM) → detailed JSON: missing elements, root cause, suggested fix

The evaluator returns a dictionary with metrics like row_count_match, column_coverage, and an overall weighted score. If the score is high enough we skip the AI step; otherwise we call the LLM with a carefully crafted prompt that forces it to output structured JSON.

3. Core Code

Save this as sql_eval_framework.py. It contains:

deterministic_evaluate() – the fast checks.
build_judge_prompt() – template for the AI step.
evaluate_batch() – async batch runner with concurrency control.


python
# sql_eval_framework.py
import asyncio, json
from typing import Dict, Any, List
import pandas as pd
from sqlglot import parse_one, exp
from litellm import acompletion  # pip install litellm
from pydantic import BaseModel, Field

# ------------------------------------------------------------------
# 1️⃣ Deterministic Evaluator (80/20 + extras)
# ------------------------------------------------------------------
def deterministic_evaluate(
    expected_sql: str,
    llm_sql: str,
    expected_df: pd.DataFrame,
    llm_df: pd.DataFrame
) -> Dict[str, Any]:
    """
    Fast checks that run in < 0.5 s per query.
    Returns a dict with scores and a weighted overall score.
    """

    # --- Row count -------------------------------------------------
    row_match = len(expected_df) == len(llm_df)
    row_delta_pct = abs(len(expected_df) - len(llm_df)) / max(len(expected_df), 1)

    # --- Column coverage --------------------------------------------
    exp_cols, llm_cols = set(expected_df.columns), set(llm_df.columns)
    col_coverage = len(exp_cols & llm_cols) / len(exp_cols) if exp_cols else 0
    missing_cols = list(exp_cols - llm_cols)
    extra_cols   = list(llm_cols - exp_cols)

    # --- Result set match (order‑insensitive) -----------------------
    try:
        exp_sorted = expected_df.sort_values(by=list(expected_df.columns)).reset_index(drop=True)
        llm_sorted  = llm_df.sort_values(by=list(llm_df.columns)).reset_index(drop=True)
        result_exact = exp_sorted.equals(llm_sorted)
    except Exception:
        result_exact = False

    # --- Basic AST structural checks --------------------------------
    try:
        ast_exp = parse_one(expected_sql, read="postgres")
        ast_llm = parse_one(llm_sql, read="postgres")

        tables_match   = {t.this for t in ast_exp.find_all(exp.Table)} == \
                         {t.this for t in ast_llm.find_all(exp.Table)}
        where_exists   = bool(ast_exp.args.get("where")) == bool(ast_llm.args.get("where"))
        group_by_match = str(ast_exp.args.get("group")) == str(ast_llm.args.get("group"))
    except Exception:
        tables_match, where_exists, group_by_match = False, False, False

    # --- Weighted score ---------------------------------------------
    weights = {
        "row": 0.15,
        "col": 0.20,
        "result": 0.25,
        "tables": 0.10,
        "where": 0.10,
        "group_by": 0.10
    }
    overall_score = (
        (1 if row_match else 0) * weights["row"] +
        col_coverage * weights["col"] +
        (1 if result_exact else 0) * weights["result"] +
        (1 if tables_match else 0) * weights["tables"] +
        (1 if where_exists else 0) * weights["where"] +
        (1 if group_by_match else 0) * weights["group_by"]
    )

    return {
        "row_count_match": row_match,
        "row_delta_pct": round(row_delta_pct, 3),
        "column_coverage": round(col_coverage, 2),
        "missing_columns": missing_cols,
        "extra_columns": extra_cols,
        "result_exact_match": result_exact,
        "tables_match": tables_match,
        "where_exists_match": where_exists,
        "group_by_match": group_by_match,
        "deterministic_score": round(overall_score, 2),
    }

# ------------------------------------------------------------------
# 2️⃣ AI Judge Prompt Builder
# ------------------------------------------------------------------
def build_judge_prompt(
    user_query: str,
    schema_ddl: str,
    expected_sql: str,
    llm_sql: str,
    exp_sample_csv: str,
    llm_sample_csv: str,
    exp_rows: int,
    llm_rows: int,
    exp_cols: List[str],
    llm_cols: List[str]
) -> str:
    """
    Returns a prompt that forces the LLM to output *only* JSON.
    The prompt includes enough context (schema, samples, row counts)
    so the model can reason about missing or extra elements.
    """

    return f"""
You are an expert SQL analyst and debugger with 15+ years experience.

Task: Deeply compare the **Expected SQL** (ground truth) with the **LLM-generated SQL**.
User natural‑language query: {user_query}

Database Schema (full DDL):
{schema_ddl}

Expected SQL (ground truth):
{expected_sql}

LLM-generated SQL:
{llm_sql}

Expected Result (first 10 rows as CSV):
{exp_sample_csv}

LLM Result (first 10 rows as CSV):
{llm_sample_csv}

Row count — Expected: {exp_rows} | LLM: {llm_rows}
Columns — Expected: {', '.join(exp_cols)} | LLM: {', '.join(llm_cols)}

Evaluate using this exact rubric. Think step‑by‑step.

1. **Semantic Correctness** (0-100): How well does the LLM SQL answer the user query compared to the expected SQL?
2. **What is MISSING?** List every important element present in Expected but absent/mis‑implemented in LLM SQL (filters, joins, aggregations, computed columns, groupings, ordering, etc.).
3. **What is EXTRA or WRONG?** List anything in LLM SQL that should not be there or is incorrect.
4. **Root Cause Hypothesis**: Why might the LLM have made these mistakes given the original prompt?
5. **Suggested Fix**: Provide a complete, corrected SQL that would pass all checks. Make minimal changes possible while fully matching the expected semantics.

Return ONLY valid JSON (no markdown):
{{
  "semantic_score": integer 0-100,
  "missing_elements": ["filter: status='ACTIVE'", "join on customer_id", ...],
  "extra_or_wrong_elements": ["unnecessary ORDER BY", ...],
  "root_cause": "string explanation",
  "suggested_improved_sql": "full corrected SQL here",
  "overall_verdict": "PASS | PARTIAL | FAIL",
  "explanation": "concise paragraph"
}}
"""

# ------------------------------------------------------------------
# 3️⃣ Batch Evaluator (async, concurrency control)
# ------------------------------------------------------------------
async def evaluate_batch(
    test_cases: List[Dict[str, Any]],
    max_parallel: int = 10,
    judge_model: str = "claude-3-5-haiku-20241022"
) -> List[Dict[str, Any]]:
    """
    test_cases is a list of dicts with keys:
        user_query, schema_ddl, expected_sql, llm_sql,
        expected_df, llm_df
    Returns each case enriched with deterministic scores and, if needed,
    the AI judge JSON.
    """

    semaphore = asyncio.Semaphore(max_parallel)

    async def process_one(case: Dict[str, Any]) -> Dict[str, Any]:
        # 1️⃣ Deterministic pass
        det_scores = deterministic_evaluate(
            case["expected_sql"],
            case["llm_sql"],
            case["expected_df"],
            case["llm_df"]
        )
        result = {**case, **det_scores}

        # 2️⃣ If high enough, skip AI judge
        if det_scores["deterministic_score"] >= 0.92:
            result["judge_review"] = None
            return result

        # 3️⃣ Build prompt and call LLM
        exp_sample_csv = case["expected_df"].head(10).to_csv(index=False)
        llm_sample_csv = case["llm_df"].head(10).to_csv(index=False)

        prompt = build_judge_prompt(
            user_query=case["user_query"],
            schema_ddl=case["schema_ddl"],
            expected_sql=case["expected_sql"],
            llm_sql=case["llm_sql"],
            exp_sample_csv=exp_sample_csv,
            llm_sample_csv=llm_sample_csv,
            exp_rows=len(case["expected_df"]),
            llm_rows=len(case["llm_df"]),
            exp_cols=list(case["expected_df"].columns),
            llm_cols=list(case["llm_df"].columns)
        )

        response = await acompletion(
            model=judge_model,
            messages=[{"role": "user", "content": prompt}]
        )
        # Parse the JSON safely
        try:
            review = json.loads(response.choices[0].message.content)
        except Exception as e:
            review = {"error": str(e), "raw_response": response.choices[0].message

DEV Community

Build a Production‑Ready SQL Evaluation Engine for LLMs

Intro

1. Why Two Layers?

2. High‑Level Flow

3. Core Code

Top comments (0)