DIEGO FABRIZIO ANDIA NAVARRO

Posted on Jul 1

QL AI Database Solutions: Building a Text-to-SQL Pipeline You Can Actually Run

#ai #sql #python #database

Abstract

"Talk to your database" tools — from freeCodeCamp's SQL query extractor tutorials to Hugging Face's smolagents Text-to-SQL examples to production frameworks like Vanna.AI — all share the same core architecture: describe the schema, hand a natural-language question and that schema to an LLM, get back SQL, execute it, return results. This article builds that pipeline from scratch against a real SQLite database for a small coffee-subscription store, with a pluggable LLM client so the same code path works whether you're calling Claude in production or running the demo with no API key at all. Four natural-language questions were asked against the live database and produced real, verified SQL and real result sets — not illustrative snippets.

The architecture behind every "AI SQL" tool

Every Text-to-SQL system, regardless of vendor, follows the same four steps:

Schema introspection — read the actual table and column names from the database, because the LLM can't guess a schema it's never seen.
Prompt construction — combine the schema description with the user's question into a single instruction.
SQL generation — the LLM returns a SQL string (not an answer — the SQL itself).
Execution — the generated SQL runs against the real database, and the result set is what gets shown to the user.

Production tools like Vanna.AI add a retrieval layer on top — they train on your DDL, documentation, and past queries so the LLM sees relevant examples rather than the entire schema on every call — but the four-step core is the same one built below.

The real-world example: a coffee-store database

A SQLite database with four tables: customers, products, orders, and order_items — small enough to read in full, realistic enough to need actual joins and aggregation.

# schema.py
def describe_schema(db_path: str) -> str:
    conn = sqlite3.connect(db_path)
    cur = conn.cursor()
    cur.execute("SELECT name FROM sqlite_master WHERE type='table'")
    tables = [row[0] for row in cur.fetchall()]

    lines = []
    for table in tables:
        cur.execute(f"PRAGMA table_info({table})")
        columns = [f"{col[1]} {col[2]}" for col in cur.fetchall()]
        lines.append(f"{table}({', '.join(columns)})")
    return "\n".join(lines)

Real output:

customers(id INTEGER, name TEXT, tier TEXT)
products(id INTEGER, name TEXT, price REAL)
orders(id INTEGER, customer_id INTEGER, created_at TEXT)
order_items(id INTEGER, order_id INTEGER, product_id INTEGER, quantity INTEGER)

That's exactly the kind of compact schema description that gets dropped into an LLM prompt.

A pluggable LLM client — so the pipeline is testable without API costs

The interesting engineering decision isn't the prompt — it's keeping the LLM call behind an interface, the same way you'd mock any external dependency:

# llm_client.py
class SQLClient(ABC):
    @abstractmethod
    def generate_sql(self, question: str, schema: str) -> str:
        """Return a SQL query string for the given natural-language question."""


class AnthropicSQLClient(SQLClient):
    """Production client: translates the question using Claude."""

    def __init__(self, model: str = "claude-sonnet-4-6"):
        import anthropic
        self.client = anthropic.Anthropic()  # reads ANTHROPIC_API_KEY from env
        self.model = model

    def generate_sql(self, question: str, schema: str) -> str:
        prompt = f"""You are a SQL generator. Given this schema:

{schema}

Write a single SQLite query that answers this question. Return ONLY the SQL.

Question: {question}"""

        response = self.client.messages.create(
            model=self.model,
            max_tokens=300,
            messages=[{"role": "user", "content": prompt}],
        )
        return response.content[0].text.strip()

AnthropicSQLClient is the real production path. To keep this article's demo runnable by anyone — no API key, no cost — a second implementation, LocalSQLClient, satisfies the exact same interface using schema-grounded pattern matching instead of a hosted model. Both plug into the identical pipeline below; switching from demo to production is a one-line change at the call site.

The pipeline: question in, real results out

# ask.py
def ask(question: str, client: SQLClient, db_path: str = "store.db"):
    schema = describe_schema(db_path)
    sql = client.generate_sql(question, schema)

    conn = sqlite3.connect(db_path)
    cur = conn.cursor()
    cur.execute(sql)
    columns = [d[0] for d in cur.description]
    rows = cur.fetchall()
    conn.close()

    return sql, columns, rows

Running four real questions against the live database:

Q: How many gold customers do we have?
SQL: SELECT COUNT(*) FROM customers WHERE tier = 'gold'
Columns: ['COUNT(*)']
  (2,)

Q: What are the top 3 products by revenue?
SQL: SELECT p.name, SUM(oi.quantity * p.price) AS revenue
FROM order_items oi
JOIN products p ON p.id = oi.product_id
GROUP BY p.name
ORDER BY revenue DESC
LIMIT 3
Columns: ['name', 'revenue']
  ('Pour-over Kit', 64.0)
  ('Espresso Beans 1kg', 55.5)
  ('Cold Brew Concentrate', 42.599999999999994)

Q: What orders has Ana Torres placed?
SQL: SELECT o.id, o.created_at
FROM orders o
JOIN customers c ON c.id = o.customer_id
WHERE c.name = 'Ana Torres'
Columns: ['id', 'created_at']
  (1, '2026-05-02')
  (2, '2026-05-20')

Q: What is the total revenue?
SQL: SELECT SUM(oi.quantity * p.price) AS total_revenue
FROM order_items oi
JOIN products p ON p.id = oi.product_id
Columns: ['total_revenue']
  (191.79999999999998,)

Every one of those numbers is checkable against the seed data by hand — that's the actual test of whether a Text-to-SQL system works: not whether the SQL looks plausible, but whether it returns the right rows.

Testing the pipeline like any other code path

Because the SQL generation step sits behind an interface, the rest of the system is fully testable without touching a real LLM:

def test_top_products_by_revenue():
    sql, columns, rows = ask("What are the top 3 products by revenue?", client)
    assert len(rows) == 3
    assert rows[0][0] == "Pour-over Kit"

def test_unmatched_question_raises():
    with pytest.raises(ValueError):
        ask("What's the weather like today?", client)

Real run:

test_ask.py::test_gold_customer_count PASSED
test_ask.py::test_top_products_by_revenue PASSED
test_ask.py::test_orders_for_named_customer PASSED
test_ask.py::test_total_revenue PASSED
test_ask.py::test_unmatched_question_raises PASSED

5 passed in 0.02s

The last test matters as much as the others: a Text-to-SQL system needs a defined behavior for the question it can't answer, not a guess that produces a confidently wrong query.

What production systems add on top of this

The pipeline above is honest about being a minimal core, not a finished product. What tools like Vanna.AI, the smolagents Text-to-SQL example, and the freeCodeCamp tutorial referenced below add:

Retrieval-augmented prompting — instead of stuffing the entire schema into every prompt, retrieve only the tables and example queries relevant to the question, which matters once a database has hundreds of tables.
Query validation — checking the generated SQL is read-only and references real tables before execution, since an LLM can hallucinate a column name.
Result explanation — a second LLM call that turns the raw rows back into a natural-language answer.
Conversation memory — letting a follow-up question like "now just the gold ones" refer back to the previous query's filters.

Conclusion

The core of every "AI SQL" product is small enough to build and verify in an afternoon: introspect the schema, prompt an LLM with it, execute what comes back, test the whole thing with a fake model standing in for the real one. The part that's genuinely hard — and where tools like Vanna.AI earn their complexity — is making that core reliable across hundreds of tables and ambiguous questions. But understanding the four-step skeleton first is what makes evaluating (or building on top of) any of those production tools possible.

Top comments (1)

CLARA QUISPE CHILENO • Jul 4

Interesting breakdown, keeping the LLM behind a SQLClient interface so the whole pipeline can be tested without an API key is a solid decision that a lot of skip entirely.
One important detail is that ask() executes whatever SQL generate_sql() returns with no check in between. This works fine for LocalSQLClient against a throwaway SQLite file, but the moment AnthropicSQLClient points at a real, writable database, nothing stops a well-crafted question from making the model return a DROP TABLE or DELETE instead of a SELECT. The pipeline as written would run it without issue, since it only checks that a string came back, not what kind of SQL it is.
I don't think this requires modifying the rest of the architecture — just one small step inserted right before execution, in the same "pluggable, testable" spirit as the SQLClient interface, for example:

def is_safe_select(sql: str) -> bool:
    parsed = sqlparse.parse(sql)
    if len(parsed) != 1 or parsed[0].get_type() != "SELECT":
        return False
    return not any(kw in sql.upper() for kw in ["DROP", "DELETE", "UPDATE", "INSERT", "ALTER"])

sql = client.generate_sql(question, schema)
if not is_safe_select(sql):
    raise ValueError("Generated SQL query failed the read-only safety check")

Something like this could be integrated alongside describe_schema() and ask() without disrupting the modular design.