Vivek Kumar

Posted on Jun 16

Building a Natural Language Query Interface for Your Database: A Developer's Blueprint

#sql #database #ai #tutorial

Every product team eventually hits the same wall. Marketing wants to know which signups came from last week's campaign. Support wants a list of customers on the Pro plan who opened a ticket in the last 48 hours. The founder wants MRR by cohort, broken down by acquisition channel, by yesterday at 9am.

And every time, someone pings the nearest engineer to write yet another one-off SQL query.

The dream is obvious: let people ask questions in plain English and get answers from the database. For years, that dream lived in expensive enterprise BI tools. In 2026, with capable LLMs available behind an API call, you can build a credible natural language query interface yourself — or use something like Draxlr that ships with AI-powered SQL generation already built in. If you want to understand how it works under the hood, or roll your own, this post walks through what such a system actually looks like: the architecture, the SQL plumbing, the prompt design, and the production failure modes you'll want to plan for before your first user types a question.

What "Natural Language to SQL" actually means

At its simplest, a text-to-SQL system is a function:

question (string) -> SQL query (string) -> result (rows)

The naive implementation is one prompt to GPT-style model: "Here's my schema, here's the question, write a SQL query." That works for toy demos and falls apart the moment you point it at a real database with 80 tables, 14 of which are named some variation of users.

A production-grade interface looks more like this:

question
   |
   v
[ schema retrieval ]  <-- pull only relevant tables
   |
   v
[ prompt assembly ]   <-- schema + examples + guardrails
   |
   v
[ SQL generation ]    <-- LLM call
   |
   v
[ validation ]        <-- parse, lint, dry-run
   |
   v
[ safe execution ]    <-- read-only role, row limits, timeouts
   |
   v
result + the SQL it ran (always show this)

Each of those stages is a thing you build. Let's walk through them.

Step 1: Schema retrieval (don't dump everything)

The single biggest accuracy lever is the schema context you feed the model. Dumping your entire schema into the prompt sounds tempting and is almost always wrong: it blows your context window, costs more, and — counterintuitively — makes the model less accurate because it has to pick the right tables out of a haystack.

The fix is retrieval. Treat your schema like a knowledge base, embed it, and pull only what's relevant to the question.

Start by capturing a clean description of every table:

SELECT
  c.table_name,
  c.column_name,
  c.data_type,
  pgd.description AS column_comment
FROM information_schema.columns c
LEFT JOIN pg_catalog.pg_statio_all_tables st
  ON st.schemaname = c.table_schema AND st.relname = c.table_name
LEFT JOIN pg_catalog.pg_description pgd
  ON pgd.objoid = st.relid AND pgd.objsubid = c.ordinal_position
WHERE c.table_schema = 'public'
ORDER BY c.table_name, c.ordinal_position;

For each table, build a small text document that the LLM can actually read:

Table: subscriptions
Description: One row per customer subscription. A user can have at most one
active subscription at a time. `status` is one of active, paused, cancelled.
Columns:
  - id              (uuid, PK)
  - user_id         (uuid, FK -> users.id)
  - plan_id         (uuid, FK -> plans.id)
  - status          (text)
  - mrr_cents       (integer) -- monthly recurring revenue in cents
  - started_at      (timestamptz)
  - cancelled_at    (timestamptz, nullable)
Sample rows:
  id=a1.., user_id=u3.., plan_id=p_pro, status=active, mrr_cents=4900

Embed each of those documents with any embedding model, store the vectors, and at query time embed the user's question and pull the top 5–10 most similar tables. This is plain RAG, applied to schema instead of documents.

The payoff is dramatic. A question like "how many users upgraded to Pro last month?" will retrieve users, subscriptions, and plans — and leave the 70 other tables out of the prompt entirely.

Step 2: Prompt assembly

Once you have the relevant tables, the prompt itself follows a predictable shape:

A system message defining the role and the dialect.
The retrieved schema documents.
A few worked examples (few-shot).
Hard rules ("read-only", "always LIMIT 1000", "use ISO dates").
The user's question.

Here's a stripped-down version in pseudocode:

def build_prompt(question, tables, examples):
    return f"""You are a PostgreSQL expert. Generate a single SELECT query
to answer the user's question. Rules:
- Use only the tables and columns shown below.
- Always include LIMIT 1000.
- Use ISO date literals (e.g. '2026-05-01').
- Never write INSERT, UPDATE, DELETE, DROP, or DDL.
- If the question is ambiguous, return a JSON object
  {{"clarify": "..."}} instead of SQL.

## Schema
{format_tables(tables)}

## Examples
{format_examples(examples)}

## Question
{question}

Return only the SQL, no explanation."""

The two underrated pieces here are the examples and the clarification escape hatch.

A handful of question/SQL pairs from your domain teaches the model your conventions — that mrr_cents is in cents, that "active users" means status = 'active' AND last_seen_at > now() - interval '30 days', that you always join on tenant_id. Three good examples often beat a thousand words of instruction.

The clarification hatch is the difference between a tool that hallucinates confidently and one that admits when a question is too vague. Ambiguous natural language is the #1 source of bad SQL, and giving the model a way to ask back is far better than letting it guess.

Step 3: Validation before execution

Never run an LLM-generated query straight against your database. There are three layers worth wiring up.

Parse it. Run the SQL through a parser like sqlglot or pgsql-parser. If it doesn't parse, you have a clean signal to either retry or report the error — no need to wait for the database to reject it.

Lint it. Walk the parsed AST and reject anything that isn't a SELECT. Reject queries that reference tables outside your allowed list. Reject queries without a LIMIT. This is your defence against a creative model that decides DELETE FROM users is a reasonable answer.

from sqlglot import parse_one, exp

def is_safe(sql, allowed_tables):
    tree = parse_one(sql, read="postgres")
    if not isinstance(tree, exp.Select):
        return False, "only SELECT allowed"
    for t in tree.find_all(exp.Table):
        if t.name not in allowed_tables:
            return False, f"table {t.name} not allowed"
    if not tree.args.get("limit"):
        return False, "missing LIMIT clause"
    return True, None

Dry-run it. PostgreSQL has EXPLAIN. Run the query under EXPLAIN (not EXPLAIN ANALYZE — that executes it) to confirm the planner accepts it. If EXPLAIN returns an estimated cost above some threshold, refuse to run it or warn the user.

Step 4: Safe execution

The query has parsed, linted, and been planned. Now run it — but not as your application's regular DB user.

Create a dedicated read-only role with the minimum privileges needed:

CREATE ROLE nl_query_runner LOGIN PASSWORD '...';

REVOKE ALL ON ALL TABLES IN SCHEMA public FROM nl_query_runner;
GRANT SELECT ON users, subscriptions, plans, events TO nl_query_runner;

ALTER ROLE nl_query_runner SET statement_timeout = '10s';
ALTER ROLE nl_query_runner SET default_transaction_read_only = on;

Statement timeouts catch runaway queries. default_transaction_read_only is belt-and-braces protection in case your lint layer ever has a bug. Granting SELECT only on the specific tables you've exposed means even a perfectly-crafted injection can't touch your secrets table.

For multi-tenant apps, layer Postgres row-level security on top, so even a query like SELECT * FROM subscriptions only sees the calling tenant's rows. (I wrote about this in a previous post on row-level security for embedded dashboards.)

A realistic end-to-end example

Imagine a SaaS analytics app where a user types:

"What was our MRR from Pro customers in April?"

Here's what happens:

Retrieval pulls subscriptions, plans, and users.
Prompt assembles those tables plus three example queries.
Generation produces:

   SELECT SUM(s.mrr_cents) / 100.0 AS mrr_dollars
   FROM subscriptions s
   JOIN plans p ON p.id = s.plan_id
   WHERE p.name = 'Pro'
     AND s.status = 'active'
     AND s.started_at <= '2026-04-30'
     AND (s.cancelled_at IS NULL OR s.cancelled_at > '2026-04-30')
   LIMIT 1000;

Validation parses cleanly, only references allowed tables, has a LIMIT.
Execution runs under the read-only role and returns:

mrr_dollars
48,372.00

The UI shows the answer and the SQL that produced it. Always show the SQL. Users learn to trust the system faster when they can see its work, and your power users will start tweaking the SQL directly.

Common mistakes and gotchas

A few traps you'll hit if you don't plan for them:

Trusting the model on dates. LLMs are weirdly bad at "last week" vs. "the last 7 days" vs. "the previous calendar week". Resolve relative time expressions in your code before generating SQL, and inject explicit dates into the prompt.

Ambiguous column names. If you have users.created_at and subscriptions.created_at, a question like "how many created this month?" is genuinely ambiguous. Detect this and fall back to the clarification hatch instead of guessing.

Pre-aggregated columns. If you have a daily_metrics rollup table, the model may not know whether to query it or recompute from raw events. Document this explicitly in the table description: "Prefer this table for date-based aggregations; events table only for event-level analysis."

Joins that explode. A model can produce a cartesian join with a missing condition and pull back a billion rows. The EXPLAIN cost check and the statement_timeout are your safety net.

Currency, units, and time zones. mrr_cents is not mrr_dollars. started_at might be UTC; the user's "April" might mean Pacific Time. Encode these in your schema documentation and your few-shot examples, or expect surprising answers.

Showing only the answer, not the SQL. This destroys trust. The first time a user gets a number that looks off and has no way to inspect it, they'll stop using your tool. Always show the query.

Key takeaways

A practical natural-language-to-SQL interface is not a single LLM call. It's a small pipeline: retrieve the relevant schema, assemble a tight prompt with examples and rules, generate the SQL, validate it before it ever touches the database, and run it under a tightly-scoped role. The LLM is the interesting part, but the boring infrastructure around it — schema retrieval, parsing, linting, read-only roles, timeouts — is what separates a demo from a product.

Build for the failure cases from day one. Add a clarification path for ambiguous questions. Always show the generated SQL. Log every question and query so you can mine them for new few-shot examples. The systems that work in production are the ones that treat the LLM as a junior analyst whose output always gets reviewed, not as an oracle.

Over to you

Have you built a natural language interface for your own database, or are you using one in a product? What broke first when real users started typing into it? Drop your war stories — and the tools you reached for — in the comments. I'm especially curious which retrieval strategies have worked for people with very large or very messy schemas.

DEV Community