DAYAN ELVIS JAHUIRA PILCO

Posted on Jul 5

SQL AI Database Solutions: Building a Safe Text-to-SQL App with Streamlit and Hugging Face

#ai #python #sql #llm

TL;DR: I built a working "talk to your database" app: you ask a question in plain English, an LLM (via the free Hugging Face Inference API) writes the SQL, a guardrail layer validates it, and SQLite returns the results in a Streamlit table. The part most tutorials skip — never executing raw LLM output — is the core of this article, and it's covered by 17 unit tests. Repo: github.com/Dayan-18/sql-ai-demo →

The promise and the problem

Text-to-SQL is one of the most useful applications of LLMs: most of the world's business data lives in relational databases, and most of the people who need answers from it don't write SQL. Modern models are good enough that "Which country has the highest total sales?" reliably becomes a correct JOIN + GROUP BY.

But there's a problem most tutorials ignore: an LLM's output is untrusted input. If you take whatever the model returns and hand it to your database, you've built a natural-language injection interface. A malicious (or just confused) prompt can produce DROP TABLE, UPDATE, or an ATTACH DATABASE pointing somewhere it shouldn't. So this demo has three layers, and the middle one is the point:

question ──▶ LLM (writes SQL) ──▶ GUARDRAILS (validates) ──▶ SQLite (executes) ──▶ table

The demo database

A small e-commerce SQLite database — customers, products, orders — seeded with realistic data. The key function extracts the schema as text, because showing the schema to the model is the single biggest accuracy factor in text-to-SQL:

def get_schema_text(conn):
    """Return the CREATE TABLE statements — this is what the LLM sees."""
    rows = conn.execute(
        "SELECT sql FROM sqlite_master WHERE type='table' AND sql IS NOT NULL"
    ).fetchall()
    return "\n\n".join(r[0] for r in rows)

The prompt

def build_prompt(schema: str, question: str) -> str:
    return f"""You are an expert SQLite analyst. Given this database schema:

{schema}

Write ONE SQLite SELECT query that answers this question:
"{question}"

Rules:
- Return ONLY the SQL, inside a ```
{% endraw %}
sql code block.
- Read-only: SELECT statements only.
- Use exact table and column names from the schema.
"""
{% raw %}

Note that we ask the model to be read-only — but we never rely on it. Politely asking an LLM to behave is a UX optimization, not a security control.

The guardrails (the actual security)

Three deterministic functions between the model and the database:


python
FORBIDDEN = re.compile(
    r"\b(INSERT|UPDATE|DELETE|DROP|ALTER|CREATE|REPLACE|TRUNCATE|"
    r"ATTACH|DETACH|PRAGMA|VACUUM|GRANT|REVOKE)\b",
    re.IGNORECASE,
)


def extract_sql(llm_output: str) -> str:
    """LLMs love wrapping SQL in markdown fences and prose. Strip all of it."""
    match = re.search(r"

```(?:sql)?\s*(.*?)```

", llm_output, re.DOTALL)
    sql = match.group(1) if match else llm_output
    sql = sql.split(";")[0].strip()   # keep only the first statement
    return sql


def validate_sql(sql: str) -> str:
    """Allow exactly one read-only SELECT statement. Raise otherwise."""
    stripped = sql.strip().rstrip(";").strip()
    if not re.match(r"^\s*(SELECT|WITH)\b", stripped, re.IGNORECASE):
        raise UnsafeSQLError("Only SELECT queries are allowed")
    if FORBIDDEN.search(stripped):
        raise UnsafeSQLError("Query contains a forbidden keyword")
    if ";" in stripped:
        raise UnsafeSQLError("Multiple statements are not allowed")
    return stripped

Allow-list first (SELECT/WITH only), deny-list second (forbidden keywords), and single-statement enforcement to kill the classic SELECT 1; DROP TABLE chain. In production you'd add a read-only database connection and a row limit — defense in depth.

The app: 30 lines of Streamlit


python
question = st.text_input("Ask a question about the data:")

if question:
    raw = ask_llm(build_prompt(schema, question))
    sql = validate_sql(extract_sql(raw))
    st.code(sql, language="sql")
    df = pd.read_sql_query(sql, conn)
    st.dataframe(df)

The LLM call uses the Hugging Face Inference API (free token, no GPU needed), with temperature=0.1 because we want deterministic SQL, not creativity:


python
resp = requests.post(
    HF_URL,
    headers={"Authorization": f"Bearer {token}"},
    json={
        "model": "Qwen/Qwen2.5-Coder-32B-Instruct",
        "messages": [{"role": "user", "content": prompt}],
        "max_tokens": 300,
        "temperature": 0.1,
    },
    timeout=60,
)

Ask "Which country has the highest total sales?" and the app shows the generated SQL —


sql
SELECT c.country, SUM(p.price * o.quantity) AS total
FROM orders o
JOIN customers c ON c.id = o.customer_id
JOIN products p ON p.id = o.product_id
GROUP BY c.country
ORDER BY total DESC
LIMIT 1

— and the result table (Peru, 1645.99, if you're curious). Ask it to "delete all customers" instead, and the guardrail answers: 🛡️ Blocked by guardrails: Only SELECT queries are allowed.

Testing what can be tested

You can't unit-test an LLM's creativity, but everything around it is deterministic and must be tested. The suite covers the database seed, the analytical queries, the fence-stripping, and — most importantly — the attack cases:


python
@pytest.mark.parametrize("evil", [
    "DROP TABLE customers",
    "DELETE FROM orders",
    "UPDATE products SET price = 0",
    "PRAGMA writable_schema = ON",
    "ATTACH DATABASE '/etc/passwd' AS pwn",
    "SELECT * FROM customers; DROP TABLE customers",
])
def test_rejects_dangerous_sql(evil):
    with pytest.raises(UnsafeSQLError):
        validate_sql(evil)


text
test_core.py::test_schema_has_three_tables PASSED                        [  5%]
test_core.py::test_seed_data_loaded PASSED                               [ 11%]
test_core.py::test_analytical_query_runs PASSED                          [ 17%]
test_core.py::test_extracts_sql_from_markdown_fence PASSED               [ 23%]
test_core.py::test_extracts_plain_sql_without_fence PASSED               [ 29%]
test_core.py::test_keeps_only_first_statement PASSED                     [ 35%]
test_core.py::test_accepts_select PASSED                                 [ 41%]
test_core.py::test_accepts_cte PASSED                                    [ 47%]
test_core.py::test_rejects_dangerous_sql[DROP TABLE customers] PASSED    [ 52%]
test_core.py::test_rejects_dangerous_sql[DELETE FROM orders] PASSED      [ 58%]
test_core.py::test_rejects_dangerous_sql[INSERT INTO customers...] PASSED [ 64%]
test_core.py::test_rejects_dangerous_sql[UPDATE products SET price = 0] PASSED [ 70%]
test_core.py::test_rejects_dangerous_sql[PRAGMA writable_schema = ON] PASSED [ 76%]
test_core.py::test_rejects_dangerous_sql[ATTACH DATABASE...] PASSED      [ 82%]
test_core.py::test_rejects_dangerous_sql[SELECT 1; DROP TABLE...] PASSED [ 88%]
test_core.py::test_rejects_empty PASSED                                  [ 94%]
test_core.py::test_prompt_contains_schema_and_question PASSED            [100%]

============================== 17 passed ==============================

These 17 tests run in GitHub Actions on every push — the same CI pattern from my previous articles. Note what's not tested: the LLM call itself. Network calls to a probabilistic model don't belong in unit tests — the guardrails exist precisely so correctness doesn't depend on trusting the model.

Where this goes in the real world

This 4-file demo scales conceptually to real solutions: swap SQLite for Postgres with a read-only role, add semantic caching so repeated questions don't hit the LLM, log every generated query for audit, and consider fine-tuned SQL models (like the ones on the Hugging Face text-to-SQL page) when latency or cost matter. The architecture — schema in, validated SQL out, execution sandboxed — stays the same.

Conclusion

Text-to-SQL is genuinely within reach of any developer today: a free HF token, Streamlit, and ~150 lines of Python. The difference between a demo and a solution is the layer in the middle — treat the LLM's output like user input, validate with an allow-list, test the attack cases, and let CI enforce it forever.

Full code: github.com/Dayan-18/sql-ai-demo

Would you let an LLM query your production database? Tell me in the comments! 👇

Previous articles in this series: SAST with Bandit · IaC scanning with Checkov · API testing with pytest · CI/CD tools compared

References: Text-to-SQL on Hugging Face · Streamlit docs · HF Inference API

Top comments (2)

Cristhian Carlos MAMANI CORI • Jul 5

Impressive work, Dayan! Here is my abstract and technical observation: This article builds a working "talk to your database" application — natural language in, LLM-generated SQL via the Hugging Face Inference API, validation through a guardrail layer, and execution on SQLite rendered in a Streamlit table. What sets it apart from typical text-to-SQL tutorials is treating the model's output as untrusted input: an allow-list permitting only SELECT/WITH, a forbidden-keyword check, a single-statement rule, and 17 unit tests including seven attack cases like DROP TABLE and PRAGMA. Important Observation: your distinction between asking the model to be read-only (a UX optimization) and enforcing it deterministically (a security control) is exactly the right mindset for LLM applications. For a production deployment I would add the defense-in-depth measures you mention — a read-only database role, row limits and query audit logging — plus protection against prompt injection through the question field itself. Excellent proof of concept with real security thinking.

Mudassir Khan • Jul 8

"politely asking an LLM to behave is a UX optimization, not a security control" — more tutorials should lead with that.

one prod limitation at scale: schema injection as LLM context. works great for 3 tables. at 200 tables the schema hits 4K tokens and the model starts hallucinating columns that almost exist. the fix we use: semantic search over a schema registry, inject only the 5 to 10 most relevant tables per query. same guardrail layer, smaller attack surface, faster.

semantic caching is worth building early. in most internal analytics tools, the top 20 questions account for ~80% of volume. hitting the LLM for those every time is cost and latency you do not need.

has anyone tried this with read replicas where schema might lag? curious whether you validate columns against live schema or trust the prompt.