Prompting AI for Complex Multi-Table SQL: A Practical Guide

#sql #database #ai #tutorial

You paste your question into an AI tool: "Show me revenue by plan for customers who churned last quarter." The model returns a query. It looks plausible. You run it. Cartesian product. Forty million rows. Your database grinds to a halt.

The problem usually isn't the AI — it's the prompt. Multi-table SQL generation is where generic prompting falls apart. A single-table SELECT is forgiving; the model can guess. But the moment you need three tables, two JOINs, a date filter, and a GROUP BY, the model needs to know your schema precisely. Without that, it invents column names, picks the wrong join key, or misses a relationship entirely.

This guide covers the specific techniques that get AI to produce accurate, runnable multi-table SQL — the same results you'd expect from a purpose-built tool like Draxlr, but applicable wherever you're prompting an AI today.

Why Multi-Table Queries Break AI Prompts

A large language model generating SQL faces a core challenge: it doesn't know your schema unless you tell it. For simple queries it can make educated guesses — SELECT * FROM users WHERE id = 1 is hard to get wrong. But for multi-table queries, the model needs to know:

Which tables exist and what each column means
Which columns are foreign keys and what they reference
What "revenue" means in your system (is it in orders.total, invoices.amount_due, or subscription_events.mrr?)
Whether joins should be INNER or LEFT (which affects whether churned customers appear at all)

When this context is missing, the model fills the gaps with plausible-sounding guesses. That's how you get JOIN customers ON customers.id = orders.customer_id when your actual foreign key is orders.account_uuid.

Technique 1: Inject the Full Schema for Relevant Tables

The single biggest improvement you can make is including CREATE TABLE statements (or equivalent schema definitions) directly in your prompt. Don't summarize — paste the actual DDL.

Weak prompt:

I have users, orders, and subscriptions tables.
Write a query that shows revenue by plan for churned customers.

Strong prompt:

-- Schema context:
CREATE TABLE users (
  id UUID PRIMARY KEY,
  email TEXT NOT NULL,
  created_at TIMESTAMPTZ
);

CREATE TABLE subscriptions (
  id UUID PRIMARY KEY,
  user_id UUID REFERENCES users(id),
  plan_name TEXT,         -- 'starter', 'growth', 'enterprise'
  status TEXT,            -- 'active', 'canceled', 'paused'
  canceled_at TIMESTAMPTZ,
  mrr_cents INTEGER       -- monthly recurring revenue in cents
);

CREATE TABLE invoices (
  id UUID PRIMARY KEY,
  user_id UUID REFERENCES users(id),
  subscription_id UUID REFERENCES subscriptions(id),
  amount_cents INTEGER,
  paid_at TIMESTAMPTZ,
  period_start DATE,
  period_end DATE
);

-- Question: Show total revenue collected last quarter, grouped by plan,
-- for subscriptions that were canceled during that same quarter.

With this prompt, the model knows the join keys, the status values, where revenue lives, and what "churned" means. The output will be far more accurate.

Key rules for schema injection:

Include all tables involved in the query, even indirectly
Keep column comments — they're gold for business terminology
Include REFERENCES clauses so the model sees foreign key relationships explicitly
If you have an ENUM or constrained set of values, list them

Technique 2: Name the Join Path Explicitly

Even with full schema context, the model may choose an incorrect join path when multiple routes exist between tables. If invoices can be joined to users either directly via invoices.user_id or indirectly through subscriptions, tell the model which path to use.

Join invoices to subscriptions using invoices.subscription_id = subscriptions.id,
then join subscriptions to users using subscriptions.user_id = users.id.
Do NOT join invoices directly to users.

This eliminates the ambiguity that causes fan-out bugs — where joining along two paths simultaneously multiplies row counts unexpectedly.

Technique 3: Use Chain-of-Thought for Complex Logic

For queries involving subqueries, window functions, or multi-step aggregation, ask the model to reason before writing SQL. Chain-of-thought prompting produces dramatically fewer logic errors on hard queries.

Add this to your prompt:

Before writing the query, think through:
1. Which tables are needed and how they join
2. What filtering needs to happen before aggregation vs. after
3. Whether to use a subquery, CTE, or window function
Then write the final SQL.

The model's reasoning output often catches problems that would otherwise end up as bugs. For example, it might note: "Filtering on canceled_at in the WHERE clause will exclude nulls for active subscriptions — I should use HAVING or a subquery instead." That's the kind of insight that prevents incorrect results.

Here's what a well-structured CTE-based output looks like for a revenue-by-plan query:

-- Step 1: Identify subscriptions canceled in Q1 2026
WITH churned_subs AS (
  SELECT
    id,
    user_id,
    plan_name,
    canceled_at
  FROM subscriptions
  WHERE status = 'canceled'
    AND canceled_at >= '2026-01-01'
    AND canceled_at < '2026-04-01'
),

-- Step 2: Sum invoices paid during the same period for those subscriptions
revenue_by_sub AS (
  SELECT
    cs.plan_name,
    SUM(i.amount_cents) AS total_revenue_cents
  FROM churned_subs cs
  JOIN invoices i
    ON i.subscription_id = cs.id
  WHERE i.paid_at >= '2026-01-01'
    AND i.paid_at < '2026-04-01'
  GROUP BY cs.plan_name
)

SELECT
  plan_name,
  ROUND(total_revenue_cents / 100.0, 2) AS total_revenue_usd
FROM revenue_by_sub
ORDER BY total_revenue_usd DESC;

The CTE structure is a direct result of chain-of-thought reasoning — the model decomposed the problem into steps before writing a single line of SQL.

Technique 4: Provide a Few-Shot Example

If you repeatedly query the same database, including one working example in your prompt dramatically improves accuracy. Show the model a question it already got right, plus the SQL that produced the correct result.

Example:
Question: "How many active subscriptions does each user have?"
SQL:
  SELECT u.email, COUNT(s.id) AS active_subs
  FROM users u
  LEFT JOIN subscriptions s
    ON s.user_id = u.id AND s.status = 'active'
  GROUP BY u.email;

Now answer: "Show the most recent invoice date for each active subscriber."

The example teaches the model your naming conventions, your preferred JOIN style, and which table is the "primary" source of truth — without you having to explain it again.

Technique 5: Constrain the Output Format

AI tools often produce multiple variants, verbose explanations, or queries with placeholder values. If you're feeding output directly into an application or testing pipeline, add explicit output constraints:

Return only the SQL query. No explanations. No markdown fences.
Use aliases for all column names. Use CTEs instead of nested subqueries.
Do not use SELECT *.

This makes the output machine-readable and enforces the style your team has standardized on.

Common Mistakes and How to Avoid Them

Forgetting NULL semantics on LEFT JOINs. Ask the model to use LEFT JOIN when you want to include rows with no match (e.g., users with zero invoices). If you use INNER JOIN, those rows silently disappear.

Ambiguous column names across tables. If both users and subscriptions have a created_at, the model may reference the wrong one. Qualify everything: users.created_at, subscriptions.created_at.

Business terms without definitions. "Active customers" means different things in different systems. Does it mean status = 'active'? Logged in within 30 days? Has a paid invoice? Define it in the prompt: "Active means the subscription has status = 'active' and canceled_at IS NULL."

Assuming the model remembers prior context. Each new prompt is (usually) stateless. Re-inject your schema and any previously-established definitions every time.

Key Takeaways

Getting accurate multi-table SQL from AI isn't about picking the right tool — it's about giving any tool enough context to reason correctly. The techniques that matter most:

Paste your CREATE TABLE statements, including foreign keys and column comments
Name the join path explicitly when multiple routes exist
Use chain-of-thought prompting for queries with subqueries or complex aggregation
Include one working example query to anchor naming conventions
Constrain the output format for consistent, machine-readable results

The more context you give, the less the model has to guess — and guessing is where multi-table SQL generation goes wrong.

What's the most complex multi-table query you've successfully generated with AI? Did you find a prompting technique that made a big difference? Drop it in the comments — I'd love to see what's working in production.