DEV Community

osman uygar köse
osman uygar köse

Posted on

How Prompt Engineering Turned Natural Language into Production-Ready SQL Queries

TL;DR: We built SQLatte, an AI-powered natural language to SQL platform, and learned that 80% of the magic is in the prompts. Here's how we engineered LLM prompts to generate optimized, production-ready SQL queries with 94% accuracy and 10x faster query execution through intelligent partition filtering.


The Challenge

Imagine you're a system engineer at 2 AM, your monitoring system is screaming, and you need to query your Zabbix database to find which host is causing the issue. Problem? You don't know SQL. Even if you did, Zabbix's schema is complex with dozens of interconnected tables.

This is why we built SQLatte - a platform where you can ask "Which hosts had CPU over 90% in the last hour?" and get instant results without writing a single line of SQL.

The secret sauce? Carefully engineered prompts that turn casual questions into optimized database queries.


🏗️ Architecture Overview

User Question
     ↓
Intent Detection (SQL vs Chat)
     ↓
Schema Context Injection
     ↓
SQL Generation with Optimizations
     ↓
Query Execution
     ↓
Results + Explanation
Enter fullscreen mode Exit fullscreen mode

Let's dive into each step and the prompt engineering behind it.


🧠 Part 1: Intent Detection - The Gatekeeper

The Problem

Not every question needs a SQL query. Users might say:

  • "Hello!" (chat)
  • "How do I use this?" (chat)
  • "Show me top customers" (SQL)
  • "What's my table schema?" (SQL)

We need to route correctly before attempting SQL generation.

The Solution: Structured Intent Detection Prompt

def determine_intent(self, question: str, schema_info: str) -> Dict[str, any]:
    """
    Determine if the question requires SQL or is general chat
    """
    prompt = f"""Analyze this user question and determine if it requires a SQL query or is general conversation.

Available Tables Schema:
{schema_info if schema_info else "No tables selected"}

User Question: {question}

Rules:
1. If question asks about data, analytics, or queries related to the available tables → intent: "sql"
2. If question is greeting, general chat, help, or unrelated to tables → intent: "chat"
3. If tables are not selected but question seems like SQL query → intent: "chat" (explain they need to select tables)

Respond in this exact format:
INTENT: sql or chat
CONFIDENCE: 0.0 to 1.0
REASONING: brief explanation
"""
Enter fullscreen mode Exit fullscreen mode

Why This Works

1. Structured Output: We enforce a specific format (INTENT: ... CONFIDENCE: ...) which makes parsing deterministic.

2. Context Injection: We pass the available schema so the LLM knows what data exists.

3. Clear Rules: Three explicit rules guide the LLM's decision-making.

4. Confidence Score: Allows us to handle edge cases (confidence < 0.6 → ask for clarification).

Real Results

# Example 1: Clear SQL intent
question = "Show me users who signed up yesterday"
# → INTENT: sql, CONFIDENCE: 0.95

# Example 2: Chat intent
question = "Hello, how are you?"
# → INTENT: chat, CONFIDENCE: 0.98

# Example 3: Edge case
question = "What data do you have?"
# → INTENT: chat, CONFIDENCE: 0.85
# → Response: "Please select tables from the dropdown..."
Enter fullscreen mode Exit fullscreen mode

Accuracy: 97% on our test dataset of 500 queries.


⚡ Part 2: The Performance Game-Changer - Partition Filtering

The Problem

Imagine querying a table with 10 billion rows of monitoring data:

-- ❌ DISASTER: Full table scan
SELECT * FROM metrics 
WHERE cpu_usage > 90
LIMIT 100;

-- Execution time: 347 seconds ⏰
-- Rows scanned: 10,000,000,000
Enter fullscreen mode Exit fullscreen mode

Our Trino/Hive tables are partitioned by date using a dt column (format: YYYYMMDD, e.g., '20251223'). But the LLM doesn't know this by default.

The Solution: Partition-Aware Prompt Engineering

Here's the critical part of our SQL generation prompt:

def generate_sql(self, question: str, schema_info: str) -> Tuple[str, str]:
    prompt = f"""You are a SQL expert. Generate a SQL query based on the user's question.

Table Schema(s):
{schema_info}

User Question: {question}

Rules:
1. Generate ONLY valid SQL syntax
2. If multiple tables are provided, use appropriate JOINs
3. Infer JOIN conditions from table relationships (common column names)
4. Use table aliases for readability (e.g., orders o, customers c)
5. Include LIMIT clause for safety (default 100 rows)
6. For aggregations, use GROUP BY appropriately
7. Use explicit JOIN syntax (INNER JOIN, LEFT JOIN, etc.)

⚡ PERFORMANCE OPTIMIZATION (CRITICAL):
8. **PARTITION COLUMN**: If schema contains a 'dt' column (VARCHAR format YYYYMMDD, e.g., '20251218'), this is a PARTITION KEY
   - ALWAYS add WHERE clause with 'dt' filter when possible
   - Date filters MUST use dt column in format: dt = '20251218' or dt BETWEEN '20251201' AND '20251218'
   - For "recent", "latest", "today", "last" queries → use last 2 days: dt >= '20251222'
   - For "yesterday" → use dt = '20251222' (current date - 1)
   - For specific date range → convert to dt format (e.g., "last week" → dt >= '20251216')
   - ⚠️ NEVER query without dt filter unless explicitly asked for "all time" data

9. If 'datetime' column exists alongside 'dt', use 'dt' for filtering (faster) and 'datetime' for display

10. Example optimized query: 
    SELECT * FROM orders WHERE dt >= '20251201' AND status = 'completed' LIMIT 100

Format your response as:
SQL:
Enter fullscreen mode Exit fullscreen mode


sql
[your SQL query here]


EXPLANATION:
[brief explanation including partition strategy]
"""
Enter fullscreen mode Exit fullscreen mode

Real-World Impact

Before partition optimization:

-- User: "Show me yesterday's failed orders"

-- Generated SQL (naive):
SELECT * FROM orders 
WHERE status = 'failed'
AND datetime >= '2024-12-22 00:00:00'
LIMIT 100;

-- ❌ Performance:
-- Execution: 89 seconds
-- Rows scanned: 10B
-- Partitions scanned: 3,650 (all partitions!)
Enter fullscreen mode Exit fullscreen mode

After prompt engineering:

-- Generated SQL (optimized):
SELECT * FROM orders 
WHERE dt = '20241222'
AND status = 'failed'
AND datetime >= '2024-12-22 00:00:00'
LIMIT 100;

-- ✅ Performance:
-- Execution: 0.8 seconds (111x faster!)
-- Rows scanned: 2.7M
-- Partitions scanned: 1 (only yesterday's partition!)
Enter fullscreen mode Exit fullscreen mode

The Magic: Query Classification

The LLM learns to classify queries and apply appropriate patterns:

Query Type Pattern Partition Strategy
"latest N records" ORDER BY datetime DESC LIMIT N dt >= today - 2 days
"yesterday" Date = yesterday dt = yesterday
"last week" Date range dt >= today - 7 days
"specific date" Date = X dt = 'YYYYMMDD'
"all time" (explicit) No date filter No dt filter

🎨 Part 3: Schema Context Injection

The Problem

LLMs need to understand your database structure to generate correct SQL. But how do you provide schema information effectively?

Our Approach: Rich Schema Description

def get_table_schema(self, table_name: str) -> str:
    """Get table schema with enhanced information"""

    schema_info = f"""Table: {self.database}.{table_name}
{'=' * 60}

COLUMNS:
──────────────────────────────────────────────────────────
  • order_id  →  BIGINT  [NOT NULL]  [PRIMARY KEY]
  • customer_id  →  BIGINT  [NOT NULL]  [INDEX]
  • dt  →  VARCHAR(8)  [PARTITION KEY]  ⚡ Format: YYYYMMDD
  • datetime  →  TIMESTAMP  [NOT NULL]
  • status  →  VARCHAR(50)  [NOT NULL]
  • amount  →  DECIMAL(10,2)
  • region  →  VARCHAR(50)

PRIMARY KEY: order_id

FOREIGN KEYS:
──────────────────────────────────────────────────────────
  • customer_id → customers.customer_id

INDEXES:
──────────────────────────────────────────────────────────
  • idx_status (status)
  • idx_customer (customer_id)

⚡ OPTIMIZATION HINTS:
──────────────────────────────────────────────────────────
  • This table is PARTITIONED by 'dt' column
  • ALWAYS filter by 'dt' for fast queries
  • Current partitions: 20231201 to 20251223 (755 days)

ROWS: 10,234,567,890
"""
    return schema_info
Enter fullscreen mode Exit fullscreen mode

Key Insights

1. Visual Hierarchy: Clear sections (COLUMNS, INDEXES, etc.) help the LLM parse structure.

2. Explicit Markers: [PARTITION KEY], [PRIMARY KEY] make important columns obvious.

3. Optimization Hints: We literally tell the LLM how to optimize queries.

4. Format Examples: Format: YYYYMMDD prevents the LLM from guessing date formats.


🔗 Part 4: Multi-Table JOINs

The Challenge

User asks: "Show me orders with customer names from last week"

This requires:

  1. Joining orders and customers tables
  2. Inferring the join condition (likely customer_id)
  3. Applying partition filter
  4. Proper aliasing

The Prompt Enhancement

# In the SQL generation prompt:

"""
JOIN RULES (CRITICAL):
1. If multiple tables are provided, use appropriate JOINs
2. Infer JOIN conditions from:
   - Common column names (e.g., customer_id, user_id)
   - Foreign key relationships (see FOREIGN KEYS section in schema)
   - Column name patterns (table_name_id usually joins to table_name.id)

3. Always use explicit JOIN syntax:
   ✅ INNER JOIN customers c ON o.customer_id = c.customer_id
   ❌ FROM orders o, customers c WHERE o.customer_id = c.customer_id

4. Use meaningful table aliases:
   - orders → o
   - customers → c
   - products → p

5. Apply partition filter to ALL partitioned tables in JOIN

Example multi-table query:
SELECT o.order_id, c.customer_name, o.amount
FROM orders o
INNER JOIN customers c ON o.customer_id = c.customer_id
WHERE o.dt >= '20251216'  -- Last week
AND o.status = 'completed'
LIMIT 100;
"""
Enter fullscreen mode Exit fullscreen mode

Results

Input:

"Show me orders with customer names from yesterday"
Enter fullscreen mode Exit fullscreen mode

Generated SQL:

SELECT 
    o.order_id,
    o.datetime,
    c.customer_name,
    c.email,
    o.amount,
    o.status
FROM orders o
INNER JOIN customers c ON o.customer_id = c.customer_id
WHERE o.dt = '20251222'  -- ⚡ Partition filter!
ORDER BY o.datetime DESC
LIMIT 100;
Enter fullscreen mode Exit fullscreen mode

Performance:

  • Join type: Correct (INNER JOIN)
  • Join condition: Inferred correctly
  • Partition filter: Applied ✅
  • Execution time: 1.2 seconds

Success rate: 91% for 2-table joins, 84% for 3+ table joins


🎭 Part 5: Conversational Context

The Problem

Users ask follow-up questions:

User: "Show me sales by region"
Assistant: [shows results]

User: "What about last month?"  ← How does LLM know this refers to sales?
Enter fullscreen mode Exit fullscreen mode

The Solution: Context Injection

def generate_sql_with_context(
    self,
    question: str, 
    schema_info: str,
    conversation_history: List[Dict]
) -> Tuple[str, str]:
    """Generate SQL with conversation context"""

    # Build context summary
    if len(conversation_history) > 1:
        context_summary = "\n\nRecent conversation:\n"
        for msg in conversation_history[-5:]:  # Last 5 messages
            if msg['role'] == 'user':
                context_summary += f"User: {msg['content']}\n"
            elif msg['role'] == 'assistant':
                # Include previous SQL for reference
                if 'sql' in msg.get('metadata', {}):
                    context_summary += f"Generated SQL: {msg['metadata']['sql']}\n"

        enhanced_question = f"{question}\n\nContext: {context_summary}"
    else:
        enhanced_question = question

    return self.generate_sql(enhanced_question, schema_info)
Enter fullscreen mode Exit fullscreen mode

Example Flow

# Turn 1
User: "Show sales by region"
Generated SQL:
"""
SELECT region, SUM(amount) as total_sales
FROM orders
WHERE dt >= '20251201'  -- Current month
GROUP BY region
ORDER BY total_sales DESC;
"""

# Turn 2 (follow-up)
User: "What about last month?"

Context injected to LLM:
"""
Recent conversation:
User: Show sales by region
Generated SQL: SELECT region, SUM(amount) ... WHERE dt >= '20251201' ...

Current question: What about last month?
"""

Generated SQL (context-aware):
"""
SELECT region, SUM(amount) as total_sales
FROM orders
WHERE dt >= '20251101' AND dt < '20251201'  -- ⚡ Adjusted date range!
GROUP BY region
ORDER BY total_sales DESC;
"""
Enter fullscreen mode Exit fullscreen mode

Context-aware accuracy: 88% (vs 23% without context)


🔧 Part 6: Prompt Engineering Best Practices

What We Learned

1. Be Explicit About Formats

❌ Bad:

"Use the date column for filtering"
Enter fullscreen mode Exit fullscreen mode

✅ Good:

"Use the 'dt' column (VARCHAR format YYYYMMDD, e.g., '20251223') for partition filtering"
Enter fullscreen mode Exit fullscreen mode

2. Provide Examples

❌ Bad:

"Filter by date range"
Enter fullscreen mode Exit fullscreen mode

✅ Good:

"Example: dt BETWEEN '20251201' AND '20251223'"
Enter fullscreen mode Exit fullscreen mode

3. Use Structured Outputs

❌ Bad:

"Determine if this needs SQL"
Enter fullscreen mode Exit fullscreen mode

✅ Good:

"Respond in this exact format:
INTENT: sql or chat
CONFIDENCE: 0.0 to 1.0"
Enter fullscreen mode Exit fullscreen mode

4. Prioritize Critical Information

We use markers like:

  • ⚡ CRITICAL:
  • ⚠️ NEVER:
  • ✅ ALWAYS:

This helps the LLM focus on important rules.


Links


Did this help you? ⭐ Star the repo and share with your team!

Questions? Open an issue or reach out: @osmanuygar


Built with ☕ and lots of prompt iterations

Top comments (0)