Explore a systematic approach to evaluating SQL query generation by comparing results against predefined gold queries while accounting for valid variations
By Yaswanth Vudumula
This article was originally published on IBM Developer.
SQL-Eval is a framework designed to evaluate the correctness of SQL queries generated by Large Language Models (LLMs). Developed by the defog.ai team, it supports multiple LLMs, including OpenAI, AWS Bedrock, Gemini, MLX, Mixtral, Anthropic, and more.
I contributed by developing a watsonx wrapper, integrating ChromaDB for few-shot learning, and implementing self-correcting SQL queries.
How SQL-Eval works
Generates a SQL query from an LLM after retrieving relevant columns and metadata.
Executes both the gold (expected) query and the LLM-generated query on their respective databases to obtain results.
Compares the two dataframes using exact match and subset match to calculate the accuracy rate.
Logs metrics (for example, token usage, latency) and aggregates results for reporting.
Handling multiple correct queries
SQL-Eval ensures that different valid SQL queries are recognized as correct.
For example, the question: "Who are the 10 most recent users from Toronto?"
Both of these queries are valid:
- Query 1
SELECT userid, username, created_at
from users
where city='Toronto'
order by created_at
DESC LIMIT 10;
- Query 2
SELECT userid, firstname || ' ' || lastname, created_at
from users
where city='Toronto'
order by created_at
DESC LIMIT 10;
SQL-Eval validates such variations, ensuring that all correct queries are accepted.
GitHub repository: SQL-Generation-Evaluation
Principles of SQL-Eval
SQL-Eval follows key principles to ensure accurate and reliable evaluation:
Reproducible and repeatable: Ensures consistent evaluations following the scientific method.
Measures SQL complexity and diversity: Handles nested queries, multiple joins, and different ways of expressing SQL logic.
Creating the SQL-Eval dataset
The defog.ai team created a dataset of questions and gold query pairs, where the gold query produces the expected results. Unlike other benchmarks that rely on string-based matching (e.g., ROUGE), SQL-Eval focuses on result accuracy rather than exact SQL syntax.
Since multiple SQL queries can correctly ans r the same question, SQL-Eval accepts different valid formats.
Example: Top 3 restaurants in New York
All four queries below are correct:
- Query 1
SELECT name
FROM restaurants
GROUP BY name
ORDER BY AVG(rating) DESC LIMIT 3
- Query 2
SELECT id, name
FROM restaurants
GROUP BY name
ORDER BY AVG(rating) DESC LIMIT 3
- Query 3
SELECT name, AVG(rating)
FROM restaurants
GROUP BY 1
ORDER BY 2 DESC LIMIT 3`
- Query 4
SELECT name, AVG(rating)
FROM restaurants
GROUP BY 1
ORDER BY 2 DESC LIMIT 3`
Handling multiple correct queries
To ensure flexibility, SQL-Eval allows multiple acceptable column variations. For example:
SELECT {id, user_name} FROM users;
This means any of the following queries are valid:
SELECT id FROM users;
SELECT user_name FROM users;
SELECT id, user_name FROM users;
Ensuring uniform SQL representation
SQL-Eval considers various SQL structures, including:
Joins: Supports multiple joins and nested queries.
Aggregations: Evaluates GROUP BY usage.
Ordered results: Assesses ORDER BY statements.
Ratios and calculations: Handles expressions like SELECT X / Y, ensuring meaningful pre-aggregation and avoiding division by zero.
Filters: Tests the effectiveness of WHERE conditions.
Evaluation method
Big-picture approach
Validating SQL queries deterministically is challenging, but SQL-Eval takes this approach by comparing executed queries against predefined gold queries. Instead of relying on exact SQL matches, it evaluates logical equivalence, allowing for variations while handling edge cases.
To achieve this, define a minimal set of acceptable query alternatives in the question/query pairs. The detailed logic behind this process is available in a Python function in our repository. The following example illustrates this process.
Example: User preferences query
Consider the following database schema:
CREATE TABLE users (
uid BIGINT,
name TEXT,
likes_movies BOOLEAN,
likes_plays BOOLEAN
)
For the question, "Return users and whether they like movies", an acceptable gold query is:
SELECT {uid, name}, likes_movies FROM users;
Since the question does not specify whether uid, name, or both should be included, multiple variations are considered correct. SQL-Eval ensures these flexible interpretations are accounted for.
Evaluation steps
We begin by expanding the braces in the gold query to generate all possible column combinations. This results in 22−1 (excluding the empty set) valid queries:
- SELECT uid, likes_movies FROM users
- SELECT name, likes_movies FROM users
- SELECT uid, name, likes_movies FROM users
Each of these queries is then executed against the database, producing a corresponding dataframe. For brevity, we use two rows in this example:
dfs_gold = [
pd.DataFrame({"uid": [1, 2], "likes_movies": [True, False]}),
pd.DataFrame({"name": ["alice", "bob"], "likes_movies": [True, False]}),
pd.DataFrame({"uid": [1, 2], "name": ["alice", "bob"], "likes_movies": [True, False]}),
]
After executing the gold queries and the generated query, compare their results. If the generated query's output matches any of the gold query results exactly (ignoring data types), it is marked as correct. This comparison is handled by the compare_df function.
In the example below, only the first dataframe in dfs_gold matches exactly, as it contains the same columns (uid, likes_movies) with identical values.
Example:
# result of executing a generated query:
# `SELECT u.id, u.likes_movies FROM users u`
df_generated = pd.DataFrame({"uid": [1, 2], "likes_movies": [True, False]})
compare_df(df_generated, dfs_gold[0]) # True
compare_df(df_generated, dfs_gold[1]) # False
compare_df(df_generated, dfs_gold[2]) # False
However, some variations of the generated query should still be considered correct. Some variations in query results should still be considered correct, such as:
- Column aliases: Queries that rename columns but return the same data.
SELECT u.uid AS id, likes_movies FROM users u;
id likes_movies
1 True
2 False
- Additional columns: Queries that include extra, non-essential columns.
id name likes_movies
1 Alice True
2 bob False
- Different row order: Queries that return the correct data but in a different order.
SELECT uid, likes_movies FROM users u ORDER BY likes_movies;
id likes_movies
2 False
1 True
These variations are harmless and may even be useful depending on the context. To account for them, use a relaxed comparison function called subset_df. If a generated query does not match exactly using the compare_df function, the subset_df function checks whether the gold query result is a subset of the generated query result.
Subset evaluation
Here’s how the subset_df function works:
For each column in df1, check if the same values exist in df2, ignoring data types, column names (which may be aliased), and row order.
Extract the relevant columns from df2, rename them to match df1, and verify that the resulting dataframe matches df1. This prevents incorrect matches caused by shuffled columns, which can occur with low-cardinality data types like booleans or enums.
Conclusion
This evaluation method accounts for harmless result variations, such as column renaming, additional columns, and row reordering.
You can test your own prompt-engineered solutions using the watsonx API (as outlined in the README) since fine-tuning a model may not always be feasible.
If you're considering a text-to-SQL model or product, assess whether this benchmark aligns with your needs—particularly in terms of semantic diversity, SQL complexity, and specificity within a narrow domain.
Explore our repository here: SQL Generation Evaluation
Top comments (0)