LLMs for Text-to-SQL problems: the benchmark vs real-world performance

#llm #openai #machinelearning

The scope of the problem

As Large Language Models (LLMs) are gaining popularity for code generation tasks, it has become of particular interest to utilize their capabilities for generating SQL commands from text descriptions and running them, to be able to retrieve the data from the databases. But contrary to many other code generation problems, this problem is more than just code generation from a text description. The model needs to be aware of the structure of the database, the data types, the relationships between tables, and more. In addition to that, depending on the business task, it may be necessary to train the model on how to translate specialized business terminology into the data structure of the database, to be able to answer the question effectively. Furthermore, it is often necessary to make the model understand the business needs in general, to be able to chat with the user with a proper understanding of the problem, and then use the data to answer their questions.

How to evaluate the performance?

Given the scope of the problem, the first question would be how to evaluate the performance of the available AI solutions. Since it is pretty difficult to take everything into account, the first step would be to narrow down the problem to a more simple Text-to-SQL problem. The benchmark would include balanced train, validation (dev), and test sets with various databases from different business fields. The minimum necessary information would be a database file itself, and a set of Question-SQL pairs with different SQL query difficulties, for each database. Also, we may include test descriptions for the database, tables, and each column in the tables, and provide examples of the terms used in the questions. This information is called ‘external knowledge’, and it tries to model the real-life business case when a human data analyst is aware of more than just the database itself. The simplest metric that can be used for evaluating the model performance is Execution Accuracy (EX): the number of correctly answered questions divided by the total number of questions, in terms of the query execution result.

There are two widely used benchmarks for assessing the performance of LLM-based Text-to-SQL solutions: Spider and BIRD. At the time of writing this article (only Spider-1.0 is available), the BIRD benchmark seems to be more suitable as it is specifically designed for real business problems. It includes a wide variety of databases of different sizes from different business fields, standardized query difficulty gradation, and some external knowledge.

A closer look at the BIRD benchmark performance
The official BIRD whitepaper concludes: ‘Our experimental results demonstrate that BIRD presents a more daunting challenge compared to existing benchmarks since even the most popular and powerful LLM, ChatGPT, falls significantly short of human performance. This leaves plenty of room for improvement and innovation in the text-to-SQL tasks’.

But if we take a closer look at the benchmark results it would reveal several interesting things. First, there is a much bigger gap between the dev and test sets performance for some submitted solutions as compared to others. Second, the difference in performance between different generations of GPT models (GPT-3.5 vs GPT-4) is much bigger than the difference in the performance of the different methods that use one model (e.g. different methodologies using GPT-4 model). This suggests that the quality of the model is more important than the approach used. Finally, the role of the external knowledge should not be underestimated. For example, the EX of GPT-4 model on the test dataset was 60.77% and 40.20% when used with and without external knowledge, respectively. The human performance was also different: 90.27% vs 70.36% with and without external knowledge, respectively.

While the addition of external knowledge plays an important role, as demonstrated by BIRD results, the actual external knowledge used in the benchmark is scarce. We may speculate that in a real-world scenario, a more detailed set of examples would be available, making it easier for the model to generate correct predictions. Furthermore, a much wider set of examples can be analyzed by the LLM-based data retrieval AI solution, as compared to a human, further bridging the gap between the LLM and human performance. Here, we can go back to the scope of the business problem, and once again point out that it is important to find ways to increase execution accuracy for the particular problem, not the benchmark in general.

Example: SQL query generation with and without a knowledge base example

The following simplified examples visually demonstrate the importance of providing additional information for a specific business case of analyzing car sales.

import openai

# Add OpenAI API key here
openai.api_key = "..."

Let's start by downloading Truecar dataset and creating a Postgre database out of it. We put different columns from this dataset into three different tables, to simulate real-world situations and to force the model to join them when answering questions.

Then, we extract database schema and constraints from it, for all tables, and pass them as .csv strings. This part simulates the automatic collection of the database information by the AI model.

We also define the question that we would like to answer.

SQL_DIALECT = "Postgre SQL"

EXAMPLE_DB_SCHEMA = """
"table_schema","table_name","column_name","data_type","character_maximum_length","numeric_precision","column_default","is_nullable"
"public","listings","vehicle_id","integer",NULL,32,NULL,"YES"
"public","listings","price","integer",NULL,32,NULL,"NO"
"public","listings","mileage","integer",NULL,32,NULL,"NO"
"public","listings","region","character varying",20,NULL,NULL,"YES"
"public","regions","vehicle_id","integer",NULL,32,NULL,"YES"
"public","regions","city","character varying",100,NULL,NULL,"NO"
"public","regions","state","character varying",2,NULL,NULL,"NO"
"public","regions","county","character varying",100,NULL,NULL,"YES"
"public","regions","region","character varying",20,NULL,NULL,"YES"
"public","vehicles","vehicle_id","integer",NULL,32,"nextval('vehicles_vehicle_id_seq'::regclass)","NO"
"public","vehicles","make","character varying",50,NULL,NULL,"NO"
"public","vehicles","model","character varying",100,NULL,NULL,"NO"
"public","vehicles","year","smallint",NULL,16,NULL,"NO"
"public","vehicles","vin","character varying",17,NULL,NULL,"NO"
"public","vehicles","created_on","timestamp without time zone",NULL,NULL,"CURRENT_TIMESTAMP","NO"
"""

EXAMPLE_DB_CONSTRAINTS = """
"table_schema","table_name","constraint_name","constraint_type","column_name","foreign_table_schema","foreign_table_name","foreign_column_name"
"public","listings","listings_vehicle_id_fkey","FOREIGN KEY","vehicle_id","public","vehicles","vehicle_id"
"public","regions","regions_vehicle_id_fkey","FOREIGN KEY","vehicle_id","public","vehicles","vehicle_id"
"public","vehicles","vehicles_pkey","PRIMARY KEY","vehicle_id","public","vehicles","vehicle_id"
"public","vehicles","vehicles_vin_key","UNIQUE","vin","public","vehicles","vin"
"""

QUESTION = "How many cars were sold in California?"

Define the system prompt and parse the SQL dialect previously defined.

system_message = """
    You are a robot that generates {} SQL queries.
"""
system_message = system_message.format(SQL_DIALECT)

Parse the SQL dialect, database constraints and schema, and the question, in the user prompt.

user_message = """
    You are given the datasets of database constraints and table summaries (formatted as .csv strings) for the required table names. 
    You are required to generate a syntactically correct {} SQL query that retrieves the data from the tables that would help answer the question most accurately. 

    Guide on how to generate the query:
       - Pay attention to the details of the question: accurately follow the instructions. 
       - Return only the query and nothing else. Do not return anything other than a query. Do not answer the question with text.
       - Example output query: 'SELECT * FROM table_name'. 

    Database constraints for the relevant tables are: {}. 
    Table summaries for the relevant tables are: {}.  

    The question is: {}.
"""
user_message = user_message.format(SQL_DIALECT, EXAMPLE_DB_CONSTRAINTS, EXAMPLE_DB_SCHEMA, QUESTION)

Define the messages list used to create a response message from OpenAI API.

Using this history of messages, the chat will generate the next response.

Each message is a dictionary that contains the role and the text content.

The role can be 'system' for the system prompt, user for the user messages, or 'assistant' for the response messages from the chat.

In our case, we only generate a response once, and no memory is used, so no Assistant messages are needed.

If the function calls are used, the functions can be passed as a separate argument when creating the chat completion object. In the previous articles, we simplified this step using Langchain tools. In this example, we only use messages, and no functions are used.

messages = [
    {
        "role": "system", 
        "content": system_message, 
    },
    {
        "role": "user", 
        "content": user_message,
    },
]

Create the response using the OpenAI chat completions tool.

We use GPT-4 model and set the temperature to 0 for fully reproducible outputs.

Also, reducing temperature for code generation tasks is often recommended as it helps generate more reliable code.

response = openai.chat.completions.create(model="gpt-4", 
                                          temperature=0,
                                          messages=messages)
response_message = response.choices[0].message.content
print(f"The question is: {QUESTION}")
print(f"\nThe response without example is:\n\n{response_message}")

The question is: How many cars were sold in California?

The response without example is:

SELECT COUNT(*) FROM listings 
JOIN regions ON listings.vehicle_id = regions.vehicle_id 
WHERE regions.state = 'California';

The response from the chat returned a query that would produce an incorrect output.

The reason is that the model is unaware of the available categories in the 'State' column.

In our database, California state is recorded as 'CA' or 'Ca', but not as 'California'.

This problem can be fixed by providing more information about available categories to the model automatically when connected to the database.

However, here we would like to demonstrate how adding a business knowledge example would help. We will provide an explanation and ground truth query for this question. Then, we will ask a similar question with an additional step to filter out the results only for the year 2003.

EXAMPLE_KNOWLEDGE = {
    "Question": "How many cars were sold in California?", 
    "Explanation": "The number of cars sold in California State in the USA", 
    "Query": """
             SELECT COUNT(*) FROM listings 
             JOIN regions ON listings.vehicle_id = regions.vehicle_id 
             WHERE regions.state = 'CA' OR regions.state = 'Ca';
             """, 
}

QUESTION_W_KNOWLEDGE = "How many cars were sold in California in 2003?"

Now, let's change the user prompt. We add a clause with an explanation of how to use the example and a placeholder for the example itself.

Then, we parse the example and the new question in the new prompt.

user_message = """
    You are given the datasets of database constraints and table summaries (formatted as .csv strings) for the required table names. 
    You are required to generate a syntactically correct {} SQL query that retrieves the data from the tables that would help answer the question most accurately. 

    Guide on how to generate the query:
       - Pay attention to the details of the question: accurately follow the instructions. 
       - Return only the query and nothing else. Do not return anything other than a query. Do not answer the question with text.
       - You may or may not be provided a relevant ground truth example. Use it to generate a more accurate query.
       - Example output query: 'SELECT * FROM table_name'. 

    Database constraints for the relevant tables are: {}. 
    Table summaries for the relevant tables are: {}.  
    Ground truth example is: {}.

    The question is: {}.
"""
user_message = user_message.format(SQL_DIALECT, EXAMPLE_DB_CONSTRAINTS, EXAMPLE_DB_SCHEMA, EXAMPLE_KNOWLEDGE, QUESTION_W_KNOWLEDGE)

Repeat the same steps as above to generate responses, but use the updated user message.

messages = [
    {
        "role": "system", 
        "content": system_message, 
    },
    {
        "role": "user", 
        "content": user_message,
    },
]

response = openai.chat.completions.create(model="gpt-4", 
                                          temperature=0,
                                          messages=messages)
response_message = response.choices[0].message.content
print(f"The question is: {QUESTION_W_KNOWLEDGE}")
print(f"\nThe response with example is:\n\n{response_message}")

The question is: How many cars were sold in California in 2003?

The response with example is:

SELECT COUNT(*) FROM listings 
JOIN regions ON listings.vehicle_id = regions.vehicle_id 
JOIN vehicles ON listings.vehicle_id = vehicles.vehicle_id
WHERE (regions.state = 'CA' OR regions.state = 'Ca') AND vehicles.year = 2003;

Although the question is different and requires an additional join to extract the year, the categories for California state are used correctly this time. This demonstrates the capability of the model to use additional business knowledge to generate more accurate queries.

Conclusion

Although the current LLM-based Text-to-SQL tools cannot match the human performance on the benchmark tests yet, it is also demonstrated by the benchmark tests that adding more information about specific business cases can greatly improve the model's performance. For solving specific business problems, it is important to improve the performance for the specific data rather than on a wide benchmark with different datasets.