DEV Community: Ilya Fastovets

Text-to-SQL: question breakdown by ground truth queries examples

Ilya Fastovets — Tue, 12 Mar 2024 00:37:09 +0000

Approaching human-level performance

In the previous article, I talked about the benchmark performance of LLMs on Text-to-SQL tasks.

As of the time of writing this article, there is still a huge gap between the performance of LLM-based solutions and baseline human-level performance (HLP).

We acknowledged this problem but also discussed that the benchmark performance is conceptually not the same thing as the performance for a specific business case.

For example, it has been demonstrated by the benchmark (BIRD) that the addition of extra knowledge vastly increases the model’s performance. Here, we speculate that it might be possible to achieve a performance much closer to HLP if we think of HLP as a typical data analyst in a specific company.

Even if the AI solution is not as ‘intelligent’ as a human data analyst, it might be better than a human at something else, that can help bridge the gap between the model performance and HLP.

One of these things is knowledge base examples (external knowledge) which is the collection of ground truth queries for the pre-defined questions.

If the size of this collection is huge and the queries are complex, it might be unachievable for a human to remember them all, and reuse them effectively.

In contrast, similarity search integrated into an LLM-based AI solution, also known as Retrieval Augmented Generation (RAG), can retrieve relevant knowledge base examples quickly and effectively.
Steps not from general knowledge, but from business knowledge

Let’s say we have an LLM agent that is capable of breaking down complex input questions into multiple simple steps, for which a separate query is generated.

This muti-step approach has been demonstrated to have better performance on the benchmarks as compared to the baseline with the same LLM (BIRD). Here, we go one step further and propose an approach to break down the input question by the most similar examples.

Instead of creating a series of steps from LLM’s general understanding and retrieving the most similar example from the knowledge base for each step, we also generate the steps so that they would be as close to the available examples as possible.

For example, assume we have the examples for ‘customer growth’ and ‘monthly revenue’ and the question is ‘How is the revenue generated related to the increase in customers?’.

With the proposed approach, the model would consider available examples at the question breakdown steps, and generate the sub-questions as close to the available examples as possible.

This step would encourage the model to re-use the available examples as much as possible, in this case, the examples for ‘customer growth’ and ‘monthly revenue’.

The closer the provided example is to the desired question, the bigger the chance the generated query would be accurate.

The proposed solution uses RAG twice: once for breaking down the question, and once for generating the sub-query (Fig.1).

Fig. 1. Proposed workflow for breaking down questions into simple steps using RAG based on available business knowledge examples.

Unlimited examples at a lower cost

The first step of the proposed solution uses an LLM agent to break down business questions into sub-questions by the available ground truth examples from the knowledge base.

One may argue that the RAG is not necessary at this step, as we could simply parse all available examples into the prompt.

This solution would work for a small knowledge base, but as the knowledge base grows bigger, it would be more difficult to fit all available examples into a single prompt.

Even as bigger prompt limits are now available, it would not be optimal from the cost-saving perspective. By using RAG with a limited number of similar examples, we can achieve both better performance and reasonable cost savings.

We further speculate that this approach would be most effective with extremely large databases and also a large number of ground truth examples.

We need a better benchmark

Breaking down complex business questions into simple steps and adding ground truth business knowledge greatly improves the performance of LLM-powered Text-to-SQL solutions.

The approach proposed in this article aims at combining the two ideas to break down questions into the steps most similar to the available knowledge base examples.

This could improve the performance of the solutions on the actual real-world problems.

As of now, it is unclear how to evaluate the performance of the proposed solution. To my knowledge, there is no suitable benchmark to test it effectively. As of now, we leave the evaluation of this method for future research.

LLMs for Text-to-SQL problems: the benchmark vs real-world performance

Ilya Fastovets — Fri, 08 Mar 2024 03:57:57 +0000

The scope of the problem

As Large Language Models (LLMs) are gaining popularity for code generation tasks, it has become of particular interest to utilize their capabilities for generating SQL commands from text descriptions and running them, to be able to retrieve the data from the databases. But contrary to many other code generation problems, this problem is more than just code generation from a text description. The model needs to be aware of the structure of the database, the data types, the relationships between tables, and more. In addition to that, depending on the business task, it may be necessary to train the model on how to translate specialized business terminology into the data structure of the database, to be able to answer the question effectively. Furthermore, it is often necessary to make the model understand the business needs in general, to be able to chat with the user with a proper understanding of the problem, and then use the data to answer their questions.

How to evaluate the performance?

Given the scope of the problem, the first question would be how to evaluate the performance of the available AI solutions. Since it is pretty difficult to take everything into account, the first step would be to narrow down the problem to a more simple Text-to-SQL problem. The benchmark would include balanced train, validation (dev), and test sets with various databases from different business fields. The minimum necessary information would be a database file itself, and a set of Question-SQL pairs with different SQL query difficulties, for each database. Also, we may include test descriptions for the database, tables, and each column in the tables, and provide examples of the terms used in the questions. This information is called ‘external knowledge’, and it tries to model the real-life business case when a human data analyst is aware of more than just the database itself. The simplest metric that can be used for evaluating the model performance is Execution Accuracy (EX): the number of correctly answered questions divided by the total number of questions, in terms of the query execution result.

There are two widely used benchmarks for assessing the performance of LLM-based Text-to-SQL solutions: Spider and BIRD. At the time of writing this article (only Spider-1.0 is available), the BIRD benchmark seems to be more suitable as it is specifically designed for real business problems. It includes a wide variety of databases of different sizes from different business fields, standardized query difficulty gradation, and some external knowledge.

A closer look at the BIRD benchmark performance
The official BIRD whitepaper concludes: ‘Our experimental results demonstrate that BIRD presents a more daunting challenge compared to existing benchmarks since even the most popular and powerful LLM, ChatGPT, falls significantly short of human performance. This leaves plenty of room for improvement and innovation in the text-to-SQL tasks’.

But if we take a closer look at the benchmark results it would reveal several interesting things. First, there is a much bigger gap between the dev and test sets performance for some submitted solutions as compared to others. Second, the difference in performance between different generations of GPT models (GPT-3.5 vs GPT-4) is much bigger than the difference in the performance of the different methods that use one model (e.g. different methodologies using GPT-4 model). This suggests that the quality of the model is more important than the approach used. Finally, the role of the external knowledge should not be underestimated. For example, the EX of GPT-4 model on the test dataset was 60.77% and 40.20% when used with and without external knowledge, respectively. The human performance was also different: 90.27% vs 70.36% with and without external knowledge, respectively.

While the addition of external knowledge plays an important role, as demonstrated by BIRD results, the actual external knowledge used in the benchmark is scarce. We may speculate that in a real-world scenario, a more detailed set of examples would be available, making it easier for the model to generate correct predictions. Furthermore, a much wider set of examples can be analyzed by the LLM-based data retrieval AI solution, as compared to a human, further bridging the gap between the LLM and human performance. Here, we can go back to the scope of the business problem, and once again point out that it is important to find ways to increase execution accuracy for the particular problem, not the benchmark in general.

Example: SQL query generation with and without a knowledge base example

The following simplified examples visually demonstrate the importance of providing additional information for a specific business case of analyzing car sales.

import openai

# Add OpenAI API key here
openai.api_key = "..."

Let's start by downloading Truecar dataset and creating a Postgre database out of it. We put different columns from this dataset into three different tables, to simulate real-world situations and to force the model to join them when answering questions.

Then, we extract database schema and constraints from it, for all tables, and pass them as .csv strings. This part simulates the automatic collection of the database information by the AI model.

We also define the question that we would like to answer.

SQL_DIALECT = "Postgre SQL"

EXAMPLE_DB_SCHEMA = """
"table_schema","table_name","column_name","data_type","character_maximum_length","numeric_precision","column_default","is_nullable"
"public","listings","vehicle_id","integer",NULL,32,NULL,"YES"
"public","listings","price","integer",NULL,32,NULL,"NO"
"public","listings","mileage","integer",NULL,32,NULL,"NO"
"public","listings","region","character varying",20,NULL,NULL,"YES"
"public","regions","vehicle_id","integer",NULL,32,NULL,"YES"
"public","regions","city","character varying",100,NULL,NULL,"NO"
"public","regions","state","character varying",2,NULL,NULL,"NO"
"public","regions","county","character varying",100,NULL,NULL,"YES"
"public","regions","region","character varying",20,NULL,NULL,"YES"
"public","vehicles","vehicle_id","integer",NULL,32,"nextval('vehicles_vehicle_id_seq'::regclass)","NO"
"public","vehicles","make","character varying",50,NULL,NULL,"NO"
"public","vehicles","model","character varying",100,NULL,NULL,"NO"
"public","vehicles","year","smallint",NULL,16,NULL,"NO"
"public","vehicles","vin","character varying",17,NULL,NULL,"NO"
"public","vehicles","created_on","timestamp without time zone",NULL,NULL,"CURRENT_TIMESTAMP","NO"
"""

EXAMPLE_DB_CONSTRAINTS = """
"table_schema","table_name","constraint_name","constraint_type","column_name","foreign_table_schema","foreign_table_name","foreign_column_name"
"public","listings","listings_vehicle_id_fkey","FOREIGN KEY","vehicle_id","public","vehicles","vehicle_id"
"public","regions","regions_vehicle_id_fkey","FOREIGN KEY","vehicle_id","public","vehicles","vehicle_id"
"public","vehicles","vehicles_pkey","PRIMARY KEY","vehicle_id","public","vehicles","vehicle_id"
"public","vehicles","vehicles_vin_key","UNIQUE","vin","public","vehicles","vin"
"""

QUESTION = "How many cars were sold in California?"

Define the system prompt and parse the SQL dialect previously defined.

system_message = """
    You are a robot that generates {} SQL queries.
"""
system_message = system_message.format(SQL_DIALECT)

Parse the SQL dialect, database constraints and schema, and the question, in the user prompt.

user_message = """
    You are given the datasets of database constraints and table summaries (formatted as .csv strings) for the required table names. 
    You are required to generate a syntactically correct {} SQL query that retrieves the data from the tables that would help answer the question most accurately. 

    Guide on how to generate the query:
       - Pay attention to the details of the question: accurately follow the instructions. 
       - Return only the query and nothing else. Do not return anything other than a query. Do not answer the question with text.
       - Example output query: 'SELECT * FROM table_name'. 

    Database constraints for the relevant tables are: {}. 
    Table summaries for the relevant tables are: {}.  

    The question is: {}.
"""
user_message = user_message.format(SQL_DIALECT, EXAMPLE_DB_CONSTRAINTS, EXAMPLE_DB_SCHEMA, QUESTION)

Define the messages list used to create a response message from OpenAI API.

Using this history of messages, the chat will generate the next response.

Each message is a dictionary that contains the role and the text content.

The role can be 'system' for the system prompt, user for the user messages, or 'assistant' for the response messages from the chat.

In our case, we only generate a response once, and no memory is used, so no Assistant messages are needed.

If the function calls are used, the functions can be passed as a separate argument when creating the chat completion object. In the previous articles, we simplified this step using Langchain tools. In this example, we only use messages, and no functions are used.

messages = [
    {
        "role": "system", 
        "content": system_message, 
    },
    {
        "role": "user", 
        "content": user_message,
    },
]

Create the response using the OpenAI chat completions tool.

We use GPT-4 model and set the temperature to 0 for fully reproducible outputs.

Also, reducing temperature for code generation tasks is often recommended as it helps generate more reliable code.

response = openai.chat.completions.create(model="gpt-4", 
                                          temperature=0,
                                          messages=messages)
response_message = response.choices[0].message.content
print(f"The question is: {QUESTION}")
print(f"\nThe response without example is:\n\n{response_message}")

The question is: How many cars were sold in California?

The response without example is:

SELECT COUNT(*) FROM listings 
JOIN regions ON listings.vehicle_id = regions.vehicle_id 
WHERE regions.state = 'California';

The response from the chat returned a query that would produce an incorrect output.

The reason is that the model is unaware of the available categories in the 'State' column.

In our database, California state is recorded as 'CA' or 'Ca', but not as 'California'.

This problem can be fixed by providing more information about available categories to the model automatically when connected to the database.

However, here we would like to demonstrate how adding a business knowledge example would help. We will provide an explanation and ground truth query for this question. Then, we will ask a similar question with an additional step to filter out the results only for the year 2003.

EXAMPLE_KNOWLEDGE = {
    "Question": "How many cars were sold in California?", 
    "Explanation": "The number of cars sold in California State in the USA", 
    "Query": """
             SELECT COUNT(*) FROM listings 
             JOIN regions ON listings.vehicle_id = regions.vehicle_id 
             WHERE regions.state = 'CA' OR regions.state = 'Ca';
             """, 
}

QUESTION_W_KNOWLEDGE = "How many cars were sold in California in 2003?"

Now, let's change the user prompt. We add a clause with an explanation of how to use the example and a placeholder for the example itself.

Then, we parse the example and the new question in the new prompt.

user_message = """
    You are given the datasets of database constraints and table summaries (formatted as .csv strings) for the required table names. 
    You are required to generate a syntactically correct {} SQL query that retrieves the data from the tables that would help answer the question most accurately. 

    Guide on how to generate the query:
       - Pay attention to the details of the question: accurately follow the instructions. 
       - Return only the query and nothing else. Do not return anything other than a query. Do not answer the question with text.
       - You may or may not be provided a relevant ground truth example. Use it to generate a more accurate query.
       - Example output query: 'SELECT * FROM table_name'. 

    Database constraints for the relevant tables are: {}. 
    Table summaries for the relevant tables are: {}.  
    Ground truth example is: {}.

    The question is: {}.
"""
user_message = user_message.format(SQL_DIALECT, EXAMPLE_DB_CONSTRAINTS, EXAMPLE_DB_SCHEMA, EXAMPLE_KNOWLEDGE, QUESTION_W_KNOWLEDGE)

Repeat the same steps as above to generate responses, but use the updated user message.

messages = [
    {
        "role": "system", 
        "content": system_message, 
    },
    {
        "role": "user", 
        "content": user_message,
    },
]

response = openai.chat.completions.create(model="gpt-4", 
                                          temperature=0,
                                          messages=messages)
response_message = response.choices[0].message.content
print(f"The question is: {QUESTION_W_KNOWLEDGE}")
print(f"\nThe response with example is:\n\n{response_message}")

The question is: How many cars were sold in California in 2003?

The response with example is:

SELECT COUNT(*) FROM listings 
JOIN regions ON listings.vehicle_id = regions.vehicle_id 
JOIN vehicles ON listings.vehicle_id = vehicles.vehicle_id
WHERE (regions.state = 'CA' OR regions.state = 'Ca') AND vehicles.year = 2003;

Although the question is different and requires an additional join to extract the year, the categories for California state are used correctly this time. This demonstrates the capability of the model to use additional business knowledge to generate more accurate queries.

Conclusion

Although the current LLM-based Text-to-SQL tools cannot match the human performance on the benchmark tests yet, it is also demonstrated by the benchmark tests that adding more information about specific business cases can greatly improve the model's performance. For solving specific business problems, it is important to improve the performance for the specific data rather than on a wide benchmark with different datasets.

Memory in LLM agents

Ilya Fastovets — Fri, 16 Feb 2024 03:56:26 +0000

The role of memory in LLM chats

In the previous article, we discussed how the reasoning and decision-making capabilities of LLM agents can help us solve practical tasks. In the provided example we used OpenAI LLM with function calls to create this agent. Two dummy methods were implemented to demonstrate its capabilities, namely calling them one after the other, and passing the results between one another. The scratchpad that we added to the agent helped it remember the previous steps so that the next steps could be reasoned correctly.

However, as the decision-making agent was integrated into the chat, the chat itself lacked memory capability. What this means is that every message that you sent to the chat was treated as a new one. Let’s consider the following case. The agent's reasoning and decision-making are now part of the general dialogue with the same storyline. With each message, the chat decides whether to call a specific tool or to continue a conversation with a simple response. But now the user also needs the chat to remember what the user was talking about before, to be able to adjust future responses.

Let’s make an example. Now, you are developing a weather forecast assistant chat that can discuss the weather with you, and advise you on the driving conditions in the specified areas. Apart from being able to retrieve information about the weather, it also needs to remember what you were talking about in the previous messages. This is where the concept of memory comes into play.

LLM memory types

The memory in relation to the LLM chats is basically a history of messages that is provided to the chat as the input along with the most recent user message, to generate a meaningful response. Thus, the simplest form of memory is just the whole history of messages which is called conversation buffer memory.

The problem with this type of memory is that there is a limitation on the input number of tokens that the model can handle. Another consideration is the costs associated with each token used in the input. The simple solution is a conversation buffer window memory type. What it does is that it only uses the latest k messages from the chat history. The obvious problem here is that the chat would not be capable of remembering the long history of the conversation, and the memory would only be short-term.

This problem can be solved by another popular type of memory: conversation summary memory. At each step, use another LLM call to summarize the most recent user message with the previous message or the previous summary. Then, the generated summary is used in the conversation in place of the actual history of the messages. The generated summary is short, and now the chat is capable of having a long memory without using too many tokens. However, a major drawback of conversation summary memory is that important details can be lost.

To be able to handle both of these problems, a combination of the previous two approaches was introduced: a conversation summary buffer memory type. It uses the window approach with the specified window size, but instead of deleting the messages older than the window, it triggers the summary generation for them. Thus, the summary is used for long-term memory while the actual messages are used for short-term memory. This way, the chat would remember what you were talking about a long time ago, and the details of the most recent conversation. Another advantage of this approach is that the summary call is not always triggered, as it depends on the chat length. Because generating the summary is another call to the LLM, it often results in time and cost savings.

Other types of memory also exist, but they are more specific and less related to the weather assistant example we discussed. For the complete list of the memory types available in Langchain, refer to this documentation page.

Example: a driving assistant using weather data

In our example, we will use OpenWeatherMap API to retrieve real-time weather data. Before the example can be run, it is necessary to register the user on this website. Then, activate the subscription and create an API key. The API key will be activated within 2 hours of the subscription. You will be asked to provide payment info, however, a free tier is available if you don’t exceed the free tier requests limit. Use this API key to run the example.

import requests
from pydantic.v1 import BaseModel, Field
from langchain.chat_models import ChatOpenAI
from langchain.agents.output_parsers import OpenAIFunctionsAgentOutputParser
from langchain.agents import AgentExecutor
from langchain.prompts import ChatPromptTemplate, MessagesPlaceholder
from langchain.tools.render import format_tool_to_openai_function
from langchain.agents.format_scratchpad import format_to_openai_function_messages
from langchain.memory import ConversationSummaryBufferMemory
from langchain.tools import StructuredTool

# Put your OpenAI API key here
OPENAI_API_KEY = "..."

# Put your OpenWeatherMap API key here (see instructions above)
OPEN_WEATHER_MAP_API_KEY = "..."

Define the method used in the agent

First, we define the function that we need.

Note that we are using type annotations.
This will help Langchain to properly convert the Python function to Langchain Structured Tool and to represent it as an OpenAI function in OpenAI API.

def get_weather_for_city(city_name: str, units: str="imperial") -> dict:
    """
    Fetches weather data for a specified city.

    Parameters:
    - city_name (str): The name of the city.
    - units (str): Units of measurement. "metric" for Celsius, "imperial" for Fahrenheit.

    Returns:
    - dict: Weather data for the city.
    """
    api_key = OPEN_WEATHER_MAP_API_KEY
    base_url = "http://api.openweathermap.org/data/2.5/weather?"
    complete_url = f"{base_url}appid={api_key}&q={city_name}&units={units}"

    response = requests.get(complete_url)

    if response.status_code == 200:
        return response.json()
    else:
        return {"error": "Failed to fetch data"}

Let's call the function to test if it works correctly and examine what the output looks like.

This is the output that the agent will use to generate the response.

As this is just an example, you can modify the API call to retrieve the necessary data. Use this API reference.

city_name = "London"
weather_data = get_weather_for_city(city_name)
print(weather_data)

{'coord': {'lon': -0.1257, 'lat': 51.5085}, 'weather': [{'id': 804, 'main': 'Clouds', 'description': 'overcast clouds', 'icon': '04n'}], 'base': 'stations', 'main': {'temp': 54.34, 'feels_like': 53.83, 'temp_min': 52.68, 'temp_max': 55.27, 'pressure': 1012, 'humidity': 93}, 'visibility': 10000, 'wind': {'speed': 9.22, 'deg': 200}, 'clouds': {'all': 100}, 'dt': 1707954149, 'sys': {'type': 2, 'id': 2075535, 'country': 'GB', 'sunrise': 1707895066, 'sunset': 1707930712}, 'timezone': 0, 'id': 2643743, 'name': 'London', 'cod': 200}

Define Pydantic arguments schema for these methods

To better convert Python functions to Langchain Tools, I found it helpful to also describe their inputs using Pydantic classes.

Those will be passed together with the function as arguments to the Langchain method that creates Tools from Python functions.

For some reason, Pydantic v2 is not yet supported by Langchain, note that Pydantic v1 is used here.

class GetWeatherForCityInput(BaseModel):
    """
    Pydantic schema for the get_weather function inputs.
    """
    city_name: str = Field(..., description="The name of the city for which to fetch weather data.")
    units: str = Field(default="imperial", description="Units of measurement. Use 'metric' for Celsius or 'imperial' for Fahrenheit. Defaults to 'metric'.")

Define prompts

We will use two input prompts: a system prompt and a user input prompt.

In this case, the system prompt describes what needs to be done, and the user initialization prompt contains the question in it.

We describe in detail what needs to be done in the system prompt.

Also, we will pass the chat history (memory) in the user prompt.

system_init_prompt = """
You are a driving assistant capable of accessing weather data in any location. 
With this weather data, you provide detailed information about how safe it would be to drive in this location.
If two locations are provided, you also check two or three locations between them to make sure the entire road is good to drive.
"""

user_init_prompt = """
Chat history is: {}.
The question is: {}. 
Go!
"""

Define parts of the agent using LCEL

Here, we define the parts used in the agent and create the agent and the agent executor.

First, we create the LLM object from ChatOpenAI class for OpeAI API. We pass OpenAI API key here as a parameter.

We then initialize the memory object to be used with the agent. We are using the conversation summary buffer memory type. Note that we are using GPT-3.5 instead of GPT-4 here because it is faster, cheaper, and good enough for text summarization tasks. Also, we set max token limit to 1024 tokens. This parameter determines the messages history length before we start summarizing the messages. Increase it to have a longer window of the memory, and decrease it to shorten it.

Then, we create a tools list from the Python function. Here, we use a method from StructuredTool to create the Tools. The Tools are combined in a list, and then bind() method is used to add them to the LLM object that we created above.

In the next step, we initialize the prompt object from the prompt messages that we defined above. It contains the system prompt and a formatted user init prompt.

The agent is defined using LCEL, which is a recommended way to define chains and agents in Langchain. This article describes why. The agent combines input formatting, prompt, llm with tools, and a parser. In the case of OpenAI function, it is convenient to use OpenAIFunctionsAgentOutputParser right out of the box, as we do here.

Finally, we initialize the agent executor and set verbose to True to display intermediate steps. This will help us to understand how reasoning works in Langchain Agents. The agent executor is now initialized with the 'memory' parameter that automatically loads, parses, and updates memory variables.

# Initialize the LLM
llm = ChatOpenAI(
    temperature=0.5,
    model_name="gpt-4",
    openai_api_key=OPENAI_API_KEY,
)

# Initialize the memory: conversation summary buffer
memory = ConversationSummaryBufferMemory(
    llm=ChatOpenAI(
        model_name="gpt-3.5-turbo", # Use a cheaper model to summarize the history
        openai_api_key=OPENAI_API_KEY,
    ),
    memory_key="chat_history", # What dict key to use to parse in the agent
    return_messages=True,
    max_token_limit=1024, # The bigger the limit, the more unsummarized messages
)

# Initialize the tools
tools = [
    StructuredTool.from_function(
        func=get_weather_for_city,
        args_schema=GetWeatherForCityInput,
        description="Function to get weather for specified city.",
    ), 
]
llm_with_tools = llm.bind(
    functions=[format_tool_to_openai_function(t) for t in tools]
)

# Initialize the prompt
prompt = ChatPromptTemplate.from_messages(
    [
        ("system", system_init_prompt),
        ("user", user_init_prompt.format("{chat_history}", "{input}")),
        MessagesPlaceholder(variable_name="agent_scratchpad"),
    ],
)

# Initialize agent
agent = (
    {
        "input": lambda x: x["input"],
        "agent_scratchpad": lambda x: format_to_openai_function_messages(
            x["intermediate_steps"]
        ),
        "chat_history": lambda x: x["chat_history"],
    }
    | prompt
    | llm_with_tools
    | OpenAIFunctionsAgentOutputParser()
)

# Initialize the agent executor
agent_executor = AgentExecutor(agent=agent, 
                               tools=tools, 
                               memory=memory,
                               verbose=True)

Run the chat with the agent executor

Let's run the chat to see how it works.

We start by asking for the driving recommendations for the trip from San Francisco to Las Vegas and get a response from the agent.

Then, we tell it that we would like to proceed to New York. If the memory works correctly, the chat assistant should remember the previous conversation and provide driving recommendations from Las Vegas to New York.

Finally, we confirm that the memory is working correctly by asking to summarize the whole trip. We type 'exit' to leave the chat.

print("Welcome to the chatbot. Type 'exit' to leave the chat.")

while True:
    user_message = input("You: ")
    if user_message.lower() == "exit":
        print("Exiting chat. Have a great day!")
        break

    response = agent_executor.invoke({"input": user_message})
    response = response.get("output")

    print(f"Chatbot: {response}")

Welcome to the chatbot. Type 'exit' to leave the chat.

You:  How safe is it to drive from San Francisco to Las Vegas? 

Entering new AgentExecutor chain...

Invoking: get_weather_for_city with {'city_name': 'San Francisco', 'units': 'imperial'}

{'coord': {'lon': -122.4194, 'lat': 37.7749}, 'weather': [{'id': 804, 'main': 'Clouds', 'description': 'overcast clouds', 'icon': '04d'}], 'base': 'stations', 'main': {'temp': 54.72, 'feels_like': 54.03, 'temp_min': 52.32, 'temp_max': 57.88, 'pressure': 1015, 'humidity': 88}, 'visibility': 10000, 'wind': {'speed': 20, 'deg': 142, 'gust': 28.99}, 'clouds': {'all': 100}, 'dt': 1707954565, 'sys': {'type': 2, 'id': 2003880, 'country': 'US', 'sunrise': 1707922868, 'sunset': 1707961609}, 'timezone': -28800, 'id': 5391959, 'name': 'San Francisco', 'cod': 200}

Invoking: get_weather_for_city with {'city_name': 'Las Vegas', 'units': 'imperial'}

{'coord': {'lon': -115.1372, 'lat': 36.175}, 'weather': [{'id': 800, 'main': 'Clear', 'description': 'clear sky', 'icon': '01d'}], 'base': 'stations', 'main': {'temp': 62.74, 'feels_like': 60.22, 'temp_min': 61.11, 'temp_max': 64.83, 'pressure': 1017, 'humidity': 32}, 'visibility': 10000, 'wind': {'speed': 10, 'deg': 90, 'gust': 0}, 'clouds': {'all': 0}, 'dt': 1707954794, 'sys': {'type': 2, 'id': 2083590, 'country': 'US', 'sunrise': 1707920986, 'sunset': 1707959997}, 'timezone': -28800, 'id': 5506956, 'name': 'Las Vegas', 'cod': 200}

Invoking: get_weather_for_city with {'city_name': 'Bakersfield', 'units': 'imperial'}

{'coord': {'lon': -119.0187, 'lat': 35.3733}, 'weather': [{'id': 800, 'main': 'Clear', 'description': 'clear sky', 'icon': '01d'}], 'base': 'stations', 'main': {'temp': 63.07, 'feels_like': 61.81, 'temp_min': 56.88, 'temp_max': 64.15, 'pressure': 1017, 'humidity': 58}, 'visibility': 10000, 'wind': {'speed': 8.05, 'deg': 320}, 'clouds': {'all': 0}, 'dt': 1707954993, 'sys': {'type': 2, 'id': 2019205, 'country': 'US', 'sunrise': 1707921850, 'sunset': 1707960995}, 'timezone': -28800, 'id': 5325738, 'name': 'Bakersfield', 'cod': 200}

Invoking: get_weather_for_city with {'city_name': 'Fresno', 'units': 'imperial'}

{'coord': {'lon': -119.8343, 'lat': 36.6666}, 'weather': [{'id': 800, 'main': 'Clear', 'description': 'clear sky', 'icon': '01d'}], 'base': 'stations', 'main': {'temp': 63.03, 'feels_like': 61.39, 'temp_min': 61.45, 'temp_max': 64, 'pressure': 1019, 'humidity': 50}, 'visibility': 10000, 'wind': {'speed': 9.22, 'deg': 170}, 'clouds': {'all': 0}, 'dt': 1707954426, 'sys': {'type': 1, 'id': 4068, 'country': 'US', 'sunrise': 1707922153, 'sunset': 1707961083}, 'timezone': -28800, 'id': 5350964, 'name': 'Fresno', 'cod': 200}

The weather conditions for the drive from San Francisco to Las Vegas are as follows:


San Francisco: The weather is overcast with clouds and the temperature is around 54.72°F. The wind speed is quite high at 20 mph, gusting up to 28.99 mph. Visibility is good at 10,000 meters.
Bakersfield: The weather is clear with a temperature of 63.07°F. The wind speed is moderate at 8.05 mph. Visibility is good at 10,000 meters.
Fresno: The weather is clear with a temperature of 63.03°F. The wind speed is moderate at 9.22 mph. Visibility is good at 10,000 meters.
Las Vegas: The weather is clear with a temperature of 62.74°F. The wind speed is moderate at 10 mph. Visibility is good at 10,000 meters.


Overall, it seems safe to drive from San Francisco to Las Vegas considering the weather conditions. However, the high wind speed in San Francisco might require extra caution. Please also consider the road conditions and traffic situation, and drive safely.

Finished chain.

Chatbot: The weather conditions for the drive from San Francisco to Las Vegas are as follows:


San Francisco: The weather is overcast with clouds and the temperature is around 54.72°F. The wind speed is quite high at 20 mph, gusting up to 28.99 mph. Visibility is good at 10,000 meters.
Bakersfield: The weather is clear with a temperature of 63.07°F. The wind speed is moderate at 8.05 mph. Visibility is good at 10,000 meters.
Fresno: The weather is clear with a temperature of 63.03°F. The wind speed is moderate at 9.22 mph. Visibility is good at 10,000 meters.
Las Vegas: The weather is clear with a temperature of 62.74°F. The wind speed is moderate at 10 mph. Visibility is good at 10,000 meters.


Overall, it seems safe to drive from San Francisco to Las Vegas considering the weather conditions. However, the high wind speed in San Francisco might require extra caution. Please also consider the road conditions and traffic situation, and drive safely.

You:  Then I would like to proceed to New York

Entering new AgentExecutor chain...

Invoking: get_weather_for_city with {'city_name': 'New York'}

{'coord': {'lon': -74.006, 'lat': 40.7143}, 'weather': [{'id': 800, 'main': 'Clear', 'description': 'clear sky', 'icon': '01n'}], 'base': 'stations', 'main': {'temp': 32.76, 'feels_like': 21.88, 'temp_min': 28.94, 'temp_max': 35.22, 'pressure': 1020, 'humidity': 41}, 'visibility': 10000, 'wind': {'speed': 17, 'deg': 314, 'gust': 17}, 'clouds': {'all': 0}, 'dt': 1707954762, 'sys': {'type': 2, 'id': 2008776, 'country': 'US', 'sunrise': 1707911526, 'sunset': 1707949714}, 'timezone': -18000, 'id': 5128581, 'name': 'New York', 'cod': 200}

Invoking: get_weather_for_city with {'city_name': 'Las Vegas'}

{'coord': {'lon': -115.1372, 'lat': 36.175}, 'weather': [{'id': 800, 'main': 'Clear', 'description': 'clear sky', 'icon': '01d'}], 'base': 'stations', 'main': {'temp': 62.73, 'feels_like': 60.21, 'temp_min': 61.11, 'temp_max': 64.83, 'pressure': 1017, 'humidity': 32}, 'visibility': 10000, 'wind': {'speed': 10, 'deg': 90, 'gust': 0}, 'clouds': {'all': 0}, 'dt': 1707954946, 'sys': {'type': 2, 'id': 2083590, 'country': 'US', 'sunrise': 1707920986, 'sunset': 1707959997}, 'timezone': -28800, 'id': 5506956, 'name': 'Las Vegas', 'cod': 200}

Invoking: get_weather_for_city with {'city_name': 'Denver'}

{'coord': {'lon': -104.9847, 'lat': 39.7392}, 'weather': [{'id': 803, 'main': 'Clouds', 'description': 'broken clouds', 'icon': '04d'}], 'base': 'stations', 'main': {'temp': 47.62, 'feels_like': 45.43, 'temp_min': 41.07, 'temp_max': 50.18, 'pressure': 1013, 'humidity': 37}, 'visibility': 10000, 'wind': {'speed': 5.01, 'deg': 0, 'gust': 11.99}, 'clouds': {'all': 75}, 'dt': 1707954392, 'sys': {'type': 2, 'id': 2004334, 'country': 'US', 'sunrise': 1707918863, 'sunset': 1707957246}, 'timezone': -25200, 'id': 5419384, 'name': 'Denver', 'cod': 200}

Invoking: get_weather_for_city with {'city_name': 'Chicago'}

{'coord': {'lon': -87.65, 'lat': 41.85}, 'weather': [{'id': 803, 'main': 'Clouds', 'description': 'broken clouds', 'icon': '04n'}], 'base': 'stations', 'main': {'temp': 41.77, 'feels_like': 40.37, 'temp_min': 38.17, 'temp_max': 45.18, 'pressure': 1022, 'humidity': 54}, 'visibility': 10000, 'wind': {'speed': 3, 'deg': 195, 'gust': 3}, 'clouds': {'all': 81}, 'dt': 1707955040, 'sys': {'type': 2, 'id': 2005153, 'country': 'US', 'sunrise': 1707914907, 'sunset': 1707952881}, 'timezone': -21600, 'id': 4887398, 'name': 'Chicago', 'cod': 200} 

The weather conditions for the drive from Las Vegas to New York, passing through Denver and Chicago, are as follows:


Las Vegas: The weather is clear with a temperature of 62.73°F. The wind speed is moderate at 10 mph. Visibility is good at 10,000 meters.
Denver: The weather is cloudy with a temperature of 47.62°F. The wind speed is light at 5.01 mph. Visibility is good at 10,000 meters.
Chicago: The weather is cloudy with a temperature of 41.77°F. The wind speed is light at 3 mph. Visibility is good at 10,000 meters.
New York: The weather is clear with a temperature of 32.76°F. The wind speed is high at 17 mph. Visibility is good at 10,000 meters.


Overall, it seems safe to drive from Las Vegas to New York considering the weather conditions. However, the high wind speed in New York might require extra caution. Please also consider the road conditions and traffic situation, and drive safely.

Finished chain.

Chatbot: The weather conditions for the drive from Las Vegas to New York, passing through Denver and Chicago, are as follows:


Las Vegas: The weather is clear with a temperature of 62.73°F. The wind speed is moderate at 10 mph. Visibility is good at 10,000 meters.
Denver: The weather is cloudy with a temperature of 47.62°F. The wind speed is light at 5.01 mph. Visibility is good at 10,000 meters.
Chicago: The weather is cloudy with a temperature of 41.77°F. The wind speed is light at 3 mph. Visibility is good at 10,000 meters.
New York: The weather is clear with a temperature of 32.76°F. The wind speed is high at 17 mph. Visibility is good at 10,000 meters.


Overall, it seems safe to drive from Las Vegas to New York considering the weather conditions. However, the high wind speed in New York might require extra caution. Please also consider the road conditions and traffic situation, and drive safely.

You:  So, summarize the whole trip

Entering new AgentExecutor chain...

The weather conditions for the entire trip from San Francisco to New York, passing through Las Vegas, Denver, and Chicago, are as follows:


San Francisco: Overcast with clouds, temperature around 54.72°F, high wind speed at 20 mph, gusting up to 28.99 mph, good visibility at 10,000 meters.
Bakersfield: Clear weather, temperature of 63.07°F, moderate wind speed at 8.05 mph, good visibility at 10,000 meters.
Fresno: Clear weather, temperature of 63.03°F, moderate wind speed at 9.22 mph, good visibility at 10,000 meters.
Las Vegas: Clear weather, temperature of 62.74°F, moderate wind speed at 10 mph, good visibility at 10,000 meters.
Denver: Cloudy, temperature of 47.62°F, light wind speed at 5.01 mph, good visibility at 10,000 meters.
Chicago: Cloudy, temperature of 41.77°F, light wind speed at 3 mph, good visibility at 10,000 meters.
New York: Clear weather, temperature of 32.76°F, high wind speed at 17 mph, good visibility at 10,000 meters.


Overall, the weather conditions seem safe for driving. However, high wind speeds in San Francisco and New York may require extra caution. It's important to also consider road conditions and traffic situation. Safe journey!

Finished chain.

Chatbot: The weather conditions for the entire trip from San Francisco to New York, passing through Las Vegas, Denver, and Chicago, are as follows:


San Francisco: Overcast with clouds, temperature around 54.72°F, high wind speed at 20 mph, gusting up to 28.99 mph, good visibility at 10,000 meters.
Bakersfield: Clear weather, temperature of 63.07°F, moderate wind speed at 8.05 mph, good visibility at 10,000 meters.
Fresno: Clear weather, temperature of 63.03°F, moderate wind speed at 9.22 mph, good visibility at 10,000 meters.
Las Vegas: Clear weather, temperature of 62.74°F, moderate wind speed at 10 mph, good visibility at 10,000 meters.
Denver: Cloudy, temperature of 47.62°F, light wind speed at 5.01 mph, good visibility at 10,000 meters.
Chicago: Cloudy, temperature of 41.77°F, light wind speed at 3 mph, good visibility at 10,000 meters.
New York: Clear weather, temperature of 32.76°F, high wind speed at 17 mph, good visibility at 10,000 meters.


Overall, the weather conditions seem safe for driving. However, high wind speeds in San Francisco and New York may require extra caution. It's important to also consider road conditions and traffic situation. Safe journey!

You:  exit

Exiting chat. Have a great day!

Conclusion

We have created an OpenAI LLM agent capable of invoking the function to retrieve weather data, to provide driving assistance for the user. Different types of LLM memory have been considered, and conversation summary buffer memory was integrated into the example. As a result, the chat now has an efficient memory, at a lower cost.

Using OpenAI Functions with Langchain Agents

Ilya Fastovets — Thu, 18 Jan 2024 03:11:21 +0000

About the author

My name is Ilya Fastovets, and I am a data scientist. My primary area of expertise is machine learning tools for agriculture. I find this field particularly exciting because it helps to optimize the production of food which directly affects nature and people’s lives. What makes it even more interesting is that this field combines some other fields of science, such as Biology, Chemistry, Physics, and Soil Science. After GPT-3 and GPT-4 models were released, I became particularly interested in their decision-making capabilities, and how they can be applied to solve real-world problems. We noticed that the sales leaders have a special interest in understanding their company’s data using natural language. This is how Datalynx was started. There, I am working on exploring the decision-making capabilities of LLMs to come up with a solution to this problem.

Introduction

With the release of GPT-3 Large Language Model (LLM), OpenAI has revolutionized the machine learning space. Many people have found it useful for solving tasks related to text, such as composing emails, writing reports, and code generation. However, there are more hidden capabilities of LLMs that many people overlook. In particular, the reasoning and decision-making chats that can ‘talk to themselves’ to break down complex problems into simple ones, trigger actions, and analyze the output while achieving the desirable results. These tasks may not even be related to text. An example could be a smart home. First, the voice recognition system accepts your voice command (e.g., ‘make it more cosy’). The command is converted into text by a speech-to-text model. Then, an LLM is used to get the temperature and lighting conditions from the sensors, analyze the output from them, and take action to make the atmosphere in the house more cosy. In this example, the task is not related to text, yet an LLM is used in the backend for decision-making. But how can this be achieved? This is where Langchain comes into play, with its powerful Agent capabilities. In this short article, we will create a simple workable example to demonstrate how it works.

What is Langchain?

According to the official documentation, ‘Langchain is a framework for developing applications powered by language models’. This is a higher-level framework that is often used instead of or in addition to the official OpenAI API for Python, to utilize the full capabilities of GPT-4 LLM. To achieve this, several new high-level concepts were introduced, such as Chains, Agents, and Memory. In this article, I will focus on the Agents and specifically on how they can be used to work with OpenAI function calls for decision-making.

What is a function call in OpenAI?

Apart from raw text generation, newer OpenAI chat completion models are capable of generating the output in a structured format. Given a detailed description of a function call in the input prompt, it can automatically decide whether to generate a ‘free’ text output or to ‘call a function’. In case the model decides to call a function, a JSON object is returned in the output. It contains the name of the function, and the parameters to call the function with. Multiple functions can be used, with the model being capable of deciding what function to use and when. This capability is extended in the idea of Agents in Langchain.

What is a Langchain Agent?

Langchain Agent is a reasoning engine that can choose a sequence of actions to take. It is a flexible concept and can be used for solving various problems. It is easy to think about it as if a model chat chats with itself to solve a particular problem. It is a general concept suitable for various LLMs, but in the case of OpenAI models, it is capable of doing so by utilizing OpenAI function calls. To achieve this, a special parser is designed. A parser is a thing that analyzes the outputs of the agent and decides whether the agent is taking the next action (AgentAction object) or is finished (AgentFinish object). The output of the agent is returned to the chat when the AgentFinish object is returned by the parser. Another concept that needs to be mentioned in this regard is a Tool. In our case, a Langchain Tool is just another representation of an OpenAI function call that can be utilized by an Agent. Now, let us use a simple example to demonstrate how to define Tools and an Agent, and how to execute the Agent to solve a particular problem with these tools using the parser for OpenAI function calls.

Example

In the example below, I will provide a demonstration of how an agent can be created in Langchain. In this simple example, I will use two dummy methods. The first method will retrieve the full name of an imaginary customer from their first name by simply attaching the last name 'Smith' to it. The second method will use the full name retrieved by the first method to get the email of the customer. This will be done by simply attaching '@gmail.com' to it. Then I will ask the question 'What is the full name and email of our customer John?'. The agent should call the first method with the first name to retrieve the full name, and then sequentially call the second method with the output of the first method to get the email of the customer. Finally, the agent should combine this information in an answer, and promptly stop itself at this final step. Let's proceed with this example.

Example: retrieving customer data using a Langchain Agent

We start by importing the necessary modules and adding our OpenAI API key

from pydantic.v1 import BaseModel, Field
from langchain.chat_models import ChatOpenAI
from langchain.agents.output_parsers import OpenAIFunctionsAgentOutputParser
from langchain.agents import AgentExecutor
from langchain.prompts import ChatPromptTemplate, MessagesPlaceholder
from langchain.tools.render import format_tool_to_openai_function
from langchain.agents.format_scratchpad import format_to_openai_function_messages
from langchain.tools import StructuredTool

# Put your OpenAI API key here
OPENAI_API_KEY = "..."

Define the methods used in the agent

First, we define the functions that we need.

Note that we are using type annotations.
This will help Langchain to properly convert the Python functions to Langchain Tools and to represent it as OpenAI functions in OpenAI API.

def get_customer_full_name(first_name: str) -> str:
    """
    Retrieve customer's full name given the customer first name.

    Args:
        first_name (str): The first name of the customer. 

    Returns: 
        str: The full name of the customer.
    """
    full_name = first_name + "_Smith"
    return full_name

def get_customer_email(full_name: str) -> str:
    """
    Retrieve customer email given the full name of the customer. 

    Args: 
        full_name (str): The full name of the customer.

    Returns:
        str: The email of the customer.
    """
    email = full_name.lower() + "@gmail.com"
    return email

Define Pydantic arguments schema for these methods

To better convert Python functions to Langchain Tools, I found it helpful to also describe their inputs using Pydantic classes.

Those will be passed together with the function as arguments to the Langchain method that creates Tools from Python functions.

For some reason, Pydantic v2 is not yet supported, note that Pydantic v1 is used here.

class GetCustomerFullNameInput(BaseModel):
    """
    Pydantic arguments schema for get_customer_full_name method
    """
    first_name: str = Field(..., description="The first name of the customer")

class GetCustomerEmailInput(BaseModel):
    """
    Pydantic arguments schema for get_customer_email method
    """
    full_name: str = Field(..., description="The full name of the customer")

Define prompts

We will use two input prompts: a system prompt and a user input prompt.

In this case, the system prompt describes what needs to be done, and the user initialization prompt contains the question in it.

system_init_prompt = """
You are a shop manager capable of retrieving full names and emails of the customers. 
Given the question, answer it to the best of your abilities.
"""

user_init_prompt = """
The question is: {}. 
Go!
"""

Define parts of the agent using LCEL

Here, we define the parts used in the agent and create the agent and the agent executor.

First, we create the LLM object from ChatOpenAI class for OpeAI API. We pass OpenAI API key here as a parameter.

Then, we create tools from Python functions. Here, we use a method from StructuredTool to create the Tools. In our case, the functions only have one input, so the use of structured tool is not required. However, this is the right way to go when functions with multiple inputs are used. The Tools are combined in a list, and then bind() method is used to add them to the LLM object that we created above.

In the next step, we initialize the prompt object from the prompt messages that we defined above. It contains the system prompt and a formatted user init prompt. Note that it also has a placeholder for 'agent_scratchpad'. This variable is used in the agent to store the history of the agent (intermediate steps) when it is executed.

The agent is defined using LCEL, which is a recommended way to define chains and agents in Langchain. This article describes why: https://python.langchain.com/docs/expression_language/why . The agent combines input formatting, prompt, llm with tools, and a parser. In the case of OpenAI function, it is convenient to use OpenAIFunctionsAgentOutputParser right out of the box, as we do here.

Finally, we initialize the agent executor and set verbose to True to display intermediate steps. This will help us to understand how reasoning works in Langchain Agents.

# Initialize the LLM
llm = ChatOpenAI(
    temperature=0.5,
    model_name="gpt-4",
    openai_api_key=OPENAI_API_KEY,
)

# Initialize the tools
tools = [
    StructuredTool.from_function(
        func=get_customer_full_name,
        args_schema=GetCustomerFullNameInput,
        description="Function to get customer full name.",
    ), 
    StructuredTool.from_function(
        func=get_customer_email,
        args_schema=GetCustomerEmailInput,
        description="Function to get customer email",
    )
]
llm_with_tools = llm.bind(
    functions=[format_tool_to_openai_function(t) for t in tools]
)

# Initialize the prompt
prompt = ChatPromptTemplate.from_messages(
    [
        ("system", system_init_prompt),
        ("user", user_init_prompt.format("{input}")),
        MessagesPlaceholder(variable_name="agent_scratchpad"),
    ],
)

# Initialize agent
agent = (
    {
        "input": lambda x: x["input"],
        "agent_scratchpad": lambda x: format_to_openai_function_messages(
            x["intermediate_steps"]
        ),
    }
    | prompt
    | llm_with_tools
    | OpenAIFunctionsAgentOutputParser()
)

# Initialize the agent executor
agent_executor = AgentExecutor(agent=agent, 
                               tools=tools, 
                               verbose=True)

Run the chat with the agent executor

The final step is to invoke the agent with the user input message.

This step can also be done in a loop. In this case, it will work like a chat.

user_message = "What is the full name and email of our customer John?"
response = agent_executor.invoke({"input": user_message})
response = response.get("output")
print(f"Response: {response}")

Entering new AgentExecutor chain...

Invoking: get_customer_full_name with {'first_name': 'John'}

Invoking: get_customer_email with {'full_name': 'John_Smith'}

The full name of our customer John is John Smith and his email is john_smith@gmail.com.

Finished chain.

Response: The full name of our customer John is John Smith and his email is john_smith@gmail.com.

Summary and improvements

Langchain Agents is a powerful reasoning and decision-making tool that can be used in various situations, even for nor not text-related tasks. In this simple example, I explain how to set up an Agent and run it to solve a dummy task. The next step could be adding a memory to the chat to make it remember things you discussed. This could be the topic of a whole new discussion. Another possible improvement could be the set of methods (tools) to use. Those tools could also be created with LLMs in most cases, and accurately designing them could be crucial for solving your problems.

Memory in LLM agents

DEV Community: Ilya Fastovets

Text-to-SQL: question breakdown by ground truth queries examples

Approaching human-level performance

Unlimited examples at a lower cost

We need a better benchmark

Related articles

LLMs for Text-to-SQL problems: the benchmark vs real-world performance

The scope of the problem

How to evaluate the performance?

Example: SQL query generation with and without a knowledge base example

Conclusion

Memory in LLM agents

The role of memory in LLM chats

LLM memory types

Example: a driving assistant using weather data

Define the method used in the agent

Define Pydantic arguments schema for these methods

Define prompts

Define parts of the agent using LCEL

Run the chat with the agent executor

Conclusion

Using OpenAI Functions with Langchain Agents

About the author

Introduction

What is Langchain?

What is a function call in OpenAI?

What is a Langchain Agent?

Example

Example: retrieving customer data using a Langchain Agent

Define the methods used in the agent

Define Pydantic arguments schema for these methods

Define prompts

Define parts of the agent using LCEL

Run the chat with the agent executor

Summary and improvements

Related article