Txt-to-SQL: Querying Databases with Nebius AI Studio and Agents (part 3)

#agents #nebius #codeassistant #llm

Maximum effort! Not really. We're building an agent to query the database for this installment. It occurred to me that I already had a RAG system to write agent code. And away we go!

Adding agent embeddings

An Internet search leads me to An agent for interacting with a SQL database. I could copy the code, but I'm in a hurry. It's HTML, so I'll have to make a few changes to the ingest script, including downloading the page and converting it to plain text.

The page is dynamically generated, and when I use beautifulsoup to extract the text from the body, e.g.,

content = soup.find(id="body")

I get an empty list. Impatience wins again, and I brute force the HTML parsing by grabbing everything between a <p> tag and extracting only the text.

def html2text(url):
    response = requests.get(url)
    soup = BeautifulSoup(response.content, 'html.parser')
    text = []
    for p in soup.find_all('p'):
        para = p.get_text()
        text.append(para)

    return str(text)

In the spirit of DRY, I added a separate function to create embeddings for a single document. Previously, creating embeddings was buried in a function that read documents from a directory.

With those changes, I inserted the agent embeddings into the database.

url ="https://langchain-ai.github.io/langgraph/tutorials/sql-agent/#utility-functions"
doc = html2text(url)
embeddings = doc_create_embeddings(doc)
upsert(embeddings, database_conn)

Testing models for code generation

I reused the client from the previous post. Here's the prompt.

rag_query = """Write a Python program that creates a tool and agent 
that queries a Postgres database with a SQL statement and prints the 
result to the console. The program only executes the SQL, and the 
schema is already known."
"""

But hey! I've barely used any of the credits I received for signing up.

It's time to live a little and try some of the big-boy LLMs.

First up is Meta-Llama-3.1-405B-Instruct. This code looks reasonable. It creates an agent and method to execute the tool. A minor nit: neither the os nor json imports are used.

import os
import json
from langchain.llms import AI
from langchain.agents import get_tool_names, initialize_agent, Tool
from langchain_community.utilities import SQLDatabase
from langchain_community.agent_toolkits import SQLDatabaseToolkit
from langchain_openai import OpenAI

# Initialize the OpenAI LLM
llm = OpenAI(model="text-davinci-003")

# Initialize the Postgres database connection
db = SQLDatabase.from_uri("postgresql://username:password@host:port/dbname")

# Create a tool that executes a SQL query against the database
class SQLQueryTool(Tool):
    name = "sql_query_tool"

    def __call__(self, query: str) -> str:
        try:
            result = db.run_no_throw(query)
            if not result:
                return "Error: Query failed."
            return str(result)
        except Exception as e:
            return f"Error: {str(e)}"

# Create a toolkit with the SQL query tool
toolkit = SQLDatabaseToolkit(db=db, llm=llm, tools=[SQLQueryTool()])

# Initialize the agent
agent = initialize_agent(toolkit=toolkit, llm=llm)

# Define a function that queries the database with a SQL statement
def query_database(sql_statement: str) -> None:
    # Create a tool call with the SQL query
    tool_call = {"name": SQLQueryTool.name, "args": {"query": sql_statement}}

    # Execute the tool call using the agent
    result = agent({"tool_calls": [tool_call]})

    # Print the result to the console
    print(result)

Next up is Qwen2.5-Coder-32B-Instruct. Whoa, complete fail on the task. Qwen did not create an agent and wrote a straightforward program to connect to Postgres and execute the query.

import psycopg2
import openai

# Set up your OpenAI API key
openai.api_key = 'your-openai-api-key'

# Database connection parameters
db_params = {
    'dbname': 'your_database_name',
    'user': 'your_username',
    'password': 'your_password',
    'host': 'your_host',
    'port': 'your_port'  # Default is 5432
}

def connect_to_db(params):
    """Connect to the PostgreSQL database."""
    try:
        connection = psycopg2.connect(**params)
        print("Database connection successful")
        return connection
    except Exception as e:
        print(f"Error connecting to database: {e}")
        return None

def execute_query(connection, query):
    """Execute a SQL query and return the results."""
    try:
        with connection.cursor() as cursor:
            cursor.execute(query)
            results = cursor.fetchall()
            return results
    except Exception as e:
        print(f"Error executing query: {e}")
        return None

def main():
    # Connect to the database
    conn = connect_to_db(db_params)
    if not conn:
        return

    # Example SQL query
    sql_query = "SELECT * FROM your_table_name LIMIT 10;"

    # Execute the query
    results = execute_query(conn, sql_query)
    if results:
        print("Query Results:")
        for row in results:
            print(row)

    # Close the database connection
    conn.close()

if __name__ == "__main__":
    main()

Next, DeepSeek-Coder-V2-Lite-Instruct. This code accomplishes the task of creating the tool and agent, but it also includes code for extracting a table's schema. This is an artifact from the RAG embeddings because the prompt specifies executing only SQL.

from langchain_community.agent_toolkits import SQLDatabaseToolkit
from langchain_openai import OpenAI
from langchain_core.tools import Tool
from langchain_community.utilities import SQLDatabase

# Initialize the database
db = SQLDatabase.from_uri("postgresql://username:password@localhost:5432/dbname")

# Create a toolkit for the database
toolkit = SQLDatabaseToolkit(db=db, llm=OpenAI(model="gpt-4o"))

# Create a tool to execute SQL queries
@tool
def db_query_tool(query: str) -> str:
    """Execute a SQL query against the database and return the results."""
    result = db.run(query)
    if not result:
        return "Error: Query failed."
    return result

# List available tools
tools = toolkit.get_tools()
print(list(map(lambda tool: tool.name, tools)))

# Get schema for a specific table (e.g., 'Artist')
get_schema_tool = next(tool for tool in tools if tool.name == "sql_db_schema")
print(get_schema_tool.invoke("Artist"))

# Example query
query = "SELECT * FROM Artist LIMIT 5;"
result = db_query_tool(query)
print(result)

The results

The models tested didn't produce "copy and paste" code, but they did produce workable code that provided a shortcut to the langchain API. Instantiating the Nebius model with Langchain is a minor change, similar to the OpenAI client.

from langchain_community.utilities import SQLDatabase
from langchain_community.agent_toolkits import SQLDatabaseToolkit
from langchain_community.agent_toolkits import create_sql_agent
from langchain_openai import ChatOpenAI
import os


API_KEY = os.environ.get('NEBIUS_API_KEY')

client = ChatOpenAI(
    base_url="https://api.studio.nebius.ai/v1/",
    api_key=API_KEY,
    model="meta-llama/Meta-Llama-3.1-70B-Instruct",
)

# Connect to the PostgreSQL database
db = SQLDatabase.from_uri("postgresql://postgres:postgres@localhost:5432/NORTHWIND")

# Define a tool to execute SQL queries
def db_query_tool(query: str) -> str:
    """Execute a SQL query against the database and return the results."""
    try:
        result = db.run_no_throw(query)
        if not result:
            return "Error: Query failed."
        return result
    except Exception as e:
        return f"Error: {str(e)}"

agent = create_sql_agent(
    llm=client,
    toolkit=SQLDatabaseToolkit(db=db, llm=client),
    verbose=True
)

# SQL query to execute
sql_query = """SELECT country, COUNT(*) AS number_of_suppliers
FROM suppliers
GROUP BY country
ORDER BY country;
"""

# Execute the agent with the SQL query
response = agent.run(sql_query)

# Print the result
print(response)

The output from the agent is verbose and requires formatting in a practical application.

> Entering new SQL Agent Executor chain...
Thought: I should look at the tables in the database to see what I can query. Then I should query the schema of the most relevant tables.
Action: sql_db_list_tables
Action Input: categories, customer_customer_demo, customer_demographics, customers, employee_territories, employees, order_details, orders, products, region, shippers, suppliers, territories, us_statesAction: sql_db_schema
Action Input: suppliers
CREATE TABLE suppliers (
        supplier_id SMALLINT NOT NULL, 
        company_name VARCHAR(40) NOT NULL, 
        contact_name VARCHAR(30), 
        contact_title VARCHAR(30), 
        address VARCHAR(60), 
        city VARCHAR(15), 
        region VARCHAR(15), 
        postal_code VARCHAR(10), 
        country VARCHAR(15), 
        phone VARCHAR(24), 
        fax VARCHAR(24), 
        homepage TEXT, 
        CONSTRAINT pk_suppliers PRIMARY KEY (supplier_id)
)

/*
3 rows from suppliers table:
supplier_id     company_name    contact_name    contact_title   address city    region  postal_code     country phone   fax     homepage
1       Exotic Liquids  Charlotte Cooper        Purchasing Manager      49 Gilbert St.  London  None    EC1 4SD UK      (171) 555-2222  None       None
2       New Orleans Cajun Delights      Shelley Burke   Order Administrator     P.O. Box 78934  New Orleans     LA      70117   USA     (100) 555-4822     None    #CAJUN.HTM#
3       Grandma Kelly's Homestead       Regina Murphy   Sales Representative    707 Oxford Rd.  Ann Arbor       MI      48104   USA     (313) 555-5735     (313) 555-3349  None
*/Thought: I can now construct the query using the schema of the suppliers table. I will use the sql_db_query_checker tool to double check if my query is correct before executing it.
Action: sql_db_query_checker
Action Input: SELECT country, COUNT(*) AS number_of_suppliers FROM suppliers GROUP BY country ORDER BY country;SELECT country, COUNT(*) AS number_of_suppliers 
FROM suppliers 
GROUP BY country 
ORDER BY country;Action: sql_db_query
Action Input: SELECT country, COUNT(*) AS number_of_suppliers FROM suppliers GROUP BY country ORDER BY country;[('Australia', 2), ('Brazil', 1), ('Canada', 2), ('Denmark', 1), ('Finland', 1), ('France', 3), ('Germany', 3), ('Italy', 2), ('Japan', 2), ('Netherlands', 1), ('Norway', 1), ('Singapore', 1), ('Spain', 1), ('Sweden', 2), ('UK', 2), ('USA', 4)]Thought: I now know the final answer
Final Answer: The countries with their respective number of suppliers are: Australia (2), Brazil (1), Canada (2), Denmark (1), Finland (1), France (3), Germany (3), Italy (2), Japan (2), Netherlands (1), Norway (1), Singapore (1), Spain (1), Sweden (2), UK (2), USA (4).

> Finished chain.
{'input': 'SELECT country, COUNT(*) AS number_of_suppliers\nFROM suppliers\nGROUP BY country\nORDER BY country;\n', 'output': 'The countries with their respective number of suppliers are: Australia (2), Brazil (1), Canada (2), Denmark (1), Finland (1), France (3), Germany (3), Italy (2), Japan (2), Netherlands (1), Norway (1), Singapore (1), Spain (1), Sweden (2), UK (2), USA (4).'}

Takeaways

I doubt that LLMs will replace developers or make someone a 10x developer. However, I could cobble together the working code in an hour. LLMs can assist with understanding an unfamiliar API, which makes them useful.

In the next installment, we'll build an application from the code snippets in the previous installments.