From raw CSVs to a production-ready AI assistant that queries data safely — without hallucinating SQL.
In this post, I’ll walk through how I built an AI-powered data analyst using:
- Azure OpenAI
- LangChain
- LangGraph
- Function Calling
- SQLite
The assistant can:
- Analyze CSV data using pandas
- Query a SQL database safely
- Choose predefined backend functions automatically
- Explain results clearly
- Avoid hallucinations and unsafe SQL
🧩 Problem Statement
When working with AI models and databases, common problems include:
- ❌ Hallucinated SQL queries
- ❌ Unsafe eval or raw SQL execution
- ❌ No control over what the model can access
- ❌ No explanation of how results were computed
Goal:
Build an AI assistant that:
- Answers analytical questions about COVID data
- Uses only allowed tools
- Never guesses
- Explains every answer
📊 Dataset
We use the COVID all-states history dataset, which includes:
- state
- date
- hospitalizedIncrease
- positiveIncrease
- …and more
The dataset is first used as:
- A pandas DataFrame
- A SQLite database
🧱 Architecture Overview
User Question
↓
Azure OpenAI (Assistant / LangChain)
↓
Tool Selection (Function / SQL / DataFrame)
↓
Safe Backend Execution
↓
Result
↓
Final Explanation
Key idea:
The model decides WHAT to do.
Your backend decides HOW it is done.
🔹 Part 1: Talking to Azure OpenAI via LangChain
We start by connecting to Azure OpenAI using AzureChatOpenAI:
llm = AzureChatOpenAI(
azure_endpoint="https://<your-endpoint>.cognitiveservices.azure.com/",
api_key="YOUR_API_KEY",
api_version="2024-12-01-preview",
model="gpt-4o-mini"
)
A simple sanity check:
response = llm.invoke([
HumanMessage(content="Hello, Azure OpenAI via LangChain!")
])
print(response.content)
🔹 Part 2: DataFrame Agent (CSV Analysis)
We load the CSV into pandas and expose controlled computation via a tool.
DataFrame Tool
@tool
def run_df(query: str) -> str:
"""Run Python code on the global dataframe `df` and return the result."""
return str(eval(query))
⚠️ Note: In production, replace eval with a restricted execution layer.
Enforcing Tool Usage
llm_with_tools = llm.bind_tools([run_df])
The prompt forces the model to:
- Use the tool
- Perform actual pandas calculations
- Explain results
🔹 Part 3: Moving from CSV → SQL (SQLite)
We convert the CSV into SQLite:
engine = create_engine("sqlite:///./db/test.db")
df.to_sql(
name="all_states_history",
con=engine,
if_exists="replace",
index=False
)
Now the same dataset can be queried via SQL.
🔹 Part 4: SQL Agent with LangGraph
Using LangGraph’s ReAct agent:
agent_executor_SQL = create_react_agent(
model=llm,
tools=toolkit.get_tools()
)
The system prompt enforces:
- Only valid tables
- Only specific columns
- No hallucinated values
- Markdown-only output
🔹 Part 5: Function Calling (No Raw SQL)
Instead of letting the model generate SQL, we define pre-approved backend functions.
Example Functions
def get_hospitalized_increase_for_state_on_date(state_abbr, specific_date):
...
def get_positive_cases_for_state_on_date(state_abbr, specific_date):
...
Function Registry (Critical!)
FUNCTION_MAP = {
"get_hospitalized_increase_for_state_on_date": get_hospitalized_increase_for_state_on_date,
"get_positive_cases_for_state_on_date": get_positive_cases_for_state_on_date,
}
This ensures:
- ✅ Only allowed functions run
- ❌ No arbitrary code execution
🔹 Part 6: Azure OpenAI Function Calling (No Assistant API)
Using Chat Completions + functions:
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=messages,
functions=functions,
function_call="auto"
)
If the model calls a function:
- Extract arguments
- Route via
FUNCTION_MAP - Execute backend logic
- Send result back
- Get final grounded answer
🔹 Part 7: Assistant API (Persistent Context)
Now we level up.
Creating the Assistant
assistant = client.beta.assistants.create(
name="Covid Data Assistant",
model="gpt-4o-mini",
tools=[{"type": "function", "function": fn} for fn in functions]
)
Assistant Loop (Key Concept)
while True:
run_status = client.beta.threads.runs.retrieve(...)
if run_status.status == "requires_action":
# extract function name
# dispatch via FUNCTION_MAP
# submit tool output
elif run_status.status == "completed":
break
The assistant remembers conversation context,
but never caches database results.
🧠 Key Takeaways
✅ What This Design Solves
- Prevents SQL hallucinations
- Enforces backend safety
- Keeps AI answers grounded in data
- Scales cleanly as tools grow
🧩 Mental Model
| Layer | Responsibility |
|---|---|
| LLM | Reasoning & intent |
| Assistant | Tool selection |
| Backend | Data access |
| Function Map | Security |
🎯 When to Use What?
| Use Case | Best Choice |
|---|---|
| One-shot queries | Chat + function calling |
| Multi-turn analysis | Assistant API |
| CSV exploration | DataFrame tools |
| Production DB | Predefined SQL functions |
🚀 Final Thoughts
This approach mirrors how real production AI systems are built:
- AI decides what
- Backend controls how
- Data remains authoritative
- Explanations remain transparent
Connect With Me
Let’s learn and build cool data science and AI projects together!
- 💼 LinkedIn: https://www.linkedin.com/in/singla-khushi/
- 🔗 GitHub: https://github.com/KhushiSingla-tech
- 📩 Comments below are always welcome!
Top comments (0)