Sandeep B Kadam

Posted on Mar 22

I Built an NL2SQL Agent for IPL Cricket While Learning How AI Agents Actually Work

#ai #multiagent #python #rag

IPL 2026 starts this month, so I built something around it.

I have been learning how to build AI agents, and one thing I kept wanting was a project that felt concrete. Not just a chatbot demo, but something that had to deal with real data, real edge cases, and real mistakes.

So I built IPL Cricket Analyst.

It lets you ask questions about IPL data in plain English and get back a SQL-backed answer in real time from an SQL database, along with charts, follow-up suggestions, and support for multi-turn questions. The user don't have to know the queries, LLM will generate for you.

Some example questions:

"Who has the best death-over economy since 2020?"
"Show me a bar chart of the top 10 wicket takers."
"What is Virat Kohli's strike rate at Eden Gardens after 2022?"

Under the hood, the agent writes the SQL, validates it, runs it against 278,000+ ball-by-ball deliveries, and streams the result back while it works.

What I built

Layer	Tech
Frontend	Next.js 14, TypeScript, Tailwind CSS
Backend	FastAPI, Python 3.11, LangChain
Database	PostgreSQL, 9 tables, 278k+ rows
Vector store	ChromaDB
Cache / History	Redis 7
LLM	GPT-4o for SQL, GPT-4o-mini for rewrite and insights
Charts	MCP Chart Server with Vega-Lite v5

The basic goal was simple: take a cricket question, turn it into SQL, run it safely, and make the result feel responsive in the UI.

The pipeline

Each question goes through a pipeline inside run_agent_stream() that looks roughly like this:

Input validation
→ Response cache check
→ Query rewrite + history summarization
→ Entity resolution
→ [Table selection || Cricket RAG]
→ SQL generation
→ SQL validation + semantic check
→ SQL execution
→ [Answer rephrase || Insights || Viz]
→ Streamed NDJSON to frontend

The frontend receives events step by step, so the SQL shows up first, then the answer, then the extra pieces like insights and charts.

That streaming part made a much bigger difference to the feel of the app than I expected.

What turned out to be harder than I expected
1. Cricket stats are tricky in ways generic NL2SQL examples do not prepare you for
A lot of NL2SQL tutorials work on clean, simple schemas. IPL data is not that.

For example, a batting average is not just a straightforward aggregation. Dismissals can be subtle because the dismissed player is not always the striker. Ducks are also not a ball-level concept. They have to be computed at the innings level.

I ran into a lot of cases where the SQL looked reasonable but was still wrong from a cricket point of view.

To handle that better, I added:

a cricket rules document for retrieval
IPL-specific few-shot SQL examples
an extra semantic validation step before execution

That combination helped a lot more than just changing prompts randomly.

2. Accuracy improved only after I started measuring it properly

At first I was mostly testing with ad hoc questions, which felt fine until I started noticing inconsistencies.

So I put together a 50-question ground-truth evaluation set and started running the system against it repeatedly.

The first version was around 82% accurate.
After a lot of iteration, it got to 98% on that eval.

Most of the improvements did not come from big architectural changes. They came from fixing very specific failure modes, like:

using the wrong grain for aggregation
getting milestone logic wrong
small cricket-specific details like death overs being overs 16 to 20, which in this dataset meant handling indexing carefully
selecting columns that made the answer noisier than it needed to be

That was probably the biggest lesson in the whole project. Evaluation made the work much more grounded.

3. Follow-up questions needed more care than I thought

One of the things I wanted was for follow-ups to feel natural.

Questions like:

"Who scored the most runs in 2023?"
"What was his strike rate?"
"What about 2022?"

That sounds simple from a user perspective, but a lot has to go right for it to work consistently.

I ended up rewriting follow-up questions into standalone questions before sending them downstream. That made the rest of the pipeline much more reliable.

It was one of those changes that feels obvious in hindsight.

4. Reliability work matters even in small projects

I did not want this to be just a cool demo that works once.

So I added some basic safeguards:

per-IP rate limiting
a response cache
a circuit breaker
request timeouts
input validation
SELECT-only SQL enforcement

None of that is especially flashy, but it made the project feel much more solid.

What I learned

This project taught me a lot about building agents in a way that feels less magical and more engineering-focused.

A few things stood out:

Evaluation matters a lot.
Without a fixed eval set, it is very easy to convince yourself the system is getting better when it is just getting different.

Domain grounding matters more than I expected.
A strong model can generate convincing SQL, but convincing is not the same as correct. The cricket-specific rules and examples made a huge difference.

Streaming helps the UX a lot.
Even when the full pipeline takes a few seconds, showing progress step by step makes the app feel much better.

The hard part is usually not generation.
A lot of the work ended up being around validation, edge cases, memory, retries, and handling the weird questions cleanly.

Why I liked building this

I started this mainly as a way to learn more about AI agents, but it turned into a really useful exercise in building around failure cases.

It is easy to make an agent look smart in a short demo.
It is much harder to make it dependable when the inputs are messy, the domain has tricky rules, and the answer actually needs to be right.

That is what made this project fun.

Try it yourself

The dataset is public on Kaggle:
https://www.kaggle.com/datasets/sandeepbkadam/ipl-cricket-dataset-20082025-postgresql

GitHub: https://github.com/Sandhu93/nl2sql-agent

What I want to improve next

Right now I want to get better visibility into how the system behaves in practice.