IPL 2026 starts this month, so I built something around it.
I have been learning how to build AI agents, and one thing I kept wanting was a project that felt concrete. Not just a chatbot demo, but something that had to deal with real data, real edge cases, and real mistakes.
So I built IPL Cricket Analyst.
It lets you ask questions about IPL data in plain English and get back a SQL-backed answer in real time from an SQL database, along with charts, follow-up suggestions, and support for multi-turn questions. The user don't have to know the queries, LLM will generate for you.
Some example questions:
- "Who has the best death-over economy since 2020?"
- "Show me a bar chart of the top 10 wicket takers."
- "What is Virat Kohli's strike rate at Eden Gardens after 2022?"
Under the hood, the agent writes the SQL, validates it, runs it against 278,000+ ball-by-ball deliveries, and streams the result back while it works.
What I built
| Layer | Tech |
|---|---|
| Frontend | Next.js 14, TypeScript, Tailwind CSS |
| Backend | FastAPI, Python 3.11, LangChain |
| Database | PostgreSQL, 9 tables, 278k+ rows |
| Vector store | ChromaDB |
| Cache / History | Redis 7 |
| LLM | GPT-4o for SQL, GPT-4o-mini for rewrite and insights |
| Charts | MCP Chart Server with Vega-Lite v5 |
The basic goal was simple: take a cricket question, turn it into SQL, run it safely, and make the result feel responsive in the UI.
The pipeline
Each question goes through a pipeline inside run_agent_stream() that looks roughly like this:
Input validation
→ Response cache check
→ Query rewrite + history summarization
→ Entity resolution
→ [Table selection || Cricket RAG]
→ SQL generation
→ SQL validation + semantic check
→ SQL execution
→ [Answer rephrase || Insights || Viz]
→ Streamed NDJSON to frontend
The frontend receives events step by step, so the SQL shows up first, then the answer, then the extra pieces like insights and charts.
That streaming part made a much bigger difference to the feel of the app than I expected.
What turned out to be harder than I expected
1. Cricket stats are tricky in ways generic NL2SQL examples do not prepare you for
A lot of NL2SQL tutorials work on clean, simple schemas. IPL data is not that.
For example, a batting average is not just a straightforward aggregation. Dismissals can be subtle because the dismissed player is not always the striker. Ducks are also not a ball-level concept. They have to be computed at the innings level.
I ran into a lot of cases where the SQL looked reasonable but was still wrong from a cricket point of view.
To handle that better, I added:
- a cricket rules document for retrieval
- IPL-specific few-shot SQL examples
- an extra semantic validation step before execution
That combination helped a lot more than just changing prompts randomly.
2. Accuracy improved only after I started measuring it properly
At first I was mostly testing with ad hoc questions, which felt fine until I started noticing inconsistencies.
So I put together a 50-question ground-truth evaluation set and started running the system against it repeatedly.
The first version was around 82% accurate.
After a lot of iteration, it got to 98% on that eval.
Most of the improvements did not come from big architectural changes. They came from fixing very specific failure modes, like:
- using the wrong grain for aggregation
- getting milestone logic wrong
- small cricket-specific details like death overs being overs 16 to 20, which in this dataset meant handling indexing carefully
- selecting columns that made the answer noisier than it needed to be
That was probably the biggest lesson in the whole project. Evaluation made the work much more grounded.
3. Follow-up questions needed more care than I thought
One of the things I wanted was for follow-ups to feel natural.
Questions like:
- "Who scored the most runs in 2023?"
- "What was his strike rate?"
- "What about 2022?"
That sounds simple from a user perspective, but a lot has to go right for it to work consistently.
I ended up rewriting follow-up questions into standalone questions before sending them downstream. That made the rest of the pipeline much more reliable.
It was one of those changes that feels obvious in hindsight.
4. Reliability work matters even in small projects
I did not want this to be just a cool demo that works once.
So I added some basic safeguards:
- per-IP rate limiting
- a response cache
- a circuit breaker
- request timeouts
- input validation
- SELECT-only SQL enforcement
None of that is especially flashy, but it made the project feel much more solid.
What I learned
This project taught me a lot about building agents in a way that feels less magical and more engineering-focused.
A few things stood out:
Evaluation matters a lot.
Without a fixed eval set, it is very easy to convince yourself the system is getting better when it is just getting different.
Domain grounding matters more than I expected.
A strong model can generate convincing SQL, but convincing is not the same as correct. The cricket-specific rules and examples made a huge difference.
Streaming helps the UX a lot.
Even when the full pipeline takes a few seconds, showing progress step by step makes the app feel much better.
The hard part is usually not generation.
A lot of the work ended up being around validation, edge cases, memory, retries, and handling the weird questions cleanly.
Why I liked building this
I started this mainly as a way to learn more about AI agents, but it turned into a really useful exercise in building around failure cases.
It is easy to make an agent look smart in a short demo.
It is much harder to make it dependable when the inputs are messy, the domain has tricky rules, and the answer actually needs to be right.
That is what made this project fun.
Try it yourself
The dataset is public on Kaggle:
https://www.kaggle.com/datasets/sandeepbkadam/ipl-cricket-dataset-20082025-postgresql
GitHub: https://github.com/Sandhu93/nl2sql-agent
What I want to improve next
Right now I want to get better visibility into how the system behaves in practice.
- Monitoring and observability are next on the list, especially:
- latency by pipeline step
- better failure logging
- more structured evaluation runs
If you have worked on NL2SQL, agent reliability, or evaluation workflows, I would genuinely love to hear what has worked for you.
Happy to answer questions in the comments. Happy Learning
Preview




Top comments (0)