Part 3 of 5 | ← Part 2 | Part 4 → | View Series
The Problem With LLMs for Retrieval
I considered using GPT or Claude for Q&A. Open the API, feed it cricket data, done.
Three problems:
- Cost: $0.01 per request × 10,000 daily users = $100/day
- Latency: API calls take 500-2000ms
- Hallucinations: An LLM might confidently invent data ("Suryakumar hit 12 sixes") if it's unsure
For cricket facts, hallucinations are unacceptable.
The Solution: TF-IDF + Cosine Similarity
Instead:
- Generate 42,000 Q&A pairs from raw data
- Vectorize all questions using TF-IDF
- At query time: vectorize user's question, find closest match, return answer
Cost: $0 after initial computation
Latency: <5ms per query
What is TF-IDF?
TF-IDF = Term Frequency × Inverse Document Frequency
Term Frequency (TF):
How often does word X appear in THIS question?
(More = higher weight)
Inverse Document Frequency (IDF):
How many documents contain word X?
(Rare words = higher weight)
TF-IDF = TF × IDF
Common word ("the"): low TF-IDF (appears everywhere)
Specific word ("Suryakumar"): high TF-IDF (appears rarely)
Result: Important words get high weight, noise gets filtered out.
Generating 42,000 Q&A Pairs
Type 1: Per-Match Facts (35 variations per match)
For every match, generate multiple question phrasings:
def generate_match_qa(match_id, date, batting_team, winner, pom, runs):
answer = (
f"Match {match_id} | {date} | {batting_team} scored {runs} | "
f"Winner: {winner} | Player of Match: {pom}"
)
questions = [
f"who won match {match_id}",
f"player of the match {match_id}",
f"how many runs did {batting_team} score match {match_id}",
f"pom match {match_id}",
f"who was man of match {match_id}",
# ... 30 more variations covering:
# - Different phrasings
# - Natural language variants
# - Abbreviations (POM, 6s, 4s, RR, PP)
# - Casual phrasing ("tell me about match X")
]
return [(q, answer) for q in questions]
Result: ~36,000 pairs from ~1,000 matches.
Each answer identical (same match, 35 questions). This trains the vectorizer that "player of match", "pom", "man of match" all mean the same thing.
Type 2: Aggregate Stats
# Most Wins
top_team = df["winner"].value_counts().index[0] # Mumbai Indians
top_wins = df["winner"].value_counts().iloc[0] # 69 wins
questions = [
"which team has most wins",
"most successful team in ipl",
"best team ipl history",
]
answer = f"Mumbai Indians has most wins: {top_wins}"
# Toss Impact
toss_win_rate = df["toss_winner"].eq(df["winner"]).mean() * 100
questions = [
"does toss matter",
"toss impact on match",
"toss win match win",
]
answer = f"Toss wins lead to match wins {toss_win_rate:.1f}% of time"
Result: ~3,000 aggregate pairs.
Type 3: Head-to-Head Records
for team1 in teams:
for team2 in teams[teams.index(team1)+1:]:
h2h_matches = df[
((df["team1"] == team1) & (df["team2"] == team2)) |
((df["team1"] == team2) & (df["team2"] == team1))
]
wins = h2h_matches["winner"].value_counts()
answer = f"H2H: {team1} {wins[team1]} wins, {team2} {wins[team2]} wins"
questions = [
f"head to head {team1} vs {team2}",
f"{team1} vs {team2} record",
f"h2h {team1} {team2}",
]
Result: ~3,000 H2H pairs.
Total: 42,000+ pairs
Building the TF-IDF Index
from sklearn.feature_extraction.text import TfidfVectorizer
# Initialize vectorizer
tfidf = TfidfVectorizer(
ngram_range=(1, 3), # Unigrams, bigrams, trigrams
min_df=1, # Keep all terms
analyzer="word" # Split by spaces
)
# Extract questions and fit
questions = [pair[0] for pair in qa_pairs]
Q_matrix = tfidf.fit_transform(questions)
print(f"Vectorized {len(questions)} questions")
print(f"Vocabulary: {Q_matrix.shape[1]}")
# Output:
# "Vectorized 42523 questions"
# "Vocabulary: 18394"
Matrix shape: (42,523 × 18,394)
- 42,523 rows: One per question
- 18,394 columns: Unique vocabulary terms
- Sparse: Most cells are 0 (most terms don't appear in most questions)
Why ngram_range=(1, 3)?
Unigrams (1 word):
"player", "of", "the", "match"
Bigrams (2 words):
"player of", "of the", "the match"
Trigrams (3 words):
"player of the"
Without bigrams/trigrams:
Questions vectorized separately:
"player" gets weight 0.05 (appears in tons of questions)
"of" gets weight 0.02 (extremely common)
"the" gets weight 0.01 (super common)
Result: Query "player of the match" matches EVERY question with "player"
→ False positives everywhere
With bigrams/trigrams:
"player of the" as a UNIT gets weight 0.80 (specific to match contexts)
"of" alone still gets weight 0.01
Result: Query "player of the match" matches ONLY match-specific questions
→ No false positives
Query-Time Retrieval
from sklearn.metrics.pairwise import cosine_similarity
def answer_question(user_question, tfidf, Q_matrix, answers, threshold=0.15):
"""
Algorithm:
1. Vectorize user's question (same vectorizer)
2. Compute cosine similarity against all 42K questions
3. Find best match
4. Return if confident enough
"""
# Step 1: Vectorize user input
q_vec = tfidf.transform([user_question.lower()])
# Shape: (1, 18394) vector
# Step 2: Compute similarity to all 42K questions
similarities = cosine_similarity(q_vec, Q_matrix)[0]
# Shape: (42000,) array of similarity scores
# Step 3: Find best match
best_idx = np.argmax(similarities)
best_score = similarities[best_idx]
# Step 4: Return only if confident
if best_score < threshold:
return None, best_score
return answers[best_idx], best_score
Real Examples
Example 1: Good Match ✅
User: "Who was player of the match in match 335982?"
Vectorization: TF-IDF creates 1×18394 vector
Similarity: Computed against all 42K
Best match: Index 5203 (similarity: 0.92)
Question 5203: "player of the match in match 335982"
Answer: "...Match 335982 | ...Player of Match: Suryakumar Yadav"
Confidence gate: 0.92 > 0.15 → ✅ Return answer
Example 2: Fuzzy Match ✅
User: "who was pom in match 335982?"
Vectorization: "pom" is abbreviation, matches "player of match" trigram
Similarity: 0.85 (strong match due to trigram)
Best match: Question 5204 "pom in match 335982"
Answer: Same as above
Confidence gate: 0.85 > 0.15 → ✅ Return answer
Example 3: Off-Topic ❌
User: "Can I learn TF-IDF for machine learning?"
Vectorization: Cricket vocabulary doesn't have "learn", "machine", "for"
Similarity: Best match is 0.08 (coincidental overlap)
Best match: Some random match fact
Confidence gate: 0.08 < 0.15 → ❌ Reject
Response: "🤔 I'm not confident about that"
The threshold (0.15) acts as a graceful uncertainty gate.
Memory Efficiency
Without sparse matrices (dense):
42,523 × 18,394 × 8 bytes = 6.2GB
With sparse matrix (TF-IDF default):
Only non-zero entries stored (~20MB compressed)
Result: 310x smaller! Fits in memory easily.
Performance Metrics
| Metric | Value |
|---|---|
| Q&A pairs | 42,523 |
| Vocabulary size | 18,394 |
| Index file size | 20MB |
| Query time | <5ms |
| Memory (at rest) | 20MB |
| Memory (loaded) | 150MB |
| Queries per second (1 server) | 10,000+ |
Cost: Single $5 server handles 100,000 queries/day with zero API calls.
Why This Beats Alternatives
| Approach | Cost | Speed | Accuracy | Hallucination Risk |
|---|---|---|---|---|
| TF-IDF (Ours) | $0 | <5ms | 100% | 0% |
| LLM API (GPT) | $100/day | 500ms | 95% | 5-10% |
| Hybrid (LLM + retrieval) | $50/day | 200ms | 99% | 1% |
| Database + SQL | $50/month | 50ms | 100% | 0% |
We chose TF-IDF because:
- ✅ Zero hallucinations (only returns data facts)
- ✅ Instant (<5ms)
- ✅ Free (no APIs)
- ✅ Transparent (we control everything)
For this use case, TF-IDF is optimal.
Intent Detection (Going Beyond Direct Match)
Raw Q&A answer is often verbose. If user asks "How many sixes?", returning full match details is wrong.
def detect_intent(question):
if any(w in question for w in ["sixes", "6", "boundary"]):
return "sixes"
if any(w in question for w in ["player", "man", "pom"]):
return "player_of_match"
if any(w in question for w in ["winner", "win", "result"]):
return "winner"
return "summary"
def format_answer(raw_answer, intent):
if intent == "sixes":
import re
m = re.search(r"6s (\d+)", raw_answer)
return f"🔥 Sixes: {m.group(1)}" if m else raw_answer
if intent == "player_of_match":
m = re.search(r"Player of Match: ([^\n]+)", raw_answer)
return f"🌟 {m.group(1)}" if m else raw_answer
return raw_answer
Result: Clean, focused answers instead of walls of text.
Model Serialization
qa_bundle = {
"tfidf": tfidf, # Fitted vectorizer
"Q_matrix": Q_matrix, # Pre-computed TF-IDF vectors
"questions": questions, # For debugging
"answers": answers, # Parallel to questions
}
joblib.dump(qa_bundle, "models/qa_model.joblib")
Everything in one file. Load once at startup, reuse forever.
What's in Part 4 (FastAPI Backend)
Next post: How to route questions intelligently:
✅ Lazy loading (optional until first use)
✅ Intent detection (prediction vs Q&A)
✅ Team extraction (fuzzy matching)
✅ Smart routing (/chat does both)
✅ Error handling (timeout, API errors)
Sneak preview: One endpoint handles both systems. The backend decides which AI to use.
This is Part 3 of 5. Subscribe to follow! 🏏
Top comments (0)