DEV Community: Akshay Rajinikanth

I Did AI/ML "Wrong", And It's the Best Mistake I Ever Made

Akshay Rajinikanth — Thu, 12 Mar 2026 08:30:19 +0000

Okay, real talk.

When my classmates were installing TensorFlow and building "AI projects" for their resumes, I was sitting in the library reading about probability distributions and hypothesis testing.

They laughed. I felt behind.

Fast forward to our final year and now I'm the one they're texting at 11 PM asking "bro what even is overfitting?"

Funny how that works.

What I Noticed Among My Peers

There's a pattern I've watched play out all semester. A friend jumps into a PyTorch tutorial, builds something that "works," and then hits a wall the moment anything breaks or needs explaining.

They can run the model. They just can't think about it.

Here's what I keep hearing:

"Why is my model performing perfectly on training data but terrible on test data?"
"What does this ROC curve actually mean?"
"Why does the learning rate matter so much?"
"What do you mean the data has bias — it's just numbers?"

These aren't framework problems. These are statistics problems. And no amount of model.fit() will fix them if you don't understand what's happening underneath.

The hard truth? Most ML tutorials teach you to drive a car without explaining how the engine works. Fun until something breaks.

Why Statistics Is Secretly the Foundation of AI

Here's something nobody tells you in year one:

Machine Learning is just applied statistics with better marketing.

I'm only slightly joking.

Every ML concept you struggle with maps almost perfectly to a statistics concept:

What ML calls it	What it actually is
Overfitting	High variance in your model
Regularization	Penalizing complexity to reduce variance
Logistic Regression	Probability estimation using the sigmoid function
Loss Function	A formalized way of measuring error distributions
Gradient Descent	Optimization over a cost surface : calculus + stats
Evaluation Metrics	Hypothesis testing applied to model predictions

When my professor introduced logistic regression, half the class stared blankly. I understood it immediately, because I already knew that the sigmoid function outputs a probability between 0 and 1. I'd spent weeks thinking about probability. It clicked in seconds.

Realizations That Changed How I See ML

Here are three moments where my statistics background saved me:

1. Probability → Logistic Regression

Logistic regression doesn't predict a class. It predicts the probability of a class. If you've never thought carefully about what probability means conditional probability, Bayes' theorem, odds ratios that distinction will confuse you endlessly.

2. Distributions → Data Modeling

Before you model anything, you need to understand what your data looks like. Is it normally distributed? Skewed? Heavy-tailed? This changes everything — which algorithm you use, how you preprocess, whether your assumptions hold.

My peers skip this step. Then their model "doesn't work" and they have no idea why.

3. Variance → Overfitting

Overfitting sounds mysterious until you realize it's literally the definition of high variance your model is too sensitive to the training data. Understanding the bias-variance tradeoff isn't an advanced topic. It's a statistics 101 concept that makes ML model evaluation make complete sense.

Once you see it, you can't unsee it.

If You're Starting Today, Follow This Roadmap

You don't need years. You need the right order.

Step 1 — Statistics Basics (3–4 weeks)

Mean, median, variance, standard deviation
Probability and conditional probability
Bayes' theorem (yes, now not later)
Distributions: normal, binomial, Poisson

Step 2 — Data Analysis Thinking (2–3 weeks)

Exploratory Data Analysis (EDA)
Correlation vs. causation (this will save your life)
Hypothesis testing and p-values
Handling missing data, outliers, skewness

Step 3 — Python + Data Tools (2–3 weeks)

NumPy, Pandas, Matplotlib, Seaborn
Practice EDA on real datasets (Kaggle is your friend)
Learn to question your data before modeling it

Step 4 — Machine Learning Fundamentals (4–6 weeks)

Linear and logistic regression (you'll now actually understand these)
Decision trees, k-NN, SVMs
Bias-variance tradeoff, cross-validation, regularization
scikit-learn

Step 5 — Deep Learning (ongoing)

Neural networks, backpropagation, activation functions
CNNs, RNNs, Transformers
PyTorch or TensorFlow — now you're ready

The difference? By Step 4, you won't just be copying code. You'll be reasoning about your models.

Actionable Advice (From One Student to Another)

If you take nothing else from this, take these:

Don't skip EDA. Ever. Look at your data before you model it. Always.
Learn to read a confusion matrix before you learn to build a neural network.
Understand what a p-value is — ML evaluation metrics are essentially the same idea, dressed differently.
Build intuition before building models. StatQuest on YouTube is criminally underrated for this.
Kaggle notebooks are your best textbook. Read others' EDA sections obsessively.
When your model fails, think statistically first. Bad data distribution, data leakage, and class imbalance cause more failures than bad architectures.

The Conclusion I Wish Someone Had Given Me

There's a version of you that learns TensorFlow in week one, builds a "95% accurate" model on an imbalanced dataset, puts it on your resume, and doesn't realize what went wrong until an interview.

There's another version that spends a few extra weeks understanding why things work.

That second version isn't slower. They're just building on solid ground.

The flashy frameworks come and go. The math doesn't.

And honestly? Once the statistics clicked, AI stopped feeling like magic and started feeling like a puzzle I actually knew how to solve. That shift in confidence — from copying tutorials to genuinely understanding — is worth every extra week.

AI isn't magic — it's mostly statistics wearing a hoodie.

Start there. The rest will follow.

📌 TL;DR

Many CS students jump into ML frameworks without understanding the statistical foundations underneath
Concepts like overfitting, gradient descent, loss functions, and evaluation metrics are fundamentally statistics concepts
Learning probability, distributions, variance, and hypothesis testing first makes ML dramatically easier to understand and debug
Suggested order: Statistics → Data Analysis → Python Tools → ML → Deep Learning
Build intuition before you build models

💬 Discussion Question

Did you learn statistics before ML, or did you dive straight into frameworks?
Looking back, do you think the order mattered? Drop your experience in the comments. I'm genuinely curious how different paths shaped how people think about models. 👇

If this helped you, consider sharing it with a classmate who just installed PyTorch for the first time. They might need this more than they know. 😄

I Thought I Knew Data. Then This Book Proved Me Wrong.

Akshay Rajinikanth — Sun, 22 Feb 2026 13:50:17 +0000

A ~14 min read for anyone who has ever sent a chart into the void and heard nothing back.

Let me set the scene.

It's my third week as a data intern at a mid-sized tech company. I've spent two days — two full days — building what I genuinely believed was the most beautiful dashboard of all time. Six charts. Multiple color gradients. A secondary y-axis because, honestly, it looked sophisticated. I walked into the review meeting with the energy of someone who had just invented data visualization.

The VP glanced at my screen for about four seconds.

"What am I supposed to take away from this?"

I had no answer. Because I hadn't thought about that. Not once.

That weekend, I picked up Storytelling with Data by Cole Nussbaumer Knaflic. What followed was the most humbling and genuinely useful reading experience of my early career. This isn't a book summary. This is what I actually learned — the stuff that hit me in the face and made me rethink how I communicate anything with numbers.

Let's get into it.

The Difference Between Exploring Data and Explaining Data (And Why Most People Confuse Them)

Here's the first gut-punch the book delivers: there are two completely different jobs you can do with data, and most people are doing one when they think they're doing the other.

Exploratory analysis is detective work. You're digging through data, looking for patterns, outliers, anything interesting. You're the audience. You're allowed to be messy, confused, and wrong. This is the part most of us are decent at.

Explanatory analysis is journalism. You've found the story. Now your job is to communicate it clearly to someone who doesn't have the context you do. This is the part most of us are terrible at — because we skip straight to making slides without ever deciding what story we're telling.

That review meeting I bombed? I had done great exploratory analysis. I found real patterns in the data. But I walked into that room still in detective mode, showing my evidence board instead of delivering my verdict.

The fix isn't technical. It's a mindset shift. Before you open PowerPoint, Tableau, or a Jupyter notebook to make anything presentable — stop and answer three questions:

Who is your audience, specifically? (Not "the team." Who, exactly, will be in the room, and what do they care about?)
What do you want them to do after seeing this? (Not "understand the data." What action, decision, or change?)
How will you deliver it — live presentation, email, shared document?

That's it. Three questions. Answering them before touching any tool will save you more time than any productivity hack you've read.

The "Big Idea" and the 3-Minute Story: Two Things That Will Change How You Communicate Forever

Okay, two concepts from this book that I now use almost daily.

The Big Idea is a single sentence — not a bullet point, not a title, one complete sentence — that captures your entire point, why it matters, and what someone should do about it.

It sounds easy. It is not easy.

Try it right now with something you're working on. Most people produce something like: "This slide shows Q3 engagement metrics across platforms." That's a description. A Big Idea sounds like: "Mobile engagement dropped 23% in Q3 because our push notifications are sending at 3am for users in APAC, and if we fix the timezone logic we can recover it by end of Q4."

Feel the difference? One describes what you made. The other tells someone why they should care and what to do next.

The 3-minute story is the companion skill. It's your ability to explain your key message clearly, without slides, in about three minutes. Imagine someone catches you in the elevator and asks what your project found. Can you deliver the point? If not, you don't know your story well enough yet.

I used to think the slides were the presentation. The book made me realize: the slides are a visual aid for a story you should already know by heart.

Nobody Taught You How to Pick Charts, And It Shows

Raise your hand if you've ever used a pie chart in a professional setting.

(It's okay. We've all done it. This is a safe space.)

Here's the thing about chart selection — most of us pick visuals based on vibes. The data looks circular, so we pick a pie chart. There are categories, so we use a stacked bar. It looks cool, so we add a second y-axis. We never learned a framework.

Here's the simplified version of what the book teaches:

Use text when you have just one or two numbers that are genuinely important. Just write the number big. Don't hide a headline in a chart.

Use line charts for anything time-based where trend matters. This is your most powerful and underused weapon. The eye naturally follows a line and reads direction instantly.

Use bar charts for comparisons. They're simple, they work, everyone understands them. Always start your bar chart at zero — truncating the axis is one of the most common ways charts accidentally lie. And horizontal bar charts are massively underrated, especially when your category labels are long.

Use scatter plots rarely, but when you need to show a relationship between two variables — correlation, clusters, outliers — they're excellent.

Use slope charts when you're comparing two points in time across multiple categories. It's incredibly efficient. Instead of a 10-line chart mess, you get a clean "here's what changed" visual.

Avoid: pie charts (the human eye is bad at comparing angles and areas), 3D charts (they distort every number), donut charts (same problem as pie, but with a hole), and dual y-axis charts (they create confusion almost every time).

The key insight: there's rarely one "correct" chart, but there are definitely wrong ones. The right question is: what comparison or pattern am I trying to make obvious? Then pick the chart that makes that thing most obvious.

Clutter Is Actively Hurting Your Credibility

This section of the book made me feel genuinely embarrassed about past work.

There's a concept called cognitive load — basically, the mental energy your brain spends processing what it sees. When a chart is cluttered, your audience burns that energy on navigation instead of understanding. They get tired. They disengage. They ask "what am I supposed to take away from this?" — which is exactly what happened to me.

The book introduces the Gestalt principles of visual perception, which are psychological rules for how humans group things visually. The practical ones:

Things that are close together look related. Use spacing intentionally.
Things that are similar in color or shape look related. Be deliberate about when you make things the same color.
Things that are enclosed together (light shading, borders) look like a group. A subtle background box can do a lot of work.
People's eyes follow the smoothest path. Diagonal text and labels force people to rotate their heads or mentally rotate the text. Don't do it.

The decluttering checklist is almost aggressive in its simplicity:

Remove the chart border (you don't need it)
Remove gridlines (or make them very light grey)
Remove data markers from line charts (unless a specific point matters)
Clean up axis labels (fewer, simpler)
Label data directly instead of using a legend (legends force eye travel)
Use one consistent color palette

When I first tried this on a chart I'd made, it felt like I was deleting half my work. The result looked better in every way. The data became the star because there was nothing else competing for attention.

Your Brain Has Three Types of Memory, and They All Matter for Presentations

Quick neuroscience detour that genuinely changed how I think about making slides.

Iconic memory is your visual buffer — it captures everything you see for a fraction of a second before your brain decides what to pay attention to. It processes things like color, size, and movement before you're consciously aware of it.

Short-term (working) memory can only hold about four chunks of information at once. Four. When you put a chart with 12 data series, 3 legends, a title, axis labels, and annotations on a single slide, you are blowing past that limit immediately.

Long-term memory is where you want your key message to end up. Information gets there through repetition and through the combination of images + words (they reinforce each other).

The practical implication: simplify each slide so your audience only has to hold two or three things in working memory, and they'll remember your message.

Pre-Attentive Attributes: Directing Eyes Without Saying a Word

This is the part that feels like a superpower once you understand it.

Pre-attentive attributes are visual properties that the human eye notices before conscious attention kicks in — things like color, size, position, and contrast. In the fraction of a second before your audience "decides" to look at something, their eyes have already been directed.

The practical implication: if you make one bar orange in a chart full of grey bars, every single person in the room will look at that bar first. Before they read the title. Before they read the axis. That bar becomes the thing.

This is how you guide your audience without speaking.

The rules the book gives for color specifically:

Use it sparingly. If everything is colorful, nothing is.
Use it consistently. Once you pick a color to mean "this is the important thing," it should always mean that.
Be colorblind-aware. Red-green combinations are seen differently by roughly 8% of men. Add secondary cues (line style, labels) if you have no choice.
Be thoughtful about what colors mean culturally. Red often means danger or loss. Green means growth or positive. Don't accidentally tell a different story with your palette.

Size is the other big one. Bigger = more important. If you make something small, people will treat it as footnote material.

The "Slideument" Problem Is Real and It's Destroying Your Meetings

If you've ever made slides for a presentation and then emailed those same slides as a follow-up document, you've created what the book calls a "slideument."

The problem: presentations and documents have opposite requirements.

A presentation slide needs to be simple enough that someone can absorb it while listening to you speak. A document needs enough detail that someone can understand it without you there to explain.

Trying to do both simultaneously means you fail at both. Your presentation slides are too text-heavy to follow live. Your circulated slides are too sparse to stand alone.

The honest solution the book recommends: make two versions. Simple slides for the room. Annotated document for circulation. Yes, it's more work. But it actually works.

The shortcut most people can actually execute: for your presentation, use progressive animation to reveal data piece by piece in sync with your narration. Then, for the circulated version, send the final annotated slide with callout text explaining each key observation. Same underlying visual, different experience.

Data Has a Three-Act Structure, Just Like Every Great Story

This is where the book goes from practical to genuinely philosophical, and I mean that as a compliment.

The structure of a great play is the structure of a great presentation:

Act 1 — Setup: Establish the world. Who is the main character (the metric, the product, the team)? What was the situation? What changed?

Act 2 — Conflict: What's the problem? What happens if nothing changes? What have you tried?

Act 3 — Resolution: What's the recommended action? What does success look like?

Nancy Duarte, one of the great thinkers on presentation design, describes this as the tension between "what is" and "what could be." Your data tells the story of that gap — and your job is to make the audience care enough to close it.

Every dataset has a story in it. The mistake is presenting the data and waiting for the audience to find the story themselves. They won't. Or they'll find the wrong one. You need to have done that work beforehand and then guide them through it.

A test the book gives: if you cover up all your charts and only read your slide titles in order, does a coherent story emerge? If yes, you have horizontal logic — your titles alone tell the story. If the titles are just labels (Q1 Results, Q2 Results, Q3 Results), you have a data dump, not a presentation.

Another test: does every element on each slide directly support the message in that slide's title? If there are elements that don't — even interesting ones — cut them. This is vertical logic. One slide, one idea.

The Repetition Thing That Feels Redundant But Isn't

Here's something counterintuitive about presenting to humans: saying the same thing multiple times is not annoying. It's effective.

Working memory is fragile. By the time you're on slide 8, most people have forgotten the key point you made on slide 2. Repetition moves information from short-term to long-term memory.

The structure the book advocates:

Tell them what you're going to tell them (opening summary)
Tell them (the actual content)
Tell them what you told them (closing recap)

Yes, it feels redundant to you as the presenter. You've been living with this content for weeks. But your audience is hearing it for the first time, potentially while also thinking about their next meeting. Repetition isn't condescending. It's considerate.

Actionable Takeaways (The Part You'll Actually Save)

If you've made it this far, here's the condensed playbook:

Before you build anything:

Answer: Who? What action? How will it be delivered?
Write your Big Idea as one complete sentence.
Storyboard on paper before opening any tool.

When choosing your visual:

Default to line charts for trends, bar charts for comparisons.
When in doubt, horizontal bar chart.
Kill the pie charts. I'm serious.

When cleaning up your visual:

Remove every element that doesn't earn its place.
No chart borders. Minimal gridlines. Direct labels over legends.
Use color sparingly — one accent color max.

When presenting:

Your slides are a visual aid for a story you already know.
Practice your 3-minute verbal version before the meeting.
Never read your slides aloud. Your audience is literate.

When writing your story:

Slide titles should tell the story on their own.
Structure: tension (what is) → conflict (why it matters) → resolution (what to do).
Repeat your key message at the start, middle, and end.

For the long game:

Get good at one tool (Python/Matplotlib, Tableau, even Excel properly).
Seek feedback from people outside your team — they'll catch what you're too close to see.
Collect examples of great data visualization and study what makes them work.

The Real Lesson (It Has Nothing to Do With Charts)

Here's what I took away that I didn't expect.

The skills in this book — understanding your audience, crafting a clear message, eliminating noise, guiding attention, telling a story with a beginning, conflict, and resolution — these are not data skills. They're communication skills. They work in presentations, in emails, in Slack messages, in job interviews.

Most early career people think being good at their craft is enough. Write clean code, run solid analyses, design polished mockups. But the people who actually advance are the ones who can make their work land — who can take something complex and make it clear to someone who doesn't have their context.

The chart is never the point. The decision the chart enables is the point.

I still think about that VP's question in my third week. "What am I supposed to take away from this?" At the time it felt like a failure. Now I treat it as the most important question in every presentation I build.

Answer that question before your audience has to ask it. That's the whole game.

If this resonated with you, I'd genuinely recommend picking up Storytelling with Data — it's a fast read and the before/after chart makeovers alone are worth it. For more in the same vein, check out eagereyes.org, flowingdata.com, and storytellingwithdata.com. Happy to discuss any of this in the comments.

[Boost]

Akshay Rajinikanth — Sat, 21 Feb 2026 14:06:53 +0000

When Similarity Search Breaks: Why RAG Fails on Numerical Queries

Akshay Rajinikanth ・ Feb 19

#ai #beginners #machinelearning #datascience

When Similarity Search Breaks: Why RAG Fails on Numerical Queries

Akshay Rajinikanth — Thu, 19 Feb 2026 03:53:02 +0000

I was building a chatbot using Retrieval-Augmented Generation (RAG) over a semi-structured insurance database. The system answered questions about policies, coverage, reviews, and claims history. During testing, I asked: “What health problems commonly result in insurance claims over $1,000?” The chatbot confidently listed diabetes complications ($847), minor surgeries ($654), and preventive care ($423) all below the requested threshold.

This was not an isolated mistake. To investigate, I created a minimal test case using 20 smartphones and built a basic RAG system. When asked “Phones released in 2024,” the system returned iPhone SE (2022), Nothing Phone (2) (2023), and Samsung Galaxy A54 (2023). The retrieved context looked relevant, yet it violated the numeric constraint.

If you build RAG applications, you’ve likely encountered this pattern: semantic questions work reliably, while queries involving numbers, dates, or quantities fail systematically. The issue is not the language model’s reasoning; it is the retrieval step. Embedding models encode numbers as tokens rather than ordered values, and vector search retrieves semantically similar documents instead of numerically valid ones. In embedding space, “$499” and “$999” can appear close despite representing very different quantities.

This behavior has practical implications. Systems used in finance, healthcare, compliance, or analytics dashboards often rely on thresholds and filters; retrieving the wrong evidence can produce confident but incorrect conclusions. The failure stems from similarity search optimizing semantic closeness rather than respecting structured constraints. In this article, I will examine why this happens and how to address it in practical RAG pipelines.

Why Embeddings Struggle with Hard Constraints

This failure is not a bug but a consequence of how similarity is computed in embedding space.

When text is embedded, the transformer maps it to a point in a high-dimensional vector space, positioning semantically similar chunks close together. Retrieval is then performed by ranking chunks using cosine similarity. In my project, the retrieved recommendations were topically relevant to the query, but similarity was dominated by shared product descriptions rather than the release year, producing results with correct context but incorrect dates. The model learns that “$999” and “$499” are related as prices, yet it does not encode their numerical relationship.

A simple observation explains this behavior: embedding models capture semantic distinctions such as “cheap” vs. “expensive,” categorical groupings like “budget” vs. “flagship,” and even recognize that “$499” and “$999” represent prices, but they do not represent ordered relationships such as magnitude ($299 < $500 < $799 < $999), temporal sequence (2022 < 2023 < 2024), or quantitative comparison (64GB < 128GB < 256GB).

On a traditional number line, ordering is explicit: values have fixed relative positions.

The key point is that the model isn’t making a mistake, the search space itself is wrong for constraint-based queries. In my insurance project I tried prompt engineering, few-shot examples, switching models, even tuning temperature, but nothing improved the results. By the time the language model received the retrieved documents, the incorrect candidates had already been selected. The generation step was operating on flawed evidence. In practice, I was effectively asking the model to choose phones under $500 from a candidate set containing an iPhone 15 Pro ($999) and a Samsung S24 ($799). The failure occurred before reasoning began.

Hybrid Retrieval: Separating Constraints from Semantic Search

The failure reveals design mismatch: semantic similarity and logical constraints are different problems. The vector search excels at understanding meaning, ranking relevance and capturing semantics, but it does not enforce logical constraints, numerical comparison and number matching.

Instead of forcing embeddings and similarity search to handle both, we separate responsibilities: embeddings determine relevance, while structured filters enforce validity.

Modern vector databases support metadata filtering alongside vector search capabilities. Structured constraints are applied first to reduce the candidate set, after which semantic search is applied to find relevant results. This prevents numerically invalid documents from ever entering the retrieval stage and improves both accuracy and latency.

Let us see the implementation in 3 simple steps:

Step 1: Extract the metadata

def extract_metadata(text: str) -> dict:
    """Parse structured data from text."""
    return {
        'price': float(re.search(r'\$(\d+)', text).group(1)),
        'release_year': int(re.search(r'(\d{4})-', text).group(1)),
        'category': next(c for c in ['budget', 'flagship', 'premium'] if c in text.lower())
    }

# Store with both embedding and metadata
doc = Document(
    page_content=content,
    metadata=extract_metadata(content)  # ← Indexed separately
)

The metadata extraction can be implemented using Regex (Rule-Based extraction method) which is reliable and extremely fast and reliable on semi-structured data. Named entity recognition models, LLMs or hybrid of both can also be leveraged to work on pure unstructured data at the cost of latency and occasional formatting errors.

Step 2: Parsing constraints from queries

def parse_constraints(query: str) -> dict:
    """
    Convert natural language constraints into database filters.

    Examples:
        "phones under $500"  -> {'price': {'$lte': 500}}
        "phones in 2024"     -> {'release_year': 2024}
    """
    filters = {}

    # Price: under / below / less than
    if match := re.search(r'(?:under|below|less than)\s*\$(\d+)', query, re.I):
        filters['price'] = {'$lte': int(match.group(1))}

    # Price: over / above / more than
    if match := re.search(r'(?:over|above|more than)\s*\$(\d+)', query, re.I):
        filters['price'] = {'$gte': int(match.group(1))}

    # Year constraint
    if match := re.search(r'(?:in|from)\s*(\d{4})', query):
        filters['release_year'] = int(match.group(1))

    return filters

Here the query is converted into structured filters. Filter constraint value is parsed using Regex in the above example. In production systems this step may also be implemented using tool-calling, allowing the model to output structured parameters instead of raw text.

Step 3: Applying the filters before searching

def search(query: str, k: int = 5):
    """
    Apply structured filters first, then rank remaining results by semantic similarity.
    """
    filters = parse_constraints(query)

    return vectorstore.similarity_search(
        query,
        k=k,
        filter=filters  # ChromaDB applies metadata filtering BEFORE vector search
    )

The vector database first narrows the candidate set using metadata constraints, and only then performs semantic ranking on the filtered results. As a result, the language model receives only valid evidence, eliminating the earlier failure mode where reasoning operated on incorrect context.

Evaluation: Constraint Satisfaction Before vs After Filtering

To evaluate the behavior, I used a synthetic semi-structured dataset of 20 smartphones and queried the system using top-k retrieval. A result was marked correct only if all returned items satisfied the numeric constraint. Both systems used the same embedding model, LLM, and prompts only the retrieval strategy was modified.

Query	Basic RAG	Metadata RAG	Improvement
Show me phones under $500	60%	100%	+40%
Phones released in 2024	0%	100%	+100%
Budget phones under $400	20%	100%	+80%
Flagship phones between $700 and $900	20%	100%	+80%

The pattern is consistent: the table shows that the basic RAG significantly underperforms on constraint-based queries, often returning correct answers only by a random chance. After adding metadata filtering, every query satisfies its numerical conditions, demonstrating that separating structured constraints from semantic retrieval turns RAG into a reliable system for quantitative questions.

In the implementation we did not change the embedding model, LLM, or the prompts. We just modified the retrieval objective.

The above is the output of the basic RAG code. The standard semantic RAG retrieves contextually relevant smartphones but ignores the numerical constraint. Results include items outside the requested range because the embedding search treats “$500” as context, not a hard filter.

The above is the output of metadata-aware RAG. The system first applies structured filters (price, year, category) and only then performs semantic ranking. Every returned result now satisfies the constraint, showing correct retrieval instead of approximate semantic matches.

Separating constraints from semantic retrieval enables:

Composable multi-constraint queries
Scalable filtering on large datasets
Deterministic enforcement of business rules

Notably, the embedding model, language model, and prompts remained unchanged — only the retrieval objective was modified. The performance gain comes from correcting evidence selection rather than improving reasoning.

Other Ways to Handle Constraints in RAG

Metadata filtering is not the only strategy to handle numerical queries in RAG, but it differs from others in an important way: it enforces constraints before retrieval rather than correcting them afterward.

A common workaround is post-retrieval filtering: retrieve a larger candidate set (for example, top 20 or top 50) using pure vector search, then ask the LLM to remove invalid results. This helps, but it wastes retrieval budget on irrelevant documents, and the model still misjudges boundaries (“around $500” vs “under $500”). The behavior remains a probabilistic rather than reliable.

Another attempt is query rewriting. For example, transforming “phones under $500” into phrases like “cheap affordable budget low-cost phones below 500 dollars.” This shifts similarity toward cheaper items, yet semantic closeness is not equivalent to numerical correctness and high-priced phone often remain in the candidate set.

You can also apply a programmatic filter after retrieval:
[r for r in results if r.metadata['price'] < 500]

This removes invalid outputs, but it requires retrieving several times more documents than necessary. When valid items are rare, the system may still fail simply because the correct documents were never retrieved.

Hybrid dense-sparse search (BM25 combined with vectors) helps exact keyword matching, yet numbers remain tokens rather than ordered quantities such as “500” and “999” are matched lexically, not numerically.

Finally, Embeddings can be fine-tuned on numerical data, so the model learns ordering relationships. However, this increases training cost and still produces probabilistic improvements rather than guaranteed constraint satisfaction.

Why metadata filtering wins

Metadata filtering changes the retrieval objective itself. Instead of encouraging the model to respect numerical constraints, it enforces them before semantic ranking occurs. Modern vector databases support structured filtering alongside similarity search because applying constraints first reduces the candidate space and prevents invalid evidence from entering the reasoning stage.

In other words, most alternatives attempt to persuade the model to behave correctly, whereas metadata filtering ensures the system operates only on valid inputs.

The Real Problem Was Retrieval

When my insurance chatbot answered “claims over $1,000” with results under $1,000, I initially treated it as a prompting problem. I tried clearer instructions, added examples, and switched models, nothing changed.
The issue was not the language model but the retrieval stage. The system never retrieved numerically valid documents, so the model reasoned correctly over incorrect evidence. Embeddings place $499 and $999 near each other as “prices,” not as ordered quantities.

The resolution was architectural rather than model-based: metadata enforces constraints, while vector search determines relevance. Once those responsibilities were separated, the system became predictable instead of approximate.

The broader lesson extends beyond RAG. When building AI systems, improving reasoning often means improving evidence selection. Many apparent model failures are retrieval failures in disguise, and reliability comes from structuring the search space, not from making the model larger.

Any AI system that retrieves probabilistic context for deterministic requirements will appear to “reason badly” even when the reasoning is correct.