Aniket Hingane

Posted on Apr 4

Layered Agentic Retrieval for Retail Floor Questions: A Solo PoC

#python #retail #machinelearning #agents

How I Routed Associate Questions Across Specialized TF-IDF Indexes Before Assembly

TL;DR

This write-up documents a personal experiment I ran while thinking about how retail associates actually use knowledge in the moment. A shopper rarely asks a question that fits neatly into a single policy PDF. The phrasing is noisy, the intent is mixed, and the clock is always ticking. I built a small Python system that treats each question as a routing problem rather than a retrieval problem with a single index. Three independent TF-IDF corpora stand in for returns policies, product care guidance, and service-floor procedures. An orchestrator scores each domain, retrieves top hits from the winner, and optionally blends in a second domain when the primary score looks weak. I kept the entire pipeline on-device without calling a hosted language model, because I wanted the evidence to be inspectable and reproducible on a laptop. The repository is public for learning purposes only; it is not a product recommendation, not a deployment blueprint, and not connected to anything I have shipped at a job. I am describing it as a proof of concept because that is what it is, and I am careful not to claim that this small corpus behaves like a real enterprise knowledge base. If you only take one sentence away, take this: I cared more about inspectable routing than about impressing anyone with model names, and that priority shaped every file I wrote.

Introduction

Before I describe the code, I want to anchor the emotional posture of this article. I am writing as an individual who builds experiments in public to learn, not as someone who is reporting on a deployed system or claiming validated business outcomes. That framing matters because retail is easy to romanticize and easy to misrepresent. I have spent a fair amount of time watching how people ask questions in retail settings, not as a researcher with formal instruments but as someone who pays attention to phrasing when I am in line or when I am helping a friend think through a store policy. The questions are rarely perfect. Someone might say “I need to return this” while also mentioning “the coating on the jacket feels wrong after one wash.” That sentence mixes two worlds. One world is about returns and receipts. The other is about care instructions and product durability. If you flatten everything into one retrieval index, you can still get plausible text, but you lose the ability to explain why a snippet was chosen. That loss of explainability matters to me as an engineer, not because I dislike neural models, but because I like to know which shelf the system reached for first.

In my PoC I chose a retail floor framing because it is relatable, easy to illustrate with synthetic documents, and avoids domains I have been avoiding in this personal writing series. I am not describing inventory optimization, warehouse counts, or pricing strategy here. I am not touching financial advice. I am staying with a narrow slice: how an associate-facing assistant might assemble evidence before anyone writes a customer-facing sentence. I also want to be clear about the social layer. Associates are not robots. They use judgment. A system that pretends to replace that judgment with a single score is not something I would defend. What I am experimenting with is a structured scaffold that keeps evidence visible so a human can still override.

I wrote this article because I wanted a serious project that still fits on a laptop. I have seen enough demos where a model produces fluent language and hides the underlying evidence. I wanted the opposite. I wanted logs that read like engineering notes, not marketing copy. The code is structured so that I can print a bundle of evidence rows with identifiers and cosine scores. That is not glamorous, but it is the kind of transparency I find useful when I iterate.

There is another motivation I should state plainly. I am interested in practices that survive contact with messy language. People shorten words, omit nouns, and rely on context. They say “the thirty-day thing” instead of “the return window policy.” Any retrieval system that assumes the query already contains canonical terms will fail in ways that look embarrassing on a demo but painful in real life. I did not solve that fully here. I only created a place to talk about it honestly while still writing code.

I also want readers to know the scope boundary I used while writing. This article discusses a synthetic dataset and illustrative scoring rules. It does not describe any real retailer’s policies, staffing model, or vendor contracts. If a phrase resembles language you have seen in the wild, that is because operational writing converges on similar vocabulary, not because I copied private material.

What's This Article About?

The article walks through RetailFloor-AgenticRouter-AI, a Python project that ingests short customer-style questions, scores three separate TF-IDF indexes, and retrieves evidence from the best-matching domains before optional blending.
I explain why I combined lexical hints with cosine similarity rather than relying on either signal alone in isolation.
I show how the batch table and matplotlib chart help me see whether the demo is skewing toward one domain because of a bug or because of the wording of the synthetic questions.
I discuss limitations honestly: tiny corpora, linear scoring, and heuristic thresholds are not the same as a live knowledge system.
I include a code walkthrough that mirrors how I read the repository myself when I return to it after a gap.

Tech Stack

The implementation is intentionally boring in a good way. I rely on Python 3.10 or newer, NumPy, scikit-learn for TF-IDF and cosine similarity, matplotlib for a bar chart, and Rich for readable terminal output. There is no hosted vector database and no cloud requirement; the entire index fits in memory.

From where I stand, that stack is enough to demonstrate the idea that “agentic routing” in the small can be practiced with classical IR tooling when the corpus is tiny and the goal is structured assembly rather than open-ended generation. If I later swap TF-IDF for embeddings, the per-domain interfaces remain stable, which was a design goal while I sketched the modules.

Why Read It?

If you are evaluating how to structure prompts or pre-model logic for operational assistants, this article offers a concrete pattern: treat context as composable blocks with clear boundaries.
If you are learning scikit-learn’s text pipelines, the retrieval module is short and testable.
If you care about reproducibility, the orchestrator gives you a baseline against which any future learned model can be compared.
I think the read is most useful for practitioners who want a middle ground between “pure LLM” and “pure rules,” because the code shows exactly where those worlds meet in my PoC.

There is also a pedagogical angle I care about. Many tutorials jump straight to embeddings and vector databases without establishing why lexical baselines still matter. I am not anti-embedding; I use them elsewhere. But I believe beginners should see cosine similarity on explicit vectors at least once, because it demystifies what “nearest neighbor” means in code rather than in marketing language.

Finally, if you maintain open-source examples, you know the burden of dependencies. I kept the stack small so a reader in a constrained environment can still run the demo. That constraint shaped decisions as much as any architectural principle.

Let's Design

Framing the problem without overfitting the story

Before touching code, I spent time writing short synthetic shopper questions on paper. I noticed recurring patterns: some messages emphasize receipts and timing early, others bury the actionable detail in the second half, and a few mix care language with service language. I did not try to split multi-intent questions into multiple tickets in this repository. Instead, I focused on a single-text input so the routing stays easy to reason about. That choice trades realism for clarity, and I am comfortable stating that upfront.

Why multiple indexes instead of one concatenated corpus

The design starts from a simple observation I kept returning to while prototyping: returns policy text is not interchangeable with product care guidance. They answer different questions. Policies talk about windows, receipts, and eligibility. Care guidance talks about fabric, detergents, and storage. Service-floor procedures talk about pickup, escalations, and documented steps for adjustments. When I mixed those prematurely, I got tangled retrieval results. When I separated them, I could log each domain independently.

The retrieval layer builds one TF-IDF vectorizer per domain. Each domain has a handful of short documents with identifiers. For each query, I compute a cosine similarity between the query vector and every document vector in that domain, and I take the maximum as a domain strength signal. That is a simple baseline, but it is a baseline I can explain to a colleague without drawing diagrams.

The orchestration layer combines those strengths with a small lexical hint. The hint is a deliberately limited set of regular expressions that look for words like “return,” “wash,” or “curbside.” I did not try to build a full intent model. I wanted a small nudge that prevents absurd routing when the vector space is sparse. In my opinion, that is a trade you can criticize. I would rather hear that criticism than pretend the vector space is larger than it is.

The agentic step as a confidence gate

I use the word “agentic” in a narrow sense. There is no autonomous loop that calls external tools. I am referring to a decision step that can widen retrieval when the primary domain looks weak. If the combined score for the primary domain falls below a threshold I tuned by hand, I pull additional hits from the next-best domain. That is not deep reasoning. It is a guardrail. I still call it agentic in the sense that the system chooses a second retrieval path based on measured confidence rather than a fixed pipeline.

If I were to extend this experiment, I would log the threshold crossings and measure how often the secondary blend helps. In this PoC, I only observe the behavior in the console.

Retrieval choices and what I rejected

I considered a few alternatives before settling on TF-IDF for the first public cut. A dense embedding model would likely rank semantically related chunks more robustly, but it would also introduce versioning questions, dependency weight, and reproducibility concerns for readers who just want to clone and run. I decided that demonstrating clean interfaces mattered more than squeezing extra retrieval quality from a miniature corpus.

I also thought about BM25. It is a strong baseline for lexical tasks and behaves well on short documents. I stayed with TF-IDF largely because the scikit-learn pipeline is familiar to many readers and the difference between BM25 and TF-IDF on a handful of short documents is unlikely to change the story materially. If I expand the corpus by an order of magnitude, BM25 or a hybrid approach becomes more compelling.

Observability as a first-class requirement

I log the evidence bundle for the first query in every run not because the first query is special, but because it proves the pipeline without drowning the reader in repetition. In a longer study I would probably log structured JSON for every query and ship it to a file, but the PoC keeps stdout readable.

The matplotlib chart is part of the same philosophy. A batch table tells you what happened row by row; a distribution tells you whether the demo batch skewed toward one domain. In my experiments, skew often revealed mistakes in keyword priorities rather than retrieval mistakes, which surprised me at first.

Let's Get Cooking

The entry point is main.py. It keeps the demo batch in one helper so the narrative stays obvious when someone reads top to bottom.

def _demo_queries() -> list[tuple[str, str, str]]:
    """(id, short label, customer text)"""
    return [
        (
            "Q-01",
            "returns window",
            "I bought jeans last week, tags still on, can I still bring them back with my receipt?",
        ),
        (
            "Q-02",
            "care",
            "How should I wash this water-resistant jacket without ruining the coating?",
        ),
        # ... additional synthetic questions ...
    ]

What this does: it defines the synthetic workload as tuples that include a question identifier, a human-readable label for my own notes, and the free-text body. I structured it this way because separating labels from the text lets me test routing without fabricating metadata inside the prose.

Why I wrote it this way: early on, I inlined labels as hashtags inside the text and immediately regretted it. Parsing labels from natural language is a separate project. For this PoC, explicit fields keep runs reproducible.

The orchestration layer is run_agentic_retrieval in orchestrator.py. It ranks domains, retrieves primary hits, and optionally blends secondary hits.

def run_agentic_retrieval(
    indexes: dict[IntentDomain, DomainIndex],
    query: str,
    top_k_per_domain: int = 2,
) -> OrchestratorResult:
    ranked = rank_domains(indexes, query)
    primary_domain, primary_score = ranked[0]
    secondary_domain, secondary_score = ranked[1]

    evidence: list[tuple[IntentDomain, RetrievalHit]] = []

    primary_hits = retrieve_top_k(indexes[primary_domain], query, k=top_k_per_domain)
    for h in primary_hits:
        evidence.append((primary_domain, h))

    rationale = (
        f"Primary domain {primary_domain.value} "
        f"(combined routing score={primary_score:.3f})."
    )

    if primary_score < _SECONDARY_THRESHOLD and secondary_score > 0.05:
        secondary_hits = retrieve_top_k(indexes[secondary_domain], query, k=top_k_per_domain)
        for h in secondary_hits:
            evidence.append((secondary_domain, h))
        rationale += (
            f" Secondary blend {secondary_domain.value} "
            f"(combined={secondary_score:.3f}) because primary score stayed below "
            f"{_SECONDARY_THRESHOLD}."
        )
        return OrchestratorResult(
            query=query,
            primary_domain=primary_domain,
            primary_score=primary_score,
            secondary_domain=secondary_domain,
            secondary_score=secondary_score,
            evidence=evidence,
            rationale=rationale,
        )

    return OrchestratorResult(
        query=query,
        primary_domain=primary_domain,
        primary_score=primary_score,
        secondary_domain=None,
        secondary_score=None,
        evidence=evidence,
        rationale=rationale,
    )

What this does: it materializes the agentic gate in code. If the primary score is below threshold, I append evidence from the runner-up domain. If not, I keep the evidence narrow.

Why I wrote it this way: I wanted the branching logic to be explicit and readable. A hidden implicit merge would have made debugging harder when I tuned thresholds.

The hybrid score itself is computed in rank_domains by combining cosine strength with lexical boosts. I kept the boosts small so they cannot dominate the vector signal.

def _combined_score(indexes: dict[IntentDomain, DomainIndex], domain: IntentDomain, query: str) -> float:
    base = domain_strength(indexes[domain], query)
    b = _lexical_boost(domain, query)
    return float(min(1.0, base + b))

What this does: it anchors the routing in measurable similarity while still allowing short, common retail words to nudge the domain when the corpus is tiny.

Per-domain retrieval is standard TF-IDF with cosine similarity.

def retrieve_top_k(index: DomainIndex, query: str, k: int = 3) -> list[RetrievalHit]:
    qv = index.vectorizer.transform([query])
    sims = cosine_similarity(qv, index.doc_matrix).ravel()
    order = np.argsort(-sims)[:k]
    hits: list[RetrievalHit] = []
    for i in order:
        doc = index.docs[int(i)]
        hits.append(
            RetrievalHit(
                doc_id=doc.doc_id,
                score=float(sims[int(i)]),
                snippet=doc.text,
            )
        )
    return hits

What this does: it returns the top-k most similar documents within a single domain index. That is the building block the orchestrator repeats.

Let's Setup

Clone the public repository from GitHub: https://github.com/aniket-work/RetailFloor-AgenticRouter-AI
Create a virtual environment inside the project directory using python -m venv venv
Activate the environment using source venv/bin/activate on macOS or Linux
Install dependencies with pip install -r requirements.txt

Step-by-step details can be found at the repository README in the same project. I prefer local virtual environments because they keep experiments isolated and reproducible.

Let's Run

Run python main.py from the repository root with the virtual environment active.
Read the first evidence bundle, which prints the primary domain and the retrieved snippets with scores.
Read the batch summary table and open output/domain_distribution.png to see the matplotlib distribution.

Theory and personal notes on similarity in tiny corpora

When I describe TF-IDF to someone who has only used embeddings, I often start with frequency rather than geometry. Term frequency inside a document tells you what the document emphasizes. Inverse document frequency down-weights terms that appear everywhere. Once you have a sparse vector per document and a vector for the query, cosine similarity becomes a concrete operation: a dot product normalized by magnitudes. That story is not new. I still find it valuable because it connects the math to the words people actually typed.

In my experiments with this PoC, the corpus is so small that IDF behaves differently than it would on a large intranet crawl. Rare words can dominate. Common words can look overly important if they appear in multiple documents within the same domain. I noticed that when I added a few more sentences to one policy snippet, the relative rankings shifted more than I expected. That sensitivity is not a secret flaw; it is a reminder that retrieval quality tracks corpus curation.

I also thought about correlation between domains. In a single merged index, a query might retrieve a returns snippet and a care snippet together because both mention “original packaging” or similar phrasing. By splitting indexes, I force the system to decide which domain is primary before I show mixed evidence. That decision can be wrong, but at least it is explicit. In my opinion, explicit wrongness is easier to debug than implicit blending.

Another topic I want to address is calibration. Cosine scores on TF-IDF are not probabilities. I still print them because relative ranking matters more than absolute numbers for this demo. If I ever publish a more serious evaluation, I would separate “routing accuracy” from “snippet usefulness” and measure them with different labeled sets. For now, I rely on a batch table and a chart, which is humble instrumentation, but it matches the scale of the project.

Deeper walkthrough of the batch questions I chose

I picked six questions because they cover different shapes of language without pretending to be a comprehensive benchmark. The returns window question is direct. The care question uses product language. The curbside question pulls service-floor vocabulary. The “ambiguous” running shoe question is intentionally written to stress the boundary between care guidance and subjective comfort language. The gift receipt question tests whether returns language routes cleanly. The escalation question tests service-floor procedures.

When I first ran the batch, I looked at the primary domain column before I looked at scores. That habit comes from older debugging practice: identify the categorical outcome, then inspect the numeric confidence. In my PoC, I also read the chart to see whether one domain swallowed the batch. If that happened without a good reason, I assumed I had a bug or a badly worded question. That is a cheap sanity check, but it caught a few mistakes while I iterated.

What I would measure next if I invested another weekend

Precision and recall for domain routing against a labeled set of at least two hundred queries.
Mean reciprocal rank for the top evidence snippet within the correct domain.
Frequency of secondary-blend activations and whether those cases correlate with improved human ratings.

I have not done that work here. I am stating it as a matter of intellectual honesty. A chart in a blog post is not an evaluation suite.

Failure modes I observed while iterating

Sometimes the vector score for the primary domain is modest even when the lexical hint is strong. In those cases, the combined score still tends to land on the right domain because the hint is doing real work. The opposite also happened during early drafts: strong cosine matches to the wrong domain because of shared words like “store” or “order.” I addressed some of that by tightening documents so repeated generic words appear less often, but the real fix for a production setting would be more documents and better tokenization choices, not cleverer regex.

I also saw cases where the secondary blend did not trigger because the primary score crossed the threshold even though a human would still want cross-domain evidence. That is a design tension. If I lower the threshold, I blend more often and risk noisy evidence. If I raise the threshold, I stay pure but miss helpful cross-domain context. I do not think there is a universal answer. I think there is only a policy choice that should be explicit.

Why I avoided a hosted model in the core loop

I am not opposed to models. I use them in other projects. In this PoC, I wanted the repository to remain lightweight and the behavior to remain inspectable for readers who may not have API keys or budget. I also wanted to avoid a moving target. Hosted models change versions; retrieval baselines change less often. For a learning artifact, stability matters.

If I integrated a model later, I would still keep the routing structure. The orchestrator pattern is not tied to TF-IDF. It is a way to decide which evidence shelves to open. In my view, that separation ages well.

Statistics and visualization as a discipline habit

The matplotlib chart is simple: counts of primary domain picks across the batch. I still find it useful because it forces a second perspective on the same data. Tables can hide imbalance when you are focused on individual rows. A bar chart makes imbalance obvious. As per my experience writing operational tooling, that kind of redundancy is how I catch mistakes before they become narratives.

I also think visualization discipline matters when writing publicly. A demo can look convincing because the author cherry-picked queries. A batch section with a chart is still cherry-picked, but it is harder to hide systemic skew without looking inconsistent. I am not claiming purity. I am claiming a slightly higher bar than a single happy-path screenshot.

How this relates to my broader experiments with context assembly

Across several personal PoCs, I keep returning to the same lesson: the model is only as grounded as the evidence you hand it. Routing is one way to ground. Chunking is another. Metadata filters are another. In this retail framing, routing is the headline because it is the piece that most directly mirrors how I think a careful associate works. I picture someone mentally classifying the question before reaching for a binder or a search box. The code is a crude metaphor for that mental step.

I also think solo experiments have a hidden advantage. There is no committee to smooth the awkward edges. If a design choice is hard to explain, I feel it immediately because I have to write the article myself. That pressure improves clarity even when it does not improve novelty. In my opinion, many good engineering blogs come from that kind of forced explanation rather than from raw brilliance.

Data hygiene notes for anyone who forks the repository

If you fork this project and replace the synthetic corpus with your own text, start with document boundaries. Decide what constitutes a chunk. Decide whether headings belong in the chunk or in metadata. Decide whether you need stable identifiers for compliance reasons. I used short stable identifiers like ret_001 because they read well in logs. In a real setting, you might need provenance fields and timestamps. None of that appears here because I am not simulating compliance tooling.

I would also caution against mixing marketing language with policy language in the same chunk unless you intend to. Marketing copy often uses emotional words that pollute retrieval for operational questions. In my synthetic set, I tried to keep tone dry and procedural. That choice is artificial, but it is intentionally artificial.

Security and privacy framing

This PoC does not store customer data, payment data, or loyalty identifiers. I mention loyalty only inside synthetic policy text as a generic procedural note. If you adapt the idea to real systems, you should treat evidence logs as sensitive depending on your environment. Retrieval indexes can leak information through side channels if someone can probe them repeatedly. I am not providing a threat model here. I am naming the concern because responsible write-ups should name concerns even when the demo is synthetic.

Longer reflection on agentic retrieval as a phrase

Language shifts quickly in this field. I use “agentic” because the orchestrator makes a conditional decision that changes which retrievals run. That is a narrow meaning. I am not claiming autonomous agency, persistent memory, or tool use beyond vector retrieval. If the word feels too flashy for your taste, you can substitute “conditional multi-retrieval” and the code still reads the same.

From my perspective, the value of the word is that it signals intent to practitioners who are comparing patterns. If you are building a catalog of approaches, you want names that map to behaviors. The behavior here is: measure confidence, branch, merge evidence. That is enough to distinguish the flow from a single-query single-index baseline.

Expanded discussion of matplotlib output and how I read it

The chart file is written to output/domain_distribution.png. When I open it, I look for dominance first. If one bar towers over the others, I ask whether the batch questions were written to favor that domain accidentally. Then I look at the absolute counts. With six questions, ties and singletons are common. That is fine for a story, but it would be insufficient for a statistical claim. I treat the chart as a sanity check, not as proof of generalization.

I also think about color and accessibility. The default color cycle in matplotlib is familiar to many readers, but it is not perfect for every color vision profile. If I extended this project with a public UI, I would revisit palette choices. For a static PNG in a repository, I kept the defaults to reduce dependencies and keep the focus on structure.

What I changed between iterations while writing this article

Early drafts used only cosine similarity without lexical hints. The routing looked mathematically pure but sometimes felt silly in plain language. I added small boosts because I wanted the demo to track common sense when the corpus is tiny. Some readers will dislike that because it introduces hand-tuned rules. I accept the criticism. I would rather show a transparent hand-tuned rule than hide the same bias inside an unlabeled embedding space.

I also adjusted the secondary threshold upward once I saw how often the blend triggered. The goal was to make blending meaningful rather than routine. If blending happens on every query, it stops being a guardrail and becomes a second retrieval path you always pay for.

Closing Thoughts

What I learned about thresholds

I spent more time than I expected tuning the secondary threshold and lexical boosts. That is typical for small corpora. When you only have a few documents per domain, cosine similarity can swing based on a single shared word. I do not consider that a flaw in cosine similarity. I consider it a reminder that the corpus is the real product.

Edge cases I still think about

Multi-intent questions that require two different operational actions, not just two evidence bundles.
Questions that reference SKU numbers or store-specific hours that are not in the synthetic corpus.
Situations where policy language conflicts across documents, which this PoC does not attempt to resolve.

Ethics and responsible framing

I believe any assistant that touches customer-facing work should default to transparency. That means citing sources, showing scores, and making it obvious when the system is uncertain. I did not build a customer-facing UI here, but I did build the kind of evidence rows I would want to see before trusting a draft answer.

Roadmap if this stays a hobby experiment

Replace TF-IDF with BM25 or embeddings when the corpus grows.
Add evaluation harnesses with labeled queries rather than eyeballing batch tables.
Add structured logging to JSON for offline analysis.

A few more words on reproducibility and environment capture

Whenever I publish a PoC, I ask myself whether someone can reproduce the same numbers on their machine. For this project, the deterministic parts are the TF-IDF fit and the query order. The parts that can drift are library versions and floating point noise. I pinned versions loosely in requirements.txt with minimum versions rather than exact hashes because this is not a safety-critical artifact. If I needed bitwise reproducibility, I would pin exact versions and record a seed wherever randomness appears. Randomness does not play a role in the current retrieval path, which keeps the story simpler.

I also think about documentation as part of reproducibility. A repository without a clear run command is a puzzle, not an experiment. That is why I kept main.py as a single entry point and why I describe the output paths explicitly. From my experience, the fastest way to lose a reader is to hide the command they should run after cloning.

Narrative closure without overstating the result

I want to end the technical portion with a calm statement of scope. This PoC demonstrates routing and retrieval assembly. It does not demonstrate customer satisfaction. It does not demonstrate associate productivity. It does not demonstrate compliance alignment. Those outcomes require measurements I did not perform. I am naming the gap on purpose because overstated claims are how experimental articles age poorly.

Repository link

Public code for this experiment: https://github.com/aniket-work/RetailFloor-AgenticRouter-AI

I keep this repository separate from my publishing scripts so the public repo only contains the PoC implementation, diagrams, and images. If you clone it, you will not find my article drafts or automation utilities mixed into the same tree, because I want the repository to stay a clean reference implementation for the idea itself.

Disclaimer

The views and opinions expressed here are solely my own and do not represent the views, positions, or opinions of my employer or any organization I am affiliated with. The content is based on my personal experience and experimentation and may be incomplete or incorrect. Any errors or misinterpretations are unintentional, and I apologize in advance if any statements are misunderstood or misrepresented.

Tags: python, retail, machinelearning, agents

DEV Community