Building a RAG Search Engine for an AI Sales Agent: Problems, Iterations, and Real Decisions

#rag #ai

In early 2024 I was building an AI-driven WhatsApp sales agent that needed to answer customer questions about business products accurately. The agent had to behave like an SDR, qualifying leads, answering product questions, and moving conversations forward.

The core challenge: how do you give an AI agent accurate, relevant product knowledge when a business has thousands of products and you can't fit them all in a context window?

This article describes the real decisions and iterations behind the solution, including what failed before we got it right.

The Problem

Stuffing an entire product catalogue into the system prompt was never viable. Beyond the context window ceiling, sending thousands of products on every request wastes tokens, increases latency, and actively degrades response quality. The model's attention gets diluted across irrelevant products.

The real problem has two parts:

What data to include: Given a conversation, which products are actually relevant?

When to include it: Not every message needs product context. Triggering a product search on every turn is wasteful and slow.

An additional constraint shaped the early architecture: in early 2024, tool calling support across models was inconsistent and unreliable. We could not depend on it as a foundation.

First Attempt and Why It Failed

My first approach was to handle the "when to search" problem directly in the system prompt. I instructed the model to return a structured identifier when the conversation required product information, something like ["REQUEST_PRODUCTS", "search term"]. My code would intercept this identifier, run the vector search, and re-call the model passing the conversation history plus the returned products as additional context for generating the next response.

This worked in theory but fell apart in practice. The main issue was output consistency. The model frequently ignored the required format, adding extra text alongside the identifier or generating a full response instead of the structured output I needed. Since my code was parsing a specific format, any deviation broke the pipeline.

The fundamental problem was that I was asking a single model call to handle two responsibilities at once: deciding whether a product search was needed and signaling that decision in a machine-readable format. Mixing those concerns into one prompt made the output unreliable.

This is where classifiers came in.

The Classifier Architecture

The solution was to break the pipeline into dedicated steps, each with a single responsibility. Instead of asking one model call to decide, search, and respond, I introduced classifiers: small focused AI calls that evaluate one thing and return a structured output.

The pipeline has three stages: intent detection, search quality evaluation, and response generation.

Stage 1: Intent Detection

The first classifier receives the conversation history and returns a search term if product information is needed, or null if it is not. A single focused question to the model returns a consistent structured output far more reliably than embedding that logic into a generation prompt.

Stage 2: Search Quality Evaluation

After the vector search returns results, a second classifier evaluates how well those results match the conversation context, returning a precision score between 0 and 1. If the score does not meet the threshold, the pipeline retries with a variation: a shorter message window passed to the first classifier to generate a different search term. The idea is that reducing the conversation history changes the emphasis of the generated search term, introducing variation that can surface better results.

The retry loop runs for a maximum of 5 attempts. On each retry the message window shrinks by 2 messages, starting from 12 messages and stopping if the window reaches 2 messages or the attempt limit is hit. The best scoring result across all attempts is carried forward regardless of whether the threshold was reached.

best_score = 0
best_products = []
window_size = 12

attempt = 0
while attempt < MAX_ATTEMPTS and window_size > 2:
    search_term = intent_classifier(conversation_history, window_size)

    if search_term is null:
        break

    products = vector_search(search_term)
    score = quality_classifier(products, conversation_history)

    if score > best_score:
        best_score = score
        best_products = products

    if score >= 0.7:
        break

    window_size = window_size - 2
    attempt = attempt + 1

response = generate(conversation_history, best_products)

Stage 3: Response Generation

Only at this final stage does the model generate a customer-facing message, now with the highest quality product context available. Separating generation from classification meant each prompt had a single job, which dramatically improved output consistency.

The Search Implementation

With the classifier architecture handling when and what to search, the search itself needed to be fast, accurate, and operationally simple.

Storing products as vectors

When a product catalogue is uploaded, the system generates an embedding for each product, and stores it in PostgreSQL using the pgvector extension. Rather than embedding each field individually, each product is embedded as a single concatenated string:

Name: {productName}, Description: {productDescription}, Price: {productPrice}

Embedding fields individually would have complicated the search without a clear quality benefit. A single embedding per product keeps the search straightforward: one vector comparison per product, one similarity score per result.

Why pgvector over a dedicated vector database

Adding a dedicated vector database like Pinecone or Weaviate would have introduced an additional managed service, a separate billing account, and another failure point. At the scale this system operated, none of those tradeoffs were justified. pgvector runs inside the same PostgreSQL instance already handling relational data, meaning one database to back up, monitor, and connect to.

For systems requiring millions of vectors or sub-millisecond retrieval at high concurrency, a dedicated vector database becomes the right call. This was not that system.

The similarity search

Queries run as cosine similarity searches using pgvector's <=> operator. The operator returns a cosine distance between 0 and 2. To convert this into a more intuitive similarity score, the search query transforms the result using 1 - (embedding <=> query_vector), producing a score between -1 and 1 where higher values indicate stronger similarity. The search term generated by the intent classifier is embedded using the same model and format as the stored products, then compared against the catalogue vectors. The top-K results are returned with their similarity scores for the quality classifier to evaluate.

One known limitation of cosine similarity is poor handling of negation. If a user says they do not want a specific type of product, the search may still return results related to that product because the embedding captures semantic proximity without understanding negation. In practice this was handled at the generation stage, instructing the model to filter out irrelevant results from the context. A more robust solution would involve query rewriting before the vector search, detecting negation in the conversation and reformulating the search term to exclude the unwanted concepts, but this was not implemented in the production system.

Known Limitations

The current implementation has a few limitations worth acknowledging for anyone considering this approach in production.

Embedding caching: Every query generates a new OpenAI API call. At low volume this is negligible, but at scale repeated similar queries would benefit from caching embeddings to reduce both latency and cost.

Ingestion pipeline: The repository ingests products sequentially on startup for simplicity. The production system handled this differently: businesses uploaded their own product catalogues through the platform, triggering a background ingestion pipeline. The sequential approach is sufficient for local testing but is not representative of how ingestion works at real scale.

Cosine similarity and negation: As described in the previous section, the search does not handle negation well. Query rewriting before the vector search is the most promising direction for addressing this, though it was not implemented here.

Conclusion

This architecture was shaped by the constraints of early 2024, when reliable tool calling was not something you could build a production system on. The classifier approach solved that constraint but added its own complexity: multiple model calls per turn, a retry loop, and careful prompt engineering to keep each classifier focused.

Today I would approach this differently. Native tool calling has matured enough to be worth testing as a replacement for the intent detection step. But I do not think classifiers disappear entirely. The quality evaluation loop, where the system iterates toward a better search result rather than accepting the first one, solves a problem that tool calling does not address. The practical answer is probably a hybrid: native tool calls where the model's built-in capabilities are sufficient, and explicit classification steps where output precision matters enough to warrant the extra calls.

The extracted search implementation this article references is available at github.com/Jancera/rag-search. It covers the vector storage and similarity search backbone described here. The classifier architecture lives in the production system it was built for.