DEV Community: Jancer Lima

Building a RAG Search Engine for an AI Sales Agent: Problems, Iterations, and Real Decisions

Jancer Lima — Fri, 29 May 2026 00:46:09 +0000

In early 2024 I was building an AI-driven WhatsApp sales agent that needed to answer customer questions about business products accurately. The agent had to behave like an SDR, qualifying leads, answering product questions, and moving conversations forward.

The core challenge: how do you give an AI agent accurate, relevant product knowledge when a business has thousands of products and you can't fit them all in a context window?

This article describes the real decisions and iterations behind the solution, including what failed before we got it right.

The Problem

Stuffing an entire product catalogue into the system prompt was never viable. Beyond the context window ceiling, sending thousands of products on every request wastes tokens, increases latency, and actively degrades response quality. The model's attention gets diluted across irrelevant products.

The real problem has two parts:

What data to include: Given a conversation, which products are actually relevant?

When to include it: Not every message needs product context. Triggering a product search on every turn is wasteful and slow.

An additional constraint shaped the early architecture: in early 2024, tool calling support across models was inconsistent and unreliable. We could not depend on it as a foundation.

First Attempt and Why It Failed

My first approach was to handle the "when to search" problem directly in the system prompt. I instructed the model to return a structured identifier when the conversation required product information, something like ["REQUEST_PRODUCTS", "search term"]. My code would intercept this identifier, run the vector search, and re-call the model passing the conversation history plus the returned products as additional context for generating the next response.

This worked in theory but fell apart in practice. The main issue was output consistency. The model frequently ignored the required format, adding extra text alongside the identifier or generating a full response instead of the structured output I needed. Since my code was parsing a specific format, any deviation broke the pipeline.

The fundamental problem was that I was asking a single model call to handle two responsibilities at once: deciding whether a product search was needed and signaling that decision in a machine-readable format. Mixing those concerns into one prompt made the output unreliable.

This is where classifiers came in.

The Classifier Architecture

The solution was to break the pipeline into dedicated steps, each with a single responsibility. Instead of asking one model call to decide, search, and respond, I introduced classifiers: small focused AI calls that evaluate one thing and return a structured output.

The pipeline has three stages: intent detection, search quality evaluation, and response generation.

Stage 1: Intent Detection

The first classifier receives the conversation history and returns a search term if product information is needed, or null if it is not. A single focused question to the model returns a consistent structured output far more reliably than embedding that logic into a generation prompt.

Stage 2: Search Quality Evaluation

After the vector search returns results, a second classifier evaluates how well those results match the conversation context, returning a precision score between 0 and 1. If the score does not meet the threshold, the pipeline retries with a variation: a shorter message window passed to the first classifier to generate a different search term. The idea is that reducing the conversation history changes the emphasis of the generated search term, introducing variation that can surface better results.

The retry loop runs for a maximum of 5 attempts. On each retry the message window shrinks by 2 messages, starting from 12 messages and stopping if the window reaches 2 messages or the attempt limit is hit. The best scoring result across all attempts is carried forward regardless of whether the threshold was reached.

best_score = 0
best_products = []
window_size = 12

attempt = 0
while attempt < MAX_ATTEMPTS and window_size > 2:
    search_term = intent_classifier(conversation_history, window_size)

    if search_term is null:
        break

    products = vector_search(search_term)
    score = quality_classifier(products, conversation_history)

    if score > best_score:
        best_score = score
        best_products = products

    if score >= 0.7:
        break

    window_size = window_size - 2
    attempt = attempt + 1

response = generate(conversation_history, best_products)

Stage 3: Response Generation

Only at this final stage does the model generate a customer-facing message, now with the highest quality product context available. Separating generation from classification meant each prompt had a single job, which dramatically improved output consistency.

The Search Implementation

With the classifier architecture handling when and what to search, the search itself needed to be fast, accurate, and operationally simple.

Storing products as vectors

When a product catalogue is uploaded, the system generates an embedding for each product, and stores it in PostgreSQL using the pgvector extension. Rather than embedding each field individually, each product is embedded as a single concatenated string:

Name: {productName}, Description: {productDescription}, Price: {productPrice}

Embedding fields individually would have complicated the search without a clear quality benefit. A single embedding per product keeps the search straightforward: one vector comparison per product, one similarity score per result.

Why pgvector over a dedicated vector database

Adding a dedicated vector database like Pinecone or Weaviate would have introduced an additional managed service, a separate billing account, and another failure point. At the scale this system operated, none of those tradeoffs were justified. pgvector runs inside the same PostgreSQL instance already handling relational data, meaning one database to back up, monitor, and connect to.

For systems requiring millions of vectors or sub-millisecond retrieval at high concurrency, a dedicated vector database becomes the right call. This was not that system.

The similarity search

Queries run as cosine similarity searches using pgvector's <=> operator. The operator returns a cosine distance between 0 and 2. To convert this into a more intuitive similarity score, the search query transforms the result using 1 - (embedding <=> query_vector), producing a score between -1 and 1 where higher values indicate stronger similarity. The search term generated by the intent classifier is embedded using the same model and format as the stored products, then compared against the catalogue vectors. The top-K results are returned with their similarity scores for the quality classifier to evaluate.

One known limitation of cosine similarity is poor handling of negation. If a user says they do not want a specific type of product, the search may still return results related to that product because the embedding captures semantic proximity without understanding negation. In practice this was handled at the generation stage, instructing the model to filter out irrelevant results from the context. A more robust solution would involve query rewriting before the vector search, detecting negation in the conversation and reformulating the search term to exclude the unwanted concepts, but this was not implemented in the production system.

Known Limitations

The current implementation has a few limitations worth acknowledging for anyone considering this approach in production.

Embedding caching: Every query generates a new OpenAI API call. At low volume this is negligible, but at scale repeated similar queries would benefit from caching embeddings to reduce both latency and cost.

Ingestion pipeline: The repository ingests products sequentially on startup for simplicity. The production system handled this differently: businesses uploaded their own product catalogues through the platform, triggering a background ingestion pipeline. The sequential approach is sufficient for local testing but is not representative of how ingestion works at real scale.

Cosine similarity and negation: As described in the previous section, the search does not handle negation well. Query rewriting before the vector search is the most promising direction for addressing this, though it was not implemented here.

Conclusion

This architecture was shaped by the constraints of early 2024, when reliable tool calling was not something you could build a production system on. The classifier approach solved that constraint but added its own complexity: multiple model calls per turn, a retry loop, and careful prompt engineering to keep each classifier focused.

Today I would approach this differently. Native tool calling has matured enough to be worth testing as a replacement for the intent detection step. But I do not think classifiers disappear entirely. The quality evaluation loop, where the system iterates toward a better search result rather than accepting the first one, solves a problem that tool calling does not address. The practical answer is probably a hybrid: native tool calls where the model's built-in capabilities are sufficient, and explicit classification steps where output precision matters enough to warrant the extra calls.

The extracted search implementation this article references is available at github.com/Jancera/rag-search. It covers the vector storage and similarity search backbone described here. The classifier architecture lives in the production system it was built for.

Why worker pools beat clustering for CPU-Heavy tasks on Node.js

Jancer Lima — Tue, 05 May 2026 00:15:24 +0000

Imagine you have a Nodejs server with endpoint that performs heavy CPU operations.

By default your server runs on a single thread. This means it will freeze depending on the CPU load. If your server has other asynchronous endpoints, for example, to execute database operations, those endpoints would become unresponsive while the heavy load endpoint is processing.

Our first idea is to create more threads, sending the heavy tasks to be processed in parallel by another CPU core. Once finished, we send the output back to the main thread and return the answer to the client.

The problem now is that we can not have more threads than available CPU cores (technically we can but it does not make much sense) so we start thinking about using worker pools where we instantiate an fixed amount of workers and reuse them to our desired tasks.

Now we have a stable structure where we offload CPU intensive tasks to other threads to make the main thread free and available for new requests.

I've setup a test case where we run our server into a docker container with 2 CPUs and 2Gb of memory. Our server has a root endpoint / which returns an OK response and a /blocking/:n endpoint that runs a Fibonacci algorithm.

I've let n as a parameter so we can customize how much work we want our server to do (n is the Fibonacci input).

All the source code for the server and the benchmark can be found here.

I've setup 20k requests on / and 30 requests on /blocking/35. It takes approximately 10 seconds to execute and we can analyze the output. Note: The tests were hitting both endpoints simultaneously.

One test was against a server with 1 instance with 2 threads.

The other test was against a server with 2 instances with 1 thread each.

The server clustering were made using node cluster module, the worker pool was using Piscina which internally uses worker_threads module and the tests were executed with autocannon.

Results

2 processes, 1 thread each

Results for http://localhost:3000/

┌─────────┬──────┬──────┬───────┬───────┬─────────┬─────────┬───────┐
│ Stat    │ 2.5% │ 50%  │ 97.5% │ 99%   │ Avg     │ Stdev   │ Max   │
├─────────┼──────┼──────┼───────┼───────┼─────────┼─────────┼───────┤
│ Latency │ 0 ms │ 2 ms │ 43 ms │ 50 ms │ 4.17 ms │ 9.12 ms │ 94 ms │
└─────────┴──────┴──────┴───────┴───────┴─────────┴─────────┴───────┘
┌───────────┬────────┬────────┬────────┬────────┬─────────┬────────┬────────┐
│ Stat      │ 1%     │ 2.5%   │ 50%    │ 97.5%  │ Avg     │ Stdev  │ Min    │
├───────────┼────────┼────────┼────────┼────────┼─────────┼────────┼────────┤
│ Req/Sec   │ 1,162  │ 1,162  │ 1,926  │ 3,761  │ 2,000.2 │ 744.83 │ 1,162  │
├───────────┼────────┼────────┼────────┼────────┼─────────┼────────┼────────┤
│ Bytes/Sec │ 277 kB │ 277 kB │ 458 kB │ 895 kB │ 476 kB  │ 177 kB │ 277 kB │
└───────────┴────────┴────────┴────────┴────────┴─────────┴────────┴────────┘

Req/Bytes counts sampled once per second.
# of samples: 10

20k requests in 10.07s, 4.76 MB read


Results for http://localhost:3000/blocking/35

┌─────────┬────────┬─────────┬─────────┬─────────┬────────────┬────────────┬─────────┐
│ Stat    │ 2.5%   │ 50%     │ 97.5%   │ 99%     │ Avg        │ Stdev      │ Max     │
├─────────┼────────┼─────────┼─────────┼─────────┼────────────┼────────────┼─────────┤
│ Latency │ 276 ms │ 1968 ms │ 8645 ms │ 8645 ms │ 2327.81 ms │ 1998.18 ms │ 8645 ms │
└─────────┴────────┴─────────┴─────────┴─────────┴────────────┴────────────┴─────────┘
┌───────────┬───────┬───────┬───────┬───────┬───────┬───────┬───────┐
│ Stat      │ 1%    │ 2.5%  │ 50%   │ 97.5% │ Avg   │ Stdev │ Min   │
├───────────┼───────┼───────┼───────┼───────┼───────┼───────┼───────┤
│ Req/Sec   │ 1     │ 1     │ 3     │ 3     │ 2.73  │ 0.62  │ 1     │
├───────────┼───────┼───────┼───────┼───────┼───────┼───────┼───────┤
│ Bytes/Sec │ 253 B │ 253 B │ 759 B │ 759 B │ 690 B │ 156 B │ 253 B │
└───────────┴───────┴───────┴───────┴───────┴───────┴───────┴───────┘

Req/Bytes counts sampled once per second.
# of samples: 11

30 requests in 11.07s, 7.59 kB read

1 process with 2 threads

Results for http://localhost:3000/

┌─────────┬──────┬──────┬───────┬───────┬─────────┬─────────┬───────┐
│ Stat    │ 2.5% │ 50%  │ 97.5% │ 99%   │ Avg     │ Stdev   │ Max   │
├─────────┼──────┼──────┼───────┼───────┼─────────┼─────────┼───────┤
│ Latency │ 1 ms │ 2 ms │ 35 ms │ 43 ms │ 4.56 ms │ 7.63 ms │ 61 ms │
└─────────┴──────┴──────┴───────┴───────┴─────────┴─────────┴───────┘
┌───────────┬────────┬────────┬────────┬────────┬──────────┬────────┬────────┐
│ Stat      │ 1%     │ 2.5%   │ 50%    │ 97.5%  │ Avg      │ Stdev  │ Min    │
├───────────┼────────┼────────┼────────┼────────┼──────────┼────────┼────────┤
│ Req/Sec   │ 640    │ 640    │ 1,454  │ 3,379  │ 1,818.37 │ 790.84 │ 640    │
├───────────┼────────┼────────┼────────┼────────┼──────────┼────────┼────────┤
│ Bytes/Sec │ 152 kB │ 152 kB │ 346 kB │ 804 kB │ 433 kB   │ 188 kB │ 152 kB │
└───────────┴────────┴────────┴────────┴────────┴──────────┴────────┴────────┘

Req/Bytes counts sampled once per second.
# of samples: 11

20k requests in 11.05s, 4.76 MB read


Results for http://localhost:3000/blocking/35

┌─────────┬────────┬─────────┬─────────┬─────────┬────────────┬───────────┬─────────┐
│ Stat    │ 2.5%   │ 50%     │ 97.5%   │ 99%     │ Avg        │ Stdev     │ Max     │
├─────────┼────────┼─────────┼─────────┼─────────┼────────────┼───────────┼─────────┤
│ Latency │ 276 ms │ 2794 ms │ 3861 ms │ 3861 ms │ 2358.17 ms │ 957.99 ms │ 3861 ms │
└─────────┴────────┴─────────┴─────────┴─────────┴────────────┴───────────┴─────────┘
┌───────────┬───────┬───────┬─────────┬─────────┬───────┬───────┬───────┐
│ Stat      │ 1%    │ 2.5%  │ 50%     │ 97.5%   │ Avg   │ Stdev │ Min   │
├───────────┼───────┼───────┼─────────┼─────────┼───────┼───────┼───────┤
│ Req/Sec   │ 2     │ 2     │ 4       │ 4       │ 3.34  │ 0.82  │ 2     │
├───────────┼───────┼───────┼─────────┼─────────┼───────┼───────┼───────┤
│ Bytes/Sec │ 506 B │ 506 B │ 1.01 kB │ 1.01 kB │ 843 B │ 207 B │ 506 B │
└───────────┴───────┴───────┴─────────┴─────────┴───────┴───────┴───────┘

Req/Bytes counts sampled once per second.
# of samples: 9

30 requests in 9.02s, 7.59 kB read

Conclusion

We can see that having 1 process with 2 threads gave us better results for both endpoints.

Looking to the 99th metric, it was 7ms faster for the / route and 4784ms for the /blocking endpoint.

This shows us that spinning up multiple independent process might seem like a quick scaling fix, but in practice, we waste resources managing process overhead instead of computing actual work. More importantly, a single process with a worker pool keeps the main Event Loop unblocked. It successfully handles all incoming traffic and efficiently distributes the heavy CPu load, resulting in a significantly lower wait times for 99% of our requests.

Of course we could expand our test scenario and environment looking for more realistic numbers, but that will be a work for another article.

Thanks for your attention!

Incident Report: Service failure due to storage full

Jancer Lima — Sat, 18 Apr 2026 01:09:42 +0000

Yesterday, my homelab server suddenly became unresponsive. It started with a flurry of Discord notifications, the universal signal that something has gone seriously wrong.

I found all services offline. The logs pointed to a primary culprit: a Redis failure, specifically a Server Out of Memory error.

The core error was: RedisClient::CommandError: MISCONF Errors writing to the AOF file: No space left on device
My first thought was: Why is AOF even enabled? I turned it on for testing and forgot. My root partition was at 99% capacity, just 270MB remaining out of 24GB.

Further investigation revealed where the "wasted" space was hiding:

PM2 Logs (~3.8GB): The process manager was storing massive, unrotated text logs.
Hidden Caches (~1.5GB): Accumulated ~/.cache, ~/.npm, and ~/.rvmsource files from multiple builds and deployments.

To get the system breathing again, I performed a quick "surgical" cleaning:

PM2 Flush: Immediately cleared the massive log files using pm2 flush.
Log Truncation: Emptied application logs using truncate -s 0 log/*.log (this clears the content without deleting the file handle).
Cache Pruning: Deleted hidden build caches in ~/.npm and ~/.cache.
Journal Vacuum: Cleared system logs with journalctl --vacuum-size=500M.

Now I had enough space to spin up all process again, but I need to recover Redis since it entered a Read-Only mode to protect data integrity.

Fixing the AOF Manifest: Because the disk filled during a Redis write, the appendonly.aof.manifest was corrupted. I fixed it using sudo redis-check-aof --fix on the manifest file inside /var/lib/redis/appendonlydir/.
Clearing the MISCONF Lock: Even with free space, Redis remained in a "protected" state. I manually overrode this with redis-cli config set stop-writes-on-bgsave-error no.
Service Restart: Reset the systemd failure counter with systemctl reset-failed redis-server and restarted the service.

After that I could successfully restart all services and have everything running. All data inside redis were not critical, so I didn't care about losing it.

Lessons learned

The failure was a classic case of neglecting "boring" infrastructure: log rotation and disk monitoring. To prevent a repeat performance, I've implemented the following:

Log Management: Installed pm2-logrotate to cap PM2 logs at 10MB per file and limited journald to 500MB globally.
Next Steps:
- Expand the VM disk size (24GB is too tight for this stack).
- Set up a cron job for weekly apt autoremove and cache clearing.
- Implement an automated disk usage alert (likely via Grafana or a simple shell script to Discord).

Is TLS Enough? A Retrospective on Application-Layer Encryption

Jancer Lima — Tue, 24 Feb 2026 11:27:35 +0000

Years ago, I was part of a heated debate that every engineering team eventually faces: Is standard TLS enough, or do we need custom application-layer encryption?

We were implementing a payment solution. The provider required a backend-to-backend integration, meaning we had to take user credit card data, send it to our server, and then forward it to the provider.

My argument was that the TLS layer would be enough for it. The rest of the team disagreed. They didn't have a technical counter-argument, it was just a "lack of trust".

We ended up building a complex Dual-Keypair System:

The app keypair: Used to sign requests so the server could verify the data actually came from our app (Authenticity).
The server keypair: The app used the server's public key to encrypt the payload, ensuring only our backend could read it (Confidentiality).

It worked, but years later I realized the hidden parts we didn't consider.

The scale: We were small then. But if you scale to multiple server instances, you suddenly need a secure way to share those private keys. You've just turned a "payment problem" into a "key management problem".
The key rotation: What happens if a key is compromised or expires? If you have users on old app versions, you're stuck. You either support legacy keys forever or force your users to update, either options are bad.
The "hostile" client: We stored a key in the app. But unless you are using the device's hardware Secure Enclave (iOS) or Keystore (Android), your app is a hostile environment. A determined attacker can decompile the code and extract those keys.
The TLS termination: The only valid concern was where the TLS "ends". If your HTTPS connection terminates at a Load Balancer and the internal traffic to your web server is plain HTTP, you have a gap.

Unless you are building a banking core or a literal payment gateway, TLS is enough. If you’re worried about the "last mile" inside your VPC, solve that at the infrastructure level. Don’t bake custom crypto into your application logic unless you’re ready to manage the massive operational overhead that comes with it.

Experience taught me that "Trust issues" should be solved with better infrastructure, not more code.