Md Arsalan Arshad

Posted on Apr 2

Why I Separated the Indexing and Query Pipelines — And What Happened the One Time I Didn't

#ai #programming #architecture #discuss

I was testing LocusLab, a multi-tenant AI agent platform I am building, and something was off with the latency. Not consistently off. Sometimes the agent would reply in under 2 seconds, sometimes it would take 9 or 10. No errors in the logs. No timeouts. Just this random unpredictable delay that made no sense.

My first instinct was the queue. Messages piling up maybe? I checked. Queue depth was fine. Then I thought it was the webhook, maybe the DM events were arriving late. Also fine. Then I thought the LLM calls were inconsistent so I started logging every stage of the pipeline individually. Still couldn't find it.

It took me 3 days to find the actual cause.

Both my indexing pipeline and query pipeline were running inside the same Lambda function. So when a user uploaded documents and indexing started, all the heavy work like preprocessing the documents, splitting them into chunks, generating embeddings, storing them in the vector database, it was all consuming the same compute and memory that the query side needed to reply to messages. The indexing would finish, resources would free up, and query latency would drop back to normal. Then someone uploads another document, indexing kicks off again, latency spikes. That is why it looked random. It was not random at all. It was completely tied to whenever indexing was happening in the background.

The moment I realised this I felt stupid. Because I knew this was the right architecture from the start and I still did not do it.

Why These Two Things Cannot Live Together

Before getting into what I changed, it helps to understand what these two pipelines actually are and why they are so different.

Think of it this way. The indexing pipeline is the librarian organising books in the background. Nobody is standing there waiting for each book to be placed on the shelf. It can take its time. The more it batches together, the more efficient it becomes. A document taking 2 minutes to fully index is completely fine because no user is waiting on the other side.

The query pipeline is the librarian answering a question from someone standing at the desk. That person is waiting right now. Every second feels long. You need to find the answer as fast as possible and get back to them.

When you put both of these in the same function, the librarian is trying to organise shelves and answer questions at the same time. The person at the desk keeps waiting because the librarian is busy in the back.

More technically, the indexing pipeline is optimised for throughput. You want to process as many documents as possible and batching embedding calls makes them cheaper. The query pipeline is optimized for latency. You want to get below 2 seconds end to end, run searches in parallel, check the cache first so you can skip the whole pipeline on repeated questions. These two goals fight each other when they share the same compute.

And the frustrating part is the failure is invisible. No errors, no crashes, just inconsistent latency that looks like 10 different problems before you find the real one.

What I Changed

I split them into two separate Lambda functions with an SQS queue connecting them.

The indexing pipeline is now its own Lambda that gets triggered by the SQS queue. When someone uploads documents we push a job to the queue and immediately tell the user their upload was received. The Lambda picks it up in the background and handles everything, figuring out what type of document it is, extracting the text, splitting it into chunks that make sense for retrieval, generating embeddings, storing everything in VectorDB. The user is not waiting for any of this. If it is slow it does not matter. If it fails the message stays in the queue and retries automatically. Failed jobs go to a dead letter queue so nothing silently disappears.

The query pipeline is a separate Lambda. When a message comes in it handles the full retrieval flow. It checks the cache first because a cache hit means you can respond in under 50ms without running any search at all. If it is a cache miss it runs vector search and keyword search at the same time in parallel rather than one after the other, then combines the results, picks the most relevant chunks, builds the context, calls the LLM, and returns the response. This function has no idea the indexing Lambda exists.

The two functions share only two things. The VectorDB index where vectors are stored and a DynamoDB table that tracks document and chunk metadata. That is the only connection between them.

What Actually Got Better

Query latency dropped and stayed consistent. I ran the same test that originally broke things, a large document upload triggering full indexing, while at the same time hitting the query side with multiple messages. Latency did not move.

Debugging also became much simpler and I did not expect this part.

Before the split every investigation started with figuring out what else was happening in the function at that exact moment. After the split that question became irrelevant. If a query is slow the problem is in the query Lambda. If a document fails to index the problem is in the indexing Lambda. They fail separately and it is obvious where to look.

The Honest Part

I knew this was the right architecture before I started building. Separate the pipelines, queue between them, indexing runs async in the background. I have read enough about system design to know this.
But I told myself just for now, just to move fast and see the agent working end to end, I will keep them together and fix it later. That was the plan.

Then I spent 3 days debugging a problem that should not have existed.

The agent quality was actually good during all of this. The retrieval was working, the responses were accurate, the Shopify integration was pulling the right products. All of that was fine. The only thing hurting the user experience was an architecture shortcut I took on day one that I knew was wrong.

That is the part that still bothers me. It was not a hard problem. It was a known problem I chose to defer.

Where I Am Now and What Comes Next

I am still on Lambda for both pipelines. At my current scale it works fine and Lambda is honestly a good fit for early stage products. It scales automatically, you only pay for what you use, and there is no infrastructure to manage.

The two real limitations I am aware of are cold starts and the 15 minute execution limit. Cold starts add latency when a function has not been called recently which matters a lot for the query side. The 15 minute limit means very large document processing jobs need to be broken into smaller pieces so they do not hit the ceiling.

When traffic grows to the point where I need a constantly warm query function and more control over how long indexing jobs can run, I will move to ECS. But that is a future problem. The separation itself is what mattered, not which compute service I used to do it.

When Keeping Them Together Is Actually Fine

If you are building a prototype, single tenant, small number of documents, no real users yet, keep them together. The overhead of managing two functions, a queue between them, and separate monitoring is not worth it when you are just trying to see if the product idea works at all.

The moment you have users uploading documents while other users are querying at the same time, separate them. That is the line. Not because of some rule, but because that is exactly when one pipeline starts silently hurting the other one.

Do not wait for 3 days of confused debugging to make the call.

Top comments (2)

Md Ayan Arshad • Apr 2

This is such a real failure mode.

What’s interesting is it’s not just about “same Lambda”, it’s how indexing workloads end up saturating shared system limits (concurrency, vector DB, embedding APIs), and query latency takes the hit

Looks random on the surface, but it’s actually tightly correlated with background load

Separating indexing (throughput) and query (latency) pipelines is one of those things everyone knows… until they debug it the hard way😅

Md Arsalan Arshad • Apr 2

@ayanarshad02 yes you got it right, everyone knows it but many ignores it at the beginning...