DEV Community: Prithvi Rajan

My RAG Pipeline Took an Hour. Here's How I Got It Down to 30 Seconds.

Prithvi Rajan — Sun, 01 Mar 2026 19:57:16 +0000

A content ingestion job used to take over an hour. Now it finishes in 30 seconds. No change in hardware, just better utilization of what is already there, a smarter queue system, and hours debugging how CUDA and multiprocessing works. Here’s how I got there.

I was creating a RAG application with Django, and Milvus as my vector database. I initially created a very simple way to ingest documents.

Create a celery task → Fetch the page → Chunk the page → Create vector embeddings → Upload them to Milvus.

This worked great. Nothing wrong with it other than the fact that it was slow. Ingesting the entire Django docs took over an hour.

Can we do better?

So I run everything on my person computer. I have a CUDA GPU (Nvidia 4070 Super), so I wanted to see if that can speed up the process. I changed the embedding model to use the GPU, tweaked some of the docker images and got the GPU to start creating embeddings on my test code.

def get_embedding_model(force_cpu: bool = False):
    global embedding_model
    if embedding_model is None:
        import torch
        from FlagEmbedding import BGEM3FlagModel

        if torch.cuda.is_available() and not force_cpu:
            device = "cuda:0"
        else:
            device = "cpu"

        logger.info(f"Loading BGE-M3 on {device}")

        # Correct, fast loading
        embedding_model = BGEM3FlagModel(
            "/models/bge-m3",      # local path from Docker image
            device=device,
        )

        logger.info(f"Model was loaded with {next(embedding_model.model.parameters()).device}")

    return embedding_model

Looks simple right?

This change ended breaking my celery setup, and forced me to actually optimize my code.

CUDA does not support fork() , so having multiple celery workers was not possible. Each fork tries to re-initialize CUDA context, which throws this error.

RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method

I changed how celery creates new processes to use the “spawn” function, but that ended up crashing my system. My GPU ran out of VRAM quickly, and shut everything down.

My PC crashing kinda scared me and made me take a step back. What are the different bottlenecks I am facing? What is the solution to the bottlenecks?

There were two distinct bottlenecks. The first was fetching pages. With the current pipeline, a worker, after fetching a page has to wait for chunking, embedding, and upload to milvus. The second bottleneck was loading the embedding model onto the GPU takes up a lot of VRAM.

Luckily, the solution to both bottlenecks were exactly the same thing. Decoupling the IO work from the GPU work.

Optimized architecture

The solution was two Celery queues with very different personalities.

The first is a general-purpose CPU queue. Multiple workers run freely in parallel, each one fetching a page from the API, chunking the content, and passing it downstream. Concurrency here is a feature; the more workers, the faster the pages come in.

The second is a GPU queue, locked to a single worker by design. That one worker does nothing but take chunks from the CPU queue, run them through the embedding model, and push the results to a Redis queue. One process, one model loaded in memory, running continuously without interruption.

The final piece is a Celery Beat job that drains the Redis queue every minute, batching up the accumulated embeddings and writing them to Milvus in bulk rather than one document at a time.

What makes this architecture satisfying is how cleanly each component maps to a resource. CPU workers are cheap to spin up, so you run many. GPU memory is precious, so you protect it with a single process. Milvus writes are expensive per-call, so you batch them. Every design decision follows directly from the constraint it's solving.

And because each layer is independent, scaling is straightforward. If fetching becomes the bottleneck, add CPU workers. If embedding becomes the bottleneck, add a GPU container. Neither change requires touching the other.

Final results

An embedding job, which took over an hour before, was done in 30 seconds. It could have been even faster, but I had to rate limit the API calls to GitHub to follow their policy. This is an 120x improvement in performance over the naive strategy.

How to make Django asynchronous?

Prithvi Rajan — Thu, 26 Feb 2026 06:54:10 +0000

Django is synchronous by nature. It was not built for an asynchronous system. However, the team is making an effort to bring true asynchronous support to Django. In the meantime, how do we make Django work async?

Note how I mentioned “true” asynchronous support. Currently, Django does support some form of asynchronous workflows, but the core part of the ORM (connecting to the DB) is still synchronous.

Why do we need async workflows in the first place?

If you are creating a RAG chatbot, calling an LLM api in a synchronous workflow will block the worker for an extremely long time. Another option would be to use celery for this, but that just adds unnecessary complications.

So how do we make it asynchronous?

A basic checklist would comprise of:

Switch to an ASGI server
Start using the async ORM from Django
Stop using any sync middleware
Stop using DRF

Sounds easy right? This will work, but the moment you get any load, it will start showing its cracks.

The big problem

In Django, connections are normally reused, but only per thread. So If you have multiple workers running, multiple celery workers and long running LLM calls, then you end up with a ton of open connections to your DB.

The thing is, DB connections are expensive. In terms of memory. In a postgres server, 1 connection can use 10 Mb of RAM. Which is extremely resource intensive. Normally, a Postgres server limits the number of connections to 20, maybe 50. So under load, you are extremely likely to hit those limits, and your APIs will start to fail.

How do we fix this?

The answer is a connection pooler like PgBouncer. PgBouncer acts as an interface to the DB. Django can make 1000s of connections to PgBouncer, but PgBouncer routes it through the 20 odd connections you have to Postgres.

This works because, a connection to PgBouncer is cheap. Each connection is only 2Kb of overhead. This means you can easily have 1000s of connections to PgBouncer.

With this change your endpoint can now handle a much larger concurrent load.

While Django might add truly async support, people might still face the exact same issue, as the problem mainly arises when we use run multiple workers, or add celery.

Introducing Slide-CN

Prithvi Rajan — Mon, 23 Feb 2026 00:14:27 +0000

Have you ever created a presentation, and tried to change the font size of the header? You need to change it manually for every slide. If you want to change the font color, you need to select all the text in every slide, and change it manually. Then, after making these changes, you realize it looked better before, but going back to the previous version is a nightmare.

The problem is not with fonts, color, or anything else. I am a developer, and these things are solved problems. I am frustrated with the fact that I am still having to struggle with basic things like version control, and reusability. With code, I have git. I can create one component and re use it everywhere, so why can’t I do the same with slides?

Slides are just UI. It is not rocket science.

So why not create slides with code?

So I built Slide-CN

What this unlocks

Version control for presentations

Because I am now using code to create slides, I can also use things like, git and github. You can review your collaborators changes. Your workflow as a whole gets better.

Reusability without templates

Slide-CN is component based. You can create small custom components that you can reuse anywhere you want. You can set color schemes. You can standardize how presentations from your company is supposed to look like, without having to create complex templates.

Real data, not screenshots

Your slide deck is basically a website now. That means, you can call APIs, and dynamically render content. You can pull data from your dashboard in real time. Your presentations become alive.

Interactive storytelling

Your slides are now made out of web components. Web components that can be interactive. React to clicks, mouse movements, and more. Maximum freedom, zero restriction.

Link, not a file

Host your presentation, and use a url to share it with people. That way, you can update your slides, and everyone sees the new version. This means, you dont have to deal with flies like “demo_version_2_final_final” anymore. You can also track how people progress through you slides, and see where people are the most engaged, where people loose interest and so on. Unlimited freedom to do what you want.

Open-source

There are a ton of open source component libraries that you can drag and drop into a slide-cn project. A few examples of these are

Reactbits
Shadcn
MagicUI

Nothing comes close to the kind of ecosystem code has.

Why not Gamma or Canva?

I will probably write a whole article about this, but the short version is that these tools like Canva and PPT are an abstraction over code. They were built in a time where the thought of coding a website seemed alien to most people.

That is not the case now

Anyone can pickup cursor and prompt away. People are not shying away from code now. The tools available to vibecode are getting better every month. Even LLMs are natively better at “coding” than at “canva.”

Gamma represents the opposite end of this spectrum, where an agent generates your entire presentation. You give up all control. You can not change minor details, you are constrained by their system. Slide-CN gives you complete freedom, along with the speed that comes from vibe-coding.