RAG Made Simple: Demonstration and Analysis of Simplicity (Part 3)

#ai #nlp #rag #python

Stunning image by Ales Nostril, courtesy of Unsplash

Live Demonstration

This demonstration was run on a cloud instance with an RTX A5000 GPU using Microsoft’s Phi model as the generator. If you plan to use Mistral as the language model, note that it requires a Hugging Face API key since it is not publicly accessible. Phi and GPT models can be used without a key by configuring them in config.yml.

The first run will take longer, as the model weights and the embedding function for ChromaDB are downloaded initially. Below is a video of the system in action. (The response time is noticeably slow due to GPU limitations.)

Video

What Went Well..

Surprisingly good output from Phi: The Phi model generated coherent summaries with proper inline citations for the given query, following multiple rounds of prompt tuning. There was no fine-tuning done.
Vector store growth: Documents retrieved from the API were successfully vectorised and stored in ChromaDB for future use.
Responsive UI: The interface remained snappy and interactive throughout the demonstration.

...And What Didn’t

Slow inference: The time to generate a summary was painfully long. This could be mitigated with more powerful or distributed hardware, but that’s beyond the scope of this portfolio project.
Inconsistent summarisation: Citation formatting varied. While some responses were crisp and relevant, others veered off and generated excess tokens — a clear sign that better results would require fine-tuning on a domain-specific dataset.
Naive document ranking: A simple distance threshold was used to filter local results before falling back to the ArXiv API. While functional, more advanced re-ranking techniques could improve precision.

Conclusion

Despite its limitations, the system delivered a complete, working end-to-end RAG pipeline running in the cloud. Seeing it in action underscored both the promise and the computational demands of even small-scale LLMs. Document retrieval typically took ~11 seconds, while summarisation took ~60 seconds, both of which could be improved with stronger infrastructure. Still, as a self-contained, vendor-free solution built from the ground up, this was a very satisfying personal milestone.

I hope this was a productive and insightful read. If you have any feedback, ideas for improvement, or suggestions for pushing this further, I’d love to hear from you, feel free to reach out or drop a comment!

Top comments (3)

WalkingTree Technologies • Jun 23 '25

Hey Sri Hari, this is cool! I loved the way you explained and demonstrated the live RAG system in action – super helpful. Agree, the slow inference is such a blocker sometimes. We’ve been working on similar agent-based RAG setups and have been trying out some ways to improve speed and coordination between agents.

Would love to exchange ideas or geek out on this if you’re up for it! Also, we’re hosting a webinar soon on how agentic AI and RAG can work together for complex workflows – happy to share the link if you’re interested.

Sri Hari Karthick • Jun 23 '25

Appreciate that! Yeah, the goal was to show what’s possible even on a basic setup, kind of a “low-barrier” RAG system anyone can run locally without tying into any vendor.

Totally agree that inference could be way faster with multi-GPU or distributed setups, but I wanted to make it approachable first. Would love to see how you’re scaling things on your end, always keen to learn more modern approaches!

Some comments may only be visible to logged-in visitors. Sign in to view all comments.