TL;DR:
In 8 days of part-time work, I built a RAG system using NestJS + PostgreSQL (pgvector) that processes ~11,000 document chunks.
The first version responded in about 4 minutes; after optimization, it took 40-60 seconds.
The main takeaway: RAG isn't "vector search + LLM," but rather data preparation, context filtering, and careful handling of prompts.
Why I did this
The main goal of the project was to create a RAG system that could answer questions based on my knowledge and experience. This allowed me to understand real-world work with large volumes of documents.
The RAG system was integrated with my business card website site15.ru. There, I showcase and describe some of my projects from the last 10 years: downloads, stars, views, npm libraries, group counters, posts, and karma on Habr and dev.to.
Technically, it's implemented like this: site15.ru frontend → site15.ru backend → rag-system server. The backend passes a special API key, which at least partially protects the site from unnecessary requests.
Thus, site15.ru acts as a demo and interface for interacting with RAG.
Chat frontend on site15.ru, example of a RAG request and response
Why RAG turned out to be more complicated than it seemed
At the start, the project looked simple:
RAG = vector search + LLM
In practice, it turned out that most of the time was spent on:
- data preparation and segmentation,
- context filtering,
- generating and matching prompts.
The first version of the system did everything sequentially and responded in about 4 minutes even to a simple question.
System Architecture
Data Sources (Telegram messages, articles, portfolios, resumes)
↓
Backend (NestJS)
├─ LLM module
├─ PostgreSQL + pgvector
├─ RxJS asynchronous pipeline
├─ Dialog Manager
└─ API controller for site15.ru
↓
RAG Components
├─ Question Transformer
├─ Document and Section Filtering
├─ Vector Search
└─ Prompt Generation
↓
LLM Providers
(OpenAI / Groq / Ollama)
Choosing a Stack
Backend - NestJS
I constantly write backends in NestJS, so the choice was obvious.
Frontend - React Admin
I needed an admin panel to manage data and prompts.
PostgreSQL + pgvector
One system for regular data and vectors - simpler and more reliable than separate storage.
Multiple LLM Providers
Support for different providers allows you to use free limits and easily switch if needed.
Key Architectural Decision: Hierarchical Filtering
The main idea is to not send everything to LLM.
Request processing pipeline:
User request (frontend site15.ru)
↓
Backend site15.ru
↓
RAG server: query normalization
↓
Metadata filtering (11,000 → ~500)
↓
Section/header filtering (500 → ~200)
↓
Vector search (200 → 5–10)
↓
One optimized request in LLM
This significantly reduced the amount of data sent in LLM, and speed up responses.
The biggest technical challenge
Creating metadata for 11,005 document chunks.
The cloud didn't allow for processing everything at once; I had to run it locally through LM Studio with the qwen2.5-7b-instruct model - it took 2 days on an RTX 2060 SUPER.
Why the first version was slow
First version: 8 intermediate prompts, consecutive LLM calls → ~4 minutes per response.
After optimization: 4 coordinated prompts, some stages are parallel, asynchronous queue on RxJS → 40–60 seconds.
Prompts and Errors
Prompts turned out to be the most challenging aspect. Examples:
Context Leak
User: "Tell me about NestJS"
LLM: "NestJS is a great framework. By the way, I have a Telegram bot for coffee..."
Instruction Conflict
Prompt 1: Be brief
Prompt 2: Provide detailed examples
Conclusion: Fewer, but consistent prompts are better.
Using AI in Development
About 70% of the code was written with the help of AI assistants, but without my architectural editing and debugging, it wouldn't have worked.
Security
- Checking the authorized IP address,
- Checking the API key for requests from the site15.ru backend.
The project is not production-ready, but this allows us to at least somewhat protect the main site from unnecessary requests.
Deployment
- RAG server on a separate VPS,
-
docker-composefor PostgreSQL and Ollama, - backend via
pm2, - integration with site15.ru.
Current status
Experimental project, not MVP, not production-ready.
Site15.ru serves as an interface for demonstrating RAG and project statistics.
Future Plans
- Writing user scenario tests (checking response and pipeline correctness).
- Refactoring the LLM module to NestJS style.
- Adding analytics: processing time, number of LLM calls, token consumption.
- Automating deployment via Docker/Kubernetes.
- Improving accuracy and speed by optimizing context filtering.
Conclusions
RAG is a complex engineering process: data preparation, context filtering, and careful prompt handling are more important than the LLM itself.
In 8 days of part-time work, I was able to assemble a working system, integrate it with site15.ru, and gain real-world experience with RAG.
Links
https://github.com/site15/rag-system - Project repository
https://site15.ru - My business card website with support chat












Top comments (0)