How I Built a Local-First AI Stack for Document Q&A Without OpenAI ππ€
A multi-service monorepo with llama.cpp, Qdrant, Python FastAPI services, React, Node and MCP support for AI agents.
Youβve probably seen buzzwords like RAG, vector database, embeddings, MCP, and local LLMs everywhere. This article is meant to make those terms feel concrete by showing how they fit together in a real project.
What Youβll See in This Project π
- Local-first RAG architecture
- Document PDF ingestion and chunking pipeline
-
Embedding generation using
sentence-transformers -
Vector search with
Qdrant -
Local LLM inference with
llama.cpp -
Python backend microservices built with
FastAPI - React frontend for document upload and chat
- Optional ML layer for security and query analysis
- MCP integration so AI agents can use the system as tools
Table of contents π§
- 1. Introduction
- 2. What Is a Local AI Stack
- 3. Why Build AI Without OpenAI
- 4. Use Cases for Local AI
- 5. Key Concepts Behind the System
- 6. High Level Architecture
- 7. Technology Stack
- 8. System Components Explained
- 9. Document Ingestion Pipeline
- 10. Example Document Ingestion Lifecycle
- 11. Query Processing Flow
- 12. Example Request Lifecycle
- 13. Improving Retrieval Quality
- 14. Security Considerations
- 15. Performance Optimization
- 16. Advantages And Pros of a Local AI Stack
- 17. Limitations And Cons and Tradeoffs
- 18. Future Improvements
- 19. Refactoring Path: LangChain, LlamaIndex, or Bedrock
- 20. Conclusion
- 21. Demoing This Repo
Most AI tutorials still follow the same recipe: call OpenAI, print the response, and label it an AI application.
That is fine for a quick prototype, but it becomes limiting fast. You inherit API costs, external latency, privacy concerns, and a system design that often relies on a single provider sitting in the middle of everything.
I wanted to build something closer to a real product: a local-first AI system that can ingest documents, search them semantically, generate grounded answers, and stay flexible enough to support both humans and AI agents.
That is what document_rag is. It is a local-first Retrieval-Augmented Generation (RAG) platform for uploading documents, retrieving relevant context, and answering questions with sources. By default, it runs locally without requiring OpenAI, and it is structured as a multi-service monorepo with an MCP server so tools like Cursor or Claude Desktop can also use the same platform.
You can find the full source code on GitHub at GitHub.
In this article, I will walk through the architecture, the tech stack, the tradeoffs, and why building AI locally is worth considering in the first place.
1. Introduction π±
AI-powered applications are quickly moving from novelty to default product features. Search, support assistants, internal copilots, documentation chat, and workflow automation are all being rebuilt around language models.
The easiest way to build these systems is to rely entirely on hosted providers such as OpenAI. That works well for prototypes, but many teams eventually hit the same questions:
- How much will this cost at scale?
- Can we safely send internal documents to an external API?
- What happens if latency spikes or pricing changes?
- What if the application needs to work inside a private network?
Running AI locally is one answer to those concerns.
In this project, I built a local-first RAG system that ingests PDFs and text, chunks them, turns them into embeddings, stores them in a vector database, retrieves relevant context for a question, and then generates an answer with a local LLM. It lives in a monorepo that contains the frontend, backend services, shared modules, and an MCP server for agent access. The article shows how that stack fits together and why this architecture is useful beyond a demo.
2. What Is a Local AI Stack π§±
A local AI stack is a system where the critical AI components run on infrastructure you control instead of depending entirely on an external API provider.
In practice, that usually means:
- A local or self-hosted LLM runtime
- A local embedding model
- A vector database for retrieval
- Backend services that orchestrate ingestion and question answering
- A UI or API layer for users and other tools
The biggest difference from cloud AI is where inference and data processing happen.
- In a cloud-first setup, your app sends prompts and often context to a remote provider.
- In a local-first setup, your app keeps the pipeline close to the data and only uses external providers if you intentionally enable them.
In document_rag, the default local stack is:
-
llama.cppfor LLM inference -
BAAI/bge-small-en-v1.5for embeddings - Qdrant for vector storage
- FastAPI services for ingestion, embedding, retrieval, RAG orchestration, and optional ML analysis
- Express Gateway as the entry point for the frontend
- React frontend for upload and chat
- MCP server so AI agents can search, ask questions, and ingest content through the same platform
From a repository-design perspective, this is also a monorepo: multiple related applications and services live in one Git repository, share documentation and infrastructure, and work together as one system.
3. Why Build AI Without OpenAI πΈ
There is nothing wrong with OpenAI or other cloud providers. They are excellent tools. But there are solid engineering reasons to build a system that does not require them by default.
Cost considerations
Hosted APIs are easy to start with, but repeated embedding calls, chat completions, and large context windows can become expensive. A local stack makes costs more predictable because the main expense is infrastructure and hardware, not token billing on every request.
For example, AWS Bedrock pricing depends on the model you choose and can add up quickly at scale. As one reference point, the AWS Bedrock pricing page lists Claude 3.5 Sonnet Extended Access at roughly $6 per million input tokens and $30 per million output tokens, with batch pricing lower than that. That may be perfectly reasonable for production workloads, but the cost becomes usage-driven very quickly once you have frequent queries, longer contexts, or multiple users.
With a local llama.cpp setup, the cost model is different. There is no per-token API bill for each request. Instead, you are paying for the machine, electricity, storage, and the operational overhead of running the model. If you are testing on hardware you already own, the marginal cost can feel close to zero. But if you need a stronger dedicated GPU box, the fixed monthly cost can be significant even before traffic grows.
Data privacy and security
Many RAG systems work with internal PDFs, team docs, policies, contracts, or private knowledge bases. Keeping the pipeline local reduces exposure and makes the system easier to justify in privacy-sensitive environments.
Latency improvements
When embeddings, search, and inference are close to the application, you remove some network overhead. Local inference is not always faster in absolute terms, but it is often more predictable.
Vendor lock-in concerns
If the whole product depends on one hosted provider, switching later can be painful. This project avoids that by using config-driven backends. The default path is local, but optional OpenAI and other future backends fit behind stable service contracts.
Offline and internal systems
Some tools need to run on internal networks, development laptops, or restricted environments. A local-first design makes those scenarios practical.
4. Use Cases for Local AI π―
A local AI stack is especially useful for:
- Internal document search
- Engineering or product knowledge bases
- Company documentation assistants
- Secure enterprise environments
- Teams in regulated industries that want stronger control over data flow
This repository focuses on document question answering, but the same architecture can support internal wikis, policy assistants, onboarding tools, legal document review support, and research archives.
5. Key Concepts Behind the System π§
Before looking at the architecture, it helps to define the core RAG building blocks.
Large Language Models
LLM stands for Large Language Model. It is the component responsible for generating the final answer from the user question plus the retrieved context. In this project, the default runtime is llama.cpp, which serves a local model such as Mistral 7B in GGUF format.
Embeddings
Embeddings convert text into vectors so semantically similar content can be matched even if the wording is different. This repo uses BAAI/bge-small-en-v1.5 by default.
Vector search
Vector search lets the system retrieve the most relevant document chunks for a user query. Qdrant is used as the default vector store.
Retrieval-Augmented Generation
RAG stands for Retrieval-Augmented Generation. It combines document retrieval with language generation, so the model answers using relevant source material instead of relying only on its pretrained knowledge.
In practice, RAG combines retrieval and generation:
- Turn the user question into an embedding.
- Search the vector database for relevant chunks.
- Pass those chunks as context to the LLM.
- Generate an answer grounded in the retrieved content.
That grounding is what makes RAG more useful for document-based assistants than raw prompting alone.
6. High Level Architecture ποΈ
At a high level, the system is split into focused services instead of a single large app:
- Frontend for upload and chat
- Gateway for a unified API entry point
- Ingestion service for parsing and chunking documents
- Embedding service for converting text to vectors
- Retrieval service for vector storage and search
- RAG service for orchestration and answer generation
- Optional ML service for injection detection, query classification, and retrieval scoring
- MCP server so AI agents can use the same backend as tools
This layout is one reason I describe the project as a monorepo. Instead of separating everything into different repositories, the frontend, backend services, shared modules, and MCP integration are versioned together. For a system like this, it makes local development, documentation, and architecture changes easier to manage.
The main data flow looks like this:
- A user uploads a document through the frontend.
- The Gateway forwards the request to the Ingestion service.
- Ingestion parses the file, splits it into chunks, and asks the Embedding service for vectors.
- The Retrieval service stores those vectors in Qdrant.
- Later, when the user asks a question, the RAG service embeds the query, retrieves relevant chunks, optionally reranks them, and sends grounded context to the LLM.
- The answer comes back with sources.
Here is the high-level architecture:
There is also a second access path besides the browser UI: the MCP server exposes the system to AI agents over the Model Context Protocol. That means the same platform can power both a human-facing frontend and agent workflows such as search_documents, ask_rag, and ingest_document.
That separation makes the system easier to reason about, replace, and extend.
7. Technology Stack π§°
Here is the concrete stack used in this project.
| Layer | Technology |
|---|---|
| LLM runtime | (C++ runtime) --> llama.cpp |
| Default LLM model | (GGUF model) --> Mistral 7B |
| Embeddings | (Python / sentence-transformers) --> BAAI/bge-small-en-v1.5 |
| Vector database | (Vector DB) --> Qdrant |
| API gateway | (NodeJS) --> Express + TypeScript |
| Backend services | (Python) --> FastAPI |
| Frontend | Frontend (React) --> React + Vite + TypeScript |
| Optional agent interface | (Python) --> MCP server via FastMCP |
| Optional alternative backends | (Config-driven) --> OpenAI, pgvector, Bedrock placeholders/backends |
One thing I like about this setup is that it stays practical. The default stack is local-first, but the interfaces are designed so that changing a backend does not force a full rewrite of the product.
8. System Components Explained π§©
Gateway / API layer
The Gateway is the public entry point used by the frontend. It keeps the UI simple and hides the internal service boundaries.
Embedding service
This service owns text-to-vector conversion. Other services do not care which embedding provider is behind it as long as the /embed contract stays stable.
Vector database
Qdrant stores chunk vectors and powers similarity search. The Retrieval service sits in front of it so vector database details are isolated.
LLM service
The generation layer uses llama.cpp by default. The RAG service talks to an abstraction, so local inference is the default but not the only possible implementation.
Ingestion service
The Ingestion service is responsible for parsing documents, chunking text, requesting embeddings, and inserting results into the retrieval layer. The next sections go deeper into the ingestion pipeline and its request lifecycle.
RAG orchestration service
This is the brain of the application. It handles query processing, context assembly, prompt construction, answer generation, safeguards, optional query rewriting, and optional reranking. The dedicated query-flow sections below show that process in more detail.
Optional ML service
The ML service adds extra intelligence around prompt injection detection, query intent classification, and retrieval scoring. It is not required for the core app to work, which is a good design choice for graceful degradation.
MCP server
The MCP server is a thin integration layer that exposes the platform as tools for AI agents such as Cursor or Claude Desktop. Instead of building a separate agent-specific backend, this repo reuses the same ingestion, retrieval, and RAG services and makes them available over MCP.
The service architecture looks like this:
9. Document Ingestion Pipeline π₯
Ingestion is where a lot of real RAG quality is decided.
The pipeline in this repo looks like this:
- Accept a PDF upload or raw text.
- Parse the document into plain text.
- Split the text into chunks.
- Generate embeddings for those chunks.
- Store vectors and metadata in the retrieval layer.
Here is the ingestion pipeline:
Input
A document or text is provided by the user through upload or via the MCP client.
Gateway / MCP Server
The request is received and validated by the Gateway API or the MCP server, which acts as the system entry point.
Ingestion Service
The request is forwarded to the Ingestion Service, responsible for preparing the document for processing and indexing.
Parser
The parser extracts raw text from the uploaded content (for example PDF, TXT, or other supported formats).
Chunker
The extracted text is split into smaller chunks to optimise embedding generation and retrieval accuracy.
Embedding Service
Each chunk is converted into a vector representation (embedding) using an embedding model.
Vector Preparation
The generated vectors represent the semantic meaning of each chunk.
Retrieval Service
The vectors and their associated metadata are upserted through the retrieval service.
Vector Database
The vectors are stored in Qdrant, where they become searchable for future semantic queries.
Document parsing
The system supports PDF ingestion and also text ingestion, which is useful for testing, automation, and MCP-driven workflows.
Chunking strategies
Chunking is critical because poor chunking hurts retrieval quality even if the model is strong. This project exposes chunk-size configuration and keeps chunking as a shared concern rather than scattering it across services.
Embedding generation
Each chunk is sent to the Embedding service, which returns vector representations using the configured backend.
Storing vectors
The Retrieval service upserts the vectors into Qdrant, making the document searchable for future queries.
10. Example Document Ingestion Lifecycle π
To make the ingestion path as concrete as the query path, here is a simplified lifecycle of what happens when a user uploads a PDF document through the frontend.
Document Ingestion Service Flow (with Optional ML)
| Step | Caller | Endpoint | Target Service | Description |
|---|---|---|---|---|
| 1 | Frontend | /ingest/ |
Gateway | Upload PDF or raw text document |
| 2 | Gateway | /ingest |
Ingestion Service | Forward document to ingestion pipeline |
| 3 | Ingestion Service | β | Parser Module | Extract text from PDF or text input |
| 4 | Ingestion Service |
/classify (optional) |
ML Service | Classify document type when document classification is enabled |
| 5 | Ingestion Service | β | Chunker Module | Split extracted text into smaller chunks |
| 6 | Ingestion Service | /embed |
Embedding Service | Convert text chunks into vector embeddings |
| 7 | Embedding Service | β | Embedding Model | Generate embeddings using sentence-transformers
|
| 8 | Ingestion Service | /upsert |
Retrieval Service | Send vectors and metadata for storage |
| 9 | Retrieval Service | β | Qdrant | Store embeddings in vector database |
| 10 | Ingestion Service | Response | Gateway | Return ingestion result |
| 11 | Gateway | JSON response | Frontend | Confirm document indexed successfully |
User action
Upload file:
employment_contract.pdf
Step 1 β Frontend uploads the document
The React frontend sends the PDF to the Gateway as a multipart form upload.
POST http://localhost:8000/ingest/
Content-Type: multipart/form-data
file = employment_contract.pdf
Step 2 β Gateway forwards to the Ingestion service
The Gateway proxies the uploaded file to the Ingestion service.
POST http://localhost:8001/ingest
Content-Type: multipart/form-data
file = employment_contract.pdf
Step 3 β PDF parsing and text extraction
The Ingestion service stores the upload temporarily, extracts text from the PDF, and prepares it for chunking. If document classification is enabled, the service can also classify the document before indexing it.
Step 4 β Chunking
The extracted text is split into smaller chunks so the content can be embedded and retrieved effectively later.
Example chunk output:
Chunk 1: This employment agreement begins on...
Chunk 2: Either party may terminate this contract by...
Chunk 3: Confidentiality obligations survive termination...
Step 5 β Batch embedding
The Ingestion service sends all chunks to the Embedding service in one batch request.
POST http://localhost:8002/embed
Content-Type: application/json
{
"texts": [
"This employment agreement begins on...",
"Either party may terminate this contract by...",
"Confidentiality obligations survive termination..."
]
}
The Embedding service returns one vector per chunk.
Step 6 β Vector upsert
The Ingestion service packages the chunk text, source filename, and embeddings into points and sends them to the Retrieval service.
POST http://localhost:8003/upsert
Content-Type: application/json
{
"points": [
{
"id": "uuid-1",
"vector": [ ... ],
"payload": {
"text": "This employment agreement begins on...",
"source": "employment_contract.pdf"
}
}
]
}
The Retrieval service stores those points in Qdrant, making the document searchable.
Step 7 β Ingestion result returned
Once indexing finishes, the Ingestion service returns a success payload.
Example response:
{
"status": "success",
"chunks_inserted": 3,
"document": "employment_contract.pdf"
}
The Gateway passes that response back to the frontend, which can then confirm that the document is ready for search and question answering.
Here is the same ingestion flow shown as a sequence trace:
This flow is useful because it shows that ingestion is not just file upload. It is a full indexing pipeline: parsing, chunking, embedding, and vector storage. That is what makes later RAG queries possible.
11. Query Processing Flow π
Question answering follows a similar service-oriented path:
- The user asks a question in the frontend.
- The Gateway forwards the question to the RAG service.
- The RAG service validates the input with safeguards.
- If enabled, the ML service analyses the query for prompt injection and intent classification.
- If enabled, the query can be rewritten into a clearer search form.
- The query is embedded.
- The Retrieval service performs vector search in Qdrant.
- If enabled, a reranker improves the ordering of retrieved chunks.
- If enabled, the ML service scores retrieval quality before generation.
- The RAG service builds a prompt with the selected context.
- The LLM generates an answer.
- Output safeguards validate the response before it is returned.
- The frontend shows the answer together with sources.
The RAG query flow looks like this:
User Input
A user submits a question to the system.
Gateway
The request is received by the Gateway API, which acts as the entry point for user queries.
RAG Service
The request is forwarded to the RAG orchestration service, which coordinates the entire retrieval and generation pipeline.
Input Safeguards
The query is validated to detect unsafe or malformed inputs.
Query Analysis (Optional)
A machine learning service may analyse the query to determine intent, complexity, or additional metadata.
Query Rewriting (Optional)
The query can be rewritten to improve retrieval accuracy.
Embedding Generation
The processed query is converted into a vector embedding using the embedding service.
Retrieval Service
The retrieval service searches for relevant document chunks using the query vector.
Vector Database Search
The similarity search is executed in Qdrant, which stores the document embeddings.
Context Retrieval
The most relevant chunks are returned as context for the language model.
Reranking (Optional)
A reranker model may reorder the retrieved chunks to improve relevance.
Retrieval Scoring (Optional)
An ML service may evaluate and score the quality of the retrieved results.
Prompt Construction
The prompt builder assembles the final prompt using the user query and the retrieved context.
Model Execution
The prompt is sent to the local model runtime powered by llama.cpp.
Output Safeguards
The generated response is validated to ensure safety and compliance.
Final Response
The system returns the answer along with source references to the user.
This is a good example of why modularity matters. The user experiences one simple chat flow, but the system is actually combining retrieval, ranking, safety checks, and generation behind the scenes.
12. Example Request Lifecycle π
To make the architecture more concrete, here is a simplified lifecycle of a real request when a user asks a question about an uploaded document. The URLs below use the default local development ports from this repository.
Example Request Flow
| Step | Caller | Endpoint | Target Service | Description |
|---|---|---|---|---|
| 1 | Frontend | /chat/ |
Gateway | Send the user question to the public API |
| 2 | Gateway | /ask |
RAG Service | Forward the request to the RAG orchestration layer |
| 3 | RAG |
/analyze (optional) |
ML Service | Check prompt injection and classify query intent |
| 4 | RAG | /embed |
Embedding Service | Generate a vector embedding for the question |
| 5 | RAG | /search |
Retrieval Service | Retrieve the most relevant chunks |
| 6 | Retrieval | Vector search | Qdrant | Run similarity search on stored embeddings |
| 7 | RAG |
/score (optional) |
ML Service | Score retrieval quality before generation |
| 8 | RAG | Completion call | LLM Runtime | Send prompt and context to the configured model backend |
| 9 | Gateway | Response | Frontend | Return the final answer and sources |
User question
What does the contract say about termination conditions?
Step 1 β Frontend sends the request
The React frontend sends the user question to the Gateway.
POST http://localhost:8000/chat/
Content-Type: application/json
{
"question": "What does the contract say about termination conditions?"
}
Step 2 β Gateway forwards to the RAG service
The Gateway validates the incoming request shape and forwards it to the RAG orchestration service.
POST http://localhost:8004/ask
Content-Type: application/json
{
"question": "What does the contract say about termination conditions?"
}
Step 3 β Input validation and optional ML analysis
The RAG service first runs its input safeguards. If the optional ML service is enabled, it can also analyze the query for prompt injection and classify the query intent.
POST http://localhost:8005/analyze
Content-Type: application/json
{
"query": "What does the contract say about termination conditions?"
}
This step is optional and the system can continue without it if the ML service is disabled or unavailable.
Step 4 β Query embedding
Inside the RAG service, the question is passed to the Embedding service to generate a vector representation.
POST http://localhost:8002/embed
Content-Type: application/json
{
"text": "What does the contract say about termination conditions?"
}
The Embedding service returns an embedding vector for the query.
Step 5 β Vector search
The RAG service sends the query vector to the Retrieval service, which searches Qdrant for the most relevant chunks.
POST http://localhost:8003/search
Content-Type: application/json
{
"query_vector": [ ... ],
"top_k": 5
}
The Retrieval service returns the most similar chunks, including their text and source metadata.
Step 6 β Optional reranking and retrieval scoring
If reranking is enabled, the RAG service reorders the retrieved chunks before generation. If the ML service is enabled, it can also score the retrieval quality.
POST http://localhost:8005/score
Content-Type: application/json
{
"query": "What does the contract say about termination conditions?",
"chunks": [
{
"text": "Either party may terminate the agreement with written notice...",
"source": "employment_contract.pdf"
}
]
}
This lets the system estimate whether the retrieved context is strong enough before asking the model to generate an answer.
Step 7 β Prompt construction
The RAG service builds a grounded prompt that combines the original question with the retrieved context.
Example prompt:
Context:
[Chunk 1: termination clause description...]
[Chunk 2: conditions for ending the agreement...]
Question:
What does the contract say about termination conditions?
Answer using only the provided context.
If query rewriting, reranking, safeguards, or ML-based scoring are enabled, those steps are applied around retrieval and prompt construction before the model is called.
Step 8 β LLM inference
The prompt is then sent from the RAG service to the configured LLM backend. In the default local setup, that means a call to the llama.cpp server configured by LLM_URL (typically http://localhost:8080).
Step 9 β Response returned
Once the answer is generated, the RAG service returns the result together with the source references.
Example response:
{
"question": "What does the contract say about termination conditions?",
"answer": "The contract allows termination if either party provides 30 days written notice...",
"sources": [
"contract_page_4_chunk_2",
"contract_page_5_chunk_1"
]
}
Finally, the Gateway returns that response to the frontend, and the answer is displayed in the chat interface.
Here is the same request shown as a sequence trace:
This section is useful because it shows the system as an actual request trace, not just a conceptual diagram. It makes it easier to see how the services collaborate and how RAG works in practice.
13. Improving Retrieval Quality π―
A basic RAG demo often stops at embedding plus vector search. This project goes further.
Chunking strategies
Chunk size is configurable because retrieval quality depends heavily on how much meaning each chunk carries.
Top-k retrieval
The system can retrieve a broader candidate set for search and then narrow it down before generation. That is more robust than sending only the first raw matches directly to the LLM.
Context filtering
The architecture supports filtering and validating what reaches the model, which matters both for relevance and safety.
Query rewriting
One of the nicer features in this repo is optional query rewriting. Short or vague questions can be expanded into clearer search queries for better embedding and retrieval, while the original user wording is preserved for the final answer.
Reranking
The project also supports optional BGE reranking. That means vector search can fetch a wider set of candidates, and then a reranker can choose the best chunks to pass into the answer prompt.
Together, these choices make the retrieval layer more realistic than a minimal tutorial project.
14. Security Considerations π
Local AI does not automatically mean secure AI. You still need defensive layers.
This project includes several useful security-oriented ideas:
- Input safeguards to block obvious prompt injection or disallowed patterns
- Output safeguards to block sensitive or restricted responses
- Optional ML-based injection detection
- Context controls so only selected retrieved chunks are passed to generation
Prompt injection is especially important in RAG because the model is reading untrusted document content and user instructions at the same time. Even in a local system, that risk still exists.
Input validation and context filtering are therefore just as important as model quality.
The ML and safety layer in this project looks like this:
| Step | What it does | Service / Module | Endpoint / Call Type |
|---|---|---|---|
| User Query | Incoming question | Gateway β RAG |
POST /chat (Gateway) β POST /ask (RAG) |
| Input Safeguards | Validate input | Safeguard module inside RAG (backend/services/safeguard) |
In-process validate_input(...)
|
| Query Analysis | Injection / intent analysis (optional) | ML Service | POST /analyze |
| Query Rewriter | Improve query text (optional) | Query Rewriter (backend/shared/query_rewriter) |
In-process rewrite(...)
|
| Embedding Service | Generate query vector | Embedding Service | POST /embed |
| Retrieval Service | Search context chunks | Retrieval Service | POST /search |
| Reranker | Reorder results (optional) | Reranker (backend/shared/reranker, BGE) |
In-process rerank(...)
|
| ML Retrieval Scoring | Score retrieved context (optional) | ML Service | POST /score |
| Prompt Builder | Build final prompt with context | Prompt builder (shared helpers in RAG) | In-process prompt assembly |
| Model Runtime | LLM inference | LLM backend (e.g. llama.cpp) | RAG β LLM HTTP call (LLM_URL) |
| Output Safeguards | Validate model response | Safeguard module inside RAG | In-process validate_output(...)
|
| Final Response | Return answer + sources | RAG β Gateway β Client | HTTP response |
15. Performance Optimization β‘
Performance in local AI is about balancing model quality, retrieval quality, and resource usage.
Some useful optimisation levers visible in this repo are:
- Caching opportunities for repeated queries or embeddings
- Embedding reuse for already-ingested chunks
- Vector DB tuning such as
TOP_Kand retrieval candidate size - Choosing smaller or larger local models depending on hardware
- Disabling optional steps like query rewriting or ML analysis when latency matters more than quality
The nice thing about a modular design is that you can tune each part independently instead of treating the whole system as one black box.
16. Advantages And Pros of a Local AI Stack β
If you want the short version, these are the main pros of this architecture:
- Full control over document data
- More predictable operational cost
- Freedom to customise models and providers
- Flexible service boundaries
- Better fit for internal tools and private deployments
- A path to support both web users and AI agents through the same backend
The last point is worth highlighting. This repo not only exposes a frontend. It also includes an MCP server, which means AI agents can search documents, ask grounded questions, and ingest text using the same backend services.
That matters because it turns the project from a simple web app into a more reusable AI platform. The same monorepo supports browser users, backend APIs, and agent tooling without duplicating business logic.
17. Limitations And Cons and Tradeoffs βοΈ
A local-first approach is powerful, but it is not magic.
If you want the short version, these are the main cons:
Hardware requirements
Running local models well still depends on available CPU, RAM, and ideally GPU support. The better the model, the more demanding the setup tends to be.
Model quality vs cloud providers
Strong local models can be impressive, but top hosted models may still outperform them on reasoning, instruction following, or multilingual tasks depending on the setup.
Throughput and concurrent requests
Another important limitation is serving capacity. A local model runtime such as llama.cpp can work very well for development, demos, and low-volume internal tools, but multiple simultaneous requests can quickly become a bottleneck.
If several users send questions at the same time, you may see:
- queued requests
- slower response times
- higher CPU or GPU contention
- reduced throughput compared to managed cloud inference
That does not make local inference a bad choice, but it does mean you should think about expected traffic. A local-first stack is often strongest for single-user workflows, small teams, internal tools, or controlled environments rather than high-concurrency public applications.
Maintenance overhead
When you own the stack, you also own more operational work:
- Managing model files
- Tuning chunking and retrieval
- Running vector infrastructure
- Maintaining service compatibility
- Handling upgrades and troubleshooting
That is the tradeoff for greater control.
18. Future Improvements πΊοΈ
This project already implements more than a minimal RAG demo, but there is still room to grow.
Some especially valuable next steps are:
- Hybrid search that combines vector and keyword retrieval
- More advanced reranking strategies
- ML-based query rewriting improvements
- Multi-model orchestration or query routing
- Observability for latency, retrieval quality, and answer quality
- Multi-tenant document collections
Because the project already uses stable service boundaries and config-driven backends, those improvements can be added incrementally instead of requiring a full redesign.
19. Refactoring Path: LangChain, LlamaIndex, or Bedrock π£οΈ
One advantage of this architecture is that it does not lock the project into one implementation style forever. Because the system is already separated into services with stable contracts, it can be refactored gradually to use higher-level frameworks or managed cloud providers.
Refactoring toward LangChain
If I wanted to adopt LangChain, I would not rewrite the whole repo at once. The cleaner approach would be to replace internal orchestration inside the RAG service first:
- Use LangChain for prompt templates, retrievers, and chain composition
- Keep the existing Gateway, frontend, and ingestion APIs unchanged
- Wrap the current Embedding, Retrieval, and LLM integrations behind LangChain-compatible adapters
That would let the repo keep its current service boundaries while using LangChain as an orchestration layer instead of making LangChain the whole architecture.
Refactoring toward LlamaIndex
LlamaIndex would make the most sense if I wanted a more framework-driven retrieval pipeline with built-in indexing, query engines, and document abstractions.
A practical path would be:
- Move retrieval orchestration logic from the custom RAG service into LlamaIndex components
- Keep Qdrant as the vector backend if desired
- Reuse the existing ingestion and document-loading flow where it still fits
- Preserve the external API contracts so the frontend and MCP tools do not need to change
In other words, LlamaIndex could replace part of the internal RAG implementation without forcing a full product rewrite.
Refactoring toward AWS Bedrock
Bedrock is a different kind of change because it is a provider shift rather than only a framework shift.
This repo is already designed for that direction:
- Embedding backends are configurable
- LLM backends are configurable
- The docs and code structure already anticipate Bedrock-style backend implementations
That means a Bedrock migration could be done by implementing or completing:
- a Bedrock embedding backend in the Embedding service
- a Bedrock LLM backend in the RAG service
- optional AWS-specific configuration in service settings
The important part is that the public APIs would stay the same. The frontend, Gateway, ingestion flow, and MCP integration would not need to know whether the actual model provider is local llama.cpp, OpenAI, or Bedrock.
Why this matters
This is exactly why I prefer a modular monorepo for projects like this. The current stack is local-first, but the architecture leaves room for a future where:
- local inference is used in development
- Bedrock or another managed provider is used in production
- LangChain or LlamaIndex is introduced only where it adds value
That flexibility makes the project more realistic from a software engineering perspective.
20. Conclusion π§΅
This repository demonstrates that you can build a serious AI application without making OpenAI the centre of the architecture.
The system combines local embeddings, vector search, document ingestion, retrieval orchestration, safeguards, reranking, query rewriting, optional ML analysis, MCP-based agent access, and local LLM inference into one coherent stack. It also shows an important design principle: local-first does not have to mean rigid. By keeping the interfaces stable inside a multi-service monorepo, the system stays flexible enough to support alternative backends later.
Local AI makes the most sense when you care about privacy, predictable cost, internal deployment, and architectural control. Cloud AI may still be the better fit when you need the strongest hosted models immediately, want minimal infrastructure work, or do not mind sending data to external providers.
For me, that is the biggest takeaway from building this project: the interesting part is not just calling a model. It is designing the full pipeline around retrieval quality, data ownership, extensibility, and real product constraints.
21. Demoing This Repo π¬
If you want to use this article together with a live demo, the shortest path is:
make up
make llm
make frontend
When everything is up and running locally, it looks like this:
Then:
- Open
http://localhost:5173 - Upload a PDF
- Ask a question that requires document retrieval
- Show that the answer includes sources







Top comments (0)