Have you ever wanted to chat with your own PDF collection – textbooks, research papers, internal documentation – using a local LLM, while keeping your data completely private?
This is exactly what I built. In this article, I’ll walk you through a complete, production‑ready setup that:
- Ingests a folder of PDFs into a vector database (Chroma)
- Serves an OpenAI‑compatible RAG API using FastAPI
- Uses llama.cpp as the local LLM backend (any GGUF model works)
- Connects seamlessly to Open WebUI for a beautiful chat interface
- Provides persistent memory (the vector store survives restarts)
All code is available at the end of this article – ready to copy, paste, and run.
🧠 Why this system?
- Privacy first – everything runs on your machine.
- Long‑term knowledge – uploaded PDFs stay in the vector store; you can chat with them any time.
- Cross‑chat memory – the RAG pipeline works every time you ask a question.
- Modular – swap Chroma for Qdrant, replace llama.cpp with Ollama, or add hybrid search.
📦 Prerequisites
- Python 3.11+ (I used 3.12, but 3.11 is safer)
- Docker (for Open WebUI)
- A GGUF model (e.g., Llama 3 8B, Mistral) and the
llama.cppserver - Basic terminal knowledge
🔧 Step 1 – Project setup
Create a directory and a virtual environment:
mkdir my_knowledge_base && cd my_knowledge_base
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
Create requirements.txt:
fastapi
uvicorn[standard]
chromadb
langchain
langchain-community
langchain-text-splitters
pypdf
sentence-transformers
openai
python-multipart
Install everything:
pip install -r requirements.txt
Create two folders:
mkdir knowledge_pdfs vector_store
Place your PDF files inside knowledge_pdfs/.
⚙️ Step 2 – The FastAPI application (app.py)
Copy the entire code below into app.py.
It handles:
Background ingestion of PDFs (non‑blocking)
An OpenAI‑compatible /v1/chat/completions endpoint
A /v1/models endpoint for Open WebUI
Health and status endpoints
import os
import glob
import threading
import time
from pathlib import Path
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
from typing import List, Optional, Dict
from langchain_community.document_loaders import PyPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import Chroma
from langchain_community.embeddings import HuggingFaceEmbeddings
from openai import OpenAI
# ------------------------------------------------------------
# Configuration
# ------------------------------------------------------------
PDF_DIR = "./knowledge_pdfs"
VECTOR_DB_DIR = "./vector_store"
CHUNK_SIZE = 500
CHUNK_OVERLAP = 50
EMBEDDING_MODEL = "sentence-transformers/all-MiniLM-L6-v2"
LLAMA_CPP_HOST = "http://localhost:10002" # your llama.cpp server address
LLAMA_CPP_MODEL = "my-gguf-model" # can be any name
# ------------------------------------------------------------
# Global objects & status
# ------------------------------------------------------------
app = FastAPI(title="Knowledge Base RAG API", version="2.0")
vector_store = None
embeddings = None
ingestion_status = {
"running": False,
"done": False,
"error": None,
"total_chunks": 0,
"files_processed": 0
}
ingestion_lock = threading.Lock()
# ------------------------------------------------------------
# Pydantic models (OpenAI compatible)
# ------------------------------------------------------------
class Message(BaseModel):
role: str
content: str
class ChatCompletionRequest(BaseModel):
model: str
messages: List[Message]
temperature: Optional[float] = 0.7
max_tokens: Optional[int] = 2048
stream: Optional[bool] = False
# ------------------------------------------------------------
# Background ingestion worker (non‑blocking)
# ------------------------------------------------------------
import threading
from fastapi import BackgroundTasks, HTTPException
# Status tracking
ingestion_status = {
"running": False,
"done": False,
"error": None,
"total_chunks": 0,
"files_processed": 0
}
ingestion_lock = threading.Lock()
def _ingest_pdfs_worker(vector_store, pdf_dir, chunk_size, chunk_overlap):
global ingestion_status
try:
# vector_store.delete_collection()
pdf_files = glob.glob(os.path.join(pdf_dir, "*.pdf"))
if not pdf_files:
with ingestion_lock:
ingestion_status["error"] = "No PDF files found"
return
total_chunks = 0
for idx, pdf_path in enumerate(pdf_files, 1):
# Load
loader = PyPDFLoader(pdf_path)
docs = loader.load()
# Add metadata
for doc in docs:
doc.metadata["source"] = os.path.basename(pdf_path)
# Split
splitter = RecursiveCharacterTextSplitter(
chunk_size=chunk_size,
chunk_overlap=chunk_overlap,
separators=["\n\n", "\n", " ", ""]
)
chunks = splitter.split_documents(docs)
# Batch insert into vector store
batch_size = 500
for i in range(0, len(chunks), batch_size):
vector_store.add_documents(chunks[i:i+batch_size])
total_chunks += len(chunks)
with ingestion_lock:
ingestion_status["files_processed"] = idx
ingestion_status["total_chunks"] = total_chunks
# Persist once at the end
vector_store.persist()
with ingestion_lock:
ingestion_status["done"] = True
ingestion_status["running"] = False
except Exception as e:
with ingestion_lock:
ingestion_status["error"] = str(e)
ingestion_status["running"] = False
# ------------------------------------------------------------
# Startup event – initialise vector store (no auto‑ingestion)
# ------------------------------------------------------------
@app.on_event("startup")
def startup():
global vector_store, embeddings, ingestion_status
Path(PDF_DIR).mkdir(parents=True, exist_ok=True)
embeddings = HuggingFaceEmbeddings(model_name=EMBEDDING_MODEL)
vector_store = Chroma(
persist_directory=VECTOR_DB_DIR,
embedding_function=embeddings,
collection_name="pdf_knowledge"
)
# Check if already populated
if vector_store._collection.count() > 0:
ingestion_status = {
"running": False,
"done": True,
"error": None,
"total_chunks": vector_store._collection.count(),
"files_processed": 0
}
print(f"✅ Vector store already contains {vector_store._collection.count()} chunks.")
else:
ingestion_status = {
"running": False,
"done": False,
"error": None,
"total_chunks": 0,
"files_processed": 0
}
print("⚠️ Vector store is empty. Use POST /reload to ingest PDFs.")
# ------------------------------------------------------------
# Endpoints
# ------------------------------------------------------------
@app.get("/health")
def health():
return {
"status": "ok",
"vector_store_ready": vector_store is not None,
"ingestion_done": ingestion_status["done"]
}
@app.get("/v1/models")
def list_models():
return {
"object": "list",
"data": [
{
"id": LLAMA_CPP_MODEL,
"object": "model",
"created": int(time.time()),
"owned_by": "local"
}
]
}
@app.post("/reload")
async def reload_knowledge():
"""Start background ingestion (non‑blocking)."""
global ingestion_status, vector_store
if ingestion_status.get("running", False):
raise HTTPException(status_code=409, detail="Ingestion already in progress")
# Reset status
ingestion_status = {
"running": True,
"done": False,
"error": None,
"total_chunks": 0,
"files_processed": 0
}
# Optionally clear existing collection to avoid duplicates
try:
vector_store.delete_collection()
vector_store = Chroma(
persist_directory=VECTOR_DB_DIR,
embedding_function=embeddings,
collection_name="pdf_knowledge"
)
except Exception:
pass # collection might not exist yet
thread = threading.Thread(
target=ingest_pdfs_worker,
args=(vector_store, PDF_DIR, CHUNK_SIZE, CHUNK_OVERLAP),
daemon=True
)
thread.start()
return {"message": "Ingestion started"}
@app.get("/ingestion-status")
def get_ingestion_status():
with ingestion_lock:
return ingestion_status.copy()
@app.post("/v1/chat/completions")
async def chat_completion(req: ChatCompletionRequest):
# 1. Wait if ingestion is still running
if ingestion_status.get("running", False) and not ingestion_status.get("done", False):
return {
"id": "loading",
"object": "chat.completion",
"created": int(time.time()),
"model": req.model,
"choices": [{
"index": 0,
"message": {"role": "assistant", "content": "Knowledge base is still loading. Please try again shortly."},
"finish_reason": "stop"
}],
"usage": {}
}
# 2. Extract last user message
user_msg = None
for m in reversed(req.messages):
if m.role == "user":
user_msg = m.content
break
if not user_msg:
raise HTTPException(status_code=400, detail="No user message found")
# 3. Retrieve relevant chunks
try:
retriever = vector_store.as_retriever(search_kwargs={"k": 4})
docs = retriever.invoke(user_msg)
except Exception as e:
print(f"Retrieval error: {e}")
docs = []
if not docs:
context = "No relevant documents found in the knowledge base."
else:
context_parts = []
for i, doc in enumerate(docs):
source = doc.metadata.get("source", "unknown")
text = doc.page_content
context_parts.append(f"[Document {i+1} from {source}]\n{text}")
context = "\n\n---\n\n".join(context_parts)
# 4. Build the improved system prompt (strict RAG assistant)
system_prompt = (
"You are a knowledgeable assistant that answers questions strictly based on the provided context. "
"Follow these rules:\n"
"1. If the context contains the relevant information, answer clearly and concisely using only that information.\n"
"2. If the context is insufficient or does not answer the question, say: 'The knowledge base does not contain enough information to answer this question.' – Do not invent an answer.\n"
"3. When applicable, reference the source document(s) by the filename shown in the context (e.g., 'According to [filename]...').\n"
"4. Keep answers focused and avoid adding external knowledge not found in the context.\n"
"5. If the user asks to elaborate or explain step‑by‑step, provide a detailed answer as long as the context supports it."
)
user_prompt = f"Context:\n{context}\n\nQuestion: {user_msg}\nAnswer:"
# 5. Call llama.cpp server
try:
llm_client = OpenAI(
base_url=f"{LLAMA_CPP_HOST}/v1",
api_key="not-needed",
timeout=60.0
)
response = llm_client.chat.completions.create(
model=LLAMA_CPP_MODEL,
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": user_prompt}
],
temperature=req.temperature or 0.7,
max_tokens=req.max_tokens or 2048
)
answer = response.choices[0].message.content
except Exception as e:
print(f"LLM call failed: {e}")
answer = "Sorry, I encountered an error while generating the answer. Please check that your llama.cpp server is running."
# 6. Return OpenAI‑compatible response
return {
"id": "chatcmpl-rag",
"object": "chat.completion",
"created": int(time.time()),
"model": req.model,
"choices": [
{
"index": 0,
"message": {"role": "assistant", "content": answer},
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": len(user_msg.split()),
"completion_tokens": len(answer.split()),
"total_tokens": len(user_msg.split()) + len(answer.split())
}
}
# ------------------------------------------------------------
# Run with: uvicorn app:app --reload --host 0.0.0.0 --port 8000
# ------------------------------------------------------------
if __name__ == "__main__":
import uvicorn
uvicorn.run(app, host="0.0.0.0", port=8000)
🚀 Step 3 – Run the llama.cpp server
Download a GGUF model (e.g., from Hugging Face) and start the server:
./llama-server -m models/your-model.gguf \
--host 0.0.0.0 --port 10002 \
--ctx-size 8192 \
--n-predict 2048 \
--rope-scaling linear
Important – --n-predict 2048 allows long answers. The default is 512, which will cut off responses.
🧪 Step 4 – Start your FastAPI knowledge base
uvicorn app:app --reload --host 0.0.0.0 --port 8000
Then ingest your PDFs (this runs in the background):
curl -X POST http://localhost:8000/reload
Monitor progress:
curl http://localhost:8000/ingestion-status
When "done": true, you’re ready.
Test the chat endpoint:
curl -X POST http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "my-gguf-model",
"messages": [{"role": "user", "content": "What does your documentation say about microprocessors?"}]
}'
🌐 Step 5 – Connect Open WebUI
Run Open WebUI (Docker):
docker run -d -p 3000:8080 \
-v openwebui-data:/app/backend/data \
--name openwebui \
ghcr.io/open-webui/open-webui:main
Then:
Open http://localhost:3000 and create an admin account.
Go to Admin Settings → Connections → OpenAI.
Click Add Connection.
URL: http://host.docker.internal:8000/v1 (if Open WebUI is in Docker and your FastAPI runs on the host)
or http://localhost:8000/v1 (if both run natively).API Key: leave blank (or type dummy).
Save.
Your model (my-gguf-model) will appear in the model selector. Start chatting with your PDFs!
📝 The system prompt explained
The improved prompt (inside /v1/chat/completions) forces the LLM to:
✅ Use only the retrieved context
✅ Refuse to answer if context is missing (no hallucination)
✅ Cite source filenames when possible
✅ Stay focused and avoid external knowledge
This is the secret to reliable, grounded answers.
🧹 Tips & troubleshooting
Answers get cut off → Increase max_tokens in the endpoint (default is 2048) and ensure llama.cpp uses --n-predict 2048 or higher.
Retrieval returns nothing → Check /ingestion-status. If total_chunks is 0, run POST /reload again and watch the logs. Make sure your PDFs are in knowledge_pdfs/.
Open WebUI doesn’t see the model → Manually add the model in Workspace → Models with the same ID (my-gguf-model). Also verify that /v1/models returns the model.
Duplicate chunks on re‑ingest → The /reload endpoint now deletes the old collection before ingesting, so duplicates should not happen.
ModuleNotFoundError: No module named 'langchain.text_splitter' → Change the import to from langchain_text_splitters import RecursiveCharacterTextSplitter and install langchain-text-splitters.
Chroma collection does not exist → Delete the ./vector_store folder and run POST /reload again. The collection will be created on first add.
LLM call fails → Ensure your llama.cpp server is running on http://localhost:10002 and that the model name matches. Test with curl http://localhost:10002/v1/models.
🎯 Final thoughts
You now have a fully local, persistent knowledge base that you can query from a beautiful chat interface. All data stays on your machine, and you can extend it with more PDFs anytime (just run POST /reload again).
The complete code (the app.py above) is ready to be copied and used as a starting point for your own projects. Swap the embedding model, try a different vector store, or add hybrid search – the possibilities are endless.
Happy building! 🚀
Found this helpful? Leave a like or comment below – I’d love to hear how you’re using local RAG.
Top comments (0)