navid mirnouri

Posted on Apr 30

Building a Persistent Knowledge Base RAG System with FastAPI, llama.cpp, Chroma, and Open WebUI

#ai #python #llm #programming

Have you ever wanted to chat with your own PDF collection – textbooks, research papers, internal documentation – using a local LLM, while keeping your data completely private?

This is exactly what I built. In this article, I’ll walk you through a complete, production‑ready setup that:

Ingests a folder of PDFs into a vector database (Chroma)
Serves an OpenAI‑compatible RAG API using FastAPI
Uses llama.cpp as the local LLM backend (any GGUF model works)
Connects seamlessly to Open WebUI for a beautiful chat interface
Provides persistent memory (the vector store survives restarts)

All code is available at the end of this article – ready to copy, paste, and run.

🧠 Why this system?

Privacy first – everything runs on your machine.
Long‑term knowledge – uploaded PDFs stay in the vector store; you can chat with them any time.
Cross‑chat memory – the RAG pipeline works every time you ask a question.
Modular – swap Chroma for Qdrant, replace llama.cpp with Ollama, or add hybrid search.

📦 Prerequisites

Python 3.11+ (I used 3.12, but 3.11 is safer)
Docker (for Open WebUI)
A GGUF model (e.g., Llama 3 8B, Mistral) and the llama.cpp server
Basic terminal knowledge

🔧 Step 1 – Project setup

Create a directory and a virtual environment:

mkdir my_knowledge_base && cd my_knowledge_base
python -m venv venv
source venv/bin/activate   # On Windows: venv\Scripts\activate

Create requirements.txt:

fastapi
uvicorn[standard]
chromadb
langchain
langchain-community
langchain-text-splitters
pypdf
sentence-transformers
openai
python-multipart

Install everything:

pip install -r requirements.txt

Create two folders:

mkdir knowledge_pdfs vector_store

Place your PDF files inside knowledge_pdfs/.

⚙️ Step 2 – The FastAPI application (app.py)
Copy the entire code below into app.py.
It handles:

Background ingestion of PDFs (non‑blocking)

An OpenAI‑compatible /v1/chat/completions endpoint

A /v1/models endpoint for Open WebUI

Health and status endpoints


import os
import glob
import threading
import time
from pathlib import Path
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
from typing import List, Optional, Dict
from langchain_community.document_loaders import PyPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import Chroma
from langchain_community.embeddings import HuggingFaceEmbeddings
from openai import OpenAI

# ------------------------------------------------------------
# Configuration
# ------------------------------------------------------------
PDF_DIR = "./knowledge_pdfs"
VECTOR_DB_DIR = "./vector_store"
CHUNK_SIZE = 500
CHUNK_OVERLAP = 50
EMBEDDING_MODEL = "sentence-transformers/all-MiniLM-L6-v2"
LLAMA_CPP_HOST = "http://localhost:10002"   # your llama.cpp server address
LLAMA_CPP_MODEL = "my-gguf-model"           # can be any name

# ------------------------------------------------------------
# Global objects & status
# ------------------------------------------------------------
app = FastAPI(title="Knowledge Base RAG API", version="2.0")
vector_store = None
embeddings = None

ingestion_status = {
    "running": False,
    "done": False,
    "error": None,
    "total_chunks": 0,
    "files_processed": 0
}
ingestion_lock = threading.Lock()

# ------------------------------------------------------------
# Pydantic models (OpenAI compatible)
# ------------------------------------------------------------
class Message(BaseModel):
    role: str
    content: str

class ChatCompletionRequest(BaseModel):
    model: str
    messages: List[Message]
    temperature: Optional[float] = 0.7
    max_tokens: Optional[int] = 2048
    stream: Optional[bool] = False

# ------------------------------------------------------------
# Background ingestion worker (non‑blocking)
# ------------------------------------------------------------
import threading
from fastapi import BackgroundTasks, HTTPException

# Status tracking
ingestion_status = {
    "running": False,
    "done": False,
    "error": None,
    "total_chunks": 0,
    "files_processed": 0
}
ingestion_lock = threading.Lock()

def _ingest_pdfs_worker(vector_store, pdf_dir, chunk_size, chunk_overlap):
    global ingestion_status
    try:
        # vector_store.delete_collection()
        pdf_files = glob.glob(os.path.join(pdf_dir, "*.pdf"))
        if not pdf_files:
            with ingestion_lock:
                ingestion_status["error"] = "No PDF files found"
            return

        total_chunks = 0
        for idx, pdf_path in enumerate(pdf_files, 1):
            # Load
            loader = PyPDFLoader(pdf_path)
            docs = loader.load()
            # Add metadata
            for doc in docs:
                doc.metadata["source"] = os.path.basename(pdf_path)

            # Split
            splitter = RecursiveCharacterTextSplitter(
                chunk_size=chunk_size,
                chunk_overlap=chunk_overlap,
                separators=["\n\n", "\n", " ", ""]
            )
            chunks = splitter.split_documents(docs)

            # Batch insert into vector store
            batch_size = 500
            for i in range(0, len(chunks), batch_size):
                vector_store.add_documents(chunks[i:i+batch_size])

            total_chunks += len(chunks)
            with ingestion_lock:
                ingestion_status["files_processed"] = idx
                ingestion_status["total_chunks"] = total_chunks

        # Persist once at the end
        vector_store.persist()
        with ingestion_lock:
            ingestion_status["done"] = True
            ingestion_status["running"] = False

    except Exception as e:
        with ingestion_lock:
            ingestion_status["error"] = str(e)
            ingestion_status["running"] = False

# ------------------------------------------------------------
# Startup event – initialise vector store (no auto‑ingestion)
# ------------------------------------------------------------
@app.on_event("startup")
def startup():
    global vector_store, embeddings, ingestion_status
    Path(PDF_DIR).mkdir(parents=True, exist_ok=True)
    embeddings = HuggingFaceEmbeddings(model_name=EMBEDDING_MODEL)
    vector_store = Chroma(
        persist_directory=VECTOR_DB_DIR,
        embedding_function=embeddings,
        collection_name="pdf_knowledge"
    )
    # Check if already populated
    if vector_store._collection.count() > 0:
        ingestion_status = {
            "running": False,
            "done": True,
            "error": None,
            "total_chunks": vector_store._collection.count(),
            "files_processed": 0
        }
        print(f"✅ Vector store already contains {vector_store._collection.count()} chunks.")
    else:
        ingestion_status = {
            "running": False,
            "done": False,
            "error": None,
            "total_chunks": 0,
            "files_processed": 0
        }
        print("⚠️ Vector store is empty. Use POST /reload to ingest PDFs.")

# ------------------------------------------------------------
# Endpoints
# ------------------------------------------------------------
@app.get("/health")
def health():
    return {
        "status": "ok",
        "vector_store_ready": vector_store is not None,
        "ingestion_done": ingestion_status["done"]
    }

@app.get("/v1/models")
def list_models():
    return {
        "object": "list",
        "data": [
            {
                "id": LLAMA_CPP_MODEL,
                "object": "model",
                "created": int(time.time()),
                "owned_by": "local"
            }
        ]
    }

@app.post("/reload")
async def reload_knowledge():
    """Start background ingestion (non‑blocking)."""
    global ingestion_status, vector_store
    if ingestion_status.get("running", False):
        raise HTTPException(status_code=409, detail="Ingestion already in progress")

    # Reset status
    ingestion_status = {
        "running": True,
        "done": False,
        "error": None,
        "total_chunks": 0,
        "files_processed": 0
    }

    # Optionally clear existing collection to avoid duplicates
    try:
        vector_store.delete_collection()
        vector_store = Chroma(
            persist_directory=VECTOR_DB_DIR,
            embedding_function=embeddings,
            collection_name="pdf_knowledge"
        )
    except Exception:
        pass  # collection might not exist yet

    thread = threading.Thread(
        target=ingest_pdfs_worker,
        args=(vector_store, PDF_DIR, CHUNK_SIZE, CHUNK_OVERLAP),
        daemon=True
    )
    thread.start()
    return {"message": "Ingestion started"}

@app.get("/ingestion-status")
def get_ingestion_status():
    with ingestion_lock:
        return ingestion_status.copy()

@app.post("/v1/chat/completions")
async def chat_completion(req: ChatCompletionRequest):
    # 1. Wait if ingestion is still running
    if ingestion_status.get("running", False) and not ingestion_status.get("done", False):
        return {
            "id": "loading",
            "object": "chat.completion",
            "created": int(time.time()),
            "model": req.model,
            "choices": [{
                "index": 0,
                "message": {"role": "assistant", "content": "Knowledge base is still loading. Please try again shortly."},
                "finish_reason": "stop"
            }],
            "usage": {}
        }

    # 2. Extract last user message
    user_msg = None
    for m in reversed(req.messages):
        if m.role == "user":
            user_msg = m.content
            break
    if not user_msg:
        raise HTTPException(status_code=400, detail="No user message found")

    # 3. Retrieve relevant chunks
    try:
        retriever = vector_store.as_retriever(search_kwargs={"k": 4})
        docs = retriever.invoke(user_msg)
    except Exception as e:
        print(f"Retrieval error: {e}")
        docs = []

    if not docs:
        context = "No relevant documents found in the knowledge base."
    else:
        context_parts = []
        for i, doc in enumerate(docs):
            source = doc.metadata.get("source", "unknown")
            text = doc.page_content
            context_parts.append(f"[Document {i+1} from {source}]\n{text}")
        context = "\n\n---\n\n".join(context_parts)

    # 4. Build the improved system prompt (strict RAG assistant)
    system_prompt = (
        "You are a knowledgeable assistant that answers questions strictly based on the provided context. "
        "Follow these rules:\n"
        "1. If the context contains the relevant information, answer clearly and concisely using only that information.\n"
        "2. If the context is insufficient or does not answer the question, say: 'The knowledge base does not contain enough information to answer this question.' – Do not invent an answer.\n"
        "3. When applicable, reference the source document(s) by the filename shown in the context (e.g., 'According to [filename]...').\n"
        "4. Keep answers focused and avoid adding external knowledge not found in the context.\n"
        "5. If the user asks to elaborate or explain step‑by‑step, provide a detailed answer as long as the context supports it."
    )

    user_prompt = f"Context:\n{context}\n\nQuestion: {user_msg}\nAnswer:"

    # 5. Call llama.cpp server
    try:
        llm_client = OpenAI(
            base_url=f"{LLAMA_CPP_HOST}/v1",
            api_key="not-needed",
            timeout=60.0
        )
        response = llm_client.chat.completions.create(
            model=LLAMA_CPP_MODEL,
            messages=[
                {"role": "system", "content": system_prompt},
                {"role": "user", "content": user_prompt}
            ],
            temperature=req.temperature or 0.7,
            max_tokens=req.max_tokens or 2048
        )
        answer = response.choices[0].message.content
    except Exception as e:
        print(f"LLM call failed: {e}")
        answer = "Sorry, I encountered an error while generating the answer. Please check that your llama.cpp server is running."

    # 6. Return OpenAI‑compatible response
    return {
        "id": "chatcmpl-rag",
        "object": "chat.completion",
        "created": int(time.time()),
        "model": req.model,
        "choices": [
            {
                "index": 0,
                "message": {"role": "assistant", "content": answer},
                "finish_reason": "stop"
            }
        ],
        "usage": {
            "prompt_tokens": len(user_msg.split()),
            "completion_tokens": len(answer.split()),
            "total_tokens": len(user_msg.split()) + len(answer.split())
        }
    }

# ------------------------------------------------------------
# Run with: uvicorn app:app --reload --host 0.0.0.0 --port 8000
# ------------------------------------------------------------
if __name__ == "__main__":
    import uvicorn
    uvicorn.run(app, host="0.0.0.0", port=8000)

🚀 Step 3 – Run the llama.cpp server
Download a GGUF model (e.g., from Hugging Face) and start the server:


./llama-server -m models/your-model.gguf \
  --host 0.0.0.0 --port 10002 \
  --ctx-size 8192 \
  --n-predict 2048 \
  --rope-scaling linear

Important – --n-predict 2048 allows long answers. The default is 512, which will cut off responses.

🧪 Step 4 – Start your FastAPI knowledge base

uvicorn app:app --reload --host 0.0.0.0 --port 8000

Then ingest your PDFs (this runs in the background):

curl -X POST http://localhost:8000/reload

Monitor progress:

curl http://localhost:8000/ingestion-status

When "done": true, you’re ready.

Test the chat endpoint:

curl -X POST http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "my-gguf-model",
    "messages": [{"role": "user", "content": "What does your documentation say about microprocessors?"}]
  }'

🌐 Step 5 – Connect Open WebUI
Run Open WebUI (Docker):

docker run -d -p 3000:8080 \
  -v openwebui-data:/app/backend/data \
  --name openwebui \
  ghcr.io/open-webui/open-webui:main

Then:

Open http://localhost:3000 and create an admin account.
Go to Admin Settings → Connections → OpenAI.
Click Add Connection.
URL: http://host.docker.internal:8000/v1 (if Open WebUI is in Docker and your FastAPI runs on the host)
or http://localhost:8000/v1 (if both run natively).
API Key: leave blank (or type dummy).
Save.

Your model (my-gguf-model) will appear in the model selector. Start chatting with your PDFs!

📝 The system prompt explained
The improved prompt (inside /v1/chat/completions) forces the LLM to:

✅ Use only the retrieved context

✅ Refuse to answer if context is missing (no hallucination)

✅ Cite source filenames when possible

✅ Stay focused and avoid external knowledge

This is the secret to reliable, grounded answers.

🧹 Tips & troubleshooting

Answers get cut off → Increase max_tokens in the endpoint (default is 2048) and ensure llama.cpp uses --n-predict 2048 or higher.
Retrieval returns nothing → Check /ingestion-status. If total_chunks is 0, run POST /reload again and watch the logs. Make sure your PDFs are in knowledge_pdfs/.
Open WebUI doesn’t see the model → Manually add the model in Workspace → Models with the same ID (my-gguf-model). Also verify that /v1/models returns the model.
Duplicate chunks on re‑ingest → The /reload endpoint now deletes the old collection before ingesting, so duplicates should not happen.
ModuleNotFoundError: No module named 'langchain.text_splitter' → Change the import to from langchain_text_splitters import RecursiveCharacterTextSplitter and install langchain-text-splitters.
Chroma collection does not exist → Delete the ./vector_store folder and run POST /reload again. The collection will be created on first add.
LLM call fails → Ensure your llama.cpp server is running on http://localhost:10002 and that the model name matches. Test with curl http://localhost:10002/v1/models.

🎯 Final thoughts
You now have a fully local, persistent knowledge base that you can query from a beautiful chat interface. All data stays on your machine, and you can extend it with more PDFs anytime (just run POST /reload again).

The complete code (the app.py above) is ready to be copied and used as a starting point for your own projects. Swap the embedding model, try a different vector store, or add hybrid search – the possibilities are endless.

Happy building! 🚀

Found this helpful? Leave a like or comment below – I’d love to hear how you’re using local RAG.