DEV Community: navid mirnouri

Fed 15 papers into Gemma 4. Got back a hypothesis none of them actually state — with a null hypothesis, experiment design, and a confidence score that drops when the model reviews itself.

navid mirnouri — Sat, 09 May 2026 15:32:33 +0000

Gemma 4 Challenge: Build With Gemma 4 Submission

navid mirnouri

May 9

I Built a Research Synthesis Engine That Reads 15 Papers and Generates Peer-Reviewed Hypotheses — Powered by Gemma 4

#devchallenge #gemmachallenge #gemma

Comments

6 min read

I Built a Research Synthesis Engine That Reads 15 Papers and Generates Peer-Reviewed Hypotheses — Powered by Gemma 4

navid mirnouri — Sat, 09 May 2026 14:25:06 +0000

This is a submission for the Gemma 4 Challenge: Build with Gemma 4

I Built a Research Synthesis Engine That Reads 15 Papers and Generates Peer-Reviewed Hypotheses — Powered by Gemma 4

Every researcher knows the feeling: you have a stack of papers, a vague sense that something important is hiding between them, and no time to find it. Individual papers answer narrow questions. The breakthroughs live in the gaps between them.

I built LitSynth — a local, fully offline research synthesis engine that ingests up to 15 scientific PDFs, reasons across all of them simultaneously, and produces four structured outputs: cross-paper agreements, contradictions with mechanistic explanations, research gap analysis ranked by importance, and novel falsifiable hypotheses — each one put through a multi-round adversarial peer review loop before it reaches you.

This only exists because of Gemma 4's 128K context window and thinking mode. RAG pipelines approximate this. Gemma 4 actually does it.

What I Built

LitSynth is a seven-stage reasoning pipeline that treats a set of scientific papers as a single evidence corpus rather than a collection of independent documents.

The seven stages

1. Parallel PDF ingestion — Papers are parsed concurrently with pdfplumber, chunked into 8,000-character segments, and passed to the extraction stage.

2. Batched claim extraction (3 chunks per LLM call) — Each batch prompt asks Gemma 4 to extract up to 4 specific, falsifiable, numerically-grounded claims per section. Claims are namespaced by paper ID and chunk index to prevent collision. Running 6 workers in parallel reduces this to roughly a third of the wall-clock time of sequential extraction.

3. Agreement identification — A single long-context prompt packages all claims (within a token budget) and asks Gemma 4 to find convergent findings across papers — with specific claim IDs as evidence, not just paper names.

4. Contradiction detection (parallel clusters) — Claims are grouped by experimental method. Each cluster runs in its own thread. The contradiction prompt requires:

The exact claim text from each paper
A mechanistic explanation of why they conflict
A proposed reconciliation (different populations, measurement conditions, etc.)

5. Gap analysis — Research gaps are traced back to the specific claims and contradictions that reveal them, and ranked critical / high / medium / low by importance. The prompt explicitly asks: "what question is implied by this evidence that no paper answers?"

6. Hypothesis generation — This is the centrepiece. The generation prompt enforces mandatory rules at the prompt level:

Every hypothesis must reference ≥2 specific claim IDs from the corpus
Every hypothesis must name a gap_addressed (a gap ID from stage 5)
The mechanism field must name the specific signal, its origin layer/module, and the downstream effect it produces
A null hypothesis must be included for every hypothesis
The experiment design must specify the independent variable, control condition, measurements, and statistical test
Forbidden language: "necessary and sufficient", "proves", "objective metric", "always", "guaranteed"

7. Adversarial refinement loop — Every generated hypothesis enters a multi-round peer review cycle (up to 2 rounds by default):

All hypotheses are reviewed in parallel (each gets its own LLM call, no waiting)
The reviewer scores weakness count, assigns a confidence penalty, and flags fatal_flaw
If an improved hypothesis is provided, a quick re-review checks whether it has fewer weaknesses than the original before accepting the improvement
Confidence is recalibrated: original_conf − (0.06 × weaknesses) − reviewer_penalty
Hypotheses with fatal_flaw=True are moved to a discarded list, not silently dropped

The final output separates accepted hypotheses from discarded ones, shows revision history, and includes calibrated confidence scores.

Demo

Input

15 open-access papers on transformer attention mechanisms and long-context performance.

Sample accepted hypothesis (after 2 refinement rounds, revision 2)

HYPOTHESIS:
  In decoder-only LLMs with ≥7B parameters trained on sequences ≤8K tokens,
  injecting domain-specific embeddings into KV cache positions 0–32 will reduce
  hallucination rate on closed-domain QA by ≥15% compared to prompt-only
  injection, because early-layer cache slots function as high-priority retrieval
  anchors for attention heads in layers 8–16.

NULL HYPOTHESIS:
  KV cache position injection will show no statistically significant difference
  in hallucination rate compared to prompt-only injection (p > 0.05).

MECHANISM: [architectural]
  Domain embeddings written to KV positions 0–32 are preferentially attended
  to by layers 8–16 due to recency bias in rotary position encoding, causing
  those layers to anchor factual retrieval against the injected context before
  processing user tokens.

EXPERIMENT:
  IV: injection method (KV cache positions 0–32 vs. system prompt prefix)
  Control: same model, same domain corpus, same evaluation prompts
  Measurements: hallucination rate on TruthfulQA-domain subset, exact-match F1
  Statistical test: paired t-test, α = 0.05, n = 500 per condition

GROUNDED IN: paper_2_ck1_c3, paper_7_ck0_c1, paper_11_ck3_c2
FILLS GAP:   gap_3a8f2c (effect of cache position on retrieval priority)
CONFIDENCE:  0.61 (recalibrated from 0.80 after 2 review rounds)
REVISION:    2

Sample discarded hypothesis

One hypothesis was flagged fatal_flaw=True after round 1 because it claimed a mechanism was "necessary and sufficient" — the schema validator rejected the rewrite attempt as well (still contained absolute language), so it was cleanly discarded with the critique logged.

Pipeline summary output

Papers:            15
Claims extracted:  312
Agreements:        8
Contradictions:    14 (across 6 method clusters)
Research gaps:     9  (3 critical, 4 high, 2 medium)
Hypotheses:        2 accepted, 1 discarded
Refinement rounds: 2
Runtime:           ~18 minutes on a MacBook M2 Pro (local, offline)

How I Built It

Architecture

PDF files
    │
    ▼
Parallel PDF loader (pdfplumber, 4 workers)
    │
    ▼
Batched claim extractor (6 workers, 3 chunks/call, streaming=True, thinking=False)
    │
    ├─────────────────────────┐
    ▼                         ▼
Agreements              Contradictions
(single long-context)   (parallel method clusters, thinking=True)
    │                         │
    └──────────┬──────────────┘
               ▼
           Gap analysis
           (importance-ranked, causally linked)
               │
               ▼
       Hypothesis generation
       (grounded, falsifiable, schema-validated)
               │
               ▼
     Adversarial refinement loop
     ┌─────────────────────────┐
     │  Review all (parallel)  │
     │  ↓                      │
     │  Recalibrate confidence │
     │  ↓                      │
     │  Attempt improvement    │
     │  ↓                      │
     │  Re-review candidate    │
     │  ↓                      │
     │  Accept if better       │ ← up to MAX_REFINEMENT_ROUNDS
     └─────────────────────────┘
               │
               ▼
     LiteratureSynthesis output
     (JSON + Gradio UI)

Key technical decisions

Batched extraction instead of one call per chunk. Packing 3 chunks into one prompt with section headers ([paper_id=paper_2 chunk_id=1]) reduces LLM calls by ~3x with no quality loss. The prompt instructs the model to treat each section independently, so cross-contamination doesn't occur.

Thread-local LLM instances. ChatOllama is not thread-safe. Each worker thread constructs its own instance via threading.local(). Six extraction workers + two parallel synthesis steps all run without any shared state on the model object.

Checkpoint invalidation by content hash. A manifest file stores an MD5 of filename + size + mtime for every input PDF. If the input changes, all checkpoints are wiped before the run starts. This prevents the nasty failure mode where stale checkpoints silently produce wrong results.

Two LLM profiles per thread.

Extraction: streaming=True, thinking=False — simple JSON task, user sees token progress
Synthesis: streaming=False, thinking=True — complex reasoning, no streaming overhead

Schema-level validation as a last-resort guardrail. The Hypothesis Pydantic model runs a model_validator that scans hypothesis + mechanism text for forbidden phrases and raises ValueError before a bad hypothesis ever enters the refinement loop. This catches cases where the prompt-level constraints fail.

Confidence recalibration. LLM-assigned confidence scores are untrustworthy. After each review round, confidence is recomputed: max(0.05, conf − 0.06 × len(weaknesses) − reviewer_penalty). A hypothesis that entered generation at 0.80 but accumulated 5 weaknesses and a 0.20 reviewer penalty exits at 0.30 — an honest signal.

Stack

Model: Gemma 4 31B Dense via Ollama (local, offline)
Orchestration: Python + LangChain Ollama adapter
Schema: Pydantic v2 with custom validators
UI: Gradio with tabbed output (Agreements / Contradictions / Gaps / Hypotheses / Raw JSON)
PDF parsing: pdfplumber
Parallelism: concurrent.futures.ThreadPoolExecutor

Code

Full source on GitHub: github.com/navid72m/litsynth

pip install pdfplumber langchain-ollama pydantic gradio tqdm
ollama pull gemma4
python ui.py

Why Gemma 4

Three capabilities made this project possible — and none of them are present in smaller models:

1. The 128K context window is the load-bearing wall.
A standard RAG pipeline would embed chunks, retrieve the top-k, and reason over those. The problem is that cross-paper relationships are exactly the kind of signal that falls between retrieval buckets. A finding in paper 3 that partially contradicts a result in paper 11 only becomes visible if both are in context simultaneously. With Gemma 4's 128K window, the entire evidence corpus fits. The model sees everything at once. RAG approximates this — Gemma 4 actually does it.

2. Thinking mode changes the quality of synthesis.
The difference between Gemma 4's thinking mode and standard generation on the hypothesis step is not subtle. Standard generation produces fluent but shallow hypotheses. Thinking mode produces hypotheses that trace through intermediate reasoning steps — "if finding A holds and gap B exists, then mechanism C predicts outcome D." You can see this in the <think> blocks (stripped before JSON parsing, but logged separately for inspection). The adversarial reviewer benefits equally: it produces structured, dimension-by-dimension critiques rather than vague feedback.

3. The 31B dense model is the right size for this task.
The E2B/E4B models are excellent for edge deployment and single-task extraction. But synthesis — holding 300 claims in context and reasoning about relationships between them — requires the full 31B. The task isn't latency-sensitive (a research session takes minutes, not milliseconds), so the larger model's reasoning quality justifies the compute. The 31B also runs locally on an M2 Pro with 32GB RAM via Ollama, which keeps the entire pipeline offline — no paper content leaves the machine.

The model choice isn't incidental. Every design decision in LitSynth — batching, the token budget guard, the context assembly strategy — exists to make the most of Gemma 4's specific capabilities. A different model would require a different architecture. This one is built around what Gemma 4 can actually do.

Built for the Gemma 4 Challenge, May 2026. All synthesis runs locally. No API calls. No paper data leaves your machine.

Building a Persistent Knowledge Base RAG System with FastAPI, llama.cpp, Chroma, and Open WebUI

navid mirnouri — Thu, 30 Apr 2026 18:36:01 +0000

Have you ever wanted to chat with your own PDF collection – textbooks, research papers, internal documentation – using a local LLM, while keeping your data completely private?

This is exactly what I built. In this article, I’ll walk you through a complete, production‑ready setup that:

Ingests a folder of PDFs into a vector database (Chroma)
Serves an OpenAI‑compatible RAG API using FastAPI
Uses llama.cpp as the local LLM backend (any GGUF model works)
Connects seamlessly to Open WebUI for a beautiful chat interface
Provides persistent memory (the vector store survives restarts)

All code is available at the end of this article – ready to copy, paste, and run.

🧠 Why this system?

Privacy first – everything runs on your machine.
Long‑term knowledge – uploaded PDFs stay in the vector store; you can chat with them any time.
Cross‑chat memory – the RAG pipeline works every time you ask a question.
Modular – swap Chroma for Qdrant, replace llama.cpp with Ollama, or add hybrid search.

📦 Prerequisites

Python 3.11+ (I used 3.12, but 3.11 is safer)
Docker (for Open WebUI)
A GGUF model (e.g., Llama 3 8B, Mistral) and the llama.cpp server
Basic terminal knowledge

🔧 Step 1 – Project setup

Create a directory and a virtual environment:

mkdir my_knowledge_base && cd my_knowledge_base
python -m venv venv
source venv/bin/activate   # On Windows: venv\Scripts\activate

Create requirements.txt:

fastapi
uvicorn[standard]
chromadb
langchain
langchain-community
langchain-text-splitters
pypdf
sentence-transformers
openai
python-multipart

Install everything:

pip install -r requirements.txt

Create two folders:

mkdir knowledge_pdfs vector_store

Place your PDF files inside knowledge_pdfs/.

⚙️ Step 2 – The FastAPI application (app.py)
Copy the entire code below into app.py.
It handles:

Background ingestion of PDFs (non‑blocking)

An OpenAI‑compatible /v1/chat/completions endpoint

A /v1/models endpoint for Open WebUI

Health and status endpoints


import os
import glob
import threading
import time
from pathlib import Path
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
from typing import List, Optional, Dict
from langchain_community.document_loaders import PyPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import Chroma
from langchain_community.embeddings import HuggingFaceEmbeddings
from openai import OpenAI

# ------------------------------------------------------------
# Configuration
# ------------------------------------------------------------
PDF_DIR = "./knowledge_pdfs"
VECTOR_DB_DIR = "./vector_store"
CHUNK_SIZE = 500
CHUNK_OVERLAP = 50
EMBEDDING_MODEL = "sentence-transformers/all-MiniLM-L6-v2"
LLAMA_CPP_HOST = "http://localhost:10002"   # your llama.cpp server address
LLAMA_CPP_MODEL = "my-gguf-model"           # can be any name

# ------------------------------------------------------------
# Global objects & status
# ------------------------------------------------------------
app = FastAPI(title="Knowledge Base RAG API", version="2.0")
vector_store = None
embeddings = None

ingestion_status = {
    "running": False,
    "done": False,
    "error": None,
    "total_chunks": 0,
    "files_processed": 0
}
ingestion_lock = threading.Lock()

# ------------------------------------------------------------
# Pydantic models (OpenAI compatible)
# ------------------------------------------------------------
class Message(BaseModel):
    role: str
    content: str

class ChatCompletionRequest(BaseModel):
    model: str
    messages: List[Message]
    temperature: Optional[float] = 0.7
    max_tokens: Optional[int] = 2048
    stream: Optional[bool] = False

# ------------------------------------------------------------
# Background ingestion worker (non‑blocking)
# ------------------------------------------------------------
import threading
from fastapi import BackgroundTasks, HTTPException

# Status tracking
ingestion_status = {
    "running": False,
    "done": False,
    "error": None,
    "total_chunks": 0,
    "files_processed": 0
}
ingestion_lock = threading.Lock()

def _ingest_pdfs_worker(vector_store, pdf_dir, chunk_size, chunk_overlap):
    global ingestion_status
    try:
        # vector_store.delete_collection()
        pdf_files = glob.glob(os.path.join(pdf_dir, "*.pdf"))
        if not pdf_files:
            with ingestion_lock:
                ingestion_status["error"] = "No PDF files found"
            return

        total_chunks = 0
        for idx, pdf_path in enumerate(pdf_files, 1):
            # Load
            loader = PyPDFLoader(pdf_path)
            docs = loader.load()
            # Add metadata
            for doc in docs:
                doc.metadata["source"] = os.path.basename(pdf_path)

            # Split
            splitter = RecursiveCharacterTextSplitter(
                chunk_size=chunk_size,
                chunk_overlap=chunk_overlap,
                separators=["\n\n", "\n", " ", ""]
            )
            chunks = splitter.split_documents(docs)

            # Batch insert into vector store
            batch_size = 500
            for i in range(0, len(chunks), batch_size):
                vector_store.add_documents(chunks[i:i+batch_size])

            total_chunks += len(chunks)
            with ingestion_lock:
                ingestion_status["files_processed"] = idx
                ingestion_status["total_chunks"] = total_chunks

        # Persist once at the end
        vector_store.persist()
        with ingestion_lock:
            ingestion_status["done"] = True
            ingestion_status["running"] = False

    except Exception as e:
        with ingestion_lock:
            ingestion_status["error"] = str(e)
            ingestion_status["running"] = False

# ------------------------------------------------------------
# Startup event – initialise vector store (no auto‑ingestion)
# ------------------------------------------------------------
@app.on_event("startup")
def startup():
    global vector_store, embeddings, ingestion_status
    Path(PDF_DIR).mkdir(parents=True, exist_ok=True)
    embeddings = HuggingFaceEmbeddings(model_name=EMBEDDING_MODEL)
    vector_store = Chroma(
        persist_directory=VECTOR_DB_DIR,
        embedding_function=embeddings,
        collection_name="pdf_knowledge"
    )
    # Check if already populated
    if vector_store._collection.count() > 0:
        ingestion_status = {
            "running": False,
            "done": True,
            "error": None,
            "total_chunks": vector_store._collection.count(),
            "files_processed": 0
        }
        print(f"✅ Vector store already contains {vector_store._collection.count()} chunks.")
    else:
        ingestion_status = {
            "running": False,
            "done": False,
            "error": None,
            "total_chunks": 0,
            "files_processed": 0
        }
        print("⚠️ Vector store is empty. Use POST /reload to ingest PDFs.")

# ------------------------------------------------------------
# Endpoints
# ------------------------------------------------------------
@app.get("/health")
def health():
    return {
        "status": "ok",
        "vector_store_ready": vector_store is not None,
        "ingestion_done": ingestion_status["done"]
    }

@app.get("/v1/models")
def list_models():
    return {
        "object": "list",
        "data": [
            {
                "id": LLAMA_CPP_MODEL,
                "object": "model",
                "created": int(time.time()),
                "owned_by": "local"
            }
        ]
    }

@app.post("/reload")
async def reload_knowledge():
    """Start background ingestion (non‑blocking)."""
    global ingestion_status, vector_store
    if ingestion_status.get("running", False):
        raise HTTPException(status_code=409, detail="Ingestion already in progress")

    # Reset status
    ingestion_status = {
        "running": True,
        "done": False,
        "error": None,
        "total_chunks": 0,
        "files_processed": 0
    }

    # Optionally clear existing collection to avoid duplicates
    try:
        vector_store.delete_collection()
        vector_store = Chroma(
            persist_directory=VECTOR_DB_DIR,
            embedding_function=embeddings,
            collection_name="pdf_knowledge"
        )
    except Exception:
        pass  # collection might not exist yet

    thread = threading.Thread(
        target=ingest_pdfs_worker,
        args=(vector_store, PDF_DIR, CHUNK_SIZE, CHUNK_OVERLAP),
        daemon=True
    )
    thread.start()
    return {"message": "Ingestion started"}

@app.get("/ingestion-status")
def get_ingestion_status():
    with ingestion_lock:
        return ingestion_status.copy()

@app.post("/v1/chat/completions")
async def chat_completion(req: ChatCompletionRequest):
    # 1. Wait if ingestion is still running
    if ingestion_status.get("running", False) and not ingestion_status.get("done", False):
        return {
            "id": "loading",
            "object": "chat.completion",
            "created": int(time.time()),
            "model": req.model,
            "choices": [{
                "index": 0,
                "message": {"role": "assistant", "content": "Knowledge base is still loading. Please try again shortly."},
                "finish_reason": "stop"
            }],
            "usage": {}
        }

    # 2. Extract last user message
    user_msg = None
    for m in reversed(req.messages):
        if m.role == "user":
            user_msg = m.content
            break
    if not user_msg:
        raise HTTPException(status_code=400, detail="No user message found")

    # 3. Retrieve relevant chunks
    try:
        retriever = vector_store.as_retriever(search_kwargs={"k": 4})
        docs = retriever.invoke(user_msg)
    except Exception as e:
        print(f"Retrieval error: {e}")
        docs = []

    if not docs:
        context = "No relevant documents found in the knowledge base."
    else:
        context_parts = []
        for i, doc in enumerate(docs):
            source = doc.metadata.get("source", "unknown")
            text = doc.page_content
            context_parts.append(f"[Document {i+1} from {source}]\n{text}")
        context = "\n\n---\n\n".join(context_parts)

    # 4. Build the improved system prompt (strict RAG assistant)
    system_prompt = (
        "You are a knowledgeable assistant that answers questions strictly based on the provided context. "
        "Follow these rules:\n"
        "1. If the context contains the relevant information, answer clearly and concisely using only that information.\n"
        "2. If the context is insufficient or does not answer the question, say: 'The knowledge base does not contain enough information to answer this question.' – Do not invent an answer.\n"
        "3. When applicable, reference the source document(s) by the filename shown in the context (e.g., 'According to [filename]...').\n"
        "4. Keep answers focused and avoid adding external knowledge not found in the context.\n"
        "5. If the user asks to elaborate or explain step‑by‑step, provide a detailed answer as long as the context supports it."
    )

    user_prompt = f"Context:\n{context}\n\nQuestion: {user_msg}\nAnswer:"

    # 5. Call llama.cpp server
    try:
        llm_client = OpenAI(
            base_url=f"{LLAMA_CPP_HOST}/v1",
            api_key="not-needed",
            timeout=60.0
        )
        response = llm_client.chat.completions.create(
            model=LLAMA_CPP_MODEL,
            messages=[
                {"role": "system", "content": system_prompt},
                {"role": "user", "content": user_prompt}
            ],
            temperature=req.temperature or 0.7,
            max_tokens=req.max_tokens or 2048
        )
        answer = response.choices[0].message.content
    except Exception as e:
        print(f"LLM call failed: {e}")
        answer = "Sorry, I encountered an error while generating the answer. Please check that your llama.cpp server is running."

    # 6. Return OpenAI‑compatible response
    return {
        "id": "chatcmpl-rag",
        "object": "chat.completion",
        "created": int(time.time()),
        "model": req.model,
        "choices": [
            {
                "index": 0,
                "message": {"role": "assistant", "content": answer},
                "finish_reason": "stop"
            }
        ],
        "usage": {
            "prompt_tokens": len(user_msg.split()),
            "completion_tokens": len(answer.split()),
            "total_tokens": len(user_msg.split()) + len(answer.split())
        }
    }

# ------------------------------------------------------------
# Run with: uvicorn app:app --reload --host 0.0.0.0 --port 8000
# ------------------------------------------------------------
if __name__ == "__main__":
    import uvicorn
    uvicorn.run(app, host="0.0.0.0", port=8000)

🚀 Step 3 – Run the llama.cpp server
Download a GGUF model (e.g., from Hugging Face) and start the server:


./llama-server -m models/your-model.gguf \
  --host 0.0.0.0 --port 10002 \
  --ctx-size 8192 \
  --n-predict 2048 \
  --rope-scaling linear

Important – --n-predict 2048 allows long answers. The default is 512, which will cut off responses.

🧪 Step 4 – Start your FastAPI knowledge base

uvicorn app:app --reload --host 0.0.0.0 --port 8000

Then ingest your PDFs (this runs in the background):

curl -X POST http://localhost:8000/reload

Monitor progress:

curl http://localhost:8000/ingestion-status

When "done": true, you’re ready.

Test the chat endpoint:

curl -X POST http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "my-gguf-model",
    "messages": [{"role": "user", "content": "What does your documentation say about microprocessors?"}]
  }'

🌐 Step 5 – Connect Open WebUI
Run Open WebUI (Docker):

docker run -d -p 3000:8080 \
  -v openwebui-data:/app/backend/data \
  --name openwebui \
  ghcr.io/open-webui/open-webui:main

Then:

Open http://localhost:3000 and create an admin account.
Go to Admin Settings → Connections → OpenAI.
Click Add Connection.
URL: http://host.docker.internal:8000/v1 (if Open WebUI is in Docker and your FastAPI runs on the host)
or http://localhost:8000/v1 (if both run natively).
API Key: leave blank (or type dummy).
Save.

Your model (my-gguf-model) will appear in the model selector. Start chatting with your PDFs!

📝 The system prompt explained
The improved prompt (inside /v1/chat/completions) forces the LLM to:

✅ Use only the retrieved context

✅ Refuse to answer if context is missing (no hallucination)

✅ Cite source filenames when possible

✅ Stay focused and avoid external knowledge

This is the secret to reliable, grounded answers.

🧹 Tips & troubleshooting

Answers get cut off → Increase max_tokens in the endpoint (default is 2048) and ensure llama.cpp uses --n-predict 2048 or higher.
Retrieval returns nothing → Check /ingestion-status. If total_chunks is 0, run POST /reload again and watch the logs. Make sure your PDFs are in knowledge_pdfs/.
Open WebUI doesn’t see the model → Manually add the model in Workspace → Models with the same ID (my-gguf-model). Also verify that /v1/models returns the model.
Duplicate chunks on re‑ingest → The /reload endpoint now deletes the old collection before ingesting, so duplicates should not happen.
ModuleNotFoundError: No module named 'langchain.text_splitter' → Change the import to from langchain_text_splitters import RecursiveCharacterTextSplitter and install langchain-text-splitters.
Chroma collection does not exist → Delete the ./vector_store folder and run POST /reload again. The collection will be created on first add.
LLM call fails → Ensure your llama.cpp server is running on http://localhost:10002 and that the model name matches. Test with curl http://localhost:10002/v1/models.

🎯 Final thoughts
You now have a fully local, persistent knowledge base that you can query from a beautiful chat interface. All data stays on your machine, and you can extend it with more PDFs anytime (just run POST /reload again).

The complete code (the app.py above) is ready to be copied and used as a starting point for your own projects. Swap the embedding model, try a different vector store, or add hybrid search – the possibilities are endless.

Happy building! 🚀

Found this helpful? Leave a like or comment below – I’d love to hear how you’re using local RAG.

"Run a Fully Local AI With Persistent Memory: LM Studio + Big RAG Guide"

navid mirnouri — Sun, 26 Apr 2026 10:19:50 +0000

If you've ever wanted a completely private, offline AI assistant that actually remembers what's in your documents — and doesn't forget your conversation the moment you open a new chat — this guide is for you.

We're going to:

Set up LM Studio running Google's Gemma 4 locally
Install the Big RAG plugin to index your documents
Modify the plugin source to add genuine persistent memory across sessions

No cloud. No subscriptions. No data leaving your machine.

What Is RAG and Why Does It Matter?

RAG (Retrieval-Augmented Generation) lets you point a language model at your own files — PDFs, notes, documentation, whatever — and ask questions about them. Instead of the model relying on what it learned during training, it searches your documents in real time and injects the most relevant passages into the prompt before generating a response.

The result: accurate, grounded answers from your data, not hallucinated guesses.

Part 1: Installing LM Studio and Loading Gemma 4

Download LM Studio

Head to lmstudio.ai and grab the installer for your platform. It supports macOS (Apple Silicon + Intel), Windows, and Linux.

Download Gemma 4

In the left sidebar, click the Discover tab and search for gemma-4. Pick the quantisation that fits your hardware:

Quantisation	RAM needed	Notes
Q4_K_M	16 GB	Best balance of quality/speed
Q3_K_M	8 GB	Lighter, still capable
Q8_0	24 GB+	Highest quality

Download the Embedding Model

Big RAG needs a separate embedding model to convert your documents into searchable vectors. Search for:

nomic-ai/nomic-embed-text-v1.5-GGUF

It's small (~270 MB) and purpose-built for this job. You'll never chat with it — it runs silently in the background.

Verify Everything Works

Click the Chat tab, select Gemma 4 from the model picker, and send a test message. If it responds, you're ready.

Part 2: Installing the Big RAG Plugin

Big RAG is an open-source plugin that indexes an entire folder of documents into a persistent local vector database. It supports PDF, TXT, Markdown, HTML, and images (via OCR).

Prerequisites

You need Node.js — download the LTS version from nodejs.org.

Then bootstrap the lms CLI (it ships with LM Studio):

# macOS / Linux
~/.lmstudio/bin/lms bootstrap

# Windows
cmd /c %USERPROFILE%/.lmstudio/bin/lms.exe bootstrap

Open a new terminal after that, then verify:

lms --version

Clone and Build

git clone https://github.com/ari99/lm_studio_big_rag_plugin.git
cd lm_studio_big_rag_plugin
npm install
npm run build

Install Permanently

# macOS / Linux
cp -r . ~/.lmstudio/plugins/lm_studio_big_rag_plugin

# Windows (PowerShell)
Copy-Item -Recurse . "$env:USERPROFILE\.lmstudio\plugins\lm_studio_big_rag_plugin"

Restart LM Studio, go to Settings → Plugins, and toggle Big RAG on.

Configure It

In Settings → Plugins → Big RAG, set:

Documents directory — the folder with your files, e.g. ~/Documents/MyKnowledgeBase
Vector store directory — where the index lives, e.g. ~/.lmstudio/big-rag-db
Embedding model — nomic-ai/nomic-embed-text-v1.5-GGUF
Retrieval limit — 5
Affinity threshold — 0.35

Drop some PDFs or text files into your documents folder, open a new chat with Gemma 4, and send any message. Big RAG will index your documents on first run — you'll see a progress indicator. After that, every question automatically pulls relevant passages.

💡 Tip:
Tip: If Big RAG says "no relevant content found", lower the affinity threshold to 0.2. If it's pulling irrelevant results, raise it to 0.5.

Part 3: Adding Persistent Memory to Big RAG

This is the interesting part. Out of the box, Big RAG has no memory between sessions. Every new chat starts completely blank. We're going to fix that by modifying src/promptPreprocessor.ts.

The solution has two layers:

Within-session history — using LM Studio's pullHistory() API to inject recent conversation turns directly into the prompt
Cross-session memory — using a local chat_memory.json file to remember context from past sessions

Step 1 — Install the Memory Dependency

cd lm_studio_big_rag_plugin
npm install lowdb

lowdb is a tiny, zero-dependency JSON file database. Perfect for this.

Step 2 — Add Imports and Types

Open src/promptPreprocessor.ts and add these imports at the top alongside the existing ones:

import { JSONFilePreset } from 'lowdb/node'
import * as path from "path";

Then add the memory schema and helpers before the preprocess function:

type MemorySchema = {
  history: Array<{
    timestamp: string;
    user_text: string;
    summary: string;
  }>
}

async function getMemory(vectorStoreDir: string) {
  const dbPath = path.join(vectorStoreDir, 'chat_memory.json');
  const defaultData: MemorySchema = { history: [] };
  return await JSONFilePreset<MemorySchema>(dbPath, defaultData);
}

function summarizeText(
  text: string,
  maxLines: number = 3,
  maxChars: number = 400
): string {
  const lines = text.split(/\r?\n/).filter(line => line.trim() !== "");
  const clippedLines = lines.slice(0, maxLines);
  let clipped = clippedLines.join("\n");
  if (clipped.length > maxChars) clipped = clipped.slice(0, maxChars);
  const needsEllipsis = lines.length > maxLines || text.length > clipped.length;
  return needsEllipsis ? `${clipped.trimEnd()}…` : clipped;
}

Step 3 — Replace the Prompt Assembly Block

Find the section inside preprocess() where ragContextFull is assembled and the final prompt is built. Replace that entire block with this:

// ── Within-session chat history ────────────────────────────────────────────
const history = await ctl.pullHistory();

// Chat is an iterable — use getText() / getRole(), not .content (which
// doesn't exist on the typed ChatMessage object)
const allTurns: Array<{ role: string; text: string }> = [];
for (const msg of history) {
  allTurns.push({ role: msg.getRole(), text: msg.getText() });
}
const recentTurns = allTurns.slice(-6); // last 3 full exchanges

let historyContext = "";
if (recentTurns.length > 0) {
  historyContext = "\n\nRecent conversation history:\n";
  for (const msg of recentTurns) {
    const role = msg.role === "user" ? "User" : "Assistant";
    historyContext += `${role}: ${summarizeText(msg.text, 6, 600)}\n\n`;
  }
}

// ── Cross-session persistent memory ───────────────────────────────────────
const memoryDb = await getMemory(vectorStoreDir);
const pastMemories = memoryDb.data.history.slice(-5);
const persistentMemory = pastMemories.length > 0
  ? "\n\nPersistent memory from past sessions:\n" +
    pastMemories.map(m => `- [${m.timestamp}] ${m.summary}`).join("\n")
  : "";

// Inject both into the RAG context block
ragContextFull    += historyContext + persistentMemory;
ragContextPreview += historyContext + persistentMemory;

// ── Build and return the final prompt ──────────────────────────────────────
const promptTemplate = normalizePromptTemplate(pluginConfig.get("promptTemplate"));
const finalPrompt = fillPromptTemplate(promptTemplate, {
  [RAG_CONTEXT_MACRO]: ragContextFull.trimEnd(),
  [USER_QUERY_MACRO]: userPrompt,
});
const finalPromptPreview = fillPromptTemplate(promptTemplate, {
  [RAG_CONTEXT_MACRO]: ragContextPreview.trimEnd(),
  [USER_QUERY_MACRO]: userPrompt,
});

await warnIfContextOverflow(ctl, finalPrompt);

// Write a meaningful memory entry for future sessions
memoryDb.data.history.push({
  timestamp: new Date().toISOString(),
  user_text: userPrompt,
  summary: `Q: ${summarizeText(userPrompt, 1, 100)} | Top doc: ${
    results[0] ? path.basename(results[0].filePath) : "none"
  }`,
});
await memoryDb.write();

return finalPrompt;

Step 4 — Rebuild and Reinstall

npm run build

# macOS / Linux
cp -r . ~/.lmstudio/plugins/lm_studio_big_rag_plugin

In LM Studio go to Settings → Plugins and toggle Big RAG off then back on.

How It All Works Together

When you send a message, the preprocessor now does four things before Gemma 4 ever sees it:

Embeds your query with nomic-embed-text and retrieves the most relevant document chunks from the vector index
Pulls live chat history via ctl.pullHistory() — real conversation turns from LM Studio's own engine — and formats the last 3 exchanges as context
Loads cross-session memory from chat_memory.json and injects the last 5 session summaries
Assembles the final prompt combining all three context sources and passes it to Gemma 4

After the response, a new memory entry is written recording what you asked and which document was most relevant. This survives new chats, restarts, and LM Studio updates.

Common Gotchas

Property 'messages' does not exist on type 'Chat' — Chat is an iterable object, not a plain array. Use for (const msg of history) to iterate it.

Property 'content' does not exist on type 'ChatMessage' — ChatMessage exposes getText() and getRole() methods, not a .content property. The .content field only exists on the raw input format used when building a chat with Chat.from([...]).

lms import . fails with "Path is not a file" — lms import expects a zip, not a folder. Use the cp method above instead.

Plugin doesn't appear after copying — Restart LM Studio fully, not just toggle the plugin.

Tuning Tips

Chunk size — default 500 tokens works for most docs. Use 700 for dense technical content, 300 for short notes
Memory size — the chat_memory.json file grows indefinitely. Open it in any text editor and prune old entries if needed — it's just a JSON array
Re-indexing — enable the Manual Reindex Trigger toggle in plugin settings to pick up newly added documents
Top-k — increase the retrieval limit from 5 to 8 if you want more context injected, but watch your context window

Final Thoughts

What we've built here is a genuinely useful private knowledge assistant. No monthly fee, no API key, no data leaving your machine. Gemma 4 is strong on instruction following, nomic-embed-text is one of the best local embedding models available, and Big RAG's incremental indexing means your document library can grow without penalty.

The persistent memory piece is a workaround for what will eventually be a first-class LM Studio feature — the plugin SDK is still in beta. But it works reliably today, and you own it completely.

Tested on LM Studio 0.4.12, macOS Sequoia, Apple M-series. Windows commands are included where they differ.

Drop any questions in the comments — happy to help troubleshoot your setup.