If you've ever wanted a completely private, offline AI assistant that actually remembers what's in your documents — and doesn't forget your conversation the moment you open a new chat — this guide is for you.
We're going to:
- Set up LM Studio running Google's Gemma 4 locally
- Install the Big RAG plugin to index your documents
- Modify the plugin source to add genuine persistent memory across sessions
No cloud. No subscriptions. No data leaving your machine.
What Is RAG and Why Does It Matter?
RAG (Retrieval-Augmented Generation) lets you point a language model at your own files — PDFs, notes, documentation, whatever — and ask questions about them. Instead of the model relying on what it learned during training, it searches your documents in real time and injects the most relevant passages into the prompt before generating a response.
The result: accurate, grounded answers from your data, not hallucinated guesses.
Part 1: Installing LM Studio and Loading Gemma 4
Download LM Studio
Head to lmstudio.ai and grab the installer for your platform. It supports macOS (Apple Silicon + Intel), Windows, and Linux.
Download Gemma 4
In the left sidebar, click the Discover tab and search for gemma-4. Pick the quantisation that fits your hardware:
| Quantisation | RAM needed | Notes |
|---|---|---|
| Q4_K_M | 16 GB | Best balance of quality/speed |
| Q3_K_M | 8 GB | Lighter, still capable |
| Q8_0 | 24 GB+ | Highest quality |
Download the Embedding Model
Big RAG needs a separate embedding model to convert your documents into searchable vectors. Search for:
nomic-ai/nomic-embed-text-v1.5-GGUF
It's small (~270 MB) and purpose-built for this job. You'll never chat with it — it runs silently in the background.
Verify Everything Works
Click the Chat tab, select Gemma 4 from the model picker, and send a test message. If it responds, you're ready.
Part 2: Installing the Big RAG Plugin
Big RAG is an open-source plugin that indexes an entire folder of documents into a persistent local vector database. It supports PDF, TXT, Markdown, HTML, and images (via OCR).
Prerequisites
You need Node.js — download the LTS version from nodejs.org.
Then bootstrap the lms CLI (it ships with LM Studio):
# macOS / Linux
~/.lmstudio/bin/lms bootstrap
# Windows
cmd /c %USERPROFILE%/.lmstudio/bin/lms.exe bootstrap
Open a new terminal after that, then verify:
lms --version
Clone and Build
git clone https://github.com/ari99/lm_studio_big_rag_plugin.git
cd lm_studio_big_rag_plugin
npm install
npm run build
Install Permanently
# macOS / Linux
cp -r . ~/.lmstudio/plugins/lm_studio_big_rag_plugin
# Windows (PowerShell)
Copy-Item -Recurse . "$env:USERPROFILE\.lmstudio\plugins\lm_studio_big_rag_plugin"
Restart LM Studio, go to Settings → Plugins, and toggle Big RAG on.
Configure It
In Settings → Plugins → Big RAG, set:
-
Documents directory — the folder with your files, e.g.
~/Documents/MyKnowledgeBase -
Vector store directory — where the index lives, e.g.
~/.lmstudio/big-rag-db -
Embedding model —
nomic-ai/nomic-embed-text-v1.5-GGUF -
Retrieval limit —
5 -
Affinity threshold —
0.35
Drop some PDFs or text files into your documents folder, open a new chat with Gemma 4, and send any message. Big RAG will index your documents on first run — you'll see a progress indicator. After that, every question automatically pulls relevant passages.
💡 Tip:
Tip: If Big RAG says "no relevant content found", lower the affinity threshold to0.2. If it's pulling irrelevant results, raise it to0.5.
Part 3: Adding Persistent Memory to Big RAG
This is the interesting part. Out of the box, Big RAG has no memory between sessions. Every new chat starts completely blank. We're going to fix that by modifying src/promptPreprocessor.ts.
The solution has two layers:
-
Within-session history — using LM Studio's
pullHistory()API to inject recent conversation turns directly into the prompt -
Cross-session memory — using a local
chat_memory.jsonfile to remember context from past sessions
Step 1 — Install the Memory Dependency
cd lm_studio_big_rag_plugin
npm install lowdb
lowdb is a tiny, zero-dependency JSON file database. Perfect for this.
Step 2 — Add Imports and Types
Open src/promptPreprocessor.ts and add these imports at the top alongside the existing ones:
import { JSONFilePreset } from 'lowdb/node'
import * as path from "path";
Then add the memory schema and helpers before the preprocess function:
type MemorySchema = {
history: Array<{
timestamp: string;
user_text: string;
summary: string;
}>
}
async function getMemory(vectorStoreDir: string) {
const dbPath = path.join(vectorStoreDir, 'chat_memory.json');
const defaultData: MemorySchema = { history: [] };
return await JSONFilePreset<MemorySchema>(dbPath, defaultData);
}
function summarizeText(
text: string,
maxLines: number = 3,
maxChars: number = 400
): string {
const lines = text.split(/\r?\n/).filter(line => line.trim() !== "");
const clippedLines = lines.slice(0, maxLines);
let clipped = clippedLines.join("\n");
if (clipped.length > maxChars) clipped = clipped.slice(0, maxChars);
const needsEllipsis = lines.length > maxLines || text.length > clipped.length;
return needsEllipsis ? `${clipped.trimEnd()}…` : clipped;
}
Step 3 — Replace the Prompt Assembly Block
Find the section inside preprocess() where ragContextFull is assembled and the final prompt is built. Replace that entire block with this:
// ── Within-session chat history ────────────────────────────────────────────
const history = await ctl.pullHistory();
// Chat is an iterable — use getText() / getRole(), not .content (which
// doesn't exist on the typed ChatMessage object)
const allTurns: Array<{ role: string; text: string }> = [];
for (const msg of history) {
allTurns.push({ role: msg.getRole(), text: msg.getText() });
}
const recentTurns = allTurns.slice(-6); // last 3 full exchanges
let historyContext = "";
if (recentTurns.length > 0) {
historyContext = "\n\nRecent conversation history:\n";
for (const msg of recentTurns) {
const role = msg.role === "user" ? "User" : "Assistant";
historyContext += `${role}: ${summarizeText(msg.text, 6, 600)}\n\n`;
}
}
// ── Cross-session persistent memory ───────────────────────────────────────
const memoryDb = await getMemory(vectorStoreDir);
const pastMemories = memoryDb.data.history.slice(-5);
const persistentMemory = pastMemories.length > 0
? "\n\nPersistent memory from past sessions:\n" +
pastMemories.map(m => `- [${m.timestamp}] ${m.summary}`).join("\n")
: "";
// Inject both into the RAG context block
ragContextFull += historyContext + persistentMemory;
ragContextPreview += historyContext + persistentMemory;
// ── Build and return the final prompt ──────────────────────────────────────
const promptTemplate = normalizePromptTemplate(pluginConfig.get("promptTemplate"));
const finalPrompt = fillPromptTemplate(promptTemplate, {
[RAG_CONTEXT_MACRO]: ragContextFull.trimEnd(),
[USER_QUERY_MACRO]: userPrompt,
});
const finalPromptPreview = fillPromptTemplate(promptTemplate, {
[RAG_CONTEXT_MACRO]: ragContextPreview.trimEnd(),
[USER_QUERY_MACRO]: userPrompt,
});
await warnIfContextOverflow(ctl, finalPrompt);
// Write a meaningful memory entry for future sessions
memoryDb.data.history.push({
timestamp: new Date().toISOString(),
user_text: userPrompt,
summary: `Q: ${summarizeText(userPrompt, 1, 100)} | Top doc: ${
results[0] ? path.basename(results[0].filePath) : "none"
}`,
});
await memoryDb.write();
return finalPrompt;
Step 4 — Rebuild and Reinstall
npm run build
# macOS / Linux
cp -r . ~/.lmstudio/plugins/lm_studio_big_rag_plugin
In LM Studio go to Settings → Plugins and toggle Big RAG off then back on.
How It All Works Together
When you send a message, the preprocessor now does four things before Gemma 4 ever sees it:
- Embeds your query with nomic-embed-text and retrieves the most relevant document chunks from the vector index
-
Pulls live chat history via
ctl.pullHistory()— real conversation turns from LM Studio's own engine — and formats the last 3 exchanges as context -
Loads cross-session memory from
chat_memory.jsonand injects the last 5 session summaries - Assembles the final prompt combining all three context sources and passes it to Gemma 4
After the response, a new memory entry is written recording what you asked and which document was most relevant. This survives new chats, restarts, and LM Studio updates.
Common Gotchas
Property 'messages' does not exist on type 'Chat' — Chat is an iterable object, not a plain array. Use for (const msg of history) to iterate it.
Property 'content' does not exist on type 'ChatMessage' — ChatMessage exposes getText() and getRole() methods, not a .content property. The .content field only exists on the raw input format used when building a chat with Chat.from([...]).
lms import . fails with "Path is not a file" — lms import expects a zip, not a folder. Use the cp method above instead.
Plugin doesn't appear after copying — Restart LM Studio fully, not just toggle the plugin.
Tuning Tips
- Chunk size — default 500 tokens works for most docs. Use 700 for dense technical content, 300 for short notes
-
Memory size — the
chat_memory.jsonfile grows indefinitely. Open it in any text editor and prune old entries if needed — it's just a JSON array - Re-indexing — enable the Manual Reindex Trigger toggle in plugin settings to pick up newly added documents
- Top-k — increase the retrieval limit from 5 to 8 if you want more context injected, but watch your context window
Final Thoughts
What we've built here is a genuinely useful private knowledge assistant. No monthly fee, no API key, no data leaving your machine. Gemma 4 is strong on instruction following, nomic-embed-text is one of the best local embedding models available, and Big RAG's incremental indexing means your document library can grow without penalty.
The persistent memory piece is a workaround for what will eventually be a first-class LM Studio feature — the plugin SDK is still in beta. But it works reliably today, and you own it completely.
Tested on LM Studio 0.4.12, macOS Sequoia, Apple M-series. Windows commands are included where they differ.
Drop any questions in the comments — happy to help troubleshoot your setup.
Top comments (0)