navid mirnouri

Posted on Apr 26

"Run a Fully Local AI With Persistent Memory: LM Studio + Big RAG Guide"

#ai #llm #privacy #tutorial

If you've ever wanted a completely private, offline AI assistant that actually remembers what's in your documents — and doesn't forget your conversation the moment you open a new chat — this guide is for you.

We're going to:

Set up LM Studio running Google's Gemma 4 locally
Install the Big RAG plugin to index your documents
Modify the plugin source to add genuine persistent memory across sessions

No cloud. No subscriptions. No data leaving your machine.

What Is RAG and Why Does It Matter?

RAG (Retrieval-Augmented Generation) lets you point a language model at your own files — PDFs, notes, documentation, whatever — and ask questions about them. Instead of the model relying on what it learned during training, it searches your documents in real time and injects the most relevant passages into the prompt before generating a response.

The result: accurate, grounded answers from your data, not hallucinated guesses.

Part 1: Installing LM Studio and Loading Gemma 4

Download LM Studio

Head to lmstudio.ai and grab the installer for your platform. It supports macOS (Apple Silicon + Intel), Windows, and Linux.

Download Gemma 4

In the left sidebar, click the Discover tab and search for gemma-4. Pick the quantisation that fits your hardware:

Quantisation	RAM needed	Notes
Q4_K_M	16 GB	Best balance of quality/speed
Q3_K_M	8 GB	Lighter, still capable
Q8_0	24 GB+	Highest quality

Download the Embedding Model

Big RAG needs a separate embedding model to convert your documents into searchable vectors. Search for:

nomic-ai/nomic-embed-text-v1.5-GGUF

It's small (~270 MB) and purpose-built for this job. You'll never chat with it — it runs silently in the background.

Verify Everything Works

Click the Chat tab, select Gemma 4 from the model picker, and send a test message. If it responds, you're ready.

Part 2: Installing the Big RAG Plugin

Big RAG is an open-source plugin that indexes an entire folder of documents into a persistent local vector database. It supports PDF, TXT, Markdown, HTML, and images (via OCR).

Prerequisites

You need Node.js — download the LTS version from nodejs.org.

Then bootstrap the lms CLI (it ships with LM Studio):

# macOS / Linux
~/.lmstudio/bin/lms bootstrap

# Windows
cmd /c %USERPROFILE%/.lmstudio/bin/lms.exe bootstrap

Open a new terminal after that, then verify:

lms --version

Clone and Build

git clone https://github.com/ari99/lm_studio_big_rag_plugin.git
cd lm_studio_big_rag_plugin
npm install
npm run build

Install Permanently

# macOS / Linux
cp -r . ~/.lmstudio/plugins/lm_studio_big_rag_plugin

# Windows (PowerShell)
Copy-Item -Recurse . "$env:USERPROFILE\.lmstudio\plugins\lm_studio_big_rag_plugin"

Restart LM Studio, go to Settings → Plugins, and toggle Big RAG on.

Configure It

In Settings → Plugins → Big RAG, set:

Documents directory — the folder with your files, e.g. ~/Documents/MyKnowledgeBase
Vector store directory — where the index lives, e.g. ~/.lmstudio/big-rag-db
Embedding model — nomic-ai/nomic-embed-text-v1.5-GGUF
Retrieval limit — 5
Affinity threshold — 0.35

Drop some PDFs or text files into your documents folder, open a new chat with Gemma 4, and send any message. Big RAG will index your documents on first run — you'll see a progress indicator. After that, every question automatically pulls relevant passages.

💡 Tip:
Tip: If Big RAG says "no relevant content found", lower the affinity threshold to 0.2. If it's pulling irrelevant results, raise it to 0.5.

Part 3: Adding Persistent Memory to Big RAG

This is the interesting part. Out of the box, Big RAG has no memory between sessions. Every new chat starts completely blank. We're going to fix that by modifying src/promptPreprocessor.ts.

The solution has two layers:

Within-session history — using LM Studio's pullHistory() API to inject recent conversation turns directly into the prompt
Cross-session memory — using a local chat_memory.json file to remember context from past sessions

Step 1 — Install the Memory Dependency

cd lm_studio_big_rag_plugin
npm install lowdb

lowdb is a tiny, zero-dependency JSON file database. Perfect for this.

Step 2 — Add Imports and Types

Open src/promptPreprocessor.ts and add these imports at the top alongside the existing ones:

import { JSONFilePreset } from 'lowdb/node'
import * as path from "path";

Then add the memory schema and helpers before the preprocess function:

type MemorySchema = {
  history: Array<{
    timestamp: string;
    user_text: string;
    summary: string;
  }>
}

async function getMemory(vectorStoreDir: string) {
  const dbPath = path.join(vectorStoreDir, 'chat_memory.json');
  const defaultData: MemorySchema = { history: [] };
  return await JSONFilePreset<MemorySchema>(dbPath, defaultData);
}

function summarizeText(
  text: string,
  maxLines: number = 3,
  maxChars: number = 400
): string {
  const lines = text.split(/\r?\n/).filter(line => line.trim() !== "");
  const clippedLines = lines.slice(0, maxLines);
  let clipped = clippedLines.join("\n");
  if (clipped.length > maxChars) clipped = clipped.slice(0, maxChars);
  const needsEllipsis = lines.length > maxLines || text.length > clipped.length;
  return needsEllipsis ? `${clipped.trimEnd()}…` : clipped;
}

Step 3 — Replace the Prompt Assembly Block

Find the section inside preprocess() where ragContextFull is assembled and the final prompt is built. Replace that entire block with this:

// ── Within-session chat history ────────────────────────────────────────────
const history = await ctl.pullHistory();

// Chat is an iterable — use getText() / getRole(), not .content (which
// doesn't exist on the typed ChatMessage object)
const allTurns: Array<{ role: string; text: string }> = [];
for (const msg of history) {
  allTurns.push({ role: msg.getRole(), text: msg.getText() });
}
const recentTurns = allTurns.slice(-6); // last 3 full exchanges

let historyContext = "";
if (recentTurns.length > 0) {
  historyContext = "\n\nRecent conversation history:\n";
  for (const msg of recentTurns) {
    const role = msg.role === "user" ? "User" : "Assistant";
    historyContext += `${role}: ${summarizeText(msg.text, 6, 600)}\n\n`;
  }
}

// ── Cross-session persistent memory ───────────────────────────────────────
const memoryDb = await getMemory(vectorStoreDir);
const pastMemories = memoryDb.data.history.slice(-5);
const persistentMemory = pastMemories.length > 0
  ? "\n\nPersistent memory from past sessions:\n" +
    pastMemories.map(m => `- [${m.timestamp}] ${m.summary}`).join("\n")
  : "";

// Inject both into the RAG context block
ragContextFull    += historyContext + persistentMemory;
ragContextPreview += historyContext + persistentMemory;

// ── Build and return the final prompt ──────────────────────────────────────
const promptTemplate = normalizePromptTemplate(pluginConfig.get("promptTemplate"));
const finalPrompt = fillPromptTemplate(promptTemplate, {
  [RAG_CONTEXT_MACRO]: ragContextFull.trimEnd(),
  [USER_QUERY_MACRO]: userPrompt,
});
const finalPromptPreview = fillPromptTemplate(promptTemplate, {
  [RAG_CONTEXT_MACRO]: ragContextPreview.trimEnd(),
  [USER_QUERY_MACRO]: userPrompt,
});

await warnIfContextOverflow(ctl, finalPrompt);

// Write a meaningful memory entry for future sessions
memoryDb.data.history.push({
  timestamp: new Date().toISOString(),
  user_text: userPrompt,
  summary: `Q: ${summarizeText(userPrompt, 1, 100)} | Top doc: ${
    results[0] ? path.basename(results[0].filePath) : "none"
  }`,
});
await memoryDb.write();

return finalPrompt;

Step 4 — Rebuild and Reinstall

npm run build

# macOS / Linux
cp -r . ~/.lmstudio/plugins/lm_studio_big_rag_plugin

In LM Studio go to Settings → Plugins and toggle Big RAG off then back on.

How It All Works Together

When you send a message, the preprocessor now does four things before Gemma 4 ever sees it:

Embeds your query with nomic-embed-text and retrieves the most relevant document chunks from the vector index
Pulls live chat history via ctl.pullHistory() — real conversation turns from LM Studio's own engine — and formats the last 3 exchanges as context
Loads cross-session memory from chat_memory.json and injects the last 5 session summaries
Assembles the final prompt combining all three context sources and passes it to Gemma 4

After the response, a new memory entry is written recording what you asked and which document was most relevant. This survives new chats, restarts, and LM Studio updates.

Common Gotchas

Property 'messages' does not exist on type 'Chat' — Chat is an iterable object, not a plain array. Use for (const msg of history) to iterate it.

Property 'content' does not exist on type 'ChatMessage' — ChatMessage exposes getText() and getRole() methods, not a .content property. The .content field only exists on the raw input format used when building a chat with Chat.from([...]).

lms import . fails with "Path is not a file" — lms import expects a zip, not a folder. Use the cp method above instead.

Plugin doesn't appear after copying — Restart LM Studio fully, not just toggle the plugin.

Tuning Tips

Chunk size — default 500 tokens works for most docs. Use 700 for dense technical content, 300 for short notes
Memory size — the chat_memory.json file grows indefinitely. Open it in any text editor and prune old entries if needed — it's just a JSON array
Re-indexing — enable the Manual Reindex Trigger toggle in plugin settings to pick up newly added documents
Top-k — increase the retrieval limit from 5 to 8 if you want more context injected, but watch your context window

Final Thoughts

What we've built here is a genuinely useful private knowledge assistant. No monthly fee, no API key, no data leaving your machine. Gemma 4 is strong on instruction following, nomic-embed-text is one of the best local embedding models available, and Big RAG's incremental indexing means your document library can grow without penalty.

The persistent memory piece is a workaround for what will eventually be a first-class LM Studio feature — the plugin SDK is still in beta. But it works reliably today, and you own it completely.

Tested on LM Studio 0.4.12, macOS Sequoia, Apple M-series. Windows commands are included where they differ.

Drop any questions in the comments — happy to help troubleshoot your setup.

DEV Community