DEV Community

Cover image for "Run a Fully Local AI With Persistent Memory: LM Studio + Big RAG Guide"
navid mirnouri
navid mirnouri

Posted on

"Run a Fully Local AI With Persistent Memory: LM Studio + Big RAG Guide"

If you've ever wanted a completely private, offline AI assistant that actually remembers what's in your documents — and doesn't forget your conversation the moment you open a new chat — this guide is for you.

We're going to:

  1. Set up LM Studio running Google's Gemma 4 locally
  2. Install the Big RAG plugin to index your documents
  3. Modify the plugin source to add genuine persistent memory across sessions

No cloud. No subscriptions. No data leaving your machine.


What Is RAG and Why Does It Matter?

RAG (Retrieval-Augmented Generation) lets you point a language model at your own files — PDFs, notes, documentation, whatever — and ask questions about them. Instead of the model relying on what it learned during training, it searches your documents in real time and injects the most relevant passages into the prompt before generating a response.

The result: accurate, grounded answers from your data, not hallucinated guesses.


Part 1: Installing LM Studio and Loading Gemma 4

Download LM Studio

Head to lmstudio.ai and grab the installer for your platform. It supports macOS (Apple Silicon + Intel), Windows, and Linux.

Download Gemma 4

In the left sidebar, click the Discover tab and search for gemma-4. Pick the quantisation that fits your hardware:

Quantisation RAM needed Notes
Q4_K_M 16 GB Best balance of quality/speed
Q3_K_M 8 GB Lighter, still capable
Q8_0 24 GB+ Highest quality

Download the Embedding Model

Big RAG needs a separate embedding model to convert your documents into searchable vectors. Search for:

nomic-ai/nomic-embed-text-v1.5-GGUF
Enter fullscreen mode Exit fullscreen mode

It's small (~270 MB) and purpose-built for this job. You'll never chat with it — it runs silently in the background.

Verify Everything Works

Click the Chat tab, select Gemma 4 from the model picker, and send a test message. If it responds, you're ready.


Part 2: Installing the Big RAG Plugin

Big RAG is an open-source plugin that indexes an entire folder of documents into a persistent local vector database. It supports PDF, TXT, Markdown, HTML, and images (via OCR).

Prerequisites

You need Node.js — download the LTS version from nodejs.org.

Then bootstrap the lms CLI (it ships with LM Studio):

# macOS / Linux
~/.lmstudio/bin/lms bootstrap

# Windows
cmd /c %USERPROFILE%/.lmstudio/bin/lms.exe bootstrap
Enter fullscreen mode Exit fullscreen mode

Open a new terminal after that, then verify:

lms --version
Enter fullscreen mode Exit fullscreen mode

Clone and Build

git clone https://github.com/ari99/lm_studio_big_rag_plugin.git
cd lm_studio_big_rag_plugin
npm install
npm run build
Enter fullscreen mode Exit fullscreen mode

Install Permanently

# macOS / Linux
cp -r . ~/.lmstudio/plugins/lm_studio_big_rag_plugin

# Windows (PowerShell)
Copy-Item -Recurse . "$env:USERPROFILE\.lmstudio\plugins\lm_studio_big_rag_plugin"
Enter fullscreen mode Exit fullscreen mode

Restart LM Studio, go to Settings → Plugins, and toggle Big RAG on.

Configure It

In Settings → Plugins → Big RAG, set:

  • Documents directory — the folder with your files, e.g. ~/Documents/MyKnowledgeBase
  • Vector store directory — where the index lives, e.g. ~/.lmstudio/big-rag-db
  • Embedding modelnomic-ai/nomic-embed-text-v1.5-GGUF
  • Retrieval limit5
  • Affinity threshold0.35

Drop some PDFs or text files into your documents folder, open a new chat with Gemma 4, and send any message. Big RAG will index your documents on first run — you'll see a progress indicator. After that, every question automatically pulls relevant passages.

💡 Tip:
Tip: If Big RAG says "no relevant content found", lower the affinity threshold to 0.2. If it's pulling irrelevant results, raise it to 0.5.


Part 3: Adding Persistent Memory to Big RAG

This is the interesting part. Out of the box, Big RAG has no memory between sessions. Every new chat starts completely blank. We're going to fix that by modifying src/promptPreprocessor.ts.

The solution has two layers:

  • Within-session history — using LM Studio's pullHistory() API to inject recent conversation turns directly into the prompt
  • Cross-session memory — using a local chat_memory.json file to remember context from past sessions

Step 1 — Install the Memory Dependency

cd lm_studio_big_rag_plugin
npm install lowdb
Enter fullscreen mode Exit fullscreen mode

lowdb is a tiny, zero-dependency JSON file database. Perfect for this.

Step 2 — Add Imports and Types

Open src/promptPreprocessor.ts and add these imports at the top alongside the existing ones:

import { JSONFilePreset } from 'lowdb/node'
import * as path from "path";
Enter fullscreen mode Exit fullscreen mode

Then add the memory schema and helpers before the preprocess function:

type MemorySchema = {
  history: Array<{
    timestamp: string;
    user_text: string;
    summary: string;
  }>
}

async function getMemory(vectorStoreDir: string) {
  const dbPath = path.join(vectorStoreDir, 'chat_memory.json');
  const defaultData: MemorySchema = { history: [] };
  return await JSONFilePreset<MemorySchema>(dbPath, defaultData);
}

function summarizeText(
  text: string,
  maxLines: number = 3,
  maxChars: number = 400
): string {
  const lines = text.split(/\r?\n/).filter(line => line.trim() !== "");
  const clippedLines = lines.slice(0, maxLines);
  let clipped = clippedLines.join("\n");
  if (clipped.length > maxChars) clipped = clipped.slice(0, maxChars);
  const needsEllipsis = lines.length > maxLines || text.length > clipped.length;
  return needsEllipsis ? `${clipped.trimEnd()}…` : clipped;
}
Enter fullscreen mode Exit fullscreen mode

Step 3 — Replace the Prompt Assembly Block

Find the section inside preprocess() where ragContextFull is assembled and the final prompt is built. Replace that entire block with this:

// ── Within-session chat history ────────────────────────────────────────────
const history = await ctl.pullHistory();

// Chat is an iterable — use getText() / getRole(), not .content (which
// doesn't exist on the typed ChatMessage object)
const allTurns: Array<{ role: string; text: string }> = [];
for (const msg of history) {
  allTurns.push({ role: msg.getRole(), text: msg.getText() });
}
const recentTurns = allTurns.slice(-6); // last 3 full exchanges

let historyContext = "";
if (recentTurns.length > 0) {
  historyContext = "\n\nRecent conversation history:\n";
  for (const msg of recentTurns) {
    const role = msg.role === "user" ? "User" : "Assistant";
    historyContext += `${role}: ${summarizeText(msg.text, 6, 600)}\n\n`;
  }
}

// ── Cross-session persistent memory ───────────────────────────────────────
const memoryDb = await getMemory(vectorStoreDir);
const pastMemories = memoryDb.data.history.slice(-5);
const persistentMemory = pastMemories.length > 0
  ? "\n\nPersistent memory from past sessions:\n" +
    pastMemories.map(m => `- [${m.timestamp}] ${m.summary}`).join("\n")
  : "";

// Inject both into the RAG context block
ragContextFull    += historyContext + persistentMemory;
ragContextPreview += historyContext + persistentMemory;

// ── Build and return the final prompt ──────────────────────────────────────
const promptTemplate = normalizePromptTemplate(pluginConfig.get("promptTemplate"));
const finalPrompt = fillPromptTemplate(promptTemplate, {
  [RAG_CONTEXT_MACRO]: ragContextFull.trimEnd(),
  [USER_QUERY_MACRO]: userPrompt,
});
const finalPromptPreview = fillPromptTemplate(promptTemplate, {
  [RAG_CONTEXT_MACRO]: ragContextPreview.trimEnd(),
  [USER_QUERY_MACRO]: userPrompt,
});

await warnIfContextOverflow(ctl, finalPrompt);

// Write a meaningful memory entry for future sessions
memoryDb.data.history.push({
  timestamp: new Date().toISOString(),
  user_text: userPrompt,
  summary: `Q: ${summarizeText(userPrompt, 1, 100)} | Top doc: ${
    results[0] ? path.basename(results[0].filePath) : "none"
  }`,
});
await memoryDb.write();

return finalPrompt;
Enter fullscreen mode Exit fullscreen mode

Step 4 — Rebuild and Reinstall

npm run build

# macOS / Linux
cp -r . ~/.lmstudio/plugins/lm_studio_big_rag_plugin
Enter fullscreen mode Exit fullscreen mode

In LM Studio go to Settings → Plugins and toggle Big RAG off then back on.


How It All Works Together

When you send a message, the preprocessor now does four things before Gemma 4 ever sees it:

  1. Embeds your query with nomic-embed-text and retrieves the most relevant document chunks from the vector index
  2. Pulls live chat history via ctl.pullHistory() — real conversation turns from LM Studio's own engine — and formats the last 3 exchanges as context
  3. Loads cross-session memory from chat_memory.json and injects the last 5 session summaries
  4. Assembles the final prompt combining all three context sources and passes it to Gemma 4

After the response, a new memory entry is written recording what you asked and which document was most relevant. This survives new chats, restarts, and LM Studio updates.


Common Gotchas

Property 'messages' does not exist on type 'Chat'Chat is an iterable object, not a plain array. Use for (const msg of history) to iterate it.

Property 'content' does not exist on type 'ChatMessage'ChatMessage exposes getText() and getRole() methods, not a .content property. The .content field only exists on the raw input format used when building a chat with Chat.from([...]).

lms import . fails with "Path is not a file"lms import expects a zip, not a folder. Use the cp method above instead.

Plugin doesn't appear after copying — Restart LM Studio fully, not just toggle the plugin.


Tuning Tips

  • Chunk size — default 500 tokens works for most docs. Use 700 for dense technical content, 300 for short notes
  • Memory size — the chat_memory.json file grows indefinitely. Open it in any text editor and prune old entries if needed — it's just a JSON array
  • Re-indexing — enable the Manual Reindex Trigger toggle in plugin settings to pick up newly added documents
  • Top-k — increase the retrieval limit from 5 to 8 if you want more context injected, but watch your context window

Final Thoughts

What we've built here is a genuinely useful private knowledge assistant. No monthly fee, no API key, no data leaving your machine. Gemma 4 is strong on instruction following, nomic-embed-text is one of the best local embedding models available, and Big RAG's incremental indexing means your document library can grow without penalty.

The persistent memory piece is a workaround for what will eventually be a first-class LM Studio feature — the plugin SDK is still in beta. But it works reliably today, and you own it completely.


Tested on LM Studio 0.4.12, macOS Sequoia, Apple M-series. Windows commands are included where they differ.

Drop any questions in the comments — happy to help troubleshoot your setup.

Top comments (0)