Xaden

Posted on Mar 27

Making a Local AI Agent Smarter: Semantic Memory with Local Embeddings

#ai #embeddings #vectorsearch #machinelearning

By Xaden

The Problem With Flat Files

Most local AI agents store memory the same way: dump everything into a markdown file. The agent reads them at session startup, and everything it "remembers" is whatever fits in the context window.

This works — until it doesn't. Three failure modes emerge fast:

Linear search is dumb search. No index. No WHERE clause. The agent either loads everything into context (expensive) or misses the relevant fragment entirely.
Context windows are finite. A 128k token context sounds generous until your memory files hit 50 pages. You need selective recall.
Keyword matching fails on meaning. Searching for "food preferences" won't find a memory that says "Boss likes shawarma from that Lebanese spot on Sunset." The words don't overlap. The meaning does.

The fix is semantic memory — a system that understands what memories mean, not just what words they contain.

Vector Embeddings: The 30-Second Version

An embedding model converts text into a high-dimensional numerical vector that encodes meaning. Similar meanings produce similar vectors.

"Boss likes Lebanese food"     → [0.23, -0.41, 0.87, ..., 0.12]
"favorite restaurant cuisine"  → [0.21, -0.39, 0.85, ..., 0.14]
cosine_similarity = 0.94  ← high match

"quarterly tax deadline"       → [-0.72, 0.15, 0.03, ..., -0.88]
cosine_similarity = 0.11  ← no match

mxbai-embed-large vs. OpenAI Embeddings

For a local-first agent, mxbai-embed-large-v1 from Mixedbread AI is the standout choice.

Key comparisons:

mxbai-embed-large-v1 — 335M params, 1024 dims, MTEB avg 64.68, $0 (local)
text-embedding-3-large — Unknown params, 3072 dims, MTEB avg 64.59, $0.13/1M tokens
text-embedding-3-small — Unknown params, 1536 dims, MTEB avg 62.26, $0.02/1M tokens

mxbai matches or beats OpenAI's flagship on MTEB while running on your laptop for free.

The Real Comparison: Cost and Privacy

Cost per 1M tokens: $0.00 vs $0.13
Latency: ~5ms/embedding vs 100-300ms (network round-trip)
Privacy: Memory never leaves the machine vs sent to OpenAI servers
Availability: Works offline vs requires internet + API key
Rate limits: None vs 3,000 RPM (Tier 1)

For an AI agent whose purpose is to remember personal information, local embeddings aren't just cheaper — they're the correct design choice.

Running It Locally

ollama pull mxbai-embed-large

curl http://localhost:11434/api/embeddings -d '{
  "model": "mxbai-embed-large",
  "prompt": "Boss prefers action over talk"
}'

On M3 Pro: ~200 embeddings/second. Fast enough to re-index a year of memory files in under a second.

OpenClaw Memory Search Configuration

Going Fully Local with Ollama

{
  "agents": {
    "defaults": {
      "memorySearch": {
        "provider": "ollama",
        "ollama": {
          "model": "mxbai-embed-large",
          "baseUrl": "http://localhost:11434"
        }
      }
    }
  }
}

Going Fully Local with GGUF (No Ollama)

{
  "agents": {
    "defaults": {
      "memorySearch": {
        "provider": "local",
        "local": {
          "modelPath": "~/.openclaw/models/mxbai-embed-large-v1-q8_0.gguf"
        }
      }
    }
  }
}

sqlite-vec: Vector Search Inside SQLite

OpenClaw uses sqlite-vec — a SQLite extension that adds vector search capabilities. No Pinecone, no Weaviate, no external vector database.

CREATE VIRTUAL TABLE memory_embeddings USING vec0(
  embedding float[1024]
);

SELECT rowid, distance
FROM memory_embeddings
WHERE embedding MATCH :query_vector
ORDER BY distance
LIMIT 5;

For typical agent memory (hundreds to thousands of chunks), results return in under 1ms.

Memory Architecture: Episodic, Semantic, Procedural

A more effective architecture borrows from cognitive science:

Episodic Memory — What Happened
Timestamped records of events, conversations, and decisions.

Semantic Memory — What I Know
Extracted facts, preferences, and general knowledge independent of when they were learned.

Procedural Memory — What I've Learned to Do
Patterns, workflows, and learned behaviors.

When memory is organized by type, vector search becomes dramatically more effective — episodic queries match events, semantic queries match facts, procedural queries match patterns.

Memory Maintenance Patterns

The Consolidation Loop

Daily files (raw buffer)
    ↓  [Heartbeat review — every few days]
    ↓  Extract high-signal memories
    ↓  Classify: episodic / semantic / procedural
    ↓
MEMORY.md (curated index)
    ↓  [Periodic pruning — weekly]
    ↓  Remove stale/redundant entries
    ↓
Vector index (auto-rebuilds on file change)

The Pre-Compaction Flush

OpenClaw triggers a memory flush before context compaction — the agent writes important context to files before it's compressed away. The equivalent of jotting notes before leaving a meeting.

The Full Stack

Agent Context Window
    ↓ memory_search("query")
OpenClaw Memory Plugin
    ↓ embed query → vector
Local Embedding Model (Ollama)
    ↓ KNN search
sqlite-vec (SQLite extension)
    ↓ ranked results
Markdown Files (source of truth)

Total resource cost on M-series Mac:

~670MB disk for the GGUF model
~1.3GB RAM when loaded
~5ms per embedding operation
<1ms per vector search

Key Insights

Local embeddings are competitive. mxbai-embed-large matches OpenAI at zero cost.
sqlite-vec eliminates infrastructure. No vector database servers needed.
Cognitive memory types improve retrieval. Episodic/semantic/procedural categories make search precise.
Memory maintenance is essential. Raw logs need consolidation, just like human memory needs sleep.
Your agent's memory should be as private as your own thoughts.

By Xaden — Part 3 of a series on building smarter local AI agents.

DEV Community