Ankit Pangasa

Posted on Apr 10

Why Do Local AI Agents Forget You?

#agents #ai #llm #tutorial

Every local AI agent I built before this one had the same problem: restart the session, and it had no idea who I was. The fix wasn't a better model; it was treating memory as structured persistent data outside the context window, using memory, storage, and LLMs, all running locally. This post walks through exactly how I wired the full stack together, including the two silent gotchas that will quietly break things if you miss them.

Picture this: you spend 10 minutes onboarding your local AI assistant. Pytest over unittest. Type hints always. Google-style docstrings. It listens, applies everything correctly, and you feel like you've finally built something worth using.

Then you restart Docker, just to find a blank slate. No memory of you, your preferences, or anything you talked about. Back to square one.

If this sounds familiar, you've hit the #1 reason most local AI agent setups feel like toys rather than tools. And the fix isn't about picking a bigger model or tweaking your prompts. It's about where and how memory lives in the stack. Let's build something that actually holds onto what you tell it.

The Context Window Trap

Before we write a single line of code, let's talk about why the obvious solutions don't actually work.

The usual workarounds for local agent memory look something like this: a PREFERENCES.md file loaded at startup, a long system prompt with your coding conventions, or LangChain's ConversationSummaryMemory to keep things tidy. They all work right up until they don't.

Here's the shared flaw: they all live inside the context window. And everything inside the context window is subject to compaction, token limits, and session restarts. A month of daily use later, the agent who knew your testing framework on day one has quietly forgotten it by day thirty. You don't notice until it starts generating code that ignores everything you told it.

What we actually need is memory that lives outside the context window but is stored durably on disk, retrieved by semantic meaning, and built from extracted facts rather than raw conversation dumps. That's the stack we're building today.

Getting Your Local Inference Running

Alright, let's get our hands dirty. The foundation of everything here is Ollama, as it's the easiest way to get a local model up and running with an OpenAI-compatible API on localhost. One install, one pull, and you're off.

ollama pull qwen3:8b
ollama pull nomic-embed-text

We're pulling two models: qwen3:8b for the agent's reasoning and code generation, and nomic-embed-text for the memory layer's embeddings. You need both from the start.

Why qwen3:8b specifically? It produces reliable JSON for intent detection (critical for tool routing), handles code generation cleanly, and fits comfortably in 8 GB VRAM. For a local coding assistant, it's a well-balanced pick.

Gotcha #1: The Think Block Problem

Here's the first thing that's going to break your setup if you're not watching for it: qwen3 wraps every response in <think>…</think> blocks before returning the actual content. If anything downstream tries to parse that raw response as JSON, it'll crash immediately, with no obvious error pointing to why.

The fix is a small monkey-patch applied once at startup that disables thinking mode globally:

_orig_ollama_chat = ollama.chat
def _no_think_chat(*args, **kwargs):
    opts = kwargs.get("options") or {}
    if isinstance(opts, dict):
        opts.setdefault("think", False)
        kwargs["options"] = opts
    return _orig_ollama_chat(*args, **kwargs)
ollama.chat = _no_think_chat

The setdefault here is intentional, which means individual callers can still explicitly opt into thinking mode when they need it. For JSON extraction and tool routing in a pipeline, thinking mode adds latency without improving output. Disable it globally, override locally where it matters.

Routing Requests Before the Model Sees Them

With inference running, the next piece is making the agent actually do things based on what you ask. The temptation is to send every message straight to the LLM and let it figure out intent on its own. This works fine early on. It becomes flaky under real usage.

A more robust approach is a two-tier classification system: LLM-first, regex fallback. The LLM handles the vast majority of requests correctly; the regex catches the edge cases when the model returns malformed output (which happens more often than you'd expect under memory pressure).

def detect_intent(message: str) -> dict:
    try:
        resp = ollama.chat(
            model=OLLAMA_CHAT_MODEL,
            messages=[{"role": "user", "content": INTENT_PROMPT.format(message=message)}],
            options={"temperature": 0, "num_predict": 1024},
        )
        return extract_json(resp["message"]["content"])
    except Exception:
        return keyword_intent_fallback(message)

The temperature=0 setting makes classification deterministic if you want consistent routing, not creative interpretation. The keyword_intent_fallback is what keeps the agent responsive on the bad days when the LLM decides to return something unexpected.

For the orchestration layer, I'm using OpenClaw's tool-dispatch model. Every action the agent can take is declared as a skill in a SKILL.md file, and the framework routes to the right Python script when a skill gets invoked:

name: local-ai-assistant description: Local AI coding assistant with persistent memory command-dispatch: tool command-tool: exec command-arg-mode: raw

For a coding assistant where most operations are file I/O, this maps naturally and stays easy to debug. Skills are small, independent, and swappable.

The Memory Layer: Where Most Agents Actually Fall Apart

This is the layer worth spending real time on. Get this right, and the agent goes from "impressive demo" to "something I actually use daily." Get it wrong, and you're back to re-explaining yourself every session.

So what does a well-designed memory layer actually look like? Let's walk through what it needs to do:

Extract facts from conversation, not store raw messages, but pull out discrete, reusable pieces of knowledge
Embed those facts as vectors so retrieval can find semantically relevant memories even when the current query uses different words
Persist across sessions to disk, not RAM, to survive restarts
Update as facts change not keep accumulating outdated versions of the same preference

Think of it as the difference between a colleague who takes good meeting notes versus one who tries to remember everything from memory. One scales, the other doesn't.

Wiring It Into the Stack

Here's the full Mem0 config pointing at your local Ollama and Qdrant instances:

from mem0 import Memory
config = {
    "llm": {
        "provider": "ollama",
        "config": {
            "model": "qwen3:8b",
            "ollama_base_url": "http://localhost:11434"
        },
    },
    "embedder": {
        "provider": "ollama",
        "config": {
            "model": "nomic-embed-text",
            "ollama_base_url": "http://localhost:11434"
        },
    },
    "vector_store": {
        "provider": "qdrant",
        "config": {
            "collection_name": "coding_assistant",
            "host": "localhost",
            "port": 6333,
            "embedding_model_dims": 768,
        },
    },
}
memory = Memory.from_config(config)

What's happening across these three blocks is:

The qwen3:8b model runs the fact extraction step, turning raw conversation into structured preferences rather than storing messages verbatim.
The nomic-embed-text model converts those facts to 768-dimensional vectors on write and embeds your search queries on retrieval.
Qdrant stores the vectors and survives Docker restarts without any special handling.

Spin Qdrant up before you run anything:

docker run -d --name qdrant-local -p 6333:6333 \
  -v $(pwd)/qdrant_storage:/qdrant/storage qdrant/qdrant

And using memory is as straightforward as it looks:

# Writing a preference
memory.add("I always use type hints and pytest", user_id="dev")
# Two weeks later — different message, same semantic intent:
results = memory.search("write a utility function", user_id="dev")
# Returns: "user always uses type hints and pytest"

The search query says nothing about testing frameworks, but memory.search() method returns the pytest preference because it's matching by meaning, not keywords.

Gotcha #2: The Silent Dimension Mismatch

Here's the second thing that will quietly ruin your setup i.e, the embedding_model_dims value in your Qdrant config must exactly match the output dimension of your embedding model. The nomic-embed-text model produces 768-dimensional vectors.

Swap to a different embedding model later and forget to update this number, and inserts will fail silently. No exception. No log warning. Writes just stop happening, and you'll burn 20 minutes wondering why your memory isn't persisting before you catch the mismatch.

python
# Keep this updated if you ever change your embedding model:
"embedding_model_dims": 768    # nomic-embed-text
# "embedding_model_dims": 1536 # text-embedding-3-small
# "embedding_model_dims": 384  # all-MiniLM-L6-v2

Don't Let Your Memory Store Become a Junk Drawer

One more improvement worth adding early is a pre-write filter that decides whether an exchange is actually worth storing before it hits the vector database. Without this, your memory fills up with one-off lookups and throwaway queries, and retrieval quality tanks as the signal-to-noise ratio drops.


def _is_worth_storing(self, user_message: str) -> bool:
    response = ollama.chat(
        model=OLLAMA_CHAT_MODEL,
        messages=[{"role": "user", "content": SMART_MEMORY_PROMPT.format(
            user_message=user_message
        )}],
        options={"temperature": 0, "num_predict": 512},
    )
    data = self._extract_json_robust(response["message"]["content"])
    return bool(data.get("worth_storing", False))

Adding this filter early is much easier than cleaning up a polluted vector store later. Trust me on that one.

What You Might Hit Next

The fix I'm exploring is a semantic similarity check before each write and embed the candidate fact, retrieve the closest N existing entries, and run a quick LLM comparison to decide whether it's genuinely new or a duplicate. One extra step on write, cheap relative to what it preserves on retrieval.

If you've already tackled this in your own local setup without adding significant write latency, I'd genuinely like to know what approach you landed on. Drop it in the comments!

Full working source code: https://github.com/AashiDutt/OpenClaw_Mem0_Ollama

Top comments (0)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.