DEV Community

Cover image for Beyond the Context Window: Simulating True AI Memory with Ollama and AIsa.one
Harish Kotra (he/him)
Harish Kotra (he/him)

Posted on

Beyond the Context Window: Simulating True AI Memory with Ollama and AIsa.one

We often confuse "Context" with "Memory" in LLMs.

When you paste a 100-page PDF into an LLM, you aren't giving it memory; you're giving it a very long short-term memory. True memory isn't about stuffing everything into the prompt - it's about state persistence, emotional continuity, and the ability to recall specific facts without needing them constantly repeated.

I built LongMind, a proof-of-concept demo to visualize exactly how Memory and Context differ in AI behavior. Here is how it works.

The Architecture

The stack is simple but effective:

  • Backend: Node.js + Express
  • Frontend: React (Vite)
  • AI Engine: Ollama (Local Llama 3.2) OR AIsa.one (Cloud)
  • Memory: A simple JSON store (no vector DB complexity needed for this demo)

The 3 Modes of Cognition

The core feature of LongMind is the ability to toggle between three inference strategies:

  1. Context Only (Amnesia)
    The LLM sees only the current message. It has no idea who you are or what happened 5 minutes ago.
    Result: You betray the NPC, and 10 seconds later, he greets you warmly.

  2. Memory Only (Rigid Retrieval)
    The LLM sees only the persisted facts, ignoring the current conversational nuance.
    Result: You say "Hello", and the NPC ignores the greeting to rant about a past betrayal. This mimics bad RAG implementations where retrieval overpowers flow.

  3. Memory + Context (The Holy Grail)
    The LLM sees both. It integrates the past (Memory) with the present (Context).
    Result: You apologize. The NPC hears the apology (Context) but refuses it because he remembers the betrayal (Memory). This feels "human."

The Code: LLM Abstraction

We created a unified generateResponse function that alters the prompt engineering based on the selected mode:

// server/llm/provider.js
async function generateResponse({ systemPrompt, memory, userMessage, mode, provider }) {
    const messages = [{ role: "system", content: systemPrompt }];

    // Inject Memory if applicable
    let memoryText = "";
    if (memory && memory.length > 0) {
        memoryText = "Trusted Memories:\n" + memory.map(m => `- ${m}`).join("\n");
        if (mode !== 'context_only') {
             messages.push({ role: "system", content: `RELEVANT MEMORIES:\n${memoryText}` });
        }
    }

    // Inject User Context only if NOT in 'Memory Only' mode
    if (mode !== 'memory_only' && userMessage) {
        messages.push({ role: "user", content: userMessage });
    } else if (mode === 'memory_only') {
        // Force reaction to memory even without input
        messages.push({ role: "system", content: "React solely to your memories." });
    }
    // ... Call Ollama or AIsa
}
Enter fullscreen mode Exit fullscreen mode

This simple demo illustrates why "Context Windows" aren't a silver bullet. You can have a 1M token context, but if you treat it as a scratchpad, you get drift. True agentic behavior requires a persistent "Self" that exists outside the inference cycle.

Check out the code on GitHub to run your own local NPC with Llama 3.2!

Here's what the output looks like

Memory Only!

Betryal

Context Only

Memory + Context

Memory + Context

Top comments (0)