DEV Community

Oleksander
Oleksander

Posted on

419 Clones in 48 Hours — What Happened When I Launched an SDK for Offline AI Agent Memory

48 hours after launch. 419 clones. 90 unique developers. 8 stars. Nobody said a word.

That silence told me something important: engineers don't star things — they test them.

Here's the story of what I built, why, and what those numbers actually mean.


The Problem Nobody Talks About

Everyone is building AI agents. Most of them have a memory problem.

The standard approach: use embeddings. Store text as vectors, query them at recall time. Tools like Mem0, Zep, and LangMem all work this way.

The hidden cost:

  • Every recall = an embedding API call = 150–300ms latency
  • Every embedding call = money (OpenAI charges per token)
  • Offline deployment? Impossible — you need the embedding API available

For cloud-based chatbots this is fine. But for local AI agents running on your own hardware — especially with Ollama — this breaks the whole offline-first promise.

If your agent needs to "remember" something, it has to call home first.

That felt wrong to me.


A Different Idea: SDR Instead of Embeddings

I started reading about Sparse Distributed Representations (SDR) — the pattern encoding mechanism used in Hierarchical Temporal Memory (HTM) theory, originally inspired by how the neocortex works.

The core idea: represent any concept as a sparse binary vector (256K bits in Aura's case) where only ~2% of bits are active. Similarity between patterns is computed using Tanimoto coefficient — pure bit math, no neural network needed.

No embedding model. No API call. No GPU.

Just math.

Recall latency: 0.35ms. That's not a typo.


What I Built

Aura — a cognitive memory system for AI agents written in Rust.

Key properties:

  • Sub-millisecond recall — 0.35ms average, 0.29ms after warm cache
  • Zero LLM calls for memory operations — the recall itself needs no model
  • 2.7MB binary — the entire memory engine fits in a small file
  • Fully offline — works with Ollama, any local model, no internet required
  • Persistent across sessions — brain reloads from disk, all context intact
  • 217 tests, ChaCha20-Poly1305 encryption, patent pending (US 63/969,703)

Four memory levels with different retention weights:

Working Memory    → 0.80 retention  (temporary context)
Decision Memory   → 0.90 retention  (choices made)
Domain Memory     → 0.95 retention  (learned knowledge)
Identity Memory   → 0.99 retention  (core facts)
Enter fullscreen mode Exit fullscreen mode

Integration with Ollama: 3 Lines

from aura_memory import Aura

brain = Aura("./agent_brain")
context = brain.recall(user_input, token_budget=1500)

# inject context into your Ollama system prompt
response = ollama.chat(
    model="gemma3n:e4b",
    messages=[
        {"role": "system", "content": f"Context:\n{context}\n\nYou are a helpful assistant."},
        {"role": "user", "content": user_input}
    ]
)

# store the interaction
brain.store(user_input, response["message"]["content"])
Enter fullscreen mode Exit fullscreen mode

That's it. Your Ollama agent now has persistent memory across sessions — no embedding API, no cloud, no ongoing cost.


Live Demo Output

I ran a 4-phase test with gemma3n:e4b locally. Here's the actual terminal output:

Phase 1: Storing facts
✓ Stored: Name is Aleksander, AI engineer from Ukraine
✓ Stored: Working on AuraSDK — cognitive memory for agents
✓ Stored: Prefers concise technical explanations

Phase 2: Conversations with memory context
[Recall: 0.35ms] Context injected into system prompt
[Recall: 0.48ms] Agent referenced previous preference correctly
[Recall: 0.41ms] Agent remembered project name without being told

Phase 3: Session reload (fresh Python instance)
Brain loaded from disk...
[Recall: 0.29ms] ALL context intact ✅

Total records: 12
Memory persisted: YES
LLM calls for memory: 0
Enter fullscreen mode Exit fullscreen mode

The agent remembered my name, project, and communication preferences across a completely fresh Python instance — without a single LLM or embedding call.


Benchmark vs Embedding-based approach

Metric Aura Embedding-based approach
Recall latency 0.35ms ~200ms
Embedding API calls 0 Required
Offline capable
Binary size 2.7MB N/A (cloud)
Cost per recall $0 API pricing
Speedup 270x faster baseline

Why Rust?

Three reasons:

  1. Performance — sub-millisecond recall requires zero garbage collection overhead
  2. Safety — memory systems that corrupt data are worse than no memory at all
  3. Portability — 2.7MB binary runs anywhere: Raspberry Pi, edge devices, air-gapped servers

19,500 lines of Rust. 217 tests. Built during power outages in Kyiv 🇺🇦


The 419 Clones

After posting in the Ollama Discord and commenting on a few Twitter threads about agent memory, the GitHub traffic spiked:

  • 419 clones in 48 hours
  • 90 unique cloners
  • Zero comments

I think developers are quietly testing it. That's the most honest validation I could ask for — nobody clones a repo to be polite.

If you're one of those 90 people: I'd genuinely love to know what you found. What worked, what didn't, what you were trying to build.


Get Started

pip install aura-memory
Enter fullscreen mode Exit fullscreen mode

One Question For You

How are you handling memory in your AI agents right now?

Embeddings? Simple conversation history? Something else entirely?

I'm genuinely curious about the tradeoffs people are navigating — especially for local/offline deployments where latency and API costs actually matter.

Top comments (1)

Collapse
 
c_nguynnh_85e04737d profile image
Đức Nguyễn ĐÌnh

🤖 AhaChat AI Ecosystem is here!
💬 AI Response – Auto-reply to customers 24/7
🎯 AI Sales – Smart assistant that helps close more deals
🔍 AI Trigger – Understands message context & responds instantly
🎨 AI Image – Generate or analyze images with one command
🎤 AI Voice – Turn text into natural, human-like speech
📊 AI Funnel – Qualify & nurture your best leads automatically