Oleksander

Posted on Feb 26

419 Clones in 48 Hours — What Happened When I Launched an SDK for Offline AI Agent Memory

#agents #ai #performance #showdev

48 hours after launch. 419 clones. 90 unique developers. 8 stars. Nobody said a word.

That silence told me something important: engineers don't star things — they test them.

Here's the story of what I built, why, and what those numbers actually mean.

The Problem Nobody Talks About

Everyone is building AI agents. Most of them have a memory problem.

The standard approach: use embeddings. Store text as vectors, query them at recall time. Tools like Mem0, Zep, and LangMem all work this way.

The hidden cost:

Every recall = an embedding API call = 150–300ms latency
Every embedding call = money (OpenAI charges per token)
Offline deployment? Impossible — you need the embedding API available

For cloud-based chatbots this is fine. But for local AI agents running on your own hardware — especially with Ollama — this breaks the whole offline-first promise.

If your agent needs to "remember" something, it has to call home first.

That felt wrong to me.

A Different Idea: SDR Instead of Embeddings

I started reading about Sparse Distributed Representations (SDR) — the pattern encoding mechanism used in Hierarchical Temporal Memory (HTM) theory, originally inspired by how the neocortex works.

The core idea: represent any concept as a sparse binary vector (256K bits in Aura's case) where only ~2% of bits are active. Similarity between patterns is computed using Tanimoto coefficient — pure bit math, no neural network needed.

No embedding model. No API call. No GPU.

Just math.

Recall latency: 0.35ms. That's not a typo.

What I Built

Aura — a cognitive memory system for AI agents written in Rust.

Key properties:

Sub-millisecond recall — 0.35ms average, 0.29ms after warm cache
Zero LLM calls for memory operations — the recall itself needs no model
2.7MB binary — the entire memory engine fits in a small file
Fully offline — works with Ollama, any local model, no internet required
Persistent across sessions — brain reloads from disk, all context intact
217 tests, ChaCha20-Poly1305 encryption, patent pending (US 63/969,703)

Four memory levels with different retention weights:

Working Memory    → 0.80 retention  (temporary context)
Decision Memory   → 0.90 retention  (choices made)
Domain Memory     → 0.95 retention  (learned knowledge)
Identity Memory   → 0.99 retention  (core facts)

Integration with Ollama: 3 Lines

from aura_memory import Aura

brain = Aura("./agent_brain")
context = brain.recall(user_input, token_budget=1500)

# inject context into your Ollama system prompt
response = ollama.chat(
    model="gemma3n:e4b",
    messages=[
        {"role": "system", "content": f"Context:\n{context}\n\nYou are a helpful assistant."},
        {"role": "user", "content": user_input}
    ]
)

# store the interaction
brain.store(user_input, response["message"]["content"])

That's it. Your Ollama agent now has persistent memory across sessions — no embedding API, no cloud, no ongoing cost.

Live Demo Output

I ran a 4-phase test with gemma3n:e4b locally. Here's the actual terminal output:

Phase 1: Storing facts
✓ Stored: Name is Aleksander, AI engineer from Ukraine
✓ Stored: Working on AuraSDK — cognitive memory for agents
✓ Stored: Prefers concise technical explanations

Phase 2: Conversations with memory context
[Recall: 0.35ms] Context injected into system prompt
[Recall: 0.48ms] Agent referenced previous preference correctly
[Recall: 0.41ms] Agent remembered project name without being told

Phase 3: Session reload (fresh Python instance)
Brain loaded from disk...
[Recall: 0.29ms] ALL context intact ✅

Total records: 12
Memory persisted: YES
LLM calls for memory: 0

The agent remembered my name, project, and communication preferences across a completely fresh Python instance — without a single LLM or embedding call.

Benchmark vs Embedding-based approach

Metric	Aura	Embedding-based approach
Recall latency	0.35ms	~200ms
Embedding API calls	0	Required
Offline capable	✅	❌
Binary size	2.7MB	N/A (cloud)
Cost per recall	$0	API pricing
Speedup	270x faster	baseline

Why Rust?

Three reasons:

Performance — sub-millisecond recall requires zero garbage collection overhead
Safety — memory systems that corrupt data are worse than no memory at all
Portability — 2.7MB binary runs anywhere: Raspberry Pi, edge devices, air-gapped servers

19,500 lines of Rust. 217 tests. Built during power outages in Kyiv 🇺🇦

The 419 Clones

After posting in the Ollama Discord and commenting on a few Twitter threads about agent memory, the GitHub traffic spiked:

419 clones in 48 hours
90 unique cloners
Zero comments

I think developers are quietly testing it. That's the most honest validation I could ask for — nobody clones a repo to be polite.

If you're one of those 90 people: I'd genuinely love to know what you found. What worked, what didn't, what you were trying to build.

Get Started

pip install aura-memory

📦 PyPI: aura-memory
🔗 GitHub: teolex2020/AuraSDK
🌐 Docs: aurasdk.dev

One Question For You

How are you handling memory in your AI agents right now?

Embeddings? Simple conversation history? Something else entirely?

I'm genuinely curious about the tradeoffs people are navigating — especially for local/offline deployments where latency and API costs actually matter.

Top comments (0)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.