DEV Community

Oleksander
Oleksander

Posted on

Agent memory in 5 lines of Python — no LLM, no cloud, <1ms recall

Last week, my AI agent analyzed 10 competitors in the AI memory market over 3 days. On day 4, I asked it to compare their pricing. It didn't search again — it already knew them all. That's what happens when your agent has real memory, not a chat history.

Your AI agent forgets everything between sessions. Every conversation starts from zero. Every user preference, every decision, every piece of context — gone. You paste old conversations into the system prompt, hit the token limit, and wonder why the agent feels so... stateless.

Most "memory" solutions bolt on a vector database, call an embedding API, and charge you per query. You now have 200ms latency, a cloud dependency, and a monthly bill — for what is essentially a fancy search index.

What if your agent could remember like a human? Important things stick. Trivial things fade. Trusted sources rank higher than random web scrapes. And it all happens in under 1 millisecond, locally, with zero LLM calls.

That's Aura.

What makes Aura different

Aura Others (Mem0, Zep, Cognee)
LLM required No Yes
Recall latency <1ms 200ms+ / LLM-bound
Works offline Yes No
Binary size 2.7 MB Heavy
Cost per op $0 API billing
Source provenance Built-in No

Aura is a Rust-native cognitive memory engine with Python bindings. It uses a 4-signal RRF (Reciprocal Rank Fusion) recall system — no embeddings required — and models memory decay, consolidation, and trust scoring inspired by how human memory actually works.

Let's see how it works.

1. Install

pip install aura-memory
Enter fullscreen mode Exit fullscreen mode

That's it. No Docker, no API keys, no cloud account. The entire engine ships as a single 2.7 MB binary.

2. Store & recall — the basics

from aura import Aura, Level

brain = Aura("./agent_memory")

# Store memories at different importance levels
brain.store("User prefers dark mode and Vim keybindings",
            level=Level.Identity, tags=["preference", "ui"])

brain.store("Deploy staging before production, always run tests",
            level=Level.Decisions, tags=["workflow"])

brain.store("Fix login bug - users getting 403 on /api/auth",
            level=Level.Working, tags=["bug", "auth"])

# Recall — returns formatted context ready for LLM injection
context = brain.recall("authentication issues", token_budget=2000)
print(context)
Enter fullscreen mode Exit fullscreen mode

Output:

=== COGNITIVE CONTEXT ===
[IDENTITY]
  - User prefers dark mode and Vim keybindings [preference, ui]

[DECISIONS]
  - Deploy staging before production, always run tests [workflow]

[WORKING]
  - Fix login bug - users getting 403 on /api/auth [bug, auth]

=== END CONTEXT ===
Enter fullscreen mode Exit fullscreen mode

That's it. store()recall() → inject into your system prompt. Five lines to give your agent persistent memory.

3. Memory levels — not all memories are equal

Aura organizes memory into 4 levels across 2 tiers, modeled after human cognitive architecture:

Tier Level Decay rate Use case
Core Identity 0.99/cycle User preferences, personality
Core Domain 0.95/cycle Learned facts, domain knowledge
Cognitive Decisions 0.90/cycle Choices made, action items
Cognitive Working 0.80/cycle Current tasks, recent messages

Core tier = slow decay (weeks to months). Your agent's "personality" and knowledge base.
Cognitive tier = fast decay (hours to days). Ephemeral context that fades naturally.

This means your agent doesn't need explicit "forget" logic. Old tasks decay away. Core knowledge persists. Just like your brain.

# Query only recent, ephemeral memories
recent = brain.recall_cognitive("workflow")

# Query only long-term knowledge
knowledge = brain.recall_core_tier("programming")
Enter fullscreen mode Exit fullscreen mode

4. Trust scoring — the killer feature

Here's where Aura gets interesting. Not all information sources are equally reliable. A user telling you their name is more trustworthy than a web scrape claiming "Python 4.0 is coming soon."

from aura import TrustConfig

tc = TrustConfig()
tc.source_trust = {"user": 1.0, "api": 0.8, "web_scrape": 0.5}
brain.set_trust_config(tc)

# Store from different sources
brain.store("Python 3.13 released October 2024",
            tags=["python"], channel="user")

brain.store("Python 4.0 coming soon",
            tags=["python"], channel="web_scrape")

# Trust-weighted recall ranks user-sourced memory higher
results = brain.recall_structured("python release", top_k=5)
for r in results:
    print(f"  score={r['score']:.3f}  trust={r['trust']:.2f}  {r['content']}")
Enter fullscreen mode Exit fullscreen mode

Output:

  score=0.995  trust=1.00  Python 3.13 released October 2024
  score=0.589  trust=0.50  Python 4.0 coming soon
Enter fullscreen mode Exit fullscreen mode

The user-sourced fact scores 0.995. The web scrape scores 0.589. Your agent now has built-in epistemological hygiene — it knows how much to trust each piece of information.

5. Source provenance — know where every fact came from

Every memory in Aura carries an epistemological tag:

  • recorded — direct user input (trust × 1.00)
  • retrieved — fetched from web/API (trust × 0.90)
  • inferred — LLM conclusion (trust × 0.85)
  • generated — agent-created (trust × 0.80)
brain.store("BTC at $67k on Feb 21",
            source_type="retrieved", tags=["crypto"])

brain.store("User prefers conservative trading strategy",
            source_type="recorded", tags=["crypto"])

# Recall ranks recorded higher than retrieved
results = brain.recall_structured("crypto strategy", top_k=5)
Enter fullscreen mode Exit fullscreen mode

This prevents a subtle but dangerous problem: agents presenting web search results as their own "memories." With source_type, your agent always knows what it observed vs what it found vs what it guessed.

No other memory SDK tracks this. Not Mem0. Not Zep. Not Cognee.

6. Plug it into any LLM

Aura is LLM-agnostic. The pattern is always the same:

user_message = "What are my UI preferences?"

# 1. Recall relevant context
context = brain.recall(user_message, token_budget=2000)

# 2. Build system prompt
system_prompt = f"""You are a helpful assistant with memory.

{context}

Use the above context to personalize your responses."""

# 3. Send to your LLM of choice:
# Ollama:     requests.post("http://localhost:11434/api/chat", ...)
# OpenAI:     openai.chat.completions.create(messages=[...])
# LangChain:  ChatPromptTemplate with {context}
# Claude:     anthropic.messages.create(...)
# Any HTTP:   just inject system_prompt
Enter fullscreen mode Exit fullscreen mode

No adapters. No framework lock-in. If your LLM takes a string, Aura works with it.

Bonus: structured recall with scores

When you need more than formatted text — for routing, filtering, or debugging:

results = brain.recall_structured("user preferences", top_k=5)
for r in results:
    print(f"  [{r['level']}] score={r['score']:.3f} -- {r['content'][:80]}")
Enter fullscreen mode Exit fullscreen mode
  [IDENTITY] score=0.590 -- User prefers dark mode and Vim keybindings
  [WORKING]  score=0.586 -- Fix login bug - users getting 403 on /api/auth
  [DECISIONS] score=0.581 -- Deploy staging before production, always run tests
Enter fullscreen mode Exit fullscreen mode

Each result includes level, score, trust, tags, timestamps, and source metadata. Use this to build intelligent routing: high-trust Identity memories go straight to the system prompt; low-trust Working memories get verified first.

Performance — sub-millisecond, for real

Benchmarked on a standard machine with 1,000 stored records:

Operation Latency
Store 0.129 ms/op
Recall (1K records) 0.861 ms/op
Search by tag 0.103 ms/op

For comparison, embedding-based recall typically runs 200ms+ per call. Aura is 200x faster because it uses SDR (Sparse Distributed Representation) encoding + MinHash + tag matching — no neural network inference needed.

You can optionally add embeddings as a 4th signal if you want semantic similarity on top:

from sentence_transformers import SentenceTransformer
model = SentenceTransformer("all-MiniLM-L6-v2")
brain.set_embedding_fn(lambda text: model.encode(text).tolist())
Enter fullscreen mode Exit fullscreen mode

But the 3-signal fusion works great without them.

Living memory — decay, reflect, consolidate

Run a single maintenance cycle and Aura handles the rest:

report = brain.run_maintenance()
print(f"Decayed: {report.decay.decayed}")
print(f"Promoted: {report.reflect.promoted}")
print(f"Consolidated: {report.consolidation.native_merged}")
print(f"Archived: {report.records_archived}")
Enter fullscreen mode Exit fullscreen mode

Call this periodically (every N interactions, or on a schedule), and your agent's memory stays clean and relevant without manual curation.

More features you get out of the box

  • Namespace isolation — keep test/prod/per-user memories separate
  • Encryption — ChaCha20-Poly1305 + Argon2id, one argument: Aura("./data", password="secret")
  • MCP server — expose memory as a tool for Claude, GPT, or any MCP-compatible agent
  • Zero dependencies — pure Rust core, no runtime requirements

Try it now

The fastest way to try Aura is the interactive Colab notebook — zero setup, runs in your browser:

▶ Open in Google Colab

Or install locally:

pip install aura-memory
Enter fullscreen mode Exit fullscreen mode
  • GitHub — star it if you find it useful ⭐
  • API docs — full reference for 40+ methods
  • Examples — Ollama integration, research bot, edge devices

Aura is MIT-licensed. Built by a solo developer in Kyiv, Ukraine — including during power outages. Patent pending (US 63/969,703). If you're building AI agents that need to remember, I'd love to hear what you think.

Top comments (0)