DEV Community: Oleksander

Why AI Needs an External Cognitive Layer Beyond Memory

Oleksander — Thu, 02 Apr 2026 16:30:15 +0000

Why AI Needs an External Cognitive Layer Beyond Memory

Most AI agents today are still built around a thin pattern:

a large language model,
a prompt,
a tool loop,
and some form of memory or retrieval.

That stack can look impressive in demos, but it breaks down once the agent needs continuity, specialization, self-consistency, and long-lived behavioral control.

Memory alone is not enough.

If an agent only stores past records, it can remember what happened. It still cannot reliably:

form stable beliefs,
build concepts over time,
learn causal structure,
accumulate policies,
generate internal pressure,
anticipate future outcomes,
detect epistemic gaps,
or regulate its own mode of operation.

That is the gap we have been exploring in Aura.

The Core Thesis

AI systems need an external cognitive layer that lives outside model weights.

Not just a vector database.
Not just a chat history.
Not just a memory API.

A real cognitive layer should be able to:

preserve continuity across sessions,
accumulate knowledge in structured form,
survive model upgrades,
support domain specialization,
remain inspectable and governed,
and shape agent behavior over time.

This matters because current LLMs are powerful, but they are still weak at stable long-horizon cognition. They are excellent inference engines. They are not yet sufficient as complete cognitive architectures.

From Memory to Cognition

In Aura, the architecture has gradually moved beyond simple memory.

The working progression is:

Record
Belief
Concept
Causal
Policy

That already changes the role of memory.

The system is no longer just storing facts. It is organizing experience into a structured cognitive state.

And once that structure exists, new layers become possible.

The Next Four Cognitive Functions

The recent evolution of the system can be summarized in four steps:

1. Want

The system should not only react to prompts.

It should also detect internal tensions:

unresolved policy pressure,
contradictions,
unstable structure,
pending cognitive obligations.

Those tensions can flow into drives, goals, and imperative-like internal pressure.

That is the beginning of motivation.

2. Expect

A cognitive system should not only remember the past.

It should form expectations about what should happen next.

From stable causal structure, it can produce predictions.
From mismatches between expectation and observation, it can produce surprise.

That turns cognition from retrospective to anticipatory.

3. Wonder

A capable system should not only repair contradictions.

It should also notice what it does not know.

Epistemic gaps matter:

weakly grounded entities,
missing causal mechanisms,
underspecified policy dependencies,
repeated blind spots,
ambiguous concept boundaries.

That is the beginning of curiosity.

4. Regulate

A cognitive system should not always behave in exactly the same mode.

Under pressure, it may need to become more conservative.
Under stability, it may be able to explore.

This is not about emotional theater.
It is about regulation.

A global modulation layer can shape:

drive thresholds,
curiosity thresholds,
exploration budget,
and behavioral selectivity.

That is the beginning of self-regulation.

Why This Should Live Outside the Model

This is the most important architectural point.

If all cognition lives only inside model weights, you lose too much:

portability,
auditability,
versioning,
organization-level control,
inspectability,
and long-lived continuity across changing model generations.

An external cognitive layer can survive:

model upgrades,
shell changes,
deployment changes,
domain swaps,
and organizational adaptation.

That makes it more durable than any single model interface.

Why This Matters Commercially

This is not only a research direction.
It is also a product direction.

A governed external cognitive layer enables:

specialist cognitive bases,
organization-specific overlays,
persistent agent continuity,
safer multi-step behavior,
and explainable adaptation without retraining.

That creates a path beyond generic chat agents.

Instead of selling only an agent, you can sell:

a cognitive substrate,
a specialist module,
and an organization layer that persists over time.

Why This Matters Even If Model Architectures Change

A common objection is:

What if future models already include better internal cognition?

That does not remove the need for an external layer.

Even if models become far more capable, organizations will still need:

governance,
portability,
ownership,
rollback,
specialist control,
and cognition that survives vendor changes.

So the long-term bet is not:

"models will stay weak."

The better bet is:

"portable, governed cognition will still matter even when models get stronger."

The Direction

The future of agent systems is unlikely to be:

model only,
prompt only,
or memory only.

It will likely require a distinct cognitive layer that can:

accumulate structured knowledge,
generate internal motivational pressure,
anticipate,
explore,
regulate,
and remain externally governed.

That is the direction we think is worth building toward.

Not just better memory for AI.

A real cognitive layer beyond memory.

I am currently building and testing this cognitive architecture in a closed environment. If you are an AI architect, researcher, or founder hitting the limits of RAG and standard agent loops, my DMs are open. I’d love to compare notes on the future of autonomous cognition.

Pennsylvania State found why AI memory fails across models. AuraSDK doesn't have this problem.

Oleksander — Fri, 27 Mar 2026 16:52:05 +0000

Pennsylvania State University just published a paper that exposes a structural flaw in how most AI agent memory systems work.

The paper is called MemCollab: Cross-Agent Memory Collaboration via Contrastive Trajectory Distillation. The findings are uncomfortable if you're building agent memory the conventional way.

The flaw

Most agent memory systems work like this:

Model solves a problem
Memory stores the reasoning trace — what the model did, how it got there
Model retrieves that memory later and performs better

The assumption buried inside this design: the stored knowledge is about the task, not about the model that solved it.

Pennsylvania State tested whether that assumption holds.

They gave a 7B model's memory to a 32B model. MATH500 dropped from 63.8% to 50.6%. HumanEval dropped from 68.3% to 34.1%.

Then they gave the 32B model's memory to the 7B model. Performance dropped again. Both directions failed. Both fell below the zero-memory baseline.

Giving a model someone else's memory made it perform worse than having no memory at all.

Why this happens

A model's reasoning traces don't just capture what the correct answer required. They capture how that specific model thinks — its preferred solving strategies, its heuristic shortcuts, its stylistic patterns.

Memory distilled from those traces encodes the model's reasoning personality alongside the actual task knowledge. When a different model retrieves that memory, it gets handed instructions optimized for a completely different cognitive architecture. The guidance actively interferes.

What MemCollab does

MemCollab fixes this by making memory construction cross-model. Two agents — a smaller and a larger model — independently solve the same problem. One succeeds, one fails. The system contrasts the trajectories and extracts only the abstract invariants:

What reasoning principle was present in the success and violated in the failure?
What error pattern appeared in the failure that the success avoided?

The extracted memory stores only those rules — not the solution, not the reasoning style, not the model-specific heuristics.

Results:

Llama 3 8B: MATH500 from 27.4% → 42.4%
Qwen 7B: MATH500 from 52.2% → 67.0%, HumanEval from 42.7% → 74.4%
Reasoning turns cut from 3.3 → 1.5 on HumanEval (fewer dead ends)

The deeper insight

The efficiency finding is the one that gets overlooked. MemCollab doesn't just improve accuracy — it makes agents reach correct answers in fewer steps. The contrastive memory isn't adding more guidance. It's stripping out the noise that was making agents explore dead ends repeatedly.

By encoding what not to do as explicitly as what to do, the memory prunes the search space before the agent even starts.

Why AuraSDK doesn't have this problem

AuraSDK avoids the contamination problem structurally — by never storing reasoning traces at all.

When you store something in AuraSDK:

brain.store(
    "Staging deploy prevented 3 production incidents",
    semantic_type="fact",
    tags=["workflow", "deployment"]
)

You're storing a claim about the world, not a record of how a model reasoned about it. The cognitive layers — Belief, Concept, Causal, Policy — are derived from the content of what was observed, not from the model's processing of it.

Record → Belief → Concept → Causal → Policy

Each layer is built deterministically from the one below. Beliefs emerge from clusters of records. Causal patterns emerge from temporal co-occurrence and explicit links. Policy hints emerge from repeated causal patterns. None of this touches model internals.

The result: the cognitive layer is model-agnostic by design. Swap GPT-4o for Claude, swap Claude for Llama — the stored memory, the belief structure, the causal patterns, the policy hints all remain valid. There's nothing model-specific to contaminate.

Two different approaches to the same insight

MemCollab and AuraSDK arrive at the same conclusion from different directions:

Memory that encodes how a model thinks is fragile. Memory that encodes what happened is durable.

MemCollab fixes contamination after the fact — by contrasting two models' traces and extracting only what survived.

AuraSDK avoids contamination by construction — by never storing traces in the first place.

	MemCollab	AuraSDK
What's stored	Abstract reasoning invariants across models	Claims, facts, relationships
Requires LLM to build memory	Yes — two models per problem	No
Model-agnostic	Yes — by contrastive distillation	Yes — by design
Works offline	No	Fully
Recall latency	LLM-bound	0.076ms
Cognitive layers	None	Belief → Concept → Causal → Policy
Open source	Research paper	MIT, ships today

What this means for the field

The Pennsylvania State paper validates something important: the right unit of memory is not a reasoning trace. It's the abstract principle that holds regardless of which model does the reasoning.

AuraSDK takes this further: the right unit of memory is a structured observation about the world — a fact, a decision, a contradiction, a preference — that any model can retrieve and use without being handed someone else's cognitive fingerprint.

The field is converging on this. The implementations differ. But the core insight is the same.

pip install aura-memory

GitHub

Google's TurboQuant solves half the AI memory problem. Here's the other half.

Oleksander — Wed, 25 Mar 2026 16:49:10 +0000

This week Google Research published TurboQuant — a two-stage KV-cache quantization algorithm that achieves 6x memory reduction and 8x attention speedup with zero accuracy loss at 3 bits. No training required.

It's genuinely impressive engineering. But it's worth being precise about what problem it solves.

The two AI memory problems

Most people conflate two distinct problems:

Problem A: memory within a session
As context grows, the KV-cache grows. It becomes expensive in RAM and slow in attention computation. TurboQuant solves this — brilliantly.

Problem B: memory between sessions
When the session ends, the KV-cache is gone. The model starts from zero next time. No memory of past interactions, no accumulated patterns, no structured experience. TurboQuant doesn't touch this.

What TurboQuant actually does

TurboQuant is a two-stage pipeline:

PolarQuant — rotates vectors randomly, converts to polar coordinates, quantizes components without needing per-block normalization constants. This eliminates the 1–2 bit overhead that traditional quantization methods carry.
QJL (Quantized Johnson-Lindenstrauss) — encodes residual error with a single sign bit. Zero memory overhead.

Result: 3-bit KV-cache, 6x compression, 8x speedup, zero accuracy degradation on LongBench, Needle-in-a-Haystack, RULER, and ZeroSCROLLS benchmarks.

This makes long-context inference significantly cheaper and faster. Real value.

The gap it leaves open

The moment the session ends — the KV-cache is gone.

Week 1 with any model: average answers.
Week 4 with any model: still average answers. It forgot everything.

Fine-tuning costs thousands of dollars and weeks. RAG gives you retrieval, not cognition. Context windows bill per token and still reset.

What we built for Problem B

AuraSDK is a local cognitive substrate that sits outside model weights.

It accumulates structured experience across sessions through a 5-layer pipeline:

Record → Belief → Concept → Causal → Policy

Each layer is derived deterministically from the one below — no LLM, no embeddings. Policy hints like "deploy to staging first" aren't written by anyone. They emerge from repeated causal patterns in stored experience.

from aura import Aura

brain = Aura("./agent_memory")

brain.store("Staging deploy prevented 3 production incidents", tags=["workflow"])
brain.store("User always deploys to staging first", tags=["workflow"])

# after run_maintenance(), the cognitive stack derives:
hints = brain.get_surfaced_policy_hints()
# → [{"action": "Prefer", "domain": "workflow", "description": "deploy to staging first"}]

What v1.5.4 adds:

Autonomous cognitive plasticity — the substrate observes model output and updates itself. No fine-tuning. Full audit trail.
Salience weighting — what matters persists longer, decays slower
Contradiction governance — conflicting evidence surfaced explicitly, not averaged silently

Performance (1,000 records, Ryzen 7, v1.5.4):

Store: 0.91ms
Recall: 0.076ms (~2,600× faster than Mem0)
Recall (cached): 1.4µs
Maintenance cycle: 15ms median

No API keys. No cloud. No LLM dependency. ~3MB binary. Fully offline. MIT license.

The full picture

	TurboQuant	AuraSDK
Problem	KV-cache overhead within session	No memory between sessions
Approach	Quantization of attention keys/values	Persistent cognitive substrate
Scope	Single inference pass	Cross-session accumulation
Requires LLM	Yes (runs inside it)	No
Works offline	N/A	Fully
Open source	Research paper	MIT, ships today

These are complementary. TurboQuant makes inference cheaper in the moment. AuraSDK makes the model smarter over time.

The field needs both.

pip install aura-memory

GitHub ·

Your AI forgets everything — this layer fixes that without retraining

Oleksander — Tue, 24 Mar 2026 16:19:10 +0000

Your AI model forgets everything after every conversation.

Not because it’s bad — because it has no memory system.

RAG helps retrieve context.
Fine-tuning helps adjust behavior.

But neither actually gives your system memory.

This article shows a different approach:
a cognitive layer that sits outside the model
and gets smarter over time — while the model stays frozen.

What the cognitive layer actually does

AuraSDK builds a 5-layer structure from whatever the model and users store:

Record → Belief → Concept → Causal → Policy

Each layer is derived from the one below — without LLM, without embeddings, locally:

Record: raw stored fact with trust score, provenance, decay rate
Belief: competing hypotheses about the same claim, epistemically weighted
Concept: stable abstractions over repeated beliefs
Causal: learned cause→effect patterns from co-occurring evidence
Policy: advisory hints — Prefer, Avoid, Warn — that emerge from causal structure

Nothing in layers 2–5 is hand-authored. They emerge from what's stored and observed over time.

from aura import Aura, Level

brain = Aura("./memory")
brain.enable_full_cognitive_stack()

brain.store("Staging deploy prevented 3 production incidents", tags=["deploy"])
brain.store("Direct prod deploy caused outage in Q3", tags=["deploy"])

# After maintenance:
hints = brain.get_surfaced_policy_hints()
# → [{"action": "Prefer", "domain": "deploy", "description": "staging before production"}]

That policy hint was not written by anyone. The causal layer found the pattern. The policy layer surfaced it.

v1.5.4: the three things that were missing

1. The substrate now learns from the model's own output

Before v1.5.4, the cognitive layer only knew what you explicitly stored. Now it observes model responses and updates itself.

Claims are extracted. Confirmations strengthen existing beliefs. Contradictions raise volatility. The substrate evolves from inference — without retraining, without an external API.

capture = brain.capture_experience(
    prompt="How should we handle this deploy?",
    retrieved_context=context_ids,
    model_response="Always verify staging health checks before pushing to production.",
    source="model_inference",
)
brain.ingest_experience_batch([capture])
brain.run_maintenance()
# cognitive layer updated — next recall is different

Safety bounds (non-negotiable):

Generated claims capped at 0.70 confidence — cannot overwrite recorded facts
PlasticityMode::Off by default — nothing changes without explicit opt-in
Every mutation writes to an audit trail traceable to the prompt that caused it
purge_inference_records() — clean rollback when needed
freeze_namespace_plasticity("medical") — some domains must never adapt from inference

Recorded facts always win over model inference. Always.

2. The substrate now knows what matters

High-frequency recall and high-significance are not the same thing. A trivial fact mentioned 20 times should not outrank a critical decision mentioned once.

v1.5.4 adds salience weighting:

brain.mark_record_salience(record_id, salience=0.9)
# → this record resists decay, ranks higher, gets preserved longer

Maintenance now also produces bounded reflection summaries: recurring blockers, unresolved tensions, patterns that keep appearing. Not "feelings" — structured synthesis from what's actually stored.

3. Contradictions are now first-class, not silently averaged

Before: conflicting evidence was weighted and averaged. The conflict was invisible.

Now:

clusters = brain.get_contradiction_clusters()
queue = brain.get_contradiction_review_queue()

Recall explanations carry explicit markers: "this recommendation depends on unresolved evidence." The operator sees the friction. The user can be told honestly.

What else ships in v1.5.4

Concept persistence — concepts used to reset on every restart. Now they survive. The 5-layer stack is actually intact across sessions.

Belief reranking active by default — in v1.5.3, BeliefRerankMode::Off was the default. The cognitive stack was engineered but not running. Now it runs.

Production integrity — startup validation, persistence manifest, concept partition cap for large corpora.

Explainability is built in

Every recall decision is traceable:

explanation = brain.explain_recall("deployment decision")
# → which records matched, why, what belief groups they belong to,
#   what salience contributed, whether unresolved evidence is present

chain = brain.provenance_chain(record_id)
# → full trace from policy hint back to source records

This is not logging. It is structural explainability derived from the cognitive layer itself.

Performance

Benchmarked on 1,000 records, Windows 10 / Ryzen 7:

Operation	Latency	vs Mem0
Store	0.09 ms	~same
Recall	0.74 ms	~270× faster
Recall (cached)	0.48 µs	~400,000× faster
Maintenance	1.1 ms	no equivalent

Mem0 recall requires an embedding API call (~200ms+). AuraSDK recall is pure local computation. No embeddings required. No external service.

The positioning in one sentence

AuraSDK is not a vector database. Not a RAG wrapper. Not a fine-tuning platform. Not a generic agent framework.

It is a governable cognitive substrate for frozen AI models — the layer that makes them smarter, more consistent, and more explainable over time, without touching their weights.

Try it

pip install aura-memory

Try in browser (no install): Open in Colab

GitHub: teolex2020/AuraSDK — MIT license, patent pending (US 63/969,703)

Built in Kyiv, Ukraine 🇺🇦

What would you build with a model that actually accumulates structured experience over time?

I tested the same AI model against itself. Memory won 4/5.

Oleksander — Wed, 18 Mar 2026 13:23:16 +0000

The experiment

Same model. Same 5 questions. One difference: one side had persistent memory via AuraSDK, the other had none.

Both sides used Gemini 2.5 Flash-Lite — identical model, identical cost per token.

Result: 4/5 questions won by the side with memory. 48% fewer tokens used.

What the questions tested

Real Python dev scenarios — the kind where generic answers aren't enough:

"I'm writing a new async function to fetch user orders from the DB. What patterns should I follow?"
"We need background email sending when an order is completed. How should we implement it?"
"I'm writing tests for the new payment module. Walk me through the setup."
"Code review: results = list(db.query(Order).filter(...)) — what's wrong?"
"The reports endpoint is slow — 3 seconds for 200 users. How do I debug and fix this?"

These questions have correct generic answers. But they also have specific answers — if you know the team's stack.

What the model with memory knew

AuraSDK had loaded 15 records of accumulated team knowledge:

We use arq for task queues, not Celery — switched because Celery was too heavy
Production incident: missing await on async DB call silently returned a coroutine. 2 hours to debug.
SQLAlchemy lazy loading doesn't work with async sessions — always use selectinload() or joinedload()
Code review caught: someone did list(queryset) and loaded 50k rows into memory. Lint rule added.
N+1 queries found in the user reports endpoint — fixed with a single joined query
Test setup: pytest + pytest-asyncio, transactions that rollback after each test, coverage must stay above 85%

What happened question by question

Q2 — Background email sending:

With memory (1,307 tokens):

"Use arq — we switched from Celery because it was too heavy for our scale. Define an arq job decorated with @arq.func, enqueue it when order completes."

Without memory (3,203 tokens):

"Here are several approaches: FastAPI BackgroundTasks for simple cases, Celery for robust production use..."

Same model. One knew the decision. One gave a textbook overview.

Q4 — Code review:

With memory (829 tokens):

"This is a direct violation of our established practice. We had an incident where this loaded 50,000 rows into memory. A lint rule was added specifically to prevent this."

Without memory (1,604 tokens):

"This is generally considered an anti-pattern in SQLAlchemy. Here's a breakdown of what's wrong..."

How AuraSDK works

from aura import Aura, Level

brain = Aura("./agent_memory")
brain.enable_full_cognitive_stack()

# store team knowledge
brain.store("We use arq, not Celery — switched because Celery was too heavy",
            level=Level.Domain, tags=["python", "dev"])
brain.store("Production incident: list(queryset) loaded 50k rows into memory",
            level=Level.Decisions, tags=["python", "lesson-learned"])

# recall before answering — <1ms, no API call
context = brain.recall("background email sending", token_budget=1000)

# inject into prompt
system = f"TEAM CONTEXT:\n{context}\n\nAnswer using this context."

No embeddings. No vector database. No LLM calls during learning. Pure local Rust computation.

The cognitive pipeline

AuraSDK doesn't just store and retrieve text. Every record goes through 5 layers:

Record → Belief → Concept → Causal → Policy

Belief: groups related observations, resolves contradictions with confidence scores
Concept: discovers stable topic clusters across beliefs
Causal: finds cause-effect patterns from temporal and explicit links
Policy: derives behavioral hints (Prefer / Avoid / Warn) from causal patterns

After enough interactions, the system surfaces this automatically:

hints = brain.get_surfaced_policy_hints()
# [{"action": "Prefer", "domain": "dev", "description": "use arq over celery for task queues"}]

Nobody wrote that rule. The system derived it from the pattern of stored observations.

The token math

	With memory	Without memory
Q1	1,200 tokens	1,545 tokens
Q2	1,307 tokens	3,203 tokens
Q3	1,923 tokens	4,067 tokens
Q4	829 tokens	1,604 tokens
Q5	1,294 tokens	2,155 tokens
Total	6,553 tokens	12,574 tokens

48% fewer tokens. The memory layer doesn't add bloat — it gives the model exactly what it needs.

How it compares

	AuraSDK	Mem0	Zep	Letta
LLM required for learning	No	Yes	Yes	Yes
Works offline	Fully	Partial	No	With local LLM
Recall latency	<1ms	~200ms+	~200ms	LLM-bound
Self-derives behavioral policies	Yes	No	No	No
Binary size	~3MB	~50MB+	Cloud	Python pkg

Try it

pip install aura-memory
python examples/demo.py

Open source: github.com/teolex2020/AuraSDK
Patent pending: US 63/969,703
Built in Kyiv, Ukraine.

I built a cognitive layer for AI agents that learns without LLM calls

Oleksander — Tue, 17 Mar 2026 12:19:42 +0000

The problem

Every time your agent starts a conversation, it starts from zero.

Sure, you can stuff a summary into the system prompt. You can use RAG. You can call Mem0 or Zep.

But all of these have the same problem: they need LLM calls to learn. To extract facts, to build a user profile, to understand what matters — you're paying per token, adding latency, and depending on a cloud service.

What if the learning happened locally, automatically, without any LLM involvement?

What AuraSDK does differently

AuraSDK is a cognitive layer that runs alongside any LLM. It observes interactions and — without any LLM calls — builds up a structured understanding of patterns, causes, and behavioral rules.

from aura import Aura, Level

brain = Aura("./agent_memory")
brain.enable_full_cognitive_stack()

# store what happens
brain.store("User always deploys to staging first", level=Level.Domain, tags=["workflow"])
brain.store("Staging deploy prevented 3 production incidents", level=Level.Domain, tags=["workflow"])

# sub-millisecond recall — inject into any LLM prompt
context = brain.recall("deployment decision")

# after enough interactions, the system derives this on its own:
hints = brain.get_surfaced_policy_hints()
# [{"action": "Prefer", "domain": "workflow", "description": "deploy to staging first"}]

Nobody wrote that policy rule. The system derived it from the pattern of stored observations.

The cognitive pipeline

AuraSDK processes every stored record through 5 layers:

Record → Belief → Concept → Causal → Policy

Each layer is bounded and deterministic:

Belief: groups related observations, resolves contradictions
Concept: discovers stable topic clusters across beliefs
Causal: finds cause-effect patterns from temporal and explicit links
Policy: derives behavioral hints (Prefer / Avoid / Warn) from causal patterns

The entire pipeline runs in milliseconds. No LLM. No cloud. No embeddings required.

Try it in 60 seconds

pip install aura-memory
python examples/demo.py

Output:

Phase 4 - Recall in action

  Query: "deployment decision"  [0.29ms]
    1. Staging deploy prevented database migration failure
    2. Direct prod deploy skipped staging -- caused data loss

  Query: "code review"  [0.18ms]
    1. Code review caught SQL injection before merge
    2. Code review found performance regression early

5 learning cycles completed in 16ms. Recall at 0.29ms.

How it compares

	AuraSDK	Mem0	Zep	Letta
LLM required for learning	No	Yes	Yes	Yes
Works offline	Fully	Partial	No	With local LLM
Recall latency	<1ms	~200ms+	~200ms	LLM-bound
Self-derives behavioral policies	Yes	No	No	No
Binary size	~3MB	~50MB+	Cloud	Python pkg

What's new in v1.5.3

Full 5-layer cognitive pipeline active by default
enable_full_cognitive_stack() — one call to activate everything
Decay now driven by memory level, not manual type labels
Policy hints now work with explicit causal links (link_records())
demo.py — see it working in 60 seconds

Built in Rust, from Kyiv

Pure Rust core. No Python dependencies for the engine. Patent pending (US 63/969,703).

Open source: github.com/teolex2020/AuraSDK
Install: pip install aura-memory
Web: aurasdk.dev

If you're building AI agents and want deterministic, explainable, offline-capable memory — give it a try and tell me what you think.

10x Faster Recall + Memory That Evolves: Aura v1.3 for AI Agents

Oleksander — Thu, 05 Mar 2026 09:42:48 +0000

The Problem

Every AI agent framework has the same weakness: memory is an afterthought. Most solutions dump everything into a vector database and hope cosine similarity finds the right context. This works until it doesn't — when your agent needs to know when it learned something, what changed since last week, or which memories are actually useful.

What Aura Does Differently

Aura is a pure-Rust cognitive memory engine. Instead of embeddings + vector search, it uses:

SDR Encoding (Sparse Distributed Representations) — biologically-inspired, noise-tolerant
RRF Fusion — 4 parallel ranking signals (SDR similarity, MinHash, Tag Jaccard, optional embeddings)
Temporal Decay — memories naturally fade unless reinforced
Graph Connections — associative, causal, and co-activation links between memories

The result: sub-millisecond recall, ~3MB binary, zero external dependencies.

10x Recall Speedup (v1.3.1)

Every recall_structured call was cloning ALL records into a new HashMap to filter by namespace. At 10K records, that's 94ms of pure waste.

Fix: pass the original HashMap through the pipeline. Each signal collector filters by namespace inline with a cheap contains() check. Plus a new StructuredRecallCache for repeated queries.

Records	Before	After	Speedup
1K	15 ms	2.6 ms	5.8x
5K	58 ms	5.1 ms	11.4x
10K	94 ms	8.6 ms	10.9x

Warm recall (cache hit): ~0.07 ms — constant time regardless of record count.

What's New in v1.3.0

1. Temporal Queries

from aura import Aura
import time

brain = Aura("./memory")
brain.store("User prefers dark mode", level="Domain")

# ... days pass, user changes preference ...
brain.supersede(old_id, "User prefers light mode")

# What did we know last week?
old_memories = brain.recall_at("user preferences", last_week_timestamp)

recall_at(query, timestamp) filters records by creation time. history(record_id) shows the full access/strength timeline. This is how you debug agent behavior — "why did it do X on Tuesday?"

2. LangChain Drop-In

from aura import Aura
from aura.langchain import AuraMemory

brain = Aura("./memory")
memory = AuraMemory(brain)

# Works with any LangChain chain
chain = ConversationChain(llm=llm, memory=memory)

AuraChatMessageHistory implements the full BaseChatMessageHistory interface. AuraMemory is duck-type compatible with ConversationBufferMemory. No changes to your existing code.

3. Adaptive Recall

# After recall, tell Aura what was useful
results = brain.recall_structured("deployment steps", top_k=5)
for score, record in results:
    if was_helpful(record):
        brain.feedback(record.id, useful=True)   # +0.1 strength
    else:
        brain.feedback(record.id, useful=False)  # -0.15 strength

Over time, noise naturally decays while valuable memories get reinforced. No other memory SDK has this built-in.

4. Memory Versioning

# Save state before experiment
brain.snapshot("before_refactor")

# ... agent does things ...

# Something went wrong? Roll back
brain.rollback("before_refactor")

# Or compare states
diff = brain.diff("before_refactor", "after_refactor")
print(f"Added: {diff['added']}, Removed: {diff['removed']}")

5. Agent-to-Agent Sharing

# Agent A exports relevant context
fragment = agent_a.export_context("user preferences", top_k=5)

# Agent B imports it (strength halved, tagged "shared")
agent_b.import_context(fragment)

The protocol envelope includes version and provenance metadata. Imported records arrive with reduced trust — they need to prove themselves.

6. C FFI — Aura as a Platform

#include "aura.h"

AuraHandle* h = aura_open("./memory");
aura_store(h, "Remember this", NULL, NULL);

char* result = aura_recall(h, "what to remember", 5);
printf("%s\n", result);
aura_free_string(result);
aura_close(h);

Working examples in Go and C#. Any language with C FFI can use Aura.

7. OpenTelemetry

[features]
telemetry = ["opentelemetry", "opentelemetry_sdk", "opentelemetry-otlp", "tracing-opentelemetry"]

17 key functions instrumented with #[instrument] spans. OTLP export to any collector. Grafana dashboard template included.

The Bug That Took 8 Hours

Fun story: our CI was timing out at 6+ hours. We tried increasing timeouts, switching to release builds, reducing the test matrix. Nothing worked.

Turns out: Aura struct didn't have a Drop implementation. When tests ended without calling close(), internal file handles wouldn't release. Each test hung for 5 minutes waiting for a timeout that never came. 28 tests x 5 min = CI death.

Fix: 9 lines of code.

impl Drop for Aura {
    fn drop(&mut self) {
        self.stop_background();
        let _ = self.flush();
        let _ = self.storage.flush();
        let _ = self.index.save();
    }
}

Now 503 tests pass in 7 minutes. Sometimes the hardest bugs are the simplest ones.

Try It

pip install aura-memory

from aura import Aura

brain = Aura("./my_agent_memory")
brain.store("User prefers concise answers", level="Identity")

context = brain.recall("how should I respond?")
# Returns formatted context for your LLM's system prompt

Star the repo if this is useful. PRs and issues welcome.

Agent memory in 5 lines of Python — no LLM, no cloud, <1ms recall

Oleksander — Mon, 02 Mar 2026 07:32:40 +0000

Last week, my AI agent analyzed 10 competitors in the AI memory market over 3 days. On day 4, I asked it to compare their pricing. It didn't search again — it already knew them all. That's what happens when your agent has real memory, not a chat history.

Your AI agent forgets everything between sessions. Every conversation starts from zero. Every user preference, every decision, every piece of context — gone. You paste old conversations into the system prompt, hit the token limit, and wonder why the agent feels so... stateless.

Most "memory" solutions bolt on a vector database, call an embedding API, and charge you per query. You now have 200ms latency, a cloud dependency, and a monthly bill — for what is essentially a fancy search index.

What if your agent could remember like a human? Important things stick. Trivial things fade. Trusted sources rank higher than random web scrapes. And it all happens in under 1 millisecond, locally, with zero LLM calls.

That's Aura.

What makes Aura different

	Aura	Others (Mem0, Zep, Cognee)
LLM required	No	Yes
Recall latency	<1ms	200ms+ / LLM-bound
Works offline	Yes	No
Binary size	2.7 MB	Heavy
Cost per op	$0	API billing
Source provenance	Built-in	No

Aura is a Rust-native cognitive memory engine with Python bindings. It uses a 4-signal RRF (Reciprocal Rank Fusion) recall system — no embeddings required — and models memory decay, consolidation, and trust scoring inspired by how human memory actually works.

Let's see how it works.

1. Install

pip install aura-memory

That's it. No Docker, no API keys, no cloud account. The entire engine ships as a single 2.7 MB binary.

2. Store & recall — the basics

from aura import Aura, Level

brain = Aura("./agent_memory")

# Store memories at different importance levels
brain.store("User prefers dark mode and Vim keybindings",
            level=Level.Identity, tags=["preference", "ui"])

brain.store("Deploy staging before production, always run tests",
            level=Level.Decisions, tags=["workflow"])

brain.store("Fix login bug - users getting 403 on /api/auth",
            level=Level.Working, tags=["bug", "auth"])

# Recall — returns formatted context ready for LLM injection
context = brain.recall("authentication issues", token_budget=2000)
print(context)

Output:

=== COGNITIVE CONTEXT ===
[IDENTITY]
  - User prefers dark mode and Vim keybindings [preference, ui]

[DECISIONS]
  - Deploy staging before production, always run tests [workflow]

[WORKING]
  - Fix login bug - users getting 403 on /api/auth [bug, auth]

=== END CONTEXT ===

That's it. store() → recall() → inject into your system prompt. Five lines to give your agent persistent memory.

3. Memory levels — not all memories are equal

Aura organizes memory into 4 levels across 2 tiers, modeled after human cognitive architecture:

Tier	Level	Decay rate	Use case
Core	Identity	0.99/cycle	User preferences, personality
Core	Domain	0.95/cycle	Learned facts, domain knowledge
Cognitive	Decisions	0.90/cycle	Choices made, action items
Cognitive	Working	0.80/cycle	Current tasks, recent messages

Core tier = slow decay (weeks to months). Your agent's "personality" and knowledge base.
Cognitive tier = fast decay (hours to days). Ephemeral context that fades naturally.

This means your agent doesn't need explicit "forget" logic. Old tasks decay away. Core knowledge persists. Just like your brain.

# Query only recent, ephemeral memories
recent = brain.recall_cognitive("workflow")

# Query only long-term knowledge
knowledge = brain.recall_core_tier("programming")

4. Trust scoring — the killer feature

Here's where Aura gets interesting. Not all information sources are equally reliable. A user telling you their name is more trustworthy than a web scrape claiming "Python 4.0 is coming soon."

from aura import TrustConfig

tc = TrustConfig()
tc.source_trust = {"user": 1.0, "api": 0.8, "web_scrape": 0.5}
brain.set_trust_config(tc)

# Store from different sources
brain.store("Python 3.13 released October 2024",
            tags=["python"], channel="user")

brain.store("Python 4.0 coming soon",
            tags=["python"], channel="web_scrape")

# Trust-weighted recall ranks user-sourced memory higher
results = brain.recall_structured("python release", top_k=5)
for r in results:
    print(f"  score={r['score']:.3f}  trust={r['trust']:.2f}  {r['content']}")

Output:

  score=0.995  trust=1.00  Python 3.13 released October 2024
  score=0.589  trust=0.50  Python 4.0 coming soon

The user-sourced fact scores 0.995. The web scrape scores 0.589. Your agent now has built-in epistemological hygiene — it knows how much to trust each piece of information.

5. Source provenance — know where every fact came from

Every memory in Aura carries an epistemological tag:

recorded — direct user input (trust × 1.00)
retrieved — fetched from web/API (trust × 0.90)
inferred — LLM conclusion (trust × 0.85)
generated — agent-created (trust × 0.80)

brain.store("BTC at $67k on Feb 21",
            source_type="retrieved", tags=["crypto"])

brain.store("User prefers conservative trading strategy",
            source_type="recorded", tags=["crypto"])

# Recall ranks recorded higher than retrieved
results = brain.recall_structured("crypto strategy", top_k=5)

This prevents a subtle but dangerous problem: agents presenting web search results as their own "memories." With source_type, your agent always knows what it observed vs what it found vs what it guessed.

No other memory SDK tracks this. Not Mem0. Not Zep. Not Cognee.

6. Plug it into any LLM

Aura is LLM-agnostic. The pattern is always the same:

user_message = "What are my UI preferences?"

# 1. Recall relevant context
context = brain.recall(user_message, token_budget=2000)

# 2. Build system prompt
system_prompt = f"""You are a helpful assistant with memory.

{context}

Use the above context to personalize your responses."""

# 3. Send to your LLM of choice:
# Ollama:     requests.post("http://localhost:11434/api/chat", ...)
# OpenAI:     openai.chat.completions.create(messages=[...])
# LangChain:  ChatPromptTemplate with {context}
# Claude:     anthropic.messages.create(...)
# Any HTTP:   just inject system_prompt

No adapters. No framework lock-in. If your LLM takes a string, Aura works with it.

Bonus: structured recall with scores

When you need more than formatted text — for routing, filtering, or debugging:

results = brain.recall_structured("user preferences", top_k=5)
for r in results:
    print(f"  [{r['level']}] score={r['score']:.3f} -- {r['content'][:80]}")

  [IDENTITY] score=0.590 -- User prefers dark mode and Vim keybindings
  [WORKING]  score=0.586 -- Fix login bug - users getting 403 on /api/auth
  [DECISIONS] score=0.581 -- Deploy staging before production, always run tests

Each result includes level, score, trust, tags, timestamps, and source metadata. Use this to build intelligent routing: high-trust Identity memories go straight to the system prompt; low-trust Working memories get verified first.

Performance — sub-millisecond, for real

Benchmarked on a standard machine with 1,000 stored records:

Operation	Latency
Store	0.129 ms/op
Recall (1K records)	0.861 ms/op
Search by tag	0.103 ms/op

For comparison, embedding-based recall typically runs 200ms+ per call. Aura is 200x faster because it uses SDR (Sparse Distributed Representation) encoding + MinHash + tag matching — no neural network inference needed.

You can optionally add embeddings as a 4th signal if you want semantic similarity on top:

from sentence_transformers import SentenceTransformer
model = SentenceTransformer("all-MiniLM-L6-v2")
brain.set_embedding_fn(lambda text: model.encode(text).tolist())

But the 3-signal fusion works great without them.

Living memory — decay, reflect, consolidate

Run a single maintenance cycle and Aura handles the rest:

report = brain.run_maintenance()
print(f"Decayed: {report.decay.decayed}")
print(f"Promoted: {report.reflect.promoted}")
print(f"Consolidated: {report.consolidation.native_merged}")
print(f"Archived: {report.records_archived}")

Call this periodically (every N interactions, or on a schedule), and your agent's memory stays clean and relevant without manual curation.

More features you get out of the box

Namespace isolation — keep test/prod/per-user memories separate
Encryption — ChaCha20-Poly1305 + Argon2id, one argument: Aura("./data", password="secret")
MCP server — expose memory as a tool for Claude, GPT, or any MCP-compatible agent
Zero dependencies — pure Rust core, no runtime requirements

Try it now

The fastest way to try Aura is the interactive Colab notebook — zero setup, runs in your browser:

▶ Open in Google Colab

Or install locally:

pip install aura-memory

GitHub — star it if you find it useful ⭐
API docs — full reference for 40+ methods
Examples — Ollama integration, research bot, edge devices

Aura is MIT-licensed. Built by a solo developer in Kyiv, Ukraine — including during power outages. Patent pending (US 63/969,703). If you're building AI agents that need to remember, I'd love to hear what you think.

419 Clones in 48 Hours — What Happened When I Launched an SDK for Offline AI Agent Memory

Oleksander — Thu, 26 Feb 2026 13:05:19 +0000

48 hours after launch. 419 clones. 90 unique developers. 8 stars. Nobody said a word.

That silence told me something important: engineers don't star things — they test them.

Here's the story of what I built, why, and what those numbers actually mean.

The Problem Nobody Talks About

Everyone is building AI agents. Most of them have a memory problem.

The standard approach: use embeddings. Store text as vectors, query them at recall time. Tools like Mem0, Zep, and LangMem all work this way.

The hidden cost:

Every recall = an embedding API call = 150–300ms latency
Every embedding call = money (OpenAI charges per token)
Offline deployment? Impossible — you need the embedding API available

For cloud-based chatbots this is fine. But for local AI agents running on your own hardware — especially with Ollama — this breaks the whole offline-first promise.

If your agent needs to "remember" something, it has to call home first.

That felt wrong to me.

A Different Idea: SDR Instead of Embeddings

I started reading about Sparse Distributed Representations (SDR) — the pattern encoding mechanism used in Hierarchical Temporal Memory (HTM) theory, originally inspired by how the neocortex works.

The core idea: represent any concept as a sparse binary vector (256K bits in Aura's case) where only ~2% of bits are active. Similarity between patterns is computed using Tanimoto coefficient — pure bit math, no neural network needed.

No embedding model. No API call. No GPU.

Just math.

Recall latency: 0.35ms. That's not a typo.

What I Built

Aura — a cognitive memory system for AI agents written in Rust.

Key properties:

Sub-millisecond recall — 0.35ms average, 0.29ms after warm cache
Zero LLM calls for memory operations — the recall itself needs no model
2.7MB binary — the entire memory engine fits in a small file
Fully offline — works with Ollama, any local model, no internet required
Persistent across sessions — brain reloads from disk, all context intact
217 tests, ChaCha20-Poly1305 encryption, patent pending (US 63/969,703)

Four memory levels with different retention weights:

Working Memory    → 0.80 retention  (temporary context)
Decision Memory   → 0.90 retention  (choices made)
Domain Memory     → 0.95 retention  (learned knowledge)
Identity Memory   → 0.99 retention  (core facts)

Integration with Ollama: 3 Lines

from aura_memory import Aura

brain = Aura("./agent_brain")
context = brain.recall(user_input, token_budget=1500)

# inject context into your Ollama system prompt
response = ollama.chat(
    model="gemma3n:e4b",
    messages=[
        {"role": "system", "content": f"Context:\n{context}\n\nYou are a helpful assistant."},
        {"role": "user", "content": user_input}
    ]
)

# store the interaction
brain.store(user_input, response["message"]["content"])

That's it. Your Ollama agent now has persistent memory across sessions — no embedding API, no cloud, no ongoing cost.

Live Demo Output

I ran a 4-phase test with gemma3n:e4b locally. Here's the actual terminal output:

Phase 1: Storing facts
✓ Stored: Name is Aleksander, AI engineer from Ukraine
✓ Stored: Working on AuraSDK — cognitive memory for agents
✓ Stored: Prefers concise technical explanations

Phase 2: Conversations with memory context
[Recall: 0.35ms] Context injected into system prompt
[Recall: 0.48ms] Agent referenced previous preference correctly
[Recall: 0.41ms] Agent remembered project name without being told

Phase 3: Session reload (fresh Python instance)
Brain loaded from disk...
[Recall: 0.29ms] ALL context intact ✅

Total records: 12
Memory persisted: YES
LLM calls for memory: 0

The agent remembered my name, project, and communication preferences across a completely fresh Python instance — without a single LLM or embedding call.

Benchmark vs Embedding-based approach

Metric	Aura	Embedding-based approach
Recall latency	0.35ms	~200ms
Embedding API calls	0	Required
Offline capable	✅	❌
Binary size	2.7MB	N/A (cloud)
Cost per recall	$0	API pricing
Speedup	270x faster	baseline

Why Rust?

Three reasons:

Performance — sub-millisecond recall requires zero garbage collection overhead
Safety — memory systems that corrupt data are worse than no memory at all
Portability — 2.7MB binary runs anywhere: Raspberry Pi, edge devices, air-gapped servers

19,500 lines of Rust. 217 tests. Built during power outages in Kyiv 🇺🇦

The 419 Clones

After posting in the Ollama Discord and commenting on a few Twitter threads about agent memory, the GitHub traffic spiked:

419 clones in 48 hours
90 unique cloners
Zero comments

I think developers are quietly testing it. That's the most honest validation I could ask for — nobody clones a repo to be polite.

If you're one of those 90 people: I'd genuinely love to know what you found. What worked, what didn't, what you were trying to build.

Get Started

pip install aura-memory

📦 PyPI: aura-memory
🔗 GitHub: teolex2020/AuraSDK
🌐 Docs: aurasdk.dev

One Question For You

How are you handling memory in your AI agents right now?

Embeddings? Simple conversation history? Something else entirely?

I'm genuinely curious about the tradeoffs people are navigating — especially for local/offline deployments where latency and API costs actually matter.