Oleksander

Posted on Mar 18

I tested the same AI model against itself. Memory won 4/5.

#llm #ai #python #gemini

The experiment

Same model. Same 5 questions. One difference: one side had persistent memory via AuraSDK, the other had none.

Both sides used Gemini 2.5 Flash-Lite — identical model, identical cost per token.

Result: 4/5 questions won by the side with memory. 48% fewer tokens used.

What the questions tested

Real Python dev scenarios — the kind where generic answers aren't enough:

"I'm writing a new async function to fetch user orders from the DB. What patterns should I follow?"
"We need background email sending when an order is completed. How should we implement it?"
"I'm writing tests for the new payment module. Walk me through the setup."
"Code review: results = list(db.query(Order).filter(...)) — what's wrong?"
"The reports endpoint is slow — 3 seconds for 200 users. How do I debug and fix this?"

These questions have correct generic answers. But they also have specific answers — if you know the team's stack.

What the model with memory knew

AuraSDK had loaded 15 records of accumulated team knowledge:

We use arq for task queues, not Celery — switched because Celery was too heavy
Production incident: missing await on async DB call silently returned a coroutine. 2 hours to debug.
SQLAlchemy lazy loading doesn't work with async sessions — always use selectinload() or joinedload()
Code review caught: someone did list(queryset) and loaded 50k rows into memory. Lint rule added.
N+1 queries found in the user reports endpoint — fixed with a single joined query
Test setup: pytest + pytest-asyncio, transactions that rollback after each test, coverage must stay above 85%

What happened question by question

Q2 — Background email sending:

With memory (1,307 tokens):

"Use arq — we switched from Celery because it was too heavy for our scale. Define an arq job decorated with @arq.func, enqueue it when order completes."

Without memory (3,203 tokens):

"Here are several approaches: FastAPI BackgroundTasks for simple cases, Celery for robust production use..."

Same model. One knew the decision. One gave a textbook overview.

Q4 — Code review:

With memory (829 tokens):

"This is a direct violation of our established practice. We had an incident where this loaded 50,000 rows into memory. A lint rule was added specifically to prevent this."

Without memory (1,604 tokens):

"This is generally considered an anti-pattern in SQLAlchemy. Here's a breakdown of what's wrong..."

How AuraSDK works

from aura import Aura, Level

brain = Aura("./agent_memory")
brain.enable_full_cognitive_stack()

# store team knowledge
brain.store("We use arq, not Celery — switched because Celery was too heavy",
            level=Level.Domain, tags=["python", "dev"])
brain.store("Production incident: list(queryset) loaded 50k rows into memory",
            level=Level.Decisions, tags=["python", "lesson-learned"])

# recall before answering — <1ms, no API call
context = brain.recall("background email sending", token_budget=1000)

# inject into prompt
system = f"TEAM CONTEXT:\n{context}\n\nAnswer using this context."

No embeddings. No vector database. No LLM calls during learning. Pure local Rust computation.

The cognitive pipeline

AuraSDK doesn't just store and retrieve text. Every record goes through 5 layers:

Record → Belief → Concept → Causal → Policy

Belief: groups related observations, resolves contradictions with confidence scores
Concept: discovers stable topic clusters across beliefs
Causal: finds cause-effect patterns from temporal and explicit links
Policy: derives behavioral hints (Prefer / Avoid / Warn) from causal patterns

After enough interactions, the system surfaces this automatically:

hints = brain.get_surfaced_policy_hints()
# [{"action": "Prefer", "domain": "dev", "description": "use arq over celery for task queues"}]

Nobody wrote that rule. The system derived it from the pattern of stored observations.

The token math

	With memory	Without memory
Q1	1,200 tokens	1,545 tokens
Q2	1,307 tokens	3,203 tokens
Q3	1,923 tokens	4,067 tokens
Q4	829 tokens	1,604 tokens
Q5	1,294 tokens	2,155 tokens
Total	6,553 tokens	12,574 tokens

48% fewer tokens. The memory layer doesn't add bloat — it gives the model exactly what it needs.

How it compares

	AuraSDK	Mem0	Zep	Letta
LLM required for learning	No	Yes	Yes	Yes
Works offline	Fully	Partial	No	With local LLM
Recall latency	<1ms	~200ms+	~200ms	LLM-bound
Self-derives behavioral policies	Yes	No	No	No
Binary size	~3MB	~50MB+	Cloud	Python pkg

Try it

pip install aura-memory
python examples/demo.py

Open source: github.com/teolex2020/AuraSDK
Patent pending: US 63/969,703
Built in Kyiv, Ukraine.

Top comments (1)

Mark John • Mar 18

🤔