DEV Community: Christian Alexander Nonis

From RAG to a “memory layer”: what building an AI assistant taught us

Christian Alexander Nonis — Sun, 29 Mar 2026 16:36:07 +0000

About a year and a half ago, we were building a proactive AI assistant.

Not just a chatbot, but something that could actually act on your behalf.

It could reply to emails in your tone, move calendar events, organize your inbox, and surface information based on what you actually care about.

The goal was simple:

build something that feels like an extension of how you think.

The part we didn’t expect

To make that work, we started with what most people use today: RAG.

And to be fair - RAG works.

You can go pretty far with chunking, embeddings, and retrieval.
You can build systems that feel smart.

But as the assistant got more complex, something started to break.

Not in an obvious way.

It was more subtle.

The system could retrieve relevant information,
but it didn’t really understand how things were connected.

Everything was based on similarity.

And similarity is not structure.

Building a "brain"

To move forward, we needed something else.

We started building what we internally called a "brain".

A layer responsible for:

extracting meaning from data
connecting concepts together
maintaining a consistent structure over time

At the beginning, it was just a supporting component for the assistant.

But the deeper we went, the more it became clear:

this was the real problem.

About 7 months ago, we made a decision:
we stopped focusing on the assistant itself
and went all-in on this layer.

That became BrainAPI.

From retrieval to structure

The shift can be summarized like this.

Typical RAG pipeline:
chunk -> embed -> retrieve -> generate

What we moved toward:
ingest -> extract -> connect -> graph -> query

Instead of treating data as independent chunks,
we process it into a structured representation of entities and relationships.

In practice:

documents are parsed into concepts
relationships are extracted and normalized
everything is stored in a graph + vector layer

Vectors are still useful,
but they are no longer the primary abstraction.

The graph is.

What changes in practice

This changes how you interact with data.

Instead of asking:

"what text is similar to this query?"

You can ask:

what entities are involved?
how are they connected?
what paths exist between concepts?
what else is related in this context?

Retrieval becomes navigation.

Where this approach helps

We found this particularly useful when:

context spans across multiple sources and time
relationships matter more than keywords
consistency is important (not just relevance)

Some practical use cases:

recommendation systems (ecommerce, social)
search systems that go beyond keyword matching
persistent memory for agents and chatbots
more reliable RAG setups in complex domains

Exploring "polarities"

One interesting direction we’ve been exploring is something we call polarities.

Instead of returning a single "best" answer,
the system can surface a range of possible solutions around a problem,
based on how concepts relate in the graph.

It’s less about ranking results,
and more about exploring a solution space.

Why this matters

At Lumen Labs (our startup), this direction came from a broader observation.

AI systems today are powerful,
but they are also fragile in how they represent knowledge.

They retrieve well.
They generate well.

But they don’t really ground information in a consistent structure.

And that’s where a lot of issues come from,
especially when accuracy actually matters.

If we want systems that people can rely on,
we need something closer to a structured memory layer.

Open sourcing it

We’ve been using this approach in production for a few B2B use cases,
but never exposed it publicly.

Now we’re opening it up.

the core is open source
it can run fully locally (we’ve tested it with Ollama + offline setups)
or be deployed as managed instances in the cloud
it’s extensible via a plugin system

Closing thoughts

We don’t think this replaces RAG.

But it feels like RAG is one component of a bigger system,
not the system itself.

After spending the last year and a half building on top of AI systems,
this "memory layer" is the piece that felt missing.

Curious to hear how others are approaching this,
especially if you’ve hit similar limitations with chunk-based retrieval.

Links

Repo: https://github.com/Lumen-Labs/brainapi2
Website / Video: https://brain-api.dev

Giving LLMs Real Memory: Why It’s Hard, and How BrainAPI Solves It

Christian Alexander Nonis — Sat, 09 Aug 2025 07:27:19 +0000

Large Language Models (LLMs) are impressive. They can write code, answer questions, and chat fluently on almost any topic.

But there’s one fundamental flaw:

LLMs have no memory.

Unless you manually feed the model your entire conversation history each time, it will forget everything.

For developers, this creates a lot of friction when building anything that needs persistence or context.

In this article, we’ll:

Explore why LLM memory is a hard problem
See how BrainAPI approaches it with a structured, hybrid memory architecture
Walk through a mini tutorial to integrate it into your project

Why LLM Memory Is Hard

LLMs are stateless by design. Each prompt is processed independently.

When you ask a follow-up question, the model doesn’t “remember” the previous answer — it only knows what you explicitly include in the input.

Developers try to work around this with:

Retrieval-Augmented Generation (RAG) — storing chunks of data in a vector DB and fetching relevant ones per query
Prompt stuffing — appending conversation history to every prompt
Manual state tracking — keeping facts in variables or databases and re-injecting them

The problems:

RAG struggles with multi-turn continuity (“What was the second method again?”)
Prompt stuffing bloats tokens and drives up cost
Coreference issues — “it” and “she” become ambiguous without entity tracking
No high-level awareness — the bot can’t easily remember your goals, preferences, or evolving context

We need something that:

Understands entities and relationships
Stores facts and knowledge in a structured way
Retrieves context intelligently, not just by keyword similarity
Tracks conversation at both a detail and summary level

Introducing BrainAPI

BrainAPI by Lumen Labs is an on-demand memory layer for LLM applications.

It’s accessible via Python and Node.js SDKs, and it handles:

Storing conversation messages
Injecting static or dynamic knowledge
Retrieving relevant context for the current query

Key differences vs. simple RAG:

Coreference resolution — normalizes references so “she” and “Mary” are connected
Triplet-based knowledge graph — facts are stored as subject → predicate → object
Hybrid retrieval — combines graph traversal and vector similarity search
High-level observation layer — summaries of user goals, topics, and context slices

How It Works Under the Hood

The architecture has five layers:

1. Coreference Resolution

Ensures entity consistency across messages.

Example:

"Mary is getting married next year. She wants it in Rome."
→ "Mary is getting married next year. Mary wants the wedding in Rome."

Currently using fastcoref in Python; exploring a faster C++ rule-based resolver.

2. Triplet Extraction & Embedding

From each message or knowledge chunk:

Extract subject-predicate-object triples
Embed whole phrase and individual entities
Wikify entity names to avoid duplicates (e.g. "NYC" → "New York City")

3. Storage Backend

Neo4j — the knowledge graph
Pinecone — vector embeddings for semantic search
MongoDB — raw text chunks, logs, and metadata

4. Hybrid Retrieval

When asked “Mary’s wedding date”:

Extract (Mary) and (wedding date)
Search Neo4j for subject = “Mary”
Traverse edges for exact match on object
If no exact match, run vector search on connected nodes
If no subject found, run vector search for closest parent entity

5. High-Level LLM Observations

A summarization layer produces structured observations every few turns:

Topics discussed
User goals
Relevant constraints

These summaries give the bot bird’s-eye awareness without flooding the context window.

When to Use BrainAPI

Documentation Bots — remember context between Q&A and follow-ups
Goal-Oriented Assistants — persist user preferences and constraints
Educational Tutors — track student progress and personalize lessons
Personal AI Companions — maintain continuity across days or weeks

Mini Tutorial: Adding Memory to Your Bot

Let’s add BrainAPI to a Python chatbot.

1. Install the SDK

pip install lumen-brain

2. Save incoming messages

from lumen_brain import LumenBrainDriver
driver = LumenBrainDriver("your-api-key")

driver.save_message(
    memory_uuid="project-chat-memory",
    content="I’m planning a conference in Rome next May.",
    role="user",
    conversation_id="conv-001"
)

3. Inject Knowledge

driver.inject_knowledge(
    memory_uuid="project-chat-memory",
    type="file",
    content="Our conference venue options include the Colosseum and Forum."
)

4. Retrieve relevant context

result = driver.query_memory(
    text="When is the conference happening again?",
    memory_uuid="project-chat-memory",
    conversation_id="conv-001"
)

response = llm.invoke({ "input": "When is the conference happening again?" + result.context })