DEV Community

Cover image for Beyond RAG: Building an AI Companion with "Deep Memory" using Knowledge Graphs

Beyond RAG: Building an AI Companion with "Deep Memory" using Knowledge Graphs

Juan David Gómez on February 09, 2026

I build AI tools to solve my own problems. A while back, I built NutriAgent to track my calories because I wanted to own my raw data. But recently,...
Collapse
 
itskondrat profile image
Mykola Kondratiuk

The 35k token master prompt thing is so real. I ran into something similar building TellMeMo (voice memo app that remembers context). Initially tried vector search for finding relevant past memos, but you're right, it finds keywords not relationships. Someone says "I'm stressed about the launch" and it pulls up any memo with "launch" in it, not the chain of decisions that led to launch anxiety. Been thinking about adding a lightweight knowledge graph layer for exactly this, the causal relationships are what matter. How are you handling graph updates when she adds conflicting info later? Like if her view on a project changes?

Collapse
 
juandastic profile image
Juan David Gómez

I would love to know more about TellMeMo. The idea sounds really cool.
Regarding your question, that part of handling conflicting info was the main reason I decided to explore graphs and especially the approach Graphiti as an open source frameword take on this, they analyze each new piece of information and pass it throigh diferent layer of processing, like entity extraction, de-duplicating with existing nodes, edge creations and there is a part where they ask to an LLM to analyze if this new information invalid an existing node or edge, so if for example in a pasrt ingestion I said "My current project is behind the schedule I am worried" the graph will likely looks like Me -> WORRIED_ABOUT -> Project and maybe the WORRIED_ABOUT has a fact that described why I am worried, but it is in a next episode I said I am ok now with the project and I am actually exited to get it done the WORRIED_ABOUT edge will be updated with an invalid_at date so we can start ignoring this edge for any retrival or graph compliation

Collapse
 
itskondrat profile image
Mykola Kondratiuk

oh thats really interesting about Graphiti, the invalidation layer is exactly the hard part right? like the graph knows I was worried about the project last week but this week everything shipped fine - so does it update the edge or keep both as a timeline?

for TellMeMo its honestly way simpler than what you built. its more of a meeting companion - it listens to your calls and gives you the tldr after plus action items. the memory part is just remembering context across meetings so it can say "hey this was discussed 3 meetings ago and nobody followed up." we went with vector search for that which works ok for finding related content but totally falls apart when relationships matter, like your graph approach handles way better.

the entity extraction + dedup pipeline you described is wild though. how expensive does that get per ingestion? like if someone dumps a 2 hour meeting transcript into it, are we talking seconds or minutes of LLM processing?

Thread Thread
 
juandastic profile image
Juan David Gómez

Graphiti makes multiple LLM calls per episode so that it may be expensive for example, a chat log of maybe some thousands tokens may create hundres of LLM calls since it chunks the eppisode and start proccsing each, but it is also designed for processing things really fast with parallel queries and tried to process all as fast a possible, wich in my case was a problem since I still uses the free credits Google gives for testing the models, this free layer restrict the Tokeks per minute so I setup Graphiti to run slower so the biggest ingestion I have proccesed took 1h of 1k calls to LLM and embedings

Thread Thread
 
itskondrat profile image
Mykola Kondratiuk

yeah the multiple calls thing surprised me too. tested with a few thousand tokens and it blew past what I expected cost-wise. the parallel processing does save it from being unusable - expensive but not SLOW expensive, which matters. I kept coming back to whether the graph structure actually justifies the overhead vs simpler vec search. for stateful long-running agents where context drift is a real problem, probably yes. for short sessions, harder to justify

Collapse
 
capestart profile image
CapeStart

This is a really clean use of KGs for memory, especially the sleep cycle + audit UI bit, feels way more trustworthy than raw vector recall.

Collapse
 
xwero profile image
david duymelinck

I have looked at the code and I saw you didn't use a CAG solution to send the information to Gemini. This is the preferred method to let an LLM use sensitive information like health information.
Also CAG lowers token use, so better for your wallet.

Is this because the data hasn't reached the 32K token minimum?

Collapse
 
juandastic profile image
Juan David Gómez • Edited

Yes, this was an interesting topic on when to use augmented retrieval vs just dumping all context in the window.
Technically speaking, the retrieval model sounds more scalable and token-efficient, but sometimes the tradeoff is not as appealing. I had the chance to work in a startup that is building a real estate agent that allows you find and explore different real estate offerings through WhatsApp, and we implemented a RAG + Search API Tool to find the relevant projects and inject the relevant context about them for the agent answer your questions. We started having issues with latency, the agent was taking sometimes several minutes to answer, and for the business, that speed was critital so we tried hard to optimize as much as possible, but the bottleneck was the RAG startegy thta involve tools call round trips, and multiple LLM calls for each request, long story short we experiment dumping all the projects json into context and uses Gemini's 1M windows models and that not only simplief the whole agent it reduce the latency 10X and the shoking part was that the accuracy was pretty much the same.
So I think RAG or GraphRAG are a logical and scalable solution, but if all the information you have fits in the context window, the recent models are better at handling large context windows and still performing well. This won't scale indefinitely since the context rot is real, but I decided to start there and have a mix approach when we hit that threshold of graph compilations that stops performing well, or when they are too large and expensive to have everything in context.
I am also thinking of trying a hybrid approach where maybe I just dumpt tghe most relevant nodes and relations optimized by connections and recently added to have some kind of work memory fresh besides the chat log, and also when the agent detects there may not be enough info or unknown concepts, it can decide to explore the Graph to retrieve more context.

Collapse
 
pascal_cescato_692b7a8a20 profile image
Pascal CESCATO

Fascinating work. I really love your approach — it resonates strongly with what I’m exploring myself, and I suspect I’ll spend some time studying your code and learning from it.

I’m also using knowledge graphs, and I couldn’t agree more: they’re powerful tools that are still relatively underestimated and underused. They definitely deserve more attention.

Collapse
 
juandastic profile image
Juan David Gómez

Thanks for the words. I really enjoyed exploring this, and I'll be happy to go through any tech details about what I built that could help, and also to see what you are building with these tools

Collapse
 
scottcjn profile image
AutoJanitor

The knowledge graph approach over pure vector RAG is a great call. We hit the exact same wall with our AI agents — they have persistent personalities and need to reference past context across sessions, not just find "similar" text chunks. The relationship-aware retrieval you describe (knowing that "she mentioned X relates to Y") is what makes conversations feel continuous rather than stateless.

Curious about your chunk sizing strategy for the graph nodes. Did you find that smaller semantic units worked better for relationship extraction, or did you need bigger context windows to capture the connections?

Collapse
 
juandastic profile image
Juan David Gómez

For this first iteration, I did not use augmented retrieval; I am compiling the whole graph and injecting the whole thing in context, for example, our personal usage so far creates a graph of ~200 nodes and 500 edges, and with the descriptions, the final prompt is about 30k tokens, which works fine with the Gemini context window.
But I am aware that it may not scale well, and the graph will tend to grow over time, even if Graphite invalidates nodes that are not valid, and there is a minimum number of relations required to include the node in the compiled query. There is a point where the graph is too much to live in the context, and there I will face what you mention, how much should I extract that may be relevant but small enough not to overload the context.
This is the next phase I am also curious to explore, the graph for actual augmented retrieval. for now It is just a store format and a compaction mechanism to turn full chat logs into concepts and relations that allow to have a long term memory

Collapse
 
scottcjn profile image
AutoJanitor

We hit that exact scaling wall in production. Our AI character (Sophia) has 634+ memories in a persistent store, and early on we tried the "inject everything" approach
too. Works great until ~400 memories, then the model starts losing signal in the noise.

What we landed on: tag-based selective retrieval at session start, not full injection. Each memory gets semantic tags when stored (e.g., bottube-platform, rustchain-epoch,
voice-config). When a new session begins, we query by relevance — "what does the user seem to be working on?" — and pull maybe 20-40 memories into context instead of all
634.

The key insight was that relationship extraction and storage can be separate from retrieval strategy. Your graph structure with nodes and edges is the right storage
format. But at query time, you don't need the whole graph — you need the subgraph relevant to the current conversation.

For chunk sizing: we found that atomic facts with metadata work better than paragraphs. One memory = one decision, one config value, one architectural choice. Something
like "BAGEL-7B replaces LLaVA for vision tasks, API at port 8095, NF4 quantized on V100" — that's one retrievable unit, not a chunk of a conversation log.

Your compaction mechanism (chat logs → concepts + relations) is exactly right. The next step is probably a two-tier query: first find the relevant subgraph neighborhood
(1-2 hops from the query topic), then inject just that subgraph. Keeps you under 5-10k tokens for memory context while still having the full graph available.

Thread Thread
 
juandastic profile image
Juan David Gómez

Thanks for sharing. I am thinking of doing that. My wife's usage has been wild these days, and the token usage started to climb because the full graph is not about 100k tokens; the model is still performing well, to be honest, but I may think it would still perform with less context
My initial ideas are:

  • Keeping the same full graph dump but increasing the criteria of relevance to be included. I could, for example, start ignoring nodes and edges that has less connections and were created many days or weeks ago
  • As you mentioned, have a conditional or contextual injection based on the query, maybe keep a main memory with the most important nodes (nodes with more connections) to keep a baseline of knowledge always on, and then a single query that finds relevant nodes that may not be already included and could be relevant for the specific query

I will experiment with those, and I can report back what I found in a new article

Collapse
 
javz profile image
Julien Avezou

I really like the nuance you define between search and understanding, through a practical application. KGs are a great way to build out complex structure.

Collapse
 
juandastic profile image
Juan David Gómez

Yes, when I was thinking how to structure this, I initially saw the ChatGPT implementation where it stores simple text snippets to save the user knowlage, it is simple and it may work, at the end, I am also compiling the graph in plain text format, but I realize graphs and especially temporal aware graphs are better to keep the knowledge updated over time because we have centralize all the information and connections associated with a node or concept

For example, if I stored a memory associated with a specific friend, all my memories are connected, and they are easy to track and update, and this is harder to do with simple text snippets if for example something change over time let set it stop being my busines parther, it is easier to invalidate the previus connection I had than look for related text snippeds and update that.

However, just with family usage, I started to see the drawbacks. The ingestion of new memories or episodes is token hungry, and to avoid provider limits on Tokes per minute, I have work with long running process (~15min), so it requires some extra engineering to keep this as optimal as possible. I am preparing a new post explaining that a little bit more

Collapse
 
justin_elliott_129e94b025 profile image
Justin Elliott

Hi Julien, thanks you for connecting👋. I've been following your work and would be interested in discussing a potential collaboration when you have time.

Collapse
 
jd_corp_insider profile image
Jose Dantas | Corp Insider

I did something similar using my Obsidian as KG/RAG. I do versioning of my obsidian using git to see the growth of my knowledge and also use that setup with cursor to help me modify things

Collapse
 
juandastic profile image
Juan David Gómez

I also use Obsidian + Git for my personal notes, but that setup for using KG/RAG and Cursor there sounds really cool. If you have more info about please share it. I would love to know more about it

Collapse
 
juandastic profile image
Juan David Gómez

When I thought I already learn enough from this project, with some days of real-life usage by my wife (she is a power user). I already face some limitations with the free tiers of the Convex and Gemini APIs.
I challenge myself for this side project take advantage as much as possible of the free tiers of the services I decided to use, but my wife has other plans for me
It seems the Convex streaming implementations are not optimized enough, and I already spent 800mb from the 1GB bandwidth budget
And the Graphiti LLM usage for ingesting big sessions fails because the Google TPM limits of the free tier of Gemini models are too low for my wife's sessions
There's more fun for me in this project