Juan David Gómez

Posted on Mar 1

Scaling AI Memory: How I Tamed a 120k-Token Prompt with Deterministic GraphRAG

#ai #gemini #llm #rag

In a past article, I wrote about Synapse, an AI companion I built for my wife. To solve the problem of an LLM forgetting her past, I bypassed standard vector RAG entirely. Instead, I used a Knowledge Graph (via Graphiti and Neo4j) to map her life, compiled the entire graph into text, and injected it straight into Gemini's massive context window.

It worked beautifully. Until it didn't.

When you build a prototype, you test it with a few messages. When your wife is the power user, she builds an entire world. By day 21 of her using the app daily for deep sessions, the system hit a wall.

Here is the raw data of her input tokens per message over 18 days:

She was sending over 120,000 tokens of system context on every single chat turn.

Gemini handled it. Modern context windows are incredible, but the reality of production kicked in. My API costs were climbing, Convex bandwidth was getting chewed up storing and moving massive payloads, and latency was increasing.

Dumping everything into the prompt is a great MVP, but it does not scale. I needed a new architecture.

The Community Was Right: Storage ≠ Retrieval

When I published the first article, the Dev.to community called this exact scaling wall.

Developers like @scottcjn and @itskondrat pointed out that while a Knowledge Graph is the perfect way to store relationships and causality, you shouldn't retrieve the whole thing every time.

I didn't want to revert to standard Vector RAG, because standard RAG loses the plot. If she says "I'm stressed," a vector search retrieves a random journal entry about "stress." A graph knows the causality: Project A -> CAUSED -> Stress, and also, for first sessions or smaller graphs, the full usage of the context window is still the best option

I needed a hybrid approach:

A Base Prompt (Working Memory): The most critical structural info about her life, capped at a strict budget.
GraphRAG (Episodic Recall): Long-tail memories retrieved on-demand for the current chat turn.

Here is how I built it.

Phase 1: Hydration V2 (The Budget-Aware Brain)

My first API endpoint (Hydration V1) just ran a SELECT * from the graph and formatted the results.

I rewrote it as Hydration V2: a cascading waterfill allocation system. I set a hard limit of roughly 120,000 characters (~30k tokens). The goal is to maximize the usefulness of the prompt without blowing the budget.

Here is how the waterfill logic allocates space:

1. The Node Budget (40%):
Nodes are the entities (People, Projects, Concepts). I sort them by their "degree" (number of connections). The most connected nodes are included first. Because nodes are just short summaries, they rarely use the full 40%. The unused characters roll over into the Edge budget.

2. The Edge Budget (60% + Rollover):
Edges are the relationships (the actual stories and facts). To prioritize them, I classify nodes in the top 30th percentile of connections as "Hubs."

P1 (Hub-to-Hub): The structural backbone of her life. (e.g., User -> WORKS_ON -> Main Career). These are included first.
P2 (Hub-Adjacent): One node is a Hub, sorted by recency.
P3 (Long-Tail): Low-degree nodes. These are the first to get cut when the budget fills up.

The Bridge: The Metadata Contract

Here was the hardest architectural problem: If Hydration V2 puts "Fact A" in the Base Prompt, and my RAG pipeline searches for "Fact A" on the next turn, I will inject duplicate data into the LLM.

To fix this, Hydration V2 doesn't just return text. It returns a Metadata Contract:

{
  "compilationMetadata": {
    "is_partial": true,
    "total_estimated_tokens": 29500,
    "included_node_ids": ["uuid-1", "uuid-2"],
    "included_edge_ids":["uuid-x", "uuid-y"]
  }
}

If is_partial is true, it means the graph was too big and the waterfill algorithm had to cut things. It also returns the exact UUIDs of the nodes and edges that did make it into the prompt.

The React frontend stores this metadata and sends it back to the backend on every single chat request. Now, the backend knows exactly what the LLM already knows.

Phase 2: Deterministic GraphRAG (No Agents)

Most RAG systems today use "Agents" or tool-calling loops. The LLM decides if it needs to search, writes a query, waits for the tool, and then answers.

I hate this pattern for chat UI. Especially for use cases where no complex reasoning or multiple tools are needed, it adds 2 to 5 seconds of latency. I wanted my RAG pipeline to be deterministic and execute in under 1 second.

Here is my straight-line GraphRAG pipeline:

1. The Gate Check
Before doing any search, the backend checks compilationMetadata.is_partial. If it is false, that means her entire graph fits into the Base Prompt. The system skips RAG entirely. Zero wasted compute.

2. The Query
Instead of just taking her last message (which might just be "Why?"), I concatenate the last 3 non-system messages to build a context-rich search query.

3. Hybrid Search
I use Graphiti to run a single hybrid search: Semantic Search (vector embeddings) + BM25 (exact keyword match), fused together using Reciprocal Rank Fusion (RRF).

4. The Secret Sauce: Deduplication
Once I have the search results, I cross-reference them with the Metadata Contract from the frontend.

def deduplicate_edges(retrieved_edges: list[Edge], metadata: CompilationMetadata):
    """
    Drops any edges that are already present in the Base System Prompt.
    """
    return[e for e in retrieved_edges if e.uuid not in metadata.included_edge_ids]

This guarantees zero redundancy. If the RAG pipeline finds a memory, but it's already in the Base Prompt, it gets silently dropped.

5. Ephemeral Injection
The surviving edges and nodes are formatted and injected into the System Message right before hitting Gemini, under a clear header: ### RELEVANT EPISODIC MEMORY FOR THIS TURN ###.

Crucially, this injected context is ephemeral. It is sent to the LLM for this specific turn, but it is never saved to the persistent database chat history. This prevents the context window from bloating with old RAG results over time (context rot).

Observability & The Results

You can't improve what you don't measure. I added OpenTelemetry across the backend. Now, when I look at a trace, I can see exactly what the waterfill dropped (hydrate.is_partial), how long the search took (rag.search_duration_ms), and how many facts were actually injected (rag.injected_edges_count).

The Impact:
Look back at the chart at the start of this article. After Day 21, I deployed this architecture.
The input tokens per message instantly collapsed from 120k back down to a stable ~40k tokens (the budget limit + chat history).

The magic is that the AI didn't get dumber. It still feels like it knows everything about her because the structural skeleton (the Hubs) is always there in the Base Prompt. But when she asks a specific question about a past event, the GraphRAG pipeline silently fetches the long-tail details in under a second.

Conclusion

A massive 1 million token context window is an incredible luxury, but it is not a substitute for software architecture.

Dumping everything into the prompt is the best way to validate an idea. But building real products eventually forces you to move from "what works theoretically" to "what works economically and efficiently."

By separating Storage (Knowledge Graphs) from Retrieval (Budget-Aware Base Prompts + Deterministic RAG), Synapse is now fast, cheap to run, and infinitely scalable.

The code for both of these systems is open source. You can check out exactly how I implemented the waterfill allocation (hydration_v2.py) and the retrieval pipeline (graph_rag.py) in the backend repository.

Frontend (Body): synapse-chat-ai
Backend (Cortex): synapse-cortex

I love sharing these real-world scaling problems. If you are building memory systems or working with AI in production, I'd love to hear your approach. Let's connect on X or LinkedIn.