DEV Community: Juan David Gómez

I Built a Mobile App in 3 Days. The Hard Part Was Keeping It Connected.

Juan David Gómez — Mon, 04 May 2026 02:20:36 +0000

I have been building web apps for 12 years. In that time I never wrote a single line of mobile code. Not Swift, not Kotlin, not even a basic React Native hello world.

That changed last month because of my wife.

She has been using Synapse, the AI companion I built for her, every day from her phone browser. If you are new here, Synapse is a personal AI that uses a temporal knowledge graph instead of simple vector search to remember everything about her life, relationships, and emotional patterns. I have written about scaling the memory system, moving ingestion to async, benchmarking memory frameworks, and building a Notion interface for the knowledge graph. This article is about something different: going mobile.

The experience from the phone browser worked but it did not feel right. She wanted a real app. Something that lives on her home screen, opens instantly, and does not show a browser address bar at the top.

I had been avoiding mobile development for years. It always felt like a completely different world with different tooling, different deployment, different everything. But two things made me reconsider. First, React Native with Expo has gotten really good. It is close enough to the web development I already know that the jump did not feel as scary. Second, AI tooling like Cursor and Claude made me confident I could move fast even in territory I had never touched before.

There was one more thing that made this possible. Synapse is built as a monorepo with Turborepo. The web app, the backend, shared packages, and now the mobile app all live in the same repository. This means the AI coding agent can see everything at once. When I asked it to build a new screen, it could look at the existing web components, the Convex backend, the shared types, and produce code that actually fit. No context switching, no copy pasting between repos. It just worked.

So I opened a terminal and started.

3 Days to a Working App

First commit: April 3, 2026. By April 5, I had a working app with Clerk authentication, chat UI, real-time streaming, memory management, personas, and the same dark theme from the web version. Everything powered by the same Convex backend.

Here is what 3 days of work looks like:

The app has an onboarding flow that explains how Synapse works: you converse, it ingests your conversations, compiles them into a knowledge graph, and evolves over time. From the sidebar, you can access your sessions, memory, personas, and plans. The chat interface streams AI responses in real time. The personas screen lets you switch between different AI modes like Brujula (therapeutic companion based on ACT and DBT), Calma (emotional support through positive psychology), Focus (pure technical mode, no memory context), and others. The Memory Explorer shows the full knowledge graph with 45 nodes and 47 relationships, where you can inspect any entity and see its connections.

The monorepo was the secret weapon here. Clerk for auth, Expo Router for navigation, the same Convex mutations and queries the web app already uses. I did not have to rebuild any backend logic. The shared packages meant types, API definitions, and validation were already there. I just had to build the screens.

Three days. For a web developer who had never touched mobile before, that felt unreal. But the speed hid a problem I would not discover until my wife actually started using it.

Everything worked on my simulator. Then she started using it on her phone.

The Problem: iOS Kills Your Connection

On the web, streaming AI responses is straightforward. You call fetch, get a ReadableStream from response.body, and read chunks as they arrive. Clean, modern, reliable.

React Native on iOS does not support this.

The Hermes JavaScript engine that powers React Native on iOS does not implement ReadableStream on the fetch response body. So the standard web approach does not work at all. The workaround is to use XMLHttpRequest with responseType: "text" and listen to the onprogress event. Every time new data arrives, xhr.responseText contains everything received so far. You compare the length to track what is new.

Here is what that looks like:

const xhr = new XMLHttpRequest();
xhr.open("POST", `${CONVEX_SITE_URL}/chat`);
xhr.setRequestHeader("Content-Type", "application/json");
xhr.setRequestHeader("Authorization", `Bearer ${token}`);
xhr.responseType = "text";

let lastLength = 0;

xhr.onprogress = () => {
  const currentText = xhr.responseText;
  if (currentText.length > lastLength) {
    lastLength = currentText.length;
    updateStreamedContent(currentText);
  }
};

This worked great in development. I could see the AI response streaming in word by word, just like on the web. I shipped it.

Then my wife started using the app the way people actually use phones. She would ask Synapse something, then switch to WhatsApp while waiting for the response. Or she would lock the screen. Or she would check Instagram for a few seconds.

And when she came back, the response was gone. Blank message. Or an error.

Here is what was happening. iOS aggressively suspends apps that go to the background. When she switched away, iOS killed the network request. The XMLHttpRequest connection dropped silently. But the AI backend had already started generating. The server kept running, producing tokens, burning cost. The response just had nowhere to go. The phone was no longer listening.

This is not a bug you catch in development. You catch it when someone uses the app the way real people use phones: they never stay on one screen. They switch constantly. And every switch is a chance for iOS to kill your connection.

The Fix: Convex as Middleware

The key insight was simple: stop treating the client as the only receiver of the stream.

Synapse uses Convex as its backend. The mobile app sends a request to a Convex HTTP endpoint, which forwards it to the AI service (Cortex) and streams the response back. Before the fix, this was a straight pipe: Cortex generates, Convex streams to client, client renders. If the client disappears, the pipe breaks and everything after that point is lost.

The fix was to make the server aware of the disconnect and keep going anyway.

if (delta?.content) {
  content += delta.content;

  if (!clientDisconnected) {
    try {
      await writer.write(encoder.encode(delta.content));
    } catch {
      clientDisconnected = true;
      console.warn(
        "[http /chat] Client disconnected, continuing generation server-side",
        { requestId, contentLengthSoFar: content.length }
      );
    }
  }
}

When the HTTP writer tries to send a chunk and it fails (because the client is gone), the server sets a clientDisconnected flag, logs a warning, and keeps generating. It does not stop. It does not throw. It just stops trying to write to a dead connection and continues accumulating the response.

At the end of the generation, regardless of whether the client was still connected, the server makes a single database write to persist the final content:

await ctx.runMutation(internal.messages.finalizeGeneration, {
  id: assistantMessageId,
  content,
  metadata: { model: modelUsed, usedFallback, /* ... */ },
  completedAt: Date.now(),
});

When the user reopens the app, the message is already there. Fully generated. Stored in Convex. No retry needed, no lost tokens, no blank messages.

There was one more edge case to handle. The mobile client has error handling that calls reportStreamFailure when the stream fails. But what if the server finished the generation successfully after the client disconnected? The client would come back, see the XHR failed, and try to mark the message as an error, overwriting the perfectly good response the server already saved.

The guard is simple:

export const reportStreamFailure = mutation({
  args: {
    messageId: v.id("messages"),
    errorMessage: v.optional(v.string()),
  },
  handler: async (ctx, args) => {
    const message = await ctx.db.get(args.messageId);

    // Don't overwrite a message the server already finalized
    if (message.completedAt !== undefined) {
      console.log("[reportStreamFailure] Skipped — already finalized");
      return;
    }

    await ctx.db.patch(args.messageId, {
      type: "error",
      content: errorContent,
      metadata: { errorCode: "CLIENT_STREAM_FAILURE" },
      completedAt: Date.now(),
    });
  },
});

If the message already has a completedAt timestamp, the failure report is ignored. The server won the race. The response is safe.

The whole flow looks like this: the client sends a request, the server starts generating. The client disconnects because iOS suspended the app. The server detects the disconnect on the next chunk write, sets the flag, and continues generating. Every chunk accumulates in memory. When the generation finishes, the server persists the full response to Convex. The client reconnects and finds the complete message waiting.

The Numbers

After one month of tracking, here is where things stand:

546 messages sent from the web app
239 messages sent from the mobile app
Mobile accounts for 30.4% of all usage

And the trend is clear. In recent days, mobile is matching or exceeding web usage. On May 1, there were 10 mobile messages and 9 web messages. She is using the phone more than the browser now.

Without the disconnect fix, nearly a third of all interactions would have been unreliable. Every time she switched apps mid-response (which is constantly), the message would have been lost.

Going mobile also comes with costs beyond engineering time. The Apple Developer Program is $99 per year. Google Play is a one-time $25 fee. Not a lot in absolute terms, but it is the kind of thing you think about when you are building a side project, not a funded startup.

What I Learned

The device is the canvas. Same AI, same backend, same knowledge graph, but the phone changes how she uses Synapse. Quick questions while cooking. Checking memories on the go. The interactions are shorter, more frequent, and more spontaneous than on the web.

Building for mobile forced me to think about resilience in a way web never did. On the web, if someone switches tabs, your JavaScript keeps running. The connection stays open. The response arrives. You do not even think about it. On iOS, nothing is guaranteed. Your app can be suspended at any moment, and if you did not plan for that, your users will have a broken experience.

The browser is forgiving. iOS is not. And building for the unforgiving platform made the entire system better.

If you are building AI products and want to follow along, I write about the real challenges of shipping AI to real users. Not theory. Not demos. The stuff that breaks when someone actually uses your app every day.

Follow me:

X: @juandastic
LinkedIn: Juan David Gomez

My AI Sends 30k Tokens Per Message. 80% of Them Were Wasted.

Juan David Gómez — Sun, 19 Apr 2026 00:42:07 +0000

Building AI side projects is fun until you have to pay for them.

I built Synapse, an AI companion with deep memory powered by a knowledge graph. My wife uses it daily for therapy, coaching, and reflection. The AI knows her life, her patterns, her goals, her emotional triggers. It remembers things across weeks and months.

Two weeks ago, I connected PostHog to track LLM costs. Here is what I saw:

24 in two weeks. Four users. One of her sessions hit $2.42 for 28 messages. A single conversation.

I looked at the token breakdown and the problem was obvious. Every message sends roughly 30,000 tokens of system context. Her knowledge graph has grown rich after weeks of daily use. Entities, relationships, temporal facts, emotional patterns. All of it compiled into a structured text snapshot and injected into every single message.

And 80 to 90% of those tokens are the exact same compiled knowledge repeated on every single turn.

The memory quality is great. The cost structure is not. So I made two changes: I restructured how context is assembled, and I added an explicit cache layer using Gemini's CachedContent API. Together they cut the cost per message by more than half.

Quick Context: The Hybrid Memory Architecture

If you have been following this series, you know the backstory. If not, here is the short version. (The full technical deep dive is in Scaling AI Memory: How I Tamed a 120K Token Prompt with Deterministic GraphRAG)

Synapse uses a two-layer approach to give the AI long-term memory:

1. Base Compilation (Working Memory). When a session starts, Synapse Cortex compiles the knowledge graph into a structured text summary. Entities, relationships, temporal facts. The most connected nodes always make it in. A waterfill algorithm caps the budget at roughly 120,000 characters (~30K tokens). This is the "always-on" context.

2. GraphRAG (Episodic Recall). When the graph is too large for the budget, a second layer retrieves long-tail memories per-turn using hybrid search. It uses the graph UUIDs from the compilation metadata to avoid duplicating what is already in the base. Zero-latency, deterministic, no agent loops.

This works well for quality. The AI still feels like it knows everything about you. But the cost story has a gap: that 30K compilation is the same text for the entire session, and it gets billed as fresh input tokens on every single message.

In a 28-message session, that is 28 x 30k = 840k tokens just from the base knowledge. Almost all of it identical.

The Problem: One Blob, One Bill

Before this change, the context assembly on the Convex side (the frontend backend) looked like this:

// prepareContext: before
let systemContent = session.cachedSystemPrompt;
systemContent += `\n\nCurrent date and time: ${currentDateTime}`;
if (userKnowledge) {
  systemContent += `\n\n${userKnowledge}`;
}
const apiMessages = [
  { role: "system", content: systemContent },
  ...conversationHistory,
];

One string. Persona prompt, datetime, and the entire 30K compilation concatenated together. Sent as a single system message on every request.

This design was simple and it worked fine when the graph was small. But it has two problems at scale:

You cannot cache part of a blob. AI providers (OpenAI, Google, Anthropic) offer automatic cache layers in theory. When you send the same text prefix multiple times, the provider may cache it behind the scenes and charge less. But this is unreliable. A small change anywhere in the prompt can break the matching pattern. The provider may choose not to cache for reasons you cannot see or control. You have zero visibility into whether it is working.

In my case, the systemContent blob included the current datetime on every message. That single line changing every turn was enough to break any automatic prefix matching. Even though the other 25k tokens were identical.

The persona prompt (~500 tokens) is lightweight and rarely changes. The datetime changes every turn. The knowledge compilation (~25-30K tokens) is heavy but stable for the entire session. Treating them as one string means the lightweight parts are hostage to the heavy part.

Change #1: Splitting the Context Snapshot

The first change was structural. Instead of returning one systemContent string, prepareContext now returns three separate fields:

// prepareContext: after
const systemInstruction =
  `${session.cachedSystemPrompt}\n\nCurrent date and time: ${currentDateTime}`;

const cacheName = knowledgeCache?.cacheName ?? undefined;

return {
  apiMessages,        // user/assistant turns only, no system message
  systemInstruction,  // lightweight persona + datetime (~500 tokens)
  compilation,        // heavy knowledge (~25K tokens), stable per session
  cacheName,          // Gemini cache pointer (if available)
};

The HTTP layer sends these as separate JSON parameters to Cortex:

body: JSON.stringify({
  model,
  system_instruction: systemInstruction,
  ...(compilation !== undefined && { compilation }),
  ...(cacheName !== undefined && { cache_name: cacheName }),
  messages: apiMessages,
})

This separation is what makes everything else possible. The compilation is now an independent unit that the server can handle differently from the volatile parts.

The Episodic Section

I also adjusted the compilation itself. I added a new section that summarizes the previous session. Instead of relying only on the graph's entity and relationship definitions, the model now gets a short episodic recap: "Last session you talked about X, explored Y, and mentioned Z."

This serves two purposes. First, it gives the model easy session continuity without loading raw message history. Second, it let me trim the budget for less-connected facts and concepts. The total max tokens dropped from ~30K to ~25K. That is ~5,000 fewer tokens per message before caching even enters the picture.

You can see the effect in the PostHog data. After April 10th, the average tokens per message for my wife's account dropped visibly:

Change #2: Gemini Explicit Cache

Here is where the real savings come from.

Gemini CachedContent: The Concept

Gemini offers an explicit caching API. You create a cache resource by uploading content to caches.create(). You get back a resource name like cachedContents/abc123. On subsequent requests, you pass that name and Gemini uses the cached content as a prefix instead of re-processing the input.

The economics: cached tokens cost roughly 75% less than regular input tokens. For a 25K token compilation, that means paying for about 6,250 tokens instead of 25,000. On every single turn.

The minimum is around 1,024 tokens. I use 4,000 characters as a conservative threshold.

How I Integrated It

The cache lifecycle follows the existing Synapse pipeline. No new infrastructure. No new services. Just a new step after compilation:

When a session starts (hydration), Cortex compiles the knowledge and creates a Gemini cache from it. The cacheName is returned to the client alongside the compilation. The client stores both in user_knowledge_cache (Convex table).

During the session, the client sends cache_name and compilation on every chat request. If the cache is valid, Cortex passes it to Gemini via cached_content and the compilation is served from cache. If there is no cache, Cortex inlines the compilation into the prompt. Same result, different price.

When the session closes (ingestion), new messages are processed into the knowledge graph, a fresh compilation is generated, and a new cache is created. The cycle repeats.

The entire cache manager is about 100 lines of Python:

class CacheManager:
    async def create_compilation_cache(self, user_id, compilation_text):
        if len(compilation_text) < MIN_CHARS_FOR_CACHE:
            return None, "compilation_too_small"

        cache = await self._client.aio.caches.create(
            model=self._model,
            config=types.CreateCachedContentConfig(
                display_name=f"compilation_{user_id}",
                system_instruction=compilation_text,
                ttl="3600s",
            ),
        )
        return cache.name, ""

    async def invalidate_by_name(self, cache_name):
        await self._client.aio.caches.delete(name=cache_name)

    async def refresh_ttl(self, cache_name):
        await self._client.aio.caches.update(
            name=cache_name,
            config=types.UpdateCachedContentConfig(ttl="3600s"),
        )

The backend is stateless. It does not track which user owns which cache. The client persists the cacheName and forwards it. This keeps the architecture clean and avoids a new data store.

The TTL Problem

Gemini caches expire by wall clock. Default is 1 hour. Not by usage. So a user in a long conversation would hit expiration at exactly the 60-minute mark regardless of how many messages they sent.

My solution: after every successful cache hit, I spawn a fire-and-forget task that pushes the TTL forward by another hour. Active users never expire mid-session. The task runs in the background, does not block the response, and if it fails nothing breaks. The worst case is the cache expires and the fallback kicks in.

if cache_hit and active_cache_name and cache_manager:
    _spawn_background(cache_manager.refresh_ttl(active_cache_name))

The Fallback: Engineering for the Unhappy Path

One thing I like about Gemini's implementation: if the cache is expired or not found, the request fails explicitly. It does not silently fall back to full-price tokens. It tells you. That gives you control to decide what to do next, whether that is retry with the full compilation or create a fresh cache.

But that also means a stale cache is a new way for the app to break. Caches expire by wall clock. They can get deleted upstream. The model in the cache might not match the model in the request. If I built a system that only works when the cache is hot, it would be a matter of time before my wife wakes me up at midnight asking why her precious app stopped working.

The design principle is simple: the client always sends everything. Both the cache_name and the full compilation. The server decides which to use.

If the cache is valid, use it. 75% cheaper. If the cache is stale, fall back to inlining the compilation. Full price, but it works. The user never notices.

The Peek Pattern

There is a subtlety with the Gemini SDK. It is lazy. When you open a streaming request, the actual HTTP call does not happen until you pull the first chunk. That means cache errors do not surface when you create the stream. They surface when you iterate.

So I peek the first chunk inside a try/except:

try:
    stream_iter, first_chunk = await _open_and_peek(active_cache_name)
except Exception as err:
    if use_cache and _looks_like_cache_error(err):
        # Cache is stale, invalidate it
        await cache_manager.invalidate_by_name(active_cache_name)
        # Rebuild with compilation inlined and retry
        gemini_contents = await self._build_contents(
            request, inline_compilation=True
        )
        stream_iter, first_chunk = await _open_and_peek(None)
    else:
        raise

I match against a list of known error strings: "cache expired", "cache not found", "does not match the model in the cached content", and a few more. If the error matches, I invalidate the stale cache, rebuild the contents with the compilation inlined, and retry the stream. All before any bytes reach the client.

Auto Re-Hydration

One more thing. When a fallback happens, I do not just serve the request and move on. The final SSE usage chunk includes cache_fallback_triggered: true. The Convex client watches for this flag and schedules a background re-hydration:

if (usage?.cache_fallback_triggered) {
  await ctx.scheduler.runAfter(0, internal.cortex.hydrate, {
    userId, sessionId,
  });
}

This creates a fresh cache for the next message. So only one message per expiration window pays full price. Every subsequent message in that session gets the cache benefit again.

The Numbers

Here is what the data shows after both changes went live.

Token Reduction

The episodic restructuring (Change #1) dropped the average tokens per message from the ~40K range to the ~30K range. That is visible in the PostHog chart starting around April 10th.

Cost Per Generation

From the PostHog session tracking, here is the daily breakdown:

Date	Generations	Total Cost	Cost/Generation
Apr 11	67	$1.14	$0.0170
Apr 12	86	$1.18	$0.0137
Apr 13	78	$1.32	$0.0169
Apr 14	79	$3.08	$0.0390
Apr 16	109	$1.56	$0.0143
Apr 17	54	$2.16	$0.0399
Apr 18	62	$0.55	$0.0088

April 18 is when the explicit cache was fully active. The cost per generation dropped to $0.0088. Less than half of the typical days.

The expensive days (Apr 14 and 17) are sessions with heavy ingestion: 28 and 39 generation calls respectively, including the Gemini calls for graph processing during the Sleep Cycle. Those costs include both the chat generation and the knowledge extraction, not just the user-facing messages.

The $2.42 Session Revisited

That 28-message session that cost $2.42? With caching active at the April 18 rate, the same conversation would cost roughly $0.25. That is almost a 90% reduction.

And the AI still has full access to the knowledge graph. The compilation is the same content. It is just served from cache instead of being re-tokenized on every turn.

Observability

Both changes are fully instrumented. On the Axiom side, every chat request logs cache attributes: cache.hit, cache.hit_ratio, cache.fallback_triggered, cache.skip_reason. On PostHog, I track cache_enabled, cache_hit, and cached_tokens per generation.

This means I can answer questions like: what percentage of requests hit the cache? How often does the fallback trigger? Is the TTL refresh working? No guessing.

What is Next

The cache layer is live and working. But there are a few things I want to explore:

Tighter compilation budgets. Now that the episodic summary provides session continuity, I think I can trim the base compilation further. Maybe from 25K to 20K tokens. The GraphRAG layer already handles the long tail. Less base knowledge means a smaller cache, which means even cheaper turns.

Multi-model flexibility. Gemini caches are bound to a specific model. If I switch between Flash and Pro based on the task, each model needs its own cache. That is a limitation I have not solved yet.

Building AI solutions is fun. Paying for them is the part that makes you think harder. And thinking harder usually leads to a better architecture.

This is article #5 in the Synapse series. If you want the full backstory:

The Synapse Story - What Synapse is and why I built it
Beyond RAG - Knowledge graphs as the memory foundation
Scaling AI Memory - The hybrid approach with Hydration V2 and GraphRAG
Full Circle - Giving the knowledge graph a Notion interface via MCP

The code is open source: synapse-chat-ai (frontend) and synapse-cortex (backend).

Let's connect on X or LinkedIn

My Wife Sent 297 Messages in 15 Days. Not to Me. To the AI I Built Her. The Synapse Story

Juan David Gómez — Fri, 03 Apr 2026 01:26:03 +0000

The Psychologist Who Couldn't Find the Right Therapist

My wife is a professional psychologist. She is also a regular therapy patient. And for years, she struggled to find a therapist who fits her intelligence and knowledge to use that in her favor, not against her.

The problem was not the therapists. The problem was the format. She would walk into a session with a mental list (sometimes an actual list or even full presentations) of what she wanted to cover that week. Sometimes the session went deep into the right topics. Other times, something emotionally loud from that day would take over the entire hour. She would leave feeling lighter, sure, but frustrated. She had used her session to vent about something temporary instead of working on her core issues. And her therapist only saw her for one hour per week. There was no way to cover everything.

When she discovered LLMs and understood their potential, something clicked. She started experimenting with Gemini as a daily companion for emotional exploration, not as a replacement for therapy, but as the missing piece between sessions. The LLM handles the daily processing: the venting, the pattern recognition, the emotional sorting. The professional therapist acts as the safeguard of the process, keeping everything aligned with long-term goals.

It worked. But Gemini has no memory. So she built a workaround.

Over months, she crafted a massive "Master Prompt" in Notion. It contained her medical history, key life events, emotional triggers, therapeutic frameworks, and ongoing projects. Every time she started a new conversation, she had to manually copy-paste this just to get the AI up to speed. If she didn't, the advice was generic and useless.

The prompt grew every week because life kept happening. She dreaded starting new threads because of the "context set up" tax. She felt like she was constantly repeating herself.

She didn't need a search engine or a simple chat history. She needed a continuous brain.

So I built Synapse. An AI chat with deep memory, powered by a knowledge graph, designed for the kind of personal conversations that actually need the AI to know who you are.

Three Articles, Three Walls, One User

Synapse didn't start as what it is today. It evolved through four versions, and every single one broke when she actually used it. I documented each step on Dev.to. Here is the compressed timeline.

V1: The Knowledge Graph. I replaced the Notion page with a Neo4j knowledge graph powered by Graphiti. As she chatted, the AI quietly extracted entities and relationships in the background. I called it the "Sleep Cycle." No more copy-pasting. The compiled graph was about 10,000 tokens, down from her 35,000-token manual prompt. It worked. (Read the full origin story)

V2: The Scaling Wall. The graph grew. By day 21, every message carried over 120,000 tokens of system context. Costs climbed. Latency suffered. I built a budget-aware "waterfill" system (Hydration V2) that caps the prompt at ~30K tokens and retrieves the rest on demand with zero-latency GraphRAG. The AI didn't get dumber. It still felt like it knew everything about her. (How I tamed 120K tokens)

V3: The UX Wall. I built a graph visualizer so she could explore her AI's memory. I thought it was beautiful. To her, it was just overwhelming. She missed Notion. So I brought Notion back as the interface to her AI's brain, using MCP for bidirectional sync. She could review her AI's knowledge in structured tables, flag mistakes with a checkbox, and push corrections back to the graph. (The full circle)

Every version looked perfect in my demo. Every version broke when she actually used it. That is what makes building for a real user different from building a tutorial.

Synapse: A Chat AI That Builds a Map of Your Life

Enough history. Let me show you what Synapse is today.

The Pipeline: Converse, Ingest, Compile, Evolve

Synapse works in a 4-step cycle. You don't configure anything. You just talk.

1. Converse. You open a chat and start talking. Pick a persona (more on that below). The AI already knows who you are because the knowledge from your previous sessions was compiled and injected before you sent your first message.

2. Ingest. When a conversation pauses (3 hours of inactivity) or you press the "Consolidate Memory" button, the system closes the session. It sends the transcript to Synapse Cortex, the Python backend that powers the brain. Cortex uses Graphiti and Gemini to extract entities, relationships, and patterns into a Neo4j knowledge graph. This is the "Sleep Cycle."

3. Compile. Next time you start a conversation, Cortex hydrates the session. It compiles the most important knowledge from your graph into a structured text snapshot (~30K tokens). The most connected entities, the "hubs" of your life, always make it in. If the graph is too large, a waterfill algorithm prioritizes what matters most and retrieves the rest on demand.

4. Evolve. With every conversation, the graph refines. Old facts get invalidated with timestamps, not deleted. New connections emerge. The AI's understanding of you grows over weeks and months.

Three Personas, One Memory

Most AI tools give you one generic chatbot. Synapse gives you three specialized lenses. All three share the same knowledge graph. Same memory, different perspective.

🧭 Compass (Therapeutic). Built on Acceptance and Commitment Therapy (ACT), Dialectical Behavior Therapy (DBT), and Polyvagal Theory. Neuroaffirmative by default. It tracks nervous system states. It remembers which grounding techniques worked before. One meaningful question at a time. For processing anxiety, grief, anger, and anything that needs deep emotional context.

🌿 Solace (Wellbeing). Built on Positive Psychology (PERMA model), Self-Compassion (Kristin Neff), and Mindfulness-Based Stress Reduction. Gentle, unhurried, reflective. For daily emotional check-ins, mood patterns, and self-compassion practices. It notices patterns in your energy, sleep, and stress over time.

⚡ Momentum (Growth Coach). Built on Motivational Interviewing and Implementation Intentions. Direct, action-oriented, no fluff. For goal tracking, overcoming procrastination, and building momentum. It remembers your commitments across sessions and calls you on it.

She might process a hard week with Compass on Monday, do a gentle check-in with Solace on Wednesday, and set specific goals with Momentum on Friday. All three know what happened. No repetition. The knowledge graph is shared across all personas.

These are not toy system prompts. Each persona has carefully designed therapeutic frameworks, response styles, and boundaries. The Compass persona alone references four evidence-based frameworks.

Facts vs. Relationships: What Makes Memory Actually Useful

Let me be honest about what other AI tools offer. Gemini and ChatGPT both now have memory features. And they do show you what they remember. But look at what that memory looks like:

"Luciana's favorite color is purple."
"Luciana has a cat."
"Is using Windows with WSL."

A flat list of disconnected facts. No relationships between them. No causality. No timeline. You can delete a memory, but you cannot tell the AI "actually, I left that job in March" and have it update everything connected to that fact.

Now compare that to what Synapse stores:

Work Stress → TRIGGERS → Insomnia
Insomnia → AFFECTS → Relationship with Partner
Therapist → RECOMMENDED → Grounding Techniques
Grounding Techniques → HELPED_WITH → Work Stress (since March)

The difference is not just visibility. It is structure. One stores isolated data points. The other stores a connected model of your life. When she tells the AI "I'm feeling overwhelmed today," a flat memory might recall that she mentioned "overwhelm" three months ago. The knowledge graph knows the causal chain: which project caused the stress, how the stress affected her sleep, and what techniques helped her last time.

For personal conversations, especially around mental health, this difference changes everything.

Your Memory Is Yours

Beyond the graph structure, Synapse gives you full control over your AI's memory:

Explore your graph. An interactive force-directed visualization where you can click any entity and see its connections, descriptions, and relationships.
Correct in plain English. Type "I actually left that job in March, not April" and the graph updates. Graphiti handles temporal invalidation. Old facts are marked as outdated, not deleted. The AI knows what history vs. what is current is.
Export to Notion. Your full knowledge graph synced to Notion databases. The AI designs the schema based on your actual data (if you talk about health, it creates a Medications database; if you talk about work, it creates a Projects database). Review it in a tool you already know. Flag errors with a checkbox. Push corrections back.
Fully open source. Both the frontend and the backend are on GitHub. You can audit exactly how your data is processed. For something that touches mental health, this is not optional.

297 Messages in 15 Days

Enough about architecture. What happens when a real person uses this thing every day?

On March 19, 2026, I started tracking product analytics with PostHog. Here is what the first 15 days looked like for my wife, User Zero.

The numbers:

297 messages sent across 15 days
15.2 million tokens processed
100% daily active usage. She used it every single day. No exceptions.
Peak: 69 messages in one day (March 22). That is roughly 5 million tokens in a single day.
Average: ~20 messages per day
Average ~51K tokens per message. These are not quick Q&A exchanges. These are deep, contextual conversations where the AI brings in compiled knowledge from weeks of previous sessions.

This is not a novelty bounce. She didn't try it and forget. She integrated Synapse into her daily routine. For context, she was already a heavy Gemini user before Synapse existed. The difference is she stopped using Gemini for personal conversations. Synapse became the de facto daily companion.

The Therapy Bridge

I want to be clear about something. Synapse is not replacing her therapist or psychiatrist. It is bridging the gaps between sessions.

Between weekly therapy appointments, life happens. Emotions surface. Patterns repeat. With Synapse, she can process them in real-time with an AI that knows her full context. The Compass persona tracks nervous system states. It remembers which coping strategies worked. It connects this week's anxiety to the same trigger from last month.

When she goes to her next therapy session, the preliminary exploration is already done. She arrives with clearer language for what she is feeling and why. The session is more productive because she is not spending the first 20 minutes catching her therapist up on context.

A human therapist sees you for 1 hour per week. Synapse fills the other 167 hours.

Synapse is not a replacement for therapy. It is the journal that talks back, with memory.

What I Learned Building for an Audience of One

My wife is the most demanding and picky user I have known. She never cuts back from saying something I built is useless. But when something is actually useful, she uses it every day. My hack to build things that matter is simple: build for her, and I know I will build something real.

A few reflections from this journey.

Memory is not a feature. It IS the product. For mental health and personal conversations, the smartest model in the world is useless if it doesn't know what made you cry last Tuesday. Gemini is brilliant. It can explain quantum physics. But ask it about YOUR life, and it starts from zero every time. The Master Prompt was her workaround. Synapse is the fix.

Transparency is not optional for personal AI. She can see her graph, correct it, export it to Notion. She knows what the AI "thinks" about her. For something this personal, a black box is not acceptable. Open source is not a nice-to-have. It is a requirement.

Building in public works. The community saw the 120K-token problem before I did. My first article got 26 comments. Someone signed up for Synapse and warned me about costs within hours of publishing Article #3. I shipped a plan system and a demo account the same day. That feedback loop is priceless.

One comment from Victor Okefie on my third article stuck with me. About watching my wife ignore the graph visualizer I built for her:

"That's not feature development. That's listening."

That is the whole philosophy.

Try It, Break It, Build With Me

Synapse is live and open source.

What's available today (Free tier):

10 messages per day, all three personas
Knowledge graph visualization and memory corrections
Full English and Spanish support
Fully open source, both frontend and backend

What's coming:

Pro tier: 50 messages per day, intelligent retrieval
Therapeutic tier: Share graph insights with your therapist, crisis detection and alerts, session reminders, therapy homework integration

Links:

🔗 Try it: synapse-chat.juandago.dev
💻 Frontend code: github.com/juandastic/synapse-chat-ai
🧠 Backend code: github.com/juandastic/synapse-cortex

The technical deep-dive series:

Building software is fun. But seeing it come alive and solve actual problems for someone you care about is something else entirely.

Let's connect on X or LinkedIn.

I Benchmarked Graphiti vs Mem0: The Hidden Cost of Context Blindness in AI Memory

Juan David Gómez — Sun, 22 Mar 2026 07:08:06 +0000

A few days ago, Taranjeet, the CEO of Mem0, reacted to one of my articles about building AI memory with knowledge graphs. That caught my attention.

Mem0 is one of the most popular memory frameworks in the AI space. Thousands of developers use it. And here I was, running a heavier, more expensive architecture with Graphiti and Neo4j for my personal project.

Was I over-engineering this?

I had to find out. So I built a benchmark.

Quick Context: Why I Care About AI Memory

I've been building Synapse, an AI companion for my wife. Not a chatbot. A companion that remembers her life, her relationships, her emotional states, and how all of that connects over time.

It started with a 35,000-token "Master Prompt" that she maintained manually in Notion. Every time something changed in her life, she updated it by hand. That obviously didn't scale. So I moved to Graphiti, a knowledge graph framework that extracts entities and relationships from conversations automatically.

I wrote about this journey in two previous articles:

Beyond RAG: Building an AI Companion with Deep Memory Using Knowledge Graphs (how knowledge graphs replaced the manual prompt)
Scaling AI Memory: How I Tamed a 120K-Token Prompt with Deterministic GraphRAG (how I kept the prompt under control as the graph grew)

The system works well. But when I started looking at Mem0, I realized they solve some of the same problems (fact extraction, deduplication, contradiction handling) with a different architecture. They use a vector store as the primary brain and offer an optional graph layer on top. Fewer LLM calls per ingestion, and a fundamentally different take on how to combine vectors and graphs.

I wanted to understand both approaches. What does storing everything in one graph give you? What does splitting vectors and graphs into independent stores give you? What do you lose in each case?

Two Fundamentally Different Philosophies

Before the benchmark, let me explain what each system actually does under the hood. They both ingest conversations and store memories. But the architecture is completely different.

Graphiti: The Unified Graph

Graphiti puts everything in one place: a Neo4j graph database. Entities become nodes. Facts become edges. Embeddings live as properties on those nodes and edges.

The key detail: each edge carries a full natural-language fact, plus temporal fields. When a fact becomes outdated, Graphiti doesn't delete it. It marks it with an invalid_at timestamp and creates the new fact alongside it.

Mem0: The Split Architecture

Mem0 takes a different approach. The primary brain is a vector store (Qdrant, Pinecone, etc.) holding atomic fact strings. It has an optional graph (Neo4j), but it runs as a completely independent parallel system.

The vector store holds rich text. The graph holds thin triples: entity -> relationship_type -> entity. No natural-language facts on edges. No temporal fields. And critically: the two stores share no IDs and run independently. They can drift out of sync.

What's Actually Stored on an Edge

This is the single most important difference. Let me show it concretely.

Graphiti edge:

source: "Demy"
target: "Maplewood"
relation_type: "WORKS_AT"
fact: "Demy started working at the startup Maplewood
       doing full-stack work, not just backend"
valid_at: 2026-02-15
invalid_at: null
embedding: [0.012, -0.034, ...]

Mem0 graph edge:

source: "demy"
target: "maplewood"
relationship: "WORKS_AT"
valid: true
mentions: 2

Graphiti stores the full story on every edge. Mem0 stores the label on the graph edge and puts the text in the vector store as a separate entry. For retrieval this means: Graphiti can give you structure AND semantics in one query. Mem0 needs two separate lookups and hopes they align.

The "Aha!" Moment: Context Blindness

Before I show you the benchmark results, I need to explain the insight that made this comparison matter to me. Because the results only make sense once you understand what "context blindness" means in practice.

The Problem with Pure RAG

Most AI memory systems work like this: user asks something, you do a similarity search, you inject the top-K results into the prompt. Simple and effective.

But there's a hidden cost. The LLM only sees what the similarity search returns. If the user asks about work, and the search returns work facts, the model has no idea about the emotional context from childhood that might be relevant. It's blind to everything outside the search window.

I call this context blindness: the LLM's intelligence is limited by the narrow slice of memory that semantic similarity surfaces for each turn.

Why This Matters for a Companion

Modern models are incredible at reasoning over large contexts. Give them 50k tokens of well-organized information about a person's life, and they make connections you didn't explicitly ask for. They notice patterns. They bring up relevant history naturally.

But you can't give them everything. That's expensive and noisy. So the question becomes: how do you decide what the model should always know vs what it should retrieve on demand?

The Synapse Approach: Base Context + RAG for the Long Tail

This is the architecture I built for Synapse, which I call Hydration V2:

Base Context: A budget-aware prompt (~30k tokens) that always includes the most important entities. I use the graph structure, specifically node degree (how many connections an entity has), to find the "hubs" of her life. Elena (mom), Noa (partner), Marco (tech lead). These always go in.
RAG for Long Tail: Similarity search only kicks in for specific details that don't fit in the base context. And here's the trick: I track exactly which facts are already in the base prompt.

# The metadata contract. Cortex sends this on every request
{
  "compilationMetadata": {
    "is_partial": true,
    "included_node_ids": ["uuid-elena", "uuid-noa", "uuid-marco"],
    "included_edge_ids": ["uuid-works-at", "uuid-diagnosed-with", ...]
  }
}

When RAG retrieves results, I cross-reference against this list and drop any facts already in context. No duplication. No wasted tokens.

Why This Only Works with Co-located Semantics

Here's the thing: this metadata contract requires that nodes and edges live in the same store with shared IDs. I go from "Elena has high degree" to "here are Elena's facts" in one database query.

With Mem0's split architecture, this is impossible. The graph knows Elena is important (she has many connections). But Elena's actual facts live in the vector store under different IDs. There's no direct link between the graph entity "elena" and the vector memories about Elena. You'd need to search the vector store by text similarity to find Elena-related facts. Which is exactly the context blindness problem you're trying to avoid.

Could you build a mapping table between vector IDs and graph entities? Sure. But at that point you're building a co-location layer on top of a split architecture. You're rebuilding what Graphiti gives you for free.

The Benchmark: What I Actually Tested

I built a 3-phase benchmark using a fictional user profile (Demy) with complex life situations: an ASD diagnosis, workplace dynamics, BJJ training, family trauma, and relationship changes.

Important caveat: Synapse doesn't use advanced graph features like BFS traversal or multi-hop queries. It does hybrid search: BM25 + cosine similarity + RRF reranking. So this benchmark doesn't test "graph retrieval" in the academic sense. It tests something more practical: what you gain or lose in retrieval quality when semantic context and graph entities live together vs apart.

The Setup

Both systems got the exact same data, same LLM (gpt-4.1-mini), same embedding model (text-embedding-3-small). Graphiti searched with the same SearchConfig that Cortex uses in production (edge + node hybrid with RRF). Mem0 searched with both vector memories AND graph relations in parallel.

Every phase ended with a gemini-3-flash-preview assessment that scored both systems on relevant dimensions (1-5 scale).

The full benchmark is open source. You can run it yourself.

Phase 1: Knowledge Extraction

Four conversations ingested: an ASD Level 1 diagnosis, workplace feedback from tech lead Marco, a BJJ blue belt promotion, and childhood memories with mother Elena.

Then I ran 5 knowledge probes: factual, relational, event-based, emotional, and workplace queries.

Phase 2: Contradiction Handling

Six facts changed: new job (Maplewood startup), belt upgrade (blue to purple), gym switch (Roots MMA to Iron Flow), breakup (Noa), role change (backend to full-stack), and new pet (Pixel the cat).

Both systems were probed before and after the updates.

Phase 3: Story Retention

A rich 14-message narrative about a traumatic childhood event called "the forest event." A camping trip with family, sensory overload at a campfire, going nonverbal, the mother's reaction, a fight between parents that led to their divorce, and 20 years of guilt. Sensory triggers. EMDR therapy plans.

This was the hardest test. Can atomic fact extraction preserve a story's connective tissue?

The Results

Cost: Mem0 Wins

No surprise here. Graphiti's richer pipeline costs more.

Phase	Graphiti	Mem0	Ratio
Phase 1 (4 sessions)	34,632 tokens	25,394 tokens	1.36x
Phase 2 (2 sessions)	25,601	14,532	1.76x
Phase 3 (1 session, 14 msgs)	26,900	11,936	2.25x
Total	87,133	51,862	1.68x

The ratio increases with narrative complexity. Phase 3's single story session cost 2.25x more with Graphiti, driven by its entity deduplication pipeline checking each new edge against the entire existing graph.

Phase 1: Knowledge Coverage

Dimension	Graphiti	Mem0
Fact completeness	5	4
Entity relations	5	2
Specificity	5	4
Retrievability	4	3
Overall	4.75	3.25

Graphiti won 4 of 5 probes. Its entity summaries added context that Mem0 lacked. Marco's entity node included the specific date of the 1-on-1 and the feedback details, making retrieval sharper.

But two problems showed up in Mem0's results that I didn't expect:

Problem 1: Top-K crowding. When I asked "What feedback did Marco give Demy?", Mem0's vector search returned childhood memories about Elena alongside the Marco results. The emotional weight of those embeddings dominated the similarity rankings and pushed relevant results down. The graph relations were even worse, returning elena → enrolled → demy and elena → is_mom_of → demy for a workplace query.

Problem 2: Graph retrieval noise. Mem0's graph search returns structural neighbors without semantic awareness. It doesn't know that Elena triples are irrelevant to a Marco query. It just returns whatever is connected. This happened in 3 of 5 probes.

Phase 2: Contradictions (The Split-Brain Problem)

Dimension	Graphiti	Mem0
Temporal handling	5	2
Current fact retrieval	4	3
Additive facts	5	5
Historical awareness	5	2
Overall	4.75	3.0

Graphiti's temporal invalidation worked as expected. When I searched for "Who is Demy's partner?" after the breakup, the old Noa edge appeared clearly marked:

User is processing their neurodivergent experience
with the support of Noa. [OUTDATED]

An LLM reading this knows Noa is history, not present. It can say "I remember Noa" without confusing past and present.

Mem0 had a different problem. After the purple belt update, both facts appeared as equally current:

- Got a purple belt last month in martial arts
- Got promoted to blue belt at Roots MMA

No way to tell which is current. Both just exist side by side.

But the most interesting finding was the split-brain. When Demy switched gyms from Roots MMA to Iron Flow:

Mem0's graph correctly updated: demy → trains_at → iron_flow_gym
Mem0's vector store still prominently featured: "Feels physically exhausted but mentally regulated after training" (from the Roots MMA era)

The two independent stores drifted out of sync. The graph knew one thing, the vectors said another. This is an architectural consequence, not a bug. The two stores process the same messages independently with no cross-referencing.

Both systems handled purely additive facts well. Pixel the cat was correctly stored by both. Graphiti even caught a secondary effect: the improved relationship with Rodrigo who helped pick out the cat.

Phase 3: Story Retention (The Surprise)

This is where it got interesting. I expected Graphiti to dominate again. It didn't.

Graphiti extracted 16 story-related edges. Clean entity connections:

- [Elena -> Tomas] Elena and Tomas had a major fight
  during the camping trip, leading to...
- [Tomas -> Elena] Tomas was married to Elena until
  their separation about a year after the forest event [OUTDATED]
- [User -> Dr. Vega] User is being treated by Dr. Vega
  who helps them understand the forest event trauma

Mem0 extracted 12 story-related memories. Different kind of detail:

- Has sensory triggers related to the event: smell of
  wood smoke, sound of running water, someone screaming a name
- Carried guilt for nearly 20 years believing the event
  caused parents' separation
- Experienced sensory overload on the second night due to
  noise, smoke, and flickering light
- Experienced a recent trigger in a park when someone
  yelled a name loudly, causing them to freeze

The pattern was clear:

Graphiti captured the causal structure: who did what to whom, what led to what, entity connections. The skeleton of the story.
Mem0 captured the lived experience: sensory triggers, emotional weight, the 20-year guilt, the specific park incident. The flesh of the story.

When I asked "What are Demy's sensory triggers?", Graphiti returned generic references to the forest event. Mem0 returned the exact three triggers: wood smoke, running water, someone screaming a name.

When I asked "Why did Demy's parents separate?", Graphiti returned the direct causal chain: fight during camping → separation a year later. Mem0 returned the emotional aftermath but with weaker causation.

For a companion that needs to both understand the story structure AND respond with emotional awareness, neither system alone was complete.

What I Learned

Mem0 is genuinely good at vector retrieval

Looking at the data fairly, Mem0's atomic fact extraction produces high-quality, well-crafted memories. "Feels anger about not knowing earlier, which might have prevented burnout." "At a cousin's birthday party, hide in the bathroom for 45 minutes due to the loud noise." These are clean, specific, and individually useful.

For a standard RAG pipeline (similarity search against a query, inject top results), Mem0's memories are arguably better optimized than Graphiti's edge facts, which are structured around entity pairs rather than standalone readability.

But vector retrieval alone creates blind spots

The top-k crowding problem is real. When all your memories are independent vectors with no structural awareness, emotionally heavy content dominates similarity rankings. Childhood trauma bleeds into workplace queries. The system has no way to say "these facts are about Elena, those are about Marco" without relying entirely on embedding distance.

This is what I mean by context blindness. The LLM only sees what similarity search surfaces. And similarity search doesn't understand life categories. It understands embedding proximity.

Co-located semantics are the key differentiator

The practical advantage of Graphiti isn't graph traversal (I don't use it). It's that entities and their facts live together. This enables:

Knowing what matters: node degree tells you Elena is a hub entity
Getting the full picture: one query returns both the entity summary and all its facts
Tracking what's in context: the metadata contract prevents duplicate retrieval

With Mem0, you can know Elena is structurally important (the graph tells you). But getting Elena's rich facts requires a separate vector search, and that search might return non-Elena results based on embedding similarity. The two stores don't talk to each other.

The real architecture is base context + selective RAG

After running this benchmark, I'm more convinced than ever: the future of AI memory isn't "retrieve everything via similarity search." It's:

Pre-load the important stuff: use the graph structure to identify key entities, put their facts in the base context
Use RAG for the long tail: specific memories, niche details, historical events that don't fit in the budget
Track what's already in context: so RAG doesn't waste tokens re-retrieving facts the model already knows

This way, the model always has the structural backbone of the user's life. RAG extends it when needed. Latency stays low for the common case. And you avoid the top-k crowding problem because the important entities aren't competing in similarity search. They're already in context.

I won't go into details today, but I feel the code agents do a similar thing with the AGENTS.md always there alongside the tools definition, and then skills search + code discovery

The Verdict

Mem0 is the right choice for most AI agents. If you need a reliable, mutable memory system with great fact extraction and you're doing standard similarity search, Mem0 is simpler, cheaper (40% fewer tokens), and well-maintained. For 90% of agents, the split architecture doesn't matter because you're not building base context from graph structure.

Graphiti is worth the cost for deeply interconnected companions. If you need to build a structural understanding of someone's life, know which entities are central, pre-load their context, track what's already known, and handle temporal evolution, Graphiti's unified architecture pays for itself. The extra tokens buy you co-located semantics that enable strategies Mem0's split stores can't support.

The hidden cost of context blindness isn't in the retrieval scores. It's in the connections the model never makes because the right context wasn't there.

The full benchmark (scripts, seed data, results, and the technical report) is open source. You can run it yourself, swap models, and see if your results match mine.

If you're building memory systems for AI agents, I'd love to hear how you approach this. What's working for you? What breaks at scale?

Find me on X or LinkedIn.

Full Circle: Giving My AI's Knowledge Graph a Notion Interface using MCP

Juan David Gómez — Tue, 17 Mar 2026 06:06:10 +0000

This is a submission for the Notion MCP Challenge

When I started building AI tools for my wife, it was because she had outgrown Notion.

She uses LLMs as a life coach, therapist, and sounding board. To give the AI context, she maintained a massive 35,000-token "Master Prompt" in a Notion page detailing her life, medical history, and goals. She had to manually copy-paste this wall of text into every new chat.

To automate this, I built Synapse, a system that replaces that manual prompt with a Temporal Knowledge Graph (Neo4j + Graphiti). As she chats, the AI quietly extracts entities and relationships in the background, building a continuous memory.

It worked perfectly. But then I hit a UX wall.

I built a visualizer of the actual knowledge graph so she could explore her AI's memory. I thought it was beautiful. To me, it was fascinating to watch the graph grow and see new connections form over time. But to her, it was just overwhelming. The sheer amount of nodes and floating edges was too much to process, so she ended up completely ignoring that section of the app.

It turns out that while the concept of a graph is great for understanding relationships, navigating a massive raw graph view is for machines, not humans. She missed Notion. She missed structured tables, clear properties, and the simple ability to just click and type to fix a mistake.

So, I brought the project full circle. I used the new Notion MCP to turn Notion back into the ultimate Human-Machine interface for her AI's brain.

What I Built

I built a bidirectional, human-in-the-loop sync between a Neo4j Knowledge Graph and Notion.

This isn't just a one-way "AI appending text to a page" script. It is a dynamic two-way pipeline:

The Export (AI Designs the UI): Instead of using hardcoded Notion templates, Synapse compiles the user's graph and asks Gemini to design a custom database schema. If the user talks a lot about their health, the AI creates a "Medications" database with "Active/Suspended" select tags. If they talk about code, it creates a "Projects" database with tech stacks. No two exports look the same.
The Import (Human-in-the-Loop): AI memory systems hallucinate. To fix this, every AI-generated Notion database gets a Needs Review checkbox and a Correction Notes column. If the AI misunderstood something, my wife just checks the box, types the correction in Notion, and hits sync. The system updates the Knowledge Graph (invalidating the old facts) and automatically patches the Notion row.

Video Demo

Show us the code

The entire architecture is open source:

Backend (Synapse Cortex): https://github.com/juandastic/synapse-cortex
Frontend (Synapse Chat): https://github.com/juandastic/synapse-chat-ai

juandastic / synapse-cortex

Synapse Cortex

Cognitive backend for the Synapse AI Chat application. A stateless REST API that processes conversational data into a dynamic knowledge graph, enabling personalized long-term memory and intelligent context retrieval for AI assistants.

📋 Table of Contents

Overview

Synapse Cortex is a knowledge graph-powered backend designed to give AI chat applications long-term memory capabilities. Instead of treating each conversation in isolation, Synapse Cortex:

Ingests conversational data from chat sessions
Extracts entities, relationships, and facts using LLMs
Stores them in a temporal knowledge graph (Neo4j)
Retrieves relevant context for future conversations
Visualizes the knowledge graph for user exploration and debugging

The system is built on Graphiti, a temporal knowledge graph framework that handles entity resolution, relationship extraction, and temporal invalidation of…

View on GitHub

However, to see the actual backend code that implements the Notion integration, you can check the Export feature commit and the Correction commit.

The UI work here was minimal since Notion will be the actual UI, but I decided to have a simple interface to set the Notion config (for simplicity, I did not implement a full OAuth flow) and trigger the export and sync corrections

How I Used Notion MCP

Integrating AI with rigid APIs is usually a nightmare of mapping schemas, formatting JSON, and handling edge cases. MCP fundamentally changes this. I no longer write rigid ETL pipelines; I just give tools to reasoning engines.

1. SDK for Structure, MCP for Intelligence

I split my architecture into two phases.

First, I use the standard Notion SDK to create the empty databases. This is a rigid, structural operation.

Second, I use the @notionhq/notion-mcp-server combined with LangGraph (a ReAct agent) to actually populate the data and process corrections.

When a row is flagged for correction, I don't write complex if/else logic to figure out how to update Notion. I just pass the user's correction and the updated graph data to the LangGraph agent equipped with the Notion MCP tools. The agent autonomously decides whether to use API-patch-page (to update the specific properties) or API-delete-block (if the fact is completely invalidated and the row should be archived).

2. The Engineering Deep Dive: Node.js in a Python World

My backend is written in Python (FastAPI). The official Notion MCP server is written in Node.js.

Because Synapse is a multi-tenant system (each user has their own independent Notion OAuth token), I couldn't just leave a single global MCP server running. I needed a way to securely isolate connections and ensure low latency between my Python agent and the MCP tools.

I decided to run the official Node.js MCP server as a subprocess (stdio) directly inside my FastAPI backend.

This created some fun lifecycle management challenges:

Docker adjustments: I had to modify my Python Dockerfile to install Node.js so the environment could execute npx @notionhq/notion-mcp-server.
Context Management: I built an asynchronous context manager (_NotionAgentContext) in Python. When an export or correction job starts, it spins up the Node subprocess, passes the specific user's NOTION_TOKEN securely via environment variables, initializes the LangGraph agent, processes the batches of data, and gracefully shuts down the subprocess when the job is done.

class _NotionAgentContext:
    async def __aenter__(self):
        # 1. Start the Node.js MCP subprocess via stdio
        self._stdio_cm = stdio_client(server_params)
        read, write = await self._stdio_cm.__aenter__()

        # 2. Initialize session and load Notion tools
        self._session_cm = ClientSession(read, write)
        session = await self._session_cm.__aenter__()
        await session.initialize()
        tools = await load_mcp_tools(session)

        # 3. Return a LangGraph autonomous agent equipped with Notion MCP
        return create_react_agent(llm, tools)

    async def __aexit__(self, exc_type, exc_val, exc_tb):
        # Gracefully shut down the subprocess to prevent zombie Node processes
        await self._session_cm.__aexit__(exc_type, exc_val, exc_tb)
        await self._stdio_cm.__aexit__(exc_type, exc_val, exc_tb)

By running it via stdio instead of SSE, the communication between the LangGraph reasoning loop and the Notion MCP server is lightning fast, localized, and securely scoped to the current user's job.

Notion MCP allowed me to stop writing fragile API wrappers and focus on what actually matters: building a system that lets a human seamlessly collaborate with their AI's memory.

Conclusion

This project has been incredibly rewarding. My wife absolutely loves the result; she finally has her AI's brain in a format she can actually read, organize, and correct without feeling overwhelmed.

I also have to acknowledge that this Notion MCP Challenge was perfectly timed. I already knew my graph visualizer wasn't working for her, but this contest provided the exact motivation and the right technology (MCP) to bring this bidirectional integration to life. It’s a great feeling when a new tool perfectly aligns with a real-world problem you are trying to solve.

If you are curious about the rest of the Synapse architecture—like why I chose Knowledge Graphs over standard Vector RAG, or how I handled the backend scaling challenges of processing massive context windows—you can check out my previous articles on my DEV profile.

Synapse is live in https://synapse-chat.juandago.dev/ if you want to check it out

Building software is fun, but seeing it come alive and solve actual problems for the people you care about is magical.

I'd love to hear your thoughts on this approach or how you are using MCP in your own projects. Let's continue the conversation on X or connect on LinkedIn.

Scaling AI Memory: How I Tamed a 120k-Token Prompt with Deterministic GraphRAG

Juan David Gómez — Sun, 01 Mar 2026 08:59:50 +0000

In a past article, I wrote about Synapse, an AI companion I built for my wife. To solve the problem of an LLM forgetting her past, I bypassed standard vector RAG entirely. Instead, I used a Knowledge Graph (via Graphiti and Neo4j) to map her life, compiled the entire graph into text, and injected it straight into Gemini's massive context window.

It worked beautifully. Until it didn't.

When you build a prototype, you test it with a few messages. When your wife is the power user, she builds an entire world. By day 21 of her using the app daily for deep sessions, the system hit a wall.

Here is the raw data of her input tokens per message over 18 days:

She was sending over 120,000 tokens of system context on every single chat turn.

Gemini handled it. Modern context windows are incredible, but the reality of production kicked in. My API costs were climbing, Convex bandwidth was getting chewed up storing and moving massive payloads, and latency was increasing.

Dumping everything into the prompt is a great MVP, but it does not scale. I needed a new architecture.

The Community Was Right: Storage ≠ Retrieval

When I published the first article, the Dev.to community called this exact scaling wall.

Developers like @scottcjn and @itskondrat pointed out that while a Knowledge Graph is the perfect way to store relationships and causality, you shouldn't retrieve the whole thing every time.

I didn't want to revert to standard Vector RAG, because standard RAG loses the plot. If she says "I'm stressed," a vector search retrieves a random journal entry about "stress." A graph knows the causality: Project A -> CAUSED -> Stress, and also, for first sessions or smaller graphs, the full usage of the context window is still the best option

I needed a hybrid approach:

A Base Prompt (Working Memory): The most critical structural info about her life, capped at a strict budget.
GraphRAG (Episodic Recall): Long-tail memories retrieved on-demand for the current chat turn.

Here is how I built it.

Phase 1: Hydration V2 (The Budget-Aware Brain)

My first API endpoint (Hydration V1) just ran a SELECT * from the graph and formatted the results.

I rewrote it as Hydration V2: a cascading waterfill allocation system. I set a hard limit of roughly 120,000 characters (~30k tokens). The goal is to maximize the usefulness of the prompt without blowing the budget.

Here is how the waterfill logic allocates space:

1. The Node Budget (40%):
Nodes are the entities (People, Projects, Concepts). I sort them by their "degree" (number of connections). The most connected nodes are included first. Because nodes are just short summaries, they rarely use the full 40%. The unused characters roll over into the Edge budget.

2. The Edge Budget (60% + Rollover):
Edges are the relationships (the actual stories and facts). To prioritize them, I classify nodes in the top 30th percentile of connections as "Hubs."

P1 (Hub-to-Hub): The structural backbone of her life. (e.g., User -> WORKS_ON -> Main Career). These are included first.
P2 (Hub-Adjacent): One node is a Hub, sorted by recency.
P3 (Long-Tail): Low-degree nodes. These are the first to get cut when the budget fills up.

The Bridge: The Metadata Contract

Here was the hardest architectural problem: If Hydration V2 puts "Fact A" in the Base Prompt, and my RAG pipeline searches for "Fact A" on the next turn, I will inject duplicate data into the LLM.

To fix this, Hydration V2 doesn't just return text. It returns a Metadata Contract:

{
  "compilationMetadata": {
    "is_partial": true,
    "total_estimated_tokens": 29500,
    "included_node_ids": ["uuid-1", "uuid-2"],
    "included_edge_ids":["uuid-x", "uuid-y"]
  }
}

If is_partial is true, it means the graph was too big and the waterfill algorithm had to cut things. It also returns the exact UUIDs of the nodes and edges that did make it into the prompt.

The React frontend stores this metadata and sends it back to the backend on every single chat request. Now, the backend knows exactly what the LLM already knows.

Phase 2: Deterministic GraphRAG (No Agents)

Most RAG systems today use "Agents" or tool-calling loops. The LLM decides if it needs to search, writes a query, waits for the tool, and then answers.

I hate this pattern for chat UI. Especially for use cases where no complex reasoning or multiple tools are needed, it adds 2 to 5 seconds of latency. I wanted my RAG pipeline to be deterministic and execute in under 1 second.

Here is my straight-line GraphRAG pipeline:

1. The Gate Check
Before doing any search, the backend checks compilationMetadata.is_partial. If it is false, that means her entire graph fits into the Base Prompt. The system skips RAG entirely. Zero wasted compute.

2. The Query
Instead of just taking her last message (which might just be "Why?"), I concatenate the last 3 non-system messages to build a context-rich search query.

3. Hybrid Search
I use Graphiti to run a single hybrid search: Semantic Search (vector embeddings) + BM25 (exact keyword match), fused together using Reciprocal Rank Fusion (RRF).

4. The Secret Sauce: Deduplication
Once I have the search results, I cross-reference them with the Metadata Contract from the frontend.

def deduplicate_edges(retrieved_edges: list[Edge], metadata: CompilationMetadata):
    """
    Drops any edges that are already present in the Base System Prompt.
    """
    return[e for e in retrieved_edges if e.uuid not in metadata.included_edge_ids]

This guarantees zero redundancy. If the RAG pipeline finds a memory, but it's already in the Base Prompt, it gets silently dropped.

5. Ephemeral Injection
The surviving edges and nodes are formatted and injected into the System Message right before hitting Gemini, under a clear header: ### RELEVANT EPISODIC MEMORY FOR THIS TURN ###.

Crucially, this injected context is ephemeral. It is sent to the LLM for this specific turn, but it is never saved to the persistent database chat history. This prevents the context window from bloating with old RAG results over time (context rot).

Observability & The Results

You can't improve what you don't measure. I added OpenTelemetry across the backend. Now, when I look at a trace, I can see exactly what the waterfill dropped (hydrate.is_partial), how long the search took (rag.search_duration_ms), and how many facts were actually injected (rag.injected_edges_count).

The Impact:
Look back at the chart at the start of this article. After Day 21, I deployed this architecture.
The input tokens per message instantly collapsed from 120k back down to a stable ~40k tokens (the budget limit + chat history).

The magic is that the AI didn't get dumber. It still feels like it knows everything about her because the structural skeleton (the Hubs) is always there in the Base Prompt. But when she asks a specific question about a past event, the GraphRAG pipeline silently fetches the long-tail details in under a second.

Conclusion

A massive 1 million token context window is an incredible luxury, but it is not a substitute for software architecture.

Dumping everything into the prompt is the best way to validate an idea. But building real products eventually forces you to move from "what works theoretically" to "what works economically and efficiently."

By separating Storage (Knowledge Graphs) from Retrieval (Budget-Aware Base Prompts + Deterministic RAG), Synapse is now fast, cheap to run, and infinitely scalable.

The code for both of these systems is open source. You can check out exactly how I implemented the waterfill allocation (hydration_v2.py) and the retrieval pipeline (graph_rag.py) in the backend repository.

Frontend (Body): synapse-chat-ai
Backend (Cortex): synapse-cortex

I love sharing these real-world scaling problems. If you are building memory systems or working with AI in production, I'd love to hear your approach. Let's connect on X or LinkedIn.

When 5 Minutes Isn't Enough: Moving AI Ingestion from Sync to Async (And Saving 99% Compute)

Juan David Gómez — Fri, 13 Feb 2026 03:38:32 +0000

In my last post, I introduced Synapse, the AI system I built for my wife that uses a Knowledge Graph to give her LLM a "Deep Memory."

In the early demos and tests, it looked perfect. She ends a chat, the system processes it, and the graph updates in about 50 seconds.

But demos are lies.

When we started using it for real, 45-minute chat sessions with tens of messages, the system fell apart. The "End Session" button would spin for 5 minutes and then crash.

I thought I had a simple timeout bug. It turned out I had a fundamental architecture problem.

Here is how I went from crashing servers and wasting tokens to a 99% reduction in Convex Actions compute time by implementing the Async Request-Reply Pattern.

The "Happy Path" Trap

My initial implementation was naive. I treated the heavy AI processing like a standard web request.

Convex (The Orchestrator) triggers an HTTP POST to my Python backend.
FastAPI (The Brain) calls Graphiti + Gemini to process the text.
FastAPI waits for the result and returns it.
Convex saves the result to the DB.

This is the standard Synchronous pattern.

The problem? Convex Actions have a hard execution limit (usually 5 to 10 minutes depending on the plan).

When my wife had a short conversation, processing took 1 or 2 minutes. Fine.
But when she had a deep conversation, the Graph extraction logic (running on Gemini 3 Flash) took 15 minutes.

You cannot fit a 15-minute task into a 5-minute box.

Attempt #1: The "Brute Force" Retry (And Why It Failed)

At first, I didn't realize it was taking 15 minutes. I assumed the Gemini API was just being flaky or slow.

So, I did what any engineer does when things fail: I added retries.

I configured Convex to retry the action with exponential backoff on failure.

Here is the disaster that followed:

Convex sends the request.
It waits 5 minutes. Timeout.
Convex thinks the request failed, so it schedules a Retry.
It sends the request again.

The Hidden Bug:
The Python backend didn't know Convex had timed out. The first process was still running in the background, consuming LLM tokens and writing to the graph.

Suddenly, I had two heavy processes processing the same chat log simultaneously. I was paying double the API costs, wasting bandwidth, and clogging my backend with "zombie" processes. And the user still got an error message.

The Turning Point: Observability

I couldn't fix what I couldn't see. I installed OpenTelemetry and connected it to Axiom to trace the actual execution time on the Python backend.

The trace was a slap in the face.

The ingestion wasn't failing; it was just slow. It consistently took 12 to 18 minutes for large sessions.

I realized this wasn't a bug I could "optimize" away. I needed to change the architecture.

The Solution: The Async Request-Reply Pattern

In software engineering, when a task takes longer than a user (or a server) is willing to wait, you decouple the Request from the Response.

I switched to a Polling Architecture.

Instead of Convex waiting for the answer, it just asks for a "ticket."

Convex sends a POST /ingest request.
FastAPI immediately returns 202 Accepted with a jobId. (Time taken: ~300ms).
FastAPI starts the heavy processing in a background task (asyncio.create_task).
Convex goes to sleep and wakes up every few minutes to check the status.

Here is the flow:

Why Linear Backoff?

I switched from Exponential to Linear backoff for the polling.

If I know a task takes at least 5 minutes, checking after 10 seconds is a waste of resources. Checking after 2 minutes is also a waste.

I set the scheduler to check after 5 minutes, then 10 minutes, then 10 minutes again. This reduces the noise on my server significantly.

The Results: 99% Efficiency Gain

The difference in resource usage is massive.

Before (Synchronous):

Convex Action running time: 5 minutes (blocking/waiting).
Result: Fail -> Retry -> 5 more minutes.
Total "Billed" Compute: ~10-15 minutes.
Token Waste: High (re-processing the same data).

After (Async Polling):

Request 1 (Trigger): ~300ms.
Request 2 (Poll at 5m): ~300ms.
Request 3 (Final Fetch): ~300ms.
Total "Billed" Compute: < 2 seconds.

We went from wasting 10 minutes of compute just "waiting" for a response, to using less than 2 seconds of active execution time to manage the same job.

More importantly, the Python backend never processes the same job twice. If Convex asks for the status of a job that is already running, FastAPI just says "Still working on it," and the work continues undisturbed.

Conclusion

This project taught me a valuable lesson about building "Vertical AI" apps: AI tasks are slow.

We are used to web requests taking 200ms. In the world of LLMs and Knowledge Graphs, a "fast" task might take 30 seconds, and a "deep" task might take 15 minutes.

If your backend takes longer than your timeout limit, don't just increase the timeout. Decouple the request. It makes your system more resilient, your bills lower, and your architecture cleaner.

I'd love to hear how you handle long-running LLM tasks. Let me know on X or LinkedIn.

Beyond RAG: Building an AI Companion with "Deep Memory" using Knowledge Graphs

Juan David Gómez — Mon, 09 Feb 2026 00:07:19 +0000

I build AI tools to solve my own problems. A while back, I built NutriAgent to track my calories because I wanted to own my raw data. But recently, the problem wasn't mine, it was my wife's.

She uses LLMs differently than I do. While I use them for code or quick facts, she uses them as a therapist, a life coach, and a sounding board. Over the last year, she built a massive "Master Prompt" in Notion. It contained her medical history, key life events, emotional triggers, and ongoing projects.

It was 35,000 tokens long.

Every time she started a new chat, she had to manually copy-paste this wall of text just to get the AI up to speed. If she didn't, the advice was generic and useless.

She didn't need a search engine or a simple chat history. She needed a continuous brain.

I realized that the standard way we build AI memory with RAG (Retrieval Augmented Generation) wouldn't be enough. So I built Synapse AI Chat. It's an AI architecture that uses a Knowledge Graph to give an LLM "Deep Memory."

Here is how I built it, why I chose Knowledge Graphs over Vectors (To be fair, I used both), and how I handled the engineering messiness of making it work.

Why Standard RAG Wasn't Enough

Most AI memory systems today use Vector RAG. You chunk text, turn it into numbers (vectors), and find "similar" chunks later.

This works great for finding a specific policy in a PDF, but not that great for modeling human relationships and history.

Vectors find similarity, not structure.
If my wife tells the AI, "I'm feeling overwhelmed today" a Vector search might pull up a journal entry from three months ago where she mentioned "overwhelm."

But a Knowledge Graph understands the story. It knows:
"Project A" -> CAUSED -> "Stress" -> RESULTED_IN -> "Overwhelm"

I needed the AI to understand causality, not just keywords.

The Architecture Decision: Full Context Injection

Because I was using Google's Gemini models (which have a massive context window), I didn't need to retrieve just 5 small chunks of text. I could inject the entire compiled profile into the prompt.

My goal was to turn the raw chat logs into a structured graph, then flatten it back into a comprehensive "User Manual" for the AI to read before every interaction.

Graphiti, the framework I used for the graph indexing, supports semantic search for a retrieval strategy. I decided to take advantage of the Gemini's big context windows. The compiled graph output ended up being smaller than the source, from almost 35k tokens to ~14k, just combining the entities with their descriptions and their relations in plain text, avoiding extra tokens to build a narrative prompt like her old master's prompt

Introducing Synapse: The Architecture

I split the project into two parts: the Body (the UI you talk to) and the Brain (the API that processes memory).

The Frontend (Body): React 19 + Convex. I chose Convex because it handles real-time database syncing effortlessly, which makes the chat feel snappy.
The Cortex (Brain): Python + FastAPI. This does the heavy data processing.
The Memory Engine: Graphiti + Neo4j.
The Models:
- Gemini 3 Flash: For the "heavy lifting" (building the graph).
- Gemini 2.5 Flash: For the actual chat (speed and cost).

Here is the high-level view:

How It Works: The "Deep Memory" Pipeline

The system operates in three distinct phases.

Phase A: Conversation (The Chat)

When my wife chats with Synapse, she is talking to Gemini 2.5 Flash. It’s fast and fluid.

The trick is that the System Prompt isn't static. Before she sends her first message, I hydrate the prompt with a text summary of her entire Knowledge Graph. The AI immediately knows who she is, what she's worried about, and who her friends are.

Phase B: Ingestion (The "Sleep" Cycle)

This is where the magic happens. When she finishes a conversation by stopping chatting for 3 hours or manually clicking a Consolidate button, I treat this like the AI taking a nap to consolidate memories.

We send the chat transcript to the Python Cortex. Here, I switch to Gemini 3 Flash.

Why the upgrade? Extracting entities from a messy human conversation is hard.
If she says, "I stopped taking medication X and started Y," a weaker model might just add "Taking Y" to the graph. Gemini 3 is smart enough to create a generic logic:

Find node "Medication X".
Mark the relationship as STOPPED.
Create node "Medication Y".
Create relationship STARTED.

Phase C: Hydration (The Awakening)

When she returns, the next session is already prepared with the new compiled graph summary. It doesn't just dump a prompt. It compiles the nodes and edges into a natural language narrative.

    def _format_compilation(definitions: list[str], relationships: list[str]) -> str:
        sections = []

        if definitions:
            sections.append(
                "#### 1. CONCEPTUAL DEFINITIONS & IDENTITY ####\n"
                "# (Understanding what these concepts mean specifically for this user)\n"
                + "\n".join(definitions)
            )

        if relationships:
            sections.append(
                "#### 2. RELATIONAL DYNAMICS & CAUSALITY ####\n"
                "# (How these concepts interact and evolve over time)\n"
                + "\n".join(relationships)
            )

        if not sections:
            return ""

        content = "\n\n".join(sections)

The "Killer Feature": Memory Explorer

AI memory is usually a "Black Box." Users don't trust what they can't see.

I wanted my wife to be able to audit her own brain. I built a visualizer using react-force-graph. She can see bubbles representing her life: "Work," "Health," "Family."

If she sees a connection that is wrong (e.g., the AI thinks she likes a food she actually hates), she can edit the input and re-process the graph with new information like "I actually hate mushrooms now."

The system then processes that new input and updates the graph, creating new nodes and relations or invalidating the existing ones. This "Human-in-the-loop" approach builds massive trust.

Engineering Challenges

Building this wasn't just about prompt engineering. There were real system challenges.

1. Handling Latency (The Job Queue)

Graph ingestion is slow. It takes anywhere from 60 to 200 seconds for Graphiti and Gemini to process a long conversation and update Neo4j.

I couldn't have the UI hang for 3 minutes.

I used Convex as a Job Queue. When the session ends, the UI returns immediately. Convex processes the job in the background, updating the UI state to "Processing..." and then "Memory Updated" when it's done.

2. Handling Flakiness (The Retry Logic)

The Gemini API is powerful, but occasionally it throws 503 Service Unavailable errors, especially during heavy graph processing tasks.

I implemented an "Event-Driven Retry" system. If the graph build fails, I don't just crash. I schedule a retry with exponential backoff.

export const RETRY_DELAYS_MS = [
  0,            // Attempt 1: Immediate
  2 * 60_000,   // Attempt 2: +2 minutes (let the API cool down)
  10 * 60_000,  // Attempt 3: +10 minutes
  30 * 60_000,  // Attempt 4: +30 minutes
];

export const processJob = internalAction({
  args: { jobId: v.id("cortex_jobs") },
  handler: async (ctx, args) => {
    const job = await ctx.runQuery(internal.cortexJobs.get, { id: args.jobId });

    try {
      // 1. Do the heavy lifting (Call Gemini 3 Flash)
      // This is where 503 errors usually happen
      await ingestGraphData(ctx, job.payload);

      // 2. Mark complete if successful
      await ctx.runMutation(internal.cortexJobs.complete, { jobId: args.jobId });

    } catch (error) {
      const nextAttempt = job.attempts + 1;

      if (nextAttempt >= job.maxAttempts) {
        // Stop the loop if we've tried too many times
        await ctx.runMutation(internal.cortexJobs.fail, { 
            jobId: args.jobId, error: String(error) 
        });
      } else {
        // 3. Schedule the retry using Convex's scheduler
        const delay = RETRY_DELAYS_MS[nextAttempt] ?? 30 * 60_000;

        await ctx.scheduler.runAfter(delay, internal.processor.processJob, {
          jobId: args.jobId
        });
      }
    }
  },
});

3. Snappy UX

Convex's real-time sync was a lifesaver here. I didn't have to write complex WebSocket code. If the Python backend updates the status of a memory job in the database, the React UI updates instantly.

The tokens streaming is better with convex in the middle, since the backend is connected with convex. If the user's browser is closed or the connection fails, the token generation will continue, passing the answer to Convex and streaming it to the user when it is possible.

The catch here is that this could increase the Functions usage since each update will count, so the streaming updates are throttled to 100ms intervals to balance responsiveness with database write efficiency

The Result

The difference is night and day.

Before: My wife dreaded starting a new thread because of the "context set up" tax. She felt like she was constantly repeating herself, and having the responsibility to constanly doing break points to update the Master Prompt with the new data and start a new thread

Now: She just talks. The system has a "Deep Memory" of about 10,000 tokens (compressed from months of chats) that is injected automatically.

She has different threads for different topics, but they all share the same Cortex. If she mentions a health issue in the "Work" thread (e.g., "My back hurts from sitting"), the "Health" thread knows about it the next time she logs in.

Conclusion

This project taught me that we are moving from "Horizontal" AI platforms (like ChatGPT, which knows a little about everything) to "Vertical" AI stacks that know everything about you. I’ve been watching how the ChatGPT and Gemini apps are starting to create user profiles and thread summaries to build this kind of memory. They are chasing the same goal: a truly personalized experience.

The key takeaway for me is that Vectors are great for search, but Knowledge Graphs are essential for understanding.

I keep enjoying building solutions for real problems. Nowadays, we have powerful tools to build awesome software faster than ever, but I found that having a product vision and the technical understanding to architect a solution is still critical. That is the difference between building a quick prototype and solving a real problem.

This project is being used for real by my wife and me, and honestly, this is my favorite part of building products. The fun doesn't end when the architecture is done; it begins when people actually use it. Watching the product evolve, finding bugs, pivoting features, or even realizing that an initial idea didn't make sense at all, that is the journey. Building software is fun, but seeing it come alive and solve actual problems is magical.

The project is live at synapse-chat.juandago.dev if you want to see it in action.

The code is open source if you want to dig into the implementation:

Frontend (Body): synapse-chat-ai
Backend (Cortex): synapse-cortex

I'd love to hear your impressions and thoughts. Let's continue the conversation on X or connect on LinkedIn.

I Used My AI Nutrition Agent Every Day for a Month. Here's What I Actually Had to Fix

Juan David Gómez — Sat, 27 Dec 2025 06:19:17 +0000

A month ago, I wrote about building NutriAgent, my AI nutrition tracker that logs meals from Telegram and the web into a Google Sheet I own (you can read the original post here). I got it working, posted the article, and figured that was the end of the story.

Then I started using it every single day. And that's when the real problems began to show up.

Not bugs. Not crashes. Just... little things that made me think "wait, this is annoying" multiple times per day. Things you only notice when you're the actual user solving a real problem, not just demoing a cool idea.

Two problems broke the experience completely.

The Two Spreadsheets Problem (Why My Data Felt Broken)

I'd log my breakfast quickly on Telegram from my phone. Then at lunch, I'd be at my computer and use the web interface because it was easier. But at the end of the day, when I wanted to see my full nutrition breakdown, I had my data split across two different accounts and two different spreadsheets. I had to manually copy rows and merge them just to get a simple daily total.

The agent stored my Telegram meals under one user ID. My web chats were under another. When I asked "what did I eat this week?" the answer depended entirely on which platform I was using. My nutrition data was fragmented, making any real analysis impossible.

I realized that "make it multi-user" wasn't enough. I needed one identity across both channels.

Since I found both channels useful for different scenarios, I decided to find a way to use them while keeping my data integrated and easy to visualize, and analyze

How the Linking Actually Works

I thought about building this feature into the main agent as a tool for this: "Send your email to link your account." But typing emails in chat felt clunky. Waiting for verification codes in Telegram felt slower than just clicking a button.

Some features are just faster in a web interface. Account linking is one of them.

So I built a Settings page in the web app that generates a short-lived linking code. You copy it, paste it into Telegram, and the bot connects your accounts. That's it.

The flow:

Get a code from the web Settings
Send it to the Telegram bot
Backend validates and binds your telegram_user_id to your clerk_user_id
Merge the chat histories and nutrition logs to keep everything in a single user account

Under the Hood: One User, Two Channels, One Source of Truth

Under the hood, the core decision was to pick a single canonical user identity and force everything else to align with it.

On the web side, authentication is handled by Clerk, which gives me a stable clerk_user_id. Instead of inventing a parallel identity system for Telegram, I decided to make clerk_user_id the primary key everywhere.

On the backend, the user model now looks roughly like this:

clerk_user_id → primary identifier
telegram_user_id → optional, nullable
email → metadata and debugging

This means:

Telegram is no longer a “separate user”
It’s just another interface attached to the same account
All nutrition logs, chat history, and summaries are keyed off the same ID

The linking code flow is intentionally simple:

The web app generates a short-lived code bound to clerk_user_id
Telegram sends the code back to the backend
If valid, the backend attaches telegram_user_id to the existing user record

No guessing. No heuristics. No email matching.
If the code matches, the user explicitly intended to link the accounts.

This small constraint eliminated an entire class of edge cases I didn’t want to debug later.

The "One Meal, Three Messages" Telegram Headache

Once I got both channels working smoothly, I started using them interchangeably. That's when I noticed something else. The web version lets me attach multiple images to a single message, for instance, a photo of my food plus a screenshot of the nutrition label. This made the AI estimates much more accurate.

But when I tried the same thing on Telegram, it fired off three separate messages, and I got three separate AI responses with different calorie counts. Each photo was processed in isolation from the webhook, without the context of the others. The experience gap was frustrating. The agent felt smart on web, broken on Telegram.

How I Fixed the Multiple Images Problem

Telegram has a way to detect media groups that are sent the so I introduced a MediaGroupHandler in the webhook handler for when you send multiple photos at once. So I built a simple batching system:

When the bot receives an image as part of a media group, it waits 1 second to start processing the request
If more images arrive in that chat within the window, it groups them and resets the delay
Sends them all as list[bytes] to the agent in one call

The agent's analyze() method already accepts list[bytes], so no changes needed there. The fix was purely in the Telegram handler.

Now I can send three angles of my plate plus a nutrition label and get one smart response.

Why This Fix Lives in the Telegram Layer (Not the Agent)

One important detail: I didn’t change the agent at all to support multiple images.

The agent already accepts list[bytes] for images. The real bug wasn’t model capability — it was message orchestration.

Telegram delivers images as:

Separate webhook events
Sometimes grouped with a media_group_id
Sometimes arriving milliseconds apart, out of order

Originally, each webhook triggered an agent call immediately. That meant:

One image = one analysis
Zero shared context
Conflicting calorie estimates

The fix was to treat Telegram messages as signals, not requests.

I introduced a lightweight batching layer in the Telegram handler:

Images with the same media_group_id are buffered
A short debounce window (1 second) waits for more images
Each new image resets the timer
When the window closes, all images are sent together

Conceptually, it’s:

“Wait until the user is done talking, then think.”

media_groups: dict[str, list[bytes]] = {}
tasks: dict[str, asyncio.Task] = {}
lock = asyncio.Lock()

async def handle_image(chat_id, media_group_id, image_bytes):
    async with lock:
        media_groups.setdefault(media_group_id, []).append(image_bytes)

        if media_group_id in tasks:
            tasks[media_group_id].cancel()

        tasks[media_group_id] = asyncio.create_task(
            process_after_delay(media_group_id, chat_id)
        )

async def process_after_delay(media_group_id, chat_id):
    await asyncio.sleep(1)
    images = media_groups.pop(media_group_id, [])
    await agent.analyze(images=images, chat_id=chat_id)

By keeping this logic inside the Telegram adapter:

The agent stays platform-agnostic
The same analysis pipeline works for web uploads, Telegram albums, or future mobile clients
Telegram quirks don’t leak into core business logic

This ended up being one of those fixes that made everything feel smarter without making the system more complex.

Another side effect of this implementation was that it forced me to go deeper into asynchronous programming with FastAPI and Uvicorn. I already had some exposure to asyncio, but this was the first time I had to reason explicitly about timing, cancellation, and shared state in a real user-facing flow.

To keep the solution simple, I used in-memory storage combined with asyncio.Lock() and cancellable asyncio.Tasks to implement the batching and debounce logic. This works well because the bot currently runs with a single worker, so I don’t need external coordination or persistence.

The important part is that this wasn’t a shortcut — it was a conscious tradeoff. The same pattern would translate cleanly to Redis, a queue, or a background worker if I needed to scale horizontally. For now, the simpler solution keeps the system easier to reason about, test, and evolve.

The "Oh, That's Actually Smooth Now" Moment

After the changes, I logged lunch on Telegram during a break, used the web chat when I was at the computer, and that evening, I opened the single spreadsheet with the whole picture of my day ready to analyze and compare with the rest of the week.

I sent three images of dinner—no spam, just one clean response. The product finally feels intentional instead of held together with duct tape.

What Dogfooding Actually Teaches You

Building for yourself is different than building for a hypothetical user. You feel the pain immediately. You can't ignore bad UX because you're the one suffering.

The gap between "it works" and "it works well enough to use daily" is massive—and only dogfooding reveals it.

I learned that context engineering is more important than overloading prompts. I learned that some features belong in web UIs, not chat. And I learned that starting with a no-code tool is great for testing, but real usage demands real architecture.

It's a Real Product Now

NutriAgent stopped being a toy project when I started needing it. These changes didn't just add features—they made it something I can share and scale.

The project is live at https://nutriagent.juandago.dev. The code is open source for the Agent and Web UI.

This was my journey, but I'd love to hear your thoughts. Let's continue the conversation on X or LinkedIn.

I Ditched MyFitnessPal and Built an AI Agent to Track My Food

Juan David Gómez — Sat, 15 Nov 2025 03:28:36 +0000

I wanted to track my calories and protein for my training goals, but I got tired of existing apps. They lock you into their pretty dashboards, make it hard to export your own data, and you can't cross-reference that nutrition data with your training logs easily. I just wanted to own my raw data and build custom reports for myself.

So I built NutriAgent. It's an AI nutrition tracker that understands text and photos of my meals, logs everything into a database and Google Sheets that I control, and I can chat with it on Telegram or the web. This post is about my journey of turning a simple "call GPT" prototype into a real tool-using agent with memory—for myself, but built with proper product decisions.

My First Agent Wasn't Code (And That's Why I Rewrote It)

I didn't start with Python. My first version was actually a quick PoC in n8n (the self-hosted workflow tool). I set up a simple flow with an agent node, a few tools, and Telegram integration. It worked surprisingly well; I used it for several days, and it logged my meals fine.

The problem hit when I shared it with a friend. He wanted to try it, but I realized nothing was reusable. All my credentials for third-party services were hardcoded to my accounts. The whole flow was built around a single user: me. It couldn't support multiple people, and turning that n8n setup into a real product would have been a hack on top of a hack.
That was the real push. I decided to rebuild it properly in Python—not just for me, but as a real multi-user system. It was more work, but it gave me the excuse to spend more time bringing a proper product to life, which is what I actually enjoy doing.

Building a Proper Agent in Python

The n8n prototype proved the concept worked, but now I had to rebuild it from scratch; this time with proper architecture for multiple users. As I started writing the Python version, I realized I needed to be more intentional about the agent's design than I was in my quick n8n flow.

In n8n, I had basic tools duct-taped together. For a real system, I needed:

A clean agent setup that could handle many users' conversations and data
Well-designed tools that actually corresponded to product features
Robust memory that wouldn't break when I scaled beyond just my own use

I used LangChain's create_agent because it handles a lot of the heavy lifting. The core setup looks like this:

PROMPT_FILE = Path(__file__).parent.parent / "prompts" / "food_analysis_prompt.txt"

class FoodAnalysisAgent:
    def __init__(self) -> None:
        self.llm = ChatOpenAI(
            model="gpt-4o-mini",
            api_key=settings.OPENAI_API_KEY,
            temperature=0.3,
        )
        self.system_prompt = self._create_system_prompt()

    def _create_system_prompt(self) -> str:
        template = PROMPT_FILE.read_text(encoding="utf-8")
        current_datetime = datetime.now().strftime("%Y-%m-%d %H:%M:%S")
        return template.format(current_datetime=current_datetime)

I keep the prompt in a separate file because I edit it a lot. It's easier to tweak the instructions without touching code. I inject the current datetime so the agent knows when we are important for queries like "today" or "this week" in my conversations.

Making It Understand Photos and My Chat History

The agent needs to handle my messy real-world inputs: sometimes text, sometimes a photo, sometimes both. Plus, it needs to remember what we were just talking about.

Here's how I normalize everything before sending it to the agent:

@traceable(name="FoodAnalysisAgent.analyze", run_type="chain")
async def analyze(
    self,
    text: str | None,
    images: list[bytes] | None,
    conversation_history: list[dict[str, Any]] | None,
    user_id: int,
    redirect_uri: str | None = None,
) -> str:
    messages: list[Any] = []

    # Pull my past conversation from DB and convert to LangChain format
    for msg in conversation_history or []:
        if msg["role"] == "user":
            messages.append(HumanMessage(content=msg["text"]))
        elif msg["role"] == "bot":
            messages.append(AIMessage(content=msg["text"]))

    # Add my current message (text + optional images)
    if images:
        content: list[Any] = []
        if text:
            content.append({"type": "text", "text": text})
        for img in images:
            content.append({
                "type": "image_url",
                "image_url": {"url": f"data:image/jpeg;base64,{base64.b64encode(img).decode()}"},
            })
        messages.append(HumanMessage(content=content))
    else:
        messages.append(HumanMessage(content=text or ""))

    agent = self._get_agent(user_id=user_id, redirect_uri=redirect_uri)
    result = await agent.ainvoke({"messages": messages})
    return str(result)

This lets me send a photo of fries and add context like "these were air-fried" to get a better estimate. The agent sees the image and text together, plus our conversation history, so it feels like a natural chat about my meals.

Designing Tools for My Own Use Cases

Each tool maps to something I actually want to do. I didn't want abstract functions; I wanted "register this meal" or "show me my data."

Saving My Meals to DB and Google Sheets

def create_register_nutritional_info_tool(user_id: int):
    @tool
    async def register_nutritional_info(
        calories: float,
        proteins: float,
        carbs: float,
        fats: float,
        meal_type: str,
        extra_details: str | None = None,
    ) -> str:
        record = await save_nutritional_info(
            user_id=user_id,  # This is me
            calories=calories,
            proteins=proteins,
            carbs=carbs,
            fats=fats,
            meal_type=meal_type,
            extra_details=extra_details,
        )

        spreadsheet_id: str | None = None
        config = await get_spreadsheet_config(user_id)
        if config:
            try:
                spreadsheet_id = await append_nutritional_data(
                    user_id=user_id,
                    calories=calories,
                    proteins=proteins,
                    carbs=carbs,
                    fats=fats,
                    meal_type=meal_type,
                    extra_details=extra_details,
                    record_id=record["id"],
                )
            except Exception:
                # DB is my source of truth; Sheets is best-effort
                logger.warning("Failed to append to my spreadsheet", exc_info=True)

        # Build a friendly summary for me
        ...
        return response

    return register_nutritional_info

My database is the source of truth. Google Sheets is a nice-to-have mirror. If Sheets fails, I don't lose my data; the meal is already saved in Supabase. This gives me peace of mind because I know my data is always safe.

Querying My Past Meals

def create_query_nutritional_info_tool(user_id: int):
    @tool
    async def query_nutritional_info(
        start_date: str | None = None,
        end_date: str | None = None,
    ) -> str:
        records = await get_nutritional_info(
            user_id=user_id,  # Querying my own history
            start_date=start_date,
            end_date=end_date,
        )
        if not records:
            return "No nutritional records found."

        lines = []
        for r in records:
            date = r["created_at"].split("T")[0]
            lines.append(
                f"Date: {date} | Meal: {r['meal_type']} | "
                f"Calories: {r['calories']} | Proteins: {r['proteins']}g | "
                f"Carbs: {r['carbs']}g | Fats: {r['fats']}g"
            )
        return "\n".join(lines)

    return query_nutritional_info

I pre-format my records into simple text lines instead of dumping raw JSON. The model understands this better and can answer my questions like "what was my protein intake on Monday?" more reliably.

Connecting My Google Sheets via OAuth

def create_register_google_account_tool(user_id: int, redirect_uri: str | None):
    @tool
    async def register_google_account() -> str:
        config = await get_spreadsheet_config(user_id)
        if config:
            return "Your Google account is already connected. I'll keep saving meals there."

        if not redirect_uri:
            return (
                "I need a valid redirect URL to start the Google authorization flow. "
                "The server configuration seems incomplete."
            )

        authorization_url = get_authorization_url(user_id, redirect_uri)
        return (
            "To enable Google Sheets integration, please authorize access using this link:\n\n"
            f"{authorization_url}"
        )

    return register_google_account

This keeps all the OAuth complexity inside a tool. The agent just decides when I need to connect my account and triggers the flow naturally in our conversation.

My Memory System: Two Stores for Different Jobs

Supabase is my core memory: my chats, messages, and nutritional records all live there. It's fast and reliable.

Google Sheets is for me: I can see my data, build custom charts, and truly own it. But it's slower and sometimes fails, so it's a mirror, not the primary store.

Here's how I ensure my spreadsheet exists before writing:

async def ensure_spreadsheet_exists(user_id: int) -> tuple[str, Credentials]:
    config = await get_spreadsheet_config(user_id)
    if not config:
        raise ValueError(f"No spreadsheet config for my user_id={user_id}")

    credentials = await ensure_valid_credentials(user_id, config)
    spreadsheet_id = config.get("spreadsheet_id")

    if not spreadsheet_id:
        spreadsheet_id = await create_spreadsheet(user_id, credentials)
    else:
        try:
            await verify_spreadsheet_has_headers(credentials, spreadsheet_id)
        except HttpError as e:
            if e.resp.status == 404:
                spreadsheet_id = await create_spreadsheet(user_id, credentials)
            else:
                raise

    return spreadsheet_id, credentials

This dual-store approach balances reliability with my need for ownership. I get a spreadsheet I control, but the app doesn't break if Google has issues.

Same Brain, Different Ways to Chat

The agent is just a class. I can talk to it however I want:

Telegram: I message my bot, it normalizes my messages (text, photos, documents), downloads media, and calls the agent. I use webhooks to keep it responsive.
Web UI: I built a simple web interface that hits the same agent API. It creates chats with chat_type="external" so the agent doesn't care if I'm using Telegram or the web.

The agent interface is stable. I could add WhatsApp, SMS, or anything else without changing the core AI logic.

Tracing and Logging Saved My Sanity

I added @traceable from LangSmith around the main analyze method. Suddenly I could see:

Exactly what the model received from me
Every tool call and its arguments
Where errors happened and how long things took

I also log my user ID, spreadsheet IDs, and macros to debug production issues.

Real example: When I built the Web UI, my meals stopped showing images in the traces. I saw the model wasn't receiving them. The format was wrong, I fixed it in 5 minutes because the trace made it obvious.

What I Learned Building This for Myself

Where agents are worth it: When they orchestrate real tools and stateful systems (like a database, Sheets, and OAuth), not just when they chat. Each tool should map to a clear, real-world action I want to take.

What surprised me:

You don't need the most intelligent LLM to build a useful agent. A simple, well-written prompt and simple tools that capture the main features are often enough to create a reliable and good user experience.
Context engineering is key. Understanding the tools and what information or context each tool provides is more important than loading the prompt with ultra-detailed instructions.
Handling OAuth tokens, refresh flows, and "self-healing" spreadsheets (like recreating one if I accidentally delete it) was critical for making a reliable tool that depends on a third-party service.

The main takeaway: I've always loved building digital products that solve real problems; it's been my main career motivation. But this project was different. I had a personal problem, and I wasn't just building a "good enough" solution; I was able to build the perfect solution for my own needs. That gets me excited to build more and keep growing my skills with these new technologies.

Starting with a no-code tool like n8n was great for testing ideas quickly. But for a product you might want to share or scale, investing in proper code architecture from the start saves you from rebuilding everything later.

I can't say it was easy; I definitely leaned on my existing experience in software development. But it's a total game-changer. The way we can build products today is so different from even just a few years ago.

The project is live at https://nutriagent.juandago.dev if you want to see what I built. The code is available on GitHub for the Agent and also for the Web UI

Heads up: Since this is a personal project, my Google Cloud account isn't verified. If you try connecting your Google account, you'll get a scary warning screen (Google's way of handling unverified apps). I don't store your credentials; it's just for writing to your own Sheets, but the warning looks dramatic.

This was my journey, but I'd love to hear your thoughts. I'm excited to start sharing more updates on this project and other things I'm building. Let's continue the conversation on X or connect on LinkedIn.