Scott Crawford

Posted on Feb 24 • Originally published at hifriendbot.com

Building a Cloud Memory System for Ai Coding Assistants

#ai #architecture #mcp #systemdesign

I spent the last several months building a persistent memory system for Claude Code. Not a wrapper around a local vector database — a cloud-native, multi-layer cognitive memory engine designed to survive context compaction, scale across devices, and get smarter over time.

This post isn't about the product. It's about the engineering: the architectural decisions, the trade-offs, and what I learned about building memory for Ai coding assistants.

The Problem Space

Building memory for an Ai coding assistant is harder than it sounds. You're not building a database — you're building a system that needs to answer a question most databases can't: "What does this developer need to know right now?"

The challenges:

Volume asymmetry. A developer generates thousands of lines of conversation per day. Maybe 1% of that is worth remembering long-term. You need an extraction layer that separates signal from noise.
Retrieval relevance. Keyword search fails for memory. When Claude needs to know about your "database setup," it should find your PostgreSQL configuration decisions even if you never used those exact words. You need semantic search.
Temporal dynamics. A bug fix from yesterday is more important than a bug fix from three months ago. But a core architecture decision from three months ago is more important than both. You need a ranking system that accounts for both recency and importance.
Contradiction handling. Developers change their minds. You migrated from MySQL to PostgreSQL. The old "we use MySQL" memory is now wrong. You need conflict detection and resolution.
Context boundaries. Your React conventions shouldn't contaminate your Python project. But your preference for tabs over spaces should follow you everywhere. You need scoping with inheritance.

The 3-Layer Architecture

The system has three layers, each solving a different part of the problem:

Layer 1: Ai Extraction    — What's worth remembering?
Layer 2: Semantic Storage  — How do we find it later?
Layer 3: Time-Aware Rank   — What matters most right now?

Each layer is independent and can be reasoned about separately.

Layer 1: Ai Extraction

The extraction layer answers the hardest question: given a conversation between a developer and an Ai assistant, what facts are worth storing permanently?

A naive approach would be to store everything. That fails fast — a 2-hour coding session generates thousands of tokens of conversation, most of which is transient (debugging output, file reads, exploratory questions). Storing all of it creates noise that drowns out signal during retrieval.

The key insight is that extraction is lossy by design. You want to lose most of the conversation. What survives should be concise, factual, complete sentences that stand on their own: "The database uses a custom prefix" or "CSS overrides require a specific selector pattern."

What Makes a Good Memory

Through testing, I found that effective memories share four properties:

Self-contained. The memory makes sense without reading the conversation it came from.
Specific. "We use PostgreSQL" is useful. "We discussed database options" is not.
Actionable. It should change Claude's behavior. A memory that directly prevents a mistake is worth ten memories that provide background context.
Stable. It shouldn't become outdated within a session. Architectural patterns are more valuable than version numbers.

Layer 2: Semantic Storage and Search

Once a memory is extracted, it needs to be stored in a way that supports meaning-based retrieval.

Why Semantic, Not Keyword

A query for "how does auth work?" should find a memory about JWT tokens and cookie settings — even though the query and the memory share almost no keywords. This is the fundamental limitation of keyword search for memory systems. You need meaning-based retrieval.

Why Cloud, Not Local

The most popular local memory solutions run embedding models and vector databases on the developer's machine. This works, but it has real costs:

RAM: In-process vector stores with local embeddings consume significant memory. Community reports of 15GB+ leaks.
Startup latency: Loading a local embedding model adds seconds to every MCP server startup.
Platform fragility: Local runtimes have known issues across different platforms and architectures.
No sync: A local vector database is inherently single-machine.

Moving the heavy lifting server-side eliminates all of these. The MCP server becomes a thin HTTP client — stateless, lightweight, platform-agnostic. The trade-off is network dependency, which I consider acceptable for a developer tool.

Layer 3: Time-Aware Ranking

Semantic similarity alone isn't enough. Given a query, you might find 50 relevant memories. Which ones should surface first?

The challenge is balancing multiple signals. A memory can be semantically relevant but old and unimportant. Another can be important and recent but semantically off-topic. The ranking system needs to weigh all of these factors and surface the best results for any given moment.

The key insight: importance and recency are both critical, but they interact in non-obvious ways. A recent low-importance memory might outrank an old low-importance one, but a core architecture decision from months ago should always surface when relevant — regardless of age.

Deduplication and Conflict Resolution

Developers change their minds. You migrated from MySQL to PostgreSQL. The old "we use MySQL" memory is now wrong. A good memory system needs to handle both duplicates and contradictions gracefully — keeping the most accurate, most specific version while preserving a version trail so nothing is truly lost.

Compaction Recovery

Claude Code's auto-compaction is the single biggest threat to productive sessions. When context gets compressed, Claude loses its working state — which files it was looking at, what approach it was taking, what decisions were made earlier in the conversation.

CogmemAi solves this by automatically preserving and restoring context around compaction events. The result: you experience seamless continuity rather than a cold restart. Claude picks up right where it left off, with full awareness of your project and the current session.

Project Scoping

Project memories (scope: project) are tied to a specific repository. Architecture decisions, file structure notes, dependency constraints.
Global memories (scope: global) follow the developer everywhere. Identity, coding style, tool preferences.

At session start, the system blends both: project-specific memories for the current repo, plus global preferences. Cloning the same repo on a different machine maps to the same project automatically.

Lessons Learned

Extraction quality is everything. If you store low-quality memories, no amount of search sophistication will compensate. The most impactful optimization is always at the extraction layer.

Importance is subjective. What matters to one developer is noise to another. The system needs to learn from usage patterns over time, not just assign static scores.

Domain-specific content is harder than general content. CSS-specific memories often sound similar but are functionally different. Configuration details need tighter matching because small value differences are significant. This is an ongoing area of improvement.

Closing Thoughts

Building memory for Ai coding assistants is fundamentally a retrieval problem disguised as a storage problem. Storing everything is easy. Surfacing the right 20 memories out of 2,000 when Claude needs them — that's where the engineering lives.

The three-layer approach (extract, embed, rank) gives you independent dials to tune. Bad extraction? Fix the extraction prompt without touching search. Irrelevant search results? Adjust the embedding model without touching extraction. Ranking feels off? Tune the weights without touching either upstream layer.

If you're building something similar, my advice: start with extraction quality. If you store garbage, no amount of search sophistication will find gold.

The system described here is CogmemAi, an open-source MCP server for Claude Code. The architecture is MIT-licensed and the documentation is here. I'm happy to discuss the technical details — feel free to open an issue or find me in the comments.

DEV Community