DEV Community

Sri
Sri

Posted on • Originally published at smara.io

How I Built a Persistent Memory Layer for AI Coding Tools

If you use AI coding assistants daily, you have felt this pain. You open a new session with Claude Code, Cursor, or Copilot, and you spend the first twenty minutes re-explaining your project structure, your preferences, the bug you fixed yesterday, the architectural decisions you made last week. The AI has no idea. Every session starts from absolute zero.

I started measuring this. In my own workflow, I was burning 20-25 minutes per session on context restoration alone. That is not the worst part. MCP servers — the tools that extend these AI assistants — consume tokens just by loading. I have watched 67,000 tokens disappear before I even typed my first prompt. That is roughly half the context window on most models, gone before any actual work begins.

Context fills up. The conversation dies. You start a new one. The cycle repeats.

Now multiply this across a team. Five developers, each running four AI sessions per day, each losing twenty minutes to context re-establishment. That is nearly seven hours of developer time evaporating every single day. Over a month, that is 140 hours — three and a half work weeks — spent telling AI tools things they already knew yesterday.

The problem is not that these tools are unintelligent. The problem is architectural. Each AI session is stateless by design. There is no persistence layer. There is no memory. Your Claude Code session and your Cursor session are completely isolated. A fact stored in one never reaches the other.

I wanted to fix this. Not by building yet another AI wrapper or a fancy prompt template, but by adding a proper memory layer that sits beneath every tool and persists across all of them. That project became Smara.

Why MCP Was the Right Protocol

When I started building Smara, I had three options for integrating with AI coding tools: fork each tool and patch memory in, build custom plugins for each platform, or find a universal protocol.

The Model Context Protocol (MCP) made the choice obvious. MCP is an open standard — originally created by Anthropic, now adopted by OpenAI, Google, and dozens of other tool vendors — that lets you extend any compatible AI tool without touching its source code. You write a server that exposes tools, and any MCP-compatible client can call them.

The key insight is that MCP's tool-calling model maps naturally to memory operations. An AI assistant already knows how to call tools. It already reasons about when to use grep versus read_file. Adding smara_store and smara_search to its toolkit requires zero behavioral changes. The AI decides when to store a memory and when to recall one, the same way it decides when to run a shell command.

Custom plugins would have locked me into a single platform. API wrappers would have required explicit integration code in every tool. MCP gave me write-once, run-everywhere. One server, one npm install, and suddenly Claude Code, Cursor, Windsurf, and any future MCP-compatible tool all share the same memory pool.

The ecosystem validated this bet. MCP SDK downloads have crossed 97 million per month. There are over 10,000 MCP servers in the wild. When OpenAI and Google both adopted the protocol within months of Anthropic publishing it, it was clear that MCP was not a niche experiment — it was becoming the standard interface between AI tools and the outside world.

Building on MCP also meant I could focus entirely on what mattered: the memory engine itself. No platform-specific glue code. No maintaining five different plugin formats. Just one clean protocol.

Designing the 7 Memory Tools

Early prototypes of Smara had two tools: store and search. It worked, technically. But the AI made poor decisions about when to use them. A blunt interface produces blunt behavior.

The production MCP server exposes seven tools, each with a specific semantic purpose:

  • smara_store — Save a new fact to memory with importance scoring
  • smara_search — Semantic search across stored memories
  • smara_recall — Load context at conversation start (top memories by decay score)
  • smara_forget — Explicitly remove a memory that is outdated or wrong
  • smara_list — Browse memories with filtering (by source, namespace, date)
  • smara_tag — Organize memories with labels for better retrieval
  • smara_relate — Create explicit connections between related memories

Why seven instead of two? Granularity improves AI decision-making. When the AI sees smara_forget, it understands it can correct mistakes. When it sees smara_relate, it can link a debugging session to an architectural decision. These are not just CRUD operations — they are cognitive primitives that map to how an intelligent agent should interact with memory.

Adding Smara to your AI tool takes one config block. Here is the MCP configuration for Claude Code:

{
  "smara": {
    "command": "npx",
    "args": ["-y", "@smara/mcp-server"],
    "env": {
      "SMARA_API_KEY": "smara_your_key_here"
    }
  }
}
Enter fullscreen mode Exit fullscreen mode

Drop that into ~/.claude/mcp_config.json (or .cursor/mcp.json for Cursor), restart, and memory is live. Context loads automatically at conversation start. New facts are stored silently during normal work. The AI handles the memory management itself — you do not need to think about it.

Ebbinghaus Decay Scoring

Most memory systems rank results by recency or by vector similarity. Recency is naive — a two-week-old architectural decision matters more than this morning's typo fix. Pure similarity misses temporal context — it cannot distinguish between a current fact and an obsolete one.

Smara uses Ebbinghaus forgetting curve decay, modeled after how human memory actually works. Every memory's relevance decays exponentially over time, modulated by its importance:

export function ebbinghaus(createdAt: Date, importance: number): number {
  const days = (Date.now() - createdAt.getTime()) / (1000 * 60 * 60 * 24);
  const halfLife = Math.max(importance, 0.1) * 10;
  return Math.exp(-days / halfLife);
}
Enter fullscreen mode Exit fullscreen mode

A high-importance memory (importance: 1.0) has a half-life of about 10 days. A low-importance memory (importance: 0.1) fades in roughly a day. This means "the production database uses PostgreSQL 15 on port 5432" stays strong for weeks, while "I renamed a variable in utils.ts" naturally fades away.

The critical mechanism: memories strengthen on access. Every time the AI retrieves a memory, Smara bumps its access_count and resets its last_accessed_at timestamp. Frequently used facts stay fresh. Forgotten facts decay. This is exactly how human memory works — rehearsal strengthens neural pathways.

Search results are ranked by a blend of semantic similarity and temporal decay:

export function blendScore(similarity: number, decayScore: number): number {
  return similarity * 0.7 + decayScore * 0.3;
}
Enter fullscreen mode Exit fullscreen mode

The 70/30 blend means relevance still dominates — you get the fact that best matches your query — but recency and access patterns break ties. A moderately relevant memory that you accessed yesterday beats a slightly more relevant memory you have not touched in a month.

No competitor in this space has Ebbinghaus decay scoring. Mem0, Zep, Supermemory — they all use flat recency or pure vector similarity. Smara's temporal scoring is, as far as I can tell, unique in the developer tools space. It is also what makes the memory feel natural rather than mechanical.

Making It Work Across Teams

Individual memory was the easy part. The hard problem — the one I underestimated by weeks — was shared team memory.

The challenge is not technical complexity. It is semantic complexity. When developer A stores "the auth service uses JWT with RS256," should developer B's AI session see that? Probably yes. When developer A stores "I prefer tabs over spaces," should developer B's session see that? Absolutely not.

Smara solves this with a namespace and visibility model. Every memory has a namespace (like default, project-api, infra) and a visibility setting (private or team). When a memory is stored with a team_id, it is automatically visible to all team members in that namespace. Private memories stay private.

The search engine performs a UNION query across both pools:

SELECT id, fact, importance, created_at, source, namespace,
       1 - (embedding <=> $1::vector) AS similarity
FROM memories
WHERE tenant_id = $2
  AND namespace = $4
  AND valid_until IS NULL
  AND (
    (user_id = $3 AND team_id IS NULL)
    OR (team_id = $5 AND visibility = 'team')
  )
Enter fullscreen mode Exit fullscreen mode

Your private memories and your team's shared memories are searched together, ranked by the same blended score. The AI does not need to know or care about the distinction — it just gets the most relevant facts.

Here is the scenario that makes this powerful: your teammate debugs a gnarly API rate-limiting issue on Monday. Their Claude session stores the root cause and the fix as team-scoped memories. On Tuesday, you hit a related issue. Your Claude session automatically recalls your teammate's findings. No Slack thread to search. No knowledge base article to write. The team's AI tools build collective knowledge passively, as a byproduct of normal work.

Team management is handled through a full REST API — create teams, invite members by email, assign roles (admin, member, read-only), and set per-team memory limits. Deduplication and contradiction detection are scoped per team namespace, so team memories stay clean even as multiple people contribute.

Architecture and Self-Hosting

Smara's architecture has three layers.

The MCP server (@smara/mcp-server on npm) is a lightweight TypeScript process that runs locally alongside your AI tool. It translates MCP tool calls into REST API calls. No state, no database, no dependencies beyond Node.js. Install with npx -y @smara/mcp-server and point it at any Smara-compatible backend.

The hosted API (api.smara.io) is a Fastify application running on Railway. It handles authentication, rate limiting, billing, and the core memory operations. PostgreSQL with pgvector stores the memories and their embeddings. Voyage AI (voyage-3, 1024 dimensions) generates the embeddings for semantic search. The API auto-migrates on startup — create tables, indexes, and extensions automatically.

The storage layer is PostgreSQL with pgvector. Memories are stored with their vector embeddings, importance scores, decay metadata, source tags, namespace labels, team associations, and soft-delete timestamps. Deduplication uses cosine similarity bands: >= 0.985 is a true duplicate (skip), 0.94-0.985 is a contradiction (soft-delete old, store new), below 0.94 is a genuinely new fact.

Self-hosting takes five minutes:

git clone https://github.com/smara-io/api.git
cd api
VOYAGE_API_KEY=your-key docker compose up -d
Enter fullscreen mode Exit fullscreen mode

That gives you the full API on localhost:3011 with a local PostgreSQL instance. Point the MCP server at your self-hosted instance by setting SMARA_API_URL:

{
  "smara": {
    "command": "npx",
    "args": ["-y", "@smara/mcp-server"],
    "env": {
      "SMARA_API_KEY": "smara_your_key_here",
      "SMARA_API_URL": "http://localhost:3011"
    }
  }
}
Enter fullscreen mode Exit fullscreen mode

Everything is MIT-licensed. The hosted service exists for convenience and for teams that do not want to manage infrastructure. The self-hosted path exists for developers who want full control over their data.

Pricing a Developer Tool as a Solo Founder

Pricing a developer tool is an exercise in controlled anxiety. Price too high and nobody tries it. Price too low and you cannot sustain the infrastructure. Price with a free tier that is too generous and you fund everyone's usage out of pocket.

I looked at the competitive landscape. Mem0 charges $249/month for their Pro tier. Zep starts at $25 and climbs to $475. Supermemory charges $399 for teams. These prices make sense for enterprise buyers with procurement budgets. They are prohibitive for indie developers and small teams — exactly the people who need memory the most, because they cannot afford to waste time on context re-establishment.

Smara's tiers:

Plan Memories Price
Free 10,000 $0/month
Developer 200,000 $19/month
Pro 2,000,000 $99/month

The free tier is genuinely usable. 10,000 memories covers months of individual use. The goal is not to frustrate free users into upgrading — it is to let them experience the product fully and upgrade when they need team features or higher volume.

The $19 Developer tier hits the "expense it without approval" threshold at most companies. The $99 Pro tier is literally half of Mem0's comparable plan, with more memories included.

The mental model behind this pricing: memory is infrastructure, not a luxury. It should cost about the same as a database or a monitoring tool — present in the budget, never the biggest line item. If a developer saves twenty minutes per session and runs four sessions per day, that is over thirteen hours saved per month. At any reasonable hourly rate, $19 pays for itself before lunch on day one.

What's Next

Smara v1.2 is in active development with three focus areas.

Teams is shipping first. The API already supports team creation, invitation flows, role-based access, and shared memory search (as shown in the code above). The MCP server update will add team-aware tools so that AI sessions can store and retrieve team knowledge seamlessly.

AI Agents comes next. Agents are identities with their own memory namespace, system prompt, and composable skills. Think of a code-review agent that remembers your team's style guide, or an onboarding agent that accumulates institutional knowledge from every new hire's questions. Eight built-in skills are planned: code review, PR review, doc writing, test generation, deploy checklist, onboarding, architecture advising, and security audit.

IDE-native memory panels will give you visibility into what your AI tools remember — browse, search, edit, and delete memories from a visual interface instead of relying on the AI to manage everything silently.

The longer-term vision is straightforward: every AI tool you use should share one brain. Not just coding assistants — any AI agent, any workflow, any platform. Smara is the memory layer that makes that possible.


If you want to try it, the fastest path is three commands:

  1. Get a free API key at smara.io
  2. Add the MCP config to your Claude Code or Cursor setup
  3. Start working — memory happens automatically

Links:

I am building Smara as a solo founder. If you have questions about the architecture, the MCP integration, or the Ebbinghaus scoring model, I am @parallelromb — happy to talk.

Top comments (0)