Enrique B.

Posted on May 19

Your AI Agent is Stuck in a Loop. Here's the Memory Layer That Breaks It and Saves You Money

#ai #mcp #opensource #deepseek

Every time you open a new chat in Cursor, VS Code, Antigravity and even Claude Desktop, you paste your codebase back in. Or you let the IDE do it automatically, same result. You're burning context tokens on files the agent already "knew" ten minutes ago in a different window. The agent re-reads my_models.py, your_script.js, their_script.ts, index.html and the three service files it touched yesterday, just to orient itself before answering your actual question.

That's not a context window problem. It's a memory architecture problem. There's no persistence layer between sessions or IDEs. The fix isn't a bigger context window; it's not sending the same data twice.

In a real test session with Copilot, after three exchanges with four unnecessary re-queries, zerikai_memory's tool results accounted for just 1.6% of a 128K context window. A raw file-chunk retrieval setup doing the same work would burn ~40% of that window before you got to your actual question. On the DeepSeek side, locking the project brief as a fixed system prefix means every query hits the KV cache at $0.0028/M tokens instead of $0.14/M, a 50x cost reduction on the largest token block per call.

That's the exact problem zerikai_memory was built to solve. (Open source)

Hi, I'm Enrique, an AI software engineer with over 30 years of tech experience. I built zerikai_memory after diagnosing a structural bottleneck in my own workflows: context loss between sessions, redundant re-injection across IDEs, and rising per-query costs that compound with every sloppy retrieval. The rising cost of AI IDE subscriptions made the economics impossible to ignore, but the underlying architecture problem was already there. zerikai_memory is the engineering response to that diagnosis, built for developers who want precision and cost control.

What It Actually Does

zerikai_memory is a local MCP server. Any IDE that supports the Model Context Protocol (Cursor, pi, Antigravity, Copilot, and even Claude Desktop) can connect to it. When you run scan_workspace, it walks your project using tree-sitter to parse source files into individual code entities: Functions, Classes, Methods, HTML components.

Each entity gets its own embedding in a local ChromaDB collection, along with structured metadata: Language, Return type, Parent class, Line numbers, Param count, Decorators, Docstring presence.

That last part matters. Most RAG setups (including the ones built into commercial IDEs) chunk files by token count or line count, which means a function might get split across two chunks, or buried under 500 lines of imports. This is a corner being cut at your expense: you pay for the tokens, the agent gets noise. zerikai_memory treats each function as the atomic unit instead. extract_entities in code_indexer.py returns a list of CodeEntity objects, one per parsed symbol: clean signal, no noise.

When your agent queries memory, along with the raw answer, it gets inline source citations back:

Sources: `extract_entities` #code_indexer.py:184 (0.35)
`_extract_js_function` #code_indexer.py:595 (0.81)
`_extract_js_like` #code_indexer.py:543 (0.89)
`CodeEntity` #code_indexer.py:156 (0.83)
`_extract_html` #code_indexer.py:907 (0.84)

(raw answer synthesized from those sources)

The IDE agent doesn't hunt for the function; the citations give it the exact file, line number, and L2 distance score. #code_indexer.py:184 is plain text that renders in every IDE and is clickable in VS Code Copilot, so you can jump straight to the source too. No additional search, no extra API call.

The Indexing Cost Is Zero

tree-sitter is a deterministic local parser. It runs entirely on your machine, no API calls, no LLM involvement. Parsing and embedding your codebase costs $0.00.

The only thing that costs money is the project brief: a 9-section structured summary (Overview, Technical Stack, Core Architecture, Primary Conventions, Purpose, Key Files, Data Flow, Development & Testing, Future Roadmap). That brief is generated once after the first scan using either the DeepSeek API or Ollama. With DeepSeek, it's a few cents. With Ollama, it's free. After that, the brief is locked and cached.

Why lock it? Because the brief is the fixed prefix of every system message sent to DeepSeek. DeepSeek's KV cache means identical prefixes are stored server-side after the first call, so subsequent queries hit the cache at $0.0028/M tokens instead of \$0.14/M tokens. That's a 50× cost reduction on the largest token block per query. Intentionally keeping the brief stable is what makes that discount work consistently.

If you update the Architecture or implement a large feature, you can regenerate the brief by asking the MCP server to update_brief.

Auto-Routing: Paying Only for What Needs It

Not every query deserves a cloud LLM. A lookup like "where is process_data defined?" is a short, specific question that Ollama handles fine locally. An architectural question like "explain how the query routing pipeline decides between DeepSeek v4-flash and v4-pro" is a 20-word query touching design decisions, and that earns the cloud call.

The routing logic in _should_use_cloud() runs a 4-step priority chain on every incoming query:

Explicit override: use_cloud=True or use_cloud=False passed directly; short-circuits everything else.
Architectural keyword escalation: if the query contains design/architecture terms (configured in config.py), route to DeepSeek.
Length escalation: queries of 40 words or more route to DeepSeek.
Default fallback: fall back to whatever MEMORY_MODE is set to in .env (local, cloud, or hybrid).

Within the DeepSeek path, _select_model picks between v4-flash and v4-pro based on query complexity. Short cloud queries get the cheaper model; heavier ones get the capable one. You're not paying v4-pro rates for a function lookup.

Tip: start in cloud mode (DeepSeek API), then you can try hybrid and then local (Ollama).

Lexical Re-ranking: Fixing the Semantic Similarity False Positive Problem

Semantic similarity alone has a well-known failure mode: a generic function that describes the same concept as your query can outscore the specific function you actually want. A file summary mentioning "parsing" might beat parse_entities() in cosine space if the embeddings aren't tight.

zerikai_memory adds a re-ranking pass on top of ChromaDB's semantic results. After distance filtering removes weak matches, survivors are re-scored by weighted keyword overlap in entity names and docstrings:

score = (1 / distance) + (keyword_hits × LEXICAL_RERANK_WEIGHT)

A function named extract_entities will outrank a vague file-level summary that happened to embed close to the query. No results are dropped; it's a pure reorder. The weight is configurable via ENABLE_LEXICAL_RERANK in .env. It's off by default; turn it on if you're seeing generic summaries crowding out specific function hits.

The reranking test in the inline citations example above shows it working:

VS Code Query:

{
  "workspace": "zerikai_memory",
  "user_query": "extract function for parsing code entities from source files. Show me the Sources table."
}

Returns extract_entities at distance 0.35 as the top hit, with the JS and HTML extractors behind it. Notably, _extract_markdown didn't appear despite sharing vocabulary, which is exactly the false positive problem the re-ranker exists to solve.

Context Window Impact

Entity-level indexing matters most when the agent is sloppy. In a test session with Copilot, after three back-and-forth exchanges (including four re-queries it didn't need), the context window sat at 17.1K of 128K tokens, just 13%. The tool results from Zerikai accounted for 1.6% of that total.

If zerikai returned raw file chunks instead of individual entities, a single retrieval might pull 500 lines of a file to find one function. Twelve retrievals across three turns could burn north of 40% of the window before you even got to your actual work.

	Entity-level (zerikai, tested)	Raw file chunks (estimated)
Tool results after 3 turns, 4 calls	1.6% of window (~2K tokens)	~40% of window (~50K tokens)
Space left for actual work	~111K tokens	~28K tokens
Agent re-queries	Negligible cost	Amplifies the burn

When the agent re-queries four times because it summarized too aggressively, entity-level indexing means it burned 800 tokens, not 50,000. The real cost of a bad agent layer is multiplied by how much data each retrieval returns (one of the reasons I migrated to pi.dev, but that's another write-up).

Small retrievals save you twice: they stretch your IDE's token quota and they shrink your DeepSeek bill. Pay for the brief once. Query your own code for free.

How Agents Display Source Citations

zerikai memory returns the full #file:line (distance) in every response. But each agent publisher decides how much of it to surface. The same query, same tool, four different agents:

Example: "Sources: _truncate_for_brief #main.py:401 (0.62)"

Agent	Calls	Showed L2	Notes
Antigravity (Gemini)	1	Yes	Re-rendered inline text as a table
Claude Desktop	1	Yes	Showed full citation in body
Copilot	5	No	Re-queried 4x, hid distance from user until asked
Pi (inside workspace)	1	Yes	One clean call when in the project path

The data is always there in context; the agent uses #file:line (distance) for reasoning regardless.

Ask: "show me the full sources with distances" and it surfaces the complete line. What you see is filtered by your agent's display layer, not by zerikai memory.

Workspace Isolation and `.memignore`

Every project gets its own ChromaDB sub-collection, its own SQLite records, and its own brief file. Queries for project_a don't pull from project_b. Workspaces are identified by normalized filesystem path with deterministic UUIDs, so the same project opened from different IDE windows or different paths resolves to the same memory store.

Before any file hits the indexer, .memignore filters it out. It works identically to .gitignore: glob patterns, # comments, blank lines ignored. node_modules/, .git/, venv/, compiled output: configure it once per workspace, it's enforced on every scan.

Scans are idempotent. Re-scanning the same file overwrites its existing records using deterministic hashing. Stale entries (files deleted or added to .memignore since the last scan) are automatically purged. Run scan_workspace as often as you want; no duplicates accumulate, no ghost records from renamed files.

Be discriminative with the files and folders added to .memignore. We tend to save test/ or research/, and other dated resources, which add unnecessary noise to the ChromaDB index. Add those types of folders and files to .memignore to ignore them during the scan.

Local Mode: Zero Cost, No Data Leaves Your Machine

If you're working on a private codebase or just don't want to touch a cloud API, set MEMORY_MODE=local in .env and point zerikai_memory at a running Ollama instance. Everything runs locally: parsing, embedding, retrieval, and synthesis. The inline citations still work. The project brief still generates. Auto-routing routes everything to Ollama. Token tracking still logs usage (it just logs zero cost).

This makes the local mode genuinely useful for confidential projects, not just a fallback. The main trade-off versus DeepSeek is synthesis quality on complex architectural queries, as Ollama models are capable but won't match v4-pro on multi-hop reasoning about your codebase design.

What It Doesn't Do

Worth being explicit: zerikai_memory doesn't replace your IDE's built-in context management for the file you're actively editing. It's additive; what it eliminates is the need to re-inject project-wide context at the start of every new session. It also doesn't currently support all languages tree-sitter handles; Python, JS/TS, HTML, CSS, and Markdown are covered, and others are on the roadmap.

The project brief is intentionally locked after the first scan. If your architecture changes significantly, you need to trigger update_brief via the MCP server. That's a manual step, and it's a deliberate trade-off for KV cache stability over automatic freshness.

Stack

Python: MCP server, query routing, brief synthesis
ChromaDB: local vector store (file-based, stored in .brain/vector_db/)
zerikai.db: SQLite workspace registry and token tracking (stored in .brain/zerikai.db)
tree-sitter: deterministic local code parser, zero API cost
DeepSeek: cloud LLM (v4-flash and v4-pro), with KV cache optimization
Ollama: local LLM for zero-cost operation

Getting Started: Index zerikai_memory itself as your first workspace. It's the fastest way to see how entity-level indexing, lexical re-ranking, and inline citations actually work before applying them to your own projects. If your language isn't supported yet, search PyPI for tree-sitter-<language> to find the grammar bindings, then use the MCP server itself to guide you through adding it to code_indexer.py. Rescan, and you're ready to index your other workspaces with the newly added language.

The repo is on GitHub: zerikai_memory

If you're running agent workflows across multiple IDEs or projects, the context re-injection overhead adds up. zerikai_memory cuts that out at the root.

The more interesting problems turned out to be elsewhere: getting traceable, line-precise source citations out of a retrieval layer so agents stop guessing and start pointing, and getting lexical re-ranking to prevent semantically close but functionally irrelevant results from crowding out the right answer.

[🔴 VIDEO DEMO COMING SOON - BOOKMARK TO WATCH]

Have you thought about the impact of traceable sources in your own agent workflows? Have you hit the false-positive problem in semantic memory search? Drop your setup and approach below.

Thank you

Enrique

DEV Community

Your AI Agent is Stuck in a Loop. Here's the Memory Layer That Breaks It and Saves You Money

What It Actually Does

The Indexing Cost Is Zero

Auto-Routing: Paying Only for What Needs It

Lexical Re-ranking: Fixing the Semantic Similarity False Positive Problem

Context Window Impact

How Agents Display Source Citations

Workspace Isolation and `.memignore`

Local Mode: Zero Cost, No Data Leaves Your Machine

What It Doesn't Do

Stack

Top comments (0)

What It Actually Does

The Indexing Cost Is Zero

Auto-Routing: Paying Only for What Needs It

Lexical Re-ranking: Fixing the Semantic Similarity False Positive Problem

Context Window Impact

How Agents Display Source Citations

Workspace Isolation and .memignore

Local Mode: Zero Cost, No Data Leaves Your Machine

What It Doesn't Do

Stack

Workspace Isolation and `.memignore`