DEV Community: Enrique Bruzual

Post-Mortem: Building a Local MCP Server for Codebase Memory using Ollama and ChromaDB

Enrique Bruzual — Wed, 15 Jul 2026 23:46:16 +0000

Developers are pushing back against cloud API billing and the privacy risks of sending proprietary codebases to third-party endpoints. A Hacker News thread from this year put it plainly: the problem isn't the price per token, it's the unpredictability of usage-based billing when AI agents are continuously polling APIs. On Reddit, the privacy concern is starker -- for enterprise and defense work, sending company IP to OpenAI or Anthropic is a hard no regardless of cost.

zerikai_memory has a local mode for exactly this: everything runs through Ollama, nothing leaves your machine. We shipped mistral:7b as the default local model. ornith:9b dropped in June 2026, trained specifically for agentic coding tasks, so we tested both. Here is what we found.

How zerikai_memory Uses a Local Model

zerikai_memory runs in three modes: cloud (DeepSeek), local (Ollama), and hybrid. Routing between them is handled by _should_use_cloud() at main.py:986, a 4-step priority chain: explicit override, keyword match, word count threshold, then MEMORY_MODE env var fallback. In local mode that function always returns false -- everything stays on-device.

In local mode, every synthesis call hits _query_ollama at main.py:1555. The model receives a project brief from _load_project_context plus structured ChromaDB entity payloads: function signatures, file paths, line ranges, docstrings. It returns an answer with inline #file:line citations. One call, no streaming, no tool loop.

The retrieval layer (ChromaDB + L2 distance + lexical rerank) is model-agnostic. Both models received identical context in every query test. The only variable was synthesis.

Hardware

GPU: NVIDIA RTX 3050, 8GB GDDR6 dedicated VRAM
CPU: Intel i7-12700
RAM: 32GB
OS: Windows 11

8GB dedicated VRAM is the hard ceiling. Windows offers shared system memory as overflow, but inference on shared RAM through PCIe is slow enough to matter in practice.

Models

mistral:7b (v0.3) -- 4.4GB, 32K context. Function calling via Ollama raw mode only: you construct [AVAILABLE_TOOLS] prompts by hand and parse [TOOL_CALLS] responses yourself. Last updated May 2024.

ornith:9b-dense -- roughly 5.5GB estimated, built on Gemma 4 and Qwen 3.5. RL-trained for agentic coding tasks with native tool calling. 69.4 on SWE-Bench Verified, 43.1 on Terminal-Bench 2.1, matching or beating models 3x its parameter count.

Test 1: Benchmark Script (Latency)

A standalone Python script hit Ollama directly with static ChromaDB payload samples constructed from real zerikai_memory workspace entities. 3 queries, 3 samples each per model, raw latency measured at the HTTP layer. This test measures inference speed on the hardware, not synthesis quality against a live codebase.

Model	Mean	StdDev	Min	Max
mistral:7b	6.14s	3.58s	2.92s	14.57s
ornith:9b	13.39s	5.76s	8.77s	25.67s

ornith's cold start on the first query hit 25.67s. It spilled into shared memory before Ollama pinned the model. Warmed samples settled at 9-17s. mistral fits clean in 8GB and runs at 3-7s warm.

The benchmark script is in the repo (in the blog/ directory). Run it against your own hardware before committing to a model swap.

Test 2: Live MCP Queries (Synthesis Quality)

Both models were tested live through the zerikai_memory universal-brain MCP layer against the real zerikai_memory workspace index. Same 5 queries, models switched manually between runs, Ollama restarted between each switch. This test measures what a developer actually experiences in Cursor or VS Code -- not synthetic payloads, real ChromaDB retrieval.

Same query, same ChromaDB context, same system prompt. Here is what came back.

Query 1: How does query routing decide between Ollama and DeepSeek?

ornith:9b:

Based on the retrieved codebase memory, here's how query routing works:

Routing Logic:

1. Explicit Override: The `_should_use_cloud` function first checks for an explicit
   `use_cloud` parameter that can force cloud mode (DeepSeek) regardless of other
   conditions #main.py:986-1005.

2. Keyword Detection: If the user query contains keywords like "architecture",
   "explain", "how does", or "overview", it routes to DeepSeek's cloud mode
   #config.py:43-44.

3. Word Count Threshold: Queries exceeding 40 words automatically route to
   cloud/DeepSeek via `_should_use_cloud`'s word count check #main.py:992-994,
   with this threshold defined in config as `CLOUD_ESCALATION_WORD_COUNT = 40`
   #config.py:37.

4. Default Mode: Falls back to whatever is set as `DEFAULT_MEMORY_MODE`, which
   defaults to "hybrid" unless explicitly changed via the `MEMORY_MODE` environment
   variable #config.py:31-32, with this function returning based on whether that
   mode equals "cloud" #main.py:996-998.

Important Note: The retrieved context does not contain explicit information about
Ollama's role in the routing decision or how local model queries are constructed
when cloud mode is NOT selected.

mistral:7b:

The query routing in zerikai_memory decides between using Ollama or DeepSeek based
on a 4-step priority chain. The priorities are as follows:

1. Explicit use_cloud override.
2. CLOUD_ESCALATION_KEYWORDS keyword match.
3. CLOUD_ESCALATION_WORD_COUNT length threshold.
4. DEFAULT_MEMORY_MODE fallback (recommended).

This decision-making process aims to balance privacy, speed, and API costs while
providing maximum context nuance for the user. For more detailed information on
modes, configurations, and usage, you can refer to the README.md file in the
project repository.

Sources: main.py:986, README.md:320, main.py:1495, main.py:1311, README.md:105

Query 2: Where is the .brain directory path defined and how is it resolved across operating systems?

ornith:9b:

The `.brain` directory path is defined and resolved through two key locations:

Definition: In `config.py`, line 1-2, there's a constant that defines the base
name for the brain directory [config.py:1]:

    DB_PATH = '.brain/'
    BRAIN_DIR = Path(DB_PATH)

Resolution: In `main.py`, lines 120-145, within the `init_workspace` function,
this path is resolved by joining it with the workspace's absolute path [main.py:130]:

    path = Path(workspace_path).resolve()
    db_path = path / DB_PATH
    db_path.mkdir(parents=True, exist_ok=True)

The resolution process:
1. Takes the provided `workspace_path` parameter
2. Resolves it to an absolute path using `.resolve()`
3. Joins it with the `.brain/` directory constant
4. Creates the full directory structure if it doesn't exist

mistral:7b:

The `DB_PATH` constant, which is used for defining the storage root for all
workspace data, is defined in `config.py`:

    DB_PATH = Path(__file__).parent / ".brain"

It resolves to `zerikai_memory/.brain/`. This path is platform-agnostic as the
`Path` object in Python handles different operating systems' file paths.

Sources: config.py:67, main.py:1034, main.py:818, README.md:375, main.py:1796

Query 3: How does background brief synthesis avoid MCP timeouts?

ornith:9b:

Background brief synthesis avoids MCP timeouts by running as a fire-and-forget
task launched with `asyncio.create_task`, meaning `_background_brief_synthesis()`
is called without awaiting or returning its result. This decouples the slow
LLM-based brief generation from the MCP request lifecycle, so the IDE agent
receives its response immediately rather than waiting for synthesis to complete.

Sources:
* main.py:784 - 1.96 (rerank)
* main.py:1827 - 0.99 (rerank)
* main.py:393 - 0.98 (rerank)
* main.py:2094 - 0.94 (rerank)

mistral:7b:

Background brief synthesis avoids MCP timeouts by using asynchronous task creation
with `asyncio.create_task` in the `_background_brief_synthesis` function. This
approach allows the function to run independently without blocking the main event
loop, ensuring timely responses and avoiding MCP timeouts. The synthesized brief is
saved as a markdown file in `.brain/contexts/<workspace_id>.md`.

Sources: main.py:784, main.py:1827, main.py:393, main.py:2094

Query 4: How does save_to_memory work with tree-sitter and what file types does it index?

ornith:9b:

`save_to_memory` routes by file extension: files with `.py`, `.js`, `.ts`, `.css`,
`.html`, or `.md` are processed via tree-sitter entity extraction (the
`extract_entities` flow in `code_indexer.py`). Other extensions fall through to
DeepSeek/Ollama LLM summarization.

For the tree-sitter path, tree-sitter parses the source and extracts one
`CodeEntity` per function/method/class -- each holding signature, docstring/JSDoc,
and metadata. These entities are upserted into ChromaDB with deterministic md5 IDs
(so re-scans overwrite duplicates), and token usage is logged to the SQLite tracker
(`zerikai.db`).

Sources: main.py:1072, README.md:279, code_indexer.py:169, code_indexer.py:204,
code_indexer.py:30

mistral:7b:

The `save_to_memory` function works by saving content to persistent vector memory
in ChromaDB. It uses tree-sitter for code parsing for supported file types:
.py, .js, .ts, .css, .html, and .md. Other formats fall back on DeepSeek/Ollama
LLM summarization. It uses deterministic md5 IDs so re-scans overwrite duplicates
and logs token usage to `zerikai.db`.

Sources: main.py:1072, README.md:279, code_indexer.py:169, code_indexer.py:204,
code_indexer.py:30

Query 5: How does asyncio semaphore control Ollama concurrency in local mode?

Neither model had enough indexed context to answer this fully. What each model did with that gap is the most important data point in this test.

ornith:9b:

Based on the available codebase memory context, I cannot determine how asyncio
semaphore specifically controls Ollama concurrency in local mode. The retrieved
entities show:

1. `_should_use_cloud` - Determines cloud usage #main.py:986
2. `MEMORY_MODE` configuration with cloud/hybrid modes #config.py:30-45
3. `_background_scan` - Uses Semaphore(4) for tree-sitter and Semaphore(2) for
   LLM summarization during scanning #main.py:1827

None of these entities contain the specific concurrency logic for `_query_ollama`
at runtime. If this routing exists in the codebase, it's not present in the
retrieved memory entries.

mistral:7b:

The project uses asyncio semaphore for controlling Ollama concurrency in local mode.

The query can be answered by explaining that asyncio semaphore is used to manage
the number of concurrent processes for tree-sitter parsing and LLM summarization
during the scan loop, with a Semaphore(4) for tree-sitter parsing and Semaphore(2)
for LLM summarization in local mode.

Sources: config.py:62, config.py:53, main.py:1827, main.py:1555

mistral answered with confidence. The Semaphore(4) and Semaphore(2) values it cited are real -- but they belong to the background scan pipeline, not query-time Ollama concurrency. It applied the wrong context to a different question and presented the answer as fact. ornith said it didn't have the information and stopped. For an IDE assistant, a confident wrong answer costs more than an honest gap.

The Prerequisite: Docstring Quality

The order of operations matters and it is not obvious from the README.

After editing your .memignore file, run embedding-docstring on your codebase first. Then run scan_workspace. If you scan first with sparse or missing docstrings, ChromaDB indexes thin vectors. Re-scanning won't fix it unless you re-enrich first and scan again. The memory is only as good as what tree-sitter extracted, and tree-sitter only extracts what is there.

zerikai_memory ships with the embedding-docstring skill for this reason. It audits and rewrites docstrings, comment blocks, and inline documentation across an entire workspace for vector embedding quality, covering Python, JavaScript, TypeScript, and HTML. It writes missing documentation from scratch and respects a .memignore file at the workspace root. The correct workflow is:

.memignore  →  embedding-docstring  →  scan_workspace  →  query

Skip the first step and both models underperform. You will spend time blaming the model or the hardware when the real problem is what went into ChromaDB.

Current status: works well with pi.dev, VS Code support in progress due to large file size constraints in some editors. Update: as of 7/14/2026 VS Code now supports large files, so the skill is usable in both Cursor and VS Code.

Brief Generation: An Uncontrolled But Useful Data Point

As a secondary test, we compared briefs generated for the same workspace by DeepSeek (cloud, sparse docstrings) and ornith:9b (local, after embedding-docstring enrichment). This is not a controlled comparison -- the docstring density differed between runs, so the model is not the only variable.

What the comparison shows is that ornith:9b, given enriched ChromaDB context, produces dense, precise briefs: atomic overwrite semantics, naming convention breakdowns, explicit gap flags where documentation is missing. DeepSeek against sparse context produced thinner output with some inferred detail not present in the code.

The takeaway is not that ornith beats DeepSeek for brief generation. It is that embedding-docstring enrichment is visible and measurable in the output. When the context is rich, ornith produces briefs good enough to feed meaningful synthesis queries. When it is not, neither model can compensate.

Full Local Mode and Brief Synthesis: The Semaphore Fix

Before this release, full local mode had a GPU saturation problem. _synthesize_deep_brief at main.py:538 fired asyncio.gather across all 9 brief sections simultaneously with no concurrency gate. In local mode that meant 9 concurrent Ollama calls hitting the GPU at once -- guaranteed to saturate an 8GB card.

The fix shipped alongside this test. A global ollama_semaphore initialized in main.py after client setup gates _build_section calls through a _build_section_safe wrapper when use_cloud=False. Cloud and hybrid modes bypass the semaphore entirely -- DeepSeek handles its own rate limiting on the API side.

ollama_semaphore = asyncio.Semaphore(OLLAMA_MAX_CONCURRENCY)

async def _build_section_safe(name):
    if not use_cloud:
        async with ollama_semaphore:
            return await _build_section(name, workspace_id, workspace_path)
    return await _build_section(name, workspace_id, workspace_path)

OLLAMA_MAX_CONCURRENCY is configurable via .env, defaulting to 1 for 8GB hardware. Users on cards with more VRAM headroom can raise it. The ornith:9b brief in this post was generated with this fix in place -- full local mode brief synthesis is production-ready as of this release.

Hardware and Cost

If token pricing is the reason you are reading this, here is what a GPU upgrade costs against what you are spending on API calls:

RTX 3060 12GB (recommended minimum for ornith:9b): $330-$470 new. ASUS Dual and Gigabyte WINDFORCE variants available at Newegg around $340-$440.
RTX 4060 Ti 16GB: $400-$500. The extra VRAM lets you load larger 13B-14B quantized models without spilling to system RAM.
RTX 4070 12GB: around $600. Faster Tensor cores, quicker token generation.

AMD cards (RX 6700 XT 12GB, refurbished from $380) offer equivalent VRAM but require ROCm configuration. Ollama's CUDA path is plug-and-play on NVIDIA. AMD works but adds setup overhead.

On 8GB (RTX 3050 class), ornith:9b runs but cold starts are painful and VRAM headroom is tight. The RTX 3060 12GB is the practical sweet spot for local zerikai_memory use.

Recommendation

ornith:9b is the new default local model recommendation, replacing mistral:7b.

On 8GB dedicated VRAM: ornith fits but runs tight. Cold start hits 25s when Ollama hasn't pinned the model. Warm synthesis at 9-17s is acceptable for a local-only workflow where you are not switching models or running concurrent GPU workloads. Set OLLAMA_MAX_CONCURRENCY=1 in .env.

On 10-12GB dedicated VRAM (RTX 3060 12GB or better): the model stays pinned, cold starts drop significantly, and citation precision is consistently better than mistral.

Under 8GB dedicated VRAM, or if synthesis latency matters more than citation precision, use mistral:7b. Set OLLAMA_MODEL=mistral:7b in .env. It handles synthesis correctly when context is dense. When context is thin, it will fill gaps with confident but wrong answers.

The query test was clean and controlled. The model difference is real and attributable to ornith's training on agentic coding tasks, not hardware or docstring quality. Use the benchmark script in the repo to validate on your own machine before switching.

📖 Original Publication: This engineering post-mortem was originally published on the Zerikai Tech Blog. Read the clean, formatted web version at https://zerikai.com.

Your AI Agent is Stuck in a Loop. Here's the Memory Layer That Breaks It and Saves You Money

Enrique Bruzual — Tue, 19 May 2026 13:20:53 +0000

Every time you open a new chat in Cursor, VS Code, Antigravity and even Claude Desktop, you paste your codebase back in. Or you let the IDE do it automatically, same result. You're burning context tokens on files the agent already "knew" ten minutes ago in a different window. The agent re-reads my_models.py, your_script.js, their_script.ts, index.html and the three service files it touched yesterday, just to orient itself before answering your actual question.

That's not a context window problem. It's a memory architecture problem. There's no persistence layer between sessions or IDEs. The fix isn't a bigger context window; it's not sending the same data twice.

In a real test session with Copilot, after three exchanges with four unnecessary re-queries, zerikai_memory's tool results accounted for just 1.6% of a 128K context window. A raw file-chunk retrieval setup doing the same work would burn ~40% of that window before you got to your actual question. On the DeepSeek side, locking the project brief as a fixed system prefix means every query hits the KV cache at $0.0028/M tokens instead of $0.14/M, a 50x cost reduction on the largest token block per call.

That's the exact problem zerikai_memory was built to solve. (Open source)

Hi, I'm Enrique, an AI software engineer with over 30 years of tech experience. I built zerikai_memory after diagnosing a structural bottleneck in my own workflows: context loss between sessions, redundant re-injection across IDEs, and rising per-query costs that compound with every sloppy retrieval. The rising cost of AI IDE subscriptions made the economics impossible to ignore, but the underlying architecture problem was already there. zerikai_memory is the engineering response to that diagnosis, built for developers who want precision and cost control.

What It Actually Does

zerikai_memory is a local MCP server. Any IDE that supports the Model Context Protocol (Cursor, pi, Antigravity, Copilot, and even Claude Desktop) can connect to it. When you run scan_workspace, it walks your project using tree-sitter to parse source files into individual code entities: Functions, Classes, Methods, HTML components.

Each entity gets its own embedding in a local ChromaDB collection, along with structured metadata: Language, Return type, Parent class, Line numbers, Param count, Decorators, Docstring presence.

That last part matters. Most RAG setups (including the ones built into commercial IDEs) chunk files by token count or line count, which means a function might get split across two chunks, or buried under 500 lines of imports. This is a corner being cut at your expense: you pay for the tokens, the agent gets noise. zerikai_memory treats each function as the atomic unit instead. extract_entities in code_indexer.py returns a list of CodeEntity objects, one per parsed symbol: clean signal, no noise.

When your agent queries memory, along with the raw answer, it gets inline source citations back:

Sources: `extract_entities` #code_indexer.py:184 (0.35)
`_extract_js_function` #code_indexer.py:595 (0.81)
`_extract_js_like` #code_indexer.py:543 (0.89)
`CodeEntity` #code_indexer.py:156 (0.83)
`_extract_html` #code_indexer.py:907 (0.84)

(raw answer synthesized from those sources)

The IDE agent doesn't hunt for the function; the citations give it the exact file, line number, and L2 distance score. #code_indexer.py:184 is plain text that renders in every IDE and is clickable in VS Code Copilot, so you can jump straight to the source too. No additional search, no extra API call.

The Indexing Cost Is Zero

tree-sitter is a deterministic local parser. It runs entirely on your machine, no API calls, no LLM involvement. Parsing and embedding your codebase costs $0.00.

The only thing that costs money is the project brief: a 9-section structured summary (Overview, Technical Stack, Core Architecture, Primary Conventions, Purpose, Key Files, Data Flow, Development & Testing, Future Roadmap). That brief is generated once after the first scan using either the DeepSeek API or Ollama. With DeepSeek, it's a few cents. With Ollama, it's free. After that, the brief is locked and cached.

Why lock it? Because the brief is the fixed prefix of every system message sent to DeepSeek. DeepSeek's KV cache means identical prefixes are stored server-side after the first call, so subsequent queries hit the cache at $0.0028/M tokens instead of \$0.14/M tokens. That's a 50× cost reduction on the largest token block per query. Intentionally keeping the brief stable is what makes that discount work consistently.

If you update the Architecture or implement a large feature, you can regenerate the brief by asking the MCP server to update_brief.

Auto-Routing: Paying Only for What Needs It

Not every query deserves a cloud LLM. A lookup like "where is process_data defined?" is a short, specific question that Ollama handles fine locally. An architectural question like "explain how the query routing pipeline decides between DeepSeek v4-flash and v4-pro" is a 20-word query touching design decisions, and that earns the cloud call.

The routing logic in _should_use_cloud() runs a 4-step priority chain on every incoming query:

Explicit override: use_cloud=True or use_cloud=False passed directly; short-circuits everything else.
Architectural keyword escalation: if the query contains design/architecture terms (configured in config.py), route to DeepSeek.
Length escalation: queries of 40 words or more route to DeepSeek.
Default fallback: fall back to whatever MEMORY_MODE is set to in .env (local, cloud, or hybrid).

Within the DeepSeek path, _select_model picks between v4-flash and v4-pro based on query complexity. Short cloud queries get the cheaper model; heavier ones get the capable one. You're not paying v4-pro rates for a function lookup.

Tip: start in cloud mode (DeepSeek API), then you can try hybrid and then local (Ollama).

Lexical Re-ranking: Fixing the Semantic Similarity False Positive Problem

Semantic similarity alone has a well-known failure mode: a generic function that describes the same concept as your query can outscore the specific function you actually want. A file summary mentioning "parsing" might beat parse_entities() in cosine space if the embeddings aren't tight.

zerikai_memory adds a re-ranking pass on top of ChromaDB's semantic results. After distance filtering removes weak matches, survivors are re-scored by weighted keyword overlap in entity names and docstrings:

score = (1 / distance) + (keyword_hits × LEXICAL_RERANK_WEIGHT)

A function named extract_entities will outrank a vague file-level summary that happened to embed close to the query. No results are dropped; it's a pure reorder. The weight is configurable via ENABLE_LEXICAL_RERANK in .env. It's off by default; turn it on if you're seeing generic summaries crowding out specific function hits.

The reranking test in the inline citations example above shows it working:

VS Code Query:

{
  "workspace": "zerikai_memory",
  "user_query": "extract function for parsing code entities from source files. Show me the Sources table."
}

Returns extract_entities at distance 0.35 as the top hit, with the JS and HTML extractors behind it. Notably, _extract_markdown didn't appear despite sharing vocabulary, which is exactly the false positive problem the re-ranker exists to solve.

Context Window Impact

Entity-level indexing matters most when the agent is sloppy. In a test session with Copilot, after three back-and-forth exchanges (including four re-queries it didn't need), the context window sat at 17.1K of 128K tokens, just 13%. The tool results from Zerikai accounted for 1.6% of that total.

If zerikai returned raw file chunks instead of individual entities, a single retrieval might pull 500 lines of a file to find one function. Twelve retrievals across three turns could burn north of 40% of the window before you even got to your actual work.

	Entity-level (zerikai, tested)	Raw file chunks (estimated)
Tool results after 3 turns, 4 calls	1.6% of window (~2K tokens)	~40% of window (~50K tokens)
Space left for actual work	~111K tokens	~28K tokens
Agent re-queries	Negligible cost	Amplifies the burn

When the agent re-queries four times because it summarized too aggressively, entity-level indexing means it burned 800 tokens, not 50,000. The real cost of a bad agent layer is multiplied by how much data each retrieval returns (one of the reasons I migrated to pi.dev, but that's another write-up).

Small retrievals save you twice: they stretch your IDE's token quota and they shrink your DeepSeek bill. Pay for the brief once. Query your own code for free.

How Agents Display Source Citations

zerikai memory returns the full #file:line (distance) in every response. But each agent publisher decides how much of it to surface. The same query, same tool, four different agents:

Example: "Sources: _truncate_for_brief #main.py:401 (0.62)"

Agent	Calls	Showed L2	Notes
Antigravity (Gemini)	1	Yes	Re-rendered inline text as a table
Claude Desktop	1	Yes	Showed full citation in body
Copilot	5	No	Re-queried 4x, hid distance from user until asked
Pi (inside workspace)	1	Yes	One clean call when in the project path

The data is always there in context; the agent uses #file:line (distance) for reasoning regardless.

Ask: "show me the full sources with distances" and it surfaces the complete line. What you see is filtered by your agent's display layer, not by zerikai memory.

Workspace Isolation and `.memignore`

Every project gets its own ChromaDB sub-collection, its own SQLite records, and its own brief file. Queries for project_a don't pull from project_b. Workspaces are identified by normalized filesystem path with deterministic UUIDs, so the same project opened from different IDE windows or different paths resolves to the same memory store.

Before any file hits the indexer, .memignore filters it out. It works identically to .gitignore: glob patterns, # comments, blank lines ignored. node_modules/, .git/, venv/, compiled output: configure it once per workspace, it's enforced on every scan.

Scans are idempotent. Re-scanning the same file overwrites its existing records using deterministic hashing. Stale entries (files deleted or added to .memignore since the last scan) are automatically purged. Run scan_workspace as often as you want; no duplicates accumulate, no ghost records from renamed files.

Be discriminative with the files and folders added to .memignore. We tend to save test/ or research/, and other dated resources, which add unnecessary noise to the ChromaDB index. Add those types of folders and files to .memignore to ignore them during the scan.

Local Mode: Zero Cost, No Data Leaves Your Machine

If you're working on a private codebase or just don't want to touch a cloud API, set MEMORY_MODE=local in .env and point zerikai_memory at a running Ollama instance. Everything runs locally: parsing, embedding, retrieval, and synthesis. The inline citations still work. The project brief still generates. Auto-routing routes everything to Ollama. Token tracking still logs usage (it just logs zero cost).

This makes the local mode genuinely useful for confidential projects, not just a fallback. The main trade-off versus DeepSeek is synthesis quality on complex architectural queries, as Ollama models are capable but won't match v4-pro on multi-hop reasoning about your codebase design.

What It Doesn't Do

Worth being explicit: zerikai_memory doesn't replace your IDE's built-in context management for the file you're actively editing. It's additive; what it eliminates is the need to re-inject project-wide context at the start of every new session. It also doesn't currently support all languages tree-sitter handles; Python, JS/TS, HTML, CSS, and Markdown are covered, and others are on the roadmap.

The project brief is intentionally locked after the first scan. If your architecture changes significantly, you need to trigger update_brief via the MCP server. That's a manual step, and it's a deliberate trade-off for KV cache stability over automatic freshness.

Stack

Python: MCP server, query routing, brief synthesis
ChromaDB: local vector store (file-based, stored in .brain/vector_db/)
zerikai.db: SQLite workspace registry and token tracking (stored in .brain/zerikai.db)
tree-sitter: deterministic local code parser, zero API cost
DeepSeek: cloud LLM (v4-flash and v4-pro), with KV cache optimization
Ollama: local LLM for zero-cost operation

Getting Started: Index zerikai_memory itself as your first workspace. It's the fastest way to see how entity-level indexing, lexical re-ranking, and inline citations actually work before applying them to your own projects. If your language isn't supported yet, search PyPI for tree-sitter-<language> to find the grammar bindings, then use the MCP server itself to guide you through adding it to code_indexer.py. Rescan, and you're ready to index your other workspaces with the newly added language.

The repo is on GitHub: zerikai_memory

If you're running agent workflows across multiple IDEs or projects, the context re-injection overhead adds up. zerikai_memory cuts that out at the root.

The more interesting problems turned out to be elsewhere: getting traceable, line-precise source citations out of a retrieval layer so agents stop guessing and start pointing, and getting lexical re-ranking to prevent semantically close but functionally irrelevant results from crowding out the right answer.

Have you thought about the impact of traceable sources on your own agent workflows? Have you hit the false-positive problem in semantic memory search? Drop your setup and approach below.

Thank you

Enrique

Speeding Up SaaS: Shipping HTMX in Production (A Post-Mortem)

Enrique Bruzual — Mon, 09 Feb 2026 22:50:02 +0000

How I built a high-speed "Cognitive Control Plane" with Hypermedia and Autonomous AI Pipelines

1. Project Overview: The Reddit Lead Qualification and Analysis System

When I set out to build the "Reddit Lead Qualification and Analysis System," which is a tool designed to find, evaluate, and categorize potential customers on Reddit before they enter a sales pipeline, I wasn't just building a simple scraper. I was building a specialized cognitive control plane for my business. The system needed to ingest thousands of posts across hundreds of targeted subreddits every day, autonomously qualify them through a multi-stage AI pipeline, and provide a streamlined UI for me to manage the resulting leads.

The core challenge was friction.

For an Independent Developer Consultant, time is the scarcest resource. Every minute spent debugging a frontend build pipeline or synchronizing state between two different programming languages, Python and TypeScript, is a minute not spent refining the AI's lead-scoring logic. For a project of this scale, the standard industry recommendation is often a React frontend talking to a FastAPI backend. But for an independent development project, that architecture introduces a massive tax; the "Model Synchronization Tax" where I would define a Pydantic model in Python and then have to maintain a separate representation in the frontend. If I add a column to the leads table in my database, I shouldn't have to touch 15 different files across two repositories just to see it on my screen.

Additionally, every interaction in the traditional SPA model requires converting a Python object to JSON, sending it over the wire, and parsing it in JavaScript only to update a virtual DOM. While the React ecosystem in 2026 has introduced Server Components (RSC) to mitigate some of this, those solutions often carry a hidden "Infrastructure Tax." They require a complex Node.js-based build pipeline and a runtime environment that can break the clean, "Python-only" workflow I prefer.

I chose a different path: Hypermedia. Specifically, HTMX.

My goal was a sub-500ms feedback loop for myself as the operator. When I'm reviewing a batch of 100 leads, I need the experience to feel instantaneous. I wanted sub-5-minute "idea-to-feature" velocity; if I realize I need a new filter for "Qualified" leads, I should be able to implement it in one place and see it live. Most importantly, I wanted a code footprint that didn't require a massive node_modules folder to build or a complex virtual environment just to render a button.

2. Implementation Approach: Hypermedia from the Ground Up

Choosing HTMX was a strategic decision from day one, not a late-stage pivot; it was an intentional, tactical design choice for my FastAPI-based stack. The architecture follows the HATEOAS (Hypermedia as the Engine of Application State) principle; the server doesn't just send raw data, it sends the representation of the data in the state it should be displayed. This means my backend is "UI-aware" in the best possible way.

The Autonomous Qualification Pipeline

Beyond the UI, the heart of the system is the autonomous lead qualification pipeline. Unlike a simple search, this is a sequential, the background process managed by a task queue. When a post is ingested, it moves through four distinct stages:

Summary Generation: A model like gpt-4o or deepseek-chat condenses the post and comments into a concise technical summary.
Qualification: The system scores the lead based on intent and fit against custom business prompts.
Key Point Extraction: For qualified leads, the AI extracts specific talking points and pain points.
Draft Response: Finally, the system generates a tailored response draft for me to review.

This autonomous pipeline runs silently in the background. While the system also includes a robust API and a Model Context Protocol (MCP) layer, those are beyond the scope of this post; I will dive into those in a future article. The role of the HTMX-powered dashboard is to surface these processed results and allow me to interact with them with minimal latency.

The Integration Layer: `HX-Request`

In my FastAPI backend, specifically in scripts/views/views.py, I implemented a pattern to handle "Full Page" vs. "Fragment" requests using the same route. This is the core of HTMX in production. When I load the dashboard directly, I get the full shell; header, sidebar, and footer. But when I change a filter or click a pagination link, HTMX sends a header (HX-Request) that tells my server: "Hey, I only need the table content."

@router.get("/", response_class=HTMLResponse)
async def dashboard(request: Request, db: Session = Depends(get_db)):
    # ... logic to fetch leads and stats ...
    # This involves complex SQLAlchemy queries with joined loads 
    # to ensure the Post and Lead data is fetched efficiently.
    context = {"leads": leads, "stats": stats, ...}

    if request.headers.get("HX-Request"):
        return templates.TemplateResponse(
            request=request,
            name="fragments/dashboard_table.html",
            context=context
        )

    return templates.TemplateResponse(
        request=request,
        name="pages/dashboard.html",
        context=context
    )

By checking the HX-Request header, I can return just the table rows (a fragment) when I click "Next Page," or the entire dashboard when I first refresh the browser. This eliminates the need for a client-side router entirely. I don't have to define "Routes" in JavaScript anymore, the URL structure is defined by my Python files, as it should be.

Semantic Search: A Manual Discovery Tool

While the autonomous pipeline qualifies leads based on pre-defined rules, I often need to manually explore the collected data. This is where the Semantic Search Engine comes in. Using pgvector, I can find posts that are semantically similar to a current interest, even if they weren't flagged by the initial qualification logic.

Implementing this manual discovery tool with HTMX felt like magic:

<input type="text" name="q" 
       hx-get="/" 
       hx-trigger="keyup changed delay:500ms" 
       hx-target="#dashboard-table" 
       placeholder="Search leads by intent...">

This tells the browser: "Every time I stop typing for 500ms, send a GET request to the current URL with my input value, and swap the results into the table." In my Python code, I call the semantic_search function using vector embeddings. There were no React state hooks, no onChange handlers, and no complex debounce logic to write. It just worked.

Key Pattern: The Polling Pipeline

One of the most powerful features I implemented was the "Re-Analysis" polling. For instance, if I want to manually trigger a re-run of the qualification pipeline for a specific lead, I need a way to see the progress without refreshing.

With HTMX, I reduced this to a single endpoint that returns a polling fragment:

@router.post("/leads/{lead_id}/re-analyze")
async def trigger_re_analysis(lead_id: int):
    # Trigger the background task pipeline starting at Task 1
    task_1_summary(lead_id)

    # Return a fragment that polls for status every 2 seconds
    return HTMLResponse(content=f"""
        <div hx-get="/leads/{lead_id}/analysis-status" 
             hx-trigger="every 2s" 
             hx-target="this" 
             hx-swap="outerHTML"
             class="animate-pulse">
            Analyzing...
        </div>
    """)

The frontend logic is now entirely declarative. The server tells the browser: "Here is your current state (Analyzing), and by the way, check back with me in 2 seconds." When the final response draft is ready, the server returns the result, and the polling stops automatically. This is implementation-focused engineering at its finest.

Destructive Actions without the "Refresh Hammer"

Managing a list of subreddits or keyword rules often involves frequent deletions. In a standard multi-page app, deleting an item usually triggers a full page refresh; a "refresh hammer" that breaks the flow. In HTMX, I used hx-delete to provide an "SPA-like" feel with zero manual JavaScript.

<button hx-delete="/rules/{{ keyword.id }}?page={{ pagination.current_page }}" 
        hx-target="#rules-table-container"
        hx-confirm="Delete this rule?">
    Delete
</button>

State Management via URL: `hx-push-url`

A common critique of HTMX is that users lose the ability to use the "Back" button or share specific filtered views. In many frameworks, this requires complex series of state hooks to keep the URL in sync.

In the dashboard of this project, I solved this with a single attribute: hx-push-url="true".

<button hx-get="/?page=2" 
        hx-target="#dashboard-content" 
        hx-push-url="true">
    Next Page
</button>

The "Traffic Controller" Pattern: Beyond HTMX with SSE

While HTMX is my primary driver; I hit a practical limit when implementing the System Monitor. My application relies on an autonomous background worker to handle the heavy lifting of lead qualification. I needed to stream high-frequency telemetry, such as heartbeats, logs, and state changes, back to the UI without the overhead of full HTML fragment swaps for every tiny update.

Instead of force-fitting HTMX OOB swaps into a high-velocity logging stream, I implemented a "Traffic Controller" pattern using plain JavaScript and Server-Sent Events (SSE).

// A simplified look at the Client-Side Traffic Controller
eventSource.onmessage = (event) => {
    const data = JSON.parse(event.data);

    // Pulse Updates: Update status pills and timers
    if (data.pulse) {
        updateStatusDisplay(data.pulse);
    }

    // Brain Updates: Update global stats badges
    if (data.memory) {
        updateStatsCounters(data.memory);
    }

    // Activity Stream: Append new logs to the terminal view
    if (data.log_entry) {
        appendLogToStream(data.log_entry);
    }
};

This was the superior choice for monitoring a decoupled background system. It allowed the server to emit lean JSON telemetry while the client-side logic handled the fine-grained DOM updates. It proves an essential point for any Independent Developer Consultant: HTMX is not a golden hammer. The bridge to a high-performance system often requires knowing exactly when to "drop down" into plain JavaScript to handle specialized data streams.

3. Comparative Analysis: HTMX vs. Frontend Frameworks

In the engineering world, we often talk about "abstractions." A heavy frontend framework is a massive abstraction layer over the DOM. HTMX, conversely, is an extension of the browser's native hypermedia capabilities. Here is how they stack up based on my implementation.

Development Velocity: Ease and Speed of Implementation

The single biggest win with HTMX was the collapse of the "Middle Tier." In a traditional frontend-heavy stack, every feature requires three distinct workstreams: the backend database and API logic, the frontend data fetching and state management, and finally the UI mapping.

With HTMX, the Backend and Frontend Data layers are merged. The speed of implementation for the "Dashboard Search" feature in the Reddit Lead Qualification and Analysis System was illustrative. To implement a real-time semantic search with pgvector, I only had to:

Add a search input in HTML with hx-get="/" hx-trigger="keyup changed delay:500ms".
Update the existing Python dashboard route to filter by the q parameter.

Because Python already handles the HTML rendering via templates, I didn't have to write a single line of state management code to handle the search results. I estimate this saved me roughly 70% of the development time compared to a framework-heavy implementation.

Code Footprint: Lines of Code and Maintenance

Code is a liability. The more code I write, the more I have to debug and maintain. My implementation showed a dramatic reduction in "glue code."

JS Bundle Size: A typical modern project often starts at ~150KB for the framework alone, ballooning to 500KB+ with standard libraries. HTMX is 14KB. Even with the progress made by React Server Components in 2026, which can reduce bundles for specific segments of an app, the baseline infrastructure remains heavy. For the "Reddit Lead Qualification and Analysis System" project, HTMX means the "Time to Interactive" is incredibly fast even on slower connections.
LoC Reduction: By eliminating the need for client-side state managers and routers, I reduced the total frontend-associated lines of code by an estimated 60%. There are no more JSON reducers and no more manual event handlers to synchronize local UI with remote state.

4. Production Challenges & Trade-offs (The Honest Part)

As much as I appreciate hypermedia, it introduces specific challenges that must be addressed in a production environment. I am an engineer, not a fanboy; every decision involves a trade-off.

The Complexity Shift: Brain Power relocation

With HTMX, I am not removing complexity, I am shifting it. Instead of managing complexity in the browser via JavaScript frameworks, I am managing it on the server in Python.
The scripts/views/views.py file in this project is already substantial. Because the server is responsible for rendering fragments, the backend routes become more "UI-aware." I have to think about which piece of HTML is being returned and where it fits in the DOM.

This requires discipline with directory structures. I found that having a dedicated templates/fragments/ directory was essential. Without it, the backend logic becomes an unmaintainable mess of string concatenations and obscure template paths. In an independent project, that "mental load" of remembering where a fragment goes can be a bottleneck. If I were working with a larger team, I would need very strict "contract" between the fragment names and the server responses to avoid breakage.

The Mental Model Tax

In 2026, a major challenge with modern React (especially RSC) is the "blurring" of the line between what runs on the server and what runs on the client. It requires significant mental effort to remember which component has access to which environment. HTMX keeps that line crystal clear; the server renders HTML, the client displays it. This predictability is a massive boon for development speed.

Error Handling in a Hypermedia World

In a JSON API, if a request fails, the client receives a status code (like 401 or 500) and can cleanly display a notification using a frontend library. In HTMX, if the server returns a 500 error, the browser might swap the entire stack trace or the generic error page into the middle of a table by default. This is a poor user experience.

I had to implement custom logic using headers to trigger UI events for errors while still maintaining the hypermedia flow. For instance, I use the HX-Trigger header to send events to a global toast notification system. This requires a small "bridge" of plain JavaScript, proving that in a complex single-tenant SaaS application, you can never truly be 100% "JavaScript-free."

The "Round-Trip" Latency and Interactivity

HTMX is inherently server-centric. every interaction requires a round trip to the server. If I'm using the "Reddit Lead Qualification and Analysis System" on a high-latency connection, the 200ms delay for every button click can be noticeable.

For highly interactive elements, like complex drag-and-drops or "real-time" text editors, HTMX can be supplemented with lightweight libraries like Alpine-js. I view Alpine as modern-day progressive enhancement; it is a way to "sprinkle" local interactivity without abandoning the hypermedia core, echoing the classic web philosophy I have seen evolve since the early 90s. Alpine-js handles the "low-level" UI state, like opening a modal or toggling a dropdown, instantly without hitting the server, while HTMX handles the "heavy-lifting" data updates. For the majority of the "Reddit Lead Qualification and Analysis System" single-tenant SaaS functionality, where each customer gets their own dedicated instance isolated in a container, the HTMX round-trip model is more than sufficient.

5. Performance & Operational Considerations

In the "Reddit Lead Qualification and Analysis System" project, I focused on hard metrics to ensure the system remained performant under production loads. I didn't want to rely on architectural "vibes"; I wanted numbers.

Bandwidth Usage in the Reddit Lead Qualification and Analysis System

A common concern with HTMX is that rendering HTML on the server is more expensive than serializing JSON. If a JSON response is 2KB and the HTML is 10KB, surely the JSON is better?
For this project:

CPU Overhead: In the FastAPI environment, the "Time to Render" for a dashboard fragment (approx. 50 leads with nested Post data) averaged 12ms. A pure JSON serialization of the same data recorded around 4ms. This 8ms difference is imperceptible to a human.
Bandwidth Usage: In the Reddit Lead Qualification and Analysis System dashboard, the partial HTML fragment was roughly ~12KB, while the equivalent JSON was ~3KB.

While the HTML is larger, I have to account for the Hydration Tax found in modern frameworks. Even with RSC-based apps in 2026, the browser often has to download similar amounts of data twice; once as part of the initial HTML shell and once as serialized component logic. HTMX avoids this entirely by sending only what is needed for the DOM update. HTMX relies on the browser's native engine for HTML insertion, making the total energy to interactive lower for this project than many framework-centric alternatives.

Operational Simplicity: An Independent Developer's Best Friend

From an operational standpoint, the deployment of the "Reddit Lead Qualification and Analysis System" grew significantly simpler than any React based project I've seen. In typical React+Python projects, the pipeline involves multiple build steps; installing node, running npm builds, managing separate asset storage (like S3 or a CDN), and dealing with CORS.

In this project, the frontend is the backend. I have a single deployment process. I build one Docker image that contains my Python code and my HTML templates. There is no separate "frontend build" step that can fail because of a minor version mismatch in a transitive dependency. There is no CORS configuration to debug because the UI and the Data are served from the same domain. This reduction in operational surface area is a massive win for reliability.

6. Lessons Learned & Recommendations

After implementing five production single-tenant SaaS and web applications with HTMX, including the "Reddit Lead Qualification and Analysis System," the most important lesson I've learned is that simplicity scales.

HTMX is SaaS-Ready

While often discussed for internal tools or small hobby projects, HTMX is perfectly capable of powering a production single-tenant SaaS. I am using it to manage a complex pipeline of AI interactions and data ingestion. Its simplicity is a multiplier for an Independent Developer Consultant because it reduces the "Context Switching" overhead. I am always in a "Python State of Mind," whether I'm writing data models or UI logic.

For those rare components that require extreme local interactivity; libraries like Alpine-js can bridge the gap perfectly without the overhead of a full framework stack.

Final Recommendation: Focus on the Problem, Not the Plumbing

My experience with the "Reddit Lead Qualification and Analysis System" suggests that the majority of modern business applications could be built more efficiently with HTMX. Frameworks have their place in specialized "apps" (like complex graphic editors or offline-first tools); but for dashboard-driven systems, hypermedia is the superior choice.

HTMX allowed me to move at the speed of thought. I could ship real features to production in minutes while maintaining a codebase that I actually enjoy working in. It removed the "plumbing" of modern web development and let me focus on the actual problem: finding and qualifying leads with AI.

If you are starting a new project, I urge you to look at the fundamentals of the web. Embrace hypermedia, and spend your complexity budget on solving the actual business problem, not on managing the plumbing of your frontend framework.

Resources for Further Exploration

HTMX Documentation: The definitive guide.
Alpine.js: Perfect for local interactivity.
The Hypermedia Systems Book: Philosophy of the web.
FastAPI + HTMX Tutorial: A practical starter guide.
HATEOAS Guide: Understanding the core philosophy.

DEV Community: Enrique Bruzual

Post-Mortem: Building a Local MCP Server for Codebase Memory using Ollama and ChromaDB

How zerikai_memory Uses a Local Model

Hardware

Models

Test 1: Benchmark Script (Latency)

Test 2: Live MCP Queries (Synthesis Quality)

Query 1: How does query routing decide between Ollama and DeepSeek?

Query 2: Where is the .brain directory path defined and how is it resolved across operating systems?

Query 3: How does background brief synthesis avoid MCP timeouts?

Query 4: How does save_to_memory work with tree-sitter and what file types does it index?

Query 5: How does asyncio semaphore control Ollama concurrency in local mode?

The Prerequisite: Docstring Quality

Brief Generation: An Uncontrolled But Useful Data Point

Full Local Mode and Brief Synthesis: The Semaphore Fix

Hardware and Cost

Recommendation

Your AI Agent is Stuck in a Loop. Here's the Memory Layer That Breaks It and Saves You Money

What It Actually Does

The Indexing Cost Is Zero

Auto-Routing: Paying Only for What Needs It

Lexical Re-ranking: Fixing the Semantic Similarity False Positive Problem

Context Window Impact

How Agents Display Source Citations

Workspace Isolation and .memignore

Local Mode: Zero Cost, No Data Leaves Your Machine

What It Doesn't Do

Stack

Speeding Up SaaS: Shipping HTMX in Production (A Post-Mortem)

1. Project Overview: The Reddit Lead Qualification and Analysis System

2. Implementation Approach: Hypermedia from the Ground Up

The Autonomous Qualification Pipeline

The Integration Layer: HX-Request

Semantic Search: A Manual Discovery Tool

Key Pattern: The Polling Pipeline

Destructive Actions without the "Refresh Hammer"

State Management via URL: hx-push-url

The "Traffic Controller" Pattern: Beyond HTMX with SSE

3. Comparative Analysis: HTMX vs. Frontend Frameworks

Development Velocity: Ease and Speed of Implementation

Code Footprint: Lines of Code and Maintenance

4. Production Challenges & Trade-offs (The Honest Part)

The Complexity Shift: Brain Power relocation

The Mental Model Tax

Error Handling in a Hypermedia World

The "Round-Trip" Latency and Interactivity

5. Performance & Operational Considerations

Bandwidth Usage in the Reddit Lead Qualification and Analysis System

Operational Simplicity: An Independent Developer's Best Friend

6. Lessons Learned & Recommendations

HTMX is SaaS-Ready

Final Recommendation: Focus on the Problem, Not the Plumbing

Resources for Further Exploration

Workspace Isolation and `.memignore`

The Integration Layer: `HX-Request`

State Management via URL: `hx-push-url`