<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Radoslav Tsvetkov</title>
    <description>The latest articles on DEV Community by Radoslav Tsvetkov (@radotsvetkov).</description>
    <link>https://dev.to/radotsvetkov</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3873179%2Ffec4dcd5-6606-4a6b-a397-76c98c39d6b0.png</url>
      <title>DEV Community: Radoslav Tsvetkov</title>
      <link>https://dev.to/radotsvetkov</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/radotsvetkov"/>
    <language>en</language>
    <item>
      <title>I built a memory layer for AI assistants that refuses to fake citations</title>
      <dc:creator>Radoslav Tsvetkov</dc:creator>
      <pubDate>Tue, 05 May 2026 11:43:36 +0000</pubDate>
      <link>https://dev.to/radotsvetkov/i-built-a-memory-layer-for-ai-assistants-that-refuses-to-fake-citations-228a</link>
      <guid>https://dev.to/radotsvetkov/i-built-a-memory-layer-for-ai-assistants-that-refuses-to-fake-citations-228a</guid>
      <description>&lt;p&gt;A few months ago I was using my AI assistant to dig through my Obsidian vault. I asked it about a decision I had made on a side project, and it answered with confidence. The answer cited two notes. One of them existed but did not say what the model claimed. The other did not exist at all.&lt;/p&gt;

&lt;p&gt;I kept staring at the response for a minute, because it sounded exactly right. It matched what I half-remembered. If I had not gone to check the source, I would have used that answer in a real conversation with someone.&lt;/p&gt;

&lt;p&gt;That is the problem I want to tell you about, and the one I have spent the last few months trying to fix.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why "chat with your notes" tools quietly lie to you
&lt;/h3&gt;

&lt;p&gt;The standard recipe for personal RAG looks like this:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Take your markdown vault.&lt;/li&gt;
&lt;li&gt;Split it into chunks.&lt;/li&gt;
&lt;li&gt;Index those chunks with keyword search and embeddings.&lt;/li&gt;
&lt;li&gt;At query time, retrieve the top few chunks.&lt;/li&gt;
&lt;li&gt;Hand them to a language model with an instruction like "answer using only the context, and cite your sources."&lt;/li&gt;
&lt;li&gt;Hope.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Step six is where things go sideways. The model is not actually obligated to cite anything correctly. It is just being asked nicely. When it gets confused, it picks a note title that sounds related and uses it as a citation. When it knows there is a related concept in the vault but cannot find a clean quote, it paraphrases and attributes a paraphrase that does not exist anywhere.&lt;/p&gt;

&lt;p&gt;For chatting with Wikipedia, this is fine. For your own decisions, meeting notes, and life events, it is not. The whole point of writing things down was to have a single source of truth. If your AI layer can invent quotes from notes that do not exist, you have lost the property that made the notes useful in the first place.&lt;/p&gt;

&lt;h3&gt;
  
  
  The shift: claims, not chunks
&lt;/h3&gt;

&lt;p&gt;Memora's central idea is small but, once you accept it, kind of irreversible.&lt;/p&gt;

&lt;p&gt;Stop treating the chunk of text as the unit of memory. Treat the claim as the unit of memory.&lt;/p&gt;

&lt;p&gt;A claim is a small structured fact. It has a subject, a predicate, and an object. It is extracted from a specific note, and it carries the exact byte range of the source text it came from, plus a blake3 hash of that source. It also carries a validity window (when the claim started being true and, optionally, when it stopped) and a privacy band.&lt;/p&gt;

&lt;p&gt;When you index your vault, Memora calls a model once per note to extract these claims. They go into a SQLite database with edges between them, like "supersedes", "contradicts", and "derives_from".&lt;/p&gt;

&lt;p&gt;When you query, you do not get raw chunks back. You get claims. Each claim has an ID. The answering model is asked to cite specific claim IDs in its response.&lt;/p&gt;

&lt;p&gt;Then comes the part that, in my opinion, makes the whole thing actually useful.&lt;/p&gt;

&lt;h3&gt;
  
  
  Validation before you ever see the answer
&lt;/h3&gt;

&lt;p&gt;Before any answer reaches the user, Memora does the following:&lt;/p&gt;

&lt;p&gt;For each cited claim ID, it looks up the byte range and the source note, re-reads the exact span from your markdown on disk, and recomputes the blake3 hash. If the hash matches the one stored at extraction time, the citation stands. If it does not, the citation is stripped. If the model invented an ID that does not exist in the database, that one is stripped too.&lt;/p&gt;

&lt;p&gt;If the answer ends up with no valid citations left, the model is re-prompted, this time with only the claims that survived as context. The system enforces the citation contract through Rust types and span hashes, not through prompt obedience.&lt;/p&gt;

&lt;p&gt;This is a different trust model from "ask the model to be careful". It does not assume good behavior. It checks.&lt;/p&gt;

&lt;h3&gt;
  
  
  The boring engineering decisions
&lt;/h3&gt;

&lt;p&gt;I want to talk about a few choices that look small but pay off every single day.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Rust as the language.&lt;/strong&gt; I wanted a single binary you could drop on a machine and forget. No Python environment, no node_modules, no Docker required. Cargo install or download a release. The type system also makes the citation contract enforceable at compile time, which matters when you are trying to make a guarantee like "no unverified citation will ever leave this function".&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;SQLite + HNSW for storage.&lt;/strong&gt; Personal vaults are not at internet scale. A few thousand notes is the realistic upper bound for a long time. SQLite handles claims and edges fine. HNSW handles vector search fine. There is no Kafka, no vector DB service, no infrastructure for you to run.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Obsidian as the substrate.&lt;/strong&gt; I did not want to invent another note-taking app. The vault stays in plain markdown. You can edit in Obsidian, in vim, in TextEdit. Memora watches for changes and re-extracts claims from changed notes. The notes are still yours, in a format that will outlive the project.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;MCP for integration.&lt;/strong&gt; Instead of building a chat UI, Memora exposes its tools over the Model Context Protocol. That means it works inside Claude Desktop, Cursor, and anything else that speaks MCP. You bring the chat interface you already use.&lt;/p&gt;

&lt;h3&gt;
  
  
  Things that took longer than I expected
&lt;/h3&gt;

&lt;p&gt;Two surprises worth mentioning.&lt;/p&gt;

&lt;p&gt;The first one is that "atomic claims" is harder than it sounds. Early on I had the extractor pulling things like "the project was good", which is technically a claim but completely useless. The current extraction prompt has been through many revisions and is paired with deduplication, normalization, and a gate that filters single-claim noise out of the active challenger output. There is still room to improve. If you have ideas, I would love to hear them.&lt;/p&gt;

&lt;p&gt;The second one is that local LLMs are not yet good enough at extraction. Qwen 14B hallucinates relationships. Qwen 32B is acceptable but misses cross-region patterns. Llama 70B can match Claude Haiku quality but at significant memory cost. So the recommended setup right now is Claude Haiku for extraction (about $0.30 to index a 100-note vault) with a local model for embeddings. The fully local path works, but I want to be honest that it is not at production quality yet.&lt;/p&gt;

&lt;h3&gt;
  
  
  Where the project is now
&lt;/h3&gt;

&lt;p&gt;It is at v0.1.27. It is open source under Apache 2.0. It indexes 100-note vaults in 5 to 10 minutes with Claude Haiku. The active challenger surfaces decisions, contradictions, stale dependencies, and open questions in a generated atlas page in your vault, so you can keep an eye on the state of your own knowledge over time.&lt;/p&gt;

&lt;p&gt;If you live in your notes, if you have ever asked an AI tool a question about your own writing and gotten back something that was not actually there, please try it.&lt;/p&gt;

&lt;p&gt;Repo: &lt;a href="https://github.com/radotsvetkov/memora" rel="noopener noreferrer"&gt;https://github.com/radotsvetkov/memora&lt;/a&gt;&lt;br&gt;
Architecture demo: &lt;a href="https://radotsvetkov.github.io/memora" rel="noopener noreferrer"&gt;https://radotsvetkov.github.io/memora&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;I am especially interested in feedback from people with messy vaults, weird folder layouts, and strong opinions about retrieval. Issues, edge cases, and design discussions are welcome on GitHub.&lt;/p&gt;

&lt;p&gt;If you read this far, thank you. I would much rather hear that I am wrong about something specific than that this looks neat.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>opensource</category>
      <category>obsidian</category>
      <category>rust</category>
    </item>
    <item>
      <title>Building an Autonomous Coding Agent in Rust: Architecture, Decisions, and What I Learned</title>
      <dc:creator>Radoslav Tsvetkov</dc:creator>
      <pubDate>Sat, 11 Apr 2026 09:25:29 +0000</pubDate>
      <link>https://dev.to/radotsvetkov/building-an-autonomous-coding-agent-in-rust-architecture-decisions-and-what-i-learned-3p2a</link>
      <guid>https://dev.to/radotsvetkov/building-an-autonomous-coding-agent-in-rust-architecture-decisions-and-what-i-learned-3p2a</guid>
      <description>&lt;p&gt;I have been building Akmon for several months — a terminal AI coding agent that ships as a single Rust binary. No separate runtime, no package manager, no installer. Copy the file and it works.&lt;br&gt;
This is not a "here is my project" post. It is an honest account of the decisions I made, the tradeoffs involved, and the things that surprised me. If you are building in the agent space or are curious how autonomous tool-calling loops actually behave in practice, I hope it is useful.&lt;/p&gt;
&lt;h2&gt;
  
  
  Why Rust
&lt;/h2&gt;

&lt;p&gt;The choice was pragmatic, not ideological.&lt;br&gt;
I needed one artifact that behaves identically on a developer's MacBook, a Linux server accessed over SSH, a Docker container in CI, and an air-gapped environment with no internet access. Rust's static linking story and lack of a managed runtime match that deployment model directly. The release binary uses LTO, size-optimized settings, and stripping. The result is 3.4 MB that runs anywhere you can run a normal executable.&lt;/p&gt;

&lt;p&gt;The second reason is structural. An agent session involves a lot of moving parts simultaneously: streaming completions from an HTTP API, a growing conversation history, permission prompts waiting for user input, and a terminal UI rendering in real time. Rust's ownership model and async ecosystem make it feasible to keep that complexity under control. Compiler-enforced boundaries between crates mean that accidental coupling fails at build time rather than at runtime.&lt;/p&gt;
&lt;h2&gt;
  
  
  The Workspace Architecture
&lt;/h2&gt;

&lt;p&gt;Akmon is a multi-crate Cargo workspace. Each crate has a single clear responsibility:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;akmon-cli       — binary entry point, CLI parsing, wiring
akmon-core      — permissions, sandbox, audit types, project layout
akmon-config    — configuration loading and defaults
akmon-models    — provider implementations and streaming protocol
akmon-tools     — tool implementations (read_file, shell, edits, specs...)
akmon-query     — agent loop, session management, context assembly
akmon-tui       — ratatui terminal interface
akmon-index     — optional semantic search
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Dependency flow is strictly inward. The CLI depends on everything. akmon-core depends on nothing in the workspace. If a tool implementation accidentally imports from the TUI layer, the build fails. The architecture is enforced by the compiler, not by convention.&lt;/p&gt;

&lt;p&gt;The most complex logic lives in akmon-query, specifically in session.rs. Everything else exists to serve what happens there.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Provider Abstraction
&lt;/h2&gt;

&lt;p&gt;Eight providers — Anthropic, OpenAI, OpenRouter, Groq, Azure, Bedrock, Ollama, and custom OpenAI-compatible endpoints — need to work through a single interface so the agent loop does not need to know or care which one is active.&lt;/p&gt;

&lt;p&gt;Every backend implements the &lt;code&gt;LlmProvider&lt;/code&gt; trait. The key method takes a list of messages and a configuration struct (tools, max tokens, session ID, optional fallback model) and returns a stream of events: text deltas, tool calls, usage reports, and errors. The agent loop consumes these events identically regardless of which provider produced them.&lt;/p&gt;

&lt;p&gt;Provider selection happens once at session start in &lt;code&gt;LlmConnectConfig::resolve&lt;/code&gt;. The priority chain matters and is easy to get wrong:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Amazon Bedrock when AWS context is configured&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;claude-*&lt;/code&gt; model names route to Anthropic directly, or via OpenRouter if no Anthropic key exists, or error clearly if neither is available&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;org/model&lt;/code&gt; format routes to OpenRouter&lt;/li&gt;
&lt;li&gt;Azure OpenAI when endpoint and key are both present&lt;/li&gt;
&lt;li&gt;OpenAI when the model matches Chat API patterns and a key is set&lt;/li&gt;
&lt;li&gt;Groq when the model matches Groq-hosted patterns and a key is set&lt;/li&gt;
&lt;li&gt;Custom OpenAI-compatible endpoint when configured&lt;/li&gt;
&lt;li&gt;Ollama for everything else&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The bug you only fix once: &lt;code&gt;claude-*&lt;/code&gt; must be resolved before the Ollama fallback. Otherwise the tool quietly sends Anthropic API requests to a local Ollama server that speaks a completely different protocol. I learned this the hard way.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Agent Loop
&lt;/h2&gt;

&lt;p&gt;The loop lives in session.rs under a labeled &lt;code&gt;'session loop&lt;/code&gt;. It is not simply "run until the model says stop." Before each iteration it checks an iteration limit (default 25), a budget cap in headless mode, and several error conditions. Autonomy is bounded by configuration, not just by model behavior.&lt;/p&gt;

&lt;p&gt;Each iteration follows the same sequence:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Apply context compaction if needed&lt;/li&gt;
&lt;li&gt;Compose the message list for the next API call&lt;/li&gt;
&lt;li&gt;For Ollama, trim to system messages plus the last six non-system messages&lt;/li&gt;
&lt;li&gt;Call the provider and consume the stream until a Done event arrives&lt;/li&gt;
&lt;li&gt;Handle the stop reason&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Stop reasons determine what happens next:&lt;br&gt;
&lt;strong&gt;ToolUse:&lt;/strong&gt; execute the requested tools, append results to context, continue the loop.&lt;br&gt;
&lt;strong&gt;EndTurn with tool calls:&lt;/strong&gt; the model produced both text and tool calls. Execute the tools and continue.&lt;br&gt;
&lt;strong&gt;EndTurn with no tool calls:&lt;/strong&gt; genuine completion. Emit a Done event, persist context, exit the loop.&lt;br&gt;
&lt;strong&gt;MaxTokens:&lt;/strong&gt; the response was truncated. If there were tool calls, execute what arrived and continue without consuming a continuation credit. If there were no tool calls, inject a continuation user message and loop again, up to three times. After three truncations without completion, surface a clear error to the user.&lt;/p&gt;

&lt;p&gt;The natural exit is &lt;code&gt;EndTurn&lt;/code&gt; with no pending tool calls. Everything else is a managed edge case.&lt;/p&gt;

&lt;h2&gt;
  
  
  Context Management
&lt;/h2&gt;

&lt;p&gt;This is the hardest problem in building a coding agent and the one that takes the most iteration to get right.&lt;/p&gt;

&lt;p&gt;Every turn adds tokens to the context. File reads, tool results, conversation history — it all accumulates. Eventually you hit the model's context window limit and things break in ways that are hard to debug. Akmon handles this at three levels.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Microcompact&lt;/strong&gt; runs after each turn. Older tool results are replaced with a short placeholder to prevent linear context growth. The implementation is more careful than a simple character count: it never clears write or edit tool results (those are too important), only clears shell output when it exceeds 500 characters, and keeps the most recent 20 messages intact. Groq keeps 12 due to tighter context limits.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Autocompact&lt;/strong&gt; triggers when estimated input tokens exceed 85% of usable context. At that point, a prefix of conversation history is summarized via the same provider and folded back into the context as a system message. The agent continues from the summary rather than from the raw history. You lose some detail but you gain the ability to keep working on a large project without hitting a hard wall.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Spec files&lt;/strong&gt; are the real solution because they avoid the problem entirely. Before implementing anything significant, the agent writes a detailed plan to &lt;code&gt;.akmon/specs/plan.md&lt;/code&gt;. This file lives on disk, persists across sessions, and survives compaction. Working from a spec rather than from accumulated conversation history keeps the implementation context clean from the start.&lt;/p&gt;

&lt;p&gt;These are genuinely different things. Microcompact manages turn-by-turn growth. Autocompact handles sessions that run long. Specs are a workflow pattern that reduces how much context you need in the first place.&lt;/p&gt;

&lt;h2&gt;
  
  
  Prompt Caching
&lt;/h2&gt;

&lt;p&gt;Anthropic's prompt caching charges 10% of the normal input token price for cache reads. In practice, for a 30-turn session building a web application, around 35 to 40% of input tokens are served from cache. On a session that would otherwise cost $0.54, caching brings it to around $0.35.&lt;/p&gt;

&lt;p&gt;The implementation detail that matters: cache control is attached at the top level of the Messages API request, not as per-block markers on individual content items. Getting this wrong means paying full price for tokens that should be cached.&lt;/p&gt;

&lt;p&gt;Treat these numbers as measurements from specific sessions, not as guarantees. Your cache hit rate depends heavily on how much your system prompt changes between turns.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Permission System
&lt;/h2&gt;

&lt;p&gt;Every tool call that modifies state — file writes, shell commands, web requests — goes through a permission check before execution. The user sees a prompt and chooses: allow once, allow for the session, allow all writes, or deny.&lt;/p&gt;

&lt;p&gt;Every decision is recorded in the audit log as a JSON Lines event tagged with the event kind (policy evaluation, tool dispatch, tool outcome, agent step). The schema is more structured than a flat key-value pair — it needs to be, because "what did the agent do and why" is a question that gets asked later, not during the session.&lt;/p&gt;

&lt;p&gt;This is the difference between "the AI wrote some code" and "at 11:23:45 UTC the agent requested to write src/auth.rs, the user approved it for this session, and the result was a 47-line file." In professional environments that distinction matters.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Surprised Me
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Retries and rate limits mid-stream are genuinely difficult.&lt;/strong&gt; A rate limit that arrives halfway through a streaming response means partial state, a user who is watching status messages, and the need to not double-count the iteration. Getting this right took longer than any other single piece of the codebase.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Local models need aggressive context trimming.&lt;/strong&gt; A 30k token context that Anthropic processes in 2 seconds takes 90 seconds or more on a local 9B model running on consumer hardware. Trimming to system messages plus the last six non-system messages before sending to Ollama made the difference between something usable and something that times out constantly.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Usage aggregation is easy to get wrong.&lt;/strong&gt; Anthropic returns token counts at the end of each streaming response. Accumulating these correctly across 35 API calls in a single session requires careful state management in both the session layer and the TUI. When this breaks, users see $0.14 instead of $0.68. I know because it happened.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The TUI is the most visible part but the least interesting engineering.&lt;/strong&gt; ratatui is excellent. But the real product is in the policy engine, context management, and provider correctness. The TUI is just the window into those systems.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Is Next
&lt;/h2&gt;

&lt;p&gt;The codebase already includes MCP configuration and an HTTP MCP client path — so the next work there is expanding subprocess and stdio server support, not starting from scratch. Other areas still in progress: checkpoint and rewind for safer autonomous operation, shell state persistence across tool calls, and first-class Windows support.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Akmon is Apache 2.0.&lt;/strong&gt; The repo is &lt;a href="https://dev.tourl"&gt;github.com/radotsvetkov/akmon&lt;/a&gt; and the docs are at &lt;a href="https://dev.tourl"&gt;radotsvetkov.github.io/akmon&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;If you are building agents I would genuinely like to hear what you are working on.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;- Rado&lt;/em&gt;&lt;/p&gt;

</description>
      <category>rust</category>
      <category>ai</category>
      <category>opensource</category>
      <category>tooling</category>
    </item>
  </channel>
</rss>
