<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Radoslav Tsvetkov</title>
    <description>The latest articles on DEV Community by Radoslav Tsvetkov (@radotsvetkov).</description>
    <link>https://dev.to/radotsvetkov</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3873179%2Ffec4dcd5-6606-4a6b-a397-76c98c39d6b0.png</url>
      <title>DEV Community: Radoslav Tsvetkov</title>
      <link>https://dev.to/radotsvetkov</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/radotsvetkov"/>
    <language>en</language>
    <item>
      <title>Building an Autonomous Coding Agent in Rust: Architecture, Decisions, and What I Learned</title>
      <dc:creator>Radoslav Tsvetkov</dc:creator>
      <pubDate>Sat, 11 Apr 2026 09:25:29 +0000</pubDate>
      <link>https://dev.to/radotsvetkov/building-an-autonomous-coding-agent-in-rust-architecture-decisions-and-what-i-learned-3p2a</link>
      <guid>https://dev.to/radotsvetkov/building-an-autonomous-coding-agent-in-rust-architecture-decisions-and-what-i-learned-3p2a</guid>
      <description>&lt;p&gt;I have been building Akmon for several months — a terminal AI coding agent that ships as a single Rust binary. No separate runtime, no package manager, no installer. Copy the file and it works.&lt;br&gt;
This is not a "here is my project" post. It is an honest account of the decisions I made, the tradeoffs involved, and the things that surprised me. If you are building in the agent space or are curious how autonomous tool-calling loops actually behave in practice, I hope it is useful.&lt;/p&gt;
&lt;h2&gt;
  
  
  Why Rust
&lt;/h2&gt;

&lt;p&gt;The choice was pragmatic, not ideological.&lt;br&gt;
I needed one artifact that behaves identically on a developer's MacBook, a Linux server accessed over SSH, a Docker container in CI, and an air-gapped environment with no internet access. Rust's static linking story and lack of a managed runtime match that deployment model directly. The release binary uses LTO, size-optimized settings, and stripping. The result is 3.4 MB that runs anywhere you can run a normal executable.&lt;/p&gt;

&lt;p&gt;The second reason is structural. An agent session involves a lot of moving parts simultaneously: streaming completions from an HTTP API, a growing conversation history, permission prompts waiting for user input, and a terminal UI rendering in real time. Rust's ownership model and async ecosystem make it feasible to keep that complexity under control. Compiler-enforced boundaries between crates mean that accidental coupling fails at build time rather than at runtime.&lt;/p&gt;
&lt;h2&gt;
  
  
  The Workspace Architecture
&lt;/h2&gt;

&lt;p&gt;Akmon is a multi-crate Cargo workspace. Each crate has a single clear responsibility:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;akmon-cli       — binary entry point, CLI parsing, wiring
akmon-core      — permissions, sandbox, audit types, project layout
akmon-config    — configuration loading and defaults
akmon-models    — provider implementations and streaming protocol
akmon-tools     — tool implementations (read_file, shell, edits, specs...)
akmon-query     — agent loop, session management, context assembly
akmon-tui       — ratatui terminal interface
akmon-index     — optional semantic search
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Dependency flow is strictly inward. The CLI depends on everything. akmon-core depends on nothing in the workspace. If a tool implementation accidentally imports from the TUI layer, the build fails. The architecture is enforced by the compiler, not by convention.&lt;/p&gt;

&lt;p&gt;The most complex logic lives in akmon-query, specifically in session.rs. Everything else exists to serve what happens there.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Provider Abstraction
&lt;/h2&gt;

&lt;p&gt;Eight providers — Anthropic, OpenAI, OpenRouter, Groq, Azure, Bedrock, Ollama, and custom OpenAI-compatible endpoints — need to work through a single interface so the agent loop does not need to know or care which one is active.&lt;/p&gt;

&lt;p&gt;Every backend implements the &lt;code&gt;LlmProvider&lt;/code&gt; trait. The key method takes a list of messages and a configuration struct (tools, max tokens, session ID, optional fallback model) and returns a stream of events: text deltas, tool calls, usage reports, and errors. The agent loop consumes these events identically regardless of which provider produced them.&lt;/p&gt;

&lt;p&gt;Provider selection happens once at session start in &lt;code&gt;LlmConnectConfig::resolve&lt;/code&gt;. The priority chain matters and is easy to get wrong:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Amazon Bedrock when AWS context is configured&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;claude-*&lt;/code&gt; model names route to Anthropic directly, or via OpenRouter if no Anthropic key exists, or error clearly if neither is available&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;org/model&lt;/code&gt; format routes to OpenRouter&lt;/li&gt;
&lt;li&gt;Azure OpenAI when endpoint and key are both present&lt;/li&gt;
&lt;li&gt;OpenAI when the model matches Chat API patterns and a key is set&lt;/li&gt;
&lt;li&gt;Groq when the model matches Groq-hosted patterns and a key is set&lt;/li&gt;
&lt;li&gt;Custom OpenAI-compatible endpoint when configured&lt;/li&gt;
&lt;li&gt;Ollama for everything else&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The bug you only fix once: &lt;code&gt;claude-*&lt;/code&gt; must be resolved before the Ollama fallback. Otherwise the tool quietly sends Anthropic API requests to a local Ollama server that speaks a completely different protocol. I learned this the hard way.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Agent Loop
&lt;/h2&gt;

&lt;p&gt;The loop lives in session.rs under a labeled &lt;code&gt;'session loop&lt;/code&gt;. It is not simply "run until the model says stop." Before each iteration it checks an iteration limit (default 25), a budget cap in headless mode, and several error conditions. Autonomy is bounded by configuration, not just by model behavior.&lt;/p&gt;

&lt;p&gt;Each iteration follows the same sequence:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Apply context compaction if needed&lt;/li&gt;
&lt;li&gt;Compose the message list for the next API call&lt;/li&gt;
&lt;li&gt;For Ollama, trim to system messages plus the last six non-system messages&lt;/li&gt;
&lt;li&gt;Call the provider and consume the stream until a Done event arrives&lt;/li&gt;
&lt;li&gt;Handle the stop reason&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Stop reasons determine what happens next:&lt;br&gt;
&lt;strong&gt;ToolUse:&lt;/strong&gt; execute the requested tools, append results to context, continue the loop.&lt;br&gt;
&lt;strong&gt;EndTurn with tool calls:&lt;/strong&gt; the model produced both text and tool calls. Execute the tools and continue.&lt;br&gt;
&lt;strong&gt;EndTurn with no tool calls:&lt;/strong&gt; genuine completion. Emit a Done event, persist context, exit the loop.&lt;br&gt;
&lt;strong&gt;MaxTokens:&lt;/strong&gt; the response was truncated. If there were tool calls, execute what arrived and continue without consuming a continuation credit. If there were no tool calls, inject a continuation user message and loop again, up to three times. After three truncations without completion, surface a clear error to the user.&lt;/p&gt;

&lt;p&gt;The natural exit is &lt;code&gt;EndTurn&lt;/code&gt; with no pending tool calls. Everything else is a managed edge case.&lt;/p&gt;

&lt;h2&gt;
  
  
  Context Management
&lt;/h2&gt;

&lt;p&gt;This is the hardest problem in building a coding agent and the one that takes the most iteration to get right.&lt;/p&gt;

&lt;p&gt;Every turn adds tokens to the context. File reads, tool results, conversation history — it all accumulates. Eventually you hit the model's context window limit and things break in ways that are hard to debug. Akmon handles this at three levels.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Microcompact&lt;/strong&gt; runs after each turn. Older tool results are replaced with a short placeholder to prevent linear context growth. The implementation is more careful than a simple character count: it never clears write or edit tool results (those are too important), only clears shell output when it exceeds 500 characters, and keeps the most recent 20 messages intact. Groq keeps 12 due to tighter context limits.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Autocompact&lt;/strong&gt; triggers when estimated input tokens exceed 85% of usable context. At that point, a prefix of conversation history is summarized via the same provider and folded back into the context as a system message. The agent continues from the summary rather than from the raw history. You lose some detail but you gain the ability to keep working on a large project without hitting a hard wall.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Spec files&lt;/strong&gt; are the real solution because they avoid the problem entirely. Before implementing anything significant, the agent writes a detailed plan to &lt;code&gt;.akmon/specs/plan.md&lt;/code&gt;. This file lives on disk, persists across sessions, and survives compaction. Working from a spec rather than from accumulated conversation history keeps the implementation context clean from the start.&lt;/p&gt;

&lt;p&gt;These are genuinely different things. Microcompact manages turn-by-turn growth. Autocompact handles sessions that run long. Specs are a workflow pattern that reduces how much context you need in the first place.&lt;/p&gt;

&lt;h2&gt;
  
  
  Prompt Caching
&lt;/h2&gt;

&lt;p&gt;Anthropic's prompt caching charges 10% of the normal input token price for cache reads. In practice, for a 30-turn session building a web application, around 35 to 40% of input tokens are served from cache. On a session that would otherwise cost $0.54, caching brings it to around $0.35.&lt;/p&gt;

&lt;p&gt;The implementation detail that matters: cache control is attached at the top level of the Messages API request, not as per-block markers on individual content items. Getting this wrong means paying full price for tokens that should be cached.&lt;/p&gt;

&lt;p&gt;Treat these numbers as measurements from specific sessions, not as guarantees. Your cache hit rate depends heavily on how much your system prompt changes between turns.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Permission System
&lt;/h2&gt;

&lt;p&gt;Every tool call that modifies state — file writes, shell commands, web requests — goes through a permission check before execution. The user sees a prompt and chooses: allow once, allow for the session, allow all writes, or deny.&lt;/p&gt;

&lt;p&gt;Every decision is recorded in the audit log as a JSON Lines event tagged with the event kind (policy evaluation, tool dispatch, tool outcome, agent step). The schema is more structured than a flat key-value pair — it needs to be, because "what did the agent do and why" is a question that gets asked later, not during the session.&lt;/p&gt;

&lt;p&gt;This is the difference between "the AI wrote some code" and "at 11:23:45 UTC the agent requested to write src/auth.rs, the user approved it for this session, and the result was a 47-line file." In professional environments that distinction matters.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Surprised Me
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Retries and rate limits mid-stream are genuinely difficult.&lt;/strong&gt; A rate limit that arrives halfway through a streaming response means partial state, a user who is watching status messages, and the need to not double-count the iteration. Getting this right took longer than any other single piece of the codebase.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Local models need aggressive context trimming.&lt;/strong&gt; A 30k token context that Anthropic processes in 2 seconds takes 90 seconds or more on a local 9B model running on consumer hardware. Trimming to system messages plus the last six non-system messages before sending to Ollama made the difference between something usable and something that times out constantly.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Usage aggregation is easy to get wrong.&lt;/strong&gt; Anthropic returns token counts at the end of each streaming response. Accumulating these correctly across 35 API calls in a single session requires careful state management in both the session layer and the TUI. When this breaks, users see $0.14 instead of $0.68. I know because it happened.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The TUI is the most visible part but the least interesting engineering.&lt;/strong&gt; ratatui is excellent. But the real product is in the policy engine, context management, and provider correctness. The TUI is just the window into those systems.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Is Next
&lt;/h2&gt;

&lt;p&gt;The codebase already includes MCP configuration and an HTTP MCP client path — so the next work there is expanding subprocess and stdio server support, not starting from scratch. Other areas still in progress: checkpoint and rewind for safer autonomous operation, shell state persistence across tool calls, and first-class Windows support.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Akmon is Apache 2.0.&lt;/strong&gt; The repo is &lt;a href="https://dev.tourl"&gt;github.com/radotsvetkov/akmon&lt;/a&gt; and the docs are at &lt;a href="https://dev.tourl"&gt;radotsvetkov.github.io/akmon&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;If you are building agents I would genuinely like to hear what you are working on.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;- Rado&lt;/em&gt;&lt;/p&gt;

</description>
      <category>rust</category>
      <category>ai</category>
      <category>opensource</category>
      <category>tooling</category>
    </item>
  </channel>
</rss>
