<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: jiayu</title>
    <description>The latest articles on DEV Community by jiayu (@jiayu_fd0917c8f896ea39ab9).</description>
    <link>https://dev.to/jiayu_fd0917c8f896ea39ab9</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3861896%2Ff808efa2-7e98-48e5-a6dc-57dd74eb510a.png</url>
      <title>DEV Community: jiayu</title>
      <link>https://dev.to/jiayu_fd0917c8f896ea39ab9</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/jiayu_fd0917c8f896ea39ab9"/>
    <language>en</language>
    <item>
      <title>I Built a CLI AI Coding Assistant from Scratch — Here's What I Learned</title>
      <dc:creator>jiayu</dc:creator>
      <pubDate>Wed, 08 Apr 2026 02:48:08 +0000</pubDate>
      <link>https://dev.to/jiayu_fd0917c8f896ea39ab9/i-built-a-cli-ai-coding-assistant-from-scratch-heres-what-i-learned-2ma7</link>
      <guid>https://dev.to/jiayu_fd0917c8f896ea39ab9/i-built-a-cli-ai-coding-assistant-from-scratch-heres-what-i-learned-2ma7</guid>
      <description>&lt;h1&gt;
  
  
  I Built a CLI AI Coding Assistant from Scratch — Here's What I Learned
&lt;/h1&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;TL;DR:&lt;/strong&gt; I spent several months studying Claude Code's architecture, then built &lt;a href="https://github.com/jiayu6954-sudo/seed-ai" rel="noopener noreferrer"&gt;Seed AI&lt;/a&gt; — a TypeScript CLI assistant with 14 original improvements. This post is a brain dump of the most interesting technical problems I solved.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  Why bother?
&lt;/h2&gt;

&lt;p&gt;Claude Code is excellent. But it has a few hard constraints:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Locked to Anthropic's API (no DeepSeek, no local Ollama)&lt;/li&gt;
&lt;li&gt;No memory between sessions&lt;/li&gt;
&lt;li&gt;No tool result caching (reads the same file 3 times per session)&lt;/li&gt;
&lt;li&gt;No Docker sandbox&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;None of these are complaints — they're product decisions. But they left room to explore.&lt;/p&gt;




&lt;h2&gt;
  
  
  The interesting problems
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Parallel tool execution without breaking UX
&lt;/h3&gt;

&lt;p&gt;Claude Code executes tools serially: &lt;code&gt;permission(A) → exec(A) → permission(B) → exec(B)&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;When the LLM requests 3 file reads simultaneously, that's 3× the latency.&lt;/p&gt;

&lt;p&gt;The naive fix is full parallelism — but then you'd get 3 permission dialogs firing at once, which is confusing. The right split:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Permissions: serial (user reviews one at a time)&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;approvedCalls&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;ToolCall&lt;/span&gt;&lt;span class="p"&gt;[]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[];&lt;/span&gt;
&lt;span class="k"&gt;for &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;call&lt;/span&gt; &lt;span class="k"&gt;of&lt;/span&gt; &lt;span class="nx"&gt;toolCalls&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;approved&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;askPermission&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;call&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;approved&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="nx"&gt;approvedCalls&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;push&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;call&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;// Execution: parallel (no UX interaction needed)&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nb"&gt;Promise&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;allSettled&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="nx"&gt;approvedCalls&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;map&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;call&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nx"&gt;tools&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;execute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;call&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;call&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;input&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Result: latency drops from N×T to roughly 1.2×T for N similar-sized reads.&lt;/p&gt;




&lt;h3&gt;
  
  
  2. Session-level tool cache with write-before-invalidate ordering
&lt;/h3&gt;

&lt;p&gt;LLMs re-read files constantly. &lt;code&gt;file_read&lt;/code&gt;, &lt;code&gt;glob&lt;/code&gt;, &lt;code&gt;grep&lt;/code&gt; — all idempotent. The obvious optimization: cache within a session.&lt;/p&gt;

&lt;p&gt;The tricky part is write invalidation ordering. If you do:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;file_edit(foo.ts) → invalidate cache → read foo.ts (gets new content)  ✓
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;But if invalidation happens &lt;em&gt;after&lt;/em&gt; execution:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;file_edit(foo.ts) → read foo.ts (cache hit, gets OLD content) → invalidate  ✗
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The fix: &lt;code&gt;cache.invalidateForWrite()&lt;/code&gt; runs before &lt;code&gt;execute()&lt;/code&gt;, not after. Obvious in retrospect, costly if you get it wrong.&lt;/p&gt;

&lt;p&gt;Cache key format: &lt;code&gt;"file_read:{"path":"/foo/bar.ts"}"&lt;/code&gt; — tool name plus &lt;code&gt;JSON.stringify(input)&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Typical session cache hit rate: 20–40%.&lt;/p&gt;




&lt;h3&gt;
  
  
  3. LLM-powered context compression
&lt;/h3&gt;

&lt;p&gt;When conversations get long, Claude Code truncates. Old messages are gone.&lt;/p&gt;

&lt;p&gt;A better approach: use the cheapest available model (Haiku, ~$0.0002/call) to summarize what was dropped, inject that summary into the system prompt:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;SUMMARY_MODEL&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;claude-haiku-4-5-20251001&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;// Injected into system prompt dynamic section:&lt;/span&gt;
&lt;span class="c1"&gt;// ## Earlier conversation summary (compressed)&lt;/span&gt;
&lt;span class="c1"&gt;// [Completed: refactored auth module, fixed token expiry bug]&lt;/span&gt;
&lt;span class="c1"&gt;// [Decided: use refresh tokens over session extension]&lt;/span&gt;
&lt;span class="c1"&gt;// [Current state: tests passing, ready for PR]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Cumulative summaries are appended with &lt;code&gt;---&lt;/code&gt; separators across multiple compression rounds. The main model always knows what happened earlier.&lt;/p&gt;

&lt;p&gt;Cost per compression: ~$0.0002. Negligible.&lt;/p&gt;




&lt;h3&gt;
  
  
  4. Semantic vector memory across sessions
&lt;/h3&gt;

&lt;p&gt;The long-term memory system uses a 3-layer structure:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;~/.seed/memory/
├── user.md                    ← cross-project user profile
└── projects/&lt;span class="o"&gt;{&lt;/span&gt;sha1&lt;span class="o"&gt;(&lt;/span&gt;path&lt;span class="o"&gt;)[&lt;/span&gt;:12]&lt;span class="o"&gt;}&lt;/span&gt;/
    ├── context.md             ← tech stack, architecture
    ├── decisions.md           ← key decisions + reasoning
    └── learnings.md           ← bugs fixed, patterns that worked
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The project fingerprint (&lt;code&gt;sha1(normalize(path)).slice(0, 12)&lt;/code&gt;) maps paths consistently — same project, same memory bucket, regardless of how the path is written.&lt;/p&gt;

&lt;p&gt;At session end, Haiku extracts only durable knowledge (not ephemeral variable names). At session start, a local embedding model retrieves the top-8 relevant chunks via cosine similarity. Constant injection cost: ~800 tokens, regardless of total memory size.&lt;/p&gt;

&lt;p&gt;TF-IDF fallback when no embedding service is available.&lt;/p&gt;




&lt;h3&gt;
  
  
  5. Docker sandbox with graceful host fallback
&lt;/h3&gt;

&lt;p&gt;Each &lt;code&gt;bash&lt;/code&gt; tool call spins up a fresh, auto-removed container:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="p"&gt;[&lt;/span&gt;
  &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;run&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;--rm&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;-v&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;`&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;mountPath&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;:/workspace`&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;-w&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;/workspace&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;--network&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;none&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;          &lt;span class="c1"&gt;// strict mode&lt;/span&gt;
  &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;--memory&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;512m&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;--cpus&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;1&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;--security-opt&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;no-new-privileges&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;alpine:latest&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;sh&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;-c&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;command&lt;/span&gt;
&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Three isolation levels: &lt;code&gt;strict&lt;/code&gt; (read-only FS + no network), &lt;code&gt;standard&lt;/code&gt;, &lt;code&gt;permissive&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;The important part: when Docker isn't running, it doesn't crash. &lt;code&gt;docker info&lt;/code&gt; probes on first use, result is cached, and the system falls back to host execution with a visible warning. The user always knows which mode they're in.&lt;/p&gt;




&lt;h3&gt;
  
  
  6. Local LLM support with capability detection
&lt;/h3&gt;

&lt;p&gt;Ollama doesn't expose a standard endpoint for "does this model support tool calls?" The probe I landed on:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Send a minimal tool-use request&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;res&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;fetch&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;`&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;base&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;/v1/chat/completions`&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="na"&gt;body&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;JSON&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;stringify&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="nx"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[...],&lt;/span&gt; &lt;span class="na"&gt;tools&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;PROBE_TOOL&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="p"&gt;})&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;

&lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="nx"&gt;res&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;ok&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="nx"&gt;res&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;status&lt;/span&gt; &lt;span class="o"&gt;===&lt;/span&gt; &lt;span class="mi"&gt;400&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;body&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;res&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;text&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
  &lt;span class="c1"&gt;// Ollama returns "does not support tools" for incapable models&lt;/span&gt;
  &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sr"&gt;/not support tools/i&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;test&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;body&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="kc"&gt;false&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nx"&gt;res&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;ok&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If the model doesn't support native tool calls, fall back to XML-formatted tool calls in the prompt:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight xml"&gt;&lt;code&gt;&lt;span class="nt"&gt;&amp;lt;tool_call&amp;gt;&lt;/span&gt;
{"name": "file_read", "parameters": {"path": "src/index.ts"}}
&lt;span class="nt"&gt;&amp;lt;/tool_call&amp;gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The same &lt;code&gt;ToolRegistry&lt;/code&gt; handles both paths. Models like &lt;code&gt;qwen2.5-coder&lt;/code&gt; support native calls; reasoning models like DeepSeek-R1 need XML fallback (and even then, rarely emit tool calls reliably).&lt;/p&gt;




&lt;h3&gt;
  
  
  7. Streaming jitter: the React setState ordering bug
&lt;/h3&gt;

&lt;p&gt;This one took several sessions to fully diagnose.&lt;/p&gt;

&lt;p&gt;Ink (React for CLIs) redraws the terminal on every state update. During streaming, we buffer text deltas every 80ms to batch redraws. But there was a subtle React batch ordering bug at stream end:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// The broken sequence:&lt;/span&gt;
&lt;span class="c1"&gt;// Event: "done" → finalizeLastMessage() → setMessages({isStreaming: false})&lt;/span&gt;
&lt;span class="c1"&gt;// Finally block → updateLastAssistantMessage(remainingBuffer) → setMessages({isStreaming: true})&lt;/span&gt;
&lt;span class="c1"&gt;// React batches both → later updater wins → message permanently stuck as streaming&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The fix: merge buffer drain and &lt;code&gt;isStreaming: false&lt;/code&gt; into a single atomic &lt;code&gt;setMessages&lt;/code&gt; call:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;finally&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;remainingChunk&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;deltaBufferRef&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;current&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="nx"&gt;deltaBufferRef&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;current&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="dl"&gt;""&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="nf"&gt;setMessages&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="nx"&gt;prev&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;copy&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[...&lt;/span&gt;&lt;span class="nx"&gt;prev&lt;/span&gt;&lt;span class="p"&gt;];&lt;/span&gt;
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;last&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;copy&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;copy&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;length&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;];&lt;/span&gt;
    &lt;span class="c1"&gt;// append chunk + set isStreaming: false in one updater&lt;/span&gt;
    &lt;span class="nx"&gt;copy&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;copy&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;length&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="p"&gt;...&lt;/span&gt;&lt;span class="nx"&gt;last&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;content&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;isStreaming&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;false&lt;/span&gt; &lt;span class="p"&gt;};&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nx"&gt;copy&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="p"&gt;});&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Separately: Ink's &lt;code&gt;&amp;lt;Static&amp;gt;&lt;/code&gt; component writes completed messages to terminal scrollback once, never redraws them. Only the live streaming zone (~22 lines on a 30-row terminal) repaints every 80ms. This is how you eliminate streaming jitter in a terminal UI.&lt;/p&gt;




&lt;h3&gt;
  
  
  8. MCP subprocess leak
&lt;/h3&gt;

&lt;p&gt;MCP (Model Context Protocol) servers run as persistent child processes communicating over stdio. The first version created a new &lt;code&gt;MCPRegistry&lt;/code&gt; on every user submit — which spawned new server processes and orphaned the old ones.&lt;/p&gt;

&lt;p&gt;The fix is straightforward once you see it: persist the registry instance across submits with a ref:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;mcpRegistryRef&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;useRef&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nx"&gt;MCPRegistry&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="kc"&gt;null&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kc"&gt;null&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="c1"&gt;// Connect once, reuse forever&lt;/span&gt;
&lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="nx"&gt;mcpConnectedRef&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;current&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;mcpConnectedRef&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;current&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;reg&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;MCPRegistry&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
  &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;reg&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;connect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;settings&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;mcpServers&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="nx"&gt;mcpRegistryRef&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;current&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;reg&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The symptom was subtle: no crash, no error — just progressively more zombie processes accumulating in the background. Worth checking for in any system that manages long-lived child processes.&lt;/p&gt;




&lt;h3&gt;
  
  
  9. Ollama's &lt;code&gt;&amp;lt;think&amp;gt;&lt;/code&gt; tag format divergence
&lt;/h3&gt;

&lt;p&gt;DeepSeek-R1 produces thinking tokens. Two different formats depending on where you call it:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;DeepSeek cloud API:&lt;/strong&gt; thinking content comes in a separate &lt;code&gt;reasoning_content&lt;/code&gt; field. Clean, explicit.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Ollama's OpenAI-compatible endpoint:&lt;/strong&gt; strips the opening &lt;code&gt;&amp;lt;think&amp;gt;&lt;/code&gt; tag but keeps the closing &lt;code&gt;&amp;lt;/think&amp;gt;&lt;/code&gt;. The thinking content appears at the start of the regular &lt;code&gt;content&lt;/code&gt; stream, ending with &lt;code&gt;\n\n&amp;lt;/think&amp;gt;\n\n&lt;/code&gt;, followed by the actual response.&lt;/p&gt;

&lt;p&gt;Handling them:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// DeepSeek cloud: dedicated field&lt;/span&gt;
&lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;chunk&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;choices&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nx"&gt;delta&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;reasoning_content&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nf"&gt;onEvent&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;thinking_delta&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;delta&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;chunk&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;choices&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nx"&gt;delta&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;reasoning_content&lt;/span&gt; &lt;span class="p"&gt;});&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;// Ollama: detect bare &amp;lt;/think&amp;gt; closing tag in content stream&lt;/span&gt;
&lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;rawContent&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;includes&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;&amp;lt;/think&amp;gt;&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;thinking&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;response&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;rawContent&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;split&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;&amp;lt;/think&amp;gt;&lt;/span&gt;&lt;span class="se"&gt;\n\n&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="nf"&gt;onEvent&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;thinking_delta&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;delta&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;thinking&lt;/span&gt; &lt;span class="p"&gt;});&lt;/span&gt;
  &lt;span class="nf"&gt;onEvent&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;text_delta&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;delta&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;response&lt;/span&gt; &lt;span class="p"&gt;});&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This kind of format divergence is common when a vendor implements "compatible" APIs — compatible enough to work for the basic case, different enough to break edge cases.&lt;/p&gt;




&lt;h2&gt;
  
  
  What I'd do differently
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Keep the scope tight from the start.&lt;/strong&gt; The codebase has some CC source files included for reference that ended up tracked by git. A clear &lt;code&gt;src/seed/&lt;/code&gt; vs &lt;code&gt;src/vendor/&lt;/code&gt; split from day one would have prevented that.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Write integration tests before the features.&lt;/strong&gt; Unit tests caught regressions in the tool cache and permission system. But the agent loop end-to-end has no integration tests — that's the biggest quality gap right now.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Don't fight the terminal.&lt;/strong&gt; I spent time on a custom scrollbar and alt-screen buffer before realizing Ink's &lt;code&gt;&amp;lt;Static&amp;gt;&lt;/code&gt; + primary buffer is the correct architecture. The terminal already knows how to scroll.&lt;/p&gt;




&lt;h2&gt;
  
  
  The repo
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;GitHub:&lt;/strong&gt; &lt;a href="https://github.com/jiayu6954-sudo/seed-ai" rel="noopener noreferrer"&gt;https://github.com/jiayu6954-sudo/seed-ai&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;TypeScript, MIT license. Runs on Node.js 20+. Supports Anthropic, OpenAI, DeepSeek, Groq, Gemini, Ollama, OpenRouter, Moonshot.&lt;/p&gt;

&lt;p&gt;If you're building something similar or have questions about any of the above, open an issue or leave a comment.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>cli</category>
      <category>showdev</category>
      <category>typescript</category>
    </item>
  </channel>
</rss>
