DEV Community

jiayu
jiayu

Posted on

I Built a CLI AI Coding Assistant from Scratch — Here's What I Learned

I Built a CLI AI Coding Assistant from Scratch — Here's What I Learned

TL;DR: I spent several months studying Claude Code's architecture, then built Seed AI — a TypeScript CLI assistant with 14 original improvements. This post is a brain dump of the most interesting technical problems I solved.


Why bother?

Claude Code is excellent. But it has a few hard constraints:

  • Locked to Anthropic's API (no DeepSeek, no local Ollama)
  • No memory between sessions
  • No tool result caching (reads the same file 3 times per session)
  • No Docker sandbox

None of these are complaints — they're product decisions. But they left room to explore.


The interesting problems

1. Parallel tool execution without breaking UX

Claude Code executes tools serially: permission(A) → exec(A) → permission(B) → exec(B).

When the LLM requests 3 file reads simultaneously, that's 3× the latency.

The naive fix is full parallelism — but then you'd get 3 permission dialogs firing at once, which is confusing. The right split:

// Permissions: serial (user reviews one at a time)
const approvedCalls: ToolCall[] = [];
for (const call of toolCalls) {
  const approved = await askPermission(call);
  if (approved) approvedCalls.push(call);
}

// Execution: parallel (no UX interaction needed)
const results = await Promise.allSettled(
  approvedCalls.map(call => tools.execute(call.name, call.input))
);
Enter fullscreen mode Exit fullscreen mode

Result: latency drops from N×T to roughly 1.2×T for N similar-sized reads.


2. Session-level tool cache with write-before-invalidate ordering

LLMs re-read files constantly. file_read, glob, grep — all idempotent. The obvious optimization: cache within a session.

The tricky part is write invalidation ordering. If you do:

file_edit(foo.ts) → invalidate cache → read foo.ts (gets new content)  ✓
Enter fullscreen mode Exit fullscreen mode

But if invalidation happens after execution:

file_edit(foo.ts) → read foo.ts (cache hit, gets OLD content) → invalidate  ✗
Enter fullscreen mode Exit fullscreen mode

The fix: cache.invalidateForWrite() runs before execute(), not after. Obvious in retrospect, costly if you get it wrong.

Cache key format: "file_read:{"path":"/foo/bar.ts"}" — tool name plus JSON.stringify(input).

Typical session cache hit rate: 20–40%.


3. LLM-powered context compression

When conversations get long, Claude Code truncates. Old messages are gone.

A better approach: use the cheapest available model (Haiku, ~$0.0002/call) to summarize what was dropped, inject that summary into the system prompt:

const SUMMARY_MODEL = "claude-haiku-4-5-20251001";

// Injected into system prompt dynamic section:
// ## Earlier conversation summary (compressed)
// [Completed: refactored auth module, fixed token expiry bug]
// [Decided: use refresh tokens over session extension]
// [Current state: tests passing, ready for PR]
Enter fullscreen mode Exit fullscreen mode

Cumulative summaries are appended with --- separators across multiple compression rounds. The main model always knows what happened earlier.

Cost per compression: ~$0.0002. Negligible.


4. Semantic vector memory across sessions

The long-term memory system uses a 3-layer structure:

~/.seed/memory/
├── user.md                    ← cross-project user profile
└── projects/{sha1(path)[:12]}/
    ├── context.md             ← tech stack, architecture
    ├── decisions.md           ← key decisions + reasoning
    └── learnings.md           ← bugs fixed, patterns that worked
Enter fullscreen mode Exit fullscreen mode

The project fingerprint (sha1(normalize(path)).slice(0, 12)) maps paths consistently — same project, same memory bucket, regardless of how the path is written.

At session end, Haiku extracts only durable knowledge (not ephemeral variable names). At session start, a local embedding model retrieves the top-8 relevant chunks via cosine similarity. Constant injection cost: ~800 tokens, regardless of total memory size.

TF-IDF fallback when no embedding service is available.


5. Docker sandbox with graceful host fallback

Each bash tool call spins up a fresh, auto-removed container:

[
  "run", "--rm",
  "-v", `${mountPath}:/workspace`,
  "-w", "/workspace",
  "--network", "none",          // strict mode
  "--memory", "512m",
  "--cpus", "1",
  "--security-opt", "no-new-privileges",
  "alpine:latest", "sh", "-c", command
]
Enter fullscreen mode Exit fullscreen mode

Three isolation levels: strict (read-only FS + no network), standard, permissive.

The important part: when Docker isn't running, it doesn't crash. docker info probes on first use, result is cached, and the system falls back to host execution with a visible warning. The user always knows which mode they're in.


6. Local LLM support with capability detection

Ollama doesn't expose a standard endpoint for "does this model support tool calls?" The probe I landed on:

// Send a minimal tool-use request
const res = await fetch(`${base}/v1/chat/completions`, {
  body: JSON.stringify({ model, messages: [...], tools: [PROBE_TOOL] })
});

if (!res.ok && res.status === 400) {
  const body = await res.text();
  // Ollama returns "does not support tools" for incapable models
  if (/not support tools/i.test(body)) return false;
}
return res.ok;
Enter fullscreen mode Exit fullscreen mode

If the model doesn't support native tool calls, fall back to XML-formatted tool calls in the prompt:

<tool_call>
{"name": "file_read", "parameters": {"path": "src/index.ts"}}
</tool_call>
Enter fullscreen mode Exit fullscreen mode

The same ToolRegistry handles both paths. Models like qwen2.5-coder support native calls; reasoning models like DeepSeek-R1 need XML fallback (and even then, rarely emit tool calls reliably).


7. Streaming jitter: the React setState ordering bug

This one took several sessions to fully diagnose.

Ink (React for CLIs) redraws the terminal on every state update. During streaming, we buffer text deltas every 80ms to batch redraws. But there was a subtle React batch ordering bug at stream end:

// The broken sequence:
// Event: "done" → finalizeLastMessage() → setMessages({isStreaming: false})
// Finally block → updateLastAssistantMessage(remainingBuffer) → setMessages({isStreaming: true})
// React batches both → later updater wins → message permanently stuck as streaming
Enter fullscreen mode Exit fullscreen mode

The fix: merge buffer drain and isStreaming: false into a single atomic setMessages call:

} finally {
  const remainingChunk = deltaBufferRef.current;
  deltaBufferRef.current = "";
  setMessages((prev) => {
    const copy = [...prev];
    const last = copy[copy.length - 1];
    // append chunk + set isStreaming: false in one updater
    copy[copy.length - 1] = { ...last, content, isStreaming: false };
    return copy;
  });
}
Enter fullscreen mode Exit fullscreen mode

Separately: Ink's <Static> component writes completed messages to terminal scrollback once, never redraws them. Only the live streaming zone (~22 lines on a 30-row terminal) repaints every 80ms. This is how you eliminate streaming jitter in a terminal UI.


8. MCP subprocess leak

MCP (Model Context Protocol) servers run as persistent child processes communicating over stdio. The first version created a new MCPRegistry on every user submit — which spawned new server processes and orphaned the old ones.

The fix is straightforward once you see it: persist the registry instance across submits with a ref:

const mcpRegistryRef = useRef<MCPRegistry | null>(null);

// Connect once, reuse forever
if (!mcpConnectedRef.current) {
  mcpConnectedRef.current = true;
  const reg = new MCPRegistry();
  await reg.connect(settings.mcpServers);
  mcpRegistryRef.current = reg;
}
Enter fullscreen mode Exit fullscreen mode

The symptom was subtle: no crash, no error — just progressively more zombie processes accumulating in the background. Worth checking for in any system that manages long-lived child processes.


9. Ollama's <think> tag format divergence

DeepSeek-R1 produces thinking tokens. Two different formats depending on where you call it:

DeepSeek cloud API: thinking content comes in a separate reasoning_content field. Clean, explicit.

Ollama's OpenAI-compatible endpoint: strips the opening <think> tag but keeps the closing </think>. The thinking content appears at the start of the regular content stream, ending with \n\n</think>\n\n, followed by the actual response.

Handling them:

// DeepSeek cloud: dedicated field
if (chunk.choices[0].delta.reasoning_content) {
  onEvent({ type: "thinking_delta", delta: chunk.choices[0].delta.reasoning_content });
}

// Ollama: detect bare </think> closing tag in content stream
if (rawContent.includes("</think>")) {
  const [thinking, response] = rawContent.split("</think>\n\n");
  onEvent({ type: "thinking_delta", delta: thinking });
  onEvent({ type: "text_delta", delta: response });
}
Enter fullscreen mode Exit fullscreen mode

This kind of format divergence is common when a vendor implements "compatible" APIs — compatible enough to work for the basic case, different enough to break edge cases.


What I'd do differently

Keep the scope tight from the start. The codebase has some CC source files included for reference that ended up tracked by git. A clear src/seed/ vs src/vendor/ split from day one would have prevented that.

Write integration tests before the features. Unit tests caught regressions in the tool cache and permission system. But the agent loop end-to-end has no integration tests — that's the biggest quality gap right now.

Don't fight the terminal. I spent time on a custom scrollbar and alt-screen buffer before realizing Ink's <Static> + primary buffer is the correct architecture. The terminal already knows how to scroll.


The repo

GitHub: https://github.com/jiayu6954-sudo/seed-ai

TypeScript, MIT license. Runs on Node.js 20+. Supports Anthropic, OpenAI, DeepSeek, Groq, Gemini, Ollama, OpenRouter, Moonshot.

If you're building something similar or have questions about any of the above, open an issue or leave a comment.

Top comments (0)