Note: This post contains architecture diagrams. For the fully rendered version visit the original post.
AI coding assistants have fundamentally changed how developers write software. The best ones do one thing well: collapse the distance between the developer's intent and the code that expresses it. But under the hood, they are largely orchestration layers — the individual primitives they use (language servers, vector search, tool-calling LLMs, sandboxed execution) are all open-source or replicable.
This post lays out a complete open-source architecture that covers every capability a modern AI coding IDE ships: autocomplete, chat-in-editor, multi-file agent editing, repo understanding, system design reasoning, and a safe code execution loop. No proprietary APIs required.
The project is called Dhi (धी) — from the Gayatri Mantra. It means pure intellect. The code lives at github.com/sochaty/dhi.
The Full Picture
Before going layer by layer, here is how the system fits together:
IDE Frontend
↓ JSON-RPC / WebSocket
Orchestration Core (LangGraph · Tool Router · Context Assembler)
↓ ↓ ↓ ↓
FIM Engine Chat Engine Agent Engine Execution Sandbox
↓ ↓ ↓ ↓
Model Layer (Ollama · vLLM · StarCoder2 · DeepSeek)
↓
Repo Intelligence Layer (Tree-sitter · Chroma · LSP)
Each vertical slice is independently deployable. You can run only the autocomplete engine on a laptop, or scale the agent engine on a GPU cluster. The layers communicate over a local JSON-RPC bus (same protocol VS Code uses for LSP).
Layer 1 — Repo Understanding
Every other capability is only as good as the system's understanding of the codebase. This is where most open-source IDE projects are weakest. Getting it right means going beyond naive file-splitting.
Parsing with Tree-sitter
Tree-sitter produces a concrete syntax tree for 40+ languages in under 5ms per file. Rather than splitting on character count, we split on semantic boundaries: functions, classes, method bodies. This keeps each chunk self-contained and reduces context fragmentation at retrieval time.
Source Files → Tree-sitter (per language) → Semantic Chunks
→ Metadata Overlay (file path · line range · symbol name)
→ nomic-embed-text-v1.5 (768-dim, runs locally)
→ Chroma (dev) / Qdrant (prod)
The Call Graph Layer
Pure vector search finds semantically similar code but misses structural relationships. A symbol reference graph (built from LSP textDocument/references calls) lets you answer questions like: "find every function that touches the auth middleware" with a graph traversal rather than a fuzzy search.
Store this as an adjacency list in SQLite — lightweight, zero infrastructure, always in sync with the repo.
Layer 2 — Autocomplete (Fill-in-the-Middle)
Autocomplete in a modern AI IDE is not next-token prediction. It is fill-in-the-middle (FIM): the model sees the prefix (everything before the cursor) and the suffix (everything after), and generates the completion that bridges them.
<fim_prefix>
repo context: top-3 retrieved chunks (~1500 tokens)
current file: lines 0 → cursor (~800 tokens)
<fim_suffix>
current file: cursor → EOF (~400 tokens)
<fim_middle> ← model generates here
target: < 150ms P50 | < 400ms P95
Model choices
| Model | Parameters | FIM Support | Runs on |
|---|---|---|---|
| StarCoder2-3B | 3B | ✅ native | Apple M2 / 8GB GPU |
| DeepSeek-Coder-V2-Lite | 16B | ✅ native | 24GB GPU |
| Qwen2.5-Coder-7B | 7B | ✅ native | 16GB GPU |
| CodeLlama-13B | 13B | ✅ native | 24GB GPU |
Serve them with Ollama for local dev or vLLM in production (PagedAttention cuts memory by ~40%, continuous batching removes queuing).
Speculative Decoding
Pair a small draft model (StarCoder2-1B) with a large verifier (DeepSeek-Coder-V2-Lite). The draft generates K tokens; the verifier accepts or rejects in a single forward pass. Effective throughput: 3–5× faster than the large model alone for typical completion lengths.
Layer 3 — Chat-in-Editor
Chat works differently from autocomplete. The latency bar is 2–5 seconds (acceptable for a conversational exchange), but the context window must be carefully assembled to fit within model limits while including what's most relevant.
Developer Message
↓
Context Assembler
Slot 1: system prompt ~500 tokens
Slot 2: active file + selection ~2 000 tokens
Slot 3: LSP diagnostics ~500 tokens
Slot 4: retrieved RAG chunks ~4 000 tokens
Slot 5: conversation history ~2 000 tokens
Slot 6: user message remaining budget
↓
LLM (streaming SSE)
↓
Response Parser
prose → Chat Panel
code blocks → Diff Preview in editor
tool_call → Tool Engine
The critical UX insight: stream tokens to the chat panel in real time, but buffer code blocks and only apply them to the editor after the complete block arrives. Partial code blocks applied live cause flickering and make the diff unreadable.
For the model, any instruction-tuned model with a large context window works here: Qwen2.5-Coder-32B-Instruct, DeepSeek-V3, or Llama-3.3-70B-Instruct via Ollama / vLLM.
Layer 4 — Multi-File Agent Editing
This is the hardest layer to get right. The agent must plan, act across multiple files, observe outcomes (compiler errors, test failures), and revise — all without losing context of the original goal.
The Plan-Act-Observe Loop
Developer Request
↓
Planner (reasoning model · ordered task list)
↓
┌─────────────────────────────────┐
│ Think (LLM) ←──── Observe │
│ ↓ ↑ │
│ Act (Tools) ──────────┘ │
└─────────────────────────────────┘
↓ done
Diff Preview (Accept / Reject per file)
Tool Set
| Tool | What it does |
|---|---|
read_file(path) |
Returns file contents |
write_file(path, content) |
Applies diff |
search_codebase(query) |
Vector + keyword hybrid search |
run_command(cmd) |
Sandboxed shell (Docker) |
list_directory(path) |
File tree |
get_diagnostics() |
LSP errors / warnings |
get_references(symbol) |
Call graph lookup |
create_file(path, content) |
Creates new file |
delete_file(path) |
Deletes with undo stack |
Orchestration: LangGraph
LangGraph models the agent loop as a directed graph of nodes (think, act, observe, plan, verify). Edges are conditional — the observe node routes back to think on errors, or forwards to verify on success.
The key advantage over a simple while loop: checkpointing. LangGraph can pause the loop mid-execution, serialize state to disk, and resume — critical for long refactors that might span dozens of file edits.
Layer 5 — System Design and Reasoning
Architecture-level questions ("should I use event sourcing here?", "draw the service dependency graph") require a different mode: long-horizon reasoning over the entire codebase context, not just a few files.
Developer Question
↓
Repo Summary Builder
(1-paragraph LLM summary per directory → project map)
↓ ~8k token project map
Reasoning Model (DeepSeek-R1 / QwQ-32B)
↓
Diagram Output (Mermaid / PlantUML, rendered in IDE panel)
The repo summary is the critical artifact. Build it once on first index, then update incrementally using git diff — only re-summarise modules that changed in the last commit.
Layer 6 — Safe Code Execution Loop
Agents that can write code must be able to run it. But running arbitrary LLM-generated code on the host machine is a hard no. The execution layer must be:
- Isolated: no access to host filesystem, network, or env vars outside the project
- Ephemeral: container torn down after each run
- Auditable: all stdin/stdout/stderr captured and shown to the developer
Agent: run_command("pytest tests/")
↓
Docker Container (ephemeral)
project files: read-only bind mount
/tmp: writable scratch only
network: none
memory: 512 MB
timeout: 30 s
seccomp: restricted syscalls
↓
exit 0 → tests passed → agent proceeds
exit 1 → failures → agent sees stderr, re-plans
timeout → hard kill → agent informed, retries or stops
The Self-Healing Loop
Write code → Run tests
↓ pass → Offer diff to developer
↓ fail → Observe error → re-plan → Write code
When tests fail, the output becomes the next observation in the agent loop. The agent sees the exact error, reasons about the fix, edits the file, and re-runs — typically converging in 2–3 iterations for straightforward bugs.
For an even tighter sandbox, use gVisor (Google's container runtime that intercepts syscalls in user space) or Firecracker (AWS's micro-VM used in Lambda) instead of vanilla Docker.
The Full Open-Source Stack
| Capability | Component | Notes |
|---|---|---|
| Editor | Monaco Editor | MIT, same engine as VS Code |
| Syntax parsing | Tree-sitter | MIT, 40+ languages |
| Code intelligence | LSP servers (clangd, pylsp, ts-ls) | Per-language |
| Embeddings | nomic-embed-text-v1.5 | Apache 2.0, 768-dim, runs locally |
| Vector store | Chroma (dev) / Qdrant (prod) | Both open-source |
| FIM autocomplete | StarCoder2-3B / Qwen2.5-Coder-7B | BigCode / Qwen license |
| Chat model | Qwen2.5-Coder-32B-Instruct | Apache 2.0 |
| Reasoning model | QwQ-32B / DeepSeek-R1-32B | MIT / MIT |
| Model serving | Ollama (local) / vLLM (production) | MIT / Apache 2.0 |
| Agent orchestration | LangGraph | MIT |
| Execution sandbox | Docker + seccomp / gVisor | Apache 2.0 |
| Backend API | FastAPI | MIT |
| Frontend | Next.js + Tailwind | MIT |
What You Don't Get For Free
An honest architecture post should name the hard parts:
Latency at low VRAM. A 32B model doing chat on a single 24GB GPU hits 15–20 tokens/second. Acceptable for most workflows, but noticeably slower than cloud-hosted alternatives. The fix is speculative decoding, quantisation (GGUF Q4), or offloading to a small cloud GPU when needed.
Prompt cache invalidation. Managed AI coding services almost certainly implement prompt caching across requests. Replicating this without a managed inference provider requires careful key-value cache management in vLLM — possible, but non-trivial.
Index freshness. Keeping the vector store in sync with active edits (every keystroke rewrites files) requires debounced incremental re-indexing — easy to get wrong and end up with stale retrieval.
Security surface. The Docker sandbox is safe for test runners. But agents that can write_file anywhere in the repo, modify CI configs, or touch secrets files are a different risk level. Implement a path allowlist and require developer confirmation for writes outside the current working directory.
Closing Thoughts
The individual components here — Tree-sitter, vLLM, LangGraph, Docker — are each battle-tested in production at scale. The architecture challenge is the orchestration: assembling the right context, routing to the right model at the right latency budget, and designing a UX that keeps the developer in control of what the agent actually touches.
The moat of any great AI coding tool is not its architecture. It's the years of UX iteration on top of this architecture. The open-source community now has every primitive it needs to build something just as capable.
The next post walks through implementing the FIM autocomplete engine end-to-end: Tree-sitter chunking, nomic embeddings, and a StarCoder2-3B inference server — all running on a single laptop.
Top comments (1)
Some comments may only be visible to logged-in visitors. Sign in to view all comments.