Sourish Chakraborty

Posted on Jun 23 • Originally published at blogs.sourishchakraborty.com

Building an AI Coding IDE from Scratch: A Full Open-Source Architecture

#ai #architecture #opensource #devtools

Note: This post contains architecture diagrams. For the fully rendered version visit the original post.

AI coding assistants have fundamentally changed how developers write software. The best ones do one thing well: collapse the distance between the developer's intent and the code that expresses it. But under the hood, they are largely orchestration layers — the individual primitives they use (language servers, vector search, tool-calling LLMs, sandboxed execution) are all open-source or replicable.

This post lays out a complete open-source architecture that covers every capability a modern AI coding IDE ships: autocomplete, chat-in-editor, multi-file agent editing, repo understanding, system design reasoning, and a safe code execution loop. No proprietary APIs required.

The project is called Dhi (धी) — from the Gayatri Mantra. It means pure intellect. The code lives at github.com/sochaty/dhi.

The Full Picture

Before going layer by layer, here is how the system fits together:

IDE Frontend
    ↓  JSON-RPC / WebSocket
Orchestration Core  (LangGraph · Tool Router · Context Assembler)
    ↓           ↓            ↓                ↓
FIM Engine  Chat Engine  Agent Engine  Execution Sandbox
    ↓           ↓            ↓                ↓
        Model Layer  (Ollama · vLLM · StarCoder2 · DeepSeek)
                         ↓
        Repo Intelligence Layer  (Tree-sitter · Chroma · LSP)

Each vertical slice is independently deployable. You can run only the autocomplete engine on a laptop, or scale the agent engine on a GPU cluster. The layers communicate over a local JSON-RPC bus (same protocol VS Code uses for LSP).

Layer 1 — Repo Understanding

Every other capability is only as good as the system's understanding of the codebase. This is where most open-source IDE projects are weakest. Getting it right means going beyond naive file-splitting.

Parsing with Tree-sitter

Tree-sitter produces a concrete syntax tree for 40+ languages in under 5ms per file. Rather than splitting on character count, we split on semantic boundaries: functions, classes, method bodies. This keeps each chunk self-contained and reduces context fragmentation at retrieval time.

Source Files → Tree-sitter (per language) → Semantic Chunks
    → Metadata Overlay (file path · line range · symbol name)
    → nomic-embed-text-v1.5 (768-dim, runs locally)
    → Chroma (dev) / Qdrant (prod)

The Call Graph Layer

Pure vector search finds semantically similar code but misses structural relationships. A symbol reference graph (built from LSP textDocument/references calls) lets you answer questions like: "find every function that touches the auth middleware" with a graph traversal rather than a fuzzy search.

Store this as an adjacency list in SQLite — lightweight, zero infrastructure, always in sync with the repo.

Layer 2 — Autocomplete (Fill-in-the-Middle)

Autocomplete in a modern AI IDE is not next-token prediction. It is fill-in-the-middle (FIM): the model sees the prefix (everything before the cursor) and the suffix (everything after), and generates the completion that bridges them.

<fim_prefix>
  repo context: top-3 retrieved chunks (~1500 tokens)
  current file: lines 0 → cursor (~800 tokens)

<fim_suffix>
  current file: cursor → EOF (~400 tokens)

<fim_middle>   ← model generates here
  target: < 150ms P50  |  < 400ms P95

Model choices

Model	Parameters	FIM Support	Runs on
StarCoder2-3B	3B	✅ native	Apple M2 / 8GB GPU
DeepSeek-Coder-V2-Lite	16B	✅ native	24GB GPU
Qwen2.5-Coder-7B	7B	✅ native	16GB GPU
CodeLlama-13B	13B	✅ native	24GB GPU

Serve them with Ollama for local dev or vLLM in production (PagedAttention cuts memory by ~40%, continuous batching removes queuing).

Speculative Decoding

Pair a small draft model (StarCoder2-1B) with a large verifier (DeepSeek-Coder-V2-Lite). The draft generates K tokens; the verifier accepts or rejects in a single forward pass. Effective throughput: 3–5× faster than the large model alone for typical completion lengths.

Layer 3 — Chat-in-Editor

Chat works differently from autocomplete. The latency bar is 2–5 seconds (acceptable for a conversational exchange), but the context window must be carefully assembled to fit within model limits while including what's most relevant.

Developer Message
    ↓
Context Assembler
  Slot 1: system prompt          ~500 tokens
  Slot 2: active file + selection  ~2 000 tokens
  Slot 3: LSP diagnostics          ~500 tokens
  Slot 4: retrieved RAG chunks     ~4 000 tokens
  Slot 5: conversation history     ~2 000 tokens
  Slot 6: user message             remaining budget
    ↓
LLM  (streaming SSE)
    ↓
Response Parser
  prose       → Chat Panel
  code blocks → Diff Preview in editor
  tool_call   → Tool Engine

The critical UX insight: stream tokens to the chat panel in real time, but buffer code blocks and only apply them to the editor after the complete block arrives. Partial code blocks applied live cause flickering and make the diff unreadable.

For the model, any instruction-tuned model with a large context window works here: Qwen2.5-Coder-32B-Instruct, DeepSeek-V3, or Llama-3.3-70B-Instruct via Ollama / vLLM.

Layer 4 — Multi-File Agent Editing

This is the hardest layer to get right. The agent must plan, act across multiple files, observe outcomes (compiler errors, test failures), and revise — all without losing context of the original goal.

The Plan-Act-Observe Loop

Developer Request
    ↓
Planner  (reasoning model · ordered task list)
    ↓
┌─────────────────────────────────┐
│  Think (LLM) ←──── Observe     │
│       ↓                ↑       │
│  Act (Tools) ──────────┘       │
└─────────────────────────────────┘
    ↓  done
Diff Preview  (Accept / Reject per file)

Tool Set

Tool	What it does
`read_file(path)`	Returns file contents
`write_file(path, content)`	Applies diff
`search_codebase(query)`	Vector + keyword hybrid search
`run_command(cmd)`	Sandboxed shell (Docker)
`list_directory(path)`	File tree
`get_diagnostics()`	LSP errors / warnings
`get_references(symbol)`	Call graph lookup
`create_file(path, content)`	Creates new file
`delete_file(path)`	Deletes with undo stack

Orchestration: LangGraph

LangGraph models the agent loop as a directed graph of nodes (think, act, observe, plan, verify). Edges are conditional — the observe node routes back to think on errors, or forwards to verify on success.

The key advantage over a simple while loop: checkpointing. LangGraph can pause the loop mid-execution, serialize state to disk, and resume — critical for long refactors that might span dozens of file edits.

Layer 5 — System Design and Reasoning

Architecture-level questions ("should I use event sourcing here?", "draw the service dependency graph") require a different mode: long-horizon reasoning over the entire codebase context, not just a few files.

Developer Question
    ↓
Repo Summary Builder
  (1-paragraph LLM summary per directory → project map)
    ↓  ~8k token project map
Reasoning Model  (DeepSeek-R1 / QwQ-32B)
    ↓
Diagram Output  (Mermaid / PlantUML, rendered in IDE panel)

The repo summary is the critical artifact. Build it once on first index, then update incrementally using git diff — only re-summarise modules that changed in the last commit.

Layer 6 — Safe Code Execution Loop

Agents that can write code must be able to run it. But running arbitrary LLM-generated code on the host machine is a hard no. The execution layer must be:

Isolated: no access to host filesystem, network, or env vars outside the project
Ephemeral: container torn down after each run
Auditable: all stdin/stdout/stderr captured and shown to the developer

Agent: run_command("pytest tests/")
    ↓
Docker Container  (ephemeral)
  project files: read-only bind mount
  /tmp:          writable scratch only
  network:       none
  memory:        512 MB
  timeout:       30 s
  seccomp:       restricted syscalls
    ↓
exit 0  → tests passed  → agent proceeds
exit 1  → failures      → agent sees stderr, re-plans
timeout → hard kill     → agent informed, retries or stops

The Self-Healing Loop

Write code → Run tests
                ↓ pass      → Offer diff to developer
                ↓ fail      → Observe error → re-plan → Write code

When tests fail, the output becomes the next observation in the agent loop. The agent sees the exact error, reasons about the fix, edits the file, and re-runs — typically converging in 2–3 iterations for straightforward bugs.

For an even tighter sandbox, use gVisor (Google's container runtime that intercepts syscalls in user space) or Firecracker (AWS's micro-VM used in Lambda) instead of vanilla Docker.

The Full Open-Source Stack

Capability	Component	Notes
Editor	Monaco Editor	MIT, same engine as VS Code
Syntax parsing	Tree-sitter	MIT, 40+ languages
Code intelligence	LSP servers (clangd, pylsp, ts-ls)	Per-language
Embeddings	nomic-embed-text-v1.5	Apache 2.0, 768-dim, runs locally
Vector store	Chroma (dev) / Qdrant (prod)	Both open-source
FIM autocomplete	StarCoder2-3B / Qwen2.5-Coder-7B	BigCode / Qwen license
Chat model	Qwen2.5-Coder-32B-Instruct	Apache 2.0
Reasoning model	QwQ-32B / DeepSeek-R1-32B	MIT / MIT
Model serving	Ollama (local) / vLLM (production)	MIT / Apache 2.0
Agent orchestration	LangGraph	MIT
Execution sandbox	Docker + seccomp / gVisor	Apache 2.0
Backend API	FastAPI	MIT
Frontend	Next.js + Tailwind	MIT

What You Don't Get For Free

An honest architecture post should name the hard parts:

Latency at low VRAM. A 32B model doing chat on a single 24GB GPU hits 15–20 tokens/second. Acceptable for most workflows, but noticeably slower than cloud-hosted alternatives. The fix is speculative decoding, quantisation (GGUF Q4), or offloading to a small cloud GPU when needed.

Prompt cache invalidation. Managed AI coding services almost certainly implement prompt caching across requests. Replicating this without a managed inference provider requires careful key-value cache management in vLLM — possible, but non-trivial.

Index freshness. Keeping the vector store in sync with active edits (every keystroke rewrites files) requires debounced incremental re-indexing — easy to get wrong and end up with stale retrieval.

Security surface. The Docker sandbox is safe for test runners. But agents that can write_file anywhere in the repo, modify CI configs, or touch secrets files are a different risk level. Implement a path allowlist and require developer confirmation for writes outside the current working directory.

Closing Thoughts

The individual components here — Tree-sitter, vLLM, LangGraph, Docker — are each battle-tested in production at scale. The architecture challenge is the orchestration: assembling the right context, routing to the right model at the right latency budget, and designing a UX that keeps the developer in control of what the agent actually touches.

The moat of any great AI coding tool is not its architecture. It's the years of UX iteration on top of this architecture. The open-source community now has every primitive it needs to build something just as capable.

The next post walks through implementing the FIM autocomplete engine end-to-end: Tree-sitter chunking, nomic embeddings, and a StarCoder2-3B inference server — all running on a single laptop.

Top comments (4)

Alex Shev • Jun 23

The architecture question is the interesting part. An AI IDE is not just chat plus autocomplete; it needs repo understanding, edit boundaries, tool execution, review surfaces, and a way to explain why a change was made.

Sourish Chakraborty • Jun 23

Right — and that full picture is what the Dhi series is building toward, one layer at a time.

Where things stand right now:

Repo understanding — shipped in post-2. Tree-sitter across six languages, BM25 + vector hybrid search with RRF merging, full workspace indexing via /index-dir. The FIM prefix now pulls ranked chunks from across the entire repo, not just the open file. The architecture write-up covers the six-layer design if you want the full picture of where this is going. Curious what you've found most important in practice — review surfaces or the explain-why layer?

Alex Shev • Jun 23

In practice I would start with review surfaces, then explain-why. If the review surface is weak, the explanation becomes hard to trust. Once the diff, tests, affected files, and retrieved context are visible, the reason trail can actually be checked.

Some comments may only be visible to logged-in visitors. Sign in to view all comments.