DEV Community

Cover image for Building an AI Coding IDE from Scratch: A Full Open-Source Architecture
Sourish Chakraborty
Sourish Chakraborty

Posted on • Originally published at blogs.sourishchakraborty.com

Building an AI Coding IDE from Scratch: A Full Open-Source Architecture

Note: This post contains architecture diagrams. For the fully rendered version visit the original post.

AI coding assistants have fundamentally changed how developers write software. The best ones do one thing well: collapse the distance between the developer's intent and the code that expresses it. But under the hood, they are largely orchestration layers — the individual primitives they use (language servers, vector search, tool-calling LLMs, sandboxed execution) are all open-source or replicable.

This post lays out a complete open-source architecture that covers every capability a modern AI coding IDE ships: autocomplete, chat-in-editor, multi-file agent editing, repo understanding, system design reasoning, and a safe code execution loop. No proprietary APIs required.

The project is called Dhi (धी) — from the Gayatri Mantra. It means pure intellect. The code lives at github.com/sochaty/dhi.


The Full Picture

Before going layer by layer, here is how the system fits together:

IDE Frontend
    ↓  JSON-RPC / WebSocket
Orchestration Core  (LangGraph · Tool Router · Context Assembler)
    ↓           ↓            ↓                ↓
FIM Engine  Chat Engine  Agent Engine  Execution Sandbox
    ↓           ↓            ↓                ↓
        Model Layer  (Ollama · vLLM · StarCoder2 · DeepSeek)
                         ↓
        Repo Intelligence Layer  (Tree-sitter · Chroma · LSP)
Enter fullscreen mode Exit fullscreen mode

Each vertical slice is independently deployable. You can run only the autocomplete engine on a laptop, or scale the agent engine on a GPU cluster. The layers communicate over a local JSON-RPC bus (same protocol VS Code uses for LSP).


Layer 1 — Repo Understanding

Every other capability is only as good as the system's understanding of the codebase. This is where most open-source IDE projects are weakest. Getting it right means going beyond naive file-splitting.

Parsing with Tree-sitter

Tree-sitter produces a concrete syntax tree for 40+ languages in under 5ms per file. Rather than splitting on character count, we split on semantic boundaries: functions, classes, method bodies. This keeps each chunk self-contained and reduces context fragmentation at retrieval time.

Source Files → Tree-sitter (per language) → Semantic Chunks
    → Metadata Overlay (file path · line range · symbol name)
    → nomic-embed-text-v1.5 (768-dim, runs locally)
    → Chroma (dev) / Qdrant (prod)
Enter fullscreen mode Exit fullscreen mode

The Call Graph Layer

Pure vector search finds semantically similar code but misses structural relationships. A symbol reference graph (built from LSP textDocument/references calls) lets you answer questions like: "find every function that touches the auth middleware" with a graph traversal rather than a fuzzy search.

Store this as an adjacency list in SQLite — lightweight, zero infrastructure, always in sync with the repo.


Layer 2 — Autocomplete (Fill-in-the-Middle)

Autocomplete in a modern AI IDE is not next-token prediction. It is fill-in-the-middle (FIM): the model sees the prefix (everything before the cursor) and the suffix (everything after), and generates the completion that bridges them.

<fim_prefix>
  repo context: top-3 retrieved chunks (~1500 tokens)
  current file: lines 0 → cursor (~800 tokens)

<fim_suffix>
  current file: cursor → EOF (~400 tokens)

<fim_middle>   ← model generates here
  target: < 150ms P50  |  < 400ms P95
Enter fullscreen mode Exit fullscreen mode

Model choices

Model Parameters FIM Support Runs on
StarCoder2-3B 3B ✅ native Apple M2 / 8GB GPU
DeepSeek-Coder-V2-Lite 16B ✅ native 24GB GPU
Qwen2.5-Coder-7B 7B ✅ native 16GB GPU
CodeLlama-13B 13B ✅ native 24GB GPU

Serve them with Ollama for local dev or vLLM in production (PagedAttention cuts memory by ~40%, continuous batching removes queuing).

Speculative Decoding

Pair a small draft model (StarCoder2-1B) with a large verifier (DeepSeek-Coder-V2-Lite). The draft generates K tokens; the verifier accepts or rejects in a single forward pass. Effective throughput: 3–5× faster than the large model alone for typical completion lengths.


Layer 3 — Chat-in-Editor

Chat works differently from autocomplete. The latency bar is 2–5 seconds (acceptable for a conversational exchange), but the context window must be carefully assembled to fit within model limits while including what's most relevant.

Developer Message
    ↓
Context Assembler
  Slot 1: system prompt          ~500 tokens
  Slot 2: active file + selection  ~2 000 tokens
  Slot 3: LSP diagnostics          ~500 tokens
  Slot 4: retrieved RAG chunks     ~4 000 tokens
  Slot 5: conversation history     ~2 000 tokens
  Slot 6: user message             remaining budget
    ↓
LLM  (streaming SSE)
    ↓
Response Parser
  prose       → Chat Panel
  code blocks → Diff Preview in editor
  tool_call   → Tool Engine
Enter fullscreen mode Exit fullscreen mode

The critical UX insight: stream tokens to the chat panel in real time, but buffer code blocks and only apply them to the editor after the complete block arrives. Partial code blocks applied live cause flickering and make the diff unreadable.

For the model, any instruction-tuned model with a large context window works here: Qwen2.5-Coder-32B-Instruct, DeepSeek-V3, or Llama-3.3-70B-Instruct via Ollama / vLLM.


Layer 4 — Multi-File Agent Editing

This is the hardest layer to get right. The agent must plan, act across multiple files, observe outcomes (compiler errors, test failures), and revise — all without losing context of the original goal.

The Plan-Act-Observe Loop

Developer Request
    ↓
Planner  (reasoning model · ordered task list)
    ↓
┌─────────────────────────────────┐
│  Think (LLM) ←──── Observe     │
│       ↓                ↑       │
│  Act (Tools) ──────────┘       │
└─────────────────────────────────┘
    ↓  done
Diff Preview  (Accept / Reject per file)
Enter fullscreen mode Exit fullscreen mode

Tool Set

Tool What it does
read_file(path) Returns file contents
write_file(path, content) Applies diff
search_codebase(query) Vector + keyword hybrid search
run_command(cmd) Sandboxed shell (Docker)
list_directory(path) File tree
get_diagnostics() LSP errors / warnings
get_references(symbol) Call graph lookup
create_file(path, content) Creates new file
delete_file(path) Deletes with undo stack

Orchestration: LangGraph

LangGraph models the agent loop as a directed graph of nodes (think, act, observe, plan, verify). Edges are conditional — the observe node routes back to think on errors, or forwards to verify on success.

The key advantage over a simple while loop: checkpointing. LangGraph can pause the loop mid-execution, serialize state to disk, and resume — critical for long refactors that might span dozens of file edits.


Layer 5 — System Design and Reasoning

Architecture-level questions ("should I use event sourcing here?", "draw the service dependency graph") require a different mode: long-horizon reasoning over the entire codebase context, not just a few files.

Developer Question
    ↓
Repo Summary Builder
  (1-paragraph LLM summary per directory → project map)
    ↓  ~8k token project map
Reasoning Model  (DeepSeek-R1 / QwQ-32B)
    ↓
Diagram Output  (Mermaid / PlantUML, rendered in IDE panel)
Enter fullscreen mode Exit fullscreen mode

The repo summary is the critical artifact. Build it once on first index, then update incrementally using git diff — only re-summarise modules that changed in the last commit.


Layer 6 — Safe Code Execution Loop

Agents that can write code must be able to run it. But running arbitrary LLM-generated code on the host machine is a hard no. The execution layer must be:

  • Isolated: no access to host filesystem, network, or env vars outside the project
  • Ephemeral: container torn down after each run
  • Auditable: all stdin/stdout/stderr captured and shown to the developer
Agent: run_command("pytest tests/")
    ↓
Docker Container  (ephemeral)
  project files: read-only bind mount
  /tmp:          writable scratch only
  network:       none
  memory:        512 MB
  timeout:       30 s
  seccomp:       restricted syscalls
    ↓
exit 0  → tests passed  → agent proceeds
exit 1  → failures      → agent sees stderr, re-plans
timeout → hard kill     → agent informed, retries or stops
Enter fullscreen mode Exit fullscreen mode

The Self-Healing Loop

Write code → Run tests
                ↓ pass      → Offer diff to developer
                ↓ fail      → Observe error → re-plan → Write code
Enter fullscreen mode Exit fullscreen mode

When tests fail, the output becomes the next observation in the agent loop. The agent sees the exact error, reasons about the fix, edits the file, and re-runs — typically converging in 2–3 iterations for straightforward bugs.

For an even tighter sandbox, use gVisor (Google's container runtime that intercepts syscalls in user space) or Firecracker (AWS's micro-VM used in Lambda) instead of vanilla Docker.


The Full Open-Source Stack

Capability Component Notes
Editor Monaco Editor MIT, same engine as VS Code
Syntax parsing Tree-sitter MIT, 40+ languages
Code intelligence LSP servers (clangd, pylsp, ts-ls) Per-language
Embeddings nomic-embed-text-v1.5 Apache 2.0, 768-dim, runs locally
Vector store Chroma (dev) / Qdrant (prod) Both open-source
FIM autocomplete StarCoder2-3B / Qwen2.5-Coder-7B BigCode / Qwen license
Chat model Qwen2.5-Coder-32B-Instruct Apache 2.0
Reasoning model QwQ-32B / DeepSeek-R1-32B MIT / MIT
Model serving Ollama (local) / vLLM (production) MIT / Apache 2.0
Agent orchestration LangGraph MIT
Execution sandbox Docker + seccomp / gVisor Apache 2.0
Backend API FastAPI MIT
Frontend Next.js + Tailwind MIT

What You Don't Get For Free

An honest architecture post should name the hard parts:

Latency at low VRAM. A 32B model doing chat on a single 24GB GPU hits 15–20 tokens/second. Acceptable for most workflows, but noticeably slower than cloud-hosted alternatives. The fix is speculative decoding, quantisation (GGUF Q4), or offloading to a small cloud GPU when needed.

Prompt cache invalidation. Managed AI coding services almost certainly implement prompt caching across requests. Replicating this without a managed inference provider requires careful key-value cache management in vLLM — possible, but non-trivial.

Index freshness. Keeping the vector store in sync with active edits (every keystroke rewrites files) requires debounced incremental re-indexing — easy to get wrong and end up with stale retrieval.

Security surface. The Docker sandbox is safe for test runners. But agents that can write_file anywhere in the repo, modify CI configs, or touch secrets files are a different risk level. Implement a path allowlist and require developer confirmation for writes outside the current working directory.


Closing Thoughts

The individual components here — Tree-sitter, vLLM, LangGraph, Docker — are each battle-tested in production at scale. The architecture challenge is the orchestration: assembling the right context, routing to the right model at the right latency budget, and designing a UX that keeps the developer in control of what the agent actually touches.

The moat of any great AI coding tool is not its architecture. It's the years of UX iteration on top of this architecture. The open-source community now has every primitive it needs to build something just as capable.

The next post walks through implementing the FIM autocomplete engine end-to-end: Tree-sitter chunking, nomic embeddings, and a StarCoder2-3B inference server — all running on a single laptop.

Top comments (1)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.