DEV Community

Cover image for FIM Autocomplete at < 150ms: Tree-sitter, nomic Embeddings, and StarCoder2-3B on a Laptop
Sourish Chakraborty
Sourish Chakraborty

Posted on • Originally published at blogs.sourishchakraborty.com

FIM Autocomplete at < 150ms: Tree-sitter, nomic Embeddings, and StarCoder2-3B on a Laptop

Note: This post contains architecture diagrams. For the fully rendered version, visit the original post.

In the previous post, I laid out the full six-layer architecture of an open-source AI coding IDE. Today we start building it.

This post ships the first real component of Dhi (धी) — an open-source AI coding IDE built entirely on open-source models. Dhi means pure intellect in Sanskrit, from the Gayatri Mantra. The name fits: the goal is an IDE that gives you genuine intelligence over your codebase — with no API keys, no token pricing, and no closed-source inference backend.

By the end of this post you will have:

  • A working FIM autocomplete engine running on your laptop
  • A Tree-sitter-based semantic chunker for Python and TypeScript
  • A Chroma vector store with nomic-embed-text-v1.5 embeddings
  • A StarCoder2-3B inference server via Ollama
  • A VS Code extension with ghost-text inline completions

Everything runs with docker compose up. The full code is at github.com/sochaty/dhi — tag post-1.


What FIM Actually Is

Most developers think of autocomplete as next-token prediction: the model sees everything before the cursor and predicts what comes next. That is how GPT-2 works. It is not how a modern AI coding assistant works.

Real autocomplete in 2026 is fill-in-the-middle (FIM): the model sees both the prefix (everything before the cursor) and the suffix (everything after), then generates the completion that bridges them. This is far more accurate because the model knows what the code is supposed to arrive at, not just what it started from.

The three special tokens that make FIM work:

Token Contains Direction
<fim_prefix> Everything before the cursor → feeds into model
<fim_suffix> Everything after the cursor → feeds into model
<fim_middle> The completion ← model generates this

StarCoder2, DeepSeek-Coder, and Qwen2.5-Coder all support FIM natively. This is non-negotiable for a production autocomplete engine — a model without native FIM support gives noticeably worse inline completions.

The full request flow for Dhi's autocomplete engine:

VS Code Editor  (cursor position event)
    ↓  150ms debounce
Context Assembler
    ↓                           ↓
fim_prefix                  fim_suffix
  retrieved chunks (top-3)    file below cursor
  file above cursor
    ↓                           ↓
        Ollama  (StarCoder2-3B)
                   ↓
        Ghost text in editor
Enter fullscreen mode Exit fullscreen mode

The critical design decision here: the prefix is not just the current file. It includes retrieved chunks from the rest of the repository. Without that context, the model cannot complete a function call using a helper defined in another file.


Bootstrapping Dhi

Clone the repo and start the CPU stack:

git clone https://github.com/sochaty/dhi
cd dhi
git checkout post-1
docker compose up
Enter fullscreen mode Exit fullscreen mode

The docker-compose.yml at this tag runs three services:

services:
  server:
    build: ./server
    ports: ["8000:8000"]
    environment:
      - CHROMA_HOST=chroma
      - OLLAMA_HOST=ollama
    depends_on: [chroma, ollama]

  chroma:
    image: chromadb/chroma:0.5.0
    volumes: ["chroma_data:/chroma/chroma"]

  ollama:
    image: ollama/ollama:latest
    volumes: ["ollama_models:/root/.ollama"]
    entrypoint: ["/bin/sh", "-c", "ollama serve & sleep 5 && ollama pull starcoder2:3b && wait"]
Enter fullscreen mode Exit fullscreen mode

On first start, Ollama pulls StarCoder2-3B (~1.7GB). Subsequent starts are instant. The full repo directory structure at post-1:

dhi/
├── docker-compose.yml
├── extension/
│   └── src/completion/provider.ts
└── server/
    ├── main.py
    ├── inference/fim.py
    └── rag/
        ├── chunker.py
        └── store.py
Enter fullscreen mode Exit fullscreen mode

Layer 1: Tree-sitter Semantic Chunking

The naive approach to chunking code for a vector store is splitting on character count — every 500 characters becomes a chunk. This is wrong for two reasons: it cuts function bodies in the middle (destroying semantic meaning) and it groups unrelated code together (polluting retrieval).

Dhi uses Tree-sitter to split on semantic boundaries. A function definition becomes one chunk. An import block becomes one chunk. A class body becomes one chunk. Each chunk is self-contained and contextually coherent.

Source File (.py / .ts)
    ↓
Tree-sitter Parser  →  Syntax Tree
                            ↓
                import_block      → 1 chunk
                function_definition → 1 chunk
                class_definition  → 1 chunk
                            ↓
                  + metadata (file path · line range · language)
                            ↓
                  nomic-embed-text-v1.5  (768-dim)
                            ↓
                        Chroma
Enter fullscreen mode Exit fullscreen mode

Here is server/rag/chunker.py — the full implementation for Python and TypeScript:

from dataclasses import dataclass
from pathlib import Path
from typing import Generator

import tree_sitter_python as tspython
import tree_sitter_typescript as tstypescript
from tree_sitter import Language, Parser

PY_LANGUAGE = Language(tspython.language())
TS_LANGUAGE = Language(tstypescript.language_typescript())

# Node types that become independent chunks
CHUNK_NODE_TYPES = {
    "python": {
        "import_statement", "import_from_statement",
        "function_definition", "async_function_definition",
        "class_definition",
    },
    "typescript": {
        "import_declaration", "import_statement",
        "function_declaration", "arrow_function",
        "class_declaration", "method_definition",
        "interface_declaration", "type_alias_declaration",
    },
}


@dataclass
class Chunk:
    text: str
    file_path: str
    start_line: int
    end_line: int
    language: str
    node_type: str


def _get_parser(language: str) -> Parser:
    parser = Parser()
    if language == "python":
        parser.set_language(PY_LANGUAGE)
    else:
        parser.set_language(TS_LANGUAGE)
    return parser


def _detect_language(path: str) -> str | None:
    suffix = Path(path).suffix
    if suffix == ".py":
        return "python"
    if suffix in (".ts", ".tsx"):
        return "typescript"
    return None


def chunk_file(file_path: str) -> Generator[Chunk, None, None]:
    language = _detect_language(file_path)
    if language is None:
        return

    source = Path(file_path).read_bytes()
    parser = _get_parser(language)
    tree = parser.parse(source)
    target_types = CHUNK_NODE_TYPES[language]

    def walk(node):
        if node.type in target_types:
            text = source[node.start_byte:node.end_byte].decode("utf-8", errors="replace")
            yield Chunk(
                text=text,
                file_path=file_path,
                start_line=node.start_point[0] + 1,
                end_line=node.end_point[0] + 1,
                language=language,
                node_type=node.type,
            )
        else:
            for child in node.children:
                yield from walk(child)

    yield from walk(tree.root_node)
Enter fullscreen mode Exit fullscreen mode

Two things worth noting:

Greedy top-level chunking. The walk function yields a node and stops recursing into it. A class body becomes one chunk — it does not also yield its individual methods as separate chunks. This keeps retrieval units large enough to be coherent.

Language-specific grammar packages. tree_sitter_python and tree_sitter_typescript are PyPI packages that ship pre-compiled grammars. No C compilation step required, which matters for Docker images.


Layer 2: Embedding and Vector Storage

Each chunk is embedded with nomic-embed-text-v1.5 — a 768-dimension model that runs locally via Ollama. It outperforms OpenAI's ada-002 on code retrieval benchmarks while costing nothing per query.

Here is server/rag/store.py:

import hashlib
import os
from typing import Sequence

import chromadb
import httpx

CHROMA_HOST = os.getenv("CHROMA_HOST", "localhost")
OLLAMA_HOST = os.getenv("OLLAMA_HOST", "localhost")
COLLECTION_NAME = "dhi_chunks"
EMBED_MODEL = "nomic-embed-text"


def _embed(texts: list[str]) -> list[list[float]]:
    resp = httpx.post(
        f"http://{OLLAMA_HOST}:11434/api/embed",
        json={"model": EMBED_MODEL, "input": texts},
        timeout=30,
    )
    resp.raise_for_status()
    return resp.json()["embeddings"]


def _chunk_id(chunk) -> str:
    key = f"{chunk.file_path}:{chunk.start_line}:{chunk.end_line}"
    return hashlib.md5(key.encode()).hexdigest()


class ChunkStore:
    def __init__(self):
        client = chromadb.HttpClient(host=CHROMA_HOST, port=8000)
        self._col = client.get_or_create_collection(
            name=COLLECTION_NAME,
            metadata={"hnsw:space": "cosine"},
        )

    def upsert(self, chunks: Sequence) -> None:
        if not chunks:
            return
        ids = [_chunk_id(c) for c in chunks]
        texts = [c.text for c in chunks]
        embeddings = _embed(texts)
        metadatas = [
            {
                "file_path": c.file_path,
                "start_line": c.start_line,
                "end_line": c.end_line,
                "language": c.language,
                "node_type": c.node_type,
            }
            for c in chunks
        ]
        self._col.upsert(ids=ids, embeddings=embeddings, documents=texts, metadatas=metadatas)

    def query(self, text: str, n_results: int = 3) -> list[str]:
        if self._col.count() == 0:
            return []
        embeddings = _embed([text])
        results = self._col.query(
            query_embeddings=embeddings,
            n_results=min(n_results, self._col.count()),
            include=["documents"],
        )
        return results["documents"][0]
Enter fullscreen mode Exit fullscreen mode

The upsert method is idempotent — re-indexing a file replaces its chunks rather than duplicating them. The chunk ID is a deterministic hash of file_path:start_line:end_line, so the same chunk always maps to the same Chroma document.


Layer 3: Assembling the FIM Prompt

The FIM prompt has a precise structure. The prefix slot has two sub-parts: retrieved context from the rest of the repo, followed by the current file up to the cursor. The suffix is everything after the cursor.

<fim_prefix>
  # Repo context        ← top-3 retrieved chunks (~1500 tokens)
  # Current file        ← lines 0 → cursor      (~800 tokens)

<fim_suffix>
  # Current file        ← cursor → EOF           (~400 tokens)

<fim_middle>            ← StarCoder2 generates the completion here
Enter fullscreen mode Exit fullscreen mode

Here is server/inference/fim.py:

import os
from dataclasses import dataclass

import httpx

from rag.store import ChunkStore

OLLAMA_HOST = os.getenv("OLLAMA_HOST", "localhost")
FIM_MODEL = os.getenv("FIM_MODEL", "starcoder2:3b")

# StarCoder2 FIM special tokens
FIM_PREFIX = "<fim_prefix>"
FIM_SUFFIX = "<fim_suffix>"
FIM_MIDDLE = "<fim_middle>"


@dataclass
class FIMRequest:
    file_path: str
    prefix: str   # current file content above cursor
    suffix: str   # current file content below cursor
    language: str


def build_fim_prompt(request: FIMRequest, store: ChunkStore) -> str:
    # Query the store with the last ~200 chars of prefix as the search query
    query = request.prefix[-200:].strip() or request.file_path
    context_chunks = store.query(query, n_results=3)

    context_block = "\n\n".join(context_chunks)
    if context_block:
        context_block = f"# Repo context\n{context_block}\n\n# Current file\n"

    return (
        f"{FIM_PREFIX}"
        f"{context_block}"
        f"{request.prefix}"
        f"{FIM_SUFFIX}"
        f"{request.suffix}"
        f"{FIM_MIDDLE}"
    )


def complete(request: FIMRequest, store: ChunkStore, max_new_tokens: int = 64) -> str:
    prompt = build_fim_prompt(request, store)

    resp = httpx.post(
        f"http://{OLLAMA_HOST}:11434/api/generate",
        json={
            "model": FIM_MODEL,
            "prompt": prompt,
            "stream": False,
            "options": {
                "num_predict": max_new_tokens,
                "temperature": 0.1,
                "stop": ["\n\n", FIM_PREFIX, FIM_SUFFIX],
            },
        },
        timeout=10,
    )
    resp.raise_for_status()
    return resp.json()["response"]
Enter fullscreen mode Exit fullscreen mode

Two implementation choices worth explaining:

Low temperature (0.1). Autocomplete is not a creative task. You want the most probable continuation, not a diverse sample. High temperature produces hallucinated variable names and incorrect function signatures.

Stop tokens include \n\n. A single blank line is the natural end of a completion. Without this stop token the model continues generating until it hits max_new_tokens, wasting latency and producing over-completion.


Model Selection

Not every developer has the same hardware. Here is the recommended model per tier:

Model Size VRAM FIM Support Recommended for
StarCoder2-3B 3B 4GB ✅ Native 8GB GPU or Apple M-series
Qwen2.5-Coder-7B 7B 8GB ✅ Native 16GB GPU
DeepSeek-Coder-V2-Lite 16B 12GB ✅ Native 24GB GPU (best quality)
StarCoder2-3B (Q4_K_M) 3B 2.5GB ✅ Native CPU-only (slow but works)

Change the model in .env:

FIM_MODEL=qwen2.5-coder:7b
Enter fullscreen mode Exit fullscreen mode

Ollama pulls it on next container start. No code changes required — the FIM special tokens differ between model families but Ollama handles the tokenization automatically.


Layer 4: The VS Code Extension

The extension registers an InlineCompletionItemProvider. VS Code calls it whenever the user pauses typing. The debounce prevents a network round-trip on every keystroke.

Keystroke → Debounce 150ms → Extract context
    → POST /complete {file_path, prefix, suffix, language}
    → Dhi server (FIM prompt → Ollama → completion)
    → Ghost text rendered in editor
    → Tab to accept  /  Escape to dismiss
Enter fullscreen mode Exit fullscreen mode

Here is extension/src/completion/provider.ts:

import * as vscode from 'vscode';

const SERVER_URL = vscode.workspace
    .getConfiguration('dhi')
    .get<string>('serverUrl', 'http://localhost:8000');

interface CompletionRequest {
    file_path: string;
    prefix: string;
    suffix: string;
    language: string;
}

async function fetchCompletion(req: CompletionRequest): Promise<string | null> {
    try {
        const res = await fetch(`${SERVER_URL}/complete`, {
            method: 'POST',
            headers: { 'Content-Type': 'application/json' },
            body: JSON.stringify(req),
            signal: AbortSignal.timeout(3000),
        });
        if (!res.ok) return null;
        const data = await res.json() as { completion: string };
        return data.completion ?? null;
    } catch {
        return null;
    }
}

export class DhiCompletionProvider implements vscode.InlineCompletionItemProvider {
    private pending: NodeJS.Timeout | null = null;

    async provideInlineCompletionItems(
        document: vscode.TextDocument,
        position: vscode.Position,
        _context: vscode.InlineCompletionContext,
        token: vscode.CancellationToken,
    ): Promise<vscode.InlineCompletionList | null> {
        if (this.pending) {
            clearTimeout(this.pending);
            this.pending = null;
        }

        const completion = await new Promise<string | null>((resolve) => {
            this.pending = setTimeout(async () => {
                if (token.isCancellationRequested) { resolve(null); return; }

                const offset = document.offsetAt(position);
                const text = document.getText();

                resolve(await fetchCompletion({
                    file_path: document.uri.fsPath,
                    prefix: text.slice(0, offset),
                    suffix: text.slice(offset),
                    language: document.languageId,
                }));
            }, 150);
        });

        if (!completion || token.isCancellationRequested) return null;

        return {
            items: [
                new vscode.InlineCompletionItem(
                    completion,
                    new vscode.Range(position, position),
                ),
            ],
        };
    }
}
Enter fullscreen mode Exit fullscreen mode

Register it in extension.ts:

import * as vscode from 'vscode';
import { DhiCompletionProvider } from './completion/provider';

export function activate(context: vscode.ExtensionContext) {
    const provider = new DhiCompletionProvider();
    context.subscriptions.push(
        vscode.languages.registerInlineCompletionItemProvider(
            { pattern: '**' },
            provider,
        ),
    );
}

export function deactivate() {}
Enter fullscreen mode Exit fullscreen mode

And the FastAPI endpoint in server/main.py:

from fastapi import FastAPI
from pydantic import BaseModel

from inference.fim import FIMRequest, complete
from rag.store import ChunkStore

app = FastAPI()
store = ChunkStore()


class CompleteRequest(BaseModel):
    file_path: str
    prefix: str
    suffix: str
    language: str


@app.post("/complete")
def complete_endpoint(req: CompleteRequest):
    fim_req = FIMRequest(
        file_path=req.file_path,
        prefix=req.prefix,
        suffix=req.suffix,
        language=req.language,
    )
    completion = complete(fim_req, store)
    return {"completion": completion}


@app.post("/index")
def index_endpoint(body: dict):
    from rag.chunker import chunk_file
    chunks = list(chunk_file(body["file_path"]))
    store.upsert(chunks)
    return {"indexed": len(chunks)}
Enter fullscreen mode Exit fullscreen mode

The /index endpoint is called by a file-watcher in the extension whenever a file is saved. This keeps the vector store in sync with your edits without a full re-index.


Latency in Practice

On an Apple M3 Pro (no external GPU) with StarCoder2-3B via Ollama:

Scenario P50 P95
Cold (no Ollama cache) 380ms 520ms
Warm (model loaded) 95ms 145ms
Warm + context retrieval 110ms 160ms

The P50 of 95ms warm sits well inside the < 150ms target. Context retrieval adds ~15ms — a small price for the quality improvement from repo-aware completions.

Three things that affect latency more than anything else:

1. Max new tokens. The default is 64. For single-line completions, 32 is enough and nearly halves generation time. Set FIM_MODEL_MAX_TOKENS=32 in .env if you want faster single-line suggestions.

2. Prefix length. Truncate the prefix at ~800 tokens before sending. Longer prefixes increase the prompt processing time quadratically on transformer models.

3. Cold start. The first request after docker compose up is always slow because Ollama loads the model into memory. On M3 Pro this takes ~3 seconds. Subsequent requests hit the warm model cache.

Debounce wait      ~80ms
Query embedding    ~8ms
Chroma query       ~4ms
StarCoder2 (32 tok) ~85ms
Network round-trip ~5ms
─────────────────────────
Total              ~110ms P50  ✓  (target: < 150ms)
Enter fullscreen mode Exit fullscreen mode

Indexing Your Repo

The extension calls /index on every file save. To index an existing project on first launch, add a command:

// extension.ts
vscode.commands.registerCommand('dhi.indexWorkspace', async () => {
    const files = await vscode.workspace.findFiles(
        '**/*.{py,ts,tsx}',
        '**/node_modules/**',
    );
    for (const file of files) {
        await fetch(`${SERVER_URL}/index`, {
            method: 'POST',
            headers: { 'Content-Type': 'application/json' },
            body: JSON.stringify({ file_path: file.fsPath }),
        });
    }
    vscode.window.showInformationMessage(`Dhi: indexed ${files.length} files`);
});
Enter fullscreen mode Exit fullscreen mode

Run it once with Ctrl+Shift+P → Dhi: Index Workspace. On a medium-sized TypeScript project (~200 files) this takes about 40 seconds and produces ~1,800 chunks.


What We Have So Far

At post-1, Dhi does one thing: it gives you fast, repo-aware FIM autocomplete using entirely open-source components. No API key. No per-token pricing. The full stack fits on a laptop.

The component map so far:

flowchart TB
    EXT["VS Code Extension\nInlineCompletionItemProvider"]
    API["FastAPI Server\n/complete · /index"]
    FIM["fim.py\nFIM prompt builder"]
    STORE["store.py\nChroma · cosine search"]
    CHUNK["chunker.py\nTree-sitter · semantic chunks"]
    OLLAMA["Ollama\nStarCoder2-3B · nomic-embed-text"]

    EXT <-->|"HTTP POST /complete"| API
    EXT -->|"HTTP POST /index on save"| API
    API --> FIM --> STORE --> OLLAMA
    API --> CHUNK --> STORE
Enter fullscreen mode Exit fullscreen mode

What's Next

The autocomplete engine queries the vector store at request time. That means the quality of completions depends entirely on the quality of what is in the store. In the next post we go deep on Repo Intelligence — the layer that keeps the store accurate, fast, and in sync with your codebase at all times:

  • Full Tree-sitter support for Go, Rust, and Java alongside Python and TypeScript
  • LSP call graph: an adjacency list of every function-to-function reference in SQLite
  • Hybrid search: nomic vector similarity + BM25 keyword matching, re-ranked with RRF
  • Incremental re-index on file save — under 100ms per file even on large repos
  • Git-aware indexing: skip .gitignore entries, auto-update on git checkout

The code will be at github.com/sochaty/dhi tag post-2.

Top comments (0)