DEV Community

Jovan Chan
Jovan Chan

Posted on • Originally published at aicoderscope.com

Continue.dev + LM Studio 2026: setup guide, the context-window dial you must set before loading, and which GGUF models pass the FIM test

This article was originally published on aicoderscope.com

TL;DR: Continue.dev v1.3.38 + LM Studio 0.4.15 gives you local AI coding in both VS Code and JetBrains — with a GUI model browser, automatic CUDA detection, and optional remote GPU access via LM Link. One trap stops most setups before they produce good output: LM Studio's context window defaults to 4,096 tokens and you must increase it in the model settings before loading, not after. Miss that step and Continue.dev silently feeds the model 20% of the context it requested.

What you'll be able to do after this guide:

  • Serve any GGUF coding model from LM Studio at http://localhost:1234/v1
  • Configure Continue.dev with separate model roles — a lightweight 1.5B for tab autocomplete, a 14B or 32B for chat and edits — using a single config.yaml
  • Get fill-in-the-middle (FIM) tab completions working in VS Code and JetBrains
Continue.dev + LM Studio Continue.dev + Ollama Cursor Pro
Best for Windows + GUI model browser + LM Link macOS / Linux, CLI-first Best-in-class VS Code agent
Price / Cost $0, no API bill $0, no API bill $20/mo, usage-capped
The catch LM Studio is a multi-hundred-MB GUI app; no headless install No GUI, needs CUDA path setup on Windows No local model option at all

Honest take: On Windows, LM Studio is the lower-friction path to local Continue.dev — CUDA auto-detection and a visual model browser beat Ollama's CLI for developers who don't want to wrangle environment variables. On macOS or Linux, the Continue.dev + Ollama guide is simpler. Choose LM Studio if you're Windows-primary or want the LM Link remote-GPU feature.


What Continue.dev does differently with LM Studio vs Ollama

Continue.dev's Ollama provider talks to Ollama's native REST API (/api/generate, /api/chat, /api/tags) and uses Ollama's FIM detection via the Modelfile template. The LM Studio provider takes a different path: it extends Continue.dev's OpenAI class and points at LM Studio's OpenAI-compatible server (http://localhost:1234/v1).

This means:

FIM works differently. For tab autocomplete, Continue.dev calls LM Studio's /v1/completions endpoint with a suffix parameter — the standard OpenAI-compatible FIM path. This works reliably with Qwen2.5-Coder models (which include FIM training) and DeepSeek-Coder models. It fails silently with models that weren't trained for FIM, producing generic "complete from where I left off" suggestions rather than true fill-in-the-middle completions.

The model name is decorative. Unlike Ollama, where Continue.dev queries /api/tags to verify the model exists, LM Studio's API routes to the currently loaded model regardless of the name in your request. The model field in config.yaml is passed in the API call but LM Studio ignores it and uses whatever model you loaded in the GUI. This simplifies configuration but means you must manually pre-load the right model before starting your coding session.

Context length is a GUI setting, not an environment variable. Ollama has OLLAMA_NUM_CTX and per-model Modelfiles. In LM Studio, context length is configured at model-load time in the settings panel — and the default (4,096 tokens) is not enough for Continue.dev's typical request size.


Hardware floor

GPU / VRAM Recommended model Notes
RTX 4060 8 GB Qwen2.5-Coder 7B Q4_K_M Autocomplete only; chat produces marginal results
RTX 3060 12 GB Qwen2.5-Coder 14B Q4_K_M Practical floor for chat + edit; autocomplete on 1.5B separately
RTX 4060 Ti 16 GB Qwen2.5-Coder 14B Q6_K Solid daily-driver
RTX 3090 / RTX 4090 24 GB Qwen2.5-Coder 32B Q4_K_M Best local tier; Devstral Small 2 Q4_K_M also fits here
Mac M3/M4 unified memory Use Ollama + MLX instead LM Studio on Apple Silicon runs but Ollama + MLX is measurably faster

LM Studio runs noticeably slower than Ollama on Apple Silicon because the macOS build still uses llama.cpp's Metal path while Ollama has better integrated MLX support. If you're on a Mac, the Continue.dev + Ollama guide will get you better performance. For hardware selection context, runaihome.com's local AI model by VRAM tier guide covers the landscape in detail.


Step 1: Install LM Studio 0.4.15

Download from lmstudio.ai. The current stable release is 0.4.15 (build 2, released May 29, 2026). It ships as a single executable installer — .exe on Windows, .dmg on macOS, and AppImage/deb on Linux.

On Windows: run the installer. It detects your CUDA version automatically and installs the matching runtime. No manual CUDA path configuration needed.

On Linux:

chmod +x LM-Studio-0.4.15-x86_64.AppImage
./LM-Studio-0.4.15-x86_64.AppImage --no-sandbox
Enter fullscreen mode Exit fullscreen mode

After launch, go to Settings → Developer Mode and toggle it on. This unlocks the local server controls and the parallel inference settings from LM Studio 0.4.0 onward.


Step 2: Download a coding model

Open the Discover tab and search for qwen2.5-coder. LM Studio shows available GGUF quantizations alongside estimated VRAM usage for each. For a 24 GB card, select Q4_K_M of the 32B variant (approximately 20 GB, leaving headroom for a 32k context window). For 12–16 GB cards, use the 14B at Q4_K_M (approximately 8 GB).

For the separate autocomplete model (recommended — it fires on every keystroke and needs to be fast), also download qwen2.5-coder-1.5b:

# Using the lms CLI that ships with LM Studio 0.4.x
lms get qwen2.5-coder-1.5b-instruct

# Verify download
lms ls
# Expected output: a list of model paths in your LM Studio models directory
Enter fullscreen mode Exit fullscreen mode

The lms CLI is in your PATH after LM Studio installs. If the command isn't found, open a fresh terminal — the installer adds it during the first launch.


Step 3: Set the context window — before loading, not after

This is where most Continue.dev + LM Studio setups silently break.

LM Studio defaults to a 4,096-token context window for most models. Continue.dev sends significantly more — file context, conversation history, and retrieved snippets combined can easily hit 8,000–16,000 tokens depending on your project size. When Continue.dev sends more than the loaded context window allows, LM Studio truncates the oldest tokens silently. The model never sees the earlier context. Responses look plausible but are based on an incomplete picture.

To fix this, set the context length in the model configuration before you click Load:

  1. In the left sidebar, click on the model you want to load
  2. In the right-side configuration panel, find Context Length (labeled n_ctx in some versions)
  3. Set it to at least 16384 — this covers most coding tasks
  4. For large codebases or long agent conversations, set it to 32768 (requires approximately 2–4 GB extra VRAM depending on the model)
  5. Click Load Model

The context length is baked in at load time. If you change it, you must unload and reload the model.

You can verify the context window is set correctly from the lms CLI after loading:

lms status
# Expected output includes: Context Length: 16384 (or whatever you set)
Enter fullscreen mode Exit fullscreen mode

If you see Context Length: 4096 after loading, you changed the setting while the model was already loaded — it won't apply until you reload.


Step 4: Start the local server

In LM Studio's Developer panel, click Start Server. The server starts on port 1234 by default. Verify it's responding:


bash
curl http://localhost:1234/v1/models
# Expected: {"object":"list","dat
Enter fullscreen mode Exit fullscreen mode

Top comments (0)