This article was originally published on aicoderscope.com
TL;DR: Continue.dev v1.3.38 + LM Studio 0.4.15 gives you local AI coding in both VS Code and JetBrains — with a GUI model browser, automatic CUDA detection, and optional remote GPU access via LM Link. One trap stops most setups before they produce good output: LM Studio's context window defaults to 4,096 tokens and you must increase it in the model settings before loading, not after. Miss that step and Continue.dev silently feeds the model 20% of the context it requested.
What you'll be able to do after this guide:
- Serve any GGUF coding model from LM Studio at
http://localhost:1234/v1 - Configure Continue.dev with separate model roles — a lightweight 1.5B for tab autocomplete, a 14B or 32B for chat and edits — using a single
config.yaml - Get fill-in-the-middle (FIM) tab completions working in VS Code and JetBrains
| Continue.dev + LM Studio | Continue.dev + Ollama | Cursor Pro | |
|---|---|---|---|
| Best for | Windows + GUI model browser + LM Link | macOS / Linux, CLI-first | Best-in-class VS Code agent |
| Price / Cost | $0, no API bill | $0, no API bill | $20/mo, usage-capped |
| The catch | LM Studio is a multi-hundred-MB GUI app; no headless install | No GUI, needs CUDA path setup on Windows | No local model option at all |
Honest take: On Windows, LM Studio is the lower-friction path to local Continue.dev — CUDA auto-detection and a visual model browser beat Ollama's CLI for developers who don't want to wrangle environment variables. On macOS or Linux, the Continue.dev + Ollama guide is simpler. Choose LM Studio if you're Windows-primary or want the LM Link remote-GPU feature.
What Continue.dev does differently with LM Studio vs Ollama
Continue.dev's Ollama provider talks to Ollama's native REST API (/api/generate, /api/chat, /api/tags) and uses Ollama's FIM detection via the Modelfile template. The LM Studio provider takes a different path: it extends Continue.dev's OpenAI class and points at LM Studio's OpenAI-compatible server (http://localhost:1234/v1).
This means:
FIM works differently. For tab autocomplete, Continue.dev calls LM Studio's /v1/completions endpoint with a suffix parameter — the standard OpenAI-compatible FIM path. This works reliably with Qwen2.5-Coder models (which include FIM training) and DeepSeek-Coder models. It fails silently with models that weren't trained for FIM, producing generic "complete from where I left off" suggestions rather than true fill-in-the-middle completions.
The model name is decorative. Unlike Ollama, where Continue.dev queries /api/tags to verify the model exists, LM Studio's API routes to the currently loaded model regardless of the name in your request. The model field in config.yaml is passed in the API call but LM Studio ignores it and uses whatever model you loaded in the GUI. This simplifies configuration but means you must manually pre-load the right model before starting your coding session.
Context length is a GUI setting, not an environment variable. Ollama has OLLAMA_NUM_CTX and per-model Modelfiles. In LM Studio, context length is configured at model-load time in the settings panel — and the default (4,096 tokens) is not enough for Continue.dev's typical request size.
Hardware floor
| GPU / VRAM | Recommended model | Notes |
|---|---|---|
| RTX 4060 8 GB | Qwen2.5-Coder 7B Q4_K_M | Autocomplete only; chat produces marginal results |
| RTX 3060 12 GB | Qwen2.5-Coder 14B Q4_K_M | Practical floor for chat + edit; autocomplete on 1.5B separately |
| RTX 4060 Ti 16 GB | Qwen2.5-Coder 14B Q6_K | Solid daily-driver |
| RTX 3090 / RTX 4090 24 GB | Qwen2.5-Coder 32B Q4_K_M | Best local tier; Devstral Small 2 Q4_K_M also fits here |
| Mac M3/M4 unified memory | Use Ollama + MLX instead | LM Studio on Apple Silicon runs but Ollama + MLX is measurably faster |
LM Studio runs noticeably slower than Ollama on Apple Silicon because the macOS build still uses llama.cpp's Metal path while Ollama has better integrated MLX support. If you're on a Mac, the Continue.dev + Ollama guide will get you better performance. For hardware selection context, runaihome.com's local AI model by VRAM tier guide covers the landscape in detail.
Step 1: Install LM Studio 0.4.15
Download from lmstudio.ai. The current stable release is 0.4.15 (build 2, released May 29, 2026). It ships as a single executable installer — .exe on Windows, .dmg on macOS, and AppImage/deb on Linux.
On Windows: run the installer. It detects your CUDA version automatically and installs the matching runtime. No manual CUDA path configuration needed.
On Linux:
chmod +x LM-Studio-0.4.15-x86_64.AppImage
./LM-Studio-0.4.15-x86_64.AppImage --no-sandbox
After launch, go to Settings → Developer Mode and toggle it on. This unlocks the local server controls and the parallel inference settings from LM Studio 0.4.0 onward.
Step 2: Download a coding model
Open the Discover tab and search for qwen2.5-coder. LM Studio shows available GGUF quantizations alongside estimated VRAM usage for each. For a 24 GB card, select Q4_K_M of the 32B variant (approximately 20 GB, leaving headroom for a 32k context window). For 12–16 GB cards, use the 14B at Q4_K_M (approximately 8 GB).
For the separate autocomplete model (recommended — it fires on every keystroke and needs to be fast), also download qwen2.5-coder-1.5b:
# Using the lms CLI that ships with LM Studio 0.4.x
lms get qwen2.5-coder-1.5b-instruct
# Verify download
lms ls
# Expected output: a list of model paths in your LM Studio models directory
The lms CLI is in your PATH after LM Studio installs. If the command isn't found, open a fresh terminal — the installer adds it during the first launch.
Step 3: Set the context window — before loading, not after
This is where most Continue.dev + LM Studio setups silently break.
LM Studio defaults to a 4,096-token context window for most models. Continue.dev sends significantly more — file context, conversation history, and retrieved snippets combined can easily hit 8,000–16,000 tokens depending on your project size. When Continue.dev sends more than the loaded context window allows, LM Studio truncates the oldest tokens silently. The model never sees the earlier context. Responses look plausible but are based on an incomplete picture.
To fix this, set the context length in the model configuration before you click Load:
- In the left sidebar, click on the model you want to load
- In the right-side configuration panel, find Context Length (labeled
n_ctxin some versions) - Set it to at least 16384 — this covers most coding tasks
- For large codebases or long agent conversations, set it to 32768 (requires approximately 2–4 GB extra VRAM depending on the model)
- Click Load Model
The context length is baked in at load time. If you change it, you must unload and reload the model.
You can verify the context window is set correctly from the lms CLI after loading:
lms status
# Expected output includes: Context Length: 16384 (or whatever you set)
If you see Context Length: 4096 after loading, you changed the setting while the model was already loaded — it won't apply until you reload.
Step 4: Start the local server
In LM Studio's Developer panel, click Start Server. The server starts on port 1234 by default. Verify it's responding:
bash
curl http://localhost:1234/v1/models
# Expected: {"object":"list","dat
Top comments (0)