Jovan Chan

Posted on Jun 11 • Originally published at aicoderscope.com

Cursor + Ollama and LM Studio in 2026: use local models for Chat and Cmd+K — and keep tab completion honest

#cursor #ollama #lmstudio #localllm

This article was originally published on aicoderscope.com

TL;DR: You can route Cursor's Chat panel and Cmd+K through a local model running on your own machine — zero API spend for those features. The CORS header and the correct base URL (http://localhost:11434/v1 for Ollama, http://localhost:1234/v1 for LM Studio) are all you need. The hard limit: Cursor's Tab autocomplete stays cloud-only regardless of what you configure. If Tab is your primary use case, this setup won't help.

	Ollama path	LM Studio path	Neither (stay cloud)
Best for	macOS, Linux, Apple Silicon	Windows with CUDA GPU	Heavy Tab users
Cost to run	Free (hardware only)	Free (hardware only)	$20–$200/mo (Pro tier)
Tab completion	❌ Still cloud-only	❌ Still cloud-only	✅ Unlimited on Pro+
The catch	OLLAMA_ORIGINS env var required	GUI-only model loading	Credit pool burns fast with Claude Sonnet

Honest take: If you spend more than an hour a day in Cursor's Chat panel asking architectural questions, explaining code, or running long Cmd+K rewrites, switching those requests to a local Qwen2.5-Coder-32B drops that API spend to zero. If 80% of your Cursor value comes from Tab autocomplete, local models add nothing.

What actually happens when Cursor talks to a local model

Cursor's AI features split into two architecturally different systems:

Tab autocomplete runs through Cursor's proprietary server-side model — a small, fast transformer trained for fill-in-the-middle (FIM) completions. This is not OpenAI, not Claude. Cursor controls it, it runs on Cursor's infrastructure, and you cannot swap it out. The Override Base URL setting in Cursor's model panel has no effect on Tab.

Chat, Cmd+K, and Agent mode use the OpenAI API format and are called from the Cursor client running on your local machine. When you override the base URL, Cursor sends chat requests directly from your VS Code process to whatever endpoint you've configured — Ollama on localhost:11434, LM Studio on localhost:1234, or a remote server. The model credit pool from your Cursor Pro subscription is not consumed for these calls.

This architecture is why local model substitution is meaningful but partial.

Hardware floor and model selection

Chat and Cmd+K are less latency-sensitive than Tab autocomplete — you typically wait for a full response. A 14B model on a mid-range GPU is usable; a 7B model can handle single-function questions but starts to drift on larger refactoring prompts.

GPU / VRAM	Recommended model	Ollama pull command
8 GB (RTX 4060 / 8 GB Apple M)	`qwen2.5-coder:7b`	`ollama pull qwen2.5-coder:7b`
12 GB (RTX 3060 12 GB)	`qwen2.5-coder:14b`	`ollama pull qwen2.5-coder:14b`
16 GB (RTX 4060 Ti 16 GB)	`qwen2.5-coder:14b` or `devstral:24b-small` (Q4)	`ollama pull qwen2.5-coder:14b`
24 GB (RTX 3090 / RTX 4090)	`qwen2.5-coder:32b`	`ollama pull qwen2.5-coder:32b`
Apple M3/M4 Max (36–128 GB unified)	`qwen2.5-coder:32b`	`ollama pull qwen2.5-coder:32b`

Qwen2.5-Coder-32B scores 92.7% on HumanEval and 73.7 on Aider's pass-rate benchmark — within a few points of GPT-4o. On a 24 GB GPU it fits comfortably at Q4_K_M quantization (~19 GB loaded). For hardware advice on building a local AI rig, runaihome.com's local AI model by VRAM tier covers the full GPU comparison.

The 7B tier is workable for explaining snippets and one-shot Cmd+K edits. For anything that requires tracking a refactoring plan across multiple files, 14B is the practical minimum.

Path 1: Cursor + Ollama

Step 1 — Install Ollama and set the CORS header

Download and install Ollama from ollama.com. On macOS and Linux, Ollama runs as a background service after installation. On Windows, it installs as a system tray application.

Before pulling any model, set the OLLAMA_ORIGINS environment variable. This is the step most guides skip and the reason Cursor throws a CORS error on the first request.

macOS / Linux (add to ~/.zshrc or ~/.bashrc):

export OLLAMA_ORIGINS="*"

Windows (run in a terminal, then restart Ollama):

setx OLLAMA_ORIGINS "*"

After setting the variable, restart the Ollama service so it picks up the change:

# macOS / Linux
pkill ollama
ollama serve

# Or restart via the macOS menu bar icon

Step 2 — Pull a model

ollama pull qwen2.5-coder:7b

Expected output:

pulling manifest
pulling 966de95ca8a6... 100% ▕████████████████████████████████▏ 4.7 GB
pulling 66b9ea09bd5b... 100% ▕████████████████████████████████▏  68 B
pulling e7fed4a1ded7... 100% ▕████████████████████████████████▏  4.8 KB
verifying sha256 digest
writing manifest
success

Verify the model is loaded and the API is live:

curl http://localhost:11434/v1/models

You should see a JSON response listing qwen2.5-coder:7b. If you get a connection refused error, Ollama isn't running — launch it with ollama serve.

Step 3 — Configure Cursor

Open Cursor and press Cmd+, (macOS) or Ctrl+, (Windows/Linux) to open settings.
Click Cursor Settings (not VS Code Settings) in the top-right or via the gear icon.
Navigate to the Models tab.
Scroll to the OpenAI API Key section.
Toggle on Override OpenAI Base URL.
Enter: http://localhost:11434/v1
In the API Key field, enter any non-empty string — Ollama doesn't validate keys, but Cursor requires the field to be non-empty. ollama works fine.
Click Add Model and type the exact model name as it appears in Ollama: qwen2.5-coder:7b
In the model list, deselect all other models — leave only your local model checked. This prevents the "does not work with your current plan" error that appears when Cursor tries to route a request to a premium model.

Now open a file, press Cmd+L to open the chat panel, and type a question. The response comes from your local Ollama instance.

If localhost doesn't connect

A small percentage of users — mostly those behind corporate firewalls or VPNs — find that localhost:11434 doesn't resolve correctly from Cursor. The symptom is a timeout or "network error" in the chat panel despite Ollama running fine. Fix: use the loopback IP explicitly instead of the hostname:

Change the base URL to: http://127.0.0.1:11434/v1

If that also fails, the request is being intercepted by a network proxy. The workaround is to expose Ollama through ngrok:

ngrok http 11434 --host-header="localhost:11434"

ngrok prints a public HTTPS URL like https://abc123.ngrok-free.app. Use that as your Cursor base URL: https://abc123.ngrok-free.app/v1. Note that the free ngrok tier generates a new URL on every restart, so you'd need to update Cursor's settings each time.

Path 2: Cursor + LM Studio

LM Studio (stable release: 0.4.15, May 29, 2026) is the better choice on Windows — it has a GUI model browser, automatic CUDA detection, and a one-click server start. LM Studio's GGUF library includes all major coding models.

Step 1 — Install LM Studio and load a model

Download from lmstudio.ai. Run the installer; it auto-detects your CUDA version and driver.

Inside LM Studio:

Click Discover (the search icon) in the left sidebar.
Search for qwen2.5-coder.
Choose the variant matching your VRAM — Q4_K_M for 8 GB and 12 GB cards, Q6_K for 16 GB and up.
Click Download.

Step 2 — Start the local server

Click Developer in the left sidebar (the </> icon).
Select your downloaded model from the d

DEV Community