yann ortodoro

Posted on May 22

Running Qwen 2.5 Coder 14B Locally in Cursor with Ollama

#cursor #productivity #programming

I've been leaning on AI inside my editor for a while now, and Cursor is the tool that finally made it stick. It sits right in the IDE, understands my files, genuinely good at the boring stuf, refactors...

But the more I leaned on it, the more one number kept nagging at me: tokens. Every prompt, every file I dragged in, every "explain this" : all of it burns through cloud usage, and on a busy day that adds up fast. The hard, occasional problems were worth it. The endless little ones weren't, and those were most of my day.

So the real question wasn't "is the cloud good enough". It was: why am I paying cloud tokens for work a local model could handle for free? I wanted the Cursor experience for the everyday grind without metering every keystroke against a usage limit. So I wired Cursor up to Ollama and ran Qwen 2.5 Coder 14B on my own server.

The privacy angle came along for the ride and turned out to be a genuine bonus : private repos, client code, and internal logic now stay on my own box. Saving tokens is what got me to actually do this, everything else was upside.

The thing that makes this possible is that Ollama speaks the OpenAI API : /v1/models, /v1/chat/completions, all of it. So anything expecting an OpenAI-style endpoint can be pointed at a local model instead. Cursor included.

Why bother running it locally?

I want to be clear up front: the goal was never to ditch cloud models entirely but to stop spending tokens on work that doesn't need them.

The cloud is still where I go for big architectural reasoning, nasty multi-file debugging, product strategy: the stuff where you really want the strongest model you can get, and where the token cost is genuinely worth it.

The local model handles everything else, and "everything else" turns out to be most of my day: explain this file, generate a small component, review this diff, refactor a function, draft some SQL, clean up a prompt. None of that justifies a metered cloud cal once it's can run locally.

My main project has a lot of moving parts: backend services, a Vue frontend, a pile of admin screens, complex rules, data, generated assets, modules tangled into other modules. Running the model myself gives me room to poke at all of it without second-guessing where it's going.

Why this model in particular?

I tried a handful through Ollama before settling:

qwen2.5-coder:7b
qwen2.5-coder:14b
deepseek-coder-v2:16b
qwen3:8b
qwen3:14b

The 7B is quick and light, and honestly fine for small tasks. But once you're asking for real code help, the 14B is just the better trade. It's the sweet spot between "runs comfortably on my hardware" and "actually writes decent code."

The official Qwen2.5-Coder-14B-Instruct page lists it at 14.7B parameters. Its native context is 32,768 tokens, and it stretches up to 131,072 with YaRN, a length-extrapolation trick. That headroom is what sold me, because Cursor eats context for breakfast : code, chat history, instructions, all stacked into one request...

What I was aiming for

The shape of it is simple:

Cursor (Windows)
        ↓
OpenAI-compatible API
        ↓
Ollama (Linux server)
        ↓
Qwen 2.5 Coder 14B

My Ollama box lives at http://my-ollama-host:11434, and the OpenAI-compatible endpoint is just that with /v1 tacked on http://my-ollama-host:11434/v1. That /v1 URL is the one Cursor wants as its OpenAI Base URL override. (Swap in your own hostname or IP wherever you see my-ollama-host.)

Step 1 — Pull the model

On the Linux server:

ollama pull qwen2.5-coder:14b

Check what's installed:

ollama list

Mine looks something like:

qwen3:8b
qwen2.5-coder:7b
deepseek-coder-v2:16b
qwen2.5-coder:14b
qwen3:14b
llama3.2:1b
llama3.2:3b

And confirm the API responds:

curl http://localhost:11434/v1/models

If Ollama's happy, you get back a JSON list of models.

Step 2 — Make sure Windows can actually reach it

This is the part people skip and then waste an hour on. Cursor was on Windows, Ollama was on Linux, so before touching any config I just checked that the two could talk.

From the Linux box itself:

curl http://my-ollama-host:11434/v1/models

Then from Windows PowerShell:

curl.exe http://my-ollama-host:11434/v1/models

Use curl.exe, not curl. On Windows, plain curl is usually an alias for Invoke-WebRequest, which is a different beast and will give you confusing results. The .exe forces the real thing.

Once Windows got the model list back cleanly, I knew the network was fine. Server reachable, model present, API working. Whatever broke next wasn't going to be one of those.

Step 3 — Point Cursor at it (and hit a wall)

Here's what I plugged into Cursor, which by all rights should have just worked:

Model: qwen2.5-coder:14b
OpenAI API Key: ollama
OpenAI Base URL override: http://my-ollama-host:11434/v1

Ollama uses model:tag names like qwen2.5-coder:14b totally standard. Cursor wasn't having it:

AI Model Not Found
Model name is not valid: "qwen2.5-coder:14b"

I went back and checked everything twice. Name was right. Endpoint was right. The model showed up fine in /v1/models. The model wasn't missing at all Cursor just didn't like the name. Something in its validation doesn't accept arbitrary custom model names.

The hack that fixed it

The trick is to give Ollama an alias with a name Cursor will accept, and have that alias point at the real model.

I called mine gpt-4o-mini. It does not touch OpenAI. It's Qwen, wearing a name tag Cursor recognizes.

One caveat worth saying out loud: the name doesn't have to be gpt-4o-mini. It just has to be something on Cursor's list of recognized models. I picked an OpenAI name because I knew it'd pass, pick any allowlisted name you can live with.

On the Ollama server:

cat > Modelfile.gpt-4o-mini <<'EOF'
FROM qwen2.5-coder:14b
EOF

ollama create gpt-4o-mini -f Modelfile.gpt-4o-mini

Now ollama list shows both, the alias and the thing it wraps:

gpt-4o-mini
qwen2.5-coder:14b

So the sleight of hand is just: Cursor thinks it's using gpt-4o-mini, Ollama quietly serves qwen2.5-coder:14b. That's it. That's the whole fix. Modelfiles exist precisely for this you describe a model and stamp out a new named one from it with ollama create.

The working Cursor config ended up being:

Model: gpt-4o-mini
API Key: ollama
Base URL override: http://my-ollama-host:11434/v1
What's actually running: qwen2.5-coder:14b

Stretching the context window

Once it was running, I wanted more room. Two different limits matter here and people mix them up constantly:

Context window : how much the model can see at once.
Output length : how much it can write back in one go.

For coding in Cursor, the context window is the one that bites you, because Cursor crams code, prior conversation, instructions, and file snippets into a single request. Run out of room and it quietly starts forgetting things.

Ollama controls this with num_ctx. Its docs describe context length as the max tokens the model keeps in memory, and they ship VRAM-based defaults:

Available VRAM	Default context
< 24 GiB	4k
24–48 GiB	32k
>= 48 GiB	256k

I've got 128 GiB of VRAM, so I could in theory go wild. But here's the catch nobody mentions: VRAM doesn't make a model good at long context. It just makes it possible. Push past what the model was actually trained for and you get a model that technically accepts 200k tokens and then makes things up about the first half.

And this is where that native-versus-extended distinction matters. Qwen2.5-Coder-14B is natively a 32 768-token model, the 131 072 figure only holds when you run it with YaRN extrapolation, which right now basically means vLLM. Ollama serves the GGUF build and doesn't do YaRN, so when I set num_ctx 131072 here, I'm pushing the model way past its native window without the trick that's supposed to make that work. It'll happily accept the tokens, it just gets less reliable the deeper into that range you go. So 131 072 is my hard ceiling because nothing above it is even claimed, but I treat the upper half as "use with a little suspicion" rather than gospel. In practice I run 65 536 for normal work and only reach for 131 072 when I genuinely need it. Forcing 256k for this model is pointless either way.

My two go-to configs

For everyday use:

cat > Modelfile.gpt-4o-mini <<'EOF'
FROM qwen2.5-coder:14b
PARAMETER num_ctx 65536
PARAMETER num_predict 4096
PARAMETER temperature 0.2
EOF

ollama create gpt-4o-mini -f Modelfile.gpt-4o-mini
ollama stop gpt-4o-mini

For heavier review sessions, crank it:

cat > Modelfile.gpt-4o-mini <<'EOF'
FROM qwen2.5-coder:14b
PARAMETER num_ctx 131072
PARAMETER num_predict 8192
PARAMETER temperature 0.2
EOF

ollama create gpt-4o-mini -f Modelfile.gpt-4o-mini
ollama stop gpt-4o-mini

What the knobs do: num_ctx is the context window, num_predict is how long a single response can run, and temperature 0.2 keeps it boring which is exactly what you want for code.

Double-checking it took

After recreating the alias:

ollama show --modelfile gpt-4o-mini

You should see your FROM line and the three PARAMETER lines staring back. Then just make sure Cursor's still pointed at gpt-4o-mini on http://my-ollama-host:11434/v1 and you're set.

The pattern I've landed on with this much VRAM is keeping both configs around 65k for fast and snappy work, 131k for when I want it chewing on a lot at once.

Where it shines, and where it doesn't

In daily use this thing pulls real weight. Reviewing Vue components, tidying admin screens, explaining backend services, refactoring a module in isolation, writing SQL, sanity-checking API logic, knocking out tests, sharpening prompts, reading through private code I'd rather not upload anywhere. On my project specifically it's been great for the admin UI, game data, skills and spells logic, world-state and movement systems, NPC and quest structures, backend performance passes, and the prompt engineering behind generated assets.

What it won't do is stand in for a top-tier cloud model on complex problems. A 14B model with a big context window is still a 14B model. Full-repo architecture reviews, gnarly multi-file refactors, debugging that spans a dozen layers, product strategy, anything security-sensitive, big design calls is still cloud territory for me.

Which is the whole point, really. Local for the frequent, cheap, fast stuff that would otherwise quietly drain your token budget. Cloud for the rare, expensive, high-stakes thinking that's worth paying for. Use both, don't pretend one replaces the other.

What I'd tell past me

If /v1/models answers from Windows, stop blaming Ollama and the network. They're fine. The problem is somewhere else.
Cursor will reject perfectly valid Ollama model names. qwen2.5-coder:14b worked everywhere except in Cursor's name check.
The fastest fix is an alias Cursor accepts : gpt-4o-mini -> qwen2.5-coder:14b did it for me.
With lots of VRAM, raise the context but cap it at what the model can actually handle. For this one, 131 072 is the advertised ceiling (and even that leans on YaRN, which Ollama doesn't apply), so I treat the top of that range with some caution.

Wrapping up

Running a local coding model inside Cursor isn't just a party trick is something I reach for every day. Cursor, Ollama, Qwen 2.5 Coder 14B, the OpenAI-compatible API, a fat context window, and enough VRAM to not worry about it: that combination is a legitimately good local dev assistant.

And the funny part is the hardest piece wasn't what I expected. Not Ollama, not the network, not the model. It was Cursor refusing a model name. Once that clicked, the fix was almost embarrassingly small: alias the model to a name Cursor likes, point Cursor at it, serve Qwen behind it, and bump num_ctx to taste.

The payoff is a setup that keeps my token spend for the work that actually deserves it for the daily work of writing, reviewing, and refactoring, it more than holds its own, and it does it without touching a usage meter.

Final result in cursor :

Top comments (3)

S M Tahosin • May 24

Token anxiety is so real. I love that you bypassed the cloud costs entirely by routing Cursor through Ollama to run Qwen 2.5 14B locally. Given that it's a 14B model, how noticeable is the latency when requesting large refactors compared to using Claude/GPT-4 directly? I might need to switch my setup this weekend!

yann ortodoro • Jul 4

Thanks for your comment, it ollama performance depends of your hardware and llm size and it will never have the same performance than a last llm version but for some standards tasks it can be ok. You have to find to right balance biut it can help to save some tokens ;)

Comment hidden by post author - thread only accessible via permalink

xulingfeng • May 24

Test comment - ignore this

Some comments have been hidden by the post's author - find out more