<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: yann ortodoro</title>
    <description>The latest articles on DEV Community by yann ortodoro (@yann_ortodoro).</description>
    <link>https://dev.to/yann_ortodoro</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3946816%2F6c37fade-98fa-43c9-a227-f9f0cba8a84a.jpg</url>
      <title>DEV Community: yann ortodoro</title>
      <link>https://dev.to/yann_ortodoro</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/yann_ortodoro"/>
    <language>en</language>
    <item>
      <title>Running Qwen 2.5 Coder 14B Locally in Cursor with Ollama</title>
      <dc:creator>yann ortodoro</dc:creator>
      <pubDate>Fri, 22 May 2026 23:30:28 +0000</pubDate>
      <link>https://dev.to/yann_ortodoro/running-qwen-25-coder-14b-locally-in-cursor-with-ollama-4436</link>
      <guid>https://dev.to/yann_ortodoro/running-qwen-25-coder-14b-locally-in-cursor-with-ollama-4436</guid>
      <description>&lt;p&gt;I've been leaning on AI inside my editor for a while now, and Cursor is the tool that finally made it stick. It sits right in the IDE, understands my files, genuinely good at the boring stuf, refactors...&lt;/p&gt;

&lt;p&gt;But the more I leaned on it, the more one number kept nagging at me: &lt;strong&gt;tokens&lt;/strong&gt;. Every prompt, every file I dragged in, every "explain this" : all of it burns through cloud usage, and on a busy day that adds up fast. The hard, occasional problems were worth it. The endless little ones weren't, and those were most of my day.&lt;/p&gt;

&lt;p&gt;So the real question wasn't "is the cloud good enough". It was: why am I paying cloud tokens for work a local model could handle for free? I wanted the Cursor experience for the everyday grind without metering every keystroke against a usage limit. So I wired Cursor up to Ollama and ran Qwen 2.5 Coder 14B on my own server.&lt;/p&gt;

&lt;p&gt;The privacy angle came along for the ride and turned out to be a genuine bonus : private repos, client code, and internal logic now stay on my own box. Saving tokens is what got me to actually do this, everything else was upside.&lt;/p&gt;

&lt;p&gt;The thing that makes this possible is that Ollama speaks the OpenAI API : &lt;code&gt;/v1/models&lt;/code&gt;, &lt;code&gt;/v1/chat/completions&lt;/code&gt;, all of it. So anything expecting an OpenAI-style endpoint can be pointed at a local model instead. Cursor included.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why bother running it locally?
&lt;/h2&gt;

&lt;p&gt;I want to be clear up front: the goal was never to ditch cloud models entirely but to stop spending tokens on work that doesn't need them.&lt;/p&gt;

&lt;p&gt;The cloud is still where I go for big architectural reasoning, nasty multi-file debugging, product strategy: the stuff where you really want the strongest model you can get, and where the token cost is genuinely worth it.&lt;/p&gt;

&lt;p&gt;The local model handles everything else, and "everything else" turns out to be most of my day: explain this file, generate a small component, review this diff, refactor a function, draft some SQL, clean up a prompt. None of that justifies a metered cloud cal once it's can run locally. &lt;/p&gt;

&lt;p&gt;My main project has a lot of moving parts: backend services, a Vue frontend, a pile of admin screens, complex rules, data, generated assets, modules tangled into other modules. Running the model myself gives me room to poke at all of it without second-guessing where it's going.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why this model in particular?
&lt;/h2&gt;

&lt;p&gt;I tried a handful through Ollama before settling:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;qwen2.5-coder:7b&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;qwen2.5-coder:14b&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;deepseek-coder-v2:16b&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;qwen3:8b&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;qwen3:14b&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The 7B is quick and light, and honestly fine for small tasks. But once you're asking for real code help, the 14B is just the better trade. It's the sweet spot between "runs comfortably on my hardware" and "actually writes decent code."&lt;/p&gt;

&lt;p&gt;The official Qwen2.5-Coder-14B-Instruct page lists it at 14.7B parameters. Its native context is 32,768 tokens, and it stretches up to 131,072 with YaRN, a length-extrapolation trick. That headroom is what sold me, because Cursor eats context for breakfast : code, chat history, instructions, all stacked into one request... &lt;/p&gt;

&lt;h2&gt;
  
  
  What I was aiming for
&lt;/h2&gt;

&lt;p&gt;The shape of it is simple:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Cursor (Windows)
        ↓
OpenAI-compatible API
        ↓
Ollama (Linux server)
        ↓
Qwen 2.5 Coder 14B
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;My Ollama box lives at &lt;code&gt;http://my-ollama-host:11434&lt;/code&gt;, and the OpenAI-compatible endpoint is just that with &lt;code&gt;/v1&lt;/code&gt; tacked on &lt;code&gt;http://my-ollama-host:11434/v1&lt;/code&gt;. That &lt;code&gt;/v1&lt;/code&gt; URL is the one Cursor wants as its &lt;strong&gt;OpenAI Base URL override&lt;/strong&gt;. (Swap in your own hostname or IP wherever you see &lt;code&gt;my-ollama-host&lt;/code&gt;.)&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 1 — Pull the model
&lt;/h2&gt;

&lt;p&gt;On the Linux server:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;ollama pull qwen2.5-coder:14b
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Check what's installed:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;ollama list
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Mine looks something like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight properties"&gt;&lt;code&gt;&lt;span class="py"&gt;qwen3&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s"&gt;8b&lt;/span&gt;
&lt;span class="py"&gt;qwen2.5-coder&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s"&gt;7b&lt;/span&gt;
&lt;span class="py"&gt;deepseek-coder-v2&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s"&gt;16b&lt;/span&gt;
&lt;span class="py"&gt;qwen2.5-coder&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s"&gt;14b&lt;/span&gt;
&lt;span class="py"&gt;qwen3&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s"&gt;14b&lt;/span&gt;
&lt;span class="py"&gt;llama3.2&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s"&gt;1b&lt;/span&gt;
&lt;span class="py"&gt;llama3.2&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s"&gt;3b&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And confirm the API responds:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl http://localhost:11434/v1/models
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If Ollama's happy, you get back a JSON list of models.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 2 — Make sure Windows can actually reach it
&lt;/h2&gt;

&lt;p&gt;This is the part people skip and then waste an hour on. Cursor was on Windows, Ollama was on Linux, so before touching any config I just checked that the two could talk.&lt;/p&gt;

&lt;p&gt;From the Linux box itself:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl http://my-ollama-host:11434/v1/models
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then from Windows PowerShell:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight powershell"&gt;&lt;code&gt;&lt;span class="n"&gt;curl.exe&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;http://my-ollama-host:11434/v1/models&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Use &lt;code&gt;curl.exe&lt;/code&gt;, not &lt;code&gt;curl&lt;/code&gt;. On Windows, plain &lt;code&gt;curl&lt;/code&gt; is usually an alias for &lt;code&gt;Invoke-WebRequest&lt;/code&gt;, which is a different beast and will give you confusing results. The &lt;code&gt;.exe&lt;/code&gt; forces the real thing.&lt;/p&gt;

&lt;p&gt;Once Windows got the model list back cleanly, I knew the network was fine. Server reachable, model present, API working. Whatever broke next wasn't going to be one of those.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 3 — Point Cursor at it (and hit a wall)
&lt;/h2&gt;

&lt;p&gt;Here's what I plugged into Cursor, which by all rights should have just worked:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Model:&lt;/strong&gt; &lt;code&gt;qwen2.5-coder:14b&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;OpenAI API Key:&lt;/strong&gt; &lt;code&gt;ollama&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;OpenAI Base URL override:&lt;/strong&gt; &lt;code&gt;http://my-ollama-host:11434/v1&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Ollama uses &lt;code&gt;model:tag&lt;/code&gt; names like &lt;code&gt;qwen2.5-coder:14b&lt;/code&gt; totally standard. Cursor wasn't having it:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;AI Model Not Found
Model name is not valid: "qwen2.5-coder:14b"
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;I went back and checked everything twice. Name was right. Endpoint was right. The model showed up fine in &lt;code&gt;/v1/models&lt;/code&gt;. The model wasn't missing at all Cursor just didn't like the &lt;em&gt;name&lt;/em&gt;. Something in its validation doesn't accept arbitrary custom model names.&lt;/p&gt;

&lt;h2&gt;
  
  
  The hack that fixed it
&lt;/h2&gt;

&lt;p&gt;The trick is to give Ollama an alias with a name Cursor &lt;em&gt;will&lt;/em&gt; accept, and have that alias point at the real model.&lt;/p&gt;

&lt;p&gt;I called mine &lt;code&gt;gpt-4o-mini&lt;/code&gt;. It does not touch OpenAI. It's Qwen, wearing a name tag Cursor recognizes.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;One caveat worth saying out loud: the name doesn't have to be &lt;code&gt;gpt-4o-mini&lt;/code&gt;. It just has to be something on Cursor's list of recognized models. I picked an OpenAI name because I knew it'd pass, pick any allowlisted name you can live with.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;On the Ollama server:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;cat&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; Modelfile.gpt-4o-mini &lt;span class="o"&gt;&amp;lt;&amp;lt;&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="no"&gt;EOF&lt;/span&gt;&lt;span class="sh"&gt;'
FROM qwen2.5-coder:14b
&lt;/span&gt;&lt;span class="no"&gt;EOF

&lt;/span&gt;ollama create gpt-4o-mini &lt;span class="nt"&gt;-f&lt;/span&gt; Modelfile.gpt-4o-mini
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now &lt;code&gt;ollama list&lt;/code&gt; shows both, the alias and the thing it wraps:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="s"&gt;gpt-4o-mini&lt;/span&gt;
&lt;span class="s"&gt;qwen2.5-coder:14b&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;So the sleight of hand is just: Cursor thinks it's using &lt;code&gt;gpt-4o-mini&lt;/code&gt;, Ollama quietly serves &lt;code&gt;qwen2.5-coder:14b&lt;/code&gt;. That's it. That's the whole fix. Modelfiles exist precisely for this  you describe a model and stamp out a new named one from it with &lt;code&gt;ollama create&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;The working Cursor config ended up being:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Model:&lt;/strong&gt; &lt;code&gt;gpt-4o-mini&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;API Key:&lt;/strong&gt; &lt;code&gt;ollama&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Base URL override:&lt;/strong&gt; &lt;code&gt;http://my-ollama-host:11434/v1&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;What's actually running:&lt;/strong&gt; &lt;code&gt;qwen2.5-coder:14b&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Stretching the context window
&lt;/h2&gt;

&lt;p&gt;Once it was running, I wanted more room. Two different limits matter here and people mix them up constantly:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Context window&lt;/strong&gt; : how much the model can &lt;em&gt;see&lt;/em&gt; at once.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Output length&lt;/strong&gt; : how much it can &lt;em&gt;write back&lt;/em&gt; in one go.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For coding in Cursor, the context window is the one that bites you, because Cursor crams code, prior conversation, instructions, and file snippets into a single request. Run out of room and it quietly starts forgetting things.&lt;/p&gt;

&lt;p&gt;Ollama controls this with &lt;code&gt;num_ctx&lt;/code&gt;. Its docs describe context length as the max tokens the model keeps in memory, and they ship VRAM-based defaults:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Available VRAM&lt;/th&gt;
&lt;th&gt;Default context&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&amp;lt; 24 GiB&lt;/td&gt;
&lt;td&gt;4k&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;24–48 GiB&lt;/td&gt;
&lt;td&gt;32k&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&amp;gt;= 48 GiB&lt;/td&gt;
&lt;td&gt;256k&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;I've got 128 GiB of VRAM, so I could in theory go wild. But here's the catch nobody mentions: VRAM doesn't make a model good at long context. It just makes it &lt;em&gt;possible&lt;/em&gt;. Push past what the model was actually trained for and you get a model that technically accepts 200k tokens and then makes things up about the first half.&lt;/p&gt;

&lt;p&gt;And this is where that native-versus-extended distinction matters. Qwen2.5-Coder-14B is natively a 32 768-token model, the 131 072 figure only holds when you run it with YaRN extrapolation, which right now basically means vLLM. Ollama serves the GGUF build and doesn't do YaRN, so when I set &lt;code&gt;num_ctx 131072&lt;/code&gt; here, I'm pushing the model way past its native window &lt;em&gt;without&lt;/em&gt; the trick that's supposed to make that work. It'll happily accept the tokens, it just gets less reliable the deeper into that range you go. So 131 072 is my hard ceiling because nothing above it is even claimed, but I treat the upper half as "use with a little suspicion" rather than gospel. In practice I run 65 536 for normal work and only reach for 131 072 when I genuinely need it. Forcing 256k for this model is pointless either way.&lt;/p&gt;

&lt;h3&gt;
  
  
  My two go-to configs
&lt;/h3&gt;

&lt;p&gt;For everyday use:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;cat&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; Modelfile.gpt-4o-mini &lt;span class="o"&gt;&amp;lt;&amp;lt;&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="no"&gt;EOF&lt;/span&gt;&lt;span class="sh"&gt;'
FROM qwen2.5-coder:14b
PARAMETER num_ctx 65536
PARAMETER num_predict 4096
PARAMETER temperature 0.2
&lt;/span&gt;&lt;span class="no"&gt;EOF

&lt;/span&gt;ollama create gpt-4o-mini &lt;span class="nt"&gt;-f&lt;/span&gt; Modelfile.gpt-4o-mini
ollama stop gpt-4o-mini
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For heavier review sessions, crank it:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;cat&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; Modelfile.gpt-4o-mini &lt;span class="o"&gt;&amp;lt;&amp;lt;&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="no"&gt;EOF&lt;/span&gt;&lt;span class="sh"&gt;'
FROM qwen2.5-coder:14b
PARAMETER num_ctx 131072
PARAMETER num_predict 8192
PARAMETER temperature 0.2
&lt;/span&gt;&lt;span class="no"&gt;EOF

&lt;/span&gt;ollama create gpt-4o-mini &lt;span class="nt"&gt;-f&lt;/span&gt; Modelfile.gpt-4o-mini
ollama stop gpt-4o-mini
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;What the knobs do: &lt;code&gt;num_ctx&lt;/code&gt; is the context window, &lt;code&gt;num_predict&lt;/code&gt; is how long a single response can run, and &lt;code&gt;temperature 0.2&lt;/code&gt; keeps it boring which is exactly what you want for code.&lt;/p&gt;

&lt;h3&gt;
  
  
  Double-checking it took
&lt;/h3&gt;

&lt;p&gt;After recreating the alias:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;ollama show &lt;span class="nt"&gt;--modelfile&lt;/span&gt; gpt-4o-mini
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You should see your &lt;code&gt;FROM&lt;/code&gt; line and the three &lt;code&gt;PARAMETER&lt;/code&gt; lines staring back. Then just make sure Cursor's still pointed at &lt;code&gt;gpt-4o-mini&lt;/code&gt; on &lt;code&gt;http://my-ollama-host:11434/v1&lt;/code&gt; and you're set.&lt;/p&gt;

&lt;p&gt;The pattern I've landed on with this much VRAM is keeping both configs around 65k for fast and snappy work, 131k for when I want it chewing on a lot at once.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where it shines, and where it doesn't
&lt;/h2&gt;

&lt;p&gt;In daily use this thing pulls real weight. Reviewing Vue components, tidying admin screens, explaining backend services, refactoring a module in isolation, writing SQL, sanity-checking API logic, knocking out tests, sharpening prompts, reading through private code I'd rather not upload anywhere. On my project specifically it's been great for the admin UI, game data, skills and spells logic, world-state and movement systems, NPC and quest structures, backend performance passes, and the prompt engineering behind generated assets.&lt;/p&gt;

&lt;p&gt;What it won't do is stand in for a top-tier cloud model on complex problems. A 14B model with a big context window is still a 14B model. Full-repo architecture reviews, gnarly multi-file refactors, debugging that spans a dozen layers, product strategy, anything security-sensitive, big design calls is still cloud territory for me.&lt;/p&gt;

&lt;p&gt;Which is the whole point, really. Local for the frequent, cheap, fast stuff that would otherwise quietly drain your token budget. Cloud for the rare, expensive, high-stakes thinking that's worth paying for. Use both, don't pretend one replaces the other.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I'd tell past me
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;If &lt;code&gt;/v1/models&lt;/code&gt; answers from Windows, stop blaming Ollama and the network. They're fine. The problem is somewhere else.&lt;/li&gt;
&lt;li&gt;Cursor will reject perfectly valid Ollama model names. &lt;code&gt;qwen2.5-coder:14b&lt;/code&gt; worked everywhere except in Cursor's name check.&lt;/li&gt;
&lt;li&gt;The fastest fix is an alias Cursor accepts : &lt;code&gt;gpt-4o-mini -&amp;gt; qwen2.5-coder:14b&lt;/code&gt; did it for me.&lt;/li&gt;
&lt;li&gt;With lots of VRAM, raise the context but cap it at what the model can actually handle. For this one, 131 072 is the advertised ceiling (and even that leans on YaRN, which Ollama doesn't apply), so I treat the top of that range with some caution.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Wrapping up
&lt;/h2&gt;

&lt;p&gt;Running a local coding model inside Cursor isn't just a party trick is something I reach for every day. Cursor, Ollama, Qwen 2.5 Coder 14B, the OpenAI-compatible API, a fat context window, and enough VRAM to not worry about it: that combination is a legitimately good local dev assistant.&lt;/p&gt;

&lt;p&gt;And the funny part is the hardest piece wasn't what I expected. Not Ollama, not the network, not the model. It was Cursor refusing a model name. Once that clicked, the fix was almost embarrassingly small: alias the model to a name Cursor likes, point Cursor at it, serve Qwen behind it, and bump &lt;code&gt;num_ctx&lt;/code&gt; to taste.&lt;/p&gt;

&lt;p&gt;The payoff is a setup that keeps my token spend for the work that actually deserves it for the daily work of writing, reviewing, and refactoring, it more than holds its own, and it does it without touching a usage meter.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Final result in cursor :&lt;/em&gt;&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fc4j9tpycud6rcs5cwg0t.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fc4j9tpycud6rcs5cwg0t.png" alt=" "&gt;&lt;/a&gt;&lt;/p&gt;

</description>
      <category>cursor</category>
      <category>productivity</category>
      <category>programming</category>
    </item>
  </channel>
</rss>
