DEV Community: yann ortodoro

Prompt Improvement — One brain for all my AI tools

yann ortodoro — Wed, 08 Jul 2026 05:31:24 +0000

I stopped sending lazy prompts to my coding agent — and the fix grew into a self-hosted layer that routes, remembers, and meters everything. The goad was to be more efficient on the requests I provided to the AI (Cursor mainly) with less efforts.

If you read my post about running Qwen locally inside Cursor, this is the next chapter of the same obsession: getting more from AI tools, with more control and more efficient way.

Here is a prompt I actually sent to my coding agent:

do a deep deepdive anaylsis of the project as IT architect and product owner and prepare a backlog.

One sentence, two typos, three jobs — and the agent would confidently guess everything I didn’t say. So I built a layer that rewrites every message into a proper task specification before any model sees it. What the agent received (trimmed):

# Improved task specification

## Goal
As both IT Architect and Product Owner, perform a deep-dive analysis of the
current project and produce a prioritized backlog ready for sprint planning.

## Deliverables
1. Project Analysis Report — architecture, risks, technical debt, gaps
2. Product Backlog — items with ID, type, acceptance criteria, priority,
   complexity, dependencies
3. Epic Roadmap — phased: MVP → enhancements → scale
4. One ADR per major architectural decision or risk
5. BACKLOG.md and ARCHITECTURE.md committed to the repo root

## Constraints
- Technical debt appears as first-class backlog entries — never hidden
- Priority set from a dual lens: business value AND architectural risk

## Definition of done
- [ ] All layers covered: frontend, backend, data, infra, security
- [ ] Every identified risk maps to a backlog item
- [ ] Sign-off from both Architect and Product Owner perspectives

One sentence in; a work plan out. Deliverables the agent can be held to, constraints I would never think to type, and a definition of done that makes the output checkable — with my original words preserved at the bottom, nothing hidden. A cheap or local model can does the rewrite for fractions of a cent.

The auto-improve pipeline: from one lazy sentence to a canonical task specification. (Claude AI generated)

One fix became a layer

Once every prompt passed through one place, the rest followed naturally.

Routing. Each task type — code, search, reasoning — has an ordered list of models ranked by quality. The router takes the best available one; cost only breaks ties between models that are effectively interchangeable; and everything falls back to a local Ollama model, free and always on. Cost breaks ties — it never downgrades.

Quality-first routing: cost breaks ties, never downgrades (Claude AI generated)

Templates and memory. Prompts worth keeping become versioned templates with parameters, available in every project on every machine (if centralized on a server ) — and a built-in importer seeds the library from public prompt collections, so it’s useful before you’ve saved a single prompt of your own. Facts I explicitly ask it to remember persist with a scope, private or shareable.

Accounting. Every call writes a row — model, tokens, cost, latency — so “what did AI cost me this week?” finally has a number for an answer.

Two doors into everything. As an MCP server, the tools appear natively inside Cursor, Claude, or any client that speaks the protocol — no UI to build, because the host app is the interface. As an OpenAI-compatible endpoint, it exposes virtual models like route-code, so pointing an editor's model setting at it sends real traffic through my own routing.

How it’s built

One core engine, two thin faces: all the logic lives in one place, and the MCP server and HTTP gateway are adapters over it. The plumbing is deliberately boring and bought, not built :

Python 3.12, LiteLLM as the one adapter for every provider (the multi-provider gateway is a commodity in 2026; originality belongs above it),
SQLite in WAL mode as the single shared brain,
Streamable HTTP with a bearer token that fails closed.

It runs on my home Linux dev server, firewalled to the LAN. Local-first is not an aesthetic: a layer that sees every prompt is only acceptable if you own it.
And no — it cannot pool your flat subscriptions; those expose no API. It works on API keys plus a local model, full stop.

How an MCP tool call flows, with the technology at each layer.

Few days of real use

requests            64
tokens              113,805
success rate        96.9 %
total cost          $1.12
top model           claude-sonnet   33 calls · $0.82
local floor hit     1 call          $0.00

Routing behaved exactly as configured — the quality pick led, a cheaper model absorbed routine calls, and the local floor caught a failure. And a week of AI-assisted work cost less than a coffee, which says where the real value lies: not in saving money, but in making every interaction better and every cost visible.

The result is Ylang

That layer is now an open-source project: Ylang, MIT-licensed — github.com/Yann-0/ylang.

Two features are built into its seams and deliberately dormant until there is enough data: a budget meter for when several people share one server, and pattern learning that notices which prompts I keep rewriting and which models I actually prefer.

If you live across several AI tools and would rather they shared one brain you own: install it, connect your editor, and type something lazy. Then tell me the one thing I actually want to know — after a week, did you leave it running?

Running Qwen 2.5 Coder 14B Locally in Cursor with Ollama

yann ortodoro — Fri, 22 May 2026 23:30:28 +0000

I've been leaning on AI inside my editor for a while now, and Cursor is the tool that finally made it stick. It sits right in the IDE, understands my files, genuinely good at the boring stuf, refactors...

But the more I leaned on it, the more one number kept nagging at me: tokens. Every prompt, every file I dragged in, every "explain this" : all of it burns through cloud usage, and on a busy day that adds up fast. The hard, occasional problems were worth it. The endless little ones weren't, and those were most of my day.

So the real question wasn't "is the cloud good enough". It was: why am I paying cloud tokens for work a local model could handle for free? I wanted the Cursor experience for the everyday grind without metering every keystroke against a usage limit. So I wired Cursor up to Ollama and ran Qwen 2.5 Coder 14B on my own server.

The privacy angle came along for the ride and turned out to be a genuine bonus : private repos, client code, and internal logic now stay on my own box. Saving tokens is what got me to actually do this, everything else was upside.

The thing that makes this possible is that Ollama speaks the OpenAI API : /v1/models, /v1/chat/completions, all of it. So anything expecting an OpenAI-style endpoint can be pointed at a local model instead. Cursor included.

Why bother running it locally?

I want to be clear up front: the goal was never to ditch cloud models entirely but to stop spending tokens on work that doesn't need them.

The cloud is still where I go for big architectural reasoning, nasty multi-file debugging, product strategy: the stuff where you really want the strongest model you can get, and where the token cost is genuinely worth it.

The local model handles everything else, and "everything else" turns out to be most of my day: explain this file, generate a small component, review this diff, refactor a function, draft some SQL, clean up a prompt. None of that justifies a metered cloud cal once it's can run locally.

My main project has a lot of moving parts: backend services, a Vue frontend, a pile of admin screens, complex rules, data, generated assets, modules tangled into other modules. Running the model myself gives me room to poke at all of it without second-guessing where it's going.

Why this model in particular?

I tried a handful through Ollama before settling:

qwen2.5-coder:7b
qwen2.5-coder:14b
deepseek-coder-v2:16b
qwen3:8b
qwen3:14b

The 7B is quick and light, and honestly fine for small tasks. But once you're asking for real code help, the 14B is just the better trade. It's the sweet spot between "runs comfortably on my hardware" and "actually writes decent code."

The official Qwen2.5-Coder-14B-Instruct page lists it at 14.7B parameters. Its native context is 32,768 tokens, and it stretches up to 131,072 with YaRN, a length-extrapolation trick. That headroom is what sold me, because Cursor eats context for breakfast : code, chat history, instructions, all stacked into one request...

What I was aiming for

The shape of it is simple:

Cursor (Windows)
        ↓
OpenAI-compatible API
        ↓
Ollama (Linux server)
        ↓
Qwen 2.5 Coder 14B

My Ollama box lives at http://my-ollama-host:11434, and the OpenAI-compatible endpoint is just that with /v1 tacked on http://my-ollama-host:11434/v1. That /v1 URL is the one Cursor wants as its OpenAI Base URL override. (Swap in your own hostname or IP wherever you see my-ollama-host.)

Step 1 — Pull the model

On the Linux server:

ollama pull qwen2.5-coder:14b

Check what's installed:

ollama list

Mine looks something like:

qwen3:8b
qwen2.5-coder:7b
deepseek-coder-v2:16b
qwen2.5-coder:14b
qwen3:14b
llama3.2:1b
llama3.2:3b

And confirm the API responds:

curl http://localhost:11434/v1/models

If Ollama's happy, you get back a JSON list of models.

Step 2 — Make sure Windows can actually reach it

This is the part people skip and then waste an hour on. Cursor was on Windows, Ollama was on Linux, so before touching any config I just checked that the two could talk.

From the Linux box itself:

curl http://my-ollama-host:11434/v1/models

Then from Windows PowerShell:

curl.exe http://my-ollama-host:11434/v1/models

Use curl.exe, not curl. On Windows, plain curl is usually an alias for Invoke-WebRequest, which is a different beast and will give you confusing results. The .exe forces the real thing.

Once Windows got the model list back cleanly, I knew the network was fine. Server reachable, model present, API working. Whatever broke next wasn't going to be one of those.

Step 3 — Point Cursor at it (and hit a wall)

Here's what I plugged into Cursor, which by all rights should have just worked:

Model: qwen2.5-coder:14b
OpenAI API Key: ollama
OpenAI Base URL override: http://my-ollama-host:11434/v1

Ollama uses model:tag names like qwen2.5-coder:14b totally standard. Cursor wasn't having it:

AI Model Not Found
Model name is not valid: "qwen2.5-coder:14b"

I went back and checked everything twice. Name was right. Endpoint was right. The model showed up fine in /v1/models. The model wasn't missing at all Cursor just didn't like the name. Something in its validation doesn't accept arbitrary custom model names.

The hack that fixed it

The trick is to give Ollama an alias with a name Cursor will accept, and have that alias point at the real model.

I called mine gpt-4o-mini. It does not touch OpenAI. It's Qwen, wearing a name tag Cursor recognizes.

One caveat worth saying out loud: the name doesn't have to be gpt-4o-mini. It just has to be something on Cursor's list of recognized models. I picked an OpenAI name because I knew it'd pass, pick any allowlisted name you can live with.

On the Ollama server:

cat > Modelfile.gpt-4o-mini <<'EOF'
FROM qwen2.5-coder:14b
EOF

ollama create gpt-4o-mini -f Modelfile.gpt-4o-mini

Now ollama list shows both, the alias and the thing it wraps:

gpt-4o-mini
qwen2.5-coder:14b

So the sleight of hand is just: Cursor thinks it's using gpt-4o-mini, Ollama quietly serves qwen2.5-coder:14b. That's it. That's the whole fix. Modelfiles exist precisely for this you describe a model and stamp out a new named one from it with ollama create.

The working Cursor config ended up being:

Model: gpt-4o-mini
API Key: ollama
Base URL override: http://my-ollama-host:11434/v1
What's actually running: qwen2.5-coder:14b

Stretching the context window

Once it was running, I wanted more room. Two different limits matter here and people mix them up constantly:

Context window : how much the model can see at once.
Output length : how much it can write back in one go.

For coding in Cursor, the context window is the one that bites you, because Cursor crams code, prior conversation, instructions, and file snippets into a single request. Run out of room and it quietly starts forgetting things.

Ollama controls this with num_ctx. Its docs describe context length as the max tokens the model keeps in memory, and they ship VRAM-based defaults:

Available VRAM	Default context
< 24 GiB	4k
24–48 GiB	32k
>= 48 GiB	256k

I've got 128 GiB of VRAM, so I could in theory go wild. But here's the catch nobody mentions: VRAM doesn't make a model good at long context. It just makes it possible. Push past what the model was actually trained for and you get a model that technically accepts 200k tokens and then makes things up about the first half.

And this is where that native-versus-extended distinction matters. Qwen2.5-Coder-14B is natively a 32 768-token model, the 131 072 figure only holds when you run it with YaRN extrapolation, which right now basically means vLLM. Ollama serves the GGUF build and doesn't do YaRN, so when I set num_ctx 131072 here, I'm pushing the model way past its native window without the trick that's supposed to make that work. It'll happily accept the tokens, it just gets less reliable the deeper into that range you go. So 131 072 is my hard ceiling because nothing above it is even claimed, but I treat the upper half as "use with a little suspicion" rather than gospel. In practice I run 65 536 for normal work and only reach for 131 072 when I genuinely need it. Forcing 256k for this model is pointless either way.

My two go-to configs

For everyday use:

cat > Modelfile.gpt-4o-mini <<'EOF'
FROM qwen2.5-coder:14b
PARAMETER num_ctx 65536
PARAMETER num_predict 4096
PARAMETER temperature 0.2
EOF

ollama create gpt-4o-mini -f Modelfile.gpt-4o-mini
ollama stop gpt-4o-mini

For heavier review sessions, crank it:

cat > Modelfile.gpt-4o-mini <<'EOF'
FROM qwen2.5-coder:14b
PARAMETER num_ctx 131072
PARAMETER num_predict 8192
PARAMETER temperature 0.2
EOF

ollama create gpt-4o-mini -f Modelfile.gpt-4o-mini
ollama stop gpt-4o-mini

What the knobs do: num_ctx is the context window, num_predict is how long a single response can run, and temperature 0.2 keeps it boring which is exactly what you want for code.

Double-checking it took

After recreating the alias:

ollama show --modelfile gpt-4o-mini

You should see your FROM line and the three PARAMETER lines staring back. Then just make sure Cursor's still pointed at gpt-4o-mini on http://my-ollama-host:11434/v1 and you're set.

The pattern I've landed on with this much VRAM is keeping both configs around 65k for fast and snappy work, 131k for when I want it chewing on a lot at once.

Where it shines, and where it doesn't

In daily use this thing pulls real weight. Reviewing Vue components, tidying admin screens, explaining backend services, refactoring a module in isolation, writing SQL, sanity-checking API logic, knocking out tests, sharpening prompts, reading through private code I'd rather not upload anywhere. On my project specifically it's been great for the admin UI, game data, skills and spells logic, world-state and movement systems, NPC and quest structures, backend performance passes, and the prompt engineering behind generated assets.

What it won't do is stand in for a top-tier cloud model on complex problems. A 14B model with a big context window is still a 14B model. Full-repo architecture reviews, gnarly multi-file refactors, debugging that spans a dozen layers, product strategy, anything security-sensitive, big design calls is still cloud territory for me.

Which is the whole point, really. Local for the frequent, cheap, fast stuff that would otherwise quietly drain your token budget. Cloud for the rare, expensive, high-stakes thinking that's worth paying for. Use both, don't pretend one replaces the other.

What I'd tell past me

If /v1/models answers from Windows, stop blaming Ollama and the network. They're fine. The problem is somewhere else.
Cursor will reject perfectly valid Ollama model names. qwen2.5-coder:14b worked everywhere except in Cursor's name check.
The fastest fix is an alias Cursor accepts : gpt-4o-mini -> qwen2.5-coder:14b did it for me.
With lots of VRAM, raise the context but cap it at what the model can actually handle. For this one, 131 072 is the advertised ceiling (and even that leans on YaRN, which Ollama doesn't apply), so I treat the top of that range with some caution.

Wrapping up

Running a local coding model inside Cursor isn't just a party trick is something I reach for every day. Cursor, Ollama, Qwen 2.5 Coder 14B, the OpenAI-compatible API, a fat context window, and enough VRAM to not worry about it: that combination is a legitimately good local dev assistant.

And the funny part is the hardest piece wasn't what I expected. Not Ollama, not the network, not the model. It was Cursor refusing a model name. Once that clicked, the fix was almost embarrassingly small: alias the model to a name Cursor likes, point Cursor at it, serve Qwen behind it, and bump num_ctx to taste.

The payoff is a setup that keeps my token spend for the work that actually deserves it for the daily work of writing, reviewing, and refactoring, it more than holds its own, and it does it without touching a usage meter.

Final result in cursor :