DEV Community: christian daniel

How much does it really cost to use AI models for coding?

christian daniel — Sat, 16 May 2026 05:25:05 +0000

I’ve been reading several posts about the true inference cost of AI models.

But it wasn’t until I ran my own numbers that I was genuinely stunned.

For 14 days, from May 3 to May 16, I used three models classified as Open Weights for a personal project where I’m building both the backend in Nest.js and the frontend in React.

These were my usage numbers:

MoonshotAI: Kimi K2.6
Input: 267,755,276 tokens
Output: 1,941,655 tokens

DeepSeek: DeepSeek V4 Pro
Input: 136,286,132 tokens
Output: 867,593 tokens

Xiaomi: MiMo-V2.5-Pro
Input: 2,791,785 tokens
Output: 59,251 tokens

In total:

Input: 406,833,193 tokens
Output: 2,868,499 tokens
Total: 409,701,692 tokens

More than 400 million tokens.

I’m using an Opencode Go subscription, which cost me USD 5 for the first month. Starting from the second month, it costs USD 10/month.

And in those 14 days, I already hit the monthly rate limits.

But wait…

USD 5 for more than 400M tokens?

Yes. USD 5.

That made me wonder:

How much would this exact same amount of tokens have cost using a traditional inference provider?

So I went to OpenRouter and looked up the average prices of the models I had been using:

DeepSeek: DeepSeek V4 Pro
Input:  USD 0.316 / 1M tokens
Output: USD 1.74 / 1M tokens

MoonshotAI: Kimi K2.6
Input:  USD 0.306 / 1M tokens
Output: USD 3.84 / 1M tokens

Xiaomi: MiMo-V2.5-Pro
Input:  USD 0.470 / 1M tokens
Output: USD 3.07 / 1M tokens

Doing the math, if I had used OpenRouter as the inference provider, the cost for those 14 days would have been approximately:

Model	Input	Output	Total
MoonshotAI: Kimi K2.6	USD 81.93	USD 7.46	USD 89.39
DeepSeek: DeepSeek V4 Pro	USD 43.07	USD 1.51	USD 44.58
Xiaomi: MiMo-V2.5-Pro	USD 1.31	USD 0.18	USD 1.49

Total: USD 135.46 in 14 days

Extrapolated to a 30-day month:

Model	Monthly estimate
MoonshotAI: Kimi K2.6	USD 191.55
DeepSeek: DeepSeek V4 Pro	USD 95.52
Xiaomi: MiMo-V2.5-Pro	USD 3.20

Estimated monthly total: USD 290.27

But of course, another important factor comes into play here: cache.

Inference providers usually apply discounts when part of the input prompt comes from cache, meaning tokens from the prompt were already processed before and can be reused.

So I ran another calculation assuming:

Cache hit rate: 70%
Cached input cost: 20% of the normal cost

That means the effective input cost becomes:

70% × 20% + 30% × 100% = 44%

In other words, input tokens would cost 56% less, while output tokens would remain the same.

Under that assumption, the cost of my 14 days of usage would have been:

Model	Input with cache	Output	Total
MoonshotAI: Kimi K2.6	USD 36.05	USD 7.46	USD 43.51
DeepSeek: DeepSeek V4 Pro	USD 18.95	USD 1.51	USD 20.46
Xiaomi: MiMo-V2.5-Pro	USD 0.58	USD 0.18	USD 0.76

Total with cache: USD 64.72 in 14 days

Extrapolated to 30 days:

Model	Monthly estimate with cache
MoonshotAI: Kimi K2.6	USD 93.23
DeepSeek: DeepSeek V4 Pro	USD 43.84
Xiaomi: MiMo-V2.5-Pro	USD 1.63

Estimated monthly total with cache: USD 138.70

That represents approximately 52.2% less than the calculation without cache.

Then I did the same exercise assuming I used GPT-5.4 the entire time, also applying the cache hit discount.

The estimated monthly result was approximately:

USD 690.77/month

So the comparison looks like this:

Opencode Go:
USD 10/month

Estimated cost using the same Open Weights models via OpenRouter with cache:
USD 138.70/month

Estimated cost using GPT-5.4 with cache:
USD 690.77/month

Put another way:

with Opencode Go, I’d be paying approximately 7.2% of what it would cost to use those same Open Weights models via OpenRouter;
and just 1.4% of what it would cost to use GPT-5.4 under the same usage pattern.

And if I take the first-month promotional price, USD 5, the difference is even more dramatic:

3.6% compared to the estimated cost with Open Weights models;
0.7% compared to the estimated cost with GPT-5.4.

This leaves me with one question:

How are these subscription models actually sustainable?

Do published inference prices reflect the real cost?

Are subscriptions being subsidized?

Or are we at a stage where many companies are absorbing losses to capture users and volume?

I don’t have a definitive answer.

But after running these numbers, it’s clear to me that the real cost of using AI for intensive development is not as obvious as it seems.

And that behind a seemingly simple monthly subscription, there may be a much more complex economy at play.

AI #LLM #SoftwareDevelopment #OpenWeights #AIEngineering #DeveloperTools

Run Your Own Local AI Chat with OpenWebUI and llama.cpp - Windows

christian daniel — Sat, 28 Feb 2026 20:16:06 +0000

TL;DR: A local ChatGPT-like stack using OpenWebUI as the UI and llama.cpp as the inference server, with a GGUF model from Hugging Face. Everything talks over an OpenAI-compatible API. No API bills, no data leaving your machine.

Why this matters

Privacy: Prompts and replies stay on your machine.
No API bills: No usage-based pricing or quotas.
Control: You pick the model, quantization, and context size.
Open source: OpenWebUI and llama.cpp are free and auditable.

I wanted a local tool for LLM tasks that don't need a paid API: drafts, small scripts, experiments. This setup does that.

Who this is for

Anyone who wants a local AI chat without subscriptions. No prior LLM experience required; this is mostly wiring a UI to a local server.

My setup (Windows)

OS: Windows 11
RAM: 16 GB minimum; 32 GB helps for larger models
GPU: optional but recommended for speed (I have a GPU with 8 GB of VRAM)
Disk: Enough for multi-GB model files (often 4–8 GB per model)

A 7B model in Q4 quantization runs on many machines; bigger models need more memory.

Architecture overview

Three pieces:

OpenWebUI: the browser UI (chat, history, model selection)
llama.cpp server: local inference with an OpenAI-compatible HTTP API
GGUF model: weights you download once and keep on disk

OpenWebUI talks to llama-server over HTTP. No cloud in the loop.

Step 1: Install llama.cpp (Windows, prebuilt CUDA)

Prebuilt binaries are the fastest way to a working server.

1.1 Check your CUDA version (NVIDIA only)

In PowerShell:

nvidia-smi

Note the CUDA Version line (e.g. 12.x). You'll use it to choose the right llama.cpp build.

1.2 Download the release and CUDA runtime bundle

Open llama.cpp releases.
Pick the release that matches your CUDA version (e.g. CUDA 12) and download it.
Download the CUDA runtime DLL bundle from Assets (e.g. cudart-llama-bin-win-cuda-12).

The extra DLL bundle matters: the CUDA build often needs runtime DLLs that aren't on your PATH. Putting them next to the executables avoids "missing DLL" errors.

1.3 Extract and add to PATH

Extract the main archive to a stable folder, for example:

C:\Program Files\llama.cpp\

Add that folder to your system PATH (Windows search → Environment Variables → Path → Edit → New). That way you can run llama-server and llama-cli from any terminal without the full path.

1.4 Copy CUDA DLLs (NVIDIA only)

Extract the CUDA runtime bundle and copy all .dll files into the same folder as llama-server.exe (the one on your PATH).

1.5 Verify

Open a new terminal (so PATH is refreshed) and run:

llama-server --help

If you see help output, the install is good.

Step 2: Install and run OpenWebUI (Windows, no Docker)

OpenWebUI is a self-hosted chat UI. A straightforward option to install it is through a Python venv.

2.1 Create venv and install

python -m venv .venv
.\.venv\Scripts\Activate.ps1
pip install open-webui

Alternative (Conda):

conda create -n local_chat python=3.11 -y
conda activate local_chat
pip install open-webui

2.2 Run OpenWebUI

open-webui serve

Open http://localhost:8080 (or the port shown in the terminal). You'll see the UI; the model connection comes in Step 4.

Step 3: Download a GGUF model from Hugging Face and start the server

Start with a smaller model so you can confirm the pipeline before throwing RAM at bigger ones.

Example model used in this post: Qwen2.5-Coder-7B-Instruct-GGUF. I used the Q4_K_M quantized file.

On the Hugging Face repo you'll see several quantizations (Q2–Q8). Q4 is a good balance for local use: smaller file, decent quality.

3.1 Download the GGUF file

Download the Q4_K_M (or your chosen) .gguf file and put it in a stable folder, e.g.:

C:\Users\<YourUser>\.llm_models\

Replace <YourUser> with your Windows username.

3.2 Start the llama.cpp server

Use a port that doesn't clash with OpenWebUI (8080). Here we use 10000.

llama-server -m "C:\Users\<YourUser>\.llm_models\Qwen2.5-Coder-7B-Instruct-Q4_K_M.gguf" --port 10000

Leave this terminal open. You should now have an OpenAI-compatible API at:

http://localhost:10000

Step 4: Connect OpenWebUI to llama.cpp

With both the llama-server and OpenWebUI running:

Open OpenWebUI at http://localhost:8080.
Go to Settings → Connections (or Admin → Connections, depending on your OpenWebUI version).
Add an OpenAI-compatible connection (see screenshot below): Base URL http://localhost:10000/v1, API key empty or a placeholder like local if required.

Save, select the new connection/model in the UI, and send a test message. If the model answers, the stack is working.

What you get

A browser chat (OpenWebUI) talking to a model that runs on your machine (llama.cpp).
No external API calls.
No paid subscriptions.

Trade-offs and limitations

RAM/VRAM is the real limit. Bigger models and longer context need more memory.
Disk space adds up. Models live on disk (often several GB each, and you may keep several quantizations).
Smaller models have limits. On modest hardware, what you can run may not be enough for heavy reasoning, long-form planning, or high-stakes tasks.

Troubleshooting

OpenWebUI loads but no model appears

Confirm llama-server is running and that http://localhost:10000 responds (e.g. in a browser or with curl).
Make sure you didn’t use the same port as OpenWebUI (8080).

Connection fails

Try http://127.0.0.1:10000 instead of http://localhost:10000.
Check that Windows Firewall isn’t blocking local connections.

It’s slow

Use a smaller model or lower quantization (e.g. Q4).
Reduce context length if you increased it.
On NVIDIA: confirm you use the CUDA build and that the runtime DLLs are in the same folder as the executables.

Wrap-up

You now have a local chat stack: OpenWebUI for the UI, llama.cpp for inference, and a GGUF model from Hugging Face. Solid baseline for privacy-first, no-subscription experimentation. Next steps: try different models (general vs coder), other quantizations, or tuning context length for your workload.