<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: christian daniel</title>
    <description>The latest articles on DEV Community by christian daniel (@christian35620).</description>
    <link>https://dev.to/christian35620</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3790405%2Fe6ae4ec9-4028-4c0b-9a96-22456d741b31.jpg</url>
      <title>DEV Community: christian daniel</title>
      <link>https://dev.to/christian35620</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/christian35620"/>
    <language>en</language>
    <item>
      <title>How much does it really cost to use AI models for coding?</title>
      <dc:creator>christian daniel</dc:creator>
      <pubDate>Sat, 16 May 2026 05:25:05 +0000</pubDate>
      <link>https://dev.to/christian35620/how-much-does-it-really-cost-to-use-ai-models-for-coding-235f</link>
      <guid>https://dev.to/christian35620/how-much-does-it-really-cost-to-use-ai-models-for-coding-235f</guid>
      <description>&lt;p&gt;I’ve been reading several posts about the true inference cost of AI models.&lt;/p&gt;

&lt;p&gt;But it wasn’t until I ran my own numbers that I was genuinely stunned.&lt;/p&gt;

&lt;p&gt;For 14 days, from &lt;strong&gt;May 3 to May 16&lt;/strong&gt;, I used three models classified as &lt;strong&gt;Open Weights&lt;/strong&gt; for a personal project where I’m building both the backend in &lt;strong&gt;Nest.js&lt;/strong&gt; and the frontend in &lt;strong&gt;React&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;These were my usage numbers:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;MoonshotAI: Kimi K2.6
Input: 267,755,276 tokens
Output: 1,941,655 tokens

DeepSeek: DeepSeek V4 Pro
Input: 136,286,132 tokens
Output: 867,593 tokens

Xiaomi: MiMo-V2.5-Pro
Input: 2,791,785 tokens
Output: 59,251 tokens
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In total:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Input: 406,833,193 tokens
Output: 2,868,499 tokens
Total: 409,701,692 tokens
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;More than &lt;strong&gt;400 million tokens&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;I’m using an &lt;strong&gt;Opencode Go&lt;/strong&gt; subscription, which cost me &lt;strong&gt;USD 5&lt;/strong&gt; for the first month. Starting from the second month, it costs &lt;strong&gt;USD 10/month&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;And in those 14 days, I already hit the monthly rate limits.&lt;/p&gt;

&lt;p&gt;But wait…&lt;/p&gt;

&lt;p&gt;USD 5 for more than 400M tokens?&lt;/p&gt;

&lt;p&gt;Yes. USD 5.&lt;/p&gt;

&lt;p&gt;That made me wonder:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;How much would this exact same amount of tokens have cost using a traditional inference provider?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;So I went to OpenRouter and looked up the average prices of the models I had been using:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;DeepSeek: DeepSeek V4 Pro
Input:  USD 0.316 / 1M tokens
Output: USD 1.74 / 1M tokens

MoonshotAI: Kimi K2.6
Input:  USD 0.306 / 1M tokens
Output: USD 3.84 / 1M tokens

Xiaomi: MiMo-V2.5-Pro
Input:  USD 0.470 / 1M tokens
Output: USD 3.07 / 1M tokens
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Doing the math, if I had used OpenRouter as the inference provider, the cost for those 14 days would have been approximately:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Input&lt;/th&gt;
&lt;th&gt;Output&lt;/th&gt;
&lt;th&gt;Total&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;MoonshotAI: Kimi K2.6&lt;/td&gt;
&lt;td&gt;USD 81.93&lt;/td&gt;
&lt;td&gt;USD 7.46&lt;/td&gt;
&lt;td&gt;USD 89.39&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;DeepSeek: DeepSeek V4 Pro&lt;/td&gt;
&lt;td&gt;USD 43.07&lt;/td&gt;
&lt;td&gt;USD 1.51&lt;/td&gt;
&lt;td&gt;USD 44.58&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Xiaomi: MiMo-V2.5-Pro&lt;/td&gt;
&lt;td&gt;USD 1.31&lt;/td&gt;
&lt;td&gt;USD 0.18&lt;/td&gt;
&lt;td&gt;USD 1.49&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Total: USD 135.46 in 14 days&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Extrapolated to a 30-day month:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Monthly estimate&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;MoonshotAI: Kimi K2.6&lt;/td&gt;
&lt;td&gt;USD 191.55&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;DeepSeek: DeepSeek V4 Pro&lt;/td&gt;
&lt;td&gt;USD 95.52&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Xiaomi: MiMo-V2.5-Pro&lt;/td&gt;
&lt;td&gt;USD 3.20&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Estimated monthly total: USD 290.27&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;But of course, another important factor comes into play here: cache.&lt;/p&gt;

&lt;p&gt;Inference providers usually apply discounts when part of the input prompt comes from cache, meaning tokens from the prompt were already processed before and can be reused.&lt;/p&gt;

&lt;p&gt;So I ran another calculation assuming:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Cache hit rate: 70%
Cached input cost: 20% of the normal cost
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That means the effective input cost becomes:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;70% × 20% + 30% × 100% = 44%
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In other words, input tokens would cost &lt;strong&gt;56% less&lt;/strong&gt;, while output tokens would remain the same.&lt;/p&gt;

&lt;p&gt;Under that assumption, the cost of my 14 days of usage would have been:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Input with cache&lt;/th&gt;
&lt;th&gt;Output&lt;/th&gt;
&lt;th&gt;Total&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;MoonshotAI: Kimi K2.6&lt;/td&gt;
&lt;td&gt;USD 36.05&lt;/td&gt;
&lt;td&gt;USD 7.46&lt;/td&gt;
&lt;td&gt;USD 43.51&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;DeepSeek: DeepSeek V4 Pro&lt;/td&gt;
&lt;td&gt;USD 18.95&lt;/td&gt;
&lt;td&gt;USD 1.51&lt;/td&gt;
&lt;td&gt;USD 20.46&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Xiaomi: MiMo-V2.5-Pro&lt;/td&gt;
&lt;td&gt;USD 0.58&lt;/td&gt;
&lt;td&gt;USD 0.18&lt;/td&gt;
&lt;td&gt;USD 0.76&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Total with cache: USD 64.72 in 14 days&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Extrapolated to 30 days:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Monthly estimate with cache&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;MoonshotAI: Kimi K2.6&lt;/td&gt;
&lt;td&gt;USD 93.23&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;DeepSeek: DeepSeek V4 Pro&lt;/td&gt;
&lt;td&gt;USD 43.84&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Xiaomi: MiMo-V2.5-Pro&lt;/td&gt;
&lt;td&gt;USD 1.63&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Estimated monthly total with cache: USD 138.70&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;That represents approximately &lt;strong&gt;52.2% less&lt;/strong&gt; than the calculation without cache.&lt;/p&gt;

&lt;p&gt;Then I did the same exercise assuming I used &lt;strong&gt;GPT-5.4&lt;/strong&gt; the entire time, also applying the cache hit discount.&lt;/p&gt;

&lt;p&gt;The estimated monthly result was approximately:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;USD 690.77/month&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;So the comparison looks like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Opencode Go:
USD 10/month

Estimated cost using the same Open Weights models via OpenRouter with cache:
USD 138.70/month

Estimated cost using GPT-5.4 with cache:
USD 690.77/month
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Put another way:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;with Opencode Go, I’d be paying approximately &lt;strong&gt;7.2%&lt;/strong&gt; of what it would cost to use those same Open Weights models via OpenRouter;&lt;/li&gt;
&lt;li&gt;and just &lt;strong&gt;1.4%&lt;/strong&gt; of what it would cost to use GPT-5.4 under the same usage pattern.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;And if I take the first-month promotional price, USD 5, the difference is even more dramatic:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;3.6%&lt;/strong&gt; compared to the estimated cost with Open Weights models;&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;0.7%&lt;/strong&gt; compared to the estimated cost with GPT-5.4.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This leaves me with one question:&lt;/p&gt;

&lt;h2&gt;
  
  
  How are these subscription models actually sustainable?
&lt;/h2&gt;

&lt;p&gt;Do published inference prices reflect the real cost?&lt;/p&gt;

&lt;p&gt;Are subscriptions being subsidized?&lt;/p&gt;

&lt;p&gt;Or are we at a stage where many companies are absorbing losses to capture users and volume?&lt;/p&gt;

&lt;p&gt;I don’t have a definitive answer.&lt;/p&gt;

&lt;p&gt;But after running these numbers, it’s clear to me that the real cost of using AI for intensive development is not as obvious as it seems.&lt;/p&gt;

&lt;p&gt;And that behind a seemingly simple monthly subscription, there may be a much more complex economy at play.&lt;/p&gt;

&lt;h1&gt;
  
  
  AI #LLM #SoftwareDevelopment #OpenWeights #AIEngineering #DeveloperTools
&lt;/h1&gt;

</description>
      <category>ai</category>
      <category>coding</category>
      <category>llm</category>
      <category>productivity</category>
    </item>
    <item>
      <title>Run Your Own Local AI Chat with OpenWebUI and llama.cpp - Windows</title>
      <dc:creator>christian daniel</dc:creator>
      <pubDate>Sat, 28 Feb 2026 20:16:06 +0000</pubDate>
      <link>https://dev.to/christian35620/run-your-own-local-ai-chat-with-openwebui-and-llamacpp-windows-1k6c</link>
      <guid>https://dev.to/christian35620/run-your-own-local-ai-chat-with-openwebui-and-llamacpp-windows-1k6c</guid>
      <description>&lt;p&gt;&lt;strong&gt;TL;DR:&lt;/strong&gt; A local ChatGPT-like stack using OpenWebUI as the UI and llama.cpp as the inference server, with a GGUF model from Hugging Face. Everything talks over an OpenAI-compatible API. No API bills, no data leaving your machine.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyrx8bb8t2ewqx0cgk9pl.gif" alt="OpenWebUI interface" width="1868" height="900"&gt;
&lt;/h2&gt;

&lt;h2&gt;
  
  
  Why this matters
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Privacy:&lt;/strong&gt; Prompts and replies stay on your machine.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No API bills:&lt;/strong&gt; No usage-based pricing or quotas.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Control:&lt;/strong&gt; You pick the model, quantization, and context size.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Open source:&lt;/strong&gt; OpenWebUI and llama.cpp are free and auditable.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I wanted a local tool for LLM tasks that don't need a paid API: drafts, small scripts, experiments. This setup does that.&lt;/p&gt;




&lt;h2&gt;
  
  
  Who this is for
&lt;/h2&gt;

&lt;p&gt;Anyone who wants a local AI chat without subscriptions. No prior LLM experience required; this is mostly wiring a UI to a local server.&lt;/p&gt;




&lt;h2&gt;
  
  
  My setup (Windows)
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;OS:&lt;/strong&gt; Windows 11&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;RAM:&lt;/strong&gt; 16 GB minimum; 32 GB helps for larger models&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;GPU:&lt;/strong&gt; optional but recommended for speed (I have a GPU with 8 GB of VRAM)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Disk:&lt;/strong&gt; Enough for multi-GB model files (often 4–8 GB per model)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A 7B model in Q4 quantization runs on many machines; bigger models need more memory.&lt;/p&gt;




&lt;h2&gt;
  
  
  Architecture overview
&lt;/h2&gt;

&lt;p&gt;Three pieces:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;OpenWebUI: the browser UI (chat, history, model selection)&lt;/li&gt;
&lt;li&gt;llama.cpp server: local inference with an OpenAI-compatible HTTP API&lt;/li&gt;
&lt;li&gt;GGUF model: weights you download once and keep on disk&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;OpenWebUI talks to &lt;code&gt;llama-server&lt;/code&gt; over HTTP. No cloud in the loop.&lt;/p&gt;




&lt;h2&gt;
  
  
  Step 1: Install llama.cpp (Windows, prebuilt CUDA)
&lt;/h2&gt;

&lt;p&gt;Prebuilt binaries are the fastest way to a working server.&lt;/p&gt;

&lt;h3&gt;
  
  
  1.1 Check your CUDA version (NVIDIA only)
&lt;/h3&gt;

&lt;p&gt;In PowerShell:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight powershell"&gt;&lt;code&gt;&lt;span class="n"&gt;nvidia-smi&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Note the &lt;strong&gt;CUDA Version&lt;/strong&gt; line (e.g. 12.x). You'll use it to choose the right llama.cpp build.&lt;/p&gt;

&lt;h3&gt;
  
  
  1.2 Download the release and CUDA runtime bundle
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;Open &lt;a href="https://github.com/ggml-org/llama.cpp/releases" rel="noopener noreferrer"&gt;llama.cpp releases&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;Pick the release that matches your CUDA version (e.g. CUDA 12) and download it.&lt;/li&gt;
&lt;li&gt;Download the CUDA runtime DLL bundle from Assets (e.g. &lt;code&gt;cudart-llama-bin-win-cuda-12&lt;/code&gt;).&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The extra DLL bundle matters: the CUDA build often needs runtime DLLs that aren't on your PATH. Putting them next to the executables avoids "missing DLL" errors.&lt;/p&gt;

&lt;h3&gt;
  
  
  1.3 Extract and add to PATH
&lt;/h3&gt;

&lt;p&gt;Extract the main archive to a stable folder, for example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;C:\Program Files\llama.cpp\
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Add that folder to your system PATH (Windows search → Environment Variables → Path → Edit → New). That way you can run &lt;code&gt;llama-server&lt;/code&gt; and &lt;code&gt;llama-cli&lt;/code&gt; from any terminal without the full path.&lt;/p&gt;

&lt;h3&gt;
  
  
  1.4 Copy CUDA DLLs (NVIDIA only)
&lt;/h3&gt;

&lt;p&gt;Extract the CUDA runtime bundle and copy all &lt;code&gt;.dll&lt;/code&gt; files into the same folder as &lt;code&gt;llama-server.exe&lt;/code&gt; (the one on your PATH).&lt;/p&gt;

&lt;h3&gt;
  
  
  1.5 Verify
&lt;/h3&gt;

&lt;p&gt;Open a &lt;strong&gt;new&lt;/strong&gt; terminal (so PATH is refreshed) and run:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight powershell"&gt;&lt;code&gt;&lt;span class="n"&gt;llama-server&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;--help&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If you see help output, the install is good.&lt;/p&gt;




&lt;h2&gt;
  
  
  Step 2: Install and run OpenWebUI (Windows, no Docker)
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://openwebui.com/" rel="noopener noreferrer"&gt;OpenWebUI&lt;/a&gt; is a self-hosted chat UI. A straightforward option to install it is through a Python venv.&lt;/p&gt;

&lt;h3&gt;
  
  
  2.1 Create venv and install
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight powershell"&gt;&lt;code&gt;&lt;span class="n"&gt;python&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;-m&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;venv&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;venv&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;\.venv\Scripts\Activate.ps1&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;pip&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;install&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;open-webui&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Alternative (Conda):&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight powershell"&gt;&lt;code&gt;&lt;span class="n"&gt;conda&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;create&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;-n&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;local_chat&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;python&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;3.11&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;-y&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;conda&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;activate&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;local_chat&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;pip&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;install&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;open-webui&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  2.2 Run OpenWebUI
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight powershell"&gt;&lt;code&gt;&lt;span class="n"&gt;open-webui&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;serve&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Open &lt;code&gt;http://localhost:8080&lt;/code&gt; (or the port shown in the terminal). You'll see the UI; the model connection comes in Step 4.&lt;/p&gt;




&lt;h2&gt;
  
  
  Step 3: Download a GGUF model from Hugging Face and start the server
&lt;/h2&gt;

&lt;p&gt;Start with a smaller model so you can confirm the pipeline before throwing RAM at bigger ones.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Example model used in this post:&lt;/strong&gt; &lt;a href="https://huggingface.co/Qwen/Qwen2.5-Coder-7B-Instruct-GGUF" rel="noopener noreferrer"&gt;Qwen2.5-Coder-7B-Instruct-GGUF&lt;/a&gt;. I used the &lt;strong&gt;Q4_K_M&lt;/strong&gt; quantized file.&lt;/p&gt;

&lt;p&gt;On the Hugging Face repo you'll see several quantizations (Q2–Q8). Q4 is a good balance for local use: smaller file, decent quality.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F72j7v05rpjuhpwhp7bgr.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F72j7v05rpjuhpwhp7bgr.png" alt="quantizations llm model" width="623" height="395"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  3.1 Download the GGUF file
&lt;/h3&gt;

&lt;p&gt;Download the Q4_K_M (or your chosen) &lt;code&gt;.gguf&lt;/code&gt; file and put it in a stable folder, e.g.:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;C:\Users\&amp;lt;YourUser&amp;gt;\.llm_models\
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Replace &lt;code&gt;&amp;lt;YourUser&amp;gt;&lt;/code&gt; with your Windows username.&lt;/p&gt;

&lt;h3&gt;
  
  
  3.2 Start the llama.cpp server
&lt;/h3&gt;

&lt;p&gt;Use a port that doesn't clash with OpenWebUI (8080). Here we use 10000.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight powershell"&gt;&lt;code&gt;&lt;span class="n"&gt;llama-server&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;-m&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"C:\Users\&amp;lt;YourUser&amp;gt;\.llm_models\Qwen2.5-Coder-7B-Instruct-Q4_K_M.gguf"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;--port&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;10000&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Leave this terminal open. You should now have an OpenAI-compatible API at:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;http://localhost:10000
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Step 4: Connect OpenWebUI to llama.cpp
&lt;/h2&gt;

&lt;p&gt;With both the llama-server and OpenWebUI running:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Open OpenWebUI at &lt;code&gt;http://localhost:8080&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Go to &lt;strong&gt;Settings → Connections&lt;/strong&gt; (or Admin → Connections, depending on your OpenWebUI version).&lt;/li&gt;
&lt;li&gt;Add an OpenAI-compatible connection (see screenshot below): Base URL &lt;code&gt;http://localhost:10000/v1&lt;/code&gt;, API key empty or a placeholder like &lt;code&gt;local&lt;/code&gt; if required.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5rz037bv87q8gtzxjez1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5rz037bv87q8gtzxjez1.png" alt="add new connection" width="800" height="244"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fz2fmuctgjmm7ityd70p3.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fz2fmuctgjmm7ityd70p3.png" alt="api form" width="593" height="675"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Save, select the new connection/model in the UI, and send a test message. If the model answers, the stack is working.&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  What you get
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjmz4s6ulfkzaq4xru38x.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjmz4s6ulfkzaq4xru38x.gif" alt="openweb ui chat" width="1868" height="900"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A browser chat (OpenWebUI) talking to a model that runs on your machine (llama.cpp).&lt;/li&gt;
&lt;li&gt;No external API calls.&lt;/li&gt;
&lt;li&gt;No paid subscriptions.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Trade-offs and limitations
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;RAM/VRAM is the real limit. Bigger models and longer context need more memory.&lt;/li&gt;
&lt;li&gt;Disk space adds up. Models live on disk (often several GB each, and you may keep several quantizations).&lt;/li&gt;
&lt;li&gt;Smaller models have limits. On modest hardware, what you can run may not be enough for heavy reasoning, long-form planning, or high-stakes tasks.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Troubleshooting
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;OpenWebUI loads but no model appears&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Confirm &lt;code&gt;llama-server&lt;/code&gt; is running and that &lt;code&gt;http://localhost:10000&lt;/code&gt; responds (e.g. in a browser or with &lt;code&gt;curl&lt;/code&gt;).&lt;/li&gt;
&lt;li&gt;Make sure you didn’t use the same port as OpenWebUI (8080).&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Connection fails&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Try &lt;code&gt;http://127.0.0.1:10000&lt;/code&gt; instead of &lt;code&gt;http://localhost:10000&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Check that Windows Firewall isn’t blocking local connections.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;It’s slow&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Use a smaller model or lower quantization (e.g. Q4).&lt;/li&gt;
&lt;li&gt;Reduce context length if you increased it.&lt;/li&gt;
&lt;li&gt;On NVIDIA: confirm you use the CUDA build and that the runtime DLLs are in the same folder as the executables.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Wrap-up
&lt;/h2&gt;

&lt;p&gt;You now have a local chat stack: OpenWebUI for the UI, llama.cpp for inference, and a GGUF model from Hugging Face. Solid baseline for privacy-first, no-subscription experimentation. Next steps: try different models (general vs coder), other quantizations, or tuning context length for your workload.&lt;/p&gt;




&lt;h2&gt;
  
  
  Further reading
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://github.com/open-webui/open-webui" rel="noopener noreferrer"&gt;OpenWebUI GitHub&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/ggml-org/llama.cpp/releases" rel="noopener noreferrer"&gt;llama.cpp releases&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://huggingface.co/models?library=gguf" rel="noopener noreferrer"&gt;Hugging Face (GGUF models)&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>ai</category>
      <category>tutorial</category>
      <category>opensource</category>
      <category>llm</category>
    </item>
  </channel>
</rss>
