<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: christian daniel</title>
    <description>The latest articles on DEV Community by christian daniel (@christian35620).</description>
    <link>https://dev.to/christian35620</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3790405%2Fe6ae4ec9-4028-4c0b-9a96-22456d741b31.jpg</url>
      <title>DEV Community: christian daniel</title>
      <link>https://dev.to/christian35620</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/christian35620"/>
    <language>en</language>
    <item>
      <title>Run Your Own Local AI Chat with OpenWebUI and llama.cpp - Windows</title>
      <dc:creator>christian daniel</dc:creator>
      <pubDate>Sat, 28 Feb 2026 20:16:06 +0000</pubDate>
      <link>https://dev.to/christian35620/run-your-own-local-ai-chat-with-openwebui-and-llamacpp-windows-1k6c</link>
      <guid>https://dev.to/christian35620/run-your-own-local-ai-chat-with-openwebui-and-llamacpp-windows-1k6c</guid>
      <description>&lt;p&gt;&lt;strong&gt;TL;DR:&lt;/strong&gt; A local ChatGPT-like stack using OpenWebUI as the UI and llama.cpp as the inference server, with a GGUF model from Hugging Face. Everything talks over an OpenAI-compatible API. No API bills, no data leaving your machine.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyrx8bb8t2ewqx0cgk9pl.gif" alt="OpenWebUI interface" width="1868" height="900"&gt;
&lt;/h2&gt;

&lt;h2&gt;
  
  
  Why this matters
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Privacy:&lt;/strong&gt; Prompts and replies stay on your machine.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No API bills:&lt;/strong&gt; No usage-based pricing or quotas.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Control:&lt;/strong&gt; You pick the model, quantization, and context size.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Open source:&lt;/strong&gt; OpenWebUI and llama.cpp are free and auditable.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I wanted a local tool for LLM tasks that don't need a paid API: drafts, small scripts, experiments. This setup does that.&lt;/p&gt;




&lt;h2&gt;
  
  
  Who this is for
&lt;/h2&gt;

&lt;p&gt;Anyone who wants a local AI chat without subscriptions. No prior LLM experience required; this is mostly wiring a UI to a local server.&lt;/p&gt;




&lt;h2&gt;
  
  
  My setup (Windows)
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;OS:&lt;/strong&gt; Windows 11&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;RAM:&lt;/strong&gt; 16 GB minimum; 32 GB helps for larger models&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;GPU:&lt;/strong&gt; optional but recommended for speed (I have a GPU with 8 GB of VRAM)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Disk:&lt;/strong&gt; Enough for multi-GB model files (often 4–8 GB per model)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A 7B model in Q4 quantization runs on many machines; bigger models need more memory.&lt;/p&gt;




&lt;h2&gt;
  
  
  Architecture overview
&lt;/h2&gt;

&lt;p&gt;Three pieces:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;OpenWebUI: the browser UI (chat, history, model selection)&lt;/li&gt;
&lt;li&gt;llama.cpp server: local inference with an OpenAI-compatible HTTP API&lt;/li&gt;
&lt;li&gt;GGUF model: weights you download once and keep on disk&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;OpenWebUI talks to &lt;code&gt;llama-server&lt;/code&gt; over HTTP. No cloud in the loop.&lt;/p&gt;




&lt;h2&gt;
  
  
  Step 1: Install llama.cpp (Windows, prebuilt CUDA)
&lt;/h2&gt;

&lt;p&gt;Prebuilt binaries are the fastest way to a working server.&lt;/p&gt;

&lt;h3&gt;
  
  
  1.1 Check your CUDA version (NVIDIA only)
&lt;/h3&gt;

&lt;p&gt;In PowerShell:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight powershell"&gt;&lt;code&gt;&lt;span class="n"&gt;nvidia-smi&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Note the &lt;strong&gt;CUDA Version&lt;/strong&gt; line (e.g. 12.x). You'll use it to choose the right llama.cpp build.&lt;/p&gt;

&lt;h3&gt;
  
  
  1.2 Download the release and CUDA runtime bundle
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;Open &lt;a href="https://github.com/ggml-org/llama.cpp/releases" rel="noopener noreferrer"&gt;llama.cpp releases&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;Pick the release that matches your CUDA version (e.g. CUDA 12) and download it.&lt;/li&gt;
&lt;li&gt;Download the CUDA runtime DLL bundle from Assets (e.g. &lt;code&gt;cudart-llama-bin-win-cuda-12&lt;/code&gt;).&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The extra DLL bundle matters: the CUDA build often needs runtime DLLs that aren't on your PATH. Putting them next to the executables avoids "missing DLL" errors.&lt;/p&gt;

&lt;h3&gt;
  
  
  1.3 Extract and add to PATH
&lt;/h3&gt;

&lt;p&gt;Extract the main archive to a stable folder, for example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;C:\Program Files\llama.cpp\
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Add that folder to your system PATH (Windows search → Environment Variables → Path → Edit → New). That way you can run &lt;code&gt;llama-server&lt;/code&gt; and &lt;code&gt;llama-cli&lt;/code&gt; from any terminal without the full path.&lt;/p&gt;

&lt;h3&gt;
  
  
  1.4 Copy CUDA DLLs (NVIDIA only)
&lt;/h3&gt;

&lt;p&gt;Extract the CUDA runtime bundle and copy all &lt;code&gt;.dll&lt;/code&gt; files into the same folder as &lt;code&gt;llama-server.exe&lt;/code&gt; (the one on your PATH).&lt;/p&gt;

&lt;h3&gt;
  
  
  1.5 Verify
&lt;/h3&gt;

&lt;p&gt;Open a &lt;strong&gt;new&lt;/strong&gt; terminal (so PATH is refreshed) and run:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight powershell"&gt;&lt;code&gt;&lt;span class="n"&gt;llama-server&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;--help&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If you see help output, the install is good.&lt;/p&gt;




&lt;h2&gt;
  
  
  Step 2: Install and run OpenWebUI (Windows, no Docker)
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://openwebui.com/" rel="noopener noreferrer"&gt;OpenWebUI&lt;/a&gt; is a self-hosted chat UI. A straightforward option to install it is through a Python venv.&lt;/p&gt;

&lt;h3&gt;
  
  
  2.1 Create venv and install
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight powershell"&gt;&lt;code&gt;&lt;span class="n"&gt;python&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;-m&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;venv&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;venv&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;\.venv\Scripts\Activate.ps1&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;pip&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;install&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;open-webui&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Alternative (Conda):&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight powershell"&gt;&lt;code&gt;&lt;span class="n"&gt;conda&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;create&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;-n&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;local_chat&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;python&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;3.11&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;-y&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;conda&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;activate&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;local_chat&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;pip&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;install&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;open-webui&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  2.2 Run OpenWebUI
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight powershell"&gt;&lt;code&gt;&lt;span class="n"&gt;open-webui&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;serve&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Open &lt;code&gt;http://localhost:8080&lt;/code&gt; (or the port shown in the terminal). You'll see the UI; the model connection comes in Step 4.&lt;/p&gt;




&lt;h2&gt;
  
  
  Step 3: Download a GGUF model from Hugging Face and start the server
&lt;/h2&gt;

&lt;p&gt;Start with a smaller model so you can confirm the pipeline before throwing RAM at bigger ones.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Example model used in this post:&lt;/strong&gt; &lt;a href="https://huggingface.co/Qwen/Qwen2.5-Coder-7B-Instruct-GGUF" rel="noopener noreferrer"&gt;Qwen2.5-Coder-7B-Instruct-GGUF&lt;/a&gt;. I used the &lt;strong&gt;Q4_K_M&lt;/strong&gt; quantized file.&lt;/p&gt;

&lt;p&gt;On the Hugging Face repo you'll see several quantizations (Q2–Q8). Q4 is a good balance for local use: smaller file, decent quality.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F72j7v05rpjuhpwhp7bgr.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F72j7v05rpjuhpwhp7bgr.png" alt="quantizations llm model" width="623" height="395"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  3.1 Download the GGUF file
&lt;/h3&gt;

&lt;p&gt;Download the Q4_K_M (or your chosen) &lt;code&gt;.gguf&lt;/code&gt; file and put it in a stable folder, e.g.:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;C:\Users\&amp;lt;YourUser&amp;gt;\.llm_models\
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Replace &lt;code&gt;&amp;lt;YourUser&amp;gt;&lt;/code&gt; with your Windows username.&lt;/p&gt;

&lt;h3&gt;
  
  
  3.2 Start the llama.cpp server
&lt;/h3&gt;

&lt;p&gt;Use a port that doesn't clash with OpenWebUI (8080). Here we use 10000.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight powershell"&gt;&lt;code&gt;&lt;span class="n"&gt;llama-server&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;-m&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"C:\Users\&amp;lt;YourUser&amp;gt;\.llm_models\Qwen2.5-Coder-7B-Instruct-Q4_K_M.gguf"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;--port&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;10000&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Leave this terminal open. You should now have an OpenAI-compatible API at:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;http://localhost:10000
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Step 4: Connect OpenWebUI to llama.cpp
&lt;/h2&gt;

&lt;p&gt;With both the llama-server and OpenWebUI running:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Open OpenWebUI at &lt;code&gt;http://localhost:8080&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Go to &lt;strong&gt;Settings → Connections&lt;/strong&gt; (or Admin → Connections, depending on your OpenWebUI version).&lt;/li&gt;
&lt;li&gt;Add an OpenAI-compatible connection (see screenshot below): Base URL &lt;code&gt;http://localhost:10000/v1&lt;/code&gt;, API key empty or a placeholder like &lt;code&gt;local&lt;/code&gt; if required.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5rz037bv87q8gtzxjez1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5rz037bv87q8gtzxjez1.png" alt="add new connection" width="800" height="244"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fz2fmuctgjmm7ityd70p3.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fz2fmuctgjmm7ityd70p3.png" alt="api form" width="593" height="675"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Save, select the new connection/model in the UI, and send a test message. If the model answers, the stack is working.&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  What you get
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjmz4s6ulfkzaq4xru38x.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjmz4s6ulfkzaq4xru38x.gif" alt="openweb ui chat" width="1868" height="900"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A browser chat (OpenWebUI) talking to a model that runs on your machine (llama.cpp).&lt;/li&gt;
&lt;li&gt;No external API calls.&lt;/li&gt;
&lt;li&gt;No paid subscriptions.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Trade-offs and limitations
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;RAM/VRAM is the real limit. Bigger models and longer context need more memory.&lt;/li&gt;
&lt;li&gt;Disk space adds up. Models live on disk (often several GB each, and you may keep several quantizations).&lt;/li&gt;
&lt;li&gt;Smaller models have limits. On modest hardware, what you can run may not be enough for heavy reasoning, long-form planning, or high-stakes tasks.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Troubleshooting
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;OpenWebUI loads but no model appears&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Confirm &lt;code&gt;llama-server&lt;/code&gt; is running and that &lt;code&gt;http://localhost:10000&lt;/code&gt; responds (e.g. in a browser or with &lt;code&gt;curl&lt;/code&gt;).&lt;/li&gt;
&lt;li&gt;Make sure you didn’t use the same port as OpenWebUI (8080).&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Connection fails&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Try &lt;code&gt;http://127.0.0.1:10000&lt;/code&gt; instead of &lt;code&gt;http://localhost:10000&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Check that Windows Firewall isn’t blocking local connections.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;It’s slow&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Use a smaller model or lower quantization (e.g. Q4).&lt;/li&gt;
&lt;li&gt;Reduce context length if you increased it.&lt;/li&gt;
&lt;li&gt;On NVIDIA: confirm you use the CUDA build and that the runtime DLLs are in the same folder as the executables.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Wrap-up
&lt;/h2&gt;

&lt;p&gt;You now have a local chat stack: OpenWebUI for the UI, llama.cpp for inference, and a GGUF model from Hugging Face. Solid baseline for privacy-first, no-subscription experimentation. Next steps: try different models (general vs coder), other quantizations, or tuning context length for your workload.&lt;/p&gt;




&lt;h2&gt;
  
  
  Further reading
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://github.com/open-webui/open-webui" rel="noopener noreferrer"&gt;OpenWebUI GitHub&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/ggml-org/llama.cpp/releases" rel="noopener noreferrer"&gt;llama.cpp releases&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://huggingface.co/models?library=gguf" rel="noopener noreferrer"&gt;Hugging Face (GGUF models)&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>ai</category>
      <category>tutorial</category>
      <category>opensource</category>
      <category>llm</category>
    </item>
  </channel>
</rss>
