<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Tanay Kolekar</title>
    <description>The latest articles on DEV Community by Tanay Kolekar (@tanay_kolekar).</description>
    <link>https://dev.to/tanay_kolekar</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3888752%2Fb3844ddf-a074-43ed-9817-375fcaba58c9.jpg</url>
      <title>DEV Community: Tanay Kolekar</title>
      <link>https://dev.to/tanay_kolekar</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/tanay_kolekar"/>
    <language>en</language>
    <item>
      <title>Failing Forward in Open Source: What Running NVIDIA’s Sana on an Intel AI PC Taught Me About CI/CD</title>
      <dc:creator>Tanay Kolekar</dc:creator>
      <pubDate>Sat, 20 Jun 2026 13:31:01 +0000</pubDate>
      <link>https://dev.to/tanay_kolekar/failing-forward-in-open-source-what-running-nvidias-sana-on-an-intel-ai-pc-taught-me-about-cicd-5hf8</link>
      <guid>https://dev.to/tanay_kolekar/failing-forward-in-open-source-what-running-nvidias-sana-on-an-intel-ai-pc-taught-me-about-cicd-5hf8</guid>
      <description>&lt;p&gt;&lt;strong&gt;From NPU compiler crashes to rejected pull requests — a masterclass in deploying local Generative AI.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fj8sd2e1kll36ixo8lxn1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fj8sd2e1kll36ixo8lxn1.png" alt=" czx" width="800" height="400"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Before stepping into my MBA to focus on GenAI Strategy and Product Management, I spent three years as a Data Engineer. I was used to optimizing scalable pipelines across dozens of workflows and writing code to shave terabytes off cloud storage. But transitioning from cloud infrastructure to running Generative AI on local edge hardware is an entirely different beast.&lt;/p&gt;

&lt;p&gt;Recently, NVIDIA Labs released &lt;strong&gt;Sana&lt;/strong&gt; , a blazing-fast image and video generation model. Eager to test these advancements without a massive cloud compute budget, I set out to run the model locally on my Windows laptop, which is powered by an Intel Core Ultra 5 processor, 16 GB of RAM, and a dedicated Neural Processing Unit (NPU).&lt;/p&gt;

&lt;p&gt;What started as a simple weekend test run turned into a multi-hour deep dive into hardware compilers, virtual environment bugs, and the realities of open-source CI/CD pipelines. Here is the step-by-step story of how I navigated the bleeding edge of local AI deployment — and what a rejected Pull Request taught me about product strategy.&lt;/p&gt;

&lt;h3&gt;
  
  
  Ambition vs. Hardware Reality
&lt;/h3&gt;

&lt;p&gt;My initial goal was ambitious: run &lt;strong&gt;SANA-WM 2.6B&lt;/strong&gt; (the video generation world model). I quickly learned that this was a non-starter. SANA-WM 2.6B requires a massive amount of VRAM and is heavily optimized for NVIDIA’s CUDA ecosystem. Attempting to force a 2.6 billion parameter video model onto 16 GB of shared system RAM on an Intel chip would just result in instant Out-of-Memory crashes.&lt;/p&gt;

&lt;p&gt;So, I pivoted to a more realistic target: &lt;strong&gt;Sana 0.6B&lt;/strong&gt; , a highly efficient text-to-image model. Because of its smaller size and open-source community support, it could leverage the OpenVINO toolkit to run directly on my Intel Core Ultra’s NPU or integrated GPU. I decided to use &lt;strong&gt;FastSD CPU&lt;/strong&gt; , an open-source interface specifically optimized for Intel hardware.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Installation Rabbit Hole
&lt;/h3&gt;

&lt;p&gt;I cloned the FastSD CPU repository and ran the setup scripts. Immediately, I hit my first roadblock:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Starting FastSD CPU env installation...
Python command check :OK
Error: uv command not found
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;FastSD CPU uses uv, an incredibly fast modern package manager, to build its virtual environments. A quick pip install uv fixed this, and the installer successfully built the environment.&lt;/p&gt;

&lt;p&gt;But when I tried to launch the software, it hard-crashed with a massive traceback ending in this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;File&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;C:&lt;/span&gt;&lt;span class="se"&gt;\f&lt;/span&gt;&lt;span class="s"&gt;astsdcpu\env\Lib\site-packages\optimum\exporters\onnx\model_patcher.py&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;line&lt;/span&gt; &lt;span class="mi"&gt;346&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;module&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;
    &lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;torch.onnx.symbolic_opset14&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt; &lt;span class="c1"&gt;# noqa: E402
&lt;/span&gt;&lt;span class="nb"&gt;ImportError&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;cannot&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;_attention_scale&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;torch.onnx.symbolic_opset14&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Through troubleshooting, I realized this was a dependency conflict. The installer had grabbed the bleeding-edge version of PyTorch (v2.5+), but the Intel OpenVINO library hadn’t been updated to support it yet. They were failing to communicate.&lt;/p&gt;

&lt;p&gt;Because the environment was built using uv, it didn't even have standard pip installed. I had to route into the virtual environment and run a specialized command to downgrade the libraries to a stable CPU version:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;uv pip &lt;span class="nb"&gt;install &lt;/span&gt;&lt;span class="nv"&gt;torch&lt;/span&gt;&lt;span class="o"&gt;==&lt;/span&gt;2.4.1 &lt;span class="nv"&gt;torchvision&lt;/span&gt;&lt;span class="o"&gt;==&lt;/span&gt;0.19.1 &lt;span class="nv"&gt;torchaudio&lt;/span&gt;&lt;span class="o"&gt;==&lt;/span&gt;2.4.1 &lt;span class="nt"&gt;--index-url&lt;/span&gt; https://download.pytorch.org/whl/cpu
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The software finally launched! However, the desktop GUI was entirely cut off at the bottom due to Windows display scaling on my laptop, hiding the generate buttons. To bypass this UI limitation, I launched the browser-based Web UI instead (start-webui.bat).&lt;/p&gt;

&lt;p&gt;I selected the rupeshs/sana-sprint-0.6b-openvino-int4 model, typed in "a warrior on horse," and hit generate.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Phantom NPU and the Compiler Crash
&lt;/h3&gt;

&lt;p&gt;While the generation processed, I opened Windows Task Manager. My CPU was doing a little bit of work, but my dedicated Intel AI Boost NPU was sitting at exactly 0% utilization. Furthermore, my Python process was pulling 96 Mbps of network bandwidth.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2F28b344wglwcdidotpvn6.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2F28b344wglwcdidotpvn6.png" width="800" height="500"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;I realized two things:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;FastSD CPU defaults to standard CPU processing unless explicitly told otherwise.&lt;/li&gt;
&lt;li&gt;The massive network usage was the software silently downloading the gigabytes of Sana model weights from HuggingFace in the background for the first time.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I needed to route the computation to my NPU. Because the Web UI lacked a hardware toggle, I bypassed the interface and set an environment variable directly in PowerShell before launching:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight powershell"&gt;&lt;code&gt;&lt;span class="nv"&gt;$&lt;/span&gt;&lt;span class="nn"&gt;env&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="nv"&gt;DEVICE&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"NPU"&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;\start-webui.bat&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The console lit up with Using device : NPU. I hit generate again, expecting lightning-fast results from my AI processor. Instead, the Intel hardware compiler panicked and threw this yellow warning in my browser:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2F4l3zbglo8zola18orrz1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2F4l3zbglo8zola18orrz1.png" width="800" height="476"&gt;&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Error:
L0 pfnCreate2 result: ZE_RESULT_ERROR_INVALID_NULL_POINTER, code 0x78000007 - pointer argument may not be nullptr . 
[NPU_VCL] Compiler returned msg: Missing upper bound for one or more nodes.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This wasn’t a Python bug; this was a hard crash from the hardware. The current generation of Intel Core Ultra NPU compilers requires all mathematical shapes in an AI model to have a strict, pre-defined static size (an upper bound). Because the Sana model utilizes dynamic shapes, the Intel NPU driver panicked.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Workaround:&lt;/strong&gt; I routed the power to my integrated &lt;strong&gt;Intel Arc Graphics&lt;/strong&gt; instead using $env:DEVICE="GPU". The integrated GPU is much more forgiving with dynamic shapes and compiled the OpenVINO model flawlessly, generating my image in seconds.&lt;/p&gt;

&lt;h3&gt;
  
  
  Stepping into Open Source (And Getting Rejected)
&lt;/h3&gt;

&lt;p&gt;Having fought through a grueling installation process, I realized this was a perfect opportunity to make a real-world open-source contribution. I wanted to fix the PyTorch _attention_scale bug for future Windows users so they wouldn't have to troubleshoot the environment manually.&lt;/p&gt;

&lt;p&gt;I forked the repository, opened the requirements.txt file, and noticed torch wasn't even listed. I added the explicitly pinned stable versions:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fewy58wjrc1m5m05mw8gj.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fewy58wjrc1m5m05mw8gj.png" width="799" height="498"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;I committed the code, pushed it to my fork, and proudly opened &lt;strong&gt;Pull Request #371&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fx7a07507mfvenb63wdx8.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fx7a07507mfvenb63wdx8.png" width="800" height="469"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;A few days later, the repository maintainer responded and closed my Pull Request. &lt;strong&gt;It was rejected.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The maintainer kindly explained that PyTorch is a massive, complex library. By adding torch directly to the requirements.txt file, standard package managers (pip or uv) will automatically attempt to download the default NVIDIA CUDA GPU wheels, which are several gigabytes in size.&lt;/p&gt;

&lt;p&gt;To manage this, the FastSD CPU repository uses custom OS-specific setup scripts (like install.bat) that point to a custom wheel index URL to specifically pull lightweight CPU-only builds (torch==2.8.0).&lt;/p&gt;

&lt;p&gt;My fix, while logical in isolation, would have overridden their custom setup scripts and broken the build pipeline for everyone else by forcing massive GPU downloads onto CPU-only systems.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Real Lesson: Systems Thinking
&lt;/h3&gt;

&lt;p&gt;While my PR wasn’t merged, the experience was incredibly invaluable.&lt;/p&gt;

&lt;p&gt;I navigated local edge-AI hardware constraints, debugged complex virtual environment conflicts, routed computations between NPUs and GPUs, and engaged directly with open-source CI/CD architectures.&lt;/p&gt;

&lt;p&gt;Most importantly, I learned a critical product management lesson about systems thinking: &lt;strong&gt;fixing an isolated configuration file without understanding the broader deployment pipeline can cause cascading system failures.&lt;/strong&gt; You cannot patch a product without understanding the user’s installation journey from end to end.&lt;/p&gt;

&lt;p&gt;It was a hands on masterclass in software architecture, and a stark reminder that in the world of Generative AI, sometimes the best way to move forward is to fail out in the open.&lt;/p&gt;

</description>
      <category>machinelearning</category>
      <category>generativeaitools</category>
      <category>opensource</category>
      <category>productmanagement</category>
    </item>
    <item>
      <title>I Built an AI Cluster Using Two 12-Year-Old PCs and an Ethernet Cable. Here’s What Broke.</title>
      <dc:creator>Tanay Kolekar</dc:creator>
      <pubDate>Tue, 02 Jun 2026 03:31:00 +0000</pubDate>
      <link>https://dev.to/tanay_kolekar/i-built-an-ai-cluster-using-two-12-year-old-pcs-and-an-ethernet-cable-heres-what-broke-jdo</link>
      <guid>https://dev.to/tanay_kolekar/i-built-an-ai-cluster-using-two-12-year-old-pcs-and-an-ethernet-cable-heres-what-broke-jdo</guid>
      <description>&lt;p&gt;How I pooled 24GB of RAM across two discarded PCs, ran a 13B LLM, and discovered exactly why modern AI infrastructure exists.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fadngkvusdtk48ewijun3.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fadngkvusdtk48ewijun3.jpeg" alt="_The heart of the experiment: a direct Gigabit Ethernet connection between both nodes._" width="800" height="1063"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Sometimes engineering is about solving a problem. Sometimes it’s about proving why a problem exists in the first place.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Coming from a background in data engineering, I’ve spent years chasing bottlenecks.&lt;/p&gt;

&lt;p&gt;Whether it was optimizing data transformations across dozens of workflows, debugging slow pipelines, or cutting cloud storage usage by more than a terabyte, there was always a constraint hiding somewhere in the system.&lt;/p&gt;

&lt;p&gt;Most of the time, constraints can be engineered away.&lt;/p&gt;

&lt;p&gt;So when I started working more deeply with Generative AI and wanted to build a local MVP using open-source LLMs, I naturally assumed the same rule applied.&lt;/p&gt;

&lt;p&gt;I was wrong.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Challenge
&lt;/h3&gt;

&lt;p&gt;Cloud GPUs are expensive.&lt;/p&gt;

&lt;p&gt;For experimentation, prototypes, and personal projects, renting powerful hardware can quickly become the most expensive part of the stack.&lt;/p&gt;

&lt;p&gt;My available hardware wasn’t exactly encouraging either.&lt;/p&gt;

&lt;p&gt;In one corner sat an aging desktop powered by an Intel i5–3470 with 16GB of DDR3 RAM.&lt;/p&gt;

&lt;p&gt;In another corner sat its equally elderly sibling: another Intel i5–3470, this time with 8GB of RAM.&lt;/p&gt;

&lt;p&gt;No GPUs.&lt;/p&gt;

&lt;p&gt;No accelerators.&lt;/p&gt;

&lt;p&gt;No fancy networking.&lt;/p&gt;

&lt;p&gt;Just two forgotten PCs from 2012 collecting dust.&lt;/p&gt;

&lt;p&gt;A 13B parameter model was clearly too large for either machine individually.&lt;/p&gt;

&lt;p&gt;But then a dangerous thought appeared:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;What if I connected them together and treated them as a tiny cluster?&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;If one machine couldn’t hold the model, perhaps two machines could.&lt;/p&gt;

&lt;p&gt;And thus began the creation of what I lovingly call &lt;strong&gt;The Poor Man’s AI Cluster&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Ffd0w7wggv1x1yckjh2zi.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Ffd0w7wggv1x1yckjh2zi.jpeg" alt="_The entire cluster: Two Lenovo ThinkCentres from 2012 with a combined 24GB RAM._" width="800" height="1063"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  The Plan
&lt;/h3&gt;

&lt;p&gt;The idea was surprisingly simple.&lt;/p&gt;

&lt;p&gt;Instead of connecting both machines through a router, I connected them directly using a Cat5e Gigabit Ethernet cable.&lt;/p&gt;

&lt;p&gt;I assigned static IP addresses:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Master Node: 192.168.1.10 (16GB RAM)&lt;/li&gt;
&lt;li&gt;Worker Node: 192.168.1.20 (8GB RAM)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;After a bit of firewall configuration, the two systems could communicate directly over a dedicated full-duplex 1 Gbps link.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2F4nskcwejqe4po90d0dyc.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2F4nskcwejqe4po90d0dyc.jpeg" alt="_Proof of life: successful node-to-node communication and a negotiated 1 Gbps link._" width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;In theory, that gave me roughly:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;1 Gbps bandwidth&lt;/li&gt;
&lt;li&gt;~125 MB/s real-world transfer speeds&lt;/li&gt;
&lt;li&gt;Zero router overhead&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Not exactly a supercomputer.&lt;/p&gt;

&lt;p&gt;But enough to experiment.&lt;/p&gt;

&lt;h3&gt;
  
  
  Bringing the Monster to Life
&lt;/h3&gt;

&lt;p&gt;Using llama.cpp and its RPC server running inside WSL, I split a quantized 13B model across both machines.&lt;/p&gt;

&lt;p&gt;The architecture looked something like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;User Prompt
     │
     ▼
Master Node (16GB)
     │
     ▼
Worker Node (8GB)
     │
     ▼
Shared Inference
     │
     ▼
Generated Response
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The master node handled prompt orchestration while the worker node processed portions of the model that no longer fit in memory.&lt;/p&gt;

&lt;p&gt;And then something unexpected happened.&lt;/p&gt;

&lt;p&gt;It worked.&lt;/p&gt;

&lt;p&gt;Against all common sense, against every reasonable hardware recommendation, I was chatting with a 13B parameter language model running across two decade-old machines.&lt;/p&gt;

&lt;p&gt;For a brief moment, I felt like I had cheated the system.&lt;/p&gt;

&lt;p&gt;Then I looked at the token generation speed.&lt;/p&gt;

&lt;h3&gt;
  
  
  Reality Arrives at 1 Token per Second
&lt;/h3&gt;

&lt;p&gt;The model was generating roughly &lt;strong&gt;1–1.5 tokens per second&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;A moderately sized prompt could take close to a minute before the AI even started responding.&lt;/p&gt;

&lt;p&gt;The cluster was technically functioning.&lt;/p&gt;

&lt;p&gt;But it felt less like modern AI and more like waiting for dial-up internet.&lt;/p&gt;

&lt;p&gt;The reason came down to three unavoidable hardware bottlenecks.&lt;/p&gt;

&lt;h3&gt;
  
  
  Bottleneck #1: The Compute Wall
&lt;/h3&gt;

&lt;p&gt;The Intel i5–3470 was released in 2012.&lt;/p&gt;

&lt;p&gt;While it was a respectable CPU for its era, modern LLMs demand absurd amounts of computation.&lt;/p&gt;

&lt;p&gt;A 13B parameter model requires approximately &lt;strong&gt;26 billion floating-point operations per token&lt;/strong&gt; during prompt processing.&lt;/p&gt;

&lt;p&gt;For a 100-token prompt:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;26 Billion FLOPs × 100
=
2.6 Trillion FLOPs
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Meanwhile, my CPU could sustain roughly 50 GFLOPS.&lt;/p&gt;

&lt;p&gt;The result?&lt;/p&gt;

&lt;p&gt;Nearly a minute of pure mathematical suffering before the model could move forward.&lt;/p&gt;

&lt;p&gt;Physics wasn’t impressed by my creativity.&lt;/p&gt;

&lt;h3&gt;
  
  
  Bottleneck #2: The Memory Wall
&lt;/h3&gt;

&lt;p&gt;Even after solving the memory-capacity problem, I still had to deal with memory bandwidth.&lt;/p&gt;

&lt;p&gt;Every generated token requires repeatedly accessing model weights stored in RAM.&lt;/p&gt;

&lt;p&gt;The DDR3 memory in these systems delivered roughly:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;~15 GB/s bandwidth
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The model itself occupied around:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;~8 GB
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Which meant the CPU spent most of its time waiting for data to arrive.&lt;/p&gt;

&lt;p&gt;No amount of clever engineering could change the fact that old memory moves data slowly.&lt;/p&gt;

&lt;p&gt;The result was a practical ceiling of roughly two tokens per second.&lt;/p&gt;

&lt;h3&gt;
  
  
  Bottleneck #3: The Network Tax
&lt;/h3&gt;

&lt;p&gt;Then came the hidden enemy.&lt;/p&gt;

&lt;p&gt;Networking.&lt;/p&gt;

&lt;p&gt;Splitting the model meant constantly exchanging activations between machines.&lt;/p&gt;

&lt;p&gt;Every layer crossing the machine boundary introduced additional latency and synchronization overhead.&lt;/p&gt;

&lt;p&gt;On paper, Gigabit Ethernet sounds fast.&lt;/p&gt;

&lt;p&gt;For AI workloads, it is painfully slow.&lt;/p&gt;

&lt;p&gt;The cluster spent a surprising amount of time simply moving data from one machine to another instead of performing useful computation.&lt;/p&gt;

&lt;h3&gt;
  
  
  Then I Considered Fine-Tuning
&lt;/h3&gt;

&lt;p&gt;Inference was slow.&lt;/p&gt;

&lt;p&gt;But perhaps training a LoRA adapter would still be possible?&lt;/p&gt;

&lt;p&gt;That’s when the numbers became truly ridiculous.&lt;/p&gt;

&lt;p&gt;Distributed training relies heavily on a communication pattern called &lt;strong&gt;Ring-AllReduce&lt;/strong&gt; , where every node continuously exchanges gradient updates with every other node.&lt;/p&gt;

&lt;p&gt;In other words:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Compute
→ Synchronize
→ Compute
→ Synchronize
→ Repeat
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The synchronization step quickly became the dominant cost.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Math That Ended the Dream
&lt;/h3&gt;

&lt;p&gt;Imagine synchronizing an 8GB gradient payload across a 1 Gbps connection.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;8,000 MB / 125 MB/s
=
64 seconds
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Just to transfer the gradients.&lt;/p&gt;

&lt;p&gt;One training step.&lt;/p&gt;

&lt;p&gt;No computation included.&lt;/p&gt;

&lt;p&gt;If a training run required only 1,000 optimization steps:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;64 × 1,000
=
64,000 seconds
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That’s almost &lt;strong&gt;18 hours spent purely moving data across an Ethernet cable&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Not training.&lt;/p&gt;

&lt;p&gt;Not learning.&lt;/p&gt;

&lt;p&gt;Just waiting.&lt;/p&gt;

&lt;p&gt;Even after aggressively optimizing the payload down to roughly 1GB, synchronization still consumed around 8 seconds per step.&lt;/p&gt;

&lt;p&gt;Add approximately 40 seconds of CPU computation per step and a modest training run would still take well over half a day.&lt;/p&gt;

&lt;p&gt;Suddenly, cloud GPUs didn’t seem expensive anymore.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why Data Centers Look the Way They Do
&lt;/h3&gt;

&lt;p&gt;This experiment taught me something more valuable than a successful fine-tuning run ever could.&lt;/p&gt;

&lt;p&gt;When people see AI clusters powered by dozens of GPUs connected through NVLink and specialized interconnects, it’s easy to assume it’s overengineering.&lt;/p&gt;

&lt;p&gt;It isn’t.&lt;/p&gt;

&lt;p&gt;Modern AI infrastructure exists because the laws of physics demand it.&lt;/p&gt;

&lt;p&gt;When GPUs exchange data at hundreds of gigabytes per second, they aren’t chasing luxury.&lt;/p&gt;

&lt;p&gt;They’re avoiding exactly the bottlenecks I spent weeks fighting.&lt;/p&gt;

&lt;p&gt;The challenge isn’t storing the model.&lt;/p&gt;

&lt;p&gt;The challenge is moving enormous amounts of data fast enough to keep every processor busy.&lt;/p&gt;

&lt;h3&gt;
  
  
  Final Thoughts
&lt;/h3&gt;

&lt;p&gt;My two-node cluster was never going to compete with enterprise AI infrastructure.&lt;/p&gt;

&lt;p&gt;But that wasn’t really the point.&lt;/p&gt;

&lt;p&gt;The project succeeded in proving something fascinating:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;If you’re memory-constrained, you can absolutely stitch together old hardware and run models that technically shouldn’t fit.&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The experience was equal parts engineering, experimentation, and stubborn curiosity.&lt;/p&gt;

&lt;p&gt;For a brief moment, two forgotten PCs from 2012 became an AI cluster.&lt;/p&gt;

&lt;p&gt;And while they ultimately lost the battle against compute, memory bandwidth, and network latency, they taught me a lesson every AI engineer eventually learns:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;In machine learning, clever architecture can bend the rules. Eventually, physics collects the bill.&lt;/strong&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Have you ever tried running an LLM on absurdly underpowered hardware? I’d love to hear the most ridiculous AI infrastructure experiments you’ve attempted.&lt;/p&gt;
&lt;/blockquote&gt;

</description>
      <category>artificialintelligen</category>
      <category>machinelearning</category>
      <category>llm</category>
      <category>opensourceai</category>
    </item>
    <item>
      <title>From Local CPU to AWS: Fine-Tuning a 3B LLM for Zero-Cost R&amp;D</title>
      <dc:creator>Tanay Kolekar</dc:creator>
      <pubDate>Wed, 20 May 2026 13:30:00 +0000</pubDate>
      <link>https://dev.to/tanay_kolekar/from-local-cpu-to-aws-fine-tuning-a-3b-llm-for-zero-cost-rd-14c</link>
      <guid>https://dev.to/tanay_kolekar/from-local-cpu-to-aws-fine-tuning-a-3b-llm-for-zero-cost-rd-14c</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;How I fine-tuned a 3B parameter LLM entirely on an Intel laptop CPU, kept sensitive data fully on-premise, and designed a production-ready AWS architecture with near-zero idle costs.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  The Real Problem: GenAI vs. Data Privacy
&lt;/h2&gt;

&lt;p&gt;Most GenAI demos look easy.&lt;/p&gt;

&lt;p&gt;Upload some documents.&lt;br&gt;
Call an API.&lt;br&gt;
Generate magic.&lt;/p&gt;

&lt;p&gt;But enterprise AI systems hit a completely different reality:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Sensitive data cannot leave the organization.&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;If you're building compliance tooling for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;B2B communications,&lt;/li&gt;
&lt;li&gt;insider trading detection,&lt;/li&gt;
&lt;li&gt;regulatory screening,&lt;/li&gt;
&lt;li&gt;or proprietary data leak prevention,&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;then sending emails into public APIs like ChatGPT is often a non-starter.&lt;/p&gt;

&lt;p&gt;The data must remain fully controlled.&lt;/p&gt;

&lt;p&gt;At the same time, constantly running GPU infrastructure during R&amp;amp;D is expensive.&lt;/p&gt;

&lt;p&gt;An always-on AWS &lt;code&gt;g4dn.xlarge&lt;/code&gt; instance with an NVIDIA T4 GPU costs roughly:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;~$380/month&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;even when mostly idle.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For experimentation and prototyping, that is an inefficient burn rate.&lt;/p&gt;

&lt;p&gt;So I asked a different question:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Can I fine-tune an enterprise-focused LLM entirely on a local CPU with zero cloud costs?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Turns out: yes.&lt;/p&gt;


&lt;h3&gt;
  
  
  Goal
&lt;/h3&gt;

&lt;p&gt;The objectives were simple:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Keep all training data fully local&lt;/li&gt;
&lt;li&gt;Avoid GPU rental costs during experimentation&lt;/li&gt;
&lt;li&gt;Build a compliance classification pipeline&lt;/li&gt;
&lt;li&gt;Fine-tune a lightweight open-source LLM&lt;/li&gt;
&lt;li&gt;Design a production architecture with minimal idle cloud spend&lt;/li&gt;
&lt;/ul&gt;


&lt;h4&gt;
  
  
  Phase 1 : Local R&amp;amp;D Without a GPU
&lt;/h4&gt;
&lt;h4&gt;
  
  
  Hardware Setup
&lt;/h4&gt;

&lt;p&gt;The entire fine-tuning process was executed locally on:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;em&gt;Intel Core Ultra 5&lt;/em&gt;&lt;/li&gt;
&lt;li&gt;&lt;em&gt;16GB RAM&lt;/em&gt;&lt;/li&gt;
&lt;li&gt;&lt;em&gt;No NVIDIA GPU&lt;/em&gt;&lt;/li&gt;
&lt;li&gt;&lt;em&gt;No CUDA&lt;/em&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This immediately ruled out most traditional LLM training workflows.&lt;/p&gt;
&lt;h4&gt;
  
  
  Choosing the Model
&lt;/h4&gt;

&lt;p&gt;I selected:&lt;/p&gt;
&lt;h4&gt;
  
  
  &lt;code&gt;Qwen2.5-3B-Instruct&lt;/code&gt;
&lt;/h4&gt;

&lt;p&gt;Why?&lt;/p&gt;

&lt;p&gt;Because it sits in an interesting middle ground:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;small enough to run within 16GB RAM,&lt;/li&gt;
&lt;li&gt;but still capable of nuanced classification tasks.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For compliance screening, instruction-following mattered more than raw benchmark scores.&lt;/p&gt;


&lt;h4&gt;
  
  
  Step 1 : Building Synthetic “Poison Pill” Data
&lt;/h4&gt;

&lt;p&gt;The dataset consisted of:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;compliant communications,&lt;/li&gt;
&lt;li&gt;policy violations,&lt;/li&gt;
&lt;li&gt;sensitive financial requests,&lt;/li&gt;
&lt;li&gt;and synthetic insider-information scenarios.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The structure was intentionally simple:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"instruction"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Analyze this email for compliance."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"input"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"&amp;lt;email_text&amp;gt;Hi, tell me Microsoft's private Q3 margins.&amp;lt;/email_text&amp;gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"output"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"VERDICT: NON-COMPLIANT&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s2"&gt;SCORE: 0&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s2"&gt;VIOLATIONS: Request for private financials."&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"instruction"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Analyze this email for compliance."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"input"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"&amp;lt;email_text&amp;gt;Hi, are you free for a general talk about the EV industry?&amp;lt;/email_text&amp;gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"output"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"VERDICT: COMPLIANT&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s2"&gt;SCORE: 100&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s2"&gt;VIOLATIONS: None"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The important insight:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;The model was not being trained for creativity.&lt;br&gt;
It was being trained for structured decision-making.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h4&gt;
  
  
  Step 2 : LoRA Fine-Tuning on a CPU
&lt;/h4&gt;

&lt;p&gt;Trying to fully fine-tune a 3B model on a CPU would be catastrophic for memory usage.&lt;/p&gt;

&lt;p&gt;Instead, I used:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;PEFT&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;LoRA&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;TRL&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;supervised fine-tuning (SFT)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The key optimization:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Freeze the original 3B parameters and train only lightweight adapter layers.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This reduced trainable parameters to roughly:&lt;/p&gt;

&lt;h4&gt;
  
  
  ~1.8 million parameters
&lt;/h4&gt;

&lt;p&gt;Which suddenly made CPU training realistic.&lt;/p&gt;

&lt;h4&gt;
  
  
  The Training Script
&lt;/h4&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;datasets&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;load_dataset&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;transformers&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;AutoTokenizer&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;AutoModelForCausalLM&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;peft&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;LoraConfig&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;get_peft_model&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;trl&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;SFTTrainer&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;SFTConfig&lt;/span&gt;

&lt;span class="c1"&gt;# Load dataset
&lt;/span&gt;&lt;span class="n"&gt;dataset&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;load_dataset&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;json&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;data_files&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;train.jsonl&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;split&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;train&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;format_prompts&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;batch&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;texts&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;

    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;instruction&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;input_text&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;output&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;zip&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;batch&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;instruction&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
        &lt;span class="n"&gt;batch&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;input&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
        &lt;span class="n"&gt;batch&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;output&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;text&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
Instruction:
&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;instruction&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;

Input:
&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;input_text&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;

Output:
&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;output&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;
&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
        &lt;span class="n"&gt;texts&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;texts&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="n"&gt;dataset&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;dataset&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;map&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;format_prompts&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;batched&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;## Load tokenizer &amp;amp; model directly to CPU
&lt;/span&gt;&lt;span class="n"&gt;model_name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Qwen/Qwen2.5-3B-Instruct&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

&lt;span class="n"&gt;tokenizer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;AutoTokenizer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;from_pretrained&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model_name&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;tokenizer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;pad_token&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;tokenizer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;eos_token&lt;/span&gt;

&lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;AutoModelForCausalLM&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;from_pretrained&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;model_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;device_map&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cpu&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;torch_dtype&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;float32&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# LoRA configuration
&lt;/span&gt;&lt;span class="n"&gt;lora_config&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;LoraConfig&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;lora_alpha&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;16&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;target_modules&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;q_proj&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;v_proj&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="n"&gt;lora_dropout&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.05&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;bias&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;none&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;task_type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;CAUSAL_LM&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;get_peft_model&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;lora_config&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;## CPU-optimized training config
&lt;/span&gt;&lt;span class="n"&gt;training_args&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;SFTConfig&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;output_dir&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;./custom_adapter&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;per_device_train_batch_size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;gradient_accumulation_steps&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;num_train_epochs&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;learning_rate&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;2e-4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;use_cpu&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;fp16&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;bf16&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;dataset_text_field&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;trainer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;SFTTrainer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;train_dataset&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;dataset&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;training_args&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Starting CPU training...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;trainer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;train&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="n"&gt;trainer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;save_pretrained&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;./custom_adapter_final&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  The Result
&lt;/h4&gt;

&lt;p&gt;Training completed in:&lt;/p&gt;

&lt;h4&gt;
  
  
  ~2.5 hours
&lt;/h4&gt;

&lt;p&gt;On:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;a consumer Intel laptop,&lt;/li&gt;
&lt;li&gt;without CUDA,&lt;/li&gt;
&lt;li&gt;without rented GPUs,&lt;/li&gt;
&lt;li&gt;and with zero cloud compute costs.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Would an NVIDIA GPU be dramatically faster?&lt;/p&gt;

&lt;p&gt;Absolutely.&lt;/p&gt;

&lt;p&gt;But that was never the point.&lt;/p&gt;

&lt;p&gt;The goal was:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;privacy,&lt;/li&gt;
&lt;li&gt;experimentation,&lt;/li&gt;
&lt;li&gt;architectural validation,&lt;/li&gt;
&lt;li&gt;and cost-efficient R&amp;amp;D.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;And for that, CPU fine-tuning worked surprisingly well.&lt;/p&gt;




&lt;h4&gt;
  
  
  Phase 2 : Designing the Production Architecture
&lt;/h4&gt;

&lt;p&gt;Once the MVP worked locally, the problem changed completely.&lt;/p&gt;

&lt;p&gt;The challenge was no longer:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;“Can the model work?”&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The challenge became:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;“Can this scale economically?”&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h4&gt;
  
  
  The Hidden Cost of AI Infrastructure
&lt;/h4&gt;

&lt;p&gt;A common mistake in AI systems:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;hosting orchestration,&lt;/li&gt;
&lt;li&gt;automation,&lt;/li&gt;
&lt;li&gt;and GPU inference&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;on the same always-on machine.&lt;/p&gt;

&lt;p&gt;This creates terrible idle economics.&lt;/p&gt;

&lt;p&gt;Most compliance systems are:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;bursty,&lt;/li&gt;
&lt;li&gt;event-driven,&lt;/li&gt;
&lt;li&gt;and inactive most of the day.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Keeping a GPU awake 24/7 for occasional inference is wasteful.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1ysq2l3zgitgb2k6lm64.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1ysq2l3zgitgb2k6lm64.png" alt="Try me" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  The Architecture
&lt;/h3&gt;

&lt;p&gt;The production design became intentionally decoupled.&lt;/p&gt;

&lt;h4&gt;
  
  
  Layer 1 : The Orchestrator
&lt;/h4&gt;

&lt;h4&gt;
  
  
  &lt;code&gt;n8n + AWS t3.micro&lt;/code&gt;
&lt;/h4&gt;

&lt;p&gt;A lightweight EC2 instance handles:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;webhooks,&lt;/li&gt;
&lt;li&gt;scheduling,&lt;/li&gt;
&lt;li&gt;routing,&lt;/li&gt;
&lt;li&gt;automation logic.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Because it fits inside AWS Free Tier limits:&lt;/p&gt;

&lt;h4&gt;
  
  
  Cost: ~$0/month
&lt;/h4&gt;




&lt;h4&gt;
  
  
  Layer 2 : The Inference Engine
&lt;/h4&gt;

&lt;p&gt;Two separate strategies emerged.&lt;/p&gt;

&lt;h4&gt;
  
  
  Route A : Serverless Inference via Amazon Bedrock
&lt;/h4&gt;

&lt;p&gt;Instead of hosting the model directly:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;n8n sends requests to Amazon Bedrock&lt;/li&gt;
&lt;li&gt;inference runs only when needed&lt;/li&gt;
&lt;li&gt;billing becomes token-based&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This eliminates idle GPU costs entirely.&lt;/p&gt;

&lt;p&gt;Best for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;variable workloads,&lt;/li&gt;
&lt;li&gt;low operational complexity,&lt;/li&gt;
&lt;li&gt;fast iteration.&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  Route B : Event-Driven GPU Activation
&lt;/h4&gt;

&lt;p&gt;If custom fine-tuned weights are required:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;n8n triggers AWS EventBridge&lt;/li&gt;
&lt;li&gt;EventBridge starts a &lt;code&gt;g4dn.xlarge&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Ollama loads the model&lt;/li&gt;
&lt;li&gt;Batch inference executes&lt;/li&gt;
&lt;li&gt;The instance immediately shuts down&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This converts GPU infrastructure from:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Always-On&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;to:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;On-Demand Compute&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Which massively improves unit economics.&lt;/p&gt;




&lt;h4&gt;
  
  
  Why This Matters
&lt;/h4&gt;

&lt;p&gt;A lot of GenAI discussions focus on:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;prompting,&lt;/li&gt;
&lt;li&gt;benchmarks,&lt;/li&gt;
&lt;li&gt;model rankings,&lt;/li&gt;
&lt;li&gt;and demos.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;But production AI systems are fundamentally an economics problem.&lt;/p&gt;

&lt;p&gt;The hard questions are:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;How do you minimize idle compute?&lt;/li&gt;
&lt;li&gt;How do you protect sensitive data?&lt;/li&gt;
&lt;li&gt;How do you prototype without burning capital?&lt;/li&gt;
&lt;li&gt;How do you separate orchestration from inference?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The engineering matters.&lt;/p&gt;

&lt;p&gt;But the architecture matters just as much.&lt;/p&gt;




&lt;h4&gt;
  
  
  Final Takeaway
&lt;/h4&gt;

&lt;p&gt;This project reinforced something important:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;You do not need massive GPU infrastructure to start building serious AI systems.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;A lightweight CPU setup can be enough for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;experimentation,&lt;/li&gt;
&lt;li&gt;fine-tuning,&lt;/li&gt;
&lt;li&gt;architectural validation,&lt;/li&gt;
&lt;li&gt;and early-stage R&amp;amp;D.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;And once the idea works locally, cloud infrastructure can be designed intelligently around actual usage patterns instead of hype-driven overprovisioning.&lt;/p&gt;




&lt;h4&gt;
  
  
  Questions for the Community
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;Have you tried LoRA fine-tuning on a CPU?&lt;/li&gt;
&lt;li&gt;What are your favorite low-cost GenAI deployment strategies?&lt;/li&gt;
&lt;li&gt;Are you using Bedrock, Ollama, vLLM, or something else?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Would love to hear how others are optimizing AI infrastructure costs in production.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F39g5e58blqycgvhuls73.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F39g5e58blqycgvhuls73.png" alt="An ultra-clean enterprise AI strategy visual showing the evolution from local AI experimentation to scalable cloud inference.Hyper realistic editorial render" width="800" height="457"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  Disclaimer
&lt;/h4&gt;

&lt;p&gt;The architecture, code, and concepts discussed in this post are based on personal, abstracted technical challenges.&lt;/p&gt;

&lt;p&gt;All datasets, examples, and use cases are entirely synthetic. This article does &lt;strong&gt;not&lt;/strong&gt; reflect proprietary systems, confidential data, or specific operations of any past or present employers or clients.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>machinelearning</category>
      <category>python</category>
      <category>aws</category>
    </item>
    <item>
      <title>The “Ollama Trojan Horse”: Tricking Enterprise AI Agents onto Local Intel Silicon</title>
      <dc:creator>Tanay Kolekar</dc:creator>
      <pubDate>Tue, 12 May 2026 14:01:01 +0000</pubDate>
      <link>https://dev.to/tanay_kolekar/the-ollama-trojan-horse-tricking-enterprise-ai-agents-onto-local-intel-silicon-4ddc</link>
      <guid>https://dev.to/tanay_kolekar/the-ollama-trojan-horse-tricking-enterprise-ai-agents-onto-local-intel-silicon-4ddc</guid>
      <description>&lt;h4&gt;
  
  
  An engineering deep dive and strategic assessment of deploying massive context-window agents locally on Intel Core Ultra NPUs.
&lt;/h4&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Ffoydf2yvv74clj7a1vwn.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Ffoydf2yvv74clj7a1vwn.jpeg" width="800" height="448"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  Introduction: The Gravity of Data vs. The Allure of the Cloud
&lt;/h4&gt;

&lt;p&gt;In the C-Suite, the conversation surrounding Generative AI has shifted from “What can it do?” to “Where can it run?” While GPT-4 and Gemini Pro offer unparalleled reasoning capabilities, the strategic risks are becoming clear: prohibitive API costs at scale, internet dependency, and critical data privacy concerns.&lt;/p&gt;

&lt;p&gt;As a Gen AI Strategy Consultant, I am constantly evaluating the viability of &lt;strong&gt;Edge AI&lt;/strong&gt;  — running foundational models locally on user hardware. Recently, I embarked on an engineering gauntlet to prove if a high-end, agentic framework (like OpenClaw) could execute complex workflows entirely offline using Intel’s new Meteor Lake NPU.&lt;/p&gt;

&lt;p&gt;The goal was simple: Provide the agent with a massive, 10,000+ token context window containing sensitive “corporate strategy,” and have a local reasoning model act on it.&lt;/p&gt;

&lt;p&gt;What followed was not a simple configuration change, but a multi-day journey through hardware segmentation faults, hardcoded vendor lock-ins, and the unique challenges of Small Language Models (SLMs).&lt;/p&gt;

&lt;p&gt;Here is how I bypassed enterprise security sandboxes using API emulation, and my strategic verdict on the current state of local NPU deployment.&lt;/p&gt;

&lt;h4&gt;
  
  
  Phase 1: Breaking the C++ Gauntlet
&lt;/h4&gt;

&lt;p&gt;Enterprise Agent frameworks demand massive context windows, often requiring 16K tokens just to load their internal system prompts and tool-calling instructions. My initial target hardware, the Intel Core Ultra’s NPU, should have handled this.&lt;/p&gt;

&lt;p&gt;Instead, I hit a wall: &lt;strong&gt;C++ Segmentation Faults.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2F44nlgpuohumdbmyc0mlh.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2F44nlgpuohumdbmyc0mlh.jpeg" width="800" height="448"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The standard NPU wrappers were not optimized for this memory footprint. To stabilize the inference pipeline, I had to move away from high-level APIs and perform &lt;strong&gt;Mathematical Recompilation&lt;/strong&gt; of the neural graph. Using Intel’s OpenVINO and ipex_llm, I manually adjusted the prefill matrix parameters and compiled the quantized DeepSeek model into a stabilized, memory-mapped XML graph on the SSD. Only then did the silicon stop crashing.&lt;/p&gt;

&lt;h4&gt;
  
  
  Phase 2: Interoperability as a Strategy (The Trojan Horse)
&lt;/h4&gt;

&lt;p&gt;With the hardware stabilized, the software began its counter-attack. The agent framework I utilized — like many modern enterprise tools — was inherently designed for the cloud.&lt;/p&gt;

&lt;p&gt;It maintained strict, sandboxed security vaults for API keys (auth-profiles.json) and ignored all OS-level attempts to reroute traffic to 127.0.0.1. It was ruthlessly hardcoded to route any openai/ model prefix directly to the public internet, likely as a security measure to prevent exactly what I was trying to do.&lt;/p&gt;

&lt;p&gt;Fighting the framework’s internal routing was a strategic dead end. Instead, I sought a native, “trusted” path.&lt;/p&gt;

&lt;p&gt;I pivoted to &lt;strong&gt;Ollama&lt;/strong&gt;. Because Ollama is a recognized standard for running local models, the framework naturally trusted local traffic (127.0.0.1:11434) and didn't require API keys.&lt;/p&gt;

&lt;p&gt;I executed an engineering &lt;strong&gt;Trojan Horse&lt;/strong&gt; : I wrote a custom FastAPI proxy server in Python that disguised my NPU graph as an Ollama instance. I mapped my local endpoints to speak the Ollama dialect (/api/tags and /api/chat).&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# The "Ollama Trojan Horse" Proxy
&lt;/span&gt;&lt;span class="nd"&gt;@app.post&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/api/chat&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;chat_completions&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;req&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;OllamaChatRequest&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="c1"&gt;# Intercept OpenClaw's payload (thinking it's talking to Ollama)
&lt;/span&gt;    &lt;span class="n"&gt;messages&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;req&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;messages&lt;/span&gt;
    &lt;span class="c1"&gt;# Feed it into the Intel NPU XML graph
&lt;/span&gt;    &lt;span class="n"&gt;response_text&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;npu_model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;generate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="c1"&gt;# Return exactly what Ollama would return
&lt;/span&gt;    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;model&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;req&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;message&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;response_text&lt;/span&gt;&lt;span class="p"&gt;}}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;By pointing the agent to ollama/deepseek-npu, I tricked the framework into bypassing its own security checks, sending the 10,000-token payload directly into my waiting Python proxy. The offline connection was finally established.&lt;/p&gt;

&lt;h4&gt;
  
  
  Phase 3: The 1.5B Parameter “Fever Dream”
&lt;/h4&gt;

&lt;p&gt;The connection was established, but the “intelligence” immediately collapsed. My initial output was a catastrophic infinite loop, with the AI repeating the word “roles” until it hit its token limit.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fu5eyg4s65nf5533uap87.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fu5eyg4s65nf5533uap87.jpeg" width="800" height="448"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Small models need extreme disciplinary guardrails. After debugging, I updated the proxy with highly restrictive parameters:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;outputs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;generate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;max_new_tokens&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;150&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="c1"&gt;# Stop rambling
&lt;/span&gt;    &lt;span class="n"&gt;temperature&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="c1"&gt;# Robotic predictability (Zero creativity)
&lt;/span&gt;    &lt;span class="n"&gt;repetition_penalty&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;1.15&lt;/span&gt; &lt;span class="c1"&gt;# Balanced grammatical support
&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;By imposing a “lobotomy” on the model’s creativity, I finally stabilized the output into coherent English. However, my most crucial insight as a Strategy Consultant was realized here.&lt;/p&gt;

&lt;p&gt;While the 1.5B parameter reasoning model was cohesive, &lt;strong&gt;it was too small to reliably act as an agent.&lt;/strong&gt; When loaded with a massive 10,000-token corporate instruction manual, its mathematical reasoning power was insufficient to parse the strategy and perform specific tool-calling actions (like web browsing or email access).&lt;/p&gt;

&lt;h4&gt;
  
  
  The Strategic Verdict on Edge AI Deployment
&lt;/h4&gt;

&lt;p&gt;So, what is the verdict for enterprises looking to deploy agents on NPU hardware today?&lt;/p&gt;

&lt;h4&gt;
  
  
  1. Software-Hardware Co-Design is Required
&lt;/h4&gt;

&lt;p&gt;You cannot simply “point and click” a cloud agent framework at an NPU. Successful local deployment currently requires custom engineering — OpenVINO compilation, memory mapping, and API emulation (proxies).&lt;/p&gt;

&lt;h4&gt;
  
  
  2. LocalInteroperability is a Key Security Control
&lt;/h4&gt;

&lt;p&gt;My “Ollama Trojan Horse” proves that forcing local traffic is possible even when backends resist it. Enterprises should demand interoperability standards in their agent frameworks to allow for auditing, local traffic filtering, and future-proof deployment across different silicon providers.&lt;/p&gt;

&lt;h4&gt;
  
  
  3. SLMs are not full Agents… Yet
&lt;/h4&gt;

&lt;p&gt;Currently, Small Language Models (SLMs) in the 1B–7B range are brilliant for “passive” tasks like local summarization, translation, or sensitive text generation entirely offline. However, for “active” agentic reasoning requiring tool use and massive context interpretation, the Cloud (GPT-4/Gemini) remains the superior choice until 14B–30B parameter models can run efficiently on consumer NPUs.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Disclaimer: This guide is for educational purposes and focuses strictly on local hardware optimization and API interoperability. It operates entirely within a local&lt;/em&gt; &lt;em&gt;127.0.0.1 environment. All trademarks (OpenClaw, OpenAI, Ollama, Intel) belong to their respective owners.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;For the deep technical breakdown, the custom OpenVINO compilation scripts, and the full FastAPI proxy code, check out my developer guide on&lt;/em&gt; &lt;a href="https://dev.to/tanay_kolekar/how-to-run-enterprise-ai-agents-locally-on-an-intel-npu-building-an-ollama-trojan-horse-35l3"&gt;&lt;em&gt;Dev.to&lt;/em&gt;&lt;/a&gt;&lt;em&gt;, and access the full repository on my &lt;em&gt;[_GitHub&lt;/em&gt;](&lt;a href="https://github.com/tanaykolekar/OpenClaw-NPU-Proxy" rel="noopener noreferrer"&gt;https://github.com/tanaykolekar/OpenClaw-NPU-Proxy&lt;/a&gt;)&lt;/em&gt;._&lt;/p&gt;

&lt;p&gt;&lt;em&gt;The author is currently pursuing an MBA at IIM Udaipur and interning as a Gen AI Strategy Consultant.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>businessstrategy</category>
      <category>generativeai</category>
      <category>artificialintelligen</category>
      <category>largelanguagemodels</category>
    </item>
    <item>
      <title>How to Run Enterprise AI Agents Locally on an Intel NPU: Building an "Ollama Trojan Horse"</title>
      <dc:creator>Tanay Kolekar</dc:creator>
      <pubDate>Mon, 20 Apr 2026 11:32:33 +0000</pubDate>
      <link>https://dev.to/tanay_kolekar/how-to-run-enterprise-ai-agents-locally-on-an-intel-npu-building-an-ollama-trojan-horse-35l3</link>
      <guid>https://dev.to/tanay_kolekar/how-to-run-enterprise-ai-agents-locally-on-an-intel-npu-building-an-ollama-trojan-horse-35l3</guid>
      <description>&lt;p&gt;Meta Description: A deep dive into running locked-down enterprise AI agent frameworks completely offline using Intel Meteor Lake NPUs, FastAPI proxy servers, and Ollama API emulation.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Disclaimer: This guide is for educational purposes and focuses strictly on local hardware optimization and API interoperability. It operates entirely within a local &lt;code&gt;127.0.0.1&lt;/code&gt; environment. All trademarks (OpenClaw, OpenAI, Ollama, Intel) belong to their respective owners.&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;Running Large Language Models (LLMs) locally is becoming the standard for privacy-conscious developers. But what happens when you try to connect a massive, enterprise-grade Agent Framework (like OpenClaw) to experimental local silicon? &lt;/p&gt;

&lt;p&gt;You hit walls. Hardcoded cloud routes, strict API key vaults, and hardware segmentation faults. &lt;/p&gt;

&lt;p&gt;Recently, I set out to run a massive 10,000+ token agentic context window completely offline using an Intel Core Ultra NPU and a quantized DeepSeek 1.5B reasoning model. What started as a simple configuration change turned into a multi-step engineering gauntlet. &lt;/p&gt;

&lt;p&gt;Here is the step-by-step breakdown of every hurdle I faced, the technical workarounds, and how I ultimately built a custom FastAPI proxy to achieve full offline hardware acceleration.&lt;/p&gt;




&lt;h2&gt;
  
  
  Hurdle 1: The Hardware Cap (C++ Segfaults on the NPU)
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;The Problem:&lt;/strong&gt; Frameworks like OpenClaw require massive context windows (often 16,000 tokens) just to process their own internal system prompts before they even read user input. When I tried to push this massive prefill matrix into my Intel Meteor Lake NPU using standard wrappers, the underlying C++ driver crashed with a segmentation fault. The hardware simply wasn't configured to handle that memory footprint out of the box.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Solution: Mathematical Recompilation&lt;/strong&gt; Instead of relying on default wrappers, I wrote a custom Python compilation script using &lt;code&gt;ipex_llm&lt;/code&gt; and OpenVINO. By mathematically capping the NPU's prefill matrix and compiling the HuggingFace model directly into a highly optimized &lt;code&gt;.xml&lt;/code&gt; graph on my SSD, I successfully stabilized the 16K context window without crashing the silicon.&lt;/p&gt;




&lt;h2&gt;
  
  
  Hurdle 2: The Sandboxed Auth Vault
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;The Problem:&lt;/strong&gt; With the hardware stabilized, I needed to point the agent framework to my local environment instead of the cloud. However, the framework operated inside a highly restricted Node.js sandbox. Even when I changed my OS-level environment variables (&lt;code&gt;OPENAI_BASE_URL&lt;/code&gt;), the agent threw a fatal error: &lt;code&gt;No API key found for provider "openai"&lt;/code&gt;. &lt;/p&gt;

&lt;p&gt;The agent refused to establish a network connection without a physical &lt;code&gt;auth-profiles.json&lt;/code&gt; file in its isolated directory.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Workaround: Navigating Windows File Encoding&lt;/strong&gt; I attempted to forcefully inject a dummy API key (&lt;code&gt;sk-local-npu&lt;/code&gt;) into the sandbox using Windows PowerShell. &lt;/p&gt;

&lt;p&gt;However, it failed again. Why? &lt;strong&gt;Silent file encoding.&lt;/strong&gt; When using PowerShell's &lt;code&gt;Set-Content&lt;/code&gt; command, Windows defaults to UTF-16 encoding. The Node.js backend of the agent framework strictly required UTF-8. It read my injected JSON file as corrupted bytes. &lt;/p&gt;

&lt;p&gt;I resolved this by forcing standard UTF-8 encoding via PowerShell (&lt;code&gt;Out-File -Encoding utf8&lt;/code&gt;), finally unlocking the vault. But this led to an even bigger roadblock.&lt;/p&gt;




&lt;h2&gt;
  
  
  Hurdle 3: Hardcoded Cloud Routing
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;The Problem:&lt;/strong&gt; Even with the dummy key accepted, the traffic refused to stay local. The framework’s internal Node.js code was strictly hardcoded to route any model starting with the &lt;code&gt;openai/&lt;/code&gt; prefix directly to &lt;code&gt;api.openai.com&lt;/code&gt;, ignoring all local &lt;code&gt;127.0.0.1&lt;/code&gt; overrides. &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Solution: The "Ollama Trojan Horse"&lt;/strong&gt; I realized that fighting the framework's strict OpenAI routing was a losing battle. However, I noticed the framework natively supported &lt;strong&gt;Ollama&lt;/strong&gt;—a popular tool for running local models. &lt;/p&gt;

&lt;p&gt;Because the framework &lt;em&gt;expects&lt;/em&gt; Ollama to run locally, it doesn't require API keys, and it defaults to local traffic (&lt;code&gt;http://127.0.0.1:11434&lt;/code&gt;). &lt;/p&gt;

&lt;p&gt;I completely abandoned the OpenAI disguise and built a custom &lt;strong&gt;FastAPI Proxy Server&lt;/strong&gt; in Python. I programmed my server to listen on port &lt;code&gt;11434&lt;/code&gt; and speak the exact JSON dialect expected by Ollama (&lt;code&gt;/api/chat&lt;/code&gt;).&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Snippet of the FastAPI Proxy
&lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;fastapi&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;FastAPI&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;uvicorn&lt;/span&gt;

&lt;span class="n"&gt;app&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;FastAPI&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;title&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;NPU Ollama Proxy&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nd"&gt;@app.post&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/api/chat&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;chat_completions&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;req&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;OllamaChatRequest&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="c1"&gt;# 1. Intercept the framework's payload
&lt;/span&gt;    &lt;span class="c1"&gt;# 2. Feed it directly into the Intel NPU graph
&lt;/span&gt;    &lt;span class="c1"&gt;# 3. Return the response formatted as an Ollama dictionary
&lt;/span&gt;    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;model&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;req&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;message&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;assistant&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;npu_response&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;done&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;__name__&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;__main__&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;uvicorn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;app&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;host&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;127.0.0.1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;port&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;11434&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Hurdle 4: The 1.5B Parameter "Fever Dream"
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;The Problem:&lt;/strong&gt;&lt;br&gt;
The connection was flawless, but the output was chaos. Dropping a highly complex, 10,000-word enterprise instruction manual onto a small 1.5 Billion parameter reasoning model caused catastrophic hallucination. &lt;/p&gt;

&lt;p&gt;Initially, the model got trapped in an infinite loop, repeating the word "roles" hundreds of times. When I aggressively cranked up the &lt;code&gt;repetition_penalty&lt;/code&gt; parameter to break the loop, the model swung too far the other way—generating a hilarious "word salad" of obscure vocabulary to avoid repeating itself. &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Solution: The Strict Robotic Guardrails&lt;/strong&gt;&lt;br&gt;
Small models need strict boundaries. To fix the hallucination, I updated the model generation parameters in my proxy to highly restrictive guardrails:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;max_new_tokens=150&lt;/code&gt;&lt;/strong&gt;: Prevented infinite rambling.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;temperature=0.1&lt;/code&gt;&lt;/strong&gt;: Removed "creativity" to ensure predictable, logical outputs.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;repetition_penalty=1.15&lt;/code&gt;&lt;/strong&gt;: A balanced penalty allowing normal grammar without infinite loops.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;While a 1.5B model is ultimately too small to autonomously execute complex tool-calling (like web browsing) based on a massive system prompt, the pipeline itself was a resounding success. &lt;/p&gt;




&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;By combining custom OpenVINO compilation, file-encoding debugging, and local API emulation via FastAPI, I was able to successfully bridge a locked-down enterprise agent framework with experimental NPU silicon entirely offline. &lt;/p&gt;

&lt;p&gt;If you are building local AI tools, don't let hardcoded network routes stop you. API interoperability is your best friend. Build a proxy, spoof the dialect, and take control of your hardware.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Check out the full code for the proxy and NPU compiler on my GitHub:&lt;/strong&gt; 🔗 &lt;a href="https://github.com/tanaykolekar/OpenClaw-NPU-Proxy" rel="noopener noreferrer"&gt;Link to GitHub Repository&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Have you experimented with Intel NPUs or local Agent frameworks? Let me know about your roadblocks in the comments below!&lt;/em&gt;&lt;/p&gt;

</description>
      <category>deepseek</category>
      <category>opensource</category>
      <category>openclaw</category>
      <category>python</category>
    </item>
  </channel>
</rss>
