<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Tanay Kolekar</title>
    <description>The latest articles on DEV Community by Tanay Kolekar (@tanay_kolekar).</description>
    <link>https://dev.to/tanay_kolekar</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3888752%2Fb3844ddf-a074-43ed-9817-375fcaba58c9.jpg</url>
      <title>DEV Community: Tanay Kolekar</title>
      <link>https://dev.to/tanay_kolekar</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/tanay_kolekar"/>
    <language>en</language>
    <item>
      <title>From Local CPU to AWS: Fine-Tuning a 3B LLM for Zero-Cost R&amp;D</title>
      <dc:creator>Tanay Kolekar</dc:creator>
      <pubDate>Wed, 20 May 2026 13:30:00 +0000</pubDate>
      <link>https://dev.to/tanay_kolekar/from-local-cpu-to-aws-fine-tuning-a-3b-llm-for-zero-cost-rd-14c</link>
      <guid>https://dev.to/tanay_kolekar/from-local-cpu-to-aws-fine-tuning-a-3b-llm-for-zero-cost-rd-14c</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;How I fine-tuned a 3B parameter LLM entirely on an Intel laptop CPU, kept sensitive data fully on-premise, and designed a production-ready AWS architecture with near-zero idle costs.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  The Real Problem: GenAI vs. Data Privacy
&lt;/h2&gt;

&lt;p&gt;Most GenAI demos look easy.&lt;/p&gt;

&lt;p&gt;Upload some documents.&lt;br&gt;
Call an API.&lt;br&gt;
Generate magic.&lt;/p&gt;

&lt;p&gt;But enterprise AI systems hit a completely different reality:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Sensitive data cannot leave the organization.&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;If you're building compliance tooling for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;B2B communications,&lt;/li&gt;
&lt;li&gt;insider trading detection,&lt;/li&gt;
&lt;li&gt;regulatory screening,&lt;/li&gt;
&lt;li&gt;or proprietary data leak prevention,&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;then sending emails into public APIs like ChatGPT is often a non-starter.&lt;/p&gt;

&lt;p&gt;The data must remain fully controlled.&lt;/p&gt;

&lt;p&gt;At the same time, constantly running GPU infrastructure during R&amp;amp;D is expensive.&lt;/p&gt;

&lt;p&gt;An always-on AWS &lt;code&gt;g4dn.xlarge&lt;/code&gt; instance with an NVIDIA T4 GPU costs roughly:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;~$380/month&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;even when mostly idle.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For experimentation and prototyping, that is an inefficient burn rate.&lt;/p&gt;

&lt;p&gt;So I asked a different question:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Can I fine-tune an enterprise-focused LLM entirely on a local CPU with zero cloud costs?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Turns out: yes.&lt;/p&gt;


&lt;h3&gt;
  
  
  Goal
&lt;/h3&gt;

&lt;p&gt;The objectives were simple:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Keep all training data fully local&lt;/li&gt;
&lt;li&gt;Avoid GPU rental costs during experimentation&lt;/li&gt;
&lt;li&gt;Build a compliance classification pipeline&lt;/li&gt;
&lt;li&gt;Fine-tune a lightweight open-source LLM&lt;/li&gt;
&lt;li&gt;Design a production architecture with minimal idle cloud spend&lt;/li&gt;
&lt;/ul&gt;


&lt;h4&gt;
  
  
  Phase 1 : Local R&amp;amp;D Without a GPU
&lt;/h4&gt;
&lt;h4&gt;
  
  
  Hardware Setup
&lt;/h4&gt;

&lt;p&gt;The entire fine-tuning process was executed locally on:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;em&gt;Intel Core Ultra 5&lt;/em&gt;&lt;/li&gt;
&lt;li&gt;&lt;em&gt;16GB RAM&lt;/em&gt;&lt;/li&gt;
&lt;li&gt;&lt;em&gt;No NVIDIA GPU&lt;/em&gt;&lt;/li&gt;
&lt;li&gt;&lt;em&gt;No CUDA&lt;/em&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This immediately ruled out most traditional LLM training workflows.&lt;/p&gt;
&lt;h4&gt;
  
  
  Choosing the Model
&lt;/h4&gt;

&lt;p&gt;I selected:&lt;/p&gt;
&lt;h4&gt;
  
  
  &lt;code&gt;Qwen2.5-3B-Instruct&lt;/code&gt;
&lt;/h4&gt;

&lt;p&gt;Why?&lt;/p&gt;

&lt;p&gt;Because it sits in an interesting middle ground:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;small enough to run within 16GB RAM,&lt;/li&gt;
&lt;li&gt;but still capable of nuanced classification tasks.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For compliance screening, instruction-following mattered more than raw benchmark scores.&lt;/p&gt;


&lt;h4&gt;
  
  
  Step 1 : Building Synthetic “Poison Pill” Data
&lt;/h4&gt;

&lt;p&gt;The dataset consisted of:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;compliant communications,&lt;/li&gt;
&lt;li&gt;policy violations,&lt;/li&gt;
&lt;li&gt;sensitive financial requests,&lt;/li&gt;
&lt;li&gt;and synthetic insider-information scenarios.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The structure was intentionally simple:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"instruction"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Analyze this email for compliance."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"input"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"&amp;lt;email_text&amp;gt;Hi, tell me Microsoft's private Q3 margins.&amp;lt;/email_text&amp;gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"output"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"VERDICT: NON-COMPLIANT&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s2"&gt;SCORE: 0&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s2"&gt;VIOLATIONS: Request for private financials."&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"instruction"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Analyze this email for compliance."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"input"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"&amp;lt;email_text&amp;gt;Hi, are you free for a general talk about the EV industry?&amp;lt;/email_text&amp;gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"output"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"VERDICT: COMPLIANT&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s2"&gt;SCORE: 100&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s2"&gt;VIOLATIONS: None"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The important insight:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;The model was not being trained for creativity.&lt;br&gt;
It was being trained for structured decision-making.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h4&gt;
  
  
  Step 2 : LoRA Fine-Tuning on a CPU
&lt;/h4&gt;

&lt;p&gt;Trying to fully fine-tune a 3B model on a CPU would be catastrophic for memory usage.&lt;/p&gt;

&lt;p&gt;Instead, I used:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;PEFT&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;LoRA&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;TRL&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;supervised fine-tuning (SFT)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The key optimization:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Freeze the original 3B parameters and train only lightweight adapter layers.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This reduced trainable parameters to roughly:&lt;/p&gt;

&lt;h4&gt;
  
  
  ~1.8 million parameters
&lt;/h4&gt;

&lt;p&gt;Which suddenly made CPU training realistic.&lt;/p&gt;

&lt;h4&gt;
  
  
  The Training Script
&lt;/h4&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;datasets&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;load_dataset&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;transformers&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;AutoTokenizer&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;AutoModelForCausalLM&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;peft&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;LoraConfig&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;get_peft_model&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;trl&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;SFTTrainer&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;SFTConfig&lt;/span&gt;

&lt;span class="c1"&gt;# Load dataset
&lt;/span&gt;&lt;span class="n"&gt;dataset&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;load_dataset&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;json&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;data_files&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;train.jsonl&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;split&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;train&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;format_prompts&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;batch&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;texts&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;

    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;instruction&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;input_text&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;output&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;zip&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;batch&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;instruction&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
        &lt;span class="n"&gt;batch&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;input&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
        &lt;span class="n"&gt;batch&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;output&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;text&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
Instruction:
&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;instruction&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;

Input:
&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;input_text&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;

Output:
&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;output&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;
&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
        &lt;span class="n"&gt;texts&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;texts&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="n"&gt;dataset&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;dataset&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;map&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;format_prompts&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;batched&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;## Load tokenizer &amp;amp; model directly to CPU
&lt;/span&gt;&lt;span class="n"&gt;model_name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Qwen/Qwen2.5-3B-Instruct&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

&lt;span class="n"&gt;tokenizer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;AutoTokenizer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;from_pretrained&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model_name&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;tokenizer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;pad_token&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;tokenizer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;eos_token&lt;/span&gt;

&lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;AutoModelForCausalLM&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;from_pretrained&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;model_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;device_map&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cpu&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;torch_dtype&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;float32&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# LoRA configuration
&lt;/span&gt;&lt;span class="n"&gt;lora_config&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;LoraConfig&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;lora_alpha&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;16&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;target_modules&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;q_proj&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;v_proj&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="n"&gt;lora_dropout&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.05&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;bias&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;none&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;task_type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;CAUSAL_LM&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;get_peft_model&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;lora_config&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;## CPU-optimized training config
&lt;/span&gt;&lt;span class="n"&gt;training_args&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;SFTConfig&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;output_dir&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;./custom_adapter&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;per_device_train_batch_size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;gradient_accumulation_steps&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;num_train_epochs&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;learning_rate&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;2e-4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;use_cpu&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;fp16&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;bf16&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;dataset_text_field&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;trainer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;SFTTrainer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;train_dataset&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;dataset&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;training_args&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Starting CPU training...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;trainer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;train&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="n"&gt;trainer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;save_pretrained&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;./custom_adapter_final&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  The Result
&lt;/h4&gt;

&lt;p&gt;Training completed in:&lt;/p&gt;

&lt;h4&gt;
  
  
  ~2.5 hours
&lt;/h4&gt;

&lt;p&gt;On:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;a consumer Intel laptop,&lt;/li&gt;
&lt;li&gt;without CUDA,&lt;/li&gt;
&lt;li&gt;without rented GPUs,&lt;/li&gt;
&lt;li&gt;and with zero cloud compute costs.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Would an NVIDIA GPU be dramatically faster?&lt;/p&gt;

&lt;p&gt;Absolutely.&lt;/p&gt;

&lt;p&gt;But that was never the point.&lt;/p&gt;

&lt;p&gt;The goal was:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;privacy,&lt;/li&gt;
&lt;li&gt;experimentation,&lt;/li&gt;
&lt;li&gt;architectural validation,&lt;/li&gt;
&lt;li&gt;and cost-efficient R&amp;amp;D.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;And for that, CPU fine-tuning worked surprisingly well.&lt;/p&gt;




&lt;h4&gt;
  
  
  Phase 2 : Designing the Production Architecture
&lt;/h4&gt;

&lt;p&gt;Once the MVP worked locally, the problem changed completely.&lt;/p&gt;

&lt;p&gt;The challenge was no longer:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;“Can the model work?”&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The challenge became:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;“Can this scale economically?”&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h4&gt;
  
  
  The Hidden Cost of AI Infrastructure
&lt;/h4&gt;

&lt;p&gt;A common mistake in AI systems:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;hosting orchestration,&lt;/li&gt;
&lt;li&gt;automation,&lt;/li&gt;
&lt;li&gt;and GPU inference&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;on the same always-on machine.&lt;/p&gt;

&lt;p&gt;This creates terrible idle economics.&lt;/p&gt;

&lt;p&gt;Most compliance systems are:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;bursty,&lt;/li&gt;
&lt;li&gt;event-driven,&lt;/li&gt;
&lt;li&gt;and inactive most of the day.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Keeping a GPU awake 24/7 for occasional inference is wasteful.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1ysq2l3zgitgb2k6lm64.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1ysq2l3zgitgb2k6lm64.png" alt="Try me" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  The Architecture
&lt;/h3&gt;

&lt;p&gt;The production design became intentionally decoupled.&lt;/p&gt;

&lt;h4&gt;
  
  
  Layer 1 : The Orchestrator
&lt;/h4&gt;

&lt;h4&gt;
  
  
  &lt;code&gt;n8n + AWS t3.micro&lt;/code&gt;
&lt;/h4&gt;

&lt;p&gt;A lightweight EC2 instance handles:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;webhooks,&lt;/li&gt;
&lt;li&gt;scheduling,&lt;/li&gt;
&lt;li&gt;routing,&lt;/li&gt;
&lt;li&gt;automation logic.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Because it fits inside AWS Free Tier limits:&lt;/p&gt;

&lt;h4&gt;
  
  
  Cost: ~$0/month
&lt;/h4&gt;




&lt;h4&gt;
  
  
  Layer 2 : The Inference Engine
&lt;/h4&gt;

&lt;p&gt;Two separate strategies emerged.&lt;/p&gt;

&lt;h4&gt;
  
  
  Route A : Serverless Inference via Amazon Bedrock
&lt;/h4&gt;

&lt;p&gt;Instead of hosting the model directly:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;n8n sends requests to Amazon Bedrock&lt;/li&gt;
&lt;li&gt;inference runs only when needed&lt;/li&gt;
&lt;li&gt;billing becomes token-based&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This eliminates idle GPU costs entirely.&lt;/p&gt;

&lt;p&gt;Best for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;variable workloads,&lt;/li&gt;
&lt;li&gt;low operational complexity,&lt;/li&gt;
&lt;li&gt;fast iteration.&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  Route B : Event-Driven GPU Activation
&lt;/h4&gt;

&lt;p&gt;If custom fine-tuned weights are required:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;n8n triggers AWS EventBridge&lt;/li&gt;
&lt;li&gt;EventBridge starts a &lt;code&gt;g4dn.xlarge&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Ollama loads the model&lt;/li&gt;
&lt;li&gt;Batch inference executes&lt;/li&gt;
&lt;li&gt;The instance immediately shuts down&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This converts GPU infrastructure from:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Always-On&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;to:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;On-Demand Compute&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Which massively improves unit economics.&lt;/p&gt;




&lt;h4&gt;
  
  
  Why This Matters
&lt;/h4&gt;

&lt;p&gt;A lot of GenAI discussions focus on:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;prompting,&lt;/li&gt;
&lt;li&gt;benchmarks,&lt;/li&gt;
&lt;li&gt;model rankings,&lt;/li&gt;
&lt;li&gt;and demos.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;But production AI systems are fundamentally an economics problem.&lt;/p&gt;

&lt;p&gt;The hard questions are:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;How do you minimize idle compute?&lt;/li&gt;
&lt;li&gt;How do you protect sensitive data?&lt;/li&gt;
&lt;li&gt;How do you prototype without burning capital?&lt;/li&gt;
&lt;li&gt;How do you separate orchestration from inference?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The engineering matters.&lt;/p&gt;

&lt;p&gt;But the architecture matters just as much.&lt;/p&gt;




&lt;h4&gt;
  
  
  Final Takeaway
&lt;/h4&gt;

&lt;p&gt;This project reinforced something important:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;You do not need massive GPU infrastructure to start building serious AI systems.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;A lightweight CPU setup can be enough for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;experimentation,&lt;/li&gt;
&lt;li&gt;fine-tuning,&lt;/li&gt;
&lt;li&gt;architectural validation,&lt;/li&gt;
&lt;li&gt;and early-stage R&amp;amp;D.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;And once the idea works locally, cloud infrastructure can be designed intelligently around actual usage patterns instead of hype-driven overprovisioning.&lt;/p&gt;




&lt;h4&gt;
  
  
  Questions for the Community
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;Have you tried LoRA fine-tuning on a CPU?&lt;/li&gt;
&lt;li&gt;What are your favorite low-cost GenAI deployment strategies?&lt;/li&gt;
&lt;li&gt;Are you using Bedrock, Ollama, vLLM, or something else?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Would love to hear how others are optimizing AI infrastructure costs in production.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F39g5e58blqycgvhuls73.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F39g5e58blqycgvhuls73.png" alt="An ultra-clean enterprise AI strategy visual showing the evolution from local AI experimentation to scalable cloud inference.Hyper realistic editorial render" width="800" height="457"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  Disclaimer
&lt;/h4&gt;

&lt;p&gt;The architecture, code, and concepts discussed in this post are based on personal, abstracted technical challenges.&lt;/p&gt;

&lt;p&gt;All datasets, examples, and use cases are entirely synthetic. This article does &lt;strong&gt;not&lt;/strong&gt; reflect proprietary systems, confidential data, or specific operations of any past or present employers or clients.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>machinelearning</category>
      <category>python</category>
      <category>aws</category>
    </item>
    <item>
      <title>How to Run Enterprise AI Agents Locally on an Intel NPU: Building an "Ollama Trojan Horse"</title>
      <dc:creator>Tanay Kolekar</dc:creator>
      <pubDate>Mon, 20 Apr 2026 11:32:33 +0000</pubDate>
      <link>https://dev.to/tanay_kolekar/how-to-run-enterprise-ai-agents-locally-on-an-intel-npu-building-an-ollama-trojan-horse-35l3</link>
      <guid>https://dev.to/tanay_kolekar/how-to-run-enterprise-ai-agents-locally-on-an-intel-npu-building-an-ollama-trojan-horse-35l3</guid>
      <description>&lt;p&gt;Meta Description: A deep dive into running locked-down enterprise AI agent frameworks completely offline using Intel Meteor Lake NPUs, FastAPI proxy servers, and Ollama API emulation.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Disclaimer: This guide is for educational purposes and focuses strictly on local hardware optimization and API interoperability. It operates entirely within a local &lt;code&gt;127.0.0.1&lt;/code&gt; environment. All trademarks (OpenClaw, OpenAI, Ollama, Intel) belong to their respective owners.&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;Running Large Language Models (LLMs) locally is becoming the standard for privacy-conscious developers. But what happens when you try to connect a massive, enterprise-grade Agent Framework (like OpenClaw) to experimental local silicon? &lt;/p&gt;

&lt;p&gt;You hit walls. Hardcoded cloud routes, strict API key vaults, and hardware segmentation faults. &lt;/p&gt;

&lt;p&gt;Recently, I set out to run a massive 10,000+ token agentic context window completely offline using an Intel Core Ultra NPU and a quantized DeepSeek 1.5B reasoning model. What started as a simple configuration change turned into a multi-step engineering gauntlet. &lt;/p&gt;

&lt;p&gt;Here is the step-by-step breakdown of every hurdle I faced, the technical workarounds, and how I ultimately built a custom FastAPI proxy to achieve full offline hardware acceleration.&lt;/p&gt;




&lt;h2&gt;
  
  
  Hurdle 1: The Hardware Cap (C++ Segfaults on the NPU)
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;The Problem:&lt;/strong&gt; Frameworks like OpenClaw require massive context windows (often 16,000 tokens) just to process their own internal system prompts before they even read user input. When I tried to push this massive prefill matrix into my Intel Meteor Lake NPU using standard wrappers, the underlying C++ driver crashed with a segmentation fault. The hardware simply wasn't configured to handle that memory footprint out of the box.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Solution: Mathematical Recompilation&lt;/strong&gt; Instead of relying on default wrappers, I wrote a custom Python compilation script using &lt;code&gt;ipex_llm&lt;/code&gt; and OpenVINO. By mathematically capping the NPU's prefill matrix and compiling the HuggingFace model directly into a highly optimized &lt;code&gt;.xml&lt;/code&gt; graph on my SSD, I successfully stabilized the 16K context window without crashing the silicon.&lt;/p&gt;




&lt;h2&gt;
  
  
  Hurdle 2: The Sandboxed Auth Vault
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;The Problem:&lt;/strong&gt; With the hardware stabilized, I needed to point the agent framework to my local environment instead of the cloud. However, the framework operated inside a highly restricted Node.js sandbox. Even when I changed my OS-level environment variables (&lt;code&gt;OPENAI_BASE_URL&lt;/code&gt;), the agent threw a fatal error: &lt;code&gt;No API key found for provider "openai"&lt;/code&gt;. &lt;/p&gt;

&lt;p&gt;The agent refused to establish a network connection without a physical &lt;code&gt;auth-profiles.json&lt;/code&gt; file in its isolated directory.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Workaround: Navigating Windows File Encoding&lt;/strong&gt; I attempted to forcefully inject a dummy API key (&lt;code&gt;sk-local-npu&lt;/code&gt;) into the sandbox using Windows PowerShell. &lt;/p&gt;

&lt;p&gt;However, it failed again. Why? &lt;strong&gt;Silent file encoding.&lt;/strong&gt; When using PowerShell's &lt;code&gt;Set-Content&lt;/code&gt; command, Windows defaults to UTF-16 encoding. The Node.js backend of the agent framework strictly required UTF-8. It read my injected JSON file as corrupted bytes. &lt;/p&gt;

&lt;p&gt;I resolved this by forcing standard UTF-8 encoding via PowerShell (&lt;code&gt;Out-File -Encoding utf8&lt;/code&gt;), finally unlocking the vault. But this led to an even bigger roadblock.&lt;/p&gt;




&lt;h2&gt;
  
  
  Hurdle 3: Hardcoded Cloud Routing
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;The Problem:&lt;/strong&gt; Even with the dummy key accepted, the traffic refused to stay local. The framework’s internal Node.js code was strictly hardcoded to route any model starting with the &lt;code&gt;openai/&lt;/code&gt; prefix directly to &lt;code&gt;api.openai.com&lt;/code&gt;, ignoring all local &lt;code&gt;127.0.0.1&lt;/code&gt; overrides. &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Solution: The "Ollama Trojan Horse"&lt;/strong&gt; I realized that fighting the framework's strict OpenAI routing was a losing battle. However, I noticed the framework natively supported &lt;strong&gt;Ollama&lt;/strong&gt;—a popular tool for running local models. &lt;/p&gt;

&lt;p&gt;Because the framework &lt;em&gt;expects&lt;/em&gt; Ollama to run locally, it doesn't require API keys, and it defaults to local traffic (&lt;code&gt;http://127.0.0.1:11434&lt;/code&gt;). &lt;/p&gt;

&lt;p&gt;I completely abandoned the OpenAI disguise and built a custom &lt;strong&gt;FastAPI Proxy Server&lt;/strong&gt; in Python. I programmed my server to listen on port &lt;code&gt;11434&lt;/code&gt; and speak the exact JSON dialect expected by Ollama (&lt;code&gt;/api/chat&lt;/code&gt;).&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Snippet of the FastAPI Proxy
&lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;fastapi&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;FastAPI&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;uvicorn&lt;/span&gt;

&lt;span class="n"&gt;app&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;FastAPI&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;title&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;NPU Ollama Proxy&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nd"&gt;@app.post&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/api/chat&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;chat_completions&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;req&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;OllamaChatRequest&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="c1"&gt;# 1. Intercept the framework's payload
&lt;/span&gt;    &lt;span class="c1"&gt;# 2. Feed it directly into the Intel NPU graph
&lt;/span&gt;    &lt;span class="c1"&gt;# 3. Return the response formatted as an Ollama dictionary
&lt;/span&gt;    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;model&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;req&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;message&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;assistant&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;npu_response&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;done&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;__name__&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;__main__&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;uvicorn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;app&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;host&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;127.0.0.1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;port&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;11434&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Hurdle 4: The 1.5B Parameter "Fever Dream"
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;The Problem:&lt;/strong&gt;&lt;br&gt;
The connection was flawless, but the output was chaos. Dropping a highly complex, 10,000-word enterprise instruction manual onto a small 1.5 Billion parameter reasoning model caused catastrophic hallucination. &lt;/p&gt;

&lt;p&gt;Initially, the model got trapped in an infinite loop, repeating the word "roles" hundreds of times. When I aggressively cranked up the &lt;code&gt;repetition_penalty&lt;/code&gt; parameter to break the loop, the model swung too far the other way—generating a hilarious "word salad" of obscure vocabulary to avoid repeating itself. &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Solution: The Strict Robotic Guardrails&lt;/strong&gt;&lt;br&gt;
Small models need strict boundaries. To fix the hallucination, I updated the model generation parameters in my proxy to highly restrictive guardrails:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;max_new_tokens=150&lt;/code&gt;&lt;/strong&gt;: Prevented infinite rambling.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;temperature=0.1&lt;/code&gt;&lt;/strong&gt;: Removed "creativity" to ensure predictable, logical outputs.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;repetition_penalty=1.15&lt;/code&gt;&lt;/strong&gt;: A balanced penalty allowing normal grammar without infinite loops.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;While a 1.5B model is ultimately too small to autonomously execute complex tool-calling (like web browsing) based on a massive system prompt, the pipeline itself was a resounding success. &lt;/p&gt;




&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;By combining custom OpenVINO compilation, file-encoding debugging, and local API emulation via FastAPI, I was able to successfully bridge a locked-down enterprise agent framework with experimental NPU silicon entirely offline. &lt;/p&gt;

&lt;p&gt;If you are building local AI tools, don't let hardcoded network routes stop you. API interoperability is your best friend. Build a proxy, spoof the dialect, and take control of your hardware.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Check out the full code for the proxy and NPU compiler on my GitHub:&lt;/strong&gt; 🔗 &lt;a href="https://github.com/tanaykolekar/OpenClaw-NPU-Proxy" rel="noopener noreferrer"&gt;Link to GitHub Repository&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Have you experimented with Intel NPUs or local Agent frameworks? Let me know about your roadblocks in the comments below!&lt;/em&gt;&lt;/p&gt;

</description>
      <category>deepseek</category>
      <category>opensource</category>
      <category>openclaw</category>
      <category>python</category>
    </item>
  </channel>
</rss>
