<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Tanay Kolekar</title>
    <description>The latest articles on DEV Community by Tanay Kolekar (@tanay_kolekar).</description>
    <link>https://dev.to/tanay_kolekar</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3888752%2Fb3844ddf-a074-43ed-9817-375fcaba58c9.jpg</url>
      <title>DEV Community: Tanay Kolekar</title>
      <link>https://dev.to/tanay_kolekar</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/tanay_kolekar"/>
    <language>en</language>
    <item>
      <title>How to Run Enterprise AI Agents Locally on an Intel NPU: Building an "Ollama Trojan Horse"</title>
      <dc:creator>Tanay Kolekar</dc:creator>
      <pubDate>Mon, 20 Apr 2026 11:32:33 +0000</pubDate>
      <link>https://dev.to/tanay_kolekar/how-to-run-enterprise-ai-agents-locally-on-an-intel-npu-building-an-ollama-trojan-horse-35l3</link>
      <guid>https://dev.to/tanay_kolekar/how-to-run-enterprise-ai-agents-locally-on-an-intel-npu-building-an-ollama-trojan-horse-35l3</guid>
      <description>&lt;p&gt;Meta Description: A deep dive into running locked-down enterprise AI agent frameworks completely offline using Intel Meteor Lake NPUs, FastAPI proxy servers, and Ollama API emulation.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Disclaimer: This guide is for educational purposes and focuses strictly on local hardware optimization and API interoperability. It operates entirely within a local &lt;code&gt;127.0.0.1&lt;/code&gt; environment. All trademarks (OpenClaw, OpenAI, Ollama, Intel) belong to their respective owners.&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;Running Large Language Models (LLMs) locally is becoming the standard for privacy-conscious developers. But what happens when you try to connect a massive, enterprise-grade Agent Framework (like OpenClaw) to experimental local silicon? &lt;/p&gt;

&lt;p&gt;You hit walls. Hardcoded cloud routes, strict API key vaults, and hardware segmentation faults. &lt;/p&gt;

&lt;p&gt;Recently, I set out to run a massive 10,000+ token agentic context window completely offline using an Intel Core Ultra NPU and a quantized DeepSeek 1.5B reasoning model. What started as a simple configuration change turned into a multi-step engineering gauntlet. &lt;/p&gt;

&lt;p&gt;Here is the step-by-step breakdown of every hurdle I faced, the technical workarounds, and how I ultimately built a custom FastAPI proxy to achieve full offline hardware acceleration.&lt;/p&gt;




&lt;h2&gt;
  
  
  Hurdle 1: The Hardware Cap (C++ Segfaults on the NPU)
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;The Problem:&lt;/strong&gt; Frameworks like OpenClaw require massive context windows (often 16,000 tokens) just to process their own internal system prompts before they even read user input. When I tried to push this massive prefill matrix into my Intel Meteor Lake NPU using standard wrappers, the underlying C++ driver crashed with a segmentation fault. The hardware simply wasn't configured to handle that memory footprint out of the box.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Solution: Mathematical Recompilation&lt;/strong&gt; Instead of relying on default wrappers, I wrote a custom Python compilation script using &lt;code&gt;ipex_llm&lt;/code&gt; and OpenVINO. By mathematically capping the NPU's prefill matrix and compiling the HuggingFace model directly into a highly optimized &lt;code&gt;.xml&lt;/code&gt; graph on my SSD, I successfully stabilized the 16K context window without crashing the silicon.&lt;/p&gt;




&lt;h2&gt;
  
  
  Hurdle 2: The Sandboxed Auth Vault
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;The Problem:&lt;/strong&gt; With the hardware stabilized, I needed to point the agent framework to my local environment instead of the cloud. However, the framework operated inside a highly restricted Node.js sandbox. Even when I changed my OS-level environment variables (&lt;code&gt;OPENAI_BASE_URL&lt;/code&gt;), the agent threw a fatal error: &lt;code&gt;No API key found for provider "openai"&lt;/code&gt;. &lt;/p&gt;

&lt;p&gt;The agent refused to establish a network connection without a physical &lt;code&gt;auth-profiles.json&lt;/code&gt; file in its isolated directory.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Workaround: Navigating Windows File Encoding&lt;/strong&gt; I attempted to forcefully inject a dummy API key (&lt;code&gt;sk-local-npu&lt;/code&gt;) into the sandbox using Windows PowerShell. &lt;/p&gt;

&lt;p&gt;However, it failed again. Why? &lt;strong&gt;Silent file encoding.&lt;/strong&gt; When using PowerShell's &lt;code&gt;Set-Content&lt;/code&gt; command, Windows defaults to UTF-16 encoding. The Node.js backend of the agent framework strictly required UTF-8. It read my injected JSON file as corrupted bytes. &lt;/p&gt;

&lt;p&gt;I resolved this by forcing standard UTF-8 encoding via PowerShell (&lt;code&gt;Out-File -Encoding utf8&lt;/code&gt;), finally unlocking the vault. But this led to an even bigger roadblock.&lt;/p&gt;




&lt;h2&gt;
  
  
  Hurdle 3: Hardcoded Cloud Routing
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;The Problem:&lt;/strong&gt; Even with the dummy key accepted, the traffic refused to stay local. The framework’s internal Node.js code was strictly hardcoded to route any model starting with the &lt;code&gt;openai/&lt;/code&gt; prefix directly to &lt;code&gt;api.openai.com&lt;/code&gt;, ignoring all local &lt;code&gt;127.0.0.1&lt;/code&gt; overrides. &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Solution: The "Ollama Trojan Horse"&lt;/strong&gt; I realized that fighting the framework's strict OpenAI routing was a losing battle. However, I noticed the framework natively supported &lt;strong&gt;Ollama&lt;/strong&gt;—a popular tool for running local models. &lt;/p&gt;

&lt;p&gt;Because the framework &lt;em&gt;expects&lt;/em&gt; Ollama to run locally, it doesn't require API keys, and it defaults to local traffic (&lt;code&gt;http://127.0.0.1:11434&lt;/code&gt;). &lt;/p&gt;

&lt;p&gt;I completely abandoned the OpenAI disguise and built a custom &lt;strong&gt;FastAPI Proxy Server&lt;/strong&gt; in Python. I programmed my server to listen on port &lt;code&gt;11434&lt;/code&gt; and speak the exact JSON dialect expected by Ollama (&lt;code&gt;/api/chat&lt;/code&gt;).&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Snippet of the FastAPI Proxy
&lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;fastapi&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;FastAPI&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;uvicorn&lt;/span&gt;

&lt;span class="n"&gt;app&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;FastAPI&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;title&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;NPU Ollama Proxy&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nd"&gt;@app.post&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/api/chat&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;chat_completions&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;req&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;OllamaChatRequest&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="c1"&gt;# 1. Intercept the framework's payload
&lt;/span&gt;    &lt;span class="c1"&gt;# 2. Feed it directly into the Intel NPU graph
&lt;/span&gt;    &lt;span class="c1"&gt;# 3. Return the response formatted as an Ollama dictionary
&lt;/span&gt;    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;model&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;req&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;message&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;assistant&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;npu_response&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;done&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;__name__&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;__main__&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;uvicorn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;app&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;host&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;127.0.0.1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;port&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;11434&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Hurdle 4: The 1.5B Parameter "Fever Dream"
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;The Problem:&lt;/strong&gt;&lt;br&gt;
The connection was flawless, but the output was chaos. Dropping a highly complex, 10,000-word enterprise instruction manual onto a small 1.5 Billion parameter reasoning model caused catastrophic hallucination. &lt;/p&gt;

&lt;p&gt;Initially, the model got trapped in an infinite loop, repeating the word "roles" hundreds of times. When I aggressively cranked up the &lt;code&gt;repetition_penalty&lt;/code&gt; parameter to break the loop, the model swung too far the other way—generating a hilarious "word salad" of obscure vocabulary to avoid repeating itself. &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Solution: The Strict Robotic Guardrails&lt;/strong&gt;&lt;br&gt;
Small models need strict boundaries. To fix the hallucination, I updated the model generation parameters in my proxy to highly restrictive guardrails:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;max_new_tokens=150&lt;/code&gt;&lt;/strong&gt;: Prevented infinite rambling.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;temperature=0.1&lt;/code&gt;&lt;/strong&gt;: Removed "creativity" to ensure predictable, logical outputs.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;repetition_penalty=1.15&lt;/code&gt;&lt;/strong&gt;: A balanced penalty allowing normal grammar without infinite loops.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;While a 1.5B model is ultimately too small to autonomously execute complex tool-calling (like web browsing) based on a massive system prompt, the pipeline itself was a resounding success. &lt;/p&gt;




&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;By combining custom OpenVINO compilation, file-encoding debugging, and local API emulation via FastAPI, I was able to successfully bridge a locked-down enterprise agent framework with experimental NPU silicon entirely offline. &lt;/p&gt;

&lt;p&gt;If you are building local AI tools, don't let hardcoded network routes stop you. API interoperability is your best friend. Build a proxy, spoof the dialect, and take control of your hardware.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Check out the full code for the proxy and NPU compiler on my GitHub:&lt;/strong&gt; 🔗 &lt;a href="https://github.com/tanaykolekar/OpenClaw-NPU-Proxy" rel="noopener noreferrer"&gt;Link to GitHub Repository&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Have you experimented with Intel NPUs or local Agent frameworks? Let me know about your roadblocks in the comments below!&lt;/em&gt;&lt;/p&gt;

</description>
      <category>deepseek</category>
      <category>opensource</category>
      <category>openclaw</category>
      <category>python</category>
    </item>
  </channel>
</rss>
