DEV Community

Cover image for How to Run Enterprise AI Agents Locally on an Intel NPU: Building an "Ollama Trojan Horse"
Tanay Kolekar
Tanay Kolekar

Posted on

How to Run Enterprise AI Agents Locally on an Intel NPU: Building an "Ollama Trojan Horse"

Meta Description: A deep dive into running locked-down enterprise AI agent frameworks completely offline using Intel Meteor Lake NPUs, FastAPI proxy servers, and Ollama API emulation.

Disclaimer: This guide is for educational purposes and focuses strictly on local hardware optimization and API interoperability. It operates entirely within a local 127.0.0.1 environment. All trademarks (OpenClaw, OpenAI, Ollama, Intel) belong to their respective owners.


Running Large Language Models (LLMs) locally is becoming the standard for privacy-conscious developers. But what happens when you try to connect a massive, enterprise-grade Agent Framework (like OpenClaw) to experimental local silicon?

You hit walls. Hardcoded cloud routes, strict API key vaults, and hardware segmentation faults.

Recently, I set out to run a massive 10,000+ token agentic context window completely offline using an Intel Core Ultra NPU and a quantized DeepSeek 1.5B reasoning model. What started as a simple configuration change turned into a multi-step engineering gauntlet.

Here is the step-by-step breakdown of every hurdle I faced, the technical workarounds, and how I ultimately built a custom FastAPI proxy to achieve full offline hardware acceleration.


Hurdle 1: The Hardware Cap (C++ Segfaults on the NPU)

The Problem: Frameworks like OpenClaw require massive context windows (often 16,000 tokens) just to process their own internal system prompts before they even read user input. When I tried to push this massive prefill matrix into my Intel Meteor Lake NPU using standard wrappers, the underlying C++ driver crashed with a segmentation fault. The hardware simply wasn't configured to handle that memory footprint out of the box.

The Solution: Mathematical Recompilation Instead of relying on default wrappers, I wrote a custom Python compilation script using ipex_llm and OpenVINO. By mathematically capping the NPU's prefill matrix and compiling the HuggingFace model directly into a highly optimized .xml graph on my SSD, I successfully stabilized the 16K context window without crashing the silicon.


Hurdle 2: The Sandboxed Auth Vault

The Problem: With the hardware stabilized, I needed to point the agent framework to my local environment instead of the cloud. However, the framework operated inside a highly restricted Node.js sandbox. Even when I changed my OS-level environment variables (OPENAI_BASE_URL), the agent threw a fatal error: No API key found for provider "openai".

The agent refused to establish a network connection without a physical auth-profiles.json file in its isolated directory.

The Workaround: Navigating Windows File Encoding I attempted to forcefully inject a dummy API key (sk-local-npu) into the sandbox using Windows PowerShell.

However, it failed again. Why? Silent file encoding. When using PowerShell's Set-Content command, Windows defaults to UTF-16 encoding. The Node.js backend of the agent framework strictly required UTF-8. It read my injected JSON file as corrupted bytes.

I resolved this by forcing standard UTF-8 encoding via PowerShell (Out-File -Encoding utf8), finally unlocking the vault. But this led to an even bigger roadblock.


Hurdle 3: Hardcoded Cloud Routing

The Problem: Even with the dummy key accepted, the traffic refused to stay local. The framework’s internal Node.js code was strictly hardcoded to route any model starting with the openai/ prefix directly to api.openai.com, ignoring all local 127.0.0.1 overrides.

The Solution: The "Ollama Trojan Horse" I realized that fighting the framework's strict OpenAI routing was a losing battle. However, I noticed the framework natively supported Ollama—a popular tool for running local models.

Because the framework expects Ollama to run locally, it doesn't require API keys, and it defaults to local traffic (http://127.0.0.1:11434).

I completely abandoned the OpenAI disguise and built a custom FastAPI Proxy Server in Python. I programmed my server to listen on port 11434 and speak the exact JSON dialect expected by Ollama (/api/chat).

# Snippet of the FastAPI Proxy
from fastapi import FastAPI
import uvicorn

app = FastAPI(title="NPU Ollama Proxy")

@app.post("/api/chat")
async def chat_completions(req: OllamaChatRequest):
    # 1. Intercept the framework's payload
    # 2. Feed it directly into the Intel NPU graph
    # 3. Return the response formatted as an Ollama dictionary
    return {
        "model": req.model,
        "message": {"role": "assistant", "content": npu_response},
        "done": True
    }

if __name__ == "__main__":
    uvicorn.run(app, host="127.0.0.1", port=11434)
Enter fullscreen mode Exit fullscreen mode

Hurdle 4: The 1.5B Parameter "Fever Dream"

The Problem:
The connection was flawless, but the output was chaos. Dropping a highly complex, 10,000-word enterprise instruction manual onto a small 1.5 Billion parameter reasoning model caused catastrophic hallucination.

Initially, the model got trapped in an infinite loop, repeating the word "roles" hundreds of times. When I aggressively cranked up the repetition_penalty parameter to break the loop, the model swung too far the other way—generating a hilarious "word salad" of obscure vocabulary to avoid repeating itself.

The Solution: The Strict Robotic Guardrails
Small models need strict boundaries. To fix the hallucination, I updated the model generation parameters in my proxy to highly restrictive guardrails:

  • max_new_tokens=150: Prevented infinite rambling.
  • temperature=0.1: Removed "creativity" to ensure predictable, logical outputs.
  • repetition_penalty=1.15: A balanced penalty allowing normal grammar without infinite loops.

While a 1.5B model is ultimately too small to autonomously execute complex tool-calling (like web browsing) based on a massive system prompt, the pipeline itself was a resounding success.


Conclusion

By combining custom OpenVINO compilation, file-encoding debugging, and local API emulation via FastAPI, I was able to successfully bridge a locked-down enterprise agent framework with experimental NPU silicon entirely offline.

If you are building local AI tools, don't let hardcoded network routes stop you. API interoperability is your best friend. Build a proxy, spoof the dialect, and take control of your hardware.

Check out the full code for the proxy and NPU compiler on my GitHub: 🔗 Link to GitHub Repository

Have you experimented with Intel NPUs or local Agent frameworks? Let me know about your roadblocks in the comments below!

Top comments (0)