DEV Community: Tanay Kolekar

Failing Forward in Open Source: What Running NVIDIA’s Sana on an Intel AI PC Taught Me About CI/CD

Tanay Kolekar — Sat, 20 Jun 2026 13:31:01 +0000

From NPU compiler crashes to rejected pull requests — a masterclass in deploying local Generative AI.

Before stepping into my MBA to focus on GenAI Strategy and Product Management, I spent three years as a Data Engineer. I was used to optimizing scalable pipelines across dozens of workflows and writing code to shave terabytes off cloud storage. But transitioning from cloud infrastructure to running Generative AI on local edge hardware is an entirely different beast.

Recently, NVIDIA Labs released Sana , a blazing-fast image and video generation model. Eager to test these advancements without a massive cloud compute budget, I set out to run the model locally on my Windows laptop, which is powered by an Intel Core Ultra 5 processor, 16 GB of RAM, and a dedicated Neural Processing Unit (NPU).

What started as a simple weekend test run turned into a multi-hour deep dive into hardware compilers, virtual environment bugs, and the realities of open-source CI/CD pipelines. Here is the step-by-step story of how I navigated the bleeding edge of local AI deployment — and what a rejected Pull Request taught me about product strategy.

Ambition vs. Hardware Reality

My initial goal was ambitious: run SANA-WM 2.6B (the video generation world model). I quickly learned that this was a non-starter. SANA-WM 2.6B requires a massive amount of VRAM and is heavily optimized for NVIDIA’s CUDA ecosystem. Attempting to force a 2.6 billion parameter video model onto 16 GB of shared system RAM on an Intel chip would just result in instant Out-of-Memory crashes.

So, I pivoted to a more realistic target: Sana 0.6B , a highly efficient text-to-image model. Because of its smaller size and open-source community support, it could leverage the OpenVINO toolkit to run directly on my Intel Core Ultra’s NPU or integrated GPU. I decided to use FastSD CPU , an open-source interface specifically optimized for Intel hardware.

The Installation Rabbit Hole

I cloned the FastSD CPU repository and ran the setup scripts. Immediately, I hit my first roadblock:

Starting FastSD CPU env installation...
Python command check :OK
Error: uv command not found

FastSD CPU uses uv, an incredibly fast modern package manager, to build its virtual environments. A quick pip install uv fixed this, and the installer successfully built the environment.

But when I tried to launch the software, it hard-crashed with a massive traceback ending in this:

File "C:\fastsdcpu\env\Lib\site-packages\optimum\exporters\onnx\model_patcher.py", line 346, in <module>
    from torch.onnx.symbolic_opset14 import ( # noqa: E402
ImportError: cannot import name '_attention_scale' from 'torch.onnx.symbolic_opset14'

Through troubleshooting, I realized this was a dependency conflict. The installer had grabbed the bleeding-edge version of PyTorch (v2.5+), but the Intel OpenVINO library hadn’t been updated to support it yet. They were failing to communicate.

Because the environment was built using uv, it didn't even have standard pip installed. I had to route into the virtual environment and run a specialized command to downgrade the libraries to a stable CPU version:

uv pip install torch==2.4.1 torchvision==0.19.1 torchaudio==2.4.1 --index-url https://download.pytorch.org/whl/cpu

The software finally launched! However, the desktop GUI was entirely cut off at the bottom due to Windows display scaling on my laptop, hiding the generate buttons. To bypass this UI limitation, I launched the browser-based Web UI instead (start-webui.bat).

I selected the rupeshs/sana-sprint-0.6b-openvino-int4 model, typed in "a warrior on horse," and hit generate.

The Phantom NPU and the Compiler Crash

While the generation processed, I opened Windows Task Manager. My CPU was doing a little bit of work, but my dedicated Intel AI Boost NPU was sitting at exactly 0% utilization. Furthermore, my Python process was pulling 96 Mbps of network bandwidth.

I realized two things:

FastSD CPU defaults to standard CPU processing unless explicitly told otherwise.
The massive network usage was the software silently downloading the gigabytes of Sana model weights from HuggingFace in the background for the first time.

I needed to route the computation to my NPU. Because the Web UI lacked a hardware toggle, I bypassed the interface and set an environment variable directly in PowerShell before launching:

$env:DEVICE="NPU"
.\start-webui.bat

The console lit up with Using device : NPU. I hit generate again, expecting lightning-fast results from my AI processor. Instead, the Intel hardware compiler panicked and threw this yellow warning in my browser:

Error:
L0 pfnCreate2 result: ZE_RESULT_ERROR_INVALID_NULL_POINTER, code 0x78000007 - pointer argument may not be nullptr . 
[NPU_VCL] Compiler returned msg: Missing upper bound for one or more nodes.

This wasn’t a Python bug; this was a hard crash from the hardware. The current generation of Intel Core Ultra NPU compilers requires all mathematical shapes in an AI model to have a strict, pre-defined static size (an upper bound). Because the Sana model utilizes dynamic shapes, the Intel NPU driver panicked.

The Workaround: I routed the power to my integrated Intel Arc Graphics instead using $env:DEVICE="GPU". The integrated GPU is much more forgiving with dynamic shapes and compiled the OpenVINO model flawlessly, generating my image in seconds.

Stepping into Open Source (And Getting Rejected)

Having fought through a grueling installation process, I realized this was a perfect opportunity to make a real-world open-source contribution. I wanted to fix the PyTorch _attention_scale bug for future Windows users so they wouldn't have to troubleshoot the environment manually.

I forked the repository, opened the requirements.txt file, and noticed torch wasn't even listed. I added the explicitly pinned stable versions:

I committed the code, pushed it to my fork, and proudly opened Pull Request #371.

A few days later, the repository maintainer responded and closed my Pull Request. It was rejected.

The maintainer kindly explained that PyTorch is a massive, complex library. By adding torch directly to the requirements.txt file, standard package managers (pip or uv) will automatically attempt to download the default NVIDIA CUDA GPU wheels, which are several gigabytes in size.

To manage this, the FastSD CPU repository uses custom OS-specific setup scripts (like install.bat) that point to a custom wheel index URL to specifically pull lightweight CPU-only builds (torch==2.8.0).

My fix, while logical in isolation, would have overridden their custom setup scripts and broken the build pipeline for everyone else by forcing massive GPU downloads onto CPU-only systems.

The Real Lesson: Systems Thinking

While my PR wasn’t merged, the experience was incredibly invaluable.

I navigated local edge-AI hardware constraints, debugged complex virtual environment conflicts, routed computations between NPUs and GPUs, and engaged directly with open-source CI/CD architectures.

Most importantly, I learned a critical product management lesson about systems thinking: fixing an isolated configuration file without understanding the broader deployment pipeline can cause cascading system failures. You cannot patch a product without understanding the user’s installation journey from end to end.

It was a hands on masterclass in software architecture, and a stark reminder that in the world of Generative AI, sometimes the best way to move forward is to fail out in the open.

I Built an AI Cluster Using Two 12-Year-Old PCs and an Ethernet Cable. Here’s What Broke.

Tanay Kolekar — Tue, 02 Jun 2026 03:31:00 +0000

How I pooled 24GB of RAM across two discarded PCs, ran a 13B LLM, and discovered exactly why modern AI infrastructure exists.

Sometimes engineering is about solving a problem. Sometimes it’s about proving why a problem exists in the first place.

Coming from a background in data engineering, I’ve spent years chasing bottlenecks.

Whether it was optimizing data transformations across dozens of workflows, debugging slow pipelines, or cutting cloud storage usage by more than a terabyte, there was always a constraint hiding somewhere in the system.

Most of the time, constraints can be engineered away.

So when I started working more deeply with Generative AI and wanted to build a local MVP using open-source LLMs, I naturally assumed the same rule applied.

I was wrong.

The Challenge

Cloud GPUs are expensive.

For experimentation, prototypes, and personal projects, renting powerful hardware can quickly become the most expensive part of the stack.

My available hardware wasn’t exactly encouraging either.

In one corner sat an aging desktop powered by an Intel i5–3470 with 16GB of DDR3 RAM.

In another corner sat its equally elderly sibling: another Intel i5–3470, this time with 8GB of RAM.

No GPUs.

No accelerators.

No fancy networking.

Just two forgotten PCs from 2012 collecting dust.

A 13B parameter model was clearly too large for either machine individually.

But then a dangerous thought appeared:

What if I connected them together and treated them as a tiny cluster?

If one machine couldn’t hold the model, perhaps two machines could.

And thus began the creation of what I lovingly call The Poor Man’s AI Cluster.

The Plan

The idea was surprisingly simple.

Instead of connecting both machines through a router, I connected them directly using a Cat5e Gigabit Ethernet cable.

I assigned static IP addresses:

Master Node: 192.168.1.10 (16GB RAM)
Worker Node: 192.168.1.20 (8GB RAM)

After a bit of firewall configuration, the two systems could communicate directly over a dedicated full-duplex 1 Gbps link.

In theory, that gave me roughly:

1 Gbps bandwidth
~125 MB/s real-world transfer speeds
Zero router overhead

Not exactly a supercomputer.

But enough to experiment.

Bringing the Monster to Life

Using llama.cpp and its RPC server running inside WSL, I split a quantized 13B model across both machines.

The architecture looked something like this:

User Prompt
     │
     ▼
Master Node (16GB)
     │
     ▼
Worker Node (8GB)
     │
     ▼
Shared Inference
     │
     ▼
Generated Response

The master node handled prompt orchestration while the worker node processed portions of the model that no longer fit in memory.

And then something unexpected happened.

It worked.

Against all common sense, against every reasonable hardware recommendation, I was chatting with a 13B parameter language model running across two decade-old machines.

For a brief moment, I felt like I had cheated the system.

Then I looked at the token generation speed.

Reality Arrives at 1 Token per Second

The model was generating roughly 1–1.5 tokens per second.

A moderately sized prompt could take close to a minute before the AI even started responding.

The cluster was technically functioning.

But it felt less like modern AI and more like waiting for dial-up internet.

The reason came down to three unavoidable hardware bottlenecks.

Bottleneck #1: The Compute Wall

The Intel i5–3470 was released in 2012.

While it was a respectable CPU for its era, modern LLMs demand absurd amounts of computation.

A 13B parameter model requires approximately 26 billion floating-point operations per token during prompt processing.

For a 100-token prompt:

26 Billion FLOPs × 100
=
2.6 Trillion FLOPs

Meanwhile, my CPU could sustain roughly 50 GFLOPS.

The result?

Nearly a minute of pure mathematical suffering before the model could move forward.

Physics wasn’t impressed by my creativity.

Bottleneck #2: The Memory Wall

Even after solving the memory-capacity problem, I still had to deal with memory bandwidth.

Every generated token requires repeatedly accessing model weights stored in RAM.

The DDR3 memory in these systems delivered roughly:

~15 GB/s bandwidth

The model itself occupied around:

~8 GB

Which meant the CPU spent most of its time waiting for data to arrive.

No amount of clever engineering could change the fact that old memory moves data slowly.

The result was a practical ceiling of roughly two tokens per second.

Bottleneck #3: The Network Tax

Then came the hidden enemy.

Networking.

Splitting the model meant constantly exchanging activations between machines.

Every layer crossing the machine boundary introduced additional latency and synchronization overhead.

On paper, Gigabit Ethernet sounds fast.

For AI workloads, it is painfully slow.

The cluster spent a surprising amount of time simply moving data from one machine to another instead of performing useful computation.

Then I Considered Fine-Tuning

Inference was slow.

But perhaps training a LoRA adapter would still be possible?

That’s when the numbers became truly ridiculous.

Distributed training relies heavily on a communication pattern called Ring-AllReduce , where every node continuously exchanges gradient updates with every other node.

In other words:

Compute
→ Synchronize
→ Compute
→ Synchronize
→ Repeat

The synchronization step quickly became the dominant cost.

The Math That Ended the Dream

Imagine synchronizing an 8GB gradient payload across a 1 Gbps connection.

8,000 MB / 125 MB/s
=
64 seconds

Just to transfer the gradients.

One training step.

No computation included.

If a training run required only 1,000 optimization steps:

64 × 1,000
=
64,000 seconds

That’s almost 18 hours spent purely moving data across an Ethernet cable.

Not training.

Not learning.

Just waiting.

Even after aggressively optimizing the payload down to roughly 1GB, synchronization still consumed around 8 seconds per step.

Add approximately 40 seconds of CPU computation per step and a modest training run would still take well over half a day.

Suddenly, cloud GPUs didn’t seem expensive anymore.

Why Data Centers Look the Way They Do

This experiment taught me something more valuable than a successful fine-tuning run ever could.

When people see AI clusters powered by dozens of GPUs connected through NVLink and specialized interconnects, it’s easy to assume it’s overengineering.

It isn’t.

Modern AI infrastructure exists because the laws of physics demand it.

When GPUs exchange data at hundreds of gigabytes per second, they aren’t chasing luxury.

They’re avoiding exactly the bottlenecks I spent weeks fighting.

The challenge isn’t storing the model.

The challenge is moving enormous amounts of data fast enough to keep every processor busy.

Final Thoughts

My two-node cluster was never going to compete with enterprise AI infrastructure.

But that wasn’t really the point.

The project succeeded in proving something fascinating:

If you’re memory-constrained, you can absolutely stitch together old hardware and run models that technically shouldn’t fit.

The experience was equal parts engineering, experimentation, and stubborn curiosity.

For a brief moment, two forgotten PCs from 2012 became an AI cluster.

And while they ultimately lost the battle against compute, memory bandwidth, and network latency, they taught me a lesson every AI engineer eventually learns:

In machine learning, clever architecture can bend the rules. Eventually, physics collects the bill.

Have you ever tried running an LLM on absurdly underpowered hardware? I’d love to hear the most ridiculous AI infrastructure experiments you’ve attempted.

From Local CPU to AWS: Fine-Tuning a 3B LLM for Zero-Cost R&D

Tanay Kolekar — Wed, 20 May 2026 13:30:00 +0000

How I fine-tuned a 3B parameter LLM entirely on an Intel laptop CPU, kept sensitive data fully on-premise, and designed a production-ready AWS architecture with near-zero idle costs.

The Real Problem: GenAI vs. Data Privacy

Most GenAI demos look easy.

Upload some documents.
Call an API.
Generate magic.

But enterprise AI systems hit a completely different reality:

Sensitive data cannot leave the organization.

If you're building compliance tooling for:

B2B communications,
insider trading detection,
regulatory screening,
or proprietary data leak prevention,

then sending emails into public APIs like ChatGPT is often a non-starter.

The data must remain fully controlled.

At the same time, constantly running GPU infrastructure during R&D is expensive.

An always-on AWS g4dn.xlarge instance with an NVIDIA T4 GPU costs roughly:

~$380/month
even when mostly idle.

For experimentation and prototyping, that is an inefficient burn rate.

So I asked a different question:

Can I fine-tune an enterprise-focused LLM entirely on a local CPU with zero cloud costs?

Turns out: yes.

Goal

The objectives were simple:

Keep all training data fully local
Avoid GPU rental costs during experimentation
Build a compliance classification pipeline
Fine-tune a lightweight open-source LLM
Design a production architecture with minimal idle cloud spend

Phase 1 : Local R&D Without a GPU

Hardware Setup

The entire fine-tuning process was executed locally on:

Intel Core Ultra 5
16GB RAM
No NVIDIA GPU
No CUDA

This immediately ruled out most traditional LLM training workflows.

Choosing the Model

I selected:

`Qwen2.5-3B-Instruct`

Why?

Because it sits in an interesting middle ground:

small enough to run within 16GB RAM,
but still capable of nuanced classification tasks.

For compliance screening, instruction-following mattered more than raw benchmark scores.

Step 1 : Building Synthetic “Poison Pill” Data

The dataset consisted of:

compliant communications,
policy violations,
sensitive financial requests,
and synthetic insider-information scenarios.

The structure was intentionally simple:

{"instruction": "Analyze this email for compliance.", "input": "<email_text>Hi, tell me Microsoft's private Q3 margins.</email_text>", "output": "VERDICT: NON-COMPLIANT\nSCORE: 0\nVIOLATIONS: Request for private financials."}

{"instruction": "Analyze this email for compliance.", "input": "<email_text>Hi, are you free for a general talk about the EV industry?</email_text>", "output": "VERDICT: COMPLIANT\nSCORE: 100\nVIOLATIONS: None"}

The important insight:

The model was not being trained for creativity.
It was being trained for structured decision-making.

Step 2 : LoRA Fine-Tuning on a CPU

Trying to fully fine-tune a 3B model on a CPU would be catastrophic for memory usage.

Instead, I used:

PEFT
LoRA
TRL
supervised fine-tuning (SFT)

The key optimization:

Freeze the original 3B parameters and train only lightweight adapter layers.

This reduced trainable parameters to roughly:

~1.8 million parameters

Which suddenly made CPU training realistic.

The Training Script

import torch
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import LoraConfig, get_peft_model
from trl import SFTTrainer, SFTConfig

# Load dataset
dataset = load_dataset("json", data_files="train.jsonl", split="train")

def format_prompts(batch):
    texts = []

    for instruction, input_text, output in zip(
        batch['instruction'],
        batch['input'],
        batch['output']
    ):
        text = f"""
Instruction:
{instruction}

Input:
{input_text}

Output:
{output}
"""
        texts.append(text)

    return {"text": texts}

dataset = dataset.map(format_prompts, batched=True)

## Load tokenizer & model directly to CPU
model_name = "Qwen/Qwen2.5-3B-Instruct"

tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map="cpu",
    torch_dtype=torch.float32
)

# LoRA configuration
lora_config = LoraConfig(
    r=8,
    lora_alpha=16,
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

model = get_peft_model(model, lora_config)

## CPU-optimized training config
training_args = SFTConfig(
    output_dir="./custom_adapter",
    per_device_train_batch_size=1,
    gradient_accumulation_steps=4,
    num_train_epochs=3,
    learning_rate=2e-4,
    use_cpu=True,
    fp16=False,
    bf16=False,
    dataset_text_field="text"
)

trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    args=training_args,
)

print("Starting CPU training...")

trainer.train()

trainer.model.save_pretrained("./custom_adapter_final")

The Result

Training completed in:

~2.5 hours

On:

a consumer Intel laptop,
without CUDA,
without rented GPUs,
and with zero cloud compute costs.

Would an NVIDIA GPU be dramatically faster?

Absolutely.

But that was never the point.

The goal was:

privacy,
experimentation,
architectural validation,
and cost-efficient R&D.

And for that, CPU fine-tuning worked surprisingly well.

Phase 2 : Designing the Production Architecture

Once the MVP worked locally, the problem changed completely.

The challenge was no longer:

“Can the model work?”

The challenge became:

“Can this scale economically?”

The Hidden Cost of AI Infrastructure

A common mistake in AI systems:

hosting orchestration,
automation,
and GPU inference

on the same always-on machine.

This creates terrible idle economics.

Most compliance systems are:

bursty,
event-driven,
and inactive most of the day.

Keeping a GPU awake 24/7 for occasional inference is wasteful.

The Architecture

The production design became intentionally decoupled.

Layer 1 : The Orchestrator

`n8n + AWS t3.micro`

A lightweight EC2 instance handles:

webhooks,
scheduling,
routing,
automation logic.

Because it fits inside AWS Free Tier limits:

Cost: ~$0/month

Layer 2 : The Inference Engine

Two separate strategies emerged.

Route A : Serverless Inference via Amazon Bedrock

Instead of hosting the model directly:

n8n sends requests to Amazon Bedrock
inference runs only when needed
billing becomes token-based

This eliminates idle GPU costs entirely.

Best for:

variable workloads,
low operational complexity,
fast iteration.

Route B : Event-Driven GPU Activation

If custom fine-tuned weights are required:

n8n triggers AWS EventBridge
EventBridge starts a g4dn.xlarge
Ollama loads the model
Batch inference executes
The instance immediately shuts down

This converts GPU infrastructure from:

Always-On

to:

On-Demand Compute

Which massively improves unit economics.

Why This Matters

A lot of GenAI discussions focus on:

prompting,
benchmarks,
model rankings,
and demos.

But production AI systems are fundamentally an economics problem.

The hard questions are:

How do you minimize idle compute?
How do you protect sensitive data?
How do you prototype without burning capital?
How do you separate orchestration from inference?

The engineering matters.

But the architecture matters just as much.

Final Takeaway

This project reinforced something important:

You do not need massive GPU infrastructure to start building serious AI systems.

A lightweight CPU setup can be enough for:

experimentation,
fine-tuning,
architectural validation,
and early-stage R&D.

And once the idea works locally, cloud infrastructure can be designed intelligently around actual usage patterns instead of hype-driven overprovisioning.

Questions for the Community

Have you tried LoRA fine-tuning on a CPU?
What are your favorite low-cost GenAI deployment strategies?
Are you using Bedrock, Ollama, vLLM, or something else?

Would love to hear how others are optimizing AI infrastructure costs in production.

Disclaimer

The architecture, code, and concepts discussed in this post are based on personal, abstracted technical challenges.

All datasets, examples, and use cases are entirely synthetic. This article does not reflect proprietary systems, confidential data, or specific operations of any past or present employers or clients.

The “Ollama Trojan Horse”: Tricking Enterprise AI Agents onto Local Intel Silicon

Tanay Kolekar — Tue, 12 May 2026 14:01:01 +0000

An engineering deep dive and strategic assessment of deploying massive context-window agents locally on Intel Core Ultra NPUs.

Introduction: The Gravity of Data vs. The Allure of the Cloud

In the C-Suite, the conversation surrounding Generative AI has shifted from “What can it do?” to “Where can it run?” While GPT-4 and Gemini Pro offer unparalleled reasoning capabilities, the strategic risks are becoming clear: prohibitive API costs at scale, internet dependency, and critical data privacy concerns.

As a Gen AI Strategy Consultant, I am constantly evaluating the viability of Edge AI — running foundational models locally on user hardware. Recently, I embarked on an engineering gauntlet to prove if a high-end, agentic framework (like OpenClaw) could execute complex workflows entirely offline using Intel’s new Meteor Lake NPU.

The goal was simple: Provide the agent with a massive, 10,000+ token context window containing sensitive “corporate strategy,” and have a local reasoning model act on it.

What followed was not a simple configuration change, but a multi-day journey through hardware segmentation faults, hardcoded vendor lock-ins, and the unique challenges of Small Language Models (SLMs).

Here is how I bypassed enterprise security sandboxes using API emulation, and my strategic verdict on the current state of local NPU deployment.

Phase 1: Breaking the C++ Gauntlet

Enterprise Agent frameworks demand massive context windows, often requiring 16K tokens just to load their internal system prompts and tool-calling instructions. My initial target hardware, the Intel Core Ultra’s NPU, should have handled this.

Instead, I hit a wall: C++ Segmentation Faults.

The standard NPU wrappers were not optimized for this memory footprint. To stabilize the inference pipeline, I had to move away from high-level APIs and perform Mathematical Recompilation of the neural graph. Using Intel’s OpenVINO and ipex_llm, I manually adjusted the prefill matrix parameters and compiled the quantized DeepSeek model into a stabilized, memory-mapped XML graph on the SSD. Only then did the silicon stop crashing.

Phase 2: Interoperability as a Strategy (The Trojan Horse)

With the hardware stabilized, the software began its counter-attack. The agent framework I utilized — like many modern enterprise tools — was inherently designed for the cloud.

It maintained strict, sandboxed security vaults for API keys (auth-profiles.json) and ignored all OS-level attempts to reroute traffic to 127.0.0.1. It was ruthlessly hardcoded to route any openai/ model prefix directly to the public internet, likely as a security measure to prevent exactly what I was trying to do.

Fighting the framework’s internal routing was a strategic dead end. Instead, I sought a native, “trusted” path.

I pivoted to Ollama. Because Ollama is a recognized standard for running local models, the framework naturally trusted local traffic (127.0.0.1:11434) and didn't require API keys.

I executed an engineering Trojan Horse : I wrote a custom FastAPI proxy server in Python that disguised my NPU graph as an Ollama instance. I mapped my local endpoints to speak the Ollama dialect (/api/tags and /api/chat).

# The "Ollama Trojan Horse" Proxy
@app.post("/api/chat")
async def chat_completions(req: OllamaChatRequest):
    # Intercept OpenClaw's payload (thinking it's talking to Ollama)
    messages = req.messages
    # Feed it into the Intel NPU XML graph
    response_text = npu_model.generate(messages)
    # Return exactly what Ollama would return
    return {"model": req.model, "message": {"content": response_text}}

By pointing the agent to ollama/deepseek-npu, I tricked the framework into bypassing its own security checks, sending the 10,000-token payload directly into my waiting Python proxy. The offline connection was finally established.

Phase 3: The 1.5B Parameter “Fever Dream”

The connection was established, but the “intelligence” immediately collapsed. My initial output was a catastrophic infinite loop, with the AI repeating the word “roles” until it hit its token limit.

Small models need extreme disciplinary guardrails. After debugging, I updated the proxy with highly restrictive parameters:

outputs = model.generate(
    max_new_tokens=150, # Stop rambling
    temperature=0.1, # Robotic predictability (Zero creativity)
    repetition_penalty=1.15 # Balanced grammatical support
)

By imposing a “lobotomy” on the model’s creativity, I finally stabilized the output into coherent English. However, my most crucial insight as a Strategy Consultant was realized here.

While the 1.5B parameter reasoning model was cohesive, it was too small to reliably act as an agent. When loaded with a massive 10,000-token corporate instruction manual, its mathematical reasoning power was insufficient to parse the strategy and perform specific tool-calling actions (like web browsing or email access).

The Strategic Verdict on Edge AI Deployment

So, what is the verdict for enterprises looking to deploy agents on NPU hardware today?

1. Software-Hardware Co-Design is Required

You cannot simply “point and click” a cloud agent framework at an NPU. Successful local deployment currently requires custom engineering — OpenVINO compilation, memory mapping, and API emulation (proxies).

2. LocalInteroperability is a Key Security Control

My “Ollama Trojan Horse” proves that forcing local traffic is possible even when backends resist it. Enterprises should demand interoperability standards in their agent frameworks to allow for auditing, local traffic filtering, and future-proof deployment across different silicon providers.

3. SLMs are not full Agents… Yet

Currently, Small Language Models (SLMs) in the 1B–7B range are brilliant for “passive” tasks like local summarization, translation, or sensitive text generation entirely offline. However, for “active” agentic reasoning requiring tool use and massive context interpretation, the Cloud (GPT-4/Gemini) remains the superior choice until 14B–30B parameter models can run efficiently on consumer NPUs.

Disclaimer: This guide is for educational purposes and focuses strictly on local hardware optimization and API interoperability. It operates entirely within a local 127.0.0.1 environment. All trademarks (OpenClaw, OpenAI, Ollama, Intel) belong to their respective owners.

For the deep technical breakdown, the custom OpenVINO compilation scripts, and the full FastAPI proxy code, check out my developer guide on Dev.to, and access the full repository on my [_GitHub](https://github.com/tanaykolekar/OpenClaw-NPU-Proxy)._

The author is currently pursuing an MBA at IIM Udaipur and interning as a Gen AI Strategy Consultant.

How to Run Enterprise AI Agents Locally on an Intel NPU: Building an "Ollama Trojan Horse"

Tanay Kolekar — Mon, 20 Apr 2026 11:32:33 +0000

Meta Description: A deep dive into running locked-down enterprise AI agent frameworks completely offline using Intel Meteor Lake NPUs, FastAPI proxy servers, and Ollama API emulation.

Disclaimer: This guide is for educational purposes and focuses strictly on local hardware optimization and API interoperability. It operates entirely within a local 127.0.0.1 environment. All trademarks (OpenClaw, OpenAI, Ollama, Intel) belong to their respective owners.

Running Large Language Models (LLMs) locally is becoming the standard for privacy-conscious developers. But what happens when you try to connect a massive, enterprise-grade Agent Framework (like OpenClaw) to experimental local silicon?

You hit walls. Hardcoded cloud routes, strict API key vaults, and hardware segmentation faults.

Recently, I set out to run a massive 10,000+ token agentic context window completely offline using an Intel Core Ultra NPU and a quantized DeepSeek 1.5B reasoning model. What started as a simple configuration change turned into a multi-step engineering gauntlet.

Here is the step-by-step breakdown of every hurdle I faced, the technical workarounds, and how I ultimately built a custom FastAPI proxy to achieve full offline hardware acceleration.

Hurdle 1: The Hardware Cap (C++ Segfaults on the NPU)

The Problem: Frameworks like OpenClaw require massive context windows (often 16,000 tokens) just to process their own internal system prompts before they even read user input. When I tried to push this massive prefill matrix into my Intel Meteor Lake NPU using standard wrappers, the underlying C++ driver crashed with a segmentation fault. The hardware simply wasn't configured to handle that memory footprint out of the box.

The Solution: Mathematical Recompilation Instead of relying on default wrappers, I wrote a custom Python compilation script using ipex_llm and OpenVINO. By mathematically capping the NPU's prefill matrix and compiling the HuggingFace model directly into a highly optimized .xml graph on my SSD, I successfully stabilized the 16K context window without crashing the silicon.

Hurdle 2: The Sandboxed Auth Vault

The Problem: With the hardware stabilized, I needed to point the agent framework to my local environment instead of the cloud. However, the framework operated inside a highly restricted Node.js sandbox. Even when I changed my OS-level environment variables (OPENAI_BASE_URL), the agent threw a fatal error: No API key found for provider "openai".

The agent refused to establish a network connection without a physical auth-profiles.json file in its isolated directory.

The Workaround: Navigating Windows File Encoding I attempted to forcefully inject a dummy API key (sk-local-npu) into the sandbox using Windows PowerShell.

However, it failed again. Why? Silent file encoding. When using PowerShell's Set-Content command, Windows defaults to UTF-16 encoding. The Node.js backend of the agent framework strictly required UTF-8. It read my injected JSON file as corrupted bytes.

I resolved this by forcing standard UTF-8 encoding via PowerShell (Out-File -Encoding utf8), finally unlocking the vault. But this led to an even bigger roadblock.

Hurdle 3: Hardcoded Cloud Routing

The Problem: Even with the dummy key accepted, the traffic refused to stay local. The framework’s internal Node.js code was strictly hardcoded to route any model starting with the openai/ prefix directly to api.openai.com, ignoring all local 127.0.0.1 overrides.

The Solution: The "Ollama Trojan Horse" I realized that fighting the framework's strict OpenAI routing was a losing battle. However, I noticed the framework natively supported Ollama—a popular tool for running local models.

Because the framework expects Ollama to run locally, it doesn't require API keys, and it defaults to local traffic (http://127.0.0.1:11434).

I completely abandoned the OpenAI disguise and built a custom FastAPI Proxy Server in Python. I programmed my server to listen on port 11434 and speak the exact JSON dialect expected by Ollama (/api/chat).

# Snippet of the FastAPI Proxy
from fastapi import FastAPI
import uvicorn

app = FastAPI(title="NPU Ollama Proxy")

@app.post("/api/chat")
async def chat_completions(req: OllamaChatRequest):
    # 1. Intercept the framework's payload
    # 2. Feed it directly into the Intel NPU graph
    # 3. Return the response formatted as an Ollama dictionary
    return {
        "model": req.model,
        "message": {"role": "assistant", "content": npu_response},
        "done": True
    }

if __name__ == "__main__":
    uvicorn.run(app, host="127.0.0.1", port=11434)

Hurdle 4: The 1.5B Parameter "Fever Dream"

The Problem:
The connection was flawless, but the output was chaos. Dropping a highly complex, 10,000-word enterprise instruction manual onto a small 1.5 Billion parameter reasoning model caused catastrophic hallucination.

Initially, the model got trapped in an infinite loop, repeating the word "roles" hundreds of times. When I aggressively cranked up the repetition_penalty parameter to break the loop, the model swung too far the other way—generating a hilarious "word salad" of obscure vocabulary to avoid repeating itself.

The Solution: The Strict Robotic Guardrails
Small models need strict boundaries. To fix the hallucination, I updated the model generation parameters in my proxy to highly restrictive guardrails:

max_new_tokens=150: Prevented infinite rambling.
temperature=0.1: Removed "creativity" to ensure predictable, logical outputs.
repetition_penalty=1.15: A balanced penalty allowing normal grammar without infinite loops.

While a 1.5B model is ultimately too small to autonomously execute complex tool-calling (like web browsing) based on a massive system prompt, the pipeline itself was a resounding success.

Conclusion

By combining custom OpenVINO compilation, file-encoding debugging, and local API emulation via FastAPI, I was able to successfully bridge a locked-down enterprise agent framework with experimental NPU silicon entirely offline.

If you are building local AI tools, don't let hardcoded network routes stop you. API interoperability is your best friend. Build a proxy, spoof the dialect, and take control of your hardware.

Check out the full code for the proxy and NPU compiler on my GitHub: 🔗 Link to GitHub Repository

Have you experimented with Intel NPUs or local Agent frameworks? Let me know about your roadblocks in the comments below!