Shakib S.

Posted on Mar 9

I Tried to Build a Local Claude-Style Assistant

#devops #machinelearning #opensource #openclaw

I didn't want a demo. I wanted a real assistant.

The plan was simple. To use OpenClaw with Qwen's latest 3.5 Large Language Models.

Not another local LLM that could summarize text and write boilerplate. I wanted something with memory — something that could keep a user profile, retrieve my notes, use tools, live in Telegram, and actually feel persistent. The kind of thing you close your laptop and trust is still there.

So I did what the local AI community makes look achievable: I tried to build it myself.

My machine: Arch Linux, an RTX 3050-class GPU with 6 GB VRAM, ~12 GB of system RAM. Enough to run small models. Enough to experiment. Enough, I thought, to build something real.

What I got instead was a sharp education in the gap between "running a model locally" and "running an agent framework locally." They are not the same workload, and conflating them is the most common mistake in this space.

The Stack That Made Sense on Paper

The tool I wanted to build on was OpenClaw — an open-source agent framework that layers tools, memory, sessions, multi-channel support, and structured workflows on top of a language model backend. It promised to be the missing piece between "I can run a model" and "I have an assistant."

The plan:

Ollama to serve models locally
Qwen as the model (small, recent, reportedly strong for its size)
OpenClaw as the agent layer

On paper, this is a reasonable stack. In practice, it surfaces every assumption that local AI discourse glosses over.

Mistake #1: Assuming the Model Was the Whole Problem

I started with qwen3.5:4b through Ollama.

It ran. That wasn't the issue.

The issue was behavioral: the model kept slipping into visible chain-of-thought output. Every response came with narrated reasoning. Not broken, but wrong — asking for a daily assistant and getting a model that wanted to think out loud at every turn is like hiring a receptionist who reads their internal monologue aloud before answering the phone.

So I did what everyone does: I tried to prompt my way out of it. Stricter system prompts. Custom Modelfiles. Context limits. Persona instructions. Explicit /no_think directives.

That helped. It didn't solve the core mismatch.

The qwen3.5:4b model is from the thinking branch of the Qwen family — it's optimized for reasoning tasks, not conversational fluency. The model family matters. Using a reasoning model for a chat assistant is a category error, not a configuration problem.

The fix was straightforward once I admitted it: switch to an instruct model.

ollama pull qwen3:4b-instruct-2507-q4_K_M

q4_K_M quantization is a solid default for 4B models on 6 GB VRAM — it cuts memory footprint meaningfully without destroying output quality. With a pure instruct model, the behavioral issues cleared up immediately.

But the harder problem was still waiting.

Mistake #2: Thinking "Model Works" Means "Stack Works"

Here's what a plain local chat app sends to a model:

[system prompt]
[recent conversation turns]
[user message]

That's it. A few hundred tokens, maybe a few thousand if you keep a long history. Totally manageable.

Here's what an agent framework sends to a model:

[system prompt]
[tool schemas — every tool the agent can call]
[session state and memory]
[workspace context]
[bootstrap instructions]
[structured output expectations]
[past actions and results]
[conversation history]
[user message]

Every one of those elements costs tokens. And on a local setup, tokens aren't a billing abstraction — they're a memory and latency problem.

This is the thing local AI content almost never explains clearly: the model is not the product. The framework around the model is the product. And that framework has weight.

OpenClaw made this visible very quickly.

The Context Window Trap: 4K vs 16K vs 262K

The first error I hit was blunt:

Error: context window too small. Minimum required: 16000 tokens. Current: 4096.

Fine. I increased the context window.

Then Ollama started allocating memory as if the context was 262,144 tokens.

Error: model requires ~38.9 GiB of system memory. Available: ~12.5 GiB.

That's not a typo. 38.9 GB for a 4B model, because of context window size.

Here's why: transformer attention is quadratic in sequence length. When you set a 262K context window, the KV cache — the memory structure that stores computed attention for every token in context — scales with it. For a 4B model at q4_K_M quantization, the model weights themselves are around 2.5 GB. But a 262K context KV cache can dwarf that by an order of magnitude.

The math, roughly:

KV cache size ≈ 2 × layers × heads × head_dim × context_length × bytes_per_element

For Qwen 4B:
≈ 2 × 32 × 8 × 128 × 262144 × 2 bytes
≈ ~34 GB

You can see how a "reasonable" context ceiling becomes a hardware wall fast.

The trap is that the minimum functional context OpenClaw needs (16K) is already well above what fits comfortably in a 6 GB VRAM setup if you want any headroom for inference. And 16K is the floor, not the sweet spot.

The Configuration Spiral

What followed was a long sequence of plausible-looking fixes that went nowhere:

Set num_ctx: 8192 in Ollama → OpenClaw complained the session minimum wasn't met
Set num_ctx: 16384 → worked, but sessions kept inflating back to 262K in the state view
Manually edited OpenClaw config files → watched changes revert
Checked whether I was editing the wrong state directory → yes, sometimes
Created local model aliases with explicit context caps → partial success
Hard-capped context in every config location I could find → Ollama respected it; OpenClaw didn't always agree with the result

The pattern is seductive. Every partial success feels like momentum. The model loaded? Great. The context lowered? Okay. The agent accepted the config? Almost there.

Then the next hidden assumption surfaces.

The system wasn't randomly broken. It was consistently surfacing the same underlying incompatibility: the framework assumed a context budget that my hardware couldn't provide. Every workaround was borrowing against that fundamental gap, not closing it.

The Insight That Reframed Everything

At some point I stopped treating this as a configuration bug and asked a different question:

What kind of system is OpenClaw actually designed for?

Not "can I make it work with a 4B model on 6 GB VRAM?" but "what does the framework assume about its environment?"

The answer, once you look honestly at the architecture, is clear:

OpenClaw is built for models with real context headroom — 32K, 64K, 128K tokens where the agent scaffolding is a small fraction of available budget rather than the entire budget.

It's built for models with low-latency inference — where tool call round-trips and multi-step reasoning don't become multi-minute waits.

It's built for the API tier, not the consumer GPU tier.

That's not a criticism. It's a design reality. The framework does things that genuinely require those resources: persistent memory, multi-turn tool use, session-aware behavior, complex orchestration. That stuff is the whole value proposition. And it has a minimum viable substrate.

What Changed When I Switched to a Cloud Backend

I pointed OpenClaw at Kimi K2.5 via a cloud API and the experience shifted immediately.

Not magically. The framework still has quirks. But the fundamental friction — the constant negotiation over whether the infrastructure could physically support the next operation — disappeared.

Messages went through cleanly. Context stopped being the entire conversation. The tool layer worked the way the documentation described. I could actually evaluate the product rather than fighting the substrate.

The comparison is useful: the same framework, the same prompts, the same configuration. The only variable was whether the model backend could absorb the overhead without drowning.

Local: every interaction was a resource negotiation.

Cloud: the framework did what it was supposed to do.

What This Actually Means for Local AI

I want to be precise here, because "just use the cloud" is a lazy conclusion and I don't believe it.

Small local models are genuinely good at a real set of tasks:

Plain conversational chat — instruct models at 4B–8B are solid here
Focused code help — constrained tasks where context window is not the bottleneck
Document drafting — one-shot or few-shot generation over small inputs
Local RAG — retrieval over a small, well-scoped document set
Privacy-sensitive workflows — anything that shouldn't leave your machine

Where local models struggle is not a model quality problem. It's a systems design problem: if your tool layer, memory layer, session model, and orchestration all assume a generous context budget, a small local setup will spend most of its energy surviving the framework rather than doing useful work.

The honest reframe is this:

Running a 4B model in a simple chat interface and running that same 4B model inside a full agent framework are not the same workload. One fits on your GPU. The other assumes datacenter-class headroom.

Treating them as equivalent is why so many local AI projects stall out in configuration hell rather than producing something useful.

Practical Recommendations

If your goal is a local everyday assistant:

Use a small instruct model in a lean interface. Ollama + Open WebUI or a minimal Python frontend. Keep the context requirement below 8K. Avoid frameworks that inject large amounts of scaffolding unless you've measured the overhead. A q4_K_M quantized 4B–8B instruct model in a simple chat loop is genuinely useful.

ollama pull llama3.2:3b-instruct-q4_K_M  # ~2GB, fast, good for chat
ollama pull qwen3:8b-instruct-q4_K_M     # ~5GB, better reasoning

If your goal is to actually experience OpenClaw:

Start with a cloud-backed model. Let the framework do what it was designed to do before you optimize for local deployment. You'll learn what the product actually is rather than spending all your time fighting the substrate.

If you're committed to local + agent features:

You need either a machine with 24+ GB VRAM (RTX 4090, A-series workstation GPUs), or you need to be very intentional about which agent features you enable and what their context cost is. Profile the token overhead of each feature before enabling it.

The Question I Should Have Asked First

I went into this project asking: "Which model should I run?"

That's the wrong question. The right question is:

"What kind of system am I trying to run, and what does that system actually require?"

Plain chat and agent frameworks are different categories with different resource profiles. A model that works beautifully in a simple interface can fail badly inside a framework that assumes 10x the context budget.

Understanding that distinction early would have saved me a lot of configuration spirals. It's also just a more accurate mental model for thinking about local AI in general — not as "models you run" but as "systems with compute requirements," where the framework overhead is often larger than the model itself.

That's the lesson that actually transfers.

Have you run into context window or memory walls with local agent frameworks? What workarounds have actually held up? I'd like to hear what's working.

Top comments (1)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.