zac

Posted on Apr 14 • Originally published at remoteopenclaw.com

GPU Optimization Guide for Ollama Models in OpenClaw

#claude #ai #productivity #tutorial

Originally published on Remote OpenClaw.

Marketplace

Free skills and AI personas for OpenClaw — browse the marketplace.

Browse the Marketplace →

Join the Community

Join 1k+ OpenClaw operators sharing deployment guides, security configs, and workflow automations.

Join the Community →

If you are running Ollama models locally for OpenClaw, your GPU is the bottleneck that determines everything: which models you can run, how much context you can hold, how fast your agent responds, and whether your system stays stable under load. Most operators set up Ollama, pull a model, and never think about GPU optimization — and then wonder why their agent feels slow or starts dropping context mid-session.

This guide covers the practical GPU optimization decisions that matter for OpenClaw specifically. If you have not picked your model yet, start with the best Ollama models for OpenClaw guide first.

Why GPU Optimization Matters for OpenClaw

OpenClaw is not a thin chatbot. It is an agent runtime that maintains tool state, memory context, system instructions, and multi-turn conversation history simultaneously. All of that content lives in the model's context window, and the context window lives in VRAM.

A regular chat application might use 2-4K tokens per interaction. An OpenClaw agent session routinely carries 20-60K tokens of active context. That is a 10-15x difference in VRAM pressure compared to casual model usage.

This means GPU optimization for OpenClaw is fundamentally about VRAM management. Raw compute speed matters for token generation, but VRAM capacity determines whether your agent can function at all with the context it needs.

VRAM Requirements by Model and Context

Here are the practical VRAM requirements for the most common OpenClaw-suitable Ollama models. These numbers reflect Q4_K_M quantization, which is the most common default.

Model

Params

Q4_K_M Size

VRAM at 4K ctx

VRAM at 32K ctx

VRAM at 64K ctx

qwen3.5:9b

~6.6GB

~8GB

~12GB

~16GB

glm-4.7-flash

30B (3B active)

~18GB

~20GB

~24GB

~28GB

qwen3-coder:30b

30B (3.3B active)

~18GB

~20GB

~24GB

~28GB

qwen3.5:27b

27B

~17GB

~19GB

~23GB

~27GB

The key insight from this table: the model weights are only part of the VRAM story. The context window adds substantial VRAM overhead, and that overhead scales linearly with context length. A model that fits comfortably at 4K context might not fit at 64K context on the same GPU.

For VPS and cloud GPU options, see the best VPS for OpenClaw guide.

Context Window and VRAM Allocation

Ollama uses automatic context defaults based on your available VRAM:

Under 24 GiB VRAM: defaults to 4K context
24-48 GiB VRAM: defaults to 32K context
48+ GiB VRAM: defaults to 256K context

For OpenClaw, these defaults are almost always wrong. Ollama's own documentation recommends at least 64K context for agent workloads. If you are on a 16GB GPU, Ollama will default to 4K context — which is far too low for OpenClaw to function properly.

Override the default explicitly:

# Set context length for the Ollama server
OLLAMA_CONTEXT_LENGTH=64000 ollama serve

# Verify the active context allocation
ollama ps

The tradeoff is straightforward: higher context uses more VRAM, leaving less room for the model itself. If you cannot fit both your model weights and 64K context in VRAM, you have three options:

Drop to a smaller model (e.g., qwen3.5:9b instead of qwen3.5:27b)
Use more aggressive quantization (Q4 instead of Q8)
Accept partial CPU offloading (significantly slower but functional)

Quantization Levels and When to Use Each

Quantization reduces model precision to fit larger models into less VRAM. Ollama handles quantization automatically when you pull a model, but understanding the tradeoffs helps you make better choices.

Quantization

Bits per weight

VRAM savings vs FP16

Quality impact

Best for

FP16

Baseline

None

Maximum quality, plenty of VRAM

Q8_0

~50%

Minimal

Quality-sensitive tasks with large VRAM

Q5_K_M

~5.5

~65%

Small

Good balance for 24GB GPUs

Q4_K_M

~4.5

~72%

Moderate

Best default for most operators

Q3_K_M

~3.5

~78%

Noticeable

Squeezing large models onto small GPUs

Q2_K

~2.5

~84%

Significant

Last resort only

For OpenClaw specifically, Q4_K_M is the sweet spot for most operators. Agent tasks like tool calling, code generation, and instruction following are less sensitive to quantization than creative writing or nuanced reasoning. You lose very little practical performance going from Q8 to Q4 for typical OpenClaw workflows.

Below Q4, quality degradation becomes noticeable. Q3 can work for simple tasks but starts failing on multi-step reasoning. Q2 is not recommended for OpenClaw under any circumstances.

# Pull a specific quantization level
ollama pull qwen3.5:9b-q4_K_M
ollama pull qwen3.5:9b-q8_0

# Check which quantization you are running
ollama show qwen3.5:9b --modelfile

Marketplace

Free skills and AI personas for OpenClaw — browse the marketplace.

Browse the Marketplace →

$Stats: Q4\_K\_M Best Quantization; 50% Less VRAM Savings; VRAM Key Constraint; nvidia-smi Monitor Tool$

Key numbers to know

Batch Size and Concurrent Requests

If you run a single OpenClaw instance, batch size rarely matters — you are generating one response at a time. But if you run multiple agents or have OpenClaw handling concurrent tasks, batch settings affect throughput significantly.

Ollama's default batch size works for single-user scenarios. For concurrent usage:

Parallel requests: Ollama can handle multiple concurrent requests to the same model, but each active request consumes additional VRAM for its context. Two concurrent 64K context requests need roughly double the context VRAM overhead.
Model loading: Ollama keeps recently used models in VRAM. If you switch between models frequently, the loading and unloading adds latency. Stick to one or two models to avoid constant reloading.
Queue behavior: When VRAM is full, additional requests queue until previous ones complete. This is better than crashing, but it means your agent stalls during peak usage.

For most single-operator OpenClaw deployments, the default batch settings are fine. Optimize here only if you notice throughput problems with concurrent workloads.

Hardware Recommendations by Budget

Budget tier: $200-400 (used market)

RTX 3060 12GB or RTX 2080 Ti 11GB. These GPUs handle 7-9B models at Q4 quantization with moderate context. You will not hit the 64K context recommendation with larger models, but they work for lighter OpenClaw usage paired with an OpenRouter fallback.

Mid tier: $500-900 (used market)

RTX 3090 24GB or RTX 4070 Ti Super 16GB. The RTX 3090 is the best value GPU for local inference right now. 24GB VRAM fits most OpenClaw-suitable models at Q4 with 32-64K context. This is the sweet spot for serious local operators.

High tier: $1000-2000

RTX 4090 24GB or dual RTX 3090. The 4090 offers the best single-GPU performance with 24GB VRAM and much faster inference than the 3090. Dual 3090s give you 48GB total VRAM for larger models or higher context windows, but multi-GPU inference adds complexity.

Apple Silicon

M2 Pro/Max or M3 Pro/Max. Apple Silicon shares memory between CPU and GPU, giving you effectively 32-96GB of "VRAM" depending on your configuration. Ollama has native Metal support. The M3 Max with 96GB unified memory can run very large models at full context. For the OpenClaw setup guide, Apple Silicon is one of the most practical local options.

Monitoring and Troubleshooting GPU Usage

The most common GPU-related OpenClaw problem is invisible: the model runs but delivers poor results because the context window was silently truncated to fit in VRAM. Always verify your actual allocation.

# Check NVIDIA GPU memory usage in real time
nvidia-smi -l 1

# Check what Ollama has loaded and its context allocation
ollama ps

# Check if the model is using GPU or fell back to CPU
ollama ps | grep -i "gpu\|cpu"

Common problems and fixes

Model runs on CPU instead of GPU: Check that your NVIDIA drivers are current and CUDA is available. Restart the Ollama server after driver updates.
Out of memory errors: Drop to a smaller quantization level first. If that is not enough, drop to a smaller model. As a last resort, reduce the context window — but be aware this directly impacts OpenClaw performance.
Slow token generation: If you see 1-5 tokens per second, the model is likely partially offloaded to CPU. Check VRAM usage — if it is at 100%, some layers are running on CPU. Either free VRAM or use a smaller model.
Context gets truncated mid-session: This happens when your VRAM cannot hold the growing context. Monitor VRAM during long sessions. If it hits the ceiling, the agent starts losing earlier context silently.

Frequently Asked Questions

How much VRAM do I need for Ollama models in OpenClaw?

For the recommended 64K context window, you need at least 16GB VRAM for smaller models like qwen3.5:9b and 24GB or more for mid-size models like glm-4.7-flash or qwen3-coder:30b. The exact requirement depends on the model size and quantization level. Running at lower context windows reduces VRAM needs but also reduces OpenClaw performance.

Should I use Q4 or Q8 quantization for OpenClaw?

Q4_K_M is the best starting point for most operators because it cuts VRAM usage roughly in half compared to full precision while keeping quality loss minimal for agent tasks. Q8 is noticeably better for complex reasoning but requires significantly more VRAM. Only use Q8 if your GPU has headroom after accounting for context window memory.

Can I run Ollama for OpenClaw on an older GPU like the RTX 3060?

Yes, but with limitations. The RTX 3060 has 12GB VRAM, which is enough for Q4-quantized 7-9B models at moderate context lengths. You will not be able to run 30B models or reach the full 64K context recommendation. For budget hardware, pair a smaller local model with an OpenRouter fallback for heavier tasks.

Does Ollama automatically use my GPU for OpenClaw?

Yes, Ollama automatically detects and uses NVIDIA GPUs with CUDA support and Apple Silicon GPUs with Metal support. You do not need to configure GPU offloading manually in most cases. Use nvidia-smi or ollama ps to verify that the model is loaded on your GPU rather than running on CPU.

DEV Community