Originally published on Remote OpenClaw.
Marketplace
Free skills and AI personas for OpenClaw — browse the marketplace.
Join the Community
Join 1k+ OpenClaw operators sharing deployment guides, security configs, and workflow automations.
If you are running Ollama models locally for OpenClaw, your GPU is the bottleneck that determines everything: which models you can run, how much context you can hold, how fast your agent responds, and whether your system stays stable under load. Most operators set up Ollama, pull a model, and never think about GPU optimization — and then wonder why their agent feels slow or starts dropping context mid-session.
This guide covers the practical GPU optimization decisions that matter for OpenClaw specifically. If you have not picked your model yet, start with the best Ollama models for OpenClaw guide first.
Why GPU Optimization Matters for OpenClaw
OpenClaw is not a thin chatbot. It is an agent runtime that maintains tool state, memory context, system instructions, and multi-turn conversation history simultaneously. All of that content lives in the model's context window, and the context window lives in VRAM.
A regular chat application might use 2-4K tokens per interaction. An OpenClaw agent session routinely carries 20-60K tokens of active context. That is a 10-15x difference in VRAM pressure compared to casual model usage.
This means GPU optimization for OpenClaw is fundamentally about VRAM management. Raw compute speed matters for token generation, but VRAM capacity determines whether your agent can function at all with the context it needs.
VRAM Requirements by Model and Context
Here are the practical VRAM requirements for the most common OpenClaw-suitable Ollama models. These numbers reflect Q4_K_M quantization, which is the most common default.
Model
Params
Q4_K_M Size
VRAM at 4K ctx
VRAM at 32K ctx
VRAM at 64K ctx
qwen3.5:9b
9B
~6.6GB
~8GB
~12GB
~16GB
glm-4.7-flash
30B (3B active)
~18GB
~20GB
~24GB
~28GB
qwen3-coder:30b
30B (3.3B active)
~18GB
~20GB
~24GB
~28GB
qwen3.5:27b
27B
~17GB
~19GB
~23GB
~27GB
The key insight from this table: the model weights are only part of the VRAM story. The context window adds substantial VRAM overhead, and that overhead scales linearly with context length. A model that fits comfortably at 4K context might not fit at 64K context on the same GPU.
For VPS and cloud GPU options, see the best VPS for OpenClaw guide.
Context Window and VRAM Allocation
Ollama uses automatic context defaults based on your available VRAM:
- Under 24 GiB VRAM: defaults to 4K context
- 24-48 GiB VRAM: defaults to 32K context
- 48+ GiB VRAM: defaults to 256K context
For OpenClaw, these defaults are almost always wrong. Ollama's own documentation recommends at least 64K context for agent workloads. If you are on a 16GB GPU, Ollama will default to 4K context — which is far too low for OpenClaw to function properly.
Override the default explicitly:
# Set context length for the Ollama server
OLLAMA_CONTEXT_LENGTH=64000 ollama serve
# Verify the active context allocation
ollama ps
The tradeoff is straightforward: higher context uses more VRAM, leaving less room for the model itself. If you cannot fit both your model weights and 64K context in VRAM, you have three options:
- Drop to a smaller model (e.g., qwen3.5:9b instead of qwen3.5:27b)
- Use more aggressive quantization (Q4 instead of Q8)
- Accept partial CPU offloading (significantly slower but functional)
Quantization Levels and When to Use Each
Quantization reduces model precision to fit larger models into less VRAM. Ollama handles quantization automatically when you pull a model, but understanding the tradeoffs helps you make better choices.
Quantization
Bits per weight
VRAM savings vs FP16
Quality impact
Best for
FP16
16
Baseline
None
Maximum quality, plenty of VRAM
Q8_0
8
~50%
Minimal
Quality-sensitive tasks with large VRAM
Q5_K_M
~5.5
~65%
Small
Good balance for 24GB GPUs
Q4_K_M
~4.5
~72%
Moderate
Best default for most operators
Q3_K_M
~3.5
~78%
Noticeable
Squeezing large models onto small GPUs
Q2_K
~2.5
~84%
Significant
Last resort only
For OpenClaw specifically, Q4_K_M is the sweet spot for most operators. Agent tasks like tool calling, code generation, and instruction following are less sensitive to quantization than creative writing or nuanced reasoning. You lose very little practical performance going from Q8 to Q4 for typical OpenClaw workflows.
Below Q4, quality degradation becomes noticeable. Q3 can work for simple tasks but starts failing on multi-step reasoning. Q2 is not recommended for OpenClaw under any circumstances.
# Pull a specific quantization level
ollama pull qwen3.5:9b-q4_K_M
ollama pull qwen3.5:9b-q8_0
# Check which quantization you are running
ollama show qwen3.5:9b --modelfile
Marketplace
Free skills and AI personas for OpenClaw — browse the marketplace.
Key numbers to know
Batch Size and Concurrent Requests
If you run a single OpenClaw instance, batch size rarely matters — you are generating one response at a time. But if you run multiple agents or have OpenClaw handling concurrent tasks, batch settings affect throughput significantly.
Ollama's default batch size works for single-user scenarios. For concurrent usage:
- Parallel requests: Ollama can handle multiple concurrent requests to the same model, but each active request consumes additional VRAM for its context. Two concurrent 64K context requests need roughly double the context VRAM overhead.
- Model loading: Ollama keeps recently used models in VRAM. If you switch between models frequently, the loading and unloading adds latency. Stick to one or two models to avoid constant reloading.
- Queue behavior: When VRAM is full, additional requests queue until previous ones complete. This is better than crashing, but it means your agent stalls during peak usage.
For most single-operator OpenClaw deployments, the default batch settings are fine. Optimize here only if you notice throughput problems with concurrent workloads.
Hardware Recommendations by Budget
Budget tier: $200-400 (used market)
RTX 3060 12GB or RTX 2080 Ti 11GB. These GPUs handle 7-9B models at Q4 quantization with moderate context. You will not hit the 64K context recommendation with larger models, but they work for lighter OpenClaw usage paired with an OpenRouter fallback.
Mid tier: $500-900 (used market)
RTX 3090 24GB or RTX 4070 Ti Super 16GB. The RTX 3090 is the best value GPU for local inference right now. 24GB VRAM fits most OpenClaw-suitable models at Q4 with 32-64K context. This is the sweet spot for serious local operators.
High tier: $1000-2000
RTX 4090 24GB or dual RTX 3090. The 4090 offers the best single-GPU performance with 24GB VRAM and much faster inference than the 3090. Dual 3090s give you 48GB total VRAM for larger models or higher context windows, but multi-GPU inference adds complexity.
Apple Silicon
M2 Pro/Max or M3 Pro/Max. Apple Silicon shares memory between CPU and GPU, giving you effectively 32-96GB of "VRAM" depending on your configuration. Ollama has native Metal support. The M3 Max with 96GB unified memory can run very large models at full context. For the OpenClaw setup guide, Apple Silicon is one of the most practical local options.
Monitoring and Troubleshooting GPU Usage
The most common GPU-related OpenClaw problem is invisible: the model runs but delivers poor results because the context window was silently truncated to fit in VRAM. Always verify your actual allocation.
# Check NVIDIA GPU memory usage in real time
nvidia-smi -l 1
# Check what Ollama has loaded and its context allocation
ollama ps
# Check if the model is using GPU or fell back to CPU
ollama ps | grep -i "gpu\|cpu"
Common problems and fixes
- Model runs on CPU instead of GPU: Check that your NVIDIA drivers are current and CUDA is available. Restart the Ollama server after driver updates.
- Out of memory errors: Drop to a smaller quantization level first. If that is not enough, drop to a smaller model. As a last resort, reduce the context window — but be aware this directly impacts OpenClaw performance.
- Slow token generation: If you see 1-5 tokens per second, the model is likely partially offloaded to CPU. Check VRAM usage — if it is at 100%, some layers are running on CPU. Either free VRAM or use a smaller model.
- Context gets truncated mid-session: This happens when your VRAM cannot hold the growing context. Monitor VRAM during long sessions. If it hits the ceiling, the agent starts losing earlier context silently.
Frequently Asked Questions
How much VRAM do I need for Ollama models in OpenClaw?
For the recommended 64K context window, you need at least 16GB VRAM for smaller models like qwen3.5:9b and 24GB or more for mid-size models like glm-4.7-flash or qwen3-coder:30b. The exact requirement depends on the model size and quantization level. Running at lower context windows reduces VRAM needs but also reduces OpenClaw performance.
Should I use Q4 or Q8 quantization for OpenClaw?
Q4_K_M is the best starting point for most operators because it cuts VRAM usage roughly in half compared to full precision while keeping quality loss minimal for agent tasks. Q8 is noticeably better for complex reasoning but requires significantly more VRAM. Only use Q8 if your GPU has headroom after accounting for context window memory.
Can I run Ollama for OpenClaw on an older GPU like the RTX 3060?
Yes, but with limitations. The RTX 3060 has 12GB VRAM, which is enough for Q4-quantized 7-9B models at moderate context lengths. You will not be able to run 30B models or reach the full 64K context recommendation. For budget hardware, pair a smaller local model with an OpenRouter fallback for heavier tasks.
Does Ollama automatically use my GPU for OpenClaw?
Yes, Ollama automatically detects and uses NVIDIA GPUs with CUDA support and Apple Silicon GPUs with Metal support. You do not need to configure GPU offloading manually in most cases. Use nvidia-smi or ollama ps to verify that the model is loaded on your GPU rather than running on CPU.
Top comments (0)