DEV Community: Chris Kesler

OpenClaw Model Manager: A GUI for the Power Users Who Hate Waiting

Chris Kesler — Sun, 08 Mar 2026 21:13:54 +0000

How I built a standalone web dashboard to tame OpenClaw's CLI—and what it taught me about AI infrastructure in the real world.

The Problem Nobody Talks About

OpenClaw is genuinely powerful. It runs a local AI gateway that routes your conversations through any combination of models—Anthropic, OpenRouter, Google, local Ollama models—with fallback chains, auth profiles, aliases, and session management baked right in. Once it's configured, it mostly just works.

But "mostly just works" hides a lot of friction.

Want to know if your gateway is running? openclaw gateway status. Want to change your primary model? Edit a JSON config file, then restart the gateway. Want to see which provider is in a rate-limit cooldown? Good luck—dig through auth-profiles.json manually. Want to check if your dual RTX 3060s can actually run that 34B parameter model? Open a calculator.

The tools are all there. They're just scattered, CLI-only, and invisible when you need them most.

That's why I built OpenClaw Model Manager.

What It Is

OpenClaw Model Manager is a standalone web dashboard that wraps OpenClaw's existing CLI and config files with a clean, dark-themed UI.

It runs as its own Express server on port 18800—completely independent of the OpenClaw gateway itself. That's intentional: the manager doesn't go down when the gateway does, and it's exactly how you bring the gateway back up when it crashes. It's not a replacement for OpenClaw; it's a control panel for it.

No build step. No framework. No dependencies beyond Express and ws. Just static HTML, CSS, and JavaScript backed by a thin Node server that shells out to the same openclaw CLI you'd use in your terminal. Bind it to 0.0.0.0:18800 and it's accessible over Tailscale or your local network from any device—your phone, a laptop on the couch, or a remote machine across the country.

The Features (And Why They Matter)

🚨 Provider Failover Panel

This one was born from real pain.

During development, switching primary models repeatedly triggered Anthropic's rate limiter. The gateway put Anthropic on a 5-minute cooldown—but the first fallback in my chain was also an Anthropic model. So both were blocked. The error messages were confusing, the fix required editing a JSON file, and there was no visibility into what was happening.

The Failover Panel solves this. It lives at the top of the Health tab so you see it immediately when something's wrong:

Red borders appear the moment any provider enters cooldown.
Countdown timers show exactly how long until each provider recovers.
"Switch To ⚡" button hot-swaps your primary model to any ready provider instantly—no gateway restart, no JSON editing, no waiting.
"Clear Cooldown" resets a provider's error state immediately if you know the rate limit has lifted.

Why it matters: When you're in the middle of a workflow and your primary model goes down, you need one click to fix it—not a terminal and a config file.

📊 Live System Health

The Health tab shows a plain-English summary of what's happening on your machine right now:

GPU VRAM bars: Utilization percentage and temperature for each card, refreshed every 3 seconds via nvidia-smi.
RAM usage: Total and available memory.
Model offload detection: Plainly shows whether your current model is running entirely on GPU, split across GPU and CPU, or running in CPU-only mode.

🦙 Local Model Compatibility & The "12GB Trap"

The Local Models tab connects to your Ollama instance and lists every model you have installed, alongside an honest assessment of whether your hardware can run it:

✅ Fits on GPU: Full VRAM available; runs fast.
⚠️ Partial GPU: Model is larger than available VRAM; will split to system RAM (slower).
💻 CPU only: Too large for GPU; will be incredibly slow.
❌ Won't fit in RAM: Model exceeds your total system memory.

The Math: VRAM requirements are estimated as Model Size × 1.2 (accounting for a 20% overhead for the KV cache at short contexts). For a 24GB AI Lab (like my 2× RTX 3060 setup), this gives you a realistic picture of which 7B, 13B, and 34B models are actually usable day-to-day before you even try to load them.

🔗 Drag-and-Drop Fallback Chains

Your fallback chain is the safety net that keeps conversations going when your primary model fails. OpenClaw tries each model in order until one works.

The Fallbacks tab lets you manage it visually. Drag tiles to reorder them, see your balance of cloud vs. local models at a glance, and hit save. It writes directly to your openclaw.json config.

Why it matters: The right fallback order is the difference between a 500ms retry to OpenRouter and a 25-minute wait for Anthropic to recover.

🌐 Remote Connection Manager

If you run OpenClaw on more than one machine, the Connection Manager lets you add any instance—local or remote—and switch between them with a single dropdown. Switch connections over Tailscale, and every tab immediately updates to show data for that specific remote instance.

The Architecture

The architecture is built for resilience. The server stays alive independent of the gateway. Local operations shell out to the openclaw CLI, while remote operations hit the remote Model Manager's HTTP API. No custom protocol, no agents, no magic—just HTTP all the way down.

Getting Started

If you're running multiple OpenClaw instances, dealing with provider rate limits, or trying to figure out which Ollama models will actually run on your GPUs, this dashboard is for you.

It's free, it's open source, and it took one morning to build from scratch.

Quick Start

git clone [https://github.com/chriskesler35/openclaw-model-manager](https://github.com/chriskesler35/openclaw-model-manager)
cd openclaw-model-manager
npm install
npm start
Open http://localhost:18800 in your browser.

(Built with Express, WebSockets, and stubbornness. Dark mode only.)

Check out the full repository here: github.com/chriskesler35/openclaw-model-manager

The 24GB AI Lab: A Survival Guide to Full-Stack Local AI on Consumer Hardware

Chris Kesler — Sun, 08 Mar 2026 19:08:42 +0000

We’ve all been there: You see a viral post about a new AI model, you try to run a fine-tune locally, and your terminal rewards you with a wall of red text and a CUDA Out of Memory error.

If you’re running a mid-range, multi-GPU setup—specifically a dual-GPU rig like the NVIDIA RTX 3060 (12GB each)—you aren't just a hobbyist; you’re an orchestrator. You have 24GB of total VRAM, but because it’s physically split across two cards, the default settings of almost every AI tool will crash your system.

After months of trial and error in a Dockerized Windows environment, I’ve developed a "Zero-Crash Pipeline." This is the exact blueprint for taking a model from a raw fine-tune in Unsloth to an agentic reality using Ollama, OpenClaw, and ComfyUI.

1. The Foundation: Docker & The "Windows Handshake"

Running your ML environment in Docker (using the Unsloth image) keeps your Windows host clean, but Docker needs strict instructions on how to handle memory across two GPUs.

Before you even load a model, you must inject these two settings into your Python script. These are the guardrails that prevent 3:00 AM crashes:

The Memory Fix:

import os
os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "expandable_segments:True"

By default, PyTorch hogs rigid blocks of VRAM. This setting treats your VRAM as a dynamic pool, allowing it to grow and shrink as needed. This simple change eliminates the common VRAM fragmentation error that frequently crashes a 12GB card halfway through a training run.

The Multi-GPU Bug Fix: If you use two GPUs, the system tries to do math across both cards simultaneously. To prevent the training script from throwing a cryptic 'int object has no attribute mean' error, you must explicitly tell TrainingArguments to stop token averaging:

average_tokens_across_devices = False

2. The Training Phase: "The Rule of 1024"

You might want a model that can read a whole novel at once, but consumer hardware requires strict budget discipline.

Context Limit: Set max_seq_length = 1024. This is the stability sweet spot for 24GB of combined VRAM. It provides significant "headroom" for the OS and Docker overhead during peaks.

Batch Discipline: Keep per_device_train_batch_size = 1.

The Secret Sauce: Set gradient_accumulation_steps = 8. Instead of trying to process 8 items at once (which instantly spikes VRAM), the model processes 1 item, 8 times, and then updates itself. It’s the same mathematical result with a fraction of the memory pressure.

The Missing Link: Always import and use DataCollatorForLanguageModeling. Many tutorials skip this, but without it, your dual-GPU setup will throw dimension mismatch errors when trying to batch text dynamically.

3. The "Merge" (The Most Dangerous Step)

You’ve finished training your LoRA. Now you need to bake those new learnings back into the main base model.

The Step Everyone Skips: If you run a standard merge, PyTorch will try to load the base model and the LoRA into your VRAM simultaneously. On 12GB cards, this is an instant system freeze. You must force the computer to use your System RAM (CPU offloading) for the heavy lifting.

# The VRAM Insurance Policy
model.save_pretrained_merged(
    "model_output", 
    tokenizer, 
    save_method = "merged_4bit_forced", # Best format for Ollama
    maximum_memory_usage = 0.4,         # Forces CPU offloading
)

Context: This limits the merge process to 40% of your VRAM, pushing the rest of the workload to your system memory. It takes a few minutes longer, but it works 100% of the time.

4. The "Sanitization" Script (The Final Polish)

You have your output file (often a .gguf or .safetensors), but when you try to load it into Ollama, it rejects it with an unexpected EOF or invalid format error.

Why? Because the PyTorch export process often leaves behind non-standard metadata (U8/U9 headers)—essentially digital junk mail that confuses the local inference engine.

The Fix: A quick Python "Washing Script." Run this utility over your output directory to strip the headers before creating your Ollama Modelfile.

import os
from safetensors.torch import load_file, save_file

def sanitize_metadata(input_dir, output_dir):
    os.makedirs(output_dir, exist_ok=True)
    for filename in os.listdir(input_dir):
        if filename.endswith(".safetensors"):
            file_path = os.path.join(input_dir, filename)
            tensors = load_file(file_path)

            # Save it back with an explicitly empty metadata dictionary
            save_path = os.path.join(output_dir, filename)
            save_file(tensors, save_path, metadata={})
            print(f"Sanitized: {filename}")

# Point these to your Docker volume mounts
sanitize_metadata("/workspace/work/model_output", "/workspace/work/sanitized_model")

5. Deployment: From Model to Agent

With your "washed" model successfully running in Ollama, the loop is closed. Because the model is optimized for your hardware's strict 1024 context window, the latency is near-zero.

You can now point OpenClaw's local API setting directly to your Ollama localhost. OpenClaw handles the logic and tool-calling, and when a visual task is required, it triggers your local ComfyUI instance.

Appendix: The Dual-GPU Troubleshooting Matrix

If you are running a multi-GPU Docker setup, you will likely encounter these three "Gatekeeper" errors. Use these verified configurations to bypass them.

Error Message / Symptom	Likely Cause	The "Hardware-Aware" Fix
`CUDA Out of Memory (OOM)` during long training runs.	VRAM fragmentation within the Docker container.	Set `os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "expandable_segments:True"` before initializing the model.
`AttributeError: 'int' object has no attribute 'mean'`	Multi-GPU synchronization conflict in Unsloth/HuggingFace.	Set `average_tokens_across_devices=False` in your `TrainingArguments`.
`Ollama create: unexpected EOF` or `Tensor not found`	Unsanitized U8/U9 metadata headers in the Safetensors file.	Run the "Header Stripper" Python script to load and re-save the weights with an empty metadata dictionary.
System Freeze during the `save_pretrained_merged` step.	Attempting to load the base model and LoRA into VRAM simultaneously.	Use `maximum_memory_usage=0.4` and `save_method="merged_4bit_forced"` to force CPU offloading.

Conclusion

Building local AI on a multi-GPU rig isn't about having the fastest hardware; it's about being the best mechanic. By controlling your memory allocation, capping your context, and "washing" your metadata, you can turn consumer graphics cards into a highly capable, private, agentic laboratory.