I Added Self-Hosted GPU Training to MetaClaw — Here's How to Train Your AI Agent on Your Own A100s

#ai #opensource #llm #machinelearning

Your AI agent should get better every time you talk to it. MetaClaw makes this happen — it's an open-source framework that meta-learns from your real conversations and automatically evolves your agent. Their technical report hit #1 on HuggingFace Daily Papers.

I've been using it for a few weeks and loved the concept, but the RL training was locked to cloud backends (Tinker/MinT). I wanted to train on my own GPUs — for privacy, cost, and flexibility. So I forked it and built what I needed.

What I Built

GitHub: OctoClaws/MetaClaw
Landing Page: octoclaws.github.io/MetaClaw

1. Self-Hosted GPU Training Backend

The biggest addition. A complete self-hosted alternative to cloud training:

FastAPI training server with PEFT/LoRA engine
vLLM inference with LoRA hot-swap (swap LoRA adapters without reloading the base model)
3 loss functions: importance sampling, PPO, CISPO
Bearer token authentication + checkpoint save/load
OpenAI-compatible /v1/chat/completions endpoint on the training server

Configuration is dead simple:

rl:
  backend: remote
  remote_url: http://your-gpu-server:8000
  remote_api_key: your-secret-key

Tested end-to-end on 8×A100-SXM4-80GB with Qwen3-8B.

2. Per-Agent Multi-Agent Isolation

If you run multiple agents through one MetaClaw instance, skills used to bleed across agents. I built full isolation:

Per-agent skill directories — each agent stores/retrieves skills independently, with a _shared/ pool for common ones
Per-agent mode routing — each agent can independently use skills_only or rl mode
Per-agent LoRA training — each agent gets its own checkpoint, training one doesn't affect another

3. Training Engine Bug Fixes

Found and fixed several bugs in the original pipeline:

Optimizer tracking frozen base model params → wasted memory
Logprobs computed after temperature scaling → inconsistent distributions
Gradient checkpointing + KV cache incompatibility → silent failure
Thread safety issues (lambda closures, runtime CUDA_VISIBLE_DEVICES mutation)
Qwen3 <think> tag parsing → reasoning mixed into output
Multimodal content format → OpenClaw sends list[dict], not plain strings

Why Self-Hosted Training Matters

	Cloud	Self-Hosted
Privacy	Conversations sent to 3rd party	Stays on your network
Cost	Per-token fees	Free if you have GPUs
Flexibility	Fixed models/params	Full control
Speed	Queue wait times	Train on demand

Quick Start

git clone https://github.com/OctoClaws/MetaClaw.git
cd MetaClaw
pip install -e ".[rl,evolve]"
metaclaw setup
metaclaw start

For self-hosted training, deploy the training server on your GPU machine and set rl.backend: remote in config.

DEV Community