DEV Community

莫小苝
莫小苝

Posted on

I Added Self-Hosted GPU Training to MetaClaw — Here's How to Train Your AI Agent on Your Own A100s

Your AI agent should get better every time you talk to it. MetaClaw makes this happen — it's an open-source framework that meta-learns from your real conversations and automatically evolves your agent. Their technical report hit #1 on HuggingFace Daily Papers.

I've been using it for a few weeks and loved the concept, but the RL training was locked to cloud backends (Tinker/MinT). I wanted to train on my own GPUs — for privacy, cost, and flexibility. So I forked it and built what I needed.

What I Built

GitHub: OctoClaws/MetaClaw
Landing Page: octoclaws.github.io/MetaClaw

1. Self-Hosted GPU Training Backend

The biggest addition. A complete self-hosted alternative to cloud training:

  • FastAPI training server with PEFT/LoRA engine
  • vLLM inference with LoRA hot-swap (swap LoRA adapters without reloading the base model)
  • 3 loss functions: importance sampling, PPO, CISPO
  • Bearer token authentication + checkpoint save/load
  • OpenAI-compatible /v1/chat/completions endpoint on the training server

Configuration is dead simple:

rl:
  backend: remote
  remote_url: http://your-gpu-server:8000
  remote_api_key: your-secret-key
Enter fullscreen mode Exit fullscreen mode

Tested end-to-end on 8×A100-SXM4-80GB with Qwen3-8B.

2. Per-Agent Multi-Agent Isolation

If you run multiple agents through one MetaClaw instance, skills used to bleed across agents. I built full isolation:

  • Per-agent skill directories — each agent stores/retrieves skills independently, with a _shared/ pool for common ones
  • Per-agent mode routing — each agent can independently use skills_only or rl mode
  • Per-agent LoRA training — each agent gets its own checkpoint, training one doesn't affect another

3. Training Engine Bug Fixes

Found and fixed several bugs in the original pipeline:

  • Optimizer tracking frozen base model params → wasted memory
  • Logprobs computed after temperature scaling → inconsistent distributions
  • Gradient checkpointing + KV cache incompatibility → silent failure
  • Thread safety issues (lambda closures, runtime CUDA_VISIBLE_DEVICES mutation)
  • Qwen3 <think> tag parsing → reasoning mixed into output
  • Multimodal content format → OpenClaw sends list[dict], not plain strings

Why Self-Hosted Training Matters

Cloud Self-Hosted
Privacy Conversations sent to 3rd party Stays on your network
Cost Per-token fees Free if you have GPUs
Flexibility Fixed models/params Full control
Speed Queue wait times Train on demand

Quick Start

git clone https://github.com/OctoClaws/MetaClaw.git
cd MetaClaw
pip install -e ".[rl,evolve]"
metaclaw setup
metaclaw start
Enter fullscreen mode Exit fullscreen mode

For self-hosted training, deploy the training server on your GPU machine and set rl.backend: remote in config.

Links

Happy to answer questions about the architecture or setup!

Top comments (0)