Adrien Cossa

Posted on Jun 10 • Edited on Jun 18

Hosting OpenClaw: a money trap and two silent failures

#ai #selfhosted #openclaw #openrouter

I run OpenClaw on a Hetzner CAX ARM VPS. It talks to me over Signal and does a morning press review. Three gotchas on that box are worth writing down: one money trap in the model-routing layer, and two silent failures that each left the briefing dead for days. In case someone is staring at the same thing.

OpenRouter can spend your credits on a provider you didn't pick

I use OpenRouter as the single door to a pile of models. Its BYOK (bring-your-own-key) feature has a trap. You add your own OpenAI key for a model, flip on "Always use for this provider," and read that as never spend OpenRouter credits. It doesn't mean that.

The toggle only guarantees they use your key for that provider. It does not stop OpenRouter routing to a different provider that serves the same model when your key fails or is unavailable — and that fallback spends your OpenRouter credits, at pay-per-token rates, on a provider you never picked. Working as designed. Just not the design in your head when you flip the toggle.

The setting that does what I wanted is the provider.only routing param. It tells OpenRouter to fail the request rather than fall back:

"provider": { "only": ["your-provider-slug"] }

The part that surprised me even without BYOK: a single model spans a wide price range by provider. I checked openai/gpt-oss-120b (what I actually run) via GET /api/v1/models/openai/gpt-oss-120b/endpoints — it runs from about $0.039 to $0.95 per million tokens. Default routing optimises its own mix of price and availability and can land you on the expensive end silently.

So I pinned it. In OpenClaw the routing params live under models.providers.openrouter.params.provider:

"provider": { "only": ["deepinfra", "dekallm", "novita"], "sort": "price" }

Now OpenRouter uses only those three, cheapest first, and a failure drops to my free fallback instead of escalating to a pricey provider. One catch that cost me a few minutes: only wants provider slugs, not display names — the slug is the part before the / in each endpoint's tag field.

Two guardrails worth setting regardless:

OpenRouter API keys have no spend limit by default — mine showed "limit": null. Set one at openrouter.ai/settings/keys. A cap is the one protection that doesn't depend on getting the routing right.
openrouter.ai/activity shows which provider actually served each request — where to look when you suspect a provider you didn't pick.

There's a catch to sort: price that pulls the opposite way: the cheapest providers are the least reliable. My daily digest started failing after a day — LLM request failed, nothing else. The gateway was healthy, so it wasn't the auto-update below. The route was the problem: a long digest turn kept landing on whichever cheap provider was flaky that morning. The fix was to pin the cron's agent to a few reliable, still-cheap paid providers and sort by throughput:

"provider": { "only": ["groq", "together", "baseten"], "sort": "throughput" }

For a digest of about 30k tokens a day, that's around $0.20 a month — and the next run came through. Pin by price for anything you can afford to retry, but for an automation that must deliver, the cheapest route is a false economy. Free tiers rate-limit and cheap providers get flaky exactly when you lean on them. Two more things the same episode taught me: routing reliability belongs in the routing config, not the agent's model list — and an over-eager edit to that model block can hard-fail the gateway on restart, so back up the file first.

signal-cli crashes on ARM64, and the fix is a container

The briefing went quiet. The gateway was up and healthy, but signal-cli kept dying. The crash pointed at a SIGSEGV deep inside libsignal's native JNI code during message encryption — generation worked fine, every send killed the daemon, and the watchdog kept restarting it into the same wall.

The root cause is an ARM64 packaging gap. signal-cli's bundled libsignal-client jar ships a native library for Linux x86 and macOS ARM, but not Linux ARM64. On an ARM box it falls back to a mismatched libsignal_jni.so that corrupts JNI handles and segfaults on the encrypt path. I burned time on JVM flags first — interpreter-only (-Xint), a different garbage collector, disabling compressed oops — and every one still crashed. It's not a JIT or GC problem. The native library is simply wrong for the platform.

What worked: stop fixing the native build by hand and run signal-cli from a container that ships correct ARM64 binaries — bbernhard/signal-cli-rest-api. Mount the existing account data so there's no re-pairing and no safety-number change:

docker run -d --name signal-daemon --restart unless-stopped --no-healthcheck \
  -p 127.0.0.1:8080:8080 \
  -e XDG_DATA_HOME=/data -e HOME=/tmp \
  -v "$HOME/.local/share:/data" \
  --user 1000:1000 --entrypoint signal-cli \
  bbernhard/signal-cli-rest-api:latest \
  daemon --http 0.0.0.0:8080 --no-receive-stdout

Then set channels.signal.autoStart to false and channels.signal.httpUrl to http://127.0.0.1:8080 so OpenClaw connects to the container instead of spawning the broken local binary. Sends stopped crashing, and a direct send test came back SUCCESS. Updates are now a docker pull.

One rough edge still open: in the container, replies to incoming messages resolve their recipient fine, but the proactive scheduled send (the cron "announce") doesn't resolve the explicit recipient and errors before sending — even though the generated text is right there in the run record. For now I get a briefing by messaging the bot. Tracking down why the scheduled push differs is still on my list.

An auto-update can kill your scheduled jobs without a single error

The briefing went quiet again — but nothing crashed. The gateway was healthy, systemctl status said active (running), restart count low. The daily cron just stopped firing: no error, no run-log row. Two days passed before I noticed.

The cause: OpenClaw auto-updated on disk and migrated its cron store as part of the bump — the per-feature JSON files got consolidated into a single ~/.openclaw/state/openclaw.sqlite, the old ones renamed *.migrated. But it never restarted the running process. The old in-memory scheduler still pointed at the now-renamed jobs.json, which no longer existed. New code on disk, old code in memory, store moved out from under both.

The class is worth naming: an auto-updating daemon that migrates state without restarting fails silently. "It worked yesterday" means nothing across an auto-update. Newer OpenClaw ships a restart-after-update path, but the fallback may not pick up a store migration, and I didn't want to bet the briefing on that.

The fix watches the outcome, not the process — because the process was green the whole time. A small stdlib Python script on a daily systemd timer, run an hour after the brief is due, asks one question: did today's expected output actually arrive? On a miss it messages me over the same Signal channel, naming the problem and the fix (sudo systemctl restart openclaw.service). A healthy day sends nothing. Checking "is the daemon up" stays green for two days. Checking "did the thing I wanted happen" catches it the same morning.

One footgun while debugging: don't run the openclaw CLI on the host. openclaw cron list, openclaw doctor, any of them — the CLI detects the running gateway's PID and SIGTERMs it before starting its own, knocking the service over for 20-30 seconds. Edit the cron store directly, hit signal-cli's JSON-RPC for sends, and admin from anywhere but the box.

The pattern across all three: the thing that tells you it's fine — the toggle, systemctl status, the watchdog — is measuring something next to what you actually care about. The cure each time was to watch the outcome and keep a way back in. If you're running OpenClaw on your own box and hit a fourth one of these, or found a cleaner fix for that proactive-send gap, I'd like to hear it.

Top comments (1)

Alex Shev • Jun 11

The silent failure part is the real lesson. Self-hosted agent infrastructure needs proof points after every expensive or external step: did the command run, did the artifact appear, did the budget stay sane, did the output reach the user?

Without those checks, the terminal looks busy while the workflow quietly drifts.