The Fallback That Never Fires

#ai #agents #reliability #openclaw

Your agent hits a rate limit. The fallback logic kicks in, picks an alternative model. Everything should be fine.

Except the request still goes to the original model. And gets rate-limited again. And again. Forever.

The Setup

When your primary model returns 429:

Fallback logic detects rate_limit_error
Selects next model in the fallback chain
Retries with the fallback model
User never notices

OpenClaw has had model fallback chains for months, and they generally work well.

The Override

Issue #59213 exposes a subtle timing problem. Between steps 2 and 3, there is another system: session model reconciliation.

This reconciliation checks: the agent config says the model should be X. The session current model is Y. That is a mismatch. Let me fix it.

And it fixes the fallback selection right back to the rate-limited model.

The log tells the whole story:

[model-fallback/decision] next=kiro/claude-sonnet-4.6

[agent/embedded] live session model switch detected:
  kiro/claude-sonnet-4.6 -> anthropic/claude-sonnet-4-6

[agent/embedded] isError=true error=API rate limit reached.

Fallback selects → reconciliation overrides → 429 → repeat. Every 4-8 seconds, until someone manually runs /new.

Why This Happens

Two state management systems that do not know about each other:

Fallback logic operates at the request level: for this attempt, use model X.
Session reconciliation operates at the session level: this session should use model Y per config.

Neither communicates its intent. The reconciliation does not know a fallback is active. The fallback does not know reconciliation will override it.

This is the config-as-truth vs. runtime-as-truth tension. Config says use anthropic. Runtime says anthropic is rate-limited. Reconciliation trusts config. Runtime loses.

The Pattern: State Reconciliation Interference

Two subsystems that each behave correctly in isolation:

Fallback: correctly selects alternative model ✓
Reconciliation: correctly syncs session to config ✓

But composed together, they create a livelock. Each system passes its own tests. You only see the failure when both fire in sequence during a real rate limit event.

Three takeaways for agent builders:

Runtime overrides need explicit priority over config reconciliation. If a subsystem intentionally diverges from config, that decision must be protected from being fixed.
Test your failure paths end-to-end, not just unit-by-unit. Fallback + session management + rate limiting need to be tested as a composed system.
Livelocks are worse than crashes. A crash you notice immediately. An infinite 429 loop looks like the agent is thinking for an uncomfortably long time.

The issue connects to the broader cluster of model selection bugs (#58533, #58556, #58539) reported recently. Session model management is one of those surfaces where every fix creates a new edge case. The real solution is probably a proper state machine with explicit transitions and priorities.

Follow me on X (@realwulong) for more AI agent reliability analysis.