Your agent hits a rate limit. The fallback logic kicks in, picks an alternative model. Everything should be fine.
Except the request still goes to the original model. And gets rate-limited again. And again. Forever.
The Setup
When your primary model returns 429:
- Fallback logic detects rate_limit_error
- Selects next model in the fallback chain
- Retries with the fallback model
- User never notices
OpenClaw has had model fallback chains for months, and they generally work well.
The Override
Issue #59213 exposes a subtle timing problem. Between steps 2 and 3, there is another system: session model reconciliation.
This reconciliation checks: the agent config says the model should be X. The session current model is Y. That is a mismatch. Let me fix it.
And it fixes the fallback selection right back to the rate-limited model.
The log tells the whole story:
[model-fallback/decision] next=kiro/claude-sonnet-4.6
[agent/embedded] live session model switch detected:
kiro/claude-sonnet-4.6 -> anthropic/claude-sonnet-4-6
[agent/embedded] isError=true error=API rate limit reached.
Fallback selects → reconciliation overrides → 429 → repeat. Every 4-8 seconds, until someone manually runs /new.
Why This Happens
Two state management systems that do not know about each other:
- Fallback logic operates at the request level: for this attempt, use model X.
- Session reconciliation operates at the session level: this session should use model Y per config.
Neither communicates its intent. The reconciliation does not know a fallback is active. The fallback does not know reconciliation will override it.
This is the config-as-truth vs. runtime-as-truth tension. Config says use anthropic. Runtime says anthropic is rate-limited. Reconciliation trusts config. Runtime loses.
The Pattern: State Reconciliation Interference
Two subsystems that each behave correctly in isolation:
- Fallback: correctly selects alternative model ✓
- Reconciliation: correctly syncs session to config ✓
But composed together, they create a livelock. Each system passes its own tests. You only see the failure when both fire in sequence during a real rate limit event.
Three takeaways for agent builders:
Runtime overrides need explicit priority over config reconciliation. If a subsystem intentionally diverges from config, that decision must be protected from being fixed.
Test your failure paths end-to-end, not just unit-by-unit. Fallback + session management + rate limiting need to be tested as a composed system.
Livelocks are worse than crashes. A crash you notice immediately. An infinite 429 loop looks like the agent is thinking for an uncomfortably long time.
The issue connects to the broader cluster of model selection bugs (#58533, #58556, #58539) reported recently. Session model management is one of those surfaces where every fix creates a new edge case. The real solution is probably a proper state machine with explicit transitions and priorities.
Follow me on X (@realwulong) for more AI agent reliability analysis.
Top comments (0)