Mykola Kondratiuk

Posted on Jul 3

I Wired an AI Fallback Runbook After a 19-Day Outage - Here's All 3 Parts

#discuss #ai #leadership #productivity

When your primary model goes dark for 19 days, what does your workflow actually do in that first hour? Fail silently? Stall until someone notices? Or route somewhere else and keep moving?

If you can't answer that, you don't have a resilience posture. You have luck, and luck just got tested in public.

Fable 5 came back globally on July 1 after a 19-day export-control shutdown pulled it offline. The post-mortems are landing this week, and most fall into two buckets: "it's back" and "here's what the outage cost." MarketScale had the sharper read. Teams that treat any single model as permanent infrastructure keep getting surprised, while teams with a routing map, banked reserves, and a named backup treated the whole thing as an arbitrage window instead of a fire.

I went and looked at my own setup the morning it came back, and I found gaps. This is the runbook I wired after, written on a calm day so I'm not writing it during the next scramble. None of it is PM scope-creep. It's the same reliability thinking you'd apply to a database sitting on a single replica.

Part 1: A routing policy you can actually read

"We use Claude" is not a routing policy. It's a default nobody voted for.

A routing policy is a named map: this class of task goes to this model, that class goes to a cheaper or faster one. Bulk classification doesn't need your most expensive reasoning pass. The gnarly multi-step agent run does. Most teams carry this map around in one engineer's head, which means it doesn't survive that person taking a week off, and it definitely doesn't survive the model itself going offline.

Get it out of the head and into a file:

# routing.yaml - the map, not the vibe
tasks:
  bulk_classify:
    primary: haiku-tier
    fallback: gemini-flash
  long_agent_run:
    primary: opus-tier
    fallback: sonnet-5
  code_review:
    primary: sonnet-5
    fallback: gpt-tier

The value isn't the YAML. It's that "what runs where" is now a decision on paper instead of a habit in someone's memory. When the primary for a task class disappears, the router already knows where to send the work. Nobody improvises at 2am.

Part 2: Plan-banking, the canned goods in your pantry

There's a handful of workflows you genuinely cannot afford to have stall. For those, keep a reserve.

Plan-banking means pre-generating the plans, scaffolds, and outputs for your critical workflows while the model is up and cheap, then drawing from that bank when it isn't. Think of it as canning vegetables in summer so a February storm doesn't mean an empty plate. On a calm day it costs you a scheduled job and some storage. During a 19-day blackout it's the gap between "we drew from the reserve" and "we're blocked until further notice."

I keep it dead simple:

# nightly: bank fresh plans for the workflows that can't stall
for wf in onboarding_flow release_notes triage_playbook; do
  generate_plan "$wf" > "bank/${wf}.$(date +%F).json"
done
# keep 14 days, prune the rest

The trap I fell into first was banking everything. You don't need reserves for the workflow that runs twice a year. Bank the two or three that would actually hurt, and let the rest fail loudly.

Part 3: A second source you've actually tested

For every critical use case, name the specific fallback model. Then prove it works for that task before you need it.

This is where the Fable shutdown got concrete. Teams with a named, tested second source rerouted in hours. Teams with "we'll figure it out" lost real weeks. And "no viable alternate exists" stopped being an honest excuse this year, because the menu got deep. Sonnet 5 shipped June 30 at $2/$10 per million tokens as the most agentic Sonnet yet, so the backup is now cheaper and more capable than the thing it's backing up. Gemini 3.5 Pro is clearing for July. The alternates are there.

The part people skip is the testing. A fallback you've never run against your real prompts is a guess wearing a helmet. Wire a weekly canary:

# canary.yaml - does the backup still pass our bar?
second_source_check:
  schedule: weekly
  for_each: critical_use_case
  run: fallback_model against golden_prompts
  alert_if: pass_rate < 0.95

If the second source silently drifts below your bar, you want to know on a quiet Thursday, not in the middle of the outage you built it for.

The part that's easy to miss

There's a geopolitics wrinkle worth one line. The same government that pulled Fable's export license is now being offered a 5% stake in a competitor. Which model gets restricted next is not something you can predict, but it is something you can hedge, and a routing policy spanning more than one vendor is the hedge.

Here's what shifted my thinking most. None of these three moves is expensive. They're all cheap on a calm day and completely impossible in a panic, which is exactly why they get skipped. There's never a fire forcing you to do them, right up until there is. The 19-day shutdown didn't create a new risk. It just mailed everyone the bill for a decision they made by not deciding.

So, honest question for the comments: of the three parts, which one does your team actually have written down right now? Not "we could stand it up." Written down, today. I'll go first. My routing map was solid, and my second-source canary didn't exist until last week.

Tags: #AI

Top comments (2)

Mykola Kondratiuk • Jul 3

plan-banking quietly breaks for input-heavy workflows - my banked release-notes plans went stale in a day because the inputs churned, so the reserve was worse than nothing. it only really works for stable-shape work and i glossed over that.

Marouane K • Jul 15

Hi itskondrat, I noticed your post about managing your AI fallback runbook and thought of Clypify, a tool that can help you streamline your content workflow and automate publishing to Medium. With Clypify, you can focus on creating high-quality content without worrying about the logistics. Free plan at clypify.com — no card needed.