Mykola Kondratiuk

Posted on Jun 15 • Edited on Jun 17

Fable 5 Went Dark Friday Night. I Ran My Critical Workflow on a Backup Saturday - Here's What Broke

#discuss #ai #productivity #devops

Prompt overfitting and silent failure risks

On Friday afternoon a government order hit Anthropic, and by Saturday morning Fable 5 and Mythos 5 were disabled for every customer worldwide. Not deprecated. Gone. Two days later OpenAI shut Sora down because it was losing fifteen million dollars a day.
Disclosure: This article was written with AI assistance. I use AI tools as part of my workflow for building and writing about AI-native PM practices.

I don't have a strong take on the politics. What I had was a smaller, more selfish question at 8am Saturday: if I'd staffed a real workflow on either of those, what would I actually do right now?

So I tested it. Here's what happened.

"We'd just switch" is a hope, not a plan

I'd been telling myself I had redundancy for months. If my main model fell over, I'd move to a second vendor. Easy.

The problem with that sentence is that I had never once run it. A fallback you've never executed isn't a fallback. It's a guess with good posture.

So Saturday I took my single most critical AI-dependent workflow - a spec-to-task-breakdown pipeline I lean on every day - and ran it end to end on a different vendor's model. One time. Just to find out whether the guess held.

It didn't.

Break #1: the prompt was overfit to one model

The first thing that broke was the prompt itself. My prompt had drifted into a shape that worked beautifully on the model I built it against. Tight, terse, lots of implicit structure the model had learned to fill in.

The backup model read the same prompt and produced mush. Not wrong exactly, just vague and unstructured, the kind of output you'd toss.

The fix was real work, not a config flag:

- summarize the spec and break it into tasks
+ You are breaking a spec into engineering tasks.
+ Output JSON only, matching this shape:
+ { "tasks": [{ "title": "", "estimate_pts": 0, "depends_on": [] }] }
+ Rules:
+ - every task must be independently shippable
+ - no task larger than 3 points; split if larger
+ - depends_on references task titles, not indexes

Model A filled in all that structure on its own. Model B needed it spelled out. That's twenty minutes of restructuring I'd much rather spend on a calm Saturday than during an actual outage.

Break #2: a silent tool-call dependency

The second break scared me more because it was invisible. One step in the pipeline depended on a tool call - a function the model invokes to pull live data. The backup model's tool-calling format was different enough that the call silently no-op'd.

The output still looked plausible. It just used stale data and didn't tell me. That's the worst failure mode there is: confidently wrong, no error, no flag. I only caught it because I was looking for trouble. On a normal day that bad output flows downstream and someone makes a decision on it.

Availability belongs on the risk register

Here's the reframe I walked away with. We already handle the API being down. You get a 503, you back off, you retry, it comes back. That's an outage with an SLA and a status page that eventually goes green.

This is the model being gone. No SLA. No restore ETA. No green status page, because it isn't coming back. A policy order or a vendor's burn-rate review can end it overnight, and you find out the same way everyone else does.

For a service you don't control and can't restore, that's a single point of failure on your critical path. We'd never ship that for a database. Most of us are shipping it for the model doing half the thinking.

The one-pager that deletes your worst hour

The cheapest move turned out to be the most useful. The first hour after a model goes dark gets burned figuring out what just broke - which workflows touched that model, what versions, where the outputs live.

IBM found 88% of enterprises don't keep a complete inventory of the AI and agents they run. You can't reroute around a dead model if you don't know what depended on it. So I wrote one file:

workflows:
  - name: spec-to-tasks
    model: primary-vendor/model-a
    criticality: must-survive
    fallback: tested 2026-06-13, prompt needs restructure
  - name: standup-digest
    model: primary-vendor/model-a
    criticality: can-wait
    fallback: none, recovery order documented
  - name: video-assets
    model: openai/sora
    criticality: can-wait
    export_path: download MP4s + project json before EOL

That last line is the Sora lesson. When a vendor kills a product, not just a model, you also have to ask where your outputs go and how you get them out. One extra column.

The point isn't fear

I want to be clear, because the lazy version of this post is "AI is unreliable, panic." It isn't, and that's not useful. Depending on these models is the right call. The teams that win aren't the ones who avoided the dependency. They're the ones who can keep the work moving the morning it disappears.

That competence costs an afternoon to build and almost nobody has built it yet:

Run your most critical workflow on a second model once. The rehearsal is the whole instrument.
Sort workflows into must-survive-today vs can-wait. Only the short list earns a tested fallback.
Keep a one-page workflow-to-model list so the first lost hour becomes a glance.

I ran my test on a quiet Saturday and it cost me twenty minutes and a little ego. The alternative was running it for the first time on the morning it counted.

What would break first in your stack if your main model wasn't there tomorrow - and have you ever actually checked?

Top comments (15)

Daniel Nwaneri • Jun 15

The silent no-op is the part that should change how people design fallbacks. an outage fails loud. this fails quiet, and quiet failures are the ones that ship.

worth pushing further: the prompt restructure (break #1) is annoying but cheap to fix once. the tool-call format mismatch (break #2) isn't a prompt problem, it's a contract problem . your pipeline assumed a specific model's tool-calling shape as if it were a stable interface. that assumption is the actual single point of failure, not the model itself.
did the backup model give any signal at all that the tool call failed, or was the output indistinguishable from a successful run with fresh data?

Mykola Kondratiuk • Jun 15

that's the design debt nobody puts on the backlog — 'can this fail silently?' the tool-call mismatch broke exactly like that: status green, work not done, no error to grep for. now every fallback in my stack has a forced trace log or it doesn't ship.

xulingfeng • Jun 15

The silent tool-call dependency is the one that'd keep me up. Stale data masquerading as fresh output — that's worse than an outage because nobody knows to panic. The one-pager is smart, but the rehearsal is what actually saves you.

Mykola Kondratiuk • Jun 15

the 'nobody knows to panic' framing is the right one — the incident doc is about how to react, but the rehearsal is what reveals whether you even know something went wrong. ran ours on a non-critical flow first and found two silent failures in the first 10 minutes.

Aliaksei Zelianouski • Jun 16

You did the right thing, and the good news is it's not as hard as this run made it feel. I run a game with 20-something models across 8 providers, all interchangeable - making the agent model-agnostic in code was surprisingly easy. The one real gotcha: Anthropic and Gemini encrypt their reasoning tokens, so swapping mid-conversation drops the reasoning already sitting in your history. Minor in practice. The thing to watch is a model-specific dependency you didn't know you leaned on - a much bigger output size, a tool-calling quirk - which is what your single real run flushed out. Build it agnostic, keep a test that swaps the model mid-flow, and you're covered.

Mykola Kondratiuk • Jun 16

the encrypted reasoning token gotcha is one I hadn't thought through - dropping reasoning mid-conversation is the kind of thing you only discover on a live incident. how deep into a conversation were you when you first hit it? curious whether it degrades output visibly or just makes the model repeat context it already processed.

Aliaksei Zelianouski • Jun 16

I keep reasoning tokens together with each message and include them in the history every time I send request to a model. I don't think that losing the reasoning part is a big deal - the message history is still there. It's hard to tell how much historical reasoning tokens actually help. I've never seen any visible effect of that. Although, it's not like I have to switch a model mid-game too often - it's a rare case. I don't see my users doing it at all.

Mykola Kondratiuk • Jun 17

makes sense if you control when you switch - our incident was the model going dark mid-workflow, no choice. at that point i have no idea what the receiving model does with reasoning history it cannot actually read.

festusisaac • Jun 15

Love the practical pragmatism here. The workflow-to-model inventory markdown snippet is brilliant in its simplicity.
It reminds me of standard dependency mapping in security compliance, but for runtime cognitive logic. When a model goes dark, you don't want to be grepping through codebases trying to find where a specific model string is hardcoded or used.

Mykola Kondratiuk • Jun 15

the security compliance analogy is apt - both are about knowing your dependencies before they become incidents. the model string grep problem is real, especially in monorepos where model names leak into logging config, test fixtures, and prompt templates in ways you don't notice until something breaks at 11pm.

Mallory Haigh • Jun 16

The silent tool-call failure points at something significantly bigger than prompt hygiene or vendor redundancy. When agents fail in their quiet and confident fashion, it's not the models causing the problem - it's the lack of a platform underneath.

Things like identity, observability, and evaluation gates aren't things you bolt onto a workflow after an outage - then, it's too late. Instead, these need to be part of the substrate the workflow runs on. Every one of the breaks mentioned is a symptom of building directly against a model instead of through a standardized platform.

The methodology that fixes this has been sitting in traditional software for quite a long time, and is successful - internal developer platforms built inside a solid platform engineering practice. Now, this foundational layer of policy, governance, standardization, and automation is being expanded to serve more than just human users. Agents, as autonomous actors in the system, are now part of Agentic Development Platforms, using that foundational IDP layer as a jump-off point for agentic infrastructure.

Mykola Kondratiuk • Jun 16

mostly right about the platform gap, but I'd push back on the not-the-models framing - the model's confident output is what makes the observability problem unsolvable at the platform layer. if it said I think this succeeded you'd catch it. it says done with full certainty. no signal to intercept.

Sloan the DEV Moderator • Jun 15

Hey, this article appears to have been generated with the assistance of ChatGPT or possibly some other AI tool.

We allow our community members to use AI assistance when writing articles as long as they abide by our guidelines. Please review the guidelines and edit your post to add a disclaimer.

Failure to follow these guidelines could result in DEV admin lowering the score of your post, making it less visible to the rest of the community. Or, if upon review we find this post to be particularly harmful, we may decide to unpublish it completely.

We hope you understand and take care to follow our guidelines going forward!

View full discussion (15 comments)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.