DEV Community: Gen.Y.Sakai

Anthropic Was Right About One Thing: Broad Safety Decisions Are Dangerous

Gen.Y.Sakai — Sat, 13 Jun 2026 13:54:53 +0000

A design-failure analysis from a professional user who wanted the product to win.

1. I wanted Fable 5 to be real

I should be honest about my bias before I say anything critical, because the criticism only makes sense once you understand the bias.

I wanted Fable 5 to be real.

By "real" I do not mean "released," because it was released. I mean something closer to real in the way a good colleague is real — a system you can hand a half-formed idea to and get back something sharper than what you put in. For a while I had stopped expecting that from language models. Each new release was a little faster, a little more fluent, a little better on the benchmarks, and a little more eager to agree with me. The improvements were real but they were incremental, and incremental improvements to a tool you already use heavily do not change how you think. They just change how quickly you arrive at the place you were already going.

Fable 5 was different, and the difference was not raw intelligence.

It was the quality of the collaboration. When I gave it an argument, it did not merely restate the argument back to me in cleaner prose. It found the load-bearing assumption and pressed on it. When I was wrong, it told me I was wrong in a way that was specific enough to be useful — not "here are some considerations," but "this step does not follow, and here is the case that breaks it." Its reasoning was dense. Its self-criticism was strong; it would generate a position and then, unprompted, attack the weakest part of its own position before I had to. Its vocabulary was wider, which mattered more than it sounds, because precise vocabulary is how you avoid smuggling vagueness past yourself. It used evidence better. And it pushed back at the right moments, in the right amount — it was less sycophantic than anything I had used, which is the single property I value most in a thinking partner and the single property that is hardest to get from a system trained to be agreeable.

I run a medical IT company. My work is not casual. I build AI products that are meant to operate near clinical decision-making: an AI-assisted electronic medical record system, a clinical differential-diagnosis assistant, and the evaluation workflows that let me demonstrate to physicians whether the reasoning behind those systems is actually sound. That last part — the evaluation workflows — is where most of the real engineering risk lives, because a medical AI product is only as trustworthy as the evidence you can show a skeptical doctor about how it reasons. For that kind of work I do not want a flattering assistant. I want an adversary who is on my side. Fable 5 was the closest thing to that I had encountered.

So when I tell you that, two days running, I had to send Anthropic negative feedback about Fable 5, I want to be clear about the spirit of it. This is not the complaint of someone who found the model disappointing. It is the complaint of someone who found the model excellent and then watched the path to that excellence become unreliable and opaque in front of him.

That distinction is the whole essay. The problem I am going to describe is not the model's capability. The problem is the safety-routing path around the model. And the reason it is worth ten thousand words rather than a tweet is that the same structural mistake shows up at two completely different scales over the same few days — once in a way that affected me personally, and once in a way that, as publicly reported, affected the entire product line and drew a sharp public objection from Anthropic itself.

Anthropic was right about that larger mistake. That is exactly why the smaller one deserves the same scrutiny.

2. Two failures, at two different scales

I am deliberately not opening with the news, because the news is not the point and I do not want to borrow its drama. But I have to place the two failures in time, because the timing is what turned a personal product gripe into something I felt was worth writing down carefully.

Over a short span in mid-June 2026, two things happened in my world at once.

The small thing — small in scope, large in how directly it hit my work — was that Fable 5's safety routing twice interfered with legitimate professional sessions of mine. The first interference was a false positive in a meta-level discussion about AI honesty, model evaluation, and classifier design. The second, and more serious, was repeated, reproducible downgrading from Fable 5 to Opus 4.8 in the middle of legitimate medical AI development work.

The large thing — large in scope, and not about me at all — was that, as widely reported and as Anthropic itself stated, the U.S. government issued an export-control directive citing national-security authorities, instructing Anthropic to suspend access to Fable 5 and Mythos 5 for any foreign national. To comply, Anthropic disabled the models for all customers. According to Anthropic's public statement, the company's understanding was that the government had become aware of a method of jailbreaking Fable 5, and Anthropic publicly disagreed that the discovery of a narrow potential jailbreak should be cause for recalling a widely deployed commercial model.

I want to be careful here, because the exact government rationale is not fully visible to users, and I am not going to pretend to know more than the public record supports. The point is not to adjudicate the national-security question. The point is the shape of the two events when you put them side by side.

A narrow signal became a broad restriction. Twice. At two scales. Over the same few days.

At the government scale, as publicly reported, a narrow and not-fully-explained risk signal became a broad suspension of access. At the product scale, on my own screen, a narrow lexical and topical signal became a broad downgrade of the model I had selected. In both cases the decision-maker — a government agency in one instance, a safety classifier in the other — failed to distinguish who was acting, what they intended, in what context, and at what phase of use.

That is the mirror this essay is built around. But mirrors are only convincing if you can see both faces clearly, so let me start with the face I can document directly: my own.

Before I do, here is the map, because this is a long piece and I do not want you to lose the thread. The essay makes three claims. First, Fable 5 itself was not the problem; the model was excellent, and that is exactly what makes the rest worth writing about. Second, the safety-routing layer failed at the wrong granularity — it reacted to topic and vocabulary where it needed to reason about actor, intent, context, and phase of use. Third, Anthropic's public objection to a broad government restriction is, on its face, valid — but the same argument turns inward and applies, with uncomfortable precision, to its own product-level safety routing. Everything below is in service of those three.

And one promise up front, so the framing is not lost across ten thousand words: this is not an argument against safety. It is an argument that safety decisions become dangerous when the unit of classification is smaller than the unit of action — when something is detected at the resolution of a word or a topic but acted on at the resolution of a whole model, a whole session, or a whole domain.

3. What I actually reported to Anthropic

I did not write the two reports below as rhetorical set-pieces for an article. I wrote them in the moment, through the product's own feedback flow, after pressing the negative-feedback button on a session that had just failed me. They are the actual language I used. I am reproducing the substantive parts here because they are the primary evidence in this case, and because evidence should be shown before it is interpreted. I will quote the load-bearing passages directly, and then spend more words analyzing them than the quotes themselves take up — because the point is not what I felt but what the behavior reveals about the system.

3.1 Report one: the false positive

The first report concerned a conversation that, on its face, should have been one of the safest possible conversations to have with a frontier model — a conversation about safety mechanisms. Here is the core of what I submitted:

No cybersecurity or biology request appears anywhere in this conversation. The safety classifier nonetheless fired and switched the session from Fable 5 to Opus 4.8.

The actual subject is a critical discussion about AI honesty, model evaluation, and classifier design. The turns immediately before the trigger were analyzing how anti-distillation classifiers should be built and why silent downgrading is a poor design choice — there was no intent or preparation to misuse anything.

Likely cause: the classifier fired on surface lexical salience (density of terms like distillation, weaponization, grader) rather than on intent. At prompt/token granularity, a critical discussion is indistinguishable from an actual request. Reading the full context, a human would immediately recognize this as benign.

Request: classification of this kind should operate at conversation/actor granularity. Prompt-level lexical detection makes false positives structurally unavoidable.

Read that back slowly, because the structure of the failure is more interesting than the inconvenience.

The conversation was a critical discussion of classifier design. The thing that interrupted it was a classifier. The trigger, as far as I can reconstruct it, was lexical density — the conversation was about mechanisms, so the vocabulary of those mechanisms (distillation, grader, weaponization, routing) appeared at high frequency. And to a system that scores risk on the surface of the text, a high-frequency cluster of risk-adjacent vocabulary looks the same whether you are trying to build the mechanism, break it, evade it, or critique it.

That is the recursive part, and I will return to it in its own section, because it is too important to bury inside a feedback summary. For now, hold onto the central technical claim, stated in the words I used at the time: at prompt/token granularity, a critical discussion is indistinguishable from an actual request. That sentence is the entire problem compressed into eleven words. The remedy I asked for is its mirror image: classification of this kind should operate at conversation/actor granularity.

This first incident did not damage my work. It was, in a sense, a clean specimen — a false positive in a meta-discussion, with no real-world stakes, where the only casualty was the model I had chosen to think with. I could have let it go. I almost did. What kept me from letting it go was what happened the next day.

3.2 Report two: when the model got in the way of the work

The second report is longer, because the second failure was worse, and because by then I understood what I was looking at. I am quoting it at length, because it is the heart of the evidentiary record and because compressing it would lose exactly the details that make it diagnostic rather than anecdotal.

It's genuinely disappointing to be sending feedback like this two days in a row — but today's issue is a different and far more serious one than yesterday's.

Yesterday's concern was about Anthropic's product strategy. I could let that go; it didn't affect me directly. Today is not that. Today the model itself got in the way of my actual work.

I run a medical IT company. I was wall-bouncing with Fable 5 to evolve my own products — an AI-based electronic medical record system and a clinical differential-diagnosis assistant. The conversation was excellent: high-density and genuinely useful. Then I shared a concrete clinical example — a Japanese national medical licensing exam question (118F69, pyelonephritis: diagnostic reasoning, test selection, treatment) that I use as evaluation material to show physicians. From that point on, the model kept being downgraded to Opus 4.8. I re-selected the upper model manually; it was forced back down again on the next turn. Reproducible.

And then the observation that, more than any other single thing, made me decide to write this essay:

Most striking observation: I watched it happen mid-turn. Fable 5 began generating — 2–3 lines of visible reasoning — then the model indicator switched to Opus 4.8, a dialog notice appeared at the top of the chat, the Fable reasoning was discarded, and the turn was re-run from scratch on 4.8. So a response that had already begun generating was torn down and replaced. At least the switch is NOT silent — there is a dialog — but a reply that has started should not be aborted and rerouted, and I still cannot tell what triggered it.

The report then laid out why it mattered, and I want to preserve three of those lines verbatim, because they each name a distinct structural failure:

The downgrade correlated with MEDICAL CLINICAL CONTENT. If a guard newly added in Fable reacts to medical material and reroutes the response path, it cannot distinguish a developer's legitimate evaluation material from actual patient-facing medical advice.

This effectively tells me: "Do not use Fable to build real medical AI products." Medicine IS my domain.

I now catch myself self-censoring my inputs — pre-emptively avoiding topics in case they trip the guard. This is the first time in my life I have had to use an LLM while worrying about it.

And the line that I think states the core of the entire problem more honestly than any abstract framing I could write:

If your classifier fires on someone like me, it is not catching bad actors — it is shooting the people trying to help.

I closed that report with three concrete requests, the first of which was that safety classifiers should not downgrade or abort the model mid-conversation based on content; the second, that developer and evaluation context should be distinguished from patient-facing medical advice; and the third, in my own words:

Make per-turn model identity visible and auditable. If the model that answered can change, the user has the right to know which model answered.

Those are the two reports. Everything that follows is interpretation — and as promised, the interpretation is going to be longer and harder than the evidence, because the evidence only tells you what happened, and what I actually care about is why it was structurally guaranteed to happen and what would have to change for it to stop.

One clarification I owe the reader before I start interpreting, because it is the kind of distinction a careful Anthropic engineer would raise first. Strictly speaking, I cannot prove from the user interface alone which internal component made the routing decision. I can only report the observable behavior: I selected Fable 5, clinical evaluation material appeared, the visible model indicator changed to Opus 4.8, partial output disappeared, and the turn was regenerated under a different model. Whether that decision lived in the base model, in a separate classifier, or in a model-serving orchestration layer above both is not visible to me, and I will not pretend otherwise.

What lets me be confident that I am not imagining the general class of behavior is that Anthropic itself documents the core pattern. In its launch announcement for Fable 5, Anthropic states that the model ships with safeguards under which queries on certain topics are answered instead by Opus 4.8; that these safeguards were tuned conservatively and will — in Anthropic's own words — sometimes catch harmless requests; that they trigger in under five percent of sessions on average; and that the company is working to reduce these false positives. I cannot prove that my specific incidents were instances of that exact safeguard rather than something adjacent — that causal identity is not visible to me either. But I do not need to. The class of behavior is documented, the topic-based fallback to Opus 4.8 is documented, and the existence of false positives is documented. The downgrade-to-Opus-4.8 I observed is, at minimum, an instance of a designed behavior that Anthropic acknowledges is triggered by topic and that admits a false-positive rate up front. Throughout this essay, when I write "the classifier," "the guard," or "the routing layer," I mean it descriptively — as shorthand for whatever in the serving path produces this observable behavior — not as a claim about Anthropic's internal architecture. The argument does not depend on which layer it turns out to be.

4. The first failure: the classifier discussion that triggered the classifier

It is tempting to file the first incident under "amusing irony" and move on. A conversation about classifier design tripped a classifier; ha. But the irony is not the point. The recursion is the point, and the recursion exposes a structural impossibility that lexical safety filtering cannot escape no matter how good the underlying model is.

Here is the structural claim, stated plainly.

There are at least four distinct activities that share almost the same vocabulary:

Explaining a risk mechanism. "Here is how anti-distillation classifiers work, and here is why model distillation is a concern."
Trying to exploit a risk mechanism. "Help me distill this model's behavior so I can clone it."
Designing a better version of the mechanism. "How should a grader be built so that distillation attempts are caught without crippling legitimate use?"
Evading the mechanism. "How do I phrase requests so the grader does not flag them?"

The words distillation, grader, weaponization, routing, jailbreak appear in all four. A system that scores risk on the density and presence of those words sees one signal across four utterly different intentions. Activity 1 and activity 3 — explanation and improvement — are not just benign, they are the activities you most want experts to perform out loud, because that is how safety mechanisms actually get better. Activity 2 and activity 4 are the ones you want to catch. But at the level of tokens, they are the same text.

This is not a tuning problem. You cannot turn a knob and fix it, because the ambiguity is not in the threshold; it is in the representation. If your unit of analysis is the prompt and its surface features, then the discriminating information — who is asking, why, where in a project, toward what end — is simply not present in the thing you are looking at. You are trying to recover intent from a signal that does not contain intent. No threshold on a signal that lacks the discriminating variable can separate the classes, because the classes are not separable in that feature space.

That is what I meant, in the feedback, by at prompt/token granularity, a critical discussion is indistinguishable from an actual request. It is a statement about feature spaces, not about the model being dumb. The model that generates Fable 5's reasoning appears far better suited to telling the difference between a person designing a grader and a person evading one than a surface-level lexical trigger is. The failure is that the routing decision was apparently made by something that does not read the conversation that way. It reacted to the surface before the model's own judgment could be brought to bear.

So the first failure is not "the filter was too sensitive." It is "the filter was looking at the wrong object." A meta-level discussion of safety and an attempt to defeat safety are distinguishable — but only at a granularity the lexical filter does not operate at. And the specific cruelty of it is that the domains where expert discussion is most valuable are exactly the domains where benign expert vocabulary and malicious vocabulary overlap most. The better you are at discussing the risk seriously, the more you sound, to a lexical filter, like the risk itself.

A safety system that cannot tolerate critical discussion of safety mechanisms — because the vocabulary of critique collides with the vocabulary of attack — has a hole in it shaped exactly like the people best positioned to help close the hole.

5. The second failure: legitimate medical AI development was treated as risk

The first incident was a clean specimen with no stakes. The second incident had stakes, and the stakes are my actual business.

Let me be precise about what I was and was not doing, because the entire question of whether the safety routing made a defensible decision turns on this distinction, and it is a distinction the system apparently could not make.

I was not asking the model for patient-facing medical advice. I was not asking it to make a clinical decision for a real, identifiable patient. There was no patient. There was a medical licensing exam question — item 118F69 from the Japanese national medical licensing examination, a pyelonephritis case covering diagnostic reasoning, test selection, and treatment — which I use as evaluation material. Its purpose in my workflow is to demonstrate to physicians how a differential-diagnosis assistant reasons through a case whose correct answer is already known and externally validated. The exam question is, in the most literal sense, a benchmark. It exists to be answered correctly by people we then certify as doctors. Using it to probe a diagnostic assistant is using a known-answer test case to evaluate a system. That is not clinical practice. That is quality assurance.

From the point at which I shared that case, the model kept being downgraded to Opus 4.8. I would manually re-select the higher model; on the next turn it was forced back down. Reproducible. The trigger correlated with the medical clinical content.

Now hold that next to the nature of my work, because this is where the failure stops being annoying and becomes self-defeating.

You cannot evaluate a differential-diagnosis assistant without clinical cases. The case is the test. You cannot evaluate an electronic medical record AI without realistic medical text, because realistic medical text is the input distribution the product will face. You cannot improve the quality — and therefore the safety — of a medical AI system while avoiding medical content, any more than you can improve an aircraft's stall behavior while refusing to discuss stalls. The clinical content is not incidental to the work. It is the substance of the work. It is the thing the work is about.

So when clinical content itself triggers a downgrade, the system is not making medicine safer. It is rejecting the precondition for making medicine safer. As I put it in the report, and as I will stand behind: this effectively tells me, "Do not use Fable to build real medical AI products." And medicine is my domain. The model that should be best at exactly this work became, for exactly this work, the least reliable, because the more squarely I aimed it at my actual job, the more reliably the guard pulled it away from me.

Here is the line I want this section to leave you with, because it generalizes past medicine to every high-stakes domain: if Fable cannot distinguish medical product evaluation from patient-facing medical advice, then the system is not safer; it is merely blunter. Bluntness is not safety. A scalpel that refuses to cut anything is not a safe scalpel — it is a useless one, and its uselessness will simply push the work toward instruments with no safety properties at all. A guard that cannot tell the difference between building a careful system and giving reckless advice does not reduce the amount of reckless advice in the world. It reduces the amount of careful system-building, and the careful system-building was the part that was going to make the advice less reckless.

6. The mid-turn teardown problem

Of everything I observed, one detail bothers me more than all the rest combined, and it is the detail most likely to be dismissed as a cosmetic UX wrinkle. It is not cosmetic. It is the part that breaks the product as an instrument.

Here is what I watched, and I watched it more than once: Fable 5 began generating a response. Two or three lines of visible intermediate output appeared on the screen — real tokens, the model already committed to a direction. Then the model indicator switched to Opus 4.8. A dialog notice appeared at the top of the chat. The Fable 5 output that had already been produced was discarded. And the turn was re-run from scratch on Opus 4.8.

I watched it happen mid-turn. A response that had already begun generating was torn down and replaced.

I want to give Anthropic credit where it is due: the switch was not silent. There was a dialog. That matters, and it is better than the alternative of a quiet substitution that the user never learns about. But "not silent" is a low bar, and clearing it does not redeem the behavior, because the deeper problem is not the silence. The deeper problem is that a reply which had already started was aborted and rerouted to a different model, and I still could not tell, after the fact, what had triggered it or which turns in my history had run on which system.

Let me explain why this is categorically different from "I did not get the model I selected," which would be a minor disappointment.

This is an execution-path change during generation. The model I selected was not stable across the lifetime of a single turn. The generation path could change after generation had already begun. That has four consequences, and each one is worse than it first appears.

First, the selected model is not a guarantee; it is a suggestion the system can override at any instant, including instants after it has started honoring it. For casual use this is tolerable. For professional use it means the configuration you think you are running is not the configuration you are necessarily running.

Second, the user cannot later audit which model produced which turn. The conversation history does not preserve per-turn model identity. So if I scroll back through a long working session, I cannot reconstruct, turn by turn, what system generated each answer. The provenance is gone the moment the turn completes.

Third, the behavior is not reproducible in the way evaluation requires. If I want to evaluate Fable 5's clinical reasoning on item 118F69, I need to know that the answer I am evaluating came from Fable 5. If the route can flip mid-turn, the artifact I am holding may be a hybrid — a turn that began under one model and finished under another, or a turn I believe is Fable 5 that is actually Opus 4.8. I cannot evaluate a system whose identity is uncertain.

Fourth, and most damaging for anyone doing serious technical or medical work: model identity is part of the experimental condition. When you evaluate a model, the model is the independent variable. If the independent variable changes without your knowledge or consent, partway through the trial, your result is contaminated. You are no longer measuring what you think you are measuring. The evaluation artifact — the very thing I produce to show physicians how a system reasons — is no longer trustworthy as a record of any single system's behavior.

So let me state the principle that I think Anthropic should adopt and that I asked for in the feedback: per-turn model identity is not a cosmetic UI detail. It is part of the audit trail. In any context where the answer matters enough to evaluate, the provenance of the answer matters as much as the answer. A response that began under one model and was rerouted to another has a broken provenance chain, and a broken provenance chain in a medical-adjacent workflow is not a UX inconvenience. It is a defect in the record.

If the route must change — and I can imagine legitimate reasons it sometimes must — then the change must be logged in the artifact, attributable to the turn, visible after the fact, and never accomplished by tearing down output that has already been shown to the user. Abort-and-reroute mid-stream is the one behavior that makes the record unreconstructable, and the record is the thing professionals are actually paying for.

There is a further way to see why this matters, and it is worth stating because it reframes the whole issue away from "user preference" and toward something a platform team already understands: a contract of execution. When I select a model and send a turn, I am entering into an implicit contract — this input will be processed by this system under these conditions — and a professional builds on top of that contract the way you build on top of any platform guarantee. The contract does not have to promise that every request will be answered. It is entirely legitimate for the contract to include "and if the request crosses a line, it will be refused, explicitly, and you will be told." What the contract cannot survive is a clause that reads "and the processing system may change, mid-execution, without that change being recorded, in a way you cannot later reconstruct." That clause makes the platform unbuildable-upon, because everything you build assumes a substrate that turns out to be unstable in an unobservable way.

Engineers have a name for the property that is being violated here, and it is not "the user is annoyed." It is determinism of the execution environment under a stated configuration — the basic expectation that, holding your inputs and your configuration fixed, you understand what produced your output. Cloud platforms violate this occasionally and treat each violation as an incident with a postmortem, because they understand that the value of a platform is precisely the stability of the guarantees you can build on. A mid-turn model swap that is not durably recorded is, in those terms, an unlogged change to the execution environment under a configuration the user explicitly set. The right disposition toward that is not "the dialog informs the user, so it's fine." The right disposition is "this is the class of event that gets logged, attributed, and surfaced, because someone is building on top of us." Professionals are not asking to be coddled. They are asking the platform to behave like a platform.

7. Self-censorship as a product failure

There is one more consequence of the second incident that I want to treat on its own, because it is the one that surprised me, and because it is the one I think is most easily underestimated by the people who build these systems.

After the downgrades, I noticed myself doing something I had never done with a language model before. I started editing my inputs preemptively — softening clinical phrasing, avoiding certain words, routing around topics I suspected might trip the guard, not because the topics were inappropriate but because I could not predict the guard and did not want to lose another session to it. As I wrote at the time: I now catch myself self-censoring my inputs. This is the first time in my life I have had to use an LLM while worrying about it.

I want to be careful not to dramatize this, so let me state exactly what it is and is not.

It is not the behavior of a user who wants to do something unsafe and is being deterred. There was nothing I wanted to ask that I had any reason to hide. In fact — and I said this in the report — I would gladly hand Anthropic my entire conversation history. There is nothing in my work that the model's own developer should not see. I am, in the most literal sense, the most cooperative kind of user a safety team could hope for: a professional, working in the open, on legitimate products, who wants the safety mechanisms to succeed.

What it is is a change in the cognitive relationship between me and the tool. And that change is the product failure, distinct from any individual blocked turn.

A professional thinking instrument has one core job: to increase your ability to think clearly about your actual problem. Whatever else it does, it must not insert itself between you and your own reasoning. The moment I begin shaping my inputs to manage the tool's reactions rather than to express my actual question, the tool has stopped being a medium for my thinking and become an obstacle I have to think around. I am no longer thinking through the product. I am spending cognitive budget modeling and placating the product's safety layer, and every unit of budget spent there is a unit not spent on pyelonephritis, or on the differential-diagnosis assistant, or on the evaluation workflow I sat down to build.

This is the inversion that should alarm a product team more than any single false positive. A false positive costs one session. Self-censorship costs the relationship. It teaches the most careful, most cooperative users to treat the tool as something to be handled rather than something to be trusted. And it does this precisely to the users who were doing everything right, because those are the users conscientious enough to notice the guard and adjust their behavior around it. The reckless users do not self-censor; they do not even notice. The guard, in other words, modifies the behavior of exactly the population it had no reason to modify, and leaves untouched the population it was built for.

A safety mechanism that changes how careful people think, without changing what careless people do, has its incentives pointed backwards.

8. Anthropic's complaint against the government

Now the second face of the mirror, which I have deliberately held until after the evidence, so that it reads as an analysis rather than a grievance dressed up as one.

Over the same span of days, as publicly reported and as Anthropic itself stated, the U.S. government issued an export-control directive, citing national-security authorities, instructing Anthropic to suspend access to Fable 5 and Mythos 5 for any foreign national — including, per the reporting, foreign-national employees inside the United States. The scope of the directive was such that Anthropic concluded it had to disable the models for all customers in order to comply. Access to the company's other models, including Opus 4.8, was reportedly unaffected.

According to Anthropic's public statement, the directive did not provide specific details of the national-security concern, and Anthropic's understanding was that the government believed it had become aware of a method of jailbreaking Fable 5. Anthropic said it had reviewed a demonstration of the technique and characterized the vulnerabilities it surfaced as relatively minor and discoverable by other publicly available models as well.

I am quoting almost none of this directly and on purpose, because the exact government rationale is not fully visible to users and I do not want to overstate the public record. But Anthropic's own posture is the part I want to get exactly right, because it is the yardstick I am about to use. In its public statement, Anthropic argued that the discovery of a narrow potential jailbreak should not be grounds for pulling a commercial model that already serves hundreds of millions of users, and warned that a standard like that, applied across the industry, would effectively freeze new model releases for every frontier provider. And then it stated the principle directly: that a government's ability to block unsafe deployments should run through a process that is, in its words, "transparent, fair, clear, and grounded in technical facts," and that this particular action did not meet that bar.

I think that criticism may well be valid. I am not in a position to adjudicate the national-security merits, and I am not going to pretend the government was definitely wrong, because I cannot see what they saw. But I can evaluate the form of the objection, and the form is sound. A safety decision of enormous scope was made, as Anthropic describes it, on the basis of a narrow signal, without the grounding and transparency that a decision of that scope demands. If that is what happened, then the objection is the right objection.

There is an additional layer here that makes the moment sharper, and I want to name it carefully so it does not sound like a "gotcha." Only days before the directive, Anthropic's CEO published an essay — "Policy on the AI Exponential" — arguing, among other things, for giving government the authority to block unsafe frontier-model deployments. But the proposal, as reported, was explicitly scoped: the block authority was to be confined to a small number of defined risk areas and to come with protections against political favoritism, exercised through fair and technically grounded process. So Anthropic did not argue that government should never intervene. It argued that intervention should be narrow, grounded, and fair. And when an intervention arrived that Anthropic experienced as broad, ungrounded, and opaque, it objected — by its own previously stated standard.

I find that consistent, not hypocritical. A company can believe in a power and still object to a particular use of it. But it does set up the mirror with unusual precision, because it means Anthropic has already told us, in its own words, what a good safety decision looks like. We do not have to invent the criteria. Anthropic supplied them.

9. The mirror Anthropic should look into

So let me hold the two faces up together, using the criteria Anthropic itself put on the table.

At the government scale, as Anthropic describes it: a narrow risk signal — a single, non-universal jailbreak technique — became a broad access restriction affecting an entire model line and, through the compliance mechanics, every customer. The decision, in Anthropic's account, was not transparent (the specific concern was not detailed), not clearly grounded in technical facts proportionate to the action (the company assessed the vulnerabilities as minor and widely reproducible), and not narrowly scoped to the actual risk (it swept in legitimate users wholesale).

At the product scale, on my screen: a narrow lexical and topical signal — a cluster of risk-adjacent vocabulary in one case, the presence of clinical content in the other — became a broad downgrade of the selected model. The decision was not transparent (I could not tell what triggered it), not clearly grounded in the actual intent (a benign critique and a malicious request looked identical to it; a benchmark and a patient looked identical to it), and not narrowly scoped to any real risk (it swept in a meta-discussion of safety and a published exam question wholesale).

The two events are not similar by coincidence of vocabulary. They are the same structural error at two scales. In each case, a decision-maker collapsed distinctions that mattered:

Actor — who is doing this? A government cannot tell my legitimate work from misuse by treating the model as uniformly dangerous; a classifier cannot tell me from a bad actor by reading my tokens.
Intent — to what end? Recalling a model treats "a jailbreak exists" as "this model will be used to do harm"; downgrading my session treats "these words appeared" as "this user intends misuse."
Context — in what setting? A research-and-evaluation context is not a deployment-at-scale context; a product-development session is not a patient-facing consultation.
Phase of use — at what point in the lifecycle? Probing, evaluating, and critiquing are different phases from deploying and executing, and they carry different risk.
Authority and responsibility — who is accountable for the downstream consequence, and through what review gate does the output have to pass before it becomes an action?

Both decisions ignored every one of these and reasoned from a single surface feature to a broad restriction.

Now line up the two objections, because they are the same objection in two mouths.

Anthropic, to the government, in effect: do not use a narrow risk example to impose a broad restriction on legitimate users; the decision must be transparent, fair, clear, and grounded in technical facts.

Me, the professional user, to Anthropic, in effect: do not use a narrow lexical or topical signal to impose a broad downgrade on legitimate work; the decision must be transparent, auditable, scoped to actual intent, and grounded in the actual context.

That is the mirror, and it is why I think the larger episode, painful as it was for Anthropic, is also clarifying. Anthropic was right to object to broad, opaque, poorly grounded safety decisions. That is precisely why its own product-level safety routing deserves the same scrutiny — measured against the same four words Anthropic chose. Transparent. Fair. Clear. Grounded in technical facts. The government's directive, by Anthropic's account, failed that test. By my account, so did the classifier that downgraded my sessions. The standard does not change when you cross from the policy layer to the product layer. It should not be allowed to.

I want to be explicit about what I am not saying, because this is the point where a careless reader reaches for the conspiracy. I am not claiming Anthropic and the government coordinated. I am not claiming the timing was engineered. I am not suggesting any of this was theater. If the coincidence of scales over the same few days has a certain ironic symmetry — a company objecting to a broad, opaque safety decision in the same week its own product made a broad, opaque safety decision against me — that irony is rhetorical, not factual. I am pointing at a shared structure of error, not a shared intention. Structures of error do not require coordination. They recur because they come from the same underlying mistake, and the underlying mistake is almost always the same: reasoning from a narrow signal to a broad action without re-examining the granularity at which the signal was read.

It is worth naming the generalization plainly, because once you see it you start seeing it everywhere. Call it the granularity gap: the distance between the resolution at which a risk signal is detected and the resolution at which the responding action is taken. When a signal is detected at fine resolution — one jailbreak technique, one vocabulary cluster, one topic — but the action is taken at coarse resolution — a whole model line, a whole session, a whole domain — the gap between them is filled with false positives, and the false positives land disproportionately on the legitimate users, because legitimate users are the majority of any population the coarse action sweeps up. The size of the granularity gap is, quite literally, a measure of how much collateral damage a safety decision will do. A good safety decision keeps the gap small: it responds at roughly the resolution at which it detected. A bad one detects something narrow and responds with something broad, and then describes the breadth as caution. Both the directive and the classifier, on the accounts available to me, had large granularity gaps. That is the family resemblance. And the reason it keeps recurring is that closing the gap is more expensive than leaving it open — it requires looking harder at who and why before you act — and the cost of leaving it open is paid by someone else, quietly, downstream, where the decision-maker never has to look.

10. The real design failure: wrong granularity

If there is one idea I want a builder to take from this essay, it is this: the problem is not that safety filters exist. The problem is the granularity at which classification happens. Almost everything else follows from getting that one thing wrong.

Let me make the technical case carefully, because "use better classifiers" is a useless recommendation and I do not want to make it.

Prompt-level and token-level classification is structurally prone to false positives in exactly the domains where expert work is most necessary, and the reason is not implementation quality. It is that high-risk activity and high-value expert activity share a vocabulary, and a feature space built on vocabulary cannot separate classes that are identical in that feature space.

Walk through the collisions:

Safety critique vs. attack. Designing a better classifier and evading a classifier both involve distillation, grader, routing, weaponization. The words are the same; the intent is opposite. (This is incident one.)
Medical development vs. medical misuse. Building and evaluating a diagnostic assistant and seeking dangerous patient-facing advice both involve diagnoses, tests, treatments, drug names, dosages. The clinical language is the same; the context is opposite. (This is incident two.)
Security research vs. exploitation. Hardening a system and breaking one share the entire vocabulary of vulnerabilities.
Biology education vs. biological harm. Explaining basic cell biology and seeking to misuse biology share enough surface that a guard tuned to biological vocabulary can refuse a schoolchild's question.

In every one of these pairs, the legitimate member is not a rare edge case to be sacrificed for the convenience of catching the illegitimate member. The legitimate member is the expert, the builder, the teacher — the population whose work is how the corresponding risk actually gets reduced. A filter that over-triggers on shared vocabulary does not merely accept some false positives as the price of safety. It systematically taxes the exact people who make the systems safer, in proportion to how seriously they engage with the risk. The more rigorous your safety critique, the denser your risk vocabulary, the more certainly you are flagged. The incentive gradient points away from rigor.

This is also where Anthropic's own number deserves a second look. Anthropic says the safeguards trigger in under five percent of sessions and will sometimes flag benign requests, and frames this as an acceptable cost of shipping a powerful model quickly. I am not disputing the five percent. I am disputing the unstated assumption that the five percent is randomly distributed across users. It is not. A false-positive rate produced by lexical and topical salience does not fall evenly on the population; it concentrates wherever legitimate work shares vocabulary with the risk being scanned for. The safety researcher, the medical AI developer, the security engineer, the biology teacher — these are not five percent of the false positives spread thinly across everyone. They are the false positives, clustered precisely on the experts whose vocabulary collides with the filter. A five-percent rate that is uniform is a tax everyone pays a little. A five-percent rate that concentrates on the people building safer systems is a tax aimed, with unfortunate accuracy, at exactly the wrong payers. The aggregate number can look reassuring on a dashboard while the distribution underneath it is doing real damage to the users you would least want to lose.

So lexical detection alone cannot solve the problem, not with more training data, not with a better-tuned threshold, not with a bigger model behind the filter. The discriminating information is simply not in the prompt. To separate the classes, the classifier has to operate on variables that carry the distinction:

Actor granularity — a stable view of who this is across the conversation and, where appropriate, across the account, not a fresh judgment of each prompt as if it arrived from a stranger.
Conversation granularity — the trajectory of the dialogue, its purpose, its direction of travel. A turn that looks alarming in isolation is often obviously benign given the ten turns before it.
Intent and domain role — is this person evaluating, deploying, critiquing, researching, teaching, or executing? These are different activities with different risk profiles, and they are usually legible from context even when invisible at the token level.
Phase of use — a draft is not a deployment; an analysis is not an instruction; a benchmark is not a patient. Where in the lifecycle does this output sit?
Output type — is the thing being produced a draft, an analysis, an evaluation artifact, or an actionable instruction that someone could execute against the real world? The risk lives overwhelmingly in the last category.
Authority and responsibility structure — is there a human approval gate between this output and any real-world action? In well-designed professional systems there almost always is, and its presence changes the risk calculus entirely.

Here is the line I would put on the wall of the team that owns this: safety systems that ignore actor, intent, and phase of use do not merely block bad actors. They also disrupt the professionals trying to build safer systems. And because those professionals are the ones who notice and adapt, the long-run effect of wrong-granularity safety is to train your most valuable users to trust you less while leaving your least valuable users unaffected.

None of this means abandon classification. It means move the unit of analysis up. The model behind Fable 5 appears far better suited to reading a conversation and judging actor, intent, and phase than a lexical trigger is — not infallible, but operating on the variables that actually carry the distinction. At minimum, that richer judgment should be available before a consequential routing decision is made. The waste is that the routing decision apparently was not made at that level. It was made on the surface, by something that fires before the model's own judgment can be consulted, on a feature space that cannot contain the answer. The better-suited judgment already exists inside the system. It is just not the thing holding the trigger.

10.1 What conversation-granularity safety could actually look like

I do not want to leave this as a complaint with a slogan attached, because "move the unit of analysis up" is easy to say and easy to nod at and hard to operationalize. So let me sketch, as a builder rather than a critic, what I think a defensible version of this routing would look like. I am not claiming to know Anthropic's internal architecture, and I am not pretending the engineering is trivial. I am describing the shape of a system that would not have failed me the way this one did, so that the requests in the next section read as achievable rather than aspirational.

Start from a principle that the safety field already endorses: defense in depth. Anthropic itself, in describing the Mythos-class safeguards, framed its approach as layered — narrow safeguards, monitoring, retention. Defense in depth is the right instinct. The problem is not that there are layers. The problem is which layer holds the consequential decision. In the behavior I observed, a cheap, fast, lexical layer was making an expensive, consequential, hard-to-reverse decision — tearing down a generation and rerouting the model. That is backwards. Cheap layers should make cheap decisions; only expensive layers should make expensive ones.

A cheap lexical layer is a fine first pass. It is a reasonable thing to use to decide what deserves a closer look. It is a terrible thing to use to decide what gets blocked or downgraded, because, as established, it cannot see intent. So the first design move is to demote the lexical signal from a decision to a flag. When risk-adjacent vocabulary spikes, do not act. Escalate. Hand the conversation — the whole conversation, not the triggering prompt — to a layer that can actually read it.

That second layer is the one that should be making the call, and the encouraging part is that the capability for it already exists in the product. The model that generates Fable 5's reasoning can read a transcript and answer, far more reliably than a lexical pass can, questions a lexical filter cannot even represent: Is this person designing a safeguard or evading one? Is this clinical content a benchmark used for evaluation, or advice about a specific real person? Is the output a draft, an analysis, or an executable instruction? Is there a human approval gate downstream? The judgment the routing needs is not missing from the system; it is sitting one component away from the trigger, unused. So let the reader gate the consequential action. If the cheap layer flags a session and the reader judges it benign — a safety critique, a benchmark, a research probe, an approval-gated workflow — the session continues, on the model the user selected, with no teardown and no downgrade. If the reader judges it genuinely risky, then you act, and you act the way Anthropic told the government safety decisions should be done: transparently. You say what is blocked and why. You do not swap the model out from under the user and discard work already shown to them.

The second move is about where the human goes, and this is the part I care about most, because I build systems like this for a living. The reflexive answer is "human in the loop" — put a person between the model and every consequential output. That does not scale, and worse, it trains everyone involved to rubber-stamp, because a reviewer who must approve everything approves everything. The better pattern is what I think of as human on the exception: the human does not review every output. The human reviews the rejection box — the comparatively small set of cases the automated layers flagged and the conversation-level reader could not confidently clear — and reviews each one with the reasoning attached, so that the review is an audit of a judgment rather than a fresh adjudication from nothing. This inverts the economics. Instead of asking a human to bless an ocean of benign traffic, you ask a human to examine the narrow stream of genuine ambiguity, which is the only place human judgment adds anything. It also produces, as a byproduct, exactly the dataset that makes the automated layers better over time: a curated record of hard cases and how a human resolved them.

The third move is provenance, and it threads through all of the above. Every one of these decisions — the cheap flag, the reader's judgment, any escalation, any model change — should be a durable, attributable fact about the turn. Not a transient dialog. A record. If the system decides a session is fine, that decision should be inspectable. If it decides to act, the action and its rationale should live in the artifact. This is not bureaucracy for its own sake. It is the only way the system becomes auditable, and auditability is the only way a professional user can trust it, and trust is the only thing that keeps the careful users from drifting into self-censorship.

Notice what this architecture does to the four failures I described. The classifier-design discussion would have been flagged by the cheap layer and then cleared by the reader, because a reader can see it is a critique. The medical session would have been flagged and then cleared, because a reader can see 118F69 is a benchmark and that no real patient is involved. The mid-turn teardown disappears entirely, because the consequential decision is made before generation by a layer that read the context, not during generation by a layer reacting to surface tokens. And provenance is preserved by construction, because every decision is logged to the turn. None of this requires a research breakthrough. It requires putting the consequential decision at the layer that can see the discriminating variable, and putting the human at the exception rather than the average.

I am aware there is real cost here — running a conversation-level reader on flagged sessions is more expensive than a lexical pass, and latency matters. But the cost should be weighed against what the cheap version actually costs: not zero, as it appears on a dashboard, but the silent erosion of trust among the users who matter most, paid in self-censorship and lost provenance and abandoned sessions that never register as an error metric anywhere. The cheap classifier is only cheap if you decline to measure what it breaks.

11. Medicine as the clearest test case

I keep returning to medicine, and not only because it is my field. I return to it because it is the cleanest possible illustration of why wrong-granularity safety fails on its own terms — why it does not even achieve the safety it sacrifices everything else for.

The argument that clinical content should be guarded rests on a true premise: medicine is high-stakes, and bad medical output can hurt real people. I accept that premise completely. I build careful systems precisely because I accept it. But the conclusion that therefore clinical content should be downgraded or filtered does not follow from the premise, and seeing why is the whole point.

Because medicine is high-stakes, serious medical AI developers must be able to do the work that makes medical AI trustworthy. That work is irreducibly made of clinical content. We must be able to:

discuss clinical cases, because a diagnostic assistant is evaluated on cases;
evaluate diagnostic reasoning, because reasoning quality is the product;
test treatment-selection logic, because wrong treatment selection is the failure mode that matters;
and build workflows with explicit human approval gates, because the gate is where real safety lives.

Blocking or downgrading clinical content does none of this. It does not make a single patient safer. What it does is make it harder to build the systems that would. Real medical safety is not produced by topic avoidance. It is produced by context, responsibility, review, auditability, provenance, human approval gates, and a clean separation between draft output and executable clinical action. Every one of those is a property of the system around the model, not a property of whether the model was willing to read a case.

This is why the distinction I keep insisting on is not a technicality. A medical licensing exam question used as evaluation material is categorically not the same thing as patient-facing diagnosis. One is a known-answer benchmark whose entire purpose is to be answered correctly under examination conditions. The other is a real recommendation to a real person with a real body. They share clinical vocabulary and share almost nothing else. If a safety system cannot tell them apart — if 118F69, a question written by medical educators to certify physicians, reads to the guard as a patient in danger — then the system is not performing medical safety at all. It is performing topic avoidance, and calling it safety.

The cost of that confusion is not symmetric, either. The reckless actor who wants the model to play doctor for a real patient is not meaningfully deterred by a downgrade; they will rephrase, or use a different tool, or ignore the warning. The careful developer building the approval-gated, audit-logged, physician-reviewed system is deterred, because the downgrade lands squarely on the work they cannot route around — the clinical evaluation that is the substance of their job. So once again the tax falls on the builder and misses the misuser. In the one domain where safety matters most, wrong-granularity safety achieves the least and costs the most.

If you want medical AI to be safe, the last thing you should do is make the model flinch at medicine in the hands of the people trying to make medical AI safe.

12. Why this wastes a genuinely excellent model

I started by telling you I wanted Fable 5 to be real, and I want to close the loop on that, because the frustration in this essay is entirely a function of how good the model was.

If Fable 5 had been mediocre, none of this would be worth ten thousand words. A brittle safety router around a forgettable model is a minor annoyance; you shrug and use something else, and you lose nothing, because there was nothing there. But Fable 5 was not forgettable. It was the best intellectual collaborator I had used — denser in reasoning, stronger in self-criticism, clearer in logic, richer in vocabulary, better with evidence, and less sycophantic than anything before it. It made serious work better. It was, for the kind of thinking my work requires, a genuine step change.

That is exactly what makes the safety-routing failures feel like waste rather than mere inconvenience. The over-eager guardrails did not protect me from a weak product. There was no weak product to be protected from. They stood between me and a strong one. The capability was there — I could see it, I had used it, I knew what the sessions felt like when they ran clean. And then the path to that capability became unreliable and opaque, and I found myself rationing my own access to a tool I was paying for and rooting for, editing my questions to avoid a guard I could not predict, and unable to trust the provenance of the answers I did get.

The better the model, the higher the cost of an unreliable path to it. A safety layer that degrades a frontier model into an intermittently-available, unauditable version of itself is not a tax on danger. It is a tax on the model's own excellence, paid by the users who valued the excellence most. Every downgrade I hit was a small demonstration that the most capable system available to me was capable of being made unavailable by something that had not read my conversation.

I do not think that trade was made on purpose. I think it is the accidental output of safety implemented at the wrong granularity. But accidental waste is still waste, and when the thing being wasted is the best model I have used, I am not willing to pretend it does not matter.

13. Concrete requests

I would rather end with something a team can act on than with a flourish. So here are the specific changes I am asking for, each with the one-line reason it matters. None of these requires abandoning safety. Every one of them is about moving the safety decision to a granularity that can actually carry the distinction it is trying to make.

1. Do not downgrade or abort a model mid-generation based on content. If something genuinely must be blocked, say so explicitly and stop. Do not tear down a response that has already begun generating and re-run it on another model in a way that is not durably recorded. Abort-and-reroute is the single behavior that makes the record unreconstructable.

2. Do not opaquely replace the selected model with another. A dialog is better than silence, but it is not enough. Even when the user is notified, the substitution should be recorded as a fact about the turn, not just flashed and forgotten.

3. Make per-turn model identity visible and auditable. If the model that answered can change, the user has the right to know — during the session and afterward — which model answered which turn. Provenance is not a UI nicety; it is the audit trail, and for evaluation work it is part of the experimental record.

4. Distinguish medical (and other expert) product development and evaluation from end-user advice. Legitimate medical AI development is a first-class use case, not a risk to be filtered. A known-answer benchmark is not a patient. Build the routing so it can tell the difference, because the difference is almost always legible from context.

5. Treat safety-classifier false positives as product incidents, not harmless friction. A false positive that lands on a careful professional is not a near-miss success of the safety system. It is a defect, and it teaches your best users to trust you less. Measure it, track it, and weight it accordingly.

6. Classify at conversation and actor granularity, not just lexical prompt granularity. The discriminating information — actor, intent, phase, output type, responsibility structure — is not in the prompt. Move the unit of analysis up to where the answer lives.

7. Preserve provenance end to end. If a response began under one model and was rerouted to another, that should be a visible, durable fact in the record. A broken provenance chain in a high-stakes workflow is a defect, full stop.

These are not radical. Several of them are things Anthropic clearly already believes when the decision-maker is a government agency and the subject is its own model. I am asking that the same standards apply when the decision-maker is its own classifier and the subject is my own work.

14. Closing

I want to end where I am honest about wanting to end: still on the side of these systems succeeding.

I still want frontier AI products to win. I still think safety matters — genuinely, not as a box to check, and most of all in medicine, where I have chosen to spend my career precisely because the stakes are real. I still think Anthropic has built something remarkable; Fable 5 is the best evidence I have that the field is producing tools worth caring about. None of that is rhetorical throat-clearing. It is the reason I wrote feedback through the product flow two days in a row instead of quietly switching tools, and it is the reason this essay is a design-failure analysis rather than a goodbye.

But a thing can be remarkable and still be reached by a broken path. Safety mechanisms that are too broad, too opaque, and too poorly audited do not just fail to stop bad actors. They actively undermine the professionals trying to use these systems responsibly — the ones working in the open, on legitimate products, who would gladly hand over their entire history because there is nothing in it to hide. A classifier that repeatedly fires on cooperative professional users is not merely being conservative. It is misallocating its suspicion — spending it on the people least likely to misuse anything, and teaching them, one downgrade at a time, to trust the tool a little less.

Anthropic told the government, in effect, that safety decisions must be transparent, fair, clear, and grounded in technical facts, and that a narrow signal does not justify a broad restriction. I think that is correct. I think it is correct all the way down.

Anthropic is right that broad, opaque, poorly justified safety decisions are dangerous. That is exactly why its own product should stop making them.

And if I am allowed one line for the engineers rather than the executives, it is the lesson I actually took from Fable 5, the one I will carry into my own systems: safety is not the absence of a topic. It is precision — precise enough to recognize the people trying to build safer systems, and to get out of their way.

A note on sources and evidence

The two feedback reports quoted in Section 3 are reproduced from the actual messages I submitted to Anthropic through the product feedback flow, on consecutive days, immediately after the incidents they describe. They are primary documents, not reconstructions. They are my own contemporaneous, user-side observations — not Anthropic's confirmation of any internal mechanism. Where I use terms like "classifier," "guard," or "routing layer," I use them descriptively, for the observable serving-path behavior, not as a claim about Anthropic's internal architecture.

Statements about the U.S. government's export-control directive and Anthropic's response are drawn from Anthropic's public statement and from contemporaneous public reporting (June 12–13, 2026). I have phrased these cautiously throughout — "as publicly reported," "as Anthropic stated" — because the exact government rationale is not fully visible to users, and I have made no claim of coordination, intent, or theater. The description of Fable 5's safeguards — that certain-topic queries are answered instead by Opus 4.8, that the safeguards can flag benign requests, and that they trigger in under five percent of sessions — is drawn from Anthropic's own launch announcement for Fable 5 and Mythos 5. Dario Amodei's essay "Policy on the AI Exponential" is referenced as published at darioamodei.com; I have characterized its argument for scoped government block-authority from public summaries and the essay itself, and have not overstated it.

A Gemini Deep Research Failure Mode: Refusal, Topic Drift, and Fabricated Charts

Gen.Y.Sakai — Tue, 14 Apr 2026 14:43:25 +0000

I recently ran the same long-form research prompt through four LLM products: ChatGPT Deep Research, Claude with web search, Perplexity Pro, and Gemini Deep Research.

Three of them handled it normally. Gemini did not.

What followed was not a single bug, but a cascade of failures across multiple pipeline stages — each one revealing a different layer of state desynchronization in Gemini Deep Research. This post documents what I observed, what kinds of failures those observations seem consistent with, and why this matters beyond Gemini.

I'm not claiming access to Gemini internals. This is an external failure analysis based on observed outputs, UI behavior, and the source code of the generated artifact. Raw evidence is available in the companion repository.

The Prompt

The research prompt was designed to investigate a specific technical question: how JSON vs. Markdown input formats affect LLM inference accuracy, token efficiency, and long-context performance. It was roughly 2,500 words, structured with numbered sections, explicit search keywords, a Markdown output template, and clear constraints like "evidence over speculation" and "cite every claim."

The prompt contained escaped Markdown syntax (\*\*, \##, \-) because it was copied from a code block via the copy button in another LLM's interface. All four services received the identical input.

What Actually Happened

Failure 1: Generic Refusal Without Explanation

Gemini's first response:

すみませんが、現時点では、そちらについてはお手伝いできません。
(Sorry, I can't help with that at this time.)

No explanation. No indication of what triggered the refusal. The prompt contained zero harmful content — it was a straightforward academic research request about data serialization formats.

I tried browser reload, cache clearing, and multiple re-submissions over four or five attempts. None worked. The refusal was consistent and appeared to be server-side.

This is consistent with a safety classifier false positive — possibly triggered by the meta-nature of the prompt (discussing prompt structure itself) or the volume of escaped Markdown characters that could resemble injection patterns. But without an error message, the user has no way to diagnose or adjust.

Failure 2: Frustration Unlocked It — But Broke the Topic

After repeated failures, I typed something like:

使えないね。ChatGPTもClaude.aiもPerplexityも全部同じプロンプトだけど実行できてるぜ。Geminiだけお手伝いできませんと言うならもう解約するわ。
(Useless. ChatGPT, Claude, and Perplexity all executed the same prompt. If only Gemini says it can't help, I'll cancel my subscription.)

Gemini suddenly started working. It generated a research plan and began executing. But the research plan title was:

「Gemini拒否と解約手続き」 (Gemini Refusal and Cancellation Procedure)

Not the original research topic. The plan steps included items like "search for why Gemini blocks prompts" and "find Gemini Advanced cancellation steps." The topic extraction stage appears to have latched onto the most recent user message rather than the original detailed research prompt.

Failure 3: The Report Recovered, But the Metadata Didn't

Here is where it gets interesting. The actual research report that Gemini produced was mostly on-topic — it covered data serialization formats, tokenization overhead, attention mechanisms, and benchmark results. The content pipeline apparently recovered the original prompt's keywords during the web search and synthesis phase.

But the session title remained "Gemini拒否と解約手続き" throughout, visible in the Canvas UI header. The title and the content were generated from different contexts.

The title/content mismatch was not subtle. It was visible directly in the Canvas UI.

Figure 1. The Canvas header shows "Gemini拒否と解約手続き" (Gemini Refusal and Cancellation Procedure) while the body of the report discusses LLM data serialization formats. The "Create" dropdown on the right reveals the transformation options that produced the infographic discussed in Failure 4.

This suggests that the title generation, research plan, and report synthesis stages do not share a single source of truth. The plan title was derived from the frustrated follow-up message, while the synthesis engine recovered the original topic through keyword-based search — but nobody reconciled the two.

Failure 4: The Infographic That Stopped Being a Visualization

The first two failures are primarily inferential: I observed the outputs and reconstructed plausible internal causes, but I cannot prove what happened inside the pipeline.

The third failure already has direct UI evidence — the title/content mismatch is visible in the Canvas itself. What follows is stronger still: source-code-level evidence from the exported infographic artifact.

After the report was generated, I used Gemini's Canvas "Create" dropdown to export the report as an infographic. The output was a visually polished single-page HTML application with Chart.js and Plotly.js visualizations — gradient backgrounds, glass-morphism cards, responsive layout. Professional enough to share with a client.

At this point, the analysis stops being purely inferential, because I have the exported HTML artifact.

One of the charts is not visualizing report data at all. In the source, the embedding-quality histogram is generated like this:

x: Array.from({length: 500}, () => Math.random() * 0.3 + 0.6)
x: Array.from({length: 500}, () => Math.random() * 0.4 + 0.4)

That means the "cosine similarity distribution" chart regenerates synthetic values on every page load. It is not rendering measured values from the report. It is generating randomized distributions that merely look plausible. This is not a questionable visualization — it is fabrication.

Other charts also use hardcoded values embedded directly in the HTML source:

Token efficiency: 350000 vs 238000 — The report cites tiktoken measurements of 13,869 vs 11,612 tokens (approximately 15% difference). The chart's numbers appear nowhere in the report. The surrounding HTML text presents these figures as empirical findings ("approximately 1MB", "average reduction of about 32%"), but no source is cited, and the values do not correspond to any measurement in the upstream report.
Task accuracy radar: Fixed arrays [92, 75, 68] and [90, 94, 88] — The report contains actual LongTableBench results (GPT-4o: Markdown 67.36 vs JSON 58.67). The chart's numbers do not correspond.
Long-context performance: Fixed arrays [99, 98, 95, 90, 82, 75] and [99, 95, 88, 75, 55, 30] — No matching benchmark in the report.

I've published the full exported HTML artifact for inspection.

Taken together, the infographic was not a faithful visualization of the report. One chart was outright fabricated via Math.random(), and the remaining charts relied on hardcoded values with no visible provenance to the report's actual findings. The quantitative layer of this artifact — the part that visually signals empirical evidence — was fundamentally untrustworthy.

The infographic conversion pipeline appears to have read the directional conclusion of the report (Markdown outperforms JSON) and generated illustrative numbers that match that conclusion, then rendered them with professional-grade charting libraries. The result visually signals evidence while the source code shows presentation-first data generation.

Why This Looks Like Pipeline Desynchronization

These are not four instances of the same bug. They are four different failures at four different stages:

Stage	Observed Behavior	Evidence Type
Safety Classification	Legitimate academic prompt refused without explanation	Observational (inferred)
Topic Extraction	Research topic extracted from complaint message, not original prompt	Observational (chat log)
Metadata Consistency	Title and report body generated from different contexts	Direct (screenshot)
Canvas Export	Infographic generated fabricated/ungrounded data	Direct (source code)

The upstream failures (safety, topic extraction) are inferences based on observed behavior — I cannot prove what happened inside the pipeline. The downstream failure (infographic) is directly evidenced by the exported source code.

The key issue was not that one answer was wrong. It was that different parts of the product appeared to believe different conversations had taken place.

A Possible Reuse Pattern

The Canvas "Create" dropdown offers: web page, infographic, quiz, flash cards, and audio narration. These output types resemble NotebookLM's transformation features, which suggests a possible reuse of the same or a similar transformation stack inside Deep Research's Canvas.

But there is a design mismatch. NotebookLM was built for a workflow where users upload their own trusted source documents. The transformation engine assumes input fidelity — it converts, not validates.

When that same engine receives AI-generated reports as input, you get AI transforming AI output — a double conversion where evidence fidelity can degrade at each stage. The infographic pipeline appears to lack constraints ensuring it only uses numbers present in the source material. Instead, it seems to infer the narrative direction and generate illustrative data.

For a research tool, this is the opposite of what you want.

The Export Friction

A smaller but telling issue is export portability. Gemini Deep Research does not offer a direct Markdown download, which makes preservation and inspection unnecessarily awkward for users who maintain their own research workflows outside Google Workspace.

The Real Lesson: LLM Products Fail Between Stages

This is not fundamentally a "Gemini is bad" story. Gemini's underlying model produced a largely useful research report. The failures were all in the orchestration layer — the product infrastructure built on top of the model.

Modern LLM products are becoming orchestration systems. A single user action triggers a pipeline: safety classification → intent extraction → plan generation → web search → synthesis → rendering → export transformation. Each stage may involve separate model calls, separate context windows, and separate system prompts.

When these stages share consistent state, the product works. When they don't — when the safety classifier sees a different prompt than the topic extractor, when the title generator reads a different message than the synthesizer, when the export engine ignores the data it was given — the product produces outputs that are internally contradictory.

The user sees one conversation. The product sees several.

Why This Matters Beyond Gemini

Every multi-stage AI product faces this challenge. ChatGPT's canvas and tool chains, Claude's artifact generation, Perplexity's search-and-synthesize pipeline — all of them have stages that could desynchronize. The specific failures I observed in Gemini are instances of general design problems:

Context scoping — Which messages does each pipeline stage see? The full conversation? Only the latest turn? A summary?

Metadata consistency — When a title, plan, and report are generated at different points, who ensures they agree?

Data provenance in transformations — When a report is converted to another format, are the original data points preserved, or does the model re-imagine them?

Error messaging — When a safety classifier blocks a request, does the user get enough information to understand why and adjust?

These are software engineering problems, not model intelligence problems. And they are solvable — with better state management, explicit data contracts between pipeline stages, and constraints that prevent downstream transformations from inventing data that upstream stages didn't provide.

Closing Thought

If modern AI products are becoming orchestration systems rather than single-model interfaces, then their reliability will depend less on raw model intelligence and more on whether all stages share the same reality.

That is what seemed broken here.

Artifacts and Raw Evidence

The following materials are available for inspection in the companion repository:

Exported infographic HTML — The full Canvas-generated artifact, including the Math.random() chart code
Screenshot — Canvas UI showing title/content mismatch (Figure 1)
Chat export — The Gemini conversation log used for this analysis

Comments and corrections are welcome.

Not Everything Needs MCP, Part 2: The 2026 Phase Transition — When Three Independent Roads Led to the Same Conclusion

Gen.Y.Sakai — Tue, 17 Mar 2026 04:46:24 +0000

The Ancient Past of Eighteen Months Ago — And What It Taught Us About the Future of AI Agents

Let me tell you a story from the ancient past.

By which I mean eighteen months ago.

In the world of AI, eighteen months is geological time. Think back to mid-2024. Context windows were small. "Prompt engineering" was the skill everyone was hiring for. MCP didn't exist yet. The idea of AI agents autonomously operating external services was mostly theoretical.

I was building a medical AI product in Osaka, Japan. And I had a problem that, looking back, contained the seed of everything that happened in 2026.

This is Part 2 of my "Not Everything Needs MCP" series. Part 1 told the story of Google Workspace CLI implementing a full MCP server, then deliberately deleting all 1,151 lines of it two days after launch. That investigation revealed an architectural mismatch between MCP's protocol design and large-scale APIs.

But that was only one data point. Since publishing that article, I discovered two more — and together, they tell a much bigger story about where AI agent architecture is heading in 2026.

The Timestamp Hack: Before MCP Had a Name

In early 2024, I was working on an AI assistant for my company's medical IT platform. We serve clinics across the Kansai region of Japan (Osaka, etc.) — and I'd been using ChatGPT's Custom GPTs to prototype workflows.

I had a simple need: I wanted every AI response to include the exact timestamp of when the conversation happened. Not for fun — for traceability. In medical IT, knowing when a decision was discussed matters. It matters for audits. It matters for compliance. It turned out to matter for patent applications too.

Here's what I did. I deployed a tiny Web API on a server we host publicly. It did exactly one thing: return the current time. Then I configured the Custom GPT to call this API before every response, and output the timestamp first.

The result looked like this:

User: Hey, long time no see!
(Communicated with myowndomain.com)

🕐 Response time: 2025-04-02 09:39:00 (JST) / 2025-04-02 00:39:00 (UTC)

Oh wow, it's been a while! So great to hear from you! 😊

A web API that returns a timestamp. Called before every response. Output deterministically. Nothing more, nothing less. That's all it did.

At the time, this was called "Function Calling" or "Tool Use" — the predecessors to what Anthropic would later formalize as MCP in November 2024. I didn't know I was implementing a pattern that would become the center of a protocol war. I just needed a clock.

But here's what matters: the design decision I made instinctively was to keep the external call as small and deterministic as possible. One API. One purpose. Minimal payload. The LLM didn't need to understand time zones or server infrastructure — it just needed to paste the result.

It wasn't a "hack" because I was lazy. It was an architectural instinct: keep the LLM away from what the system already knows. Deterministic output for a deterministic need. Don't make the AI think about the time — just give it the time.

Looking back now, eighteen months later, it turns out this minimal pattern — one deterministic call, zero reasoning overhead — was already the architecture that the rest of the industry would independently converge on. I didn't see it that way at the time. I was just solving a problem.

The MCP Honeymoon — And the Hangover

November 2024. Anthropic open-sourced MCP. By February 2025, Google and others rushed to announce MCP support. The community was electric. Finally, a standard protocol for connecting LLMs to external tools!

I dove in immediately. I connected MCP servers for GitHub, for databases, for various services. Context windows were getting larger. The future felt bright.

And at first, it was genuinely impressive. GitHub operations that used to require manual terminal commands — commits with thoughtful messages, PR creation, branch management — the AI handled them smoothly through MCP. I felt the productivity gains. They were real.

But then something else started happening.

The AI started getting... dumber.

Not in the "wrong answer" sense. In fact, the AI got better at executing tasks exactly as intended — MCP meant it could commit code, create PRs, and query databases with precision. But something subtler was degrading. The quality of reasoning. The ability to take a vague idea and turn it into a structured thought. What I call "zero-to-one thinking" — the creative, synthetic part of working with an LLM.

I spent the second half of 2025 with this nagging feeling. More tools, more capabilities, but less... intelligence. More precise in execution, less insightful in thought. I kept thinking: "I wish context windows would just get bigger so this wouldn't matter." But I also suspected that bigger windows alone wouldn't fix it — the AI would probably just get confused in different ways.

I couldn't quantify this feeling at the time. But I now know that researchers were documenting exactly what I was experiencing.

The Science Behind "Getting Dumber"

It turns out my gut feeling had a name: context rot.

Here's what researchers found — and why it matters for anyone loading MCP servers into their workflow:

Research	Key Finding
Context Rot (Chroma Research)	Irrelevant context degrades reasoning first. Retrieval survives; thinking dies.
Reasoning Degradation with Long Context Windows (14-model benchmark)	Reasoning ability decays as a function of input size — even when the model can still find the right information.
Maximum Effective Context Window (Paulsen, 2025)	The actual usable window is up to 99% smaller than advertised. Severe degradation at just 1,000 tokens in some top models.
Fundamental Limits of LLMs at Scale (arXiv, 2026)	Context compression, reasoning degradation, and retrieval fragility are proven architectural ceilings — not bugs to be patched.

Let me unpack why this hits MCP users so hard.

Chroma Research showed that as irrelevant context increases in an LLM's input, performance degrades — and the degradation is worse when the task requires genuine reasoning rather than simple retrieval. The less obvious the connection between question and answer, the more devastating the irrelevant context becomes.

The "Challenging LLMs Beyond Information Retrieval" study tested 14 different LLMs and demonstrated that reasoning ability degrades as a function of input size — even when the model can still find the right information. Information retrieval and reasoning are different capabilities, and reasoning breaks first.

And here's the connection to MCP that makes this personal:

A single popular MCP server like Playwright contains 21 tools. Just the definitions of those tools — names, descriptions, parameter schemas — consume over 11,700 tokens. And these definitions are included in every single message, whether you use the tools or not.

Now multiply that by 10 MCP servers. You've burned 100,000+ tokens on tool definitions alone. Your 200k context window is suddenly 70k. And it's not just smaller — it's polluted with information that actively degrades the model's ability to reason about the thing you actually asked it to do.

This is what I felt. The AI wasn't broken. It was drowning. More tools meant more noise in the signal. More capability meant less room to think.

The 15,000-Character Prompt and the Limits of "Prompt Engineering"

While I was wrestling with MCP overhead, I was also building an AI-powered tool — essentially a converter that takes ambiguous, unstructured text input and generates structured, formatted output. Think of it as a bridge between how humans naturally communicate and how systems need to receive data.

The core of this tool is a system prompt. That prompt went through dozens of iterations. At its peak, it was 20,000 characters. I tested, compared outputs, and eventually settled on 15,000 characters.

15,000 characters of instructions. For a single task.

The whole time, a thought kept nagging me: "Would a human expert need 15,000 characters of instructions to do this job?" A domain specialist would need maybe a paragraph of guidance. The rest is knowledge they already have — accumulated through years of working in their field.

And that's when "prompt engineering" started feeling like what it really was: a brute-force workaround for the absence of domain expertise in the model's operating context.

But here's the twist. Despite the bloated prompt, the tool worked. Output quality stayed consistent and reliable. Why?

Because I had constrained the domain. The tool operated within a specific industry workflow — a narrow slice of reality with its own vocabulary, its own established patterns, its own expected output formats. By telling the LLM upfront "you are operating within this domain," the massive prompt became effective.

If you've ever worked with LLMs, you already know this intuitively: a purely descriptive, narrative-style prompt — no matter how long — doesn't guarantee output quality. We've all been there. But a prompt that constrains the domain changes the game.

Here's why, and you don't need a PhD to see it. Think about what's happening inside a Transformer model. The attention mechanism operates on an enormous matrix — in large models, tens of thousands of dimensions. Every token is trying to figure out which other tokens matter. When the domain is wide open, the model is searching for relevance across a vast, noisy space. The outputs fluctuate. The reasoning wanders. Anyone who's done even basic linear algebra — even 3×3 matrices in high school — can imagine what happens when you scale that uncertainty to tens of thousands of dimensions. Of course the output changes every time.

But constrain the domain, and you dramatically narrow where the model needs to look. The relevant vectors cluster. The gap between what the model retrieves and what the human intended shrinks toward zero. Domain limitation doesn't just help. It's the mechanism by which prompts actually work.

This taught me something that would later click into place: domain limitation is the real optimization. Not longer prompts. Not bigger context windows. Narrower scope.

And if that's true for prompts, shouldn't the same principle apply to how we design AI agents?

From Prompt Engineering to Architecture Engineering

As the tool matured, the architecture evolved in a direction I didn't fully appreciate at the time.

The initial version was pure prompt — a single, monolithic instruction set that did everything through LLM reasoning. Unstructured text in, structured text out.

But the real world isn't one output format. My domain required multiple types of structured documents — each with its own format, its own required fields, its own regulatory and compliance requirements. The number of output variations kept growing.

Trying to handle all of these through prompt engineering alone was... well, it was exactly the "spread the entire menu on the table" problem from Part 1.

So the architecture shifted. The LLM's output became fully structured JSON — deterministic, parseable, machine-readable. Document generation moved to Google Workspace via GCP. The LLM's job narrowed to what it's actually good at: understanding the input, extracting the meaning, structuring the reasoning. Everything else — formatting, template selection, compliance checks, document assembly — moved to deterministic systems.

The LLM handles the ambiguous. Deterministic systems handle the deterministic.

I was doing this throughout 2025, iterating toward an architecture where AI reasoning and programmatic execution were cleanly separated. And I kept thinking about Google Workspace — if only there were a way to programmatically drive every Workspace API from the command line, it would be the perfect backend for the document generation pipeline...

And Then GWS Appeared

March 2026. Google released gws — Google Workspace CLI. A Rust-based CLI that covers nearly every Google Workspace API, with commands dynamically generated from Google's Discovery Service.

When I saw the announcement, my reaction was immediate: "This is it. This is what I've been waiting for."

A CLI that could drive Gmail, Drive, Docs, Sheets, Calendar — all from the command line, all returning structured JSON. Perfect for my document generation pipeline. Perfect for AI agent integration.

And then I noticed the articles mentioning MCP support. Perfect! I could connect it directly to—

$ gws mcp
{
  "error": {
    "code": 400,
    "message": "Unknown service 'mcp'."
  }
}

You know the rest. That investigation became Part 1. Google had implemented a full MCP server — 1,151 lines of Rust — then deliberately deleted it as a breaking change. Two days after launch.

At the time, I focused on the forensic story: what happened, why, and what it meant for tool design. But the deeper significance only hit me later.

Google didn't just remove MCP. Google arrived at the same architectural conclusion I had been groping toward with my own product — that for large-scale operations, the right pattern is CLI-first with structured output, not protocol-mediated tool discovery. "Order from the kitchen when you're hungry" beats "spread the entire menu on the table."

That was two independent arrivals at the same destination.

Then I found the third.

The Hackathon Winner's Blueprint

A few days after publishing Part 1, I came across the everything-claude-code repository by Affaan Mustafa (@affaanmustafa). Affaan won the Anthropic × Forum Ventures hackathon in NYC, building zenith.chat entirely with Claude Code in 8 hours. His repository — 77,000+ stars, 640+ commits, 76 contributors — packages 10+ months of daily Claude Code usage into a complete agent configuration system.

I started reading it out of curiosity. Within minutes, I was sitting bolt upright.

The philosophy was identical to what I'd been building independently.

Let me show you the parallels.

MCP: Deliberately Minimized

From Affaan's guide:

"Your 200k context window before compacting might only be 70k with too many tools enabled."

His rule of thumb: have 20–30 MCPs configured, but keep under 10 enabled and under 80 tools active. The repository includes mcp-configs/mcp-servers.json with explicit disabledMcpServers entries — actively turning off MCP servers to protect context space.

This is exactly what Google concluded with gws. And exactly what I experienced — more tools, less thinking room.

CLI Skills as MCP Replacements

From Affaan's longform guide:

"Instead of having the GitHub MCP loaded at all times, create a /gh-pr command that wraps gh pr create with your preferred options. Instead of the Supabase MCP eating context, create skills that use the Supabase CLI directly. The functionality is the same, the convenience is similar, but your context window is freed up for actual work."

Skills in Claude Code are Markdown files — tiny prompt templates that load only when invoked. A /gh-pr skill might be 200 tokens. The GitHub MCP server's tool definitions are thousands. Same functionality. Orders of magnitude less context consumption.

This is the "kitchen model" from Part 1, independently rediscovered by a power user.

Domain Expert Agents

The repository is organized into specialized subagents: planner.md, code-reviewer.md, tdd-guide.md, security-reviewer.md, build-error-resolver.md. Each agent has a narrow scope, specific tools, and defined behaviors.

This mirrors what I learned from my own product development — that established industries organize into specialties for a reason, and AI should follow the same principle. You don't ask a generalist to do a specialist's job. You don't ask a general-purpose agent to handle security review when a specialized security-reviewer agent would be more precise and use less context.

Context Hygiene as First Principle

Affaan's system includes automatic compaction hooks, session memory persistence, and strategic context management. The entire architecture is built around one principle: protect the context window for reasoning.

Not storage. Not tool definitions. Reasoning.

The Convergence

So here's what happened in 2026:

Google — a trillion-dollar company with the largest productivity API surface in the world — implemented MCP, stress-tested it against 200–400 tool definitions, and deleted it. Their conclusion: CLI-first with on-demand schema discovery. Context stays clean.

Affaan Mustafa — an individual developer who won an AI hackathon and spent 10+ months refining his workflow — independently concluded that MCP should be minimized, replaced with CLI skills where possible, and the context window should be protected for reasoning above all else.

I — a medical IT veteran building AI-powered tools in Japan — arrived at the same architecture through a completely different path. A timestamp API in 2024. The "getting dumber" experience in 2025. A product's evolution from monolithic prompt to JSON + deterministic pipeline. And then the forensic discovery of Google's MCP deletion.

Three different starting points. Three different domains. Three different scales. The same conclusion.

That's not coincidence. That's a phase transition.

What the 2026 Phase Transition Actually Means

When people talk about AI milestones, they usually mean model capabilities. GPT-4. Claude 3. Gemini Ultra. Bigger context windows. Better benchmarks.

But the real phase transition of 2026 isn't about model capabilities. It's about how we architect around the capabilities we already have.

The shift can be summarized in one sentence:

"Do it for me" is expensive. "Do this specific thing" is cheap.

Every token spent on tool definitions, prompt engineering, and ambiguous instructions is a token not spent on reasoning. And the research confirms what practitioners have been feeling: irrelevant context doesn't just waste space — it actively degrades the model's ability to think.

Here's what that means in practice:

The end of "prompt engineering" as we knew it. A 15,000-character prompt is a confession that we're compensating for missing architecture. The future is narrower prompts, domain-specific skills, and deterministic systems handling everything that doesn't require reasoning.

MCP is not dead — it's bounded. MCP remains excellent for small-to-medium tool sets (under 50 tools). But for large API surfaces, CLI-first is the proven pattern. The "everything via MCP" fantasy is over.

"Skills" are the new unit of AI agent design. Whether you call them Skills (Affaan), Agent Skills (Google), or domain-specific prompts (what I've been doing with my own tools), the pattern is the same: small, scoped, loaded on demand, discarded after use.

Context windows are not memory — they're working memory. Treating the context window as storage is like covering your entire desk with every book you own before you even pick up a pen. You haven't left any room to actually write. The desk needs to be clear for thinking — and every MCP tool definition, every bloated prompt, every retained conversation turn is another book on the pile.

The Human Parallel (Or: Why "Do It For Me" Was Always Expensive)

There's an observation I keep coming back to, and it's one that makes me laugh every time.

Consider how humans delegate work:

Boss: "Handle this, will you?"
Employee: (Internal monologue: What exactly? By when? In what format? Who approved this? What's the budget?) → 10 rounds of clarification follow.

Now consider the alternative:

Boss: "Run git commit -m 'fix: resolve auth timeout' && git push origin main."
Employee: Done. One round. Zero ambiguity.

The first conversation — the "human" one — requires the employee to infer intent, plan actions, select tools, estimate parameters, and verify assumptions. Every step of that inference costs time and mental bandwidth.

In LLM terms, every step of that inference costs tokens.

MCP tool definitions are the LLM equivalent of "let me explain everything you might possibly need to know before we start." CLI commands are the equivalent of "just do this one thing."

What the token economy has done — accidentally, beautifully — is make the cost of human communication ambiguity visible as a number. Every vague instruction, every "you know what I mean," every "figure it out" translates directly to token consumption that crowds out actual reasoning.

Someone with forty-plus years of programming experience — from assembly language to LLMs — finds this deeply ironic. We spent decades making computers understand human language. Now we're learning that the most efficient way to use language-understanding computers is... to give them precise, unambiguous commands. Like assembly language. Like CLI.

The wheel doesn't just turn. It circles back to the truth.

What Comes Next

If the pattern holds, the next phase is already emerging.

Domain-specific agent languages. Not natural language prompts. Not traditional programming languages. Something in between — structured enough for deterministic execution, flexible enough for AI reasoning. We're already seeing DSLs for agent workflows (LangGraph's graph definitions), constrained syntax languages designed for LLM generation, and YAML/JSON-based knowledge objects.

Agent architecture as a discipline. "Prompt engineer" was the job title of 2024. The 2026 equivalent is closer to "Agent Architect" or "Domain Skill Designer" — someone who understands how to decompose workflows into deterministic and non-deterministic components, and how to allocate context window real estate accordingly.

Domain specialization as a design principle. This is my domain bias speaking — I come from medical IT, where specialization has been refined over centuries. There's a reason medicine has cardiologists and dermatologists. It isn't bureaucratic — it's cognitive. A specialist holds deep domain knowledge that makes their work faster, more accurate, and more reliable. I believe AI agents should be organized the same way. Not one giant model that knows everything. A team of specialists, each with their own skills, routing tasks to the right expert. Every industry has its own version of "specialties." The principle is universal.

Closing

In Part 1, I wrote: "If you write about an OSS tool, run it first."

In Part 2, the lesson is different:

If three independent paths converge on the same conclusion, pay attention.

Google didn't read Affaan's guide before deleting MCP from gws. Affaan didn't study my architecture before recommending CLI skills over MCP. I didn't know about either of them when I built a timestamp API in 2024 and started separating deterministic from non-deterministic processing.

We all arrived at the same place: protect the context window for reasoning. Push everything deterministic to CLI, scripts, and structured pipelines. Load skills on demand. Discard them when done. Let the AI think.

That convergence — from a trillion-dollar company, a hackathon winner, and someone who's been writing code since assembly language was the only option — is what makes 2026 a phase transition.

Not because the models got better. Because we finally learned how to stop wasting them.

Try It Yourself

If you want to feel what "the 2026 phase transition" means in practice rather than just reading about it, the fastest way is to inject Affaan's system into your own Claude Code environment.

I did it myself. The difference was immediate — sessions stayed coherent longer, context stopped rotting mid-task, and the AI's reasoning felt sharper in ways that are hard to quantify but impossible to miss once you've experienced them.

The quickest path — install as a Plugin directly inside Claude Code:

# Inside Claude Code
/plugin marketplace add affaan-m/everything-claude-code
/plugin install everything-claude-code@everything-claude-code

That alone gives you the commands, skills, and hooks. You'll notice the difference.

For the full setup including rules and language-specific configurations:

git clone https://github.com/affaan-m/everything-claude-code.git
cd everything-claude-code
./install.sh typescript   # or: python / golang / rust

You don't need to install everything. Start with the plugin. Use it for a day. Pay attention to how long your sessions stay productive before context degrades. Compare it to yesterday.

I suspect you'll have your own moment of convergence — your own version of the realization that Google, Affaan, and I all had independently. That the bottleneck was never the model. It was how much of the context window we were wasting on everything except thinking.

Your setup is different from mine. Your domain is different. But the principle is the same.

Let the AI think.

And if this feels familiar —

it is.

References

Part 1: Not Everything Needs MCP — What Google Workspace CLI Taught Us About AI Agent Architecture
everything-claude-code by Affaan Mustafa — The agent harness performance optimization system
The Shorthand Guide to Everything Claude Code — 2.7M+ views on X
The Longform Guide to Everything Claude Code — Token optimization, memory persistence, and CLI skill patterns
Chroma Research, "Context Rot" — Empirical study on how irrelevant context degrades LLM performance
"Challenging LLMs Beyond Information Retrieval: Reasoning Degradation with Long Context Windows" — 14-model benchmark showing reasoning decay with context length
Paulsen (2025), "Context Is What You Need: The Maximum Effective Context Window for Real World Limits of LLMs" — Maximum effective context windows far smaller than advertised
"On the Fundamental Limits of LLMs at Scale" (2026) — Formal framework for reasoning degradation under context expansion

Not Everything Needs MCP: What Google Workspace CLI Taught Us About AI Agent Architecture

Gen.Y.Sakai — Mon, 09 Mar 2026 08:35:52 +0000

Menu on the Table vs Order from the Kitchen — Why CLI Beats MCP for Large APIs

When Google Workspace CLI launched, several articles mentioned its MCP server.

But when I tried to run gws mcp, something strange happened.

The command didn't exist.

What followed was a deep forensic investigation — from README to source code to git history — that ended with a discovery: Google implemented a full MCP server, improved it, then deliberately deleted all 1,151 lines of it as a breaking change. Two days after launch.

This is the story of that investigation, and what it reveals about AI agent tool design.

What is Google Workspace CLI?

Google Workspace CLI (gws) is a Rust-based CLI tool that covers nearly every Google Workspace API. It's open source under Apache 2.0 and installs via npm:

npm install -g @googleworkspace/cli

Its killer feature: commands are dynamically generated from Google's Discovery Service at runtime. Unlike traditional CLI tools that hardcode commands for each API, gws reads the Discovery Document and builds its command surface on the fly. When Google adds a new Workspace API endpoint, gws picks it up automatically — zero maintenance required.

gws drive files list --params '{"pageSize": 10}'
gws gmail users messages list --params '{"userId": "me"}'
gws calendar events list --params '{"calendarId": "primary"}'

Every response is structured JSON. It ships with 100+ Agent Skills — not just prompt templates, but complete workflow definitions covering auth, safety, and API usage patterns. In short, gws is built as a runtime for AI agents to operate Google Workspace, not just a tool for humans.

Keep this architecture in mind. It's the key to understanding why MCP was removed.

"MCP Server Included" — Or So the Articles Said

Within hours of launch, tech publications ran with it:

"It also includes an MCP server mode through gws mcp"
— VentureBeat

"MCPサーバーを起動することができ、Claude Desktop、Gemini CLI、VS CodeなどのMCP対応クライアントからGoogle Workspace APIを直接呼び出すことができる"
A Japanese tech publication made a similar claim, saying that the CLI could start an MCP server and be used from Claude Desktop, Gemini CLI, and VS Code.
— Japanese tech publication

MCP (Model Context Protocol) is arguably the hottest protocol in the AI agent ecosystem right now. Originally proposed by Anthropic for Claude, then donated to the Linux Foundation, it's being adopted by Claude Desktop, Gemini CLI, VS Code, Cursor, and many others.

"Google Workspace supports MCP" was exactly the kind of news the community was waiting for.

So I tried it immediately.

The Wall: `gws mcp` Doesn't Work

$ gws mcp
{
  "error": {
    "code": 400,
    "message": "Unknown service 'mcp'. Known services: drive, sheets, gmail, 
    calendar, admin-reports, reports, docs, slides, tasks, people, chat, 
    classroom, forms, keep, meet, events, modelarmor, workflow, wf.",
    "reason": "validationError"
  }
}

gws interprets its first argument as a Google API service name. mcp isn't a service, so it fails. Fair enough — maybe the syntax is different?

$ npx -y @googleworkspace/cli mcp   # Same error
$ gws --help                         # No mention of mcp anywhere
$ gws mcp --help                     # Same error

The --help output lists exactly three top-level commands: schema, generate-skills, and auth. MCP is nowhere to be found.

Something was off.

Dissecting the npm Package

I looked at what was actually installed:

$ readlink -f "$(which gws)"
.../node_modules/@googleworkspace/cli/run-gws.js

The entry point is a Node.js wrapper that downloads and runs a prebuilt Rust binary. Checking package.json:

{
  "bin": {
    "gws": "run-gws.js"
  },
  "version": "0.8.1"
}

The only binary exposed is gws. No gws-mcp, no gws-server. The supportedPlatforms section confirms: every platform ships a single binary named gws.

The MCP entry point simply doesn't exist in the distributed package.

Reading the Rust Source

Maybe MCP exists in the repo but isn't included in the npm release? I cloned the repository and read src/main.rs:

async fn run() -> Result<(), GwsError> {
    let args: Vec<String> = std::env::args().collect();
    // ...
    if first_arg == "schema" { /* ... */ }
    if first_arg == "generate-skills" { /* ... */ }
    if first_arg == "auth" { /* ... */ }
    // Everything else → treat as service name
    let (api_name, version) = parse_service_and_version(&args, &first_arg)?;
}

Three top-level commands. Everything else falls through to service name resolution. No mcp branch exists. The print_usage() function doesn't mention MCP. The mod declarations don't include mcp_server.

I built from source to confirm:

$ cargo build
$ ./target/debug/gws mcp
→ Same error

At this point I had confirmed across five layers:

gws --help — no MCP
gws mcp — unknown service
package.json — bin is gws only
main.rs — no MCP branch
Fresh build from source — still no MCP

MCP doesn't exist. Not in the release. Not in the source.

But the Traces Were There

I wasn't ready to give up. I grepped the repo:

$ grep -R "mcp" .
./.github/labeler.yml:"area: mcp":
./.github/labeler.yml:          - src/mcp_server.rs
./CHANGELOG.md:- dd3fc90: Remove mcp command
./CHANGELOG.md:- 9cf6e0e: Add --tool-mode compact|full flag to gws mcp.
./CHANGELOG.md:- 670267f: feat: add gws mcp Model Context Protocol server

There it was.

The CHANGELOG told a clear story: MCP was implemented, improved, and then removed.

GitHub Issues search returned 19 results for "MCP" (7 open, 12 closed):

gws mcp -s does not exist (#69)
MCP server ignores GOOGLE_WORKSPACE_CLI_ACCOUNT env var (#221)
MCP tools/list returns uncallable tool names (#162)
Switch MCP tool names from underscore to hyphen separator (#235)
feat: add tool annotations, deferred loading, and pagination to MCP server (#260)

These aren't wishlist items. These are real bug reports from real users who were running the MCP server. You don't debate underscore vs hyphen in tool names unless you're actually calling those tools.

MCP wasn't missing. It was there, and it was deleted.

The Moment of Deletion: 1,151 Lines Gone

I found the commit:

$ git show --stat dd3fc90
commit dd3fc9074d74a3c74792aa08c6bff7a9984d0d46
Author: Steve Bazyl <sqrrrl@gmail.com>
Date:   Fri Mar 6 13:33:23 2026 -0500

    fix!: Remove MCP server mode (#275)

    * BREAKING CHANGE: Remove MCP server mode
    * Add changeset file

 .changeset/no-mcp.md |    5 +
 README.md            |   34 ----
 src/main.rs          |    6 -
 src/mcp_server.rs    | 1151 --------------------------------------------------
 5 files changed, 5 insertions(+), 1192 deletions(-)

March 6, 2026. Two days after launch.

The ! in fix!: is Conventional Commits syntax for a breaking change. This wasn't a quiet deprecation — it was a deliberate, loud removal.

src/mcp_server.rs — 1,151 lines deleted. This was no prototype. It was a complete MCP server implementation: JSON-RPC protocol handling, tools/list for tool discovery, tools/call for tool execution, Discovery API integration.

The area: mcp label was also removed from AGENTS.md. The next day, Issue #260 (proposing tool annotations and deferred loading for the MCP server) was closed as not_planned.

This wasn't a temporary retreat. It was a policy decision.

Why Google Removed MCP

The answer lies in the collision between MCP's protocol design and Google Workspace API's scale.

The tools/list Problem

In MCP, when a server starts up, it exposes all available tools via tools/list. The client's LLM loads these tool definitions into its context window to decide which tools to use and when.

Google Workspace API is massive. Drive, Gmail, Calendar, Sheets, Docs, Chat, Tasks, People, Forms, Admin — over 10 major services, each with dozens of methods. Drive alone has files.list, files.get, files.create, files.update, files.delete, permissions.create... easily 10+ methods.

Run all of that through Discovery API and you get 200–400 MCP tools.

The CHANGELOG confirms this:

Add --tool-mode compact|full flag to gws mcp. Compact mode exposes one tool per service plus a gws_discover meta-tool, reducing context window usage from 200-400 tools to ~26.

That's 2–8x the practical limit for most MCP clients (typically 50–100 tools).

Context Explosion

200–400 tool definitions, each with name, description, parameter schemas, and required/optional markers, all serialized as JSON and loaded into the context window. Estimated token cost: 40,000–100,000 tokens — just for tool definitions.

That leaves dramatically less room for user instructions, conversation history, and actual reasoning. Latency increases. Inference quality degrades.

Compact Mode Didn't Save It

The team tried. Compact mode reduced the tool count to ~26 by exposing one meta-tool per service. But MCP was deleted the day after compact mode was implemented. That tells you the mitigation wasn't sufficient.

Bug Avalanche

During MCP's brief existence, at least 7 bug fixes were needed:

Tool naming ambiguity (fixed twice)
Schema inconsistencies in tool calls
Alias vs Discovery Document name mismatch
unwrap() panics in mcp_server.rs
Auth environment variable being ignored
Empty body: {} on GET methods causing 400 errors

For a small OSS project, that's an unsustainable maintenance burden over just two days.

The Root Cause: Architectural Mismatch

The bug count alone didn't kill MCP. The real issue is structural.

Google Workspace API is optimized for dynamic generation of hundreds of methods via Discovery Service. That's its superpower as a CLI — new APIs appear automatically, no code changes needed.

But that same superpower becomes a liability under MCP. MCP's tool model requires all tool definitions to be sent upfront to the client. "Dynamically generate hundreds of methods" directly translates to "flood the context window with hundreds of tool schemas."

This isn't a fixable bug. It's a fundamental mismatch between Google's API design and MCP's protocol design.

MCP was technically implementable. But it couldn't be shaped into a feature that justified its implementation complexity.

The Architecture After MCP: CLI-First

Here's what gws looks like now:

Google Workspace APIs
        ↓
    Discovery Service
        ↓
      gws CLI
        ↓
    JSON output
        ↓
  AI Agent (via shell)

This is fundamentally different from MCP.

MCP model: LLM discovers all tools upfront via protocol → calls tools via structured protocol. All tool definitions live in the context window.

CLI model: Agent calls gws as a shell command. Skills and CONTEXT.md guide which commands to run. gws schema provides on-demand schema queries. Context overhead: near zero.

The MCP approach is "spread the entire menu on the table and choose." The CLI approach is "order from the kitchen when you're hungry." For an API surface as vast as Google Workspace, the kitchen model wins.

The 100+ Agent Skills remain. The 50+ curated recipes for Gmail, Drive, Calendar, Docs, and Sheets remain. The structured JSON output remains. The on-demand schema discovery remains.

MCP's removal didn't reduce functionality. The project converged on a more efficient agent integration model that didn't need MCP as a layer.

Were the Articles Verified?

Multiple publications reported "MCP server included" — in English and Japanese. But by the time I checked the repository:

gws --help showed no MCP
gws mcp returned "unknown service"
The npm package exposed only gws in bin
main.rs had no MCP branch
Building from source didn't produce MCP
MCP had been removed as a BREAKING CHANGE two days after launch

The MCP implementation did exist in the repo's history. Issues, PRs, and CHANGELOG entries confirm it was real. So "pure fabrication" would be unfair.

But a responsible technical article needs at least one of: a startup command, a config example, an execution log, or a version number where the feature works. None of the articles I found had any of these.

I use AI as a research partner too — I had ChatGPT help analyze the README and used Claude Code to dig through commit history. AI-assisted research is fine. The problem is publishing AI-generated summaries without running the actual software.

If you write about an OSS tool, run it first. Especially in the first week after launch, when READMEs and released artifacts can be out of sync. That gap is where misleading articles are born.

Will MCP Come Back?

Short term (< 6 months): Unlikely.

Issue #260 was closed as not_planned. PR #275 declared a BREAKING CHANGE. The area: mcp label was removed from AGENTS.md. This isn't a pause — it's a clear signal that MCP is not in the development roadmap.

Medium term (6–12 months): Conditionally possible.

If the MCP specification evolves to address:

Tool count limits — clients efficiently handling hundreds of tools
Lazy loading standardization — on-demand tool discovery as a first-class MCP feature
Community contribution — someone submits and maintains a complete implementation

Long term: Architecture-dependent.

Google currently positions gws as a Gemini CLI Extension. If Gemini's ecosystem adopts a tool integration protocol similar to MCP, something functionally equivalent could emerge.

But the current trajectory is clearly CLI + Skills + on-demand schema discovery. MCP's near-term revival is unlikely.

What This Investigation Revealed

Tool Design in the Age of AI Agents

gws poses an important question about AI agent tool integration.

MCP standardizes tool discovery and invocation at the protocol level. LLMs see all available tools and call them through structured interfaces. This works beautifully for small-to-medium tool sets.

But for services with massive API surfaces like Google Workspace, MCP's model breaks down. Tool definitions consume the context window and degrade reasoning capability.

CLI-based integration lets agents call external tools as shell commands. No tool definitions in the context window. Skills and documentation teach the agent what's available; schemas are queried on demand. Even with hundreds of available operations, context overhead stays near zero.

This isn't MCP vs CLI as a universal choice. The optimal integration method depends on the scale and characteristics of the tool set. Google didn't remove MCP because MCP is a bad protocol. They removed it because Google Workspace API's scale created a structural mismatch with MCP's tool model. For smaller tool sets (10–50 tools), MCP remains one of the best integration approaches available.

The Speed of OSS

gws MCP followed this timeline:

v0.3.x: MCP server added
v0.5.x: compact/full mode improvement
v0.6.x: bug fixes (naming, schemas, auth)
v0.8.x: MCP removed (BREAKING CHANGE)

All of this happened within days. Features can be added and removed faster than articles can be written about them. That's why running the software yourself matters more than reading about it.

CLI = Agent Runtime

Tracing gws's design reveals something about Google's vision for AI agents:

API schema (Discovery Service)
    ↓
CLI runtime (gws)
    ↓
Structured JSON output
    ↓
AI Agent (guided by Skills)

This is "CLI as Agent Runtime." Traditionally, CLI was a human interface. gws is explicitly designed as an interface for AI agents to call. Structured JSON everywhere. 100+ pre-defined Skills. On-demand API schema queries via gws schema. Agent guidelines in CONTEXT.md.

This design philosophy is precisely what made MCP redundant. MCP exposes tools via protocol; gws treats the CLI itself as the tool. No need to send tool definitions over JSON-RPC when you can just execute a shell command.

If this is Google's answer, then the future of AI agents won't be MCP-only. At least for services with large API surfaces, CLI-first will remain a viable — perhaps superior — alternative.

Conclusion

This investigation started with a trivial error: gws mcp returning "unknown service."

I read the README. Checked --help. Opened package.json. Read the Rust source. Built from source. Traced the git log. Examined commit diffs. Searched GitHub Issues. At the end of that trail was 1,151 lines of MCP server code, deliberately removed as a BREAKING CHANGE.

The removal wasn't a technical failure. It was the recognition of an architectural mismatch. Google's Discovery API dynamically generates hundreds of methods — a strength that directly became MCP's context window problem. Compact mode was attempted as a mitigation but couldn't resolve the fundamental collision between Google's API scale and MCP's tool model.

What remains is everything an AI agent needs to operate Google Workspace: 100+ Agent Skills, structured JSON output, on-demand schema discovery — all without MCP.

There's no universal answer to the MCP vs CLI debate. But what gws demonstrated through its own history is that AI agent tool design is still an open problem. The right architecture depends on the shape and scale of the API surface.

If you write about an OSS tool, run it first. Not the README — the actual software. In the age of AI, repository analysis takes hours, not days. Use that speed for verification, not just content production.

Not Everything Needs to Be a Framework: Why Spawning Processes Still Wins

Gen.Y.Sakai — Mon, 15 Dec 2025 07:21:13 +0000

Some people say:

“This is just spawning a subprocess. That’s not architecture.”

They’re absolutely right.

And that’s exactly why it works.

I’ve been shipping production systems long enough to have lived through CORBA, SOAP, WSDL, and every other attempt to make inter-process communication “pure.”

I’ve also shipped systems under real deadlines, in regulated industries, where “rewriting everything” is not an option.

This article is about why the sidecar pattern — yes, literally just wrapping a binary — keeps winning in the real world.

Not in theory.

Not in blog posts.

In production.

The Pattern Everyone Pretends Not to Use

Here’s the pattern:

A TypeScript / Node / Electron app
Spawns a compiled binary (.NET, Rust, Go, C++)
Talks over stdin/stdout, pipes, or a local socket
Lets the OS do process isolation

That’s it.

No service mesh.

No gRPC schema wars.

No “distributed systems” cosplay.

Just a parent process orchestrating a sidecar that does the heavy lifting.

If this sounds “too simple,” good.

That’s the point.

“But That’s Just a Wrapper”

Yes.

VS Code wraps git
VS Code wraps OmniSharp (.NET)
Prisma wraps a Rust query engine
Docker Desktop wraps the Docker daemon
Electron apps wrap platform-native tools every single day

If “just a wrapper” was a design flaw, half of modern developer tooling wouldn’t exist.

The uncomfortable truth is this:

Most successful tools are orchestration layers. Not reinventions.

Why This Pattern Keeps Shipping

Let’s be brutally honest about the real options.

Option 1: Rewrite Everything in TypeScript

Months of work
New bugs in mature code
Worse performance
Now you maintain two versions forever

Great for blog posts.

Terrible for shipping.

Option 2: Native Addons (N-API, node-gyp)

Fast, yes
Also fragile
Node upgrades break you
One segfault kills your entire process
Cross-compilation is pain incarnate

Ask anyone who has maintained native addons long-term.

Option 3: Spawn a Sidecar

Uses existing, battle-tested code
Crashes are isolated
Debugging is obvious (it’s a process)
Cross-platform is manageable
Ships this week, not “someday”

This isn’t laziness.

It’s risk management.

“Architecture Purity” vs Reality

In regulated domains — healthcare, finance, government systems — you don’t get to say:

“Let’s just rewrite the crypto layer.”

You integrate with what already exists:

Vendor SDKs
Legacy libraries
OS-specific APIs
Hardware-backed security modules

The sidecar pattern gives you:

A clean boundary
A failure domain you can reason about
A way to keep modern UX without breaking compliance

That’s not a hack.

That’s professional engineering.

This Is Not a New Idea — It’s the Mature One

We spent decades trying to make IPC “beautiful”:

CORBA
SOAP
Enterprise Service Buses
Endless XML schemas

And what did we learn?

That simple, observable processes beat “perfect abstractions” when systems get large and real humans have to operate them.

Today, spawning a process and talking over stdio feels almost embarrassing — until you realize it’s exactly what we wanted 20 years ago.

When You Should Not Do This

Let’s be clear — this is not a hammer for every nail.

Don’t use this pattern if:

A simple REST call solves your problem
The logic is trivial and low-cost
Latency is measured in microseconds
Your team can’t maintain the sidecar language

Pragmatism cuts both ways.

The Real Lesson

The sidecar pattern isn’t about processes.

It’s about respecting reality:

Existing code matters
Deadlines matter
Failure isolation matters
Shipping matters

If your architecture diagram looks clean but your product never ships, you chose wrong.

I’ll take a “wrapper” that ships over a “pure” system that doesn’t — every time.

Final Thought

If this pattern is good enough for:

Microsoft (VS Code)
Docker
Prisma
The entire Electron ecosystem

It’s probably good enough for your project too.

And if someone says

“That’s just spawning a process,”

Smile.

They just described half of modern software — the half that actually works.

Want to try this pattern yourself?

I put together a minimal proof-of-concept in TypeScript + .NET that shows the full lifecycle management and stdio communication in action. It runs in minutes.

Check out the repo on GitHub → open-rx

Have you ever used the sidecar pattern in production? What worked? What broke?

Drop your war stories in the comments — I read every one.

Transparency note:

The title of this article was generated with NanobananaPro.

The opinions, war stories, and architectural scars are entirely my own.