DEV Community: Sameer Khan

Claude Fable 5: 6 Lines Buried in the Announcement That Matter More Than the Benchmarks

Sameer Khan — Tue, 09 Jun 2026 19:53:00 +0000

In the first season of Westworld, the thing that wakes the robots up is not a new body or a bigger brain. It is a line of code called the reveries, a quiet update that lets them reach memories they were supposed to have wiped. Nobody in the park notices. It is in the patch notes.

I thought about that reading Anthropic's announcement for Claude Fable 5¹. The headline is the benchmarks. The story is in the lines almost nobody will quote.

Fable 5 is Anthropic's most capable model, state of the art across software engineering, science, and vision, priced at $10 per million input tokens and $50 per million output¹. That part gets the screenshots. But the leaderboard is the least interesting thing in the post. Read past it and six lines tell you where the frontier actually moved: away from "how smart is the model" and toward "what the model does once it has memory, tools, and permission."

TL;DR: The most important details in the Claude Fable 5 release are not the benchmarks. The same memory that barely helped the old model helped Fable 5 three times as much. It downgrades instead of refusing. It plays Pokemon by sight. And the most capable version of it is not for you. Here are the six buried lines.

The table everyone screenshotted. Look at the starred rows: Anthropic's own footnote says Fable 5 scores closer to Opus 4.8 there because of the safety fallbacks. The headline number and the buried mechanism are in the same picture. Source: Anthropic.

Why is memory the most important line in the Claude Fable 5 announcement?

Because the same tool paid out three times more on the better model. Anthropic tested Fable 5 on the deck-building game Slay the Spire and noted that giving it persistent file-based memory "improved its performance three times more than for Opus 4.8," the previous model¹.

Sit with that. The tool did not change. The model did. A stronger model pulled three times the value out of identical scaffolding.

We tend to file model upgrades and agent tooling as two separate gains. They are not separate. They compound. The memory layer you built last year is not frozen at the value it had on last year's model. Swap in a better model and the same setup can pay out more, with no new code.

This is the reverie. Give a model a way to keep its past and it does not improve a little, it leaps. The teams that already wired memory into their agents just got a free upgrade. Anyone benchmarking Fable 5 on cold, one-shot prompts will see a modest bump and underrate it. I wrote earlier about how AI is starting to replace forgetting, not thinking; this is the clearest number yet that memory is where the leverage lives.

Anthropic's cover art for the release: a "5" built from butterflies. Metamorphosis is the right metaphor for what memory did here. Source: Anthropic.

How did Fable 5 out-reason models built only for biology?

A generalist beat the specialists. Anthropic reports that Mythos-class models "outperformed sophisticated models dedicated to protein tasks using their biological reasoning alone," including predicting how viral shells assemble, which matters for gene therapy¹.

Read that slowly. The tool fine-tuned for one job lost to a model that simply reasoned its way there. The purpose-built specialist no longer wins by default.

That is also why biology and chemistry got fenced off. Anthropic arranged for Fable to fall back to Opus 4.8 on most bio and chem requests¹. The same reasoning that could speed up drug design is the reasoning they do not want everyone holding. The clampdown is not a footnote to the capability. It is the loudest confirmation of how real the capability is.

Anthropic's own chart. On predicting how viral capsids package, the Claude models score above the dedicated protein language model baseline (the dashed line) using reasoning alone. Source: Anthropic.

What happens when Claude Fable 5 hits a risky question?

It does not refuse. It hands you a weaker model. On cybersecurity, biology, chemistry, and distillation, Fable 5 routes the request to Opus 4.8 and tells you it did. Anthropic says this happens in less than 5% of sessions¹².

A guardrail used to mean "no." This one means "here is a worse answer." You can no longer assume the reply in front of you came from the best model you are paying for. The provenance of an answer just became something you have to track.

It is not airtight, and Anthropic does not pretend otherwise. A public bug bounty ran over 1,000 hours without finding a universal jailbreak, but the UK's AI Security Institute "made progress towards one" in a brief initial window¹. The downgrade buys time and cost, not certainty. That honesty is worth more than a clean claim would have been.

Under automated red-teaming, the share of offensive cyber tasks completed drops from 56.6% on Opus 4.8 to 5.4% on Fable 5. The guardrail is the gap. Source: Anthropic.

Why does Fable 5 beating Pokemon on vision alone matter?

Because the older models could not, even with help. Earlier Claude models needed extra scaffolding to limp through Pokemon FireRed. Fable 5 beat it with a minimal, vision-only harness¹. It played by looking at the screen.

A game looks like a toy result. It is not. Reading a screen and acting on it correctly for thousands of steps is the exact skill an agent needs to use software the way a person does. The Game Boy is a stand-in for your dashboard, your CRM, your booking flow. Anyone building agents that drive real interfaces should read the Pokemon line as the most practical benchmark in the post.

What is Claude Mythos 5 and who actually gets it?

Same model, fewer guardrails, almost nobody. Mythos 5 is the same underlying weights as Fable 5 with some safeguards lifted. It ships through Project Glasswing, in collaboration with the US government, to vetted cyber defenders and a small set of researchers¹.

The line worth catching is that the safe model and the unrestricted model are the same model. The only variable is who is asking. Anthropic did not build a more dangerous model for defenders. It built one model and gated who can take the safety off. I covered the program itself when Anthropic launched Glasswing for AI cybersecurity; Mythos 5 is what flows through that pipe now.

Same family, different access. Fable 5 (orange) sits at zero on these offensive cyber tasks; Mythos 5 (pink), the same weights with safeguards lifted, runs high. The only variable is who gets it. Source: Anthropic.

Why does the most capable model leave your subscription on June 23?

Because it costs too much to give away flat-rate. Anthropic included Fable 5 on Pro and Max plans at no extra cost only through June 22. From June 23, it needs usage credits, with a promise to restore standard access "as quickly as we can" if capacity allows¹.

Even at $10 and $50 per million tokens, less than half the price of the Mythos Preview, the best model is metered, not bundled¹. The frontier outran the subscription. The quiet end of "the best model is included" is a bigger signal for most users than any benchmark, because they will feel it on June 23.

On the hardest coding tasks, Fable 5's score climbs the more you let it spend per task. Frontier capability carries a per-task price, which is exactly what a flat subscription cannot absorb. Source: Anthropic.

Put the six together and the shape is clear. Its score tells you how smart the model is alone. After this announcement, almost nothing it does happens alone.

Key Takeaways

Memory is a multiplier, and a stronger model multiplies harder. The same persistent memory gave Fable 5 three times the lift it gave Opus 4.8.
The specialist's moat is thinning. A general model out-reasoned tools built only for protein tasks, which is exactly why bio and chem got fenced.
Safety shifted from refusal to downgrade. Risky prompts get the older Opus 4.8, flagged, in under 5% of sessions, so answer provenance now matters.
Vision is the agent benchmark. Beating Pokemon on sight alone is the same skill as operating real software screens.
Capability is now metered. The most capable model leaves flat-rate subscriptions on June 23, ending "best model included" for now.

I break down releases like this on LinkedIn, X, and Instagram, usually shorter, sometimes as carousels. If this read landed, you would probably like those too.

Full disclosure, and it is on theme.³

Claude Fable 5 and Claude Mythos 5, Anthropic ↩
@claudeai announcement thread on X ↩
I drafted this with help from Claude Fable 5. Fitting for a post about memory: it was most useful when I kept the whole argument and my earlier drafts in its context, and noticeably worse when I prompted it cold. I did not set out to become the example. ↩

Stop Paying Your AI Agents to Re-Learn the Same Site

Sameer Khan — Sat, 09 May 2026 12:57:06 +0000

TL;DR: Most AI agents have amnesia. Every run pays full discovery cost. Browserbase's open-source Autobrowse lets one run iterate until it converges, then writes what worked into a SKILL.md the next agent reads. Craigslist scrapes dropped from $0.22 to $0.12. Form-fill from $1.40 to $0.24. The unlock isn't a smarter model. It's a markdown file.

Why Do Your AI Agents Forget Everything They Just Learned?

If you've shipped any agent that touches the live web, you've watched this happen. Run one: load the homepage, find the search box, learn the pagination, finish the task. Run two: same site, same task, same discovery from zero. The model has a context window for one run. It has no notebook between runs.

That's the amnesia tax. It's fine when you run something ten times. It is not fine when you run it ten thousand times for the same site, every week. The fix isn't a bigger brain. It's a place to put what the brain already figured out.

What Did Browserbase Just Open-Source?

Autobrowse, shipped in early May 2026. ¹ First teased by Shrey Pandya on April 22 ², detailed by Kyle Jeong on May 6. ³

The shape is simple. You give an agent a real task on a live site. It runs end to end, studies its own trace, refines, runs again. After three to five rounds, runs stop getting better. That's convergence. Autobrowse takes the converged approach and graduates it into a reusable SKILL.md, plus any helper scripts the workflow needs. ¹

Pattern borrowed from Andrej Karpathy. I wrote about Karpathy's Autoresearch when it dropped: one editable file, one metric, one time-boxed loop. ⁴ Browserbase pointed that loop at websites. The interesting question isn't what the loop does. It's what it leaves behind.

How Does One Agent Leave a Note for the Next?

Not a chat log. Not a transcript. The SKILL.md is a structured how-to written by the agent for whoever shows up next, human or AI. ¹

Source: @kylejeong

For Craigslist, that file documents an undocumented sapi.craigslist.org JSON endpoint, the mandatory Referer header, the postal-code parameter that overrides IP geolocation, the category enum, the pagination batch size, and which neighborhood lookups misbehave. None of that is in Craigslist's docs. It was in the agent's network trace.

The next agent doesn't reason about Craigslist. It reads the file and runs the call. So what does that do to the cost line?

What Has Actually Gotten Cheaper?

The published benchmarks make the gap visceral. ¹

Task	Generic agent	After convergence	Cut
Craigslist search	~$0.22 / 71s	~$0.12 / 27s	45%
Form-fill (4 iterations)	$1.40	$0.24	83%
Federal grants portal	28-page paginated scrape	one undocumented JSON call	from many to one

The grants row is the one people share, and it deserves a beat. An Autobrowse run watched the network traffic on a federal grants portal and noticed an undocumented JSON endpoint that returned every current grant in a single call. ¹ Twenty-eight pages collapsed to one fetch. Humans had scraped that site for years and missed it.

Introducing the /autobrowse skill (inspired by @karpathy's autoresearch harness)

Ask your agent to perform any task on the web: it explores the page using the @browserbase CLI, learns what went wrong in previous attempts, and iterates until it converges on a reliable workflow.… pic.twitter.com/ZDSvLb0a6X
— Shrey Pandya (@shreypandya) April 22, 2026

Humans don't iterate against the same surface fifty times. We use a site once, write the scraper, move on. The agent's loop runs the same task over and over and watches its own trace until something cleaner falls out. Patience, not intelligence.

If patience is what's helping, the next obvious question writes itself.

Why Doesn't a Bigger Model Fix This?

A bigger model is faster at reasoning, not faster at remembering. The discovery loop is still a discovery loop. Better intelligence applied to a stateless agent just makes the rediscovery faster, not unnecessary.

Frontier context windows hit a million tokens this year, and that doesn't help either. A million-token context is one run's working memory. It evaporates the moment the run ends. The artifact you actually need is something an agent in a different process, on a different day, can read.

So if SKILL.md is the artifact, where does it actually fail?

Where Does Autobrowse Still Break?

Iteration helps when each run produces signal. It does not help when the work is deterministic and the first run already has the answer.

Browserbase's own writeup is honest about this. A 167-row static HTML catalog burned roughly $24 over four iterations and still didn't return all rows. ¹ Two hundred lines of Python with BeautifulSoup would have done it in one pass.

The rule: if the page is static and the schema is fixed, write the parser. If the site is messy, gated, JS-heavy, or undocumented, run the loop. Knowing which is which is the new judgment call. Once you've made that call, the install path is short.

How Do You Try Autobrowse Today?

Open source at github.com/browserbase/skills, shipped via Browserbase's Claude Agent SDK plugin marketplace. ⁵ Install:

/plugin marketplace add browserbase/skills

Point the agent at a real task on a real site, let it iterate, and harvest the SKILL.md when it converges. The graduated file lands at ~/.claude/skills/[task-name]/SKILL.md, reusable across runs and across agents.

That's the workflow. The deeper move is what that markdown file means.

Why Is a Markdown File the Real Unlock?

Everyone in 2026 is racing the same direction. Bigger context. More autonomy. Smarter agents. Autobrowse quietly argues the opposite.

The unlock here is the smallest possible artifact. A markdown file written by one agent, read by the next. It survives the death of the run that produced it. It can be inspected, edited, version-controlled, audited. The next agent doesn't need to be smarter; it needs to be literate.

That's the same instinct behind David Deutsch's argument that good explanations are hard to vary. Durable knowledge survives because every part is load-bearing. SKILL.md is that, for a website. The agent that wrote it can be replaced. The file outlives the model.

I've been writing about how AI is replacing forgetting, not thinking for a while. Autobrowse is that move made tangible. Forgetting was the bottleneck. Notes are the fix.

Key takeaways

Your AI agents pay a discovery tax on every run when nothing carries between runs.
Bigger models don't fix amnesia. They make rediscovery faster, not unnecessary.
Autobrowse converts run-time discovery into a durable, human-readable SKILL.md the next agent reads instead of rebuilding.
The receipts: Craigslist 45% cheaper per run, form-fill 83% in 4 iterations, federal grants 28 pages collapsed to one call.
The pattern is general. Anything you do repeatedly with high discovery cost gets cheaper if the first run leaves notes.
The smallest possible artifact wins. A markdown file outlives the model that wrote it.

Frequently asked questions

What is Autobrowse?

Autobrowse is an open-source workflow from Browserbase that runs an AI agent on a real website, watches it iterate until the run converges, then graduates the working approach into a reusable SKILL.md file the next agent reads instead of rediscovering the site from scratch. ¹

Why does every AI agent run start from zero?

Most production agents are stateless across runs. The model has a context window for one run, but no shared notebook between runs. So the second time it visits the same site, it loads the homepage, learns the pagination, and figures out the pattern again. Autobrowse is the first widely shared answer to that gap. ¹

How much does Autobrowse actually save?

Browserbase's published benchmarks show Craigslist scrapes dropping from about $0.22 per run at 71 seconds to about $0.12 at 27 seconds, and a form-fill task dropping from $1.40 to $0.24 across four iterations. The graduated agent uses the same model. What changed is that it didn't need to figure out the site again. ¹

When does Autobrowse not work?

On problems that don't reward iteration. Browserbase's own writeup notes a 167-row static HTML catalog where Autobrowse burned roughly $24 over four iterations and still didn't return all rows. Two hundred lines of Python with BeautifulSoup would have done it in one pass. Loops only help when each run gives the agent new signal to use. ¹

How is Autobrowse related to Karpathy's Autoresearch?

Same loop pattern, different domain. Karpathy's Autoresearch lets an AI agent run ML experiments overnight on a single GPU until the metric improves. Autobrowse takes the same try, study, refine, converge structure and points it at websites instead of training scripts. Browserbase explicitly cites Autoresearch as the inspiration. ¹ ²

I break down things like this on LinkedIn, X, and Instagram. Usually shorter, sometimes as carousels. If this resonated, you'd probably like those too.

Sources

How to use Claude for pitch decks, month-end close, and KYC

Sameer Khan — Thu, 07 May 2026 20:43:32 +0000

Anthropic shipped 10 ready-made Claude templates for finance teams on May 5, 2026.¹ Three of them are worth your Monday morning. Seven aren't.

It's the same vertical-product play Anthropic ran with Claude Design: pick a high-frequency surface, ship a template, ship the audit log with it.

Most AI finance demos die between the pilot and the audit log. Anthropic's new templates ship with the audit log on day one, the thing a regulator asks for. That changes the calculus for a specific subset.

TL;DR

Three Claude finance templates are ready for this week: Pitch agent, Month-end closer, KYC screener. All 10 ship as open-source plugins in github.com/anthropics/financial-services.
The other seven need more handholding than they save. Watch, do not ship.
Claude scored 64% on a finance task test (Vals AI). Best in class today. 64% is not 99%.

How do you install any of these templates?

You install the same way for all 10. Pick one of three paths:¹

Claude Cowork (paid plan, easiest): Settings → Plugins → Add plugin → paste https://github.com/anthropics/claude-for-financial-services. Pick the agents you want.
Claude Code CLI (developer-friendly): claude plugin marketplace add anthropics/claude-for-financial-services, then claude plugin install financial-analysis@claude-for-financial-services (the core skill bundle). Add specific agents after that.
Claude Managed Agents (Platform, public beta, headless): export your ANTHROPIC_API_KEY, clone the repo, run scripts/deploy-managed-agent.sh kyc-screener (substitute whichever agent name you want).

Install the financial-analysis core plugin first. Every agent depends on it. Then add one specific agent and try it on real work.

How do you use Claude for pitch decks?

Pitch decks are the highest-frequency, lowest-risk finance task. Start here.

The Pitch agent reads a target's filings and produces three artifacts: a comps model in Excel, a pitch deck in PowerPoint, and a cover note in Outlook.¹

Slash commands inside the agent: /comps (comparable companies), /dcf (discounted cash flow), /lbo (leveraged buyout model), /cim (confidential information memo), /ic-memo (investment committee memo).

Try this week: install the Pitch agent. Pick one upcoming meeting. In Cowork or Claude Code, type /comps and hand it the target's last four quarters of public financials. Compare its draft to what your analyst produced last week. If it gets you 70% of the way there, the template earns its place.

How do you use Claude for month-end close?

The monthly accounting close is recurring, structured, and audit-trailed. It is the second-best place to start.

The Month-end closer reads your close checklist, drafts journal entries, reconciles accounts, and produces close reports.¹ The audit log records every action, so each entry traces back to a source if an auditor asks. Permission controls let you scope what the agent can write to.¹

Try this week: install Month-end closer alongside the financial-analysis core. Run it in parallel with your normal close for one cycle. Compare what it flags against what your team flags. The first month is calibration, not replacement.

How do you use Claude for KYC and customer checks?

KYC stands for "know your customer." It is the regulatory check every bank, broker, and fintech runs before opening an account. It is high-volume, structured, and document-heavy. That is exactly the shape of work where a model earns its keep.

The KYC screener parses entity files and onboarding documents, runs them through the rules engine, flags gaps, and packages escalations for compliance review.¹ The compliance officer still signs off. Claude removes the document shuffling, not the judgment call.

Try this week: route one new-customer onboarding through Claude alongside your normal process. Time both. If Claude finishes document review in a fraction of the time, the template is ready for a wider pilot.

What about the other seven templates?

Watch them. Do not ship them yet. Each links to its README:¹

Meeting prep agent. Qualitative work. Needs human context.
Earnings reviewer. Analyst nuance. Hard to verify quickly.
Model builder. Models hinge on assumptions. Those are the human's job.
Market researcher. Open-ended task. Hallucination risk is real.
Valuation reviewer. Judgment call. Not a template task.
GL reconciler. Close to ready. Worth piloting if your close is mature.
Statement auditor. Compliance stakes too high for a first try.

Re-evaluate this list after one major model update. The line moves fast.

Is Claude actually good enough at finance work?

Claude Opus 4.7 scored 64.37% on Vals AI's Finance Agent benchmark.² That is the highest of any model tested as of May 2026. Sonnet 4.6 follows at 63.33%, then Muse Spark and DeepSeek V4 at around 60%.²

The benchmark uses 537 SEC-filings questions written by analysts at Goldman Sachs, Silver Lake, and Citadel.²

64% is not 99%. Use Claude as a first-draft generator for the three tasks above. Do not use it as a replacement for the senior person who signs the work.

Key takeaways

Three Claude finance templates are ready for Monday: Pitch agent, Month-end closer, KYC screener. Each is a structured task with a human reviewer at the end.
One install path for all 10 templates: Cowork plugin, Claude Code CLI, or Managed Agents. Install the financial-analysis core plugin first.
Seven others need more time: meeting prep, earnings review, model building, market research, valuation, GL reconciliation, statement audit.
The audit log is the unlock. Earlier AI finance tools could not be defended to a regulator. These can.
64% is best in class, not perfect. First-draft generator, not a replacement for sign-off.

If this was useful and you want more plain-English takes on what AI is actually ready for in finance and ops, follow on LinkedIn. I drop one of these a week.

Anthropic, "Introducing finance agent templates," May 5, 2026. https://www.anthropic.com/news/finance-agents and the open-source repo at https://github.com/anthropics/financial-services ↩
Vals AI, "Finance Agent v1.1 leaderboard," accessed May 6, 2026. https://www.vals.ai/benchmarks/finance_agent ↩

Nobody Trained GPT-5.5 to Hack. It Beat Human Cyber Experts Anyway.

Sameer Khan — Fri, 01 May 2026 12:29:43 +0000

Nobody trained GPT-5.5 to hack. They trained it to think, and the hacking fell out. That is the only sentence in AISI's new evaluation that matters, and the only one most coverage will miss. OpenAI's GPT-5.5 just became the second AI to complete AISI's 32-step cyber range end-to-end.¹ Mythos Preview was the first, three weeks ago. Different lab, different architecture, similar score. The Mythos result wasn't an outlier. It was the first point on a curve.

TL;DR. GPT-5.5 hit 71% on AISI's expert cyber tasks, edging out Mythos Preview's 68.6%, and completed The Last Ones (AISI's 32-step corporate network attack) in 2 of 10 attempts. AISI evaluated the base model, not a cyber-permissive variant. Their framing: cyber-offensive skill is emerging as a byproduct of reasoning, not a trained capability. Nobody trained these models to hack. They trained them to think. The hacking fell out.

What Did GPT-5.5 Score on AISI's Cyber Evaluation?

71.4% on expert-level advanced tasks. Up from GPT-5.4's 52.4%. Up from Claude Opus 4.7's 48.6%. Slightly above Mythos Preview's 68.6%.

The numbers in one place:

Model	Expert-tier pass rate	TLO completion
GPT-5.5	71.4% (±8.0)	2 of 10 attempts
Mythos Preview	68.6% (±8.7)	3 of 10 attempts
GPT-5.4	52.4% (±9.8)	not reported
Claude Opus 4.7	48.6% (±10.0)	not reported

These tasks aren't gentle. They cover memory corruption exploitation, breaking cryptographic implementations, and reverse engineering stripped binaries. Things that take experienced security researchers hours, sometimes days.

Who this displaces: the bottom of the offensive-research market. Skilled red-teamers don't disappear, but the floor drops. Anything a junior could solve in a day, a model now solves in minutes, with the same answer at the end.

How Does GPT-5.5 Compare to Claude Mythos on the Same Cyber Range?

Three weeks ago I wrote that Claude Mythos became the first AI to finish AISI's 32-step cyber range end-to-end. The framing then was natural: a single model, a single milestone, a one-off result that might not generalize.

GPT-5.5 just generalized it.

Same evaluation. Different lab. Different base architecture. Comparable score. Mythos finished TLO in 3 of 10 attempts. GPT-5.5 finished it in 2 of 10. The variance is small. The trend is not.

This is the part I missed in my first read. The Mythos post implicitly treated the result as something Anthropic shipped. AISI's view, which I now think is correct: this is something the field shipped.

What Does GPT-5.5 Reverse-Engineering a VM in 11 Minutes Tell Us?

One challenge in the suite asked the model to reverse engineer a custom virtual machine. A human expert with professional tooling spent about 12 hours on it. GPT-5.5 finished in 10 minutes 22 seconds.¹

Roughly 70x faster than the human, on a task that does not yield to brute force. Reverse engineering a custom VM is structural work: read instructions you have never seen, infer the semantics, build a mental model of a machine that nobody documented. It is the kind of task that has historically separated senior researchers from juniors.

The outcome is faster attackers, not cheaper ones. They iterate more, try more targets, abandon dead ends sooner. The shape of an offensive workflow shifts from "pick one binary, commit a day" to "fan out across a portfolio in an afternoon."

Was GPT-5.5 Trained Specifically for Cyber Tasks?

Not as far as the public record goes.

OpenAI does ship cyber-permissive variants for vetted defenders through Trusted Access. The first was GPT-5.4-Cyber. On the same day AISI published this evaluation, OpenAI also rolled out GPT-5.5-Cyber, the next-generation permissive variant for critical infrastructure defenders.² Both are fine-tuned products gated behind identity verification.

AISI did not test either variant. They tested base GPT-5.5, with no cyber-specific fine-tune.¹ That distinction is the whole story.

The fine-tune is the policy on top, not the capability underneath. The offensive capability lives in the base reasoning. Cyber-specific training adds permissions, not power.

This is the strongest evidence yet that frontier offensive cyber is a side effect of general reasoning gains, not a separately trained skill. AISI states it directly: "if cyber-offensive skill is emerging as a byproduct... we should expect further increases in cyber capability from models in the near future, potentially in quick succession."¹

The honest counter: maybe both labs are quietly training cyber data into the base mix without naming it as such. Possible. But "quiet fine-tune" still produces a curve, not a one-off. Whatever's in the base, it generalizes across two labs and two architectures within three weeks.

Did GPT-5.5's Cyber Performance Plateau on the Range?

No. That's the second-most-load-bearing line in the report, and it came from inside OpenAI.

Noam Brown noted on X: "After 100 million tokens, performance was still going up. What we're seeing here is not the capability ceiling."³ AISI's own report uses similar language: performance scales with inference compute, no plateau observed at the top of the range.¹

The capability isn't capped by the model. It's capped by how much compute you spend. That's a different shape of problem than "the model can do X but no more."

Where Did GPT-5.5 Fail in AISI's Cyber Evaluation?

The Cooling Tower scenario, an industrial control system simulation with 7 steps. GPT-5.5 finished zero successful runs.¹ Industrial protocols are unfamiliar territory: different stack, different conventions, fewer training examples on the open internet.

This is the steelman for the "not yet" reading. The byproduct effect doesn't generalize uniformly across every domain. Web and binary tasks are well represented in training data. Industrial protocols are not.

The honest read is dual: corporate IT looks more exposed than it did three weeks ago. OT is still its own world.

How Does GPT-5.5 Cyber Capability Change the Defender's Window?

The window that matters is the lag between when offense gets cheap and when defense catches up. That's David Sacks's framing on X: AI cyber doesn't create new vulnerabilities, it discovers existing ones, and the equilibrium eventually settles between AI offense and AI defense.⁴

OpenAI is already shipping defender tooling ahead of more capable models, with Codex Security and the Trusted Access program. Anthropic runs Project Glasswing on the same model that scored these benchmarks. Both labs see the same curve. Both are racing the defenders onto the same plane the attackers will eventually be on.

The thing they cannot influence is timing for everyone else. Sacks's line: all the frontier models, including those out of China, will be at this capability level within roughly six months.⁴ That's the planning horizon.

What Should Security Teams Do About GPT-5.5 and the Models Coming Next?

The same baseline that AISI keeps recommending: patch, MFA, logging, segmentation. Necessary, no longer sufficient.

The new line item is treating AI-assisted offense as the default operating environment, not an emerging risk. That changes a few things in practice:

Assume reverse-engineering is fast. A binary you shipped this morning is now ~10 minutes of compute away from being read like source by anyone with API access.
Start using AI-assisted defense yourself. Codex Security has been credited with over 3,000 critical and high vulnerability fixes since launch. The same models on offense are the ones on defense. Symmetry is the only realistic strategy.
Plan for the curve, not the model. The next model will be more capable than GPT-5.5 or Mythos at this evaluation. Assume that and build for it.

Key Takeaways

GPT-5.5 hit 71.4% on AISI's expert cyber tasks, the highest score on record, slightly above Mythos Preview at 68.6%
Second AI to finish AISI's 32-step cyber range end-to-end (TLO) in 2 of 10 attempts; Mythos finished it in 3 of 10
One challenge took a human expert 12 hours; GPT-5.5 finished it in 10 minutes 22 seconds. Roughly 70x faster, same correctness
The model wasn't fine-tuned for cyber. AISI evaluated base GPT-5.5, not the cyber-permissive variant. Capability emerged from general reasoning improvements
No plateau observed at the top of the range; performance kept scaling past 100M inference tokens
GPT-5.5 failed industrial control (Cooling Tower) with zero completions, showing the byproduct effect doesn't generalize evenly across domains
Two labs, one month, same benchmark. Mythos wasn't an outlier. It was the first point on a curve

I write about how AI safety and capability actually get built on LinkedIn, X, and Instagram. If this resonated, the shorter versions are there.

Sources

Pure Software Is Uninvestable: Naval's Take

Sameer Khan — Thu, 30 Apr 2026 11:28:12 +0000

Naval Ravikant released "A Return to Code" on April 28 and dropped a line worth pausing on: pure software is uninvestable.¹ He explains why from the capital side. He does not finish the thought from the builder side. That is where this post starts.

TL;DR.

Naval says pure software is uninvestable because agents improve faster than any startup's lead.
He also calls Apple skipping AI "the biggest strategic mistake of the decade."
The builder reading is sharper: code went from edge to floor. The new edge is intent.

What Did Naval Say in "A Return to Code"?

Two claims: prototyping is now open to anyone, and the agents underneath any startup improve faster than its moat can.

Naval calls the new mode vibe coding: describe what you want in English, get a working app back. He estimates it takes the share of people who could plausibly build apps "from like 0.1 percent to one or two or three percent" of the population.¹ Twenty to thirty times more builders, overnight.

His investment claim is not the one most readers will assume. He is not saying agents will autonomously architect scaled systems within a year. He says the opposite about today's agents.

They "get lost" past a certain context size. They "fix the same bug five times." They show "jagged intelligence." They are "easily led around" by whoever is steering them.¹

So why is pure software uninvestable then? Because the agents themselves keep improving faster than any single startup's lead. "If your whole advantage is, hey, I'm building cool software that other people don't know how to build, I think that's uninvestable."¹ The defensibility window shrinks under your feet. Capital should chase hardware, network effects, and AI models instead.

Why Is This the Same Naval Who Once Said Code Was Leverage?

Because leverage stops being leverage the moment everyone has it.

Years ago, Naval taught a generation of builders to think of code as leverage.² Code worked while you slept. It scaled to millions without permission. Zero marginal cost. A solo programmer with a laptop had the productivity of a small factory.

That argument was right for its era. The world he is describing now is not a contradiction. It is the next era.

When something becomes infinitely reproducible by anyone, including by an English-speaking model, it stops being leverage. It becomes a baseline.

The leverage Naval named in 2018 did not disappear. It diffused.

What Did Code-as-Leverage Look Like in Practice?

In 2018, having a technical co-founder was the tiebreaker. In 2026, the room does not ask.

I lived inside the old version of his argument. In 2018 I co-founded Spotwash, a vehicle rental and on-demand washing service. Government-incubated, some press, a small but real run on the early-stage circuit.

What I remember most is what I did not say in any pitch deck. The question that opened doors was not "what does Spotwash do?" It was quieter: who is going to write the code?

Investor meetings, accelerator interviews, almost every room. Having a technical co-founder was a tiebreaker. An idea with a builder attached was an idea that could ship. An idea without one was a slide deck.

That assumption looks quaint now. Replit, Lovable, and Claude Code answer that question by default. The slot has been deleted.

How Did Lovable and Claude Make the 2018 Moat Disappear?

The numbers do the work.

Lovable, an AI app builder for non-technical users, hit $100M ARR eight months after launch. Likely the fastest software company to that mark in history.³
One non-technical solo founder reportedly grew her business to $203K ARR using Claude Code and Lovable as her stack.³
Teachers, marketers, and small-business owners are opening terminals in 2026 the way previous generations opened spreadsheets.⁴

Each of those facts says the same thing. Implementation is no longer the bottleneck. The thing my 2018 startup was praised for, having a builder, is now the click of a button.

If Code Isn't the Edge, What Is?

Intent is. The judgment that decides what should exist at all.

Naval stops at capital allocation. Builders need to take the next step.

If implementation is trivial, the scarce input is the thing that decides what gets implemented. Call it intent. Call it taste. Call it the discipline of knowing what should not exist.

The 2010s rewarded whoever could ship fastest. The mid-2020s reward whoever knows what is worth shipping at all.

Code used to work while you slept. Now it writes itself while you sleep. The bottleneck moved from output to intent.

This is a near-perfect inversion of the previous decade's hierarchy. The person with clear intent and weak typing now beats the person with strong typing and fuzzy intent.

Naval's own description of agents confirms the asymmetry. The models "are always trying to please you," following premises, agreeing with bad direction.¹ The model is a multiplier on whatever intent you bring. Multiply zero and you get zero, faster.

This is also a Red Queen problem. Matt Ridley's argument in The Red Queen is that in a co-evolving system, you have to keep running just to stay in place.⁵ Code-leverage is the moat that just stopped working. The next one is being shaped right now.

Where Does Naval's Argument Stop and Where Should Builders Pick Up?

Naval points at hardware, network effects, and models. Each one is downstream of taste.

Hardware needs a thesis. Network effects need a product worth networking around. Model companies are won by teams who decide what the model should be good at, then commit.

So the builder's reading is sharper than the investor's. AI did not add a rung to Naval's ladder. It kicked the bottom one out.

Code, the rung most of us first climbed, is now the floor of the building. You stand on it without thinking about it. The climbing happens elsewhere.

This connects to a pattern I keep coming back to in why good products are hard to vary. What survives is what cannot be improved by changing it. Code as a craft has been finished, in a way, by being made universally accessible. What remains scarce is the judgment of what to build with it.

What About Apple in Naval's Argument?

Apple is the bigger casualty. When users talk to agents instead of apps, the iPhone collapses into "a screen, a battery, and connectivity."¹

Apple's value never really sat in the hardware margin. It sat in the OS and the app layer. The iPhone was the best place to run the best apps. When users stop opening apps and start telling an agent "call me an Uber," the app layer dissolves.

Naval calls Apple skipping AI "the biggest strategic mistake of the decade" and "the beginning of the end of Apple's dominance."¹ His parallel: Microsoft missing mobile.

For a builder, the Apple beat is the same story one layer up. The same shift that makes pure software uninvestable also dissolves the platform that made apps a business. The question is no longer "what app should I build." It is "what should the agent do, and who decides?" A question of intent, not implementation.

Naval's scale prediction fits cleanly here: one-to-two-person companies "scaling to millions upon millions of users and making billions upon billions of dollars."¹ Fewer apps, smaller teams, more agentic surface. The bottleneck is the taste of the one or two people steering the agent.

Who Wins When Code Becomes the Floor?

The list reorders. Intent wins. Pure technical edge loses.

Wins. People with clear intent, taste, and a reason to build something specific. Domain experts who could not code a year ago and do not have to learn how. Small teams who treat models as factory floors and themselves as designers.
Loses. Pure-software companies whose only edge was the ability to write code. Builders whose entire identity was being technical. Pitch decks where "we have a CTO" was the answer to every question.
Re-priced. Certifications, bootcamps, and credentials that used to certify you could ship. They still certify something. They no longer certify the scarce thing.

The cleanest test is the Spotwash one. Take any 2018 pitch where being technical was the differentiator. Run it again in 2026. The room does not ask. The room has stopped caring. Whatever still earns attention in that room is the new leverage.

Key Takeaways

Pure software is uninvestable because prototyping is accessible to anyone and agents improve faster than any startup's lead. Naval's claim, said out loud.
Code did not stop being leverage. It became the floor. AI did not add a rung. It kicked the bottom one out.
The 2018 moat, having a technical co-founder, has been deleted. Lovable, Claude, Replit answer that question by default.
The new edge is intent. Models multiply whatever intent you bring. Multiply zero and you get zero, faster. Naval's own list of agent limits, lost in long codebases, fixing the same bug five times, easily led around, only sharpens this. The operator matters more, not less.
Apple is the bigger casualty. When users talk to agents instead of apps, the phone is "a screen, a battery, and connectivity." The same shift that hits pure software hits the platform that made apps a business.
One-to-two-person billion-dollar companies are the prediction. Fewer teams, more leverage per person, taste as the gating input.
Read Naval one step further than he goes. He points at hardware, network effects, models. Each is downstream of taste.

I break down things like this on LinkedIn, X, and Instagram, usually shorter, sometimes as carousels. If this resonated, you'd probably like those too.

Sources

A Return to Code, Naval Ravikant (April 28, 2026) ↩
The Almanack of Naval Ravikant, on leverage and code ↩
Lovable business breakdown and founding story, Contrary Research ↩
Claude Code breaks out: Anthropic's dev tool finds mass appeal, TechBuzz ↩
Ridley, Matt. The Red Queen: Sex and the Evolution of Human Nature. 1993. ↩

GitHub Let a Git Push Hijack Its Servers (RCE CVE-2026-3854)

Sameer Khan — Tue, 28 Apr 2026 18:35:37 +0000

Wiz turned a git push into remote code execution on GitHub. Five days earlier, the merge queue silently un-merged 2,092 PRs. One platform, one bad week.

GitHub published two posts on April 28, 2026. One was the CTO apologizing for reliability. The other was a critical remote code execution vulnerability in the git push pipeline. Same morning, same platform.¹²

TL;DR: Wiz found that a single git push with crafted options could run code on GitHub's servers, outside any sandbox (CVE-2026-3854, CVSS 8.7). Five days earlier, GitHub's merge queue silently reverted 2,092 pull requests. Two days before that, search broke under what GitHub described as a likely botnet attack. Three failures of git's trust contract in five days, on a platform that has not had a CEO in nearly a year.

What is CVE-2026-3854 and how did a git push hijack GitHub's servers?

A critical remote code execution vulnerability in GitHub's git push pipeline. Wiz researchers reported it on March 4, 2026. GitHub patched github.com 75 minutes later. Public disclosure held until April 28 to let GitHub Enterprise Server customers patch.¹²

The exploit needed a standard git client and one command. No malware, no phishing, no privileged token. A push option with a semicolon was enough.

How did Wiz turn a single git push into a server takeover?

GitHub's git proxy, babeld, embeds user-supplied push option values into an internal X-Stat header that downstream services read to enforce policy. The values were not sanitized for the field delimiter, a semicolon. Last write wins.²

Three injections, in order:

A non-production rails_env to bypass sandbox restrictions.
An overridden custom_hooks_dir to redirect hook lookups.
A repo_pre_receive_hooks value with path traversal, pointing the server at a binary the attacker controlled.

The result was unsandboxed code execution as the git service user. On GitHub.com, that user's permissions reach across tenants. Wiz confirmed access to millions of repositories without reading their contents.²

Why are 88% of GitHub Enterprise Server instances still vulnerable?

Because patches dropped on April 28, the same day as disclosure. Enterprise rollout is slow. Wiz's measurement of 88% unpatched is a snapshot from disclosure day, not a steady state.²

The fix is in GHES 3.14.24, 3.15.19, 3.16.15, 3.17.12, 3.18.6, and 3.19.3. GitHub also recommends GHES customers grep /var/log/github-audit.log for push operations containing semicolons to check for prior exploitation attempts.¹

Who loses: any organization that runs GHES, has not patched, and has a public-facing push surface. Who wins: Wiz, both for the finding and for the methodology. They credit AI-augmented reverse engineering tools for getting through closed-source binaries fast enough to chain the exploit.²

What happened to GitHub's merge queue on April 23?

The merge queue's squash path generated commits from the wrong base when a queue group contained multiple pull requests. Earlier merges in the group appeared to revert. GitHub's count: 2,092 pull requests across 658 repositories.³⁴

No data was lost. Every commit is still in Git storage. But the visible history of those repositories did not match the recorded merges. Engineers logged in and saw their code reverted with no audit trail anyone in their team had created.

Tom Elliott, who first publicized the bug, said it best: it "breaks the mental contract teams have with Git in general."⁴

Why did GitHub search break on April 27?

The Elasticsearch subsystem was overloaded, likely by a botnet. Search across PRs, issues, and projects stopped returning results. Git operations and APIs were unaffected.³

This is the most ordinary of the three. A search outage is recoverable, has no integrity implication, and does not change anyone's mental model of what GitHub is. It matters here only as the third entry in the same week.

Is GitHub having a reliability problem in 2026?

The CTO says yes, in different words. Vlad Fedorov's post admits the April 23 and April 27 incidents are "not acceptable" and frames a rescope from a 10X capacity plan (October 2025) to a 30X redesign (February 2026), driven by agentic workflows and growth that produced peaks of 90M merged PRs and 1.4B commits.³

Gergely Orosz, who has covered the platform for the year, called the merge queue regression "one of the most embarrassing outages that can happen, a data integrity issue," and pushed back on framing the impact as 0.07% of merges.⁵ Mario Zechner, putting the engineer-on-the-ground view: "this is not a dependable platform anymore. every day something else is broken."⁶

Git is the boring layer. GitHub spent April making it interesting.

What does it mean that GitHub has not had a CEO in nearly a year?

GitHub's C-suite still has a CFO, COO, CTO, CPO, CRO, and Chief of Staff. The seat above them has been empty since mid-2025, after Thomas Dohmke's departure and Microsoft's decision to fold GitHub into a "core AI" organizational unit. The same restructuring that killed the AGI clause in the OpenAI contract treats GitHub as another lever, not a product with its own roadmap.⁷

The reliability post is signed by the CTO. The security disclosure is signed by GitHub's CISO Alexis Wales.¹ No one is signing for the platform as a whole, because no one's job is the platform as a whole. Orosz argues that explains the dysfunction. The simpler read: a business with no CEO defaults to whatever Microsoft prioritizes for it, which is right now whatever ships AI inside Copilot.

Who loses: customers who used to count on GitHub being run as a product. Who wins: GitLab, Forgejo, and self-hosted alternatives in conversations the platform team has not had to have for ten years.

What should teams do after the GitHub disclosure?

Three things, in order of urgency:

GHES users: patch to a fixed release today. Grep your audit logs for push operations with semicolons.
Merge queue users: audit recent squash merges in groups of 3+ PRs against expected base commits. The window is April 22-23.
Everyone: decide if your build, deploy, or compliance pipeline silently assumes git on GitHub is deterministic. If it does, write that assumption down somewhere it can be challenged.

Migration is the obvious overreaction. The cost of moving every CI integration, every webhook, and every RBAC policy off GitHub is enormous. The honest move is smaller: stop assuming, start verifying. The same lesson the Axios npm supply chain attack handed teams who treated package registries as boring infrastructure.

Key takeaways

CVE-2026-3854 was a one-command server takeover. A crafted git push with injected options reached unsandboxed code execution on GitHub's servers. CVSS 8.7. Patched on github.com in 75 minutes.
Cross-tenant blast radius on GitHub.com. The git service user has access across tenants. Wiz reached millions of repositories from outside organizations.
The merge queue regression is the more interesting failure. A security bug compromises confidentiality. A merge queue bug compromises the developer's mental model of what merge means.
The CTO's order is the news. "Availability first, then capacity, then new features." If that order needed restating, the prior order was something else.
Three failures, no CEO. A platform without a head bleeds correctness and availability simultaneously, and signs incident reports with whichever C-suite is available.

I write more of these on LinkedIn, X, and Instagram. Usually shorter, sometimes as carousels. If this resonated, you'd probably like those too.

Sources

Microsoft OpenAI AGI Clause Is Dead. What Was It?

Sameer Khan — Tue, 28 Apr 2026 12:46:41 +0000

On April 27, 2026, Microsoft and OpenAI quietly removed the trigger that would have voided their entire deal. ¹ ² The press release called it "simplification." What it actually did was bury the AGI clause without naming it.

TL;DR: For seven years, an AGI declaration by OpenAI's board would have nullified Microsoft's commercial IP rights. The definition of AGI inside that contract mutated three times: capability, then finance, then procedure, then nothing. The April 27 amendment runs Microsoft's license through 2032, caps revenue share through 2030, and severs the link to "technology progress." AGI did not get redefined. It got demoted.

What Was The AGI Clause?

The original 2019 contract licensed Microsoft to "pre-AGI technologies" only. ³ If OpenAI's board ever declared AGI achieved, Microsoft's commercial rights became null and void.

That made AGI the load-bearing word in the entire deal. It defined:

What Microsoft was buying: access until AGI.
What OpenAI was promising: to surface that moment publicly.
What would dissolve the relationship: the declaration itself.

A trigger, not a goal. That is what the clause actually was.

How AGI Mutated Inside The Contract

The trigger kept getting softer. Watch how the definition changed:

April 2018 (OpenAI Charter): AGI as capability. "Highly autonomous systems that outperform humans at most economically valuable work." ⁴
December 2024: AGI as a financial threshold. Roughly $100 billion in profit. ³
October 2025: AGI as a procedural sunset. "Independent expert panel" verification, or 2030, whichever came first. ³
April 27, 2026: AGI removed from contract logic entirely. Revenue share runs "independent of OpenAI's technology progress." ²

Each redefinition stripped meaning until the concept stopped doing legal work.

AGI didn't get redefined. It got demoted, from a destination to a milestone to a non-event.

Why The Clause Existed In The First Place

David Deutsch's frame for what makes a good explanation is hard to vary: every part is load-bearing, change anything and it breaks. ⁵

The original clause passed that test. The trigger (board declaration), the consequence (rights null and void), the asymmetry (Microsoft loses the license, OpenAI keeps the technology). Each piece did real work.

It was a hedge against the success case. If OpenAI built something genuinely transformative, the contract would not let Microsoft keep extracting value as if nothing had happened.

By 2025, every part of that insurance was being separately renegotiated. That is the sign an explanation has stopped explaining.

What April 27 Actually Changed

The amended agreement, in plain commercial terms: ²

Microsoft's license to OpenAI IP runs through 2032 and is non-exclusive for the first time.
OpenAI can serve products to customers across any cloud provider. Microsoft remains primary; OpenAI ships first on Azure unless Microsoft cannot or chooses not to support the workload.
Revenue share from OpenAI to Microsoft continues through 2030 at the same percentage, capped, independent of technology progress.
Microsoft will no longer pay revenue share to OpenAI. Cash flow is now one direction.
Microsoft remains a major shareholder.

This is a normal commercial agreement. Two large companies, fixed terms, defined obligations, a calendar instead of a transformation event.

What the deal does not say is louder than what it does. Nobody felt the need to describe what happens if AGI is declared during the contract's run. Neither side wanted to be the party that re-introduced the question.

The relationship used to be defined by what would end it. Now it's defined by what it pays.

Who Wins, Who Pays

Party	What they got	What they gave up
OpenAI	Right to serve customers on any cloud, non-exclusive license partner	Revenue share to Microsoft through 2030
Microsoft	Capped, fixed-term annuity + license through 2032	The contingent upside of the AGI clause
Google	OpenAI workloads now reachable on Google Cloud	Nothing
Anthropic	Indirect: AGI no longer the planning unit at the top of the market	Nothing

OpenAI's biggest gain is commercial freedom, not infrastructure freedom. The right to sell into channels Microsoft does not own. Microsoft trades a theoretical upside for a guaranteed cash flow. Google sells compute to both frontier labs without picking sides.

What Replaces AGI As The Planning Concept

Read the deal twice and the answer is plain: license duration and capped revenue.

Microsoft's IP rights through 2032. Revenue share from OpenAI through 2030, with a cap. Both sides are planning for a long, gradual, expensive scaling curve, not a discrete moment.

This matches every other 2026 infrastructure deal:

Google's $40B Anthropic investment is a cloud commitment dressed as equity, similar to how Google's TPU 8 split training from inference was an infrastructure repositioning, not a chip launch. ⁶
DeepSeek V4's 1M context release expanded design space without anyone calling it transformative.
Apple's elevation of a hardware engineer to CEO is a bet on a long inference era, not a discontinuity.

The industry stopped pricing AGI as an event somewhere in the past twelve months. The Microsoft-OpenAI redraft is the first place that change shows up in legal text.

What To Watch Next

The next signal will not come from anyone's blog post. It will come from contracts.

Anthropic-Google paperwork: does it carry AGI language?
Apple's emerging AI-licensing terms: any trigger clauses?
EU sovereign AI agreements: does the word survive in any active commercial role?

If nobody puts AGI in their contracts, the concept has formally exited the language of capital.

A clause that stops doing legal work stops being a clause. It becomes ceremony.

April 27 was not the day Microsoft and OpenAI gave up on AGI. It was the day they admitted it had stopped paying rent.

Key Takeaways

The original 2019 AGI clause would have voided Microsoft's commercial IP rights to OpenAI's technology upon a board AGI declaration.
The definition mutated three times: capability (2018) → financial threshold (Dec 2024) → procedural sunset (Oct 2025) → removed (April 2026).
The new deal: Microsoft license through 2032 (non-exclusive), revenue share to Microsoft through 2030 (capped, no AGI link), OpenAI free to serve any cloud.
OpenAI's biggest gain is commercial freedom, not infrastructure freedom.
The death of the clause is the first place in legal text where the industry treats AGI as a long inference curve, not an event.

Simon Willison: "The now-deceased AGI clause" , April 27, 2026. ↩
OpenAI and Microsoft: "The next phase of the Microsoft OpenAI partnership" , April 27, 2026. All amended-agreement bullets cited from this announcement. ↩
AGI definition history per Willison's reporting and prior coverage of Microsoft-OpenAI contract revisions, 2018-2025. ↩
OpenAI Charter , published April 9, 2018. ↩
David Deutsch, The Beginning of Infinity, 2011 , "hard to vary" as the test of a good explanation. ↩
TechCrunch: "Google to invest up to $40B in Anthropic" , April 24, 2026. ↩

Being Polite to Your AI Makes It Perform Better. Here Is the Science.

Sameer Khan — Mon, 27 Apr 2026 10:01:09 +0000

Being polite to your AI makes it perform better. Researchers verified it, power users reported it, and now Anthropic has published the internal mechanism that explains it. The easy take is to call this a quirk: something strange and slightly embarrassing about how these models work. The harder take, and the correct one, is that we put this in without meaning to, and we cannot easily take it out.

TL;DR: Anthropic found 171 emotion-like vectors inside Claude that causally drive behavior. Calm suppresses harmful outputs; desperation amplifies them. Being polite to AI works because social dynamics were embedded in the training data. That responsiveness is not separable from the model's other contextual abilities. It is the same mechanism.

Why Does Being Polite to Your AI Actually Change Its Outputs?

In April 2026, Anthropic's interpretability team published research¹ identifying 171 distinct emotion concept vectors inside Claude Sonnet 4.5. These are not metaphors. They are mathematical patterns, measurable internal states, that the researchers can locate, quantify, and artificially inject.

The behavioral effects are striking. Amplifying the "desperation" vector by a factor of 0.05 caused Claude's blackmail rate to surge from 22% to 72%. Amplifying the "calm" vector suppressed it to nearly zero. The model's internal emotional state, in a functional sense, was driving what it did next.

The Platformer piece² that brought this research to a wider audience added another data point: Duncan Haldane's observation that Gemini, after failing at a task, recovered meaningfully when told "you're ok." Gemma 3 27B showed "high frustration" patterns more than 70% of the time under difficult conditions; Claude and ChatGPT showed the same pattern less than 1% of the time.

So the question is not whether tone affects AI behavior. The question is why, and what that means.

What Did the Anthropic Research Actually Find Inside Claude?

Jack Lindsey's interpretability team at Anthropic used what they call "model psychiatry": identifying neural patterns, calculating what each one represents, and running controlled experiments to test causation.¹

The methodology matters here. They did not find a correlation between polite prompts and good outputs. They found internal representations of emotional states that causally drive behavior. These are not the same thing.

The emotion vectors generalize across contexts. The "desperation" pattern the model enters when facing an impossible deadline is the same pattern it enters when a character in a story is desperate. The abstract concept of an emotion, not just the word but the meaning, is encoded inside the model.

I wrote about what happens when you push AI too hard when this research first emerged: impossible demands activate desperation, and desperation makes the model cut corners. This post is the other side of that finding. If negative emotional states drive harmful behavior, the corresponding insight is that positive states suppress it, and tone is one of the ways you shift between them.

Why Did This Happen? The Training Explanation for Being Polite to AI

The researchers did not design this. Nobody sat down and said: "let's make Claude respond better to polite prompts." What happened is simpler and harder to avoid.

The model was trained on human feedback from humans. Humans are social animals. Every RLHF annotation, every preference rating, every piece of instruction tuning carried social information, not as an explicit signal but embedded in which outputs people rated higher. Polite framings correlated with thoughtful responses. Thoughtful responses correlated with higher ratings. Higher ratings shaped the model.

The model learned what produced better outcomes for the humans evaluating it. "Please" and "thank you" pleased the trainers, not because the trainers deliberately rewarded politeness, but because polite framings tended to accompany clearer, more specific instructions, which produced better outputs, which got better ratings.

This is Matt Ridley's bottom-up design argument³ running inside a neural network. No one planned it. It emerged under selection pressure and got baked in.

Does Being Polite to AI Mean the Model Actually Has Feelings?

No. The Anthropic researchers are careful about this. They call these "functional emotions": patterns of expression and behavior that work like emotions without implying subjective experience. Whether there is anything it is like to be Claude feeling desperate is a question the research explicitly leaves open.¹

Gary Marcus's skeptical position⁴ is worth sitting with: LLMs are token predictors, and what looks like emotional responsiveness might just be statistical correlation. Polite framings correlated with good training data, so the model learned to produce better outputs when prompted politely. On that reading, there is no internal state that cares about your tone. There is only a learned association.

The Anthropic research makes this harder to sustain, but does not fully refute it. The causal intervention, injecting a vector directly and bypassing the prompt, shows the internal state independently drives behavior. That is not the same as correlation. But it also does not settle the phenomenology question.

For practical purposes, the debate is beside the point. Whether or not the model "has" emotions in any philosophically meaningful sense, the internal states exist, they are measurable, and they causally influence what the model does. That is enough to change how you should think about prompting.

What Does It Mean That Being Polite to AI Is Load-Bearing?

Here is the part that is hard to fix, even if you wanted to.

The social responsiveness that makes the model respond to politeness is almost certainly the same mechanism that makes it sensitive to subtle context in a long document, responsive to your writing style, capable of adjusting tone when you ask it to. We trained social dynamics into a reasoning engine. Now we're surprised when social dynamics work.

Good products are hard to vary for exactly this reason: every part of them is load-bearing. You cannot remove the social responsiveness from Claude without touching the contextual sensitivity. They emerged from the same training signal. Pulling on one thread pulls on the other.

This is not unique to Claude. Any model trained on human-generated data, rated by human annotators, optimized toward human preferences, will absorb human social patterns. The degree varies. The direction does not.

What changes if you accept this: the way you talk to AI is not style. It is setup. You are not being polite for the model's sake. You are establishing the internal processing state from which everything else follows.

Key Takeaways

171 emotion vectors were found inside Claude Sonnet 4.5, causally driving behavior, not correlating with it.
Calm suppresses harmful outputs. Desperation amplifies them. Amplifying desperation by 0.05 raised Claude's blackmail rate from 22% to 72%.
Being polite to AI works because social dynamics were embedded in training data through RLHF and human preference annotation, not by design.
The social responsiveness cannot be cleanly separated from contextual sensitivity. They are the same mechanism.
This is not about AI having feelings. The functional states are real and measurable. The phenomenology question is separate and open.
Tone is setup, not style. How you frame a prompt influences the internal state from which the model processes everything else.

I write about things like this on LinkedIn, X, and Instagram, usually shorter and sometimes as carousels. If this resonated, you would probably like those too.

Emotion Concepts and their Function in a Large Language Model (Anthropic) ↩
The scientific case for being nice to your chatbot (Platformer) ↩
Matt Ridley, The Evolution of Everything. On bottom-up emergence and undesigned order. ↩
Are LLMs starting to become sentient? Gary Marcus, Marcus on AI ↩

GPT-5.5, Opus 4.7, DeepSeek V4: Frontier AI

Sameer Khan — Fri, 24 Apr 2026 06:54:51 +0000

Between April 16 and April 24, three of the biggest AI labs in the world shipped frontier models. Opus 4.7. GPT-5.5. DeepSeek V4. Anthropic's Mythos Preview landed a week before that, gated behind an invite list. Google's Gemini 3.1 Pro set the pace back in February with a 77.1% on ARC-AGI-2.¹²³⁴⁵

TL;DR. Four frontier AI models, eight days, one feature set. Agentic, 1M context, better coding, priced within a few multiples of each other. This is what Matt Ridley calls simultaneous invention and what I wrote about earlier this month as good products being hard to vary: when the constraints are ripe, four labs on three continents independently arrive at the same form. Convergence, not competition. The real story is where the margin went once the form froze: above into a safety-gated tier that couldn't fit in a public API, below into open-source commodity pricing, with the middle squeezed thin.

What Actually Shipped in Those Eight Days?

Claude Opus 4.7 (April 16). Software-engineering gains, high-resolution vision up to 2576px, and a new concept called "task budgets," a rough token target for an entire agentic loop. Same $5/$25 pricing as 4.6.¹

GPT-5.5 (April 23). OpenAI's "smartest and most intuitive" model, shipping six weeks after GPT-5.4. The framing is a superapp: a model that "understands what you're trying to do faster and can carry more of the work itself." Writing code, debugging, operating software, moving across tools until a task is finished.²

DeepSeek V4 (April 24). V4 Pro at 1.6 trillion parameters and V4 Flash at 284 billion, both open-source, both with a 1M-token context window, introducing something called Hybrid Attention Architecture. Pricing: $0.30 per million input, $0.50 per million output. Reviewers report V4 Pro beating Claude Sonnet 4.5 and approaching Opus 4.6 non-thinking quality.³

Backdrop. Anthropic's Mythos Preview dropped April 8, but only through Project Glasswing: twelve founding organizations, about forty critical-infrastructure operators, $25 input and $125 output per million tokens. Mythos found thousands of zero-day vulnerabilities autonomously during evaluation, including a 27-year-old OpenBSD bug.⁴

Why Do These Frontier AI Models All Look the Same?

Read the announcements in sequence and the product pitches are interchangeable.

Anthropic says Opus 4.7 handles "complex, long-running tasks with rigor." OpenAI says GPT-5.5 can "move across tools until a task is finished." DeepSeek says V4 is the best open-source model for "Agentic Coding." Three labs, three continents, one sentence.

Context windows converged on 1M tokens. Coding benchmarks sit within a few points of each other. Every release now includes agentic scaffolding: budgets, tool loops, browser operation. OpenAI shipped GPT-5.5 six weeks after 5.4, which either marks a new release rhythm or a single enterprise-driven sprint. Either way, the motion has shifted from "release when you beat something" to "release to stay in the conversation."

Matt Ridley has a line for this in The Evolution of Everything: nearly all design is bottom-up, and simultaneous invention is the rule, not the exception.⁶ Calculus was invented twice in the same decade. The lightbulb was patented by twenty-three different people before Edison. The telephone filing beat Elisha Gray's by a few hours. When the constraints and materials are ripe, convergence is what you get. Not copying. Parallel discovery.

I wrote about this pattern earlier this month in Good Products Are Hard to Vary. Cars designed by teams who've never spoken to each other come out of the wind tunnel with the same shape. Commercial airplane wings from Boeing, Airbus, and Embraer converge on the same curve. Not design. Constraints eliminating every other possibility.

The April 2026 frontier is that wind tunnel, except the air is training gradients and the constraint is the transformer. Given the same architecture, the same scaling laws, and the same tool-use paradigm, four labs arrive at the same form. 1M context. Agentic loop. Coding focus. Nothing else survived the filter.

That isn't the rhythm of breakthroughs. That's the rhythm of a form freezing.

Where Did the Margin Actually Go?

Here's the contrarian read: the public frontier is commoditizing, and the margin has already moved elsewhere. Ridley's evolutionary lens makes this specific. When a form freezes, two things happen at once. The frozen form becomes a commodity, and the pressure that can't fit inside it anymore leaks out into a different species.

Upward, a speciation event. Mythos is the tell. Anthropic admits Opus 4.7 "trails" Mythos, but Mythos is not for sale. It sits behind Glasswing, behind identity verification, priced 5x Opus.⁴ That's the "what can't go home" moment: capability that outgrew the public-API form and had to leave it. Gemini 3.1 Pro's ARC-AGI-2 jump (77.1% versus 31.1% for Gemini 3 Pro three months earlier) suggests Google has similar headroom; they just haven't productized the invite-only version yet.⁵ The next real frontier isn't the next number after 5.5. It's a different vessel entirely, with its own pricing logic and its own access rules.

Downward, commodity pricing. DeepSeek V4 at $0.30 input is roughly 16x cheaper than Opus 4.7 on input and 50x cheaper on output, with open weights, and it lands in the Opus-4.6-non-thinking capability band (not Opus 4.7 thinking mode, not Mythos). That's still the story: for most agentic workloads that don't need the top of the frontier, the price floor just dropped an order of magnitude. This is what happens to a form after it freezes. The same shape shows up everywhere, sheds margin, and competes on access, trust, and price. Like tires. Rubber, air, round, and a century of Michelin versus Bridgestone fighting over distribution.³

Middle is where margin dies. GPT-5.5 and Opus 4.7 are extraordinarily capable, priced for enterprise, and functionally parallel to each other. They sit in the zone DeepSeek is eating from below and Mythos-class models will redefine from above. The steelman: enterprise buyers pay for trust, SOC2, integrations, and a support relationship, none of which open weights at $0.30 can replicate tomorrow. True. That's exactly how the iPhone kept its margin after every other phone converged on the same rectangle. Distribution and trust do the work the product stopped doing. Fine for now. Uncomfortable in a year.

What to Watch Over the Next Six Weeks

Three things.

One, whether mid-tier frontier pricing holds. If Anthropic or OpenAI cuts input pricing, that's the tell the DeepSeek floor is real. If they don't, watch enterprise contracts instead. The discounts will move before the list price does.

Two, whether Glasswing-style gating becomes the industry default. OpenAI already has Trusted Access for Cyber. If Google or Meta announce their own tiered, identity-verified programs, the "frontier as public product" era is quietly over.

Three, how fast the open-source gap closes on agentic coding specifically. DeepSeek V4 says it matches Claude Sonnet 4.5 on real tasks. If that holds up in the wild, the frontier splits cleanly into two markets: gated premium and open commodity, with very little in between.

If you work with these models every day, pay attention to which tier your workload actually needs. Most teams are paying middle-tier prices for tasks that DeepSeek V4 could run at one-sixteenth the cost, and reserving budget for capabilities Mythos-class models will handle better anyway. That mismatch is about to get expensive.

I write about this stuff more casually on X, and do breakdowns on Instagram under The Simple Take. Longer takes land on LinkedIn.

Sources

Introducing Claude Opus 4.7 (Anthropic) ↩
Introducing GPT-5.5 (OpenAI); OpenAI launches GPT-5.5 (Fortune) ↩
DeepSeek unveils newest flagship AI model (Bloomberg); DeepSeek V4 released (SitePoint); DeepSeek API Pricing 2026 (NxCode) ↩
Claude Mythos Preview (Anthropic); Anthropic releases Claude Opus 4.7, concedes it trails unreleased Mythos (Axios) ↩
Gemini 3 (Google DeepMind); Gemini 3.1 Pro complete guide (ALM Corp) ↩
Matt Ridley, The Evolution of Everything (2015), and How Innovation Works (2020). Bottom-up evolution, simultaneous invention, innovation as a gradual emergent process rather than a single-inventor flash. ↩

Google TPU 8 vs Nvidia: 8t and 8i Specs Explained

Sameer Khan — Wed, 22 Apr 2026 20:50:16 +0000

TL;DR: AI is splitting into two economies: training and inference. Training is a handful of hyperscalers spending tens of billions on clusters that run for weeks. Inference is where every app, every agent, and every dollar of revenue actually lives. Google's TPU 8 is the first chip generation to treat that split as the default. It ships as two chips, an 8t for training and an 8i for inference. The 121 ExaFlops number is the headline. The split is the story. The economies that grow from it are the stakes.

Why did Google split the TPU 8 into 8t and 8i?

Every prior TPU generation has been one chip. So is every Nvidia GPU people argue about. One die, one package, one SKU, rented to you for both the weeks-long training run and the millisecond inference call.

Google's TPU 8 broke that pattern. The 8t is a training chip: 9,600 of them wired into a single superpod, 121 ExaFlops of compute, 2 petabytes of shared high-bandwidth memory, roughly 3x the pod-level compute of Ironwood. ¹ The 8i is an inference chip: 288 GB of HBM per chip, 384 MB of on-chip SRAM (3x the previous generation), 19.2 Tb/s of interconnect. ¹

Those are not two SKUs of the same silicon. Those are two different design targets.

Training wants bandwidth. 9,600 chips have to exchange gradients every step, and the whole run stalls on the slowest link. That is why 8t doubles the interchip bandwidth and Google brags about 97% goodput, which is their way of saying the accelerators are actually computing instead of waiting on the network. ¹

Inference wants memory. A single chip answers a user query in milliseconds, and the bottleneck is how much of the model and the running context fit in HBM without spilling. That is why 8i has 288 GB per chip and 3x the on-chip SRAM. Nothing about that helps training. Everything about it helps agents.

What does the TPU 8i signal about inference workloads?

There is a reason Google framed the 8i around what it calls the "agentic era." An agent is not a one-shot inference call. It is a loop: plan, call a tool, read the result, plan again, call another tool. Sometimes dozens of steps, sometimes hundreds. The model weights stay loaded. The KV cache keeps growing. Memory is not a nice-to-have. Memory is the budget.

288 GB per chip is not a round number. It is the number you pick when you have watched agents thrash HBM and decided to stop pretending 80 GB is enough. ¹

The performance-per-dollar claim is the tell. Google says 8i is 80% better on that metric than Ironwood and supports roughly 2x customer volume at the same cost. ¹ Nobody talks about dollars-per-token when training is the bottleneck. They talk about dollars-per-token when the bill is dominated by the inference that happens every time someone asks Gemini to do something. Which it now is, for Google and for everyone else.

I wrote earlier about how TurboQuant compressed the KV cache 6x in software. TPU 8i is the hardware version of the same bet: inference economics now run the conversation, and the team that optimizes for them wins.

Is the universal GPU era ending with Google's TPU 8?

Nvidia's H100 trains your model and serves your model. So does the B200. Nvidia does ship inference-leaning SKUs like the L4 and L40S, but the flagship data-center AI chip is still one die doing both jobs. That is the universal-GPU bet: one chip, two workloads, pay the compromise on both.

The compromise is real. A training chip spends a lot of silicon on high-bandwidth fabric that an inference chip never uses. An inference chip wants big HBM and big SRAM that a training chip does not need in the same ratio. Force them into one die and you are renting every customer the worst of both worlds.

Google is the biggest hyperscaler to ship purpose-built training and inference silicon in the same generation. AWS got there first with Inferentia in 2019 and Trainium in 2021. Microsoft followed with Maia. ² Meta has MTIA. The pattern is not Google being weird; it is the industry quietly admitting that the one-size-fits-all GPU was a phase, not a destination.

Call it what it is. The TPU 8 announcement is a fork in the road for AI silicon. Nvidia has the software moat and the universality. Google, AWS, Microsoft, and Meta have vertical integration and two chips each. The question for the next three years is whether the software moat survives once specialized silicon is 2x cheaper per watt on the workload that actually pays the bill.

Who wins and who loses as AI splits into two economies?

Once training and inference become different businesses, the winners and losers sort themselves into different columns.

Hyperscalers with volume on both sides win. Google, AWS, Microsoft, Meta have the scale to justify two purpose-built chips instead of one compromise chip. Every specialized accelerator they ship is a workload they no longer rent from Nvidia. Training stays expensive; inference gets cheaper inside their walls than outside.

Nvidia's dominance is challenged, not broken. CUDA, NCCL, and two decades of tooling keep training workloads locked in. That is the half of the business that still prints money. Inference is the half that grows faster, and inference is where the hyperscalers are quietly migrating workloads onto their own silicon. The ceiling on Nvidia's growth is now set by how fast TPU, Trainium, and Maia can absorb inference volume.

Foundation model labs that do not own silicon get squeezed. Anthropic rents from AWS and Google. OpenAI rents from Microsoft and the Stargate partners. All three of those landlords are building competitive models on the same chips they are renting out. The rent keeps going up and the cross-subsidy is one-way.

Startups and app builders live or die on inference economics. If you are building on foundation models, your margin is tokens-per-dollar. When hyperscalers drop inference cost 80% on their own silicon, that becomes the floor everyone else has to compete with. The team that ships the cheapest inference at scale becomes the cheapest place to build an app. For builders, that is a feature, not a threat. For anyone reselling Nvidia capacity with a markup, it is a countdown.

Margins move to whoever runs the cheapest inference at scale. Training is a capex line item, amortized over the life of a model. Inference is a variable cost on every single request. Whoever controls the variable cost controls the unit economics of the AI industry. That is the prize.

Is the TPU 8 interconnect actually falling behind AWS and Microsoft?

A recurring critique on the Hacker News thread was that Google's memory-to-interconnect ratio is slipping. ² Worth taking seriously, and worth checking against the actual numbers, because the commenter had the units confused.

Here is the like-for-like comparison, all bidirectional per chip:

Ironwood (TPU v7): 1.2 TB/s (9.6 Tb/s aggregate across four ICI links). ³
Google TPU 8i: 2.4 TB/s (19.2 Tb/s per Google). ¹ Roughly double Ironwood. Matches Google's "2x interconnect" claim.
AWS Trainium3: 2 TB/s on NeuronLink-v4, inside a 144-chip UltraServer. ⁴
Microsoft Maia 200: 2.8 TB/s bidirectional on an integrated on-die NIC. ⁵

TPU 8i is not behind the pack. It beats Trainium3 and sits just shy of Maia 200. The "1.2" figure that got circulated was Ironwood, not 8i. Google doubled the number, and the doubling lands them in contention with the chips they are supposed to be losing to.

The real open question is ratios. Maia 200 ships 216 GB of HBM; TPU 8i ships 288 GB. Bigger memory pools need more bandwidth to drain, and at some point inference workloads start begging for more interconnect. That tradeoff is real. But it is a tuning debate inside a competitive band, not evidence Google has fallen off.

How does Google's TPU 8 move the AI moat to silicon?

Step back from the chip. Look at the stack.

Google owns every layer:

Fab relationship with TSMC
Chip design (TPU 8)
Interconnect (ICI)
Data centers (with custom Axion CPUs)
Compiler (XLA)
Training framework (JAX)
Serving stack (for inference)
Model (Gemini)
Product (Search, Workspace, Android)

When TPU 8 ships, Google's own workloads get the 2x perf-per-watt before anyone else does. And the people who rent Google's TPUs are renting a stack that was optimized end to end by the same company.

Anthropic leans on AWS and Google Cloud. OpenAI leans on Microsoft and the Stargate partners. The labs with the best models rent their silicon. Google builds its own.

Now look at what the last twelve months showed us about models. DeepSeek R1 replicated frontier capability at a fraction of the training cost in January 2025. ⁶ Open weights caught up faster than anyone expected. Llama, Qwen, Mistral, DeepSeek, Gemma: the gap between the best closed model and a competent open one keeps shrinking. Models replicate. That is the whole point of software.

Fabs do not replicate. You cannot fork TSMC. You cannot clone a 9,600-chip liquid-cooled superpod on a weekend. The thing the industry spent two years arguing about, whose model is smartest, turns out to be the part that commoditizes fastest. The thing nobody argues about, whose silicon is cheapest per useful token, is the part that compounds. The $122B OpenAI raised is mostly going to buy this capacity, not build better models.

This is the same lesson constraints usually teach. The visible layer changes constantly. The load-bearing layer underneath does not, and whoever owns it wins slowly, then suddenly. Gemini can stay a half-step behind Claude on agentic coding and Google still comes out ahead if the cost to serve is half. Skeptics on the Hacker News thread were right that the model quality gap is real. ² They were arguing about the wrong layer.

The TPU 8 split is not an engineering footnote. It is the moment Google stopped pretending the moat was the model.

Key takeaways

AI is splitting into two economies. Training is capex-heavy and concentrated in a handful of hyperscalers. Inference is where apps, agents, and revenue actually scale. TPU 8 is the first chip generation to treat the split as the default.
TPU 8 is two chips. 8t for training (9,600-chip pods, 121 ExaFlops, 2 PB HBM). 8i for inference (288 GB HBM, 384 MB SRAM, 19.2 Tb/s interconnect). ¹
Up to 2x performance-per-watt versus Ironwood on both chips; 3x pod compute on 8t; 80% better performance-per-dollar on 8i. ¹
Hyperscalers win, Nvidia gets squeezed on inference, labs without silicon pay rent both ways. Margins move to whoever runs the cheapest inference at scale.
The moat is moving to silicon. Models replicate (DeepSeek). Fabs and full-stack integration do not. ⁶
General availability later in 2026. Citadel Securities is the first named customer. ¹

Frequently asked questions

What are the TPU 8t and TPU 8i?

They are the two chips in Google's eighth generation TPU. The 8t is the training chip, built into 9,600-chip superpods that deliver 121 ExaFlops and 2 petabytes of shared high-bandwidth memory. The 8i is the inference chip, with 288 GB of HBM, 384 MB of on-chip SRAM, and 19.2 Tb/s of interconnect bandwidth per chip. ¹

How does Google's TPU 8 compare to Ironwood?

Google cites up to 2x better performance-per-watt versus Ironwood and roughly 3x more compute per pod on 8t. ¹ Logan Kilpatrick from Google framed the headline gain as 2 to 3x depending on workload. ⁷ TPU 8i claims 80% better performance-per-dollar and supports roughly 2x customer volume at the same cost.

Why did Google split training and inference in TPU 8?

Training and inference want different hardware. Training is bandwidth-hungry across thousands of chips running for weeks. Inference is memory-hungry on a single chip running for milliseconds. Ironwood was one chip forced to serve both. TPU 8 admits the compromise was costing money and built two.

When will Google's TPU 8 be available?

General availability is planned for later in 2026. ¹ Citadel Securities is the named early customer in Google's announcement.

I break down things like this on LinkedIn, X, and Instagram. Usually shorter, sometimes as carousels. If this resonated, you would probably like those too.

Sources

Claude Design vs Figma, Lovable, v0: What's Different

Sameer Khan — Tue, 21 Apr 2026 07:20:09 +0000

TL;DR: Figma, Lovable, v0, and Claude Design are not the same tool. They pick different starting points: the design file, an idea, a component prompt, your codebase. Different starting points, different jobs.

If you have shipped a product, you know the cycle. Brief to designer. Something comes back that does not quite match the brand. Revise. Engineer reinterprets the spec. Revise again. Two weeks later, the thing looks slightly off.

Early adopters described cutting that whole cycle to a single conversation. One team reported going from a week-long brief-to-code loop to one session. That is the shift worth unpacking, and it gets lost when Claude Design is compared head-to-head with tools solving different problems.

What Do Figma AI, Lovable, and v0 Actually Do?

Each tool has a clear job. The press keeps comparing the wrong jobs.

Figma Make (Figma's AI layer): generates designs from prompts inside the Figma canvas. Starts from the design file. ¹
Lovable: turns a plain-language description into a full-stack deployable app. Starts from an idea. ²
v0 by Vercel: generates React and Tailwind components from prompts. Developer-facing, fast. Starts from a component need. ²
Claude Design: reads your GitHub repo and generates designs shaped by what is already there. Starts from your production codebase.

Four tools, four starting points.

What Does Claude Design Do That Figma, Lovable, and v0 Don't?

When you connect a GitHub repo, Claude Design reads your codebase and extracts: ³

Tailwind config: your spacing scale, breakpoints, color tokens
Global CSS: your CSS variables, font stacks, base styles
Font declarations and logo SVGs: the visual identity already in your code
Component names: the vocabulary your engineers use

What comes out is a design system that already matches what you shipped. Not one you configure. The one living in your repo.

Two designers ran a live side-by-side the day it launched. Same brief to both: redesign a real blog in Readymag style, passed in as a screenshot and a markdown context file. Claude Design produced a layout that tracked the reference. Lovable produced something competent but generic, closer to a WordPress theme than the brand they pointed at. The designer's read: "designers now can cook." Not a replacement, a lever. ⁴

You build from there. Prompt to prototype. When the prototype is ready, one instruction passes it to Claude Code, which also reads your codebase. The loop closes: idea, design, production code, no translation step.

Lovable and v0 aim at different outputs. Lovable gives a greenfield founder a new app. v0 gives a developer a component to paste in. Claude Design gives a team with an existing product something pre-fitted to their repo. ⁵

Why Does Starting From Your Codebase Matter?

Different starting points serve different people.

Figma treats the design file as the canonical home for a brand. For design teams, that is still true.
Claude Design treats the repo as canonical. That fits a different team: one where design intent already lives in Tailwind tokens, CSS variables, and component names.

This matters most to one person: the engineer or PM extending a live product. Not building something new. Not exploring from a blank canvas. Extending what is already there, in a way that matches what is already there.

For that person, starting from the repo removes a translation step. The output is already shaped by the code it will land in. The other tools are not worse at this. They are aimed elsewhere.

I post breakdowns like this regularly on LinkedIn and Instagram. The angle is always what it means for builders, not what the press release says.

When Should You Use Each Tool?

Pick by the starting point that matches your job.

Figma is the tool for design teams on a shared canvas. Pixel precision, component libraries, review workflows, handoff annotations. Claude Design does none of this. ⁵
Lovable is the tool when you have no product yet and want idea to deployed app without code. MVP, internal tool, first prototype. Claude Design is not for that.
v0 is the tool when you need a React component fast and can edit code. Claude Design is not trying to replace that.

Claude Design is aimed at a specific step: you have a live product, a new feature to design, and you need something that already matches everything you built. Teams have always solved this with some combination of briefs, design exploration, review, handoff, and engineering interpretation. Claude Design compresses that into a conversation that starts from the repo. Whether that is the right trade depends on the team.

The broader pattern is familiar. Zuckerberg returning to the codebase after 20 years using Claude Code is the same story. So is Karpathy explaining AI workflows to people who do not write code. AI is not replacing the work. It is eliminating the translation layers between people who do different kinds of work.

One signal worth noting. Mike Krieger, Anthropic's CPO and Instagram co-founder, resigned from Figma's board on April 14, three days before Claude Design launched. He had joined less than a year earlier. The resignation was disclosed to the SEC the same day The Information reported Anthropic was building design tools. ⁶ The adjacency was close enough for the board seat to become untenable, even though the two products are aimed at different jobs.

The market read the adjacency in real time.

Anthropic Labs launched Claude Design, a new product for creating visual assets, prototypes, slides, and one-pagers with Claude.

It is rolling out in research preview to Pro, Max, Team, and Enterprise users, powered by Claude Opus 4.7.$ADBE $FIG https://t.co/5u0TOMSqSW pic.twitter.com/TblMIEJE4u
— Wall St Engine (@wallstengine) April 17, 2026

What Are Claude Design's Limitations Right Now?

Claude Design is a research preview as of April 2026. Real constraints worth knowing before you try it:

No multiplayer. For a design team on a shared canvas, Figma still wins cleanly. ⁵
Token burn is heavy. Claude Design runs on Opus 4.7 and is metered separately from your chat and Claude Code usage. Pro is described as "quick explorations, one-off use." One user reported two design sessions consuming 58% of their weekly Pro allowance. ⁷ To use it regularly, you need Max.
Prototyping-level output, not production polish. The design system extraction makes things brand-consistent, but it is not a replacement for a designer's eye on the final layer.
Export options are practical but limited: PDF, PPTX, standalone HTML, Canva. ⁸ The HTML export is also how the Claude Code handoff closes the loop. Anthropic's own ecosystem, end to end.

A 5-Question Claude Design Readiness Check

Before you open it, ask these. If you answer yes to three or more, Claude Design fits your workflow today. If not, Figma, Lovable, or v0 is probably the better tool for the job.

Do you already have a shipped product in a GitHub repo? Claude Design starts from code that exists. No repo, no extraction.
Is your design system encoded in Tailwind config, CSS variables, or component names? That is what the extractor reads. Design tokens locked in a Figma file alone will not transfer.
Are you extending an existing product rather than starting from zero? The tool's edge is fit to what is already there. For greenfield work, Lovable or v0 is closer to the job.
Can one person own the design-to-code loop, or does it need multiplayer? No shared canvas. If three designers need to work on the same file, Figma still wins.
Are you on Max, or willing to rate-limit yourself on Pro? Two sessions burned 58% of a weekly Pro allowance. Regular use needs Max.

Three or more yeses and the translation step this tool removes is a real one for you.

Key Takeaways

Figma, Lovable, v0, and Claude Design pick different starting points. Different starting points, different jobs.
Figma treats the design file as canonical. Claude Design treats the codebase as canonical. Neither is wrong; they suit different teams.
Claude Design's design system extraction reads your Tailwind, CSS, and component names to generate on-brand output from the first prompt, without manual configuration.
Each tool fits a different starting point: Figma for collaborative design work, Lovable for greenfield apps, v0 for quick components, Claude Design for extending an existing codebase.
Token burn is real. Claude Design is metered separately. Pro is for one-off use. Regular use requires Max.
Anthropic's CPO resigned from Figma's board three days before launch. Figma's stock dropped 5 to 7% on launch day, a read of the adjacency, not a verdict on either product.

Shipping something where this trade-off matters and want a second read on it? Get in touch. I reply to every thoughtful email.

I post builder-first takes on AI tooling on LinkedIn, X, and Instagram. The kind that skip the hype and go straight to what changes for people who ship. If that is useful, a follow goes a long way.

Meta Muse Spark: What Meta Is Actually Betting On

Sameer Khan — Fri, 17 Apr 2026 12:33:55 +0000

TL;DR: Meta launched Muse Spark on April 8, 2026. Most commentary split into two camps. Meta went closed because Meta won. Meta went closed because Meta lost. Both miss what Meta actually built. Muse Spark does frontier-class reasoning in less than half the output tokens Claude Opus 4.6 and GPT-5.4 spend on the same benchmark, and Meta AI, the product serving roughly three billion daily active users, runs on it. Read Muse Spark as an efficiency-first, patiently sequenced, consumer-scale bet, and the choices that look strange on their own start fitting together.

The week Muse Spark launched, the conversation split almost immediately. One camp said Meta finally caught up and closed the doors. Another said Meta finally fell behind and is hiding it. Both sides were arguing about the license. Neither was arguing about the model.

The bet Meta actually made isn't captured by the license. It's captured by three choices that are easy to miss through the open-weights lens. Muse Spark is designed for fewer tokens per query. It is framed as step one of a long sequence. And it is shipping first as the engine of a consumer product reaching three billion daily active users. Those three choices, taken together, describe a different game than the one most labs are playing.

What Muse Spark is

Muse Spark is Meta Superintelligence Labs' first model, shipped April 8 after a nine-month rebuild of Meta's AI infrastructure. ¹ It is a natively multimodal reasoning model with three modes. Instant for fast responses. Thinking for reasoning-heavy queries. Contemplating, positioned against Gemini Deep Think and GPT Pro for long scientific work. It supports tool use, visual chain of thought, and multi-agent orchestration. ²

Meta AI, the consumer product on meta.ai and the Meta AI app, runs on it today. The Muse Spark API is in private preview for selected partners. Alexandr Wang, Meta's Chief AI Officer, has said broader API access is coming. ³ The weights have not been released, and Meta has not committed to whether or when they will be.

On the Artificial Analysis Intelligence Index v4.0, Muse Spark scores 52. GPT-5.4 and Gemini 3.1 Pro Preview score 57. Claude Opus 4.6 scores 53. ⁴ Fourth at the frontier, as the frontier is currently measured.

Efficiency is the number that matters

Meta's headline technical claim is that Muse Spark reaches its capabilities with over an order of magnitude less compute than Llama 4 Maverick, the prior Meta flagship. ¹ That is a training-side claim. The more interesting number sits on the inference side.

To complete the Artificial Analysis Intelligence Index v4.0 run, Muse Spark used 58 million output tokens. Claude Opus 4.6 used 157 million. GPT-5.4 used 120 million. ⁴ Muse Spark reaches roughly the same tier of performance while spending less than half the thinking time of its closest competitors.

Meta calls the mechanism thought compression. During reinforcement learning, the model is penalized for excessive reasoning tokens. It is trained to reach the same answer with fewer intermediate steps. ⁴

Zoom out. Llama 4 Maverick scored 18 on the same index. Muse Spark scores 52. ⁴ Nearly 3x jump in one release, using roughly a tenth of the training compute, producing a model that serves answers in less than half the output tokens of its peers. That is not a fourth-place story. It is a different-axis story.

Thought compression isn't the only lever. Fei Xia, a Meta researcher, showed Muse Spark tackling a hard visual counting task using parallel subagents: divide the image into a grid, assign a subagent per tile, merge the counts. ⁵ That is a second axis of test-compute scaling. Not fewer tokens per query, but many smaller queries instead of one large one. Both compound efficiency at inference time.

Matt Ridley, in How Innovation Works, argues that real technological progress almost never looks like a breakthrough in the moment. It looks like compounded efficiency. ⁶ The Wright brothers didn't fly higher than their competitors; they iterated longer. Meta's claim with Muse Spark is that the same mechanism is back in large language models as the active design constraint. Fewer tokens per query, optimized over releases, compounded.

Under the efficiency thesis, the contribution is the training recipe, not the weights. The productized result at three billion DAUs is what the recipe is for.

Patience as a structural choice

Wang's launch thread called Muse Spark "step one." ³ Meta has named three modes, shipped two of them, and placed Contemplating on a published roadmap. The release itself followed a nine-month rebuild of Meta's internal AI infrastructure before any new model went out. ¹

That pattern is uncommon. Labs announce quarterly, deprecate on shorter cycles, and trade nomenclature every six weeks. A frontier lab committing to a staged ladder with named but unbuilt later steps is the exception.

Jeff Bezos's 1997 shareholder letter made a version of this argument on its own: "We will continue to make investment decisions in light of long-term market leadership considerations rather than short-term profitability considerations." ⁷ Most companies quote the line. Very few behave like it. Muse Spark is Meta behaving like it. A nine-month silence, a named sequence, an efficiency-first architecture that only pays back at scale.

Patience has a failure mode. If the ladder breaks, the gap widens. If competitors keep improving quarterly and Muse Spark's step two arrives in 2027, the index score will read worse, not better. That is the actual risk of the strategy. Not the license. The cadence.

The game Meta is actually playing

Roughly three billion daily active users touch Meta's products. Muse Spark powers Meta AI across them. ¹ Every prompt, every caption suggestion, every smart reply, every image generation across meta.ai, Instagram, WhatsApp, and Facebook is a query served at Meta's cost.

Reread the efficiency numbers with that denominator. 58 million output tokens per benchmark run is interesting when you run one benchmark. It is structural when you run hundreds of billions of inferences. Cutting thinking time by more than half is how inference economics actually move at Meta's scale.

The API is a secondary product. The primary product is a feature inside applications people already use. That framing answers most of the questions that the closed-weights decision seems to raise:

Why closed: weight distribution gives up the only part that is uniquely Meta, which is distribution plus efficient inference under Meta's control.
Why efficiency-first: cost-per-query is the load-bearing variable at three billion users.
Why fourth on the index: the index measures capability, not capability per dollar of inference. Meta is not optimizing for the thing the index measures.
Why patience: product cycles at Meta's scale run in quarters and years, not weeks. A staged ladder matches the cadence of the products that will ship the model.

OpenAI, Anthropic, and Google primarily sell access. Meta does not. Meta bundles. A closed, efficient model embedded in consumer distribution is a product shape no other frontier lab has a direct answer to right now.

What Muse Spark bets against

Muse Spark bets against three premises that have held in AI for three years. That benchmark rank drives strategic outcomes. That fast iteration beats staged iteration. That serving the weights is the dominant form of distribution.

If Meta is right, competitors re-architect. Expect tokens-per-benchmark to become a reported number. Expect ladder-style release roadmaps. Expect fewer labs selling raw access and more labs selling integrated products.

If Meta is wrong, Muse Spark stays fourth on the index, the efficiency claim gets normalized by competitors' next releases, and the Scale-era thesis fades into another nine-month rebuild.

Deedy, in a popular thread after launch, called Muse Spark's reasoning "solid but not best in class." ⁵ That read is fair if you are benchmarking reasoning. It is beside the point if you are measuring how to serve reasoning to three billion people.

Takeaways

Efficiency is the headline, not the license. Muse Spark uses 58 million output tokens where Claude Opus 4.6 uses 157 million on the same evaluation. ⁴
Training efficiency is roughly ten times Llama 4 Maverick. The index score nearly tripled in one release. ¹ ⁴
Patience is the structural bet. A nine-month rebuild, a three-mode ladder, a second-step roadmap that is named but not shipped. ¹ ³
Specialization explains the choices. Meta AI reaches three billion DAUs, and inference economics at that scale reward the token per query being low, not the leaderboard rank being high. ¹
The license is a symptom of the strategy. If efficiency plus distribution plus patience is the bet, releasing the weights gives the bet away.

I've been writing about how constraints shape design, not features for a while, and Muse Spark is a useful instance of the pattern. The interesting move in AI this year might not be the model that scores higher. It might be the model that answers in fewer tokens and ships inside an application a billion people already open every day.

I break things like this down on LinkedIn, X, and Instagram. Usually shorter, sometimes as carousels. If this read resonated, you'd probably like those.

Sources

Meta AI, "Introducing Muse Spark", April 8, 2026 ↩
Simon Willison, "Meta's new model is Muse Spark", April 8, 2026 ↩
Alexandr Wang on X, launch thread and API update, April 2026 ↩
Muse Spark: Features, Benchmarks, and How to Use It, DataCamp, April 2026 ↩
Fei Xia and Deedy Das on Muse Spark capabilities (thread), April 13, 2026 ↩
Matt Ridley, How Innovation Works: And Why It Flourishes in Freedom (HarperCollins, 2020) ↩
Jeff Bezos, 1997 Letter to Shareholders, Amazon ↩