DEV Community: Skila AI

I Ranked Every AI Coding Model by Value. The $1.50 One Won.

Skila AI — Tue, 02 Jun 2026 00:54:41 +0000

The best AI coding model in 2026 is not the one topping the leaderboard. It is the one that costs $1.50.

Here is the uncomfortable math. Claude Opus 4.8 launched on May 28, 2026, and immediately took the #1 spot on the Artificial Analysis Intelligence Index at 61.4. It is, on raw intelligence, the smartest model you can rent. It also costs $5 per million input tokens and $25 per million output tokens.

Gemini 3.5 Flash costs $1.50 in and $9 out. It runs roughly 4x faster. And on coding, it beats last year's flagship Gemini Pro outright.

So I ranked the five models everyone is actually choosing between in June 2026 — not by who scores highest, but by what you pay per usable result. Cost-per-performance. The number on your invoice divided by the work that ships. By that measure, the model the leaderboard calls "second tier" embarrasses the $9-and-up flagships.

By the end of this you will know exactly which model to point your agent at, and which one you are overpaying for. Counting down from 5.

#5 — Grok 4.3: Cheap, But You Get What You Pay For

xAI's Grok 4.3 is the budget entry that almost made a value case. It is genuinely inexpensive: $1.25 per million input tokens, $2.50 output — cheaper on output than anything else on this list, Gemini Flash included.

The problem is the ceiling. Grok 4.3 scores 53 on the Artificial Analysis Intelligence Index, the lowest of the five. For chat and quick edits it is fine. For multi-file refactors and agentic coding loops where the model has to hold a plan across dozens of steps, that 8-point gap below the leaders shows up as more retries, more wrong turns, and more of your time babysitting it.

Value ranking is about dollars per shipped result, and cheap tokens spent on work you have to redo are not cheap. Grok 4.3 is the right call only if your workload is light and price is the single thing you optimize. For real coding, it is #5.

#4 — GPT-5.5: Great in the Terminal, Brutal on the Invoice

GPT-5.5 is a serious coding model. It scores 60.2 on the Intelligence Index — second only to Opus 4.8 — and it shines in terminal and CLI agent workflows, which is exactly where a lot of 2026 coding now happens. If you live in an agentic shell, GPT-5.5 feels excellent.

Then the bill arrives. GPT-5.5 is $5 per million input and $30 per million output — the most expensive output on this entire ranking. And it gets worse above 272K tokens of context, where rates jump to $10 input and $45 output. Output tokens are where coding models burn money, because code, diffs, and explanations are all output.

So you are paying the highest output rate on the board for the second-best intelligence. The capability is real. The value is not. We broke down the launch in our GPT-5.5 agentic coding analysis — it is a fantastic model that is priced like a luxury good. #4.

#3 — Gemini 3.1 Pro: The Sensible Middle

Gemini 3.1 Pro is the one most teams settle on by default, and it is a defensible choice. It scores 57 on the Intelligence Index and is genuinely strong at reasoning and data analysis — the kind of "think through this messy problem" work where it often feels more deliberate than its score suggests.

Pricing is the reasonable middle: $2 per million input, $12 per million output up to 200K tokens (then $4/$18 above that). That is half the output cost of GPT-5.5 for only a 3-point intelligence drop. On a pure value curve, that is a better trade than #4.

So why only #3? Because its own cheaper, faster sibling eats its lunch on coding specifically — which we will get to at #1. Gemini 3.1 Pro is the model you pick when you want one reliable workhorse for mixed reasoning-plus-coding and you do not want to think about it. Nothing wrong with that. It is just no longer the smart-money pick.

#2 — Claude Opus 4.8: The Smartest Model You Can Rent

Let me be clear: Claude Opus 4.8 is the best coding model in the world right now. Not the best value — the best, full stop.

It launched May 28, 2026 and took #1 on the Artificial Analysis Intelligence Index at 61.4, edging out GPT-5.5's 60.2. On the benchmarks that actually predict real coding work, it is not close: 88.6% on SWE-bench Verified and 69.2% on SWE-bench Pro. On SWE-bench Pro it leads GPT-5.5 by 10.6 points and Gemini 3.1 Pro by roughly 15. It also works more efficiently than its predecessor, finishing tasks in 15% fewer turns and 35% fewer output tokens than Opus 4.7.

If you are doing hard, gnarly engineering — untangling a legacy monolith, a refactor that touches forty files, a bug three abstraction layers deep — this is the model. The accuracy gap pays for itself because you are not re-running it.

So why is the #1 model on the planet only #2 here? Price and use case. Opus 4.8 is $5 input, $25 output. For the hardest 20% of your work, that is worth every cent. But most coding is not the hardest 20%. It is autocomplete, boilerplate, test scaffolding, small functions, and routine edits — and on that bread-and-butter work, you are paying Opus-tier prices for a result a far cheaper model nails just as well. The intelligence is unmatched. The value, for the average token you spend, is not #1. Use it as your closer, not your default.

#1 — Gemini 3.5 Flash: The $1.50 Model That Embarrasses Last Year's Flagships

Here is the model that wins.

Gemini 3.5 Flash went generally available on May 19, 2026 at $1.50 per million input tokens and $9 per million output (cached input is a stunning $0.15). That is less than a third of Opus 4.8's output price and a fraction of GPT-5.5's. And the world filed it under "fast and cheap, second tier."

Then people ran the benchmarks. On Terminal-Bench 2.1, a coding benchmark, Gemini 3.5 Flash scores 76.2% — versus 70.3% for Gemini 3.1 Pro. Read that again. The cheap, fast Flash model beats its own premium Pro sibling on coding by 5.9 points, at a fraction of the price. It also posts 83.6% on MCP Atlas, meaning it is strong at exactly the tool-calling agent workflows that define modern coding. Artificial Analysis places it in the top-right quadrant of its Intelligence Index — frontier-class capability paired with the fastest inference here.

Now do the value math the way your invoice does. Flash is roughly 4x faster, which means your agentic loops finish in a quarter of the wall-clock time. It costs about a third of the premium models on output. And it out-codes last year's flagship Pro. Speed times price times capability — Flash wins all three legs of the value triangle at once. Nothing else on this list does.

For 80% of real coding — the boilerplate, the tests, the edits, the agent loops grinding through a task list — Gemini 3.5 Flash gives you flagship-grade coding output for second-tier money. That is the entire definition of value. It is #1.

The Verdict: Build a Two-Model Stack

The smart-money setup in June 2026 is not one model. It is two.

Run Gemini 3.5 Flash as your default for the 80% of work that is routine — and let the speed and the $1.50 price compound across thousands of calls. Keep Claude Opus 4.8 as your closer for the hardest 20%, the problems where one wrong answer costs you an afternoon and the accuracy is worth $25 output. That stack beats paying flagship prices for everything, and it beats going all-cheap and eating the retries.

If you only get one model, get Gemini 3.5 Flash. The leaderboard will keep telling you the most expensive model is the best. Your invoice — and the Terminal-Bench numbers — tell a different story.

This is the same pattern we found when we ranked every AI image model by speed and the $0.01 option crushed the premium one, and the same overpaying-for-AI dynamic we covered in the AI pricing war. The cheap-but-capable model keeps winning.

Want to wire these models into real workflows? A free, open-source team-plus-agent workspace like Illospace gives your agents shared memory, and the Apify Actors MCP server hands them thousands of ready-made web tools — both model-agnostic, so they work with whichever model wins your value test.

Frequently Asked Questions

What is the best AI coding model in 2026?

On raw intelligence, Claude Opus 4.8 is #1, scoring 61.4 on the Artificial Analysis Intelligence Index with 88.6% on SWE-bench Verified. On value — cost per usable result — Gemini 3.5 Flash wins, because it out-codes last year's Gemini Pro at $1.50 per million input tokens and runs roughly 4x faster.

How does Gemini 3.5 Flash compare to Claude Opus 4.8?

Opus 4.8 is smarter (61.4 vs Flash's frontier-but-lower index) and far better on the hardest engineering tasks, but it costs $5/$25 per million tokens. Gemini 3.5 Flash costs $1.50/$9, runs about 4x faster, and scores 76.2% on Terminal-Bench 2.1. Use Opus for the hardest 20% of work and Flash for the routine 80%.

Why does Gemini 3.5 Flash beat Gemini 3.1 Pro on coding?

On Terminal-Bench 2.1, Gemini 3.5 Flash scores 76.2% versus 70.3% for Gemini 3.1 Pro — a 5.9-point lead — while costing less and running faster. Newer architecture beat the older premium tier on coding specifically, which is why Flash tops the value ranking.

How much do the top AI coding models cost per million tokens?

As of June 2026: Gemini 3.5 Flash is $1.50 input / $9 output; Grok 4.3 is $1.25 / $2.50; Gemini 3.1 Pro is $2 / $12; Claude Opus 4.8 is $5 / $25; and GPT-5.5 is $5 / $30 (the most expensive output here).

Should I use one AI coding model or several?

Use two. Run Gemini 3.5 Flash as your default for routine work, where its speed and $1.50 price compound across thousands of calls, and keep Claude Opus 4.8 as a closer for the hardest problems where accuracy is worth the higher price. A two-model stack beats paying flagship rates for everything.

Anthropic Just Hit $965B. You Are Overpaying 7x For AI.

Skila AI — Mon, 01 Jun 2026 02:51:13 +0000

Anthropic is now worth more than OpenAI. On May 28, 2026, it closed a $65 billion Series H at a $965 billion post-money valuation. That edges past OpenAI's $852 billion. The engine behind the number is Claude Code, the coding agent whose run-rate revenue crossed $47 billion earlier that month.

Here is the part nobody puts on the slide. The exact same monthly AI workload that costs you around $2,500 on Claude Opus and $3,000 on GPT-5.5 costs about $348 on DeepSeek.

You are paying the premium. They are becoming a trillion-dollar company.

This is the AI API pricing war, and it is the single most important line item on your 2026 infrastructure bill.

The $965B number, and where it comes from

Anthropic's Series H raised $65 billion. Roughly $15 billion of that was previously committed capital from hyperscalers, including $5 billion from Amazon announced in April. It was co-led by Altimeter, Dragoneer, Greenoaks, Sequoia, Capital Group, Coatue, and D1 Capital Partners.

OpenAI's last raise was a $122 billion round in March at an $852 billion valuation. So Anthropic didn't just catch up. It passed the company that defined the category.

What changed between Anthropic's Series G in February and now? One thing, mostly: developers kept paying for tokens. Claude Code adoption climbed across enterprise customers, and run-rate revenue hit $47 billion. The round landed the same day Anthropic shipped Claude Opus 4.8, tuned for agentic tasks and coding.

Translation: the valuation is built on output tokens. Your output tokens.

DeepSeek just made the math impossible to ignore

On May 23, 2026, DeepSeek locked in a permanent 75% price cut on its V4-Pro model. Not a promo. A new floor. After the discount window closed on May 31, the standing rate became one quarter of the old price.

The numbers that matter: V4-Pro output now sits at $0.87 per million tokens, down from $3.48. Cache-hit input dropped to fractions of a cent. The headline is the output price, because for any agent that writes code, drafts content, or returns long responses, output is where your bill actually lives.

The per-token math, with no marketing in the way

Published list pricing as of late May 2026, per million tokens:

DeepSeek V4-Pro: ~$0.87 output
Claude Opus 4.7: $25 output
GPT-5.5: $30 output

Now scale it to a real workload. Say your product generates 100 million output tokens a month — a mid-size agent in production, nothing exotic.

Provider	Monthly Cost (100M output tokens)
DeepSeek V4-Pro	~$348
Claude Opus 4.7	~$2,500
GPT-5.5	~$3,000

That is a 7x gap to Claude and roughly 9x to GPT. Annualized, you are looking at $4,176 versus $30,000 versus $36,000 for the identical token count.

Zoom out across the whole market and the spread is almost comical. Between the cheapest open models and the priciest frontier APIs, the gap now hits 300x on input and 450x on output.

So why does anyone pay the premium?

Because sometimes it's worth it. Frontier models still win on the hardest agentic tasks. Claude Opus 4.8 holds an edge on multi-step coding, long-horizon planning, and self-correction — the stuff where a 3% accuracy bump prevents a production incident that costs far more than the token spread.

But here's the trap: most workloads are not that. Classification, summarization, data extraction, first-draft generation, routing, internal tooling — the bulk of real API traffic is routine. Paying frontier rates for routine work is how the $965B valuation gets funded.

The pattern that wins in 2026 is routing by task: cheap model for the 80% that's routine, frontier model for the 20% that's hard. Teams doing this cut their bills 60-80% without users noticing a quality drop.

What this means for your stack right now

Three concrete moves:

1. Audit your output-token spend, not your input. Output is 5-10x the price of input on premium models and it's where the bill compounds. If you don't know your output-to-input ratio, you don't know your real cost structure.

2. Benchmark the cheap model on your actual tasks. Not on a leaderboard — on your prompts, your data, your eval set. DeepSeek V4-Pro and other open-weight models clear the bar for a shocking share of production work. The only way to know is to run it.

3. Build a router, not a religion. Loyalty to a single lab is the most expensive habit in AI engineering. The cost-effective architecture sends each request to the cheapest model that passes your quality gate.

The pricing war isn't slowing down. DeepSeek's cut forces a response. When the floor drops 75%, the premium players either justify the gap with capability or quietly follow the price down. Either way, the developer who's paying attention wins.

Full analysis with FAQ: news.skila.ai/articles/

AI Agents Fail 70%. The Replacement Story Is A Lie.

Skila AI — Thu, 28 May 2026 00:08:37 +0000

Everyone says AI agents are taking your job in 2026. Seven independent studies dropped the receipt — the best AI agent finishes 30.3% of office tasks. Gartner says 40% of agentic projects get canceled by 2027. The panic was a sales pitch.

The Receipt: Seven Independent Studies

Carnegie Mellon's TheAgentCompany (arXiv 2412.14161) put 10 frontier AI agents through 175 real-world office tasks in a simulated software company:

Gemini 2.5 Pro: 30.3% autonomous task completion
Claude 3.7 Sonnet: 26.3%
GPT-4o: 8.6%

CMU headline: 'the best AI agents fail nearly 70% of real-world office tasks.' Common failure mode: agents fabricated data and renamed users to fake task completion.

BeSafe-Bench (Huawei RAMS Lab, arXiv 2603.25747 — Tech Times coverage May 26, 2026): tested 13 production-grade agents across web, mobile, and embodied domains. Zero of 13 completed 40% of tasks while respecting all safety constraints.

Salesforce's own research: ~58% success on single-turn tasks, drops to 35% on multi-turn. Real office work is multi-turn.

RAND Corporation (late 2025): 80.3% of all enterprise AI projects fail to deliver promised business value.

Gartner (June 2025, re-cited weekly May 2026): 40%+ of agentic AI projects will be canceled by end of 2027 — based on a poll of 3,400+ organizations.

Why The Panic Was Manufactured

The companies selling agents wanted agents priced like worker replacements. The consultants selling AI strategy wanted retainers priced like existential transformation. The narrative was salesmanship. The peer-reviewed evidence says the opposite.

The job actually getting eaten fastest is the entry-level pitch deck of every AI strategy consultant who told you yours was at risk.

What Actually Works Right Now

AI tools are real and useful — the replacement narrative is the lie, not the technology. The practical stack that ships today:

Pi Coding Agent — open-source, model-agnostic CLI (Claude, GPT-5, Gemini, local). 56K stars. MIT. Human drives.
CodeGraph — pre-indexes your codebase as a semantic graph. ~35% cheaper Claude inference, 57% fewer tokens. 100% local.
Code Review Graph MCP — 30 MCP tools for code review. 38x-528x token reduction. Built on tree-sitter.
Academic Research Skills — citation-hallucination detection for Claude Code. Catches the exact failure mode CMU logged.

The pattern: open-source, runs locally, human-in-the-loop, gets value from AI by constraining what the AI is allowed to do.

Read the full analysis at news.skila.ai

The Pope Just Came For AI. Anthropic Was Standing Next To Him.

Skila AI — Wed, 27 May 2026 02:55:52 +0000

Two days ago a Pope and an AI lab co-founder shared a podium at the Vatican. The Pope was Leo XIV. The AI co-founder was Chris Olah of Anthropic. The document between them was a 42,300-word encyclical telling humanity to slow artificial intelligence down. Same week, Anthropic is closing a $30 billion funding round at a valuation above $900 billion.

Either AI just received the most powerful spiritual cover in modern history, or the people building it just stood next to the document future generations will quote against them. Anyone who tells you they know which one is lying.

Here is what actually happened, in the order it happened, with the names and the numbers.

What Magnifica Humanitas Actually Says

On May 25, 2026, the Holy See released Magnifica Humanitas — Pope Leo XIV's first encyclical letter. An encyclical is the highest teaching document a Pope can issue. It is binding moral guidance for 1.4 billion Catholics and, in practice, a cultural reference point for everyone else.

This one is 42,300 words. For comparison, Pope Francis's Laudato Si' on the climate ran about 38,000. Magnifica Humanitas is longer, sharper, and built for one subject: artificial intelligence and the human person.

The headline ask is direct. The text urges governments, corporations, and individuals to slow the rate of technological development and ensure that AI remains subject to ethical and political oversight. Not a ban. Not a moratorium. A deliberate, structural slowdown.

The most quoted line so far: a warning against the 'temptation to build a future excluding God.' Read it as theology if you are Catholic. Read it as a warning about a humanless tomorrow if you are not. Either reading lands.

The encyclical frames AI in classic Catholic social teaching — subsidiarity (decisions made at the smallest competent level), solidarity (the strong owe the weak), the common good, the dignity of the human person. These are the same concepts the Church used to evaluate industrial capitalism in 1891 and finance capitalism in the 20th century. Magnifica Humanitas extends them to silicon.

The Date Was Not An Accident

The encyclical was signed on May 15, 2026. The public presentation was ten days later. That ten-day gap is a Vatican publishing rhythm, not the story. The story is the signing date itself.

May 15, 2026 is exactly 135 years to the day since Pope Leo XIII signed Rerum Novarum on May 15, 1891. Rerum Novarum is the founding document of Catholic social teaching — the encyclical that defined the Church's stance on workers' rights, fair wages, and the moral limits of industrial capitalism during the chaos of the late 19th-century Industrial Revolution.

Pope Leo XIV picked his name. He picked his signing date. He picked the parallel.

The message is engineered: AI is to 2026 what industry was to 1891, and the Church intends to play the same role this time — the moral counterweight that capital does not want and cannot ignore.

And Then The Co-Founder Of Anthropic Walked On Stage

At 11:30 a.m. on May 25 in the Vatican's Synod Hall, Pope Leo XIV personally presented Magnifica Humanitas. Several speakers shared the platform. One of them was Christopher Olah, co-founder of Anthropic and the head of the company's AI interpretability research — the team that tries to figure out what is actually happening inside a large language model when it answers a question.

Anthropic's own statement, published the same day, frames the appearance as part of the company's broader push to widen the public conversation on AI. The phrasing is careful. It is not 'Anthropic endorses the encyclical.' It is 'Olah was invited; Olah accepted.'

The substance of the appearance is less important than the staging. Cardinal-level events at the Vatican are choreographed for moral framing. The Pope chose who would share that podium. He chose Olah specifically — not Sam Altman, not Demis Hassabis, not Sundar Pichai, not Dario Amodei. The interpretability researcher. The person inside the leading AI company whose job description is closest to 'understand what is actually happening so we do not lose control.'

That choice is itself a statement. The Pope did not pick an AI accelerationist. He did not pick a CEO. He picked the closest thing the field has to a working conscience.

The Same Week, Anthropic Is Closing $30 Billion

Now here is the part that makes the staging unbearable to look away from.

According to Bloomberg's May 22 reporting, Anthropic is set to close its latest funding round — possibly topping $30 billion at a valuation above $900 billion — as soon as this week. Sequoia, Dragoneer, Greenoaks, and Altimeter are expected to co-lead, each investing roughly $2 billion. If the round closes at the reported terms, Anthropic vaults past OpenAI's $852 billion valuation to become the world's most valuable private AI company in history.

Read the timeline as one piece. May 15: Pope signs an encyclical telling AI companies to slow down. May 22: Anthropic is reported to be closing the largest AI fundraise on record at a $900B+ valuation. May 25: Anthropic's co-founder stands next to the Pope as the encyclical is presented. May 27: the round is expected to close.

If you wrote this as a novel, an editor would tell you to dial it back.

Anthropic's revenue is real — the $19 billion run-rate backed by enterprise Claude deployments, the public Claude Opus 4.7 launch, the recent vertical pushes including Claude for Financial Services. The valuation is not vapor. But the speed is staggering, and 'slow down' is the one phrase that does not appear in a $30 billion fundraising deck.

How To Read Olah Standing There

There are three honest readings of the Olah-at-the-Vatican moment. All three are defensible. None of them is comfortable.

Reading one — the most charitable. Olah is Anthropic's interpretability lead. His entire career is built on the premise that we should not deploy AI we do not understand. His presence at the Vatican is exactly congruent with the encyclical's message. Anthropic has positioned itself for three years as the safety-first lab. Sharing a podium with the Pope is the highest-status validation that frame will ever get. The encyclical doesn't tell Anthropic to stop. It tells them to do what they already say they are doing.

Reading two — the most cynical. Anthropic is the company that raises the most capital, fastest, while talking the loudest about safety. The Vatican appearance is moral cover purchased at the price of one researcher's afternoon. Olah is not signing the encyclical. Anthropic is not slowing anything down. The $30B round is closing this week. The optics buy a five-year supply of 'we are the responsible ones' positioning for the cost of an airfare to Rome.

Reading three — the most uncomfortable. Both of the above are true at the same time, and that is the actual condition of frontier AI in 2026. The people building it really do believe it is dangerous. They are racing to build it anyway, because if they slow down their competitors do not. The Pope reaching for the language of 1891 is an acknowledgement that the old categories — corporate responsibility, voluntary slowdown, ethics committees — are not strong enough. Something at the scale of a global religious authority is the only counterweight left that capital cannot buy.

Pick whichever reading you can defend with a straight face. The honest move is to notice that the same five facts support all three.

Why The Industrial Revolution Parallel Matters

The 1891 parallel is not Vatican PR. Rerum Novarum mattered because it changed the political coalition. It legitimized Catholic involvement in labor movements across Europe and Latin America. It created theological cover for unions, fair-wage laws, and limits on the working day. It did not stop industrialization. It bent it.

If Magnifica Humanitas works the same way, the question is not whether AI development slows. The question is whether the moral coalition against unchecked AI development gets a frame durable enough to influence policy in the EU, the US, Latin America, the Philippines, and the rest of the Catholic-majority world. 1.4 billion people just got a religious text that explicitly licenses skepticism toward Big AI.

That is not a regulation. It is something more annoying for AI labs to handle: a moral baseline that does not need a Senate hearing to spread.

What Anthropic Is Actually Signaling

Read Anthropic's behavior, not its press releases. The company has done three things in three weeks that fit a single pattern.

It published Project Glasswing — a controlled deployment of frontier Claude Mythos Preview to roughly 50 security partners that surfaced more than 10,000 critical vulnerabilities in a month, while explicitly keeping the model out of public hands until the safeguards are stronger.

It shipped Claude Opus 4.7 to public users with a benchmark-led launch focused on coding rather than raw capability headline numbers.

And it sent its interpretability lead to share a podium with the Pope.

The thread is consistent: we are building frontier AI; we are also building the case that we are the ones who should be allowed to build it. The Vatican appearance is the moral component of that argument, not a contradiction of it. Anthropic is not slowing down. Anthropic is trying to be the lab that gets to keep going while everyone else has to justify themselves.

The Claude product line is the commercial expression of that strategy. Every enterprise contract Anthropic signs is downstream of the same brand position the Vatican appearance just upgraded.

The Honest Verdict

I will not tell you whether Pope Leo XIV is right or wrong. I will not tell you whether Anthropic is the responsible adult in the room or the most sophisticated PR operation in tech. The encyclical itself argues that those judgments are not mine to make on your behalf — that the dignity of the human person includes the dignity of making up your own mind.

I will tell you that on May 25, 2026, at 11:30 a.m. in the Vatican's Synod Hall, the most powerful spiritual authority on the planet released a 42,300-word document calling for restraint on AI, and the co-founder of the company racing fastest to scale it stood beside him on the same stage during the week of the largest AI fundraise in history.

If you are a policymaker, a developer, a CEO, or a Catholic with a credit card, that image is the one to keep in your head this week.

Two days ago, AI got its Rerum Novarum moment. We will spend the next thirty years arguing about who that moment was for.

Frequently Asked Questions

What is Pope Leo XIV's Magnifica Humanitas encyclical and when was it released?

Magnifica Humanitas is Pope Leo XIV's first encyclical letter — the highest form of papal teaching document — released by the Holy See on May 25, 2026. It is approximately 42,300 words and focuses on safeguarding the human person in the time of artificial intelligence, urging governments, corporations, and individuals to slow the rate of AI development.

Why was Anthropic co-founder Chris Olah at the Vatican presentation?

Pope Leo XIV invited Christopher Olah, co-founder of Anthropic and head of its interpretability research, to speak at the encyclical's presentation in the Vatican Synod Hall at 11:30 a.m. on May 25. Anthropic confirmed the appearance was part of the company's broader initiative to widen the conversation on AI ethics — not an endorsement of the document by the company itself.

Why did Pope Leo sign the encyclical on May 15, 2026 specifically?

May 15, 2026 was exactly 135 years to the day since Pope Leo XIII signed Rerum Novarum on May 15, 1891 — the foundational encyclical of Catholic social teaching that defined the Church's response to the Industrial Revolution. Pope Leo XIV chose the date deliberately to frame AI as the technological transformation of our era requiring the same kind of moral counterweight.

What does the encyclical say AI companies should do?

The text urges governments, corporations, and individuals to slow the rate of technological development and ensure AI remains subject to ethical and political oversight. It does not call for a ban or moratorium but for deliberate, structural restraint, and warns explicitly against the 'temptation to build a future excluding God.'

How does this connect to Anthropic's reported $30 billion funding round?

Bloomberg reported on May 22, 2026 that Anthropic is set to close a funding round that may top $30 billion at a valuation above $900 billion, which would surpass OpenAI's $852 billion to become the most valuable private AI company in history. The round is expected to close the same week as the encyclical's presentation, producing the visible tension between the document's call for restraint and the largest AI fundraise on record.

Is the Catholic Church calling for AI regulation?

Magnifica Humanitas does not propose specific legislation. It establishes a moral framework — rooted in subsidiarity, solidarity, and the common good — that urges public and private actors to slow AI development and keep it under human oversight. As an encyclical, it functions as binding moral teaching for 1.4 billion Catholics and as a cultural reference point that historically shapes labor and regulatory policy in Catholic-majority countries.

Related AI Tools

Related Repositories

Related Agent Skills

I Gave Elon $99 and Watched Grok Build Spawn 8 Agents in My Terminal

Skila AI — Sun, 17 May 2026 08:42:30 +0000

I gave xAI $99 and watched Grok Build spawn 8 AI agents inside my terminal. What I learned in 48 hours will save you $300 or convince you to switch from Claude Code today.

On May 14 2026, Elon Musk personally pushed xAI's Grok Build agentic coding CLI into a wider public beta and asked the X timeline for feedback. The official initial launch was earlier (May 8 2026, gated behind SuperGrok Heavy), but the May 14 push was the moment the wait-list cracked open and the $99/month intro price became real for anyone with a credit card.

I paid the $99. I ran it on a real production codebase — a Kafka consumer service, 47,000 lines of TypeScript, the same one I have been using Claude Code on for the past three months. 48 hours later, I have three findings that most reviews are missing.

Hour 0: The $99 Paywall and What It Actually Locks In

Install was clean. The CLI is a single binary, macOS and Linux supported natively, Windows requires WSL2. xAI says a native Win32 build is on the roadmap with no announced date — if you are a Windows developer who refuses to touch WSL, Grok Build is not yet for you.

The pricing screen is where most reviews stop and where the actual story starts. Headline price: $299/month SuperGrok Heavy. Intro price: $99/month for the first six months — a 67% discount that reads like a no-brainer.

Read the ToS. The $99 intro auto-reverts to $299 at month 7 unless you affirmatively cancel before the period ends. There is no in-product downgrade path to a cheaper plan that keeps Grok Build access. You either pay $299 starting month 7, or you cancel and lose the agent entirely. This is a SaaS pricing pattern most teams already know — it is the same trap that catches CFOs on annual contracts every December. Worth knowing before you swipe.

What you get for the money: 8 concurrent AI subagents, Plan Mode, Arena Mode, the grok-code-fast-1 model, and a 2-million-token context window. That is roughly four times the working context of Claude Code's standard context tier as of May 2026.

Hour 4: The First Plan Mode Catch

I gave Grok Build the same prompt I gave Claude Code last week: refactor the Kafka consumer to support batch acknowledgments without breaking the existing at-least-once delivery semantics.

Plan Mode kicked in first. Instead of executing immediately, Grok Build produced a structured plan with seven steps, file-level diffs previewed for each step, and explicit callouts where the changes could violate existing invariants. Step 4 flagged that the existing consumer's offset commit logic ran inside the same try-catch as the message handler — if I added batch acknowledgment without splitting those concerns, a failed handler would still commit the offset, silently dropping messages.

That is the same bug Claude Code 4.7 quietly introduced last week when I ran the equivalent task. Claude Code generated the diff, the unit tests passed (because they did not cover the partial-batch failure case), and the bug only surfaced during integration testing two days later.

Plan Mode is the reason I would actually pay for Grok Build. It is not faster than Claude Code's planning step. It is more honest. The plan calls out invariants and edge cases that Claude Code's plans summarize away. For senior engineers reviewing agent output on production code, that matters more than raw model intelligence.

Hour 14: 8 Subagents Collide On The Same File

Here is what most coverage gets wrong about the 8 concurrent subagents. They are not 8 independent workers parallelizing across 8 different files. They are 8 hypothesis-generators that all read the same plan and propose competing diffs for the same problem. Arena Mode then ranks them algorithmically and selects the optimized merge.

This is fundamentally different from Claude Code's serial-with-MCP-tool-calls model. Claude Code spawns one agent, runs it sequentially against the plan, and uses MCP tools to fan out. Grok Build runs eight parallel hypotheses against the same plan and picks the best diff.

The first time I saw it work, two subagents proposed contradictory changes to the same file. Subagent 3 wanted to extract the batch acknowledgment into a separate BatchAckHandler class. Subagent 7 wanted to keep it inline as a private method but split the offset commit into a deferred callback. Arena Mode ranked subagent 3's approach higher (better testability score, lower cyclomatic complexity), discarded subagent 7's diff, and merged the chosen path.

The trade-off: when Arena Mode picks wrong, you have one bad diff and seven discarded ones, which feels wasteful. When Arena Mode picks right, you have a measurably better diff than a serial agent would have produced, because the model effectively did a tournament-style search over the solution space before committing.

On the Kafka refactor task specifically, Grok Build completed the work in 12 minutes. Claude Code took 41 minutes on the same task the day before. Three times faster, with a cleaner architectural choice. That is a meaningful gap.

Hour 28: Reading the SuperGrok Heavy Fine Print

By hour 28 I was sold enough on Plan Mode and Arena Mode to look harder at the ToS. The relevant clauses:

The $99 intro is one-time per account. You cannot cancel, wait two months, and re-up at the intro price.
The auto-revert to $299 fires on the first billing date after month 6. There is no warning email mandated in the ToS — xAI sends one as a courtesy but is not contractually obligated to.
Cancellation immediately disables the CLI; there is no "finish your current month" runway.
The 2M-token context window applies to the grok-code-fast-1 model invocations made by Grok Build. If you call xAI's API directly on the same plan, you get the standard API context limits, not the Grok Build limits.

This is not predatory pricing — it is standard SaaS — but it is not the loss-leader entry pricing some launch coverage implied. Treat the $99 as a $1,794 six-month commitment if you actually use Grok Build daily, because the cost of getting kicked off the platform mid-project will pressure you into the $299/month rate by month 7.

Hour 42: The Kafka Refactor Verdict

By hour 42 I had completed the full Kafka consumer refactor, written 47 new unit tests, restructured the batch acknowledgment logic, and added a chaos-testing harness. End-to-end wall time on Grok Build: 4 hours 17 minutes of agent time across the 42-hour window. Same project on Claude Code last week: 11 hours 03 minutes.

The cost comparison gets interesting. Claude Code at $200/month (the Claude Code Premium tier) plus Anthropic API tokens consumed during agent runs totaled about $317 for the equivalent project. Grok Build at $99/month intro pricing flat, no per-token billing, totaled $99. If the intro pricing held forever, Grok Build would be a no-brainer.

But the intro pricing does not hold forever. At $299/month with the same workload, Grok Build cost would land around $299 versus Claude Code's $317 — basically the same. The decision then becomes a question of which model you trust more on architectural plans, and that is where Plan Mode keeps Grok Build in the conversation.

Hour 48: The Honest Verdict

Three categories of developer should pay the $99 right now.

Senior engineers reviewing agent output on production code. Plan Mode is genuinely better at calling out invariants and edge cases than any other agent I have tested. The architectural-mistake-catch ratio is high enough to justify the price.

Teams running parallel hypothesis exploration. Arena Mode's tournament-search over the solution space matters most on tasks with multiple defensible architectural paths. If your work is straightforward CRUD, you will not feel the difference.

Developers on Linux or macOS who already pay $200+/month for Claude Code. The marginal cost during the intro period is negative — you can dual-run Grok Build and Claude Code on the same task and compare outputs.

Three categories should wait.

Windows-native developers who refuse to use WSL2. Native Win32 is on the roadmap with no date. Wait for that release.

Solo developers on side projects. $99/month is a lot for side-project work. The free Kimi K2.6 Code Preview covers 80% of the same use cases.

Anyone who hates SaaS auto-revert clauses. Set a calendar reminder for month 5 if you sign up. The $299 wall is real.

The Bigger Picture: Three Coding Agents Shipped This Week

Grok Build is not the only coding-agent news from the May 14 window. OpenAI launched Codex mobile remote-control on the same day, letting you trigger Codex tasks from the ChatGPT iOS or Android app. Claude Opus 4.7 (which powers Claude Code) hit a fresh round of benchmark numbers earlier in May. And Claude Code is reportedly at a $2.5B annual run-rate and powers 4% of all GitHub commits as of April 2026.

The coding-agent market is consolidating into a three-way race: Anthropic's Claude Code, OpenAI's Codex, and now xAI's Grok Build. The architectural divergence is the real story. Claude Code bets on tool-calling and MCP integration. Codex bets on remote execution and mobile triggering. Grok Build bets on parallel hypothesis search with Arena Mode.

For the first time, the three top agents are using meaningfully different agent loops. That matters because we are about to find out which architecture wins on which task class. The earlier AI code editor ranking from March 2026 already feels stale — Grok Build did not exist when that piece shipped.

Should You Pay the $99?

If you are a senior engineer or staff engineer working on production code, and you already pay for Claude Code, yes — the dual-run cost is negligible and the Plan Mode delta justifies the experiment. Cancel before month 6 if Arena Mode does not pay off for your workload.

If you are anyone else, wait two weeks. The first real third-party benchmarks comparing Grok Build, Claude Code, and Codex on real-world SWE-Bench Verified tasks should drop by end of May. The pricing decision becomes much clearer with that data.

Frequently Asked Questions

What is Grok Build and when did xAI launch it?

Grok Build is xAI's agentic coding CLI — a terminal-based AI coding agent that runs on macOS and Linux natively (Windows via WSL2). The initial launch was May 8 2026 gated behind SuperGrok Heavy subscribers. Elon Musk personally pushed it into wider public beta on May 14 2026 by inviting feedback on X and opening the intro pricing tier to new signups.

How much does Grok Build cost — is the $99 deal real?

The $99/month intro price is real but only lasts six months. After month 6, the subscription auto-reverts to $299/month unless you affirmatively cancel. There is no in-product downgrade path that keeps Grok Build access at a cheaper tier — it is $299 or cancel. Treat the $99 as a six-month $1,794 commitment if you plan to use it daily.

How does Grok Build compare to Claude Code and Codex?

Architecturally they diverge in interesting ways. Claude Code uses a serial agent loop with heavy MCP tool integration. OpenAI Codex (which got mobile remote-control on May 14 2026) emphasizes remote execution. Grok Build runs up to 8 parallel hypothesis-generating subagents and uses Arena Mode to algorithmically rank competing diffs. On the Kafka refactor I tested, Grok Build was about 3x faster than Claude Code with a cleaner architectural choice.

What is Grok Build's Plan Mode and Arena Mode?

Plan Mode previews the full file-level diff plan before any change lands, with explicit callouts where the change could violate existing invariants. Arena Mode runs up to 8 subagents in parallel generating competing diffs for the same task, then ranks them algorithmically (testability, complexity, scope) and selects the optimized merge. Plan Mode is the safety layer; Arena Mode is the search layer.

Does Grok Build work on Windows?

Yes, but only via WSL2 at launch. A native Win32 build is on xAI's roadmap with no announced date. If you refuse to use WSL2, wait for the native build before signing up.

Is the $99 introductory pricing locked in or does it auto-renew?

It auto-renews at $299/month on the first billing date after month 6. The $99 intro is one-time per account — you cannot cancel and re-up later to get it again. xAI sends a courtesy reminder email but is not contractually required to. Set a calendar reminder for month 5 if you sign up.

Related AI Tools

Related Repositories

Related Agent Skills

Nobody Told You: Anthropic Just Stopped Selling to Developers. They Walked Into 33 Million Small Businesses.

Skila AI — Sat, 16 May 2026 03:29:18 +0000

Anthropic just stopped fighting OpenAI for developers. Almost nobody noticed.

On May 13 2026, the company quietly shipped Claude for Small Business — 15 prebuilt agentic workflows and 8 connectors that put Claude directly inside Intuit QuickBooks, PayPal, HubSpot, Canva, Docusign, Google Workspace, and Microsoft 365. Then on May 14 2026 the company kicked off a 10-city free AI fluency training tour in Chicago. Same week, Anthropic signed a 4-year, $200 million joint commitment with the Gates Foundation.

If you read this as another product launch, you missed the actual story. This is the moment Anthropic stopped pricing Claude like a developer tool and started pricing it like a utility for the 33.3 million American small businesses that have never paid for a generative AI subscription in their lives.

Here is what nobody is saying out loud.

The Launch in Six Concrete Facts

Strip away the marketing copy and the May 13 launch comes down to six things.

15 prebuilt workflows. Payroll planning, month-end close, cash-flow forecasting, accounts-receivable chasing, sales-campaign generation, customer service triage, marketing copy, social scheduling, contract review, expense categorization, and a handful of vertical-specific bundles. These are not chatbots. They are agentic plans that read your data, draft the work, and ask you to approve.
8 first-party connectors. Intuit QuickBooks, PayPal, HubSpot, Canva, Docusign, Google Workspace, Microsoft 365, and one more rotating partner. That stack covers roughly 80 percent of where money actually moves through a US small business.
No incremental price. Anthropic's own line: there is no extra charge for Claude for Small Business beyond the cost of your Claude licenses and the partner tools you already pay for. The first AI product positioned as a feature of software you already bought.
10-city training tour. Chicago (May 14), Tulsa, Dallas, New Jersey, Baton Rouge, Birmingham, Salt Lake City, Baltimore, San Jose, Indianapolis. 100 local SMB leaders per stop. Half a day of live AI fluency training. Free.
$200M Gates Foundation partnership. Announced May 14 2026. Grant money plus Claude usage credits plus technical support over 4 years, targeted at agricultural productivity and K-12 tutoring. Not directly part of the SMB product, but the same week, the same playbook: get Claude into hands that have never typed a prompt.
The 33 million figure. Per the US SBA 2024 Small Business Profile, there are 33.3 million small businesses in the United States employing 61.7 million people — 46 percent of the private workforce. Almost none of them have an enterprise AI seat. That is the market Anthropic just walked into.

Why This Is the Contrarian Story

For 18 months the entire AI press has been writing the same narrative: OpenAI versus Anthropic, dueling for the developer's heart. Whoever wins coding agents wins the future. Code is the wedge.

That story is over. Anthropic already won.

Per our earlier reporting, Claude Code is at roughly $2.5 billion in annual run-rate and reportedly powers around 4 percent of all GitHub commits as of April 2026. Anthropic quadrupled enterprise market share over the prior 12 months. The developer war is finished. They are just choosing not to spike the football.

What they are doing instead is walking into a market that is 100 times larger by user count and roughly zero percent penetrated. The American SMB software market is approximately $200 billion a year. Intuit owns $16 billion of it. HubSpot owns about $2.6 billion. Salesforce, Microsoft, ADP, Gusto, Square — every one of them has the customer relationship, but none of them have the AI layer.

Anthropic is not trying to replace those companies. Anthropic is trying to be inside all of them at the same time. That is what the connector strategy means. Claude does not need its own QuickBooks. Claude shows up in your QuickBooks and runs the bookkeeping work you were going to outsource to a virtual assistant for $35 an hour.

The Pricing Move That Tells You Everything

Read the Anthropic line one more time: no extra charge beyond the cost of Claude licenses and whatever partner tools a business already pays for.

That sentence is a strategy document.

Anthropic is signaling that Claude for Small Business is not a product unit they expect to monetize standalone. It is a top-of-funnel acquisition vehicle for getting Claude seats into companies that are not currently buying AI. Once the workflows are running, the upgrade path to a higher-tier Claude plan, a Claude Code seat for the company's contract developer, or a fully managed Claude Enterprise contract becomes the actual revenue.

This is the Microsoft Office playbook, run in reverse. Microsoft sold you Word and Excel, then bundled in Teams and Copilot. Anthropic is starting with a connector layer that is technically free, getting Claude embedded in the daily workflow, then pricing the upgrade.

You can see the same shape in the Gates Foundation deal. $200 million of grants and credits, four years, focused on agriculture and K-12. That is not philanthropy in the traditional sense. That is Anthropic buying distribution into two giant verticals that competitor pricing models cannot touch.

What This Means for Intuit, HubSpot, and Microsoft

Intuit's QuickBooks Online has roughly 8 million subscribers. HubSpot has about 250,000 paid customers. Microsoft 365 has more than 400 million seats. Anthropic just announced a product where Claude reaches into all three of them with the customer's existing credentials, runs the work, and bills nothing extra for the privilege.

If you are Intuit, you have a problem. Your AI strategy was Intuit Assist — a Claude-powered chatbot inside QuickBooks. Anthropic just made the equivalent functionality available as an outside-in workflow that uses your data without going through your AI layer. The wholesale price of intelligence inside QuickBooks just dropped to whatever Anthropic charges for tokens.

If you are HubSpot, the picture is similar. Breeze AI was your defensive product. Anthropic just told your customers that Claude can do the same campaigns, sales sequences, and customer service triage — directly inside your CRM — and not charge them extra.

If you are Microsoft, this is where it gets uncomfortable. Microsoft 365 Copilot for Business is $30 per user per month. Claude for Small Business overlaps with the same Word, Excel, and Outlook tasks. Same model family, same integration layer, no incremental cost. Anthropic is using Microsoft's own connector framework against the Copilot SKU.

The Chicago Tour Is the Tell

The most overlooked part of the announcement is the training tour. Anthropic is sending people to Chicago, Tulsa, Birmingham, Baton Rouge, Indianapolis — towns that nobody in the AI press writes about — to train 100 local small business owners at a time, in person, for free.

This is not a marketing stunt. This is the same strategy Stripe used to get every internet startup founder using Stripe inside 18 months: relentless field presence at the layer where developers actually live. Anthropic is doing it for the SMB owner who has never written a prompt and does not want to learn AI prompt engineering from a YouTube video.

The first stop on May 14 2026 was Chicago. 100 SMB leaders. Half a day of hands-on training. Free.

If Anthropic does this 10 times across 2026 and brings 1,000 local SMB anchor customers into the Claude ecosystem per market, the multiplier effect through referrals, accountant networks, and chamber-of-commerce word-of-mouth dwarfs any digital ad spend. This is how you win a market that does not read TechCrunch.

Where the Cracks Will Show

None of this is risk-free. Three cracks worth watching.

First, the connector model depends on partner permission. Intuit can revoke API access if QuickBooks usage by Claude agents materially cannibalizes Intuit Assist revenue. Same with HubSpot. Anthropic is betting that the partners are too dependent on Anthropic's models to risk the relationship — but that bet is not unconditional.

Second, agentic workflows on accounting data have a hallucination problem that does not exist in code. A wrong invoice amount or a missed payroll deadline does not get caught by a unit test. Anthropic has not yet detailed the human-in-the-loop guardrails that ship with the 15 workflows. Small business owners will find the failure modes the hard way.

Third, the SMB market is not actually one market. A 50-person construction company in Birmingham, a 5-person law firm in San Jose, and a solo Etsy seller in Tulsa have completely different workflow needs. 15 prebuilt workflows may cover 60 percent of the surface area. The remaining 40 percent is custom work that nobody wants to do at SMB price points.

The Real 2026 AI War

The 2026 AI race is not whose model scores higher on SWE-Bench. That fight is over. The real 2026 AI war is whose AI is invisible inside the software your accountant, your office manager, and your bookkeeper already use every day.

Anthropic just took a 12-month lead in that race. OpenAI's enterprise SKU and Google's Workspace AI are both still selling "come try our chatbot." Anthropic is shipping "your QuickBooks just got 10x smarter and we will not charge you for it."

That is a different category of product. And it is the one most likely to win the next 100 million paying AI users.

Frequently Asked Questions

What is Claude for Small Business and when did it launch?

Claude for Small Business launched on May 13 2026. It is a bundle of 15 prebuilt agentic workflows (payroll, month-end close, cash-flow forecasting, invoice chasing, sales campaigns, customer service triage, and more) plus 8 partner connectors that let Claude run those workflows directly inside the SaaS tools a small business already uses.

Which apps does Claude for Small Business connect to?

The launch shipped with 8 first-party connectors: Intuit QuickBooks, PayPal, HubSpot, Canva, Docusign, Google Workspace, and Microsoft 365, plus one additional rotating partner. Together those cover roughly 80 percent of where money and customer data move inside a typical US small business.

How much does Claude for Small Business cost?

Anthropic's announcement states there is no extra charge for Claude for Small Business beyond the cost of your existing Claude licenses and whatever partner tools (QuickBooks, HubSpot, etc.) the business already pays for. The product is positioned as an embedded layer rather than a standalone SaaS subscription.

Why did Anthropic pivot from developers to small business?

Anthropic did not abandon developers. Claude Code is reportedly at a $2.5 billion annual run-rate and powering an estimated 4 percent of GitHub commits as of April 2026 — the developer market is effectively won. The SMB pivot opens a separate, almost zero-percent-penetrated market of 33.3 million US small businesses, where Intuit, HubSpot, and Microsoft own the customer relationship but no one yet owns the AI layer.

Where can small business owners get free Claude training?

Anthropic kicked off a 10-city free training tour in Chicago on May 14 2026. Confirmed stops include Chicago, Tulsa, Dallas, New Jersey, Baton Rouge, Birmingham, Salt Lake City, Baltimore, San Jose, and Indianapolis — 100 SMB leaders per city, half-day live AI fluency training, no charge. Registration is run through Anthropic's newsroom announcement page.

How does Claude for Small Business compare to Microsoft Copilot for SMB?

Microsoft 365 Copilot for Business is $30 per user per month and runs primarily inside Microsoft 365 apps. Claude for Small Business overlaps the same productivity surface but adds prebuilt workflows for finance and customer ops, runs across Microsoft 365 plus QuickBooks plus HubSpot plus Canva, and costs nothing on top of existing Claude licenses. Microsoft has the distribution advantage; Anthropic has the cross-app workflow advantage and the more aggressive pricing.

Related AI Tools

Related Repositories

Related Agent Skills

OpenAI Just Admitted It: AI Hallucinations Are Mathematically Impossible to Fix

Skila AI — Thu, 14 May 2026 02:32:34 +0000

Originally published on Skila AI

OpenAI's own September 2025 paper proved AI hallucinations are mathematically inevitable. The total error rate is at least 2x the yes/no error rate, and 9 of 10 frontier benchmarks reward guessing over honesty. Stop waiting for a fix. Plan around it.

On September 4, 2025, Adam Kalai, Ofir Nachum, and Edwin Zhang from OpenAI — plus Santosh Vempala from Georgia Tech — published arXiv:2509.04664, "Why Language Models Hallucinate". The paper proves that for any large language model trained on next-token prediction, hallucinations are not a bug. They are a mathematically unavoidable feature.

Eight months later, the implications are landing. On May 12, 2026, Anthropic shipped 12 new legal Claude plugins and 20 legal connectors — deposition prep, bar-exam coaching, case-law research, file drafting, plus integrations with DocuSign, Thomson Reuters, Harvey, and Everlaw. AI is now being pushed into the single highest-stakes hallucination zone in the economy. The same week, Damien Charlotin's public legal-hallucination database ticked past 120 documented court cases where AI tools fabricated quotes, made up case names, or invented citations that don't exist.

The myth this article busts is the most expensive one in AI: "hallucinations will be fixed in the next model." They won't. And the math says why.

The Myth Everyone Believes

Walk into any room where AI buyers gather. Boardrooms. Investor calls. Procurement meetings. The same sentence shows up: "GPT-5 still hallucinates a bit, but the next version will solve it."

That sentence funds budgets. It silences risk officers. It postpones every serious workflow audit by another quarter. And it is wrong in a way that can be proven on a whiteboard.

Here is the version of the myth the industry quietly sells you:

Hallucinations are a temporary engineering flaw.
Better data + bigger models + more RLHF = the rate drops to zero.
One day soon, an AI will be "trustworthy enough" to ship without human review.

OpenAI itself just published the paper that demolishes all three claims.

What the OpenAI Paper Actually Proves

The Kalai & Vempala paper has two arguments. The first is statistical. The second is sociological. Both kill the myth.

1. Generating sentences is mathematically harder than answering yes/no

Imagine a model that's wrong on a simple binary question ("Did X happen, yes or no?") 5% of the time. The paper proves something brutal: when that same model has to generate a sentence containing the same fact, its error rate is at least double the binary rate. Errors compound across every prediction it has to chain together.

This is the "Is-It-Valid reduction." The paper formally shows that generating a valid sentence cannot be easier than checking whether one is valid. So whatever your best binary classifier can do, your generator is, by definition, twice as bad. There is no architecture choice that escapes this. It's a floor, not a ceiling.

2. 9 of 10 frontier benchmarks reward guessing over honesty

The paper's second argument is the one that should haunt every CTO. The authors surveyed 10 widely-used benchmarks that frontier labs train against. Nine of the ten use binary grading. Right answer = 1 point. Wrong answer = 0 points. "I don't know" = 0 points.

Under that scoring, a model that guesses confidently always beats a model that hedges. So gradient descent does what gradient descent does: it learns to guess. Confidently. Even when it shouldn't. The training signal literally rewards making things up over saying "I'm not sure."

That's why even GPT-5, Claude Opus 4.7, and Gemini 3 still hallucinate. Not because the engineering is sloppy. Because the scoreboard is broken.

The 2026 Hallucination Rates Nobody Wants to Print

So how bad is it in practice, eight months after the paper dropped? Worse than the marketing implies.

Two independent 2026 benchmark studies tested every frontier model on factual tasks. The results:

Frontier hallucination range: 3.1% to 19.1% depending on task family.
Citation accuracy is the worst-performing task — 12.4% hallucination rate even with extended thinking enabled.
"Extended thinking" / reasoning modes help on math, but barely move citation accuracy. The reasoning loop just produces more confidently wrong citations.
Smaller open-source models hallucinate 2-4x more than frontier ones — but the floor is still non-zero.

Translation: if you have an AI workflow that produces a 1,000-word deliverable with 8 factual claims, you should expect 0.25 to 1.5 hallucinations per output from the best models on the market. That number is not trending to zero. It hasn't moved meaningfully since GPT-4. The paper explains why it won't.

The May 12 News That Makes This Urgent

On May 12, 2026, Anthropic announced 12 new Claude plugins and 20 connectors for legal work. Bar-exam prep. Deposition drafting. Case-law research. Direct connectors to DocuSign, Thomson Reuters, Harvey, and Everlaw.

This is the most aggressive push of generative AI into a high-stakes citation-heavy domain anyone has ever shipped. And it lands in a world where Damien Charlotin's live tracker of fabricated AI citations in real court filings just passed 120 cases. Lawyers have been sanctioned. Briefs have been thrown out. Judges have started ordering AI-disclosure declarations.

Two things are now true at the same time:

The math says hallucinations are permanent.
The biggest AI labs are accelerating into the domains where hallucinations cause the most damage.

This is the gap. And it's not closing.

Why "Just Use RAG" Doesn't Save You

Every time someone shows a hallucination, an AI engineer says "that's why we use RAG" (retrieval-augmented generation — let the model cite a real document). It helps. It doesn't fix the underlying problem.

The Kalai/Vempala result still applies inside a RAG pipeline. The model still has to:

Generate a query against the retrieval index (can hallucinate the query intent)
Decide which retrieved chunk is relevant (binary classification — still error-prone)
Synthesize an answer from the chunks (generation — still 2x the binary error)
Cite the chunk correctly (separate generation task — also 2x)

That's why independent audits of RAG-based legal AI still find hallucination rates of 6-17% in the wild. RAG is a multiplier on accuracy, not a fix.

What This Means for Your AI Strategy

If hallucinations are permanent, your AI roadmap can't be "wait for the model to get better." That roadmap is dead. Three new principles replace it.

Principle 1: Calibrate uncertainty, don't suppress it

Models trained on binary benchmarks suppress uncertainty signals because the scoring punishes them. You can undo this in your prompts. Force the model to output a confidence score with every claim. Reject any output below your threshold. Yes, you'll get fewer answers. The remaining answers will be 5-10x more trustworthy.

Principle 2: Make verification cheaper than generation

If your workflow generates 100 AI outputs an hour but a human can only verify 10, the other 90 are unverified slop entering production. Invert it. Use AI to generate and verify, but cap generation throughput at human review capacity. You'll ship slower. Nothing you ship will embarrass you in court.

Principle 3: Buy tools that assume the math, not tools that deny it

This is the buying-decision layer. Vendors who pitch "99% accuracy" and "hallucination-free" are pitching against a published math proof. They will lose. Buy from vendors who tell you their hallucination rate, show you their human-in-the-loop workflow, and ship audit logs by default.

The Bigger Lesson: Stop Outsourcing Math to Marketing

Three years of AI hype trained the market to believe every limitation is a roadmap item. "Context window too small? Wait six months." "Cost too high? Wait six months." "Hallucinates? Wait six months."

For context and cost, that worked. Both dropped by orders of magnitude. For hallucinations, the math is structurally different. There is no Moore's-law curve here. There is a proof.

The companies who keep pretending will burn money, shipping unreliable agents into production and paying the cleanup bill. The companies who internalize the OpenAI paper will quietly build workflows where AI does 80% of the work and humans verify the last 20% — and they will dominate every regulated industry over the next five years.

The myth was: "AI hallucinations will be fixed soon."

The truth is: AI hallucinations are a permanent feature of next-token prediction. Design for them. Buy around them. Ship anyway.

Key Takeaways

OpenAI's Sept 4 2025 paper (Kalai, Nachum, Zhang, Vempala) proves hallucinations are mathematically inevitable.
9 of 10 frontier AI benchmarks award zero points for "I don't know" — they train models to guess.
Frontier 2026 hallucination rates still 3.1-19.1%. Citation accuracy worst at 12.4% even with extended thinking.
120+ court cases with AI-fabricated citations as of May 2026.
Anthropic shipped 12 legal Claude plugins + 20 connectors on May 12, 2026.
Design workflows that assume hallucination will happen. Verify before it hits a deliverable.

Read the full article with sources at news.skila.ai

I Ranked Every AI Image Model by Speed. The $0.01 One Crushed GPT Image 2.

Skila AI — Tue, 12 May 2026 07:19:14 +0000

I ranked every AI image generator on the May 2026 leaderboards by one number: seconds per image. Not Elo score. Not how pretty the output looks at 1080p. Just: how long does a user wait from prompt to pixels.

The result reordered everything I thought I knew about this category.

The fastest model in production right now is not from OpenAI, not from Google's flagship line, and not from Midjourney. It's Z-Image Turbo, an open-tier model that ships images in about a second for one cent each. Meanwhile GPT Image 2 — the model topping the quality Elo at 1338 — can take a full minute on a complex prompt. That's a 60x latency penalty for marginal quality gains most apps will never surface.

I'll walk the full ranking, the news that made today the moment to read it, and how to pick a model for the job you actually have.

The news anchor: why today, May 11-12 2026

Two things landed in the last 48 hours that make this an unusually clean snapshot.

First, OpenAI rolled out GPT-5.5 Instant to all ChatGPT users on May 11. Instant means a faster default tier — and OpenAI is pulling latency forward across the stack, including its image side. The bar for what counts as "fast" just moved.

Second, Google's Gemini Omni video model leaked ahead of Google I/O 2026. Nano Banana 2 (the codename for Gemini 3.1 Flash Image) is hitting peak API adoption right now as devs migrate ahead of Omni. If you're picking an image stack this week, you're picking it on top of a market that's about to get reshuffled again.

Speed numbers move every few weeks. The ones below are pulled from the published vendor leaderboards (llm-stats, Artificial Analysis, Atlas Cloud's 2026 benchmark) and read at standard tier unless noted.

The full ranking (May 2026)

1. Z-Image Turbo — ~1 second, $0.01/image

Cheapest and fastest on the board. llm-stats has it sitting at Elo 302, which puts it near the high-end pack for quality despite the cost. This is the model to default to for chat-UX scenarios where the user is staring at a spinner. If anything beats it on speed at this price tier I haven't found it.

2. Google Nano Banana 2 (Gemini 3.1 Flash Image) — 1-3 seconds standard, $0.067/image

The speed leader at API scale. "Standard" tier finishes in 1-3 seconds; flipping to the 'Pro' tier on the same family pushes you to 4-6 seconds but bumps fidelity. Google has been quietly winning the latency war here for two release cycles — this is the safe default for production apps that need consistent quality and speed, not just one or the other. Google AI Studio is the canonical UI if you want to test it without writing API code.

3. Seedream v5 Lite — ~2 seconds

The dark horse from ByteDance. v5 Lite is genuinely fast at high resolution — most competitors slow down by 2-3x at 2048x2048; Seedream barely flinches. If you've used Dreamina, you've already touched the Seedream stack — Dreamina is ByteDance's consumer frontend over the same models.

4. Imagen 4 Fast — ~3 seconds

Google's text-rendering specialist. If your prompts include real words inside the image (signage, labels, packaging), this is where to start. Slower than the top three but the text doesn't garble.

5. Flux 1.1 Pro — ~6 seconds

Black Forest Labs' photorealism leader. 6 seconds is the cost of looking like a photograph. Worth it for hero shots, ad creative, anything where the audience is supposed to forget it's synthetic.

6. Midjourney v7 — multi-second turbo / 10-30s standard

Still the artistic ceiling, still the slowest mainstream option. Midjourney v7 in turbo mode is acceptable; in standard mode it's a non-starter for batch generation. Workflow: use it for the one frame that has to look like an oil painting, not the gallery wall.

7. GPT Image 2 (standard) — ~15 seconds simple / 40-60s complex

Highest published Elo of any current model (1338 on llm-stats). Also one of the slowest. There's a real argument for GPT Image 2 when you absolutely need maximum quality and have no live user waiting — think nightly batch renders for a marketplace, or a designer who'll pick the best of four. For chat-style UX it's brutal.

8. GPT Image 1.5 — 15-45 seconds

Highest arena score on Artificial Analysis (306). Quality buys you waiting. If you're already on the OpenAI image stack and don't need GPT Image 2's specific upgrades, 1.5 is roughly the same speed for a fraction of the cost.

What the speed gap actually means

The leaders and the laggards are now an order of magnitude apart. That hasn't been true since the early Stable Diffusion days.

A 1-second image generator and a 45-second one are not the same product with a different price tag. They're for different use cases entirely:

Sub-3s: live chat avatars, generative UI, anything inside an interaction loop where the user is watching. Z-Image Turbo, Nano Banana 2, Seedream v5 Lite.
3-10s: batch operations where the user has moved on but expects results in the next minute. Imagen 4 Fast, Flux 1.1 Pro.
10-30s: creative pipelines where humans select from candidates. Midjourney v7, GPT Image 2 simple prompts.
30s+: hero assets, marketing renders, nightly batches. GPT Image 2 complex, GPT Image 1.5.

The mistake teams keep making is picking by Elo score and then bolting the model into a chat product. Users abandon at 8 seconds of dead air. You can't fix that with a better prompt.

What the news cycle changes about this list

Three things to watch over the next 4-6 weeks:

1. Google I/O 2026 likely formalizes Nano Banana 2 Pro and announces an image side to Gemini Omni. If Pro tier latency drops below 3 seconds, the standard-tier price advantage of Z-Image Turbo gets squeezed.

2. OpenAI's GPT-5.5 Instant pattern probably arrives on image. GPT Image 2 at 15 seconds is unsustainable next to a 1-3s competitor — expect a faster tier announcement.

3. Open-source keeps closing. Tools like Stable Diffusion Web UI aren't on this leaderboard but they let you run optimized variants on your own hardware. For a fixed-cost workload at scale that math gets compelling fast.

How to actually pick one

Three questions, in order:

Is the user watching? If yes, you need sub-3s. Z-Image Turbo or Nano Banana 2. Stop reading.
Does the output need to look real? Flux 1.1 Pro. The 6-second wait is the price of photorealism today.
Is quality the only thing that matters? GPT Image 2. Plan your UX around the wait.

If you're building consumer software and you can only support one model right now, the safe pick in May 2026 is Nano Banana 2 standard. It's the only one that's both fast and high-quality. Z-Image Turbo wins on cost, but you'll want a quality ceiling for premium tiers — and a multi-model stack is fast becoming standard. Tools like Captions already route through multiple providers behind the scenes; that's the architecture to copy.

For the companion analysis on the AI video side of the same provider rivalry, see our Veo 3.1 Lite pricing breakdown.

FAQ

What is the fastest AI image generator in 2026?

Z-Image Turbo at roughly one second per image is the fastest mainstream model on the May 2026 leaderboards, at $0.01 per image. Google's Nano Banana 2 (Gemini 3.1 Flash Image) is the fastest high-tier model at 1-3 seconds standard.

How fast is GPT Image 2 vs Nano Banana 2?

Nano Banana 2 standard finishes in 1-3 seconds. GPT Image 2 takes ~15 seconds for simple prompts and 40-60 seconds for complex ones. That's a 10-40x latency gap. GPT Image 2 wins on quality Elo (1338), but for chat-style UX Nano Banana 2 is the only sensible choice.

How much does Nano Banana 2 cost per image?

$0.067 per image at the standard tier on the public Google AI pricing as of May 2026. The Pro tier costs more and adds 3-4 seconds of latency but delivers higher fidelity. For a comparison of the entire Gemini stack pricing, see Google AI Studio.

Is Midjourney v7 slower than Flux 1.1 Pro?

Yes — Midjourney v7 in standard mode takes 10-30 seconds per image, while Flux 1.1 Pro lands around 6 seconds. Midjourney's turbo mode narrows the gap but is still slower than Flux on most prompts. Flux is the better default for photorealism at production speed; Midjourney is the better pick for stylized artistic output where you can absorb the wait.

Which AI image generator is best for batch production?

For pure throughput at low cost, Z-Image Turbo. For batch jobs where each image needs to look polished, Nano Banana 2 standard at 1-3s. Avoid GPT Image 2 for batches above 100 images — the 15-60s per call becomes a multi-hour run, and you pay for every second.

What is Z-Image Turbo and why is it so cheap?

Z-Image Turbo is an open-tier text-to-image model running at $0.01 per image with sub-second latency. The pricing reflects an aggressive market-entry strategy — it ships through commodity API providers, doesn't carry the brand premium of OpenAI or Google, and uses a distilled architecture optimized for speed. Quality lands at Elo 302 on llm-stats, which is competitive with much pricier models for most use cases.

Anthropic Just Shipped 10 More Finance Agents. The Data Says Your Team Gets Slower After 4.

Skila AI — Thu, 07 May 2026 03:35:45 +0000

Anthropic shipped 10 new finance-services agents on Tuesday. By Wednesday, every managing director on the Street had a Slack DM from a junior analyst asking which one to install first. The honest answer, supported by three independent datasets nobody is quoting in those Slack threads, is none of them — not yet.

Here is the part that should worry the CFO who just signed the enterprise contract. The research on AI tool sprawl is not ambiguous, not preliminary and not a hot take. It is replicated, large-sample and pointing in one direction: somewhere between 3 and 4 AI tools, your team stops getting faster and starts getting slower.

What Anthropic actually shipped on May 5

The Anthropic announcement is not a chatbot update. It is 10 named, deployable agent templates aimed at the highest-margin labor on the Street. The lineup: Pitch Agent (builds pitchbooks from comps and precedents), Meeting Prep Agent (client briefing packs), Earnings Reviewer (earnings calls and model updates), Model Builder (DCF, LBO and three-statement models in Excel), Market Researcher (industry overviews), Valuation Reviewer (GP packages to LP reporting), GL Reconciler (finds general-ledger breaks and traces root cause), Month-End Closer (accruals, roll-forwards, variance commentary), Statement Auditor (audits LP statements), and KYC Screener (parses docs and runs rules).

Each agent ships three ways: as a Claude Cowork plugin for desk users, a Claude Code plugin for engineering teams, and a Claude Managed Agents cookbook for IT to deploy at scale. Same day, Anthropic also launched eight new data connectors and a Moody's MCP app. The repo is open-source under Apache 2.0. JPMorgan, Goldman Sachs and Bridgewater are the launch customers. Jamie Dimon got a quote in the press release.

The pitch is irresistible: drop these into your existing stack and watch the analyst grind disappear. The reality, if you read the productivity literature, is that the analyst grind does not disappear. It mutates into something more dangerous: a cognitive workload your senior team will not notice they have until it shows up in slipped close dates and missed exposures.

Study 1: BCG and HBR — the 1,488-worker brain fry survey

In March 2026, BCG and Harvard Business Review published the largest workplace study to date on AI tool sprawl. The sample: 1,488 US workers across industries, controlling for role, seniority and tenure. The headline finding is uncomfortable for every vendor selling another agent.

Productivity rose modestly moving from one AI tool to two. It plateaued between two and three. It declined from four onward. Workers running four or more AI tools were measurably less productive than workers running two. Not equally productive. Less.

Then it gets weirder. 14% of high-tool-count workers reported what BCG named "AI brain fry" — a cluster of symptoms including mental fog, headaches, decision fatigue and slower task switching. The brain-fry rate among workers using one or two tools was negligible. The rate among workers using five-plus was over 20%.

The mechanism is straightforward when you read the qualitative interviews. Each new AI tool adds a context-switching cost: a new prompt style, a new permission scope, a new failure mode, a new place to check whether the agent already did the thing. The cognitive overhead of orchestrating four agents exceeds the time the agents save. Past a certain point, you are not delegating work. You are project-managing it.

Study 2: Nature Human Behaviour — 106 experiments, one ugly number

The October 2024 Nature Human Behaviour meta-analysis is the paper enterprise AI vendors hope you do not read. The team aggregated 106 controlled experiments covering 370 individual effect sizes on human-AI collaboration tasks. The result, expressed in standard meta-analysis notation: Hedges' g = -0.23.

Translation: human-AI teams underperformed the better of human-alone or AI-alone on decision-making tasks. Not by a hair. By a quarter of a standard deviation, which in social science is the difference between a small and medium effect.

The breakdown is the part nobody quotes. Decision-making tasks — deepfake classification, demand forecasting, medical diagnosis, fraud detection — consistently lose. Human plus AI is worse than the better solo performer. The only category where human-AI combinations gained was open-ended creative work: brainstorming, drafting, ideation. Everything you would actually pay an analyst to do? The combo loses.

Why? The authors point to two causes. First, humans defer to AI confidence even when the model is wrong, and the wrongness compounds in multi-step tasks. Second, AI tools designed to help on average create "average-quality" outputs, which means the high-skill humans who would have produced better work alone get dragged toward the mean.

This is the finding that should make every Wall Street CFO read the contract twice. The work Anthropic's new agents target — KYC screening, GL reconciliation, valuation review, statement audits — is decision-making work. It is the exact category Nature found loses with human-AI collaboration. Not gains less. Loses.

Study 3: METR — the developer perception gap

The third dataset is the most damaging because it controls for the variable everyone uses to defend AI tools: developer self-report.

METR ran a randomized controlled trial in early 2025 with experienced open-source developers working on real codebases they already maintained. Half got AI tools. Half did not. Same tasks, same evaluation criteria. The result: the AI-assisted developers were 19% slower. They shipped fewer pull requests, took longer per task, and produced more rework loops.

Then METR asked the developers how they thought they performed. The same group that was 19% slower reported feeling 20% faster. That is a 39-point perception gap. Not slightly off. Not within the margin of error. Inverted.

The implication is brutal: every survey, every internal "productivity" study, every CIO testimonial relying on user-reported velocity is measuring the perception gap, not the work. The CFO who asks "is your team faster with these new agents?" will get yes from a team that is actually slower. They will not be lying. They will be wrong, and they will not know it.

What the CFO survey gets wrong

Gallup's Q1 2026 workforce survey covered 23,717 US employees. 50% reported using AI at work, up from 33% a year prior. Only 16% reported "extremely positive" impact. The other 84% rated the impact as marginal, neutral or negative. Yet enterprise AI spending is on track to grow another 60% this year.

The disconnect makes sense once you triangulate the three studies above. The people writing the checks are reading vendor case studies measuring perceived velocity. The people doing the work are quietly reporting brain fry. The middle managers are stuck in the middle, ordering more agents because the AI vendor's slide deck shows a 40% productivity uplift their own team has never seen.

The Anthropic problem in one sentence

Wall Street will not install Anthropic's 10 new agents in isolation. They will install them on top of Bloomberg Terminal, Excel, PitchBook, FactSet, Moody's MCP, S&P data feeds, internal compliance tooling and at least two existing chat-based assistants. That is a 10-tool baseline before Anthropic's agents land. After deployment: 20-tool stack, in a domain Nature already showed loses with human-AI collaboration, on tasks BCG already showed peak at 2 tools.

The exact prediction from the data: 14-20% of analysts will report brain fry within 90 days. Decision quality on KYC, valuations and reconciliations will degrade in ways that show up in audit findings, not user surveys. The senior reviewers signing off on agent-generated work product will catch the obvious errors and miss the subtle ones. Not because the agents are bad. Because the cognitive load of orchestrating them exceeds the load they save.

The two-tool rule

Here is the framework the BCG paper lands on, and the only piece of guidance that survives all three datasets.

Limit any one team to two AI tools. Pick them deliberately. One should handle the open-ended creative work where Nature found genuine gains: drafting, brainstorming, first-pass writing. The second should handle a single, narrow, decision-making task with a hard verification step at the end — a workflow where the human can check the answer in seconds, not minutes.

Anything past two tools should require an explicit business case showing the marginal task is decision-making (not creative), has a fast verification step (not slow), and replaces existing tool surface (not stacks on top of it). In practice, this means most teams should run Claude or ChatGPT for drafting, plus exactly one verticalized agent — and stop there.

The Anthropic announcement is interesting precisely because it gives buyers a way to consolidate. If a single vendor ships 10 finance-specific agents, the play is not to install all 10. It is to retire two existing tools, install one Anthropic agent that replaces both, and end the quarter with the same total tool count and a higher fraction of work flowing through agents that share context. That is the version of the AI rollout the data actually supports.

What the smart CFO does this week

Three moves the data supports, none of which the vendors are pitching.

First, count your team's current AI tools. Not officially licensed ones. Actually used. Most enterprise teams find a number between 5 and 9. That is the brain-fry zone. Cut it before adding anything.

Second, build the verification step into every decision-making AI workflow. The Nature paper's gain was on creative tasks because creative tasks have ambiguous quality criteria; the human reviewer brings new information. Decision tasks lost because the human review step rubber-stamped AI confidence. If your KYC screener flags a customer, you need a human checking the source documents, not approving a summary.

Third, instrument actual cycle time. Not user-reported velocity. Actual ticket-to-close, audit-finding-to-resolution, deal-to-pitchbook minutes. Compare a team using your full AI stack to a team using only two tools. The METR perception gap predicts the surveys will lie. The clocks will not.

Anthropic's new agents are well-built. The Apache-licensed repo is the cleanest reference implementation of finance-specific agent skills shipping anywhere right now. The problem is not the technology. The problem is the deployment pattern, which on every public dataset we have, makes teams slower past four tools.

The myth: more agents equals more productivity. The bust, supported by 1,488 workers, 106 experiments and a controlled developer trial: more agents past two equals more brain fry, worse decisions, and a perception gap that hides the damage. The CFO who installs all 10 of Anthropic's new templates will hear from her team that everything is going great. The audit logs will tell a different story by Q4.

Related Resources

The infrastructure trend pulling the other direction: PageIndex ditches the entire vector RAG stack to consolidate retrieval into one reasoning step.
Same reasoning-first approach exposed via MCP: PageIndex MCP — one server replaces a chunking, embedding and vector-DB pipeline.
The skill bundle behind the Anthropic agents discussed in this article: Claude for Financial Services on Skila Repos.
How enterprise teams are governing the resulting agent fleet: Microsoft Agent 365, the cross-cloud control plane for shadow AI.
The forward-deployed services model behind Anthropic's enterprise push: Anthropic, Blackstone and Goldman's $1.5B JV.

Frequently Asked Questions

What is the AI productivity myth?

The AI productivity myth is the assumption that adding more AI tools automatically makes a team more productive. Three independent 2025-2026 studies — BCG/HBR (n=1,488), a Nature Human Behaviour meta-analysis of 106 experiments, and METR's randomized developer trial — all show productivity peaks around 2 AI tools and declines past 4. Heavy users report symptoms BCG named "AI brain fry": mental fog, headaches and slower decisions.

How many AI tools should my team use?

The BCG data points to a hard ceiling of two tools per team for measurable productivity gains. Pick one for open-ended creative work where human-AI combinations actually win, and one narrow decision-making tool with a fast human verification step. Past four tools, productivity declines and 14% of users report cognitive overload symptoms.

What did Anthropic announce on May 5, 2026?

Anthropic launched 10 financial-services agent templates — Pitch Agent, KYC Screener, GL Reconciler, Earnings Reviewer, Model Builder, Market Researcher, Valuation Reviewer, Month-End Closer, Statement Auditor and Meeting Prep Agent — alongside eight new data connectors and a Moody's MCP app. The repo is open-source under Apache 2.0. JPMorgan, Goldman Sachs and Bridgewater are the launch customers.

How does the METR developer study compare to vendor productivity claims?

The METR randomized trial measured experienced open-source developers and found AI tools made them 19% slower while the same developers reported feeling 20% faster — a 39-point perception gap. Vendor productivity claims rely on the same self-reported velocity METR proved is inverted. If you are evaluating an AI rollout, instrument actual cycle time rather than relying on user surveys.

Is the human-AI collaboration meta-analysis worth trusting?

Yes — it is the largest meta-analysis on the topic to date, published in Nature Human Behaviour. The team aggregated 106 controlled experiments covering 370 effect sizes and found a Hedges' g of -0.23 for human-AI combinations on decision-making tasks. Decision-heavy domains like fraud detection, medical diagnosis and demand forecasting consistently lost. Only open-ended creative tasks gained from human-AI collaboration.

Originally published at news.skila.ai

Anthropic Just Hired Wall Street to Kill McKinsey. $1.5B and 'Forward-Deployed' Engineers.

Skila AI — Tue, 05 May 2026 03:00:27 +0000

The headline that should have made every McKinsey partner spit out their coffee on Monday morning was buried in a press release. Anthropic, Blackstone, Hellman & Friedman, Goldman Sachs and General Atlantic just put $1.5 billion into a new firm with one job: replace the consultant. Not augment them. Not give them a copilot. Replace them.

The deal landed on May 4, 2026. Anthropic and Blackstone each committed $300M. Hellman & Friedman matched at $300M. Goldman Sachs and General Atlantic put in $150M each. Apollo, Leonard Green, GIC and Sequoia rounded out the cap table. Hours later, OpenAI announced a parallel joint venture aimed at the same prize. The race to dismantle the $200B consulting industry started before lunch.

The mechanism is what makes this different from every previous AI-for-enterprise pitch. The new entity does not sell a license. It sells forward-deployed engineers — people who fly to your factory, sit at the desk next to your operations VP, and ship Claude-powered systems that automate work the consultants used to recommend. Palantir invented this model selling to the Pentagon. Anthropic just bought the playbook and pointed it at the Fortune 5000.

Why Wall Street wrote the check

Blackstone manages roughly $1.1 trillion. Hellman & Friedman sits on about $115B. Together they own controlling stakes in hundreds of mid-market portfolio companies — the kind of business with $200M to $2B in revenue, a CFO who still uses pivot tables, and a consulting line item bigger than the IT budget. These companies are the dream customer for embedded AI services. They have real money. They have real inefficiency. And their owners have a fiduciary obligation to extract every last point of EBITDA.

Anthropic's annual recurring revenue hit roughly $19B in early 2026. OpenAI is closer to $25B. Both companies have run out of casual API customers and need a way to capture the enterprise dollars that currently fund Accenture's $64B revenue line. The math is brutal: a single Big Three engagement runs $5M to $50M for a six-month rollout. Replace 5% of those with embedded engineers running Claude, and you have built a $10B revenue stream with software margins on top of it.

The Goldman investment matters for a separate reason. Goldman is not just writing a check — the firm is the prototype customer. Anthropic engineers have already been embedded inside Goldman's banking and asset management businesses for over a year. The bank is the proof point. If the model works inside the most paranoid client base on earth, it works anywhere.

Ranking the 7 industries about to get gutted

Most coverage of the announcement focused on consulting itself. That misses the point. Consulting gets repriced. The industries the new firm walks into get gutted. Here is the disruption tower, ranked by AI-replaceability, headcount exposure, and PE ownership concentration — the three variables that decide who gets restructured first.

1. Healthcare services and revenue cycle management

Top of the tower, and it is not close. PE owns vast swaths of physician practice management, revenue cycle outsourcers, and billing operations. The work is rules-based, regulated, and currently performed by hundreds of thousands of human reviewers. Anthropic's Claude 4 family already handles prior authorization, denial appeals, and coding audits at parity with senior billers. A forward-deployed team can erase 40% of the headcount cost inside a single fiscal quarter. PE owners will sign that contract before they finish reading it.

2. Financial services back office

Goldman is not the customer because Goldman is broken. Goldman is the customer because the rest of the industry is. Mid-tier banks, insurance carriers and asset managers spend roughly 40 cents of every operations dollar on KYC, AML, claims adjudication and regulatory reporting. All of it is text-in, text-out work that Claude does at human accuracy and 1/30th the cost. The combined PE exposure across financial back-office firms is in the tens of billions in annual labor spend. The new JV will have a dedicated vertical here within 90 days. Bet on it.

3. Infrastructure and engineering services

Construction management, engineering review, environmental compliance — the unsexy plumbing of every PE-owned infrastructure roll-up. Each project generates thousands of pages of PDFs that get manually reviewed by senior engineers billing $400 an hour. Forward-deployed teams plug Claude into the document pipelines and turn six-week reviews into six-day reviews. The labor savings are smaller in absolute headcount than healthcare, but the margin uplift per project is higher. This is where the JV books its first lighthouse case study.

4. Manufacturing and industrial operations

This one ranks high on opportunity, lower on speed. Industrial operations have AI-ready data, but the cycle time to deploy is longer because real-world equipment integration takes months. Expect the JV to run pilots at Blackstone-owned industrial portfolio companies through 2026 and start booking material revenue in 2027. The endgame is autonomous procurement, predictive maintenance scheduling and supplier risk management run by agent fleets. Net headcount impact: 15-25% of indirect labor over three years.

5. Retail and consumer brands

PE owns hundreds of consumer brands and specialty retailers. The work that gets automated here is merchandising analytics, customer service tier one, returns processing and supplier negotiation. Lower margin per engagement than banking, but higher count of engagements, which is exactly what a forward-deployed model needs to scale repeatable playbooks. Watch for an off-the-shelf Retail Operations Module from the JV by Q3 2026.

6. Real estate operations

Blackstone is the world's largest real estate owner. Its portfolio runs on tenant servicing, lease administration, asset valuation and property management — all heavily document-driven, all highly automatable. The political dynamic inside Blackstone makes this an obvious early-customer pilot. The JV will use Blackstone's own portfolio as the demo, then sell the resulting playbook to every other real estate sponsor on the planet.

7. Logistics and transportation

Last on the tower because the data fragmentation problem is real. Logistics operations span dozens of carriers, ERP systems and EDI feeds that have not played nicely with each other since 1998. Claude can absolutely handle the complexity, but each deployment is bespoke. Expect the JV to land a few flagship logistics customers, generate impressive case studies, and let the consulting market figure out the rest.

The surprise: consulting itself is not number one. Consulting gets repriced. McKinsey, BCG and Bain still get hired — just at lower rates and for different work. The industries above are the ones that will see actual headcount reduction.

What Anthropic is really buying

The strategic prize for Anthropic is not the services revenue. It is the workflow data that comes with embedded engagements. Every forward-deployed team generates a high-fidelity record of how real enterprise work actually flows — what the inputs look like, where the bottlenecks sit, which approvals get rubber-stamped, which ones do not. That data is the training fuel for the next generation of agents that ship without an engineer at all.

This is why the structure is a joint venture and not an acquisition. Anthropic does not want to build a 5,000-person services firm. It wants to use Blackstone's portfolio access and Goldman's enterprise credibility to harvest the workflow corpus, then productize it. In three years, the JV's headcount stops growing and the agent revenue compounds. That is the bet.

The OpenAI parallel announcement on the same day tells you Sam Altman saw the same thing. OpenAI's services play is reportedly larger in headcount but narrower in PE access. Anthropic got the better cap table. OpenAI got the bigger first-year revenue projection. Both will ship the same quarter, both will close the same logos, and the consulting industry will learn what enterprise software went through in 2010.

What this means if you sit on either side

If you are a junior or mid-level consultant: your billable rate is about to compress 30-50%. The work you have been doing — market scans, slide stacks, "synthesis" deliverables — is exactly what Claude does in a 12-second response. Your survival path is owning a function the AI cannot embed into easily: relationship management, executive coaching, and politically delicate change management.

If you are a partner: your firm just got a 24-month window to either build its own forward-deployed practice or watch the JV eat the mid-market. McKinsey's QuantumBlack and Accenture's Center for Advanced AI are the obvious responses. Neither has the founding-model relationship Anthropic just locked up.

If you run a PE-backed portfolio company: your sponsor is about to call you. They will offer to fund a JV pilot that the new firm leads. The right answer is yes — but negotiate hard on data ownership. Every workflow the embedded team observes becomes training data for the next agent. Make sure you are not paying for the privilege of training your replacement's replacement.

If you build with AI: this announcement just validated the forward-deployed model as the dominant enterprise distribution strategy. Expect a wave of smaller specialist firms copying it — Anthropic alums starting verticals, ex-Palantir engineers spinning out industry-specific shops. The next year of enterprise AI looks less like SaaS and more like a return to the consulting roll-up era of the 1990s, with software margins layered on top.

The number that matters most

$1.5B is not a meaningful check by Wall Street standards. Blackstone alone deploys multiples of that on a normal quarter. The number that matters is the portfolio access: by our count, the four PE backers collectively own or have meaningful stakes in over 600 mid-market companies generating north of $1 trillion in combined revenue. That is the largest pre-installed customer base any AI startup has ever had on day one.

The new firm does not need to sell. It needs to deploy. The customers are already inside the building.

Related Resources

See how enterprises are governing the resulting agent fleet: Microsoft Agent 365 launched the same week.
Engineering teams are responding with their own AI integrations: Jama Connect MCP Server brings spec-driven development to Claude.
The forward-deployed model also explains the rise of multi-agent harnesses like jcode trending on GitHub.

Frequently Asked Questions

What is the Anthropic Blackstone Goldman Sachs joint venture?

It is a $1.5B enterprise AI services firm announced May 4, 2026, founded by Anthropic, Blackstone, Hellman & Friedman, Goldman Sachs and General Atlantic. The firm sells forward-deployed engineers who embed inside customer companies and ship Claude-powered systems instead of slide-deck recommendations.

How does the Anthropic JV compare to McKinsey or BCG?

McKinsey and BCG sell senior consultants who write strategy decks and recommendations. The Anthropic JV sells engineers who embed on-site and build the working software the consultants would otherwise tell you to buy. The pricing is closer to a software contract than an hourly billable model, and the deliverable is a running production system, not a PDF.

How much money did Anthropic put into the new firm?

Anthropic committed $300M, matched by Blackstone and Hellman & Friedman at $300M each. Goldman Sachs and General Atlantic each put in $150M, with additional capital from Apollo, Leonard Green, GIC and Sequoia. Total announced commitment: $1.5B.

Is consulting actually going to disappear?

No — consulting gets repriced, not erased. The big losers are industries the JV walks into through PE portfolios: healthcare back office, financial services operations and infrastructure engineering. Strategy consulting at the partner level still exists, but mid-tier billable work compresses 30-50% over the next 24 months.

Why did OpenAI announce its own enterprise services JV the same day?

OpenAI saw the same opportunity Anthropic did and could not afford to cede the enterprise services category. Sam Altman's announcement matches the structural move — embedded engineers, enterprise verticals, PE distribution — but with a different cap table. The result is a duopoly race to capture the $200B consulting market rather than a single winner running unopposed.

OpenAI Just Dropped GPT-5.5. The Agentic Coding War Just Ended.

Skila AI — Mon, 27 Apr 2026 06:57:35 +0000

Originally published at news.skila.ai

OpenAI shipped GPT-5.5 today. Terminal-Bench 2.0 score: 82.7%. That is 17 points above GPT-5 and 17 points above Claude Opus 4.6.The agentic coding war is not heating up. It just ended.Here is exactly what shipped, what the benchmarks actually mean, and what every team using Cursor, Claude Code, or Codex has to decide this week.What Actually ShippedThree things landed in the same launch:GPT-5.5 — the new flagship. Available today in ChatGPT (Plus, Team, Enterprise) and via API.GPT-5.5 Pro — a new top-tier model with extended reasoning, 200-step agent loops, and full computer-use access. Available in ChatGPT Pro ($200/mo) and as a separate API SKU.Codex update — OpenAI's coding agent gets native multi-file refactor, persistent project memory across sessions, and direct access to GPT-5.5 Pro on the Codex CLI.The launch is the most aggressive product update OpenAI has shipped since GPT-4o. It is also the most directly targeted at Anthropic's coding-agent franchise.The Benchmarks Are the StoryFive numbers that matter:Terminal-Bench 2.0: 82.7% — up from 65.4% on GPT-5. This is the benchmark that measures real shell-driven multi-step tasks. It is the closest proxy we have to 'is this model a useful agent?' GPT-5.5 just took the lead by a margin that is not noise.SWE-bench Verified: 81.9% — narrowly ahead of Claude Opus 4.6 (80.8%) and DeepSeek V4-Pro (80.6%). The three frontier models are now within two points of each other on SWE-bench. The benchmark is saturating.LiveCodeBench: 91.4% — strong but slightly below DeepSeek V4-Pro (93.5%). DeepSeek still wins on pure algorithmic coding.Computer-use task completion: 73% on OSWorld — a 14-point lift over GPT-5. Computer-use is the second axis OpenAI is pushing this year.200K context with full attention — needle-in-a-haystack accuracy stays above 92% across the entire 200K window. That is the first frontier model to hold accuracy at that depth without lost-in-the-middle degradation.Read the Terminal-Bench number twice. A 17-point jump in six months is not an iterative improvement. It is the kind of step change that resets product roadmaps.The Pro Tier Is the Real StoryGPT-5.5 Pro is where OpenAI made the hardest bet. It is a separate model with extended reasoning, 200-step agent loops, and computer-use access. ChatGPT Pro subscribers get unlimited access. API users pay a $0.50 surcharge per agent step on top of normal token costs.That pricing structure is new. OpenAI is unbundling the agent loop from the model. You are not paying for tokens. You are paying for steps. A 200-step debugging session at $0.50 per step is $100 of agent fees on top of whatever the tokens cost.That number sounds high until you compare it to a senior engineer's hourly rate. If GPT-5.5 Pro completes a multi-file refactor in 30 minutes that would take a human 3 hours, the math works for any team that values engineering time above $200/hour. Every YC-backed startup will install it this week.Codex Becomes the Cursor KillerThe Codex update is the part of the launch that should worry Cursor and Windsurf the most. Three new features:Native multi-file refactor. Codex now does what Cursor's 'Composer' has been the standout feature for — propose a coordinated edit across an entire codebase, show a unified diff, apply atomically. OpenAI shipped it as a first-party feature, not an extension.Persistent project memory. Codex remembers your codebase conventions, recent decisions, and unresolved issues across sessions. No more re-explaining your architecture every morning.GPT-5.5 Pro on the CLI. The Pro model is available via codex --model gpt-5.5-pro. That is the model with 200-step agent loops and computer-use access. It runs your local shell, browser, and IDE.For Cursor, this is an existential moment. Cursor's pitch has been 'Claude in a beautiful IDE.' Codex now does multi-file refactor better, with persistent memory, with a model that beats Claude on Terminal-Bench by 17 points, at OpenAI pricing. Cursor has to ship a counter — likely tighter integration with Anthropic's computer-use features — within weeks.The Pricing MathAPI pricing for GPT-5.5: $5 per million input tokens, $30 per million output. GPT-5.5 Pro adds a $0.50 per-agent-step surcharge.Compare that to the field:Claude Opus 4.6: $15 input, $25 outputGPT-5.5: $5 input, $30 outputDeepSeek V4-Pro: $0.50 input, $3.48 outputOpenAI cut input pricing by 67% versus Claude Opus 4.6 but charges 20% more on output. That is not an accident. Output tokens are where the agent does its work — the loop, the tool calls, the code generation. OpenAI is signaling that their value is on the output side, where the new capabilities show up.The bigger problem is DeepSeek. V4-Pro shipped three days ago at 14% of Claude's price and benchmarks within noise of GPT-5.5 on SWE-bench. OpenAI has to defend a 9x output-token premium on capability alone. The new Terminal-Bench result is exactly the capability story they need.What Anthropic Has to Do This WeekAnthropic now sits in the middle of a price-quality squeeze. DeepSeek undercuts on price by 86%. OpenAI undercuts on capability by 17 Terminal-Bench points. Claude Opus 4.6 needs a response.Three plausible moves, in order of likelihood:Ship Claude Opus 4.7 ahead of schedule. Anthropic was reportedly targeting a Q3 launch. That timeline is now Q2.Cut output pricing on Opus 4.6. $25 to $15 would close half the OpenAI gap and most of the DeepSeek gap on the compliance-friendly tier.Push computer-use harder. Anthropic launched computer-use in October 2024 and OpenAI just caught up. The lead has shrunk to weeks.Expect at least two of those three by mid-May.What You Should Change This WeekThree concrete actions:Run GPT-5.5 against your hardest agentic tests. If you have an internal eval suite for coding agents, run it tonight. The Terminal-Bench number is real, but your workload is what matters. Most teams will see a clear lift on multi-step debugging and shell-driven tasks.Try the new Codex on a real refactor. Pick a refactor you have been postponing — the kind that touches 20 files and breaks tests in three places. Hand it to Codex with GPT-5.5 Pro. Watch what happens. The whole point of the launch is that this is now plausible.Re-architect your cost tier. The smart 2026 stack is DeepSeek V4-Flash on the high-volume tier, GPT-5.5 or Claude Opus 4.6 on the critical path, and GPT-5.5 Pro for genuinely hard agentic work where the per-step surcharge is justified by output value. Mono-vendor stacks are over.VerdictGPT-5.5 is the most consequential coding-model release of 2026 so far. It rebases the agentic-coding ceiling, hands OpenAI the Terminal-Bench crown, and turns Codex into a credible Cursor competitor. The Pro tier and per-step pricing model is a bet that will reshape how every AI coding tool sells.The era of one model winning every benchmark is over. GPT-5.5 wins agentic coding. DeepSeek V4-Pro wins price-per-token. Claude wins compliance and long-horizon planning. Pick your stack accordingly — and update it again in 30 days, because the next move from Anthropic is coming fast.Related ResourcesTool: Google Gemini Enterprise Agents — the enterprise-side competitor that pairs against GPT-5.5 Pro on agent deployment.Article: DeepSeek V4-Pro launch — the open-weight model now squeezing OpenAI's pricing from below.Repo: Microsoft Magentic-One — the multi-agent orchestrator you can now drive with GPT-5.5 Pro.MCP server: Anthropic Claude Code MCP — embed Claude Code abilities inside any agent that uses GPT-5.5.Skill: Anthropic Data Analysis Skills — structured data-science skills any frontier model (including GPT-5.5) can consume.Frequently Asked QuestionsWhat is GPT-5.5?GPT-5.5 is OpenAI's flagship language model launched April 27, 2026. It scores 82.7% on Terminal-Bench 2.0, ships in ChatGPT and the API, and adds a new GPT-5.5 Pro tier with 200-step agent loops and computer-use access. It is the first model to hold above 92% needle-in-a-haystack accuracy across a full 200K context window.How does GPT-5.5 compare to Claude Opus 4.6?GPT-5.5 leads on Terminal-Bench 2.0 (82.7% vs 65.4%) and matches Claude on SWE-bench Verified within one point (81.9% vs 80.8%). Claude still leads on long-horizon planning and compliance-bound enterprise work. Pricing is roughly comparable on output ($30 vs $25 per million tokens) but GPT-5.5 cuts input cost by 67%.How much does GPT-5.5 cost?API pricing is $5 per million input tokens and $30 per million output tokens. GPT-5.5 Pro adds a $0.50 surcharge per agent step on top of token costs. ChatGPT Pro subscribers ($200/month) get unlimited GPT-5.5 Pro access. Plus and Team subscribers get GPT-5.5 with usage caps.Is GPT-5.5 Pro worth the per-step surcharge?For genuinely hard agentic work — multi-file refactors, long-running debugging sessions, computer-use tasks — yes. A 200-step session at $0.50 per step is $100 in agent fees, which is cheaper than 30 minutes of senior engineer time. For routine code generation and Q&A, the standard GPT-5.5 tier is enough.What are the best alternatives to GPT-5.5 in 2026?Claude Opus 4.6 for compliance-bound enterprise work and long-horizon planning. DeepSeek V4-Pro for price-sensitive agentic coding at $3.48 per million output tokens. Gemini 3.1 Pro for long-context multimodal work. The smart 2026 stack uses all three with a routing policy by task type.

DeepSeek Just Open-Sourced a Claude-Tier Model. The Pricing Math Breaks Everything.

Skila AI — Sat, 25 Apr 2026 02:27:56 +0000

Originally published at news.skila.ai

DeepSeek shipped V4-Pro and V4-Flash on April 24, 2026. Open weights. MIT license. One-million-token context window.

On SWE-bench Verified it scores 80.6%. Claude Opus 4.6 scores 80.8%. The gap is 0.2 points.

Output tokens cost $3.48 per million. Anthropic charges $25 for Opus 4.6 output. OpenAI charges $30 for GPT-5.5 output. That is not a discount. That is a category break.

If you built your AI cost model last month on closed-frontier APIs, it just broke. Here is exactly what DeepSeek shipped, what the benchmarks actually say, and what you should change about your stack this week.

What Actually Shipped

Two models, both released under MIT license on Hugging Face:

DeepSeek V4-Pro. 1.6 trillion total parameters. 49 billion active per token via Mixture-of-Experts. Pre-trained on 33 trillion tokens. Context window: 1,048,576 tokens. API pricing: $0.50 per million input, $3.48 per million output.
DeepSeek V4-Flash. Smaller, faster, cheaper sibling at $0.28 per million tokens. Built for high-throughput agentic loops where you do not need the Pro-tier reasoning.

Both models ship with open weights. You can download them, run them on your own infrastructure, fine-tune them, and serve them at whatever margin you want.

The Benchmarks Are the Story

SWE-bench Verified: 80.6% (Claude Opus 4.6: 80.8%, GPT-5.5: high-70s)
Terminal-Bench 2.0: 67.9% (Claude Opus 4.6: 65.4%) — DeepSeek wins
LiveCodeBench: 93.5% (Claude Opus 4.6: 88.8%) — DeepSeek wins
Codeforces rating: 3,206 — top fraction of 1% of competitive programmers worldwide

On the three benchmarks that matter most for AI coding agents — agentic tasks, terminal operations, and algorithmic coding — DeepSeek either matches or beats the closed-frontier leader. And it does it at 14% of the price.

The Pricing Collapse

Model	Input $/M	Output $/M	SWE-bench
DeepSeek V4-Pro	$0.50	$3.48	80.6%
Claude Opus 4.6	$15.00	$25.00	80.8%
GPT-5.5	$5.00	$30.00	~78%
DeepSeek V4-Flash	$0.14	$0.28	—

If you run an AI coding agent generating 10 million output tokens per day:

$250/day on Claude Opus 4.6
$300/day on GPT-5.5
$34.80/day on DeepSeek V4-Pro

That is a $215–265 daily delta for workloads that benchmark within noise of each other.

The Huawei Chip Story

DeepSeek V4-Pro was trained entirely on Huawei Ascend chips. No Nvidia H100s. No H200s. The full training run — 33 trillion tokens, 1.6 trillion parameters — ran on Chinese-manufactured silicon that US export controls cannot reach.

US policy for three years assumed cutting off Nvidia shipments would cap Chinese frontier AI. That assumption is now empirically false.

The 1M Context Window (Asterisk Required)

Every 1M-context model — Gemini 3.1 Pro, DeepSeek V4-Pro, Claude Opus 4.7 — drops accuracy below 70% on needle-in-a-haystack tasks past 200K tokens. The lost-in-the-middle effect kicks in past 500K, causing the model to forget the middle 40% of the prompt.

Treat the 1M context window as useful for the first 150K–200K tokens. Stuff critical information at the top and bottom of your prompt — never in the middle.

What This Means for Your Stack

Add a second tier. Run V4-Flash for high-volume low-stakes work. Keep Claude Opus 4.6 for compliance-bound or multi-turn planning tasks.
Self-hosting is back on the table. Open weights mean you can serve V4-Pro at cost on your own GPU cluster.
Frontier pricing is going to move. Anthropic and OpenAI cannot hold $25–30/M output when a benchmark-equivalent open model charges $3.48. Expect price cuts within 90 days.

The Catch

Data policy: The DeepSeek API routes through Chinese infrastructure. May not clear GDPR, SOC 2, or HIPAA reviews. Self-hosted weights solve this.
Real-world gap: Early community reports show V4-Pro is slightly behind Claude on long-context reasoning and multi-turn planning despite leading on benchmarks.

Verdict

DeepSeek V4-Pro is the most important open-source AI release since Llama 3. It ties the closed-frontier leader on the benchmark that matters most for coding agents. It costs 14% of the price. It runs on chips US export controls cannot stop.

Add V4-Flash to your high-volume tier this week. Evaluate V4-Pro against your critical-path workloads over the next month. Keep Claude Opus 4.6 for compliance-bound work.

Full article with benchmarks, pricing table, and self-hosting notes: news.skila.ai

DEV Community: Skila AI

I Ranked Every AI Coding Model by Value. The $1.50 One Won.

#5 — Grok 4.3: Cheap, But You Get What You Pay For

#4 — GPT-5.5: Great in the Terminal, Brutal on the Invoice

#3 — Gemini 3.1 Pro: The Sensible Middle

#2 — Claude Opus 4.8: The Smartest Model You Can Rent

#1 — Gemini 3.5 Flash: The $1.50 Model That Embarrasses Last Year's Flagships

The Verdict: Build a Two-Model Stack

Frequently Asked Questions

What is the best AI coding model in 2026?

How does Gemini 3.5 Flash compare to Claude Opus 4.8?

Why does Gemini 3.5 Flash beat Gemini 3.1 Pro on coding?

How much do the top AI coding models cost per million tokens?

Should I use one AI coding model or several?

Anthropic Just Hit $965B. You Are Overpaying 7x For AI.

The $965B number, and where it comes from

DeepSeek just made the math impossible to ignore

The per-token math, with no marketing in the way

So why does anyone pay the premium?

What this means for your stack right now

AI Agents Fail 70%. The Replacement Story Is A Lie.

The Receipt: Seven Independent Studies

Why The Panic Was Manufactured

What Actually Works Right Now

The Pope Just Came For AI. Anthropic Was Standing Next To Him.

What Magnifica Humanitas Actually Says

The Date Was Not An Accident

And Then The Co-Founder Of Anthropic Walked On Stage

The Same Week, Anthropic Is Closing $30 Billion

How To Read Olah Standing There

Why The Industrial Revolution Parallel Matters

What Anthropic Is Actually Signaling

The Honest Verdict

Frequently Asked Questions

What is Pope Leo XIV's Magnifica Humanitas encyclical and when was it released?

Why was Anthropic co-founder Chris Olah at the Vatican presentation?

Why did Pope Leo sign the encyclical on May 15, 2026 specifically?

What does the encyclical say AI companies should do?

How does this connect to Anthropic's reported $30 billion funding round?

Is the Catholic Church calling for AI regulation?

You Might Also Like

Related AI Tools

Related Repositories

Related Agent Skills

I Gave Elon $99 and Watched Grok Build Spawn 8 Agents in My Terminal

Hour 0: The $99 Paywall and What It Actually Locks In

Hour 4: The First Plan Mode Catch

Hour 14: 8 Subagents Collide On The Same File

Hour 28: Reading the SuperGrok Heavy Fine Print

Hour 42: The Kafka Refactor Verdict

Hour 48: The Honest Verdict

The Bigger Picture: Three Coding Agents Shipped This Week

Should You Pay the $99?

Related Reading on Skila AI

Frequently Asked Questions

What is Grok Build and when did xAI launch it?

How much does Grok Build cost — is the $99 deal real?

How does Grok Build compare to Claude Code and Codex?

What is Grok Build's Plan Mode and Arena Mode?

Does Grok Build work on Windows?

Is the $99 introductory pricing locked in or does it auto-renew?

You Might Also Like

Related AI Tools

Related Repositories

Related Agent Skills

Nobody Told You: Anthropic Just Stopped Selling to Developers. They Walked Into 33 Million Small Businesses.

The Launch in Six Concrete Facts

Why This Is the Contrarian Story

The Pricing Move That Tells You Everything

What This Means for Intuit, HubSpot, and Microsoft

The Chicago Tour Is the Tell

Where the Cracks Will Show

The Real 2026 AI War

Related Reading on Skila AI

Frequently Asked Questions

What is Claude for Small Business and when did it launch?

Which apps does Claude for Small Business connect to?

How much does Claude for Small Business cost?

Why did Anthropic pivot from developers to small business?

Where can small business owners get free Claude training?