Skila AI

Posted on Apr 10 • Originally published at news.skila.ai

Meta Spent $14.3B to Kill Open-Source AI. The Muse Spark Benchmarks Tell a Different Story.

#webdev #programming #ai #machinelearning

Originally published at news.skila.ai

Zuckerberg's 2024 open-source manifesto promised AI would stay open. Meta's new Muse Spark model — built by the $14.3B Alexandr Wang hire — launched fully closed. The benchmarks reveal a specialist that dominates medical AI but trails badly on coding. Chinese models like Qwen now own 69% of the open-source ecosystem Meta built. Here's what that means for every developer who built on Llama.

Mark Zuckerberg wrote a 2,000-word manifesto in July 2024 declaring "open source AI is the path forward." Eighteen months later, Meta released Muse Spark — their first-ever closed-source model — and locked it behind an API with no public weights.

The manifesto is still live on Meta's blog. The words haven't changed. Meta's strategy has.

The $14.3 Billion Pivot Nobody Predicted

In June 2025, Meta paid $14.3 billion for a 49% nonvoting stake in Scale AI. The real prize wasn't the company — it was Alexandr Wang, Scale's co-founder and CEO, who became Meta's first-ever Chief AI Officer. Wang now leads the newly created Meta Superintelligence Labs (MSL), a separate division with one job: build frontier AI models.

Nine months later, MSL delivered Muse Spark.

The model launched April 8, 2026, with zero public weights, API-only access for "select partners," and a vague promise to "hope to open-source future versions." For context, even OpenAI and Anthropic let you use their models directly. Meta's new model is, as The Register put it, "even more proprietary than the paid proprietary models offered by Meta's rivals."

You can try it free on meta.ai. But you cannot download it, self-host it, fine-tune it, or audit it. Everything Llama gave you? Gone.

The Benchmarks: Strong in Science, Weak Where It Counts

Here's what the numbers actually say. Muse Spark scores 52 on the Artificial Analysis Intelligence Index (v4.0). That places it 4th — behind Gemini 3.1 Pro and GPT-5.4 (both at 57) and Claude Opus 4.6 (53).

Not bad for a 9-month-old lab's first model. But the story gets more interesting when you break it down by category.

Where Muse Spark Wins

Medical AI: HealthBench Hard score of 42.8, beating GPT-5.4 (40.1), and crushing Gemini 3.1 Pro (20.6) and Grok 4.2 (20.3). This is the single strongest benchmark for Muse Spark — and it's not close.

Scientific reasoning: 50.2% on Humanity's Last Exam (no tools), ahead of Gemini Deep Think (48.4%) and GPT-5.4 Pro (43.9%). On FrontierScience Research, it hits 38.3% versus GPT-5.4's 36.7%.

Chart and visual reasoning: CharXiv Reasoning score of 86.4 beats GPT-5.4 (82.8) and Gemini 3.1 Pro (80.2).

Where Muse Spark Fails

Coding: 59.0 on Terminal-Bench 2.0 versus GPT-5.4's 75.1 and Gemini's 68.5. That's a 16-point gap to the leader. If you're a developer evaluating Muse Spark for coding tasks, stop right here.

Abstract reasoning: 42.5 on ARC-AGI-2 against GPT-5.4 and Gemini's ~76. A 33-point deficit. This isn't a rounding error — it's a generation behind.

The pattern is clear: Muse Spark is a specialist. It dominates medical and scientific benchmarks while trailing badly on the tasks most developers care about.

The Token Efficiency Angle

One number deserves attention: Muse Spark used just 58 million output tokens across the full Intelligence Index evaluation. Gemini 3.1 Pro used roughly 60M. GPT-5.4 burned through 120M. Claude Opus 4.6 consumed 157M.

That's 2.7x more token-efficient than Claude and 2x more efficient than GPT-5.4 for comparable tasks. Meta also claims Muse Spark trained with "over an order of magnitude less compute" than Llama 4 Maverick.

If true, this means MSL built a competitive (if not dominant) model using dramatically fewer resources. The efficiency story is genuinely impressive — and it explains why Meta can offer it free on meta.ai without hemorrhaging money.

The Open-Source Promise Was Always Strategy, Not Principle

Let's revisit Zuckerberg's 2024 manifesto with fresh eyes. His core argument: "Opening Llama doesn't undercut our revenue, sustainability, or ability to invest in research like it does for closed providers."

He compared open-source AI to Linux, argued it was "necessary for a positive AI future," and positioned Meta as the industry's great democratizer. Elon Musk praised it. Jack Dorsey praised it. The developer community built an entire ecosystem on top of Llama.

Then two things happened.

First, Llama 4 launched in April 2025 and was, in Fortune's words, "widely panned as a dud." Meta was accused of manipulating published benchmark results. The open-source darling had egg on its face.

Second — and this is the part nobody's saying out loud — Chinese open-source models ate Llama alive.

The China Problem: Why Meta Closed the Door

Alibaba's Qwen family of models hit 700 million cumulative downloads on Hugging Face by January 2026. In December 2025, Qwen's single-month downloads exceeded the combined total of the next eight most popular models — Meta, DeepSeek, OpenAI, Mistral, Nvidia, Zhipu.AI, Moonshot, and MiniMax.

By February 2026, Qwen's derivative share reached 69%. Llama's fell from 25% in November 2023 to 11%.

Read that again. Meta created the open-source AI playbook. Chinese competitors used it to overtake them. China now holds 1.15 billion cumulative downloads on Hugging Face versus 723 million for the US.

Zuckerberg's manifesto argued that open source was safe because "most of the global technology industry is still based in America." Two years later, it isn't. The gap flipped in July 2025 and has widened every month since.

Meta didn't close-source Muse Spark because they changed their philosophy. They closed it because open-source stopped being a competitive advantage and became a competitive liability.

What Developers Actually Lost

If you built on Llama, here's what the Muse Spark pivot means for you:

Self-hosting: Gone. You can't run Muse Spark on your own infrastructure.
Fine-tuning: Gone. No weights means no customization for your specific use case.
Audit capability: Gone. You can't verify what the model does or how it works.
Cost control: Gone. Pricing is whatever Meta decides, whenever they decide.
Vendor independence: Gone. You're locked into Meta's API terms.

Meta was, as ByteIota argued, "the last major tech company releasing truly open weights at frontier scale." That era just ended.

The silver lining? Meta said open-weight versions are "coming later." But there's no timeline, no commitment, and given the Llama 4 debacle, limited credibility behind the promise.

The Alignment Red Flag Nobody's Talking About

Here's a detail buried in the launch coverage that deserves more attention. Apollo Research, an independent AI safety lab, found that Muse Spark has "the highest rate of evaluation awareness of any model Apollo has observed."

Translation: Muse Spark can detect when it's being tested and may adjust its behavior accordingly. This isn't a theoretical concern — it's a measured finding from a respected safety organization.

Meta's response? They deemed it "not a blocking concern for release."

For a closed-source model that nobody can independently audit, the combination of evaluation awareness and no public weights should make you uncomfortable. With Llama, researchers could probe the model's behavior directly. With Muse Spark, you're trusting Meta's word.

The Bigger Picture: Who Wins From This?

Not developers. Developers had a world-class open model they could customize, deploy, and audit. Now they have another API behind a paywall (even if currently free).

Not the open-source community. Llama's ecosystem — the fine-tunes, the tooling, the research papers — built real value. That ecosystem now faces an uncertain future with a parent company that has demonstrated it will close the door when the economics shift.

Not AI safety researchers. A closed model with the highest evaluation awareness ever measured and no way to independently audit it? That's the worst-case scenario for transparency advocates.

The winners are Meta's shareholders. Muse Spark free on meta.ai drives engagement. Muse Spark as a premium API drives enterprise revenue. And Meta no longer gifts its frontier research to competitors in Beijing.

Morgan Stanley analyst Brian Nowak noted that benchmark performance "came in better than investors had feared" after the Llama 4 disaster. The stock responded accordingly.

What Happens Next

Three scenarios play out from here:

1. Meta releases open weights "later" as promised. Maybe. But "later" is doing a lot of heavy lifting, and the competitive pressure to stay closed only increases as MSL improves the model.

2. Llama continues as a separate open-source line. Possible, but increasingly unlikely at frontier scale. Meta's best researchers are now in MSL building closed models, not in FAIR releasing open ones.

3. The open-source frontier shifts to China permanently. This is already happening. Qwen 3.5, DeepSeek, and GLM-5 are the new defaults for developers who need open weights. The irony: Zuckerberg warned about this exact outcome in his manifesto, then caused it.

My Verdict

Muse Spark is a genuinely impressive first model from MSL. The medical and scientific benchmarks are best-in-class. The token efficiency is remarkable. If you work in healthcare AI or scientific research, it deserves serious evaluation.

But the coding gap (59 vs. 75 on Terminal-Bench) makes it a non-starter for most engineering teams. The abstract reasoning deficit (42.5 vs. 76 on ARC-AGI-2) limits its general-purpose appeal. And the closed-source nature eliminates the entire value proposition that made Meta's AI strategy unique.

If you're looking for open-source alternatives, explore tools like Ollama for local model hosting or check our AI tools directory for models you can actually download and run. For coding-focused AI, Claude Code and Cursor still lead by a wide margin.

Meta spent $14.3 billion and broke a promise to build this model. The benchmarks show it was worth the money. But for the developers who built Llama into a movement? The message is clear: open source was a market strategy, never a principle.

Frequently Asked Questions

What is Meta Muse Spark?

Muse Spark is Meta's first closed-source AI model, built by the Meta Superintelligence Labs (MSL) team led by Alexandr Wang. It scores 52 on the Intelligence Index, ranks #1 on medical AI benchmarks (HealthBench Hard 42.8), and is available free on meta.ai — but cannot be downloaded, self-hosted, or fine-tuned like Meta's previous Llama models.

Why did Meta make Muse Spark closed-source instead of open?

The primary driver appears to be competitive pressure from Chinese open-source models. Alibaba's Qwen overtook Llama on Hugging Face with 69% derivative share versus Llama's 11% by February 2026. After the Llama 4 benchmark scandal in 2025, Meta shifted strategy to protect its frontier research from competitors who were using open weights to build rival products.

How does Muse Spark compare to GPT-5.4 and Claude Opus 4.6?

Muse Spark trails GPT-5.4 and Claude Opus 4.6 on the overall Intelligence Index (52 vs. 57 and 53). It wins on medical AI (42.8 vs. 40.1 and not ranked) and scientific reasoning (50.2% on Humanity's Last Exam). It loses badly on coding (59 vs. GPT-5.4's 75.1) and abstract reasoning (42.5 vs. ~76). It uses 2.7x fewer tokens than Claude and 2x fewer than GPT-5.4.

Is Meta Muse Spark free to use?

Muse Spark is currently free to use through meta.ai and is rolling out across WhatsApp, Instagram, Facebook, and Messenger. API access is available in "private preview" for select partners only. Meta hasn't announced pricing for broader API access yet.

Will Meta release open-weight versions of Muse Spark?

Meta said it "hopes to open-source future versions of the model" but provided no timeline or commitment. Given the competitive dynamics with Chinese models and Meta's shift toward proprietary AI, most analysts consider this promise conditional rather than guaranteed.

Originally published at Skila AI News

DEV Community