DEV Community: Leonid Khomenko

Grok 3 — Elon Musk’s AI, 2 Months Later

Leonid Khomenko — Tue, 15 Apr 2025 11:12:06 +0000

At the end of February, Elon rolled out his latest model. Of course, it was "the best in the world."

Is it really a Smartest AI on Earth?

As usual, Musk brought the hype train. But there wasn't much objective data at launch. xAI's short blog post mentioned that it was still in beta and the models were actively training.

They flashed some benchmarks showing Grok 3 ahead. However, they did not give access to the API. Which is important because independent benchmarks use it for evaluation.

So, Elon claims Grok 3 is "scarily smart" and beats everything else. But the only ways to check were chatting with it yourself or looking at their benchmarks.

And those benchmarks? Take a look:

See that lighter area on the right? That's the boost Grok got by having way more compute power (test-time compute) to get more consistent answers. It's not exactly a fair fight.

You probably know AI models often give slightly different answers each time—sometimes better, sometimes worse. Most benchmarks ignore this variability, evaluating only the first response (pass@1). It’s simpler and matches how we actually use AI—we expect a good answer on the first try.

But Grok results were all shown using cons@64. Meaning, it got 64 tries for each question and picked the most common answer. Then, xAI compared that boosted score against the pass@1 scores of competitors.

So on one hand, they claim it's a next-gen model. On the other, they're using pretty cheap tricks.

To be fair, in such a competitive field, all labs bend the rules. They cherry-pick benchmarks or exclude stronger models from comparisons—but rarely as blatantly.

Okay, benchmarks aside. What are experienced users saying after actually using it? The general consensus is:

The model is huge but hasn’t brought breakthroughs. It still hallucinates and tends toward overly long responses.

Performance-wise, Grok 3 lands somewhere near the top OpenAI models, maybe a bit better than DeepSeek and Google's stuff at the time of release.

However, two months later, Gemini 2.5, Claude 3.7, and the new GPT-4o arrived. We also finally got partial API access for Grok 3 and its mini version. Unfortunately, only the mini version received the thinking mode in API.

So today we know it's expensive and definitely not the absolute best.

But hold on, there's still more to the story.

The model is interesting and worth looking at. And you have to hand it to them, Elon and xAI jumped into the market quickly, becoming a key player in record time.

1 – The Hardware

The big story here?

In 2024, xAI built a massive compute cluster. We're talking 100,000 Nvidia H100 GPUs up and running in just 4 months. Then they doubled that to 200,000 cards in another 3 months.

Nvidia's CEO, Jensen Huang, mentioned this usually takes about 4 years.

This was a massive engineering feat. And this time, no funny business—it's the largest data center in the world. Nobody else has managed to link up that many GPUs in one spot.

Typically, such clusters are multiple regular data centers linked by costly Infiniband cables. During training, these centers need to swap tons of data constantly. If the connection is slow, those pricey GPUs sit idle, which is bad news.

A typical data center might have 10,000-20,000 GPUs, sucking down 20-30 megawatts of power. For example, Microsoft (for OpenAI) operates 100k GPUs network in Arizona, and Meta runs 128k.

See the two H-shaped buildings? That's two standard Meta data centers next to each other.

Power needs for top-tier clusters have exploded up to 10x since 2022. We're now talking around 150 MW per cluster. That's like powering a small city. This creates a huge load on regional power grids. In some places, it's actually cheaper to generate the power than to deliver it because there aren't enough power lines.

So, Elon enters this market way behind. And... does the "Elon thing." Hate his tweets all you want, the man knows how to build factories like nobody else.

He bought an old Electrolux factory in Memphis and decided to build one giant data center instead of a network like everyone else.

Predictably, power became an issue.

The factory only had 7 MW from the local grid—enough for only 4,000 GPUs. The local utility, Tennessee Valley Authority, promised another 50 MW, but not until August. And xAI's own 150 MW substation was still being built, not ready until year-end.

But waiting isn’t Musk’s style.

Dylan Patel (from Semianalysis) spotted via satellite images that Elon just brought in 14 massive mobile diesel generators from VoltaGrid. Hooked them up to 4 mobile substations and powered the data center. Literally trucked in the electricity.

Patel mentioned they might have bought up 30% of the entire US market for these generators (though I couldn't find anything on that).

Impressively, the data center also uses liquid cooling. Only Google has really done this at scale before. This is a big deal because the next generation of Nvidia chips, the Blackwell B200s, require liquid cooling. Everyone else will have to retrofit their existing data centers.

You can check out the first few minutes of this video to see what it looks like inside. I got a chuckle out of how hyped the guy is about gray boxes and cables:

https://www.youtube.com/watch?v=Jf8EPSBZU7Y

It's seriously cool engineering—just look at the cable management.

No one has done such massive work in so little time.

2 – Even More Hardware!

Elon says by summer 2025, they'll have a 300k GPU cluster with Blackwell B200 chips. Given Musk’s habit of exaggeration, let's say it's realistically somewhere between 200-400k new chips by the end of 2025. B200 is roughly 2.2 times better than H100 for model training (based on Nov 2024 estimates).

Musk even plans to build a dedicated 2.2 GW power plant. That's more power than a medium-sized city consumes.

And he's not alone—all the big players are doing something similar:

Meta is building two gas plants in Louisiana.

OpenAI/Microsoft is setting up something similar in Texas.

Amazon and Google are also building gigawatt-scale data centers.

Why not nuclear? It's got the power, but building a nuclear plant takes way too long. You can't just pop one up next to your data center in a year. Wind and solar farms plus batteries are promising, but they also take too long to deploy at the needed scale.

As a result, both Microsoft and Meta have already had to backtrack on their green renewable energy promises. They broke their backs lifting Moloch to Heaven!

3 – Grok 3 is Huge

So, Elon built this massive, expensive box. Now what?

Estimates suggest Grok 2 trained on ~20k H100s, while Grok 3 used over 100k. For context, GPT-4 trained for about 90-100 days on ~25k older A100 chips, with H100 roughly 2.25x faster.

Doing the math, Grok 2 got about twice the computing power thrown at it compared to GPT-4. And Grok 3 got five times more than Grok 2. Google's Gemini 2.0 probably used a similar amount of hardware (100k of their own TPUv6 chips), but the model itself is likely smaller.

Basically, the total compute cost for Grok 3 is an order of magnitude (10 times!) higher than its closest competitor. Sadly, we don't have public data for GPT-4.5 or Gemini 2.5.

So they poured insane amounts of resources into building this mega-cluster, and the resulting model is... just on par with the incumbents. Definitely not leagues better.

It seems xAI's expertise in training still lags behind OpenAI, Google, or Anthropic. They essentially brute-forced their way into the top tier. No magic tricks shown, just: "If brute force isn't solving your problem, you aren't using enough of it."

But there's a catch with that approach. Epoch AI estimates that over the last decade, algorithmic improvements accounted for about a third of the progress in model capabilities. The other two-thirds came from just throwing more hardware and data at bigger models.

Brute force worked for Grok 3 this time, but costs will grow exponentially while delivering less and less improvement. And xAI need to catch up on the algorithm side. The good news is that now they're seen as pushing the frontier, so it will likely be much easier to attract top talent.

4 – What's Good About Grok?

1, It's completely free (probably until the full release).

And without Anthropic's tight limits, DeepSeek's outages, or OpenAI's paid tiers.

Even with all the new models dropped in the last couple of months, Grok is still holding its own near the top of the Chatbot Arena leaderboard.

We now also have an independent benchmarking by EpochAI:

And by LiveBench:

2, Reasoning & Deep Research Mode

Back in February, free Deep Research feature was mostly Perplexity exclusive. Now, Google and OpenAI offer some in a basic tier—maybe Grok pushed them?

This mode automatically analyzes 30-100 links (Google might do more) in minutes and spits out a detailed (and bloated) summary that you just need to skim and fact-check. It's way easier than researching anything from scratch. I've found Grok's version works faster than the others, so I've started using it when I need to research something. Like, when buying a new headphones.

3, Integration with X

This could be its killer feature: semantic search not just for keywords, but for what you meant. You can also ask it to summarize posts on a topic to track trends. Or to find recent posts from a specific user.

Twitter is the closest to a real-time information platform, so thats great. But so far Grok often lags, pulling data from the last couple of days instead.

4, The Unfiltered Stuff

And for the grand finale, the 18+ mode. Grok is notoriously easy to jailbreak without much effort. You can get it to do... well, whatever you might want, from flirty voices to questionable recipes. The voice mode examples are particularly wild.

https://x.com/goodside/status/1893932239718691167

Listen to the end, it's hilarious!

Ironically, Grok itself doesn't seem to hold Musk (or Trump) in high regard. When this came out, xAI tried a fix—literally hardcoding a rule that Grok couldn't criticize Elon. When that blew up, they blamed a former OpenAI employee for “not fitting the company culture.” Super cringe.

The real issue is that Grok's opinions are just a reflection of its training data (i.e., the internet), not some intentional bias. Trying to patch these views without messing up the whole model is hard.

5 - Should You Bother Trying It?

Definitely try it, but as your second pilot.

TLDR:

Cost way more to train than competitors' models.
Despite that, performance is almost on par with the best.
But it's super fast and free (for now).
The Deep Research mode is genuinely useful—try it if you haven't.
More prone to hallucinations and jumping to conclusions too fast.
Responses are usually well-structured but often feels bloated.
Unique access to Twitter data.

xAI proved capable of building world-class infrastructure at unprecedented speed. But in actual AI capabilities, they're basically buying their way to the top with sheer compute power.

This adds another strong player pressuring OpenAI, Google, and Anthropic, pushing the AI industry toward commoditization. Competition is heating up and the exclusivity of top-tier models is fading.

Enjoyed this? Give an upvote or subscribe to my newsletter.
I'd appreciate it!

Too many AIs

Leonid Khomenko — Fri, 21 Mar 2025 09:33:44 +0000

Since early 2025, AI labs have flooded us with so many new models that I'm struggling to keep up.

But trends says nobody cares!

There is only ChatGPT.

How so?

The new models are awesome, but their naming is a complete mess. Plus, you can't even tell models apart by benchmarks anymore. Plain "this one's the best, everyone use it." doesn't work now.

In short, there are many truly fantastic AI models on the market, but few people actually use them.

And that's a shame!

I'll try to make sense of the naming chaos, explain the benchmark crisis, and share tips on how to choose the right model for your needs.

Too Many Models, Terrible Names

Dario Amodei has long joked we might create AGI before we learn to name our models clearly. Google is traditionally leading the confusion game:

To be fair, it makes some sense. Each "base" model now has lots of updates. They're not always groundbreaking enough to justify each update as a new version. That's where all these prefixes come from.

To simplify things, I put together a table of model types from major labs, stripping out all the unnecessary details.

So what are these types of models?

There are huge, powerful base models. They're impressive but slow and costly at scale.
That's why we invented distillation: take a base model, train a more compact model on its answers, and you get roughly the same capabilities, just faster and cheaper.
This is especially critical for reasoning models. The best performers now follow multi-step reasoning chains—plan the solution, execute, and verify the outcome. Effective but pricey.

There are also specialized models: for search, super-cheap ones for simple tasks, or models for specific fields like medicine and law. Plus a separate group for images, video, and audio. I didn't include all these to avoid confusion. I also deliberately ignored some other models and labs to keep it as simple as possible.

Sometimes more details just make things worse.

All Models Are Basically Equal Now

It's become tough to pick a clear winner. Andrej Karpathy recently called this a "evaluation crisis."

It's unclear which metrics to look at now. MMLU is outdated, SWE-Bench is too narrow. Chatbot Arena is so popular that labs have learned to "hack" it.

Currently, there are several ways to evaluate models:

Narrow benchmarks measure very specific skills, like Python coding or hallucination rates. But models are getting smarter and mastering more tasks, so you can't measure their level with just one metric anymore.

Comprehensive benchmarks try capturing multiple dimensions with loads of metrics. But comparing all these scores quickly becomes chaotic. Note that people try to factor multiple of these complex benchmarks. Five or ten at a time! One model wins here, another there—good luck making sense of it. LifeBench has 3 metrics within each category. And that's just one benchmark among dozens.

Arena, where humans blindly compare model answers based on personal preferences. Models get an ELO rating, like chess players. Win more often, get higher ELO. But this was great until the models got too close to each other.

A 35-point difference means a model is better just 55% of the time.

As in chess, the player with the lower ELO still has a good chance to win. Even with a 100-point gap, a "worse" model still outperforms in a third of cases.

And again—some tasks are better solved by one model, others by another. Choose a model higher on the list and one of your 10 requests might be better. Which one and how much better? Who knows.

So, How Do You Choose?

For lack of better options, Karpathy suggests relying on the vibe-check.

Test the models yourself and see which one feels right. Sure, it's easy to fool yourself. It’s subjective and prone to bias—but it's practical.

Here's my personal advice:

If the task is new—open multiple tabs with different models and compare results. Trust your gut on which model requires less tweaking or edits.
If the task is more familiar, use only your best model.
Forget about chasing benchmark numbers. Focus on the UX you like and prioritize the subscription you're already willing to pay for.
If you still want numbers, try https://livebench.ai/#/. The creators claim it fixes common benchmarking issues like hacking, obsolescence, narrowness and subjectivity.
For product creators, here's a great guide from HuggingFace on how to set up your own benchmark.

Meanwhile, if you've been waiting for a sign to try something other than ChatGPT, here it is:

https://claude.ai/
https://gemini.google.com/
https://grok.com/
https://chat.deepseek.com/
https://chat.openai.com/

This is my first post here, but I have more writing on my Substack.
I'd appreciate it if you subscribed—but honestly, the dev community seems pretty cool, so I plan to keep writing more stuff here.
Maybe smth usefull on each of the models I mentioned.