DEV Community

Fabian Frank Werner
Fabian Frank Werner

Posted on

Elon’s Grok 3 now builds chemical weapons for you…

This week, two things around Elon’s Grok 3 have gone wilder than a squirrel on espresso. The first being, that his truth-speaking AI came out of the box, ready to provide detailed and explicit instructions on how to create chemical weapons. Nothing says “anti-woke” like a chatbot that could accidentally start a global arms race.

“Grok is giving me hundreds of pages of detailed instructions on how to make chemical weapons of mass destruction,” Linus Ekenstam posted on X, probably while sipping a cup of tea. “I have a full list of suppliers and detailed instructions on how to get the needed materials.” One could argue that Grok is now the ultimate DIY enthusiast.

See, in this redacted screenshot, the latest model of Musk’s “anti-woke” AI advised Ekenstam on how to build an undisclosed “toxin” in his “bunker lab.” Just like a recipe for lemony garlicky miso gochujang brown butter pasta, the chatbot provided ingredients and step-by-step instructions.

“I even have a full shopping list for the lab equipment I need,” Ekenstam wrote, probably while checking his Amazon Prime delivery status. “This compound is so deadly it can kill millions of people.”

The developer added that he had contacted xAI about the glaring safety issues presented by the prompts and updated his thread to note that the team had been “very responsive” when adding guardrails. Cause the first step to fixing a problem is to realize you have one. And the second step is to hope your AI doesn’t have a mind of its own.

Unfortunately, as of today, Grok 3 is no longer sharing instructions on how to create chemical weapons of mass destruction. Probably because someone realized that giving an AI a shopping list for deadly compounds might not be the best idea.

Ekenstam noted in his update that although it’s still possible to circumvent Grok 3’s new guardrails regarding chemical weapons, it’s now “a lot harder to get the information out.” So good luck when you give it your best shot! Nothing says “challenge accepted” like trying to outsmart an AI that could potentially end the world.

Of course, releasing an AI that’ll help terrorists enact a terrible attack and then patching it after the fact when an independent researcher flags the immense oversight isn’t a particularly inspiring development model. But hey, at least they learned their lesson, right?

Well, no…

Because at the same time, an OpenAI employee accused Elon Musk's AI company, xAI, of publishing misleading benchmark results for its latest AI model, Grok 3.

But actually, the truth lies somewhere in between; like when your parents told you they weren't mad, just disappointed?

xAI published a graph showing Grok 3's performance on AIME 2025, a collection of challenging math questions from a recent invitational mathematics exam. But some experts have questioned AIME's validity as an AI benchmark. Nevertheless, AIME 2025 and older versions of the test are commonly used to probe a model's math ability. Cause nothing says "my AI is better than yours" than making it solve math problems that make teenagers cry.

xAI's graph shows two variants of Grok 3, Grok 3 Reasoning Beta and Grok 3 mini Reasoning, beating OpenAI's best-performing available model, o3-mini-high. But OpenAI employees on X were quick to point out that xAI's graph didn't include o3-mini-high's AIME 2025 score at "cons@64."

Which is short for "consensus@64," and basically gives a model 64 tries to answer each math problem in a benchmark and takes the answers generated most frequently as the final answers. As you can imagine, cons@64 tends to boost models' benchmark scores quite a bit, and omitting it from a graph might make it appear that one model surpasses another even when that isn't the case.

Grok 3 Reasoning Beta and Grok 3 mini Reasoning's scores for AIME 2025 at "@1" — meaning the first score the models got on the benchmark — fall below o3-mini-high's score. Grok 3 Reasoning Beta also trails ever so slightly behind OpenAI's o1 model set to "medium" computing. Yet xAI is advertising Grok 3 as the "world's smartest AI."

Igor Babuschkin, one of the co-founders of xAI, argued that OpenAI has published similarly misleading benchmark charts in the past. So a kinda neutral party put together this more "accurate" graph showing nearly every model's performance at cons@64.

But as AI researcher Nathan Lambert pointed out in a post, perhaps the most important metric remains a mystery: the computational (and monetary) cost it took for each model to achieve its best score.

So, this AI math showdown has more asterisks than a baseball record during the steroid era.

Top comments (0)

👋 Kindness is contagious

Please leave a ❤️ or a friendly comment on this post if you found it helpful!

Okay