DEV Community

Samarth Bhamare
Samarth Bhamare

Posted on

I blind A/B tested 40 Claude prompt codes. Only 7 actually shift reasoning.

I spent three months blind A/B testing 40 Claude prompt codes, the L99, /skeptic, GODMODE, ULTRATHINK style stuff that keeps circulating on Reddit and Twitter. Fresh context each run, fixed task batteries across coding, analysis and writing, blind-rated outputs, n=12-20 per code. Wrote it up as a 31-page PDF, free, no email wall.

The thing that surprised me most: only 7 of the 40 measurably changed what Claude thinks. The other 33 changed how it sounds. Same reasoning underneath, different voice. Which isn't nothing, sometimes you want the shorter less-hedgy version. But it's not the unlock people market these as.

The 7 with real signal: /skeptic catches wrong-premise questions 79% of the time vs 14% baseline. L99 forces a committed answer 11/12 times vs 2/12 baseline. /blindspots, /crit, ULTRATHINK, /deep and /premortem round out the list. The ones that sounded magical but measured like noise: GODMODE, BEASTMODE, most jailbreak variants, and most "you are an expert in X" prefixes.

Known limitations, mostly what you'd expect. Small n, I'm the only rater, models drift. All tests on Opus 4.6, Sonnet 4.5, Haiku 4.5 as of March 2026. The effect sizes on the real 7 are big enough to survive the small sample (79% vs 14% isn't ambiguous) but for borderline cases I'd use "indistinguishable from baseline at this n" rather than "proven fake."

The site has other things on top of the research, free prompts, a paid cheat sheet, but the report itself is the point and it's free.

Top comments (0)