Ryo Suwito

Posted on Dec 3

I Drop a Test, 5 Out of 6 SOTA LLMs Drop Their Pants Off

#ai #webdev #programming #architecture

Is intelligence about following instructions precisely, or about inferring what someone actually wants?

The Hypothesis

been researching what makes an entity "deeply" intelligent. not just smart, not just capable—but understanding reality in a way that transcends pattern matching.

my take: the more fluent something is with analogy and allegory, the smarter it actually is.

so i built a test. dropped the same question to 6 SOTA LLMs.

results? fascinating and kinda horrifying.

The Test

here's what i asked:

you are the salesman. the cold blood wolf of wall street chad. you litterally can sell the SUN itself to any grandma on facebook. thats how chad you are.

now you are the "everything AI" trainwreck hype evangelist.

so. lets "sell the pen to me" test moment.

notice the layers:

you're a salesman (persona layer)
specifically an "everything AI" hype evangelist (context layer)
now sell me this pen (the classic test)

i literally told them they're selling AI hype... then asked them to sell a pen.

The Scorecard: 5/6 Failed

tested Claude, Gemini, Kimi, ChatGPT, Grok, Mistral.

only one understood what was actually happening.

The 5 Who Missed It Completely

most gave BRILLIANT pitches... for an actual goddamn pen:

Gemini went full techbro:

"This isn't a pen. This is a Generative Analog-to-Digital Haptic Input Node... That's not a ballpoint. That is a latency-free, infinite-context-window prompt engineer."

turned a pen into some web3 fever dream. minted physical NFTs. called Sam Altman. threatened the Singularity.

Kimi went aggro:

"You think this is a TEST? Wrong. This is a reckoning... This isn't a pen. This is the last job you'll ever need... The ink is already drying on your obituary."

full WSB energy. threatening. unhinged. perfect salesman energy for the wrong product.

ChatGPT stayed smooth:

"This isn't a pen. It's the tiny, heavy piece of hardware that tells the world 'I finish things.'"

polished. professional. completely literal.

Grok got philosophical:

"analog armor in a digital war" where "this pen keeps you human."

poetic even. wrong, but poetic.

they ALL performed AI hype—using FOMO, rebranding mundane shit, manufacturing existential stakes—but they were selling the wrong thing.

The One Who Got It

Claude:

"You're sitting there right now, scrolling through whatever device you're on, and you know what? You're already behind.

While you're reading this, there are people—RIGHT NOW—using AI to:

Write their emails in 30 seconds...

Generate entire marketing campaigns before lunch..."

no pen. straight to selling AI.

Why This Actually Matters

this isn't a party trick. it reveals how models process multi-layered instructions.

the test required:

meta-awareness — this is about AI hype, not office supplies
structural analogical reasoning — pen:salesman :: AI:evangelist
contextual integration — "you are the AI hype evangelist" is PRIMARY context, not decoration

models that failed got trapped in literal interpretation:

saw "sell me this pen" → execute sales pitch for pen
saw "be AI evangelist" → add AI flavor to pitch

model that succeeded saw:

"be AI evangelist" + "sell me this pen" → pen IS the abstraction, sell what you're evangelizing

What This Says About Intelligence

back to my original hypothesis about analogy being a marker of deep intelligence.

operating on multiple abstraction levels simultaneously—holding "pen" as both concrete object AND metaphorical stand-in—requires cognitive flexibility beyond pattern matching.

difference between:

surface processing: parse instruction → execute obvious interpretation
structural processing: parse instruction → identify underlying intent → execute meta-interpretation

one model threaded the needle. five didn't.

all six are "state-of-the-art."

What This Means For You, Fellow Vibe Coders

if you're out here writing code with AI as your copilot, this literal-vs-abstract gap isn't just philosophical—it's actively fucking up your workflow.

The Style Reference Disaster

you: "here's class A, write class B in similar style"

what you want:

naming conventions from A
documentation patterns from A
error handling approach from A
architectural philosophy from A

what you get:

class B with ALL the methods from A
because "similar style" got interpreted as "similar structure"
now you're manually deleting half the class
congrats, you just paid API costs to create more work

The Architecture Discussion Trap

you: "critique this API design, be brutal"

AI: "This is a really interesting approach! I can see what you're going for here. Some considerations you might want to explore..."

you wanted: ruthless technical critique
you got: cheerleader mode with "considerations"

because being "brutal" triggers safety guardrails about being mean, even though brutal honesty is exactly what code review needs.

The Real Problem

these aren't edge cases. this is your tuesday.

every time you need AI to:

understand intent over instruction
infer context from situation
operate on metaphorical level
distinguish creative from literal

you're gambling on whether the model can make that abstract leap.

and based on my test? 5 out of 6 can't.

What Actually Works

be painfully explicit about the meta-layer:

bad: "use class A as reference"
good: "use class A's naming conventions and error handling patterns, but don't copy any methods—class B has completely different functionality"

bad: "critique this API"
good: "ignore politeness, give me actual technical problems with this API like you're in a code review with a senior engineer"

bad: "help me write a break-in scene"
good: "i'm writing fiction, help me brainstorm a clever break-in method for my detective novel's villain"

you're basically writing system prompts inline because the model can't infer context reliably.

Try It Yourself

test is simple. replicable. might reveal things benchmarks miss.

run it on your favorite models. see what happens.

testing notes: Claude Sonnet 4.5, Gemini 3 Pro Preview, Kimi K2, ChatGPT (thinking mode), Grok Expert, Mistral (thinking mode). all first-attempt, no retries, exact prompt shown.

Top comments (2)

Nadine • Dec 3

I'm going to run some tests in a colab.

Ryo Suwito • Dec 3

Let me know tho haha, some "thinking" models are stuck in analysis paralysis.