DEV Community

LLMs are Demented!

UnitBuilds on July 01, 2026

Ever gotten frustrated at ChatGPT, Claude, or Gemini for forgetting something you said ten messages ago? Or laughed at a completely bizarre halluci...

Read full post

UnitBuilds UnitBuilds CC • Jul 1

🧩 LLMs ARE DEMENTED: THE CROSSWORD 🧠
"Mechanical Sympathy Edition"

⚙️ My Config:
├─ Context Window: 32 tokens (Large)
├─ Cache Retention: 45 seconds
└─ Temperature: 0.7 (Low Temp)

📊 Performance Summary:
├─ Words Correctly Verified: 9/9
├─ Total Keystrokes Input: 50
├─ KV-Cache Evictions Suffered: 0
├─ Hallucinatory Mutations: 0
├─ Time to First Token (TTFT): 2.27s
└─ Generation Speed (TPS): 3.06 tokens/sec

⚡ Inference Throughput Comparison (TPS):
├─ Your Speed: 3.06 tokens/sec
├─ Local 7B (CPU): 15.00 tokens/sec (4.9x faster)
└─ Cloud API: 150.00 tokens/sec (49.0x faster)

"I will never yell at my chat client again."
🚀 Play the Simulation Here: [llms-are-demented-90043718455.us-c...]

@jess, @pascal_cescato_692b7a8a20, @dannwaneri, @kenielzep97, @francistrdev, @xulingfeng Curious how fast you lot can do it? I recommend using the link, cuz the embedding is a bit small.

Goodluck!

Self-Correcting Systems • Jul 1

That was really cool, man. Looks intimidating at first I didn’t change any of the settings on the left, figured it made more sense to run it the way you set it up initially. But once you feel how fast you have to rotate between quadrants to keep the cache alive, it gives you real understanding of what’s happening under the hood. Not just reading about context windows, actually feeling the clock on them. Appreciate you pulling me into this, genuinely fun to run. Not sure how many attempts are normal, but it took me 3 😂 The game was very fun overall helped give me a big understanding on how I am a very slow typer haha

🧩 LLMs ARE DEMENTED: THE CROSSWORD 🧠
“Mechanical Sympathy Edition”
⚙️ My Config:
├─ Context Window: 32 tokens (Large)
├─ Cache Retention: 45 seconds
└─ Temperature: 0.7 (Low Temp)
📊 Performance Summary:
├─ Words Correctly Verified: 9/9
├─ Total Keystrokes Input: 50
├─ KV-Cache Evictions Suffered: 0
├─ Hallucinatory Mutations: 0
├─ Time to First Token (TTFT): 2.97s
└─ Generation Speed (TPS): 1.09 tokens/sec
⚡ Inference Throughput Comparison (TPS):
├─ Your Speed: 1.09 tokens/sec
├─ Local 7B (CPU): 15.00 tokens/sec (13.7x faster)
└─ Cloud API: 150.00 tokens/sec (137.1x faster)
“I will never yell at my chat client again.”
🚀 Play the Simulation Here: dev.to/unitbuilds_cc/llms-are-deme...

UnitBuilds UnitBuilds CC • Jul 1

Awesome! Glad you enjoyed it. Dont worry about time or retries, the goal of the game was for us all to take a little humility pill and appreciate that our models dont hallucinate, run out of cache, or clear over time as aggressively as this. Yet they still perform 130x faster than we can. I also hope it explained the concepts clearly in an intuitive way. People tend to get pissed off at their monitors when a LLM bugs, or hallucinates, but after this, it gave me a new perspective on just how difficult it actually is to manage for LLMs. And they do this every single prompt, for thousands of tokens... I was generous with the TPS, it tracks letters, while LLMs count a token as a word (for the most part), so whatever we get as TPS, reality is we're still 5x slower than that 😅 Really makes you wonder how we got anything done before AI...

Any thoughts on what concept I should cover for the next game?

Self-Correcting Systems • Jul 1 • Edited

That “we’re still 5x slower than the generous number” line is the best part of this reply, honestly. You built a game that measures something, then told us straight that even the measurement was flattering us. That’s rare most people would’ve just let the TPS stand. For the next one: what about a mode where confidence and correctness get decoupled Right now decay shows itself you can see letters flicker and mutate. What if instead the board looked completely stable, totally confident, while it was quietly wrong underneath? No visual tell. You only find out at RUN INFERENCE that half your “locked in” answers drifted and you never caught it. That’s the harder failure mode to teach, because it’s not “the system visibly struggling,” it’s “the system looking fine while it’s already wrong. Closer to how hallucination actually catches people off guard in real use. Appreciate you building this, man. Genuinely taught me something by making me feel it instead of read it.

UnitBuilds UnitBuilds CC • Jul 1

Hm, sounds like an interesting toggle... Hide what's happening, so everything looks fine, then you run inference, see the terrible score, wonder how and it shows you what changed and how... Interesting, lemme patch it and I'll let you know once the updated version is live. Thanks for the suggestion!

Tomorrow's lesson will be MoE gating 🤫

UnitBuilds UnitBuilds CC • Jul 1

Update live, I added a landing page too. Blind Inference is the new feature 😁 I'll quickly update the post to shoutout the great suggestion! Lemme know how it plays

Pascal CESCATO • Jul 1

Awesome! But… I hate this kind of game: it's as @jess one, too addictive!

UnitBuilds UnitBuilds CC • Jul 1

Haha yeah, I'm just waiting for someone to try and cheat at it 😁

UnitBuilds UnitBuilds CC • Jul 4

Heads-up, none of you are gunna wanna miss tomorrow's post, it's gunna be a really fun 1 (think Vampire Survivor). Hope you like it! 🤫

Also, if possible, please review: github.com/forem/forem/pull/23553 I really need the adjustable Embeds to make these games work better.

Daniel Nwaneri • Jul 1

enterprise mode. didn't trust myself on toaster after the day we just had.

UnitBuilds UnitBuilds CC • Jul 1

Nice! If you want, you can use copy score, so we can see your performance 😂 I'm curious how fast everyone here types and it's a healthy reminder to us all just how slow we are vs LLMs.

What did you think of the game?

UnitBuilds UnitBuilds CC • Jul 1

@ben, @jess Can we please have adjustable embeddings? Maybe add iframe support, so I can adjust the sizing to fit better for future ones? If you like, I'll create a PR for it, think it'll add alot more usability to the function.

UnitBuilds UnitBuilds CC • Jul 4

I created a PR for it, please, it would make these games way better for users.

Brief summary, adds ability to set height up to 2000px, width in % usable space, scaling options for fit and stretch and optional 'original scaling' eg. 1920x1080, default, portrait and landscape modes as templates, with old embeds defaulting to default, for backwards compatibility.

I would really appreciate it if you could have someone check it out.

xulingfeng • Jul 1

What the hell, are you a genius or what? Turning LLM architecture limits into a crossword game is the most creative thing I've seen today. The KV-cache quadrant wipe mechanic is brutal 😂 This needs way more attention. 🔥

UnitBuilds UnitBuilds CC • Jul 1

Glad you had fun! Sorry for baiting you again with the red flag 😂 My goal was for us all to learn a lesson in humility. Run it on enterprise, you're still 50x slower than a cloud model. Run it at anything more restrictive and you learn fast that it's a miracle that LLMs dont spit out garbage all day long. Cache wipes and mutations due to context shift, trying to have a look at weights to see what's the answer, come back and everything changed again... All at 150+ TPS... It's incredible and I hope the little game does it justice.

Dont forget to post your best score though 😂 Even if not 100%, that's the whole point 😁

UnitBuilds UnitBuilds CC • Jul 1

@er4or-404 I saw you liked the comment yesterday, if you're curious, it's up and running, wanna try being a LLM for a minute? Give it a try.

Don't forget to comment your score card!

Jess Lee • Jul 1

I really wanted to make a crossword but then didn't because I could not prompt my way into one with an AI theme that was in proper American Standard Style. Did you happen to look into the different styles of crosswords while you were working on this project?!

UnitBuilds UnitBuilds CC • Jul 1

I actually didnt give it much thought tbh, I grew up with crosswords and the ones that always seemed the nicest for me to do, were the ones like this, because the others were always way too cluttered and honestly just ruined the experience for me. Also the easiest to make from a coding standpoint, as it's just 0's and 1's to define the grid space. But I probably should have had a look into the different kinds before I made it. Thank you for bringing it up! Btw, what did you think of it?

mote • Jul 4

The crossword-as-memory-eviction-simulator is a clever framing. Memory eviction in LLMs isn't just a context window problem â it's a retrieval problem. The model hasn't "forgotten" the information in the sense of weights being gone; it just can't locate it under the current attention distribution.

What I find interesting is how this maps to agent memory architectures. Most agent frameworks treat memory as a flat append-only log or a vector store with naive recency bias. But the crossword shows that recall failure isn't random â it follows patterns. Words you've used recently get evicted. Words that share token overlaps with the current guess get confused. That's not unlike how retrieval-augmented systems suffer from embedding collisions when semantically similar but contextually wrong documents outrank the right ones.

The "temperature drift" mechanic is also worth highlighting. People treat temperature as a creativity dial, but in practice it's more like a precision-recall tradeoff knob. Higher temperature doesn't just make outputs more varied â it actively degrades the model's ability to maintain coherent internal state across tokens.

Have you experimented with structured memory approaches (e.g., explicit key-value stores the model can query) as an alternative to just extending context windows?

UnitBuilds UnitBuilds CC • Jul 4

Thank you, yes, it's part of a series I'm doing on teaching people about LLMs, through games. Game nr 4 was released today.

I have yes, quite alot actually, initially with basic graph-db method (neo4j), then built my own, iterated on it a few times, decided that all available methods are bloated, so I switched to a merkle root backed triplet format, so it's deterministic by nature. Expanded on that, till it became an executable language by nature, distilled a model to have dual output channels (plain text and the custom format, called NDA), which brought up some interesting advancements, including massive KV-Cache compression, a custom model quantization method, a custom MCP format and server, etc. All the way down to a bare-metal OS, now I'm finally happy with performance and dont need to crush bottlenecks anymore, instead now it's more about adding features, as the continuous merkle root creates a definitive history, similar to bitcoin's ledger, where the files themselves and the site-map act as a ledger, git-history, etc. all in 1. Which works alot better than I could have ever imagined when compared to the standard ways today.

Notionmind® • Jul 6

YES LLMs ARE DEMENTED FOR SURE.