UnitBuilds for UnitBuilds CC

Posted on Jul 1 • Edited on Jul 3

LLMs are Demented!

#ai #games #machinelearning #discuss

Interactive crossword on hardware limits

Ever gotten frustrated at ChatGPT, Claude, or Gemini for forgetting something you said ten messages ago? Or laughed at a completely bizarre hallucination where it replaced a normal word with a random emoji?

It’s easy to yell at the chat client. It's much harder to maintain Mechanical Sympathy for the massive, spinning plates of hardware constraints running under the hood.

So, we built an interactive game to teach you how LLMs actually work (and fail):

🧩 LLMs Are Demented: The Crossword

Play in Fullscreen Mode (if the embed window sizing is annoying)

⚙️ How the Game Works

This is a standard, technical 9-word crossword puzzle. To win, you must retrieve the definitions of core machine learning concepts (like WEIGHTS, TOKEN, ATTENTION, and EPOCH) and type them in.

But as you play, you are running directly inside the actual architectural constraints of a Large Language Model:

1. 💾 The Context Window ( $C_{\text{tokens}}$ )

The model only tracks your last N cell edits. If you type more letters than your context size, the oldest letters you entered fall out of context and start organically decaying. They will slowly flicker and mutate into visually similar characters (or pure noise) as the model loses track of them.

2. ⏰ KV-Cache Expirations ( $\tau$ )

The board is split into 4 distinct quadrants (Q1-Q4). If you leave a quadrant untouched for too long, its cache expires—and that entire section of the board is instantly wiped blank! You must hop between quadrants to keep their caches active.

3. 🔥 Temperature ( $T$ )

Controls the chaos of mutations:

Low Temp ( $T \le 0.8$ ): Drifts predictably (e.g. E becomes 3, A becomes 4).
High Temp ( $T \ge 1.3$ ): Explodes into pure symbolic entropy (emojis, percent signs, and system glyphs).

🛠️ Choose Your Hardware Preset

Before you click INITIATE RUN, select your inference endpoint difficulty:

🏢 Enterprise API (Easy): Large context window ($C=64$), 90-second cache, very low temperature. Very forgiving.
💻 Local Llama (Medium): Quantized 7B model running on a laptop ($C=32$), 45-second cache, standard temperature ($0.7$). You'll need to move fast to avoid decay.
🍞 Smart Toaster (Hard): Edge inference on a kitchen appliance ($C=16$), 15-second cache, high temperature ($1.4$). Complete hardware chaos.
🍞 Smart Toaster (Hard): Edge inference on a kitchen appliance ($C=16$), 15-second cache, high temperature ($1.4$). Complete hardware chaos.

Tip: If you need a cheatsheet, click the 🧠 VIEW WEIGHTS button to dump the answers database. But be warned: the database query locks keyboard inputs, forcing you to close the weights, switch contexts, and recall the answers from memory!

🕶️ Challenge Mode: Blind Inference

By popular demand (shoutout to @kenielzep97 for the brilliant suggestion!), I've added a Blind Inference toggle to the hyperparameters panel.

Flip it on to play with all telemetry, warning overlays, and letter mutations completely masked. You won't know the cache is decaying or mutating until the final compiler locks your run—a harsh simulation of how an LLM has no meta-awareness of its own context limitations!

🏁 Beat the Machine & Share Your Score

Once you fill in the last box, the system triggers RUN INFERENCE automatically to lock your scorecard.

Can you beat the local CPU (15 TPS) or a Cloud API (150 TPS)? Click COPY SCORE at the end of your run and paste your stats in the comments below!

💬 Let's Discuss:

What's the weirdest "mutation" you saw at High Temperature?
What was your Time to First Token (TTFT) and highest TPS?

UnitBuilds-CC / LLMs-are-Demented

An educational crossword game to learn about LLMs

📟 The Gating Crisis: Sparse MoE Router Simulator 🧠⚡

Part of the UnitBuilds CC Playgrounds Suite

Welcome, neural engineer. You have been put in charge of the Gating Network (Router) for a running Mixture of Experts (MoE) Large Language Model.

Your task is to route incoming multi-modal token streams ([T] Text, [M] Math, [V] Vision, [A] Audio, and [C] Code) to specialized Feed-Forward Network (FFN) experts in real-time. Since this is a Top-2 Routing network, you must dispatch every token to exactly two experts before it reaches the eviction threshold.

If you route tokens incorrectly, the model's output quality degrades into perplexity collapse. If you overload any individual expert beyond its queue limit, the system experiences Capacity Drops (loss of data).

🕹️ Game Mechanics (How to Play)

⌨️ Hotkey Routing: Use numbers 1 to 8 (or 1 to 4 in simplified mode) to…

View on GitHub

Disclaimer: AI was used throughout this project, it is just fitting that it would co-author with me, so special thanks to the Foundry for its tireless hours toiling away and Gemini for producing the cover image.

Top comments (22)

UnitBuilds UnitBuilds CC • Jul 1

🧩 LLMs ARE DEMENTED: THE CROSSWORD 🧠
"Mechanical Sympathy Edition"

⚙️ My Config:
├─ Context Window: 32 tokens (Large)
├─ Cache Retention: 45 seconds
└─ Temperature: 0.7 (Low Temp)

📊 Performance Summary:
├─ Words Correctly Verified: 9/9
├─ Total Keystrokes Input: 50
├─ KV-Cache Evictions Suffered: 0
├─ Hallucinatory Mutations: 0
├─ Time to First Token (TTFT): 2.27s
└─ Generation Speed (TPS): 3.06 tokens/sec

⚡ Inference Throughput Comparison (TPS):
├─ Your Speed: 3.06 tokens/sec
├─ Local 7B (CPU): 15.00 tokens/sec (4.9x faster)
└─ Cloud API: 150.00 tokens/sec (49.0x faster)

"I will never yell at my chat client again."
🚀 Play the Simulation Here: [llms-are-demented-90043718455.us-c...]

@jess, @pascal_cescato_692b7a8a20, @dannwaneri, @kenielzep97, @francistrdev, @xulingfeng Curious how fast you lot can do it? I recommend using the link, cuz the embedding is a bit small.

Goodluck!

Self-Correcting Systems • Jul 1

That was really cool, man. Looks intimidating at first I didn’t change any of the settings on the left, figured it made more sense to run it the way you set it up initially. But once you feel how fast you have to rotate between quadrants to keep the cache alive, it gives you real understanding of what’s happening under the hood. Not just reading about context windows, actually feeling the clock on them. Appreciate you pulling me into this, genuinely fun to run. Not sure how many attempts are normal, but it took me 3 😂 The game was very fun overall helped give me a big understanding on how I am a very slow typer haha

🧩 LLMs ARE DEMENTED: THE CROSSWORD 🧠
“Mechanical Sympathy Edition”
⚙️ My Config:
├─ Context Window: 32 tokens (Large)
├─ Cache Retention: 45 seconds
└─ Temperature: 0.7 (Low Temp)
📊 Performance Summary:
├─ Words Correctly Verified: 9/9
├─ Total Keystrokes Input: 50
├─ KV-Cache Evictions Suffered: 0
├─ Hallucinatory Mutations: 0
├─ Time to First Token (TTFT): 2.97s
└─ Generation Speed (TPS): 1.09 tokens/sec
⚡ Inference Throughput Comparison (TPS):
├─ Your Speed: 1.09 tokens/sec
├─ Local 7B (CPU): 15.00 tokens/sec (13.7x faster)
└─ Cloud API: 150.00 tokens/sec (137.1x faster)
“I will never yell at my chat client again.”
🚀 Play the Simulation Here: dev.to/unitbuilds_cc/llms-are-deme...

UnitBuilds UnitBuilds CC • Jul 1

Awesome! Glad you enjoyed it. Dont worry about time or retries, the goal of the game was for us all to take a little humility pill and appreciate that our models dont hallucinate, run out of cache, or clear over time as aggressively as this. Yet they still perform 130x faster than we can. I also hope it explained the concepts clearly in an intuitive way. People tend to get pissed off at their monitors when a LLM bugs, or hallucinates, but after this, it gave me a new perspective on just how difficult it actually is to manage for LLMs. And they do this every single prompt, for thousands of tokens... I was generous with the TPS, it tracks letters, while LLMs count a token as a word (for the most part), so whatever we get as TPS, reality is we're still 5x slower than that 😅 Really makes you wonder how we got anything done before AI...

Any thoughts on what concept I should cover for the next game?

Self-Correcting Systems • Jul 1 • Edited

That “we’re still 5x slower than the generous number” line is the best part of this reply, honestly. You built a game that measures something, then told us straight that even the measurement was flattering us. That’s rare most people would’ve just let the TPS stand. For the next one: what about a mode where confidence and correctness get decoupled Right now decay shows itself you can see letters flicker and mutate. What if instead the board looked completely stable, totally confident, while it was quietly wrong underneath? No visual tell. You only find out at RUN INFERENCE that half your “locked in” answers drifted and you never caught it. That’s the harder failure mode to teach, because it’s not “the system visibly struggling,” it’s “the system looking fine while it’s already wrong. Closer to how hallucination actually catches people off guard in real use. Appreciate you building this, man. Genuinely taught me something by making me feel it instead of read it.

UnitBuilds UnitBuilds CC • Jul 1

Hm, sounds like an interesting toggle... Hide what's happening, so everything looks fine, then you run inference, see the terrible score, wonder how and it shows you what changed and how... Interesting, lemme patch it and I'll let you know once the updated version is live. Thanks for the suggestion!

Tomorrow's lesson will be MoE gating 🤫

UnitBuilds UnitBuilds CC • Jul 1

Update live, I added a landing page too. Blind Inference is the new feature 😁 I'll quickly update the post to shoutout the great suggestion! Lemme know how it plays

Pascal CESCATO • Jul 1

Awesome! But… I hate this kind of game: it's as @jess one, too addictive!

UnitBuilds UnitBuilds CC • Jul 1

Haha yeah, I'm just waiting for someone to try and cheat at it 😁

UnitBuilds UnitBuilds CC • Jul 4

Heads-up, none of you are gunna wanna miss tomorrow's post, it's gunna be a really fun 1 (think Vampire Survivor). Hope you like it! 🤫

Also, if possible, please review: github.com/forem/forem/pull/23553 I really need the adjustable Embeds to make these games work better.

Daniel Nwaneri • Jul 1

enterprise mode. didn't trust myself on toaster after the day we just had.

UnitBuilds UnitBuilds CC • Jul 1

Nice! If you want, you can use copy score, so we can see your performance 😂 I'm curious how fast everyone here types and it's a healthy reminder to us all just how slow we are vs LLMs.

What did you think of the game?

UnitBuilds UnitBuilds CC • Jul 1

@ben, @jess Can we please have adjustable embeddings? Maybe add iframe support, so I can adjust the sizing to fit better for future ones? If you like, I'll create a PR for it, think it'll add alot more usability to the function.

UnitBuilds UnitBuilds CC • Jul 4

I created a PR for it, please, it would make these games way better for users.

Brief summary, adds ability to set height up to 2000px, width in % usable space, scaling options for fit and stretch and optional 'original scaling' eg. 1920x1080, default, portrait and landscape modes as templates, with old embeds defaulting to default, for backwards compatibility.

I would really appreciate it if you could have someone check it out.

xulingfeng • Jul 1

What the hell, are you a genius or what? Turning LLM architecture limits into a crossword game is the most creative thing I've seen today. The KV-cache quadrant wipe mechanic is brutal 😂 This needs way more attention. 🔥

UnitBuilds UnitBuilds CC • Jul 1

Glad you had fun! Sorry for baiting you again with the red flag 😂 My goal was for us all to learn a lesson in humility. Run it on enterprise, you're still 50x slower than a cloud model. Run it at anything more restrictive and you learn fast that it's a miracle that LLMs dont spit out garbage all day long. Cache wipes and mutations due to context shift, trying to have a look at weights to see what's the answer, come back and everything changed again... All at 150+ TPS... It's incredible and I hope the little game does it justice.

Dont forget to post your best score though 😂 Even if not 100%, that's the whole point 😁

UnitBuilds UnitBuilds CC • Jul 1

@er4or-404 I saw you liked the comment yesterday, if you're curious, it's up and running, wanna try being a LLM for a minute? Give it a try.

Don't forget to comment your score card!

Jess Lee • Jul 1

I really wanted to make a crossword but then didn't because I could not prompt my way into one with an AI theme that was in proper American Standard Style. Did you happen to look into the different styles of crosswords while you were working on this project?!

UnitBuilds UnitBuilds CC • Jul 1

I actually didnt give it much thought tbh, I grew up with crosswords and the ones that always seemed the nicest for me to do, were the ones like this, because the others were always way too cluttered and honestly just ruined the experience for me. Also the easiest to make from a coding standpoint, as it's just 0's and 1's to define the grid space. But I probably should have had a look into the different kinds before I made it. Thank you for bringing it up! Btw, what did you think of it?

mote • Jul 4

The crossword-as-memory-eviction-simulator is a clever framing. Memory eviction in LLMs isn't just a context window problem â it's a retrieval problem. The model hasn't "forgotten" the information in the sense of weights being gone; it just can't locate it under the current attention distribution.

What I find interesting is how this maps to agent memory architectures. Most agent frameworks treat memory as a flat append-only log or a vector store with naive recency bias. But the crossword shows that recall failure isn't random â it follows patterns. Words you've used recently get evicted. Words that share token overlaps with the current guess get confused. That's not unlike how retrieval-augmented systems suffer from embedding collisions when semantically similar but contextually wrong documents outrank the right ones.

The "temperature drift" mechanic is also worth highlighting. People treat temperature as a creativity dial, but in practice it's more like a precision-recall tradeoff knob. Higher temperature doesn't just make outputs more varied â it actively degrades the model's ability to maintain coherent internal state across tokens.

Have you experimented with structured memory approaches (e.g., explicit key-value stores the model can query) as an alternative to just extending context windows?

UnitBuilds UnitBuilds CC • Jul 4

Thank you, yes, it's part of a series I'm doing on teaching people about LLMs, through games. Game nr 4 was released today.

I have yes, quite alot actually, initially with basic graph-db method (neo4j), then built my own, iterated on it a few times, decided that all available methods are bloated, so I switched to a merkle root backed triplet format, so it's deterministic by nature. Expanded on that, till it became an executable language by nature, distilled a model to have dual output channels (plain text and the custom format, called NDA), which brought up some interesting advancements, including massive KV-Cache compression, a custom model quantization method, a custom MCP format and server, etc. All the way down to a bare-metal OS, now I'm finally happy with performance and dont need to crush bottlenecks anymore, instead now it's more about adding features, as the continuous merkle root creates a definitive history, similar to bitcoin's ledger, where the files themselves and the site-map act as a ledger, git-history, etc. all in 1. Which works alot better than I could have ever imagined when compared to the standard ways today.