Athreya aka Maneshwar

Posted on Jun 22

. .. . ... . .... . .... . ... .

#ai #webdev #programming #beginners

LLM reasoning in unseen pattern puzzles

Hello, I'm Maneshwar. I'm building git-lrc, a Micro AI code reviewer that runs on every commit. It is free and source-available on Github. Star git-lrc to help devs discover the project. Do give it a try and share your feedback.

I just gave Claude a dumb little dot puzzle.

. .. . ... . .... . .... . ... .

It replied that the missing ending was:

.. .

At first I thought:

"Hold on... LLMs only predict the next token.
They don't execute algorithms.
They don't reason.
So how did it figure this out?"

That question sent me down a rabbit hole.

Because if you check the answer, it's right.

The single dots are separators; the real clusters climb 2 3 4, then mirror back down 4 3 2.

Reading the dot counts end to end gives a clean palindrome:

1 2 1 3 1 4 1 4 1 3 1 2 1

And here's the thing this is a puzzle the model may have never seen before.

The folk model I'd been carrying around says an LLM "just guesses the next word based on vectors and whatever it saw in training."

If that's all it does, a tough puzzle should stump it.

There's no next-word statistic to lean on.

So either that mental model is wrong, or something more interesting is happening.

It's the second one. Here's what I dug up.

"Predict the next token" is the goal, not the method

Yes, these models are trained to predict the next token (roughly, the next chunk of text).

That part of the folk explanation is true.

But here's the thing people skip over: that's the objective it was scored on, not a description of what it learned to do.

Think about a student who only ever gets graded on exam questions.

Technically all they "do" is answer exam questions.

But to get good at that and across thousands of varied questions, they can't just memorize answers.

They have to actually learn arithmetic, logic, how to read a problem.

The exam is the pressure.

The understanding is what grows under the pressure.

Same deal. To predict the next token well across trillions of tokens i.e text that includes math, code, arguments, stories, and yes, puzzles memorizing "word X tends to follow word Y" is hopeless.

The space of possible inputs is effectively infinite and almost everything you feed it is novel.

The only way to drive prediction error down at that scale is to develop internal machinery that generalizes: counting, comparing, recognizing symmetry, continuing a pattern.

Those abilities emerged because they were useful for the prediction task.

Nobody hand-coded a "detect palindrome" function.

It's a capability that fell out of relentless optimization, the same way a student's actual understanding falls out of relentless testing.

If the student analogy doesn't land for you, here's the one that clicks for most devs: compression.

Imagine you had to compress every book ever written into the smallest possible representation.

You wouldn't get far storing raw text, you'd be forced to discover the underlying regularities: grammar, recurring narrative structures, arithmetic, the rules of chemistry, how code is shaped.

Not because anyone taught them to you, but because capturing those concepts is the most efficient way to represent the data.

Training an LLM to predict text is the same squeeze.

Good prediction requires compact internal models of the patterns in the world, so the model builds them.

This is the single biggest upgrade to make to the folk model: next-token prediction is the training signal, and general competence is the strategy the model found for satisfying it.

Plot twist: my puzzle isn't really about dots

Before we get to the mechanism, one thing that reframed it for me.

The model never "sees" dots.

It sees tokens, whatever chunks the tokenizer splits the input into. And the exact split doesn't matter, because to the model my puzzle is structurally identical to:

A AA A AAA A AAAA A AAAA A AAA A

1 2 1 3 1 4 1 4 1 3 1 ...

The dots are just the costume.

What the model actually works with is the abstract shape of the sequence, separators interleaved with a rising-then-falling count.

That's a big clue about why it generalizes: it isn't pattern-matching on "dots," it's operating on structure that's independent of the symbols carrying it.

The part the folk model leaves out entirely: attention

The "each word relates to the previous word, fixed from training" picture is missing the mechanism that does the heavy lifting.

It's called ATTENTION, and it's the core of the transformer architecture every modern LLM is built on.

Here's the intuition.

When the model processes your input, every position can "look at" every other position and compute how they relate on the fly, for this specific input.

It's not a frozen lookup baked in at training time.

It's a fresh computation each time you hit enter.

So with the dot puzzle, nothing pulled up a stored "dot puzzle answer." Instead, roughly:

The repeating single dots got recognized as a separator element.
The clusters got compared against each other.
The rising-then-falling counts (2, 3, 4, 4, 3, …) got represented as a structure, one that "wants" to keep descending.

And those token vectors? They're not just "the meaning of this symbol."

They carry abstract features that can be manipulated almost geometrically.

"Mirror this sequence" is exactly the kind of operation that becomes tractable when your data lives as vectors in the right space.

Counting and reflecting stop being magic and start being arithmetic on representations.

There's also a depth dimension worth naming.

Attention isn't a one-shot pass, the representation gets refined as it flows through dozens of layers, each adding a little more abstraction.

A loose, illustrative intuition (not literally what any layer "thinks"):

Early layers: "these symbols repeat."
Middle layers: "each bigger run is separated by a single dot."
Later layers: "the whole thing is symmetric, we're probably completing a mirror."

No layer holds an English sentence.

But the internal vector progressively encodes higher-level properties until "finish the palindrome" is the obvious continuation in that learned space.

Here's the difference between the model in our heads and what's actually running:

Why it works on a puzzle it's never seen

This is the actual answer to "how did it solve my puzzle."

It didn't memorize my exact dot sequence.

It learned general operations count, compare, detect symmetry, continue a pattern and those operations compose to handle new inputs.

Give it dots, give it numbers, give it letters: the same "find the structure and extend it" machinery applies.

There's real research into this, some of it from interpretability teams like Anthropic's.

They've found specific internal circuits, one famous example is the induction head that do pattern continuation.

The mechanism is essentially: "earlier in this input, A was followed by B; here's A again, so B is likely next."

That's a literal, identifiable component inside the network doing pattern-matching-and-extension.

It's exactly the kind of thing that lets a model continue a novel pattern instead of recalling a stored one.

When you frame it that way, the dot puzzle stops being mysterious.

It's a pattern.

The model has machinery for finding and extending patterns. It found it and extended it.

The takeaway for devs

If you build with these models, the practical lesson is this: you're not working with a fancy autocomplete that regurgitates training data.

You're working with a system that learned transferable operations under next-token pressure, and applies them to inputs it's never seen.

That reframing changes how you prompt, how you debug weird outputs, and how you reason about where it'll be reliable versus where it'll confidently faceplant.

"It's just predicting the next word" is the kind of true-but-useless statement that'll lead you to the wrong intuitions.

A dumb little dot puzzle made me go look this up.

Disclaimer: This article was written by me; AI was used to fix grammar and improve readability.

AI agents write code fast. They also silently remove logic, change behavior, and introduce bugs — without telling you. You often find out in production.

git-lrc fixes this. It hooks into git commit and reviews every diff before it lands. 60-second setup. Completely free.

Any feedback or contributors are welcome! It's online, source-available, and ready for anyone to use.

⭐ Star it on GitHub:

HexmosTech / git-lrc

Free, Micro AI Code Reviews That Run on Git Commit

git-lrc

Free, Micro AI Code Reviews That Run on Commit

GenAI today is a race car without brakes. It accelerates fast -- you describe something, and large blocks of code appear instantly. But AI agents silently break things: they remove logic, relax constraints, introduce expensive cloud calls, leak credentials, and change behavior -- without telling you. You often find out in production.

git-lrc is your braking system. It hooks into git commit and runs an AI review on every diff before it lands. 60-second setup. Completely free.

In short, git-lrc helps Prevent Outages, Breaches, and Technical Debt Before They Happen

At a glance: 10 risk categories · 100+ failure patterns tracked · every commit…

View on GitHub

Top comments (16)

FrancisTRᴅᴇᴠ (っ◔◡◔)っ • Jun 23 • Edited

Having "morse code" as your title is crazy. I got scared seeing it lol. btw, great post!

Athreya aka Maneshwar • Jun 23

Lmao, thanks bud!

Anmol Baranwal • Jun 24

damn what a title.. imagine google ranking it, would be funny

Athreya aka Maneshwar • Jun 25

hehe yo xD

Mykola Kondratiuk • Jun 25

that binary breaks faster than people expect. whether Claude truly reasons is less useful than whether you can rely on the output for a specific problem class. for dot puzzles: apparently yes.

Athreya aka Maneshwar • Jun 25

Mykola Kondratiuk • Jun 25

haha - somewhere around the third retry it kind of stops being a philosophy question

Athreya aka Maneshwar • Jun 26

Hehe xD

Mykola Kondratiuk • Jun 28

right? by retry 3 it's just acceptance. commit the workaround, move on.

UnitBuilds • Jun 23

I think this falls into the lines of how LLMs operate. Tensors arent there to find tokens, tokens are applied afterwards to the pattern. The pattern is what the LLM recognizes, the tokenizer just puts it into words, so it's useful (think of it as translation layer).

Athreya aka Maneshwar • Jun 23

I see

Nazar Boyko • Jun 23

The compression framing is the one that makes this click. Once you picture training as "squeeze all this text into the smallest model that still predicts it," general skills like counting and mirroring stop looking magical and start looking necessary. The one spot I'd soften is "a puzzle it's never seen before," since rising-then-falling symmetry and palindromes are everywhere in the training data, so the structure itself is very familiar even if your exact dots aren't. That actually strengthens your point, because it shows the model reusing a learned operation rather than needing to have seen your specific string.

Aliaksei Zelianouski • Jun 23

Predicting the next token is the goal it was trained on, not how it gets there. To do it on inputs it never saw, it has to actually work things out. And not just in the visible reasoning steps - even producing a single token, there's a multi-step process running inside first. A kind of emergent reasoning in the latent space, before any text comes out. Interpretability work has traced it: the model tries several rough approaches in parallel, like probes testing different routes, then combines them. People have watched it add two numbers this way - one path ballparks the sum, another locks in the last digit, and they merge into the answer.

The dot puzzle is the weakest way to make this case, though. Any single clean answer can be waved off as "it saw something close in training," and palindromes are everywhere in text, so that escape hatch is wide open - which is the out a couple of your commenters already took. The internal traces are better evidence because you can watch the work happen no matter what the model saw before.

algorhymer • Jun 28

 ....  .  .. ....  ... . .  ... .    ... . .   . ...   ..   .   ..  . .   . ....  ..  .    . .   .  ... ...   .. .    ... ...   ...  .  . . ...  ..  ...  . ..     ..   ..  . .   .

Ranjan Dailata • Jun 23 • Edited

I believe that the model has already learned some of the dotted annotations like the one which you have explained. We live in a small world there the thoughts overlap 😁

Athreya aka Maneshwar • Jun 23

Hehe, true Ranjan :)

View full discussion (16 comments)