DEV Community

Cover image for I Let an LLM Play Cookie Clicker. Then I Fixed Its Biggest Weakness.
Damla Hamurcu
Damla Hamurcu

Posted on

I Let an LLM Play Cookie Clicker. Then I Fixed Its Biggest Weakness.

What a toy game taught me about when AI planning actually beats simple heuristics, and when it doesn't.


Cookie Clicker is secretly a resource allocation problem. You have one currency (cookies), a growing menu of buildings that generate cookies per second (CpS), tiered upgrades that double building output, and costs that scale exponentially. Every tick, you face the same decision: what should I buy next?

It feels like the kind of problem that rewards planning. Look ahead, sequence your purchases toward powerful upgrades, and you should come out ahead. So I built a simplified Cookie Clicker simulation in Python and asked: can an LLM make better spending decisions than a simple greedy algorithm?

The answer is more interesting than "yes" or "no." The LLM identifies opportunities the greedy algorithm is structurally blind to, but it executes them inside a worse control loop. Fix the control loop, and the combination beats both.

The Setup

The simulation has 10 buildings (Cursor through Alchemy Lab), tiered upgrades that unlock at ownership thresholds and double building output, grandma synergies that scale with building count, and standard Cookie Clicker cost scaling (each purchase multiplies the next by 1.15×). No golden cookies, no prestige, but the core resource allocation problem is intact.

I tested four strategies. All starting from the same mid-game checkpoint (tick 35,000, ~7M CpS) and running independently for 5,000 ticks.

Buy Cheapest always buys the cheapest affordable item. No evaluation, no prioritization. The baseline.

Greedy ROI evaluates every available purchase by its payback period: how many ticks until this purchase pays for itself in additional CpS? It picks the shortest payback time, every tick. No memory. No planning. Just the locally optimal choice, relentlessly.

LLM Planner receives the full game state: current CpS, every building count, every available upgrade with its cost and CpS impact, and nearby tier-unlock thresholds. It produces a prioritized 10-purchase plan with reasoning. It re-plans when its queue empties or when new upgrades unlock. I used GPT-5-mini with structured output.

Hybrid uses the LLM planner for strategic direction, but falls back to the greedy ROI algorithm when the LLM is waiting to afford its next planned purchase. About 30 lines of wrapper code.

Results

Strategy Final CpS Purchases Upgrades LLM Calls
Hybrid 8,911,796 30 1 2
Greedy ROI 8,771,218 27 0 0
LLM Planner 8,314,442 42 1 5
Buy Cheapest 7,573,220 177 2 0

The hybrid wins. It beats pure greedy by 1.6% and the pure LLM by 7.2%.

1.6% sounds small, and a skeptical reader might think "that's noise." But it was consistent across runs, and it came from only 2-4 LLM calls total. The value isn't coming from continuous planning. It's coming from a small number of well-timed strategic interventions.

The final numbers don't tell the real story though. The CpS-over-time chart does.

Chart CpS over time

What Actually Happened

Three distinct behaviors show up in the data.

Greedy ROI makes a purchase roughly every 150-200 ticks. Each one is locally optimal. It never pauses, never wastes cookies, and never makes a bad buy. Its CpS line is a steady staircase upward.

LLM Planner starts strong. It identified that pushing Factory count from 45 to 50 would unlock a tier 4 upgrade that doubles all Factory output. It queued five consecutive Factory purchases followed by the upgrade. Here's its reasoning on Factory #48:

"Still driving toward the Factory tier 4 unlock; each Factory bought is invested toward a guaranteed doubling once the tier is purchased."

This is a genuinely smart move. The greedy algorithm would never buy five Factories in a row when a single Temple has a shorter payback period, even though the five Factories plus their unlock are worth far more in aggregate.

But after executing its plan, the LLM went silent. Its queue emptied and the re-planning trigger didn't fire. Cookies accumulated with nothing to spend them on. CpS flatlined from tick 3,300 to tick 5,000. That's nearly a third of the entire run doing nothing. By the time greedy caught up and passed it, the damage was done.

The Hybrid got the best of both. It executed the same Factory threshold push (ticks 1,258-1,398 in its run), then let greedy take over during the gaps. Look at ticks 1,431-1,675: a burst of Grandma, Temple, Cursor, Farm, and Mine purchases. That's the greedy fallback spending cookies productively while the LLM queue was empty. No stalling. No idle cookies.

Why Greedy Is Secretly Strong

Here's the part that surprised me, and the real insight from this experiment.

The greedy algorithm isn't naive. When it evaluates a purchase, it calls _cps_delta(), which simulates the full game state after the purchase. That includes tier upgrade doublings, grandma synergy multipliers, everything. The moment a tier upgrade becomes available, greedy sees its full multiplicative effect as an immediate CpS delta and buys it if the payback is good.

The environment is doing something subtle here: it collapses long-term value into short-term signal. Greedy doesn't need to "plan" for a tier upgrade because the upgrade's value is fully visible the tick it unlocks. Greedy is really "greedy over a rich instantaneous reward function," and that's a much harder baseline to beat than it looks. This also favors greedy more than many real-world systems, where the impact of decisions is often delayed or only partially observable.

The only place greedy is structurally blind is threshold plays: sequences of purchases that are individually suboptimal but unlock something powerful at the end. Buying five Factories when each one has worse ROI than a Temple requires looking ahead. Greedy can't do that. The LLM can.

The LLM Doesn't Even Agree With Itself

I ran the hybrid twice from the same starting state. Run 1: the LLM pushed Factories to 50 for the tier 4 unlock (2 calls, 8.91M CpS). Run 2: it targeted Alchemy Lab and Shipment scaling (4 calls, roughly the same final CpS).

Completely different plans. Same outcome.

This tells you something important: the hybrid architecture is robust to LLM variance. The greedy fallback acts as a stabilizer. Any halfway-decent strategic direction gets converted into consistent execution. The system doesn't require a consistently good plan. Just a non-random one.

This is the opposite of the pure LLM, where different plans produced meaningfully different results (8.3M vs 7.9M in earlier testing). Without the greedy safety net, plan quality matters a lot. With it, plan quality matters much less.

The Control Loop Matters More Than the Intelligence

The pure LLM's problem was never intelligence. It identified the Factory threshold push, sequenced synergy upgrades correctly, and reasoned about multi-step interactions in a complex system. Its problem was architectural. It was a blocking queue executor with no interrupt mechanism.

# Pure LLM: blocks when queue is empty
if not self._plan:
    self._plan = self._call_llm(state)  # expensive, slow
# Nothing happens between plans

# Hybrid: fills gaps with greedy
action = self._llm.decide(state)
if action is not None:
    return action
return self._greedy_decide(state)  # never idle
Enter fullscreen mode Exit fullscreen mode

The greedy algorithm re-evaluates every tick. The LLM hard-blocks until its planned item is affordable. That's not a fair comparison of planning vs. heuristics. It's a comparison of a continuous control loop vs. a batch executor.

Fixing this required about 30 lines of code. The hybrid wrapper calls the LLM planner first. If it returns nothing (waiting to afford something), it runs greedy ROI as a fallback, with one constraint: don't spend cookies that the LLM's next planned purchase needs. This prevents the fallback from interfering with planned threshold plays, but it also introduces small inefficiencies when the planned purchase is far away. More on that below.

What This Means Beyond Cookie Clicker

I see the same dynamic in production systems constantly.

Simple, reliable heuristics often outperform AI-driven approaches. Not because they're smarter, but because they're predictable and they never stop working. A hand-tuned rule that runs on every request beats a sophisticated model that occasionally hangs, returns malformed output, or needs a retry loop.

But the answer isn't "don't use AI." The answer is: use AI for what it's good at and heuristics for what they're good at. The LLM is good at spotting structural opportunities across a complex state space. The greedy algorithm is good at relentless, consistent execution. Neither is good at the other's job.

The hybrid pattern generalizes:

  1. Let the AI set strategic direction periodically (identify thresholds, sequence multiplier chains, spot non-obvious opportunities)
  2. Let a simple, fast heuristic handle moment-to-moment execution
  3. Never let the system sit idle because the AI hasn't spoken

This is how I'd architect any system where you're tempted to put an LLM in a hot loop: don't. Put it in a planning layer that fires occasionally, and let deterministic logic handle the rest.

What I'd Do Next

The hybrid's reservation logic is still too conservative. It hoards cookies for planned purchases even when they're thousands of ticks away, creating mini-stalls. A smarter version would only reserve cookies when the planned purchase is close to affordable.

More interesting: what happens if you break greedy's advantage? The reason greedy is so strong here is that _cps_delta() makes long-term value immediately visible. What if upgrades had hidden effects? What if some rewards were delayed? That's where I'd expect the LLM's planning advantage to widen significantly. And it's the experiment I'm most excited to try next.

Try It Yourself

The full simulation, all four strategies, and the comparison charts are on GitHub: https://github.com/hamurda/cookie-clicker-strategy-lab.


If you've benchmarked LLMs against algorithmic baselines on other problems, I'd love to hear what you found. Does the "strategic planning + consistent execution" pattern show up in your work too?

Top comments (0)