I was playing Cookie Clicker, the idle game where you buy buildings that generate cookies, which you spend on more buildings, and I kept hitting genuine dilemmas. Do I buy the cheap cursor now, or save for the grandma that compounds better over time? Is there a mathematically best approach, or is this just vibes?
That's when I realized: this is a resource allocation problem with compounding returns and opportunity costs. Could an LLM reason about it? And could I build the whole thing using Claude Code?
Four sessions over a week, using Claude Code with Sonnet 4.6, I built a Cookie Clicker simulation engine that pits a greedy algorithmic strategy against an LLM-powered planner. Here's what I learned.
The Project
I'm an engineer building AI projects on the side. I'm not a senior software engineer, and that matters for this story, because a lot of what I learned came from the gaps between what Claude Code produced and what I could evaluate.
The simulation engine runs a simplified Cookie Clicker: buildings generate cookies per second (CpS), costs scale with each purchase, and different strategies compete over a fixed number of ticks. One strategy uses a greedy ROI formula. The other asks an LLM to plan its purchases. Before opening Claude Code, I wrote a detailed spec covering game mechanics, cost formulas, upgrade synergies, and what the output should look like. This turned out to be the most important decision of the project.
Session 1 & 2: Spec-First Works — Until You Refactor
In the first session, I gave Claude Code the spec and asked it to build the engine. It produced working code quickly. The core math (cost scaling, CpS calculation, tier multipliers, grandma synergy stacking) was correct on the first pass. The file structure was clean: data at the top, pure functions in the middle, simulation loop at the bottom.
The pattern: spec quality determines output quality. When I gave Claude Code a detailed, unambiguous spec, it delivered. When I was vague, I got code that worked but needed rethinking.
But even with a good spec, I noticed a habit. When I asked for an end-of-simulation report, Claude Code wrote fresh functions to gather the data it needed, even though I'd already built functions that computed the same values for the simulation loop. It added a total_baked field to the game state dataclass and never referenced it anywhere. Maybe Claude Code was testing me too. Who knows. It solves each new requirement independently rather than looking at what already exists.
Session 2 is where things got interesting. I asked Claude Code to refactor the single file into three: game engine, strategies, and simulation runner. It produced a clean architecture. A Strategy abstract base class with a single method, typed return values, proper separation. The dependency graph had no cycles. Textbook stuff.
Then I ran the code and found bugs I'd already fixed. The same ones. Manually. In the previous session.
In session 1, I'd manually corrected a double-computation issue in the simulation loop and fixed a reporting function that was recomputing values instead of receiving them. During the refactor, Claude Code put both problems back. It doesn't have memory of why your code looks the way it does. Every refactor is a fresh start from its understanding of the goal, not from your history of deliberate improvements.
And so I went looking for solutions. Could I write better CLAUDE.md directions to preserve certain patterns? More explicit prompting? I tried restricting what it was allowed to change, but that created a different problem: Claude Code became a typing machine, just executing instructions without the design intelligence that makes it useful in the first place. There's a tension between giving it freedom to make good decisions and preventing it from undoing yours.
What actually helped was simpler: start a fresh session for each phase, with a clear recap of what's done and what's next. A fresh context with a focused brief ("here's the codebase, here's what's working, here's the one thing I need you to do") consistently produced better results than continuing a session full of corrections and pivots.
This became my most important rule: after any structural refactor, the first thing you check is whether it undid something you already fixed.
Session 3: Where Spec-First Hit Its Limits
Building the LLM strategy required a different kind of work. I couldn't just write a spec and hand it off. The design problem was genuinely iterative.
The architecture I landed on was a planner: ask the LLM to plan the next 10 purchases given the full game state, execute them sequentially, and re-plan when the list runs out. Claude Code implemented this cleanly. But the behavior surprised me.
The LLM had everything it needed. Ticks remaining, current CpS, how each purchase would affect CpS, all available options and their costs. And yet, it planned to waste the last 1,700 ticks doing nothing. It identified an expensive building as the best ROI, committed to saving for it, and didn't buy anything else while waiting. When I increased the call frequency, asking for only the next purchase, it performed noticeably better. Higher API bills, better results. Apparently money talks.(And that's another post's discussion.)
But here's why this matters for the Claude Code story. Even with perfect information in a fully deterministic system, the LLM needed regular check-ins to stay on track. Now imagine giving Claude Code all your project requirements upfront and expecting it to execute a multi-phase architecture without course correction. If deterministic math problems need re-planning, software design certainly does. This was the moment where I couldn't just write a better spec. The spec was fine. The implementation was correct. The problem was a design tradeoff I could only discover by running the code and being surprised.
Using the Tool to Check the Tool
Session 3 also surfaced a trust problem. Claude Code was producing dead code, unused variables, unreachable if-blocks, even when following my directions precisely. I caught some in review. I caught others by mentally running the logic. But I'm not senior enough to be confident I was catching everything, and I don't have a senior engineer to lean on.
So I tried something: in a fresh session, I asked Claude Code to review the codebase like an experienced mentor and write its findings to a review.md file. It produced a structured list of issues, categorized by severity. Most findings were cosmetic: missing docstrings, suggestions to rename variables. But buried in there was a real one. A dead if-block in the game logic, an unreachable branch that I'd walked right past during my own review.
I curated the list. Removed the noise, kept the findings I agreed with. Then I opened another fresh session and told it to fix the issues listed in review.md. It did, cleanly. No side effects, no "improvements" I didn't ask for.
The workflow: build → new session to review → curate findings → new session to fix. Each session has a single, well-defined job. The context stays clean. You can extend this to architecture review, test generation, or documentation. The key insight is that review and implementation should never happen in the same session. Mixing them is how you get silent regressions.
The Patterns
After four sessions, here's my working model for Claude Code:
Phase the work. Small, clear, well-defined bites. One session per phase, with a fresh context and a clear brief. This solves the majority of issues I encountered: lost intent, dead code accumulation, silent regressions. There's evidence for this in the project itself. In session 4, I gave Claude Code a precise architectural refactor: exact folder structure, where each file should go, what should change and what shouldn't. It executed perfectly. No dead code, no regressions, no unsolicited improvements. The same tool that generated unused dataclass fields during open-ended feature work produced spotless output when the task was bounded and clear.
After any refactor, check for regressions. This is the number one failure mode. Claude Code doesn't carry your design intent across structural changes. It rebuilds from its understanding, not your history. Treat every refactor output like a code review, not a delivery.
Give the spec, not the steps. Tell Claude Code what should happen, not how to implement it. It sometimes chose simpler representations than what I'd planned, and they were better. Leave room for it to make implementation decisions.
You decide architecture. If you're delegating decisions you don't understand, stop and think. Claude Code implements faithfully, but it won't push back and say "this structure won't scale" or "you'll regret this when you add feature X." That's your job.
The less you know, the more it costs. When I knew exactly what the output should look like, Claude Code was fast and I could verify the results in minutes. When I was in unfamiliar territory, I burned through iterations, tokens, and time. The cost of using these tools scales with your uncertainty. The clearer your mental model of what "done" looks like, the faster and cheaper you get there. Claude Code doesn't close that gap. It makes the gap cheaper to navigate, but the gap is still yours.
The Judgment Layer
Claude Code is a strong first-draft tool. It gets mechanics right, produces clean structure, and follows specs faithfully. But it doesn't carry forward your design intent across refactors. It doesn't anticipate how code needs to evolve. And it won't push back on decisions that will cause problems later.
What you bring as the engineer is the judgment layer: reviewing for regressions, catching duplicated logic, making structural decisions about what the code needs to become, and iterating on design tradeoffs that require understanding the why behind the code, not just the what.
For a small project like this, I could see the right architecture. The question I'm still sitting with: what happens when I can't?
The full project is on GitHub. Built with Claude Code (Sonnet 4.6), four sessions, roughly a week of part-time work.
Top comments (0)