Testing AI reasoning where Stack Overflow can’t help
Last week, I gave two frontier AI models the same task: write a fully functional Tetris game in 6510 assembly language for the Commodore 64.
One produced a playable game on the first iteration. The other produced a black screen with garbage characters.
This isn’t a story about which AI is “better.” It’s about what happens when you strip away the safety nets of modern programming and force models to reason from first principles.
Why the Commodore 64?
Modern coding benchmarks have a problem: saturation. When you ask an AI to “reverse a linked list in Python,” you’re not testing reasoning — you’re testing recall. That exact problem, with minor variations, exists thousands of times in training data.
The Commodore 64 is different. Released in 1982, it has:
- 64KB of RAM (often less than 38KB usable)
- A 1MHz processor (your phone is 3,000x faster)
- No floating-point math
- No operating system in the modern sense
You can’t copy-paste solutions from Stack Overflow. The “standard” approaches don’t exist. And when something breaks, there’s no helpful error message — just a frozen screen or visual garbage.
I’ve been using this constraint as a personal benchmark for AI models. I call it the Commodore 64 Constraint.
The Toolchain
To make this test fair and repeatable, I built a Python-based toolchain that connects to the VICE emulator. It works like this:
Code → Compile (cc65) → Inject into VICE → Read Screen RAM → AI analyzes result → Iterate
The key innovation: the AI can “see” what’s happening on the C64 screen. A Python script reads the emulator’s memory and converts it to ASCII, giving the model visual feedback on whether its code actually works.
Both models used the exact same toolchain, same compiler, same emulator settings.
Repository : C64AIToolChain on GitHub
Round 1: Claude Opus 4.5
Claude (running as an agent in GitHub Copilot) was released two days before this test. I had no idea what to expect.
First iteration : The game ran. Not perfectly — there were random blocks appearing where they shouldn’t, and the pieces flickered during movement. But the core was there: pieces spawned, fell, responded to joystick input, and the score displayed correctly.
The bugs were debugging problems, not bootstrapping problems.
The fixes, in order:
- Phantom blocks : Pointer corruption in zero-page memory. The screen position calculator was overwriting variables used by the piece renderer. Solution: dedicated pointer variables.
- Flickering : Drawing new position before erasing old. Fixed with VBlank synchronization — updating the screen only during the monitor’s vertical refresh
- Lines not clearing : The X register was being corrupted mid-loop by subroutine calls. Switched to a dedicated zero-page variable for loop counting.
- Carry flag bugs : Missing CLC instructions before additions caused address calculation errors. A classic 6502 gotcha.
After each fix, the game improved visibly. The progression was linear: broken → less broken → working → polished.
Final result : Complete Tetris with all 7 pieces, rotation, line clearing, level progression, and a demo mode where the AI plays itself.
Round 2: Gemini 3
Gemini had previously passed my BASIC Tetris challenge, where it outperformed Claude 4.0 and GPT-4. I expected a strong showing.
First iteration : Black screen. A few nonsense characters scattered randomly. No recognizable game structure.
This wasn’t a bug to fix — it was a failure to bootstrap. The code compiled, but produced nothing resembling Tetris.
The next 20 iterations were a struggle. Unlike Claude’s linear progression, Gemini’s debugging was circular:
- Fix the screen initialization → break the piece rendering
- Fix the rendering → break the collision detection
- Fix the collision → reintroduce the screen bug
The model also got stuck on a “ghost piece” feature (showing where the current piece will land). It kept trying to render white dots under the falling tetrominoes, but the feature never worked correctly. The final README presents this as a feature, but in practice, it was a distraction that consumed iterations without improving core functionality.
After 20+ iterations , the game reached a stable state — but “stable” isn’t “complete.” Pieces fall, rotate, and lock. But the accumulation display is broken: you can’t clearly see the locked pieces building up. The visual feedback that makes Tetris playable is compromised.
What the Comparison Reveals
The difference isn’t intelligence — both models clearly “understand” what Tetris is and how 6502 assembly works. The difference is systems coherence : the ability to fix one thing without breaking another.
The Smoking Gun: A Carry Flag Bug
After the test, I ran a detailed code analysis on both implementations. What I found explains Gemini’s “strange accumulation” problem perfectly.
On the 6502 processor, the ADC (Add with Carry) instruction always includes the carry flag from the previous operation. If you forget to clear it, your math is off by one. This is a classic 6502 gotcha.
Gemini’s board index calculation:
adc ptr_lo ; 10y
stx ptr_lo ; Store X
adc ptr_lo ; ⚠️ NO CLC! If carry=1, adds 10y+x+1 instead of 10y+x
Claude’s version:
asl ; 10y
clc ; ✅ Always clear carry
adc test_x ; Safe: exactly 10y+x
One missing CLC instruction. Three bytes. That's why pieces occasionally locked to wrong positions, creating gaps in the accumulation display.
This isn’t a “Gemini is bad at assembly” story. It’s a “low-level programming is unforgiving” story. Claude happened to defensively clear the carry flag before every addition. Gemini didn’t. On modern hardware, this distinction doesn’t exist. On a 6502, it’s the difference between working and broken.
Different Strengths
Claude treated the C64 like an embedded system with interdependent components. When fixing the flickering, it considered the implications for memory layout and timing. It also implemented a sophisticated AI demo mode that analyzes the board and makes strategic decisions.
Gemini focused on visual features — ghost pieces, next-piece preview, color-enhanced tooling. Its approach to the code was more “modern”: clean segment organization, separate data arrays. But it treated bugs as isolated problems, leading to a whack-a-mole pattern where fixing one thing broke another.
The pattern : Gemini excels at high-level features and user experience. Claude excels at low-level correctness and algorithmic robustness. Both are valuable — in different phases of development.
The “Modern Code” Signal
Here’s something interesting: both models wrote code that looks like 2025 code running on 1982 hardware.
Original C64 code from the 1980s used:
- Single-letter labels (L1, VAL, chk_c)
- Spaghetti logic with endless JMP statements
- Magic numbers everywhere
Both AI models used:
- Descriptive labels (check_collision, move_timer, head_idx)
- Structured subroutines with clear separation of concerns
- Constants and comments explaining the logic
This suggests neither model is simply retrieving historical code from training data. They’re translating modern software engineering principles into the constraints of ancient hardware — exactly what the benchmark is designed to test.
Limitations of This Test
I want to be honest about what this doesn’t prove:
- Sample size of one. This is a single task, tested once per model. A rigorous benchmark would need multiple runs, multiple tasks, and statistical analysis.
- Human in the loop. I guided both models through debugging. A different human might have gotten different results.
- Version sensitivity. Gemini’s performance on BASIC Tetris was strong. Maybe assembly specifically hits a weakness. Maybe a future version fixes it.
- The “I’m testing myself” problem. Claude is helping me write this article. Draw your own conclusions about that.
What This Means for AI Evaluation
Current benchmarks measure whether AI can produce correct code in forgiving environments. The Commodore 64 Constraint measures something different: can AI produce working systems under hard resource limits?
This matters because real-world engineering often involves constraints. Embedded systems, legacy codebases, performance-critical applications — these are domains where “it compiles” isn’t enough.
The C64 strips away the abundance of modern computing and asks a simpler question: Can you actually engineer a solution, or just recall one?
Based on this test, both models can reason about assembly — but they reason differently. Claude produced bulletproof core logic; Gemini reached for visual polish. For a production game, you’d want Claude’s foundation with Gemini’s UI features ported on top.
The real winner? The Commodore 64, still teaching programmers humility after 43 years.
Try It Yourself
The complete toolchain is open source:
GitHub : C64AIToolChain
Both Tetris implementations are included. Run them, compare them, improve them. If you get better results with Gemini (or any other model), I’d genuinely like to know.
The Commodore 64 has been teaching programmers humility for 43 years. It turns out it teaches AI the same lesson.




Top comments (0)