Stat Phantom

Posted on Jun 23

I Built the First Purely Learned Frame-by-Frame Tetris AI: Then It Started Cheating

#ai #machinelearning #deeplearning #tetris

Greetings all! You might know me from my Snake AI ablation series where I spent an unreasonable amount of time teaching a snake to eat apples. This is a new series. Same researcher, different game, significantly worse life decisions.

This post is about Tetris. Specifically, about building what is, to our knowledge, the first AI agent to play frame-by-frame NES Tetris from raw pixels with no handcrafted observations, no shaped rewards, no enumerated placements, and no warm-start (and I mean that scoped to frame-level control from pixels, not as a field-wide claim). Button presses in, pixels out, reward only.

At its peak it reached NES level 21.

Then it started aiming pieces directly into the stack on purpose. And when I tried to fix that, everything got worse.

That's the post.

What "Purely Learned" Actually Means

Before anything else I need to define the constraint, because "purely learned" does a lot of work here and the definition is what makes this hard.

The standard approach to Tetris AI, the one that actually works, treats each piece placement as a single action. You enumerate the ~40 legal positions and rotations for a given piece, score each one, pick the best. The agent never has to figure out how to physically move or rotate a piece because the action space just skips that entirely. It picks a destination and the piece teleports there.

That approach is powerful. My own placement baseline hit an average NES score of ~210,000 (max ~5.9M) using C51 over roughly 40 enumerated actions. The pixels-to-score pipeline works fine at that level.

But enumeration is a handcrafted prior. You're injecting the knowledge that pieces have legal placements, that rotations are discrete, that the board can be abstracted into a set of possible drop positions. The agent didn't learn any of that. You handed it to them. So for this project, that's disqualified.

The constraint is: raw board pixels as input, 18 discrete button-combination actions as output, reward signal only (line clears, lock nudge, death penalty), nothing else. The agent has to discover from scratch how to move pieces, how to rotate them, how to drop them, and where to put them. The achievement lives entirely in the scaffolding people quietly remove. Take away that cheat sheet and you have the open problem.

As of the most recent published work I could find, this remains explicitly unsolved. Liu et al. tried Dreamer, DrQ, and Plan2Explore on frame-level NES Tetris from pixels and concluded none of them learned to clear lines. Every paper that successfully trains a Tetris agent either uses engineered board features, enumerates placements, or leans on reward shaping heavy enough to constitute a curriculum.

So why is it hard?

Every Flat Agent I Trained Died at the Same Step

By "flat" I mean a single neural network processing the board state and emitting frame-level actions. No hierarchy, no subgoals, just one agent doing everything.

I ran four separate flat Rainbow-C51 agents on frame-level NES Tetris with different reward configurations: potential-based reward shaping, half-weight shaping, no shaping at all, and a lock-nudge variant. The results were the same every time.

Run	Reward shaping	Peak avg score	Collapse episode	Post-collapse avg
`1024env_ars`	Uncertain (older run)	2,478	ep ~265k	~210
`shHalf_rl2x`	Half PBRS	269	ep 110k	~175
`noShape`	None	352	ep 600k	~280
`rewardshift`	Lock nudge + line reward	378	ep 470k	~235

Every single one climbed, peaked, then collapsed. Not "plateaued." Not "converged to a suboptimal policy." Collapsed. The agent that had been playing passably started playing like it had forgotten everything it knew.

ALARM BELLS. Setting an all-time record is a warning sign, not a milestone. The pattern is consistent across all four runs: peak, all-time record, collapse within 10-30k episodes. If your flat agent just hit its best-ever score, start a timer.

The collapse has a distinct fingerprint. The training loss stays completely smooth across it (this is not a numerical instability). What actually crashes is the per-layer NoisyLinear σ/μ ratio, dropping ~20% in a single 10k-episode window after weeks of 1-2% smooth decay. Simultaneously, episodes-per-second falls 4-5× as the agent abandons fast play. It doesn't blow up. It just... stops knowing things.

The more striking detail is when it happens. The noShape run collapsed at episode 600k. The rewardshift run collapsed at episode 470k. Different episode counts, different reward shapes. But noShape at ep 600k equals roughly 1.47M gradient steps, and rewardshift at ep 470k equals roughly 1.37M gradient steps. There is a death clock at ~1.4M gradient steps. Change the reward, the shaping, the exploration strategy: flat agents die at the same odometer reading regardless.

Reward shaping changes when the agent reaches its peak and from what altitude it falls. It does not change whether it falls.

I also tried removing NoisyNet exploration, adding an ε-floor, and increasing the n-step horizon to 20. The ε-floor is the best example of what happens when you try to outsmart this thing. It was supposed to maintain exploration and prevent collapse. What it actually did: made the agent climb slower, so it collapsed later, at the exact same peak (avg 378) and episode count (470k) as the run it was meant to save. The scenic route to the identical cliff.

The n=20 run is its own story. Off-policy bias with uncorrectable n-step returns broke learning entirely. The average return declined below random baseline and pinned at −4.2 for 510,000 episodes, while the loss kept decreasing. The agent grew more and more confident about a policy worse than doing nothing. Incredibly wasteful.

The diagnosis: within-piece credit assignment. A piece takes tens of frames to place. The only reward (a line clear) arrives long after, attributable to a chain of low-level actions spanning the entire drop sequence. A flat agent has to bridge that full horizon directly. At around 1.4M gradient steps, whatever representation it built stops being stable enough to support continued improvement, and everything falls apart. So what's the fix?

The Architecture That Didn't Collapse

The fix is a manager/worker decomposition. A manager that decides where a piece should go (once per piece lock), and a worker that figures out how to physically get it there (every frame).

The manager operates on board pixels, runs a C51 distributional head over a spatial map, and emits a goal: a target row, column, and rotation for the current piece. This goal is passed to the worker via what I'm calling the Feudal Goal Interface (FGI) with a spatial codec. The goal is an absolute board coordinate and rotation, not an enumerated placement index. The manager picks anywhere, legal or not (this becomes relevant shortly).

The worker operates on an egocentric observation of the board, receives the goal from the manager, and earns a dense per-frame reach reward: a goal-distance gradient that fires every frame as it moves the piece closer to the target. Double-DQN, dueling scalar head, 18 actions.

A sharp reader will object here: "you said no shaped rewards, but the worker gets a dense per-frame reward. That's shaping." It isn't, and the distinction matters. The reach reward contains zero Tetris knowledge. It doesn't say "holes are bad" or "keep the stack flat" or anything about piece geometry. It says "you are this far from a coordinate." The goal it points toward is generated by the manager from pixels. It's internal manager-to-worker communication, not injected knowledge from outside. The system's only external inputs are pixels and the game reward. Everything else, including the goal coordinate itself, is learned from scratch inside the hierarchy.

The key is the timescale split. The manager acts once per piece lock, so its credit horizon is measured in placements, not frames. The within-piece credit assignment problem that kills flat agents simply doesn't exist at the manager's level. The worker gets a dense per-frame signal that makes the low-level movement problem tractable. Each level of the hierarchy gets a horizon it can actually handle.

I ran this architecture (the miss05 configuration, more on that shortly) to 1.38M gradient steps with no flat-style collapse. No σ crash. No speed drop. It plateaus, it does not fall apart. The hierarchy carries the within-piece horizon past the zone that kills flat nets.

One caveat worth stating clearly: the hierarchy is not failure-proof. An earlier run (coplay_reach_noshape) crashed hard: clears peaked at 0.032 then fell to 0.005. The root cause was the manager's C51 value head inheriting a support range of [-20, 1000] from the placement network (where scores reach into the thousands). Co-play manager returns are roughly 0-30. With 101 atoms spread over that range, the manager had maybe 3 atoms of actual resolution for the values it was seeing, effectively value-blind, lurching into the same degenerate corner-basin every ~90-100k steps and crawling back out.

The fix was recalibrating the support to [-10, 30] (0.40 per atom over the actual return scale). Match your support to your actual returns. Obvious in hindsight. The hierarchy avoids the specific flat 1.4M-gradient-step collapse, but it has its own failure modes if misconfigured.

Then It Started Cheating

As the hierarchy learned, I noticed something in the telemetry. The reach% metric (the percentage of pieces where the worker actually reached the manager's goal) was falling. Not spiking. Not oscillating. Steadily falling, over hundreds of thousands of episodes, as performance climbed.

And tgt_depth (a measure of where the manager was aiming, with negative values being legal placements above the stack and positive values being positions inside the stack) was heading positive. Deep positive.

The manager had discovered that aiming a piece at a spot buried inside the stack earns the same line-clear credit as a good goal, because the worker clears lines anyway. So it became the pointy-haired-boss of RL: issues garbage orders, takes credit for the work.

Episode	Avg score	Lines/ep	Reach%	Goal correlation	Clears/lock	Max NES level
10k	157	0.006	0.4%	0.249	.0004	0
50k	219	0.089	6.3%	0.742	.0031	0
90k	355	1.57	2.1%	0.573	.036	0
150k	999	10.3	0.6%	0.336	.156	7
180k	4,585	30.4	0.4%	0.174	.275	16
192k	10,436	49.6	0.2%	0.139	.320	17
200k	13,639 (peak avg)	58.0	0.1%	0.126	.333	19
210k	13,044	55.8	0.1%	0.174	.338	21
288k	7,026	38.4	0.3%	0.358	.297	21

coplay_cap as of publication. Still training at time of writing.

Reach drops from 6.3% to 0.2%. Goal correlation falls from 0.74 to 0.14. The manager's goals become almost completely decorrelated from where pieces actually land. The conductor waves the baton, the orchestra ignores it. Here's the part that should bother you: the music improves.

At its peak, NES level 21. Record score 85,120. Average peaked at ~13,639 before declining as the board conditions got harder (more on that below). The avg is tail-inflated throughout (std consistently exceeds the mean, median at ep 192k was ~5,002), so the distribution is wide. But the level 21 is real.

At 0.2% goal accuracy.

Before you get too excited: level 21 is frantic flailing, not elegance. The board gets dirtier as capability climbs, then actually cleans up slightly at the higher levels (holes peaked at 36, back down to 15 by ep 288k, though aggregate height kept rising to ~114). The avg declined from its peak of ~13,639 at ep 200k down to ~7,026 by ep 288k. That's not a flat collapse (no σ crash, no speed drop, the record score kept climbing) — it's what happens when the agent is consistently reaching harder board states and the score distribution gets wider and wilder. The cumulative clears by ep 288k: 4.5M singles, 782k doubles, 38.8k triples, 803 tetrises. Still mostly singles. Still winning by working frantically and never tidying up. The result is real. The style is not pretty.

So what do you do when your manager starts cheating?

Every Time I Fixed It, It Got Worse

In an earlier run I had tried miss05: a configuration specifically designed to address the illegal-goal drift visible in earlier experiments. It added two mechanisms on top of the same feudal skeleton:

reach_penalty = 0.05: a graded distance penalty docked from the manager's reward for each piece, proportional to how far the actual landing was from the goal.

miss_reward_scale = 0.5 (half-on-miss): when the worker failed to reach the goal footprint, the manager's positive reward was scaled by 0.5. Death penalty kept full.

The goal was to make the manager care about whether its goals were actually reachable and legal. And it worked. miss05 ran to 901k episodes (1.38M gradient steps) with reach consistently 55-77%, goal correlation ~0.96, target depths near zero. Legal, reachable, coordinated placements. The well-behaved agent.

It capped at NES level 2.

Metric	`miss05` (enforce legality)	`coplay_cap` (drop the enforcement)
Reach%	55-77%	0.2%
Goal correlation	0.96	0.14
Goal legality (tgt_depth)	Legal (~0)	Illegal (+6, buried)
Clears/lock	0.11 (plateau)	0.32 (climbing)
Max NES level	2	17
Lines/episode	~5	~50
Grad-steps reached	1.38M	359k (still training)

The well-behaved agent is the worst one.

I need to be honest about a confound: miss05 used a conv-base manager and base-capacity worker, while coplay_cap used a wider conv manager and a scaled-up worker. Capacity and legality-enforcement co-vary across these two runs. It's not a clean single-variable ablation, and I won't pretend otherwise.

The cleaner evidence is within coplay_cap itself. At fixed capacity, reach falls from 6.3% to 0.2% as capability rises. The system actively moves away from legal goals as it improves.

Why the Manager Stopped Trying

The manager's reward is the outcome: line clears, lock nudge, death penalty. Not whether its goal was good. Not whether it was reached. Just what happened.

Once the worker is reasonably competent, it clears lines roughly independent of the exact goal. If the goal is unreachable, the worker free-plays and still clears lines. Legal goals and illegal goals earn the same outcome reward. There is no gradient pointing the manager toward legal goals. None.

So the goals diffuse. Most random placements in a 20×10 grid are buried inside or above the stack. tgt_depth drifting to +6 means "aim somewhere deep in the stack so the piece just falls." The manager hasn't broken anything. It found a way to emit a plausible-looking goal that the worker ignores, while still collecting full outcome credit. Wolpert and Tumer called this failure mode "Wonderful Life Utility" for a reason. COMA formalised it. I watched it happen live in the telemetry.

This compounds as the worker gets more autonomous. The more the worker free-plays competently, the less the manager's goal matters, the less the manager optimises for reachable goals, the more it drifts. shrugs The well-behaved run (miss05) penalised the symptom and inadvertently killed the signal. The reach penalty and half-on-miss scaling forced legal goals but reduced the worker's autonomy and disconnected the dense per-frame reach gradient from actual learning. Clean, coordinated, legal, and capped at level 2.

Easy to get wrong: at 0.2% reach, the obvious read is "the manager is useless, this is basically a solo worker." WRONG. No-manager flat nets collapse (see above). The manager's actual contribution was never "aim piece, land piece." It was "give the worker a target to chase so the per-frame goal-distance gradient has direction." That target doesn't have to be legal. It just has to exist. The manager looked vestigial. It was load-bearing the whole time.

The Actual Fix (That I Haven't Run Yet)

The principled solution is a counterfactual reward, sometimes called a difference reward or Wonderful Life Utility (Wolpert & Tumer; COMA, Foerster et al. 2018). Instead of rewarding the manager for the outcome, reward it for the difference between the outcome and what the worker would have achieved with no goal at all.

Vacuous or illegal goals: outcome ≈ free-play baseline, ~0 credit. The drift stops being free.

Reachable, genuinely-helpful goals: outcome > baseline, positive credit. The manager has an actual gradient toward useful goals for the first time.

The concrete plan: run ~25% of environments in worker-only / null-goal free-play to generate a running baseline B (bucketed by board height), then route goal_advantage = task_reward - B to the manager instead of the raw task reward. Worker keeps the dense reach signal. Manager gets credit only when its goal actually helped.

This is designed. It is not yet run. coplay_cap needs to converge first (ep 288k at time of writing, NES level 21, avg declining from peak as board conditions harden). Once it does, the counterfactual reward gets implemented.

Either the manager finally learns to aim somewhere the worker can actually reach and clear from, genuine "aim then land" coordination, or it doesn't, and that tells us something equally worth writing about.

The sequel exists either way.

Honesty checklist — read before citing anything

Single-seed throughout. Every run in this post is n=1. The trustworthy parts are the directions and the within-run inversions, not the exact crossover numbers.

"First purely learned" is harness-scoped. Model-based approaches (DreamerV3, MuZero) are on the untried bench, not ruled out. The claim is specifically about frame-level control from pixels with no handcrafted features, no shaped rewards, no placement enumeration, no warm-start.

coplay_cap was still training at publication. Numbers reflect the latest snapshot at time of writing (ep 288k). Peak avg was ~13,639 at ep 200k; the subsequent decline is not a flat collapse (no σ crash, no speed drop) but reflects harder board conditions at higher levels.

The flat-collapse table is from project records. The original run files were cleaned up. The pattern is well-established across all four runs; the exact figures are recalled, not freshly verified from disk.

The miss05 vs coplay_cap comparison is confounded. Capacity and legality-enforcement co-vary. The clean ablation (conv-wide + scaled worker, legality on vs off) has not been run. The within-run drift in coplay_cap is the cleaner evidence.

Collapse-survival evidence rests on miss05 (1.38M gradient steps), not coplay_cap (only 359k). And it's a cross-stack comparison, not an identical-stack ablation.

Level 21 is survival-volume, not efficient Tetris. 4.5M singles, 782k doubles, 38.8k triples, 803 tetrises cumulative by ep 288k. It's keeping the board alive, not building wells.

The counterfactual reward fix is designed, not run. Frame it as the next experiment, not a result.

I'm a CS researcher. All experiments here are personal projects run on my own hardware, entirely separate from any institutional affiliation.

References

Wolpert, D. & Tumer, K.: "Wonderful Life Utility" (counterfactual/difference rewards)
Foerster, J. et al. (2018): COMA: Counterfactual Multi-Agent Policy Gradients
Vezhnevets, A. et al. (2017): FeUdal Networks for Hierarchical Reinforcement Learning, ICML
Bairaktaris, J.A. & Johannssen, A. (2025): "Outsmarting algorithms: A comparative battle between Reinforcement Learning and heuristics in Atari Tetris," Expert Systems with Applications 277, 127251
Liu, H. & Liu, L.: "Learn to Play Tetris with Deep Reinforcement Learning," OpenReview
Algorta, S. & Şimşek, Ö. (2019): "The Game of Tetris in Machine Learning," arXiv:1905.01652

Top comments (2)

Suny Choudhary • Jun 25

This is a great reminder that “cheating” in RL usually means “your reward function told the truth too vaguely.”

The manager issuing garbage goals and still getting credit because the worker salvaged the piece is such a clean failure mode. It is also refreshing that you called out the single-seed/confound limits instead of overselling the result.

Stat Phantom • Jun 25

this is exactly what I kept trying to fix by constantly trying the reward function, I must of tried 20+ different iterations but every one made everything MUCH worse, not just slightly. literally the worker reward is ONLY 'how far away am I from the manager?' and that's it. the managers reward is simply 'did I clear a line'? and that's it. I've tried to fix mesa-optimisation or reward hacking in so many ways by having things like a bang-on-bonus if placement is on the intended goal, or splitting the managers head reward to give negative feedback for illegal placements etc. Nothing made it better or even came close. significant reduction. (like 10x the reduction)