DEV Community: Stat Phantom

I Built the First Purely Learned Frame-by-Frame Tetris AI: Then It Started Cheating

Stat Phantom — Tue, 23 Jun 2026 02:15:00 +0000

Greetings all! You might know me from my Snake AI ablation series where I spent an unreasonable amount of time teaching a snake to eat apples. This is a new series. Same researcher, different game, significantly worse life decisions.

This post is about Tetris. Specifically, about building what is, to our knowledge, the first AI agent to play frame-by-frame NES Tetris from raw pixels with no handcrafted observations, no shaped rewards, no enumerated placements, and no warm-start (and I mean that scoped to frame-level control from pixels, not as a field-wide claim). Button presses in, pixels out, reward only.

At its peak it reached NES level 21.

Then it started aiming pieces directly into the stack on purpose. And when I tried to fix that, everything got worse.

That's the post.

What "Purely Learned" Actually Means

Before anything else I need to define the constraint, because "purely learned" does a lot of work here and the definition is what makes this hard.

The standard approach to Tetris AI, the one that actually works, treats each piece placement as a single action. You enumerate the ~40 legal positions and rotations for a given piece, score each one, pick the best. The agent never has to figure out how to physically move or rotate a piece because the action space just skips that entirely. It picks a destination and the piece teleports there.

That approach is powerful. My own placement baseline hit an average NES score of ~210,000 (max ~5.9M) using C51 over roughly 40 enumerated actions. The pixels-to-score pipeline works fine at that level.

But enumeration is a handcrafted prior. You're injecting the knowledge that pieces have legal placements, that rotations are discrete, that the board can be abstracted into a set of possible drop positions. The agent didn't learn any of that. You handed it to them. So for this project, that's disqualified.

The constraint is: raw board pixels as input, 18 discrete button-combination actions as output, reward signal only (line clears, lock nudge, death penalty), nothing else. The agent has to discover from scratch how to move pieces, how to rotate them, how to drop them, and where to put them. The achievement lives entirely in the scaffolding people quietly remove. Take away that cheat sheet and you have the open problem.

As of the most recent published work I could find, this remains explicitly unsolved. Liu et al. tried Dreamer, DrQ, and Plan2Explore on frame-level NES Tetris from pixels and concluded none of them learned to clear lines. Every paper that successfully trains a Tetris agent either uses engineered board features, enumerates placements, or leans on reward shaping heavy enough to constitute a curriculum.

So why is it hard?

Every Flat Agent I Trained Died at the Same Step

By "flat" I mean a single neural network processing the board state and emitting frame-level actions. No hierarchy, no subgoals, just one agent doing everything.

I ran four separate flat Rainbow-C51 agents on frame-level NES Tetris with different reward configurations: potential-based reward shaping, half-weight shaping, no shaping at all, and a lock-nudge variant. The results were the same every time.

Run	Reward shaping	Peak avg score	Collapse episode	Post-collapse avg
`1024env_ars`	Uncertain (older run)	2,478	ep ~265k	~210
`shHalf_rl2x`	Half PBRS	269	ep 110k	~175
`noShape`	None	352	ep 600k	~280
`rewardshift`	Lock nudge + line reward	378	ep 470k	~235

Every single one climbed, peaked, then collapsed. Not "plateaued." Not "converged to a suboptimal policy." Collapsed. The agent that had been playing passably started playing like it had forgotten everything it knew.

ALARM BELLS. Setting an all-time record is a warning sign, not a milestone. The pattern is consistent across all four runs: peak, all-time record, collapse within 10-30k episodes. If your flat agent just hit its best-ever score, start a timer.

The collapse has a distinct fingerprint. The training loss stays completely smooth across it (this is not a numerical instability). What actually crashes is the per-layer NoisyLinear σ/μ ratio, dropping ~20% in a single 10k-episode window after weeks of 1-2% smooth decay. Simultaneously, episodes-per-second falls 4-5× as the agent abandons fast play. It doesn't blow up. It just... stops knowing things.

The more striking detail is when it happens. The noShape run collapsed at episode 600k. The rewardshift run collapsed at episode 470k. Different episode counts, different reward shapes. But noShape at ep 600k equals roughly 1.47M gradient steps, and rewardshift at ep 470k equals roughly 1.37M gradient steps. There is a death clock at ~1.4M gradient steps. Change the reward, the shaping, the exploration strategy: flat agents die at the same odometer reading regardless.

Reward shaping changes when the agent reaches its peak and from what altitude it falls. It does not change whether it falls.

I also tried removing NoisyNet exploration, adding an ε-floor, and increasing the n-step horizon to 20. The ε-floor is the best example of what happens when you try to outsmart this thing. It was supposed to maintain exploration and prevent collapse. What it actually did: made the agent climb slower, so it collapsed later, at the exact same peak (avg 378) and episode count (470k) as the run it was meant to save. The scenic route to the identical cliff.

The n=20 run is its own story. Off-policy bias with uncorrectable n-step returns broke learning entirely. The average return declined below random baseline and pinned at −4.2 for 510,000 episodes, while the loss kept decreasing. The agent grew more and more confident about a policy worse than doing nothing. Incredibly wasteful.

The diagnosis: within-piece credit assignment. A piece takes tens of frames to place. The only reward (a line clear) arrives long after, attributable to a chain of low-level actions spanning the entire drop sequence. A flat agent has to bridge that full horizon directly. At around 1.4M gradient steps, whatever representation it built stops being stable enough to support continued improvement, and everything falls apart. So what's the fix?

The Architecture That Didn't Collapse

The fix is a manager/worker decomposition. A manager that decides where a piece should go (once per piece lock), and a worker that figures out how to physically get it there (every frame).

The manager operates on board pixels, runs a C51 distributional head over a spatial map, and emits a goal: a target row, column, and rotation for the current piece. This goal is passed to the worker via what I'm calling the Feudal Goal Interface (FGI) with a spatial codec. The goal is an absolute board coordinate and rotation, not an enumerated placement index. The manager picks anywhere, legal or not (this becomes relevant shortly).

The worker operates on an egocentric observation of the board, receives the goal from the manager, and earns a dense per-frame reach reward: a goal-distance gradient that fires every frame as it moves the piece closer to the target. Double-DQN, dueling scalar head, 18 actions.

A sharp reader will object here: "you said no shaped rewards, but the worker gets a dense per-frame reward. That's shaping." It isn't, and the distinction matters. The reach reward contains zero Tetris knowledge. It doesn't say "holes are bad" or "keep the stack flat" or anything about piece geometry. It says "you are this far from a coordinate." The goal it points toward is generated by the manager from pixels. It's internal manager-to-worker communication, not injected knowledge from outside. The system's only external inputs are pixels and the game reward. Everything else, including the goal coordinate itself, is learned from scratch inside the hierarchy.

The key is the timescale split. The manager acts once per piece lock, so its credit horizon is measured in placements, not frames. The within-piece credit assignment problem that kills flat agents simply doesn't exist at the manager's level. The worker gets a dense per-frame signal that makes the low-level movement problem tractable. Each level of the hierarchy gets a horizon it can actually handle.

I ran this architecture (the miss05 configuration, more on that shortly) to 1.38M gradient steps with no flat-style collapse. No σ crash. No speed drop. It plateaus, it does not fall apart. The hierarchy carries the within-piece horizon past the zone that kills flat nets.

One caveat worth stating clearly: the hierarchy is not failure-proof. An earlier run (coplay_reach_noshape) crashed hard: clears peaked at 0.032 then fell to 0.005. The root cause was the manager's C51 value head inheriting a support range of [-20, 1000] from the placement network (where scores reach into the thousands). Co-play manager returns are roughly 0-30. With 101 atoms spread over that range, the manager had maybe 3 atoms of actual resolution for the values it was seeing, effectively value-blind, lurching into the same degenerate corner-basin every ~90-100k steps and crawling back out.

The fix was recalibrating the support to [-10, 30] (0.40 per atom over the actual return scale). Match your support to your actual returns. Obvious in hindsight. The hierarchy avoids the specific flat 1.4M-gradient-step collapse, but it has its own failure modes if misconfigured.

Then It Started Cheating

As the hierarchy learned, I noticed something in the telemetry. The reach% metric (the percentage of pieces where the worker actually reached the manager's goal) was falling. Not spiking. Not oscillating. Steadily falling, over hundreds of thousands of episodes, as performance climbed.

And tgt_depth (a measure of where the manager was aiming, with negative values being legal placements above the stack and positive values being positions inside the stack) was heading positive. Deep positive.

The manager had discovered that aiming a piece at a spot buried inside the stack earns the same line-clear credit as a good goal, because the worker clears lines anyway. So it became the pointy-haired-boss of RL: issues garbage orders, takes credit for the work.

Episode	Avg score	Lines/ep	Reach%	Goal correlation	Clears/lock	Max NES level
10k	157	0.006	0.4%	0.249	.0004	0
50k	219	0.089	6.3%	0.742	.0031	0
90k	355	1.57	2.1%	0.573	.036	0
150k	999	10.3	0.6%	0.336	.156	7
180k	4,585	30.4	0.4%	0.174	.275	16
192k	10,436	49.6	0.2%	0.139	.320	17
200k	13,639 (peak avg)	58.0	0.1%	0.126	.333	19
210k	13,044	55.8	0.1%	0.174	.338	21
288k	7,026	38.4	0.3%	0.358	.297	21

coplay_cap as of publication. Still training at time of writing.

Reach drops from 6.3% to 0.2%. Goal correlation falls from 0.74 to 0.14. The manager's goals become almost completely decorrelated from where pieces actually land. The conductor waves the baton, the orchestra ignores it. Here's the part that should bother you: the music improves.

At its peak, NES level 21. Record score 85,120. Average peaked at ~13,639 before declining as the board conditions got harder (more on that below). The avg is tail-inflated throughout (std consistently exceeds the mean, median at ep 192k was ~5,002), so the distribution is wide. But the level 21 is real.

At 0.2% goal accuracy.

Before you get too excited: level 21 is frantic flailing, not elegance. The board gets dirtier as capability climbs, then actually cleans up slightly at the higher levels (holes peaked at 36, back down to 15 by ep 288k, though aggregate height kept rising to ~114). The avg declined from its peak of ~13,639 at ep 200k down to ~7,026 by ep 288k. That's not a flat collapse (no σ crash, no speed drop, the record score kept climbing) — it's what happens when the agent is consistently reaching harder board states and the score distribution gets wider and wilder. The cumulative clears by ep 288k: 4.5M singles, 782k doubles, 38.8k triples, 803 tetrises. Still mostly singles. Still winning by working frantically and never tidying up. The result is real. The style is not pretty.

So what do you do when your manager starts cheating?

Every Time I Fixed It, It Got Worse

In an earlier run I had tried miss05: a configuration specifically designed to address the illegal-goal drift visible in earlier experiments. It added two mechanisms on top of the same feudal skeleton:

reach_penalty = 0.05: a graded distance penalty docked from the manager's reward for each piece, proportional to how far the actual landing was from the goal.

miss_reward_scale = 0.5 (half-on-miss): when the worker failed to reach the goal footprint, the manager's positive reward was scaled by 0.5. Death penalty kept full.

The goal was to make the manager care about whether its goals were actually reachable and legal. And it worked. miss05 ran to 901k episodes (1.38M gradient steps) with reach consistently 55-77%, goal correlation ~0.96, target depths near zero. Legal, reachable, coordinated placements. The well-behaved agent.

It capped at NES level 2.

Metric	`miss05` (enforce legality)	`coplay_cap` (drop the enforcement)
Reach%	55-77%	0.2%
Goal correlation	0.96	0.14
Goal legality (tgt_depth)	Legal (~0)	Illegal (+6, buried)
Clears/lock	0.11 (plateau)	0.32 (climbing)
Max NES level	2	17
Lines/episode	~5	~50
Grad-steps reached	1.38M	359k (still training)

The well-behaved agent is the worst one.

I need to be honest about a confound: miss05 used a conv-base manager and base-capacity worker, while coplay_cap used a wider conv manager and a scaled-up worker. Capacity and legality-enforcement co-vary across these two runs. It's not a clean single-variable ablation, and I won't pretend otherwise.

The cleaner evidence is within coplay_cap itself. At fixed capacity, reach falls from 6.3% to 0.2% as capability rises. The system actively moves away from legal goals as it improves.

Why the Manager Stopped Trying

The manager's reward is the outcome: line clears, lock nudge, death penalty. Not whether its goal was good. Not whether it was reached. Just what happened.

Once the worker is reasonably competent, it clears lines roughly independent of the exact goal. If the goal is unreachable, the worker free-plays and still clears lines. Legal goals and illegal goals earn the same outcome reward. There is no gradient pointing the manager toward legal goals. None.

So the goals diffuse. Most random placements in a 20×10 grid are buried inside or above the stack. tgt_depth drifting to +6 means "aim somewhere deep in the stack so the piece just falls." The manager hasn't broken anything. It found a way to emit a plausible-looking goal that the worker ignores, while still collecting full outcome credit. Wolpert and Tumer called this failure mode "Wonderful Life Utility" for a reason. COMA formalised it. I watched it happen live in the telemetry.

This compounds as the worker gets more autonomous. The more the worker free-plays competently, the less the manager's goal matters, the less the manager optimises for reachable goals, the more it drifts. shrugs The well-behaved run (miss05) penalised the symptom and inadvertently killed the signal. The reach penalty and half-on-miss scaling forced legal goals but reduced the worker's autonomy and disconnected the dense per-frame reach gradient from actual learning. Clean, coordinated, legal, and capped at level 2.

Easy to get wrong: at 0.2% reach, the obvious read is "the manager is useless, this is basically a solo worker." WRONG. No-manager flat nets collapse (see above). The manager's actual contribution was never "aim piece, land piece." It was "give the worker a target to chase so the per-frame goal-distance gradient has direction." That target doesn't have to be legal. It just has to exist. The manager looked vestigial. It was load-bearing the whole time.

The Actual Fix (That I Haven't Run Yet)

The principled solution is a counterfactual reward, sometimes called a difference reward or Wonderful Life Utility (Wolpert & Tumer; COMA, Foerster et al. 2018). Instead of rewarding the manager for the outcome, reward it for the difference between the outcome and what the worker would have achieved with no goal at all.

Vacuous or illegal goals: outcome ≈ free-play baseline, ~0 credit. The drift stops being free.

Reachable, genuinely-helpful goals: outcome > baseline, positive credit. The manager has an actual gradient toward useful goals for the first time.

The concrete plan: run ~25% of environments in worker-only / null-goal free-play to generate a running baseline B (bucketed by board height), then route goal_advantage = task_reward - B to the manager instead of the raw task reward. Worker keeps the dense reach signal. Manager gets credit only when its goal actually helped.

This is designed. It is not yet run. coplay_cap needs to converge first (ep 288k at time of writing, NES level 21, avg declining from peak as board conditions harden). Once it does, the counterfactual reward gets implemented.

Either the manager finally learns to aim somewhere the worker can actually reach and clear from, genuine "aim then land" coordination, or it doesn't, and that tells us something equally worth writing about.

The sequel exists either way.

Honesty checklist — read before citing anything
Single-seed throughout. Every run in this post is n=1. The trustworthy parts are the directions and the within-run inversions, not the exact crossover numbers.

"First purely learned" is harness-scoped. Model-based approaches (DreamerV3, MuZero) are on the untried bench, not ruled out. The claim is specifically about frame-level control from pixels with no handcrafted features, no shaped rewards, no placement enumeration, no warm-start.

coplay_cap was still training at publication. Numbers reflect the latest snapshot at time of writing (ep 288k). Peak avg was ~13,639 at ep 200k; the subsequent decline is not a flat collapse (no σ crash, no speed drop) but reflects harder board conditions at higher levels.

The flat-collapse table is from project records. The original run files were cleaned up. The pattern is well-established across all four runs; the exact figures are recalled, not freshly verified from disk.

The miss05 vs coplay_cap comparison is confounded. Capacity and legality-enforcement co-vary. The clean ablation (conv-wide + scaled worker, legality on vs off) has not been run. The within-run drift in coplay_cap is the cleaner evidence.

Collapse-survival evidence rests on miss05 (1.38M gradient steps), not coplay_cap (only 359k). And it's a cross-stack comparison, not an identical-stack ablation.

Level 21 is survival-volume, not efficient Tetris. 4.5M singles, 782k doubles, 38.8k triples, 803 tetrises cumulative by ep 288k. It's keeping the board alive, not building wells.

The counterfactual reward fix is designed, not run. Frame it as the next experiment, not a result.

I'm a CS researcher. All experiments here are personal projects run on my own hardware, entirely separate from any institutional affiliation.

References

Wolpert, D. & Tumer, K.: "Wonderful Life Utility" (counterfactual/difference rewards)
Foerster, J. et al. (2018): COMA: Counterfactual Multi-Agent Policy Gradients
Vezhnevets, A. et al. (2017): FeUdal Networks for Hierarchical Reinforcement Learning, ICML
Bairaktaris, J.A. & Johannssen, A. (2025): "Outsmarting algorithms: A comparative battle between Reinforcement Learning and heuristics in Atari Tetris," Expert Systems with Applications 277, 127251
Liu, H. & Liu, L.: "Learn to Play Tetris with Deep Reinforcement Learning," OpenReview
Algorta, S. & Şimşek, Ö. (2019): "The Game of Tetris in Machine Learning," arXiv:1905.01652

When Chaos Wins: Adding Noise Improved My Snake AI's Stability

Stat Phantom — Sun, 17 May 2026 07:20:58 +0000

Greetings all! Continuing the series where I build Rainbow DQN one component at a time on Snake. The first post covered encoding, the second covered memory, the third covered PER hurting performance. This one is about a truly WTF?! moment I stumbled into while evaluating the models.

When you evaluate a model that uses noisy networks, you turn the noise off. You're not training, so why would you keep the exploration noise active? You want the clean, deterministic policy. The model's best guess, no randomness. That's what you do, it's basically an axiom in machine learning.

So I did just that. And the evaluation scores were significantly worse than training. Not slightly. Significantly.

What Noisy Networks Do (Quick Recap)

Standard DQN uses epsilon-greedy exploration: pick a random action X% of the time, decay that percentage over training. Simple, dumb, works.

Noisy networks replace this with something smarter. Each linear layer in the network gets learnable noise parameters (sigma weights). During training, the network adds noise to its own weights, producing slightly different outputs each forward pass. The network learns how much noise to apply. Early in training, sigma values are high and the agent explores broadly. As training progresses and the agent gets more confident, sigma can shrink. For evaluation, you set sigma to zero. Clean output. Textbook.

The Evaluation Gap

Running evaluations across multiple training checkpoints, I noticed something was off. Not subtly off. The deterministic eval scores were wildly inconsistent.

Some checkpoints averaged 78. Others averaged 18. The training curve at these same points? Perfectly stable. The model was learning consistently the whole time, but deterministic evaluation was telling a completely different story depending on which checkpoint I happened to evaluate.

First instinct: it's a bug. Checked the eval pipeline, checked the checkpoint loading, checked the environment seeding. Everything was fine. The model genuinely performed this erratically when noise was turned off. So if it's not a bug... what is it?

The Bimodal Trap

The ep450K checkpoint was where it got properly weird. Deterministic eval produced a strongly bimodal distribution: roughly 25% of episodes scored near zero, while 75% scored above 80. The average landed at 59, but that number is completely meaningless when your distribution is two separate peaks with a canyon between them.

So what's going on? The deterministic policy has traps. Specific game states where the mean-weight Q-values for two or more actions are nearly identical. Without noise, the agent picks the same action every single time it hits that state. If that action happens to be the wrong one? Stuck. It loops, it crashes, it scores zero. 25% of episodes starting from certain initial states hit these traps every time.

Now. Same checkpoint, same evaluation seeds, noise turned back on:

The bimodal failure mode vanished. Gone. The p25 jumped from 2 to 59. The average climbed from 59 to 73. The standard deviation dropped from 42 to 26. The noise nudges the agent out of those deterministic traps. Not randomly, not chaotically, but because the learned noise provides just enough variation in the Q-values to stop the agent getting stuck in a degenerate action loop.

The noise isn't exploration overhead left over from training. It's a load-bearing part of the learned policy.

This wasn't a one-off. The pattern held at every checkpoint from ep50K through ep450K. Stochastic eval beat deterministic eval at every single point. Lower variance, higher consistency, fewer catastrophic zero-score episodes. The sigma values aren't residual training artifacts waiting to be zeroed out. They're doing actual work.

Why Snake Makes This Worse

Snake has a property that makes deterministic policies especially vulnerable to traps: a single wrong turn can be immediately fatal.

Picture a snake at length 100+ navigating a tight corridor of its own body. The optimal action and the second-best action might differ by a tiny margin in Q-value. Deterministic policy picks the same one every time. If that action leads into a dead end three moves later, the agent dies. Every time. From that state. Noise provides enough Q-value perturbation to occasionally pick the second-best action, which might be the one that actually survives.

In environments with more breathing room (wide open Atari levels, games where one wrong move doesn't kill you), deterministic policies don't develop these traps as severely. The longer the snake gets, the more traps exist, and the more the noise matters.

What This Means In Practice

If you're using noisy networks and evaluating with mean weights, your evaluation scores may not just be noisy. They can be structurally misleading. The deterministic policy can have failure modes that simply don't exist in the trained stochastic policy.

Before assuming deterministic eval shows the "true" performance of your agent, run a stochastic eval comparison. If the scores diverge, your agent has learned to depend on its noise.

Honest Caveats

Single architecture, single game. This was observed on C51 + dueling + noisy on Snake. Games with more forgiving state dynamics may not exhibit the same bimodal failure mode.

Noise can grow too large. At one late-stage checkpoint, sigma values had grown large enough that stochastic eval actually dropped below deterministic. There's a Goldilocks zone where noise is productive. Past that zone it becomes destructive. The finding is not "always evaluate with noise." The finding is "don't assume deterministic eval is automatically better."

Training scores remain the most reliable metric. For the ablation study, training window averages computed identically across all runs are the primary comparison, sidestepping the whole question entirely.

If you've observed similar eval divergence with noisy networks, or if you have environments where deterministic eval reliably matches training performance, I'd like to hear about it in the comments.

This work is part of ongoing research and the findings are planned to be submitted as a peer-reviewed paper.

References

Peer-Reviewed

Fortunato et al. (2018) - "Noisy Networks for Exploration" - ICLR 2018. arXiv: 1706.10295

Hessel et al. (2018) - "Rainbow: Combining Improvements in Deep Reinforcement Learning" - AAAI 2018. DOI: 10.1609/aaai.v32i1.11796

Bellemare et al. (2017) - "A Distributional Perspective on Reinforcement Learning" - ICML 2017. arXiv: 1707.06887

Removing PER From Rainbow DQN Set a New Snake AI World Record

Stat Phantom — Sat, 09 May 2026 08:32:53 +0000

Greetings all! Quick context: this is part of an ongoing series where I'm building Rainbow DQN one component at a time on Snake and measuring what each piece actually does. The first post covered the encoding, the second covered a memory optimisation. This one is about the finding I've been teasing: which Rainbow component hurts performance on Snake.

The answer is Prioritised Experience Replay (PER). Removing it from Rainbow DQN didn't just match performance. It set a new world record of ~~153~~ 156 on a 20×20 grid, smashing the previous record of 134 set by full Rainbow (with PER), and nearly 2.5× the best published peer-reviewed result of 62 (Sebastianelli et al., 2021).

The component that Hessel et al. (2018) ranked as one of Rainbow's two most important pieces actively hurts on some games such as snake.

What Is PER? (And Why Does Everyone Use It?)

Prioritised Experience Replay changes how an agent samples from its replay buffer. Instead of uniform random sampling (every stored transition has equal probability of being replayed), PER assigns a priority to each transition based on its TD error. Transitions the agent got most wrong get replayed most often.

The intuition is thus: why waste training steps on transitions the agent already understands well? Focus on the hard ones. Replay the failures. Learn from mistakes. Push yourself. insert 'Just Do It!' meme here

To prevent this biased sampling from corrupting the gradient, PER applies importance sampling (IS) weights that mathematically correct for the non-uniform distribution. A parameter called beta controls how aggressively this correction is applied, and is annealed from a low value (0.4) toward 1.0 over training.

Hessel et al.'s 2018 Rainbow paper tested each component's contribution by removing them one at a time. PER and multi-step returns were the two most impactful. Remove either one and performance dropped the most. This result, measured on Atari, became the received wisdom in the DRL Gaming community: PER is essential.

And for some reason, nobody asked whether that ranking holds on tasks that look nothing like Atari.

The Bug I Found First

Before I could even evaluate PER properly, I had to fix a misconfiguration that most multi-environment setups will hit without realising.

PER's beta parameter is annealed over beta_anneal_steps gradient steps. The default values in most implementations are calibrated for single-environment training where roughly one gradient step happens per episode. My setup runs 2048 parallel environments with 4 gradient steps per global step. That's approximately 8,192 gradient steps per episode.

The result? With a beta_anneal_steps of 100,000 (a common default), beta reached 1.0 by episode ~12. Not 12,000. Yes you read that right, twelve. The IS correction was fully engaged before the agent had learned anything at all. The training wheels came off before one foot was even on the pedal. For the remaining ~300,000 episodes of training, PER was running with maximum gradient suppression against priorities that were pure noise.

Gradient norms confirmed it: they were approximately 4× lower than equivalent non-PER runs. The agent was being actively throttled.

After identifying this, I recalibrated beta_anneal_steps to 6,000,000 (covering ~300,000 episodes at the actual gradient-steps-per-episode rate) and ran again from scratch. The corrected run did show improvement over the non-PER baseline.

So, PER fixed, job done, moving on? NOPE!

Fixed PER Still Underperforms

The corrected PER run outperformed the dueling+noisy baseline by a meaningful but modest margin. Not the dramatic improvement you'd expect from one of Rainbow's "top two components." The improvement was there, it just wasn't impressive.

This raised a question for me. If PER barely helps without C51 (distributional output), what happens when C51 is present? C51 fundamentally changes the nature of the TD error. In standard DQN, the TD error is a scalar: predicted Q minus target Q. PER uses this scalar as its priority signal. Simple, clean, well-defined.

In C51, the "error" is a KL divergence between two probability distributions. It's not a scalar residual in the same sense. Most Rainbow implementations approximate a priority from this distributional loss, but it's exactly that: an approximation. If the priority signal is noisier in the distributional setting, PER is making sampling decisions on worse information while still applying the full IS correction penalty.

The only way to test this was to run Rainbow with and without PER and compare directly.

The Head-to-Head

Full Rainbow (with PER) vs C51 without PER. Same architecture, same hyperparameters, same encoding, same hardware, same training seed. The only difference: PER on or off.

Both models evaluated at the ep50K snapshot: 10 segments × 2,000 episodes (20,000 total per model), deterministic policy, seeds 0–19,999.

C51 without PER outperforms full Rainbow across every single metric. Not by a little. The weakest C51 segment (avg 31.47) far exceeds the strongest Rainbow segment (avg 22.91). There is zero overlap between the two distributions. This isn't noise. This is a structural difference.

At the training level, C51 overtook Rainbow in record score around episode 153K and maintained the lead through the end of both runs. The final records: 153 (C51 without PER) vs 134 (full Rainbow with PER).

Removing PER didn't just fail to hurt. It was the single change that pushed the model from 134 to a world record of ~~153~~ 156.

Why PER Hurts on Snake

This result isn't random bad luck. There are structural reasons why PER is a poor fit for Snake, and they generalise to any task with similar properties.

Dense rewards reduce TD error variance. PER's priority mechanism works best when the replay buffer contains a mix of genuinely informative rare transitions and common boring ones. In sparse-reward environments (long Atari episodes, complex RPGs), most transitions carry little signal, and PER correctly surfaces the rare valuable ones. Snake hands out food frequently. The reward signal is dense. TD errors across transitions are relatively homogeneous. There isn't enough variance in transition informativeness for priority sampling to do meaningful work.

Parallel environments already ensure diversity. One of PER's core benefits in single-environment training is making rare or unusual game states available for replay more often. With 2048 environments running simultaneously, the replay buffer is already populated with massively diverse experience at every step. The agent sees rare states regularly just from the volume of parallel play. PER's diversity benefit is structurally preempted by the parallelism.

IS weight correction suppresses gradients. The IS correction is mathematically necessary to prevent biased gradients, but it comes at a cost: it down-weights the very transitions PER most wants to learn from. In a dense-reward setting where TD errors are already relatively uniform, this correction may be net-harmful. You pay the gradient suppression overhead without the corresponding benefit of surfacing genuinely informative transitions.

C51 makes PER's priority signal worse. In standard DQN, the TD error is a clean scalar. In C51, the "error" is derived from a KL divergence between distributions, an approximation that may not faithfully represent which transitions are most informative in the distributional sense. PER is making sampling decisions on a noisier signal while still applying the full IS penalty.

These four factors compound. Each one individually would weaken PER's contribution. Together, they explain why removing PER entirely produces a better model than including it.

This Isn't Just My Finding

Pan et al. and Ivgi et al. have independently documented PER underperforming in dense-reward or high-parallelism settings. Both identify that PER's advantage is largest when rewards are sparse and TD errors vary substantially across transitions. This lends external validity to what I observed here and suggests the finding is not specific to Snake or to my implementation.

The practical recommendation: before including PER in your setup, ask whether your task has sparse rewards and rare informative transitions. If it doesn't, PER's overhead (IS correction, priority tracking, beta calibration complexity) may outweigh its benefit. The fact that Hessel et al. found PER essential on Atari does not mean it's essential on your task.

Honest Caveats

Tested across multiple seeds. The primary comparison shown above is from a single training seed, but the PER vs no-PER comparison has been tested across 5 seeds. The results are somewhat chaotic at the individual seed level, with some seeds showing a smaller gap and occasional flips. But the mean across all 5 seeds shows a positive effect from removing PER. The relative ranking holds on average, even if individual seeds can be noisy. This is consistent with the structural arguments above: PER's disadvantage on dense-reward tasks is systematic, not a seed-specific fluke.

Dense-reward specific. This finding is about PER on Snake, which is a dense-reward task with frequent food collection and relatively uniform state visitation. PER may still be valuable on sparse-reward, long-horizon tasks. The claim is not "PER is useless." The claim is "PER is not universally beneficial, and the conditions under which it helps are narrower than the literature implies."

Beta calibration. The PER run used the corrected beta annealing schedule. The comparison is against properly-configured PER, not the misconfigured version. The misconfiguration is documented because it's a real pitfall that anyone using PER in a multi-environment setup will hit, but the head-to-head result stands on the corrected run.

What's Next

The ablation study continues. The PER finding is one piece of a larger investigation into how each Rainbow component contributes in a dense-reward, parallel-environment setting. The full ablation ladder, from standard DQN through full Rainbow, is being built one component at a time.

If you've observed PER underperforming on dense-reward tasks, or if you have counterexamples where PER helped significantly despite frequent rewards, I'd like to hear about it in the comments.

This work is part of ongoing research and the findings are planned to be submitted as a peer-reviewed paper.

If you're new to this series:

Stat Phantom

Apr 25

A CNN Grid Encoding for Snake AI That DOUBLES! the Best Published Score

#ai #machinelearning #deeplearning #cnn

10 min read

Stat Phantom

May 1

2 Lines of Code Saved 6.4x Memory on My Snake AI

#ai #programming #machinelearning #deeplearning

6 min read

References

Peer-Reviewed

Hessel et al. (2018) - "Rainbow: Combining Improvements in Deep Reinforcement Learning" - AAAI 2018. DOI: 10.1609/aaai.v32i1.11796

Schaul et al. (2016) - "Prioritized Experience Replay" - ICLR 2016. arXiv: 1511.05952

Bellemare et al. (2017) - "A Distributional Perspective on Reinforcement Learning" - ICML 2017. arXiv: 1707.06887

Sebastianelli et al. (2021) - "A Deep Q-Learning based approach applied to the Snake game" - 29th Mediterranean Conference on Control and Automation (MED). DOI: 10.1109/MED51440.2021.9480232

2 Lines of Code Saved 6.4x Memory on My Snake AI

Stat Phantom — Fri, 01 May 2026 06:36:43 +0000

Greetings all! In my previous post I covered Binary Plane Encoding, a 3-channel grid representation for Snake that doubled the best published score. Three binary channels: head, body, apple. For details check my previous post.

But there was a fourth channel I left out. Direction. The snake's current heading, encoded as a uint8 (0 = up, 1 = right, 2 = down, 3 = left), is painted uniformly across a 20×20 plane due to matrix shape requirements. That's 400 elements carrying exactly 2 bits of information. A 1,600× overhead at the channel level.

Worse, that one integer channel with its 2 bits was blocking the entire state from being bit-packed. The other three grid channels are binary, meaning they could be packed at 1 bit per element. But the direction channel with its scoffs 2 bits, can't. So the replay buffer sees the state as uint8 instead of binary. One channel, 2 bits, holding back one more step of memory optimisation, forcing 1,600 bytes per state instead of 250 (20 × 20 grid, ×4 channels, 1 byte per channel = 1,600 vs 20 × 20 grid, ×5 channels, 1 bit per element / 8 = 250).

This follow-up post is about fixing that, and the pitfalls along the way.

The First Attempt

Four cardinal directions. Two bits encode four states. So the intuitive replacement is two binary channels instead of one integer channel: one bit for North/South, one bit for East/West. Compact, geometric, obvious.

Except it doesn't work. Walk through it:

North and West both map to 0,0 - Collision.

The failure is subtle because the scheme seems right. Four directions, four possible bit combinations, should be a clean fit. But the scheme tries to answer "is there a north/south component?" and "is there an east/west component?" Cardinal movement is strictly one-dimensional. The perpendicular component is always exactly zero. What does the E/W bit say when the snake is moving north? It's not moving east. It's also not moving west. Both map to 0. "Not moving east" is identical to "not moving west" in a single bit.

Two bits should be enough for four directions. They are. Just not those two bits.

Ask Better Questions

The collision happens because the N/S + E/W scheme asks the wrong questions for cardinal movement. The fix isn't more bits. It's better questions.

The correct encoding uses two bits derived geometrically:

Axis bit: which axis is the snake travelling along? (0 = vertical, 1 = horizontal)

Sign bit: which direction on that axis? (0 = negative, 1 = positive)

All four directions get unique codes. The axis bit answers "which axis?" and the sign bit answers "which end?" Both questions always have exactly one answer for cardinal movement. No ambiguity, no collisions. The specific sign convention (whether north is positive or negative) doesn't matter as long as it's internally consistent. The CNN will learn whatever mapping you give it.

The first attempt was asking the wrong questions. Once you ask the right ones, two bits is plenty.

For anyone wondering about diagonal games (8 directions), the axis + sign scheme breaks because a diagonal is on both axes simultaneously. The general solution there is a 4-channel one-hot: one binary plane per cardinal direction, with two planes active for a diagonal. But for Snake, cardinal-only, the 2-channel scheme is the right choice. Don't build the generality you don't need.

The Memory Maths

This is where the change pays off. The state goes from (4, 20, 20) with one integer channel to (5, 20, 20) with all binary channels. Yes, adding a channel saves memory. That sounds backwards but the maths checks out.

Before (4-channel, uint8 storage): 4 × 20 × 20 = 1,600 elements at 1 byte each = 1,600 bytes per state. A 1-million-transition replay buffer (storing both state and next state): 3.2 GB.

After (5-channel, binary bit-packed): 5 × 20 × 20 = 2,000 elements. Every value is now 0 or 1, so each element can be packed at 1 bit, 8 elements per byte. ⌈2,000 / 8⌉ = 250 bytes per state. The same buffer: 500 MB.

6.4× reduction. Adding one channel, removing 2.7 GB.

To put this in perspective: the grid encoding stored naively as float32 (before any compression) would be 6,400 bytes per state, or 12.8 GB for a 1M-transition buffer. The first post's uint8 storage cut that to 3.2 GB (4× reduction). This post's binary bit-packing cuts it again to 500 MB. Across both changes, that's a 25.6× total reduction from the uncompressed float32 starting point.

And compared to the pixel-based approaches from the first post? Wei et al.'s RGB inputs would need approximately 49 GB for the same buffer. Binary Plane Encoding with binary cardinal directions brings that to 500 MB. Nearly a 98× difference. A 1-million-transition replay buffer now fits comfortably in the VRAM of a gaming laptop, hell, it fits in some EPYC CPU caches (AMD's Genoa-X packs up to 1,152 MB of L3). With pixel inputs, it wouldn't fit on most workstations.

Two Lines of Code

The implementation change is in snake_cnn_env.py. Replace the single integer direction plane with two binary planes:

# Before: one integer channel
# grid[3] = self._direction  # 0, 1, 2, or 3

  grid[3] = float(self._direction % 2 == 1)   # axis: 0=vertical, 1=horizontal
  grid[4] = float(0 < self._direction < 3)     # sign: 0=negative, 1=positive

Update input_channels from 4 to 5 in the model config. Done. We now store 5 channels instead of 4, but each channel is 1 bit instead of 8. One extra channel, massively less storage.

One real cost: changing input_channels changes the shape of the first convolutional weight tensor. Existing checkpoints can't be loaded into a 5-channel model. This requires a fresh training run, so schedule the change at a natural break point, not mid-experiment.

torch.unpackbits Doesn't Exist

The CPU side of bit-packing is trivial. np.packbits and np.unpackbits have existed in NumPy since 2010. Pack on write, unpack on read. Done.

So just implement it on the GPU side right? WRONG. The natural PyTorch equivalent would be torch.unpackbits, which... doesn't exist? The function is absent from the stable API entirely, and importing it raises an AttributeError. This is a genuine gap in PyTorch that anyone implementing binary storage on CUDA will hit.

The community workaround I found uses bitmasks:

mask = 2 ** torch.arange(8, dtype=torch.uint8, device=x.device).reshape(8, 1)
unpacked = (x.unsqueeze(-1) & mask).bool().int().flip(dims=[1])

This works. It preserves the original bit values, converts them to binary via .bool().int(), and flips the bit order to match MSB-first convention. Four operations, correct output.

But I don't need to preserve the original mask values, I just need 0s and 1s. I thought I could do better, and I wouldn't be a programmer if I didn't try for no other reason except... shrugs I wanted to?

shifts   = torch.arange(7, -1, -1, device=packed.device, dtype=torch.uint8)
unpacked = ((packed.unsqueeze(-1) >> shifts) & 1)   # (B, packed_size, 8)
unpacked = unpacked.reshape(B, -1)[:, :n_elems]     # drop padding bits

Each packed byte is broadcast against 8 shift values [7, 6, 5, 4, 3, 2, 1, 0], right-shifting to move each successive bit into the least significant position. Bitwise & with 1 isolates it. Two operations instead of four. No .bool().int() needed because >> shift & 1 always yields binary output directly. No .flip() needed because the descending shift range already produces MSB-first order. Fewer intermediate tensors in VRAM during sampling.

The mask approach also has a shape bug: it's written for a 1D input (flat array of bytes) and breaks on a batched 2D input (B, packed_size). The shift approach handles batched GPU sampling correctly from the start.

Both are fully device-resident with no CPU-GPU transfer. But two operations beats four, and not allocating intermediate tensors matters when batch size and state shape are large. Will reducing two ops make a difference? Probably not, but I saw the OPportunity and took it. And yes, I said that just for the joke.

So, two lines of code changed the state representation to allow bit-packing and saved a lot of storage with no loss of data.

What's Next

This is part of an ongoing series building Rainbow DQN incrementally and measuring each component on Snake. The state representation work runs in parallel to the algorithm comparison. It doesn't change which Rainbow components help or hurt, but a 6.4× memory reduction means larger buffers, more parallel environments, or training on hardware that previously couldn't fit the buffer.

The algorithm results are the next post.

If you've hit the torch.unpackbits gap yourself, or found a cleaner solution than bitwise shifts for GPU-side bit unpacking, I'd like to hear about it in the comments.

This work is part of ongoing research and the findings are planned to be submitted as a peer-reviewed paper.

If you missed the first post in this series:

Stat Phantom

Apr 25

A CNN Grid Encoding for Snake AI That DOUBLES! the Best Published Score

#ai #machinelearning #deeplearning #cnn

10 min read

A CNN Grid Encoding for Snake AI That DOUBLES! the Best Published Score

Stat Phantom — Sat, 25 Apr 2026 04:39:23 +0000

A traditional Snake game grid has only 4 states each grid point can be in: empty, head, body, or apple. And for some reason every published Snake AI paper either throws away spatial information by condensing the game state into a handful of hand-picked numbers, or buries entity identity under layers of raw pixel data that the network has to untangle. Incredibly wasteful.

The solution? Binary Plane Encoding. Using it, a CNN-based model reached a record score of 125 on a 20×20 grid in 2.5 hours on a single RTX 2070, doubling the best published result of 62 (even the average is consistently above this record). This post explains the encoding, why it works, and explores why nobody in the Snake DRL space has tried it before.

The Two Camps

The published literature on deep reinforcement learning for Snake spans 2018 to 2025 and splits into two approaches to state representation.

Camp one: hand-crafted feature vectors. Sebastianelli et al. (2021) and Kommalapati et al. (2025) both use 11 binary features fed to a fully-connected network. Three danger flags (is there a wall or body segment directly ahead, to the left, to the right), four direction flags (which way is the snake currently heading), and four food-relative flags (is the apple above, below, left, right of the head). The network receives a pre-digested summary of the game state. It never sees the grid. It never learns spatial relationships. A human decided what matters and encoded that decision directly into the input.

This works well. Sebastianelli achieved a best score of 62 on a 20×20 grid with vanilla DQN and this 11-feature representation, and uses very little resources... at least initially, but then a hard ceiling is quickly reached. The network cannot discover and learn spatial patterns because it never sees the spatial layout. And the features themselves are Snake-specific. Those 11 binary values encode what a Snake expert thinks matters. They would be meaningless for any other game. If you want an agent that can generalise beyond a single environment, this is a dead end.

Camp two: raw pixels. Wei et al. (2018) and Tushar & Siddique (2022) both train from screenshots. Wei uses 64×64 RGB frames stacked four deep, giving 64×64×12 input. Tushar converts to binary (any non-zero pixel becomes 1) at 84×84, also four frames stacked, giving 84×84×4.

The pixel approach is game-agnostic, which is its strength. But the cost is significant. Tushar's binary encoding collapses head, body, and apple into a single value. In any individual frame, every occupied cell looks identical. The agent can only figure out what's what by watching how things move across four stacked frames: food stays still, the snake moves. A single frame on its own contains zero identity information. Wei's RGB encoding preserves colour and therefore identity, but at the cost of massive input dimensionality and redundant spatial resolution (64×64 pixels to represent a 20×20 logical grid).

Both pixel approaches were tested on 12×12 grids, reaching best scores of 17 (Wei) and 20 (Tushar). Neither has been applied to 20×20.

Beyond the peer-reviewed literature, informal projects show similar patterns. A supervised learning approach on GitHub (Huynh, 2020) uses 7 hand-crafted features with a Keras network and reaches a best of 46, average 22 on 20×20. A Medium article (Schoberg, 2020) compares deterministic algorithms rather than learned policies, reaching 67 on 20×20 with a collision-avoiding shortest-path algorithm (no neural network involved at all).

Across all of it, every neural network approach uses either compressed feature vectors or raw pixel grids.

The Gap

Here is the part that surprised me. Multi-channel grid encoding is not a new idea. It is the standard state representation in board game AI.

AlphaZero (Silver et al., 2018) represents chess, Go, and Shogi as multi-channel binary planes. Each piece type, colour, and game-state feature gets its own channel. The network receives a spatial tensor where every channel encodes a different semantic category of information about the board. MuZero extends this. The representation is well-established, well-understood, and has been proven at the highest levels of game AI.

Snake fundamentally runs on a grid with set positions entities can occupy. It mirrors the exact class of problem where channel-per-entity encoding has proven effective, yet no published Snake DRL paper, and no self-published project I have found, attempts this representation. (Although this not appearing in published papers isn't surprising to me. As someone who this month had to go through over 2,100 papers, most papers just follow pre-existing trends.)

All of the pre-existing Snake DRL literature either pre-computes features and discards spatial representation, or captures raw pixels and forces the network to spend capacity on visual processing before it can even begin to learn the game.

This is the gap. Not a novel encoding technique, but an established one applied to a domain that has ignored it.

The Encoding

The state representation is a 20×20×3 binary tensor. Three channels, each covering the full grid:

Channel 0 (head): 1 at the head position, 0 everywhere else.

Channel 1 (body): 1 at each body segment position, 0 elsewhere.

Channel 2 (apple): 1 at the apple position, 0 everywhere else.

Every value is exactly 0 or 1. A single frame provides complete, unambiguous game state. What is the head, where is the body, where is the food. No temporal stacking required. No entity disambiguation through motion inference. No feature engineering.

The construction from game state is straightforward:

import numpy as np

def encode_state(grid_size, head_pos, body_positions, apple_pos):
    state = np.zeros((3, grid_size, grid_size), dtype=np.uint8)

    # Channel 0: head
    state[0, head_pos[0], head_pos[1]] = 1

    # Channel 1: body
    for segment in body_positions:
        state[1, segment[0], segment[1]] = 1

    # Channel 2: apple
    state[2, apple_pos[0], apple_pos[1]] = 1

    return state

That produces 20×20×3 = 1,200 values per state. Compare that to the pixel approaches: Tushar's binary encoding produces 84×84×4 = 28,224 values (23× larger), and Wei's RGB produces 64×64×12 = 49,152 values (41× larger). The grid encoding captures strictly more semantic information in a fraction of the space.

The information hierarchy makes this concrete:

Approach	Entity identity per frame	Full spatial layout	Game-agnostic
Binary Plane Encoding (this model)	Yes, perfect	Yes	Partial (any grid game)
RGB pixels (Wei et al.)	Yes, via colour	Approximate	Yes
Binary pixels (Tushar)	No (needs 4 frames)	Approximate	Yes
Feature vectors (Sebastianelli)	Yes, pre-computed	No	No (Snake-specific)

The only representation in the reviewed literature that provides perfect entity identity, full spatial layout, and game-agnostic structure without additional processing.

The CNN Architecture

The model processing this encoding is deliberately compact:

Two convolutional layers with 32 and 64 channels respectively, 3×3 kernels with same padding, followed by a single MaxPool2d that halves the spatial dimensions from 20×20 to 10×10. Two dense layers of 512 and 256 units. Mish activation throughout.

The network also uses a dueling architecture (separate value and advantage streams) and NoisyLinear layers replacing standard linear layers in the fully-connected head, providing learned exploration noise instead of epsilon-greedy.

This is not a large network. It doesn't need to be. The compact input representation means the convolutional backbone doesn't need depth. Two 3×3 layers with a single pooling stage are sufficient to capture the spatial relationships that matter in a 20×20 grid: proximity to walls, body segment density in nearby regions, and relative food position. The encoding has already done the hard work of structuring the information. The CNN just needs to read it.

Previous Records

The meaningful comparisons are grouped by grid size, since raw scores are not directly comparable across different board dimensions.

20×20 Grid

The only published peer-reviewed result on a 20×20 Snake grid is Sebastianelli et al. (2021). They used an MLP with 11 hand-crafted binary features and vanilla DQN, testing 13 hyperparameter configurations across evaluation runs. Their best single score was 62.

This work, using Binary Plane Encoding with a CNN and Rainbow DQN (incorporating C51 distributional output, dueling architecture, noisy exploration, prioritised replay, and 3-step returns), achieved a record of 125 on the same grid. over double.

This isn't a cherry-picked peak. Across 55,000 episodes of sustained training, the rolling average holds between 60 and 70, and the median between 64 and 74. Sebastianelli's best single game of 62 sits below this model's average. The p10 floor (the score that 90% of episodes exceed) holds around 30, meaning even the worst games routinely outperform most published baselines. The p90 reaches into the high 90s, with individual episodes regularly breaking 100. Training to this point took approximately 2.5 hours on a single RTX 2070.

An important caveat: this is not an encoding-only comparison. The improvement comes from changes across multiple axes simultaneously. State representation (grid encoding vs feature vector), architecture (CNN vs MLP), algorithm (Rainbow DQN vs vanilla DQN), and training scale (2048 parallel environments vs a smaller setup). The encoding is the enabling change that made the architecture and training scale feasible on consumer hardware, but the doubling should not be attributed to the encoding alone.

12×12 Grid

Direct score comparison across grid sizes doesn't work because a 12×12 grid has a maximum possible score of approximately 141 food items versus approximately 399 for 20×20. Board coverage (score divided by maximum possible) provides a normalised metric:

Work	Grid	Best Score	Board Coverage
Wei et al. (2018)	12×12	17	~12%
Tushar & Siddique (2022)	12×12	20	~14%
Sebastianelli et al. (2021)	20×20	62	~16%
This model	20×20	125	~31%

The gap persists across normalisation. At 31% board coverage, this approach covers roughly double the grid fraction of the nearest published result and more than double the pixel-based CNN approaches.

Informal results (not peer-reviewed)

For completeness: a supervised learning project (Huynh, 2020) on 20×20 achieved a best of 46, and a deterministic shortest-path algorithm (Schoberg, 2020) reached 67 on 20×20. The latter is not a learned policy. Neither is peer-reviewed.

Why It Works

The encoding's advantage operates on two levels.

Information quality. The network receives exactly the information it needs to play Snake, in a spatial format that CNNs are designed to process, with zero noise or redundancy. Each channel answers one question: where is the head, where is the body, where is the food. There is no ambiguity to resolve, no motion to infer, no irrelevant visual detail to filter out.

Pixel inputs have a problem where the network must first learn to segment the image (such as determining what's the snake's body and what's the background). After this it then needs to learn to interpret the spatial relationships between the segments. With Binary Plane Encoding, this segmentation is pre-constructed, leaving the network to devote its entire capacity to learning the actual game instead of learning how to see in the first place.

Information density. At 1,200 values per state stored as uint8, a replay buffer holding 1,000,000 transitions fits comfortably in approximately 1.6GB of VRAM. This made a GPU-resident replay buffer and 2048 parallel environments possible on a single RTX 2070 with 8GB of VRAM.

For comparison, storing Tushar's 84×84×4 binary inputs at the same buffer capacity would need approximately 28GB. Wei's 64×64×12 RGB inputs would need approximately 49GB. Neither fits on consumer hardware. You would need multiple high-end GPUs or cloud infrastructure to achieve the same training scale with pixel-based inputs.

The compact encoding didn't just improve information quality. It made the training infrastructure possible. 2048 parallel environments with a GPU-resident buffer meant the replay buffer reached useful diversity faster, the distributional RL gradient signal had richer data to work with, and the agent surpassed all previous records before reaching 100,000 training episodes.

Honest Caveats

This encoding is a privileged state representation. The agent receives information extracted directly from the game's internal data structures: exact head position, exact body segment positions, exact apple position. A human player has access to the same logical information through visual perception, but this agent receives it pre-structured without any perceptual processing.

The model plateaued at 125 (over 50,000 simulations without it budging), but a subsequent run using a variant algorithm has already broken that record, so we know this isn't the ceiling for the encoding. The more interesting question is whether pixel-based approaches could ever reach these scores given enough compute. Theoretically yes, but whether it's achievable in practice is unknown. Imperfections in the visual pipeline may compound through training, but that hypothesis hasn't been tested and the performance cost of segmentation quality on Snake hasn't been quantified. Whether the gap is recoverable or structural is an open question and one worth testing properly. If you take this on, I'd love to see what you find.

Cross-paper comparisons to Sebastianelli et al. and the pixel-based approaches should be read with the privileged state in mind. The improvement reflects the combined effect of encoding quality, architecture, algorithm, and training scale. Isolating each factor's individual contribution is the purpose of the ablation study this encoding supports.

What's Next

Binary Plane Encoding is the foundation for a systematic ablation study on Rainbow DQN applied to Snake. The study adds one component at a time (Double DQN, noisy exploration, dueling architecture, prioritised experience replay, C51 distributional output), measuring each component's individual contribution in a dense-reward, vectorised-environment setting.

Early results have already produced some surprises about which Rainbow components help and which ones hurt on a task like Snake. That is the next post.

If you have experience with alternative state representations for grid-based game AI, or if you have seen Binary Plane Encoding applied to Snake in work I haven't found, I'd genuinely like to hear about it in the comments.

This work is part of ongoing research and the findings are planned to be submitted as a peer-reviewed paper.

References

Peer-Reviewed

Sebastianelli et al. (2021) - "A Deep Q-Learning based approach applied to the Snake game" - 29th Mediterranean Conference on Control and Automation (MED). DOI: 10.1109/MED51440.2021.9480232

Kommalapati et al. (2025) - "Building an AI Snake Powered by Deep Reinforcement Learning and Deep Q-Learning" - IEEE 7th International Symposium on Advanced Electrical and Communication Technologies (ISAECT). DOI: 10.1109/ISAECT68904.2025.11318716

Wei et al. (2018) - "Autonomous Agents in Snake Game via Deep Reinforcement Learning" - IEEE International Conference on Agents (ICA), Singapore. DOI: 10.1109/AGENTS.2018.8460004

Tushar & Siddique (2022) - "A Memory Efficient Deep Reinforcement Learning Approach For Snake Game Autonomous Agents" - IEEE 16th International Conference on Application of Information and Communication Technologies (AICT). DOI: 10.1109/AICT55583.2022.10013603

Silver et al. (2018) - "A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play" - Science 362, 1140-1144. DOI: 10.1126/science.aar6404

Informal / Community Work

Huynh (2020) - Supervised learning Snake AI. GitHub Repository

Schoberg (2020) - Deterministic algorithms for Snake. Medium Article