Ruben Ghafadaryan

Posted on Oct 12 • Edited on Nov 14

The 64 KB Challenge: Teaching a Tiny Neural Network to Play Pong

#python #neural #pytorch #iot

Introduction

As someone who started hacking in the mid’80s, I’m still a shameless fan of retro computers. Sure, they were hilariously limited, but those limits made us crafty. My first machine had 16 KB of RAM (about 2 KB reserved for video). Apps came from a cassette recorder, and somehow that was… fine.

When the Atari 65XE with its majestic 64 KB arrived, we were sure nothing could stop us. Fast-forward to today: I’m on a 64 GB RAM box with a GPU and a terabyte of storage - and I still catch myself thinking, “eh, I could use more.”

Meanwhile, the resources we casually throw at neural nets are a little terrifying. A standard PyTorch + CUDA install eats gigabytes of disk; “toy” experiments can heat a room and run for hours.

Unlike today’s parameter-hungry models, the earliest perceptron experiments ran on vacuum-tube mainframes like the IBM 704, which topped out at 32K 36-bit words (roughly 144 KB of storage). And yet, within that tiny footprint, the perceptron showed something revolutionary: you could learn a decision rule from examples instead of hand-coding logic.

So here’s the challenge I set for myself: build and train a tiny neural network that can play a simplified Pong as a partner/opponent against a rule-based bot - and keep the entire model plus its training data under 64 KB.

A few ground rules so the purists don’t sharpen their pitchforks:

I’m not writing this on an actual 8-bit machine. We’ll use modern Python, but we’ll measure and enforce memory like it’s 1987.
"Under 64 KB" means: serialized model parameters and model itself consume less than 64 KB memory together.

We’ll compare with a "don’t-hold-back" variant (PyTorch + CUDA), suggested by a large model - because contrast is fun.

And, surely, we’re not doing this for nostalgia points only. We’re doing it because on-device neural nets for IoT are useful right now: they run without a network, keep data private, cut latency to near-zero, and sip power. Many teams building compact devices need models that are small, trainable on their own data, and autonomous at the edge. This project is a concrete example - and a spark - for deeper work on tiny, task-specific models that actually ship.

AI Usage Disclaimer

Did I use AI while building the tiny NN or writing this piece? Yes - selectively. Like most engineers, I use assistants for rough drafts, typo-hunting, and smoothing awkward sentences (helpful since English isn’t my first language). That doesn’t mean the article is auto-generated.
Neither code is AI-created, though AI has been widely used when working on it.

Guardrails I followed:

Every line of code and every equation was reviewed by me.
Titles, section breaks, and tone got light AI polish.
Constraints, numbers, and trade-offs come from hands-on experiments - not copy-paste.
If any AI-generated code goes in verbatim, I’ll say so explicitly.

Model Constraints and Shape

The whole point is to live under 64 KB - not just the network, but the serialized weights as well. To make that possible, we don’t feed pixels. We feed the game state: paddle and ball positions, their velocities, a small hint about where the ball is heading, and a rough "time-to-impact" estimate. Once normalized, you’re looking at roughly a dozen scalars. It’s signal, not scenery.

The network’s job is simple: choose one of three actions - up, hold, or down. No diagonals, no sideways drift. The architecture matches the task: inputs go into a small hidden layer, and out come three logits. At inference time we just pick the largest logit and move on to the next frame. The model implemented is [12] → [16] → [3].

To save space, weights are stored as signed 4-bit values - two per byte. Activations, however, stay int8 with a fixed scale that covers about [-1, 1). That mix matters. On a network this small, pushing activations down to 4-bit as well makes collapse far more likely - start seeing the model "stick" on one action because there just isn’t enough dynamic range to separate situations cleanly. Keeping activations at int8 buys stability for a few extra bytes, which is a great trade.

Nonlinearity is a simple, saturating clamp. It’s cheap, keeps values in range, and doesn’t require lookup tables or trig functions. The final layer leaves us with three integer logits; we take the argmax and return the value.

Training Process

We train a small student network to mimic a calm, predictable teacher. The teacher is a simple "physics-intercept" bot: when the ball is coming toward our paddle, it projects the path forward—including wall bounces - until the paddle’s x-line, then heads to meet it; when the ball is leaving, it slides back toward center. A tiny dead-zone around the paddle’s middle prevents jitter. It’s not flashy, but it’s consistent, which gives us labels we can trust. We refer the teacher as a "Rule Based Bot".

The inputs are the same compact signals we’ll use at runtime: paddle and ball positions, their velocities, a predicted intercept and its delta to the paddle, rough timing/speed hints, plus a couple of direction signs. Everything is normalized to [-1, +1] and stored as int8. Each example carries a single-byte label - UP, HOLD, or DOWN - so one sample is only a handful of bytes.

Because deployment uses 4-bit weights and 8-bit activations. We train with quantization in the loop: parameters are discrete, activations are clamped, and each layer can apply a small right-shift to keep values in range. This avoids the classic trap of "looks great in float32, collapses after quantization."

Optimization stays deliberately simple: hill climbing. Start with small, varied integers; nudge one weight by ±1 (and occasionally a bias); score against the teacher; keep the change if accuracy doesn’t get worse. With only a few hundred parameters, that’s enough - and it matches the discrete space we actually ship.

What do we watch while training? Accuracy, obviously, but also saturation. If too many activations are pegged at the rails, we bump a layer’s right-shift by one bit or trim fan-in. We also do short rollouts against the teacher to catch late reactions, camping, or oscillation. When accuracy plateaus and behavior looks clean, we serialize the tiny parameters, log the byte counts, and confirm we’re still under 64 KB.

Alternate Path for Comparison

For contrast, we also built a no-limits version in PyTorch, using CUDA when it’s available. The network is straightforward -12 inputs, two hidden layers of 128 and 64 with ReLU, and 3 outputs for UP, HOLD, DOWN - so:
[12] → [128] → [64] → [3].

It trains against the same rule-based bot, sees the same normalized features, and makes decisions by taking the argmax of its logits. No quantization here; it’s float all the way.

There’s also a distillation option: train the tiny integer model using the big PyTorch model as the teacher instead of the rule-based one. That gives us an apples-to-apples comparison and a clean way to see what extra capacity buys—and what careful quantization can keep.

The alternate path has been created using AI assistance and manual review later.
Network architecture has been suggested by AI, and then negotiated to the agreed minimum.

The Visualizer and CLI Player

We test the model in a small, deterministic arena: logic lives in [0, 1], rendering goes to pixels, the model plays on the right, and the rule-based bot plays on the left. Each frame builds the same 12-feature vector used in training, queries the model for three logits, turns that into an action, updates both paddles at a fixed speed, steps the ball with clean top/bottom bounces, checks paddle hits at their x-lines, and nudges ball speed slightly after successful returns (capped so rallies stay readable). A miss updates the score and triggers a fresh serve.

Controls exist to help us observe, not to get in the way: pause, slow-motion, and a quick reset to reproduce openings. A small overlay shows actions, ball and paddle positions/velocities, the model’s integer logits, FPS, and slow-mo status. The loop stays flat and predictable - two decisions, one physics step, one draw.

The model bundle loads at startup. Randomness is seeded so interesting rallies are reproducible, and long runs can stop after a target score for clean comparisons. The UI is intentionally minimal - light AI-assisted scaffolding, fully reviewed - so the focus stays on how the tiny net thinks.

The visualizer can pit the tiny model or the no-limits model against the bot—or against each other. For longer experiments, a CLI mode runs series of games up to a chosen point total and reports rally lengths and basic match statistics.

Also each model has its own game visualizer (historically) with same UI but limited to the particular model.

The Project

Project is available at GitHub:

Tiny model (tiny/)

tiny_nn.py - the compact MLP and its quantization routines.
tiny_trainer_vs_rl.py - trains the tiny model by imitating the rule-based bot.
tiny_trainer_vs_torch.py - optional: distill the tiny model from the PyTorch teacher.
tiny_game.py - real-time visualizer for the tiny model vs. the rule-based bot.

PyTorch no-limits model (torch_based/)

torch_pong_model.py - PyTorch MLP implementation.
torch_based_trainer.py - trains the PyTorch model against the same rule-based bot.
https://github.com/rghafadaryan/neuro-pong/blob/main/torch_based/torch_pong_game.py - visualizer for the PyTorch model.

Top-level utilities

versus_game.py - pits any two models (tiny or PyTorch) against each other or the rule-based bot.
versus_game_cli.py - CLI runner for series of games (no UI); outputs rally lengths and match stats. All scripts expose an exhaustive set of command-line options via --help.

The source code is free to download and use. This is an active work in progress and is provided as is, without warranties; use at your own risk.

Results

We trained two players on the same normalized 12-feature inputs and the same rule-based teacher: a no-limits PyTorch model (CUDA if available) and the tiny quantized model.

Training setup:

PyTorch model: 100,000 samples · 8 epochs
Tiny model: 12,000 hill-climb iterations

Memory Budget

Model	Parameters	Model bytes	Features bytes	Labels bytes	Total bytes	Approx size
PyTorch (no-limits)	10,499	41,996	5,760,000	120,000	5,921,996	≈ 5.65 MB
Tiny (4-bit / int8)	—	141	36,864	3,072	40,077	≈ 39.14 KB

Game Results

100 games have been played, the winner must win 3 balls to win the game.

Matchup	LEFT	RIGHT	LEFT wins %	GAMES	Rally min	Rally avg	Rally max
torch-based vs rule-based	torch-based	rule-based	68.0%	100	120	458.04	2269
tiny (trained on rule-based) vs rule-based	tiny	rule-based	43.0%	100	120	390.14	2267
tiny vs torch-based	tiny	torch-based	13.0%	100	120	1088.24	9127

Takeaways

Against the rule-based baseline, both learners perform competitively; the PyTorch model wins more often, but the tiny model isn’t far off and even edges some runs depending on seeds and lengths.
Head-to-head, the PyTorch model clearly outplays the tiny model—no surprise given its capacity and float precision.
Long rallies show there are no one-shot games, and the tiny network can hold its ground for a while.
The tiny pipeline still delivers playable, stable behavior inside a ~39 KB bundle, which was the primary goal. Results will shift with game length, sampling, and training settings, so there’s room to tune and explore.

Conclusion

You don’t need a datacenter to teach a machine a good habit. A tiny, quantized network - with a few hundred bytes of parameters and a few tens of kilobytes of data - can learn a useful policy and hold its own against a solid rule-based player. The big PyTorch model wins, of course, but the small one shows up, plays real rallies, and does it inside a 39 KB envelope.

Why it matters: the world is full of little devices that don’t want a cloud—sensors, toys, tools, quiet boxes on factory floors. They need models that wake up fast, think locally, and sip power. This project shows those models aren’t just possible - they’re practical.

With the right constraints, small models stay focused: just enough to do the job, nothing more. This 64 KB challenge is a spark for further work on tiny, task-specific neural nets.

DEV Community