DEV Community: Ruben Ghafadaryan

Why Clothing Matching Is More Complicated Than Asking ChatGPT

Ruben Ghafadaryan — Mon, 18 May 2026 16:17:35 +0000

Brief Summary

At first glance, AI-based clothing matching looks deceptively simple: upload two garment photos into a modern multimodal AI model and ask whether they suit each other. In reality, the problem quickly expands into a combination of color theory, computer vision, pattern analysis, visual psychology, trend modeling, and semantic style understanding. This article describes why a dedicated matching system may still be needed even in the era of large multimodal AI models, what technical ideas stand behind such systems, and why the final solution will most likely require a hybrid architecture and a significant amount of experimentation.

Disclaimer:

This article is not AI-generated. However, as English is not my native language, AI tools were used to help polish the style, improve readability, and double-check several technical and historical references. The ideas, conclusions, and technical direction remain entirely mine. In cases where suitable illustrations could not be found online, or where copyright limitations existed, AI-generated images were used as placeholders or conceptual illustrations.

Calling AI?

Discussing a potential fashion-related AI project, I requested a surprisingly large amount of resources for a seemingly simple feature: clothing matching. More specifically, a system that could look at two garments and answer the familiar question:

“Do these clothes actually suit each other?”

Very quickly the obvious question appeared:

“Why don’t we simply upload two images into ChatGPT or another multimodal AI and ask whether they match?”

At first glance, this sounds entirely reasonable. Modern AI models can already describe images, recognize objects, explain paintings, generate realistic photographs, and analyze visual content surprisingly well. Surely they can decide whether trousers and a shirt work together.

The deeper we discussed the task, however, the more we discovered an uncomfortable truth: fashion is one of those areas where humans make highly subjective visual decisions and still expect them to be consistent.

And unfortunately, consistency is exactly what production systems require.

A Small Personal Problem

I should probably clarify that I am not a fashion expert.

Like many engineers, my practical understanding of clothing historically operated somewhere between “looks acceptable” and “at least the colors are not actively dangerous.” Most successful color combinations in my wardrobe were selected not by me, but by my wife.

Unfortunately, I am a mathematician first.

And mathematicians have a predictable habit: whenever they cannot rely on intuition, they start building models.

So while many people may simply feel that two garments do not work together, my brain immediately starts asking which parameters conflict, whether saturation is involved, how visual attention is distributed, and whether the problem can be represented numerically.

At that point, the fashion problem quietly transforms into signal processing with visual side effects.

The Black-and-Yellow Blouse Problem

Imagine a black blouse with yellow circles.

Now combine it with plain white shorts.

Most people would probably say:

“Looks good.”

Now replace the white shorts with bright red trousers.

Suddenly the same blouse starts producing a completely different visual impression.

What changed?

Not the blouse. Not even the number of colors.

The issue is that fashion compatibility is not simply:

“Are the colors similar?”

In reality, the human brain evaluates contrast, saturation, visual balance, dominant versus accent colors, pattern interaction, complexity, texture, style, and overall visual harmony.

At this point, the project stopped looking like a small image comparison tool and started looking much closer to computational psychology with RGB values.

Image 1. Same blouse. Completely different visual balance

Why Generic Multimodal AI Is Not Enough

Modern multimodal systems like ChatGPT are genuinely impressive. If you upload two garments and ask whether they match, the answer will often sound convincing.

The problem is that “convincing” and “reliable production logic” are not the same thing.

A general-purpose AI model behaves more like a highly educated assistant with broad visual knowledge, but without stable fashion rules. Sometimes the answer will be excellent, sometimes inconsistent, and sometimes the same images will produce different answers after a model update.

This becomes problematic for a real product.

A customer-facing fashion system requires stable behavior, measurable quality, explainable reasoning, controllable business logic, predictable latency, and the ability to evolve into recommendation systems later.

“Ask a large neural network and hope for the best” becomes less convincing once product teams, customers, and legal departments enter the discussion.

The First Technical Surprise: RGB Is Almost Useless

Initially, one might imagine a straightforward solution: extract image colors, compare RGB values, calculate distances, produce a score.

This idea survives until the first real examples appear.

Humans do not perceive colors in RGB space.

Two shades of green may technically belong to the same color family while visually clashing completely. An acid neon green and a soft watercolor green may both be “green,” but placing them together can create a distinctly uncomfortable visual effect.

This is why serious color analysis usually avoids RGB and instead uses perceptual color spaces such as LAB or LCH. These models separate brightness, chroma, and hue in a way much closer to human perception.

In simplified form, the difference between two colors may be represented as a distance in perceptual color space:

$d = (L_{1} - L_{2})^{2} + (a_{1} - a_{2})^{2} + (b_{1} - b_{2})^{2}$

Fortunately, the practical meaning is much simpler than the formula itself: the higher the value, the more differently humans perceive the colors.

Suddenly the system no longer sees:

“green”

but instead:

“high-chroma aggressive green with strong visual dominance.”

Which is considerably more useful for fashion analysis.

Fashion Is Not Always Logical

Another important realization appears quite quickly: even if we successfully model “good” visual combinations mathematically, fashion trends may completely ignore logic.

History repeatedly demonstrates that objectively questionable combinations can suddenly become fashionable because celebrities popularized them, luxury brands promoted them, social media amplified them, or entire industries decided they looked modern.

Acid neon colors periodically return every few years, oversized silhouettes repeatedly cycle back into fashion, and combinations once considered excessive suddenly become trendy again.

Meanwhile Coco Chanel helped popularize elegant black clothing as timeless classics, even though black had historically been associated more with practicality or mourning.

Fashion is full of such contradictions.

This creates an additional challenge for AI systems: they must distinguish between timeless visual harmony and temporary cultural trends.

A system based purely on “classical matching rules” might reject combinations that become fashionable for entirely cultural reasons. This means future systems may eventually require trend-awareness, contextual modes, or “experimental fashion” profiles.

Because sometimes people intentionally wear combinations designed to attract attention rather than visual harmony.

Image 2. Fashion periodically ignores logic and remains successful anyway.

The Unexpected Complexity of Patterns

The next surprise came from ornaments and patterns.

Even if colors match perfectly, patterns may still conflict visually. A floral blouse combined with plaid trousers may technically share compatible colors, yet the result still feels visually overloaded.

The problem is no longer color harmony. It becomes visual attention competition.

Fashion turns out to have surprisingly strict unwritten rules regarding which garment is allowed to dominate visually. Usually one expressive item works well, while multiple expressive items quickly become risky.

This means the system needs to analyze not only colors, but also ornament density, texture complexity, edge distribution, and pattern frequency.

At this point FFT unexpectedly enters the fashion industry.

Yes, Fast Fourier Transform.

The same mathematical tool used in signal processing can help estimate how visually “busy” a garment is. Repetitive patterns generate different frequency signatures than smooth minimalistic surfaces.

In simplified terms, compatibility itself may eventually become a weighted combination of many visual parameters:

$S = w_{1} C + w_{2} P + w_{3} T + w_{4} V$

Where:

$C$ — color harmony
$P$ — pattern compatibility
$T$ — texture and style semantics
$V$ — visual balance
$w_{i}$ — configurable weights defining importance of each factor

Fortunately, the real implementation would look slightly less frightening than the equation suggests.

Image 3. Signal processing meets fashion analysis

Fashion Is Largely About Visual Hierarchy

One of the most interesting discoveries was that outfits are not evaluated garment-by-garment.

Humans evaluate the balance of visual attention.

A neutral white pair of shorts works well with a loud patterned blouse because the shorts visually “step back” and allow the blouse to dominate. A second visually loud garment creates competition, and the eye no longer knows where to focus.

This is why neutral colors are so powerful in fashion: black, white, gray, beige, navy. These colors help stabilize combinations containing stronger visual elements.

A good matching engine therefore does not simply compare colors. It tries to estimate which garment dominates, which one supports, and whether the outfit becomes visually overloaded.

Where Modern AI Actually Becomes Useful

Ironically, after all this discussion about not relying entirely on multimodal AI, we still ended up wanting multimodal AI.

Just not alone.

Models like CLIP turned out to be extremely interesting because they understand higher-level visual semantics.

Traditional computer vision can detect dominant colors, saturation, contrast, texture complexity, edge density, FFT-based pattern frequencies, and visual attention balance. But CLIP can additionally recognize concepts like elegant, sporty, streetwear, minimalist, vintage, or formal.

This creates a hybrid architecture: classical deterministic visual analysis combined with semantic embedding systems.

The deterministic layer provides stability and explainability. The multimodal layer provides aesthetic understanding.

Together they form a much more reliable system than either approach alone.

Image 4. The system does considerably more than compare two JPEG files.

Can Fashion Trends Be Formalized?

One particularly interesting extension of such systems is the ability to formalize trends and allow customers to define their own style preferences instead of relying on a single universal “good taste” model.

Because the uncomfortable truth is that fashion rules are not fixed.

A luxury fashion retailer, a youth streetwear platform, and a conservative office clothing brand may evaluate the same outfit very differently.

This means the system eventually needs something resembling a configurable style-definition layer.

Conceptually, this could work as a lightweight “fashion policy language,” where customers define preferred visual patterns and forbidden combinations.

For example, one profile may prefer high neutral color ratios, low pattern complexity, limited accent colors, and conservative combinations. Another may intentionally allow strong contrasts, multiple accent colors, mixed patterns, and visually aggressive combinations.

In practice, the configuration may look like structured metadata or policy definitions:

{
  "style_profile": "minimalist_modern",
  "preferred": {
    "neutral_ratio_min": 0.5,
    "pattern_complexity_max": 0.4
  },
  "avoid": {
    "high_saturation_pairs": true
  }
}

Or, for a more experimental audience:

{
  "style_profile": "streetwear_experimental",
  "allow": {
    "multiple_accents": true,
    "mixed_patterns": true
  },
  "preferred": {
    "visual_attention_score_min": 0.7
  }
}

Over time, such systems could also become trend-aware: “prefer Scandinavian minimalism,” “allow neon revival trends,” “follow contemporary streetwear aesthetics,” or “avoid runway-style combinations.”

At that point the system stops behaving like a static rule engine and starts acting more like a configurable visual recommendation platform.

Ironically, this is very similar to many engineering systems: eventually the difficult part is not the algorithm itself, but allowing humans to customize the behavior without completely destroying the logic underneath.

How One Simple Task Becomes a Research Direction

In this case, we were not asked only for a small “do these two clothes match?” feature. The broader request was closer to an intelligent fashion platform, and clothing compatibility was just one selected problem from a much larger set.

However, this single problem is already enough to show the scale of the domain.

So the important conclusion is not that one matching feature is unexpectedly large. The more important conclusion is that this “simple” task is a good warning signal: the whole fashion intelligence domain may contain many similar subproblems, each of which can turn into a separate research and experimentation track.

In other words, clothing matching is not the whole platform.

It is just one doorway into the larger problem.

Conclusion

The original question was:

“Why don’t we just ask ChatGPT whether two garments suit each other?”

The answer turned out to be surprisingly complicated.

Because fashion compatibility is not merely image recognition, color comparison, or text generation. It is a combination of perceptual color theory, visual hierarchy, pattern interaction, texture analysis, semantic style understanding, human psychological perception, and evolving cultural trends.

Modern multimodal AI is absolutely useful in this domain. In fact, it may become one of the most important components of the final architecture.

But relying solely on a generic AI prompt is unlikely to produce a robust production-grade solution.

Most likely, the real implementation will require experimentation, hybrid architectures, handcrafted visual descriptors, semantic embedding systems, configurable trend profiles, and a considerable amount of testing.

Which leads to the final and slightly uncomfortable conclusion:

This task may still require a lot of experimenting before the right approach is found.

Detecting Logo Similarity: Combining AI Embeddings with Fourier Descriptors

Ruben Ghafadaryan — Sun, 09 Nov 2025 16:45:58 +0000

Introduction

This article started from a conversation in our V-Mobile office. We were discussing cases where new company logos suspiciously resembled famous brands. In many instances, these similarities seemed intentional—designed to confuse customers and boost sales, especially in smaller markets.
This got me thinking: Could we build a system to automatically detect when a new logo copies an existing one?
At first glance, this looks like a straightforward image similarity problem. Many tools handle this well. However, logos are special. They're not like regular photos or illustrations, and as I discovered, detecting logo similarities is far more challenging than expected.

The Challenge with AI-Based Tools

As AI enthusiasts, we naturally started with popular AI models.

DINO: Great, But Not Perfect

DINO is excellent for image similarity detection. However, it can be easily confused by background changes or gradient fills.
Example: Here are Image 1 and Image 2, a slightly modified version of Image 1. When I tested them with DINO (specifically dinov2-small), it showed a cosine distance of 0.56 between their embeddings.(Note: Throughout this article, "distance" means cosine distance unless specified otherwise.)

This high distance means DINO thinks they're quite different, even though they're clearly similar to human eyes. This creates false negatives—we might miss real similarities.

Image 1.

Image 2.

CLIP: Another Piece of the Puzzle

CLIP is another powerful similarity tool. It builds embeddings based on what the image represents semantically—in other words, it tries to describe the picture's content.
This works great for most images, but logos often contain abstract curves and shapes that don't have clear semantic meaning. When I compared two visually different images: Image 1 and Image 3, CLIP gave them a distance of 0.80, suggesting they're quite similar just because they share some semantic elements.

Image 3.

The Verdict

Relying solely on CLIP or DINO won't give us reliable results. We need additional tools.

Bringing Vectors into the Mix

We needed something to help re-rank results from CLIP and DINO. Ideally, this tool should be:

Invariant to colors
Optionally invariant to rotations or scaling (in case someone tries to trick the system)

I decided to explore vector representations. What if we convert raster images to vectors and analyze the vector data? This could give us more flexibility.

Converting Images to Vectors

First, I converted PNG logos to SVG vector files. But before conversion, I preprocessed each image:

Remove the alpha channel to eliminate transparency
Remove background using rembg
Crop near-white colors to avoid confusing the tracer with minor elements
Limit the maximum dimension to 1024 pixels
Remove noise using a median filter
Increase contrast for clearer edges

After preprocessing, I fed the images to vtracer. To keep things consistent, I limited the output to cubic Bézier curves: parametric curves defined by 4 control points.
The results were promising! The vectorized versions captured the essential shapes while eliminating noise.

Image 4. Original PNG Logo File

Image 5. Pre-processed Image after tracing (a screenshot, as article editor does not allow to load SVG files).

Analyzing Bézier Curves with Fourier Descriptors

Now we have SVG files, but we can't compare text files directly. Instead, we need to compare their geometric components.
vtracer gives us paths as cubic Bézier curves. Here's how we extract meaningful data:

Sample the curves: Since Bézier curves are easy to evaluate at any point, we sample each curve into a fixed number of 2D points;

Apply Fourier Transform: We treat this sequence of points as a signal and apply a Discrete Fourier Transform (DFT)
Extract Fourier descriptors: The low-frequency Fourier coefficients become our shape descriptor
Normalize: We normalize the sampled points to make them comparable:
Subtract the centroid (translation invariance)
Divide by scale (scale invariance)
Optionally fix the starting point (rotation invariance)
Now each curve is represented by a fixed-length vector that we can store and compare, just like other embeddings.

Image 6. An AI-generated image illustrating extraction of Fourier descriptors.

The key advantage: Unlike CLIP and DINO, these descriptors capture pure geometry rather than semantics, making them better for fine-grained shape comparison.

The Catch: False Positives.Unfortunately, this approach has its own problem: false positives. Completely different images might contain similar curves, producing misleadingly high similarity scores.

For example, when comparing two clearly similar images Image 1 and Image 2, the Fourier descriptor distance was 0.63—moderately similar. But when comparing one of them to a completely different image Image 3, the distance was 0.89—only slightly more different.

I also tried calculating Chamfer distance between individual Bézier curves for point-to-point matching, but this made things worse. The problem remained: too many false positives.
At this point, I needed to step back and rethink the approach.

The Solution: A Combined Approach

After all this experimentation, I reached these conclusions:

DINO is powerful but can produce false negatives
CLIP is powerful but can produce false positives
Fourier Descriptors are relatively unstable with false positives, but can still help filter noise

Each method has strengths and weaknesses. The solution? Combine them all.

The Weighted Formula

Similarity = (DINO × 0.7) + (CLIP × 0.2) + (Fourier × 0.1)

I assigned the highest weight to DINO since it's generally most reliable. CLIP gets a moderate weight, and Fourier descriptors get a small weight just to help filter edge cases.
These weights came from empirical testing and produced much more reliable results.

The Optimized Search Strategy

When searching through a database of logos, we don't need to calculate everything for every image. Here's an efficient multi-stage approach:

Stage 1: Use DINO to retrieve initial candidates, then filter with CLIP. Use thresholds to stop search if high similarity found or no similarity found
Stage 2: Use Fourier descriptors to re-rank found similarities
Stage 3 (optional): Re-rank the top results using Chamfer distance with per-path Fourier descriptors

Optionally, before starting the multi-stage approach we can
search for SHA256 hash, to

find full copies of the image
search for perceptual hash, to find copies with minor modifications

This staged approach gives us accurate results while avoiding unnecessary calculations.

The Implementation

I've built a proof of concept system that includes:

A combined storage solution using SQLite3 and FAISS
Storage for DINO embeddings, CLIP embeddings, and Fourier descriptors (both combined and per-path)
SHA256 hash and perceptual hashes for each image
Scripts to populate the database with PNG images
A search script to find similar logos in the database
A direct comparison script for two specific logos
Support for both GPU and CPU processing The code is still under development and does not guarantee stable work. But it still can illustrate the approaches and technics used.

https://github.com/rghafadaryan/logo-similarity

Testing Data

For this work, I used a subset of 500 logo images from the Large Logo Dataset.
Direct download: https://data.vision.ee.ethz.ch/sagea/lld/data/LLD-logo_sample.zip

What's Next?

This project is ongoing. The combined approach shows promising results, but there's always room for improvement. I'm continuing to refine the weights, explore additional geometric features, and test on larger datasets.

I'll be back with more results as this work progresses. If you're working on similar problems or have suggestions, I'd love to hear from you in the comments!

AI Use Disclaimer

AI assistance was used in preparing this article to help with grammar, wording, and clarity, since English is not my native language.

For the coding part of the project, AI-based copilots were used mostly in calculation-heavy sections.
However, every line of code was personally reviewed and verified by me before use.

All technical decisions, conclusions, and interpretations described here represent my own work.

The 64 KB Challenge: Teaching a Tiny Neural Network to Play Pong

Ruben Ghafadaryan — Sun, 12 Oct 2025 15:48:06 +0000

Introduction

As someone who started hacking in the mid’80s, I’m still a shameless fan of retro computers. Sure, they were hilariously limited, but those limits made us crafty. My first machine had 16 KB of RAM (about 2 KB reserved for video). Apps came from a cassette recorder, and somehow that was… fine.

When the Atari 65XE with its majestic 64 KB arrived, we were sure nothing could stop us. Fast-forward to today: I’m on a 64 GB RAM box with a GPU and a terabyte of storage - and I still catch myself thinking, “eh, I could use more.”

Meanwhile, the resources we casually throw at neural nets are a little terrifying. A standard PyTorch + CUDA install eats gigabytes of disk; “toy” experiments can heat a room and run for hours.

Unlike today’s parameter-hungry models, the earliest perceptron experiments ran on vacuum-tube mainframes like the IBM 704, which topped out at 32K 36-bit words (roughly 144 KB of storage). And yet, within that tiny footprint, the perceptron showed something revolutionary: you could learn a decision rule from examples instead of hand-coding logic.

So here’s the challenge I set for myself: build and train a tiny neural network that can play a simplified Pong as a partner/opponent against a rule-based bot - and keep the entire model plus its training data under 64 KB.

A few ground rules so the purists don’t sharpen their pitchforks:

I’m not writing this on an actual 8-bit machine. We’ll use modern Python, but we’ll measure and enforce memory like it’s 1987.
"Under 64 KB" means: serialized model parameters and model itself consume less than 64 KB memory together.

We’ll compare with a "don’t-hold-back" variant (PyTorch + CUDA), suggested by a large model - because contrast is fun.

And, surely, we’re not doing this for nostalgia points only. We’re doing it because on-device neural nets for IoT are useful right now: they run without a network, keep data private, cut latency to near-zero, and sip power. Many teams building compact devices need models that are small, trainable on their own data, and autonomous at the edge. This project is a concrete example - and a spark - for deeper work on tiny, task-specific models that actually ship.

AI Usage Disclaimer

Did I use AI while building the tiny NN or writing this piece? Yes - selectively. Like most engineers, I use assistants for rough drafts, typo-hunting, and smoothing awkward sentences (helpful since English isn’t my first language). That doesn’t mean the article is auto-generated.
Neither code is AI-created, though AI has been widely used when working on it.

Guardrails I followed:

Every line of code and every equation was reviewed by me.
Titles, section breaks, and tone got light AI polish.
Constraints, numbers, and trade-offs come from hands-on experiments - not copy-paste.
If any AI-generated code goes in verbatim, I’ll say so explicitly.

Model Constraints and Shape

The whole point is to live under 64 KB - not just the network, but the serialized weights as well. To make that possible, we don’t feed pixels. We feed the game state: paddle and ball positions, their velocities, a small hint about where the ball is heading, and a rough "time-to-impact" estimate. Once normalized, you’re looking at roughly a dozen scalars. It’s signal, not scenery.

The network’s job is simple: choose one of three actions - up, hold, or down. No diagonals, no sideways drift. The architecture matches the task: inputs go into a small hidden layer, and out come three logits. At inference time we just pick the largest logit and move on to the next frame. The model implemented is [12] → [16] → [3].

To save space, weights are stored as signed 4-bit values - two per byte. Activations, however, stay int8 with a fixed scale that covers about [-1, 1). That mix matters. On a network this small, pushing activations down to 4-bit as well makes collapse far more likely - start seeing the model "stick" on one action because there just isn’t enough dynamic range to separate situations cleanly. Keeping activations at int8 buys stability for a few extra bytes, which is a great trade.

Nonlinearity is a simple, saturating clamp. It’s cheap, keeps values in range, and doesn’t require lookup tables or trig functions. The final layer leaves us with three integer logits; we take the argmax and return the value.

Training Process

We train a small student network to mimic a calm, predictable teacher. The teacher is a simple "physics-intercept" bot: when the ball is coming toward our paddle, it projects the path forward—including wall bounces - until the paddle’s x-line, then heads to meet it; when the ball is leaving, it slides back toward center. A tiny dead-zone around the paddle’s middle prevents jitter. It’s not flashy, but it’s consistent, which gives us labels we can trust. We refer the teacher as a "Rule Based Bot".

The inputs are the same compact signals we’ll use at runtime: paddle and ball positions, their velocities, a predicted intercept and its delta to the paddle, rough timing/speed hints, plus a couple of direction signs. Everything is normalized to [-1, +1] and stored as int8. Each example carries a single-byte label - UP, HOLD, or DOWN - so one sample is only a handful of bytes.

Because deployment uses 4-bit weights and 8-bit activations. We train with quantization in the loop: parameters are discrete, activations are clamped, and each layer can apply a small right-shift to keep values in range. This avoids the classic trap of "looks great in float32, collapses after quantization."

Optimization stays deliberately simple: hill climbing. Start with small, varied integers; nudge one weight by ±1 (and occasionally a bias); score against the teacher; keep the change if accuracy doesn’t get worse. With only a few hundred parameters, that’s enough - and it matches the discrete space we actually ship.

What do we watch while training? Accuracy, obviously, but also saturation. If too many activations are pegged at the rails, we bump a layer’s right-shift by one bit or trim fan-in. We also do short rollouts against the teacher to catch late reactions, camping, or oscillation. When accuracy plateaus and behavior looks clean, we serialize the tiny parameters, log the byte counts, and confirm we’re still under 64 KB.

Alternate Path for Comparison

For contrast, we also built a no-limits version in PyTorch, using CUDA when it’s available. The network is straightforward -12 inputs, two hidden layers of 128 and 64 with ReLU, and 3 outputs for UP, HOLD, DOWN - so:
[12] → [128] → [64] → [3].

It trains against the same rule-based bot, sees the same normalized features, and makes decisions by taking the argmax of its logits. No quantization here; it’s float all the way.

There’s also a distillation option: train the tiny integer model using the big PyTorch model as the teacher instead of the rule-based one. That gives us an apples-to-apples comparison and a clean way to see what extra capacity buys—and what careful quantization can keep.

The alternate path has been created using AI assistance and manual review later.
Network architecture has been suggested by AI, and then negotiated to the agreed minimum.

The Visualizer and CLI Player

We test the model in a small, deterministic arena: logic lives in [0, 1], rendering goes to pixels, the model plays on the right, and the rule-based bot plays on the left. Each frame builds the same 12-feature vector used in training, queries the model for three logits, turns that into an action, updates both paddles at a fixed speed, steps the ball with clean top/bottom bounces, checks paddle hits at their x-lines, and nudges ball speed slightly after successful returns (capped so rallies stay readable). A miss updates the score and triggers a fresh serve.

Controls exist to help us observe, not to get in the way: pause, slow-motion, and a quick reset to reproduce openings. A small overlay shows actions, ball and paddle positions/velocities, the model’s integer logits, FPS, and slow-mo status. The loop stays flat and predictable - two decisions, one physics step, one draw.

The model bundle loads at startup. Randomness is seeded so interesting rallies are reproducible, and long runs can stop after a target score for clean comparisons. The UI is intentionally minimal - light AI-assisted scaffolding, fully reviewed - so the focus stays on how the tiny net thinks.

The visualizer can pit the tiny model or the no-limits model against the bot—or against each other. For longer experiments, a CLI mode runs series of games up to a chosen point total and reports rally lengths and basic match statistics.

Also each model has its own game visualizer (historically) with same UI but limited to the particular model.

The Project

Project is available at GitHub:

Tiny model (tiny/)

tiny_nn.py - the compact MLP and its quantization routines.
tiny_trainer_vs_rl.py - trains the tiny model by imitating the rule-based bot.
tiny_trainer_vs_torch.py - optional: distill the tiny model from the PyTorch teacher.
tiny_game.py - real-time visualizer for the tiny model vs. the rule-based bot.

PyTorch no-limits model (torch_based/)

torch_pong_model.py - PyTorch MLP implementation.
torch_based_trainer.py - trains the PyTorch model against the same rule-based bot.
https://github.com/rghafadaryan/neuro-pong/blob/main/torch_based/torch_pong_game.py - visualizer for the PyTorch model.

Top-level utilities

versus_game.py - pits any two models (tiny or PyTorch) against each other or the rule-based bot.
versus_game_cli.py - CLI runner for series of games (no UI); outputs rally lengths and match stats. All scripts expose an exhaustive set of command-line options via --help.

The source code is free to download and use. This is an active work in progress and is provided as is, without warranties; use at your own risk.

Results

We trained two players on the same normalized 12-feature inputs and the same rule-based teacher: a no-limits PyTorch model (CUDA if available) and the tiny quantized model.

Training setup:

PyTorch model: 100,000 samples · 8 epochs
Tiny model: 12,000 hill-climb iterations

Memory Budget

Model	Parameters	Model bytes	Features bytes	Labels bytes	Total bytes	Approx size
PyTorch (no-limits)	10,499	41,996	5,760,000	120,000	5,921,996	≈ 5.65 MB
Tiny (4-bit / int8)	—	141	36,864	3,072	40,077	≈ 39.14 KB

Game Results

100 games have been played, the winner must win 3 balls to win the game.

Matchup	LEFT	RIGHT	LEFT wins %	GAMES	Rally min	Rally avg	Rally max
torch-based vs rule-based	torch-based	rule-based	68.0%	100	120	458.04	2269
tiny (trained on rule-based) vs rule-based	tiny	rule-based	43.0%	100	120	390.14	2267
tiny vs torch-based	tiny	torch-based	13.0%	100	120	1088.24	9127

Takeaways

Against the rule-based baseline, both learners perform competitively; the PyTorch model wins more often, but the tiny model isn’t far off and even edges some runs depending on seeds and lengths.
Head-to-head, the PyTorch model clearly outplays the tiny model—no surprise given its capacity and float precision.
Long rallies show there are no one-shot games, and the tiny network can hold its ground for a while.
The tiny pipeline still delivers playable, stable behavior inside a ~39 KB bundle, which was the primary goal. Results will shift with game length, sampling, and training settings, so there’s room to tune and explore.

Conclusion

You don’t need a datacenter to teach a machine a good habit. A tiny, quantized network - with a few hundred bytes of parameters and a few tens of kilobytes of data - can learn a useful policy and hold its own against a solid rule-based player. The big PyTorch model wins, of course, but the small one shows up, plays real rallies, and does it inside a 39 KB envelope.

Why it matters: the world is full of little devices that don’t want a cloud—sensors, toys, tools, quiet boxes on factory floors. They need models that wake up fast, think locally, and sip power. This project shows those models aren’t just possible - they’re practical.

With the right constraints, small models stay focused: just enough to do the job, nothing more. This 64 KB challenge is a spark for further work on tiny, task-specific neural nets.