Ertugrul

Posted on Aug 4 • Edited on Aug 16

🎮 (JumpNet) Part 2: From Pixels to Policy — Training JumpNet to Make the Right Move

#cnn #transferlearning #datascience #deeplearning

⬅️Read Part 1: Building the Data Pipeline
➡️Read Part 3: Real-Time Inference — Watching JumpNet Come Alive

🧠 Overview

In Part 1, we built a custom dataset where each frame tells the model: "Should I jump?" and "If yes, for how long?". Now it's time to train a model that can actually answer those questions.

This post will break down:

The model architecture (with dual heads: classification + regression)
How we trained it using PyTorch
Evaluation metrics and results
Interpretation of extremely high scores (and what to fix)

🏗️ Model Architecture — What is JumpNet?

JumpNet is a two-headed neural network built on top of a MobileNetV2 feature extractor. Here's how it's structured:

🔍 Detailed Layer Breakdown:

Backbone (MobileNetV2) — Acts as the visual encoder. Takes in (227x227x3) images and converts them into compact feature maps.
Global Average Pooling — Reduces spatial dimension to a flat vector (1280-dim).
Fully Connected Block —
- Linear(1280 → 512)
- ReLU
- Dropout(p=0.3) — reduces overfitting

🔄 Dual Output Heads:

jump_head = Linear(512 → 1) → followed by sigmoid → binary decision
hold_head = Linear(512 → 1) → direct float output (no activation)

This structure allows the model to jointly learn classification and regression from shared features — very efficient for multitask learning!

JumpNet is a Convolutional Neural Network that takes in a game frame and produces two outputs:

Output	Task	Shape	Description
`jump_prob`	Binary Classification	[1]	Should the agent jump or not?
`hold_duration`	Regression	[1]	If jumping, how long should the key be held?

It uses a MobileNetV2 backbone pretrained on ImageNet and two small heads:

# model.py
self.jump_head = nn.Linear(512, 1)   # Binary output
self.hold_head = nn.Linear(512, 1)   # Float output

🛠️ Training Setup

Let me be honest: getting this to work wasn’t as smooth as the final metrics suggest.

When I first plugged in the MobileNetV2 backbone, I made a rookie mistake — I didn’t freeze the pretrained weights. For several epochs, the model struggled like a toddler learning to walk, forgetting everything it had already learned from ImageNet. My classification accuracy barely moved. Turns out, fine-tuning and breaking the whole network open from the start was not a great idea.

After locking the early layers and just training the final FC block and heads, things stabilized — finally, the model began to ‘trust’ its own visual encoder again.

Another hiccup? I initially treated hold_duration regression as equally important for every frame — including frames where there was no jump at all. That meant the model was trying to guess a hold time even when nothing should be held. Fixing this with a simple mask (jump_labels > 0.5) improved learning dramatically.

Even the learning rate — set initially to 1e-3 — was too aggressive. Loss spiked, predictions became unstable, and only after toning it down to 1e-4 did things begin to converge. The model didn’t just need data — it needed patience.

Here’s what finally worked:

We trained the model using PyTorch on a dataset of 1778 labeled samples (1278 positive, 500 negative).

criterion_cls = nn.BCELoss()  # Binary Cross Entropy
criterion_reg = nn.MSELoss()  # Mean Squared Error
optimizer = Adam(model.parameters(), lr=1e-4)

Each epoch runs a classification + regression combo loss:

loss = loss_cls + loss_reg

🧪 Evaluation Metrics

Watching a model learn something from scratch — and then succeed — is one of the most satisfying moments in machine learning. After several minutes of feeding screen images and labels, JumpNet started to pick up patterns. It began understanding not just whether to jump, but how long to hold the key based on what it 'saw.'

In technical terms, we evaluated the model on the unseen portion of the same dataset and got these results:

=== Model Evaluation Metrics ===
Accuracy:           1.0000
F1 Score:           1.0000
Precision:          1.0000
Recall:             1.0000
Hold Duration MSE:  0.0095

📉 Breakdown:

These metrics alone don't tell the whole story. When we zoom into the TensorBoard graphs, we notice something important — especially in the regression loss:

Loss/Classification: Starts high (~0.22) and sharply drops to ~0.0007 — this suggests the model very quickly learns to distinguish jump vs. no-jump.
Loss/Regression: Gradually reduces from ~0.1 to ~0.0069 — but with occasional spikes mid-training. These jumps suggest instability in learning the hold_duration, often caused by:
- Outlier samples with very short or very long holds
- Inconsistent gradient flow due to masking
- High variance in visual contexts (e.g., different obstacles)

It’s worth noting that these spikes didn’t completely derail the model, but they do reflect a weakness in the regression component’s ability to generalize smoothly.

Loss/Total: Combined loss follows the same exponential decay trend.

🧠 Interpretation

It’s tempting to pat ourselves on the back. I mean — 1.0 F1? That’s better than most Kaggle submissions. But these numbers are also a trap.

Real-world data is messy, unpredictable, and never fully covered by your training distribution. This model has likely learned the dataset more than it has learned the task.

Still, there’s something beautiful in this phase: your model, in this tiny controlled world, is thriving. The learning pipeline is working. And you now have a baseline you can start to trust — but not worship.

Training seems too clean — almost too perfect. This usually points to either:

Data leakage (train/test overlap)
Model overfitting to low-variance data
Extremely deterministic labels

Even though the model technically fits the data well, we should be skeptical of how it generalizes to new inputs (i.e., a different level, different game speed, or visual style).

🎯 Digging Deeper: What Do These Scores Actually Mean?

At first glance, it looks like we built a perfect model. But that’s not the full story:

✅ Why it might be this good:

We used clean, paired, filtered training data (press–release only)
The binary task (jump/no jump) was clear and balanced
Negative samples were well-separated from positives

⚠️ Why it might be misleading:

Evaluation used data from same video, just split
Positive examples were augmented, not freshly collected
Model might just memorize background and frame position patterns

🚧 What can we do to fix this?

Cross-video generalization: Try training on one video, testing on another
Data augmentation: Add color shift, occlusions, screen shake
Regularization: Increase dropout or use data noise
Frame skipping: Evaluate robustness to skipped frames

🧵 Next Up: Real-Time Deployment

You’ve seen the data. You’ve seen the model learn. Now let’s put it to the test.

In Part 3, we’ll bring everything together inside a live simulation loop: capturing frames, feeding them to the model, and actually pressing keys based on the predictions.

We'll explore: