🎮 (JumpNet) Part 3: Real-Time Inference — Watching JumpNet Come Alive

#cnn #transferlearning #datascience #machinelearning

⬅️ Read Part 1: Building the Data Pipeline
⬅️ Read Part 2: Training the Model

🎬 Project Goal Recap

We started with screenshots and keypresses (Part 1), turned that into a labeled dataset, and trained a model to predict jumps and hold durations (Part 2). Now, in this final part, we’re putting that model into action — inside a real game.

The ultimate goal: Can an AI play a platformer using only one key?

To answer that, I built a simulation GUI, loaded the trained model, and observed how it performs frame by frame — all in real-time. It’s the moment where the code stops being code and becomes a decision-making agent, playing a game just like you or me.

🖼️ The Simulation GUI

At the heart of this system is a custom-built Python GUI using tkinter. Here's what it does:

Allows region snipping on the screen (fullscreen overlay)
Loads a PyTorch .pt model (JumpNet)
Captures screenshots from the defined region every N milliseconds
Sends each frame through the model
Simulates a spacebar press (with duration) if the model decides to jump

🎛️ Tuning matters. During early testing, I tweaked threshold and interval parameters live during gameplay. This helped find a sweet spot where the model acts fast — but not too aggressively. Every millisecond matters in a twitch-based game, and fine-tuning these values meant the difference between clearing an obstacle and face-planting into it.

This GUI doesn't just serve as an interface; it became the lab where experiments were conducted, logs were collected, and breakthrough moments were celebrated.

🧠 Real-Time Inference Logic

The main loop of inference works like this:

Grab current frame using mss
Preprocess it (resize, normalize)
Run model → get jump_prob, hold_duration
Decide if jump_prob > threshold
If jump, simulate a spacebar press for hold_duration
Wait for interval, repeat

# main.py (simplified loop)
frame = get_screen_region()
img_tensor = transform(frame).unsqueeze(0)
jump_pred, hold_pred = model(img_tensor)

if jump_pred.item() > threshold:
    press_space_for(hold_pred.item())

Each cycle of this loop takes around 50ms to complete on CPU, which is more than fast enough for games with reaction windows above 200ms. However, even in this modest latency range, consistency is key.

📽️ Demo Video

Want to see it in action? Here's JumpNet controlling the game in real-time:

🎬 Watch the YouTube Demo

In this demo, you can see how JumpNet reacts to incoming platforms. The most satisfying moment? Watching the AI make a perfect sequence of jumps, just as I would.

🦾 Logging & Debugging

Every decision JumpNet makes is logged:

Frame Index
Jump Prediction
Hold Duration
Actual Keypress Time
Whether action was taken or skipped

This made debugging a breeze and also revealed when things went wrong — like:

Duplicate keypresses from overlapping inference intervals
Tiny fluctuations in jump_prob near the threshold
Holding the key too long, missing landing platforms

I also found subtle bugs in my simulation logic by comparing logs to screen recordings. For example, one issue where two overlapping frames caused a second jump to interrupt a successful one was only discovered by analyzing frame-level timestamps in the logs.

⚠️ What Went Wrong (and Why That’s Fine)

Let’s be honest — even a “perfect” model on paper can stumble in reality.

I ran into these real-world issues:

Lag spikes during frame capture caused the game to desync with keypresses.
Jump timing was too early or too late because of interval jitter.
Hold duration drift on certain platforms, leading to overshooting.

All of these point to a single fact:

🎯 Inference isn’t just about accuracy. It’s about **timing, **consistency, and **real-world sync.**

These problems echo what we saw in Part 2’s loss spikes — especially in regression. Slight instabilities in prediction had amplified effects when deployed. Even a 0.1-second deviation in hold time made the difference between success and failure.

To mitigate some of these, I added:

Minimum hold threshold (to prevent 0.05s "micro-jumps")
Cooldown window after jump to avoid overlaps
Optional manual override to retune threshold mid-run

None of these solutions were perfect, but together they stabilized performance and turned the agent into something almost reliable.

🧰 Conclusion — A Playable Baseline

Despite challenges, the agent successfully passed the first few obstacles using its own vision and learned jump strategy. It wasn’t perfect — but it was autonomous. That alone is a massive milestone.

Watching an AI play a level you know intimately is like seeing your reflection behave independently. It’s eerie, thrilling, and absolutely rewarding.

This completes the initial trilogy:

🏗️ Part 1: Built the dataset
🧠 Part 2: Trained the model
🎮 Part 3: Watched it play

And the best part? Every piece of this pipeline is modular and extensible. I can swap the model, change the input resolution, or even hook in another task entirely (like shooting or ducking).

This isn’t just a demo. It’s a real, scalable testbed for interactive vision-based agents.