Originally published on Medium.
Let me start with a confession: my first attempt at building a video time prediction model was a disaster.
I'd spent 3 months reading papers, collecting datasets, and training models.
But when I finally deployed it, the results were laughable.
I was trying to use a 3D CNN to extract features from video frames, and then feed those features into an LSTM to predict the time.
It sounded good on paper, but in practice, it was a mess.
The model was overfitting, underfitting, and just generally not working.
I tried tweaking the architecture, adjusting the hyperparameters, and even switching to a different dataset.
But no matter what I did, I just couldn't seem to get it to work.
And then, one day, I stumbled upon a paper about SlowFast networks, and everything changed.
The Before: When Everything Technically Works But Nothing Really Does
My model was technically working, in the sense that it was producing outputs and not crashing.
But in terms of actually predicting time in videos, it was a failure.
Some of the issues I was facing included:
- Poor feature extraction
- Inability to handle variable frame rates
- Overfitting to the training data The real insight here is that I was focusing on the wrong problem. I was so caught up in trying to get the model to work, that I wasn't thinking about whether the model was even the right tool for the job.
The Shift That Changed Everything
The turning point came when I stopped asking: What's the best model for this task?
...and started asking: What's the best way to represent time in a video?
This sounds obvious, but it completely changed my approach.
I started thinking about how humans perceive time, and how I could use that to inform my model design.
SlowFast Networks — What They Actually Do For You
Before, I was using a standard 3D CNN to extract features from video frames.
But with SlowFast networks, I could extract features at multiple scales, and then fuse them together to get a more robust representation of time.
The code for this was surprisingly simple:
import torch
import torch.nn as nn
class SlowFastNetwork(nn.Module):
def __init__(self):
super(SlowFastNetwork, self).__init__()
self.slow_path = nn.Sequential(
nn.Conv3d(3, 64, kernel_size=3),
nn.ReLU(),
nn.MaxPool3d(kernel_size=2)
)
self.fast_path = nn.Sequential(
nn.Conv3d(3, 64, kernel_size=3),
nn.ReLU(),
nn.MaxPool3d(kernel_size=2)
)
def forward(self, x):
slow_features = self.slow_path(x)
fast_features = self.fast_path(x)
return torch.cat((slow_features, fast_features), dim=1)
I spent 4 hours figuring this out, but it was worth it
Time Prediction — What It Actually Means
Before, I was trying to predict time as a regression problem.
But with SlowFast networks, I could frame it as a classification problem, and get much better results.
The insight here is that time is not a continuous variable, but rather a discrete one.
We can think of time as a series of discrete events, rather than a continuous flow.
The key insight here is that time is not just a matter of clock time, but also of event time.
By representing time as a series of discrete events, we can build models that are more robust and more accurate.
The After: What Actually Changed
The results were night and day.
Before, my model was producing errors of up to 30 seconds.
After, the errors were down to 1-2 seconds.
Some of the key changes included:
- Improved feature extraction
- Better handling of variable frame rates
- Reduced overfitting One thing that still doesn't work perfectly is handling videos with multiple timelines. This is an area where I'm still doing research, and hoping to make some breakthroughs.
Final Thought: It's Not About Time — It's About Understanding
If I'm being honest, I was so focused on predicting time in videos that I forgot about the bigger picture.
Video understanding is not just about time, it's about understanding the events, actions, and objects in a video.
So, if you're also working on video understanding, I'm curious: what's the one thing that you're still struggling to get right?
Follow me on Medium for more AI/ML content!
Top comments (0)