I Spent 6 Months Trying to See Time in Videos. Here's What Finally Worked.

#videoanalysis #timeprediction #machinelearning #cnn

Originally published on Medium.

Let me start with a confession: my first attempt at building a video time prediction model was a disaster.
I'd spent 3 months reading papers, collecting datasets, and training models.
But when I finally deployed it, the results were laughable.

I was trying to use a 3D CNN to extract features from video frames, and then feed those features into an LSTM to predict the time.
It sounded good on paper, but in practice, it was a mess.
The model was overfitting, underfitting, and just generally not working.

I tried tweaking the architecture, adjusting the hyperparameters, and even switching to a different dataset.
But no matter what I did, I just couldn't seem to get it to work.
And then, one day, I stumbled upon a paper about SlowFast networks, and everything changed.

The Before: When Everything Technically Works But Nothing Really Does

My model was technically working, in the sense that it was producing outputs and not crashing.
But in terms of actually predicting time in videos, it was a failure.
Some of the issues I was facing included:

Poor feature extraction
Inability to handle variable frame rates
Overfitting to the training data The real insight here is that I was focusing on the wrong problem. I was so caught up in trying to get the model to work, that I wasn't thinking about whether the model was even the right tool for the job.

The Shift That Changed Everything

The turning point came when I stopped asking: What's the best model for this task?
...and started asking: What's the best way to represent time in a video?
This sounds obvious, but it completely changed my approach.
I started thinking about how humans perceive time, and how I could use that to inform my model design.

SlowFast Networks — What They Actually Do For You

Before, I was using a standard 3D CNN to extract features from video frames.
But with SlowFast networks, I could extract features at multiple scales, and then fuse them together to get a more robust representation of time.
The code for this was surprisingly simple:

import torch
import torch.nn as nn

class SlowFastNetwork(nn.Module):
    def __init__(self):
        super(SlowFastNetwork, self).__init__()
        self.slow_path = nn.Sequential(
            nn.Conv3d(3, 64, kernel_size=3),
            nn.ReLU(),
            nn.MaxPool3d(kernel_size=2)
        )
        self.fast_path = nn.Sequential(
            nn.Conv3d(3, 64, kernel_size=3),
            nn.ReLU(),
            nn.MaxPool3d(kernel_size=2)
        )
    def forward(self, x):
        slow_features = self.slow_path(x)
        fast_features = self.fast_path(x)
        return torch.cat((slow_features, fast_features), dim=1)

I spent 4 hours figuring this out, but it was worth it

Time Prediction — What It Actually Means

Before, I was trying to predict time as a regression problem.
But with SlowFast networks, I could frame it as a classification problem, and get much better results.
The insight here is that time is not a continuous variable, but rather a discrete one.
We can think of time as a series of discrete events, rather than a continuous flow.

The key insight here is that time is not just a matter of clock time, but also of event time.
By representing time as a series of discrete events, we can build models that are more robust and more accurate.

The After: What Actually Changed

The results were night and day.
Before, my model was producing errors of up to 30 seconds.
After, the errors were down to 1-2 seconds.
Some of the key changes included:

Improved feature extraction
Better handling of variable frame rates
Reduced overfitting One thing that still doesn't work perfectly is handling videos with multiple timelines. This is an area where I'm still doing research, and hoping to make some breakthroughs.

Final Thought: It's Not About Time — It's About Understanding

If I'm being honest, I was so focused on predicting time in videos that I forgot about the bigger picture.
Video understanding is not just about time, it's about understanding the events, actions, and objects in a video.
So, if you're also working on video understanding, I'm curious: what's the one thing that you're still struggling to get right?

Follow me on Medium for more AI/ML content!