Seeing Fast and Slow: Learning the Flow of Time in Videos
Video generation can now produce photorealistic frames — but it still has no reliable internal clock, and this paper is the first serious attempt to fix that.
What breaks in existing approaches
Every current video generation model encodes temporal dynamics implicitly, as a side effect of learning frame sequences. There's no explicit representation of how fast motion is occurring. The practical consequence: a 4x slow-motion clip and a real-time clip of the same action are indistinguishable to the model at training time. Speed is fully entangled with appearance. When you ask a diffusion-based video model to generate "a running cheetah at 0.25x speed," it has no mechanism to honor that constraint — it will generate something that looks like slow motion stylistically (motion blur, maybe), but it won't actually control temporal density of motion.
The paper also points at the data problem this creates: slow-motion footage is enormously valuable for training temporally-aware models, but high-speed cameras are expensive and labeled slow-mo datasets are tiny. The standard workaround — synthetically slowing standard video — produces temporal artifacts and doesn't give you the richer motion detail that real high-speed capture contains.
The core idea
The insight is that videos already broadcast their own playback speed through multimodal cues you've been ignoring: audio-visual sync drifts when speed changes, motion blur scales with shutter-speed, and the statistical texture of optical flow changes character at different frame rates. You don't need labels — just a model trained to detect these inconsistencies self-supervisedly.
Think of it like pitch detection for audio: you don't need a human to annotate "this is 440Hz" — the signal encodes the answer. Here, temporal structure in the video itself is the supervisory signal for speed estimation.
Once you have a reliable speed estimator, the clever part is using it as a mining filter: scrape in-the-wild video, run the classifier, pull out clips that are genuinely high-FPS slow-motion footage. The paper claims the resulting dataset is the largest slow-motion collection built this way. That curated dataset then trains the downstream generation and temporal super-resolution models.
No benchmark numbers are available from the abstract alone — I'd want to see speed estimation accuracy vs. prior work and FVD scores on the generation side before trusting the headline claims.
Where this actually matters
Temporal super-resolution is the concrete production win. Broadcast and sports pipelines routinely receive 30fps source material that needs to go out at 60fps or higher for slow-motion replay. Current frame interpolation (RIFE, FILM) just blends adjacent frames — it invents pixels but doesn't understand the underlying motion trajectory. A model that genuinely reasons about temporal dynamics could produce interpolated frames that are physically plausible rather than ghosted smears. The dataset angle is arguably more immediately deployable than the generation models: if you're training any motion-aware model and need slow-mo footage, their mining pipeline is worth replicating before bothering with the generation side.
What I'd verify first
Three things I'd dig into before trusting this in production:
The multimodal self-supervised signal — does it hold when audio is absent or the clip is heavily edited? A lot of in-the-wild footage is muted or has replaced audio, which would break the audio-visual sync cue. How much does performance degrade to vision-only?
The dataset quality gate. "Largest slow-motion dataset" is a strong claim. I'd want to see false-positive rate on the speed classifier — how often does it pull standard 30fps footage with fast action and misclassify it as genuine slow-mo? Noisy training data here would poison the generation models downstream.
Temporal super-resolution hallucination rate. This is the general problem with any frame synthesis method: the generated intermediate frames can look locally smooth but be globally physically wrong (a ball in the wrong position, a hand bending the wrong way). Without quantitative evaluation against ground-truth high-fps captures, "high-FPS sequences with fine-grained temporal details" is marketing language.
Bottom line
The speed-estimation-as-data-mining pipeline is a genuinely useful engineering contribution; the generation models are research demos that need external validation before you'd touch them in production.
📄 Paper: https://arxiv.org/abs/2604.21931
tags: computervision, deeplearning, videogeneration, pytorch
🇰🇷 Korean version on Velog: https://velog.io/@tkdnel1002/9s0f78a7
Top comments (0)