<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Bongho Tae</title>
    <description>The latest articles on DEV Community by Bongho Tae (@xoqhdgh1002).</description>
    <link>https://dev.to/xoqhdgh1002</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3896559%2F3b7e6ff4-85a9-47b3-a452-08b8c7ea14d3.png</url>
      <title>DEV Community: Bongho Tae</title>
      <link>https://dev.to/xoqhdgh1002</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/xoqhdgh1002"/>
    <language>en</language>
    <item>
      <title>Time's Fingerprint: How AI Finally Learned to Read the Speed of the World</title>
      <dc:creator>Bongho Tae</dc:creator>
      <pubDate>Sat, 25 Apr 2026 05:24:59 +0000</pubDate>
      <link>https://dev.to/xoqhdgh1002/times-fingerprint-how-ai-finally-learned-to-read-the-speed-of-the-world-3l0k</link>
      <guid>https://dev.to/xoqhdgh1002/times-fingerprint-how-ai-finally-learned-to-read-the-speed-of-the-world-3l0k</guid>
      <description>&lt;h2&gt;
  
  
  The blur we never thought to ask about
&lt;/h2&gt;

&lt;p&gt;You have almost certainly watched a video that felt wrong before you could explain why. Maybe it was dashcam footage shared on social media — the traffic moving just a beat too briskly, the pedestrians crossing the street with a faint mechanical urgency, as though everyone had somewhere slightly too important to be. Or maybe it was the reverse: a sports clip slowed down to a crawl, the ball hanging in the air like something painted on silk, the crowd frozen mid-roar. Your brain registered something about time before your conscious mind caught up.&lt;/p&gt;

&lt;p&gt;That gut feeling — &lt;em&gt;this is moving at the wrong speed&lt;/em&gt; — is something humans do effortlessly and machines have, until very recently, struggled to do at all. A new paper from researchers at the University of Washington and Google changes that. They have taught a computer system not just to understand what is happening in a video, but to understand &lt;em&gt;when&lt;/em&gt; — to read the flow of time embedded in moving images the way a musician reads tempo from sheet music.&lt;/p&gt;

&lt;p&gt;The consequences turn out to be surprisingly far-reaching.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why computers went blind to speed
&lt;/h2&gt;

&lt;p&gt;Modern computer vision is remarkably capable. Given a video, existing systems can tell you that a dog is chasing a ball, that the man in the blue jacket is the same man who appeared three seconds earlier, that the faces in this clip belong to certain people. What these systems cannot reliably do is answer a simpler-sounding question: is this video playing at normal speed?&lt;/p&gt;

&lt;p&gt;The reason is subtler than it first appears. Think about what a video actually is: a sequence of still photographs shown so rapidly that the eye perceives motion. At 24 frames per second — the standard for film — you're seeing 24 photographs every second. At 240 frames per second — the speed of a high-end action camera — you're capturing ten times more moments. When that 240-frames-per-second footage is played back at 24 frames per second, you get the floating, dreamlike quality of slow motion. Every heartbeat of action is stretched into ten beats of screen time.&lt;/p&gt;

&lt;p&gt;Now, a machine looking at individual frames faces a chicken-and-egg problem: it sees a ball mid-flight, but how does it know whether that frame came from a 24fps normal-speed video or a 240fps slow-motion clip played back at one-tenth speed? The objects look identical. The scene looks identical. The motion, considered frame-by-frame, looks identical.&lt;/p&gt;

&lt;p&gt;This is why most computer vision research simply ignored the question. Speed was treated as a metadata problem — something you look up in the file's technical specifications, not something you read from the pixels themselves. But that assumption collapses the moment you're working with in-the-wild internet video, where metadata is unreliable, absent, or deliberately manipulated.&lt;/p&gt;

&lt;h2&gt;
  
  
  Motion blur is time's fingerprint
&lt;/h2&gt;

&lt;p&gt;The breakthrough insight in this paper is that time actually does leave fingerprints on pixels — you just have to know where to look.&lt;/p&gt;

&lt;p&gt;Consider what happens to a photograph of a speeding motorcycle. If the shutter stays open even a fraction too long, the motorcycle doesn't appear as a crisp object. It smears. You see a streak, a ghost, a blur that traces the path of motion across the frame. This motion blur is not a flaw in the photograph. It is information. It is the camera's way of recording that something moved very fast during the brief window the shutter was open.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7x7rv5hwyopt2lxrfmbg.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7x7rv5hwyopt2lxrfmbg.jpeg" alt="Motion blur on a fast-moving motorcycle" width="512" height="352"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Time Figure 2 Audio signal natu changes, its audio pitch shifts free cross-modal supervisi spectrogram (used only duri nearby frames. Higher playba&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;The same logic applies to video. When a bicycle races down a mountain trail in real time, the background trees streak into horizontal smudges behind it. When that same footage is captured at high speed and played back slowly, each individual frame is sharper — there is less blur per frame, because the camera captured each moment during a much shorter window.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flbdzbcrijpu46euhbw5f.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flbdzbcrijpu46euhbw5f.jpeg" alt="Cyclist on a mountain trail with motion blur in the background" width="512" height="352"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The researchers trained their model to read these cues the way a forensic analyst reads tire marks on asphalt — not just noticing that blur exists, but using its character, direction, and intensity to reconstruct what kind of motion produced it, and at what temporal scale.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fu1bc0gpd2nd01sdq3v1x.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fu1bc0gpd2nd01sdq3v1x.jpeg" alt="Mountain bike racer with strong motion blur showing speed" width="512" height="352"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;A panning camera following a bird in flight, for instance, produces a very particular blur signature — the bird is sharp while the background dissolves into horizontal streaks, because the camera tracked the subject and let the world smear behind it. This kind of image is visually unmistakable as &lt;em&gt;fast&lt;/em&gt;, even if nothing in the semantic content — bird, sky, trees — carries that information directly.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fikj9dh98enxrtdohwczy.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fikj9dh98enxrtdohwczy.jpeg" alt="Bird in flight photographed with panning motion blur" width="512" height="352"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The audio trick that changed everything
&lt;/h2&gt;

&lt;p&gt;Visual blur is one fingerprint of speed. But the paper's most elegant trick exploits a second one: sound.&lt;/p&gt;

&lt;p&gt;Here is something most people don't consciously think about: when you speed up a video, the audio pitch rises. Play a recording of a conversation at twice normal speed and everyone sounds like a cartoon character — voices become thin, reedy, almost helium-inflected. Slow it down to half speed and the same voices become impossibly low and thick, like a record player running out of battery.&lt;/p&gt;

&lt;p&gt;This happens for the same reason that a police siren sounds higher as it approaches you and lower as it recedes: the pitch of a sound is determined by the frequency of the sound waves reaching your ears, and that frequency changes when the source is moving (or, in this case, when time itself is compressed or expanded in playback).&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8ktr1g5hwinwujkv2i8z.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8ktr1g5hwinwujkv2i8z.jpeg" alt="Audio spectrogram showing frequency changes with playback speed" width="800" height="320"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Time Figure 2 Audio signal natu changes, its audio pitch shifts free cross-modal supervisi spectrogram (used only duri nearby frames. Higher playba&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;The researchers visualized this as a spectrogram — a map of which sound frequencies appear at which moments. In the image above, you can see the effect directly: the left side of the image, representing slower playback, shows sound energy clustered in lower frequencies, with the high-frequency regions dark and empty. On the right, where playback speed increases, the higher frequencies suddenly light up, the entire spectrum shifting upward like a musical key change written in light.&lt;/p&gt;

&lt;p&gt;This creates a profound opportunity. It means that the &lt;em&gt;same video&lt;/em&gt; carries two independent, corroborating signals about its own speed: the visual blur in the frames and the pitch signature in the audio. The model can compare these signals against each other, using each one to check and sharpen its reading of the other.&lt;/p&gt;

&lt;p&gt;This is what researchers call cross-modal supervision — using two different sensory channels as mutual teachers. Think of how a wine sommelier uses both smell and taste together to identify a vintage. Neither sense alone might be definitive, but the agreement between them, or the revealing discord, tells a richer story than either could alone. The model learns the relationship between visual speed cues and audio pitch cues by watching enormous amounts of ordinary video — without anyone labeling a single frame or telling the system what "slow motion" looks like.&lt;/p&gt;

&lt;h2&gt;
  
  
  Teaching a machine without a teacher
&lt;/h2&gt;

&lt;p&gt;This brings us to perhaps the most important methodological decision in the paper: everything described so far is learned without labels.&lt;/p&gt;

&lt;p&gt;In most machine learning, you need a human to annotate training data. Someone has to watch thousands of videos and write down: "this one is played at half speed," "this one is normal," "this one is sped up two times." This labeling process is expensive, slow, and bottlenecked by human attention. More fundamentally, it requires the person labeling to already know the answer — which is exactly what you're trying to teach the machine.&lt;/p&gt;

&lt;p&gt;The researchers sidestepped this entirely through a technique called self-supervised learning. Imagine teaching someone to recognize a forged signature without ever showing them examples of forgeries. Instead, you hand them a stack of authentic signatures and let them look for internal inconsistencies — places where the pen pressure, the angle, the rhythm of a stroke breaks with what the same hand produced moments earlier. They learn by noticing when something doesn't cohere, without anyone ever telling them what to look for.&lt;/p&gt;

&lt;p&gt;The model in this paper learns similarly. Researchers took ordinary internet videos and artificially sped some up, slowed others down, or mixed sections of different speeds. They then asked the model to detect these changes — not by consulting a label, but by noticing when the visual flow and audio pitch no longer fit together, or when the blur patterns across consecutive frames don't match the implied rhythm of motion. The "teacher" is the internal consistency of the video itself.&lt;/p&gt;

&lt;h2&gt;
  
  
  Building the world's largest slow-motion library
&lt;/h2&gt;

&lt;p&gt;Once you have a system that can reliably tell whether a video contains slow motion, you can use that system as a filter — a tireless, infinitely patient curator.&lt;/p&gt;

&lt;p&gt;The internet contains an enormous amount of slow-motion footage mixed in with billions of ordinary videos. Tracking its location is the problem: there's no reliable, consistent way to find it from metadata alone. People tag and title videos erratically. One creator calls the same footage "slo-mo," another calls it "60fps," another calls it nothing at all.&lt;/p&gt;

&lt;p&gt;The researchers turned their trained model loose on this haystack. By processing large collections of video and flagging clips where the model detected slow-motion signatures — the characteristic blur, the pitch-shifted audio, the visual density of temporal detail — they assembled the largest slow-motion dataset ever collected from naturally occurring sources.&lt;/p&gt;

&lt;p&gt;This matters because slow-motion footage is genuinely different from ordinary video in a way that matters for AI training. Think of ordinary video as a novel that describes a battle in broad strokes — armies clash, a hero falls, the tide turns. Slow-motion footage is like a frame-by-frame graphic novel of the same battle, where every sword stroke and expression is captured in full detail. For a machine learning to understand motion, physics, and causality, that detail is not decorative. It is the text.&lt;/p&gt;

&lt;h2&gt;
  
  
  When the machine learns to control time
&lt;/h2&gt;

&lt;p&gt;The paper's most forward-looking section describes two things the researchers built using all this acquired understanding: a system that generates video at a specified speed, and a system that converts low-quality, blurry, low-frame-rate video into high-quality slow motion.&lt;/p&gt;

&lt;p&gt;The first — speed-conditioned video generation — is something like teaching an illustrator to draw differently depending on a mood instruction. Ask them to draw a waterfall as "frozen," and they'll use sharp lines, crystalline forms, stillness implied in every edge. Ask them to draw the same waterfall as "rushing," and the same elements become streaks, arcs, foam caught in mid-scatter. The instruction shapes every aesthetic decision, not just the subject matter. Here, instead of artistic mood, the instruction is temporal: generate this scene as though captured at half normal speed, or double normal speed. The model learns to make every visual choice — how sharp to render edges, how much to blur movement, how to distribute motion across frames — consistent with the specified temporal flow.&lt;/p&gt;

&lt;p&gt;The second — temporal super-resolution — is arguably the more practically remarkable achievement. Given a video that is blurry, low-frame-rate, and temporally thin (imagine footage from a security camera, or a clip compressed heavily for file size), the system reconstructs what the in-between moments probably looked like. This is not guessing randomly. It is inference constrained by everything the model has learned about how motion works, how blur distributes across a scene, and how things in the physical world actually move between recorded frames.&lt;/p&gt;

&lt;p&gt;Think of how a skilled art restorer approaches a damaged oil painting. Faced with sections where the paint has flaked away entirely, they don't fill in the gaps with random colors. They study the surrounding strokes, the artist's technique as visible in intact sections, the logic of the depicted scene — and from all of this, they reconstruct what almost certainly was there. The result is not certainty, but it is informed reconstruction, and for many purposes it is better than leaving the gap blank.&lt;/p&gt;

&lt;h2&gt;
  
  
  What becomes possible now
&lt;/h2&gt;

&lt;p&gt;These capabilities, combined, begin to shift what is possible in several concrete domains.&lt;/p&gt;

&lt;p&gt;Consider a surgeon training on video of a delicate procedure. Currently, the training footage may have been captured on standard medical cameras at rates that simply don't capture the full motion of the most critical moments — the tension and release of a suture, the exact angle of an incision. With temporal super-resolution, the same footage could be enriched with recovered in-between frames, giving trainees and instructors a more complete picture of technique.&lt;/p&gt;

&lt;p&gt;Or consider a forensic analyst asked whether a viral video of an incident has been manipulated — specifically, whether someone sped up footage to make a crowd look more menacing, or slowed it down to make an action look more deliberate than it was. These techniques give investigators a systematic way to test that question, looking for the inconsistencies between visual and audio speed signatures that arise when footage has been post-processed — the equivalent of finding anachronistic fiber in a supposedly antique cloth.&lt;/p&gt;

&lt;p&gt;For the film industry, speed-conditioned generation opens the possibility of creating cinematic slow motion in post-production, without the cost of high-speed cameras. What currently requires tens of thousands of dollars in equipment could, if these techniques mature, be applied as a computational process to footage captured with ordinary cameras.&lt;/p&gt;

&lt;p&gt;And at a deeper level, there is something philosophically significant about what this paper is pointing toward: the idea that time itself is a visual dimension that can be learned, not just assumed. Most AI systems that watch video treat it as a sequence of images. This paper treats it as a recording of temporal flow — and argues that how things unfold across time is as learnable, and as teachable, as what objects look like or where they are.&lt;/p&gt;

&lt;h2&gt;
  
  
  What the paper doesn't answer
&lt;/h2&gt;

&lt;p&gt;There are honest gaps here worth noting. The audio-based speed detection, elegant as it is, is useless on silent video — a substantial fraction of internet content. The visual signals alone carry less certainty in certain kinds of footage: scenes with little motion, static shots, or carefully stabilized camera work where blur signatures are deliberately suppressed by stabilization software.&lt;/p&gt;

&lt;p&gt;More fundamentally, the temporal super-resolution system, like all such reconstruction methods, is making educated inferences about what it didn't see. In most applications, this is fine. But in forensic or legal contexts, a system that fills in moments it never observed is a system that can produce compelling artifacts — convincing reconstructions of things that may not have happened quite that way. The capability and the caution need to develop together.&lt;/p&gt;

&lt;p&gt;And the paper is still largely a proof-of-concept for some of the generation results. The generated videos, while compelling, show the artifacts and limitations familiar to anyone who has watched AI-generated video for more than a few seconds. The principle is demonstrated; the product-quality execution is still ahead.&lt;/p&gt;

&lt;p&gt;But the direction is clear, and the foundation is sound. Time has always moved through video. Now, finally, the machines are starting to notice.&lt;/p&gt;

&lt;p&gt;📄 &lt;a href="https://arxiv.org/abs/2604.21931v1" rel="noopener noreferrer"&gt;https://arxiv.org/abs/2604.21931v1&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;tags: computervision, videogeneration, selfsupervisedlearning, temporalai&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;🇰🇷 Korean version on Velog: &lt;a href="https://velog.io/@tkdnel1002/4jkzs29p" rel="noopener noreferrer"&gt;https://velog.io/@tkdnel1002/4jkzs29p&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;

</description>
      <category>computervision</category>
      <category>videogeneration</category>
      <category>selfsupervisedlearni</category>
      <category>temporalai</category>
    </item>
    <item>
      <title>When a Machine Finally Learns to Feel Time Passing</title>
      <dc:creator>Bongho Tae</dc:creator>
      <pubDate>Sat, 25 Apr 2026 05:22:27 +0000</pubDate>
      <link>https://dev.to/xoqhdgh1002/when-a-machine-finally-learns-to-feel-time-passing-40ag</link>
      <guid>https://dev.to/xoqhdgh1002/when-a-machine-finally-learns-to-feel-time-passing-40ag</guid>
      <description>&lt;h2&gt;
  
  
  The moment you knew something was off
&lt;/h2&gt;

&lt;p&gt;You are watching a clip on social media of a skateboarder landing an impossible trick. Something feels wrong. The arms swing a little too smoothly. The dust rises at just slightly the wrong pace. Within half a second, before you have consciously formed a thought, your brain has already delivered its verdict: &lt;em&gt;this video has been slowed down&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;That verdict came from somewhere. Not from a timestamp in the corner of the screen. Not from a caption. Something in the visual texture of the footage itself — the way motion blur smears across a wheel, the rhythm of a jacket flap, the relationship between how fast the body moves and how long it takes to land — told your nervous system that time in this clip is not moving at its natural rate.&lt;/p&gt;

&lt;p&gt;For decades, this particular skill has been almost entirely off-limits for artificial intelligence. Machines could recognize objects, count people, read text, even generate photorealistic faces — but the felt sense of time's pace was essentially invisible to them. A new paper from researchers at the University of Washington and Google changes that, building systems that can not only detect when a video has been sped up or slowed down but also &lt;em&gt;generate&lt;/em&gt; footage at specified temporal rhythms, and sharpen blurry low-frame-rate video into fluid, detailed motion. The work is less about any single application and more about establishing time itself as something a machine can learn to perceive and control.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why computers were, until now, nearly blind to pace
&lt;/h2&gt;

&lt;p&gt;To understand why this problem is hard, consider how a conventional image-recognition system works. It looks at a single frame — a frozen slice of the world — and classifies what it contains. A cat. A car. A running athlete. That process has become extraordinarily good. But speed is not visible in a single frame in any direct way. Speed is a relationship between frames, between moments, between the present image and the memory of what just came before.&lt;/p&gt;

&lt;p&gt;Humans perceive speed holistically, through a bundle of cues that we barely notice we are using. Consider watching a mountain biker tear down a slope.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flbdzbcrijpu46euhbw5f.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flbdzbcrijpu46euhbw5f.jpeg" alt="Motion blur on a mountain biker" width="512" height="352"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Time Figure 2 Audio signal natu changes, its audio pitch shifts free cross-modal supervisi spectrogram (used only duri nearby frames. Higher playba&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;The background streaks into a horizontal smear. The rider's body holds its shape while the world behind dissolves. Your brain reads this visual grammar fluently. Motion blur is a kind of natural speedometer — the more the world smears, the faster things were moving when the shutter opened. Experienced photographers understand this intuitively: a faster shutter "freezes" motion and eliminates blur; a slower shutter lets the blur accumulate like ink dragged across wet paper.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fu1bc0gpd2nd01sdq3v1x.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fu1bc0gpd2nd01sdq3v1x.jpeg" alt="Mountain biker with motion blur, panning shot" width="512" height="352"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Time Figure 2 Audio signal natu changes, its audio pitch shifts free cross-modal supervisi spectrogram (used only duri nearby frames. Higher playba&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Previous AI video systems largely ignored this grammar. They were trained to recognize &lt;em&gt;what&lt;/em&gt; was happening — someone is cycling, a bird is flying, a motorcycle is cornering — without developing any feel for &lt;em&gt;how fast&lt;/em&gt; it was happening. This is a bit like training a music student to identify instruments by sight while never teaching them to hear rhythm. You could assemble an orchestra and they would name every instrument correctly while remaining completely deaf to whether the piece was allegro or adagio.&lt;/p&gt;

&lt;p&gt;The deeper problem was data. Teaching a machine to detect speed requires labeled examples: videos tagged "this one was shot at normal speed," "this one was artificially slowed down," "this one was accelerated." Assembling such a dataset by hand, at scale, is brutally expensive. And without sufficient data, the machines simply never developed the necessary sense.&lt;/p&gt;

&lt;h2&gt;
  
  
  Learning to feel time without being taught
&lt;/h2&gt;

&lt;p&gt;The central cleverness of this paper is the researchers' decision to sidestep the labeling problem entirely. They trained their model using a technique called &lt;em&gt;self-supervised learning&lt;/em&gt; — a phrase that sounds circular but describes something genuinely elegant.&lt;/p&gt;

&lt;p&gt;Think of it like this: imagine you are learning to read a clock, but no one will tell you directly what each position of the hands means. Instead, you are given thousands of pairs of clocks, and for each pair you are told only whether the second clock shows an earlier or later time. You cannot see any labels. But from the &lt;em&gt;relationship&lt;/em&gt; between clocks — from the angle differences, from the patterns of which configurations follow which — you gradually build an internal model of how time flows across a clock face. By the end, you understand the clock not because anyone explained it, but because the structure of the data itself encoded the answer.&lt;/p&gt;

&lt;p&gt;The researchers did something analogous with video. They took ordinary footage from the internet, artificially sped it up or slowed it down by known amounts, and then trained a model to detect these manipulations. Crucially, &lt;em&gt;no human ever labeled these videos&lt;/em&gt;. The labels were generated automatically — the researchers knew exactly what changes they had made, so the training signal was free. The model's job was to reconstruct the manipulation from the visual evidence alone.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fikj9dh98enxrtdohwczy.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fikj9dh98enxrtdohwczy.jpeg" alt="Pelican in flight showing motion blur" width="512" height="352"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Time Figure 2 Audio signal natu changes, its audio pitch shifts free cross-modal supervisi spectrogram (used only duri nearby frames. Higher playba&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Through this process, the model was forced to pay attention to the same things your brain attends to: blur patterns, the rhythm of recurring motions, the way texture changes frame by frame. It could not cheat by reading metadata. It had to &lt;em&gt;see&lt;/em&gt; time the way we see it.&lt;/p&gt;

&lt;h2&gt;
  
  
  When the soundtrack gives you away
&lt;/h2&gt;

&lt;p&gt;There is a second layer of cleverness, and it involves sound. When you change the playback speed of a video, the audio changes too. Speed a clip up, and voices rise in pitch — everyone sounds like they have inhaled helium. Slow a clip down, and sounds become low, woozy, almost submarine. This pitch shift is not an accident; it is a direct physical consequence of how audio works. Sound is a wave, and stretching or compressing the wave changes its frequency, which we hear as pitch.&lt;/p&gt;

&lt;p&gt;The researchers realized this creates a free cross-modal signal — a second channel of information, completely independent of the visuals, that carries evidence about temporal speed. They trained their model to listen to the audio alongside watching the frames, using the relationship between what is heard and what is seen as an additional training cue.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8ktr1g5hwinwujkv2i8z.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8ktr1g5hwinwujkv2i8z.jpeg" alt="Spectrogram showing frequency shift at speed change" width="800" height="320"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Time Figure 2 Audio signal natu changes, its audio pitch shifts free cross-modal supervisi spectrogram (used only duri nearby frames. Higher playba&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;This figure shows what that audio evidence looks like when visualized. A &lt;em&gt;spectrogram&lt;/em&gt; is a kind of musical X-ray — a map of which sound frequencies are present at each moment in time, displayed as a heat map. On the left side of the spectrogram, the high frequencies are dark and absent. On the right, after the speed changes, they bloom into existence. If you played this video and listened carefully, you would hear the pitch shift — but even without listening, the visual pattern in the spectrogram tells the same story.&lt;/p&gt;

&lt;p&gt;Think of the spectrogram as a fingerprint. A video running at normal speed leaves one kind of acoustic fingerprint. A sped-up or slowed-down video leaves a different one. The model learned to read those fingerprints, combining them with the visual blur and motion patterns to arrive at a more confident judgment about temporal speed than either sense alone could provide. It is, in a modest but genuine way, the machine equivalent of that gut feeling you get watching the skateboarder — a convergence of cues from multiple senses arriving at a single verdict.&lt;/p&gt;

&lt;h2&gt;
  
  
  Mining slow motion from the noise
&lt;/h2&gt;

&lt;p&gt;Once the model could detect speed changes reliably, the researchers turned it into a curator. The internet contains enormous amounts of slow-motion footage — sports cameras, wildlife documentaries, action sequences — but it is thoroughly mixed with normal-speed content and mislabeled clips. Sorting by hand is not feasible at scale.&lt;/p&gt;

&lt;p&gt;The speed-detection model acted like a trained sommelier moving through a vast, disorganized cellar. It tasted each bottle, so to speak, and set aside the genuinely slow-motion footage into a separate collection. The result was what the paper describes as the largest slow-motion video dataset assembled to date, built not from expensive new filming but from existing noise — like panning for gold in a river that no one had previously bothered to sieve.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7x7rv5hwyopt2lxrfmbg.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7x7rv5hwyopt2lxrfmbg.jpeg" alt="Motorcycle with motion blur" width="512" height="352"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Time Figure 2 Audio signal natu changes, its audio pitch shifts free cross-modal supervisi spectrogram (used only duri nearby frames. Higher playba&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Why does slow-motion footage matter so much for training AI? Because slow-motion cameras, which can capture hundreds or thousands of frames per second, preserve temporal detail that ordinary cameras discard. When a hummingbird's wing moves at fifty beats per second and your camera captures only thirty frames per second, most of what the wing does simply vanishes between frames — the machine never sees it. A high-speed camera, playing back at slower rates, reveals the full arc of the motion: the curl at the tip, the slight backward stroke, the recovery. All of that additional information is training data for any system that needs to understand how things move through time.&lt;/p&gt;

&lt;h2&gt;
  
  
  Sharpening blurry time
&lt;/h2&gt;

&lt;p&gt;Here the paper's ambitions expand outward from detection into generation. The researchers built two related systems, and it is worth pausing on each.&lt;/p&gt;

&lt;p&gt;The first is called &lt;em&gt;temporal super-resolution&lt;/em&gt;. The word "resolution" usually refers to spatial sharpness — how many pixels describe a scene. Temporal resolution is the analogous concept in time: how many frames per second capture the motion. Standard video has 24 or 30 frames per second. Slow-motion footage might have 240 or more.&lt;/p&gt;

&lt;p&gt;Temporal super-resolution is the process of inventing the in-between frames — taking a 30-frames-per-second clip and producing a convincing 240-frames-per-second version. This sounds like alchemy, and in a sense it is. The machine does not know what actually happened in the gaps. It infers what probably happened, using everything it has learned about how objects move, how motion blur accumulates, how fast typical physical processes unfold.&lt;/p&gt;

&lt;p&gt;A useful analogy: imagine reading a novel in which every other page has been torn out. A careless reader might simply skip the gaps. A skilled reader might reconstruct what the missing pages probably said — not because they have supernatural knowledge, but because stories have patterns, causes lead to effects, characters act consistently. Temporal super-resolution does the same thing with motion: it reads the pattern of what came before and after, and writes the missing frames.&lt;/p&gt;

&lt;p&gt;The second system is &lt;em&gt;speed-conditioned video generation&lt;/em&gt;. Here the researchers trained a model not just to analyze temporal speed but to produce it on demand. Given a description or a scene and a target speed — "generate this at half speed," or "at triple speed" — the model produces video in which the motion is appropriately fast or slow. The blur patterns, the rhythm of movement, the visual grammar of pace are all calibrated to the specified rate.&lt;/p&gt;

&lt;p&gt;Think of this as the difference between a pianist who can identify what tempo a recording was played at versus a pianist who can play any piece you name at any tempo you specify. Detection and generation are related skills but not identical ones. Building the second requires a more fundamental model of what makes motion &lt;em&gt;feel&lt;/em&gt; fast or slow in the first place.&lt;/p&gt;

&lt;h2&gt;
  
  
  What becomes possible
&lt;/h2&gt;

&lt;p&gt;The practical implications unfold across several domains, and it is worth being concrete about them rather than waving at vague future possibilities.&lt;/p&gt;

&lt;p&gt;In sports broadcasting, replays are already ubiquitous, but they require footage shot in slow motion from the start. With temporal super-resolution, an editor could take ordinary sideline footage of a golf swing — 30 frames per second, slightly blurry at the critical moment of impact — and reconstruct it as smooth, fluid slow motion. The clubhead's precise angle at contact, the ball's initial deformation, the ripple through the shaft — all of it could be recovered from what was previously just a fast blur.&lt;/p&gt;

&lt;p&gt;In forensics and investigative journalism, the ability to detect whether a video has been artificially sped up or slowed down becomes a form of temporal authentication. Manipulating video speed is a technique used to make crowds look larger or smaller, to make events seem more or less violent, to create impressions of panic or calm. A reliable detection system is a countermeasure — not perfect, but meaningful.&lt;/p&gt;

&lt;p&gt;In medical imaging, surgeons sometimes review high-speed footage of tissue behavior, cardiac valve motion, or fluid dynamics in small vessels. The ability to extract higher temporal resolution from existing recordings, without the cost and complexity of specialized equipment, could broaden access to this kind of analysis.&lt;/p&gt;

&lt;p&gt;In entertainment and creative work, speed-conditioned generation opens new expressive possibilities. A filmmaker who wants a specific kinetic quality — the languorous drift of a sunset time-lapse, the visceral punch of ultra-slow-motion collision — could generate it from scratch rather than scheduling and filming it.&lt;/p&gt;

&lt;h2&gt;
  
  
  What the paper leaves open
&lt;/h2&gt;

&lt;p&gt;None of this is to say the work is without limits. The paper acknowledges that generating convincing slow motion from scratch involves inference under uncertainty — the machine is guessing about motion it did not observe, and those guesses can fail in complex scenes with multiple overlapping objects or unpredictable physical behavior. A bouncing basketball moving through empty air is a manageable problem; a tackle in the middle of a crowd is considerably harder.&lt;/p&gt;

&lt;p&gt;There is also a question the paper does not fully engage with: the deepfake problem in reverse. If a machine can reliably detect speed manipulation, and the same techniques are known to those who manipulate footage, a race begins. Detection systems improve; circumvention techniques improve in response. The paper is not naive about this — it frames temporal forensics as an application — but the dynamics of that race are not addressed in any depth. History suggests that in most detection-versus-evasion competitions, evasion eventually finds ways to stay competitive.&lt;/p&gt;

&lt;p&gt;And the self-supervised training approach, clever as it is, builds a model that learns to detect the artificial speed changes that were introduced in training. Whether it generalizes equally well to all the creative and accidental ways that temporal irregularities appear in real-world footage — equipment malfunctions, mixed-frame-rate editing, certain compression artifacts — is not fully established.&lt;/p&gt;

&lt;p&gt;Still, the fundamental contribution here is harder to dismiss than any of these caveats. Time has been something of a blind spot in AI video systems — present in every frame, structuring everything, but largely unmodeled. Establishing that a machine can be taught to perceive temporal pace from first principles, and to use that perception to both analyze and generate footage, opens a door that has been functionally closed for a long time. Whether what lies beyond it is forensic tools or creative instruments or some application no one has imagined yet, the door is now open.&lt;/p&gt;

&lt;p&gt;📄 &lt;a href="https://arxiv.org/abs/2604.21931v1" rel="noopener noreferrer"&gt;https://arxiv.org/abs/2604.21931v1&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;tags: computervision, videogeneration, temporalreasoning, selfsupervised&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;🇰🇷 Korean version on Velog: &lt;a href="https://velog.io/@tkdnel1002/buqtmlyl" rel="noopener noreferrer"&gt;https://velog.io/@tkdnel1002/buqtmlyl&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;

</description>
      <category>computervision</category>
      <category>videogeneration</category>
      <category>temporalreasoning</category>
      <category>selfsupervised</category>
    </item>
    <item>
      <title>When Machines Learn to Feel Time</title>
      <dc:creator>Bongho Tae</dc:creator>
      <pubDate>Sat, 25 Apr 2026 05:10:36 +0000</pubDate>
      <link>https://dev.to/xoqhdgh1002/when-machines-learn-to-feel-time-4neh</link>
      <guid>https://dev.to/xoqhdgh1002/when-machines-learn-to-feel-time-4neh</guid>
      <description>&lt;h2&gt;
  
  
  The moment that made this problem real
&lt;/h2&gt;

&lt;p&gt;Scroll through your social media feed for ten minutes and you'll encounter it dozens of times: a clip of a hummingbird frozen mid-wingbeat, its feathers splayed like a tiny green hand; a basketball player's dunk stretched into four elastic seconds; a car crash replayed at a tenth of normal speed so that steel buckles like wet cardboard. Then, thirty seconds later, a time-lapse of a flower blooming, a city waking up, a storm rolling across a plain — all the world compressed into a dreamy, accelerated rush.&lt;/p&gt;

&lt;p&gt;You have no difficulty perceiving any of this. Your brain adjusts instantly, contextualizing speed changes by the look of motion blur, the tempo of ambient sound, the rhythm of cause and effect playing out in front of you. You know, instinctively, that the hummingbird clip is slow-motion because wings don't look like that in real life. You know the city time-lapse is sped up because people don't move like flickers of light.&lt;/p&gt;

&lt;p&gt;But until very recently, the artificial intelligence systems that power our cameras, video editors, and streaming platforms had almost no idea any of this was happening. They watched every video at face value — incapable of asking, let alone answering, the question: &lt;em&gt;what is time actually doing here?&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;A team of researchers from the University of Washington and Google has now built systems that can. Their paper, "Seeing Fast and Slow," treats time not as a fixed container that videos simply fill up, but as a learnable, manipulable dimension — something a machine can be taught to sense, estimate, and ultimately control.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why computers were essentially time-blind
&lt;/h2&gt;

&lt;p&gt;To understand why this was hard, consider what a video actually is to a computer: a stack of still photographs, each called a frame, shown in rapid succession. Play them at the right rate and the human eye stitches them into the illusion of motion. The computer, however, sees no illusion — it sees a pile of images. It has traditionally been trained to ask questions like "is there a cat in this image?" or "is this person smiling?" Questions rooted in single moments, not in the flow between them.&lt;/p&gt;

&lt;p&gt;Teaching a computer to &lt;em&gt;reason about time&lt;/em&gt; is a bit like trying to teach someone what music sounds like using only photographs of a piano. You can label the keys. You can describe the hammers and strings. But you cannot convey the difference between a lullaby and a military march without letting the person actually hear the notes unfold in sequence.&lt;/p&gt;

&lt;p&gt;Previous AI systems tried to reason about video by treating time as just another spatial dimension — as if duration were simply height or width, a container to be measured rather than an experience to be understood. They could detect that motion was occurring but had almost no capacity to ask whether that motion was fast, slow, natural, or artificially manipulated. Detecting speed changes — knowing that a video has been accelerated or decelerated — was essentially left to human editors or crude, rules-based algorithms. This mattered more than it might sound: slow-motion footage, captured by expensive high-speed cameras that record hundreds or thousands of frames per second, contains dramatically richer visual information than ordinary video. A standard smartphone camera captures 30 frames per second, roughly matching the blur of human perception. A high-speed camera captures 1,000 or more, freezing events that happen faster than a blink — the ripple in a raindrop hitting a puddle, the flex of a sprinter's tendon, the exact instant a soap bubble pops. That footage is a treasure trove for training AI systems to understand how things move. The problem was: finding it in the wild, separated from the vast ocean of normal-speed video, was brutally difficult to do at scale.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flbdzbcrijpu46euhbw5f.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flbdzbcrijpu46euhbw5f.jpeg" alt="Overview of the full system pipeline" width="512" height="352"&gt;&lt;/a&gt;&lt;br&gt;
**&lt;/p&gt;

&lt;h2&gt;
  
  
  The clever trick: listening to time, not just watching it
&lt;/h2&gt;

&lt;p&gt;Here is where the paper's first insight becomes elegant. The researchers realized that if a video has been slowed down from its original recording speed, it carries a signature — but not necessarily a visual one. It carries an &lt;em&gt;audio&lt;/em&gt; signature.&lt;/p&gt;

&lt;p&gt;Think of a vinyl record being played at the wrong speed. Play it too slow, and every voice deepens into a rumbling baritone; play it too fast, and singers sound like cartoon chipmunks. The pitch of sound is exquisitely sensitive to the rate at which it is reproduced. This is not a glitch — it is physics. Sound is vibration, and vibration has frequency. Change the speed of playback, change the frequency.&lt;/p&gt;

&lt;p&gt;Now imagine you are a researcher with access to millions of YouTube videos. Many of them have been slowed down for artistic or editorial effect: sports highlights, nature documentaries, recipe videos showing the pour of honey. When the original footage was shot at high speed and then played back at normal speed, the audio — if any was recorded — gets stretched and distorted. The pitch drops. The rhythm slows. The spectrogram, which is a kind of visual map of sound that shows which frequencies are present at each moment in time, changes shape in characteristic ways.&lt;/p&gt;

&lt;p&gt;The researchers used this cross-modal clue — the relationship between what you see and what you hear — as free supervision. This is the key move. "Free supervision" in machine learning parlance means finding a signal that teaches the model without anyone having to sit down and manually label thousands of examples. The audio track is already there. It already contains information about speed. The model simply has to be taught to read it.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fu1bc0gpd2nd01sdq3v1x.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fu1bc0gpd2nd01sdq3v1x.jpeg" alt="Audio pitch and spectrograms as cross-modal speed signals" width="512" height="352"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Time Figure 2 Audio signal natu changes, its audio pitch shifts free cross-modal supervisi spectrogram (used only duri nearby frames. Higher playba&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;This is called self-supervised learning, and the analogy I find most clarifying is that of a child learning to read a clock by watching how long things take in the real world. No one sits the child down and says "the minute hand advances one tick every sixty seconds." Instead, the child notices that when the minute hand moves from the 12 to the 3, the cartoon show they were watching is now a quarter of the way done. They learn the relationship between the visual symbol (the clock) and the experienced duration (the show) without explicit instruction. The researchers' model did something structurally similar: it learned the relationship between visual motion patterns and audio pitch patterns by watching an enormous quantity of video — and those patterns are the clock.&lt;/p&gt;

&lt;h2&gt;
  
  
  Building the world's largest slow-motion library
&lt;/h2&gt;

&lt;p&gt;Once the model could reliably detect whether a video clip was slow-motion — and estimate by roughly how much — the researchers had a tool for sorting the internet. They applied this temporal reasoning system to vast repositories of online video and extracted, automatically, the clips that qualified as genuine slow-motion footage: material shot on high-speed cameras, containing that densely-packed temporal information.&lt;/p&gt;

&lt;p&gt;The result was the largest slow-motion video dataset ever assembled from real-world sources. Think of it like this: imagine you are a sommelier trying to build a cellar of aged wines, but the world's wine supply is a single enormous warehouse where vintage bottles are scattered randomly among bottles of table wine, with no labels on any of them. You develop a palate — a way of tasting the wine in the bottle, figuratively, before you open it — that lets you sort through thousands of bottles quickly and pull out only the ones that have been aged for decades. The researchers built the equivalent palate for video. The resulting cellar is immense and high-quality in a way no previous collection had managed to be.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fikj9dh98enxrtdohwczy.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fikj9dh98enxrtdohwczy.jpeg" alt="Dataset scale and diversity comparison" width="512" height="352"&gt;&lt;/a&gt;&lt;br&gt;
**&lt;/p&gt;

&lt;h2&gt;
  
  
  Teaching machines to generate time, not just observe it
&lt;/h2&gt;

&lt;p&gt;With this rich dataset in hand, the team's ambitions grew larger. They used the slow-motion footage to train two new capabilities that move from &lt;em&gt;perceiving&lt;/em&gt; time to &lt;em&gt;creating&lt;/em&gt; it.&lt;/p&gt;

&lt;p&gt;The first is speed-conditioned video generation. Ordinary AI video generators — the kind making headlines by conjuring photorealistic clips of things that never happened — produce motion at a fixed, implicit pace. Ask them to show a runner, and they'll show a runner at whatever speed they happen to have absorbed from their training data. Ask them to show the same runner at half speed, or double speed, or ten times normal speed, and they have no reliable way to comply. They are like an orchestra that can play a piece of music but cannot change tempo on request.&lt;/p&gt;

&lt;p&gt;The researchers' new model is tempo-aware. You tell it not just &lt;em&gt;what&lt;/em&gt; to generate, but &lt;em&gt;how fast&lt;/em&gt; time should move within the generated clip. The model has internalized the relationship between speed and the visual texture of motion — the way fast motion blurs certain edges differently, the way slow-motion footage reveals microexpressions on a face or the spray pattern of water — and can modulate those textures deliberately.&lt;/p&gt;

&lt;p&gt;The second capability is temporal super-resolution. This is perhaps the most technically remarkable of the paper's contributions, and it deserves a careful analogy.&lt;/p&gt;

&lt;p&gt;You may have encountered image super-resolution: AI systems that take a blurry, low-resolution photograph and sharpen it into something that looks higher-definition. The trick is that the AI has learned, from millions of high-resolution images, what things &lt;em&gt;tend to look like&lt;/em&gt; up close — what the texture of skin or stone or fabric looks like at a fine grain — and it uses that learned knowledge to make a plausible guess about what the blurry image would look like if it had been captured with a better camera.&lt;/p&gt;

&lt;p&gt;Temporal super-resolution does the same thing but for time. Take a video recorded at 30 frames per second — one image every 33 milliseconds. Between each frame, something happened. A hand moved. Water splashed. The AI's job is to hallucinate the missing frames: to invent what the world looked like during those in-between moments, with enough physical and visual plausibility that the result, played back at a higher frame rate, looks genuinely smooth rather than artificially interpolated.&lt;/p&gt;

&lt;p&gt;The researchers' model, trained on high-speed footage that actually contains those in-between moments, has learned what realistic temporal detail looks like. It can take a blurry 30fps video and produce a plausible 240fps version — one that &lt;em&gt;feels&lt;/em&gt; like slow-motion footage rather than a cheap software trick. Like a jazz musician who has heard so many performances that they can improvise a bridge between two musical phrases as if it had always been there.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7x7rv5hwyopt2lxrfmbg.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7x7rv5hwyopt2lxrfmbg.jpeg" alt="Before and after temporal super-resolution examples" width="512" height="352"&gt;&lt;/a&gt;&lt;br&gt;
**&lt;/p&gt;

&lt;h2&gt;
  
  
  What becomes possible now
&lt;/h2&gt;

&lt;p&gt;Pause for a moment and let the practical implications land.&lt;/p&gt;

&lt;p&gt;A documentary filmmaker shooting in the field with a standard camera witnesses an unexpected, fast-moving event: a bird strike, a lightning bolt, an athlete's peak moment. Their footage looks normal-speed and slightly blurry. With temporal super-resolution, the footage can be sharpened into something that looks slow-motion — revealing detail the camera was technically too slow to capture properly. The moment can be recovered.&lt;/p&gt;

&lt;p&gt;A sports coaching team studying a sprinter's form, a surgeon reviewing a procedure, a physicist analyzing a droplet experiment — each of these fields depends on seeing events that happen faster than standard cameras allow. High-speed cameras are expensive, bulky, and require significant setup. A post-production tool that can enhance ordinary footage offers a genuine democratization of slow-motion analysis.&lt;/p&gt;

&lt;p&gt;Then there is the darker application the researchers themselves name: temporal forensics. If AI can now learn to detect speed manipulations in video, it becomes possible — in principle — to apply that same detection to videos circulating in the world and flag ones that have been artificially sped up or slowed down to distort the perception of events. A protest that looks chaotic at real speed but appears deliberately violent at slowed-down playback; a speech where a moment of hesitation is stretched to suggest confusion. The same technology that generates manipulated time can be used to detect it.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8ktr1g5hwinwujkv2i8z.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8ktr1g5hwinwujkv2i8z.jpeg" alt="Speed-conditioned generation results at different temporal scales" width="800" height="320"&gt;&lt;/a&gt;&lt;br&gt;
**&lt;/p&gt;

&lt;h2&gt;
  
  
  Where I remain skeptical
&lt;/h2&gt;

&lt;p&gt;The paper is carefully scoped, and the researchers are honest about what they haven't yet achieved. Speed-conditioned generation can be told &lt;em&gt;how fast&lt;/em&gt; to produce motion, but it is not yet producing footage that looks indistinguishable from genuine high-speed camera output at all speeds and subjects. Temporal super-resolution is making educated guesses about what happened between frames — and educated guesses can be wrong in ways that are hard to detect and potentially consequential when the footage is being used as evidence or scientific data.&lt;/p&gt;

&lt;p&gt;There is also a deeper philosophical concern lurking in any system that learns to interpolate missing moments from video: at what point does "filling in" become "inventing"? The model is doing something genuinely extraordinary — constructing visual reality that was never captured — and the line between plausible enhancement and subtle fabrication is not always clear, even to experts. As these tools become more powerful and more accessible, the question of provenance in video — where did this footage actually come from, and what has been done to it? — becomes more pressing, not less.&lt;/p&gt;

&lt;p&gt;There is also a mundane practical limit: the whole system depends on finding audio-visual correlations in online video, which means it works best when footage comes with audio and when that audio hasn't been separately edited. Mute video, dubbed video, or footage where the audio was replaced entirely breaks the cross-modal supervision scheme. The audio is a free teacher, but only when it's telling the truth about the original recording.&lt;/p&gt;

&lt;h2&gt;
  
  
  Time as a dimension machines can finally learn
&lt;/h2&gt;

&lt;p&gt;What this paper ultimately demonstrates is that time — the most fundamental dimension of video — is something AI systems can be taught to understand, not just measure. For decades, computer vision research treated a video as a series of spatial problems, questions about &lt;em&gt;where&lt;/em&gt; things were, &lt;em&gt;what&lt;/em&gt; they looked like, whether they were cats or cars or clouds. The temporal dimension was largely incidental, a backdrop for spatial reasoning rather than a subject of investigation in its own right.&lt;/p&gt;

&lt;p&gt;This work reframes time as a perceptual object in itself — something with texture, with speed, with the capacity to reveal or conceal information depending on how it flows. Teaching a machine to sense whether time is running fast or slow, and then to manipulate that flow deliberately, is a small step toward a kind of machine perception that feels, for the first time, genuinely temporal.&lt;/p&gt;

&lt;p&gt;We are used to machines that can see. We are beginning to build machines that can feel — if not time's passage exactly — at least the difference between a world unfolding at natural pace and one that has been stretched or compressed to tell a different story. Whether that capability will be used mostly to reveal the truth of moments we were too slow to see, or to construct moments that never happened at all, will depend on choices that no research paper can make for us.&lt;/p&gt;

&lt;p&gt;📄 &lt;a href="https://arxiv.org/abs/2604.21931v1" rel="noopener noreferrer"&gt;https://arxiv.org/abs/2604.21931v1&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;tags: computervision, videogeneration, deeplearning, temporalai&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;🇰🇷 Korean version on Velog: &lt;a href="https://velog.io/@tkdnel1002/nrghg29y" rel="noopener noreferrer"&gt;https://velog.io/@tkdnel1002/nrghg29y&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;

</description>
      <category>computervision</category>
      <category>videogeneration</category>
      <category>deeplearning</category>
      <category>temporalai</category>
    </item>
    <item>
      <title>Seeing Fast and Slow: Learning the Flow of Time in Videos</title>
      <dc:creator>Bongho Tae</dc:creator>
      <pubDate>Sat, 25 Apr 2026 00:53:39 +0000</pubDate>
      <link>https://dev.to/xoqhdgh1002/seeing-fast-and-slow-learning-the-flow-of-time-in-videos-ec9</link>
      <guid>https://dev.to/xoqhdgh1002/seeing-fast-and-slow-learning-the-flow-of-time-in-videos-ec9</guid>
      <description>&lt;h2&gt;
  
  
  The moment you've felt this before
&lt;/h2&gt;

&lt;p&gt;You pull out your phone at your nephew's birthday party and film him launching himself off the diving board. Later that evening, rewatching the clip, something looks wrong. The splash hits the water too fast, the crowd's laughter arrives before their faces change expression, the whole scene has the frantic, unmoored quality of a silent film projected at the wrong speed. You reach for the settings and realize the camera was set to some auto-enhance mode that trimmed a third of your footage. The moment still happened — but time, the invisible scaffolding that makes a moment feel real, has been bent.&lt;/p&gt;

&lt;p&gt;A group of researchers at the University of Washington recently published a paper asking a deceptively simple question: can we teach a computer to feel what you felt in that moment? Can an AI learn to sense when time is flowing correctly — and, beyond that, learn to manipulate that flow the way a skilled film editor does? Their answer, developed across several interconnected ideas, reveals something surprising about what "seeing" actually means.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why machines were blind to this problem
&lt;/h2&gt;

&lt;p&gt;Before getting to what the researchers built, it helps to understand why time has been such a blind spot for artificial intelligence working with video.&lt;/p&gt;

&lt;p&gt;For years, the dominant approach to teaching computers about video was, essentially, to treat it as a very fast slideshow. A video is a sequence of still images — frames — displayed rapidly enough that the brain perceives motion. Twenty-four frames per second is the cinematic standard; your eye, stitching them together, creates the illusion of fluid movement. Early machine learning systems looked at these frames the way a student might cram for an exam by memorizing individual flashcards, without absorbing the rhythm and logic that connects them.&lt;/p&gt;

&lt;p&gt;Think of it this way: imagine you're trying to learn a language by studying photographs of text, rather than by listening to people speak. You might learn to recognize individual words, but you'd miss everything about pace, inflection, and the way meaning changes depending on how fast or slow something is delivered. A spoken joke, rushed, becomes incomprehensible. A pause before a punchline is not dead air — it's structural. Meaning lives in timing.&lt;/p&gt;

&lt;p&gt;Computer vision systems have become extraordinarily good at reading the "flashcards" — identifying objects, faces, scenes, even emotions. But they largely ignored the rhythm. This paper is, at its core, an attempt to teach machines to hear the music, not just read the notes.&lt;/p&gt;

&lt;h2&gt;
  
  
  The secret cheat sheet hidden inside every video
&lt;/h2&gt;

&lt;p&gt;The first clever move in this research is figuring out how to teach temporal reasoning without requiring a human to sit down and label thousands of videos by hand. Labeling is expensive, slow, and often the bottleneck in machine learning research. The team's solution was to exploit something already present in the video itself: the relationship between what you see and what you hear.&lt;/p&gt;

&lt;p&gt;This is called self-supervised learning, and the analogy that makes it click is this: imagine you're learning to detect whether a film has been sped up by watching movies with the sound on. You don't need a teacher to hand you a worksheet. You already know, intuitively, that a normal walking pace has a rhythm, that speech occupies a certain tempo, that the splash of water hitting a pool arrives at a predictable interval after someone jumps. When a video is played at double speed, these rhythms desynchronize from your internal sense of how things work. The sound becomes chipmunk-rapid; the motion feels like time-lapse. You notice the wrongness without anyone having to explain it to you.&lt;/p&gt;

&lt;p&gt;The researchers gave their model the equivalent of this intuitive training. By watching enormous quantities of ordinary video — the kind freely available on the internet — the system learned what "normal time" looks and sounds like by experiencing the natural statistical regularities in footage. It learned that if you see a door closing, the sound of the latch follows within a certain window. It learned that the arc of a thrown ball has a characteristic shape when gravity operates correctly, a shape that bends when time is tampered with. These patterns, accumulated across thousands of hours of footage, became the model's internal clock.&lt;/p&gt;

&lt;p&gt;When the researchers then presented the system with an unfamiliar video and asked "is this sped up?" the model was doing something analogous to what a skilled forensic audiologist does when analyzing a recording — listening for the subtle inconsistencies that betray manipulation. Not by following an explicit checklist, but by having internalized a deep sense of what naturalness feels like.&lt;/p&gt;

&lt;h2&gt;
  
  
  Panning for slow-motion gold
&lt;/h2&gt;

&lt;p&gt;Once they had a model that could reason about temporal speed, the researchers turned it toward a logistically thorny problem: dataset collection.&lt;/p&gt;

&lt;p&gt;Slow-motion video is particularly valuable for training AI systems that need to understand fine-grained motion. When a hummingbird's wings are filmed at four hundred frames per second and played back at twenty-four, every microscopic beat becomes visible. High-speed cameras capture a world that our eyes simply cannot access in real time — the exact moment a water droplet impacts a surface and rebounds into a crown shape, the precise sequence of muscle contractions in an athlete's sprint. For a machine learning system trying to understand physical motion, this footage is extraordinarily rich.&lt;/p&gt;

&lt;p&gt;The problem is that collecting genuine slow-motion video is laborious. High-speed cameras are expensive. Researchers have previously had to rely on relatively small, carefully curated datasets.&lt;/p&gt;

&lt;p&gt;Here's where the team's speed-detection model became a kind of prospecting tool. The internet is flooded with video — YouTube alone hosts an incomprehensible volume of footage. But most of it is ordinary: shot at normal frame rates, compressed, noisy, unreliable. Buried within this noise, however, are genuine slow-motion clips: sports highlights, nature documentaries, product demos, wedding reels. The challenge is finding them.&lt;/p&gt;

&lt;p&gt;The researchers used their temporal reasoning model like a metal detector swept over a beach. Because the model had learned to identify when footage was genuinely captured at high frame rate (as opposed to artificially slowed down in post-production, which looks quite different), it could sift through enormous quantities of internet video and flag the authentic slow-motion material. The result was the largest curated slow-motion video dataset assembled to date — not built by filming anything themselves, but by building a smarter filter to strain signal from noise.&lt;/p&gt;

&lt;p&gt;This is a recurring pattern in modern AI research that deserves a moment's attention: instead of going out and collecting new data, you build a tool that can recognize the data you need within what already exists. It's the difference between a wildlife biologist who travels to the rainforest to document rare birds, and one who builds acoustic recognition software to scan archived recordings from around the world. The second approach scales.&lt;/p&gt;

&lt;h2&gt;
  
  
  Teaching a machine to generate time itself
&lt;/h2&gt;

&lt;p&gt;The most ambitious section of the paper concerns video generation — specifically, speed-conditioned video generation. This is where the research moves from analysis to synthesis, from understanding time to creating it.&lt;/p&gt;

&lt;p&gt;Imagine asking a filmmaker to produce a scene of a waterfall — but to show you two versions: one shot normally, one shot in extreme slow motion. These aren't the same video played at different speeds. The slow-motion version has to contain more information: more frames, more detail within each frame, more nuance about how the water subdivides and spirals as it falls. A filmmaker achieves this by pointing a high-speed camera at the scene in the first place. The information is captured at the source.&lt;/p&gt;

&lt;p&gt;Now imagine asking an AI to generate a video of that waterfall at a specified speed, without any source footage to draw from. This is an entirely different challenge. The model has to not just render a plausible waterfall, but understand that "slow motion" implies a level of physical detail — the individual threads of water, the spray, the micro-turbulences — that simply wouldn't be present in standard-speed footage. Speed isn't just a dial on the output; it shapes the entire physics of what the content looks like.&lt;/p&gt;

&lt;p&gt;The researchers trained their generation model on that rich slow-motion dataset they'd curated, combined with standard footage. The model learned something subtle: that the relationship between apparent speed and visual detail is not incidental but structural. A puddle photographed at standard speed shows smooth surface tension. The same puddle at four hundred frames per second reveals it as a roiling, complex micro-ocean. These are not the same image slowed down — they're categorically different in the information they contain.&lt;/p&gt;

&lt;p&gt;Think of it like the difference between a watercolor sketch and a detailed oil painting. You can't produce one by scaling down the other. They operate at different levels of resolution. The generation model had to learn to "paint in oils" when asked for slow motion, and "sketch in watercolor" when asked for standard speed — not just to apply a blur filter, but to genuinely understand what richness looks like at each temporal scale.&lt;/p&gt;

&lt;h2&gt;
  
  
  Filling in what the camera missed
&lt;/h2&gt;

&lt;p&gt;The final major capability the paper develops is temporal super-resolution — a phrase that means, in plain terms, filling in missing time.&lt;/p&gt;

&lt;p&gt;Older videos, videos shot on low-end cameras, videos compressed for the web — these often run at lower frame rates. Fifteen frames per second. Sometimes even fewer. At this rate, motion begins to stutter visibly; the illusion of fluidity breaks down. Fast motion — a head turn, a bird taking flight, a hand gesture — appears to jump rather than flow.&lt;/p&gt;

&lt;p&gt;What the researchers built is a system that looks at such a video and asks: what probably happened between these frames? It's similar to what a music-restoration engineer does when handed a damaged recording: the original audio has gaps, but a skilled restorer can interpolate — reason from context about what sound is consistent with everything before and after the missing section — and reconstruct something plausible. The result isn't guessed at randomly; it's constrained by everything the system knows about how music, or in this case motion, behaves.&lt;/p&gt;

&lt;p&gt;The temporal super-resolution model, trained extensively on that slow-motion dataset, had seen many thousands of examples of exactly what happens in the space between frames when water flows, when hands move, when fabric ripples. It had internalized the grammar of motion. Applied to a choppy fifteen-frames-per-second clip, it could insert the missing frames not by stretching what existed but by genuinely reasoning about what must have occurred in the interval.&lt;/p&gt;

&lt;p&gt;The difference in quality, apparently, is significant: blurry, stuttering footage becomes crisp and fluid. The improvement isn't cosmetic — it's informational. Real physical detail is being reconstructed, not merely averaged.&lt;/p&gt;

&lt;h2&gt;
  
  
  What this opens up, concretely
&lt;/h2&gt;

&lt;p&gt;Consider a forensic investigator examining video evidence. One of the most common manipulations in doctored footage is temporal: clips are sped up to make crowds look panicked, slowed down to make officers look more aggressive, trimmed and reassembled to hide what happened between moments. A system that has internalized a deep sense of temporal naturalness can look at footage the way a handwriting expert examines a signature — not checking for obvious forgery, but sensing the subtle inconsistencies that betray unnatural constraint.&lt;/p&gt;

&lt;p&gt;Or consider a physical therapist working with a patient recovering from stroke, trying to evaluate the recovery of fine motor control. Video of the patient performing small hand movements, shot on a standard phone camera, contains limited information about the precision and timing of the gestures. A temporal super-resolution system could extract the detail that was there but invisible — the micro-hesitations, the slight asymmetries — making the analysis more precise without requiring expensive specialized equipment.&lt;/p&gt;

&lt;p&gt;Or, more simply: a video editor trying to create smooth slow-motion footage from a clip shot on a phone. Currently, artificial slow-motion applied to standard footage looks fake — the interpolation is mechanical. A system that genuinely understands temporal physics could produce slow-motion that looks as though it was shot with a high-speed camera, because it has learned what high-speed cameras actually see.&lt;/p&gt;

&lt;h2&gt;
  
  
  What this paper doesn't yet answer
&lt;/h2&gt;

&lt;p&gt;This research is careful and technically serious, but it rests on certain assumptions worth examining. The model learns its sense of "natural time" from internet video, which is not a neutral sample of reality. It over-represents certain kinds of motion — sports, urban environments, human activity — and under-represents others. Whether it would perform with equal confidence on footage from a hospital, or an industrial facility, or an ecological survey is not addressed.&lt;/p&gt;

&lt;p&gt;There's also a deeper question lurking. The paper treats time as a "perceptual dimension" that can be learned and manipulated. But perception implies a perceiver. The system has learned statistical patterns in temporal data; it's an open question whether this constitutes anything like genuine temporal understanding, or whether it is a very sophisticated pattern-matcher that would fail in novel contexts outside its training distribution. The authors are honest about this — they describe their system as opening "doors to richer world-models" — but the door and the room beyond it are different things.&lt;/p&gt;

&lt;p&gt;Still, the approach is elegant. Using the natural multimodal structure of video as a self-teaching curriculum, building a temporal sense from the inside out rather than imposing it from without — this has the feel of a genuinely right direction. Time, as it turns out, was always in the video. It just took a different kind of looking to find it.&lt;/p&gt;




&lt;p&gt;📄 &lt;a href="https://arxiv.org/abs/2604.21931" rel="noopener noreferrer"&gt;https://arxiv.org/abs/2604.21931&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;tags: computervision, video, deeplearning, timelapse&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;🇰🇷 Korean version on Velog: &lt;a href="https://velog.io/@tkdnel1002/6x0i2dwx" rel="noopener noreferrer"&gt;https://velog.io/@tkdnel1002/6x0i2dwx&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;

</description>
      <category>machinelearning</category>
      <category>computervision</category>
      <category>videogeneration</category>
      <category>deeplearning</category>
    </item>
    <item>
      <title>Test Post</title>
      <dc:creator>Bongho Tae</dc:creator>
      <pubDate>Sat, 25 Apr 2026 00:52:18 +0000</pubDate>
      <link>https://dev.to/xoqhdgh1002/test-post-1kp6</link>
      <guid>https://dev.to/xoqhdgh1002/test-post-1kp6</guid>
      <description>&lt;p&gt;Hello world test.&lt;/p&gt;

</description>
      <category>machinelearning</category>
    </item>
  </channel>
</rss>
