Dylan Jhaveri for Mux

Posted on Nov 26, 2019 • Edited on Jan 22, 2020 • Originally published at mux.com

In defense of 'flicks' (or how I learned to stop worrying and love 705600000)

Note originally posted by my colleague on the mux blog

About 2 years ago, the Oculus VR division of THE FACEBOOK created a project they called 'flicks'. Essentially a flick is just a really big number, specifically the number 705,600,000. This project was picked up by some news outlets like TechCrunch, The Verge, and the BBC and seems to cause some confusion and even some ridicule. To be fair, a news article about a number is a bit odd. If you’re not an engineer working with digital media, the idea behind this number is difficult to grasp. And to those who do work in digital media, the number seems to not offer anything new. It purports to solve a problem that nobody in the industry actually has. So where did it come from, and why does it exist? Let’s back up…

Time is a surprisingly difficult concept in digital media. For starters, we are dealing with time values that are very small and difficult to imagine. Recently I saw the movie Gemini Man at the AMC Metreon here in San Francisco. It was one of the few theaters capable of playing the 120 frames per second version, whereas most films are 24 frames per second. At 120fps, every frame is projected for just over 0.00833 seconds before flashing the next one– a very short period of time. But compared to digital audio, this is an eternity. Audio recorded at 44100hz has a sample every 0.000022675 seconds. That is 367.5 times more audio samples than video frames.

Example audio visual timeline

These numbers with fractional components (a decimal point) are known as “floating point” numbers in computer science and computers are surprisingly bad at dealing with them. Above when I said every video frame was on the screen for 0.00833 seconds, that was not exactly true. When you divide 1 by 120 The number result is 0.008333333333 with the 3 repeating forever. For a computer to store a number that repeats forever would require an infinite amount of memory, so the number is approximated. The difference between the approximation and the actual number results in tiny errors in the math. These small errors can add up over time and become big errors and could result in problems such as audio and video becoming out of sync. Using a unit of time like milliseconds or nanoseconds would help, but would only delay the problem and not solve it.

The ultimate solution is to not record time in seconds but instead as an integer number of fractional units. For example 1000 x 1 ÷ 120 is the 1000th frame of a 120fps video. Converting to seconds we still end up with a floating point number but as long as we count frames as integers the error does not accumulate over time. If you’re following the math closely you may have noticed that while solving this problem we have created another one.

What frame should we render first? The video frame at 1000 x 1 ÷ 120, or the audio sample at 367500 x 1 ÷ 44100? We need to convert to a common time base to know for sure. We could convert to seconds then compare, but that brings us back again to the floating point problem. By using the “least common multiple” or LCM, of the ratios, 88,200 in this case, we can convert these fractions to a common time base at which point we can compare them. 88,200 ÷ 44100 x 367500 = 735,000, and 88,200 ÷ 120 x 1000 = 735,000. These time stamps are exactly the same and should be rendered together to ensure sync. At no time did we need to use floating point math, which may have given us a slightly different answer.

In the world of digital media there are some time bases that come up very frequently. As stated, film commonly uses 24 fps, European television (and other PAL countries) use 25 fps, and for obscure reasons, American television (and other NTSC countries) use 29.97 fps. Wait! Floating point numbers again? Actually no, because it’s not really 29.97fps. It’s actually 30000 ÷ 1001 fps.

Here is where flicks come in. You see 705600000 ÷ 44100 = 16000 EXACTLY, 705600000 ÷ 120 = 5,880,000 EXACTLY, and even 1001 x 705600000 ÷ 30000 = 23,543,520 EXACTLY. This is why the flick is interesting, it has a special property of being the least common multiple of many of the commonly used timebases in digital media.

We now know what the flick is, Buy why? We have established that if we record every time stamp as 3 integers, a numerator, a denominator and a multiplier, we don’t need a common base since we can convert between them as needed. There are two primary reasons. First is efficiency. If we know we will need to compare a lot of time stamps in different time bases, or compare the same timestamp multiple times, converting them to a common base can be faster for a computer. Once converted to a common base, comparing two numbers is probably the fastest operation a computer can do. Whereas comparing two fractions requires an algorithm to find a common base, convert the value, and then compare. But the motivation cited in the flicks github page is slightly different, and caused by a design decision in the C++ programing language.

Computers are pretty good at dealing with time, but humans are really bad at it. In most parts of the United States, once a year for daylight saving, we have a 25 hour, only to have a 23 hour day a few months later. Every 4 years, we get an extra day at the end of February, unless the year is divisible by 100; except when the year is also evenly divisible by 400 then it is a leap year (there was no leap day in the year 1900). We even have leap seconds, where we add an extra second to a year whenever we notice that the atomic clocks don't quite agree with astronomical observations. Meanwhile a computer's clock needs to keep moving forward one second per second otherwise bad things happen.

To help with this human nonsense and standardize how to manage time, C++11 added a new package called chrono to its standard library. Because the language designers were smart, the std::chrono::duration time type included support for the time as a ratio technique we have established. Perfect! Well... not so fast. Because the language designers were unwilling to give up more cpu cycles and slow down programs (C++’s defining feature is speed after all), it was decided that the time base must be known in advance while writing the program (compile time). This allows for fast running programs because the fractions can be ignored when they are known to be equal, but it sacrifices automatic time base conversions when they are not. Herein lies the problem. When playing back a media file we can’t know the time base in advance, we haven't seen the file yet. What we need is a time base that can support any media we are likely to encounter. Enter flicks. An elegant solution to a problem only a handful of media engineers will ever encounter that happened to be announced on a slow news day.

Flicks does not seem to be in wide use. I used it at Mux in one specific place in our transcoding pipeline with its intended purpose. I used it with a C++11 program where utilizing std::chrono made things a bit easier to standardize on. But searching GitHub and Google, I could only find a handful of places where it is used in the wild, and I really don't expect that to change.