DEV Community


Mistakes and what I learnt building a video buffer system for deep learning data

Anthony Sarkis
Building things @
・8 min read

This is my first post here so please go easy on me! :)

It was loading so slowly it felt like we were dead in the water.

For context, this is for a system that creates training data for deep learning systems. Here's an example:

Alt Text

The green part is annotations create by the user or AI. The video is playable at various rates including real time. The annotations can be very numerous, ie each frame could have 10s or 100s, and there are a variety of relations to each.

Pre-history specification: preview many frames at once

When a user played a video, it played the raw file, but did not show any annotations (the green overlay) until the file was paused.
An early user rightly stated this was bad!

So! Low and behold I hacked together a little frame buffer that got all the annotations for the next x frames, cached it locally.

Because the general use case was this "interpolation" feature, the main value of the buffer was to do a quick check that the interpolation worked as expected, or stop it near the part that didn't.

I don't think it ever was super great, but it fit the bill, took a very minimal amount of engineering effort, and most importantly, the users were happy! :)

30x Frames

Our early users had fairly low frame rate video, and in general the objects were always present, either at the edge of the frame or in the video. 

There was little need to be overly accurate about frames, and when it was needed the slider could be used.
A new user was helping us push this to a new level, a few of the changes included:

  • 10x frames per second. Now could be up to 120 instead of ~10.
  • 3x Longer videos. In combination with FPS that meant each video could be >1500 frames.
  • Time Series focus, so now we need more precision on jumping back and forth between frames.

Alt Text
Reflecting on challenges

Why do frames matter?

Accuracy requirement  

The use case is a scientific application, the data that eventually gets exported is reported literally down the to the pixel. Therefore we are very concerned about accuracy of the data.

The first approach 'solved' this by just reloading the whole thing anytime there were changes - that way it was guaranteed to be up to date.

Unfortunately in the new context, this was very jarring, as it meant that a user may hit the loading lock many times during regular use. Basically it was borderline unusable.

Changing data

Part of the challenge is that this is trending towards being real time rendering, imagine something like (to be clear I am exaggerating) Adobe After effects but for data.

Still, it's challenging in this context, to paint one example:

A user could change the data in only frame 12, rerun interpolation, and now the data in frames 0 -> 50 have changed (assuming the sequence spans more frames.).

Also keep in mind that every video is unique - so caching here has little benefit once a video has been completed. We are write heavy vs usual cases that are read heavy.

Make it 10x faster when a video is empty

Even more maddening was that this slow loading happened even when there wasn't any significant data to load, ie a new video that had not yet been annotated (or was only lightly annotated) !! 

Why was this?

Because all of the data had the potential to change, it meant the call looked like:

for frame in slice_of_video:
   for annotation in frame:

This was asymptotically slow and also slow in reality since even getting annotations for a single frame wasn't a super fast thing.

Therefore even if we just fixed the buffer at say 10 frames, it doesn't really solve it. And in the context of the new requirement it would be basically unplayable.

Reduce the length of the outer loop to approach 0.

We already had a concept in the system of "how many changes for each (frame)". We added this to the original SQL query.

This mean that the length of slice_of_video loop was 0 if the video was empty. We were only getting annotations we had to, instead of making a ton of empty calls.

It also meant we could extend the slice (a portion of the video, ie frames 30 to 60), to be much larger, since it only slowed down when data was found.

This require a tiny amount of fiddling with array setup to get the data positioned right (ie insert None for frames that we didn't get data for), but was a big step in the right direction.

To make a subtle distinction clear here, this is per frame. If we just did it per video then the moment it had any annotations it would go back to loading slowly.

This means a video that has annotations at the start, will skip loading those as the user works on a middle portion of the video.

I'm no extolling that this is some perfect solution - but in the context of the rest of the existing system it was a relatively easy improvement.

Using an absolute reference point

Alt Text

At the time, we were updating lists based on a "relative" frame. ie the first element in the buffer was the current frame, the next element was current + 1 etc. ie:

   [ current frame data ],
   [ +1 ],
   [ +2 ],
] etc.

In that context it seemed reasonable to send a matrix of lists as the buffer.

Then we increased the FPS to 60, and allowed more control on jumping between frames. Now we have a problem, While the buffer loading had gone from terrible to reasonable, it really didn't make sense to be reloading it.

The fastest load time - a cache hit in the front end store

There's some joke somewhere that the answer to any CS problem is to use a dictionary... and this this case it was True!

Now we send the the buffer as a key value store. This has many advantages:

Instant frame changes (including going backwards)

The various parts of the code that allow the user to jump to any frame, now simply check if the frame exists in the buffer (constant time).

  • If it exists, it uses it, else it refreshes the buffer.

  • The buffer can include frames both forward and backward in time.

  • Invalidating the cache, (ie for switching files), is as simple as setting it equal to a blank dictionary, since a key not existing is reason to refresh it.

This wasn't possible with the array because it was relative, so it was assumed to exist and be correct.

Now, most of the video can be edited with the fastest possible call to the server: None at all!

Decoupling when a server side refresh is needed

Now that we were defaulting to updating the buffer locally first, the question came up, as to when and how we should do the server side update.

I'm talking about the stuff indirect to local actions. The existing checks handled initial loading, empty buffers etc. But what if something changed server side?

I realized that all of the server side data side changes were still triggered by a user concept. (ie clicking Interpolation button.) So I decoupled the server side refresh, so that concepts that needed it could still call it, but otherwise it was assumed the local version was up to date.

(re)Learning lessons

I'm not saying any of these lessons are new, but hopefully the context of a specific situation is helpful to it.

It's fine to leave optimization to later.

If we had tried to optimize this from the get go I doubt we would have had as a good a result because:

  • There are about 10 areas of the system that were built to this similar "basic" level of function. 9 did not need any optimization.
  • Things like the "count_changes" attributes that were critical to the time savings was only added later. (and may have been harder to justify building only to support this)
  • The requirements changed (10x FPS, Adding "go to" controls). If this had been more optimal in the old context it still may not have carried over to the new context.

Default to thinking about caches (buffers) in key value stores.

This was also a case of (badly) pre-optimizing. I falsely assumed that because a video plays linearly that accessing a sequential array would make more sense, except that was solving the wrong problem.

It wasn't a bottleneck to check and load new instances at each frame, but it WAS a bottle neck to have to reload the whole buffer every time we moved frames. (or alternatively some mechanism to determine relative position sounded like a recipe for disaster.)

Explicit is generally better then implicit

Ie it's better to declare that frame 10 has xyz. vs "relative to current position" the 0th element is xyz.

Perhaps this was just a mistake, but I had been viewing using a relative frame as being better "information hiding". Except the "Information hiding" concept works better if it's operating at the right abstraction.

In this case, the abstraction of which frame it's globally on was more relevant. While in a sense that required "more" knowledge, it meant everything else that it interacted with was simpler.

In a perfect world with perfect isolation perhaps this wouldn't be needed, but the reality is we there's always assumptions backed into, and so it's better to declare those assumptions.

Do look for global optimizations over local algorithmic ones. 

I was a bit slow to see some of these changes because when I first started working on it was stuck in the mental model of needing to look at each frame, and needing to do a server side refresh.

When I was able to step back and think about what actually had to be loaded it made a big difference. It's worth noting that the local algorithm didn't actually change, what changed was the assumptions made around it (ie bypassing which frames needed to be looked at, calling the server less often, etc).

I think part of why I found this interesting is that it's one of the areas where general CS algorithms knowledge was actually useful. It wasn't a novel approach on some singularly difficult problem, nor was it a purely naive implementation. But somewhere in the middle.

Alt Text

Importance of unified front and back end design

I think it also shows how important the relationship between the front end and back end of a system are. In this case I was working on both so I could "yell at myself" so to speak, to fix one of the sides. I think if this had been some type of generic specification between different teams it would have been harder to get resolution. Because there were valid trade off concerns on each side that were fairly directly opposed.

Side note, beware the if 0:

In the early stages of reviewing this I noticed it was 4x slower at the start of a video. Why?

python treats 0 as False. This python code:

# False if start is 0.
if start:
  # do something

So instead of respecting the starting point, when the start was 0, the condition would fail to fire, and it would attempt to get the entire buffer for the entire video (instead of a slice as designed). Doh!

This was a bug rather then a design change so I just included it at the bottom here. Although IMO python, especially with type annotations, should detect this:

>>> x: int = 0
>>> type(x)
<class 'int'>
>>> x
# x evals to an Int?
>>> if x: print(1)
# x evals to False, but I think it should be True!

I'm sure there's some long story about 0 being a Falsy value. But it still just seems silly to me. Especially when canonical python is if x: being preferred over if x is not None:.

Thanks for reading, hope it helps!

Building deep learning vision systems? Check out

Discussion (0)

Forem Open with the Forem app