DEV Community: Aleksandr Gushchin

GSOC-2021 Work Product Submission, Xiph.Org Foundation

Aleksandr Gushchin — Fri, 20 Aug 2021 21:43:57 +0000

Student: Aleksandr Gushchin
Github Handle: @aleksandrgushchin
Project: Improve fast scene detection modes proposal
Mentor: Luca Barbato
Organisation: Xiph.Org Foundation

Goals

This summer, I contributed to Xiph.Org Foundation. The main aim of this project was to improve scene change detection algorithm. This algorithm determines where to split video sequences for optimal encoding efficiency. Currently implemented fast scene detection method is not optimal and sometimes give false results. This is also detrimental to per scene visual metric quality targeting.

Change Log

Dataset has been made to test the algorithm
Metric value peaks have been made more distinctive for algorithm to detect resulting in better accuracy
Adjusting threshold for both versions of the algorithm
Adaptive threshold implementation for slow version
The more accurate version of the algorithm has been implemented
Downsampling for this new version has been added
Detailed description for all three versions has been added
CLI option of scene detection speed mode has been added
Unit-tests have been updated according to new version
av-scenechange has been updated according to the new version of rav1e
- CLI option of scene detection speed mode has been added
- CLI option for file to write result in has been added
- Speed measurement has been added

Brief summary of new version of the algorithm

Version	F score on BBC Planet Earth	F score on open source videos
New fast version	0.7441	0.6652
Old fast version	0.6543	0.5951
New medium version	0.7802	0.7032
New slow version	0.9217	0.7504
Old slow version	0.7024	0.5628

Development process

To fairly test the algorithm I needed a big representative dataset. I found BBC Planet Earth dataset, but I still needed sequences with bigger resolutions and different theme (all of BBC videos were documentaries with 388x280 resolution). I downloaded and manually marked-up 20 videos from vimeo. More description of final dataset can be found here.
After collecting the data I calculate the results of current solution. It can be found here.
Detailed analysis of the current algorithm. I made charts and visualizations regarding different algorithm options and threshold. It can be found here and here. I made several conclusions on how to improve current solution.
Improve current solution by adjusting thresholds and update metric values. Detailed description is here and here. I made a pull request with these changes
New metrics development. I experimented with motion vectors and color histograms to build a new dissimilarity metric upon them. For histogram-based metric I experimented also with distance functions. I tried to implement edge change ratio but failed because it turned out to be too slow. I focused on histogram-based metric since it was the most accurate. I experimented with block-based approach, combining with previous versions of the algorithm and shifting by motion vectors. Results can be found here and here.
After third version was ready I added it to repo, provide CLI option for users to manually choose version and update unit-tests.
Detailed description of final result you can read here alongside with unsuccessful ideas and possible improvements.

Code

Pull requests:

#2765
#162

Blog posts

All posts can be found here

Acknowledgement

I'd like to thank my mentor Luca Barbato for always monitoring my progress, immediately responding and guiding me whenever I needed help and whole Xiph team!

New Scene Change Detector version

Aleksandr Gushchin — Wed, 18 Aug 2021 07:59:14 +0000

There are three versions of the algorithm based on speed setting of rav1e. Detailed description of each version is down below.

Fast version - pixel-based version with improved threshold.
- Corresponds to speed level 10 of ravie
- Performs a downsampling
Medium version - based on motion vectors with iproved threshold.
- Corresponds to speed level 7-9 of ravie
Slow version - histogram metric with block-based approach.
- Corresponds to speed level 0-6 of ravie

Results

Version	F score on BBC Planet Earth	F score on open source videos
New fast version	0.7441	0.6652
Old fast version	0.6543	0.5951
New medium version	0.7802	0.7032
New slow version	0.9217	0.7504
Old slow version	0.7024	0.5628

So the F score of the fast version is improved by 0.0898 on BBC and 0.0701 on open sorce videos.
So the F score of the slow version is improved by 0.2193 on BBC and 0.1876 on open sorce videos.

Desciption of each version

Fast version is a simple calculation of the pixel-wise difference. For each corresponding pixel the difference of values is taken and summed up. The final dissimilarity metric is the average values of all pixels. I improve the old version by adjusting threshold and modifying the metric itself by calculating numerical differentive.
Medium version is improved version of the old slow version with adaptive threshold. To build dissimilarity metric motion vectors between two consecutive frames are computed. Frames are divided into blocks and each block on the second frame is shifted by motion vector. Dissimilarity metric is the average difference between all blocks. I improve the old version by adjusting threshold and modifying the metric itself by calculating numerical differentive.
Slow version block-based histogram metric. Frames are divided into non-overlapping blocks. Then the mean value of this histogram is calculated and compared with the value of the corresponding block. Dissimilarity metric is the average difference between all blocks.

Results and examples of each version

Slow version

The slow version is marked in the legend as "with blocks". "Without blocks" is the similar metric but without division of frames into blocks.
Results on BBC dataset:

Results on open-source videos:

Medium version

Here you can see charts of performance (F score, precision and recall) of the algorith depending on the threshold. These results were obtained with open-sourced videos from youtube.com and vimeo.com

Here also the results on BBC Planet Earth datset:

And precision-recall curve:

Fast version

Here you can see experiments with threshold for fast version. The bold line represents old fast version, the bottom line here is the old slow version of the algorithm:

Detailed analysis can be seen here:

Fast version
Medium version
Slow version at the end of the post

Speed

Version	Average FPS on BBC Planet Earth (360x288)	Average FPS on open source videos (1280x720)
Fast version	234	22
Medium version	222	18
Slow version	156	13

Overall metric improvement

Here I want to show how I improved metric values in all version from the old ones.
Blue line represent values of algorithm's metric on frames, orange - threshold, gray lines represent if algorithm marked this frame as scene change.
Here is the example of the outcome on one of the videos:

The top picture shows original metric values, the bottom one shows metric after improvement.
You can see that the peaks with the scene changes became more distinct, so the threshold is easier to tune.

Unsuccessful ideas

Here is a list of ideas that I implemented, but they turned out to be impractical:

Slow version with motion vectors
- Each block is shifted by motion vectors. It slowed down algorithm even more and decreased F score.
Combining medium version with the slow one
- The idea was to marked frames as scene change if one of version said so. Again, it slowed down algorithm and did not bring any gain to F score.
Separate metric for flashes
- I implemented a few metrics for flashes detection and deployed them. But ofted flashes occurs several frames in a row and and contains scene change. Because of this it is difficult for algorithm to decide if these flashes contains scene change or not.

Possible improvements:

Threshold
- Possible dependency on metric values (maximum value on past frames, mean and std values). Current threshold can perform worse then static threshold in some cases. The example is on the picture. It can be seen that threshold varies around the same value. If it took into account the mean value of past frames, for example, it would be more accurate:
Version based on edge detection
- It can be useful to take into accound another feature of frames - object edges. Combining with existing versions it can boost F score
Metric
- Adjusting metric values according to nearest values. For example, by substructing the mean value of surrounding frames.
Block-based metric improvement
- It may be useful to experiment with the blocks individually rather than just taking the mean value of all them all. For example, if difference between k blocks is near zero algorithm should'n mark this frame as scene change no matter what other blocks has. Or if k blocks have difference about maximum of possible value algorithm mark this frame as scene change no matter what other blocks has.
Dowsampling for medium and slow versions:
- For videos with high resolution it may be considered to dowsample them to HD or so. This will significantly increase the speed, but will have a small impact on F score.

August 09 - August 16 Weekly Status

Aleksandr Gushchin — Fri, 13 Aug 2021 20:00:20 +0000

This week I finished analysis of the new metric based on color histograms. It can be read here.

I experimented with combining this new metric with the current one. Here are the F scores on BBC Planet Earth dataset for different versions:

Current rav1e version:
- 0.7024
Improved threshold (version of my latest pr):
- 0.8081
Version with color histogram based metric:
- 0.8502
Union of the latest two versions (frame is considered to be a scene change if one of the above-mentioned metrics said so):
- 0.8923
Version with color histogram metric with block-based approach, when each frame is divided into blocks (more details on the bottom):
- 0.9217

When I say the intersection of algorithms, I mean an algorithm that marks the frame as scene change only if both algorithms have marked it.
Here is the picture that explains why I chose union rather than an intersection. It should also be taken into account that the recall of the algorithms is higher than accuracy:

Numbers represent the amount of frames considered as scene changes in two versions of the algorithm and ground truth.
Each number represents one color area.
It can be seen that ground truth contains around 90% of the intersection of these versions.

I improved histogram-based metric by dividing frames into blocks. The results for it can be seen below along with the regular histogram-based approach. The improved version is marked in the legend as "with blocks"
Results on BBC dataset:

Results on open-source videos:

But the calculation speed became 0.75x of the current version on the BBC dataset (resolution 360x288) and 0.56x on open-source videos (resolution 1280x720).

I check if it worth it to combine this metric with the current one. The average increase in the F score is about 0.01-0.02, which, considering the even greater decrease in the speed rate, is unreasonable.

Also, I implemented block-based histogram approach considering the motion vectors. The results will be published here soon.

August 02 - August 09 Weekly Status

Aleksandr Gushchin — Fri, 06 Aug 2021 20:46:25 +0000

This week I experimented with new metrics, which I implemented.

I implemented 2 metrics based on color histogram and motion vectors. Since motion vectors already used in current version of the algorithm I focused on hostogram metric. Code for it can be found here. I used histogram crate for this implementation. I calculate histogram of the first plane of the frame (luma component) and compare it to the previous frame's histogram. I used 4 metrics to calculate differences between these histograms:

The difference between mean values
The difference between std values
Taxicab distance here p, q - histograms, n - amount of bins in each.
The square of the Euclidean distance

After that I substract current value from previous to make peaks more distinctive for the threshold.

Below you can find examples of these 4 distances in the same order on the same video. These pictures have scene changes (gray vertical lines) and final metric (blue line):

It can be seen that the metric with the least distinctive peaks is the one with STD difference.

Below you can find results for the first two distances:

Mean values of histograms:
STD values of histograms:

Below you can see the results for these metrics of BBC Planet Earth dataset and manually marked up open source videos:

Metric	F score on BBC Planet Earth	F score on open source videos
#1(mean)	0.8502	0.6532
#2(std)	0.6543	0.5951
#3(Euclidian)	0.7031	0.6002
#4(Taxicab)	0.7143	0.6231

The speed of the algorithm with this metric is ~0.87x speed of current version of the algorithm.

Summary:
The F score of new metrics is better than the current one.
New metrics are a bit slower that the current metric.

But since they use different characteristics of the frame (motion vectors and color histograms) in combination they could enhance each other and increase the final F score.

TO DO:
~~Precision recall curves for these metrics with different thresholds.~~
Correlation with previous metric. Would it be better to combine these metrics or use them separately?

July 26 - August 02 Weekly Status

Aleksandr Gushchin — Fri, 30 Jul 2021 19:52:01 +0000

This week I implemented changes to the scene change algorithm and made pull request to github repository. Main analysis can be seen in this blogpost.

I increased threshold for the fast version
- It is increased F score by 0.1502 up to 0.7441
I applied numerical differentiation to metric values to make the peaks of the metric more distinguishable for the threshold
I reduced threshold for the slow version
- It is increased F score by 0.1056 up to 0.8081

Overall, I improved F score of the fast version by 0.1454 and slow version by 0.1512

Threshold Experiments for scene change detector

Aleksandr Gushchin — Tue, 27 Jul 2021 20:48:17 +0000

Fast version's threshold

I experimented with threshold of the fast version of the algorithm:

This picture shows F score on testing dataset, X axis shows the number of the video, the Y axis - F score on this video. Lines represents different versions of the algorithm according to the legend.
It can be seen that increasing the threshold value improves the F score of the algorithm and does not affect the processing speed.

Here is the table of mean F score on datasets for different versions:

Version	F score on BBC Planet Earth	F score on open source videos
Fast with thr = 12 (current)	0.5939	0.6011
Fast with thr = 15	0.6490	0.6361
Fast with thr = 16	0.6961	0.6375
Fast with thr = 17	0.7393	0.6623
Fast with thr = 18	0.7441	0.6652
Fast with thr = 20	0.7795	0.6244
Slow	0.7024	0.5628
Slow improved	0.8081	0.6515

I chose optimal value of the threshold as 18.

Slow version's threshold

After experiments, the threshold for slow version was reduced by a factor of 2.2. Here you can see charts of performance (F score, precision and recall) of the algorith depending on the threshold reduction factor. These results were obtained with open-sourced videos from youtube.com and vimeo.com

Here also the results on BBC Planet Earth datset:

And precision-recall curve:

Improvement of the metric

To make the peaks of the metric more distinguishable for the threshold, I applied numerical differentiation to its values. Here is the example of the outcome on one of the videos:

The top picture shows original metric values, the bottom one shows metric after improvement.
You can see that the peaks with the scene changes became more distinct, so the threshold is easier to tune.

July 19 - July 26 Weekly Status / New scene change detector of the rav1e analysis

Aleksandr Gushchin — Fri, 23 Jul 2021 21:13:14 +0000

At the beginning of this week I started implementing changes to the scene change detector threshold. I made its values lower, and also changed the behavior when using max_keyint anf min_keyint options (before algorithm chose non-optimal frames, after algorithm chooses frames with highest metric value). Before I finished implementing other strategies for threshold, I found out about update in algorithm.

After that I analyzed new version of the scene change detector of the rav1e.

On 4k videos fast version of the algororithm works better than slow version. The x-axis shows the number of the video in the dataset, the y-axis shows F score of the algorithm. Blue line is the fast version, yellow is the slow version of the algorithm. It can be seen that fast version shows much better results.

But on BBC Planet Earth dataset slow version shows better results.

Version	F score	Precision	Recall
Slow	0.7024	0.6452	0.8013
Fast	0.5939	0.4739	0.7975

You can see from the table that fast version has similar recall score to slow version but worse precision.
So I will try to improve the precision by increasing the base threshold. In fast version there is no adaptive threshold either so I will implement and experiment with it.
Definition of F score, precision and recall can be seen in wikipedia. In short, the higher the precision, the less amount of false positive frames, the higher the recall, the less amount of misses by the algorithm. The F score performs as a balance metric between precision and recall.

The example of how low the base threshold for fast version is.

The blue line is the metric value, the orange one is the threshold. The vertical grey lines shows the scene changes. On the first picture the grey lines is the ground truth, on the second is predicted scene changes be the algorithm. As you can see on the second picture there are a lot of false positives. If the threshold value was aroud 20-24, the precision and F score would be a lot higher.
On the other hand, for the slow version of the algorithm threshold is still too high. It can be seen from these two pictures.

The concept is similar to the pictures above except for the version of the algorithm and used video. It can be senn that if the threshold was lower the algorithm would have higher F score.
Examples with other videos:

Third video:

A similar problem is observed in the rest of the video.
Another example:
On the average, the speed of the fast version is 1.3 times more that the speed of the slow version.
The new version of the algorithm is better than the old one by about 0.05-0.1 in terms of F score. Based on the results of the analysis, it can be improved even further.

July 12 - July 19 Weekly Status

Aleksandr Gushchin — Fri, 16 Jul 2021 20:32:35 +0000

This week I experimented with threshold according to one of my previous posts.

First thing I've done is to lower threshold itself. After a series of experiments I chose to decrease it by 35%.

You can see the results of lowering the theshold here:

To compare it with the previous results I post a picture from another blogpost:

The average F score on complete dataset was increased by ~0.096.

Also I tested different strategies to adapt threshold.

If max_interval option is specified and during max_interval frames metric below threshold algorithm chooses not the last frame of this series but the one with the biggest metric. You can see examples here: Before: After:

Max_interval option here is 500. You can see that before algorithm chose exactly 500 frame, but after it chose the one with the biggest metric.

Threshold does not take into account the following values of metrics:
Algorithm now don't mark the current frame as scene change if metric value of the next frame is bigger than the current one. It helps to prevent the algorithm to mark a series of consecutive frames as scene change (example below):

I didn't implement these changes in github repos yet and plan to do it early next week.

The most difficult remaining problem is the metric itself. I will take a closer look into it to see if there is a way to correct it a little bit. After it I will start to implement a new metric to compensate the weaknesses of this metric.

June 21 - June 28 Weekly Status

Aleksandr Gushchin — Sat, 26 Jun 2021 20:33:18 +0000

This week I tested the dependency of speed on resolution of the video. So far, results seem strange because downsampling from 4k to 2k didn’t affect much speed of the algorithm (approximately 1.2-1.3x faster).
Then I measured the time complexity of downsampling alone. Inside the video frame reading function, I set the loop for 10000 iterations - at each of them the memory was allocated for a new frame of smaller size and the planes to which downsampling was applied were copied to the new frame valuable. It turned out to be about 200 iterations per second, which is way faster than 4k videos regular scenechange detector processing speed (2.5 frames per second). So the downsampling itself shouldn't affect the speed much. So, here's the problem to solve.
I made an output of the metric of the algorithm to the json file and visualized it in a few charts. I made a blog post about it. Based on visualization I made some conclusions for future work, which are written at the end of the post.

Metric visualization and analysis

Aleksandr Gushchin — Fri, 25 Jun 2021 20:09:47 +0000

The pictures in this blogpost will depict two charts each - they differ only in the gray vertical lines. In the upper chart the gray lines represent ground truth, in the lower chart the result of the algorithm.
Blue line represent values of algorithm's metric on frames, orange - threshold.

My first observation is that the threshold is often too high. It can illustrated in this picture:

The second observation is that parameters (min interval and max interval) are often bad for the algorithm (especially the max parameter):

You can see that algorithm marked exactly 500th frame as scene change although there are frames before it with higher metric.

The third observation is that often the algorithm does not work well because of the threshold, but because of the metric:

The fourth observation - the threshold in general does not behave optimally, it does not take into account the following values of metrics, because of this a lot of consecutive frames can be marked as scene changes:

You can see that around 2000th frame the metric consistently grows, followed by a delayed growth of the threshold. Because of this, about 100 consecutive frames are marked as scene changes. Here is closer look on it:

Summary:

Threshold often is way too high for single spikes of metric values
Max interval parameter can be improved, it better to mark previous frames with higher metric values
Threshold does not take into account the following values of metrics
Metric can be wrong (usually either the metric works well on the whole video or poorly on the whole video)

June 14 - June 21 Weekly Status

Aleksandr Gushchin — Mon, 21 Jun 2021 19:32:15 +0000

This week I modified code to test the algorithm. Namely, I added command line arguments for downsampling video, outputting the result of the work as a json file, as well as calculating the speed of the algorithm. After that I tested the algorithm on my dataset. Based on the test results I made a blog post, where I put the data and charts. During the testing I changed the min_key, max_key command line parameters and made charts for each part of the dataset (documentaries, 4K and complex videos from YouTube) based on these data.

Results of current algorithm

Aleksandr Gushchin — Fri, 18 Jun 2021 19:54:10 +0000

Results on different parts of algorithm. Each line represents different value of minimum interval value (--min-scenecut command line parameter).
It can be seen that performance of the algorithm on difficult youtube videos is worse than on documental films from BBC.

Here is the code for testing

F score can be calculated using formula below:

where Precision and Recall are:

tp is number of correctly detected scene changes
tn is number of correctly detected frames without scene changes
fp is number of false alarms
fn is number of missed scene changes

The average speed for the algorithm:

Part	FPS
4K videos	2.5
BBC Dataset	238
Other videos	30

TO DO:
Dependency of F score and speed on resolution