DEV Community

Luca Barbato
Luca Barbato

Posted on • Updated on

Temporal RDO update optimization

This is the second story of a specific optimization that gave a significant speed-up in rav1e. For the first one go here. For the next one go here

For readers unfamiliar with AV1 and rav1e, we will start with some background. If you already read the first article you may skip ahead to the implementation details.

The Alliance for Open Media is an effort founded by Google, Cisco and Mozilla, with many other companies joining them over the past few years. The Alliance for Open Media AV1 video codec is the first to be released as a result of this effort. We will refer to it as AV1.

rav1e is an AV1 encoder -- written primarily in the Rust programming language, its goal is to be safe and fast.

Architecture of an AV1 encoder

AV1 has a conventional block transform video coding architecture with motion compensation. A video is a sequence of frames which are divided into one or more tiles. Tiles are divided into large squares called superblocks -- either 64 or 128 pixels in width and height for AV1. Superblocks are partitioned recursively into smaller square or rectangular blocks.

AV1 Partitioning

Each block may be predicted by neighboring blocks in the same tile or by a motion projection of regions in other frames. There are many modes to choose from for both of these cases. The core of the video encoder is deciding how to partition the blocks, which prediction mode and reference to use for each block and how to transform the difference. The following diagram illustrates the range of techniques available in AV1.

The technology inside AV1

Rate-distortion optimization (RDO)

The process for guiding encoding choices is called rate-distortion optimization, commonly abbreviated as RDO. For each set of choices there is a trade between the rate of bits that describe the decisions and how much the decoded frame is distorted compared to the original. The ratio of this trade-off can be estimated for a given quality range.

Temporal RDO

The RDO logic is normally applied over a small set of frames within a larger group. We call the smaller group sub-GOP and the larger group GOP.

The temporal RDO analyzes a larger set of frames to find areas that would recur outside the single sub-GOP and compute a bias that would guide the normal RDO decisions to invest more bits in the blocks that would be more important in the future.

The idea of temporal RDO is to compress objects well-predicted in future frames with higher quality. For example, a car moving across the screen can be encoded finely on the first frame, which will improve its visual quality on all subsequent frames.
On the other hand, an object that soon disappears can be encoded coarsely to save bits. This can also apply to multi-view images: spend more bits on objects visible in multiple views and less bits on objects visible in only a single view.
-- Ivan Molodetskikh

The temporal RDO needs to see moderately far in the future in order to be effective.

Every time a decision is made for a block, it has to propagate -- both within the frame and over all the past ones.

This caused its implementation to be one of the largest bottlenecks in the encoder:

  • As per every lookahead, it adds latency, potentially a lot of it.
  • The update process is inherently serial over the frames.

Temporal RDO implementation

In rav1e the bulk of the code implementing it lives in compute_block_importances().

hawktrace of encoding

Being it completely serial, once the encoding process is run on a machine with enough threads it takes a fairly large share of the overall encoding time.

In the case of encoding using 16 threads and 16 tiles, it takes a little over the 30% of the time spent on receive_packet for the speed level 10.

The rest of the encoding workload is spread nearly evenly per-tile, the computation of the block importance bias is not.

Making it faster

The update process is fairly serial if we consider frames and the per-block updates. But the analysis itself has good parallelism opportunities since it happens per block.

Enter rayon

rayon is a data-parallelism crate that makes it incredibly easy to take an iterator and execute it in parallel using a pool of workers. We use it extensively in rav1e.

After a refactor1 to split the update loop in one that computes the per-block costs and one that propagates the importance biases to its neighbors, all it took was an apparently small change:

diff --git a/src/api/internal.rs b/src/api/internal.rs
index 395a66e4..b2ab16cd 100644
--- a/src/api/internal.rs
+++ b/src/api/internal.rs
@@ -21,6 +21,7 @@ use crate::rate::{
   RCState, FRAME_NSUBTYPES, FRAME_SUBTYPE_I, FRAME_SUBTYPE_P,
   FRAME_SUBTYPE_SEF,
 };
+use crate::rayon::prelude::*;
 use crate::scenechange::SceneChangeDetector;
 use crate::stats::EncoderStats;
 use crate::tiling::Area;
@@ -809,14 +810,14 @@ impl<T: Pixel> ContextInner<T> {
     let plane_org = &frame.planes[0];
     let plane_ref = &reference_frame.planes[0];
     let lookahead_intra_costs_lines =
-      fi.lookahead_intra_costs.chunks_exact(fi.w_in_imp_b);
+      fi.lookahead_intra_costs.par_chunks_exact(fi.w_in_imp_b);
     let block_importances_lines =
-      fi.block_importances.chunks_exact(fi.w_in_imp_b);
+      fi.block_importances.par_chunks_exact(fi.w_in_imp_b);

     let costs: Vec<_> = lookahead_intra_costs_lines
       .zip(block_importances_lines)
       .enumerate()
-      .flat_map(|(y, (lookahead_intra_costs, block_importances))| {
+      .flat_map_iter(|(y, (lookahead_intra_costs, block_importances))| {
         lookahead_intra_costs
           .iter()
           .zip(block_importances.iter())
Enter fullscreen mode Exit fullscreen mode

On speed level 10, the time spent on compute_block_importances() is reduced to nearly 1/3.

hawktrace after the change

Across the speed levels the impact is the following

alt text

As expected, making the temporal-rdo update parallel made our quicker speed levels fairly faster.

alt text

Given that the RDO rollback optimization had a larger impact on our slower speed levels, it compounds nicely.

After this optimization we retuned a little the speed levels, changing the default rdo-lookahead depth.


  1. https://github.com/xiph/rav1e/commit/705fb22f191c8468313a9b32ac9796887c7f93b6 

Oldest comments (1)

Collapse
 
trueneu profile image
Pavel Gurkov

Thank you, I've learned something new today!
(special bonus for not a JS tutorial post)