An Overview of "Deep Neural Networks for YouTube Recommendations"

Youtube is one of the largest and most advanced recommendation systems in the industry.

Paul Covington, Jay Adams, and Emre Sargin’s 2016 paper “Deep Neural Networks for YouTube Recommendations” discuss an important question:

How to preset a few “best” recommendations?

I will show you a brief overview of the system and some features selection.

Perspectives

There are three main perspectives:

Scale:
- There is no existing recommendation algorithms proven to work well on YouTube’s massive user base and corpus.
Freshness
- Videos update quickly and corpus is constantly changing. 300 hours of video are uploaded to YouTube every minute.(2019)
- The system needs to balance new and old content.
- Solve cold start problem.
Noise
- User behavior data on Youtube is sparse with only implicit feedback.
- There is a lot of noise data.

Facts and Numbers about Youtube

System Overview

The recommendation system has two stages:

Stage 1: Candidate generation:

This provides broad personalization using collaborative filtering. Retrieves a small subset. The magnitude of candidate videos ranges from millions to hundreds.The similarity between users compared by identifiers:

videos watched
search query tokens
demographics(geographic region, device, gender, and age)

The input layer is followed by several layers of fully connected Rectified Linear Units (ReLU).

Stage 2: Ranking

During ranking, it is able to take advantage of many more features describing the video and the user’s relation because only a few hundred videos are being scored.

Ranking is crucial for ensembling different candidate sources.
Assign an independent score to each video impression using logistic regression.
Final ranking objective is constantly being tuned based on live A/B testing results.

Features

Next I would like to discuss some features selection based on business consideration.

Our goal is to predict expected watch time given training examples that are either positive (the video impression was clicked) or negative (the impression was not clicked). Positive examples are annotated with the amount of time the user spent watching the video.

Why is expected watch time as an optimization goal rather than Click-Through Rate or Play Rate?

Watch time is more about users' real interest. The most important signals are those describing a users interaction with the video. From a business perspective, the longer watch time, the more ad revenue YouTube gets. And increasing the watch time of users is also in line with the long-term interests of video sites and user stickiness. It means the model meets YouTube business considerations.

Recommending recently uploaded (“fresh”) content is extremely important for YouTube as a product. We consistently observe that users prefer fresh content, though not at the expense of relevance.

How to introduce the bias of fresh content in the model?

Machine learning models are often biased toward the past data because they are trained from historical examples.

...using the age of the training example as an input feature removes an inherent bias towards the past and allows the model to represent the time-dependent behavior of popular of videos.

...we feed the age of the training example as a feature during training. At serving time, this feature is set to zero (or slightly negative) to reflect that the model is making predictions at the very end of the training window.

This makes a big difference as the above graph. The green line is the actual distribution. The red line shows the predictions with age of the training example, and the blue line without it. The red line is more match the actual distribution.

In the preprocessing of training examples, why do they extract an equal number of training samples for per user.

Another key insight that improved live metrics was to generate a fixed number of training examples per user, effectively weighting our users equally in the loss function. This prevented a small cohort of highly active users from dominating the loss.

Why the time-sequence of the user's viewing history are completely abandoned, and the user's recent browsing history is regarded as the same?

If we think too much about the impact of timing, the users' recommendation results will be too much affected by a recently watched or searched video. They give us a example, if a user has just issued a search query for “taylor swift”. It may be not a good idea that most videos are related to “taylor swift” on next homepage recommendations.

By discarding sequence information and representing search queries with an unordered bag of tokens, the classifier is no longer directly aware of the origin of the label.

This paper is representative of using DNN as a recommendation system. It combines business reality and user scenarios. I really recommend it.

References