DEV Community: Yu Fen Lin

An Overview of "Deep Neural Networks for YouTube Recommendations"

Yu Fen Lin — Fri, 17 Jan 2020 17:34:23 +0000

Youtube is one of the largest and most advanced recommendation systems in the industry.

Paul Covington, Jay Adams, and Emre Sargin’s 2016 paper “Deep Neural Networks for YouTube Recommendations” discuss an important question:

How to preset a few “best” recommendations?

I will show you a brief overview of the system and some features selection.

Perspectives

There are three main perspectives:

Scale:
- There is no existing recommendation algorithms proven to work well on YouTube’s massive user base and corpus.
Freshness
- Videos update quickly and corpus is constantly changing. 300 hours of video are uploaded to YouTube every minute.(2019)
- The system needs to balance new and old content.
- Solve cold start problem.
Noise
- User behavior data on Youtube is sparse with only implicit feedback.
- There is a lot of noise data.

Facts and Numbers about Youtube

System Overview

The recommendation system has two stages:

Stage 1: Candidate generation:

This provides broad personalization using collaborative filtering. Retrieves a small subset. The magnitude of candidate videos ranges from millions to hundreds.The similarity between users compared by identifiers:

videos watched
search query tokens
demographics(geographic region, device, gender, and age)

The input layer is followed by several layers of fully connected Rectified Linear Units (ReLU).

Stage 2: Ranking

During ranking, it is able to take advantage of many more features describing the video and the user’s relation because only a few hundred videos are being scored.

Ranking is crucial for ensembling different candidate sources.
Assign an independent score to each video impression using logistic regression.
Final ranking objective is constantly being tuned based on live A/B testing results.

Features

Next I would like to discuss some features selection based on business consideration.

Our goal is to predict expected watch time given training examples that are either positive (the video impression was clicked) or negative (the impression was not clicked). Positive examples are annotated with the amount of time the user spent watching the video.

Why is expected watch time as an optimization goal rather than Click-Through Rate or Play Rate?

Watch time is more about users' real interest. The most important signals are those describing a users interaction with the video. From a business perspective, the longer watch time, the more ad revenue YouTube gets. And increasing the watch time of users is also in line with the long-term interests of video sites and user stickiness. It means the model meets YouTube business considerations.

Recommending recently uploaded (“fresh”) content is extremely important for YouTube as a product. We consistently observe that users prefer fresh content, though not at the expense of relevance.

How to introduce the bias of fresh content in the model?

Machine learning models are often biased toward the past data because they are trained from historical examples.

...using the age of the training example as an input feature removes an inherent bias towards the past and allows the model to represent the time-dependent behavior of popular of videos.

...we feed the age of the training example as a feature during training. At serving time, this feature is set to zero (or slightly negative) to reflect that the model is making predictions at the very end of the training window.

This makes a big difference as the above graph. The green line is the actual distribution. The red line shows the predictions with age of the training example, and the blue line without it. The red line is more match the actual distribution.

In the preprocessing of training examples, why do they extract an equal number of training samples for per user.

Another key insight that improved live metrics was to generate a fixed number of training examples per user, effectively weighting our users equally in the loss function. This prevented a small cohort of highly active users from dominating the loss.

Why the time-sequence of the user's viewing history are completely abandoned, and the user's recent browsing history is regarded as the same?

If we think too much about the impact of timing, the users' recommendation results will be too much affected by a recently watched or searched video. They give us a example, if a user has just issued a search query for “taylor swift”. It may be not a good idea that most videos are related to “taylor swift” on next homepage recommendations.

By discarding sequence information and representing search queries with an unordered bag of tokens, the classifier is no longer directly aware of the origin of the label.

This paper is representative of using DNN as a recommendation system. It combines business reality and user scenarios. I really recommend it.

References

Build a decision tree in R

Yu Fen Lin — Fri, 20 Dec 2019 19:11:19 +0000

Overview & Purpose

With this article, we will build a decision trees model based on the Titanic data set that predicts whether a given person survived or not.

Steps:

Initial data understanding and preparation
Build, train, and test the model
Evaluate the performance of the model

1. Understanding the data set

We will use Titanic Passenger Survival Data Set. This data set provides information on the fate of passengers on the fatal maiden voyage of the ocean liner "Titanic", summarized according to economic status (class), sex, age and survival. Below is a brief description of the 12 variables in the data set :

PassengerId:<int> Serial Number
Survived:<int> Contains binary Values of 0 and 1
- 0: Passenger did not survive
- 1: Passenger Survived
Pclass: <int> Ticket Class - 1st Class, 2nd Class or 3rd Class Ticket
Name: <chr> Name of the passenger
Sex: <chr> Male or Female
Age: <dbl> Age in years
SibSp: <int> No. of Siblings / Spouses — brothers, sisters and/or husband/wife
Parch: <int> No. of parents/children — mother/father and/or daughter, son
Ticket: <chr> Serial Number
Fare: <dbl> Passenger fare
Cabin: <chr> Cabin Number
Embarked: <chr> Port of Embarkment
- C: Cherbourg
- Q: Queenstown
- S: Southhampton

Load necessary data

Remove all objects in the Global Environment and load titanic data.

rm(list = ls())

# install necessary packages
library(tidyverse) 
install.packages("titanic") 
# load necessary packages
library(titanic)
# load necessary data 
titanic <-
  titanic_train

Take a look.

titanic %>%
  View(title = "Titanic")

Produce the summaries of data.
Summary() is one important functions that help in summarising each attribute in the dataset.

> summary(titanic)
  PassengerId       Survived          Pclass          Name          
 Min.   :  1.0   Min.   :0.0000   Min.   :1.000   Length:891        
 1st Qu.:223.5   1st Qu.:0.0000   1st Qu.:2.000   Class :character  
 Median :446.0   Median :0.0000   Median :3.000   Mode  :character  
 Mean   :446.0   Mean   :0.3838   Mean   :2.309                     
 3rd Qu.:668.5   3rd Qu.:1.0000   3rd Qu.:3.000                     
 Max.   :891.0   Max.   :1.0000   Max.   :3.000                     

     Sex                 Age            SibSp           Parch       
 Length:891         Min.   : 0.42   Min.   :0.000   Min.   :0.0000  
 Class :character   1st Qu.:20.12   1st Qu.:0.000   1st Qu.:0.0000  
 Mode  :character   Median :28.00   Median :0.000   Median :0.0000  
                    Mean   :29.70   Mean   :0.523   Mean   :0.3816  
                    3rd Qu.:38.00   3rd Qu.:1.000   3rd Qu.:0.0000  
                    Max.   :80.00   Max.   :8.000   Max.   :6.0000  
                    NA's   :177                                     
    Ticket               Fare           Cabin             Embarked        
 Length:891         Min.   :  0.00   Length:891         Length:891        
 Class :character   1st Qu.:  7.91   Class :character   Class :character  
 Mode  :character   Median : 14.45   Mode  :character   Mode  :character  
                    Mean   : 32.20                                        
                    3rd Qu.: 31.00                                        
                    Max.   :512.33

There is two "" in Embarked. Drop them.

> titanic$Embarked[grepl("^\\s*$", titanic$Embarked)] 
[1] "" ""
> titanic <- droplevels(titanic[!grepl("^\\s*$", titanic$Embarked),,drop=FALSE])

There is also 177 NA's in Age. Use mean of age to fill NA's

> summary(titanic$Age) 
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
   0.42   20.00   28.00   29.64   38.00   80.00     177 

> titanic$Age[is.na(titanic$Age)] <- 
  round(mean(titanic$Age, na.rm = TRUE))

Set categorical variables. Variables can be classified as categorical or quantitative.

Categorical variables take on values that are names or labels. ex. Embarked in our dataset.
Quantitative variables are numerical. They represent a measurable quantity. ex. Age in our dataset.

titanic$Survived = as.factor(titanic$Survived)
titanic$Pclass = as.factor(titanic$Pclass)
titanic$Embarked = as.factor(titanic$Embarked)
titanic$Sex_num = if_else(titanic$Sex == "male",
                          1,
                          0)
titanic$Sex_num = as.factor(titanic$Sex_num)

Okay, we get the data ready to use.

2. Build, train, and test the model

Choose the variables we would like to discuss. We choose Survived, Pclass, Age, SibSp, Parch, Fare, Sex_num, and Embarked.

df <- 
  titanic %>% 
  select(Survived, Pclass, Age, SibSp, Parch, Fare, Sex_num, Embarked)

Check the target variable, Survived.Good, it is not a huge class imbalance.

> df %>%count(Survived)
# A tibble: 2 x 2
  Survived     n
  <fct>    <int>
1 0          549
2 1          340

Check the distribution and correlation between variables.

library(psych)
pairs.panels(df[,],
             ellipses=FALSE,
             pch = 19, 
             hist.col="blue")

Split train and test data. Set 75% is train data.

library(caret) 
set.seed(2019)
trainIndex <- createDataPartition(df$Survived, p=0.75, list = FALSE)
train <- df[trainIndex,]
test <- df[-trainIndex,]

Build decision tree model

tree <- rpart(Survived ~., data=train, method='class')

What does the decision tree look like?

library(rpart)
prp(tree,         
    faclen=0,           
    fallen.leaves=TRUE, 
    shadow.col="gray",
    )

Another fancier way to take a look a decision tree.

library(rpart.plot)
rpart.plot(tree)

3. Evaluate the performance of the model

Use test data to evaluate the performance of the model.

X_test <-
  test %>%
  select(Pclass, Age, SibSp, Parch, Fare, Sex_num, Embarked)
pred <- predict(tree, newdata=X_test, type=c("class"))

Calculate confusion matrix and plot it.

confus.matrix <- table(real=test$Survived, predict=pred)
fourfoldplot(confus.matrix, color = c("#CC6666", "#99CC99"),
             conf.level = 0, margin = 1, main = "Confusion Matrix")

The accuracy of the model

> sum(diag(confus.matrix))/sum(confus.matrix)
[1] 0.8333333

Hope you found this article helpful.

What is TensorFlow?

Yu Fen Lin — Mon, 02 Dec 2019 15:48:00 +0000

In this post, we are going to see what is TensorFlow, why TensorFlow, and how does it work?

What is TensorFlow?

“An end-to-end open source machine learning platform.”

Deep learning is a subfield of machine learning. There are many deep learning frameworks out there. This is a brief overview of the most popular and widely used frameworks.

Deep Learning Frameworks(Image Source)

TensorFlow is the most famous deep learning library in the world and a Python-friendly open source library. TensorFlow is a powerful data flow oriented library developed by the Google Brain Team. It is designed to be easy to use and widely applicable to both numeric and neural network oriented problems. It was released under the Apache License 2.0 on November 9, 2015. Google product uses machine learning in all of its products to improve the search engine, translation, image captioning or recommendations. Latest TensorFlow 2.0 was released in October 2019.

Many popular companies such as NVIDIA, Twitter, Snapchat, Uber and more are using TensorFlow for all their major operations and research areas.
TensorFlow is being used in over 70 K+ open source repositories and 780 K+ commits thus far on GitHub.

How TensorFlow works?

TensorFlow is cross-platform. It was built to run on multiple CPUs or GPUs and even mobile operating systems, and it has several wrappers in several languages like Python, C++ or Java.
Given below is an ecosystem of Tensorflow:

TensorFlow utilizes dataflow graph for calculations. It describes how data moves through a graph, or a series of processing nodes.
Nodes in the graph represent math operations, and the edges represent multidimensional data arrays (tensors) communicated between them.
The name “TensorFlow” is derived from the operations which neural networks perform on multidimensional data arrays or tensors! It’s literally a flow of tensors.

TensorBoard: TensorFlow's visualization toolkit

TensorBoard provides the visualization and tooling needed for machine learning experimentation. It lets us visualize the graphs, plot quantitative metrics about the graph with additional data like images to pass through it.

A Japanese farmer turned programmer used it to detect the best quality cucumber, which only his father and mother can do even with their years of experience.
That's the power of Machine Learning, which is now available to everyone because of TensorFlow.

Install TensorFlow 2

# Current stable release for CPU-only
pip install tensorflow

In the TensorFlow GitHub org, you can find not only TensorFlow itself, but a useful ecosystem of other repos, including models, serving, TensorBoard, Project Magenta, and many more.

Reference:

Why Data Science

Yu Fen Lin — Fri, 08 Nov 2019 11:52:16 +0000

Finding what people really want is one important thing I care about.

I have been a software and mechanical engineer. As an engineer, I tried to make sure the products meet the needs of my users as much as possible. When interviewing users and analyzing their requirements, I got the problem. It seems difficult for people to clearly say what they need and what they really want. There is a big gap between what people said and what they really want. Data science is a powerful tool to discover the real needs of people.

Asking, “Why?” is the key to everything.

Data science is the study of data. Data analysis is a process with the goal of discovering useful information, informing conclusions and supporting decision-making. Effective analysis requires obtaining relevant facts to answer questions and support a conclusion or opinion. These insights help people to make powerful data driven decisions.
Here is a list of data science deliverables: Prediction, Classification, Recommendations, Pattern detection and grouping, Anomaly detection, Recognition, Actionable insights, Automated processes and decision-making, Optimization, Forecasts.
Whatever questions you may have, data science could provide an objective opinion.

Data science is becoming pervasive across so many industries. Data science can be applied to many different domains. It is widely used in e-commerce industries, banking, health-care, even film industry.

In the future

I would like to apply data science to the mechanical engineering field such as machine monitor, production forecast, failure prevention, improve reliability, and warranty analysis.In automated production, collecting and monitoring data allows us to replace components and maintenance machinery at the proper timing to optimize our productivity.

Photo by Vlad Tchompalov on Unsplash

TEst Title

Yu Fen Lin — Thu, 07 Nov 2019 14:53:29 +0000

kjda;lkfjlajs;dkfjsal;jf;jas