Toul

Posted on Apr 16, 2021 • Updated on Sep 3, 2022 • Originally published at toul.io

30 days of posting to TikTok : I share the exact steps of how I used Machine Learning to make my most successful video

#beginners #machinelearning #startup #tutorial

During, the month of March, I decided to try out TikTok for 30 days with hopes of growing my profile.
I was inspired to follow the advice of posting 3 times a day for 30 days per Dr. Jen Golbeck who I found on the TikTok as @jengolbeck.
Dr. Jen Golbeck teaches a class in going viral on social media, so I figured their advice was trying out.

Here is a Twitter thread with the stats of my profile at the start of the experiment

Strategy to post 3 times a day

My basic strategy was to keep trying different topics until something stuck. I chose topics based on things that I knew about and could easily share in the form of a 10-60 second video.

I made it as easy as possible to meet the requirement of posting 3 times a day by talking about things I had experience with.

During, day 10 (3-10-2021) I had a video that grew my account from 15 followers to 500+ over a few days and received x5 views then all of the other videos prior to that time.

Here's a link to the video for those that are interested

I didn't want to make my decision based off of one lucky video and wanted to build a dataset with as many data points with different feature values as possible, so I kept posting on things unrelated to the video that blew up by chance.

Now, to the completely extra analysis of the the 30 days of video data.

Hint: my video that went viral (in comparison to all other videos in my account) is related to Data Analytics, so I'm also writing this for those viewers and my followers @techtok_career_guide who are interested in Data Analytics.

If you're not interested in that and want to see the results then scroll to the 'Conclusion Section'

The Completely Extra Analysis in Figuring out What content I produced is Engaging to TikTok Viewers

If you'd like to see some pics to go with the code snippets outputs then consider reading the original post otherwise this post will have code snippets only.

0.) Data Set: The start of all data analytics

I went through all of my videos and built my own dataset with features that I thought might be important.

Because, unfortunately TikTok only keeps data for 7 days at a time and doesn't share many features of the data other than timestamp, likes, and so on.

So by building my data set by hand I added extra features about the video such as 'Video has cover' column.

**If you plan to follow along then go here for the data and if you want to analyze your own channel then go through each of your videos to create a dataset.

The steps I share will be much the same.

Please do reach out to me if you get stuck at any point.

1. ) Load the Data

The first thing to do is to load the data into a Jupyter notebook and to do that we'll be using Pandas library.

import pandas as pd

# Create a dataframe
df = pd.read_csv('tik_tok_analytics.csv', converters={'Hashtags':lambda x:x.split(',')})

If you're unfamiliar with Jupyter Notebook then here's a quick tutorial I previously wrote.

A data-frame (df) is like an excel spreadsheet in code form.

Now, that the data has been loaded let's do some basic exploration to get to know the data better.

2.) Exploring the data

Let's see what what the dimensions of the data are (number of rows by number of columns).

# Rows by columns
print(len(df), len(df.columns))

Now, that we know the dimensions of the dataset let's get a quick glance at the data's statistics.

# A look at the statistics from data 
df.describe()

Mean: The average value of the data.

Std: Volatility of the value, or how much it varies the bigger the number the more the data values are all over the place.
From the dataset it seems, Upload_time, Length, Views, Likes, and Comments are features that vary a lot.
Now, let's see what the dataframe looks like.

# Head shows the first top 5 from the data frame
df.head()

As you can see the DataFrame really does look like an Excel spreadsheet, which is not a bad way of thinking about DataFrames.

3.) About the Data Set features - Different Types

As I put together the dataset by hand I'm intimately familiar with the features of the data, however, I'm going to go into detail for the benefit of the reader

Text: 'Video_Name', is included in the data set so that'll it'll be easier to know which datapoint comes from which video, but when it comes time to doing calculations we'll drop the column. Also, 'Hashtags' is another example, but we'll keep it around by doing a transformation on it.

Binary Data: 'Tech', 'Cover', 'Q_and_A', 'has_captions', 'Humor'

0=> doesn't have it
1=> does have it

Qualitative Data: 'Upload_time', 'Day', 'Location', 'Hat'

Upload_time => out of 24 hrs in a day, e.g. 0:00 => 12 AM
Day => out of 7 days of the week, e.g. 1 => Mon.
Location => where I filmed at , e.g. 1 => House
Hat => what kind of hat was I wearing, e.g. 2 => Beanie

Quantitative Data: 'Views', 'Shares', 'Comments', 'Likes', 'Hashtags', 'Length', 'avg_watch'

The numerical value of the data, e.g. video 2 had 1000 views

4.) Data Pre-processing / cleaning

Now, that we're familiar with the data somewhat, it is time to make sure the data doesn't have any missing values and chances are it will because it was created by hand.

It is also a good practice with any dataset to check.

df.isnull().sum()

Uh-oh looks like the data has some missing values.
There are two popular ways of handling this (1) Eliminate the data or (2) Fill the missing data with the column's avg. value (called imputation).

To decide between the two think about the amount of data points we have (102– the row length) and how many missing data points we have, six.

Hence, eliminating (6 rows_with_missing_data / 102 rows_with_data) * 100 = 5.88 which is almost 6% of our data.

So, in this case it is not a good idea given that the data set is already small.

Typically, elimination would be an okay choice if doing so would only delete less than 2% of the data.

Imputation to replace missing values

Now, that the decision has been made let's go ahead and replace the values with the average of the feature.

# The feature can be accessed via 'df.Feature_name' or df['Feature_name']
# Inplace means that the dataframe will be updated in place 
# and you will not need to reassign it, e.g. 
# 'df=df.Shares.fillna(df.Shares.mean())' v what is shown below.

df.Hat.fillna(df.Hat.mean(), inplace=True)
df.Shares.fillna(df.Shares.mean(), inplace=True)
df.Location.fillna(df.Location.mean(), inplace=True)
df.Humor.fillna(df.Humor.mean(), inplace=True)
df.Tech.fillna(df.Tech.mean(), inplace=True)
df.isnull().sum()

Handling Text Data

There are two text features in the dataset 'Hashtags' and 'Video_name', but only one is interesting to me; 'Hashtags'.

Within each df.Hashtag column is a list of text hashtags. And some videos have the same hashtags, so I want the question I want to answer is 'Do hashtags play any role in predicting anything?'

Hence, I'm going to transform it into to Binary Data for each video and in doing so explode the amount of features for the dataset.

# Split the list of text comma separated values apart into their own
# column. To keep up with them name them 'hashtag_'
split_df = df.Hashtags.apply(pd.Series).add_prefix('hashtag_')

Uh-oh we see that they're are NaN values, that's no good, time to replace them.

So A NaN in this case is the same as saying 'hey this row, column, doesn't have anything here' or is zero.

Otherwise it'll have the text value there.

Hence, what we'll then do is replace NaN with 0 and any text value with 1.

But, before that we'll put the two data frames into one by adding the 'split_df' to the the original df like so.

df = pd.concat([df, split_df],axis=1)
df

Now, we'll replace the NaN values.

df.fillna(0, inplace=True)

df.head()

Next, we'll replace the Text values of the hashtag_n features with 1 to imply the video has the value.

# Iterate through the 'hashtag_n' columns and if it is string 
# mark it 1 else 0  b/c 0 => Video did not have hashtag and 
# we know from previous step which videos they are
# but if the value is a string type then mark it 1 because
# that means the video has the hashtags
for i in range(10): # 10 b/c that's how many hashtag columns there are
    h='hashtag_'+str(i)
    df[h] = df[h].apply(lambda x: 1 if type(x) == str else 0)

df

You'll now see that the dimensions of the dataset are now 102 by 27 features!

Now that the data has been pre-processed and some features added to make it a little more interesting we'll move onto trying to finding which features seem to make the difference in engagement for the video.

5.) Using Machine Learning to Figure out which Features matter

Unfortunately, TikTok does not release the value or a metric called Engagement from the data that is available.

So for all intents and purposes of my data analytics exploring with TikTok data Engagement will be defined as the following from the data I have collected on my videos:

Sum of shares, comments, likes, and the ratio of video completion divided by Views.

In code that'll look like this,

# Combine into one column
df.Engagement = ((df.Shares + df.Comments + df.Likes + (df.avg_watch/df.Length))/df.Views)

As you can see my videos aren't very engaging but never the less there is a difference between videos in engagement level.

Now, let's use machine learning to try and figure out what features influence this observed difference.

5.a) Use Random Forest Model to Find Important features

I decided to use a Tree based model over Linear Regression because it is unknown if the underlying relationships in the data are linear or not.

If it were known that the relationship(s) were/are linear then Linear Regression would be the better choice.

5. b) Setting up the data so that it can be used

The first thing we are going to do is reorder the features so that the first column is the one that we are trying to 'predict' in this case 'Engagement'.
Notice, that video_name was also dropped since it is not a feature that is interesting, it's only there to go back later and observe the video.

# assign columns 1-13 to x
# assign column (class labels) 1 to y
# to make it simpler will swap y (Location column) to the first location
columns_titles = [
    'Engagement', 
    'Tech',
    'Upload_time',
    'Topic',
    'Day',
    'Length',
    'Q_and_A', 
    'CTA', 
    'Cover', 
    'Location',
    'Hat',
    'has_captions', 
    'Humor',
    'hashtag_0', 
    'hashtag_1',
    'hashtag_2', 
    'hashtag_3',
    'hashtag_4', 
    'hashtag_5', 
    'hashtag_6',
    'hashtag_7', 
    'hashtag_8', 
    'hashtag_9' 
]
df = df.reindex(columns=columns_titles)

5.b.1) Normalize the Engagement feature

First, normalization of the Engagement. I am rounding the data to group similar decimal places together. E.g. 0.0023 and 0.0022 both become 0.002.

# Rounding the values by a few decimal places 
# so that they're easier to group for later use.
df['Engagement'] = round(df['Engagement'],1)

# Normalize - usedlater
def normalize(x, col_max):
    if x == -1:
        return np.nan
    else:
        return x/col_max

Inspect the data to see the grouping of labeled data.

df['Engagement'] = df['Engagement'].apply(lambda x: normalize(x, df['Engagement'].max()))
df.groupby('Engagement').size()

From here we can see there's 3 classes of Engagement, none (0.0), some (0.5), and more engaging (1.0).

However, because the dataset is so small and because this is an intro to Machine Learning, the problem of engagement will be reduced to either engaging or not.

import numpy as np
df['Engagement'] = np.where((df['Engagement'] == 0.0 ), int(0), df['Engagement'])
df['Engagement'] = np.where((df['Engagement'] != 0.0 ), int(1), df['Engagement'])

# Check that there are two classes 
df.groupby('Engagement').size()

As a consequence of adding Engagement via a calculation that we've derived it is easy to know which features will influence engagement based on the math (views, shares, likes, i.e. everything in the formula).

Therefore, to make it interesting we'll remove every feature that was used in the formula to try and see if there is a connection between engagement and the other features.

Except for length because it is useful metadata about the video and is different than the other data used in the calculation because it is not generated by the viewer of the video.

columns_titles = [
    'Engagement',
    'Upload_time',
    'Q_and_A', 
    'Length',
    'Hat', 
    'has_captions',
    'Humor',
    'Location',
    'hashtag_1',
    'hashtag_2', 
    'hashtag_3', 
    'hashtag_5'
]
df = df.reindex(columns=columns_titles)

5.b) Splitting the data into Training and Test Data

So, for the features we'll call them X, which is a norm and is something you'll observe in the other Machine learning related posts.

For y we'll assign the feature that we are trying to predict 'Engagement', again another norm.

from sklearn.model_selection import train_test_split

# Splitting the features
X, y = df.iloc[:, 1:].values, df.iloc[:,0].values

Next, we'll create the training and test groups of data.

X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2, random_state=42,stratify=y)

train_test_split: is doing the work of cutting X and y into smaller groups of data. It is a good practice to have Training data and Test data so that the model can be evaluated on how well it does.

test_size: The amount of data to be used when testing the data. Meaning if 40% (0.4) then 60% (0.6) will be used to train the data. Because the dataset is so small we'll select 0.2 so that the algorithm can have as much of the data as possible to train with.

Random_state: can be any number and serves as a way to keep track of how the data is shuffled. Otherwise, each time we ran the notebook the results would be slightly different.
Stratify is to keep the proportion of the target label 'Engagement' equal during each iteration.

5.c) Outlier Detection

We'll use a Machine Learning algorithm to identify samples in the Training and Test data that are outliers. The reason to do so is to improve the performance of the machine learning algorithm.

Otherwise, outliers will make it hard to fit the model to the data. In general it is a best practice to minimize the influence of outliers.

Typically finding outliers is done by hand through plotting a Whisker chart or a scatterplot chart.

However, a ML model itself can be used to identify which data points are outliers. So we'll use it to automatically remove the outliers via using the IsolationForest.

from sklearn.ensemble import IsolationForest
print(X.shape, y.shape)
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)


# find outliers in a training set
iso = IsolationForest(contamination=0.1)
yhat = iso.fit_predict(X_train)

# select all rows that are not outliers 
mask = yhat != -1
X_train, y_train = X_train[mask, :], y_train[mask]

print(X_train.shape, y_train.shape)

From the deletion of outliers you can see that in the X training data set there were 8 removed.

5.d) Scale the data

First scale the data to make absolutely sure there's no outliers. Even, though we've removed outliers from training and test data we'll go the extra step of making sure all features are scaled such that no feature has more impact than the others, which might fool the algorithm.

from sklearn.preprocessing import StandardScaler

stdsc = StandardScaler()
X_train_std = stdsc.fit_transform(X_train)
X_test_std = stdsc.transform(X_test)

5.e) Let Random Forest find the Important Features

from sklearn.ensemble import RandomForestClassifier
feat_labels = df.columns[1:]
forest = RandomForestClassifier(n_estimators=1000, random_state=1)
forest.fit(X_train,y_train)
importances = forest.feature_importances_
indices = np.argsort(importances)[::-1]

for f in range(X_train.shape[1]):
    print("%2d) %-*s %f" % (f + 1, 30,feat_labels[indices[f]],importances[indices[f]]))
plt.title('Feature Importance')
plt.bar(range(X_train.shape[1]),importances[indices],align='center')
plt.xticks(range(X_train.shape[1]),feat_labels, rotation=90)
plt.xlim([-1, X_train.shape[1]])
plt.tight_layout()
plt.show()

Now, that the data has been pre-processed we can begin to identify the important features of the data set using RandomForest.

Based on the findings of RandomForest it seems that 'Upload_time' and 'Length' are the two most important features for Engagement on my channel.To further verify the findings we can use SelectFromModel.

from sklearn.feature_selection import SelectFromModel
sfm = SelectFromModel(forest, threshold=0.1, prefit=True)
X_selected = sfm.transform(X_train)
print('Number of samples that meet this criterion:' , X_selected.shape[0])
for f in range(X_selected.shape[1]):print("%2d) %-*s %f" % (f + 1, 30,feat_labels[indices[f]],importances[indices[f]]))

To directly tell us the two most important features.
Which is exactly what we want.

Conclusion

Based on my channel's Performance the most important things for me to do are:

1.) To keep my videos short 10s is optimal for short videos and 20s being optimal for the longest

2.) Upload whenever my user base is most active which is from 7 PM to 9 PM

3.) Not stress out about hashtags because it seems that hashtags don't really matter for my videos

Testing the Conclusion

I created another video that met all of the important features from the data analysis above and it is now my best performing video to date in terms of comments, likes, and avg_watch time.

DEV Community