DEV Community: Jinhoon Chung

A New Way to Recommend Video Games at Amazon Part 3

Jinhoon Chung — Wed, 18 Jan 2023 15:23:22 +0000

This is part 3 of the whole post. Please refer to the links below.

Intro

In the previous parts, I built a new recommendation system and shown you if the system has value or not. Let's try to display a more personalized list of recommended games for a customer.

Here is a graph of the numbers of recommended games by predicted ratings.

This is one of the critical cases where one customer would be recommended with above 14,000 games with the predicted rating equaling 5. Amazon would love it if this customer would buy all of them, but it is not very realistic. Let's try to trim down this long list.

Similarity

One of the straight forward ways to trim down the long list is to use the game information and find similarities between different games. Examples of information are console information and company information. Unfortunately, the metadata does not contain genre information.

Do you see some familiar names? Let's pick one game.

This is a famous game owned by Capcom. This game runs on PC, Playstation, and Xbox, but the selected game is for the Xbox console. With a matrix calculation that compares 72,000 game information to the selected game, I can get values from 0 to 1 where 1 means the perfect match, and 0 means no information match.

If you look at the third row from the bottom, you can see there are 16 games that have the perfect match with the selected game. However, it is actually 15 because one of 16 is the selected game. During matrix calculation, the selected game was compared to all games including itself.

Trimming the List

The following step is simple. The similarity ratios are multiplied by the predicted ratings. This is possible because each game is given a similarity ratio and a predicted rating. Even with a high predicted rating, the game would be ranked lower if the similarity ratio is low.

Let's check out the final list of the top 10 recommended games.

You can observe other series of Resident Evil. Please ignore the same name as the selected game located in the 4th row. I know that "DMC Devil May Cry", "Strider", and "Mega Man" are from the same company. Except for "Strider" and "Mega Man", all games are 3D graphics.

Here are the final predicted ratings after multiplying by similarity ratios.

The first six games still have the perfect predicted rating. The predicted rating starts to be smaller from the 7th game. I do think these numbers would look more interesting if the metadata has genre information because one game has multiple genres.

Final Note

I personally enjoyed this project because the chosen topic was motivating to me. There are several elements that can be improved. If all of these ideas are finalized, then the next steps are to discuss where and how to put this information for which customers and if these ideas can be applied to the different departments. This was a project completed for the data science Bootcamp program, but I have tried to work it out as if this is a real problem.

Thank you for reading.

Project GitHub Link

A New Way to Recommend Video Games at Amazon Part 2

Jinhoon Chung — Wed, 18 Jan 2023 15:04:54 +0000

This is part 2 of the whole post. Please refer to part 1.

Intro

A new recommendation system was built in the previous section. It is time to check if the new system has value. It would be tough and complex to figure out if a customer would have a better or worse shopping experience using the system. However, computation and interpretation become more simple if we like to find out if a customer would have a different shopping experience. We will use a hypothesis test and calculate p-values.

Simple Case

Let's pick one customer and one game. Let's say the selected game has an average given rating equaling 1. However, the system predicts the selected customer would rate the selected game 4.76. We can say the system is likely to draw more attention from the customer. In other words, the customer would stop by and check the item which is normally ignored because of the low given rating. This can happen in the opposite direction. This is the case for different shopping experiences.

One Customer with a List of Games

Let's stay with one customer, but we want to check a list of games. The new system predicts ratings on the games, and the games are assigned with average given ratings.

Here comes the hypothesis test. We like to know if the given ratings and predicted ratings are significantly different or not. Using the t-test, we can get a p-value for this case. If the p-value is low enough (less than 0.05), then the two lists are different enough. This can conclude this customer would have a different shopping experience.

More Customers with a List of Games

We need to be careful when calculating p-values. There are around 72,000 games in the data. If we compare two types of ratings on all games, p-values will be significant even with a small difference. To solve this issue, the number of games needed to get p-values should be small enough. Normally 30 sample games should be enough, but 50 games might be ideal because I usually check out more than 50 items when shopping, and 30 sounds too small compared to the 72,000 games found in the data. Lastly, Amazon displays 50 games in a specified page.

While there are around 72,000 games, there are around 1.5 million customer ids in the data. It would cost too much to check 1.5 million individuals. I decide to sample 500 customers with one condition. The customers should have ratings on at least 5 games. The previous section, "One Customer with a List of Games", is repeated 500 times. Therefore, we get 500 p-values.

The red dots show p-values less than 0.05. It turns out that more than 60% of p-values are red dots meaning those customers would have a different shopping experience. I think it is a good number, but it would be Amazon's choice if this sounds like a better business or marketing.

Metadata with Average Given Ratings

I wanted to put this section away from previous sections to keep a good flow of the analysis. I have shown you the comparison between average given ratings vs. predicted ratings. It is true that one game is rated more than once. To find a representable rating for each game, we can just calculate the average given ratings. Then, a table would be created with unique game ids and average given ratings. Metadata also has game ids (asin). After removing some duplicates, the unique game ids in the metadata would be merged with the table with the average given ratings. Then, these given ratings are compared with the predicted ratings shown in the previous sections.

In the Next Section

I have gone over all steps above, but there is no realistic output from this analysis. One major reason is that the list of recommended games is too large. In part 3, I will show you how the list can be trimmed and finally show you a useful result.

Project GitHub Link
Part 3

A New Way to Recommend Video Games at Amazon Part 1

Jinhoon Chung — Wed, 18 Jan 2023 14:36:37 +0000

Intro

I have completed the Capstone project, and this post is a summary of it. I decided to build a new recommendation system for video games at Amazon.com. The reasons are the followings:

I play lots of video games.
I do shopping at Amazon.com
35% of Amazon's revenue is from the recommendation system. link
I don't see a recommendation system that specifically uses customer's ratings at Amazon.com

There are already various recommendations that use "items bought together", "browsing history", "similar products", etc. More than half of the recommended items at Amazon.com are not relevant to my interest, and many items are similar to what I already have and won't buy again.

It gets even worse when looking at the page for the video game department. Amazon is busy showing the top-selling or top-rated items, new releases, and items with special discounts. I don't feel like the page is personalized.

The page might gain more attention if Amazon can add a section with a personalized list of recommended games.

Outline

I plan to write three posts outlined below. This post focuses on the first part.

Introduction and building a recommendation list using customer ratings
Proving customers would have a different shopping experience
Trimming down a long list of recommended games and wrapping up

Data

There is a database released by Amazon. There are two parts to the database, review data and metadata. If the links to the two separate datasets below do not work, please use this link.

Review Data

The data can be downloaded here.

This dataset has four columns, item id (video game id), user id, rating, and timestamp. We do not need the timestamp column for this analysis.

Here are the first 5 rows of the data.

Here is the count for each rating and overall distribution. We can see ratings are from 1 to 5, and the majority (58%) of ratings are 5.

MetaData

The data can be downloaded here.

This metadata has various information on video games, but we only need item id, console information, and company information. Unfortunately, this data does not have genre information.

This data will be crucial for the second and third parts. I will show you how this data is clean in the second part. We only need the review data for the current part.

Analysis - Building a System

Train-test-split

The data is split for validation purpose.

Python has a convenient library called "surprise". This library is designed specifically for the recommender system. Surprise already has a function to do train-test-split, but the data format was complicated to play with. I have used both scikit-learn and surprise to split and format the data.

# selecting X and y
X = df[df.columns[:2]] #item_id, user_id
y = df[df.columns[-1]] #rating

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test 
= train_test_split(X, y, test_size=0.33, random_state=0)

from surprise import Reader, Dataset
# read in values as Surprise dataset 
reader = Reader()
# train data
# Loading the data again for the Surprise library
train = Dataset.load_from_df(pd.concat([X_train, y_train], 
                                       axis = 1), reader)
# test data
# Loading the data again for the Surprise library
test = Dataset.load_from_df(pd.concat([X_test, y_test], 
                                      axis = 1), reader)
# whole data for comparison
data = Dataset.load_from_df(df, reader)

Collaborative Filtering

The Surprise library uses a matrix factorization to predict ratings on items. This is a supervised learning technique because the rating information is treated as a continuous variable.

The table above shows 5 items and 4 users as an example. Users do not rate all items. The system predicts ratings that can be replaced with missing ratings. For example, user 4 would have 4 missing ratings replaced with predicted ratings. Then, the games will be recommended to the user based on the highest predicted rating.

The codes below calculate the RMSE values that can be compared with the RMSE values from various scikit-learn regressors, and the Surprise library has the best RMSE.

from surprise.model_selection import cross_validate
from surprise.prediction_algorithms import SVD
import numpy as np

svd = SVD(random_state = 0)
val_svd = cross_validate(svd, train, measures=['RMSE', 'MAE'], cv=3)
print("Mean RMSE for the baseline model validation:")
np.mean(val_svd['test_rmse'])

svd = SVD(random_state = 0).fit(train_set)

pred = []
for i in range(len(X_test)):
    pred.append(svd.predict(X_test.iloc[i].values[0], X_test.iloc[i].values[1])[3])

from sklearn.metrics import mean_squared_error
print("RMSE for the baseline model on test data:")
print(mean_squared_error(y_test, pred, squared=False))

Mean RMSE for the baseline model validation: 1.2980
RMSE for the baseline model on test data: 1.2827

The RMSE value is not so great. A tuned model shows better RMSE values.

svd = SVD(random_state = 0, 
          n_factors= 100, 
          reg_all = 0.07, 
          n_epochs = 150)

Mean RMSE for the tuned model validation: 1.2850
RMSE for the tuned model on test data: 1.2660

RMSE is not improved significantly. I hope for a value of less than 1, this is the best I have. This can be improved with more data cleaning. For example, I can try to find and remove outliers and/or customers with a low number of ratings. I might add the fourth part for the update.

The graph below shows an example of how a selected customer is recommended the video games assigned with predicted ratings.

It is easy to see this customer would be overwhelmed by a huge list of recommended video games. I will talk about how this can be improved, but it would have more worth talking first about if this system (the tuned model) is useful.

Project GitHub Link
Part 2
Part 3

How to use GPU for Deep Learning

Jinhoon Chung — Fri, 26 Aug 2022 13:49:44 +0000

Intro

My data science Bootcamp introduced me Deep Learning a few months ago. While reading materials, I was so surprised to know my RTX 3090 can be used for Deep Learning! I thought GPUs were only for gaming. Now, I have one more excuse to buy the high-end GPU for the next time I build a new desktop.

However, the excitement didn't last long as it was not easy to set up a new environment for Deep Learning using GPU on Anaconda or Git Bash. I have done some research, and the steps were pretty straightforward, but they were mostly at least a year old, and no information was related to Git Bash.

In this article, I like to show you how to incorporate Git Bash to use GPU Deep Learning on Jupyter notebook. Let me talk through how we can set up an environment.

Benefit of Using GPU

Let me talk briefly about the benefit of using GPU for Deep Learning. It is much faster.

Based on the image above from this link, GPU is generally 3 times faster than CPU. A better GPU can give you much a better performance.

Set-up

Step 0

Before starting, please make sure Anaconda and Git Bash are installed.

Step 1 - Anaconda

After opening Anaconda, click "Environment" on the menu.

Click "Create" at the bottom of the list of environments.

A small window should pop up. Name the environment. I previously named the environment "gpu". The next step is to choose a version. The walkthrough I read showed me to choose version 3.6, but I was able to make mine work with version 3.7. So, I recommend 3.7, but you can try a newer version and see if it works or not.

Select the environment you just created. Choose "Not Installed" from the drop-down menu. On the "Search Packages" box, type in "tensorflow" and hit "Enter".

You should see some results like what you see in the picture above. Check "keras-gpu" and "tensorflowo-gpu" then click "Apply". This should take some time to get things ready to install the packages. If Anaconda shows you they can't be installed, then this is where you should try again with a different version mentioned earlier in this article.

Once the packages are installed, we are ready for Git Bash.

Step 2 - Git Bash

This step is pretty simple and quick. We only need to change the environment.

Just type "conda activate gpu", then hit "enter".

Now type "jupyter notebook", then hit "enter" to open the notebook.

Step 3 - Test on Jupyter Notebook

Run the codes below. I have attached an output below.

import tensorflow as tf
from tensorflow import keras
print(len(tf.config.experimental.list_physical_devices('GPU')))

tf.test.is_gpu_available()

tf.test.is_built_with_cuda()

You can also check Git Bash logs after running the codes above.

The logs mention my RTX 3090 with the size of memory. It should print out your GPU.

The images below show the GPU memory usage difference when using GPU or not. Please pay attention to "Dedicated GPU memory".

Attributes from Python Pipeline

Jinhoon Chung — Tue, 21 Jun 2022 14:26:48 +0000

Intro

I have recently learned python pipeline. It is very useful especially for the readability of the technical notebook and for the overall coding. However, I have encountered a big trouble that is bringing out the attributes of elements included in the pipeline.

With aid from my instructor and googling, the trouble turned into a valuable experience of learning something new.

Data

Let me briefly go over what data has been used for this post. The data is from National 2009 H1N1 Flu Survey. The link will direct you to the page where you can see variable names. The purpose of the survey is to study the H1N1 flu vaccination rate and the categories of respondents. This is just to help you to understand the output of the pipeline later in this post.

Elements in the Pipeline

The elements are actually called "steps" in the pipeline. Each step can be an encoder, sampling, or any machine learning (classification, regression, etc.). In the steps, OneHotEncoder as an encoder and RandomForestClassifier as a classifier are used.

Attributes

OneHotEncoder

Below is the python code to instantiate the encoder. Results will be put all together at the end of the post for a better organization.

# import libraries for columns transformation
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

# instantiate encoders
ohe = OneHotEncoder(sparse=False, handle_unknown='ignore')

# apply encoding to just one column in the data
# to reduce complexity in the results
ct = ColumnTransformer([('age', ohe, ['age_group'])],
                       remainder='passthrough')

Please remember ct as it will be back soon.

RandomForestClassifier

Another python code for the classifier.

from sklearn.ensemble import RandomForestClassifier

This is it for this section! The further steps of coding will come soon.

Results

OneHotEncoder

One of the useful features or attributes of the encoders is get_feature_names(). This will bring all of the variable names associated with the encoding. Let's take a look at the code and the output. Remember that ct from above is back.

ct.get_feature_names()

The age column is successfully encoded. What if we use the pipeline? Let's instantiate the pipeline first.

# import library
from imblearn.pipeline import Pipeline

# instantiate pipeline using column transformer and 
# model from classifier
pipe2 = Pipeline(steps=[('ct', ct),
                        ('rfc', RandomForestClassifier(random_state=1, 
                                                       max_depth = 9))
                       ]
                )
pipe2.fit(X_train_labeled, y_train)

Please pay attention to the pipeline model, pipe2 as it will be back several times in this post. Here is the magic code.

pipe2.steps[0][1].get_feature_names()

What just happened? A pipeline can show you what steps are taken, and it lets you use attributes of each step after it is called.

As you can see from the image above, pipeline steps are saved as a list of tuples. Each step can be called like the image shown below.

Then, you can use any available attributes to get the information you need.

RancomForestClassifier

Let me repeat with the classifier. Let's begin one without the pipeline.

rfc=RandomForestClassifier(random_state=1, max_depth = 9)
X_train_labeled_ct = ct.fit_transform(X_train_labeled)
rfc.fit(X_train_labeled_ct, y_train)
rfc.feature_importances_

Here is the one with the pipeline.

It can be seen that once a pipeline is declared, the coding gets simplified.

Application

Using the above results, we can do something like the below!

# graph of the features sorted by the impact level on the analysis
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

# extract feature importance scores and feature names
# then merge them
feat_impt = pd.concat([pd.DataFrame(pipe2.steps[1][1].feature_importances_, 
                                    columns = ['FI_score']),
                       pd.DataFrame(pipe2.steps[0][1].get_feature_names(), 
                                    columns = ['Features'])],
                      axis = 1
                     )

# sort descending by importance
feat_impt.sort_values(by = 'FI_score', inplace=True)

# print graph of the top 20 important features
plt.figure(figsize=(8,9))
plt.barh(range(20), feat_impt.FI_score[-20:], align='center') 
plt.yticks(np.arange(20), feat_impt.Features[-20:]) 
plt.xlabel('Feature importance')
plt.ylabel('Feature');

Magic of Transformation in Linear Regression

Jinhoon Chung — Fri, 13 May 2022 12:26:00 +0000

Intro

Recently I completed a project from data science boot camp and learned how the transformation of numerical information is helpful in finding a regression model. On this page, I like to focus on how the transformation of numerical data can give better validation to the model instead of focusing on how the model looks and can be used to predict a value.

I will go over the data source and data structure briefly and give a quick explanation of what methods are used in this post. I will also include some python code and mathematical formulas to aid understanding.

Data

I am using data from the project I had completed, but the data is from a real-life and contains information on house sales data in King County, Washington. The information includes house prices and multiple house features. Below is the list of variables used in the analysis.

Dependent Variable

House Price

Independent Variables

Numerical

Living space in squared-feet
Lot size in squared-feet
Year built
The number of floors*

* A separate explanation of why I defined it as numerical is at the end of the post.

Categorical

Binaries

Waterfront
View presence
Renovation condition
Basement presence

Multi-categorical

Maintenance condition
House grade

Methods

This section is just to give you an idea of what I have done for the results. If these look familiar to you or don't interest you, then you can just skip to the result section. Checking the results before reading this section might give you an idea more easily of why I am posting this.

Assumptions

There are several ways to validate the model, and I like to go over the four major assumptions to validate the model. The assumptions are linearity, normality, homoscedasticity, and multicollinearity. I chose this method because it can be explained visually, and visualization is more helpful in explaining the concept than just lots of words and numbers.

1. Linearity

It is important to check the linearity assumption in the linear regression analysis. As polynomial transformation has not been applied in this analysis, the expected house price (dependent variable) will be compared to the raw value of the house price.

Below is the python code I used. The purpose of sharing the code is to give some idea of how the graph is created.

# split whole data to training and test data
from sklearn.model_selection import train_test_split

X = independent_variables
y = house_price_column

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

# find the model fit
from sklearn.linear_model import LinearRegression
model = LinearRegression()

model.fit(X_train, y_train)

# Calculate predicted price using test data
y_pred = model.predict(X_test)

# Graphing part
import matplotlib.pyplot as plt

fig, ax = plt.subplots()

perfect_line = np.arange(y_test.min(), y_test.max())
ax.plot(perfect_line, linestyle="--", color="orange", label="Perfect Fit")
ax.scatter(y_test, y_pred, alpha=0.5)
ax.set_xlabel("Actual Price")
ax.set_ylabel("Predicted Price")
ax.legend();

2. Normality

The normality assumption is related to the normality of model residuals. This is checked using a QQ plot.

import scipy.stats as stats
residuals = y_test - y_pred
sm.graphics.qqplot(residuals, dist=stats.norm, line='45', fit=True);

3. Homoscedasticity

The assumption of homoscedasticity checks the dependent variables against the dependent variables and sees if values are dispersed without any pattern. This assumption is also related to the residuals.

fig, ax = plt.subplots()

residuals = y_test - y_pred

ax.scatter(y_pred, residuals, alpha=0.5)
ax.plot(y_pred, [0 for i in range(len(X_test))])
ax.set_xlabel("Predicted Value")
ax.set_ylabel("Actual - Predicted Value");

4. Multicollinearity

The assumption of multicollinearity checks dependency between independent variables. It is best to have independent variables independent of one another as much as possible.

from statsmodels.stats.outliers_influence import variance_inflation_factor
vif = [variance_inflation_factor(X_train.values, i) for i in range(X_train.shape[1])]
pd.Series(vif, index=X_train.columns, name="Variance Inflation Factor")

Transformations of numerical variables

Log transformation

This part is simple. All values in the numerical columns are natural-logged.

Normalization

The below formula shows a value in a numerical variable is subtracted by the mean of the variable, and then the subtracted value is divided by the standard deviation of the variable.

Results

Here is the fun part. You can just relax and see how graphs and scores change.

The Raw Data - no transformation

1. Linearity

I see several outliers. Some linearity is observed only on the left side.

2. Normality

Only 1/3 of the dots are on the red line.

3. Homoscedasticity

A clear pattern is observed.

4. Multicollinearity

Only scores below 5 are accepted. About half of the scores are not acceptable.

sqft_living            8483.406359
sqft_lot                  1.200729
floors                   14.106084
waterfront                1.085728
view                      1.344477
yr_built                 72.300842
is_renovated              1.157421
has_basement              2.175980
condition_Fair            1.038721
condition_Good            1.668386
condition_Very Good       1.295097
grade_11 Excellent        1.530655
grade_6 Low Average       5.129509
grade_7 Average          14.142031
grade_8 Good              8.261598
grade_9 Better            3.446987
interaction            8460.117213

Log Transformation

1. Linearity

It shows much better linearity. The linearity of the dots have a slightly lower slope than the perfect line.

2. Normality

There is a small kurtosis, but the majority of the dots are on the red line.

3. Homoscedasticity

This looks better, too. A slight pattern is observed on the right side.

4. Multicollinearity

Several scores are still too high.

sqft_living            471370.972327
sqft_lot                  155.772190
floors                      4.052275
yr_built                  922.928871
waterfront                  1.086052
view                        1.337069
is_renovated                1.146855
has_basement                2.438983
condition_Fair              1.042784
condition_Good              1.668688
condition_Very Good         1.283740
grade_11 Excellent          1.468962
grade_6 Low Average         5.221791
grade_7 Average            12.895007
grade_8 Good                7.519577
grade_9 Better              3.355969
interaction            469074.416388

Log Transformation and Normalization

1. Linearity

The slope is slightly better and closer to the perfect line.

2. Normality

I don't see much difference from the previous graph.

3. Homoscedasticity

I don't see much difference from the previous graph.

4. Multicollinearity

All of the scores are now acceptable. This is a huge difference!

sqft_living            3.001670
sqft_lot               1.552016
floors                 2.046914
yr_built               1.758294
waterfront             1.086293
view                   1.313341
is_renovated           1.148279
has_basement           2.441147
condition_Fair         1.042169
condition_Good         1.647135
condition_Very Good    1.281906
grade_11 Excellent     1.278034
grade_6 Low Average    1.939542
grade_7 Average        2.077564
grade_8 Good           1.609822
grade_9 Better         1.440610
interaction            1.374175

Conclusion

Transformations helped to keep (i.e. not reject) the four assumptions. Visualizations seem clear enough to guide you to study what was improving.

Extra

The number of floors

This is kind of out of the major topic in this post, but this decision can be crucial to the overall regression analysis. Let me begin with the value counts of the floor information.

1.0    10673
2.0     8235
1.5     1910
3.0      611
2.5      161
3.5        7

The left column shows the data has a range of the floor counts from 1 through 3.5. The model might make more sense to treat this variable as a categorical variable. However, what if one likes to predict a house price that has 4 floors? This question or problem can be solved if this information is treated as numerical. I think this is a matter of the goal of the analysis.

How to make money out of a movie?

Jinhoon Chung — Wed, 23 Mar 2022 21:30:03 +0000

Intro

Are you interested in making a new movie? Do you need some ideas about what kind of movie will help you to make a profit? Then, this post might give you a direction or two on where to begin.

Movie Information

Budget and Profit

To make a profit, we need to get profit information. The Numbers is a good source for the information.

The below table is a part of the data.

production_budget    worldwide_gross
$425,000,000         $2,776,345,279
$300,000,000         $2,048,134,200
$306,000,000         $2,053,311,220
$215,000,000         $1,648,854,864
$190,000,000         $1,518,722,794

Unfortunately, the data does not include profit, but it can easily be calculated by subtracting the budget from the gross.

production_budget    worldwide_gross    profit
$425,000,000         $2,776,345,279     2351.35
$300,000,000         $2,048,134,200     1748.13
$306,000,000         $2,053,311,220     1747.31
$215,000,000         $1,648,854,864     1433.85
$190,000,000         $1,518,722,794     1328.72

The profit is formatted to be in millions. Now we have information for budget and profit. Let's look at a graph and see if there is any relationship between budget and profit.

I would go for a scatter plot, but this graph looks less intimidating. There are five blue boxes. Each box represents the median value of each profit range. Profit ranges help to make analysis look more organized. The profit range at the left shows below zero which means it has a negative profit. There are two lines above and below the blue boxes. They represent the fluctuation of the data. A range of the data sometimes can be handier than just one number, the median.

What we can see from the above graph is that the more money spent on a movie returns more profit. Now that sounds infinite. This can be overwhelming for someone with a new experience. Let's dive into other aspects of the movies.

Genre

I think everyone has a favorite movie genre or two. There are drama, action, fantasy, etc. Try naming them all. We can get good data from TMDB. While TMDB has good information on the movie genres, the profit data is not included. This is handled by combining data from TMDB with data from The Numbers mentioned above.

Let's check how genres are distributed before relating genres to profit.

Wow! Look at drama. It is overwhelmingly at the top. We also can see comedy, thriller, and action at the top. It is might be a good idea to avoid those genres to avoid too much competition.

It is time to check how genres can be related to profit.

The elements in this graph are the same as the graph for budget and profit. This graph is just that the bars are in the horizontal direction. The blue box again shows the median while two lines show the fluctuation. Now, drama is at the bottom half. The most profitable genres are animation, adventure, fantasy, family, science fiction, and action. I have picked six instead of five. The reason is that the fluctuation of those six stays above zero profit.

This part does not sound infinite. I guess the animation genre should be taken into the account.

Why don't we look at another aspect that is movie runtime?

Movie Runtime

Runtime data is obtained from IMDb. Since IMDb does not have profit information, the data is combined with data from The Numbers.

Let's go straight into the result.

The runtime between 80 to 100 minutes has the lowest profit while movies above 120 have the highest profit. Surprisingly movies with less than 60 minutes long have more profit than movies with 80 to 100 minutes long. If a new movie with a length of more than 2 hours is not affordable, why don't we target the movie length of fewer than 60 minutes?

Conclusion

This analysis should give you some ideas about where to begin before filming a new movie. We looked at information on budget, genre, and runtime. If you gained curiosity about other aspects of the movies, then it means this analysis did something right.