<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: nkpremices</title>
    <description>The latest articles on DEV Community by nkpremices (@nkpremices).</description>
    <link>https://dev.to/nkpremices</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F189484%2F6103b5ea-4450-43ba-89bc-3b74a27179a9.jpeg</url>
      <title>DEV Community: nkpremices</title>
      <link>https://dev.to/nkpremices</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/nkpremices"/>
    <language>en</language>
    <item>
      <title>Investigating the TMDB movie dataset, part 2</title>
      <dc:creator>nkpremices</dc:creator>
      <pubDate>Thu, 12 Mar 2020 18:23:26 +0000</pubDate>
      <link>https://dev.to/nkpremices/investigating-the-tmdb-movie-dataset-part-2-e64</link>
      <guid>https://dev.to/nkpremices/investigating-the-tmdb-movie-dataset-part-2-e64</guid>
      <description>&lt;p&gt;This blog post is the second part of a whole series. I would recommend you read the &lt;a href="https://dev.to/nkpremices/investigating-the-tmdb-movie-dataset-6co"&gt;first part&lt;/a&gt; if you want to understand this one.&lt;/p&gt;

&lt;p&gt;In this blog post, I am going to talk about data cleaning. we are going to use the results that we got in the first part and build from there.&lt;/p&gt;

&lt;h1&gt;
  
  
  Data Cleaning
&lt;/h1&gt;

&lt;h5&gt;
  
  
  Step 1. Remove some columns with a lot of null values.
&lt;/h5&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;br&gt;
 &lt;code&gt;df.drop(['imdb_id', 'homepage', 'tagline', 'overview', 'budget_adj', 'revenue_adj'], axis=1, inplace=True)&lt;br&gt;
df.head(1)&lt;/code&gt;&lt;br&gt;
&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fres.cloudinary.com%2Fpremices%2Fimage%2Fupload%2Fv1583942632%2FScreen_Shot_2020-03-11_at_18.03.05_czezk0.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fres.cloudinary.com%2Fpremices%2Fimage%2Fupload%2Fv1583942632%2FScreen_Shot_2020-03-11_at_18.03.05_czezk0.png" alt="img"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h5&gt;
  
  
  Step 2. Remove duplicated data
&lt;/h5&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;df.drop_duplicates(inplace=True)&lt;/code&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h5&gt;
  
  
  Step 3. Remove all null values in the columns that have null values
&lt;/h5&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;br&gt;
 &lt;code&gt;df.dropna(subset = ['cast', 'director', 'genres'], how='any', inplace=True)&lt;/code&gt;&lt;br&gt;
&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Let's check if there are still null values&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;df.isnull().sum()&lt;/code&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fres.cloudinary.com%2Fpremices%2Fimage%2Fupload%2Fv1583940342%2FScreen_Shot_2020-03-11_at_17.19.32_ei0xld.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fres.cloudinary.com%2Fpremices%2Fimage%2Fupload%2Fv1583940342%2FScreen_Shot_2020-03-11_at_17.19.32_ei0xld.png" alt="img"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h5&gt;
  
  
  Step 4. Replace zero values with null values in the budget and revenue column.
&lt;/h5&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;br&gt;
 &lt;code&gt;df['budget'] = df['budget'].replace(0, np.NaN)&lt;br&gt;
   df['revenue'] = df['revenue'].replace(0, np.NaN)&lt;br&gt;
   df.info()&lt;/code&gt;&lt;br&gt;
&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fres.cloudinary.com%2Fpremices%2Fimage%2Fupload%2Fv1583940343%2FScreen_Shot_2020-03-11_at_17.19.42_jinvxt.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fres.cloudinary.com%2Fpremices%2Fimage%2Fupload%2Fv1583940343%2FScreen_Shot_2020-03-11_at_17.19.42_jinvxt.png" alt="img"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h5&gt;
  
  
  Step 5. Drop the runtime column.
&lt;/h5&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;br&gt;
 &lt;code&gt;df.query('runtime != 0', inplace=True)&lt;br&gt;
df.query('runtime == 0')&lt;/code&gt;&lt;br&gt;
&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fres.cloudinary.com%2Fpremices%2Fimage%2Fupload%2Fv1583940342%2FScreen_Shot_2020-03-11_at_17.19.57_wqlokv.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fres.cloudinary.com%2Fpremices%2Fimage%2Fupload%2Fv1583940342%2FScreen_Shot_2020-03-11_at_17.19.57_wqlokv.png" alt="img"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;df.info()&lt;/code&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fres.cloudinary.com%2Fpremices%2Fimage%2Fupload%2Fv1583940343%2FScreen_Shot_2020-03-11_at_17.20.04_tfcs5d.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fres.cloudinary.com%2Fpremices%2Fimage%2Fupload%2Fv1583940343%2FScreen_Shot_2020-03-11_at_17.20.04_tfcs5d.png" alt="img"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;df.describe()&lt;/code&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fres.cloudinary.com%2Fpremices%2Fimage%2Fupload%2Fv1583943336%2FScreen_Shot_2020-03-11_at_18.15.18_vstnox.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fres.cloudinary.com%2Fpremices%2Fimage%2Fupload%2Fv1583943336%2FScreen_Shot_2020-03-11_at_18.15.18_vstnox.png" alt="img"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;From the table above, we can see that replacing the zeros by null values in the budget and revenue distribution made them look better. We can also see that the minimum makes now more sense&lt;/p&gt;




&lt;p&gt;This is the end of the second part. If you got some good time reading, stay tuned. I will post the third part soon. &lt;/p&gt;

&lt;p&gt;thank you for reading. &lt;/p&gt;

</description>
      <category>python</category>
    </item>
    <item>
      <title>Investigating the TMDB movie dataset</title>
      <dc:creator>nkpremices</dc:creator>
      <pubDate>Wed, 11 Mar 2020 15:50:13 +0000</pubDate>
      <link>https://dev.to/nkpremices/investigating-the-tmdb-movie-dataset-6co</link>
      <guid>https://dev.to/nkpremices/investigating-the-tmdb-movie-dataset-6co</guid>
      <description>&lt;p&gt;Lately, I've been going through the &lt;a href="https://medium.com/r/?url=https%3A%2F%2Fwww.udacity.com%2Fcourse%2Fdata-analyst-nanodegree--nd002"&gt;Data analyst nanodegree program of Udacity&lt;/a&gt;. I worked on some projects there and I will be writing blog posts about them in the coming weeks.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;Note:&lt;/code&gt; This blog post is the first part of a whole series of blogposts where I describe a whole dataset analysis. The aim is to showcase how simple the data analysis can be.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h1&gt;
  
  
  Introduction
&lt;/h1&gt;

&lt;h5&gt;
  
  
  About the dataset
&lt;/h5&gt;

&lt;p&gt;The dataset is called TMDB movie data. Downloaded from &lt;a href="https://medium.com/r/?url=https%3A%2F%2Fwww.google.com%2Furl%3Fq%3Dhttps%3A%2F%2Fwww.kaggle.com%2Ftmdb%2Ftmdb-movie-metadata%26sa%3DD%26ust%3D1532469042115000"&gt;this page&lt;/a&gt;, its original version was removed by &lt;a href="https://www.kaggle.com/tmdb/tmdb-movie-metadata"&gt;Kaggle&lt;/a&gt; and replaced with a similar set of movies and data fields from &lt;a href="https://www.kaggle.com/tmdb/themoviedb.org"&gt;The Movie Database (TMDb)&lt;/a&gt;. It contains more than 5000 movies and their rating and basic information, including user ratings and revenue data.&lt;/p&gt;

&lt;p&gt;A successful movie is evaluated by its popularity, vote average score(Ratings) and revenue. There are some keys that can affect the success of a movie. For example, the Budget, Cast, Director, Tagline Keywords, Runtime, Genres, Production Companies, Release Date, Vote Average, etc.&lt;/p&gt;

&lt;p&gt;Looking at how the data is in the dataset, various questions can be asked. For example - &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;How was the popularity of a movie over the years? &lt;/li&gt;
&lt;li&gt;Considering the five recent years, how is the distribution of revenue in different score rating levels ? &lt;/li&gt;
&lt;li&gt;How is the distribution of revenue in different popularity levels ?&lt;/li&gt;
&lt;li&gt;What kinds of properties are associated with movies that have high popularity?&lt;/li&gt;
&lt;li&gt;What kind of properties are associated with movies that have high voting score?&lt;/li&gt;
&lt;li&gt;How many movies are released year by year ?&lt;/li&gt;
&lt;li&gt;What are the keywords trends by generation ?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In this series of blog posts, we are going to answer the questions above using the TMDB Movie data, Numpy, Pandas, and Matplotlib.&lt;/p&gt;

&lt;p&gt;for this blogpost, we will focuss on general comments about the data&lt;/p&gt;

&lt;p&gt;Firt of all, let's import the needed packages&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;br&gt;
 &lt;code&gt;import pandas as pd&lt;br&gt;
import numpy as np&lt;br&gt;
import matplotlib.pyplot as plt&lt;br&gt;
import seaborn as sns&lt;br&gt;
from collections import Counter&lt;br&gt;
%matplotlib inline&lt;/code&gt;&lt;br&gt;
&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h1&gt;
  
  
  Data Wrangling
&lt;/h1&gt;

&lt;h5&gt;
  
  
  General Properties
&lt;/h5&gt;

&lt;p&gt;Let's load the info of the dataset&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;br&gt;
 &lt;code&gt;df = pd.read_csv('tmdb-movies.csv')&lt;br&gt;
df.info()&lt;/code&gt;&lt;br&gt;
&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--yy7jC7jY--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://res.cloudinary.com/premices/image/upload/v1583940342/Screen_Shot_2020-03-11_at_17.17.33_uuhhm8.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--yy7jC7jY--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://res.cloudinary.com/premices/image/upload/v1583940342/Screen_Shot_2020-03-11_at_17.17.33_uuhhm8.png" alt="info"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Judging form the info above, the dataset has 10866 entries and 21 colums. The types used are int, float and string. Form the total number of entries and the number of entries per column, a lot of columns have null values. Let's check the exact number of null records per column.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;list(df.isnull().sum().items())&lt;/code&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--kKV30TkM--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://res.cloudinary.com/premices/image/upload/v1583940342/Screen_Shot_2020-03-11_at_17.18.13_b4rymb.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--kKV30TkM--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://res.cloudinary.com/premices/image/upload/v1583940342/Screen_Shot_2020-03-11_at_17.18.13_b4rymb.png" alt="img"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Looking at the result above, we see that the colums that have null values are cast, homepage, director, tagline, keywodds, overview, genres, production companies. We also see that homepage, tagline, keywords and production_companies have a lot of null records. I decided to get rid of tagline and keywords since they have a lot of null values.&lt;/p&gt;

&lt;p&gt;Let's try to get more descriptive information from the dataset&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;df.describe()&lt;/code&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--DvkfvDuM--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://res.cloudinary.com/premices/image/upload/v1583940342/Screen_Shot_2020-03-11_at_17.18.30_zzuejp.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--DvkfvDuM--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://res.cloudinary.com/premices/image/upload/v1583940342/Screen_Shot_2020-03-11_at_17.18.30_zzuejp.png" alt="img"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;If we look at the popularity column, we can find some outliers. since it has no upperbond, it is better to just retain the original data. We can see that there is a lot of zero values in the budget, revenue and runtime columns. The first guess might be that these movies were not released but if we look at the release_year column we can notice that the minimum value (1996) is avalid year and that there were no null values. Therefore those movies were released. May be the zeroes mean the abscence of data. However, in order to decide on that let's check closely those records&lt;/p&gt;

&lt;p&gt;First for the budget&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;br&gt;
 &lt;code&gt;df_budget_zero = df.query('budget == 0')&lt;br&gt;
df_budget_zero.head(3)&lt;/code&gt;&lt;br&gt;
&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--sW5hZYof--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://res.cloudinary.com/premices/image/upload/v1583940342/Screen_Shot_2020-03-11_at_17.18.49_nqxyhy.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--sW5hZYof--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://res.cloudinary.com/premices/image/upload/v1583940342/Screen_Shot_2020-03-11_at_17.18.49_nqxyhy.png" alt="img"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Then for the revenue&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;br&gt;
 &lt;code&gt;df_revenue_zero = df.query('revenue == 0')&lt;br&gt;
df_revenue_zero.head(3)&lt;/code&gt;&lt;br&gt;
&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--kRdo7pgQ--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://res.cloudinary.com/premices/image/upload/v1583940342/Screen_Shot_2020-03-11_at_17.18.58_j3sjwd.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--kRdo7pgQ--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://res.cloudinary.com/premices/image/upload/v1583940342/Screen_Shot_2020-03-11_at_17.18.58_j3sjwd.png" alt="img"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;After checking for Mr Afonso poyart on the film Solace#Production) on wikkipedia, I noteiced that the film was actually a success. WHich means that there was a sucessful release wich also means that there was a budget. Therefore, the zero values were missing data. I would decide based on that to drop the records since this might affect the statistics od the result of my analysis.&lt;/p&gt;

&lt;p&gt;Subsequently, lets check the number of null values to decide if the zeros should just be set as ull or completely droped out.&lt;/p&gt;

&lt;p&gt;First for the budget zero values&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;br&gt;
 &lt;code&gt;df_budget_0count =  df.groupby('budget').count()['id']&lt;br&gt;
df_budget_0count.head(2)&lt;/code&gt;&lt;br&gt;
&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--wPmgHpDp--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://res.cloudinary.com/premices/image/upload/v1583941289/Screen_Shot_2020-03-11_at_17.40.40_tus7bz.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--wPmgHpDp--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://res.cloudinary.com/premices/image/upload/v1583941289/Screen_Shot_2020-03-11_at_17.40.40_tus7bz.png" alt="img"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;As suggested by the results, there are a lot of zero values than non zero values. Dropping them out would corrupt the results. I better set them as null instead.&lt;/p&gt;

&lt;p&gt;Then for the revenue zero values&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;br&gt;
 &lt;code&gt;df_revenue_0count =  df.groupby('revenue').count()['id']&lt;br&gt;
df_revenue_0count.head(2)&lt;/code&gt;&lt;br&gt;
&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--7CpJmSL1--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://res.cloudinary.com/premices/image/upload/v1583941289/Screen_Shot_2020-03-11_at_17.40.48_kb0egw.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--7CpJmSL1--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://res.cloudinary.com/premices/image/upload/v1583941289/Screen_Shot_2020-03-11_at_17.40.48_kb0egw.png" alt="img"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Same situation. Set to null&lt;/p&gt;

&lt;p&gt;Finally for the runtime&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--A-7_XBVR--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://res.cloudinary.com/premices/image/upload/v1583941289/Screen_Shot_2020-03-11_at_17.40.58_udrblh.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--A-7_XBVR--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://res.cloudinary.com/premices/image/upload/v1583941289/Screen_Shot_2020-03-11_at_17.40.58_udrblh.png" alt="img"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The number of zeroes is negligible, they can be droupped out&lt;/p&gt;

&lt;h1&gt;
  
  
  Summary
&lt;/h1&gt;

&lt;p&gt;Remove some columns with a lot of null values and unnecessary ones for answering the questions : homepage, tagline, imdb_id, overview, budget_adj, revenue_adj.&lt;br&gt;
Remove duplicated data&lt;br&gt;
Remove all null values in the columns that have null values&lt;br&gt;
Replace zero values with null values in the budget and revenue column.&lt;br&gt;
Drop the lines with runtime == 0.&lt;/p&gt;




&lt;p&gt;The first part ends here. If you had some good time reading this, kindly check the &lt;a href="https://dev.to/nkpremices/investigating-the-tmdb-movie-dataset-part-2-e64"&gt;second part&lt;/a&gt; which is about data cleaning.&lt;/p&gt;

&lt;p&gt;Thank you for reading&lt;/p&gt;

</description>
      <category>python</category>
    </item>
  </channel>
</rss>
