Thales Bruno

Posted on Aug 6, 2020 • Originally published at thalesbr.uno

Handling missing values

#datascience #tutorial #beginners #python

Header photo by Emily Morter on Unsplash

A key part of data science is to clean the data before work with it. For example, it's common finding missing values in a dataset, either because that data was not collected or don't exist though.

Depends on the data we have missing, or even why we don't have it, we handle it in different ways. So, we need to analyze each dataset column carefully and decide whether we will remove all rows with missing values, or remove the column itself, or fill the cells with some data (and which data could be).

Count missing values

Before handle the missing values, we should know exactly how much of them we have. The code below counts all missing values in a pandas DataFrame df and print it all by column.

df.isnull().sum()

Output:

budget                     0
genres                    28
homepage                3091
id                         0
keywords                 412
original_language          0

Check relative data missing

So, we know we have missing values in our dataset, and now we may figure out the proportion of missing values in relation to all data. To do this we sum the missing values, divide by the number of cells, and multiply by 100, as we can in the code below.

def percent_missing(df: pd.DataFrame) -> float:
    # # of rows x # of columns using numpy product function
    total_data_points = np.product(df.shape)
    total_missing = df.isnull().sum().sum()
    return (total_missing/total_data_points)*100

print(f"{percent_missing(df)}"%)

Output:

3.863904504129364%

Drop rows with missing values

An approach to handle missing data is just to drop all rows containing missing values. Again, the decision needs to be made after a careful analysis. But, if we decide to drop the rows, the code is quite simple.

df.dropna()

Drop columns with missing values

If we decide to drop the columns instead, we just say that to the dropna() method with the axis parameter.

df.dropna(axis=1)

Fill in missing values

Even though, if we don't want to drop the rows or columns, we may fill the cells with some data. As always, we need to analyze and decide which data we will fill with. Some approaches are: fill with 0, with blank space, with the most common value observed in that column, with the last or the next valid value observed, with statistics measurements like mean or median, and so on.

Using pandas

Pandas has the fillna() method to deal with it. We specify the value we want to fill and pass it as an argument to the method. In the code below we are filling all missing cells with 0.

df.fillna(0)

Using Scikit-Learn

With scikit-learn, we have more sophisticated ways to deal with the problem. For example, we may fill a categorical column with the most frequent value observed on it, or fill a numerical column with the mean value of the data. The sklearn SimpleImputer class has many strategies for imputing data as you can see in its documentation. The code below uses the most_frequent strategy, that

replace missing using the most frequent value along each column. Can be used with strings or numeric data.

from sklearn.impute import SimpleImputer
imputer = SimpleImputer(strategy='most_frequent')
new_df = imputer.fit_transform(df)

In a future article, I will cover in more detail the use of scikit-learn. See you!

Documentation

Pandas dropna()
Pandas fillna()
Scikit-learn SimpleImputer()

DEV Community