DEV Community

Hygor
Hygor

Posted on

How to check missing values in pandas

In this article I would like to describe how to find NaN values in a pandas DataFrame. This kind of operation can be very useful given that is common to find datasets with missing or incorrect data values.

I will be using the numpy package to generate some data with NaN values.

Import necessary packages

import pandas as pd
import numpy as np
import platform
Enter fullscreen mode Exit fullscreen mode
print(f'Python version: {platform.python_version()} ({platform.python_implementation()})')
print(f'Pandas version: {pd.__version__}')
print(f'Numpy version: {np.__version__}')
Enter fullscreen mode Exit fullscreen mode
Python version: 3.6.4 (CPython)
Pandas version: 0.23.1
Numpy version: 1.14.5
Enter fullscreen mode Exit fullscreen mode

Generate data with NaN values

num_nan = 25 # number of NaN values wanted in the generated data
np.random.seed(6765431)  # set a seed for reproducibility
A = np.random.randn(10, 10)
print(A)
Enter fullscreen mode Exit fullscreen mode
[[-1.56132314 -0.16954058 -0.17845422 -1.33689111 -0.19185078 -1.18617765
   0.44499302 -0.61209568  0.31170935  1.4127548 ]
 [ 0.85330488  0.68517546 -1.10140989  0.84918019  0.72802961 -0.35161197
   0.73519152  1.13145412  0.53231247  0.78103143]
 [-0.81614324  0.15906898  0.49940119 -0.09319255 -1.07837721 -0.76053341
   0.73622083 -0.45518154 -0.69194032  1.02550409]
 [-1.96339975  0.07593331 -0.16798377 -1.20398958  0.88333656  1.17908422
   0.26324698 -2.65442248 -0.31583796 -0.16065732]
 [-1.24321376 -0.89816898  0.02824671  0.15304093  0.56505667 -0.78115883
   0.74504467  1.14025258 -0.04518221 -0.83908358]
 [ 1.00967019  0.84240102  1.15043436 -0.40120489  0.00664105 -1.23247563
   0.64738343  1.66096762 -0.92556683  0.47575796]
 [ 0.96516278  1.11158059 -0.82155143  0.88900313  2.16943761 -2.05250161
   2.40156233  0.92453867 -0.24437783 -2.91029265]
 [-0.86492662  0.82443151 -0.48246862 -1.05183143 -1.15272524 -0.77170733
   0.07177233  1.02820181 -2.08947076  0.89859677]
 [-0.07263982 -0.56840867  1.30910275 -0.52846822  0.06019191 -0.61000727
   0.40782356 -0.36124333 -1.54522486 -0.07891861]
 [-1.96361682 -1.06315325 -0.45582138 -0.74566868  1.27579529 -2.46306005
   0.57022673 -0.02793746  0.78652775  1.27690195]]
Enter fullscreen mode Exit fullscreen mode
# Set random values to nan
A.ravel()[np.random.choice(A.size, num_nan, replace=False)] = np.nan
print(A)
Enter fullscreen mode Exit fullscreen mode
[[-1.56132314 -0.16954058 -0.17845422 -1.33689111 -0.19185078 -1.18617765
          nan -0.61209568  0.31170935  1.4127548 ]
 [ 0.85330488  0.68517546         nan  0.84918019         nan -0.35161197
   0.73519152         nan  0.53231247  0.78103143]
 [-0.81614324  0.15906898  0.49940119         nan -1.07837721 -0.76053341
   0.73622083         nan -0.69194032  1.02550409]
 [-1.96339975  0.07593331         nan -1.20398958  0.88333656         nan
   0.26324698         nan -0.31583796 -0.16065732]
 [-1.24321376 -0.89816898  0.02824671  0.15304093  0.56505667 -0.78115883
   0.74504467  1.14025258 -0.04518221 -0.83908358]
 [ 1.00967019  0.84240102         nan -0.40120489  0.00664105         nan
   0.64738343  1.66096762 -0.92556683  0.47575796]
 [ 0.96516278         nan -0.82155143  0.88900313  2.16943761         nan
   2.40156233         nan -0.24437783         nan]
 [-0.86492662  0.82443151 -0.48246862 -1.05183143 -1.15272524 -0.77170733
   0.07177233  1.02820181 -2.08947076         nan]
 [-0.07263982         nan  1.30910275 -0.52846822  0.06019191 -0.61000727
   0.40782356 -0.36124333         nan         nan]
 [        nan         nan         nan         nan  1.27579529 -2.46306005
          nan         nan  0.78652775  1.27690195]]
Enter fullscreen mode Exit fullscreen mode
# Create a DataFrame from the generated data
df = pd.DataFrame(A)
df
Enter fullscreen mode Exit fullscreen mode
0 1 2 3 4 5 6 7 8 9
0 -1.561323 -0.169541 -0.178454 -1.336891 -0.191851 -1.186178 NaN -0.612096 0.311709 1.412755
1 0.853305 0.685175 NaN 0.849180 NaN -0.351612 0.735192 NaN 0.532312 0.781031
2 -0.816143 0.159069 0.499401 NaN -1.078377 -0.760533 0.736221 NaN -0.691940 1.025504
3 -1.963400 0.075933 NaN -1.203990 0.883337 NaN 0.263247 NaN -0.315838 -0.160657
4 -1.243214 -0.898169 0.028247 0.153041 0.565057 -0.781159 0.745045 1.140253 -0.045182 -0.839084
5 1.009670 0.842401 NaN -0.401205 0.006641 NaN 0.647383 1.660968 -0.925567 0.475758
6 0.965163 NaN -0.821551 0.889003 2.169438 NaN 2.401562 NaN -0.244378 NaN
7 -0.864927 0.824432 -0.482469 -1.051831 -1.152725 -0.771707 0.071772 1.028202 -2.089471 NaN
8 -0.072640 NaN 1.309103 -0.528468 0.060192 -0.610007 0.407824 -0.361243 NaN NaN
9 NaN NaN NaN NaN 1.275795 -2.463060 NaN NaN 0.786528 1.276902

Check for NaN values

Now that we have some data to operate on let's see the different ways we can check for missing values.

There are two methods of the DataFrame object that can be used: DataFrame#isna() and DataFrame#isnull(). But if you check the source code it seems that isnull() is only an alias for the isna() method. To keep it simple I will only use the isna() method as we would get the same result using isnull().

df.isna()
Enter fullscreen mode Exit fullscreen mode
0 1 2 3 4 5 6 7 8 9
0 False False False False False False True False False False
1 False False True False True False False True False False
2 False False False True False False False True False False
3 False False True False False True False True False False
4 False False False False False False False False False False
5 False False True False False True False False False False
6 False True False False False True False True False True
7 False False False False False False False False False True
8 False True False False False False False False True True
9 True True True True False False True True False False

As it can be seen above when we use the isna() method it returns a DataFrame with boolean values, where True indicates NaN values and False otherwise.

If we wanted to know how many missing values there are on each row or column we could use the DataFrame#sum() method:

df.isna().sum(axis='rows')  # 'rows' or 0
Enter fullscreen mode Exit fullscreen mode
0    1
1    3
2    4
3    2
4    1
5    3
6    2
7    5
8    1
9    3
dtype: int64
Enter fullscreen mode Exit fullscreen mode
df.isna().sum(axis='columns')  # 'columns' or 1
Enter fullscreen mode Exit fullscreen mode
0    1
1    3
2    2
3    3
4    0
5    2
6    4
7    1
8    3
9    6
dtype: int64
Enter fullscreen mode Exit fullscreen mode

To simply know the total number of missing values we can call sum() again:

df.isna().sum().sum()
Enter fullscreen mode Exit fullscreen mode
25
Enter fullscreen mode Exit fullscreen mode

If we simply wanna know if there is any missing value with no care for the quantity we can simply use the any() method:

df.isna().any()  # can also receive axis='rows' or 'columns'
Enter fullscreen mode Exit fullscreen mode
0    True
1    True
2    True
3    True
4    True
5    True
6    True
7    True
8    True
9    True
dtype: bool
Enter fullscreen mode Exit fullscreen mode

Calling it again we have a single boolean output:

df.isna().any().any()
Enter fullscreen mode Exit fullscreen mode
True
Enter fullscreen mode Exit fullscreen mode

Besides the isna() method we also have the notna() method which is its boolean inverse. Applying it we can get the number of values that are not missing or simply if all values are not missing (but using the all() method instead of any()).

print(df.notna().sum().sum())  # not missing
print(df.notna().all().all())
Enter fullscreen mode Exit fullscreen mode
75
False
Enter fullscreen mode Exit fullscreen mode

Note 1: in the examples, it was used the DataFrame methods to check for missing values, but the pandas package has its own functions with the same purpose that can be applied to other objects. Example:

print(pd.isna([1, 2, np.nan]))
print(pd.notna([1, 2, np.nan]))
Enter fullscreen mode Exit fullscreen mode
[False False  True]
[ True  True False]
Enter fullscreen mode Exit fullscreen mode

Note 2: the methods applied here on DataFrame objects are also available for Series and Index objects.

Time comparison

Comparing the time taken by the two methods we can see that using any() is faster but sum() will give us the additional information about how many missing values there are.

%timeit df.isna().any().any()
Enter fullscreen mode Exit fullscreen mode
333 µs ± 33.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Enter fullscreen mode Exit fullscreen mode
%timeit df.isna().sum().sum()
Enter fullscreen mode Exit fullscreen mode
561 µs ± 97.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Enter fullscreen mode Exit fullscreen mode

Dealing with missing values

Two easy ways to deal with missing values are removing them or filling them with some value. These can be achieved with the dropna() and fillna() methods.

The dropna() method will return a DataFrame without the rows and columns containing missing values.

df.dropna()
Enter fullscreen mode Exit fullscreen mode
0 1 2 3 4 5 6 7 8 9
4 -1.243214 -0.898169 0.028247 0.153041 0.565057 -0.781159 0.745045 1.140253 -0.045182 -0.839084

The fillna() method will return a DataFrame with the missing values filled with a specified value.

df.fillna(value=5)
Enter fullscreen mode Exit fullscreen mode
0 1 2 3 4 5 6 7 8 9
0 -1.561323 -0.169541 -0.178454 -1.336891 -0.191851 -1.186178 5.000000 -0.612096 0.311709 1.412755
1 0.853305 0.685175 5.000000 0.849180 5.000000 -0.351612 0.735192 5.000000 0.532312 0.781031
2 -0.816143 0.159069 0.499401 5.000000 -1.078377 -0.760533 0.736221 5.000000 -0.691940 1.025504
3 -1.963400 0.075933 5.000000 -1.203990 0.883337 5.000000 0.263247 5.000000 -0.315838 -0.160657
4 -1.243214 -0.898169 0.028247 0.153041 0.565057 -0.781159 0.745045 1.140253 -0.045182 -0.839084
5 1.009670 0.842401 5.000000 -0.401205 0.006641 5.000000 0.647383 1.660968 -0.925567 0.475758
6 0.965163 5.000000 -0.821551 0.889003 2.169438 5.000000 2.401562 5.000000 -0.244378 5.000000
7 -0.864927 0.824432 -0.482469 -1.051831 -1.152725 -0.771707 0.071772 1.028202 -2.089471 5.000000
8 -0.072640 5.000000 1.309103 -0.528468 0.060192 -0.610007 0.407824 -0.361243 5.000000 5.000000
9 5.000000 5.000000 5.000000 5.000000 1.275795 -2.463060 5.000000 5.000000 0.786528 1.276902

References:

Top comments (0)