In this article I would like to describe how to find NaN values in a pandas DataFrame. This kind of operation can be very useful given that is common to find datasets with missing or incorrect data values.
I will be using the numpy package to generate some data with NaN values.
Import necessary packages
import pandas as pd
import numpy as np
import platform
print(f'Python version: {platform.python_version()} ({platform.python_implementation()})')
print(f'Pandas version: {pd.__version__}')
print(f'Numpy version: {np.__version__}')
Python version: 3.6.4 (CPython)
Pandas version: 0.23.1
Numpy version: 1.14.5
Generate data with NaN values
num_nan = 25 # number of NaN values wanted in the generated data
np.random.seed(6765431) # set a seed for reproducibility
A = np.random.randn(10, 10)
print(A)
[[-1.56132314 -0.16954058 -0.17845422 -1.33689111 -0.19185078 -1.18617765
0.44499302 -0.61209568 0.31170935 1.4127548 ]
[ 0.85330488 0.68517546 -1.10140989 0.84918019 0.72802961 -0.35161197
0.73519152 1.13145412 0.53231247 0.78103143]
[-0.81614324 0.15906898 0.49940119 -0.09319255 -1.07837721 -0.76053341
0.73622083 -0.45518154 -0.69194032 1.02550409]
[-1.96339975 0.07593331 -0.16798377 -1.20398958 0.88333656 1.17908422
0.26324698 -2.65442248 -0.31583796 -0.16065732]
[-1.24321376 -0.89816898 0.02824671 0.15304093 0.56505667 -0.78115883
0.74504467 1.14025258 -0.04518221 -0.83908358]
[ 1.00967019 0.84240102 1.15043436 -0.40120489 0.00664105 -1.23247563
0.64738343 1.66096762 -0.92556683 0.47575796]
[ 0.96516278 1.11158059 -0.82155143 0.88900313 2.16943761 -2.05250161
2.40156233 0.92453867 -0.24437783 -2.91029265]
[-0.86492662 0.82443151 -0.48246862 -1.05183143 -1.15272524 -0.77170733
0.07177233 1.02820181 -2.08947076 0.89859677]
[-0.07263982 -0.56840867 1.30910275 -0.52846822 0.06019191 -0.61000727
0.40782356 -0.36124333 -1.54522486 -0.07891861]
[-1.96361682 -1.06315325 -0.45582138 -0.74566868 1.27579529 -2.46306005
0.57022673 -0.02793746 0.78652775 1.27690195]]
# Set random values to nan
A.ravel()[np.random.choice(A.size, num_nan, replace=False)] = np.nan
print(A)
[[-1.56132314 -0.16954058 -0.17845422 -1.33689111 -0.19185078 -1.18617765
nan -0.61209568 0.31170935 1.4127548 ]
[ 0.85330488 0.68517546 nan 0.84918019 nan -0.35161197
0.73519152 nan 0.53231247 0.78103143]
[-0.81614324 0.15906898 0.49940119 nan -1.07837721 -0.76053341
0.73622083 nan -0.69194032 1.02550409]
[-1.96339975 0.07593331 nan -1.20398958 0.88333656 nan
0.26324698 nan -0.31583796 -0.16065732]
[-1.24321376 -0.89816898 0.02824671 0.15304093 0.56505667 -0.78115883
0.74504467 1.14025258 -0.04518221 -0.83908358]
[ 1.00967019 0.84240102 nan -0.40120489 0.00664105 nan
0.64738343 1.66096762 -0.92556683 0.47575796]
[ 0.96516278 nan -0.82155143 0.88900313 2.16943761 nan
2.40156233 nan -0.24437783 nan]
[-0.86492662 0.82443151 -0.48246862 -1.05183143 -1.15272524 -0.77170733
0.07177233 1.02820181 -2.08947076 nan]
[-0.07263982 nan 1.30910275 -0.52846822 0.06019191 -0.61000727
0.40782356 -0.36124333 nan nan]
[ nan nan nan nan 1.27579529 -2.46306005
nan nan 0.78652775 1.27690195]]
# Create a DataFrame from the generated data
df = pd.DataFrame(A)
df
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | |
---|---|---|---|---|---|---|---|---|---|---|
0 | -1.561323 | -0.169541 | -0.178454 | -1.336891 | -0.191851 | -1.186178 | NaN | -0.612096 | 0.311709 | 1.412755 |
1 | 0.853305 | 0.685175 | NaN | 0.849180 | NaN | -0.351612 | 0.735192 | NaN | 0.532312 | 0.781031 |
2 | -0.816143 | 0.159069 | 0.499401 | NaN | -1.078377 | -0.760533 | 0.736221 | NaN | -0.691940 | 1.025504 |
3 | -1.963400 | 0.075933 | NaN | -1.203990 | 0.883337 | NaN | 0.263247 | NaN | -0.315838 | -0.160657 |
4 | -1.243214 | -0.898169 | 0.028247 | 0.153041 | 0.565057 | -0.781159 | 0.745045 | 1.140253 | -0.045182 | -0.839084 |
5 | 1.009670 | 0.842401 | NaN | -0.401205 | 0.006641 | NaN | 0.647383 | 1.660968 | -0.925567 | 0.475758 |
6 | 0.965163 | NaN | -0.821551 | 0.889003 | 2.169438 | NaN | 2.401562 | NaN | -0.244378 | NaN |
7 | -0.864927 | 0.824432 | -0.482469 | -1.051831 | -1.152725 | -0.771707 | 0.071772 | 1.028202 | -2.089471 | NaN |
8 | -0.072640 | NaN | 1.309103 | -0.528468 | 0.060192 | -0.610007 | 0.407824 | -0.361243 | NaN | NaN |
9 | NaN | NaN | NaN | NaN | 1.275795 | -2.463060 | NaN | NaN | 0.786528 | 1.276902 |
Check for NaN values
Now that we have some data to operate on let's see the different ways we can check for missing values.
There are two methods of the DataFrame object that can be used: DataFrame#isna()
and DataFrame#isnull()
. But if you check the source code it seems that isnull()
is only an alias for the isna()
method. To keep it simple I will only use the isna()
method as we would get the same result using isnull()
.
df.isna()
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | |
---|---|---|---|---|---|---|---|---|---|---|
0 | False | False | False | False | False | False | True | False | False | False |
1 | False | False | True | False | True | False | False | True | False | False |
2 | False | False | False | True | False | False | False | True | False | False |
3 | False | False | True | False | False | True | False | True | False | False |
4 | False | False | False | False | False | False | False | False | False | False |
5 | False | False | True | False | False | True | False | False | False | False |
6 | False | True | False | False | False | True | False | True | False | True |
7 | False | False | False | False | False | False | False | False | False | True |
8 | False | True | False | False | False | False | False | False | True | True |
9 | True | True | True | True | False | False | True | True | False | False |
As it can be seen above when we use the isna()
method it returns a DataFrame with boolean values, where True
indicates NaN values and False
otherwise.
If we wanted to know how many missing values there are on each row or column we could use the DataFrame#sum()
method:
df.isna().sum(axis='rows') # 'rows' or 0
0 1
1 3
2 4
3 2
4 1
5 3
6 2
7 5
8 1
9 3
dtype: int64
df.isna().sum(axis='columns') # 'columns' or 1
0 1
1 3
2 2
3 3
4 0
5 2
6 4
7 1
8 3
9 6
dtype: int64
To simply know the total number of missing values we can call sum()
again:
df.isna().sum().sum()
25
If we simply wanna know if there is any missing value with no care for the quantity we can simply use the any()
method:
df.isna().any() # can also receive axis='rows' or 'columns'
0 True
1 True
2 True
3 True
4 True
5 True
6 True
7 True
8 True
9 True
dtype: bool
Calling it again we have a single boolean output:
df.isna().any().any()
True
Besides the isna()
method we also have the notna()
method which is its boolean inverse. Applying it we can get the number of values that are not missing or simply if all values are not missing (but using the all()
method instead of any()
).
print(df.notna().sum().sum()) # not missing
print(df.notna().all().all())
75
False
Note 1: in the examples, it was used the DataFrame methods to check for missing values, but the pandas package has its own functions with the same purpose that can be applied to other objects. Example:
print(pd.isna([1, 2, np.nan]))
print(pd.notna([1, 2, np.nan]))
[False False True]
[ True True False]
Note 2: the methods applied here on DataFrame objects are also available for Series and Index objects.
Time comparison
Comparing the time taken by the two methods we can see that using any()
is faster but sum()
will give us the additional information about how many missing values there are.
%timeit df.isna().any().any()
333 µs ± 33.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit df.isna().sum().sum()
561 µs ± 97.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Dealing with missing values
Two easy ways to deal with missing values are removing them or filling them with some value. These can be achieved with the dropna()
and fillna()
methods.
The dropna()
method will return a DataFrame without the rows and columns containing missing values.
df.dropna()
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | |
---|---|---|---|---|---|---|---|---|---|---|
4 | -1.243214 | -0.898169 | 0.028247 | 0.153041 | 0.565057 | -0.781159 | 0.745045 | 1.140253 | -0.045182 | -0.839084 |
The fillna()
method will return a DataFrame with the missing values filled with a specified value.
df.fillna(value=5)
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | |
---|---|---|---|---|---|---|---|---|---|---|
0 | -1.561323 | -0.169541 | -0.178454 | -1.336891 | -0.191851 | -1.186178 | 5.000000 | -0.612096 | 0.311709 | 1.412755 |
1 | 0.853305 | 0.685175 | 5.000000 | 0.849180 | 5.000000 | -0.351612 | 0.735192 | 5.000000 | 0.532312 | 0.781031 |
2 | -0.816143 | 0.159069 | 0.499401 | 5.000000 | -1.078377 | -0.760533 | 0.736221 | 5.000000 | -0.691940 | 1.025504 |
3 | -1.963400 | 0.075933 | 5.000000 | -1.203990 | 0.883337 | 5.000000 | 0.263247 | 5.000000 | -0.315838 | -0.160657 |
4 | -1.243214 | -0.898169 | 0.028247 | 0.153041 | 0.565057 | -0.781159 | 0.745045 | 1.140253 | -0.045182 | -0.839084 |
5 | 1.009670 | 0.842401 | 5.000000 | -0.401205 | 0.006641 | 5.000000 | 0.647383 | 1.660968 | -0.925567 | 0.475758 |
6 | 0.965163 | 5.000000 | -0.821551 | 0.889003 | 2.169438 | 5.000000 | 2.401562 | 5.000000 | -0.244378 | 5.000000 |
7 | -0.864927 | 0.824432 | -0.482469 | -1.051831 | -1.152725 | -0.771707 | 0.071772 | 1.028202 | -2.089471 | 5.000000 |
8 | -0.072640 | 5.000000 | 1.309103 | -0.528468 | 0.060192 | -0.610007 | 0.407824 | -0.361243 | 5.000000 | 5.000000 |
9 | 5.000000 | 5.000000 | 5.000000 | 5.000000 | 1.275795 | -2.463060 | 5.000000 | 5.000000 | 0.786528 | 1.276902 |
Top comments (0)