Hygor

Posted on Aug 14, 2018

How to check missing values in pandas

#pandas #python #numpy

In this article I would like to describe how to find NaN values in a pandas DataFrame. This kind of operation can be very useful given that is common to find datasets with missing or incorrect data values.

I will be using the numpy package to generate some data with NaN values.

Import necessary packages

import pandas as pd
import numpy as np
import platform

print(f'Python version: {platform.python_version()} ({platform.python_implementation()})')
print(f'Pandas version: {pd.__version__}')
print(f'Numpy version: {np.__version__}')

Python version: 3.6.4 (CPython)
Pandas version: 0.23.1
Numpy version: 1.14.5

Generate data with NaN values

num_nan = 25 # number of NaN values wanted in the generated data
np.random.seed(6765431)  # set a seed for reproducibility
A = np.random.randn(10, 10)
print(A)

[[-1.56132314 -0.16954058 -0.17845422 -1.33689111 -0.19185078 -1.18617765
   0.44499302 -0.61209568  0.31170935  1.4127548 ]
 [ 0.85330488  0.68517546 -1.10140989  0.84918019  0.72802961 -0.35161197
   0.73519152  1.13145412  0.53231247  0.78103143]
 [-0.81614324  0.15906898  0.49940119 -0.09319255 -1.07837721 -0.76053341
   0.73622083 -0.45518154 -0.69194032  1.02550409]
 [-1.96339975  0.07593331 -0.16798377 -1.20398958  0.88333656  1.17908422
   0.26324698 -2.65442248 -0.31583796 -0.16065732]
 [-1.24321376 -0.89816898  0.02824671  0.15304093  0.56505667 -0.78115883
   0.74504467  1.14025258 -0.04518221 -0.83908358]
 [ 1.00967019  0.84240102  1.15043436 -0.40120489  0.00664105 -1.23247563
   0.64738343  1.66096762 -0.92556683  0.47575796]
 [ 0.96516278  1.11158059 -0.82155143  0.88900313  2.16943761 -2.05250161
   2.40156233  0.92453867 -0.24437783 -2.91029265]
 [-0.86492662  0.82443151 -0.48246862 -1.05183143 -1.15272524 -0.77170733
   0.07177233  1.02820181 -2.08947076  0.89859677]
 [-0.07263982 -0.56840867  1.30910275 -0.52846822  0.06019191 -0.61000727
   0.40782356 -0.36124333 -1.54522486 -0.07891861]
 [-1.96361682 -1.06315325 -0.45582138 -0.74566868  1.27579529 -2.46306005
   0.57022673 -0.02793746  0.78652775  1.27690195]]

# Set random values to nan
A.ravel()[np.random.choice(A.size, num_nan, replace=False)] = np.nan
print(A)

[[-1.56132314 -0.16954058 -0.17845422 -1.33689111 -0.19185078 -1.18617765
          nan -0.61209568  0.31170935  1.4127548 ]
 [ 0.85330488  0.68517546         nan  0.84918019         nan -0.35161197
   0.73519152         nan  0.53231247  0.78103143]
 [-0.81614324  0.15906898  0.49940119         nan -1.07837721 -0.76053341
   0.73622083         nan -0.69194032  1.02550409]
 [-1.96339975  0.07593331         nan -1.20398958  0.88333656         nan
   0.26324698         nan -0.31583796 -0.16065732]
 [-1.24321376 -0.89816898  0.02824671  0.15304093  0.56505667 -0.78115883
   0.74504467  1.14025258 -0.04518221 -0.83908358]
 [ 1.00967019  0.84240102         nan -0.40120489  0.00664105         nan
   0.64738343  1.66096762 -0.92556683  0.47575796]
 [ 0.96516278         nan -0.82155143  0.88900313  2.16943761         nan
   2.40156233         nan -0.24437783         nan]
 [-0.86492662  0.82443151 -0.48246862 -1.05183143 -1.15272524 -0.77170733
   0.07177233  1.02820181 -2.08947076         nan]
 [-0.07263982         nan  1.30910275 -0.52846822  0.06019191 -0.61000727
   0.40782356 -0.36124333         nan         nan]
 [        nan         nan         nan         nan  1.27579529 -2.46306005
          nan         nan  0.78652775  1.27690195]]

# Create a DataFrame from the generated data
df = pd.DataFrame(A)
df

	0	1	2	3	4	5	6	7	8	9
0	-1.561323	-0.169541	-0.178454	-1.336891	-0.191851	-1.186178	NaN	-0.612096	0.311709	1.412755
1	0.853305	0.685175	NaN	0.849180	NaN	-0.351612	0.735192	NaN	0.532312	0.781031
2	-0.816143	0.159069	0.499401	NaN	-1.078377	-0.760533	0.736221	NaN	-0.691940	1.025504
3	-1.963400	0.075933	NaN	-1.203990	0.883337	NaN	0.263247	NaN	-0.315838	-0.160657
4	-1.243214	-0.898169	0.028247	0.153041	0.565057	-0.781159	0.745045	1.140253	-0.045182	-0.839084
5	1.009670	0.842401	NaN	-0.401205	0.006641	NaN	0.647383	1.660968	-0.925567	0.475758
6	0.965163	NaN	-0.821551	0.889003	2.169438	NaN	2.401562	NaN	-0.244378	NaN
7	-0.864927	0.824432	-0.482469	-1.051831	-1.152725	-0.771707	0.071772	1.028202	-2.089471	NaN
8	-0.072640	NaN	1.309103	-0.528468	0.060192	-0.610007	0.407824	-0.361243	NaN	NaN
9	NaN	NaN	NaN	NaN	1.275795	-2.463060	NaN	NaN	0.786528	1.276902

Check for NaN values

Now that we have some data to operate on let's see the different ways we can check for missing values.

There are two methods of the DataFrame object that can be used: DataFrame#isna() and DataFrame#isnull(). But if you check the source code it seems that isnull() is only an alias for the isna() method. To keep it simple I will only use the isna() method as we would get the same result using isnull().

df.isna()

	0	1	2	3	4	5	6	7	8	9
0	False	False	False	False	False	False	True	False	False	False
1	False	False	True	False	True	False	False	True	False	False
2	False	False	False	True	False	False	False	True	False	False
3	False	False	True	False	False	True	False	True	False	False
4	False	False	False	False	False	False	False	False	False	False
5	False	False	True	False	False	True	False	False	False	False
6	False	True	False	False	False	True	False	True	False	True
7	False	False	False	False	False	False	False	False	False	True
8	False	True	False	False	False	False	False	False	True	True
9	True	True	True	True	False	False	True	True	False	False

As it can be seen above when we use the isna() method it returns a DataFrame with boolean values, where True indicates NaN values and False otherwise.

If we wanted to know how many missing values there are on each row or column we could use the DataFrame#sum() method:

df.isna().sum(axis='rows')  # 'rows' or 0

0    1
1    3
2    4
3    2
4    1
5    3
6    2
7    5
8    1
9    3
dtype: int64

df.isna().sum(axis='columns')  # 'columns' or 1

0    1
1    3
2    2
3    3
4    0
5    2
6    4
7    1
8    3
9    6
dtype: int64

To simply know the total number of missing values we can call sum() again:

df.isna().sum().sum()

If we simply wanna know if there is any missing value with no care for the quantity we can simply use the any() method:

df.isna().any()  # can also receive axis='rows' or 'columns'

0    True
1    True
2    True
3    True
4    True
5    True
6    True
7    True
8    True
9    True
dtype: bool

Calling it again we have a single boolean output:

df.isna().any().any()

True

Besides the isna() method we also have the notna() method which is its boolean inverse. Applying it we can get the number of values that are not missing or simply if all values are not missing (but using the all() method instead of any()).

print(df.notna().sum().sum())  # not missing
print(df.notna().all().all())

75
False

Note 1: in the examples, it was used the DataFrame methods to check for missing values, but the pandas package has its own functions with the same purpose that can be applied to other objects. Example:

print(pd.isna([1, 2, np.nan]))
print(pd.notna([1, 2, np.nan]))

[False False  True]
[ True  True False]

Note 2: the methods applied here on DataFrame objects are also available for Series and Index objects.

Time comparison

Comparing the time taken by the two methods we can see that using any() is faster but sum() will give us the additional information about how many missing values there are.

%timeit df.isna().any().any()

333 µs ± 33.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit df.isna().sum().sum()

561 µs ± 97.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Dealing with missing values

Two easy ways to deal with missing values are removing them or filling them with some value. These can be achieved with the dropna() and fillna() methods.

The dropna() method will return a DataFrame without the rows and columns containing missing values.

df.dropna()

	0	1	2	3	4	5	6	7	8	9
4	-1.243214	-0.898169	0.028247	0.153041	0.565057	-0.781159	0.745045	1.140253	-0.045182	-0.839084

The fillna() method will return a DataFrame with the missing values filled with a specified value.

df.fillna(value=5)

	0	1	2	3	4	5	6	7	8	9
0	-1.561323	-0.169541	-0.178454	-1.336891	-0.191851	-1.186178	5.000000	-0.612096	0.311709	1.412755
1	0.853305	0.685175	5.000000	0.849180	5.000000	-0.351612	0.735192	5.000000	0.532312	0.781031
2	-0.816143	0.159069	0.499401	5.000000	-1.078377	-0.760533	0.736221	5.000000	-0.691940	1.025504
3	-1.963400	0.075933	5.000000	-1.203990	0.883337	5.000000	0.263247	5.000000	-0.315838	-0.160657
4	-1.243214	-0.898169	0.028247	0.153041	0.565057	-0.781159	0.745045	1.140253	-0.045182	-0.839084
5	1.009670	0.842401	5.000000	-0.401205	0.006641	5.000000	0.647383	1.660968	-0.925567	0.475758
6	0.965163	5.000000	-0.821551	0.889003	2.169438	5.000000	2.401562	5.000000	-0.244378	5.000000
7	-0.864927	0.824432	-0.482469	-1.051831	-1.152725	-0.771707	0.071772	1.028202	-2.089471	5.000000
8	-0.072640	5.000000	1.309103	-0.528468	0.060192	-0.610007	0.407824	-0.361243	5.000000	5.000000
9	5.000000	5.000000	5.000000	5.000000	1.275795	-2.463060	5.000000	5.000000	0.786528	1.276902

DEV Community

How to check missing values in pandas

Import necessary packages

Generate data with NaN values

Check for NaN values

Time comparison

Dealing with missing values

References:

Top comments (0)

Read next

Python Core Developer Sprint 2024

Character Encoding with the Python os module and Unicode

Download multiple YouTube videos and audio using Python.

Pytest and PostgreSQL: Fresh database for every test (part II)