DEV Community

Cover image for Different methods to filter a Pandas DataFrame
Aviator
Aviator

Posted on • Originally published at Medium

Different methods to filter a Pandas DataFrame

Pandas is a popular open-source python library used for data manipulation and analysis. It helps in the data analysis process which involves cleaning, exploring, analyzing, and visualizing data.

In this article, we will cover the various methods of performing filtering using pandas.

Data Filtering is one of the data manipulation operations we can perform when working with data. Filtering with pandas is similar to a WHERE clause in SQL or a filter in Microsoft Excel.

If you have worked with these tools, you have an idea of how filtering works.
Filters can be used to select a subset of data. It can also be used to exclude or discover null values from a dataset

Here are some examples of filtering

1. Select all employees who were employed by the organization after 2020
2. Select all male employees who work in the finance and marketing department.
3. Select all traffic accidents that occurred on Wednesday around 12 noon
4. Select all students whose registration number is null

These are some real-world examples of what we can do with filters depending on our dataset and use case for our analysis.

Note: This tutorial assumes that you are already familiar with the usage of pandas for data analysis and you can differentiate between its underlying data structure namely Series and Dataframe.

To keep this tutorial simple, we would be creating our sample dataset. It would be a dataset of employees working for an organization.

import pandas as pd
Enter fullscreen mode Exit fullscreen mode
employees_data = {
    'first_name': ['Jamie', 'Michael', 'James', 'Mark', 'Stones', 'Sharon', 'Jessica', 'Johnson'],
    'last_name': ['Johnson', 'Marley', 'Peterson', 'Henry', 'Walker', 'White', 'Mendy', 'Johnson'],
    'department': ['Accounting', 'Marketing', 'Engineering', 'HR', 'HR', 'Sales', 'Data Analytics', 'Accounting'],
    'date_of_hire': [2011, 2012, 2022, 2000, 2007, 2018, 2020, 2018],
    'gender': ['M','M','M','M','F','F','M','F'],
    'salary': [10000,87500,77000,44500,90000,45000,25000,65000],
    'national_id': [22123,78656,98976,12765, None, None,56432,98744],
    'emp_id': ['Emp22', 'Emp54', 'Emp77', 'Emp99', 'Emp98', 'Emp01', 'Emp36', 'Emp04']
}

employees_df = pd.DataFrame(data=employees_data)
Enter fullscreen mode Exit fullscreen mode

pandas dataframe

In pandas, there is more than one way to operate. Pandas provide multiple ways to filter data.

They include:

isin()
This method provides a way to apply single or multiple conditions for filtering.

employees_df[employees_df['department'].isin(['Marketing'])]
Enter fullscreen mode Exit fullscreen mode

isin method 1

employees_df[employees_df['department'].isin(['Marketing','HR'])]
Enter fullscreen mode Exit fullscreen mode

isin method 2

Logical operators
We can use the available logical operators to filter for a subset of data

employees_df[employees_df['salary'] > 30000]
Enter fullscreen mode Exit fullscreen mode

Logical operators

Here we filter for employees whose salary is above 30000.

Occasions may arise where we have to filter our data for multiple conditions. The logical operator also makes this possible.

employees_df[(employees_df['salary'] > 30000) & (employees_df['department'] == 'Marketing')]
Enter fullscreen mode Exit fullscreen mode

multiple logical operators
For all employees with a salary above 30000, who work in the marketing department

You can also use other logical operators such as less than(<), greater than(>), equal to(=), not equal to(!=), etc.

Query Function
The query function takes in an expression as an argument which evaluates to a Boolean that is used to filter the dataframe.

employees_df.query("department == 'Marketing'")
Enter fullscreen mode Exit fullscreen mode

query function

We can also query based on multiple conditions

employees_df.query("department == 'Marketing' and date_of_hire > 2006")
Enter fullscreen mode Exit fullscreen mode

multiple query method

Str Accessor
Pandas make it easy to work with string values. Using the str accessor, we can filter for records whose values are strings.

employees_df[employees_df['department'].str.contains("M")]
Enter fullscreen mode Exit fullscreen mode

str contains method

employees_df[employees_df['department'].str.startswith("M")]
Enter fullscreen mode Exit fullscreen mode

str startswith method

nlargest and nsmallest
Most times, we just need records of the highest or lowest values in a column.
These methods make it possible. We could filter for the highest 3 or lowest 3 salaries.

employees_df.nlargest(3, 'salary')
Enter fullscreen mode Exit fullscreen mode

nlargest method

employees_df.nsmallest(3, 'salary')
Enter fullscreen mode Exit fullscreen mode

nsmallest method

Tilde sign (~)
Used to reverse the logic used in filter condition

employees_df[~employees_df['department'].str.contains("A")]
Enter fullscreen mode Exit fullscreen mode

Tilde method

employees_df[~employees_df['department'].isin(['Marketing','HR'])]
Enter fullscreen mode Exit fullscreen mode

Tilde multiple parameters

isnull | notnull
Using the isnull method, we can return records that have NaN values and mark them for deletion. Using the notnull method, we can filter for records that do not contain NaN values.

employees_df[employees_df['national_id'].isnull()]
Enter fullscreen mode Exit fullscreen mode

isnull method

employees_df[employees_df['national_id'].notnull()]
Enter fullscreen mode Exit fullscreen mode

notnull method

Filter
Using this method, we can also filter for a subset of data.

employees_df.filter(items=['first_name', 'department'],axis=1)
Enter fullscreen mode Exit fullscreen mode

filter method 1

Filter where the index begins with 5 or 8 using regex.

employees_df.filter(regex='5|8', axis=0)
Enter fullscreen mode Exit fullscreen mode

filter method 2

filter the dataframe where the index is 6

employees_df.filter(like='6', axis=0)
Enter fullscreen mode Exit fullscreen mode

Filter method 3

In this tutorial, we discussed some of the different ways of filtering rows to return a subset of data from our pandas dataframe.
The only way to master pandas and be good at data manipulation is through practice.
I do hope you will make out time to practice.

Connect with me on Twitter or LinkedIn where I also share some useful data analytics tips

Top comments (0)