Welcome to the fifth article in the "7 Days of Pandas" series where we cover the pandas
library in Python which is used for data manipulation.
In the first article of the series, we looked at how to read and write CSV files with Pandas. In this tutorial, we will look at some of the most common operations that we perform on a dataframe in Pandas.
In the second article, we looked at how to perform basic data manipulation.
In the third article, we looked at how to perform EDA (exploratory data analysis) with Pandas.
In the fourth article, we looked at how to handle missing values in a dataframe.
In this tutorial, we will look aggregate and group data in Pandas.
Aggregating and grouping data is a common task when working with datasets, and pandas provides a range of functions and methods to help you do this efficiently.
In this tutorial, we will cover the following topics:
- Applying aggregate functions to pandas dataframe.
- Grouping data in pandas dataframe (and applying aggregate functions to the grouped data).
Before we begin, let's first import pandas and create a sample dataframe that we will be using throughout this tutorial.
import pandas as pd
# employee data
data = {
"Name": ["Tim", "Shaym", "Noor", "Esha", "Sam", "James", "Lily"],
"Gender": ["M", "M", "F", "F", "M", "M", "F"],
"Age": [26, 28, 27, 32, 24, 31, 33],
"Department": ["Marketing", "Product", "Product", "HR", "Product", "HR", "Marketing"],
"Salary": [60000, 70000, 82000, 55000, 58000, 55000, 65000]
}
# create pandas dataframe
df = pd.DataFrame(data)
# display the dataframe
df
Name | Gender | Age | Department | Salary | |
---|---|---|---|---|---|
0 | Tim | M | 26 | Marketing | 60000 |
1 | Shaym | M | 28 | Product | 70000 |
2 | Noor | F | 27 | Product | 82000 |
3 | Esha | F | 32 | HR | 55000 |
4 | Sam | M | 24 | Product | 58000 |
5 | James | M | 31 | HR | 55000 |
6 | Lily | F | 33 | Marketing | 65000 |
Applying aggregate functions
Pandas comes with a number of aggregate functions that you can apply to the entire dataframe or one or more columns in the dataframe.
For example, you can apply the sum()
funciton to get the sum of values in each column.
df.sum()
Name TimShaymNoorEshaSamJamesLily
Gender MMFFMMF
Age 201
Department MarketingProductProductHRProductHRMarketing
Salary 445000
dtype: object
Note that for object
type sum()
resulted in a concatenated string.
You can select which columns to apply the aggregate functions to.
For example, let's the mean value of the "Age" and the "Salary" columns.
df[["Age", "Salary"]].mean()
Age 28.714286
Salary 63571.428571
dtype: float64
Grouping data
To group a pandas DataFrame by one or more columns, you can use the pandas dataframe groupby()
method. This method takes one or more column names as arguments and returns a groupby object that can be used to apply various operations to the grouped data.
Let's group the above data on the "Gender" column.
df.groupby("Gender")
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x10684b970>
We get a groupby object. You can now use this object to apply aggregations to the grouped data. For example, let's get the average "Age" value for each group.
df.groupby("Gender")["Age"].mean()
Gender
F 30.666667
M 27.250000
Name: Age, dtype: float64
We get the mean value of the "Age" column for each group (here, "Gender") in the data.
You can group the data of more than one columns as well. For example, let's group the data on "Gender" and "Department" and get the average "Age" in each group.
df.groupby(["Gender", "Department"])["Age"].mean()
Gender Department
F HR 32.0
Marketing 33.0
Product 27.0
M HR 31.0
Marketing 26.0
Product 26.0
Name: Age, dtype: float64
You can also apply multiple aggregate functions to the grouped data using the .agg()
function.
df.groupby(["Gender", "Department"]).agg(['mean', 'count'])
Age | Salary | ||||
---|---|---|---|---|---|
mean | count | mean | count | ||
Gender | Department | ||||
F | HR | 32.0 | 1 | 55000.0 | 1 |
Marketing | 33.0 | 1 | 65000.0 | 1 | |
Product | 27.0 | 1 | 82000.0 | 1 | |
M | HR | 31.0 | 1 | 55000.0 | 1 |
Marketing | 26.0 | 1 | 60000.0 | 1 | |
Product | 26.0 | 2 | 64000.0 | 2 |
Top comments (0)