DEV Community

Piyush Raj
Piyush Raj

Posted on • Edited on

Pandas - Aggregating and Grouping Data - 7 Days of Pandas

Welcome to the fifth article in the "7 Days of Pandas" series where we cover the pandas library in Python which is used for data manipulation.

In the first article of the series, we looked at how to read and write CSV files with Pandas. In this tutorial, we will look at some of the most common operations that we perform on a dataframe in Pandas.

In the second article, we looked at how to perform basic data manipulation.

In the third article, we looked at how to perform EDA (exploratory data analysis) with Pandas.

In the fourth article, we looked at how to handle missing values in a dataframe.

In this tutorial, we will look aggregate and group data in Pandas.

Aggregating and grouping data is a common task when working with datasets, and pandas provides a range of functions and methods to help you do this efficiently.

In this tutorial, we will cover the following topics:

  1. Applying aggregate functions to pandas dataframe.
  2. Grouping data in pandas dataframe (and applying aggregate functions to the grouped data).

Before we begin, let's first import pandas and create a sample dataframe that we will be using throughout this tutorial.

import pandas as pd

# employee data
data = {
    "Name": ["Tim", "Shaym", "Noor", "Esha", "Sam", "James", "Lily"],
    "Gender": ["M", "M", "F", "F", "M", "M", "F"],
    "Age": [26, 28, 27, 32, 24, 31, 33],
    "Department": ["Marketing", "Product", "Product", "HR", "Product", "HR", "Marketing"],
    "Salary": [60000, 70000, 82000, 55000, 58000, 55000, 65000]
}

# create pandas dataframe
df = pd.DataFrame(data)

# display the dataframe
df
Enter fullscreen mode Exit fullscreen mode
Name Gender Age Department Salary
0 Tim M 26 Marketing 60000
1 Shaym M 28 Product 70000
2 Noor F 27 Product 82000
3 Esha F 32 HR 55000
4 Sam M 24 Product 58000
5 James M 31 HR 55000
6 Lily F 33 Marketing 65000

Applying aggregate functions

Pandas comes with a number of aggregate functions that you can apply to the entire dataframe or one or more columns in the dataframe.

For example, you can apply the sum() funciton to get the sum of values in each column.

df.sum()
Enter fullscreen mode Exit fullscreen mode
Name                         TimShaymNoorEshaSamJamesLily
Gender                                            MMFFMMF
Age                                                   201
Department    MarketingProductProductHRProductHRMarketing
Salary                                             445000
dtype: object
Enter fullscreen mode Exit fullscreen mode

Note that for object type sum() resulted in a concatenated string.

You can select which columns to apply the aggregate functions to.

For example, let's the mean value of the "Age" and the "Salary" columns.

df[["Age", "Salary"]].mean()
Enter fullscreen mode Exit fullscreen mode
Age          28.714286
Salary    63571.428571
dtype: float64
Enter fullscreen mode Exit fullscreen mode

Grouping data

To group a pandas DataFrame by one or more columns, you can use the pandas dataframe groupby() method. This method takes one or more column names as arguments and returns a groupby object that can be used to apply various operations to the grouped data.

Let's group the above data on the "Gender" column.

df.groupby("Gender")
Enter fullscreen mode Exit fullscreen mode
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x10684b970>
Enter fullscreen mode Exit fullscreen mode

We get a groupby object. You can now use this object to apply aggregations to the grouped data. For example, let's get the average "Age" value for each group.

df.groupby("Gender")["Age"].mean()
Enter fullscreen mode Exit fullscreen mode
Gender
F    30.666667
M    27.250000
Name: Age, dtype: float64
Enter fullscreen mode Exit fullscreen mode

We get the mean value of the "Age" column for each group (here, "Gender") in the data.

You can group the data of more than one columns as well. For example, let's group the data on "Gender" and "Department" and get the average "Age" in each group.

df.groupby(["Gender", "Department"])["Age"].mean()
Enter fullscreen mode Exit fullscreen mode
Gender  Department
F       HR            32.0
        Marketing     33.0
        Product       27.0
M       HR            31.0
        Marketing     26.0
        Product       26.0
Name: Age, dtype: float64
Enter fullscreen mode Exit fullscreen mode

You can also apply multiple aggregate functions to the grouped data using the .agg() function.

df.groupby(["Gender", "Department"]).agg(['mean', 'count'])
Enter fullscreen mode Exit fullscreen mode
Age Salary
mean count mean count
Gender Department
F HR 32.0 1 55000.0 1
Marketing 33.0 1 65000.0 1
Product 27.0 1 82000.0 1
M HR 31.0 1 55000.0 1
Marketing 26.0 1 60000.0 1
Product 26.0 2 64000.0 2

Top comments (0)