Piyush Raj

Posted on Dec 23, 2022

Pandas - Basic Exploratory Data Analysis - 7 Days of Pandas

#python #datascience #beginners

Welcome to the third article in the "7 Days of Pandas" series where we cover the pandas library in Python which is used for data manipulation.

In the first article of the series, we looked at how to read and write CSV files with Pandas.
In the second article, we looked at how to perform basic data manipulation.
In this tutorial, we will look at some of the common operations that we perform on a dataframe during the exploratory data analysis (EDA phase).

Exploratory Data Analysis (EDA) helps us better understand the data at hand and can give us valuable insights. In this phase, we look at the data for insights and use descriptive statistics and visualizations to derive insights from the data.

The pandas library comes with a number of useful functions that help us explore the data. In this tutorial, we will cover the following topics:

Get the first and the last N rows of a dataframe.
Using the info() function.
Get descriptive statistics with the describe() function.

Before we begin, let's first import pandas and create a sample dataframe that we will be using throughout this tutorial.

import pandas as pd

# employee data
data = {
    "Name": ["Tim", "Shaym", "Noor", "Esha", "Sam", "James", "Lily"],
    "Age": [26, 28, 27, 32, 24, 31, 33],
    "Department": ["Marketing", "Product", "Product", "HR", "Product", "HR", "Marketing"],
    "Salary": [60000, 70000, 82000, 55000, 58000, 55000, 65000]
}

# create pandas dataframe
df = pd.DataFrame(data)

# display the dataframe
df

	Name	Age	Department	Salary
0	Tim	26	Marketing	60000
1	Shaym	28	Product	70000
2	Noor	27	Product	82000
3	Esha	32	HR	55000
4	Sam	24	Product	58000
5	James	31	HR	55000
6	Lily	33	Marketing	65000

We have a dataframe with information of some employee in an office.

Get the first and the last N rows of a dataframe

After loading or creating a dataframe, a good first step is to look at the first few rows to see if the data is as expected or not. Or, if there are any obvious issues with the data (for example, missing fields, etc.).

You can use the pandas dataframe head() function to get the first n rows of the dataframe. Pass the number of rows you want from the top as an argument. By default, n is 5.

# get the first five rows
df.head(5)

	Name	Age	Department	Salary
0	Tim	26	Marketing	60000
1	Shaym	28	Product	70000
2	Noor	27	Product	82000
3	Esha	32	HR	55000
4	Sam	24	Product	58000

You can similarly get the last n rows of the dataframe, using the pandas dataframe tail() function. Pass the number of rows you want from the bottom as an argument. By default, n is 5.

# get the last five rows
df.tail(5)

	Name	Age	Department	Salary
2	Noor	27	Product	82000
3	Esha	32	HR	55000
4	Sam	24	Product	58000
5	James	31	HR	55000
6	Lily	33	Marketing	65000

Use the `info()` function

You can use the pandas dataframe info() function to get a concise summary of the dataframe. It gives information such as the column dtypes, count of non-null values in each column, the memory usage of the dataframe, etc.

# summary of the dataframe
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7 entries, 0 to 6
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   Name        7 non-null      object
 1   Age         7 non-null      int64 
 2   Department  7 non-null      object
 3   Salary      7 non-null      int64 
dtypes: int64(2), object(2)
memory usage: 352.0+ bytes

Get descriptive statistics with the `describe()` function

The pandas dataframe describe() function returns some descriptive statistics for a dataframe. For example, for numerical columns, it returns the count, mean, standard deviation, min, max, percentile values, etc.

# get dataframe's descriptive statistics
df.describe()

	Age	Salary
count	7.000000	7.000000
mean	28.714286	63571.428571
std	3.352327	9778.499252
min	24.000000	55000.000000
25%	26.500000	56500.000000
50%	28.000000	60000.000000
75%	31.500000	67500.000000
max	33.000000	82000.000000

Note that the pandas dataframe describe() function, by default includes only the numeric columns when generating the dataframe’s description.

You can, however, specify other columns types (or all the columns) to include the statistics for using the include parameter.

# get descriptive statistics for object type the columns
df.describe(include='object')

	Name	Department
count	7	7
unique	7	3
top	Tim	Product
freq	1	3

For object type columns, we get the information about the count, number of unique values, top (the most frequent value), and freq (the count of the most frequent value in the column).

These descriptive statistics give us valuable insights into the distribution of the data in different columns.

DEV Community

Pandas - Basic Exploratory Data Analysis - 7 Days of Pandas

Get the first and the last N rows of a dataframe

Use the `info()` function

Get descriptive statistics with the `describe()` function

Top comments (0)

Read next

New to Dev.to. What do you usually do here?

How These Free Open Source Projects Can Jumpstart Your Career (No Experience? No Problem!)

While Loops

Building Race Riot: A Racing Game with Pygame and a CI/CD Pipeline

Get the first and the last N rows of a dataframe

Use the info() function

Get descriptive statistics with the describe() function

Read next

New to Dev.to. What do you usually do here?

How These Free Open Source Projects Can Jumpstart Your Career (No Experience? No Problem!)

While Loops

Building Race Riot: A Racing Game with Pygame and a CI/CD Pipeline

Use the `info()` function

Get descriptive statistics with the `describe()` function