Welcome to the third article in the "7 Days of Pandas" series where we cover the pandas
library in Python which is used for data manipulation.
In the first article of the series, we looked at how to read and write CSV files with Pandas.
In the second article, we looked at how to perform basic data manipulation.
In this tutorial, we will look at some of the common operations that we perform on a dataframe during the exploratory data analysis (EDA phase).
Exploratory Data Analysis (EDA) helps us better understand the data at hand and can give us valuable insights. In this phase, we look at the data for insights and use descriptive statistics and visualizations to derive insights from the data.
The pandas library comes with a number of useful functions that help us explore the data. In this tutorial, we will cover the following topics:
- Get the first and the last N rows of a dataframe.
- Using the
info()
function. - Get descriptive statistics with the
describe()
function.
Before we begin, let's first import pandas and create a sample dataframe that we will be using throughout this tutorial.
import pandas as pd
# employee data
data = {
"Name": ["Tim", "Shaym", "Noor", "Esha", "Sam", "James", "Lily"],
"Age": [26, 28, 27, 32, 24, 31, 33],
"Department": ["Marketing", "Product", "Product", "HR", "Product", "HR", "Marketing"],
"Salary": [60000, 70000, 82000, 55000, 58000, 55000, 65000]
}
# create pandas dataframe
df = pd.DataFrame(data)
# display the dataframe
df
Name | Age | Department | Salary | |
---|---|---|---|---|
0 | Tim | 26 | Marketing | 60000 |
1 | Shaym | 28 | Product | 70000 |
2 | Noor | 27 | Product | 82000 |
3 | Esha | 32 | HR | 55000 |
4 | Sam | 24 | Product | 58000 |
5 | James | 31 | HR | 55000 |
6 | Lily | 33 | Marketing | 65000 |
We have a dataframe with information of some employee in an office.
Get the first and the last N rows of a dataframe
After loading or creating a dataframe, a good first step is to look at the first few rows to see if the data is as expected or not. Or, if there are any obvious issues with the data (for example, missing fields, etc.).
You can use the pandas dataframe head()
function to get the first n rows of the dataframe. Pass the number of rows you want from the top as an argument. By default, n is 5.
# get the first five rows
df.head(5)
Name | Age | Department | Salary | |
---|---|---|---|---|
0 | Tim | 26 | Marketing | 60000 |
1 | Shaym | 28 | Product | 70000 |
2 | Noor | 27 | Product | 82000 |
3 | Esha | 32 | HR | 55000 |
4 | Sam | 24 | Product | 58000 |
You can similarly get the last n rows of the dataframe, using the pandas dataframe tail()
function. Pass the number of rows you want from the bottom as an argument. By default, n is 5.
# get the last five rows
df.tail(5)
Name | Age | Department | Salary | |
---|---|---|---|---|
2 | Noor | 27 | Product | 82000 |
3 | Esha | 32 | HR | 55000 |
4 | Sam | 24 | Product | 58000 |
5 | James | 31 | HR | 55000 |
6 | Lily | 33 | Marketing | 65000 |
Use the info()
function
You can use the pandas dataframe info()
function to get a concise summary of the dataframe. It gives information such as the column dtypes, count of non-null values in each column, the memory usage of the dataframe, etc.
# summary of the dataframe
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7 entries, 0 to 6
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Name 7 non-null object
1 Age 7 non-null int64
2 Department 7 non-null object
3 Salary 7 non-null int64
dtypes: int64(2), object(2)
memory usage: 352.0+ bytes
Get descriptive statistics with the describe()
function
The pandas dataframe describe()
function returns some descriptive statistics for a dataframe. For example, for numerical columns, it returns the count, mean, standard deviation, min, max, percentile values, etc.
# get dataframe's descriptive statistics
df.describe()
Age | Salary | |
---|---|---|
count | 7.000000 | 7.000000 |
mean | 28.714286 | 63571.428571 |
std | 3.352327 | 9778.499252 |
min | 24.000000 | 55000.000000 |
25% | 26.500000 | 56500.000000 |
50% | 28.000000 | 60000.000000 |
75% | 31.500000 | 67500.000000 |
max | 33.000000 | 82000.000000 |
Note that the pandas dataframe describe()
function, by default includes only the numeric columns when generating the dataframe’s description.
You can, however, specify other columns types (or all the columns) to include the statistics for using the include
parameter.
# get descriptive statistics for object type the columns
df.describe(include='object')
Name | Department | |
---|---|---|
count | 7 | 7 |
unique | 7 | 3 |
top | Tim | Product |
freq | 1 | 3 |
For object
type columns, we get the information about the count, number of unique values, top (the most frequent value), and freq (the count of the most frequent value in the column).
These descriptive statistics give us valuable insights into the distribution of the data in different columns.
Top comments (0)