AJ_Coding

Posted on Mar 15, 2023

Exploratory Data Analysis Beginner Guide

#python #datascience #tutorial #eventdriven

Data professionals use exploratory data analysis (EDA) to explore, study, and become familiar with a dataset’s properties and the relationships between its variables. Data visualization is one of the most important tools and approaches used in EDA. We can truly understand the data’s appearance and the sorts of questions it may help us answer by analyzing and visualizing it using EDA. It also provides a means of identifying trends and patterns, identifying outliers and other abnormalities, and addressing certain important research problems.

These are some of the major steps carried out for this type of analysis:

1). Collect the data

2). Load the data

3). Get the basic information about the data

4). Handle duplicate values

5). Handle the unique values in the data

6). Visualize unique count

7). Find null values

8). Replace null values

9). Know the data type

10). Filter data

11). Get data’s box plot

12). Get the basic information about the data

13). Create correlation plot

First, we make sure we are working with the correct libraries. For EDA, the following libraries are often used: pandas, numpy, seaborn and matplotlib. To do that, we use the functions:

Import pandas as pd

Import numpy as np

Import matplotlib.pyplot as plt

Import seaborn as sns

For the ease of convenience, we will use pyforest instead as shown below:

import pyforest
pd.set_option( 'display.max_columns', 200)

The dataset that we will be working on is the Police Shootings in USA found on my Github here: https://github.com/AJ-Coding101/Exploratory-Data-Analysis-EDA-of-USA-Police-Shootings. We can collect the data that will be used for analysis by using the code below:

df = pd.read_csv("shootings.csv")

df.head (2) #To display the first 2 rows

df.tail(2) #To display the last 2 rows

The next step would be to view what types of data we are dealing with at a glimpse.

df.shape #To show how many rows and columns are in the dataset
(4895, 16)

df.describe()

The describe function is very useful as it shows details such as mean, minimum and maximum values count among others.

To get even more insight on our dataset, we can use df.info().

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4895 entries, 0 to 4894
Data columns (total 15 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   id                       4895 non-null   int64  
 1   name                     4895 non-null   object 
 2   date                     4895 non-null   object 
 3   manner_of_death          4895 non-null   object 
 4   armed                    4895 non-null   object 
 5   age                      4895 non-null   float64
 6   gender                   4895 non-null   object 
 7   race                     4895 non-null   object 
 8   city                     4895 non-null   object 
 9   state                    4895 non-null   object 
 10  signs_of_mental_illness  4895 non-null   bool   
 11  threat_level             4895 non-null   object 
 12  flee                     4895 non-null   object 
 13  body_camera              4895 non-null   bool   
 14  arms_category            4895 non-null   object 
dtypes: bool(2), float64(1), int64(1), object(11)
memory usage: 506.8+ KB

Here we are able to see vital info such as data types, entries, memory usage and any null values per column. We can also notice that ‘age’ column has a data type of float instead of int and ‘date’ column is represented as an object data type which is incorrect. This might cause potential issues in our analysis. However, there is an easy fix for this as shown below:

df['age'] = df['age'].astype(int)
df['date']=pd.to_datetime(df['date'])

We run df.info() again and now all the data types are shown correctly.

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4895 entries, 0 to 4894
Data columns (total 15 columns):
 #   Column                   Non-Null Count  Dtype         
---  ------                   --------------  -----         
 0   id                       4895 non-null   int64         
 1   name                     4895 non-null   object        
 2   date                     4895 non-null   datetime64[ns]
 3   manner_of_death          4895 non-null   object        
 4   armed                    4895 non-null   object        
 5   age                      4895 non-null   int32         
 6   gender                   4895 non-null   object        
 7   race                     4895 non-null   object        
 8   city                     4895 non-null   object        
 9   state                    4895 non-null   object        
 10  signs_of_mental_illness  4895 non-null   bool          
 11  threat_level             4895 non-null   object        
 12  flee                     4895 non-null   object        
 13  body_camera              4895 non-null   bool          
 14  arms_category            4895 non-null   object        
dtypes: bool(2), datetime64[ns](1), int32(1), int64(1), object(10)
memory usage: 487.7+ KB

Next, we can confirm that there are no null values in our data.
df.isna().sum()

id 0
name 0
date 0
manner_of_death 0
armed 0
age 0
gender 0
race 0
city 0
state 0
signs_of_mental_illness 0
threat_level 0
flee 0
body_camera 0
arms_category 0
dtype: int64

df.isna().sum().sum()
0
It is also useful to view if we have any duplicated rows and remove them as they are redundant.

df.duplicated().sum()
0

Next, we will take some steps to visualize our data. The library matplotlib is useful in this step of our analysis. We would like to view the number of police shootings according to race in the USA.

df['race'].value_counts().plot(kind='bar', edgecolor = 'black')
plt.title('Histogram:According to race')
plt.xlabel('Race')
plt.ylabel('Number of shootings')
plt.show()

We can also visualize the number of police shootings according to age. For this we can use a histogram.

plt.hist(df['age'], bins=15, edgecolor = 'black')
plt.title('Histogram:According to age')
plt.xlabel('Age')
plt.ylabel('Number of shootings')
plt.xticks( range ( 0, 101, 5))
plt.show()

The age group between 29–34 years seems to have encountered the most shootings by the police. To visualize police shootings according to year, we need to extract the year from the date column.

df['year'] = df['date'].dt.year
year_shootings = df.groupby('year').size()
#count() can also be used instead of size()

Then using Pandas, we can use the groupby() function to group the data by year and calculate the number of shootings. The code snippet below can plot a line graph displaying the figures required.

year_shootings.plot (kind='line', grid = 'black')
plt.title('Number of shootings according to year')
plt.xlabel('Year')
plt.ylabel('Number of shootings')
plt.show()

We can observe that the number of police shootings had been steadily decreasing each year. In 2020, the figure drops significantly. Let’s do some inspection on the dataframe to identify why.

df.groupby('year').size()
year
2015    965
2016    904
2017    906
2018    888
2019    858
2020    374
dtype: int64


df.groupby(['year',df['date'].dt.month]).size()

year  date
2015  1       75
      2       77
      3       91
      4       83
      5       69
              ..
2020  2       61
      3       73
      4       58
      5       78
      6       22
Length: 66, dtype: int64


df[df['year'] == 2019].groupby(['year', df['date'].dt.month]).size()
year  date
2019  1        81
      2        68
      3        76
      4        63
      5        64
      6        77
      7        69
      8        57
      9        59
      10       73
      11       71
      12      100
dtype: int64

df[df['year'] == 2020].groupby(['year', df['date'].dt.month]).size()
year  date
2020  1       82
      2       61
      3       73
      4       58
      5       78
      6       22
dtype: int64

It is now clear that our dataset only has the number of shootings upto June 2020 hence the drop for that year.

The seaborn library is built on top of the matplotlib library and can be used for powerful visualizations as well. Let’s visualize the number of police shootings according to age with a regression line as well.

First, we again use the groupby() function to group the data by age and then calculate the number of shootings by age.

df.groupby(df['age']) . size()

age
6      2
12     1
13     1
14     3
15    13
      ..
81     1
82     2
83     2
84     4
91     1
Length: 75, dtype: int64

age_shootings = df.groupby(df['age']) . size()

sns.scatterplot(x=age_shootings.index, y=age_shootings.values)
sns.regplot(x=age_shootings.index, y=age_shootings.values)
plt.title('Number of shootings by age')
plt.xlabel('Age')
plt.ylabel('Number of shootings')

The code creates a scatter plot of the number of shootings by age then regplot() function adds a regression line to the scatter plot. The regression line represents the best fit line that explains the relationship between the number of police shootings and age of the victims. We are able to better understand the trend of the data and the correlation between the 2 variables.

Next, we can visualize the number of police shootings according to race and by year.

shootings_by_race_year = df.groupby(['year', 'race']).size()
shootings_by_race_year = shootings_by_race_year.unstack()

First, we use the groupby() function to group the data by year and race then calculate the total number of shootings as shown in the code above and use unstack() to present the data in a more readable format.

As we can see, Exploratory Data Analysis is very crucial in the data science. The steps shown in this article are just some of the general steps taken in EDA.

DEV Community

Exploratory Data Analysis Beginner Guide

Top comments (0)