DEV Community: AJ_Coding

Github Guide for Data Scientists

AJ_Coding — Tue, 04 Apr 2023 20:25:14 +0000

Source: Git Organized: A Better Git Flow | Render

INTRODUCTION

Git is defined as a free and open source distributed version control system designed to handle both small and large projects with speed and efficiency, according to the official Git website.

There are 2 types of Version Control Systems (VCS):

· Centralized VCS

· Distributed VCS

A centralized VCS (CVCS) uses a client-server model where all team members access a central repository to store and manage changes made to files. The client software enables users to checkout a working copy of the files to their computer, make changes and commit those changes back to the central repository. However, the downsides to a CVCS are slower performance when dealing with a large repository and no access when the servers are down.

On the other hand, a distributed VCS (DVCS) enables team members to have a complete working copy of the remote repository together with its history of changes in their local repository. This allows users to work on their files independently from anywhere without having to rely on a central repository to store changes. Local repositories from different team members can be merged hence enabling collaboration on the same files and working on different sections of the project simultaneously. As a data scientist, a DVCS such as Github is essential to our daily activities as we will demonstrate in the upcoming sections.

Download and Setup

First, we will need to download Git from the official website mentioned above and choose your operating system accordingly. Afterwards, you may head on over to github.com and sign up for an account. Optionally, you can download Github Desktop Application, if you prefer a more visual approach when performing commits, pushes, merging etc.

However, for the purpose of this article we will learn how to use Git commands in the Command Line Interface (CLI). It is crucial to understand Git from this basic level and this will in turn make using the Desktop Application a breeze. To confirm, Git has been installed successfully, you may run the command below in your CLI such as Command Terminal or Git Bash:

$ git –-version

If everything was setup correctly, the command will return the version of Git. Next, we’ll need to ensure to configure our name and email in order to identify ourselves with Git. You can run the commands below whereby you can replace “your name” and “name@email.com” with your name and email address respectively.

$ git config - global user.name "your name"
$ git config - global user.email "name@email.com"

Creating a repository

A repository or repo for short is a place where you can store and manage the code for a project. It generally contains all of the project’s code, documentation, and other files. Collaboration on a project, tracking changes over time, and sharing work with others are all made simple by repositories.

There are 2 types of repositories:

· Local Repository: A copy of a repository stored on your computer’s hard drive where you can work on the local version of your project. You’ll be able to work on your project without affecting the main branch or the changes made by others. The local repo enables the user to make changes on the project, create branches and test those changes before pushing them to the remote repository for others to review and merge.

· Remote Repository: A copy of a repository stored on the cloud or remote server, such as Github. With the aid of remote repos, you may work on a project with others, publish your code online, and back up your data. You may sync your local repository and remote repository by creating a remote repo on GitHub and pushing your local repository to the remote repository. By doing so, you may collaborate on project changes and share your code with others.

When working with Git and GitHub, you will utilize a blend of local and remote repositories. To edit the code, test your changes, and save your work, make use of your local repository. After that, you may share your work with others and work with your team by pushing your changes to a remote repository on GitHub.

Source: Author

To create our first repo, we will head over to github.com profile page, click on the + sign at the top right of the page then “New Repository.”

Afterwards, you can follow these steps you complete the creation of the repository:

· Give your repository a name. Its name has to be descriptive and reflect the goal of your project.

· You may optionally include a repository description. This can aid people in comprehending the purpose of your work.

· Choose whether you want your repository to be private or public.

· Decide whether or not to include a README file when starting a repository. If you want to give some fundamental information about your project, this is a fantastic option.

· Choose a licence if you wish for your project. The conditions under which others may use and change your code are laid forth in a licence.

· Click “Create Repository”.

Source: Author

Working Directory

In the working directory, is where we will have our project files that will be later pushed to the remote repository on Github. For illustration purposes, we will create a folder called “First” and inside it add a CSV file “repo.” We will also use Git Bash for our input commands. You can use the native Git Bash terminal that came with the successful installation of Git in your operating system or you may use Git Bash in your VS Code editor. Make sure to first navigate to the project folder in Git Bash before using the command. The command below creates a CSV file called “repo” in your working directory.

$ touch repo.csv

Next, we will need to initialize our repository in our working directory. To do so, we will use the Git command below. The command also creates a subdirectory “.git” that is typically hidden.

$ git init

An alternative method for this first step is to clone our new remote repository that we had created on Github.com using the command below:

$ git clone "Repository URL"

Source: Author

This will create a local copy of the remote repository in your working directory. One of the Git commands that we will be used quite often is “git status”. This command tells us which files are untracked, modified, tracked, conflicts etc. It shows the current status of the working directory and staging area in relation to the working directory. When we run “git status” after creating the working directory and creating a CSV file called “repo” we should get the outcome below in our terminal.

On branch main
No commits yet
Untracked files:
(use "git add <file>…" to include in what will be committed)
repo.csv

This basically tells us that Git is aware there is a file in the working directory known as “repo.csv” but it is not in the staging area yet and has not been committed. To track the file, we will use “git add.”

Staging Area

This is also known as the index. It is the intermediary step between the working directory and the local repository. For ease of understanding, the image below demonstrates the basic Git workflow and commands.

Source: ByteByteGo

The staging area allows you to review and choose which changes will be included in the local Git repository. To add files to the staging area you can use the “git add” command. If using a period (.), all files and folders in the working directory will be added to the staging area. If you’re working with many files and want to be selective, you can indicate the filename after the command.

$ git add .
OR
$ git add repo.csv

After we add the file to the staging area using the command above, we can run “git status” again.

On branch main
No commits yet
Changes to be committed:
(use "git rm - cached <file>…" to unstage)
new file: repo.csv

Now Git is tracking our new file as it has been added to the staging area. Also, if we want to unstage the file, we can use git rm –cached repo.csv if required.

Local Repository

Next, we will need to commit our file in the staging area to the local repository. A commit is a snapshot of the changes made to files and folders in a repo. To save changes you make to files in your repository as a new version of your project’s history, you must commit the changes.

Before we perform our first commit, it is a good practice to only commit the files that are needed to build and run our project. Unnecessary files such as log.txt files don’t need to be uploaded to our repository. To achieve this, we will need to create a .gitignore file and add files that we don’t want included in our commit here.

$ touch .gitignore

Also, we can create a log.txt file and include it in our .gitignore. We can do that using our VS code by opening .gitignore, typing in “log.txt” and save.

Source: Author

From this image, we can see that the “.gitignore” file is untracked by Git as shown by the “U”, “repo.csv” is added (A) to the staging area, and “log.txt” is ignored by Git. Let’s run “git status” to confirm this.

On branch main
No commits yet
Changes to be committed:
(use "git rm - cached <file>…" to unstage)
new file: repo.csv
Untracked files:
(use "git add <file>…" to include in what will be committed)
.gitignore

Based on the above, we need to upload the .gitignore file to the staging area first by running “git add .” before committing our changes to the local repository.

After adding all eligible files to the staging area and running “git status” again, we get the outcome below.

On branch main
No commits yet
Changes to be committed:
(use "git rm - cached <file>…" to unstage)
new file: .gitignore
new file: repo.csv

We can use the Git command “git ls-files” to confirm what is in in our staging area currently.

$ git ls-files
.gitignore
repo.csv

As confirmed, “log.txt” is ‘ignored’ by Git. Next step would be to commit our changes to our local repository. To do so, we’ll run the command below:

$git commit –m "First Commit"

As a good practice, the commit message in the quotes should be in present tense and should be an evocative summary of the changes that we are committing to the repository.

$ git commit -m "First Commit"
[main (root-commit) deed287] First Commit
2 files changed, 2 insertions(+)
create mode 100644 .gitignore
create mode 100644 repo.csv

Running “git status” will give us a confirmation that the first commit has been done successfully.

$ git status
On branch main
nothing to commit, working tree clean

To view our commit history, we can use the “git log” command. This Git command gives us a lot of details such as:

· Commit Hash which is a unique identifier for the commit and is a 40 characters hexadecimal string.

· Branch

· Author’s Name and Email Address who made the commit.

· Date and Time when the commit was made

· Commit Message

Using the commit hash, we can return to a previous state of the project code using git checkout or git restore — source commithash repo.csv

Remote Repository

Now that our project files have been committed to the local repository, the next step would be to upload them to the remote repository on Github. We first copy the repository URL from the website and use the command below:

$ git add remote origin "Repository URL"

The “origin” is the name we are giving to the remote repository. This sets up a link between the local repository and the remote repository named origin. We can confirm the remote repository using:

$ git remote -v
origin https://github.com/AJ-Coding101/First.git (fetch)
origin https://github.com/AJ-Coding101/First.git (push)

Now that we have confirmed our remote repository, we can push our changes accordingly. If it is the first time you’re pushing to the remote repo, you will be requested to input your Github username and password before continuing. To perform the push we use the command below:

$ git push –u origin main

“Origin” is the name we had given to our remote repository and “main” is the branch we are pushing to. The “-u” stands for upstream and sets the upstream branch as main. This is a shortcut that helps us not to specify the branch each time we want to pull or push in the future.

$ git push -u origin main
Enumerating objects: 4, done.
Counting objects: 100% (4/4), done.
Delta compression using up to 8 threads
Compressing objects: 100% (2/2), done.
Writing objects: 100% (4/4), 271 bytes | 271.00 KiB/s, done.
Total 4 (delta 0), reused 0 (delta 0), pack-reused 0
To https://github.com/AJ-Coding101/First.git
* [new branch] main -> main
branch 'main' set up to track 'origin/main'.

Below is how our Github repository page looks like after performing the push:

Source: Author

We can see that our dataset “repo.csv” is now in the cloud and the “log.txt” file even though it was in our working directory, it is not immediately present in the main page of our repository but is located in the “.gitignore” file.

We can also see that we have made only 1 commit so far and we have 1 branch which is our default.

Source: Author

Branches

A branch is a reference to a particular commit in the repository’s history. You may work on a new feature or bug repair apart from the main codebase by creating a branch. This prevents changes to the original code. Once you’ve made changes and tested them, the branch can be merged back to the main code.

Source: Gitbookdown

To create a new branch, we can use the command below:

$git branch second

Second here will be the name of our new branch. To switch to the new branch we can use:

$ git checkout second
Switched to branch 'second'

Just to ensure that the branch exists, the following command will help:

$ git branch or $ git branch –l
* main
second

The asterisk (*) indicates the branch that we are currently working on.

$ git ls-files
.gitignore
repo.csv

Using the command above we can see that in the “second” branch, we still have our original files present. Let’s make some changes to our new branch.


$ touch dataset.csv

Source: Author

$ git status
On branch second
Untracked files:
(use "git add <file>…" to include in what will be committed)
dataset.csv
nothing added to commit but untracked files present (use "git add" to track)

We can see that we have a new untracked csv file “dataset” in our new branch. Let’s add the csv file to our staging area and commit it to our local repository.

$ git add dataset.csv
$ git commit -m "Add new dataset"
[second 00695d1] Add new dataset
1 file changed, 0 insertions(+), 0 deletions(-)
create mode 100644 dataset.csv

We can now view what files are present in our new branch as shown below:

$ git ls-files
.gitignore
dataset.csv
repo.csv

Let’s switch back to our main branch and view the files present.

$ git checkout main
$ git ls-files
.gitignore
dataset.csv
repo.csv

The new file “dataset.csv” is not present in our main branch as we can confirm from the command above thus showing that we can edit, make commits on a separate branch without affecting the main codebase.

We can merge our changes in the “second” branch to the “main” branch using the command below.

$ git merge second
Updating deed287..00695d1
Fast-forward
dataset.csv | 0
1 file changed, 0 insertions(+), 0 deletions(-)
create mode 100644 dataset.csv

Once we are done with a branch we can easily delete it using:

$ git branch –d second

To completely delete the branch from our remote repository we can use:

$ git push origin - delete second

Source: Author

Pull Requests

A pull request is a request to merge changes made in a branch into another branch, typically the main branch.

The process of making a pull request involves creating a new branch to contain the changes, making the changes and committing them to the branch, and then submitting a pull request to the destination branch. The pull request can then be reviewed by other contributors and the repository owner before the changes are merged into the destination branch.

It is important to note that the branch containing the changes must be pushed to the remote repository on Github before a pull request can be created. This allows the other contributors to access the changes and review the pull request.

Here are the basic steps for creating a pull request on Github:

Create a new branch using the method we described earlier.
Make changes: Make changes to the code, add files, or make other modifications.
Commit changes: Use the “git commit” command to commit the changes to the branch.
Push the branch: Use the “git push” command to push the branch to the remote repository on Github.
Create the pull request: Go to the Github repository and click on the “New pull request” button. Select the source and destination branches and create the pull request.
Review and merge: The pull request can now be reviewed by other contributors and the repository owner. Once approved, the changes can be merged into the destination branch.

CONCLUSION

Github is a powerful tool for version control and collaboration that can greatly benefit data scientists in particular. Data science involves working with complex datasets and code, often in teams, and Github provides an efficient way to manage and track changes to these files. By using Git and Github, data scientists can easily collaborate with their colleagues and keep track of changes to their code and datasets.

Getting Started with Machine Learning for Sentimental Analysis (Part 1)

AJ_Coding — Thu, 23 Mar 2023 15:15:16 +0000

What is Sentimental Analysis?

This type of analysis also goes by the name opinion mining. It consists of using NLP (Natural Language Processing), text analysis and computational linguistics techniques to extract subjective info from textual data.

It entails locating and classifying the attitudes, feelings, and views represented in a piece of writing, such a review, tweet, or news item. Sentiment analysis seeks to establish the presence and strength of positive, negative, or neutral sentiment in a given text. It is utilized in many different industries, such as social media analysis, customer service, and market research.

Types of Sentimental Analysis

Sentimental Analysis focuses primarily on the polarity of a text, that is whether the text is positive, neutral or negative. However, this method of analysis can be broader in that it can detect specific emotions as well. For example, a client is angry, sad, happy, fear etc.

It can also detect intentions, e.g. interested/not interested and even urgency. That is whether based on the input gathered and analyzed, is the customer in urgent need of a product or not.

Depending on the outcome required, we can tailor our analysis accordingly. The main types of sentimental analysis are as follows:

1) Polarity-based
This type of analysis puts emphasis on figuring out whether a sentiment is positive, negative or neutral. It then places a score depending on the polarity expressed. For example, positive = 4, neutral = 2 and negative = 0.

2) Aspect-based
Here we delve a bit deeper to find out what are the sentiments towards a specific aspect, feature, brand etc. The outcome of the analysis will give more information on what products/services are people mentioning in a negative/neutral/positive way.

3) Multilingual
This type of sentimental analysis deals with text input from different languages. It is able to obtain the sentiment of a text regardless of what language it is written in. This is especially useful for multinational companies or organizations that would like to know the feelings towards their products or services from their clientele in different countries.

4) Emotion-based
Suppose we would like to go beyond the basic polarity of a sentiment and we would like to know what are the types of emotions expressed in a piece of text such as fear, anger, happiness etc. This form of sentimental analysis would be best applied here.

5) Intent-based
We can also be able to tell what the intentions behind a certain text are. Was the sentiment meant to persuade, critique, recommend or provide praise? These are some of the outcomes of this type of analysis would be provide.

6) Contextual
This takes into account the context in which a piece of text was written. Context can include the date/ time, place, intended audience and the background of the writer. For example, based on context we can be able to identify negative or positive sentiment written by senior citizens during peak hours of the day.

7) Irony and Sarcasm
This type of sentimental analysis identifies whether the sentiment expressed is opposite to its intended meaning.

Benefits of performing sentimental analysis:

There are numerous benefits obtained from sentimental analysis:

1) Understanding customer opinions:
Can be used to assess how consumers feel about an organization’s goods and services. The outcomes of the analysis can help in learning what consumers like or dislike. This can then improve product development.

2) Brand reputation management:
May assist a company keep an eye on the perception of their brand by spotting negative sentiments and addressing them promptly in order to preserve a positive brand image.

3) Crisis management:
This is a bit similar to brand reputation management in that it also helps to mitigate an organization’s negative sentiments, however the main difference is that the latter helps to proactively build and maintain a company’s image over time while crisis management is focused on reducing or combating the impact of negative situations in a crisis such as a product recall or data breach.

4) Competitive analysis:

Using sentimental analysis, companies can compare the sentiments obtained from their own customers and compare them to their companies. This can bring to light areas where the business may be lacking and help to improve their products and services in order to better compete in the market.

5) Market insights:
Enables a company to establish their clientele’s preferences and trends. This information can then help in establishing more effective marketing strategies or campaigns that better align with the customers’ needs.

How is Sentimental Analysis performed?

There are three main approaches to performing sentiment analysis. These are namely:
• Rule-Based Approach
• Machine Learning Approach
• Hybrid Approach

The rule-based approach involves developing a rule set that identifies sentiment based on a specific phrases, keywords or text patterns. For example, a negative sentiment when the input text contains words such as ‘annoyed’, ‘poor’ etc.

A machine learning approach deals with establishing a sentiment from text by training a machine learning algorithm. The trained model then learns to recognize certain patterns and categorize them into positive, neutral or negative sentiments. The benefit of an efficiently trained machine learning algorithm is that it can be used to sort new input text in future with less hassle.

The third approach to performing sentiment analysis involves combining both machine learning and the rule based approach. In some instances, the machine learning model may not perform as well and in order to achieve prime accuracy, the rule-based approach may be used as well.

Our primary focus moving forward will be documenting how machine learning (ML) can be used to perform sentiment analysis on tweets, Facebook posts, news, movie reviews and any other platforms that document feedback from users and customers.

The main steps involved in any sentimental analysis project using ML are:
• Importing the data
• Cleaning and preprocessing the data
• Split the data into training/test sets
• Creating a model
• Training the model
• Make predictions
• Evaluate and improve

Some of the popular machine learning methods that we are going to implement to our analysis are:

1) Naive Bayes:
This algorithm deals primarily with the probability of a text belonging to a certain sentiment based on the frequency of words in the input text.

2) Deep Learning:
Using neural networks with multiple layers we can recognize patterns in a text.

3) Support Vector Machines (SVM):
Unlike Naïve Bayes, this method in non-probabilistic in that it used for text classification by finding the optimal hyper plane that can separate the positive and negative sentiments accordingly.

4) Decision Trees:
A tree-like model of decisions is built using this type of machine learning.

5) Random Forest:
Multiple decision trees are combined thereby improving the accuracy of the model that will be used for the analysis.

Our next article(s) will delve deeper into building a ML model from scratch after preprocessing the input text into clean arithmetic data that our model will be able to decipher and perform accurate sentiment analysis.

Exploratory Data Analysis Beginner Guide

AJ_Coding — Wed, 15 Mar 2023 17:26:57 +0000

Data professionals use exploratory data analysis (EDA) to explore, study, and become familiar with a dataset’s properties and the relationships between its variables. Data visualization is one of the most important tools and approaches used in EDA. We can truly understand the data’s appearance and the sorts of questions it may help us answer by analyzing and visualizing it using EDA. It also provides a means of identifying trends and patterns, identifying outliers and other abnormalities, and addressing certain important research problems.

These are some of the major steps carried out for this type of analysis:

1). Collect the data

2). Load the data

3). Get the basic information about the data

4). Handle duplicate values

5). Handle the unique values in the data

6). Visualize unique count

7). Find null values

8). Replace null values

9). Know the data type

10). Filter data

11). Get data’s box plot

12). Get the basic information about the data

13). Create correlation plot

First, we make sure we are working with the correct libraries. For EDA, the following libraries are often used: pandas, numpy, seaborn and matplotlib. To do that, we use the functions:

Import pandas as pd

Import numpy as np

Import matplotlib.pyplot as plt

Import seaborn as sns

For the ease of convenience, we will use pyforest instead as shown below:

import pyforest
pd.set_option( 'display.max_columns', 200)

The dataset that we will be working on is the Police Shootings in USA found on my Github here: https://github.com/AJ-Coding101/Exploratory-Data-Analysis-EDA-of-USA-Police-Shootings. We can collect the data that will be used for analysis by using the code below:

df = pd.read_csv("shootings.csv")

df.head (2) #To display the first 2 rows

df.tail(2) #To display the last 2 rows

The next step would be to view what types of data we are dealing with at a glimpse.

df.shape #To show how many rows and columns are in the dataset
(4895, 16)

df.describe()

The describe function is very useful as it shows details such as mean, minimum and maximum values count among others.

To get even more insight on our dataset, we can use df.info().

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4895 entries, 0 to 4894
Data columns (total 15 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   id                       4895 non-null   int64  
 1   name                     4895 non-null   object 
 2   date                     4895 non-null   object 
 3   manner_of_death          4895 non-null   object 
 4   armed                    4895 non-null   object 
 5   age                      4895 non-null   float64
 6   gender                   4895 non-null   object 
 7   race                     4895 non-null   object 
 8   city                     4895 non-null   object 
 9   state                    4895 non-null   object 
 10  signs_of_mental_illness  4895 non-null   bool   
 11  threat_level             4895 non-null   object 
 12  flee                     4895 non-null   object 
 13  body_camera              4895 non-null   bool   
 14  arms_category            4895 non-null   object 
dtypes: bool(2), float64(1), int64(1), object(11)
memory usage: 506.8+ KB

Here we are able to see vital info such as data types, entries, memory usage and any null values per column. We can also notice that ‘age’ column has a data type of float instead of int and ‘date’ column is represented as an object data type which is incorrect. This might cause potential issues in our analysis. However, there is an easy fix for this as shown below:

df['age'] = df['age'].astype(int)
df['date']=pd.to_datetime(df['date'])

We run df.info() again and now all the data types are shown correctly.

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4895 entries, 0 to 4894
Data columns (total 15 columns):
 #   Column                   Non-Null Count  Dtype         
---  ------                   --------------  -----         
 0   id                       4895 non-null   int64         
 1   name                     4895 non-null   object        
 2   date                     4895 non-null   datetime64[ns]
 3   manner_of_death          4895 non-null   object        
 4   armed                    4895 non-null   object        
 5   age                      4895 non-null   int32         
 6   gender                   4895 non-null   object        
 7   race                     4895 non-null   object        
 8   city                     4895 non-null   object        
 9   state                    4895 non-null   object        
 10  signs_of_mental_illness  4895 non-null   bool          
 11  threat_level             4895 non-null   object        
 12  flee                     4895 non-null   object        
 13  body_camera              4895 non-null   bool          
 14  arms_category            4895 non-null   object        
dtypes: bool(2), datetime64[ns](1), int32(1), int64(1), object(10)
memory usage: 487.7+ KB

Next, we can confirm that there are no null values in our data.
df.isna().sum()

id 0
name 0
date 0
manner_of_death 0
armed 0
age 0
gender 0
race 0
city 0
state 0
signs_of_mental_illness 0
threat_level 0
flee 0
body_camera 0
arms_category 0
dtype: int64

df.isna().sum().sum()
0
It is also useful to view if we have any duplicated rows and remove them as they are redundant.

df.duplicated().sum()
0

Next, we will take some steps to visualize our data. The library matplotlib is useful in this step of our analysis. We would like to view the number of police shootings according to race in the USA.

df['race'].value_counts().plot(kind='bar', edgecolor = 'black')
plt.title('Histogram:According to race')
plt.xlabel('Race')
plt.ylabel('Number of shootings')
plt.show()

We can also visualize the number of police shootings according to age. For this we can use a histogram.

plt.hist(df['age'], bins=15, edgecolor = 'black')
plt.title('Histogram:According to age')
plt.xlabel('Age')
plt.ylabel('Number of shootings')
plt.xticks( range ( 0, 101, 5))
plt.show()

The age group between 29–34 years seems to have encountered the most shootings by the police. To visualize police shootings according to year, we need to extract the year from the date column.

df['year'] = df['date'].dt.year
year_shootings = df.groupby('year').size()
#count() can also be used instead of size()

Then using Pandas, we can use the groupby() function to group the data by year and calculate the number of shootings. The code snippet below can plot a line graph displaying the figures required.

year_shootings.plot (kind='line', grid = 'black')
plt.title('Number of shootings according to year')
plt.xlabel('Year')
plt.ylabel('Number of shootings')
plt.show()

We can observe that the number of police shootings had been steadily decreasing each year. In 2020, the figure drops significantly. Let’s do some inspection on the dataframe to identify why.

df.groupby('year').size()
year
2015    965
2016    904
2017    906
2018    888
2019    858
2020    374
dtype: int64


df.groupby(['year',df['date'].dt.month]).size()

year  date
2015  1       75
      2       77
      3       91
      4       83
      5       69
              ..
2020  2       61
      3       73
      4       58
      5       78
      6       22
Length: 66, dtype: int64


df[df['year'] == 2019].groupby(['year', df['date'].dt.month]).size()
year  date
2019  1        81
      2        68
      3        76
      4        63
      5        64
      6        77
      7        69
      8        57
      9        59
      10       73
      11       71
      12      100
dtype: int64

df[df['year'] == 2020].groupby(['year', df['date'].dt.month]).size()
year  date
2020  1       82
      2       61
      3       73
      4       58
      5       78
      6       22
dtype: int64

It is now clear that our dataset only has the number of shootings upto June 2020 hence the drop for that year.

The seaborn library is built on top of the matplotlib library and can be used for powerful visualizations as well. Let’s visualize the number of police shootings according to age with a regression line as well.

First, we again use the groupby() function to group the data by age and then calculate the number of shootings by age.

df.groupby(df['age']) . size()

age
6      2
12     1
13     1
14     3
15    13
      ..
81     1
82     2
83     2
84     4
91     1
Length: 75, dtype: int64

age_shootings = df.groupby(df['age']) . size()

sns.scatterplot(x=age_shootings.index, y=age_shootings.values)
sns.regplot(x=age_shootings.index, y=age_shootings.values)
plt.title('Number of shootings by age')
plt.xlabel('Age')
plt.ylabel('Number of shootings')

The code creates a scatter plot of the number of shootings by age then regplot() function adds a regression line to the scatter plot. The regression line represents the best fit line that explains the relationship between the number of police shootings and age of the victims. We are able to better understand the trend of the data and the correlation between the 2 variables.

Next, we can visualize the number of police shootings according to race and by year.

shootings_by_race_year = df.groupby(['year', 'race']).size()
shootings_by_race_year = shootings_by_race_year.unstack()

First, we use the groupby() function to group the data by year and race then calculate the total number of shootings as shown in the code above and use unstack() to present the data in a more readable format.

As we can see, Exploratory Data Analysis is very crucial in the data science. The steps shown in this article are just some of the general steps taken in EDA.