Vee

Posted on Oct 11, 2023 • Edited on Nov 6, 2023

Exploratory Data Analysis using Data Visualization techniques

#datascience #beginners #python #luxdatanerds

Exploratory data analysis (EDA) is the process of studying data using visualization and statistical methods to understand the data. In other words, it is the first look on your data. It is a vital step before you begin with the actual data analysis. EDA helps to discover relationships within data, identify patterns and outliers that may exist within the dataset. Data scientists use EDA to ensure the results they produce are valid and applicable to any desired business outcomes and goals.

Objectives of EDA

The main objectives of EDA are to:

Confirm if the data is making sense in the context of the problem being solved. In the case where it doesn't, we come up with other strategies such as collecting more data.
Uncover and resolve issues on data quality such as, duplicates, missing values, incorrect data types and incorrect values.
Get insights about the data, for example, descriptive statistics.
Detect anomalies and outliers that may cause problems during data analysis. Outliers are values that lie too far from the standard values.
Uncover data patterns and correlations between variables.

Types of EDA

Exploratory data analysis is classified into three broad categories namely:
Univariate
Bivariate
Multivariate

Steps in EDA

The following is a step-by-step approach to undertaking an exploratory data analysis:

Data Collection

Gather relevant and sufficient data for your project. There are various sites online you could get your data from irrespective of the sector you're in. Here are a few examples to check out: Kaggle, Datahub.io, BFI film industry statistics

2.Familiarize with data

This step is important as it helps you to determine whether the data is adequate for the analysis about to be done.

3.Data cleaning

This where any missing values, outliers and duplicates are identified and removed from the dataset. Also, data that is irrelevant for the anticipated analysis is removed at this stage.

4.Identify associations in the dataset

Look for any correlations between variables. You can use a heatmap or scatterplots to make it easier for you to identify the correlations.

Example: Exploratory Data Analysis using NYC Citi Bike data.

We will now perform an exploratory data analysis on NYC Citi Bike data to get a better understanding of the process. You can access the data here.

1.Import data

The first step is to import all the modules you are going to use in your project. In this case, we will need pandas for data wrangling, seaborn for data visualization. This is how I would do it:

`import pandas as pd
import seaborn as sns

Then import your dataset. If you're using Google colab, this is how you would load the data:

from google.colab import files uploaded = files.upload()

You will then read in the data as a pandas data frame like this:

2.Get an overview of the data

You can approach this in various ways. For example, using .info() helps us to know the data types, number of columns, column names, and number of values in the data frame. The following is an example:

The other alternative is to use .describe(). This gives you a statistical summary of the data. Here's an example:

3.Visualize the distribution for trip duration
This will help us to have a glimpse of how long most trips took. Using seaborn, this is how I would do it:

# visualize distribution for trip duration sns.histplot(data['tripduration'])
Here's the sample output:

From the output, it is evident that most trips were ranging within 10 minutes.

4.Visualize correlation between gender and trip duration

# checking for association between tripduration and gender using scatterplots sns.pairplot(data[['tripduration', 'gender']])

Sample output is as follows:

5.Calculate the percentage of subscribers

we need to find out the share of subscribers from the total number of riders in New York city. Here's how to find out:

6.Evaluate how trip length varies based on trip start time

data['hour'] = data.starttime.apply(lambda x: x[11:13]).astype('str') data

# visualize correlation sns.scatterplot(x= 'hour', y= 'tripduration', data = data, hue= 'usertype')

The output is as follows:

7.Determine the bike stations where most trips starts

First we get the count of bike stations and store the output as a new data frame. We we then drop the duplicates from the original data frame then merge the two new data frames for visualization.

# Get the count of trips from each station new_data = data.groupby(['start station id']).size().reset_index(name= 'counts')

#remove duplicate values from the start station id column temp_data = data.drop_duplicates('start station id')

# left join to merge new_data and temp_data dataframes newdata2 = pd.merge(new_data, temp_data[['start station id', 'start station name', 'start station latitude', 'start station longitude']], how= 'left', on= ['start station id'])

#install folium !pip install folium import folium

# initialize a map m = folium.Map(location=[40.691966, -73.981302],tiles= 'OpenstreetMap', zoom_start= 12) m

The output is as follows:

Conclusion

EDA is very crucial as it affects the quality of the findings in the final analysis. The success of any EDA is dependent on the quality and quantity of data, the type of tools and visualization used, and proper interpretation by a data scientist.