Introduction to Exploratory Data Analysis:
Exploratory Data Analysis is essential in the Data Science process. It helps to summarize and visualize data to gain an understanding of the relationship between variables, patterns, and characteristics.
The roles of EDA are:
- Checking the quality of data
- Data cleaning
- Data summarization
- Identifying patterns and trends in data
- Detecting and checking for outliers
- Checking the relationship between variables
- Data visualization
1: Data preparation:
Data preparation or data pre-processing is the process of cleaning, transforming, and structuring raw data that is suitable for analysis
The process involved in the data preparation includes:
- Data collection where one can collect data from databases, excel sheets, surveys and questionnaires, and Application Programming Interfaces (APIs)
- Data cleaning is handling missing data, detecting and handling outliers, and data validation.
- Data transformation is the transformation of different types of data through normalization, encoding through one hot encoding and label encoding, feature engineering, and aggregation.
- Data integration is a process where the data needs the joining of data sources into a single dataset through joining tables and merging data
- Data reduction is the reduction of data so that it can be represented in a smaller form while the integrity of important information through techniques like Principal Component Analysis(PCA)
- Data sampling is the process of using sampling techniques when the dataset is large.
- Data Documentation through keeping detailed documentation of the data preparation process and including any changes made to the data through the handling of missing values, and explanations of variable transformations and techniques used in the same.
2. Data visualization techniques.
Data visualization techniques are tools and methods used in representing data graphically for easier analysis of complex information
Examples of data visualization include:
- Bar charts are used to display data using rectangular bars where the length of each bar represents the quantity or frequency of a category and be represented vertically or horizontally.
- Line graphs are used to display data in a graph in points or crosses that are connected by a line and are great at showing trends. They are mostly used in time series data
- Histograms are used to visualize the distribution of continuous or discrete datasets. They are used to group data into bins to represent the frequency or density of data points in each bin
- Scatter plots are used to display individual data points as dots on a two-dimensional (2D) plane, with one variable on the x-axis and another on the y-axis. They are useful for exploring and investigating various relationships and correlations between two variables
- Boxplots are used to provide a summary of the distribution of a dataset, including the dataset's median, quartiles, and potential outliers. They are particularly useful for comparing the distributions of multiple variables.
It is important to choose the right visualization technique for easier analysis and communication of complex information in the
dataset.
2(a). Choosing the right visualization technique.
Choosing the right visualization techniques for your data is important depending on the type of data e.g. numerical, time series, or categorical.
For comparison of variables in your dataset, you should consider using charts like bar charts for categorical data and line charts for time series data
For visualization of distribution in your dataset, you should consider using histograms and density to visualize the distribution of continuous data and boxplots for displaying the distribution of central tendencies and variability.
for visualization of relation exploration of variables, you should consider using scatter plots for visualizing the relationship between two numerical data and bubble plots for displaying the relationship between three or more variables
3: Creating visualizations.
Creating visualizations is a process where one can use Python's matplotlib and seaborn and R's ggplot2 to create visualizations for analysis and communication of data while detecting outliers.
Below is the process of creating visualizations using Python's matplotlib:
- Preparing your dataset by importing libraries and reading your files
import pandas as pd
import matplotlib.pyplot as plt
# Reading your data
# If the document or data is not in the folder, use the document's path e.g.
#df = pd.read_csv("/admin/documents/data1/mockdata.csv")
df = pd.read_csv('mock_data.csv')
#If you are using an API use the request module to read your data for example
# import requests
# api_link = "https://www.example.com/api/"
# response = requests.get(api_link)
- Choosing a plot type and creating a basic plot:
x = df["Savings"]
y = df["Monthly Spending"]
plt.figure(figsize=(8, 6))
- Customization of your plot by adding titles, labels, and legends and customizing the appearance of the visualization.
plt.title("The relationship between savings and monthly spendings over the last 12 months")
plt.xlabel("Savings")
plt.ylabel("Monthly spending")
- Data mapping shows how data should be mapped to the aesthetics of your plot e.g. x and y plots, colours, and sizes.
plt.scatter(x, y, label="Data Points", color="blue", marker"o")
- Apply statistical transformations.
5. Analysis of the data.
After creating visualizations and use of visualization techniques to explore relationships, correlations, and other important information, an analysis of the data is done
One can analyze the relationship between two or more variables as found after visualization, detecting and handling potential outliers, and distribution of data
Below is a case study exploring the relationship between savings and monthly spending to create a campaign on increasing the saving culture among customers of a bank.
(i) Data preparation:
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv('customer_data.csv')
(ii) Data Exploration:
Here, you should check for missing values in your data and get to know its structures and characteristics
df.isnull().sum()
#To check data statistics
df.describe()
(iii): Data visualization:
plt.figure(figsize=(10, 6))
plt.scatter(df['Monthly Spending'], df['Savings'], alpha=0.5)
plt.title('The relationship between savings and monthly spendings over the last 12 months')
plt.xlabel('Monthly Spending')
plt.ylabel('Savings')
plt.grid(True)
plt.show()
(iv) Data analysis:
Data analysis should be done after analyzing the scatter plot.
In conclusion, we can agree that data visualization techniques are very instrumental in Exploratory Data Analysis(EDA).
Top comments (0)