Joseph Ngigi

Posted on Feb 27, 2023

Exploratory Data Analysis(EDA)

#datascience #tutorial #beginners #dataanalysis

Introduction on EDA

EDA is an important stepping stone before one begins on the exploration of Data Analysis. EDA is done once there is a defined problem in which can be typically a normal business problem, or a company is looking forward to look for a business opportunity. So, before conducting an analysis, there is a verification of whether the data is appropriate. A quality analysis of data depends on how well one understands the data they are dealing with. EDA is very helpful as it helps one discover and resolve Data Quality issues for example:

Duplicates

Unwanted datapoints

Missing Data

Incorrect Values

Detect Outliers and Anomalies

EDA generates data summaries like Mean, median, sum, count and other statistical information. The main purpose of Exploratory data analysis is to Analyze and summarize a dataset.

A Dataset is a collection of related data, typically organized in a structured format which can be be accessed individually or in combination or managed as a whole entity.

What is EDA?

It is an approach to analyzing data that focuses on analyzation and investigating data sets and summarize their main characteristics, often employing data visualization methods. EDA is aimed to discover the underlying structure and nature of the data, rather than simply confirming preconceived hypotheses. It uses statistical and graphical techniques, such as histograms so as to visualize the data.

After completing the process of EDA and extracting insights, the characteristics of the data can be utilized for more advanced data analysis or modeling tasks, such as machine learning.

The Data Analysis Process

Defining an Objective
In data analytics, the first step is defining Problem Statement. This is the process of coming up with a hypothesis and figuring out how to test it. One can start by defining a business problem(eg. Why are we losing customers as Xiaomi). As a data analyst, you will look into depth by trying to find the factors that are negatively impacting the customer experience. The analysis is not limited to only this, of course there are other factors. Defining the objective is mostly about the soft skills, lateral thinking and the business model knowledge. Business Metrics and Key Performance Indicators(KPIs) can help track problem points.

Data Collection
Can be Qualitative on Quantitative Data. The data can be categorized into:

First-Party Data: This is the data that is directly collected from customers through various channels such as website, mobile apps, social media accounts, customer relationship management (CRM) software. This data is clear and structured.
Second-Party Data: Second-party data is data that a company acquires directly from another company or organization, rather than collecting it themselves. A car manufacturer might acquire second-party data from a car dealership that has customer data from its own sales and service interactions.
Third-Party Data: Tt is data that is collected by one company and then sold or licensed to other companies for their own use. It can include demographic information, browsing behavior, purchase history, and more.

Data Collection tool
There are various tools that can be used for data collection. One important tool that can be used for this is a Data Management Platform(DMP). It allows one to identify and aggregate data from numerous sources before you can manipulate it. Examples include Salesforce, SAS, Xplenty(Data Integration Platform). Some open source platforms can be Pymcore and Dswarm

Cleaning The Data
This is an important step in [EDA] Exploratory Data Analysis so as to get it ready for Analysis and make sure that you are working with high quality data. Data cleaning is the process of identifying and correcting errors, inconsistencies, and inaccuracies in a dataset. Key cleaning data tasks include:

Removing major Error, duplicates and Outliers

Removing unwanted data points

Bringing Structure to the data eg. Fixing typos ot layout issues

Filling or removing unwanted

There are various tools that can be used for data cleaning. OpenRefine is one of the open source tools that one can use for data cleaning. This might not be suitable for large datasets. Python library Pandas and other R libraries can be a much better suit for cleaning the large datasets. There are much more DataCleaningTools tools. Check them out

Data Analyzation
Data analyis typically depends on your objective from the data. Some techniques of Data analysis include; Univariate or Bivariate analysis, Data Mining analysis, Time series analysi and Machine Learning. Categorically, all data analyzation can fit into:

Descriptive Analysis which uses current and historical data to identify relationships and trends

Diagnostic Analysis is an advanced analytics that examines data and diagnosing problems or issues to understand why it happened

Predictive Analysis is the analytics that makes predictions about future outcomes using statistical and machine learning techniques to analyze historical data

Prescriptive Analysis is the analysis where one analyzes data and content to recommend the optimal course of action to achieve the objective

Visualization of The Data
This involves the interpretation of the outcomes and present them. The results are expected to be unambiguous. Data visualization is the practice of presenting information or data in a visual form, such as charts, graphs, maps, and other interactive visual representations. There are many tools that can used for data visualization. They include; Data wrapper, Tableau, Google Charts, Power BI, etc. These don't require one to have coding skills. Plotty, Seaborn and Matplotlib are some python data visualization tools.

Types of exploratory data analysis

There are four primary types of EDA

Univariate analysis This EDA is Non-graphical. It is the simplest and it deals with analyzing a single variable mainly to identify patterns, trends, and outliers in the data.
Univariate graphical This method provide graphical presentation, providing Histograms, a bar plot, Box plots(minimum, first quartile, median, third quartile, and maximum.)
Multivariate nongraphical This involves analyzing multiple variables simultaneously to identify patterns, trends, and correlations in the data. Usually, it displays the correlation between two or more data variables by using cross-tabulation or statistical techniques.
Multivariate graphical This one uses uses graphics to display relationships. The most used is a grouped bar plot or bar chart.

Scatterplot, Multivariate chart, Bubble chart, Run chart and Heatmap

EDA Tools

These are various common tools that can b e used for EDA. Python and R are the most common used for data Analysis and Visualization

Pandas is a Python library that can be used for data manipulation and analysis and it provides tools that are suitable to manipulate large datasets. Pandas is built on top of NumPy, another popular Python library for scientific computing. Pandas can be used for Data cleaning and preparation, Data exploration, Data analysis and Time series analysis
NumPy This is a python library that used in scientific computing, providing tools to work with Array and Matrices.
MatplotlibA Python library for data visualization, including scatter plots, line graphs, and histograms.
Seaborn A Python library for statistical data visualization, including heatmaps, cluster maps, and box plots.

Other common tools include; Tableau, Power BI, RStudio, SAS and IBM SPSS

EDA tools provide a range of statistical functions and techniques that can be used to gain insights and make data-driven decisions. Some of these functions include Data visualization, Correlation analysis, egression analysis, Clustering and dimension reduction techniques and clustering analysis

DEV Community