Overview
Data is often described as the new oil of the digital age,but like crude oil, it is only valuable when refined and preprocessed. Exploratory Data Analysis(EDA) is the key to unlocking the hidden gems within your data. In this article, we will delve into the world of EDA, exploring its key benefits, techniques and finally look at data visualization as one key technique and give a real world example.
What is Exploratory Data Analysis
Exploratory Data Analysis, or EDA, is the process of investigating a dataset and summarizing its main features. It is the process of visually and statistically summarizing, interpreting, and understanding datasets. Its primary goal is to uncover patterns, trends, relationships, and anomalies within the data. EDA is a crucial step before diving into more advanced analytics or building predictive models
Key Benefits
Spotting missing and incorrect data
Understanding the underlying structure of your data
Testing your hypothesis and checking assumptions. It helps you form educated guesses about what might be happening within your data.
Calculating the most efficient variable by determining how they relate to each other and which independent variables affect the dependent variable.
Create the most efficient model by removing any extraneous information because additional data can either skew your results or simply obscure key insights with unnecessary.
Types of Exploratory Data Analysis
Depending on the type of data we have and the columns we are analyzing, various strategies can be used
1. Univariate Analysis
This sort of evaluation looks at the distribution of a single variable at a time to understand its distribution and relevant tendencies.
2. Bivariate Analysis
It looks at the distribution of two or more variables and explores the relationships, associations, correlations, and dependencies between them
3. Multivariate Analysis
This extends bivariate evaluation to encompass more variables. It aims to apprehend the complex interactions and dependencies among more than one variable.
4. Time Series Analysis
It is mainly applied to statistics sets that have a temporal component. This entails inspecting and modeling styles, traits, and seasonality through the years.
5. Data Visualization.
This is an important aspect of EDA that will focus on in this article. This entails creating visible representations of the statistics to facilitate understanding and exploration. Various visualization techniques, inclusive of bar charts, histograms, scatter plots, line plots, heat maps, and interactive dashboards are used to represent exclusive kinds of statistics
Exploratory Data Analysis using Data Visualization
Data Visualization
Data Visualization is the graphical representation of data that allows us to see patterns, trends, and outliers more clearly. In EDA, data visualization serves several critical purposes:
1. Pattern Recognition: Visualizations help in identifying recurring patterns in the data, which can lead to deeper insights
2. Anomaly Detection: Outliers and anomalies often stand out vividly in visualizations, making them easier to spot
3. Communication: Visualizations are a universal language that can effectively convey complex information to both technical and non-technical stakeholders.
To choose and design a data visualization, it is important to consider two things:
The question you want to answer ( and how many variables that question involves)
The data that is available. (is it quantitative or categorical?)
In this article, we will explore different types of graphical representations using the customer churn rate dataset to explore different aspects of the dataset that will enable us to draw meaningful insights from the data.
We will first start by importing the libraries we will use and the data
The libraries are inclusive of those we will use for machine learning. Don't let them scare you.
Let's have a snippet of our dataset
This dataset contains 32 columns.
I have already dealt with the missing values. So we will start with EDA analysis. For this article, we will sorely focus on looking at the general churn rate, the geography of the customer, and the customer's lifetime in the service.
The General Churn Rate
To get a glimpse of the general churn rate of the customer, we introduce a metric(churn rate-the percentage of customers who churned) and look at it in terms of the characteristics of the customers we have. We will use a pie chart for this.
Pie charts make it possible to visualize the relationships between the parts and the whole of a variable.
From the chart, we can see among the customers, 26.5% of customers are in churn and have stopped using the company's services
The geography of the user
We will look at the customer's location geographically and determine whether geography has an impact on the churn rate.
We will use a scatter map box and then use hexagons to further understand this relationship
A scatter plot on a Map box map created with Plotly Express is a visualization that combines the geographical context of a map with the ability to display individual data points as markers.
Plotly Express is a high-level data visualization library that allows users to create interactive plots and charts with minimal code
Key features include:
Geographical context
Interactive exploration
Customizable markers
Marker clustering
Color mapping
Size mapping
Animations
Customizable map layout
From the scatter plot, The largest number of customers is in the Los Angeles and San Francisco areas which are large cities
Let's use a bar chart to get a glimpse and count of customers per city
Let's add visualizations by hexagons
We want to see the number of customers and the percentage of churn customers by dividing an area into hexagons which is convenient if we want to understand whether the value of the metric changes depending on the geographical location of the clients, and entities such as a city or country are very large.
Hexagonal cells are color-coded based on the number of data points they hold, which enables you to easily understand data patterns. They help you identify patterns or clusters in a larger point dataset.
In general, there are fewer hexagons in the Los Angeles area with a high percentage of churn rate (50+%). In some hexagons, we see 80-100 percent of customers in outflow, but these are hexagons where in total <= 10 customers.
Let's build a scatter plot, where the x-axis is the number of customers in a hexagon, y-churn rate
We observed a churn rate of 25% only in hexagons, where we had a small number of customers. We do not see any geography of customers where our metric would behave differently as we can consider these hexagons with a small number of customers and churn rate >= 50% as zones with abnormally high churn rates.
Customer's lifetime in the service
To determine how many months the clients who are in the churn used our service and whether is there a point when the largest number of customers stop using the service, we will create a histogram
We will group the data by churn label and tenure months and check the quantiles
Churn Label
No
0.50 38.0
0.75 61.0
0.90 71.0
0.95 72.0
Yes
0.50 10.0
0.75 29.0
0.90 51.0
0.95 60.0
Name: Tenure Months, dtype: float64
50% of the customers who left the service did so in the first 10 months. The number of clients in the churn ceases to decline sharply after 5 months.
Conclusion
EDA is only a key to understanding and represent data in a better way which helps you build a powerful and more generalized model. Data visualization is easy to perform EDA which makes it easy to make others understand what we are doing.
Top comments (0)