This blog is part of MSP Developer Stories initiative by Microsoft Students Partner (India) Program - https://studentpartners.microsoft.com/ which is aimed for student communities to Learn, Lead and Empower.
The goal of this post is to emphasize the role of Exploratory Data Analysis while solving business problems with Machine Learning and Artificial Intelligence with a detailed case study walkthrough.
A 360° data mindset In this information-driven age, a 360° view has to be taken for the extraordinary volume of data that is being available – historic, current and predictive – so that right data has to be extracted to make better business decisions.
Exploratory Data Analysis (EDA) is an observational approach to understand the characteristics of the data. EDA is essential for a well-defined and structured data science project and it should be performed before any machine learning modelling phase. This helps in Identifying patterns and develop hypotheses.
Case Study : A medium size bikes & cycling accessories manufacturing consultancy is keen on growing the business. We’ll help them analyze their customer and transaction data to optimize marketing strategy.
Preliminary Data Exploration – Identify ways to improve the quality of data
Environment and Code Readiness
- Create a Jupyter Notebook hosted on Azure
- Import pandas package to read and write excel data
- Import matplotlib & seaborn for data visualization
- Upload the Customer data into the Azure Notebook path
Let’s put the below analysis into various data quality dimensions in a table
Identify Missing Values
Column can be dropped if no relevance
Gender data to be consistent, should be either Male or Female
Check for validity of Transactions data : product first sold date data type float to be converted into date time format
Follow the above code and output for other data sets
Here is the Data Quality Analysis Summary
Data Exploration, Model Development and Interpretation : Understanding the data distributions, feature engineering, data transformations, modelling, results interpretation and reporting.
Customer Age & Gender Distribution : Female category is more than Male; New customers are recommended between 30 to 60 years old
Calculate the age of the customers from date of birth for plotting the graph
Number of Mass Customers under the Wealth Segment are the highest
New customers are from Manufacturing & Finance industry
Customer cars owned data
Visualizations & Interactive Dashboard : Help us highlight key findings and convey the ideas in a more succinct manner. Below dashboards have been built in Power BI desktop. Walkthrough of the building of dashboards in Power BI is out of scope for this blog.
Conclusion, Exploratory Data Analysis is a key process in Machine Learning / Data Science projects. The main pillars of EDA are data cleaning, data preparation, data exploration, and data visualization.
Top comments (0)