DEV Community

Mary Sinaida Omukami
Mary Sinaida Omukami

Posted on

Exploratory Data Analysis using Data Visualization Techniques.

What is exploratory data analysis?

It is an approach of analyzing data sets to summarize their main characteristics using statistical graphics and other data visualization methods. This involves inspecting the dataset from many angles, describing & summarizing it without making any assumptions about its contents. It's an important step to take before diving into statistical modelling and machine learning models.

Raw data can have outliers, missing values and skewed graphs, if used could eventually generate inaccurate models because they used inaccurate data on the right models. Since the data wasn't thoroughly worked on during the preparation stage.
So exploratory data analysis has four steps as illustrated below:

  1. Data Cleaning: Handling missing values, removing outliers, and ensuring data quality. When you get data in it's raw form you're supposed to inspect to see if it's actually clean. Most of the time due to human error from the collection process you might get some quality issues which will need to be addressed. You can check if a data has Outliers by making box plots. Outliers are removed using the statistical measure z -score method .Missing values are checked using isnull(). For random values like a missing name, you can't guess that so one might need to delete the whole row. For columns and dataset with numeric continuous variable you can mean median or mode of the column. 

  2. Data Exploration: Examining summary statistics, visualizing data distributions, and identifying patterns. Examining summary statistics by using df.describe() function and you get the descriptive statistics. Visualization data distribution like univariate plots  and bivariate/multivariate plots that analyze ine variable or compare two to multiple variables. From the visualization and summary statistics you can identify patterns in the datasets. 

  3. Feature Engineering: Transforming variables, creating new features, or selecting relevant variables for analysis. Transforming variables to the correct data type. Creating features that will be important to our model. 

  4. Data Visualization: Presenting insights through plots, charts, and graphs to communicate findings effectively.  Keep it simple, choose the right visuals,provide context and make it actionable. 

Below are examples of tools that are useful for EDA:
1.Box plot
2.Histogram
3.Multi-vari chart
4.Run chart
5.Scatter plot (2D/3D)
6.Stem-and-leaf plot
7.Odds ratio
8.Heat map
9.Bar chart
10.Horizon graph

In summary, exploratory data is important stage in data analysis process it can determine whether you get accurate final results for your analysis.

Top comments (0)