Briankimany

Posted on Oct 11, 2023

Data Analysis using Data Visualization Techniques.

There are a few ways and methods one can annalyse data and an example is by use of data visualization techniques.
This involves an individual creating imagery representations of the data and extracting insights from the visualized data.

Imagery representations of data like pie charts , line plots and graphs provide an easy understanding of data ,more so for the clients , stake holders , management or for personal analysis of the data.

This can be achieved by use of already made software or a programming language like python.

Some common software tools

1: Tabulea
2: Infogram
3: Datawrapper
4: ChartBlocks

Some commonly programming packages are

1:matplotlib
2:plotly
3:seaborn
4:AltairPygal
5:Geoplotlib
6:Folium
7:Gleam

Commonly used techniques

1:Pie charts
2:Histograms
3:Bar charts
4:Heat maps
5:Box plots
6:Area graphs
7:Correlation matrices

1: Pie charts

Pie charts are circular graphs divided into segments which represent a portion/group of a data.Each portion represents a group of data and can be differently colored.They are able to bring out the comparison of certain groups to the whole data if the data used isn't complex.

<data frame>.plot(kind = 'pie')

2: Histograms
Histograms are used to illustrate the frequency of continuous data sets.The x - axis contain the different groups of data while the y - axis indicates the frequency.
They can be used to identify the shape of the data if its is skewed,outliers,multi-modal distribution , understand central tendency etc

<data frame >.plot.pie(y='mass', figsize=(5, 5)

3: Bar charts

Bart charts are plots whose length of bars represent the magnitude of the variable.
They can represent both numerical and categorical data which has been grouped in intervals.
When plotting bars in python , there are two types of bars one can achieve.

1: Vertical bars
This is done by using the kind key word argument in the .plot() method and passing "bar" as the argument

<data frame>.plot(kind = 'bar' , figsize =(width, height))

2: Horizontal bars
The only diffrence is the positioning of the magnitude of the bars.Here the Y-axis are used fro labeling while the x-axis represent the six=ze of the bar.
This type of bar provides more room for labeling.

<data frame>.plot(kind = 'barch' ,color = '')

NB

The key word used is 'barch'.
For more information on ways to edit the graph you can visit pandas.plot documentation

4: Heat maps

This is a type of visualization that shows the variations of a third variable in a 2-dimensional plot over time.They communicate easily and offer chances for easy comparison between different groups like in the diagram below, one is able to easily notice that the January morning and afternoon average temperatures are the lowest in the whole data set.This is indicated by the fading colors.

Source data plus notebooks

Heat maps can be divide into

Grid heat
This is mostly used in multivariate data and > highlights the relationships amongst the features

Spacial heat
This is mostly used in geographical datasets.

5: Box plots

Box plots communicate information concerning the data by use of quartile.They communicate different aspects of the data like

Outliers
minimum and maximum values
median
Interquartile Range
Q1 , Q2(median), Q3

Here is a sample image

1: Outliers are extremely high or low data points with respect to the nearest data points.They are represented by the green data points in figure above.

2:Minimum and maximum are the lowest and highest data points respectively in a sorted data set.

3:Median is the middle point of a sorted data set.It is also called the Q2 or 50 percentile.

4:Q1, Q2 , Q3These are the data points gotten after splitting sorted dataset into four equal parts.

Q1 represents the first data point.It's the number halfway between the minimum value and the middle point

Q2 is the second point or the median

Q3 Its the third point ie,it's the halfway point between the maximum point and the middle point.

5:Interquartile range abbreviated as (IQR) is the range from Q1 to Q3

<your data frame>.plot(kind = 'box')

6: Area charts

Area charts can be seen as combinations of line graphs and bar graphs.The only difference it has from a line graph is its is shaded.They can be divided in to two groups

Overlapping charts
stacked area charts

1: Overlapping charts
This is mostly used when comparing two features in a dataset.Data points are plotted on the same axis ,with a common base of zero but shaded differently for each group.
It's advisable for the data series to be of the same height for a much fair comparison.

From the figure below we can conclude that the station is mainly busy during and evening hours. This is derived from the observation that in the morning there is a sharp curve with a large area shaded brown.In the evening from 1600 hrs the area shaded blue increases again as people are leaving the town.

2: Stacked area
In this charts the groups data points are plotted one at a time with the base line being inherited from the previous group's height.Unlike overlapping area charts, stacked charts can compare more than two features.This makes it easy for comparison of features.
From the image we can easily compare the different average temperatures over the months even though we might not be able efficiently keep track of the changes in height for the groups in the middle.

Below is an example from

7: Correlation matrices

This techniques involves analysis of different columns in a data and establishing whether the columns set aside to be features are purely independent.
They output is a grid of correlations coefficients which indicate the strength of the interdependence ie

1 represents a strong interdependence
0 represents neutral
(-ve)1 represents no correlations among the two features.

The diagram below indicates an example where we are predicting the variable is if a customer is likely to churn or not and the rest of the columns are selected to be features.

Source data plus notebooks

From the image ,most of the columns are independent even though age and IsActiveMember have a coefficient of 0.9 that may not greatly affect the performance of our model.

Conclusion.

In conclusion, data visualization is an essential step of the data analysis process. It allows extract insights from complex datasets, identify patterns, and communicate insights effectively. By use techniques like pie charts, histograms, bar charts, heat maps, box plots, area graphs, and correlation matrices, we gain a deeper understanding of data. Moreover, the tools and software available, both proprietary and open-source, provide numerous ways to create compelling visualizations.

DEV Community