Shubham Singh

Posted on May 13, 2022 • Edited on May 15, 2022

Understanding Data For Data Analytics, Data Science, and Machine Learning – Part-2

#datascience #statics #r #machinelearning

Things to know beforehand

What is Variability?

It is how much data is spread out.

[1] Central Tendency

[2] Median

When your data is very influenced by the outliers then using median is good choice because it is not effected by outliers
to calculate median sort your data (ascending or descending does not matter) and then find the middle point.

Center point will be different based on whether you n is even or odd

[a] when n is even

When n is even, there are 2 centers

First~point = \frac{n}{2} Second~point = \frac{n}{2}+1

[b] when n is odd

For odd n it just

Mid~point = \frac{n}{2}

in R both can be calculated with same function

median()

[3] Mode

In Data Mode is the value which occurs most often in the data
calculating mode is a manual task because you have to count occurrence of each value in the Data.

R doesn't have an inbuilt function for mod, so we can use this function

mode <- function(v) {
   uniqv <- unique(v)
   print(uniqv[which.max(tabulate(match(v, uniqv)))])
}
mode(data)

[2] Major of Spread

Understanding spread of data is very important to understand your data better, 2 sets of data can have same mean but different spread which may lead to low quality estimates.

[1] Range

it is one of the simplistic major of variability, to calculate Range :

Range = max~value - min~value

diff(range(data))
# or
print(max(data) - min(data))

[2] Inter Quartile Range (IQR) and Whiskers Plot

By dividing your data is 4 equal parts, quartiles are generated each quartile contains 25% of data, i.e.,
1st quartile is 25% of data (25th percentile); 2nd quartile is 50% of data (50th percentile); 3rd quartile is 75% of data (75th percentile); 4th quartile is 100% of data (100th percentile).

IQR = 75th~percentile - 25th~percentile

Box and Whiskers Plot is very useful for 5 point summery and understanding spread and Outliers

library(ggplot2)
data <- iris

ggplot(data) + geom_boxplot(
  mapping = aes(
    x = Sepal.Length,
    y = Species
  )
) + coord_flip()

The five point summary in box plot includes the minimum value, the first quartile, the median, the third quartile, and the maximum value.

Each of these can be looked into the plot below.

Minimum value : start of the vertical line.
First Quartile : start of the box in the middle.
Median : bold horizontal line is the point where median lies.
Third Quartile : end of the box in middle.
Maximum value : end of the vertical line.

And if you are wondering what is that point outside the box in virginica it is an outlier.

To compute outliers mathematically, you need a threshold if any point passes the outliers threshold it is considered as outlier.

outliers~threshold = 1.5*IQR

[3] Variance

Variance is the expectation of the squared deviation of a random variable from its population mean or sample mean. Variance is a measure of dispersion, meaning it is a measure of how far a set of numbers is spread out from their average value.

S^2 = \sigma^2 = \frac{\Sigma{}(x_{i}-\overline{x})^2}{n-1}

Why it is S^2 because the sum of xi - x bar can result in zero, so we square it to make it a +ve number.

var(data)

[4] Standard Deviation

the standard deviation is a measure of the amount of variation or dispersion of a set of values. A low standard deviation indicates that the values tend to be close to the mean of the set, while a high standard deviation indicates that the values are spread out over a wider range.
It is quite same as Variance difference is SD unit is same as data, but variance is in unit squared.

SD = \sigma = \sqrt{S^2}

sd(data)

Normal distributions with standard deviations of 5 and 10.

For Part-3 go here

DEV Community