DEV Community

Afroza Nowshin
Afroza Nowshin

Posted on

Learning Statistics with R, Part-2

bellcurve

This photo shows a perfectly symmetrical distribution as the mean and median are almost the same. Remember, for our distribution the mean is 23.6 and median is 24; almost equal.

density

The datasets that we will deal with in our projects may not have the almost equal mean and median which lead to "skewness" to the distribution.

skews

  1. When mean > median, this indicates that the few of the extreme values are pulling the mean upward, while most of the data points are concentrated in the lower end. This is a positively skewed distribution where the distribution is skewed to the right with a long tail extending to the right.

  2. When mean < median, this indicates that the few of the extremely low values are pulling the mean downward while of the data points are concentrated in the higher end. This is a negatively skewed distribution where the distribution is skewed to the left with a long tail extending to the left.

To measure the skewness of a distribution, you need to use 'moments' package which provides you with the skewness function:

temperatures <- c(22, 24, 26, 27, 23, 24, 24, 25, 21, 20) 

install.packages("moments")
library(moments)
skewness <- skewness(temperatures)
Enter fullscreen mode Exit fullscreen mode

In the previous post, we learned about the three quartiles that divide the entire dataset into 4 equal chunks. There are properties that will tell us the spread of data points around the mean (center) of a dataset. In the picture that is at the beginning of this post, you can see lines saying 1, -1, 2, -2. We will get into these lines shortly. Before that, we need to understand how each data points 'deviates' from the center or the mean.

In our dataset:

20, 21, 22, 23, 24, 24, 24, 25, 26, 27

If we subtract each value from the mean 23.6, we will get:

-3.6, -2.6, -1.6, 0.6, 0.4, 0.4, 0.4, 1.4, 2.4, 3.4

We need the average of these values but the negatives and positives will cancel each other out, therefore we will square each of them:

12.96, 6.76, 2.56, 0.36, 0.16, 0.16, 0.16, 1.96, 5.76, 11.56

Now we will take the sum of these values: 42.4

Since we are using a sample temperature dataset, we will divide this value with number of observations, 10 - 1.

The result is the variance of the dataset:

Variance = ∑ (x - μ)2 / (n - 1)


= 42.4 / 9≈ 4.71

Where x is the individual data points and mu is the mean. Variance gives you how much each data point is varying. Since this is a squared value, this will give the unit in squares and is less preferred. The square root of variance is called Standard Deviation (SD) and is heavily used to understand the deviation of each point from the mean:

SD = √(Variance)
SD = √(4.71)
SD ≈ 2.17

In the image of the normal distribution, +1 and -1 represent one standard deviation above and below the mean, respectively. Approximately 68% of the data points are contained within this range of one standard deviation from the mean." =2 and -2 represent two standard deviation above and below the mean, respectively. Approximately 97% of the data points are contained within this range of two standard deviation from the mean.

In R, you can use the functions var() and sd() for calculating these values:

sd_temp <- sd(temperatures)

# Calculate Variance
variance_temp <- var(temperatures)

Enter fullscreen mode Exit fullscreen mode

Finally,

the 'summary()' function of R will provide you all the statistical property that we discussed so far!

summary_stats <- summary(temperatures)
print(summary_stats)
Enter fullscreen mode Exit fullscreen mode

The output will look like the following (I formatted to table for legibility):

Min. 1st Qu. Median Mean 3rd Qu. Max.
20.00 22.25 24.00 23.60 24.75 27.00

I wanted to teach all the basics with logic so that you can understand, what the summary() function does for you 👀

To conclude this post, based on our mean=23.6 and median=24 which are less than 25°C, on May 8, you can go for silk or linen clothes if you are still wondering this question based the dataset😁 For more informed decisions, you need historical data forcast and trends to analyze.

Top comments (0)