Central Limit Theorem: In Data Science?

The other day while talking to a friend about all the skills required for a Data Science, we started to go more in depth about an important one that is usually overlooked in conversation because it is not as "exciting", statistics. So today I want to go over a statistical concept that is very important when working with data, the Central Limit Theorem.

What the Central Limit Theorem(CLT) tells us is that, when independent random variables are added together, their normalized sum will converge toward a normal distribution. So let's quickly go over these terms and then why this is useful for Data Scientist.

Independent Random Variables

So here independent random variables is referring to subsets of variables chosen randomly from the larger group. With the independence being that they outcome of one variable does not have any effect on the outcome of another selected. So for example if you had flipped a coin 100 times and then randomly selected the results from 10 of the tosses, we would have 10 independent variables because the outcome of the toss being heads or tails is not dependent on how many times you flip the coin.

Normal Distribution

If a dataset is normally distributed it will have most of its data at or around the the mean of the data, and then be equally spread out on both sides as the probability decreases to create a Bell Curve like below.

Having our data distributed in this way can give us much insight into our data such as the probability distribution to find outliers and hypothesis testing.

In Data Science

In DS hypothesis testing is a very important part of your job. You constantly have to ask yourself if the data you have can support your idea or if the data is just that way due to chance? How this is done in DS is that we check if we have support for our hypothesis given that it is wrong(null hypothesis). So to do this as long as we have a sub set of our data that is suggested to be over 30, we can create a normal distribution of our data to see where our null hypothesis sits and the probability that we can cast it aside to prove our hypothesis. I apologize if this is confusing and will try to continue on hypothesis testing in another blog, but for now we will continue on to some visuals with python to see the Central Limit Theorem in action.

Coding

For this example I reached out and grabbed the train.csv from the Titanic Dataset on Kaggle: https://www.kaggle.com/c/titanic/data.

import random
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
df = pd.read_csv('../../Downloads/train.csv')
df.describe()

So here we can see that we have quite a bit of data in our 'Age' column with 714 rows but that it does not match the 891 rows in 'PassengerId', so we will have to get rid of those pesky little 'nan's.

data=df.dropna()

Now let's grab our mean and take a look at the distribution of our data.

pop_mean=data.mean()
print(pop_mean)
plt.hist(data,bins=100);

Now we know that we have a mean age of just under 30 years old and we can see that our data is skewed to the right. So let's create a function to perform our Central Limit Theorem, and see what happens when we take 100 of samples of 30 people from our dataset.

def CLT(num_of_samples,sample_size):
    sample_means=[]
    for i in range(num_of_samples):
        sample=np.random.choice(data,size=sample_size,replace=True)
        sample_means.append(sample.mean())
    sns.distplot(sample_means, bins=80, kde=True)
    print(f'mean:{sum(sample_means)/num_of_samples}')

CLT(100,30)

Wow our mean is actually pretty close and our data is starting to take the shape of a normal distribution, but we have the power of the computer so let's add a sample or 9,900 more.

CLT(10000,50)

There it is look at that distribution, almost a mirror image of a textbook normal distribution. Also a mean of 29.68, only .01 away from the mean of our total population. Maybe these mathematicians actually know what they are talking about!