Hacker Statistics with Python.

#python #statistics #datascience #tutorial

Think probabilistically!

Every data has a story to tell. For this you need to think statistically, speak the language of your data, and understand what your data is telling you. The foundations of statistical thinking took decades to build, but can be grasped much faster today with the help of computers. With the power of Python-based tools and that's hacker statistics.

Obviously data is abundant but there are lot of questions about the data itself which can reveal many underlying facts and details like:

Type of data.

Probability Distribution.

Central Tendencies.

Range and extremes.

Topics:

[ ] Bernoulli trials.
[ ] Bootstrapping.

Hacker Statistics:

Use simulated repeated measurements to gather more info about data. The basic idea is that instead of literally repeating the data acquisition over and over again, we can simulate those repeated measurements using Python. For our first simulation, we will take a cue from our forebears. The concepts of probability originated from studies of games of chance.

So, let's jump right into it.

1. Bernoulli trials, the coin flip experiment:

The Assumptions of Bernoulli Trials:
- 1. Each trial results in one of two possible outcomes, denoted success (S) or failure (F).
- 2. The probability of S remains constant from trial-to-trial and is denoted by p. Write q= 1−p for the constant probability of F.
- 3. The trials are independent.

A common example:

Coin flips : There are only two possible outcomes of a
coin flip : Heads or Tails, with equal probability of 50-50.

Consider:

A random decimal number d is drawn between 0 and 1.

If :  d >= 0.5 then we get Tails.

Else : d < 0.5 then we get Heads.

In Python:


import numpy as np

np.random.seed(42)

trials = np.random.random(size=4)
print(trials)

[0.37454012 0.95071431 0.73199394 0.59865848]

3 Tails and 1 Head, that's just for 1 simulation of 4 flips, but what if we simulate it 10000 times of 10 flips each? It's nearly impossible to do this manually. So, we will use python to do this job.

From each 10 flips, we will take number of heads in account from each simulation.

import numpy as np

np.random.seed(42)

heads = []
mid = 0

for i in range(10000):
    ten_flips = np.random.random(10) < 0.5
    heads.append(sum(ten_flips))

    #fair odds :probability of 5 heads and 5 tails in each sim.
    if sum(ten_flips) == 5:
       mid+=1

print(mid/10000)

0.2545

So, we got 5 heads and 5 tails out of 10 trials about 25% of time. If we plot a graph for every possible outcome of 10 trials we get:

In python:

import matplotlib.pyplot as plt
heads, counts_elements = np.unique(heads, return_counts=True)
probability = counts_elements/10000


plt.bar(heads,probability,width=0.95)
plt.xticks([i for i in range(1,11)])
plt.ylim(0,0.3)
plt.xlabel("Number of heads")
plt.ylabel("Probability")
plt.show()

This is a binomial distribution, read more about binomial distribution here.

Topics:

[x] Bernoulli trials.
[ ] Bootstrapping.

2. Bootstrapping:

For any given data, we can gain summary statistics of measurements, including the mean, median, and standard deviation. But remember, we need to think probabilistic-ally. What if we acquired the data again? Would we get the same mean? The same median? The same standard deviation? Probably not. In inference problems, it is rare that we are interested in the result from a single experiment or data acquisition. We want to say something more general.

However we can draw samples from population data again and again and calculate summary statistics each team.

Dice toss : There are 6 possible outcomes of a dice roll, each with equal probability, It's an obvious thing that if we roll same dice 10 times then it gets mean value around 3.0.

import numpy as np

dice_rolls = np.random.randint(1,6,size=10)
print(dice_rolls.mean())

2.8

But is it necessary that if we perform 10 rolls around 10000 times then we get same mean of around 3.0 each time?

Let's try this out:

import matplotlib.pyplot as plt
import seaborn as sns

roll_means = []

for i in range(10000):
    roll_mean=np.random.choice(dice_rolls,len(dice_rolls)).mean()
    roll_means.append(roll_mean)


ax = sns.histplot(roll_means)
ax2 = ax.twinx()
sns.kdeplot(roll_means,shade=True,ax=ax2)

ci = np.quantile(roll_means,[0.025,0.975])
print(f"Confidence Intervals : {ci}")

for i in ci:
    ax.plot([i,i],[0,1000],c="red")

ax.set_xlabel("Dice Roll Mean (10 rolls)")

plt.show()

Confidence Intervals : [2.  3.7]

Observe those red vertical lines in plot, area between these two lines state 95% confidence interval. Means 95% of mean value of 10 dice rolls lies between this region. Read more about confidence intervals here.

Topics:

[x] Bernoulli trials.
[x] Bootstrapping.

Conclusion:

By using computational powers we can easily simulate the various techniques like statistical inference easily and quickly. This give a good idea about data and how to use data effectively.