DEV Community

Cover image for Hacker Statistics with Python.
Kshitij Dhama
Kshitij Dhama

Posted on

Hacker Statistics with Python.

Think probabilistically!


Every data has a story to tell. For this you need to think statistically, speak the language of your data, and understand what your data is telling you. The foundations of statistical thinking took decades to build, but can be grasped much faster today with the help of computers. With the power of Python-based tools and that's hacker statistics.


Obviously data is abundant but there are lot of questions about the data itself which can reveal many underlying facts and details like:

  • Type of data.
  • Probability Distribution.
  • Central Tendencies.
  • Range and extremes.



Alt Text

Topics:

[ ] Bernoulli trials.
[ ] Bootstrapping.


Hacker Statistics:

Use simulated repeated measurements to gather more info about data. The basic idea is that instead of literally repeating the data acquisition over and over again, we can simulate those repeated measurements using Python. For our first simulation, we will take a cue from our forebears. The concepts of probability originated from studies of games of chance.


So, let's jump right into it.


1. Bernoulli trials, the coin flip experiment:

  • The Assumptions of Bernoulli Trials:
    • 1. Each trial results in one of two possible outcomes, denoted success (S) or failure (F).
    • 2. The probability of S remains constant from trial-to-trial and is denoted by p. Write q= 1−p for the constant probability of F.
    • 3. The trials are independent.



A common example:


Coin flips : There are only two possible outcomes of a
coin flip : Heads or Tails, with equal probability of 50-50.



Consider:

A random decimal number d is drawn between 0 and 1.

If :  d >= 0.5 then we get Tails.

Else : d < 0.5 then we get Heads.

Enter fullscreen mode Exit fullscreen mode

In Python:


import numpy as np

np.random.seed(42)

trials = np.random.random(size=4)
print(trials)

Enter fullscreen mode Exit fullscreen mode
[0.37454012 0.95071431 0.73199394 0.59865848]
Enter fullscreen mode Exit fullscreen mode



3 Tails and 1 Head, that's just for 1 simulation of 4 flips, but what if we simulate it 10000 times of 10 flips each? It's nearly impossible to do this manually. So, we will use python to do this job.

From each 10 flips, we will take number of heads in account from each simulation.

import numpy as np

np.random.seed(42)

heads = []
mid = 0

for i in range(10000):
    ten_flips = np.random.random(10) < 0.5
    heads.append(sum(ten_flips))

    #fair odds :probability of 5 heads and 5 tails in each sim.
    if sum(ten_flips) == 5:
       mid+=1

print(mid/10000)

Enter fullscreen mode Exit fullscreen mode
0.2545
Enter fullscreen mode Exit fullscreen mode



So, we got 5 heads and 5 tails out of 10 trials about 25% of time. If we plot a graph for every possible outcome of 10 trials we get:


Alt Text



In python:

import matplotlib.pyplot as plt
heads, counts_elements = np.unique(heads, return_counts=True)
probability = counts_elements/10000


plt.bar(heads,probability,width=0.95)
plt.xticks([i for i in range(1,11)])
plt.ylim(0,0.3)
plt.xlabel("Number of heads")
plt.ylabel("Probability")
plt.show()

Enter fullscreen mode Exit fullscreen mode

Alt Text
This is a binomial distribution, read more about binomial distribution here.


Topics:

[x] Bernoulli trials.
[ ] Bootstrapping.


2. Bootstrapping:

  • For any given data, we can gain summary statistics of measurements, including the mean, median, and standard deviation. But remember, we need to think probabilistic-ally. What if we acquired the data again? Would we get the same mean? The same median? The same standard deviation? Probably not. In inference problems, it is rare that we are interested in the result from a single experiment or data acquisition. We want to say something more general.

However we can draw samples from population data again and again and calculate summary statistics each team.

Alt Text



Dice toss : There are 6 possible outcomes of a dice roll, each with equal probability, It's an obvious thing that if we roll same dice 10 times then it gets mean value around 3.0.

import numpy as np

dice_rolls = np.random.randint(1,6,size=10)
print(dice_rolls.mean())

Enter fullscreen mode Exit fullscreen mode
2.8
Enter fullscreen mode Exit fullscreen mode

But is it necessary that if we perform 10 rolls around 10000 times then we get same mean of around 3.0 each time?


Let's try this out:

import matplotlib.pyplot as plt
import seaborn as sns

roll_means = []

for i in range(10000):
    roll_mean=np.random.choice(dice_rolls,len(dice_rolls)).mean()
    roll_means.append(roll_mean)


ax = sns.histplot(roll_means)
ax2 = ax.twinx()
sns.kdeplot(roll_means,shade=True,ax=ax2)

ci = np.quantile(roll_means,[0.025,0.975])
print(f"Confidence Intervals : {ci}")

for i in ci:
    ax.plot([i,i],[0,1000],c="red")

ax.set_xlabel("Dice Roll Mean (10 rolls)")

plt.show()
Enter fullscreen mode Exit fullscreen mode
Confidence Intervals : [2.  3.7]
Enter fullscreen mode Exit fullscreen mode

Alt Text



Observe those red vertical lines in plot, area between these two lines state 95% confidence interval. Means 95% of mean value of 10 dice rolls lies between this region. Read more about confidence intervals here.


Topics:

[x] Bernoulli trials.
[x] Bootstrapping.


Conclusion:

By using computational powers we can easily simulate the various techniques like statistical inference easily and quickly. This give a good idea about data and how to use data effectively.


Thank you! Catch ya later. 😄

Top comments (0)