# Salmon Run Statistics: Stages (Part 2)

In today's post, I'll be preforming hypothesis testing on the various stages.

What's being tested is whether there is a significant difference in the average waves cleared, hazard level, golden eggs collected, or power eggs collected for each of the stages.

We'll be going in order of stage release.
As a note, a p value listed of 0.0 is not actually zero, but close enough to it that Python cannot tell the difference.

On stage: 58465
Not on stage: 237024

Waves cleared:

1 - x̄2 = 0.02202136266782695
d = 0.02180660926876172
t = 4.75855948072555
p = 1.9528027441484954e-06

Hazard Level:

1 - x̄2 = 0.7916590856902417
d = 0.02508002845167262
t = 5.36330448554007
p = 8.191814053431072e-08

Golden Eggs Collected:

1 - x̄2 = 0.29067416948813474
d = 0.03312541526802805
t = 7.1015810044868
p = 1.2426245625649125e-12

Power Eggs Collected:

1 - x̄2 = 0.29067416948813474
d = 0.0009492056915165725
t = 3.2082636985041604
p = 0.001335847436272462

On stage: 58206
Not on stage: 237283

Waves Cleared:

1 - x̄2 = -0.09935144682267083
d = -0.09838256668436818
t = -20.991545905954546
p = 1.365533454759977e-97

Hazard Level:

1 - x̄2 = -2.8769270721537907
d = -0.09114202581290942
t = -20.287411896696618
p = 2.632302741250626e-91

Golden Eggs Collected:

1 - x̄2 = -1.6038006761037291
d = -0.18277015634596588
t = -42.74669521494301
p = 0.0

Power Eggs Collected:

1 - x̄2 = -1.6038006761037291
d = -0.0052372618196400425
t = -8.62183075765369
p = 6.705805318363079e-18

On stage: 56963
Not on stage: 238526

Waves Cleared:

1 - x̄2 = 0.14200233776683957
d = 0.1406175240670411
t = 31.299034907944364
p = 6.80477043218477e-214

Hazard Level:

1 - x̄2 = 4.759780678961647
d = 0.15079146694564513
t = 32.07588477095397
p = 2.0912126528419186e-224

Golden Eggs Collected:

1 - x̄2 = 1.1372705506692107
d = 0.12960408325705847
t = 27.062506697275193
p = 1.3476593528527444e-160

Power Eggs Collected:

1 - x̄2 = 1.1372705506692107
d = 0.00371379294345405
t = 51.1355543467026
p = 0.0

On stage: 61733
Not on stage: 233756

Waves Cleared:

1 - x̄2 = -0.061300749402876775
d = -0.06070294151523198
t = -13.206234532306135
p = 8.757489255083284e-40

Hazard Level:

1 - x̄2 = -1.1633113788847709
d = -0.03685409920502122
t = -8.209514401404489
p = 2.247521976981891e-16

Golden Eggs Collected:

1 - x̄2 = -0.32276730159660616
d = -0.03678276924006097
t = -8.014870746224835
p = 1.1149222395261515e-15

Power Eggs Collected:

1 - x̄2 = -0.32276730159660616
d = -0.0010540068291945382
t = 2.822913848544297
p = 0.004759891986650458

On stage: 59908
Not on stage: 235581

Cleared Waves

1 - x̄2 = 0.0020956528774993544
d = 0.002075215968780652
t = 0.4522412772851219
p = 0.6510962365369337

Hazard Level:

1 - x̄2 = -1.3176295304787118
d = -0.041742950609045815
t = -9.06088510509389
p = 1.318471198905114e-19

Golden Eggs Collected:

1 - x̄2 = 0.5202282126739455
d = 0.05928554164036889
t = 12.886738997648841
p = 5.766022404170223e-38

Power Eggs Collected:

1 - x̄2 = 0.5202282126739455
d = 0.0016988216779880055
t = -48.47749341095846
p = 0.0

All but one t-test resulted in a statistically significant result at the p < .05 level.
However, the effect size calculated for each is small enough as to be unremarkable.

Thus while there is a statistically significant difference between the five stages, the effect size is small enough as to hardly matter.

The main script for this was scripts/stages.py
This called upon functions in the core.py and filters.py portions of the code base.
It also utilized some functions from matplotlib.pyplot and scipy, in addition to the usual requirements of jsonlines, ujson, and numpy.

As to how the script works:

``````data = core.init("All")
stageList: List[str] = []
if job["stage"] is not None:
if not (job["stage"]["name"][locale] in stageList):
stageList.append(job["stage"]["name"][locale])
``````

This first segment initializes the data, pulling new results from stat.ink.
Then, it finds all the stages in the database, and makes a list of them.

``````listOfFiles: List[Tuple[str, str]] = []
for stage in stageList:
listOfFiles.append(filters.onStages(data, [stage]))
``````

This segment then creates a list of the filtered data sets, creating a paired set of data with and without that stage, for each stage in the list.

``````with open("reports/stages.txt", "w", encoding="utf-8") as writer:
i: int = 1
for stageFiles in listOfFiles:
plt.figure(i)
withVal: str = stageFiles
withoutVal: str = stageFiles
withValClearWaves: List[float] = []
withValDangerRate: List[float] = []
withValGoldenTotal: List[float] = []
withValPowerTotal: List[float] = []
withoutValClearWaves: List[float] = []
withoutValDangerRate: List[float] = []
withoutValGoldenTotal: List[float] = []
withoutValPowerTotal: List[float] = []
clearWaves: List[float] = []
dangerRate: List[float] = []
goldenTotal: List[float] = []
powerTotal: List[float] = []
withValClearWaves.append(float(job["clear_waves"]))
withValDangerRate.append(float(job["danger_rate"]))
withValGoldenTotal.append(float(job["my_data"]["golden_egg_delivered"]))
withValPowerTotal.append(float(job["my_data"]["power_egg_collected"]))
withoutValClearWaves.append(float(job["clear_waves"]))
withoutValDangerRate.append(float(job["danger_rate"]))
withoutValGoldenTotal.append(float(job["my_data"]["golden_egg_delivered"]))
withoutValPowerTotal.append(float(job["my_data"]["power_egg_collected"]))
clearWaves.append(float(job["clear_waves"]))
dangerRate.append(float(job["danger_rate"]))
goldenTotal.append(float(job["my_data"]["golden_egg_delivered"]))
powerTotal.append(float(job["my_data"]["power_egg_collected"]))
diffMeansClearWaves: float = np.mean(withValClearWaves) - np.mean(
withoutValClearWaves
)
t, p = ttest_ind(withValClearWaves, withoutValClearWaves, equal_var=False)
writer.write(withVal + "\n")
writer.write(withoutVal + "\n")
writer.write("n_1 = " + str(len(withValClearWaves)) + "\n")
writer.write("n_2 = " + str(len(withoutValClearWaves)) + "\n")
writer.write("\n")
writer.write("x\u0304_1 - x\u0304_2 = " + str(diffMeansClearWaves) + "\n")
writer.write("d = " + str(diffMeansClearWaves / np.std(clearWaves)) + "\n")
writer.write("t = " + str(t) + "\n")
writer.write("p = " + str(p) + "\n")
writer.write("\n")
t, p = ttest_ind(withValDangerRate, withoutValDangerRate, equal_var=False)
plt.subplot(321)
plt.hist(withValDangerRate, density=True)
plt.xlabel("Danger Rate")
plt.ylabel("Probability")
plt.subplot(322)
plt.hist(withoutValDangerRate, density=True)
plt.xlabel("Danger Rate")
plt.ylabel("Probability")
diffMeansDangerRate: float = np.mean(withValDangerRate) - np.mean(
withoutValDangerRate
)
writer.write("x\u0304_1 - x\u0304_2 = " + str(diffMeansDangerRate) + "\n")
writer.write("d = " + str(diffMeansDangerRate / np.std(dangerRate)) + "\n")
writer.write("t = " + str(t) + "\n")
writer.write("p = " + str(p) + "\n")
writer.write("\n")
t, p = ttest_ind(withValGoldenTotal, withoutValGoldenTotal, equal_var=False)
plt.subplot(323)
plt.hist(withValGoldenTotal, density=True)
plt.xlabel("Golden Eggs")
plt.ylabel("Probability")
plt.subplot(324)
plt.hist(withoutValGoldenTotal, density=True)
plt.xlabel("Golden Eggs")
plt.ylabel("Probability")
diffMeansGoldenTotal: float = np.mean(withValGoldenTotal) - np.mean(
withoutValGoldenTotal
)
writer.write("x\u0304_1 - x\u0304_2 = " + str(diffMeansGoldenTotal) + "\n")
writer.write("d = " + str(diffMeansGoldenTotal / np.std(goldenTotal)) + "\n")
writer.write("t = " + str(t) + "\n")
writer.write("p = " + str(p) + "\n")
writer.write("\n")
t, p = ttest_ind(withValPowerTotal, withoutValPowerTotal, equal_var=False)
plt.subplot(325)
plt.hist(withValPowerTotal, density=True)
plt.xlabel("Power Eggs")
plt.ylabel("Probability")
plt.subplot(326)
plt.hist(withoutValPowerTotal, density=True)
plt.xlabel("Power Eggs")
plt.ylabel("Probability")
diffMeansPowerTotal: float = np.mean(withValGoldenTotal) - np.mean(
withoutValGoldenTotal
)
writer.write("x\u0304_1 - x\u0304_2 = " + str(diffMeansPowerTotal) + "\n")
writer.write("d = " + str(diffMeansPowerTotal / np.std(powerTotal)) + "\n")
writer.write("t = " + str(t) + "\n")
writer.write("p = " + str(p) + "\n")
writer.write("\n")
i += 1
plt.show()
``````

This is a big segment.
First, it creates a bunch of lists to store all the data retrieved.
Next it goes through the data files of that iteration, retrieving all the data needed for the calculations.
Then, for each metric, it calculates the d, p, and t values.
It also creates a histogram for all the metrics except the waves cleared.
All of the data generated gets written to the `"reports/stages.txt"` file, and the histograms are displayed in their own windows on screen.