In today's post, I'll be preforming hypothesis testing on the various stages.

What's being tested is whether there is a significant difference in the average waves cleared, hazard level, golden eggs collected, or power eggs collected for each of the stages.

We'll be going in order of stage release.

As a note, a p value listed of 0.0 is not actually zero, but close enough to it that Python cannot tell the difference.

First stage: Lost Outpost

Sample sizes:

On stage: 58465

Not on stage: 237024

Waves cleared:

x̄

_{1}- x̄_{2}= 0.02202136266782695

d = 0.02180660926876172

t = 4.75855948072555

p = 1.9528027441484954e-06

Hazard Level:

x̄

_{1}- x̄_{2}= 0.7916590856902417

d = 0.02508002845167262

t = 5.36330448554007

p = 8.191814053431072e-08

Golden Eggs Collected:

x̄

_{1}- x̄_{2}= 0.29067416948813474

d = 0.03312541526802805

t = 7.1015810044868

p = 1.2426245625649125e-12

Power Eggs Collected:

x̄

_{1}- x̄_{2}= 0.29067416948813474

d = 0.0009492056915165725

t = 3.2082636985041604

p = 0.001335847436272462

Second stage: Marooner's Bay

Sample Sizes:

On stage: 58206

Not on stage: 237283

Waves Cleared:

x̄

_{1}- x̄_{2}= -0.09935144682267083

d = -0.09838256668436818

t = -20.991545905954546

p = 1.365533454759977e-97

Hazard Level:

x̄

_{1}- x̄_{2}= -2.8769270721537907

d = -0.09114202581290942

t = -20.287411896696618

p = 2.632302741250626e-91

Golden Eggs Collected:

x̄

_{1}- x̄_{2}= -1.6038006761037291

d = -0.18277015634596588

t = -42.74669521494301

p = 0.0

Power Eggs Collected:

x̄

_{1}- x̄_{2}= -1.6038006761037291

d = -0.0052372618196400425

t = -8.62183075765369

p = 6.705805318363079e-18

Third stage: Salmonid Smokeyard

Sample Sizes:

On stage: 56963

Not on stage: 238526

Waves Cleared:

x̄

_{1}- x̄_{2}= 0.14200233776683957

d = 0.1406175240670411

t = 31.299034907944364

p = 6.80477043218477e-214

Hazard Level:

x̄

_{1}- x̄_{2}= 4.759780678961647

d = 0.15079146694564513

t = 32.07588477095397

p = 2.0912126528419186e-224

Golden Eggs Collected:

x̄

_{1}- x̄_{2}= 1.1372705506692107

d = 0.12960408325705847

t = 27.062506697275193

p = 1.3476593528527444e-160

Power Eggs Collected:

x̄

_{1}- x̄_{2}= 1.1372705506692107

d = 0.00371379294345405

t = 51.1355543467026

p = 0.0

Fourth stage: Spawning Grounds

Sample Sizes:

On stage: 61733

Not on stage: 233756

Waves Cleared:

x̄

_{1}- x̄_{2}= -0.061300749402876775

d = -0.06070294151523198

t = -13.206234532306135

p = 8.757489255083284e-40

Hazard Level:

x̄

_{1}- x̄_{2}= -1.1633113788847709

d = -0.03685409920502122

t = -8.209514401404489

p = 2.247521976981891e-16

Golden Eggs Collected:

x̄

_{1}- x̄_{2}= -0.32276730159660616

d = -0.03678276924006097

t = -8.014870746224835

p = 1.1149222395261515e-15

Power Eggs Collected:

x̄

_{1}- x̄_{2}= -0.32276730159660616

d = -0.0010540068291945382

t = 2.822913848544297

p = 0.004759891986650458

Fifth stage: Ruins of Ark Polaris

Sample Size:

On stage: 59908

Not on stage: 235581

Cleared Waves

x̄

_{1}- x̄_{2}= 0.0020956528774993544

d = 0.002075215968780652

t = 0.4522412772851219

p = 0.6510962365369337

Hazard Level:

x̄

_{1}- x̄_{2}= -1.3176295304787118

d = -0.041742950609045815

t = -9.06088510509389

p = 1.318471198905114e-19

Golden Eggs Collected:

x̄

_{1}- x̄_{2}= 0.5202282126739455

d = 0.05928554164036889

t = 12.886738997648841

p = 5.766022404170223e-38

Power Eggs Collected:

x̄

_{1}- x̄_{2}= 0.5202282126739455

d = 0.0016988216779880055

t = -48.47749341095846

p = 0.0

All but one t-test resulted in a statistically significant result at the p < .05 level.

However, the effect size calculated for each is small enough as to be unremarkable.

Thus while there is a statistically significant difference between the five stages, the effect size is small enough as to hardly matter.

The main script for this was scripts/stages.py

This called upon functions in the core.py and filters.py portions of the code base.

It also utilized some functions from matplotlib.pyplot and scipy, in addition to the usual requirements of jsonlines, ujson, and numpy.

As to how the script works:

```
data = core.init("All")
stageList: List[str] = []
with gzip.open(data) as reader:
for job in jsonlines.Reader(reader, ujson.loads):
if job["stage"] is not None:
if not (job["stage"]["name"][locale] in stageList):
stageList.append(job["stage"]["name"][locale])
```

This first segment initializes the data, pulling new results from stat.ink.

Then, it finds all the stages in the database, and makes a list of them.

```
listOfFiles: List[Tuple[str, str]] = []
for stage in stageList:
listOfFiles.append(filters.onStages(data, [stage]))
```

This segment then creates a list of the filtered data sets, creating a paired set of data with and without that stage, for each stage in the list.

```
with open("reports/stages.txt", "w", encoding="utf-8") as writer:
i: int = 1
for stageFiles in listOfFiles:
plt.figure(i)
withVal: str = stageFiles[0]
withoutVal: str = stageFiles[1]
withValClearWaves: List[float] = []
withValDangerRate: List[float] = []
withValGoldenTotal: List[float] = []
withValPowerTotal: List[float] = []
withoutValClearWaves: List[float] = []
withoutValDangerRate: List[float] = []
withoutValGoldenTotal: List[float] = []
withoutValPowerTotal: List[float] = []
clearWaves: List[float] = []
dangerRate: List[float] = []
goldenTotal: List[float] = []
powerTotal: List[float] = []
with gzip.open(withVal) as reader:
for job in jsonlines.Reader(reader, ujson.loads):
withValClearWaves.append(float(job["clear_waves"]))
withValDangerRate.append(float(job["danger_rate"]))
withValGoldenTotal.append(float(job["my_data"]["golden_egg_delivered"]))
withValPowerTotal.append(float(job["my_data"]["power_egg_collected"]))
with gzip.open(withoutVal) as reader:
for job in jsonlines.Reader(reader, ujson.loads):
withoutValClearWaves.append(float(job["clear_waves"]))
withoutValDangerRate.append(float(job["danger_rate"]))
withoutValGoldenTotal.append(float(job["my_data"]["golden_egg_delivered"]))
withoutValPowerTotal.append(float(job["my_data"]["power_egg_collected"]))
with gzip.open(data) as reader:
for job in jsonlines.Reader(reader, ujson.loads):
clearWaves.append(float(job["clear_waves"]))
dangerRate.append(float(job["danger_rate"]))
goldenTotal.append(float(job["my_data"]["golden_egg_delivered"]))
powerTotal.append(float(job["my_data"]["power_egg_collected"]))
diffMeansClearWaves: float = np.mean(withValClearWaves) - np.mean(
withoutValClearWaves
)
t, p = ttest_ind(withValClearWaves, withoutValClearWaves, equal_var=False)
writer.write(withVal + "\n")
writer.write(withoutVal + "\n")
writer.write("n_1 = " + str(len(withValClearWaves)) + "\n")
writer.write("n_2 = " + str(len(withoutValClearWaves)) + "\n")
writer.write("\n")
writer.write("x\u0304_1 - x\u0304_2 = " + str(diffMeansClearWaves) + "\n")
writer.write("d = " + str(diffMeansClearWaves / np.std(clearWaves)) + "\n")
writer.write("t = " + str(t) + "\n")
writer.write("p = " + str(p) + "\n")
writer.write("\n")
t, p = ttest_ind(withValDangerRate, withoutValDangerRate, equal_var=False)
plt.subplot(321)
plt.hist(withValDangerRate, density=True)
plt.xlabel("Danger Rate")
plt.ylabel("Probability")
plt.subplot(322)
plt.hist(withoutValDangerRate, density=True)
plt.xlabel("Danger Rate")
plt.ylabel("Probability")
diffMeansDangerRate: float = np.mean(withValDangerRate) - np.mean(
withoutValDangerRate
)
writer.write("x\u0304_1 - x\u0304_2 = " + str(diffMeansDangerRate) + "\n")
writer.write("d = " + str(diffMeansDangerRate / np.std(dangerRate)) + "\n")
writer.write("t = " + str(t) + "\n")
writer.write("p = " + str(p) + "\n")
writer.write("\n")
t, p = ttest_ind(withValGoldenTotal, withoutValGoldenTotal, equal_var=False)
plt.subplot(323)
plt.hist(withValGoldenTotal, density=True)
plt.xlabel("Golden Eggs")
plt.ylabel("Probability")
plt.subplot(324)
plt.hist(withoutValGoldenTotal, density=True)
plt.xlabel("Golden Eggs")
plt.ylabel("Probability")
diffMeansGoldenTotal: float = np.mean(withValGoldenTotal) - np.mean(
withoutValGoldenTotal
)
writer.write("x\u0304_1 - x\u0304_2 = " + str(diffMeansGoldenTotal) + "\n")
writer.write("d = " + str(diffMeansGoldenTotal / np.std(goldenTotal)) + "\n")
writer.write("t = " + str(t) + "\n")
writer.write("p = " + str(p) + "\n")
writer.write("\n")
t, p = ttest_ind(withValPowerTotal, withoutValPowerTotal, equal_var=False)
plt.subplot(325)
plt.hist(withValPowerTotal, density=True)
plt.xlabel("Power Eggs")
plt.ylabel("Probability")
plt.subplot(326)
plt.hist(withoutValPowerTotal, density=True)
plt.xlabel("Power Eggs")
plt.ylabel("Probability")
diffMeansPowerTotal: float = np.mean(withValGoldenTotal) - np.mean(
withoutValGoldenTotal
)
writer.write("x\u0304_1 - x\u0304_2 = " + str(diffMeansPowerTotal) + "\n")
writer.write("d = " + str(diffMeansPowerTotal / np.std(powerTotal)) + "\n")
writer.write("t = " + str(t) + "\n")
writer.write("p = " + str(p) + "\n")
writer.write("\n")
i += 1
plt.show()
```

This is a big segment.

First, it creates a bunch of lists to store all the data retrieved.

Next it goes through the data files of that iteration, retrieving all the data needed for the calculations.

Then, for each metric, it calculates the d, p, and t values.

It also creates a histogram for all the metrics except the waves cleared.

All of the data generated gets written to the `"reports/stages.txt"`

file, and the histograms are displayed in their own windows on screen.

## Top comments (0)