DEV Community: Daniel Bray

Using Shotgun to Find and Limit Indirect Dependencies

Daniel Bray — Tue, 26 Oct 2021 08:25:16 +0000

On a well-run project, over time, novelty tends to zero, and all good software, if it’s being used at all, will eventually go into maintenance.

At the beginning of a project, when designs are still coming together, commits are likely to be in many different components of the application. As the application moves into maintenance, then – if the SOLID principles have been applied – one would expect commits to impact on smaller and smaller sections of the codebase. This would indicate that the changes for issue fixes and the like don’t require lots of small changes all over the codebase.

Shotgun (and its related gradle plugin) is a new tool that can identify overly complex and interdependent elements of your code base that other tools can’t.

What does this give me that a cyclomatic complexity rating doesn’t?

We already have tools for calculating the complexity of code (e.g Cyclomatic Complexity) but they rely on finding the elements of code that talk directly to each other. Shotgun, however, reports on what elements of code are updated at the same time, measuring how coherent your commits are within a code hierarchy.

For example, if you had an event-based architecture, where services interact through a queue, then you may find that changing one service might require changes in the events being sent, and so on down to the receiving services. A cyclomatic complexity rating wouldn’t take account of this, since the services don’t directly interact with each other. Shotgun, however, will report that when these elements are updated at the same time over and over again.

The idea is that – if we’ve been doing our job right – then over time the complexity of commits is getting smaller. If it’s not, then it’s likely that “small” commits are touching too many files because the code is overly interdependent.

The report looks like this:

With:

A heatmap showing the complexity of each day’s commits.
A list of active commit sets: these are sets of files that are regularly committed in one go. These file sets have a high inter-dependence.
A list of active files: these are single files that are being updated a lot.

You can also click on any given day and view the details of the commits for that day.

What do I do with this information?

Shotgun tells you if every change you’re making is a big change in lots of different components.

At the beginning of a project this is normal – you’re just figuring things out. If, after a while, most commits are touching elements all over the codebase, then you need to start thinking about refactoring.

The active commit sets will point to groups of elements that are committed together a lot. These are highly coherent. If the sets are too big then there’s probably a need to abstract some common behaviours to reduce the interdependence.
It’s also worth checking the larger commits for the same issues.

How is shotgun coherency calculated?

The aim is to derive a score for each commit, that is lower if:

The commits are limited to small numbers of files.
The commits are limited to files in the same package.
The commits are limited to files in the same package hierarchy.

The actual score is calculated as follows.

Imagine a commit like this:

The process is a simple:

Ignore any merge commits – we don’t want to risk double-counting, or be at the mercy of whether or not the merge was fast-forwarded.
Ignore any files that were deleted – removing code doesn’t add to the complexity of the application.
Split the files up into different source sets, e.g.
- src/main/java
- srs/main/resources
- src/test/java
- srs/test/resources
Build a set of simple directed graphs, where the vertices are the directories and files in the commit.
Prune out the roots of these graphs so we’re left with only the common root of the commit. In this example we’d be removing “com.sonalake”
Finally, add up all the edges of the graphs.

This is the shotgun coherency score.

Where there are multiple commits on a given day, then it is the median score that is used for that day.

Some examples:

A single file commit | Score: 1

Two files in the same directory | Score: 2

Two files in the same hierarchy | Score: 3

Two files in parallel directories | Score: 4

How can I use it?

There is a basic library that comes with a command line: shotgun; where you can find full details of the configuration parameters; but in short you can define things like:

The different source sets.
How small should commits get before they are not included in the home page.
The size of the heatmap buckets.

The easiest way to use the tool in a project is to use the shotgun-gradle-plugin

Just drop this in as a plugin in your gradle configuration (full details here):

plugins {
id 'com.sonalake.shotgun\-gradle-plugin' version "1.0.0"
}

And configure it appropriately for your project:

shotgun {
  inputDirectory            = "$projectDir"
  outputFile                = ".shotgun/report.htm"
  sourceSets                = \["src/main/java",
                                "src/main/resources",
                                "src/main/webapp",
                                "src/test/java",
                                "src/test/resources"\]
  minimumCommitInterest     =   3
  topCommitValueForFileSets =   10
  topCommitValueForFiles    =   40
  legendLevels              =   \[10, 20, 30, 50, 80, 120\]
}

Why is it called shotgun?

The purpose of this app is to spot when changes are consistently being applied in a scattergun approach over the entire codebase.

Also because of this.

Can I contribute to this project?

We hope this tool is useful to everyone, so we made it public, along with some other tools, libraries and examples, in our Sonalake github project.

If you think there are improvements to make, please fork the project and submit them, and we’d be delighted to review and merge them.

Constraint Programming: Solving Sudoku with Choco Solver library

Daniel Bray — Tue, 27 Apr 2021 09:51:20 +0000

Why solve sudoku?

Enterprise application development is, for the most part, solving one of these types of problems:

“Let me create, read, update and delete these things”
“Do this same task to as many things as possible, as quickly as possible”
“What’s the best way to allocate the resources I have to do these tasks, if whatever does task X, can’t be used to do task Y?”

This last problem type is a graph colouring problem, and the nature of these is that solving one of them is much the same as solving another.

Sudoku is one of these types of problems, but it has very simple rules, so it’s a nice playground to try out different ways to solve graph colouring problems. This post outlines a solution using constraint programming with choco solver.

What is constraint programming?

Constraint programming is a paradigm for programming that can be a little unusual the first time you come to it, since it’s completely different to imperative programming.

In short: you tell the program what problem needs to be solved, but not how to solve that problem.

What this means in practical terms is:

First you define your variables.
- These are my tasks
- These are my workers
Then you define the domains in which these variables exist. For example:
- This variable has to have the value of 3
- This variable can have any value between 1 and 42
- This variable is a set of between 4 and 10 numbers, that are all taken from a domain running from 1 to 99.
Then you define the constraints for these variables, for example:
- If one variable has a value, the other variable must have a different value.
- One variable must be the sum/max/min of a few other variables.
- One set must be a compliment of another set.
If you’re looking for any solution, then you’re done. If you’re looking for the best solution, then you need to define a set of cost variables that you aim to minimise or maximise. For example:
- Find a solution that minimizes the “cost of doing business” variable.
- Find a solution that maximises the “how many messages are being transmitted” variable.
Finally, you give this to the framework to solve, and it will use different AI approaches to find solutions to the problem you’ve defined.

Solving Sudoku in Choco Solver

So, to make these ideas more concrete, we’ll use them to solve a simple problem.

For this example, we will choose the world’s hardest sudoku problem, you can find a full example of this in Sudoku.java

If you don’t know Sudoku, the rules are:

The grid is a 9 X 9 area of squares.
Each square must contain a single number, from 1 to 9.
The same number can’t appear in the same row twice.
The same number can’t appear in the same column twice.
The grid is broken down into 9 distinct sub-grids of 9 squares each. The same number can’t appear in the same sub-grid twice.

That’s it. If you’re wondering what this has to do with resource usage, you could imagine the numbers represent available channels in a cell tower, and the squares represent the messages that need to be sent. The solution to this problem will tell you how you could send out these messages in an evenly distributed manner, without getting any resource contention on the channels.

Getting started with Choco Solver

The complete code for this is available here: Sudoku

Before we do anything else, we must first create an empty model for the game, this is as simple as:

Model model = new Model("sudoku");

This model is where new variables, constraints and optimizations are created.

Defining variables and domains

For the sudoku problem, there are 81 variables, one for each square in our grid.

IntVar[][] grid = new IntVar[9][9];

These come in one of two flavours:

If the value is not known at the start, then we need to define it with a domain of possible values, in our case, they are values from 1 to 9

grid[row][col] = model.intVar(row, col), 1, 9);

If the value is known at the start, then we define it as a simple constant that can’t change. This is, in effect, saying that the variable has a domain of a single value.

grid[row][col] = model.intVar(value)

Define constraints

Once we have the variables, we need to set up their constraints, we have 9 rows, columns, and sub-squares where the values must all be different. We use the allDifferent constraint for this.

for (int i = 0; i != 9; i++) {
  model.allDifferent(getCellsInRow(grid, i)).post();
  model.allDifferent(getCellsInColumn(grid, i)).post();
  model.allDifferent(getCellsInSquare(grid, i)).post();
}

Solve it

Finally, we solve it as follows.

Solver solver = model.getSolver();
solver.solve();

Choco solver will keep looking for solutions until it gives up, because it’s used them all, or because it thinks it won’t find anything better, or because you’ve told it to give up after doing enough work.

Once this is done you can get at the value that choco solver found for each cell as follows

grid[row][column].getValue()

And that’s it, the world’s hardest sudoku is solved in under a second.

So what else can you do with choco solver?

Whatever you want it to, so long as you can turn it into a combinatorics constraint problem.

For example, here’s a more complex graph colouring program: GraphColouring.java

What it’s doing is:

Given a graph: imagine these are tasks that can’t be done at the same time.
Given you have N possible colours to choose from: imagine these are workers that are available to do these tasks.
Given that no colour can be used more the M times: imagine a worker can only do so many things in a day.
Given you want to use the least number of colours: use the fewest workers required to do this work.
Colour the graph in: suggest a roster for the tasks and workers that uses the least number of people without overworking them.

Where do we go now?

This post refers to two simple examples of what is possible, but really most combinatorial problems can be solved using this approach.

For a more formal and complete description of constraint programming, check out Explaining Constraint Programming
For a nice explanation of how to formalise actual problems, check out lectures 7, 8 and 9 of this AI lecture series from MIT (actually, all of this series is worth watching)

Part 4: Hypothesis Testing of frequency-based samples

Daniel Bray — Tue, 16 Feb 2021 15:59:19 +0000

In part one of this series, we introduced the idea of hypothesis testing, along with a full description of the different elements that go into using these tools. It ended with a cheat-sheet to help you choose which test to use based on the kind of data you’re testing.

Part two outlined some code samples for how to perform z-tests on proportion-based samples.

Part three outlined some code samples for how to perform t-tests on mean-based samples.

This post will now go into more detail for frequency-based samples.

If any of these terms – Null Hypothesis, Alternative Hypothesis, p-value – are new to you, then I’d suggest reviewing the first part of this series before carrying on with this one..

What is a frequency-based sample?

In these cases we’re interested in checking frequencies, e.g. I’m expecting my result set to have a given distribution: does it?

Are differences between the distributions of two samples big enough that we should notice it? Are the distributions between variables in a single sample enough to indicate that the variables might depend on each other?

Requirements for the quality of the sample

For these tests the following sampling rules are required:

Random	The sample must be a random sample from the entire population
Normal	The sample must be normal, for these tests either: The underlying population must be normal – this can be tricky – as a population might normally be normal, only to be non-normal the day you sample it 😉 If you can’t assume the underlying population is normal then you should use a sample size of at least 30 (as per the central limit theorem)
Independent	The sample must be independent – for these tests a good rule of thumb is that the sample size be less than 10% of the total population.

Tests for mean-based samples

All of these code samples are available in this git repository

Chi-squared quality-of-fit

Compare the counts for some variables in a sample to an expected distribution

In this test we have an expected distribution of data across a category, and we want to check if the sample matches that.

For example, suppose a network was sized to have the expected distribution, and a sample observed the following counts

Class of Service	Expected Distribution	Observed Count in sample (size 650)
A	5%	27
B	10%	73
C	15%	82
D	70%	468

Given a null hypothesis that the distribution is as expected, then the following python code would derive the probability that the sample fits into this expected distribution.

from scipy.stats import chisquare

# can we assume anything from our sample
significance = 0.05

# what do we expect to see in proportions?
expected_proportions = [.05, .1, .15, .7]

# what counts did we see in our sample?
observed_counts = [27, 73, 82, 468]

########################
# how big was our sample
sample_size = sum(observed_counts)

# we derive our comparison counts here for  our expected proportions, based on the sample size
expected_counts = [float(sample_size) * x for x in expected_proportions]

# Get the stat data
(chi_stat, p_value) = chisquare(observed_counts, expected_counts)

# report
print('chi_stat: %0.5f, p_value: %0.5f' % (chi_stat, p_value))

if p_value > significance:
   print("Fail to reject the null hypothesis - we have nothing else to say")
else:
   print("Reject the null hypothesis - suggest the alternative hypothesis is true")

Chi-squared (homogeneity)

Compare the counts for some variables between two samples

In this case, the test is similar to the best fit (above) but rather than estimate the expected counts from the expected distribution, the test is comparing two sets of sampled counts to see if their frequencies are different enough to suggest that the underlying populations have different distributions.

This is, in effect, the same code as above – only in this case we have actual expected values to match, rather than having to estimate them from the sample.


from scipy.stats import chisquare

# can we assume anything from our sample
significance = 0.05

# what counts did we see in our samples?
observed_counts_A = [32, 65, 97, 450]
observed_counts_B = [27, 73, 82, 468]

########################

# Get the stat data
(chi_stat, p_value) = chisquare(observed_counts_A, observed_counts_B)

# report
print('chi_stat: %0.5f, p_value: %0.5f' % (chi_stat, p_value))

if p_value > significance:
   print("Fail to reject the null hypothesis - we have nothing else to say")
else:
   print("Reject the null hypothesis - suggest the alternative hypothesis is true")

Chi-squared (independence)

Check single sample to see if the discrete variables are independent

In this case you have a sample from a population, over two discrete variables, and you want to tell if these two discrete variables have some kind of relationship – or if they are independent.

NOTE: this is for discrete variables (i.e. categories). If you wanted to check if numeric variables are independent you’d want to consider using something like a linear regression.

Suppose we had a pivot to see how people from different area types (town/country) voted for three different political parties.

The question we are asking is whether or not we can say whether or not there is likely to be a connection between these two variables (i.e. do town/country people have a strong preference to vote for a given party).

	Party
	Cocktail Party	Garden Party	Mouse Party
Voter Type
Town	200	150	50
Country	250	300	50

The python code to check this is:

from scipy.stats import chi2_contingency
import numpy as np

# can we assume anything from our sample
significance = 0.05

pivot = np.array([
  # town votes
  [200,150,50],
  # country votes
  [250,300,50]
])
########################
# Get the stat data
(chi_stat, p_value, degrees_of_freedom, expected) = chi2_contingency(pivot)

# report
print('chi_stat: %0.5f, p_value: %0.5f' % (chi_stat, p_value))

if p_value > significance:
  print("Fail to reject the null hypothesis - we have nothing else to say")
else:
  print("Reject the null hypothesis - suggest the alternative hypothesis is true")

Where do we go next?

Thank you for reading the final part of our introduction into hypothesis testing. I hope you found it a useful introduction into the world of statistical analysis. If you would like to look deeper into this field, I’d suggest the following.

I’ve not touched on issues of power or effect size in this series. For that I would direct you to Robert Coe’s always worth reading: It’s the effect size, stupid: what effect size is and why it is important
If you have more complex types of data to examine, then I’d suggest reading more into
- Analysis Of Variance – for when you have means in more than two sets of groups to compare, and using multiple t-sets would waste your power.
- Linear Regression – for when you want to predict the value of one continuous variable, based on the values of some other continuous value, or just want to see if different continuous variables are, in fact, related.
If our previous post – Quantitative analysis is as subjective as qualitative analysis – is making you doubt whether you can trust stats at all, then check out how meta analysis can be used to collect the results of multiple different analyses, and produce a single overall measure as to whether the underlying tests show a significant interaction.

If you would like to know more or have any suggestions, please don’t hesitate to reach out to us!

PART I: An Introduction to Hypothesis Testing
PART II: Hypothesis Testing of proportion-based samples
PART III: Hypothesis Testing of mean-based samples

Part 3: Hypothesis Testing of mean-based samples

Daniel Bray — Wed, 10 Feb 2021 11:44:52 +0000

Part two outlined some code samples for how to perform z-tests on proportion-based samples.

This post will now go into more detail for mean-based samples.

If any of the terms – Null Hypothesis, Alternative Hypothesis, p-value – are new to you, then I’d suggest reviewing the first part of this series before carrying on with this one.

What is a mean-based sample?

In these cases we’re interested in checking the arithmetic mean of some samples. This could be checking if the sample’s mean matches some expected value, or comparing two samples from two different populations, or comparing two samples from the same population, taken before and after some intervention.

Requirements for the quality of the sample

For these tests the following sampling rules are required:

Random	The sample must be a random sample from the entire population
Normal	The expected values in the sample must be “big enough” – for these tests a good rule of thumb is that – given the sample size – every variable’s expected count must be at least 5. For example: suppose a network was sized to have 5% real time traffic, and 95% best effort messages: this is our expected frequency. A sample size of 50 would mean we would “expect” approximately 2.5 real time traffic messages in this sample – this is less than 5 so the sample would be rejected as not being big enough.
Independent	The sample must be independent – for these tests a good rule of thumb is that the sample size be less than 10% of the total population.

Tests for f-based samples

All of these code samples are available in this git repository They use the common statsmodels library to perform the tests.

1-sample t-test

Compare the proportion in a sample to an expected value

Here we have a – defined by a mean – and we want to see if we can make some assertion about whether the overall mean the underlying population is greater than / less than / different to some expected mean.

So, in this example, suppose we want to sample a call centre to check if the average call time is more than 2 minutes.

Our null hypothesis is: the mean call time is exactly 2 minutes
Our alternative hypothesis is: the mean call time is more than 2 minutes
From one population we sampled 500 calls, and found a mean call time of 122 seconds, with a standard deviation of 73 seconds
We use a 1-sample t test to check if sample allows us to accept or reject the null hypothesis

To calculate the p-value in Python:

from scipy.stats import truncnorm
from statsmodels.stats.weightstats import DescrStatsW as stats
# can we assume anything from our sample?
significance = 0.025
# we're checking if calls can be resolved in over 2 minutes
# so Ho == 120 seconds
null_hypothesis = 120
# Normally, in the real world, you would process an entire sample (i.e. sample_a)
# But for this test, we'll generate a sample from this shape, wherE:
# - min/max is the range of available options
# - sample mean/dev are used to define the normal distribution
# - size is how large the sample will be
min, max, sample_mean_a, sample_dev_a, sample_size_a = (0, 300, 121, 50, 500)
########################
# here - for our test - we're generating a random string of durations to be our sample
# these are in a normal distribution between min/max, normalised around the mean
sample_a = truncnorm(
  (min - sample_mean_a) / sample_dev_a,
  (max - sample_mean_a) / sample_dev_a,
  loc=sample_mean_a,
  scale=sample_dev_a).rvs(sample_size_a)
# Get the stat data
(t_stat, p_value, degree_of_freedom) = stats(sample_a).ttest_mean(null_hypothesis, 'larger')
# report
print('t_stat: %0.3f, p_value: %0.3f' % (t_stat, p_value))
if p_value > significance:
  print("Fail to reject the null hypothesis - we have nothing else to say")
else:
  print("Reject the null hypothesis - suggest the alternative hypothesis is true")

2-sample independent t-test

Compare the mean of the samples from 2 different populations

Here we have two samples – taken from two different populations – defined by a mean – and we want to see if we can make some assertion about whether the overall means of one the underlying populations is greater than / less than / different to the other.

So, in this example, suppose we want to compare two different call centres to see how their call times relate to each other.

We have two samples – A and B: our null hypothesis is: the means from the two populations are the same
Our alternative hypothesis is: the means from the population A > mean from population B
From one population we sampled 500 calls, and found a mean call time of 121 seconds, with a standard deviation of 56 seconds.
From the other population we sampled 500 calls, and found a mean call time of 125 seconds, with a standard deviation of 16 seconds
We use a 2-sample independent t-test to check if sample allows us to accept or reject the null hypothesis

To calculate the p-value in Python:

from scipy.stats import truncnorm
from statsmodels.stats.weightstats import ttest_ind
# can we assume anything from our sample?
significance = 0.025
# we're checking if calls can be resolved in over 2 minutes
# so Ho == 120 seconds
null_hypothesis = 120
# Normally, in the real world, you would process an entire sample (i.e. sample_a)
# But for this test, we'll generate a sample from this shape, wherE:
# - min/max is the range of available options
# - sample mean/dev are used to define the normal distribution
# - size is how large the sample will be
min, max = (0, 300)
sample_mean_v1, sample_dev_v1, sample_size_v1 = (121, 56, 500)
sample_mean_v2, sample_dev_v2, sample_size_v2 = (125, 16, 500)
########################
# here - for our test - we're generating a random string of durations to be our sample
# these are in a normal distribution between min/max, normalised around the mean
sample_v1 = truncnorm(
(min - sample_mean_v1) / sample_dev_v1,
(max - sample_mean_v1) / sample_dev_v1,
loc=sample_mean_v1,
scale=sample_dev_v1).rvs(sample_size_v1)
sample_v2 = truncnorm(
(min - sample_mean_v2) / sample_dev_v2,
(max - sample_mean_v2) / sample_dev_v2,
loc=sample_mean_v2,
scale=sample_dev_v2).rvs(sample_size_v2)
# Get the stat data
# note that we're comparing V2 to V1 - so the sample we expect to be larger goes first here
(t_stat, p_value, degree_of_freedom) = ttest_ind(sample_v2, sample_v1, alternative='larger')
# report
print('t_stat: %0.3f, p_value: %0.3f' % (t_stat, p_value))
if p_value > significance:
 print("Fail to reject the null hypothesis - we have nothing else to say")
else:
 print("Reject the null hypothesis - suggest the alternative hypothesis is true")

2-sample paired t-test

Compare the mean of two samples from the same population

Here we have two samples – taken from the same population – defined by a mean – and we want to see if we can make some assertion about whether the mean of the underlying population in the second sample is greater than / less than / different to how it was in the first.

So, in this example, suppose we have made some code change and it looks like it has slowed things down, and so we want to sample the performance from before and after the change, to see if things have really slowed down.

We have two samples – A and B: our null hypothesis is: the means from the two populations are the same
Our alternative hypothesis is: the means from the population A > mean from population B
Before the change, we sampled 500 events from the population, and found a mean processing time of 121 milliseconds, with a standard deviation of 56 milliseconds.
After the change, we sampled 500 events from the population, and found a mean processing time of 128 milliseconds, with a standard deviation of 16 milliseconds.
We use a 2-sample paired t-test to check if sample allows us to accept or reject the null hypothesis

NOTE: in this case it is assumed that the same elements have been sampled multiple times. So, this is, in effect, a 1-sample t test on the differences between the two samples with:

Null hypothesis: difference is 0
Alternative hypothesis: difference is greater than 0

To calculate the p-value in Python:

from scipy.stats import truncnorm
from statsmodels.stats.weightstats import DescrStatsW as stats
# can we assume anything from our sample?
significance = 0.05
# we're checking if calls can be resolved in over 2 minutes
# so Ho == 120 seconds
null_hypothesis = 120
# Normally, in the real world, you would process an entire sample (i.e. sample_a)
# But for this test, we'll generate a sample from this shape, wherE:
# - min/max is the range of available options
# - sample mean/dev are used to define the normal distribution
# - size is how large the sample will be
min, max = (0, 300)
sample_mean_v1, sample_dev_v1, sample_size_v1 = (121, 56, 500)
sample_mean_v2, sample_dev_v2, sample_size_v2 = (125, 16, 500)
########################
# here - for our test - we're generating a random string of durations to be our sample
# these are in a normal distribution between min/max, normalised around the mean
sample_v1 = truncnorm(
 (min - sample_mean_v1) / sample_dev_v1,
 (max - sample_mean_v1) / sample_dev_v1,
 loc=sample_mean_v1,
 scale=sample_dev_v1).rvs(sample_size_v1)
sample_v2 = truncnorm(
 (min - sample_mean_v2) / sample_dev_v2,
 (max - sample_mean_v2) / sample_dev_v2,
 loc=sample_mean_v2,
 scale=sample_dev_v2).rvs(sample_size_v2)
# Get the stat data
# note that this is, in effect, a sample t-test on the differences
# we want to see if v2 is slower than V1 so we get the differences and check the probability that they
# are larger than the null hypothesis here (of the default = 0.0)
(t_stat, p_value, degree_of_freedom) = stats(sample_v2 - sample_v1).ttest_mean(alternative='larger')
# report
print('t_stat: %0.5f, p_value: %0.5f' % (t_stat, p_value))
if p_value > significance:
  print("Fail to reject the null hypothesis - we have nothing else to say")
else:
  print("Reject the null hypothesis - suggest the alternative hypothesis is true")

In the next post I will focus on testing of frequency-based samples.

PART I: An Introduction to Hypothesis Testing
PART II: Hypothesis Testing of proportion-based samples
PART IV: Hypothesis Testing of frequency-based samples

Part 2: Hypothesis Testing of proportion-based samples

Daniel Bray — Wed, 03 Feb 2021 14:26:09 +0000

In part one of this series, I introduced the concept of hypothesis testing, and described the different elements that go into using the various tests. It ended with a cheat-sheet to help you choose which test to use based on the kind of data you’re testing.

In this second post I will go into more detail on proportion-based samples.

If any of the terms Null Hypothesis, Alternative Hypothesis, p-value are new to you, I’d suggest reviewing the first part of this series before moving on.

What is a proportion-based sample?

In these cases we’re interested in checking proportions. For example 17% of a sample matches some profile, and the rest does not. This could be a test comparing a single sample against some expected value, or comparing two different samples.

Note: These tests are only valid when there are only two possible options; and if the probability of one option is p, then the probability of the other must be (1 – p).

Requirements for the quality of the sample

For these tests the following sampling rules are required:

Random	The sample must be a random sample from the entire population
Normal	The sample must reflect the distribution of the underlying population. For these tests a good rule of thumb is that: Given a sample size of n Given a sample proportion of p Then both np and n(1-p) must be at least 10 For example: if a sample finds that 80% of issues were resolved in 5 days, and 20% were not, then that sample must have at least 10 issues resolved within 5 days, and at least 10 issues resolved in more than 5 days.
Independent	The sample must be independent – for these tests, a good rule of thumb is that the sample size is less than 10% of the total population.

Code Samples for Proportion-based Tests

Note that all of these code samples are available on Github. They use the popular statsmodels library to perform the tests.

1-sample z-test

Compare the proportion in a sample to an expected value

Here we have a sample and we want to see if some proportion of that sample is greater than/less than/different to some expected test value.

In this example:

We expect more than 80% of the tests to pass, so our null hypothesis is: 80% of the tests pass
Our alternative hypothesis is: more than 80% of the tests pass
We sampled 500 tests, and found 410 passed
We use a 1-sample z-test to check if the sample allows us to accept or reject the null hypothesis

To calculate the p-value in Python:

from statsmodels.stats.proportion import proportions_ztest

# can we assume anything from our sample
significance = 0.05

# our sample - 82% are good
sample_success = 410
sample_size = 500

# our Ho is  80%
null_hypothesis = 0.80

# check our sample against Ho for Ha > Ho
# for Ha < Ho use alternative='smaller'
# for Ha != Ho use alternative='two-sided'
stat, p_value = proportions_ztest(count=sample_success, nobs=sample_size, value=null_hypothesis, alternative='larger')

# report
print('z_stat: %0.3f, p_value: %0.3f' % (stat, p_value))

if p_value > significance:
   print ("Fail to reject the null hypothesis - we have nothing else to say")
else:
   print ("Reject the null hypothesis - suggest the alternative hypothesis is true")

2-sample z-test

Compare the proportions between 2 samples

Here we have two samples, defined by a proportion, and we want to see if we can make an assertion about whether the overall proportions of one of the underlying populations is greater than / less than / different to the other.

In this example, we want to compare two different populations to see how their tests relate to each other:

We have two samples – A and B. Our null hypothesis is that the proportions from the two populations are the same
Our alternative hypothesis is that the proportions from the two populations are different
From one population we sampled 500 tests and found 410 passed
From the other population, we sampled 400 tests and found 379 passed
We use a 2-sample z-test to check if the sample allows us to accept or reject the null hypothesis

To calculate the p-value in Python:

from statsmodels.stats.proportion import proportions_ztest
import numpy as np

# can we assume anything from our sample
significance = 0.025

# our samples - 82% are good in one, and ~79% are good in the other
# note - the samples do not need to be the same size
sample_success_a, sample_size_a = (410, 500)
sample_success_b, sample_size_b = (379, 400)

# check our sample against Ho for Ha != Ho
successes = np.array([sample_success_a, sample_success_b])
samples = np.array([sample_size_, sample_size_b])

# note, no need for a Ho value here - it's derived from the other parameters
stat, p_value = proportions_ztest(count=successes, nobs=samples,  alternative='two-sided')

# report
print('z_stat: %0.3f, p_value: %0.3f' % (stat, p_value))

if p_value > significance:
   print ("Fail to reject the null hypothesis - we have nothing else to say")
else:
   print ("Reject the null hypothesis - suggest the alternative hypothesis is true")

In the next post I will focus on hypothesis testing mean-based samples.

PART I: An Introduction to Hypothesis Testing
PART III: Hypothesis Testing of mean-based samples
PART IV: Hypothesis Testing of frequency-based samples

An Introduction to Hypothesis Testing

Daniel Bray — Thu, 21 Jan 2021 22:11:28 +0000

As part of the ongoing development of our VisiMetrix platform we are faced with the need to make decisions about how best to analyse massive datasets. We want to help users make decisions when looking at data. Sometimes though it’s too expensive to check all the data or it’s so complicated that it’s easy to make an incorrect assumption and be led away in the wrong direction.

In cases like this, hypothesis testing can help by providing a degree of confidence that either our observations are real, or the changes we’ve made have, in fact, made a difference.

In cases where a complete examination of the underlying data set is impossible – perhaps all the data is not yet available or is simply too expensive to process all of it – we have found the following statistical tests to be very helpful.

1-Sample Z-Test

VisiMetrix monitors large telecom networks, and in some cases, its data will suggest that new software or hardware elements should be added to the network to improve overall performance. Since changing telecom networks is costly, we need to determine whether this change would be worthwhile by verifying that a sizeable proportion of the underlying traffic matches a well-defined profile. Unfortunately, checking such vast quantities of data is extremely compute and time-intensive.

In cases like this, a test known as the 1-sample Z-test can be applied to a sample of the data to determine if the network infrastructure change is, in fact, worthwhile implementing.

2-Sample Paired T-Test

When VisiMetrix draws the attention of a telco’s operations team to a history of PDP creation (user connectivity) errors, they will often apply a configuration change to their underlying network to correct this. However, since things like PDP creation errors are, for the most part, rare, it can be a challenge to validate that a configuration change has, in fact, corrected connection failures for real end-customers.

In cases like this, a 2-sample paired t-test can be applied to samples taken before and after the configuration changes to confirm that any reduction in errors was, in fact, real, and not just a random artefact of the data.

Chi-Square Goodness of Fit

When a telco is planning new hardware deployments, they can use information from their monitoring infrastructure to understand the pre-upgrade state of the network. Looking beyond that they have to make some assumptions about traffic patterns as far as 2-3 years in the future.

1. Expected future event volumes
2. Expected distribution for each event type

They will use these predicted volumes to dimension new hardware, network, and other infrastructure. Once deployed, it is critical to validate these sizing assumptions early on. The challenge, however, is that the traffic soon after an upgrade will be nowhere near the upper limit of what was sized so it would be difficult to tell whether or not the upgrades will be able to support the predicted traffic volumes in the coming years.

The challenge here is to validate the dimensioning assumptions in advance of peak traffic. Using the fact that the proportions of event types should not differ significantly pre and post upgrade we can apply a Chi-Square Goodness of Fit test to the initially limited production data and used to confirm that the observed distribution is as-expected.

Once we know this, we can be confident that the deployed hardware will support the eventual load. This test is performed regularly, to catch any changes in user behaviour over time that might affect the proportions.

The purpose of this series of blog posts is to provide an introduction to hypothesis testing, and the types of problems to which it can be applied. At the end of this post, I will present a cheat sheet that will help you decide when to use which type of test. The following posts will go into more depth for each test, and provide a code sample for how to calculate it.

Hypothesis Testing

Hypothesis testing is a statistical method that can be used to make decisions about a data set without having to examine every element in that dataset. For example, imagine you have a software system that processes billions of events per hour. Events are grouped into transactions of, say, hundreds of events. Your product owner has identified a candidate product feature that could provide real customer value but only if at least 80% of the transactions over the last 12 months contain events that match a given set of criteria (profile).

Now we have a problem. It will take weeks to check to process 12 months of events. Why are we bothering to take a sample? Because we want to make a decision, and checking every element in the set might be too difficult (billions of events), or just impossible (testing food means destroying it).

The issues then become:

There is something we want to know about the entire population, but we can’t interrogate all of it.
We sample the population and learn something about that sample, but since it’s only a sample, we can’t be sure that it is, in fact, representative of the entire population.
Finally, what – if anything – can we guess about the population, given what we’ve learnt about the sample?

This can all get very heavy, very quickly, so I’ll give a quick example of what a hypothesis test does. In this example we have a data set that’s so large we can’t process all of it to get an answer, so we have to sample it, and then check what conclusions we can deduce from this sample.

Example:

Suppose your software application is processing billions of transactions per hour.
Your product owner has asked you to implement some new way to process these transactions, but it’s only a worthwhile feature to implement if at least 80% of the transactions – over the whole of the last year – match a given profile.
Now suppose that a check to see if a given transaction fits this profile was so expensive to calculate that it would take weeks to check all of them.
So, instead, you sample just 1,000 transactions and find out that 82% of the sampled transactions have the required profile.

What can we say about all these billions of transactions, given what we have learnt about just this sample of 1,000? This is where the null hypothesis and alternative hypothesis come into play.

Null and Alternative Hypothesis

A hypothesis test starts with making two hypotheses:

The null hypothesis – in general, this is a “suppose there’s nothing to see here” case.
The alternative hypothesis – this is what we’re checking for.

The test works by assuming the null hypothesis is true and then checking to see how likely a sample fits into that hypothesis.

If it’s not likely enough, then we can suggest the alternative hypothesis is true.

Before taking the sample a significance level is selected. By convention this is 5% – but be advised, this is only a convention, and you must choose this with care. Later on, you will be making a judgement based on a derived probability by comparing it to this significance, so it’s important to consider the significance level before taking the sample.

Technically, this makes this kind of hypothesis test a significance test – we’re not proving anything. We are only deciding that, on the balance of probabilities, given how much risk we’re willing to take, that we’re happy to accept that something is likely enough to be true.

Does that sound vague? It should. There are reasons to be very careful about the kinds of assumptions you should be willing to make based on the results of these tests. In short, these tests aren’t about certainty, they’re about confidence.

In our example, we would start by assuming this null hypothesis is true:

Exactly 80% of the transactions match the profile

Or in more formal language:

p(profile) = 0.8

What we want to do now is imagine the following. Note, we don’t actually have to do the following, this is just here to explain to you why this all works.

Imagine what would happen if we were to take lots of samples from a population where the proportion was exactly 80%
Each sample we took would have a different proportion; but we’d expect most of them to be near enough to the “real” one of 80%
If we count how many of each proportion we get, the result is a histogram where the “real” proportion has the highest bar.
Eventually, if we were to take more and more samples, this would tend towards a normal curve, centred around 80% Now we have a curve – for a fictional population that matches our null hypothesis – with which we can compare our sample.

Sample and Compare to Null Hypothesis

So, how do we check our sample against this null hypothesis curve? First, we define our alternative hypothesis – i.e. this is the thing we’re trying to prove. For the kinds of tests we’re talking about here, this must be related to the null hypothesis – i.e. it must be comparing the same terms, just comparing them with a different operator.

In our example, because we have the null hypothesis:

Exactly 80% of the transactions match the profile

We would consider this as our alternative hypothesis:

More than 80% of the transactions match the profile

Or in more formal language:

p(profile) > 0.8

Finally, we compare our sample proportion (in our example this was 82%) to the curve for the null hypothesis, and we figure out how likely it is that this sample could have come from a population where the proportion was, in fact, exactly 80%.

In our example, since we’re checking how likely it is that our real population proportion is greater than 80% (our assumed null hypothesis population proportion), we are, in effect, comparing:

The area under this curve to the right of where our sample result is.
To the total area under this curve.

This fraction is the probability of how likely it is that our sample came from a population that had a proportion that matched our null hypothesis.

Drawing conclusions about the sample

All of the tests that follow derive a result called a p-value. These values are often misunderstood. This misunderstanding can lead the tester to make certain assumptions about the underlying population that cannot be justified.

The p-value is the probability that the sample result could have occurred if the null hypothesis were true.

So, a p-value has no meaning outside of the given sample, and cannot be related to any other sample or p-value, and doesn’t give an indication of how accurate the sample value is. So, in our example, had we calculated a p-value of 4%, the following significance levels would have caused us to draw the following conclusions:

Significance	Conclusions
5%	The p-value of 4% is less than the significance of 5%. So, the probability of this sample coming from a population with the values assumed by the null hypothesis is not significant So, we can reject the null hypothesis, which suggests the alternative hypothesis. NOTE: this doesn’t prove the alternative hypothesis; only that we can feel a degree of confidence that more than 80% of the transactions match our profile. We cannot say _anything_ else about the actual value of the proportion of the underlying population – i.e. we can’t say that it’s likely to be 82%, or even close to 82%
1%	The p-value of 4% is greater than (or equal to) the significance of 1%. So, the probability of this sample coming from a population with the values assumed by the null hypothesis is significant. We cannot reject the null hypothesis, i.e. we can’t feel confident that the sample came from a population different from the one assumed by the null hypothesis. We cannot say _anything_ else about the actual value of the proportion of the underlying population – i.e. we can’t say that it’s likely to be less than 80%

Significance

Conclusions

The p-value of 4% is less than the significance of 5%.
So, the probability of this sample coming from a population with the values assumed by the null hypothesis is not significant
So, we can reject the null hypothesis, which suggests the alternative hypothesis.

NOTE:

cannot

The p-value of 4% is greater than (or equal to) the significance of 1%.
So, the probability of this sample coming from a population with the values assumed by the null hypothesis is significant.
We cannot reject the null hypothesis, i.e. we can’t feel confident that the sample came from a population different from the one assumed by the null hypothesis.
We cannot say _anything_ else about the actual value of the proportion of the underlying population – i.e. we can’t say that it’s likely to be less than 80%

What Next?

This above example is for a test comparing proportions, but a different test would be required depending on what it was that you were comparing. This figure below offers a guide as to which test to apply depending on the nature of the data, and the observations you’re looking to make.

The rest of this series of blog posts will explain – with examples – when each of these different test types is applicable and will include sample code for each of them.

PART I: An Introduction to Hypothesis Testing
PART II: Hypothesis Testing of proportion-based samples
PART III: Hypothesis Testing of mean-based samples
PART IV: Hypothesis Testing of frequency-based samples

The Code Speaks for Itself – Generating API docs for Spring Applications

Daniel Bray — Fri, 25 Sep 2020 11:18:09 +0000

Introduction

Programming is mostly about communication, and one of the most time-consuming parts of this aspect of development is the communication of how service APIs function. If this is done poorly, then the documents can get out of date, or be so vague that the developers will spend too much time answering questions about how their API works.

This post outlines a process that we in Sonalake have found to automate the creation of REST API documentation. It’s done in such a way that it won’t require too much in the way of manual effort once it’s started, because most of the documentation detail will come from work you’re already doing to test the service. We have provided a working example of this in the sonalake-autodoc-example project.

What drove the creation of this process was the aim to provide a good developer experience (DX) to our own developers, and our clients and partners, by delivering good documentation that:

Describes what, specifically, is in the API
Provides examples of how to use the API
Contains a changelog for how the API has evolved between versions

Tools like Swagger do a great job on automating documentation for point 1, but when it comes to points 2 and 3, these types of documentation are generally written manually (or more likely, not written at all).

By generating documentation from the source, we have found that it allows for significant portions of the API documentation to be automatically generated. By generating documented examples from unit tests, we can ensure that these examples always align with the reality of the application.

It also allows for developers to keep documentation up-to-date without having to leave the development environment, and for documents to be released and published the same way as any other development artifact.

How do we do this?

Some parts of the documentation are written manually by the developers in AsciiDoc. These parts of the documentation are not expected to change much between releases, and are limited to things like:

Introducing what the API is
Describing how to authenticate
Outlining a generic set of use case steps, without any actual code samples (the code samples will be auto-generated during the build, using the data passed to unit tests).

The rest of the process will generate the following sections, also in AsciiDoc format.

Swagger documentation concerning the paths and entities
Code samples for the use case steps, generated from the unit tests
Changelog history of differences between published versions of the swagger.json

Finally, the AsciiDoc files are collated and published in a single PDF file using Asciidoctor PDF.

At a high level, the main steps are as follows:

Step	Comment
Define Theme	The theme in the above project is a simple, clean layout, , suitable for rendering most documents, and contains the standard document tracking elements such as document versions. This uses the standard AsciiDoctor-PDF theme configurations.
Generate Example Code Snippets	Use spring-restdocs to document the inputs/outputs for REST queries by writing unit tests that exercise the APIs. We’ll embed these snippets in the final documentation later on.
Generate swagger.json	Use a SpringBootTest to spin up the app in-memory and pull down the swagger.json to a local directory. You can use the test from the previous step to do this.
Generate Changelog	Use Sonalake’s swagger-changelog plugin to parse any previously published API specs, compare it to the current dev version, and produce a changelog in AsciiDoc format.
Author Hand-written Content	A document containing: * Hand-written content that won’t change too often. For example, an introduction. * A code examples document of simple text, referencing the generated snippets. * Write a single framing document that will link to both the hand-written, and generated content.

We have a developed sample project to showcase all of these steps: sonalake-autodoc-example. This is a very simple Spring Boot application with a trivial REST API with two GET methods. The rest of the project is solely dedicated to automating the documentation. Let’s walk through it.

Define Theme

The main tool for the AsciiDoctor-to-PDF generation is AsciiDoctor-PDF and it comes with a full set of theming options. The simple-theme.yml sample provides a simple, clean professional layout, that you can probably re-use by just changing the logo image.

Generate Example Code Snippets

This part of the pipeline generates snippets in AsciiDoc format from unit tests. The output contains examples of REST calls, with request bodies and responses that will always be accurate for the current version of the code base.

In the sample project this all happens in BaseWebTest. It takes advantage of spring-restdocs and acts as a base class for all other web-based unit tests.

API calls would be tested in the normal way:

mockMvc.perform(
    get("/api/endpoint-a")
    .contentType(MediaType.APPLICATION_JSON)
    .accept(MediaType.APPLICATION_JSON)
    .characterEncoding(StandardCharsets.UTF_8.name())
).andExpect(status().isOk()).andReturn();

A unit test of the form above generates the following snippets to

build/generated-snippets/${test-class-name}/${test-method-name}

curl-request.adoc
http-request.adoc
http-response.adoc
httpie-request.adoc
request-body.adoc
response-body.adoc

For example, a http-request for a POST might look like:

[source,http,options="nowrap"]
----
POST /api/endpoint-a HTTP/1.1
Content-Type: application/json;charset=UTF-8
Accept: application/json
Content-Length: 52
Host: autodoc.sonalake.com
{
  "fieldA" : "sample A",
  "fieldB" : "sample B"
}
----

These files can be referenced in your examples documents, with the result that examples will always be up-to-date.

Generate swagger.json

This part of the pipeline generates an up-to-date view of the REST paths and entities in AsciiDoc format. First by generating a swagger.json, and then translating this into AsciiDoc.

The sample project contains a single test, GenerateDocumentationTest.java, that starts up the application as @SpringBootTest in the test profile, and pulls down the swagger.json generated by the SwaggerConfig.java. It then runs swagger.json through the swagger2markup-gradle-plugin to convert it to AsciiDoc format.

This produces the following sets of files:

Overview.adoc

Contains some metadata from application.yml such as title text and version information for inclusion on the main page

security.adoc

a simple page describing how authentication, for example HTTP headers, should be configured

paths.adoc

A list of all the REST calls and responses the application will accept and respond with

definitions.adoc

a list of all the entities the application will accept and respond with

Make Documentation Easier to Follow

Tags are an optional, but useful, tool for collecting related endpoints together, even when they are implemented in different classes. By default, Swagger will name the resources after their controller classes, but tags allow you to give them a different name.

For example:

@Api(tags = {"Section A"})
@Description("Some operations in section A")
public class ControllerA1 {

Generate Changelog

The last part of the automated process is related to how to create a changelog. It assumes that previously released versions of the Swagger are published under Nexus. All of this configuration is contained in build.gradle.

Publish Swagger Spec as a Nexus Artifact using the maven-publish and maven-publish-auth plugins.
Generate changelog from Nexus history using the Sonalake swagger-changelog Gradle plugin. The plugin will retrieve any previously published RELEASE versions of the Swagger spec, and will produce the following:
- A file of the form change-log-0.0.1-0.0.2-SNAPSHOT.adoc for each version
- An index file, change-log.adoc, listing all versions

Author Hand-written Content

Writing the following documents will round out the process.

introduction.adoc – a simple one or two paragraph description of what the applications is for
security.adoc – a quick description of how to authenticate, and what, if any, roles exist in the application. Do note, that if you want to, you can easily write other tests that will print out a list of such roles in AsciiDoc format, and include them in this file.

A special case of a hand-written document is the examples.adoc where a high-level description of the overall flow of REST calls would be written. For example, to on-board a new user, you need to call X, then Y, and the Z. However, this document would not include any actual REST calls or parameters. Rather, it would refer to the results of the unit tests that you have written to test these endpoints.

[[examplesscheme]]
== Examples
What follows are some examples of the API usage
=== Endpoint A
A get call
include::{snippets}/endpoint-a-test/test-get-value/http-request.adoc[]
Returns this
include::{snippets}/endpoint-a-test/test-get-value/http-response.adoc[]

Since the overall flow of your application isn’t likely to change – even if the URLs and request/responses change – this document will remain relatively unchanged over time. The only thing you are likely to have to update are your unit tests, but you’d be doing that anyway. Right?

Tying it All Together

All this work is done in build.gradle – it dictates where to write the files in the build directory.

ext {
 asciiDocOutputDir = file("${buildDir}/asciidoc/generated")
 swaggerOutputDir = file("${buildDir}/swagger")
 snippetsOutputDir = file("${buildDir}/generated-snippets")
}

The following tells Gradle to pass system properties down to the test tool, so the generate documentation task can know the current document version.

test {
 systemProperties = System.properties
 systemProperty 'sg.api.version', version

 useJUnitPlatform()
}

Then use swagger2markup to convert the swagger.json into AsciiDoc format

convertSwagger2markup {
 dependsOn test
 swaggerInput "${swaggerOutputDir}/swagger.json"
 outputDir asciiDocOutputDir
 config = [
   'swagger2markup.pathsGroupedBy'                          : 'TAGS',
   'swagger2markup.extensions.springRestDocs.snippetBaseUri': snippetsOutputDir.getAbsolutePath()
 ]
}

Next, the following tells the swagger changelog plugin from where to pull version information, and to where to write the diff files.

swaggerChangeLog {
 groupId = "${rootProject.group}"
 artifactId = "${rootProject.name}-API"

 // where to find the nexus repo
 nexusHome = 'http://atlanta.sonalake.corp:8081/nexus'

 // where to store the changelog
 targetdir = "${buildDir}/asciidoc/generated/changelog"

 // if we’re building a snapshot version, then include it as the
 // end of the changelog
 snapshotVersionFile = "${buildDir}/swagger/swagger.json"
}

Finally, this is where the AsciiDoctor-PDF Gradle plugin takes all the AsciiDoc files we have created, and converts them into a pdf.

Note that the baseDirFollowsSourceDir setting, all paths are relative to the main index file. This is done because it allows for references within the AsciiDoc file structure to not have to worry about where they are on the file system.

// create a PDF from the asciidoc
asciidoctorPdf {
 dependsOn convertSwagger2markup
 dependsOn generateChangeLog

 baseDirFollowsSourceDir()

 sources {
   include 'api-guide.adoc'
 }
 attributes = [
   doctype        : 'book',
   toc            : 'left',
   toclevels      : '3',
   numbered       : '',
   sectlinks      : '',
   sectanchors    : '',
   hardbreaks     : '',
   generated      : '../../../build/asciidoc/generated',
   resources      : '../../../src/main/resources',
   snippets       : '../../../build/generated-snippets',
   changes        : '../../../build/asciidoc/generated/changelog',
   imagesdir      : 'theme',
   'pdf-stylesdir': 'theme',
   'pdf-style'    : 'simple-theme.yml',
   revnumber      : version
 ]
}

That’s it. You can take the code from the sample project into any Spring Boot project in about an hour, and produce professional, clean documents. We hope you find it as useful as we do!