Anton Yarkov

Posted on Oct 5

Fundamentals metrics for Software Engineering derived from Math Statistics

#mathematics #statistics #measurement #analytics

If you can’t measure it, you can’t improve it. – Peter Drucker

In software engineering, we all need to make data-driven decisions. Thus, we frequently need to be able to measure impact, performance, feedback, scope, etc. to be able to set measurable goals and progress on our action items.

To be able to make those educated guesses you need to collect data, then aggregate and process it to output some valuable information in digital representation.

Sounds easy but even to know what and how to collect is already a challenging task. To make sure we collect in the right way it is better to think vice-versa: start from answering how you are going to aggregate the data given.

Let's build some fundamental understanding on how we aggregate and process the data with some simple examples.

This article combines together a number of other resources and articles that I reference in the end. Thus, it's not really an author content, but more like a compilation from other sources.

Average aka Mean

From school times we all have been given and trained with a most popular tool for aggregation called average value. As being mentioned by Konrad Kokosa in his great book "Pro .NET Memory Management", the vast majority of measurements in software performance has been made by using the average metric, which has serious disadvantages and majorly misunderstood. You need to be very careful when drawing conclusions from it.

The mean has two major drawbacks: it doesn't necessarily correspond to any specific sample value (i.e. every hospital lists the average human body temperature as 36.6°C!) and it can obscure the true nature of the data's distribution. The problems with this and other simple metrics, such as variance, are convincingly illustrated by Anscombe's quartet.

Thus, we can derive a lot of additional interesting metrics from Statistcs.

Normal distribution

It turns out that many natural phenomena follow the Normal Distribution. The normal distribution, also known as the Gaussian (named after the mathematician Carl Friedrich Gauss) or Bell-curve, is described by the following equation:

The Gaussian curve is also called the Probability Density Function (PDF) for the normal distribution where the probability of finding values that are further from the center decreases.

When we’re looking at a curve like this, the x-axis represents the value, while the y-axis represents the frequency with which we see a given value (i.e., values that are “higher” on the y-axis occur more frequently).

In a normal distribution, we see a curve centered (the dotted line) at its most frequent value, with decreasing probability of seeing values further away from the most frequent one (the most frequent value is the mode). Note that the normal distribution is symmetric, which means that values to the left and right of the center have the same probability of occurring.

We can define a number of aggregation metrics on that Gaussian:

Percentile is the value below which a given percentage of observations fall. For example, the 95th percentile is the value below which 95% of the data points lie. Percentiles are excellent for characterizing data without being skewed by extreme values. I strongly recommend measuring percentiles when using tools. They often have direct relevance to the domain. For instance, we may want an application’s response time to be no more than 1 second in 90% of cases, and no more than 4 seconds in 99% of cases. To monitor this, we should measure the 90th and 99th percentiles of response time.
The median is the middle value, where half of the data is above and half is below. It provides a better representation of a typical value because it is more resistant to outliers. Additionally, the median corresponds to an actual data point rather than being artificially calculated.
Histogram is a graphical representation of the distribution of data. It shows how many samples fall into each range of values. This is one of the most informative ways to understand data, as it describes the distribution in full.

Percentile

A percentile is defined as the value where x percent of the data falls below the value. For example, if we call something “the 10th percentile,” we mean that 10% of the data is less than the value and 90% is greater than (or equal to) the value.

A normal distribution with the 10th percentile depicted.

And the 90th percentile is where 90% of the data is less than the value and 10% is greater:

A normal distribution with the 90th percentile depicted.

To calculate the 10th percentile, let’s say we have 10,000 values. We take all of the values, order them from largest to smallest, and identify the 1001st value (where 1000 or 10% of the values are below it), which will be our 10th percentile.

The median, mode and range

The median, or the middle value, is also known as the 50th percentile (the middle percentile out of 100). This is the value at which 50% of the data is less than the value, and 50% is greater than the value (or equal to it). In the below graph, half of the data is to the left (shaded in blue), and a half is to the right (shaded in purple), with the 50th percentile directly in the center.

A normal distribution with the median/50th percentile depicted.

Besides that, the mode is defined as the most common or frequently occurring value and the range defined as the difference between largest value in a data set and the smallest value in a data set.

Interestingly enought, when you look onto Normal Distribution graph the median, average, and mode are all fall on the center line, even though, they’re defined differently, calculated differently and have different meaning! This is because a normal distribution is symmetric. Thus, the magnitude and number of points with values larger than the median are completely balanced (both in magnitude and number of points smaller than the median).

In other words, there is always the same number of points on either side of the median, but the average considers the actual value of the points.

For the median and average to be equal, the points less than the median and greater than the median must have the same (symmetric) distribution i.e., there must be the same number of points that are somewhat larger and somewhat smaller and much larger and much smaller.

Why is this important? The fact that median and average are the same in the normal distribution can cause some confusion. Since a normal distribution is often one of the first things we learn, we (myself included!) can think it applies to more cases than it actually does.

It’s easy to forget or fail to realize, that only the median guarantees that 50% of the values will be above, and 50% below – while the average guarantees that 50% of the weighted values will be above and 50% below (i.e., the average is the centroid, while the median is the center).

The average and median are the same in a normal distribution, and they split the graph exactly in half. But they aren’t calculated the same way, don’t represent the same thing, and aren’t necessarily the same in other distributions.

Let's look onto example. The following chart describes PDFs of the pizza delivery time in three cities: city 'A,' city 'B,' and city 'C.'

In city 'A,' the mean delivery time is 30 minutes, and the standard deviation is 5 minutes.
In city 'B,' the mean delivery time is 40 minutes, and the standard deviation is 5 minutes.
In city 'C,' the mean delivery time is 30 minutes, and the standard deviation is 10 minutes.

We can see that the Gaussian shapes of city 'A' and city 'B' pizza delivery times are identical; however, their centers are different. That means that in city 'A,' you wait for pizza for 10 minutes less on average, while the measure of spread in pizza delivery time is the same.

We can also see that the centers of Gaussians in city 'A' and city 'C' are the same; however, their shapes are different. Therefore, the average pizza delivery time in both cities is the same, but the measure of spread is different.

What else we can find from this data? Let's see, but before we move forward, let's summarize what we defined so far:

The mean (aka average) is defined as the sum(value) / count(value).
The median is the middle value, where half of the data is above and half is below.
The mode is defined as the most common or frequently occurring value.
The range is defined as the difference between largest value in a data set and the smallest value in a data set.

Variance and Standard Deviation

The Variance is a measure of the spreading of the data set from its mean (aka average). The variance is denoted by σ^2.

The Standard Deviation is the square root of the variance - denoted by the Greek letter σ (sigma).

Suppose we would like to compare the heights of two high school basketball teams. The following table provides the players' heights and the mean height of each team.

As we can see, the mean height of both teams is the same. Let us examine the height variance. Since the variance measures the spreading of the data set, we would like to know the data set deviation from its mean. We can calculate the distance from the mean for each variable by subtracting the mean from each variable:

The height is denoted by x and the mean of the heights by the Greek letter μ. The distance from the mean for each variable would be:

The following table presents the distance from the mean for each variable.

Some of the values are negative. To get rid of the negative values, let us square the distance from the mean:

The following table presents the squared distance from the mean for each variable.

In order to calculate The Variance of the data set, we need to find the average value of all squared distances from the mean:

For team A, the variance would be:

For team B, the variance would be:

We can see that although the mean of both teams is the same, the measure of the height spreading of Team A is higher than the measure of the height spreading of Team B. Therefore, the Team A players are more diverse than the Team B players. There are players for different positions like ball handler, center, and guards, while the Team B players are not versatile.

The units of the variance are meters squared; it is more convenient to look at standard deviation, which is a square root of the variance.

The standard deviation of Team A players' heights would be 0.12m.
The standard deviation of Team B players' heights would be 0.036m.

6-Sigma (6 standard deviations)

As we can see, if you generate a Gaussian distribution for your sampled data, you can show that at one standard deviation from the mean, you will have 68.27% of the area under the curve covered. At two standard deviations, 2-Sigma, you have 95.45% of the area under the curve, at three standard deviations, you will have 99.73% of the area and so forth until you get to six standard deviations, when you have 99.99985% of the area (1.5 Defects per million, 6-Sigma).

The mean is represented by μ (mu), as you can see above, the mean is the center. The standard deviation is represented by σ (sigma). The standard deviation is a measurement of variation.

As seen above one standard deviation from the mean will take in 68% of all data in a normal model, two standard deviations from the mean will take in 95% of the data.

If we look into our very first example with pizza delivery times, the following chart describes the proportions of the normal distribution.

68.26% of the delivery times lie within μ±σ range (25-35 minutes)
95.44% of the delivery times lie within μ±2σ range (20-40 minutes)
99.74% of the delivery times lie within μ±3σ range (15-45 minutes)

Usually, measurement errors are distributed normally.

Also, many IQ tests have μ = 100 and σ = 15. So, one standard deviation either above or below the mean is IQ scores from 85 to 115. This means that 68% of the population will have IQ scores between 85 and 115. Two standard deviations from the mean will cover scores from 70 to 130. 95% of the population will have IQ scores that are within 2 standard deviations of the mean.

Quartile or Quantile

Quantiles are cut points dividing the range of a probability distribution into continuous intervals with equal probabilities, or dividing the observations in a sample in the same way.

Some q-quantiles have special names:

The only 2-quantile is called the median
The 3-quantiles are called tertiles or terciles
The 4-quantiles are called quartiles; A quartile is any of the three values which divide the sorted data set into four equal parts, so that each part represents one fourth of the sampled population.
First quartile (designated Q1) = lower quartile = cuts off lowest 25% of data = 25th percentile
Second quartile (designated Q2) = median = cuts data set in half = 50th percentile
Third quartile (designated Q3) = upper quartile = cuts off highest 25% of data, or lowest 75% = 75th percentile
The difference between the upper and lower quartiles is called the interquartile range, midspread or middle fifty = Q3 − Q1.
The 5-quantiles are called quintiles
The 6-quantiles are called sextiles
The 7-quantiles are called septiles
The 8-quantiles are called octiles
The 10-quantiles are called deciles
The 12-quantiles are called duo-deciles or dodeciles
The 16-quantiles are called hexadeciles
The 20-quantiles are called ventiles, vigintiles, or demi-deciles
The 100-quantiles are called percentiles
The 1000-quantiles have been called permilles or milliles.

Example:
Consider an ordered population of 10 data values [3, 6, 7, 8, 8, 10, 13, 15, 16, 20]. What are the 4-quantiles (the "quartiles") of this dataset?

So the first, second and third 4-quantiles (the "quartiles") of the dataset [3, 6, 7, 8, 8, 10, 13, 15, 16, 20] are [7, 9, 15]. If also required, the zeroth quartile is 3 and the fourth quartile is 20.

What are Moving average, Rolling average, Running average?

Moving Average is also known as Rolling Average, Running Average, or Rolling Mean. You calculate it by taking the average of a set of values over a specific period of time.

It provides a standardised and concise way to summarise and analyse data, revealing the overall trend and enabling data professionals, and decision-makers to draw meaningful conclusions based on distribution, central tendency, variability, and relationship within a dataset.

Many people are enthusiastic about tracking their daily step counts. So, let’s use this to understand the concept of moving average. Let’s say, instead of focusing solely on the number of steps we take each day, we calculate a 7-day moving average of step count.

To calculate the 7-day moving average, add the step counts from the past seven days and divide the sum by 7.

Considering the calculation in the above image, the moving average of 7928.57 steps gives us a better understanding of our overall activity levels. By comparing this average to the daily step count, we can see whether we consistently meet or surpass the average.

Normal distribution in probabilistic measurements (low-accuracy and high-precision)

Expected Value, Mean and Hidden State

Expected Value is the value you would expect your hidden variable to have over a long time or many trials. Mean (aka average) and Expected Value are closely related terms. However, there is a difference. The mean is usually denoted by the Greek letter μ. The letter E usually denotes the expected value.

For example, given five different coins – two 5-cent coins and three 10-cent coins, we can easily calculate the mean value by averaging coins' values.

The above outcome cannot be defined as the expected value because the system states (coin values) are not hidden, and we've used the entire population (all 5 coins) for the mean value calculation.

Now assume another example: five different weight measurements of the same person: 79.8kg, 80kg, 80.1kg, 79.8kg, and 80.2kg. The person is a system, and the person's weight is a system state.

The measurements are different due to the random measurement error of the scales. We do not know the actual value of the weight since it is a Hidden State. However, we can estimate the weight by averaging the scales' measurements.

The outcome of the estimate is the expected value of the weight.

Estimate, Accuracy and Precision

An Estimate is about evaluating the hidden state of the system. The aircraft's actual position is hidden from the observer. We can estimate the aircraft's position using sensors, such as radar. The estimate can be significantly improved by using multiple sensors and applying advanced estimation and tracking algorithms (such as the Kalman Filter). Every measured or computed parameter is an estimate.

Accuracy indicates how close the measurement is to the true value.

Precision describes the variability in a series of measurements of the same parameter. Accuracy and precision form the basis of the estimate.

The following figure illustrates accuracy and precision.

High-precision systems have low variance in their measurements (i.e., low uncertainty), while low-precision systems have high variance in their measurements (i.e., high uncertainty). The random measurement error produces the variance.

Low-accuracy systems are called biased systems since their measurements have a built-in systematic error (bias).

The influence of the variance can be significantly reduced by averaging or smoothing measurements. For example, if we measure temperature using a thermometer with a random measurement error, we can make multiple measurements and average them. Since the error is random, some of the measurements would be above the actual value and others below the actual value. The estimate would be close to the actual value. The more measurements we make, the closer the estimate would be. On the other hand, a biased thermometer produces a constant systematic error in the estimate. All examples in this tutorial assume unbiased systems.

A random variable describes the hidden state of the system. A random variable is a set of possible values from a random experiment. The random variable can be continuous or discrete:

A continuous random variable can take any value within a specific range, such as battery charge time or marathon race time are continuous random variables.
A discrete random variable is countable, such as the number of website visitors or the number of students in the class. The random variable is described by the probability density function (i.e. by Gaussian).

The following figure represents a statistical view of measurement.

A measurement is a random variable, described by the Probability Density Function (PDF) i.e. Gaussian.
The measurement's mean is the Expected Value of the random variable.
The offset between the measurement's mean and the actual value is the measurement's accuracy, also known as bias or systematic measurement error.

The dispersion of the distribution is the measurement precision, also known as the measurement noise, random measurement error, or measurement uncertainty.

DEV Community