Recently I needed to calculate some statistics (the average and the standard deviation) on a stream of incoming data. I did some research about it, and this article is the result. I'm going to split it into several parts. This first part is about how to calculate the average incrementally. The second part will be about how do the same thing with the standard deviation. A third part will be about the exponential moving average, also known as a low pass filter.

What people often call the 'average' is more technically referred to in statistics as the 'arithmetic mean'. For this article, the terms 'average' and 'mean' are interchangeable.

The usual way to calculate the average for a set of data that we all learn in school is to add up all of the values (the total), and then divide by the number of values (the count):

Here's the math notation that describes what I just wrote above:

Below is a simple javascript function that uses this naive approach to obtain the mean:

```
const simpleMean = values => {
validate(values)
const sum = values.reduce((a,b)=>a+b, 0)
const mean = sum/values.length
return mean
}
const validate = values => {
if (!values || values.length == 0) {
throw new Error('Mean is undefined')
}
}
```

While this logic is fine as far as it goes, it does have a couple of limitations in practice:

- We accumulate a potentially large sum, which can cause precision and overflow problems when using floating point types.
- We need to have all of the data available before we can do the calculation.

Both of these problems can be solved with an incremental approach where we adjust the average for each new value that comes along. I'll show how to derive this formula with some math first, and then I'll show a JavaScript implementation.

There are two symbols in math that are often used to denote the mean.

, the lowercase greek letter sigma, refers to the`σ`

population mean. This is the average for an entire population - all of the possible values. The average of all the grades on a particular test is an example.

, pronounced x-bar, refers to the`x̄`

sample mean. This is the average of a sample of values from the total population. You might take a random sample of people across the country to find the average height of the population, but it's impractical to measure the height of every single person in an entire country. Of course when using a sample, it's desirable to try to get as close as possible to the mean of the population the sample represents.I decided to use the sample notation for this article to indicate that the average we're calculating could be based on a sample.

Okay, let's start with the formula for the average that we saw earlier:

Let's split up the sum so that we add up the first n-1 values first, and then we add the last value x_{n}.

We know that the average = total / count:

Let's rearrange this a bit:

Here's the result of applying the above substitution to the total of the first n-1 values:

Let's expand this:

Rearranging a bit, we get:

We can cancel out the `n`

's in the first fraction to obtain our final result:

What does this all really mean? We now have a recurrence relation that defines our mean for the nth value as follows: Add a differential to whatever the mean was for the previous n-1 values. Each time we add a new value, all we have to do is to calculate this differential and add it to the previous average. This now becomes the new average.

Below is a simple implementation of this idea:

```
class MovingAverageCalculator {
constructor() {
this.count = 0
this._mean = 0
}
update(newValue) {
this.count++
const differential = (newValue - this._mean) / this.count
const newMean = this._mean + differential
this._mean = newMean
}
get mean() {
this.validate()
return this._mean
}
validate() {
if (this.count == 0) {
throw new Error('Mean is undefined')
}
}
}
```

In the above code, each time we call `update`

with a new value, we increment the count and calculate our differential. `newMean`

is the previous average added to this differential. That now becomes the average that will be used the next time we call `update`

.

Below is a simple comparison of the two methods:

```
console.log('simple mean = ' + simpleMean([1,2,3]))
const calc = new MovingAverageCalculator()
calc.update(1)
calc.update(2)
calc.update(3)
console.log('moving average mean = ' + calc.mean)
```

The result is as expected:

```
C:\dev\>node RunningMean.js
simple mean = 2
moving average mean = 2
```

There are of course many other kinds of moving averages that are possible, but if you simply want a cumulative moving average, this logic works well: It's simple, you can apply it to a streaming data set, and it sidesteps problems with precision and overflow that can happen with the naive approach.

Before concluding, I'd like to derive one more identity using our last result. We don't need it right now, but we'll use in the next article.

We'll start with our recurrence relation for the mean:

Let's subtract the first term on the right from both sides, giving us the value of just our differential:

Now let's multiply by n:

Let's multiply both sides by -1:

And finally let's mutiply the -1 through both sides:

We'll just hold on to this identity for now, but it will be useful in part 2 where we derive the formula for incrementally calculating the variance and standard deviation.

Related:

## Discussion (12)

I'm not sure if this corrects the floating point precision problem or swaps it out for a new one.

When N gets large, the original equation suffers from having a large accumulated value, thus running the risk of added samples not preserving their full precision -- in the worst case they become 0.

But, in the online approach, you're dividing each sample by N and adding to the current average. This has the same problem that as N grows the value of individual samples decreases, running the risk of not being significant compared to the current running average.

That's an excellent point! It does seem as though the average will basically stop changing once n gets high enough. Off the top of my head, it seems that one could calculate the average over a window of the most recent m values rather than the total n of all the values received so far. Maybe there's a better solution than that one though. What do you think?

Yes, a windowed solution will work. It depends on the needs of the software. I've use the solution you've provided many times without problem, where accuracy wasn't vitally important.

Creating higher precision solutions can be challenging. An alternative, if speed isn't a big issue (which it usually isn't), is to use high precision number library, like libGmp, and use something like 1024bit floating point numbers. It's what I use for constants in the Leaf compiler.

I might be missing something, but why not save the sum, and run the update function like so?

If you have a problem where a large accumulated value, and a desire for full precision are both in scope, then you simply use a datatype better suited to handle this than float.

Yes, if you are okay accumulating the sum like this, it’s fine. With floating point values, this approach can have issues, but if it meets your needs (e.g. using a different data type as you mentioned) then it’s okay too. Everything depends on one’s particular use case. As @edA-qa mentioned, the approach I present here can have some problems too when n gets big, so it’s not perfect either. I believe it can in turn be corrected with, for example, something like an exponential moving average, but that’s a whole other story. There are many ways to skin a cat!

A much faster way to do that JS version is to get rid of reduce and use the plain old for loop, I could setup a JSPERF test to prove that the plain old for loop kicks ass when it comes to performance compared to map filter reduce and the other crappy ES6 methods, mentioning this since this post says "real time"

Also a better approach is

sum = sum - oldestValue

sum = sum + newestValue

where oldestValue is just beyond the moving average s period

Wow impressive, thanks!! I never learned so much from a single article.

Thank you so much!

this is awesome - do you know if this technique could be applied to computing a percentile such as 90% in web log analysis?

In the next article, I will show how to calculate variance and standard deviation incrementally. Assuming your distribution is applicable, that could work I think.

Very cool, thanks for sharing. Mind you, I’ll have to read this a couple more times, and then I’ll be waiting anxiously for anything resembling a nail to use this hammer on.

Thank you!