Recently I needed to calculate some statistics (the average and the standard deviation) on a stream of incoming data. I did some research about it,...
For further actions, you may consider blocking this person and/or reporting abuse
I'm not sure if this corrects the floating point precision problem or swaps it out for a new one.
When N gets large, the original equation suffers from having a large accumulated value, thus running the risk of added samples not preserving their full precision -- in the worst case they become 0.
But, in the online approach, you're dividing each sample by N and adding to the current average. This has the same problem that as N grows the value of individual samples decreases, running the risk of not being significant compared to the current running average.
That's an excellent point! It does seem as though the average will basically stop changing once n gets high enough. Off the top of my head, it seems that one could calculate the average over a window of the most recent m values rather than the total n of all the values received so far. Maybe there's a better solution than that one though. What do you think?
Yes, a windowed solution will work. It depends on the needs of the software. I've use the solution you've provided many times without problem, where accuracy wasn't vitally important.
Creating higher precision solutions can be challenging. An alternative, if speed isn't a big issue (which it usually isn't), is to use high precision number library, like libGmp, and use something like 1024bit floating point numbers. It's what I use for constants in the Leaf compiler.
I might be missing something, but why not save the sum, and run the update function like so?
If you have a problem where a large accumulated value, and a desire for full precision are both in scope, then you simply use a datatype better suited to handle this than float.
Yes, if you are okay accumulating the sum like this, it’s fine. With floating point values, this approach can have issues, but if it meets your needs (e.g. using a different data type as you mentioned) then it’s okay too. Everything depends on one’s particular use case. As @edA-qa mentioned, the approach I present here can have some problems too when n gets big, so it’s not perfect either. I believe it can in turn be corrected with, for example, something like an exponential moving average, but that’s a whole other story. There are many ways to skin a cat!
A much faster way to do that JS version is to get rid of reduce and use the plain old for loop, I could setup a JSPERF test to prove that the plain old for loop kicks ass when it comes to performance compared to map filter reduce and the other crappy ES6 methods, mentioning this since this post says "real time"
Also a better approach is
sum = sum - oldestValue
sum = sum + newestValue
where oldestValue is just beyond the moving average s period
Wow impressive, thanks!! I never learned so much from a single article.
Thank you so much!
this is awesome - do you know if this technique could be applied to computing a percentile such as 90% in web log analysis?
In the next article, I will show how to calculate variance and standard deviation incrementally. Assuming your distribution is applicable, that could work I think.
Very cool, thanks for sharing. Mind you, I’ll have to read this a couple more times, and then I’ll be waiting anxiously for anything resembling a nail to use this hammer on.
Thank you!