Why the outliers?
When I started studying data science, one of the things that struck me the most from the beginning, was the concept of the outliers, and in particular the way they are usually dealt with, during an analysis on data.
What I found to be a common procedure for data analysis about outliers clashed immediately with what I have learned in the past.
From that moment on I had a desire to explore this topic more, and bring up a few points to the world of data scientists, even just to put a little doubt in our minds as data scientists, when we just so lightly decide to “get rid of the outliers”...
A little bit of background
For whoever is not familiar with the concept of outliers, when you are observing a certain phenomenon, or doing an analysis on a set of data, the outliers are the results that lie at the extremes of your spectrum of values, or significantly outside of range, compared to the bulk of the results.
If this doesn’t explain it, let me borrow a definition: ”In statistics, an outlier is a data point that differs significantly from other observations.”
And this is how the paragraph continues:
"An outlier may be due to variability in the measurement or it may indicate experimental error; the latter are sometimes excluded from the data set. An outlier can cause serious problems in statistical analyses." (See this link for further reading about outliers)
As the article mentions, outliers can be (and I would say tend to be) excluded from the dataset and therefore excluded from the statistical analysis.
In other words, since these values are extremes, they represent “exceptions” and therefore they tend to be ignored so that the EDA can bring more consistent results.
Why I want to defend the outliers
From the very first time that I read this definition, and I learnt how it is a common procedure to cut out the outliers, something inside of me said “no this is not right!”.
Forgive me, but this is the physicist in me, in particular an experimental physicist that learnt as part of her training that no point, no result, no observation should be excluded from an analysis, unless we are certain that it was due to a mistake.
The physic’s point of view
The physics approach to outliers is based on the fact that sometimes we do not expect to observe a certain behavior, a certain result from an experiment, but that particular result (granted that it’s not due to a mistake in the procedure applied) could be actually revealing to us a new aspect, a new characteristic or even a whole new phenomenon that we didn’t expect.
I want to bring up an example, which is one of my favorite experiments in modern physics which explains this concept very well.
It is the discovery of the nucleus, by Ernest Rutherford:
”At the time of the experiment, in 1909, the electron was the only known atomic particle. Physicists had thought of a number of possible arrangements for electrons inside the atom. These arrangements or atomic models, had to account for the fact that matter, generally speaking, is electrically neutral and therefore, to counteract the negatively charged electron the atom must contain positive electricity in some form.
A positively charged particle comparable to the electron was not known to exist; it was possible that the positive electricity took a different form; perhaps a fluid. [...]
Alpha particles are smaller than atoms, but heavy, therefore they could serve as high - energy bullets for probing atoms at a time [...].
In his laboratory, beams of alpha particles were sent through atoms [of a gold foil specifically] to hit a detecting screen, so that one could observe whether or not the particles had been knocked off course during their intra-atomic journey.
This method of exploration has been likened to shooting bullets through a bale of hay in which a very small piece of platinum has been hidden. Most of the bullets would encounter nothing but hay and these would pass right through the bale and out to the other side. But should a bullet chance to hit the platinum, it would ricochet at some angle. And if an enormous number of bullets were shot at the bale, so that the hidden nugget was hit often, ricochets in various directions could reveal the nugget’s location and shape.
This was the research that Rutherford suggested: to bombard atoms with alpha particles to see whether any were scattered at a wide angle and there was every reason to believe that they would not find one. Still it should be tried. [...]
Contrary to all expectations, they found that out of the thousands of alpha particles he had tracked through a gold foil some, a very few, had been deflected at wide angles.
Of these, one or two had been turned aside by more than 90degrees; they had come out of the target on the same side they had entered it.”
From the book Men who made a new physics, by Barbara Lovett Cline, which I strongly recommend as an easy, fun reading about modern physics science discoveries.
This discovery was the main point that brought Rutherford to come up with his Rutherford model, which described (correctly) the internal structure of the atom, up until that point unknown.
What made me think about the gold foil experiment when learning about outliers is the fact that the result that Rutherford found, of the alpha particles being scattered at wide angles, was what would be considered an outlier.
It was a tremendous smaller percentage of observations compared to all the rest of the observations that were collected.
It was also significantly different from all the bulk of the results, since the scattering happened at very small angles, if not at all.
But if those few measures were not taken seriously, and discarded as “mistakes” or “outliers” we wouldn’t know the atom as we know it today, and physics might have advanced in its research building on a mistake (like assuming that there is a positively charged fluid inside each atom).
This is the reason why I was taught, as a physicist, not to discard observations and to include all the measurements and all the results in all of my analysis.
The data science’s point of view
Now I know really well that in an EDA there are good reasons to ignore the outliers, since especially the ones that lie very much outside of the range of the bulk of the observations, tend to influence very heavily statistical measures like the mean, which is one of the most indicative and most used values in a statistical analysis (click this link for further explanation on this).
I also am aware that when we are conducting this type of analysis, especially if it is made to provide some sort of business suggestion, what we are most interested in is the bulk of the results, the majority of the population, having to give a prediction about the future of what we can reasonably expect. Given that, what we can expect more realistically is the average, not an outlier.
What Amazon wants to know is what most customers would buy, not what the one particular unpredictable person is going to purchase on that particular day. The general market is looking for high volume, predictable and safe trends on how the general population behaves, and this model has no use for outliers.
But hear me out, one last point for our poor outcasts…
We were all outliers in the past two years
I want to bring up one more example to reflect on the importance of outliers.
Which is the way we all lead our lives in the past two years, as a consequence of the Covid-19 pandemic.
If we look at our behavior, many of the things that became usual for us during the pandemic, are things which we almost never did before 2020.
If we studied the trends of people wearing masks, isolating and socially distancing, working remote, regularly using hand sanitizer or gloves or face shields (outside of a medical environment), getting constantly tested for a virus...
All these behaviors, which during the past 2 years became a norm for most of us, were outliers if we studied them pre-pandemic.
There are situations that can arise, that shift our habits even radically, and in a way not cutting out what is “strange”, “outside the range” or “unexpected” can teach us something, can make us more prepared or can open us up to new types of studies.
Conclusion
Sometimes it is not possible to predict things with a statistical model, sometimes reality surprises us because of other people’s freedom (like when my son says yes to cleaning up his toys without complaining… totally outlier behavior) and sometimes it hits us hard in ways we could never have imagined.
But I think sometimes not looking at the bulk of results, not ignoring the observation that we didn’t expect, can teach us something new, and open up our minds to the unexpected.
Probably for statistic’ sake we will still all need to get rid of our outliers, but I hope this different point of view will make you a little more curious, when you are deleting the outliers, about what happened there, what was that decision that that person made that was out of the ordinary and why, what could have we learnt from it, and maybe you will feel like me that slight twist of the stomach when you run df.drop( )
...
Resources and Further Reading:
https://www.amazon.com/Men-Who-Made-New-Physics/dp/0226110273
Top comments (0)