Posted on Aug 31, 2023

#02Python - Outliers and their types

#python #outliers #statistics #programming

Outliers are unusual values that stand out from the rest of the data in a set, often because they are at extreme values. They can result from measurement errors, incorrect inputs, rare events, or even represent information about the observed phenomenon.

Outliers can be categorized into different types based on their nature. Here are some types of outliers:

Univariate Outliers (single variable): Refers to a value that stands far apart from the others in a single variable.
An illustrative example occurs in a pizza-eating contest where all competitors eat 3 to 5 slices, but one person devours 20 slices and still asks for more! This person is the univariate outlier of the group.
It can distort measures (mean, median, etc.) and graphs, affecting statistics related to the variable.
Multivariate Outliers (across multiple variables): Are values identified when we consider multiple entities at the same time. Detecting multivariate outliers is more complex as it requires considering interactions between variables.
An example happens at a costume party where most people chose common costumes, but someone shows up dressed as a dragon, floating with a jetpack, and carrying a giant violin. This "Space Dragon Musician" is a multivariate outlier.
It can influence analyses involving interactions between variables, such as heatmaps and correlations, leading to wrong conclusions if not properly addressed.
Global Outliers: Are values significantly distant from all the other data points in the entire dataset.
For instance, in a physical education class, where the teacher asked everyone to write their heights on a sheet. While most were around 1.50m to 1.70m, the "Super Basketball Player" wrote 2.20m on the paper, making him the global outlier.
This specific type clearly distorts analyses like the mean, making it less representative of the data, and can distort the overall view of the data.
Contextual Outliers: Are observed based on the specific context of the problem.
For example, in a salary study within a company, a value 10 times above the average is unusual and noteworthy. However, upon closer examination, it might correspond to a high-ranking position in the organization. Although much higher than the others, its presence is not an error.
The impact here might be smaller, as its justification is tied to circumstances. It usually doesn't drastically distort aggregate statistics if treated as a special case.
Replicated Data Outliers: Are outliers found when variant data is collected at different times or locations.
They can arise due to temporal or spatial variations, or changes in measurement methodology. They can provide insights into changes in the phenomenon over time or space.
Imagine measuring your mug's height using a ruler on your desk every day. On the first day, you record 12 cm, the next day 13 cm, and the third day 11 cm. This doesn't mean the mug is growing or the ruler is changing, but your measurement is varying (12, 13, 11 centimeters).
Influential Outliers: Are values that significantly impact statistical analyses, such as regression model fitting. This type of outlier can affect the slope and fit of the regression line, as well as statistical models, potentially resulting in incorrect conclusions if not properly addressed.
An example is seen in a car dealership that mostly sells popular cars between $30,000 and $50,000. A luxury car was sold for $150,000. This distinct sale had a major impact on sales metrics and the average price of cars sold.
Random Outliers: Are caused by measurement errors or natural variations in the data. They occur randomly and don't signify significant patterns.
For instance, in an industry, during a temperature measurement experiment, sensors typically read between 22°C and 25°C. However, in one reading, the sensor indicated 500°C.
This might have been a measurement error and doesn't represent the actual temperature.

It's important to note that not all outliers are errors or process failures. Some provide information about the studied phenomenon or indicate special circumstances. When dealing with outliers, it's essential to understand the context and decide whether they should be treated, transformed, or retained.

DEV Community

#02Python - Outliers and their types

Top comments (0)

Read next

Data Structures: A Comprehensive Introduction

Getting credentials for your Discord Activity

2 Sites to Become a Better Developer

Understanding Java Exceptions: A Guide with Practical Examples