DEV Community

Naftali M
Naftali M

Posted on

Outlier detection and handling in R

An outlier is a data point that significantly differs from other observations in a dataset. It can be:

  1. - Unusually high or low compared to the rest of the data.
  2. - Anomalous due to measurement errors, data entry mistakes, or rare events.
  3. - A true extreme value that represents natural variation.

Example of data set

name of the loaded data is Data
Image description

How to identify outliers

  1. basic summary function
summary(Data)
Enter fullscreen mode Exit fullscreen mode

output

Image description

  1. Visual methods (using Box plot)

Plot age on a box plot
boxplot(Data$Age, main = "Age",col = "skyblue")
output
Image description

Plot Net_worth on a box plot

boxplot(Data$Net_worth, main ="Networth in *10000 PLN",col = "orange")
Enter fullscreen mode Exit fullscreen mode

output

Image description

  1. Using interquatile range

Identify the outlier on age values

Q1 <- quantile(Data$Age, 0.25)
Q3 <- quantile(Data$Age, 0.75)
IQR <- Q3 - Q1
lower_bound_age <- Q1 - 1.5 * IQR
upper_bound_age <- Q3 + 1.5 * IQR
outlier_age <- Data$Age[Data$Age < lower_bound_age | Data$Age > upper_bound_age]
print(outlier_age)
Enter fullscreen mode Exit fullscreen mode

output
93

Identify the outlier on Net_worth values

Q1 <- quantile(Data$Net_worth, 0.25)
Q3 <- quantile(Data$Net_worth, 0.75)
IQR <- Q3 - Q1
lower_bound_Net_worth <- Q1 - 1.5 * IQR
upper_bound_Net_worth <- Q3 + 1.5 * IQR
outlier_networth <- Data$Net_worth[Data$Net_worth < lower_bound_Net_worth | Data$Net_worth > upper_bound_Net_worth]
print(outlier_networth)
Enter fullscreen mode Exit fullscreen mode

output
152000

SOLVING THE OUTLIER

  1. Droping the outliers using the interquartile range
new_data <- Data[
  Data$Net_worth >= lower_bound_Net_worth & Data$Net_worth <= upper_bound_Net_worth &
  Data$Age >= lower_bound_age & Data$Age <= upper_bound_age, 
]

summary(new_data)
Enter fullscreen mode Exit fullscreen mode

output

Image description

  1. Substituting the outliers with column mean

identify the row index for outliers

# check the data row
which(Data$Net_worth== 152000 )
which(Data$Age== 93)
Enter fullscreen mode Exit fullscreen mode

output
12, 10

Replace the outliers with the means

#Replace the data points with the mean
Data$Net_worth[12] <- mean(Data$Net_worth)
Data$Age[10] <- mean(Data$Age)
summary(Data)
Enter fullscreen mode Exit fullscreen mode

plot the new data columns on a box plot

boxplot(Data$Age, 
        main = "Age", 
        col = "green", 
        border = "blue")
Enter fullscreen mode Exit fullscreen mode

Output

Image description

boxplot(Data$Net_worth, 
        main = "Networth in *10000 PLN", 
        col = "yellow", 
        border = "blue")
Enter fullscreen mode Exit fullscreen mode

output

Image description

Top comments (0)