Machine Learning - Outliers - Dealing and Detecting it

#machinelearning #datascience

HOW TO DETECT OUTLIERS?

A. Detecting outliers are having many methods divided broadly into three types,

1. Supervised methods
2. Semi-supervised methods
3. Unsupervised methods

B. Simply visualizing the data can also help us to find outlier. Visualization methods like Box-plot, Histogram and Scatter-plot can also be used.

C. Any value beyond the range of -1.5*IQR to 1.5*IQR (Inter-Quartile Range)

D. Using capping methods, any value which is out of range of 5th and 95th percentile can be considered as an outlier.

E. Data points, which are three or more standard deviations away from mean are considered to be an outliers.

F. Outlier detection is merely a special case of the examination of data for influential data points and it also depends on the business understanding.

DEALING WITH OUTLIERS:

I read an article on outliers and found it very interesting and I am sharing that in this post. There are four major ways of dealing with an outlier. They are as follows,

Deleting Observations: We delete outlier values if it is due to a data entry error, data processing error or outlier observations are very small in numbers. When there is no importance for these points in the model building can also be detected.
Transforming and binning values: Transforming variables can also eliminate outliers. Natural log of value reduces the variations caused by extreme values. Binning is also a form of variable transformation. Decision Tree algorithm allows dealing with outliers well due to binning of a variable. We can also use the process of assigning weights to different observations and you can also use a sigmoid function to squash values.
Imputing: We can also impute outliers. We can use the mean, median, mode imputation methods. Before imputing values, we should analyse if it is a natural outlier or artificial. If it is artificial, we can go with imputing values. We can also use a statistical model to predict values of outlier observation and after that we can impute it with the predicted values.
Treat Outliers separately: If there are significant number of outliers, we should treat them separately in the statistical model. One of the approaches is to treat both groups are two different groups and build an individual model for both groups and then combine the output.

DEV Community

Machine Learning - Outliers - Dealing and Detecting it

Top comments (0)

Read next

The 2024 Nobel Prize in Physics: An Achievement for AI - More Career Opportunities

Data Architecture Best Practices

The Intersection of Data Science and Cybersecurity

New AI System Can Track Any Moving Object in Video Without Training, Breakthrough Study Shows