DEV Community: Ajaykrishnan Selucca

Machine Learning - Outliers - Dealing and Detecting it

Ajaykrishnan Selucca — Thu, 28 May 2020 02:31:24 +0000

HOW TO DETECT OUTLIERS?

A. Detecting outliers are having many methods divided broadly into three types,

1. Supervised methods
2. Semi-supervised methods
3. Unsupervised methods

B. Simply visualizing the data can also help us to find outlier. Visualization methods like Box-plot, Histogram and Scatter-plot can also be used.

C. Any value beyond the range of -1.5*IQR to 1.5*IQR (Inter-Quartile Range)

D. Using capping methods, any value which is out of range of 5th and 95th percentile can be considered as an outlier.

E. Data points, which are three or more standard deviations away from mean are considered to be an outliers.

F. Outlier detection is merely a special case of the examination of data for influential data points and it also depends on the business understanding.

DEALING WITH OUTLIERS:

I read an article on outliers and found it very interesting and I am sharing that in this post. There are four major ways of dealing with an outlier. They are as follows,

Deleting Observations: We delete outlier values if it is due to a data entry error, data processing error or outlier observations are very small in numbers. When there is no importance for these points in the model building can also be detected.
Transforming and binning values: Transforming variables can also eliminate outliers. Natural log of value reduces the variations caused by extreme values. Binning is also a form of variable transformation. Decision Tree algorithm allows dealing with outliers well due to binning of a variable. We can also use the process of assigning weights to different observations and you can also use a sigmoid function to squash values.
Imputing: We can also impute outliers. We can use the mean, median, mode imputation methods. Before imputing values, we should analyse if it is a natural outlier or artificial. If it is artificial, we can go with imputing values. We can also use a statistical model to predict values of outlier observation and after that we can impute it with the predicted values.
Treat Outliers separately: If there are significant number of outliers, we should treat them separately in the statistical model. One of the approaches is to treat both groups are two different groups and build an individual model for both groups and then combine the output.

Machine Learning - Outliers, its type and causes

Ajaykrishnan Selucca — Tue, 26 May 2020 01:53:04 +0000

OUTLIERS :

Outliers are points which are like introverts who never mingle with other points or group of points(distribution) like me. Outliers are extreme values that deviate from other observations on data, an outlier is an observation that diverges from an overall pattern on a sample. If the outliers are not treated during our EDA (Exploratory Data Analytics) the resulting machine learning model may have problems like low accuracy, errors etc.

TYPES OF OUTLIERS:

UNI-VARIATE OUTLIERS : It is a data point that consists of extreme value on one variable.

MULTIVARIATE OUTLIERS : It is the combination of unusual scores on atleast two variables.

TYPES OF OUTLIERS BASED ON ENVIRONMENT:

POINT OUTLIERS : They are single data points that lay far from the rest of the distribution.

CONTEXTUAL OUTLIERS : It deviates significantly with the respect to a specific context of the object (Noise)

COLLECTIVE OUTLIERS : A subset of data objects collectively deviate significantly from the whole data set, even if the individual data objects may not be the outliers.

CAUSES OF OUTLIERS:

Handling outliers is very important, because all the outliers aren't a bad thing. Its very important to understand that simply removing the outliers from our dataset without considering how they will impact the results is a recipe for disaster.

Outliers can impact the results of our analysis and statistical modelling in a drastic way. Especially in logistical regression, outliers has a greater impact and lets discuss about that in a separate post.

The causes of outliers are as follows,

Data Entry Errors (Human Errors)
Measurement Errors (Instrument errors)
Experimental Errors (Data Extraction or execution errors)
Intentional Errors (dummy outliers made to test detection methods)
Data Processing Errors ( Data manipulation or Data set unintended mutations)
Sampling Errors (Extracting or mixing data from wrong or various sources)
Natural (Its not an error, but an real extreme value in a Data)

In the next blog we can see about, detecting and dealing with outliers.

Machine Learning - Over-fitting & Under-fitting

Ajaykrishnan Selucca — Mon, 25 May 2020 12:40:31 +0000

In my last post on "BIAS and VARIANCE" we heard about two words - Under-Fit and Over-Fit. In this post, I am going to tell you precise;y what is Over-fitted and Under-Fitted model.

UNDER-FITTING:

It occurs when the model is too simple, say when there is Low Variance and High Bias. When the accuracy of the model is too low than our expectation, the model we have built is said to be under-fit.

Below is the graphical representation of an under-fit model.
(The red dots in the graph describes the data points, where major of those data points are present away from the line)

OVER-FITTING:

It occurs when the model is too complex, when there is Low Bias and High variance. (The machine learning model that we build, should not always be 100% accurate, which generally means Over-fitted model)

Below is the graphical representation of an Over-fit model.
(The line is drawn as per the red dots i.e. the data points)

Bias and Variance both contribute to errors in a model (but ideally there should be a right fit point, where both the bias and variance deviate from their value) but it's the prediction error that you want to minimize, not the bias or variance specifically.

Below is the graphical representation of the right fit point, where the model will have a good accuracy, without being over-fit or under-fit.

Ideally we want low variance and low bias. In reality, though, there's usually a trade-off.

A suitable fit should acknowledge significant trends in the data and play down or even omit minor variations.

This might mean re-randomizing our training, test data or using cross-validation, adding new data to better detect underlying patterns or even switching algorithms. Specifically, this might entail switching from linear regression to non-linear regression to reduce bias by increasing variance.

Machine Learning - BIAS and VARIANCE

Ajaykrishnan Selucca — Sun, 24 May 2020 14:01:08 +0000

Machine Learning is all how we train a model using the training data-set. So, the model we train should reflect our training data-set and that is most common challenging part, as the model should not be over-fitted as well as under-fitted. Let us discuss about over-fitting and under-fitting in a different blog, here we shall see what is a BIAS and VARIANCE in machine learning.

We build our machine learning model using our training data, but can we predict the future based on your training data alone or do we need to generalize its ways, patterns to better absorb new data?

This trade-off is captured in "Bias versus Variance".

BIAS : It is the gap between what the model predicted and the actual value.

VARIANCE : It describes how the data is being spread of the predictions.

Put together, bias and variance affect the model's prediction accuracy and can lead to problems with under-fitting and over-fitting.

I read a post in Instagram, which described the BIAS and VARIANCE using a shooting target. It made me to understand, what BIAS and VARIANCE will lead to our end results of the machine learning model we build.

Machine Learning - Performance Metrics

Ajaykrishnan Selucca — Sat, 23 May 2020 07:37:36 +0000

ACCURACY OF THE MODEL:

Accuracy is most common in all the machine learning models we work.It is so much important majorly in Classification problems.Accuracy is defined as the ratio of the number of correct predictions made by the model over all kinds of predictions made. It can be expressed as follow,

                   Number of Correct predictions           
       Accuracy =  ------------------------------
                   Total No. of Predictions made

It can also be said that the major disadvantage of an accuracy is it wont work upto the mark when we have an imbalanced dataset. It works only if there is an equal number of samples belonging to each class.

Let us consider that we have 98% of samples of class A and 2% of samples of class B in our training set. Then the model we build may easily get 98% training accuracy by simply predicting the training samples of Class A. When this same model is tested in Test Set with 60% of samples of Class A and 40% samples of Class B, then test accuracy drops to 60%. Classification Accuracy being great, just gives us a false sense of achieving high accuracy.

The real problem arises when the cost of mis-classification of the minor class samples are very high. If we deal with a rare but a fatal disease, the cost of failing to diagnose the disease of a sick person is much higher than the cost of selling a healthy person to more tests. Similarly in case of predicting bank fraud detection.

CALCULATING A F1-SCORE:

F1 Score combines precision and recall relative to a specific positive class. It conveys the balance between the precision and the recall and also between an uneven class distribution. The F1 Score reaches its best value at 1 and worst at 0.

                            Precision * Recall
           F1 Score = 2 *  ---------------------
                            Precision + Recall

                True (+ve)                             True (+ve)
 Precision = -------------------          Recall = -------------------
            True (+ve) + False (+ve)              True (+ve) + False (-ve)

Now lets understand , What is Precision and Recall? When to use it?

Precision gives us information about its performance with respect to False positives ( how many did we caught). It is about being precise, which means, by an example, if we managed to capture one cancer case and we captured it correctly, then we are 100% precise

Recall gives us information about the performance with respect to False negatives ( how many did we miss).Recall is not so much about capturing cases correctly, by an example, it is more about capturing all cases that have "cancer" with the answer as "cancer", then we have 100% recall.

So basically, if we want to focus more on minimizing False Negatives, we would want our recall to be as close as 100% as possible without precision being too bad.If we want to focus on minimizing False Positives, then we want our Precision as close to 100% as possible.

CONFUSION MATRIX:

                          Actually (+ve) (1)      Actually (-ve)(0)

    Predicted (+ve)(1)       True (+ve)               False (+ve)

    Predicted (-ve)(0)       False (-ve)               True (-ve)

Confusion matrix as the name suggests gives us a matrix as output as N*N matrix, where N is the number of classes being predicted. This metrics work for imbalanced data set.It finds the correctness and accuracy of the model and completely based on the numbers inside this table.

This confusion matrix is a table with two dimensions ('Actual' and 'Predicted'), and sets of "classes" in both dimensions. Our actual classifications are columns and Predicted once are rows.

What are these Positives and Negatives?

True Positives : The cases in which we predicted YES and the actual output was also YES
True Negatives : The cases in which we predicted NO and the actual output was NO.
False Positives : The cases in which we predicted YES and the actual output was NO.
False Negatives : The cases in which we predicted NO and the actual output was YES.