DEV Community: Kelvin Jose

Supervised and Unsupervised Algorithms

Kelvin Jose — Thu, 27 Feb 2020 10:40:58 +0000

For example “I want to predict whether a user will leave my online platform say an e-commerce service - like the mighty Amazon or Alibaba - or not. Suppose If I find somebody who is about to leave my business, I can give the potential user some personalized offers so that they might stay longer and I can grow my business better. For this I will be having the previous history of users who left the platform behind. So I can use this data to train a Machine Learning model to predict the users’ future behavior. For this I might take 5000 users’ history who are already left and another 5000 users’ history who are still active. I already know that these all user already left and these all are still active. We utilize this knowledge or label in the Machine Learning term to train a model. Each users’ history is separately labeled as user x is still active but user y left. In the same way 5000 users’ history would be labelled or tagged as left and the rest 5000 data would be labelled as active. The reason why stick to the same number for each category or class is because if I use an imbalanced dataset to train the model, the algorithm might be tend to outperform on a certain set of data, we call this phenomena bias. I suggest we should use balanced dataset not exactly a definite number which impossible in most cases but a comparable number with less margin between the difference. Our underlying algorithm would see the data and tries to understand the pattern. Once the model is trained, we could use some dataset to see how our model performs. In a nutshell, supervised algorithms use labelled or tagged data to learn like this user is active but that user has left or intuitively this is apple but that is orange, as simple as that.

Now let’s think about a different scenario. Remember I have the same e-commerce business and the same 10000 users’ history dataset. Instead of predicting the probability of a user which indicates whether he/she would leave the platform or not, we attempt something else. So in the supervised manner the data is labelled but it’s not in this unsupervised manner. Forget the user status label or tag, we are not experimenting with that. It will be kept as a feature in the dataset. Ultimately we mean that the data is unlabelled. More intuitively we have a bag of apples and oranges but each fruit isn’t explicitly labelled as this is apple and that is orange, instead we have a number of features like size, color, shape, weight and taste. In the same way, we would have features like last login, last purchase, time spent and user status of different users. What we know is this bag has a number of fruits and or that dataset has the details of 10000 users. Unsupervised Machine Learning algorithms try to figure out the underlying patterns inside the data. So probably we would be able to group similar feature to each categories such as apple or orange because all apples show similar features than oranges. Following the same path, users from a certain region show a particular activity than another region. In short, unsupervised algorithm find similar groups and associations among data.

Understanding Bias-Variance Trade-Off (Part 2)

Kelvin Jose — Tue, 25 Feb 2020 08:46:04 +0000

This is the continuation of Understanding Bias-Variance Trade-Off (Part 1)

The trade-off between bias and variance can be depicted as, model with low bias will be flexible enough to fit the data well. But if it becomes too flexible it will memorize and overfit the data instead of generalizing it and it won’t perform well on a different but similar dataset i.e. error due to high variance. n the figure below, we see a plot of the model’s performance using prediction capability on the vertical axis as a function of model complexity on the horizontal axis. Here, we depict the case where we use a number of different orders of polynomial functions to approximate the target function. Shown in the figure are the calculated square bias, variance, and error on the test set for each of the estimator functions.

We see that as the model complexity increases, the variance slowly increases and the squared bias decreases. This points to the tradeoff between bias and variance due to model complexity, i.e. models that are too complex tend to have high variance and low bias, while models that are too simple will tend to have high bias and low variance. The best model will have both low bias and low variance.

Managing the bias-variance tradeoff is one great example of how experienced data scientists serve an integral part of an analytics-driven business.

Understanding Bias-Variance Trade-Off (Part 1)

Kelvin Jose — Tue, 25 Feb 2020 08:36:35 +0000

The bias-variance trade-off is one of the most important aspect of machine learning projects. To approximate reality, different algorithms use mathematical and statistical techniques to optimize and best estimate model parameters. In between this optimization task, algorithms often encounter a dramatic term called error.

Errors can be divided into two as reducible and non-reducible errors. Among these two, reducible error is further divided into bias error and variance error. Irreducible error or uncertainty is associated with a natural variability in a system. Data Scientists try to reduce bias and variance errors in order to formulate an optimized and accurate model which might be taken into real world, we don’t know. However there is a trade-off between bias and variance in selecting the best model.

The term bias error shows how much a model’s predictions (y-hat) differ from the actual or expected outcome (y), over the training data. Or the model tries to over simplify the assumptions about the data which in turn underfits the training data. I would say this is purely related to the model selection procedure. Data scientists can re-sample the data and build another model and average the cost value to see the former issue still exists or not. If this average shows a significant difference from y, we should doubt there is high bias error.

The error due to variance is the amount by which the prediction, over one training set, differs from the expected predicted value, over all the training sets. When models become too complex, they will be sensitive to even the high degree variations of the training data. But overfitting the data instead of best fit, the same model may act weird with another similar dataset. As with bias, you can repeat the entire model building process multiple times.

find Part-2 of this post here