For example “I want to predict whether a user will leave my online platform say an e-commerce service - like the mighty Amazon or Alibaba - or not. Suppose If I find somebody who is about to leave my business, I can give the potential user some personalized offers so that they might stay longer and I can grow my business better. For this I will be having the previous history of users who left the platform behind. So I can use this data to train a Machine Learning model to predict the users’ future behavior. For this I might take 5000 users’ history who are already left and another 5000 users’ history who are still active. I already know that these all user already left and these all are still active. We utilize this knowledge or label in the Machine Learning term to train a model. Each users’ history is separately labeled as user x is still active but user y left. In the same way 5000 users’ history would be labelled or tagged as left and the rest 5000 data would be labelled as active. The reason why stick to the same number for each category or class is because if I use an imbalanced dataset to train the model, the algorithm might be tend to outperform on a certain set of data, we call this phenomena bias. I suggest we should use balanced dataset not exactly a definite number which impossible in most cases but a comparable number with less margin between the difference. Our underlying algorithm would see the data and tries to understand the pattern. Once the model is trained, we could use some dataset to see how our model performs. In a nutshell, supervised algorithms use labelled or tagged data to learn like this user is active but that user has left or intuitively this is apple but that is orange, as simple as that.
Now let’s think about a different scenario. Remember I have the same e-commerce business and the same 10000 users’ history dataset. Instead of predicting the probability of a user which indicates whether he/she would leave the platform or not, we attempt something else. So in the supervised manner the data is labelled but it’s not in this unsupervised manner. Forget the user status label or tag, we are not experimenting with that. It will be kept as a feature in the dataset. Ultimately we mean that the data is unlabelled. More intuitively we have a bag of apples and oranges but each fruit isn’t explicitly labelled as this is apple and that is orange, instead we have a number of features like size, color, shape, weight and taste. In the same way, we would have features like last login, last purchase, time spent and user status of different users. What we know is this bag has a number of fruits and or that dataset has the details of 10000 users. Unsupervised Machine Learning algorithms try to figure out the underlying patterns inside the data. So probably we would be able to group similar feature to each categories such as apple or orange because all apples show similar features than oranges. Following the same path, users from a certain region show a particular activity than another region. In short, unsupervised algorithm find similar groups and associations among data.
Top comments (0)