In many cases it's quite obvious that we have a classification or regression task and it's impossible to make such a mistake when you pick your model design. However, in some cases it's not so obvious, as the target variable can be a mix of regression and classification target variables. For example, our target variable can be equal to zero in vast majority of cases and varies a lot in all other cases. This is a very common situation in financial tasks, when we have to predict the default of the customer and potential damage depends from this customer's exposure, or when we try to identify fraud transactions and losses depend from the size of the transaction. In these cases, it can be quite difficult to pick a particular model design, so let's try to simulate this situation and investigate different approaches.
Also, let's introduce one more variable: amount=10^(Normal distributed variable), this variable will show potential loss connected with every case when y > th. Let's imagine that in our case p = 0.99, and alpha = 0.5, so we have any losses only in 1% of cases and Gini coefficient of our model in about 50%. Let's imagine also that our threshold on the same level as p, so we can reject only 1% of cases and have to accept 99%. The main target metric for us is recall, the percent of all losses rejected by our model, when we can reject only 1% of cases.
Let's consider 4 different options here:
- We can create a classification model not directly connected with money, range our cases and reject 1% of cases with highest probability of loss. (In our particular case we will use x as a score.)
- We can transform x to probability by normal cdf, and multiply this probability by money. It will be our expected loss and we can reject 1% of cases with highest expected loss.
- We can create a ridge regression model based on our x and amount and reject 1% of cases with highest ridge predictions.
- We can create a ridge regression model based on our x and log(amount) and reject 1% of cases with highest ridge predictions.
All in all, we can see that classification task multiplied by amount of money gave the best result here (recall = 58%), on the second place is regression on amount and x (recall = 52%), on the third place is regression on log(amount) and x (recall = 51%) and on the last place is naive classification model (recall = 12%). So, the best solution in such kind of tasks is to create a classification model with final multiplication by money or a simple regression model.
Top comments (0)