In this post I will provide you with some basic information about weight of evidence and a related concept, information value. I will outline how to calculate weight of evidence and information value. I will also talk about a few guidelines and a caution for using these concepts.
First of all, what is weight of evidence?
Weight of evidence (WOE) is a method of encoding a predictor variable to show the relationship it has with a binary target variable. It originated in the credit and finance industries to help separate “good” risks from “bad” risks, with the risk in that case being loan default. It has been in use for more than forty years, although WOE is most widely known in the credit, finance, and insurance industries.
It is calculated as the natural log of the distribution of good customers (those who did not default on their loan) divided by the distribution of bad customers (those who did default). This can also be looked at from the perspective of events (something happening) vs non-events (the same thing not happening). In that case, the calculation is the log of the distribution of non-events divided by the distribution of events.
This is the formula for weight of evidence:
The Steps To Calculate WOE:
-
If the predictor variable is continuous (for example, a list of peoples' age or household income): split the variable into a number of groups or “bins.” This process is known as “binning.”
A good rule of thumb is to start with 10 bins. This is especially true if your binning strategy is to keep an equal number of data points in each bin. If that is the case, using 10 bins means each bin will contain 10% of the variable's values.
The principle behind that rule of thumb is that a given bin should capture at least 5%-10% of the observations in the category. The reason is that, to a point, fewer bins do a better job of capturing patterns. As long as the number of values in a bin isn’t so large that the pattern is masked, a smaller number of bins is better. The problem with too many bins is that the smaller number of values in each bin may not be enough to capture the pattern or signal in the data.
Unfortunately, the only way to determine the ideal number of bins to use is to experiment a bit. There is no quicker or better way that I know of, so we generally start with 10 bins and adjust from there.
This step is skipped for categorical variables, because they are already effectively binned.
-
Calculate the number of events & non-events in each bin.
Each bin should ideally have at least one event and at least one non-event in it. If there are no events or non-events in a bin, then an approximation will need to be made. A quick-and-dirty approach would be to use the probability of the event or non-event in place of WOE (this is the same as saying WOE = 0).
Another approach would be to use additive or Laplace smoothing. Let's walk through how to do that.
This is the formula for the adjusted distribution using smoothing:
- Start by identifying whether it's the event or non-event that's missing from a bin. If it's both, then you'll benefit the most from using the quick-and-dirty approach from above.
- Since you're performing this step because you're missing events or non-events from a particular bin, the number of events/non-events will be 1 (0 events or non-events + the smoothing factor of 1).
- Take the total number of data points (or observations) in the bin and add 2. This comes from the smoothing factor of 1 from above, multiplied by the number of categories, which is 2 (event or non-event).
- Divide the number from step 2 by the number from step 3. This is the adjusted distribution of events or non-events that you'll use in step 4 of the WOE calculation. For this event or non-event in this bin, you'll skip step 3 of the WOE calculation.
By using smoothing to calculate an adjusted distribution, the problem of dividing by 0 is avoided.
-
Calculate the percentage of non-events & events in each bin.
This is done by taking the number of events in the bin and dividing it by the total number of values in the bin. Repeat this for the number of non-events in the bin.
Calculate WOE by taking the natural log of the percentage of non-events in the bin divided by the percentage of events in the bin.
Weight Of Evidence Usage
Now that you know what it is and how to calculate it, why do you want to use it?
One reason is that WOE allows you to see the relationship between the predictor variables and the target variable. If WOE is a positive number, then the distribution of non-events is higher than the distribution of events for that bin or category. That means the variable likely has a weaker relationship to the target. If it's a negative number, then the distribution of non-events is lower than the distribution of events, which may indicate a stronger relationship to the target.
Another reason is that you're planning to use a logistic regression model. WOE is particularly suited to logistic regression, because WOE is log-odds for a given group of values in a category and logistic regression is predicting log-odds of the target variable. This also means that the target variable and the encoded predictor variable are already on the same scale so scaling is unnecessary.
Keep in mind, a consideration in using weight of evidence is the potential for target leakage inherent in using the distribution of the target values in a category to encode that category. This can lead to overfitting in your model. One way of dealing with this is to inject some random Gaussian noise into the variable during encoding.
Related Concept - Information Value
You may have been wondering if I forgot about information value, since I haven't mentioned it again since the start of this post. That's because information value is closely related to weight of evidence. In fact, WOE is used in calculating information value!
Information value is intended to express how much benefit there is in knowing the value of an independent variable for predicting the target variable. Where weight of evidence shows you the relationship between the independent variable and the target variable, information value shows you the strength of that relationship.
Calculating Information Value
Information value is calculated as the difference of the distributions of non-events and events multiplied by the weight of evidence value, summed over all groups or bins of a predictor variable.
- For each bin, subtract the percentage of events from the percentage of non-events and multiply the result by the WOE value for the bin.
- Add those results together for each bin in the predictor variable.
- The total is the information value for the predictor variable.
Information Value Usage
According to Siddiqi (2006)[1], the information value statistic can be interpreted according to the table below.
IV | Predictive Power |
---|---|
< 0.02 | No predictive value |
0.02 - 0.1 | Weak predictor |
0.1 - 0.3 | Moderate predictor |
0.3 - 0.5 | Strong predictor |
> 0.5 | Suspiciously strong predictor |
In the case of information value > 0.5, you should double-check your information value calculation.
Having the information value for each of your predictor variables allows you to rank them accordingly and may assist in feature selection for your model. By using the variables with higher information value ranks, you are able to eliminate lower-ranked variables (assuming there are no variable interactions). That helps you avoid the so-called curse of dimensionality!
That's all I intend to cover in this post. Thank you for your time, and I hope to catch you in the next one!
References
[1] Siddiqi, Naeem (2006). Credit Risk Scorecards: Developing and Implementing Intelligent Credit Scoring. SAS Institute, pp 79-83.
All GIFs sourced from giphy.com
Top comments (0)