Cover image for Using explanations for finding bias in black-box models

Using explanations for finding bias in black-box models

andreasmessalas profile image Andreas Messalas Originally published at code4thought.eu ・6 min read

There is no doubt, that machine learning (ML) models are being used for solving several business and even social problems. Every year, ML algorithms are getting more accurate, more innovative and consequently, more applicable to a wider range of applications. From detecting cancer to banking and self-driving cars, the list of ML applications is never ending.

However, as the predictive accuracy of ML models is getting better, the explainability of such models is seemingly getting weaker. Their intricate and obscure inner structure forces us more often than not to treat them as “black-boxes”, that is, getting their predictions in a no-questions-asked policy. Common “black-boxes” are Artificial Neural Networks (ANNs), ensemble methods. However seemingly interpretable models can be rendered unexplainable, like Decision Trees for instance when they have a big depth.

Black-box models
Neural Networks, Ensemble methods and most of the recent ML models behave like black-boxes

The need to shed a light into the obscurity of the “black-box” models is evident: GDPR’s article 15 and 22 (2018), OECD AI principles (2019) and Senate’s of the USA Algorithmic Accountability Act bill (2019) are some examples which indicate that ML Interpretability, along with ML Accountability and Fairness, have already (or should) become integral characteristics for any application that makes automated decisions.

Since many organizations will be obliged to provide explanations about the decisions of their automated predictive models, there will be a serious need for third-party organizations to perform the interpretability tasks and audit those models on their behalf. This provides an additional level of integrity and objectivity to the whole audit process, as the explanations are provided by an external factor. Moreover, not every organization (especially startups) has the resources to deal with interpretability issues, rendering third-party auditors necessary.

However, in this manner intellectual property issues arise, since organizations will not want to disclose any information about the details of their model. Therefore, from the wide range of interpretability methods, the model-agnostic approaches (i.e. methods that are oblivious of the model’s details) are deemed to be appropriate for this purpose.

Besides explaining the predictions of a black-box model, interpretability can also provide us with insight about erroneous behavior of our models, which may be caused by undesired patterns in our data. We will examine an example, where interpretability helps us identify gender bias in our data, using a model-agnostic method, which utilizes surrogate models and Shapley values.

We use the “Default of Credit Card Clients Dataset”, which contains information (demographic factors, credit data, history of payment, and bill statements) about 30,000 credit card clients in Taiwan from April 2005 to September 2005. The target of the models in our examples is to identify the defaulters (i.e. bank customers, who will not pay the next payment of their credit card).

Gender biased data

The existence of biased datasets is not uncommon. It can be caused from false preprocessing or even from collecting from a poor data source, creating skewed and tainted samples. Examining the reasons behind a model’s prediction, may inform us about possible bias in the data.

Defaulters based on Gender: The red and blue bars represent the original distributions of female and male customers, while the purple one depicts the new constructed biased distribution of male customers.

In the “Default of Credit Card Clients Dataset”, 43% of the defaulters are male and 57% are female. This does not consist in a biased dataset, since the non-defaulters have a similar distribution (39% and 61% respectively).

We distort the dataset, by picking at random 957 male defaulters (i.e. one third of the overall male defaulters) and we alter their label. This creates a new biased dataset with 34% / 66% male/female defaulters and 41% / 59% male/female non-defaulters respectively.

We then remove the gender feature from the dataset and take the predictions of a black-box model trained on this biased dataset (to which we are indifferent about its structure). We then train a surrogate XGBoost model, from which we extract the Shapley values that help us explain the predictions of the original model. More precisely, we use the Shapley values to pinpoint the most important features and then we use natural language to describe them in the explanations.

We examine the explanations for a false negative prediction (i.e. falsely predicted as non-defaulter) of a male customer and a false positive prediction (i.e. falsely predicted as defaulter) of a female customer. They are both single university graduates with similar credit limit. However, the male customer delayed the last 4 payments, while the female delayed only the final one.

(months delayed)
80000 university single 5 4 3 2 paid duly U.R.C
60000 university single 2 U.R.C U.R.C U.R.C U.R.C U.R.C
(U.R.C: Use of Revolving Credit)

For the male customer as the explanation below indicates, the delay for the September payment had a negative impact of 33% (i.e. contributing towards ‘Default’). However, counterintuitively the delay for the August payment had a positive impact.

Alt Text

For the female customer, the 2-month delay of September also contributed negatively with 47%, but in a much greater percentage compared to the 5-month delay of the male customer (33%).

Alt Text


Although, the gender feature was not included in the training of the model, we observed with the help of the explanations, that the gender bias was encoded into other features (e.g. positive contribution for the delay of payment for the male customer). Moreover, by observing the percentage of the impact of the explanations, we detected a harsher confrontation of the model towards the female customer (e.g. greater negative impact for a lesser delay of payment). This strange behavior should alarm us and motivate us get a better sample of the defaulters.


In cases where the dataset contains real people, it is important to ensure that the model does discriminate against one group over others. Explanations facilitate us detect bias, even if it is skewed, pinpoint unintended decision patterns of our black-box model and motivate us to fix our data.


Editor guide