DEV Community: Code4Thought

Fix your recommender system to fix your profitability

Romanos Kapsalis — Tue, 17 May 2022 13:35:13 +0000

How bias in Recommender Systems affects e-commerce, society and eventually your profits

In 2021:

Almost 70% of internet users in the EU, have bought or ordered goods or services for private use [source: Eurostat].
In the US, e-commerce sales were estimated to be approximately 870 billion dollars [source: Digital Commerce 360].

Undoubtedly the COVID-19 pandemic has fuelled a significant increase in internet usage both in e-commerce sales and in content-provider systems. However, the key factor behind this increase is the utilization of Recommender Systems (RecSys) by an exponentially growing number of services.
Recommender (or recommendation) systems* provide recommendations to users, based on user behavior data, either explicitly or implicitly. Ideally, RecSys offer continuous user engagement with content, as well as increasing loyalty and satisfaction of users. But what are the major contributing factors behind these?

RecSys can provide personalized content. A recent study in the US showed that 71% of customers expect companies to deliver personalized interactions and 76% of them get frustrated when they don't get it.
RecSys can offer a vast amount of similar items to those recently purchased, viewed or rated by the users and as a result they spend more time interacting with the products.

What is more, these systems can have a serious impact on sales promotions (e.g. what types of products users can select and consume), as sellers can persuade customers to buy specific items. All the aforementioned benefits of RecSys can lead to increased consumption and thus, higher profits for businesses.

Image 1: Examples of RecSys benefits on popular systems

Bias in RecSys

Despite the above mentioned benefits of RecSys, there are a number of performance and ethical issues that need to be addressed and investigated. Apart from well known performance issues in RecSys such as cold-start and data sparsity, little attention has been paid to ethical issues, until recently. In this article, we will focus on how the ethical issues related to recommender systems and their impact merely on us (either as consumers or citizens and users) can also significantly impact e-commerce platforms.

Image 2: Long tail diagram (dataset: movielens-100k)

One of the most important types of bias that arise in RecSys is popularity bias. This type of bias describes the phenomenon of popular items ("head") being recommended frequently while less popular ("long-tail"), niche products, are recommended rarely or not at all. The 80/20 ratio is based on the Pareto principle: for many outcomes, roughly 80% of consequences come from 20% of cause.

But why is popularity bias so important?

Popularity bias in society

It was previously mentioned that RecSys offer continuous user engagement and this is highly correlated with company profitability. The more the users are engaged the more likely it becomes for a platform to increase their sales as well as their profitability. But that doesn't come without a cost.
First and foremost, popularity bias in RecSys highly affects society. For instance, it has played a vital role in the wide spread of misinformation and fake news as it hinders significantly the quality of the content provided to users. Additionally, the article "How Facebook got addicted to spreading misinformation", describes a similar case. There we can see that machine learning models programmed to maximize engagement bear the risk to increase polarization, amplify fake news and hate speech and lead to unpredicted consequences, such as election sway and the genocide of the Rohingya Muslim minority in Myanmar.
Popularity bias is also highly connected with a phenomenon called "echo chambers" where users only encounter opinions that reflect and reinforce their own, without encountering opposing opinions.

Image 3: Echo chambers on Twitter. Nodes represent a user who sent a message, and edges, a user retweeting another user (Blue represents liberals and red conservatives) [Source: PNAS]

Popularity bias impact in e-commerce

The effects of popularity bias can also have a significant (negative) business impact especially on e-commerce systems. Apart from popularity bias, there are some other aspects of RecSys quality that need to be addressed, which are presented in Image 4.

Image 4: Aspects of RecSys quality and their business impact

Study & Findings

In order to investigate the business impact of popularity bias and of the other aspects of RecSys quality (Diversity, Coverage and Novelty) and to gain a better understanding of their sources, an extensive study was held by our team, using a real dataset provided by a major electronics retailer. The dataset contained 8.263 ratings, 3.078 products and 276 users.
The first step of the process was to build 11 different RecSys models utilizing 11 different algorithms, from classical to the most recent, State-Of-The-Art, approaches (see Image 5 for more details). Each of these systems produced a list of 10 items for every user. Then, we had to evaluate the results produced by the respective RecSys. For this purpose 12 different metrics were selected.

Image 5: Algorithms and evaluation metrics used in the study

Findings

The main findings of our study are as follows:

🔎 From Average Recommendation Popularity (ARP) (Image 6) it is apparent that most of the algorithms recommended the most popular items. More specifically, despite the fact that the average number of ratings per item in the dataset is 2.68, all the algorithms' scores, except ItemKNN, NGCF and Random are much higher than this value.

🔎 The algorithms used, not only recommend the most popular algorithms, but they tend to completely ignore user preferences, as shown in the Popularity Ranking-based Equal Opportunity (PopREO) chart in Image 6.

Image 6: Popularity bias as measured by 2 different metrics (the lower the better)

🔎 The results of this study indicate that RecSys developers should be aware of the bias-accuracy trade-off and should avoid using algorithms that enlarge this phenomenon (of bias), as shown in Image 7. It has to be mentioned that the poor results in accuracy are caused by the:

very high percentage of dataset sparsity (almost 99%)
uneven distribution of ratings
relatively small number of ratings

Image 7: Accuracy-popularity bias trade-off

🔎 Algorithms may not cover all the items given as input, this can be clearly seen in Image 8. Consequently, a significant amount of items will remain unseen for the majority of users, affecting the number of sales and decreasing user satisfaction. Moreover, it was found that diversity is not always connected with popularity bias and except our baseline algorithm, itemKNN and NGCF all the other algorithms are not enhancing diversity in recommended items

Image 8: Some RecSys algorithms have very low scores in Coverage, Diversity and Novelty

Conclusion

In terms of accuracy, users should be highly encouraged to rate products. RecSys need vast amounts of data to produce meaningful results. But in most cases, there are a number of customers that haven't offered explicit feedback to the system. Cookies might be one of the most effective ways to overcome this problem by collecting user data. However this bears additional overhead for managing risks related to privacy and GDPR compliance.

As regards popularity bias and other aspects of RecSys quality, the proposed solutions depend on whether the e-commerce system owner has access to the model or not. If they have access to the model then a bias mitigation technique (pre-processing, in-processing, post-processing or a combination of these) can be used. Otherwise, some other mediocre solutions can be applied such as encouraging users to rate recently purchased products.

Last but not least, bias can affect item providers (either sellers or manufacturers). The rationale behind this is that if the product belongs to the long tail category or if it is new, it may never be recommended to users.
In conclusion, e-commerce businesses need to control popularity bias, diversity and novelty because they highly affect user satisfaction and their profits as well.

*Note: in this article we refer to collaborative filtering systems and more specifically top-n recommenders that produce a list of top-n recommendations, as an output, for each user.

Don't be wrong because you might be fooled: Tips on how secure your ML model

Andreas Messalas — Tue, 17 May 2022 09:27:10 +0000

Figuring out the reasons why your ML model might be consistently less accurate in certain classes than others, might help you increase not only its total accuracy but also its adversarial robustness.

Introduction

Machine Learning (ML) models and especially the Deep Learning (DL) ones can achieve impressive results, especially on unstructured data like images and text. However, there is a fundamental limitation with the (supervised) ML framework: *the distributions we end up using ML models on are NOT always the ones we train it on. *

This leads to models that on the one hand seem accurate but on the other hand they are brittle.
Adversarial attacks exploit this brittleness and introduce unnoticeable perturbations to the images that force the model to flip their predictions.

For example, lighting a "Stop" sign in a specific way can make a traffic sign model predict it as a "Speed 30" sign (Fig. 1 left). In another example, just by rotating an image or replacing some words in a sentence of a medical diagnosis with their synonyms, you can fool models to give a wrong medical diagnosis and risk scores respectively (Fig. 1 right).

Fig. 1 (references [1], [2])

It is evident that adversarial robustness of a model is associated with its security and safety, thus it is important to be aware of its existence and its implications.
In this article, we will use a model trained on the CIFAR10 public dataset to:
📌 Investigate the intuition that the model's inability to correctly predict an image also leads to higher susceptibility to adversarial attacks.
📌 We will measure the class disparities, meaning that we will check the performance of the model across the 10 classes.
📌 In the end, we will conduct a root cause analysis that will pinpoint the causes of these class differences that hopefully will help us fix not only the miss-classifications but also will increase the adversarial robustness of the model.

The "cat", "bird" and "dog" classes are harder to correctly classify and easier to attack

We trained a simple model, using the well-known ResNet architectural pattern on a 20 layer deep network, which achieves a 89.4% accuracy on the validation set. We then plotted the miss-rate (per class) to check if there are any disparities between the classes (Fig. 2).
It is evident that the "cat", "bird" and "dog" classes are harder to correctly classify than the rest of the classes.

Fig. 2: Miss-Rates across the 10 classes of CIFAR10

We then applied two kind of adversarial attacks:

An untargeted attack, where an attack is considered successful when the predicted class label is changed (to any other label)
A targeted attack with the least-likely target, where we have a successful attack when the predicted class label is changed specifically to the label that the model has the least confidence for the specific instance.

Afterwards, we plotted the attack-success rate per class, which measures the percentage of successful attacks per class (note: each class in the test set has 1,000 images). We can observe that the most successfully attacked classes are the same ones that are also miss-classified (Fig 3).

This is intuitive and to some extent expected, since the fact that the model miss-classifies some instances means that it pays more attention to features that are not very relevant to that class, so adding more perturbed features makes the model's job much harder and the attacker's goal easier.

Fig. 3: Attack Success Rate for Targeted and Untargeted attacks

Root cause analysis

We identified three possible root causes for the class disparities in the model's predictions:

1. Miss-labeled/Confusing Training Data

Data collection is probably the most costly and time-consuming part of most machine learning projects. It's perfectly reasonable to expect that this arduous process will entail some mistakes. We discovered that the CIFAR10 training set contains some images that either are miss-labeled or they are themselves confusing even for humans (Fig. 4).

Fig.4: Confusing/Miss-labeled images from CIFAR10 training dataset

This could be considered some kind of involuntary data-poisoning, since a significant amount of these poisoned images can shift the data distributions to a false direction, thus making it more difficult for the model to learn meaningful features and consequently make correct predictions.

Data poisoning is also linked with adversarial attacks, although in that case the poisoned data are carefully crafted. It is shown that even a single poisoned image can affect multiple test-images (Fig. 5).

Fig. 5 (Koh et al)

2. Is this a "cat" or a "dog"?

The "cat" class has the worst miss-rate of all the classes, followed by the "dog" class which has the third worst miss-rate.

Since these animals share some similar features (four legs, ears, tail), our intuition is that our model could not extract meaningful features to distinguish these two animals or has learned better "dog" features than the "cat" ones.

Using saliency map explanations, we can verify this intuition: we can see that the model has not learned some distinctive cat characteristics such as the pointy ears and nose and instead focuses on the whole animal's face or body (Fig. 6).

Fig. 6: Saliency map explanations for images of "cat" miss-classified as "dog"

3. Is this a "bird" or an "airplane"?

A similar situation is happening with the "bird" and "airplane" classes. The model in this case is confused by the blue background, since most airplane images contain an object in a blue background (Fig. 7).

Fig. 7: Saliency map explanations for images of "bird" miss-classified as "airplane"

Takeaways

👉 Good data means a good model: spend some time investigating your data and try to identify if there are any systematic errors in your training set.
👉 Use explanation methods as a debugger, in order to understand why your model model misses certain groups of instances more than others
👉 Adversarial attacks are a cost-effective way to check the adversarial robustness of your model.

Fairness, Accountability & Transparency (F.Acc.T) under GDPR

Andreas Messalas — Sat, 14 Nov 2020 17:13:39 +0000

Why the need for Regulation?

Algorithmic decisions are already crucially affecting our lives. The last few year, news like the ones listed below are becoming more and more common:

"There's software used across the country to predict future criminals. And it's biased against blacks", 2016 [source: ProPublica].
"Amazon reportedly scraps internal AI recruiting tool that was biased against women", 2018 [source: The Verge]
"Apple Card is being investigated over claims it gives women lower credit limits", 2019 [source: MIT Technology Review]
"Is an Algorithm Less Racist Than a Loan Officer?", 2020 [source: NY Times]

The above indicate why we see a rise to new regulations and laws that aim to control to some extent the growing power of the algorithms. One of the first and most well-known regulations is the General Data Protection Regulation (GDPR) introduced by the European Parliament and Council of the European Union in May, 2016.

In this article, we will focus on how the properties of Fairness, Accountability and Transparency (better known as F.Acc.T) are reflected in the GDPR and we will examine the scope of the restrictions they impose on the public and private sector.

Principles of GDPR

Article 5(1) & (2) in GDPR lists the basic principles relating to the processing of personal data:

lawfulness, fairness and transparency, purpose limitation, data minimisation, accuracy, storage limitation, integrity and confidentiality and accountability

While the F.Acc.T properties are clearly listed as principles, we need to take a deeper look in the GDPR text to understand what these terms actually mean.

Transparency under the GDPR

Transparency has been a matter of debate between law scholars, since it is not very clear if GDPR contains the right to explanation of automated decisions. For example, the following two papers [3, 4] of the same journalhave almost contradicting titles (see Image 1).

Image 1: Contradicting papers [3, 4] about the right to explanation

The epicentre of this debate is located in Article 22 (3) GDPR, where it is stated that: "[...] the data controller shall implement suitable measures to safeguard the data subject's rights and freedoms and legitimate interests, at least the right to obtain human intervention on the part of the controller, to express his or her point of view and to contest the decision". Here, it is not written explicitly if the right to an explanation exists.

However, in Recital 71 GDPR, it is stated categorically that (inter alia) this right exists: "[...] such processing should be subject to suitable safeguards, which should include specific information to the data subject and the right to obtain human intervention, to express his or her point of view, to obtain an explanation of the decision reached after such assessment and to challenge the decision.".

Since only articles are binding in the GDPR and not the recitals --- which provide context and greater depth of meaning to the articles --- most scholars [3, 4 & 5 ] believe that a right to an explanation does not follow from GDPR.

There is another conflicting point: Article 15 (1)(h) states the data controller should provide to the data subject "meaningful information about the logic involved".

Some scholars [3, 4] interpret that the "meaningful information" refers only to the model's general structure, hence the explanations for individual predictions are not necessary.
Other scholars [5] believe that in order for the information to be meaningful it needs to allow the data subjects to exercise their rights defined by Article 22(3), which is the right to "express his or her point of view and to contest the decision". Since explanations provide this ability, it is argued that they must be presented.

Loophole in the system

However, even in the best case scenario, where everyone agrees that the right to explanations exists in the GDPR, the regulation states that it is referring only to "decisions based solely on automated processing" (Article 22 (1)). This means that any kind of human intervention in the decision, exempts the controller from the obligation of providing explanations. For example, if a creditor uses an automated credit scoring system only as advisory and in the end they are the ones to take the final decision for the credit applicant, then in this case GDPR does not force the system to provide explanations to the creditor as well as to the applicant.

Code4Thought's stance

We believe that explanations are essential to ensure transparency and trust across all kinds of stakeholders of an algorithmic system. From our experience, users of advisory automated systems usually are prone to follow the system's decision, even when sometimes the decision does not make much sense. From faithfully following erroneous Google Maps directions to judges using racially biased risk assessments tools, the immense power of today's algorithmic systems tends to overthrow people's ability for critical thinking and as a result it takes its place.

So, regardless of human intervention, we firmly believe that when algorithmic systems affect real people, explanations are necessary as they establish trust between the system and the data controllers as well as the data subjects. Moreover, they can scrutinize the algorithms for systematic bias, consequently increasing fairness. Last but not least, understanding the algorithm's reasoning can be very helpful for debugging/tuning purposes, while in other cases (e.g. for medical diagnosis) it is imperative.

Artificial Intelligence (AI) as a "black-box"

The intricate and obscure inner structure of the most common AI model (e.g. deep neural nets, ensemble methods, GANs, etc.) forces us more often than not to treat them as "black-boxes", that is, getting their predictions in a no-questions-asked policy. As mentioned above, this can lead erroneous behavior of the models, which may be caused by undesired patterns in our data. Explanations help us detect such patterns: in Image 2 we observe a photo of a man (lead singer of "Green Day", B.J. Armstrong), who was falsely classified as a "Female" by a gender recognition "black-box" model. The explanation shows us that the (red) pixels representing the make-up around his eyes contributed heavily towards predicting him as "Female", suggesting that our model chose the gender of a picture solely by detecting make-up or the not appearance of a moustache. This discovery would motivate us to take a closer look to our training data and update them.

Image 2: Explanation of a false prediction by a gender recognition model (photo by CelebA dataset)

Explainability methods

There are various explainability methods and Image 3 presents a taxonomy of them. One of Pythia's transparency tools is called MASHAP, which is a model-agnostic method that outputs feature summary in both local and global scope.

Fairness under the GDPR

Fairness might be the most difficult principle to define, since determining what is fair is a very subjective process, which varies from culture to culture. Asking ten different philosophers what defines fairness, the result would probably be ten different answers.

Even the translation of the word "fair" in various european languages is not straight-forward, where two or three different words are used, each referring to slightly different nuances. For example, in the German text, across different articles and recitals, "fair" is translated as "nach Treu un Glaube", "faire" or "gerecht".

In the context of GDPR, scholars believe that fairness can have many possible nuances, such as non-discrimination, fair balancing, procedural fairness, etc. [1, 2]. Fairness as non-discrimination refers to the elimination of "[...] discriminatory effects on natural persons on the basis of racial or ethnic origin, political opinion, religion or beliefs, trade union membership, genetic or health status or sexual orientation, or processing that results in measures having such an effect" as defined in Recital 71 GDPR. Procedural fairness is linked with timeliness, transparency and burden of care by data controllers [1]. Fair balancing is based on proportionality between interests and necessity of purposes [1, 2].

Code4Thought's stance

In Code4Thought, we strive to pinpoint discriminatory behavior in algorithmic systems based on proportional disparities in their outcomes, so our notion of fairness is adjacent with the non-discimination and fair-balancing concepts of GDPR. More specifically, we try to examine if an algorithmic system is biased against a specific protected group of people by measuring, if the system favors one group over another for a specific outcome (e.g. Apple Card giving higher credit to male customers than female ones).

While there is a plethora of metrics for measuring fairness and various open-source toolkits containing such metrics (e.g. IBM's AIF360, Aequitasby the University of Chicago), we choose the Disparate Impact Ratio (DIR) as our general fairness metric, which is a "ratio of ratios", i.e. the proportion of individuals that receive a certain outcome for two groups.

The reason behind this choice is the fact that DIR is usually connected with a well-known regulatory rule, called the "four-fifths-rule" introduced by the U.S. Equal Employment Opportunity Commission (EEOC) in 29 C.F.R. § 1607.4(D) published in July 2018, rendering the metric in a way as an industry standard. Ideally the DIR should be equal to one (i.e. equal proportions for all the groups that are selected for a certain outcome). However according to the aforementioned rule, any value of DIR below 80% can be regarded as "evidence of adverse impact". Since DIR is a fraction and the denominator might be larger than the numerator, we consider an acceptable range of DIR from 80% to 120%.

Image 4: Screenshot of Pythia’s monitoring tool

In Image 4, a screenshot of our Pythia's monitoring tool is displayed, measuring the DIR across time of Twitter's internal algorithm of selecting preview photos indicating possible bias (see our article for more information).

Accountability under the GDPR

Authority is increasingly expressed algorithmically and decisions that used to be based on human intuition and reaction are now automated. Data controllers should be able to guide or restrain their algorithms when necessary.

Article 24(2) GDPR mentions that "[..] the controller shall implement appropriate technical and organisational measures to ensure and to be able to demonstrate that processing is performed in accordance with this Regulation". Also in Article 5(2) it stated that "The controller shall be responsible for, and be able to demonstrate compliance with, paragraph 1 ('accountability')."

Imposing accountability in algorithmic systems is not as technically challenging as fairness or transparency. It can be accomplished through the use of evaluation frameworks, in the form of simple check-lists and questionnaires that require input from experts that build or manage the system.

In July 2020 the High-Level Expert Group on Artificial Intelligence (AI HLEG) appointed by the European Union has created an assessment list for trustworthy AI. Prior to that, Code4Thought had published in an academic conference a similar (but less broad) assessment framework for algorithmic accountability in 2019.

Final Thoughts

We presented how the F.Acc.T principles are viewed through the lenses of GDPR. It is clear that besides the technical challenges of incorporating these principles in an algorithmic pipeline, the legal view of their imposition is also in some areas obscure (e.g. transparency).

At Code4Thought, we hope that the legal gaps will be filled and the related conflicts will be resolved soon, thereby creating more clear and explicit guidelines and regulations for AI Ethics. But until then, we urge organisations to proactively care about the "F.Acc.T-ness" of their algorithmic systems, since we strongly believe that it can boost their trustworthiness and yield to more stable, responsible algorithmic systems and perhaps a more fair society.

It should be noted that globally there also other similar regulations for data privacy and protection, some of which are listed below:

Personal Information Protection and Electronic Documents Act (PIPEDA) by the Parliament of Canada in 2001
California Consumer Privacy Act (CCPA) by California State Legislature in June, 2018
Brazilian General Data Protection Law (Lei Geral de Proteção de Dados or LGPD) passed in 2018
White paper οn Artificial Intelligence --- A European approach to excellence and trust by the European Union in February, 2020
Denmark's legislation about AI & Data Ethics being the first country in Europe to implement such laws in May, 2020

Is Twitter biased against BIPOC? A fairness & transparency evaluation of Twitter's black-box model

Andreas Messalas — Tue, 13 Oct 2020 18:20:45 +0000

The controversy

Twitter was in the headlines recently for apparent racial bias in the photo preview of some tweets. More specifically, Twitter’s machine learning algorithm that selects which part of an image to show in a photo preview favors showing the faces of white people over black people. For example the following tweet, contains an image of Mitch McConnell (white male) and Barack Obama (black male) twice, but Twitter selects Mitch McConnell both times in the tweet’s preview photo.

This tweet went viral with (currently) 81K retweets, almost 200K likes and was covered in articles by BBC and CNN. It also got the attention of many users who posted different configurations of images with black and white people and trying to verify themselves whether there is truly bias in Twitter’s preview selection model. Some even tried posting images with white and black dogs as well as cartoon characters.

Twitter’s official reply was: “We tested for bias before shipping the model & didn’t find evidence of racial or gender bias in our testing. But it’s clear that we’ve got more analysis to do. We’ll continue to share what we learn, what actions we take, & will open source it so others can review and replicate.”

Twitter Comms

@twittercomms

@bascule We tested for bias before shipping the model & didn't find evidence of racial or gender bias in our testing. But it’s clear that we’ve got more analysis to do. We'll continue to share what we learn, what actions we take, & will open source it so others can review and replicate.

17:54 PM - 20 Sep 2020

170 1416

Fairness evaluation

In Code4Thought, we are deeply concerned with bias and discrimination in algorithmic systems, especially when these systems can crucially affect real people. So we did our own testing with our fairness and transparency service called Pythia and we investigated if Twitter’s model is truly bias-free as Twitter’s official reply suggested.

In order to have a more systematic approach, we used a specialized dataset containing images of faces of different racial groups, which is balanced for all groups. To keep things simple we used only adult black and white males for our experiments and we constructed a new dataset containing combined photos — collages — of adult black males at the top, white adult males in the bottom and a white background between them. Our new dataset contained 4,009 pictures of black and white adult males, which we uploaded in an account on Twitter called @bias_tester. Finally, we manually labeled the preview photo of each tweet as ‘Black male’ if Twitter’s underlying model selected the black male, otherwise we labeled it as ‘White male’.

Using Code4Thought’s Pythia service on the new labeled dataset containing the 4,009 collage-photos and the preview label (‘Black male’, ’White male’), we examined Twitter’s model on fairness and transparency.

The metric we choose to measure fairness is Disparate Impact Ratio (DIR), which basically measures how differently the model behaves across different groups of people – in our case black and white adult males. More specifically it is the proportion of individuals that receive a positive outcome for two groups.

If there is great disparity in the model’s outcome for each group, then we can claim that there might be bias in the model. According to “4-5ths rule” by the U.S. Equal Employment Opportunity Commission (EEOC), any value of DIR below 80% can be regarded as evidence of adverse impact. Since DIR is a fraction and the denominator might be larger than the numerator, we consider an acceptable range of DIR from 80% to 120%.

Twitter allows its users to upload a certain amount of tweets per day, so we sent batches of 300 tweets of our collages every 3 hours. After each batch was uploaded, we manually measured the number of black and white males in Twitter’s preview photo and sent this data to our Pythia platform.

After all batches were sent, the total DIR was 0.61, which is much less than the accepted threshold. This analysis suggests that Twitter’s preview photo selection model is more likely to choose a white male than a black male. We can observe that, while some batches of data (blue line) were compliant, the total DIR (orange line) was continuously not compliant, which is an indication of bias towards black males.

Using explanations to verify bias

We would like to get a sense of how the underlying model “thinks” and try to understand its decision process, in order to find reasoning for the discovered bias of the fairness evaluation. We modified our existing dataset by using the individual images of the same black and white adult males from the collages, and labeled them with 1 if the photo was selected in the preview, otherwise we labeled them with 0. We used Pythia’s model-agnostic explainer , which utilizes surrogate models (i.e. models that try to mimic the original model) and Shapley values, in order to explain the predictions of the Twitter’s preview selection model, even without having direct access to it (more info about our method can be found in this corresponding paper).

Clearly the size of our dataset (8,018 pictures) is not enough to give us information in a confident way about how Twitter’s internal model makes its predictions, however it can give us already some insights.

Below are some examples of explanations from our dataset. The green and red pixels in the grayscale image demonstrate positive and negative contributions correspondingly towards selecting the image as the preview photo.

Since we do not know what is the goal of Twitter’s preview selection model (e.g. facial recognition), it is difficult to understand the explanations. On a first glance, we might observe that the model tries to identify facial characteristics such as the eyes, nose and cheeks.

Looking at the first two examples above, where the model selected the white male, we could say that pixels on the cheeks with a darker tone (whether because of the black skin or because of the shade) as well as the eyes of the black males contribute negatively and conclude that the model is biased against dark skin pixels.

However, we can not confirm this assertion by looking at the last two examples, where the model selected the black male. Moreover, some explanations show that pixels in the background also are contributive, which makes us wonder if any facial characteristic plays an important role at all in the model’s selection process

This unclear behavior led us to analyze a bit further our data and look for other attributes of the images which are not correlated with the facial features of the person in the image. So we measured the sharpness and contrast of the photos and using Pythia’s explainer again we checked whether either one of them played an important role in the predictions. The following scatter plot shows how sharpness plays an important role in the model’s decision process.

The vertical axis depicts the output of the (surrogate) model and the horizontal axis the sharpness of each image. We can observe that for less sharp photos (i.e. sharpness less than 3) the contribution is negative (red color) leading to a low model output score, while for photos with a higher value of sharpness (i.e. more than 5) the contribution is positive (green color) and the output score is higher.

In fact, we measured the sharpness of the two pictures shown in the tweet that started this whole debate, and we found that McConnell’s photo was sharper than Obama’s – the exact sharpness values were 5.65 and 3.54 correspondingly. As for the contrast of the photos, there was no strong correlation with the model’s predictions and as a result it did not play an important role in the model.

Bottomline: is there any discrimination in Twitter’s model?

Our analysis on the deducted explanations suggests that the model does not depend on any facial feature of the people inside an image, rather than on the total image quality, and probably on the sharpness of an image. Again, this conclusion derives from analysing a small dataset and we can not confidently assert that Twitter’s model has this behaviour overall.

If we assume that our analysis is valid for larger datasets, then there is an important question that must be answered: Does the fact that Twitter’s model is not directly discriminatory, since it does not rely on racial or facial features, means that there is no ethical issue and that this controversy should end?

In Code4Thought we firmly believe that any trace of unfairness towards specific groups of people in an algorithmic process should be addressed and then solved, even if the process does not utilise any direct discriminatory procedures. We look forward to seeing Twitter’s further reactions and analysis on the issue in question and we hope that we did our small contribution to the public discourse on how we can make technology more fair and responsible.

Using explanations for finding bias in black-box models

Andreas Messalas — Thu, 03 Oct 2019 16:47:45 +0000

There is no doubt, that machine learning (ML) models are being used for solving several business and even social problems. Every year, ML algorithms are getting more accurate, more innovative and consequently, more applicable to a wider range of applications. From detecting cancer to banking and self-driving cars, the list of ML applications is never ending.

However, as the predictive accuracy of ML models is getting better, the explainability of such models is seemingly getting weaker. Their intricate and obscure inner structure forces us more often than not to treat them as “black-boxes”, that is, getting their predictions in a no-questions-asked policy. Common “black-boxes” are Artificial Neural Networks (ANNs), ensemble methods. However seemingly interpretable models can be rendered unexplainable, like Decision Trees for instance when they have a big depth.

Neural Networks, Ensemble methods and most of the recent ML models behave like black-boxes

The need to shed a light into the obscurity of the “black-box” models is evident: GDPR’s article 15 and 22 (2018), OECD AI principles (2019) and Senate’s of the USA Algorithmic Accountability Act bill (2019) are some examples which indicate that ML Interpretability, along with ML Accountability and Fairness, have already (or should) become integral characteristics for any application that makes automated decisions.

Since many organizations will be obliged to provide explanations about the decisions of their automated predictive models, there will be a serious need for third-party organizations to perform the interpretability tasks and audit those models on their behalf. This provides an additional level of integrity and objectivity to the whole audit process, as the explanations are provided by an external factor. Moreover, not every organization (especially startups) has the resources to deal with interpretability issues, rendering third-party auditors necessary.

However, in this manner intellectual property issues arise, since organizations will not want to disclose any information about the details of their model. Therefore, from the wide range of interpretability methods, the model-agnostic approaches (i.e. methods that are oblivious of the model’s details) are deemed to be appropriate for this purpose.

Besides explaining the predictions of a black-box model, interpretability can also provide us with insight about erroneous behavior of our models, which may be caused by undesired patterns in our data. We will examine an example, where interpretability helps us identify gender bias in our data, using a model-agnostic method, which utilizes surrogate models and Shapley values.

We use the “Default of Credit Card Clients Dataset”, which contains information (demographic factors, credit data, history of payment, and bill statements) about 30,000 credit card clients in Taiwan from April 2005 to September 2005. The target of the models in our examples is to identify the defaulters (i.e. bank customers, who will not pay the next payment of their credit card).

Gender biased data

The existence of biased datasets is not uncommon. It can be caused from false preprocessing or even from collecting from a poor data source, creating skewed and tainted samples. Examining the reasons behind a model’s prediction, may inform us about possible bias in the data.

Defaulters based on Gender: The red and blue bars represent the original distributions of female and male customers, while the purple one depicts the new constructed biased distribution of male customers.

In the “Default of Credit Card Clients Dataset”, 43% of the defaulters are male and 57% are female. This does not consist in a biased dataset, since the non-defaulters have a similar distribution (39% and 61% respectively).

We distort the dataset, by picking at random 957 male defaulters (i.e. one third of the overall male defaulters) and we alter their label. This creates a new biased dataset with 34% / 66% male/female defaulters and 41% / 59% male/female non-defaulters respectively.

We then remove the gender feature from the dataset and take the predictions of a black-box model trained on this biased dataset (to which we are indifferent about its structure). We then train a surrogate XGBoost model, from which we extract the Shapley values that help us explain the predictions of the original model. More precisely, we use the Shapley values to pinpoint the most important features and then we use natural language to describe them in the explanations.

We examine the explanations for a false negative prediction (i.e. falsely predicted as non-defaulter) of a male customer and a false positive prediction (i.e. falsely predicted as defaulter) of a female customer. They are both single university graduates with similar credit limit. However, the male customer delayed the last 4 payments, while the female delayed only the final one.

				PAYMENT STATUS (months delayed)
ID	CREDIT LIMIT	EDUCATION	MARRIAGE	SEPT.	AUG.	JULY	JUNE	MAY	APRIL
26664 (male)	80000	university	single	5	4	3	2	paid duly	U.R.C
599 (female)	60000	university	single	2	U.R.C	U.R.C	U.R.C	U.R.C	U.R.C

(*U.R.C: Use of Revolving Credit*)

For the male customer as the explanation below indicates, the delay for the September payment had a negative impact of 33% (i.e. contributing towards ‘Default’). However, counterintuitively the delay for the August payment had a positive impact.

For the female customer, the 2-month delay of September also contributed negatively with 47%, but in a much greater percentage compared to the 5-month delay of the male customer (33%).

Summary

Although, the gender feature was not included in the training of the model, we observed with the help of the explanations, that the gender bias was encoded into other features (e.g. positive contribution for the delay of payment for the male customer). Moreover, by observing the percentage of the impact of the explanations, we detected a harsher confrontation of the model towards the female customer (e.g. greater negative impact for a lesser delay of payment). This strange behavior should alarm us and motivate us get a better sample of the defaulters.

Takeaway

In cases where the dataset contains real people, it is important to ensure that the model does discriminate against one group over others. Explanations facilitate us detect bias, even if it is skewed, pinpoint unintended decision patterns of our black-box model and motivate us to fix our data.