Strategies for collecting ground truth for ML models at scale

#ai #machinelearning

Alongside with clean and powerful features, ground truth labels play significant role in an ML model’s performance. Pretty often improving the quality of data labels can boost the model performance higher than using sophisticated training methods or complex model architectures. In this article we will look into strategies for ground truth data generation and their pros and cons.

How to understand that ground truth you generated is good?

Before talking about how to generate quality ground truth data, we need to talk about what does this mean.

The first quality indicator is accuracy of labels themselves. When we train an ML model, we usually approximate some implicit function and ground truth represents values of said function in some points. But labeling process itself can introduce errors, for example if a rater lacks proper training or got confused over some example, they could assign a wrong label.

Second measure for ground truth quality is consistency. If an example is labeled several times by different people or at different time points, consistency is a measure of how different resulting labels get. A good metric to measure that is Cohen’s kappa coefficient.

And the last indicator for how good ground truth is is actually cost of labeling per each example. It’s a pretty common effect when the training data size increases, it leads to a better model performance because now algorithm has more examples to learn from. And if the labeling method we use is cheaper or even free, we could generate more labeled data and this way achieve better model quality.

Automatic labeling

One of the best strategies for generating ground truth data is create it automatically. For example, if you train a model which colorizes black-and-white photos, you can generate labeled examples by simply taking a color photograph and applying black-and-white filter to it. This way if you have a set of colored photographs, labeling them is essentially free. This showcases one of the biggest advantages of using an automated labeling — it’s very cheap and has a high throughput.

The downside of this approach is that it’s rarely possible to label examples accurately in an automatic manner. At the end, if you already have an algorithm for producing a label, why do you need a model then? For example, why it’s easy to produce black-and-white photos, it’s very hard to create a dataset for detecting hate speech in online messages. There is no deterministic algorithm for generating hateful or nice message. Of course, we can come up with an algorithm that takes a swear word vocabulary, mixes it with some innocent words and generates a random message, but the quality of such examples would be pretty low — model would only learn to classify random examples and not the real-world data. So the big downside is that often there is no algorithm for automatic ground truth creation that produces quality data. So in our contest, we have the following grades:

Accuracy: depends on the problem
Consistency: ★★★★★ (if algorithm is deterministic)
Cost: ★★★★★ (basically free)

User actions

The middle ground between generating ground truth automatically and having a dedicated human looking at an example is leveraging the audience of your product itself! If you are looking to apply machine learning to a consumer product with millions and billions of user, you already have access to a lot of useful signal about many user behavior. The ground truth can be generated by analysing the user action logs and extracting desired labels from there. For example, if you want to collect data for predicting whether a user would click on a given ad, you can extract events when a user actually clicked on an ad from the logs and treat them as a positive example and events when user looked at an ad, but haven’t clicked — as a negative example. This way, if you have millions or billions of users performing billions of actions every day/month, you’d have essentially unlimited ground truth data and can train colossal models.

The downsides of this approach are the same as with the previous strategy — it could work for some problems, but might produce poor quality data for other ones. For example, if you collect ground truth for predicting whether a particular online message is a hate speech, one promising hypothesis would be to leverage user reports — whenever a user reports a message to be hate speech — we take a note of that and then use for training the model. But this approach would only work if all the users shared the same understanding what is hate speech and all be willing to report every single message they consider to be hate speech as such. Unfortunately, this is rarely the case: different users have different principles on what’s hateful and what’s not and very few of them would bother to report messages to the platform. So for this problem we would get very inconsistent and incomplete ground truth.

Another issue with this approach is a minor one, but it tends to affect a bigger set of problems. Even if you can get high quality labeled data from user actions, over time it would bias predictions via the feedback loop: if we train a model on ads users click and then use this model to determine which ads to show them, model would bias towards its own decisions more and more with each retraining iteration. There are techniques how to counter this, but it falls out of the scope of this article.

Let’s grade this approach:

Accuracy: depends on the problem
Consistency: depends on the problem
Cost: ★★★★★ (basically free)

Human in the loop

If any of the automated or semi-automated approaches don’t work, then the only option left is to ask a human to take a look into an example and make a verdict which label it belongs to. Of course, human time is not free and this strategy has multiple approaches balancing cost against accuracy and consistency.

The easiest way to have humans to label examples is to ask model developers themselves to go through a bunch of examples. This way you don’t need to train any additional people because you already know what the problem is. The issue with this approach is that it’s not consistent (different developers could have different opinions. It’s better than asking different users, but still carries bias) and it’s way too expensive. An hour of work of a highly-skilled ML developer or research usually costs many times the average contractor hourly rate. Model developers as raters get:
Accuracy: ★★★★★
Consistency: ★★☆☆☆
Cost: ★☆☆☆☆

The second easiest approach is crowdsourcing — there are currently many platforms offering to offload data labeling to its users. To start getting ground truth, you usually need to create set of guidelines on how to label data points with some examples as well as a golden set — number of perfectly labeled examples to serve as a reference when evaluating human rater performance. Here you can again balance cost with accuracy by controling how much training each rater requires, whether you get a persistent pool of people or a flexible one, etc. As in this strategy you need to create explicit guidelines on how to label your data, it leads to a better consistency between different raters. The downside is that you still have little control over rater training and evaluation as well as you are not guaranteed to have a stable pool of people working on your problem. This approach gets:
Accuracy: ★★★★★
Consistency: ★★★☆☆
Cost: ★★☆☆☆

When the problem scale gets bigger and importance rises, companies usually switch to hiring a dedicated pool of human raters via a vendor. This is an improvement over the crowdsourcing approach and here you’d have more control over rater training and evaluation as well as more predictable throughput. On the other hand, this approach is usually more expensive per rater-hour and require bigger financial commitments: while you can scale up and down the number of raters you have in crowdsourcing, with dedicated pool you usually commit to a particular rater count and pay for it regardless of whether you keep them busy or not. Dedicated pool of contractor raters scores the following:
Accuracy: ★★★★★
Consistency: ★★★★☆
Cost: ★★★☆☆

And the most expensive and powerful option is to hire a pool of highly trained raters directly as full-time employees in your company. Here you can better integrate them into the company structure and invest more into their training and hence the quality of labeling. The downside of this approach is cost: now each labeled example becomes more expensive in comparison with previous approaches. And finally the FTE rater pool gets:
Accuracy: ★★★★★
Consistency: ★★★★★
Cost: ★★★★☆

When it comes to the human-in-the-loop approaches, usually companies settle on a hybrid approach. For example, they could have small number of highly trained full-time employees to create guidelines and policies for contractor raters as well as to create golden sets to measure their performance. Then the bulk of the work is outsourced to a vendor rater pool. Additionally, another way of improving the accuracy and consistency of labels is to ask multiple less-trained raters to rate the same example and if they disagree — escalate it to a highly-trained rater to make the final verdict.

Summary

At the end, there is no silver bullet for how to create ground truth for your ML model and you need to balance cost vs quality as well as we if there are any shortcuts you can take for your particular problem.