DEV Community

Murad Mazitov
Murad Mazitov

Posted on • Updated on

Human feedback in ML or How to collect data for your own ChatGPT

In this blog post, I will discuss collecting data for machine learning using human labelers. I will focus on the practical perspective of creating new processes for labeling data.

We will cover the following topics:

  • When to use outstaffed labelers versus in-house assessment
  • Strategies for motivating labelers through monetary and non-monetary incentives
  • Tips for creating clear and effective instructions
  • Quality control
  • How to structure a machine learning project that involves human feedback tasks
  • The types of monitoring you may need throughout the project

DATA > MODELS

There is a trend where companies are investing more in data collection in recent years. This is due to the fact that machine learning models, especially those related to text, have improved significantly. This means that there is less need to invest heavily in feature engineering, experimentation with network architecture, and so on, as the existing models are already highly capable. However, the bottleneck for the quality of the final machine learning model is now the data and targets.

Let's take a look at ChatGPT. One of its core ideas was "reinforcement learning from human feedback," which heavily relied on human labelers. Therefore, the quality of the data was crucial. Working with human labelers was one of the technical challenges in creating such an innovative product.

”We trained this model using Reinforcement Learning from Human Feedback (RLHF), using the same methods as InstructGPT, but with slight differences in the data collection setup. We trained an initial model using supervised fine-tuning: human AI trainers provided conversations in which they played both sides—the user and an AI assistant. We gave the trainers access to model-written suggestions to help them compose their responses. We mixed this new dialogue dataset with the InstructGPT dataset, which we transformed into a dialogue format.”

Introducing ChatGPT

Labelers

There is another trend: as models improve, most tasks can now be solved through fine-tuning of large pre-trained models. This means that the amount of data required for fine-tuning is decreasing. Rather than needing large pools of average-quality data, you now need only a few thousand high-quality samples, without any noise or randomness.

There are many crowdsourcing marketplaces, such as Mechanical Turk and Toloka.ai, that can connect you with labelers. However, if your machine learning task is not trivial and requires a bit of research, you should be cautious when using crowdsourcing marketplaces. While they can provide quality control, exams, honeypot tasks, etc., most of the workers on these platforms will be random people. If the quality of your ML models with high leverage depends on the quality of your data, you should invest in data collection.

The main idea that I am trying to promote is that you should consider the labeler as a part of your team. Such an attitude towards labelers can bring great value to your product. In the following paragraphs, I will describe how to achieve this.

Motivation

Motivation is a key factor in improving the quality of your data. There are two ways to motivate workers: monetarily and non-monetarily.

Non-monetary motivation is the easiest way to improve the quality of your data. If your labelers are motivated only by money, they will not invest too much effort into quality.

Basic principles:

  1. Labelers should understand the importance of the task they are performing.
  2. Labelers should be aware that quality is a top priority 24/7.
  3. Labelers should feel included in product development.

Here are some tips to motivate your in-house assessment team:

  • Ensure that labelers understand that the results of their work can impact the real world. For example, if they are labeling search ads, they should know that the user search experience and the business efficiency of advertisements depend on the accuracy of their work.
  • Schedule a chat with your machine learning team and labelers, where the latter can ask questions.
  • Answer labelers' questions and do not ignore them.
  • Always consider the proposals made by labelers, as they can sometimes provide great ideas on how to improve the product or instructions. Make sure to acknowledge that you have seen their proposals.
  • Provide personal feedback to labelers when possible.

Here are some ways to demotivate labelers:

  • Ignoring them in chat
  • Assigning them low-quality tasks to assess (be mindful of what you are sending to labelers)
  • Including very similar tasks in a pool
  • Providing poorly made instructions for labelers (even from a design perspective)
  • Failing to explain how the labeled data will be used in the future
  • Additionally, unclear instructions can also be demotivating for labelers.

Tips for Creating Instructions

  • Clearly define why the task is important.
  • Use clean formatting and design.
  • KISS: Keep it simple, stupid. The instructions should be concise and limited to one page. Keep the language clear and easy to understand. Break up long paragraphs into shorter sections or bullet points for easier reading.
  • Include few examples for each rule in the instructions.
  • If you have classification problems, try to limit the number of possible answer variations to less than 7.
  • Include an entry exam that covers all rules, but keep it brief. The goal is to ensure that every labeler is familiar with the problem.
  • Before launching the instructions to labelers, ask a friend or family member to label around a dozen samples based on the instructions and provide feedback.

Quality control

Quality control is a process designed to measure the quality of active labelers. It can be used to motivate active labelers in a monetary way by paying more to workers with better accuracy, giving feedback to labelers, or filtering the final pool by the average quality of the worker.

One of the most widespread methods for quality control is to mix tasks with high confidence answers to other new tasks on the platform. This is known as the "golden set," "benchmark tasks," or "honeypots."

To ensure the effectiveness, follow these principles:

  1. Make the golden set indistinguishable from other tasks by adding tasks to the golden set from the same distribution as all other tasks.
  2. Sprinkle golden tasks uniformly throughout the labeling process.
  3. Be aware that if you want to label something specific, golden set tasks will be easily found, which can falsely increase the average quality for the labeler.
  4. Select the best labelers and delegate tasks to them, such as creating honeypots, providing feedback, and answering questions in chat.
  5. Schedule regular meetings where the team can review samples of labeled data and discuss any patches to the instructions or propose new clusters of data to label, with the aim of improving the labeling instructions.

Other tips for success of a project

Some useful points for structuring a project:

  1. Ensure that everyone on the team is familiar with the instructions for labelers, and understands what the labeling task entails.
  2. Before launching the instructions, the entire team should be calibrated on dozens of examples. This calibration process should take less than five hours of meetings, but the discussions that it triggers can bring significant value to your product.

What to monitor

Now that you have a production labeling process, it's important to monitor it. Here are some essential monitoring tasks:

  1. Track the number of finished tasks each day.
  2. Monitor the number of active labelers on the project.
  3. Keep track of the number of "experienced" labelers on the project.
  4. Measure the average quality of control tasks.
  5. Monitor daily spending.
  6. Analyze the distribution of labels.
  7. Calculate the inter-labeler agreement rate.

Useful links

How to Label 1M Data Points/Week

Introducing ChatGPT by OpenAI

Why is ChatGPT so good?

More than fun and money. Worker Motivation in Crowdsourcing--A Study on Mechanical Turk

Good luck with your ML task!

Top comments (0)