Typical mistakes when labeling text using LLMs

#llm #gpt3 #machinelearning

Large language models (LLMs) are great for text labeling. I was lucky enough to start using GPT for this type of task in mid 2022, not long after the breakthrough davinci-002 model was released by OpenAI. I curated a number of similar tasks after that and this experience taught me how to do the right way.

Before we start: I want to emphasize that we are going to discuss mostly the process around labeling while prompt engineering is a great topic for a separate article.

Mistakes

Relying on visual validation of the LLM outputs 👎

It’s tempting to expect that LLMs are good enough to understand your instructions, validate the model output based just on several inputs and say that the result is good. But how good is it? Are you sure that the accuracy is good enough for the downstream task? Were the tested examples representative of the inference dataset?

We can’t answer these questions based only on several tested examples, so we definitely need to label a dataset manually to be able to compare the quality of different prompts and to be sure about the inference quality. Don’t think too much and just label a small dataset of 50-100 entities to start with. You also need to define a metric that you are going to use to measure labeling quality.

Bonus: topic distribution across documents is often skewed, so random samples can lack some important topics. Try to stratify sampling to unify topic distribution: run topic modeling and pick an equal number of samples from each topic.

Lacking alignment on corner cases 🙈

Labeling may seem simple and obvious initially, but when you start comparing your labels with others’ it appears that your task interpretation can be a bit different from your colleague’s interpretation.

Let’s say we have an Amazon review saying “I definitely recommend this product”. Does it mean that the product has high quality? Or good value for money? I personally would answer “no” to both questions, but the correct answer depends on the downstream use case.

To identify such corner cases you need to ask different people to label the same dataset and understand everyone’s reasoning behind differently labeled entities. After that the person who has the most context on the downstream data usage (usually Product manager) has to improve the labeling instruction to represent these cases.

The process of labeling → comparing labels → improving instruction can be repeated once or twice again until different people’s judgments converge.

Not measuring quality on a hold-out dataset ❌

Let’s say we came up with the perfect labeling instruction, engineered an awesome LLM prompt and have great evaluation metric values. What could possibly go wrong? People who have ever trained an ML model would instantly answer - OVERFITTING!

If we improved metrics only on the “training” dataset which we used for prompt engineering, often it doesn’t mean that we improved metrics on an unseen dataset. It is especially relevant for small prompt tweaking (like shuffling parts of the prompt or adding punctuation): such improvements often don’t translate well to the “hold-out” dataset.

Here is the approach I use to overcome this issue:

Label a training dataset of N entities which seems representative (based on the corner case discussion);
Label another dataset of the same size which will be used as a hold-out dataset;
Label the third dataset of the same size which will be used as a validation dataset for the first iteration of the algorithm;
Do prompt engineering to reach expected quality on the training dataset;
Measure quality on the validation dataset:
- If the quality is lower than the quality on the training dataset:
  - Add validation dataset to the training dataset and use it as training dataset for the next iteration;
  - Label N more entities and use them as a validation dataset for the next iteration;
  - Repeat steps starting from prompt engineering;
Measure quality on the hold-out dataset - these are the numbers to report.

Labeling flow

We discussed what we should not do, now let’s summarize the right way of labeling text with LLMs:

Write down the first version of the labeling instruction;
Do topic modeling to assign topic to each document;
Sample N documents stratified by topics (and other parameters if applicable: length, date, author, …) and get labels for this dataset from multiple people;
Compare labels, understand reasoning and improve labeling instructions if needed;
- Repeat steps 3 and 4 until you have convergence;
Label a validation and hold-out datasets of size N;
Define the north star metric for measuring labeling quality;
Do prompt engineering to improve the metric on the training dataset;
Measure quality on the validation dataset:
- If the quality is worse than the quality on the training dataset:
  - Add validation dataset to the training dataset and use it as training dataset for the next iteration;
  - Label N more entities and use them as a validation dataset for the next iteration;
  - Repeat steps 7 and 8;
Measure quality of the hold-out dataset - these are the numbers to report.

Conclusion

I described the best labeling approach I have at hand right now, but there is still room to grow. In particular, I’m constantly thinking about ways to improve prediction robustness: even a small change in prompt punctuation can significantly change output quality. This is the type of problem which is solved by cross-validation in regular machine learning tasks. Is there a way to implement something similar in LLMs? Self-reflection? I don’t have an answer for now, but I would be happy to discuss it, so please feel free to reach out to me to chat about it 🙂