Data-centric AI for NLP is here, and it's here to stay!

#machinelearning #nlp #datascience #ai

When a machine learning model performs poorly, many teams intuitively try to improve the model and the underlying code - let’s say switching from a logistic regression to a neural network. Knowing that this can be helpful, it isn’t the only approach you can take to implement your use case. Taking a data-centric approach and improving the underlying data itself is often a more efficient approach to increase the performance of your models. In this article, we want to show you how that can be done - for instance, but not limited to - using our open-source tool refinery. Let’s demystify data-centric AI!

Starting from scratch

In our case, let's imagine that we have some unprocessed text data, with which we would like to build a classifier. For instance, a model to differentiate between topics of a text.

The first step is to carefully look at the data we have. Can we already spot repeating patterns in the data, such as regular expressions or keywords? How is the data structured, i.e. are there short and long paragraphs and such? Does the data capture the information I need to achieve my goals, or do I need to process it firsthand in some other way? These are not easy questions, but answering them (at least to some extent) early on will ensure the success of your projects later down the road.

Generally, you can do this by labeling a few of the examples. You will require some manually labeled examples in any way. As you understand patterns and have some reference data, you will get (imperfect, maybe noisy) ideas for automation patterns. Let’s dive into how they can look like.

One way to create signals for automated labels is labeling functions. These labeling functions allow us to programmatically express and label patterns found in the data, even if these patterns are a bit noisy. We’ll look into this a bit later, don’t worry about that now. For example, we can write a Python function to assign a label once certain words are found in a data point.

A simple labeling function in refinery

We can also use an active learner for labeling text data. The active learner is a machine learning model that is trained on a subset of the data which has already been labeled manually. Because the data has been processed into high-quality embeddings by SOTA transformer models (e.g. distilbert-base-uncased), we can use simple models such as a logistic regression as the active learner.

Code for the active learner

The active learner then automatically applies labels to the unlabeled data. The more data we label manually, the more accurate the active learner becomes, creating a positive feedback loop.

We could even think of more ideas, e.g. integrating APIs or crowd labeling, but for now, let’s just think of these two examples. We’re also currently building a really cool content library, which will help you to come up with the best ideas for your automation.

Results from an active learner

The labels from our labeling functions and the active learner can then be used for weak supervision, which takes all the labels and aggregates them into a weak supervision label. Think of weak supervision as a framework for really simple integration and denoising of noisy labels.

Improving the labeled data quality

Your data will rarely be perfect from the get-go. Usually, the data will be messy. That means that it's important to continuously improve on the existing training data. How can we do this? The output of the weak supervision also gives us confidence for each label assigned, which is super helpful!

We can then look at all the labels with particularly low confidence, create a new data slice and improve on that specific part of our data. We can manually label some more data out of that low-confidence data slice and write more labeling functions (or ask someone from our team to do that for us, as each slice is tagged with a URL), which then further improves the active learner as well. This not also allows us to improve the labels of our data, but we can also spot noisy labels which differ from the ground truth.

Confidence distribution of our labels

We can also compare the manual labels with the labels from the weak supervision. The data browser makes it very easy to spot differences. Again, this shows how data-centric AI is not only about scaling your labeling. It really is about adding metadata to your records that help you build large-scale, but especially high-quality training datasets.

The data browser in refinery

Making the problem at hand easier

There are also further steps we can take towards the goal of data-centric AI. For example, in the domain of NLP, we can further improve the embeddings we use. Let's have a quick refresher on what embeddings are.

To work with text data for NLP, we can embed sentences into a vector space. The words are represented as numeric values in this vector space. Positioning words in this space make sure that the underlying information and meaning of the words are kept intact while also enabling the processing of the texts by an algorithm.

State-of-the-art embeddings are created using modern transformer models. These embeddings are very rich with information, but can also be super complex, occasionally having hundreds of dimensions. Fine-tuning these embeddings often lead to huge improvements.

Instead of using more and more complex models, we can do something else. An alternative approach is to improve the data at hand in a way that the information within the data is preserved, but we don't need super complex models like Transformer or LSTM neural nets in our downstream tasks to make use of the data. By improving the vector space itself, even simple models such as logistic regressions or decision trees can give tremendously great results!

Why data-centric AI is here to stay

Now, if you’re thinking: “Wait, but isn’t this also changing the model parameters, and thus again model-centric AI?”. Well, you’re not wrong - but you’re effectively spending the biggest chunk of your time on improving the data, and thus improving the model. That is what data-centric AI is all about.

We just explored some of the upsides of data-centric AI. Weak supervision provides us with an interface to easily integrate heuristics, such as labeling functions and active learning, to (semi-)automate the labeling process. Further, it helps us to enrich our data with confidence scores or metadata for slicing our records, such that we can easily use weakly supervised labels to manage our data quality. Last but not least, similarity learning can be used to simplify the underlying problem itself, which is much easier than increasing the complexity of the model used.

Enriching data and treating it like a software artefact enables us to build better and more robust machine learning models. This is why we are confident that data-centric AI is here to stay.

If you think so too, please make sure to check out our GitHub repository with the open-source refinery. We’re sure it will be super helpful for you, too :)