DEV Community

Cover image for Train a Sentence-CamemBERT
Hakim
Hakim

Posted on

Train a Sentence-CamemBERT

The CamemBERT model is a state-of-the-art language model for modeling the
French language.

It is a RoBERTa model that has been trained on a large number of French texts and can be easily adapted to a large number of tasks thanks to finetuning.

Here we're going to to finetune the model for sentence embedding.

Sentence-BERT

The output of a BERT model is an embedding vector for each token. To obtain an embedding of the text as a whole, we need to define a transformation strategy to go from individual token embeddings to an embedding vector for the sentence as a whole.

The simplest and most effective strategy is simply to take the average of the token embeddings.
This strategy is known as mean pooling.

Sentence-BERT diagram : token embeddings are average in a mean pooling layer to get the sentence embedding

If you'd like to find out more about the strategies that have been considered, take a look at this paper: Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks.

Finetuning a BERT model into a Sentence-BERT model

The authors of the paper mentioned above have built a Python library called sentence-transformers, for manipulating Sentence-BERT models.

We'll use it to obtain a Sentence-CamemBERT model from a CamemBERT model available on huggingface.

Prerequisites

We will be using the following packages:

datasets
sentence-transformers
Enter fullscreen mode Exit fullscreen mode

Training data

First of all, we're going to retrieve the training data.
We will use the French part of the dataset STSb Multi MT.
This is a dataset containing pairs of sentences and a score between 0 and 5 representing the similarity between the two sentences.

from datasets import load_dataset

sts_train_dataset = load_dataset("stsb_multi_mt", name="fr", split="train")
sts_dev_dataset = load_dataset("stsb_multi_mt", name="fr", split="dev")
sts_test_dataset = load_dataset("stsb_multi_mt", name="fr", split="test")
Enter fullscreen mode Exit fullscreen mode

We'll then convert the retrieved data into InputExample objects that can be used for training.

from typing import List
from sentence_transformers import InputExample

def dataset_to_input_examples(dataset) -> List[InputExample]:
    return [
    InputExample(
        texts=[example["sentence1"], example["sentence2"]],
        label=example["similarity_score"] / 5.0,
    )
    for example in dataset
]

sts_train_examples = dataset_to_input_examples(sts_train_dataset)
sts_dev_examples = dataset_to_input_examples(sts_dev_dataset)
sts_test_examples = dataset_to_input_examples(sts_test_dataset)
Enter fullscreen mode Exit fullscreen mode

We will use the CamemBERT model named almanach/camembert-base for finetuning:

from sentence_transformers import evaluation, losses, SentenceTransformer
from torch.utils.data import DataLoader

batch_size = 32

model = SentenceTransformer("almanach/camembert-base")

train_dataloader = DataLoader(sts_train_examples, shuffle=True, batch_size=batch_size)
train_loss = losses.CosineSimilarityLoss(model=model)
Enter fullscreen mode Exit fullscreen mode

We use cosine-similarity loss objective to train the model.

Finally, an evaluator is built to monitor the model's performance on the dev dataset during training.

sts_dev_evaluator = evaluation.EmbeddingSimilarityEvaluator.from_input_examples(
    sts_dev_examples, name="sts-dev"
)
Enter fullscreen mode Exit fullscreen mode

We can now start training the model:

model.fit(
    train_objectives=[(train_dataloader, train_loss)],
    evaluator=sts_dev_evaluator,
    epochs=10,
    warmup_steps=500,
    save_best_model=True,
)
Enter fullscreen mode Exit fullscreen mode

Model evaluation

Once training is complete, you can measure the model's performance on the test data set you've kept
away:

sts_test_evaluator = evaluation.EmbeddingSimilarityEvaluator.from_input_examples(
    sts_test_examples, name="sts-test"
)

sts_test_evaluator(model, ".")
Enter fullscreen mode Exit fullscreen mode

I get a Pearson correlation of 0.837 which is at the same level as the Sentence-CamemBERT I found on huggingface :

Sentence-BERT model distilled

As you may have noticed in the table above, I've also trained a Sentence-CamemBERT model that's about half the size (68M parameters vs. 110M) and yet performs very well: h4c5/sts-distilcamembert-base.

This is in fact a model obtained by following the above procedure but starting from the distilled CamemBERT model: cmarkea/distilcamembert-base.

This so-called "distilled" model was obtained by removing half of the layer of the CamemBERT base model and training it to maintain its performance.

To find out more about the distillation process, please consult the following papers:

Et voilà. You can find my two Sentence-CamemBERT models on huggingface :

Top comments (0)