DEV Community

Cover image for Fine-Tuning T5-Small Model for a Completely New Language: Limbu
Bedram Tamang
Bedram Tamang

Posted on

Fine-Tuning T5-Small Model for a Completely New Language: Limbu

Introduction

Natural Language Processing (NLP) is expanding its reach into underserved languages. In this blog, we’ll explore how to fine-tune the T5-Small model to translate between English and Limbu, a Tibeto-Burman language spoken in Nepal and neighboring regions.


Preparing the Data

We created an English-Limbu translation dataset in JSON format, containing over 1,500 pairs. Below is a sample of the data:

[
   {
        "id": 1,
        "translation": {
            "en": "hi",
            "lim": "ᤜᤠᤤ ॥"
        }
    },
    {
        "id": 2,
        "translation": {
           "en": "Let's eat.",
           "lim": "ᤀᤠᤏᤡ᤹ ᤆᤠᤶ ॥ "
        }
    },
    {
        "id": 3,
        "translation": {
            "en": "We saw it.",
            "lim": "ᤀᤏᤡᤃᤧ ᤁᤴ ᤏᤡᤔᤠᤏᤠ ॥ "
        }
    },
    ...
]
Enter fullscreen mode Exit fullscreen mode

The dataset was saved as limbu-english.json.


Setting Up the Environment

Install the required libraries in Google Colab:

!pip install transformers datasets evaluate sacrebleu
!pip install transformers[sentencepiece]
!pip install sentencepiece
Enter fullscreen mode Exit fullscreen mode

Load the dataset:

from datasets import load_dataset

path = 'limbu-english.json'
translations = load_dataset('json', data_files=path)
translations = translations["train"].train_test_split(test_size=0.2)
Enter fullscreen mode Exit fullscreen mode

Loading the Pretrained Model

We initialized the T5-Small model:

from transformers import AutoTokenizer, TFAutoModelForSeq2SeqLM

checkpoint = "t5-small"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = TFAutoModelForSeq2SeqLM.from_pretrained(checkpoint)
Enter fullscreen mode Exit fullscreen mode

Tokenizing the Dataset

We generated a custom tokenizer and tokenized the dataset:

def get_training_corpus():
    dataset = translations["train"]
    for start_idx in range(0, len(dataset), 1000):
        yield [item['lim'] for item in dataset[start_idx:start_idx + 1000]['translation']]

lim_tokenizer = tokenizer.train_new_from_iterator(get_training_corpus(), 52000)

source_lang = "en"
target_lang = "lim"
prefix = "translate English to Limbu: "

def preprocess_function(examples):
    inputs = [prefix + example[source_lang] for example in examples["translation"]]
    targets = [example[target_lang] for example in examples["translation"]]
    return lim_tokenizer(inputs, text_target=targets, max_length=128, truncation=True)

tokenized_translations = translations.map(preprocess_function, batched=True)
Enter fullscreen mode Exit fullscreen mode

Preparing for Training

The tokenized data was prepared for the TensorFlow model:

from transformers import DataCollatorForSeq2Seq

data_collator = DataCollatorForSeq2Seq(tokenizer=lim_tokenizer, model=checkpoint)

tf_train_set = model.prepare_tf_dataset(
    tokenized_translations["train"],
    shuffle=True,
    batch_size=16,
    collate_fn=data_collator,
)

tf_test_set = model.prepare_tf_dataset(
    tokenized_translations["test"],
    shuffle=False,
    batch_size=16,
    collate_fn=data_collator,
)
Enter fullscreen mode Exit fullscreen mode

Training the Model

We used AdamWeightDecay for optimization:

from transformers import AdamWeightDecay
import tensorflow as tf

optimizer = AdamWeightDecay(learning_rate=2e-5, weight_decay_rate=0.01)
model.compile(optimizer=optimizer)

Enter fullscreen mode Exit fullscreen mode

Let's define the metrics to observe while training

from transformers.keras_callbacks import KerasMetricCallback

def compute_metrics(eval_preds):
    preds, labels = eval_preds
    if isinstance(preds, tuple):
        preds = preds[0]
    decoded_preds = lim_tokenizer.batch_decode(preds, skip_special_tokens=True)

    labels = np.where(labels != -100, labels, lim_tokenizer.pad_token_id)
    decoded_labels = lim_tokenizer.batch_decode(labels, skip_special_tokens=True)

    decoded_preds, decoded_labels = postprocess_text(decoded_preds, decoded_labels)

    result = metric.compute(predictions=decoded_preds, references=decoded_labels)
    result = {"bleu": result["score"]}

    prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in preds]
    result["gen_len"] = np.mean(prediction_lens)
    result = {k: round(v, 4) for k, v in result.items()}
    return result

metric_callback = KerasMetricCallback(metric_fn=compute_metrics, eval_dataset=tf_test_set)
Enter fullscreen mode Exit fullscreen mode

These metrics can be seen in log file as well, but instead we will store it into the huggingface

from huggingface_hub import notebook_login

notebook_login()
Enter fullscreen mode Exit fullscreen mode

and push the logs into huggingface as

from transformers.keras_callbacks import PushToHubCallback
push_to_hub_callback = PushToHubCallback(output_dir="eng-limbu-t5-001", tokenizer=lim_tokenizer)

callbacks = [push_to_hub_callback, tf.keras.callbacks.EarlyStopping(monitor='val_loss', patience=10)]
history = model.fit(x=tf_train_set, validation_data=tf_test_set, epochs=500, callbacks=callbacks)
Enter fullscreen mode Exit fullscreen mode

Visualizing Training Progress

We visualized the training loss:

import matplotlib.pyplot as plt

plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.ylabel('Loss')
plt.xlabel('Epoch')
plt.legend(['Train', 'Test'], loc='upper left')
plt.show()
Enter fullscreen mode Exit fullscreen mode

Testing the Model

We tested the model using the pipeline module:

from transformers import pipeline

translator = pipeline("text2text-generation", model="bedus-creation/eng-limbu-t5-001")
result = translator("translate English to Limbu: Hello")
print(result)
Enter fullscreen mode Exit fullscreen mode

Evaluating with BLEU Score

Finally, we calculated the BLEU score for translation accuracy:

bleu = evaluate.load("bleu")

predictions = [
    "Hi",
    ]
references = [
    ["ᤜᤠᤤ ॥"],
]

results = bleu.compute(predictions=predictions, references=references)

print(results)
Enter fullscreen mode Exit fullscreen mode

Conclusion

Fine-tuning the T5-Small model for Limbu demonstrates the potential of NLP in preserving and advancing underrepresented languages. With more training data and optimization, such models can become invaluable tools for language preservation and cross-cultural communication.

Top comments (0)