<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: alok kumar</title>
    <description>The latest articles on DEV Community by alok kumar (@alok_kumar_d262824002396c).</description>
    <link>https://dev.to/alok_kumar_d262824002396c</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F1975382%2F8b895ed6-41d6-4c38-978d-5e512d6c2b9b.png</url>
      <title>DEV Community: alok kumar</title>
      <link>https://dev.to/alok_kumar_d262824002396c</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/alok_kumar_d262824002396c"/>
    <language>en</language>
    <item>
      <title>Fine-tuning LLM Using Masking</title>
      <dc:creator>alok kumar</dc:creator>
      <pubDate>Sun, 25 Aug 2024 06:09:07 +0000</pubDate>
      <link>https://dev.to/alok_kumar_d262824002396c/fine-tuning-llm-using-masking-m5</link>
      <guid>https://dev.to/alok_kumar_d262824002396c/fine-tuning-llm-using-masking-m5</guid>
      <description>&lt;p&gt;&lt;strong&gt;Fine-tuning LLM Using Masking&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Fine-tuning a large language model (LLM) BART (Bidirectional and Auto-Regressive Transformers) using Masked Language Modeling (MLM) involves training the model on a specific dataset where some tokens are randomly masked and the model learns to predict the masked tokens. BART is a sequence-to-sequence model that combines the benefits of BERT (which uses MLM) and GPT (which is auto-regressive).&lt;/p&gt;

&lt;p&gt;Below, I'll walk you through the steps and provide code to fine-tune BART using Masked Language Modeling.&lt;/p&gt;

&lt;p&gt;Steps to Fine-Tune BART with MLM&lt;br&gt;
Import Necessary Libraries: We’ll use the transformers library from Hugging Face, which provides pre-trained models and tokenizers.&lt;/p&gt;

&lt;p&gt;Load a Pre-trained BART Model and Tokenizer: We’ll load a pre-trained BART model and its tokenizer.&lt;/p&gt;

&lt;p&gt;Prepare the Dataset: We'll create or load a dataset, tokenize it, and apply the MLM. The dataset is split into input and target sequences.&lt;/p&gt;

&lt;p&gt;Set Up the Training Arguments: Define the training parameters like learning rate, batch size, and the number of epochs.&lt;/p&gt;

&lt;p&gt;Fine-Tune the Model: Use the Hugging Face Trainer API to fine-tune the model.&lt;/p&gt;

&lt;p&gt;Evaluate the Model: After training, evaluate the model on a validation dataset&lt;/p&gt;

&lt;p&gt;Code Example&lt;br&gt;
Here is an example code to fine-tune BART using MLM:&lt;br&gt;
from transformers import BartForConditionalGeneration, BartTokenizer, Trainer, TrainingArguments&lt;br&gt;
from datasets import load_dataset&lt;br&gt;
import torch&lt;br&gt;
from torch.nn.utils.rnn import pad_sequence&lt;/p&gt;

&lt;h1&gt;
  
  
  Load the tokenizer and the model
&lt;/h1&gt;

&lt;p&gt;model_name = "facebook/bart-base"&lt;br&gt;
tokenizer = BartTokenizer.from_pretrained(model_name)&lt;br&gt;
model = BartForConditionalGeneration.from_pretrained(model_name)&lt;/p&gt;

&lt;h1&gt;
  
  
  Load a sample dataset
&lt;/h1&gt;

&lt;p&gt;dataset = load_dataset("wikitext", "wikitext-2-raw-v1", split="train")&lt;/p&gt;

&lt;h1&gt;
  
  
  Preprocessing function to tokenize the input text and mask some tokens
&lt;/h1&gt;

&lt;p&gt;def preprocess_function(examples):&lt;br&gt;
    inputs = tokenizer(examples["text"], return_tensors="pt", truncation=True, padding=True)&lt;br&gt;
    inputs["input_ids"] = torch.tensor(inputs["input_ids"])&lt;/p&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Apply masking&lt;br&gt;
labels = inputs["input_ids"].clone()&lt;br&gt;
mask_token_id = tokenizer.mask_token_id&lt;br&gt;
probability_matrix = torch.full(labels.shape, 0.15)&lt;br&gt;
mask_matrix = torch.bernoulli(probability_matrix).bool()&lt;br&gt;
labels[~mask_matrix] = -100  # Ignore labels that are not masked&lt;br&gt;
inputs["input_ids"][mask_matrix] = mask_token_id

&lt;p&gt;inputs["labels"] = labels&lt;br&gt;
return inputs&lt;br&gt;
&lt;/p&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;h1&gt;
&lt;br&gt;
  &lt;br&gt;
  &lt;br&gt;
  Apply the preprocessing function to the dataset&lt;br&gt;
&lt;/h1&gt;

&lt;p&gt;processed_dataset = dataset.map(preprocess_function, batched=True, remove_columns=["text"])&lt;/p&gt;

&lt;h1&gt;
  
  
  Set up training arguments
&lt;/h1&gt;

&lt;p&gt;training_args = TrainingArguments(&lt;br&gt;
    output_dir="./results",&lt;br&gt;
    per_device_train_batch_size=4,&lt;br&gt;
    num_train_epochs=1,&lt;br&gt;
    save_steps=10_000,&lt;br&gt;
    save_total_limit=2,&lt;br&gt;
)&lt;/p&gt;

&lt;h1&gt;
  
  
  Initialize Trainer
&lt;/h1&gt;

&lt;p&gt;trainer = Trainer(&lt;br&gt;
    model=model,&lt;br&gt;
    args=training_args,&lt;br&gt;
    train_dataset=processed_dataset,&lt;br&gt;
)&lt;/p&gt;

&lt;h1&gt;
  
  
  Fine-tune the model
&lt;/h1&gt;

&lt;p&gt;trainer.train()&lt;/p&gt;

&lt;h1&gt;
  
  
  Save the fine-tuned model
&lt;/h1&gt;

&lt;p&gt;model.save_pretrained("./fine-tuned-bart-mlm")&lt;br&gt;
tokenizer.save_pretrained("./fine-tuned-bart-mlm")&lt;br&gt;
Explanation&lt;br&gt;
Tokenizer and Model:&lt;/p&gt;

&lt;p&gt;BartTokenizer: Tokenizes the text into input IDs that the model can process.&lt;br&gt;
BartForConditionalGeneration: BART model used for conditional generation tasks like summarization, translation, etc.&lt;br&gt;
Dataset Loading:&lt;/p&gt;

&lt;p&gt;We load the "wikitext-2-raw-v1" dataset from the datasets library, which contains raw text data.&lt;br&gt;
Preprocessing:&lt;/p&gt;

&lt;p&gt;The preprocess_function tokenizes the text and creates input IDs.&lt;br&gt;
We create a mask over the input tokens (15% masking probability), replacing some of them with the mask token ().&lt;br&gt;
The labels tensor is created where the unmasked tokens are set to -100, which tells the model to ignore those tokens during the loss computation.&lt;br&gt;
Training Arguments:&lt;/p&gt;

&lt;p&gt;output_dir: Directory to save the model checkpoints.&lt;br&gt;
per_device_train_batch_size: Batch size for training.&lt;br&gt;
num_train_epochs: Number of training epochs.&lt;br&gt;
save_steps and save_total_limit: Control model checkpointing.&lt;br&gt;
Trainer:&lt;/p&gt;

&lt;p&gt;We use the Hugging Face Trainer class to manage the training loop, including handling data loading, model updates, and saving.&lt;br&gt;
Fine-Tuning:&lt;/p&gt;

&lt;p&gt;trainer.train(): Trains the model on the processed dataset using the defined training arguments.&lt;br&gt;
Model Saving:&lt;/p&gt;

&lt;p&gt;After training, the fine-tuned model and tokenizer are saved for future use.&lt;br&gt;
Summary&lt;br&gt;
The provided code demonstrates how to fine-tune a BART model using the Masked Language Modeling objective. This approach is beneficial when you want the model to better understand and predict masked tokens, which is essential for tasks like text completion, inpainting, or pre-training before transfer learning to other NLP tasks.&lt;/p&gt;

</description>
      <category>datascience</category>
      <category>generative</category>
      <category>ai</category>
      <category>career</category>
    </item>
  </channel>
</rss>
