<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Abderrahim Alakouche</title>
    <description>The latest articles on DEV Community by Abderrahim Alakouche (@abderrahimal).</description>
    <link>https://dev.to/abderrahimal</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F1011782%2F28c2093c-f3c9-4df5-95dd-712ebf4a37cb.png</url>
      <title>DEV Community: Abderrahim Alakouche</title>
      <link>https://dev.to/abderrahimal</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/abderrahimal"/>
    <language>en</language>
    <item>
      <title>Zero Shot Text Classification Under the hood</title>
      <dc:creator>Abderrahim Alakouche</dc:creator>
      <pubDate>Sun, 05 May 2024 19:59:29 +0000</pubDate>
      <link>https://dev.to/abderrahimal/zero-shot-text-classification-under-the-hood-3h19</link>
      <guid>https://dev.to/abderrahimal/zero-shot-text-classification-under-the-hood-3h19</guid>
      <description>&lt;p&gt;Due to its potential in real world applications, text data has attracted a lot of attention especially in the last decade, The field of Natural Language Processing (NLP) deals with problems related to this type of data. One such problem is text classification which is known as elephant among blind researchers because it accepts multiple alternate views and several solution strategies.&lt;br&gt;
The traditional approach to perform this task has been to simply train a machine learning model  to predict a label given a text. However, getting large quantities of high quality labeled data can be a difficult challenge that requires so much effort and processing.&lt;/p&gt;

&lt;p&gt;In 2019, a new language representation called &lt;a href="https://arxiv.org/abs/1810.04805" rel="noopener noreferrer"&gt;BERT (Bedirectional Encoder Representation from Transformers)&lt;/a&gt; was introduced. The main idea behind this paradigm is to first pre-train a language model using a massive amount of unlabeled data then fine-tune all the parameters using labeled data from the downstream tasks. This allows the model to generalize well to different NLP tasks. Moreover, it has been shown that this language representation model can be used to solve downstream tasks without being explicitly trained on, e.g classify a text without training phase.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbktjyz4n4srtg29w0b4y.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbktjyz4n4srtg29w0b4y.gif" alt="Adventure Time"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Zero Shot Text Classification (ZSTC)
&lt;/h2&gt;

&lt;p&gt;In simple words, zero-shot text classification allows us to learn a classifier on one set of labels and then evaluate on a different set of labels that the classifier has never seen before. There are many approaches to tackle this problem: &lt;/p&gt;

&lt;p&gt;• &lt;a href="https://arxiv.org/abs/1603.08895" rel="noopener noreferrer"&gt;Latent Embedding approach&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;• &lt;a href="https://aclanthology.org/2020.coling-main.285/" rel="noopener noreferrer"&gt;Text Aware Representation of Sentence&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;• &lt;a href="https://arxiv.org/abs/1909.00161" rel="noopener noreferrer"&gt;Natural Language Inference&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;In this article we will be focusing on ZSTC based on Natural Language Inference (NLI).&lt;/p&gt;

&lt;h2&gt;
  
  
  ZSTC based on NLI
&lt;/h2&gt;

&lt;p&gt;Natural Language Inference (NLI) is the task of determining whether a Hypothesis is &lt;strong&gt;true&lt;/strong&gt; (entailment), &lt;strong&gt;false&lt;/strong&gt; (contradiction), or  &lt;strong&gt;undetermined&lt;/strong&gt; (neutral) given a Premise. This can be adapted to the task of zero-shot text classification by treating the sequence which we want to classify as the premise and turning a candidate label into a hypothesis. If the model predicts that the constructed premise entails the hypothesis, then we can take that as a prediction that the label applies to the text. &lt;/p&gt;

&lt;p&gt;Let’s say we want to classify the sentence &lt;code&gt;I turn coffee into code&lt;/code&gt; if it is about coffee or not.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Premise&lt;/strong&gt;: I turn coffee into code.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Candidate label&lt;/strong&gt;: coffee&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Hypothesis&lt;/strong&gt;: This example is about coffee&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9s99ywx7qyzcm3l1iw5a.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9s99ywx7qyzcm3l1iw5a.gif" alt="Code and Coffee"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The Premise is always assigned to the sentence we want to classify. The Hypothesis needs a bit of creativity because it directly affects the quality of the predictions, usually we use &lt;code&gt;This example is about {candidate label}&lt;/code&gt;. However, it is always good to make the hypothesis relevant to the topic we are trying to classify on e.g. in case we want to classify emotions we can change it to &lt;code&gt;This emotion is {candidate label}&lt;/code&gt;. &lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0d9vwg5dbo556d9pv8sh.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0d9vwg5dbo556d9pv8sh.png" alt="NLI architecture"&gt;&lt;/a&gt;&lt;/p&gt;
 Figure 1: ZSTC based on NLI architecture.



&lt;h2&gt;
  
  
  Under the hood
&lt;/h2&gt;

&lt;p&gt;Now that we have a basic idea of how text classification can be used in conjunction with NLI models to tackle the ZSTC problem, let's take a closer look at what's happening within the architecture shown in Figure 1.&lt;/p&gt;

&lt;h3&gt;
  
  
  Tokenization
&lt;/h3&gt;

&lt;p&gt;In this step we are taking the premise, the hypothesis and combining them as a sentence pair [&lt;strong&gt;premise&lt;/strong&gt;, &lt;strong&gt;hypothesis&lt;/strong&gt;], this sentence pair is fed into the model tokenizer to get the &lt;a href="https://huggingface.co/transformers/v3.2.0/glossary.html#input-ids" rel="noopener noreferrer"&gt;input ids&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fashibelp72uxebr1ru9s.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fashibelp72uxebr1ru9s.png" alt="Tokenization"&gt;&lt;/a&gt;&lt;/p&gt;
 Figure 2: Tokenization.



&lt;p&gt;The input ids are often the only required parameters to be passed to the model as input, they are the numerical representations of tokens building the sentence pair. Note that the tokenizer automatically deletes square brackets and adds special tokens which are special ids the model uses.&lt;br&gt;&lt;br&gt;
Let’s decode the previous input ids using Hugging Face Transformers library in Python to understand the differences.&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;


&lt;span class="o"&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; decoded_sequence &lt;span class="o"&gt;=&lt;/span&gt; tokenizer.decode&lt;span class="o"&gt;(&lt;/span&gt;input_ids&lt;span class="o"&gt;)&lt;/span&gt;
&lt;span class="o"&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; print&lt;span class="o"&gt;(&lt;/span&gt;decoded_sequence&lt;span class="o"&gt;)&lt;/span&gt;

&amp;lt;s&amp;gt; I turn coffee into code &amp;lt;/s&amp;gt;&amp;lt;/s&amp;gt; This example is about coffee&amp;lt;/s&amp;gt;


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;For this example, we are using &lt;a href="https://huggingface.co/joeddav/xlm-roberta-large-xnli" rel="noopener noreferrer"&gt;joeddav/xlm-roberta-large-xnli&lt;/a&gt; model from Hugging face Hub. It is Roberta based model, the special tokens for this type of models tokenizer are:    &lt;/p&gt;

&lt;p&gt;• &lt;code&gt;&amp;lt;s&amp;gt;&lt;/code&gt;: bos_token - the beginning of the sequence token.    &lt;/p&gt;

&lt;p&gt;• &lt;code&gt;&amp;lt;/s&amp;gt;&lt;/code&gt; : eos_token / sep_token - the end of sequence token or the separator token, in case we have multiple sentences.&lt;/p&gt;

&lt;h3&gt;
  
  
  Model logits
&lt;/h3&gt;

&lt;p&gt;Now that we have the numerical representation of our input tokens, we can run the NLI model to get the output. Since, this type of models are trained on a dataset of three possible labels (contradiction, neutral, entailment) the output contains three logits and considering we only have a batch of one sentence, the output must be an array of 1x3.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjltuw9rqpjh2uha9oc08.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjltuw9rqpjh2uha9oc08.png" alt="NLI model logits"&gt;&lt;/a&gt;&lt;/p&gt;
 Figure 3: NLI model logits.



&lt;p&gt;In this example we are trying to solve a binary classification problem so we need to drop neutral logits. In other words, an entailment corresponds to a positive example that belongs to the target class &lt;strong&gt; coffee &lt;/strong&gt; and contradiction indicates a negative sample, &lt;strong&gt;not coffee&lt;/strong&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Softmax
&lt;/h3&gt;

&lt;p&gt;Before diving into the softmax block, let’s understand the main difference between    multi-class classification, multi-label classification and binary classification.&lt;/p&gt;

&lt;p&gt;• Multi-class classification: predicting one of more than two classes.&lt;/p&gt;

&lt;p&gt;• Multi-label classification: each input can have multi-output classes.&lt;/p&gt;

&lt;p&gt;• Binary classification: predicting one of two classes (&lt;strong&gt; coffee &lt;/strong&gt;, &lt;strong&gt;not coffee&lt;/strong&gt;). &lt;/p&gt;

&lt;p&gt;To tackle multi-class classification problems using the NLI approach, we need to softmax the entailments logits over all labels. Remember, in this type of classification we need to provide more than two classes and the output must be one class. This output will have the maximum entailment probability after applying the softmax.&lt;/p&gt;

&lt;p&gt;Consider the same example &lt;code&gt;I turn coffee into code.&lt;/code&gt;, but with multiple candidate labels.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Premise&lt;/strong&gt; : I turn coffee into code.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Candidate labels&lt;/strong&gt; : [sport, series, programming, life]&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6boxgblg5171ijtnsmtc.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6boxgblg5171ijtnsmtc.png" alt="Multi-class classification"&gt;&lt;/a&gt;&lt;/p&gt;
 Figure 4: NLI model logits.



&lt;p&gt;For binary classification and multi-label classification we apply the activation function over the entailment vs contradiction for each label independently. In case of multi-label classification, there are two ways to limit the number of predicted classes for each input. Define the number of classes per prediction or the probability threshold.&lt;/p&gt;

&lt;p&gt;Returning to the output in figure 3, we are dropping neutral logits and applying the softmax. &lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmu9g41aixs2xupajsez1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmu9g41aixs2xupajsez1.png" alt="Binary classification"&gt;&lt;/a&gt;&lt;/p&gt;
 Figure 5: Binary classification



&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;We can’t deny the limitations of zero-shot text classification. One such limitation is evaluation. By default, the input data is unlabeled, so we don’t have a ground truth to use for model evaluation. Solving this problem will be a huge success for the NLP community.&lt;/p&gt;

</description>
      <category>deeplearning</category>
      <category>machinelearning</category>
      <category>ai</category>
      <category>nlp</category>
    </item>
  </channel>
</rss>
