DEV Community: sangjun_park

AI Paper Review: ORPO - Monolithic Preference Optimization without Reference Model

sangjun_park — Wed, 01 Jan 2025 15:00:54 +0000

(Please note that this content was translated using GPT-o1, so there might be some mistakes or inaccuracies in the translation.)

paper link: https://arxiv.org/abs/2403.07691

Overview

First, let’s interpret the title and understand what it means:

Monolithic Preference Optimization:

Monolithic -> Refers to working as one large, single-structure system rather than separating multiple modules or elements.
Preference Optimization -> Refers to the process of optimizing around a specific goal or criterion (e.g., user preference, satisfaction) that the user desires.
In other words, it means optimizing preferences within a single system without separate sub-modules.

Abstract

SFT (Supervised Fine-Tuning): A process in which a pre-trained language model (such as GPT, Llama, etc.) undergoes additional supervised learning on a specific task or domain -> leads to better performance on the given task.
SFT plays a key role in preference alignment, and even a small penalty for undesired generation styles is sufficient.
- Preference Alignment: Training a language model to generate responses in line with a specific standard (e.g., user preferences, task requirements). For example, if a user demands concise and clear answers, the model is trained to produce them. Conversely, providing excessive information or ambiguous answers is considered a “non-preferred generation style.”
Main results:
- Up to 12.20% improvement on AlpacaEval 2.0
- 66.19% on IFEval
- Achieved 7.32 on MT-Bench

1 Introduction

Limitations of the Existing Approach

Pre-trained Language Models (PLMs), trained on large-scale data such as web text or textbooks, demonstrate excellent performance on various NLP tasks. However, for practical applications, additional tuning such as Instruction Tuning or Preference Alignment is needed.
- Instruction Tuning: A process where the model is trained to generalize to new tasks given instructions in natural language. It uses a unified dataset so that it can handle various inputs and produce outputs accordingly. As a result, the model acquires zero-shot or few-shot capabilities.
- Preference Alignment: Uses paired preference data to align models with human values. RLHF (reinforcement learning from human feedback) and DPO (directly adjusting the output probability distribution using preference data) fall under this category.
- Traditional preference alignment generally follows a multi-step process (SFT -> reward model training -> reinforcement learning), thus requiring an additional reference model and a supervised fine-tuning phase.

A New Approach

A new method called ORPO (Odds Ratio Preference Optimization) is proposed as a novel monolithic alignment approach. Unlike existing methods, it can effectively suppress undesired generation styles without the need for a preparatory SFT stage or a reference model. It utilizes resources efficiently while implementing preference alignment.
ORPO demonstrates strong performance on leaderboards for the Phi-2, Llama-2, and Mistral models. It has also outperformed existing methods (RLHF, DPO) on various benchmarks such as AlpacaEval 2.0 and IFEval.

The figure above, based on AlpacaEval 2.0, shows the results of fine-tuning Llama-2 (7B) and Mistral (7B) using different algorithms. It displays the win rate (%) of each model, comparing RLHF (in red), DPO (in green), and ORPO (in blue). From this, we can see that ORPO outperforms RLHF and DPO.

This graph compares three alignment methods—RLHF, DPO, and ORPO.

RLHF: After the SFT phase, a reward model is built, and the model is fine-tuned via policy gradient based on feedback from that reward model. RLHF needs a reference model and a policy, making it a relatively complex multi-stage process.
- Reward Model: Trained from human feedback -> provides a score for how “good” a model output is.
- Policy Gradient: The central algorithm for reinforcement learning. It updates the current language model (policy) based on the reward signal from the reward model.
- Reference Model: A fixed model that serves as a baseline for the current policy. By comparing policy and reference, one can calculate the KL-divergence to ensure that the policy does not deviate too far from the reference.
DPO: Compared to RLHF, it has a simplified process that uses preference data to train the language model. Typically, DPO does not use a separate reference model, although in some implementations, the model from the SFT stage is designated as the reference, or a reference model is employed to provide additional stability.
- DPO updates the model by leveraging the log odds ratio between accepted and rejected responses. Accepted responses get a strong positive signal, while rejected responses get a weaker penalty.
ORPO: A new method of aligning the language model with human preferences using preference data. Like DPO, it uses log odds ratio but does not rely on a reference model, directly learning from preference data. Additionally, it performs odds-ratio-based optimization right after SFT, so no additional multi-stage process is needed.

The figure above explains how log odds ratio optimization works. Good responses get a strong adaptation signal, whereas poor responses get a weaker penalty, reflecting these differences in the model’s gradient updates.

2 Related Works

Connections to Reinforcement Learning

RLHF uses a Bradley-Terry model to calculate the probability of one response competing against another, training a reward model to maximize the score of the selected response. PPO (Proximal Policy Optimization) is a typical RL algorithm used here, enabling the model to learn from human preferences. This is proven to be scalable and general for instruction-following language models, extended further by RLAIF.
- PPO (Proximal Policy Optimization): A reinforcement learning update algorithm that restricts large updates to the current policy (the language model) by clipping, making it less sensitive and more stable than some other RL methods.
- RLAIF: While RLHF requires human evaluators to rank or compare outputs, RLAIF uses a language model instead of humans to evaluate responses and train the reward model. This reduces costs and can be applied to larger data.
However, RLHF suffers from instability in PPO, requiring extensive hyperparameter search, and a reward model’s sensitivity also causes challenges. Hence, more stable preference-alignment algorithms are needed.

Without a Reward Model

Various methods have been studied for performing preference alignment without a separate reward model:
- DPO (Direct Policy Optimization): Integrates reward modeling into the preference learning stage. Preference data forces the probability of favorable responses to exceed that of unfavorable ones.
- IPO (Identity Preference Optimization): Developed to avoid overfitting problems that may arise in DPO, placing less emphasis on the relative gap and more on maintaining consistent baselines.
- KTO (Kahneman-Tversky Optimization): An approach grounded in the behavioral economics theories of Kahneman and Tversky. Unlike RLHF or DPO, it does not rely on comparison data but instead incorporates cognitive biases inherent to human decision-making.
- ULMA (Unified Language Model Alignment): A unified approach to aligning language models that features a consistent training framework and varied data. It’s said to be efficient, with simpler data preparation and a universal training structure for easier extension.

SFT (Supervised Fine-Tuning) Revisited

Preference alignment methods often rely on SFT as a stable starting point between the original and actively updated policies.
In RLHF, the SFT model is considered the baseline policy, and empirical results also suggest that SFT is important in non-RL alignment approaches.
Some research shows that performing SFT with a carefully curated dataset alone can yield a human-aligned language model.
Even a small set of meticulously filtered data can produce a useful language model assistant, or one can iteratively select self-generated model outputs to refine alignment. In some cases, using only a portion of a preference dataset—carefully chosen—can suffice.
However, theoretical and empirical work on how SFT integrates with preference alignment is limited.

3 The Role of Supervised Fine-Tuning

This study examines the role of Supervised Fine-Tuning as the initial step for preference alignment. SFT is crucial for adapting a pre-trained language model to a desired domain by increasing the log probability of related tokens. However, it can also inadvertently increase the likelihood of undesired styles. Thus, we need to maintain the domain-adaptation effect of SFT while distinguishing and reducing undesired generation styles.

The above figure shows the log probability of accepted vs. rejected responses while SFT is being performed on OPT-250M using the HH-RLHF dataset. Over training, both accepted and rejected responses’ log probabilities increase. Ideally, we would see an increase in accepted responses’ log probabilities while rejected responses remain low, but this figure shows otherwise.
In other words, while SFT helps domain adaptation, additional measures are needed to suppress undesired response styles.

No Direct Penalty from Cross-Entropy

Cross-entropy loss penalizes low logit values for reference (correct) tokens. Mathematically, it can be expressed as shown in the figure below.

yᵢ: A boolean that indicates if the token is the correct token (1 if correct, 0 if not).
pᵢ: The predicted probability that the token is correct (between 0 and 1).
m: The length of the sequence. For instance, if the input sentence is “Hello World,” m=2 by word, m=11 by character.
|V|: The vocabulary size of the model.
Cross-entropy aims to maximize log(pᵢ) only for the correct token yᵢ=1. For the other tokens yᵢ=0, there is no additional suppression effect.
Consequently, the log probability of rejected responses can increase as well, which is undesirable from a preference alignment standpoint.

Generalization Across Two Response Styles

SFT alone does not resolve the over-adjustment of accepted vs. rejected responses, as demonstrated in the study below:
- Fine-tuning an OPT-350M model on the HH-RLHF dataset using accepted responses only.
- Monitoring the log probability of rejected responses showed that both accepted and rejected responses increased.
- This indicates that while cross-entropy loss effectively orients the model toward conversational (or domain-specific) data, it does not penalize undesired responses, so rejected responses can end up with higher log probability than accepted ones.

Penalizing Undesired Generations

Past research indicates that adding a penalty term to the loss function for undesired outputs can mitigate degenerative effects.
- Example: To prevent repetitive outputs, one might add a penalty term for assigning high probability to tokens that appear frequently in the recent context.
Building on the idea that dynamic penalization of rejected responses is effective, we designed a unified preference-alignment method that penalizes undesired responses on the fly.

4 Odds Ratio Preference Optimization

Now we introduce a new algorithm called ORPO for preference alignment. This algorithm adds an odds-ratio-based penalty to the usual negative log-likelihood (NLL) loss so that the model effectively differentiates between preferred and non-preferred responses.

4.1 Background

The above formula calculates the average log likelihood of the output sequence y given the input sequence x.
- y: The output sequence, consisting of m tokens.
- x: The input sequence.
- Pθ(yₜ | x, y₍ₜ₎): The probability that the t-th token yₜ in the output sequence y is generated given the input x and the previous tokens y₍ₜ₎.
- m: The length (in tokens) of the output sequence y.
Essentially, the log probability of the entire sequence is summed across tokens and then averaged to avoid dependence on sequence length.

odds: The ratio between the probability that y is generated and that it is not generated, indicating the relative likelihood of generation. If the generation probability is Pθ(y | x), then its non-generation probability is 1 - Pθ(y | x).
oddsθ(y | x) = k: Means that for a given input x, the model θ finds y k times more likely than not-y.

The odds ratio between a winning response yw (accepted) and a losing response yl (rejected) indicates how much more preferred yw is over yl.

4.2 Objective Function of ORPO

Objective Function: In machine learning or optimization problems, this is the function that the model aims to optimize during training.

LSFT: The SFT loss term in the ORPO objective function.
LOR: The term that maximizes the odds ratio between accepted responses and rejected responses. “OR” stands for “Odds Ratio.”
λ: A hyperparameter that controls the relative importance of the two loss terms.
E(x,yw,yl): The expected value over the probability distribution of (x, yw, yl).

LOR: Encourages a higher log odds ratio between yw (preferred) and yl (non-preferred). It wraps the log odds ratio with the log sigmoid function, so we minimize LOR by increasing the log odds ratio between yw and yl.
σ: By wrapping the log ratio in a sigmoid function, we ensure smoothness and differentiability. Large input values approach 1, while small inputs approach 0. Consequently, LOR ranges from 0 to +∞ and we want to push it closer to 0.

4.3 Gradient of ORPO

The gradient of LOR (∇θLOR) shows that using an odds-ratio-based loss is justified for preference alignment. The gradient effectively adjusts the probability difference between the winning and losing responses.

The gradient of LOR is described as the product of δ(d) and h(d).

δ(d) is a penalty term that suppresses non-preferred responses.
In the extreme, if oddsθ(yw|x) ≈ oddsθ(yl|x) (i.e., when the odds of the winning and losing responses are similar), δ(d) approaches 1/2.
Conversely, if oddsθ(yw|x) ≫ oddsθ(yl|x), meaning the winning response has a much higher odds than the losing response, δ(d) approaches 0.

h(d) updates the parameters θ in a way that raises the probability of the desired label yw and lowers that of yl.
Because ∇θlog p(θ) = (1/p(θ))∇θp(θ), when Pθ(yw|x) is already high, the update magnitude is larger. Conversely, if Pθ(yw|x) is small, the update magnitude is smaller. Intuitively, “if the model is already confident about yw, a small parameter change could significantly alter that probability, so we apply a larger update.”
1 − Pθ(yw|x) is the probability that yw has not yet occurred. As Pθ(yw|x) grows, the update scale increases, and as it decreases, the scale shrinks.
For yl, a negative sign is applied, so it is updated in the opposite direction of yw.

Why “the higher the probability, the larger the update”?

ORPO aims to widen the gap between the probabilities of yw (preferred) and yl (non-preferred). That is, it makes the good better and the bad worse.

5 Experimental Settings

5.1 Training Setup

Model

We scale the OPT model from 125M to 1.3B parameters and compare four methods:
1. Supervised Fine-Tuning (SFT)
2. Proximal Policy Optimization (PPO)
3. Direct Policy Optimization (DPO)
4. Odds Ratio Preference Optimization (ORPO)
The PPO and DPO models are trained on accepted responses for a single epoch, starting from the SFT model, using the TRL library.
- We denote this by adding a “+” (e.g., +DPO).
Additionally, the following models are included in the training:
- Phi-2 (2.7B): A pre-trained language model with strong downstream performance
- Llama2 (7B)
- Mistral (7B)

Dataset

Each training configuration and model is tested on two datasets:
- Anthropic’s HH-RLHF
- Binarized UltraFeedback
We filter out cases where yw = yl (preferred and non-preferred are identical) or where yw or yl is empty.

Reward Model

We train separate reward models for OPT-350M and OPT-1.3B.
The objective function of the reward model is as follows:

By taking the difference between the scores of preferred and non-preferred responses and passing it through a sigmoid, the output is close to 1 if the difference is large and positive, approaching 0 otherwise. Taking the log of this value ranges from 0 to negative infinity, reflecting the reward difference.

5.2 Leaderboard Evaluation Summary

AlpacaEval Evaluation: We compare ORPO to other instruction-tuned models on the AlpacaEval1.0/2.0 benchmarks.
- Models under comparison: Llama-2 Chat (7B, 13B), Zephyr (α, β)
- AlpacaEval1.0: Uses GPT-4 as an evaluator, checking if a model’s response is preferred over text-davinci-003.
- AlpacaEval2.0: Uses GPT-4-turbo to check if a model’s response is preferred over GPT-4 responses.
MT-Bench Evaluation: This benchmark assesses how well the model follows instructions in multi-turn conversations. GPT-4 is used as the evaluator to judge the quality of answers to challenging questions.

6 Results and Analysis

6.1 Performance of ORPO

Compared to RLHF or DPO, ORPO enables the model to learn user preferences more quickly and stably.
Even with a small amount of data and limited training, it achieves high-level instruction-following performance (shown in Llama-2, Mistral, Phi-2, etc.).

6.2 Single-Turn Instruction-Following

Applying ORPO to Phi-2 (2.7B) significantly boosts its AlpacaEval score, surpassing Llama-2 Chat 7B performance (71.80%, 6.35% -> Llama-2 Chat 7B is 71.34%, 4.96%).
Applying ORPO to Llama-2 (7B) outperforms Llama-2 Chat 7B and 13B RLHF versions in AlpacaEval (81.26%, 9.44%).
Applying ORPO to Mistral (7B) produces Mistral-ORPO-α (7B) and β (7B), which rival or outperform the Zephyr series in single-turn tasks.

6.3 Multi-Turn Instruction-Following

According to MT-Bench, the Mistral-ORPO series (7B) achieves quite competitive scores (7.23–7.32) compared to larger or commercial models such as Llama-2-Chat 70B and Claude.
This suggests that even single-turn training data can help the model generalize to multi-turn dialogues.

6.4 Win Rate from the Reward Model Perspective

On OPT (125M, 350M, 1.3B), ORPO achieves a much higher win rate compared to SFT, PPO, or DPO.
The larger the model, the more ORPO outperforms DPO (e.g., 70.9% vs. DPO at OPT-1.3B).
ORPO reliably obtains higher rewards across different datasets (UltraFeedback, HH-RLHF).

6.5 Lexical Diversity Analysis

ORPO-trained models exhibit slightly lower per-input diversity compared to DPO (i.e., they provide more consistent responses for a given prompt), but higher across-input diversity overall.
This indicates that ORPO is producing more optimized answers for each individual request while maintaining varied response patterns across multiple inputs.

In summary, ORPO is an efficient and stable method compared to previous approaches (DPO, RLHF, PPO) and achieves improved instruction-following and reward gains across various model sizes. It also balances intra-prompt consistency with inter-prompt variability, producing responses that better meet users’ demands.

7 Discussion

7.1 Comparison to Probability Ratio

Odds Ratio vs. Probability Ratio

Traditional preference alignment methods often apply probability ratios (PR) within SFT.
ORPO, however, adopts the odds ratio (OR), which offers the following advantages over probability ratio:
- Odds ratio is more responsive to model understanding of preference.
- Using a probability ratio can lead to overly strong suppression of non-preferred responses.
- The odds ratio avoids extreme suppression, enabling stable co-training of SFT and preference alignment.

Experimental Simulation

Drawing random probability pairs (X₁, X₂), computing both PR and OR, and then comparing log-scale distributions shows that:
- PR yields a sharper, more skewed distribution,
- OR, by contrast, is smoother and more spread out.
When minimizing log sigmoid loss in preference alignment, PR can overly suppress non-preferred tokens, causing abnormal model convergence.

Conclusion: Why Odds Ratio is More Suitable

In the SFT-inclusive preference alignment setting, moderate “prioritization” outperforms harsh suppression.
Probability ratio triggers excessive suppression, so ORPO chooses odds ratio to properly reduce non-preferred responses and promote preferred ones.
This prevents undue logit flattening (which degrades quality) and lets SFT + preference alignment proceed in harmony.

Minimizing LOR

Reflecting Preferences During ORPO Training

Visualization of the training process under ORPO reveals that while preferred responses maintain or increase their log probabilities, undesired ones progressively decline.

Changes in Log Odds Ratio

As the log odds ratio rises, log probabilities of undesired responses steadily drop. Hence, ORPO retains the domain-adaptation benefit of SFT while reducing the likelihood of undesirable outputs.

Effect of λ Parameter

By varying λ in ORPO’s loss function, one can analyze how the gap in log probability between preferred and non-preferred responses changes. A larger λ results in stronger suppression of undesired responses, whereas a smaller λ applies milder suppression.

8 Conclusion

This paper revisited the role of SFT and introduced a single-stage preference alignment method called ORPO, which forgoes a reference model and integrates preference alignment directly.
Compared to SFT or RLHF, ORPO was consistently more preferred by reward models at all scales.
ORPO’s win rate against DPO also increases as model size grows.
Specifically, testing on 2.7B and 7B models showed that ORPO exceeds large models on the AlpacaEval metric, and Mistral-ORPO-α and β achieved 11.33% and 12.20% gains on AlpacaEval2.0 and 7.23 and 7.32 on MT-Bench, demonstrating both efficiency and efficacy.

Limitations

While we compared DPO and RLHF, we have not covered a more extensive range of preference alignment algorithms.
Future work will explore models beyond 7B and validate generalization across multiple domains and data quality settings.
Additionally, we plan to deepen our analysis of how ORPO affects the internal workings of pre-trained models across subsequent preference alignment stages.

How to use CleanLab wisely

sangjun_park — Fri, 08 Nov 2024 09:44:08 +0000

Data quality plays an important role in model performance. While sophisticated algorithms continue to emerge, the axiom "garbage in, garbage out" remains true. CleanLab emerges as a groundbreaking solution to this persistent challenge, offering a systematic approach to identifying and correcting label errors in datasets.
The CleanLab represents a paradigm shift in data quality management by implementing confident learning algorithms. This framework automatically identifies potential label errors. So, I search several models to identify through using CleanLab to figure out which one could be the best to enhance the performance.

Linear Models

Linear models play a crucial role in CleanLab's data cleaning framework. Their effectiveness stems from several key characteristics that make them particularly valuable for identifying label errors and ensuring data quality.

Linear models are quite effective in CleanLab. Especially, the robust baseline performance are quite stable and they have reliable probability estimates.

from sklearn.linear_model import LogisticRegression
from cleanlab.classification import CleanLearning

# Creating an interpretable linear model
linear_model = LogisticRegression(
    C=1.0,
    class_weight='balanced',
    random_state=42
)

# Integrating with CleanLab
cl = CleanLearning(linear_model)

SVM

support vector machine (SVM) is a powerful supervised learning algorithm used for classification and regression tasks. It's particularly effective in high-dimensional spaces and is widely used in various machine learning applications

The key concepts of SVM are Margin, Support Vectors, and the Kernel Trick. Margin is the distance between the decision boundary and the nearest data points. The SVM aims to maximize this margin. and Larger margins generally lead to better generalization. Support Vectors points closest to the decision boundary. This is the critical points that define the margin. And only these points affect the devision boundary. Finally, the Kernel Trick Transforms non-linear problems into linear ones. And it maps data into higher-dimensional space.

# Common kernel options in sklearn
from sklearn.svm import SVC

# Linear kernel
linear_svm = SVC(kernel='linear')

# RBF (Gaussian) kernel
rbf_svm = SVC(kernel='rbf')

# Polynomial kernel
poly_svm = SVC(kernel='poly', degree=3)

Random Forest Classifier

Random Forest Classifier is particularly effective in CleanLab due to its ensemble nature and robust probability estimates. Multiple decision trees provide robust predictions. And this naturally handling the outliers and noise.

from sklearn.ensemble import RandomForestClassifier
from cleanlab.classification import CleanLearning

# Basic setup
rf_model = RandomForestClassifier(
    n_estimators=100,
    max_depth=None,
    min_samples_split=2,
    min_samples_leaf=1,
    bootstrap=True,
    random_state=42
)

# Integration with CleanLab
cl = CleanLearning(rf_model)

XGBoost

XGBoost is an ensemble learning technique that builds models sequentially. And each new model tries to correct errors made by previous models. Gradient boosting uses gradient descent to minimize errors. And combines weak learners into a strong predictor.

import xgboost as xgb
from cleanlab.classification import CleanLearning

# Basic XGBoost setup
xgb_model = xgb.XGBClassifier(
    n_estimators=100,        # Number of boosting rounds
    learning_rate=0.1,       # Step size shrinkage
    max_depth=5,             # Maximum tree depth
    min_child_weight=1,      # Minimum sum of instance weight
    subsample=0.8,           # Subsample ratio of training instances
    colsample_bytree=0.8,    # Subsample ratio of columns
    objective='binary:logistic'  # Objective function
)

Soft Voting Ensemble

Finally, we can make soft voting ensemble to make cleanlab more effectively. Soft voting combines probability predictions from multiple models by averaging their predicted probabilities. rather than just taking the majority vote of predicted classes.

from sklearn.ensemble import VotingClassifier
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
import xgboost as xgb
from cleanlab.classification import CleanLearning

# Initialize base models
svm_model = SVC(probability=True, kernel='rbf')
rf_model = RandomForestClassifier(n_estimators=100)
xgb_model = xgb.XGBClassifier(n_estimators=100)
lr_model = LogisticRegression()

# Create voting classifier
ensemble = VotingClassifier(
    estimators=[
        ('svm', svm_model),
        ('rf', rf_model),
        ('xgb', xgb_model),
        ('lr', lr_model)
    ],
    voting='soft'
)

# Integrate with CleanLab
cl_ensemble = CleanLearning(ensemble)

The Method of Tokenizing: In the Perspective of Sparse Embedding

sangjun_park — Mon, 28 Oct 2024 00:55:16 +0000

(In progress)

Poly-Encoder

sangjun_park — Fri, 18 Oct 2024 11:55:39 +0000

Introduction

Poly Encoder is a neural network architecture used in dialogue systems and information retrieval. Developed by Facebook AI Research in 2020, it combines the strengths of Bi-encoders and Cross-encoders to improve performance and efficiency.

Why are we using Poly-Encoder?

The most convincing reason why we are using Poly-encoder is to improve the performance. Poly encoder offers a balance between the speed of Bi-encoders and the accuracy of Cross-encoders. This makes it particularly effective for tasks that require both efficiency and precision.

Let's Compare to Cross-encoders and Bi-encoders to know which one would be the best for you.

Bi-encoders

Speed: Very fast
Accuracy: Lower compared to Cross-Encoders. They lack direct interaction between query and candidates, potentially missing nuanced relationships. ##### Cross-encoders
Speed: Very Slow, especially with large candidate sets. They process query and each candidate together, requiring separate computation for each pair.
Accuracy: Very High. Much better than Bi-Encoders or Poly-encoders. Direct interaction between query and candidates allows for capturing complex relationships ##### Poly-encoders
Speed: Medium. Faster than Cross-encoders and slightly slower than Bi-encoders.
Accuracy: Better than Bi-encoders and approaching that of Cross-encoders. The multiple code vectors and attention mechanism allow for more nuanced matching.
Poly Encoders achieve this balance through Pre-computation, Attention Mechanism, and Multiple code vectors

The Structure of Poly-Encoder

This diagram illustrates the architecture of Poly-Encoder model, which is designed for efficient and effective text matching.

Context Encoder

In Context Encoder, it takes multiple inputs (such as In_x 1, In_x 2, ..., In_x N_x). Then it processes these inputs to produce multiple output embeddings (such as Out_x 1, Out_x 2, ..., Out_x N_x). These outputs represent different aspects or features or input context.

Query Codes

This is a set of learnable vectors(Code 1, ..., Code m). And These act as global features that the model learns to extract from the context.

Attention Mechanism

For each query code, an attention mechanism is applied over the context encoder outputs. Attention Mechanism produces. This produces a set of embeddings (Emb 1, ..., Emb m) that capture different aspects of the context relevant to each query code.

Context Embedding (Ctxt Emb)

This is another attention mechanism combines the embeddings from the previous step into a single context embedding. This embedding represents the entire context, weighted by its relevance to the query.

Candidate Encoder

This is similar to the context encoder, but processes candidate responses. The outputs(Out_y 1, Out_y 2, ..., Out_y N_y) represent features of the candidate.

Candidate Aggregator

It combines the candidate encoder outputs into a single candidate embedding (Cand emb)

Scoring

The final score is computed by combining the context embedding and the candidate embedding. This score represents the relevance or match between the context and the candidate.

Overview of Retrieval-Augmented Generation (RAG)

sangjun_park — Fri, 11 Oct 2024 05:44:06 +0000

(in progress)

BERT, roBERTa, ELECTRA Overview

sangjun_park — Fri, 27 Sep 2024 10:13:51 +0000

(in progress)

Basic Linux for AI Developers

sangjun_park — Fri, 13 Sep 2024 04:09:49 +0000

1. Brief Introduction to Linux

Linux is a crucial OS for AI development due to its open-source nature, stability, efficiency, and compatibility with AI frameworks. It offers scalability, strong community support, and cost-effectiveness, making it an ideal platform for AI projects from development to deployment.

2. Essential Linux Commands

There are lots of commands in linux, However, I'll introduce something that is used most commonly.

Navigating the file system: ls, cd, pwd & permission command: chmod, chown

ls

the 'ls' command in Linux stands for "list" and is used to display the contents of a directory. It shows the files, directories, and sometimes additional details such as file permissions, size, and modification date.

Here are some example of basic structure of files and directories.

project/
├── data/
│   ├── dataset.csv
│   ├── processed_data/
│   └── raw_data/
├── models/
│   └── model_v1.h5
├── scripts/
│   └── train.py
├── .gitignore
├── README.md
└── venv/
    ├── bin/
    ├── include/
    ├── lib/
    └── pyvenv.cfg

If we type 'ls' in this case,

ls

It will show like this.

$ ls 
data models scripts .gitignore README.md venv

Then, we'll now use ls -l command to watch some detailed view of the directories and files.

$ ls -l 
drwxr-xr-x 2 user user 4096 Sep 21 12:34 data 
drwxr-xr-x 2 user user 4096 Sep 21 12:35 models 
drwxr-xr-x 2 user user 4096 Sep 21 12:36 scripts 
-rw-r--r-- 1 user user 220 Sep 21 12:33 .gitignore 
-rw-r--r-- 1 user user 345 Sep 21 12:33 README.md 
drwxr-xr-x 6 user user 4096 Sep 21 12:37 venv

In Linux, file permissions are represented by string of 10 characters that control who can read, write, or execute a file or directory.

drwxr-xr-x

# we can separate this by
1 3.  3.  3.  chars
d rwx r-x r--

First character(d) represents the type of file.
- d: directory
- -: regular file (non-directory)
- l: symbolic link (special type of file that acts as a reference or pointer to another file or directory)
Next three characters(rwx) represents the permissions for the owner.
- r: stands for read permission
- w: stands for write permission
- x: stands for execute permission (for directories, it means the ability to list or traverse the directory)
Next three characters(r-x) represents the permissions for the group.
- r, w, x: same meaning as I explained it before
- -: no permission. In this case, there are '-' in second one so it means it doesn't have any write permission.
Final three characters(r--) represents the permissions for others.
- In this case, the others have only read permission and don't have any write or execute permission.

chmod

this command stands for "change mode". It's a command used in Unix and Unix-like operating systems to change the access permissions of file system objects(files and directories).

chmod allows to set permissions for three categories of users. The 'owner', 'group', and 'others'. And for each category, you can set three types of permissions which is called Read(r), Write(w), and Execute(x).

The method of using chmod has two types of things. The first one is symbolic method, and the second one is numeric method.

The format of symbolic method is like this.

Format: chmod [who] [operation] [permissions] filename

Who: u (user/owner), g (group), o (others), a (all)
Operation: + (add), - (remove), = (set exactly)
Permissions: r (read), w (write), x (execute)

For example,

chmod u+x file.sh

(Add execute permission for the owner in file.sh)

chmod go-rw file.txt

(Remove read and write permissions for group and others)

The format of numeric method is like this.

Format: chmod [assigned_3_numbers] filename

Each permission is assigned a number:

Read (r) = 4
Write (w) = 2
Execute (x) = 1

And sum these numbers for each category(owner, group, others).

For example,

chmod 755 file.sh

- **Owner:** 7 (4+2+1 => read, write, and execute)
- **Group:** 5 (4+1 => read, execute) 
- **Others:** 5 (4 + 1 => read, execute)

There are some command like this.

chmod -R 755 directory

(Recursively set permissions for a directory and its contents)

chown

this command stands for "change owner". It's a command used in Unix and Unix-like OS to change the owner and group of files and directories. "chown" command allows to change the user owner, or the group owner, or both the user and group owners simultaneously.

The basic syntax of the chown command is like this:

chown [options] <owner>:<group> <file/directory>`

<owner>: The new owner of the file or directory.
<group>: (Optional) The new group for the file or directory.
<file/directory>: The file or directory whose ownership you want to change.
[options]: additional flags or parameters that modify the behavior of the command.

Here are some of the examples that could explain chown well.

example 1: Change the Owner of a Single File

sudo chown user1 file1.txt

This command changes the owner of file1.txt to user1 while keeping the group unchanged.

example 2: Change Both Owner and Group of a File

sudo chown user1:group1 file1.txt

This changes the owner of file1.txt to user1 and the group to group1.

example 3: Change Ownership of All Files in a Directory Recursively

sudo chown -R user1:group1 /home/user1/

This command changes ownership for all files and subdirectories within /home/user1/, making user1 the owner and group1 the group for every item inside the directory.

example 4: Change Ownership Using a Reference File

sudo chown --reference=ref_file.txt target_file.txt

This command changes the ownership of target_file.txt to match the ownership of ref_file.txt. It's useful if you want to apply the same ownership attributes as a reference file.

example 5: Suppress Errors During Ownership Change:

sudo chown -f user1:group1 /some/protected/file

-f option: forces the command to ignore errors, such as permission issues, and completes the operation without displaying error messages.

example 6: Verbose Output to See What's Being Changed:

sudo chown -v user1:group1 file1.txt

This command displays a message for each file whose ownership is changed, providing a clear overview of the modification.

File and directory manipulation: mkdir, touch, cp, mv, rm

mkdir

The command 'mkdir' stands for "make directory". It's used in command-line interfaces to create a new directory (folder) in a file system. The purpose of 'mkdir' is to create one or more new directories. If you want to use 'mkdir', just typically type 'mkdir' followed by the name of the directory named "new_folder" in your current location.

There are several ways to make directory with using "mkdir".

1. Create a single directory

mkdir documents

2. Create multiple directories at once:

mkdir photos videos music

3. Create a directory with spaces in the name:

mkdir "My Projects"

Viewing and editing file contents: cat, less, nano

cat

The 'cat' command is a versatile utility in Unix-like operating systems. The 'cat' stands for "concatenate", which hints at its core functionality. And it's commonly used to display the contents of a file on the terminal.

The Basic syntax looks like this

cat [options] [file(s)]

The key functions of 'cat' is like: display file contents, concatenate multiple files, create new files, and append content to existing files

for example, here are some of the examples of 'cat'

a) View a file's contents

cat myfile.txt

b) Display contents of multiple files

cat file1.txt file2.txt

c) Create a new file with content (You can then type content and press Ctrl+D to save and exit)

cat > newfile.txt

d) Append content to an existing file

cat >> existingfile.txt

e) Display file with line numbers:

cat -n myfile.txt

less

The 'less' command is another utility for viewing file contents in Unix-like operating systems, but it offers more advanced features compared to 'cat'. 'less' is a pager program that allows you to view the contents of a file one screen at a time.

Here are some of the examples

Viewing a File:

less filename.txt

This command opens filename.txt in less mode, allowing to navigate through the file using keyboard shortcuts.

if you want to view multiple files, you can use it like this

less file1.txt file2.txt file3.txt

This allows you to view multiple files in sequence. Press ':n' to move to the next file and ':p' to go to the previous file.

if you want to view compressed files,

less file.gz

'less' can directly read compressed files without needing to decompress them first.

you can also display line numbers such as using like this

less -N filename.txt

This shows line numbers alongside the content.

If you want to start at a specific line, you can try like this

less +100 filename.txt

This highlights all occurrences of "search term" in the file.

Read from standard input

command | less

This allows you to pipe the output of another command into 'less' for easier viewing.

nano

The 'nano' command is a simple, user-friendly text editor for Unix-like operating systems. It's often considered easier to use than more complex editors like vim or emacs, making it a good choice for beginners or for quick edits.

Here's an overview of the 'nano' command. The basic usage is look like this

nano filename.txt

This opens the file 'filename.txt' in the nano editor. If the file doesn't exist, nano will create it.

Transformer Deep Dive

sangjun_park — Fri, 06 Sep 2024 07:58:20 +0000

1. Introduction

Transformers have become one of the most important concepts in Natural Language Processing (NLP), emerging as a response to key limitations in existing approaches. The architecture addresses several challenges faced by earlier models like Recurrent Neural Networks (RNNs) and Long Short-Term Memory (LSTM) networks:

Sequential processing bottleneck: RNNs and LSTMs process input sequences one element at a time, limiting their efficiency.
Long-range dependencies: These earlier models struggled to capture relationships between distant elements in a sequence.
Attention mechanism potential: Previous research showed that attention mechanisms could improve sequence modeling. Transformers explored whether attention alone could replace recurrent structures.
Training efficiency: The Transformer architecture enables faster and more efficient training on large datasets compared to its predecessors.

By tackling these issues, Transformers revolutionized NLP tasks, offering improved performance and scalability.

2. Main Concept of Transformer

2-1. Attention Mechanism

The attention mechanism in Transformers is inspired by human cognitive processes. Just as humans focus on specific, relevant parts of information when processing complex input, the attention mechanism allows the model to do something similar:

1. selective focus: It enables the model to "pay attention" to certain parts of the input sequence more than others
2. Importance weighting: The mechanism assigns different weights to different parts of the input, emphasizing more relevant or important information.
3. Context-dependent processing: Unlike fixed weighting, the importance of each part can change based on the current context or task
4. Efficient information extraction: This selective focus helps the model efficiently extract and utilize the most relevant information from the entire input sequence.
5. Improved handling of long-range dependencies: By directly connecting different positions in the sequence, attention helps capture relationships between distant elements more effectively than sequential processing.

2-1-1. Self-Attention

Self-attention is a mechanism that allows a model to focus on different parts of an input sequence when processing each element of that same sequence. The key distinction between conventional attention and self-attention lies in the reference point. In self-attention, each element in the sequence attends to every other element within the same sequence. This means that the input and the reference sequence are identical, which is the defining characteristic of self-attention.

Then why we need self-attention over normal attention? There are several reasons why we are using self attention.

First reason is capturing intra-sequence dependencies. Self-attention allows each element to directly interact with every other element in the sequence. This helps in capturing long-range dependencies within the input. which is crucial for understanding context in language.

The second reason for using self-attention is its ability to enable parallelization. Unlike RNNs, self-attention can be computed for all elements in parallel, which significantly improves computational efficiency. While there are other forms of attention, such as cross-attention used in encoder-decoder architectures, self-attention offers unique advantages. In cross-attention, it's possible to calculate the relevance between encoder states and decoder states. However, the encoder-decoder model tends to operate more sequentially rather than in parallel. Self-attention, by contrast, allows for parallel processing within a single sequence, making it particularly well-suited for modern, high-performance computing architectures.

Also, There are no sequential bottleneck in self-attention. Normal attention in seq2seq models often relies on sequential processing of inputs. However, self-attention removes this bottleneck, allowing for more efficient processing of long sequences.

2-1-2. Multi-Head Attention

Next, we'll be looking for multi-head attention and why we need this mechanism. Multi-head attention extends the idea of single-head attention by running multiple attention heads in parallel on the same input sequence. This allows the model to learn different types of relationships and patterns within the input data simultaneously, thereby considerably enhancing the expressive power of the model as compared to using just a single attention head.

Multi-head attention enhances the normal attention mechanism by utilizing multiple sets of learnable Query (Q), Key (K), and Value (V) matrices instead of a single set. This approach offers two key advantages. Firstly, it enables parallel processing, potentially increasing computational speed compared to normal attention. Secondly, it allows the model to simultaneously focus on different aspects of the input. By creating several sets of Q, K, and V matrices, multi-head attention can capture various facets of the input in parallel, facilitating the learning of diverse features and relationships within the data. This capability typically results in improved model performance, particularly for complex tasks such as machine translation. While multi-head attention does require more computational resources, the enhanced results it produces often justify this additional cost.

2-2. Positional Encoding

It's true that the Transformer model can process input in parallel, making it faster than sequential models like RNNs or LSTMs. However, this parallel processing means the Transformer can't inherently capture the sequence of words. To address this, we need to add positional information separately. Word position is crucial because changing the order of words can alter the sentence's meaning.

When applying positional encoding, we must consider two key factors. First, each position should have a unique identifier that remains consistent regardless of the sequence length or input. This ensures the positional embedding works identically even if the sequence changes. Second, we must be careful not to make the positional values too large, as this could overshadow the semantic or syntactic information, hindering effective training in the attention layer.

A technique that addresses the factors mentioned earlier is the use of Sine & Cosine Functions for positional encoding. This approach offers several advantages. Firstly, sine and cosine functions always maintain values between -1 and 1, ensuring that the positional encoding doesn't overshadow the input embeddings. While sigmoid functions also satisfy this constraint, they're less suitable because the gap between values for adjacent positions can become extremely small.

Some might worry that different positions could yield identical values. However, this concern is mitigated by using multiple sine and cosine functions of different frequencies to create a vector representation for each position. By increasing the frequency of these functions, we can create more varied encodings, making it highly unlikely for different positions to have the same representation.

The equation to represent different frequencies depends on position and dimension looks like this. The i means position, and d means the number of dimensions.

2-3. Encoder-Decoder

The Transformer architecture consists of two main components: the Encoder and Decoder.

2-3-1. Encoder

The encoder processes the input sequence, which is composed of multiple identical layers. Each layer has two sub-layers. The first sub-layer is the Multi-Head Attention mechanism. After this mechanism, the second sub-layer, a feedforward neural network, is applied sequentially. Both the multi-head self-attention and feedforward neural network are followed by a residual connection and layer normalization. The entire process within each encoder layer looks like this:

Input → Multi-head Self-Attention → Add & Norm → Feedforward NN → Add & Norm → Output

This entire process is then repeated for each subsequent encoder layer in the stack. And, I'll discuss in detail about an Add(Residual Connection) & Norm(Layer Normalization) in detail.

2-3-1-1. Residual Connections

Residual connections, also known as "skip connections" or "shortcut connections", serve to preserve information flow in neural networks. These connections allow unmodified input information to pass directly through the network, helping to retain important features from the original input that might otherwise be lost through multiple-layer transformations. Crucially, this approach helps mitigate the difficulty of learning in deep neural networks by providing a direct path for information and gradients to flow.

2-3-1-2. Layer Normalization

Layer normalization is applied in the "Add & Norm" step of Transformers to stabilize learning, reduce training time, and decrease dependence on careful initialization. It works by Calculating mean and variance across the last dimension which is also called the feature dimension. And then normalizing the input using these statistics. Finally, it scales and shifts the result with learnable parameters. This process helps maintain consistent value scales throughout the network. allowing each layer to learn more effectively and independently.

2-3-2. Decoder

The Transformer's decoder generates the output sequence, differing from the encoder in key aspects. It operates sequentially rather than in parallel, employs both self-attention and cross-attention mechanisms, and uses masked self-attention to preserve the autoregressive property. The decoder maintains a causal dependency on previous outputs and includes an additional output projection layer. These distinctions enable the decoder to produce coherent outputs based on the encoded input.

2-3-2-1. Masked Multi-Head Attention

The first thing we want to discuss in the decoder is Masked Multi-Head Attention. This structure is primarily to maintain the autoregressive property during training and inference. The decoder generates output tokens once at a time, from left to right. Each token should only depend on previously generated tokens. Also, the model can only attend to previous tokens in the sequence, not the future ones. This approach allows the model to be trained in a way that mimics the inference process, where future tokens are unknown. and it also allows parallel computation during training.

2-3-2-2. Multi-Head Attention for Decoder

In the Transformer architecture, the interaction between the encoder and decoder is facilitated by a critical mechanism known as cross-attention. This process utilizes queries derived from the decoder and key-value pairs obtained from the encoder, serving as a vital link between the two components. The queries, originating from the current decoder layer, represent the specific information the decoder seeks at each step of the output generation. Conversely, the keys and values, products of the encoder's processing, encapsulate the essence of the input sequence.

This intricate interplay allows the decoder to align its output generation with the most relevant elements of the input. By doing so, it captures and leverages the context from the entire input sequence, rather than being limited to a narrow window of information. The beauty of this mechanism lies in its dynamic nature; for each output token being generated, the decoder can shift its focus to different parts of the input, ensuring that the most pertinent information is always at the forefront of the generation process.
Essentially, this attention mechanism acts as a sophisticated bridge between the input—the source sequence meticulously processed by the encoder—and the output, which the decoder generates one token at a time. This bridging effect is crucial in enabling the Transformer to perform complex tasks such as translation, summarization, or question-answering with remarkable accuracy and contextual awareness. It allows the model to maintain a nuanced understanding of the input throughout the entire generation process, ensuring that each output token is produced with consideration of the full context provided by the input sequence.

6. Conclusion

The Transformer architecture represents a significant leap forward in natural language processing, addressing key limitations of previous models like RNNs and LSTMs. Its core innovations – the self-attention mechanism, multi-head attention, and positional encoding – have revolutionized how machines process and understand language.

These innovations have not only improved performance on various NLP tasks but have also paved the way for larger, more powerful language models. The Transformer's scalability and efficiency have become the foundation for models like BERT, GPT, and their successors, driving rapid advancements in the field.

As NLP continues to evolve, the principles introduced by the Transformer architecture remain central to cutting-edge research and applications, underscoring its lasting impact on the field of artificial intelligence.

Reference

Exploring Sequence Transformation Models in NLP: An Overview of Seq2Seq, RNN, LSTM, and GRU

sangjun_park — Fri, 30 Aug 2024 08:48:24 +0000

1. Introduction

In mathematics, a sequence transformation is an operator acting on a given space of sequences (a sequence space). Sequence transformations include linear mappings such as convolution with another sequence and resummation of a sequence, and more generally, are commonly used for series acceleration.

I’ll discuss four types of sequence transformation models: Seq2Seq, RNN, LSTM, and GRU, and how they are constructed.

2. Seq2Seq

The Seq2Seq model is a type of machine learning approach used in NLP. It can be applied to various tasks, including language translation, image captioning, conversational models, and text summarization. Seq2Seq works by transforming one sequence into another, enabling these diverse applications.

There are three kinds of architecture in the Seq2Seq model. The first one is an 'Encoder'. The encoder is responsible for processing the input sequence and capturing its essential information, which is stored as the hidden state of the network and in a model with an attention mechanism, a context vector. The context vector is the weighted sum of the input hidden states and is generated for every time instance in the output sequences.

The decoder takes the context vector and hidden states from the encoder to generate the final output sequence. It operates in an autoregressive manner, producing one element of the output sequence at a time.

At each step, it considers the previously generated elements, the context vector, and the input sequence information to predict the next element in the sequence.

In models with an attention mechanism, the context vector and hidden state are combined into an attention vector, which serves as input to the decoder.

Finally, we'll be looking for the Attention mechanism. In the basic Seq2Seq architecture, there are some limitations where a longer input sequence results in the hidden state output of the encoder becoming irrelevant to the decoder.

It enables the model to selectively focus on different parts of the input sequence during the decoding process. At each decoder step, an alignment model calculates the attention to hidden vectors as input. An alignment model is another neural network model that is trained jointly with the seq2seq model used to calculate how well an input, represented by the hidden state, matches with the previous output, represented by the attention hidden state.

3. RNN

Recurrent neural networks(RNNs) are a class of artificial neural networks for sequential data processing.

RNNs process data across multiple time steps, making them well-adapted for modeling and processing text, speech, and time series.

The fundamental building block of an RNN is the recurrent unit. This unit maintains a hidden state, essentially a form of memory, which is updated at each step based on the current input and the previous hidden state.

3-1. Configuration of RNN

An RNN-based model can be factored into two parts: Configuration and Architecture. Multiple RNNs can be combined in a data flow, and the data flow itself is the configuration. Each RNN itself may have any architecture, including LSTM, GRU, etc.

Standard

RNN comes in many variants. A definition of RNN is something like the image below.

In other words, it is a neural network that maps an input x_t into an output y_t with the hidden vector h_t playing the role of "memory", a partial record of all previous input-output pairs. At each step, it transforms input to output and modifies its "memory" to help it to better perform future processing.

Stacked RNN

A stacked RNN, or deep RNN, is composed of multiple RNNs stacked one above the other. Abstractly, it is structured as follows

and each layer operates as a stand-alone RNN, and each layer's output sequence is used as the input sequence to the layer above. There is no conceptual limit to the depth of the RNN.

Bi-directional

A bi-directional RNN is composed of two RNNs, one processing the input sequence in one direction, and another in the opposite direction. Abstractly, it is structured as follows:

Bidirectional RNN allows the model to process a token both in the context of what came before it and what came after it. By stacking multiple bidirectional RNNs together, the model can process a token increasingly contextually.

3-2. Architecture of RNN

Fully recurrent

Fully recurrent neural networks (FRNN) connect the outputs of all neurons to the inputs of all neurons. In other words, it is a fully connected network. This is the most general neural network topology because all other topologies can be represented by setting some connection weights to zero to simulate the lack of connections between those neurons.

Hopfield

The Hopfield network is an RNN in which all connections across layers are equally sized. It requires stationary inputs and is thus not a general RNN, as it does not process sequences of patterns. However, it guarantees that it will converge.

Elman networks and Jordan networks

An Elman network is a three-layer network (arranged horizontally as x, y, and z in the illustration) with the addition of a set of context units (u in the illustration). The middle (hidden) layer is connected to these context units fixed with a weight of one.

At each time step, the input is fed forward and a learning rule is applied. The fixed back-connections save a copy of the previous values of the hidden units in the context units (since they propagate over the connections before the learning rule is applied). Thus the network can maintain a sort of state, allowing it to perform tasks such as sequence prediction that are beyond the power of standard multilayer perceptron.

Jordan networks are similar to Elman networks. The context units are fed from the output layer instead of the hidden layer. The context units in a Jordan network are also called the state layer. They have a recurrent connection to themselves.

Elman and Jordan's networks are also known as "Simple recurrent networks (SRN)"

x: input vector
h: hidden layer vector
y: output vector
W, U and b: parameter matrices and vector
sigma-h and sigma-y: Activation functions

4. LSTM

Long short-term memory (LSTM) is a type of recurrent neural network (RNN) aimed at dealing with the vanishing gradient problem present in traditional RNNS. Its relative insensitivity to gap length is its advantage over other RNNS. It aims to provide a short-term memory for RNN that can last thousands of timesteps, thus "long short-term memory". The name is made in analogy with long-term memory and short-term memory and their relationship, studied by cognitive psychologists since the early 20th century.

A common LSTM unit is composed of a cell, an input gate, an output gate, and a forget gate. The cell remembers values over arbitrary time intervals and the three gates regulate the flow of information into and out of the cell.

Forget gates: Decide what information to discard from the previous state by mapping the previous state and the current input to a value between 0 and 1. A rounded value of 1 means to keep the information, and a value of 0 means to discard it.
Input gates: decide which pieces of new information to store in the current cell state, using the same system as forget gates.
Output gates: control which pieces of information in the current cell state to output by assigning a value from 0 to 1 to the information, considering the previous and current states.

Selectively outputting relevant information from the current state allows the LSTM network to maintain useful, long-term dependencies to make predictions, both in current and future time steps.

5. GRU

Gated recurrent unit (GRU) was designed as a simplification of LSTM. They are used in the full form and several further simplified variants. They have fewer parameters than LSTM, as they lack an output gate.

6. Conclusion

In this article, we explored four different types of sequence transformation models: Seq2Seq, RNN, LSTM, and GRU. Each of these models has its unique strengths and applications in the field of natural language processing (NLP) and other domains involving sequential data.

Choosing the right model depends on the specific task at hand. For tasks requiring long-term dependencies, LSTMs, and GRUs may be more appropriate, while Seq2Seq models excel in tasks that involve mapping one sequence to another, particularly when enhanced with attention mechanisms.

References

Data Visualization: How the Skewness and Kurtosis Lead Visual Distortion

sangjun_park — Fri, 23 Aug 2024 09:51:09 +0000

1. Introduction

Data Visualization is one of the most important parts of modern society. So, it is also important to clearly show data distribution and how to show it without distortion. However, In some cases, data visualizations are distorted to mislead people intentionally. There are lots of cases to explain about this problem.

This image is an example of how data can be intentionally misleading. Tim Cook presented a graph of cumulative iPhone sales, which at first glance may not seem distorted, especially since the title clearly states “Cumulative iPhone Sales.” However, the graph inherently shows an upward trend due to the nature of cumulative data, which can create a misleading impression of continuous growth, regardless of the actual sales rate.

This graph reveals declining iPhone sales, but it’s difficult to discern this from the graph that Tim Cook presented. As illustrated in this case, data is sometimes selectively presented to create a more favorable impression, leading others to interpret it in a way that benefits the presenter.

Here are some other examples that can lead to misleading judgments. Jason E. Chaffetz, a former U.S. representative for Utah’s 3rd congressional district, presented a graph suggesting that there are more abortions than cancer screening and prevention services. However, this graph is highly distorted because it doesn’t accurately reflect the absolute figures. According to Chaffetz’s version, there were 327,000 abortions in 2013, while cancer screenings numbered 935,573—nearly three times the number of abortions. As a result, the graph presented in the “Honest Version” is much more accurate than Chaffetz’s version.

2. Definition

Now, I'll discuss about the definition of 'Skewness', and 'Kurtosis'. And why these two things can distort data.

2.1. Skewness

Skewness is a statistical measure that describes the asymmetry of a distribution around its mean. In other words, it indicates whether the data points in a dataset are distributed evenly on both sides of the mean or if they tend to cluster more on one side.

There are two types of skewness to consider. The first is ‘Negative Skew (Left Skew),’ where the distribution is negatively skewed when the tail on the left side (lower values) is longer or fatter than the right side. In this case, most data points cluster on the right side of the distribution, with fewer smaller values extending the tail to the left. The second type is ‘Positive Skew (Right Skew),’ which is essentially the opposite. Here, most data points cluster on the left side of the distribution, with fewer larger values extending the tail to the right.

2.2. Kurtosis

Kurtosis is a statistical measure that describes the "tailedness" or the shape of a distribution's tails about its overall shape, particularly compared to a normal distribution. It provides insight into the extremity of deviations (outliers) in a dataset.

This graph illustrates the concept of kurtosis by comparing different types of distributions. It shows three curves representing different levels of kurtosis relative to the normal distribution.

First, let’s discuss positive kurtosis, also known as leptokurtic. A leptokurtic distribution has a taller and sharper peak than a normal distribution. It exhibits higher peaks and more extreme outliers, indicating that the data is more concentrated around the mean with a greater occurrence of extreme values.

On the other hand, negative kurtosis, also referred to as platykurtic, is the opposite of leptokurtic. A platykurtic distribution is flatter and broader than a normal distribution, with a lower peak and thinner tails. Distributions with negative kurtosis have fewer outliers and a more even spread of data values, indicating that the data is more evenly distributed with fewer extreme values.

3. Problem & Solving

3.1. The Perspective of Skewness

This image illustrates how skewness affects the relationship between the mean, median, and mode. In a symmetrical distribution, the mean, median, and mode are nearly the same. However, in a negatively skewed distribution, the mean and median are lower than the mode, which might lead to the misconception that the mean and median are higher than they are. Conversely, in a positively skewed distribution, the mean and median are greater than the mode, with the mean being significantly higher than the median.

Furthermore, extreme skewness can cause people to focus on the minority of the data rather than the majority. For instance, in a dataset with high positive skewness, most of the data is concentrated on the left side of the histogram, while a long tail extends to the right, representing a few high-value outliers. Conversely, in a dataset with high negative skewness, the histogram shows most of the data concentrated on the right, with a long tail extending to the left. These outliers can complicate decision-making.

To address this issue, log transformations or exponential transformations can be applied to reduce skewness and minimize distortion. Additionally, providing further context can help ensure a more accurate interpretation of the data.

Log transformation is a common technique used to normalize a positively skewed distribution, where the tail extends to the right. The logarithm function compresses large values into a smaller range— for example, log(1000) = 3 while log(10) = 1, reducing the difference between large numbers. At the same time, smaller values are spread out more after the transformation. This effect narrows the distribution, reduces the tail, and results in a more symmetric distribution.

Unlike log transformations, which are commonly used to reduce skewness in a distribution, exponential transformations take the opposite approach. Although less common for reducing skewness, exponential transformations can be useful in specific contexts where increasing skewness or spreading out the data more widely is necessary. Therefore, they are generally more applicable when dealing with negatively skewed data.

3.2. The Perspective of Kurtosis

Kurtosis, like skewness, can also lead to visual distortions in data. For example, when kurtosis values are very high, the data may appear more extreme or prone to outliers, as the heavy tails suggest a higher frequency of extreme values. This could result in an overestimation of risk or variability in the data.

On the other hand, low kurtosis can make the data appear more uniform or less extreme, potentially leading to an underestimation of the presence of outliers or variability.

To prevent misleading conclusions, it’s important to annotate or explain the presence of high or low kurtosis when visualizing data. This helps viewers interpret the data more accurately. Alternatively, outliers that contribute to very high kurtosis can be adjusted or removed, but this process must be approached with caution and a thorough understanding of the impact on the analysis.

4. Conclusion

It can be easy to lead people in a desired direction by distorting data. Therefore, we must strive to minimize data distortion and provide users with visualizations that convey the most accurate information possible. However, there are times when we unintentionally present information that could lead others to misunderstand. The most notable examples of this are skewness and kurtosis.

For this reason, we have closely examined how skewness and kurtosis can distort data and explored the methods to prevent such distortions.

Reference

Thinking About Multi-head attention and why we need it

sangjun_park — Fri, 16 Aug 2024 01:10:22 +0000

Transformer Model

Before explaining the concept of Multi-Head Attention, we first need to know about the Transformer model, which is superior to Multi-Head Attention.

Researchers at Google developed the transformer model and it is proposed in a 2017 paper "Attention is All You Need". The Text is converted to a numerical representation called tokens, and each token is converted into a vector by looking up from a word embedding table.

In NLP, the word embedding is a representation of a word. Typically, the representation is a real-valued vector that encodes the meaning of the word in such a way that the words that are closer in the vector space are expected to be similar in meaning.

The goal of word embedding is to capture the semantic meaning of words.

Before transformer models, there are several kinds of NLP models such as RNNs, LSTMs, or GRUs. However, these models had several limitations.

Sequential Processing: RNNs process tokens in sequence, which makes them slow and difficult to parallelize.
Long-Range Dependencies: RNNs struggle to capture dependencies between distant tokens due to the vanishing gradient problem.
Fixed-Length Contexts: RNNs generally rely on fixed-length context windows, limiting their ability to handle varying context lengths effectively.

The transformer model addresses these limitations using a self-attention mechanism and fully parallelizable architecture.

Self-Attention mechanism

concept

Attention: In the context of neural networks, it refers to the ability to focus on specific parts of the input sequence when producing each element of the output. It allows the model to weigh the importance of the different words (or tokens) in a sequence relative to each other. In the context of sequence-to-sequence models, attention typically involves focusing on different parts of the encoder's output when generating each word in the decoder's output.
Self-Attention: specific type of attention where each word in the sequence attends to every other word, including itself. This helps the model understand the relationships between words in the context of the entire sequence.

The key difference between attention and self-attention lies in the scope of the elements. For example, In the sentence "The cat sat on the mat", self-attention allows the word "cat" to focus on "sat" and "mat" to understand its context better, rather than considering only adjacent words.

For each word in input sequences, the self-attention mechanism computes three vectors

Query(Q): Represents the word we are currently focusing on. In the Self-Attention mechanism, we make a Query vector based on each word of the input sequence.
Key(K): Represents the words we are comparing against. It compares to the Query vector so that we can calculate how important this word(K) is compared to the target(Q).
Value(V): Contains the actual information of the words. It is calculated by the weight of Attention and used as a final result.

These vectors are not directly generated from the input word embeddings. instead, they are created learnable weight matrices.

We multiply between input embedding X and the weight matrix (W_Q, W_K, W_V) to make the vector of Q, K, and V.

Now we need to Calculate Attention Scores. Q, K, and V vectors are generated from the input embeddings using the respective weight matrices W_Q, W_K, and W_V.

The attention score for each pair of words is calculated by taking the dot product of the Query vector for one word with the Key vector of another word.

Qi: the Query vector for the i-th word.
Kj: the Key vector for the j-th word.
dot product: measures the similarity between the Query and Key vectors, determining how much one word is related to another

To prevent the dot products from becoming too large and destabilizing the softmax function, the scores are scaled by the square root of the dimension of the Key vectors (dk).

Finally, we'll apply the softmax function. The scaled attention scores are then passed through a softmax function. The softmax function converts these scores into a probability distribution, where the scores sum to 1. Softmax essentially normalizes the scores, making it easier to interpret them as attention weights.

Generate Output

after calculating attention weights, the next step is to generate the output for the self-attention mechanism.

for each value of j from 1 to n, the summation notation means to calculate the expression of Attention Weights and jth of V(Value).

Multi-Head Attention Extension

In Multi-Head Attention, each head performs its own self-attention calculation independently, and then the output of all heads is concatenated and then linearly transformed to produce the final output.

This is the formula for Multi-Head attention extension.

Let's Imagine that there are h heads, each head produces an output, which we can denote as:

head 1, head 2, ..., head h

and each of these outputs (from head 1 to head h) is a vector of the same dimension, let's say d-k. The idea of concatenation is to take these h vectors and combine them into a single, longer vector. The reason why we are using d-k is because we divide d-model by h.

for example, let's suppose there are d-model, which is the dimensionality of the input vectors to the model. For example, if each word in the input sequence is represented by a 512-dimensional vector, then d-model = 512.

h is the number of attention heads. For example, if you have 8 heads, then h = 8. We need to figure out the dimensions of the head. and it could be calculated by 512 / 8 = 64. and we called this d-k.

Each head independently processes the input sequence through the Linear Projections and Attention Mechanism.

This image shows the best example of explaining the concatenation and linearly-transformed.

After we apply the linear transformation, it is ready for subsequent layers in the Transformer model.

Why do we need Multi-Head Attention?

The primary reasons for using Multi-Head Attention are related to the ability to capture complex patterns, improve model performance, and provide richer, more context-aware representations of the input data.

References

Thinking about JavaScript Libraries and Frameworks for User Interface

sangjun_park — Thu, 19 Jan 2023 01:53:29 +0000

Before to begin

There are lots of libraries and frameworks in javascript. But I questioned myself. ‘Do I know about the difference between them?’ I was not sure about it. So I decided to discover the history of javascript and why some kind of the libraries or frameworks such as jQuery, React, Vue, Angular, and Svelte was made.

The Birth of JavaScript

The creator of JavaScript was Brendan Ike. According to this website, The reason why he wants to make JavaScript was to validate user input on websites. Nowadays, JavaScript can help complex features inside of web browsers such as interacting with users or renewing something periodically.

Some people just compare HTML, CSS, and JS to nouns, adjectives, and verbs. I also agree about this opinion. JavaScript could make web browsers more dynamic.

jQuery

jQuery: write less, do more

Then, why jQuery was made? Did VanillaJS doesn’t work properly? I was curious about this question so I searched the jQuery website to find the truth. According to this website, they said jQuery is a fast, small, and feature-rich JavaScript library.

In 2006, John Resig was a web designer who processed his project. He was frustrated with writing JavaScript because of its difficulty. For example, the compatibility of the browser was difficult to manage. There are some separate codes only for IE(Internet Explorer), Chrome, of Firefox. This compatibility issue makes him frustrated. Not only about just compatibility issues, but also the grammatical things were more simple rather than JavaScript.

So he made jQuery, the simple and powerful JavaScript library to use. This powerful tool could solve compatibility issues easily. And that makes web developers more productive rather than just using JavaScript.

The Weakness of jQuery

However, one of the problems with using jQuery was the speed. The speed in jQuery is really slow compared to using only JavaScript. Especially using jQuery to make an animation. Also, maintaining the code of jQuery is more difficult than just using JavaScript. Some people might think maintaining jQuery is easier because of its simple code. However, because of that, some people just duplicate the code in jQuery and that makes it more difficult to maintain. This causes jQuery easy to make spaghetti code, which makes maintainability terrible.

Why we don’t use jQuery more than before?

First, the browser compatibility is good enough. So that makes jQuery not useful rather than before. One of the reasons why jQuery was made was the compatibility issue(especially, the cross-browsing issue). Nowadays, this issue is not as important as before.

Also, most features in jQuery can be implemented in VanillaJS or TypeScript. However, jQuery is the library that is built over JavaScript with ‘Wrapping’. And this feature makes it slower than VanillaJS. jQuery is not a library related to optimization.

And the most important thing is the reason we use not jQuery but some other framework or library is ‘Virtual DOM’, which makes it more powerful to build front-end things.

Virtual DOM

According to Wikipedia, a virtual DOM is a lightweight JavaScript representation of the Document Object Model(DOM) used in declarative web frameworks such as React, Vue, and Elm. However, Before doing some research about virtual DOM, I need to know what exactly the ‘DOM’ is.

What is DOM?

‘DOM’ is the data representation of the objects that comprise the structure and content of a document on the web. It is a kind of programming interface for web documents. and it represents the page so that programs can change the document structure, style, and content

And we can access this DOM in JavaScript. Not only just for access but also can manipulate using the DOM and scripting languages like JavaScript.

DOM’s structure is represented by a ‘Node Tree’. In this tree, one parent node can get several child nodes. But it doesn’t mean DOM is the same as HTML. DOM is just a valid HTML document interface. During creating DOM, the browser figures out the things that are not correct HTML code. Also, DOM is neither the things that the user can see in the browser. What exactly the user can see is the ‘Render Tree’, which is combined with DOM and CSSOM. ‘display: none’ is one of the most representative things to understand the difference between DOM and Render Tree. It doesn’t render when some elements have ‘display: none’. However, DOM doesn’t care about what the property of CSS is.

Why do we need Virtual DOM then?

First of all, DOM is structured by Tree Node. So it is easy for us to understand. But lots of nodes makes DOM slower, and there could be more error when we update DOM. Furthermore, recent modern app use SPA(Single Page Application), which means only one page is configured for the whole application. Therefore, the source of HTML is quite huge because lots of dynamic features are inside of it. Furthermore, there are also some problems that we need to rerender consistently.

Some people might think that just rerendering DOM doesn’t take lots of effort. However, the thing when we write this code like this should have happened like this.

document.getElementById('some-id').innerValue = 'updated value';

the Browser parses the HTML and figures out where exactly the node has the same ID compared to this code.
remove the child element which is related to this element.
update this element to ‘updated value’
calculate the CSS about the parent and child nodes.
finally, it is painted in the display of the browser.

So many things happened during updating the value of the DOM. Entire things have been changed even if it is not necessary, and the browser needs to calculate and update the new layout. Changing the layout should need a complex algorithm to change. And it also affects performance. Especially, when it comes to rendering 10 times, the browsers also update the new render tree 10 times. And when the size of the DOM is huge, It is quite complex to find which DOM has been changed and should be rerendered.

So, Here the Virtual DOM comes! 😁

Virtual DOM is quite lightweight. The concept of virtual DOM is like a concept of abstraction of HTML DOM, but it is already an abstract concept so it could be lightweight and also it is separated from the browser’s real DOM.

Because the virtual DOM didn’t truly rerender in the browser, it is relatively fast compared to the real DOM. After the data update, there are three steps to rerender through the virtual DOM.

The entire UI is re-rendered through virtual DOM. This virtual DOM is now current(relatively fast compared to real DOM).
Calculating the difference between the previous virtual DOM(it is equal to the real DOM) and the current virtual DOM.
After the calculation is completed, DOM only changes the real DOM about the thing that has been changed.

These 3 steps of the process are called ‘Reconciliation’ in React.

Angular: I’ll use Incremental DOM!

From now, I was talking about the virtual DOM. However, virtual DOM is not the only way to solve the problem of the performance of re-rendering. What makes the difference between Incremental DOM and virtual DOM? And why the Angular.js using Incremental DOM?

What is Incremental DOM?

Incremental DOM gives us a more simple way of approach than a virtual DOM. There are no virtual things in memory in incremental DOM. They just use a real DOM to find the difference.

Incremental DOM compiles all the components through the chunks of instruction. This instruction created a DOM tree and find out what the difference is.

Hmm. Okay. Incremental DOM looks more simple. Because they don’t need to make a virtual DOM. Instead of that, they use a real In-Memory DOM that was used before. Then, what makes the Incremental DOM special?

Actually. that was the point. It is more efficient to use memory because we don’t need to make a virtual DOM. it also means that we don’t need to copy the real DOM to re-render the application UI. So it decreases the usage of memory a lot.

Also, There is ‘Tree Shaking’ in Incremental DOM. Tree Shaking is a kind of stage that removes some code that is not useful during the build process. Therefore, Incremental DOM can remove some instructions that are not used right now by the Components. However, it is impossible to do tree-shaking when some kind of framework or library uses virtual DOM. because virtual DOM is used by an interpreter. But Incremental DOM can do a Tree-Shaking when it comes to compiling the code.

So, What’s Better? Incremental DOM or Virtual DOM?

The answer is ‘It depends’. There are some trade-offs between incremental DOM and virtual DOM. If some project has a highly important about using memory, then Incremental DOM would be better. However, if speed is more important than memory usage, then virtual DOM would be the better choice.

React vs Vue: What’s the difference?

Nowadays(in 2023), React and Vue are the two most popular frameworks or libraries in JavaScript. They both use virtual DOM, and they are both component-based. So you can reuse components in other projects and build more flexible development because of capsulation and ease of expansion.

Then, What makes React and Vue different?

Vue: Single File Component

Vue.js is a framework based on a single file component. So, they just put HTML, CSS, and JavaScript in only one file which extension is ‘.vue’. This could make the production and maintenance of the project easier, and also it is easy to cooperate with this framework. Vue is also HTML based and they use a ‘template’ which is a kind of grammatical thing in Vue. Many people said it is easy to learn because the grammar in Vue.js is based on HTML.

React: Separate Each File

Unlike Vue.js, React separate the file into ‘.js’ and ‘.css’.

There are a lot more differences compared to Vue.js and React.js. Supposed we need to mutate data. In Vue, we must create data objects and then freely update data. Whereas in React, we create the object which is called ‘state’. and also we need more stuff to do if we want to update it.

In conclusion, Vue looks more strict than react to use, and React looks more flexible than Vue. So, If you want to build a large-scale application, React would be a better choice than Vue. However, Vue seems like easier to learn than React. And also it seems like it is faster to build an application. So If you want to build a small app as fast as possible, then Vue would be a much better choice than react.

Binding Methods

Two-way Data Binding vs One-way Data Binding

One thing that makes react differently compared to Angular and Vue is the ‘Binding Method’. Binding Method is a kind of policy about data interaction between a child component and a parent component. Only Angular uses Two-way Data Binding whereas React, Angular, and Vue all use One-way Data Binding.

In Two-way Data Binding, data flows each direction. What I mean is both parent components to child components and vice versa. Model and View watch each other so that if one of them changes its data, then the other one changes it directly. However, because of this, it is more difficult than One-way Data Binding to know which direction the change of data comes from.

However, In One-Way Data Binding, data only flows in one way. For example in React, If they manage state, then they use props so that there could be some data flow from the parent component to the child component. So that we can predict the flow of data more clearly. But One-Way Data Binding doesn’t mean that we cannot send data from the child component to the parent component. If the parent component sends the handler function to the child component, and when the event happened in the child component, then the child can change the state through the function that the parent component gives. And this is also called ‘Lifting State Up’.

Svelte: Write less

Svelte was born in 2016, and it is an open-source front-end framework. On this website, we can clearly understand why this framework was born.

One of the most popular features in Svelte is ‘write less code’. looks quite similar compared to jQuery. We can build boilerplate-free components using languages that we already know such as HTML, CSS, and JavaScript through Svelte. Boilerplate code means the thing that could be reused repeatedly. However, it is quite uncomfortable when someone refactors those things.

Also, Svelte has ‘No virtual DOM’ compare to React. We already know what is not a good thing when we use a virtual DOM. Unlike React or Vue, Svelte is a kind of compiler whereas virtual DOM is an interpreter. So, the Svelte can interpret the code during build time. And that makes less code. On the other side, React and Vue interprets the code during the run-time.

And the last thing that makes Svelte more powerful is ‘Truly reactive’. There are no complex state management libraries compared to React. So Svelte can brings reactivity to JavaScript itself.

However, there are some challenging things to using Svelte as a production stage. Here is a good article to read about why people aren’t switching to Svelte yet.

Conclusion

Thank you to read my article. I was born and live in South Korea so my English sometimes looks terrible. If you notice some kind of things that make some trouble such as grammatical errors or incorrect information, please leave a comment below or send mail to me 😌.