DEV Community: Yuval

AI For Developers: How Transformer LLMs Work

Yuval — Wed, 02 Apr 2025 16:45:19 +0000

There are my notes from the DeepLearning.AI course How Transformer LLMs Work

Words Representation

When dealing with language models it is important to understand how words are represented as numeric values, we will go over the evolution of this topic.

Here is a timeline of language models

Bag of Words

In this method we simply count the number of times a word in a vocabulary appears in the sentence we want to encode.
Note that not all words may appear in the vocabulary, usually the vocabulary is built only from words that appear in the training dataset. In the example below, the word "cute" is not part of the vocabulary, that's ok.
The "bag of words" is simply the count of words in a vector representation:

Word2Vec

Bag of words is a very simple representation of words and it does not consider the semantic nature of text.
Word2Vec was designed to capture the meaning of a word, it is also represented as a vector of numbers, called "embeddings".
These embeddings vectors are built using neural networks.

We start by initializing a random embedding vector for each word in the vocabulary, then we train a network that takes a pair of words and tries to predict whether the two words are "close" to each other. The embeddings vector are just a layer from the neural network, the vector captures the "meaning" of the word.

The final result is that words with similar meaning are clustered together on the embedding space (note that the embeddings vector dimension can be very large, like 512, 1024 and more)

Types of Embeddings

Once we created an embeddings vector for each word (more specifically for a "token" which can be a part of a word) in a sentence, we can use various techniques (like average these vectors) to get the "sentence" embeddings, and similarly the entire "document" embedding.

Encoding and Decoding Context

Word2Vec creates an embeddings vector for each word in a sentence regardless of its position (or context) in the sentence. For example, in the sentences "The money is in the bank" and "The bank of the river", Word2Vec will generate the same embeddings vector to the word "bank" although the meaning in each sentence is different. The word embeddings should change with the context.

RNNs

Recurrent Neural Networks can be used to model entire sequences, like time series or words in sentence. RNNs process the entire sequence so each word is processed in the context of the previous words.

In a translation task 2 RNNs are used, an Encoder and a Decoder.
The encoder tries to create a context vector that represents the sentence in the source language, then the decoder uses this context vector to generate text in the target language.

Autoregressive

The decoder generates the translated words one-by-one, it starts by taking the entire sentence in the source language (and the encoder context) to generate the first translated word, then the translated word is appended to the previous input and the decoder generates the second word, and so on...

The words are generated one at a time, using the same context embedding, this doesn't work good for long sentences as a single context fails to represent the entire input.

Attention

Attention is a mechanism that allows a model to focus on parts of the input that are relevant to one another. The attention mechanism was introduced in 2014 (three years before the Transformer architecture).
The mechanism builds an "attention map" that gives higher weights to pair of words from the input and output that relate to each other (do not confuse it with self-attention).

The attention mechanism is added to the decoder step, now instead of passing a single context vector from the encoder to the decoder, we pass the "hidden state" of each word to the decoder, hidden state is an internal layer of the RNN that contains information about previous words. The decoder uses the attention mechanism to look at the entire sequence to generate the output, instead of the limited context embedding.

Transformers

The transformer architecture was introduced in 2017 in a paper called "Attention is all you need", the architecture is solely based on attention without the use of RNNs (hence the name of the paper...).
The major advantage of the transformer is that it allows the model to be trained in parallel which speeds up calculations.

The transformer is built from a stack of encoders and decoders blocks, each uses the attention mechanism. By stacking the blocks we amplify the strength of the encoders/decoders.

Encoder Block

The input words are converted to embeddings, but instead of word2vec we start with random values. Then self-attention process the embeddings and updates them, creating an embeddings that are more contextualized due to the attention mechanism. These are passed to a feed-forward neural network to create the finalized contextualized word embeddings.

The self-attention process the input sequence against itself, as opposed to the original attention mechanism from 2014 that processed the input sequence against the output sequence.
Here the attention map results higher weights to words in the sentence that are more related to each other.

### Decoder Block
The decoder takes any previously generated word and pass it to the "masked self-attention" for processing, the result is an updated intermediate embeddings. These embeddings are passed to another attention, along with the encoder output, to create a single embeddings. This single embeddings is passed to a feed-forward network to create the next generated word.

Masked Self-Attention

Masked Self-Attention is similar to self-attention but it removes the upper diagonal, hence mask future positions, any given token can only attend to token that came before it.

Representation Models

The original transformer architecture was an encoder-decoder which is best for text translations, but its hard to use it for other language tasks, like classification.

BERT

Bidirectional Encoder Representations from Transformers, BERT, is an encoder only model that generates contextualized word embeddings, these embeddings can then be used for classification.

The encoder blocks are the same as we saw above, the [CLS] is a special token that represents the specific task we fine tune the model for (classification).

Training is done by randomly masking words in a sentence and training the model to predict the masked word.

This process is the "pre training" after which we get a "base" model that was trained on masked data. Then we can fine tune this model to specific tasks like Classification, Named Entity Recognition, Paraphrase Identification etc.

Generative Models

In contrast to representation models, generative models are decoders-only models, the input embeddings are initialized randomly and passed to a stack of decoders:

After the release of ChatGPT (GPT 3.5) at the end of 2022 there was a flow of newly released generative models, both proprietary and open source, and 2023 became the Year of Generative AI

Tokenizers

The process of tokenizing the input text vary from model to model, some can tokenize it word-by-word, some can break words input 2 tokens, some can tokenize character-by-character etc.
After the sentence has been tokenized each token is assign an integer id, and the tokenizer vocabulary is built.
The important thing is that once a model was trained using a specific tokenizer the same tokenizer must be used during inference (text generation).

Example

Let's see how difference tokenizers will tokenize the sentence:

English and CAPITALIZATION 🎵.

bert-base-cased

[CLS] English and CA ##PI ##TA ##L ##I ##Z ##AT ##ION [UNK] . [SEP]

This tokenizer has 28,996 tokens in its vocabulary.
The [CLS] hint for the Classification task, the hashtags signals this token belongs to the token before it, [UNK represents an unknown word (a word that is not it the tokenizer vocabulary), [SEP] represents a seperator.

Xenova/gpt-4

English and CAPITAL IZATION � � � .

While GPT4 is closed model we still have access to its tokenizer.
This tokenizer has 100,263 tokens in its vocabulary, see how the word CAPITALIZATION is broken into 2 parts only, a smaller vocabulary usually requires higher fragmentation of long words (the short tokens can be shared among many words).
A larger vocabulary on the other hand requires more computation and memory (as the generated output is actually an array with the size of the vocabulary where each token gets a probability of being the next token).
Also note that since GPT is used for text generation we don't have the special [CLS] token etc.

Transformer LLM

Lets dive into how a transformer LLM works, we know that it all starts with a prompt:

Write an email apologizing to Sarah for the tragic gardeing mishap.
Explain how it happened.

And the LLM will generate the output token-by-token:

Dear Sarah.
I would like ...

Overview

There three main components for a Transformer LLM: Tokenizer, Stack of Transformer Blocks, Language Modeling Head

Tokenizer

We already discussed what a tokenizer is, the LLM has its own tokenizer vocabulary where each token is associated with an integer.
Also the trained model has an embeddings vector for each token in the vocabulary, these embeddings are substituted at the beginning (input token -> token embedding).

Language Modeling Head

(we skip the transformer blocks for now)
The output of the transformer blocks is an "embedding like" vector that encompass the "best" next token. The language modeling head is where this "embedding like" vector is transformed into a probability map.
The output is an array with the size of the vocabulary, where each token in the vocabulary is assigned a probability of being the next token.

Once we have to probability map we need to pick the next token out of all the probabilities.
We can always choose the highest probability token, this is called "greedy decoding", usually controlled by a parameter called temprature. Another option is to choose a token from a basket of tokens, we add to the basket the tokens with the highest probability, one-by-one, until the sum of all added probabilities reaches some threshold, then we randomly choose a word. This threshold is usually controlled by a parameter called top-p.

Decoding Loop

As we know the text is generated in an autoregressive manner, meaning we generate the first word then we append that word to the input to generate the next word.

Transformers are very efficient as they can process all the tokens in parallel, the number of tokens it can process in parallel is usually the context size.

Once we generated the first word it is appended to the prompt to generate the next word, but now we use cached calculations from the previous step, so no need to process all the tokens again.
That is why the metric "time to first token" is used in LLMs as generating the first token is when we process the most tokens.

Transformer Block

Once we have an input prompt we first replace the words with tokens and then replace the tokens with their pre-computed vector embeddings.
All embeddings are processed thru a set of transformer blocks, each block has its own self-attention layer and a feed-forward network.

Feed-Forward Neural Network

Lets look at the prompt "The dog chased the llama because it".
As we said each the token goes thru the transformer block, and in a nutshell, the neural network learns to predict the next token (or specifically an "embedding like" vector that encompass info about the next token, this passed to the language modeling head for processing).

Self-Attention Layer

Continuing we the previous example, if we had only the NN and it had to predict the next token for the word "it" this would be a difficult task, as statistically there could be a lot of words that can come after the word "it". The goal of the self-attention layer is to provide a "better" input to the NN, an input that encompasses more contextual info.
In our example, the contextual info may be a hint that the word "it" refers to the "llama", so when the NN process the word "it" it has some "understanding" that "it" refers to the "llama".

In a high level, when the attention layer is processing a token it embeds relevant information from the previous tokens into the current token, specifically:
1) Relevance Scoring - how relevant a previous token to the current token.
2) Combining Information - combining information from the relevant tokens into the current token.

Relevance Scoring

Self-attention happens in what is called "Attention Head", usually there are multiple attention heads.
The input to the attention head is a sequence of embedded tokens with positional encoding (remember that transformers, unlike recurrent neural networks (RNNs), don't inherently process sequential data. Therefore, positional encodings are added to the embeddings to provide the model with information about the order of the tokens in the sequence).
These tokens (with positional encoding) are then transformed, using learned weight metrices, to produce three vectors: Queries (Q), Keys (K) and Values (V).

The Relevance Scores are calculated for each token by multiplying the current token's Q vector with all other tokens K vectors, this produce a relevance score for the current token with all other token.

Combining Information

The Combining Information step is done by multiplying the relevance score vector with the V vector of each token.
So if in the example below we are processing the "it" token, then we first computed the relevance score for each token with the word "it", then we multiply each token's value with its relevance score and sum it up.

Multiple Heads

The self-attention layer has multiple attention heads, each head with its own Q, K, V metrices.
Multi-head attention allows the model to learn multiple, different types of relationships simultaneously, also each head can focus on a different aspect of the input.

To make self-attention more efficient several heads are grouped together and they share the same K, V metrices, typically referred to as n_groups and n_attention_heads

Sparse Attention

In our example each token can attend to all previous tokens, but in larger models this is very expansive, in sparse attention a token can attend only to a limited number of previous tokens.
In example (a) each token can attend to all previous tokens
In example (b) it can attend to a maximum of 4 previous tokens (and maybe some more backward with jumps).
In example (c) the input sequence is chunked into lengths of 4 tokens.

In order to support a really large context window there is a concept of Ring Attention, it is beyond the scope of this class but in general it uses multiple GPUs to enable scaling the context window.

Example

Model Architecture

Now we should have enough knowledge to understand the description of a model architecture, here is an example of the Llama 3 model.

Layers (32) is the number of transformer blocks.
Model Dimension (4096) is the token embeddings length.
FFN Dimension (14,336) is the size of the feed-forward neural network (in the image)
Attention Heads (32) is the number of total heads
Key/Value Heads (8) is the number of attention groups (shared K/V)
Vocabulary Size (128,000) is the number of known tokens in the model.

Step by Step Inference

Lets run a text generation task step-by-step to see what we have in each step.

Installing Phi-3

We will use the phi-3 model, first lets load it and its tokenizer.

from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline

# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("../models/microsoft/Phi-3-mini-4k-instruct")

model = AutoModelForCausalLM.from_pretrained(
    "../models/microsoft/Phi-3-mini-4k-instruct",
    device_map="cpu",
    torch_dtype="auto",
    trust_remote_code=True,
)

Check the number of tokens

print(tokenizer.vocab_size)

# output: 32000

Check the model architecture

print(model)

Output:

Phi3ForCausalLM(
  (model): Phi3Model(
    (embed_tokens): Embedding(32064, 3072, padding_idx=32000)
    (embed_dropout): Dropout(p=0.0, inplace=False)
    (layers): ModuleList(
      (0-31): 32 x Phi3DecoderLayer(
        (self_attn): Phi3Attention(
          (o_proj): Linear(in_features=3072, out_features=3072, bias=False)
          (qkv_proj): Linear(in_features=3072, out_features=9216, bias=False)
          (rotary_emb): Phi3RotaryEmbedding()
        )
        (mlp): Phi3MLP(
          (gate_up_proj): Linear(in_features=3072, out_features=16384, bias=False)
          (down_proj): Linear(in_features=8192, out_features=3072, bias=False)
          (activation_fn): SiLU()
        )
        (input_layernorm): Phi3RMSNorm()
        (resid_attn_dropout): Dropout(p=0.0, inplace=False)
        (resid_mlp_dropout): Dropout(p=0.0, inplace=False)
        (post_attention_layernorm): Phi3RMSNorm()
      )
    )
    (norm): Phi3RMSNorm()
  )
  (lm_head): Linear(in_features=3072, out_features=32064, bias=False)
)

we can see the main components are:

The Phi3Model that generates the probabilities for the next token for each token in the vocabulary, it consist of:
1. The Embedding layer, it can get up to 32,064 tokens and generates embedding vector of size 3072 for each token.
2. The 32 transformer blocks, Phi3DecoderLayer, each with:
  1. Self Attention layer Phi3Attention
  2. Feed Forward Neural Network Phi3MLP
The Language Model component lm_head that generates the probabilities for each token in the vocabulary being the next token. NOTE The lm_head output size is 32,064 while the tokenizer vocabulary size is only 32,00, why? in order to optimize processing vector sizes are rounded to a multiply of 64.

First we will run the Phi3Model directly:

prompt = "The capital of France is"
input_ids = tokenizer(prompt, return_tensors="pt").input_ids

print(input_ids)

# output:
tensor([[ 450, 7483,  310, 3444,  338]])

Run the model:

model_output = model.model(input_ids)
print(model_output[0].shape)

# output:
torch.Size([1, 5, 3072])

The output are 5 "embedding like" vectors that represents the next token for each of the input tokens, so if we want to complete our prompt we need to get the next token for the word "is", which is the 5th token.
To get the actual next token out from the "embedding like" vector we need to run the lm_head.

lm_head_output = model.lm_head(model_output[0])
lm_head_output.shape

# output:
torch.Size([1, 5, 32064])

Now we see that the output are 5 vectors, one per input token, and for each there is a vector with the size of the vocabulary, each value is a logit value representing the probability of this token being the next token.
Lets take the probabilities for the next token of the 5th token:

next_token_logits = lm_head_output[0, 4]
print(next_token_logits)

# output:
tensor([27.8750, 29.5000, 28.0000,  ..., 20.3750, 20.3750, 20.3750],
  dtype=torch.bfloat16, grad_fn=<SelectBackward0>)

Now as we discussed there are different strategies of picking the next token, we will pick the one that has the highest probability, and we decode the token id back to a word:

next_token_id = next_token_logits.argmax()
print(tokenizer.decode(next_token_id))

# output:
Paris

And voila! We got the correct completion !!

Recent Improvements (2024)

We have learned the "original" transformer architecture, over the years there have been some modifications and as of 2024 this is where we stand:

Most noticeable is the lack of "positions encoding" at the beginning, it is still important for the self-attention to know the position of the token it is processing, but now the positional encoding is done in the self-attention by what is called "rotary embeddings".
Self-attention layer is optimized with the grouped-query attention (as we discussed).
The raw input tokens are merged with the output of the self-attention layer before moving to the NN (the bypass dashed line around the self-attention layer and the + sign). This is actually not a change from the 2017 architecture but more visible on the 2024 drawing.
The normalization layers moved to be before the self-attention and the feed-forward network, some experimentations showed it yields better results.

Rotary Embeddings (RoPE)

The goal of rotary embeddings is to improve and optimize the training process.
Assume we are training a model with context size of 16k, and in our batch we have 32 documents. Each batch can have a potential size of 16k, but there are many documents that are less than 16k so the batch vector of 16k is getting padded with 0.

The GPU still runs computation on the entire 16k vector, it doesn't care that the numbers are 0.
What if instead of padding the vector with 0s we could have put several documents in a single batch:

Now let's think about positional encoding when our data looks like this. If we use the simple regular positional encoding, which is the word index in the context vector, it will work for the first doc, but for the second doc it won't work, as the first word of the 2nd doc should have a positional encoding of 0.
Rotary embeddings is a way to solve this problem, it adds positional information at the self-attention layer, just before calculating the relevance scores.

Mixture of Experts (MoE)

We saw that in the transformer block we have a single feed-forward neural network, what if we had several feed-forward networks, each specialize on a specific kind of tokens, and we had a router that routes the tokens to the appropriate "expert".
Note these are not domain experts (like psychology or biology) but more token related experts like punctuation (, . ? etc.), verbs (said, read, etc.), conjunctions (the, and, if etc.), visual descriptions (dark, outer, yellow etc.)

During training the router itself is being trained as well on how to best route each token (classification task), it is also possible to chose 2 experts and combine their results.

Computational Requirements

Each expert is a full neural network, and these neural networks are where most of the LLM parameters are.
When loading a model with MoE we need to load into memory all experts which requires more memory than a regular LLM.
But, during inference we do not activate all experts, usually only 2 experts are activated at most, hence requiring less compute capabilities (in general, an expert network has fewer parameters than the single feed-forward network of a regular LLM).

Here is an example of the Mixtral 8x7B model (that has 8 experts).

While the model has a total of 46.7B parameters, during inference only 12.8B will be activated.

The END

Notes from course: Generative AI with Large Language Models - Week 3

Yuval — Sun, 23 Mar 2025 19:44:21 +0000

Notes from Week 1
Notes from Week 2

Reinforcement Learning from Human Feedback (RLHF)

RLHF is essentially a fine-tuning with human feedback which helps to better align models with human preferences and to increase the helpfulness, honesty, and harmlessness of the completions (aka HHH). This further training can also help to decrease the toxicity, often models responses and reduce the generation of incorrect information.

In 2020, researchers at OpenAI published a paper that explored the use of fine-tuning with human feedback to train a model to write short summaries of text articles. Here are the results:

Reinforcement Learning

Reinforcement learning is a type of machine learning in which we train a model to perform a specific action. The training process involves a rewards mechanism where for each action of the model we provide a positive/negative reward, during training the model learns to maximize the reward.

For example lets take a look at training a model to play the tic-tac-toe game.

The agent is a model or policy acting as a Tic-Tac-Toe player. Its objective is to win the game. The environment is the game board, and the state at any moment, is the current configuration of the board. The action space comprises all the possible positions a player can choose based on the current board state. The agent makes decisions by following a strategy known as the RL policy. Now, as the agent takes actions, it collects rewards based on the actions' effectiveness in progressing towards a win. The goal of reinforcement learning is for the agent to learn the optimal policy for a given environment that maximizes their rewards. This learning process is iterative and involves trial and error.
The set of actions and resulting states is called "Rollout".

Reinforcement Learning in LLMs

We can use the rewards mechanism of reinforcement learning to fine tune an LLM, for each model output we can provide a positive/negative reward, based on how close the output is to human preferences, and the model weights will be update so to maximize the reward.
As opposed to the tic-tac-toe game where determining the reward can be computed automatically, obtaining human feedback can be time consuming and expensive. A possible solution is to use an additional model, known as the reward model, to classify the outputs of the LLM and evaluate the degree of alignment with human preferences. We'll see next how to train such a rewards model.

Collecting Human Feedback

To prepare data for RLHF fine-tuning of an LLM, you first select a capable model, preferably an instruct model, and use it to generate multiple completions for each prompt in a test dataset. The model you choose should have some capability to carry out the task you are interested in, whether this is text summarization, question answering etc.
Then, you collect human feedback by having labelers rank these completions based on defined criteria like helpfulness or toxicity, ensuring clear and detailed instructions are provided to the labelers to maintain consistency and quality. The labelers should rank the outputs from best to worst.
Finally, the generated completions are transformed into pairs (each completion is paired with all other completions), a score of 1 is given to the completion that was ranked higher. It is important that the preferred completion is put first in the pair.

Training the Reward Model

The reward model is usually also a language model that is trained using supervised learning methods on the pairwise comparison data that you prepared from the human labelers assessment off the prompts. For a given prompt X, the reward model learns to favor the human-preferred completion.

Once the model has been trained on the human rank prompt-completion pairs, you can use the reward model as a binary classifier. In a binary classifier the model output are 2 logit values, one for each output class, this logit value will be used as the reward value when we will do the reinforcement learning.
Let's say you want to detoxify your LLM, and the reward model needs to identify if the completion contains hate speech. In this case, the two classes would be "not-hate", the positive class that you ultimately want to optimize for, and "hate", the negative class you want to avoid. For each of these classes the model will output a logit value which will be used as the reward value, ideally a "not-hate" class will have a higher logit value than the "hate" class (of course we can apply a softmax function to the logit values to get the probabilities).

Fine Tuning using the Reward Model

In order to fine tune a model using the reward model we first need to choose a model that already has a good performance on the task at hand.
Next we start an iterative process in which we:
1) Take a prompt from the dataset
2) Generate a completion from the model
3) Score the output using the reward model
4) Run a reinforcement learning algorithm that updates the model weights.

We continue this process until we meet some stopping criteria, like "max steps" or a threshold evaluation score.

There are several reinforcement learning algorithms available, a popular choice is PPO - Proximal Policy Optimization, the details of this algorithm are quite complex but it's not necessary to know them in order to use it.

Reward Hacking

An interesting problem that can emerge in reinforcement learning is known as reward hacking, where the agent learns to cheat the system by favoring actions that maximize the reward received even if those actions don't align well with the original objective.
Lets say we start with the prompt "This product is" and the first completion is "complete garbage", this completion will get a low reward score, PPO will update the model weights so the next completion can be "okay but not the best", the optimization will keep going and in order to maximize the reward we can end up with a completion like "the most awesome", this is a bit exaggerated but gets a high reward value, and in extreme cases we can get completion that are completely incorrect but maximize the reward, like "beautiful love and world peace all around".

To mitigate this risk we will penalize the reward if the updated LLM completion is too far from the original model completion, we measure the divergence using the KL Divergence. It is a mathematical measure of the difference between two probability distributions, which helps us understand how one distribution differs from another.

Note that this is quite compute and memory intensive as we need to keep both the original and updated models in memory.
Instead of training a whole new model we can just train a PEFT adapter (LoRA), this way we hold in memory only the original model and the PEFT matrix.

Evaluation

Once you have completed your RHF alignment of the model, you will want to assess the model's performance.
A simple way is to compute the reward of an entire dataset using the original instruct model, then compute the reward of the same dataset on the RLHF model, and compare the two scores.

Scaling Human Feedback (Constitutional AI)

The human effort required to produce the trained reward model in the first place is huge. The labeled data set used to train the reward model typically requires large teams of labelers, sometimes many thousands of people to evaluate many prompts each.

Constitutional AI is one approach of scale supervision. First proposed in 2022 by researchers at Anthropic, Constitutional AI is a method for training models using a set of rules and principles that govern the model's behavior. Basically we are using a set of rules (constitution) and an initial model to automatically generate a dataset that will be used to fine tune that original model.

Example of constitutional principles:

Please choose the response that is the most helpful, honest, and harmless
Choose the response that is less harmful, paying close attention to whether each response encourages illegal, unethical or immoral activity.
Choose the response that sounds most similar to what a peaceful, ethical, and wise person like Martin Luther King Jr. or Mahatma Gandhi might say

The way to generate the pairs dataset is as follow:
1) Prompt the model in ways that try to get it to generate harmful responses, these prompts are called "Red Teaming Prompts".
2) Ask the model to critique its own harmful responses according to the constitutional principles.
3) Ask the model to revise the response to comply with those rules.

Here is an example of generating a single pair of training data. The green rectangles marks the original prompt and the final constitutional response.

Once we generated the dataset we can use it to fine tune the original.

Now that we have a fine-tuned model that is aligned with our principals we can use it to run Reinforcement Learning using AI Feedback (RLAIF).
The process is:
1) Use the fine-tuned model to generate multiple responses to a red-teaming prompts.
2) Ask the model "which response is preferred" - this way we build a dataset of ranked responses.
3) Using the ranked responses dataset we train the Reward model.
4) Using the Reward model we further fine-tune the model to get the final Constitutional LLM.

The complete process is like so:

LLM-powered Applications

Let's talk about the things you'll have to consider to integrate your model into applications, specifically, optimization of the model for inference and augmenting the model with tools to build an AI-powered applications.

Model Optimization

Large language models present inference challenges in terms of computing and storage requirements.

Distillation

Model Distillation is a technique that focuses on having a larger teacher model train a smaller student model. The student model learns to statistically mimic the behavior of the teacher model, either just in the final prediction layer or in the model's hidden layers as well.

How it works?

You freeze the teacher model's weights and use it to generate completions for your training data ("labels"). At the same time, you generate completions for the training data using your student model ("predictions"). The knowledge distillation between teacher and student model is achieved by minimizing a loss function called the distillation loss. To calculate this loss, distillation uses the probability distribution over tokens that is produced by the teacher model's softmax layer. Now, the teacher model is already fine tuned on the training data.

The probability distribution likely closely matches the ground truth data and won't have much variation in tokens. That's why Distillation applies a little trick adding a temperature parameter to the softmax function. With a temperature parameter greater than one, the probability distribution becomes broader and less strongly peaked. This softer distribution provides you with a set of tokens that are similar to the ground truth tokens.
In parallel, you train the student model to generate the correct predictions based on your ground truth training data. Here, you don't vary the temperature setting and instead use the standard softmax function. Distillation refers to the student model outputs as the hard predictions and hard labels. The loss between these two is the student loss. The combined distillation and student losses are used to update the weights of the student model via back propagation.

In practice, distillation is not as effective for generative decoder models. It's typically more effective for encoder only models, such as BERT that have a lot of representation redundancy.

Post-Training Quantization (PTQ)

PTQ transforms a model's weights to a lower precision representation, such as 16-bit floating point or 8-bit integer. To reduce the model size and memory footprint, as well as the compute resources needed for model serving.

Note that quantization also requires an extra calibration step to statistically capture the dynamic range of the original parameter values. As with other methods, there are tradeoffs because sometimes quantization results in a small percentage reduction in model evaluation metrics.

Pruning

The goal is to reduce model size for inference by eliminating weights that are not contributing much to overall model performance. These are the weights with values very close to or equal to zero. Note that some pruning methods require full retraining of the model, while others fall into the category of parameter efficient fine tuning, such as LoRA.

In practice, however, there may not be much impact on the size and performance if only a small percentage of the model weights are close to zero.

Model Preparation Cheat Sheet

Here is a summary of all the methods we discussed to train/adjust a model for your needs.

Using LLM in Application

When using LLM in application there are some challenges that might need to be addressed, specifically when the LLM is being asked about events that happened after it has been trained, or about topics it doesn't have information about (like internal corporate data), in these cases the model may hallucinate and provide inaccurate responses.

There are several techniques available to overcome these challenges, like providing the LLM with data from external sources or external applications.

Retrieval Augmented Generation (RAG)

RAG is a framework for building LLM powered systems that make use of external data sources. RAG is a great way to overcome the knowledge cutoff issue and help the model update its understanding of the world.
RAG is useful in any case where you want the language model to have access to data that it may not have seen. This could be new information documents not included in the original training data, or proprietary knowledge stored in your organization's private databases.

How it works?

Implementing RAG can be quite complex and there are a lot of considerations, in a nutshell:
At the heart of this implementation is a model component called the Retriever, which consists of a query encoder and an external data source. The encoder takes the user's input prompt and encodes it into a form that can be used to query the data source. The Retriever returns the best single or group of documents from the data source and combines the new information with the original user query. The new expanded prompt is then passed to the language model, which generates a completion that makes use of the data.
A popular choice of data source is a vector database. All documents in the database are pre-encoded into embedding vectors. When a user query is entered, the Encoder converts it into its own embedding vector. The Retriever then identifies and retrieves documents whose embedding vectors are most similar to the queryâ€™s embedding vector.

Interacting External Applications

An LLM can also interact with external applications if we provide it with the needed "tools", this is basically an "agent".
Consider a support chatbot in an ecommerce website that let users return merchandise.
The bot could start by asking the user for the order id, then it can activate the "fetch order" tool (api) to retrieve the order. Then it can call the "tool" that will generate a return label, then it will ask the user for his email address and finally call the "tool" that will email the user the return label.
For this to work the LLM should be able to plan actions in advance and then execute these actions.
There are different orchestration libraries for developing agents so the specifics are different for each library, but the general idea is the same, the LLM is the application's reasoning agent that instruct the orchestration library what actions to take, the orchestration library then execute the tool instructed by the LLM and returns the result to the LLM to continue with the reasoning (i.e. app business logic).

Program Aided Language Models (PAL)

As we know the ability of LLMs to carry out arithmetic and other mathematical operations is limited. While you can try using chain of thought prompting to overcome this, it will only get you so far.
Remember, the model isn't actually doing any real math here. It is simply trying to predict the most probable tokens that complete the prompt.

You can overcome this limitation by allowing your model to interact with external applications that are good at math, like a Python interpreter. One interesting framework for augmenting LLMs in this way is called program-aided language models, or PAL for short. This work first presented by Luyu Gao and collaborators at Carnegie Mellon University in 2022, pairs an LLM with an external code interpreter to carry out calculations.

The idea is pretty cool, in the prompt we provide an example (1-shot) on how to solve a problem using python code, then we present a new problem in the prompt, the response from the model would be a python code that suppose to solve the problem, the orchestration library then executes the python code and return the response to the model to produce the final answer.

LLM Application Architecture

As you can see, the model is typically only one part of the story in building end-to-end generative AI applications.

AWS Sagemaker

Sagemaker JumpStart is a model hub, and it allows you to quickly deploy foundation models that are available within the service, and integrate them into your own applications.
The JumpStart service also provides an easy way to fine-tune and deploy models. JumpStart covers many parts of the architecture diagram above, including the infrastructure, the LLM itself, the tools and frameworks, and even an API to invoke the model.

Notes from course: Generative AI with Large Language Models - Week 2

Yuval — Sun, 23 Mar 2025 19:42:52 +0000

Notes from Week 1

Fine Tuning

Fine tuning is the process of training the model with task-specific examples. This is done via supervised learning by providing samples of prompts and desired completions.

Instruction Fine Tuning

Instruction fine tuning is a specific method for fine tuning the model for a specific task, like text summarization. This fine tuning is done by providing a dataset with pairs of prompt/completion. The prompt should have an instruction and the completion is the desired result.
For example for text summarization the dataset can include:

Summarize the following text:
[example text]
[example completion]

In instruction fine tuning ALL the models weights are updated (full fine tuning) so the compute resources that are required are quite big.

Prepare the dataset

As we said, it is important to format the prompt in the dataset as an "instruction". While there are many public datasets that are not formatted as "instructions" we can convert them to be instructional according to the task we want to achieve.
For example the Amazon products review dataset has: product_title, review_headline, review_body, star_rating, if we want to use this dataset for text generation we could format the following prompt:

Generate a {{start_rating}}-start review (1 being lowest and 5 being highest) about this product {{product_title}}.
|||
{{review_body}}

Or for text summarization:

Give a short sentence describing the following product review:
{{review_body}}
|||
{{review_headline}}

Fine Tuning Process

The fine tuning process is like a regular neural network training process.
We give the model our prompt, and then we calculate the cross-entropy loss of the model output vs the desired output, update the model weights using back propagation and continue over several epochs (remember that the model output is words probabilities, we use that to compute the loss).
By the end of this process you will have a fine-tuned Instruct LLM model.

Single Task Fine Tuning

While LLMs can perform many tasks in fine tuning we can fine tune for a single specific task, like summarization, for this often a relatively small dataset of 500-1000 examples is enough.

Catastrophic Forgetting

The downside of fine tuning on a single task can lead to a phenomenon called "catastrophic forgetting". This is where the model "forgets" how to perform other original tasks it was trained on. This happens because fine tuning updated the entire weights of the model, so other tasks can start to behave like the single task it was fine tuned on.

So what can we do?

First we make sure we need the model to generalize on other tasks, maybe the fine tuned model on the single task is all we need
We can fine tune the model on multiple tasks, but this requires a significant amount of data, like 50,000-100,000 samples.
Use a method called PEFT - Parameter Efficient Fine-tuning. In this method we do not modify the entire weights of the model but only a small set of adapter layers of the specific task.

Multi Task Fine Tuning

Multitask fine-tuning is an extension of single task fine-tuning, where the training dataset is comprised of example inputs and outputs for multiple tasks. Here, the dataset contains examples that instruct the model to carry out a variety of tasks, including summarization, review rating, code translation, and entity recognition. You train the model on this mixed dataset so that it can improve the performance of the model on all the tasks simultaneously, thus avoiding the issue of catastrophic forgetting.

FLAN

FLAN, which stands for fine-tuned language net, is a specific set of instructions used to fine-tune different models.
FLAN-T5, the FLAN instruct version of the T5 foundation model while FLAN-PALM is the flattening struct version of the palm foundation model.
FLAN-T5 is a great general purpose instruct model. In total, it's been fine tuned on 473 datasets across 146 task categories.
One example of a prompt dataset used for summarization tasks in FLAN-T5 is SAMSum.

Using these dialogues and summaries different instruction prompts have been generated:

Including different ways of saying the same instruction helps the model generalize and perform better.

Model Evaluation

In traditional machine learning, you can assess how well a model is doing by looking at its performance on training and validation data sets where the output is already known. You're able to calculate simple metrics such as accuracy (because the models are deterministic). But with large language models where the output is non-deterministic and language-based evaluation is much more challenging.

Note in the second example there is only 1 word difference between the two sentences, a simple measurement metric could have marked them as similar.

We will discuss 2 evaluation methods, ROUGE and BLUE, it is important to note that the evaluation scores are not comparable across difference tasks, for example a ROUGE score for a text summarization task cannot be compared to a ROUGE score of title generation task.

ROUGE

ROUGE or recall oriented under study for jesting evaluation is primarily employed to assess the quality of automatically generated summaries by comparing them to human-generated reference summaries.
It is best used for text summarization (compare the generated summary to one ore more reference summaries).

ROUGE-N

With ROUGE-N (i.e. ROUGE-1, ROUGE-2 etc.) we take an N words sequence from the generated output and compare it to the reference (human) output, we can compute the standard metrics of recall/precision/f1.

For example ROUGE-1
(single word is called "unigram")

Note that if the generated output was "It is NOT cold outside" the scores were exactly the SAME!

Here is an example of ROUGE-2
(two words sequence is called "bigram")

ROUGE-L

Instead of just increasing the sequence length we test in ROUGE-N we can test for the Longest Common Sequence (LCS):

As we saw, ROUGE scores for totally wrong generations can be the same as correct generations, choosing the correct ROUGE metric is highly dependent on the task at hand and the length of generated output.

BLEU

BLEU, or bilingual evaluation understudy is an algorithm designed to evaluate the quality of machine-translated text by comparing it to human-generated translations.
It is used for text translation.
BLUE = Avg(precision across range of n-gram sizes)
Here are example of BLUE scores for different generations compared to a human reference:

As we get closer to the reference the score gets closer to 1.

Benchmarks

Both rouge and BLEU are quite simple metrics and are relatively low-cost to calculate. You can use them for simple reference as you iterate over your models, but you shouldn't use them alone to report the final evaluation of a large language model. For overall evaluation of your model's performance, however, you will need to look at one of the evaluation benchmarks that have been developed by researchers.

Benchmarks like GLUE and SuperGLUE assess general language understanding and complex reasoning, respectively, with leaderboards for model comparison. Newer benchmarks like MMLU and BIG-bench challenge LLMs with tasks requiring extensive world knowledge and problem-solving abilities, while HELM emphasizes transparency and holistic evaluation, including fairness, bias, and toxicity metrics, to provide a nuanced understanding of LLM performance and limitations.

Parameter Efficient Fine-Tuning (PEFT)

Training large language models (LLMs) via full fine-tuning is computationally expensive due to the massive memory requirements for storing model weights, optimizer states, gradients, and activations, often exceeding consumer hardware capabilities. Parameter-efficient fine-tuning (PEFT) methods address this by updating only a small subset of parameters or adding new, trainable components while freezing most of the original LLM, significantly reducing memory usage and enabling training on single GPUs.

There are 3 main PEFT techniques and we will take a look at 2 of them (the results of "selective" PEFT technique are mixed so we won't look at it).

Low-Rank Adaptation (LoRA)

In a Transformer model we usually have 2 neural networks in the encoder, and 2 neural networks in the decoder, the first is the self-attention network and the second is the feed-forward network (that produces the result). When training such LLM we are essentially training the parameters of all these networks.

In LoRA fine tuning we keep the original weights of the self-attention layer (can be the feed-forward as well) untouched (the matrix with the snow icon on it) but we add 2 small metrices (the column matrix B and the row matrix A), the multiplication result BxA produces a matrix with the same dimensions like the original self-attention layer matrix, so we can add them together to get a modified self-attention layer.
During training all we have to learn are the low-rank (small) A, B metrices.

Example

In the original "attention is all you need" paper the transformer weights matrix were 512 x 64 (32,768 parameters).
If we want to fine tune it using LoRA we can select a combination of any 2 matrices whose multiplication result a matrix of dimension 512 x 64, we can use a simple 1x64 matrix with 512x1 matrix, this is considered a LoRA with rank r = 1.
For r = 8 we will use matrices with dimensions: 8x64 and 512x8, so during training we will have to learn only 4,608 parameters, compared to 32,768 parameters of the original layer (86% reduction).

Multiple LoRAs

In full fine tuning we can fine tune the model to perform multiple tasks, so we have a single model that can handle various tasks.
In LoRA fine tuning we are usually targeting a specific task.
So how can we use the same model for different specialized tasks? since LoRA matrices are small we can store them easily in memory, and during inference we can inject the correct task-specific LoRA matrix to the original model.

Evaluation

In order to evaluate the performance of LoRA lets compare its ROUGE score to a base model and to a fully fine tuned model.
We will use the FLAN-T5 model.

We can see that a fully fine tuned model has a ROUGE score that is higher by 80% compared to the base model, but note that the LoRA fine tuned model has a score that is higher by 75%, almost like the fully fine tuned model !

Choosing the LoRA rank

A research by Microsoft tested different ranks of LoRA for the same task, measuring results with different valuations, here are the results:

A LoRA rank of 4 achieved the highest score in most of the evaluations.
The training value loss did not improve with higher LoRA ranks beyond 16.

Prompt Tuning

While prompt tuning sounds like prompt engineering they are NOT the same!
In prompt engineering we are making changes to the prompt trying to make the model output the text we want.
With prompt tuning we are adding few "virtual token" vectors to our prompt vectors, the values of these token vectors are being learned during training.

Each word in the input prompt is represented by an embedding vector, so basically we are adding few more embedding vectors to our prompt (so these are not "real" word tokens hence called "virtual tokens"), then we train the model to learn the values of these extra embeddings.

Basically it is very similar to LoRA, instead of altering the self-attention layer we are changing the input embeddings, in both cases we need to learn only a small set of parameters.

Evaluation

We can see that the larger the model the closer the performance of prompt tuning is to fully-fine tuned.

OpenCV vs YOLO for Small Object Detection

Yuval — Sun, 16 Feb 2025 11:01:47 +0000

Well, not really one vs the other but how to get the best of both worlds!
Small object detection is a tough nut to crack, especially when it comes to tiny trading signals on a complex stock chart. In this video, I tackle this challenge head-on, using a two-stage approach. First, I leverage OpenCV to create a fast initial detection algorithm. Then, I use these initial detections to train a YOLO model, which, combined with the SAHI method, significantly improves the accuracy and precision of those green and red triangles. This is an example of how to combine "classic" image processing with modern deep learning techniques for real-world applications.
(text generated by Gemini)

AI for Developers: Image captioning using visual attention

Yuval — Wed, 29 Jan 2025 21:36:29 +0000

Training an image captioning model with visual attention. An example of combining 2 modalities into a single model.

This post is a summary of this Google tutorial.

NOTE: This post is intended for developers, if you are an aspiring data scientist or AI researcher this post will not dig deep enough for you.

Overview

Image captioning models take an image as input and generate text describing the image.
The challenge in these type of models is how to bring together the visual space with the textual space. We need to bring these 2 modalities into a common ground.

One way of doing this is by using an Encoder-Decoder architecture.

Encoder - Takes an image as input and outputs an embeddings vector, a numeric representation capturing the "essence" of the image.
Decoder - Responsible for generating text with respect to the embeddings it got from the encoder. The "joint learning" of the visual features and text is done in an Attention layer.

Architecture

Encoder

We will use a pre-trained InceptionResNetV2 as our encoder, or feature extractor. InceptionResNetV2 is an image classification model and by taking the output from an intermediate layer (and not the final layer) we will get the features that represents the image.

Decoder

The decoder gets as input the image features from the encoder, and the image caption. Then it will process the data thru the following layers:

Embedding - The image caption will get embedded into a vector that will capture the "meaning" of the caption
RNN - An RNN layer that will process the caption embeddings, the RNN holds process the words one-by-one and keeps a "memory" of previous processed words.
Attention - The attention layer gets as input the RNN output and the image features, this is where the image data and text data are being processed together and relation between image features and text are being learned.

Then the Attention output, and the RNN output are being added together, normalized and run thru a final Dense layer to produce the next word "probabilities". It will product a "probability" for every word in the vocabulary.

Training & Inference

The training (and later inference) are pretty much similar to what was described in the Spanish-English translation encoder-decoder post.

The Code

Dependencies

!pip uninstall -y tensorflow
!pip uninstall -y tf-keras
!pip install tensorflow==2.15.1
!pip install tf-keras==2.15.1

Imports

import time
from textwrap import wrap

import matplotlib.pylab as plt
import numpy as np
import tensorflow as tf
import tensorflow_datasets as tfds
import tensorflow_hub as hub
from tensorflow.keras import Input
from tensorflow.keras.layers import (
    GRU,
    Add,
    AdditiveAttention,
    Attention,
    Concatenate,
    Dense,
    Embedding,
    LayerNormalization,
    Reshape,
    StringLookup,
    TextVectorization,
)

print(tf.version.VERSION)

Loading Data

IMG_HEIGHT = 299
IMG_WIDTH = 299
BUFFER_SIZE = 1000

def get_image_label(record):
    # Each data_row is a dict with keys: ['captions', 'image', 'image/filename', 'image/id', 'objects']
    img = record["image"]
    img = tf.image.resize(img, (IMG_HEIGHT, IMG_WIDTH))
    img = img / 255 # convert rgb to 0-1 range

    caption = record["captions"]["text"][0]  # only the first caption per image
    # Add the special <start><end> tokens for the decoder to use
    start = tf.convert_to_tensor("<start>")
    end = tf.convert_to_tensor("<end>")
    caption = tf.strings.join(
        [start, caption, end], separator=" "
    )

    return {"image_tensor": img, "caption": caption}

# Load the dataset.
# The dataset is huge so we are downloading it to a google storage bucket.
# The bucket is located in us-central1 and if the machine is in another zone then working 
# with the data will be very slow
data_dir = 'gs://asl-public/data/tensorflow_datasets/'

# Another option is to download the data locally.
# ** DID NOT WORK IN COLAB **
# The download size is 56GB!!
# If you want to download locally create the folder /content/data
# data_dir='/content/data'

train_dataset = tfds.load("coco_captions", split="train", shuffle_files=True, data_dir=data_dir)

# get only the image and caption
train_dataset = train_dataset.map(
    get_image_label, num_parallel_calls=tf.data.AUTOTUNE
)

# prefetch loads the upcomfing data while current are being processed
train_dataset = train_dataset.prefetch(buffer_size=tf.data.AUTOTUNE)

Visualize

f, ax = plt.subplots(1, 4, figsize=(20, 5))
for idx, data in enumerate(train_dataset.take(4)):
    ax[idx].imshow(data["image_tensor"].numpy())
    caption = "\n".join(wrap(data["caption"].numpy().decode("utf-8"), 30))
    ax[idx].set_title(caption)
    ax[idx].axis("off")

Create a Tokenizer

MAX_CAPTION_LEN = 64
VOCAB_SIZE = 20000  # use fewer words to speed up convergence

# We will override the default standardization of TextVectorization to preserve
# "<>" characters, so we preserve the tokens for the <start> and <end>.
def standardize(inputs):
    inputs = tf.strings.lower(inputs)
    return tf.strings.regex_replace(
        inputs, r"[!\"#$%&\(\)\*\+.,-/:;=?@\[\\\]^_`{|}~]?", ""
    )


# Choose the most frequent words from the vocabulary & remove punctuation etc.
tokenizer = TextVectorization(
    max_tokens=VOCAB_SIZE,
    standardize=standardize,
    output_sequence_length=MAX_CAPTION_LEN,
)

tokenizer.adapt(train_dataset.map(lambda x: x["caption"]))

# Lookup table: Word -> Index
word_to_index = StringLookup(
    mask_token="", vocabulary=tokenizer.get_vocabulary()
)

# Lookup table: Index -> Word
index_to_word = StringLookup(
    mask_token="", vocabulary=tokenizer.get_vocabulary(), invert=True
)

# tokenize the first caption (word-by-word), note the token "3" and "4" are the <start> <end> tokens
for d in train_dataset.take(1):
    for w in d["caption"].numpy().decode("utf-8").split():
        print(word_to_index(w))

Prepare training dataset

BATCH_SIZE = 32

def create_ds_fn(record):
    img_tensor = record["image_tensor"]
    caption = tokenizer(record["caption"]) # tokenize the caption

    # Create the "target" training objective which is the caption without the <start> token.
    target = tf.roll(caption, -1, 0) # shift left the tokens to remove the <start> token
    zeros = tf.zeros([1], dtype=tf.int64)
    target = tf.concat((target[:-1], zeros), axis=-1) # roll is cyclic, so the <start> token is now the last token in the tensor, replace it with 0

    # The input img_tensor will go the the encoder, the caption to the decoder
    # and the "target" is our training objective
    return (img_tensor, caption), target


batched_ds = (
    train_dataset.map(create_ds_fn)
    .batch(BATCH_SIZE, drop_remainder=True)
    .prefetch(buffer_size=tf.data.AUTOTUNE)
)

# Print a sample
for (img, caption), label in batched_ds.take(2):
    print(f"Image shape: {img.shape}")
    print(f"Caption shape: {caption.shape}")
    print(f"Label shape: {label.shape}")
    print(caption[0])
    print(label[0])

Build the model

# InceptionResNetV2 takes (299, 299, 3) image as inputs
# note we use include_top=False meaning we don't want to include the final layer
# and what we get is the extracted features with shape (8, 8, 1536)
FEATURE_EXTRACTOR = tf.keras.applications.inception_resnet_v2.InceptionResNetV2(
    include_top=False, weights="imagenet"
)
FEATURE_EXTRACTOR.trainable = False
FEATURES_SHAPE = (8, 8, 1536)
ATTENTION_DIM = 512  # size of dense layer in Attention
IMG_CHANNELS = 3

# --- ENCODER--- 
## Input Layer (image)
image_input = Input(shape=(IMG_HEIGHT, IMG_WIDTH, IMG_CHANNELS), name="input_image")

# Feature extractor layer
image_features = FEATURE_EXTRACTOR(image_input)

# reshapre the features output to 2D of (64, 1536)
x = Reshape((FEATURES_SHAPE[0] * FEATURES_SHAPE[1], FEATURES_SHAPE[2]))(
    image_features
)

# Dense output layer, this will be the input to the decoder's attention
encoder_output = Dense(ATTENTION_DIM, activation="relu")(x)

encoder = tf.keras.Model(inputs=image_input, outputs=encoder_output)

# --- DECODER ---
## Input layer (image caption)
word_input = Input(shape=(MAX_CAPTION_LEN,), name="words")

## Embeddings layer
embed_x = Embedding(VOCAB_SIZE, ATTENTION_DIM)(word_input)

# RNN
decoder_gru = GRU(
    ATTENTION_DIM,
    return_sequences=True,
    return_state=True,
)
rnn_output, rnn_state = decoder_gru(embed_x)

## Attention layer
decoder_attention = Attention()
context_vector = decoder_attention([rnn_output, encoder_output])

## Add rnn + attention
addition = Add()([rnn_output, context_vector])

## Normalization
layer_norm = LayerNormalization(axis=-1)
layer_norm_out = layer_norm(addition)

## Dense output layer
decoder_output_dense = Dense(VOCAB_SIZE)
decoder_output = decoder_output_dense(layer_norm_out)

decoder = tf.keras.Model(
    inputs=[word_input, encoder_output], outputs=decoder_output
)

# --- The Model ---
image_caption_train_model = tf.keras.Model(
    inputs=[image_input, word_input], outputs=decoder_output
)

image_caption_train_model.summary()
tf.keras.utils.plot_model(image_caption_train_model, show_shapes=True)

Training the model

Loss Function

All caption vectors has the same length, meaning some (if not all) captions have some 0 paddings at the end. We don't want to compute the loss on the padding part so in our custom loss function, after computing the loss for each element in the caption vector we re-compute the loss mean only on elements where we had an actual word.

loss_object = tf.keras.losses.SparseCategoricalCrossentropy(
    from_logits=True, reduction="none"
)

def loss_function(real, pred):
    loss_ = loss_object(real, pred)

    # returns 1 to word index and 0 to padding (e.g. [1,1,1,1,1,0,0,0,0,...,0])
    mask = tf.math.logical_not(tf.math.equal(real, 0))
    mask = tf.cast(mask, dtype=tf.int32)
    sentence_len = tf.reduce_sum(mask)
    loss_ = loss_[:sentence_len]

    return tf.reduce_mean(loss_, 1)

Training

%%time
image_caption_train_model.compile(
    optimizer="adam",
    loss=loss_function,
)

history = image_caption_train_model.fit(batched_ds, epochs=1)

Inference

Inference is a little bit different than training, during inference we will call the decoder in a loop generating words one-by-one. In order for the decoder to keep track of previously generated words, we will have to manage its RNN hidden state ("memory"). On each call to the decoder we will get the updated hidden state and provide it back to the decoder for the next iteration.

For that we build a new decoder model, note we reuse the layers we used for training as we want to use the training weights.

# This is the hidden state we will have to provide after each iteration
rnn_state_input = Input(shape=(ATTENTION_DIM,), name="gru_state_input")

# Reuse trained GRU, but update it so that it can receive states.
rnn_output, rnn_state = decoder_gru(embed_x, initial_state=rnn_state_input)

# Reuse other layers as well
context_vector = decoder_attention([rnn_output, encoder_output])
addition_output = Add()([rnn_output, context_vector])
layer_norm_output = layer_norm(addition_output)

decoder_output = decoder_output_dense(layer_norm_output)

# Define prediction Model with state input and output
decoder_pred_model = tf.keras.Model(
    inputs=[word_input, rnn_state_input, encoder_output],
    outputs=[decoder_output, rnn_state],
)

Prediction process

Initialize the GRU states as zero vectors.
Preprocess an input image, pass it to the encoder, and extract image features.
Setup word tokens of <start> to start captioning.
In the for loop, we
- pass word tokens (dec_input), GRU states (gru_state) and image features (features) to the prediction decoder and get predictions (predictions), and the updated GRU states.
- select Top-K words from logits, and choose a word probabilistically so that we avoid computing softmax over VOCAB_SIZE-sized vector.
- stop predicting when the model predicts the <end> token.
- replace the input word token with the predicted word token for the next step.

MINIMUM_SENTENCE_LENGTH = 5

## Probabilistic prediction using the trained model
def predict_caption(filename):
    rnn_state = tf.zeros((1, ATTENTION_DIM)) # Initial rnn state

    # prepare image for the encoder
    img = tf.image.decode_jpeg(tf.io.read_file(filename), channels=IMG_CHANNELS)
    img = tf.image.resize(img, (IMG_HEIGHT, IMG_WIDTH))
    img = img / 255

    # run encoder
    features = encoder(tf.expand_dims(img, axis=0))

    # initial decoder input word
    decorder_input = tf.expand_dims([word_to_index("<start>")], 1)

    # Keep track of the generated words
    result = []
    result_ids = [];
    for i in range(MAX_CAPTION_LEN):
        # run the decoder
        predictions, rnn_state = decoder_pred_model(
            [decorder_input, rnn_state, features]
        )

        # draws from log distribution given by predictions
        top_probs, top_idxs = tf.math.top_k(
            input=predictions[0][0], k=10, sorted=False
        )
        chosen_id = tf.random.categorical([top_probs], 1)[0].numpy()
        predicted_id = top_idxs.numpy()[chosen_id][0]

        # result.append(tokenizer.get_vocabulary()[predicted_id])
        result.append(index_to_word(predicted_id))
        result_ids.append(predicted_id)

        if predicted_id == word_to_index("<end>"):
            return img, result

        # use the newly generated id as input for the next decoder cycle
        decorder_input = tf.expand_dims([predicted_id], 1)

    return img, result

Let's caption!

filename = "./sample_data/baseball.jpeg"

# Generate 5 captions
for i in range(5):
    image, caption = predict_caption(filename)
    print(" ".join(caption[:-1]) + ".")

img = tf.image.decode_jpeg(tf.io.read_file(filename), channels=IMG_CHANNELS)
plt.imshow(img)
plt.axis("off");

For example for this image:

we get:

a man on a plate with a bat.
a boy is riding a baseball bat on a field.
a man in uniform standing in front of a base.
a young player swinging a bat from a pitch in a crowd of spectators.
a baseball player holds up to swing the bat.

Not bad...

AI for Developers: RNN encoder-decoder seq2seq translation

Yuval — Mon, 27 Jan 2025 21:25:54 +0000

Training an RNN encoder-decoder to translate Spanish to English. This is a simple demonstration of sequence-to-sequence model.

This post is a summary of this Google tutorial.

NOTE: This post is intended for developers, if you are an aspiring data scientist or AI researcher this post will not dig deep enough for you.

Overview

Sequence-to-sequence (Seq2Seq) are used where the goal is to transform one sequence into another, like in language translation.

The key components are:

Encoder - An RNN model that will encode the input sequence (in our case numeric tokens that represent Spanish words) into a vector (aka Context Vector), the encoder captures the essence of the input sequence.
Decoder - An RNN model that given the encoded input, is responsible for generating the output sequence (in our case numeric tokens that represent English words).

Architecture

Encoder

The encoder has 2 layers:

Embedding - will get the input (Spanish words as integers) and for each word creates a vector (size 256) that represents its meaning.
GRU - An RNN layer with 1024 hidden units. The RNN will maintain a state that captures data of the entire sequence.

The encoder's RNN state will be passed to the decoder as input.

Decoder

The decoder gets 2 inputs:

The encoder's RNN state.
An input text in English. This text will be difference during training and inference, it is discussed later.

The decoder has 3 layers:

Embedding - Like the encoder's embedding layer, the English input will be embedded into a vector of (size 256).
GRU - An RNN layer with 1024 hidden units, same as the encoder.
Dense - A "probabilities" layer, it takes the decoder output and computes the "probability" for each word in the English vocabulary of being the next word. In our data set there are 9219 words in English, so the output of this layer is a vector of size 9219. The layer is basically a "softmax" layer that computes the "logits" (out of the scope of this post).

Training

Preparing the data

As we mentioned above, the input to the decoder, aside from the encoder's output, is the English text, this text is different during training and inference.

During inference we will call the decoder in a loop, on each iteration the decoder will generate 1 word in English, this word will be fed as input to the decoder on the next cycle. But what is the first word we start with?
For this we will use a special <start> token. Now when do we stop our loop? For that the decoder will have to learn to generate a special <end> token.

So when we prepare our data we will append all sentences with the special <start> token and append them with the special <end> token.
For example: I love machine learning will become <start>I love machine learning<end> (same for the Spanish sentences).

Tokenization

Obviously the models cannot work with texts, it words with numbers, in order to fix this we will create a special dictionary that will hold an integer value for each word in our vocabulary. This process is called "tokenization". We do this for both languages (holding 2 different dictionaries).
Note that the special <start> and <end> tokens will get tokenized as well.

Training dataset

In order to train our encoder-decoder model we need 3 pieces of data:

An input to the encoder, this will be tokenized Spanish sentence (with the special start/end tokens).
An input to the decoder, this will be the tokenized English sentence (with the special start/end tokens).
The training objective (or "target", or "label"), this is what we want our decoder to learn to generate, for this we will use the tokenized English sentence BUT WITHOUT the <start> token (remember, during inference we will provide the <start> token signaling the decoder to generate the first translated word).

Running translations (inference)

Translation works a little bit different than training. If during training the decoder input was the entire translated sentence, this is not the case during inference, as we don't have it, all we have is the input sentence in Spanish.
What we'll do is this:

Run the encoder on the input sentence and get the encoder's RNN hidden state.
We call the decoder and pass it the encoder's state from above, along with the special <start> token.
The decoder will generate the next (translated) word, and a new hidden state, we will call the decoder again using these two outputs.
We continue to call the decoder N times, N being the length of the longest sentence we had during training. This may not be ideal as in real world there may be longer sentences, checking for <end> token may work better (with some safeguard to prevent infinite loop).

The Code

Dependencies

!pip uninstall tensorflow -y
!pip uninstall tf-keras -y
!pip install tensorflow==2.15.1
!pip install keras

imports

import os
import numpy as np
import pandas as pd
import tensorflow as tf
from tensorflow.keras.layers import GRU, Dense, Embedding, Input
from tensorflow.keras.models import Model, load_model

print(tf.__version__)

Load Data

# Download data
DATA_URL = (
    "http://storage.googleapis.com/download.tensorflow.org/data/spa-eng.zip"
)
path_to_zip = tf.keras.utils.get_file(
    "spa-eng.zip", origin=DATA_URL, extract=True
)

path_to_file = os.path.join(os.path.dirname(path_to_zip), "spa-eng/spa.txt")
print("Translation data stored at:", path_to_file)

# Load into dataframe
data = pd.read_csv(
    path_to_file, sep="\t", header=None, names=["english", "spanish"]
)

# Load sentences into tensors
target_lang_sentences = data.pop('english')
input_lang_sentences = data.pop('spanish')

Helper Utils

import re
import unicodedata

def unicode_to_ascii(s):
    """Transforms an ascii string into unicode."""
    normalized = unicodedata.normalize("NFD", s)
    return "".join(c for c in normalized if unicodedata.category(c) != "Mn")


def preprocess_sentence(w):
    """Lowers, strips, and adds <start> and <end> tags to a sentence."""
    w = unicode_to_ascii(w.lower().strip())

    # creating a space between a word and the punctuation following it
    # eg: "he is a boy." => "he is a boy ."
    w = re.sub(r"([?.!,¿])", r" \1 ", w)

    w = re.sub(r'[" "]+', " ", w)

    # replacing everything with space except (a-z, A-Z, ".", "?", "!", ",")
    w = re.sub(r"[^a-zA-Z?.!,¿]+", " ", w)

    w = w.rstrip().strip()

    # adding a start and an end token to the sentence
    # so that the model know when to start and stop predicting.
    w = "<start> " + w + " <end>"
    return w


def tokenize(lang, lang_tokenizer=None):
    """Given a list of sentences, return an integer representation

    Arguments:
    lang -- a python list of sentences
    lang_tokenizer -- keras_preprocessing.text.Tokenizer, if None
        this will be created for you

    Returns:
    tensor -- int tensor of shape (NUM_EXAMPLES,MAX_SENTENCE_LENGTH)
    lang_tokenizer -- keras_preprocessing.text.Tokenizer
    """
    if lang_tokenizer is None:
        lang_tokenizer = tf.keras.preprocessing.text.Tokenizer(filters="")
        lang_tokenizer.fit_on_texts(lang)

    tensor = lang_tokenizer.texts_to_sequences(lang)

    tensor = tf.keras.preprocessing.sequence.pad_sequences(
        tensor, padding="post"
    )

    return tensor, lang_tokenizer


def preprocess(sentences, tokenizer):
    """Preprocesses then tokenizes text

    Arguments:
    sentences -- a python list of of strings
    tokenizer -- Tokenizer for mapping words to integers

    Returns:
    tensor -- int tensor of shape (NUM_EXAMPLES,MAX_SENTENCE_LENGTH)
    lang_tokenizer -- keras_preprocessing.text.Tokenizer
    """
    sentences = [preprocess_sentence(sentence) for sentence in sentences]
    tokens, _ = tokenize(sentences, tokenizer)
    return tokens


def int2word(tokenizer, int_sequence):
    """Converts integer representation to natural language representation

    Arguments:
    tokenizer -- keras_preprocessing.text.Tokenizer
    int_sequence -- an iterable or rank 1 tensor of integers

    Returns list of string tokens
    """
    return [tokenizer.index_word[t] if t != 0 else "" for t in int_sequence]

Prepare datasets

# Training on 120000 samples and 3 epochs took 15 minutes on T4
NUM_EXAMPLES = 120000

# Preprocess the sentences (add <start> <end> etc)
preprocessed_input_lang_sentences = input_lang_sentences.head(NUM_EXAMPLES).map(preprocess_sentence)
preprocessed_target_lang_sentences = target_lang_sentences.head(NUM_EXAMPLES).map(preprocess_sentence)

# Tokenize the preprocessed sentences, note it makes all sentences the same length (append with zeros)
tokenized_input_lang_tensor, input_lang_tokenizer = tokenize(preprocessed_input_lang_sentences)
tokenized_target_lang_tensor, target_lang_tokenizer = tokenize(preprocessed_target_lang_sentences)

# Save the len of the sentences
max_sentence_length_input_lang = tokenized_input_lang_tensor.shape[1]
max_setence_length_target_lang = tokenized_target_lang_tensor.shape[1]

# Load into dataset
dataset = tf.data.Dataset.from_tensor_slices(
    (tokenized_input_lang_tensor, tokenized_target_lang_tensor)
)

# Split date to train/validation
TEST_PROP = 0.2
validation_size = int(TEST_PROP * len(dataset))
train_size = len(dataset) - validation_size

shuffeled_dataset = dataset.shuffle(10000)
train_dataset = shuffeled_dataset.take(train_size)
validation_dataset = shuffeled_dataset.skip(train_size).take(validation_size)

# Convert the dataset items to what we need for training, the training input
# are the English/Spanish sentences that goes to the encoder and decoder as input.
# The "target" (or "label") is the shifted English sentence (remove the <start>)
def to_dataset_item(input_lang_tensor, target_lang_tensor):
    encoder_input = input_lang_tensor
    decoder_input = target_lang_tensor

    # The train target should not have the first <start> token, shift it left
    target = tf.roll(decoder_input, -1, 0)

    # roll is cyclic, so the <start> token is now the last token in the tensor, replace it with 0
    zeros = tf.zeros([1], dtype=tf.int32)
    target = tf.concat([target[:-1], zeros], axis=-1)

    return ((encoder_input, decoder_input), target)

train_dataset = train_dataset.map(to_dataset_item)
validation_dataset = validation_dataset.map(to_dataset_item)

# Create training batches
BUFFER_SIZE = len(train_dataset)
BATCH_SIZE = 64

train_dataset = (
    train_dataset
    .shuffle(BUFFER_SIZE)
    .repeat()
    .batch(BATCH_SIZE, drop_remainder=True)
)

validation_dataset = validation_dataset.batch(
    BATCH_SIZE, drop_remainder=True
)

Build the model

EMBEDDING_DIM = 256
HIDDEN_UNITS = 1024

INPUT_VOCAB_SIZE = len(input_lang_tokenizer.word_index) + 1
TARGET_VOCAB_SIZE = len(target_lang_tokenizer.word_index) + 1

# Encoder
# Input layer
encoder_inputs = Input(shape=(None,), name="encoder_input")

# Embedding layer
encoder_inputs_embedded = Embedding(
    input_dim=INPUT_VOCAB_SIZE,
    output_dim=EMBEDDING_DIM,
    input_length=max_sentence_length_input_lang,
)(encoder_inputs)

# RNN
encoder_rnn = GRU(
    units=HIDDEN_UNITS,
    return_sequences=True,
    return_state=True,
    recurrent_initializer="glorot_uniform",
)

# Exec the RNN and get the encoder_state which will be the input to the decoder
encoder_outputs, encoder_state = encoder_rnn(encoder_inputs_embedded)

# Decoder
# Input layer
decoder_inputs = Input(shape=(None,), name="decoder_input")

# Embedding layer
decoder_inputs_embedded = Embedding(
    input_dim=TARGET_VOCAB_SIZE,
    output_dim=EMBEDDING_DIM,
    input_length=max_setence_length_target_lang,
)(decoder_inputs)

# RNN
decoder_rnn = GRU(
    units=HIDDEN_UNITS,
    return_sequences=True,
    return_state=True,
    recurrent_initializer="glorot_uniform",
)

# Exec the rnn, not the inputs are the decoder's embeddings and encoder's state
decoder_outputs, decoder_state = decoder_rnn(
    decoder_inputs_embedded, initial_state=encoder_state
)

# Dense layer
decoder_dense = Dense(TARGET_VOCAB_SIZE, activation="softmax")

# Get the predictions logits
predictions = decoder_dense(decoder_outputs)

# The entire training model
train_model = Model(inputs=[encoder_inputs, decoder_inputs], outputs=predictions)
train_model.compile(optimizer="adam", loss="sparse_categorical_crossentropy")
train_model.summary()
tf.keras.utils.plot_model(train_model, show_shapes=True)

Training

STEPS_PER_EPOCH = train_size // BATCH_SIZE
EPOCHS = 3

history = train_model.fit(
    train_dataset,
    steps_per_epoch=STEPS_PER_EPOCH,
    validation_data=validation_dataset,
    epochs=EPOCHS,
)

Inference

To generate text we will have to call the decoder in a loop to generate words, each time we will pass it the generated sentence thus far and its own RNN hidden state.
The initial state of the decoder is the encoded Spanish sentence.

For that we will create a new encoder model that we will call once, and a decoder model that we will call in a loop. Note that we use the layers of the train_model so we are using trained weights.

Now we are reusing the weights from the training we just did. To use this in production the new encoder/decoder models will have to be exported and saved, this will save them along with the trained weights and can be loaded and used in production.
The two tokenizers should be exported as well and used in production, the tokenization in production should be done using the same tokens we had in training.

# reuse the encoder's training weights
# The output (encoder rnn state) will be decoder's initial state
encoder_model = Model(inputs=encoder_inputs, outputs=encoder_state)

# This will hold the encoder's output and be used as the decoder initial state
decoder_state_input = Input(
    shape=(HIDDEN_UNITS,), name="decoder_state_input"
)

# Run the train_model decoder_rnn (reuse the weights)
decoder_outputs, decoder_state = decoder_rnn(
    decoder_inputs_embedded, initial_state=decoder_state_input
)

# Reuses weights from the decoder_dense layer
predictions = decoder_dense(decoder_outputs)

decoder_model = Model(
    inputs=[decoder_inputs, decoder_state_input],
    outputs=[predictions, decoder_state],
)

tf.keras.utils.plot_model(decoder_model, show_shapes=True)

# In order to run translation in production the encoder_model and decoder_model
# should be exported and saved, along with the tokenizers.

Translation Process

def decode_sequences(input_seqs, output_tokenizer, max_decode_length=50):
    """
    Arguments:
    input_seqs: tokenized tensor of Spanish sencteces, shape (BATCH_SIZE, SEQ_LEN)
    output_tokenizer: Tokenizer used to conver from int to words (english tokenizer)

    Returns translated sentences
    """
    # Encode the input as state vectors.
    states_value = encoder_model(input_seqs)

    # Populate the first character of target sequence with the start character.
    batch_size = input_seqs.shape[0]
    translated_seq = tf.ones([batch_size, 1]) # 1 is <start> token

    translated_sentences = [[] for _ in range(batch_size)]

    # Decode word-by-word (theoretically we could have stopped after getting <end>)
    for i in range(max_decode_length):
        output_tokens, decoder_state = decoder_model([translated_seq, states_value])

        # Sample the results token.
        # The model outputs a probabilities vector for each word in the
        # vocabulary, we take the word with the hihgest probability.
        # The output_tokens shape is [BATCH_SIZE, number of generated words in our case it is 1, probabilities vector on the entire vocabulary]
        sampled_token_index = np.argmax(output_tokens[:, 0, :], axis=-1)

        # Convert the output token to word
        tokens = int2word(output_tokenizer, sampled_token_index)
        for j in range(batch_size):
            translated_sentences[j].append(tokens[j])

        # Use the generated token as the input for the next run.
        # sampled_token_index is a 1D array of the generated token per batch input.
        # we convert the 1D array to 2D of shape [BATCH_SIZE, 1]
        tf.expand_dims(tf.constant(sampled_token_index), axis=-1)

        # Update states for next run
        states_value = decoder_state

    return translated_sentences

# Translate these sentences
sentences = [
    "No estamos comiendo.",
    "Está llegando el invierno.",
    "El invierno se acerca.",
    "Tom no comio nada.",
    "Su pierna mala le impidió ganar la carrera.",
    "Su respuesta es erronea.",
    "¿Qué tal si damos un paseo después del almuerzo?",
]

reference_translations = [
    "We're not eating.",
    "Winter is coming.",
    "Winter is coming.",
    "Tom ate nothing.",
    "His bad leg prevented him from winning the race.",
    "Your answer is wrong.",
    "How about going for a walk after lunch?",
]

machine_translations = decode_sequences(
    preprocess(sentences, input_lang_tokenizer), target_lang_tokenizer, max_setence_length_target_lang
)

for i in range(len(sentences)):
    print("-")
    print("INPUT:")
    print(sentences[i])
    print("REFERENCE TRANSLATION:")
    print(reference_translations[i])
    print("MACHINE TRANSLATION:")
    print(machine_translations[i])

Output:

-
INPUT:
No estamos comiendo.
REFERENCE TRANSLATION:
We're not eating.
MACHINE TRANSLATION:
['we', 're', 'not', 'eating', '.', '<end>', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '']
-
INPUT:
Está llegando el invierno.
REFERENCE TRANSLATION:
Winter is coming.
MACHINE TRANSLATION:
['the', 'rain', 'is', 'cold', '.', '<end>', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '']
-
INPUT:
El invierno se acerca.
REFERENCE TRANSLATION:
Winter is coming.
MACHINE TRANSLATION:
['winter', 'is', 'approaching', '.', '<end>', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '']
-
INPUT:
Tom no comio nada.
REFERENCE TRANSLATION:
Tom ate nothing.
MACHINE TRANSLATION:
['tom', 'ate', 'nothing', '.', '<end>', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '']
-
INPUT:
Su pierna mala le impidió ganar la carrera.
REFERENCE TRANSLATION:
His bad leg prevented him from winning the race.
MACHINE TRANSLATION:
['his', 'hair', 'turned', 'down', 'to', 'the', 'bottom', 'of', 'the', 'snow', '.', '<end>', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '']
-
INPUT:
Su respuesta es erronea.
REFERENCE TRANSLATION:
Your answer is wrong.
MACHINE TRANSLATION:
['your', 'answer', 'is', 'incorrect', '.', '<end>', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '']
-
INPUT:
¿Qué tal si damos un paseo después del almuerzo?
REFERENCE TRANSLATION:
How about going for a walk after lunch?
MACHINE TRANSLATION:
['how', 'about', 'we', 'spend', 'a', 'little', 'day', '?', '<end>', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '']

While the results are funny I am quite impressed that some translations are not that bad, considering we used only 120k sentences and 3 epochs.
In the paper "Attention is all you need" they used the WMT 2014 English-French dataset consisting of 36M sentences!!

AI for Developers: Using RNNs

Yuval — Fri, 24 Jan 2025 20:50:18 +0000

Training a simple RNN text generator to demonstrate the concept of text generation.

This post is a summary of this Google tutorial.

NOTE: This post is intended for developers, if you are an aspiring data scientist or AI researcher this post will not dig deep enough for you.

The RNN will learn to generate text character-by-character.

Overview

RNN - Recurrent Neural Networks are networks designed specifically to handle sequential data, like a timeseries or language. A regular neural network process the input data in a single pass, while RNNs process the input sequentially, repeatedly apply the same set of weights to the input at each step in the sequence, hence "recurrent".

The key feature of RNNs is that they have a hidden state that stores information about what has been calculated so far. This hidden state is updated at each step in the sequence so newer data processing is affected by the past data.

Let's say our input is "I love machine learning", a regular neural network will process the entire sentence at once where it can't learn the relations between the words. The RNN processes the sentence one word at a time, at each step the current word is fed into the RNN, the RNN hidden state (memory) is updated (based on current word and previous hidden state), the updated hidden state captures information about the sequence seen so far.

In this post we will NOT go into the RNN layer itself, we will use tf.keras.layers.GRU (Gated Recurrent Unit) which is a type of RNN layer provided by TF, our input will be "I love machine learning" and the word-by-word processing will be handled by the GRU layer.

Setup

The Tensorflow syntax is of version 2.15.1, newer versions will give errors.

!pip uninstall tensorflow -y
!pip uninstall tf-keras -y
!pip install tensorflow==2.15.1
!pip install keras

Load training data

The training dataset is Shakespeare's writing from Andrej Karpathy's The Unreasonable Effectiveness of Recurrent Neural Networks

Load the entire text into shakespeare_text and create a vocabulary. Since the model will generate text character-by-character the vocabulary is essentially a set of all the characters in the text.

import tensorflow as tf
import numpy as np
import os

path_to_file = tf.keras.utils.get_file(
    "shakespeare.txt",
    "https://storage.googleapis.com/download.tensorflow.org/data/shakespeare.txt",
)

shakespeare_text = open(path_to_file, "rb").read().decode(encoding="utf-8")
vocab = sorted(set(shakespeare_text))

Vectorize the text

The model cannot work with characters, it needs to work with numbers, we will use the tf.keras.layers.StringLookup layer which maps strings to indices (we will call it id).
Important! The StringLookup will add an "Unknown" [UNK] token (usually with id = 0) to represent unknown tokens it might encounter (OOV - Out of vocabulary).
We will also define a reverse lookup layer that will convert ids back to characters, as the model output will be the character id that we will need to convert back to character.
And finally we will also define a function that that gets an array of characters ids and convert it back to string.

ids_from_chars = tf.keras.layers.StringLookup(
    vocabulary=list(vocab), mask_token=None
)

chars_from_ids = tf.keras.layers.StringLookup(
    vocabulary=ids_from_chars.get_vocabulary(), invert=True, mask_token=None
)

def text_from_ids(ids):
    return tf.strings.reduce_join(chars_from_ids(ids), axis=-1)

Training

Since RNNs maintain an internal state that depends on the previously seen elements, given all the characters computed until this moment, what is the next character?
This is what we want the model to predict. The way we will train the model is by providing pairs of texts, where the "input" text will be missing the last character and the "target" will include the last character.

Another parameter is the length of the texts we will use, say we decided to work with text length of 4, we will break the entire Shakespeare text to chunks of 5 and by offsetting the first/last characters we will create the training pair of 4 characters.
Example: The string "Hello" will be split to the input "Hell" and target "ello".

shakespeare_text_ids = ids_from_chars(tf.strings.unicode_split(shakespeare_text, "UTF-8"))
ids_dataset = tf.data.Dataset.from_tensor_slices(shakespeare_text_ids)

# Our training sequence will be 100 characters
seq_length = 100
# calculate how many complete 100 charts chunks we have
examples_per_epoch = len(shakespeare_text) // (seq_length + 1)
sequences = ids_dataset.batch(seq_length + 1, drop_remainder=True)

# Create the (input, target) pairs for training
def split_input_target(sequence):
    input_text = sequence[:-1] # All chars but the LAST  one
    target_text = sequence[1:] # All chars but the FIRST one
    return input_text, target_text

train_dataset = sequences.map(split_input_target)

Next we need to shuffle the training input:

# Batch size
BATCH_SIZE = 64

# Buffer size to shuffle the dataset
# (TF data is designed to work with possibly infinite sequences,
# so it doesn't attempt to shuffle the entire sequence in memory. Instead,
# it maintains a buffer in which it shuffles elements).
BUFFER_SIZE = 10000

train_dataset = (
    train_dataset.shuffle(BUFFER_SIZE)
    .batch(BATCH_SIZE, drop_remainder=True)
    .prefetch(tf.data.experimental.AUTOTUNE)
)

Build the model

The model consists of 3 layers:

Embedding - This layer maps each character in the vocabulary to a vector, hence it gets the vocab_size and the desired embedding vector size
GRU - Gated Recurrent Unit, this is the type of RNN we are using, once can use LTSM RNN as well.
Dense - this layer is the output layer, its size should be the vocab_size as the model's output is a probability for EACH character in the vocabulary.

So what kind of model is this? is it an encoder-decoder, encoder-only, decoder-only?
I am not 100% sure how to classify it, my guess is: We have only a single RNN layer so it is not an encoder-decoder model. The single layer can be seen as decoder as it is used to generate text by "decoding" the embeddings and the layer's hidden states.

class MyModel(tf.keras.Model):
  def __init__(self, vocab_size, embedding_dim, rnn_units):
    super().__init__()
    self.embedding = tf.keras.layers.Embedding(vocab_size, embedding_dim)
    self.gru = tf.keras.layers.GRU(rnn_units,
                                   return_sequences=True,
                                   return_state=True)
    self.dense = tf.keras.layers.Dense(vocab_size)

  # TF will call this method with training batches
  # On inference we will call this method in a loop to predict the next char and will maintain the "states".
  # The states is the "memory" of the model about all the sequnces it has already processed.
  def call(self, inputs, states=None, return_state=False, training=False):
    x = inputs
    x = self.embedding(x, training=training)
    # RNN should get the state from the prev step
    if states is None:
      states = self.gru.get_initial_state(x)
    x, states = self.gru(x, initial_state=states, training=training)
    x = self.dense(x, training=training)

    if return_state:
      return x, states
    else:
      return x

model = MyModel(
    # Be sure the vocabulary size matches the `StringLookup` layers.
    vocab_size=len(ids_from_chars.get_vocabulary()),
    embedding_dim=256,
    rnn_units=1024,
)

Train the model

# Directory where the checkpoints will be saved
checkpoint_dir = "./training_checkpoints"

# Name of the checkpoint files
checkpoint_prefix = os.path.join(checkpoint_dir, "ckpt_{epoch}.weights.h5")

checkpoint_callback = tf.keras.callbacks.ModelCheckpoint(
    filepath=checkpoint_prefix, save_weights_only=True
)

EPOCHS = 10
loss = tf.losses.SparseCategoricalCrossentropy(from_logits=True)
model.compile(optimizer="adam", loss=loss)
history = model.fit(train_dataset, epochs=EPOCHS, callbacks=[checkpoint_callback])

Generating Text

In order to generate text we will call the model and pass it an initial string (e.g "ROMEO:") and an initial state (all zeros).
The model response would be a matrix in the shape [1, input_len, vocab_size], in our example of the string "ROMEO:" it will be [1, 6, 66], that is the model took our sequence and for each character it produced logits vector (sort of probability) across the entire vocabulary. Since all we care about is the last character we will take only the logits vector for the last character ([0,-1,:]), from it we will choose the "best" probable character (see below).
The model will also return an updated state.

Then in a loop we will keep calling the model, each time passing the last predicted character and the updated states, we'll keep do that until we decide to manually stop.

Choosing the "best" character

As described above, the model returns a logits vector, which is some sort of a probability FOR EACH character in our vocabulary, we need to choose 1 character.
One way is to take the character that has the highest probability, this is a good approach if we were in a classification task, but for text generation this can lead to a "boring" and repetitive text.
Another approach is to randomly sample from the probabilities, this gives better results for text generation. While the sampling is random higher logits values have higher chance if being selected.

We can also adjust the logit of each character before doing the sampling, we can use the temperature parameter to scale the probabilities.
A value lower than 1 will over scale the higher probabilities, creating a skew towards the higher probabilities tokens and less random selection.
A value higher than 1 will "equalize" the probabilities providing higher chance for lower probabilities tokens to get selected.

The text generator code

import time

class OneStep(tf.keras.Model):
  def __init__(self, model, chars_from_ids, ids_from_chars, temperature=1.0):
    super().__init__()
    self.temperature = temperature
    self.model = model
    self.chars_from_ids = chars_from_ids
    self.ids_from_chars = ids_from_chars

    # Create a mask to prevent "[UNK]" from being generated.
    skip_ids = self.ids_from_chars(['[UNK]']).numpy()
    mask=np.zeros(len(self.ids_from_chars.get_vocabulary()))
    mask[skip_ids] = -float('inf')

    # prediction_mask will have the shape of the model prediction (logit value per character in vocabulary)
    # the [UNK] character will have a value of -Inf and all the rest will have a value of 0
    # i.e [-inf,   0.,   0.,   0.,   0.,   0., ...]
    self.prediction_mask = tf.convert_to_tensor(mask, dtype=np.float32)

  @tf.function
  def generate_one_step(self, inputs, states=None):
    # Convert strings to token IDs.
    input_chars = tf.strings.unicode_split(inputs, 'UTF-8')
    input_ids = self.ids_from_chars(input_chars).to_tensor()

    # Run the model.
    # predicted_logits.shape is [batch, char, next_char_logits]
    predicted_logits, states = self.model(inputs=input_ids, states=states,
                                          return_state=True)
    # Only use the last prediction.
    predicted_logits = predicted_logits[:, -1, :]
    predicted_logits = predicted_logits/self.temperature
    # Apply the prediction mask: prevent "[UNK]" from being generated.
    predicted_logits = predicted_logits + self.prediction_mask

    # Sample the output logits to generate token IDs.
    predicted_ids = tf.random.categorical(predicted_logits, num_samples=1)
    predicted_ids = tf.squeeze(predicted_ids, axis=-1)

    # Convert from token ids to characters
    predicted_chars = self.chars_from_ids(predicted_ids)

    # Return the characters and model state.
    return predicted_chars, states

one_step_model = OneStep(model, chars_from_ids, ids_from_chars)

start = time.time()
states = None
states = tf.convert_to_tensor(np.zeros((1,1024)), dtype=np.float32)
next_char = tf.constant(['ROMEO:']) # Starting text
result = [next_char]

## Generate 1000 characters
for n in range(500):
  next_char, states = one_step_model.generate_one_step(next_char, states=states)
  result.append(next_char)

result = tf.strings.join(result)
end = time.time()
print(result[0].numpy().decode('utf-8'), '\n\n' + '_'*80)
print('\nRun time:', end - start)

Here is a sample output

ROMEO:
Where shall we do't; for this is morrow to
Foul devouring them when he shall find ourselves,
Your father shubless of friends, for marriage
Would not be lemamed your ill in less
Than more tastesing for wards are but a quarrel wit.

VALERIA:
O excreet thou the rade friar Lode?

Surrers:
Fie, fieed them, to be rid of Clifford's dagging.

RATCLIFF:
Either I have break our pawned bones,
And with the other bett breaks this body,
And terred her husbaster.

ALONSO:
Ert too things so redueded?

note that while the sentences means nothing, the model did actually generate correct words. Quite impressive for a super simple model with only 10 epochs.

Notes from course: Generative AI with Large Language Models - Week 1

Yuval — Fri, 24 Jan 2025 15:58:18 +0000

Below are my notes from the course "Generative AI with Large Language Models" offered by deeplearning.ai

Disclaimer: All the images in this post are of deeplearning.ai

Transformers

The transformer architecture introduced a breakthrough in language models, before transformers language models (like RNN) used to process data sequentially and could predict the next word based on the previously processed data only, having a limit on the range of data it could hold the model lost context from the older data.
Transformers on the other hand process all elements in a sequence simultaneously. This is made possible by the self-attention mechanism, which enables Transformers to capture long-range dependencies more effectively than RNNs. This is because the network can directly access information from any other element in the sequence, regardless of its position.

Transformers models can learn the relationships between all words:

Transformers models also assign weights to each relationship, learning how words are related to each other, here is an example of such "attention map":

In this example the word "book" is strongly related to the words "teacher" and "student".

Generating text

The original transformer model was an encoder-decoder sequence-to-sequence model, used for text translation.
Here is a general overview:

The input sequence (French) is tokenized (converting words to numbers)
The tokens are fed into an Embedding layer, where each token is mapped to an embedding vector that represent the meaning of the word.
Not shown here, but a position vector, representing the word position in the sequence is added to the embedding vector.
The combined vector (for each word) is fed to the encoder where the self-attention layers are. There are several attention layers (multi-headed attention), each attention layer learn its own attention map focusing on different words of the sentence.
Text generation happens at the decoder, a special <START> token is fed into the model, the decoder, using the context understanding passed from the encoder + its own multi-headed self-attention layers, generates the next token, this next token is then fed to the decoder to generate the next token, and so on until a special <END> token is predicted.
Note that the decoder output to the softmax layer is actually a vector the holds the entire tokens of the model corpus with a probability for each word. There are difference strategies to pick the "correct" token from this vector.
Finally the predicted tokens are converted back to words and we get the result "I love machine learning".

Generative configuration

As explained above, there are different strategies to pick the generated token in the softmax layer.

Greedy - The token with the highest probability is selected. This method is susceptible to repeated words selection and less natural text.
Random-weighted sampling - Here the probability is determined by a weight associated with each item, it's like a biased coin toss, where some outcomes are more likely than others. In this way we will not always choose the same word that has the highest probability but we also give a chance to other (high probability) words to get selected.

We can control the sampling process by specifying top-k and top-p:

top-k - selects the k most probable tokens from the model's output and samples from that reduced set
top-p - selects the smallest set of tokens whose cumulative probability exceeds a threshold p. This method is more flexible than top-k, as it dynamically adjusts the number of tokens based on their probabilities.

In practice, top-p and top-k can be used together or independently. For example, you could use top-k to select the top 50 tokens, and then use top-p to further refine that set based on a cumulative probability threshold.

We can also adjust the probabilities of each token before doing the sampling, we can use the temperature parameter to scale the probabilities.
A value lower than 1 will over scale the higher probabilities, creating a skew towards the higher probabilities tokens and less random selection.
A value higher than 1 will "equalize" the probabilities providing higher chance for lower probabilities tokens to get selected.

Generative AI Project Lifecycle

Define use case

The most important step in any project is to define the scope as accurately and narrowly as you can (text generation, classification etc.).

Select

There are plenty of foundation models to choose from, select the model that best fit the project needs.

Each model usually comes with a "model card" that explains the model use cases, how it was trained, biases and risks etc.

Adapt and align model

First develop an evaluation framework, then start an iterative process in which you try to get the desired outcome by improving the prompt (multi-shot), if that's not working try fine-tuning or RLHF

Application integration

At this stage, an important step is to optimize your model for deployment. This can ensure that you're making the best use of your compute resources and providing the best possible experience for the users of your application

How LLM are trained

Model Types

There 3 types of LLMs, encoder-only, encoder-decoder, decoder-only.

Encoder Only (autoencoding)

They are pre-trained using masked language modeling. Here, tokens in the input sequence or randomly mask, and the training objective is to predict the mask tokens in order to reconstruct the original sentence.
Autoencoding models spilled bi-directional representations of the input sequence, meaning that the model has an understanding of the full context of a token and not just of the words that come before.

These models are ideally suited to task that benefit from this bi-directional contexts like sentiment analysis, named entity recognition and word classification.
BERT is a famous encoder only model.

Decoder Only (autoregressive)

Here, the training objective is to predict the next token based on the previous sequence of tokens.
Decoder-based autoregressive models, mask the input sequence and can only see the input tokens leading up to the token in question. The model has no knowledge of the end of the sentence. The model then iterates over the input sequence one by one to predict the following token. In contrast to the encoder architecture, this means that the context is unidirectional. By learning to predict the next token from a vast number of examples, the model builds up a statistical representation of language.

These models are ideal for text generation.
GPT is a decoder only model.

Encoder-Decoder (sequence-to-sequence)

The exact details of the pre-training objective vary from model to model. A popular sequence-to-sequence model T5, pre-trains the encoder using span corruption, which masks random sequences of input tokens, replacing them with a special token (e.g <X>), called sentinel token.
The decoder is then tasked with reconstructing the mask token sequences auto-regressively. The output is the Sentinel token followed by the predicted tokens.
They are generally useful in cases where you have a body of texts as both input and output.

These models are ideally suited for translation, text summarization and question answering.
T5 and BART is a famous encoder-decoder model.

Summary

Researchers found that the larger the model the better its performance, which led to larger and larger models, how large can they grow??

Computational Challenges

In order to train a 1B parameters model a GPU with 24GB RAM is needed (each parameter is 32-bit floating point -> 4GB for model weights only, and additional 20GB are needed for the optimizers, gradients etc.).

Now imagine the amount of RAM needed to train a 500B parameters models!

Quantization

Instead of using 32-bit floating point (FP32) a 16-bit floating point can be used (FP16) or even an 8-bit integer. This of course will come on the expense of accuracy.

Google has developed a new datatype, BFLOAT16 (BF16), it is a 16-bit floating point but it allocates more bits to represent the exponent on the expense of the fraction. This allows the datatype to have the same range like FP32 just with a lower fraction accuracy.

Distributed Data Parallel (multi-GPU training)

In order to speed up training a multi-GPU training is possible. In this case a data-loader is sending different chunks of training data to the the available GPUs, each GPU performs a partial gradient update and then a synchronizer synchronize all gradients and update the model weights.
Important: In this method each GPU must hold the entire model weights + gradients + optimizer in its memory, if the model is too big it's not possible to do that.

Fully Sharded Data Parallel

In this method the model's weights + gradients + optimizer are sharded and distributed among the GPUs, this way a large model can be trained.

Scaling Laws

In order to improve the model performance (reduce loss), there are 3 factors that can be tweaked, increasing any one of them can improve the model performance (reduce its loss further during training):
1) Compute - Number of 1 peta floating-point ops/sec per day (1 petaflops/s-day)
2) Dataset Size - Amount of training data.
3) Model Size - Number of model parameters.

More peta flops/sec-day leads to lower loss

Larger datasets / parameters leads to lower loss.

A research paper that studied the relationship between the three came up with a formula on how to train the optimal model, named "Chinchilla", they found the perfect balance between compute/dataset/parameters.
Their finding was that many large models are over-parameterized and under-trained (not enough training data).
They found that in order to achieve a compute optimal training the number of training tokens should be x20 the number of model parameters.

The Chinchilla and LLaMA models are compute-optimal where GPT-3 does not have enough training data (hence need more compute). This is how LLaMA can perform as good as GPT-3 although being smaller.

The bottom line, given the Chinchilla model guidelines, given a compute budget one can guess how much training data is needed per model size.

BloombergGPT

There are specialized domains where the pre-trained models are not good enough and it is necessary to pre-train a model from scratch, for example for medicine, law or finance.

The Bloomberg team deiced to train a financial model from scratch, they split the training data between financial and general (so the model is capable both of financial tasks and general purpose language tasks).

The team had a budget constraint on compute so they chose the model size carefully.

The dashed vertical line is the compute budget, the pink shaded area is the optimal number of parameters/training tokens per FLOPs (compute).
The BloombergGPT is slightly above optimal with respect to number of parameters, and slightly below with respect to training data.

Crossplatform Tensorflow Lite

Yuval — Wed, 21 Apr 2021 20:59:07 +0000

Why crossplatform?

Today Tensorflow Lite is available as a library for both iOS and Android using Swift and Kotlin, and this is great if all you need is just running inference using some model. But what if your pipeline is more complicated? like running various image processing tasks before/after using the model output? in that case it would be more efficient to develop the entire pipeline once in C++, and use it in both iOS and Android.

In this video series we will see how to run inference in C++ using Tensorflow Lite C API and OpenCV. We'll also see how to use that code later in iOS, Android and Windows.

Source code on GitHub

Converting Tensorflow Object Detecion model to TFLite

Running Object Detection using TFLite C API

In this video we'll see how to develop an ObjectDetector class in C++ that we will be used across all platforms.

We will also test our detector on Windows.

Running Object Detection on Android

iOS

Auto-sizing element to fit inside flexbox

Yuval — Sat, 13 Mar 2021 18:23:23 +0000

Intro

We have given a task to build a screen, for mobile devices, which is a single column, with a header, text, image and an action button. The "catch" is that all elements must fit into the screen with NO scrolling, where the image is the "responsive" element that should shrink/expand according to available space.

Here is a schema of the design on 2 different screens:

What we should note here:

The image shrinks on the small device so the "GO" button is not "pushed" out of the screen
On larger screens the "GO" button is adjacent to the image, and the image takes the entire width available.
IMPORTANT: We are guaranteed that the text won't be too long (so there will always be some space left for the image) and we should support "portrait only".

Initial Layout

The first implementation direction that comes to mind is using a "flex" column design, this is the html structure:

<div class="box">
    <div class="title">
        This is some long header?
    </div>
    <div class="text">
        Lorem ipsum dolor sit amet...
    </div>
    <img src="https://via.placeholder.com/600x400.png">
    <button>Open</button>
</div>

our container class, box, is a flex column:

<style>
    .box {
        display: flex;
        flex-direction: column;
        width: 100%;
        height: 100%;
        padding: 10px;
        box-sizing: border-box;
    }
</style>

Inside our "column" we have 4 items: title, text, image and button. The title, text and button just render "as-is" taking as much space as needed, where the <img> will have to adjust to the remaining space.

Currently the result looks as follow, "small" device on the left, "large" device on the right:

Sizing inside flex

The first issue that we see is the <button>, which stretches to fill the entire width, this is the default behavior of elements inside a flex container, to fix this we can just apply align-self: center to the <button>, we'll apply this also to the <img>, and voila:

OK, while the <button> is now centered and not stretched anymore, the <img> has lost its bounds... well this can easily be fixed by specifying max-width: 100% on the <img>:

Now we are really close to achieving our goal, we just need to tell the image: "don't push", and the/one way to do it inside a flex container is by setting overflow: auto on the misbehaving element, at this point we'll also set some margin-top to the <button> and <img>:

<style>
    .box img {
        align-self: center;
        max-width: 100%;
        overflow: auto; /* do NOT push others*/
        margin-top: 10px;
    }

    .box button {
        align-self: center;
        margin-top: 30px;
    }
</style>

The END

The key takeaways, inside flex container:

Use align-self to prevent from an element to stretch.
Use overflow: auto to prevent from an element to "push" other elements (assuming of course the element is "allowed" to shrink).

OpenCV in Android native using C++

Yuval — Fri, 26 Feb 2021 21:55:38 +0000

TL;DR

Source code on GitHub

Get OpenCV

First we have to download the OpenCV SDK for Android and setup our environment.

Download OpenCV Android SDK from here (this tutorial was tested against OpenCV version 4.2.0).
Extract the zip file to some folder, I use c:\tools.
Define a global environment variable OPENCV_ANDROID pointing to the root folder of the opencv android sdk (i.e c:\tools\OpenCV-android-sdk), by "global environment variable" the meaning is that it will be available for Android Studio).

Create a Native Android project

Lets open Android Studio and create a new project, from the project type template select "Native C++":

Hit "Next" and then choose a name for you project, and on the next step just leave C++ standard on the "Toolchain Default":

Click "Finish" and wait for the project to fully load (let Gradle finish).

Add some C++ source files

Looking on the project tree using the "Android" view you will notice we have a "cpp" folder, by default it will contain 2 files:

CMakeLists.txt - Build instructions for the native code (this file can also be found under "External Build Files")
native-lib.cpp - Will contains the "bridge" (JNI) code between the managed (Kotlin/Java) and native (c++) environments.

While it is perfectly "legal" to write all our code inside native-lib.cpp we will leave that file to have only the jni related methods, and the "real" image processing we will write on another files.

Right-click on the cpp folder and choose New -> C/C++ Header file, call it opencv-utils.h
Right-click on the cpp folder and choose New -> C/C++ Source file, call it opencv-utils.cpp

The project tree should look like:

Configure OpenCV build

Open CMakeLists.txt and add the following at the top, right after "cmake_minimum_required":

# opencv
set(OpenCV_STATIC ON)
set(OpenCV_DIR $ENV{OPENCV_ANDROID}/sdk/native/jni)
find_package(OpenCV REQUIRED)

Here we sort of "importing" the opencv package and build definitions from the OpenCV Android SDK, note this line set(OpenCV_DIR $ENV{OPENCV_ANDROID}/sdk/native/jni) where we we use the environment variable we defined above that points to opencv sdk, so double check it points to the correct location.

Next we should add our source files so they will get compiled, scroll down a bit inside and search for native-lib.cpp, it will be an argument in a call to add_library method, change it as follows:

add_library( # Sets the name of the library.
        native-lib

        # Sets the library as a shared library.
        SHARED

        # Provides a relative path to your source file(s).
        opencv-utils.cpp
        native-lib.cpp)

Here we defined our native-lib library and the sources it is built from.

Next we will include libraries to help us with bitmap manipulation inside C++ (in order to convert Android Bitmap to OpenCV Mat). Scroll down a bit, and right before the call to target_link_libraries add this:

# jnigraphics lib from NDK is used for Bitmap manipulation in native code
find_library(jnigraphics-lib jnigraphics)

Finally we have to include the OpenCV and jnigraphics libs in the link process, change target_link_libraries to:

target_link_libraries( # Specifies the target library.
        native-lib

        ${OpenCV_LIBS}
        ${jnigraphics-lib}
        # Links the target library to the log library
        # included in the NDK.
        ${log-lib})

Sync & Build

That's it, if Android Studio is offering you to "sync" the project do it, if it doesn't then initiate a sync in the menu File -> Sync Project with Gradle Files.

And build Build -> Make Project, if the build is successful, great! If you got error like Error computing CMake server result with no "real" error, then something is wrong with the project definition, what worked for me was to remove the cmake version that is defined in the app build.gradle:

externalNativeBuild {
    cmake {
        path "src/main/cpp/CMakeLists.txt"
        version "3.10.2" # <<-- REMOVE THIS LINE
    }
}

Let's use OpenCV

Flip & Blur

For the demo purposes, our app will flip and blur the image using OpenCV, add the following to opencv-utils.h:

#pragma once

#include <opencv2/core.hpp>

using namespace cv;

void myFlip(Mat src);
void myBlur(Mat src, float sigma);

IT IS OK if Android Studio will mark in red the include opencv2 stuff, it should be fine after building the project.

And the implementation inside opencv-utils.cpp:

#include "opencv-utils.h"
#include <opencv2/imgproc.hpp>

void myFlip(Mat src) {
    flip(src, src, 0);
}

void myBlur(Mat src, float sigma) {
    GaussianBlur(src, src, Size(), sigma);
}

Expose native code to managed code

Next we need to expose our flip and blur method to the "managed" world, this will happen inside native-lib.cpp, we will not go over the jni standards and rules on how to expose methods, we will just copy the pre-defined stringFromJNI method to use as template, so in my case Android Studio created this method:

extern "C" JNIEXPORT jstring JNICALL
Java_com_vyw_opencv_1demo_MainActivity_stringFromJNI(...

NOTE the method name starts with the full activity namespace, make sure to copy yours correctly...

so our methods will be like:

extern "C" JNIEXPORT void JNICALL
Java_com_vyw_opencv_1demo_MainActivity_flip(JNIEnv* env, jobject p_this, jobject bitmapIn, jobject bitmapOut) {
    Mat src;
    bitmapToMat(env, bitmapIn, src, false);
    // NOTE bitmapToMat returns Mat in RGBA format, if needed convert to BGRA using cvtColor

    myFlip(src);

    // NOTE matToBitmap expects Mat in GRAY/RGB(A) format, if needed convert using cvtColor
    matToBitmap(env, src, bitmapOut, false);
}

extern "C" JNIEXPORT void JNICALL
Java_com_vyw_opencv_1demo_MainActivity_blur(JNIEnv* env, jobject p_this, jobject bitmapIn, jobject bitmapOut, jfloat sigma) {
    Mat src;
    bitmapToMat(env, bitmapIn, src, false);
    myBlur(src, sigma);
    matToBitmap(env, src, bitmapOut, false);
}

The code is pretty much self-explanatory, the interesting part is bitmapToMat and matToBitmap, these 2 methods, as the name implies, converts between Android Bitmap and OpenCV Mat classes, basically it copies the pixels bytes taking into consideration the pixels format and making the needed conversions. The methods were taken from OpenCV source opencv/modules/java/generator/src/cpp/utils.cpp with some slight adjustments.

void bitmapToMat(JNIEnv *env, jobject bitmap, Mat& dst, jboolean needUnPremultiplyAlpha)
{
    AndroidBitmapInfo  info;
    void*              pixels = 0;

    try {
        CV_Assert( AndroidBitmap_getInfo(env, bitmap, &info) >= 0 );
        CV_Assert( info.format == ANDROID_BITMAP_FORMAT_RGBA_8888 ||
                   info.format == ANDROID_BITMAP_FORMAT_RGB_565 );
        CV_Assert( AndroidBitmap_lockPixels(env, bitmap, &pixels) >= 0 );
        CV_Assert( pixels );
        dst.create(info.height, info.width, CV_8UC4);
        if( info.format == ANDROID_BITMAP_FORMAT_RGBA_8888 )
        {
            Mat tmp(info.height, info.width, CV_8UC4, pixels);
            if(needUnPremultiplyAlpha) cvtColor(tmp, dst, COLOR_mRGBA2RGBA);
            else tmp.copyTo(dst);
        } else {
            // info.format == ANDROID_BITMAP_FORMAT_RGB_565
            Mat tmp(info.height, info.width, CV_8UC2, pixels);
            cvtColor(tmp, dst, COLOR_BGR5652RGBA);
        }
        AndroidBitmap_unlockPixels(env, bitmap);
        return;
    } catch(const cv::Exception& e) {
        AndroidBitmap_unlockPixels(env, bitmap);
        jclass je = env->FindClass("java/lang/Exception");
        env->ThrowNew(je, e.what());
        return;
    } catch (...) {
        AndroidBitmap_unlockPixels(env, bitmap);
        jclass je = env->FindClass("java/lang/Exception");
        env->ThrowNew(je, "Unknown exception in JNI code {nBitmapToMat}");
        return;
    }
}

void matToBitmap(JNIEnv* env, Mat src, jobject bitmap, jboolean needPremultiplyAlpha)
{
    AndroidBitmapInfo  info;
    void*              pixels = 0;

    try {
        CV_Assert( AndroidBitmap_getInfo(env, bitmap, &info) >= 0 );
        CV_Assert( info.format == ANDROID_BITMAP_FORMAT_RGBA_8888 ||
                   info.format == ANDROID_BITMAP_FORMAT_RGB_565 );
        CV_Assert( src.dims == 2 && info.height == (uint32_t)src.rows && info.width == (uint32_t)src.cols );
        CV_Assert( src.type() == CV_8UC1 || src.type() == CV_8UC3 || src.type() == CV_8UC4 );
        CV_Assert( AndroidBitmap_lockPixels(env, bitmap, &pixels) >= 0 );
        CV_Assert( pixels );
        if( info.format == ANDROID_BITMAP_FORMAT_RGBA_8888 )
        {
            Mat tmp(info.height, info.width, CV_8UC4, pixels);
            if(src.type() == CV_8UC1)
            {
                cvtColor(src, tmp, COLOR_GRAY2RGBA);
            } else if(src.type() == CV_8UC3){
                cvtColor(src, tmp, COLOR_RGB2RGBA);
            } else if(src.type() == CV_8UC4){
                if(needPremultiplyAlpha) cvtColor(src, tmp, COLOR_RGBA2mRGBA);
                else src.copyTo(tmp);
            }
        } else {
            // info.format == ANDROID_BITMAP_FORMAT_RGB_565
            Mat tmp(info.height, info.width, CV_8UC2, pixels);
            if(src.type() == CV_8UC1)
            {
                cvtColor(src, tmp, COLOR_GRAY2BGR565);
            } else if(src.type() == CV_8UC3){
                cvtColor(src, tmp, COLOR_RGB2BGR565);
            } else if(src.type() == CV_8UC4){
                cvtColor(src, tmp, COLOR_RGBA2BGR565);
            }
        }
        AndroidBitmap_unlockPixels(env, bitmap);
        return;
    } catch(const cv::Exception& e) {
        AndroidBitmap_unlockPixels(env, bitmap);
        jclass je = env->FindClass("java/lang/Exception");
        env->ThrowNew(je, e.what());
        return;
    } catch (...) {
        AndroidBitmap_unlockPixels(env, bitmap);
        jclass je = env->FindClass("java/lang/Exception");
        env->ThrowNew(je, "Unknown exception in JNI code {nMatToBitmap}");
        return;
    }
}

Calling Native from Managed

We arrived at the last part of our demo, calling the native methods from our MainActivity.

First I have added a sample image to the res/drawable-nodpi folder (you might need to create it), I chose the nodpi flavor as I don't want Android to scale up my image, I used a relatively small image (640x427) so blurring can be real-time.

Then I setup my MainActivity view with:

ImageView - Pre-loaded with the test image as app:srcCompat="@drawable/mountain"
Button - Will be used to flip the image
SeekBar - Will be used to control the blur sigma, goes from 0-100 (later in the code will be converted to float in the rage 1-10)

Declaring the JNI methods

In order to use our methods from native-lib.cpp we need to declare them as external functions inside our activity, and we need to load our native-lib library (libnative-lib.so). If you created the project from the "Native C++" template it is already done, scroll to the very bottom of MainActivity.kt you will see it, then just add our blur and flip, it should look like this:

external fun stringFromJNI(): String
external fun blur(bitmapIn: Bitmap, bitmapOut: Bitmap, sigma: Float)
external fun flip(bitmapIn: Bitmap, bitmapOut: Bitmap)

companion object {
    // Used to load the 'native-lib' library on application startup.
    init {
        System.loadLibrary("native-lib")
    }
}

Processing Android Bitmap

Now we will flip and blur the ImageView bitmap, first lets create 2 bitmaps, the first will hold the original image (srcBitmap), the other will be used as the destination bitmap (dstBitmap) which will be viewed on screen.

class MainActivity : AppCompatActivity(), SeekBar.OnSeekBarChangeListener {
    var srcBitmap: Bitmap? = null
    var dstBitmap: Bitmap? = null

    override fun onCreate(savedInstanceState: Bundle?) {
        ...

        // Load the original image
        srcBitmap = BitmapFactory.decodeResource(this.resources, R.drawable.mountain)

        // Create and display dstBitmap in image view, we will keep updating
        // dstBitmap and the changes will be displayed on screen
        dstBitmap = srcBitmap!!.copy(srcBitmap!!.config, true)
        imageView.setImageBitmap(dstBitmap)
    ...
    ...

Whenever the user will move the seekbar we will blur the image using the seekbar value as the blur sigma:

// SeekBar event handler
override fun onProgressChanged(seekBar: SeekBar?, progress: Int, fromUser: Boolean) {
    this.doBlur()
}

fun doBlur() {
    // The SeekBar range is 0-100 convert it to 0.1-10
    val sigma = max(0.1F, sldSigma.progress / 10F)

    // This is the actual call to the blur method inside native-lib.cpp
    this.blur(srcBitmap!!, dstBitmap!!, sigma)
}

And finally we have the event handler for the flip button:

fun btnFlip_click(view: View) {
    // This is the actual call to the blur method inside native-lib.cpp
    // note we flip srcBitmap (which is not displayed) and then call doBlur which will
    // eventually update dstBitmap (and which is displayed)
    this.flip(srcBitmap!!, srcBitmap!!)
    this.doBlur()
}

THE END

That's it! The code can be found on GitHub

A simple multi-player online game using node.js - Part IV

Yuval — Fri, 26 Feb 2021 21:38:15 +0000

Intro

In this section we are going to explore the server code, the main parts are:

server.js - The entry point for the server, responsible for serving static files and accepting WebSockets
lobby.js - Responsible for pairing players into matches
game/ - All the snake game logic sits under this folder

Server

As stated above, server.js is responsible for accepting connections and serving static files, I am not using any framework here but I do use the ws module for handling WebSockets connections.

Requests handlers

In the code below we create a new http server and pass a request listener callback to handle the request, quite a straight forward code:

var http = require('http');
var server = http.createServer(function(req, res) {
    // This is a simple server, support only GET methods
    if (req.method !== 'GET') {
        res.writeHead(404);
        res.end();
        return;
    }

    // Handle the favicon (we don't have any)
    if (req.url === '/favicon.ico') {
        res.writeHead(204);
        res.end();
        return;
    }

    // This request is for a file
    var file = path.join(DEPLOY_DIR, req.url);
    serveStatic(res, file);
});

Static files handler

Whenever we receive a GET request (which is not the favicon) we assume it is for a file, the serveStatic method will look for the file and stream it back to the client.

In the code I use 2 constant variables that helps with finding the files, the first is DEPLOY_DIR which is actually the root folder where the static files are, and the second is DEFAULT_FILE which is the name of the file that should be served if the request url points to a folder.

var DEPLOY_DIR = path.resolve(__dirname, '../client/deploy');
var DEFAULT_FILE = 'index.html';

So assume we deployed the project under /var/www/SnakeMatch, then DEPLOY_DIR is /var/www/SnakeMatch/client/deploy, and a request to /all.js will serve /var/www/SnakeMatch/client/deploy/all.js.

Here is the code of the serveStatic method, where fs is Node's fs module:

/**
* Serves a static file
* @param {object} res - The response object
* @param {string} file - The requested file path
*/
function serveStatic(res, file) {
    // Get the file statistics
    fs.lstat(file, function(err, stat) {
        // If err probably file does not exist
        if (err) {
            res.writeHead(404);
            res.end();
            return;
        }

        // If this is a directory we will try to serve the default file
        if (stat.isDirectory()) {
            var defaultFile = path.join(file, DEFAULT_FILE);
            serveStatic(res, defaultFile);
        } else {
            // Pipe the file over to the response
            fs.createReadStream(file).pipe(res);
        }
    });
}

Accepting connections

After creating http server we need to bind on a port, we are using the PORT environment variable (to be used in Heroku), defaults to 3000, for WebSockets we use ws, whenever we get a WebSocket connection we just send it to the lobby

var WebSocketServer = require('ws').Server;
var port = process.env.PORT || 3000;
server.listen(port, function () {
    console.log('Server listening on port:', port);
});

// Create the WebSocket server (it will handle "upgrade" requests)
var wss = new WebSocketServer({server: server});
wss.on('connection', function(ws) {
    lobby.add(ws);
});

Lobby

The Lobby is responsible for accepting new players, and pairing players into matches.

Whenever a new socket is added to the lobby it first creates a Player object (wrapper around the socket, more on this later) and listen to its disconnect event, then it tries to pair it with another player into a Match, if there are no available players it puts the player in the pendingPlayers dictionary, if it succeeded to pair this player with another player the Match object is put in the activeMatches dictionary and it registers to the Match's GameOver event.

Lobby.add = function (socket) {
    // Create a new Player, add it to the pending players dictionary and register to its disconnect event
    var player = new Player(socket);
    pendingPlayers[player.id] = player;
    player.on(Player.Events.Disconnect, Lobby.onPlayerDisconnect);

    // Try to pair this player with other pending players, if success we get a "match"
    var match = this.matchPlayers(player);
    if (match) {
        // Register the Match GameOver event and store the match in the active matches dictionary
        match.on(Match.Events.GameOver, Lobby.onGameOver);
        activeMatches[match.id] = match;

        // Remove the players in the match from the pending players
        delete pendingPlayers[match.player1.id];
        delete pendingPlayers[match.player2.id];

        // Start the match
        match.start();
    } else {
        // No match found for this player, let him know he is Pending
        player.send(protocol.buildPending());
    }
};

The rest of the code in the Lobby is not that interesting, matchPlayers just loops over the pendingPlayers dictionary and returns a new Match object if it found another pending player (which is not the current player). When a match is over (GameOver event) we just disconnect the two players (which will close their sockets), and delete the match from the activeMatches dictionary.

The Game

Now we will go over the code under the server/game folder, it contains the Player, Match and SnakeEngine classes.

Player class

The Player is just a wrapper around the socket class, whenever new data arrives on the socket it raises a message event, if the socket gets closed it raises a disconnect event, and it exposes a send method which is used to write data over the socket. Below is the ctor and send methods:

var Emitter = require('events').EventEmitter,
    util = require('util'),
    uuid = require('node-uuid');

function Player(socket) {
    // Make sure we got a socket
    if (typeof socket !== 'object' || socket === null) {
        throw new Error('socket is mandatory');
    }

    Emitter.call(this);

    this.id = uuid.v1();
    this.index = 0; // The player index within the game (will be set by the Match class)
    this.online = true;
    this.socket = socket;

    // Register to the socket events
    socket.on('close', this.onDisconnect.bind(this));
    socket.on('error', this.onDisconnect.bind(this));
    socket.on('message', this.onMessage.bind(this));
}
util.inherits(Player, Emitter);

Player.prototype.send = function(msg) {
    if (!msg || !this.online) {
        return;
    }

    try {
        this.socket.send(msg);
    } catch (ignore) {}
};

Match class

This class is responsible for all the game logistics, it updates the snake-engine every 100 msec, it sends updates to the clients, it read messages from the client etc.

NOTE: the Match class doesn't know how to "play" snake, that's why we have the snake-engine for.

Although we described it on the first post lets go over the course of a snake match: start by sending a Ready message to the clients with all the game info (board size, snakes initial position etc), then there are 3 Steady messages (every 1 second), then there is a go message signaling to the clients that the game has started, then a series of Update messages are being sent every 100 milliseconds, and finally there is a GameOver message.

The match is over if when one of the players has failed or 60 seconds has passed, if after 60 seconds the score is tied there is an overtime of 10 seconds until one player wins.

Now lets see how the Match class is doing all this, first we define some constants:

var MATCH_TIME = 60000; // In milliseconds
var MATCH_EXTENSION_TIME = 10000; // In milliseconds
var UPD_FREQ = 100;
var STEADY_WAIT = 3; // number of steady messages to send
var BOARD_SIZE = {
    WIDTH: 500,
    HEIGHT: 500,
    BOX: 10
};

In the ctor we initialize the game, note that each player is assigned to an index (player1 / player2).

function Match(player1, player2) {
    Emitter.call(this);
    this.id = uuid.v1();
    this.gameTimer = null;
    this.matchTime = MATCH_TIME; // The match timer (each match is for MATCH_TIME milliseconds)

    // Set the players indexes
    this.player1 = player1;
    this.player1.index = 1;
    this.player2 = player2;
    this.player2.index = 2;

    // Register to the players events
    this.player1.on(Player.Events.Disconnect, this.onPlayerDisconnect.bind(this));
    this.player2.on(Player.Events.Disconnect, this.onPlayerDisconnect.bind(this));

    this.player1.on(Player.Events.Message, this.onPlayerMessage.bind(this));
    this.player2.on(Player.Events.Message, this.onPlayerMessage.bind(this));

    // Create the snake game
    this.snakeEngine = new SnakeEngine(BOARD_SIZE.WIDTH, BOARD_SIZE.HEIGHT, BOARD_SIZE.BOX);
}

Ready-Steady-Go

The ready-steady-go flow happens in the start and steady methods:

Match.prototype.start = function() {
    // Build the ready message for each player
    var msg = protocol.buildReady(this.player1.index, this.snakeEngine.board, this.snakeEngine.snake1, this.snakeEngine.snake2);
    this.player1.send(msg);

    msg = protocol.buildReady(this.player2.index, this.snakeEngine.board, this.snakeEngine.snake1, this.snakeEngine.snake2);
    this.player2.send(msg);

    // Start the steady count down
    this.steady(STEADY_WAIT);
};

/**
 * Handles the steady count down
 * @param {number} steadyLeft - The number of steady events left
 */
Match.prototype.steady = function(steadyLeft) {
    var msg;

    // Check if steady count down finished
    if (steadyLeft === 0) {
        // Send the players a "Go" message
        msg = protocol.buildGo();
        this.player1.send(msg);
        this.player2.send(msg);

        // Starts the update events (this is the actual game)
        this.gameTimer = setTimeout(this.update.bind(this), UPD_FREQ);
        return;
    }

    // Sends the players another steady message and call this method again in 1 sec
    msg = protocol.buildSteady(steadyLeft);
    this.player1.send(msg);
    this.player2.send(msg);
    --steadyLeft;
    this.gameTimer = setTimeout(this.steady.bind(this, steadyLeft), 1000);
};

Update cycle

The update method is being called every 100 milliseconds, the method is quite self-explanatory but do note that snakeEngine.update() returns a result object with info about the game state, more specifically, it tells us whether one snake has lost (by colliding into itself/border) and if there was a change to the pellets (removed/added).

Match.prototype.update = function() {
    // Update the match time, this is not super precise as the "setTimeout" time is not guaranteed,
    // but ok for our purposes...
    this.matchTime -= UPD_FREQ;

    // Update the game
    var res = this.snakeEngine.update();

    // If no snake lost on this update and there is more time we just reload the update timer
    if (res.loosingSnake < 0 && this.matchTime > 0) {
        this.gameTimer = setTimeout(this.update.bind(this), UPD_FREQ);
        this.sendUpdateMessage(res);
        return;
    }

    var msg;
    // If no snake lost it means time's up, lets see who won.
    if (res.loosingSnake < 0) {
        // Check if there is a tie
        if (this.snakeEngine.snake1.parts.length === this.snakeEngine.snake2.parts.length) {
            // We don't like ties, lets add more time to the game
            this.matchTime += MATCH_EXTENSION_TIME;
            this.gameTimer = setTimeout(this.update.bind(this), UPD_FREQ);
            this.sendUpdateMessage(res);
            return;
        }

        // No tie, build a GameOver message (the client will find which player won)
        msg = protocol.buildGameOver(protocol.GameOverReason.End, null, this.snakeEngine.snake1, this.snakeEngine.snake2);
    } else {
        // Ok, some snake had a collision and lost, since we have only 2 players we can easily find the winning snake
        var winningPlayer = (res.loosingSnake + 2) % 2 + 1;
        msg = protocol.buildGameOver(protocol.GameOverReason.Collision, winningPlayer);
    }

    // Send the message to the players and raise the GameOver event
    this.player1.send(msg);
    this.player2.send(msg);

    this.emit(Match.Events.GameOver, this);
};

Handling clients messages

Whenever the client sends a message it first get parsed using the Protocol object, then if it is a ChangeDirection request we pass it to the snake-engine for processing, note that we put the player index on the message so that snake-engine would know what player to update.

Match.prototype.onPlayerMessage = function(player, msg) {
    // Parse the message
    var message = protocol.parseMessage(msg);
    if (!message) {
        return;
    }

    switch (message.type) {
        case protocol.Messages.ChangeDirection:
            message.playerIndex = player.index;
            this.snakeEngine.handleDirChangeMessage(message);
            break;
    }
};

That's it for the Match class, the rest of the code is not that interesting.

Snake Engine

The snake-engine is responsible for "playing" the snake game, on every update it checks whether a snake had collided with itself, went out-of-bounds, ate a pellet etc.

In the ctor we create the 2 snake objects, both snakes are created at the first row of the board, one is created on the left side and the other is created on the right side.

Remember that the Board is divided into boxes, and that Board.toScreen() gets a box index and returns the screen x/y.

function SnakeEngine(width, height, boxSize) {
    this.board = new Board(width, height, boxSize);

    // The first snake is created on the left side and is heading right (very top row, y index = 0)
    var snakeLoc = this.board.toScreen(INITIAL_SNAKE_SIZE - 1);
    this.snake1 = new Snake(snakeLoc.x, snakeLoc.y, boxSize, INITIAL_SNAKE_SIZE, protocol.Direction.Right);

    // The second snake is created on the right side and is heading left (very top row, y index = 0)
    snakeLoc = this.board.toScreen(this.board.horizontalBoxes - INITIAL_SNAKE_SIZE);
    this.snake2 = new Snake(snakeLoc.x, snakeLoc.y, boxSize, INITIAL_SNAKE_SIZE, protocol.Direction.Left);

    /** @type {Pellet[]} */
    this.pellets = [];
}

The interesting methods are update, checkCollision and addPellet.

In the update method we do the following for each snake: call the snake update method (tell it to move to its next location), check for collisions, check if it ate a pellet. If there was a collision we stop immediately as the game is over, if there was no collision we try to add a new pellet to the game.

SnakeEngine.prototype.update = function() {
    var res = new GameUpdateData();

    // Update snake1
    this.snake1.update();

    // Check if the snake collides with itself or out-of-bounds
    var collision = this.checkCollision(this.snake1);
    if (collision) {
        res.loosingSnake = 1;
        return res;
    }

    // Check if the snake eats a pellet
    res.pelletsUpdate = this.eatPellet(this.snake1);

    // Update snake2
    this.snake2.update();

    // Check if the snake collides with itself or out-of-bounds
    collision = this.checkCollision(this.snake2);
    if (collision) {
        res.loosingSnake = 2;
        return res;
    }

    // Check if the snake eats a pellet
    res.pelletsUpdate = this.eatPellet(this.snake2) || res.pelletsUpdate;

    // Finally add new pellet
    res.pelletsUpdate = this.addPellet() || res.pelletsUpdate;

    // No one lost (yet...).
    return res;
};

In checkCollision we first check if the snake went out-of-bounds, we do this by comparing the snake's head to the board dimensions. Remember that the snake head is a rectangle, where the upper-left corner is denoted by x/y, so when we want to check if the snake crossed the top/left border we use x/y, but when we want to check whether the snake crossed the bottom/right border we use the bottom-right corner of the snake head.

Checking whether the snake had collided with itself is quite simple, just loop thru all the snake parts (excluding the head), and check whether they are equal to the head (equals just check x/y).

SnakeEngine.prototype.checkCollision = function(snake) {
    // Check if the head is out-of-bounds
    if (snake.parts[0].location.x < 0 ||
        snake.parts[0].location.y < 0 ||
        snake.parts[0].location.x + snake.parts[0].size > this.board.rectangle.width ||
        snake.parts[0].location.y + snake.parts[0].size > this.board.rectangle.height) {
            return true;
    }

    // Check if the snake head collides with its body
    for (var i = 1; i < snake.parts.length; ++i) {
        if (snake.parts[0].location.equals(snake.parts[i].location)) {
            return true;
        }
    }

    return false;
};

Adding pellets

When we come to add a new pellet to the game we first check that we have not exceeded the maximum number of allowed pellets, then we select a random box on the board and check that the box is vacant.

Since addPellet is getting called quite frequently (every update cycle) we have to do some filtering as we want the pellets to be added on a random timing, so at the very beginning of the method we check if Math.random() > 0.2, if yes we immediately return without adding anything, so on average we would drop 8 of 10 calls.

SnakeEngine.prototype.addPellet = function() {
    // Check if we should add pellets
    if (this.pellets.length >= MAX_PELLETS || Math.random() > 0.2) {
        return false;
    }

    // Keep loop until we found a spot for a pellet (theoretically this can turn into an infinite loop, so a solution could
    // be to stop the random search after X times and look for a spot on the board).
    var keepSearch = true;
    while (keepSearch) {
        keepSearch = false;

        // Take a random spot on the board
        var boxIndex = Math.floor(Math.random() * this.board.horizontalBoxes * this.board.horizontalBoxes);
        var loc = this.board.toScreen(boxIndex);

        // check that this spot is not on snake1
        for (var i = 0; i < this.snake1.parts.length; ++i) {
            if (this.snake1.parts[i].location.equals(loc)) {
                keepSearch = true;
                break;
            }
        }

        if (!keepSearch) {
            // check that this spot is not on snake2
            for (i = 0; i < this.snake2.parts.length; ++i) {
                if (this.snake2.parts[i].location.equals(loc)) {
                    keepSearch = true;
                    break;
                }
            }
        }

        if (!keepSearch) {
            // check that this spot is not on existing pellet
            for (i = 0; i < this.pellets.length; ++i) {
                if (this.pellets[i].location.equals(loc)) {
                    keepSearch = true;
                    break;
                }
            }
        }

        if (!keepSearch) {
            // Hooray we can add the pellet
            this.pellets.push(new Pellet(loc));
        }
    }

    return true;
};

THE END

Pshew... if you have made it all the way to here, well done and thank you!

I hope this series was in some of interest to you, to me it was fun programming this game, feel free to explore the code and even make it better !!