Image caption generation with visual attention explanation using Tensorflow

#machinelearning #tensorflow #imagecaptioning #beginners

The official Tensorflow website has an implementation of image caption generation based on the paper titled "Show, Attend and Tell: Neural Image Caption Generation with Visual Attention". I wanted to understand the code and the concept thoroughly for a pattern recognition course, so I read many many articles explaining the topic. I read the paper several times too, but the mathematical details confused me and hindered my understanding. Some articles were great, but I still felt that the bigger picture was not crystal clear. Proud to say that I finally feel that I understand the topic and before I focus on another topic, I want to consolidate my findings and share them with others that are struggling to understand the code too!

Firstly, I will not be explaining concepts like CNN, RNN, LSTM and attention. Please understand each separately before reading this article. This article is meant for beginners in Tensorflow who want to understand image captioning. I'm still a student and I'm not an expert myself, but after A LOT of searches, maybe this could help you!

The code I'm explaining can be found here
with one small modification, I took this code and ran it myself, but instead of using a GRU in the RNN_Decoder class, I replaced it with tensorflow's LSTM. The explanation here uses LSTM instead of GRU but for the purpose of understanding, this shouldn't make a difference for you!
The first part in the code just downloads the datasets and prepares two vectors: "train_captions" that holds the actual captions and img_name_vector that holds the paths to the images corresponding to each caption.

"Recent work has significantly improved the quality of caption generation using a combination of convolutional neural networks (convnets) to obtain vectorial representation of images and
recurrent neural networks to decode those representations into natural language sentences" This was quoted from the paper. The general idea is that we will choose a CNN model examples( VGG, MobileNet, ResNet, Inception) and feed the images to one of these models but without passing the images through the fully connected layers of these models. We want to obtain a new representation of the image where each location has a vector representing it's important properties . "We use a convolution neural network in order to extract a set of feature vectors which we refer to as annotation vectors. The extractor produces L vectors, each of which is a D dimensional representation corresponding to a part of the image. In order to obtain a correspondence between the feature vectors and portions of the 2-D
image, we extract features from a lower convolutional layer ".

This code uses the inception model as the CNN.
Before feeding the images through the inception model, we need to preprocess the images ex: resizing the images, that's why we have the load_image method.

Here's how we are going to transform our images:
encode_train = sorted(set(img_name_vector)) removes the dupicate images by putting them in a set, and sorts the image paths by ID.
image_dataset = tf.data.Dataset.from_tensor_slices(encode_train)
creates a tensorflow dataset.
image_dataset = image_dataset.map( load_image, num_parallel_calls=tf.data.experimental.AUTOTUNE).batch(16)
maps the input of the dataset to be fed to the "Load_image" method, which returns the preprocessed image and it's path, it also divides the images into batches of size 16.

for img, path in image_dataset: batch_features = image_features_extract_model(img) batch_features = tf.reshape(batch_features, (batch_features.shape[0], -1, batch_features.shape[3]))
Here we feed each batch of preprocessed images through our model (inception in this case) and reshape the output to have 3 dimensions, instead of the shape being: (batch size,8,8,2048) it's now : (the batch size, 64, 2048). Note that the image feature sizes will vary with different models other than Inception.

for bf, p in zip(batch_features, path): path_of_feature = p.numpy().decode("utf-8") np.save(path_of_feature, bf.numpy())
Now for each image in the batch, we save the features we extracted.

Now we're done preprocessing the images, it's time to prepare our vocab! To represent our words, we will have a dictionary of words of size 5000 words.
tokenizer.fit_on_texts(train_captions) This method creates a dictionary where each unique word gets a number which is also it's index in the dictionary. The most frequent words get the lowest values.
An example found on stackoverflow: "The cat sat on the mat." It will create a dictionary ex: word_index["the"] = 1; word_index["cat"] = 2. Word -> index in dictionary so every word gets a unique integer value. 0 is reserved for padding. So a lower integer would mean a more frequent word stackoverflow discussion

train_seqs = tokenizer.texts_to_sequences(train_captions) cap_vector = tf.keras.preprocessing.sequence.pad_sequences(train_seqs, padding='post')
This will transform our captions and pad them to be the same length, instead of our captions being text, now a sentence could look like this:
[ 3 2 351 687 2 280 5 2 84 339 4 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0] The zeros are the padding.

The data will be split into 80% training and 20% validation.
Now let's understand why we need the LSTM cell.
Quoting this article
"Long short-term memory (LSTM) cells allow the model to better select what information to use in the sequence of caption words, what to remember, and what information to forget. TensorFlow provides a wrapper function to generate an LSTM layer for a given input and output dimension.
To transform words into a fixed-length representation suitable for LSTM input, we use an embedding layer that learns to map words to 256 dimensional features (or word-embeddings). Word-embeddings help us represent our words as vectors, where similar word-vectors are semantically similar."

What are word embeddings exactly?
Word embeddings give us a way to use an efficient, dense representation in which similar words have a similar encoding. Importantly, we do not have to specify this encoding by hand. An embedding is a dense vector of floating point values (the length of the vector is a parameter you specify). Instead of specifying the values for the embedding manually, they are trainable parameters (weights learned by the model during training, in the same way a model learns weights for a dense layer). It is common to see word embeddings that are 8-dimensional (for small datasets), up to 1024-dimensions when working with large datasets. A higher dimensional embedding can capture fine-grained relationships between words, but takes more data to learn.
To read more about word embeddings in Tensorflow link

The Inception model's convolutional layers extract a 64x2048 dimensional representation of the image features. Because the LSTM cells expect 256 dimensional textual features as input, we need to translate the image representation into the representation used for the target captions. To do this, we utilize another embedding layer that learns to map the image features into the space of 256 dimensional textual features.

So long story short, The LSTM input must be 256 dimensional features whether it is the text or the image features. We will have an encoder class that uses a CNN model which is just a single Fully connected layer, to prepare our image features for the LSTM cell.

BATCH_SIZE = 64 BUFFER_SIZE = 1000 embedding_dim = 256 units = 512 vocab_size = top_k + 1 num_steps = len(img_name_train) // BATCH_SIZE // Shape of the vector extracted from InceptionV3 is (64, 2048) //These two variables represent that vector shape features_shape = 2048 attention_features_shape = 64

Here we will define the batch size we will use in our code, it's given as 64, and the embedding value which is 256. The feature shape depends on the type of model we used earlier for the image feature extraction. For VGG, MobileNet or ResNet the feature shapes will differ.

def map_func(img_name, cap): img_tensor = np.load(img_name.decode('utf-8')+'.npy') return img_tensor, cap

dataset = tf.data.Dataset.from_tensor_slices((img_name_train, cap_train))

dataset = dataset.map(lambda item1, item2: tf.numpy_function( map_func, [item1, item2], [tf.float32, tf.int32]), num_parallel_calls=tf.data.experimental.AUTOTUNE)

dataset = dataset.shuffle(BUFFER_SIZE).batch(BATCH_SIZE) dataset = dataset.prefetch(buffer_size=tf.data.experimental.AUTOTUNE)

Here we create a tensorflow dataset which carries the paths of the images and their captions. The map function will call the map_func method which will load the features we extracted for each image and return these features with their captions. Then we shuffle the data and batch them into batches of 64 image caption pairs.
So to recap, the dataset now contains the image features we extracted using our inception model and it's corresponding caption which is represented as numbers.

Now I'll jump down to the function train_step which takes as input one batch from our dataset. "hidden" is the decoder's hidden state that we will use in the attention part, which helps us remember what is important and what is not. Since each caption is independent, therefore when we call
decoder.reset_state(batch_size=target.shape[0])
we get an array of zeros as the first hidden state.
"dec_input" will be the correct previous word from the original caption. Before the first iteration in the loop, we will consider the previous word to be the "start" token, therefore we create an array with the start token as many times as the batch's size.
features = encoder(img_tensor) this will pass our image features through the encoder which will transform the image to be ready to feed into the LSTM. After passing the image to the encoder, the last dimension of the image features has changed to be 256 (the embedding value).

Now we will loop over every word in the caption. We feed the decoder 3 things : dec_input, features, hidden
dec_input being the previous correct word from the caption (first time it's value is the start token)
'features' which are the image's features.
'hidden' being the decoder's hidden state.
The decoder will return 2 things: predictions and hidden. We will use the predictions to get the word the model predicted and hidden will be used as the new hidden state for the loop's next iteration.
loss += loss_function(target[:, i], predictions)
will compare the predicted word to the original word in the real caption and compute the loss that we will use to calculate the gradients and apply it to the optimizer and backpropagate.
dec_input = tf.expand_dims(target[:, i], 1) sets the dec_input to be the current correct word from the caption, NOT the word our model just predicted. This method is called teacher forcing.

Now let's dive into the decoder to see what happens. The decoder is the class RNN_Decoder. The first thing we do is that we call the attention mechanism and feed it the image features and the hidden state.

Class BahdanauAttention is the attention model we will use. This is soft attention, I will briefly explain the difference between soft attention and hard attention at the end of this article.
Since this is a soft attention mechanism, we calculate the attention weights from the image features and the hidden state, and we will calculate the context vector by multiplying these attention weights to the image features.
context_vector = attention_weights * features
The attention weights here all add up to 1, each attention weight represents how important it's corresponding feature is in order to generate the current word.
context_vector = tf.reduce_sum(context_vector, axis=1)
the context_vector is now summed on the 2nd axis (as if it has been flattened) and now has a shape of (batch_size x 256).
The context vector and the attention weights are returned from the attention model and we return to the decoder class.

Back to the decoder class:
x = self.embedding(x) takes our previous word and passes it through an embedding layer. This transforms the word from being a number to be the word embedding vector we mentioned earlier.
The context vector will be concatenated to the previous word and this will be fed to the LSTM (or gru) cell. The LSTM (or gru) cell will return 2 things, output and state. We will manipulate "output" a little until it is the size of our entire vocab and each word in the vocabulary gets a probability. The word with the highest probability is the predicted word.
We return the predictions, the new hidden state and the attention weights.

The attention weights will not be used in the train method but will be used for plotting purposes in the evaluate method.
The evaluate method works pretty much like the train method but with one big differnece:
instead of dec_input being the previous correct word from the caption, since here we don't actually have the correct caption, we set dec_input to be the word that the model previously predicted.

As I mentioned earlier, there are two attention mechanism: Hard attention and soft attention. Here we used soft attention. Soft attention is much easier because it's deterministic, meaning that if at a time step i, if I perform soft attention twice, I will get the same output each time. This is not true for hard attention, hard attention is a stochastic process, meaning that it doesn't look at the entire hidden state but it uses the attention weights as probabilities to sample from the hidden state.In soft attention, The attention weights adds up to 1 which can be interpreted as the probability that Xi is the area that we should pay attention to. So instead of a weighted average, hard attention uses the attention weights as a sample rate to pick one Xi as the input to the LSTM. Hard attention is not differentiable, therefore it's not easy to perform back propagation.
To understand more about soft and hard attention:
[1]https://medium.com/heuritech/attention-mechanism-5aba9a2d4727
[2]https://jhui.github.io/2017/03/15/Soft-and-hard-attention/

from: https://www.oreilly.com/content/caption-this-with-tensorflow/

Figure 3. Source: “Show and Tell: Lessons learned from the 2015 MSCOCO Image Captioning Challenge.”

"In this diagram, {s0, s1, …, sN} represent the words of the caption we are trying to predict and {wes0, wes1, …, wesN-1} are the word embedding vectors for each word. The outputs {p1, p2, …, pN} of the LSTM are probability distributions generated by the model for the next word in the sentence. The model is trained to minimize the negative sum of the log probabilities of each word."

I hope this cleared some of the ambiguity of image captioning!