DEV Community: Fatema Abdelhadi

Image caption generation with visual attention explanation using Tensorflow

Fatema Abdelhadi — Fri, 03 Jul 2020 13:07:46 +0000

The official Tensorflow website has an implementation of image caption generation based on the paper titled "Show, Attend and Tell: Neural Image Caption Generation with Visual Attention". I wanted to understand the code and the concept thoroughly for a pattern recognition course, so I read many many articles explaining the topic. I read the paper several times too, but the mathematical details confused me and hindered my understanding. Some articles were great, but I still felt that the bigger picture was not crystal clear. Proud to say that I finally feel that I understand the topic and before I focus on another topic, I want to consolidate my findings and share them with others that are struggling to understand the code too!

Firstly, I will not be explaining concepts like CNN, RNN, LSTM and attention. Please understand each separately before reading this article. This article is meant for beginners in Tensorflow who want to understand image captioning. I'm still a student and I'm not an expert myself, but after A LOT of searches, maybe this could help you!

The code I'm explaining can be found here
with one small modification, I took this code and ran it myself, but instead of using a GRU in the RNN_Decoder class, I replaced it with tensorflow's LSTM. The explanation here uses LSTM instead of GRU but for the purpose of understanding, this shouldn't make a difference for you!
The first part in the code just downloads the datasets and prepares two vectors: "train_captions" that holds the actual captions and img_name_vector that holds the paths to the images corresponding to each caption.

"Recent work has significantly improved the quality of caption generation using a combination of convolutional neural networks (convnets) to obtain vectorial representation of images and
recurrent neural networks to decode those representations into natural language sentences" This was quoted from the paper. The general idea is that we will choose a CNN model examples( VGG, MobileNet, ResNet, Inception) and feed the images to one of these models but without passing the images through the fully connected layers of these models. We want to obtain a new representation of the image where each location has a vector representing it's important properties . "We use a convolution neural network in order to extract a set of feature vectors which we refer to as annotation vectors. The extractor produces L vectors, each of which is a D dimensional representation corresponding to a part of the image. In order to obtain a correspondence between the feature vectors and portions of the 2-D
image, we extract features from a lower convolutional layer ".

This code uses the inception model as the CNN.
Before feeding the images through the inception model, we need to preprocess the images ex: resizing the images, that's why we have the load_image method.

Here's how we are going to transform our images:
encode_train = sorted(set(img_name_vector)) removes the dupicate images by putting them in a set, and sorts the image paths by ID.
image_dataset = tf.data.Dataset.from_tensor_slices(encode_train)
creates a tensorflow dataset.
image_dataset = image_dataset.map( load_image, num_parallel_calls=tf.data.experimental.AUTOTUNE).batch(16)
maps the input of the dataset to be fed to the "Load_image" method, which returns the preprocessed image and it's path, it also divides the images into batches of size 16.

for img, path in image_dataset: batch_features = image_features_extract_model(img) batch_features = tf.reshape(batch_features, (batch_features.shape[0], -1, batch_features.shape[3]))
Here we feed each batch of preprocessed images through our model (inception in this case) and reshape the output to have 3 dimensions, instead of the shape being: (batch size,8,8,2048) it's now : (the batch size, 64, 2048). Note that the image feature sizes will vary with different models other than Inception.

for bf, p in zip(batch_features, path): path_of_feature = p.numpy().decode("utf-8") np.save(path_of_feature, bf.numpy())
Now for each image in the batch, we save the features we extracted.

Now we're done preprocessing the images, it's time to prepare our vocab! To represent our words, we will have a dictionary of words of size 5000 words.
tokenizer.fit_on_texts(train_captions) This method creates a dictionary where each unique word gets a number which is also it's index in the dictionary. The most frequent words get the lowest values.
An example found on stackoverflow: "The cat sat on the mat." It will create a dictionary ex: word_index["the"] = 1; word_index["cat"] = 2. Word -> index in dictionary so every word gets a unique integer value. 0 is reserved for padding. So a lower integer would mean a more frequent word stackoverflow discussion

train_seqs = tokenizer.texts_to_sequences(train_captions) cap_vector = tf.keras.preprocessing.sequence.pad_sequences(train_seqs, padding='post')
This will transform our captions and pad them to be the same length, instead of our captions being text, now a sentence could look like this:
[ 3 2 351 687 2 280 5 2 84 339 4 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0] The zeros are the padding.

The data will be split into 80% training and 20% validation.
Now let's understand why we need the LSTM cell.
Quoting this article
"Long short-term memory (LSTM) cells allow the model to better select what information to use in the sequence of caption words, what to remember, and what information to forget. TensorFlow provides a wrapper function to generate an LSTM layer for a given input and output dimension.
To transform words into a fixed-length representation suitable for LSTM input, we use an embedding layer that learns to map words to 256 dimensional features (or word-embeddings). Word-embeddings help us represent our words as vectors, where similar word-vectors are semantically similar."

What are word embeddings exactly?
Word embeddings give us a way to use an efficient, dense representation in which similar words have a similar encoding. Importantly, we do not have to specify this encoding by hand. An embedding is a dense vector of floating point values (the length of the vector is a parameter you specify). Instead of specifying the values for the embedding manually, they are trainable parameters (weights learned by the model during training, in the same way a model learns weights for a dense layer). It is common to see word embeddings that are 8-dimensional (for small datasets), up to 1024-dimensions when working with large datasets. A higher dimensional embedding can capture fine-grained relationships between words, but takes more data to learn.
To read more about word embeddings in Tensorflow link

The Inception model's convolutional layers extract a 64x2048 dimensional representation of the image features. Because the LSTM cells expect 256 dimensional textual features as input, we need to translate the image representation into the representation used for the target captions. To do this, we utilize another embedding layer that learns to map the image features into the space of 256 dimensional textual features.

So long story short, The LSTM input must be 256 dimensional features whether it is the text or the image features. We will have an encoder class that uses a CNN model which is just a single Fully connected layer, to prepare our image features for the LSTM cell.

BATCH_SIZE = 64 BUFFER_SIZE = 1000 embedding_dim = 256 units = 512 vocab_size = top_k + 1 num_steps = len(img_name_train) // BATCH_SIZE // Shape of the vector extracted from InceptionV3 is (64, 2048) //These two variables represent that vector shape features_shape = 2048 attention_features_shape = 64

Here we will define the batch size we will use in our code, it's given as 64, and the embedding value which is 256. The feature shape depends on the type of model we used earlier for the image feature extraction. For VGG, MobileNet or ResNet the feature shapes will differ.

def map_func(img_name, cap): img_tensor = np.load(img_name.decode('utf-8')+'.npy') return img_tensor, cap

dataset = tf.data.Dataset.from_tensor_slices((img_name_train, cap_train))

dataset = dataset.map(lambda item1, item2: tf.numpy_function( map_func, [item1, item2], [tf.float32, tf.int32]), num_parallel_calls=tf.data.experimental.AUTOTUNE)

dataset = dataset.shuffle(BUFFER_SIZE).batch(BATCH_SIZE) dataset = dataset.prefetch(buffer_size=tf.data.experimental.AUTOTUNE)

Here we create a tensorflow dataset which carries the paths of the images and their captions. The map function will call the map_func method which will load the features we extracted for each image and return these features with their captions. Then we shuffle the data and batch them into batches of 64 image caption pairs.
So to recap, the dataset now contains the image features we extracted using our inception model and it's corresponding caption which is represented as numbers.

Now I'll jump down to the function train_step which takes as input one batch from our dataset. "hidden" is the decoder's hidden state that we will use in the attention part, which helps us remember what is important and what is not. Since each caption is independent, therefore when we call
decoder.reset_state(batch_size=target.shape[0])
we get an array of zeros as the first hidden state.
"dec_input" will be the correct previous word from the original caption. Before the first iteration in the loop, we will consider the previous word to be the "start" token, therefore we create an array with the start token as many times as the batch's size.
features = encoder(img_tensor) this will pass our image features through the encoder which will transform the image to be ready to feed into the LSTM. After passing the image to the encoder, the last dimension of the image features has changed to be 256 (the embedding value).

Now we will loop over every word in the caption. We feed the decoder 3 things : dec_input, features, hidden
dec_input being the previous correct word from the caption (first time it's value is the start token)
'features' which are the image's features.
'hidden' being the decoder's hidden state.
The decoder will return 2 things: predictions and hidden. We will use the predictions to get the word the model predicted and hidden will be used as the new hidden state for the loop's next iteration.
loss += loss_function(target[:, i], predictions)
will compare the predicted word to the original word in the real caption and compute the loss that we will use to calculate the gradients and apply it to the optimizer and backpropagate.
dec_input = tf.expand_dims(target[:, i], 1) sets the dec_input to be the current correct word from the caption, NOT the word our model just predicted. This method is called teacher forcing.

Now let's dive into the decoder to see what happens. The decoder is the class RNN_Decoder. The first thing we do is that we call the attention mechanism and feed it the image features and the hidden state.

Class BahdanauAttention is the attention model we will use. This is soft attention, I will briefly explain the difference between soft attention and hard attention at the end of this article.
Since this is a soft attention mechanism, we calculate the attention weights from the image features and the hidden state, and we will calculate the context vector by multiplying these attention weights to the image features.
context_vector = attention_weights * features
The attention weights here all add up to 1, each attention weight represents how important it's corresponding feature is in order to generate the current word.
context_vector = tf.reduce_sum(context_vector, axis=1)
the context_vector is now summed on the 2nd axis (as if it has been flattened) and now has a shape of (batch_size x 256).
The context vector and the attention weights are returned from the attention model and we return to the decoder class.

Back to the decoder class:
x = self.embedding(x) takes our previous word and passes it through an embedding layer. This transforms the word from being a number to be the word embedding vector we mentioned earlier.
The context vector will be concatenated to the previous word and this will be fed to the LSTM (or gru) cell. The LSTM (or gru) cell will return 2 things, output and state. We will manipulate "output" a little until it is the size of our entire vocab and each word in the vocabulary gets a probability. The word with the highest probability is the predicted word.
We return the predictions, the new hidden state and the attention weights.

The attention weights will not be used in the train method but will be used for plotting purposes in the evaluate method.
The evaluate method works pretty much like the train method but with one big differnece:
instead of dec_input being the previous correct word from the caption, since here we don't actually have the correct caption, we set dec_input to be the word that the model previously predicted.

As I mentioned earlier, there are two attention mechanism: Hard attention and soft attention. Here we used soft attention. Soft attention is much easier because it's deterministic, meaning that if at a time step i, if I perform soft attention twice, I will get the same output each time. This is not true for hard attention, hard attention is a stochastic process, meaning that it doesn't look at the entire hidden state but it uses the attention weights as probabilities to sample from the hidden state.In soft attention, The attention weights adds up to 1 which can be interpreted as the probability that Xi is the area that we should pay attention to. So instead of a weighted average, hard attention uses the attention weights as a sample rate to pick one Xi as the input to the LSTM. Hard attention is not differentiable, therefore it's not easy to perform back propagation.
To understand more about soft and hard attention:
[1]https://medium.com/heuritech/attention-mechanism-5aba9a2d4727
[2]https://jhui.github.io/2017/03/15/Soft-and-hard-attention/

from: https://www.oreilly.com/content/caption-this-with-tensorflow/

Figure 3. Source: “Show and Tell: Lessons learned from the 2015 MSCOCO Image Captioning Challenge.”

"In this diagram, {s0, s1, …, sN} represent the words of the caption we are trying to predict and {wes0, wes1, …, wesN-1} are the word embedding vectors for each word. The outputs {p1, p2, …, pN} of the LSTM are probability distributions generated by the model for the next word in the sentence. The model is trained to minimize the negative sum of the log probabilities of each word."

I hope this cleared some of the ambiguity of image captioning!

Leetcode May Challenge "Count Square Submatrices with All Ones"

Fatema Abdelhadi — Fri, 22 May 2020 23:32:16 +0000

This is a leetcode question found in the May Challenge week 3.It's a medium level question, and I could only think of the brute force recursive solution(which obviously led to exceed the time limit). So I checked the discussion forum and found a beautiful dynamic programming solution, which I was determined to understand!Here I'm going to give you some intuition about the answer through a series of ugly but helpful drawings. I hope it helps!

Here's the link to the discussion, this solution is by Lee215: https://leetcode.com/problems/count-square-submatrices-with-all-ones/discuss/441306/JavaC%2B%2BPython-DP-solution
So if you didn't check the link, here is the solution and the provided explanation for it:
"dp[i][j] means the size of biggest square with A[i][j] as bottom-right corner.
dp[i][j] also means the number of squares with A[i][j] as bottom-right corner.

If A[i][j] == 0, no possible square.
If A[i][j] == 1,
we compare the size of square dp[i-1][j-1], dp[i-1][j] and dp[i][j-1].
min(dp[i-1][j-1], dp[i-1][j], dp[i][j-1]) + 1 is the maximum size of square that we can find."

   public int countSquares(int[][] A) {
        int res = 0;
        for (int i = 0; i < A.length; ++i) {
            for (int j = 0; j < A[0].length; ++j) {
                if (A[i][j] > 0 && i > 0 && j > 0) {
                    A[i][j] = Math.min(A[i - 1][j - 1], Math.min(A[i - 1][j], A[i][j - 1])) + 1;
                }
                res += A[i][j];
            }
        }
        return res;
    }

So the second sentence in the discussion forum's explanation says "dp[i][j] means the number of squares with A[i][j] as bottom-right corner." So we will loop on every element with indices i,j and we will calculate the number of squares that have this particular element as it's bottom right element. The element could participate in many squares of different sizes, starting from a square of size 1 (just this element) up to the maximum size of squares possible.

The first sentence says "dp[i][j] means the size of biggest square with A[i][j] as bottom-right corner." So if I calculate the number of different squares that have this element as it's bottom right corner element, therefore this is also considered the maximum size square this element can participate in.

Now the question bubbles down to: How do I calculate the maximum square size a particular element can be in it's bottom right corner?

consider this matrix, where we are evaluating the element at index [1][1]'the element in a pink box', It is clear that the maximum square this element forms as the bottom right corner is a square of size 1, because element[0][0] is equal to zero. If element[0][0] was equal to one, then we could have combined the four squares of size 1 to create a new square of size 2x2, with element[1][1] as it's bottom right corner element. Remember that dynamic programming is all about breaking the larger problem into smaller problems.

Notice how this algorithm updates the matrix in place:
for all elements that have an index of value 0 (first row and first column values) These elements can only be the bottom right corner of squares of size 1, therefore we don't enter the if condition for these elements, we just add their value to "res" which is either 0 or 1.
for any other element with a value of 1, there is a potential for this element to be the bottom right corner of a bigger square, therefor we do this process to check: we get the minimum value for the left, upper left and upper elements from our current element we are processing. More formally we will get the minimum value for the squares that elements A[i - 1][j], A[i][j - 1] and A[i - 1][j - 1] form the right corner values for. Once we get the minimum value that means each of these smaller squares can be combined with the current element we are processing to form a new square bigger than the smaller squares by 1 row and 1 column.

Let's look at the combination process, here we have a matrix of all ones:

Look at element A[2][[2], it's obvious that the maximum square that this element forms it's bottom right corner is of size 3x3. How can we calculate this when we reach this element? We will have to see the previously calculated values in the left, upper left and upper elements. Here I have the 2x2 square shaded, which has element A[1][2] as it's bottom right corner. By the time we are processing element A[2][2], element A[1][2] will have the value of the maximum square that has A[1][2] as it's bottom right corner (which is 2).

We are still processing the same element, but now we are looking at a different neighboring element, element[2][1], which is also the bottom right corner of a square of size 2.

Here we are looking at our last neighbor at position [1][1], which is also part of a 2x2 square. All the neighbor elements are parts of squares of size 2x2. Therefore we can safely "combine" these 2x2 squares to form a larger square of size 3x3 with element [2][2] as it's bottom right corner element.

I hope this made the intuition behind the dynamic programming solution more clear. Keep on practicing problems and Good Luck!

5 things I wish I knew starting university as a Computer science/engineering student

Fatema Abdelhadi — Sun, 10 May 2020 14:30:30 +0000

University can be stressful for cs students, especially if you didn't have any coding experience previously (such as myself). You might be feeling that you're behind others who started earlier than you, thinking it's too late for you too to be an amazing engineer. "build projects so you can land internships", "problem solve all the time so you can pass interviews", "know all your data structures and algorithms" are pieces of "advice" thrown around as if they are easy to master quickly. Soon you're over-stressed and you don't know what to focus on. If you are still in your early university days and are feeling lost and overwhelmed, know that you are not alone. I'm going to share with you what I found most valuable to help me get passed those feelings and even excel! So let's go!

1. Choose your friends wisely
This may sound off topic, but it might be my most valuable piece of advice. Find motivated and strong-willed people to be in your life. People who are passionate to learn and build things. They'll inspire you to work and become the best version of yourself. They'll be your project teammates. They'll be your study buddies. They'll keep you accountable. Unless you are always 100% motivated, these people will pick you up on your low days, and push you to work towards your dreams and goals. Remember, you are the average of the 5 people you spend most of your time with.

2. The difference between Competitive Programming and Interview-preparing
When I first heard about problem solving and how sacredly important it is, I started exploring Codeforces -which is a platform for competitive programming. I would pick a problem and stare at it the whole day. When I decide to finally write code and submit it to the online judge I get this "test case failed at test number #" message. Sounds frustrating doesn't it? Well, it was. Even though that's totally normal at the beginning. I shouldn't have just stared at the problems, I needed to see the solution, to learn from it, to see where my code fell short and to learn clean code and coding conventions. Codeforces is for problem solving and competitive programming, the problems are told in a story-like manner, this can be annoying because the technical requirements are hidden in all the details. Problem solving is an important part of being an engineer, it shapes your thinking and helps you acquire the skill of developing solutions to vague problems with constraints that might not be that clear. On the other hand, preparing for interviews is a different story. Interviews are a little different, they focus more on data-structures and Algorithms. Most questions are straightforward. Platforms for interview preparing are like Leetcode or Hackerrank. Leetcode is amazing. There are lists of problems with different levels of difficulty. Start out with the easy problems, then when you get a little bit better try the medium ones. Some problems have solutions in leetcode's articles, and you can always find someone's solution in the discussion section. When you're stuck in a problem, staring at it the whole day isn't going to help. Instead, think about the problem for about 20 mins. If you still don't know the answer, Study the solution. Yes study the solution, understand the algorithm, recognize clean code and how data structures are used. You will improve by how much you learn from the problems, not by how many problems you solve. Quality over Quantity. Both of these problem solving techniques will benefit you greatly as an engineer, but you need to know when to use which. Codeforces is great for competitive programming and leetcode is great for interviews. In general I can't emphasize enough how important it is to know your algorithms and data-structures well. If you want that dream internship, you better start problem solving as fast as possible.

3. You don't have to be a genius to excel
I remember when I first started learning about recursion, or even more specifically, the "Tower of Hanoi" problem. I just couldn't wrap my head around it at first. This worried me a little, thinking that maybe I'm just not fit for the job. Some people can understand these concepts almost instantly, but as I recently learned, this isn't always an advantage. I recently encountered this interesting concept from Barbara Oakley in her online course "Mindshift: Break Through Obstacles to Learning and Discover Your Hidden Potential". Barbara gives an example of when a teacher in class asks a question, and before you even understand the question you find people raising their hands to answer. "Some people just plain seem to have race car brains. They get to the finish line, they answer really, really fast. Other people have what you might call hiker brains. They get to the finish line but because they're walking, they get there much, much more slowly. With the race car driver, they do get to the finish line a lot faster, but everything goes by in a rush. They're also on a set, smooth roadway. They know exactly where they're going. A hiker on the other hand, moves slowly. But while they're hiking, they can reach out. They can touch the leaves on the trees, smell the air, hear the birds. And they can easily veer off the expected path into places where people don't normally go." Barbra then points out the many advantages these "hiker brains" have, because they think of all the different aspects slower, their understanding can be deeper than others. She talked about the Nobel prize winner Santiago Ramon y Cajal, who was not a genius himself, but worked with geniuses. He found that they often shared similar problems. For example, these geniuses with their race car brains were used to jumping ahead to speedy conclusions. And when they were incorrect, they weren't use to changing their minds. So they keep charging ahead with the incorrect conclusion they jumped to, their super fast brains could easily devise justification. Ramon y Cajal himself though had a persistent hiker type brain. He'd come up with a hypothesis and then he'd persistently check it out in a way that would reveal whether he was wrong. Instead of just trying to prove that he was right. If he was wrong, he changed his mind and flexibly tried again. So was his persistence and his flexibility in the face of what the data was truly telling him that made him superstar researcher. When I first heard of this comparison, I thought it was mind-blowing. I sometimes find myself stuck on specific concepts while some others would get it instantly and I would start worrying, am I smart enough to be a great software engineer? Do I really understand this algorithm thoroughly? Now I know that everyone is unique and we all have our strengths and weaknesses. It's okay if it takes you time to grasp some concepts, we all struggle sometimes.

4. Volunteering is your new best friend
Many big companies like to see volunteering experience on your resume, but that's not the main reason why I'm telling you this. I'm advising this through personal experience, as I am a volunteer at IEEE Alexandria Student Branch, in the software development committee. I have been there for about a year and here are some of my experiences: My team and I built an android app for our branch (I knew very little android at the beginning, but with sessions from more experienced students, online courses and helpful teammates I managed to learn a lot and participate in a new project to put on my resume!) I got to interact with many older students at my university who shared with me valuable lessons and insights. I attended many educational events and seminars. I got the opportunity to help younger students and made many friends. Besides the benefits, it's actually fun! If your university has a community or club with similar interests as yours, join it and start learning, networking and socializing. It was one of the best decisions I've made that year.

5. Study your university courses well
I know maintaining a high GPA isn't for everyone, but I can tell you it has benefits. Many people will tell you that a high GPA isn't necessary for software engineering, and that's partially true, but it doesn't mean having a decent GPA won't boost your chances. After attending a google event at my university and submitting my resume previously, a google recruiter reached out to me. That time I didn't have any amazing unusual projects on my resume, so I'm assuming what helped me stand out was my GPA. I know how hard and stressful it is to keep your grades high, I personally struggle with it, but if you can do it, I think it's worth it. Just don't forget that the real purpose of you attending uni is to learn and to learn properly. If you are lucky enough to be enrolled in a computer engineer/science degree, then most of your courses will be extremely important for your career. Learning is a beautiful process, and to be a successful engineer you need to be a life long learner. In the real world when you start working, it's almost impossible to find a free day where you can just sit around all day and study and solve problems. Where in university, you may have many days just to read textbooks and process information. It may not feel like a blessing now, but later on you will miss these days, so appreciate them now and take advantage of them!

I know I said 5 tips, but here are my last thoughts for this post.
Find what excites you! I'm still working on this myself, but what I realized is that you will never know what excites you until you try many different things and know what doesn't excite you. Let's say you try web development for a month, and then you figure out you don't like it that much. Isn't that better than choosing a web development career randomly and being stuck with it for a much longer period just because you chose blindly? You have time now to explore many different fields, and no that's not time wasteful. It will save you much more time on the long run.
Don't wait till you're ready to apply for internships. Apply and if you're lucky enough to get that interview, do it! Interviewing is a skill and the more you get interviewed, the faster you'll attain that skill. Don't get discouraged if you don't get an internship at one of those giant tech companies, it's especially hard for international students, but at the same time be optimistic! I know many amazing people who got offers from big Tech companies like Facebook and google.

Take care of your health. Staring at screens all day and sitting at a desk is a recipe for disastrous health, especially with the programmer stereotype where you have to be awake all night coding. That is not okay, you should not be awake for half of the night. Exercise and healthy eating habits will give you more energy, sharpen your brain and improve the quality of your life. Don't neglect your health with excuses like you would better code one more hour than going to bed. All your efforts for securing yourself a decent job in the future will be useless if your health -which is your most important asset, is in poor condition.

That's it from me for today! I hope you found this useful. Please tell me your helpful tips and tricks in the comments section below!