DEV Community: Shambhavi Mishra

All About Autoencoders!

Shambhavi Mishra — Sat, 24 Oct 2020 09:56:11 +0000

Introduction to Generative Modeling

Shambhavi Mishra — Sat, 24 Oct 2020 09:55:08 +0000

Paying Attention Again!

Shambhavi Mishra — Sat, 24 Oct 2020 09:54:42 +0000

The Transformer Architecture [1] introduced by Vaswani et al, is based on attention mechanism and overcomes the challenges faced in recurrence. In continuation to the last blog 'Let's pay some Attention!', let's delve deeper into Attention Mechanism.

Recurrent Neural Networks or RNNs were introduced to handle the sequential data but optimisation tends to take longer in case of RNN due to :

The number of iterations or steps in gradient descent is higher in recurrence.
There are several sequential operations which can not be parallelised easily.

Multi-Head Attention

Let's do a quick recap of the attention mechanism we understood in the last blog.
We have some key - value pairs and a query. We compare the query to each key and the key with the highest similarity score is assigned the highest weight. To generate an output, we now take a weighted combination of the corresponding values.

In the flow chart above, we feed the Value (V), Key (K) and Query (Q) into a linear layer with, say 3, projections each of V, K and Q. Then we compute a scalar dot product attention of each of the projection, one head per linear layer, thus we get 3 heads of scaled dot product attention . Now we concatenate these heads into one and feed the concatenated output into a linear layer which then outputs the multi-head attention.
In the case of Multi-Head Attention, we compute multiple attentions per query with different weights.

Masked Multi-Head Attention

When decoding the output from encoder, an output value should only depend upon previous outputs and not the future outputs. Thus, to ensure that future values have zero attention, we mask them. We define masked multi-head attention as the multi-head attention where some values are masked and the probabilities of masked values are nullified to prevent them from being selected.
Mathematically, as illustrated below, Masked Multi-Head Attention is calculated by the addition of 'M', a mask matrix of zeroes and negative infinity to the transpose of query.

Q^T : Transpose of Query
K : Key
V : Value
dk : Dimensionality of each key
M : Mask Matrix of 0s and -∞

Layer Normalisation

In the performance of Transformer, layer normalisation [2] introduced by Lei Ba et al. plays a key role.
The interdependency of weights and their constant change during computation is eliminated with the help of layer normalisation. Normalisation ensures that regardless of how we set the weights, the output of a layer have a mean of '0' and a variance of '1'. Thus, the scale of these outputs is going to be the same leading to faster convergence.

While in case of layer normalisation, normalisation is at a layer level whereas for batch normalisation, it is performed for one hidden unit but by normalizing across a batch of inputs.

Understanding Complexities

In a self attention network, a layer consists of 'n' positions. For each position, the dimensionality is given as 'd'. Computation in one layer is going to be of order 'n²' because for every position we attend to every other position. For each such pair, we are going to compute an embedding of dimensionality 'd'. Thus, the complexity of every layer is 'n²d'.

In this blog we covered multiple topics associated with the concept of Attention.

References :
[1] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems (NIPS'17). Curran Associates Inc., Red Hook, NY, USA, 6000–6010.
[2] J. L. Ba, J. R. Kiros, and G. E. Hinton. Layer normalization. arXiv preprint arXiv:1607.06450,
2016.

Let's pay some Attention!

Shambhavi Mishra — Fri, 28 Aug 2020 18:05:12 +0000

Before discussing a new technology or methodology, we should try to understand the need of it. And so, let us know what gave path to the Transformer Networks.

Challenges with Recurrent Neural Networks

(Image Source : mc.ai)

Gradients are simply vectors pointing in the direction of highest rate of increase of the function. During backpropagation, gradients go through matrix multiplication multiple times using the chain rule. Small gradients get smaller until they vanish and thus it gets harder to train the weights. This is called the vanishing gradient problem.
While smaller gradients vanish, if your gradient is a large value they go on increasing and result in very large updates to our network. This is known as the exploding gradient problem.

Another challenge one faces with RNNs is that of 'reccurence'. Recurrence prevents parallel computation.
Also, large number of training steps are required to train an RNN.

Solution to all our problems is - Transformers!
As the title says, Attention is all you need by Vaswani et al, (2017) is the paper that introduced the concept of transformers.
Let us first understand the Attention Mechanism.
Below attached is an image from my notes of Prof. Pascal Poupart's lecture on Transformers.

Attention Mechanism mimics the retrieval of a value (v) for a query (q) based on a key (k) in the database.
We have a query and some keys (k1, k2, k3, k4), we aim to produce an output which is a linear combination of values where the weights come from the similarity between our query and keys.
In the above diagram, the first layer consists of the keys (vectors). We generate another layer from the similarity comparison of these keys with the query (q). Thus the second layer consists of similarities (s).

We take softmax of these values to yield another layer (a). The product of values in (a) with the values (v) gives us the attention value.

So far we have understood what gave rise to the need of Attention and what exactly is Attention Mechanism.
What more will we cover?

Multihead Attention
Masked Multihead Attention
Layer Normalisation
Positional Embedding
Comparison of Self Attention and Recurrent Layers

Let's cover all this in the next blog!
You can follow me on twitter where I share all the good content and blogs!

Text Classification using ELMo Embedding

Shambhavi Mishra — Thu, 27 Aug 2020 18:05:40 +0000

When we talk about supervised learning, a much exploited task is 'Text or Image Classification'. Today we will discuss Text Classification on BBC News Dataset.

Dataset

We’ll use a public dataset from the BBC comprised of 2225 articles, each labeled under one of 5 categories: business, entertainment, politics, sport or tech.
The dataset is broken into 1490 records for training and 735 for testing. The goal will be to build a system that can accurately classify previously unseen news articles into the right category.

Preprocessing

We can not feed raw text or human understandable text directly to our model. Preprocessing a text involves multiple tasks such as stemming (breaking down) a word to its root, stopword removal to eliminate repetitive and redundant words or simply, stopwords. Preprocessing of text can not be generalised and is very specific to the task and domain of the data. Since our dataset is fairly simple and this is a beginner focussed tutorial we will remove stopwords for preprocessing our data.

We load our data using Pandas :

import pandas as pd
data = pd.read_csv('Filename.csv')

We use NLTK or Natural Language Toolkit, a Python library for modeling text. Stop words are the repetitive words, articles and conjunctions for example, which do not add value to the text from NLP perspective. NLTK library makes our task easy by providing us a list of commonly occuring stopwords.

from nltk.corpus import stopwords
stop_words = stopwords.words( ' english ' )
print(stop_words)

Output :

[ ' i ' , ' me ' , ' my ' , ' myself ' , ' we ' , ' our ' , ' ours ' , ' ourselves ' , ' you ' , ' your ' , ' yours ' ,
' yourself ' , ' yourselves ' , ' he ' , ' him ' , ' his ' , ' himself ' , ' she ' , ' her ' , ' hers ' ,
' herself ' , ' it ' , ' its ' , ' itself ' , ' they ' , ' them ' , ' their ' , ' theirs ' , ' themselves ' ,
' what ' , ' which ' , ' who ' , ' whom ' , ' this ' , ' that ' , ' these ' , ' those ' , ' am ' , ' is ' , ' are ' ,
' was ' , ' were ' , ' be ' , ' been ' , ' being ' , ' have ' , ' has ' , ' had ' , ' having ' , ' do ' , ' does ' ,
' did ' , ' doing ' , ' a ' , ' an ' , ' the ' , ' and ' , ' but ' , ' if ' , ' or ' , ' because ' , ' as ' , ' until ' ,
' while ' , ' of ' , ' at ' , ' by ' , ' for ' , ' with ' , ' about ' , ' against ' , ' between ' , ' into ' ,
' through ' , ' during ' , ' before ' , ' after ' , ' above ' , ' below ' , ' to ' , ' from ' , ' up ' , ' down ' ,
' in ' , ' out ' , ' on ' , ' off ' , ' over ' , ' under ' , ' again ' , ' further ' , ' then ' , ' once ' , ' here ' ,
' there ' , ' when ' , ' where ' , ' why ' , ' how ' , ' all ' , ' any ' , ' both ' , ' each ' , ' few ' , ' more ' ,
' most ' , ' other ' , ' some ' , ' such ' , ' no ' , ' nor ' , ' not ' , ' only ' , ' own ' , ' same ' , ' so ' ,
' than ' , ' too ' , ' very ' , ' s ' , ' t ' , ' can ' , ' will ' , ' just ' , ' don ' , ' should ' , ' now ' , ' d ' ,
' ll ' , ' m ' , ' o ' , ' re ' , ' ve ' , ' y ' , ' ain ' , ' aren ' , ' couldn ' , ' didn ' , ' doesn ' , ' hadn ' ,
' hasn ' , ' haven ' , ' isn ' , ' ma ' , ' mightn ' , ' mustn ' , ' needn ' , ' shan ' , ' shouldn ' , ' wasn ' ,
' weren ' , ' won ' , ' wouldn ' ]

We also encode our labels for the classification task.

Embedding

Word Embedding Model was a key breakthrough for learning representations for text where similar words have a similar representation in the vector space.
ELMo is a deep contextualized word representation that models both (1) complex characteristics of word use (e.g., syntax and semantics), and (2) how these uses vary across linguistic contexts (i.e., to model polysemy). These word vectors are learned functions of the internal states of a deep bidirectional language model (biLM), which is pre-trained on a large text corpus. They can be easily added to existing models and significantly improve the state of the art across a broad range of challenging NLP problems, including question answering, textual entailment and sentiment analysis. - ELMo was developed by Allen.

Contextualized Word Representation is the representation of word which is heavily dependent on the surrounding words. ELMo takes into account all the text before generating an embedding to capture the semantics of the text.

What is Model Polysemy?
Polysemy is the capability of a word to possess more that one meaning.
Bright means 'Shining' as well as 'Intelligent'.

ELMo addresses these problems of text data modeling.

I shall discuss more about different types of SOTA embeddings in another post.
ELMo Embedding pre-trained model trained on 1 Billion Word Benchmark is available on Tensorflow-Hub.
Let's code!

import tensorflow as tf
import tensorflow_hub as hub
embed = hub.Module("https://tfhub.dev/google/elmo/2", trainable=True)
def ELMoEmbedding(x):
    return embed(tf.squeeze(tf.cast(x, tf.string)), signature="default", as_dict=True)["default"]

Training our model we achieve an accuracy of 0.91 and a categorical crossentropy loss of 0.28.

input_text = Input(shape=(1,), dtype=tf.string)
embedding = Lambda(ELMoEmbedding, output_shape=(1024, ))(input_text)
dense = Dense(256, activation='relu')(embedding)
pred = Dense(5, activation='softmax')(dense)
model = Model(inputs=[input_text], outputs=pred)
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

You can view the notebook here.

Python to Neural Networks : A Guide for Beginners

Shambhavi Mishra — Wed, 26 Aug 2020 17:56:05 +0000

Last Month, I completed a year of my exploration with Machine Learning and Data Science and thus, I decided to pen down the resources I have followed till now.

I don’t say this is the way to be followed, this is just my share of experiences and mistakes which landed me here. This summary is for all my peers, who like me, find themselves lost on ‘how to-s’ of Data Science.

Python 101

Video Resources I followed :

Telusko’s Python Playlist : I started learning Python with this series by Navin Reddy which built my foundations for this journey
I have also followed lectures by Charles Severance, you can find them here .
Sentdex is a much recommended YouTube Channel, I came across it pretty late.

Books I followed :

Automate the Boring Stuff with Python
Fluent Python

Machine Learning

Video Resources I followed :

I started by taking up a Udemy Course which gave an overview of all the algorithms and its implementation, Machine Learning A-Z by Kirill Eremenco.
While I explored the project ideas after the course, I knew I was missing on to something which was deeper and more mathematical when I read that everyone was doing a Machine Learning Course by Andrew Ng. Truly, the mathematical concepts built from this course helped me sail smoothly through the next phase. Here’s the link to the course.
You can find it on Coursera too.

Books I followed :

I have relied majorly on Machine Learning Mastery with Python - Jason Brownlee
I also followed O’Reilly Python Data Science Handbook for a few things.

Delving Deeper to the Neural Nets

Deeplearning.ai ‘s Specialisation Course : Again, I have finished them on YouTube (I didn’t know about financial aid a year back on Coursera).
Stanford cs230 : Another course taken by Andrew Ng which is based on Deep Learning and its applications.
Stanford cs224n : A course based on Natural Language Processing, it was my stepping stone to NLP. I learnt through doing the assignments and following up from books and blogs (say, Olah’s Blog for LSTM). Stanford cs231n : While learning Computer Vision, this course is my guide. Solving assignments and trying out related projects was enough to substantiate the theory. Also, I can share my handwritten notes on this series, if required.

I can’t emphasize more on trying out self-paced Projects to implement whatever you learn! Through the year I have finished many projects, some internship assignments that helped me learn so much.