DEV Community: Zoheb Abai

Upgrade to PyTorch 2.0

Zoheb Abai — Wed, 24 May 2023 15:30:05 +0000

Why Upgrade?

Upgrade Objectives

Python ≥ 3.8, ≤ 3.11
CUDA ≥ 11.7.0
CUDNN ≥ 8.5.0.96
Pytorch ≥ 2.0.0

“We expect that with PyTorch 2, people will change the way they use PyTorch day-to-day”
“Data scientists will be able to do with PyTorch 2.x the same things that they did with 1.x, but they can do them faster and at a larger scale”
— Soumith Chintala

Steps for Upgrade

If you have Python ≥ 3.8, ≤ 3.11 jump to next section

Steps for upgrading Python from ≤ 3.8 to 3.10

For Clean Installation remove all existing Python related files

# Replace X with the specific version number
sudo apt --purge remove python3.X
sudo apt-get autoremove
sudo apt-get autoclean

Pre-Installations Actions

sudo apt update

# Install required dependencies
sudo apt install build-essential zlib1g-dev libncurses5-dev libgdbm-dev libnss3-dev libssl-dev libreadline-dev libffi-dev wget

Installing Python 3.10.6 from source

Download desired version (here 3.10.6) from Python Website

# Extract the source code
tar -xvf Python-3.10.6.tgz

# Configure the build 
cd python-3.10.6
./configure --enable-optimizations --prefix=/usr/local

# Start the build process
make -j $(nproc)

# Once the build completes, install Python
sudo make install

Open ./bashrcfile and add following lines at the end

export PATH="/usr/local/bin:$PATH"

Save the file and update the environment variables for the current session by running

source ~/.bashrc

Verify Python version

python3 --version

which python3

If you already have CUDA ≥ 11.7.0 jump to next section

Steps for upgrading Cuda ≤ 11.7 on Ubuntu 22.04 with a Nvidia Geforce RTX Graphics Card:

For Clean Installation remove all existing cuda related files

sudo apt-get --purge remove "*cuda*" "*cublas*" "*cufft*" "*cufile*" "*curand*"  "*cusolver*" "*cusparse*" "*gds-tools*" "*npp*" "*nvjpeg*" "nsight*"
sudo apt-get --purge remove "*nvidia*"
sudo apt-get autoremove
sudo apt-get autoclean

Pre-Installation Actions

# Verify You Have a CUDA-Capable GPU
lspci | grep -i nvidia

# Verify the System Has gcc Installed
gcc --version

# Verify the System has the Correct Kernel Headers and Development Packages Installed
sudo apt-get install linux-headers-$(uname -r)

Install NVIDIA CUDA Toolkit 11.7.1 (Debian Installer Preferred)

# Install the repository meta-data, update the GPG key, update the apt-get cache, and install CUDA:
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-ubuntu2204.pin
sudo mv cuda-ubuntu2204.pin /etc/apt/preferences.d/cuda-repository-pin-600
wget https://developer.download.nvidia.com/compute/cuda/11.7.1/local_installers/cuda-repo-ubuntu2204-11-7-local_11.7.1-515.65.01-1_amd64.deb
sudo dpkg -i cuda-repo-ubuntu2204-11-7-local_11.7.1-515.65.01-1_amd64.deb
sudo cp /var/cuda-repo-ubuntu2204-11-7-local/cuda-*-keyring.gpg /usr/share/keyrings/
sudo apt-get update
sudo apt-get -y install cuda

For more details, check this.

During Cuda Installation you might get asked for creating a password for MOK management — Do it.

Reboot the system to load the NVIDIA drivers. If you get a blue screen, DO NOT continue to boot, instead enrol the key providing the password you created a while ago. And then continue to boot.

Open ./bashrc file and add following lines at the end

export PATH=/usr/local/cuda-11.7/bin${PATH:+:${PATH}}
export LD_LIBRARY_PATH=/usr/local/cuda-11.7/lib64${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}

Save the file and update the environment variables for the current session by running

source ~/.bashrc

Verify the Cuda version

nvcc --version

nvidia-smi

If you already have CUDNN ≥ 8.5.0.96 jump to next section

Steps for upgrading CUDNN ≤ 8.5.0.96

Install CUDNN 8.5.0.96 (Debian Installer Preferred)

wget https://developer.nvidia.com/compute/cudnn/secure/8.5.0/local_installers/11.7/cudnn-local-repo-ubuntu2204-8.5.0.96_1.0-1_amd64.deb
sudo dpkg -i cudnn-local-repo-ubuntu2204-8.5.0.96_1.0-1_amd64.deb

# Import the CUDA GPG key
sudo cp /var/cudnn-local-repo-ubuntu2204-8.5.0.96/cudnn-local-*-keyring.gpg /usr/share/keyrings/

# Refresh the repository metadata
sudo apt-get update

# Install the runtime library
sudo apt-get install libcudnn8=8.5.0.96-1+cuda11.7

# Install the developer library
sudo apt-get install libcudnn8-dev=8.5.0.96-1+cuda11.7

Fore more details, check this.

If you already have PyTorch ≥ 2.0.0 you are awesome.

Steps for upgrading PyTorch≤ 2.0.0

# If you have virtualenv and use pip as manager
python3 -m pip install torch==2.0.0+cu117 torchvision==0.15.1+cu117 torchaudio==2.0.1 --index-url https://download.pytorch.org/whl/cu117

For other OS or package manager, check this.

For downloading wheel files, check this.

Verify PyTorch 2.0 Installation

python3 -c "import torch; print(torch.__version__)"

If you get errors google it or comment it down.

I hope the article helped. Thanks.

What’s in the controversial article that forced Timnit Gebru out of Google?

Zoheb Abai — Sat, 30 Jan 2021 18:12:59 +0000

“On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?” — Summarized

In short — Authors have raised global awareness on recent NLP trends and have urged researchers, developers, and practitioners associated with language technology to take a holistic and responsible approach.

Where lies the Issue?

Most notable NLP trend — the ever-increasing size (based on its number of parameters and size of training data) of Language models (LMs) like BERT and its variants, T-NLG, GPT-2/3, etc. Language Models (LMs) are trained on string prediction tasks: that is, predicting the likelihood of a token (character, word, or string) given either its preceding context or its surrounding context (in bidirectional and masked LMs). Such systems are unsupervised while training, later fine-tuned for specific tasks, and, when deployed, take a text as input, commonly outputting scores or string predictions. Increasing the number of model params/larger architecture did not yield noticeable increases for LSTMs; however, Transformers have continuously benefited from it. This trend of increasingly large LMs can be expected to continue as long as they correlate with an increase in performance. Even the models like DistilBERT and ALBERT, which are reduced form of BERT using techniques such as knowledge distillation, quantization, etc., still rely on large quantities of data and significant computing resources.

What are the Issues?

Environmental Costs

Training a single BERT base model (without hyperparameter tuning) on GPUs was estimated to require as much energy as a trans-American flight (~1900 CO2e).

Source: https://arxiv.org/pdf/1906.02243.pdf

For the task of machine translation where large LMs have resulted in performance gains, they estimate that an increase in 0.1 BLEU scores using neural architecture search for English to German translation results in an increase of $150,000 compute cost in addition to the carbon emissions.
Most sampled papers from ACL 2018, NeurIPS 2018, and CVPR 2019 claim accuracy improvements alone as primary contributions to the field. None focused on measures of efficiency as primary contributions, which should be prioritized as the evaluation metric.

Financial Costs

The amount of compute used to train the largest deep learning models (for NLP and other applications) has increased 300,000x in 6 years, increasing at a far higher pace than Moore’s Law. This, in turn, erects barriers to entry, limiting who can contribute to this research area and which languages can benefit from the most advanced techniques.
Many LMs are deployed in industrial or other settings where the cost of inference might greatly outweigh that of training in the long run.
While some language technology is genuinely designed to benefit marginalized communities, most language technology is built to serve the needs of those who already have the most privilege in society.

Risks associated with Large Training Data

Starting with who is contributing to these Internet text collections, we see that Internet access itself is not evenly distributed, resulting in Internet data overrepresenting younger users and those from developed countries.
A limited set of subpopulations continue to easily add data, sharing their thoughts and developing platforms that are inclusive of their worldviews; this systemic pattern, in turn, worsens diversity and inclusion within Internet-based communication, creating a feedback loop that lessens the impact of data from underrepresented populations.
Thus at each step, from initial participation in Internet fora to continued presence there to the collection and finally the filtering of training data, current practice privileges the hegemonic viewpoint. In accepting large amounts of web text as ‘representative’ of ‘all’ of humanity, we risk perpetuating dominant viewpoints, increasing power imbalances, and further reifying inequality.
An important caveat is that social movements that are poorly documented and do not receive significant media attention will not be captured at all. As a result, the data underpinning LMs stands to misrepresent social movements and disproportionately align with existing power regimes.
Developing and shifting frames stand to be learned in incomplete ways or lost in the big-ness of data used to train large LMs — particularly if the training data isn’t continually updated. Given the compute costs alone of training large LMs, it likely isn’t feasible for even large corporations to fully retrain them frequently enough to keep up with the kind of language change discussed here.
Components like toxicity classifiers would need culturally appropriate training data for each audit context, and even still, we may miss marginalized identities if we don’t know what to audit for.
When we rely on ever-larger datasets, we risk incurring documentation debt, i.e., putting ourselves in a situation where the datasets are undocumented and too large to document post hoc. While documentation allows for potential accountability, undocumented training data perpetuates harm without recourse.

Risks due to misdirected Research Effort

(specifically around the application of LMs for tasks intended to test for Natural Language Understanding)

The allocation of research effort towards measuring how well BERT and its kin do on both existing and new benchmarks brings with it an opportunity cost, on the one hand in terms of time not spent applying to meaning capturing approaches to meaning sensitive tasks, and on the other hand in terms of time not spent exploring more effective ways of building technology with datasets of a size that can be carefully curated and available for a broader set of languages.
From a theoretical perspective, languages are systems of signs, i.e., pairings of form and meaning. But the training data for LMs is only a form; they do not have access to meaning. Therefore, claims about model abilities must be carefully characterized.
LMs ties us to certain epistemological and methodological commitments. Either i) we commit ourselves to a noisy-channel interpretation of the task, ii) we abandon any goals of theoretical insight into tasks and treat LMs as “just some convenient technology,” or iii) we implicitly assume a certain statistical relationship — known to be invalid — between inputs, outputs, and meanings.
From the perspective of work on language technology, it is far from clear that all of the effort being put into using large LMs to ‘beat’ tasks designed to test natural language understanding, and all of the effort to create new such tasks, once the LMs have bulldozed the existing ones, brings us any closer to long-term goals of general language understanding systems. If a large LM, endowed with hundreds of billions of parameters and trained on a very large dataset, can manipulate linguistic form well enough to cheat its way through tests meant to require language understanding, have we learned anything of value about building machine language understanding?

Risks and Harms of deploying LMs at Scale

Human language usage occurs between individuals who share common ground and are mutually aware of that sharing (and its extent), who have communicative intents that they use language to convey, and who model each others’ mental states as they communicate. As such, human communication relies on the interpretation of implicit meaning conveyed between individuals. The fact that human-human communication is a jointly constructed activity is most clearly true in co-situated spoken or signed communication. Still, we use the same facilities for producing language that is intended for audiences not co-present with us (readers, listeners, watchers at a distance in time or space) and in interpreting such language when we encounter it. It must follow that even when we don’t know the person who generated the language we are interpreting, we build a partial model of who they are and what common ground we think they share with us and use this in interpreting their words.

Text generated by an LM is not grounded in communicative intent, any model of the world, or any model of the reader’s state of mind. It can’t have been because the training data never included sharing thoughts with a listener, nor does the machine have the ability to do that. This can seem counter-intuitive given the increasingly fluent qualities of automatically generated text. Still, we have to account that our perception of natural language text, regardless of how it was generated, is mediated by our own linguistic competence. Our predisposition to interpret communicative acts as conveying coherent meaning and intent, whether or not they do. The problem is, if one side of the communication does not have meaning, then the comprehension of the implicit meaning is an illusion arising from our singular human understanding of language (independent of the model). Contrary to how it may seem when we observe its output, an LM is a system for haphazardly stitching together sequences of linguistic forms observed in its vast training data, according to probabilistic information about how they combine, but without any reference to meaning: a stochastic parrot.
LMs producing text will reproduce and even amplify the encoded biases in their training data. Thus the risk is that people disseminate text generated by LMs, meaning more text in the world that reinforces and propagates stereotypes and problematic associations, both to humans who encounter the text and future LMs trained on training sets that ingested the previous generation LM’s output.
Miscreants can take advantage of the ability of large LMs to produce large quantities of seemingly coherent texts on specific topics on demand in cases where those deploying the LM have no investment in the truth of the generated text.
Risks associated with large LMs during machine translation tasks to produce seemingly coherent text over larger passages could erase cues that might tip users off to translation errors in longer passages.
Risks associated with the fact that LMs with huge numbers of parameters model their training data very closely and can be prompted to output specific information from that training data, such as personally identifiable information.

Recommended Paths Ahead!

We should consider our research time and effort a valuable resource to be spent to the extent possible on research projects that are built towards a technological ecosystem whose benefits are at least evenly distributed. Each of the below mentioned approaches take time and are most valuable when applied early in the development process as part of a conceptual investigation of values and harms rather than a post hoc discovery of risks.

Considering Environmental and Financial Impacts: We should consider the financial and environmental costs of model development upfront before deciding on a course of an investigation. The resources needed to train and tune state-of-the-art models stand to increase economic inequities unless researchers incorporate energy and compute efficiency in their model evaluations.
Doing careful data curation and documentation: Significant time should be spent on assembling datasets suited for the tasks at hand rather than ingesting massive amounts of data from convenient or easily-scraped Internet sources. Simply turning to massive dataset size as a strategy for being inclusive of diverse viewpoints is doomed to failure. As a part of careful data collection practices, researchers must adopt frameworks such as (Data Statements for Natural Language Processing, Datasheets for Datasets, Model Cards for Model Reporting) to describe the uses for which their models are suited and benchmark evaluations for a variety of conditions. This involves providing thorough documentation on the data used in model building, including the motivations underlying data selection and collection processes. This documentation should reflect and indicate researchers’ goals, values, and motivations in assembling data and creating a given model.
Engaging with stakeholders early in the design process: It should note potential users and stakeholders, particularly those that stand to be negatively impacted by model errors or misuse. An exploration of stakeholders for likely use cases can still be informative around potential risks, even when there is no way to guarantee that all use cases can be explored.
Exploring multiple possible paths towards long-term goals: We also advocate for a re-alignment of research goals: Where much effort has been allocated to making models (and their training data) bigger and to achieving ever higher scores on leaderboards often featuring artificial tasks, we believe there is more to be gained by focusing on understanding how machines are achieving the tasks in question and how they will form part of socio-technical systems. To that end, LM development may benefit from guided evaluation exercises such as pre-mortems.
Keeping alert to dual-use scenarios: For researchers working with LMs, the value-sensitive design is poised to help throughout the development process in identifying whose values are expressed and supported through technology and, subsequently, how a lack of support might result in harm.
Allocating research effort to mitigate harm: Finally, we would like to consider use cases of large LMs that have specifically served marginalized populations. We should consider cases such as: Could LMs be built in such a way that synthetic text generated with them would be watermarked and thus detectable? Are there policy approaches that could effectively regulate their use?

We hope these considerations encourage NLP researchers to direct resources and effort into techniques for approaching NLP tasks effectively without being endlessly data-hungry. But beyond that, we call on the field to recognize that applications that aim to mimic humans bring a risk of extreme harm believably. Work on synthetic human behavior is a bright line in ethical AI development, where downstream effects need to be understood and modeled to block foreseeable harm to society and different social groups.

Please find the original article here and refer to its references not mentioned above. If this post saved your time, don’t forget to appreciate it.

Top 10 Deep Learning Breakthroughs - Deep Reinforcement Learning

Zoheb Abai — Sun, 06 Sep 2020 15:05:37 +0000

Have you watched this?

Or this?

Watch these videos as per your convenience and go through this blog to understand (without much mathematical sophistication) how we arrived at this stage.

If you have already watched these videos I totally understand your excitement. This breakthrough came in 2013 after the AlexNet in a paper titled Playing Atari with Deep Reinforcement Learning.

The Preliminaries

Machine Learning provides automated methods that can detect patterns in data and use them to achieve some tasks. Generally, ML tasks are categorized into:

Supervised Learning - the task of learning from labeled datasets.
Unsupervised Learning - the task of drawing inferences from
unlabeled datasets
Reinforcement Learning - the task of maximizing cumulative rewards from sequences of action taken by agents while interacting with its environment

We will be discussing on RL here, so let's understand an RL problem by imagining a robot mouse maze world.

The robot mouse is the agent, Maze is the environment with food at few locations as positive rewards and electricity at other locations as negative rewards. At every moment it can observe the full state of the maze to decide about the actions (such as turn left/right and move forward) to consider next. The goal is to allow our robot to learn on its own to find as much food as possible while avoiding an electric shock whenever possible.

To solve these tasks, ML exploits the idea of the function
approximators. Since the 1950s we know of several types of function approximators such as:

DeepMind pioneered by applying deep learning methods into RL and since then, DRLs have achieved beyond human-level performance across many challenging domains (two of which you saw above).

Fundamentals of RL

The two major RL entities are:

Agent - is a piece of software that implements some policy. Basically, this policy must decide what action is needed at every time step, given our observations, thus solving a given problem in a more-or-less efficient way.

Environment - a model of the world which is external to an agent and communicating in limited scope by rewards (obtained from the environment), actions (executed by the agent and given to the environment), and observations (some information besides the rewards that the agent receives from the environment). The state of the environment can change based on the agent's actions.

Actions, Observations and Rewards are the communication channels here.
Actions are things that an agent can do in the environment. It can be either discrete or continuous.

Observations are pieces of information that the environment provides the agent with, like what's going on around them. It may be relevant to the upcoming reward or may not, but it eventually doesn't drive the agent's learning process (as a reward does).

The purpose of a reward is to give agent feedback about its success, and it's an important central concept in RL. Basically, the term reinforcement comes from the fact that a reward obtained by an agent should reinforce its behavior in a positive or negative way.

That's basically the fundamentals of RL. The environment could be an extremely complicated physics model, and an agent could easily be a large neural network implementing the latest RL algorithm, but the basic pattern stays the same: on every step, an agent takes some observations from the environment, does its calculations, and selects the action to issue. The result of this action is a reward and a new observation. The final goal is to achieve the largest accumulated reward over its sequence of actions.

RL Methods

RL methods are mostly categorized as

Model-free or Model-based

Model-free methods: means that the method doesn't build a model of the environment or reward; it just directly connects observations to actions (or values that are related to actions). In other words, the agent takes current observations and does some computations on them, and the result is the action that it should take. These methods are usually easier to train.
In contrast,
Model-based methods: try to predict what the next observation and/or reward will be. Based on this prediction, the agent is trying to choose the best possible action to take, very often making such predictions multiple times to look more and more steps into the future.

Policy-based or Value-based

Policy-based methods: are directly approximating the policy of the agent, that is, what actions the agent should carry out at every step. The policy is usually represented by a probability distribution over the available actions.
In contrast,
Value-based methods: here instead of the probability of actions, the agent calculates the value of every possible action and chooses the action with the best value.

On-policy or Off-policy

On-policy methods: heavily depend on the training data to be sampled according to the current policy we're updating.
In contrast,
Off-policy methods: the ability of the method to learn on old historical data obtained by a previous version of the agent.

Cross-Entropy

One of the basic RL methods is cross-entropy - a model-free, policy-based, and on-policy method which simply means:

It doesn't build any model of the environment; it just says to the agent what to do at every step
It approximates the policy of the agent
It requires fresh data obtained from the environment

In practice, the agent needs to pass an observation from the environment to the network, get probability distribution over actions (i.e. policy), and perform random sampling using probability distribution to get an action to carry out (similar to ML classification task). At the beginning of the training when our weights are random, the agent behaves randomly. After the agent gets an action to issue, it fires the action to the environment and obtains the next observation and reward for the last action. This experience of our agent is called episodes. Thus, the loop of such episodes continues.

But there are several limitations of the cross-entropy method:

For training, our episodes have to be short
The total reward for the episodes should have enough variability to separate good episodes from bad ones
There is no intermediate indication about whether the agent has succeeded or failed

Also, for interesting environments, the optimal policy is much harder to formulate and it's even harder to prove their optimality.

The Bellman Equation of optimality

Richard Bellman proved that the optimal value of the state is equal to the action, which gives us the maximum possible expected immediate reward, plus discounted long-term reward for the next state. The Bellman optimality equation of value V of state S0 is given as

where r is respective rewards for states s and action a and 𝛾 a discount factor between 0 and 1, p(a,i->j) means the probability of action a, issued in state i, to end up in state j

You may also notice that this definition is recursive: the value of the state is defined via the values of immediately reachable states. These values not only give us the best reward that we can obtain, but they basically give us the optimal policy to obtain that reward: if our agent knows the value for every state, then it automatically knows how to gather all this reward.

Now the value of action Q(s, a) which equals the total reward we can get by executing action a in state s is defined as

We can also define V(s) via Q(s, a)

And finally, we can express Q(s, a) via itself

A general procedure to calculate Q(s, a) is the value iteration algorithm which allows us to numerically calculate the values of actions with known transition probabilities and rewards includes the following steps:

Initialize all Q(s, a) to zero
For every state s and every action a in this state, perform update:
Repeat step 2

Q-values are much more convenient, as for the agent it's much simpler to make decisions about actions based on Q than based on V. For Q, to choose the action based on the state, the agent just needs to calculate Q for all available actions, using the current state and choose the action with the largest value of Q. To do the same using values of states, the agent needs to know not only values but also probabilities for transitions. In practice, we rarely know them in advance, so the agent needs to estimate transition probabilities for every action and state pair.

Deep Q-Learning

Five Atari 2600 games : Pong, Breakout, Space Invaders, Seaquest, Beam Rider

In practice, the value-iteration algorithm has several obvious limitations.

The first obvious problem is the count of environment states and our ability to iterate over them. In the Value iteration, we assume that we know all states in our environment in advance, can iterate over them, and can store value approximation associated with the state.
To give you an example of an environment with a much larger number of potential states, let's consider the Atari 2600 game platform (a very popular benchmark among RL researches). Let's calculate the state space for the Atari platform. The resolution of the screen is 210 x 160 pixels, and every pixel has one of 128 colors. So, every frame of the screen has 210 × 160 = 33600 pixels and the total amount of different screens possible is 128^33600, which is slightly more than 10^70802. If we decide to just enumerate all possible states of Atari once, it will take billions of years even for the fastest supercomputer. Also, 99.9% of this job will be a waste of time, as most of the combinations will never be shown during even long gameplay, so we'll never have samples of those states. So, this causes a major limitation for the value iteration method.
The second problem with the value iteration approach is that it limits us to discrete action spaces. Indeed, both Q(s, a) and V(s) approximations assume that our actions are a mutually exclusive discrete set, which is not true for continuous control problems where actions can represent continuous variables, such as the angle of a steering wheel. This issue is much more challenging than the first (and I would like to cover it in another blog), so for now, let's assume that we have a discrete count of actions and this count is not large.

To solve the first limitation, we can use states obtained from the environment to update the values of states, which can save us lots of resources. This modification of the Value iteration method is known as Q-learning. But what if those count of the observable set of states is huge (not infinite). For example, Atari games can have a large variety of different screens, so if we decide to use raw pixels as individual states, we'll quickly realize that we have too many states to track and approximate values for. For example, consider the two different situations in a game of Pong where the agent has to act on them differently due to change in single-pixel representing the ball.

On left agent (green paddle) doesn't need to move but on right it has quickly movie to avoid losing a point

As a solution to this problem, we can use a nonlinear representation that maps both state and action onto a value, which can be trained using a deep neural network.

Cut short, There are many tips and tricks that researchers have discovered to make Deep Q-Networks training more stable and efficient. However, tricks like

𝟄-greedy: switching between random and Q policy using the probability hyperparameter 𝟄. By varying, we can select the ratio of random actions. The usual practice is to start with = 1.0 (100% random actions and that's better as it gives us more uniformly distributed information about the environment states) and slowly decrease it to some small value such as 5% or 2% of random actions. It helps both to explore the environment in the beginning and to stick to a good policy at the end of the training.
replay buffer: use a large buffer of our past experience and sample training data from it, instead of using our latest experience. It allows us to train on more-or-less independent data, but data will still be fresh enough to train on samples generated by our recent policy.

and

target network: To make training more stable we keep a copy of our network and use it for the Q(s′, a′) value in the Bellman equation. This network is synchronized with our main network only periodically, for example, once in N steps (where N is usually quite a large hyperparameter, such as 1k or 10k training iterations).

formed the basis that allowed DeepMind to successfully train a DQN on a set of 7 Atari games and demonstrate the efficiency of this approach applied to complicated environments. Later, at the beginning of 2015, they published a revised version of the 2013 paper, on 49 different games.

The final form of the DQN algorithm used in the paper contains the following steps:

Initialize parameters for Q(s, a) and ℚ(s, a) with random weights, 𝟄 -> 1.0, and empty replay buffer.
With probability 𝟄, select a random action a, otherwise a = argmax_a Q(s, a)
Execute action a in an emulator and observe reward r and the next states'
Store transition (s, a, r, s') in the replay buffer
Sample a random mini-batch of transitions from the replay buffer
For every transition in the buffer, calculate target y = r if the episode has ended at this step or y = r + 𝛾 max ℚ(s', a'), otherwise
Calculate loss, ℒ = (Q(s, a) - y)^2
Update Q(s, a) using the SGD optimizer algorithm by minimizing the loss in respect to model parameters
Every N steps copy weights from Q to ℚ
Repeat from step 2 until converged.

Remarks

This paper had a significant effect on the Reinforcement Learning field by demonstrating that, despite common belief, it's possible to use nonlinear approximators in RL. This proof of concept stimulated large interest in the deep Q-learning field in particular and in deep RL in general. Since 2015, many improvements have been proposed, along with tweaks to the basic architecture, which significantly improves convergence, stability, and sample efficiency of the basic DQN invented by DeepMind. In January 2016, DeepMind published a paper titled Mastering the game of Go with deep neural networks and tree search which presented the AlphaGo version which defeated European Go Champion Fan Hui. Very conveniently, again in October 2017, they published a paper titled Rainbow: Combining Improvements in Deep Reinforcement Learning which presented the seven most important improvements to DQN reaching SOTA results on Atari Games Arcade. 3 years since, this technology has progressed a lot, has produced some amazing results and shall surely be surprising us in the future.

Boost your Colab Notebooks with GCP and AWS Instance in a few minutes

Zoheb Abai — Fri, 28 Aug 2020 07:56:31 +0000

For Data Scientists, Notebooks have become the 'de facto' tool while working on a project. Whether they perform EDA on the initially available dataset or begin with some data preprocessing steps or experiment with different models and libraries, notebook is the one they begin with. In this respect, Colab Notebooks are second to none - Easily available with almost all the libraries pre-installed, efficient enough with readily available GPUs and TPUs, can be saved on multiple locations from Drive to GitHub and also shared live with colleagues.

But sometimes you might require resources more than Colab typically offers, for example, you might require multi-GPUs or higher GPU RAM or maybe a better GPU or more than 12 hours of runtime (default) to conclude a successful DS experiments in your notebook. In this blog, I shall cover how to upgrade your Colab in a few minutes, receiving all the above mentioned benefits, without moving your code elsewhere, having Google Cloud Platform or Amazon Web Service Instance as its backend.

Google Cloud Platform

When you first sign up on GCP, you will have $300 free credits.

Request an increase in GPU Quota

Your account typically does not come with GPU quota. You have to explicitly request for it under IAM Admin > Quotas.
You should change your quota of GPU (all regions). Filter the Metric to GPU (all regions), increase its limit

And Submit a request. Wait until GCP sends you a second email (the first email is just to notify you that they received the request). It would take a couple of minutes (or hours maybe) for them to approve.

Create an Instance of your choice

Now go to Compute Engine > VM instances and create an instance

Fill the details as shown below:

And click Create.

Also, do this.

Once your instance is up, it shall display similar to this

Amazon Web Services

If you are using AWS for the first time, you can apply for receiving $300 free credits.

Request an increase in GPU Quota

Your account typically does not come with GPU quota. You have to explicitly request for it under Support > Create Case > Service Limit Increase. You should increase EC2 All P instances in your region by limit 1.

Write a short description of the use case and submit a request. Wait until AWS replies to you back. It would take a couple of minutes (or hours maybe) for them to approve.

Create an Instance of your choice

Now go to EC2 > Launch instance and create an instance

Follow the steps as shown below:

Download the key pair in your local storage (and maybe copy it into a secure location too). Tick the box and click Launch Instances. It shall show similar to this

Once your instance is up, it shall display similar to this

Command Line Terminal and Colab

For GCP:

Install Google Cloud SDK using the quick start for your operating system.

Once Google cloud is set up, initialize GCloud SDK for your Google account:

gcloud init

Now Connect to your server and forward your port:

gcloud beta compute ssh --zone "us-central1-a" "colab-backend" --project "myfirstproject" -- -L 8888:localhost:8888

For AWS:

Go to the directory where your EC2 security key is located and run

chmod 0400 awsec2keypair1.pem

Next,

ssh -L localhost:8888:localhost:8888 -i "awsec2keypair1.pem" ubuntu@ec2-33-142-118-69.compute-1.amazonaws.com

Common for both GCP & AWS:

First make sure this is installed:

pip install --upgrade jupyter_http_over_ws>=0.0.1a3 && \ jupyter serverextension enable --py jupyter_http_over_ws

For AWS (not required for GCP): Before launching notebook please activate the environment:

source activate pytorch_p36

you can check envs list by running conda info --envs

Finally, launch your Jupyter notebook:

jupyter notebook \ --NotebookApp.allow_origin='https://colab.research.google.com' \ --port=8888 \ --NotebookApp.port_retries=0

It shall display similar to this:

Copy the notebook link, go to Colab and follow these steps:

and click Connect.

To check if GCP/AWS backend is integrated:

Note: Google Colab currently doesn’t support integration with Google Drive while connected to a local runtime.

And Voilà! Your SUPER-Colab is ready!

Important: Don't forget to stop your GPU/AWS instance once you are done.

Let me know in the comments if you face any issue.

Step-by-Step Instructions for Testing your GitHub Python Project using GitHub Actions

Zoheb Abai — Sun, 23 Aug 2020 04:23:11 +0000

Here is the complete step-by-step video tutorial for creating a python project and testing it automatically on every commit using GitHub actions. This can be a useful addition to your project.

GitHub Repo

This tutorial was created for autograding python assignments in GitHub Classroom

A Decade of Deep CNN Archs. - GoogLeNet (ILSVRC Winner 2014)

Zoheb Abai — Sun, 16 Aug 2020 05:49:39 +0000

The meme from which it derived its name

Until 2014, CNN architectures had a standard design:

1. Stacked convolutional layers with ReLU activations
2. Optionally followed by contrast normalization and max-pooling and dropouts to address the problem of overfitting
3. Followed by one or more fully connected layers at the end

Variants of this design were prevalent in the image classification literature and had yielded the best results on MNIST and CIFAR10/100 datasets. On the ImageNet classification challenge dataset the recent trend had been to increase the number of layers (depth) and number of units at each level (width) blindly. Despite trends, taking inspiration and guidance from the theoretical work done by Arora et al. GoogLeNet takes a slightly different route for its architectural design.

Major Drawbacks of Design Trend for a Bigger Size

A large number of parameters - which makes the enlarged network more prone to overfitting, especially if the number of labeled examples in the training set is limited.
Increased use of computational resources - any uniform increase in the number of units in two consecutive convolutional layers, results in a quadratic increase in computation. The efficient distribution of computing resources is always preferred to an indiscriminate increase in size since practically the computational budget is always finite.

The Approach

The main result of a theoretical study performed by Aurora et al. states that if the probability distribution of the dataset is representable by a large, very sparse deep neural network, then the optimal network topology can be constructed layer-by-layer by analyzing the correlation statistics of the activations of the last layer and clustering neurons with highly correlated outputs. The statement resonates with the well known Hebbian principle – neurons that fire together, wire together.
In short, the approach as suggested by the theory was to build a non-uniform sparsely connected architecture that make use of the extra sparsity, even at the filter level. But since AlexNet, to better optimize parallel computing, convolutions were implemented as collections of dense connections to the patches in the earlier layer.

So based on suggestions by Çatalyürek et al. to obtain SOTA performance for sparse matrix multiplication, authors clustered sparse matrices into relatively dense submatrices, thus approximating an optimal local sparse structure in a CNN layer (Inception module) and repeating it spatially all over.

Naive inception module

Architectural Details

One big problem with this stacked inception module is that even a modest number of 5×5 convolutions would be prohibitively expensive on top of a convolutional layer with numerous filters. This problem becomes even more pronounced once pooling units are added. Even while the architecture might cover the optimal sparse structure, it would do that very inefficiently; the merging of the output of the pooling layer with the outputs of convolutional layers would definitely lead to a computational blow up within a few stages.

Thus, authors borrowed Network-in-Network architecture which was proposed by Lin et al. to increase the representational power of neural networks. It can be viewed as an additional 1 × 1 convolutional layer followed typically by the ReLU activation. Authors applied it in forms of

dimension reductions - 1×1 convolutions used for computing reductions before the expensive 3×3 and 5×5 convolutions
projections - 1×1 convolutions used for shielding a large number of input filters of the last stage to the next after max-pooling

wherever the computational requirements would increase too much (computational bottlenecks). This allows for not just increasing the depth, but also the width of our networks without a significant performance penalty.

Inception module with embedded NiN

Authors also added auxiliary classifiers, taking the form of smaller convolutional networks on top of the output of the Inception (4a) and (4d) modules, expecting to

1. encourage discrimination in the lower stages in the classifier
2. increase the gradient signal that gets propagated back
3. provide additional regularization.

During training, their loss gets added to the total loss of the network with a discount weight of 0.3. At inference time, these auxiliary networks were discarded.

The exact structure of these auxiliary classifiers is as follows:

An average pooling layer with 5×5 filter size and stride 3, resulting in a 4×4×512 output for the (4a), and 4×4×528 for the (4d) stage.
A 1×1 convolution with 128 filters for dimension reduction and rectified linear activation.
A fully connected layer with 1024 units and rectified linear activation.
A dropout layer with a 70% ratio of dropped outputs.
A linear layer with softmax loss as the classifier (predicting the same 1000 classes as the main classifier, but removed at inference time).

The complete architecture of GoogLeNet:

GoogLeNet Architecture

Training

It was trained using the DistBelief distributed machine learning system.

Image Transformations
Photometric distortions by Andrew Howard were useful to combat overfitting to some extent

Optimizer
SGD with Nesterov accelerated gradient of 0.9 momentum

Learning Rate Manager
Decreasing the learning rate by 4% every 8 epochs

No. of Layers
22

Results

Classification

Authors adopted a set of techniques during testing to obtain a higher performance:

Independently trained 7 versions of the same GoogLeNet model, and performed ensemble prediction with them. They only differ in sampling methodologies and the random order in which they see input images.
Adopted a more aggressive cropping approach than that of AlexNet. Specifically, authors resized the images to 4 scales where the shorter dimension (height or width) was 256, 288, 320 and 352 respectively, taking the left, center, and the right square of these resized images (if portrait images, take the top, center and bottom squares). For each square, they then took the 4 corners and the center 224×224 crop as well as the square resized to 224×224, and their mirrored versions. This results in 4×3×6×2 = 144 crops per image.
The softmax probabilities are averaged over multiple crops and overall the individual classifiers to obtain the final prediction.

GoogleNet ranked first.

Classification Performance Comparison

Object Detection

The approach by GoogLeNet for detection was similar to the R-CNN proposed by Girshick et al., but augmented with the Inception model as the region classifier.
Note: R-CNN decomposes the overall detection problem into two subproblems: to first utilize low-level cues such as color and superpixel consistency for potential object proposals in a category-agnostic fashion, and to then use CNN classifiers to identify object categories at those locations.

The region proposal step is improved here by combining the Selective Search approach with multi-box predictions for higher object bounding box recall. To cut down the number of false positives, the superpixel size was increased two folds, improving the mAP (mean average precision) metric by 1% for the single model case. Contrary to R-CNN, they did not use bounding box regression and localization data for pretraining.

Finally, they use an ensemble of 6 CNNs when classifying each region which improves results from 40% to 43.9% mAP, drawing them to the top.

Remarks

GoogLeNet's success promised a future towards creating sparser and more refined structures for CNN architectures. It also conveyed a strong message on consideration of a model's power and memory use efficiency while designing a new architecture. Similar to VGGNet, GoogLeNet also reinstated that going deeper and wider was indeed the right direction to improve accuracy.

Top 10 Deep Learning Breakthroughs - Family of SNGD Optimizers

Zoheb Abai — Sat, 08 Aug 2020 21:56:47 +0000

Which three concepts lie at the heart of deep learning? - Loss Functions, Optimization algorithms, and Backpropagation. Without them, we would have been 'conceptually' stuck at the perceptron model of the 1950s.

If you are completely unfamiliar with these terms, feel free to go through CS231n Lecture Notes 1, 2, 3 or/and watch this and subsequent videos by Andrew Ng.

Different problems require different loss functions, most of them are well known to us, also several new ones shall be arriving in the future.
Backpropagation, proposed by Hinton et al. in 1986, drives the Neural Networks.
But the choice of Optimization algorithms has always been the center of discussions and research inside the ML community. Researchers have been working for years on building an optimizer algorithm which

1. performs equally well on all major tasks - image classification, speech recognition, machine translation, and language modeling 
2. is robust to learning rate (LR) and weight initialization
3. has strong regularization properties

If you are a beginner and merely started coding Neural Network following online available notebooks, you might be most familiar with ADAM optimizer (2015). It's the default optimizer algorithm for almost all DL frameworks including the popular one - Keras. There's a reason why it's kept default - it's robust to different learning rates and weight initialization. If you want to train your architecture on any data and worry less about hyperparameter tuning just use the defaults and it will give you good enough result for a start.

But if you are an experienced one and already familiar with deep learning research papers, you know that except while training GANs and Reinforcement learning (both doesn't solve optimization problems), experts mostly use SGD with momentum or Nesterov as their preferred choice of the optimization algorithm.

So, which one should I choose? 🤔

Non-adaptive methods

Stochastic Gradient Descent, as we know now, arrived from its initial form of Stochastic Approximation (1951) a landmark paper in the history of numerical optimization methods by Robbins and Monro. Here it wasn't proposed as a gradient method, but as a Markov chain. The form of SGD we are familiar with can be recognized in a subsequent paper by Kiefer and Wolfowitz. Programmers now mostly use the variant Mini-batch gradient descent with a batch size often between 32 and 512 (intervals in the exponent of 2) for training NN. But the key challenge faced by SGD is that it gets trapped in numerous suboptimal local minima and saddle points in loss function's landscape. So arrived the idea of implementing momentum, borrowed from a Polyak (1964) by Hinton et al. in their famous Backpropagation paper. Momentum helps accelerate SGD in the relevant direction and dampens its oscillations. But it's still vulnerable to vanishing and exploding gradients. Hazan et al in 2015 showed that the direction of the gradient is sufficient for convergence, thus proposing Stochastic Normalized Gradient Descent Optimizers. But it's still sensitive to 'noisy' gradients, especially during an initial training phase. In 2013, Hinton et al. conclusively demonstrated that SGD with momentum or Nesterov accelerated gradient (1983)(a variant of SNGD) and simple learning rate annealing schedule combined with well-chosen weight initialization schemes achieved better convergence rates and resulted in surprisingly good accuracy for several architectures and tasks.

Optimizers on Loss Surface Contours

Adaptive methods

Parameter updates by the above-described methods do not adapt with different learning rates, proper choice of which is crucial. This gave rise to several adaptive learning rate-based optimizers within next 4-5 years to improve SNGD robustness. They generally updated the parameters via an exponential moving average of past squared gradients. But even they weren't problem-free; such as for Adam it is the vanishing or exploding of the second moment, especially during the initial phase of training, and requirement of double the optimizer memory compared to SGD with momentum for storing the second moment. In 2018, Wilson et al. extensively studied how optimization relates to generalization and concluded three major findings:

with the same amount of hyperparameter tuning, SGD and SGD with momentum outperform adaptive methods on the development/test set across all evaluated models and tasks (even if the adaptive methods achieve the same training loss or lower than non-adaptive methods.)
adaptive methods often display faster initial progress on the training set, but their performance quickly plateaus on the development/test set
the same amount of hyperparameter tuning was required for all methods, including adaptive methods

Check out this beautiful python implementation of some common optimizers.

The Way Ahead

Last year Loshchilov et al. pointed out the major factor for the poor generalization of Adam - L2regularization being not nearly as effective as for SGD. The author explains that major deep learning libraries implement only L2regularization, and not the original weight decay (most people confuse them as identical). To improve regularization in Adam they proposed AdamW - which decouples the weight decay from the gradient-based update. Although for achieving this they used a cosine learning rate annealing with Adam. 😩

This year NVIDIA team proposed a new member of SNGD family - NovoGrad an adaptive stochastic gradient descent method with layer-wise gradient normalization and decoupled weight decay. They combined SGD’s and Adam’s strengths by implementing the following three ideas while building it:

1. starting with Adam, replaced the element-wise second moment with the layer-wise moment
2. computed the first-moment using gradients normalized by the layer-wise second moment
3. decoupled weight decay (WD)from normalized gradients (similar to AdamW)

Not only it generalizes well or better than Adam/AdamW and SGD with momentum on all major tasks, but also is more robust to the initial learning rate and weight initialization than them. In addition,

1. It works well without LR warm-up (annealing)
2. It performs exceptionally well for large batch training (for ResNet-50 up to 32K)
3. It requires half the memory compared to Adam

A few days back, Zhao et al. theoretically proved and empirically showed that SNGD with momentum can achieve the SOTA accuracy with a large batch size (much faster training).

Hope you realize that we are at a brink of a breakthrough after a long search for a dataset/task/hyperparameter-tuning independent better-generalizing optimizer algorithm.

Top 10 Deep Learning Breakthroughs - AlexNet

Zoheb Abai — Wed, 05 Aug 2020 22:35:04 +0000

If you want to learn about AlexNet, check out this blog where I have extensively covered it

A Decade of Deep CNN Archs. - AlexNet (ILSVRC Winner 2012)

Zoheb Abai ・ Jul 18 '20

#datascience #machinelearning #deeplearning

Here is an example notebook, in which I have imported a pretrained AlexNet model from PyTorch Library and used it for classifying an image.

Feel free to play around and discuss.

If you want to explore a bit more on AlexNet, go through my blog on ZFNet.

A Decade of Deep CNN Archs. - ZFNet (ILSVRC Runner-up 2013)

Zoheb Abai ・ Jul 25 '20

#machinelearning #datascience #deeplearning

Although it was an updated version of AlexNet, the paper contributed towards our in-depth understanding of CNN architectures.

A Decade of Deep CNN Archs. - VGGNet (ILSVRC Winner 2014)

Zoheb Abai — Sat, 01 Aug 2020 07:56:27 +0000

VGGNet Architecture

Each year ILSVRC winners conveyed some interesting insights and 2014 was special in that regard. For most years the challenge tasks were:

Image classification: Predict the classes of objects present in an image.
Single-object localization: Image classification + draw a bounding box around one example of each object present.
Object detection: Image classification + draw a bounding box around each object present.

By 2014 it was apparent that as more and more fresh architecture unveiled, no one single CNN architecture could champion all the tasks and 2014 winners were perfect embodiment of it.

VGGNet was introduced in the paper titled Very Deep Convolutional Networks for Large-Scale Image Recognition by Karen Simonyan and Andrew Zisserman. VGGNet architecture won the competition in localization task while bagging 2nd position in the classification task. The beauty of this network lies in its architectural simplicity and reinforcing the idea of having deeper CNNs for improved performance.

Improvements over top CNN Architectures

Since 2012, there had been numerous attempts to improve over AlexNet in every possible way. In 2013, both Overfeat and ZFNet improved their performance compared to AlexNet, by utilizing smaller receptive window size (7 × 7) and smaller stride (2) in their first convolutional layer. Small-size convolution filters were previously used by Dan Ciresan Net, but their nets were significantly less deep, and they did not evaluate the large-scale dataset. In VGGNet, authors used petite 3 × 3 receptive fields (which is the smallest size to capture the notion of left/right, up/down, center) throughout the 16 and 19 layer deep networks with a stride of 1 and padding of 1 so that the spatial resolution is preserved after each convolution. Optional spatial pooling was carried out by a max-pooling layer of lowered size 2 x 2 with a stride of 2, instead of 3 x 3. The reason behind such implementation as provided by authors are :

2 or 3 consecutive 3x3 layers has an effective receptive field of 5x5 or 7x7 respectively
as every convolutional operation is followed with a non-linear ReLU activation, so multiples of them make the decision function more discriminating than a single ReLU,
a layer of 7x7 convolutional filters with C channels has 7C x 7C parameters, whereas a layer of 3x3 convolutional filters with C channels has just 3C x 3C parameters, 81% less.

HowardNet and Overfeat also improved their performance by utilizing similar multiple scaling of images during both training and testing of the network, instead of using a single scale as AlexNet.

Training

Training and evaluation of 140 million parameters VGGNet were performed on 4 NVIDIA Titan Black GPUs installed in a single system. Multi-GPU training exploits data parallelism and is carried out by splitting each batch of training images into several GPU batches, processed in parallel on each GPU. After the GPU batch gradients are computed, they are averaged to obtain the gradient of the full batch. Gradient computation is synchronous across the GPUs, so the result is the same as when training on a single GPU.
Despite the larger number of parameters and the greater depth of VGGNet compared to AlexNet it required fewer epochs to converge, to which authors conjecture that might be due to implicit regularization imposed by the greater depth and smaller convolutional filter sizes and pre-initialization of certain layers.

VGGNet does not contain Local Response Normalization as in AlexNet because such normalization does not improve the performance instead leads to increased memory consumption and computation time.

Preprocessing: The mean value of pixels over the training set was subtracted from each pixel.

Image Augmentation:

Single scale training: Authors first trained the network using images scaled to 256. Then to speed-up training of the network with images scaled to 384, it was initialized with the weights pre-trained with that of scale 256, and they used a smaller initial learning rate of 1e-03
Multiscale training: Where each training image was individually rescaled by randomly sampling scales from a certain range [Smin, Smax] where Smin = 256 and Smax = 512. For speed reasons, authors trained multi-scale models by fine-tuning all layers of a single-scale model with the same configuration, pre-trained with a fixed scale of 384.
Multi-Crop: Finally, to feed the network with the fixed-size 224×224 input images, rescaled training images were randomly cropped (one crop per image - per SGD iteration). To further augment the training set, the crops underwent random horizontal flipping and random RGB color shift as done during AlexNet.

Dropout: Same as AlexNet

Kernel Initializer: Same

Bias Initializer: 0 for each layer

Batch Size: 256

Optimizer: Same

L2 weight decay: Same

Learning Rate Manager: Same

Total epochs: 74

Total time: 21 days (max for VGG-19)

Results

6 different architectures used for experimenting

Test Time Augmentation: During test time the network was applied densely over the rescaled test images in a way similar to Overfeat. Namely, the fully connected layers were first converted to convolutional layers (the first FC layer to a 7 × 7 convolutional layer, the last two FC layers to 1 × 1 convolutional layers). The resulting fully convolutional net was then applied to the whole (uncropped) images. The result was a class score map with the number of channels equal to the number of classes, and a variable spatial resolution, dependent on the input image size. Finally, to obtain a fixed-size vector of class scores for the image, the class score map is spatially averaged (sum-pooled). Authors also augment the test set by horizontally flipping the images; the soft-max class posteriors of the original and flipped images were averaged to obtain the final scores for the images.

Authors justify using dense evaluation methods instead of multi-crop evaluation (performed during AlexNet) due to decreased computation time, although the methods being complementary to each other (due to different convolution boundary conditions) were used together for better results.
While applying a CNN to a cropped image, the convolved feature maps were padded with zeros, while for dense evaluation the padding for the same cropped image naturally came from the neighboring parts of the image (due to both the convolutions and spatial pooling), which substantially increased the overall network receptive field, so more context was captured.

Single Scale Evaluation: The test image size was set as follows:

Q = S for fixed training image scale S, and
Q = 0.5(Smin + Smax) for jittered S ∈ [Smin, Smax].

Authors observed that the classification error decreased with the increased ConvNet depth: from 11 layers in A to 19 layers in E (but saturated after that). The scale jittering at training time (S ∈ [256; 512]) lead to significantly better results than training on images with the fixed smallest side (S = 256 or S = 384), even though a single scale is used at test time. This confirmed that training set augmentation by scale jittering was indeed helpful for capturing multi-scale image statistics.

The Least performing Network A achieving 10.4% top-5 error confirmed that a deep network with small filters outperforms a shallow network with larger filters.

Multi-Scale Evaluation: Here, the authors assessed the effect of scale jittering at test time. It consisted of running a model over several rescaled versions of a test image (corresponding to different values of Q), followed by averaging the resulting class posteriors. The model trained with variable S ∈ [Smin; Smax] was evaluated over a larger range of sizes Q = {Smin, 0.5(Smin + Smax), Smax}.

The results indicated that scale jittering at test time leads to better performance as compared to evaluating the same model at a single scale.

Multi-Crop and Dense: As mentioned earlier best-performing networks D (VGG16) and E (VGG19) achieved slightly better results with Multi-Crop and Dense evaluation together.

Final submission ensembling VGG16 and VGG19

Similar to AlexNet(2012) and ZFNet(2013) submissions authors too submitted an ensemble (combining the outputs of several models by averaging their soft-max class posteriors) of their best performing models D and E; just two, a significantly less number of models than earlier submissions and remarkably outperforming them. The final submitted top-5 error of 6.8% outperformed all earlier submitted results.

Remarks

VGGNet, simplicity at its best architecture compared to its competitor GoogLeNet, had few but important insights to offer. The use of now omnipresent 3x3 convolutional layers throughout an architecture was seeded here. Both VGGNet and GoogLeNet, the winners of 2014, using the concept of effective receptive field highlighted the importance of depth in visual representations which eventually became the stepping stone of a breakthrough transformation arriving next year.

A Decade of Deep CNN Archs. - ZFNet (ILSVRC Runner-up 2013)

Zoheb Abai — Sat, 25 Jul 2020 07:09:20 +0000

ZFNet Architecture

ZFNet was introduced in the paper titled Visualizing and Understanding Convolutional Networks by Matthew D. Zeiler and Rob Fergus. This architecture did not win the competition, but its inference was implemented by winner of that year (Clarifai founded by Zeiler, 11.19% test error). This paper is remarkable because of its visualizations and understanding of the internal operation and behavior of a CNN model classifying an image. The paper also introduced us to a technique now widely known as Transfer Learning.

Due to 2012 winner AlexNet, there was an enormous increase in submission of CNN models for ILSVRC 2013 but most of them were trial-and-error based without exhibiting any understanding of how and why CNN performed so well.

Let's understand that (as explained by authors).

A CNN model

Maps a color 2D input image x_i, via a series of layers, to a probability vector y_i_hat over the C different classes, where each layer consists of

1. Convolution of the previous layer output with a set of learned filters, passing the responses through a rectified linear function

2. Optionally max pooling over local neighborhoods 

3. Optionally a local contrast operation that normalizes the responses across feature maps (it's not relevant anymore)

has conventional fully connected top few layers with final layer as a softmax classifier.
is trained using a large set of N labelled images {x, y}, where label y_i is a discrete variable indicating the true class.
cross-entropy loss function - p(x)log(q(x)), suitable for image classification, is used to compare y_hat and y.
parameters are trained by backpropagating the derivative of the loss regarding the parameters throughout the network, and updating the parameters via stochastic gradient descent in batches.

Updating AlexNet

Understanding the operation of a CNN requires interpreting the feature activity in intermediate layers, so authors present a novel way known as DeconvNet (Zeiler et al. proposed it initially as unsupervised learning technique) to map these activities back to the input pixel space, showing what input pattern, originally, had caused a given activation in the feature maps.

A DeconvNet layer (left) attached to a ConvNet layer (right)

A DeconvNet is attached to each of its ConvNet layers, providing a continuous path back to image pixels. To examine a given ConvNet activation, all other activations in the layer are set to zero and the feature maps are passed as input to the attached DeconvNet layer. Then it is successively

1. unpooled (uses the switch which records the location of the local max in maxpool), 

2. rectified, and 

3. filtered (uses transposed version of same filters in convnet)

to reconstruct the activity in the layer beneath, that gave rise to the chosen activation. This is repeated until input pixel space is reached.

They train AlexNet reproducing test error percentage within 0.1% of reported value in 2012. By visualizing the first and second layers of AlexNet, they observe two specific issues:

Filters at layer 1 are a mix of extremely high and low frequency information, with little coverage of the mid frequencies. Without the mid frequencies, there is a chain effect that deep features can only learn from extremely high and low frequency information.

Note: Spatial frequency information in an image describes the information on periodic distributions of 'light' and 'dark' in that image. High spatial frequencies correspond to features such as sharp edges and fine details, whereas low spatial frequencies correspond to features such as global shape.

AlexNet Layer 1 features

Layer 2 shows aliasing artifacts caused by the large stride 4 used in the 1st layer convolutions. Aliasing occurs when sampling frequency is too low.

Note : In each CNN layer (if not using Upsampling or DeconvNet) we are mainly sampling down (discretization) the image. If sampling frequency is too low (insufficient sampling) then we get aliasing effects on the sampled image such as jagged boundaries/edges, repetitive textures etc.

AlexNet Layer 2 features

To remedy these problems, authors made following changes in AlexNet Architecture:

Reduced the 1st layer filter size from 11×11 to 7×7. Filters of size 11x11 proved to be skipping a lot of relevant information

ZFNet Layer 1 features
Made the stride of the convolution 2, rather than 4. A filter of stride of 2 proved to retain a lot of pixel information

ZFNet Layer 2 features

This new architecture retains much more information in the 1st and 2nd layer features. So final ZFNet architecture looks like this :

Table 1: Architecture Details

Training

During training, visualization of the first layer filters revealed that, a few of them dominated. To combat this, authors renormalized each filter in the convolutional layers to a fixed radius of RMS value of 1e-01.

The model was trained on the ImageNet 2012 training set (1.3 million images, spread over 1000 different classes) on single NVIDIA GTX 580 GPU with 3 GB memory.

Preprocessing:
Same as AlexNet.

Image Augmentation:
Same as AlexNet. (224x224 here)

Dropout:
Same as AlexNet.

Kernel Initializer:
1e-02 for each layer

Bias Initializer:
0 for each layer

Batch Size:
Same

Optimizer:
Same

L2 weight decay:
None

Learning Rate Manager:
Same.

Total epochs:
70

Total time:
12 days

Results

Single ZFNet model achieves top-1 and top-5 test errors of 38.4% and 16.5% respectively, lower by a margin of 1.7% than that of AlexNet. Their final submission comprised of an ensemble of 6 CNNs (average of 5 ZFNet's and a network same as ZFNet but layer Conv3, Conv4, Conv5 with 512, 1024, 512 channels respectively) which gave an error rate of 14.8%.

Depth of the model is important for obtaining good performance:

Removing two fully connected layers yielded a slight increase in error, although they contained the majority of model parameters. Removing two of the middle convolutional layers also made a relatively small difference to the error rate. However, removing both the middle convolution layers and the fully connected layers yielded a model with only 4 layers whose performance was dramatically worse.

Transfer Learning:

Finally, authors showed that model trained on ImageNet generalizes well to other datasets. For this, they kept layers 1-7 of the ImageNet trained model fixed and train a new softmax classifier on top (for the appropriate number of classes) using the training images of the new dataset.

Fine-Tuning ImageNet trained ZFNet on Caltech-101 dataset

Fine-Tuning ImageNet trained ZFNet on Caltech-256 dataset

Visualizations

Feature Visualization

Visualization of features in a fully trained model. For layers 2-5 authors show the top 9 activations in a random subset of feature maps across the validation data, projected down to pixel space using the de-convolutional network approach

The projections from each layer show the hierarchical nature of the features in the network. Layer 2 responds to corners and other edge/color conjunctions.

Layer 3 has more complex invariance, capturing similar textures such as mesh patterns and text patterns.

Layer 4 shows significant variation, but is more class-specific such as dog faces and bird’s legs.

Layer 5 shows entire objects with significant pose variation such as keyboards and dogs.

Feature Evolution during Training

Evolution of a randomly chosen subset of model features through training. Each layer’s features are displayed in a different block. Within each block, it shows a randomly chosen subset of features at epochs [1,2,5,10,20,30,40,64]. The visualization shows the strongest activation (across all training examples) for a given feature map, projected down to pixel space using the DeconvNet approach

Here, the lower layers of the model can be seen to converge within a few epochs. However, the upper layers only develop after a considerable number of epochs (40-50), demonstrating the need to let the models train until fully converged.

Feature Invariance

Column1 and Column2: Euclidean distance between feature vectors from the original and transformed images in layers 1 and 7 respectively. Column 3:The probability of the true label for each image, as the image is transformed

Small transformations have a dramatic effect in the first layer of the model, but a lesser impact at the top feature layer, being quasi linear for translation & scaling. The network output is stable to translations and scaling. In general, the output is not invariant to rotation, except for object with rotational symmetry (e.g. entertainment center).

Occlusion Sensitivity

The first row example shows the strongest feature to be the dog’s face. When this is covered-up the activity in the feature map decreases (blue area in (b)). When the dog’s face is obscured, the probability for “Pomeranian” drops significantly. In the 1st row, for most locations it is “Pomeranian”, but if the dog’s face is obscured but not the ball, then it predicts “tennis ball”. In the 2nd example, text on the car is the strongest feature in layer 5, but the classifier is most sensitive to the wheel. The 3rd example contains multiple objects. The strongest feature in layer 5 picks out the faces, but the classifier is sensitive to the dog (blue region in (d)), since it uses multiple feature maps

With these image classification approaches, a natural question arises : Is model truly identifying the location of the object in the image, or just using the surrounding context?

Authors attempt to answer this question by systematically occluding different portions of the input image with a gray square, and monitoring the output of the classifier. Above examples show visualizations from the strongest feature map of the top convolution layer, in addition to activity in this map (summed over spatial locations) as a function of 'occluder' position. It clearly shows that the model is localizing the objects within the scene, as the probability of the correct class and activity in the feature map drops significantly when the object is occluded. This shows that the model, while trained for classification, is highly sensitive to local structure in the image and is not just using broad scene context.

Remarks

Thus, the paper holds its significance for introducing us to the perspective we require while structuring a CNN architecture. The visualization techniques introduced here to visualize the activity within the model are still relevant for inferring the performance of models or determining data preprocessing techniques for obtaining better results. Authors brought this fact to the limelight that CNN models do not generate features with random, non-interpretable patterns (black box - as thought by many) but revealing several intuitively desirable properties such as compositionality, increasing invariance and class discrimination as we ascend the layers of a CNN model.

A Decade of Deep CNN Archs. - AlexNet (ILSVRC Winner 2012)

Zoheb Abai — Sat, 18 Jul 2020 08:48:39 +0000

AlexNet Architecture (Split into two GPUs)

AlexNet was introduced in the paper, titled ImageNet Classification with Deep Convolutional Networks, by Alex Krizhevsky, Ilya Sutskever, Geoffrey E. Hinton and since then it has been cited around 67000 times and is widely considered as one of the most influential papers published in the field of computer vision. It was neither the first implementation of CNN architecture nor the first GPU implementation of a Deep CNN architecture, then why it's so influential?

Let's find it out.

Before publication of the paper

Most of the computer vision tasks were solved using machine learning methods such as SVM and K-NN or Fully Connected Neural Networks.
CNN architecture with the use of a backpropagation algorithm for computer vision problem was introduced back in 1989 by Yann LeCun et al. (LeNet).
Fast GPU implementation of a deep CNN with back propagation was introduced a year before (August 2011) by Dan C. Ciresan et al., which had achieved SOTA test error rate of 0.35% on MNIST, 2.53% on NORB and 19.51% on CIFAR10 datasets. They showed their implementation to be 10 to 60 times faster than a compiler-optimized CPU version. (Dan C. Ciresan Net).
Rectified Linear Unit (ReLU) was introduced by Geoffrey E. Hinton et al. in 2010 on Restricted Boltzmann Machines replacing binary units for recognizing objects and comparing faces.
In the authors' earlier work, they introduced Dropout layers as an efficient method for reducing overfitting.
Other than ImageNet dataset, all the publicly available labelled dataset were relatively small (in order of tens of thousands) such as MNIST and CIFAR-10, on which it was easy to achieve good performance with optimized image augmentations.
ImageNet Large-Scale Visual Recognition Challenge (an annual competition started since 2010) uses a subset of ImageNet (a database introduced by Fei-Fei et al. in 2009) with roughly 1000 images of variable-resolution in each of 1000 categories, enclosing a total of 1.2 million training images, 50,000 validation images, and 150,000 testing images. It is customary to report two error rates: top-1 and top-5 in final submissions.

Data Preprocessing

AlexNet was trained on the centered RGB values of the pixels.

Given a rectangular image, at first, the shorter side was re-scaled to a length of 256 and then a patch of size 256×256 was cropped from the center. Later, the mean value of pixels over the training set was subtracted from each pixel.

Architecture

Table 1 : Architecture Details

The network with 60 million parameters was trained by spreading it across two NVIDIA GTX 580 GPU with 3 GB memory each. The kernels of the second, fourth, and fifth convolutional layers were connected only to feature maps which resided on the same GPU, while the kernels of the third convolutional layer were connected to all feature maps across GPUs. Neurons in the fully-connected layers were also connected to all neurons in the previous layer across GPUs.

ReLU non-linear activation was applied to the output of every convolutional and fully connected layer, replacing previously used tanh units. This non-saturating non-linear function was much faster in terms of training time than saturating non-linear tanh function.

Local Response normalization (or brightness normalization) layers followed first and second convolutional layers after applying ReLU activation. These layers helped lower top-1 and top-5 test errors by 1.4% and 1.2% respectively.

Here variable a represents the ReLU activated value of a neuron. The constants k, n, α, and β are hyper-parameters whose values were k = 2, n = 5, α = 10e-4, and β = 0.75. The sum runs over n “adjacent” kernel maps at the same spatial position, and N is the total number of kernels in the layer.

Max-pooling layers followed both response-normalization layers and the fifth convolutional layer. These overlapping (strides < kernel size) max-pooling layers helped in reducing overfitting.

Training

Image Augmentation Techniques:

Translation and Horizontal Reflections: During training, network extracts random 227x227 patches (falsely mentioned as 224x224) and applies horizontal reflections. All these augmentations are performed on the fly on the CPU while the GPUs train previous batch of data.
Test Time Augmentations: During test time, the network predicts by extracting five 224 × 224 patches (the four corner patches and the center patch) as well as their horizontal reflections (hence ten patches in all), and averaging the predictions.
PCA color augmentation: At first, PCA is performed on all pixels of ImageNet training data set. As a result, they get a 3x3 covariance matrix, as well as 3 eigenvectors and 3 eigenvalues. During training, a random intensity factor based on PCA components is added to each color channel of an image, which is equivalent to changing intensity and color of illumination.

Dropout:

Each hidden neuron in the first two fully-connected layers is set to zero with a probability of 0.5 during training. This means 'dropped out' neurons do not contribute to the forward pass and do not participate in back propagation. Although during testing, all neurons were active and were not dropped.

Kernel Initializer:
Zero-mean Gaussian distribution with a standard deviation of 0.01

Bias Initializer:
1 for second, fourth, fifth convolutional layers and the fully-connected hidden layers. Remaining layers with 0.

Batch Size:
128

Optimizer:
Stochastic Gradient Descent with momentum 0.9

L2 weight decay:
5e-04

Learning Rate Manager:
LR initialized with value 1e-2 and manually reduced on a plateau by a factor of 10

Total epochs:
90

Total time:
6 days

Results

Single AlexNet model achieves top-1 and top-5 test errors of 40.7% and 18.2% respectively.

Their final submission comprised of an ensemble of 7 CNNs (average of 2 extended AlexNet pre-trained on 2011 dataset and then fine-tuned on 2012 dataset and an average of five AlexNet on 2012 dataset) which gave an error rate of 15.3%, lower by a margin of 11% than that of the runner-up (SIFT+FVs model by Fei-Fei et al.).

Thus, the world got its first Deep CNN based large database image recognition winner. The authors used several deep learning techniques which are still relevant and thus established a framework which is still followed to approach complex computer vision problems.

Share your local Jupyter Notebook on Internet for Free

Zoheb Abai — Sat, 11 Jul 2020 13:00:11 +0000

During this global lockdown, it's interesting to observe an upsurge in learning python and in particular machine learning. There are innumerable free online resources and teachers from where or whom you can learn, but all use a common open-source software (or its clone) - Jupyter Notebooks. It's one of the most useful tools for all ranks of data scientists/engineers/analysts.

While teaching online, the teachers either prerecord the sessions or consider taking sessions live. During live sessions most of the teachers use one of several video conferencing platforms, coding on their local machine (or ready-to-run notebooks) and share it live. This has two serious demerits -

Internet penetration is still worse for many developing countries. So viewing code onscreen or receiving a clear audio is a still a distant dream for many.
If students want to just run few lines of code or check their doubts regarding the code shared onscreen during sessions, they have no option to copy. It has to be either shared by the teacher before the session or the teacher has to pause for extra few minutes on each code blocks.

Being a teaching assistant myself for several data science courses, I have noticed few good and enthusiastic teachers discussing on the lack of free video conferencing with live code sharing platforms. There isn't any!

Well, for now, an audio conference with live code sharing shall suffice.

For live code sharing, one solution is very obvious - Google Colab Notebooks. It's the best resource built for data science EVER!

In this article I shall provide you an alternate free option for live code sharing, which might help you. At the least, it's a good feature to know.

You can share your local Jupyter notebooks live with a small group while hosting a session.

Prerequisite:

You already have Jupyter notebooks installed locally.

Requirements:

Download ngrok locally.
Follow all the steps mentioned till step 3.

Steps:

Before firing up the ngrok or a Jupyter notebook, run

Copy the address mentioned after Writing default config to:

Then run the code below, replacing your address (from the step above) instead of mine.

For a secured session, set the password for your Jupyter Notebook by running

Fire up your Jupyter notebook and keep a note of the port

Open a new tab in your terminal and fire up ngrok on your port

Your terminal would show something similar to

Copy the HTTP link with ngrok.io at the end and share it with your group with the notebook password. You get an 8-hour session.
Each time you write a new line/block of code in your notebook press Ctrl+S (Linux/Windows) or Cmd+S (macOS) to share the updated notebook. The group members have to refresh on their end to see those changes.

Note: As a host if you receive an alert like this

Just click Overwrite so that whatever changes you make during the session remains intact for that session.

That's how you share your code session live with a group for free while working on your local machine and simultaneously allowing your group to code on the notebook during the session and eventually letting them download the session notebook.

Thanks to Project Jupyter and ngrok.

Let me know if you have any better idea.