DEV Community

Cover image for A Decade of Deep CNN Archs. - ZFNet (ILSVRC Runner-up 2013)
Zoheb Abai
Zoheb Abai

Posted on • Updated on


A Decade of Deep CNN Archs. - ZFNet (ILSVRC Runner-up 2013)


ZFNet Architecture

ZFNet was introduced in the paper titled Visualizing and Understanding Convolutional Networks by Matthew D. Zeiler and Rob Fergus. This architecture did not win the competition, but its inference was implemented by winner of that year (Clarifai founded by Zeiler, 11.19% test error). This paper is remarkable because of its visualizations and understanding of the internal operation and behavior of a CNN model classifying an image. The paper also introduced us to a technique now widely known as Transfer Learning.

Due to 2012 winner AlexNet, there was an enormous increase in submission of CNN models for ILSVRC 2013 but most of them were trial-and-error based without exhibiting any understanding of how and why CNN performed so well.

Let's understand that (as explained by authors).

A CNN model

  • Maps a color 2D input image x_i, via a series of layers, to a probability vector y_i_hat over the C different classes, where each layer consists of
1. Convolution of the previous layer output with a set of learned filters, passing the responses through a rectified linear function

2. Optionally max pooling over local neighborhoods 

3. Optionally a local contrast operation that normalizes the responses across feature maps (it's not relevant anymore)
Enter fullscreen mode Exit fullscreen mode
  • has conventional fully connected top few layers with final layer as a softmax classifier.
  • is trained using a large set of N labelled images {x, y}, where label y_i is a discrete variable indicating the true class.
  • cross-entropy loss function - p(x)log(q(x)), suitable for image classification, is used to compare y_hat and y.
  • parameters are trained by backpropagating the derivative of the loss regarding the parameters throughout the network, and updating the parameters via stochastic gradient descent in batches.

Updating AlexNet

Understanding the operation of a CNN requires interpreting the feature activity in intermediate layers, so authors present a novel way known as DeconvNet (Zeiler et al. proposed it initially as unsupervised learning technique) to map these activities back to the input pixel space, showing what input pattern, originally, had caused a given activation in the feature maps.


A DeconvNet layer (left) attached to a ConvNet layer (right)

A DeconvNet is attached to each of its ConvNet layers, providing a continuous path back to image pixels. To examine a given ConvNet activation, all other activations in the layer are set to zero and the feature maps are passed as input to the attached DeconvNet layer. Then it is successively

1. unpooled (uses the switch which records the location of the local max in maxpool), 

2. rectified, and 

3. filtered (uses transposed version of same filters in convnet) 
Enter fullscreen mode Exit fullscreen mode

to reconstruct the activity in the layer beneath, that gave rise to the chosen activation. This is repeated until input pixel space is reached.

They train AlexNet reproducing test error percentage within 0.1% of reported value in 2012. By visualizing the first and second layers of AlexNet, they observe two specific issues:

  • Filters at layer 1 are a mix of extremely high and low frequency information, with little coverage of the mid frequencies. Without the mid frequencies, there is a chain effect that deep features can only learn from extremely high and low frequency information.

Note: Spatial frequency information in an image describes the information on periodic distributions of 'light' and 'dark' in that image. High spatial frequencies correspond to features such as sharp edges and fine details, whereas low spatial frequencies correspond to features such as global shape.


AlexNet Layer 1 features
  • Layer 2 shows aliasing artifacts caused by the large stride 4 used in the 1st layer convolutions. Aliasing occurs when sampling frequency is too low.

Note : In each CNN layer (if not using Upsampling or DeconvNet) we are mainly sampling down (discretization) the image. If sampling frequency is too low (insufficient sampling) then we get aliasing effects on the sampled image such as jagged boundaries/edges, repetitive textures etc.


AlexNet Layer 2 features

To remedy these problems, authors made following changes in AlexNet Architecture:

  • Reduced the 1st layer filter size from 11×11 to 7×7. Filters of size 11x11 proved to be skipping a lot of relevant information

    ZFNet Layer 1 features
  • Made the stride of the convolution 2, rather than 4. A filter of stride of 2 proved to retain a lot of pixel information


ZFNet Layer 2 features

This new architecture retains much more information in the 1st and 2nd layer features. So final ZFNet architecture looks like this :


Table 1: Architecture Details


During training, visualization of the first layer filters revealed that, a few of them dominated. To combat this, authors renormalized each filter in the convolutional layers to a fixed radius of RMS value of 1e-01.

The model was trained on the ImageNet 2012 training set (1.3 million images, spread over 1000 different classes) on single NVIDIA GTX 580 GPU with 3 GB memory.

Same as AlexNet.

Image Augmentation:
Same as AlexNet. (224x224 here)

Same as AlexNet.

Kernel Initializer:
1e-02 for each layer

Bias Initializer:
0 for each layer

Batch Size:


L2 weight decay:

Learning Rate Manager:

Total epochs:

Total time:
12 days


Single ZFNet model achieves top-1 and top-5 test errors of 38.4% and 16.5% respectively, lower by a margin of 1.7% than that of AlexNet. Their final submission comprised of an ensemble of 6 CNNs (average of 5 ZFNet's and a network same as ZFNet but layer Conv3, Conv4, Conv5 with 512, 1024, 512 channels respectively) which gave an error rate of 14.8%.

Depth of the model is important for obtaining good performance:

Removing two fully connected layers yielded a slight increase in error, although they contained the majority of model parameters. Removing two of the middle convolutional layers also made a relatively small difference to the error rate. However, removing both the middle convolution layers and the fully connected layers yielded a model with only 4 layers whose performance was dramatically worse.

Transfer Learning:

Finally, authors showed that model trained on ImageNet generalizes well to other datasets. For this, they kept layers 1-7 of the ImageNet trained model fixed and train a new softmax classifier on top (for the appropriate number of classes) using the training images of the new dataset.


Fine-Tuning ImageNet trained ZFNet on Caltech-101 dataset


Fine-Tuning ImageNet trained ZFNet on Caltech-256 dataset


Feature Visualization


Visualization of features in a fully trained model. For layers 2-5 authors show the top 9 activations in a random subset of feature maps across the validation data, projected down to pixel space using the de-convolutional network approach

The projections from each layer show the hierarchical nature of the features in the network. Layer 2 responds to corners and other edge/color conjunctions.

Layer 3 has more complex invariance, capturing similar textures such as mesh patterns and text patterns.

Layer 4 shows significant variation, but is more class-specific such as dog faces and bird’s legs.

Layer 5 shows entire objects with significant pose variation such as keyboards and dogs.

Feature Evolution during Training


Evolution of a randomly chosen subset of model features through training. Each layer’s features are displayed in a different block. Within each block, it shows a randomly chosen subset of features at epochs [1,2,5,10,20,30,40,64]. The visualization shows the strongest activation (across all training examples) for a given feature map, projected down to pixel space using the DeconvNet approach

Here, the lower layers of the model can be seen to converge within a few epochs. However, the upper layers only develop after a considerable number of epochs (40-50), demonstrating the need to let the models train until fully converged.

Feature Invariance


Column1 and Column2: Euclidean distance between feature vectors from the original and transformed images in layers 1 and 7 respectively. Column 3:The probability of the true label for each image, as the image is transformed

Small transformations have a dramatic effect in the first layer of the model, but a lesser impact at the top feature layer, being quasi linear for translation & scaling. The network output is stable to translations and scaling. In general, the output is not invariant to rotation, except for object with rotational symmetry (e.g. entertainment center).

Occlusion Sensitivity


The first row example shows the strongest feature to be the dog’s face. When this is covered-up the activity in the feature map decreases (blue area in (b)). When the dog’s face is obscured, the probability for “Pomeranian” drops significantly. In the 1st row, for most locations it is “Pomeranian”, but if the dog’s face is obscured but not the ball, then it predicts “tennis ball”. In the 2nd example, text on the car is the strongest feature in layer 5, but the classifier is most sensitive to the wheel. The 3rd example contains multiple objects. The strongest feature in layer 5 picks out the faces, but the classifier is sensitive to the dog (blue region in (d)), since it uses multiple feature maps

With these image classification approaches, a natural question arises : Is model truly identifying the location of the object in the image, or just using the surrounding context?

Authors attempt to answer this question by systematically occluding different portions of the input image with a gray square, and monitoring the output of the classifier. Above examples show visualizations from the strongest feature map of the top convolution layer, in addition to activity in this map (summed over spatial locations) as a function of 'occluder' position. It clearly shows that the model is localizing the objects within the scene, as the probability of the correct class and activity in the feature map drops significantly when the object is occluded. This shows that the model, while trained for classification, is highly sensitive to local structure in the image and is not just using broad scene context.


Thus, the paper holds its significance for introducing us to the perspective we require while structuring a CNN architecture. The visualization techniques introduced here to visualize the activity within the model are still relevant for inferring the performance of models or determining data preprocessing techniques for obtaining better results. Authors brought this fact to the limelight that CNN models do not generate features with random, non-interpretable patterns (black box - as thought by many) but revealing several intuitively desirable properties such as compositionality, increasing invariance and class discrimination as we ascend the layers of a CNN model.

Top comments (0)

An Animated Guide to Node.js Event Loop

Node.js doesn’t stop from running other operations because of Libuv, a C++ library responsible for the event loop and asynchronously handling tasks such as network requests, DNS resolution, file system operations, data encryption, etc.

What happens under the hood when Node.js works on tasks such as database queries? We will explore it by following this piece of code step by step.