Classification with Wise-SrNet instead of Global Average Pooling

Classification with Wise-SrNet instead of Global Average Pooling

Image classification with deep neural networks comes with two steps:

  1. Generating feature map from the input image.
  2. Making classification based on the generated feature map.

Although feature extraction is the most important part, and feature maps with higher semantic values result in more accurate classification, but there are some difficulties in making classification through the last few layers of neural networks.

The main issue is that the feature map of images will have a large size, e.g., ResNet models produce a 7*7*2048 feature map from a 224*224 image. Hence, feeding a feature map with this shape to a fully connected layer for generating the final classification array will significantly increase the number of model weights, especially when there are many classes in the dataset. The ImageNet dataset includes 1000 classes of images, so using a fully connected layer for the classification part from a 7*7*2048 feature map will increase the number of model weights for 7*7*2048*1000=100,352,000 weights! Now, if the images are larger, the weights will be even more!!

First stage models like VGG used fully connected layers for getting the classification array which increased the model weights for almost 100 million parameters. This is ridiculous as the main part of a model (Feature Extractor layers) contains only 14M params, but the classification stage (which includes a few layers) contains 100M parameters! This condition made the original version of VGG classifiers unoptimized and hard to train.

VGG architecture (The blue boxes show the fully connected layers used for classification)

After VGG, ResNet, and most of the upcoming deep convolutional models realized that they should compress the final feature map before feeding it to the classification fully-connected layer. Hence, they decided to use Global Average Pooling (GAP) layer for compressing the feature map. GAP converts a 7*7*2048 feature map to a 1*1*2048 array by averaging between the kernel values of each channel. However, this way decreases the number of training weights but also causes losing spatial information due to the large averaging kernel. By today, the newest introduced classifiers like EfficientNetV2 are still using this layer for classification.

In datasets with few classes, many researchers still use the classic technique of feeding the whole feature map to the final classification fully connected layer without compressing it. They don’t like to use the GAP layer because losing the spatial data in some cases, like medical datasets, will significantly decrease classification accuracy.

WiseSrNet is a newly introduced method for dealing with the classification process. Thie method compresses the feature map just like the GAP layer while not losing data, so you can train your models while keeping the original data and not facing extra computational cost.

The main principle behind Wise-SrNet is letting the neural network learn how to wisely compress the feature map to a lower-size array without removing essential data. In other words, the model will learn some weights while training to use for compressing the feature map. This is similar to the feature extraction part which the model learns how to extract a brief feature map from input images that contains useful information.

The next image shows the architecture of Wise-SrNet. The main core of it is a depthwise convolution layer with the size of a kernel equal to the kernel of the feature map and no activation function. The authors visualized that the model could face overfitting due to the large kernel size of the depthwise convolution layer. To resolve this problem, they applied a non-negative constraint on the depthwise convolutional layer to limit the weights from going negative. They also investigated placing a small average pooling layer before the depthwise convolution layer to reduce the kernel size of the depthwise convolutional layer and prevent overfitting.

Wise-SrNet architecture applied to a deep neural network using 224×224 images.

The following figures show how Global Average Pooling and depthwise convolutional layer work for compressing the feature map. (All the figures have been adopted from the Wise-SrNet paper)

Depthwise convolutional layer, compressing the feature map wisely (without losing important data)
Global Average Pooling, compressing the feature map for classification

The next script shows a glance at the Wise-SrNet code applied to the Xception model. The input image size was set to 224×224.

Wise-SrNet code applied to the Xception model for 224×224 images.

The authors have stated that using their method will improve the classification accuracy like this:

Effect of Wise-SrNet on a selected part of the ImageNet dataset containing 70 classes. The images were resized to 224×224, and no pre-trained weights were utilized for starting the training process.

Effect of Wise-SrNet on a selected part of the MIT Indoors Scenes dataset using 512×512 images and transfer learning.

The authors of Wise-SrNet have also brought up an interesting result that using the GAP layer on large images may not work on some models at all. In these situations, the only choice is not to use compression and feeding the whole feature map to the classification fully connected layer (the classic method that increases the model weights significantly), but now the Wise-SrNet solves this problem by compressing the feature map while holding the spatial information. Based on their obtained results, Wise-SrNet can sometimes be the only solution for training a classification model fast and accurately.

References

1-Rahimzadeh, M., Parvin, S., Safi, E. and Mohammadi, M.R., 2021. Wise-SrNet: A Novel Architecture for Enhancing Image Classification by Learning Spatial Resolution of Feature Maps. arXiv preprint arXiv:2104.12294.

2-Lin, M., Chen, Q. and Yan, S., 2013. Network in network. arXiv preprint arXiv:1312.4400.

3-Chollet, F., 2017. Xception: Deep learning with depthwise separable convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 1251–1258).