Intentional Attention Mask Transformation for Robust CNN Classification

Intentional Attention Mask Transformation for Robust CNN Classification

Abstract

Convolutional Neural Networks have achieved impressive results in various tasks, but interpreting the internal mechanism is a challenging problem. To tackle this problem, we exploit a multi-channel attention mechanism in feature space. Our network architecture allows us to obtain an attention mask for each feature while existing CNN visualization methods provide only a common attention mask for all features. We apply the proposed multi-channel attention mechanism to multi-attribute recognition task. We can obtain different attention mask for each feature and for each attribute. Those analyses give us deeper insight into the feature space of CNNs. Furthermore, our proposed attention mechanism naturally derives a method for improving the robustness of CNNs. From the observation of feature space based on the proposed attention mask, we demonstrate that we can obtain robust CNNs by intentionally emphasizing features that are important for attributes. The experimental results for the benchmark dataset show that the proposed method gives high human interpretability while accurately grasping the attributes of the data, and improves network robustness.

\addauthor

Masanari Kimura1 \addauthorMasayuki Tanaka1 \addinstitution National Institute of Advanced Industrial Science and Technology,
Tokyo, Japan \addinstitution Tokyo Institute of Technology Tokyo, Japan Intentional Attention Mask Transformation

1 Introduction

In recent years, Convolutional Neural Networks (CNNs) have made great achievements in various tasks [Krizhevsky et al.(2012)Krizhevsky, Sutskever, and Hinton, Karpathy et al.(2014)Karpathy, Toderici, Shetty, Leung, Sukthankar, and Fei-Fei]. Despite such success, it is known that an interpretation of the CNNs is difficult for humans. Therefore, visual explanation, which visualizes the inference mechanism of the CNNs, is becoming one of hot topics [Selvaraju et al.(2017)Selvaraju, Cogswell, Das, Vedantam, Parikh, and Batra, Zhang and Zhu(2018), Kuwajima et al.(2019)Kuwajima, Tanaka, and Okutomi, Kimura and Tanaka(2019)]. In this paper, we introduce a multi-channel attention sub-networks to improve interpretability of CNNs. Our main idea is to train sub-networks with multi-channel attention mask for each attribute. The multi-channel attention sub-networks provide us feature-dependent attention masks, while existing attention mask approach [Fukui et al.(2018)Fukui, Hirakawa, Yamashita, and Fujiyoshi, Wang et al.(2017)Wang, Jiang, Qian, Yang, Li, Zhang, Wang, and Tang, Hu et al.(2018)Hu, Shen, and Sun] only provides common single attention mask. This multi-channel attention mechanism can reveal which channel in the feature map focuses on which part of the image.

Figure 1: Overview of the intentional attention mask transformation. In this framework, robust inference can be achieved by transforming the attention mask with a transformation function that considers side information without retraining the network. In this paper, a symmetric function inspired by the tone curve is used as the transformation function, and the slope and bias of the symmetric function are used as the side information, but any function and side information (such as prior knowledge of data) can be used.

An improvement of CNNs have been usually obtained by heuristical approach or try-and-error, due to lack of interpretability of CNNs. In contrast, we propose an intentional attention mask transformation to improve robustness of CNNs classification performance. Our approach is to focus the important features for the classification. The multi-channel attention sub-networks give us the information which features and which part are important for the classification. Then, we can easily focus those features by just transforming the attention mask. We call this operation an intentional attention mask transformation. Figure 1 shows an overview of the intentional attention mask transformation approach. In our approach, we can control the property of the network by changing the intentional attention mask transformation depending on the side information. Note that we can control the property of the network in the inference phase without re-training.

In summary, our contributions are as follows.

  • We introduce an multi-channel attention sub-networks to improve interpretability of the CNN. The multi-channel attention masks show an importance of each feature to classify the attribute.

  • We propose an intentional attention mask transformation to improve the robustness of CNNs against image degradation such as noise. The intentional attention mask transformation is performed based on the feature importance generated by the multi-channel attention sub-networks.

  • We conduct extensive experiments on the benchmark dataset and show the usefulness of the proposed method.

Our attention mechanism makes it possible to acquire noise-robust neural networks and provides us with high interpretability.

2 Proposed Method

2.1 Network Architecture and Loss Functions

We aim to interpret and theorize the internal mechanisms of CNNs using attention mechanism. Figure 2 shows an overview of our proposed network architecture. In the proposed method, multiple outputs corresponding to an image with multiple attributes.

Let be a sample set, be a label set, is the number of samples. Feature extractor extracts generic feature used by all the following networks. This component performs feature extraction. For this component, we use the Dilation Network [Yu and Koltun(2016)].

Binary classifiers , where is number of attributes, are components that perform binary classification corresponding to each attribute of the image. This network is our main component. The loss for the binary classifiers can be expressed by a sum of binary cross entropy of each attribute as

(1)
(2)

where is the attention mask for -th attribute, is the transformation function and represents element-wise product operation. Note that the attention mask has the same number of channels as that of feature . It differentiates from existing importance visualization algorithms. For learning, we use as the transformation function . to emphasize the area of interest while keeping the low value of the attention mask following in [Fukui et al.(2018)Fukui, Hirakawa, Yamashita, and Fujiyoshi].

Figure 2: Overview of proposed network architecture. Here, there are binary classification components, where is the number of attributes.

Multi-label classifier is a component that classifies multiple labels. The loss for the multi-label classifier is

(3)
(4)

We put this network component to obtain better feature representation.

Reconstructor is a component that reconstructs a input image from extracted feature. The reconstruction loss is as follows.

(5)

This component aims to obtain better feature representation .

The overall loss function is:

(6)

In the above equation, , , , and are weight parameters of each component. Also, means L1 sparseness to the attention mask and is used to extract features that are really important for the data. We will experimentally validate the effectiveness of the multi-label classifier and the reconstructor in section 3.1.

Figure 3: Visualization of transformation function. In this figure, is 0 or 4 and is 0 to 1. The combination of parameters for linear transformation is when and .

2.2 Intentional Attention Mask Transformation

Here, we propose an intentional attention mask transformation to improve the robustness of CNNs. Our main idea is to reduce the effect of noise by only focusing important features area for the classification of each attribute. To achieve this goal, we introduce the following simple attention mask transformation function .

(7)
(8)

where and are parameters to adjust the emphasis and suppression of the mask. The function is symmetric with respect to 0.5, and the function emphasizes large values in the mask . It is similar transformation as intensity tone curve in image retouching. Figure 3 shows the transformation function corresponding to each parameter pair.

The output of a binary classifier applying our intentional attention mask transformation is:

(9)

The above equations emphasize important features about the attribute of sample . Emphasizing the features that are important for any given attribute makes the classifier robust to the effects of noise.

3 Experimental Results

We evaluate our method using the CelebA dataset [Liu et al.(2015)Liu, Luo, Wang, and Tang], which consists of 40 facial attribute labels and 202,599 images (182,637 training images and 19,962 testing images).

The parameters of the proposed method are , , and . Also, the dimension of the attention mask and feature map is 128.

3.1 Ablation Study and Comparisons with Existing Algorithms

First, we experimentally validate the effectiveness of the reconstructor and the multi-label classifier with an ablation study. In this experiment, the parameters of the transformation function are and . Table 1 shows average accuracy of the CelebA dataset. This result demonstrates that the reconstructor and the multi-label classifier contribute to improve the performance.

We also compare the performance of our proposed network structure with existing network; MT-RBM PCA [Ehrlich et al.(2016)Ehrlich, Shields, Almaev, and Amer], LNets+ANet [Liu et al.(2015)Liu, Luo, Wang, and Tang], and FaceTracer [Kumar et al.(2008)Kumar, Belhumeur, and Nayar]. Table 4 shows the experimental results of the classification task in the CelebA dataset. The proposed network achieves good performance with many attributes and all average accuracy.

Figure 4: Visualizing attention masks on multiple facial attributes recognition. One element is one channel of the attention mask. The number under each mask means feature IDs.
Method Average Accuracy
Ours 92.05
Ours w/o Reconstructor 89.14
Ours w/o Multi-label classifier 88.12
Ours w/o Reconstructor & Multi-label classifier 86.58
Table 1: Ablation study for restraint networks. Comparison of classification accuracy with and without reconstructor and multi-label classifier.

3.2 Visualization of the Attention Mask

Figure 4 shows the visualization of the attention masks by our proposed method. We selected several feature channels for visualization. Each column presents top three features which has high importance for each attribute. Our attention masks focus on areas that may be important to attributes. In addition, this experimental result suggests that analysis on feature space reveals the relationship among attributes. For example, feature IDs 25, 14 and 50 are not used in Mouse Slightly Open, although they are used in Smiling. On the other hand, IDs 105 and 50 are used in the therapy of Smiling and Mouse Slightly Open, and ID 8 is not used in Smiling. Those results are consistent with human intuition, while Mouse Slightly Open should focus only on the mouth, Smiling must focus on a wide range such as eyes. Table 2 lists some of the feature IDs and their highly correlated features. Our multi-channel attention mechanism makes it possible to obtain correlations among each channel of the feature map.

Table 3 lists some of the attributes and their highly correlated attributes. Attributes that are intuitively similar are highly correlated. This result makes it possible to group highly correlated attributes. In addition, experimental results may even reveal potential relationships among attributes.

Target Feature Top5 Highly Correlated Features
1 72 (0.98) 87 (0.95) 44 (0.94) 92 (0.94) 87 (0.94)
32 111 (0.96) 114 (0.95) 100 (0.95) 15 (0.94) 119 (0.94)
64 127 (0.97) 15 (0.97) 119 (0.97) 114 (0.97) 57 (0.97)
Table 2: Correlation among the features. It lists the target features, highly correlated features with the target, and correlation.
Target Attribute Top5 Highly Correlated Attributes
Black Hair Blond Hair, Brown Hair, Bald, Wearing Hat, Gray Hair
Heavy Makeup Wearing Lipstick, Male, Rosy Cheeks, Attractive, Young
Bushy Eyebrows Bags Under Eyes, Eyeglasses, Arched Eyebrows, Heavy Makeup, Attractive
Table 3: Correlation among the attributes. It lists the target attributes and the top five attributes that are highly correlated with the target.
Figure 5: Transition of performance degradation of the network using the intentional attention mask transformation. We add Gaussian noise with standard deviation of 0 to 0.5 to the input image to create pseudo noisy data. Here, and are parameters to adjust the emphasis and suppression of the mask.

3.3 Robustness against Noisy Inputs

We evaluate robustness of the network by the intentional attention mask transformation. We add Gaussian noise with standard deviation of 0 to 0.5 to the input image to create pseudo noisy data. Figure 5 shows the experimental results of observing the transition of the performance for this noisy input data for a combination of multiple parameters. The experimental results show that the transformation of the attention mask makes the network robust against the performance degradation due to noise. The result of and is the performance transition of the network without the intentional attention mask transformation, and by adjusting and , the performance improvement for noisy input is achieved.

Attribute Ours [Ehrlich et al.(2016)Ehrlich, Shields, Almaev, and Amer] [Liu et al.(2015)Liu, Luo, Wang, and Tang] [Kumar et al.(2008)Kumar, Belhumeur, and Nayar]
5 Shadow 92.85 90 91 85
Arched Eyebrows 81.37 77 79 76
Attractive 80.71 76 81 78
Bags Under Eyes 83.79 81 79 76
Bald 98.30 98 98 89
Bangs 94.10 88 95 88
Big Lips 70.14 69 68 64
Big Nose 83.67 81 78 74
Black Hair 88.39 76 88 70
Blond Hair 95.10 91 95 80
Blurry 95.33 95 84 81
Brown Hair 86.55 83 80 60
Bushy Eyebrows 91.87 88 90 80
Chubby 96.02 95 91 86
Double Chin 96.68 96 92 88
Eyeglasses 98.67 96 99 98
Goatee 96.72 96 95 93
Gray Hair 97.89 97 97 90
Heavy Makeup 89.49 85 90 85
High Cheekbone 86.77 83 87 84
Male 97.38 90 98 91
Mouth Open 93.67 82 92 87
Mustache 96.60 97 95 91
Narrow Eyes 86.38 86 81 82
No Beard 94.87 90 95 90
Oval Face 73.33 73 66 64
Pale Skin 97.67 96 91 83
Pointy Nose 75.62 73 72 68
Recede Hair 93.44 96 89 76
Rosy Cheeks 94.67 94 90 84
Sideburns 97.65 96 96 94
Smiling 92.28 88 92 89
Straight Hair 81.60 80 73 63
Wavy Hair 81.64 72 80 73
Earring 84.61 81 82 73
Hat 98.92 97 99 89
Lipstick 92.52 89 93 89
Necklace 86.37 87 71 68
Necktie 96.30 94 93 86
Young 87.00 81 87 80
Average 92.05 87 87 81
Table 4: Classification accuracy on the CelebA dataset. In this experiment, MT-RBM PCA [Ehrlich et al.(2016)Ehrlich, Shields, Almaev, and Amer], LNets+ANet [Liu et al.(2015)Liu, Luo, Wang, and Tang], and FaceTracer [Kumar et al.(2008)Kumar, Belhumeur, and Nayar] are used as comparison methods.

4 Conclusion and Discussion

We proposed a novel network architecture and attention mechanism that can give a visual explanation of CNNs. Our multi-channel attention mechanism makes it possible to obtain correlations among each channel of the feature map. We suggest that analysis of feature maps obtained by the proposed method is highly versatile and lead to a broad range of applied research, such as improvement of classification accuracy, network pruning, image generation, and other applications. As one of such applications, we have shown that intentional transformation of the attention mask can improve the robustness of CNNs.

References

  • [Ehrlich et al.(2016)Ehrlich, Shields, Almaev, and Amer] Max Ehrlich, Timothy J Shields, Timur Almaev, and Mohamed R Amer. Facial attributes classification using multi-task representation learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pages 47–55, 2016.
  • [Fukui et al.(2018)Fukui, Hirakawa, Yamashita, and Fujiyoshi] Hiroshi Fukui, Tsubasa Hirakawa, Takayoshi Yamashita, and Hironobu Fujiyoshi. Attention branch network: Learning of attention mechanism for visual explanation. arXiv preprint arXiv:1812.10025, 2018.
  • [Hu et al.(2018)Hu, Shen, and Sun] Jie Hu, Li Shen, and Gang Sun. Squeeze-and-excitation networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7132–7141, 2018.
  • [Karpathy et al.(2014)Karpathy, Toderici, Shetty, Leung, Sukthankar, and Fei-Fei] Andrej Karpathy, George Toderici, Sanketh Shetty, Thomas Leung, Rahul Sukthankar, and Li Fei-Fei. Large-scale video classification with convolutional neural networks. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 1725–1732, 2014.
  • [Kimura and Tanaka(2019)] Masanari Kimura and Masayuki Tanaka. Interpretation of feature space using multi-channel attentional sub-networks. arXiv preprint arXiv:1904.13078, 2019.
  • [Krizhevsky et al.(2012)Krizhevsky, Sutskever, and Hinton] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.
  • [Kumar et al.(2008)Kumar, Belhumeur, and Nayar] Neeraj Kumar, Peter Belhumeur, and Shree Nayar. Facetracer: A search engine for large collections of images with faces. In European conference on computer vision, pages 340–353. Springer, 2008.
  • [Kuwajima et al.(2019)Kuwajima, Tanaka, and Okutomi] Hiroshi Kuwajima, Masayuki Tanaka, and Masatoshi Okutomi. Improving transparency of deep neural inference process. arXiv preprint arXiv:1903.05501, 2019.
  • [Liu et al.(2015)Liu, Luo, Wang, and Tang] Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes in the wild. In Proceedings of the IEEE international conference on computer vision, pages 3730–3738, 2015.
  • [Selvaraju et al.(2017)Selvaraju, Cogswell, Das, Vedantam, Parikh, and Batra] Ramprasaath R Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra. Grad-cam: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE International Conference on Computer Vision, pages 618–626, 2017.
  • [Wang et al.(2017)Wang, Jiang, Qian, Yang, Li, Zhang, Wang, and Tang] Fei Wang, Mengqing Jiang, Chen Qian, Shuo Yang, Cheng Li, Honggang Zhang, Xiaogang Wang, and Xiaoou Tang. Residual attention network for image classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3156–3164, 2017.
  • [Yu and Koltun(2016)] F. Yu and V. Koltun. Multi-scale context aggregation by dilated convolutions. In International Conference on Learning Representations, 2016.
  • [Zhang and Zhu(2018)] Quan-shi Zhang and Song-Chun Zhu. Visual interpretability for deep learning: a survey. Frontiers of Information Technology & Electronic Engineering, 19(1):27–39, 2018.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
366158
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description