EVERY FILTER EXTRACTS A SPECIFIC TEXTURE IN CONVOLUTIONAL NEURAL NETWORKS

Every Filter Extracts a Specific Texture in Convolutional Neural Networks

Abstract

Many works have concentrated on visualizing and understanding the inner mechanism of convolutional neural networks (CNNs) by generating images that activate some specific neurons, which is called deep visualization. However, it is still unclear what the filters extract from images intuitively. In this paper, we propose a modified code inversion algorithm, called feature map inversion, to understand the function of filter of interest in CNNs. We reveal that every filter extracts a specific texture. The texture from higher layer contains more colours and more intricate structures. We also demonstrate that style of images could be a combination of these texture primitives. Two methods are proposed to reallocate energy distribution of feature maps randomly and purposefully. Then, we inverse the modified code and generate images of diverse styles. With these results, we provide an explanation about why Gram matrix of feature maps [1] could represent image style.

EVERY FILTER EXTRACTS A SPECIFIC TEXTURE IN CONVOLUTIONAL NEURAL NETWORKS

Zhiqiang Xia  Ce Zhu  Zhengtao Wang  Qi Guo  Yipeng Liu
School of Electronic Engineering / Center for Robotics
University of Electronic Science and Technology of China (UESTC), Chengdu, China
Email: {eczhu, yipengliu}@uestc.edu.cn, {zhiqiangxia}@std.uestc.edu.cn


Index Terms—  Feature maps, filter of interest, code inversion, texture primitives, style transfer

1 Introduction

Convolutional neural networks (CNNs) have achieved tremendous progress on many pattern recognition tasks, especially large-scale images recognition problems [2, 3, 4, 5]. However, on one hand, CNNs still make mistakes easily. [6] reveals that adding adversary noise to an image in a way imperceptible to humans can cause CNNs mislabel the image. [7] shows some related results: it is easy to use evolutionary algorithms to generate images that are completely unrecognizable to humans, but the state-of-the-art CNNs can classify the image to some categories with 99.99% confidence. On the other hand, it is still unclear how CNNs learn suitable features from the training data and what a feature map represents [8, 9]. This dimness of CNNs promote recent development of visualization of CNNs, also known as deep visualization [10, 11, 12, 13, 14, 15, 16]. Deep visualization aims to reveal the internal mechanism of CNNs by generating an image that activates some specific neurons, which could provide meaningful and helpful insights for designing more effective architectures [17, 18].

There are many deep visualization techniques available for understanding CNNs. Perhaps the simplest way is displaying the response of a specific layer, or several special feature maps 111To avoid the conception confusion, in this paper, “code” is adopted to represent the activations of a whole layer, and “feature map” is utilized to represent activation of a single filter at a layer.. However, these feature maps only provide limited and unintuitive information about filters and images. For instance, although it is possible to find some filters that respond to a specific object such as “face” in [19], this method is heuristic and not universal.

A major deep visualization technique is activation maximization [10], which finds an image that activates some specific neurons most intensively to reveal what feature these neurons response to. [20] shows the object conception learned by AlexNet by maximizing the neuron activation at the last layer. [21] generates similar results by applying the activation maximization to a single feature map. [22] generates many fantastic images by intensifying the activated neurons of input images at high layers as well as low layers, which is called “deep dream”. However, the generated images are rough. So a series of subsequent works concentrated on improving generated images quality by adding natural priors, such as norm [20], total variation [23], jitter [22], Gaussian blur [21] and data-driven patch priors [13]. Besides, [24] also uncovers the different types of features learned by each neuron with a priori input image.

Another major deep visualization technique calls code inversion [23], which generates an image whose activation code is similar to the target activation code at a particular layer produced by a specific image. It reveals which features are extracted by filters from the input image. Code inversion could be realized by training another neural network and directly predicting the reconstructed image [25], or by iteratively optimizing an initial noisy image [26], or transposing CNNs to project the feature activations back to the input pixel space with deconvnet [18]. These inversion methods could also be extended to statistical property of code. [27] visualizes Gram matrix of feature maps and finds that it represents image texture. [1, 28, 29] utilize Gram matrix to do image style transfer. Compared with activation maximization, code inversion could intuitively reveal the specific feature extracted by filters from given images.

Many previous works in deep visualization have revealed some valuable explanations about a single neuron [20, 7], a feature map [21] or the code [22] at different layers. The CNNs are not totally black boxes anymore. However, to our best knowledge, there is still no work about visualizing what exactly every filter tries to capture from an image. Understanding of filters could help improve architecture of CNNs.

In this paper, we introduce Feature Map Inversion (FMI) to deal with the aforementioned problem. For a filter of interest, FMI enhances the corresponding feature map and weakens the rest feature maps at the same time. Then classical code inversion algorithm is applied to the modified code and generate inversion images. Our experimental results show that every filter in CNNs extracts a specific texture. The texture at higher layers contains more colours and more intricate structures (Fig. 3). In addition, we find that style of an image could be a combination of hierarchical texture primitives. Two methods are proposed to generate images of diverse styles by inversing modified code. In particular, we change code by reallocating the sum of each feature map randomly, and according to target code purposefully. With these results, we provide an explanation about why Gram matrix of feature maps [1] could represent image style. Since every filter extracts a specific texture, the combination weights of feature maps decides image style. Like the sum of feature maps along channel axis, Gram matrix also guides the energy of every feature map of generated image.

Our experiments were conducted based on the open-source deep learning framework Mxnet [30] and is available at https://github.co-
m/xzqjack/FeatureMapInversion.

2 Method

2.1 Feature Map Inversion

In this section, we use FMI to answer the question: “What does a filter in CNNs try to capture from an input image”. Given an input image , a trained CNN encodes the image as code . Code inversion method aims to find a new image whose code is similar to . As shown in Fig. 1, for a chosen layer such as relu5_1 in VGG-19 [4], the code is a 3D tensor. There are totally 512 feature maps and the shape of each feature map is . In order to visualize what feature the filter of interest extract, we should intensify the feature map to a certain degree and weaken the others. In this paper, we set the feature map to the sum of feature maps along channel axis and the others to 0. Finally, we apply classical code inversion [23] to the modified code .

(1)
(2)

where denotes the feature map activation at location . enhances the feature map of and weaken the others.

2.2 Modified Code Inversion

Knowing that every filter extracts a specific texture, we assume that style of an image could be considered as the combination of hierarchical texture primitives. If so, we can combine the texture primitives by modifying the energy distribution of feature maps randomly and purposefully. Then, if we apply code inversion to the modified code, we will generate diverse style images.

The random modified method keeps the activation states (activated or not) of neurons unchanged and reallocate the sum of every feature map. We firstly generate a random vector , where and . Then we reallocate the energy of every feature map with vector . The modified code is

Fig. 1: Modification of feature maps at layer relu5_1 in VGG-19.
(3)

We take the modified code as target and generate an image by

(4)

The content of generated image keeps unchanged coarsely but the styles are variegated as shown in Fig. 4.

One more step, we modify the energy of each feature map by making the sum of every feature map similar to the target code purposefully. In particular, suppose that we have two input images: content image and style image . The feature maps of and at a layer are reshaped as and . We use content code as content constraint [1] and sum of feature maps along channel axis as style constraint. Finally, we generate a styled image by

(5)

where , are different weights of content term and style term. , and all the values of and are .

3 Experiment

3.1 Experiment Setting

We conduct our experiments on a well-known deep convolutional neuron networks, which is called VGG-19 [4]. It is trained to recognize 1000 different objects on 1.2-million-image ILSVRC 2014 ImageNet dataset [31]. It contains 16 convolutional layers, 16 relu layers, 5 pooling layers and totally 5504 filters. The size of all filters is . We do not use any fully connected layers. In the experiments, we set the , and use The Golden Gate Bridge and Neckarfront with Hölderlinturm and collegiate as content images in all cases as shown in Fig. 2.

3.2 Feature Map Inversion

We show qualitative FMI results on Fig. 3. The top inversion results are from input image The Golden Gate Bridge and the bottom are from Neckarfront with Hölderlinturm and Collegiate. The rows from top to bottom show feature map inversions from convolutional layers relu1_2, relu2_2, relu3_2, relu4_2 and relu5_2 respectively. In every row, the columns from left to right show the inversion results of first five feature map respectively.

Fig. 2: Content images. The left is The Golden Gate Bridge taken by Rich Niewiroski Jr in 2007. The right is Neckarfront with Hölderlinturm and collegiate taken by Andreas Praefcke in 2003.
(a) The Golden Gate Bridge
(b) Neckarfront with Hölderlinturm and Collegiate
Fig. 3: Feature Map Inversion. The top inversion results is with input image The Golden Gate Bridge and the bottom is with Neckarfront with Hölderlinturm and Collegiate. The columns show the , , , and MFI results respectively. The rows show FMI results at layers, relu1_2, relu2_2,relu3_2, relu4_2 and relu5_2 respectively. This figure is best viewed electronically in color with zoom.
Fig. 4: Randomly Modified Code Inversion. The columns from left to right show results at layers relu1_1, relu2_1,relu3_1, relu4_1 and relu5_1. This figure is best viewed electronically in color with zoom.

Numerical results show that every filter extracts a specific texture. As Fig. 3 shows, inversion results of different feature maps at different layers have distinct textures, while the corresponding inversion results in (a) and (b) have same textures including color and local structure. FMI for low layers such as relu1_2 and relu2_2 generates images whose color is monotonous and local structure is simple. As layers increase higher such as layer relu4_2 and relu5_2, the colours become plentiful and the local structures become more intricate. This character is reasonable because feature maps at higher layers can be considered as a non-linear combination of preceding feature maps. For example, feature maps at low layers represent low level semantic properties such as edge and corner, then posterior filters assemble different edge patterns and corner patterns to compose more intricate texture.

Fig. 5: Purposefully Modified Code Inversion (PMCI). From up to down, target style images are Self Portrait with Necklace of Thorns by Frida Kahlo in 1940, Femme nue assise by Pablo Picasso in 1909, The Starry Night by Vincent van Gogh in 1889 and Der Schrei by Edvard Munch in 1893. The third column and fifth column show PMCI results. The second column and the fourth column show styled images with the approach of Gatys [1]. To make all images alignment, we resize style target images.

3.3 Generating Images of Diverse Styles

Since every feature map represents a specific texture, we can change images’ style by modifying the combination weights of hierarchical texture primitives randomly and purposefully.

Fig. 4 shows the qualitative Randomly Modified Code Inversion results. We stochastically reallocate the sum of every feature map at layers relu1_1, relu2_1, relu3_1, relu4_1 and relu5_1 respectively. For every layer we generate two random inversion results. Random modification changes the activation degree of activated neurons but keeps the unactivated ones invariant. Two generated images at same column have disparate texture. Compared with input images, color is the main difference at low layers relu1_2, relu2_2 and structure is the main difference at layers relu4_1, relu5_1, which supports the discovery of Sec. 3.2.

We find it is interesting that, inversion results at higher layers contain less content details and more specific texture. The reason is that the repeated block of textures at higher layers contains more intricate structure. However, image content is composed of many unique structural sub-images. When the structure of sub-images is different from the texture, the partial content information is destroyed. Finally, content of whole images becomes scarce and numerous intricate textures appear.

In addition, we also experiment with Purposefully Modified Code inversion (PMCI). The generated images in Fig. 5 combine the code of a target content image and the energy distribution of feature maps of a target style image. In particular, we pick 4 style images for our experiment: A Self Portrait with Necklace of Thorns, Femme nue assise, The Starry Night and Der Schrei. We use code at layer relu2_2 as content term constraint and the sum of feature maps along channel axis at layers relu1_1, relu2_1, relu3_1, relu4_1, relu5_1 as the term of style constraint. The first column shows the target style images. The third column and fifth column show PMCI results. We also show the styled images of Gaty [1] at the second column and the fourth column.

PMCI generates images similar to style target. We can intuitively find that the generated images at same row have similar styles. The similarity demonstrates that the combination weights of feature maps represent image style. Thus, we can determine whether two images are of same style according to energy distribution of feature maps. With these results, we provide some insights to understand why Gram matrix of feature maps [1] could represent image style. Like the sum of feature maps along channel axis, Gram matrix also guides the energy of every feature map of generated image.

4 Conclusion

We present a method to visualize which feature a filter captures from input image in CNNs by inversing the feature map of interest. By this technique, we demonstrate that every filter extracts a specific texture. The inversion result at higher layers contains more colors and more intricate structures. We propose two methods to generate images of diverse styles. The experimental results support our assumption that the style of an image is essentially a combination of texture primitives captured by filters in CNNs. In addition to generate images of diverse styles, we also provide an explanation about why Gram matrix [1] of feature maps could be a representation of image style. Since every filter extracts a specific texture, the combination weights of feature maps decide the image style. Like the sum of feature maps along channel axis, Gram matrix also guides the energy of every feature map of generated image.

5 Acknowledgment

This research is supported by National High Technology Research and Development Program of China(863, No. 2015AA015903), National Natural Science Foundation of China (NSFC, No. 61571102), the Fundamental Research Funds for the Central Universities (No. ZYGX2014Z003, No. ZYGX2015KYQD004) and a grant from the Ph.D. Programs Foundation of Ministry of Education of China (No. 20130185110010). We gratefully acknowledge the support of NVIDIA Corporation with the donation of Quadro K5200 GPU used for this research.

References

  • [1] L.A. Gatys, A.S. Ecker, and M. Bethge, “Image style transfer using convolutional neural networks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016.
  • [2] C. Szegedy, W. Liu, Y.Q. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015, pp. 1–9.
  • [3] A. Krizhevsky, I. Sutskever, and G.E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in Neural Information Processing Systems (NIPS), 2012, pp. 1097–1105.
  • [4] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” CoRR, vol. abs/1409.1556, 2014.
  • [5] K.M. He, X.Y. Zhang, S.Q. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016.
  • [6] I.J. Goodfellow, J. Shlens, and C. Szegedy, “Explaining and harnessing adversarial examples,” International Conference on Learning Representations (ICLR), 2015.
  • [7] A. Nguyen, J. Yosinski, and J. Clune, “Deep neural networks are easily fooled: High confidence predictions for unrecognizable images,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2015, pp. 427–436.
  • [8] J. Yosinski, C. Jeff, Y. Bengio, and H. Lipson, “How transferable are features in deep neural networks?,” in Advances in Neural Information Processing Systems (NIPS), 2014, pp. 3320–3328.
  • [9] Y.X. Li, J. Yosinski, J. Clune, H. Lipson, and J. Hopcroft, “Convergent learning: Do different neural networks learn the same representations?,” in Advances in Neural Information Processing Systems (NIPS), 2015, pp. 196–212.
  • [10] Dumitru Erhan, Yoshua Bengio, Aaron Courville, and Pascal Vincent, “Visualizing higher-layer features of a deep network,” University of Montreal, vol. 1341, 2009.
  • [11] C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus, “Intriguing properties of neural networks,” International Conference on Learning Representations (ICLR), 2014.
  • [12] J.T. Springenberg, A. Dosovitskiy, T. Brox, and M. Riedmiller, “Striving for simplicity: The all convolutional net,” International Conference on Learning Representations (ICLR), 2015.
  • [13] D.L. Wei, B.L. Zhou, A. Torrabla, and W. Freeman, “Understanding intra-class knowledge inside cnn,” arXiv preprint arXiv:1507.02379, 2015.
  • [14] K. Lenc and A. Vedaldi, “Understanding image representations by measuring their equivariance and equivalence,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015, pp. 991–999.
  • [15] A. Karpathy, J. Johnson, and F.F. Li, “Visualizing and understanding recurrent networks,” International Conference on Learning Representations (ICLR), 2016.
  • [16] M. Liu, J.X. Shi, Z. Li, C.X. Li, J. Zhu, and S.X. Liu, “Towards better analysis of deep convolutional neural networks,” arXiv preprint arXiv:1604.07043, 2016.
  • [17] W.L. Shang, K. Sohn, D. Almeida, and H. Lee, “Understanding and improving convolutional neural networks via concatenated rectified linear units,” International Conference on Machine Learning (ICML), 2016.
  • [18] M.D. Zeiler and R. Fergus, “Visualizing and understanding convolutional networks,” in European Conference on Computer Vision (ECCV). Springer, 2014, pp. 818–833.
  • [19] Y. Sun, X.G Wang, and X.O. Tang, “Deeply learned face representations are sparse, selective, and robust,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015, pp. 2892–2900.
  • [20] K. Simonyan, A. Vedaldi, and A. Zisserman, “Deep inside convolutional networks: Visualising image classification models and saliency maps,” in International Conference on Learning Representations (ICLR), 2014.
  • [21] Y. Jason, C. Jeff, N. Anh, F. Thomas, and L. Hod, “Understanding neural networks through deep visualization,” in Deep Learning Workshop, International Conference on Machine Learning (ICML), 2015.
  • [22] A. Mordvintsev, C. Olah, and M. Tyka, “Inceptionism: Going deeper into neural networks,” Google Research Blog. Retrieved June, vol. 20, 2015.
  • [23] A. Mahendran and A. Vedaldi, “Visualizing deep convolutional neural networks using natural pre-images,” International Journal of Computer Vision (IJCV), pp. 1–23, 2016.
  • [24] A. Nguyen, J. Yosinski, and J. Clune, “Multifaceted feature visualization: Uncovering the different types of features learned by each neuron in deep neural networks,” Visualization for Deep Learning workshop, International Conference in Machine Learning(ICML), 2016.
  • [25] A. Dosovitskiy and T. Brox, “Inverting convolutional networks with convolutional networks,” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
  • [26] A. Mahendran and A. Vedaldi, “Understanding deep image representations by inverting them,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2015, pp. 5188–5196.
  • [27] L. Gatys, A.S. Ecker, and M. Bethge, “Texture synthesis using convolutional neural networks,” in Advances in Neural Information Processing Systems (NIPS), C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, Eds., pp. 262–270. Curran Associates and Inc., 2015.
  • [28] D. Ulyanov, V. Lebedev, A. Vedaldi, V. Lempitsky, A. Gupta, A. Vedaldi, A. Zisserman, H. Bilen, B. Fernando, and E. Gavves, “Texture networks: Feed-forward synthesis of textures and stylized images,” in International Conference on Machine Learning, (ICML), 2016.
  • [29] J. Johnson, A. Alahi, and F.F. Li, “Perceptual losses for real-time style transfer and super-resolution,” European Conference on Computer Vision (ECCV), 2016.
  • [30] T.Q. Chen, M. Li, Y.T Li, M. Lin, N.Y. Wang, M.J. Wang, T.J. Xiao, B. Xu, C.Y. Zhang, and Z.Zhang, “Mxnet: A flexible and efficient machine learning library for heterogeneous distributed systems,” Advances in Neural Information Processing Systems (NIPS), 2015.
  • [31] R. Olga, D. Jia, S. Hao, K. Jonathan, S. Sanjeev, M. Sean, Z.H. Huang, K. Andrej, K. Aditya, B. Michael, C.B. Alexander, and F.F. Li, “Imagenet large scale visual recognition challenge,” International Journal of Computer Vision (IJCV), vol. 115, no. 3, pp. 211–252, 2015.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
19876
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description