Feature selection of neural networks is skewedtowards the less abstract cue

Feature selection of neural networks is skewed towards the less abstract cue

Abstract.

Artificial neural networks (ANNs) have become an important tool for image classification with many applications in research and industry. However, it remains largely unknown how relevant image features are selected and how data properties affect this process. In particular, we are interested whether the abstraction level of image cues correlating with class membership influences feature selection. We perform experiments with binary images that contain a combination of cues, representing two different levels of abstractions: one is a pattern drawn from a random distribution where class membership correlates with the statistics of the pattern, the other a combination of symbol-like entities, where the symbolic code correlates with class membership. When the network is trained with data in which both cues are equally significant, we observe that the cues at the lower abstraction level, i.e., the pattern, are learned, while the symbolic information is largely ignored, even in networks with many layers. Symbol-like entities are only learned if the importance of low-level cues is reduced compared to the high-level ones. These findings raise important questions about the relevance of features that are learned by deep ANNs and how learning could be shifted towards symbolic features.

Neural Networks, Pattern Recognition, Feature Selection
1

1. Introduction

Advances in the manufacturing of fast graphic processing units (GPU) and the availability of large datasets have permitted the successful use of neural networks in many diverse fields, such as natural-language processing (Hinton et al., 2012) or computer vision (Krizhevsky et al., 2012). State-of-the-art results are obtained by deep learning (LeCun et al., 2015), which utilizes the large capacity of many layers for representing non-linear functions that map the input signal to an output. However, there is little profound theoretical insight how the learning process is affected by the data, and which input features are actually encoded by the network and where (Shwartz-Ziv and Tishby, 2017).
A common explanation of data processing in deep neural networks focuses on an interpretation of the hidden layers (LeCun et al., 2015; Alain and Bengio, 2016). Early layers extract low-level features from the raw input data, which are further combined in the middle and last layers to obtain a high-level representation. For example, in object classification the first layers in the network extract edge features. These are combined into general object parts and finally assembled to “archetype objects” (LeCun et al., 2015; Zeiler and Fergus, 2014). This processing pipeline exhibits similarities with presumed biological neural processing strategies (Hubel and Wiesel, 1962). Complementary, in learning theory (e.g. Vapnik, 2000) one tries to formulate general mathematical principles of the system, from which important properties, such as learning boundaries, consistency of the learning process, generalization capabilities, are derived, including the representational power of deep neural networks (Pascanu et al., 2013; Montufar et al., 2014). A theoretical framework to examine (deep) neural networks is the so called information bottleneck (Tishby and Zaslavsky, 2015), which makes general claims about the relevant information contained in random input variables with respect to random output variables and the network’s optimization via stochastic gradient descent. However, there is ongoing research and discussion about the applicability of this theory (Amjad and Geiger, 2018; Saxe et al., 2018).
In this paper, we present an experimental setup how the learning process in neural networks, i.e., feature selection, is affected by the input data. We create synthetic, binary images containing two cues, which differ in their level of abstraction and define three distinct classes. The low-level cue is a pattern drawn from a random distribution, being different for every class. The high-level cue is a combination of three symbols occurring in the image according to a class-specific code. We use datasets that are made of these images and the respective class labels to train an ANN to perform the classification task (see Figure 1). We then evaluate the classification performance of the trained networks for every test set (see Figure 2). Every cue by itself suffices to correctly classify an image.
While most approaches concentrate on exploring the network explicitly in terms of interpretation of layers or its capability to represent highly non-linear functions, our study focuses on the data: If two kinds of cues to classify the data are presented to the network, will a combination of them be used or only a single one? And, if the latter holds true, which type of cue is being favored, and are we able to influence the decision made by the network?

Figure 1. Depiction of the training scenarios. We train a neural network for each dataset: , , and .
Figure 2. Depiction of the test procedures. We evaluate the prediction capability of the class probabilities of every previously trained neural network on all test-data subsets: , , and .

2. Material and methods

In this work we investigate which kind of cues are used by simple feed-forward neural networks to learn from images. For this purpose we choose a standard neural network architecture (see Section 2.2.1) and create sets of synthetic images containing different kind of cues, each of them correlating with class membership (see Section 2.1.1).

2.1. Synthetic Datasets

For our experiments we use four datasets, each of them including images. Every image contains a combination of high- and low-level cue or only a single cue. Every cue is class-specific. The datasets differ in the type of cue used in the images. In Section 2.1.1 we describe general properties of the images and explain the composition of the cues and classes in more detail. In Section 2.1.2 we describe the datasets.

Images

General properties:

The images for the experiments have a size of  x  pixels and are binary. We create them synthetically with the Software MATLAB. The amount of pixels used to create a single cue is always and independent of the type of cue. See Figure 1 on the left for examples.

Cues:

The cues we use to create the images correlate with class membership. We differentiate between high- and low-level cues that differ in their complexity. We define complexity as the number of iterations needed for the neural network to classify the feature correctly. In this case higher level means that more iterations are needed (see Section 4). For the high-level cue we use a combination of three symbols (“+” and “x”). Every single symbol is made of nine pixels to ensure scale invariance and placed uniform randomly in the image.
For the low-level cue we use a pattern of pixels, drawn from a random distribution. In Figure 1 samples from the different datasets are shown.

Classes:

In this experiment we construct three different classes: The first class correlates with a uniformly-distributed pattern as the low-level cue. It also correlates with a combination of two symbols “x” and one “+” as the high-level cue or a combination of them (depending on the dataset used). The second class correlates with a pattern that is drawn from a distribution accumulating the pixels in the center of the image. It also correlates with a combination of two symbols “+” and one “x” or their combination. The third class correlates with a pattern that is drawn from a distribution accumulating the pixels in the corner of the image and also with three symbols “+” or the combination of both cues. See Table 1 for a comprehensive overview of the definitions of the classes.

Class pattern distribution symbols
I uniform +xx
II centered ++x
III cornered +++
Table 1. Composition of the classes in datasets , and .

Datasets

Both Cues:

This dataset contains images with both, high- and low-level cues together and their corresponding class labels. For every class we create images. We split the whole dataset into images () for the train set and images () for the test set. We name the train dataset and the test dataset . In Figure 1 examples for dataset are shown.

Symbol:

The second dataset contains images with only the high-level cue present and corresponding labels. For every class we create images, as well. We split the whole dataset into images () for the train set and images () for the test set. We name the train dataset and the test dataset . In Figure 1 examples for dataset are shown.

Pattern:

This dataset contains images with only the low-level cue and their corresponding class labels. For every class we create images. We split the whole dataset into images () for the train set and images () for the test set. We name the train dataset and the test dataset . In Figure 1 examples for dataset are shown.

Dist. Both Cues:

This dataset contains images with a combination of both, high- and low-level cues and labels as in dataset . However, we dilute the dataset by intentionally providing false labels to of the samples for the pattern. This leads to smaller correlations of the pattern with class membership compared to the correlation of the symbols. Doing so, we want to trigger a different learning behavior of the network than using dataset . Apart from that, we proceed like with the other datasets and split the set into images () for training and images () for testing, obtaining the datasets and . In Figure 1 examples for dataset are shown and Table 2 provides an overview of the classes we assigned to this manipulated dataset.

distorted class pattern distribution symbol combination
I centered or cornered +xx
II cornered or uniform ++x
III uniform or centered +++
Table 2. Composition of the distorted classes we are using in of the dataset . The rest of the dataset is created as described in Table 1.

2.2. Neural Network

Architecture

For the experiments we use simple, feed-forward, fully-connected neural networks. The architecture of a neural network with one hidden layer is shown in Figure 3.
The pictures used for our investigations have a size of  x  pixels and are flattened for the input layer. We use one, two, three or hidden layers, with either , or neurons to check the influence of the number of hidden layers and neurons on the resulting test accuracies. All hidden neurons use the ReLU-activation function. Because we are investigating a classification problem, we have three output neurons (for the three different classes). We use a Softmax activation function to output the corresponding probabilities. The results for every setup of the networks are reported in Appendix A.

Figure 3. Neural network architecture with one hidden layer. The hidden neurons use the ReLU activation function and the output neurons use the Softmax (sMax) activation function for classification. We use setups with two, three and hidden layers in this work, too.

Training

For every training scenario we use images per dataset. We use the batch gradient descent method with randomly chosen images per iteration and a learning rate of . The error is calculated as the cross entropy of the network output and the provided labels. To avoid over-fitting on the train set, we limit the maximum number of epochs to and implement early stopping. One epoch is a complete run over the whole dataset. To test the influence of the number of hidden neurons and layers we execute runs with , or hidden neurons and one, two, three or hidden layers, respectively.
We implement and run our experiments with the Keras API of the GPU-accelerated version of Tensorflow (Abadi et al., 2015).

3. Results and Discussion

We investigate many different training and testing scenarios in this work (see Figures 1 and 2). To keep this section clear we will therefore organize the results in two parts, namely experiment A and experiment B, and go through them step by step.
In experiment A we consider only datasets , and to investigate the neural networks behavior. Both cues are equally present. By doing so, we measure the performance of the network and identify the cue that was used by the network to predict the classes.
In the second experiment B, we use dataset that is constructed to contain false low-level cues for of the samples (see Section 2.1.2). We train the neural network on the subset and check the test accuracy on every dataset , , and .
In Tables 7 and 8 of the Appendix A a complete overview of the mean test accuracies for all training and test scenarios in every setup of the neural networks is shown. In Table 3 a comprehensive overview over the classification capability is presented. All reported results are mean values and mean errors of the mean over five runs.

training subset testing subset classification performance
(Network A)
(Network B)
(Network C)
(Network D)
Table 3. Symbolic overview for the overall classification performance of the trained networks A, B, C and D on all test subsets. The symbols indicate a mean test accuracy of , and between and .

3.1. Experiment A

data subset
97.46 0.04 33.26 0.17 96.17 0.08
67.93 0.71 100 0.00 38.69 0.30
79.18 0.25 33.48 0.25 99.25 0.02
Table 4. Representative mean test accuracies for one hidden layer with hidden neurons of the trained networks.

Table 4 shows representatively the accuracies on the test sets after training on the corresponding training sets for neural networks with one hidden layer and hidden neurons. Using other setups with a different number of hidden layers or hidden neurons did not impact the key findings significantly. We will discuss the effect on the absolute accuracies in Section 3.3.
The best accuracies are obtained when the training and test dataset contain the same kind of cue (main diagonal). This is an expected result, because the training and test subsets are from the same datasets. The worst accuracies are obtained for the cases where we train on the subset (or ) and test on (or ). With three possible classes an accuracy of around corresponds to guessing. This means that the network is not able to infer the classes correctly.
When training the neural network on only or and evaluating on , we obtain (or ) test accuracy. This effect is bigger for training on the symbols than for training on the pattern.
Next, the network is trained for the case of learning with and tested on and , respectively. While the ANN is able to classify the statistical pattern with test accuracy, only test accuracy is obtained for the symbols.
The results suggests that the ANN learns only the low-level cue when both cues are being provided together.

3.2. Experiment B

Now we evaluate the performance for the additional dataset , which contains wrong patterns for of the samples (see Section 2.1.2). Our intention is to trigger a different learning behavior, because the network should not be able to classify the data completely by using only the low-level cue. In Table 5, the results are presented.
The test accuracies for training on the subset and testing on the confirm the results of the first experiment A. They are in a range of around , which agrees with the percentage of correct patterns. We conclude that the network uses only the low-level cue in this case.
The second observation is that using the subset for training influenced the learning behavior of the neural network compared to the previous experiment. The test accuracies on the subset decreased compared to training with the subset. More importantly, the test accuracies for and also changed. This indicates that the network now learns both cues. Another indication are the test accuracies for () after training on . They are above the error rate of the pattern, indicating that the network does not rely solely on them.

# hidden neurons data subset
10 97.42 0.03 33.01 0.15 95.81 0.14 76.06 0.31
61.17 1.50 99.75 0.09 37.86 0.63 60.90 1.61
78.73 0.19 33.54 0.26 99.25 0.02 63.93 0.24
92.63 0.40 73.31 1.76 81.22 0.92 80.41 0.99
100 97.46 0.04 33.26 0.17 96.17 0.08 75.61 0.12
67.93 0.71 100 0.00 38.69 0.30 67.95 0.45
79.18 0.25 33.48 0.25 99.25 0.02 64.33 0.12
96.31 0.16 85.01 0.53 77.38 0.14 87.09 0.12
500 97.59 0.03 33.28 0.18 96.54 0.13 75.92 0.29
66.33 0.35 100 0.00 39.10 0.30 65.91 0.16
79.10 0.23 33.35 0.25 99.19 0.02 63.89 0.13
97.31 0.10 86.45 0.58 78.53 0.38 86.53 0.19
Table 5. Mean test accuracies for different training and testing scenarios of neural networks with one hidden layer and , and hidden neurons.

3.3. Influence of the Number of Hidden Neurons and Layers on the Performance

We repeated the experiments for ANN with different numbers of hidden neurons and layers. The results are presented in Appendix A. In general, similar behavior is observed, though some differences can be found. Using batch normalization did not have a significant impact on the general trend. Figures 4 and 5 show the slight decrease of the test accuracies and increase of the mean errors of the mean by using more hidden layers.

Figure 4. Examples of decaying test accuracies by adding more hidden layers. The number of hidden neurons per layer is indicated in brackets.
Figure 5. Examples of the increase of the mean error of the mean by adding more hidden layers for , and hidden neurons per layer.
training dataset # epochs
45.8 0.8
849.0 115.45
95.2 2.2
69.6 4.15
Table 6. Mean number of epochs over five runs needed for convergence during training.

4. Conclusions

In this paper we describe a simple experimental setup for investigating cue selection by neural networks.
Our results show that the network favors the low-level over the high-level cue when both cues are equally present. However, when we introduce false patterns to the low-level part of the dataset, the network compensates by using also the high-level cue.
A possible explanation why the low-level cue is preferred when both cues are equally present (experiment A, see Section 3.1) may be the complexity of the cues. Table 6 shows the mean number of epochs the neural networks need for converging to the final test accuracy. Learning the symbols requires by far the largest amount of epochs. Apparently, minimizing the cost function is more difficult in this case than for the other cues (all below epochs). Thus, if both cues are present in the dataset, two equally deep local minima exist, and the network will converge to the configuration corresponding the minimum that can be reached with fewer iterations. This interpretation is supported by the case of training on the subset . If the local minimum corresponding to the pattern has a value larger than the local minimum corresponding to the symbols, the network will move towards the local minimum of the symbols, which is an absolute minimum.
In the future, we are interested in developing strategies that allow shifting learning to specific, user defined cues. This could potentially be obtained by including information about the desired cues into the training data, e.g. by labeling relevant cues in images.

# hidden neurons data subset
1 hidden layer
10 97.42 0.03 33.01 0.15 95.81 0.14 76.06 0.31
61.17 1.50 99.75 0.09 37.86 0.63 60.90 1.61
78.73 0.19 33.54 0.26 99.25 0.02 63.93 0.24
92.63 0.40 73.31 1.76 81.22 0.92 80.41 0.99
100 97.46 0.04 33.26 0.17 96.17 0.08 75.61 0.12
67.93 0.71 100 0.00 38.69 0.30 67.95 0.45
79.18 0.25 33.48 0.25 99.25 0.02 64.33 0.12
96.31 0.16 85.01 0.53 77.38 0.14 87.09 0.12
500 97.59 0.03 33.28 0.18 96.54 0.13 75.92 0.29
66.33 0.35 100 0.00 39.10 0.30 65.91 0.16
79.10 0.23 33.35 0.25 99.19 0.02 63.89 0.13
97.31 0.10 86.45 0.58 78.53 0.38 86.53 0.19
2 hidden layers
10 97.14 0.04 32.95 0.15 94.62 0.80 75.76 0.10
60.67 0.63 99.67 0.08 38.13 0.60 60.49 0.71
77.39 0.31 33.81 0.13 99.24 0.01 62.95 0.31
92.18 0.48 58.99 3.38 75.57 1.29 76.91 0.46
100 97.21 0.04 32.83 0.25 95.10 0.09 75.92 0.20
62.67 0.63 99.95 0.02 38.63 0.34 62.44 0.57
78.08 0.14 33.58 0.17 99.19 0.02 63.06 0.06
94.43 0.21 84.16 1.13 67.69 0.52 86.12 0.25
500 97.34 0.03 33.37 0.23 95.78 0.16 75.77 0.17
61.34 0.35 100 0.00 38.32 0.17 60.99 0.31
78.27 0.19 33.55 0.18 99.25 0.02 63.72 0.18
95.42 0.16 93.03 0.14 66.41 0.49 86.32 0.15
Table 7. Mean test accuracies for different setups of neural networks and training datasets.
# hidden neurons data subset
3 hidden layers
10 96.97 0.06 33.05 0.07 94.98 0.49 75.42 0.20
55.98 1.55 99.62 0.13 36.47 0.80 56.17 1.50
78.92 1.56 33.68 0.30 99.22 0.04 63.32 1.08
91.60 0.25 59.85 3.96 75.94 2.75 76.66 0.41
100 96.90 0.05 32.83 0.22 95.04 0.15 75.33 0.15
60.99 0.58 99.86 0.03 37.66 0.16 60.40 0.58
78.03 0.28 33.51 0.25 99.19 0.00 62.79 0.25
92.75 0.17 81.75 0.58 66.54 1.09 83.13 0.47
500 97.00 0.04 32.89 0.13 95.41 0.23 75.49 0.19
57.94 0.36 99.90 0.02 37.82 0.41 57.56 0.37
77.76 0.34 33.22 0.22 99.17 0.01 63.52 0.32
92.30 0.28 87.58 0.81 62.89 0.88 82.09 0.13
10 hidden layers
10 96.42 0.06 32.81 0.45 91.86 3.26 75.07 0.13
44.51 3.55 85.13 12.80 35.60 0.55 44.31 3.48
72.23 3.11 33.18 0.16 99.00 0.03 58.86 1.95
93.27 1.08 39.67 7.57 81.64 3.88 74.86 0.84
100 96.20 0.10 33.21 0.24 95.90 0.67 75.00 0.26
52.76 1.70 98.97 0.09 37.46 0.43 52.74 1.88
78.71 1.62 33.72 0.22 98.66 0.06 63.77 1.01
93.92 0.05 32.99 0.21 93.36 1.54 73.88 0.08
500 96.30 0.13 33.14 0.12 94.79 0.82 75.12 0.29
50.31 0.79 99.24 0.05 36.84 0.42 49.89 0.78
78.73 0.90 33.66 0.14 98.75 0.04 63.67 0.57
92.98 0.52 42.45 9.39 86.15 7.88 75.02 1.56
Table 8. Mean test accuracies for different setups of neural networks and training datasets.

Appendix A: All results

Footnotes

  1. copyright: none

References

  1. TensorFlow: large-scale machine learning on heterogeneous systems. Note: Software available from tensorflow.org External Links: Link Cited by: §2.2.2.
  2. Understanding intermediate layers using linear classifier probes. arXiv preprint arXiv:1610.01644. Cited by: §1.
  3. Learning representations for neural network-based classification using the information bottleneck principle. arXiv preprint arXiv:1802.09766. Cited by: §1.
  4. Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups. IEEE Signal processing magazine 29 (6), pp. 82–97. Cited by: §1.
  5. Receptive fields, binocular interaction and functional architecture in the cat’s visual cortex. The Journal of physiology 160 (1), pp. 106–154. Cited by: §1.
  6. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp. 1097–1105. Cited by: §1.
  7. Deep learning. nature 521 (7553), pp. 436. Cited by: §1.
  8. On the number of linear regions of deep neural networks. In Advances in neural information processing systems, pp. 2924–2932. Cited by: §1.
  9. On the number of response regions of deep feed forward networks with piece-wise linear activations. arXiv preprint arXiv:1312.6098. Cited by: §1.
  10. On the information bottleneck theory of deep learning. Proc. International Conference on Learning Representations. Note: \urlhttps://openreview.net/forum?id=ry_WPG-A- Cited by: §1.
  11. Opening the black box of deep neural networks via information. arXiv preprint arXiv:1703.00810. Cited by: §1.
  12. Deep learning and the information bottleneck principle. In Information Theory Workshop (ITW), 2015 IEEE, pp. 1–5. Cited by: §1.
  13. The nature of statistical learning theory. Springer Science & Business Media. Cited by: §1.
  14. Visualizing and understanding convolutional networks. In European conference on computer vision, pp. 818–833. Cited by: §1.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
385511
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description