ExGate: Externally Controlled Gating for Feature-based Attention in Artificial Neural Networks
Perceptual capabilities of artificial systems have come a long way since the advent of deep learning. These methods have proven to be effective, however they are not as efficient as their biological counterparts. Visual attention is a set of mechanisms that are employed in biological visual systems to ease computational load by only processing pertinent parts of the stimuli. This paper addresses the implementation of top-down, feature-based attention in an artificial neural network by use of externally controlled neuron gating. Our results showed a 5% increase in classification accuracy on the CIFAR-10 dataset versus a non-gated version, while adding very few parameters. Our gated model also produces more reasonable errors in predictions by drastically reducing prediction of classes that belong to a different category to the true class.
keywords:feature-based attention, neural networks, top-down attention
Artificially intelligent agents are often designed with bottom-up processing mind. Agents perceive the world through sensors, make decisions based on the sensory information and then perform an action based on the decisions. It is often the case that agents are faced with interpreting large amounts of sensory data. Processing all of the incoming data can be too computationally expensive for constrained systems such as mobile robots.
Advances in machine learning, notably with the rise of deep learning (LeCun et al., 2015), have made a significant contribution to computer vision by taking ideas from biological vision to develop better artificial neural network architectures (Szegedy et al., 2015). The majority of these architectures follow the basic structure of the convolutional neural networks (CNNs) introduced in (LeCun et al., 1989, 1995). For many years CNN model progression has generally consisted of taking that basic structure but adding more layers to make the networks deeper (Krizhevsky et al., 2012; Simonyan and Zisserman, 2014). The complexity of these models has restricted their usage to high-performance computers that are not suitable for constrained systems. The need to create less demanding models has not gone unnoticed and research has focussed on developing alternative structures to improve efficiency and performance (Szegedy et al., 2015; He et al., 2016; Szegedy et al., 2017). In even more extreme cases performance has been somewhat sacrificed to create far less demanding models (Iandola et al., 2016; Howard et al., 2017; Zhang et al., 2017).
These traditional CNN based architectures have been very effective at various tasks involving visual data, however the trained networks are static. This means that the same transformations will always be applied to all of the input data. There is no option to modify the behaviour of the network in real-time, unlike what humans are capable of.
Humans have computationally strained resources, yet we are capable of interpreting enormous amounts of sensory data with ease. This is possible because our brains utilize various mechanisms to select portions of the incoming data to operate on and we can also constrain our interpretations of that data as well (Summerfield and Egner, 2009). These mechanisms have been studied extensively in fields such as neuroscience and cognitive science, especially with regards to how we, and other animals, process visual stimuli so efficiently (Desimone and Duncan, 1995; Yantis and Serences, 2003; Summerfield and Egner, 2009; Carrasco, 2011; Moore and Zirnsak, 2017). Visual attention is one of the most studied mechanisms for modulation of visual perception. This paper focusses on top-down feature-based attention (see Section 2.1) to modulate the behaviour of a feedforward neural network in an image classification task.
Further background to visual attention is provided in Section 2 to clarify the connection between this research for artificial systems with the research on biological systems. The background supplements the development of our novel mechanism for feature-based attention presented in Section 3. The experimental procedure is discussed in Section 4 and the results are presented in Section 5 along with the discussion of the results. Lastly we draw conclusions in Section 7.
2 Visual Attention
Visual attention is one of the mechanisms that our brains use to select pertinent aspects of visual stimuli. It can manifest itself as a bottom-up process, where portions of the stimuli are attended to automatically based on salient features (Desimone and Duncan, 1995). Attention can also be controlled in a top-down manner, where a current behavioural state may favour certain aspects of the stimuli (Gilbert and Li, 2013; Noudoost et al., 2010). Our research here focusses on top-down attention, as we wish to find ways to control aspects of low-level perceptual systems from higher-level cognitive agents.
There are multiple ways in which attention can be controlled in a top-down manner. Attention can be directed to a specific location (spatial attention), to certain features, and to distinct objects (Noudoost et al., 2010; Gilbert and Li, 2013; Yantis and Serences, 2003). Spatial attention has been used to great effect in recent deep learning works such as (Xu et al., 2015; Wu et al., 2016; Lu et al., 2017). These networks introduce dynamic capabilities by using recurrent neural networks (RNNs) to supply contextual information based on the history of inputs. This allows the models to learn how the sequential structure of sentences in the training data relates to specific locations of an image that should be attended to when generating new captions. The feedback inherent in the RNNs make this a form of “top-down” attention, where the RNN is acting as a primitive memory system that guides spatial attention.
2.1 Feature-based Visual Attention
Our work focusses on feature-based attention and we aim to introduce a greater level of top-down control than the works mentioned above. Feature-based attention modulates neuron activity based on the relationship between features of interest and the stimulus (Moore and Zirnsak, 2017). Unlike spatial attention which attends to specific locations of a stimulus, feature-based attention tries to measure how the feature components of stimulus compares to the features it believes are relevant to the task at hand. (Maunsell and Treue, 2006) show that certain neurons in the visual cortex have enhanced responses to stimuli relevant to the agent’s behavioural goals. (Gilbert and Li, 2013) also suggests that top-down control can engage relevant components of stimuli while also discarding irrelevant components.
Such a mechanism could also be used to improve the performance of artificial neural networks. Suppressing irrelevant features and enhancing relevant ones should increase the signal-to-noise ratio and should therefore be easier for a classifier to discriminate between classes. Instead of applying the exact same transformations on input data, we can find a way to modulate the behaviour of the neural network depending on the goals or desires of the agent.
3 Externally Controlled Feature Gating
This section presents a gating mechanism that can be controlled symbolically, rather than through neural connectivity. Such an approach was chosen to facilitate the greater vision of a neuro-symbolic system where symbolic cognitive architectures can influence neural processing in a top-down manner. This implementation differs from the approach taken in common deep learning models that use attention mechanisms such as (Bahdanau et al., 2014; Xu et al., 2015; Wu et al., 2016). In these cases the attention model is parameterised by an additional feedforward neural network. These models use hidden states from recurrent neural networks as an input to the attention model, whereas in our case it is assumed that a symbolic oracle generates the top-down signals.
The type of visual attention used in (Xu et al., 2015; Lu et al., 2017) only accounts for spatial attention. Their models focus visual attention on specific locations within the visual field, but do not attend to specific features. Neurons in the visual cortex can be modulated to enhance certain visual features based on the current behavioural preference (Maunsell and Treue, 2006). We propose a mechanism that multiplexes different gating layers depending on a specific task. This method allows us to train one set of feature detecting neurons, but by gating their outputs we can attend to specific neuron populations based on the current task.
3.1 Gating Units
A gating unit chooses whether to pass or suppress its input based on trainable parameters. The gating units used in our architecture act independently of the input, and therefore only require training of a bias value.
The output, , of a single gating unit can be expressed as
Where is the activation of the neuron that is being gated, is the sigmoid function, and is the trainable bias. This mechanism allows us to modulate neuron activations by choosing different sets of biases, rather than modifying the weights that determine how neurons respond to input stimuli.
where refers to the layer number, refers to the index of the neuron and refers to the index of the task the gating layer belongs to.
An example of how these gating layers can be incorporated into a feedforward neural network is shown in Figure 1. The gating layers are trained independently so that they become specialised on specific types of input data.
This approach is less computationally expensive than adding additional fully connected layers as is often the case with other multi-task learning approaches (Ruder, 2017). For the entire network we are only adding number of parameters, where is the total number of tasks and is the total number of hidden neurons. Our approach is also closer to biological findings where individual neurons have shown the ability to switch their behaviour based on a visual categorization task (Cromer et al., 2010).
Keeping the neuron weights the same despite changing the task is commonly used in deep learning (Yosinski et al., 2014; Ruder, 2017), notably for visual data where the basic representations of features is common regardless of the task. Our work is novel in that we are able to modify which of these common visual features are most relevant to the current goal, rather than fine-tuning a classification stage after the input features are extracted.
3.3 Problem Details
For our experiments here we will treat classifying different categories of images as different tasks. Let be the input space and be the output space of our supervised learning problem. Let be a given dataset where are sample inputs, and are the corresponding labels. Let be a set of defined categories that labels in can belong to, where is the total number of categories. For now we are assuming that the labels can be grouped into categories by an oracle (a human in this case). We also only consider the case that each label belongs to a single category.
We wish to create a feedforward, fully-connected artificial neural network that produces the output given by , where is the -th input pattern, and are the trainable parameters for the category, . The predicted labels from the network are then given by Let be a subset of containing elements belonging to category . The parameters for each category, , are trained using gradient descent methods on training data, where is the number of labelled samples in category . The objective of training is to find the parameters that minimize some loss function .
3.4 Evaluation Metrics
The standard classification metrics of accuracy and loss can be used to compare the performance of model with and without feature gating. These metrics do not provide any insight into the what effect the gating may have had on the base model neurons to impact performance. We hypothesise that the gating should minimize predictions across different categories because they select a subset of features that are only relevant to the “cued” task. Given some test data , where is the total number of test samples, we can define as the set of predicted labels from the test data with category, , inferred from the true label, . Based on our hypothesis we declare a new metric in Eq. 3 to quantify the effect of feature gating on class prediction based on categorical membership.
Where the are Iverson brackets,
We term this metric “categorical isolation” and is defined as being the percentage of predicted labels belonging to the same category as the true label. The higher the categorical isolation the better the model is at discriminating classes between different categories.
The capabilities of our architecture were tested on an image classification task using the CIFAR10 dataset (Krizhevsky et al., 2014). PyTorch was used as the framework, as it allows for dynamic computational graphs which are required for switching between gating layers during inference. The hypothesis is that overall classification accuracy can be improved by switching between specialized gating layers that are trained on categorized subsets of the original data.
CIFAR10 consists of thousands of images containing unique instances of 10 different classes. The class labels are airplane, automobile, bird, cat, deer, dog, frog, horse, ship and truck. The default training set consists of 50000 images and the default test set consists of 10000 images. Each class has exactly the same number of samples in both the training and test data.
The dataset was divided into two broad categories: “vehicles”, and “animals”. These categories acted as two separate ”tasks” for the network to perform. Samples of the images and categories are provided in Figure 2.
Each set of gating layers was trained and tested on a subset of the data based on these categories. One set of gating layers was trained on only vehicles and the other set was trained only on animals. This ensures that the gates are specialised to a specific category. During training of the original training data were used as a validation set.
The images were converted to grayscale and normalized. No other pre-processing was performed, as we are only looking for a comparative improvement between gated and non-gated versions of the neural network.
4.2 Neural Network Parameters
The base model was a fully-connected feedforward neural network with the two hidden layers. The first hidden layer had 256 neurons and the second hidden layer had 128 neurons. Neurons in the hidden layers had a Rectified Linear Unit (ReLU) type activation function. Dropout was applied to the base model during training with a dropout rate of 50%. There were 10 linear output neurons (one per class) with a log-softmax function applied to the output to generate a log-probability output.
The gated version of this network had gating layers, as specified in Section 3, after each hidden layer. The biases in the gated layers were initialized to 0.
4.3 Training and Testing Procedure
Trainable parameters were updated after calculating the average loss for mini-batches consisting of 32 randomly sampled images from the training set. Validation was performed using the validation set after a set number of updates to monitor for overfitting on the data used for calculating the gradients. The negative log-likelihood loss was used as the loss function to optimize, using the RMSprop optimizer with a learning rate of . The set of parameters that produced the lowest validation loss were saved and used to test the model on the holdout test data.
Before training the gates, the base model was trained on all of the training data to learn the feature representations generic to all classes. When training the gated versions the dataset was sampled such that each set of gates was presented with classes exclusive to the category it was being trained for. For example, when training the “vehicle” gates, only images of planes, cars, ships and trucks were presented to the network. The parameters for non-gating layers were initialized using the best parameters from the base model, and were not modified further during training. This ensures that the same neurons are being gated for both categories rather than having different sets of neuron parameters for each. We are only interested in learning the category dependant gate biases. As with the base model, the set of gates producing the lowest validation loss for each category were saved and used in the final testing procedure.
During testing the model is cued by using categories inferred directly from the image labels e.g. if the label was ”truck” the category would be ”vehicles”. For every image presented to the model the category was used to select which set of gates to use for inference, as illustrated in Figure 1. In this paper we are only evaluating the effectiveness of the gating mechanism and not how an agent automatically selects a category.
5 Results and Discussions
The results of the experiments are presented in this section along with relevant discussions. The loss curves during the training phase are shown in Figure 3.
The gated versions show much lower validation losses than the base model, which indicates they are improving upon the non-gated model. The fact that the gated models rely on pre-trained neuron weights and biases makes training the gating biases quick to train especially because there are so few parameters in comparison to the number of parameters needed for even one neuron. This is beneficial because one can reuse the same base model and very quickly train different gating layers. This is likely even quicker than the fine-tuning techniques used in transfer learning for deep learning models, where the final classification layer is often a fully connected layer that is tuned to different data (Yosinski et al., 2014).
|Model||Test Loss||Test Accuracy||Categorical Isolation|
The results show that the feature gating reduces the test loss by 0.25, increases the total classification accuracy by 5.1%, and increases categorical isolation by 15.2%. This represents a strong improvement on performance in terms of accuracy, while adding only few trainable parameters. Even a single additional hidden neuron would result in more parameters than a full gating layer. The improvement in terms of categorical isolation is significant, and it shows that features can have a drastic difference in importance depending on the active behavioural goals. This is an import result because it means the network is producing outputs that are more reasonable considering the context of the situation.
The categorical isolation metric is useful for comparing between different models, but does not provide all the details of the networks behaviour. It could be possible that the network favours specific categories, which would result in a high categorical isolation, but it might not be producing reasonable outputs for other categories. To further assist in interpreting the behaviour of the network we direct the readers attention to Figure 4.
Of particular interest here is the results of classification between the two different categories of classes. In the non-gated model there are a number of instances where the model predicted classes from one category as classes from the other category. In contrast the dark purple regions of the gated model coincide with regions where the predicted classes do not belong to the same category as the true class. This shows that the gated model almost completely nullifies all cases of misclassification between categories. The feature gating worked as expected to discriminate classes based on the features relevant to the current goal.
We also wish to explore the inner workings of the gating and not just examine the overall effect. The gate bias vectors are visualized as 2-D images in Figure 5. to illustrate the behavioural changes of individual neurons based on the categories. This may provide useful insights into how gating influences the network at a neuronal level.
The absolute difference between the gates in the first layer shows that a large number of individual biases are different between categories. These bias changes indicate that neurons in the first layer are highly sensitive to changes in task. In contrast the second layer of gates show a larger percentage of the biases remain the same, however there are a few neurons that exhibit large changes between categories. This is an indication that the second layer represents more general features that are present in both categories, but with few specialized neurons that are task dependant.
We also noticed that the range of the biases are quite distinct between the two layers. The histogram plots in Figure 6. show how the biases are distributed for each gating layer. This shows the behaviour of the gating layers as a whole rather than indicating the differences in individual gating units.
The histogram show that the second layer of gates pass through almost all of neuron activations with 50% suppression () or more, whereas the first gating layers are more symmetrical about the 50% suppression point. The fact that most of the second layer neuron activations are never fully suppressed further supports the idea that the second layer neurons are more general across tasks. What is also interesting to note is that some of the biases in the second layer have much higher values that in the first layer. This suggests that some of the neurons in the second layer are highly sensitive to specific features of the input.
6 Future Work
The experiment presented in this paper was merely to determine whether or not feature gating in this fashion work and how well it would work. Further research would include applying our gating mechanism to different datasets with more categories to evaluate how well our solution scales to an increase in the number of categories. Applying this method to convolutional neural networks (CNNs) is a high priority, as these are more akin to biological visual systems, and are the current state-of-the-art in the majority of visual tasks. It is expected that our gating mechanism is general enough that it can be applied in the same way to convolutional kernels used in CNNs.
A primary goal for this work is to develop a neuro-symbolic cognitive architecture. The symbolic system would be better suited to high-level cognitive tasks such as reasoning and problem solving than perceptual tasks where artificial neural networks have been excelling. Our gating mechanism allows for top-down control of neural-like perception by the symbolic system depending on its goals. Many other cognitive architectures only consider or implement a bottom-up flow of information that starts by sensing the world, making decisions based on the interpretation of the sensory data, and then acting on these decisions. They rarely implement the case where the perceptual stage can be influenced in a top-down manner.
This type of neuro-symbolic architecture would be ideally suited for mobile robotics. Mobile robotics applications such as search and rescue require robots to deal with a variety of situations. By modifying how a robot perceives the world depending on the situation could result in better perceptual accuracy as shown by our results.
We have implemented a novel approach to feature-based attention in feedforward neural networks. The approach to use externally controlled, task-dependant gating units shows major improvement in image classification despite only adding a few parameters. The mechanism is also general enough that it should be possible to replicate in most other forms of artificial neural networks. Another benefit is that it is designed to allow for easy control from external source, such as a cognitive architecture, to modulate the behaviour of the neural network to enhance perception. This could lead towards a more tractable form of integrating low-level, neural-like perception with higher-level symbolic cognitive systems.
The authors would like to thank the University of Cape Town for providing the funding for this research.
- Bahdanau et al. (2014) Bahdanau, D., Cho, K., Bengio, Y., 2014. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 .
- Carrasco (2011) Carrasco, M., 2011. Visual attention: The past 25 years. Vision research 51, 1484–1525.
- Cromer et al. (2010) Cromer, J.A., Roy, J.E., Miller, E.K., 2010. Representation of multiple, independent categories in the primate prefrontal cortex. Neuron 66, 796–807.
- Desimone and Duncan (1995) Desimone, R., Duncan, J., 1995. Neural mechanisms of selective visual attention. Annual review of neuroscience 18, 193–222.
- Gilbert and Li (2013) Gilbert, C.D., Li, W., 2013. Top-down influences on visual processing. Nature Reviews Neuroscience 14, 350.
- He et al. (2016) He, K., Zhang, X., Ren, S., Sun, J., 2016. Deep residual learning for image recognition, in: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778.
- Howard et al. (2017) Howard, A.G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., Andreetto, M., Adam, H., 2017. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861 .
- Iandola et al. (2016) Iandola, F.N., Han, S., Moskewicz, M.W., Ashraf, K., Dally, W.J., Keutzer, K., 2016. Squeezenet: Alexnet-level accuracy with 50x fewer parameters and¡ 0.5 mb model size. arXiv preprint arXiv:1602.07360 .
- Krizhevsky et al. (2014) Krizhevsky, A., Nair, V., Hinton, G., 2014. The cifar-10 dataset. online: http://www. cs. toronto. edu/kriz/cifar. html .
- Krizhevsky et al. (2012) Krizhevsky, A., Sutskever, I., Hinton, G.E., 2012. Imagenet classification with deep convolutional neural networks, in: Advances in neural information processing systems, pp. 1097–1105.
- LeCun et al. (2015) LeCun, Y., Bengio, Y., Hinton, G., 2015. Deep learning. nature 521, 436.
- LeCun et al. (1995) LeCun, Y., Bengio, Y., et al., 1995. Convolutional networks for images, speech, and time series. The handbook of brain theory and neural networks 3361, 1995.
- LeCun et al. (1989) LeCun, Y., Boser, B., Denker, J.S., Henderson, D., Howard, R.E., Hubbard, W., Jackel, L.D., 1989. Backpropagation applied to handwritten zip code recognition. Neural computation 1, 541–551.
- Lu et al. (2017) Lu, J., Xiong, C., Parikh, D., Socher, R., 2017. Knowing when to look: Adaptive attention via a visual sentinel for image captioning, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), p. 2.
- Maunsell and Treue (2006) Maunsell, J.H., Treue, S., 2006. Feature-based attention in visual cortex. Trends in neurosciences 29, 317–322.
- Moore and Zirnsak (2017) Moore, T., Zirnsak, M., 2017. Neural mechanisms of selective visual attention. Annual review of psychology 68, 47–72.
- Noudoost et al. (2010) Noudoost, B., Chang, M.H., Steinmetz, N.A., Moore, T., 2010. Top-down control of visual attention. Current opinion in neurobiology 20, 183–190.
- Ruder (2017) Ruder, S., 2017. An overview of multi-task learning in deep neural networks. arXiv preprint arXiv:1706.05098 .
- Simonyan and Zisserman (2014) Simonyan, K., Zisserman, A., 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 .
- Summerfield and Egner (2009) Summerfield, C., Egner, T., 2009. Expectation (and attention) in visual cognition. Trends in cognitive sciences 13, 403–409.
- Szegedy et al. (2017) Szegedy, C., Ioffe, S., Vanhoucke, V., Alemi, A.A., 2017. Inception-v4, inception-resnet and the impact of residual connections on learning., in: AAAI, p. 12.
- Szegedy et al. (2015) Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., Rabinovich, A., 2015. Going deeper with convolutions, in: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1–9.
- Wu et al. (2016) Wu, Z., Ye, Y., Yuexin, Y., Cohen, R.S.W.W., 2016. Encode, review, and decode: Reviewer module for caption generation. arXiv preprint arXiv:1605.07912 .
- Xu et al. (2015) Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhudinov, R., Zemel, R., Bengio, Y., 2015. Show, attend and tell: Neural image caption generation with visual attention, in: International conference on machine learning, pp. 2048–2057.
- Yantis and Serences (2003) Yantis, S., Serences, J.T., 2003. Cortical mechanisms of space-based and object-based attentional control. Current opinion in neurobiology 13, 187–193.
- Yosinski et al. (2014) Yosinski, J., Clune, J., Bengio, Y., Lipson, H., 2014. How transferable are features in deep neural networks?, in: Advances in neural information processing systems, pp. 3320–3328.
- Zhang et al. (2017) Zhang, X., Zhou, X., Lin, M., Sun, J., 2017. Shufflenet: An extremely efficient convolutional neural network for mobile devices. arxiv 2017. arXiv preprint arXiv:1707.01083 .