Learning an Attention Model in an Artificial Visual System
The Human visual perception of the world is of a large fixed image that is highly detailed and sharp. However, receptor density in the retina is not uniform: a small central region called the fovea is very dense and exhibits high resolution, whereas a peripheral region around it has much lower spatial resolution. Thus, contrary to our perception, we are only able to observe a very small region around the line of sight with high resolution. The perception of a complete and stable view is aided by an attention mechanism that directs the eyes to the numerous points of interest within the scene. The eyes move between these targets in quick, unconscious movements, known as “saccades”. Once a target is centered at the fovea, the eyes fixate for a fraction of a second while the visual system extracts the necessary information. An artificial visual system was built based on a fully recurrent neural network set within a reinforcement learning protocol, and learned to attend to regions of interest while solving a classification task. The model is consistent with several experimentally observed phenomena, and suggests novel predictions.
Neuroscientists and cognitive scientists have many tools at their disposal to study the brain and neural networks in general, including Electroencephalography (EEG), Single-Photon Emission Computed Tomography (SPECT), functional Magnetic Resonance Imaging (fMRI) and Microelectrode Arrays (MEA), to name a few. However, the amount of information and level of control afforded by these tools do not remotely resemble what is available to an engineer working on an artificial neural network. The engineer can manipulate any neuron at any time, force certain excitations, intervene in ongoing processes, and collect as much data about the network as needed, at any level of detail. This wealth of information has enabled reverse engineering research on artificial neural networks, leading to insights into the inner workings of trained artificial neural networks. This suggests an indirect approach to studying the brain: training a biologically plausible neural network model to exhibit complex behavior observed in real brains, and reverse engineering the result. In line with this approach, we designed an artificial visual system based on a fully recurrent unlayered neural network that learns to perform saccadic eye movements. Saccadic eye movements are quick, unconscious, task-dependent  motions following the demand of attention , that direct the eye to new targets that require the high resolution of the fovea. These targets are usually detected within the peripheral visual system . Once a target is centered at the fovea, the eye fixates for a fraction of a second while the visual system extracts the necessary information. Most eye movements are proactive rather than reactive, predict actions in advance and do not merely respond to visual stimuli .
There is good evidence that much of the active vision in humans results from Reinforcement Learning (RL) , as part of and organism’s attempt to maximize its performance while interacting with the environment . Accordingly, we train the artificial visual system within the RL paradigm. The network was not explicitly engineered to perform a certain task, and does not contain an explicit memory component — rather it has memory only by virtue of its recurrent topology. Learning takes place in a model-free setting using policy gradient techniques.
We find that the network displays attributes of human learning such as: (a) decision making and gradual confidence increase along with accumulated evidence, (b) skill transfer, namely the ability to use a pre-learned skill in a certain task in order to improve learning on a related but more difficult task, (c) selectively attending information relevant for the task at hand, while ignoring irrelevant objects in the field of view.
Ii The artificial visual system
We designed an Artificial Visual System (AVS) with the task of learning an attention model to control saccadic eye movements, and subsequent classification of digits. We refer to this task as the attention-classification task. The AVS is similar in many ways to that presented in . It is a simplified model of the human visual system, consisting of a small region in the center with high resolution, analogous to the human fovea, and two larger concentric regions which are sub-sampled to lower resolution and are analogous to the peripheral visual system in humans. The AVS was trained and tested on the classification of handwritten digits from the MNIST data set  . Only a small part of the image is visible to the AVS at any one time. Specifically, full resolution is only available at the fovea, which is 9-by-9 pixels, as in , or 5-by-5 pixels (about 69% smaller). The first peripheral region is double the size of the fovea, but sub-sampled with period 2 to match the size of the fovea in pixels. Similarly, the second peripheral region is quadruple the size of the fovea but sub-sampled with period 4. For comparison, a typical digit in the MNIST database occupies about 20-by-20 pixels of the image. The location of the observation within the image is not available to the AVS (unlike ), and movements of the observation location are not constrained to image boundaries. Instead, locations outside the image boundaries are observed as black pixels.
The AVS consists of inputs (observations) projected upon the network via input weights , a neural network consisting of neurons connected by the recurrent weights , and two outputs: a classifier , responsible for identifying the digit after the AVS has explored the image, and the attention model output , responsible for directing the eye to new locations based on the information represented in the network state (see Figure 1). The output consists of one neuron for each possible digit. At the end of the trial, the identity of the highest valued neuron is interpreted as the network’s classification. The progression of a single trial follows these principal stages:
A random digit is selected from the MNIST training database.
A location across the image is randomly selected.
The observation (called ‘glimpse’ ) from the current location is projected upon the network through , along with any pre-existing information within the network state through the recurrent weights .
The attention model output is fed back as a saccade of the eye, i.e. as the size of movement from the current location in the horizontal and vertical axes.
If a predefined number of glimpses has passed (or by network decision), compare the classifier output to the true label, otherwise return to stage 3.
Reward the AVS if the classification was correct, and continue to the next trial.
The AVS is implemented by a fully recurrent neural network. Its network topology is similar the Echo State Network (ESN)  in that the recurrent neural connections are drawn randomly and are not constrained to a particular topology such as in layered feedforward networks, or long short-term memory networks.
The network state evolves according to
is the state of network at time step , each element representing the state of a single neuron,
is the leak rate,
is the internal connections weight matrix,
is the input weight matrix,
is the observation (network input),
and are independent discrete-time Gaussian white noise processes with independent components, each having variance respectively,
is the state of the output neurons,
is the output weight matrix (consisting of blocks for the corresponding output components).
The gradient of the expected reward is estimated as in ,
where is a random trajectory of glimpses, is the probability of trajectory , the observed (usually binary) reward, a fixed baseline computed as in , is the random location of the first glimpse, and indicates averaging over trajectories. Viewed as a partially observable Markov decision process (POMDP), we can write the distributions describing the agent:
where are the probability density functions of respectively. The POMDP dynamics are deterministic: the glimpse position evolves as .
For the AVS (1) the probability density of a trajectory is
Here only the output probabilities depend on , and, using (2) and the Gaussian distribution of the noise, we find that the log likelihood gradient with respect to takes the form
where is the number of glimpses. Stochastic gradient ascent is performed only for the output weights . Recurrent weights are randomly selected, with spectral radius 1, and remain fixed throughout training. The log likelihood with respect to the internal weight matrix takes a similar form. However, the recurrent connections were not learned in our simulations.
Iii-a Use of memory
Since information accumulated by the neural network over time is mixed into the state of the network, it is not obvious that the potential to extract useful historic information can be exploited within the attention model solution. Training uses gradient ascent to the local maxima of the estimated expected reward and therefore may converge to sub-optimal maxima that do not make use of the full potential of the system. In order to test the use of memory by the trained network, two similar AVS were trained on the attention-classification task. In the first AVS, recurrent weights were random, whereas the second AVS was set to ‘forget’ historic information, by setting the recurrent weights matrix to zero.
Use of memory was found to depend on the size of the fovea. Fig. 2 shows the performance of the system across training epochs, for the case of a large (99 pixels) fovea. Initially, the AVS with memory has the advantage as the attention model is still poor at this stage, leading to relatively uninformative glimpses, so the use of information from several glimpses results in better classification. However, as the attention mechanism improves the last glimpse becomes highly informative, so the memoryless network, where information from the last glimpse is not corrupted by memory of previous glimpses, has the advantage. In fact, we found that information from a well-placed glimpse suffices to classify the digit with over 90% success rate in this case, driving the network to a solution of finding a single good glimpse location across the digit, and classification based on that glimpse, without regard to the rest of the trajectory.
The situation is different with a smaller fovea (55 pixels), where classification from a single glimpse becomes harder. As seen in Figure 3, the AVS with memory outperforms the one without memory in the small fovea case.
Iii-B Gathering Information
The human visual system acts to maximize the information relevant to the task . In order to assess whether our AVS behaves similarly, we have to characterize the relevant information in the context of our task. Since the network classification depends linearly on the network state in the last time step, we quantify the task-relevant information as the best linear separation of the network state, between each class and the other classes. Accordingly, we use Linear Discriminant Analysis (LDA) , which acts to find the projection that minimizes the distance between samples of the same cluster while at the same time maximizes the distance between clusters . The distance within each class is measured by the variance of samples belonging to that class, and is taken to be the mean of these distances across all classes. The Distance between classes is defined as the variance of the set of class centers.
We trained an AVS on the attention-classification task with 5 glimpses per digit. After the AVS was trained, it was tested in two cases. In the first case, the system was run as usual and the network state vector was recorded after the last glimpse of each digit in the test set. In the second case, the location of the last glimpse was chosen randomly rather than following the learned attention model. The results are illustrated in Figure 4, where the state of the network is projected on the first two eigenvectors of . Separation is significantly better with the full attention model compared to the one with the random last glimpse. We conclude that, at the very least, the attention model acts to maximize task-relevant information in the last glimpse better than a random walk.
Iii-C Transfer learning
Biological learning often displays the ability to use a skill learned on a simple task in order to improve learning of a harder yet related task, e.g., proficiency at tennis is beneficial when learning racquetball and even seemingly unrelated tasks such as skiing for example . To test whether transfer learning is possible in the AVS, we trained it to learn the attention model and classification of 3 digits (out of 10 in the MNIST database). The resulting solution served as an initial condition for learning the full task of classifying all 10 digits. As seen in Fig. 5, not only did the AVS with pre-learned attention learn much faster, but it also achieved a better result at the end of training.
Iii-D Ignoring distractions
The eyes are not directed to the most visually salient points in the field of view, but rather to the ones that are most relevant for the task at hand . Accordingly, we introduced an highly salient object into the training images. The object is a square, approximately the size of a digit but with maximum brightness, whereas the digits are handwritten and displayed in grayscale. The object is inserted at a fixed position relative to the digit, always on the right hand side of the digit. The trained network successfully avoids unnecessary fixations on the salient object. In cases where the first glimpse falls upon an area where both the digit and the object are within the peripheral visual region, the object seems to be completely ignored. Perhaps more interesting is the case where only the object is visible in the first glimpse, within the peripheral view. In such a case, the AVS learned to exploit the fact that the digit is always located on the left of the object and consistently performs saccades to the left. Thus, not only was the presence of a distracting object not harmful to performance, but it was actually beneficial.
In Fig. 6, the colored squares represent the foveal view of 5-by-5 pixels at each time step going from blue to red. The left and middle images show cases where the first glimpse only observes the distracting object within the peripheral view. The left image shows a case where the first glimpse observes both the distracting object and the digit within the peripheral view.
Next, we test the network with a distracting object in a random position around the digit. The observed behavior was similar when the first glimpse happened to fall on a location where both the object and digit are within the peripheral view: the AVS ignored the distraction and directed itself towards the digit. However, in the case where the first glimpse falls on a location where only the distracting object is visible in the peripheral view, the AVS failed to locate the digit.
An example is seen in Fig. 7. The lines are trajectories of the AVS each starting from a different point on a test grid and followed until the last glimpse (blue/magenta dot). Green lines are trajectories that led to a correct classification while red lines are trajectories that led to a false classification. When the AVS happen to fall at a location where the digit is not seen, it directs its gaze towards the square, which it then chooses to classify as “1” thus earning 10% expected reward which is better than nothing.
Iii-E Learning aided by demonstration (guidance)
Learning by demonstration (or learning with guidance) was implemented in the AVS. Demonstration differs from supervision in two key ways. First, demonstration is not continuous, and is applied sparsely in time in order to suggest new trajectories to the system. Second, demonstration is not required to provide the best solution to the system, because the system maintains its freedom to explore and even improve upon it. Demonstration was achieved by providing the network with a sparse and naive suggestion for the attention model. For example, on of trajectories, the system was directed to the center of the digit on the last glimpse. Such partial direction resulted in a significant improvement of both speed of learning and the final success rate, as can be observed in figure 8.
Demonstration in the AVS system was made possible by manipulating the exploration noise. The exploration noise is a Gaussian white noise and as such has probability greater than zero to accept any value. Since the output of the system at any given time is a function of that noise, we can force the output to a specific value by setting the exploration noise in that particular time step to be where is now the demonstrated output of the attention model and is the determined exploration noise that will bring the system to that desired output. As long as the demonstration is kept sparse enough, it would in practice not break the assumption that the noise is a Gaussian white noise. The noise in the system is an essential part of the log likelihood gradient and therefore the system would not only arrive to the desired output at that particular time step, but also learn from that experience.
We have shown that a simple artificial visual system, implemented through a recurrent neural network using policy gradient reinforcement learning, can be trained to perform classification of objects that are much larger than its central region of high visual acuity. While receiving only classification based reward, the system develops an active vision solution, which directs attention towards relevant parts of the image in a task-dependent way. Importantly, the internal network memory plays an essential role in maintaining information across saccades, so that the final classification is achieved by combing information from the current visual input and from previous inputs represented in the network state. Within a generic active vision system, without any specifically crafted features, we have been able to explain several features characteristic of biological vision: (i) Good classification performance using reinforcement learning based on highly limited central vision and low resolution peripheral vision, (ii) Gathering task-relevant information through active search, (iii) Transfer learning, (iv) Ignoring task-irrelevant distractors, (v) Learning through guidance. Beyond providing a model for biological vision, our results suggest possible avenues for cost-effective image recognition in artificial vision systems.
The Matlab code will be made available at: https://alonhazan.wordpress.com/
-  Alfred L. Yarbus. Eye movements and vision. Neuropsychologia, 6(4):222, 1967.
-  G T Buswell. How people look at pictures: a study of the psychology and perception in art. Chicago University of Chicago Press, page 198, 1935.
-  Jeff B. Pelz, Roxanne L. Canosa, Diane Kucharczyk, Jason Babcock, Amy Silver, and Daisei Konno. Portable eyetracking: a study of natural eye movements. Proceedings of SPIE, pages 566-582, 2000.
-  M F Land and S Furneaux. The knowledge base of the oculomotor system. Philosophical transactions of the Royal Society of London. Series B, Biological sciences, 352(1358):1231-9, 1997.
-  W Schultz. Multiple reward signals in the brain. Nature reviews. Neuroscience, 1(3):199-207, 2000.
-  Mary Hayhoe and Dana Ballard. Eye movements in natural behavior. Trends in Cognitive Sciences, 9(4):188-194, 2005.
-  Volodymyr Mnih, Nicolas Heess, Alex Graves, and Koray Kavukcuoglu. Recurrent Models of Visual Attention. Advances in Neural Information Processing Systems, pages 2204-2212, 2014.
-  Yann LeCun, Leon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278-2323, 1998.
-  L Ungerleider. ’What’ and ’where’ in the human brain. Current Opinion in Neurobiology, 4(2):157-165, 1994.
-  H Jaeger. The “echo state” approach to analysing and training recur- rent neural networks. GMD-Forschungszentrum Informationstechnik Report 148, 148:43, 2001.
-  Ronald J. Williams. Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning. Machine Learning, 8(3):229-256, 1992.
-  John M. Henderson. Human gaze control during real-world scene perception. Trends in Cognitive Sciences, 7(11):498-504, 2003.
-  Keinosuke Fukunaga. Introduction to Statistical Pattern Recognition. Pattern Recognition, 22:833-834, 1990.
-  Rachael D Seidler. Neural correlates of motor learning, transfer of learning, and learning to learn. Exercise and sport sciences reviews, 38(1):3-9, 2010.