Global-and-local attention networks for visual recognition

Global-and-local attention networks
for visual recognition

Drew Linsley, Dan Scheibler, Sven Eberhardt and Thomas Serre
Department of Cognitive Linguistic & Psychological Sciences
Carney Institute for Brain Science
Brown University
Providence, RI 02912

State-of-the-art deep convolutional networks (DCNs) such as squeeze-and-excitation (SE) residual networks implement a form of attention, also known as contextual guidance, which is derived from global image features. Here, we explore a complementary form of attention, known as visual saliency, which is derived from local image features. We extend the SE module with a novel global-and-local attention (GALA) module which combines both forms of attention – resulting in state-of-the-art accuracy on ILSVRC. We further describe, a large-scale online experiment designed for human participants to identify diagnostic image regions to co-train a GALA network. Adding humans-in-the-loop is shown to significantly improve network accuracy, while also yielding visual features that are more interpretable and more similar to those used by human observers.


Global-and-local attention networks
for visual recognition

  Drew Linsley, Dan Scheibler, Sven Eberhardt and Thomas Serre Department of Cognitive Linguistic & Psychological Sciences Carney Institute for Brain Science Brown University Providence, RI 02912 {drew_linsley,andreas_karagounis,thomas_serre}


noticebox[b]Preprint. Work in progress.\end@float

1 Introduction

Most recent gains in visual recognition have originated from the development of network architectures which incorporate some form of attention. While biology is sometimes mentioned as a source of inspiration (Stollenga et al., 2014; Mnih et al., 2014; Cao et al., 2015; You et al., 2016; Chen et al., 2017; Wang et al., 2017; Biparva and Tsotsos, 2017), the attentional mechanisms that have been considered by the computer vision community remain limited in comparison to the richness and diversity of the processes used by our visual system (see (Itti et al., 2005) for a review). Decades of work on human attention have shown that our visual system uses at least two pathways to quickly guide attention to regions of interest (Torralba et al., 2006): A “global pathway” rapidly extracts a statistical summary of the whole scene in as little as a glance (Oliva and Torralba, 2007) while a complementary “local pathway” leverages local features to extract salience cues (Itti and Koch, 2001). Most state-of-the-art networks (Bell et al., 2016; Wang et al., 2017; Hu et al., 2017a) including last year’s squeeze-and-excitation ILSVRC winner (Hu et al., 2017a), implement a global pathway. Here, we explore the role of the complementary local pathway and its interplay with a global pathway within a modern residual network architecture He et al. (2015).

Our contributions are three-fold: (i) We extend the leading squeeze-and-excitation (SE) module with a novel global-and-local attention (GALA) module which combines global contextual guidance with local saliency to achieve state-of-the-art recognition performance on ILSVRC. (ii) We further describe a large-scale online experiment to supplement ImageNet with a half-million image feature importance maps derived from human participants. These maps are psychophysically validated and used to co-train a 50-layer GALA residual network. (iii) Adding humans-in-the-loop is shown to significantly improve recognition accuracy while also creating visual representations that are more interpretable and more similar to those derived from human observers.

By supplementing ImageNet with the public release of a half-million image feature importance maps, we hope to spur interest in the development of network architectures that are not only more robust and accurate but also more interpretable and anthropomorphic.

2 Related work

Attention mechanisms

In addition to extensive work on learning visual saliency for object detection (see (Nguyen et al., 2018) for a recent review), much recent work has focused on the integration of attention modules within end-to-end trainable deep network architectures. Spatial attention mechanisms involve learning a spatial mask used to enhance/suppress the activity of units inside/outside a “spotlight" positioned over a scene. Spatial attention modulates network units according to their spatial location independently of their feature tuning. Such mechanisms were the focus of early systems for image categorization (Xu et al., 2015), and were later shown to yield significant improvements for visual question answering (VQA) and captioning (Zhu et al., 2016; Yang et al., 2016; Xu and Saenko, 2016; Seo et al., 2016; Nam et al., 2017; Patro and Namboodiri, 2018). Feature-based attention (also called “channel-wise” attention in computer vision) is a complementary form of attention which involves learning a task-specific modulation that is selectively applied to individual feature maps across an entire scene. In this work, spatial and feature-based attention are combined in a single learned mask used to modulate feature representations, as is typically done in state-of-the-art systems (You et al., 2016; Chen et al., 2017; Wang et al., 2017; Biparva and Tsotsos, 2017).

Computational-neuroscience models have suggested that there are two main pathways guiding visual attention (Torralba et al., 2006). Global features are typically used to compute so-called “summary statistics” by averaging activities from individual feature channels across the entire scene. These representations are designed to capture the scene layout or “gist”, and are hypothesized to serve as a representation of contextual information to drive attention (Oliva and Torralba, 2007). This is the type of attention used in most state-of-the-art networks (Bell et al., 2016; Wang et al., 2017; Hu et al., 2017a). In this work, we explore a complementary form of attention known as visual saliency (Itti and Koch, 2001) derived from local feature representations. While visual saliency has been extensively studied (Nguyen et al., 2018), this is the first time, to our knowledge, that this local form of attention has been combined with a global form of attention in a global-and-local (GALA) network that can learn to integrate them in a complex nonlinear combination to solve visual recognition tasks.

Human-in-the-loop computer vision

One of the goals of the present study is to leverage human supervision to co-train the proposed GALA network. Previous work has shown that it is possible to augment vision systems with human perceptual judgments on difficult problems including face recognition and localization (Scheirer et al., 2014; Branson et al., 2010; Kovashka et al., 2016), bird categorization (Deng et al., 2016), action recognition (Vig et al., ), and object detection and segmentation (Vondrick et al., 2015; Papadopoulos et al., ; Shanmuga Vadivel et al., 2015; Vijayanarasimhan and Grauman, 2014). Online games constitute an efficient way to collect human ground-truth data  (Deng et al., 2016; von Ahn and Dabbish, 2004; von Ahn et al., 2006; Das et al., 2016; Linsley et al., 2017). In a two-player game, an image may be gradually revealed to a “student” tasked to recognize it, based on bubbles drawn by a remote “teacher” (Linsley et al., 2017). The need to assemble teams of players severely limits the scale of these types of games. This can be alleviated in one-player games (Deng et al., 2016; Das et al., 2016) where a single player may be asked to query a blurred image to answer questions about it by sharpening selected regions. In this work, we introduce, a novel game which offers a significant departure from earlier work (Linsley et al., 2017), by having a human player collaborate with a DCN towards discovering the minimal configurations of image features sufficient for recognition at a suitable scale for co-training a modern DCN architecture.

3 Proposed network architecture

We designed the global-and-local attention (GALA) block as a circuit for learning complex combinations of local saliency and global contextual modulations in feedforward neural networks. GALA modulates an input layer with a modulation mask A of the same dimension as the input. Here, the spatial height, width, and number of feature channels are denoted with H, W, and C s.t. .

Local and global pathways

Our starting point is the SE module, which is denoted in our model and yields the global feature attention vector g (Fig. 1). This procedure involves two steps: first, calculating per-channel “summary statistics”; and second, applying transformations to shrink and then expand the dimensionality of these statistics. Summary statistics are computed with a global average applied to individual feature channels , yielding the vector . This is followed by a shrinking operation of the vector by the operator (so-called “squeeze”) into a lower dimensional space followed by an expansion operation (bias terms are omitted for simplicity) back to the original, higher dimensional space s.t. . We set to a rectified linear function (ReLU) and the dimensionality “reduction ratio” of the shrinking operation to 4. In parallel, local saliency S is computed with (Fig. 1), which implements the following transformation: . Here, convolution is denoted with , , and . This is reminiscent of the local computations performed in computational-neuroscience models of visual saliency to yield per-channel conspicuity maps that are then combined into a single saliency map (Itti and Koch, 2001).

Figure 1: The global-and-local (GALA) block learns to combine local saliency and global contextual signals to guide attention towards image regions that are diagnostic for object recognition. Optional supervision by ClickMe maps (yellow box) can drive attention to visual features favored by humans.
Reference Ours
top-1 err. top-5 err. top-1 err. top-5 err.
ResNet-50 (He et al., 2015) 24.70 7.80 23.88 6.86
SE-ResNet-50 (Hu et al., 2017b) 23.29 6.62 23.26 6.55
GALA-ResNet-50 no ClickMe - - 22.73 6.35
Table 1: ILSVRC’12 validation set accuracy for published reference models and our re-implementations (evaluated on 224 224 image crops).

Pathways integration

Outputs from the local and global pathways are integrated with to produce the attention volume . Because it is often unclear how tasks benefit from one form of attention vs. another, or whether task performance would benefit from additive vs. multiplicative combinations, learns parameters that govern these interactions. The vector controls the additive combination of S and g per-channel, while does the same for their multiplicative combination. In order to combine attention activities g and S, they are first tiled to produce . Finally, we calculate the attention activities of a GALA module as , where the activation function is the function, which squashes activities in the range . In contrast to other bottom-up attention modules, which use a sigmoidal function to enable excitation and inhibition, our selection of gives a GALA module the ability to “dis-inhibit” bottom-up feature activations from U and flip the signs of its individual unit responses. Finally, applies attention as .

ResNet-50 implementation

To validate the approach, we embedded the GALA module within a ResNet-50. We identified six mid- to high-level visual layers in ResNet-50 to use with GALA (layers 24, 27, 30, 33, 36, 39; each belonging to the same ResNet-50 processing block). Each GALA module was applied to the final activity in a dense path of a residual layer in ResNet-50. The residual layer’s “shortcut” activity maps were added to this GALA-modulated activity to allow the model to flexibly weigh the amount of attention used. Each attention activity map had a height and width of 1414. Table 1 shows that the accuracy of our re-implementations (Ours) of ResNet-50 (He et al., 2015) and SE-ResNet-50 (Hu et al., 2017b) trained “from scratch” on ILSVRC12 is on par with published results111We observed an identical pattern of results and approximately equal performance for both the classic pre- and more recent “post-activation” flavors of ResNet-50 (He et al., 2016). (Reference). Incorporating our proposed GALA module into the ResNet-50 (GALA-ResNet-50 no ClickMe) offers a small benefit over the SE-ResNet-50. As we will see in section 5, the benefits of GALA become much more significant with smaller datasets and when we add humans in the loop. We next describe a large-scale online experiment designed to collect the necessary supervision from human participants.

4 consists of rounds of game play where human participants play with DCN partners to recognize images from the ILSVRC12 challenge. Players viewed object images and were instructed to use the mouse cursor to “paint” image parts that are most informative for recognizing its category (written above the image). Once the participant clicked on the image, pixel bubbles were placed wherever the cursor went until the round ended. Having players bubble images in this way forced them to carefully monitor their bubbling while also preventing fastidious strategies that would produce overly sparse salt-and-pepper types of maps. As players bubbled object parts deemed important for recognition, a DCN tried to recognize an image where only bubbled parts were visible (a Gaussian noise mask hid un-bubbled image regions from the DCN). We tried to make the game as entertaining and fast-paced as possible to maximize the number of clicks derived from human players. Hence, we expanded the bubbled regions shown to the DCN to 2121 pixels to increase its accuracy. The DCN partner was an ILSVRC12-trained VGG16 (Simonyan and Zisserman, 2015) and was hosted on a flask server that provided real-time feedback to multiple players at once. processed approximately 10 bubbled images/second. A timer controlled the number of points participants received in a round. Points were calculated as the proportion of time left on the timer after the DCN reached top-5 correct recognition for that image. If the player could not help the DCN recognize the object within 7 seconds, the round ended and no points were awarded. Points were calculated on the server to protect against cheating. The game also included an option to skip poor quality images.

4.1 Game statistics

The game was launched on February 1st, 2017 and closed on September 24th, 2017. These efforts drew 1,235 participants (unique user IDs) to the game who played an average of 380 images each. In total, we recorded over 35M bubbles, producing 472,946 ClickMe maps on 196,499 unique images. Figure S2A shows sample ClickMe maps where pixels are scored according to how many times a bubble was overlaid on them over all rounds of game play where these images were presented. The maps typically highlight local image features, emphasizing certain object parts over others. For instance, ClickMe maps for animal categories (Fig. S2A, top row) are nearly always oriented towards facial components even when these are not prominent (e.g., snakes). In general, we also found that ClickMe maps for inanimate objects (Fig. S2A, bottom row) tended to exhibit a front-oriented bias, with distinguishing parts such as engines, cockpits, and wheels receiving special attention. Additional game statistics and ClickMe maps are available as Supplementary Material.

Despite the large-scale of, the collected feature importance maps display strong regularity and consistency between participants. We measured this by calculating the rank-ordered correlation between ClickMe maps from two randomly selected players for an image. These maps were blurred with a 49x49 kernel (the square of the bubble radius in the ClickMe game) to facilitate the comparison and reduce the influence of noise associated with the game interface. Repeating this procedure for 10,000 different images and taking the average of these per-image correlations revealed a strong average inter-participant reliability of (p <0.001), meaning that the kinds of features participants bubbled during game play tend to be stereotyped. Below, we report the similarity between a model’s feature importance maps and humans as a ratio of this value , and refer to this as the “Fraction of human ClickMe map variability”. We also derived a null inter-participant reliability through a similar procedure, by calculating the correlation of ClickMe maps between two randomly selected players on two randomly selected images. Across 10,000 randomly paired images, the average null correlation was , reinforcing the strength of the observed reliability.

Importantly, participants playing this game did not adopt strategies to find visual features that were more important to their DCN partners than to other humans. As we will describe below, the similarity between humans is significantly greater than the similarity between humans and typical DCNs. In addition, learning such a strategy would be near impossible given the statistics of gameplay: the number of images participants played was on average less than the number of ClickMe object categories (380 vs. 1,000), and the top-200 most frequent players were just as accurate on the first half of their game rounds as on the second half (53.64% vs. 53.61%; , n.s.).

4.2 ClickMe and object recognition

Although we have shown that ClickMe maps are consistent across participants, this does not necessarily mean that the selected features are important for object recognition. It is possible that the introspective nature of the task (players have to judge which image region to reveal to maximize the chances of the object to be correctly recognized) results in feature maps that reflect top-down mechanisms which minimally contribute to the initial stages of visual processing (Eberhardt et al., 2016). We thus directly tested the role that ClickMe map features play in human object recognition with a rapid visual recognition experiment (Figure 2b). This experiment compared the contribution of ClickMe map features for object recognition with features derived from local image saliency.

We tested human responses on 40 target (animal) and 40 distractor (non-animal) images gathered from the SALICON (Jiang et al., 2015) subset of the Microsoft COCO 2014. SALICON includes measurements of local image saliency derived from mouse-clicks by human participants, which strongly correlate with passive eye fixations on the same images. Images were presented to human participants either intact or with a phase scrambled perceptual mask which selectively exposed their most important visual features according to feature importance maps derived from either ClickMe or SALICON (Jiang et al., 2015). These maps were processed with a novel “stochastic flood-fill" algorithm that relabeled pixels with a score that combined their distance from the most important pixel with their labeled importance. This ensured a spatially continuous expansion from most-to-least important pixel, which let us create versions of each image that revealed between 1% and 100% (at log-scale spaced intervals) of its most important pixels, and record how introducing additional features from a resource influenced behavior (see thumbnails in 2 for examples of images where 100% of ClickMe or Salicon features were revealed). A phase-scrambled version of each image filled regions that were not included in the feature importance map. We used these images in a rapid visual categorization experiment to compare the diagnosticity of the features selected by ClickMe vs. salience for object recognition. Methods are described in detail in Supplementary materials.

Figure 2: (A) A representative selection of ILSVRC’12 images and their ClickMe maps. The transparency channel of select images reflect the fraction of clicks for that location. Image features consistently deemed important for recognition are opaque and unimportant ones are transparent. Animals are outlined in blue and non-animals in red. (B) Features identified in ClickMe maps are more diagnostic for object recognition than those identified in saliency maps. A rapid visual categorization experiment compared human performance in discriminating animals vs. vehicles when features were revealed according to ClickMe maps (blue curve) or saliency maps (red curve). ClickMe- and Saliency-masked image exemplars are depicted for the condition in which 100% of important features are visible, demonstrating how Saliency is not necessarily relevant to the task. For clarity, we omitted data between 1-10% of features visible from this plot where accuracy was chance for participants of both groups. Error bars are S.E.M. ***: .

We followed the experimental paradigm for rapid categorization used in (Eberhardt et al., 2016) where stimuli are flashed and responses are forced to be rapid (under 550 ms; see Supplementary Material for details). Experiments were implemented with the psiTurk framework (Gureckis et al., 2016) and custom javascript functions. We recruited 120 participants from Amazon Mechanical Turk ( Participants were organized into a ClickMe vs. Saliency group ( participants in each) who viewed images that were masked to reveal a randomly selected amount of the most important visual features according to ClickMe vs. Saliency feature importance maps. Results are shown in Fig. 2B: Human observers reached ceiling performance when 40% of the ClickMe features were visible (6% of all image pixels). In contrast, human observers viewing images masked according to saliency were only just above chance when 63% of these features were visible (9% of all image pixels), and did not reach ceiling performance until the full image was visible (accuracy measured from different participant groups). These findings validate that visual features measured by ClickMe are distinct from saliency and sufficient for human object recognition.

5 Co-training with humans in the loop

Next, we describe how to use ClickMe maps to optionally co-train a GALA module by introducing an additional loss. Let denote the cross-entropy between activity from model with input and class label , and its ClickMe importance map resized with bicubic interpolation to be the same height and width as a GALA module activity at layer , the set of layers where the GALA module is employed. Units in ClickMe maps and are transformed to the same range by normalizing each by their channel norms. We reduced the depth of each column in to 1 by setting them to their column-wise norm. ClickMe map supervision for a GALA module is combined with cross-entropy into a global loss:


This formulation jointly optimizes a model for both object classification and predicting ClickMe maps from input images. In our experiments, ClickMe maps were blurred, which helped training converge. Object image and ClickMe importance map pairs were passed through the network during training and augmented with identical random crops and left-right flips. Models were trained for 100 epochs and weights were selected that yielded the best validation accuracy across epochs. All models were implemented in Tensorflow and were trained “from scratch” with weights drawn from a scaled normal distribution. We used SGD with Nesterov momentum Sutskever et al. (2013) and a piece-wise constant learning rate schedule that decayed by after 30, 60, 80, and 90 epochs of training. Models were benchmarked on the validation split of the ILSVRC12 challenge dataset (Table 1). A separate set of models was trained on images and, if applicable, human-derived importance maps from the ClickMe dataset. Approximately 5% of ClickMe was set aside for validation (17,841 images and importance maps), another 5% for testing (17,581 images and importance maps), and the rest for training (329,036 images and importance maps). Each split of ClickMe contained exemplars from all 1,000 ILSVRC categories.

ClickMe Validation ClickMe Test
top-1 err top-5 err Maps top-1 err top-5 err Maps
ResNet-50 (He et al., 2015) 63.24 39.75 42.23 63.68 40.65 43.61
SE-ResNet-50 66.20 42.75 62.73 66.17 42.48 64.36
GALA-ResNet-50 no ClickMe 55.90 32.13 62.96 53.90 31.04 64.21
GALA-ResNet-50 w/ ClickMe 46.08 24.32 87.63 49.29 27.73 88.56
Table 2: Networks’ classification error and fraction of explained human ClickMe map variability on the ClickMe subset of ILSVRC. denotes p <0.01.

We investigated the trade-off between maximizing object categorization accuracy and predicting ClickMe maps (i.e., learning a visual representation which is consistent with that of human observers). We performed a systematic analysis over different values of the hyperparameter , which scaled the magnitude of the ClickMe map loss, while recording object classification accuracy and the similarity between ground-truth ClickMe maps and image-feature importance maps derived from the DCN. Feature importance maps were derived from networks as the norms of activity from the final layer of GALA or SE attention (layer 39, thus generalizing between models with vs. without attention; (Zagoruyko and Komodakis, 2016)). Their similarity with ClickMe maps was measured with rank-order correlation between the two (see SI for details). This analysis demonstrated that both object categorization and ClickMe map prediction improve when (see Fig. S3 in Supplementary Material). We use this as the GALA-ResNet-50 with ClickMe maps in subsequent experiments.

We report classification accuracy and the fraction of human ClickMe map variability accounted by each model on the Validation and Test splits of the ClickMe dataset (Fig. 2). We found a significant improvement by the GALA-ResNet-50 over both the ResNet-50 and SE-ResNet-50 in object classification for both dataset splits. While models with attention explained a greater fraction of human ClickMe map variability than the ResNet-50, the SE and GALA were equal on this metric. Incorporating ClickMe maps in GALA-ResNet-50 training significantly improved object classification accuracy and explained a significantly higher fraction of ClickMe map variability on both validation and test splits.

Because the GALA-ResNet-50 networks were trained “from scratch” on the ClickMe dataset, we were able to visualize the types of features selected by each for object recognition. We did so on a subset of 200 images in the ClickMe test dataset from ILSVRC12 validation, for which we had multiple participants supply ClickMe maps. We visualized these features by calculating “smoothed” gradient images (Smilkov et al., 2017), which suppresses visual noise in gradient images. Including ClickMe maps in GALA-ResNet-50 training yielded gradient images which highlighted features that were qualitatively more local and consistent with human observers (Fig. 3), emphasizing object parts such as facial features in animals, and the tires and headlights of cars. By contrast, the GALA-ResNet-50 trained without ClickMe maps placed more emphasis on the bodies of animals and cars as well as their context.

Figure 3: A GALA-ResNet-50 cued with ClickMe maps during training for object recognition uses visual features that are more similar to those used by human observers than a GALA-ResNet-50 trained without such cueing. ClickMe maps were gathered for ILSVRC12 validation set images that were held out of training, and highlight object parts that were deemed important for recognition. The difference between normalized smoothed gradient images (Smilkov et al., 2017) from each network highlights regions that were more important to each network (Gradient ). Image pixels that were more important to a ClickMe GALA-ResNet-50 are colored in red, and those more important to a vanilla GALA-ResNet-50 are colored in blue. The column-wise norm of each network’s GALA module reveals highly interpretable object and part-based attention for the ClickMe GALA-ResNet-50 (in red) vs. less interpretable and more diffuse attention for the vanilla GALA-ResNet-50 (in blue).

Our ClickMe map loss formulation requires reducing the dimensionality of the GALA attention volume to a single channel. Thus, we can directly visualize these “reduced attention maps” to understand the kinds of visual features GALA modules learn to select (Fig. 3; exemplars from the ClickMe test image set are shown). Strikingly, attention in the GALA-ResNet-50 trained with ClickMe maps, virtually without exception, focuses either on a single important visual feature of the target object class, or segments the figural object from the background. This effect persists in the presence of occlusion (Fig. 3, second row of GALA w/ ClickMe maps) and clutter (Fig. 3, fifth row of GALA w/ ClickMe maps). In comparison, some object features can be made out in the attention maps of a GALA-ResNet-50 trained without ClickMe maps, but there is no such localization, and the maps themselves are significantly more difficult to interpret.

6 Discussion

We have described a novel global-and-local attention (GALA) module – extending the squeeze-and-excitation (SE) module which constituted the building block of the winning architecture in the ILSVRC17 challenge. We have tested an SE-ResNet-50 with reduced amount of training data ( samples) and found that the architecture overfits compared to a standard ResNet-50. We have found that the proposed GALA-ResNet-50 however offers a significant increase in accuracy in this regime cutting down the top-5 error by over both ResNet-50 and SE-ResNet-50.

We have further developed, an online game to collect human ground-truth needed to teach GALA to attend to diagnostic image regions for object recognition. We described the collection of the ClickMe dataset aimed at supplementing ImageNet with nearly half a million human-derived image-feature importance maps. The approach was validated using human psychophysics by demonstrating the sufficiency of ClickMe features for rapid visual categorization: on average, human observers were able to recognize objects pre-attentively with as little as 40% of the most important pixels derived from ClickMe maps (6% of all image pixels). In comparison, when pixels were revealed according to their saliency-derived importance, observers only reached this performance when the whole image was visible. These results indicate that may provide novel insights into human vision with a measure of feature diagnosticity that goes beyond classic saliency measures. A detailed analysis of the visual features identified by ClickMe maps falls outside the scope of the present study. We release all the ClickMe data, including not only nearly a half-million importance maps, but also the associated timing of human behavioral decisions with the hope that it will be useful to other researchers. We expect that a more systematic analysis – beyond identifying what features are selected but also when they are selected (Cichy et al., 2016; Ha and Eck, 2017) could help further our understanding of the different attention mechanisms responsible for the selection of diagnostic image features.

We have also described an approach to co-train GALA using human supervision to cue the network to attend to diagnostic image regions. The routine casts ClickMe map prediction as an auxiliary task that can be combined with a primary visual categorization task. We found a trade-off between learning visual representations that are more similar to those used by human observers vs. learning visual representations that are more optimal for ILSVRC. With the proper trade-off, the approach was shown to learn more interpretable visual representations while yielding a robust improvement in classification accuracy.

While recent advancements in DCNs have led to models that perform on par with human observers in basic visual recognition tasks, there is also growing evidence of qualitative differences in the visual strategies that they employ (Saleh et al., 2016; Ullman et al., 2016; Eberhardt et al., 2016; Linsley et al., 2017). It remains an open question whether these discrepancies arise because of mechanistic differences during visual inference or because of more mundane differences in the way they are trained. That it is possible to drive a modern DCN towards learning more human-like representations with proper cuing to diagnostic image regions during training suggests that the observed differences may reflect different training regimens rather than different inference strategies. In particular, DCNs lack explicit mechanisms for perceptual grouping and figure-ground segmentation which are known to play a key role in the development of our visual system (Johnson, 2001; Ostrovsky et al., 2009). Such processes alleviate the need to learn to discard background clutter through statistical regularities learned via the presentation of millions of training samples as is the case for DCNs. In the absence of figure-ground mechanisms, DCNs are compelled to associate foreground objects and their context as single perceptual units. This leads to DCN representations that are significantly more distributed compared to those used by humans (Linsley et al., 2017). We hope that this work will spur interest in the development of novel training paradigms that leverage the plethora of visual cues (depth, motion, etc) available for figure-ground segregation in order to substitute for the human supervision used here for co-training GALA.


  • Stollenga et al. (2014) M. Stollenga, J. Masci, F. Gomez, and J. Schmidhuber. Deep networks with internal selective attention through feedback connections. arXiv preprint arXiv: …, page 13, 2014.
  • Mnih et al. (2014) V. Mnih, N. Heess, A. Graves, and K. Kavukcuoglu. Recurrent models of visual attention. Advances in Neural Information Processing Systems 27, 27:1–9, 2014.
  • Cao et al. (2015) C. Cao, X. Liu, Y. Yang, Y. Yu, J. Wang, Z. Wang, Y. Huang, L. Wang, C. Huang, W. Xu, D. Ramanan, and T. S. Huang. Look and think twice: Capturing Top-Down visual attention with feedback convolutional neural networks. In 2015 IEEE International Conference on Computer Vision (ICCV), pages 2956–2964, December 2015.
  • You et al. (2016) Q. You, H. Jin, Z. Wang, C. Fang, and J. Luo. Image captioning with semantic attention. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4651–4659, June 2016.
  • Chen et al. (2017) L. Chen, H. Zhang, J. Xiao, L. Nie, J. Shao, W. Liu, and T. S. Chua. SCA-CNN: Spatial and Channel-Wise attention in convolutional networks for image captioning. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 6298–6306, July 2017.
  • Wang et al. (2017) F. Wang, M. Jiang, C. Qian, S. Yang, C. Li, H. Zhang, X. Wang, and X. Tang. Residual attention network for image classification. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 6450–6458, July 2017.
  • Biparva and Tsotsos (2017) M. Biparva and J. Tsotsos. STNet: Selective tuning of convolutional networks for object localization. In The IEEE International Conference on Computer Vision (ICCV), volume 2, 2017.
  • Itti et al. (2005) L. Itti, G. Rees, and J. K. Tsotsos. Neurobiology of attention. Academic Press, 2005.
  • Torralba et al. (2006) A. Torralba, A. Oliva, M. S. Castelhano, and J. M. Henderson. Contextual guidance of eye movements and attention in real-world scenes: the role of global features in object search. Psychol. Rev., 113(4):766–786, October 2006.
  • Oliva and Torralba (2007) A. Oliva and A. Torralba. The role of context in object recognition. Trends Cogn. Sci., 11(12):520–527, 2007.
  • Itti and Koch (2001) L. Itti and C. Koch. Computational modelling of visual attention. Nat. Rev. Neurosci., 2(3):194–203, 2001.
  • Bell et al. (2016) S. Bell, C. Lawrence Zitnick, K. Bala, and R. Girshick. Inside-outside net: Detecting objects in context with skip pooling and recurrent neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2874–2883, 2016.
  • Hu et al. (2017a) J. Hu, L. Shen, and G. Sun. Squeeze-and-Excitation networks. September 2017a.
  • He et al. (2015) K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. December 2015.
  • Nguyen et al. (2018) T. V. Nguyen, Q. Zhao, and S. Yan. Attentive systems: A survey. Int. J. Comput. Vis., 126(1):86–110, January 2018.
  • Xu et al. (2015) K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov, R. Zemel, and Y. Bengio. Show, attend and tell: Neural image caption generation with visual attention. In International Conference on Machine Learning, pages 2048–2057, June 2015.
  • Zhu et al. (2016) Y. Zhu, O. Groth, M. Bernstein, and L. Fei-Fei. Visual7W: Grounded question answering in images. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4995–5004, June 2016.
  • Yang et al. (2016) Z. Yang, X. He, J. Gao, L. Deng, and A. Smola. Stacked attention networks for image question answering. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 21–29, June 2016.
  • Xu and Saenko (2016) H. Xu and K. Saenko. Ask, attend and answer: Exploring Question-Guided spatial attention for visual question answering. In B. Leibe, J. Matas, N. Sebe, and M. Welling, editors, Computer Vision – ECCV 2016, volume 9911 of Lecture Notes in Computer Science, pages 451–466. Springer International Publishing, Cham, 2016.
  • Seo et al. (2016) P. H. Seo, Z. Lin, S. Cohen, X. Shen, and B. Han. Progressive attention networks for visual attribute prediction. June 2016.
  • Nam et al. (2017) H. Nam, J. W. Ha, and J. Kim. Dual attention networks for multimodal reasoning and matching. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2156–2164, July 2017.
  • Patro and Namboodiri (2018) B. Patro and V. P. Namboodiri. Differential attention for visual question answering. April 2018.
  • Scheirer et al. (2014) W. J. Scheirer, S. E. Anthony, K. Nakayama, and D. D. Cox. Perceptual annotation: Measuring human vision to improve computer vision. IEEE Trans. Pattern Anal. Mach. Intell., 36(8):1679–1686, August 2014.
  • Branson et al. (2010) S. Branson, C. Wah, F. Schroff, B. Babenko, P. Welinder, P. Perona, and S. Belongie. Visual recognition with humans in the loop. In Computer Vision – ECCV 2010, pages 438–451. Springer, Berlin, Heidelberg, 5 September 2010.
  • Kovashka et al. (2016) A. Kovashka, O. Russakovsky, L. Fei-Fei, and K. Grauman. Crowdsourcing in computer vision. November 2016.
  • Deng et al. (2016) J. Deng, J. Krause, M. Stark, and L. Fei-Fei. Leveraging the wisdom of the crowd for Fine-Grained recognition. IEEE Trans. Pattern Anal. Mach. Intell., 38(4):666–676, April 2016.
  • (27) E. Vig, M. Dorr, and D. Cox. Space-Variant descriptor sampling for action recognition based on saliency and eye movements. pages 84–97.
  • Vondrick et al. (2015) C. Vondrick, H. Pirsiavash, A. Oliva, and A. Torralba. Learning visual biases from human imagination. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing Systems 28, pages 289–297. Curran Associates, Inc., 2015.
  • (29) D. P. Papadopoulos, A. D. F. Clarke, F. Keller, and V. Ferrari. Training object class detectors from eye tracking data. pages 361–376.
  • Shanmuga Vadivel et al. (2015) K. Shanmuga Vadivel, T. Ngo, M. Eckstein, and B. S. Manjunath. Eye tracking assisted extraction of attentionally important objects from videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3241–3250, 2015.
  • Vijayanarasimhan and Grauman (2014) S. Vijayanarasimhan and K. Grauman. Large-Scale live active learning: Training object detectors with crawled data and crowds. Int. J. Comput. Vis., 108(1-2):97–114, 1 May 2014.
  • von Ahn and Dabbish (2004) L. von Ahn and L. Dabbish. Labeling images with a computer game. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, CHI ’04, pages 319–326, New York, NY, USA, 2004. ACM.
  • von Ahn et al. (2006) L. von Ahn, R. Liu, and M. Blum. Peekaboom: A game for locating objects in images. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, CHI ’06, pages 55–64, New York, NY, USA, 2006. ACM.
  • Das et al. (2016) A. Das, H. Agrawal, L. Zitnick, D. Parikh, and D. Batra. Human attention in visual question answering: Do humans and deep networks look at the same regions? In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 932–937, Stroudsburg, PA, USA, 2016. Association for Computational Linguistics.
  • Linsley et al. (2017) D. Linsley, S. Eberhardt, T. Sharma, P. Gupta, and T. Serre. What are the visual features underlying human versus machine vision? January 2017.
  • Hu et al. (2017b) J. Hu, L. Shen, and G. Sun. Squeeze-and-Excitation networks. September 2017b.
  • He et al. (2016) K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Computer Vision and Pattern Recognition, 2016.
  • Simonyan and Zisserman (2015) K. Simonyan and A. Zisserman. Very deep convolutional networks for Large-Scale image recoginition. Intl. Conf. on Learning Representations (ICLR), pages 1–14, 2015.
  • Eberhardt et al. (2016) S. Eberhardt, J. Cader, and T. Serre. How deep is the feature analysis underlying rapid visual categorization? In Neural Information Processing Systems, 2016.
  • Jiang et al. (2015) M. Jiang, S. Huang, J. Duan, and Q. Zhao. SALICON: Saliency in context. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1072–1080, June 2015.
  • Gureckis et al. (2016) T. M. Gureckis, J. Martin, J. McDonnell, A. S. Rich, D. Markant, A. Coenen, D. Halpern, J. B. Hamrick, and P. Chan. psiturk: An open-source framework for conducting replicable behavioral experiments online. Behav. Res. Methods, 48(3):829–842, September 2016.
  • Sutskever et al. (2013) I. Sutskever, J. Martens, G. Dahl, and G. Hinton. On the importance of initialization and momentum in deep learning. In International Conference on Machine Learning, pages 1139–1147, February 2013.
  • Zagoruyko and Komodakis (2016) S. Zagoruyko and N. Komodakis. Paying more attention to attention: Improving the performance of convolutional neural networks via attention transfer. December 2016.
  • Smilkov et al. (2017) D. Smilkov, N. Thorat, B. Kim, F. Viégas, and M. Wattenberg. SmoothGrad: removing noise by adding noise. June 2017.
  • Cichy et al. (2016) R. M. Cichy, A. Khosla, D. Pantazis, A. Torralba, and A. Oliva. Comparison of deep neural networks to spatio-temporal cortical dynamics of human visual object recognition reveals hierarchical correspondence. Sci. Rep., 6:27755, June 2016.
  • Ha and Eck (2017) D. Ha and D. Eck. A neural representation of sketch drawings. April 2017.
  • Saleh et al. (2016) B. Saleh, A. Elgammal, and J. Feldman. The role of typicality in object classification: Improving the generalization capacity of convolutional neural networks. February 2016.
  • Ullman et al. (2016) S. Ullman, L. Assif, E. Fetaya, and D. Harari. Atoms of recognition in human and computer vision. Proc. Natl. Acad. Sci. U. S. A., 113(10):2744–2749, March 2016.
  • Johnson (2001) S. P. Johnson. Visual development in human infants: Binding features, surfaces, and objects. Vis. cogn., 8(3-5):565–578, June 2001.
  • Ostrovsky et al. (2009) Y. Ostrovsky, E. Meyers, S. Ganesh, U. Mathur, and P. Sinha. Visual parsing after recovery from blindness. Psychol. Sci., 20(12):1484–1491, December 2009.

Supplementary Material

Appendix A Additional game statistics

The game was launched on February 1st, 2017 and closed on September 24th, 2017. Over this period, 25 contests were used to drive traffic to the site by rewarding top-scoring players with gift cards. Participants were given usernames to track their performance and were allowed to play as many game rounds as they wanted. These efforts drew 1,235 participants (unique user IDs) to the game who played an average of 380 images each. More than 90% of these participants played more than one image. The distribution of number of participants per image is shown in Fig. S1.

Around 5% of the images were skipped by participants because of poor image quality or an incorrect class label. The CNN correctly recognized object images in about half of the trials that were played (47% of all images) for which participants received points. In total, we recorded over 35M bubbles, producing 472,946 ClickMe maps on 196,499 unique images.

Figure S1: Distribution of participants per ClickMe map.

ClickMe maps typically highlight local image features, emphasizing certain object parts over others (Fig. S2). For instance, ClickMe maps for animal categories (top row) are nearly always oriented towards facial components even when these are not prominent (e.g., snakes). In general, we also found that ClickMe maps for inanimate objects tended to exhibit a front-oriented bias, with distinguishing parts such as engines, cockpits, and wheels receiving special attention.

Appendix B Psychophysics methods

Stimulus generation

We implemented a “stochastic flood-fill" algorithm which we applied to a phase-scrambled version of the image to reveal increasingly larger image regions. First, the image pixel given highest importance by a feature map was identified. Second, the algorithm expanded this region anisotropically, with a bias towards pixels with higher feature importance scores. The revealed region was set to the center of the image to ensure that participants did not have to foveate to see important image parts and to prevent the spatial layout to affect the results. Separate image sets were generated by this procedure for ClickMe and saliency maps. Participants viewed images masked by one type of map or the other, but never both to prevent memory effects. Participants saw each unique exemplar only once in a randomly selected masking configuration. The total number of pixels in the feature importance maps for a given image was equalized between ClickMe and saliency maps. Original images were sampled from 4 target and 4 distractor categories: bird, zebra, elephant, and cat; table, couch, refrigerator, and umbrella.

Psychophsyics experiment

In each experiment trial, participants viewed a sequence of events overlaid onto a white background: (i) a fixation cross was first displayed for a variable time (1,100–1,600ms); (ii) followed by the test stimulus for 400ms; (iii) and an additional 150ms of response time. In total, participants were given 550ms to view the image and press a button to judge its category (feedback was provided when response times fell outside this time limit). Participants were instructed to categorize the object in the image as fast and accurately as possible by pressing the “s” or “l” key, which were randomly assigned across participants to either the target or distractor category. Similar paradigms and timing parameters yielded reliable behavioral measurements of pre-attentive visual system processes, e.g.,  (Eberhardt et al., 2016). The experiment began with a brief training phase to familiarize participants with the paradigm. Afterwards, participants were given feedback on their categorization accuracy at the end of each of the five experimental blocks (16 images per block).

Experiments were implemented with the psiTurk framework (Gureckis et al., 2016) and custom javascript functions. Each trial sequence was converted to an HTML5-compatible video format to provide the fastest reliable presentation time possible in a web browser. Videos were preloaded before each trial to optimize the reliability of experiment timing within the web browser. A photo-diode was used to verify stimulus timing was consistently accurate within   10ms across different operating system, web browser, and display type configurations. Images were sized at pixels, which is equivalent to a stimulus size between approximately across a likely range of possible display and seating setups participants used for the experiment.

Figure S2: ClickMe map exemplars.

Appendix C Trade-off

Comparing visual feature importance between DCNs and humans

We evaluated visual feature importance of DCNs by extracting the per-column activity norm of the final layer where either GALA or SE attention was applied (39). This procedure enabled us to compare visual features highlighted in ClickMe maps to those in either attention models or the standard ResNet-50. Comparing ClickMe maps with these model feature importance maps was a two-step procedure. First, ClickMe maps were resized with bicubic interpolation to the same size as the models’ feature importance maps (1414). Second, rank-order correlation was used to measure the similarity of the two maps.

Evaluating the effect of ClickMe maps on GALA

We investigated the trade-off between maximizing object categorization accuracy and predicting ClickMe maps (i.e, learning a visual representation which is consistent with that of human observers. We performed a systematic analysis over different values of the hyperparameter , which scaled the magnitude of the ClickMe map loss, while recording object classification accuracy the similarity between ClickMe maps and model feature importance maps. At each value of that was tested, five models were trained for 100 epochs, and weights that optimized accuracy on the validation ClickMe dataset were selected (Fig. S3).

Figure S3: Training the GALA-ResNet-50 with ClickMe maps improves object classification performance and drives attention towards features selected by humans. We screened the influence of ClickMe maps on training by measuring model accuracy after training on a range of values of , which scales the contribution of ClickMe maps on the total model loss. A model optimized only for object recognition uses features that explain 62.96% of variability in human ClickMe maps, which is consistent with a ResNet-50 trained without attention (dashed red line). Incorporating ClickMe maps in the loss yields a large improvement in predicting ClickMe maps (87.62%) as well as classification accuracy. Explained ClickMe map variability is plotted as the ratio to the ceiling interrater reliability, and the dotted red line depicts the floor interrater reliability (shuffled null).
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description