Diversified Visual Attention Networks for Fine-Grained Object Classification
Fine-grained object classification attracts increasing attention in multimedia applications. However, it is a quite challenging problem due to the subtle inter-class difference and large intra-class variation. Recently, visual attention models have been applied to automatically localize the discriminative regions of an image for better capturing critical difference, which have demonstrated promising performance. Unfortunately, without consideration of the diversity in attention process, most of existing attention models perform poorly in classifying fine-grained objects. In this paper, we propose a diversified visual attention network (DVAN) to address the problem of fine-grained object classification, which substantially relieves the dependency on strongly-supervised information for learning to localize discriminative regions compared with attention-less models. More importantly, DVAN explicitly pursues the diversity of attention and is able to gather discriminative information to the maximal extent. Multiple attention canvases are generated to extract convolutional features for attention. An LSTM recurrent unit is employed to learn the attentiveness and discrimination of attention canvases. The proposed DVAN has the ability to attend the object from coarse to fine granularity, and a dynamic internal representation for classification is built up by incrementally combining the information from different locations and scales of the image. Extensive experiments conducted on CUB-2011, Stanford Dogs and Stanford Cars datasets have demonstrated that the proposed diversified visual attention network achieves competitive performance compared to the state-of-the-art approaches, without using any prior knowledge, user interaction or external resource in training and testing.
object classification has attracted lots of attention from multimedia and computer vision communities, which aims to distinguish categories that are both visually and semantically very similar within a general category, e.g., various species of birds , dogs  and different classes of cars . Fine-grained object classification is especially beneficial for multimedia information retrieval and content analysis. Examples include fine-grained image search , clothing retrieval and recommendation , food recognition , animal recognition , landmark classification , and so on. Unfortunately, it is an extremely challenging task, because objects from similar subordinate categories may have marginal visual difference that is even difficult for humans to recognize. In addition, objects within the same subordinate category may present large appearance variations due to changes of scales, viewpoints, complex backgrounds and occlusions.
An important observation on fine-grained classification is that some local parts of objects usually play an important role in differentiating sub-categories. For instance, the heads of dogs are crucial for distinguishing many species of dogs. Motivated by this observation, most existing fine-grained classification approaches (e.g.,) first localize the foreground objects or object parts, and then extract discriminative features for classification. Region localization approaches  mainly employ unsupervised approaches to identify the possible object regions, while others (e.g., ) alternatively use the available bounding box and/or part annotations. Unfortunately, these approaches suffer from some limitations. First, manually defined parts may not be optimal for the final classification task. Second, annotating parts is significantly more difficult than collecting image labels. Besides, manually cropping the objects and marking their parts are time consuming and labor intensive, which is not feasible for practical use. Third, the unsupervised object region proposal approaches generate a large number of proposals (up to several thousands), which is computationally expensive to process and classify the candidate regions.
Meanwhile, psychological research has also shown that humans tend to focus their attentions selectively on parts of the visual space, instead of processing the whole scene at once when recognizing objects, and different fixations over time are combined to build up an internal representation of the scene . Such a visual attention mechanism is naturally applicable to fine-grained object classification. Therefore, many visual attention based networks have been proposed in recent years  and achieved promising results. However, it is difficult for existing visual attention models to find multiple visually discriminative regions at once. Besides, we also notice that the inter-class differences usually exist in small regions of an object, such as the beak or the legs for bird images. It is difficult for existing attention models to exactly localize them due to their small sizes. Therefore, zooming in these regions will be helpful for the attention model.
In this paper, we convert the problem of finding different attentive regions simultaneously to find them in multiple times. Owing to the capability of learning long-term dependencies in a recurrent manner, Long-Short-Term Memory (LSTM)  has been widely used in many deep neural networks for learning from experience to classify, process and predict time series. Therefore, LSTM is adopted to simulate the process of finding and learning multiple attentive regions of a fine-grained object. Moreover, to solve the problem of small attention regions, a diversified attention canvas generation module is designed in our proposed DVAN. Given an image, multiple attention canvases are first generated with different locations and scales. Some of the canvases depict the whole object while others only contain certain local parts. An incremental representation for objects is dynamically built up by combining a coarse-grained global view and fine-grained diversified local parts. With this representation, the general picture and the local details of the objects can be captured to facilitate fine-grained object classification. Figure 1 illustrates two species of birds with similar appearance. The main differences exist in the regions of eyes, breast and wings. Our diversified visual attention model can automatically discover and locate the subtle differences of these two species of birds on multiple attention canvases with location and scale variations. Another prominent merit of the proposed approach is that it does not need any bounding box or part information for both training and testing, which reduces the difficulty of large-scale fine-grained object classification.
The main contributions of this work are summarized as follows:
A diversified visual attention network (DVAN) is proposed for fine-grained object classification, which effectively and efficiently discovers and localizes the attentive objects as well as the discriminative parts of fine-grained images. To the best of our knowledge, this is the first work to exploit the diversity of computational attention for fine-grain object classification. More importantly, the approach does not need any prior knowledge or user interaction in training and testing.
By combining a coarse-grained view and fine-grained diversified local parts, an incremental object representation is dynamically built up on the generated multiple attention canvases with various locations and scales, from which subtle differences can be accurately captured.
The attention model is integrated with LSTM, in which recurrent neutrons are used to attend the sequential attention canvas from coarse to fine granularity.
Experiments conducted on three benchmark datasets demonstrate that the proposed approach achieves competitive performance with state-of-the-art methods.
The rest of the paper is organized as follows. Related work is reviewed in Section 2. The proposed diversified visual attention network is elaborated in Section 3. Experimental evaluation and analysis are presented in Section 4. Finally, we conclude this work in Section 5.
2.1Fine-Grained Image Classification
Fine-grained image classification has been extensively studied in recent years. The approaches can be classified into the following four groups according to the use of additional information or human interaction.
Discriminative Feature Representation
Discriminative representation is very important for classification. Due to the success of deep learning in recent years, many methods  rely on the deep convolutional features, which achieve significantly better performance than conventional hand crafted features. A bi-linear architecture  is proposed to compute the local pairwise feature interactions, by using two independent sub-convolutional neural networks. Motivated by the observation that a class is usually confused with a few other classes in multi-class classification, a CNN Tree  is built to progressively learn find-grained features to distinguish a subset of classes by learning features only among these classes. Such features are more discriminative, compared to features learned for all classes. Similar to CNN Tree, Selective Regularized Subspace Learning  conducts subspace learning only on confused classes with very similar viual appearance rather than the global subspace. The subset feature learning proposed in  first clusters visually similar classes and then learns deep convolutional features specific for each subset.
The alignment approaches find the discriminative parts automatically and align the parts. Unsupervised template learning  discovers the common geometric patterns of object parts and extracts the features within the aligned co-occurred patterns for fine-grained recognition. In alignment categorization, images are first segmented and then objects are roughly aligned, from which features at different aligned parts are extracted for classification. The symbiotic model  jointly performs segmentation and part localization, which help to update the results mutually. Finally, the features around each part are extracted for classification. By aligning detected keypoints to corresponding keypoints in a prototype image, lower-level features with normalized poses and higher-level features with unaligned images are integrated for bird species categorization in .
Part Localization-based Approaches
These approaches usually localize important regions using a set of pre-defined parts, which can be either manually defined or automatically detected using data mining techniques. In , the local appearance features of manually defined parts, such as face and eyes, are extracted and combined with global features for dog breed classification. The experimental results show that accurate part localization considerably increases classification performance. As an intermediate feature, Part-based One-vs-One Feature (POOF)  is proposed to discriminate two classes based on the appearance feature of a particular part. Part-based R-CNNs  learn the whole-object detector and part detectors respectively and predict a fine-grained category from a pose-normalized representation.
Another branch is called human-in-the-loop methods , which need humans to interactively identify the most discriminative regions for fine-grained categorization. The main drawback is that it is not scalable for large scale image classification due to the need of human interaction.
Existing visual attention models can be classified into soft and hard attention models. Soft attention models (e.g.,) predict the attention regions in a deterministic way. Therefore, it is differentiable and can be trained using back-propagation. Hard attention models (e.g., ) predict the attention points of an image, which are stochastic. They are usually trained by reinforcement learning  or maximizing an approximate variational lower bound. In general, soft attention models are more efficient than hard attention models, since hard attention models require sampling for training while soft attention models can be trained end-to-end. Recurrent Attention Model  is proposed to learn the gaze strategies on cluttered digit classification tasks. It is further extended to multiple digit recognition  and less-constraint fine-grained categorization . The model can direct high resolution attention to the most discriminative regions without bounding box or part location. A two-level attention model is proposed in  for fine-grained image classification. One bottom-up attention is used to detect candidate patches, and another top-down attention selects relevant patches and focuses on discriminative parts. The drawback is the two types of attention are independent and cannot be trained end-to-end, causing information loss and inefficiency.
3Diversified Visual Attention Networks
In this section, the overall architecture of the proposed DVAN model is first introduced in Section 3.1, followed by the visual attention model employed in DVAN (Section 3.2). The core contributions, diversity promoting visual attention model and corresponding attention canvas generation approach, are then elaborated in Section 3.3 and Section 3.4, respectively, which effectively guarantee diversity and thus the information gain of the visual attention procedure.
The architecture of the proposed DVAN model is described in Figure 2, which includes four components: attention canvas generation, CNN feature learning, diversified visual attention and classification. DVAN first localizes several regions of the input image at different scales and takes them as the “canvas” for following visual attention. A convolutional neural network (i.e. VGG-16) is then adopted to learn convolutional features from each canvas of attention. To localize important parts or components of the object within each canvas, a diversified visual attention component is introduced to predict the attention maps, so that important locations within each canvas are highlighted and information gain across multiple attention canvases is maximized. Different from traditional attention models focusing on a single discriminative location, DVAN jointly identifies diverse locations with the help of a well designed diversity promoting loss function. According to the generated attention maps, the convolutional features will be dynamically pooled and accumulated into the diversified attention model. Meanwhile, the attention model will predict the object class at each time step. All the predictions will be averaged to obtain the final classification results.
3.2Visual Attention Model
We will first introduce the visual attention model, which exploits the feature maps generated by the last convolution layer of the CNN. The adopted visual attention component in DVAN consists of two modules: attentive feature integration and attention map prediction, as demonstrated in the top and bottom panels of Figure 3, respectively. The final prediction using the attentive features will be presented in the last part.
Attention Feature Integration
Since we want to convert the problem of simultaneously finding multiple attentive regions in an image into finding different attentive regions in multiple times, recurrent neural network is a preferable choice to solve this problem. More specifically, LSTM  is chosen to perform the attentive feature integration in our attention model, due to its long short-term memory ability for modeling sequential data. To be self-contained, preliminary knowledge about LSTM model is first introduced. As shown in the top panel of Figure 3, a typical LSTM unit consists of an input gate , a forget gate , an output gate as well as a candidate cell state . The interaction between states and gates along time dimension is defined as follows:
Here encodes the cell state, encodes the hidden state, and is the attentive feature generated by the attention map prediction module of our visual attention model. The operator represents element-wise multiplication. The pooled attentive feature vector is then fed into LSTM unit and integrated with previous ones. The LSTM involves a transformation that consists of trainable parameters with and , where is the dimension of , , , , and . Two activation functions, i.e., the sigmoid function and the function , are used in the attention model. The information flow and involved computation of three gates, as well as the cell and hidden states of LSTM unit, are illustrated in the top panel of Figure 3.
Attention Map Prediction
The process of attention map prediction is illustrated in the bottom panel of Figure 3. Let be the feature maps generated by the last convolution layer of the CNN at time step , which is represented as
where is the -th slice of the feature maps at time step . Therefore, there are locations at the feature maps where DVAN can focus on. Each location is a feature vector where the dimensionality is the same as the number of the feature maps.
The feature maps and hidden state of previous LSTM units jointly determine the new attention map as follows:
where refers to weights of the connections from previous hidden state to the -th location of the spatial attention map. Similarly, denotes the weights from feature map to the -th location of the map.
Then, the attentive feature , acting as the input of LSTM unit, is computed by the weighted summation over the feature map based on the predicted attention map :
Attention Model Initialization
The cell and hidden states of LSTM are initialized using Multi-Layer Perceptron (MLP), and the average over all feature maps is used as the input of MLP:
where and are the functions implemented by two MLPs and is the total number of time steps of DVAN. These initial values are used to calculate the first attention map , which determines the initial attentive feature .
The hidden state of LSTM followed by a activation function in the attentive feature integration module is used as the features for classification. The classification component is a fully convolutional layer followed by a softmax layer, which generates the probability of each category. Meanwhile, the hidden state will also guide the next prediction of the attention map together with feature maps for next time step. In such a manner, new attentive features will be pooled according to the newly predicted attention maps. The states of LSTM will also be updated using the newly generated attentive features. The whole process will be recursively performed with the progress of time steps. The final classification result is the average of the classification results across all time steps.
3.3Diversity Promoting Attention Models
The visual attention model introduced above has been demonstrated to be able to automatically localize discriminative regions for minimizing classification error in an end-to-end way . However, we observe that the attention maps generated at different time steps may be quite similar when the input image at each time step is the same. As a result, attention across different time steps does not gain additional information for better classification performance. To illustrate this issue more intuitively, the generated attention maps for a bird image at different time steps is visualized in Figure 4(a), in which vanilla visual attention model is used. From this figure, we can see that given the same input image, the attention model always focuses its attention on the head and neck of this bird across all time steps. Although the head and neck are discriminative for recognizing this species of birds from other genera or classes, they are insufficient to differentiate it from other visually similar species when the subtle differences lie in parts such as wings or tails. In addition, the beak or the eye area is too small, from which it is difficult to learn useful discriminative features for classification.
On the other hand, existing attention models only consider to minimize classification error during the attention process without concerning about information gain. It is defined as:
where indicates whether the image belongs to class . is the number of classes and is the probability of class . Such strategy works well for classifying objects with significant difference. However, when the difference becomes quite subtle for fine-grained object classification, it is necessary to collect sufficient information from multiple small regions for making correct classification decision, which requires the attention process to be diverse. Therefore, in order to collect sufficient information for fine-grained object classification, a diversified attention model is novelly proposed in this work to capture diverse and discriminative regions.
To diversify the attention regions, the following diversity metric is proposed to compute the correlation between temporally adjacent attention maps:
where is the -th attention value of the attention map after conducting softmax on locations at time step . In general, will get a large value if two neighboring attention maps are similar. Unfortunately, based on our empirical observation, solely minimizing the correlation measurement does not always provide sufficient diversity for attention maps.
To further enhance the diversity of attention, we force the visual attention model to check different locations of the image in the next time step. In order to achieve this goal, a “hard” constraint is imposed on the spatial support locations of the attention maps, which requires the overlapped proportion of temporal neighbouring attention regions to be smaller than a threshold, so that the attention regions can be shifted to different locations in neighbouring time steps.
The constrain is defined as follows,
where is the support region on the original image (i.e., the attention canvas), which is used to localize the attentive region. is the number of pixels of the original image. is a given threshold.
The final loss function combines the classification loss and diversity measure, along with hard constraint on the attention canvas, which is defined as:
where is a one-hot label vector of class probabilities at time step , is the total number of time steps, and is a coefficient to control the extent of the penalty, if two neighboring attention locations do not change much.
3.4Multi-scale Attention Canvas Generation
To meet the constraint required by Eqn. (Equation 5) and make the attention map diversified, an attention canvas generation method is designed to crop multiple attention canvases with different locations and scales from the original image. It provides diversified canvases for our DVAN for attention. Some of the attention canvases contain main regions of the object, while others only include enlarged local regions of the object. These diversified attention canvases provide abundant candidates for DVAN to select discriminative regions from whole image level and enlarged local regions.
Since all attention canvases will be resized to the same size, the window size will represent the scale to enlarge. A large window size indicates a small scale for the resized attention canvas. Meanwhile, the resized attention canvas generated from small window size will have a large scale. Therefore, the local regions will be zoomed in and enlarged using a small window size. The stride determines the number of attention canvases to be generated. The attention canvas will be cropped according to the defined stride along the horizontal and vertical axes. More specifically, for a large square size, a large stride will generate a small number of canvases. On the contrary, given a smaller window size, more local regions will be cropped with a small stride. Finally, these attention canvases with various sizes will be normalized to the uniform size (e.g., for VGG Net). Figure 5 illustrates the generation of attention canvases with different window sizes and strides.
Using this strategy, the generated attention canvases will cover most regions of the input image with different locations and scales. All the resized attention canvases will be organized into a sequence, so that the canvases with small scales will be placed before the canvases with large scales. In such a way, the visual attention model will first attend the main body of an object, and then local parts of the object will be further detected.
Figure 4(b) demonstrates the diversified visual attention images after imposing the diversity penalty and multiple attention canvas generation. It can be observed that the input attention canvases are cropped at different locations and have various scales. These canvases form a sequence to be fed into DVAN. The visualized attention map of each image is diversified, from head to body, legs and tail, which is very reasonable for fine-grained object classification.
In this section, we evaluate the performance of DVAN model for fine-grained object classification. The benchmark datasets and implementation details of DVAN are first introduced. The model ablation studies are performed to investigate the contribution of each component. We compare DVAN with state-of-the-art methods, and the produced diversified attention maps are visualized in an intuitive way to demonstrate its superiority.
We evaluate the performance of the proposed DVAN on three popular datasets for fine-grained object classification: Caltech-USCD Birds (CUB-200-2011) , Stanford Dogs  and Stanford Cars . Details about these three datasets are summarized in Table ?. Some representative images from these datasets are also shown in Figure 6. From these examples, we can see that the image contents are indeed complex, making the fine-grained classification rather challenging.
CUB-200-2011 dataset consists of 11,778 images from 200 bird categories, among which 5,994 images for training and 5,794 images for testing. It provides rich annotations, including image-level labels, object bounding boxes, attribute annotations and part landmarks. Stanford Dogs dataset contains 20,580 images for 120 breeds of dogs. In each class, 100 images are used for training and around 71 images for testing. This dataset also provides tight bounding boxes surrounding the dogs of interest. Stanford Cars dataset includes 16,185 images of 196 classes of cars. Class annotations are typically at the level of Year, Maker, Model. For example, 2012 Tesla Model S and 2012 BMW M3 coupe. In this dataset, most of the images have cluttered background. Therefore, visual attention is expected to be more effective for classifying them.
Note that although bounding boxs or part-level annotations are available with these datasets, our proposed model does not utilize any bounding boxs or part-level annotations throughout the experiments. This is one of the advantages of the proposed method.
|Dataset||# Class||# Training||# Testing||BBox||Part|
|Stanford Dogs ||120||12,000||8,580|
|Stanford Cars ||196||8,144||8,041|
In this subsection, we will describe the training details of the proposed DVAN model. All images are first normalized by resizing their short edges to 256 and keeping their aspect ratios. For attention canvas generation, three different window sizes are used, i.e., , and to generate canvases from the resized images. Accordingly, the stride is set as 32, 44 and 48, respectively, to generate the image canvases along x and y axes. Finally, we obtain , and attention canvases for these three window sizes. In addition, one central region of the image is also kept for each scale. Totally, we obtain 32 different attention canvases with three scales. All these attention canvases are resized to the same size, i.e., for VGG-16 Networks. The sequence of these canvases is arranged in this way: the first 5 canvases is from scale of , followed by the 10 canvases from scale of , and the last 17 canvases is from scale of .
The popular VGG Net  is deployed to extract the CNN features, which includes 13 convolutional layers and 3 fully connected layers. The output of the last convolution layer will be used as the input of the visual attention model. The dimensionality of the hidden and cell states of LSTM and the hidden layer of MLP to initialize LSTM are set to 512. A three-stage training approach is adopted for the sake of speed. First, we fine-tune the CNN model pre-trained on ImageNet to extract the basic convolutional feature maps for attention localization. Second, the generated basic convolutional feature maps are fixed, and the diversified attention model is trained separately. Finally, the whole network is updated to further improve the performance. The model is trained using stochastic gradient descent (SGD) for 50 epochs at each stage with learning rate of 0.001. Our implementation is based on Lasagne
4.3Model Ablation Studies
In this subsection, we evaluate the effect of attention mechanism and different parameter settings for the performance of DVAN.
Effect of Attention Mechanism
To evaluate the impact of the attention mechanism of the DVAN, we train a VGG-16 model using multiple attention canvases as a baseline. It can be treated as replacing the diversified visual attention component in DVAN with two fully connected layers with size of 4096. The final prediction is also the average of the softmax values of all attention canvases as well as DVAN. The VGG-16 trained with single image is chosen as another baseline. The results of VGG-16 and VGG-16 with multiple attention canvases are listed in the upper part of Table 1. The accuracy of VGG-16 trained by single image is 72.1%. By augmenting the training data with multiple attention canvases, VGG-16-multi-canvas improves the accuracy to 74.5%. After adopting visual attention, DVAN can significantly improve the performance to 79.0%.
Effect of Different Pooling Methods
The visual attention component in DVAN provides a dynamic pooling strategy to highlight the important and discriminative regions of the image. To prove its effectiveness, we compare it with DVAN using other pooling strategies, such as max pooling and average pooling. The average pooling DVAN (DVAN-Avg) and max pooling DVAN (DVAN-Max) have the identical architecture as the proposed DVAN model, except that they do not predict the attention map on the feature maps. Instead of computing the attentive feature vector, the feature maps are average pooled or max pooled as the input feature for the LSTM in visual attention model.
The performance comparison is listed in Table 1. Both DVAN-Avg and DVAN-Max have better performance than VGG-16-multi-canvas, demonstrating the superiority of LSTM to model sequential data. The classification accuracy of DVAN-Avg is 76.8%, which is slightly better than DVAN-Max. Without dynamically weighted pooling, DVAN-Avg will be interfered by the features of trivial regions. On the other hand, retaining only the features with max response, DVAN_Max will lose more important information compared with averaged pooling. The proposed DVAN model can learn the discriminative regions from feature maps, and dynamically pool them according to the attention probabilities. Its performance reaches 79.0%, outperforming both DVAN-Avg and DVAN-Max.
Effect of Parameter
is a parameter used in Eqn. (Equation 6) to control the diversity of the neighboring attention maps. The effect of parameter for the performance of DVAN model is demonstrated in Figure 8. When , it does not require the successive attention maps to be diversified. On the contrary, a larger value strongly enforces the DVAN to attend different regions of the attention canvas. The performance is first improved and then dropped after becomes larger. Our attention model better observes discriminative regions as increases. Unfortunately, strong diversity constraint is imposed for attending the object as continuously increasing . More attention fixations are scattered in the images, leading to the loss of important information. It achieves the best performance (79.0%) when is equal to 1, which will be used as the default setting in our later experiments.
The visualized attention maps with three different settings are illustrated in Figure 7. It can be observed that the major areas of the image are attended in Figure 7(a) when , but a large portion of backgrounds are falsely involved. The mistakenly attended background regions will reduce the performance of classification. As demonstrated in Figure 7(b), the attention regions more accurately focus on the torso or the local parts of the bird when . However, when becomes larger, it can be seen from Figure 7(c) that the attention regions are unevenly scattered and they are forcibly shifted to different locations to meet the strong constraint. The discriminative regions are easily ignored while more irrelevant regions are contained, causing performance degradation.
Performance of Different Scales
Multi-scale attention canvases at different locations contribute to the diversity of attention maps, so that our model can observe the object from coarse to fine granularity. To reveal the performance with different scales, the experiments with different combinations of scales are conducted. One-scale attention only uses the attention canvases generated by window size of . Two-scale attention model adopts window sizes of and , while three-scale attention model attends the canvases with three different window sizes, , and . For a good trade-off among training efficiency, GPU memory cost and performance, experiments with more scales are not conducted. Note that all generated attention canvases will be resized to the same dimensionality (i.e., ) for training and testing.
The performance comparison on different datasets is summarized in Table 2. The fine-tuned VGG-16 model is chosen as the baseline. Using the original image as the input, the performance for VGG-16 model is the worst. The general trend is that the performance is significantly improved, when more scales of canvases are involved in the diversified visual attention model. The accuracy reaches 79.0% and 81.5% for CUB-200-2011 and Stanford Dogs, respectively, when three-scale attention model is adopted. It proves that the diversified visual attention model can correctly detect most discriminative regions with different scales, which achieves the best performance. In general, the generated attention canvases with large window sizes tend to capture the whole object, while detailed regions will be attended with small window sizes. An interesting phenomenon is that the accuracy of two-scale attention is better than three-scale attention for Stanford Cars dataset. It indicates that the attention canvases generated by a small window size cannot always provide additional information for fine-grained object classification. For example, if the attention canvases only contain the tires or windshields of a car, they are not capable of distinguishing different models, since the wheels or windshields are visually similar across different types of cars.
|Model||CUB-200-2011||Stanford Dogs||Stanford Cars|
Effect of CNN Feature Learning
Since the gradients of the diversified visual attention module can be back-propagated to the CNN feature learning module, DVAN is end-to-end trainable and learns better features. To verify this, we extract the pool5 features of the trained DVAN, VGG-16 and VGG-16-multi-canvas. We train SVM classifiers on these features and then use them to classify test images. The results are reported in Table ?. It can be seen that the accuracy of DVAN (pool5) is 68.2%, which is better than the one trained by the pool5 features of VGG-16 or VGG-16-multi-canvas. The results demonstrate that the learned convolutional features of DVAN are superior over the ones of VGG-16 and VGG-16-multi-canvas.
4.4Comparison with State-of-the-Art Methods
Performance on CUB-200-2011
|Pose Normalization ||75.7|
|Part-based RCNN ||76.4|
|Two-level Attention ||77.9|
|Bilinear CNN ||84.1|
We compare the proposed framework with state-of-the-art methods on CUB-200-2011 dataset, which is listed in Table ?. POOF  learns a large set of discriminative intermediate-level features, specific for a particular domain and a set of parts. The bounding box information and part annotation are used for part feature extraction and comparison. The accuracy is 73.3%. Pose Normalization  computes local image features with pose estimation and normalization. Low-level pose normalized features and high-level features are combined for final classification, which improves the performance to 75.7%. The idea of part-based RCNN  is similar to pose normalization, in which the whole object and part detectors are learned with geometric constraints of parts and the object. A pose-normalized representation is formed for classification. Beneficial from more accurate part detectors, the accuracy of part-based RCNN is 76.4%. Unlike aforementioned methods using bounding box information or part annotation, Two-level Attention  combines object-level attention and part-level attention to find the interested parts of an object, which selects image patches relevant to the target object and uses different CNN filters as the part detectors. The overall precision is 77.9%. Unlike two-level attention using two independent components and involving proposals generated by other algorithms, DVAN adopts a uniformed mechanism to perform the object-level and part-level attention detection using LSTM. The accuracy of our proposed DVAN is 79.0%, which is better than the Two-level Attention method and aforementioned methods using bounding box information and part annotations. By using a deeper Inception architecture with batch normalization  and actively spatial transforming feature maps, ST-CNN  achieves better accuracy (82.3%). Bilinear CNN  utilizes two CNNs (M-Net  and D-Net ) to model local pairwise feature interactions in a translationally invariant manner. At the cost of additional feature extraction, Bilinear CNN  obtains the performance of 84.1%. Different from our attention model which implicitly finds the discriminative parts of the birds, NAC  and PD  explicitly use specific channels of a CNN as part detectors. Selective Search  is also used in these two methods to better locate the object or its parts. Based on the detected parts, NAC learns the part models and extract features at object parts for classification. PD encodes the part filter responses using Spatial Weighted Fisher Vector and pools them into the final image representation. The more complicated models adopted in NAC and PD make their performances better than ours.
Performance on Stanford Dogs
|Alignment localization ||50.1|
|Unsupervised grid alignments ||56.7|
|Glimpse attention ||76.8|
|Noisy data CNN ||82.6|
The performance comparison with state-of-the-art approaches on Stanford Dogs dataset is listed in Table ?. Unlike the works relying on detectors for specific object parts, alignment localization  identifies distinctive regions by roughly aligning the objects using their implicit super-class shapes. Although the pre-given bounding boxes are used during training and testing, the performance is very poor, only 50.1%. It proves that the proposed unsupervised alignment approach is not well adaptive for this dataset. Without any spatial supervision such as bounding boxes, glimpse attention  directs high resolution attention to the most discriminative regions, with reinforcement learning-based recurrent attention models. Its recognition accuracy is significantly boosted, reaching 76.8%. By exploiting additional web data from Google Image Search to expand training datasets, the noisy data CNN  achieves the best performance, 82.6% accuracy. Without using the bounding box information and additional Web data augmentation, the performance of our approach reaches 81.5%, which is much better than the approaches using bounding box information as well as glimpse attention algorithm. The noisy data CNN obtains slightly better accuracy at the expense of additional web data.
Performance on Stanford Cars
|Model||BBox Info||Acc (%)|
|Symbiotic segmentation ||78.0|
|Fisher vector ||82.7|
The performance comparison on Stanford Cars is summarized in Table ?. ELLF  uses the stacked CNN features which are pooled in the detected part regions as the final representation, obtaining 73.9% classification accuracy. Symbiotic segmentation  improves the accuracy to 78.0% by deploying a symbiotic part localization and segmentation model. Fisher vector  further improves the performance to 82.7%, proving the effectiveness of orderless pooling methods. With the assistance of bounding boxes, R-CNN  applies high-capacity convolutional neural networks to bottom-up region proposals in order to localize and segment objects, yielding the best accuracy of 88.4%. All these methods resort to bounding box information for better object localization. Different from these approaches, DVAN locates the discriminative parts automatically without the requirement of any auxiliary information, achieving competitive performance compared to R-CNN (87.1% vs. 88.4%).
4.5Diversified Attention Visualization
Figure 9 visualizes the attention maps generated by DVAN. To save the space, 15 out of 32 attention maps are selected for demonstration. The first attention map is generated using attention canvas, followed by six attention maps generated using attention canvases. The last eight attention maps are generated by the attention canvases with size of .
Generally, the attention maps generated by small scale attention canvases incorporate more areas of the target object, while the attention maps produced by large scale canvases enclose some parts of the object, having a higher resolution. We can see that the diversified attention maps are correctly detected across different time steps. In Figure 9(a), the whole body of the bird is first observed, then the beak and neck areas are focused. In the last few time steps, local regions of tail and body receive more attention. Similarly, in Figure 9(b), the main body of the dog is firstly attended. The head, body and legs are then attended sequentially using larger scale attention canvases. For Figure 9(c), the frontal part of the car is first attended using the small scale attention canvas. After enlarging the local parts, the hoods, bumpers, and side doors are observed. Finally, the front bumps and wheels are attended as well. However, our experiments also indicate that tiny local parts of cars will reduce the classification accuracy, since these parts dismiss the ability to distinguish the models of cars.
In this paper, a diversified visual attention network is proposed for fine-grained object classification, which explicitly diversifies the attention maps for localizing multiple discriminative parts. By using a recurrent soft attention mechanism, the proposed framework dynamically attends important regions at each time step. Experiments on three publicly available datasets demonstrate that DVAN achieves competitive performance with state-of-the-art approaches, without using any information of bounding box or part location. In the future, we will explore the impact of attributes in fine-grained object classification problems, in which the attribute information will be exploited to guide the attention model for better discovering the discriminative regions.
- C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie, “The Caltech-UCSD Birds-200-2011 Dataset,” 2011.
- A. Khosla, N. Jayadevaprakash, B. Yao, and L. Fei-Fei, “Novel dataset for fine-grained image categorization,” in Proc. IEEE Int. Conf. Comput. Vis. Pattern Recog. Workshops, Jun. 2011, pp. 3466–3473.
- J. Liu, A. Kanazawa, D. Jacobs, and P. Belhumeur, “Dog breed classification using part localization,” in Proc. IEEE Euro. Conf. Comput. Vis., Oct. 2012, pp. 172–185.
- J. Krause, M. Stark, J. Deng, and L. Fei-Fei, “3d object representations for fine-grained categorization,” in Proc. IEEE Int. Conf. Comput. Vis. Workshops, Jun. 2013, pp. 554–561.
- L. Yang, P. Luo, C. Change Loy, and X. Tang, “A large-scale car dataset for fine-grained categorization and verification,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog., Jun. 2015, pp. 3973–3981.
- L. Xie, J. Wang, B. Zhang, and Q. Tian, “Fine-grained image search,” IEEE Trans. Multimedia, vol. 17, no. 5, pp. 636–647, May 2015.
- K. Yamaguchi, M. H. Kiapour, L. E. Ortiz, and T. L. Berg, “Retrieving similar styles to parse clothing,” IEEE Trans. Pattern Anal. and Mach. Intell., vol. 37, no. 5, pp. 1028–1040, May 2015.
- X. Wang, T. Zhang, D. R. Tretter, and Q. Lin, “Personal clothing retrieval on photo collections by color and attributes,” IEEE Trans Multimedia, vol. 15, no. 8, pp. 2035–2045, Dec. 2013.
- X. Liang, L. Lin, W. Yang, P. Luo, J. Huang, and S. Yan, “Clothes co-parsing via joint image segmentation and labeling with application to clothing retrieval,” IEEE Trans. Multimedia, vol. 18, no. 6, pp. 1175–1186, Jun. 2016.
- M. Bolaños and P. Radeva, “Simultaneous food localization and recognition,” in Proc. Int. Conf. Pattern Recog., 2016.
- P. Sermanet, A. Frome, and E. Real, “Attention for fine-grained categorization,” in Proc. Int. Conf. Learning Representations, May 2015.
- L. Zhu, J. Shen, H. Jin, L. Xie, and R. Zheng, “Landmark classification with hierarchical multi-modal exemplar feature,” IEEE Trans. Multimedia, vol. 17, no. 7, pp. 981–993, Jul. 2015.
- Y. Wei, W. Xia, M. Lin, J. Huang, B. Ni, J. Dong, Y. Zhao, and S. Yan, “Hcp: A flexible cnn framework for multi-label image classification,” IEEE Trans. Pattern Recog. and Mach. Intell., vol. 38, no. 9, pp. 1901–1907, 2016.
- T. Berg and P. N. Belhumeur, “POOF: Part-Based One-vs-One Features for fine-grained categorization, face verification, and attribute estimation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog., 2013, pp. 955–962.
- L. Bo, X. Ren, and D. Fox, “Kernel descriptors for visual recognition,” in Proc. Adv. Neural Inf. Process. Syst., Dec. 2010, pp. 244–252.
- S. Branson, G. V. Horn, S. Belongie, and P. Perona, “Bird species categorization using pose normalized deep convolutional nets,” in Proc. British Mach. Vis. Conf., 2014, pp. 1–14.
- Y. Chai, V. Lempitsky, and A. Zisserman, “Symbiotic segmentation and part localization for fine-grained categorization,” in Proc. IEEE Int. Conf. Comput. Vis., Dec. 2013, pp. 321–328.
- B. Yao, “A codebook-free and annotation-free approach for fine-grained image categorization,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog., 2012, pp. 3466–3473.
- N. Zhang, J. Donahue, R. B. Girshick, and T. Darrell, “Part-based r-cnns for fine-grained category detection,” in Proc. IEEE Euro. Conf. Comput. Vis., 2014, pp. 834–849.
- T. Xiao, Y. Xu, K. Yang, J. Zhang, Y. Peng, and Z. Zhang, “The application of two-level attention models in deep convolutional neural network for fine-grained image classification,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog., 2015, pp. 842–850.
- X. Liu, T. Xia, J. Wang, and Y. Lin, “Fully convolutional attention localization networks: Efficient attention localization for fine-grained recognition,” ArXiv:abs/1603.06765, 2016.
- R. A. Rensink, “The dynamic representation of scenes,” Visual Cognition, vol. 7, no. 1-3, pp. 17–42, 2000.
- T. V. Nguyen, B. Ni, H. Liu, W. Xia, J. Luo, M. Kankanhalli, and S. Yan, “Image re-attentionizing,” IEEE Trans. Multimedia, vol. 15, no. 8, pp. 1910–1919, Dec. 2013.
- S. Sharma, R. Kiros, and R. Salakhutdinov, “Action recognition using visual attention,” in NIPS Time Series Workshop, Dec. 2015.
- J. Ba, V. Mnih, and K. Kavukcuoglu, “Multiple object recognition with visual attention,” in Proc. Int. Conf. Learning Representations, May 2015, pp. 1–10.
- K. Xu, J. Ba, R. Kiros, K. Cho, A. C. Courville, R. Salakhutdinov, R. S. Zemel, and Y. Bengio, “Show, attend and tell: Neural image caption generation with visual attention,” in Int. Conf. Mach. Learning, 2015, pp. 2048–2057.
- M. Jaderberg, K. Simonyan, A. Zisserman, and k. kavukcuoglu, “Spatial transformer networks,” in Adv. Neural Inf. Process. Syst., 2015, pp. 2017–2025.
- S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural Comput., vol. 9, no. 8, pp. 1735–1780, Nov. 1997.
- A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Proc. Adv. Neural Inf. Process. Syst., 2012, pp. 1097–1105.
- J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang, E. Tzeng, and T. Darrell, “Decaf: A deep convolutional activation feature for generic visual recognition,” in Proc. Int. Conf. Mach. Learning, 2014, pp. 647–655.
- K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” in Proc. Int. Conf. Learning Representations, May 2015.
- T.-Y. Lin, A. RoyChowdhury, and S. Maji, “Bilinear cnn models for fine-grained visual recognition,” in Proc. IEEE Int. Conf. Comput. Vis., 2015, pp. 1449–1457.
- Z. Wang, X. Wang, and G. Wang, “Learning fine-grained features via a CNN tree for large-scale classification,” ArXiv:abs/1511.04534, 2015.
- C. Luo, B. Ni, S. Yan, and M. Wang, “Image classification by selective regularized subspace learning,” IEEE Trans. Multimedia, vol. 18, no. 1, pp. 40–50, Jan. 2016.
- Z. Ge, C. McCool, C. Sanderson, and P. I. Corke, “Subset feature learning for fine-grained category classification,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog., 2015, pp. 46–52.
- S. Yang, L. Bo, J. Wang, and L. G. Shapiro, “Unsupervised template learning for fine-grained object recognition,” in Adv. Neural Inf. Process. Syst., 2012, pp. 3122–3130.
- E. Gavves, B. Fernando, C. G. M. Snoek, A. W. M. Smeulders, and T. Tuytelaars, “Fine-grained categorization by alignments,” in Proc. IEEE Int. Conf. Comput. Vis., Dec. 2013, pp. 1713–1720.
- Y. Chai, V. Lempitsky, and A. Zisserman, “Bicos: A bi-level co-segmentation method for image classification,” in Proc. IEEE Int. Conf. Comput. Vis., Nov. 2011, pp. 2579–2586.
- S. Branson, G. Van Horn, P. Perona, and S. J. Belongie, “Improved bird species recognition using pose normalized deep convolutional nets.” in Proc. British Mach. Vis. Conf., Sep. 2014, pp. 1–14.
- S. Branson, C. Wah, F. Schroff, B. Babenko, P. Welinder, P. Perona, and S. Belongie, “Visual recognition with humans in the loop,” in Proc. IEEE Euro. Conf. Comput. Vis., 2010, pp. 438–451.
- J. Deng, J. Krause, and L. Fei-Fei, “Fine-grained crowdsourcing for fine-grained recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog., Jun. 2013, pp. 580–587.
- C. Wah, G. V. Horn, S. Branson, S. Maji, P. Perona, and S. Belongie, “Similarity comparisons for interactive fine-grained categorization,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog., Jun. 2014, pp. 859–866.
- V. Mnih, N. Heess, A. Graves, and k. kavukcuoglu, “Recurrent models of visual attention,” in Proc. Adv. Neural Inf. Process. Syst., 2014, pp. 2204–2212.
- R. J. Williams, “Simple statistical gradient-following algorithms for connectionist reinforcement learning,” Mach. Learn., vol. 8, no. 3-4, pp. 229–256, May 1992.
- W. Zaremba, I. Sutskever, and O. Vinyals, “Recurrent neural network regularization,” ArXiv:abs/1409.2329, 2014.
- M. Simon and E. Rodner, “Neural activation constellations: Unsupervised part model discovery with convolutional networks,” in Proc. IEEE Int. Conf. Comput. Vis., 2015.
- X. Zhang, H. Xiong, W. Zhou, W. Lin, and Q. Tian, “Picking deep filter responses for fine-grained image recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog., June 2016, pp. 1134–1142.
- S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” in Proc. Int. Conf. Mach. Learning, 2015, pp. 448–456.
- K. Chatfield, K. Simonyan, A. Vedaldi, and A. Zisserman, “Return of the devil in the details: Delving deep into convolutional nets,” in British Mach. Vis. Conf., 2014.
- J. R. Uijlings, K. E. van de Sande, T. Gevers, and A. W. Smeulders, “Selective search for object recognition,” Int. J. Comput. Vis., vol. 104, no. 2, pp. 154–171, 2013.
- E. Gavves, B. Fernando, C. G. M. Snoek, A. W. M. Smeulders, and T. Tuytelaars, “Local alignments for fine-grained categorization,” Int. J. of Comput. Vis., vol. 111, no. 2, pp. 191–212, 2015.
- J. Krause, B. Sapp, A. Howard, H. Zhou, A. Toshev, T. Duerig, J. Philbin, and F. Li, “The unreasonable effectiveness of noisy data for fine-grained recognition,” in Proc. IEEE Euro. Conf. Comput. Vis., Oct. 2016, pp. 301–320.
- J. Krause, T. Gebru, J. Deng, L. J. Li, and L. Fei-Fei, “Learning features and parts for fine-grained recognition,” in Proc. Int. Conf. Pattern Recog., Aug. 2014, pp. 26–33.
- P.-H. Gosselin, N. Murray, H. Jégou, and F. Perronnin, “Revisiting the fisher vector for fine-grained classification,” Pattern Recog. Letters, vol. 49, pp. 92–98, 2014.
- R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hierarchies for accurate object detection and semantic segmentation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog., Sep. 2014, pp. 580–587.