Looking for the Devil in the Details: Learning Trilinear Attention Sampling Network for Fine-grained Image Recognition
Learning subtle yet discriminative features (e.g., beak and eyes for a bird) plays a significant role in fine-grained image recognition. Existing attention-based approaches localize and amplify significant parts to learn fine-grained details, which often suffer from a limited number of parts and heavy computational cost. In this paper, we propose to learn such fine-grained features from hundreds of part proposals by Trilinear Attention Sampling Network (TASN) in an efficient teacher-student manner. Specifically, TASN consists of 1) a trilinear attention module, which generates attention maps by modeling the inter-channel relationships, 2) an attention-based sampler which highlights attended parts with high resolution, and 3) a feature distiller, which distills part features into a global one by weight sharing and feature preserving strategies. Extensive experiments verify that TASN yields the best performance under the same settings with the most competitive approaches, in iNaturalist-2017, CUB-Bird, and Stanford-Cars datasets.
Fine-grained visual categorization (FGVC) (e.g., classifying bird species [1, 35] and car models [14, 21, 37]), focuses on distinguishing subtle visual differences within a basic-level category. Although the techniques of convolutional neural network (CNN) [9, 16, 26] for general image recognition [15, 24] have become increasingly practical, FGVC is still a challenging task as the discriminative details (e.g., the beak and eyes of birds as shown in Figure 1) are too subtle to be well-represented by traditional CNN. Thus the majority of efforts in the fine-grained community focus on learning better representation for such subtle yet discriminative details.
Existing attention/part-based methods [2, 7, 34, 40] try to solve this problem by learning part detectors, cropping and amplifying the attended parts, and concatenating part features for recognition. Although promising performance has been achieved, there are several critical issues in such a pipeline. Specifically, 1) the number of attention is limited and pre-defined, which restricts the effectiveness and flexibility of the model. 2) Without part annotations, it is difficult to learn multiple consistent attention maps (i.e., attending on the same part for each sample). Although a well-designed initialization [7, 17, 40] can benefit the model training, it is not robust and can not handle the cases with uncommon poses. Moreover, 3) training CNNs for each part is not an efficient way. These bottlenecks seriously obstruct the study on attention-based fine-grained recognition.
To address the above challenges, we propose a trilinear attention sampling network (TASN) which learns fine-grained details from hundreds of part proposals and efficiently distills the learned features into a single convolutional neural network. The proposed TASN consists of a trilinear attention module, an attention-based sampler, and a feature distiller. First, the trilinear attention module takes as input feature maps and generates attention maps by self-trilinear product, which integrates feature channels with a bilinear relationship matrix. Since each channel of feature maps is transformed into an attention map, hundreds of part proposals can be extracted. Second, for each iteration, the attention-based sampler generates a detail-preserved image by randomly selecting an attention map, and a structure-preserved image by averaging attention maps. The former learns fine-grained feature for a specific part, and the latter captures global structure and contains all the important details. Compared to the original image, the structure-preserved one removes the non-discriminative regions, thus fine-grained details can be better represented with high resolution. Finally, A part-net and a master-net are further formulated as “teacher” and “student,” respectively. Part-net learns fine-grained features over the detail-preserved image and distills the learned features into master-net, which takes as input the structure-preserved image. Such distilling is implemented by weight sharing and feature preserving strategies. Note that instead of concatenating part features, we adopt knowledge distilling introduced in  because the part number is large and not pre-defined.
Since the feature distiller transfers the knowledge from part-net into master-net via optimizing the parameters, 1) stochastic details optimization (i.e., randomly optimize one part in each iteration) can be achieved, which makes it practicable to learn from hundreds of part proposals, and 2) efficient inference can be obtained as we can use master-net to perform recognition in the testing stage. To the best of our knowledge, this work makes the first attempt to learn fine-grained features from hundreds of part proposals and represent such part features with a single convolutional neural network. Our contributions are summarized as follows:
We propose a novel trilinear attention sampling network (TASN) to learn subtle feature representations from hundreds of part proposals for fine-grained image recognition.
We propose to optimize TASN in a teacher-student manner, in which fine-grained features can be distilled into a single master-net with high-efficiency.
We conduct extensive experiments on three challenging datasets (iNaturalist, CUB Birds and Stanford Cars), and demonstrate that TASN outperforms part-ensemble models even with a single stream.
2 Related Works
In this Section, we briefly review previous work from three aspects, including attention mechanism, adaptive image sampling, and knowledge distilling.
Attention Mechanism: As subtle yet discriminative details play an important role for Fine-Grained Image Recognition, learning to attend on discriminative parts is the most popular and promising direction. Thus various of attention mechanisms have been proposed in recent years [7, 20, 27, 36, 40]. DT-RAM  proposed a dynamic computational time model for recurrent visual attention, which can attend on the most discriminative part in dynamic steps. RA-CNN  proposed a recurrent attention convolutional neural network to recurrently learn attention maps in multiple (i.e., 3) scales. And MA-CNN  takes one step further to generate multiple (i.e., 4) consistency attention maps in a single scale by designing a channel grouping module. However, the attention numbers (i.e., 1, 3, 4, respectively) are pre-defined, which counts against the effectiveness and flexibility of the model.
Meanwhile, high-order attention methods are proposed in visual question answering (VQA) and video classification. Specifically, BAN  proposed a bilinear attention module to handle the relationship between image regions and the words in question, and NL  calculates the dot production of features to represent the spatial and temporary relationship in video frames. Different from these works, our trilinear attention module conducts bilinear pooling to obtain the relationship among feature channels, which is further utilized to integrate such features to obtain third-order attention maps.
Adaptive Image Sampling: To preserve fine-grained details for recognition, high input resolution ( v.s. ) is widely adopted [5, 34, 40] and it can significantly improve the performance . However, higher resolution means more computational cost, and more importantly, different region requires different resolution. STN  proposed a non-uniformed sampling mechanism which performs well on MNIST datasets . But without explicit guidance, it is hard to learn non-uniformed sampling parameters for sophisticated tasks such as fine-grained recognition, thus they finally learned two parts without non-uniformed sampling. SSN  firstly proposed to use saliency maps as the guidance of non-uniformed sampling and obtained significant improvements. Different from them, our sampling decomposes attention maps into two dimensions before conducting non-uniformed sampling to avoid spatial distortion.
Knowledge Distilling: Knowledge distilling is firstly proposed by Hinton et al.  to transfer knowledge from an ensemble or from a large highly regularized model into a smaller, distilled model. The main idea is using soft targets (i.e., the predicted distribution of ensemble/large model) to optimize the small model, for it contains more information than the one-hot label. Such a simple yet effective idea inspires many researchers and has been further studied by [8, 10, 38]. In this paper, we adopt this technique to distill the learned details into a single CNN.
In this section, we introduce the proposed Trilinear Attention Sampling Network (TASN), which is able to represent rich fine-grained features by a single convolutional neural network. TASN contains three modules, i.e., a trilinear attention module for details localization, an attention-based sampler for details extraction, and a feature distiller for details optimization.
An overview of the proposed TASN is shown in Figure 2. Given an input image in (a), we first take it through several convolutional layers to extract feature maps in (b), which is further transformed into attention maps in (c) by trilinear attention module. To learn the fine-grained feature for a specific part, we randomly select an attention map and conduct attention sampling over the input image using the selected attention map. The sampled image is named as detail-preserved image since it can preserve a specific detail with high resolution. Moreover, to capture global structure and contain all the important details, we average all the attention maps and again conduct attention sampling, such a sampled image is called as structure-preserved image. We further formulate a master-net to learn the features for the structure-preserved image, and the part-net to learn fine-grained representation for detail-preserved images. Finally, the part-net generates soft targets to distill the fine-grained features into master-net via soft target cross entropy .
3.1 Details Localization by Trilinear Attention
In this subsection, we introduce our trilinear attention module, which transfers convolutional feature maps into attention maps, indicating locations of fine-grained details. As shown in previous work [25, 39], each channel of the convolutional features corresponds to a certain type of visual pattern, however, such feature maps can not act as attention maps due to the lack of consistency and robustness [33, 40]. Inspired by , we transfer the feature maps into attention maps by integrating feature channels according to their spatial relationship. Note that such a process can be implemented in a self-trilinear formulation, which is denoted as trilinear attention for convenient reference.
Given an input image , we extract convolutional features by feeding it into multiple convolutional, batch normalization, ReLU, and pooling layers. Specifically, we use resnet-18  as backbone. To obtain high-resolution feature maps for precise localization, we remove two down-sampling processes from original resnet-18 by changing convolutional stride. Moreover, to improve the robustness of convolutional response, we increase the field of views  by appending two sets of dilated convolutional layers with multiple dilate rates. In the training stage, to facilitate optimizing, we conduct global average pooling over (b) in Figure 2, which is followed by a softmax classifier.
Assume the output of the dilated convolutional layers is a tube with a dimension of , where , and indicate the number of channels, height, and width respectively. We reshape this feature into a matrix with a shape of , which is denoted as . Then our trilinear function can be basically formulated as:
where is the bilinear feature, which indicates the spatial relationship among channels. Specifically, is the channel of feature maps, which contains spatial information. So indicates the spatial relationship between channel and channel . To make feature maps more consistency and robust, we further integrate spatial relationship into feature maps by conducting dot production over and , thus trilinear attention maps can be obtained (which is shown in Figure 3).
We further studied different normalization methods to improve the effectiveness of trilinear attention, and a detailed discussion can be found in Section 4.2. To the end, we adopt the following normalized trilinear attention:
where indicates normalization over the second dimension of a matrix. Note that these two normalizations have different meanings: The first one is spatial normalization which can keep each channel of feature maps within the same scale. And the second one is relationship normalization which is conducted over each relationship vector . We denote the output of the trilinear function in Equation 2 as , i.e., . We reshape into the shape of , thus each channel of indicates an attention map .
3.2 Details Extraction by Attention Sampling
In this subsection, we introduce our attention-based sampler, which takes as input an image together with trilinear attention maps, and generates a structure-preserved image and a detail-preserved image. The structure-preserved image captures the global structure and contains all the important details. Compared to the original image, the structure-preserved one removed the regions without fine-grained details, thus the discriminative parts can be better represented with high resolution. The detail-preserved image focuses on a single part, which can extract more fine-grained details.
Given an image , we obtain structure-preserved image and detail-preserved image by conducting non-uniform sampling over different attention maps:
where is the attention maps, indicates the non-uniform sampling function, indicates average pooling over channels, and indicates randomly selecting a channel from the input. We calculate the average of all attention maps to guide structure-preserved sampling because such an attention map takes all the discriminative parts into consideration. And we randomly select one attention map for detail-preserved sampling, thus it can preserve the fine-grained details of this attended area with high resolution. With the training process going on, each of the attention maps has the opportunity to be selected, thus different fine-grained details can be asynchronously refined.
Our basic idea for attention-based sampling is considering the attention map as probability mass function, i.e., the area with large attention value is more likely to be sampled. Inspired by the inverse-transform technique , we implement the sampling by calculating the inverse function of the distribution function. Moreover, we decompose attention maps into two dimensions to avoid spatial distortion.
Taking structure-preserved sampling for example, we first calculate the integral of the structure-preserved attention map over and axis:
where and are the width and height of the attention map, respectively. Note that we use function to decompose the attention map into two dimensions, because it is more robust than the alternative . We can further obtain the sampling function by:
In a word, the attention map here is used to calculate the mapping function between the coordinates of the original image and the sampled image.
Such a sampling mechanism is illustrated in Figure 4. Given an attention map in (a), we first decompose the map into two dimensions by calculating the max values over axis (b1) and axis (b2). Then the integral of (b1) and (b2) are obtained and shown in (c1) and (c2), respectively. We further calculate the inverse function of (c1) and (c2) in a digital manner, i.e., we uniformly sample points over the axis, and follow the red arrow (shown in (c1) and (c2)), and the blue arrow to obtain the values over axis. (d) shows the sampling pixels by blue dots, and we can observe that the regions with large attention values are allocated with more sampling points. Finally, (e) shows the result of the sampled image. Note that the example in Figure 4 is a structure-preserved sampling case.
3.3 Details Optimization by Knowledge Distilling
In this subsection, we introduce our details distiller, which takes as input a detail-preserved image and a structure-preserved image, and transfers the learned details from part-net to master-net in a teacher-student manner.
Specifically, for each iteration, the attention-based sampler introduced in Section 3.2 can provide a structure-preserved image (denoted as ) and a detail-preserved one (denoted as ). We first obtain the fully connected (fc) outputs by feeding these two images into the same backbone CNN (e.g., Resnet-50 ). The fc outputs are denoted as and , respectively. Then the “softmax” classifier converts and into a probability vector and , which indicates the predicted probability over each class. Taking for example:
where is a parameter namely temperature, which is normally set to 1 for classification tasks. While in knowledge distilling, a large value for is important as it can produce a soft probability distribution over classes. We obtain the soft target cross entropy  for the master-net as:
where denotes the class number. Finally, the objective function of the master-net can be drived by:
where represents the classification loss function, is a one hot vector which indicates the class label and denotes loss weight of the two terms. The soft target cross entropy aims to distill the learned feature for fine-grained details and transfer such information to the master-net. As the attention-based sampler randomly select one part in each iteration, all the fine-grained details can be distilled to the master-net in training process. Note that the convolutional parameters are shared for part-net and master-net, which is important for distilling, while the sharing of fully connected layers is optional.
4.1 Experiment setup
|Dataset||# Class||# Train||# Test|
Datasets: To evaluate the effectiveness of our proposed TASN, we conducted experiments on three extensive and competitive datasets, namely Caltech-UCSD Birds (CUB-200-2011) , Stanford Cars  and iNaturalist-2017, respectively. The detailed statistics with category numbers and the standard training/testing splits can be found in Table 1. iNaturalist-2017 is the largest dataset for the fine-grained task. Compared with other datasets for this task, it contains 13 superclasses. Such a data distribution can provide a more convincing evaluation over the generalization ability of a model.
Baselines: We compared our method to the following baselines due to their state-of-the-art performance and high relevance. Note that we did not include methods using 1) additional data (from the web or other datasets), 2) human-annotated part locations and 3) hierarchical labels (i.e., species, genus, and family) for a fair comparison. And all of the compared methods in each table share the same backbone unless specified otherwise.
FCAN : Fully convolutional attention network, which adaptively selects multiple attentions by reinforcement learning.
MDTP : Mining discriminative triplets of patches, which utilize geometric constraints to improve the accuracy of patch localization.
DT-RAM : Dynamic computational time model for recurrent visual attention, which attends on the most discriminative parts by dynamic steps.
SSN : Saliency-based sampling networks, which conduct non-uniformed sampling based on saliency map in an end-to-end way.
MG-CNN : Multiple granularity descriptors, which leverage the hierarchical labels to generate comprehensive descriptors.
STN : Spatial transformer network, which conducts parameterized spatial transformation to obtain zoomed in or pose normalized objects.
RA-CNN : Recurrent attention CNN, which recurrently attends on discriminative parts in multi-scale.
MA-CNN : Multiple attention CNN, which attends on multiple parts by their proposed channel grouping module in a weakly-supervised way.
MAMC : Multi-attention multi-class constraint network, which learns multiple attentions by conducting multi-class constraint over attended features.
iSQRT-COV : Towards faster training of global covariance pooling networks by iterative matrix square root normalization.
|spacial + relation||85.3|
Implementation: We used open-sourced MXNet  as our code-base, and trained all the models on 8 Tesla P-100 GPUs. For a fair comparison, we conducted experiments with our method TASN on VGG-19  as well as Resnet-50 , both of which are pre-trained on Imagenet . We used the standard data augmentation methods provided by MXNet, and all of the performance are single-crop testing results for a single model unless specially stated. We used SGD optimizer without momentum and weight decay, and the batch size was set to 96. The initial learning rate was set to 0.05, with a decay factor of 0.1 after every 30 epochs. The temperature in Equation 6 is 10, and the loss weight in Equation 8 is 2. More implementation details can be referred to our code, which will be released soon.
4.2 Evaluation and analysis on CUB-200-2011
Trilinear attention. Table 2 shows the impact of different normalization functions in terms of recognition accuracy. We obtain the results by 1) randomly selecting a channel of attention maps in each iteration for sampling during training stage, and 2) conducting average pooling over attention maps for testing. All the models use Resnet-50 as the backbone with an input resolution of 224. It can be observed that trilinear attention maps can significantly outperform the original feature maps. Both the attention functions of and can improve the gain of trilinear attention. While and bring a drop of performance for the reason that they will cause loss of spatial information. As a result, we adopt the last setting (of Table 2) in our TASN. Note that in the term , indicates the region that a channel is focusing on and denotes the feature of that region.
We further compared our trilinear attention module with “self-attention” . Specifically, we followed  to obtain attention maps by , and the proposed trilinear attention can outperform self-attention module with 0.7% points increases.
|sampler in SSN ||84.8||85.3|
Attention-based sampler. To demonstrate the effectiveness of our attention-based sampling mechanism, we compared our sampling mechanism with 1) uniformed sampling (by binarizing the attention maps) and 2) sampling operation introduced in SSN . We set the input attention maps to be the same when comparing sampling mechanisms, and experiments were conducted on two cases,i.e., with and without part-net. All the models use Resnet-50 as the backbone and the input resolution is set to 224. The result in Table 3 shows that our sampling mechanism remarkably outperforms the baselines. SSN sampler obtains a better result than uniformed sampler without part-net, while the further improvements are limited when added part-net. These observations show that the spatial distortion caused by SSN sampler is not helpful to preserve subtle details.
Knowledge distilling. Table 4 reveals the impact of details distilling module with different input resolutions. We can observe consistency improvements by details distilling. The performance of Resnet-50  is saturated with 85.6% and 448 input can not further improve the accuracy. Without distiller (i.e., master-net only), the performance is slightly dropped with 392 input (compared to 336 input), since it is difficult to optimize each detail with large feature resolutions (a similar drop can also be observed on Resnet-50 with 672 inputs).
Moreover, to study the attention selection strategy (i.e., ranked selection vs. random selection), we ranked attention maps by their response and sample the high response ones with large possibility, while the recognition performance dropped from 87.0% to 86.8%. The reason is that ranking makes some parts rarely picked, while such parts can also benefit details learning. We also conducted experiments on distilling two parts in each iteration, and the result is the same as distilling one part each time.
Compared to sampling-based methods. We compare our TASN with three sampling-based methods: 1) uniformed sampling with high resolution (i.e., zoom in), 2) uniformed sampling with attention (i.e., crop) and 3) non-uniformed sampling proposed in SSN . As shown in Table 5, higher resolution can significantly improve fine-grained recognition performance by (relative) 4.9%. However, 448 input increases the computational cost (i.e., flops) by four times compared to 224 input. SSN  obtains a better results than DT-RAM , and our TASN can further obtain 2.9% relative improvement. Our improvements mainly come from two aspects: 1) a better sampling mechanism considering spatial distortion (1.2%), and 2) a better fine-grained details optimizing strategy (1.7%).
Compared to attention-based part methods. In Table 6, we compare our TASN to attention-based parts methods. For a fair comparison, 1) high-resolution input is adopted by all methods and 2) the same backbone numbers are used. It can be observed that for VGG based methods, our TASN outperforms all the baselines even with only one backbone. Moreover, after ensembling three backbones (trained with different parameter settings), TASN can improve the performance by 1.9% over the best 3 parts model MA-CNN . Moreover, our 3 streams result can also outperform 6 streams MA-CNN with a margin of 0.7%. We do not ensemble more streams as the model ensemble is beyond this work. For Resnet-50 based method: compared with the state-of-the-art single-stream MAMC , our TASN also achieves a remarkable improvement by 1.6%.
Combining with second-order feature learning methods. In Table 7, we exhibit that our TASN learns a strong first-order representation, which can further improve the performance of second-order feature methods. Specifically, compared to the best second-order methods iSQRT-COV , our TASN 2k first-order feature outperforms their 8k feature with an improvement by 0.7%, which shows the effectiveness of our TASN. Moreover, we transfer their released code to our framework and obtain an accuracy of 89.1%, which shows the compatibility of these two methods. Note that for a fair comparison, we follow their settings and predict the label of a test image by averaging prediction scores of the image and its horizontal flip.
4.3 Evaluation and analysis on Stanford-Car
Ablation study. Table 8 shows the result of VGG-19 baseline, our TASN without distiller, our TASN with a single model, and ensemble results. We can observe 1.9% relative improvement by structure preserved sampling and a further improvement of 2.3% by the full model. Note that the improvement by structure preserved sampling is not that significant as on CUB-200-2011 dataset, due to the fact that foregrounds are larger for most of the images in Stanford-Car. This result also shows that our full model of TASN works well on boosting performance for foreground images.
Comparison with state-of-the-art. Similar to CUB-200-2011, we compare our TASN with attention-based parts methods under the same setting. As shown in Table 9, TASN with single VGG-19 achieves comparable results with 3 streams part methods. And our ensembled 3 streams TASN outperforms the best 3 streams part learning methods MA-CNN . Compared to their 5 streams result (28.0%), our result is still better. For Resnet-50 based method, we compare our TASN to the state-of-the-art method MAMC , and achieve 1.1% improvements.
4.4 Evaluation and analysis on iNaturalist 2017
|Super Class||# Class||Resnet ||SSN ||TASN|
We also conduct our TASN on the largest fine-grained dataset, i.e., iNaturalist 2017. We compare to Resnet-101 baseline and the best sampling method SSN . All the models use Resnet-101 as the backbone with an input resolution of 224. As there are 13 superclasses in this dataset, we re-implement SSN  with their released code to obtain the performance on each superclass. The results are shown in Table 10, and we can find that our proposed TASN outperforms the baselines on every superclass. It is notable that compared to Resnet-101, TASN significantly improves the performance, especially on Reptilia (improved by 24.0%) and Aves (improved by 21.8%), which indicates such superclasses contain more fine-grained details.
In this paper, we proposed a trilinear attention sampling network for fine-grained image recognition, which can learn rich feature representations from hundreds of part proposals. Instead of ensembling multiple part CNNs, we adopted knowledge distilling method to integrate fine-grained features into a single stream, which is not only efficient but also effective. Extensive experiments in CUB-Bird, iNaturalist 2017 and Stanford-Car demonstrate that TASN is able to outperform part-ensemble models even with a single stream. In the future, we will further study the proposed TASN in the following directions: 1) attention selection strategy, i.e., learning to select which details should be learned and distilled instead of randomly selecting, 2) conduct attention-based sampling over convolutional features instead of only over images, and 3) extend our work to other vision tasks, e.g., object detection and segmentation.
-  T. Berg, J. Liu, S. Woo Lee, M. L. Alexander, D. W. Jacobs, and P. N. Belhumeur. Birdsnap: Large-scale fine-grained visual categorization of birds. In CVPR, pages 2011–2018, 2014.
-  S. Branson, G. V. Horn, S. J. Belongie, and P. Perona. Bird species categorization using pose normalized deep convolutional nets. In BMVC, 2014.
-  L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. TPAMI, 40(4):834–848, 2018.
-  T. Chen, M. Li, Y. Li, M. Lin, N. Wang, M. Wang, T. Xiao, B. Xu, C. Zhang, and Z. Zhang. Mxnet: A flexible and efficient machine learning library for heterogeneous distributed systems. arXiv preprint arXiv:1512.01274, 2015.
-  Y. Cui, Y. Song, C. Sun, A. Howard, and S. Belongie. Large scale fine-grained categorization and domain-specific transfer learning. In CVPR, pages 4109–4118, 2018.
-  L. Devroye. Sample-based non-uniform random variate generation. In WSC, pages 260–265. ACM, 1986.
-  J. Fu, H. Zheng, and T. Mei. Look closer to see better: Recurrent attention convolutional neural network for fine-grained image recognition. In CVPR, pages 4438–4446, 2017.
-  T. Fukuda, M. Suzuki, G. Kurata, S. Thomas, J. Cui, and B. Ramabhadran. Efficient knowledge distillation from an ensemble of teachers. Proc. Interspeech, pages 3697–3701, 2017.
-  K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, pages 770–778, 2016.
-  B. Heo, M. Lee, S. Yun, and J. Y. Choi. Knowledge distillation with adversarial samples supporting decision boundary. CoRR, abs/1805.05532, 2018.
-  G. Hinton, O. Vinyals, and J. Dean. Distilling the knowledge in a neural network. stat, 1050:9, 2015.
-  M. Jaderberg, K. Simonyan, A. Zisserman, and k. kavukcuoglu. Spatial transformer networks. In NIPS, pages 2017–2025, 2015.
-  J.-H. Kim, J. Jun, and B.-T. Zhang. Bilinear attention networks. In NIPS, pages 1571–1581, 2018.
-  J. Krause, M. Stark, J. Deng, and L. Fei-Fei. 3D object representations for fine-grained categorization. In ICCV Workshop, 2013.
-  A. Krizhevsky, V. Nair, and G. Hinton. The cifar-10 dataset. online: http://www. cs. toronto. edu/kriz/cifar. html, 2014.
-  A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, pages 1106–1114, 2012.
-  M. Lam, B. Mahasseni, and S. Todorovic. Fine-grained recognition as hsnet search for informative image parts. In CVPR, pages 6497–6506. IEEE, 2017.
-  Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
-  P. Li, J. Xie, Q. Wang, and Z. Gao. Towards faster training of global covariance pooling networks by iterative matrix square root normalization. In CVPR, pages 947–955, 2018.
-  Z. Li, Y. Yang, X. Liu, F. Zhou, S. Wen, and W. Xu. Dynamic computational time for visual attention. In Proceedings of the IEEE International Conference on Computer Vision, pages 1199–1209, 2017.
-  X. Liu, W. Liu, H. Ma, and H. Fu. Large-scale vehicle re-identification in urban surveillance videos. In ICME, pages 1–6. IEEE, 2016.
-  X. Liu, T. Xia, J. Wang, Y. Yang, F. Zhou, and Y. Lin. Fully convolutional attention networks for fine-grained recognition. arXiv preprint arXiv:1603.06765, 2016.
-  A. Recasens, P. Kellnhofer, S. Stent, W. Matusik, and A. Torralba. Learning to zoom: a saliency-based sampling layer for neural networks. In ECCV, pages 51–66, 2018.
-  O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei. ImageNet Large Scale Visual Recognition Challenge. IJCV, 115(3):211–252, 2015.
-  M. Simon and E. Rodner. Neural activation constellations: Unsupervised part model discovery with convolutional networks. In ICCV, pages 1143–1151, 2015.
-  K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In ICLR, pages 1409–1556, 2015.
-  M. Sun, Y. Yuan, F. Zhou, and E. Ding. Multi-attention multi-class constraint for fine-grained image recognition. In ECCV, pages 805–821, 2018.
-  G. Van Horn, O. Mac Aodha, Y. Song, Y. Cui, C. Sun, A. Shepard, H. Adam, P. Perona, and S. Belongie. The inaturalist species classification and detection dataset. 2018.
-  A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin. Attention is all you need. In NIPS, pages 5998–6008, 2017.
-  D. Wang, Z. Shen, J. Shao, W. Zhang, X. Xue, and Z. Zhang. Multiple granularity descriptors for fine-grained categorization. In ICCV, pages 2399–2406, 2015.
-  X. Wang, R. Girshick, A. Gupta, and K. He. Non-local neural networks. In CVPR, pages 7794–7803, 2018.
-  Y. Wang, J. Choi, V. Morariu, and L. S. Davis. Mining discriminative triplets of patches for fine-grained classification. In CVPR, pages 1163–1172, 2016.
-  X.-S. Wei, J.-H. Luo, J. Wu, and Z.-H. Zhou. Selective convolutional descriptor aggregation for fine-grained image retrieval. TIP, 26(6):2868–2881, 2017.
-  X.-S. Wei, C.-W. Xie, J. Wu, and C. Shen. Mask-cnn: Localizing parts and selecting descriptors for fine-grained bird species categorization. Pattern Recognition, 76:704–714, 2018.
-  P. Welinder, S. Branson, T. Mita, C. Wah, F. Schroff, S. Belongie, and P. Perona. Caltech-UCSD Birds 200. Technical Report CNS-TR-2010-001, California Institute of Technology, 2010.
-  T. Xiao, Y. Xu, K. Yang, J. Zhang, Y. Peng, and Z. Zhang. The application of two-level attention models in deep convolutional neural network for fine-grained image classification. In CVPR, pages 842–850, 2015.
-  L. Yang, P. Luo, C. Change Loy, and X. Tang. A large-scale car dataset for fine-grained categorization and verification. In CVPR, pages 3973–3981, 2015.
-  J. Yim, D. Joo, J. Bae, and J. Kim. A gift from knowledge distillation: Fast optimization, network minimization and transfer learning. In CVPR, pages 4133–4141, 2017.
-  X. Zhang, H. Xiong, W. Zhou, W. Lin, and Q. Tian. Picking deep filter responses for fine-grained image recognition. In CVPR, pages 1134–1142, 2016.
-  H. Zheng, J. Fu, T. Mei, and J. Luo. Learning multi-attention convolutional neural network for fine-grained image recognition. In Proceedings of the IEEE international conference on computer vision, pages 5209–5217, 2017.