Looking for the Devil in the Details: Learning Trilinear Attention Sampling Network for Fine-grained Image Recognition

Looking for the Devil in the Details: Learning Trilinear Attention Sampling Network for Fine-grained Image Recognition

Heliang Zheng , Jianlong Fu, Zheng-Jun Zha, Jiebo Luo
University of Science and Technology of China, Hefei, China
Microsoft Research, Beijing, China
University of Rochester, Rochester, NY
zhenghl@mail.ustc.edu.cn, zhazj@ustc.edu.cn, jianf@microsoft.com, jluo@cs.rochester.edu
This work was performed when Heliang Zheng was visiting Microsoft Research as a research intern.

Learning subtle yet discriminative features (e.g., beak and eyes for a bird) plays a significant role in fine-grained image recognition. Existing attention-based approaches localize and amplify significant parts to learn fine-grained details, which often suffer from a limited number of parts and heavy computational cost. In this paper, we propose to learn such fine-grained features from hundreds of part proposals by Trilinear Attention Sampling Network (TASN) in an efficient teacher-student manner. Specifically, TASN consists of 1) a trilinear attention module, which generates attention maps by modeling the inter-channel relationships, 2) an attention-based sampler which highlights attended parts with high resolution, and 3) a feature distiller, which distills part features into a global one by weight sharing and feature preserving strategies. Extensive experiments verify that TASN yields the best performance under the same settings with the most competitive approaches, in iNaturalist-2017, CUB-Bird, and Stanford-Cars datasets.


1 Introduction

Fine-grained visual categorization (FGVC) (e.g., classifying bird species [1, 35] and car models [14, 21, 37]), focuses on distinguishing subtle visual differences within a basic-level category. Although the techniques of convolutional neural network (CNN) [9, 16, 26] for general image recognition [15, 24] have become increasingly practical, FGVC is still a challenging task as the discriminative details (e.g., the beak and eyes of birds as shown in Figure 1) are too subtle to be well-represented by traditional CNN. Thus the majority of efforts in the fine-grained community focus on learning better representation for such subtle yet discriminative details.

Figure 1: An illustration of learning discriminative details by TASN for a “bule jay.” As shown in (b), TASN learns such subtle details by up-sampling each detail into high-resolution. And the white concentric circles in (c) indicates fine-grained details.

Existing attention/part-based methods [2, 7, 34, 40] try to solve this problem by learning part detectors, cropping and amplifying the attended parts, and concatenating part features for recognition. Although promising performance has been achieved, there are several critical issues in such a pipeline. Specifically, 1) the number of attention is limited and pre-defined, which restricts the effectiveness and flexibility of the model. 2) Without part annotations, it is difficult to learn multiple consistent attention maps (i.e., attending on the same part for each sample). Although a well-designed initialization [7, 17, 40] can benefit the model training, it is not robust and can not handle the cases with uncommon poses. Moreover, 3) training CNNs for each part is not an efficient way. These bottlenecks seriously obstruct the study on attention-based fine-grained recognition.

To address the above challenges, we propose a trilinear attention sampling network (TASN) which learns fine-grained details from hundreds of part proposals and efficiently distills the learned features into a single convolutional neural network. The proposed TASN consists of a trilinear attention module, an attention-based sampler, and a feature distiller. First, the trilinear attention module takes as input feature maps and generates attention maps by self-trilinear product, which integrates feature channels with a bilinear relationship matrix. Since each channel of feature maps is transformed into an attention map, hundreds of part proposals can be extracted. Second, for each iteration, the attention-based sampler generates a detail-preserved image by randomly selecting an attention map, and a structure-preserved image by averaging attention maps. The former learns fine-grained feature for a specific part, and the latter captures global structure and contains all the important details. Compared to the original image, the structure-preserved one removes the non-discriminative regions, thus fine-grained details can be better represented with high resolution. Finally, A part-net and a master-net are further formulated as “teacher” and “student,” respectively. Part-net learns fine-grained features over the detail-preserved image and distills the learned features into master-net, which takes as input the structure-preserved image. Such distilling is implemented by weight sharing and feature preserving strategies. Note that instead of concatenating part features, we adopt knowledge distilling introduced in [11] because the part number is large and not pre-defined.

Since the feature distiller transfers the knowledge from part-net into master-net via optimizing the parameters, 1) stochastic details optimization (i.e., randomly optimize one part in each iteration) can be achieved, which makes it practicable to learn from hundreds of part proposals, and 2) efficient inference can be obtained as we can use master-net to perform recognition in the testing stage. To the best of our knowledge, this work makes the first attempt to learn fine-grained features from hundreds of part proposals and represent such part features with a single convolutional neural network. Our contributions are summarized as follows:

  • We propose a novel trilinear attention sampling network (TASN) to learn subtle feature representations from hundreds of part proposals for fine-grained image recognition.

  • We propose to optimize TASN in a teacher-student manner, in which fine-grained features can be distilled into a single master-net with high-efficiency.

  • We conduct extensive experiments on three challenging datasets (iNaturalist, CUB Birds and Stanford Cars), and demonstrate that TASN outperforms part-ensemble models even with a single stream.

The remainder of the paper is organized as follows. We describe related work in Section 2, and introduce our proposed TASN model in Section 3. An evaluation on three widely-used datasets is presented in Section 4, followed by conclusions in Section 5.

Figure 2: Overview of the proposed Trilinear Attention Sampling Network (TASN). An input image in (a) is taken through convolutional layers to extract convolutional feature maps in (b), and the self-trilinear product is conducted to obtain attention maps in (c). The attention-based sampler takes as input an attention map and the original image to obtain sampled images in (d). Specifically, random selection and average pooling over the channels of attention maps is conducted to obtain the input of part-net and master-net, respectively. The part-net generates a soft target to distill the fine-grained features into master-net via soft target cross entropy [11]. [Best viewed in color]

2 Related Works

In this Section, we briefly review previous work from three aspects, including attention mechanism, adaptive image sampling, and knowledge distilling.

Attention Mechanism: As subtle yet discriminative details play an important role for Fine-Grained Image Recognition, learning to attend on discriminative parts is the most popular and promising direction. Thus various of attention mechanisms have been proposed in recent years [7, 20, 27, 36, 40]. DT-RAM [20] proposed a dynamic computational time model for recurrent visual attention, which can attend on the most discriminative part in dynamic steps. RA-CNN [7] proposed a recurrent attention convolutional neural network to recurrently learn attention maps in multiple (i.e., 3) scales. And MA-CNN [40] takes one step further to generate multiple (i.e., 4) consistency attention maps in a single scale by designing a channel grouping module. However, the attention numbers (i.e., 1, 3, 4, respectively) are pre-defined, which counts against the effectiveness and flexibility of the model.

Meanwhile, high-order attention methods are proposed in visual question answering (VQA) and video classification. Specifically, BAN [13] proposed a bilinear attention module to handle the relationship between image regions and the words in question, and NL [31] calculates the dot production of features to represent the spatial and temporary relationship in video frames. Different from these works, our trilinear attention module conducts bilinear pooling to obtain the relationship among feature channels, which is further utilized to integrate such features to obtain third-order attention maps.

Adaptive Image Sampling: To preserve fine-grained details for recognition, high input resolution ( v.s. ) is widely adopted [5, 34, 40] and it can significantly improve the performance [5]. However, higher resolution means more computational cost, and more importantly, different region requires different resolution. STN [12] proposed a non-uniformed sampling mechanism which performs well on MNIST datasets [18]. But without explicit guidance, it is hard to learn non-uniformed sampling parameters for sophisticated tasks such as fine-grained recognition, thus they finally learned two parts without non-uniformed sampling. SSN [23] firstly proposed to use saliency maps as the guidance of non-uniformed sampling and obtained significant improvements. Different from them, our sampling decomposes attention maps into two dimensions before conducting non-uniformed sampling to avoid spatial distortion.

Knowledge Distilling: Knowledge distilling is firstly proposed by Hinton et al. [11] to transfer knowledge from an ensemble or from a large highly regularized model into a smaller, distilled model. The main idea is using soft targets (i.e., the predicted distribution of ensemble/large model) to optimize the small model, for it contains more information than the one-hot label. Such a simple yet effective idea inspires many researchers and has been further studied by [8, 10, 38]. In this paper, we adopt this technique to distill the learned details into a single CNN.

3 Method

In this section, we introduce the proposed Trilinear Attention Sampling Network (TASN), which is able to represent rich fine-grained features by a single convolutional neural network. TASN contains three modules, i.e., a trilinear attention module for details localization, an attention-based sampler for details extraction, and a feature distiller for details optimization.

An overview of the proposed TASN is shown in Figure 2. Given an input image in (a), we first take it through several convolutional layers to extract feature maps in (b), which is further transformed into attention maps in (c) by trilinear attention module. To learn the fine-grained feature for a specific part, we randomly select an attention map and conduct attention sampling over the input image using the selected attention map. The sampled image is named as detail-preserved image since it can preserve a specific detail with high resolution. Moreover, to capture global structure and contain all the important details, we average all the attention maps and again conduct attention sampling, such a sampled image is called as structure-preserved image. We further formulate a master-net to learn the features for the structure-preserved image, and the part-net to learn fine-grained representation for detail-preserved images. Finally, the part-net generates soft targets to distill the fine-grained features into master-net via soft target cross entropy [11].

3.1 Details Localization by Trilinear Attention

In this subsection, we introduce our trilinear attention module, which transfers convolutional feature maps into attention maps, indicating locations of fine-grained details. As shown in previous work [25, 39], each channel of the convolutional features corresponds to a certain type of visual pattern, however, such feature maps can not act as attention maps due to the lack of consistency and robustness [33, 40]. Inspired by [40], we transfer the feature maps into attention maps by integrating feature channels according to their spatial relationship. Note that such a process can be implemented in a self-trilinear formulation, which is denoted as trilinear attention for convenient reference.

Given an input image , we extract convolutional features by feeding it into multiple convolutional, batch normalization, ReLU, and pooling layers. Specifically, we use resnet-18 [9] as backbone. To obtain high-resolution feature maps for precise localization, we remove two down-sampling processes from original resnet-18 by changing convolutional stride. Moreover, to improve the robustness of convolutional response, we increase the field of views [3] by appending two sets of dilated convolutional layers with multiple dilate rates. In the training stage, to facilitate optimizing, we conduct global average pooling over (b) in Figure 2, which is followed by a softmax classifier.

Assume the output of the dilated convolutional layers is a tube with a dimension of , where , and indicate the number of channels, height, and width respectively. We reshape this feature into a matrix with a shape of , which is denoted as . Then our trilinear function can be basically formulated as:


where is the bilinear feature, which indicates the spatial relationship among channels. Specifically, is the channel of feature maps, which contains spatial information. So indicates the spatial relationship between channel and channel . To make feature maps more consistency and robust, we further integrate spatial relationship into feature maps by conducting dot production over and , thus trilinear attention maps can be obtained (which is shown in Figure 3).

We further studied different normalization methods to improve the effectiveness of trilinear attention, and a detailed discussion can be found in Section 4.2. To the end, we adopt the following normalized trilinear attention:


where indicates normalization over the second dimension of a matrix. Note that these two normalizations have different meanings: The first one is spatial normalization which can keep each channel of feature maps within the same scale. And the second one is relationship normalization which is conducted over each relationship vector . We denote the output of the trilinear function in Equation 2 as , i.e., . We reshape into the shape of , thus each channel of indicates an attention map .

Figure 3: An illustration the self-trilinear product. indicates convolutional feature maps, and we can obtain inter-channel relationships by . After that, we integrate each feature map with its related ones to get trilinear attention maps via conducting dot production over and .

3.2 Details Extraction by Attention Sampling

In this subsection, we introduce our attention-based sampler, which takes as input an image together with trilinear attention maps, and generates a structure-preserved image and a detail-preserved image. The structure-preserved image captures the global structure and contains all the important details. Compared to the original image, the structure-preserved one removed the regions without fine-grained details, thus the discriminative parts can be better represented with high resolution. The detail-preserved image focuses on a single part, which can extract more fine-grained details.

Given an image , we obtain structure-preserved image and detail-preserved image by conducting non-uniform sampling over different attention maps:


where is the attention maps, indicates the non-uniform sampling function, indicates average pooling over channels, and indicates randomly selecting a channel from the input. We calculate the average of all attention maps to guide structure-preserved sampling because such an attention map takes all the discriminative parts into consideration. And we randomly select one attention map for detail-preserved sampling, thus it can preserve the fine-grained details of this attended area with high resolution. With the training process going on, each of the attention maps has the opportunity to be selected, thus different fine-grained details can be asynchronously refined.

Our basic idea for attention-based sampling is considering the attention map as probability mass function, i.e., the area with large attention value is more likely to be sampled. Inspired by the inverse-transform technique [6], we implement the sampling by calculating the inverse function of the distribution function. Moreover, we decompose attention maps into two dimensions to avoid spatial distortion.

Taking structure-preserved sampling for example, we first calculate the integral of the structure-preserved attention map over and axis:


where and are the width and height of the attention map, respectively. Note that we use function to decompose the attention map into two dimensions, because it is more robust than the alternative . We can further obtain the sampling function by:


In a word, the attention map here is used to calculate the mapping function between the coordinates of the original image and the sampled image.

Such a sampling mechanism is illustrated in Figure  4. Given an attention map in (a), we first decompose the map into two dimensions by calculating the max values over axis (b1) and axis (b2). Then the integral of (b1) and (b2) are obtained and shown in (c1) and (c2), respectively. We further calculate the inverse function of (c1) and (c2) in a digital manner, i.e., we uniformly sample points over the axis, and follow the red arrow (shown in (c1) and (c2)), and the blue arrow to obtain the values over axis. (d) shows the sampling pixels by blue dots, and we can observe that the regions with large attention values are allocated with more sampling points. Finally, (e) shows the result of the sampled image. Note that the example in Figure 4 is a structure-preserved sampling case.

Figure 4: An example of attention-based non-uniform sampling. (a) is an attention map with Gaussian distribution. (b1) and (b2) are the marginal distributions over and axis, respectively. (c1) and (c2) are the integrals of marginal distributions. (d) shows the sampling pixels by the blue dot, and (e) illustrates the sampled image. [Best viewed in color with zoom-in.]

3.3 Details Optimization by Knowledge Distilling

In this subsection, we introduce our details distiller, which takes as input a detail-preserved image and a structure-preserved image, and transfers the learned details from part-net to master-net in a teacher-student manner.

Specifically, for each iteration, the attention-based sampler introduced in Section  3.2 can provide a structure-preserved image (denoted as ) and a detail-preserved one (denoted as ). We first obtain the fully connected (fc) outputs by feeding these two images into the same backbone CNN (e.g., Resnet-50 [9]). The fc outputs are denoted as and , respectively. Then the “softmax” classifier converts and into a probability vector and , which indicates the predicted probability over each class. Taking for example:


where is a parameter namely temperature, which is normally set to 1 for classification tasks. While in knowledge distilling, a large value for is important as it can produce a soft probability distribution over classes. We obtain the soft target cross entropy [11] for the master-net as:


where denotes the class number. Finally, the objective function of the master-net can be drived by:


where represents the classification loss function, is a one hot vector which indicates the class label and denotes loss weight of the two terms. The soft target cross entropy aims to distill the learned feature for fine-grained details and transfer such information to the master-net. As the attention-based sampler randomly select one part in each iteration, all the fine-grained details can be distilled to the master-net in training process. Note that the convolutional parameters are shared for part-net and master-net, which is important for distilling, while the sharing of fully connected layers is optional.

4 Experiments

4.1 Experiment setup

Dataset # Class # Train # Test
CUB-200-2011 [35] 200 5,994 5,794
Stanford-Car [14] 196 8,144 8,041
iNaturalist-2017 [28] 5,089 579,184 95,986
Table 1: Detailed statistics of the three datasets used in this paper.

Datasets: To evaluate the effectiveness of our proposed TASN, we conducted experiments on three extensive and competitive datasets, namely Caltech-UCSD Birds (CUB-200-2011) [35], Stanford Cars [14] and iNaturalist-2017[28], respectively. The detailed statistics with category numbers and the standard training/testing splits can be found in Table 1. iNaturalist-2017 is the largest dataset for the fine-grained task. Compared with other datasets for this task, it contains 13 superclasses. Such a data distribution can provide a more convincing evaluation over the generalization ability of a model.

Baselines: We compared our method to the following baselines due to their state-of-the-art performance and high relevance. Note that we did not include methods using 1) additional data (from the web or other datasets), 2) human-annotated part locations and 3) hierarchical labels (i.e., species, genus, and family) for a fair comparison. And all of the compared methods in each table share the same backbone unless specified otherwise.

  • FCAN [22]: Fully convolutional attention network, which adaptively selects multiple attentions by reinforcement learning.

  • MDTP [32]: Mining discriminative triplets of patches, which utilize geometric constraints to improve the accuracy of patch localization.

  • DT-RAM [20]: Dynamic computational time model for recurrent visual attention, which attends on the most discriminative parts by dynamic steps.

  • SSN [23]: Saliency-based sampling networks, which conduct non-uniformed sampling based on saliency map in an end-to-end way.

  • MG-CNN [30]: Multiple granularity descriptors, which leverage the hierarchical labels to generate comprehensive descriptors.

  • STN [12]: Spatial transformer network, which conducts parameterized spatial transformation to obtain zoomed in or pose normalized objects.

  • RA-CNN [7]: Recurrent attention CNN, which recurrently attends on discriminative parts in multi-scale.

  • MA-CNN [40]: Multiple attention CNN, which attends on multiple parts by their proposed channel grouping module in a weakly-supervised way.

  • MAMC [27]: Multi-attention multi-class constraint network, which learns multiple attentions by conducting multi-class constraint over attended features.

  • iSQRT-COV [19]: Towards faster training of global covariance pooling networks by iterative matrix square root normalization.

Attention Description Accuracy

feature maps 83.5

trilinear attention 84.9

spacial norm 85.2

spacial norm 84.3

spacial norm 84.5

relation norm 85.0

spacial + relation 85.3
Table 2: Ablation experiments on attention module in terms of recognition accuracy on the CUB-200-2011 dataset.

Implementation: We used open-sourced MXNet [4] as our code-base, and trained all the models on 8 Tesla P-100 GPUs. For a fair comparison, we conducted experiments with our method TASN on VGG-19 [26] as well as Resnet-50 [9], both of which are pre-trained on Imagenet [24]. We used the standard data augmentation methods provided by MXNet, and all of the performance are single-crop testing results for a single model unless specially stated. We used SGD optimizer without momentum and weight decay, and the batch size was set to 96. The initial learning rate was set to 0.05, with a decay factor of 0.1 after every 30 epochs. The temperature in Equation 6 is 10, and the loss weight in Equation 8 is 2. More implementation details can be referred to our code, which will be released soon.

4.2 Evaluation and analysis on CUB-200-2011

Trilinear attention. Table 2 shows the impact of different normalization functions in terms of recognition accuracy. We obtain the results by 1) randomly selecting a channel of attention maps in each iteration for sampling during training stage, and 2) conducting average pooling over attention maps for testing. All the models use Resnet-50 as the backbone with an input resolution of 224. It can be observed that trilinear attention maps can significantly outperform the original feature maps. Both the attention functions of and can improve the gain of trilinear attention. While and bring a drop of performance for the reason that they will cause loss of spatial information. As a result, we adopt the last setting (of Table 2) in our TASN. Note that in the term , indicates the region that a channel is focusing on and denotes the feature of that region.

We further compared our trilinear attention module with “self-attention” [29]. Specifically, we followed [29] to obtain attention maps by , and the proposed trilinear attention can outperform self-attention module with 0.7% points increases.

Figure 5: A comparison of feature maps in (a) and trilinear attention maps in (b). Each column shows the same channel of feature maps and trilinear attention maps, and we randomly select nine channels for comparison. We can observe that compared to first-order feature maps, each channel of the trilinear attention maps focus on a specific part, without attending on background noises.
Approach master-net TASN
Resnet-50 [9] 81.6 81.6
uniformed sampler 84.1 85.8
sampler in SSN [23] 84.8 85.3
our sampler 85.5 87.0
Table 3: Ablation experiments on sampling module in term of classification accuracy on the CUB-200-2011 dataset.
Resolution 224 280 336 392
Resnet-50 [9] 81.6 83.3 85.0 85.6
master-net 85.5 86.6 87.0 86.8
TASN 87.0 87.3 87.9 87.9
Table 4: Ablation experiments on distilling module with different input resolutions.

Attention-based sampler. To demonstrate the effectiveness of our attention-based sampling mechanism, we compared our sampling mechanism with 1) uniformed sampling (by binarizing the attention maps) and 2) sampling operation introduced in SSN [23]. We set the input attention maps to be the same when comparing sampling mechanisms, and experiments were conducted on two cases,i.e., with and without part-net. All the models use Resnet-50 as the backbone and the input resolution is set to 224. The result in Table 3 shows that our sampling mechanism remarkably outperforms the baselines. SSN sampler obtains a better result than uniformed sampler without part-net, while the further improvements are limited when added part-net. These observations show that the spatial distortion caused by SSN sampler is not helpful to preserve subtle details.

Knowledge distilling. Table 4 reveals the impact of details distilling module with different input resolutions. We can observe consistency improvements by details distilling. The performance of Resnet-50 [9] is saturated with 85.6% and 448 input can not further improve the accuracy. Without distiller (i.e., master-net only), the performance is slightly dropped with 392 input (compared to 336 input), since it is difficult to optimize each detail with large feature resolutions (a similar drop can also be observed on Resnet-50 with 672 inputs).

Moreover, to study the attention selection strategy (i.e., ranked selection vs. random selection), we ranked attention maps by their response and sample the high response ones with large possibility, while the recognition performance dropped from 87.0% to 86.8%. The reason is that ranking makes some parts rarely picked, while such parts can also benefit details learning. We also conducted experiments on distilling two parts in each iteration, and the result is the same as distilling one part each time.

Compared to sampling-based methods. We compare our TASN with three sampling-based methods: 1) uniformed sampling with high resolution (i.e., zoom in), 2) uniformed sampling with attention (i.e., crop) and 3) non-uniformed sampling proposed in SSN [23]. As shown in Table 5, higher resolution can significantly improve fine-grained recognition performance by (relative) 4.9%. However, 448 input increases the computational cost (i.e., flops) by four times compared to 224 input. SSN [23] obtains a better results than DT-RAM [20], and our TASN can further obtain 2.9% relative improvement. Our improvements mainly come from two aspects: 1) a better sampling mechanism considering spatial distortion (1.2%), and 2) a better fine-grained details optimizing strategy (1.7%).

Approach Resolution Accuracy
Resnet-50 [9] 224 81.6
Resnet-50 [9] 448 85.6
DT-RAM [20] 224 82.8
SSN [23] 227 84.5
TASN (ours) 224 87.0
Table 5: Comparison with sampling-based methods in terms of classification accuracy on the CUB-200-2011 dataset.
Approach Backbone Accuracy
MG-CNN [30] 3VGG-16 81.7
ST-CNN [12] 3Inception-v2 84.1
RA-CNN [7] 3VGG-19 85.3
MA-CNN [40] 3VGG-19 85.4
TASN (ours) 1VGG-19 86.1
TASN (ours) 3VGG-19 87.1
MAMC [27] 1Resnet-50 86.5
TASN (ours) 1Resnet-50 87.9
Table 6: Comparison with part-based methods (all the results are reported in high-resolution setting) in terms of classification accuracy on the CUB-200-2011 dataset.

Compared to attention-based part methods. In Table 6, we compare our TASN to attention-based parts methods. For a fair comparison, 1) high-resolution input is adopted by all methods and 2) the same backbone numbers are used. It can be observed that for VGG based methods, our TASN outperforms all the baselines even with only one backbone. Moreover, after ensembling three backbones (trained with different parameter settings), TASN can improve the performance by 1.9% over the best 3 parts model MA-CNN [40]. Moreover, our 3 streams result can also outperform 6 streams MA-CNN with a margin of 0.7%. We do not ensemble more streams as the model ensemble is beyond this work. For Resnet-50 based method: compared with the state-of-the-art single-stream MAMC [27], our TASN also achieves a remarkable improvement by 1.6%.

Combining with second-order feature learning methods. In Table 7, we exhibit that our TASN learns a strong first-order representation, which can further improve the performance of second-order feature methods. Specifically, compared to the best second-order methods iSQRT-COV [19], our TASN 2k first-order feature outperforms their 8k feature with an improvement by 0.7%, which shows the effectiveness of our TASN. Moreover, we transfer their released code to our framework and obtain an accuracy of 89.1%, which shows the compatibility of these two methods. Note that for a fair comparison, we follow their settings and predict the label of a test image by averaging prediction scores of the image and its horizontal flip.

Approach Dimension Accuracy
iSQRT-COV [19] 8k 87.3
iSQRT-COV [19] 32k 88.1
TASN (ours) 2k 87.9
TASN + iSQRT-COV 32k 89.1
Table 7: Extensive experiments on combining second-order feature learning methods.

4.3 Evaluation and analysis on Stanford-Car

Approach Backbone Accuracy
Baseline 1VGG-19 88.6
master-net 1VGG-19 90.3
TASN 1VGG-19 92.4
TASN (ensemble) 2VGG-19 93.1
TASN (ensemble) 3VGG-19 93.2
Table 8: Component analysis in terms of classification accuracy on the Stanford-Car dataset.
Approach Backbone Accuracy
FCAN [22] 3VGG-16 91.3
MDTP [32] 3VGG-16 92.5
RA-CNN [7] 3VGG-19 92.5
MA-CNN [40] 3VGG-19 92.6
TASN (ours) 1VGG-19 92.4
TASN (ours) 3VGG-19 93.2
MAMC [27] 1Resnet-50 92.8
TASN (ours) 1Resnet-50 93.8
Table 9: Comparison in terms of classification accuracy on the Stanford-Car dataset.

Ablation study. Table 8 shows the result of VGG-19 baseline, our TASN without distiller, our TASN with a single model, and ensemble results. We can observe 1.9% relative improvement by structure preserved sampling and a further improvement of 2.3% by the full model. Note that the improvement by structure preserved sampling is not that significant as on CUB-200-2011 dataset, due to the fact that foregrounds are larger for most of the images in Stanford-Car. This result also shows that our full model of TASN works well on boosting performance for foreground images.

Comparison with state-of-the-art. Similar to CUB-200-2011, we compare our TASN with attention-based parts methods under the same setting. As shown in Table 9, TASN with single VGG-19 achieves comparable results with 3 streams part methods. And our ensembled 3 streams TASN outperforms the best 3 streams part learning methods MA-CNN [40]. Compared to their 5 streams result (28.0%), our result is still better. For Resnet-50 based method, we compare our TASN to the state-of-the-art method MAMC [27], and achieve 1.1% improvements.

4.4 Evaluation and analysis on iNaturalist 2017

Super Class # Class Resnet [9] SSN [23] TASN
Plantae 2101 60.3 63.9 66.6
Insecta 1021 69.1 74.7 77.6
Aves 964 59.1 68.2 72.0
Reptilia 289 37.4 43.9 46.4
Mammalia 186 50.2 55.3 57.7
Fungi 121 62.5 64.2 70.3
Amphibia 115 41.8 50.2 51.6
Mollusca 93 56.9 61.5 64.7
Animalia 77 64.8 67.8 71.0
Arachnida 56 64.8 73.8 75.1
Actinopterygii 53 57.0 60.3 65.5
Chromista 9 57.6 57.6 62.5
Protozoa 4 78.1 79.5 79.5
Total 5089 59.6 65.2 68.2
Table 10: Comparison in terms of classification accuracy on the iNaturalist 2017 dataset.

We also conduct our TASN on the largest fine-grained dataset, i.e., iNaturalist 2017. We compare to Resnet-101 baseline and the best sampling method SSN [23]. All the models use Resnet-101 as the backbone with an input resolution of 224. As there are 13 superclasses in this dataset, we re-implement SSN [23] with their released code to obtain the performance on each superclass. The results are shown in Table 10, and we can find that our proposed TASN outperforms the baselines on every superclass. It is notable that compared to Resnet-101, TASN significantly improves the performance, especially on Reptilia (improved by 24.0%) and Aves (improved by 21.8%), which indicates such superclasses contain more fine-grained details.

5 Conclusion

In this paper, we proposed a trilinear attention sampling network for fine-grained image recognition, which can learn rich feature representations from hundreds of part proposals. Instead of ensembling multiple part CNNs, we adopted knowledge distilling method to integrate fine-grained features into a single stream, which is not only efficient but also effective. Extensive experiments in CUB-Bird, iNaturalist 2017 and Stanford-Car demonstrate that TASN is able to outperform part-ensemble models even with a single stream. In the future, we will further study the proposed TASN in the following directions: 1) attention selection strategy, i.e., learning to select which details should be learned and distilled instead of randomly selecting, 2) conduct attention-based sampling over convolutional features instead of only over images, and 3) extend our work to other vision tasks, e.g., object detection and segmentation.


  • [1] T. Berg, J. Liu, S. Woo Lee, M. L. Alexander, D. W. Jacobs, and P. N. Belhumeur. Birdsnap: Large-scale fine-grained visual categorization of birds. In CVPR, pages 2011–2018, 2014.
  • [2] S. Branson, G. V. Horn, S. J. Belongie, and P. Perona. Bird species categorization using pose normalized deep convolutional nets. In BMVC, 2014.
  • [3] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. TPAMI, 40(4):834–848, 2018.
  • [4] T. Chen, M. Li, Y. Li, M. Lin, N. Wang, M. Wang, T. Xiao, B. Xu, C. Zhang, and Z. Zhang. Mxnet: A flexible and efficient machine learning library for heterogeneous distributed systems. arXiv preprint arXiv:1512.01274, 2015.
  • [5] Y. Cui, Y. Song, C. Sun, A. Howard, and S. Belongie. Large scale fine-grained categorization and domain-specific transfer learning. In CVPR, pages 4109–4118, 2018.
  • [6] L. Devroye. Sample-based non-uniform random variate generation. In WSC, pages 260–265. ACM, 1986.
  • [7] J. Fu, H. Zheng, and T. Mei. Look closer to see better: Recurrent attention convolutional neural network for fine-grained image recognition. In CVPR, pages 4438–4446, 2017.
  • [8] T. Fukuda, M. Suzuki, G. Kurata, S. Thomas, J. Cui, and B. Ramabhadran. Efficient knowledge distillation from an ensemble of teachers. Proc. Interspeech, pages 3697–3701, 2017.
  • [9] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, pages 770–778, 2016.
  • [10] B. Heo, M. Lee, S. Yun, and J. Y. Choi. Knowledge distillation with adversarial samples supporting decision boundary. CoRR, abs/1805.05532, 2018.
  • [11] G. Hinton, O. Vinyals, and J. Dean. Distilling the knowledge in a neural network. stat, 1050:9, 2015.
  • [12] M. Jaderberg, K. Simonyan, A. Zisserman, and k. kavukcuoglu. Spatial transformer networks. In NIPS, pages 2017–2025, 2015.
  • [13] J.-H. Kim, J. Jun, and B.-T. Zhang. Bilinear attention networks. In NIPS, pages 1571–1581, 2018.
  • [14] J. Krause, M. Stark, J. Deng, and L. Fei-Fei. 3D object representations for fine-grained categorization. In ICCV Workshop, 2013.
  • [15] A. Krizhevsky, V. Nair, and G. Hinton. The cifar-10 dataset. online: http://www. cs. toronto. edu/kriz/cifar. html, 2014.
  • [16] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, pages 1106–1114, 2012.
  • [17] M. Lam, B. Mahasseni, and S. Todorovic. Fine-grained recognition as hsnet search for informative image parts. In CVPR, pages 6497–6506. IEEE, 2017.
  • [18] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
  • [19] P. Li, J. Xie, Q. Wang, and Z. Gao. Towards faster training of global covariance pooling networks by iterative matrix square root normalization. In CVPR, pages 947–955, 2018.
  • [20] Z. Li, Y. Yang, X. Liu, F. Zhou, S. Wen, and W. Xu. Dynamic computational time for visual attention. In Proceedings of the IEEE International Conference on Computer Vision, pages 1199–1209, 2017.
  • [21] X. Liu, W. Liu, H. Ma, and H. Fu. Large-scale vehicle re-identification in urban surveillance videos. In ICME, pages 1–6. IEEE, 2016.
  • [22] X. Liu, T. Xia, J. Wang, Y. Yang, F. Zhou, and Y. Lin. Fully convolutional attention networks for fine-grained recognition. arXiv preprint arXiv:1603.06765, 2016.
  • [23] A. Recasens, P. Kellnhofer, S. Stent, W. Matusik, and A. Torralba. Learning to zoom: a saliency-based sampling layer for neural networks. In ECCV, pages 51–66, 2018.
  • [24] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei. ImageNet Large Scale Visual Recognition Challenge. IJCV, 115(3):211–252, 2015.
  • [25] M. Simon and E. Rodner. Neural activation constellations: Unsupervised part model discovery with convolutional networks. In ICCV, pages 1143–1151, 2015.
  • [26] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In ICLR, pages 1409–1556, 2015.
  • [27] M. Sun, Y. Yuan, F. Zhou, and E. Ding. Multi-attention multi-class constraint for fine-grained image recognition. In ECCV, pages 805–821, 2018.
  • [28] G. Van Horn, O. Mac Aodha, Y. Song, Y. Cui, C. Sun, A. Shepard, H. Adam, P. Perona, and S. Belongie. The inaturalist species classification and detection dataset. 2018.
  • [29] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin. Attention is all you need. In NIPS, pages 5998–6008, 2017.
  • [30] D. Wang, Z. Shen, J. Shao, W. Zhang, X. Xue, and Z. Zhang. Multiple granularity descriptors for fine-grained categorization. In ICCV, pages 2399–2406, 2015.
  • [31] X. Wang, R. Girshick, A. Gupta, and K. He. Non-local neural networks. In CVPR, pages 7794–7803, 2018.
  • [32] Y. Wang, J. Choi, V. Morariu, and L. S. Davis. Mining discriminative triplets of patches for fine-grained classification. In CVPR, pages 1163–1172, 2016.
  • [33] X.-S. Wei, J.-H. Luo, J. Wu, and Z.-H. Zhou. Selective convolutional descriptor aggregation for fine-grained image retrieval. TIP, 26(6):2868–2881, 2017.
  • [34] X.-S. Wei, C.-W. Xie, J. Wu, and C. Shen. Mask-cnn: Localizing parts and selecting descriptors for fine-grained bird species categorization. Pattern Recognition, 76:704–714, 2018.
  • [35] P. Welinder, S. Branson, T. Mita, C. Wah, F. Schroff, S. Belongie, and P. Perona. Caltech-UCSD Birds 200. Technical Report CNS-TR-2010-001, California Institute of Technology, 2010.
  • [36] T. Xiao, Y. Xu, K. Yang, J. Zhang, Y. Peng, and Z. Zhang. The application of two-level attention models in deep convolutional neural network for fine-grained image classification. In CVPR, pages 842–850, 2015.
  • [37] L. Yang, P. Luo, C. Change Loy, and X. Tang. A large-scale car dataset for fine-grained categorization and verification. In CVPR, pages 3973–3981, 2015.
  • [38] J. Yim, D. Joo, J. Bae, and J. Kim. A gift from knowledge distillation: Fast optimization, network minimization and transfer learning. In CVPR, pages 4133–4141, 2017.
  • [39] X. Zhang, H. Xiong, W. Zhou, W. Lin, and Q. Tian. Picking deep filter responses for fine-grained image recognition. In CVPR, pages 1134–1142, 2016.
  • [40] H. Zheng, J. Fu, T. Mei, and J. Luo. Learning multi-attention convolutional neural network for fine-grained image recognition. In Proceedings of the IEEE international conference on computer vision, pages 5209–5217, 2017.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description