Fine-grained visual recognition with salient feature detection
Computer vision based fine-grained recognition has received great attention in recent years. Existing works focus on discriminative part localization and feature learning. In this paper, to improve the performance of fine-grained recognition, we try to precisely locate as many salient parts of object as possible at first. Then, we figure out the classification probability that can be obtained by using separate parts for object classification. Finally, through extracting efficient features from each part and combining them, then feeding to a classifier for recognition, an improved accuracy over state-of-art algorithms has been obtained on CUB200-2011 bird dataset.
Fine-grained recognition is an active topic in computer vision and pattern recognition, and is now widely applied in industry and academia, for instance, to classify different species of birds or plants to evaluate the natural ecosystem change , or to recognize car models for visual census estimation . Comparing with the coarse-grained recognition of traditional object recognition tasks, the purpose is to identify finer subordinate categories, such as bird species , car models , aircraft types. Fine-grained recognition is very challenging due to the significant differences between samples of the same category and the obvious similarities between different categories [38, 28].
Exciting progress has been made in this area as the involvement of many community researchers in recently years. Generally, part localization and feature description are two key factors that affect classification accuracy. To seek more precise part localization, pose-normalized descriptor  or pose alignment  are applied to all images before they are used for feature extraction. Then, convolutional neural networks are employed as descriptors to learn discriminative features. We know that although convolutional neural networks are significantly powerful in learning features, it has poor interpretability [31, 1]. Therefore, the questions of which parts have more discriminative features than others, and how does the parts with less discriminative features affect the classification accuracy, is still unknown.
When we, as human, face the issue of fine-grained recognition, what do we do? Figure 1 shows a guide for ornithologist to identify common birds. From Figure 1, we can see that, for the purpose of recognizing five species of birds coming from two categories, several parts (e.g., bill, plumage, leg) and features (e.g., length, color, shape) are used as the indicators. Intuitively, human beings rely on plenty of information when they recognize the species of, for example, the length and shape of bill, the color of plumage and leg, and so on. There is an idiom in China called The Blind Men and The Elephant: four blind men wished to know what an elephant looked like. The man who touched the elephant’s ear claimed that it is like a great fan, while the man regarded the elephant as a big pillar when he felt the elephants leg. Of course, none of them were right before they felt all parts of the elephant. The principle behind this idiom is also suitable to the fine-grained recognition, because the more information we get, the better our judgment will be.
In this paper, to improve the performance of fine-grained recognition, we try to precisely locate as many parts of object as possible at first. Then, we want to figure out the classification probability that can be obtained by using separate parts for object classification. Finally, through extracting efficient features from each part and combining them, then feeding to a classifier for recognition, the accuracy outperforms the state-of-art accuracy on CUB200-2011 dataset . We call our whole method as fine-granularity part-CNN (FP-CNN).
The key contributions of this work can be summarized as follows:
We trained a deep neural network to detect and locate saliency parts of object with high probability by generating labeled part images according to part annotation.
We compared and analyzed the effects of different parts on the recognition accuracy and found that the classification accuracy of all other components except the head in bird database is relatively low.
The experimental results conducted on CUB200-2011 bird datasets illustrate the state-of-art performance of the proposed approach.
This paper is organized as follows. A review of related work is presented in section 2. Section 3 describes the proposed fine-grained recognition method, followed by the experimental evaluation in section 4. Finally, we conclude this paper in section 5.
2 Related Work
In this section, we introduce the state-of-art work involved in fine-grained recognition from the perspective of whether human labeled information (e.g., bounding box and part annotations) is leveraged, i.e., strongly-supervised and weekly-supervised fine-grained recognition. We should remind that both of these two categories methods requires class labels, and that is the reason why we could not call them as unsupervised recognition.
|Part regions||Part key points||Region Style|
|Head||beak, crown, forehead, left eye, nape, right eye, throat||minimal rectangle|
|Breast||belly, breast||minimal rectangle|
|Wing||left wing, right wing||square envelope|
|Leg||left leg, right leg||square envelope|
2.1 Strongly-supervised fine-grained recognition
A large corpus of strongly-supervised fine-grained recognition methods have been proposed in recent works [33, 19, 13, 32, 4, 7, 5, 9, 17, 28]. where bounding box or part annotations, or both of them are used during the training stage for part location and presentative feature learning, and/or even bounding box is used in the test stage. Part R-CNN  was proposed to leverage deep convolutional features computed on bottom-up region proposals for detection and part description based on pose normalization [7, 4]. Segmentation-based methods are also very effective for fine-grained recognition, where region-level cues are used to infer foreground segmentation masks to eliminate background interference [5, 9, 30, 17, 28]. The recently proposed Mask-CNN  achieves the state-of-art classification accuracy on CUB200-2011. In order to locate the parts of birds during the test phase, two masks are generated with the help of part key points, and a fully convolutional network are trained based on the masks. Then, a three-stream CNN model is constructed for fine-grained recognition. The expressive results had been illustrated in their literature with the state-of-art accuracy of 87.3%. However, the limitation of this work is that, except for the original object image, only two parts (i.e., head and torso) are used to learn identifiable features, while the other parts are ignored, resulting in insufficient recognition of some important details. In our work, we do not make any priori assumption about the importance of various parts for fine-grained recognition, and all components are taken into account. As in , only part annotation is used in the training stage, and we obtain the average of 88.2% accuracy on CUB200-2011.
2.2 Weakly-supervised fine-grained recognition
Weakly-supervised recognition requires only image level class labels rather than uses any of part annotations, bounding box, or segmentation masks [16, 25, 20, 36, 38, 37]. Some works are based on generating parts using segmentation and alignment [9, 16], while the others are inclined to leverage visual attention mechanism [29, 38, 37]. Jonathan et al.  proposed to discover the parts without any part annotations by aligning images with similar poses, and then a convolutional neural network was used for training a feature descriptors. A bilinear convolutional neural networks was proposed to captures part-feature interactions under the motivation that modular separation of two CNNs is able to affect the overall appearance . A multi-attention convolutional neural network (MA-CNN) was presented in  to generate more efficient distinguishable parts and to learn better fine-grained features from parts in a mutual enhanced manner. The parts were located by detecting the convolutional feature channel whose peak responses occurs at adjacent locations. Zhao et al.  proposed a diversified visual attention network (DVAN), where multiple attention canvases with various locations and scales were generated for incremental object representation. Instead of finding multiple attention areas in an image at the same time, they suggested finding different regions of attention multiple times, and using recurrent neural network to predict the object class.
In this section, we present the proposed method. We at first introduce the method to localize the parts of object in a precise way with the part annotation in hand. Then, we compare and analyze the classification accuracy when using different parts of the object.
3.1 Local Feature Location and Detection
The localization of possible discriminative parts is one of the core issue of fine-grained recognition. Existing methods leveraging attention mechanism for part location are based on the intuition that some of parts have higher vision saliency than the others. This intuition, to some extend, indeed reflect the style of human beings inspecting this world, because it is a large burden for our vision system and brain to process so huge amount of information . However, when we intend to perform fine granularity classification, this maybe mislead us, especially when the object we want to recognize has marginally visual difference that even the filed experts can distinguish.
In this paper, we suggest that, in the context of fine-grained recognition, the more information we get, the better our judgment will be. Based on this idea, we at first propose a local feature location strategy which intend to accurately locate as many parts as possible with the help of part annotation in the training stage. Then, we convert the part localization problem to object detection. This is different from tradition object detection whose goal is to detect objects from raw images, because we focus on detecting the parts in the images containing the object.
3.1.1 Ground truth part region generation
It is notice that part annotation is available in some of fine-grained datasets, for example, CUB200-2011 , Birdsnap , and FGVC Aircraft . In this paper, we take CUB200-2011 as an example, but the idea can be easy extended to the other datasets. CUB200-2011 has defined fifteen part key points, and we leverage these points to construct ground truth part regions (or called bounding boxes). In our proposed local feature location strategy, five discriminative part regions (i.e., head, breast, tail, wing and leg) are generated, as shown in Table 1. We note that the accuracy of part regions has significant impact on part detection, three strategies are used to generate part regions:
(1) Two region generation styles: For head and breast region, we adopt minimal rectangle to include all the key points annotated on the bird head, and square envelope (i.e., key-point-centered square) are used for the remaining regions, as shown in Table 1.
(2) Self-tuning region size: The key points in part annotation represent the center of specific bird part. If we just draw a minimal rectangle to include all of this points as in  to generate ground truth part region, some detail features may be lost, as shown in Figure 2. For head region, the size is self tuned according to the width and height of minimal rectangle which can be denoted by
where are the width and height of minimal rectangle including the key points, and are the size of generated head region, and and are the tuning factors which are used to pad the head region. Additionally, for the part region generated by square envelope, it is also necessary to seriously determine the region size. The reason is that, if the region size is too large, the other parts of the object will be included, otherwise, if the size is too small, the distinguishable features will be lost. Besides, duo to the different sizes of the images as well as the different proportions of the objects in the images, the size of object varies significantly. In this paper, the region sizes are self-adjusted according to the size of head, because, through our observation of a large number of images, the head size is not seriously affected by the changes of scales and viewpoints and occlusions, so it can be regarded as a better reference.
(3) Redundant region elimination: It is possible that the same part but different sides (i.e., left and right) are both appear in the image, for example, left wing and right wing, left leg and right leg, as shown in Figure 2 (the two images in left side). The same problem may occur during the part detection phase for test image sets, which will be illustrated later. The region has the minimum intersection over union (IoU) will be chosen for the current part, and the IoU is defined as
where are the regions of current part and the other parts, respectively. If the IoUs for both sides are the same, we randomly choose one of them.
3.1.2 Local part detection and localization
In the second step, with the part regions in hand, we convert the part localization problem to part detection in the images including the object. The research on object detection is an active topic in recently years, and the promising performances have been proposed in the literatures leveraging deep neural network [11, 23, 22]. The earlier work  employed R-CNN  to detect objects and localize their parts for recognition. However, the recognition is conducted in a strongly supervised way (i.e., both bounding box and part annotations are used at training time), and just two parts (i.e., head and torso) were detected in CUB-200-2011 dataset. In contrast, only part annotation is required for training, and no supervision is required in the test. Our work leverages YOLO v3 to detect and locate all five parts defined in Table 1. Comparing to R-CNN (and the other classifier-based object detection approaches, e.g., fast and faster R-CNN), YOLO is much faster at obtaining comparable detection accuracy, because, for a single image, it makes predictions with a single network evaluation while R-CNN requires thousands. It is notice that, two thresholds should be carefully selected in part detection and localization when using YOLO. One threshold is compared with the IoU of the predicted and ground truth part region to determine what percentage of bounding boxes are preserved during the training phase. Meanwhile, in the test phase, the detected part is considered to be a valid part only if its confidence is higher than another threshold . The trained model is available on the Github (https://github.com/wuyun8210/part-detection).
3.2 The proposed method
In this section, besides recognizing the subcategories of the object, we are also very interested in the impact of detected parts on the accuracy of recognition.
|Strong DPM ||43.49%||—||—||—||—|
|Part-based R-CNN ||68.19%||—||—||—||—|
|Deep LAC ||74.00%||—||—||—||—|
3.2.1 The importance of the parts
The method we proposed is to train the different models on the different datasets to clarify the recognition performance of using the object or the different parts.
Firstly, we generate several groups of part image sets based on the ground truth region of the training set, as shown in Figure 4. Then, for each group of image set, we leverage deep convolutional neural network to train different models separately. We do this by assigning the object label to the corresponding parts. We use ResNet  as the backbone neural network, and fine-tune the parameters of the pre-trained model on ImageNet. From Figure 4, we take one of images of bohemian waxwing (upper left corner) in the training set as an example. Seven images (i.e., the original image and the center-cropped image of the object, and five local images of the parts) are generated and resized to the same size (in this paper, we set and to 224) to form seven groups of image sets . The center cropped image and five parts images are assigned the same label as the original image. After training, we obtain seven learned models (i.e., the weights of CNN) .
In the test phase, the same procedural is used to generate the test sets, except that the ground truth part regions are replaced by the detected and localized part regions as proposed in Section 3.1. The group number of the test sets is same as the train sets, and it is denoted by . For the images in each group of test set, the corresponding learned model is used to predict which category the images belong to. It is note that, the parts that are not visible in the training set or that are not detected in the test set are ignored. The experimental results are illustrated in Section 4.2.
3.2.2 Fine-grained recognition
In recent works, after obtaining the part regions, a straightforward method for fine-grained recognition is to design a multi-stream CNN framework for end-to-end fine-grained recognition as in [38, 28]. However, if some of parts are not visible or not properly detected, these methods can easily to face the label confliction problem in model training and prediction. This means that the empty features will correspond to different labels. We know that some of machine learning algorithms (e.g., SVM , Decision Tree ), are robust to learn from the dataset with lost information. In this paper, to avoid the label confliction problem, we leverage libSVM  to combine all of the features due to its convenience in parameter tuning.
In fine-grained recognition, the learned CNN models are used for extracting discriminative features. In the training stage, for each sample, two object images (original and center-cropped) and detected part images (maybe less than five parts) are fed to the learned models respectively. Then, the activation tensors output from ResNet pool5-layer with dimension of 4096 (with the input of image size of ) are taken as the feature of this image. The lost features (corresponding to invisible parts) are set to zero vector before all of the features are concatenated and trained by SVM. In the prediction stage, the same features are extracted and concatenated, then, we output its subcategory by the SVM classifier for each test image. It is note that the lost features related to undetected parts are also replaces by zero vectors. We illustrate the detailed results in Section 4.2.
4 Experimental results
In this section, we illustrate the experimental results of the proposed FP-CNN on part detection and localization and fine-grained recognition on the widely-used and challenge dataset CUB200-2011. This dataset contains 200 categories and total of 11788 bird images. We split the dataset into three parts: 50% for the training, 20% for validation, and the rest for test.
4.1 Part detection and localization performance
From Section 3.1, we know that two thresholds paly an important role on the performance of part detection and localization. We design a relative small threshold (i.e., ) for the training set, to ensure that efficient parts can be detected with higher probability. During the test stage, the metric that used to determine which parts are properly detected includes two folds: 1) choosing only one of detected parts that obtains the highest score from the same type, and 2) the score of the detected parts must larger than the threshold set in the test phase. In this paper, we set . Some examples of bird detection and localization are shown in Figure 5. We randomly select four birds which has been shown in Figure 3 to facilitate the readers to observe the part bounding boxes of the ground truth and the predicted. From Figure 5, we can see that, although the pictures are taken in different scale, viewpoints and backgrounds, the main parts are precisely detected and located in the majority of test images. In the last column, we also show some examples of the parts that are not well detected duo to the low scores they obtained. In Table 2, we give the localization accuracy of all types of parts using the Percentage of Correctly Localized Parts (PCP) metric as in [34, 28], and we also compare the PCP of birds head with the recent works (the tail, breast, leg and wing were not detected in these works).
From Table 2, we can see that our method obtains the highest PCP (88.20%), and it improves the performance of Mask-CNN by 1.44%, and outperforms the other works with a significant margin. In addition, the tail, breast and wing are also located with high probability (the PCP are all larger than 76%). The leg is the exception, and it just obtains the score of 58.66%. The possible reason is that the feet of birds have some similarities in shape, texture and color with the places (e.g., branches, grasses etc.) they inhabited.
|Images or parts||Test Accuracy||Average Loss|
4.2 Fine-grained Recognition
We first report the recognition results on seven groups of datasets as defined in Section 3.2. All the models are fine-tuning on the pretrained ResNet model in caffe . Figure 6 shows the recognition accuracy on the validation set with respect to the iteration (totally of 50,000 iterations are conducted). The detailed recognition results on the test sets are shown in Table 3. We can see that the experiment on cropped images obtains the highest accuracy (82.70%) and the smallest loss (0.6779) than the other groups of image sets. The accuracy on the head of birds (77.02%) outperforms the other four parts by a large margin, and it obtains the comparable performance with the original images (78.92%) and the cropped images. Additionally, although the wing and breast are not sufficiently to recognize the whole bird with high probability (both of them are approximately 50%), they indeed provide some useful information. The leg and tail obtain the lowest scores among all of these parts, 31.72% and 29.48% respectively. From the experimental results, we can safely conclude that the birds head contains more discriminative features than the other parts, on the contrary, it is difficult to recognize them by using the leg and tail.
Through the above analysis, we know that different parts have different performance when they are used for recognition independently. Then, we try to compare the classification accuracies using the features extracted from different parts through the style of incremental combination. That means, we set the combination of the original and cropped images as a baseline, then we increase one of part images according its performance order (as shown in Table 3) each time. The combined features are classified by libSVM as discussed in Section 3.2. The experimental results are shown in Table 4. As can be seen from Table 4, as the increase of combined part features, the classification accuracies increase. The best performance (88.23%) appears at the combination of the baseline and three parts (i.e., the head, wing and breast) and is slightly superior (0.17%) to the feature combination that contains all the parts.
Finally, we compare the proposed FP-CNN method with the state-of-art works on CUB200-2011 dataset. The detailed results are presented in Table 5. In our method, we select the forth combination in Table 4 as the final feature for fine-grained recognition. All the input images are resized to as discussed in Section 3.2. Three types of state-of-art works are selected for comparison: 1) strongly supervised methods using both bounding box and part annotation [34, 13, 4], 2) strongly supervised methods just using one of the annotations ([38, 17, 28], and this paper), 3) weakly supervised methods using only class labels [25, 20]. Our proposed method outperforms all of these state-of-art works in the fine-grained recognition accuracy. It is note that, the higher resolution of images can improve the classification accuracy of our method, as they provide more precise details. Although two weakly supervised methods [25, 20] obtained the attractive results, our method outperforms them by a clear margin (higher than  7.2% and  4.1%, respectively.
|Approaches||Train stage||Test stage||Model||Feature Len.||Image Size||Accuracy|
|Pose Normalized ||✓||✓||AlexNet||13512||75.70%|
In this paper, based on part annotation equipped in the dataset, the ground truth part regions are generated for training a FP-CNN model, so that fine-granularity parts can be precisely detected and localized from test images. Then, we proposed a fine-grained recognition method using these fine-granularity parts. Experimental results reveal that the proposed method improves the state-of-art recognition performance on widely used CUB200-2011 bird dataset. In the future, we will explore an accurate fine-granularity part localization method without the help of part annotation.
-  G. Alain and Y. Bengio. Understanding intermediate layers using linear classifier probes. arXiv:1610.01644, 2016.
-  H. Azizpour and I. Laptev. Object detection using strongly-supervised deformable part models. In European Conference on Computer Vision, pages 836–849, 2012.
-  T. Berg, J. Liu, S. Woo Lee, M. L. Alexander, D. W. Jacobs, and P. N. Belhumeur. Birdsnap: Large-scale fine-grained visual categorization of birds. In IEEE Conference on Computer Vision and Pattern Recognition, pages 2011–2018, 2014.
-  S. Branson, G. Van Horn, S. Belongie, and P. Perona. Bird species categorization using pose normalized deep convolutional nets. In British Machine Vision Conference, 2014.
-  Y. Chai, V. Lempitsky, and A. Zisserman. Symbiotic segmentation and part localization for fine-grained categorization. In IEEE International Conference on Computer Vision, pages 321–328, 2013.
-  C.-C. Chang and C.-J. Lin. Libsvm: a library for support vector machines. ACM Transactions on Intelligent Systems and Technology, 2(3):27, 2011.
-  R. Farrell, O. Oza, N. Zhang, V. I. Morariu, T. Darrell, and L. S. Davis. Birdlets: Subordinate categorization using volumetric primitives and pose-normalized appearance. In IEEE International Conference on Computer Vision, pages 161–168, 2011.
-  J. Friedman, T. Hastie, and R. Tibshirani. The elements of statistical learning, volume 1 of Springer Series in Statistics. Springer, 2001.
-  E. Gavves, B. Fernando, C. G. Snoek, A. W. Smeulders, and T. Tuytelaars. Fine-grained categorization by alignments. In IEEE International Conference on Computer Vision, pages 1713–1720, 2013.
-  T. Gebru, J. Krause, Y. Wang, D. Chen, J. Deng, and L. Fei-Fei. Fine-grained car detection for visual census estimation. In AAAI, volume 2, page 6, 2017.
-  R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In IEEE conference on computer vision and pattern recognition, pages 580–587, 2014.
-  K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In IEEE conference on Computer Vision and Pattern Recognition, pages 770–778, 2016.
-  S. Huang, Z. Xu, D. Tao, and Y. Zhang. Part-stacked cnn for fine-grained visual categorization. In IEEE Conference on Computer Vision and Pattern Recognition, pages 1173–1182, 2016.
-  L. Itti and C. Koch. A saliency-based search mechanism for overt and covert shifts of visual attention. Vision Research, 40(10-12):1489–1506, 2000.
-  Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. Caffe: Convolutional architecture for fast feature embedding. In ACM International Conference on Multimedia, pages 675–678, 2014.
-  J. Krause, T. Gebru, J. Deng, L.-J. Li, and L. Fei-Fei. Learning features and parts for fine-grained recognition. In International Conference on Pattern Recognition, pages 26–33, 2014.
-  J. Krause, H. Jin, J. Yang, and L. Fei-Fei. Fine-grained recognition without part annotations. In IEEE Conference on Computer Vision and Pattern Recognition, pages 5546–5555, 2015.
-  J. Krause, M. Stark, J. Deng, and L. Fei-Fei. 3d object representations for fine-grained categorization. In IEEE International Conference on Computer Vision Workshops, pages 554–561, 2013.
-  D. Lin, X. Shen, C. Lu, and J. Jia. Deep lac: Deep localization, alignment and classification for fine-grained recognition. In IEEE Conference on Computer Vision and Pattern Recognition, pages 1666–1674, 2015.
-  T.-Y. Lin, A. RoyChowdhury, and S. Maji. Bilinear cnn models for fine-grained visual recognition. In IEEE International Conference on Computer Vision, pages 1449–1457, 2015.
-  S. Maji, E. Rahtu, J. Kannala, M. Blaschko, and A. Vedaldi. Fine-grained visual classification of aircraft. arXiv:1306.5151, 2013.
-  J. Redmon, S. Divvala, R. Girshick, and A. Farhadi. You only look once: unified, real-time object detection. In IEEE conference on Computer Vision and Pattern Recognition, pages 779–788, 2016.
-  S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in Neural Information Processing Systems, pages 91–99, 2015.
-  E. Rodner, M. Simon, G. Brehm, S. Pietsch, J. W. WÃ¤gele, and J. Denzler. Fine-grained recognition datasets for biodiversity analysis. In CVPR Workshop on Fine-grained Visual Classification, pages 1–3, 2015.
-  M. Simon and E. Rodner. Neural activation constellations: Unsupervised part model discovery with convolutional networks. In IEEE International Conference on Computer Vision, pages 1143–1151, 2015.
-  J. A. Suykens and J. Vandewalle. Least squares support vector machine classifiers. Neural Processing Letters, 9(3):293–300, 1999.
-  C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie. The caltech-ucsd birds-200-2011 dataset. Computation and Neural Systems Technical Report, 2011.
-  X.-S. Wei, C.-W. Xie, J. Wu, and C. Shen. Mask-cnn: Localizing parts and selecting descriptors for fine-grained bird species categorization. Pattern Recognition, 76:704–714, 2018.
-  T. Xiao, Y. Xu, K. Yang, J. Zhang, Y. Peng, and Z. Zhang. The application of two-level attention models in deep convolutional neural network for fine-grained image classification. In IEEE Conference on Computer Vision and Pattern Recognition, pages 842–850, 2015.
-  L. Xie, Q. Tian, R. Hong, S. Yan, and B. Zhang. Hierarchical part matching for fine-grained visual categorization. In IEEE International Conference on Computer Vision, pages 1641–1648, 2013.
-  J. Yosinski, J. Clune, A. Nguyen, T. Fuchs, and H. Lipson. Understanding neural networks through deep visualization. arXiv:1506.06579, 2015.
-  H. Zhang, T. Xu, M. Elhoseiny, X. Huang, S. Zhang, A. Elgammal, and D. Metaxas. Spda-cnn: Unifying semantic part detection and abstraction for fine-grained recognition. In IEEE Conference on Computer Vision and Pattern Recognition, pages 1143–1152, 2016.
-  N. Zhang, J. Donahue, R. Girshick, and T. Darrell. Part-based r-cnns for fine-grained category detection. In European Conference on Computer Vision, pages 834–849, 2014.
-  N. Zhang, J. Donahue, R. Girshick, and T. Darrell. Part-based r-cnns for fine-grained category detection. In European Conference on Computer Vision, pages 834–849, 2014.
-  N. Zhang, R. Farrell, F. Iandola, and T. Darrell. Deformable part descriptors for fine-grained recognition and attribute prediction. In IEEE International Conference on Computer Vision, pages 729–736, 2013.
-  X. Zhang, H. Xiong, W. Zhou, W. Lin, and Q. Tian. Picking deep filter responses for fine-grained image recognition. In IEEE Conference on Computer Vision and Pattern Recognition, pages 1134–1142, 2016.
-  B. Zhao, X. Wu, J. Feng, Q. Peng, and S. Yan. Diversified visual attention networks for fine-grained object classification. IEEE Transactions on Multimedia, 19(6):1245–1256, 2017.
-  H. Zheng, J. Fu, T. Mei, and J. Luo. Learning multi-attention convolutional neural network for fine-grained image recognition. In IEEE Conference on Computer Vision and Pattern Recognition, 2017.