Weakly-supervised Learning of Mid-level Features for Pedestrian Attribute Recognition and Localization
State-of-the-art methods treat pedestrian attribute recognition as a multi-label image classification problem. The location information of person attributes is usually eliminated or simply encoded in the rigid splitting of whole body in previous work. In this paper, we formulate the task in a weakly-supervised attribute localization framework. Based on GoogLeNet, firstly, a set of mid-level attribute features are discovered by novelly designed detection layers, where a max-pooling based weakly-supervised object detection technique is used to train these layers with only image-level labels without the need of bounding box annotations of pedestrian attributes. Secondly, attribute labels are predicted by regression of the detection response magnitudes. Finally, the locations and rough shapes of pedestrian attributes can be inferred by performing clustering on a fusion of activation maps of the detection layers, where the fusion weights are estimated as the correlation strengths between each attribute and its relevant mid-level features. Extensive experiments are performed on the two currently largest pedestrian attribute datasets, i.e. the PETA dataset and the RAP dataset. Results show that the proposed method has achieved competitive performance on attribute recognition, compared to other state-of-the-art methods. Moreover, the results of attribute localization are visualized to understand the characteristics of the proposed method.
The recognition of pedestrian attributes, such as gender, glasses and wearing styles, has become a hot research topic in recent years, due to its great application potentials in video surveillance systems, e.g. pedestrian re-identification where attributes can serve as mid-level representations of a pedestrian to improve the accuracy of ReID effectively , and pedestrian retrieval where the queried attributes can be used to filter out the interesting targets from a large amount of videos efficiently [18, 5].
Pedestrian attribute recognition in surveillance scene is also a challenging problem due to the low resolution of pedestrian samples cropped from far-range surveillance scenes, the large pose variations arisen from different angles of view, occlusions from environmental objects, etc. Recently, convolutional neural network (CNN) has been applied for pedestrian attribute recognition [20, 16, 10], where high classification accuracies have been reported. In these work, pedestrian samples cropped out from scenes are fed into an end-to-end CNN classifier outputing multiple pedestrian attribute labels. Nevertheless, to enhance the performance of attribute recognition, there are still a number of problems worthy of further studies. Firstly, some fine-scale attributes such as glass wearing are hard to recognize due to the small size of positive samples. Secondly, some appearance features of these fine-scale attributes themselves may be easily neglected during the several alternations of convolution and max-pooling operations, so the final prediction layers of deep models cannot encode all the detailed features of fine-scale attributes for correct attribute predictions. Thirdly, the locations of some attributes can vary significantly in the cropped pedestrian sample. For example, when saying a pedestrian is carrying a bag, the vertical location of the bag may range from his arms to his knees, which introduces difficulty into training of the traditional CNNs. Finally, the pedestrian himself may appear at unusual regions of the cropped image samples, while some previous methods are developed under the assumption that the pedestrian appears in the middle and occupies most of the area of the image, so as to utilize some predefined spatial distributions of attributes for better performance. This problem can be serious when the pedestrian samples are cropped automatically by some pedestrian detection algorithms. Besides, the previous work shows little effect in locating attributes, thus the location information of attributes cannot be utilized by follow-up algorithms.
Considering the above difficulties, we formulate pedestrian attribute recognition in an attribute localization framework, where a Weakly-supervised Pedestrian Attribute Localization Network (WPAL-network) is proposed to infer the attribute labels from the detection results of a set of mid-level attribute-relevant features, instead of the direct classification from whole pedestrian samples. The motivation lies in that solid and abstract attributes are all relevant to some special kinds of mid-level semantic features which may be obtained by deep learning at some high level layers . For example, whether a pedestrian is carrying a bag can be directly determined by the detection of appearance feature of a bag being carried, and female gender can be easily inferred if long hair or a miniskirt is detected. Recognizing attributes by flexibly detecting these features in an image without resizing and warping can eliminate the problem that pedestrians have different statures and may appear in unusual regions in the sample image.
Since it is high-cost to label the exact locations of multiple attributes across a large dataset, and some attributes like glass wearing have ambiguous bounding box definition, the powerful fully-supervised object detection methods are not applicable. Therefore we only use image-level attribute labels to conduct weakly-supervised learning to discover mid-level semantic features with a set of weakly-supervised detection layers. These layers are similar to the network structure proposed in , which focuses on general object detection with only image-level absence/presence labels. In this paper, we modified the structure to adapt to the pedestrian attribute localization problem. One difference is that we use Flexible Spatial Pyramid Pooling (FSPP) instead of the original global max-pooling to add spatial constraint to some attributes like hats. Another is that the structure lays in the middle stage of the network but not the top, making correlation between detector and target class not bound at first but free to be learnt during training.
With the trained WPAL-network, we can locate an attribute according to the responses of the detectors of discovered mid-level attribute features. The correlation strength between attributes and the mid-level features is firstly statistically estimated over the training set with the trained network. Then, a rough shape of attribute is estimated by superposing activation maps of the mid-level detectors with weight as the correlation strength. Finally, the location of attribute is predicted as the centroid of activation cluster.
To demonstrate the effectiveness of the proposed network, extensive experiments are performed on the two large-scale pedestrian attribute datasets, i.e., PETA and RAP. Compared to the state-of-the-art methods, the WPAL-network can achieve competitive performance. And the results of attribute localizations can be visualized to further explain the characteristics of the proposed method.
The contributions of this work are concluded as follows:
We introduce weakly-supervised object detection technique into solving the pedestrian attribute recognition task, achieving state-of-the-art accuracy.
The proposed method can not only predict existence labels of attributes but also locate the attributes, so as to provide location information for further applications.
The remainder of this paper is structured as follows: In Section 2, we review previous work related to the method proposed in this paper in different aspects. In Section 3, the WPAL-network is illustrated in details. In Section 4, we describe the method of attribute localizing using the WPAL-network. In Section 5, we show some results of experiments on attribute recognition and attribute localizing.
2 Related Work
In this section, we firstly review the developments on pedestrian attribution recognition. Then, some related work on weakly-supervised object detection is introduced, which inspires us to develop the new solution on attribute localization.
2.1 Pedestrian Attribute Recognition
Early work [14, 15, 1] on human attribute recognition usually treat attributes as independent labels and train classifiers for each attribute independently. Deep learning models used in some later work enabled researchers to mine the relationship between attributes. Patrick et al. proposed the ACN model in  to jointly learn all the attributes in a single model, and showed that parameter sharing can improve recognition accuracy over independently trained models. This routine is also adopted in the DeepMAR model proposed in  and the WPAL-network in this work.
It is yet another a popular idea to make use of part information to help improving the attribute recognition accuracy. In , part models like DPM and poselets are used for aligning input patches for CNNs. Gaurav et al. propose an expanded parts model in  to learn a collection of part templates which can score an image partially with most discriminative regions for classification. The MLCNN in  divides a human body into 15 parts and train CNN models for each of them, then choose part of the models to contribute to the recognition of an attribute, according to the spatial constraint prior of it. The DeepMAR* model described in  takes three block images as input in addition to the whole body image, which correspond to the head-shoulder part, upper body and lower body of a pedestrian respectively. The idea of dividing the image into parts is adopted in the design of the WPAL-network, which drives us to make use of flexible spatial pyramid pooling layers to help locating mid-level features of some attributes in only local patches rather than the whole image.
2.2 Weakly-supervised Object Detection
To avoid the high-cost of labeling bounding boxes of objects, researchers proposed various weakly-supervised learning approaches for object detection and localization. In , Pandey et al. demonstrate capability of SVM and deformable part models on weakly-supervised object detection. In , Wang et al. proposed unsupervised latent category learning, which can discover latent information in backgrounds to help object localization in cluttered backgrounds. Cinbins et al. proposed in  a multi-fold multiple-instance learning procedure featuring prevention of weakly-supervised training from prematurely locking onto erroneous object locations.
In , the proposed network has convolution layers followed by a global max-pooling layer. Each channel of the global max-pooling layer is viewed as a detector for a certain class of object. It is assumed that the positions of max value point in the feature map correspond to the locations where the objects of the target class exist in. However, this method cannot be directly applied to our attribute localization task. Firstly, different from objects, some attributes are abstract concepts, such as gender, orientation and age, which do not correspond to certain regions. Secondly, some attributes such as hat wearing or shoe style are expected to appear within a certain partition in a pedestrian sample, which can be used to improve the localization of those attributes. Thus, to better fit the task of attribute localization, we embed this structure in the middle stage of the network to discover mid-level features relevant to attributes rather than attributes themselves, and propose to use FSPP layers instead of a single global max-pooling layer to help constraining location of certain attributes.
3 Weakly-supervised Pedestrian Attribute Localization Network
In this section, we describe the proposed WPAL-Network. The overall architecture is firstly illustrated and then detailed implementation is discussed.
3.1 Network Architecture
The framework of the WPAL-network is illustrated in Figure 1. The trunk convolution layers are derived from the GoogLeNet model  pretrained on ImageNet, provided by Caffe . In the original GoogleNet, the inception4a/output , inception4d/output and inception5b/output layers are connected to some branch layers respectively. In the WPAL-network, each branch is replaced by a convolution layer then followed by flexible spatial pyramid pooling (FSPP) layers.
The FSPP layers play the role of the global max-pooling layers in  for the discovery of attribute-relevant features. Its mechanism is shown in Figure 2, where the SPP layers in  are extended by allowing bins to overlap and the number of bins at each pyramid level to be changeable. At the first pyramid level, there is only one bin for each FSPP layer, which outputs the maximal response of each convolution channel over the full image. At the second level, the three FSPP layers are divided into , and also bins respectively, where max-pooling is performed in each bin. To avoid a high dimension output vector, we limit the height of pyramid to 2. For one FSPP layer, each convolution channel will produce a small vector with the dimension of the total number of bins at all the pyramid levels. Finally, these small vectors of all FSPP layers and all channels are concatenated into a large vector, which is further regressed into a 51-dim vector (35-dim for PETA dataset), corresponding to the attribute labels to be predicted.
To understand this architecture, we first discuss the function of the max-pooling operations in the FSPP layers. For each region of a convolution channel corresponding to a bin in a FSPP layer, the global max-pooling output indicates the possibility of certain mid-level feature exists or not, where the position of the maximal response is also expected to be the location of the mid-level feature. The bins thus can be viewed as local detectors of the mid-level feature, except the bin on the first level which is a global detector. With the following fully-connected layers, the vector composed of the existence possibility values of mid-level features is regressed to form the attribute vector, where the correlations between mid-level features and attributes are encoded in the learnt weighting coefficients. The training procedure therefore has two tasks. The first one is to learn the correlation between attributes and randomly initialized mid-level detectors. The second one is to adapt the target mid-level features of detectors to fit the correlated attributes. These two tasks can be solved simultaneously throughout the whole training process. With the learnt correlation, the detection results of mid-level features can also be used for attribute localization (see Section 4).
If the network performs only global max-pooling over the full image rather than using the FSPP layers, it cannot get satisfying results of attribute localization, although it works well in attributes classification (shown in Section 5.2, named as WPAL-GoogleNet-GMP). That is because the single global max-pooling usually leads to multiple activation clusters in different locations corresponding to multiple attribute-relevant mid-level features, which makes it difficult to determine which activation point can be used to infer the location of attributes. The FSPP layers are proposed to address this problem, based on the observation that mid-level features contributing to certain attributes have a local spatial distribution. For example, the features relevant to the attribute of hat wearing mostly appear in the upper part of a pedestrian. Thus, we adopt the multiple local max pooling operations within the bins defined by the second pyramided level in the FSPP layers as well as the global max-pooling at the first pyramid level independently. Note that we do not manually set which bin to favor for an attribute. This is also to be learnt during the training process. For example, for a detector which is expected to detect mid-level features that contribute to upper part attributes, the weights of connections from the upper bins will increase since these bins are activated more accordantly with the positive attribute labels, while the weights of connections from the lower bins get suppressed by weight-decay. For mid-level features that may equally appear in any part of the image, the bin on the first pyramid level of the FSPP layer, which is equivalent to global max-pooling, is then favored.
In this work, the shape and size of input image is not fixed, because the feature maps with variant sizes will be turn into vectors with fixed dimensionality same as the number of convolution channels after the max-pooling operations in the FSPP layers. This means the WPAL-network can process images of arbitrary resolutions without warping or transforming in preprocessing. Thus, the original shape information of the pedestrian body and other accessories can be preserved.
3.2 Multi-level Learning
The trunk layers are pretrained on the ImageNet dataset, in order to learn general features for a wide range of objects. As we know, the features abstraction level will increase along with the convolution levels. However, the pedestrian attributes locate at different scale and abstraction level. For example, the orientation of whole body is at a higher level than that of attribute of wearing glasses or not . Therefore, we need utilize information at different scale and abstraction levels for multiple attributes recognition. Here, the relatively general features learnt by three selected trunk convolution layers, i.e. Inception4a/output, Inception4d/output and Inception5b/output, are selected to be transformed by the CONVx_E layers to fit the attribute features. Note that instead of explicitly specifying the learning levels of certain attributes, the decision is learnt by training the fully-connected layers, similar to the learning of attribute-detector correlations.
3.3 Loss Function For Unbalanced Training Data
For multi-label recognition tasks, usually the cross entropy loss function is adopted:
However, in the pedestrian attribute datasets (e.g., the PETA dataset  and the RAP dataset ), the distributions of positive and negative labels in most attribute classes are usually imbalanced. Many attributes, such as wearing a V-neck or not , are seldom labeled positive in the training data. Using the objective function in Equation 1 may cause these attributes to be constantly predicted as negative. To address this problem, we introduce a weighted cross entropy loss function as follows:
where is the number of attributes; is the ground-truth attribute vector, and is the predicted attribute vector; is a weight vector indicating the proportion of positive labels over all attribute categories in the training set.
4 Attribute Localization & Shape Estimation
To locate an attribute , we first determine the strength of correlation between the attribute and the bins of mid-level feature detectors. For each detector bin, the correlation strength is calculated as a ratio of the average score on positive samples to the average score on negative samples. This process can be formulated as Algorithm 1.
Then, an existence possibility map of the attribute can be estimated by superposing the weighted activation maps masked by Gaussian filter, where the weights are the normalized correlation strengths, as shown in Algorithm 2. The extent of active region in indicates the rough shape of the attribute.
To locate the coordinator of the attribute, we perform clustering on the , where the coordinators of the pixels that value greater than average value are collected for a weighted K-means clustering procedure. The pixel values also sever as the weights of each coordinate samples. Finally, the greatest several clusters are chosen as the candidates of attribute locations. The number of candidate clusters depends on the type of the attribute (e.g. 2 candidate clusters for shoes and 1 candidate cluster for hat).
5.1 Datasets and Evaluation Protocols
Extensive experiments have been conducted on the two large-scale pedestrian attribute datasets i.e., the PETA  dataset and the RAP dataset . The PETA dataset includes 19,000 pedestrian samples, each annotated with 65 attributes. Following the protocol in , we also select 35 binary attributes for evaluation. The RAP dataset is the largest pedestrian attribute dataset so far, including 41,585 samples with 72 attributes. As implemented in , 51 binary attributes are used for evaluation. In test phase, images are zoomed to have fixed-size longest side without resizing or warping.
We adopt the mean accuracy (mA) as well as the example-based criteria proposed in  as evaluation metrics. The mA is formulated as:
where is the number of attributes, and are respectively the number of correctly predicted positive and negative samples of the attribute, and and are respectively the number of ground-truth positive and negative samples of the attribute.
The example-based evaluation criteria is defined as:
where is the number of samples, is the set of ground-truth positive attribute labels of the sample, is the set of predicted positive attribute labels of the sample, and denotes the set cardinality.
5.2 Recognition Performance
For comparisons, three approaches presented in  are used as benchmarks, including, ACN , DeepMAR , DeepMAR*  and SVM with CNN features. The performance on the PETA and the RAP datasets is listed in Table 1. We can find the WPAL-network performs quite well in terms of the metric of mA, while shows some weakness as evaluated with the example-based criteria on the RAP dataset. This is because the mA criteria is less affected by false alarms on classes with fewer samples than their opposite class. Consider an attribute with too few positive samples. False positive predictions (FP) does not affect the term in the mA formula, but affect the , so decreases. However, the is so large that becomes almost neglectable compared to it, thus the influence on total value of mA is limited. On the other hand, the example-based evaluation criteria has explicit precision terms, so the influence of false alarms are normalized. Therefore, the mA and the example-based criteria reflect different characteristics of the algorithm, and choice between them should be based on application scene. The higher mA also demonstrates that the learnt mid-level features are really effective to describe the visual characteristics of most pedestrian attributes.
We also compare the individual attribute recognition accuracies of our model with other benchmarks. The accuracy distribution of these models is shown in Figure 3, and Table 2 shows some selected attributes whose recognition accuracy difference between our model and the best of the benchmarks is larger than . Recognition performance of all attributes can be found in the supplemental materials.
5.3 Attribute Localization and Shape Estimating
Since there is no ground-truth data to evaluate the performance of attribute localization, we visualize some examples on pedestrian attribute localization. In Figure 4, the first and second rows show some successful examples, and the last row shows some failure cases. As shown in the figure, some fine-scale attributes, such as glasses, hat and shoes can be located correctly, which suggest the effectiveness of the proposed method.
5.4 Body Shape Estimating
In practice, besides the pedestrian body region, the cropped image sample may contain some unrelated background contents. Sometimes, due to the large pose variation and occlusions by other environmental objects, the problem gets more serious. Thus, we expect that the recognition algorithm should have the capability to eliminate those unrelated contents by estimating a rough region of the pedestrian body. Here, based on the existence possibility maps of the learnt mid-level features, we can test this capability by estimating the informative region which contributes to the recognition of pedestrian attributes mostly. Figure 5 shows two examples. We can find that the informative regions in both samples overlap with the pedestrians bodies perfectly, which illustrate that the proposed approach has the right attention capability to understand pedestrian attributes.
5.5 Correlation Strength Observation
|Attribute Name||Strongly-correlated Low-level Detector Bin (Rank - Level)|
|Action: Calling||- Lev.2||- Lev.2||- Lev.2||- Lev.2||- Lev.2|
|Has Black Hair||- Lev.1||- Lev.1||- Lev.1||- Lev.1||- Lev.1|
|Orientation: Front||- Lev.1||- Lev.1||- Lev.1||- Lev.1||- Lev.1|
|Has Glasses||- Lev.2||- Lev.1||- Lev.1||- Lev.1||Lev.2|
|Upper Part:Black||- Lev.1||- Lev.1||- Lev.1||- Lev.1||- Lev.1|
|Upper Part: Red||- Lev.2||- Lev.2||- Lev.1||- Lev.2||- Lev.1|
|Lower Part: Red||- Lev.2||- Lev.2||- Lev.2||- Lev.1||- Lev.2|
Although we expect low-level detectors to help recognition of low-level attributes, by observing the matrix of correlation strengths between attributes and mid-level feature detectors, we find that high-level detectors still play the most significant role in prediction, no matter for high-level attributes or low-level attributes. However, there are still some low-level detector bins ranking relatively high in the sorted bin list of some attributes by correlation strength to them, meaning that low-level features do make strong contribution to the recognition of attributes. Table 3 shows some selected features with top-5 highly ranking low-level detector bins correlated to them.
6 Conclusion and Future Work
In this work, we formulate the pedestrian attribute recognition into a weakly supervised object detection framework at the first time. A novel WPAL-network is proposed for the tasks of pedestrian attribute recognition and localization. Instead of directly predicting multiple attributes, we firstly discover a set of mid-level attribute-relevant features, and then predict attributes based of the response of these features. Furthermore, the activation maps of these features can be used to infer the location and rough shape of an attribute. The competitive recognition performance on the two large-scale attribute datasets demonstrates the effectiveness of the proposed WPAL-network.
In the future, we will seek more powerful detectors utilizing additional information such as background context and location relationship between discovered mid-level features to improve accuracy and solve recognition failure on attributes like long hair.
-  L. Bourdev, S. Maji, and J. Malik. Describing people: A poselet-based approach to attribute classification. In 2011 International Conference on Computer Vision, pages 1543–1550. IEEE, 2011.
-  L. D. Bourdev. Pose-aligned networks for deep attribute modeling, Feb. 7 2014. US Patent App. 14/175,314.
-  R. G. Cinbis, J. Verbeek, and C. Schmid. Multi-fold mil training for weakly supervised object localization. In 2014 IEEE Conference on Computer Vision and Pattern Recognition, pages 2409–2416. IEEE, 2014.
-  Y. Deng, P. Luo, C. C. Loy, and X. Tang. Pedestrian attribute recognition at far distance. In Proceedings of the 22nd ACM international conference on Multimedia, pages 789–792. ACM, 2014.
-  R. Feris, R. Bobbitt, L. Brown, and S. Pankanti. Attribute-based people search: Lessons learnt from a practical surveillance system. In Proceedings of International Conference on Multimedia Retrieval, page 153. ACM, 2014.
-  K. He, X. Zhang, S. Ren, and J. Sun. Spatial pyramid pooling in deep convolutional networks for visual recognition. In European Conference on Computer Vision, pages 346–361. Springer, 2014.
-  Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. Caffe: Convolutional architecture for fast feature embedding. In Proceedings of the 22nd ACM international conference on Multimedia, pages 675–678. ACM, 2014.
-  R. Layne, T. M. Hospedales, S. Gong, and Q. Mary. Person re-identification by attributes. In BMVC, volume 2, page 8, 2012.
-  Q. V. Le. Building high-level features using large scale unsupervised learning. In 2013 IEEE international conference on acoustics, speech and signal processing, pages 8595–8598. IEEE, 2013.
-  D. Li, X. Chen, and K. Huang. Multi-attribute learning for pedestrian attribute recognition in surveillance scenarios. Proc. ACPR, 2015.
-  D. Li, Z. Zhang, X. Chen, H. Ling, and K. Huang. A richly annotated dataset for pedestrian attribute recognition. arXiv preprint arXiv:1603.07054, 2016.
-  M. Oquab, L. Bottou, I. Laptev, and J. Sivic. Is object localization for free?-weakly-supervised learning with convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 685–694, 2015.
-  M. Pandey and S. Lazebnik. Scene recognition and weakly supervised object localization with deformable part-based models. In 2011 International Conference on Computer Vision, pages 1307–1314. IEEE, 2011.
-  G. Sharma and F. Jurie. Learning discriminative spatial representation for image classification. In BMVC 2011-British Machine Vision Conference, pages 1–11. BMVA Press, 2011.
-  G. Sharma, F. Jurie, and C. Schmid. Expanded parts model for human attribute and action recognition in still images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 652–659, 2013.
-  P. Sudowe, H. Spitzer, and B. Leibe. Person attribute recognition with a jointly-trained holistic cnn model. In Proceedings of the IEEE International Conference on Computer Vision Workshops, pages 87–95, 2015.
-  C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1–9, 2015.
-  D. A. Vaquero, R. S. Feris, D. Tran, L. Brown, A. Hampapur, and M. Turk. Attribute-based people search in surveillance environments. In Applications of Computer Vision (WACV), 2009 Workshop on, pages 1–8. IEEE, 2009.
-  C. Wang, W. Ren, K. Huang, and T. Tan. Weakly supervised object localization with latent category learning. In European Conference on Computer Vision, pages 431–445. Springer, 2014.
-  J. Zhu, S. Liao, D. Yi, Z. Lei, and S. Z. Li. Multi-label cnn based pedestrian attribute learning for soft biometrics. In 2015 International Conference on Biometrics (ICB), pages 535–540. IEEE, 2015.