Occluded Person Re-identification
Person re-identification (re-id) suffers from a serious occlusion problem when applied to crowded public places. In this paper, we propose to retrieve a full-body person image by using a person image with occlusions. This differs significantly from the conventional person re-id problem where it is assumed that person images are detected without any occlusion. We thus call this new problem the occluded person re-identitification. To address this new problem, we propose a novel Attention Framework of Person Body (AFPB) based on deep learning, consisting of 1) an Occlusion Simulator (OS) which automatically generates artificial occlusions for full-body person images, and 2) multi-task losses that force the neural network not only to discriminate a person’s identity but also to determine whether a sample is from the occluded data distribution or the full-body data distribution. Experiments on a new occluded person re-id dataset and three existing benchmarks modified to include full-body person images and occluded person images show the superiority of the proposed method.
Occluded Person Re-identification
|Jiaxuan Zhuo, Zeyu Chen, Jianhuang Lai*††thanks: *Corresponding author, Guangcong Wang|
|Sun Yat-Sen University, Guangzhou, P.R. China|
|XinHua College, Sun Yat-sen University, Guangzhou, P.R. China|
|Guangdong Key Laboratory of Information Security Technology|
|Key Laboratory of Machine Intelligence and Advanced Computing, Ministry of Education|
Index Terms— Occluded Person Re-identification, Attention Framework of Person Body, Occlusion Simulator, Multi-task Losses
Person re-identification (re-id) aims to re-identify a target person across multiple non-overlapped cameras, which has been applied to enhance the security in many important public spaces, especially crowded ones, e.g., airports, railway stations, malls and hospitals. However, when conducting person re-id in these crowded places, occlusion is an unavoidable problem. For example, a person/criminal may be occluded by other persons in the scene, or static obstacles such as cars, pillars, walls, etc. Considering the significance of the occlusion problem, it is essential to seek an effective method to search full-body person images given a person image with occlusions as a probe (Fig. 1). We call this the occluded person re-identification problem.
There are three realistic challenges becoming the bottleneck for solving the occluded person re-id. First, occlusions lead to not only the loss of target information but also the interference of occluded information. Occlusions with diverse characteristics, such as colors, sizes, shapes and positions, deteriorate global representations for the person re-id. So it is hard to learn a robust feature representation for occluded persons. Second, one may resort to local/part-based representations for the occluded person images. An intuitive method is to detect non-occluded body parts using body part detectors and then match the corresponding body parts in the gallery. However, extra annotations are needed for the body detector learning. Even worse, sometimes occluded body parts are the key discriminative parts while non-occluded body parts share a similar appearance. Third, since most existing methods implicitly assumed that the appearance of full body for a person is readily available while a person image with occlusions is an invalid sample, there are few public datasets for the occluded person re-id to learn a suitable model, especially for deep learning.
To solve this challenging problem, we propose a novel deep learning framework for the occluded person re-id, called Attention Framework of Person Body (AFPB) (Fig. 2). Specially, the AFPB consists of two components. First, an Occlusion Simulator (OS) is used to automatically generate plenty of artificial occluded person images by randomly adding background patches to full-body person images. The artificial occluded person images are thus formed as an artificial occlusion set. The artificial occlusion set and the source (full-body) set are then jointly used to learn a robust feature representation for the occluded person re-id. Second, multi-task losses, i.e., identification loss and occluded/non-occluded binary classification (OBC) loss, are integrated into the AFPB framework. Surprisingly, the simple OBC loss brings an impressive improvement by determining whether a sample is from artificial occlusion set or source images.
The AFPB method can be formulated as an “attention” model by resorting to the OS module. Different from the conventional attention scheme, the AFPB framework gradually pays attention to the person body by watching different kinds of occluded persons that generated by the OS module. The AFPB framework can be also implicitly explained as an encoder of prior knowledge that allows one to integrate extra expert knowledge into the deep framework. Specially, the OS module of the AFPB integrates the occlusion information by generating lots of artificial knowledge-related samples towards mimicking the real world, while the OBC loss aims to determine whether a sample is from the occluded person set or the full-body person set, so as to encode prior information into the framework. In addition, the identification loss can discriminate the person identity given a person image no matter whether the person is occluded or not. This constraint also forces the framework to focus on person body parts.
The AFPB method can be trained end-to-end by SGD with backpropagation, and can be easily implemented using common libraries (e.g., Caffe). In summary, this paper makes three main contributions.
It is the first attempt to define the occluded person re-id problem which commonly occurs in realistic scenes and applications.
For the occluded person re-id task, we propose an Attention Framework of Person Body (AFPB) which gradually pays attention to the person body by watching different kinds of occluded persons. An Occlusion Simulator (OS) is used to generate artificial samples to encode prior knowledge while multi-task losses are used to learn a robust feature representation against the occluded problem.
2 Related Work
Typical person re-id works mainly consist of two steps: feature extraction and metric learning. The first step aims to extract a robust and distinctive feature representation which is invariant to the challenges such as illumination, viewpoint, and occlusion, etc. The second step learns metrics or subspaces for better matching such that distances of the same class are closer than those of the different ones.
Recently, with the development of deep learning , there are three kinds of network frameworks applied in person re-id, i.e., classification networks, siamese networks and triplet networks. Classification networks regard the person re-id problem as a classification problem and directly extract discriminative features due to the superior performance of Convolutional Neural Networks (CNN) on large-scale datasets. For example, Xiao et al. jointly trained a classification model on multiple domains by using a Domain Guided Dropout (DGD) method to improve the performance. Siamese networks take image pairs as input and compute the similarity using a contrastive loss. For example, Ahmed et al.  computed the neighborhood differences of an image pair to learn a metric indicating whether the two input images depict the same person. Triplet networks, in virtue of similarity among three input images, is an extension version of siamese networks. Based on , Wang et al.  developed a point-to-set triplet for the image-to-video person re-id.
Although person re-id methods have been studied excellently [7, 4, 5, 6, 8, 9, 10, 11], few works make initial attempts to solve the occluded person re-id problem. Among the works that share similar idea to ours, partial person re-id  aims to match the probe partial image with the gallery full-body image, which is towards providing a picture to the occluded problem in person re-id. However, they only focused on the matching between non-occluded body parts and full-body parts. Some critical problems still need to be solved: 1) the output of existing pedestrian detectors for an occluded person image often includes a part of person body together with other occlusions instead of only a partial body. That is, occlusions have to be removed by a manual cropping operation in , which is unrealistic in practice; 2) the patch-based matching method proposed in  needs a large account of calculations without considering an attention scheme. Our work based on a CNN network differs from , as we directly compute the matching between occluded person images and full-body person images and propose an AFPB framework that automatically focuses on the person body by watching various occluded person data generated by an OS.
We propose a deep learning framework, Attention Framework of Person Body (AFPB), as shown in Fig. 2. The AFPB includes two main components: Occlusion Simulator (OS) and multi-task losses. The OS aims to generate artificial occluded person data which is used to simulate a variety of occluded cases using source (full-body person) data. Next, full-body person data and occluded person data are jointly trained on the CNN network with multi-task losses, i.e., identification loss and occluded/non-occluded binary classification (OBC) loss. In general, the AFPB forces the feature representation to pay more attention to person body parts by encoding prior information of occlusion into the framework. In this section, we detail these stages and visualize of the attention results.
3.1 Occlusion Simulator (OS)
As is mentioned in Section 1, with the limitation of inadequate occluded person re-id data, it is hard to train a suitable deep model for the occluded person re-id. One would think whether plenty of occluded person re-id data can be created based on existing data. A good idea to this problem is to automatically generate artificial occluded person data from the full-body person data. In this way, not only can it simulate a variety of occluded cases but can bring diversified information to the whole system. Based on the above assumptions, we design an Occlusion Simulator (OS) to generate artificial occluded person images. Next, we jointly train a CNN network on the source/full-body person data and the occluded person data. The network would pay more attention to person body parts by watching various occluded person images in person re-id. The specific implementation is as follow.
Suppose we have an original full-body data set X, which consists of N images of M identities. Let denote all the samples in , where is the image of the person and is the identity. The Occlusion Simulator can be formulated as an image-image mapping function F : , where Z is an artificial occluded person data set generated from the real images set X. The mapping F is achieved by a simple but effictive way where a random patch from the background of source images is used as an occclusion to cover a part of person body. Let denote all the samples in , where is the image of the person generated from . We finally merge X and Z into a combined set. The procedure is shown in Algorithm 1.
We aim to learn a generic feature extractor which makes descriptors of the same person closer while those of different ones more distinct. In our framework, we train a CNN with identification loss to recognize the identity of each person. When only full-body set X is training data, the objective function is
where is the identification classifier in person re-id and is the identification loss function. After generating artificial occlusion set Z from X, we combine two sets together, so the objective function is given by
It can be seen that the objective function makes both and closer to , so it would force to be more similar to . It is intuitively explained that the network has learned how to pay more attention to the key person body parts rather than occlusions or backgrounds by watching lots of occluded person data and source data.
3.2 Multi-task losses
Along with the identification loss, an occluded/non-occluded binary classification (OBC) is used to determine whether a sample is from an occluded person distribution or a full-body person distribution, such that our framework can identify the person on the basis of discriminating whether a person body is occluded or not. When integrating two losses into a unified framework, the AFPB method is further learned to extract a robust and discriminative feature representation for the occluded person re-id. In this way, the OBC loss encodes prior information of occlusion into the framework.
We treat the person re-id as a classification problem and use the softmax loss as the identification loss. Suppose the original full-body set has K identities, a K-class softmax loss of person re-id classifier is given by
where is the prediction score in person re-id classifier of the training sample for the class. As such, the OBC loss is given by
where is the prediction score in occlusion classifier of the training sample, and where indicates occluded persons and otherwise. Combining the identification loss and OBC loss, the multi-task losses are formulated as
where is a hyperparameter which balances two respective losses. Generally, it is reasonable to set because is the protagonist and is an assistance for . With multi-task losses, CNN network has the discriminability to identify the person no matter whether a person is occluded or not. That is, if a full-body person image is availuable, the network can exploit the entire person structure information. If a person is occluded, the network can focus on the key body parts.
Through this process, our framework can focus on person body parts to learn a robust feature representation against occlusions in the real world. Fig. 3 shows saliency maps generated by average pooling all feature maps of the last convolution layer. It proves that our framework can pay more attention to person body parts rather than occlusions or backgrounds.
Datasets. We evaluate the proposed method on four datasets: Occluded-REID, Partial-REID, P-DukeMTMC-reID and P-ETHZ, each of which is organized into two parts: occluded person images and full-body person images (see Fig. 4).
Occluded-REID dataset is a new dataset captured by mobile camera equipments, which consists of 2000 images of 200 occluded persons. Each identity has 5 full-body person images and 5 occluded person images with different types of severe occlusions. All images with different viewpoints and backgrounds are resized to 128 64. This dataset will be released later.
P-DukeMTMC-reID dataset and P-ETHZ dataset are modified from DukeMTMC-reID dataset and ETHZ dataset. They contain images with target persons occluded by different types of occlusion in public, e.g., people, luggages, cars and guideboards. We select identities with both full-body person images and occluded person images. After the arrangement, there are 24143 images of 1299 IDs in P-DukeMTMC-reID and 3897 images of 85 IDs in P-ETHZ, respectively. Both datasets will also be released later.
Partial-REID dataset is the first dataset for partial person re-id, which includes 900 images of 60 persons, with 5 full-body person images, 5 partial person images and 5 occluded person images each identity. The images were collected at a university campus with various viewpoints and occlusions.
Experimental setting. We take occluded person images as the probes, full-body person images as the galleries and randomly select half of the identities for training and the rest for test. We report the results trained on a baseline network, ResNet-50. Both single-shot (N=1) and multi-shot (N=2, 3, 4, 5) experiments were conducted with the initial learning rate of 1e-3, and for 50K iterations.
Evaluation metric. In matching produce, we calculate the similarities between each probe and all the gallery images by distance. The widely-used Cumulation Matching Characteristics (CMC) curve and rank-1 rate are used for quantitative evaluations of person re-id task. The experiments are repeated 10 times to gain the average results.
Data augmentation. In our experiment, we resize all images into 240 240 and crop a center region of 224 224 with a small random perturbation to augment the training data.
4.2 Performance of AFPB framework
To evaluate the performance of our Attention Framework of Person Body (AFPB), we compare it with three networks: 1) the baseline network, ResNet-50 pretained on a large-scare person re-id dataset MARS , 2) the baseline network with the first component of the AFPB, Occlusion Simulator (OS), 3) the baseline network with the second component of the AFPB, multi-task losses, on four datasets, Occluded-REID, Partial-REID, P-DukeMTMC-reID and P-ETHZ. As is shown in Table 1 and Fig. 5, our AFPB framework significantly outperforms the baseline by a large margin (improve rank-1 accuracy by , , and , respectively), which shows the effectiveness of our framework. Besides, the performance of ResNet-50 with the OS and multi-task losses are also better than that of the baseline but worse than that of our AFPB framework. It illustrates that both components of the framework make contribution to the proposed framework and it could achieve more excellent performance by combining them together because two components of the AFPB form a complementary relationship to deal with various occluded situations.
Besides, we evaluate occlusion awareness and attention performance of AFPB in two experiments. First we test the accuracy of OBC on the trained model. The classification accuracies are 88.50%, 85.33%, 91.75% and 73.88% on Occluded-REID, Partial-REID, P-DukeMTMC-reID and P-ETHZ, which demonstrates the OBC loss offers occlusion awareness to the AFPB. Then we compute the detection precision which is the ratio of salient regions in saliency maps to our manual annotations of body parts on Occluded-REID and Partial-REID (Fig. 3). Table 2 shows our method exceeds the baseline network on the detection precision by 6.21% and 7.13%, respectively, which confirms the AFPB has the superiority to focus on the person body.
4.3 Comparison with the state-of-the-art
We compare our method with the state-of-the-art methods on the Occluded-REID, Partial-REID, P-DukeMTMC and P-ETHZ in Table 3 and Fig. 6. We collect seven methods including (A) four methods of hand-crafted features and distance metrics and (B) three methods based on deep learning. It is evident that our method presents the best performance in all categories generally and surpasses the best in Rank-1 by 2.35%, 2.33%, 0.97% and 3.72% on Occluded-REID, Partial-REID, P-DukeMTMC-reID and P-ETHZ respectively. Generally the performances of methods in (B) are better than those in (A) due to the powerful learning ability of deep nerual network to automatically learn and update the model. Some methods in (B)  do not show good performance beacause their models concern about data with full-body images regardless of the occluded data domain. Differently, our methods can learn to determine whether a person is occluded or not and extract robust feature representation based on prior knowledge of occlusion. In this way, our method shows the superiority in the occluded person re-id.
4.4 Analysis of parameter
The only one parameter in our model controls the tradeoff between the identification loss and OBC loss. To explore the effect of different proportion of two losses, we further test the performance of the representation with different on Occluded-REID and Partial-REID. As is shown in Fig. 7, performance of the representation raises as the increase of . Our model achieves better performace when is within 0.7 and 0.9, which confirms the auxiliary effect of the OBC loss.
In this paper, we make the first attempt to solve the occluded person re-id problem. To address it, the AFPB is proposed to learn a robust feature representation by watching kinds of generated occluded person images. Besides, multi-task losses are integrated into the framework for the attention of the person body. Experimental results show the effectiveness and superiority of our method.
This project is supported by the NSFC(U1611461, 61573387).
-  E. Ristani, F. Solera, R. Zou, R. Cucchiara, and C. Tomasi, “Performance measures and a data set for multi-target, multi-camera tracking,” in ECCV workshop on BMTT, 2016.
-  A. Ess, B. Leibe, K. Schindler, and L. Van Gool, “A mobile vision system for robust multi-person tracking,” in CVPR, 2008.
-  G.C. Wang, X.H. Xie, J.H. Lai, and J.X. Zhuo, “Deep growing learning,” in ICCV, 2017.
-  T. Xiao, H.S. Li, W.L. Ouyang, and X.G. Wang, “Learning deep feature representations with domain guided dropout for person re-identification,” in CVPR, 2016.
-  E. Ahmed, M. Jones, and T. Marks, “An improved deep learning architecture for person re-identification,” in CVPR, 2015.
-  S.Y. Ding, L. Lin, G.R. Wang, and H.Y. Chao, “Deep feature learning with relative distance comparison for person re-identification,” PR, 2015.
-  G.C. Wang, J.H. Lai, and X.H. Xie, “P2snet: Can an image match a video for person re-identification in an end-to-end way?,” TCSVT, 2017.
-  Y.C. Chen, X.T. Zhu, W.S. Zheng, and J.H. Lai, “Person re-identification by camera correlation aware feature augmentation,” TPAMI, 2018.
-  S.Z. Chen, C.C. Guo, and J.H. Lai, “Deep ranking for person re-identification via joint representation learning,” TIP, 2016.
-  S.C. Shi, C.C. Guo, J.H. Lai, S.Z. Chen, and X.J. Hu, “Person re-identification with multi-level adaptive correspondence models,” Neurocomputing, 2015.
-  C.C. Guo, S.Z. Chen, J.H. Lai, X.J. Hu, and S.C. Shi, “Multi-shot person re-identification with automatic ambiguity inference and removal,” in ICPR, 2014.
-  W.S. Zheng, X. Li, T. Xiang, S.C. Liao, J.H. Lai, and S.G. Gong, “Partial person re-identification,” in ICCV, 2015.
-  K.M. He, X.Y. Zhang, S.Q. Ren, and J. Sun, “Deep residual learning for image recognition,” in CVPR, 2016.
-  L. Zheng, Z. Bie, Y.F. Sun, J.D. Wang, C. Su, S.J. Wang, and Q. Tian, “Mars: A video benchmark for large-scale person re-identification,” in ECCV, 2016.
-  A. Borji, M.M. Cheng, H. Jiang, and J. Li, “Salient object detection: A benchmark,” TIP, 2015.
-  S.C. Liao, Y. Hu, X.Y. Zhu, and S.Z. Li, “Person re-identification by local maximal occurrence representation and metric learning,” in CVPR, 2015.
-  Y.C. Chen, W. S. Zheng, and J.H. Lai, “Mirror representation for modeling view-specific transform in person re-identification.,” in IJCAI, 2015.
-  T. Matsukawa, T. Okabe, E. Suzuki, and Y. Sato, “Hierarchical gaussian descriptor for person re-identification,” in CVPR, 2016.
-  L. Zhang, T. Xiang, and S.G. Gong, “Learning a discriminative null space for person re-identification,” in CVPR, 2016.
-  Y.F. Sun, L. Zheng, W.J. Deng, and S.J. Wang, “Svdnet for pedestrian retrieval,” arXiv preprint arXiv:1703.05693, 2017.
-  Z. Zhong, L. Zheng, G. L. Kang, S.Z. Li, and Y. Yang, “Random erasing data augmentation,” arXiv preprint arXiv:1708.04896, 2017.