Zoom-RNN: A Novel Method for Person Recognition Using Recurrent Neural Networks
The overwhelming popularity of social media has resulted in bulk amounts of personal photos being uploaded to the internet every day. Since these photos are taken in unconstrained settings, recognizing the identities of people among the photos remains a challenge. Studies have indicated that utilizing evidence other than face appearance improves the performance of person recognition systems. In this work, we aim to take advantage of additional cues obtained from different body regions in a zooming in fashion for person recognition. Hence, we present Zoom-RNN, a novel method based on recurrent neural networks for combining evidence extracted from the whole body, upper body, and head regions. Our model is evaluated on a challenging dataset, namely People In Photo Albums (PIPA), and we demonstrate that employing our system improves the performance of conventional fusion methods by a noticeable margin.
During the past decades, taking personal photos in daily life has become easier and more common with the advent of smartphones and digital cameras. Massive amounts of these personal images are uploaded to the internet, mostly through social media. Given that most of the times these images contain people, smart platforms are interested in the organization of identities in these photos. To perform the person recognition task, the question of ”what is the identity of this person?” should be answered .
The first person recognition models were developed based on hand-crafted features and were tested on constrained tiny datasets [8, 6, 30]. But, these models cannot be easily applied to the problem of person recognition in photo album settings due to various challenges like occlusion, viewpoint changes, pose variance and low resolution represented by People in Photo Album (PIPA) . Sample images of PIPA are shown in Figure 1.
There have been numerous studies like [2, 24, 20, 19, 14] on person recognition in photo albums. The main ideas are to extract more sophisticated features from or about the input image and to employ more advanced classification methods for learning the relations between the features and identities.
Regarding the information sources used in the literature of person recognition task, some studies focus on relational nature of photos in an album belonging to an identity. Perhaps the most obvious way to capture this relation is to extract additional information from the photo. Contextual cues such as clothes, glasses, and surrounding objects, or even metadata like photo location and social relationship of identities, can drastically help the inference about an identity present in a photo album. Extraction, exploitation, and fusion of such information are extensively studied in previous works [12, 2, 14]. Moreover, current person recognition methods, like many other image processing techniques, enjoy the informative representations of the input images provided by the convolutional neural networks (CNNs).
Human body parts, other than the face, are another source of information beneficial for identifying a person.
As discussed in studies like [29, 17], we have observed that relying on facial features in person recognition have shortcomings, specifically in dealing with non-frontal views or cluttered faces, which frequently happens in personal photos.
head, upper body and whole body are the main body regions used in many person recognition models. However, the models differ in the way they aggregate the information extracted from these regions. In the early fusion approach , the feature vectors extracted from different parts are combined to form the final descriptor used for the classification while in the late fusion , each feature vector is separately classified to form a probability vector on different identities and these initial decision vectors are then aggregated.
In this paper, we propose a novel fusion method, called zooming RNN, for combining the evidence extracted from main human body parts; head, upper body, and whole body. The proposed model incorporates both approaches of early-stage decision making based on the evidence obtained from each part and the late identification based on the final aggregated feature vector. To do so, a two-part recurrent neural network is applied to the feature and probability vectors extracted by convolutional neural networks from different regions of the human body. Experimental results on PIPA dataset show the superiority of the proposed model over other fusion mechanisms. The proposed model can be easily generalized to include more contextual information in recognition.
The rest of this paper proceeds as follows. After an overview of related works in Section 2, we describe and formulate our approach in Section 3. In Section 4, the evaluation benchmark, implementation details, experimental procedures and results are presented and compared to other methods. We provide visualizations of our predictions in Section 5, and conclude our work in Section 6.
2 Related Works
Since this paper proposes a person recognition model evaluated on PIPA dataset, the previous works related to the proposed model are reviewed in the following three subsections: (1) person recognition in photo album, (2) the person recognition models on PIPA, and (3) dependency modeling with RNNs.
2.1 Person Recognition in Photo Album
Person recognition in photo album is the task of identifying people in daily life photos such as social media or private photo collections . Recognition in photo album setting includes challenges like cluttered background, pose variance, age gap, and diverse clothing [17, 15]. The success of traditional face recognition algorithms was limited when applied on personal photos that are usually taken under uncontrolled conditions with significant variations in pose, expression, and illumination .
Anguelov et al.  used additional cues present in photo collections such as clothing and album metadata to provide context, employing a Markov Random Field (MRF) with similarity potentials, and tested the system on a relatively small dataset. OﬂHare et al.  conducted a comprehensive empirical study using the real private photo collections of a number of users and proposed language modeling and nearest neighbor approaches to context-based person identification. Lin et al.  presented a probabilistic framework in which the relations between different domains (people, events, and locations) are estimated based on the co-occurrence information of the instances of two domains. The tagged objects of two other domains are used as the context for identification of an unknown object in the third domain.
Recent advances in processing power alongside the immense availability of large labeled datasets, e.g. Labeled Faces in the Wild (LFW) dataset  with various challenges due to pose invariance, motion blur, and deformation, resulted in a need for scaling up learning techniques. Recently, deep neural networks have shown great performance in many computer vision tasks including person recognition. Taigman et al.  trained their model on a large dataset and achieved accuracies around 97.45% on LFW. Schroff et al.  employed a data-driven method based on learning a Euclidean embedding per image using a deep convolutional network and achieved 99.63% on LFW dataset. Sun et al.  achieved new state-of-the-art results on LFW  and YouTube Faces  benchmarks by designing DeepID2+, increasing the dimension of hidden representations and adding supervision to early convolutional layers.
2.2 Person Recognition Models on PIPA
Studies like  ,  and  resulted in significant error reduction and approached human-level performance on commonly used standard datasets such as LFW. Recently, Zhang et al.  have introduced People In Photo Album (PIPA) as a novel dataset addressing the limitations of conventional person recognition systems, most of which lied heavily on facial cues. PIPA has become a popular benchmark for person recognition ever since and various studies [29, 17, 12, 13, 11, 15] have been conducted to reduce error on this dataset, each focusing on certain challenges. Along with the original dataset, the baseline accuracies were provided using a novel method called PIPER which significantly outperformed DeepFace  and AlexNet  on PIPA. In order to better challenge the generalization across long-term appearance changes of a person, Oh et al.  extended PIPA dataset and proposed 3 new splits. They also achieved better results on PIPA by evaluating the effectiveness of different body regions, the scene context and some attributes like age and gender.
The method introduced in  was extended in  with a concern on privacy issues of social media, with results indicating that only a handful of images are enough to threaten usersâ privacy, even in the presence of obfuscation. Li et al.  went beyond single photo and presented a framework that exploits contextual cues at personal, group and photo levels, aiming at improving the recognition rate. Kumar et al.  proposed a network that jointly optimizes a single loss over multiple body regions to tackle the pose variations challenge. Liu et al.  proposed a congenerous cosine loss, which optimizes the cosine distance among data features to simultaneously enlarge inter-class variation and intra-class similarity. They carried out experiments on various large-scale benchmarks including PIPA  and demonstrated the effectiveness of their algorithm.
2.3 Dependency Modeling with RNNs.
Sequence modeling approaches in many contexts benefit from recurrent architectures, particularly LSTMs  and GRUs  due to the ability of these networks in modeling dependencies within sequences . Recurrent Neural Networks (RNNs) have been extensively used in tasks like image captioning [16, 9, 25] and language modeling . For our application, we are interested in extracting the relation between different body region features and person identities using RNNs. The most accurate study of relational cues on PIPA dataset are conducted by Li et al. . They focus on relational information between people in the same photo, use the scene context and employ an RNN to achieve state-of-the-art results. Similarly,  and  exploited the label dependencies in an image based on decoding an image into a set of people detections.
3 Our Approach
In this paper, we tend to recognize persons in a given photo. The input to our model is an image and bounding boxes for the heads of persons in the image. As output, a label will be assigned to each person in the given input. Our general model is depicted in Figure. 2. Given an image and as the bounding box of the head region, bounding boxes for the upper body and the whole body are extracted. Having , , and , three CNNs noted as , and previously trained to identify a person based on the head, upper body and whole body regions, respectively, are applied to the corresponding extracted regions. The outputs of each CNN are a probability vector assigning probabilities to all possible identities and a feature vector giving a representation of the given region. The feature and probability vectors generated from CNNs are given to two distinct RNN branches. At the next step, the outputs of RNNs are aggregated through an averaging gate. The averaged vector is sent to a final layer after applying an element-wise function. The final outcome is a vector giving the probability of each identity. More detailed description of our approach is given in the following.
To train the CNN components of the model on an input image with the bounding box of the head , bounding boxes for upper body () and whole body () of the person are extracted in a similar approach to .
Formally, if the size and location of are and , respectively, the size and the location of are and , where . For , the location is the same as , but the size is .
After extracting bounding boxes for all three body parts, each CNN is trained with the corresponding image region as the input and the human identity as the output. The CNNs are trained using the multi-class cross entropy loss defined as:
where is the one-hot-encoded ground truth label for the input image, represents the softmax output vector produced by the CNN, and is the number of possible classes (identities). Next, we have , , and trained on head, upper body, and whole body, respectively. For each sample, feature vectors , , and are extracted from the last layers before classification layers of the trained CNNs. We also extract the -dimensional probability vector whose -th element indicates the probability that the instance belongs to the -th identity. Probability vectors for each region are extracted and noted as , , and .
To combine information obtained from different body parts, we propose using RNNs in a zooming in fashion from the whole body to the upper body and then to the head to generate more confident predictions. Two distinct RNNs with equal output dimensions are used for the feature vector () and the probability vector (). takes , and as its input and likewise, receives , and as input.
We choose Gated Recurrent Unit (GRU)  as our recurrent network architecture for its high capability of learning sequential data. Assuming as the input for a GRU cell at time , the cell activation can be formulated as below:
where stands for sigmoid function, and are weight matrices, and used in 4 is element-wise multiplication. and are update and reset gates at time . and calculated in 4 and 5 are hidden and candidate hidden vectors at time t. The value of reset and update gates are computed according to 2 and 3. The role of the reset gate is to control combination of new input and former memory. Similarly, update gate controls the amount of previous memory to keep. The value of will be updated using former and candidate hidden values.
With the features and probabilities as input to each RNN, final outputs of RNNs are combined as follows:
where and are the outputs of the probability and feature RNNs, respectively. A final layer is added for classification. The output of the classification layer is a vector named with the size equal to the number of classes. we apply the cross entropy loss (Eqn. 1) to train our model.
In this section, first, we present information about the dataset used for evaluation alongside with the specific implementation details of our approach. Then, we will provide the results of our experiments and compare the performance of our model with those of the baseline and the state-of-the-art methods.
4.1 Dataset Description
We conduct our experiments on People In Photo Album (PIPA)  dataset. PIPA contains public photo albums from users on Flickr, with their head region annotated. Head bounding boxes may be partially or fully outside of the image. It is also decided in PIPA protocol to tag no more than 10 people in a single image, meaning that not everybody in images of crowds is tagged.
Original split of the dataset consists of three parts, train, validation and test. For each identity, samples are roughly partitioned in 50-25-25 percentage for the three parts respectively, with the test set consisting of 7868 images. We will use train set only to learn representations for regions of interests as described in Section 3. As proposed in  and followed in previous studies [17, 12, 13, 11, 15] on PIPA, test set has been randomly split in half to and and we will follow this protocol. As mentioned in , there are some mislabeled instances in the test set, but to keep our results comparable with the existing methods, we won’t refine the original split.
Due to the limitation of original split proposed by , three more challenging splits were introduced in , namely album, time and day. In the album part, samples are collected from different albums of a person, meaning that and are sampled from different events and occasions. Time split aims to emphasize the temporal dimension of and . The metadata of photos is used to partition by newest and oldest images of an identity. Finally, day split is to challenge the appearance change. This split is made manually and date changes like seasons or visible changes like hairstyle are taken into consideration. Unlike the first three splits, the number of unique identities in day split is reduced from 581 to 199 with about 20 samples per identity.
4.2 Implementation Details
Inception-V3  is the architecture of choice for the CNNs in our model. We initialize CNNs with the weights of the pre-trained model on the ImageNet and for each body part, CNNs are trained on the train split. This pre-training step injects additional data with a similar distribution to test split of PIPA into the CNNs and helps them perform better when trained on or . In pre-training step, we train each CNN for 50 epochs using Stochastic Gradient Descent (SGD) optimizer with a learning rate of 0.01 and momentum of 0.9. To train CNNs on each half of the test split of PIPA, we initialize networks with pre-trained weights obtained by training on train split. Here, CNNs are trained for 50 epochs with a learning rate of 0.01 and 20 epochs with learning rate 0.001. Again, we use SGD with a momentum of 0.9. All input images are resized to the fixed size of 299299. During training, we use various methods to augment the dataset. Images are randomly flipped. Random rotation with the range of 30 degrees, is done. Horizontal and vertical shifts of -60 to 60 pixels are performed randomly. Zooming in or out is also performed in the range of 0.8 and 1.2 of the image size.
We use a GRU with three timesteps and 2048 output dimensionality. Drop out with a probability of 0.5 is applied to the combined representation of GRUs. SGD with a learning rate of 0.005 and momentum of 0.9 is used to optimize the loss. The number of training epochs is fixed on 2000. Training this part for each fold of test split takes about 1.5 hours on a single Geforce GTX 1080 Ti NVIDIA GPU. We use Keras  with Tensorflow  backend for our implementations.
4.3 Experimental Results
Now we explore the importance of modeling the relational cues of different body regions and good practices in usage of recurrent architectures for this purpose. All reported results throughout the paper are classification accuracies averaged over and , meaning that each model has been trained on and evaluated on and vice versa.
Initial CNN predictions. In the first stage of our work, we train CNNs on each body part. Every part-specific CNN can classify a person on its own. In Table. 1, accuracies of body part specific CNNs along with their average and maximum fusion variants are summarized. It is evident that an increase in body part size makes it harder for the model to perform well and as expected, the most informative single region is the head of the person. When we fuse predictions of different body parts with either element-wise average or maximum, the accuracy of the model increases noticeably in all splits except the day split. This validates the idea that different cues are present in each body part, which can be extracted by fusion methods. In the day split, performances of the upper and the whole body CNNs have a large gap with the head CNN which makes simple fusion methods yield poor predictions.
Fusion Baselines. Here, we analyze more complex baselines to combine information from different body parts. We experiment with different versions of our RNN-based model as shown in Fig. 3.
Concat: Concatenated features from all CNNs are fed into a fully connected layer with 2048 neurons and a classification layer on top of it.
Confidence-Aware: Similar to , a weighted average of probabilities of different body parts with respect to the confidence of predictions is calculated as final output.
Probabilities RNN: A variant of our model where only one RNN is used on input probabilities produced by CNNs to produce final predictions.
Features RNN: Similar to probabilities RNN but with CNN features as the input.
Embeddings RNN: Features and probabilities of each CNN are combined in an embedding layer. Outputs of the embedding layer are given to a single RNN to identify persons. Embedding layer has a fully connected layer on top of probabilities and features to embed them to a new fixed-size layer. Outputs of the fully connected layers are combined using element-wise maximum and a activation function on top of it to form the output of the embedding layer. In this case, we observed that this way of combining embeddings works better than other methods like average and activation function.
Reversed Zoom-RNN: Similar to our final approach, except that the head region is the input to the first timestep of the RNN and the whole body is the last.
Zoom-RNN: The complete version of our model.
Results of the baselines are reported in Table. 2. Concat is able to combine information of body parts to some extent but it is not able to perform better than previous simple fusion methods. Confidence-Aware gives the best aggregation result in , but all of our RNN variations outperform it in all of the splits. Probabilities RNN reasons over prediction probabilities from least confident to most confident. Like the previous baseline, although it has the ability to fuse some information from three predictions, it shows worse performance than simple average or pairwise maximum. Letting the model learn from visual features in Features RNN increases the performance over simple fusion methods. Furthermore, to evaluate whether using both probability and feature vectors is beneficial or not, the accuracy of the Embeddings RNN is reported. In this way, performance is slightly worse than Features RNN in most of the splits. We believe combining probabilities and feature vectors in lower level representations is not able to produce a strong combination. In Zoom-RNN, features and probabilities of CNNs are separately encoded into higher level representations and a combination of these representations is made. Significant improvement of accuracy in all four splits over other baselines and previous fusion methods proves our statement about the combination of probabilities and features in higher levels of representation.
Here, another important factor in the combination using RNN is the order of input CNNs. The poor performance of Reversed Zoom-RNN indicates that starting from the best performing part-specific CNN to worst one can make it difficult for the model to make true inferences. Although because of the improvement over worst part, it is obvious that model remembers some information about other parts, but it is also evident that most of the valuable cues are forgotten. Therefore a good practice is to start from weakest part-specific model to strongest one to make it easier for the recurrent model not to forget the best performing model’s representations and also remember some valuable information from other body parts.
4.4 Comparison to The State-of-the-Art
As discussed in Section 2, there have been various approaches in person recognition on PIPA. To show that our model can be improved using other contextual cues, we have implemented our version of inter-person sequence similar to . The results are summarized in Table 3. Unlike , we don’t use any contextual information other than body parts of the person, so we expect our model to perform much better by taking advantage of other contextual cues. We are aware that a recent study  performs better in three splits by using a novel loss function (COCO). Therefore,  with softmax results are reported too and it is demonstrated that our final results perform significantly better in a fair comparison. However, in this work, we are interested in showing that our relational modeling of body regions improves baseline performance. The positive effect of inter-person sequence shows that our results can be improved using other contextual cues.
| w/o context||83.86||78.23||70.29||56.40|
| with softmax||88.73||80.26||71.56||50.36|
| with context||88.75||83.33||77.00||59.35|
|Ours + inter-person sequence||91.36||85.00||77.11||58.53|
To illustrate our model’s zooming nature and the effect of modeling relational cues of different body regions, in this section, we provide examples of our predictions on PIPA test set.
In Figure 4, we show examples that average method mislabels the identity, while our model predicts the right one. It can be inferred that similar outfit and faces can easily misguide the averaging methods, while taking advantage of relation of the body regions using our model performs accurately.
Similarly, in Figure 5, we show some instances in which head features alone may misguide the model, but taking advantage of the relation between body parts helps our model predict accurately. As mentioned in Section 1, person recognition task in photo album includes challenges like non-frontal face, occlusion and motion blur. It can be understood from the examples in Figure 5 that we can overcome these challenges by extracting good information from different body regions.
In this paper, we proposed a novel method for combining cues of different body regions for the task of person recognition in photo album. Our approach uses two distinct recurrent neural networks to extract information present in different parts of a human photo in order to improve recognition performance. We conduct experiments on PIPA dataset and show that our model significantly boosts baseline performances. We also achieved state-of-the-art results in one split and second-best results in others by a narrow margin while not using contextual cues which have been proved to significantly increase the overall performance.
-  M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, et al. Tensorflow: A system for large-scale machine learning. In OSDI, volume 16, pages 265–283, 2016.
-  D. Anguelov, K.-c. Lee, S. B. Gokturk, and B. Sumengen. Contextual identity recognition in personal photo albums. In Computer Vision and Pattern Recognition, 2007. CVPR’07. IEEE Conference on, pages 1–7. IEEE, 2007.
-  F. Chollet et al. Keras. https://github.com/keras-team/keras, 2015.
-  J. Chung, C. Gulcehre, K. Cho, and Y. Bengio. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555, 2014.
-  J. Dean, G. Corrado, R. Monga, K. Chen, M. Devin, M. Mao, A. Senior, P. Tucker, K. Yang, Q. V. Le, et al. Large scale distributed deep networks. In Advances in neural information processing systems, pages 1223–1231, 2012.
-  M. Guillaumin, J. Verbeek, and C. Schmid. Is that you? metric learning approaches for face identification. In Computer Vision, 2009 IEEE 12th international conference on, pages 498–505. IEEE, 2009.
-  S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
-  G. B. Huang, M. Ramesh, T. Berg, and E. Learned-Miller. Labeled faces in the wild: A database for studying face recognition in unconstrained environments. Technical report, Technical Report 07-49, University of Massachusetts, Amherst, 2007.
-  R. Kiros, R. Salakhutdinov, and R. S. Zemel. Unifying visual-semantic embeddings with multimodal neural language models. arXiv preprint arXiv:1411.2539, 2014.
-  A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.
-  V. Kumar, A. Namboodiri, M. Paluri, and C. Jawahar. Pose-aware person recognition. arXiv preprint arXiv:1705.10120, 2017.
-  H. Li, J. Brandt, Z. Lin, X. Shen, and G. Hua. A multi-level contextual model for person recognition in photo albums. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1297–1305, 2016.
-  Y. Li, G. Lin, B. Zhuang, L. Liu, C. Shen, and A. van den Hengel. Sequential person recognition in photo albums with a recurrent network. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 5660–5668. IEEE, 2017.
-  D. Lin, A. Kapoor, G. Hua, and S. Baker. Joint people, event, and location recognition in personal photo collections using cross-domain context. In European Conference on Computer Vision, pages 243–256. Springer, 2010.
-  Y. Liu, H. Li, and X. Wang. Rethinking feature discrimination and polymerization for large-scale recognition. arXiv preprint arXiv:1710.00870, 2017.
-  J. Mao, W. Xu, Y. Yang, J. Wang, Z. Huang, and A. Yuille. Deep captioning with multimodal recurrent neural networks (m-rnn). arXiv preprint arXiv:1412.6632, 2014.
-  S. J. Oh, R. Benenson, M. Fritz, and B. Schiele. Person recognition in personal photo collections. In ICCV, pages 3862–3870, 2015.
-  S. J. Oh, R. Benenson, M. Fritz, and B. Schiele. Faceless person recognition: Privacy implications in social media. In European Conference on Computer Vision, pages 19–35. Springer, 2016.
-  N. O’Hare and A. F. Smeaton. Context-aware person identification in personal photo collections. IEEE Transactions on Multimedia, 11(2):220–228, 2009.
-  F. Schroff, D. Kalenichenko, and J. Philbin. Facenet: A unified embedding for face recognition and clustering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 815–823, 2015.
-  R. Stewart, M. Andriluka, and A. Y. Ng. End-to-end people detection in crowded scenes. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2325–2333, 2016.
-  I. Sutskever, O. Vinyals, and Q. V. Le. Sequence to sequence learning with neural networks. In Advances in neural information processing systems, pages 3104–3112, 2014.
-  C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2818–2826, 2016.
-  Y. Taigman, M. Yang, M. Ranzato, and L. Wolf. Deepface: Closing the gap to human-level performance in face verification. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1701–1708, 2014.
-  O. Vinyals, A. Toshev, S. Bengio, and D. Erhan. Show and tell: A neural image caption generator. In Computer Vision and Pattern Recognition (CVPR), 2015 IEEE Conference on, pages 3156–3164. IEEE, 2015.
-  J. Wang, Y. Yang, J. Mao, Z. Huang, C. Huang, and W. Xu. Cnn-rnn: A unified framework for multi-label image classification. In Computer Vision and Pattern Recognition (CVPR), 2016 IEEE Conference on, pages 2285–2294. IEEE, 2016.
-  L. Wolf, T. Hassner, and I. Maoz. Face recognition in unconstrained videos with matched background similarity. In Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on, pages 529–534. IEEE, 2011.
-  W.-S. T. WST. Deeply learned face representations are sparse, selective, and robust. perception, 31:411–438, 2008.
-  N. Zhang, M. Paluri, Y. Taigman, R. Fergus, and L. Bourdev. Beyond frontal faces: Improving person recognition using multiple cues. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4804–4813, 2015.
-  R. Zhao, W. Ouyang, and X. Wang. Person re-identification by salience matching. In Computer Vision (ICCV), 2013 IEEE International Conference on, pages 2528–2535. IEEE, 2013.