Unsupervised Data Uncertainty Learning in Visual Retrieval Systems

Unsupervised Data Uncertainty Learning in Visual Retrieval Systems

Abstract

We introduce an unsupervised formulation to estimate heteroscedastic uncertainty in retrieval systems. We propose an extension to triplet loss that models data uncertainty for each input. Besides improving performance, our formulation models local noise in the embedding space. It quantifies input uncertainty and thus enhances interpretability of the system. This helps identify noisy observations in query and search databases. Evaluation on both image and video retrieval applications highlight the utility of our approach. We highlight our efficiency in modeling local noise using two real-world datasets: Clothing1M and Honda Driving datasets. Qualitative results illustrate our ability in identifying confusing scenarios in various domains. Uncertainty learning also enables data cleaning by detecting noisy training labels.

\icmlsetsymbol

equal*

{icmlauthorlist}\icmlauthor

Ahmed Tahaumd \icmlauthorYi-Ting Chenhonda \icmlauthorTeruhisa Misuhonda \icmlauthorAbhinav Shrivastavaumd \icmlauthorLarry Davisumd

\icmlaffiliation

umdUniveristy of Maryland, College Park \icmlaffiliationhondaHonda Research Institute, USA

\icmlcorrespondingauthor

Ahmed Tahaahmdtaha@cs.umd.edu

\icmlkeywords

Machine Learning, ICML


\printAffiliationsAndNotice

1 Introduction

Noisy observations hinder learning from supervised datasets. Adding more labeled data does not eliminate this inherent source of uncertainty. For example, object boundaries and objects farther from the camera remain challenging in semantic segmentation, even for humans. Noisy observations take various forms in visual retrieval. The noise can be introduced by a variety of factors; e.g., low resolution inputs, a wrong training label. Modeling uncertainty in training data can improve both the robustness and interpretability of a system. In this paper, we propose a formulation to capture data uncertainty in retrieval applications. Figure 1 shows the lowest and highest uncertainty query images, detected by our system, from DukeMTMC-ReID person re-identification dataset. Similarly, in autonomous navigation scenarios, our formulation can identify confusing scenarios; thus improving the retrieval efficiency and interpretability in this safety-critical system.

Labeled datasets contain observational noise that corrupts the target values Bishop et al. (1995). This noise, also known as aleatoric uncertainty  Kendall & Gal (2017), is inherent in the data observations and cannot be reduced even if more data is collected. Aleatoric uncertainty is categorized into homoscedastic and heteroscedastic uncertainty. Homoscedastic uncertainty is task dependent, i.e., a constant observation noise for all input points. On the contrary, heteroscedastic uncertainty posits the observation noise as dependent on input . Aleatoric uncertainty has been modeled in regression and classification applications like per-pixel depth regression and semantic segmentation tasks respectively. In this paper, we extend triplet loss formulation to model heteroscedastic uncertainty in retrieval applications.

Figure 1: The first and second rows show five lowest and highest uncertainty queries (respectively) identified from DukeMTMC-ReID dataset.

Triplet loss Schroff et al. (2015) is a prominent ranking loss for space embedding. It has been successfully applied in face recognition Schroff et al. (2015); Sankaranarayanan et al. (2016) and person re-identification Cheng et al. (2016); Su et al. (2016); Ristani & Tomasi (2018). In this paper, we extend it to capture heteroscedastic uncertainty in an unsupervised manner. Vanilla triplet loss assumes a constant uncertainty for all input values. By integrating the anchor, positive, and negative uncertainties in the loss function, our model learns data uncertainty nonparametrically. Thus, the data uncertainty becomes a function of different inputs, i.e., every object has a different .

We evaluate our unsupervised formulation on two image retrieval applications: person re-identification and fashion item retrieval. Person re-identification datasets provide an established quantitative evaluation benchmark. Yet, they have little emphasis on confusing samples. Thus, we leverage Clothing1M Xiao et al. (2015) fashion classification dataset for its noisy labels and inter-class similarities. The training split has a small clean and a large noisy labeled subsets. Inter-class similarity, e.g., Down Coat and Windbreaker, and images with wrong labels are two distinct confusion sources, both of which are captured by our learned uncertainty model.

One of the main objectives behind modeling uncertainty is improving safety, since uncertainty quantification can prevent error propagation McAllister et al. (2017). To this end, we employ Honda driving dataset (HDD) Ramanishka et al. (2018) for evaluation on safety-critical autonomous navigation domain. Explicit heteroscedastic uncertainty representation improves retrieval performance by reducing the effect of noisy data with the implied attenuation. Qualitative evaluation demonstrates the ability of our approach to identify confusing driving situations.

In summary, the key contributions of this paper are:

  1. Formulating an unsupervised triplet loss extension to capture heteroscedastic (data) uncertainty in visual retrieval systems.

  2. Improving retrieval model’s interpretability by identifying confusing visual objects in train and test data. This reduces error propagation and enables data cleaning.

  3. Harnessing heteroscedastic uncertainty to improve efficiency by 1-2% and improving model stability by modeling local noise in the embedding space.

2 Related Work

2.1 Bayesian Uncertainty Modeling

Bayesian models define two types of uncertainty: epistemic and aleatoric. Epistemic uncertainty, also known as model uncertainty, captures uncertainty in model parameters. It reflects generalization error and can be reduced given enough training data. Aleatoric uncertainty is the uncertainty in our data, e.g., uncertainty due to observation noise. Kendall and Gal \yrcitekendall2017uncertainties divide it into two sub-categories: heteroscedastic and homoscedastic. Homoscedastic is task-dependent uncertainty not dependent on the input space, i.e., constant for all input data and varies between different tasks. Heteroscedastic varies across the input space due to observational noise, i.e., .

Quantifying uncertainties can potentially improve the performance, robustness, and interpretability of a system. Therefore, epistemic uncertainty modeling has been leveraged for semantic segmentation Nair et al. (2018), depth estimation Kendall & Gal (2017), active learning Gal et al. (2017), conditional retrieval Taha et al. (2019), and model selection Gal & Ghahramani (2016) though hyper-parameter tuning. A supervised approach to learning heteroscedastic uncertainty to capture observational noise has been proposed Nix & Weigend (1994); Le et al. (2005). However, labeling heteroscedastic uncertainty in real-world problems is challenging and not scalable.

A recent approach Kendall & Gal (2017) regresses this uncertainty without supervision. This approach has been applied in semantic segmentation and depth estimation. By making the observation noise parameter data-dependent, it can be learned as a function of the data as follows

(1)

for a labeled dataset with points of and is a uni-variate regression function. This formulation allows the network to reduce the erroneous labels’ effect. The noisy data with predicted high uncertainty will have a smaller effect on the loss function which increases the model robustness. The two terms in equation 1 have contradicting objectives. While the first term favors high uncertainty for all points, the second term penalizes it.

We extend triplet loss to learn data uncertainty in a similar unsupervised manner. The network learns to ignore parts of the input space if uncertainty justifies penalization. This form of learned attenuation is a consequence of the probabilistic interpretation of Kendall & Gal (2017) model.

2.2 Triplet Loss

To learn a space embedding, we leverage triplet loss for its simplicity and efficiency. It is more efficient than contrastive loss Hadsell et al. (2006); Li et al. (2017), and less computationally expensive than quadruplet Huang et al. (2016b); Chen et al. (2017) and quintuplet Huang et al. (2016a) losses. Equation 2 shows the triplet loss formulation

(2)

where is a soft margin function and is the margin between different classes embedding. and are the embedding and the Euclidean distance functions respectively. This formulation attracts an anchor image of a specific class closer to all other positive images from the same class than it is to any negative image of other classes.

The performance of triplet loss relies heavily on the sampling strategy used during training. We experiment with both hard Hermans et al. (2017) and semi-hard sampling Schroff et al. (2015) strategies. In semi-hard negative sampling, instead of picking the hardest positive-negative samples, all anchor-positive pairs and their corresponding semi-hard negatives are considered. Semi-hard negatives are further away from the anchor than the positive exemplar, yet within the banned margin . Figure 2 shows a triplet loss tuple and highlights different types of negative exemplars. Hard and semi-hard negatives satisfy equations 4 and 3 respectively.

Figure 2: Triplet loss tuple (anchor, positive, negative) and margin . (H)ard, (s)emi-hard and (e)asy negatives highlighted in black, gray and white respectively.
(3)
(4)

Triplet loss has been extended to explore epistemic (model) uncertainty Taha et al. (2019). In this paper, we propose a similar formulation to learn heteroscedastic (data) uncertainty.

2.3 Bayesian Retrieval

Dropout as a Bayesian approximation framework has been theoretically studied for both classification and regression problems Gal & Ghahramani (2016); Kendall & Gal (2017). To extend this framework to retrieval and space embedding problems, triplet loss is cast as a regression function Taha et al. (2019). Given a training dataset containing triplets and their corresponding outputs , the triplet loss can be formulated as a trivariate regression function as follows

(5)
(6)

Assuming a unit-circle normalized embedding, outputs if and ; and if and s.t. . This casting enables epistemic uncertainty learning for multi-modal conditional retrieval systems Taha et al. (2019). Inspired by this, Section 3 presents our proposed extension to capture heteroscedastic uncertainty.

3 Heteroscedastic Embedding

Heteroscedastic models investigate the observation space and identify parts suffering from high noise levels. Taha et al\yrcitetaha2019exploring cast normalized ranking losses as a regression function to study epistemic uncertainty. Similarly, we extend triplet loss to learn the data-dependent heteroscedastic uncertainty. This helps identify noisy and confusing objects in a retrieval system, either in queries or in the search gallery.

Normalized triplet loss is cast as a trivariate regression function Taha et al. (2019). It is straight-forward to extend it for unnormalized embedding with soft margin as follows

(7)
(8)

outputs if and ; and if and s.t. . Unlike the univariate regression formulation Kendall & Gal (2017), triplet loss is dependent on three objects: anchor, positive, and negative. We extend the vanilla triplet loss to learn a noise parameter for each object independently, i.e., . For a single triplet , the vanilla triplet loss is evaluated three times as follows

(9)
(10)

where . This formulation can be regarded as a weighted average triplet loss using data uncertainty. Similar to Kendall & Gal (2017), we compute a maximum a posteriori probability (MAP) estimate by adding a weight decay term parameterized by . This imposes a prior on the model parameters and reduces overfitting Le et al. (2005). Our neural network learns because it is more numerically stable than regressing the variance . Thus, in practice the final loss function is

(11)

where is the number of triplets . Our formulation can be generalized to support more complex ranking losses like quintuplet loss Huang et al. (2016a). Equation 3 provides a generalization for k-tuplets where

(12)
(13)

4 Architecture

The generic architecture employed in our experiments is illustrated in Figure 3. The encoder architecture is dependent on the input type. For an embedding space with dimensionality , our formulation requires the encoder final layer output . The extra dimension learns the input heteroscedastic uncertainty . The following subsections present two encoder variants employed to properly handle image and video inputs.

4.1 Image Retrieval

For image-based tasks of person re-identification and fashion item retrieval, we employ the architecture from Hermans et al. (2017). Given an input RGB image, the encoder is a fine-tuned ResNet architecture He et al. (2016) pretrained on ImageNet Deng et al. (2009) followed by a fully-connected network (FCN). In our experiments, the final output is not normalized and the soft margin between classes is imposed by the softplus function . It is similar to the hinge function but it decays exponentially instead of a hard cut-off. We experiment with both hard and semi-hard negative sampling strategies for person re-identification and fashion item retrieval respectively.

4.2 Video Retrieval

For autonomous navigation, a simplified version of Taha et al. (2019) architecture is employed. The Honda driving dataset provides multiple input modalities, e.g., camera and CAN sensors, and similarity notions between actions (events). We employ the camera modality and two similarity notions: goal-oriented and stimulus-driven. Input video events from the camera modality are represented using pre-extracted features per frame, from the Conv2d_7b_1x1 layer of InceptionResnet-V2 Szegedy et al. (2017) pretrained on ImageNet, to reduce GPU memory requirements.

Modeling temporal context provides an additional and important clue for action understanding Simonyan & Zisserman (2014). Thus, the encoder employs an LSTM Funahashi & Nakamura (1993); Hochreiter & Schmidhuber (1997) after a shallow CNN. During training, three random consecutive frames are drawn from an event. They are independently encoded then temporally fused using the LSTM. Note that sampling more frames per event will lead to better performance. Unfortunately, the GPU memory constrains the number of sampled frames to three. The network output is the hidden state of the LSTM last time step. Further architectural details are described in the supplementary material.

Figure 3: Our generic retrieval network supports various encoder architectures trained through a ranking loss.

5 Experiments

We evaluate our formulation on three retrieval domains through three datasets. First, it is validated using the standard person-identification benchmark Zheng et al. (2017). In order to model inter-class similarity and local noise, we leverage two real-world datasets: Clothing1M and Honda Driving Dataset (HDD). Both datasets emulate real scenarios with noisy labels due to inter-class similarity or inherent uncertainty.

5.1 Person Re-Identification

Person re-identification is employed in Multi-Target Multi-Camera Tracking system. An ID system retrieves images of people and ranks them by decreasing similarity to a given person query image. DukeMTMC-reID Zheng et al. (2017) is used for evaluation. It includes 1,404 identities appearing in more than two cameras and 408 identities appearing in a single camera for distraction purpose. 702 identities are reserved for training and 702 for testing. We evaluate our formulation using this clean dataset for two reasons: (1) It provides an established quantitative benchmark to emphasize the competence and robustness of our approach; (2) qualitative results comprehension requires no domain-specific knowledge.

For each training mini-batch, we uniformly sample person identities without replacement. For each person, sample images are drawn without replacement with resolution . The learning rate is for the first 15000 iterations and decays to at iteration 25000. Weight regularization employed with . Our formulation is evaluated twice with and without data augmentation. Similar to Ristani & Tomasi (2018), we augment images by cropping and horizontal flipping. For illumination invariance, contrast normalization, grayscale and color multiplication effects are applied. For resolution invariance, we apply Gaussian blur of varying . For additional viewpoint/pose invariance, we apply perspective transformations and small distortions. We additionally hide small rectangular image patches to simulate occlusion.

Figure 1 shows the five lowest and highest uncertainty query-identities from the DukeMTMC-ReID dataset. Heteroscedastic uncertainty is high when the query image contains multiple identities or a single identity with an outfit that blends with the background. On the contrary, identities with discriminative outfit colors (e.g., red) suffer low uncertainty.

Table 1 presents our quantitative evaluation where the performance of our method is comparable to the state-of-the-art. All experiments are executed five times, and mean average precision (mAP) and standard deviation are reported. Our formulation lags marginally due to limited confusing samples in the training split. However, it has a smaller standard deviation. It is noteworthy that the performance gap between vanilla Tri-ResNet and our formulation closes when applying augmentation. This aligns with our hypothesis that the lack of confusing samples limits our formulation. In the next subsections, we evaluate on real-world datasets containing noisy samples.

Method mAP Top-5
BoW+KISSME Zheng et al. (2015) 12.17 -
LOMO+XQDA Liao et al. (2015) 17.04 -
Baseline Zheng et al. (2016) 44.99 -
PAN Zheng et al. (2018) 51.51 -
SVDNet Sun et al. (2017) 56.80 -
Tri-ResNet Hermans et al. (2017) 56.080.005 86.760.007
Tri-ResNet + Hetero (ours) 55.160.002 86.030.005
Tri-ResNet + Aug 56.440.006 86.110.003
Tri-ResNet + Aug + Hetero (ours) 56.740.004 86.200.002
Table 1: Quantitative evaluation on DukeMTMC-ReID

5.2 Fashion Image Retrieval

A major drawback of the person re-identification dataset is the absence of noisy images. Images with multiple identities are confusing but incidental in the training split. To underscore the importance of our formulation, a large dataset with noisy data is required. Clothing1M fashion dataset Xiao et al. (2015) emulates this scenario by providing a large-scale dataset of clothing items crawled from several online shopping websites. It contains over one million images and their description. Fashion items are labeled using a noisy process: a label is assigned if the description contains the keywords of that label, otherwise, it is discarded. A small clean portion of the data is made available after manual refinement. The final training split contains 22,933 clean () and 1,024,637 (97.81%) noisy () labeled images. The validation and test sets have and clean images respectively.

T-Shirt

Shirt

Knitwear

Chiffon

Sweater

Hoodie

Windbreaker

Jacket

Down Coat

Suit

Shawl

Dress

Vest

Underwear

Figure 4: Clothing1M classes distribution.

Figure 4 shows the 14 classes, and their distribution, from the Clothing1M dataset. Training with both clean and noisy is significantly superior compared to just training with clean data. Since manual inspection of noisy data is expensive, our unsupervised formulation qualitatively identifies confusing samples and provides an efficient system to deal with noisy data. For clothing1M dataset, the training parameters are similar to person re-identification dataset except for the following: a minibatch has samples from all 14 classes ( samples per class per minibatch) and input image resolution is . Semi-hard sampling strategy is employed to mitigate noisy labels effect. The model is training for 25k training iterations, which is equivalent to six epochs.

Method mAP
Tri-ResNet (Clean Only) 52.62
Tri-ResNet (Baseline) 61.700.001
Tri-ResNet + Hetero (ours) 62.270.001
Tri-ResNet + Random Cleaning 62.330.003
Tri-ResNet + Hetero Cleaning 64.570.002
Table 2: Quantitative evaluation on Clothing1M. The first row show performance with clean data only while the remaining rows leverage both clean and noisy data. The last two rows show performance after cleaning 20% of the search gallery samples

The validation and test splits act as query and gallery database respectively for quantitative evaluation. Table 2 presents retrieval performance using only clean data vs. both clean and noisy data. Mean and standard deviation across five trails are reported. By modeling data uncertainty, our formulation improves performance and interpretability of the model. We utilize the learned uncertainty to clean confusing samples from the search gallery database. The last two rows in Table 2 present retrieval performance after cleaning 20% of the search database. While random cleaning achieves no improvement, removing items suffering the highest uncertainty boosts performance by 2%.

\addvbuffer[1.0ex]

Sweater

\addvbuffer[1.5ex]

Suit

Figure 5: Qualitative evaluation using the highest five uncertainty training images from Sweater and Suit classes in Clothing1M.

Leveraging uncertainty to refine the training split is a plausible extension but requires extensive manual labor. Figure 5 presents the five highest uncertainty training images from two classes. The supplementary material provides a qualitative evaluation showing images with highest uncertainty score from each class. Most images are either incorrectly labeled or contain multiple distinct objects which highlights the utility of our approach.

Figure 6 depicts a negative Pearson correlation between the retrieval average precision of query items and their heteroscedastic uncertainty. Query images are aggregated by the average-precision percentiles on the x-axis. Aggregated items’ average uncertainty is reported on the y-axis. Figure 7 shows query images chosen from the highest uncertainty percentile and their corresponding four top results. For visualization purposes, we discretize the data uncertainty using percentiles into five bins: very low (green), low (yellow), moderate (orange), high (violet), and very high (red). Confusion between certain classes like Sweater and Knitwear, Knitwear and Windbreaker, and Jacket and Down coat is evident.

0-10

Average Precision Percentiles

Average Data Uncertainty

Figure 6: Quantitative analysis reveals the correlation between query images retrieval average precision and uncertainty. Queries with high average precision suffer lower uncertainty and vice versa. Query images are aggregated using average precision percentiles on the x-axis. Y-axis is the aggregated images’ mean uncertainty and standard deviation.

Figure 8 shows a principal component analysis (PCA) projection for 4K randomly chosen query items embedding. Points in the left and right projections are colored by the class label and uncertainty degree respectively. Images at the center of classes (in green) have lower uncertainty, compared to points spread out through the space. The inherent inter-class similarity, e.g., Sweater and Knitwear, explains why certain regions have very higher uncertainty. Qualitative evaluation with very low uncertainty query items is provided in the supplementary material.

Query Top 4 results
Sweater Knitwear Sweater Sweater Sweater
Knitwear Windbreaker Windbreaker Windbreaker Shawl
Jacket Down Coat Down Coat T-Shirt Jacket
Figure 7: Qualitative evaluation using three very high uncertainty queries from Clothing1M dataset. Outline colors emphasize the uncertainty degree, e.g., red is very high. Inter-class similarity is a primary confusion source.
Figure 8: Qualitative analysis for the Clothing1M dataset embedding using 4K random points. The left and right plots show a PCA projection colored with class-label and uncertainty degree respectively. Points closer to class centers suffer lower uncertainty compared to farther points. Confusing inter-class similarity is highlighted with visual samples. The left zoom-in figures show four rows with samples from Sweater, Knitwear, Chiffon, and Shirt respectively. These high-resolution figures are best viewed in color/screen.

5.3 Autonomous Navigation

Modeling network and data uncertainty is gaining momentum in safety-critical domains like autonomous driving. We evaluate our approach on ego-motion action retrieval. Honda driving dataset (HDD) Ramanishka et al. (2018) is designed to support modeling driver behavior and understanding causal reasoning. It defines four annotation layers. (1) Goal-oriented actions represent the egocentric activities taken to reach a destination like left and right turns. (2) Stimulus-driven are actions due to external causation factors like stopping to avoid a pedestrian or stopping for a traffic light. (3) Cause indicates the reason for an action. Finally, the (4) attention layer localizes the traffic participants that drivers attend to. Every layer is categorized into a set of classes (actions). Figures 9 and 10 show the class distribution for goal-oriented and stimulus-driven layers respectively.

Intersection Pass

Figure 9: HDD long tail goal-oriented actions distribution.

Stop 4 sign

Figure 10: HDD imbalance stimulus-driven actions distribution.

Experiments’ technical details are presented in the supplementary material. An event retrieval evaluation using query-by-example is performed. Given a query event, similarity scores to all events are computed, i.e., a leave-one-out cross evaluation on the test split. Performances of all queries are averaged to obtain the final evaluation.

To tackle data imbalance and highlight performance on minority classes, both micro and macro average accuracies are reported. Macro-average computes the metric for each class independently before taking the average. Micro-average is the traditional mean for all samples. Macro-average treats all classes equally while micro-averaging favors majority classes. Tables 3 and 4 show quantitative evaluation for networks trained on goal-oriented and stimulus-driven events respectively.

Method Baseline Hetero (Our)
Micro mAP 77.880.003 78.450.004
Macro mAP 32.620.004 30.80.004
Intersection Passing 89.10.006 91.440.002
Left turn 81.290.007 80.150.008
Right Turn 89.990.008 89.190.016
Left Lane Change 24.650.005 20.280.008
Right Lane Change 16.040.018 9.190.002
Crosswalk Passing 1.130.001 1.310.002
U-turn 3.590.004 2.620.003
Left Lane Branch 14.030.022 8.650.024
Right Lane Branch 2.150.001 1.480.011
Merge 4.260.005 3.620.006
Table 3: Quantitative evaluation on goal-oriented actions
Method Baseline Hetero (Our)
Micro mAP 66.500.008 68.200.008
Macro mAP 35.330.005 36.230.008
Stop 4 Sign 87.850.005 89.180.006
Stop 4 Light 52.630.013 49.860.005
Stop 4 Congestion 63.880.014 67.880.016
Stop 4 Others 1.620.010 1.020.004
Stop 4 Pedestrian 2.720.007 2.560.002
Avoid Parked Car 3.300.002 6.900.024
Table 4: Quantitative evaluation on stimulus-driven actions


(Query) Right Lane Change with very high uncertainty


(Retrieval result) Right Turn with very high uncertainty


(Query) Right Turn with very high uncertainty


(Retrieval result) Right Turn with very low uncertainty

Figure 11: Qualitative evaluation on HDD using goal-oriented events. Every query is followed by the nearest retrieval result. Outline colors emphasize the event uncertainty degree. The first query shows a high uncertainty right lane change. The second query shows a right-turn maneuver blocked by crossing pedestrian. These images are best viewed in color/screen.

Figure 11 presents a qualitative evaluation on HDD. Every two consecutive rows show a very high uncertainty query event and its nearest retrieval result. All query events are chosen within the highest uncertainty percentile. A description containing the event class and its uncertainty degree is provided below each event. The first query (first row) shows the driver moving from the wrong direction lane to the correct one behind a pickup truck, with a huge cat drawn on a building wall. This example illustrates how uncertainty grounding is challenging in video events. The nearest event to this query (second row) is a very high uncertainty right turn, a similar but not identical event class.

The second query (third row) shows a right-turn maneuver where the driver is waiting for crossing pedestrian. The retrieved result (forth row) belongs to the same class but suffers very low uncertainty. We posit the high and low uncertainty are due to pedestrian presence and absence respectively. More visualization using GIFs are available in the supplementary material.

5.4 Discussion

We study uncertainty in visual retrieval systems by introducing an extension to the triplet loss. Our unsupervised formulation models embedding space uncertainty. This improves efficiency of the system and identifies confusing visual examples without raising the computational cost. We evaluate our formulation on multiple domains through three datasets. Real-world noisy datasets highlight the utility of our formulation. Qualitative results emphasize our ability in identifying confusing scenarios. This enables data cleaning and reduces error propagation in safety-critical systems.

One limitation of the proposed formulation is bias against minority classes. It treats minority class training samples as noisy input and attenuates their contribution. Tables 3 and 4 emphasize this phenomenon where performance on the majority and minority classes increases and decreases respectively. Accordingly, micro mAP increases while macro mAP decreases. Thus, this formulation is inadequate for boosting minority classes’ performance in imbalanced datasets.

For image applications, high uncertainty is relatively easy to understand – a favorable quality. Occlusion, inter-class similarity, and multiple distinct instances contribute to visual uncertainty. Unfortunately, this is not the case in video applications, where its challenging to explain uncertainty in events with multiple independent agents. Attention model Xu et al. (2015); Zhou et al. (2016) is one potential extension, which can ground uncertainty in video datasets.

6 Conclusion

We propose an unsupervised ranking loss extension to model local noise in embedding space. Our formulation supports various embedding architectures and ranking losses. It quantifies data uncertainty in visual retrieval systems without raising their computational complexity. This raises stability and efficiency for clean and inherently noisy real-world datasets respectively. Qualitative evaluations highlight our approach efficiency identifying confusing visuals. This is a remarkable add-on for safety-critical domains like autonomous navigation.

References

  • Bishop et al. (1995) Bishop, C. M. et al. Neural networks for pattern recognition. Oxford university press, 1995.
  • Chen et al. (2017) Chen, W., Chen, X., Zhang, J., and Huang, K. Beyond triplet loss: a deep quadruplet network for person re-identification. In CVPR, 2017.
  • Cheng et al. (2016) Cheng, D., Gong, Y., Zhou, S., Wang, J., and Zheng, N. Person re-identification by multi-channel parts-based cnn with improved triplet loss function. In CVPR, 2016.
  • Deng et al. (2009) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei, L. Imagenet: A large-scale hierarchical image database. In CVPR, 2009.
  • Funahashi & Nakamura (1993) Funahashi, K.-i. and Nakamura, Y. Approximation of dynamical systems by continuous time recurrent neural networks. Neural networks, 1993.
  • Gal & Ghahramani (2016) Gal, Y. and Ghahramani, Z. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. In ICML, 2016.
  • Gal et al. (2017) Gal, Y., Islam, R., and Ghahramani, Z. Deep bayesian active learning with image data. arXiv preprint arXiv:1703.02910, 2017.
  • Hadsell et al. (2006) Hadsell, R., Chopra, S., and LeCun, Y. Dimensionality reduction by learning an invariant mapping. IEEE, 2006.
  • He et al. (2016) He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In CVPR, 2016.
  • Hermans et al. (2017) Hermans, A., Beyer, L., and Leibe, B. In defense of the triplet loss for person re-identification. arXiv preprint arXiv:1703.07737, 2017.
  • Hochreiter & Schmidhuber (1997) Hochreiter, S. and Schmidhuber, J. Long short-term memory. Neural computation, 1997.
  • Huang et al. (2016a) Huang, C., Li, Y., Change Loy, C., and Tang, X. Learning deep representation for imbalanced classification. In CVPR, 2016a.
  • Huang et al. (2016b) Huang, C., Loy, C. C., and Tang, X. Local similarity-aware deep feature embedding. In NIPS, 2016b.
  • Kendall & Gal (2017) Kendall, A. and Gal, Y. What uncertainties do we need in bayesian deep learning for computer vision? In NIPS, 2017.
  • Le et al. (2005) Le, Q. V., Smola, A. J., and Canu, S. Heteroscedastic gaussian process regression. In ICML, 2005.
  • Li et al. (2017) Li, Y., Song, Y., and Luo, J. Improving pairwise ranking for multi-label image classification. In CVPR, 2017.
  • Liao et al. (2015) Liao, S., Hu, Y., Zhu, X., and Li, S. Z. Person re-identification by local maximal occurrence representation and metric learning. In CVPR, 2015.
  • McAllister et al. (2017) McAllister, R., Gal, Y., Kendall, A., Van Der Wilk, M., Shah, A., Cipolla, R., and Weller, A. V. Concrete problems for autonomous vehicle safety: Advantages of bayesian deep learning. In International Joint Conferences on Artificial Intelligence, Inc., 2017.
  • Nair et al. (2018) Nair, T., Precup, D., Arnold, D. L., and Arbel, T. Exploring uncertainty measures in deep networks for multiple sclerosis lesion detection and segmentation. In International Conference on Medical Image Computing and Computer-Assisted Intervention, 2018.
  • Nix & Weigend (1994) Nix, D. A. and Weigend, A. S. Estimating the mean and variance of the target probability distribution. In Neural Networks, 1994. IEEE World Congress on Computational Intelligence., 1994 IEEE International Conference On, 1994.
  • Ramanishka et al. (2018) Ramanishka, V., Chen, Y.-T., Misu, T., and Saenko, K. Toward driving scene understanding: A dataset for learning driver behavior and causal reasoning. In CVPR, 2018.
  • Ristani & Tomasi (2018) Ristani, E. and Tomasi, C. Features for multi-target multi-camera tracking and re-identification. arXiv preprint arXiv:1803.10859, 2018.
  • Sankaranarayanan et al. (2016) Sankaranarayanan, S., Alavi, A., Castillo, C., and Chellappa, R. Triplet probabilistic embedding for face verification and clustering. arXiv preprint arXiv:1604.05417, 2016.
  • Schroff et al. (2015) Schroff, F., Kalenichenko, D., and Philbin, J. Facenet: A unified embedding for face recognition and clustering. In CVPR, 2015.
  • Simonyan & Zisserman (2014) Simonyan, K. and Zisserman, A. Two-stream convolutional networks for action recognition in videos. In NIPS, 2014.
  • Su et al. (2016) Su, C., Zhang, S., Xing, J., Gao, W., and Tian, Q. Deep attributes driven multi-camera person re-identification. In ECCV, 2016.
  • Sun et al. (2017) Sun, Y., Zheng, L., Deng, W., and Wang, S. Svdnet for pedestrian retrieval. arXiv preprint, 2017.
  • Szegedy et al. (2017) Szegedy, C., Ioffe, S., Vanhoucke, V., and Alemi, A. A. Inception-v4, inception-resnet and the impact of residual connections on learning. In AAAI, 2017.
  • Taha et al. (2019) Taha, A., Chen, Y.-T., Yang, X., Misu, T., and Davis, L. Exploring uncertainty in conditional multi-modal retrieval systems. In arXiv preprint arXiv:, 2019.
  • Xiao et al. (2015) Xiao, T., Xia, T., Yang, Y., Huang, C., and Wang, X. Learning from massive noisy labeled data for image classification. In CVPR, 2015.
  • Xu et al. (2015) Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhudinov, R., Zemel, R., and Bengio, Y. Show, attend and tell: Neural image caption generation with visual attention. In ICML, 2015.
  • Zheng et al. (2015) Zheng, L., Shen, L., Tian, L., Wang, S., Wang, J., and Tian, Q. Scalable person re-identification: A benchmark. In ICCV, 2015.
  • Zheng et al. (2016) Zheng, L., Yang, Y., and Hauptmann, A. G. Person re-identification: Past, present and future. arXiv preprint arXiv:1610.02984, 2016.
  • Zheng et al. (2017) Zheng, Z., Zheng, L., and Yang, Y. Unlabeled samples generated by gan improve the person re-identification baseline in vitro. arXiv preprint arXiv:1701.07717, 2017.
  • Zheng et al. (2018) Zheng, Z., Zheng, L., and Yang, Y. Pedestrian alignment network for large-scale person re-identification. IEEE Transactions on Circuits and Systems for Video Technology, 2018.
  • Zhou et al. (2016) Zhou, B., Khosla, A., Lapedriza, A., Oliva, A., and Torralba, A. Learning deep features for discriminative localization. In CVPR, 2016.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
336536
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description