Exploiting Test Time Evidence to Improve Predictions of Deep Neural Networks

# Exploiting Test Time Evidence to Improve Predictions of Deep Neural Networks

Dinesh Khandelwal  Suyash Agrawal  Parag Singla  Chetan Arora
Indian Institute of Technology Delhi
###### Abstract

Many prediction tasks, especially in computer vision, are often inherently ambiguous. For example, the output of semantic segmentation may depend on the scale one is looking at, and image saliency or video summarization is often user or context dependent. Arguably, in such scenarios, exploiting instance specific evidence, such as scale or user context, can help resolve the underlying ambiguity leading to the improved predictions. While existing literature has considered incorporating such evidence in classical models such as probabilistic graphical models (PGMs), there is limited (or no) prior work looking at this problem in the context of deep neural network (DNN) models. In this paper, we present a generic multi-task learning (MTL) based framework which handles the evidence as the output of one or more secondary tasks, while modeling the original problem as the primary task of interest. Our training phase is identical to the one used by standard MTL architectures. During prediction, we back-propagate the loss on secondary task(s) such that network weights are re-adjusted to match the evidence. An early stopping or two norm based regularizer ensures weights do not deviate significantly from the ones learned originally. Implementation in two specific scenarios (a) predicting semantic segmentation given the image level tags (b) predicting instance level segmentation given the text description of the image, clearly demonstrates the effectiveness of our proposed approach.

## 1 Introduction

One significant limitation of the MTL frameworks suggested so far is that they make use of auxiliary information only during the training process. This is despite the fact that, many times such information is also available at the test time e.g., tags on a Facebook image. Arguably, exploiting this information at test time can significantly boost up prediction accuracy by resolving the underlying ambiguity and/or correcting modelling errors.

The motivation for incorporating auxiliary information at test time can also be drawn from human perception. Figure 1 shows two-tone Mooney images [mooney1957age] used by Craig Mooney to study the perceptual closure in children. Here, perceptual closure is an ability to form a complete percept of an object or pattern from incomplete one. It was shown that, though, it may be difficult to make much sense of any structure in the given images in the beginning, but once additional information is provided that these represent the faces of a woman, one easily starts perceiving them.

A natural question to ask is, whether there is a way to incorporate similar instance specific additional clues in modern deep neural network models. While similar works in classical machine learning literature such as PGMs [koller2009probabilistic] have considered conditional inference; to the best of knowledge, there is no prior work incorporating such auxiliary information in the context of DNN models, especially using the MTL based framework. We will henceforth refer to the instance specific auxiliary information as evidence.

We model our task of interest, e.g., semantic segmentation, as the primary task in the MTL framework. The evidence is modelled as the output of one or more secondary tasks, e.g., image tags. While our training process is identical to the one used by standard MTL architectures, our testing phase is quite different. Instead of simply doing a forward propagation during prediction, we back-propagate the loss on the output of the secondary tasks (evidence) and re-adjust the weights learned to match the evidence. In order to avoid over-fitting the observed evidence, we employ a regularizer in the form of two norm penalty or early stopping, so that the weights do not deviate significantly from their originally learned values.

The specific contributions of this work are as follows: (1) We propose a novel adaptation of the MTL framework which can incorporate evidence at test time to improve predictions of a DNN. The framework is generic in the sense that it can use existing DNN architectures, pre-trained for variety of tasks, along with diverse inexpensively gathered auxiliary information as evidence (2) We suggest two approaches to re-adjust the weights of the deep network so as to match the network output to evidence. (3) We demonstrate the efficacy of proposed framework by improving state-of-the-art results on two specific problems viz semantic segmentation and instance segmentation problems. Figure 2 shows some sample results.

We stress that our focus in this paper is to show the improvement obtained by using easily available auxiliary information along with state-of-the-art models. In that regard, they should not be seen as our competitors, but rather being enhanced using our approach. We present other na$̈(i+1)^{th}$ve baselines which incorporate auxiliary information for direct comparison (e.g., using label pruning).

## 2 Related Work

As argued earlier, though our architecture may seem similar in style to existing work trying to boost up the performance of the primary task based on auxiliary tasks [liang2015semantic, bingel2017identifying, romera2012exploiting], the key difference is that, while, the earlier works exploit the use of correlated tasks only during the training process, we in addition, focus on back-propagating the available (instance specific) evidence during prediction time as well. This is an important conceptual difference and can result in significant improvements by exploiting additional information as shown by our experiments.

We would also like to differentiate our work from that of posterior inference with priors. While priors can be learned for sample distributions, our work suggests conditional inference in the presence of sample specific evidence. Similarly, posterior regularization technique [ganchev2010posterior] changes the output distribution directly, albeit, only based on characteristics of the underlying data statistics. No sample specific evidence is used.

Another closely related research area is multi-modal inference [rosenfeld2018priming, kilickaya2016leveraging, cheng2014imagespirit] which also incorporates additional features from auxiliary information. While this does effectively incorporate evidence at prediction time in the form of additional features, but practically speaking, designing a network to take additional information from a highly sparse information source is non-trivial 111For example, in one of our experiments we show the semantic segmentation conditioned upon image level tags as given by an image classification task. It is easy to see that designing an MTL based DNN architecture for semantic segmentation and image classification is not difficult. On the other hand, designing a network which takes a single image label and generates features for merging with RGB image features seems non-trivial.. However, the strongest argument in support of our framework is its ability to work even when only single set of annotations are available. It is possible to train our architecture even when we have dataset containing either primary or auxiliary annotations. On the other hand, multi-modal input based architecture would require a dataset containing both the annotations at the same time. This greatly restricts its applicability. Note that the argument extends to test time also. At test time, if auxiliary information is unavailable our framework can fall back to regular predictions, while architecture with multi-modal input will fail to take-off.

Some recent works [pathak2015constrained, xu2017semantic, marquez2017imposing] have proposed constraining the output of DNN, to help regularizing the output and reduce the amount of training data required. While all these works suggests constraints during training, our approach imposes the constraints both at the train and inference time.

We note that our framework is similar in spirit to another contemporary work by Lee et al. [lee2017enforcing], who have also proposed to enforce test time constraints on a DNN output. However, while their idea is to enforce ‘prior deterministic constraints’ arising out of natural rule based processing, our framework is inspired from using any easily available and arbitrary type of auxiliary information. Our framework can be used together with theirs, as well as, is more generalizable due to lack of requirement of using the constraints on the output of the same stream.

## 3 Framework for Back-propagating Evidence

In this section, we present our approach for boosting up the accuracy of a given task of interest by incorporating evidence. Our solution employs a generic MTL [ruder2017overview] based architecture which consists of a main (primary) task of interest, and another auxiliary task whose desired output (label) represents the evidence in the network. The key contribution of our framework is its ability to back-propagate the loss on the auxiliary task during prediction time, such that weights are re-adjusted to match the output of the auxiliary task with given evidence. In this process, as we will see, the shared weights (in MTL) also get re-adjusted producing a better output on the primary task. This is what we refer to as back-propagating evidence through the network (at prediction time). We note that though we describe our framework using a single auxiliary task to keep the notation simple, it is straightforward to extend this to a setting with more one than auxiliary task (and associated evidence at prediction time).

### 3.1 Background on MTL

#### Notation

We will use to denote the primary task of interest. Similarly, let denote the auxiliary task in the network. Let denote the training example, where is input feature vector, is desired output (label) of the primary task, and denotes the desired output (label) of the auxiliary task. Correspondingly, let and denote the output produced by the network for the primary task and auxiliary tasks, respectively.

#### Model

Figure 13 shows the MTL based architecture [ruder2017overview] for this set-up. There is a common set of layers shared between the two tasks, followed by the task specific layers. represents the common hidden feature representation fed to the two task specific parts of the architecture. For ease of notation, we will refer to the shared set of layers as trunk.

The network has three sets of weights. First, there are weights associated with the trunk denoted by . and are the sets of weights associated with the two task specific branches, respectively. The total loss is a function of these weight parameters and can be defined as:

 LT(⋅)=m∑i=1(LP(y(i),^y(i)))+λm∑i=1(LA(a(i),^a(i)))

Here, and denote the loss for the primary and auxiliary tasks, respectively. is the importance weight for the auxiliary task. The sum is taken over the examples in the training set. is a function of the shared set of weights , and the task specific weights . Similarly, and is a function of the shared weights and task specific weights , respectively.

#### Training

The goal of training is to find the weights which minimize the total loss over the training data. Using the standard approach of gradient descent, the gradients can be computed as follows:

Note that the weights in the task specific branches, i.e., and , can only affect losses defined over the respective tasks (items 1 and 2 above). On the other hand, weights in the trunk affect the losses defined over both the primary as well as the auxiliary tasks. Next, we describe our approach of back-propagating the loss over the evidence.

### 3.2 Our Approach - Prediction

During test time, we are given additional information about the output of the auxiliary task. Let us denote this by (evidence) to distinguish it from the auxiliary outputs during training time. Then, for the inference, instead of directly proceeding with the forward propagation, we instead first decide to adjust the weights of the network such that the network is forced to match the evidence on the auxiliary task. Since the two tasks are correlated, we expect that this process will adjust the weights of the network in a manner such that resolving the ambiguity over the auxiliary output will also result in an improved prediction over the primary task of interest.

This feat can be achieved by defining a loss in terms of and then back-propagating its gradient through the network. Note that this loss only depends on the set of weights in the auxiliary branch, and the weights in the trunk. In particular, the weights remain untouched during this process. Finally, we would also like to make sure that our weights do not deviate too much from the originally learned weights. This is to avoid over-fitting over evidence. This can be achieved by adding a two-norm based regularizer which discourages weights which are far from the originally learned weights. The corresponding weight update equations can be derived using the following gradients:

Here, and denote the weights learned during training and is the regularization parameter. Note that these equations are similar to those used during training (item 2 and 3), with the differences that (1) The loss is now computed with respect to the single test example (2) Effect of the term dependent on primary loss has been zeroed out. (3) A regularizer term has been added. In our experiments, we also experimented with early stopping instead of adding the norm based regularizer.

Algorithm 1 describes our algorithm for weight update during test time, and Figure 4 explains it pictorially. Once the new weights are obtained, they can be used in the forward propagation to obtain the desired value on the primary task.

## 4 Semantic Segmentation

The task of semantic segmentation involves assigning a label to each pixel in the image from a fixed set of object categories. Semantic segmentation is an important part of scene understanding and is critical first step in many computer vision tasks. In many semantic segmentation applications, image level tags are often easily available and encapsulate important information about the context, scale and saliency. We explore the use of such tags as auxiliary information at test time for improving the prediction accuracy. As clarified in earlier sections as well, though using auxiliary information in the form of natural language sentences [hu2016segmentation, liu2017recurrent] have been suggested, these earlier works have used this information only during the training time. This is unlike us where we are interested in exploiting this information both during training as well as test.

#### State-of-the-art

Most current state-of-the-art methods for semantic segmentation, such as, Segnet [badrinarayanan2017segnet], DeepLabv2 [chen2016deeplab], DeepLabv3 [chen2017rethinking] and DeepLabv3+ [chen2018encoder], PSPNet [zhao2017pyramid], and U-net [ronneberger2015u] etc., are all based on DNN architectures. Most of these works use a fully convolutional (FCN) architecture replacing earlier models which used fully connected layers at the end.

DeepLabv3 uses Atrous Spatial Pyramid Pooling (ASSP) to detect objects at multiple scales. ASPP enlarges the field of view without increasing the number of parameters in the network. DeepLabv3+ uses DeepLabv3 as an encoder module in the encoder-decoder framework and a simple decoder to obtain sharp object boundaries. By this DeepLabv3+ combines the benefits of both pyramid pooling and encoder-decoder based methods.

#### Our Implementation

Our implementation builds on DeepLabv3+ [chen2018encoder] which in one of the state of the art segmentation models. DeepLabv3+ has been one of the leaders on the Pascal VOC data challenge [everingham2010pascal], and builds over the Xception [chollet2017xception] architecture which was originally designed for classification tasks. We have used the publicly available official implementation of DeepLabv3+. For ease of notation, we will refer the DeepLabv3+ model as ‘DeepLab’.

To use our framework, we have extended the DeepLab architecture to simultaneously solve the classification task in an MTL setting. Figure 5 describes our proposed MTL architecture in detail. Starting with the original DeepLab architecture (top part in the figure), we branch off after the encoder module to solve the classification task. The resultant feature map is passed through an average pooling layer, a fully connected layer, and then finally a sigmoid activation function to get probabilty for each of the 20 classes (background class is excluded).

For training, we make use of cross-entropy based loss, for the primary and binary cross entropy loss for the secondary task. We first train the segmentation only network to get the initial set of weights. These are then used to initialize the weights in the MTL based architecture (for the segmentation branch). The weights in the classification branch are randomly initialized. This is followed by a joint training of the MTL architecture. During prediction time, for each image, we back-propagate the binary cross entropy loss based on observed evidence over the auxiliary task (for test image) resulting in weights re-adjusted to fit the evidence. These weights are used to make the final prediction (per-image). The parameters in our experiments were set as follows. During training, the parameter controlling the relative weights of the two losses is set of in all our experiments. During prediction, number of early stopping iterations was set to . parameter for weighing the two norm regularizer was set to . We have used SGD optimizer with learning rate of at test time.

#### Methodology and Dataset

We compare the performance of following seven models in our experiments: (a) DL (b) DL-MT (c) DL-MT-Pr (d) DL-BP-ES (f) DL-BP-ES-Pr (g) DL-BP-L2 (h) DL-BP-L2-Pr. The first model (DL) uses vanilla DeepLab based architecture. The second (DL-MT) is trained using an MTL framework as described above. These are our baseline models without use of any auxiliary information. The suffix ‘Pr’ in the model name refers to post-processing step of pruning the output classes at each pixel using the given image tags. The probability of each label which is not present in the tag set is forced to be zero at each pixel. This is a very simple model using the auxiliary information. Models that are based on our proposed approach start with prefix DL-BP. Models with prefix DL-BP refers to models where we back-propagate the auxiliary information at prediction time. We experiment with two variations based on the choice of regularizer during prediction: DL-BP-ES uses early stopping and DL-BP-L2 uses an L2-norm based penalty.

Original DeepLab paper [chen2018encoder] also experiments with inputting images at multiple scales i.e, {,,,,} as well as using left-right flipped input images, and then taking a mean of the predictions. We refer to this as multi-scaling (MS) based approach. We compare each of our models with multi-scaling being on and off in our experiments.

For our evaluation, we make use of PASCAL VOC 2012 segmentation benchmark [everingham2010pascal]. It consists of 20 foreground object classes and one background class. We further augmented the training data with additional segmentation annotations provided by Hariharan et al. [hariharan2011semantic]. The dataset contains 10582 training and 1449 validation images. We use mean intersection over union (mIOU) as our evaluation metric which is a standard for segmentation tasks.

#### Results

Table  1 presents the results comparing the performance of various models. We see some improvement in prediction accuracy due to use of pruning the output over the baseline models (not using any auxiliary information). However, using our backpropagation based approach results in further significant improvement over the baselines. The gain is as much as 3.45 mIoU points compared to vanilla DeepLab and more than 2.19 points compared to the MTL based architecture with pruning. Both our variations have comparable performance, with early stopping based model doing slightly better. Table 2 presents the results for each of the object categories. For all the object categories except three, we perform better than the baselines. The gain is as high as 10 points (or more) for some of the classes.

Figure 6 shows the visual comparison of results for a set of hand picked examples. Our model is not only able to enhance the segmentation quality of already discovered objects, it can also discover new objects which are completely missed by the baseline. Figure 7 presents the sensitivity analysis with respect to number of early stopping iterations and the parameter controlling the weight of the L2 regularizer (during prediction). There is a large range of values in both cases where we get significant improvements. Time required for inference on a single image for different models are: seconds for DeepLab, seconds for Deeplab with multi-scaling, and seconds for our model with early stopping of 2 back-propagation. Our model with multi-scaling takes seconds on a GTX 1080 Ti GPU.

Impact of Noise: Table  3 examines the performance of our approach in the presence of noisy auxiliary information. For each image we randomly introduce (for varying values of ) additional (noisy) tags as additional information which were not part of the original set333This experiment was done without using multi-scaling. Our approach is fairly robust to this noise, and its performance degrades slowly with increasing amount of noise. Even when we have 3 additional noisy tags added, we are still doing better than baseline DL-MT model performance of 82.5 (Left Column, Table 1).

## 5 Instance Segmentation

Next, we present our experimental evaluation on a multi-modal task of object instance segmentation given textual description of the image. In an instance segmentation problem the goal is to detect and localize individual objects in the image along with segmentation mask around the objects. In our framework, we model instance segmentation as the primary task and image captioning as the auxiliary.

Arguably, instance segmentation is more challenging than semantic segmentation, and incorporating caption information also seems significantly harder due to its frequent noisy and incomplete nature. We speculate this to be the reason behind lack of any prior noticeable work using textual description for improving semantic segmentation. In this sense, our state of the art results for the problem may also have a standalone contribution of their own.

State-of-the-art: Recently proposed Mask R-CNN [he2017mask] is one of the most successful instance segmentation approaches. It is based on the Faster R-CNN  [ren2015faster] technique for object detection. In the first step, Faster R-CNN generates a box level proposal using the Region Proposal network (RPN). In the second step, each box level proposal is given an object label to detect the objects present in the overall image. Mask R-CNN uses the detector feature map and produces a segmentation mask for each detected bounding box, by re-aligning the misaligned feature maps using a special designed RoIAlign operation. Mask R-CNN predicts masks and class labels in parallel. Other notable works [li2017fully, dai2016instance] predicts the instance segmentation using a fully conventionally network, to get similar benefits as FCNs for semantic segmentation. There have also been proposals to use CRFs for post-processing FCN outputs to group pixels of individual object instances [bai2017deep, arnab2017pixelwise]. We have used Mask-RCNN in our experiments.

Our Implementations: Our MTL based architecture is shown is Figure 8. Here we take Mask R-CNN, a state-of-the-art instance segmentation approach, and combine it with the LSTM decoder from the state-of-the-art captioning generator “Show, Attend and Tell”(SAT) [xu2015show] within our framework. We use the publicly available implementations of both Mask R-CNN and SAT. We use ResNeXt-152 [xie2017aggregated] as the convolutional backbone network to extract image features. The backbone architecture is shared between Mask R-CNN and captioning decoder. We initialize our primary network with the pre-trained weights of the Mask R-CNN provided in their implementation. We then fine-tune these weights along with learning the weights of the caption decoder (secondary task) using our MTL architecture. Early stopping iterations parameter was set at 2, and parameter was set to 100. We have used Adam optimizer with a learning rate of at test time.

Methodology and Dataset: We experiment with four different models. These include: Mask-RCNN, Mask-MT, Mask-BP-ES and Mask-BP-L2. Mask-RCNN is the original instance segementation model and Mask-MT is the model trained using MTL. These two are the baseline models. We refer to our approaches as Mask-BP-ES and Mask-BP-L2, respectively, for the two types of regularizers used during prediction. Our attempts with pruning based approaches did not yield any gain, since learning the label set to be pruned (using captions) turned out to be quite difficult.

We have used MS-COCO dataset [lin2014microsoft] to evaluate our approach. The training set consist of nearly 1150k images and 5k validation images. We report our results on the validation images. In the dataset, each image has at least five captions assigned by different annotators. We use AP (average precision) as our evaluation metric. We also evaluate at AP0.5 (precision at IoU threshold of 0.5).

Results: Table 4 shows the gain obtained by using the auxiliary information over the baseline models for different sizes of objects. We result in a gain of 0.6 for AP. The gain is higher at 1.0 for AP0.5. For large objects, this number is as high as 2.2 (last column). The ES based variation does marginally better than L2. Table 5 presents the gain obtained by the ES model over the MT baseline for detecting various objects in terms of precision, recall and f-measure. Suffering slight loss on precision, we are able to improve the overall F1 by 1.5 points. The gain is maximum for small objects which makes sense since a large number of them remain undetected in the original model. Further, a careful analysis revealed that ground truth itself has inconsistencies missing several smaller objects, undermining the actual gain obtained by our approach (see supplement for details).

Figure 9 presents visual comparison of results. Our algorithm can detect newer objects (Figure 9 (A)), as well as detect additional objects of the same category (Figure 9 (C)), sometimes those not even mentioned in the caption (Figure 9 (E)). Figure 9 (F) is a Mooney face [mooney1957age] as referred in the introduction. Mask-MT incorrectly detects a bird whereas Mask-BP can correctly detect a person with reasonable segmentation.

Failure analysis: Figure 10 shows a failure example for our approach. Overuse of back-propagation at test time may lead to over-fitting on the given auxiliary information in the proposed framework. As the number of back-propagation increases from 2 to 5, we observe over-fitting, leading to the prediction of multiple surf-boards.

## 7 Image Captioning with Image Tags

We have conducted experiments on the MS COCO dataset with image caption prediction as the primary task and image tags as the auxiliary task. Figure 12 describes our proposed MTL architecture in detail. The base architecture was “Show, Attend and Tell”(SAT) [xu2015show], one of the prominent image captioning model. To use image level tags, we branch off after the encoder module and design a multi-class classifier. The resultant feature map is passed through an average pooling layer, a fully connected layer, and then finally a sigmoid activation function to get probabilty for each of the 80 classes. Using BLEU-4 as the metric, performances are: (i) SAT: 25.8 (ii) SAT (multi-task): 27.4 (iii) SAT (multi-task + backpropagate over tags): 28.1. We also gain on other metrics such as ROUGE and CIDEr.

## 8 Auxiliary Information as Input

We have also done experiments with baseline models using auxiliary information as input, but our approach performs better them. Compared to baseline DeepLab for semantic segmentation with an MIoU , using tags as input only improves it to , compared to our framework which gives an MIoU of using the same information. We also tried inputting caption embeddding generated using “InferSent” [conneau2017supervised] from Facebook AI Research, to the Mask-RCNN. However, in this case, our training did not converge. This highlights the advantage with our framework which allows the primary task to be trained separately first, and then fine-tune the joint model in an efficient manner. Further, we note that multi-task training does not require any commonly annotated dataset unlike when auxiliary information is given as input. Further, our framework does not requires any commonly annotated dataset unlike the approaches which take auxiliary information as input.

## 9 Conclusion

We have presented a novel approach to incorporate evidence into deep networks at prediction time. Our key idea is to model the evidence as auxiliary information in an MTL architecture and then modify the weights at prediction time such that output of auxiliary task(s) matches the evidence. The approach is generic and can be used to improve the accuracy of existing diverse neural network architectures for variety of tasks, using easily available auxiliary information at test time. Experiments on two different computer vision applications demonstrate the efficacy of our proposed model over state-of-the-art. In future, we would like to experiment with additional applications such as for video data.

## References

Supplementary Material

## Appendix A Interpreting as a Graphical Model

In this section, we present a Probabilistic Graphical Model’s perspective of our approach (as described in section 3 of the main paper). Referring back to Figure 13, we can define a probabilistic graphical model over the random variables (input), (hidden presentation), (primary output) and (auxiliary output). Interpreting this as a Bayesian network (with arrows going from , and , we are interested in computing the probabilities and at inference time. Further, we have:

 P(Y|X=x)=∑ZP(Y|Z)P(Z|X=x) (1)

In the first conditional probability term in the RHS, dependence on is taken away since is independent of given . Since, in our network is fully determined by (due to the nature of forward propagation), we can write this dependence as . In other words, there is a value , such that . Therefore, above equation can be equivalently written as:

 P(Y|X=x)=P(Y|Z=fz(x)).1=P(Y|Z=fz(x))

Note that sum over disappears since is non-zero only when as defined above. Similarly:

 P(A|X=x)=P(A|Z=fz(x)) (2)

The goal of inference is to find the values of and maximizing and , respectively. The parameters of the graphical model are learned by maximizing the cross entropy or some other kind of surrogate loss over the training data.

Let us analyze what happens at test time. We are given the evidence at test time. In the light of this observation, we would like to change our distribution over such that the probability of observing is maximized, i.e., is equal to . Recalling that , in order to affect this, we may:

1. Change the distribution to , or

2. Change the function to ,

such that is as close to as possible. How to do this in a principled manner? We define the appropriate loss capturing the discrepancy between the value predicted using the distribution and the evidence , i.e., . The loss term also incorporates a regularizer so that new parameters do not deviate significantly from original set of parameters, avoiding overfitting the evidence .

In order to minimize the loss, we can back-propagate its gradient in the DNN and learn the new set of parameters. This results in change of dependence of on , i.e., , as well as that of on , i.e., . The resulting parameters are and , which effectively generate a new distribution over . Hence, adjusting the DNN weights in order to match the evidence also results in an updated prediction over the primary task aligned with the observed evidence.

## Appendix B Instance Segmentation

In the main paper, we have presented results on the semantic segmentation and object instance segmentation problems. We notice that in the case of instance segmentation, though our results are much better qualitatively, the same is not fully reflected in the quantitative comparison. A careful analysis of the results reveals that there are often inconsistencies in the ground truth annotations itself. For example, in many cases ground annotation misses objects which either very small, occluded or partly present in the image. In some cases, the ground truth segmentation has inconsistencies. In many of these scenarios, our framework is able to predict the correct output, but since the ground truth annotation is erroneous, we are incorrectly penalized for detecting true objects or for detecting the correct segmentation, resulting in apparent loss in our accuracy numbers. In the figures below, we highlight some examples to support our claim. Systematically fixing the ground truth is a direction for future work.

You are adding the first comment!
How to quickly get a good reply:
• Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
• Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
• Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters