Exploiting Test Time Evidence to Improve Predictions of Deep Neural Networks

Exploiting Test Time Evidence to Improve Predictions of Deep Neural Networks

Dinesh Khandelwal  Suyash Agrawal  Parag Singla  Chetan Arora
Indian Institute of Technology Delhi

Many prediction tasks, especially in computer vision, are often inherently ambiguous. For example, the output of semantic segmentation may depend on the scale one is looking at, and image saliency or video summarization is often user or context dependent. Arguably, in such scenarios, exploiting instance specific evidence, such as scale or user context, can help resolve the underlying ambiguity leading to the improved predictions. While existing literature has considered incorporating such evidence in classical models such as probabilistic graphical models (PGMs), there is limited (or no) prior work looking at this problem in the context of deep neural network (DNN) models. In this paper, we present a generic multi-task learning (MTL) based framework which handles the evidence as the output of one or more secondary tasks, while modeling the original problem as the primary task of interest. Our training phase is identical to the one used by standard MTL architectures. During prediction, we back-propagate the loss on secondary task(s) such that network weights are re-adjusted to match the evidence. An early stopping or two norm based regularizer ensures weights do not deviate significantly from the ones learned originally. Implementation in two specific scenarios (a) predicting semantic segmentation given the image level tags (b) predicting instance level segmentation given the text description of the image, clearly demonstrates the effectiveness of our proposed approach.

1 Introduction

Over the last decade, Deep Neural Networks (DNNs) have become a leading technique for solving variety of problems in Artificial Intelligence. Often, such networks are designed and trained for a specific task at hand. However, when multiple correlated tasks are given, Multi-Task Learning (MTL) [caruana1995learning] framework is used to allow a DNN to jointly learn shared features from multiple tasks simultaneously. Usually, in an MTL framework, one is interested in the output of all the tasks at hand. However, researchers have also looked at the scenarios, when only a subset of tasks (called ‘principal’ task(s)) are of interest and the other tasks (called ‘auxiliary’ task(s)) are merely to help learn the shared generic representation [liang2015semantic, bingel2017identifying, romera2012exploiting]. In such cases, auxiliary tasks are generally derived from the easily available side information about the data. For example, one can use a MTL framework for semantic segmentation of an image as a principal task, with an auxiliary task to predict types of object present in the image.

Figure 1: Three examples of Mooney images [mooney1957age], which are difficult to interpret even for a human observer. However, given the description: “Beautiful lady smiling in front of a screen”, the face of a woman in the images become obvious. Our proposed framework is inspired from this behavior and improves the prediction of a DNN by exploiting additional cues at the test time. Standard MaskRCNN [he2017mask], fails to detect a face in the above images, but the same network when plugged in our framework with the caption as auxiliary information easily detects the faces as shown in figure 9.

One significant limitation of the MTL frameworks suggested so far is that they make use of auxiliary information only during the training process. This is despite the fact that, many times such information is also available at the test time e.g., tags on a Facebook image. Arguably, exploiting this information at test time can significantly boost up prediction accuracy by resolving the underlying ambiguity and/or correcting modelling errors.

The motivation for incorporating auxiliary information at test time can also be drawn from human perception. Figure 1 shows two-tone Mooney images [mooney1957age] used by Craig Mooney to study the perceptual closure in children. Here, perceptual closure is an ability to form a complete percept of an object or pattern from incomplete one. It was shown that, though, it may be difficult to make much sense of any structure in the given images in the beginning, but once additional information is provided that these represent the faces of a woman, one easily starts perceiving them.

Figure 2: The figure shows the improvement achieved by existing diverse state-of-the-art models using our framework for exploiting the evidence at prediction time. We show results for instance segmentation (1st row), and semantic segmentation (2nd row), using textual description and image level tags respectively as the auxiliary information. Both the images as well as auxiliary information are deliberately sourced from unconstrained data on the web (and not benchmark datasets) to show the easy availability of such information at the test time.

A natural question to ask is, whether there is a way to incorporate similar instance specific additional clues in modern deep neural network models. While similar works in classical machine learning literature such as PGMs [koller2009probabilistic] have considered conditional inference; to the best of knowledge, there is no prior work incorporating such auxiliary information in the context of DNN models, especially using the MTL based framework. We will henceforth refer to the instance specific auxiliary information as evidence.

We model our task of interest, e.g., semantic segmentation, as the primary task in the MTL framework. The evidence is modelled as the output of one or more secondary tasks, e.g., image tags. While our training process is identical to the one used by standard MTL architectures, our testing phase is quite different. Instead of simply doing a forward propagation during prediction, we back-propagate the loss on the output of the secondary tasks (evidence) and re-adjust the weights learned to match the evidence. In order to avoid over-fitting the observed evidence, we employ a regularizer in the form of two norm penalty or early stopping, so that the weights do not deviate significantly from their originally learned values.

The specific contributions of this work are as follows: (1) We propose a novel adaptation of the MTL framework which can incorporate evidence at test time to improve predictions of a DNN. The framework is generic in the sense that it can use existing DNN architectures, pre-trained for variety of tasks, along with diverse inexpensively gathered auxiliary information as evidence (2) We suggest two approaches to re-adjust the weights of the deep network so as to match the network output to evidence. (3) We demonstrate the efficacy of proposed framework by improving state-of-the-art results on two specific problems viz semantic segmentation and instance segmentation problems. Figure 2 shows some sample results.

We stress that our focus in this paper is to show the improvement obtained by using easily available auxiliary information along with state-of-the-art models. In that regard, they should not be seen as our competitors, but rather being enhanced using our approach. We present other na$̈(i+1)^{th}$ve baselines which incorporate auxiliary information for direct comparison (e.g., using label pruning).

2 Related Work

As argued earlier, though our architecture may seem similar in style to existing work trying to boost up the performance of the primary task based on auxiliary tasks [liang2015semantic, bingel2017identifying, romera2012exploiting], the key difference is that, while, the earlier works exploit the use of correlated tasks only during the training process, we in addition, focus on back-propagating the available (instance specific) evidence during prediction time as well. This is an important conceptual difference and can result in significant improvements by exploiting additional information as shown by our experiments.

We would also like to differentiate our work from that of posterior inference with priors. While priors can be learned for sample distributions, our work suggests conditional inference in the presence of sample specific evidence. Similarly, posterior regularization technique [ganchev2010posterior] changes the output distribution directly, albeit, only based on characteristics of the underlying data statistics. No sample specific evidence is used.

Another closely related research area is multi-modal inference [rosenfeld2018priming, kilickaya2016leveraging, cheng2014imagespirit] which also incorporates additional features from auxiliary information. While this does effectively incorporate evidence at prediction time in the form of additional features, but practically speaking, designing a network to take additional information from a highly sparse information source is non-trivial 111For example, in one of our experiments we show the semantic segmentation conditioned upon image level tags as given by an image classification task. It is easy to see that designing an MTL based DNN architecture for semantic segmentation and image classification is not difficult. On the other hand, designing a network which takes a single image label and generates features for merging with RGB image features seems non-trivial.. However, the strongest argument in support of our framework is its ability to work even when only single set of annotations are available. It is possible to train our architecture even when we have dataset containing either primary or auxiliary annotations. On the other hand, multi-modal input based architecture would require a dataset containing both the annotations at the same time. This greatly restricts its applicability. Note that the argument extends to test time also. At test time, if auxiliary information is unavailable our framework can fall back to regular predictions, while architecture with multi-modal input will fail to take-off.

Some recent works [pathak2015constrained, xu2017semantic, marquez2017imposing] have proposed constraining the output of DNN, to help regularizing the output and reduce the amount of training data required. While all these works suggests constraints during training, our approach imposes the constraints both at the train and inference time.

We note that our framework is similar in spirit to another contemporary work by Lee et al. [lee2017enforcing], who have also proposed to enforce test time constraints on a DNN output. However, while their idea is to enforce ‘prior deterministic constraints’ arising out of natural rule based processing, our framework is inspired from using any easily available and arbitrary type of auxiliary information. Our framework can be used together with theirs, as well as, is more generalizable due to lack of requirement of using the constraints on the output of the same stream.

3 Framework for Back-propagating Evidence

In this section, we present our approach for boosting up the accuracy of a given task of interest by incorporating evidence. Our solution employs a generic MTL [ruder2017overview] based architecture which consists of a main (primary) task of interest, and another auxiliary task whose desired output (label) represents the evidence in the network. The key contribution of our framework is its ability to back-propagate the loss on the auxiliary task during prediction time, such that weights are re-adjusted to match the output of the auxiliary task with given evidence. In this process, as we will see, the shared weights (in MTL) also get re-adjusted producing a better output on the primary task. This is what we refer to as back-propagating evidence through the network (at prediction time). We note that though we describe our framework using a single auxiliary task to keep the notation simple, it is straightforward to extend this to a setting with more one than auxiliary task (and associated evidence at prediction time).

Figure 3: Model Architecture for Multi-task learning (MTL)
Figure 4: Loss propagation at train and prediction time

3.1 Background on MTL


We will use to denote the primary task of interest. Similarly, let denote the auxiliary task in the network. Let denote the training example, where is input feature vector, is desired output (label) of the primary task, and denotes the desired output (label) of the auxiliary task. Correspondingly, let and denote the output produced by the network for the primary task and auxiliary tasks, respectively.


Figure 13 shows the MTL based architecture [ruder2017overview] for this set-up. There is a common set of layers shared between the two tasks, followed by the task specific layers. represents the common hidden feature representation fed to the two task specific parts of the architecture. For ease of notation, we will refer to the shared set of layers as trunk.

The network has three sets of weights. First, there are weights associated with the trunk denoted by . and are the sets of weights associated with the two task specific branches, respectively. The total loss is a function of these weight parameters and can be defined as:

Here, and denote the loss for the primary and auxiliary tasks, respectively. is the importance weight for the auxiliary task. The sum is taken over the examples in the training set. is a function of the shared set of weights , and the task specific weights . Similarly, and is a function of the shared weights and task specific weights , respectively.


The goal of training is to find the weights which minimize the total loss over the training data. Using the standard approach of gradient descent, the gradients can be computed as follows:

Note that the weights in the task specific branches, i.e., and , can only affect losses defined over the respective tasks (items 1 and 2 above). On the other hand, weights in the trunk affect the losses defined over both the primary as well as the auxiliary tasks. Next, we describe our approach of back-propagating the loss over the evidence.

3.2 Our Approach - Prediction

During test time, we are given additional information about the output of the auxiliary task. Let us denote this by (evidence) to distinguish it from the auxiliary outputs during training time. Then, for the inference, instead of directly proceeding with the forward propagation, we instead first decide to adjust the weights of the network such that the network is forced to match the evidence on the auxiliary task. Since the two tasks are correlated, we expect that this process will adjust the weights of the network in a manner such that resolving the ambiguity over the auxiliary output will also result in an improved prediction over the primary task of interest.

This feat can be achieved by defining a loss in terms of and then back-propagating its gradient through the network. Note that this loss only depends on the set of weights in the auxiliary branch, and the weights in the trunk. In particular, the weights remain untouched during this process. Finally, we would also like to make sure that our weights do not deviate too much from the originally learned weights. This is to avoid over-fitting over evidence. This can be achieved by adding a two-norm based regularizer which discourages weights which are far from the originally learned weights. The corresponding weight update equations can be derived using the following gradients:

Here, and denote the weights learned during training and is the regularization parameter. Note that these equations are similar to those used during training (item 2 and 3), with the differences that (1) The loss is now computed with respect to the single test example (2) Effect of the term dependent on primary loss has been zeroed out. (3) A regularizer term has been added. In our experiments, we also experimented with early stopping instead of adding the norm based regularizer.

Algorithm 1 describes our algorithm for weight update during test time, and Figure 4 explains it pictorially. Once the new weights are obtained, they can be used in the forward propagation to obtain the desired value on the primary task.

1: (input), : evidence
2: (Learning rate), (Iterations),
3:, : Originally trained weights
5:for  do;
6:     Calculate the loss over evidence
7:     Compute and , using back-propagation
8:     Update and using gradient descent rule
9:end for
10:Return the newly optimized weights
Algorithm 1 weight update algorithm

4 Semantic Segmentation

The task of semantic segmentation involves assigning a label to each pixel in the image from a fixed set of object categories. Semantic segmentation is an important part of scene understanding and is critical first step in many computer vision tasks. In many semantic segmentation applications, image level tags are often easily available and encapsulate important information about the context, scale and saliency. We explore the use of such tags as auxiliary information at test time for improving the prediction accuracy. As clarified in earlier sections as well, though using auxiliary information in the form of natural language sentences [hu2016segmentation, liu2017recurrent] have been suggested, these earlier works have used this information only during the training time. This is unlike us where we are interested in exploiting this information both during training as well as test.


Most current state-of-the-art methods for semantic segmentation, such as, Segnet [badrinarayanan2017segnet], DeepLabv2 [chen2016deeplab], DeepLabv3 [chen2017rethinking] and DeepLabv3+ [chen2018encoder], PSPNet [zhao2017pyramid], and U-net [ronneberger2015u] etc., are all based on DNN architectures. Most of these works use a fully convolutional (FCN) architecture replacing earlier models which used fully connected layers at the end.

DeepLabv3 uses Atrous Spatial Pyramid Pooling (ASSP) to detect objects at multiple scales. ASPP enlarges the field of view without increasing the number of parameters in the network. DeepLabv3+ uses DeepLabv3 as an encoder module in the encoder-decoder framework and a simple decoder to obtain sharp object boundaries. By this DeepLabv3+ combines the benefits of both pyramid pooling and encoder-decoder based methods.

Figure 5: Our proposed architecture for semantic segmentation using image level tags as test time auxiliary information. The architecture uses pre-trained DeepLab architecture, fine tuned with MTL strategy using image classification as auxiliary task(lower branch).
Figure 6: Comparison of visual results on semantic segmentation problem using image level tags as the test time auxiliary information. 1st and 2nd column are input and ground truth respectively. 3rd column shows baseline model trained with MTL. 4th column shows results of a naive strategy to improve the results of 3rd column by pruning predicted labels which were not present in tags. The last 4 columns show results of proposed approach in various configurations. Please see the paper text for details on the configurations. The first row shows improvement in segmentation of ‘dining table’, second row shows correction of ‘dog’ label and third row detecting a new object ‘sofa’ by our technique using test time tags.
Figure 7: Sensitivity of results on number of back-propagation iterations (early stopping) and parameter (L2-norm)

Our Implementation

Our implementation builds on DeepLabv3+ [chen2018encoder] which in one of the state of the art segmentation models. DeepLabv3+ has been one of the leaders on the Pascal VOC data challenge [everingham2010pascal], and builds over the Xception [chollet2017xception] architecture which was originally designed for classification tasks. We have used the publicly available official implementation of DeepLabv3+222https://github.com/tensorflow/models/tree/master/research/deeplab.. For ease of notation, we will refer the DeepLabv3+ model as ‘DeepLab’.

To use our framework, we have extended the DeepLab architecture to simultaneously solve the classification task in an MTL setting. Figure 5 describes our proposed MTL architecture in detail. Starting with the original DeepLab architecture (top part in the figure), we branch off after the encoder module to solve the classification task. The resultant feature map is passed through an average pooling layer, a fully connected layer, and then finally a sigmoid activation function to get probabilty for each of the 20 classes (background class is excluded).

For training, we make use of cross-entropy based loss, for the primary and binary cross entropy loss for the secondary task. We first train the segmentation only network to get the initial set of weights. These are then used to initialize the weights in the MTL based architecture (for the segmentation branch). The weights in the classification branch are randomly initialized. This is followed by a joint training of the MTL architecture. During prediction time, for each image, we back-propagate the binary cross entropy loss based on observed evidence over the auxiliary task (for test image) resulting in weights re-adjusted to fit the evidence. These weights are used to make the final prediction (per-image). The parameters in our experiments were set as follows. During training, the parameter controlling the relative weights of the two losses is set of in all our experiments. During prediction, number of early stopping iterations was set to . parameter for weighing the two norm regularizer was set to . We have used SGD optimizer with learning rate of at test time.

Methodology and Dataset

We compare the performance of following seven models in our experiments: (a) DL (b) DL-MT (c) DL-MT-Pr (d) DL-BP-ES (f) DL-BP-ES-Pr (g) DL-BP-L2 (h) DL-BP-L2-Pr. The first model (DL) uses vanilla DeepLab based architecture. The second (DL-MT) is trained using an MTL framework as described above. These are our baseline models without use of any auxiliary information. The suffix ‘Pr’ in the model name refers to post-processing step of pruning the output classes at each pixel using the given image tags. The probability of each label which is not present in the tag set is forced to be zero at each pixel. This is a very simple model using the auxiliary information. Models that are based on our proposed approach start with prefix DL-BP. Models with prefix DL-BP refers to models where we back-propagate the auxiliary information at prediction time. We experiment with two variations based on the choice of regularizer during prediction: DL-BP-ES uses early stopping and DL-BP-L2 uses an L2-norm based penalty.

Original DeepLab paper [chen2018encoder] also experiments with inputting images at multiple scales i.e, {,,,,} as well as using left-right flipped input images, and then taking a mean of the predictions. We refer to this as multi-scaling (MS) based approach. We compare each of our models with multi-scaling being on and off in our experiments.

For our evaluation, we make use of PASCAL VOC 2012 segmentation benchmark [everingham2010pascal]. It consists of 20 foreground object classes and one background class. We further augmented the training data with additional segmentation annotations provided by Hariharan et al. [hariharan2011semantic]. The dataset contains 10582 training and 1449 validation images. We use mean intersection over union (mIOU) as our evaluation metric which is a standard for segmentation tasks.


Without MS With MS
DL 82.45 83.58
DL-MT 82.59 83.54
DL-MT-P 84.23 84.84
DL-BP-L2 85.97 86.88
DL-BP-ES 86.01 86.84
DL-BP-L2-P 86.10 87.00
DL-BP-ES-P 86.22 87.03
Table 1: Comparison of results for semantic segmentation. The two columns shows results of various approaches with and without using multi scale (MS) strategy. 1st row shows result from baseline DeepLab, 2nd row shows the results for DeepLab trained using MTL framework along with image level tags. 3rd row shows result of 2nd row after naive pruning of tags absent in the auxiliary information. The last 4 rows show results of proposed approach in various configurations. Please see the paper text for details on the configurations.

Table  1 presents the results comparing the performance of various models. We see some improvement in prediction accuracy due to use of pruning the output over the baseline models (not using any auxiliary information). However, using our backpropagation based approach results in further significant improvement over the baselines. The gain is as much as 3.45 mIoU points compared to vanilla DeepLab and more than 2.19 points compared to the MTL based architecture with pruning. Both our variations have comparable performance, with early stopping based model doing slightly better. Table 2 presents the results for each of the object categories. For all the object categories except three, we perform better than the baselines. The gain is as high as 10 points (or more) for some of the classes.

Model bike table chair sofa plant boat tv bottle bird person mbike car aero dog horse sheep train cat bg cow bus DL-MT-Pr 45.3 55.2 57.9 62.3 69.4 78.1 81.5 83.9 90.3 91.0 91.7 91.8 93.6 93.9 94.0 94.2 95.4 95.4 95.8 96.6 97.1 DL-BP-ES-Pr 45.9 69.9 68.1 79.7 74.7 84.1 82.8 86.3 90.1 91.6 92.7 93.7 94.6 96.2 95.1 96.1 95.2 96.4 96.5 97.4 97.1 DL-BP-L2-Pr 45.8 69.0 67.9 78.2 74.1 84.0 82.8 86.0 90.2 91.6 92.7 93.7 94.5 96.2 95.1 96.0 95.3 96.4 96.4 97.3 97.1
Table 2: Object category-wise comparison of results for semantic segmentation on Pascal VOC. The first row shows results with naive pruning of labels not present in image tags. The bottom two rows are variations of proposed methodology. Numbers denote mIoU.

Figure 6 shows the visual comparison of results for a set of hand picked examples. Our model is not only able to enhance the segmentation quality of already discovered objects, it can also discover new objects which are completely missed by the baseline. Figure 7 presents the sensitivity analysis with respect to number of early stopping iterations and the parameter controlling the weight of the L2 regularizer (during prediction). There is a large range of values in both cases where we get significant improvements. Time required for inference on a single image for different models are: seconds for DeepLab, seconds for Deeplab with multi-scaling, and seconds for our model with early stopping of 2 back-propagation. Our model with multi-scaling takes seconds on a GTX 1080 Ti GPU.

Impact of Noise: Table  3 examines the performance of our approach in the presence of noisy auxiliary information. For each image we randomly introduce (for varying values of ) additional (noisy) tags as additional information which were not part of the original set333This experiment was done without using multi-scaling. Our approach is fairly robust to this noise, and its performance degrades slowly with increasing amount of noise. Even when we have 3 additional noisy tags added, we are still doing better than baseline DL-MT model performance of 82.5 (Left Column, Table 1).

Model # of noisy tags ()
0 1 2 3
DL-BP-ES 86.0 85.3 84.7 82.8
DL-BP-ES-Pr 86.3 85.6 85.0 83.3
Table 3: Performance of DL-BP-ES and DL-MT-BP-ES-Pr models as noisy tags are added in the auxiliary information. Multi-scaling was off in this experiment.

5 Instance Segmentation

Next, we present our experimental evaluation on a multi-modal task of object instance segmentation given textual description of the image. In an instance segmentation problem the goal is to detect and localize individual objects in the image along with segmentation mask around the objects. In our framework, we model instance segmentation as the primary task and image captioning as the auxiliary.

Arguably, instance segmentation is more challenging than semantic segmentation, and incorporating caption information also seems significantly harder due to its frequent noisy and incomplete nature. We speculate this to be the reason behind lack of any prior noticeable work using textual description for improving semantic segmentation. In this sense, our state of the art results for the problem may also have a standalone contribution of their own.

Figure 8: Our framework for instance segmentation using Mask-RCNN and LSTM based caption generator to use textual descriptions of an image at test time.
Figure 9: Comparison of visual results for baseline (Mask-MT) vs proposed (Mask-BP-ES) approaches. Using our strategy to exploit test time evidence, our approach is able to discover new objects (and their segmentation) which were missed earlier.
Mask-RCNN 39.8 63.5 39.4 69.0 80.0
Mask-MT 39.9 63.6 39.2 69.2 79.8
Mask-BP-L2 40.4 64.5 40.0 69.6 81.6
Mask-BP-ES 40.5 64.6 40.1 69.7 82.2
Table 4: Comparison of results for instance segmentation. Mask-RCNN and Mask-MT represent baseline pretrained network and baseline trained with MTL strategy respectively. The bottom two rows are variations of proposed framework. AP0.5 refers to AP at IoU of 0.5. APL, APM,APS represent AP0.5 values for large, medium and small objects, respectively
Type Mask-MT Mask-BP-ES
P R F1 P R F1
All 86.1 51.7 64.6 83.6 54.7 66.1
Small 80.6 29.8 43.5 78.0 31.9 45.3
Medium 86.6 61.3 71.8 83.9 64.5 73.0
Large 89.8 76.5 82.6 87.6 80.6 84.0
Table 5: Comparison of baseline(Mask-MT) and proposed(Mask-BP-ES) approaches on at 0.5 IoU and 0.9 confidence threshold. P: Precision, R: Recall, F1: F-measure

State-of-the-art: Recently proposed Mask R-CNN [he2017mask] is one of the most successful instance segmentation approaches. It is based on the Faster R-CNN  [ren2015faster] technique for object detection. In the first step, Faster R-CNN generates a box level proposal using the Region Proposal network (RPN). In the second step, each box level proposal is given an object label to detect the objects present in the overall image. Mask R-CNN uses the detector feature map and produces a segmentation mask for each detected bounding box, by re-aligning the misaligned feature maps using a special designed RoIAlign operation. Mask R-CNN predicts masks and class labels in parallel. Other notable works [li2017fully, dai2016instance] predicts the instance segmentation using a fully conventionally network, to get similar benefits as FCNs for semantic segmentation. There have also been proposals to use CRFs for post-processing FCN outputs to group pixels of individual object instances [bai2017deep, arnab2017pixelwise]. We have used Mask-RCNN in our experiments.

Our Implementations: Our MTL based architecture is shown is Figure 8. Here we take Mask R-CNN, a state-of-the-art instance segmentation approach, and combine it with the LSTM decoder from the state-of-the-art captioning generator “Show, Attend and Tell”(SAT) [xu2015show] within our framework. We use the publicly available implementations of both Mask R-CNN444github.com/roytseng-tw/Detectron.pytorch and SAT555github.com/sgrvinod/a-PyTorch-Tutorial-to-Image-Captioning. We use ResNeXt-152 [xie2017aggregated] as the convolutional backbone network to extract image features. The backbone architecture is shared between Mask R-CNN and captioning decoder. We initialize our primary network with the pre-trained weights of the Mask R-CNN provided in their implementation666github.com/facebookresearch/Detectron/blob/master/MODEL_ZOO.md. We then fine-tune these weights along with learning the weights of the caption decoder (secondary task) using our MTL architecture. Early stopping iterations parameter was set at 2, and parameter was set to 100. We have used Adam optimizer with a learning rate of at test time.

Methodology and Dataset: We experiment with four different models. These include: Mask-RCNN, Mask-MT, Mask-BP-ES and Mask-BP-L2. Mask-RCNN is the original instance segementation model and Mask-MT is the model trained using MTL. These two are the baseline models. We refer to our approaches as Mask-BP-ES and Mask-BP-L2, respectively, for the two types of regularizers used during prediction. Our attempts with pruning based approaches did not yield any gain, since learning the label set to be pruned (using captions) turned out to be quite difficult.

We have used MS-COCO dataset [lin2014microsoft] to evaluate our approach. The training set consist of nearly 1150k images and 5k validation images. We report our results on the validation images. In the dataset, each image has at least five captions assigned by different annotators. We use AP (average precision) as our evaluation metric. We also evaluate at AP0.5 (precision at IoU threshold of 0.5).

Results: Table 4 shows the gain obtained by using the auxiliary information over the baseline models for different sizes of objects. We result in a gain of 0.6 for AP. The gain is higher at 1.0 for AP0.5. For large objects, this number is as high as 2.2 (last column). The ES based variation does marginally better than L2. Table 5 presents the gain obtained by the ES model over the MT baseline for detecting various objects in terms of precision, recall and f-measure. Suffering slight loss on precision, we are able to improve the overall F1 by 1.5 points. The gain is maximum for small objects which makes sense since a large number of them remain undetected in the original model. Further, a careful analysis revealed that ground truth itself has inconsistencies missing several smaller objects, undermining the actual gain obtained by our approach (see supplement for details).

Figure 9 presents visual comparison of results. Our algorithm can detect newer objects (Figure 9 (A)), as well as detect additional objects of the same category (Figure 9 (C)), sometimes those not even mentioned in the caption (Figure 9 (E)). Figure 9 (F) is a Mooney face [mooney1957age] as referred in the introduction. Mask-MT incorrectly detects a bird whereas Mask-BP can correctly detect a person with reasonable segmentation.

Figure 10: Our framework may lead to over-fitting on the auxiliary information if back-propagation is overused. The figure shows incorrect prediction of multiple surf-board instances when the number of back-propagations are increased from 2 to 5.

Failure analysis: Figure 10 shows a failure example for our approach. Overuse of back-propagation at test time may lead to over-fitting on the given auxiliary information in the proposed framework. As the number of back-propagation increases from 2 to 5, we observe over-fitting, leading to the prediction of multiple surf-boards.

6 Multiple Types of Auxiliary Information Simultaneously

We have conducted experiments on using multiple types of auxiliary information simultaneously in our instance segmentation setup. In addition to using captions as auxiliary information, we added image tags as the second source of auxiliary information. Figure 11 describes our proposed MTL architecture in detail. Here we take Mask R-CNN, a state-of-the-art instance segmentation network to generate image level features. Using these intermediate features an LSTM based decoder from the state-of-the-art captioning generator “Show, Attend and Tell”(SAT) [xu2015show] is used to generate the captions and a multi-class classifier is designed to generate image level tags. The model was trained in a multi-task setting (one primary and two auxiliary). The evidence over the two auxiliary tasks was propagated simultaneously during test time. Here is the performance of three models using AP (Average Precision) as the metric. (i) Mask-RCNN (multi-task): (ii) Mask-RCNN (multi-task + backpropagate over captions): (iii) Mask-RCNN (multi-task + backpropagate over captions, tags): .

Figure 11: Our framework, for instance segmentation with multiple types of auxiliary information simultaneously. Mask-RCNN is used for instance segmentation. LSTM based caption generator to use textual descriptions of an image and a multiclass image classifier to use image level tags both at the test time.

7 Image Captioning with Image Tags

We have conducted experiments on the MS COCO dataset with image caption prediction as the primary task and image tags as the auxiliary task. Figure 12 describes our proposed MTL architecture in detail. The base architecture was “Show, Attend and Tell”(SAT) [xu2015show], one of the prominent image captioning model. To use image level tags, we branch off after the encoder module and design a multi-class classifier. The resultant feature map is passed through an average pooling layer, a fully connected layer, and then finally a sigmoid activation function to get probabilty for each of the 80 classes. Using BLEU-4 as the metric, performances are: (i) SAT: 25.8 (ii) SAT (multi-task): 27.4 (iii) SAT (multi-task + backpropagate over tags): 28.1. We also gain on other metrics such as ROUGE and CIDEr.

Figure 12: Our framework for image captioning using “Show, Attend and Tell”(SAT) [xu2015show] and a multiclass image classifier to use image level tags at the test time.

8 Auxiliary Information as Input

We have also done experiments with baseline models using auxiliary information as input, but our approach performs better them. Compared to baseline DeepLab for semantic segmentation with an MIoU , using tags as input only improves it to , compared to our framework which gives an MIoU of using the same information. We also tried inputting caption embeddding generated using “InferSent” [conneau2017supervised] from Facebook AI Research, to the Mask-RCNN. However, in this case, our training did not converge. This highlights the advantage with our framework which allows the primary task to be trained separately first, and then fine-tune the joint model in an efficient manner. Further, we note that multi-task training does not require any commonly annotated dataset unlike when auxiliary information is given as input. Further, our framework does not requires any commonly annotated dataset unlike the approaches which take auxiliary information as input.

9 Conclusion

We have presented a novel approach to incorporate evidence into deep networks at prediction time. Our key idea is to model the evidence as auxiliary information in an MTL architecture and then modify the weights at prediction time such that output of auxiliary task(s) matches the evidence. The approach is generic and can be used to improve the accuracy of existing diverse neural network architectures for variety of tasks, using easily available auxiliary information at test time. Experiments on two different computer vision applications demonstrate the efficacy of our proposed model over state-of-the-art. In future, we would like to experiment with additional applications such as for video data.


Supplementary Material

Appendix A Interpreting as a Graphical Model

Figure 13: Model Architecture for Multi-task learning (MTL)

In this section, we present a Probabilistic Graphical Model’s perspective of our approach (as described in section 3 of the main paper). Referring back to Figure 13, we can define a probabilistic graphical model over the random variables (input), (hidden presentation), (primary output) and (auxiliary output). Interpreting this as a Bayesian network (with arrows going from , and , we are interested in computing the probabilities and at inference time. Further, we have:


In the first conditional probability term in the RHS, dependence on is taken away since is independent of given . Since, in our network is fully determined by (due to the nature of forward propagation), we can write this dependence as . In other words, there is a value , such that . Therefore, above equation can be equivalently written as:

Note that sum over disappears since is non-zero only when as defined above. Similarly:


The goal of inference is to find the values of and maximizing and , respectively. The parameters of the graphical model are learned by maximizing the cross entropy or some other kind of surrogate loss over the training data.

Let us analyze what happens at test time. We are given the evidence at test time. In the light of this observation, we would like to change our distribution over such that the probability of observing is maximized, i.e., is equal to . Recalling that , in order to affect this, we may:

  1. Change the distribution to , or

  2. Change the function to ,

such that is as close to as possible. How to do this in a principled manner? We define the appropriate loss capturing the discrepancy between the value predicted using the distribution and the evidence , i.e., . The loss term also incorporates a regularizer so that new parameters do not deviate significantly from original set of parameters, avoiding overfitting the evidence .

In order to minimize the loss, we can back-propagate its gradient in the DNN and learn the new set of parameters. This results in change of dependence of on , i.e., , as well as that of on , i.e., . The resulting parameters are and , which effectively generate a new distribution over . Hence, adjusting the DNN weights in order to match the evidence also results in an updated prediction over the primary task aligned with the observed evidence.

Appendix B Instance Segmentation

In the main paper, we have presented results on the semantic segmentation and object instance segmentation problems. We notice that in the case of instance segmentation, though our results are much better qualitatively, the same is not fully reflected in the quantitative comparison. A careful analysis of the results reveals that there are often inconsistencies in the ground truth annotations itself. For example, in many cases ground annotation misses objects which either very small, occluded or partly present in the image. In some cases, the ground truth segmentation has inconsistencies. In many of these scenarios, our framework is able to predict the correct output, but since the ground truth annotation is erroneous, we are incorrectly penalized for detecting true objects or for detecting the correct segmentation, resulting in apparent loss in our accuracy numbers. In the figures below, we highlight some examples to support our claim. Systematically fixing the ground truth is a direction for future work.

Input Image
Ground Truth
Mask-MT Output
Mask-BP-ES Output
(Our Approach)
Figure 14: Input caption: “a person standing over a toilet using the restroom”.
Mask-BP-ES is able to detect the person standing in the toilet. This is in contrast to Mask-MT which can only detect the toilet. Ground truth incorrectly misses the person.
Input Image
Ground Truth
Mask-MT Output
Mask-BP-ES Output
(Our Approach)
Figure 15: Input caption: “a living room with a couch, coffee table, tv on a stand and a chair with ottoman”.
Mask-BP-ES is able to detect clock as compared to Mask-MT. This is not annotated in the ground truth.
Input Image
Ground Truth
Mask-MT Output
Mask-BP-ES Output
(Our Approach)
Figure 16: Input caption: “the large glass plant vase is installed into the wall”.
Mask-BP-ES is able to detect potted plant as compared to Mask-MT. This is not annotated in the ground truth.
Input Image
Ground Truth
Mask-MT Output
Mask-BP-ES Output
(Our Approach)
Figure 17: Input caption: “the little girl is busy eating her pizza at the table”.
Mask-BP-ES is able to detect dining table as compared to Mask-MT. The dining table is labeled in the ground truth with incomplete segmentation.
Input Image
Ground Truth
Mask-MT Output
Mask-BP-ES Output
(Our Approach)
Figure 18: Input caption: “three teddy bears sit in a sled in fake snow”.
Mask-BP-ES is able to detect 4 teddy bears as compared to Mask-MT which detects 3 teddy bears. All 3 teddy bears in the front portion of image are marked as a single teddy bears in the ground truth.
Input Image
Ground Truth
Mask-MT Output
Mask-BP-ES Output
(Our Approach)
Figure 19: Input caption: “a shopping cart full of food that includes bananas and milk”.
Mask-BP-ES is able to detect bottle of milk as compared to Mask-MT. This is not annotated in the ground truth.
Input Image
Ground Truth
Mask-MT Output
Mask-BP-ES Output
(Our Approach)
Figure 20: Input caption: “a yellow and red apple and some bananas”.
Mask-BP-ES is able to separte the bananas as compared to Mask-MT. These are marked as a single banana in the ground truth.
Input Image
Ground Truth
Mask-MT Output
Mask-BP-ES Output
(Our Approach)
Figure 21: Input caption: “two apples, an orange, some grapes and peanuts”.
Mask-BP-ES is able to detect the an apple in the image as compared to Mask-MT, but both the apples are marked as single apple in the ground truth.
Input Image
Ground Truth
Mask-MT Output
Mask-BP-ES Output
(Our Approach)
Figure 22: Input caption: ‘a bird resting outside of a boat window”.
Mask-BP-ES is able to detect both bird and boat in the image as compared to Mask-MT as mentined in the caption also. Ground truth has no annotaion for boat.
Input Image
Ground Truth
Mask-MT Output
Mask-BP-ES Output
(Our Approach)
Figure 23: Input caption: “a lot of strawberries and oranges sitting in a bowl”.
Mask-BP-ES is able to correctly detect an extra orange as compared to Mask-MT. These all oranges are marked as a single orange in the ground truth.
Input Image
Ground Truth
Mask-MT Output
Mask-BP-ES Output
(Our Approach)
Figure 24: Input caption: ‘a man holds a controller and keyboard with his hands”.
Mask-BP-ES is able to detect the remote which is missed by Mask-MT. The ground truth annotation misses the remote.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description