Straight to Shapes++: Real-time Instance Segmentation Made More Accurate

Straight to Shapes++: Real-time Instance Segmentation Made More Accurate

Laurynas Miksys Saumya Jetley Michael Sapienza Stuart Golodetz Philip H. S. Torr Department of Engineering Science, University of Oxford Samsung Research America FiveAI Ltd. {sjetley,smg,phst}@robots.ox.ac.uk; m.sapienza@samsung.com This work was done when L. Miksys was an intern at the Torr Vision Group, Department of Engineering Science, University of Oxford.
Abstract

Instance segmentation is an important problem in computer vision, with applications in autonomous driving, drone navigation and robotic manipulation. However, most existing methods are not real-time, complicating their deployment in time-sensitive contexts. In this work, we extend an existing approach to real-time instance segmentation, called ‘Straight to Shapes’ (STS), which makes use of low-dimensional shape embedding spaces to directly regress to object shape masks. The STS model can run at FPS on a high-end desktop, but its accuracy is significantly worse than that of offline state-of-the-art methods. We leverage recent advances in the design and training of deep instance segmentation models to improve the performance accuracy of the STS model whilst keeping its real-time capabilities intact. In particular, we find that parameter sharing, more aggressive data augmentation and the use of structured loss for shape mask prediction all provide a useful boost to the network performance. Our proposed approach, ‘Straight to Shapes++’, achieves a remarkable point improvement in mAP (at IOU of ) over the original method as evaluated on the PASCAL VOC dataset, thus redefining the accuracy frontier at real-time speeds. Since the accuracy of instance segmentation is closely tied to that of object bounding box prediction, we also study the error profile of the latter and examine the failure modes of our method for future improvements.

1 Introduction

Figure 1: vs. runtime trade-off as evaluated on PASCAL VOC dataset (Everingham et al., 2015; Hariharan et al., 2011) across a fleet of existing instance segmentation models: FCIS (Li et al., 2016), STS (Jetley et al., 2017), SDS (Hariharan et al., 2014), naive MNC (Li et al., 2016), MNC (Dai et al., 2016a)Arnab & Torr (2017), BAIS (Hayder et al., 2016). STS++ redefines the performance frontier at real-time speeds.

Scene understanding deals with the what and where of objects in a given visual environment. In cases where the scene is described using a set of RGB images, a rather elementary but challenging task of scene understanding is one of identifying and delineating instances of different categories of interest in those images. Popularly known as instance segmentation, this task has a wide applicability in real-time computer vision applications such as autonomous driving, drone navigation and robotic manipulation, and the solution methods need to be fast as well as accurate. Towards this, we develop upon a recently proposed instance segmentation model that has demonstrated real-time capabilities but is limited in terms of its performance accuracy (Jetley et al., 2017).

The Straight to Shapes (STS) model extends the real-time object detector YOLO (Redmon et al., 2016) to additionally predict encoded shape representations for multiple object hypotheses. Their model makes use of continuous shape-based representations and learns to map the input image regions to points in this continuous shape space in order to predict instance-level masks for object categories of interest. This use of an intermediate shape embedding space has also been shown to scale to unseen categories at test time that are similar to the training classes, a very useful property when operating in the wild. However, while their simple extension maintains the inference speed, the inaccuracies in regressing to the real-valued vectors of shape representations result in a sub-par mAP accuracy in comparison to existing instance segmentation methods (Li et al., 2016; Hariharan et al., 2014; Li et al., 2016; Dai et al., 2016a; Arnab & Torr, 2017; Hayder et al., 2016). More recently, Liu et al. (2015) (with their SSD model) and Redmon & Farhadi (2016) (with YOLO9000), have demonstrated that state-of-the-art quality can be achieved in real-time detection models by making systematic modifications to the underlying network architecture and training procedure. This suggests that similar improvements may be made in instance segmentation. Thus, we extend the existing STS model through the following contributions:

We review the recent advances in neural network design and training in the context of object detection and instance segmentation tasks.

We present a revised model called STS++ with an improved mAP accuracy on PASCAL VOC (Everingham et al., 2015) at real-time speeds, as shown in Figure 1.

We analyse the errors that the proposed model makes in predicting the object bounding boxes, in terms of the taxonomy proposed by Hoiem et al. (2012), and identify avenues for future improvements.

Figure 2: Schematic illustration (top) of STS (Jetley et al., 2017) instance segmentation system, and (bottom) its predictive model. The latter is a deep neural regressor that accepts sized RGB images as input and encodes them into a fixed -dimensional representation. This representation is then used to predict category, location and shape properties of the underlying object instances. These output properties are matched against ground truth data to train the network via backpropagation. The post-processing step merely takes the predicted shape representations which are then decoded into D shape masks.

2 Related Work

Object detection localises individual objects in an image using tight-fitting bounding boxes and classifies them into one of the pre-defined target categories. However, it overlooks the pose and shape details of the objects. Semantic segmentation yields a more fine-grained category map at the pixel-level, but does not contain knowledge about the number of individual objects or the boundaries between them when the objects of the same category are adjoining or overlapping. Instance segmentation attempts to combine the best of both of the above tasks by delineating individual objects at the pixel level. All the above scene understanding tasks have benefited hugely from the availability of large-scale annotated datasets and the advent of deep learning. In particular, there are many deep convolutional neural network based solutions for both object detection e.g. R-CNN (Girshick et al., 2014), Faster R-CNN (Ren et al., 2015), YOLO (Redmon et al., 2016) and SSD (Liu et al., 2015) and semantic segmentation e.g. FCN (Long et al., 2015) and CRF as RNN (Zheng et al., 2015). Following their success, various instance segmentation approaches (Hariharan et al., 2014; Dai et al., 2015, 2016a; Arnab & Torr, 2017; Hayder et al., 2016; Li et al., 2016; He et al., 2017) proposed to combine both top-down detection and bottom-up segmentation results in order to isolate object instances.

For example, the Simultaneous Detection and Segmentation (SDS) approach of Hariharan et al. (2014) adapts the R-CNN (Girshick et al., 2014) detector to the task of instance segmentation. This is achieved by refining the regions within the object boxes of R-CNN via a combination of bottom-up cues from the super-pixels of the Multiscale Combinatorial Grouping (MCG) approach of Arbeláez et al. (2014) and top-down cues from the foreground mask predicted using CNN features. In their approach, the box proposal and mask refinement stages are separately optimised. However, the Multi Network Cascade (MNC) of Dai et al. (2015) incorporates all these steps into a single neural network pipeline using a cascaded structure and joint training. A stack of convolutional layers is shared between the three cascaded tasks of bounding box proposal generation, pixel-wise prediction of an instance mask per proposal using region-of-interest (RoI) CNN features, and classification of the output mask into one of target categories. Similarly, the Boundary-Aware Instance Segmentation (BAIS) model (Hayder et al., 2016) uses shared layers for region proposal and RoI feature extraction. This is followed by a pixel-wise prediction of masks, each corresponding to a specific quantised level of distance transform (DT) values. The DT values directly and redundantly encode the boundary information, which allows the reconstructed instance masks to extend beyond the bounding box and also be robust to prediction noise – two salient features of this approach.

On the other end, Instance-sensitive FCN (Dai et al., 2016a) adapts the segmentation pipeline of FCN (Long et al., 2015) to predict relative-position score maps. Here, each pixel value in a given score map captures the pixel’s probability of being at the specified relative position w.r.t to the underlying object. An instance assembly module then reconstructs object masks by using dense sliding windows and copying pixel scores for a given position in an object window (say top-left) from the associated relative position (top-left) map. The instance mask is thus obtained by combining the masks for each relative location within a specific window. This approach is effective at delineating instances, but doesn’t associate those instances with semantic categories: it is hence often used in conjunction with R-FCN (Dai et al., 2016b) to classify the instance proposals. An end-to-end extension of this instance segmentation and classification pipeline is described as the Fully Convolutional Instance-aware Semantic Segmentation (FCIS) model (Li et al., 2016). Unlike Instance-sensitive FCN (Dai et al., 2016a), the position-sensitive score maps here are not predicted at the image level but atop the bounding box predictions of a region proposal network (RPN) (Dai et al., 2016b). The current state-of-the-art model, Mask R-CNN (He et al., 2017), once again extends the RPN pipeline for instance segmentation. The prediction of object bounding boxes and corresponding category scores is followed by the pixel-wise prediction of binary masks, one for each target category. At test time, the binary mask for the highest-scoring semantic category is selected. The authors remark that the prediction of class-wise binary masks serves to reduce the competition between pixels to belong to a single category and improves the quality of the output masks. Concurrently, the approach of Arnab & Torr (2017) starts from the semantic segmentation maps of CRF as RNN (Zheng et al., 2015) and proceeds by allocating each pixel to one of detections produced by R-FCN (Dai et al., 2016b). They use the CRF framework to optimise this allocation which corresponds to the minimisation of an energy function defined atop the per-pixel semantic score, bounding box score and correlation of the semantic mask with a predefined shape prior.

Two important properties of the above methods are worth noting. Firstly, whilst the accuracy of instance segmentation methods has improved rapidly, runtime has largely remained a secondary concern. The existing best i.e. Mask R-CNN model (He et al., 2017) runs at fps, and cannot trivially be deployed for real-time applications (the cost can be amortised over several frames on a background thread, e.g. Rünz & Agapito (2018), but only if latency is a tolerable issue). Secondly, the bottom-up scheme of pixel-wise instance mask prediction often yields coarse and noisy masks for unseen categories, as noted by Jetley et al. (2017).

In contrast to these methods, STS (Jetley et al., 2017) predicts shape masks via a bottleneck of an encoded, low-dimensional and continuous shape space, which allows test-time generalisation to unseen categories. Unlike all other existing approaches, which cannot run in real time, it is able to run at FPS on a high-end desktop, making it highly desirable, at least in principle, for use in real-time applications (Hicks et al., 2013). However, its practical usefulness is compromised in practice by the accuracy of its predictions, which are significantly worse than those that can be achieved by slower approaches. The goal of this work is to address this problem, inspired by works such as those of Liu et al. (2015) and Redmon & Farhadi (2016), which make systematic modifications to detection networks and achieve improvements in detection accuracy for real-time speeds.

3 Methodology

We start by reviewing the original STS model and then discuss the motivation behind the proposed changes and the specifics of those changes.

3.1 Original Model

The prediction model of STS (Jetley et al., 2017) is identical to that of YOLO (Redmon et al., 2016). A deep convolutional neural network (comprising 30 layers) encodes an input image of size into a fixed dimensional representation, as shown in Figure 2 (bottom). This encoded vector representation is then used for predicting the object instances at all image locations. The STS model predicts object instances relative to the predefined grid for . At every grid location, the model outputs conditional class probabilities and makes class agnostic object instance proposals. Each proposal consists of a confidence score, center coordinates, dimensions of a bounding box, and the additional instance shape representation (of size ). More formally, for every grid cell , the model outputs a vector

(1)
(2)
(3)
(4)

are the parameters of the object proposal, where

(5)

are the bounding box parameters, and is the -dimensional shape representation.

Note that here correspond to the square root of the height and width of the object bounding box, respectively. This is done to favour correct predictions of small boxes compared to larger boxes, as a small discrepancy in a large-box prediction has a smaller effect on the output accuracy. Full details of the architecture and loss function design can be found in Table 3 of Appendix A.1 and in Appendix A.3 respectively.

The system explicitly constructs a low-dimensional shape embedding space using a denoising auto-encoder. At train time, the deep neural pipeline is adjusted to correctly predict the encoded shape representations corresponding to the underlying scene objects. At test time, the pipeline estimates these encoded object shape representations which are then mapped to the space of D binary shape masks using the standalone decoder block of the denoising auto-encoder. Initial investigation demonstrates that the learned shape space encodes shape information in a semantically meaningful way, where instance masks of objects having similar shape appearances cluster together in the space. It is a continuous space and allows reconstruction of new and realistic shape masks.

3.2 Proposed Revisions

We present modifications to the existing system to speed up convergence, reduce memory requirements, minimise computational costs and improve the instance segmentation performance. Following the block diagram of Figure 2 (top), we organise the proposed changes into three different groups - changes to the prediction model, the data-preparation step and post-processing operations respectively.

3.2.1 Changes to Prediction Model

Batch Normalisation:   Batch normalisation is known to implement regularisation and improve the convergence speed during network training, often resulting in an improved performance at convergence (Ioffe & Szegedy, 2015). Its positive effect on deep neural network training is widely reported in the research literature (He et al., 2016, 2015; Vinyals et al., 2014; Szegedy et al., 2015), and it is demonstrated to improve the original YOLO model in the subsequent work by Redmon & Farhadi (2016). We propose adding batch normalisation after every convolutional layer in the network. Although it increases the total number of learnable parameters in the model, the batch-norm parameters can be combined with convolutional layer’s affine transformation once the training is finished. Thus, it does not affect model’s computational cost or speed capabilities during inference.

Figure 3: Fully connected layers are replaced with convolutional layers (filled blue blocks) to build a translation invariant model.

Sharing Prediction Weights:   Original model uses a series of convolutional layers followed by a fully-connected layer to construct a representation of the entire image and predict object instances at different grid locations (see Figure 2). Firstly, such an architecture is not translation invariant by design, and thus has to be presented with sufficient variety of shifted examples to develop this property. Secondly, fully-connected layers contain the majority of learnable network parameters. Thus, we propose encoding the raw image into a feature map that preserves the spatial information and uses shared convolutional layers to predict object instances at different spatial locations. In order not to reduce model’s capacity drastically, we introduce a sequence of convolutional layers (, and ), replacing a single fully-connected layer, before using the shared prediction module (see Figure 3 for illustration). Full details of the altered architecture can be found in Table 4 in Appendix A.1. Consequently, total number of parameters is reduced from to (see Figure 4).

Distributing Computations Evenly:   Finally, we note that the overall computational cost is rather unevenly distributed across the individual layers of the network, see Figure 5. This can slow down the whole network, particularly when the computational resources are limited. In their subsequent work on YOLO, Redmon & Farhadi (2016) identify this issue and propose the Darknet19 architecture. They borrow network design ideas from VGG (Simonyan & Zisserman, 2015) and Network in Network (NiN) (Lin et al., 2013) architectures. For example, the number of filters is doubled when the spatial dimension of the feature map is halved, convolutions of size are used at every other layer to compress the feature representations and reduce overall computational costs. We upgrade our model to Darknet19 architecture. Full details of this extended architecture are described in Table 5 in Appendix A.1.

3.2.2 Changes to Data Preparation Step

Pre-processing Raw Images:   Neural networks used on computer vision tasks usually have their parameter count far exceeding the number of data points available for training (Krizhevsky et al., 2012). Random data distortions are known to help prevent over-fitting and improve overall performance (Ciresan et al., 2010; Wong et al., 2016). Given also our focus on detecting objects in natural environments, we want to simulate various pose, angle, and lighting conditions via suitable data distortions. This can be achieved, to a limited extent, using affine transformations of 2D images and by performing per-pixel colour distortions. We propose to change the data-augmentation paradigm (often favouring more aggressive transformations) as follows:

rotation angle: uniformly from

translation in and : uniformly from

scaling factor: uniformly from log-uniformly from

random horizontal flip: from (unchanged)

scale intensity value: uniform from log-uniform from

additive intensity value: uniform from (unchanged)

Figure 4: Using shared detection module at different locations significantly reduces number of learnable parameters.
Figure 5: Darknet19 network distributes the computations evenly throughout the network.

Representing Targets:   STS (Jetley et al., 2017) uses the target representation approach similar to YOLO (Redmon & Farhadi, 2016) and SSD (Liu et al., 2015). A frame is divided into grid cells. A cell containing the center of a ground truth bounding box gets assigned all relevant information about the ground truth instance, provided as a vector of size , as described in §3.1. A nuance of this assignment, however, is that any grid cell is capable of holding information of at most one ground truth object. Whenever another object in the frame has its bounding box centered in the cell , it is discarded. This leads to the model being penalised for predicting such objects, since they are not represented anywhere in the target.

We propose changing such a target encoding, and employing anchor boxes - a method used in other detection and segmentation methods such as Faster R-CNN (Ren et al., 2015), Mask R-CNN (He et al., 2017), MNC (Dai et al., 2015). We use 3 such boxes per grid cell corresponding to 3 different aspect ratios, namely, , and . This allows the object instances of non-standard aspect ratios to find their best match, given that we assign an instance to an anchor box when their IOU is the highest amongst other pairs. In addition, the system can now support the prediction of up to 3 closely placed ground truth objects per location.

Representing Shapes:   Several different methods to represent shapes are investigated in the STS paper, namely, down-sampled binary mask representation, radial contour description, and an embedding space learned using a parametric model (in particular, a denoising auto-encoder (Vincent et al., 2008)). We use the learned shape embedding in all of our experiments, due to its low dimensionality, better reconstruction accuracy and robustness to noise.

As an alternative, we also propose and investigate the use of the distance transform (DT) based representation of binary images (Borgefors, 1986). DT encodes information in a multi-valued format where each pixel value represents its closest distance to the background (w.r.t. some metric , here we use the Euclidean -distance). This lends DT based representations with a richer structure. Moreover, a corrupted pixel value in DT representation can be recovered given the surrounding pixels values. This property provides a more robust shape reconstruction in the presence of prediction noise.

3.2.3 Changes to Post-processing Step

Mask Decoding:   In order to evaluate the quality of proposed instances, binary masks of size () are reconstructed from their predicted shape representations and re-scaled to fit their corresponding bounding box predictions. In the case of learned shape embeddings, a trained neural network decoder is used to reconstruct the mask from the predicted embeddings. Detailed architecture of the decoder can be found in the Table 6 in Appendix A.1. It is trained separately as part of an auto-encoder pipeline using the per-pixel binary cross-entropy loss:

(6)

where and are ground truth and predicted pixel intensities, respectively.

As one of the experiments, we propose incorporating the shape decoder into the full end-to-end trainable pipeline. This simplifies the overall approach and allows the rich semantic structure of the larger PASCAL VOC dataset to tune the reconstruction process. The original decoder has relatively small number of learnable parameters ( thousand) and uses hard-coded bi-linear spatial up-sampling, which limits the shape decoder’s learning ability. Hence, we also propose an alternative decoder architecture with increased learning capacity and learnable up-scaling function (via transposed convolutions). The number of layers and output mask dimensions remains the same, while we increase the number of filters per layer. More details can be found in the Table 7 of Appendix A.1. The squared error term in the original loss function design (Eq. 11, Appendix A.3), to regress to the encoded shape representations, is replaced with binary cross-entropy (Eq. 6) during the holistic training of the full pipeline.

Figure 6: (Left) Object instance binary mask; (Middle) Distance transform representation, (Right) Illustration of binary mask reconstruction from distance transform by superimposing discs, at every pixel position, of radii equal to the underlying DT values. (Images taken from wolfram.com)

Decoding the Distance Transform (DT):   In order to accommodate DT based shape representations, we implement another differentiable decoder for purposes of reconstruction. Given a low dimensional representation for every instance, a neural decoder, designed as a sequence of transposed and ordinary convolutional layers, generates binary masks of size . Every such mask encodes a specific quantised value representing distance to the background. In particular, the mask has value 1 where the distance is at least , and 0 elsewhere. Every pixel in every mask is modelled as a binary variable using logistic units. Once such masks are generated a transposed convolutional (deconvolutional) layer is applied with pre-defined non-learnable filters, where every such filter encodes a disk of radius . Essentially, a disk of radius is drawn at the location where encoded distance value is (see Figure 6 for illustration). Noticeably, each pixel value contains information about the distance to the object boundary. This redundancy equips the DT based shape representation to better handle any noise at inference. Final binary mask is obtained by taking a linear combination of such reconstructed masks and thresholding the output, where linear parameters are learned to optimise the objective. The whole decoder is fully differentiable and can be learned together with the full network. For more details refer to the Table 8 in Appendix A.1. We use for our experiments. This is similar to the way that binary masks are reconstructed in the BAIS model by Hayder et al. (2016). Finally, given the reconstructed binary masks of valid object proposals, non-maximal suppression is performed to filter the overlapping predictions (Neubeck & Gool, 2006).

4 Experiments and Results

Figure 7: (Left) Learning curves during training with and without batch normalisation. (Right) The estimates on SBD train and val splits during training with conservative vs. aggressive data augmentation.

The details of the experimental setup are given in Appendix A.2. Here, we present a quantitative analysis of the effects of the proposed changes on the instance segmentation performance. For purposes of comparison, we train our models on the SBD train set and evaluate on the SBD validation set (Hariharan et al., 2011). We further evaluate the models on the task of object detection alone and analyse the results in terms of the error taxonomy proposed by Hoiem et al. (2012).

4.1 Evaluating Instance Segmentation Performance

We start with the original STS approach that yields top performance using the 20-dimensional learned shape representations and study the effects of the proposed changes that are introduced incrementally.

Training with Batch Normalisation:   As expected and noted in the literature (Ioffe & Szegedy, 2015) batch normalisation improves the convergence speed as well as the performance at convergence, as can be seen in Figure 7. On the validation set, it boosts the performance by (). As mentioned before, the processing speed at the time of inference is not affected.

Augmenting Training Dataset:   Presenting the network with more aggressively augmented examples improves the model’s generalisation. As seen in Figure 7, the model has sufficient capacity to fit the increased variability in the training examples, and at the same time it helps to generalise better on unseen validation samples. Building on the previous change of batch normalisation, data augmentation improves our model by an additional .

Sharing Prediction Weights:   This change in architecture, from fully-connected to convolutional, introduces translation invariance into the model and reduces the total number of parameters. This has a small negative effect on accuracy decreasing it by , see Figure 8 (top).

Anchors as Bounding-box Priors:   This impacts how adjacent or possibly overlapping object instances are presented to the model. In the previous setup, the model was discouraged from predicting objects appearing close to each other in the image (by selecting only the ground truth bounding box with the highest IoU when multiple such boxes were centered in the same grid cell). In comparison, our approach provides three anchor boxes per grid cell location with which to match the ground truth instance hypotheses. This leads to a significant improvement in as can be seen in the performance plots of Figure 8.

Distributing Computations Evenly:   These architectural changes not only reduce the parameter count and the total computational effort, but also result in a more robust model, offering an overall performance gain of in terms of , refer to Figure 8 (top).

IoU threshold 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
STS* 52.8 50.6 47.0 41.3 33.5 24.9 14.9 6.0 0.4 30.1
Darknet19 67.6 65.8 63.1 58.1 52.2 43.9 31.1 16.1 3.0 44.5
End-to-End 68.2 66.2 63.4 59.0 51.7 41.7 27.9 12.8 1.9 43.6
Large Decoder 67.8 65.9 63.1 58.7 52.3 43.8 31.9 17.3 3.9 45.0
Distance Transform (STS++) 68.2 66.3 63.6 59.4 53.2 45.2 32.3 16.9 3.4 45.4
Figure 8: estimates on SBD validation set for the proposed models incorporating changes to the network architecture and target representations. (Top) scores for models with weight sharing, anchor boxes and Darknet19 are compared here with those for SDS (Hariharan et al., 2014), MNC (Dai et al., 2016a)Arnab & Torr (2017) and BAIS (Hayder et al., 2016). (Bottom) Distance transform based shape reconstruction offers a marginal advantage over both pre-trained and end-to-end trained decoders for reconstructing learned shape representations. Note: The model marked with a * has been retrained for this work.
\pbox0.12STS
 (Jetley et al., 2017) STS++ \pbox0.12SDS
 (Hariharan et al., 2014) \pbox0.12naive MNC
 (Li et al., 2016) \pbox0.12MNC
 (Dai et al., 2016a) \pbox0.12Arnab & Torr (2017) \pbox0.15BAIS
 (Hayder et al., 2016) \pbox0.15FCIS
 (Li et al., 2016)
(darknet) (alexnet) (resnet-101) (vgg-16) (resnet-101)
Learned Embedding (20) + + + + + +
Batch Normalisation + + + + + + + +
Data Augmentation + + + + + + +
Shared Predictor + + + + + +
Anchor Boxes + + + + +
Darknet19 + + + +
End-to-End + + +
Large Decoder + +
Distance Transform (STS++) +
mAP 34.6 38.6 42.3 41.4 48.5 52.2 51.7 52.3 53.2 49.7 59.1 63.5 62.0 65.7 65.7
mAP 15.0 17.4 20.8 20.5 26.7 31.1 27.9 31.9 32.3 25.3 36.0 41.5 44.8 48.3 52.1
mAP 31.5 34.3 37.0 36.1 41.4 44.5 43.6 45.0 45.4 41.4 - - 55.4 - -
runtime/frame (s) - - - - - - - -
frames/sec - - - - - - -
Table 1: Effects of the proposed changes on the instance segmentation performance measured in terms of on SBD val (Hariharan et al., 2011). The revised STS++ is compared against existing instance segmentation methods in terms of performance accuracy and the processing speed (measured as both runtime/frame in sec. and frames/sec).

Training End-to-End:   Training the shape decoder in a unified end-to-end way simplifies the overall optimisation procedure. Moreover, learning the shape decoder in an independent optimisation step can lead to a solution that is sub-optimal overall. Thus, we first train the decoder from a random initialisation using the original architecture discussed in STS (Jetley et al., 2017). This leads to a degradation in the performance, especially, in the high IoU range (see Figure 8). This suggests that the model has difficulty in reproducing fine details of the instance shape masks. We then train the decoder with an increased learning capacity (i.e. an increased number of parameters), and call it the Large Decoder model. This model is able to recover from the drop in prediction quality noted at high IoU values, and surpass the Darknet19 model with a pre-trained decoder by a small margin.

Distance Transform based Shape Encoding:   The alternative shape representation making use of quantised distance transform values offers a slight improvement (up to in terms of accuracy) for low IoU values (), as can be seen in Figure 8). However, the model still fails to predict the fine details of the object shape masks near the object boundaries, a flaw that manifests itself as low scores at high IoU values.

Table 1 traces the incremental growth in performance, over the full range of IoU values, for the above discussed series of changes to the network architecture. For a qualitative demonstration of the above results, refer to Figure 9.

Gnd. Truth STS STS++
Figure 9: Qualitative instance segmentation results: STS (column 2) predicts objects around the rear-view mirror (row 2) and misses the person in the car (row 4). Our proposed model (column 3) provides a more detailed prediction of shape masks for the motorcycle and the rider (row 1 and 2) and correctly delineates the person sitting inside the car (row 3 and 4).

4.2 Evaluating Object Detection Performance

During training, the target masks of object instances are always generated with respect to the ground truth bounding boxes rather than the predicted ones. This prevents the shape predictions from adjusting themselves to the errors in the prediction of the bounding boxes at the time of training. Thus, the accuracy of the overall instance segmentation systems rests on the quality of the object detections. We therefore proceed to analyse how the quality of the detector unit alone evolves as a function of the proposed changes. We also make a quantitative comparison with the STS model proposed by Jetley et al. (2017). We perform the evaluation on Pascal VOC 2007 test set (Everingham et al., 2015) (which is disjoint from our training set) and use the error analysis methodology and toolkit***MATLAB code available at http://dhoiem.web.engr.illinois.edu/projects/detectionAnalysis/ developed by Hoiem et al. (2012).

Approach aeroplane bicycle bird boat bottle bus car cat chair cow
STS original 33.5 58.6 33.7 31.2 17.6 11.0 65.5 37.0 62.6 6.2 26.1
Darknet19 52.2 72.0 52.7 54.6 30.6 28.0 74.8 56.8 81.8 21.1 56.9
Large Decoder 52.3 73.0 52.5 55.1 34.8 24.9 75.3 55.3 82.0 20.7 54.6
Distance Transform (STS++) 53.2 72.9 53.6 58.5 32.4 25.8 74.9 56.5 81.8 22.8 55.0
SDS (Hariharan et al., 2014) 49.7 68.4 49.4 52.1 32.8 33.0 67.8 53.6 73.9 19.9 43.7
(Arnab & Torr, 2017) 62.0 80.3 52.8 68.5 47.4 39.5 79.1 61.5 87.0 28.1 68.3
dining table dog horse motorbike person potted plant sheep sofa train tv monitor
STS original 33.5 13.5 49.7 31.0 37.9 37.7 7.0 31.2 20.0 62.7 29.2
Darknet19 52.2 26.0 71.4 61.6 62.4 54.4 19.4 53.2 36.2 76.1 54.5
Large Decoder 52.3 25.2 71.9 64.8 58.6 54.0 19.6 54.9 36.5 76.9 55.7
Distance Transform (STS++) 53.2 26.2 74.1 62.5 59.8 57.7 22.7 56.1 36.4 78.5 56.4
SDS (Hariharan et al., 2014) 49.7 25.7 60.6 55.9 58.9 56.7 28.5 55.6 32.1 64.7 60.0
Arnab & Torr (2017) 62.0 35.5 86.1 73.9 66.1 63.8 32.9 65.3 50.4 81.4 71.4
Table 2: scores for individual Pascal VOC categories on SBD val set.
Figure 10: (Left) Five types of object detections on Pascal VOC 2007 test as per Hoiem et al. (2012); (Middle) Typewise break-up of the total error for different architectural modifications; (Right) -age contribution of each detection type to the total number of false positives.

Figure 10 defines the taxonomy of prediction errors for a detector unit and compares these errors over the different modifications that are studied in this work. As noted before, the most notable improvement comes from using textitanchor boxes instead of simply the grid cells for target representation during network training. This has the biggest impact on the quality of object localisation. Despite the improvement, the localisation error still remains a main contributing factor, making up more than of the total error. For details of the detection performance on distinct object categories see Tables 10 through 14 in Appendix A.5. Notice also that the number of background errors is strongly inversely correlated to the localisation error (Figure 10). This implies that as the model becomes more precise in localising objects, it is also more prone to detecting background regions as likely object instances. Anecdotally, it appears that the model is detecting true objects in the input images which are not actually annotated in the dataset, see Figure 12. This is, however, a drawback of the evaluation dataset rather than that of the prediction model.

5 Discussion and Conclusion

In this work we tackle the problem of real-time instance segmentation. As shown in Figure 1, our revised STS++ model sets a new performance benchmark at real-time processing rates for the task of multiple object instance segmentation. The changes we implement improve the overall model performance by almost in terms of and reduce the total number of parameters by via the reuse of network parameters over different spatial locations. Tables 1 & 2 summarise the effects of the various atomic changes made to the pipeline measured in terms of the mean and individual scores over distinct object categories of Pascal VOCdataset. The instance mask visualisations in Appendix A.4 demonstrate that the new model is better at localising instances and inferring the general body shapes. For more details, refer to Tables 13 and 14 in Appendix A.5.

Significant improvements have been made to the real-time instance segmentation solution proposed by Jetley et al. (2017) making it more attractive for practical use. Yet there remains an accuracy gap in comparison to the state of the art. We believe that the two main challenges in building more accurate models are as follows. Firstly, models demonstrate poor quality in terms of for low IoU values (i.e. ) which indicates issues with correctly localising instance-level object bounding boxes. The analysis of detection errors on Pascal VOC dataset confirms erroneous localisation as the biggest contributor to the total number of false positives. Moreover, under the current model, instance mask prediction is decoupled from bounding box estimation and has no mechanism to adjust or recover if there is an error in the prediction of the bounding box. Increasing the capacity to represent ground truth instances with different sizes and aspect ratios was demonstrated to yield a great improvement in the results. A further enhancement in the encoding of ground truth information must be sought. Secondly, a poor performance in the high IoU range () indicates the model’s inability to capture intricate details of objects’ boundaries. This can be addressed by training pixel-level bottom-up segmentation models such as those making use of conditional random fields (CRFs). The instance segmentation masks obtained from the proposed model can be used as priors for an additional CRF based boundary refinement stage. A similar approach was taken by Arnab & Torr (2016), however, bounding box priors were used instead of segmentation masks. Taking more precise priors in the form of instance masks could further benefit the CRF based post-processing operation.

Appendix A Appendix

a.1 Architectures of Models

Tables 3 through 8, describe the different neural network architectures used in this project. Every feed-forward network is defined as a sequence of layers. Each layer is described by providing its type, number of output filters (feature maps), filter (kernel) size dimensions, spatial stride parameter, spatial dimensions of the output layer, number of arithmetic operations performed and the total number of learnable parameters. The layer types include convolutional layer (CONV), transposed convolutional layer (TCONV), max-pooling layer (MAXPOOL) and spatial up-sampling layer (UPSAMPLE). All the convolutional layers are padded with zeros in order to maintain the spatial dimensions of the output feature maps. A single step of addition, multiplication or max comparison is considered to be a single arithmetic operation.

Type Filters Size Stride Output Ops, Params,
1: CONV 64 2 944 0.01
2: MAXPOOL 1 2 0 0.00
3: CONV 192 1 2,775 0.11
4: MAXPOOL 1 2 0 0.00
5: CONV 128 1 154 0.02
6: CONV 256 1 1,850 0.30
7: CONV 256 1 411 0.07
8: CONV 512 1 7,399 1.18
9: MAXPOOL 1 2 0 0.00
10: CONV 256 1 206 0.13
11: CONV 512 1 1,850 1.18
12: CONV 256 1 206 0.13
13: CONV 512 1 1,850 1.18
14: CONV 256 1 206 0.13
15: CONV 512 1 1,850 1.18
16: CONV 256 1 206 0.13
17: CONV 512 1 1,850 1.18
18: CONV 512 1 411 0.26
19: CONV 1024 1 7,399 4.72
20: MAXPOOL 1 2 0 0.00
21: CONV 512 1 206 0.52
22: CONV 1024 1 1,850 4.72
23: CONV 512 1 206 0.52
24: CONV 1024 1 1,850 4.72
25: CONV 1024 1 3,699 9.44
26: CONV 1024 2 925 9.44
27: CONV 1024 1 925 9.44
28: CONV 1024 1 925 9.44
29: CONV 4096 7 411 205.52
30: CONV 3430 1 28 14.05
Total: 40,586 279.7
Table 3: Details of the architecture of the original STS model. NOTE: Number of parameters and operations required are given in millions ().
Type Filters Size Stride Output Ops, Params,
1: CONV 64 2 944 0.01
2: MAXPOOL 1 2 0 0.00
3: CONV 192 1 2,775 0.11
4: MAXPOOL 1 2 0 0.00
5: CONV 128 1 154 0.02
6: CONV 256 1 1,850 0.30
7: CONV 256 1 411 0.07
8: CONV 512 1 7,399 1.18
9: MAXPOOL 1 2 0 0.00
10: CONV 256 1 206 0.13
11: CONV 512 1 1,850 1.18
12: CONV 256 1 206 0.13
13: CONV 512 1 1,850 1.18
14: CONV 256 1 206 0.13
15: CONV 512 1 1,850 1.18
16: CONV 256 1 206 0.13
17: CONV 512 1 1,850 1.18
18: CONV 512 1 411 0.26
19: CONV 1024 1 7,399 4.72
20: MAXPOOL 1 2 0 0.00
21: CONV 512 1 206 0.52
22: CONV 1024 1 1,850 4.72
23: CONV 512 1 206 0.52
24: CONV 1024 1 1,850 4.72
25: CONV 1024 1 3,699 9.44
26: CONV 1024 2 925 9.44
27: CONV 1024 1 925 9.44
28: CONV 1024 1 925 9.44
29: CONV 2048 1 1,850 18.88
30: CONV 2048 1 3,699 37.75
31: CONV 1024 1 206 2.10
32: CONV 135 1 14 0.14
Total: 45,915 119.0
Table 4: Details of the architecture of the STS model with a shared detection layer. NOTE: Number of parameters and operations required are given in millions ().
Type Filters Size Stride Output Ops, Params,
1: CONV 32 1 347 0.00
2: MAXPOOL 1 2 0 0.00
3: CONV 64 1 1,850 0.02
4: MAXPOOL 1 2 0 0.00
5: CONV 128 1 1,850 0.07
6: CONV 64 1 206 0.01
7: CONV 128 1 1,850 0.07
8: MAXPOOL 1 2 0 0.00
9: CONV 256 1 1,850 0.30
10: CONV 128 1 206 0.03
11: CONV 256 1 1,850 0.30
12: MAXPOOL 1 2 0 0.00
13: CONV 512 1 1,850 1.18
14: CONV 256 1 206 0.13
15: CONV 512 1 1,850 1.18
16: CONV 256 1 206 0.13
17: CONV 512 1 1,850 1.18
18: MAXPOOL 1 2 0 0.00
19: CONV 1024 1 1,850 4.72
20: CONV 512 1 206 0.52
21: CONV 1024 1 1,850 4.72
22: CONV 512 1 206 0.52
23: CONV 1024 1 1,850 4.72
24: CONV 2048 2 1,850 18.88
25: CONV 1024 1 206 2.10
26: CONV 2048 1 1,850 18.88
27: CONV 1024 1 206 2.10
28: CONV 2048 1 1,850 18.88
29: CONV 1024 1 206 2.10
30: CONV 2048 1 1,850 18.88
31: CONV 1024 1 206 2.10
32: CONV 70 1 7 0.07
Total: 30,155 103.8
Table 5: Details of the architecture of the modified Darknet19 model. NOTE: Number of parameters and operations required are given in millions ().
Type Filters Size Stride Output Ops, Params,
1: TCONV 100 1 64 32.10
2: UPSAMPLE 100 1 6 0.00
3: CONV 50 1 5,760 45.05
4: UPSAMPLE 50 1 13 0.00
5: CONV 20 1 4,608 9.02
6: UPSAMPLE 20 1 20 0.00
7: CONV 10 1 3,686 1.81
8: UPSAMPLE 10 1 41 0.00
9: CONV 1 1 737 0.09
Total: 14,936 88.1
Table 6: Details of the architecture of the learned shape decoder in STS model. NOTE: Number of parameters and operations required are given in thousands ().
Type Filters Size Stride Output Ops, Params,
1: TCONV 1024 1 7 3.28
2: TCONV 256 2 118 2.36
3: CONV 192 1 107 0.44
4: TCONV 128 2 54 0.22
5: CONV 96 1 117 0.11
6: TCONV 64 2 59 0.06
7: CONV 1 1 3 0.00
Total: 463 6.5
Table 7: Details of the architecture of the proposed large shape decoder. NOTE: Number of parameters and operations required are given in millions ().
Type Filters Size Stride Output Ops, Params,
1: TCONV 1024 1 7 3.28
2: TCONV 256 2 118 2.36
3: CONV 192 1 107 0.44
4: TCONV 128 2 54 0.22
5: CONV 96 1 117 0.11
6: TCONV 64 2 59 0.06
7: CONV 8 1 20 0.00
8: DT 8 1 64 0.00
9: CONV 1 1 0 0.00
Total: 545 6.5
Table 8: Details of the architecture of the distance transform based neural shape decoder. NOTE: Number of parameters and operations required are given in millions ().

a.2 Experimental Setup

The original STS model makes use of the Darknet model (Redmon, 2013–2016) as its training and inference workhorse (implemented in plain C). The vanilla version of the Darknet model is updated with shape prediction capabilities and additional software layers (in C++) for dataset loading and manipulation, shape mask reconstruction, and for evaluating the model on the task of instance segmentation The project source code can be found at https://github.com/torrvision/straighttoshapes.. We ran all our experiments on a single desktop machine with Intel Core i7-4960X CPU (3.6GHz , 6 cores) and NVidia GeForce GTX Titan X GPU (12 GB RAM). SBD dataset (Hariharan et al., 2011) is chosen as the experimental dataset to benchmark our models. This dataset is divided into train and validation images for training and evaluation respectively. The performance accuracy is measured in terms of , and scores.

Due to memory constraints, the data is processed in the form of mini-batches of 8 training examples each. However, the parameters are only updated after accumulating gradients from 8 such mini-batches. In particular, batch-normalisation parameters are computed over 8 training examples, while gradient descent is performed over 64 such examples. The model is trained for epochs ( mini-batches) using Stochastic Gradient Descent (SGD) with momentum () and weight decay (). Leaky ReLU with an is the non-linearity used throughout our networks. The parameters in the initial layers of all the neural networks are borrowed from the pre-training of the Darknet model on ImageNet dataset (Russakovsky* et al., 2015) for the task of image classification Pre-trained Darknet model can be downloaded from https://pjreddie.com/darknet/yolo/.. The learning rate schedule for the training of the network on the task of instance segmentation is as follows:

Batch number Learning rate

The learning rate is kept low in the beginning in order to preserve the pre-trained weights. It is subsequently increased in order to speed up convergence and then reduced once more as we fine-tune the solution in the later stages of network training.

a.3 Loss Function Design

The model is trained solely as a regression problem and the loss is evaluated depending on where the ground truth objects appear in the input image.

Given a cell containing an object, let denote the parameters of the predicted bounding box at that cell location. We compute as the IoU between the ground truth box and the predicted box and assign as the index of the best prediction in cell . Further, we define indicators and to capture the truth of the predicted box being the best fit and the cell containing the ground truth object respectively.

Figure 11: (Left) Grid cell (red) containing the center of the object (marked as a blue cross) is responsible for predicting the object’s bounding box (green). The center coordinates are normalised relative to the grid cell, while the dimensions of the bounding-box are normalised relative to the dimensions of the complete image. (Right) There are 3 anchors boxes of different aspect ratios per grid cell location. The illustration displays several of these boxes at different locations in the image. Each anchor box gets assigned the ground truth object that shares the highest IoU with the anchor. Clearly, different anchors specialise in predicting objects of different aspect ratios.

Then the different terms of the objective function can be defined as follows:

(7)

where describes the center coordinates of the ground truth bounding box (see Figure 11 and §3.2.2 for how target values are constructed). Note that the model predicts the square root of the bounding box dimensions, i.e. , , for reasons described in §3.1.The model is penalised for having high confidence predictions when (i) there is no ground truth object in the cell location, or (ii) its bounding box prediction is not the current best for that cell location:

(8)

In contrast, when cell does contain a ground truth object, the confidence score of the best overlapping bounding box is penalised for deviating from the associated IoU as follows:

(9)

The conditional class probabilities are modelled independently from the bounding box predictions and also independently for each class (as binary random variables),

(10)

All the loss terms defined up until this point are used for the task of detection and are exactly as those used for training the YOLO detector. In order to address the task of segmentation the STS model additionally regresses to shape representation values as follows:

(11)

, where denotes the target shape representation (see §3.2.2) while denotes the predicted shape mask.

The overall objective of the optimisation problem is then expressed as a weighted sum of the above terms (Eq. 7)-(11):

(12)

, where the values , , , , and , are chosen via cross-validation.

Remark: Note that the prediction of the object bounding box and shape masks is decoupled in the STS model. This is to say that whenever the former introduces a discrepancy in the bounding box location, the latter is not corrected or adjusted in any way. This is in contrast to other state-of-the-art instance segmentation models (Dai et al., 2015; He et al., 2017), where the target mask is adjusted to the predicted bounding box and thus the error therein.

a.4 More Qualitative Results

The section contains some example images from the SBD validation set and their corresponding instance segmentation results. These results as visualised in the following ways: the top row contains images overlaid with predicted instance masks, while the bottom row simply contains these masks on a black background. The segmentation masks (left to right) include those from the ground truth, original STS model, proposed Darknet19 with pre-trained shape decoder, and proposed Darknet19 trained end-to-end using distance transform based shape representations.

These examples, contained in Figures 12 to 17 demonstrate the instance segmentation capabilities and frequent drawbacks, for example, failure to model object confidence scores, detect particular instances, or predict precise object boundaries.



Ground Truth STS original STS (Darknet19) STS++
Figure 12: The original model confuses tall and narrow bottles with humans in standing pose. Our proposed methods do a better job at delineating overlapping object instances. Note also that the STS++ model is able to segment out the pair of shoes in the background, labelling it of person class. These objects are, however, not annotated in the dataset and the model is penalised for segmenting them.
Ground Truth STS original STS (Darknet19) STS++
Figure 13: The original STS model predicts two high confidence bounding boxes for the same ground truth object. Although our proposed models overcome this flaw, they still fail to detect the puppy in the hands of the person.
Ground Truth STS original STS (Darknet19) STS++
Figure 14: The proposed models only get the coarse location of the bicycle but do a poor job at capturing the pixel level details.
Ground Truth STS original STS (Darknet19) STS++
Figure 15: The models do better at delineating object boundaries, however, fail to correctly identify the objects as birds and confuse them with aeroplanes.
Ground Truth STS original STS (Darknet19) STS++
Figure 16: In some locations the models output more bounding boxes than there are ground truth objects, while in other places they miss the objects entirely.
Ground Truth STS original STS (Darknet19) STS++
Figure 17: More instance segmentation results.

a.5 Detailed Error Analysis

Prediction label % of all FP error
Approach Corr Loc Sim Dissim Backgr Loc Sim Dissim Backgr
STS original 58.3 16.6 12.5 5.9 6.7 39.9 29.9 14.1 16.1
Batch Normalisation 63.3 17.8 8.4 4.2 6.2 48.4 23.0 11.6 17.0
Data Augmentation 66.2 16.3 8.0 4.0 5.5 48.2 23.7 11.9 16.2
Shared Predictor 64.3 16.3 8.6 4.9 5.9 45.8 24.0 13.8 16.5
Anchor Boxes 69.9 10.1 8.8 3.9 7.3 33.6 29.1 13.1 24.2
Darknet19 73.0 9.4 7.8 2.6 7.1 35.0 29.1 9.6 26.2
End-to-End 73.6 9.2 7.3 3.1 6.8 35.0 27.7 11.6 25.6
Large Decoder 73.1 9.4 7.2 3.3 7.1 34.7 26.6 12.3 26.3
Distance Transform (STS++) 73.7 9.0 7.1 3.0 7.1 34.3 27.0 11.5 27.2
Table 9: Detection errors of our proposed models on Pascal VOC 2007 test dataset. The errors are grouped as per the methodology presented in Hoiem et al. (2012).
STS original Overall aeroplane bicycle bird boat bottle bus car cat chair cow
TP: correct 58.3 59.2 61.7 54.5 57.0 39.6 61.0 65.9 66.2 58.8 63.2
FP: localisation 16.6 23.8 15.9 20.5 23.7 19.0 7.9 19.1 7.6 14.0 15.2
FP: similar 12.5 11.6 12.3 10.4 7.9 0.0 24.0 7.1 23.8 7.5 19.5
FP: dissimilar 5.9 1.0 8.7 4.0 2.5 20.9 0.8 2.2 0.8 9.0 0.0
FP: background 6.7 4.5 1.3 10.6 8.9 20.5 6.3 5.6 1.6 10.6 2.1
dining table dog horse motorbike person potted plant sheep sofa train tv monitor
TP: correct 58.3 53.5 62.3 60.8 59.3 56.6 37.0 61.4 68.7 65.9 53.7
FP: localisation 16.6 15.1 6.6 13.7 14.1 36.1 28.5 15.4 7.3 12.6 16.3
FP: similar 12.5 10.4 29.6 22.0 17.1 1.9 0.0 22.2 7.8 14.2 0.0
FP: dissimilar 5.9 12.4 0.2 1.8 8.7 2.5 14.5 0.0 12.6 0.7 14.4
FP: background 6.7 8.7 1.3 1.8 0.8 2.9 19.9 1.0 3.5 6.6 15.5
Table 10: Category-level detection errors of the original STS model on Pascal VOC 2007 test dataset.
Darknet19 Overall aeroplane bicycle bird boat bottle bus car cat chair cow
TP: correct 73.0 70.1 71.0 73.1 68.2 49.8 76.4 79.4 78.9 72.9 79.3
FP: localisation 9.4 11.9 13.6 14.6 17.3 17.2 3.9 9.3 6.2 9.9 5.8
FP: similar 7.8 7.7 9.0 7.5 5.6 0.0 12.2 3.4 14.6 4.4 14.0
FP: dissimilar 2.6 1.9 4.1 0.3 0.8 8.1 0.4 1.0 0.0 3.5 0.0
FP: background 7.1 8.4 2.3 4.5 8.1 25.0 7.1 6.7 0.3 9.3 0.9
dining table dog horse motorbike person potted plant sheep sofa train tv monitor
TP: correct 73.0 77.6 75.7 79.2 75.6 71.4 49.0 75.9 84.6 80.8 71.7
FP: localisation 9.4 3.0 5.1 5.6 7.6 20.6 20.1 4.5 1.3 5.6 5.8
FP: similar 7.8 3.7 17.9 14.4 8.9 1.4 0.0 16.7 6.6 8.9 0.0
FP: dissimilar 2.6 8.4 0.0 0.3 3.8 1.6 6.2 0.3 4.8 0.3 6.1
FP: background 7.1 7.4 1.3 0.5 4.1 5.0 24.7 2.6 2.8 4.3 16.3
Table 11: Category-level detection errors of our proposed Darknet19 model on Pascal VOC 2007 test dataset.
Large Decoder Overall aeroplane bicycle bird boat bottle bus car cat chair cow
TP: correct 73.1 70.7 74.0 70.7 67.7 50.7 78.3 79.4 80.0 73.2 77.5
FP: localisation 9.4 14.1 10.0 16.0 15.8 16.9 3.1 10.3 7.0 9.6 4.9
FP: similar 7.2 6.4 6.7 8.2 4.1 0.0 9.1 2.7 12.2 4.3 16.1
FP: dissimilar 3.3 2.6 4.4 0.5 1.8 9.7 1.6 0.9 0.5 4.5 0.0
FP: background 7.1 6.1 4.9 4.7 10.7 22.7 7.9 6.6 0.3 8.4 1.5
dining table dog horse motorbike person potted plant sheep sofa train tv monitor
TP: correct 73.1 78.9 77.5 78.2 74.0 71.5 49.3 75.9 83.8 77.5 72.0
FP: localisation 9.4 3.3 4.5 6.3 8.7 21.1 15.7 6.8 2.0 6.3 4.7
FP: similar 7.2 3.0 16.2 14.7 8.9 1.5 0.0 14.8 5.3 9.3 0.0
FP: dissimilar 3.3 7.0 0.2 0.0 6.5 1.4 10.8 0.3 6.8 0.7 6.1
FP: background 7.1 7.7 1.5 0.8 1.9 4.5 24.2 2.3 2.0 6.3 17.2
Table 12: Category-level detection errors of our proposed Large Decoder model on Pascal VOC 2007 test dataset.
Approach aeroplane bicycle bird boat bottle bus car cat chair cow
STS original 33.5 58.6 33.7 31.2 17.6 11.0 65.5 37.0 62.6 6.2 26.1
Batch Normalisation 38.6 58.9 39.3 38.2 18.6 15.1 67.9 41.4 70.6 7.6 34.6
Data Augmentation 42.3 64.7 38.4 41.0 23.6 14.5 71.0 42.4 74.3 8.7 44.4
Shared Predictor 41.4 63.0 42.2 41.9 22.4 14.8 69.3 39.8 73.9 8.1 40.3
Anchors 48.5 69.6 47.0 51.9 29.9 22.9 71.9 50.5 77.5 14.9 49.3
Darknet19 52.2 72.0 52.7 54.6 30.6 28.0 74.8 56.8 81.8 21.1 56.9
End-to-End 51.7 65.3 52.6 54.0 31.3 26.5 77.7 55.3 80.7 19.0 55.4
Large Decoder 52.3 73.0 52.5 55.1 34.8 24.9 75.3 55.3 82.0 20.7 54.6
Distance Transform (STS++) 53.2 72.9 53.6 58.5 32.4 25.8 74.9 56.5 81.8 22.8 55.0
SDS (Hariharan et al., 2014) 49.7 68.4 49.4 52.1 32.8 33.0 67.8 53.6 73.9 19.9 43.7
Arnab & Torr (2017) 62.0 80.3 52.8 68.5 47.4 39.5 79.1 61.5 87.0 28.1 68.3
dining table dog horse motorbike person potted plant sheep sofa train tv monitor
STS original 33.5 13.5 49.7 31.0 37.9 37.7 7.0 31.2 20.0 62.7 29.2
Batch Normalisation 38.6 11.9 58.6 45.5 45.6 36.4 14.1 34.4 23.9 69.8 39.8
Data Augmentation 42.3 15.9 64.1 47.1 53.1 40.9 15.7 44.1 27.0 72.0 43.1
Shared Predictor 41.4 19.3 60.4 50.3 51.2 37.5 13.4 41.0 28.1 69.0 41.9
Anchors 48.5 24.2 68.4 55.0 59.7 49.7 18.6 53.7 31.1 74.5 50.5
Darknet19 52.2 26.0 71.4 61.6 62.4 54.4 19.4 53.2 36.2 76.1 54.5
End-to-End 51.7 25.1 72.9 57.9 58.6 51.9 22.5 56.6 37.5 77.9 56.2
Large Decoder 52.3 25.2 71.9 64.8 58.6 54.0 19.6 54.9 36.5 76.9 55.7
Distance Transform (STS++) 53.2 26.2 74.1 62.5 59.8 57.7 22.7 56.1 36.4 78.5 56.4
SDS (Hariharan et al., 2014) 49.7 25.7 60.6 55.9 58.9 56.7 28.5 55.6 32.1 64.7 60.0
Arnab & Torr (2017) 62.0 35.5 86.1 73.9 66.1 63.8 32.9 65.3 50.4 81.4 71.4
Table 13: Average precision () estimates for the task of detection on SBD validation set for all the proposed methods compared with state-of-the-art models.
Approach aeroplane bicycle bird boat bottle bus car cat chair cow
STS original 14.9 20.2 10.9 7.8 5.2 5.5 52.3 20.8 41.3 0.6 7.5
Batch Normalisation 17.4 22.4 13.5 12.3 5.3 6.0 55.5 21.6 46.8 0.4 11.4
Data Augmentation 20.8 27.9 11.9 14.2 8.8 6.9 57.3 25.5 53.4 0.8 15.9
Shared Predictor 20.5 28.5 15.2 15.0 9.6 7.0 55.5 24.1 51.1 0.8 17.9
Anchors 26.7 30.2 19.1 18.8 10.9 11.4 65.0 35.9 61.0 2.2 24.8
Darknet19 31.1 40.3 24.9 22.7 11.6 14.5 66.7 39.5 67.2 5.9 26.3
End-to-End 27.9 23.4 19.3 18.6 13.8 13.4 70.3 39.8 60.4 3.2 23.3
Large Decoder 31.9 42.3 22.9 24.5 18.1 15.2 67.5 40.3 68.8 5.8 31.1
Distance Transform (STS++) 32.3 39.1 23.8 24.9 15.8 12.8 65.4 41.3 66.8 5.6 31.3
Arnab & Torr (2017) 44.8 69.0 27.4 52.7 26.4 22.4 70.3 46.0 74.7 9.6 46.8
dining table dog horse motorbike person potted plant sheep sofa train tv monitor
STS original 14.9 2.0 23.7 3.9 13.7 8.9 1.4 5.5 9.7 45.8 11.0
Batch Normalisation 17.4 2.1 24.9 7.4 15.9 8.4 2.2 9.0 12.4 55.6 14.6
Data Augmentation 20.8 3.3 33.6 10.4 21.8 11.2 3.7 11.9 15.2 57.8 23.7
Shared Predictor 20.5 5.9 33.2 12.0 22.6 11.5 3.0 10.0 13.9 54.3 19.4
Anchors 26.7 7.9 42.5 17.9 26.7 17.9 4.3 19.9 19.2 63.8 33.8
Darknet19 31.1 10.5 48.6 24.3 35.1 23.7 4.6 26.8 24.5 64.8 40.5
End-to-End 27.9 11.7 39.6 14.5 25.0 18.8 6.6 24.0 29.1 61.6 40.6
Large Decoder 31.9 8.5 48.8 22.1 35.8 24.7 5.7 25.8 25.4 63.1 41.9
Distance Transform (STS++) 32.3 13.2 50.8 22.6 35.7 27.4 6.4 29.0 26.8 65.4 41.0
Arnab & Torr (2017) 44.8 16.9 71.6 48.4 46.3 40.3 14.8 47.6 36.5 69.7 58.2
Table 14: Average precision () estimates for the task of detection on SBD validation set for all the proposed methods compared with state-of-the-art models.

References

  • Arbeláez et al. (2014) Pablo Andrés Arbeláez, Jordi Pont-Tuset, Jonathan T. Barron, Ferran Marqués, and Jitendra Malik. Multiscale combinatorial grouping. In Proceedings of IEEE International conference on Computer Vision and Pattern Recognition (CVPR), pp. 328–335, 2014.
  • Arnab & Torr (2016) Anurag Arnab and Philip H S Torr. Bottom-up instance segmentation using deep higher-order crfs. In Proceedings of British Machine Vision Conference (BMVC). BMVA Press, 2016.
  • Arnab & Torr (2017) Anurag Arnab and Philip P.H. Torr. Pixelwise instance segmentation with a dynamically instantiated network. In Proceedings of IEEE International conference on Computer Vision and Pattern Recognition (CVPR), 2017.
  • Borgefors (1986) Gunilla Borgefors. Distance transformations in digital images. Computer Vision, Graphics, and Image Processing, 34(3):344–371, 1986.
  • Ciresan et al. (2010) Dan Claudiu Ciresan, Ueli Meier, Luca Maria Gambardella, and Jürgen Schmidhuber. Deep big simple neural nets excel on handwritten digit recognition. CoRR, abs/1003.0358, 2010. URL http://arxiv.org/abs/1003.0358.
  • Dai et al. (2015) Jifeng Dai, Kaiming He, and Jian Sun. Instance-aware semantic segmentation via multi-task network cascades. CoRR, abs/1512.04412, 2015.
  • Dai et al. (2016a) Jifeng Dai, Kaiming He, Yi Li, Shaoqing Ren, and Jian Sun. Instance-sensitive fully convolutional networks. In Proceedings of European Conference on Computer Vision (ECCV), pp. 534–549. Springer, 2016a.
  • Dai et al. (2016b) Jifeng Dai, Yi Li, Kaiming He, and Jian Sun. R-fcn: Object detection via region-based fully convolutional networks. In Proceedings of International conference on Neural Information Processing Systems (NIPS), pp. 379–387. Curran Associates Inc., 2016b.
  • Everingham et al. (2015) Mark Everingham, S. M. Ali Eslami, Luc J. Van Gool, Christopher K. I. Williams, John M. Winn, and Andrew Zisserman. The pascal visual object classes challenge: A retrospective. International Journal of Computer Vision (IJCV), 111(1):98–136, 2015.
  • Girshick et al. (2014) Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of IEEE International conference on Computer Vision and Pattern Recognition (CVPR), pp. 580–587, 2014.
  • Hariharan et al. (2011) Bharath Hariharan, Pablo Arbeláez, Lubomir Bourdev, Subhransu Maji, and Jitendra Majik. Semantic contours from inverse detectors. In Proceedings of IEEE International Conference on Computer Vision (ICCV), 2011.
  • Hariharan et al. (2014) Bharath Hariharan, Pablo Arbeláez, Ross Girshick, and Jitendra Malik. Simultaneous detection and segmentation. In Proceedings of European Conference on Computer Vision (ECCV), 2014.
  • Hayder et al. (2016) Zeeshan Hayder, Xuming He, and Mathieu Salzmann. Boundary-aware instance segmentation. CoRR, abs/1612.03129, 2016. URL http://arxiv.org/abs/1612.03129.
  • He et al. (2015) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. CoRR, abs/1502.01852, 2015.
  • He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of IEEE International conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778, 2016.
  • He et al. (2017) Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross B. Girshick. Mask R-CNN. CoRR, abs/1703.06870, 2017. URL http://arxiv.org/abs/1703.06870.
  • Hicks et al. (2013) Stephen L. Hicks, Iain Wilson, Louwai Muhammed, John Worsfold, Susan M. Downes, and Christopher Kennard. A depth-based head-mounted visual display to aid navigation in partially sighted individuals. PLOS ONE, 8(7):1–8, 07 2013.
  • Hoiem et al. (2012) Derek Hoiem, Yodsawalai Chodpathumwan, and Qieyun Dai. Diagnosing error in object detectors. In Proceedings of European Conference on Computer Vision (ECCV), pp. 340–353, 2012.
  • Ioffe & Szegedy (2015) Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. CoRR, abs/1502.03167, 2015. URL http://arxiv.org/abs/1502.03167.
  • Jetley et al. (2017) Saumya Jetley, Michael Sapienza, Stuart Golodetz, and Philip H. S. Torr. Straight to shapes: Real-time detection of encoded shapes. In Proceedings of IEEE International conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2017.
  • Krizhevsky et al. (2012) Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Proceedings of International conference on Neural Information Processing Systems (NIPS), pp. 1097–1105. Curran Associates, Inc., 2012.
  • Li et al. (2016) Yi Li, Haozhi Qi, Jifeng Dai, Xiangyang Ji, and Yichen Wei. Fully convolutional instance-aware semantic segmentation. CoRR, abs/1611.07709, 2016. URL http://arxiv.org/abs/1611.07709.
  • Lin et al. (2013) Min Lin, Qiang Chen, and Shuicheng Yan. Network in network. In Proceedings of International Conference on Learning Representations (ICLR), 2013.
  • Liu et al. (2015) Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott E. Reed, Cheng-Yang Fu, and Alexander C. Berg. SSD: single shot multibox detector. CoRR, abs/1512.02325, 2015. URL http://arxiv.org/abs/1512.02325.
  • Long et al. (2015) Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semantic segmentation. In Proceedings of IEEE International conference on Computer Vision and Pattern Recognition (CVPR), pp. 3431–3440, 2015.
  • Neubeck & Gool (2006) Alexander Neubeck and Luc J. Van Gool. Efficient non-maximum suppression. In Proceedings of International Conference on Pattern Recognition (ICPR), pp. 850–855, 2006.
  • Redmon (2013–2016) Joseph Redmon. Darknet: Open source neural networks in c. http://pjreddie.com/darknet/, 2013–2016.
  • Redmon & Farhadi (2016) Joseph Redmon and Ali Farhadi. YOLO9000: better, faster, stronger. CoRR, abs/1612.08242, 2016. URL http://arxiv.org/abs/1612.08242.
  • Redmon et al. (2016) Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. You only look once: Unified, real-time object detection. In Proceedings of IEEE International conference on Computer Vision and Pattern Recognition (CVPR), 2016.
  • Ren et al. (2015) Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In Proceedings of International conference on Neural Information Processing Systems (NIPS), pp. 91–99, 2015.
  • Rünz & Agapito (2018) Martin Rünz and Lourdes Agapito. MaskFusion: Real-Time Recognition, Tracking and Reconstruction of Multiple Moving Objects. CoRR, abs/1804.09194, 2018.
  • Russakovsky* et al. (2015) Olga Russakovsky*, Jia Deng*, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), 115(3):211–252, 2015.
  • Simonyan & Zisserman (2015) K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In Proceedings of International Conference on Learning Representations (ICLR), 2015.
  • Szegedy et al. (2015) Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jonathon Shlens, and Zbigniew Wojna. Rethinking the inception architecture for computer vision. CoRR, abs/1512.00567, 2015. URL http://arxiv.org/abs/1512.00567.
  • Vincent et al. (2008) Pascal Vincent, Hugo Larochelle, Yoshua Bengio, and Pierre-Antoine Manzagol. Extracting and composing robust features with denoising autoencoders. In Proceedings of International Conference on Machine learning (ICML), pp. 1096–1103. ACM, 2008.
  • Vinyals et al. (2014) Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. Show and tell: A neural image caption generator. CoRR, abs/1411.4555, 2014. URL http://arxiv.org/abs/1411.4555.
  • Wong et al. (2016) Sebastien C. Wong, Adam Gatt, Victor Stamatescu, and Mark D. McDonnell. Understanding data augmentation for classification: when to warp? CoRR, abs/1609.08764, 2016. URL http://arxiv.org/abs/1609.08764.
  • Zheng et al. (2015) Shuai Zheng, Sadeep Jayasumana, Bernardino Romera-Paredes, Vibhav Vineet, Zhizhong Su, Dalong Du, Chang Huang, and Philip HS Torr. Conditional random fields as recurrent neural networks. In Proceedings of IEEE International Conference on Computer Vision (ICCV), pp. 1529–1537, 2015.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
367630
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description