Global Deconvolutional Networks for Semantic Segmentation
Semantic image segmentation is a principal problem in computer vision, where the aim is to correctly classify each individual pixel of an image into a semantic label. Its widespread use in many areas, including medical imaging and autonomous driving, has fostered extensive research in recent years. Empirical improvements in tackling this task have primarily been motivated by successful exploitation of Convolutional Neural Networks (CNNs) pre-trained for image classification and object recognition. However, the pixel-wise labelling with CNNs has its own unique challenges: (1) an accurate deconvolution, or upsampling, of low-resolution output into a higher-resolution segmentation mask and (2) an inclusion of global information, or context, within locally extracted features. To address these issues, we propose a novel architecture to conduct the equivalent of the deconvolution operation globally and acquire dense predictions. We demonstrate that it leads to improved performance of state-of-the-art semantic segmentation models on the PASCAL VOC 2012 benchmark, reaching mean IU accuracy on the test set.
Ulsan National Institute of Science and Technology
50 UNIST, Ulju, Ulsan, 44919 Korea Global Deconv. Networks for Semantic Segmentation \fxusethemecolor
Convolutional Neural Networks [LeCun et al.(1989)LeCun, Boser, Denker, Henderson, Howard, Hubbard, and Jackel] have become an essential part of deep learning models [LeCun et al.(2015)LeCun, Bengio, and Hinton] designed to tackle a wide range of computer vision tasks including image classification and recognition [Krizhevsky et al.(2012)Krizhevsky, Sutskever, and Hinton, Sermanet et al.(2013)Sermanet, Eigen, Zhang, Mathieu, Fergus, and LeCun, Simonyan and Zisserman(2014), Szegedy et al.(2014)Szegedy, Liu, Jia, Sermanet, Reed, Anguelov, Erhan, Vanhoucke, and Rabinovich, Zeiler and Fergus(2014)], image captioning [Karpathy and Li(2015), Xu et al.(2015)Xu, Ba, Kiros, Cho, Courville, Salakhutdinov, Zemel, and Bengio, Vinyals et al.(2015)Vinyals, Toshev, Bengio, and Erhan], object detection [Girshick et al.()Girshick, Donahue, Darrell, and Malik, Girshick(2015), Ren et al.(2015)Ren, He, Girshick, and Sun]. Recent advances in computing technologies with efficient utilisation of Graphical Processing Units (GPUs), as well as availability of large-scale datasets [Deng et al.(2009)Deng, Dong, Socher, Li, Li, and Li, Lin et al.(2014)Lin, Maire, Belongie, Hays, Perona, Ramanan, Dollár, and Zitnick] have been among primary factors in such a rapid rise in CNN popularity.
An adaptation of convolutional network models [Long et al.(2015)Long, Shelhamer, and Darrell], pre-trained for the image classification task, has fostered extensive research on the exploitation of CNNs in semantic image segmentation - a problem of marking (or classifying) each pixel of the image with one of the given semantic labels. Among important applications of this problem are road scene understanding [Álvarez et al.(2012)Álvarez, Gevers, LeCun, and López, Badrinarayanan et al.(2015)Badrinarayanan, Handa, and Cipolla, Sturgess et al.(2009)Sturgess, Alahari, Ladicky, and Torr], biomedical imaging [Ronneberger et al.(2015)Ronneberger, Fischer, and Brox, Ciresan et al.(2012)Ciresan, Giusti, Gambardella, and Schmidhuber], aerial imaging [Kluckner et al.(2009)Kluckner, Mauthner, Roth, and Bischof, Mnih and Hinton(2010)].
Recent breakthrough methods in the area have efficiently and effectively combined neural networks with probabilistic graphical models, such as Conditional Random Fields (CRFs) [Chen et al.(2014)Chen, Papandreou, Kokkinos, Murphy, and Yuille, Lin et al.(2015)Lin, Shen, Reid, and van den Hengel, Zheng et al.(2015)Zheng, Jayasumana, Romera-Paredes, Vineet, Su, Du, Huang, and Torr] and Markov Random Fields (MRFs) [Liu et al.(2015)Liu, Li, Luo, Loy, and Tang]. These approaches usually refine per-pixel features extracted by CNNs (so-called ‘unary potentials’) with the help of pairwise similarities between the pixels based on location and colour features, followed by an approximate inference of the obtained fully connected graphical model [Krähenbühl and Koltun(2011)].
In this work, we address two main challenges of current CNN-based semantic segmentation methods [Chen et al.(2014)Chen, Papandreou, Kokkinos, Murphy, and Yuille, Long et al.(2015)Long, Shelhamer, and Darrell]: an effective deconvolution, or upsampling, of low-resolution output from a neural network; and an inclusion of global information, or context, into existing models without relying on graphical models. Our contribution is twofold: i) we propose a novel approach for performing the deconvolution operation of the encoded signal and ii) demonstrate that this new architecture, called ‘Global Deconvolutional Network’, achieves close to the state-of-the-art performance on semantic segmentation with a simpler architecture and significantly lower number of parameters in comparison to the existing models [Chen et al.(2014)Chen, Papandreou, Kokkinos, Murphy, and Yuille, Noh et al.(2015)Noh, Hong, and Han, Liu et al.(2015)Liu, Li, Luo, Loy, and Tang].
The rest of the paper is structured as follows. We briefly explore recent common practices of semantic segmentation models in Section 2. Section 3 presents our approach designed to overcome the issues outlined above. Section 4 describes the experimental part, including the evaluation results of the proposed method on the popular PASCAL VOC dataset. Finally, Section 5 contains conclusions.
2 Related Work
Exploitation of fully convolutional neural networks has become ubiquitous in semantic image segmentation ever since the publication of the paper by Long et al\bmvaOneDot[Long et al.(2015)Long, Shelhamer, and Darrell]. Further research has been concerned with the combination of CNNs and probabilistic graphical models [Chen et al.(2014)Chen, Papandreou, Kokkinos, Murphy, and Yuille, Lin et al.(2015)Lin, Shen, Reid, and van den Hengel, Liu et al.(2015)Liu, Li, Luo, Loy, and Tang, Zheng et al.(2015)Zheng, Jayasumana, Romera-Paredes, Vineet, Su, Du, Huang, and Torr], training in the presence of weakly-labelled data [Hong et al.(2015)Hong, Noh, and Han, Papandreou et al.(2015)Papandreou, Chen, Murphy, and Yuille, Russakovsky et al.(2015a)Russakovsky, Bearman, Ferrari, and Li], learning an additional (deconvolutional) network [Noh et al.(2015)Noh, Hong, and Han].
The problem of incorporation of contextual information has been an active research topic in computer vision [Rabinovich et al.(2007)Rabinovich, Vedaldi, Galleguillos, Wiewiora, and Belongie, Heitz and Koller(2008), Divvala et al.(2009)Divvala, Hoiem, Hays, Efros, and Hebert, Doersch et al.(2014)Doersch, Gupta, and Efros, Mottaghi et al.(2014)Mottaghi, Chen, Liu, Cho, Lee, Fidler, Urtasun, and Yuille]. To some extent, probabilistic graphical models address this issue in semantic segmentation and can be either a) used as a separate post-processing step [Chen et al.(2014)Chen, Papandreou, Kokkinos, Murphy, and Yuille] or b) trained end-to-end with CNNs [Lin et al.(2015)Lin, Shen, Reid, and van den Hengel, Liu et al.(2015)Liu, Li, Luo, Loy, and Tang, Zheng et al.(2015)Zheng, Jayasumana, Romera-Paredes, Vineet, Su, Du, Huang, and Torr]. In setting a) graphical models are unable to refine the parameters of the CNN and thus the errors from the CNN will essentially be presented during post-processing. On the other hand, in b) one need to carefully design the inference part in terms of usual neural networks operations, and it still relies on computing high-dimensional Gaussian kernels [Adams et al.(2010)Adams, Baek, and Davis]. Besides that, Yu and Koltun [Yu and Koltun(2015)] have recently shown that dilated convolution filters are generally applicable and allow to increase the contextual capacity of the network as well.
In terms of improving the deconvolutional part of the network for dense predictions, there has been a prevalence of using information from lower layers: so-called ‘Skip Architecture’ [Long et al.(2015)Long, Shelhamer, and Darrell, Ronneberger et al.(2015)Ronneberger, Fischer, and Brox] and Multi-scale [Chen et al.(2014)Chen, Papandreou, Kokkinos, Murphy, and Yuille] are two notable examples. Noh et al\bmvaOneDot[Noh et al.(2015)Noh, Hong, and Han] proposed to train a separate deconvolutional network to effectively decode information from the original fully convolutional model. While these methods have given better results, all of them contain significantly more parameters than the corresponding baseline models.
In turn, we propose another approach, called ‘Global Deconvolutional Network’, which includes a global interpolation block with an additional recognition loss, and gives better results than multi-scale and ‘skip’ variants. The depiction of our architecture is presented in Figure 1.
3 Global Deconvolutional Network
In this section, we describe our approach intended to boost the performance of deep learning models on the semantic segmentation task.
3.1 Baseline Models
As baseline models, we choose two publicly available deep CNN models: FCN-32s111https://github.com/BVLC/caffe/wiki/Model-Zoo [Long et al.(2015)Long, Shelhamer, and Darrell] and DeepLab222https://bitbucket.org/deeplab/deeplab-public/ [Chen et al.(2014)Chen, Papandreou, Kokkinos, Murphy, and Yuille]. Both of them are based on the VGG 16-layer net [Simonyan and Zisserman(2014)] from the ILSVRC-2014 competition [Russakovsky et al.(2015b)Russakovsky, Deng, Su, Krause, Satheesh, Ma, Huang, Karpathy, Khosla, Bernstein, Berg, and Li]. This network contains 16 weight layers, including two fully-connected ones, and can be represented as hierarchical stacks of convolutional layers with rectified linear unit non-linearity [Glorot et al.(2011)Glorot, Bordes, and Bengio] followed by pooling operations after each stack. The output of the fully-connected layers is fed into a softmax classifier.
For semantic segmentation, where one needs to acquire dense predictions, the fully-connected layers have been replaced by convolution filters followed by a learnable deconvolution or fixed bilinear interpolation to match the original spatial dimensions of the image. The pixel-wise softmax loss represents the objective function.
3.2 Global Interpolation
The output of multiple blocks of convolutional and pooling layers is an encoded image with severely reduced dimensions. To predict the segmentation mask of the same resolution as the original image, one needs to simultaneously decode and upsample this coarse output. A natural approach is to perform an interpolation. In this work, instead of applying conventional local methods, we devise a learnable global interpolation.
We denote the decoded information of the RGB-image , as , where represents the number of channels, and define the reduced height and width , respectively. To acquire , an upsampled signal, we apply the following formula:
where the matrices and are interpolating each feature map of through the corresponding spatial dimensions. Opposite to a simple bilinear interpolation, which operates only on the closest four points, the equation above allows to include much more information on the rectangular grid. An illustrative example can be seen in Figure 2.
Note that this operation is differentiable, and during the backpropagation algorithm [Rumelhart et al.(1986)Rumelhart, Hinton, and Williams] the derivatives of the pixelwise cross-entropy loss function with respect to the input and parameters can be found as follows:
We call the operation performed by Equation (1) ‘global deconvolution’. We only use this term to underline the fact that we mimic the behaviour of standard deconvolution using a global function; note that our method is not the inverse of the convolution operation and therefore does not represent deconvolution in the strictest sense as, for example, in [Zeiler et al.(2010)Zeiler, Krishnan, Taylor, and Fergus].
3.3 Multi-task loss
It is not uncommon to force intermediate layers of deep learning networks to preserve meaningful and discriminative representations. For example, Szegedy et al\bmvaOneDot[Szegedy et al.(2014)Szegedy, Liu, Jia, Sermanet, Reed, Anguelov, Erhan, Vanhoucke, and Rabinovich] appended several auxiliary classifiers to the middle blocks of their model.
As semantic image segmentation essentially comprises image classification as one of its sub-tasks, we append an additional objective function on the top of the coarse output to further improve the model performance on the particular task of classification (Figure 1). This supplementary block consists of fully-connected layers, with the length of the last one being equal to the pre-defined number of possible labels (excluding the background). As there are usually multiple instances of the same label presented in the image, we do not explicitly encode the quantitative components and only denote the presence of a particular class or its absence. The scores from the last layer are transformed with the sigmoid function followed by the multinomial cross entropy loss.
The loss functions are defined as follows:
where is the multi-label classification loss; is the pixelwise cross-entropy loss; is the set of pixels; is a ground truth map; is the number of possible labels; is a ground truth binary vector of length ; is the softmax probability of pixel being assigned to class ; indicates the presence of class or its absence; corresponds to the predicted probability of class being presented in the image. Note that it is possible to use a weighted sum of the two losses depending on which task’s performance we want to optimise.
Overall, each component of the proposed approach aims to capture global information and incorporate it into the network, hence the name global deconvolutional network. Besides that, the proposed interpolation also effectively upsamples the coarse output and a nonlinear upsampling can be achieved with the addition of an activation function on the top of the block. The complete architecture of our approach is presented in Figure 1.
4.1 Implementation details
We have implemented the proposed methods using Caffe [Jia et al.(2014)Jia, Shelhamer, Donahue, Karayev, Long, Girshick, Guadarrama, and Darrell], the popular deep learning framework. Our training procedure follows the practice of the corresponding baseline models: DeepLab [Chen et al.(2014)Chen, Papandreou, Kokkinos, Murphy, and Yuille] and FCN [Long et al.(2015)Long, Shelhamer, and Darrell]. Both of them employ the VGG-16 net pre-trained on ImageNet [Deng et al.(2009)Deng, Dong, Socher, Li, Li, and Li].
We use Stochastic Gradient Descent (SGD) with momentum and train with a minibatch size of 20 images. We start the training process with the learning rate equal to and divide it by when the validation accuracy stops improving. We use momentum of and weight decay of . We initialise all additional layers randomly as in [Glorot and Bengio(2010)] and fine-tune them by backpropagation with a lower learning rate before finally training the whole network.
We evaluate performance of the proposed approach on the PASCAL VOC 2012 segmentation benchmark [Everingham et al.(2010)Everingham, Gool, Williams, Winn, and Zisserman], which consists of 20 semantic categories and one background category. Following [Hariharan et al.(2011)Hariharan, Arbelaez, Bourdev, Maji, and Malik], we augment the training data to 8498 images and to 10582 images for the FCN and DeepLab models, respectively.
The segmentation performance is evaluated by the mean pixel-wise intersection-over-union (mIoU) score [Everingham et al.(2010)Everingham, Gool, Williams, Winn, and Zisserman], defined as follows:
where represents the number of pixels of class predicted to belong to class .
First, we conduct all our experiments on the PASCAL VOC val set, and then compare the best performing models with their corresponding baseline models on the PASCAL VOC test set. As the annotations for the test data are not available, we send our predictions to the PASCAL VOC Evaluation Server.333http://host.robots.ox.ac.uk/
4.3 Experiments with FCN-32s
We conduct several experiments with FCN-32s as a baseline model. During the training stage the images are resized to .444This is the maximum value for both the height and width in the PASCAL VOC dataset. We evaluate all the models on the holdout dataset of 736 images as in [Long et al.(2015)Long, Shelhamer, and Darrell], and send the test results of the best performing ones to the evaluation server.
The original FCN-32s model employs the standard deconvolution operation (also known as backwards convolution) to upsample the coarse output. We replace it with the proposed global deconvolution and randomly initialise the new parameters as in [Glorot and Bengio(2010)]. We fix the rest of the network to pre-train the added block, and after that train the whole network. Global interpolation already improves its baseline model on the validation dataset, as can be seen in Table 1.
The baseline model deals with inputs of different sizes via cropping the predicted mask to the same resolution as the corresponding input. Other popular options include either 1) padding or 2) resizing to the fixed input dimensions. In case of global deconvolution, we propose a more elegant solution. Recall that the parameters of this block can be represented as matrices and , where , during the training stage. Then, given a test image , we subset the learned matrices to acquire and () and proceed with the same operation. To subset, we only leave first rows and columns of the corresponding matrices, and discard all the rest. We have found that this do not affect the final performance of our model.
Next, to increase the recognition accuracy we also append the multi-label classification loss. This slightly improves the validation score in comparison to the baseline model, while the combination with global interpolation gives a further boost in performance (FCN-32s+GDN).
Besides that, we have also conducted additional experiments with FCN-32s, where we insert a fully-connected layer directly after the coarse output (FCN-32s+FC). The idea behind this trick is to allow the network to refine the local predictions based on the information from all the pixels. One drawback in such an approach is the requirement of the fully-connected layers to have the fixed-size input, although the solutions discussed above are also applicable here. Nevertheless, neither of these methods gives satisfactory results during the empirical evaluations. Therefore, we proceed with a slightly different architecture: before appending the fully-connected layer, we first add a spatial pyramid pooling layer [He et al.(2015)He, Zhang, Ren, and Sun], which produces the output of the same length given an arbitrarily sized input. In particular, we are using a -level pyramid with max-pooling. Though during evaluation this approach alone does not give any improvement in the validation score over the baseline model, its ensemble with the global deconvolution model (FCN-32s+GDN+FC) improves previous results, which indicates that these models may be complementary to each other.
|FCN-32s [Long et al.(2015)Long, Shelhamer, and Darrell]||59.4|
|FCN-32s + Label Loss||59.8|
|FCN-32s + Global Interp.||60.9|
|FCN-32s + GDN||61.2|
|FCN-32s + GDN + FC||62.5|
|DL-LargeFOV [Chen et al.(2014)Chen, Papandreou, Kokkinos, Murphy, and Yuille]||73.3|
|DL-LargeFOV + Label Loss||73.9|
|DL-LargeFOV + Global Interp.||74.2|
|DL-LargeFOV + GDN||75.1|
We continue with the evaluation of the best performing models on the test set (Table 5). Both of them improve their baseline model, FCN-32s, and even outperform FCN-8s, another model by Long et al\bmvaOneDot [Long et al.(2015)Long, Shelhamer, and Darrell] with the skip-architecture, which combines information from lower layers with the final prediction layer.
Some examples of our approach can be seen in Figure 3.
4.4 Experiments with DeepLab
As the next baseline model we consider DeepLab-LargeFOV [Chen et al.(2014)Chen, Papandreou, Kokkinos, Murphy, and Yuille]. With the help of the algorithme à trous [Holschneider et al.(1989)Holschneider, Kronland-Martinet, Morlet, and Tchamitchian, Shensa(1992)], the model has a larger Field-of-View (FOV), which results in the finer predictions from the network. Besides that, this model is significantly faster and contains fewer parameters, than the plain modification of the VGG-16 net due to the reduced number of filters of the last two convolutional layers. The model employs simple bilinear interpolation to acquire the output of the same resolution as the input.
We proceed with the same experiments as for the FCN-32s model, except for the ones involving the fully-connected layer. As DeepLab-LargeFOV has a higher resolution coarse output, the inclusion of the fully-connected layer would result in the weight matrix of several billions parameters. Therefore, we omit these experiments.
We separately replace the bilinear interpolation with global deconvolution, append the label loss and estimate the joint GDN model. We carry out the same strategy outlined above during the testing stage to deal with variable-size inputs. All the experiments lead to improvements over the baseline model, with GDN showing a significantly higher score on the PASCAL VOC 2012 val set (Table 1).
|FCN-8s [Long et al.(2015)Long, Shelhamer, and Darrell]||76.8||34.2||68.9||49.4||60.3||75.3||74.7||77.6||21.4||62.5||46.8||71.8||63.9||76.5||73.9||45.2||72.4||37.4||70.9||55.1||62.20|
|FCN-32s + GDN||74.5||31.8||66.6||49.7||60.5||76.9||75.8||76.0||22.8||57.5||54.5||72.9||59.4||74.9||73.6||50.9||67.5||43.2||70.0||56.4||62.22|
|FCN-32s + GDN + FC||75.6||31.5||69.2||51.6||62.9||78.8||76.7||78.6||24.6||61.6||60.3||74.5||62.6||76.0||74.3||51.4||70.6||47.3||73.9||58.3||64.37|
|DL-LargeFOV-CRF [Chen et al.(2014)Chen, Papandreou, Kokkinos, Murphy, and Yuille]||83.4||36.5||82.5||62.2||66.4||85.3||78.4||83.7||30.4||72.9||60.4||78.4||75.4||82.1||79.6||58.2||81.9||48.8||73.6||63.2||70.34|
|DeconvNet+CRF_VOC[Noh et al.(2015)Noh, Hong, and Han]||87.8||41.9||80.6||63.9||67.3||88.1||78.4||81.3||25.9||73.7||61.2||72.0||77.0||79.9||78.7||59.5||78.3||55.0||75.2||61.5||70.50|
|DL-MSC-LargeFOV-CRF [Chen et al.(2014)Chen, Papandreou, Kokkinos, Murphy, and Yuille]||84.4||54.5||81.5||63.6||65.9||85.1||79.1||83.4||30.7||74.1||59.8||79.0||76.1||83.2||80.8||59.7||82.2||50.4||73.1||63.7||71.60|
|EDeconvNet+CRF_VOC[Noh et al.(2015)Noh, Hong, and Han]||89.9||39.3||79.7||63.9||68.2||87.4||81.2||86.1||28.5||77.0||62.0||79.0||80.3||83.6||80.2||58.8||83.4||54.3||80.7||65.0||72.50|
|DL-LargeFOV-CRF + GDN||87.9||37.8||88.8||64.5||70.7||87.7||81.3||87.1||32.5||76.7||66.7||80.4||76.6||82.2||82.3||57.9||84.5||55.9||78.5||64.2||73.21|
|DL-L_FOV-CRF + GDN_ENS||88.6||48.6||88.8||64.7||70.4||87.2||81.8||86.4||32.0||77.1||64.1||80.5||78.0||84.0||83.3||59.2||85.9||56.8||77.9||65.0||74.02|
|Adelaide_Cont_CRF_VOC [Liu et al.(2015)Liu, Li, Luo, Loy, and Tang]||90.6||37.6||80.0||67.8||74.4||92.0||85.2||86.2||39.1||81.2||58.9||83.8||83.9||84.3||84.8||62.1||83.2||58.2||80.8||72.3||75.30|
The DeepLab-LargeFOV model also incorporates a fully-connected CRF [Lafferty et al.(2001)Lafferty, McCallum, and Pereira, Kohli et al.(2009)Kohli, Ladicky, and Torr, Krähenbühl and Koltun(2011)] as a post-processing step.
To set the parameters of the fully connected CRF, we employ the same method of cross-validation as in [Chen et al.(2014)Chen, Papandreou, Kokkinos, Murphy, and
Yuille] on a subset of the validation data. Then we send our best performing model enriched by CRF to the evaluation server.
On the PASCAL VOC 2012 test set our single model (DL-LargeFOV-CRF+GDN) achieves mIoU, a significant improvement over the baseline model (around ), and even excels the multiscale DeepLab-MSc-LargeFOV model by (Table 5); the predictions averaged across our several models (DL-L_FOV-CRF+GDN_ENS) give a further improvement of , showing a competitive score to the models that do not exploit the Microsoft COCO dataset [Lin et al.(2014)Lin, Maire, Belongie, Hays, Perona, Ramanan,
Dollár, and Zitnick].
As is the case with the FCN-32s model, we obtain performance on par with the multi-resolution variant using a much simpler architecture. Moreover, our single CRF-equipped global deconvolutional network (DL-LargeFOV-CRF+GDN) even surpasses the results of the competing approach (DeconvNet+CRF [Noh et al.(2015)Noh, Hong, and Han]) by , where the deconvolutional part of the network contains significantly more parameters: almost 126M compared to less than 70K of global deconvolution; in case of ensembles, the improvement is over .
In this paper we addressed two important problems of semantic image segmentation: an upsampling of the low-resolution output from the network and refinement of this coarse output, incorporating global information and the additional classification loss. We proposed a novel approach, global deconvolution, to acquire the output of the same size as the input for images of variable resolutions. We showed that global deconvolution effectively replaces standard approaches, and can easily be trained in a straightforward manner.
On the benchmark competition, PASCAL VOC 2012, we showed that the proposed approach outperforms the results of the baseline models. Furthermore, our method even surpasses the performance of more powerful multi-resolution models, which combine information from several blocks of the deep neural network.
Acknowledgements The authors would like to thank the anonymous reviewers for their helpful and constructive comments, and Gaee Kim for making Fig. 1. This work is supported by the Ministry of Science, ICT & Future Planning (MSIP), Korea, under Basic Science Research Program through the National Research Foundation of Korea (NRF) grant (NRF-2014R1A1A1002662), under the ITRC (Information Technology Research Center) support program (IITP-2016-R2720-16-0007) supervised by the IITP (Institute for Information & communications Technology Promotion) and under NIPA (National IT Industry Promotion Agency) program (NIPA-S0415-15-1004).
- [Adams et al.(2010)Adams, Baek, and Davis] Andrew Adams, Jongmin Baek, and Myers Abraham Davis. Fast high-dimensional filtering using the permutohedral lattice. Comput. Graph. Forum, 29(2):753–762, 2010.
- [Álvarez et al.(2012)Álvarez, Gevers, LeCun, and López] José Manuel Álvarez, Theo Gevers, Yann LeCun, and Antonio M. López. Road scene segmentation from a single image. In ECCV, 2012.
- [Badrinarayanan et al.(2015)Badrinarayanan, Handa, and Cipolla] Vijay Badrinarayanan, Ankur Handa, and Roberto Cipolla. Segnet: A deep convolutional encoder-decoder architecture for robust semantic pixel-wise labelling. CoRR, abs/1505.07293, 2015. URL http://arxiv.org/abs/1505.07293.
- [Chen et al.(2014)Chen, Papandreou, Kokkinos, Murphy, and Yuille] Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L. Yuille. Semantic image segmentation with deep convolutional nets and fully connected crfs. CoRR, abs/1412.7062, 2014.
- [Ciresan et al.(2012)Ciresan, Giusti, Gambardella, and Schmidhuber] Dan C. Ciresan, Alessandro Giusti, Luca Maria Gambardella, and Jürgen Schmidhuber. Deep neural networks segment neuronal membranes in electron microscopy images. In NIPS, 2012.
- [Deng et al.(2009)Deng, Dong, Socher, Li, Li, and Li] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Fei-Fei Li. Imagenet: A large-scale hierarchical image database. In CVPR, 2009.
- [Divvala et al.(2009)Divvala, Hoiem, Hays, Efros, and Hebert] Santosh Kumar Divvala, Derek Hoiem, James Hays, Alexei A. Efros, and Martial Hebert. An empirical study of context in object detection. In CVPR, 2009.
- [Doersch et al.(2014)Doersch, Gupta, and Efros] Carl Doersch, Abhinav Gupta, and Alexei A. Efros. Context as supervisory signal: Discovering objects with predictable context. In ECCV, 2014.
- [Everingham et al.(2010)Everingham, Gool, Williams, Winn, and Zisserman] Mark Everingham, Luc J. Van Gool, Christopher K. I. Williams, John M. Winn, and Andrew Zisserman. The pascal visual object classes (VOC) challenge. International Journal of Computer Vision, 88(2):303–338, 2010.
- [Girshick(2015)] Ross B. Girshick. Fast R-CNN. CoRR, abs/1504.08083, 2015.
- [Girshick et al.()Girshick, Donahue, Darrell, and Malik] Ross B. Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In CVPR.
- [Glorot and Bengio(2010)] Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training deep feedforward neural networks. In AISTATS, 2010.
- [Glorot et al.(2011)Glorot, Bordes, and Bengio] Xavier Glorot, Antoine Bordes, and Yoshua Bengio. Deep sparse rectifier neural networks. In AISTATS, 2011.
- [Hariharan et al.(2011)Hariharan, Arbelaez, Bourdev, Maji, and Malik] Bharath Hariharan, Pablo Arbelaez, Lubomir D. Bourdev, Subhransu Maji, and Jitendra Malik. Semantic contours from inverse detectors. In ICCV, 2011.
- [He et al.(2015)He, Zhang, Ren, and Sun] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Trans. Pattern Anal. Mach. Intell., 37(9):1904–1916, 2015.
- [Heitz and Koller(2008)] Geremy Heitz and Daphne Koller. Learning spatial context: Using stuff to find things. In ECCV, 2008.
- [Holschneider et al.(1989)Holschneider, Kronland-Martinet, Morlet, and Tchamitchian] Matthias Holschneider, Richard Kronland-Martinet, Jean Morlet, and Ph Tchamitchian. A real-time algorithm for signal analysis with the help of the wavelet transform. In Wavelets, pages 286–297. 1989.
- [Hong et al.(2015)Hong, Noh, and Han] Seunghoon Hong, Hyeonwoo Noh, and Bohyung Han. Decoupled deep neural network for semi-supervised semantic segmentation. CoRR, abs/1506.04924, 2015.
- [Jia et al.(2014)Jia, Shelhamer, Donahue, Karayev, Long, Girshick, Guadarrama, and Darrell] Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross B. Girshick, Sergio Guadarrama, and Trevor Darrell. Caffe: Convolutional architecture for fast feature embedding. CoRR, abs/1408.5093, 2014.
- [Karpathy and Li(2015)] Andrej Karpathy and Fei-Fei Li. Deep visual-semantic alignments for generating image descriptions. In CVPR, 2015.
- [Kluckner et al.(2009)Kluckner, Mauthner, Roth, and Bischof] Stefan Kluckner, Thomas Mauthner, Peter M. Roth, and Horst Bischof. Semantic classification in aerial imagery by integrating appearance and height information. In ACCV, 2009.
- [Kohli et al.(2009)Kohli, Ladicky, and Torr] Pushmeet Kohli, Lubor Ladicky, and Philip H. S. Torr. Robust higher order potentials for enforcing label consistency. International Journal of Computer Vision, 82(3):302–324, 2009.
- [Krähenbühl and Koltun(2011)] Philipp Krähenbühl and Vladlen Koltun. Efficient inference in fully connected crfs with gaussian edge potentials. In NIPS, 2011.
- [Krizhevsky et al.(2012)Krizhevsky, Sutskever, and Hinton] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, 2012.
- [Lafferty et al.(2001)Lafferty, McCallum, and Pereira] John D. Lafferty, Andrew McCallum, and Fernando C. N. Pereira. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In ICML, 2001.
- [LeCun et al.(1989)LeCun, Boser, Denker, Henderson, Howard, Hubbard, and Jackel] Yann LeCun, Bernhard E. Boser, John S. Denker, Donnie Henderson, R. E. Howard, Wayne E. Hubbard, and Lawrence D. Jackel. Backpropagation applied to handwritten zip code recognition. Neural Computation, 1(4):541–551, 1989.
- [LeCun et al.(2015)LeCun, Bengio, and Hinton] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. Nature, 521(7553):436–444, 2015.
- [Lin et al.(2015)Lin, Shen, Reid, and van den Hengel] Guosheng Lin, Chunhua Shen, Ian D. Reid, and Anton van den Hengel. Efficient piecewise training of deep structured models for semantic segmentation. CoRR, abs/1504.01013, 2015.
- [Lin et al.(2014)Lin, Maire, Belongie, Hays, Perona, Ramanan, Dollár, and Zitnick] Tsung-Yi Lin, Michael Maire, Serge J. Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. Microsoft COCO: common objects in context. In ECCV, 2014.
- [Liu et al.(2015)Liu, Li, Luo, Loy, and Tang] Ziwei Liu, Xiaoxiao Li, Ping Luo, Chen Change Loy, and Xiaoou Tang. Semantic image segmentation via deep parsing network. CoRR, abs/1509.02634, 2015.
- [Long et al.(2015)Long, Shelhamer, and Darrell] Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semantic segmentation. In CVPR, 2015.
- [Mnih and Hinton(2010)] Volodymyr Mnih and Geoffrey E. Hinton. Learning to detect roads in high-resolution aerial images. In ECCV, 2010.
- [Mottaghi et al.(2014)Mottaghi, Chen, Liu, Cho, Lee, Fidler, Urtasun, and Yuille] Roozbeh Mottaghi, Xianjie Chen, Xiaobai Liu, Nam-Gyu Cho, Seong-Whan Lee, Sanja Fidler, Raquel Urtasun, and Alan L. Yuille. The role of context for object detection and semantic segmentation in the wild. In CVPR, 2014.
- [Noh et al.(2015)Noh, Hong, and Han] Hyeonwoo Noh, Seunghoon Hong, and Bohyung Han. Learning deconvolution network for semantic segmentation. CoRR, abs/1505.04366, 2015.
- [Papandreou et al.(2015)Papandreou, Chen, Murphy, and Yuille] George Papandreou, Liang-Chieh Chen, Kevin Murphy, and Alan L. Yuille. Weakly- and semi-supervised learning of a DCNN for semantic image segmentation. CoRR, abs/1502.02734, 2015.
- [Rabinovich et al.(2007)Rabinovich, Vedaldi, Galleguillos, Wiewiora, and Belongie] Andrew Rabinovich, Andrea Vedaldi, Carolina Galleguillos, Eric Wiewiora, and Serge J. Belongie. Objects in context. In ICCV, 2007.
- [Ren et al.(2015)Ren, He, Girshick, and Sun] Shaoqing Ren, Kaiming He, Ross B. Girshick, and Jian Sun. Faster R-CNN: towards real-time object detection with region proposal networks. CoRR, abs/1506.01497, 2015.
- [Ronneberger et al.(2015)Ronneberger, Fischer, and Brox] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. In MICCAI, 2015.
- [Rumelhart et al.(1986)Rumelhart, Hinton, and Williams] David E. Rumelhart, Geoffrey E. Hinton, and Ronald J. Williams. Learning representations by back-propagating errors. Nature, 323(6088):533–536, 1986.
- [Russakovsky et al.(2015a)Russakovsky, Bearman, Ferrari, and Li] Olga Russakovsky, Amy L. Bearman, Vittorio Ferrari, and Fei-Fei Li. What’s the point: Semantic segmentation with point supervision. CoRR, abs/1506.02106, 2015a.
- [Russakovsky et al.(2015b)Russakovsky, Deng, Su, Krause, Satheesh, Ma, Huang, Karpathy, Khosla, Bernstein, Berg, and Li] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael S. Bernstein, Alexander C. Berg, and Fei-Fei Li. Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115(3):211–252, 2015b.
- [Sermanet et al.(2013)Sermanet, Eigen, Zhang, Mathieu, Fergus, and LeCun] Pierre Sermanet, David Eigen, Xiang Zhang, Michaël Mathieu, Rob Fergus, and Yann LeCun. Overfeat: Integrated recognition, localization and detection using convolutional networks. CoRR, abs/1312.6229, 2013.
- [Shensa(1992)] Mark J. Shensa. The discrete wavelet transform: wedding the a trous and mallat algorithms. IEEE Transactions on Signal Processing, 40(10):2464–2482, 1992.
- [Simonyan and Zisserman(2014)] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. CoRR, abs/1409.1556, 2014.
- [Sturgess et al.(2009)Sturgess, Alahari, Ladicky, and Torr] Paul Sturgess, Karteek Alahari, Lubor Ladicky, and Philip H. S. Torr. Combining appearance and structure from motion features for road scene understanding. In BMVC, 2009.
- [Szegedy et al.(2014)Szegedy, Liu, Jia, Sermanet, Reed, Anguelov, Erhan, Vanhoucke, and Rabinovich] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. CoRR, abs/1409.4842, 2014.
- [Vinyals et al.(2015)Vinyals, Toshev, Bengio, and Erhan] Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. Show and tell: A neural image caption generator. In CVPR, 2015.
- [Xu et al.(2015)Xu, Ba, Kiros, Cho, Courville, Salakhutdinov, Zemel, and Bengio] Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron C. Courville, Ruslan Salakhutdinov, Richard S. Zemel, and Yoshua Bengio. Show, attend and tell: Neural image caption generation with visual attention. In ICML, 2015.
- [Yu and Koltun(2015)] Fisher Yu and Vladlen Koltun. Multi-scale context aggregation by dilated convolutions. CoRR, abs/1511.07122, 2015.
- [Zeiler and Fergus(2014)] Matthew D. Zeiler and Rob Fergus. Visualizing and understanding convolutional networks. In ECCV, 2014.
- [Zeiler et al.(2010)Zeiler, Krishnan, Taylor, and Fergus] Matthew D. Zeiler, Dilip Krishnan, Graham W. Taylor, and Robert Fergus. Deconvolutional networks. In CVPR, 2010.
- [Zheng et al.(2015)Zheng, Jayasumana, Romera-Paredes, Vineet, Su, Du, Huang, and Torr] Shuai Zheng, Sadeep Jayasumana, Bernardino Romera-Paredes, Vibhav Vineet, Zhizhong Su, Dalong Du, Chang Huang, and Philip H. S. Torr. Conditional random fields as recurrent neural networks. CoRR, abs/1502.03240, 2015.