Image Segmentation by Iterative Inference from Conditional Score Estimation
Abstract
Inspired by the combination of feedforward and iterative computations in the visual cortex, and taking advantage of the ability of denoising autoencoders to estimate the score of a joint distribution, we propose a novel approach to iterative inference for capturing and exploiting the complex joint distribution of output variables conditioned on some input variables. This approach is applied to image pixelwise segmentation, with the estimated conditional score used to perform gradient ascent towards a mode of the estimated conditional distribution. This extends previous work on score estimation by denoising autoencoders to the case of a conditional distribution, with a novel use of a corrupted feedforward predictor replacing Gaussian corruption. An advantage of this approach over more classical ways to perform iterative inference for structured outputs, like conditional random fields (CRFs), is that it is not any more necessary to define an explicit energy function linking the output variables. To keep computations tractable, such energy function parametrizations are typically fairly constrained, involving only a few neighbors of each of the output variables in each clique. We experimentally find that the proposed iterative inference from conditional score estimation by conditional denoising autoencoders performs better than comparable models based on CRFs or those not using any explicit modeling of the conditional joint distribution of outputs.
Image Segmentation by Iterative Inference from Conditional Score Estimation
Adriana Romero, Michal Drozdzal, Akram Erraqabi, Simon Jégou, Yoshua Bengio Montreal Institute for Learning Algorithms, Montreal, QC, Canada Imagia Cybernetics, Montreal, QC, Canada {adriana.romero.soriano, michal.drozdzal, akram.erraqabi}@umontreal.ca, {simon.jegou, yoshua.umontreal}@gmail.com
noticebox[b]\end@float
1 Introduction
Based on response timing and propagation delays in the brain, a plausible hypothesis is that the visual cortex can perform fast feedforward (ThorpeFizeMarlot96, ) inference when an answer is needed quickly and the image interpretation is easy enough (requiring as little as 200ms of cortical propagation for some object recognition tasks, i.e., just enough time for a single feedforward pass) but needs more time and iterative inference in the case of more complex inputs (Vanmarckeetal2016, ). Recent deep learning research and the success of residual networks (He2016DeepRL, ; GreffSS16, ) point towards a similar scenario (Liao2016BridgingTG, ): early computation which is feedforward, a series of nonlinear transformations which map lowlevel features to highlevel ones, while later computation is iterative (using lateral and feedback connections in the brain) in order to capture complex dependencies between different elements of the interpretation. Indeed, whereas a purely feedforward network could model a unimodal posterior distribution (e.g., the expected target with some uncertainty around it), the joint conditional distribution of output variables given inputs could be more complex and multimodal. Iterative inference could then be used to either sample from this joint distribution or converge towards a dominant mode of that distribution, whereas a unimodaloutput feedfoward network would converge to some statistic like the conditional expectation, which might not correspond to a coherent configuration of the output variables when the actual conditional distribution is multimodal.
This paper proposes such an approach combining a first feedforward phase with a second iterative phase corresponding to searching for a dominant mode of the conditional distribution while tackling the problem of semantic image segmentation. We take advantage of theoretical results (Alain+BengioICLR2013, ) on denoising autoencoder (DAE), which show that they can estimate the score or negative gradient of the energy function of the joint distribution of observed variables: the difference between the reconstruction and the input points in the direction of that estimated gradient. We propose to condition the autoencoder with an additional input so as to obtain the conditional score, i.e., the gradient of the energy of the conditional density of the output variables, given the input variables. The autoencoder takes a candidate output as well as an input and outputs a value so that estimates the direction . We can then take a gradient step in that direction and update towards a lowerenergy value and iterate in order to approach a mode of the implicit captured by the autoencoder. We find that instead of corrupting the segmentation target as input of the DAE, we obtain better results by training the DAE with the corrupted feedforward prediction, which is closer to what will be seen as the initial state of the iterative inference process. The use of a denoising autoencoder framework to estimate the gradient of the energy is an alternative to more traditional graphical modeling approaches, e.g., with conditional random fields (CRFs) (Lafferty01CRF, ; He04MCR, ), which have been used to model the joint distribution of pixel labels given an image (Koltun11, ). The potential advantage of the DAE approach is that it is not necessary to decide on an explicitly parametrized energy function: such energy functions tend to only capture local interactions between neighboring pixels, whereas a convolutional DAE can potentially capture dependencies of any order and across the whole image, taking advantage of the stateoftheart in deep convolutional architectures in order to model these dependencies via the direct estimation of the energy function gradient. Note that this is different from the use of convolutional networks for the feedforward part of the network, and regards the modeling of the conditional joint distribution of output pixel labels, given image features.
The main contributions of this paper are the following:

A novel training framework for modeling structured output conditional distributions which is an alternative to CRFs, based on denoising autoencoder estimation of energy gradients.

Showing how this framework can be used in an architecture for image pixelwise segmentation in which the above energy gradient estimator is used to propose a highly probable segmentation through gradient descent in the output space.

Demonstrating that this approach to image segmentation outperforms or matches classical alternatives such as combining convolutional nets with CRFs and more recent stateoftheart alternatives on the CamVid dataset.
2 Method
In this section, we describe the proposed iterative inference method to refine the segmentation of a feedforward network.
2.1 Background
As pointed in section 1, DAE can estimate a density via an estimator of the score or negative gradient of the energy function (Vincent2010SDA, ; VincentNC2011, ; Alain+BengioICLR2013, ). These theoretical analyses of DAE are presented for the particular case where the corruption noise added to the input is Gaussian. Results show that DAE can estimate the gradient of the energy function of a joint distribution of observed variables. The main result is the following:
(1) 
where is the amount of Gaussian noise injected during training, is the input of the autoencoder and is its output (the reconstruction). The approximation becomes exact as and the autoencoder is given enough capacity, training examples and training time. The direction of points towards more likely configurations of . Therefore the DAE learns a vector field pointing towards the manifold where the input data lies.
2.2 Our framework
In our case, we seek to rapidly learn a vector field pointing towards more probable configurations of . We propose to extend the results summarized in subsection 2.1 and condition the autoencoder with an additional input. If we condition the autoencoder with features , which are a function of , the DAE framework with Gaussian corruption learns to estimate , where is a segmentation candidate, an input image and is an energy function. Gradient descent in energy can thus be performed in order to iteratively reach a mode of the estimated conditional distribution:
(2) 
with step size . In addition, whereas Gaussian noise around the target would be the DAE prescription for the corrupted input to be mapped to , this may be inefficient at visiting the configurations we really care about, i.e., those produced by our feedforward predictor, which we use to obtain a first guess for , as initialization of the iterative inference towards an energy minimum. Therefore, we propose that during training, instead of using a corrupted as input, the DAE takes as input a corrupted segmentation candidate and either the input or some features extracted from a feedforward segmentation network applied to :
(3) 
where is a nonlinear function and is the index of a layer in the feedforward segmentation network. The output of the DAE is computed as
(4) 
where is a nonlinear function which is trained to denoise conditionally and is a corrupted form of . During training, is plus noise, while at test time (for inference) it is simply itself.
In order to train the DAE, (1) we extract both and from a feedforward segmentation network; (2) we corrupt into ; and (3) we train the DAE by minimizing the following loss
(5) 
where is the categorical crossentropy and is the segmentation ground truth.
Figure 1(a) depicts the pipeline during training. First, a fully convolutional feedforward network for segmentation is trained. In practice, we use one of the stateoftheart pretrained networks. Second, given an input image , we extract a segmentation proposal and intermediate features from the segmentation network. Both and are fed to a DAE network (adding Gaussian noise to ). The DAE is trained to properly reconstruct the clean segmentation (ground truth ). Figure 1(b) presents the original DAE prescription , where the DAE is trained by taking as input and .
Once trained, we can exploit the trained model to iteratively take gradient steps in the direction of the segmentation manifold. To do so, we first obtain a segmentation proposal from the feedforward network and then we iteratively refine this proposal by applying the following rule
(6) 
For practical reasons, we collapsed the corruption noise into the step size .
Figure 2 depicts the test pipeline. We start with an input image that we feed to a pretrained segmentation network. The segmentation networks outputs some intermediate feature maps and a segmentation proposal . Then, both and are fed to the DAE to compute the output . The DAE is used to take iterative gradient steps towards the manifold of segmentation masks, with no noise added at inference time.
3 Related Work
On one hand, recent advances in semantic segmentation mainly focus on improving the architecture design (ronneberger2015u, ; SegNet2015, ; DrozdzalVCKP16, ; Jegou17, ), increasing the context understanding capabilities (Gatta14deepvision, ; VisinKCBMC15, ; chen14semantic, ; YuKoltun2016, ) and building processing modules to enforce structure consistency to segmentation outputs (Koltun11, ; chen14semantic, ; CRFasRNN, ). Here, we are interested in this last research direction. CRFs are among the most popular choices to impose structured information at the output of a segmentation network, being fully connected CRFs (Koltun11, ) and CRFs as RNNs CRFasRNN () among best performing variants. More recently, an alternative to promote structure consistency by decomposing the prediction process into multiple steps and iteratively adding structure information, was introduced in (li2016iterative, ). Another iterative approach was introduced in GidarisK16a () to tackle image semantic segmentation by repeatedly detecting, replacing and refining segmentation masks. Finally, the reinterpretation of residual networks Liao2016BridgingTG (); GreffSS16 () was exploited in DrozdzalCVDTRBP17 (), in the context of biomedical image segmentation, by iteratively refining learned prenormalized images to generate segmentation predictions.
On the other hand, there has recently been some research devoted to exploit results of DAE on different tasks, such as image generation (NguyenYBDC16, ), high resolution image estimation (Sonderby2017, ) and semantic segmentation (Xie2016, ). In (NguyenYBDC16, ), authors propose plug & play generative networks, which, in the best reported results, train a fullyconnected DAE to reconstruct a denoised version of some feature maps extracted from an image classification network. The iterative update rule at inference time is performed in the feature space. In (Sonderby2017, ), authors use DAE in the context of image superresolution to learn the gradient of the density of high resolution images and apply it to refine the output of an upsampled low resolution image. In (Xie2016, ), authors exploit convolutional pseudo priors trained on the groundtruth labels in semantic segmentation task. During the training phase, the pseudoprior is combined with the segmentation proposal from a segmentation model to produce joint distribution over data and labels. At test time, the ground truth is not accessible, thus the FCN predictions are fed iteratively to the convolutional pseudoprior network. In this work, we exploit DAEs in the context of image segmentation and extend them in two ways, first by using them to learn a conditional score, and second by using a corrupted feedforward prediction as input during training to obtain better segmentation results.
4 Experiments
The main objective of these experiments is to answer the following questions:

Can a conditional DAE be used successfully as the building block of iterative inference for image segmentation?

Does our proposed corruption model (based on the feedforward prediction) work better than the prescribed target output corruption?

Does the resulting segmentation system outperform more classical iterative approaches to segmentation such as CRFs?
4.1 CamVid Dataset
CamVid^{1}^{1}1http://mi.eng.cam.ac.uk/research/projects/VideoRec/CamVid/ (camvid, ) is a fully annotated urban scene understanding dataset. It contains videos that are fully segmented. We used the same split and image resolution as (SegNet2015, ). The split contains 367 images (video frames) for training, 101 for validation and 233 for test. Each frame has a resolution of 360x480 and pixels are labeled with 11 different semantic classes.
4.2 Feedforward segmentation architecture
We experimented with two feedforward architectures for segmentation: the classical fully convolutional network FCN8 of (Long2015fully, ) and the more recent stateoftheart fully convolutional densenet (FCDenseNet103) of (Jegou17, ), which do not make use of any additional synthetic data to boost their performances.
FCN8 (Long2015fully, ): FCN8 is a feedforward segmentation network, which consists of a convolutional downsampling path followed by a convolutional upsampling path. The downsampling path successively applies convolutional and pooling layers, and the upsampling path successively applies transposed convolutional layers. The upsampling path recovers spatial information by merging features skipped from the various resolution levels on the contracting path.
FCDenseNet103 (Jegou17, ): FCDenseNet is a stateoftheart feedforward segmentation network, that exploits the feature reuse idea of (DenseNet2016, ) and extends it to perform semantic segmentation. FCDenseNet103 consists of a convolutional downsampling path, followed by a convolutional upsampling path. The downsampling path iteratively concatenates all feature outputs in a feedforward fashion. The upsampling path applies a transposed convolution to feature maps from the previous stage and recovers information from higher resolution features from the downsampling path of the network by using skip connections.
4.3 DAE architecture
Our DAE is composed of a downsampling path and an upsampling path. The downsampling path contains convolutions and pooling operations, while the upsampling path is built from unpooling with switches (also known as unpooling with index tracking) (ZhaoMGL15, ; ZhangLL16, ; SegNet2015, ) and convolution operations. As discussed in (ZhangLL16, ), reverting the max pooling operations more faithfully, significantly improves the quality of the reconstructed images. Moreover, while exploring potential network architectures, we found out that using fully convolutionallike architectures with upsampling and skip connections (between downsampling and upsampling paths) decreases segmentation results when compared to unpooling with switches. This is not surprising, since we inject noise to the model’s input when training the DAE. Skip connections directly propagate this added noise to the end layers; making them responsible for the data denoising process. Note that the last layers of the model might not have enough capacity to accomplish the denoising task.
In our experiments, we use DAE built from 6 interleaved pooling and convolution operations, followed by 6 interleaved unpooling and convolution operations. We start with 64 feature maps in the first convolution and duplicate the number of feature maps in consecutive convolutions in the downsampling path. Thus, the number of feature maps in the network’s downsampling path is: 64, 128, 256, 512, 1024 and 2048. In the upsampling path, we progressively reduce the number of feature maps up to the number of classes. Thus, the number of feature maps in consecutive layers of the upsampling path is the following: 1024, 512, 256, 128, 64 and 11 (number of classes). We concatenate the output of 4th pooling operation in downsampling path of DAE together with the feature maps corresponding to 4th pooling operation in downsampling path of the segmentation network.
4.4 Training and inference details
We train our DAE by means of stochastic gradient descent with RMSprop (rmsprop, ), initializing the learning rate to and applying an exponential decay of after each epoch. All models are trained with data augmentation, randomly applying crops of size and horizontal flips. We regularize our model with a weight decay of . We use a minibatch size of 10. While training, we add zeromean Gaussian noise ( or ) to the DAE input. We train the models for a maximum of 500 epochs and monitor the validation reconstruction error to early stop the training using a patience of 100 epochs.
At test time, we need to determine the step size and the number of iterations to get the final segmentation output. We select and the number of iterations by evaluating the pipeline on the validation set. Therefore, we try for a maximum number of 50 iterations. For each iteration, we compute the mean intersection over union (mean IoU) on the validation set and keep the combination (, number of iterations) that maximizes this metric to evaluate the test set.^{2}^{2}2The code to reproduce all experiments can be found here: https://github.com/adriromsor/iterative_inference_segm. The code requires the framework in Visin_dataset_loaders () to load and preprocess the data.
4.5 Results
Model 
Sky 
Building 
Pole 
Road 
Sidewalk 
Tree 
Sign 
Fence 
Car 
Pedestrian 
Cyclist 
Mean IoU 
Gl. accuracy 

FCN8 Long2015fully ()  
FCN8 + CRF  90.1  36.1  
FCN8 + con. mod. YuKoltun2016 ()  90.1  
FCN8 + CRFRNN (CRFasRNN, )  22.3  30.1  
FCN8 + DAE()  
FCN8 + DAE(  80.0  92.1  75.3  72.6  80.3  46.2  42.5  60.0  89.3  
FCDenseNet Jegou17 ()  94.4  
FCDenseNet + CRF  93.2  83.8  77.9  46.3  38.3  77.4  51.7  91.7  
FCDenseNet + con. mod. YuKoltun2016 ()  94.4  77.4  
FCDenseNet + DAE()  94.4  
FCDenseNet + DAE(  38.8  94.4  82.5  60.3  67.4  91.7 
Table 1 reports our results for FCN8 and FCDenseNet103 without any postprocessing step, applying fully connected CRF Koltun11 (), context network YuKoltun2016 () as trained postprocessing step, CRFRNN CRFasRNN () trained endtoend with the segmentation network and DAE’s iterative inference. For CRF, we use publicly available implementation of Koltun11 ().
As shown in the table, using DAE’s iterative inference on the segmentation candidates of a feedforward segmentation network (DAE()) outperforms stateoftheart postprocessing variants; improving upon FCN8 by a margin of IoU. When applying CRF as a postprocessor, the FCN8 segmentation results improve . Note that similar improvements for CRF were reported on other architectures for the same dataset (e.g. (SegNet2015, )). Similar improvements are achieved when using the context module YuKoltun2016 () a postprocessor () and when applying CRFRNN (). It is worth noting that our method does not decrease the performance of any class with respect to FCN8. However, CRF loses when segmenting column poles, whereas CRFRNN loses when segmenting signs. When it comes to more recent stateoftheart architectures such as FCDenseNet103, the postprocessing increment on the segmentation metrics is lower, as expected. Nevertheless, the improvement is still perceivable (+ in IoU). When comparing our method to other stateoftheart postprocessors, we observe a slight improvement. Endtoend training of CRFRNN with FCDenseNet103 did not yield any improvement over FCDenseNet103.
It is worth comparing the performance of the proposed approach DAE() with DAE() trained from the ground truth. As shown in the table, DAE( consistently outperforms DAE(. For FCN8, the proposed method outperforms DAE( by a margin of . For FCDenseNet103, differences are smaller but still noticeable. In both cases, DAE() not only outperforms DAE() globally, but also in all classes that exhibit an improvement. Note that the model trained on the ground truth requires a bigger Gaussian noise in order to slightly increase the performance of the pretrained feedforward segmentation networks. It is worth mentioning that training our model endtoend with the segmentation network didn’t improve the results, while being more memory demanding.
Figure 3 shows some qualitative segmentation results that compare the output of the feedforward network to both the CRF and iterative inference outputs. Figures 3(b)3(f) show an example from the FCN8 case, where as Figures 3(g)3(k) show an example from FCDenseNet103. As shown in Figure 3(d), the FCN8 segmentation network fails to properly find the fence in the image, mistakenly classifying it as part of a building (highlighted with a white box on the image). CRF is able to clean the segmentation candidate, for example, by filling in missing parts of the sidewalk but is not able to add nonexisting structure (see Figure 3(e)). Our method not only improves the segmentation candidate by smoothing large regions such as the sidewalk, but also corrects the prediction by incorporating missing objects such as the fence on Figure 3(f). As depicted in Figures 3(g)3(k), in case of FCDenseNet the improvement in segmentation quality is minor and difficult to perceive by visual inspection. The qualitative results follow the findings from quantitative analysis, CRF decreases slightly the quality of column pole segmentations (e. g. see area inside white boxes when comparing Figures 3(j) and 3(k)).
4.6 Analysis of iterative inference steps
In this subsection, we analyze the influence of the two inference parameters of our method, namely the step size and the number of iterations. This analysis is performed on the validation set of CamVid dataset, for the abovementioned feedforward segmentation networks. For the sake of comparison, we perform a similar analysis on densely connected CRF; by fixing the best configuration and only changing the number of CRF iterations.
Figure 4 shows how the performance varies with number of iterations. Figure 4(a) and Figure 4(b) plot the results in the case of FCN8 and FCDenseNet103, respectively. As expected, there is a tradeoff between the selected step size and the number of iterations. The smaller the , the more iterations are required to achieve the best performance. Interestingly, all within a reasonable range lead to similar maximum performances.
5 Conclusions
We have proposed to use a novel form of denoising autoencoders for iterative inference in structured output tasks such as image segmentation. The autoencoder is trained to map corrupted predictions to target outputs and iterative inference interprets the difference between the output and the input as a direction of improved output configuration, given the input image.
The evidence obtained through the experiments provide positive evidence for the three questions raised at the beginning of Sec. 4: (1) a conditional DAE can be used successfully as the building block of iterative inference for image segmentation, (2) the proposed corruption model (based on the feedforward prediction) works better than the prescribed target output corruption, and (3) the resulting segmentation system outperforms stateoftheart methods for obtaining coherent outputs.
Acknowledgments
The authors would like to thank the developers of Theano Theano2016short (), Lasagne lasagne () and the dataset loader framework Visin_dataset_loaders (). We acknowledge the support of the following agencies for research funding and computing support: Imagia, CIFAR, Canada Research Chairs, Compute Canada and Calcul Québec, as well as NVIDIA for the generous GPU support. Special thanks to Laurent Dinh for useful discussions and support.
References
 [1] Guillaume Alain and Yoshua Bengio. What regularized autoencoders learn from the data generating distribution. In International Conference on Learning Representations (ICLR’2013), 2013.
 [2] Vijay Badrinarayanan, Alex Kendall, and Roberto Cipolla. Segnet: A deep convolutional encoderdecoder architecture for image segmentation. CoRR, abs/1511.00561, 2015.
 [3] Gabriel J. Brostow, Jamie Shotton, Julien Fauqueur, and Roberto Cipolla. Segmentation and recognition using structure from motion point clouds. In European Conference on Computer Vision (ECCV), 2008.
 [4] LiangChieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L Yuille. Semantic image segmentation with deep convolutional nets and fully connected crfs. 2015.
 [5] Michal Drozdzal, Gabriel Chartrand, Eugene Vorontsov, Lisa DiJorio, An Tang, Adriana Romero, Yoshua Bengio, Chris Pal, and Samuel Kadoury. Learning normalized inputs for iterative estimation in medical image segmentation. CoRR, abs/1702.05174, 2017.
 [6] Michal Drozdzal, Eugene Vorontsov, Gabriel Chartrand, Samuel Kadoury, and Chris Pal. The importance of skip connections in biomedical image segmentation. CoRR, abs/1608.04117, 2016.
 [7] A. Romero F. Visin. Dataset loaders: a python library to load and preprocess datasets. https://github.com/fvisin/dataset_loaders, 2017.
 [8] Carlo Gatta, Adriana Romero, and Joost van de Weijer. Unrolling loopy topdown semantic feedback in convolutional deep networks. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR) workshop, 2014.
 [9] Spyros Gidaris and Nikos Komodakis. Detect, replace, refine: Deep structured prediction for pixel wise labeling. CoRR, abs/1612.04770, 2016.
 [10] Klaus Greff, Rupesh Kumar Srivastava, and Jürgen Schmidhuber. Highway and residual networks learn unrolled iterative estimation. CoRR, abs/1612.07771, 2016.
 [11] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, 2016.
 [12] Xuming He, Richard S. Zemel, and Miguel Á. CarreiraPerpiñán. Multiscale conditional random fields for image labeling. In Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, CVPR’04, pages 695–703, Washington, DC, USA, 2004. IEEE Computer Society.
 [13] Gao Huang, Zhuang Liu, Kilian Q. Weinberger, and Laurens van der Maaten. Densely connected convolutional networks. CoRR, abs/1608.06993, 2016.
 [14] Simon Jégou, Michal Drozdzal, David Vázquez, Adriana Romero, and Yoshua Bengio. The one hundred layers tiramisu: Fully convolutional densenets for semantic segmentation. In Workshop on Computer Vision in Vehicle Technology CVPRW, 2017.
 [15] Philipp Krähenbühl and Vladlen Koltun. Efficient inference in fully connected crfs with gaussian edge potentials. 2011.
 [16] John D. Lafferty, Andrew McCallum, and Fernando C. N. Pereira. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of the Eighteenth International Conference on Machine Learning, ICML ’01, pages 282–289, San Francisco, CA, USA, 2001. Morgan Kaufmann Publishers Inc.
 [17] Lasagne. Lasagne. https://github.com/Lasagne/Lasagne, 2016.
 [18] Ke Li, Bharath Hariharan, and Jitendra Malik. Iterative instance segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3659–3667, 2016.
 [19] Qianli Liao and Tomaso A. Poggio. Bridging the gaps between residual learning, recurrent neural networks and visual cortex. CoRR, abs/1604.03640, 2016.
 [20] Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semantic segmentation. 2015.
 [21] Anh Nguyen, Jason Yosinski, Yoshua Bengio, Alexey Dosovitskiy, and Jeff Clune. Plug & play generative networks: Conditional iterative generation of images in latent space. CoRR, abs/1612.00005, 2016.
 [22] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. Unet: Convolutional networks for biomedical image segmentation. In International Conference on Medical Image Computing and ComputerAssisted Intervention (MICAI), 2015.
 [23] Casper Kaae Sønderby, Jose Caballero, Lucas Theis, Wenzhe Shi, and Ferenc Huszár. Amortised MAP inference for image superresolution. International Conference on Learning Representations, 2017.
 [24] Theano Development Team. Theano: A Python framework for fast computation of mathematical expressions. arXiv eprints, abs/1605.02688, May 2016.
 [25] S. Thorpe, D. Fize, and C. Marlot. Speed of processing in the human visual system. Nature, 381:520, 1996.
 [26] T. Tieleman and G. Hinton. rmsprop adaptive learning. In COURSERA: Neural Networks for Machine Learning, 2012.
 [27] Steven Vanmarcke, Filip Calders, and Johan Wagemans. The timecourse of ultrarapid categorization: The influence of scene congruency and topdown processing. iPerception, 2016.
 [28] Pascal Vincent. A connection between score matching and denoising autoencoders. Neural Computation, 23(7):1661–1674, July 2011.
 [29] Pascal Vincent, Hugo Larochelle, Isabelle Lajoie, Yoshua Bengio, and PierreAntoine Manzagol. Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion. J. Mach. Learn. Res., 11:3371–3408, December 2010.
 [30] Francesco Visin, Marco Ciccone, Adriana Romero, Kyle Kastner, Kyunghyun Cho, Yoshua Bengio, Matteo Matteucci, and Aaron Courville. Reseg: A recurrent neural networkbased model for semantic segmentation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR) workshop, 2016.
 [31] Saining Xie, Xun Huang, and Zhuowen Tu. TopDown Learning for Structured Labeling with Convolutional Pseudoprior, pages 302–317. Springer International Publishing, Cham, 2016.
 [32] Fisher Yu and Vladlen Koltun. Multiscale context aggregation by dilated convolutions. 2016.
 [33] Yuting Zhang, Kibok Lee, and Honglak Lee. Augmenting supervised neural networks with unsupervised objectives for largescale image classification. In Proceedings of the 33nd International Conference on Machine Learning, ICML 2016, New York City, NY, USA, June 1924, 2016, pages 612–621, 2016.
 [34] Junbo Jake Zhao, Michaël Mathieu, Ross Goroshin, and Yann LeCun. Stacked whatwhere autoencoders. CoRR, abs/1506.02351, 2015.
 [35] Shuai Zheng, Sadeep Jayasumana, Bernardino RomeraParedes, Vibhav Vineet, Zhizhong Su, Dalong Du, Chang Huang, and Philip Torr. Conditional random fields as recurrent neural networks. In International Conference on Computer Vision (ICCV), 2015.