Anatomical Priors for Image Segmentation via PostProcessing with Denoising Autoencoders
Abstract
Deep convolutional neural networks (CNN) proved to be highly accurate to perform anatomical segmentation of medical images. However, some of the most popular CNN architectures for image segmentation still rely on postprocessing strategies (e.g. Conditional Random Fields) to incorporate connectivity constraints into the resulting masks. These postprocessing steps are based on the assumption that objects are usually continuous and therefore nearby pixels should be assigned the same object label. Even if it is a valid assumption in general, these methods do not offer a straightforward way to incorporate more complex priors like convexity or arbitrary shape restrictions.
In this work we propose PostDAE, a postprocessing method based on denoising autoencoders (DAE) trained using only segmentation masks. We learn a lowdimensional space of anatomically plausible segmentations, and use it as a postprocessing step to impose shape constraints on the resulting masks obtained with arbitrary segmentation methods. Our approach is independent of image modality and intensity information since it employs only segmentation masks for training. This enables the use of anatomical segmentations that do not need to be paired with intensity images, making the approach very flexible. Our experimental results on anatomical segmentation of Xray images show that PostDAE can improve the quality of noisy and incorrect segmentation masks obtained with a variety of standard methods, by bringing them back to a feasible space, with almost no extra computational time.
Keywords:
anatomical segmentation, autoencoders, convolutional neural networks, learning representations, postprocessing1 Introduction
Segmentation of anatomical structures is a fundamental task for biomedical image analysis. It constitutes the first step in several medical procedures such as shape analysis for population studies, computed assisted diagnosis and automatic radiotherapy planning, among many others. The accuracy and anatomical plausibility of these segmentations is therefore of paramount importance, since it will necessarily influence the overall quality of such procedures.
During the last years, convolutional neural networks (CNNs) proved to be highly accurate to perform segmentation in biomedical images [1, 2, 3]. One of the tricks that enables the use of CNNs in large images (by reducing the number of learned parameters) is known as parameter sharing scheme. The assumption behind this idea is that, at every layer, shared parameters are used to learn new representations of the input data along the whole image. These parameters (also referred as weights or kernels) are successively convoluted with the input data resulting in more abstract representations. This trick is especially useful for tasks like image classification, where invariance to translation is a desired property since objects may appear in any location. However, in case of anatomical structures in medical images where their location tend to be highly regular, this property leads to incorrect predictions in areas with similar intensities when enough contextual information is not considered. Shape and topology tend also to be preserved in anatomical images of the same type. However, as discussed in [4], the pixellevel predictions of most CNN architectures are not designed to account for higherorder topological properties.
Before the advent of CNNs, other classical learning based segmentation methods were popular for this task (e.g. Random Forest (RF) [5]), some of which are still being used specially when the amount of annotated data is not enough to train deep CNNs. The pixellevel predictions of these approaches are also influenced by image patches of fixed size. In these cases, handcrafted features are extracted from image patches and used to train a classifier, which predicts the class corresponding to the central pixel in that patch. These methods suffer from the same limitations related to the lack of shape and topological information discussed before.
In this work, we introduce PostDAE (postprocessing with denoising autoencoders), a postprocessing method which produces anatomically plausible segmentations by improving pixellevel predictions coming from arbitrary classifiers (e.g. CNNs or RF), incorporating shape and topological priors. We employ Denoising Autoencoders (DAE) to learn compact and nonlinear representations of anatomical structures, using only segmentation masks. This model is then applied as a postprocessing method for image segmentation, bringing arbitrary and potentially erroneous segmentation masks into an anatomically plausible space (see Figure 1).
Contributions. Our contributions are 3fold: (i) we show, for the first time, that DAE can be used as an independent postprocessing step to correct problematic and nonanatomically plausible masks produced by arbitrary segmentation methods; (ii) we design a method that can be trained using segmentationonly datasets or anatomical masks coming from arbitrary image modalities, since the DAE is trained using only segmentation masks, and no intensity information is required during learning; (iii) we validate PostDAE in the context of lung segmentation in Xray images, benchmarking with other classical postprocessing method and showing its robustness by improving segmentation masks coming from both, CNN and RFbased classifiers.
Related works. One popular strategy to incorporate prior knowledge about shape and topology into medical image segmentation is to modify the loss used to train the model. The work of [4] incorporates highorder regularization through a topology aware loss function. The main disadvantage is that such loss function is constructed adhoc for every dataset, requiring the user to manually specify the topological relations between the semantic classes through a topological validity table. More similar to our work are those by [6, 7], where an autoencoder (AE) is used to learn lower dimensional representations of image anatomy. The AE is used to define a loss term that imposes anatomical constraints during training. The main disadvantage of these approaches is that they can only be used during training of CNN architectures. Other methods like RFbased segmentation can not be improved through this technique. On the contrary, our method postprocesses arbitrary segmentation masks. Therefore, it can be used to improve results obtained with any segmentation method, even those methods which do not rely on an explicit training phase (e.g. levelsets methods).
Postprocessing methods have also been considered in the literature. In [3], the output CNN scores are considered as unary potentials of a Markov random field (MRF) energy minimization problem, where spatial homogeneity is propagated through pairwise relations. Similarly, [2] uses a fully connected conditional random field (CRF) as postprocessing step. However, as stated by [2], finding a global set of parameters for the graphical models which can consistently improve the segmentation of all classes remains a challenging problem. Moreover, these methods do not incorporate shape priors. Instead, they are based on the assumption that objects are usually continuous and therefore nearby pixels (or pixels with similar appearence) should be assigned the same object label. Conversely, our postprocessing method makes use of a DAE to impose shape priors, transforming any segmentation mask into an anatomically plausible one.
2 Anatomical Priors for Image Segmentation via PostProcessing with DAE
Problem statement. Given a dataset of unpaired anatomical segmentation masks (unpaired in the sense that no corresponding intensity image associated to the segmentation mask is required) we aim at learning a model that can bring segmentations predicted by arbitrary classifiers into an anatomically feasible space. We stress the fact that our method works as a postprocessing step in the space of segmentations, making it independent of the predictor, image intensities and modality. We employ denoising autoencoders (DAE) to learn such model.
Denoising autoencoders. DAEs are neural networks designed to reconstruct a clean input from a corrupted version of it [8]. In our case, they will be used to reconstruct anatomically plausible segmentation masks from corrupted or erroneous ones. The standard architecture for an autoencoder follows an encoderdecoder scheme (see the Sup. Mat. for a detailed description of the architecture used in this work). The encoder is a mapping that transforms the input into a hidden representation . In our case, it consists of successive nonlinearities, pooling and convolutional layers, with a final fully connected layer that concentrates all information into a low dimensional code . This code is then feed into the decoder , which maps it back to the original input dimensions through a series of upconvolutions and nonlinearities. The output of has the same size than the input .
The model is called denosing autoenconder because a degradation function is used to degrade the groundtruth segmentation masks, producing noisy segmentations used for training. The model is trained to minimize the reconstruction error measured by a loss function based on the Dice coefficient (DSC), a metric used to compare the quality of predicted segmentations with respect to the groundtruth (we refer the reader to [9] for a complete description of the Dice loss):
(1) 
The dimensionality of the learned representation is much lower than the input, producing a bottleneck effect which forces the code to retain as much information as possible about the input. In that way, minimizing the reconstruction error amounts to maximizing a lower bound on the mutual information between input and the learnt representation [8].
Mask degradation strategy. The masks used to train the DAE were artificially degraded during training to simulate erroneous segmentations. To this end, we randomly apply the following degradation functions to the ground truth masks : (i) addition and removal of random geometric shapes (circles, ellipses, lines and rectangles) to simulate over and under segmentations; (ii) morphological operations (e.g. erosion, dilation, etc) with variable kernels to perform more subtle mask modifications and (iii) random swapping of foregroundbackground labels in the pixels close to the mask borders.
Postprocessing with DAEs. The proposed method is rooted in the socalled manifold assumption [10], which states that natural high dimensional data (like anatomical segmentation masks) concentrate close to a nonlinear lowdimensional manifold. We learn such lowdimensional anatomically plausible manifold using the aforementioned DAE. Then, given a segmentation mask obtained with an arbitrary predictor (e.g. CNN or RF), we project it into that manifold using and reconstruct the corresponding anatomically feasible mask with . Unlike other methods like [6, 7] which incorporate the anatomical priors while training the segmentation network, we choose to make it a postprocessing step. In that way, we achieve independence with respect to the initial predictor, and enable improvement for arbitrary segmentation methods.
Our hypothesis (empirically validated by the following experiments) is that those masks which are far from the anatomical space, will be mapped to a similar, but anatomically plausible segmentation. Meanwhile, masks which are anatomically correct, will be mapped to themselves, incurring in almost no modification.
3 Experiments and Discussion
Database description. We benchmark the proposed method in the context of lung segmentation in XRay images, using the Japanese Society of Radiological Technology (JSRT) database [11]. JSRT is a public database containing 247 PA chest Xray images with expert segmentation masks, of 2048x2048 pixels and isotropic spacing of 0.175 mm/pixel, which are downsampled to 1024x1024 in our experiments. Lungs present high variability among subjects, making the representation learning task especially challenging. We divide the database in 3 folds considering 70% for training, 10% for validation and 20% for testing.
Postprocessing with CRF.
We compare PostDAE with the SOA postprocessing method based on a fully connected CRF [12]. The CRF is used to impose connectivity constraints to a given segmentation, based on the assumption that objects are usually continuous and nearby pixels with similar appearance should be assigned the same object label. We use an efficient implementation of a dense CRF^{1}^{1}1We used the public implementation available at https://github.com/lucasbeyer/pydensecrf with Potts compatibility function and handtuned parameters , , chosen using the validation fold. See the implementation website for more details about the aforementioned parameters. that can handle large pixel neighbourhoods in reasonable inference times. Differently from our method which uses only binary segmentations for postprocessing, the CRF incorporates intensity information from the original images. Therefore, it has to be readjusted depending on the image properties of every dataset. Instead, our method is trained once and can be used independently of the image propierties. Note that we do not compare PostDAE with other methods like [6, 7] which incorporate anatomical priors while training the segmentation method itself, since these are not postprocessing strategies.
Baseline segmentation methods. We train two different models which produce segmentation masks of various qualities to benchmark our postprocessing method. The first model is a CNN based on UNet architecture [1] (see the Sup. Mat. for a detailed description of the architecture and the training parameters such as optimizer, learning rate, etc.). The UNet was implemented in Keras and trained in GPU using a Dice loss function. To evaluate the effect of PostDAE in different masks, we save the UNet model every 5 epochs during training, and predict segmentation masks for the test fold using all these models. The second method is a RF classifier trained using intensity and texture features. We used Haralick [13] features which are based on gray level coocurrency in image patches. We adopted a public implementation available online with default parameters^{2}^{2}2The source code and a complete description of the method is publicly available online at: https://github.com/dgriffiths3/ml_segmentation which produces acceptable segmentation masks.
Results and discussion. Figure 2 shows some visual examples while Figure 3 summarizes the quantitative results (see the video in the Sup. Mat. for more visual results). Both figures show the consistent improvement that can be obtained using PostDAE as a postprocessing step, specially in low quality segmentation masks like those obtained by the RF model and the UNet trained for only 5 epochs. In these cases, substantial improvements are obtained in terms of Dice coefficient and Hausdorff distance, by bringing the erroneous segmentation masks into an anatomically feasible space. In case of segmentations that are already of good quality (like the UNet trained until convergence), the postprocessing significantly improves the Hausdorff distance, by erasing spurious segmentations (holes in the lung and small isolated blobs) that remain even in well trained models. When compared with CRF postprocessing, PostDAE significantly outperforms the baseline in the context of anatomical segmentation. In terms of running time, the CRF model takes 1.3 seconds in a Intel i77700 CPU, while PostDAE takes 0.7 seconds in a Titan Xp GPU.
One of the limitations of PostDAE is related to data regularity. In case of anatomical structures like lung, heart or liver, even if we found high intersubject variability, the segmentation masks are somehow uniform in terms of shape and topology. Even pathological organs tend to have similar structure, which can be wellencoded by the DAE (specially if pathological cases are seen during training). However, in other cases like brain lesions or tumors where shape is not that regular, it is not clear how PostDAE would perform. This case lies out of the scope of this paper, but will be explored as future work.
Conclusions and future works. In this work we have showed, for the first time in the MIC community, that autoencoders can be used as an independent postprocessing step to incorporate anatomical priors into arbitrary segmentation methods. PostDAE can be easily implemented, is fast at inference, can cope with arbitrary shape priors and is independent of the image modality and segmentation method. In the future, we plan to extend this method to muticlass and volumetric segmentation cases (like anatomical segmentation in brain images).
4 Acknowledgments
EF is beneficiary of an AXA Research Fund grant. The authors gratefully acknowledge NVIDIA Corporation with the donation of the Titan Xp GPU used for this research, and the support of UNL (CAIDPIC50420150100098LI) and ANPCyT (PICT 20160651).
5 Appendix A: Model details
UNet details: The UNet model (see Table 1) receives a 1024x1024 gray image as input and was trained using the soft Dice loss [9], batch size of 4, Adam optimizer with learning rate 1e5 and the other parameters as by Keras default. We also used dropout for regularization, including a dropout layer after layer with keep probability p=0.5.
PostDAE: PostDAE (see Table 2)receives a 1024x1024 binary segmentation as input. The network was also trained to minimize the Dice loss function using Adam Optimizer. The best performance was achieve with learning rate 0.0001; batch size 15 and 150 epochs.
Kernel  Stride  #Kernels  NonLin  
L1 
Conv  (f:3,3)  (s:1,1)  (N:16)  ReLu 
Conv  (f:3,3)  (s:1,1)  (N:16)  ReLu  
Max Pooling  (f:2,2)  (s:2,2)  
L2  Conv  (f:3,3)  (s:1,1)  (N:32)  ReLu 
Conv  (f:3,3)  (s:1,1)  (N:32)  ReLu  
Max Pooling  (f:2,2)  (s:2,2)  
L3  Conv  (f:3,3)  (s:1,1)  (N:64)  ReLu 
Conv  (f:3,3)  (s:1,1)  (N:64)  ReLu  
Max Pooling  (f:2,2)  (s:2,2)  
L4  Conv  (f:3,3)  (s:1,1)  (N:128)  ReLu 
Conv  (f:3,3)  (s:1,1)  (N:128)  ReLu  
Max Pooling  (f:2,2)  (s:2,2)  
L5  Conv  (f:3,3)  (s:1,1)  (N:256)  ReLu 
Conv  (f:3,3)  (s:1,1)  (N:256)  ReLu  
L6  UpConv  (f:3,3)  (s:1,1)  (N:128)  ReLu 
Conv  (f:3,3)  (s:1,1)  (N:128)  ReLu  
Conv  (f:3,3)  (s:1,1)  (N:128)  ReLu  
L7  UpConv  (f:3,3)  (s:1,1)  (N:64)  ReLu 
Conv  (f:3,3)  (s:1,1)  (N:64)  ReLu  
Conv  (f:3,3)  (s:1,1)  (N:64)  ReLu  
L8  UpConv  (f:3,3)  (s:1,1)  (N:32)  ReLu 
Conv  (f:3,3)  (s:1,1)  (N:32)  ReLu  
Conv  (f:3,3)  (s:1,1)  (N:32)  ReLu  
L9  UpConv  (f:3,3)  (s:1,1)  (N:16)  ReLu 
Conv  (f:3,3)  (s:1,1)  (N:16)  ReLu  
Conv  (f:3,3)  (s:1,1)  (N:16)  ReLu  
L10  Conv  (f:3,3)  (s:1,1)  (N:2)  ReLu 
Conv  (f:1,1)  (s:1,1)  (N:1)  Sigmoid  

Kernel  Stride  #Kernels  NonLin  
Conv  (f:3,3)  (s:2,2)  (N:16)  ReLu  
Conv  (f:3,3)  (s:1,1)  (N:16)  ReLu  
L2  Conv  (f:3,3)  (s:2,2)  (N:32)  ReLu 
Conv  (f:3,3)  (s:1,1)  (N:32)  ReLu  
L3  Conv  (f:3,3)  (s:2,2)  (N:32)  ReLu 
Conv  (f:3,3)  (s:1,1)  (N:32)  ReLu  
L4  Conv  (f:3,3)  (s:2,2)  (N:32)  ReLu 
Conv  (f:3,3)  (s:1,1)  (N:32)  ReLu  
L5  Conv  (f:3,3)  (s:2,2)  (N:32)  ReLu 
L6  FC      (N:512)  None 
L6  FC      (N:1024)  Relu 
L8  UpConv  (f:3,3)  (s:1,1)  (N:16)  ReLu 
Conv  (f:3,3)  (s:1,1)  (N:16)  ReLu  
L9  UpConv  (f:3,3)  (s:1,1)  (N:16)  ReLu 
Conv  (f:3,3)  (s:1,1)  (N:16)  ReLu  
L10  UpConv  (f:3,3)  (s:1,1)  (N:16)  ReLu 
Conv  (f:3,3)  (s:1,1)  (N:16)  ReLu  
L11  UpConv  (f:3,3)  (s:1,1)  (N:16)  ReLu 
Conv  (f:3,3)  (s:1,1)  (N:16)  ReLu  
L12  UpConv  (f:3,3)  (s:1,1)  (N:16)  ReLu 
Conv  (f:3,3)  (s:1,1)  (N:1)  Sigmoid 
References
 [1] Ronneberger, O., Fischer, P., Brox, T.: Unet: Convolutional networks for biomedical image segmentation. In: Proc. of MICCAI. (2015)
 [2] Kamnitsas, K., et al.: Efficient multiscale 3d CNN with fully connected CRF for accurate brain lesion segmentation. Medical Image Analysis 36 (2017) 61 – 78
 [3] Shakeri, M., et al.: Subcortical brain structure segmentation using FCNN’s. In: Proc. of ISBI. (2016)
 [4] BenTaieb, A., Hamarneh, G.: Topology aware fully convolutional networks for histology gland segmentation. In: Proc. of MICCAI. (2016)
 [5] Breiman, L.: Random forests. Machine learning 45(1) (2001) 5–32
 [6] Oktay, O., et al.: Anatomically constrained neural networks (ACNNs): application to cardiac image enhancement and segmentation. IEEE TMI 37(2) (2018) 384–395
 [7] Ravishankar, H., et al.: Learning and incorporating shape models for semantic segmentation. In: Proc. of MICCAI. (2017)
 [8] Vincent, P., et al.: Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion. JMLR 11 (2010) 3371–3408
 [9] Milletari, F., Navab, N., Ahmadi, S.A.: Vnet: Fully convolutional neural networks for volumetric medical image segmentation. In: Proc. of Fourth International Conference on 3D Vision (3DV). (2016)
 [10] Chapelle, O., Scholkopf, B., Zien, A.: Semisupervised learning. MIT Press (2009)
 [11] Shiraishi, J., et al.: Development of a digital image database for chest radiographs with and without a lung nodule: receiver operating characteristic analysis of radiologists’ detection of pulmonary nodules. Am Jour of Roent 174(1) (2000) 71–74
 [12] Krähenbühl, P., Koltun, V.: Efficient Inference in Fully Connected CRFs with Gaussian Edge Potentials. In: Proc. of Nips. (2011)
 [13] Haralick, R.M., Shanmugam, K., et al.: Textural features for image classification. IEEE Transactions on systems, man, and cybernetics (6) (1973) 610–621