HyperDense-Net: A densely connected CNN for multi-modal image segmentation

HyperDense-Net: A densely connected CNN for multi-modal image segmentation

Jose Dolz Laboratory for Imagery, Vision and Artificial Intelligence
École de Technologie Supérieure, Montreal, Canada
Xidian University, School of Mathematics and Statistics, Xi’an, China
   Ismail Ben Ayed Laboratory for Imagery, Vision and Artificial Intelligence
École de Technologie Supérieure, Montreal, Canada
Xidian University, School of Mathematics and Statistics, Xi’an, China
   Jing Yuan    Christian Desrosiers Laboratory for Imagery, Vision and Artificial Intelligence
École de Technologie Supérieure, Montreal, Canada
Xidian University, School of Mathematics and Statistics, Xi’an, China

Neonatal brain segmentation in magnetic resonance (MR) is a challenging problem due to poor image quality and low contrast between white and gray matter regions. Most existing approaches for this problem are based on multi-atlas label fusion strategies, which are time-consuming and sensitive to registration errors. As alternative to these methods, we propose a hyper-densely connected 3D convolutional neural network that employs MR-T1 and T2 images as input, which are processed independently in two separated paths. An important difference with previous densely connected networks is the use of direct connections between layers from the same and different paths. Adopting such dense connectivity helps the learning process by including deep supervision and improving gradient flow. We evaluated our approach on data from the MICCAI Grand Challenge on 6-month infant Brain MRI Segmentation (iSEG), obtaining very competitive results. Among 21 teams, our approach ranked first or second in most metrics, translating into a state-of-the-art performance.


egmentation, deep learning, brain image, dense networks

1 Introduction

The precise segmentation of infant brain images into white matter (WM), gray matter (GM) and cerebrospinal fluid (CSF) during the first year of life is of great importance in the study of early brain development. Recognizing particular brain abnormalities shortly after birth might allow to predict neuro-developmental disorders. To that end, magnetic resonance imaging (MRI) is the preferred modality for imaging neonatal brain because it is safe, non-invasive and provides cross-sectional views of the brain in multiple contrasts. Nevertheless, neonatal brain segmentation in MRI is a challenging problem due to several factors, such as reduced tissue contrast, increased noise, motion artifacts or ongoing white matter myelination in infants.

To address this problem, a wide variety of methods have been proposed [1]. A popular approach uses multiple atlases to model the anatomical variability of brain tissues [2, 3]. However, the performance of techniques based solely on atlas fusion is somewhat limited. Label propagation or adaptive methods like parametric or deformable models [4] can be applied to refine prior estimates of tissue probability [4]. Nevertheless, an important drawback of using such approaches for infant brain segmentation is the high risk of error due to high spatial variability in the neonatal population. Moreover, to obtain accurate segmentations, these methods typically require a large number of annotated images, which is time-consuming and requires extensive expertise.

In the last years, deep learning methods have been proposed as an efficient alternative to aforementioned approaches. Particularly, convolutional neural networks (CNNs) have been employed successfully to address various medical image segmentation problems, achieving state-of-the-art performance in a broad range of applications [5, 6, 7] including infant brain tissue segmentation [8, 9, 10]. For instance, a multi-scale 2D CNN architecture is proposed in [8] to obtain accurate and spatially-consistent segmentations from a single image modality.

To overcome the problem of extremely low tissue contrast between WM and GM, various works have considered multiple modalities as input to a CNN. In [9], MR-T1, T2 and fractional anisotropy (FA) images are merged in the input of the network. Similarly, Nie et al. [10] proposed a fully convolutional neural network (FCNN), where these image modalities are processed in three independent paths, and their corresponding features later fused for final segmentation. Yet, these approaches present some significant limitations. First, some architectures [8, 9] adopt a sliding-window strategy where regions defined by the window are processed one-by-one. This leads to a low efficiency and a non-structured prediction which reduces the segmentation accuracy. Second, these methods often employ 2D patches as input to the network, completely discarding anatomic context in directions orthogonal to the 2D plane. As shown in [6], considering 3D convolutions instead of 2D ones results in a better segmentation.

In light of above-mentioned challenges and limitations, we propose a hyper-densely connected 3D fully convolutional network, called HyperDenseNet, for the voxel-level segmentation of infant brain in MR-T1 and T2 images. Unlike the methods presented in [8, 9, 10], our network can incorporate 3D context and volumetric cues for effective volume prediction. The proposed HyperDenseNet network also extends our recent work in [6] by exploiting dense connections in a multi-modal image scenario. This dense connectivity facilitates the learning process by including deep supervision and improving gradient flow. To the best of our knowledge, this is the first attempt to densely-connect layers across multiple independent paths, each of them specifically designed for a different image modality. We validate the proposed network on data from the iSEG-2017 MICCAI Grand Challenge on 6-month infant brain MRI Segmentation, showing the state-of-the-art performance of our network.

2 Methodology

2.1 Single-path baseline

The architectures presented in this work are inspired by our recent work in [6], where we proposed a 3D fully CNN to segment subcortical brain structures. An important feature of that network was its ability to model both local and global context by embedding intermediate-layer outputs in the final prediction. This helped enforce consistency between features extracted at different scales, and embed fine-grained information directly in the segmentation process. Hence, outputs from intermediate convolutional layers (i.e., layers 3 and 6) were directly connected to the first fully connected layer (fully_conv_1)111Fully connected layers are replaced by a set of convolutional filters..

As baseline, we extend this semi-dense architecture to a fully-dense one, by connecting the output of all convolutional layers to fully_conv_1. In this network, MR-T1 and T2 are concatenated before the input of the CNN, and processed together via a single path. Table 1 shows the architecture of this baseline network, where each convolutional block is composed of batch normalization, a non-linearity (PReLu), and a convolution. Due to space limitations, we refer the reader to [6] for additional details.

Conv. kernel # kernels Output Size Dropout
conv_1 333 25 No
conv_2 333 25 No
conv_3 333 25 No
conv_4 333 50 No
conv_5 333 50 No
conv_6 333 50 No
conv_7 333 75 No
conv_8 333 75 No
conv_9 333 75 No
fully_conv_1 111 400 Yes
fully_conv_1 111 200 Yes
fully_conv_3 111 150 Yes
Classification 111 4 No
Table 1: Layers used in the proposed architecture and corresponding values with an input of size . In the case of multi-modal images, convolutional layers (conv_x) are present in both paths of the network. All convolutional layers have a stride of one pixel.

2.2 The proposed hyper-dense network

The concept of “the deeper the better” is considered as a key principle in deep learning architectures [11]. Nevertheless, one obstacle when dealing with deep architectures is the problem of vanishing/exploding gradients, which hamper convergence during training. To address these limitations in very deep architectures, densely connected networks were proposed in [12]. DenseNets are built on the idea that adding direct connections from any layer to all subsequent layers in a feed-forward manner makes training easier and more accurate. This is motivated by three observations. First, there is an implicit deep supervision thanks to short paths to all feature maps in the architecture. Second, direct connections between all layers help to improve the flow of information and gradients throughout the entire network. And third, dense connections have a regularizing effect, which results in a reduced risk of over-fitting on tasks with smaller training sets.

Inspired by the recent success of such densely-connected networks in medical image segmentation [13, 14], we propose a hyper-dense architecture, called HyperDenseNet, for the segmentation of multi-modal images. Unlike the baseline model, where dense connections are employed through all the layers in a single stream, we exploit the concept of dense connectivity in a multi-modal image setting. In this scenario, each modality is processed in an independent path, and dense connections occur not only between layers within the same path, but also between layers in different paths.

Figure 1: A section of the proposed HyperDenseNet. Each gray region represents a convolutional block. Red arrows correspond to convolutions and black arrows indicate dense connections between feature maps.

The blocks composing our HyperDenseNet are similar to those in the baseline architecture. Let be the output of the layer. In CNNs, this vector is typically obtained from the output of the previous layer by a mapping composed of a convolution followed by a non-linear activation function:


In a densely-connected network, connectivity follows a pattern that iteratively concatenates all feature outputs in a feed-forward manner, i.e.


where represents a concatenation operation.

Pushing this idea further, HyperDenseNet considers a more sophisticated connectivity pattern that also links the output from layers in different streams, each one associated with a different image modality. Denote as and the outputs of the layer in streams 1 and 2, respectively. The output of the layer in a stream can then be defined as


A section of the proposed architecture is depicted in Figure 1, where each gray region represents a convolutional block. For simplicity, we assume that red arrows indicate convolution operations only, and that black arrows represent direct connections between feature maps from different layers. Thus, the input of each convolutional block (maps before the red arrow) consists in the concatenation of the outputs (maps after red arrow) of all preceding layers from both paths.

2.2.1 Training parameters and implementation details

To have a large receptive field, FCNNs typically expect full images as input. The number of parameters is then limited via pooling/unpooling layers. A problem with this approach is the loss of resolution from repeated down-sampling operations. In the proposed method, we follow the technique described in [5, 6], where sub-volumes are used as input and pooling layers are avoided. While sub-volumes of size are considered training, we used sub-volumes during inference, as in [5, 6].

To initialize the weights of the network, we adopted the strategy proposed in [15] that allows very deep architectures to converge rapidly. In this strategy, a zero-mean Gaussian distribution of standard deviation is used to initialize the weights in layer , where denotes the number of connections to units in that layer. Momentum was set to 0.6 and the initial learning rate to 0.001, being reduced by a factor of 2 after every 5 epochs (starting from epoch 10). Network parameters are optimized via the RMSprop optimizer, with cross-entropy as cost function. The network was trained for 30 epochs, each one composed of 20 subepochs. At each subepoch, a total of 1000 samples were randomly selected from the training images and processed in batches of size 5.

We extended our 3D FCNN architecture proposed in [6], whose source code can be found at https://www.github.com/josedolz/LiviaNET and which is based on Theano. Training and testing was performed on a server equipped with a NVIDIA Tesla P100 GPU with 16 GB of RAM memory. Training HyperDenseNet took around 70 min per epoch, and around 35 hours in total. Segmenting a whole 3D MR scan requires 70-80 seconds on average.

3 Experiments and results

3.1 Dataset

The dataset employed in this study is publicly available from the iSEG Grand MICCAI Challenge222http://iseg2017.web.unc.edu/. Selected scans for training and testing were acquired at the UNC-Chapel Hill and were randomly chosen from the pilot study of Baby Connectome Project (BCP)333http://babyconnectomeproject.org. All scans were acquired on a Siemens head-only 3T scanners with a circular polarized head coil. During the scan, infants were asleep, unsedated, fitted with ear protection, and their heads were secured in a vacuum-fixation device.

T2 images were linearly aligned to their corresponding T1 images. All images were resampled into an isotropic 1 11 mm resolution. Using in-house tools, standard image pre-processing steps were then applied before manual segmentation, including skull stripping, intensity inhomogeneity correction, and removal of the cerebellum and brain stem. We used 9 subjects for training the network, one for validation and 13 subjects for testing.

3.2 Results

To demonstrate the benefits of the proposed HyperDenseNet, Table 2 compares the segmentation accuracy of our architecture for CSF, GM and WM brain tissues, with that of the baseline. Three metrics are employed for evaluation: Dice Coefficient (DC), modified Hausdorff distance (MHD) and average symmetric distance (ASD). Higher DC values indicate greater overlap between automatic and manual contours, while lower MHD and ASD values indicate higher boundary similarity.

Baseline 0.953 (0.007) 9.296 (0.942) 0.128 (0.016)
HyperDenseNet 0.957 (0.007) 9.421 (1.392) 0.119 (0.017)
Gray Matter
Baseline 0.916 (0.009) 7.131 (1.729) 0.346 (0.041)
HyperDenseNet 0.920 (0.008) 5.752 (1.078) 0.329 (0.041)
White Matter
Baseline 0.895 (0.015) 6.903 (1.140) 0.406 (0.051)
HyperDenseNet 0.901 (0.014) 6.659 (0.932) 0.382 (0.047)
Table 2: Mean segmentation values and standard deviation provided by the iSEG Challenge organizers for the two analyzed methods. In bold is highlighted the best performance for each metric.

Results in Table 2 show HyperDenseNet to outperform the baseline. Thus, our networks yields better DC and ASD accuracy values than the baseline, for all cases. Likewise, it achieves a lower MHD for GM and WM tissues. Considering standard deviations, the accuracy of HyperDenseNet shows less variance than the baseline, again in GM and WM regions. A paired sample t-test between both configurations revealed that differences were statistically significant (p 0.05) across all the results, except for the MHD in CSF tissues (p 0.658).

A comparison of the training and validation accuracy between the baseline and HyperDenseNet is shown in Figure 2. In these figures, mean DC for the three brain tissue is evaluated on training samples after each sub-epoch, and in the whole validation volume after each epoch. It can be observed that in both cases HyperDenseNet outperforms the baseline, achieving better results faster. This can be attributed to the higher number of direct connections between different layers, which facilitates back-propagation of the gradient to shallow layers without diminishing magnitude and thus easing the optimization.

Figure 2: Training (top) and validation (bottom) accuracy plots.

Figure 3 depicts visual results for the subject used in validation. It can be observed that HyperDenseNet (middle) recovers thin regions better than the baseline (left), which can explain improvements in distance-based metrics. As confirmed in Table 2, this effect is most prominent in boundaries between the gray and white matter. Further, HyperDenseNet produces fewer false positives for WM than the baseline, which tends to over-estimate the segmentation in this region.

Baseline HyperDenseNet Reference Contour

Figure 3: Comparison of the segmentation results achieved by the baseline and HyperDenseNet to manual reference contour on the subject employed for validation.

Comparing these results with the performance of methods submitted in the first round of the iSEG Challenge, HyperDenseNet ranked among the top-3 in 6 out of 9 metrics, being the best method in 4 of them. We can therefore say that it achieves state-of-the-art performance for the task at hand. A noteworthy point is the lower performance observed with all tested methods for the segmentation of GM and WM. This suggests that segmenting these tissues is relatively more challenging due to the unclear boundaries between them.

An extension of this study would be to investigate deeper networks with fewer number of filters per layer, as in recently-proposed dense networks. This may reduce the number of trainable parameters, while maintaining or even improving the performance. Further, as in [12], individual weights from dense connections could be also investigated to determine their relative importance. This would allow us to remove useless connections, making the model lighter without degrading its performance.

4 Conclusion

In this paper, we proposed a hyper-densely connected 3D fully CNN to segment infant brain tissue in MRI. This network, called HyperDenseNet, pushes the concept of connectivity beyond recent works, exploiting dense connections in a multi-modal image scenario. Instead of considering dense connections in a single stream, HyperDenseNet processes each modality in independent paths which are inter-connected in a dense manner.

We validated the proposed network in the iSEG-2017 MICCAI Grand Challenge on 6-month infant brain MRI Segmentation, reporting state-of-the-art results. In the future, we plan to investigate the effectiveness of HyperDenseNet in other segmentation problems that can benefit from multi-modal data.


  • [1] Antonios Makropoulos, Serena J Counsell, and Daniel Rueckert. A review on automatic fetal and neonatal brain MRI segmentation. NeuroImage, 2017.
  • [2] M Jorge Cardoso, Andrew Melbourne, Giles S Kendall, Marc Modat, Nicola J Robertson, Neil Marlow, and Sebastien Ourselin. AdaPT: an adaptive preterm segmentation algorithm for neonatal brain MRI. NeuroImage, 65:97–108, 2013.
  • [3] Li Wang, Feng Shi, Gang Li, Yaozong Gao, Weili Lin, John H Gilmore, and Dinggang Shen. Segmentation of neonatal brain mr images using patch-driven level sets. NeuroImage, 84:141–158, 2014.
  • [4] Li Wang, Feng Shi, Weili Lin, John H Gilmore, and Dinggang Shen. Automatic segmentation of neonatal images using convex optimization and coupled level sets. NeuroImage, 58(3):805–817, 2011.
  • [5] Konstantinos Kamnitsas, Christian Ledig, Virginia FJ Newcombe, Joanna P Simpson, Andrew D Kane, David K Menon, Daniel Rueckert, and Ben Glocker. Efficient multi-scale 3D CNN with fully connected CRF for accurate brain lesion segmentation. Medical image analysis, 36:61–78, 2017.
  • [6] Jose Dolz, Christian Desrosiers, and Ismail Ben Ayed. 3D fully convolutional networks for subcortical segmentation in MRI: A large-scale study. NeuroImage, 2017.
  • [7] Tobias Fechter, Sonja Adebahr, Dimos Baltas, Ismail Ben Ayed, Christian Desrosiers, and Jose Dolz. Esophagus segmentation in CT via 3D fully convolutional neural network and random walk. Medical Physics, 2017.
  • [8] Pim Moeskops, Max A Viergever, Adriënne M Mendrik, Linda S de Vries, Manon JNL Benders, and Ivana Išgum. Automatic segmentation of MR brain images with a convolutional neural network. IEEE Transactions on Medical Imaging, 35(5):1252–1261, 2016.
  • [9] Wenlu Zhang, Rongjian Li, Houtao Deng, Li Wang, Weili Lin, Shuiwang Ji, and Dinggang Shen. Deep convolutional neural networks for multi-modality isointense infant brain image segmentation. NeuroImage, 108:214–224, 2015.
  • [10] Dong Nie, Li Wang, Yaozong Gao, and Dinggang Sken. Fully convolutional networks for multi-modality isointense infant brain image segmentation. In 13th International Symposium on Biomedical Imaging (ISBI), 2016, pages 1342–1345. IEEE, 2016.
  • [11] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
  • [12] Gao Huang, Zhuang Liu, Kilian Q Weinberger, and Laurens van der Maaten. Densely connected convolutional networks. In Proceedings of the IEEE CVPR, 2017.
  • [13] Xiaomeng Li, Hao Chen, Xiaojuan Qi, Qi Dou, Chi-Wing Fu, and Pheng Ann Heng. H-DenseUNet: Hybrid densely connected UNet for liver and liver tumor segmentation from CT volumes. arXiv:1709.07330, 2017.
  • [14] Lequan Yu, Jie-Zhi Cheng, Qi Dou, Xin Yang, Hao Chen, Jing Qin, and Pheng-Ann Heng. Automatic 3d cardiovascular mr segmentation with densely-connected volumetric convnets. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pages 287–295. Springer, 2017.
  • [15] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In Proceedings of the IEEE international conference on computer vision, pages 1026–1034, 2015.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description