Hydranet: Data Augmentation
for Regression Neural Networks
Deep learning techniques are often criticized to heavily depend on a large quantity of labeled data. This problem is even more challenging in medical image analysis where the annotator expertise is often scarce. We propose a novel data-augmentation method to regularize neural network regressors that learn from a single global label per image. The principle of the method is to create new samples by recombining existing ones. We demonstrate the performance of our algorithm on two tasks: estimation of the number of enlarged perivascular spaces in the basal ganglia, and estimation of white matter hyperintensities volume. We show that the proposed method improves the performance over more basic data augmentation. The proposed method reached an intraclass correlation coefficient between ground truth and network predictions of 0.73 on the first task and 0.84 on the second task, only using between 25 and 30 scans with a single global label per scan for training. With the same number of training scans, more conventional data augmentation methods could only reach intraclass correlation coefficients of 0.68 on the first task, and 0.79 on the second task.
Deep learning techniques are getting increasingly popular for image analysis but are often dependent on a large quantity of labeled data. In case of medical images, this problem is even stronger as data acquisition is administratively and technically more complex, as data sharing is more restricted, and as the annotator expertise is scarce.
To address biomarker (e.g. number or volume of lesions) quantification, many methods propose to optimize first a segmentation problem and then derive the target quantity with simpler methods. These approaches require expensive voxel-wise annotations. In this work, we circumvent the segmentation problem by optimizing our method to directly regress the target quantity [1, 2, 3, 4, 5]. Therefore we need only a single label per image instead of voxel-wise annotations. Our main contribution is that we push this limit even further by proposing a data augmentation method to reduce the number of training images required to optimize the regressors. The proposed method is designed for global image-level labels that represent a countable quantity. Its principle is to combine real training samples to construct many more virtual training samples. During training, our model takes as input random sets of images and is optimized to predict a single label for each of these sets that denotes the sum of the labels of all images of the set. This is motivated by the idea that adding a large quantity of virtual samples with weaker labels may reduce the over-fitting to training samples and improve the generalization to unseen data.
1.1 Related Work
Data augmentation can act as a regularizer and improve the generalization performance of neural networks. In addition to simple data-augmentations such as rotation, translation and flipping, the authors of Unet  stress for instance that random elastic deformations significantly improved the performance of their model. Generative adversarial networks have for instance also been used to generate training samples, and hence reduce the over-fitting .
Recently, data augmentation methods using combinations of training samples have been published. Zhang et al.  proposed to construct virtual training samples by computing a linear combination of pairs of real training samples. The corresponding one-hot labels are summed with the same coefficients. The authors evaluated their method on classification datasets from computer vision and on a speech dataset, and demonstrate that their method improves the generalization of state-of-the-art neural networks. Simultaneously, Inoue et al.  and Tokozume et al.  reached similar conclusions. In case of grayscale volumetric inputs, summing image intensity values could overlay the target structures, confuse discriminative shapes, and thus harm the performance of the network. With our method, training samples can be combined without overlaying the intensity values. The other difference with the above-mentioned approaches is that our method is also not designed for classification, but for regression of global labels, such as volume or count in an image. With the proposed combination of samples, our method computes plausible augmentation.
The principle of the proposed data augmentation method is to create many new (and weaker) training samples by combining existing ones (see Figure 1). In the remainder, the original samples are called real samples, and the newly created samples are called virtual samples.
2.1 Proposed Data Augmentation.
During training, the model is not optimized on single real samples with label , but on sets of random samples with label , with the label of sample . These sets with labels are the virtual samples. Consequently, the loss function is computed directly on these virtual samples and not anymore the individual real samples . This approach is designed for labels describing a quantitative element in the samples, such as volume or count in an image.
To create the sets , the samples are drawn without replacement from the training set at each epoch. To create more combinations of samples, and to allow the model to use the real samples for its optimization, the size of the sets can randomly vary in during training. If the training set contains samples, with our method, we can create possible different combinations (the order of the samples in has no effect on the optimization).
Difference with mini-batch stochastic gradient descent (SGD).
In mini-batch SGD, the model is also optimized on sets of random samples, but the loss function is computed individually for each sample of the batch, and then summed (averaged). For the proposed method, the predictions are first summed, and the loss function is then computed a single time. For non-linear loss functions, this is not equivalent: , with the model’s prediction for sample .
The regularization strength can usually be modulated by at least one parameter, for instance the degree of rotation applied to the input image, or the percentage of neurons dropped in Dropout . In the proposed method, the regularization effect can be controlled by varying the average number of samples used to create combinations.
We optimize a regression neural network with a 3D image for input, and global label representing a volume or count for output. There are at least two possible implementations of the proposed method. The first implementation could consist of modifying the computation of the loss function across samples in a mini-batch, and provide mini-batches of random size. Alternatively the modelâs architecture could be adapted to receive the set of images. We opted for the second approach.
Figure 2 left shows the architecture of the base regression neural network. It is both simple (196 418 parameters) and flexible to allow fast prototyping. There is no activation function after the last layer. The output can therefore span and the network is optimized with the mean squared error (MSE). We call this regression network , such that , with the input image.
Combination of Samples.
To process several images simultaneously, we replicate times the regressor during training (Figure 2 right), resulting in different branches that receive the images . The weights of each head are shared such that . A new network is constructed as:
To allow the size of the sets to randomly vary in during training, each element of has a chance to be a black image of zero intensities only (Figure 1 right column). With , the following situation becomes possible:
For this implementation, the batch size has to be a multiple of the number of branches . We chose due to constraints in GPU memory. The regularization strength is controlled by the averaged number of samples used to create combinations, hence depends on and . During inference, to predict the label for a single input image, the input of all other branches is set to zero.
Enlarged perivascular spaces (PVS) and white matter hyperintensities (WMH) are two types of brain lesions associated with small vessel disease. The method is evaluated for the estimation of number PVS in the basal ganglia, and estimation of WMH volume. We compare the performance of our method to that of the base regressor with and without and Dropout, and for different sizes of training set.
The PVS dataset contains T2-weighted scans, from 2017 subjects, acquired from a 1.5T GE scanner. The scans were visually scored by an expert rater who counted the PVS in the basal ganglia in a single slice. The WMH dataset is the training set of the MICCAI2017’s WMH challenge . We use the available 2D multi-slice FLAIR-weighted MRI scans as input to the networks. Scans were acquired from 60 participants from 3 centers: 20 scans from Amsterdam (GE scanner), 20 from Utrecht (Philips) and 20 from Singapore (Siemens). Although the ground truths of the challenge are pixel-wise, we only used the number of WMH voxels as ground truth during training.
For the regression of PVS in the basal ganglia, a mask of the basal ganglia is created with the subcortical segmentation algorithm from FreeSurfer , and smoothed with a gaussian filter (standard deviation of 2 voxels) before being applied the image. The result is subsequently cropped around the basal ganglia. For the WMH dataset, we only crop each image around its center of mass, weighted by the voxel intensities. For both tasks the intensities are then rescaled between 0 and 1.
During training, for all methods, the images are randomly augmented on-the-fly with standard methods. The possible augmentations are flipping in or , 3D rotation from -0.2 to 0.2 radians and random translations in or from -2 to 2 voxels. Adadelta  is used as optimizer. The networks are trained with batch-size . For the proposed method, the network’s architecture has then four branches (). During an epoch, the proposed method gets as input different combinations of training samples, were is the total number of training images. During the same epoch, the base regressor simply gets the images separately (in batches of size ). For the proposed method was set to 0.1. In some experiments with Dropout  we included a dropout layer after each convolution and after the global pooling layer. The code is written in Keras with Tensorflow as backend, and the experiments were run on a Nvidia GeForce GTX 1070 GPU.
For the PVS dataset, we experiment with varying size of training set, between 12 and 25 scans. The validation set always contains the same 5 scans. All methods are evaluated on the same separated test set of 1977 scans. For the WMH dataset, the set is split into 30 training scans and 30 testing scans. Six scan from the training set are used as validation scans. In both cases, the dataset is randomly (uniform distribution) split into training and testing sets. For the PVS dataset, once the dataset has been split into 30 training scans and 1977 testing scan, we manually sample scans to keep a pseudo-uniform distribution of the lesion count when decreasing the number of training scans.
To compare the automated predictions to visual scoring (for PVS) or volumes (for WMH), we use two evaluation metrics: the mean squared error (MSE), and the intraclass correlation coefficient (ICC).
|Method||Training scans||Testing scans||Loss||Performance (ICC)|
|Base Network||30||30||MSE||0.79 0.12|
|Proposed Method||30||30||MSE||0.84 0.02|
|\hdashline[0.5pt/3pt] Base Network||30||30||MAE||0.78|
|\hdashline[0.5pt/3pt] Base Network||40||20||MSE||0.89|
Enlarged Perivascular Spaces (PVS).
Figure 3 compares the proposed method to the base regressor on the PVS datasets, and for an increasing number of training samples. Their performance is also compared to the average interrater agreement computed for the same problem and reported in . The proposed method always reaches a better MSE than the conventional methods for all training set sizes. The proposed method also significantly outperforms the base regressor in ICC (Williams’ test p-value 0.001) when averaging the predictions of the methods across the four points of their learning curve.
White Matter Hyperintensities (WMH).
We conducted three series of experiments, and trained in total five neural networks (Table 1). When using small training sets, the proposed method outperforms the base network , when optimized either for MSE or for mean absolute error. With larger training sets, the difference of performance reduces, and the base regressor performs slightly better on the ICC.
4 Discussion and Conclusion
With the proposed data augmentation method, we could reach the inter-rater agreement performance on PVS quantification reported by Dubost et al.  with only 25 training scans, and without pretraining.
Dubost et al  also regressed the number of PVS in the basal ganglia with a neural network. We achieve a similar result (0.73 ICC) while training on 25 scans instead of 1000. Zhang et al.  also proposed to combine training samples as a data augmentation method. In their experiments, combining more than images does not bring any improvement. With the proposed method, training with combinations of four images brought improvement over only using pairs of images. We did not experiment with values of larger than 4 due to GPU memory constraints. Contrary to the expected gain in generalization, on both PVS (Figure 3) and WMH datasets, using Dropout  worsened the results when training on very little data, even with low dropout rates such as 0.3. As dropout already did not improve the performance of the baseline, we do not expect improvement by including dropout in the proposed method.
To create combination of images for the proposed method, images where drawn without replacement for the sake of implementation simplicity. The regularization strength could be increased by drawing samples with replacement, which could be beneficial for small training sets. We also mentioned two possible implementations of the proposed method: (1) changing the computation of the loss over mini-batches, (2) replicating the architecture of network. In this work we used the second approach, as it was simpler to implement with our library (Keras). However with this approach, all samples used in a given the combination have to be simultaneously processed by the network, which can cause GPU memory overload in case of large 3D images or large values of . The first approach does not suffer from this overload, as the samples can be successively loaded, while only saving the individual scalar predictions in the GPU memory. In case of large 3D images, we would consequently recommend implementing the first approach.
This research was funded by the Netherlands Organisation for Health Research and Development (ZonMw) Project104003005, with additional support of Netherlands Organisation for Scientific Research (NWO), project NWO-EWVIDI 639.022.010 and project NWO-TTW Perspectief Programme P15-26. This work was partly carried out on the Dutch national e-infrastructure with the support of SURFCooperative.
-  Cole, J.H., Poudel, R.P., Tsagkrasoulis, D., Caan, M.W., Steves, C., Spector, T.D. and Montana, G., 2017. Predicting brain age with deep learning from raw imaging data results in a reliable and heritable biomarker. NeuroImage, 163, pp.115-124.
-  Dubost, F., Adams, H., Bortsova, G., Ikram, M.A., Niessen, W., Vernooij, M. and de Bruijne, M., 2019. 3D Regression Neural Network for the Quantification of Enlarged Perivascular Spaces in Brain MRI. Medical Image Analysis.
-  González, Germán, George R. Washko, and Raúl San José Estépar. Deep learning for biomarker regression: application to osteoporosis and emphysema on chest CT scans. In Medical Imaging 2018: Image Processing, vol. 10574, p. 105741H. International Society for Optics and Photonics, 2018.
-  Wang, J., Knol, M., Tiulpin, A., Dubost, F., De Bruijne, M., Vernooij, M., Adams, H., Ikram, M.A., Niessen, W. and Roshchupkin, G., 2019. Grey Matter Age Prediction as a Biomarker for Risk of Dementia: A Population-based Study. BioRxiv, p.518506.
-  Lee, J.H. and Kim, K.G., 2018. Applying deep learning in medical images: The case of bone age estimation. Healthcare informatics research, 24(1), pp.86-92.
-  Ronneberger, O., Fischer, P. and Brox, T., 2015, October. U-net: Convolutional networks for biomedical image segmentation. MICCAI.
-  Sixt, L., Wild, B. and Landgraf, T., 2018. Rendergan: Generating realistic labeled data. Frontiers in Robotics and AI, 5, p.66.
-  Zhang, H., Cisse, M., Dauphin, Y.N. and Lopez-Paz, D., 2017. mixup: Beyond empirical risk minimization. ICLR 2018.
-  Inoue, H., 2018. Data augmentation by pairing samples for images classification. arXiv preprint arXiv:1801.02929.
-  Tokozume, Y., Ushiku, Y. and Harada, T., 2018. Learning from between-class examples for deep sound recognition. ICLR.
-  Kuijf, H.J., Biesbroek, J.M., de Bresser, J., Heinen, R., Andermatt, S., Bento, M., Berseth, M., Belyaev, M., Cardoso, M.J., Casamitjana, A. and Collins, D.L., 2019. Standardized assessment of automatic segmentation of white matter hyperintensities; results of the wmh segmentation challenge. IEEE transactions on medical imaging.
-  Desikan, R.S., SÃ©gonne, F., Fischl, B., Quinn, B.T., Dickerson, B.C., Blacker, D., Buckner, R.L., Dale, A.M., Maguire, R.P., Hyman, B.T. and Albert, M.S., 2006. An automated labeling system for subdividing the human cerebral cortex on MRI scans into gyral base d regions of interest. Neuroimage, 31(3), pp.968-980.
-  Zeiler, M.D., 2012. ADADELTA: an adaptive learning rate method. arXiv preprint arXiv:1212.5701.
-  Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I. and Salakhutdinov, R., 2014. Dropout: A simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research, 15(1), pp.1929-1958.