U-Net with spatial pyramid pooling for drusen segmentation in optical coherence tomography
The presence of drusen is the main hallmark of early/intermediate age-related macular degeneration (AMD). Therefore, automated drusen segmentation is an important step in image-guided management of AMD. There are two common approaches to drusen segmentation. In the first, the drusen are segmented directly as a binary classification task. In the second approach, the surrounding retinal layers (outer boundary retinal pigment epithelium (OBRPE) and Bruch’s membrane (BM)) are segmented and the remaining space between these two layers is extracted as drusen. In this work, we extend the standard U-Net architecture with spatial pyramid pooling components to introduce global feature context. We apply the model to the task of segmenting drusen together with BM and OBRPE. The proposed network was trained and evaluated on a longitudinal OCT dataset of 425 scans from 38 patients with early/intermediate AMD. This preliminary study showed that the proposed network consistently outperformed the standard U-net model.
Age-related macular degeneration (AMD) is a devastating retinal disease and a leading cause of blindness in the elderly population in the developed world . The clinical hallmark and usually the first finding of AMD is the presence of waste deposits, called drusen. In the early stages, these drusen begin to accumulate in between two anatomical layers of the retina, the outer boundary retinal pigment epithelium (OBRPE) and the Bruch’s membrane (BM). The drusen buildup and the consequent AMD progression to late stages are remarkably variable among affected individuals, resulting in its management being one of the biggest dilemmas in ophthalmology . Currently, the patient scheduling frequency is primarily guided by the amount of drusen, which is subjectively assessed by drusen segmentation in optical coherence tomography (OCT). OCT is the state-of-the-art imaging modality for assessing the retina in AMD. This fast and non-invasive acquisition technique allows to inspect the retina at a micrometer resolution, granting the possibility to study not only the retinal layers but also several disease-related abnormalities, including drusen. Manual drusen segmentation is very time consuming, which creates a need for advanced medical image computing methods that can measure distinct and pathognomonic changes in drusen morphology in an accurate, objective and reproducible manner.
In recent years, deep learning based and non deep learning based methods were applied on this task [3, 4, 5, 6, 7]. Generally it has been shown that deep learning based methods, namely convolutional neural networks (CNN), outperform the previous cost-function based models [3, 6, 7]. In  a basic U-Net is applied on drusen and layer segmentation. In  a combination of a CNN, graph search based methods and standard classifier is introduced. In  a retina layer segmentation task is tackled by a B-scan level CNN.
Drusen segmentation task can be tackled by segmenting the neighbouring layers in the retina: BM and OBRPE. An alternative approach is to segment drusen as an additional class. Our assumption is that this additional class will not only provide more information about the layers adjacent to drusen class, but will also help the network to characterize the appearance of both drusen and non-pathological regions where OBRPE and BM overlap. The size of drusen varies, meaning a given drusen could either be a small drusen at an early stage or a large drusen at a later stage. This point is not taken into account by a normal CNN applied on drusen segmentation. This can cause the network to miss drusen that are particularly small or, conversely, drusen that exceed the network’s receptive field (figure 3). In addition, retinal layers strictly follow the same topological ordering and drusen has to appear strictly in-between OBRPE and BM. In CNN models, contextual information and the spatial relation between different anatomical parts of the retina might be overlooked by the small receptive field of a CNN. The limitations of receptive fields in a CNN is discussed in more details in [8, 9].
A solution is to increase the receptive field in the CNN architecture. This could be approached in different ways, e.g. by a dilated convolution ). In Pyramid scene parsing network (PSPNet), this is solved by a pyramid pooling module . Pyramid pooling is applying pooling with different window sizes. The idea is instead of having one size pooling with common kernel size of resulting in halved size feature maps, applying pyramid pooling layer with different kernel size resulting in a sets of bins in a pyramid order (for example , , , ). The coarsest pyramid level () resembles global pooling that covers the entire image (see Fig. 1(e)). Spatial pyramid pooling is also used in [9, 11].
In [12, 9, 11] a spatial pyramid pooling layer is used once at the end of the last convolutional layer of the network. In this paper we take one step further and use a spatial pyramid pooling layer after each convolutional block of the encoder of a standard U-Net. We also evaluate the result of segmenting three classes instead of two classes, i.e. considering drusen as an additional, extra class. Finally, we use a weighted loss function to train our proposed model. We evaluate the performance of our approaches on the task of drusen and layer segmentation in retinas imaged with OCT. Results showed that the introduced model outperforms the baselines in term of Dice index of drusen segmentation, while also producing accurate delineations of the BM and the OBRPE surfaces.
U-Net  has proven to be a suitable architecture for medical images, as it uses skip-connections to pass the feature maps from the encoder at the same level during the reconstruction stage, which makes the model convenient for segmentation tasks where precise location is needed. Thus, we chose U-Net as a backbone for our proposed pyramid U-Net with input image size of .
A retina OCT scan is comprised of sequential 2D B-scans. Usually, segmentation algorithms detect the drusen boundaries in B-scans by segmenting the outer RPE and BM surfaces, as opposed to segmenting the drusen directly. In order to provide more information to the network, in this work we define a four-class segmentation task: Drusen, RPE region, BM region and Background (figure 1 (b)). This is our first implemented approach and we evaluate whether adding the extra class helps the network to learn how the drusen class interacts with the neighboring classes.
In case of unbalanced classes, it is crucial to have a weighted loss function when evaluating multi-class segmentation output. In our work, drusen class pixels represent a very small fraction of the total pixels in an image. Thus, in order to handle the class imbalance, we use the following loss function to train a network that is based on Generalized Dice Coefficient  :
where is the prediction by the network and is the ground truth image. is the number of classes, which in our proposed case is 4 (drusen, BM , OBRPE, and the background). shows the weight attributed to a class which is usually the inverse of the contribution of class in data space. For the examined dataset, is set to 70, 20 and 10 for drusen class, OBRPE class and BM class respectively.
2.1 Pyramid Module
Fig. 1 shows the architecture of our proposed model. Each convolutional block is composed of two convolutional layers with convolutions. Each convolutional block in the encoder is followed by one Pyramid Module (PM). A PM is composed of 5 different pooling levels with bins of size (), (), (),() and (). The five-level pyramid module forms five separate sets of feature maps, each with a different size. Thus, in the first level of the network there are 5 sets of feature maps (Fig. 1(e)), i.e., one series of feature maps for each pooling size. We apply average pooling with kernel size on these feature maps in each pyramid level in order to have results with bin size , respectively.
In each level, a series of feature maps is followed by a separate convolutional to reduce the dimensionality of the feature maps to . In this paper, is set to for the bin ( bin) and to in the remaining pyramid levels (, , and ). In each pyramid level, pooling kernel size will be calculated as: in order to get feature maps with the target bin size. Since we are using U-Net as a baseline, where encoder uses max pooling in each level of the network, we keep the feature maps at each level of the network the same size as those in the basic U-Net. Therefore, all the feature maps of the different bin sizes are combined with the feature maps obtained by the pooling with kernel size . The idea is that the feature maps from different bin sizes will add additional global context information to the main pyramid. The same rule applies for the following levels. If is the number of the feature maps in each level and the desired number of the feature maps from a pooling bin , is set to for pooling with size and to for the rest of the pooling bins in the pyramid.
After applying the convolution in each pooling level , there are 5 sets of feature maps of different sizes. In order to be able to concatenate these feature maps, each series of feature maps is up-sampled to the reference size of . For the pyramid module in the decoder, is set to . The feature maps at each level of the decoder are concatenated with the feature maps resulting from () max pooling (Fig. 2(a)). After concatenation, these pyramid feature maps are concurrently fed into the next layer. Conversely, for the pyramid modules on the skip connections is set to . Therefore, these feature maps keep their original dimension (Fig. 2(b)). The output of a PM (original feature maps and feature maps from 5 level bins) are simultaneously passed through the skip connections to the matching layer in the decoder.
The generalized Dice loss function was utilized for training the network. The predicted labels were regions for each target class (figure 1 (b)). To acquire the final surfaces of the BM/OBRPE layers, a postprocessing strategy was applied. In each vertical column in the B-scan (called A-scan), the first row of activated pixels was extracted from the predicted BM region as the BM surface boundary. Similarly, in each vertical column in the B-scan (called A-scan), the last row of activated pixels was extracted from predicted OBRPE region as the OBRPE surface boundary.
3 Experimental setup
To train and evaluate the networks we use a private OCT dataset containing 425 OCT scans from 38 patients. We split the data into 34000 B-scans for training and validation (31 patients) and 7000 B-scans for testing (7 patients). Scans from the same subjects were always placed in the same set. Scans were acquired with Spectralis (Heidelberg Engineering, Heidelberg, Germany), which acquires anisotropic images with voxels, each with the size of \microm, and covering the field of view of mm.
Each B-scan of every volume has been manually annotated in the following way. The Iowa Reference Algorithm  was first applied to generate a layer segmentation. The output was then manually corrected by an expert optometrist. Then, BM, OBRPE and the drusen regions are extracted from these annotations and used for training the network (Fig. 1).
Our method and the baselines [3, 8] were trained with a batch size of 16 iterated for 50 times, using Adam optimization with an initial learning rate of . Input B-scans are normalized to zero mean and unit variance and resized to 256256 pixels. Based on equation 1, is set at 70, 20 and 10 respectively for drusen, RPE region and BM region in both baseline models and the introduced architecture.
In order to evaluate our model, we compare it to several baselines. The first baseline is the standard U-Net architecture with two classes, BM and OBRPE, which has also been applied in the task of drusen segmentation by . We denote this baseline as UNet-2C in Table 1. The second baseline is the U-Net with drusen introduced as an extra class. In this baseline, instead of extracting the area between BM and OBRPE as drusen, drusen is specifically segmented as an extra class. We denote this baseline as UNet-3C. Finally, our proposed model has in addition a spatial pyramid pooling layer at each level of the basic U-Net, and is denoted as UNet-PPM (Table 1).
An example of segmentation output is shown in Fig. 3. It shows how the pyramid pooling method solves some fundamental issues in drusen segmentation by adding global contextual information to the feature maps which are being transferred through the network. We quantitatively evaluated the segmentation performance of the drusen, OBRPE and BM segmentation. Table 1 shows the results of this evaluation, per patient dice coefficient for drusen segmentation and mean absolute error for OBRPE and BM. In addition, figure 4 shows a box-plot of per patient dice coefficient for drusen and mean absolute error for BM and RPE segmentation. One can observe that by using the pyramid module, our proposed method was able to outperform the other baseline networks.
Utilizing global spatial context is crucial for avoiding anatomically impossible segmentation such as finding drusen above RPE instead of below it. It is still a challenge to learn the plausible spatial relationships between object classes from a training dataset using statistical machine learning approaches. We proposed incorporating the pyramid pooling module into U-Net. The results showed that the proposed extension utilized the larger context for segmentation and clearly outperformed the baseline U-Net model. The proposed method is an important step towards the accurate quantification of drusen, crucial for the successful clinical management of patients with early AMD. Finally, given the widespread use of U-Net for medical image segmentation in general, the proposed extension would have an impact beyond its application in drusen segmentation.
This work was funded by the Christian Doppler Research Association, the Austrian Federal Ministry for Digital and Economic Affairs and the National Foundation for Research, Technology and Development. We thank the NVIDIA corporation for a GPU donation.
- Wong, W.L., Su, X., Li, X., Cheung, C.M.G., Klein, R., Cheng, C.Y., Wong, T.Y.: Global prevalence of age-related macular degeneration and disease burden projection for 2020 and 2040: a systematic review and meta-analysis. The Lancet. Global health 2(2) (feb 2014) e106–16
- Schlanitz, F.G., Baumann, B., Kundi, M., Sacu, S., Baratsits, M., Scheschy, U., Shahlaee, A., Mittermüller, T.J., Montuoro, A., Roberts, P., Pircher, M., Hitzenberger, C.K., Schmidt-Erfurth, U.: Drusen volume development over time and its relevance to the course of age-related macular degeneration. British Journal of Ophthalmology 101(2) (feb 2017) 198–203
- Gorgi Zadeh, S., Wintergerst, M.W., Wiens, V., Thiele, S., Holz, F.G., Finger, R.P., Schultz, T.: CNNs enable accurate and fast segmentation of drusen in optical coherence tomography. In: MICCAI Workshop on Deep Learning in Medical Image Analysis. Volume 10553 of Lect. Notes Comput. Sci. (2017) 65–73
- Khalid, S., Akram, M.U., Hassan, T., Jameel, A., Khalil, T.: Automated Segmentation and Quantification of Drusen in Fundus and Optical Coherence Tomography Images for Detection of ARMD. Journal of Digital Imaging (dec 2017) 1–13
- Novosel, J., Vermeer, K.A., de Jong, J.H., Wang, Z., van Vliet, L.J.: Joint segmentation of retinal layers and focal lesions in 3-d oct data of topologically disrupted retinas. IEEE transactions on medical imaging 36(6) (2017) 1276–1286
- Fang, L., et al.: Automatic segmentation of nine retinal layer boundaries in OCT images of non-exudative AMD patients using deep learning and graph search. Biomedical Optics Express 8(5) (2017) 2732–2744
- Shah, A., et al.: Multiple surface segmentation using convolution neural nets: application to retinal layer segmentation in OCT images. Biomedical Optics Express 9(9) (2018) 4509–4526
- Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid Scene Parsing Network. In: CVPR. (2017)
- He, K., Zhang, X., Ren, S., Sun, J.: Spatial pyramid pooling in deep convolutional networks for visual recognition. CoRR abs/1406.4729 (2014)
- Yu, F., Koltun, V.: Multi-scale context aggregation by dilated convolutions. arXiv preprint arXiv:1511.07122 (2015)
- Gu, Z., Liu, P., Zhou, K., Jiang, Y., Mao, H., Cheng, J., Liu, J.: Deepdisc: Optic disc segmentation based on atrous convolution and spatial pyramid pooling. In: Computational Pathology and Ophthalmic Medical Image Analysis. Springer (2018) 253–260
- Zhao, R., Camino, A., Wang, J., Hagag, A.M., Lu, Y., Bailey, S.T., Flaxel, C.J., Hwang, T.S., Huang, D., Li, D., Jia, Y.: Automated drusen detection in dry age-related macular degeneration by multiple-depth, en face optical coherence tomography. Biomedical Optics Express 8(11) (nov 2017) 5049
- Ronneberger, O., Fischer, P., Brox, T.: U-net: Convolutional networks for biomedical image segmentation. In: Proc. Int. Conf. Med. Imag. Comput. & Comput. Assist. Interven. (MICCAI). Volume 9351. (2015) 234–241
- Crum, W.R., Camara, O., Hill, D.L.G.: Generalized overlap measures for evaluation and validation in medical image analysis. IEEE Transactions on Medical Imaging 25(11) (Nov 2006) 1451–1461
- Chen, X., Niemeijer, M., Zhang, L., Lee, K., Abràmoff, M.D., Sonka, M.: Three-dimensional segmentation of fluid-associated abnormalities in retinal OCT: probability constrained graph-search-graph-cut. IEEE-TMI 31(8) (2012) 1521–1531