Abstract
Segmenting the retinal vasculature entails a tradeoff between how much of the overall vascular structure we identify vs. how precisely we segment individual vessels. In particular, stateoftheart methods tend to undersegment faint vessels, as well as pixels that lie on the edges of thicker vessels. Thus, they underestimate the width of individual vessels, as well as the ratio of large to small vessels. More generally, many crucial biomarkers—including the arteryvein (AV) ratio, branching angles, number of bifurcation, fractal dimension, tortuosity, vascular lengthtodiameter ratio and walltolumen length—require precise measurements of individual vessels. To address this limitation, we propose a novel, stochastic training scheme for deep neural networks that better classifies the faint, ambiguous regions of the image. Our approach relies on two key innovations. First, we train our deep networks with dynamic weights that fluctuate during each training iteration. This stochastic approach forces the network to learn a mapping that robustly balances precision and recall. Second, we decouple the segmentation process into two steps. In the first half of our pipeline, we estimate the likelihood of every pixel and then use these likelihoods to segment pixels that are clearly vessel or background. In the latter part of our pipeline, we use a second network to classify the ambiguous regions in the image. Our proposed method obtained stateoftheart results on five retinal datasets—DRIVE, STARE, CHASEDB, AVWIDE, and VEVIO—by learning a robust balance between false positive and false negative rates. In addition, we are the first to report segmentation results on the AVWIDE dataset, and we have made the groundtruth annotations for this dataset publicly available.
Optical Society of America under the terms of the OSA Open Access Publishing Agreement
Dynamic Deep Networks for Retinal Vessel Segmentation
Aashis Khanal^{1,2} and Rolando Estrada^{1,3}
^{1}Department of Computer Science, Georgia State University, Atlanta, GA 30303, USA
^{2}akhanal1@student.gsu.edu
^{3}restrada1@gsu.edu
1 Introduction
Retinal vessels provide the only noninvasive view of the cardiovascular system. Thus, they are a key diagnostic feature for a number of diseases, including diabetic retinopathy [1], coronary heart disease [2], and atherosclerosis [3]. However, the current standard of care requires manual inspection by an ophthalmologist, which makes it more challenging for people in developing nations and lowincome communities to receive care. For example, only 30% of African Americans in southern Los Angeles reported being screened for diabetic retinopathy, despite being the most atrisk ethnicity [4]. Therefore, it is vital to develop automatic retinal analysis methods to improve screening rates and public health outcomes.
The first step in a retinal analysis pipeline is to segment the regions in the image that correspond to the vasculature. Formally, vessel segmentation is a binary classification problem, but it presents unique challenges. First, existing datasets are small—typically on the order of 2040 images—because an ophthalmologist has to manually trace every vessel in each fundus image. In addition, we need to classify hundreds of thousands or even millions of pixels per image, and the labels of nearby pixels are correlated.
Deep neural networks are the state of the art for a wide range of classification problems, but, due to the aforementioned challenges, training a deep network to segment retinal images is not straightforward. The two main strategies for applying deep learning to this domain are: (1) dividing each image into small patches to maximize the number of training samples [5] or (2) combining traditional convolutional layers with upsampling to learn both local and global features [6]. In particular, the Unet architecture [6] and its extensions (e.g., Recurrent Unet [7]) have achieved stateoftheart results by applying the latter strategy.
In this paper, we propose a novel, twostage segmentation method that combines both upsampling and patchbased processing. Our approach also leverages loss functions with stochastic weights to better detect faint and ambiguous vessel pixels. As our experiments show, our proposed pipeline achieves stateoftheart results on five vessel segmentation datasets, including conventional fundus images [8, 9, 10], wide fieldofview (FOV) images [11], and lowresolution video stills and mosaics [12].
As Fig. 1 shows, we separate the overall vessel segmentation problem into two stages: (1) likelihood estimation and (2) targeted prediction. In the first half of our pipeline, we train a Unet architecture to estimate the vessel likelihood of each pixel using a stochastic cross entropy loss. That is, we randomly assign a weight to each class on each training iteration, which force the network to alternate between higher false positive vs. higher false negative rates during training, preventing it from getting stuck in a single minimum. In the second half of our pipeline, we first separate potential vessel pixels into two categories—easy and ambiguous—and then train a patchbased network to classify the latter. (Easy pixels can be classified with a simple threshold.) This second network, which we refer to as miniUnet, is a scaleddown version of the network used in the first stage. To the best of our knowledge, we are the first to apply stochastic weights for vessel segmentation, as well as the first to combine both upsampling and patchbased learning into a single pipeline.
The rest of this paper is organized as follows. First, we discuss related work on vessel segmentation, particularly deep learningbased approaches. Then, we detail our complete methodology, including our stochastic loss function and targeted prediction steps. We then present and discuss experimental results on five datasets and explore avenues for future work.
2 Related work
Vessel segmentation has a long history, although the advent of deep neural networks has yielded significant improvements in recent years. Earlier techniques relied primarily on handcrafted features, including matched filters [13], quadrature filters [14], and Gabor filters [15, 16]. The latter approach, in particular, uses a predefined kernel bank to incorporate all vessel widths and orientations. However, handcrafted features are limited by our ability to analytically model the segmentation process. Other classic techniques, as surveyed further in [17], include piecewise thresholding [9], region growing [18], and concavity measurements [19]. In addition to handcrafted features, some prior approaches used graphtheoretic techniques to trace the vascular structure, including shortestpath tracking [12] and the fast marching algorithm [20].
Since the breakthrough by convolutional neural networks (CNNs) on the ImageNet dataset in 2012 [21], deep learning has achieved tremendous success across a wide array of machine learning tasks [21, 22, 23, 24]. Currently, deep networks are the state of the art in vessel segmentation. In particular, the most successful deep architecture in this domain—Unet [6]—combines traditional convolutional layers with feature upsampling to estimate both local and global image features. As noted above, earlier deep learning approaches [25, 26] divided an image into a series of patches and applied a classifier to each patch independently. For instance, Dasgupta et al. divided the full image into small patches to train a CNN [25], while Ciresan et al. defined a patch around each pixel and used those patches to train their network [26]. One of the drawbacks of patchbased approaches is that we need to train a CNN with a large number of patches, most of which have redundant information. The upsampling in Unet, on the other hand, allows a network to train with fewer, significantly larger patches.
There are numerous extensions to the original Unet architecture. Alom et al. increased the performance of Unet by introducing residual connection, and a recursive training strategy [22], albeit at the expense of increased training time due to the extra connections and recursion. Zhuang et al. arranged two Unet architectures in serial fashion, enabling the architecture to learn more abstract features [27]. This serialization significantly increases the training time and requires heavy computational infrastructure to be able to train well. Jin et al. used deformable CNNs to construct better vascular features [28], but also with added training complexity. Finally, Yun et al. combined conditional generative adversarial networks (GANs) with Unet [23], achieving results comparable to the other Unet extensions. In contrast to these extensions, our approach not only yielded better results, but is simpler and faster to train, as detailed in the following section.
3 Methodology
As noted above, we separate vessel segmentation into two stages. Fig. 1 details each step of our pipeline. In the first stage, we use a stochastic loss function to obtain an overall likelihood map for every pixel in the image. Then, we identify those pixels that are most likely to be misclassified—generally faint vessels and pixels at the periphery of thick vessels—and apply a second, smaller network to these ambiguous regions. Below, we first present and motivate the use of stochastic loss functions and then discuss each step in our pipeline in turn.
3.1 Dynamic loss functions
A weighted loss function penalizes some types of errors more than others. They are useful for problems with unbalanced datasets or in which we want to prioritize certain outcomes (e.g., minimize false positives for a particular class). Traditionally, these weights have been fixed a priori by estimating them using a training set.
In this work, however, we propose using dynamic or stochastic weights. In this case, each weight is defined not as a single value but as a distribution. During training, every time we feed a new minibatch to the network, we first sample weights for each class and then use those sampled weights to compute the loss. Since the weights will vary for each training iteration, the same error will be penalized more or less heavily at different points during training. Empirically, this forces the network to optimize across all ranges of false positive vs. false negative rates, leading to more robust classification. For concreteness, below we discuss the use of dynamic weights for the crossentropy loss, but one can randomize any weighted loss function.
The crossentropy loss is a natural choice for binary classification since it encourages values to converge towards zero or one. This loss is defined as follows:
(1) 
where is the true distribution and is the predicted distribution, i.e., the network’s outputs. The training process aims to make these two distributions as close as possible. However, when the data is highly skewed, we can trivially obtain a high accuracy by always assigning the most common class to every pixel. In the case of vessel segmentation, very few pixels are part of a vessel, so always assigning a label of background will yield an accuracy of around 90% (but a recall of 0%). A straightforward way to ameliorate this imbalance is by applying different weights to each class:
(2) 
where are the class weights and are the class labels ( for binary segmentation). By using weights, we force the network to care about both positive and negative samples equally during training. An easy way to define these weights is to estimate them from the training set, as was done for the original Unet architecture [6].
A major drawback of the above strategy is that it imposes a fixed ratio of vesseltobackground. As our experiments show, this ratio is a major determinant of whether ambiguous pixels will be mapped to either vessel or background. In other words, when faced with an ambiguous pixel, the network will use this ratio to "guess" which label to assign to it. We can see in Figs. 3(b) and 3(c) that a network trained with fixed weights will preferentially over or undersegment ambiguous pixels.
Instead, we propose using stochastic weights which are randomly generated at each training iteration:
(3) 
where function assigns random class weight within a range of to and is the step by which the random class weights are taken. For example, = 1 would generate random integers between . Intuitively, dynamic weights prevent the network from getting stuck in a lowvalue local minimum.
When we use fixed class weights, the neural network will always penalize all pixels by the same factor. As noted earlier and as Figs. 3(b) and 3(c) show, fixed weights fail to account for fainter vessels and are insensitive to noise. Stochastic costs reduce this bias because the network must vary in how aggressively it segments ambiguous pixels throughout the training process. Even if completely different weights are given in a subsequent iteration, some of the information learned in the previous step is preserved, leading to a more balanced segmentation of ambiguous regions. As Fig. 3(d) shows, dynamic weights capture fainter vessels more accurately than fixed weights.
We now discuss the various steps of our proposed pipeline below.
3.2 Likelihood map
The first step in our pipeline consists of training a Unet architecture using stochastic weights to derive a likelihood map over the entire image. The goal of this map is not to segment the image directly but to indicate how likely a particular pixel is to be part of a vessel. We do not segment the image at this stage because the upsampling used by Unet blurs the original vessels, making it difficult to correctly label highfrequency elements (e.g., thin vessels or pixels near the edge of a thick vessel).
Figure 3 contrast the outputs of our stochastic Unet vs. two fixedweight Unets on the datasets used in our experiments (see figure for details). Overall, networks trained with fixed weights struggle to capture faint vessels, as well as the pixels at the edges of thicker vessels. Interestingly, our stochastic weights are also better able to ignore vessellike artifacts, e.g., the rim of a lens as seen in Fig. 3.
3.3 Supportresistance segmentation (SRS) for easy pixels
The likelihood map generated in the previous step ranges from [0,1]. Pixels with a higher probability of being vessels have values closer to one (and closer to zero otherwise). Intuitively, the closer a value is to either extreme, the more confident the network is in its classification. Thus, we use this likelihood map to separate pixels into either easy or ambiguous classes. The top and bottom of Fig. 1(c) show the easy and ambiguous pixels, respectively, of the likelihood map in Fig. 1(b).
We use two thresholds to classify pixels into these two classes. We refer to the lower threshold as support and the upper threshold as resistance. All pixels between resistance and support are considered ambiguous. Pixels below support (those with values near zero), which account for the majority of pixels, are background pixels whereas pixels above the resistance (those with value near one) are considered vessel pixels.
The choice of support and resistance thresholds affects the tradeoff between precision and recall. Thresholds close to one and zero, respectively, lead to higher precision for easy pixels but cause more pixels to fall in the intermediate, ambiguous band. Thresholds closer to the middle lead to higher recall, but affect precision. For our experiments, we empirically set a support of 20 and a resistance of 235.
3.4 Targeted prediction for ambiguous pixels
The last stage of our pipeline segments ambiguous pixels, i.e., those that fall within the supportresistance band described above. Figs. 1(d)(g) illustrate the various steps involved in this targeted prediction. Generally, most vessel pixels will lie above the support threshold, so they have already been predicted with high precision. To classify the pixels with intermediate values, we train a second, patchbased network only on those pixels. As Fig. 2 shows, this miniUnet network is a scaleddown version that uses only the middle layers of the full Unet architecture. We use two channels as inputs to this network: the full likelihood map outputted by the first Unet (Fig. 1(b)) and a version of this likelihood map which only contains ambiguous pixels (the bottom image in Fig. 1(c)).
Unlike prior patchbased approaches (e.g., Ciresan et al. [5]), we do not extract a patch for every pixel. Instead, we use a sparse set of patches centered at the intersections between ambiguous pixels and a regular lattice, as described below:

Raw estimate: We first produce a highrecall binary mask by considering all pixels above the support threshold as vessels. Intuitively, this mask should contain almost no false negatives, but will have numerous false positives. We then skeletonize the largest connected component in this mask (Fig. 1(d)).

Patch selection: We then define a regular lattice of width (Fig. 1(e)) and determine the intersection points between this lattice and the skeleton from the previous step (Fig. 1(d)). We empirically verified that the union of the patches defined by these intersections always covered every ambiguous pixel.

MiniUnet classification: Finally, we classify the ambiguous pixels in each patch using our second network (Fig. 1(g)). For each patch, we extract the corresponding regions in the full likelihood map and the map with only ambiguous pixels, resp. Intuitively, the second channel indicates which pixels are most crucial to classify correctly. Finally, we use the outputs from the second network as the final labels for ambiguous pixels (Fig. 1(h)).
4 Experiments and Results
We carried out experiments on five different retinal vessel datasets to verify the effectiveness of our proposed approach. Below, we first detail our experimental protocol and then discuss our results.
4.1 Materials and methods
Hardware: All our experiments were conducted on a Dell Precision 7920R server with two Intel Xeon Silver 4110 CPUs, two GeForce GTX 1080 Ti graphics cards, and 128 GBs of RAM.
Datasets: We used five datasets for our experiments (see the first column of Fig. 3 for a sample image from each dataset):

DRIVE [8]: 40, 565584 color fundus images centered on the macula. Each image includes a circular fieldofview (FOV) mask. This dataset is predivided into 20 training and 20 test images, but we further divided the training set randomly into 15 training and 5 validation images.

STARE [9]: 20, 700 605 color images with no FOV masks, also centered on the macula. Many images exhibit some degree of pathology (e.g., exudates).

VEVIO [12]: This dataset has two types of images: 16, 640480 frames from 16 different video indirect ophthalmoscopy (VIO) videos and 16 corresponding mosaics (around 600500 pixels) obtained by fusing different frames into a single image [30]. Due to the difficult image acquisition process, these images are blurry and have lens artifacts.

CHASEDB [10]: 14 lefttoright pairs of color images centered on the optic nerve (28 images total). Images are 999960 pixels.
Data preparation: We used only the green channel of each image. We then applied contrast adaptive histogram equalization [31] as a preprocessing step to enhance contrast. Given the limited number of images in each dataset, we used image augmentation techniques to increase our data size. Specifically, we flipped the input image horizontally and vertically randomly in each iteration.
Ground truth preparation: All datasets except AVWIDE had preexisting groundtruth pixel labels. The original AVWIDE dataset had only graphbased labels, so we manually traced the vessels in Adobe Photoshop CS (Adobe Systems Inc., San Jose, CA) using a Wacom Intuous3 graphics tablet (Wacom Co. Ltd, Kazoshi, Saitama, Japan) to obtain pixellevel segmentations. The original, graphbased annotations had been carried out by an expert ophthalmologist, so we followed them exactly.
Network architectures: As noted above, we used a fullsized Unet architecture for our initial probability map and a smaller, miniUnet network for targeted prediction. The Unet architecture consist of a series of downsampling steps followed by the same number of upsampling steps. Each upsampling step receives, crops to match the size, and concatenates the feature map from the downsample step on the same level, as shown in Fig. 2. We used 33pixel kernels except in the last layer; here, we used 11 kernels to produce two outputs, which are then passed through a softmax layer to obtain two probabilities: one for that particular pixel being a vessel and another for it being part of the background. Depending on which of the probabilities is higher, a pixel is labeled as either vessel or background. As in the original Unet architecture [6], this network receives inputs of size 572572 pixels, but only predicts values for the internal 388388 pixels. The outer band of 9292 pixel allows the Unet a wider view of the vascular structure. As in [6], we shift the Unet to obtain a label for every pixel and mirror the 388 388 region if the outer band extends beyond the image.
Our second network, the miniUnet architecture, is identical to the middle three layers of the fullsize Unet (see Fig. 2). Thus, its input size is 140 140 and it produces labels for the middle 100100 pixels. Similar to the first network, we expand and mirror patches as needed. However, we train this network using a dice loss, rather than cross entropy because our goal is to maximize the F1 score of our pipeline. The weighted dice loss is given by the following formula:
(4) 
, where controls how much we want to favor precision vs. recall. Higher beta yields higher precision and vice versa. As with the fullsize Unet, we used a stochastic loss in which we sampled for each training iteration:
(5) 
, where is picked randomly within range  with a stepsize of .
Training parameters: We trained our networks using Adam optimization [32] with a learning rate of 0.001 and a minibatch size of 4. We trained the full Unet for 250 epochs and the miniUnet for 60 epochs. For the full Unet, we used stochastic weights of , i.e., where weights randomly oscillated within a range of with a step size of . For the miniUnet, we used a dice loss with stochastic weights of (1, 2, 0.1), where randomly oscillated between with a step size of .
Model validation: We used 5fold cross validation with training, validation, and test sets on all datasets except DRIVE, since this dataset already had a predefined test set. For each fold, we randomly split the training set into 75% training and 25% validation images. In our tables, we report the performance on the test set across all the folds. Note that, using this protocol, each image is only included once in a test set.
4.2 Pipelines tested
To better understand the impact of each stage of our pipeline, we tested two versions of our approach, as well as two baseline methods:

Dynamic weights with targeted prediction: Our proposed approach.

Dynamic weights only: We trained the full Unet using dynamic class weights, but omitted the targeted prediction step.

Fixed weights only: We trained the full Unet using fixed class weights based on the ratio of vessel to background pixels in the training set. No targeted prediction step.

Fixed weights with targeted prediction: We applied both fixed weights and targeted prediction.
The parameters for the MiniUnet at the end of the full pipelines (variations 1 and 4) were the same for both configurations. In total, we carried out twenty experiments across the five datasets.
4.3 Results
We quantified our method’s performance in terms of precision, recall and F1 score (also called Fmeasure). For comparison to the state of the art, we also calculated the accuracy of our model, even though this value can be misleading for unbalanced datasets [12]. The precision, recall, and accuracy values in our tables are the mean values across all test images. The reported F1 score is the harmonic average of the corresponding mean precision and recall values. As Forman et al. showed [33], if the positive cases constitute around 7% or more of the samples, then this is an unbiased way to calculate the mean F1 score for fold cross validation. Across the different datasets, vessel pixels constituted around 10% of all pixels.
Method  Precision  Recall  FScore  Accuracy 

Unet[22]  0.8852  0.7537  0.8142  0.9531 
Residual Unet[22]  0.8614  0.7726  0.8149  0.9553 
Recurrent Unet[22]  0.8603  0.7751  0.8155  0.9556 
R2 Unet[22]  0.8589  0.7792  0.8171  0.9556 
Conditional GAN [23]  0.8143  0.8274  0.8208  0.9608 
LadderNet[27]  0.8593  0.7856  0.8208  0.9561 
DUNet[28]  0.8537  0.7894  0.8203  0.9697 
Fixed weights only  0.7657  0.8410  0.8015  0.9633 
Fixed weights with targeted prediction  0.7823  0.8246  0.8028  0.9643 
Dynamic weights only  0.8323  0.8163  0.8242  0.9692 
Dynamic weights with targeted prediction  0.8284  0.8235  0.8259  0.9693 
Tables 1 to 6 summarize our results on the various datasets. A value in bold represents the best score for that particular column. For convenience, we also included those of current stateoftheart approaches. Note, though, that we only list methods that reported F1 scores (or, equivalently, both precision and recall values). Due to the strong class imbalance between the two classes, accuracy alone is not an informative measure of performance. As the various tables show, our proposed pipeline consistently outperformed existing methods. To the best of our knowledge, our F1 scores and most of our accuracy values are stateoftheart results across the five datasets.
Overall, dynamic weights outperformed fixed weights across the various datasets, including the highly studied DRIVE and STARE datasets. Specifically, stochastic weights led to more balanced precision and recall scores. For example, in the DRIVE dataset, the full, stochastic pipeline obtained a precision of 0.8284, and a recall of 0.8235 while the fixedweight pipeline gave a more unbalanced results (precision of 0.7823 and recall of 0.8246). This pattern was replicated across datasets. Compared to prior methods, we consistently achieved better recall with little or no sacrifice in precision. Existing methods with high precision and low recall seldom misclassify a background pixel as a vessel, but they struggle to detect faint vessels.
Notably, our method outperformed other extensions of the original Unet architecture, including LadderNet [27], R2Unet [22], and DUNet [28]. The difference is particularly large for the CHASEDB dataset (Table 3). Here, our method score was higher than the previous state of the art (LadderNet).
Method  Precision  Recall  FScore  Accuracy 

Unet[22]  0.8475  0.8270  0.8373  0.9690 
Residual Unet[22]  0.8581  0.8203  0.8388  0.9700 
Recurrent Unet[22]  0.8705  0.8108  0.8396  0.9706 
R2 Unet[22]  0.8659  0.8298  0.8475  0.9712 
Conditional GAN [23]  0.8466  0.8538  0.8502  0.9771 
DUNet[28]  0.8856  0.7428  0.8079  0.9729 
Fixed weights only  0.8138  0.8538  0.8480  0.9739 
Fixed weights with targeted prediction  0.7979  0.8692  0.8320  0.9732 
Dynamic weights only  0.8413  0.8424  0.8418  0.9758 
Dynamic weights with targeted prediction  0.8559  0.8541  0.8549  0.9780 
Method  Precision  Recall  FScore  Accuracy 

Unet[22]  0.7336  0.8288  0.7783  0.9578 
Residual Unet[22]  0.7857  0.7726  0.7800  0.9553 
Recurrent Unet[22]  0.8195  0.7459  0.7810  0.9622 
R2 Unet[22]  0.81072  0.0.7756  0.7928  0.9634 
LadderNet [27]  0.8084  0.7978  0.8031  0.9656 
DUNet[28]  0.7510  0.8229  0.7853  0.9724 
Fixed weights only  0.8089  0.8271  0.8179  0.9744 
Fixed weights with targeted prediction  0.8266  0.8085  0.8174  0.9749 
Dynamic weights only  0.8175  0.8296  0.8235  0.9753 
Dynamic weights with targeted prediction  0.8550  0.8143  0.8245  0.9759 
Method  Precision  Recall  FScore  Accuracy 

Fixed weights only  0.7611  0.7898  0.7751  0.9706 
Fixed weights with targeted prediction  0.7439  0.8083  0.7747  0.9698 
Dynamic weights only  0.8154  0.7751  0.7947  0.9742 
Dynamic weights with targeted prediction  0.8283  0.7815  0.8042  0.9755 
Table 4 shows our results for the AVWIDE dataset. Our paper is the first assessment of this dataset as we manually created the ground truth for it. As with the other datasets, we can observe a similar pattern of unbalanced precision and recall for fixed class weights. In contrast, our stochastic technique has more balanced precision and recall. Moreover, the score and accuracy are significantly higher for our method, particularly for the full pipeline with targeted prediction.
Method  Precision  Recall  FScore  Accuracy 

Forestbased Dijkstra [12]      0.5053  0.9573 
Fixed weights only  0.5537  0.6832  0.6093  0.9637 
Fixed weights with targeted prediction  0.5472  0.6984  0.6136  0.9632 
Dynamic weights only  0.6147  0.6355  0.6249  0.9683 
Dynamic weights with targeted prediction  0.6573  0.6739  0.6654  0.9719 
Method  Precision  Recall  FScore  Accuracy 

Forestbased Dijkstra [12]      0.5403  0.9101 
Fixed weights only  0.5212  0.6580  0.5817  0.9569 
Fixed weights with targeted prediction  0.5328  0.6482  0.5849  0.9581 
Dynamic weights only  0.5924  0.5896  0.5910  0.9629 
Dynamic weights with targeted prediction  0.60052  0.5435  0.5706  0.9628 
Finally, our method also achieved stateoftheart results in the highly challenging VEVIO dataset, which has two sets of images: composite mosaics and individual frames taken from a lowresolution video [30]. As with the other datasets, fixed class weights gave highly biased scores in terms of precision and recall, while dynamic weights gave more balanced results (Tables 5 and 6).
Dataset and class weights  Precision  Recall  FScore  Accuracy 

DRIVE {1, 1}  0.8418  0.8023  0.8216  0.9694 
DRIVE classweighted  0.7657  0.8410  0.8015  0.9633 
STARE {1, 1}  0.8559  0.8208  0.8379  0.9757 
STARE classweighted  0.8138  0.8538  0.8480  0.9739 
CHASEDB {1, 1}  0.8332  0.8135  0.8232  0.9757 
CHASEDB classweighted  0.8089  0.8271  0.8179  0.9744 
AVWIDE {1, 1}  0.8231  0.7552  0.7876  0.9737 
AVWIDE classweighted  0.7611  0.7898  0.7751  0.9706 
VEVIO Mosaic {1, 1}  0.6507  0.5564  0.5998  0.9690 
VEVIO Mosaic classweighted  0.5537  0.6832  0.6093  0.9637 
VEVIO Frame {1, 1}  0.6288  0.5257  0.5726  0.9643 
VEVIO Frame classweighted  0.5212  0.6580  0.5817  0.9569 
Dataset and class weights  Precision  Recall  FScore  Accuracy 

DRIVE classweighted (about {1, 10})  0.7657  0.8410  0.8015  0.9633 
DRIVE {1, 5}  0.7806  0.8264  0.8028  0.9642 
DRIVE {1, 1}  0.8418  0.8023  0.8216  0.9694 
DRIVE {5, 1}  0.8722  0.7270  0.7930  0.9665 
DRIVE {10, 1}  0.8712  0.7137  0.7867  0.9654 
DRIVE  0.8508  0.7965  0.8227  0.9696 
DRIVE (our approach)  0.8323  0.8163  0.8242  0.9692 
5 Discussion
The key novelties of our technique are twofold. First, we use dynamic, as opposed to fixed, weights to allow our networks to learn a more robust balance between false positives and false negatives. Second, we separate different types of tasks involved in retinal vessel segmentation. The likelihood estimation step only focuses on capturing the vascular structure without worrying about the final segmentation. Conversely, the targeted prediction step relies on having a precomputed likelihood map with which to make the final predictions. We discuss specific properties of our approach below.
5.1 Dynamic weights yield smoother likelihood maps
In Fig. 3 (columns b and c), we can see that using class weights computed from the training set almost binarizes the image. All the likelihoods are very close to either 0 (background) or 1 (vessel); thus, we have no any useful information with which to refine our estimate in the targeted prediction step. Any information about fainter pixels was lost. However, if we use dynamic weights, a higher percentage of pixels will be assigned nontrivial likelihood values (Fig. 3(d)), which will allow our second network to better classify ambiguous pixels.
5.2 Dynamic weights yields more balanced precision and recall
As we can observe from the results (Tables 1 to 6), our method consistently achieves a good balance between precision and recall. This implies that our approach handles fainter or more ambiguous pixels, rather than ignoring them. In contrast, the stateoftheart results for the DRIVE dataset [27] report precision and recall values of and , respectively. Our corresponding, more balanced values were and . We can see even more of a difference on the VEVIO dataset (Tables 5 and 6). For the mosaics, the precision and recall for the Unet with class weights were and , respectively. However, our full, dynamic pipeline yielded values of and .
Our dynamic weights allow a network to settle on a local optimum that better balances precision and recall. As noted earlier, retinal images have a much higher percentage of background pixels (over 90%). Thus, a network trained on this type of data will be more reticent to label an ambiguous pixel a vessel (since the prior probability is so much lower). This imbalance makes the network settle on a local optimum that is skewed towards high precision and low recall. Previous approaches have ameliorated this effect by assigning fixed class weights in order to rebalance the classification problem. For example, if only 10% of pixels are vessels, then misclassifying those pixels will be penalized by a factor of 9. However, not all faint vessels are similar or distributed equally across different images, so a onesizefitsall treatment is suboptimal. Our technique, on the other hand, applies stress tests during the training process to ensure that the network is robust across different ratios of precision and recall. As our experiments show, this approach allows the network to settle on a a better optimum.
In more detail, Fig. 4 illustrates the effects of dynamic vs. fixed weights during training for the DRIVE dataset. Here, each subplot on the lefthand side shows precision vs. recall during training, while the subplots on the right show this value for the validation set. As Fig. 4(c) shows, a network that uses weights calculated from the training set will begin by strongly favoring recall, since a false negative is penalized more heavily than a false positive. In contrast, an unweighted network (Fig. 4(e)) will favor precision at first, since the probability that an ambiguous pixel is a vessel is very low, a priori. Over time, both networks learned to classify the training set, but their initial bias lead to poor results on the validation set, suggesting overfitting (Fig. 4(d and f)). This behavior implies that the network was very confident about easy pixels and therefore settled on an optimum that favored this class of pixels; however, this choice made it ignore fainter pixels, in turn. Our method, on the other hand, (Fig. 4(a, b)), had more balanced scores throughout the training process. The optimum on which it settled was more robust and thus achieved better scores on the validation and test sets. Our results suggest that the network was able to learn features that separate faint vessels from noise in the background.
5.3 Effect of different weighing schemes
Finally, we carried out additional analyses to investigate the impact of different weighing schemes on the final segmentation result. First, we trained a Unet architecture (without targeted prediction) with equal weights for the two classes. In other words, each misclassification, whether false positive or false negative, was weighted equally. Table 7 shows the results of this equal weighing on our different datasets (we also show the classweighted results for each dataset for comparison). Equal weights yielded better precision but worst recall. Due to the roughly 9 to 1 class imbalance between background and vessel pixels, a network trained with equal weights will tend to label ambiguous pixels as background. We verified this tradeoff further by systematically varying the class weights for the DRIVE dataset and recording the resulting precision and recall values. Table 8 lists the results when training with backgroundtovessel weights of {1,5}, {1,1} (equal weights), {5,1}, and {10,1}. For comparison, the classbased weights were roughly 1 to 10. We also tested a Unet with dynamic weights, but with a range of weights spanning only 10 to 1 (our main experiments used a range of 100 to 1). In this case, the dynamic weights has less variance across training epochs.
Interestingly, the best fixedweight result for the DRIVE dataset used equal weights (this result was not consistent across datasets, though, as Table 7 shows). Overall, we can see that the tradeoff of precision vs. recall is driven by the ratio of class weights. Also, the dynamic weights with a 10to1 range did not balance the precision vs. recall as much as the 100to1 weights, so their results more closely resembled equal weights. We observed a similar pattern for the other datasets (we omitted these results for conciseness). Skewed weights tend to favor either precision or recall and reduce the other score accordingly. By effectively balancing these two scores, our dynamic weights achieve the best result.
6 Conclusion
An accurate segmentation of the vascular structure is crucial for retinal disease diagnosis. However, this task is not easy because the quality and features of a retinal image depend on many factors, including the imaging device, lighting conditions, and an individual’s anatomy. Thus, trying to enforce a single, static approach for all retinal images is suboptimal. Instead, by separating retinal vessel segmentation into subtasks, we have more degrees of freedom with which to adapt our processing to the current image. Furthermore, dynamic weights allow a network to learn to classify all types of vessels, regardless of their colour or intensity, which in turn generates a likelihood map with clear distinctions between vessel, background, and noise. Our targeted prediction step then uses this likelihood map to better classify ambiguous pixels. Our technique gives better results than stateoftheart techniques, many of which are much complex and less intuitive. Instead of optimizing a single, black box, our approach breaks down the problem into more manageable steps and makes these steps more robust by using dynamic weights. In future work, we plan to apply our pipeline to other medical imaging domains, including CT and MRI scans. We also intend to investigate the theoretical properties of our stochastic weights further.
Disclosures
The authors declare that there are no conflicts of interest related to this article.
References
 [1] M. Y. Claudia Kondermann, Daniel Kondermann, “Blood vessel classification into arteries and veins in retinal images,” Proc.SPIE 6512 (2007).
 [2] E. Rochtchina, G. Burlutsky, G. Liew, P. Mitchell, J. J. Wang, B. E. Klein, M. D. Knudtson, R. Klein, and T. Y. Wong, “Retinal vessel diameter and cardiovascular mortality: pooled data analysis from two older populations,” European HeartJournal 28, 1984–1992 (2007).
 [3] M. K. Ikram, F. J. de Jong, J. R. Vingerling, J. C. M. Witteman, A. Hofman, M. M. B. Breteler, and P. T. V. M. de Jong, “Are Retinal Arteriolar or Venular Diameters Associated with Markers for Cardiovascular Disorders? The Rotterdam Study,” Investigative Ophthalmology & VisualScience 45, 2129–2134 (2004).
 [4] Y. Lu, L. Serpas, P. Genter, C. Mehranbod, D. Campa, and E. Ipp, “Disparities in diabetic retinopathy screening rates within minority populations: Differences in reported screening rates among african american and hispanic patients,” DiabetesCare 39, e31–e32 (2016).
 [5] D. Ciresan, A. Giusti, L. M. Gambardella, and J. Schmidhuber, “Deep neural networks segment neuronal membranes in electron microscopy images,” in Advances in Neural Information Processing Systems 25, F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, eds. (Curran Associates, Inc., 2012), pp. 2843–2851.
 [6] O. Ronneberger, P. Fischer, and T. Brox, “Unet: Convolutional networks for biomedical image segmentation,” in MICCAI, (2015).
 [7] M. Z. Alom, M. Hasan, C. Yakopcic, T. M. Taha, and V. K. Asari, “Recurrent residual convolutional neural network based on unet (r2unet) for medical image segmentation,” CoRR abs/1802.06955 (2018).
 [8] J. Staal, M. Abramoff, M. Niemeijer, M. Viergever, and B. van Ginneken, “Ridge based vessel segmentation in color images of the retina,” IEEE Transactions on Medical Imaging 23, 501–509 (2004).
 [9] A. D. Hoover, V. Kouznetsova, and M. Goldbaum, “Locating blood vessels in retinal images by piecewise threshold probing of a matched filter response,” IEEE Transactions on Med.Imaging 19, 203–210 (2000).
 [10] M. M. Fraz, P. Remagnino, A. Hoppe, B. Uyyanonvara, A. R. Rudnicka, C. G. Owen, and S. A. Barman, “An ensemble classificationbased approach applied to retinal blood vessel segmentation,” IEEE Transactions on BiomedicalEngineering 59, 2538–2548 (2012).
 [11] R. Estrada, C. Tomasi, S. C. Schmidler, and S. Farsiu, “Tree topology estimation,” IEEE Transactions on Pattern Analysis MachineIntelligence 37, 1688–1701 (2015).
 [12] R. Estrada, C. Tomasi, M. T. Cabrera, D. K. Wallace, S. F. Freedman, and S. Farsiu, “Exploratory dijkstra forest based automatic vessel segmentation: applications in video indirect ophthalmoscopy (vio),” Biomed OptExpress 3, 327–339 (2012). 22312585[pmid].
 [13] S. Chaudhuri, S. Chatterjee, N. Katz, M. Nelson, and M. Goldbaum, “Detection of blood vessels in retinal images using twodimensional matched filters,” IEEE Transactions on Med.Imaging 8, 263–269 (1989).
 [14] G. Läthén, J. Jonasson, and M. Borga, “Blood vessel segmentation using multiscale quadrature filtering,” Pattern Recognit.Letters 31, 762 – 767 (2010). Award winning papers from the 19th International Conference on Pattern Recognition (ICPR).
 [15] Z. Yavuz and C. Köse, “Retinal blood vessel segmentation using gabor filter and tophat transform,” in 2011 IEEE 19th Signal Processing and Communications Applications Conference (SIU), (2011), pp. 546–549.
 [16] F. Farokhian and H. Demirel, “Blood vessels detection and segmentation in retina using gabor filters,” in 2013 High Capacity Optical Networks and Emerging/Enabling Technologies, (2013), pp. 104–108.
 [17] D. Kaur and Y. Kaur, “Various image segmentation techniques : A review,” (2014).
 [18] M. E. MartínezPérez, A. D. Hughes, A. V. Stanton, S. A. Thom, A. A. Bharath, and K. H. Parker, “Retinal blood vessel segmentation by means of scalespace analysis and region growing,” in Medical Image Computing and ComputerAssisted Intervention – MICCAI’99, C. Taylor and A. Colchester, eds. (Springer Berlin Heidelberg, Berlin, Heidelberg, 1999), pp. 90–97.
 [19] B. S. Y. Lam, Y. Gao, and A. W. Liew, “General retinal vessel segmentation using regularizationbased multiconcavity modeling,” IEEE Transactions on Med.Imaging 29, 1369–1381 (2010).
 [20] F. Benmansour and L. D. Cohen, “Tubular structure segmentation based on minimal path method and anisotropic enhancement,” International J. ComputerVision 92, 192–210 (2011).
 [21] A. Krizhevsky, I. Sutskever, and G. E. Hinton, ‘‘Imagenet classification with deep convolutional neural networks,” in Advances in Neural Information Processing Systems 25, F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, eds. (Curran Associates, Inc., 2012), pp. 1097–1105.
 [22] M. Z. Alom, M. Hasan, C. Yakopcic, T. M. Taha, and V. K. Asari, “Recurrent residual convolutional neural network based on unet (r2unet) for medical image segmentation,” CoRR abs/1802.06955 (2018).
 [23] Y. Jiang and N. Tan, “Retinal vessel segmentation based on conditional deep convolutional generative adversarial networks,” CoRR abs/1805.04224 (2018).
 [24] K. He, G. Gkioxari, P. Dollár, and R. B. Girshick, “Mask RCNN,” CoRR abs/1703.06870 (2017).
 [25] A. Dasgupta and S. Singh, “A fully convolutional neural network based structured prediction approach towards the retinal vessel segmentation,” in 2017 IEEE 14th International Symposium on Biomedical Imaging (ISBI 2017), (2017), pp. 248–251.
 [26] D. Ciresan, A. Giusti, L. M. Gambardella, and J. Schmidhuber, “Deep neural networks segment neuronal membranes in electron microscopy images,” in Advances in Neural Information Processing Systems 25, F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, eds. (Curran Associates, Inc., 2012), pp. 2843–2851.
 [27] J. Zhuang, “Laddernet: Multipath networks based on unet for medical image segmentation,” CoRR abs/1810.07810 (2018).
 [28] Q. Jin, Z. Meng, T. D. Pham, Q. Chen, L. Wei, and R. Su, “DUNet: A deformable network for retinal vessel segmentation,” arXiveprints arXiv:1811.01206 (2018).
 [29] R. Estrada, M. J. Allingham, P. S. Mettu, S. W. Cousins, C. Tomasi, and S. Farsiu, “Retinal arteryvein classification via topology estimation,” IEEE Transactions on Med.Imaging 34, 2518–2534 (2015).
 [30] R. Estrada, C. Tomasi, M. T. Cabrera, D. K. Wallace, S. F. Freedman, and S. Farsiu, “Enhanced video indirect ophthalmoscopy (vio) via robust mosaicing,” Biomed. Opt.Express 2, 2871–2887 (2011).
 [31] K. Zuiderveld, “Graphics gems iv,” (Academic Press Professional, Inc., San Diego, CA, USA, 1994), chap. Contrast Limited Adaptive Histogram Equalization, pp. 474–485.
 [32] D. P. Kingma and J. Ba, “Adam: A Method for Stochastic Optimization,” arXiveprints arXiv:1412.6980 (2014).
 [33] G. Forman and M. Scholz, “Applestoapples in crossvalidation studies: Pitfalls in classifier performance measurement,” SIGKDD Explor.Newsl. 12, 49–57 (2010).