Intraoperative margin assessment of human breast tissue in optical coherence tomography images using deep neural networks

Intraoperative margin assessment of human breast tissue in optical coherence tomography images using deep neural networks


Objective: In this work, we perform margin assessment of human breast tissue from optical coherence tomography (OCT) images using deep neural networks (DNNs). This work simulates an intraoperative setting for breast cancer lumpectomy. Methods: To train the DNNs, we use both the state-of-the-art methods (Weight Decay and DropOut) and a newly introduced regularization method based on function norms. Commonly used methods can fail when only a small database is available. The use of a function norm introduces a direct control over the complexity of the function with the aim of diminishing the risk of overfitting. Results: As neither the code nor the data of previous results are publicly available, the obtained results are compared with reported results in the literature for a conservative comparison. Moreover, our method is applied to locally collected data on several data configurations. The reported results are the average over the different trials. Conclusion: The experimental results show that the use of DNNs yields significantly better results than other techniques when evaluated in terms of sensitivity, specificity, F1 score, G-mean and Matthews correlation coefficient. Function norm regularization yielded higher and more robust results than competing methods. Significance: We have demonstrated a system that shows high promise for (partially) automated margin assessment of human breast tissue, Equal error rate (EER) is reduced from approximately 12% (the lowest reported in the literature) to 5% – a 58% reduction. The method is computationally feasible for intraoperative application (less than 2 seconds per image).

This work has been submitted to the IEEE for possible publication.

Index: Breast cancer, Margin assessment, Optical Coherence Tomography (OCT), Deep Neural Network (DNN), Weight Decay, DropOut, Function norm regularization

1 Introduction

Breast cancer represents one of the leading causes of cancer deaths among women. It has been estimated that 252,710 new cases and 40,610 deaths would occur in the United States in 2017 [1]. Once diagnosed as breast cancer, 60% to 70% of patients undergo breast-conserving surgery or lumpectomy for surgical resection of the tumor [2]. Alternative noninvasive strategies may also be employed such as MRI-guided focused ultrasound surgery [3], but lumpectomy remains the preferred method of treatment.

At present, resected breast tissues are examined postoperatively by pathologists to demarcate tumor margins. The procedure involves analyzing the gross resection specimen macroscopically and representative histopathology sections of the tissue through microscopy. Yet, it is not practical to evaluate the entire tumor margin with this method, as a large amount of histologic tissue sections need to be examined to fully evaluate a breast tissue specimen. Pathologists therefore analyze the regions of interest selected from macroscopic evaluation, which may lead to sampling error. In 20% to 60% of cases, additional tumor is found at the margins [4], and additional surgery is required, which results in higher treatment cost, greater morbidity, infection risk, and delayed adjuvant therapy [5].

Thus, real-time intraoperative delineation of cancer margins is an important step toward reducing the re-excision rate. Alternative pathology approaches such as frozen section (FS) pathology [6] and Touch Prep (TP) cytology [7] have been proposed. However, each method exhibits limitation, preventing its wide adoption in the operating room [8]. Therefore, surgeons have an urgent need for a method to detect the entire cancer margin intraoperatively in real time. Such a method can be divided in two parts; real-time measurement and real-time evaluation. We adopt for the first part optical coherence tomography and for the second deep neural networks.

Optical coherence tomography (OCT) is one of the promising techniques for in situ tumor margin detection of breast tissue. OCT is a high-resolution optical imaging modality capable of generating real-time cross-sectional images of tissue microstructures [9]. In operation, it is analogous to ultrasound imaging except that it utilizes near-infrared light waves instead of sound waves. Since its introduction, remarkable technological advances in OCT have been made, resulting in significant improvements in acquisition speeds [10], spatial resolution, and added functional modalities [11, 12]. OCT has also found clinical applications and commercial success in ophthalmology, cardiology, gastroenterology, and oncology [9].

OCT-based evaluation of breast cancer has been reported in many previous works. For instances, intraoperative assessment of human breast tumor margins has been reported, demonstrating high correlation of OCT images and histopathology examinations (100% sensitivity and 82% specificity) [13]. Development of image processing algorithms has also been pursued to enable automated differentiation of tumor cells. Zysk et al. presented a Fourier-domain approach that detects periodicities arising from unique structural features of normal breast tissues [14]. Mujat et al. and Savastru et al. devised methods that can differentiate tumor legions based on multiple parameters extracted from OCT A-line information. Based on the OCT images acquired with animals, the method could detect cancer-positive margins with at least 81% sensitivity and 89% specificity [15]. With human tissue, it yielded 88% sensitivity and specificity [16].

As machine learning methods tend to outperform classical statistics in many image analysis applications, we investigate in this paper if such methods can improve the accuracy of real-time OCT image analysis. We report on a novel tumor margin detection method based on machine-learning analysis of OCT images of breast tissue. Our strategy is to use Deep Neural Networks (DNNs) trained using weight decay alone, weight decay and DropOut, and newly introduced function norm regularizers for analysis. The best performance is obtained by one of the new regularizers, which yields a 92% sensitivity and a 96% specificity as reported in Table 4.

In the following, we first introduce the imaging protocol, the data analysis proposed methods, and the used experimentation strategy in Sec. 2. Then, we present the settings and results of our experiments in Sec. 3. Finally, we discuss the obtained results in Sec. 4 and conclude in Sec. 5.

2 Materials and method

The first two subsections, 2.1 and 2.2, explain data acquisition for our experiments. From Subsection 2.3 to 2.5 we describe our novel margin assessment method using deep neural networks. Subsection 2.6 describes our data preparation. Subsection 2.7 explains the evaluation method in this paper.

2.1 OCT instrument

A custom-built OCT system was employed to image human breast tissue specimens, illustrated in Figure 1. The specification is as follows: The OCT system was built on optical frequency domain imaging [17, 18], and employs a 1.31-m wavelength-swept laser (SS-1310, Axsun Technologies, Inc., Billerica, MA, USA) as a light source. Light from the laser was first directed to a 90/10 fiber coupler. Then, 10% of the light was guided and detected with a fast InGaAs photodetector (InGaAs-PD) through a narrowband fixed-wavelength filter. The detector generated a pulse when the laser output swept through the passband of the filter. This pulse was converted to a transistor-transistor logic pulse and fed to a high-speed digitizer (DAQ; ATS9350, AlazarTech, Pointe-Claire, QC, Canada) to trigger signal acquisition. The k-clock from the laser was directly connected to the digitizer to serve as a sampling clock. 90% of the remaining light was directed to a Michelson-type interferometer that consisted of fiber coupler, circulators, and balanced detector. The light in the sample arm passed through a two-dimensional galvanometric beam scanner. A 50-mm focal length achromatic lens in the sample arm focused 1.5 mW light onto the specimen with a beam diameter of 35 m. The axial resolution was measured to be 5.2 m in tissue. Reflected light from the sample and reference mirror was coupled into the interferometer, and subsequently measured by a balanced detector. The employed OCT imaging system acquires images at a rate of 50,000 axial scans/s. OCT beam scans laterally over the sample using two galvanometric beam scanners. The pixel resolution is 3.66 times higher in the horizontal direction than the vertical direction. Acquisition, computation and display of OCT images were performed via a custom software implemented in Visual C++.

Finally, the acquisition process takes about 0.018 sec per OCT B-scan image, and the processing of images to make them exploitable for analysis takes 2 sec per image, which makes an intraoperative application possible. A speedup is also possible using parallel GPU computing.

Figure 1: Imaging breast tissue with OCT system. Details of the instrument design are give in Subsection 2.1.

2.2 Imaging protocol

The breast tissue specimens were obtained from the Pathology Department, Severance Hospital (Seoul, South Korea). No information about tissue donors was provided. Tissue procurement, handling, and data collection were performed according to an approved Institutional Review Board protocol. Our tissue measurement protocol consisted of OCT B-scans over small areas of each sample (3-mm 3-mm lateral scans). Each tissue sample was kept hydrated in saline solution at 37C during the measurements. After completion of OCT measurements, each tissue sample was marked with India ink on the OCT imaging locations (usually 3 to 5 locations on each sample) and fixed with formalin (10% formalin in PBS). Histologic preparation of each tissue specimen was then performed at the histology department.

Typical OCT images of normal and tumor tissue types are shown in Figure 2. The OCT image was cropped to 1.5 mm (horizontal) 0.75 mm (axial) to show the tissue part of the image. A clear structural difference can be observed between the adipose tissue and the tumorous tissue: As the adipose tissue is composed of relatively large cells, a kind of periodicity is observed in the related images. The images of timorous tissue have less clear spatial structure.

(a) OCT image of adipose tissue
(b) OCT image of tumorous tissue
Figure 2: Breast OCT images.

2.3 Data analysis method


To analyze the data, we apply DNNs as a machine learning tool. Neural networks in general can be seen as directed graphs where each node is characterized by an activation function and each of the edges is characterized by a weight (see Figure 3). The activation functions are usually non-linear functions that are close to 0 (the neuron is non-activated) for some values of the input, and have non-zero values (the neuron is activated) for other inputs. Through its activations and weights , the network defines a function that relates the inputs (images) to the outputs (unnormalized class probabilities) [19].

Input #1

Input #2

Input #3

Input #4

Output #1

Output #2

Output #3

Hidden layer

Input layer

Output layer

Figure 3: Schematic of a feed-forward neural network.

Choice of DNN architecture

In image processing, the state-of-the-art architectures are deep Convolutional Neural Networks (ConvNets) [20, 21, 22, 23]. These architectures have yielded the best performance on top benchmark evaluations in a large number of tasks related to image processing, including image classification [24], object recognition [25], and segmentation [26]. ConvNets are neural networks that give more importance to spatial information and proximity than to the global connectivity we can obtain with a fully connected network (see Figure 3). A ConvNet architecture is usually composed of the following blocks:

  • Convolution layer: composed of a bank of learned filters, to be convoluted with the input images. They operate a projection of the inputs to an over-complete basis.

  • Non-linearity: used to sparsify the over-complete features; frequently the rectified linear unit (ReLU: ) is used in ConvNets.

  • Pooling: downsampling; usually either max pooling or average pooling are used in ConvNets.

  • Fully-connected layers: applied after some blocks of convolution-ReLU-pooling layers. While convolution blocks operate like feature extractors, the fully connected layers operate as a classifier.

As the designed system is to be used in real time during surgery, the computation time is very important. In order to have a test time fast enough for clinical practice, we avoid very deep networks. The LeNet-5 architecture [27] has a very good accuracy on small images such as MNIST [28] (handwritten digits) and CIFAR [29] (small natural images) databases. For a better localization of the tumor, we propose to divide the images in small patches, and the LeNet-5 architecture is then a reasonable choice.

The network is composed of three convolution-ReLU-pooling layers, two fully connected layers and uses the cross-entropy loss function. To use this loss, we first estimate class probabilities by where is the estimated output of the network, then compute the logistic loss by where is the indicator vector for the true class with 1 only in the true label position and 0 elsewhere, and is applied entry-wise.

Figure 4 shows a diagram of the described architecture. In this figure, the blocks noted Pooling* will be replaced by either average or max pooling blocks. The final choice for each method will be chosen by model selection. The two outputs of the network are the probabilities of the patch being tumorous (first output) or healthy tissue (second output).

Figure 4: The CNN architecture used for OCT margin assessment.

2.4 Training process

In general, training a DNN is minimizing the empirical risk:


where is the loss function. However, DNNs depend on a large number of parameters (the weights), and require a large database for accurate minimization of . When only a small database is available, regularization is needed to avoid overfitting. The regularized problem can be written as follows:


where is a measure of the function complexity.

In our case, as the possibility to collect a large database is fairly limited, regularization is required. To avoid overfitting, we consider the following four regularizers:

  • Weight decay (WD) [19] : ;

  • Weight decay + DropOut (WD+DO) [30] : + an implicit term introduced by model averaging [31, Section 4]: at each step, a fixed ratio of randomly chosen activations is set to 0.

  • Function norm regularization using the data distribution (FN-DD) [32] : , where are (unlabeled) images that are not used in the empirical risk.

  • Function norm regularization using slice sampling (FN-SS) [32]: , where are sampled from a distribution proportional to using slice sampling [33].

Optimization for the last two methods is performed using [32, Algorithm 1].

2.5 Model selection

Each of the tested methods listed above depends on at least one parameter. All of them use a positive scalar to control the amount of regularization. We call this parameter for WD and WD+DO and for the function norm regularization methods. In addition, WD+DO relies on a rate that designates the ratio of neurons to be dropped at each training step. This rate is denoted by . Each choice for these parameters yields a different model. For each method, we test 16 models:

  • For WD, FN-DD and FN-SS, we test eight values for and : , , , , , 1, 10, .

  • For WD+DO, we use or . For each of the two values, we test four values for : 0.1, 0.25, 0.5 and 0.75.

  • Finally, for all methods and all models, we test both Max pooling and Average pooling for the designated blocks in Figure 4.

Model selection is then operated by 5-fold cross-validation, selecting the model with the best mean area under the Receiver Operating Characteristic (ROC) curve (AUC).

2.6 Data preparation

Ultimately, the aim of the proposed algorithm is to separate the regions that are healthy and those that are tumorous in an image of mixed tissues. In order to do that, our strategy is to divide the images into small patches of tissue, and to train our network in order to recognize each of the types.

To extract the patches that compose our dataset, we begin with images containing only one type of tissue. These images are acquired from tissues that are separated by pathologists into tumorous and normal tissues. From these images, we need to exclude the regions corresponding to air and those that are too distant from the tissue surface, as these regions are not useful for margin assessment. This requires an accurate detection of the surface of the tissue, which is achieved using a method built on the Sobel edge detector [34]:

  1. A Gaussian filter (size , standard deviation 3) is applied.

  2. The Sobel edge detector is applied.

  3. The first pixel detected as part of the edge in each vertical line is selected. For the lines where no pixel is detected as part of the edge, the same axial position as in the previous line is taken. If no pixel in the first line is detected as part of the edge, we select the position corresponding to half of the image depth.

  4. The columns are divided into ten sets by assigning the column of index to the set . For each set, we apply a cubic spline to interpolate the tissue surface estimate, and we then average the ten estimated curves. This operation is useful to limit the effect of outliers.

  5. The obtained curve is smoothed by a “rolling ball” under the surface [35].

Figure 5 shows an example of the described detection. To ensure that no remaining pixels corresponding to air are included, the detected boundary is shifted down by 30 pixels corresponding to approximately 0.1 mm (cf. Section 2.1).

(a) Tumorous tissue
(b) Adipose tissue
Figure 5: Surface detection using the method described in Section 2.6.

2.7 Evaluation metrics

Our database is extracted from three samples taken from two different patients separated beforehand into tumorous and healthy tissue by pathologists, and one sample taken from a third patient that is unlabeled (to be used for regularization in FN-DD, as well as for qualitative tests). We denote the two samples taken from the first patient by A and B, and the sample taken from the second patient by C. For each sample, a number of adjacent frames (as shown in Figure 1) is acquired (Table 1).

Tumor Normal
Sample A 849 591
Sample B 888 591
Sample C 549 729
Regularization 664
Qualitative test 60
Table 1: Number of acquired frames

These images are afterwards divided into volumes of size , that are downscaled to before applying the network.1 This allows the localization of the tumor in three dimensions. The size of the patches is chosen to be large enough to capture the structure of the tissue, but small enough to allow an accurate localization of the tumor. To have a reliable estimation of the performance of each of the methods, the following protocol is employed:

  1. We consider four database configurations: (i) train on A and test on C, (ii) train on B and test on C, (iii) train on C and test on A, and (iv) train on C and test on B . Using these data splits, cross-patient generalization is evaluated. Table 2 gives the number of training, test and regularization patches for each configuration.

  2. For each of the configurations and each of the methods, the best model is chosen by 5-fold cross-validation.

  3. The selected model is used to estimate the class probabilities of each test patch.

  4. The highest output of the network yields the estimated class for each of the patches in the test set. We then count the number of true positives (), true negatives (), false positives () and false negatives () by comparing our classification to the ground truth provided by the pathologists.

  5. The results obtained for all the configurations are then averaged to provide an estimate of the expected performance of the studied methods.

  6. These results are finally compared to the performance reported in state-of-the-art works.

We also show qualitative results in Figure 7.

Patches Train : a Train : b Train : c Train : c
Test : c Test : c Test : a Test : b
Train Cancer 2,830 3,552 1,158
Normal 2,364 1,970 2,411
Test Cancer 10,615 26,885 35,520
Normal 24,117 21,670 18,715
Regularization 6,413
Table 2: Number of patches per database configuration

Using , , and computed on the test data in each of the four data configurations, five measurements are computed, Sensitivity (), Specificity (), Precision (),2 the F1-score (), the G-mean (), and the Matthews Correlation Coefficient ():


where is the total number of pixels of the image, , and .

Sensitivity () measures the capability of the method to correctly detect tumor patches, while specificity () indicates its capability to correctly detect the healthy patches. Though a higher value is desired, it is important to analyze it jointly with as it is possible to have by considering trivially all the patches as tumorous. Moreover, a high means a reasonable preservation of the normal tissue, which is important for the patient outcome. Precision () quantifies the ratio of patches classified as cancer that are correctly identified. As our data may be imbalanced (the normal tissue tends to have a higher volume than the cancer mass), we include the F1-score (), the G-mean ([36], and the Matthews Correlation Coefficient ([37]. The -score is the harmonic mean of precision and sensitivity. It achieves its maximum value of 1 when the detection of the positive class is perfect, and its lowest value of 0 when the classification is completely wrong. Similarly, the G-mean gives the balance between and by taking their geometric mean, returning a value between 0 and 1. The is a correlation coefficient between the ground truth and the predicted classes. It returns a value between -1 and +1, with +1 indicating a perfect prediction, 0 no better than random, and -1 a total disagreement between prediction and ground truth.

Finally, receiver operating characteristic (ROC) curves are also generated from the normalized probability estimates from the softmax layer of the network at test time, and the area under each curve () is computed. In addition to these measures, we report the overall accuracy at the selected threshold (), the equal error rate () which corresponds to at the point of the ROC where .

3 Experiments and results

All the experiments were conducted using MatConvNet [38], on a machine equipped with a 4 core CPU (4.00 GHz) and a GPU GeForce GTX 750 Ti. For all settings, we train the networks during 45 epochs. The learning rates are based on an exponentially decreasing schedule following standard practice [22]: 0.05 for the 30 first epochs, 0.005 from epochs 30 to 40, and 0.0005 for the last five epochs. The momentum rate for Weight Decay and DropOut is set to 0.95 for WD and WD+DO. It is set to 0 or the FN methods because it was observed that it causes the divergence of the training process otherwise. For FN-SS, the number of generated samples for regularization is exactly the same as the number of regularization samples (see Table 2).

As an additional baseline, we trained support vector machines (SVMs) [39] with several types of kernels [40]: linear, polynomial with degrees 2 and 3, and Gaussian with a variance equal to the median of the squared distance between the data points. On these data, the linear kernel gave the best validation set accuracy, and is reported in the sequel.

In the following subsections we report:

  1. The selected models for the DNNs and the evolution of the corresponding training and test top-1 error across training epochs, which represents in our case the accuracy as defined in Section 2.7.

  2. Quantitative results: mean ROC curves and the averages of the measures described in Section 2.7 on each of the four train-test settings. For computation of the mean ROC curves, we used vertical averaging [41].

  3. Qualitative results: we visualize classification results on a sample for which both normal and tumorous tissue is present. This visualization gives a qualitative view that can be provided to surgeons in an intraoperative setting. We also provide an analysis of the speed of computation at test time.

3.1 Selected models

For all trials and all methods, max pooling gives better results than average pooling. The parameters determined by the model selection procedure are reported in Table 3. In this table, we observe that for both of the function-norm methods, there is a change in the selected model when we retrain on the data from sample C.

Method Train : a Train : b Train : c (1) Train : c (2)
& Parameter
0.1 0.1 0.25 0.25
FN-SS 10 10 1
Table 3: Selected models

3.2 Quantitative assessment

For each of the trials, the selected model is applied to classify the designated test sample. The obtained performance measurements are then averaged over the four trials (cf. Sec. 2.7). The obtained results with their corresponding error bars are displayed in Table 4. We also show the performance measures for state-of-the-art methods.3 From these numbers, we can see that on average, FN-SS is slightly better than all the other methods. However, the error bars suggest that this improvement is not statistically significant.

Figure 6 shows mean ROC curves. In Figure a, we show the mean ROC curves over the four trials for all the tested methods. For [13, 16, 15, 14], we display the performance points reported in the corresponding papers, since ROCs are not accessible. For SVM, we show only the best curve in order to avoid obstructing the figure. In Figure b, we zoom in to better visualize the curves of the best performing methods. We can see clearly in this figure that the curves for all DNN based methods are higher than the state-of-the-art points related to automated detection, and thus closer to the human detection point (red point in the figure).

Method (%) (%)
WD 0.8843 0.9667 0.9479 0.9135 0.9241 0.8462 92.75 0.9803 5.83
0.0335 0.0066 0.0195 0.0187 0.0180 0.0375 2.01 0.0040 0.83
WD+DO 0.8960 0.9648 0.9480 0.9204 0.9292 0.8613 93.72 0.9792 5.67
0.0358 0.0052 0.0160 0.0232 0.0192 0.0349 1.62 0.0058 1.15
FN-DD 0.9031 0.9651 0.9462 0.9233 0.9320 0.8701 94.24 0.9798 5.66
0.0387 0.0060 0.0187 0.0272 0.0212 0.0349 1.47 0.0078 1.54
FN-SS 0.9171 0.9631 0.9479 0.9314 0.9392 0.8849 94.96 0.9837 5.22
0.0363 0.0083 0.0167 0.0248 0.0187 0.0293 1.18 0.0049 1.21
SVM 0.6011 0.6667 0.7287 0.6275 0.4789 NaN 72.2260 0.7186 33.58
0.1418 0.2249 0.0266 0.0676 0.1610 3.0492 0.0573 2.13
Human classification [13] 1 0.8182 0.8182 0.9 0.9045 0.8182 90 - -
Mujat & al. [16] 0.88 0.8750 0.7333 0.8 0.8775 0.7178 87.64 - 12
Savastru & al. [15] 0.81 0.89 - - 0.8491 - - - -
Zysk & al. [14] 0.97 0.68 - - 0.8122 - - - -
Table 4: Evaluation metrics for all the tested methods vs. state-of-the-art methods in the literature. It can be seen that competing methods make different trade-offs between and , but that function norm regularization with slice sampling (FN-SS [32, Sec. 2.2]) dominates across standard metrics that account for a balance between true-positives and true-negatives. This dominance is reinforced by the ROC curves in Figure 6.
(a) All methods
(b) Zoom-in for DNN methods
Figure 6: Mean ROC curves for support vector machines (labeled SVM) and deep learning methods evaluated in this work. Comparison to the existing literature is based on performances reported in the respective papers evaluated on different test data. References for previous publications can be found in Table 4.

3.3 Qualitative results

In this section, we report qualitative results. For this purpose, we used 60 slices of a sample that contains both healthy and tumorous tissue (Figure 7). For these slices, we extract overlapping patches (overlap of 56 / 64 pixels in both horizontal and vertical directions, overlap of 2/3 pixels in the third dimension) after surface detection. Then, each of the patches is classified using the trained models. Here, we assign each patch to the class that yields the highest probability under a trained model. For each pixel, we average the predicted classes over all patches in which the pixel is included. In this manner, we obtain an average prediction for each of the pixels between (indicating healthy tissue) and (indicating tumorous tissue). For our visualization, we consider pixels that are at a depth smaller than 384 pixels below the detected surface. For margin assessment, this is the relevant depth ( mm) [13]. Finally, the average prediction is used to define the color of the pixel. We define:


Each pixel is then colored using the RGB triplet (). The pixels outside of the useful region are colored in black. In Figure 7, we show some of the obtained classifications trained on sample A. In these examples, we see that we can obtain very precise localization of the cancer tissue using deep learning methods.

We note that in an intraoperative setting, it may not be necessary to exactly delineate the boundaries of a tumorous region as long as we detect some true positives per region. This alert can be sufficient in practice as it will indicate a positive margin and attract the attention of the surgeon. We therefore develop a quantitative evaluation for this setting. We applied the same test procedure described above to each of the 60 images, and we assigned to each of the obtained results a score for the quality of detection of the cancer parts and the normal parts. For each of the cancer parts, we first divide the pixels into 4 subsets: pixels for which , pixels for which , pixels for which , and pixels for which . Then, we assign to each of the cancer regions:

  • 1 if the first set has at least 300 pixels

  • 0.75 if the union of the first and second sets has at least 300 pixels

  • 0.5 if the union of the first, second and third sets has at least 300 pixels

  • 0 otherwise

Note that this assessment will consider the presence of a perceivable alert in the region, and not the quality of segmentation. The two first sets will indicate a perceivable region with bare eyes, while the third and last sets will be greenish regions in the colored images, but can be indicated with a simple computation. For each of the images, the mean score over the cancer regions is computed. The ground truth of the segmented cancer region is produced approximatively from the results shown in Figure 7. In this manner, each image will have a score between 0 and 1 for the cancer detection. The same measure computed on the healthy region gives an indication of the quality of conservation of the normal tissue: the lowest measure is the best in this case. Note that a value higher than 0.5 indicates that the considered region is more likely to be tumorous, while a measure lower than 0.5 indicated that the considered region is more likely to be healthy.

As this score indicates some kind of sensitivity but relies on the subjective analysis of the surgeon, we call it subjective sensitivity. Table 5 shows the mean subjective sensitivity of the cancer regions for models trained on each of the samples (A, B and C), and over all the estimated classifications in along with the same measures for the healthy regions.

Finally, we also report here the required time to obtain results for an image in a real application situation, i.e. using already trained models. Figure 8 shows that all of the methods require approximately two seconds to estimate the class of each pixel of the image using the procedure described above. As this test procedure is embarrassingly parallel, and can be performed at the same time as image acquisition, this is feasible for intraoperative application.

Figure 7: Example OCT slice of a region with both tumorous and healthy tissue (left), as well as example detections by two deep learning variants (center and right). Tumorous tissue is characterized by lighter regions in the OCT image. In the center and right figures, red indicates that the deep learning system has classified the tissue as tumorous while green indicates the system has indicated the tissue is healthy. Surface detection is sufficiently accurate, and deep learning models have correctly identified the tumorous regions and their spatial extent.
Method Cancer regions Normal regions
WD 0.7792 0.25
0.0188 0.0187
WD+DO 0.6618 0.2653
0.0287 0.0230
FN-DD 0.7882 0.2708
0.0170 0.0249
FN-SS 0.7944 0.1833
0.0162 0.018
Table 5: Mean subjective sensitivity measured for cancer and normal regions in a mixed sample across three models trained on samples A, B and C and 60 slices. While the difference in performance for the measures related to cancer regions are not significant, the method FN-SS has a significantly better performance in preserving the normal tissue.
Figure 8: Test time per image for the DNN models trained with various regularization strategies. Each method takes approximately 2 seconds per image and is therefore feasible to apply in an intraoperative setting.

4 Discussion

In this paper, we applied DNNs for breast cancer margin assessment with state-of-the-art regularization methods, including weight decay, weight decay + DropOut, function norm regularization based on a data distribution estimated from unlabeled samples, and function norm regularization based on a slice sampling procedure.

The quantitative results on separated data show that the new approach for margin assessment based on DNNs reports quantitative metrics substantially higher than those previously reported in the literature (Table 4). However, this comparison has been made only on the base of the reported numbers in the related papers, and not by applying the methods on our data as the code used for the classification in these papers is not released. Our code is available for download from Another important remark is that the mean ROC curves from DNN methods are significantly closer to the human assessment than the previous automated detection methods (see Figure b). In addition, when we compare the DNN methods, we see that on average, the function norm regularization methods have better performance (Table 4, Figure a and Figure b).

The qualitative results on mixed data showed also the effectiveness of all of the DNNs methods. The methods alert to the presence of cancer in the related regions with equivalently high confidence score, indicating they tend to have a high probability of correctly detecting cancer tissue. When it comes to preserving the normal tissue, Table 5 shows that the function norm regularization with slice sampling has a significantly better performance, supporting that even a small improvement, as observed in Figure b and Table 4, can be beneficial in practice. For clinical application, the use of this regularization method may result in a better patient outcome, but this requires further study.

The results obtained in the qualitative results use images for which the contrast was enhanced manually, a process that would, strictly speaking, need to be repeated in an operative setting. It is desirable to perform this operation automatically, but this was not evaluated in this study. The results also depend on the surface detection. The relatively straightforward approach employed in this article works well in practice, but can likely be improved. Finally, a critique of the FN-DD results is that they have employed a data sample from the same patient for regularization, which may not be available in practice. However, the FN-SS method uses no such additional data sample and achieves almost identical results. We conclude from both quantitative and qualitative results that the FN-SS method is the most suitable for further study and eventual real world deployment.

The model selection procedure shows that the selected models for the function norm regularization methods can change when we retrain on the same data. However, the quality of the results on the test set suggests that this does not effect the behavior of the model significantly. A possible explanation to this observation is that our methods are robust to the choice of and behave well as long as this parameter is in a certain range. This is an interesting area for further research to determine if this trend is repeated in other data samples.

However, there is a price to pay for the high performance of the DNN methods: a longer training time. Most state-of-the-art methods in the literature are based on exploiting classical statistics and are likely to be faster to train. We cannot affirm this with certainty because we do not have access to the computation time for these methods. Moreover, function norm regularization methods require a longer time to train than other DNN regularization strategies. In Figure 9, we see that FN-DD requires roughly twice the time of WD and WD+DO. This can be expected as this method feeds to the network roughly twice the number of patches during training (see number of patches for training and regularization in Table 2). FN-SS requires an even longer time because it also includes the time needed to generate the samples. Speed-up of the sampling procedure is likely possible, but this is left for future work. We note that the additional computational cost of these methods has a low monetary cost compared to the potential improvement in patient outcomes [42]. The test time computation remains unaffected. In real time application, the required time per image is displayed in Figure 8 where we see that this time is the highest for WD+DO (between 2 and 2.5 seconds), and comparable for the three other methods (between 1.5 and 2 seconds). None of these differences is likely to have any impact on the patient in an intraoperative setting.

Figure 9: Training time per patch. Although the total training time can be long (Training time per patch number of patches (Table 2)), the training is done off-line. It will not affect the time needed during real time application, and the method with longer time can result in an improvement in patient outcome, while the monetary cost of additional computation is typically low [42].

5 Conclusions

This work implements successfully a simulation of an intraoperative margin assessment for breast tissue using OCT images. We show in this work the benefit of use of DNNs in image patch identification during surgery. We implement various regularization methods for DNN training based on function norms. We showed the efficiency and benefits of function norm regularization compared to other state-of-the-art training methods for DNNs.

The main advantage of our training method is that it relies on a smaller number of hyperparameters than WD and WD+DO, which need tuning. Indeed, for our methods, we need only to tune the regularization parameter and the batch size (fixed in this work for all methods), while for WD we need also to tune the momentum rate and for WD+DO the momentum and DropOut rates. Moreover, it is more robust to the data size and imbalance. The impact on patient outcome is an interesting direction for future research.


This work is partially funded by Internal Funds KU Leuven and FP7-MC-CIG 334380 and by the research program of the National Research Foundation of Korea (NRF) (NRF-2015R1A1A1A05001548 to C. J.). Y.M. Jung has been supported by Basic Science Research Program through the National Research Foundation of Korea (NRF) of Korea (2014R1A1A2054763, 2016R1D1A1B03931337).


  1. The patches extracted for training do not overlap, while those extracted for test do.
  2. Precision and recall are typically reported together. We note that recall is identical to the notion of sensitivity.
  3. Performance measures for methods reported in other papers [13, 16, 15, 14] are taken directly from those publications. Neither data nor source code for these methods were publicly available at the time of writing this article.


  1. American Cancer Society, “Breast cancer key statistics 2017,”, 2016.
  2. ——, “Breast cancer facts & figures 2015-2016,” American Cancer Society, Inc., Atlanta, 2015.
  3. P. E. Huber, J. W. Jenne, R. Rastert, I. Simiantonakis, H.-P. Sinn, H.-J. Strittmatter, D. von Fournier, M. F. Wannenmacher, and J. Debus, “A new noninvasive approach in breast cancer therapy using magnetic resonance imaging-guided focused ultrasound surgery,” Cancer Research, vol. 61, no. 23, pp. 8441–8447, 2001.
  4. J. L. Gwin, B. L. Eisenberg, J. P. Hoffman, F. D. Ottery, M. Boraas, and L. J. Solin, “Incidence of gross and microscopic carcinoma in specimens from patients with breast cancer after re-excision lumpectomy.” Annals of Surgery, vol. 218, no. 6, p. 729, 1993.
  5. S. C. Davis, S. L. Gibbs, J. R. Gunn, and B. W. Pogue, “Topical dual-stain difference imaging for rapid intra-operative tumor identification in fresh specimens,” Optics Letters, vol. 38, no. 23, pp. 5184–5187, 2013.
  6. S. Weber, F. Storm, J. Stitt, and D. M. Mahvi, “The role of frozen section analysis of margins during breast conservation surgery.” The Cancer Journal from Scientific American, vol. 3, no. 5, pp. 273–277, 1996.
  7. C. E. Cox, N. N. Ku, D. S. Reintgen, H. M. Greenberg, S. V. Nicosia, and S. Wangensteen, “Touch preparation cytology of breast lumpectomy margins with histologic correlation,” Archives of Surgery, vol. 126, no. 4, pp. 490–493, 1991.
  8. K. Esbona, Z. Li, and L. G. Wilke, “Intraoperative imprint cytology and frozen section pathology for margin assessment in breast conservation surgery: a systematic review,” Annals of Surgical Oncology, vol. 19, no. 10, pp. 3236–3245, 2012.
  9. A. F. Fercher, W. Drexler, C. K. Hitzenberger, and T. Lasser, “Optical coherence tomography-principles and applications,” Reports on Progress in Physics, vol. 66, no. 2, p. 239, 2003.
  10. T. C. Chen, B. Cense, M. C. Pierce, N. Nassif, B. H. Park, S. H. Yun, B. R. White, B. E. Bouma, G. J. Tearney, and J. F. de Boer, “Spectral domain optical coherence tomography: ultra-high speed, ultra-high resolution ophthalmic imaging,” Archives of Ophthalmology, vol. 123, no. 12, pp. 1715–1720, 2005.
  11. J. F. De Boer and T. E. Milner, “Review of polarization sensitive optical coherence tomography and stokes vector determination,” Journal of Biomedical Optics, vol. 7, no. 3, pp. 359–371, 2002.
  12. S. Makita, Y. Hong, M. Yamanari, T. Yatagai, and Y. Yasuno, “Optical coherence angiography,” Optics Express, vol. 14, no. 17, pp. 7821–7840, 2006.
  13. F. T. Nguyen, A. M. Zysk, E. J. Chaney, J. G. Kotynek, U. J. Oliphant, F. J. Bellafiore, K. M. Rowland, P. A. Johnson, and S. A. Boppart, “Intraoperative evaluation of breast tumor margins with optical coherence tomography,” Cancer Research, vol. 22, no. 69, pp. 8790–8796, November 2009.
  14. A. M. Zysk and S. A. Boppart, “Computational mehods for analysis of human breast tumor tissue in optical coherence tomography images,” Journal of Biomedical Optics, vol. 5, no. 11, September/October 2006.
  15. D. Savastru, E. W. Chang, S. Miclos, M. B. Pitman, A. Patel, and N. Iftimia, “Detection of breast surgical margins with optical coherence tomography imaging: a concept evaluation study,” Journal of Biomedical Optics, vol. 5, no. 11, May 2014.
  16. M. Mujat, R. D. Ferguson, D. X. Hammer, C. Gittins, and N. Iftimia, “Automated algorithm for breast tissue differentiation in optical coherence tomography,” Journal of Biomedical Optics, vol. 14, no. 3, pp. 034 040–034 040, 2009.
  17. S. Yun, G. Tearney, J. de Boer, N. Iftimia, and B. Bouma, “High-speed optical frequency-domain imaging,” Optics Express, vol. 11, no. 22, pp. 2953–2963, 2003.
  18. S. Yun, G. Tearney, B. Bouma, B. Park, and J. de Boer, “High-speed spectral-domain optical coherence tomography at 1.3 m wavelength,” Optics Express, vol. 11, no. 26, pp. 3598–3604, 2003.
  19. I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning.   MIT Press, 2016.
  20. Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel, “Backpropagation applied to handwritten zip code recognition,” Neural Computation, vol. 1, no. 4, pp. 541–551, 1989.
  21. Y. LeCun and Y. Bengio, “Convolutional networks for images, speech, and time series,” The Handbook of Brain Theory and Neural Networks, vol. 3361, no. 10, p. 1995, 1995.
  22. A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in Neural Information Processing Systems 25, F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, Eds.   Curran Associates, Inc., 2012, pp. 1097–1105.
  23. K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv, 2014.
  24. O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei, “ImageNet large scale visual recognition challenge,” International Journal of Computer Vision, 2015.
  25. C. Szegedy, A. Toshev, and D. Erhan, “Deep neural networks for object detection,” in Neural Information Processing Systems, 2013.
  26. S. Zheng, S. Jayasumana, B. Romera-Paredes, V. Vineet, Z. Su, D. Du, C. Huang, and P. Torr, “Conditional random fields as recurrent neural networks,” in International Conference on Computer Vision (ICCV), 2015.
  27. Y. LeCun, L. Jackel, L. Bottou, C. Cortes, J. S. Denker, H. Drucker, I. Guyon, U. Muller, E. Sackinger, P. Simard et al., “Learning algorithms for classification: A comparison on handwritten digit recognition,” Neural Networks: the Statistical Mechanics Perspective, vol. 261, p. 276, 1995.
  28. Y. LeCun, C. Cortes, and C. J. Burges, “The mnist database of handwritten digits,” 1998.
  29. A. Krizhevsky and G. Hinton, “Learning multiple layers of features from tiny images,” 2009.
  30. G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and R. R. Salakhutdinov, “Improving neural networks by preventing co-adaptation of feature detectors,” arXiv, 2012.
  31. S. Wager, S. Wang, and P. S. Liang, “Dropout training as adaptive regularization,” in Advances in Neural Information Processing Systems 26, C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger, Eds.   Curran Associates, Inc., 2013, pp. 351–359. [Online]. Available:
  32. A. R. Triki and M. B. Blaschko, “Stochastic function norm regularization of deep networks,” arXiv, vol. abs/1605.09085, 2016. [Online]. Available:
  33. R. M. Neal, “Slice sampling,” The Annals of Statistics, vol. 31, no. 3, pp. 705–767, 06 2003.
  34. N. Kanopoulos, N. Vasanthavada, and R. L. Baker, “Design of an image edge detection filter using the Sobel operator,” IEEE Journal of Solid-state Circuits, vol. 23, no. 2, pp. 358–367, 1988.
  35. S. R. Sternberg, “Biomedical image processing,” Computer, vol. 16, no. 1, pp. 22–34, 1983.
  36. H. He and E. A. Garcia, “Learning from imbalanced data,” IEEE Transactions on Knowledge and Data Engineering, vol. 21, no. 9, pp. 1263–1284, 2009.
  37. B. W. Matthews, “Comparison of the predicted and observed secondary structure of t4 phage lysozyme,” Biochimica et Biophysica Acta (BBA)-Protein Structure, vol. 405, no. 2, pp. 442–451, 1975.
  38. A. Vedaldi and K. Lenc, “MatConvNet – convolutional neural networks for MATLAB,” in Proceedings of the ACM International Conference on Multimedia, 2015.
  39. C. Cortes and V. Vapnik, “Support-vector networks,” Machine Learning, vol. 20, no. 3, pp. 273–297, 1995.
  40. B. E. Boser, I. M. Guyon, and V. N. Vapnik, “A training algorithm for optimal margin classifiers,” in Proceedings of the Fifth Annual Workshop on Computational Learning Theory.   ACM, 1992, pp. 144–152.
  41. T. Fawcett, “An introduction to roc analysis,” Pattern Recognition Letters, vol. 27, no. 8, pp. 861–874, 2006.
  42. E. Walker, “The real cost of a CPU hour,” Computer, vol. 42, no. 4, pp. 35–41, 2009.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description