Convolutional Neural Networks: Ensemble Modeling, Fine-Tuning and Unsupervised Semantic Localization for Intraoperative CLE Images

Convolutional Neural Networks: Ensemble Modeling, Fine-Tuning and Unsupervised Semantic Localization for Intraoperative CLE Images

Mohammadhassan Izadyyazdanabadi Evgenii Belykh Michael Mooney Nikolay Martirosyan Jennifer Eschbacher Peter Nakaji Mark C. Preul Yezhou Yang Arizona State University, Tempe AZ 85281, USA Department of Neurosurgery, Barrow Neurological Institute, St Joseph’s Hospital and Medical Center, Phoenix, AZ 85013 Irkutsk State Medical University, Krassnogo vosstaniya 1, Irkutsk, Russia 664003

Confocal laser endomicroscopy (CLE) is an advanced optical fluorescence technology undergoing assessment for applications in brain tumor surgery. Many of the CLE images can be distorted and interpreted as nondiagnostic. However, just one neat CLE image might suffice for intraoperative diagnosis of the tumor. While manual examination of thousands of nondiagnostic images during surgery would be impractical, this creates an opportunity for a model to select diagnostic images for the pathologists or surgeons review. In this study, we sought to develop a deep learning model to automatically detect the diagnostic images. We explored the effect of training regimes and ensemble modeling and localized histological features from diagnostic CLE images. The developed model could achieve higher agreement with the ground truth than the other human observers. With the speed and precision of the proposed method, it has potential to be integrated into the operative workflow in the brain tumor surgery.

Neural network, Unsupervised localization, Ensemble Modeling, Brain, Confocal Laser Endomicroscopy, Surgical vision.
journal: Journal of Visual Communication and Image Representation

1 Introduction

Handheld, portable Confocal Laser Endomicroscopy (CLE) is being explored in neurosurgery because of its ability to image histopathological features of tissue in real time belykh2016intraoperative; charalampaki2015confocal; foersch2012confocal; sanai2011intraoperative. CLE provides cellular resolution imaging during brain tumor surgery and thus may provide the surgeon with precise histopathological information during tumor resection in order to interrogate regions that may harbor malignant or spreading tumor, especially at the tumor border.

Figure 5: Diagnostic and nondiagnostic CLE images (field of view = ). (a,b) Diagnostic images from glioma cases. (b) Unsupervisd localization of histopathological features of gliomas such as pleomorphism and hypercellularity detected by our model. For more results see Fig. 21. (c,d) Nondiagnostic images from meningioma cases occluded with motion (c) and blood artifact (d).

Current CLE systems are able to image more than one image per second, and thus over the course of examination of the surgical tumor resection or inspection area, hundreds to thousands of images may be collected. The number of images may become rapidly overwhelming for the neurosurgeon and neuropathologist when trying to quickly select a diagnostic or meaningful image or group of images as the surgical inspection progresses. CLE is designed to be used on the fly in real time while the surgeon operates the brain. Thus overcoming the barriers involved in image selection is a key component for making CLE a practical and advantageous technology for the neurosurgical operating room.

A wide range of fluorophores are able to be used for CLE in gastroenterology, but fluorophore options are limited for in vivo human brain use due to potential toxicity belykh2016intraoperative; foersch2012confocal; zehri2014neurosurgical. In addition, motion and blood artifacts that are present in many of the images acquired with CLE using fluorescein sodium (FNa) are a barrier for revealing underlying meaningful histology. The display of suboptimal images or nondiagnostic frames interferes with the selection of and focus upon diagnostic images by the neurosurgeon and pathologist throughout the operation in order to make a correct intraoperative diagnosis. Previous assessment martirosyan2016prospective of CLE in human brain tumor surgery found that about half of the acquired images were interpreted as nondiagnostic due to abundance of motion and blood artifacts or lack of discernible or characteristic histopathological features.

Filtering out the nondiagnostic images before making an intraoperative diagnosis is challenging due to the high number of images acquired, the novel and frequently unfamiliar appearance of tissue features compared to conventional histology, great variability among images from the same tumor type, and potential similarity between images from other tumor types for the untrained interpreter (Fig. 5).

Applications of machine learning in medical imaging have greatly increased in the last ten years, resulting in numerous computer-aided detection(CADe) and diagnosis(CADx) systems in ultrasound, magnetic resonance imaging (MRI), and computed tomography (CT) reviewDLMI. Applications of machine learning for CLE imaging in neurosurgery are yet to be performed. In this study, we developed an ensemble of deep convolutional neural networks that can automatically evaluate the diagnostic value of CLE frames within milliseconds.

Due to the limited number of images in our dataset, we sought to transfer learning benefits by using pretrained models, fine-tuning them in shallow and deep manner and compare results with the models trained from scratch. Our results demonstrated that a shallow fine-tuned model, although performs better than trained from scratch, is not enough for the optimal performance and that a deep fine-tuned model detects the diagnostic CLE frames better. We also investigated the effect of ensemble modeling by creating an ensemble of models which were crafted at the development stage and produced the minimum loss on validation dataset. Finally, we compared the performance among the ensemble of models and each single model.

2 Related Works

2.1 Convolutional Neural Networks

Convolutional neural networks (CNNs), a subcategory of deep learning methods, have proven useful in visual imagery analysis from numerous fields, including medical images. This is mainly due to the deep multilayer architecture of CNNs which enables extracting abstract discriminant features, both local and global, present in the images.

In the recent years, deep learning has been vastly applied in medical image or exam classification. According to a survey done by Litjens et al. deepmedsurvey, exam and object classification together make up the number one task of interest in medical image analysis followed by object detection and organ segmentation (exam classification alone is the third task of interest). Most of these studies in medical imaging field use one of the three following imaging modalities: MRI, microscopy or CT.

Histopathological microscopic images and brain MRI scans were the first two areas where deep learning has been explored in medical imaging deepmedsurvey. In histopathology, deep learning has been used for mitosis detection ciresan, classification of leukocytes Zhao2016 and nuclei detection and classification sirinukunwattana. In brain MRI, several studies have concluded that CNN benefits the diagnosis of Alzheimer’s disease suk2016deep; shi2017multimodal as well as brain extraction salehi2017auto and lesion detection, classification, and tumor grading ghafoorian2017deep; zhao2016multiscale.

No-reference image quality assessment has been formulated as a classification problem as employed in retinal mahapatra2016retinal and echocardiographicpurang images. CNNs may also be exercised in the detection of key frames from a temporal sequence of frames in a video. Two studies demonstrated the use of classification scheme on ultrasound (US) stream video to label the frames gao2016d; kumar2016plane.

2.2 Transfer Learning vs. Deep Training

One of the major limitations in medical imaging is the small size of datasets. The number of images employed for deep learning applications in medical imaging is usually much smaller than those in computer vision. Therefore, two forms of transfer learning have gained great interest: 1. Application of a pretrained network on large-size natural images (i.e. ImageNet) as a feature extractor. 2. Initializing model parameters (weights and biases) using the data from a pretrained model yosinski2014transferable instead of random initialization. A previous study by Tajbaksh et al. Nima showed that a sufficiently fine-tuned AlexNet model could produce equal or better results than a deeply trained one for colonoscopy image quality assessment and few other medical applications. Here, we’ll study the fine tuning effect by extending it to Inception network architectures in single and ensemble mode.

2.3 Ensemble Modeling

Ensemble modeling is a well established method for increasing the model performance and reducing its variance in machine learning dietterich2000ensemble; zhou2002ensembling; ciregan2012multi. Kumar et al. kumar2017ensemble created an ensemble of 5 different models to classify the image modality from ImageCLEF 2016 medical image dataset. Specifically, 2 classifiers were created by fine-tuning AlexNet and GoogLeNet with softmax and 3 other classifiers by training an SVM on top of the features extracted by AlexNet, GoogLeNet, and their combination. Their results showed the ensemble could improve the top-1 accuracy of the classifier compared to single models, however it is not clear that how much of the improvement was because of the AlexNet and GoogLeNet combination or the 5 classifiers ensemble.

Christodoulidis et al. christodoulidis2017multisource created an ensemble of multi-source transfer learning using an automatic model selection approach described in [ensemble-selection]. After creating a pool of pretrained CNNs on several public texture datasets and fine-tuning them on the lung CT dataset, the top models which iterative grouping would produce the highest F-scores on the validation dataset were aggregated, creating an ensemble model. 5 ensemble models were developed and their output was then averaged to make the final output. Despite its computational complexity, it enhanced the lung disease pattern classification accuracy only by 2%.

To generate diversity in our models while using the whole training dataset, we trained different neural networks on different data using cross-validation. Although previous studies have tried to create variant deep learning models using different network architectures, none of them have employed training data diversification through cross-validation as described in krogh1995neural. Our proposed ensemble employed model diversification both in the network architectures and in the training and validation datasets following krogh1995neural.

2.4 Confocal Laser Endomicroscopy in Neurosurgery

Handheld, portable CLE has demonstrated its value for brain tumor surgery due to its ability to provide rapid intraoperative information regarding histopathological features of the tumor tissue martirosyan2016prospective. Convenience, portability, and speed of CLE are significant advantages in surgery. A decision support system aiding and accelerating analysis of CLE images by the neuropathologist or neurosurgeon would improve the workflow in the neurosurgical operating room izady2017improv.

Potentially used at any time during the surgery, CLE interrogation of the tissue generates images with a speed of 0.8 - 1.2 frames per second. The frames are considered nondiagnostic when the histological features are obscured by the red blood cells or motion artifacts, are out of focus, or lack any useful histopathological information. Acquired images are then exported from the instrument as JPEG or TIFF files for review. Currently, the pathologist reviews all images, including nondiagnostic ones, trying to explore the diagnostic frames for the diagnosis. Manual selection and review of thousands of images acquired at some point in surgery by the CLE operator is tedious and impractical for widespread use. Previously, we have presented izady2017improv the first deeply trained CNN model for automatic detection of diagnostic CLE frames. In this work, we extend our previous work with the following contributions and advancements:

  1. Dataset. Our dataset contains CLE images which is a novel technology in contrast to commonly used MRI or CT scans. The dataset used includes 20,734 CLE images from intracranial neoplasms.

  2. Deep training, shallow fine-tuning or deep fine-tuning? The CNN architectures were trained in three regimes: I. deeply trained (train the network from scratch with model weights randomly initialized) II. shallow fine-tuned (fine-tune only the fully connected layer(s) of the model which are responsible for the classification) III. deeply fine-tuned (fine-tune the whole network using our dataset). In this study we report model accuracy on the test dataset for the best 5 models from each network architecture and training regime. Our work is different from Nima since it considers the fine tuning effect on two different network architectures and its effect on the ensemble models.

  3. Ensemble modeling. Prior to the test phase, we created an ensemble of the best 5 models from each network and training regime. We explored the effect of ensemble modeling in all circumstances by comparing the ensemble performance with the average of single models. Our work is different from kumar2017ensemble since our ensemble generates diversity in single models by using different training and validation data achieved from nested-cross validation. Further, we studied the effect of ensemble modeling on different training scenarios rather than one.

  4. Unsupervised localizing of histological features. We visualized the shallow neurons’ activation to depict the broad histological patterns; visualization of deep neurons’ activation could localize specific histopathological lesions for diagnostic images. The neural response of convolutional layers to the diagnostic images are visualized and analyzed by a neurosurgeon. We also extracted the CNN’s deepest neural activation in response to patches of the diagnostic images using a sliding-window. Qualitative assessment of the localized regions was performed by a neurosurgeon with further analysis of the histopathological features.

  5. Interobserver study. We compared the interobserver agreement between physician-physician and ensemble of models-physician to compare our ensemble model performance with human performance. We also reported the kappa statistic for this observer study.

3 Methods

3.1 Image Acquisition

In the following sections we briefly explain the confocal imaging instrument instrument specifications and the intraoperative data collection process.

3.1.1 Instrument specifications

The CLE image acquisition system (Optiscan 5.1, Optiscan Pty, Ltd.) included a rigid pen-sized optical laser scanner with a 6.3 outer diameter and a working length of 150 . A 488 diode laser provided excitation light and the fluorescent emission signal was detected with a  505-585 band-pass filter. A single optical fiber acted as both the excitation pinhole and the detection pinhole for confocal isolation of the image plane. The detector signal was digitized synchronously with the scanning to construct images parallel to the tissue surface (en face optical sections).

Laser power was typically set to 550-900 and maximum power was limited to 1000 when applied to the brain tissue. A field of view of was scanned at pixels (1.2/second frame rate), with a lateral resolution of 0.7 and an axial resolution (i.e., effective optical slice thickness) of approximately 4.5 .

3.1.2 Intraoperative CLE imaging

Seventy-four adult patients (31 male and 43 female) were enrolled in the study (mean age 47.5 years). Intraoperative CLE images were acquired both in vivo and ex vivo by 4 neurosurgeons. For in vivo imaging, multiple locations of the tissue around the lesion were imaged and excised from the patient. For ex vivo imaging, tissue samples suspicious for tumor were excised, placed on gauze and imaged on a separate work station in the operating room. Multiple images were obtained from each biopsy location.

Co-registration of the CLE probe with the image guided surgical system allowed precise intraoperative mapping of CLE images with regard to the site of the biopsy. The only fluorophore administered was FNa (5, 10%) that was injected intravenously during the surgery.

Precise location of the areas imaged with the CLE was marked with tissue ink. Imaged tissue was sent to the pathology laboratory for formalin fixation, paraffin embedding and histological sections preparation. Final histopathological assessment was performed by standard light microscopic evaluation of 10--thick hematoxylin and eosin (H & E)-stained sections.

3.2 Image annotation

The image annotation process was done in two distinct stages: initial review and .

3.2.1 Initial review

Initially all images were reviewed. A neuropathologist and 2 neurosurgeons who were not involved in the surgeries reviewed the CLE images. For each patient, the histopathological features of corresponding CLE images and H & E-stained frozen and permanent sections were reviewed and the diagnostic value of each image was examined. When CLE image revealed clear identifiable histopathological feature, it was labeled as diagnostic; otherwise it was labeled as nondiagnostic.

3.2.2 Validation review

The database of images was divided into development and test datasets (explained in dataset preparation 4.1). Test dataset composed of 4171 CLE images randomly chosen from various patients. The validation review (val review) dataset consists of 540 images randomly chosen from the test dataset. Following this separation, two neurosurgeons reviewed val-review dataset without having access to the corresponding H & E-stained slides and labeled them as diagnostic or nondiagnostic .

3.3 Convolutional Neural Networks

Convolutional Neural Networks (CNNs) are multilayer learning frameworks and may consist of an input layer, a few convolutional layers, pooling layers, fully connected layers and the output. The goal of a CNN is to learn the hierarchy of underlying feature representations. We explain the fundamental elements of a CNN below.

3.3.1 Convolutional layer

Convolutional layers, first introduced in LeCun:1998:CNI:303568.303704 are the substitute of previous hand-crafted feature extractors. At each convolutional layer three dimensional matrices (kernels) are slid over the input and set the dot product of kernel weights with the receptive field of the input as the corresponding local output. This helps to retain the relative position of features to each other. The multi-kernel characteristic of convolutional layers enables them to prospectively extract several distinct feature maps from the same input image.

3.3.2 Activation layer

The convolutional layer output then goes through an activation function to adjust the negative values. We employed the rectified linear unit (ReLU) which is usually the preferred choice because of its simplicity, higher speed, reduced likelihood of vanishing gradients (especially in deep networks) and tendency to add sparsity over other nonlinear functions such as sigmoid function. The output of ReLU layer (), given its input (), was calculated in-place (to consume less memory) by following:


3.3.3 Normalization layer

Following the ReLU layer, a local response normalization (LRN) map is applied after the initial convolutional layers. This layer inhibits the local ReLU neurons’ activations since there’s no bound to limit them (Eq. 1). By using the Caffe implemented jia2014caffe LRN, the local regions are expanded across neighbor feature maps at each spatial location. The output of LRN layer (), given its input (), is calculated as:


where is the element of the and is the length of vector (number of neighbor maps employed in the normalization). , and are the layer’s hyperparameters and are set to their default values obtained from krizhevsky2012imagenet(, and ).

3.3.4 Pooling layer

After rectification and normalization of convolutional layer output, it’s further down-sampled by pooling operations. Pooling operations accumulate values in a smaller region by subsampling operations such as max, min, and average sampling. Here, max pooling was applied.

3.3.5 Fully connected Layer

Following several convolutional and pooling layers, the network lateral layers are fully connected. Each neuron of the layer’s output is greedily connected to all the layer’s input neurons. It can be thought of as a convolutional layer with kernel size of the layer input. The layer output is also passed through a ReLU layer. The fully connected layers are generally thought of as the classifier of a CNN model because they intake the most abstract features extracted in convolutional layers and make the output, which is the model prediction.

3.3.6 Dropout Layer

Fully connected layers are usually followed by a dropout layer, except the last fully connected layer that produces the class-specific probabilities. In dropout layers, a subset of input neurons as well as all their connections are temporarily removed from the network. Srivastava et al.dropout have demonstrated this method efficiency at improving the CNN performance in numerous computer vision tasks through reducing the overfitting.

3.3.7 Learning

The learning of a CNN is through Stochastic Gradient Descent (SGD) which stands on two major menhirs: Forward and Back Propagation. In forward propagation, the model makes predictions using the images in the training batch and the current model parameters. Once the prediction for all training images is made, the loss is calculated using the truth label provided by the experts in the initial review (explained in 3.2.1). In this work we adopt the softmax loss function given by:


where is the training image’s ground truth output, and is the value of the output layer unit in response to the input training image. is the number of training images in the minibatch, and since we consider diagnostic value categories, .

Through the back propagation, the loss gradient with respect to all model weights aids upgrading the weights as follows:


where , and are the weights of convolutional layer at iteration and and the weight update of iteration , is the momentum and is the learning rate and is dynamically lowered as the training progresses.

3.4 Evaluation Metrics

In model performance estimation (explained in 4.3) we calculated the loss, accuracy, sensitivity, specificity and area under receiver operating characteristics (ROC) curve (AUC). Here, we assumed the state of being a diagnostic image as positive and being nondiagnostic as negative. This way, sensitivity determines the model ability to detect diagnostic images and specificity determines its ability to detect nondiagnostic images. Accuracy determines general capability of a model to detect diagnostic and nondiagnostic images correctly metz1978basic.

4 Experimental Setup

4.1 Dataset Preparation

Our dataset included 20,734 CLE images from 74 brain tumor cases. For each CLE image, the diagnostic quality was determined by the experts in the initial review. The dataset was divided into two main subsets on patient level: development (dev) and test. Our pilot study revealed that when the division is on image level (mixing the images from all the patients and dividing them randomly) the model would produce poor results on images from new patients.

The total number of patients and images used at each stage are presented in Table 1. Each subset contains images from various tumor types (mainly from gliomas and meningiomas). The dev set will be available online. The test set was isolated all through the model development and was accessed only in the test phase.

Development Test
\rowcolor[HTML]C0C0C0 \cellcolor[HTML]C0C0C0No. of Patients (total) 59 15
Gliomas 16 5
Meningiomas 24 6
Other neoplasms 19 4
\rowcolor[HTML]C0C0C0 \cellcolor[HTML]C0C0C0No. of Images (total) 16,366 4,171
Diagnostic 8,023 2,071
Nondiagnostic 8,343 2,100
Table 1: Dataset preparation: Patient-based allocation of diagnostic and nondiagnostic images from various neoplasms to model development and testing. Number of patients for each tumor type is also provided.
Figure 6: Ensemble effect on network 1 (top) and network 2 (bottom) while using diverse training regimes. For both networks, the improvement was more noticeable with DT and DFT regimes. The arithmetic and geometric ensemble performed similarly. Neither of the two ensembles could improve network 2 trained with SFT.

4.2 Model Development

After the initial data split, we employed a patient-based k-fold cross validation for model development. Fifty nine cases that were allocated for model development were divided into 5 groups. Since CNNs require a large set of hyperparameters to be defined optimally (i.e. initial value of the learning rate and its lowering policy, momentum, batch size, etc.), we used different values with grid searching throughout the model development process. For every set of feasible parameters, we trained the model on 4 folds and validated on the fifth left-out group of patients. The set of hyperparameters which produced the minimum average loss was employed for each set of experiments.

The small dataset size was a main limitation of our study for using CNNs, especially with the patient-level data preparation. Therefore, we counterbalanced this limitation by fine-tuning the pretrained publicly available CNN architectures trained on large computer vision datasets (i.e. ImageNet).

Though the question about how deep should we fine-tune the pretrained models for optimal results still remains unanswered, one study tried to answer this question using endoscopy and ultrasound imagesNima. Due to the substantial intrinsic dissimilarities between the images in the 2 studies, we performed a similar investigation. Our confocal images have a much higher spatial resolution and are fluorescent images from the brain.

In total, we developed 42 models (30 single models and 12 ensemble models) using two network architectures and three training regimes (deep training, shallow fine-tuning and deep fine-tuning). The experiments are designed in order to practically find the optimal model development pathway that produces the highest performance in the considered clinical application.

4.2.1 Network architectures

Two CNN architectures were applied in this study. Network 1 had 5 convolutional layers. The first two convolutional layers had and 256 filters of size and with maximum pooling. The third, fourth and fifth convolutional layers were connected back to back without any pooling in between. The third convolutional layer had filters of size , the fourth layer had filters of size and the fifth layer had filters of size with maximum pooling. For more details please refer to krizhevsky2012imagenet.

Network 2 had 22 layers with parameters and 9 inception modules. Each inception module was a combination of filters of size , , and a max pooling, put together in parallel and the output filter banks concatenated into an input single vector for the next stage. For more details please refer to szegedy2015going.

The pretrained model for network 1, exploited in fine-tuning experiments, was the iteration 360,000 snapshot of training the model on ImageNet classification with 1000 classes. The pretrained model for network 2 was iteration 2,400,000 of training on ImageNet classification dataset. Both models are publicly available in Caffe libraries jia2014caffe.

4.2.2 Training regimes

We exercised various training regimes to see how deep fine-tuning should be done in CLE image classification for optimal results. Depending on which layers of the network are being learned through training, we had three regimes.

In regime 1, deep training (DT), the whole model weights were initialized randomly (training from scratch) and got modified all through the training with nonzero learning rates.

In regime 2, shallow fine-tuning (SFT), the whole model weights, except the last fully connected layer, were initialized with the corresponding values from the pretrained model and their values were fixed for the period of training. The last fully connected layer was initialized randomly and got tuned during training.

In regime 3, deep fine-tuning (DFT), all model weights, except for the last fully connected layer, were initialized with the corresponding values from the pretrained model and last fully connected layer was initialized randomly. Throughout the training, all the CNN layers, including the last fully connected layer, were tuned with nonzero learning rates. Our hyperparameter optimization showed that the SFT and DFT experiments required 10 times smaller initial learning rates (0.001) compared to the DT regime (0.01). To avoid overfitting, the training process was stopped after 3 epochs of consistent loss increment on the validation dataset. We also used dropout layer () and regularization ().

4.2.3 Ensemble Modeling

Let’s assume is the the value of the output layer unit of the CNN model in response to the input test image. The linear and log-linear ensemble classifier output for the same input would be:


where l is the number of CNN models combined to generate the ensemble models.

Model selection was done in two forms: single models and ensemble of models. We selected the top model (with minimum loss on the validation dataset) from each fold of the 5-fold cross validation (Model 1-5 in Table 2). In each network architecture and training regime, we combined the top-5 developed single models to produce two ensembles of models using the arithmetic (5) and geometric mean (6) of their outputs. We created 12 ensemble models () in total and compared their performance with single models.

4.3 Interobserver Study

Each solo and ensemble model developed was tested on the test dataset. The ensemble of network 2 models trained with DFT was also tested on the val review images (3.2.2) to compare human-human and model-human interobserver agreements. The resulting agreement rate (val-rater 1 and 2) was further compared with the initial image review results. The agreement of the model prediction with the initial review was also calculated. The general agreements are compared and discussed in section 5.4. Kappa analysis was also done for further validation.

Gold standard ground truth for the val review images was defined by majority voting (see Fig. 7). The agreements of the third rater with the gold standard and the proposed ensemble model with the gold standard is calculated and compared as well in Table 3.

Figure 7: Interobserver study using gold standard ground truth. Gold standard was defined using the initial review and one of the val-raters (here val-rater 1). Then, the agreement of the ensemble model and the other val-rater (here val-rater 2) with the gold standard is calculated to compare the human-human with model-human agreement. *If the initial review and the val-rater 1 agreed on an image, it is added to the gold standard, otherwise it is disregarded.

4.4 Unsupervised Histological Feature Localization

For localization of the histological features, we examined the neural activation at two sites. First, the activation of neurons in the first convolutional layer of the network 1 were visualized and the 96 feature planes were saved for review by a neurosurgeon. Neurons that presented high activation to the location of cellular structures in the input image were selected and seemed to be consistent with diverse diagnostic images. Secondly, we applied a sliding window of size pixels (size of network 1 input after input cropping) with stride of 79 pixels over the diagnostic CLE images ( pixels). The result was a matrix that provided the diagnostic value of different locations of the input image (diagnostic map). The locations of input images corresponding to the highest activations of the diagnostic map were detected and marked with a bounding box. The detected features using each of these two ways were further reviewed by a neurosurgeon.

5 Results and Discussion

We developed 42 models and tested them on 4,171 test images; accuracy rates (agreement with the initial review) are presented in Table 2. We found that network 2 resulted in more precise predictions about the diagnostic quality of images than network 1 when DT and DFT training regimes were used, while SFT training regime resulted in slightly better accuracy of network 1, compared to network 2. Therefore, network 2 architecture is a better feature extraction tool for CLE images, since it concatenates multi-scale features inside its inception modules.

Network Network 1 Network 2
Model 1 0.685 0.760 0.760 0.731 0.746 0.746
Model 2 0.658 0.749 0.755 0.750 0.746 0.805
Model 3 0.677 0.751 0.765 0.715 0.747 0.797
Model 4 0.681 0.754 0.771 0.739 0.743 0.811
Model 5 0.699 0.753 0.775 0.721 0.747 0.777
Mean 0.680 0.753 0.765 0.731 0.746 0.787
0.704 0.755 0.788 0.754 0.750 0.816
0.703 0.758 0.786 0.755 0.751 0.818
Table 2: The accuracy of different models on the test dataset. The top-5 models crafted from each training regime, as well as their arithmetic and geometric ensembles, were exerted. For each network, the ensemble of DFT models makes the most accurate predictions. The difference between arithmetic and geometric ensemble AUC was negligible.
Figure 8: Training regime effect on network 1 (left) and network 2 (right) while using single (top) or ensemble of models (bottom). In all circumstances the AUC for DFT regime was greater than the SFT and SFT is greater than DT, although the effect size varied.
Figure 21: Unsupervised localization of the histopathological features from shallow and deep neurons inside the network. First column (a, e, i) shows the input CLE images from human glioblastoma obtained intraoperatively. Second column (b, f, j) displays activation of neurons from the first layer (conv1, neuron 24) (shallow features); it highlights some of the cellular areas present in the image. Third column (c, g, k) illustrates diagnostic regions of interest identified with sliding window approach. The boxed regions represent high activation of the deepest network neuron. Fourth column (d, h, l) contains images extracted from conv1 activation (neuron 22), representative of the high fluorescence signal, a diagnostic sign of blood-brain barrier disruption and leakage of fluorescent agent from the vessels into the extracellular space.

5.1 Ensemble or solo model?

We did an ROC analysis for each of the two networks and three training regimes to see how the ensemble of models performed compared to the single models. Fig. 6 presents the ROC curves and corresponding AUC values for each ensemble model and the mean of single models. The AUC value increased by 2% for both networks with DT and DFT when the ensemble is applied instead of the single model. This effect gets smaller with network 1 SFT and becomes negligible with network 2 SFT. The two arithmetic and geometric ensemble models produced roughly similar results (paired t-test: P value 0.05).

SFT models display less sensitivity to the ensemble effect compared to DT and DFT. This is not surprising since they represent identical models except in the softmax classifier layer which has been adjusted through training.

5.2 Which training regime: DT, SFT and DFT?

Fig. 8 displays the results of ROC analysis when comparing the three training regimes in each network architecture and single/ensemble states. In all paired comparisons, DFT outperformed SFT and SFT outperformed the DT regime (paired t-test: P value 0.05).

We traced the AUC elevation from DT to DFT regime to see how much of it corresponded to the transformation of DT to SFT and SFT to DFT. For network 1, 70-80 % of the improvement occurred in the DT to SFT transformation, depending on whether it’s a single or ensemble model. For network 2 ensemble model (right bottom of Fig. 8), however, the AUC improvement caused by transforming the training regime from DT to SFT (2%) is only 25% of the total improvement from DT to DFT. For network 2 single model the AUC improvement was evenly divided between the two transformations.

Our results from this experiment indicated that for our dataset, network 1 mainly required fine-tuning the classification layer and fine-tuning other layers (feature extractors) had a smaller contribution. However, for network 2, fine-tuning the feature extractors was more important than modifying the classifier layer. Though, further experiments on more datasets are necessary to generalize this observation.

5.3 Histological features localization

8 out of total 384 reviewed colored neuron activation maps from the first layer were selected for 4 diagnostic CLE images representative for glioma. Selected activation maps highlighted diagnostic tissue architecture patterns in warm colors. Particularly, several maps emphasized regions of optimal image contrast, where hypercellular and abnormal nuclear features could be identified, and would serve as diagnostic features for image classification (Fig. 21, columns 2 and 4). Additionally, sliding window method was able to identify diagnostic aggregates of abnormally large malignant glioma cells and atypically hypercellular areas (Fig. 21, third column).

Activation of the neurons in the first convolutional layer (conv1) were found to highlight areas with increased fluorescein signal, a sign specific to brain tumor regions. Increased fluorescent signal on CLE images represent areas with blood brain barrier disruption which correspond to the tumor areas visible on a contrast enhanced MR imaging. Interestingly, sliding window method and selected colored activation maps were not distracted or deceived by the red blood cells contamination, as they mostly highlighted tumor and brain cells rather then hypercellular areas due to bleeding. The proposed feature localization approach may be useful in the future to aid in the identification of not only the diagnostic frames, but also directing the surgeon’s attention to the image parts containing major histopathological features.

Whole Val
Val-Rater 1 66 %
Val-Rater 2 73 %
75 %
Model 76 %
85 %
Table 3: Interobserver study results. The model-human agreement was higher than the human-human agreement both on whole val review dataset and the gold standard subset.

5.4 Inter-rater agreement

Table 3 demonstrates the agreement between each of the val-raters and the initial review on the whole val review dataset and the gold standard subset (explained in Fig. 7). The model agreement with the initial review is larger than each val-rater’s agreement with the initial review. This suggests that the model has successfully learned the histological features of the CLE images that are more probable to be noticed by the neurosurgeons when the corresponding H & E-stained histological slides were also provided for reference.

To consider images from the val review set that the majority of raters agreed on, that is one of the val-raters agreed on with the initial review, we used the gold standard subset. The gap between the model-human and human-human agreements became even more evident (19% for val-rater 1 and 9% for val-rater 2) with the gold standard subset (Table 3, column 4).

6 Conclusion and future work

This paper presents a deep CNN based approach that can automatically detect the diagnostic CLE images from brain tumor surgery. We used a manually annotated in-house dataset to train and test this approach. Our results showed that both deep fine-tuning and creating an ensemble of models could enhance the performance; but only their combination could reach the maximum accuracy. The ensemble effect was stronger in DT and DFT than SFT developed models. The proposed method was also able to localize some histological features of diagnostic images. Ultimately, Table 3 indicates that the proposed ensemble of deeply fine-tuned models could detect the diagnostic images with a higher agreement than the trained human observers. Other confocal imaging techniques may be aided by such deep learning models. Confocal reflectance microscopy (CRM) has been studied mooney2017immediate for rapid, fluorophore-free evaluation of pitutary adenoma biopsy specimens ex vivo. CRM allows preserving the biopsy tissue for future permanent analysis, immunohistochemical studies, and molecular studies. Continued use of unsupervised image segmentation methods to detect meaningful histological features from confocal brain tumor images will likely allow for more rapid and detailed diagnosis.


This work was supported by the Newsome Family Endowed Chair of Neurosurgery Research at the Barrow Neurological Institute held by Dr. Preul and by funds from the Barrow Neurological Foundation.


Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description