A Deep-Learning Algorithm for Thyroid Malignancy Prediction From Whole Slide Cytopathology Images

A Deep-Learning Algorithm for Thyroid Malignancy Prediction From Whole Slide Cytopathology Images

David Dov, Shahar Z. Kovalsky, Jonathan Cohen, Danielle Elliott Range, Ricardo Henao, and Lawrence Carin  D. David, R. Henao and L. Carin are with the Department of Electrical and Computer Engineering, Duke University, Durham, NC 27708, USA (e-mail: david.dov@duke.edu; ricardo.henao@duke.edu; lcarin@duke.edu).S. Z. Kovalsky is with the Department of Mathematics, Duke University, Durham, NC 27708, USA (e-mail: shaharko@math.duke.edu).J. Cohen is with the Department of Surgery, Duke University Medical Center, Durham, NC 27710, USA (e-mail: jonathan.m.cohen@duke.edu).D. Elliott Range is with the Department of Pathology, Duke University Medical Center, Durham, NC 27710, USA (e-mail: danielle.range@duke.edu).

We consider thyroid-malignancy prediction from ultra-high-resolution whole-slide cytopathology images. We propose a deep-learning-based algorithm that is inspired by the way a cytopathologist diagnoses the slides. The algorithm identifies diagnostically relevant image regions and assigns them local malignancy scores, that in turn are incorporated into a global malignancy prediction. We discuss the relation of our deep-learning-based approach to multiple-instance learning (MIL) and describe how it deviates from classical MIL methods by the use of a supervised procedure to extract relevant regions from the whole-slide. The analysis of our algorithm further reveals a close relation to hypothesis testing, which, along with unique characteristics of thyroid cytopathology, allows us to devise an improved training strategy. We further propose an ordinal regression framework for the simultaneous prediction of thyroid malignancy and an ordered diagnostic score acting as a regularizer, which further improves the predictions of the network. Experimental results demonstrate that the proposed algorithm outperforms several competing methods, achieving performance comparable to human experts.

Thyroid, AI, deep learning, medical imaging, healthcare, pathology, human level

I Introduction

The prevalence of thyroid cancer is increasing worldwide [1]. The most important test in the preoperative diagnosis of thyroid malignancy is the analysis of a fine needle aspiration biopsy (FNAB). The FNAB sample is stained and smeared onto a glass slide, and manually examined under an optical microscope by a cytopathologist, who estimates the risk of malignancy. This diagnosis, however, involves substantial clinical uncertainty and often results in unnecessary surgery. In this work, we propose a deep-learning-based algorithm for the preoperative prediction of thyroid malignancy from whole-slide cytopathology scans.

A cytopathologist determines the risk of thyroid malignancy according to various features of follicular (thyroid) cells, such as their size, color and the architecture of cell groups. Based on these features, they assign a score to the slide according to the Bethesda System (TBS), which is the universally accepted reporting system for thyroid FNAB (there are six TBS categories). TBS indicates a benign slide, TBS , and reflect inconclusive findings with an increasing risk of malignancy, and TBS indicates malignancy. TBS is assigned to inadequately prepared slides and is out of the scope of this work. In many of the indeterminate cases that undergo surgery, however, the post-surgical histopathology analysis, which is considered the gold standard for determining thyroid malignancy, shows no evidence of malignancy thus deeming the surgery unnecessary.

Cytopathology slides are typically examined via an optical microscope and are not digitized in most healthcare systems. Therefore, for the automated analysis of FNAB, we have established in [2] a dataset of samples. Each sample comprises a full scan of the glass slides whose typical resolution is and is referred to as whole slide cytopathology image. An example of a whole-slide image is presented in Fig. 2 (top). Each sample in the dataset includes the TBS category assigned by a cytopathologist, extracted from the medical record, as well as the postoperative histopathology diagnosis, which is considered the ground truth in this study. The goal in this paper is to predict the ground truth malignancy label from the whole-slide scans.

Machine learning, and in particular deep neural networks, have become prevalent in medical imaging [3]. Such methods have been used for the detection of diabetic retinopathy [4], the classification of skin cancer [5], and in histopathology [6, 7, 8] and cytopathology [9]. Thyroid malignancy prediction via machine learning has been studied in ultrasound images [10, 11, 12, 13, 14, 15] and in post-surgical histological tissues [16]. The use of automated methods for preoperative thyroid cytopathology has also been studied: [17, 18] study morphometric features of individual cells extracted using image analysis software; [19] discuss the cytomorphometric analysis of nuclear features of individual thyroid cells in extreme magnification; [20] classify thyroid cytopathology images for educational and training purposes. In [21, 22, 23] the authors use machine learning techniques trained and tested on a small number of “zoomed-in” regions manually selected and specified as diagnostically relevant by an expert cytopathologist. However, these studies do not address the problem of intervention-free malignancy prediction from whole-slide cytopathology images.

The exceptionally high resolution of whole-slide cytopathology images, each tens of gigabytes in size, presents a significant challenge since they cannot be straightforwardly processed due to the limited memory capacity of existing graphical processor unit (GPU) computing platforms. In this context, using a multiple instance learning (MIL) scheme appears to be suitable: each image is split into small regions (instances) that are processed individually into local estimates, which are then aggregated into a global image-level prediction. Specifically, [24] and [25] propose to aggregate local predictions via noisy-or [24] or noisy-and [25] pooling function. In [26] a weighted combination of local decisions is proposed, by incorporating an attention mechanism to form a global decision. The use of these approaches has been demonstrated for the analysis of breast cancer fluorescence microscopy [25], for histopathology of breast and colon cancer [26], and for the analysis of breast cancer in mammograms [27].

Classic MIL approaches assume that the pooling or the weighting mechanisms can implicitly focus on the relevant instances, while attenuating the effect of irrelevant ones. However, we found in our experiments that only a tiny fraction of the scan is informative, containing groups of follicular cells, whereas the majority of the scan is irrelevant for the prediction of thyroid malignancy, e.g., containing red blood cells, and is considered background. Due to this imbalance, standard MIL approaches fail to distinguish between the two groups and, as a result, achieve poor performance. We therefore take a different approach and propose a supervised method based on explicitly extracting relevant regions from the entire image. This resembles the work flow of a cytopathologist, who is trained to first overview the slide in search of diagnostically indicative areas containing follicular cells. In this context we note the closely related task of region-of-interest detection, extensively studied for object detection [28, 29, 30, 31]. However, unlike in typical object detection methods, we are not strictly concerned with the accurate estimation of individual instances, a difficult challenge in the case of cytopathology, as our goal is to predict the global per-slide label.

We use regions detected as containing groups of follicular cells to further predict thyroid malignancy. We propose to jointly predict malignancy and the TBS score from a single output of a neural network. Since TBS categories correspond to increasing probability of malignancy, predicting TBS falls into the scope of ordinal regression [32]. Cumulative link models are particularly relevant to our work, wherein classification is obtained by comparing a scalar output to an ordered set of thresholds [33, 34, 35]. Such joint prediction is motivated by the observation that the TBS score is a monotone and consistent proxy for the probability of malignancy: the higher the TBS score is, the higher is the probability of malignancy; nearly of the cases scored as TBS 2 and TBS 6 are indeed benign and malignant, respectively; and, the TBS scores assigned by different experts are highly unlikely to differ by more than 1 or 2 TBS categories [36, 37]. Joint prediction therefore induces cross-regularization and promotes a well-behaved malignancy predictor.

In this paper, we propose a two stage deep-learning-based algorithm for the prediction of thyroid malignancy from whole-slide cytopathology scans. The algorithm identifies instances containing groups of follicular cells and, in turn, incorporates local decisions based on the informative regions into the global slide-level prediction. We further propose an ordinal regression framework for the simultaneous prediction of thyroid malignancy, as well as the preoperative TBS category, from a single output of a neural network. A block diagram illustrating the proposed algorithm is presented in Fig. 1. Experimental results demonstrate the improved performance of the proposed algorithm compared to multiple other MIL approaches. Moreover, the proposed algorithm provides predictions of thyroid malignancy comparable to those of cytopathology experts (we compare to three such experts in our experiments).

Fig. 1: Block diagram of the proposed algorithm. The left and the right blocks denoted by NN refer to the first and the second neural networks, respectively.

This paper extends upon a previous conference publication [2], focusing on the technical aspects of the proposed algorithm. While [2] included a more thorough description of the clinical problem we address and complete details on the dataset and its acquisition, this paper provides full details and analysis of the proposed algorithm. Novel contributions presented in this paper, which go beyond the scope of [2], include the following: We introduce an analysis based on ideas from set theory [38, 26], showing how the proposed algorithm is related to multiple instance learning and how it deviates from these methods by the use of a supervised strategy. We further show how the proposed algorithm is related to the likelihood ratio test, widely used in statistics for hypotheses testing. The analysis, combined with unique characteristics of FNAB, according to which follicular groups are expected to be consistent in their contribution to the global decision, further allows us to devise an improved strategy to train the malignancy-prediction network. We further provide a full description of the proposed ordinal regression framework. We show how deviating from the classic cumulative link model presented in [33, 35] allows us to obtain natural and interpretable predictions of both thyroid malignancy and TBS category from a single output of the network. In addition, the two predictions regularize each other, further improving the malignancy prediction. Extensive cross validation experiments comparing the proposed approach to MIL methods, as well as ablation experiments, demonstrate the improved performance of the proposed algorithm.

The remainder of the paper is organized as follows. In Section II we formulate the problem of thyroid malignancy prediction. We present the proposed algorithm and analyze it in Section III. Experimental results demonstrating the improved performance of the proposed algorithm for thyroid malignancy prediction are presented in Section IV.

Ii Problem formulation

Let be a set of instances of a whole-slide image, where is the th instance, i.e., a patch from an RGB digital scan, whose width and height are and , respectively. Let be the true label of the scan, where and correspond to benign and malignant cases, respectively. The goal in this paper is to predict thyroid malignancy .

We consider an additional set of local labels , corresponding to the instances , such that , if instance contains a group of follicular cells, and otherwise. Our dataset includes instances containing follicular groups manually selected from digital scans. These local labels are used to train a convolutional neural network to distinguish instances containing follicular groups from those only containing background. In turn, the instances containing follicular groups are used to predict thyroid malignancy.

Similar to , let be the TBS category assigned to a whole slide by a pathologist. We propose in Section III-D to simultaneously predict thyroid malignancy and the TBS category using a single output of a neural network. We show in Section IV that this not only improves prediction accuracy, but also leads to more reliable predictions.

Fig. 2: (Top) Whole-slide cytopathology scan. (Bottom left) Detail of the area marked by the red rectangle. (Middle) Heat map of prediction values of the first neural network. Instances predicted to contain follicular groups correspond to bright regions. (Bottom right) Detail of the are marked by the red rectangle.

Iii Thyroid malignancy prediction

Iii-a Detecting groups of follicular cells

The proposed algorithm follows the work of cytopathologists, who identify groups of follicular cells, based on which the evaluation of thyroid malignancy is performed. Malignant and benign groups differ from each other in the tone of the cells, their size and texture, and the shape and the architecture of the group. The slide, however, only contains a small amount of follicular groups and is mainly covered with blood cells, which are irrelevant for the prediction and are considered background. In Fig. 3, we present examples of instances containing follicular groups, as well as instances containing background.

We use the local labels to train a convolutional neural network to distinguish instances containing follicular groups from those containing background. The network is based on the small and the fast converging VGG11 architecture [39], details of which are summarized in Table II in the Appendix. Training the network requires a sufficient amount of labeled data, the collection of which was done manually by an expert pathologist through an exhaustive examination of the slides. To make the labeling effort efficient, the cytopathologist only marked positive examples of instances containing follicular groups . We further observed in our experiments that instances randomly sampled from the whole slide mostly contain background. Therefore, to train the network, we propose the following design of training batches. We use batches comprising an equal number of positive and negative examples to overcome the class imbalance. As positive examples we take follicular groups sampled uniformly at random from the set of the labeled instances, , for which . Negative examples are obtained by sampling uniformly at random instances from the whole slide, for which we assign , assuming they contain background. The network is trained using the binary cross entropy (BCE) loss via stochastic gradient descent with learning rate , momentum and weight decay with decay parameter .

Fig. 3: (Top) Instances containing groups of follicular cells. (Bottom) Instances containing background.

We observe in our experiments that the training procedure converges after a few epochs, so we set a stopping criterion to avoid over-fitting. Specifically, we use the average of predictions of positive examples, a criterion we find more reasonable than, , the area under the curve (AUC). The latter takes into account negative examples, the accuracy of which we are uncertain because follicular groups can be randomly sampled from the whole slide and wrongly considered negative. We stop the training process if this measure does not increase between epochs, which typically occurs after to epochs. Indeed, we observed in our experiments that the network successfully distinguishes between follicular groups and the background, and, in turn, thyroid malignancy is successfully predicted from the selected regions, as we show in the experiments.

The network for the identification of follicular groups is applied to the whole-slide images, and a subset of instances providing the highest predictions is selected: , where and recall that is the size of . These instances are used to train and evaluate the second neural network, which predicts thyroid malignancy. Specifically, for training, we use instances, a value that balances the tradeoff between having a sufficient amount of training data and using instances that with high probability contain follicular groups. Using a validation set, we further found that selecting instances for the test phase slightly improves the performance.

Iii-B Predicting thyroid malignancy from multiple instances

We propose a simple, yet effective, procedure for the prediction of thyroid malignancy using a second neural network, whose output we denote by , where are the network parameters. We use the same architecture based on VGG11 as the first neural network and the same hyper-parameters for training (see Table II in the Appendix for details). We consider as local, instance-level, predictions of thyroid malignancy, which are then averaged into a global slide-level prediction:


where high values of correspond to high probability of malignancy. Accordingly, the predicted thyroid malignancy is given by:


where is a threshold value. We demonstrate in the experiments that this simple approach for aggregating decisions from multiple instances outperforms several competing methods. In the next subsection, we analyze the proposed algorithm from the perspective of multiple instance learning. Our analysis motivates the aggregation of the multiple predictions made according to (2) by showing its relation to the likelihood ratio test (LRT). Moreover, we show how unique characteristics associated with thyroid cytopathology slides can be further leveraged to devise an improved strategy for training the neural network .

Iii-C Multiple instance learning perspective

Multiple instances learning

MIL refers to a binary classification problem where a bag (set) of instances is assigned with a single global label. In our context, the scans are divided into small instances, thus considering thyroid malignancy detection as a MIL problem rather than, e.g., standard classification. This is done, first, due to the huge size of the scans, as full scans cannot be fed into a standard GPU due to memory limitations. Second, using a small dataset, consisting of less than 1000 slides, poses a significant limitation in training a classifier, due to the risk of over-fitting. Indeed, recent studies such as those presented in [25, 26] addressed the classification of large microscopy images using MIL approaches, by considering small regions of the images as multiple instances. In fact, our initial attempts to tackle this problem were using the methods in [25, 26], which motivated the design of the proposed algorithm.

MIL-based classifiers can be considered a part of a more general family of functions termed symmetric functions, which are defined on sets and are invariant to the order of the instances in the set [26]. In our context, prediction of slide-level malignancy from instances is indeed invariant to the order of instances. Zaheer et al. showed in [38] that a function defined over a set is symmetric if and only if it can be decomposed as:


In the case of MIL-based classifiers, neural networks can be used to learn the functions and with [26]. In particular, according to (3), the design of a particular MIL approach reduces to proper selection of and . For example, in the case of noisy-and [25], is a scalar prediction () based on a single instance, and is a network that maps the local predictions into a single bag level decision. Ilse et al. proposed in [26] to map the instances into a dimensional representation and, in turn, obtain a global representation using weighed averaging:


This takes the form (3) by setting with weights that sum to . In [26], these weights are inferred from the data based on an attention mechanism.

The proposed algorithm indeed fits into the general form of (4). The first neural network identifies instances containing follicular cells. Namely, zero weights are assigned to all instances except the instances with the highest probability to contain follicular groups; that is, if . The non-zero weights are assigned according to (1) with a constant value: for . In classical MIL approaches the weights, which control the contribution of each instance to the global decision, are implicitly determined by the model trained using the global labels alone. In contrast, the proposed algorithm identifies the important instances via a supervised procedure using the labels .

Finally, to obtain (1) we simply use , i.e., the identity function in (4). The selection of the pooling function is directly related to the interplay between the local and the global labels. The typical assumption in MIL approaches is that a bag of instances is labeled positive if at least one of the instances is positive. Noisy-and, for example, is designed based on the assumption that a certain amount of positive instances triggers a positive global label. Yet, our experiments show no advantage to this approach casting doubt on this assumption in the case of thyroid malignancy detection. Next we motivate (1) by showing its relation to the likelihood ratio test, which further allows us to devise a strategy to train the neural network in (1).

The likelihood ratio test

The following proposition motivates (1) and (2) by showing that is directly related to the likelihood ratio test (LRT) widely used in statistics for hypothesis testing [40]. Let be the likelihood ratio given by:

Proposition 1

The estimated log likelihood ratio is a linear function of defined in (1) given by:


where is a constant value and is the number of instances in .

Proposition 1 implies that making a prediction according to (2) by comparing to a threshold value is equivalent to comparing the estimated log likelihood ratio to the threshold , which means applying the likelihood ratio test. The proof is provided in the Appendix.

As a corollary to Proposition 1, we further devise a simplified and improved training strategy for the second neural network. An implicit assumption made in the proof is that , where is the sigmoid function, estimates the probability of the slide being malignant given a single instance ; a similar assumption is made in the derivation of the noisy-and MIL in [25]. Proposition 1 therefore implies that is a better estimate of the likelihood ratio provided that is a good estimate of . This comes in contrast to classical MIL approaches, wherein the network is optimized to predict the global label from the entire set, and there is no guarantee on the quality of the predictions of individual instances.

Expert consensus dictates that in a malignant slide, all follicular groups are malignant, while in a benign slide, all groups are benign. Namely, all local labels of instances containing the follicular groups match the global label. Indeed, due to the use of a fine needle in FNAB, the follicular cells are extracted from a single homogeneous lesion. So motivated, we propose to directly train to predict the global label from a single instance using the the binary cross entropy loss (BCE):


where is the sigmoid function. We note that this training strategy differs from classical MIL approaches, in which the model is trained at the bag level by replacing in (7) with since only a global label is available.

Experimental results presented in Section IV show that this training strategy indeed leads to improved performance, thus supporting the assumption that the local labels are consistent with the corresponding global label, and, in turn, that the likelihood ratio test is successfully estimated from according to Proposition 1.

Iii-D Simultaneous prediction of malignancy and Bethesda category

We propose a framework for predicting thyroid malignancy and the Bethesda category simultaneously from a single output of a neural network. Similarly to (2), we propose to predict the TBS category by comparing the output of the network to threshold values . Recall that the TBS category takes an integer value between 2 to 6, yielding:

The proposed framework for ordinal regression is inspired by the proportional odds model, also termed cumulative link model [33, 35]. The original model suggests to link between , the threshold and the cumulative probability :




The proportional odds model imposes order between different TBS by linking them to so that higher values of correspond to higher TBS categories. Recalling that the logit function is a monotone mapping of a probability function into the real line, values of which are significantly smaller than correspond to high probability that the TBS category is smaller than .

We deviate from [33, 35] by estimating rather than . We note that this deviation is not necessary for the prediction of TBS, yet it allows combining the predictions of the thyroid malignancy and the TBS category in an elegant and interpretable way. By plugging in into (9), we have:


Further substituting the last equation into (8), gives:


We note that (11) satisfies the property that high values of correspond to high TBS categories. We rewrite (11) as


and observe that the right term is the sigmoid function . Accordingly, we can train the network to predict using a BCE loss, and propose the following loss function:


where . Namely, are labels used to train classifiers corresponding to , whose explicit relation to TBS is presented in Table III in the Appendix. The use of in (13), instead of the more natural choice of , is enabled by the analysis provided in Subsection III-C.

For the simultaneous prediction of thyroid malignancy and TBS category, the total loss function is given by the sum of (7) and (13). Note the similarity between (7) and (13), which is a result of our choice to estimate rather than and has the following interpretation: (7) can be considered a special case of ordinal regression with a single fixed threshold value of . Namely, the total loss function simultaneously optimizes the parameters of the network according to classification tasks corresponding to the threshold values .

In this context, we note that the threshold values are learned along with the parameters of the networks, via stochastic gradient descent. While the training procedure does not guarantee the correct order of [35], we have found in our experiments that this order is indeed preserved.

We note that, in some cases, the term of the loss function corresponding to the prediction of malignancy may be in conflict with that of the TBS category. For example, consider a malignant case with TBS category 3 assigned by a pathologist. The term of the loss, in this case, which corresponds to TBS penalize high values of , whereas the term corresponding to malignancy encourages them. We therefore interpret the joint estimation of TBS category and malignancy as a cross-regularization scheme: Given two scans with the same TBS but different final pathology, the network is trained to provide higher prediction values for the malignant case. Likewise, in the case of two scans with the same pathology but different TBS, the prediction value of the scan with the higher TBS is expected to be higher. Thus, the network adopts properties of the Bethesda system, such that the higher the prediction value the higher is the probability of malignancy. Yet the network is not strictly restricted to the Bethesda system, so it can learn to provide better predictions.

Iv Experiments

Experimental Setting

To evaluate the proposed algorithm, we performed a -fold cross validation procedure, splitting the scans by , , for training, validation and testing, respectively, such that a test scan is never seen during training. The algorithm is trained using a Tesla P100-PCIE GPU with 16 Gb of memory. We use instances of size pixels. This size is large enough to capture large groups of follicular cells, while allowing the use of a sufficient amount of instances in each minibatch. Specifically, to train the first network, we use 10 instances per minibatch, a value set arbitrarily and that has a small effect on the performance. For the second network we found that increasing the number of instances per minibatch improves the performance, so we used instances, which corresponds to the maximum memory capacity of the GPU.

Fig. 4: Histogram of predictions of instances taken from a single slide. High predictions correspond to high probabilities that an instance contain follicular groups.

Identification of instances with follicular groups

A heat map illustrating the prediction values of the first network and the corresponding histogram of prediction values of a representative scan are presented in Figs. 2 and 4, respectively. With low prediction values, the majority of the instances contain background, as is seen in both figures. Specifically, the follicular groups (Fig. 2 top) are highlighted with bright colors in the heat map (Fig. 2 middle). In Fig. 4, the majority of instances contain background with low prediction values, however, the histogram is bimodal, with a second peak in the range of to . These high predictions indeed correspond to instances containing follicular groups, which we select for thyroid malignancy prediction. In Fig. 7, we present examples of instances detected as containing follicular groups.

Competing methods

To evaluate the performance of the proposed algorithm in predicting thyroid malignancy, we compare it to a baseline CNN, a noisy-and MIL [25] and the attention based MIL algorithm presented in [26]; we term these methods “CNN”, “NoisyAND” and “AttentionMIL”, respectively. These methods are originally designed to process whole images, which is not possible in our case due to memory limitations. Therefore, we use crops of size pixels, to allow crops per minibatch, subject to memory limitations. These values were selected to optimize performance over the validation set.

As an ablation study, we consider two additional approaches, where the proposed network for the selection of instances containing follicular groups is followed by the MIL approaches [25, 26] for obtaining a slide-level malignancy prediction; we refer to these methods as “Proposed+NoisyAND” and “Proposed+AttentionMIL”. These approaches are used to specifically evaluate the second network in the proposed algorithm. Moreover, we compare the performance of the proposed algorithm to a variant trained as a classical MIL approach, by replacing in (7) with . We term this method “Proposed-classical-training,” since the second neural network is trained as a standard MIL approach using the global label. Finally, we consider a variant of the proposed method termed “Proposed-pathology-loss,” in which the second neural network is trained to predict thyroid malignancy alone, but not the TBS category, so that the loss function is given by (7).

Method AUC AP
TABLE I: Comparison of the performance of the competing algorithms in the form of AUC and AP scores.

Prediction of thyroid malignancy and the TBS category

Table I summarizes the performance of the algorithms in the form of area under the curve (AUC) and average precision (AP) such that the higher is the better. As can be seen in the table, “CNN”, “NoisyAND” and “AttentionMIL” achieve significantly inferior performance compared to the other methods. This is because their decisions are largely made according to irrelevant background data. Specifically, the MIL approaches “NoisyAND” and “AttentionMIL” do not properly distinguish between the background and the regions containing follicular groups. The methods “Proposed+NoisyAND,” “Proposed+AttentionMIL” and “Proposed-classical-training” perform significantly better; thus reflecting the importance of proper selection of instances containing follicular groups. The comparable performance of these three methods indicates that there is no advantage to the sophisticated aggregation of decisions based on multiple instances according to [25, 26] compared to the simple averaging in (1). In addition, “Proposed-pathology-loss” provides inferior performance compared to the proposed approach, highlighting the contribution of the ordinal regression framework presented in Subsection III-D. The proposed method, based on the improved training strategy devised from the analysis in Subsection III-C, outperforms all other methods.

Fig. 5: ROC (Left) and PR (Right) curves comparing the performance of the proposed algorithm and human experts in predicting thyroid malignancy. Blue curve - the proposed algorithm. Red curve - pathologist from the medical record. Purple, orange and green curves - expert cytopathologists , , and , respectively (these three individuals analyzed the same digital image considered by the algorithm, and these experts were not the same as the clinicians from the medical record).

For the comparison of the algorithm to human level performance, we use a subset of slides which are annotated by expert cytopathologists (Experts to ), in addition to the TBS scores available in the original medical record (MR TBS). The MR TBS results are also human-generated, but in general a different clinician analyzed each of these cases, to constitute the cumulative MR TBS results. The performance of the proposed algorithm is compared to those of human in Fig. 5 using receiver operating characteristic (ROC) and precision-recall (PR) curves. The curves representing the performance of the human experts are obtained by considering the TBS categories as “human predictions of malignancy” such that TBS categories to correspond to increasing probability of thyroid malignancy. The AUC score obtained by the proposed algorithm is comparable to those of humans, and the algorithm provides an improved AP score compared to the human experts.

Fig. 6 further presents a comparison of TBS scores assigned by the algorithm and the human experts. It can be seen in the plot that high values are obtained at the top-left and right-bottom of the matrix, while off diagonal values decay. This block diagonal structure is exactly what is expected from the algorithm rather than, e.g., a diagonal structure. For the indeterminate cases, assigned TBS to by the experts, the term of the loss function corresponding to final pathology (7), encourages the algorithm to deviate from the original TBS, and provide either lower values in the benign cases or higher values in the malignant ones. On the other hand, cases assigned with TBS and by cytopathologists are benign and malignant, respectively, in more than of the cases. This high confidence in TBS and cases is similarly encoded in the algorithm, as we note that all the cases for which the algorithm predicts TBS or are indeed benign or malignant, respectively. Notably, these include cases which were previously considered indeterminate, i.e., assigned TBS to by a pathologist, and are correctly classified by the algorithm. This result demonstrates the potential of using the proposed algorithm as an assitive tool for cytopathologists by incorporating human and algorithm decisions.

Fig. 6: Confusion matrix of TBS categories assigned by the proposed algorithm vs. human experts. The colors in the plot correspond to a column normalized version of the confusion matrix.

V Conclusion

We have proposed a deep-learning-based algorithm for the prediction of thyroid malignancy from whole-slide cytopathology scans. Inspired by the work of a cytopathologist, the algorithm identifies informative image regions containing follicular cells. The algorithm then uses the informative regions and assigns a reliable malignancy score, similar to the Bethesda system, where higher values correspond to higher probabilities of malignancy. Our analysis shows how the proposed algorithm is related to MIL approaches and further allows us to devise an improved training strategy. We have demonstrated the improved performance of the proposed algorithm compared to competing methods and showed that it achieves performance comparable to three cytopathologists. We plan, in a future study, to significantly increase the size of the test set, annotated by expert cytopathologists. We further plan to exploit additional cytopathological characteristics and insights into our predicition; for example, malignant TBS slides often contain larger amounts of follicular cells, while the presence of a fluid known as colloid is strongly indicative of a benign case.


  • [1] Briseis Aschebrook-Kilfoy, Rebecca B Schechter, Ya-Chen Tina Shih, Edwin L Kaplan, Brian C-H Chiu, Peter Angelos, and Raymon H Grogan, “The clinical and economic burden of a sustained increase in thyroid cancer incidence,” Cancer Epidemiology and Prevention Biomarkers, 2013.
  • [2] D. Dov, S. Kovalsky, J. Cohen, D. Range, R. Henao, and L Carin, “Thyroid cancer malignancy prediction from whole slide cytopathology images,” arXiv, 2019.
  • [3] Geert Litjens, Thijs Kooi, Babak Ehteshami Bejnordi, Arnaud Arindra Adiyoso Setio, Francesco Ciompi, Mohsen Ghafoorian, Jeroen Awm Van Der Laak, Bram Van Ginneken, and Clara I Sánchez, “A survey on deep learning in medical image analysis,” Medical image analysis, vol. 42, pp. 60–88, 2017.
  • [4] Varun Gulshan, Lily Peng, Marc Coram, Martin C Stumpe, Derek Wu, Arunachalam Narayanaswamy, Subhashini Venugopalan, Kasumi Widner, Tom Madams, Jorge Cuadros, et al., “Development and validation of a deep learning algorithm for detection of diabetic retinopathy in retinal fundus photographs,” Jama, vol. 316, no. 22, pp. 2402–2410, 2016.
  • [5] Andre Esteva, Brett Kuprel, Roberto A Novoa, Justin Ko, Susan M Swetter, Helen M Blau, and Sebastian Thrun, “Dermatologist-level classification of skin cancer with deep neural networks,” Nature, vol. 542, no. 7639, pp. 115, 2017.
  • [6] Geert Litjens, Clara I Sánchez, Nadya Timofeeva, Meyke Hermsen, Iris Nagtegaal, Iringo Kovacs, Christina Hulsbergen-Van De Kaa, Peter Bult, Bram Van Ginneken, and Jeroen Van Der Laak, “Deep learning as a tool for increased accuracy and efficiency of histopathological diagnosis,” Scientific reports, vol. 6, pp. 26286, 2016.
  • [7] Ugljesa Djuric, Gelareh Zadeh, Kenneth Aldape, and Phedias Diamandis, “Precision histology: how deep learning is poised to revitalize histomorphology for personalized cancer care,” NPJ precision oncology, vol. 1, no. 1, pp. 22, 2017.
  • [8] K. Sirinukunwattana, S. E. A. Raza, Y. Tsang, D. R. Snead, I. A. Cree, and N. M. Rajpoot, “Locality sensitive deep learning for detection and classification of nuclei in routine colon cancer histology images,” IEEE transactions on medical imaging, vol. 35, no. 5, pp. 1196–1206, 2016.
  • [9] Abraham Pouliakis, Efrossyni Karakitsou, Niki Margari, Panagiotis Bountris, Maria Haritou, John Panayiotides, Dimitrios Koutsouris, and Petros Karakitsos, “Artificial neural networks as decision support tools in cytopathology: past, present, and future,” Biomedical engineering and computational biology, vol. 7, pp. BECB–S31601, 2016.
  • [10] Tianjiao Liu, Shuaining Xie, Jing Yu, Lijuan Niu, and Weidong Sun, “Classification of thyroid nodules in ultrasound images using deep model based transfer learning and hybrid features,” in Acoustics, Speech and Signal Processing (ICASSP), 2017 IEEE International Conference on. IEEE, 2017, pp. 919–923.
  • [11] Jianning Chi, Ekta Walia, Paul Babyn, Jimmy Wang, Gary Groot, and Mark Eramian, “Thyroid nodule classification in ultrasound images by fine-tuning deep convolutional neural network,” Journal of digital imaging, vol. 30, no. 4, pp. 477–486, 2017.
  • [12] Jinlian Ma, Fa Wu, Qiyu Zhao, Dexing Kong, et al., “Ultrasound image-based thyroid nodule automatic segmentation using convolutional neural networks,” International journal of computer assisted radiology and surgery, vol. 12, no. 11, pp. 1895–1910, 2017.
  • [13] Jinlian Ma, Fa Wu, Jiang Zhu, Dong Xu, and Dexing Kong, “A pre-trained convolutional neural network based method for thyroid nodule diagnosis,” Ultrasonics, vol. 73, pp. 221–230, 2017.
  • [14] Hailiang Li, Jian Weng, Yujian Shi, Wanrong Gu, Yijun Mao, Yonghua Wang, Weiwei Liu, and Jiajie Zhang, “An improved deep learning approach for detection of thyroid papillary cancer in ultrasound images,” Scientific reports, vol. 8, 2018.
  • [15] Wenfeng Song, Shuai Li, Ji Liu, Hong Qin, Bo Zhang, Zhang Shuyang, and Aimin Hao, “Multi-task cascade convolution neural networks for automatic thyroid nodule detection and recognition,” IEEE journal of biomedical and health informatics, 2018.
  • [16] John A Ozolek, Akif Burak Tosun, Wei Wang, Cheng Chen, Soheil Kolouri, Saurav Basu, Hu Huang, and Gustavo K Rohde, “Accurate diagnosis of thyroid follicular lesions from nuclear morphology using supervised learning,” Medical image analysis, vol. 18, no. 5, pp. 772–780, 2014.
  • [17] Alexandra Varlatzidou, Abraham Pouliakis, Magdalini Stamataki, Christos Meristoudis, Niki Margari, George Peros, John G Panayiotides, and Petros Karakitsos, “Cascaded learning vector quantizer neural networks for the discrimination of thyroid lesions,” Anal Quant Cytol Histol, vol. 33, no. 6, pp. 323–334, 2011.
  • [18] Rajiv Savala, Pranab Dey, and Nalini Gupta, “Artificial neural network model to distinguish follicular adenoma from follicular carcinoma on fine needle aspiration of thyroid,” Diagnostic cytopathology, vol. 46, no. 3, pp. 244–249, 2018.
  • [19] Hayim Gilshtein, Michal Mekel, Leonid Malkin, Ofer Ben-Izhak, and Edmond Sabo, “Computerized cytometry and wavelet analysis of follicular lesions for detecting malignancy: A pilot study in thyroid cytology,” Surgery, vol. 161, no. 1, pp. 212–219, 2017.
  • [20] Edward Kim, Miguel Corte-Real, and Zubair Baloch, “A deep semantic mobile application for thyroid cytopathology,” in Medical Imaging 2016: PACS and Imaging Informatics: Next Generation and Innovations. International Society for Optics and Photonics, 2016, vol. 9789, p. 97890A.
  • [21] Antonis Daskalakis, Spiros Kostopoulos, Panagiota Spyridonos, Dimitris Glotsos, Panagiota Ravazoula, Maria Kardari, Ioannis Kalatzis, Dionisis Cavouras, and George Nikiforidis, “Design of a multi-classifier system for discriminating benign from malignant thyroid nodules using routinely h&e-stained cytological images,” Computers in biology and medicine, vol. 38, no. 2, pp. 196–203, 2008.
  • [22] Balasubramanian Gopinath and Natesan Shanthi, “Computer-aided diagnosis system for classifying benign and malignant thyroid nodules in multi-stained fnab cytological images,” Australasian physical & engineering sciences in medicine, vol. 36, no. 2, pp. 219–230, 2013.
  • [23] Parikshit Sanyal, Tanushri Mukherjee, Sanghita Barui, Avinash Das, and Prabaha Gangopadhyay, “Artificial intelligence in cytopathology: A neural network to identify papillary carcinoma on thyroid fine-needle aspiration cytology smears,” Journal of pathology informatics, vol. 9, 2018.
  • [24] C. Zhang, J. C. Platt, and P. Viola, “Multiple instance boosting for object detection,” in Advances in neural information processing systems (NIPS), 2006, pp. 1417–1424.
  • [25] O. Z. Kraus, J. L. Ba, and B. J. Frey, “Classifying and segmenting microscopy images with deep multiple instance learning,” Bioinformatics, vol. 32, no. 12, pp. i52–i59, 2016.
  • [26] Maximilian Ilse, Jakub M Tomczak, and Max Welling, “Attention-based deep multiple instance learning,” arXiv preprint arXiv:1802.04712 (ICML18), 2018.
  • [27] G. Quellec, G. Cazuguel, B. Cochener, and M. Lamard, “Multiple-instance learning for medical image and video analysis,” IEEE reviews in biomedical engineering, vol. 10, pp. 213–234, 2017.
  • [28] J. RR Uijlings, K. EA Van De Sande, T. Gevers, and A. WM Smeulders, “Selective search for object recognition,” International journal of computer vision, vol. 104, no. 2, pp. 154–171, 2013.
  • [29] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hierarchies for accurate object detection and semantic segmentation,” in Proc. of the IEEE conference on computer vision and pattern recognition (CVPR), 2014, pp. 580–587.
  • [30] R. Girshick, “Fast r-cnn,” in Proc. of the IEEE international conference on computer vision, 2015, pp. 1440–1448.
  • [31] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: towards real-time object detection with region proposal networks,” IEEE Transactions on Pattern Analysis & Machine Intelligence, , no. 6, pp. 1137–1149, 2017.
  • [32] P. A. Gutierrez, M. Perez-Ortiz, J. Sanchez-Monedero, F. Fernandez-Navarro, and C. Hervas-Martinez, “Ordinal regression methods: survey and experimental study,” IEEE Transactions on Knowledge and Data Engineering, vol. 28, no. 1, pp. 127–146, 2016.
  • [33] P. McCullagh, “Regression models for ordinal data,” Journal of the royal statistical society. Series B (Methodological), pp. 109–142, 1980.
  • [34] A. Agresti, Categorical data analysis, vol. 482, John Wiley & Sons, 2003.
  • [35] M. Dorado-Moreno, P. A. Gutiérrez, and C. Hervás-Martínez, “Ordinal classification using hybrid artificial neural networks with projection and kernel basis functions,” in International Conference on Hybrid Artificial Intelligence Systems. Springer, 2012, pp. 319–330.
  • [36] X. Jing, S. M. Knoepp, M. H. Roh, K. Hookim, J. Placido, R. Davenport, R. Rasche, and C. W. Michael, “Group consensus review minimizes the diagnosis of “follicular lesion of undetermined significance” and improves cytohistologic concordance,” Diagnostic cytopathology, vol. 40, no. 12, pp. 1037–1042, 2012.
  • [37] P. Pathak, R. Srivastava, N. Singh, V. K. Arora, and A. Bhatia, “Implementation of the bethesda system for reporting thyroid cytopathology: interobserver concordance and reclassification of previously inconclusive aspirates,” Diagnostic cytopathology, vol. 42, no. 11, pp. 944–949, 2014.
  • [38] M. Zaheer, S. Kottur, S. Ravanbakhsh, B. Poczos, R. R. Salakhutdinov, and A. J. Smola, “Deep sets,” in Advances in Neural Information Processing Systems, 2017, pp. 3391–3401.
  • [39] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
  • [40] G. Casella and R. L. Berger, Statistical inference, vol. 2, Duxbury Pacific Grove, CA, 2002.


Proposition 1 The estimated log of the likelihood ratio is a linear function of given by:


where and are constant values.

The proof is based on the assumption that the instances are independent random variables. We note that this assumption is used to facilitate the derivation and it might not hold in practice for instances taken from the same scan. Yet, we motivate this assumption by the large variability between the follicular groups in their size, architecture and the number of cells as demonstrated in Fig. 3.


By the independence assumption and by taking the log, we have:


By applying the Bayes rule, we have:


which we rewrite by:


where: and . The first term in the last equation is estimated from the neural network, so that the estimated log likelihood ratio is given by:


Finally, (14) is given by assigning (1) into (19). \qed

Feature extraction layers
Layer Number of filters
Classification layers
Layer Output size
TABLE II: VGG11 based architecture used for both the first and the second neural networks in the proposed algorithm. Each conv2d layer comprises 2D convolutions with the parameters and . Parameters of the Max-pooling layer: , . The conv2d and the linear layers (except the last one) are followed by batch normalization and ReLU.
TABLE III: Binary labels used in the proposed ordinal regression framework to predict the Bethesda score.
Fig. 7: Instances containing follicular groups. The rows, from top to bottom, correspond to TBS categories.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description