Logistic regression with MIL and regularization
Cancer Detection with Multiple Radiologists via Soft Multiple Instance Logistic Regression and Regularization
Abstract
This paper deals with the multiple annotation problem in medical application of cancer detection in digital images. The main assumption is that though images are labeled by many experts, the number of images read by the same expert is not large. Thus differing with the existing work on modeling each expert and ground truth simultaneously, the multi annotation information is used in a soft manner. The multiple labels from different experts are used to estimate the probability of the findings to be marked as malignant. The learning algorithm minimizes the Kullback Leibler (KL) divergence between the modeled probabilities and desired ones constraining the model to be compact. The probabilities are modeled by logit regression and multiple instance learning concept is used by us.
Experiments on a reallife computer aided diagnosis (CAD) problem for CXR CAD lung cancer detection demonstrate that the proposed algorithm leads to similar results as learning with a binary RVMMIL classifier or a mixture of binary RVMMIL models per annotator. However, this model achieves a smaller complexity and is more preferable in practice.
1 Introduction
This paper deals with the multiple annotation problem in medical application of cancer detection in digital images. In difference with the existing work on learning from multiple annotators [1, 2, 3, 4], in our application the same radiologist reads only a small subset of images from the entire database. Our main assumption is that though the same image is read by many experts, the number of images read by the same expert is not large. Thus instead of modeling each expert and ground truth separately; the multi annotation information is used to deal with the noisy radiologist perception the way around.
The multiple radiologist marks are used to create a merged ground truth (GT) mark as an ellipse area (section 2.1) and then the ground truth (GT) marks are labeled in a soft manner by assigning them a probability to be malignant (section 3). A new probabilistic model with learning based on minimizing the Kullback Leibler (KL) divergence between the modeled probabilities and desired ones and constraining the model’s complexity is introduced by us. The probabilities are modeled by logit regression and multiple instance learning concept is used by us (Section 4).
2 Multiple annotation
It is well known that different radiologists perceive suspicious lesions differently. When presented by the image and asking to mark the suspicious regions by ellipse they draw ellipses in a different position, orientation and scale. Some radiologists have a tendency to draw a large ellipse covering the suspicious region while others prefer to present the same region by drawing a few small ellipses (see Figure 1b, top row). Even for easy cases of a single perceptual scale their marking may be still different, so that though the marking ellipses have a large area of intersection their parameters are still different (see Figure 1b, bottom row).
In order to deal with this geometrical variability representing the same region, a ground truth ellipse (GT) is created by merging ellipses drawn by all the annotators. In order to merge the marks of different annotators, they are split into groups representing the same GT objects.
a  b  c 

2.1 Geometrical Merging
The merging GT is created by the following heuristic procedure.

First, all possible intersecting pairs of marks annotated by different radiologists are considered. The marks are considered to be hitting (potentially referring to the same GT mark), if their normalized intersection area is sufficiently large , where are the marks and their intersection areas, respectively; the threshold is selected adaptively to the mark’s configuration. When the marks are very similar in size^{1}^{1}1their size fraction is more than and position^{2}^{2}2distance between mark centers normalized by their size is less than , the threshold is set to ; in all the other cases it is set to and , where stand for the mark sizes in mm.

Iterate over the marks and assign to each mark a score equal to the number of marks which it hits. Organize the marks into a list sorted by the score value.

While the marks list is not empty, find the ”seeding mark” of the largest score and merge it with all its hitting marks to create the primary GT mark. In fact, the primary GT mark is an object referring to the list of marks consisting from the ”seeding mark” and marks hitting with it. The primary GT mark is represented as the mark of the median size in the list and has a score equal to the number of marks in the list. The exclusion from this is a case of the GTs referring to one or two marks only. If the primary GT refers to two marks only, it is represented by one of them randomly selected. The single marks are just stand for the primary GT of the score equal to one.

Marks already merged to the primary GTs are thrown from the mark list and do not participate in the subsequent merging process.

Continue this procedure iteratively from Step 3
When the procedure above is finished, the coinciding primary GTs are merged to final ones with the score being corrected to indicate all the marks it refers to and to indicate their number as a final GT score. The noncoinciding primary GTs are just the final GTs used in classification.
3 Soft label creation
In our application we are not provided with the groundtruth labels obtained from biopsy or from other modalities; but instead by many marks from different radiologists. The geometrical aspects of GT marks generation from the multiple annotator responses were explained in section 2.1. However, the GT marks have additional score value that shows how many radiologists mark this GT object (consider this region as suspicious).
The number of radiologists is an important characteristic that enables to estimate the probability of the GT to be perceived malignant. If all radiologists are of the same competence, the naive calculation by the proportion of the positive (malignant) radiologist responses to the overall number of radiologists reading the image is a good estimation of the probability of the mark to be malignant as perceived by the radiologists^{3}^{3}3If radiologists have a shared tendency to perceive some artifact as a malignancy, the learned model should be able to assign large probability to this region as well..
These probabilities are normalized for the case of GT created by a single annotator. If a GT mark is created by one annotator and the image is read by the smallest allowed number of annotators , the desired probability of malignancy is set to , where is the smallest meaningful probability^{4}^{4}4probability, when only readers from the maximal number of them () mark the region as malignant and is a depression coefficient ( was considered by us). In the case of the larger number of image readers and GT created by a single annotator, its probability falls down proportionally to the number of readers , so that .
4 MIL with soft labeling and regularization
In the MIL framework the data is considered to be aggregated into the so called bags , ( is the number of bags). All the instances of the bag share the same extra bagstate label being positive or negative. The bag is considered to be negative if all its instances ( is the number of samples in the bag ) are negative; and positive if at least one its instance is positive. The probability of the bag to be positive () is given by:
We generalize the MIL concept to allow soft labeling, so that the instances are now aggregated into bags and each bag is malignant with an apriori probability . The standard ML (maximum likelihood) criterion learning objective function is replaced by the minimization of the KL divergence between the probability distributions and , assuming the prior and estimated distributions being i.i.d.: , where:
We minimize KL divergence with the constraint on the projection vector to prefer the parsimonious models.
(1) 
We use the conjugate gradient descend to find the model parameters and software developed by Schmidt et.al [5]. In order to use the code the gradient and Hessian of the objective function have to be analytically calculated; these expressions are evaluated and presented in the Appendix section 7.
4.1 Normalization issues
Working with the regularization parameters it is worth to get some intuition about setting them reasonably. In practice the data dimensionality (number of extracted features) changes between different versions. This motivates us to normalize the norm by the input space dimensionality to:
(2) 
Another issue is changing the number of data samples during training that naturally leads to regularization parameter resetting. A simple idea is to normalize the KLdivergence term by the number of overall samples :
(3) 
The second one approach is more delicate as also tries to resolve imbalance problem, by normalizing separately per soft positive and hard negative KLdivergence term.
(4) 
From the computational view point any of these normalizations is equivalent to a per bagnormalization.
5 Lesion Detection in CXR CAD
5.1 CAD objective and the experimental design.
In the computer aided diagnosis (CAD) the goal is usually defined as detection of malignant regions in the digital images. However, in our statement, there is no any additional information about the malignancy state of the radiologist marks, such as biopsy results or observation from other modalities such as CT, for example. In the absence of the golden ground truth we resume to a standard practice used by radiologists. Namely, if two readers are consistent in considering the region as being malignant; our system should be able to detect this.
In other words, the goal of our system should be to assist the radiologists in making diagnosis so that they are satisfied with the system, rather than detecting unknown ground truth that may be also invisible to the eye in the Xray images. In summary, while the soft classifier learns the probability of the region to be marked as malignant by radiologists, the final ROC results [6] reporting the sensitivity and number of false positives in an image are presented as if the pseudo golden GT corresponds to GTs marked by more than one reader.
The soft classifier proposed by us is compared with two other binary classifiers:

B1: the binary RVMMIL classifier discriminating the pseudo golden GTs.

B2: the mixture of binary RVMMIL classifiers trained per the experts. The mixture classifier is an extra binary MIL logistic regression network with the input being the expert classifier continuous score outputs. The final logistic network goal is to discriminate a binary pseudo golden GTs.
The goal of the binary classifiers B1B2 is to label the suspicious candidates as being malignant from the radiologists point of view^{5}^{5}5marked more than once by different radiologists in our case. In other words, it has to reduce the number of false positives (FPs) without decrease in the sensitivity. In binary classification, the candidates that are close to the pseudo golden GT mark are arranged into a positive bag; all the other candidates are considered as negative.
The CXRCAD system consists of the three steps: (1) candidate generation, (2) feature extraction and finally (3) classification. In the first step the suspicious regions (lesion candidates) are found. While this step detects a lot of the anomalies, the number of extracted false positives is extremely high. In order to reduce the number of false positives, features describing the suspicious regions are extracted and classification is performed in this feature space.
5.2 Data Description
The suspicious region features are presented by two groups: a relatively small set of specific features and a huge number of standard intensity based and texture features (see [7, 8]). The specific features are elaborately extracted to reflect the heuristic nature of the lesions, such as spiculation features, for example [9] or the region label, indicating if it falls in the rib region. The number of specific features is about and the number of texturelike is about .
The number of overall images at our disposal is , among them the number of potentially malignant images having at least one GT mark is equal to . The number of overall marks by all the experts is equal to and the number of generated pseudo golden GTs are equal to . The readers participated in the annotation process. The annotator histogram, i.e. the number of images read by a certain number of annotators is presented in Figure 2a. As can be seen most of the images were marked by annotators and only a small number by more than . The sensitivity and false positive rates of the readers in respect to the pseudo golden GTs are presented in Figure 2b.
a  b 

6 Results
The data is divided into three subsets of the same size: training, validation and test. The ”training + validation” sets are used for training models and setting the model parameters and results are reported on the unseen test data. We consider three models and compare their performance and complexity.
6.1 MIL with soft labeling and regularization.
The first vital system requirement is finding less than false positives (FP) per image. The higher FP values are not acceptable by radiologists because it complicates their work, leading to tiredness and errors. The second requirement is a high sensitivity both on GTs and images^{6}^{6}6When measuring image sensitivity, it is sufficient to detect at least one GT in the image. In other words, finding of at least one real nodule by CAD is considered as a success.. The sensitivity of the classifiers for the per different regularization parameters are presented in Figure 3. As can be seen (Figure 3.a) for there exists an interval of regularization values where regularized soft models have large sensitivity values and relatively small difference in sensitivity between ”training + validation” sets. In order to shrink this interval and select the optimal value, the models sensitivities for larger are considered as well (Figure 3.b). The optimal interval for lies between . The final optimal value was selected by us in the cross section of the optimal intervals for and ; and was set to .
a. 
b. 
The final soft classifier is obtained by retraining the soft classifier for the selected and the results are reported on the unseen test data, that was not used in any step of the soft classifier training^{7}^{7}7The reported results do not take into account possible failure in candidate generation. We report and compare classification models only.. The ROCs for for the final soft classifier are presented in Figure 4, red curves. We also note that the FP slightly moves forward to 0.52 on the test data and there is slight decrease in the sensitivity.
6.2 Comparison with the binary classifiers
In this section the optimally regularized soft model is compared with two binary classifiers and (see Section 5.1). The RVM MIL classifier ( model) is trained on the merged ”training + validation” data sets and tested on the unseen test data.
Though, the complexity of the RVM MIL model^{8}^{8}8number of selected features is defined automatically during learning via the hyper parameters; our multiple experiments with the RVM MIL show that it is not as trivial as appear in [10]. The final RVM MIL model complexity and performance results are dependent on the other control parameters. Ideally, the setting of these control parameters should be performed on the validation data set. Instead, we favor the RVM MIL by training it straightforward on merged ”training + validation” data sets and selecting parameters leading to the best results. The mixture of RVM MIL experts models ( model) is also trained on the merged ”training + validation” data sets. The final scores are mixed by an extra binary network.
The results of all three models on ”training+validation” and test data are presented in Figure 4 and Table 1.
GT sensitivity 
Image Sensitivity 
Soft Model: lambda=0.060000 Number selected features 67  
”training+validation” data  
FP per image  
0.50  1.00  1.50  2.00  2.50  3.0  
GT sensitivity  49.75  58.62  62.56  66.01  70.44  71.92 
Image sensitivity  57.14  64.29  67.14  70.00  75.71  75.71 
test data  
induced FP per image  
0.58  1.19  1.69  2.24  2.84  3.37  
GT sensitivity  44.23  55.77  64.42  67.31  71.15  75.96 
Image sensitivity  54.55  62.34  72.73  75.32  79.22  80.52 
Model B1: Number selected features 88  
”training+validation” data  
FP per image  
0.50  1.00  1.50  2.00  2.50  3.00  
GT sensitivity  50.25  60.10  65.52  68.47  73.40  74.38 
Image sensitivity  58.57  67.14  70.71  72.86  75.71  77.14 
test data  
induced FP per image  
0.54  1.07  1.62  2.13  2.70  3.17  
GT sensitivity  47.12  57.69  63.46  68.27  69.23  70.19 
Image sensitivity  55.84  63.64  70.13  76.62  77.92  77.92 
Model B2: Number selected features 203  
”training+validation” data  
induced FP per image  
0.50  1.00  1.49  1.99  2.48  3.00  
GT sensitivity  53.20  64.53  72.91  74.88  76.35  79.80 
Image sensitivity  60.71  67.14  77.14  79.29  80.71  82.86 
test data  
induced FP per image  
0.61  1.14  1.75  2.23  2.70  3.28  
GT sensitivity  47.12  60.58  66.35  70.19  77.88  78.85 
Image sensitivity  54.55  66.23  72.73  76.62  84.42  84.42 
As can be seen all three models achieve almost the same performance, however the model is the most compact as uses only features, comparing with and models that use and features, respectively. The nice property of the regularized soft model is an easily perceivable control on the model complexity; as regularization parameter increases the number of selected features decreases that should finally lead to more robust results. As can be seen the FP value is slightly moving forward on the test data for all three models.
Though learning the model is extensive, it provides also some estimation of the readers importance. The final weights of the mixture network is presented in Figure 5.
Discussion
In this paper we investigated how multiple annotation can be used for training the CAD system. We introduced the notion of pseudo golden GT and considered three learning models to address the problem. One of the models introduced by us is a soft classifier that models the desired probabilities of the regions to be marked by the radiologists. The usage of the soft classifier is a good alternative to using the RVM MIL classifier or mixture of RVM MIL experts model that enables additionally to model the desired probabilities. This model is the most compact between the three learned models, that is desirable in practice from the computational and explorative view point.
It is interesting to compare the introduced models in different scenarios with the growing number of images and number of annotators, when the number of images per annotators is very small. We are planning to conduct such a research in the future as such data will become available^{9}^{9}9Marking the CXR CAD images is quite an expensive process, requiring the financial investment..
7 Appendix
In the case of the hard apriori labels , we get exactly the same objective function as in [10],(Eq.8), that is the maximum likelihood of the data under the i.i.d. sample assumption:
(5) 
Moreover, comparing two objective functions allows to recalculate gradient and Hessian of the very easily, by observing that
(6) 
Below we present the gradient and Hessian expressions as appear in paper [10], in a bit more compact way:
(7)  
(8)  
Since gradient and Hessian are linear operators and is a linear function of , it is easily seen that:
Noticing that: and , we can simplify to
(9) 
and Hessian per bag to:
(10) 
In the code implementation the next expressions are useful:
We note also that is associated with the probability of the instance (a linear index independent of the bag) to be positive. This enables us to simplify some other entities: as and finally
(11) 
In reality, we deal with the imbalanced problem with the mixed hard and soft labels, where we have a large number of hard negative labels. While there is not any problem to use the same expressions for hard labels as for soft ones in implementation it is faster to consider the gradient and Hessian in a hard negative case , separately:
(12) 
where a single instance in the th bag.
7.1 Different number of annotators
One of the ways to estimate the apriori probability of the bag is considering the proportion of the bag to be marked by different readers (annotators) to their overall number. In practice the number of readers (annotators) may be different for different images; this is not reflected in our formulation. Though we can artificially set the prior probabilities to be dependent on the number of readers (by treating the probabilities to be a measure of our belief of the bag to be positive, rather than the probability of the bag being positive); the more natural treatment is described below.
Lets assume that the same image is read by readers and of them marked the same object (bag) as a positive one. The estimated apriori probability of the bag to be positive is . Let us divide evenly the readers into two independent annotator groups of size , from which marked the object as the positive. The estimated probability of the bag to be positive in each of the group remains the same . Assume the groups are independent and consider their work independently^{10}^{10}10like having two independent images, marked by each group separately. Here we decline from the i.i.d. data assumption; but anyway this assumption is rarely has place in practice; though the i.i.d. models lead to good results. In this case, we can think about the th bag term in the optimization function as being repeated twice in Eq. 1 as . In comparison, if only annotators participate in the marking process the same term appears only once in Eq. 1 as .
We generalize this observation above to modify the optimization function to take into account different number of annotators per image as:
(13) 
where are the number of annotators per bag and is the maximal number of annotators. The gradient and Hessian expressions will be modified as:
(14) 
Acknowledgments
The authors would like to thank David Leib for his advice and help in the preparation of the paper.
References
 [1] B. Carpenter, “A multilevel bayesian model of categorical data annotation,” Technical report, 2008. [Online]. Available: http://lingpipeblog.com/2008/11/17/whitepapermultilevelbayesianmodelsofcategoricaldataannotation
 [2] V. C. Raykar, S. Yu, L. H. Zhao, A. Jerebko, C. Florin, G. H. Valadez, L. Bogoni, and L. Moy, “Supervised learning from multiple experts: Whom to trust when everyone lies a bit,” in Proceedings of the 26th International Conference on Machine Learning, L. Bottou and M. Littman, Eds. Montreal: Omnipress, June 2009, pp. 889–896.
 [3] Y. Yan, R. Rosales, G. Fung, and J. Dy, “Modeling multiple annotator expertise in the semisupervised learning scenario,” in Proceedings of the Proceedings of the TwentySixth Conference Annual Conference on Uncertainty in Artificial Intelligence (UAI10). Corvallis, Oregon: AUAI Press, 2010, pp. 674–682.
 [4] P. Welinder, S. Branson, S. Belongie, and P. Perona, “The multidimensional wisdom of crowds,” in Conference on Neural Information Processing Systems (NIPS) 2010, 2010.
 [5] M. Schmidt, G. Fung, and R. Rosales, “Fast optimization methods for l1 regularization: A comparative study and two new approaches,” in Proceedings of the 18th European conference on Machine Learning, ser. ECML ’07. Berlin, Heidelberg: SpringerVerlag, 2007, pp. 286–297. [Online]. Available: http://pages.cs.wisc.edu/~gfung/GeneralL1
 [6] T. Fawcett, “An introduction to roc analysis.” Pattern Recognition Letters., no. 27, p. 861874, 2006.
 [7] D. Wei, H. Chan, N. Petrick, B. Sahiner, M. Helvie, D. Adler, and M. Goodsitt, “Falsepositive reduction technique for detection of masses on digital mammograms: global and local multiresolution texture analysis.” Medical Physics., vol. 24(6), pp. 903–914, 1997.
 [8] Haralick and Shanmugan, “Textural features for image classification.” IEEE Transactions on Systems, Man, and Cybernetics., vol. SMC3, pp. 610–621, 1973.
 [9] M. P. Sampat, A. C. Bovik, M. K. Markey, G. J. Whitman, and T. W. Stephens, “Toroidal gaussian filters for detection and extraction of properties of spiculated masses.” IEEE International Conference on Acoustics, Speech, and Signal Processing., pp. 610–621, 2006.
 [10] V. Raykar, B. Krishnapuram, M. Dundar, J. Bi, and R. B. Rao, “Bayesian multiple instance learning: automatic feature selection and inductive transfer,” in International Conference on Machine Learning  ICML, Helsinki, Finland, July 2008, pp. 1–8. [Online]. Available: http://icml2008.cs.helsinki.fi/papers/587.pdf