Breast Cancer Diagnosis via Classification Algorithms
In this paper, we analyze the Wisconsin Diagnostic Breast Cancer Data using Machine Learning classification techniques, such as the SVM, Bayesian Logistic Regression (Variational Approximation), and K-Nearest-Neighbors. We describe each model, and compare their performance through different measures. We conclude that SVM has the best performance among all other classifiers, while it competes closely with the Bayesian Logistic Regression that is ranked second best method for this dataset.
Breast Cancer Diagnosis via Classification Algorithms
Reihaneh Entezari Department of Statistical Sciences University of Toronto firstname.lastname@example.org 2013
Breast Cancer is the most common cancer among women, all around the world. There are two types of breast mass, ”Benign” and ”Malignant” , where benign is the non-cancerous type and malignant is the cancerous one. According to the World Health Organization, about 502,000 deaths (per year) worldwide are caused by breast cancer. The most effective way to reduce breast cancer deaths is to detect it earlier. Mammography and a Fine Needle Aspiration biopsy (FNA) are the most common diagnostic tools, which may be time consuming and cost-effective. Thus the automatic diagnosis of breast cancer becomes important, while attracting many researchers from all over the world to find an accurate and reliable diagnosis (classification) method. In Machine Learning content, there are several different methods for classification, such as SVM (Support Vector Machines) which is one of the most popular ones, Bayesian Logistic Regression (which we will use Variational Approximation here), and K-Nearest-Neighbors, etc.
SVM attempts to find the best hyperplane (by maximizing the margin) that separates the data into the two classes benign and malignant, and thus is a powerful method since it is flexible with the selection of kernel functions that map the data into higher dimension so that they are separable. Bayesian logistic regression is also another popular classification method where a prior is defined for the parameters of the model to find the posterior. After finding the posterior, we can obtain the predictive distribution and use it to predict for new test cases by defining a decision boundary. K-Nearest-Neighbors is one of the simplest classification methods, also called as the lazy learning where an object is assigned to the class most common amongst its k nearest neighbours (k being a positive integer).
Previous work on breast cancer diagnosis using classifier techniques, have shown that usually SVM has a higher accuracy performance compared to other classifiers, although in a few, Logistic Regression was also at a high level of accuracy (depending on the data set). On the other hand, although K-NN may not be the best classifier in general, but as we will see, it shows a desirable performance for this breast cancer diagnosis problem. In this paper, we have collected the Wisconsin Diagnostic Breast Cancer Data (WDBC) from the UCI Machine Learning Repository to apply the methods described above. We will see that our results are in-line with previous conclusions, although a Bayesian approach to Logistic Regression will significantly increase the performance compared to the Logistic Regression.
2 Wisconsin Diagnostic Breast Cancer Data (WDBC)
The Wisconsin Diagnostic Breast Cancer Data consists of 569 instances with 32 attributes such as ID, diagnosis statement (B=benign, M=malignant), and 30 real-valued features that were computed from a digitized image of a fine needle aspirate (FNA) of a breast mass. Ten real-valued features were computed for each cell nucleus:
1) radius, 2) texture, 3) perimeter, 4) area, 5) smoothness, 6) compactness, 7) concavity,
8) concave points, 9) symmetry, 10) fractal dimension.
The mean, standard error, and ”worst” or largest ( mean of the three largest values ) of these features were computed for each image, resulting 30 features. There are 357 benign and 212 malignant, which we will randomly split the 569 data into 80% training set and 20% test set to check performance.
Here we will briefly describe each classification method:
3.1 SVM (Support Vector Machines)
Given some data where is indicating the class of feature vector , the goal is to find a hyperplane that separates the two classes by maximizing the margin, which is the minimum distance from the data points to the hyperplane. In general, a hyperplane is defined as where is the normal vector to the hyperplane and is the inner product of and . The parameter specifies the distance of the hyperplane from the origin along the normal vector . So, we have to find the best and such that the margin is maximized. Hence if the training data are linearly separable, the optimizing problem will be:
Where the solution to will be .
There is a modified maximum margin optimization that allows misclassification of training points, when there exists no hyperplane that can completely separate the training set. It is called the Soft Margin method, which will choose a hyperplane that separates the data as cleanly as possible, while maximizing the distance to the nearest cleanly separated points. In this method, new parameters, , are added to measure the degree of misclassification for each data , by :
The new optimization will become a trade off between a large margin and a small error penalty. If the penalty function is linear, the optimization problem changes to:
Where the constant C controls the trade off between complexity of decision rule and frequency of error. By using Lagrange Multipliers , the dual form of the previous optimization problem will be:
Where is called the kernel function (here it’s linear). In general, kernel functions are defined as , where is a function that maps the data to a higher dimensional space called the feature space, to make a better look of the data. Function doesn’t have to be known since just their inner product is important in the optimization problem. Below are some possible kernel functions:
Gaussian radial basis function: where ,
we can also assume
Hyperbolic tangent: for some (not every) and
After finding the optimum and , we can classify a new test case by which is equivalent to .
3.2 Bayesian Logistic Regression
The model known as Logistic Regression, which is a model for classification, is as below:
with , where is the logistic sigmoid function and is the feature space. Then, the likelihood function for the data , where and , will be:
Where . To predict for new test cases, we need to estimate the parameter , which is a dimensional vector, according to features for each training data. One way is to estimate it directly, using any iterative optimization scheme, such as Newton-Raphson, Gradient Descent, etc. Another way is to look at the bayesian approach where a prior for is chosen to obtain the posterior. Usually Gaussian priors are considered, and since , computing the posterior would be intractable (logistic sigmoid and a Gaussian). On the other hand, the posterior can be approximated using various methods, such as Laplace approximation, MCMC, etc. Here we will approximate the posterior using the variational approximation. We will assume the prior is Gaussian, .
In the variational framework, we wish to maximize a lower bound on the marginal likelihood:
Before we proceed to the next step, note that we also have the following property for each :
Where . Now to find a lower bound on , from (6) and (7), we need to first use the variational lower bound for the logistic sigmoid function , hence:
where and are the variational parameters for each data point. Now by (6) and (8), we have :
Now, by taking the logarithm of both sides of (9), we will see that the right hand side will be a quadratic function of . Hence the variational approximation to the posterior distribution of will have a Gaussian form as below:
On the other hand, we need to find the variational parameters by maximizing the lower bound on the marginal likelihood given in (6). One way to approach this, is to use the EM algorithm, followed by:
Initialize all s
E-Step: use these parameters to find the posterior distribution over w, as given in (10), (11)
M-Step: maximize the expected complete-data log likelihood given by . Which is equivalent to solving:
Up to now, we have assumed that hyperparameters (in prior) are known. Hyperparameters can also be unknown, so that they can be learned from the data. Hence, we will need additional priors for the hyperparameters. Assume the prior on depends on another unknown parameter . Thus, we can consider the prior for to be Gaussian again, , and the prior for to be gamma, . Now the marginal likelihood will be:
Since this is analytically intractable, we will consider the variational distribution and the decomposition :
Since is still intractable, we will use the local variational bound on the logistic sigmoid function (just as before). Hence we will have:
Next, we assume that the variational distribution factorizes between parameters and hyperparameters: . Now we will find each factor as below :
Omitting the algebra, will follow and will follow , with updated equations :
where is the dimension of feature space and . We should note that the updated equations for is the same as before.
On the other hand, the variational lower bound would be:
3.2.1 Predictive distribution
Given a new test case (in its feature space) , we can compute the predictive distribution for class by marginalizing with respect to the posterior distribution for which is approximated with the Gaussian :
with . We have computed the above integral using the Monte Carlo technique, by sampling from the Gaussian distribution .
As mentioned before, in this method a test point is classified by assigning the label which is most frequent among the (a user-defined constant) training points nearest to that test point. Euclidean distance is one of the options to use as the measure of closeness (distance metric), which we will apply here.
4.1 Error Rates
After randomly dividing the data into 80% training set and 20% test set (455 Training data, 114 Test data), we have calculated the training error rate and the test error rate for each method. We should note that for SVM, the error rates for each kernel type was resulted from the best parameters chosen from a finite set of different values, by using 10-fold cross-validation. Table 1 below are showing these error rates and the number of misclassified points.
|Methods||Testing Error Rates||Training Error Rates|
|SVM (Gaussian Radial Basis)||0.01754386 (2 miss)||0.01318681 (6 miss)|
|SVM (Polynomial)||0.02631579 (3 miss)||0.01758242 (8 miss)|
|SVM (Hyperbolic Tangent)||0.07894737 (9 miss)||0.08791209 (40 miss)|
|K-NN (k=1)||0.1052632 (12 miss)||0.07252747 (33 miss)|
|K-NN (k=3)||0.0877193 (10 miss)||0.07252747 (33 miss)|
|K-NN (k=10)||0.0877193 (10 miss)||0.06153846 (28 miss)|
|Logistic Regression||0.09649123 (11 miss)||0.07692308 (35 miss)|
|Bayesian Logistic Regression (Variational Appr.)||0.06140351 (7 miss)||0.03296703 (15 miss)|
As it is seen from the test error rates and training error rates, SVM with a Gaussian Radial Basis kernel has the smallest rate (both test and training) among the other SVMs, and K-NN with k=10 has also the smallest error rate among the other K-NNs. So to compare the performance of these methods (after selecting the best from each category, explained earlier), we see that their error rates rank from smallest to largest by: 1) SVM(Gaussian) 2) Bayesian Logistic Regression 3) K-NN(k=10) 4) Logistic Regression. To have a more accurate result, in the next section, we will look at the ROC curves and AUC to compare these 4 methods ( after selection ).
4.2 ROC curves and AUC
In classification, specifically diagnosis of cancer, it’s important that what class the test cases are assigned to. There are four possible cases: 1) True Positive[TP] ( If new data is assigned to class positive, and it actually belongs to that class) 2) False Positive[FP] ( If it is assigned to positive class, but is actually in the negative class, in our case a person is diagnosed as cancerous, while being healthy) 3) True Negative[TN] ( when it is assigned to class negative and it actually belongs to that class) 4) False Negative[FN] ( It is assigned to class negative while it belongs to the positive class, in our case the person is predicted to be healthy, but actually has cancer, which is completely dangerous). In the previous section, we just considered all the false positive and false negative together, and computed the error. To conclude more accurate results, we have to consider all these cases, where in specific, ROC curves are used for showing the ”True positive Positive rate(TPR)” vs ”False Positive rate(FPR)” [ and ]. ROC curves represent relative tradeoffs between benefits (TP) and costs (FP). As it is seen in Figure 1, the distance of the ROC curves of SVM, Bayesian Logistic Regression, K-NN and Logistic Regression from the diagonal line are ordered from largest to smallest, respectively, showing performance (from highest to least). So again the same order of performance is resulted here with the ROC curves. Another way to measure the performance is to look at the AUC values (which are the area under the ROC curves) and also the Accuracy measurement. They are given in Table 2.
|Bayesian Logistic Regression||0.9934901||94%|
Again as it is seen from Table 2, the AUCs and accuracies are ordered from largest to smallest with the same orders of methods. The accuracy vs cut-off graphs for each method are also shown in Figure 2. Since our decision boundary was probability 0.5 ( or the sign function in SVM), we see that all methods (except Logistic Regression) have their highest accuracy (Figure 2).
Finally, to summarize the results, we started by looking at test error rates, ROC curves and at last AUC and accuracy values for each method. From the comparison, we see that SVM has the least error rate (highest Accuracy), as well as the highest AUC value, so it has the highest performance among the others (for this data set). On the other hand, Bayesian Logistic Regression is ranked second best, which has improved the performance of usual Logistic Regression. And at last, K-NN and Logistic Regression were ranked third and fourth respectively. But in general, they were all good classifiers since they all had high accuracy rates. The study of feature selection can be considered as a future direction for this dataset.
 Christopher M. Bishop, Pattern Recognition and Machine Learning, pp. 497-505 , pp. 466. Cambridge.
 Katie Planey, Machine Learning Approaches to Breast Cancer Diagnosis and Treatment Response Prediction. Stanford Biomedical Informatics.
 Gouda I. Salama, M.B.Abdelhalim, and Magdy Abd-elghany Zeid (2012), Breast Cancer Diagnosis on Three Different Data Sets Using Multi-Classfiers. Computer Science, Arab Academy for Science Technology & Maritime Transport, Cairo, Egypt.
 Y.Ireaneus Anna Rejani, Dr.S.Thamarai Selvi , Early Detection Of Breast Cancer Using SVM Classifier Technique. Noorul Islam College of Engineering, Tamilnadu, India.
 Lukasz Jelen , Thomas Fevens, Adam Krzyz Ak , Classification Of Breast Cancer Malignancy Using Cytological Images Of Fine Needle Aspiration Biopsies . Department of Computer Science and Software Engineering Concordia University, Montreal, Quebec, Canada.
 Manish Sarkar and Tze-Yun Leong, Application of K-Nearest-Neighbors Algorithm on Breast Cancer Diagnosis problem. Department of Computer Science, School of Computing, The National university of Singapore.
 Tommi S. Jaakkola and Michael I. Jordan , Bayesian Parameter Estimation Via Variational Methods. Dept. of Elec. Eng. & Computer Science, Massachusetts Institute of Technology, Cambridge, MA, USA.
 Tom Fawcett , An Introduction To ROC Analysis. Institute for the Study of Learning and Expertise, 2164 Staunton Court, Palo Alto, CA 94306, USA.