PF: Oversampling via Optimum-Path Forest for Breast Cancer Detection
Breast cancer is among the most deadly diseases, distressing mostly women worldwide. Although traditional methods for detection have presented themselves as valid for the task, they still commonly present low accuracies and demand considerable time and effort from professionals. Therefore, a computer-aided diagnosis (CAD) system capable of providing early detection becomes hugely desirable. In the last decade, machine learning-based techniques have been of paramount importance in this context, since they are capable of extracting essential information from data and reasoning about it. However, such approaches still suffer from imbalanced data, specifically on medical issues, where the number of healthy people samples is, in general, considerably higher than the number of patients. Therefore this paper proposes the PF, a data oversampling method based on the unsupervised Optimum-Path Forest Algorithm. Experiments conducted over the full oversampling scenario state the robustness of the model, which is compared against three well-established oversampling methods considering three breast cancer and three general-purpose tasks for medical issues datasets.
Computer-Aided Diagnosis (CAD) systems aim at helping physicians to quickly report better diagnosis to patients, thus representing an essential step towards dangerous disease accurate diagnosis. Besides, intelligent-based CAD systems have successfully employed machine learning techniques, a promising subfield of artificial intelligence, to tackle complicated problems that demand knowledge and reasoning about the subject. Regarding the latter, several works were developed in the last few years to aid detecting atherosclerosis , Parkinson’s disease , and breast cancer , to name a few.
Breast cancer is a dangerous illness, affecting millions of women, as well as few men, around the world. According to the World Health Organization, such a disease was responsible for the death of approximately women in worldwide . Therefore, early diagnosis is crucial to effectively prevent its progress, consequently making it possible to elaborate more efficient treatment plans. In general, standard breast cancer diagnosis consists of a visual analysis performed by professionals aiming to identify possible abnormalities (e.g., nodules), which may indicate cancer risk signs. Once identified, it is possible to extract relevant measures from such nodules, assisting physicians to judge the presence or absence of cancerous tissue.
Many image processing-based algorithms have also been proposed to help in such tasks [29, 27, 15], as well as machine learning techniques, which have been commonly employed for both nodule segmentation and classification. Agarap , for instance, studied the performance of several machine learning techniques to label a nodule as benign or malignant. Further, Passos et al.  developed a neural network to tell benign from malignant nodules, and to label the latter according to the most likely cancer type.
Moreover, once the nodules can be seen as abnormalities, their identification can also be tackled in an unsupervised fashion. In such a direction, Ribeiro et al.  used the unsupervised Optimum-Path Forest (OPF)  algorithm to distinct malignant nodes from benign, obtaining significant results. The unsupervised Optimum-Path Forest algorithm possesses properties commonly employed to improve other machine learning techniques, such as the Brainstorm Optimization , as well as to create new techniques, such as the OPF-based approach for anomaly detection  and the Fuzzy OPF classifier .
Although the above-mentioned techniques obtained promising results in the context of breast cancer detection, most of the classification algorithms usually suffer from imbalanced dataset problems, which arises when the number of samples among classes differs significantly. In this scenario, the trained classifier is more likely to label a new sample as belonging to the most common (majority) class, degrading its performance for the smallest (minority) class. Moreover, such a situation may degenerate when the dataset presents outliers. Numerous studies have also been proposed to cope with such a problem in medical datasets [31, 21, 22]. Therefore, the generation of synthetic samples for the minority class, which is usually referred to as oversampling, is recognized as a prominent contribution to rebalancing the dataset for classifier training purposes, consequently improving their robustness to label minority class samples correctly. In this context, several powerful oversampling strategies have been proposed in the literature to tackle such a problem [4, 12, 13], most of them still presenting difficulties while enforcing diversity among new synthetic samples, which denotes a problem that worth to be addressed, since it can improve classifier generalization properties.
To such an extent, this work proposes an approach to perform oversampling via the Optimum-Path Forest, hereinafter named PF. The method employs the unsupervised OPF algorithm to capture features intrinsic to the minority class into different clusters. Further, new training examples are generated by sampling from a Gaussian distribution parametrized by the cluster characteristics. Therefore, the main contribution of this paper are twofold: (i) extending the OPF algorithm capabilities by introducing a novel oversampling mechanism that enforces synthetic intra-class variability; and (ii) studying how PF can benefit the development of CAD systems by extensively evaluating it on five tumor-classification and one retinopathy identification tasks. Finally, the full-balanced datasets are employed to train the OPF classifier, and results, i.e., the accuracy, recall, and F1-measure, are compared against the ones obtained by the standard training dataset without oversampling. Additionally, results are compared against three baselines: SMOTE , the Borderline SMOTE , and ADASYN .
In the remainder of this paper, Section II presents the theoretical background regarding the unsupervised and supervised OPF algorithm variants, whereas Section III introduces the proposed approach. Further, Section IV outlines the experimental setup and Section V discusses experimental results. Finally, Section VI presents conclusions and future works.
This section presents the theoretical background regarding the unsupervised and supervised variants of the OPF algorithm.
Ii-a Unsupervised Optimum-Path Forest
Let be a dataset such that represents the features extracted from the -th sample. Further, let be a graph where each node corresponds to a different feature vector connected to its -nearest neighbors, as defined in the adjacency relationship set .
The unsupervised OPF algorithm consists in partitioning the graph through a competitive process, in which a few samples are marked as “prototypes” and compete among themselves to conquer the remaining nodes. Such a procedure partitions the into optimum-path trees (OPTs) rooted at a prototype, each corresponding to a cluster, where a sample is more similar to the elements of its tree than any other tree. Overall, the process can be divided into three steps: (i) computing a proper neighborhood size and the adjacency relationship , (ii) electing the prototype nodes, and (iii) performing the competition process to partition the dataset into OPTs.
Regarding the first step, different approaches may be considered. Among others, Rocha et al.  proposed to find by minimizing the normalized graph cut function, as it takes into account the dissimilarities between clusters as well as the similarity degree among samples of each cluster.
Concerning the second step, the algorithm must select the prototypes to form the root of each OPT (which will ultimately form clusters) to rule the competition process and conquer the remaining samples in the graph. The supervised OPF proposed by Papa et al. [18, 19] selects as prototypes the nearest samples from different classes, found by computing the graph Minimum Spanning Tree (MST). However, in the unsupervised variant, since labels are usually unavailable, Rocha et al.  proposed to select the prototypes as the samples located in the center of each cluster. To such an extent, all samples are assigned a density score , computed through a Gaussian probability density function (pdf), defined as follows:
where , and stands for the maximum arc-weight in . This formulation considers all adjacent nodes for density computations, as the Gaussian distribution covers of the samples with distance .
After evaluating Equation 1 for each node in the graph, the density values are used to populate a priority queue in a way that the unsupervised OPF maximizes the cost of each sample, thus partitioning the graph into OPTs. Such a cost is defined in terms of paths on , which is an acyclic sequence of adjacent samples in .
Let be a path with terminus at sample and starting from some root , being the latter the set of all prototype samples. Further, let be a trivial path (i.e., a path containing only one sample), whereas denotes the concatenation of a path and the arc such that .
In the third step, the algorithm assigns to each path a value given by a smooth connectivity function , which must satisfy some constraints to ensure the algorithm theoretic correctness [9, 5]. A path is considered optimal if for any other path . Among the proposed path-cost functions in the literature, the unsupervised OPF relies on the following formulation:
where consists in the smallest quantity to avoid plateaus and over-segmentation in regions near prototypes (i.e., areas with the highest density).
Among all possible paths that originate in some local maximum, i.e., some prototype, the OPF algorithm assigns to each sample a final path whose minimum density value along it is maximum. Such final path value is represented by a cost map , as follows:
The OPF algorithm maximizes by computing an optimum-path forest for each sample in descending order of cost. Each forest is encoded as an acyclic predecessor map which assigns to each sample its predecessor in the optimum path from , or a marker when .
It is important to remark that the unsupervised OPF algorithm determines the number of clusters (OPTs) automatically, hence such information is not required beforehand, differently from other algorithms. Furthermore, the only hyperparameter that must be set is the search interval upper bound for the proper neighborhood size .
Ii-B Supervised Optimum-Path Forest
Differently from its unsupervised version, the supervised variant uses a fully-connected graph instead. Moreover, the closest samples from different classes are marked as prototypes, as aforementioned. Regarding the competition process, instead of using Equation 2, the following smooth function is employed:
Further, the following cost map is used to partition the graph:
Such costs are used to initialize the algorithm before evaluating Equation 5, which is performed for every node in an ascending order of costs. After partitioning the graph, each prototype propagates its ground truth label to all samples in its OPT. Afterwards, prediction is performed by solving the same equation for new samples individually.
Iii Proposed Approach
Although the unsupervised OPF was conceived for clustering purposes, such clusters possess a set of features that can also be employed to synthesize new samples, thus being suitable for oversampling. This section describes the procedure for a binary classification problem, which intends to oversample the class composed of the smallest number of features. Notwithstanding, the same approach can be easily applied to multiclass problems by individually repeating the procedure for each class to be oversampled.
The process of synthetic samples generation contemplates two main steps: (i) creating plausible samples, i.e., synthetic elements with characteristics that are coherent with the selected class; and (ii) introducing sample variability, avoiding making the classifier biased towards a subset of characteristics present in that class. To tackle such issues, the PF first performs the clustering from minority class samples, which turns possible the extraction of common patterns intrinsic to the class, i.e., the samples’ average position and variance. Further, the algorithm assumes that all features from a class follow a normal distribution. Thus, a new sample can be generated by sampling such a distribution from some of the clusters found. Notice the number of synthetic samples generated by each cluster is proportional to the number of original samples compounding it, i.e., we are always doubling the number of samples from the minority class, although the user can set that percentage. The distribution is performed as follows:
where stands for the distribution mean, defined as the average feature vector from all samples within the -th cluster. Moreover, is the covariance matrix, computed as follows:
given that is a matrix formed by concatenating all cluster feature vectors. Figure 1 illustrates the proposed approach behavior.
This section describes the datasets used in the experiments. Further, the experimental setup is outlined.
Experiments were conducted over two sets of three databases each. The first set comprises three datasets for breast cancer detection, i.e., the Wisconsin Breast Cancer Database, which is composed of the datasets Prognostic, Diagnostic I, and Diagnostic II. The second set is composed of general-purpose datasets for medical issues. All of them are unbalanced, binary, and were obtained from the UCI repository . A brief description of each one follows below:
Wisconsin Breast Cancer Database (WBCD) Prognostic
3: Regards predicting a sample as recurrent or non-recurrent type of cancer based on features. There are samples, being () non-recurrent and () recurrent;
WBCD Diagnostic I: Consists of classifying a tumor as benignant or malignant based on features as well. There are instances, from which () are benign and () are malignant;
WBCD Diagnostic II: Corresponds to labelling each of the samples as benign or malignant tumor. Each sample comprises features and each class contains () and () examples, respectively;
Diabetic Retinopathy Debrecen (DRD)
4: Regards predicting whether an image contains signs of diabetic retinopathy or not based on variables. The dataset contains samples, from which () are positive and () are negative;
Cervical Cancer (CC)
5: For this task we predict the binary biopsy variable based on features for samples. Differently from other datasets, all variables in this scenario are either integer or binary. Further, the dataset is significantly skewed, with positive and () negative samples;
Mammographic Mass (MM)
6: Concerns predicting if a mammographic mass is benign or malignant based on six features. The dataset contains () benign and () malignant samples, comprising examples.
Iv-B Experimental setup
The experiments conducted in this paper considered pre-processing the data such that missing features were replaced by their corresponding mean in the training partition. All the features were normalized to have zero mean and unitary standard deviation. Further, the datasets were randomly divided into training, validation and testing sets, each containing , , and of the data, respectively
The validation set was employed to fine-tune the oversampling method hyperparameters, i.e., finding the for the unsupervised OPF and for the other methods that maximize the minority class recall. Afterward, the augmented dataset is used to train the OPF classifier for further computing the results over the testing sets. Notice that such a procedure is repeated times for statistical analysis, and the results are compared through the Wilcoxon signed-rank test  with significance concerning recall values. Figure 2 depicts such a pipeline. Implementation-wise, we rely on the supervised and unsupervised implementations provided by Opfython . Additionally, the source code was implemented using Python and is available on GitHub.
V Experimental Results
This section is divided into four main steps: (i) datasets augmented using the PF are compared against the standard version, (ii) the proposed approach is compared against three baselines for oversampling considering three distinct versions of the Wisconsin Breast Cancer Database, i.e., Prognostic, Diagnostic I, and Diagnostic II, for the task of breast cancer detection. In step (iii), a similar experiment is conducted over three general-purpose medical issues datasets, and (iv) it provides a brief discussion concerning the optimization of the proposed method hyperparameter, i.e., the .
V-a PF Data Augmentation Versus Standard Datasets
This section presents the results obtained by the Optimum-Path Forest classifiers considering the datasets balanced through PF oversampling, i.e., minority classes are augmented such that both classes present a similar number of samples, against the standard version of the datasets. Table I presents the recall considering each dataset. Values in bold denote the best results according to the Wilcoxon signed-rank test with of significance.
|Ds. Version||Recall||Prognostic||Diagnostic I||Diagnostic II||DRD||CC||MM|
In this context, one can observe PF obtained the best results in five out of six datasets according to the Wilcoxon signed-rank test, obtaining the best results alone in three of them.
V-B Results concerning the Breast Cancer datasets
Table II presents the average recall, accuracy, F1-measure and best considering PF or considering the other techniques, as well as their standard deviation. The results comprise a minority oversampling, thus providing balanced datasets. The proposed approach is compared against three baseline techniques, i.e., SMOTE, Borderline SMOTE, and ADASYN.
Results observed over WBCD Prognostic dataset show all techniques obtained similar statistical results considering the recall. Regarding the accuracy, one can observe the proposed approach obtained the highest average value, together with the borderline SMOTE. Such a result suggests samples generated by PF fit better the class distribution, being less prone to false negatives. A similar behavior is observed over the Diagnostic I dataset. Concerning the Diagnostic II dataset, the proposed approach obtained the best results, together with ADASYN and SMOTE, outperforming Borderline SMOTE. Considering the accuracy, PF obtained the highest averages.
Despite the good performance achieved by the oversampling approach using PF, the baseline algorithms have shown statistical similarity in most of the cases. Sparse clusters with low density may be responsible for introducing new samples that are distant from the original distribution of the minority class, a situation that would be likely to produce outliers in the new resample dataset. Such behavior may also influence in a moderate recall, as observed in Tables II.
V-C General Purpose Medical Datasets Results
Table III presents the results obtained over the three general purpose medical datasets, i.e., Diabetic Retinopathy Debrecen, Cervical Cancer and Mammographic Mass.
Considering Diabetic Retinopathy Debrecen, the proposed approach outperformed the average recall overall techniques, although SMOTE and Borderline SMOTE achieved similar results. Similar behavior is observed over the F1 metric. On the other hand, PF, SMOTE and ADASYN obtained the best results regarding the Cervical Cancer dataset. Since both Borderline SMOTE and ADASYN are variants of SMOTE, they are expected to perform differently over different scenarios. However, as observed in most of the experiments, they generally are outperformed by SMOTE technique itself, considering the average values. Regarding Mammographic Mass datasets, all techniques performed in a very much alike fashion, obtaining similar results.
One can notice that all three datasets present a challenging task since no technique reached a recall. Unusual behavior is observed over the Cervical Cancer dataset, whose all techniques obtained an approximate accuracy of , despite the recall below . Such behavior may suggest samples from the minority class are distributed among a subcluster from majority class, therefore providing high accuracy despite the low recall.
V-D PF Hyperparameter Selection
PF requires a proper selection of a single hyperparameter, the , which is employed in the clustering process. Such a hyperparameter, however, is way less sensitive when compared to a proper selection of the best , performed by the other techniques. Figure 3 depicts a grid search considering a proper selection of those hyperparameters for each technique. Notice the central line describes the average value over the validation dataset, while the broader area describes the standard deviation. Notice even though PF considers a very wider interval, i.e., ranging from , most of the time, the results outperform the other techniques, which assume a shorter interval between .
This paper presented an oversampling approach based on the unsupervised Optimum-Path Forest Algorithm. The proposed PF showed to be capable of handling the class imbalance problem through a simple and effective procedure that generates new synthetic samples based on the normal distribution of the feature vectors inside each cluster. The experiments performed in three datasets of breast cancer, apart from three complementary medical issue datasets, showed that the PF approach demonstrated similar or superior results when compared to the baseline methods already proposed in the literature.
Notwithstanding, the effectiveness of the proposed approach may still suffer in synthesizing new samples based on low-density clusters, a situation that may introduce noise samples in the training set and, consequently, affect creating of the prediction model. Future studies will be conducted to overcome the influence of sparse clusters with low density in the process of synthesizing new outliers. Moreover, experiments with a multiclass problem will also be performed in subsequent investigations.
- thanks: Authors contributed equally.
- thanks: The authors would like to thank FAPESP grants #2013/07375-0, #2014/12236-1, #2019/18287-0, and #2019/07665-4, as well as CNPq grants #427968/2018-6 and #307066/2017-7.
- All considered versions are available at https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Prognostic).
- Available at https://archive.ics.uci.edu/ml/datasets/Diabetic+Retinopathy+Debrecen+Data+Set.
- Available at https://archive.ics.uci.edu/ml/datasets/Cervical+cancer+%28Risk+Factors%29.
- Available at https://archive.ics.uci.edu/ml/datasets/Mammographic+Mass.
- Such percentages were obtained empirically.
- The source code is available online at https://github.com/Leandropassosjr/o2pf.
- (2018) Enhancing brain storm optimization through optimum-path forest. In 2018 IEEE 12th International Symposium on Applied Computational Intelligence and Informatics (SACI), pp. 000183–000188. Cited by: §I.
- (2018) On breast cancer detection: an application of machine learning algorithms on the wisconsin diagnostic dataset. In Proceedings of the 2nd International Conference on Machine Learning and Soft Computing, pp. 5–9. Cited by: §I.
- (2014) An ensemble-based system for automatic screening of diabetic retinopathy. Knowledge-based systems 60, pp. 20–27. Cited by: 4th item.
- (2002) SMOTE: synthetic minority over-sampling technique. Journal of Artificial Intelligence Research 16, pp. 321–357. Cited by: §I, §I.
- (2018-09-01) Path-value functions for which dijkstra’s algorithm returns optimal mapping. Journal of Mathematical Imaging and Vision 60 (7), pp. 1025–1036. External Links: Cited by: §II-A.
- (2020) OPFython: a python-inspired optimum-path forest classifier. External Links: Cited by: §IV-B.
- (2017) UCI machine learning repository. University of California, Irvine, School of Information and Computer Sciences. External Links: Cited by: §IV-A.
- (2007) The prediction of breast cancer biopsy outcomes using two cad approaches that both emphasize an intelligible decision process. Medical physics 34 (11), pp. 4164–4172. Cited by: 6th item.
- (2004) The image foresting transform: theory, algorithms, and applications. IEEE Transactions on Pattern Analysis and Machine Intelligence 26 (1), pp. 19–29. Cited by: §II-A.
- (2017) Transfer learning with partial observability applied to cervical cancer screening. In Iberian conference on pattern recognition and image analysis, pp. 243–250. Cited by: 5th item.
- (2018) Intelligent network security monitoring based on optimum-path forest clustering. IEEE Network 33 (2), pp. 126–131. Cited by: §I.
- (2005) Borderline-smote: a new over-sampling method in imbalanced data sets learning. In International conference on intelligent computing, pp. 878–887. Cited by: §I, §I.
- (2008) ADASYN: adaptive synthetic sampling approach for imbalanced learning. In 2008 IEEE international joint conference on neural networks (IEEE world congress on computational intelligence), pp. 1322–1328. Cited by: §I, §I.
- (2016) A review of computational methods applied for identification and quantification of atherosclerotic plaques in images. Expert Systems with Applications 46, pp. 1–14. External Links: Cited by: §I.
- (2017) Automated breast tumor detection and segmentation with a novel computational framework of whole ultrasound images. Medical & Biological Engineering & Computing 56, pp. 183–199. Cited by: §I.
- (1990) Cancer diagnosis via linear programming. Technical report University of Wisconsin-Madison Department of Computer Sciences. Cited by: 1st item.
- (Website) External Links: Cited by: §I.
- (2009) Supervised pattern classification based on optimum-path forest. International Journal of Imaging Systems and Technology 19 (2), pp. 120–131. External Links: Cited by: §II-A.
- (2012) Efficient supervised optimum-path forest classification for large datasets. Pattern Recognition 45 (1), pp. 512–520. Cited by: §II-A.
- (2019) A hybrid approach for breast mass categorization. In ECCOMAS Thematic Conference on Computational Vision and Medical Image Processing, pp. 159–168. Cited by: §I.
- (2019) Predictive models for bariatric surgery risks with imbalanced medical datasets. Annals of Operations Research 280 (1-2), pp. 1–18. Cited by: §I.
- (2019) Recurrent generative adversarial network for learning imbalanced medical image semantic segmentation. Multimedia Tools and Applications, pp. 1–20. Cited by: §I.
- (2019) Bag of samplings for computer-assisted parkinson’s disease diagnosis based on recurrent neural networks. Computers in biology and medicine 115, pp. 103477. Cited by: §I.
- (2015) Unsupervised breast masses classification through optimum-path forest. In 2015 IEEE 28th International Symposium on Computer-Based Medical Systems, pp. 238–243. Cited by: §I.
- (2009) Data clustering as an optimum-path forest problem with applications in image analysis. International Journal of Imaging Systems and Technology 19 (2), pp. 50–68. Cited by: §I, §II-A, §II-A.
- (2019) A novel approach for optimum-path forest classification using fuzzy logic. IEEE Transactions on Fuzzy Systems. Cited by: §I.
- (2019) Preliminary development of an automatic breast tumour segmentation algorithm from ultrasound volumetric images. In Information Technology in Biomedicine, E. Pietka, P. Badura, J. Kawa and W. Wieclawek (Eds.), Cham, pp. 77–88. Cited by: §I.
- (1945) Individual comparisons by ranking methods. Biometrics Bulletin 1 (6), pp. 80–83. Cited by: §IV-B.
- (2018) Automatic breast ultrasound image segmentation: a survey. Pattern Recognition 79, pp. 340 – 355. External Links: Cited by: §I.
- (2018) Machine learning techniques for breast cancer computer aided diagnosis using different image modalities: a systematic review. Computer Methods and Programs in Biomedicine 156, pp. 25–45. External Links: Cited by: §I.
- (2018) Class weights random forest algorithm for processing class imbalanced medical data. IEEE Access 6 (), pp. 4641–4652. External Links: Cited by: §I.