LoRAS: An oversampling approach for imbalanced datasets
Abstract
The Synthetic Minority Oversampling TEchnique (SMOTE) is widelyused for the analysis of imbalanced datasets. It is known that SMOTE frequently overgeneralizes the minority class, leading to misclassifications for the majority class, and effecting the overall balance of the model. In this article, we present an approach that overcomes this limitation of SMOTE, employing Localized Random Affine Shadowsampling (LoRAS) to oversample from an approximated data manifold of the minority class. We benchmarked our LoRAS algorithm with 28 publicly available datasets and show that that drawing samples from an approximated data manifold of the minority class is the key to successful oversampling. We compared the performance of LoRAS, SMOTE, and several SMOTE extensions and observed that for imbalanced datasets LoRAS, on average generates better Machine Learning (ML) models in terms of F1score and Balanced Accuracy. Moreover, to explain the success of the algorithm, we have constructed a mathematical framework to prove that LoRAS is a more effective oversampling technique since it provides a better estimate to mean of the underlying local data distribution of the minority class data space.
font=Small
Index terms— Imbalanced datasets, Oversampling, Synthetic sample generation, Data augmentation, Manifold learning
1 Introduction
Imbalanced datasets are frequent occurrences in a large spectrum of fields, where Machine Learning (ML) has found its applications, including business, finance and banking as well as medical science. Oversampling approaches are a popular choice to deal with imbalanced datasets (SMOTE, Han2, He, Bunkhumpornpat2009, Barua2014). We here present Localized Randomized Affine Shadowsampling (LoRAS), which produces better ML models for imbalanced datasets, compared to stateofthe art oversampling techniques such as SMOTE and several of its extensions. We use computational analyses and a mathematical proof to demonstrate that drawing samples from an approximated data manifold of the minority class is key to successful oversampling. We validated the approach with 28 imbalanced datasets, comparing the performances of several stateoftheart oversampling techniques with LoRAS. The average performance of LoRAS on all these datasets is better than other oversampling techniques that we investigated. In addition, we have constructed a mathematical framework to prove that LoRAS is a more effective oversampling technique since it provides a better estimate to local mean of the underlying data distribution, in some neighbourhood of the minority class data space.
For imbalanced datasets, the number of instances in one (or more) class(es) is very high (or very low) compared to the other class(es). A class having a large number of instances is called a majority class and one having far fewer instances is called a minority class. This makes it difficult to learn from such datasets using standard ML approaches. Oversampling approaches are often used to counter this problem by generating synthetic samples for the minority class to balance the number of data points for each class. SMOTE is a widely used oversampling technique, which has received various extensions since it was published by SMOTE. The key idea behind SMOTE is to randomly sample artificial minority class data points along line segments joining the minority class data points among of the minority class nearest neighbors of some arbitrary minority class data point.
The SMOTE algorithm, however has several limitations for example: it does not consider the distribution of minority classes and latent noise in a data set (Hu2009). It is known that SMOTE frequently overgeneralizes the minority class, leading to misclassifications for the majority class, and effecting the overall balance of the model (punt). Several other limitations of SMOTE are mentioned in Blagus2013. To overcome such limitations, several algorithms have been proposed as extensions of SMOTE. Some are focusing on improving the generation of synthetic data by combining SMOTE with other oversampling techniques, including the combination of SMOTE with Tomeklinks (ElhassanT2016), particle swarm optimization (Gao, Wang), rough set theory (Ram), kernel based approaches (Mathew), Boosting (Chawla2), and Bagging (Hanifah). Other approaches choose subsets of the minority class data to generate SMOTE samples or cleverly limit the number of synthetic data generated (Narayan). Some examples are Borderline1/2 SMOTE (Han2), ADAptive SYNthetic (ADASYN) (He), Safe Level SMOTE (Bunkhumpornpat2009), Majority Weighted Minority Oversampling TEchnique (MWMOTE) (Barua2014), Modified SMOTE (MSMOTE), and Support Vector MachineSMOTE (SVMSMOTE) (Suh) (see Table 1) (Hu2009). Recent comparative studies have focused on SMOTE, Borderline1/2 SMOTE models, ADASYN, and SVMSMOTE (Suh, AhPine2016), which is why we will focus on these five models for a comparison with our newly developed oversampling technique LoRAS. LoRAS allows us to resample the data uniformly from an approximated data manifold of the minority class data points and, thus, creating a more balanced and robust model. A LoRAS oversample is an unbiased estimator of the mean of the underlying local probability distribution followed by a minority class sample (assuming that it is some random variable) such that the variance of this estimator is significantly less than that of a SMOTE generated oversample, which is also an unbiased estimator of the mean of the underlying local probability distribution followed by a minority class sample.
Extension  Description 
Borderline1/2 SMOTE (Han2)  Identifies borderline samples and applies SMOTE on them 
ADASYN (He)  Adaptively changes the weights of different minority samples 
SVMSMOTE (Suh)  Generates new minority samples near borderlines with SVM 
SafeLevelSMOTE (Bunkhumpornpat2009)  Generates data in areas that are completely safe 
MWMOTE (Barua2014)  Identifies and weighs ambiguous minority class samples 
2 LoRAS: Localized Randomized Affine Shadowsampling
In this section we discuss our strategy to approximate the data manifold, given a small dataset. A typical dataset for a supervised ML problem consists of a set of features , that are used to characterize patterns in the data and a set of labels or ground truth. Ideally, the number of instances or samples should be significantly greater than the number of features. In order to maintain the mathematical rigor of our strategy we propose the following definition for a small dataset.
Definition 1.
Consider a class or the whole dataset with samples and features. If , then we call the dataset, a small dataset.
The LoRAS algorithm is designed to learn from a small dataset by approximating the underlying data manifold. Assuming that is the best possible set of features to represent the data and all features are equally important, we can think of a data oversampling model to be a function , that is, uses parent data points (each with features) to produce an oversampled data point in .
Definition 2.
We define a random affine combination of some arbitrary vectors as the affine linear combination of those vectors, such that the coefficients of the linear combination are chosen randomly. Formally, a vector , , is a random affine combination of vectors , () if , and are chosen randomly from a Dirichlet distribution.
The simplest way of augmenting a data point would be to take the average (or random affine combination as defined in Definition 2) of two data points as an augmented data point. But, when we have features, we can assume that the hypothetical manifold on which our data lies is dimensional. An dimensional manifold can be approximated by a collection of dimensional planes.
Given sample points we could exactly derive the equation of an unique dimensional plane containing these sample points. By Definition 1, for a small dataset, however, , and thus, there is even a possibility that . To resolve this problem, we create shadow data points or shadowsamples from our parent data points in the minority class. Each shadow data point is generated by adding noise from a normal distribution, for all features , where is some function of the sample variance for the feature . For each of the data points we can generate shadow data points such that, . Now it is possible for us to choose shadow data points from the shadow data points even if .
Since real life data are mostly nonlinear, to approximate the data manifold effectively, we have to localize our strategy. For each parent data point in a small dataset , let us denote by the set of knearest neighbors (including ) of in . We can always choose in such a way that . Every time we choose shadow data points as follows: we first choose a random parent data point and then restrict the domain of choice to the shadowsamples generated by the parent data points in .
We then take a random affine combination of the chosen shadowsamples to create one augmented Localized Random Affine Shadowsample or a LoRAS sample as defined in Definition 2. Thus, a LoRAS sample is an artificially generated sample drawn from an dimensional plane, which locally approximates the underlying hypothetical dimensional data manifold.
C_maj:  Majority class parent data points 
C_min:  Minority class parent data points 
k:  Number of nearest neighbors to be considered per parent data point (default value : if , otherwise) 
S_{p}:  Number of generated shadowsamples per parent data point (default value : ) 
L_{\textsigma}:  List of standard deviations for normal distributions for adding noise to each feature (default value : ) 
N_{aff}:  Number of shadow points to be chosen for a random affine combination (default value : ) 
N_{gen}:  Number of generated LoRAS points for each nearest neighbors group (default value : ) 
Theoretically, we can generate LoRAS samples such that and use them for training a ML model. In this article, all imbalanced classification problems that we deal with are binary classification problems. For such a problem, there is a minority class containing a relatively less number of samples compared to a majority class . We can thus consider the minority class as a small dataset and use the LoRAS algorithm to oversample. For every data point we can denote a set of shadowsamples generated from as . In practice, one can also choose shadowsamples for an affine combination and choose a desired number of oversampled points to be generated using the algorithm. We can look at LoRAS as an oversampling algorithm as described in Algorithm 1.
The LoRAS algorithm thus described, can be used for oversampling of minority classes in case of highly imbalanced datasets. Note that the input variables for our algorithm are: number of nearest neighbors per sample k, number of generated shadow points per parent data point S_{p}, list of standard deviations for normal distributions for adding noise to every feature and thus generating the shadowsamples L_{\textsigma}, number of shadowsamples to be chosen for affine combinations N_{aff}, and number of generated points for each nearest neighbors group N_{gen}.
We have mentioned the default values of the LoRAS parameters in Algorithm 1, showing the pseudocode for the LoRAS algorithm. One could use a random grid search technique to come up with appropriate parameter combinations within given ranges of parameters. As an output, our algorithm generates a LoRAS dataset for the oversampled minority class, which can be subsequently used to train a ML model.
The implementation of the algorithm in Python (V 3.7.4) and an example Jupyter Notebook for the credit card fraud detection dataset is provided on the GitHub repository https://github.com/narekdavtyan/LoRAS. In our computational code in GitHub, S_{p} corresponds to num_shadow_points, L_{\textsigma} corresponds to list_sigma_f, N_{aff} corresponds to num_aff_comb, N_{gen} corresponds to num_generated_points.
3 Case studies
For testing the potential of LoRAS as an oversampling approach, we designed benchmarking experiments with a total of 28 imbalanced datasets. With this number of diverse case studies we should have a comprehensive idea of the advantages of LoRAS over other existing oversampling methods.
3.1 Datasets used for validation
Here we provide a brief description of the datasets and the sources that we have used for our studies.
Scikitlearn imbalanced benchmark datasets: The imblearn.datasets package is complementing the sklearn.datasets package. It provides 27 preprocessed datasets, which are imbalanced. The datasets span a large range of realworld problems from several fields such as business, computer science, biology, medicine, and technology. This collection of datasets was proposed in the imblearn.datasets python library by Lema and benchmarked by Ding. Many of these datasets have been used in various research articles on oversampling approaches (Ding, saez).
Credit card fraud detection dataset: We obtain the description of this dataset from the website. https://www.kaggle.com/mlgulb/creditcardfraud. \sayThe dataset contains transactions made by credit cards in September 2013 by European cardholders. This dataset presents transactions that occurred in two days, where there are 492 frauds out of 284,807 transactions. The dataset is highly unbalanced, the positive class (frauds) account for 0.172 percent of all transactions. The dataset contains only numerical input variables, which are the result of a PCA transformation. Feature variables are the principal components obtained with PCA, the only features that have not been transformed with PCA are the ‘Time’ and ‘Amount’. The feature ‘Time’ contains the seconds elapsed between each transaction and the first transaction in the dataset. The feature ‘Amount’ consists of the transaction amount. The labels are encoded in the ‘Class’ variable, which is the response variable and takes value 1 in case of fraud and 0 otherwise (cfraud).
3.2 Methodology
For each case study, we split the dataset into 50 % training and 50 % testing data. We did a pilot study with ML classifiers such as knearest neighbors (knn), Support Vector Machine (svm) (linear kernel), Logistic regression (lr), Random forest (rf), and Adaboost. As inferred in (Blagus2013) knn, svm, and lr are effective models with SMOTE oversampling. For each dataset, except for the credit card fraud detection dataset, we chose the ML model among knn, svm, and lr that has the best classification accuracy for the minority class. For the Credit card dataset we used both lr and rf models. For computational coding, we used the scikitlearn (V 0.21.2), numpy (V 1.16.4), pandas (V 0.24.2), and matplotlib (V 3.1.0) libraries in Python (V 3.7.4).
First, we trained the models with the unmodified dataset to observe how they perform without any oversampling. Then, we oversampled the minority class using SMOTE, Borderline1 SMOTE, Borderline2 SMOTE, SVM SMOTE, ADASYN, and LoRAS to retrain the ML algorithms including the oversampled datasets. We then measured the performances of our models using performance metrics such as Balanced Accuracy and F1Score. In our study, we benchmark LoRAS against several other oversampling algorithms for the 27 benchmark datasets. To ensure fairness of comparison, we oversampled such that the total number of augmented samples generated from the minority class was as close as possible to the number of samples in the majority class as allowed by each oversampling algorithm. For the credit card fraud detection dataset we compared performances of several oversampling techniques including LoRAS and several ML models as well, ensuring that we build the best possible ML model using customized parameter settings for respective oversampling techniques. For this case we also chose the ML models lr and rf since their performance were the best.
LoRAS has several parameters (k, S_{p}, L_{\textsigma}, N_{aff}, N_{gen}). We have ensured, for a fair comparison with other models, to choose the same values for the parameter denoting the number of nearest neighbors of a minority class sample k, where ever applicable. In case of LoRAS, for the parameter N_{aff} we chose the number of features as input for all datasets and for L_{\textsigma}, we chose a list consisting of a constant value of for each dataset. For the parameter N_{gen} we used as the default value for the 27 benchmark datasets in the imblearn.datasets Python library.
4 Results
For imbalanced datasets there are more meaningful performance measures than Accuracy, including Sensitivity or Recall, Precision, and F1Score (FMeasure), and Balanced Accuracy that can all be derived from the Confusion Matrix, generated while testing the model. For a given class, the different combinations of recall and precision have the following meanings :

High Precision & High Recall: The model handled the classification task properly

High Precision & Low Recall: The model cannot classify the data points of the particular class properly, but is highly reliable when it does so

Low Precision & High Recall: The model classifies the data points of the particular class well, but misclassifies high number of data points from other classes as the class in consideration

Low Precision & Low Recall: The model handled the classification task poorly
F1Score, calculated as the harmonic mean of precision and recall and, therefore, balances a model in terms of precision and recall. Balanced Accuracy is the mean of the individual class accuracies and in this context, it is more informative than the usual accuracy score. High Balanced Accuracy ensures that the ML algorithm learns adequately for each individual class. These measures have been defined and discussed thoroughly by AbdElrahman2013. We will use the above mentioned performance measures wherever applicable in this article.
Scikitlearn imbalanced datasets: In Table 2 we show the Balanced Accuracies and F1Scores for the 27 inbuilt datasets in Scikitlearn.
Dataset  ML  Normal  SMOTE  SVM  Bl1  Bl2  ADASYN  LoRAS 
abalone  svm  .500/.000  .759/.339  .725/.336  .745/.335  .750/.328  .762/.331  .765/.345 
abalone19  svm  .500/.000  .654/.032  .549/.023  .513/.016  .516/.017  .683/.035  .741/.048 
arrythmia  knn  .500/.000  .627/.184  .505/.111  .520/.123  .538/.133  .549/.138  .678/.197 
car eval34  knn  .597/.259  .736/.248  .764/.295  .742/.274  .738/.260  .734/.248  .716/.240 
car eval4  knn  .498/.193  .753/.138  .795/.162  .759/.141  .745/.134  .751/.137  .861/.222 
coil 2000  knn  .500/.000  .618/.151  .626/.175  .624/.162  .618/.161  .568/.165  .628/.160 
ecoli  knn  .795/.405  .756/.339  .743/.326  .795/.405  .795/.405  .758/.339  .760/.343 
isolet  knn  .861/.909  .900/.455  .911/.485  .902/.460  .875/.401  .899/.453  .861/.377 
letterimg  knn  .860/.824  .980/.675  .978/.652  .975/.686  .965/.559  .980/.657  .975/.725 
libras move  knn  .500/.000  .708/.588  .705/.555  .708/.588  .750/.666  .708/.588  .708/.588 
mammography  lr  .707/.545  .880/.265  .888/.316  .874/.232  .869/.210  .873/.208  .893/.274 
oil  lr  .721/.230  .737/.209  .734/.225  .732/.223  .661/.142  .715/.184  .739/.235 
optical digits  knn  .939/.924  .979/.887  .978/.868  .975/.854  .972/.836  .974/.839  .977/.884 
ozone level  lr  .513/.052  .768/.195  .764/.264  .793/.224  .773/.187  .813/.223  .804/.227 
pendigits  knn  .982/.969  .984/.899  .983/.886  .974/.902  .970/.868  .982/.866  .990/.949 
proteinhomo  knn  .587/.296  .817/.106  .811/.152  .814/.127  .814/.127  .819/.093  .793/.137 
satimage  knn  .770/.623  .849/.430  .858/.456  .859/0.449  .849/.430  .852/.443  .872/.492 
scene  knn  .500/.000  .712/.239  .695/.252  .683/.224  .683/.224  .648/.195  .630/.178 
sickeuthyorid  knn  .500/.000  .715/.303  .730/.366  .732/.329  .725/.317  .708/.297  .689/.312 
solarflare  knn  .500/.000  .629/.222  .581/202  .577/.189  .598/204  .643/.236  .677/.251 
spectrometer  svm  .950/.893  .950/.893  .950/.893  .950/.893  .941/.750  .950/.893  .950/.893 
thyroidsick  svm  .778/.666  .901/523  .893/.534  .912/.539  .890/.434  .909/.541  .899/.640 
uscrime  knn  .545/.166  .829/.359  .848/.437  .843/.385  .842/.366  .843/.367  .825/.393 
webpage  knn  .478/.041  .747/.117  .749/.121  .743/.116  .750/.120  .733/.117  .750/.123 
winequality  knn  .500/.000  .720/.166  .675/.200  .701/.180  .701/.180  .696/.154  .628/.133 
yeastme2  knn  .500/.000  .869/.300  .805/.349  .856/.307  .874/.315  .867/.294  .823/.300 
yeastml8  svm  .500/.000  .512/.126  .533/.141  .542/.149  .512/.127  .515/.131  .552/.154 
Average    .604/.296  .781/.342  .769/.362  .771/.360  .767/.329  .774/.330  .784/.363 
Calculating average performances over all datasets, LoRAS has the best Balanced Accuracy and F1Score. As expected, SMOTE improved both Balanced Accuracy and F1Score compared to normal model training. Interestingly, the oversampling approaches SVMSMOTE and Borderline1 SMOTE also improved the average F1Score compared to SMOTE, but compromised for a lower Balanced Accuracy. Between SVMSMOTE and Borderline1 SMOTE we noted that SVMSMOTE improved the F1Score the most, but has the lesser Balanced Accuracy. In contrast our LoRAS approach produces a better Balanced Accuracy than SMOTE on average by maintaining the highest average F1Score among all oversampling techniques.
Oversampling technique  Highest Balanced Accuracy  Highest F1Score 
No oversampling  2  10 
SMOTE  6  1 
SVM SMOTE  4  9 
Borderline1 SMOTE  4  2 
Borderline2 SMOTE  4  2 
ADASYN  3  1 
LoRAS  13  9 
From Table 3, we see that LoRAS performs best in terms of Balanced Accuracy and F1Score for 11 and 9 datasets respectively.
Thus, LoRAS outperforms other oversampling algorithms in terms of both Balanced Accuracy and F1Score for a maximum number of datasets.
Interestingly, we also observe a trend that the oversampling approaches that have good performances in terms of F1Score, have a relatively weaker performance for Balanced Accuracy.
Interestingly, not only LoRAS proves to be the best choice for the highest number of datasets but also retains its performance for both of the performance measures.
Credit card fraud detection dataset: The credit card fraud detection dataset has 492 fraud instances out of 284,807 transactions. The task is to predict fraudulent transactions. In Table 4, we show the number of samples generated from several oversampling approaches. For testing, we have 246 samples of frauds and 142,158 samples of normal nonfraudulent people for each case.
Oversampling technique  Minority Training samples  Majority Training samples 
No oversampling  246  142,157 
SMOTE  142,157  142,157 
ADASYN  142,173  142,157 
Borderline1/2 SMOTE  142,157  142,157 
SVM SMOTE  78,070  142,157 
LoRAS  142,680  142,157 
We summarize our results in a tabular form in Table 5. In the table we show the scores of our models for the performance measures: Precision, Recall, F1Score, and Balanced Accuracy for lr and rf ML models.
Oversampling  ML model  Recall  Precision  F1Score  Balanced Acc. 
Normal  lr  .252  .911  .394  .634 
Normal  rf  .495  .995  .661  .747 
SMOTE  lr  .829  .579  .715  .914 
SMOTE  rf  .845  .195  .318  .919 
Borderline1 SMOTE  lr  .764  .800  .781  .881 
Borderline1 SMOTE  rf  .813  .540  .649  .905 
Borderline2 SMOTE  lr  .642  .802  .713  .821 
Borderline2 SMOTE  rf  .804  .569  .666  .901 
SVM SMOTE  lr  .735  .841  .785  .823 
SVM SMOTE  rf  .792  .672  .727  .895 
ADASYN  lr  .821  .770  .795  .910 
ADASYN  rf  .845  .195  .318  .917 
LoRAS  lr  .776  .845  .809  .880 
LoRAS  rf  .727  .913  .810  .863 
From Table 5 we infer that rf model with LoRAS oversampling has the best F1Score. Interestingly, LoRAS on both lr and rf produces a Balanced Accuracy higher than 0.85 and an F1Score higher than 0.8. Other models such as SVM SMOTE (with both lr and rf) and ADASYN with lr also produces very good results. Thus LoRAS produces better F1Score with a reasonable compromise on the Balanced Accuracy.
5 Discussion
We have constructed a mathematical framework to prove that LoRAS is a more effective oversampling technique since it provides a better estimate to the mean of the underlying local data distribution of the minority class data space. Let be an arbitrary minority class sample. Let be the set of the knearest neighbors of , which will consider the neighborhood of . Both SMOTE and LoRAS focus on generating augmented samples within the neighborhood at a time. We assume that a random variable follows a shifted tdistribution with degrees of freedom, location parameter , and scaling parameter . Note that here is not referring to the standard deviation but sets the overall scaling of the distribution (Jackman), which we choose to be the sample variance in the neighborhood of . A shifted tdistribution is used to estimate population parameters, if there are less number of samples (usually, 30) and/or the population variance is unknown. Since in SMOTE or LoRAS we generate samples from a small neighborhood, we can argue in favour of our assumption that locally, a minority class sample as a random variable, follows a tdistribution. Following Blagus2013, we assume that if then and are independent. For , we also assume:
(1) 
where, and denote the expectation and variance of the random variable respectively. However, the mean has to be estimated by an estimator statistic (i.e. a function of the samples). Both SMOTE and LoRAS can be considered as an estimator statistic for the mean of the tdistribution that follows locally.
Theorem 1.
Both SMOTE and LoRAS are unbiased estimators of the mean of the tdistribution that follows locally. However, the variance of the LoRAS estimator is less than the variance of SMOTE given that .
Proof.
A shadowsample is a random variable where , the neighborhood of some arbitrary and follows .
(2) 
assuming and are independent. Now, a LoRAS sample , where are shadowsamples generated from the elements of the neighborhood of , , such that . The affine combination coefficients follow a Dirichlet distribution with all concentration parameters assuming equal values of 1 (assuming all features to be equally important). For arbitrary ,
where denotes the covariance of two random variables and . Assuming and to be independent,
(3) 
Thus is an unbiased estimator of . For ,
(4) 
since is independent of . For an arbitrary , th component of a LoRAS sample
(5) 
For LoRAS, we take an affine combination of shadowsamples and SMOTE considers an affine combination of two minority class samples. Note, that since a SMOTE generated oversample can be interpreted as a random affine combination of two minority class samples, we can consider, for SMOTE, independent of the number of features. Also, from Equation 3, this implies that SMOTE is an unbiased estimator of the mean of the local data distribution. Thus, the variance of a SMOTE generated sample as an estimator of would be (since for SMOTE). But for LoRAS as an estimator of , when , the variance would be less than that of SMOTE. ∎
This implies that, locally, LoRAS can estimate the mean of the underlying tdistribution better than SMOTE.
6 Conclusions
Oversampling with LoRAS produces comparatively balanced ML model performances on average, in terms of Balanced Accuracy and F1Score. This is due to the fact that, in most cases LoRAS produces lesser misclassifications on the majority class with a reasonably small compromise for misclassifications on the minority class. Moreover, we infer that our LoRAS oversampling strategy can better estimate the mean of the underlying local distribution for a minority class sample (considering it a random variable).
The distribution of the minority class data points is considered in the oversampling techniques such as Borderline1 SMOTE, Borderline2 SMOTE, SVMSMOTE, and ADASYN (Gosain2017). SMOTE and LoRAS are the only two techniques, among the oversampling techniques we explored, that deal with the problem of imbalance just by generating new data points, independent of the distribution of the minority and majority class data points. Thus, comparing LoRAS and SMOTE gives a fair impression about the performance of our novel LoRAS algorithm as an oversampling technique, without considering any aspect of the distributions of the minority and majority class data points and relying just on resampling. Other extensions of SMOTE such as Borderline1 SMOTE, Borderline2 SMOTE, SVMSMOTE, and ADASYN can also be built on the principle of LoRAS oversampling strategy. According to our analyses LoRAS already reveals great potential on a broad variety of applications and evolves as a true alternative to SMOTE, while processing highly unbalanced datasets.
Availability of code: The implementation of the algorithm in Python (V 3.7.4) and an example Jupyter Notebook for the credit card fraud detection dataset is provided on the GitHub repository https://github.com/narekdavtyan/LoRAS. In our computational code, S_{p} corresponds to num_shadow_points, L_{\textsigma} corresponds to list_sigma_f, N_{aff} corresponds to num_aff_comb, N_{gen} corresponds to num_generated_points.
Acknowledgements: We thank Prof. Ria Baumgrass from Deutsches RheumaForschungszentrum (DRFZ), Berlin for enlightening discussions on small datasets occuring in her research related to cancer therapy, that led us to the current work. We thank the German Network for Bioinformatics Infrastructure (de.NBI) and Establishment of Systems Medicine Consortium in Germany e:Med for their support, as well as the German Federal Ministry for Education and Research (BMBF) programs (FKZ 01ZX1709C) for funding us.