QSAR Classification Modeling for Bioactivity of Molecular Structure via SPLLogsum
Abstract
Quantitative structureactivity relationship (QSAR) modelling is effective ’bridge’ to search the reliable relationship related bioactivity to molecular structure. A QSAR classification model contains a lager number of redundant, noisy and irrelevant descriptors. To address this problem, various of methods have been proposed for descriptor selection. Generally, they can be grouped into three categories: filters, wrappers, and embedded methods. Regularization method is an important embedded technology, which can be used for continuous shrinkage and automatic descriptors selection. In recent years, the interest of researchers in the application of regularization techniques is increasing in descriptors selection , such as, logistic regression(LR) with penalty. In this paper, we proposed a novel descriptor selection method based on selfpaced learning(SPL) with Logsum penalized LR for predicting the bioactivity of molecular structure. SPL inspired by the learning process of humans and animals that gradually learns from easy samples(smaller losses) to hard samples(bigger losses) samples into training and Logsum regularization has capacity to select few meaningful and significant molecular descriptors, respectively. Experimental results on simulation and three public QSAR datasets show that our proposed SPLLogsum method outperforms other commonly used sparse methods in terms of classification performance and model interpretation.
I Introduction
Quantitative structureactivity relationship (QSAR) model is effective ’bridge’ to search the reliable relationship between chemical structure and biological activities in the field of drug design and discovery [1]. The chemical structures are represented by a larger number of different descriptors. In general, only a few descriptors associated with bioactivity are favor of the QSAR model. Therefore, descriptor selection plays an important role in the study of QSAR that can eliminates redundant, noisy, and irrelevant descriptors [2].
In the recent years, various of methods have been proposed for descriptor selection. Generally, they can be grouped into three categories: filters, wrappers, and embedded methods [3].
Filter methodsselect descriptors according to some statistic criterion such as Ttest. These selected descriptors will be a part of classification that used to classify the compounds [4] [5]. The drawback of filter method ignores the relationship between descriptors.
Wrapper methods utilize a subset of descriptors and train a model using them[6]. For example, forward selection adds the most important descriptors until the model is not statistically significant[7]. Backward elimination starts with the candidate descriptor and remove descriptors without some statistical descriptors[8]. Particle swarm optimization has a series of initial random particles and then selects descriptors by updating the velocity and positions[9]. Genetic algorithm initials random particles and then uses the code, selection, exchange and mutation operations to select the subset of descriptors[10][11]. However, these methods are usually computationally very expensive.
Another method for descriptor selection is embedded method that combines filter methods and wrapper methods[12]. Regularization method [13]is an important embedded technology, which can be used for continuous shrinkage and automatic descriptors selection. Various of regularization types have been proposed , such as, Logsum[14],[15], [16], [17],SCAD [18], [19], [20] and so on. Recently, the regularization has been used in QSRR[21],QSPR [22] and QSTR [23] in the field of chemometrics. However, some individuals have focused their interest and attention on QSAR research. Besides, the interest of applying regularization technology to LR is increasing in the QSAR classification study. The LR model is considered to be an effective discriminant method because it provides the prediction probability of class members. In order to slelect the small subset of descriptors that can be for QSAR model of interest, various of regularized LR model have been development. For instance, Shevade SK et al[24]. proposed sparse LR model with to extract the key variables for classification problem. However, the limitation of is resulting in a biased estimation for large coefficients. Therefore, its model selection is inconsistent and lack of oracle properties. In order to alleviate these, Xu et al[17].proposed penalty, which has demonstrated many attractive properties, such as unbiasedness, sparsity and oracle properties. Then, Liang et al.[25] used LR based on penalty to select a small subset of variables.
In this paper, we proposed a novel descriptor selection using SPL via sparse LR with Logsum penalty(SPLLogsum) in QSAR classification. SPL is inspired by the learning process of humans that gradually incorporates the training samples into learning from easy ones to hard ones[26][27][28]. Different from the curriculum learning [29] that learns the data in a predefined order based on prior knowledge, SPL learns the training data in an order from easy to hard dynamically determined by the feedback of the learner itself. Meanwhile, the Logsum regularization proposed by Candes et al[14]. produces better solutions with more sparsity. The flow diagram shows the process of our proposed SPLLogsum for QSAR model in Fig. 1.
Our work has three main contributions:

We integrate the selfpaced learning into the Logsum penalized logistic regression(SPLLogsum). Our proposed SPLLogsum method can identify the easy and hard samples adaptively according to what the model has already learned and gradually add harder samples into training and prevent overfitting simultaneously.

In the unbalanced data, our proposed method can still get good performance and be superior to other commonly used sparse methods.

Experimental results on both simulation and real datset corroborate our ideas and demonstrate the correctness and effectiveness of SPLLogsum.
The structure of the paper is as follows: Section 2 briefly introduces Logsum penalized LR, SPL and classification evaluation criteria. The details of three QSAR datasets are given in Section 3. Section 4 provides the related experimental results on the artificial and three QSAR datasets. A concluding remark is finally made.
Ii Methods
Iia Sparse LR with the Logsum penalty
IiA1 the Logsum penalized LR
In this paper, we only consider a general binary classification problem and get a predictor vector X and a response variable y, which consists of chemical structure and corresponding biological activities, respectively. Suppose we have samples, = (, ), (, ),…, (, ), where = (, ,…, ) is input pattern with dimensionality , which means the has descriptors and denotes the value of descriptor j for the sample. And is a corresponding variable that takes a value of 0 or 1. Define a classifier and the LR is given as follows:
(1) 
where are the coefficients that can be to estimated , note that is the intercept. Additionally, the loglikelihood can be expressed as follows:
(2) 
the can be got by minimizing the equation (2). However, in high dimensional QSAR classification problem with , direct to solve the equation (2) can result in overfitting. Therefore, in order to solve the problem, add the regularization terms to the equation (2).
(3) 
where is loss function, is penalty function, is a tuning parameter. Note that . When is equal to 1, the has been proposed. Moreover, there are various of versions of , such as , SCAD, MCP, group lasso,and so on. We add the regularization to the equation (2). The formula is expressed as follows:
(4) 
regularization has capacity to select the descriptors. However, the drawback of regularization can result in a biased estimation for large coefficients and be not sparsity. To address these problems, Xu et al[17]. proposed regulation, which can be taken as a representative of penalty. We can rewrite the equation (4) as follows:
(5) 
Theoretically, the regularization produces better solutions with more sparsity , but it is an NP problem. Therefore, Candes et al[14]. proposed the Logsum penalty, which approximates the regularization much better. We give the formula based on Logsum regularization as follows:
(6) 
where should be set arbitrarily small, to closely make the Logsum penalty resemble the norm.
IiA2 A coordinate descent algorithm for the Logsum penalized LR
In this paper, we used the coordinate descent algorithm to solve equation (6). The algorithm is a ”oneatatime” and solves and other (represent the parameters remained after element is removed) are fixed. Suppose we have samples, = (, ), (, ),…, (, ), where = (, ,…, ) is input pattern with dimensionality , which means the has descriptors and denotes the value of descriptor for the sample. And is a corresponding variable that takes a value of 0 or 1. According to Friedman et al. [31], Liang et al. [25] and Xia et al. [30], the univariate Logsum thresholding operator can be written as:
(7) 
where , and ,
Inspired by Liang et al[25]., the equation (6) is linearized by oneterm Taylor series expansion:
(8) 
where ,, and . Redefine the partial residual for fitting as and . A pseudocode of coordinate descent algorithm for Logsum penalized LR model is shown in Algorithm 1.
IiB SPLLogsum
Inspired by cognitive mechanism of humans and animals, Koller et al[26]. proposed a new learning called SPL (SPL) that learns from easy to hard samples. During the process of optimization, more samples are entered into training set from easy to hard by increasing gradually the penalty of the SPL regularizer. Suppose given a dataset with samples. and are the sample and its label,respectively. The value of is 0 or 1 in classification model. Let denote the learn model and is model parmaeter that should be estimated. is loss function of the sample. The SPL model combines a weighted loss term and a general selfpaced regularizer imposed on sample weight, given as:
(9)  
where and . Therefore, the equation (9) can be rewritten as:
(10)  
where is regularization parameter , is age parameter for controlling the learning pace,and . According to Kumar et al[26], with the fixed , the optimal is easily calculated by:
(11) 
Here, we give a explanation for the equation (11). In SPL iteration,when estimating the with a fixed , a sample, which is taken as an easy sample (highconfidence sample with smaller loss value), can be selected() in training if the loss is smaller than . Otherwise, unselected (). When estimating the with a fixed , the classifier only is used easy samples as training set. As the model ”age” of is increasing, it means that more, probably hard samples with larger losses will be considered to train a more ”mature” model. The process of our proposed SPLLogsum method is shown in Fig. 4. Besides, A pseudocode of our proposed SPLLogsum is shown in Algorithm 2.
IiC Classification evaluation criteria
In order to evaluate the QSAR classification performance of proposed method, five classification evaluation criteria are implemented:(1)Accuracy,(2)Area under the curve (AUC), (3)Sensitivity, (4)Specificity and (5)Pvalue, which whether the selected descriptors are significant. In order to evaluate the performance of descriptors selection in simulation data, the sensitivity and specificity are defined as follows[20][32]:
True Negative(TN), False Negative(FN)
True Positive(TP), False Positive(FP)
Sensitivity,specificity
where is the elementwise product. calculates the number of nonzero element in a vector. are the on the vectors. The logical”not” operators of and is and , respectively.
Iii Datasets
Iiia PubChem AID:651580
The dataset is provided by the Anderson cancer center at the university of Texas. There are 1,759 samples, of which 982 is active, and the 777 is inactive. The 1,875 descriptors can be calculated using QSARINS software. After pretreatment, we used 1,614 samples, of which 914 is active and 700 is inactive. Each sample contains 1,642 descriptors [33].
IiiB PubChem AID:743297
We could get this dataset from website[34]. By using QSARINS software and preprocessing, we utilized 200 samples, which consist of 58 active and 142 inactive, with 1588 descriptors for model as input.
IiiC PubChem AID:743263
The QSAR dataset is from Georgetown University. The number of active and inactive is 252 and 868, respectively. After preprocessing, 983 samples could be used for QSAR modelling. Each sample contains 1,612 descriptors. Note that know more about details from website[35].
Iv Results
Iva Analyses of simulated data
In this section, we evaluate the performance of our proposed SPLlogsum method in the simulation study. Four methods are compared to our proposed method, including LR with , , and Logsum penalties, respectively. In addition, Some factors will be considered for constructing simulation, including the value of loss, model ”age” , weight , the confidence of sample, sample size , correlation coefficient and noise control parameter .
IvA1 Loss, ”age” and weight
A sample, which is taken as an easy sample with small loss value , can be selected() in training if the loss is smaller than . Otherwise, unselected (). As the model ”age” of is increasing, more probably hard samples with larger losses will be considered to train a more ”mature” model. Therefore, we constructed a simple simulation about loss, and . First of all, a group of the value of loss is given. Then the model ”age” is preset. At last, the selected samples are based on equation (11). Table I shows that the selection of samples is increasing with ”age” [36].
Samples  A  B  C  D  E  F  G  H  I  J  K  L  M  N 

Loss  0.05  0.12  0.12  0.12  0.15  0.4  0.2  0.18  0.35  0.15  0.16  0.2  0.5  0.3 
When the “age”  
V  1  1  1  1  0  0  0  0  0  0  0  0  0  0 
SPL selects:  A  B  C  D  
When the “age”  
SPL selects:  A  B  C  D  E  G  H  J  K  
When the “age”  
V  1  1  1  1  1  0  1  1  0  1  1  1  0  0 
SPL selects:  A  B  C  D  E  G  H  J  K  L  
When the “age”  
V  1  1  1  1  1  0  1  1  0  1  1  1  0  1 
SPL selects:  A  B  C  D  E  G  H  J  K  L  N 
IvA2 Highconfidence samples, mediumconfidence samples, lowconfidence samples
In this paper, we divided samples into three parts, including highconfidence samples, mediumconfidence samples and lowconfidence samples. The highconfidence samples are favored by the model, followed by mediumconfidence samples and lowconfidence samples. The lowconfidence samples are probably noise or outline that can reduce the performance of QSAR model. In order to illustrate the process of selected samples for SPL, we constructed a simple simulation. First of all, the simulated data was generated from the LR with using normal distribution to produce , where . Then, a set of coefficients is given. Finally, we can calculate the value of . Fig.5 shows that a set of samples can be divided into three parts. Therefore, in SPL iteration, we could count up the number of selected samples, where is from highconfidence samples, mediumconfidence samples and lowconfidence samples. Fig.6 indicates that at the beginning of the SPL, the model inclines to select highconfidence samples. Afterwards, when the model age gets larger, it tends to incorporate mediumconfidence and lowconfidence samples to train a mature model.
IvA3 Sample size ,correlation coefficient and noise control parameter
In this section, we constructed a simulation with sample size , correlation coefficient and noise control parameter . The process of construction is expressed as:
Step I: use normal distribution to produce X. Here, the number of row(X) is sample and the number of column(X) is variable , respectively.
(12) 
where is the vector of response variables, X = {,,……,} is the generated matrix with , is the random error, controls the signal to noise.
Step II: add different correlation parameter to simulation data.
(13) 
Step III: validate variable selection, the coefficients(10) are set in advance from 1 to 10.
(14) 
Where is the coefficient.
Step IV: we can get y from equation .
In this simulation study, first of all, 100 groups of data are constructed with , and . Then, divide the dataset into training set and testing set(training set:testing set=7:3 ). And then the coefficients are preset in advance. Finally, the LR with different penalties are used to select variables and builded model, including our proposed method. Note that the results should be averaged.
Factors  Methods  Testing dataset  

AUC  Sensitivity  Specificity  Accuracy  Sensitivity  Specificity  
200  0.2  0.3  0.7056  0.7000  0.7667  0.7333  0.8000  0.9495  
0.7133  0.7667  0.7333  0.7500  0.7000  0.9889  
Logsum  0.7578  0.8333  0.7667  0.8000  0.8000  0.9939  
SPLLogsum  0.8711  0.8333  0.9000  0.8667  0.8000  0.9980  
0.9  0.6296  0.5862  0.6774  0.6333  0.7000  0.9283  
0.7397  0.6552  0.7742  0.7167  0.5000  0.9869  
Logsum  0.7553  0.6897  0.7742  0.7333  0.7000  0.9919  
SPLLogsum  0.7998  0.7241  0.8710  0.8000  0.7000  0.9970  
0.6  0.3  0.8058  0.7576  0.8519  0.8000  0.6000  0.9556  
0.8586  0.8182  0.8889  0.8500  0.6000  0.9939  
Logsum  0.8260  0.8182  0.8148  0.8167  0.6000  0.9960  
SPLLogsum  0.8709  0.8485  0.8889  0.8667  0.6000  0.9970  
0.9  0.8880  0.7429  0.9600  0.8333  0.6000  0.9495  
0.8480  0.7714  0.8800  0.8167  0.6000  0.9939  
Logsum  0.8903  0.8000  0.9600  0.8667  0.6000  0.9929  
SPLLogsum  0.8960  0.7714  1.0000  0.8667  0.6000  0.9970  
300  0.2  0.3  0.9180  0.9111  0.9111  0.9111  1.0000  0.9323  
0.9536  0.9556  0.9333  0.9444  1.0000  0.9949  
logsum  0.9832  0.9556  0.9778  0.9667  1.0000  0.9980  
SPLLogsum  0.9832  0.9556  0.9778  0.9667  1.0000  0.9980  
0.9  0.8392  0.7907  0.8085  0.8000  1.0000  0.9465  
0.8075  0.7674  0.7872  0.7778  0.9000  0.9869  
Logsum  0.8545  0.7907  0.8298  0.8111  0.9000  0.9899  
SPLLogsum  0.9030  0.9070  0.8511  0.8778  1.0000  0.9980  
0.6  0.3  0.9071  0.8409  0.9348  0.8889  0.7000  0.9919  
0.9175  0.9091  0.9130  0.9111  0.6000  0.9960  
Logsum  0.9086  0.8409  0.9348  0.8889  0.6000  0.9929  
SPLLogsum  0.9506  0.9091  0.9348  0.9222  0.7000  0.9980  
0.9  0.7319  0.8235  0.7115  0.7750  0.7000  0.8970  
0.7432  0.7941  0.7500  0.7750  0.6000  0.9758  
Logsum  0.8200  0.8250  0.7600  0.7889  0.6000  0.9879  
SPLLogsum  0.9310  0.8750  0.9000  0.8889  0.7000  0.9970 
According to existing literature [37], we have learned that the predictive ability of a QSAR model can only be estimated using testing set of compounds. Therefore, we poured more interest and attention into testing dataset, of which can prove the generalization ability of the model. Table II exhibits that the experiment results are got by , and Logsum, including our proposed SPLLogsum methods. The performance of got by our proposed SPLlgosum method are better than those of , and Logsum. For example, when , and , the sensitivity of , Logsum and SPLLogsum is 0.8, 0.8 and 0.8 higher than 0.7 of . the specificity obtained by our proposed SPLLogsum is the highest among four methods with 0.9980. Logsum , , rank in the second, the third and the fourth place with 0.9939, 0.9889 and 0.9495. Besides, we analyzed the performance of testing set. For example, when and , the values of AUC, sensitivity, specificity and accuracy of SPLlogsum are decreased to 0.9222 to 0.9310, 0.8750, 0.9000 and 0.8889, in which is from 0.3 to 0.9. When the sample size increases, the performances of all the four methods are improved. For example, when and , for testing set, the AUC and accuracy of SPLLogsum is increased by 11 and 10 with and . When the decreases, the performance of four methods are decreased with . For example, when ,the results of SPLlogsum are decreased from 0.9310, 0.8750, 0.9000 and 0.8889 to 0.9030, 0.9070, 0.8511 and 0.8778 from to . In a word, our proposed SPLlogsum approach is superior to , and Logsum in the simulation dataset.
IvB Analyses of real data
Three public QSAR datasets are got from website, including AID 651580, AID 743297 and AID 74362. We utilized random sampling to divide datasets into training datasets and testing dataset(70% for training set and 30% for testing set). A brief description of these datasets is shown in Table IIIIV.
Dataset  No.of Samples(class1/class2)  No.of descriptors 

AID:651580  1614(914 ACTIVE/700 INACTIVE)  1642 
AID:743297  200(58 ACTIVE/ 142 INACTIVE)  1588 
AID:743263  983(229 ACTIVE/ 754 INACTIVE)  1612 
Dataset  No.of Traing(class1/class2)  No.of Testing(class1/class2) 

AID:651580  1130(634 ACTIVE/496 INACTIVE)  484(280 ACTIVE/204 INACTIVE) 
AID:743297  140(37 ACTIVE/103 INACTIVE)  60(39 ACTIVE/21 INACTIVE) 
AID:743263  689(152 ACTIVE/537 INACTIVE)  294(77 ACTIVE/217 INACTIVE) 
Dataset  Methods  Testing dataset  

AUC  Sensitivity  Specificity  Accuracy  
AID:651580  0.7063  0.6000  0.6800  0.6333  
0.6697  0.7353  0.6923  0.7167  
Logsum  0.8251  0.7429  0.8000  0.7667  
SPLLogsum  0.8583  0.8000  0.8000  0.8000  
AID:743297  0.7368  0.6667  0.7143  0.7000  
0.7045  0.7619  0.7179  0.7100  
Logsum  0.7765  0.7778  0.6905  0.7167  
SPLLogsum  0.7816  0.7667  0.7692  0.7333  
AID:743263  0.7132  0.6986  0.7344  0.7153  
0.7009  0.7286  0.6716  0.7007  
Logsum  0.6939  0.6883  0.7000  0.7034  
SPLLogsum  0.7552  0.7143  0.7667  0.7372 
Table V shows that the experiment results are got by , , Logsum and SPLLogsum for testing data. Our proposed SPLLogsum is better than other methods in terms of AUC, sensitivity, specificity and accuracy. For example, from the view of dataset AID:743297, the AUC obtained by our proposed SPLlogsum is 0.7816 higher than 0.7368, 0.7045 and 0.7765 of , and Logsum. Moreover, the accuracy of SPLlogsum is about increased by 17, 9, 4 of , and Logsum for dataset AID:651580. Furthermore, our proposed SPLLogsum method is more sparsity than other methods. For example, for dataset AID:743297 in Fig. 7, the number of selected descriptors of SPLlogsum is 10, ranks the first; then next is Logsum with 23; followed by ,constituting 53; finally, it comes from at 166, respectively.
Rank  AID:651580  

logsum  SPLlogsum  
1  L3i  MLFER_BH  MATS2e  gmin 
2  L3u  MATS8m  ATSC6v  SHBint8 
3  L3s  maxHCsats  nHBint4  GATS2s 
4  L3e  SpMin8_Bhe  gmin  TDB6e 
5  maxtN  maxsssN  ATSC3e  C1SP1 
6  SssNH  mintsC  AATSC8v  MLFER_A 
7  SpMAD_Dzs  ATSC3e  GATS8e  MATS8m 
8  mintN  TDB5i  SHBint9  minHBint4 
9  MATS8m  naaO  mindssC  ATSC2i 
10  nHssNH  ATSC4i  ASP3  GATS3m 
11  RDF40s  SssssC  SsssCH  E2p 
12  RDF85s  AATSC4s  MATS1e  nHdCH2 
Rank  AID:743297  

logsum  SPLlogsum  
1  SIC1  AATS3s  VCH3  nAcid 
2  BIC1  ndsN  ATSC5e  ATSC3s 
3  MATS7v  RDF30m  AATSC7v  AATSC5e 
4  AATSC7v  ETA_Shape_P  nHAvin  SsssCH 
5  MLFER_S  GATS4c  AATS3s  minHCsatu 
6  RDF30m  nsOm  ATSC7i  minssNH 
7  RDF70p  ATSC4c  GATS2p  RDF100v 
Rank  AID:743263  

logsum  SPLlogsum  
1  maxsF  RDF130u  LipinskiFailures  AATS5v 
2  minsF  nBondsD2  nF7Ring  AATS6s 
3  SRW9  RDF25m  SCH3  ATSC2i 
4  SRW7  nHssNH  SssNH  MATS3m 
5  MDEC13  minaaN  minHBint2  nBondsD2 
6  ETA_dAlpha_A  SpMin7_Bhi  ATSC3p  nHBint3 
7  minHCsats  nHdsCH  RDF45s  minHAvin 
8  RDF95u  RDF45s  ATSC1v  maxHBint10 
9  TDB10m  C3SP2  MDEO11  nFRing 
10  C3SP2  RDF75v  AATSC8p  RDF20s 
Descriptors  Name  Pvalue  Class 

gmin  Minimum EState  0.00050  2D 
TDB6e  3D topological distance based autocorrelation  lag 6 / weighted by Sanderson electronegativities  0.00368  3D 
GATS2s  Geary autocorrelation  lag 2 / weighted by Istate  0.00576  2D 
C1SP1  Triply bound carbon bound to one other carbon  0.00849  2D 
MLFER_A  Overall or summation solute hydrogen bond acidity  0.01383  2D 
SHBint8  Sum of EState descriptors of strength for potential hydrogen bonds of path length 8  0.01620  2D 
MATS8m  Moran autocorrelation  lag 8 / weighted by mass  0.02395  2D 
E2p  2nd component accessibility directional WHIM index / weighted by relative polarizabilities  0.03345  3D 
minHBint4  Minimum EState descriptors of strength for potential Hydrogen Bonds of path length 4  0.04056  2D 
nHdCH2  Minimum atomtype H EState: =CH2  0.04883  2D 
GATS3m  Geary autocorrelation  lag 3 / weighted by mass  0.05929  2D 
ATSC2i  Centered BrotoMoreau autocorrelation  lag 2 / weighted by first ionization potential  0.06968  2D 
nAcid  Number of acidic groups. The list of acidic groups is defined by these SMARTS ”([O;H1][C,S,P]=O)”  0.00569  2D 
ATSC3s  Centered BrotoMoreau autocorrelation  lag 3 / weighted by Istate  0.00866  2D 
AATSC5e  Average centered BrotoMoreau autocorrelation  lag 5 / weighted by Sanderson electronegativities  0.00210  2D 
SsssCH  Sum of atomtype EState: CH  0.01260  2D 
minHCsatu  Minimum atomtype H EState: H?on C sp3 bonded to unsaturated C  0.04492  2D 
minssNH  Minimum atomtype EState: NH2+  0.00025  2D 
RDF100v  Radial distribution function  100 / weighted by relative van der Waals volumes  0.01730  3D 
AATS5v  Average BrotoMoreau autocorrelation  lag 5 / weighted by van der Waals volumes  0.08453  2D 
AATS6s  Average BrotoMoreau autocorrelation  lag 6 / weighted by Istate  0.03047  2D 
ATSC2i  Centered BrotoMoreau autocorrelation  lag 2 / weighted by first ionization potential  0.00536  2D 
MATS3m  Moran autocorrelation  lag 3 / weighted by mass  0.00037  2D 
nBondsD2  Total number of double bonds (excluding bonds to aromatic bonds)  0.00059  2D 
nHBint3  Count of EState descriptors of strength for potential Hydrogen Bonds of path length 3  0.00041  2D 
minHAvin  Minimum atomtype H EState: H on C vinyl bonded to C aromatic  0.02756  2D 
maxHBint10  Maximum EState descriptors of strength for potential Hydrogen Bonds of path length 10  0.04263  2D 
nFRing  Number of fused rings  0.03005  2D 
RDF20s  Radial distribution function  020 / weighted by relative Istate  0.05537  3D 
Table VI,VII and VIII show that the number of topranked informative molecular descriptors extracted by , and SPLlogsum are 12, 7 and 10 based on the value of coefficients. Moveover, the common descriptors are emphasized in bold. Furthermore, as shown in Table IX, the selected molecular descriptors got by our proposed SPLlogsum method are meaningful and significant in terms of pvalue and almost belong to the class 2D. To sum up, our proposed SPLLogsum is the effective technique for descriptors selection in real classification problem.
V Conclusion
In the field of drug design and discovery, Only a few descriptors related to bioactivity are selected that can be for QSAR model of interest. Therefore, descriptor selection is an attractive method that reflect the bioactivity in QSAR modeling. In this paper, we proposed a novel descriptor selection using SPL via sparse LR with Logsum penalty in QSAR classification. SPL can identify the easy and hard samples adaptively according to what the model has already learned and gradually add harder samples into training and Logsum regularization can select few meaningful and significant molecular descriptors simultaneously, respectively.
Both experimental results on artificial and three QSAR datasets demonstrate that our proposed SPLLogsum method is superior to , and Logsum. Therefore, our proposed method is the effective technique in both descriptor selection and prediction of biological activity.
In this paper, SPL has capacity to learn from easy samples to hard samples. However, we ignore an important aspect in learning: diversity. We plan to incorporate this information into our proposed method in our future work.
Appendix A Abbreviations:
QSAR  Quantitative structureactivity relationship 

QSRR  Quantitative structure(chromatographic) 
retention relationships  
QSPR  Quantitative structureproperty relationship 
QSTR  Quantitative structuretoxicity relationship 
MCP  Maximum concave penalty 
SCAD  Smoothly clipped absolute deviation 
LASSO  
Elastic net  
LR  Logistic regression 
SPL  Selfpaced learning 
AUC  Area under the curve 
References
 [1] A. R. Katritzky, M. Kuanar, S. Slavov, C. D. Hall, M. Karelson, I. Kahn, and D. A. Dobchev, “Quantitative correlation of physical and chemical properties with chemical structure: utility for prediction,” Chemical reviews, vol. 110, no. 10, pp. 5714–5789, 2010.
 [2] M. Shahlaei, “Descriptor selection methods in quantitative structure–activity relationship studies: a review study,” Chemical reviews, vol. 113, no. 10, pp. 8093–8103, 2013.
 [3] I. Guyon, S. Gunn, A. BenHur, and G. Dror, “Result analysis of the nips 2003 feature selection challenge,” in Advances in neural information processing systems, 2005, pp. 545–552.
 [4] W. Duch, “Filter methods,” in Feature Extraction. Springer, 2006, pp. 89–117.
 [5] L. Yu and H. Liu, “Feature selection for highdimensional data: A fast correlationbased filter solution,” in Proceedings of the 20th international conference on machine learning (ICML03), 2003, pp. 856–863.
 [6] A. G. Karegowda, M. Jayaram, and A. Manjunath, “Feature subset selection problem using wrapper approach in supervised learning,” International journal of Computer applications, vol. 1, no. 7, pp. 13–17, 2010.
 [7] D. C. Whitley, M. G. Ford, and D. J. Livingstone, “Unsupervised forward selection: a method for eliminating redundant variables,” Journal of Chemical Information and Computer Sciences, vol. 40, no. 5, pp. 1160–1168, 2000.
 [8] K. Z. Mao, “Orthogonal forward selection and backward elimination algorithms for feature subset selection,” IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), vol. 34, no. 1, pp. 629–634, 2004.
 [9] J. Kennedy, “Particle swarm optimization,” in Encyclopedia of machine learning. Springer, 2011, pp. 760–766.
 [10] K. Deb, A. Pratap, S. Agarwal, and T. Meyarivan, “A fast and elitist multiobjective genetic algorithm: Nsgaii,” IEEE transactions on evolutionary computation, vol. 6, no. 2, pp. 182–197, 2002.
 [11] B. Hemmateenejad, M. Akhond, R. Miri, and M. Shamsipur, “Genetic algorithm applied to the selection of factors in principal componentartificial neural networks: application to qsar study of calcium channel antagonist activity of 1, 4dihydropyridines (nifedipine analogous),” Journal of chemical information and computer sciences, vol. 43, no. 4, pp. 1328–1334, 2003.
 [12] T. N. Lal, O. Chapelle, J. Weston, and A. Elisseeff, “Embedded methods,” in Feature extraction. Springer, 2006, pp. 137–165.
 [13] B. Schölkopf and A. J. Smola, Learning with kernels: support vector machines, regularization, optimization, and beyond. MIT press, 2002.
 [14] E. J. Candes, M. B. Wakin, and S. P. Boyd, “Enhancing sparsity by reweighted l 1 minimization,” Journal of Fourier analysis and applications, vol. 14, no. 56, pp. 877–905, 2008.
 [15] R. Tibshirani, “Regression shrinkage and selection via the lasso,” Journal of the Royal Statistical Society. Series B (Methodological), pp. 267–288, 1996.
 [16] H. Zou and T. Hastie, “Regularization and variable selection via the elastic net,” Journal of the Royal Statistical Society: Series B (Statistical Methodology), vol. 67, no. 2, pp. 301–320, 2005.
 [17] Z. Xu, H. Zhang, Y. Wang, X. Chang, and Y. Liang, “L 1/2 regularization,” Science China Information Sciences, vol. 53, no. 6, pp. 1159–1169, 2010.
 [18] J. Fan and R. Li, “Variable selection via nonconcave penalized likelihood and its oracle properties,” Journal of the American statistical Association, vol. 96, no. 456, pp. 1348–1360, 2001.
 [19] C.H. Zhang et al., “Nearly unbiased variable selection under minimax concave penalty,” The Annals of statistics, vol. 38, no. 2, pp. 894–942, 2010.
 [20] H.H. Huang, X.Y. Liu, and Y. Liang, “Feature selection and cancer classification via sparse logistic regression with the hybrid l1/2+ 2 regularization,” PloS one, vol. 11, no. 5, p. e0149675, 2016.
 [21] E. DaghirWojtkowiak, P. Wiczling, S. Bocian, Ł. Kubik, P. Kośliński, B. Buszewski, R. Kaliszan, and M. J. Markuszewski, “Least absolute shrinkage and selection operator and dimensionality reduction techniques in quantitative structure retention relationship modeling of retention in hydrophilic interaction liquid chromatography,” Journal of Chromatography A, vol. 1403, pp. 54–62, 2015.
 [22] M. Goodarzi, T. Chen, and M. P. Freitas, “Qspr predictions of heat of fusion of organic compounds using bayesian regularized artificial neural networks,” Chemometrics and Intelligent Laboratory Systems, vol. 104, no. 2, pp. 260–264, 2010.
 [23] R. Aalizadeh, C. Peter, and N. S. Thomaidis, “Prediction of acute toxicity of emerging contaminants on the water flea daphnia magna by ant colony optimization–support vector machine qstr models,” Environmental Science: Processes & Impacts, vol. 19, no. 3, pp. 438–448, 2017.
 [24] S. K. Shevade and S. S. Keerthi, “A simple and efficient algorithm for gene selection using sparse logistic regression,” Bioinformatics, vol. 19, no. 17, pp. 2246–2253, 2003.
 [25] Y. Liang, C. Liu, X.Z. Luan, K.S. Leung, T.M. Chan, Z.B. Xu, and H. Zhang, “Sparse logistic regression with a l 1/2 penalty for gene selection in cancer classification,” BMC bioinformatics, vol. 14, no. 1, p. 198, 2013.
 [26] M. P. Kumar, B. Packer, and D. Koller, “Selfpaced learning for latent variable models,” in Advances in Neural Information Processing Systems, 2010, pp. 1189–1197.
 [27] L. Jiang, D. Meng, S.I. Yu, Z. Lan, S. Shan, and A. Hauptmann, “Selfpaced learning with diversity,” in Advances in Neural Information Processing Systems, 2014, pp. 2078–2086.
 [28] D. Meng, Q. Zhao, and L. Jiang, “What objective does selfpaced learning indeed optimize?” arXiv preprint arXiv:1511.06049, 2015.
 [29] Y. Bengio, J. Louradour, R. Collobert, and J. Weston, “Curriculum learning,” in Proceedings of the 26th annual international conference on machine learning. ACM, 2009, pp. 41–48.
 [30] L.Y. Xia, Y.W. Wang, D.Y. Meng, X.J. Yao, H. Chai, and Y. Liang, “Descriptor selection via logsum regularization for the biological activities of chemical structure,” International journal of molecular sciences, vol. 19, no. 1, p. 30, 2017.
 [31] J. Friedman, T. Hastie, and R. Tibshirani, “Regularization paths for generalized linear models via coordinate descent,” Journal of statistical software, vol. 33, no. 1, p. 1, 2010.
 [32] W. Zhang, Y.w. Wan, G. I. Allen, K. Pang, M. L. Anderson, and Z. Liu, “Molecular pathway identification using biological networkregularized logistic models,” BMC genomics, vol. 14, no. 8, p. S7, 2013.
 [33] AID:651580, “Pubchem bioassay summary.” https://pubchem.ncbi.nlm.nih.gov/bioassay/651580, accessed April 19, 2018.
 [34] AID:743297, “Pubchem bioassay summary.” https://pubchem.ncbi.nlm.nih.gov/bioassay/743297, accessed April 19, 2018.
 [35] AID:743263, “Pubchem bioassay summary.” https://pubchem.ncbi.nlm.nih.gov/bioassay/743263, accessed April 19, 2018.
 [36] L. Jiang, D. Meng, Q. Zhao, S. Shan, and A. G. Hauptmann, “Selfpaced curriculum learning.” in AAAI, vol. 2, no. 5.4, 2015, p. 6.
 [37] A. Golbraikh and A. Tropsha, “Beware of q2!” Journal of molecular graphics and modelling, vol. 20, no. 4, pp. 269–276, 2002.