QSAR Classification Modeling for Bioactivity of Molecular Structure via SPL-Logsum

QSAR Classification Modeling for Bioactivity of Molecular Structure via SPL-Logsum

Liang-Yong Xia, Qing-Yong Wang L.-Y. Xia, Q.-Y.Wang are with Macau University of Science and Technology, Macau, China, 999078. wang@cnu.ac.cn
Abstract

Quantitative structure-activity relationship (QSAR) modelling is effective ’bridge’ to search the reliable relationship related bioactivity to molecular structure. A QSAR classification model contains a lager number of redundant, noisy and irrelevant descriptors. To address this problem, various of methods have been proposed for descriptor selection. Generally, they can be grouped into three categories: filters, wrappers, and embedded methods. Regularization method is an important embedded technology, which can be used for continuous shrinkage and automatic descriptors selection. In recent years, the interest of researchers in the application of regularization techniques is increasing in descriptors selection , such as, logistic regression(LR) with penalty. In this paper, we proposed a novel descriptor selection method based on self-paced learning(SPL) with Logsum penalized LR for predicting the bioactivity of molecular structure. SPL inspired by the learning process of humans and animals that gradually learns from easy samples(smaller losses) to hard samples(bigger losses) samples into training and Logsum regularization has capacity to select few meaningful and significant molecular descriptors, respectively. Experimental results on simulation and three public QSAR datasets show that our proposed SPL-Logsum method outperforms other commonly used sparse methods in terms of classification performance and model interpretation.

QSAR; bioactivity; descriptor selection; SPL; Logsum

I Introduction

Quantitative structure-activity relationship (QSAR) model is effective ’bridge’ to search the reliable relationship between chemical structure and biological activities in the field of drug design and discovery [1]. The chemical structures are represented by a larger number of different descriptors. In general, only a few descriptors associated with bioactivity are favor of the QSAR model. Therefore, descriptor selection plays an important role in the study of QSAR that can eliminates redundant, noisy, and irrelevant descriptors [2].

In the recent years, various of methods have been proposed for descriptor selection. Generally, they can be grouped into three categories: filters, wrappers, and embedded methods [3].

Filter methodsselect descriptors according to some statistic criterion such as T-test. These selected descriptors will be a part of classification that used to classify the compounds [4] [5]. The drawback of filter method ignores the relationship between descriptors.

Wrapper methods utilize a subset of descriptors and train a model using them[6]. For example, forward selection adds the most important descriptors until the model is not statistically significant[7]. Backward elimination starts with the candidate descriptor and remove descriptors without some statistical descriptors[8]. Particle swarm optimization has a series of initial random particles and then selects descriptors by updating the velocity and positions[9]. Genetic algorithm initials random particles and then uses the code, selection, exchange and mutation operations to select the subset of descriptors[10][11]. However, these methods are usually computationally very expensive.

Another method for descriptor selection is embedded method that combines filter methods and wrapper methods[12]. Regularization method [13]is an important embedded technology, which can be used for continuous shrinkage and automatic descriptors selection. Various of regularization types have been proposed , such as, Logsum[14],[15], [16], [17],SCAD [18], [19], [20] and so on. Recently, the regularization has been used in QSRR[21],QSPR [22] and QSTR [23] in the field of chemometrics. However, some individuals have focused their interest and attention on QSAR research. Besides, the interest of applying regularization technology to LR is increasing in the QSAR classification study. The LR model is considered to be an effective discriminant method because it provides the prediction probability of class members. In order to slelect the small subset of descriptors that can be for QSAR model of interest, various of regularized LR model have been development. For instance, Shevade SK et al[24]. proposed sparse LR model with to extract the key variables for classification problem. However, the limitation of is resulting in a biased estimation for large coefficients. Therefore, its model selection is inconsistent and lack of oracle properties. In order to alleviate these, Xu et al[17].proposed penalty, which has demonstrated many attractive properties, such as unbiasedness, sparsity and oracle properties. Then, Liang et al.[25] used LR based on penalty to select a small subset of variables.

In this paper, we proposed a novel descriptor selection using SPL via sparse LR with Logsum penalty(SPL-Logsum) in QSAR classification. SPL is inspired by the learning process of humans that gradually incorporates the training samples into learning from easy ones to hard ones[26][27][28]. Different from the curriculum learning [29] that learns the data in a predefined order based on prior knowledge, SPL learns the training data in an order from easy to hard dynamically determined by the feedback of the learner itself. Meanwhile, the Logsum regularization proposed by Candes et al[14]. produces better solutions with more sparsity. The flow diagram shows the process of our proposed SPL-Logsum for QSAR model in Fig. 1.

Fig. 1: The diagram illustrates the process of our proposed SPL-Logsum for QSAR modeling. The whole process is predicting molecular structure with unknown bioactivity which can be divided into five main stages. (1) Collecting molecular structure and biological actives; (2) Calculating molecular descriptors used by software QSARINS; (3) Learning and selecting from easy ones to hard ones and significant descriptors utilized by SPL and Logsum regularization, respectively; (4) Building the model with the optimum descriptor subset; (5) Predicting the bioactivity of a new molecular structure using the established model. Note that:different color blocks represent different values

Our work has three main contributions:

  • We integrate the self-paced learning into the Logsum penalized logistic regression(SPL-Logsum). Our proposed SPL-Logsum method can identify the easy and hard samples adaptively according to what the model has already learned and gradually add harder samples into training and prevent over-fitting simultaneously.

  • In the unbalanced data, our proposed method can still get good performance and be superior to other commonly used sparse methods.

  • Experimental results on both simulation and real datset corroborate our ideas and demonstrate the correctness and effectiveness of SPL-Logsum.

The structure of the paper is as follows: Section 2 briefly introduces Logsum penalized LR, SPL and classification evaluation criteria. The details of three QSAR datasets are given in Section 3. Section 4 provides the related experimental results on the artificial and three QSAR datasets. A concluding remark is finally made.

Ii Methods

Ii-a Sparse LR with the Logsum penalty

Ii-A1 the Logsum penalized LR

In this paper, we only consider a general binary classification problem and get a predictor vector X and a response variable y, which consists of chemical structure and corresponding biological activities, respectively. Suppose we have samples, = (, ), (, ),…, (, ), where = (, ,…, ) is input pattern with dimensionality , which means the has descriptors and denotes the value of descriptor j for the sample. And is a corresponding variable that takes a value of 0 or 1. Define a classifier and the LR is given as follows:

(1)

where are the coefficients that can be to estimated , note that is the intercept. Additionally, the log-likelihood can be expressed as follows:

(2)

the can be got by minimizing the equation (2). However, in high dimensional QSAR classification problem with , direct to solve the equation (2) can result in over-fitting. Therefore, in order to solve the problem, add the regularization terms to the equation (2).

(3)

where is loss function, is penalty function, is a tuning parameter. Note that . When is equal to 1, the has been proposed. Moreover, there are various of versions of , such as , SCAD, MCP, group lasso,and so on. We add the regularization to the equation (2). The formula is expressed as follows:

(4)

regularization has capacity to select the descriptors. However, the drawback of regularization can result in a biased estimation for large coefficients and be not sparsity. To address these problems, Xu et al[17]. proposed regulation, which can be taken as a representative of penalty. We can rewrite the equation (4) as follows:

(5)

Theoretically, the regularization produces better solutions with more sparsity , but it is an NP problem. Therefore, Candes et al[14]. proposed the Logsum penalty, which approximates the regularization much better. We give the formula based on Logsum regularization as follows:

(6)

where should be set arbitrarily small, to closely make the Logsum penalty resemble the -norm.

Fig. 2: is convex, and Logsum are non-convex. The Logsum approximates to .

The Logsum regularization is non-convex in Fig. 2.The Equation (6) has a local minima [30]. A regularization can should satisfy three properties for the coefficient estimators: unbiasedness, sparsity and continuity in Fig. 3.

Fig. 3: Three properties for the coefficient estimators: unbiasedness, sparsity and continuity for (a), (b) and (c) Logsum

Ii-A2 A coordinate descent algorithm for the Logsum penalized LR

In this paper, we used the coordinate descent algorithm to solve equation (6). The algorithm is a ”one-at-a-time” and solves and other (represent the parameters remained after element is removed) are fixed. Suppose we have samples, = (, ), (, ),…, (, ), where = (, ,…, ) is input pattern with dimensionality , which means the has descriptors and denotes the value of descriptor for the sample. And is a corresponding variable that takes a value of 0 or 1. According to Friedman et al. [31], Liang et al. [25] and Xia et al. [30], the univariate Logsum thresholding operator can be written as:

(7)

where , and ,
Inspired by Liang et al[25]., the equation (6) is linearized by one-term Taylor series expansion:

(8)

where ,, and . Redefine the partial residual for fitting as and . A pseudocode of coordinate descent algorithm for Logsum penalized LR model is shown in Algorithm 1.

Input: X,y and is chosen by 10-fold cross-validation
Output: and the value of loss
while dose convergence  do
       Initialize all ,set
       Calculate and and the loss function E.q. (8) based on
       Update each and cycle
                    
                        and
       Update
       Let ,
      
end while
Algorithm 1 A coordinate descent algorithm for Logsum penalized LR model

Ii-B SPL-Logsum

Inspired by cognitive mechanism of humans and animals, Koller et al[26]. proposed a new learning called SPL (SPL) that learns from easy to hard samples. During the process of optimization, more samples are entered into training set from easy to hard by increasing gradually the penalty of the SPL regularizer. Suppose given a dataset with samples. and are the sample and its label,respectively. The value of is 0 or 1 in classification model. Let denote the learn model and is model parmaeter that should be estimated. is loss function of the sample. The SPL model combines a weighted loss term and a general self-paced regularizer imposed on sample weight, given as:

(9)

where and . Therefore, the equation (9) can be rewritten as:

(10)

where is regularization parameter , is age parameter for controlling the learning pace,and . According to Kumar et al[26], with the fixed , the optimal is easily calculated by:

(11)

Here, we give a explanation for the equation (11). In SPL iteration,when estimating the with a fixed , a sample, which is taken as an easy sample (high-confidence sample with smaller loss value), can be selected() in training if the loss is smaller than . Otherwise, unselected (). When estimating the with a fixed , the classifier only is used easy samples as training set. As the model ”age” of is increasing, it means that more, probably hard samples with larger losses will be considered to train a more ”mature” model. The process of our proposed SPL-Logsum method is shown in Fig. 4. Besides, A pseudocode of our proposed SPL-Logsum is shown in Algorithm 2.

Fig. 4: The process of our proposed SPL-Logsum is for selecting sampls:(a)Fixed , and optimal model parameter (b)Fixed , and optimal weight (c)Increase the model age to train more hard model
Input: Input X,y,a stepsize and the ”age”
Output: Model parameter
Initialize and the value of loss
while not converged do  do
                 update used by Algorithm I
                 update according to E.q. (11)
             increase by the stepsize
      
end while
return
Algorithm 2 Pseudocode for SPL-Logsum

Ii-C Classification evaluation criteria

In order to evaluate the QSAR classification performance of proposed method, five classification evaluation criteria are implemented:(1)Accuracy,(2)Area under the curve (AUC), (3)Sensitivity, (4)Specificity and (5)P-value, which whether the selected descriptors are significant. In order to evaluate the performance of descriptors selection in simulation data, the sensitivity and specificity are defined as follows[20][32]:

True Negative(TN), False Negative(FN)

True Positive(TP), False Positive(FP)

Sensitivity,specificity
where is the element-wise product. calculates the number of non-zero element in a vector. are the on the vectors. The logical”not” operators of and is and , respectively.

Iii Datasets

Iii-a PubChem AID:651580

The dataset is provided by the Anderson cancer center at the university of Texas. There are 1,759 samples, of which 982 is active, and the 777 is inactive. The 1,875 descriptors can be calculated using QSARINS software. After pretreatment, we used 1,614 samples, of which 914 is active and 700 is inactive. Each sample contains 1,642 descriptors [33].

Iii-B PubChem AID:743297

We could get this dataset from website[34]. By using QSARINS software and preprocessing, we utilized 200 samples, which consist of 58 active and 142 inactive, with 1588 descriptors for model as input.

Iii-C PubChem AID:743263

The QSAR dataset is from Georgetown University. The number of active and inactive is 252 and 868, respectively. After preprocessing, 983 samples could be used for QSAR modelling. Each sample contains 1,612 descriptors. Note that know more about details from website[35].

Iv Results

Iv-a Analyses of simulated data

In this section, we evaluate the performance of our proposed SPL-logsum method in the simulation study. Four methods are compared to our proposed method, including LR with , , and Logsum penalties, respectively. In addition, Some factors will be considered for constructing simulation, including the value of loss, model ”age” , weight , the confidence of sample, sample size , correlation coefficient and noise control parameter .

Iv-A1 Loss, ”age” and weight

A sample, which is taken as an easy sample with small loss value , can be selected() in training if the loss is smaller than . Otherwise, unselected (). As the model ”age” of is increasing, more probably hard samples with larger losses will be considered to train a more ”mature” model. Therefore, we constructed a simple simulation about loss, and . First of all, a group of the value of loss is given. Then the model ”age” is pre-set. At last, the selected samples are based on equation (11). Table I shows that the selection of samples is increasing with ”age” [36].

Samples A B C D E F G H I J K L M N
Loss 0.05 0.12 0.12 0.12 0.15 0.4 0.2 0.18 0.35 0.15 0.16 0.2 0.5 0.3
When the “age”
V 1 1 1 1 0 0 0 0 0 0 0 0 0 0
SPL selects: A B C D
When the “age”
SPL selects: A B C D E G H J K
When the “age”
V 1 1 1 1 1 0 1 1 0 1 1 1 0 0
SPL selects: A B C D E G H J K L
When the “age”
V 1 1 1 1 1 0 1 1 0 1 1 1 0 1
SPL selects: A B C D E G H J K L N
TABLE I: In SPL iteration, when the model “age” is increasing, more samples are included to be trained. Probably, hard samples with larger losses will be considered to train a more “mature” model.

Iv-A2 High-confidence samples, medium-confidence samples, low-confidence samples

In this paper, we divided samples into three parts, including high-confidence samples, medium-confidence samples and low-confidence samples. The high-confidence samples are favored by the model, followed by medium-confidence samples and low-confidence samples. The low-confidence samples are probably noise or outline that can reduce the performance of QSAR model. In order to illustrate the process of selected samples for SPL, we constructed a simple simulation. First of all, the simulated data was generated from the LR with using normal distribution to produce , where . Then, a set of coefficients is given. Finally, we can calculate the value of . Fig.5 shows that a set of samples can be divided into three parts. Therefore, in SPL iteration, we could count up the number of selected samples, where is from high-confidence samples, medium-confidence samples and low-confidence samples. Fig.6 indicates that at the beginning of the SPL, the model inclines to select high-confidence samples. Afterwards, when the model age gets larger, it tends to incorporate medium-confidence and low-confidence samples to train a mature model.

Fig. 5: A set of samples can be divided into three parts. The color blue, gray and yellow blocks represent high-confidence samples, medium-confidence samples and low-confidence samples, respectively.
Fig. 6: The results got by SPL. The number of samples is selected while the model age is increasing.(

Iv-A3 Sample size ,correlation coefficient and noise control parameter

In this section, we constructed a simulation with sample size , correlation coefficient and noise control parameter . The process of construction is expressed as:
Step I: use normal distribution to produce X. Here, the number of row(X) is sample and the number of column(X) is variable , respectively.

(12)

where is the vector of response variables, X = {,,……,} is the generated matrix with , is the random error, controls the signal to noise.
Step II: add different correlation parameter to simulation data.

(13)

Step III: validate variable selection, the coefficients(10) are set in advance from 1 to 10.

(14)

Where is the coefficient.
Step IV: we can get y from equation .

In this simulation study, first of all, 100 groups of data are constructed with , and . Then, divide the dataset into training set and testing set(training set:testing set=7:3 ). And then the coefficients are pre-set in advance. Finally, the LR with different penalties are used to select variables and builded model, including our proposed method. Note that the results should be averaged.

Factors Methods Testing dataset
AUC Sensitivity Specificity Accuracy Sensitivity Specificity
200 0.2 0.3 0.7056 0.7000 0.7667 0.7333 0.8000 0.9495
0.7133 0.7667 0.7333 0.7500 0.7000 0.9889
Logsum 0.7578 0.8333 0.7667 0.8000 0.8000 0.9939
SPL-Logsum 0.8711 0.8333 0.9000 0.8667 0.8000 0.9980
0.9 0.6296 0.5862 0.6774 0.6333 0.7000 0.9283
0.7397 0.6552 0.7742 0.7167 0.5000 0.9869
Logsum 0.7553 0.6897 0.7742 0.7333 0.7000 0.9919
SPL-Logsum 0.7998 0.7241 0.8710 0.8000 0.7000 0.9970
0.6 0.3 0.8058 0.7576 0.8519 0.8000 0.6000 0.9556
0.8586 0.8182 0.8889 0.8500 0.6000 0.9939
Logsum 0.8260 0.8182 0.8148 0.8167 0.6000 0.9960
SPL-Logsum 0.8709 0.8485 0.8889 0.8667 0.6000 0.9970
0.9 0.8880 0.7429 0.9600 0.8333 0.6000 0.9495
0.8480 0.7714 0.8800 0.8167 0.6000 0.9939
Logsum 0.8903 0.8000 0.9600 0.8667 0.6000 0.9929
SPL-Logsum 0.8960 0.7714 1.0000 0.8667 0.6000 0.9970
300 0.2 0.3 0.9180 0.9111 0.9111 0.9111 1.0000 0.9323
0.9536 0.9556 0.9333 0.9444 1.0000 0.9949
logsum 0.9832 0.9556 0.9778 0.9667 1.0000 0.9980
SPL-Logsum 0.9832 0.9556 0.9778 0.9667 1.0000 0.9980
0.9 0.8392 0.7907 0.8085 0.8000 1.0000 0.9465
0.8075 0.7674 0.7872 0.7778 0.9000 0.9869
Logsum 0.8545 0.7907 0.8298 0.8111 0.9000 0.9899
SPL-Logsum 0.9030 0.9070 0.8511 0.8778 1.0000 0.9980
0.6 0.3 0.9071 0.8409 0.9348 0.8889 0.7000 0.9919
0.9175 0.9091 0.9130 0.9111 0.6000 0.9960
Logsum 0.9086 0.8409 0.9348 0.8889 0.6000 0.9929
SPL-Logsum 0.9506 0.9091 0.9348 0.9222 0.7000 0.9980
0.9 0.7319 0.8235 0.7115 0.7750 0.7000 0.8970
0.7432 0.7941 0.7500 0.7750 0.6000 0.9758
Logsum 0.8200 0.8250 0.7600 0.7889 0.6000 0.9879
SPL-Logsum 0.9310 0.8750 0.9000 0.8889 0.7000 0.9970
TABLE II: The results are got by different methods with different , and .

According to existing literature [37], we have learned that the predictive ability of a QSAR model can only be estimated using testing set of compounds. Therefore, we poured more interest and attention into testing dataset, of which can prove the generalization ability of the model. Table II exhibits that the experiment results are got by , and Logsum, including our proposed SPL-Logsum methods. The performance of got by our proposed SPL-lgosum method are better than those of , and Logsum. For example, when , and , the sensitivity of , Logsum and SPL-Logsum is 0.8, 0.8 and 0.8 higher than 0.7 of . the specificity obtained by our proposed SPL-Logsum is the highest among four methods with 0.9980. Logsum , , rank in the second, the third and the fourth place with 0.9939, 0.9889 and 0.9495. Besides, we analyzed the performance of testing set. For example, when and , the values of AUC, sensitivity, specificity and accuracy of SPL-logsum are decreased to 0.9222 to 0.9310, 0.8750, 0.9000 and 0.8889, in which is from 0.3 to 0.9. When the sample size increases, the performances of all the four methods are improved. For example, when and , for testing set, the AUC and accuracy of SPL-Logsum is increased by 11 and 10 with and . When the decreases, the performance of four methods are decreased with . For example, when ,the results of SPL-logsum are decreased from 0.9310, 0.8750, 0.9000 and 0.8889 to 0.9030, 0.9070, 0.8511 and 0.8778 from to . In a word, our proposed SPL-logsum approach is superior to , and Logsum in the simulation dataset.

Iv-B Analyses of real data

Three public QSAR datasets are got from website, including AID 651580, AID 743297 and AID 74362. We utilized random sampling to divide datasets into training datasets and testing dataset(70% for training set and 30% for testing set). A brief description of these datasets is shown in Table III-IV.

Dataset No.of Samples(class1/class2) No.of descriptors
AID:651580 1614(914 ACTIVE/700 INACTIVE) 1642
AID:743297 200(58 ACTIVE/ 142 INACTIVE) 1588
AID:743263 983(229 ACTIVE/ 754 INACTIVE) 1612
TABLE III: Three publicly available QSAR datasets used in the experiments
Dataset No.of Traing(class1/class2) No.of Testing(class1/class2)
AID:651580 1130(634 ACTIVE/496 INACTIVE) 484(280 ACTIVE/204 INACTIVE)
AID:743297 140(37 ACTIVE/103 INACTIVE) 60(39 ACTIVE/21 INACTIVE)
AID:743263 689(152 ACTIVE/537 INACTIVE) 294(77 ACTIVE/217 INACTIVE)
TABLE IV: The detail information of three QSAR datasets used in the experiments
Dataset Methods Testing dataset
AUC Sensitivity Specificity Accuracy
AID:651580 0.7063 0.6000 0.6800 0.6333
0.6697 0.7353 0.6923 0.7167
Logsum 0.8251 0.7429 0.8000 0.7667
SPL-Logsum 0.8583 0.8000 0.8000 0.8000
AID:743297 0.7368 0.6667 0.7143 0.7000
0.7045 0.7619 0.7179 0.7100
Logsum 0.7765 0.7778 0.6905 0.7167
SPL-Logsum 0.7816 0.7667 0.7692 0.7333
AID:743263 0.7132 0.6986 0.7344 0.7153
0.7009 0.7286 0.6716 0.7007
Logsum 0.6939 0.6883 0.7000 0.7034
SPL-Logsum 0.7552 0.7143 0.7667 0.7372
TABLE V: The results obtained by four methods.
Fig. 7: The number of descriptors got by ,, Logsum and SPL-Logsum on different datasets

Table V shows that the experiment results are got by , , Logsum and SPL-Logsum for testing data. Our proposed SPL-Logsum is better than other methods in terms of AUC, sensitivity, specificity and accuracy. For example, from the view of dataset AID:743297, the AUC obtained by our proposed SPL-logsum is 0.7816 higher than 0.7368, 0.7045 and 0.7765 of , and Logsum. Moreover, the accuracy of SPL-logsum is about increased by 17, 9, 4 of , and Logsum for dataset AID:651580. Furthermore, our proposed SPL-Logsum method is more sparsity than other methods. For example, for dataset AID:743297 in Fig. 7, the number of selected descriptors of SPL-logsum is 10, ranks the first; then next is Logsum with 23; followed by ,constituting 53; finally, it comes from at 166, respectively.

Rank AID:651580
logsum SPL-logsum
1 L3i MLFER_BH MATS2e gmin
2 L3u MATS8m ATSC6v SHBint8
3 L3s maxHCsats nHBint4 GATS2s
4 L3e SpMin8_Bhe gmin TDB6e
5 maxtN maxsssN ATSC3e C1SP1
6 SssNH mintsC AATSC8v MLFER_A
7 SpMAD_Dzs ATSC3e GATS8e MATS8m
8 mintN TDB5i SHBint9 minHBint4
9 MATS8m naaO mindssC ATSC2i
10 nHssNH ATSC4i ASP-3 GATS3m
11 RDF40s SssssC SsssCH E2p
12 RDF85s AATSC4s MATS1e nHdCH2
TABLE VI: The 12 top-ranked meaningful and significant descriptors identified by , , Logsum and SPL-logsum from the AID: 651580 dataset. (the common descriptors are emphasized in bold.)
Rank AID:743297
logsum SPL-logsum
1 SIC1 AATS3s VCH-3 nAcid
2 BIC1 ndsN ATSC5e ATSC3s
3 MATS7v RDF30m AATSC7v AATSC5e
4 AATSC7v ETA_Shape_P nHAvin SsssCH
5 MLFER_S GATS4c AATS3s minHCsatu
6 RDF30m nsOm ATSC7i minssNH
7 RDF70p ATSC4c GATS2p RDF100v
TABLE VII: The 7 top-ranked meaningful and significant descriptors identified by , ,Logsum and SPL-Logsum from the AID: 743297 dataset. (the common descriptors are emphasized in bold.)
Rank AID:743263
logsum SPL-logsum
1 maxsF RDF130u LipinskiFailures AATS5v
2 minsF nBondsD2 nF7Ring AATS6s
3 SRW9 RDF25m SCH-3 ATSC2i
4 SRW7 nHssNH SssNH MATS3m
5 MDEC-13 minaaN minHBint2 nBondsD2
6 ETA_dAlpha_A SpMin7_Bhi ATSC3p nHBint3
7 minHCsats nHdsCH RDF45s minHAvin
8 RDF95u RDF45s ATSC1v maxHBint10
9 TDB10m C3SP2 MDEO-11 nFRing
10 C3SP2 RDF75v AATSC8p RDF20s
TABLE VIII: The 10 top-ranked meaningful and significant descriptors identified by , ,Logsum and SPL-Logsum from the AID: 743263 dataset. (the common descriptors are emphasized in bold.)
Descriptors Name P-value Class
gmin Minimum E-State 0.00050 2D
TDB6e 3D topological distance based autocorrelation - lag 6 / weighted by Sanderson electronegativities 0.00368 3D
GATS2s Geary autocorrelation - lag 2 / weighted by I-state 0.00576 2D
C1SP1 Triply bound carbon bound to one other carbon 0.00849 2D
MLFER_A Overall or summation solute hydrogen bond acidity 0.01383 2D
SHBint8 Sum of E-State descriptors of strength for potential hydrogen bonds of path length 8 0.01620 2D
MATS8m Moran autocorrelation - lag 8 / weighted by mass 0.02395 2D
E2p 2nd component accessibility directional WHIM index / weighted by relative polarizabilities 0.03345 3D
minHBint4 Minimum E-State descriptors of strength for potential Hydrogen Bonds of path length 4 0.04056 2D
nHdCH2 Minimum atom-type H E-State: =CH2 0.04883 2D
GATS3m Geary autocorrelation - lag 3 / weighted by mass 0.05929 2D
ATSC2i Centered Broto-Moreau autocorrelation - lag 2 / weighted by first ionization potential 0.06968 2D
nAcid Number of acidic groups. The list of acidic groups is defined by these SMARTS ”([O;H1]-[C,S,P]=O)” 0.00569 2D
ATSC3s Centered Broto-Moreau autocorrelation - lag 3 / weighted by I-state 0.00866 2D
AATSC5e Average centered Broto-Moreau autocorrelation - lag 5 / weighted by Sanderson electronegativities 0.00210 2D
SsssCH Sum of atom-type E-State: CH- 0.01260 2D
minHCsatu Minimum atom-type H E-State: H?on C sp3 bonded to unsaturated C 0.04492 2D
minssNH Minimum atom-type E-State: -NH2-+ 0.00025 2D
RDF100v Radial distribution function - 100 / weighted by relative van der Waals volumes 0.01730 3D
AATS5v Average Broto-Moreau autocorrelation - lag 5 / weighted by van der Waals volumes 0.08453 2D
AATS6s Average Broto-Moreau autocorrelation - lag 6 / weighted by I-state 0.03047 2D
ATSC2i Centered Broto-Moreau autocorrelation - lag 2 / weighted by first ionization potential 0.00536 2D
MATS3m Moran autocorrelation - lag 3 / weighted by mass 0.00037 2D
nBondsD2 Total number of double bonds (excluding bonds to aromatic bonds) 0.00059 2D
nHBint3 Count of E-State descriptors of strength for potential Hydrogen Bonds of path length 3 0.00041 2D
minHAvin Minimum atom-type H E-State: H on C vinyl bonded to C aromatic 0.02756 2D
maxHBint10 Maximum E-State descriptors of strength for potential Hydrogen Bonds of path length 10 0.04263 2D
nFRing Number of fused rings 0.03005 2D
RDF20s Radial distribution function - 020 / weighted by relative I-state 0.05537 3D
TABLE IX: The detail information of descriptors obtained by SPL-logsum method.

Table VI,VII and VIII show that the number of top-ranked informative molecular descriptors extracted by , and SPL-logsum are 12, 7 and 10 based on the value of coefficients. Moveover, the common descriptors are emphasized in bold. Furthermore, as shown in Table IX, the selected molecular descriptors got by our proposed SPL-logsum method are meaningful and significant in terms of p-value and almost belong to the class 2D. To sum up, our proposed SPL-Logsum is the effective technique for descriptors selection in real classification problem.

V Conclusion

In the field of drug design and discovery, Only a few descriptors related to bioactivity are selected that can be for QSAR model of interest. Therefore, descriptor selection is an attractive method that reflect the bioactivity in QSAR modeling. In this paper, we proposed a novel descriptor selection using SPL via sparse LR with Logsum penalty in QSAR classification. SPL can identify the easy and hard samples adaptively according to what the model has already learned and gradually add harder samples into training and Logsum regularization can select few meaningful and significant molecular descriptors simultaneously, respectively.

Both experimental results on artificial and three QSAR datasets demonstrate that our proposed SPL-Logsum method is superior to , and Logsum. Therefore, our proposed method is the effective technique in both descriptor selection and prediction of biological activity.

In this paper, SPL has capacity to learn from easy samples to hard samples. However, we ignore an important aspect in learning: diversity. We plan to incorporate this information into our proposed method in our future work.

Appendix A Abbreviations:

QSAR Quantitative structure-activity relationship
QSRR Quantitative structure-(chromatographic)
retention relationships
QSPR Quantitative structure-property relationship
QSTR Quantitative structure-toxicity relationship
MCP Maximum concave penalty
SCAD Smoothly clipped absolute deviation
LASSO
Elastic net
LR Logistic regression
SPL Self-paced learning
AUC Area under the curve

References

  • [1] A. R. Katritzky, M. Kuanar, S. Slavov, C. D. Hall, M. Karelson, I. Kahn, and D. A. Dobchev, “Quantitative correlation of physical and chemical properties with chemical structure: utility for prediction,” Chemical reviews, vol. 110, no. 10, pp. 5714–5789, 2010.
  • [2] M. Shahlaei, “Descriptor selection methods in quantitative structure–activity relationship studies: a review study,” Chemical reviews, vol. 113, no. 10, pp. 8093–8103, 2013.
  • [3] I. Guyon, S. Gunn, A. Ben-Hur, and G. Dror, “Result analysis of the nips 2003 feature selection challenge,” in Advances in neural information processing systems, 2005, pp. 545–552.
  • [4] W. Duch, “Filter methods,” in Feature Extraction.   Springer, 2006, pp. 89–117.
  • [5] L. Yu and H. Liu, “Feature selection for high-dimensional data: A fast correlation-based filter solution,” in Proceedings of the 20th international conference on machine learning (ICML-03), 2003, pp. 856–863.
  • [6] A. G. Karegowda, M. Jayaram, and A. Manjunath, “Feature subset selection problem using wrapper approach in supervised learning,” International journal of Computer applications, vol. 1, no. 7, pp. 13–17, 2010.
  • [7] D. C. Whitley, M. G. Ford, and D. J. Livingstone, “Unsupervised forward selection: a method for eliminating redundant variables,” Journal of Chemical Information and Computer Sciences, vol. 40, no. 5, pp. 1160–1168, 2000.
  • [8] K. Z. Mao, “Orthogonal forward selection and backward elimination algorithms for feature subset selection,” IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), vol. 34, no. 1, pp. 629–634, 2004.
  • [9] J. Kennedy, “Particle swarm optimization,” in Encyclopedia of machine learning.   Springer, 2011, pp. 760–766.
  • [10] K. Deb, A. Pratap, S. Agarwal, and T. Meyarivan, “A fast and elitist multiobjective genetic algorithm: Nsga-ii,” IEEE transactions on evolutionary computation, vol. 6, no. 2, pp. 182–197, 2002.
  • [11] B. Hemmateenejad, M. Akhond, R. Miri, and M. Shamsipur, “Genetic algorithm applied to the selection of factors in principal component-artificial neural networks: application to qsar study of calcium channel antagonist activity of 1, 4-dihydropyridines (nifedipine analogous),” Journal of chemical information and computer sciences, vol. 43, no. 4, pp. 1328–1334, 2003.
  • [12] T. N. Lal, O. Chapelle, J. Weston, and A. Elisseeff, “Embedded methods,” in Feature extraction.   Springer, 2006, pp. 137–165.
  • [13] B. Schölkopf and A. J. Smola, Learning with kernels: support vector machines, regularization, optimization, and beyond.   MIT press, 2002.
  • [14] E. J. Candes, M. B. Wakin, and S. P. Boyd, “Enhancing sparsity by reweighted l 1 minimization,” Journal of Fourier analysis and applications, vol. 14, no. 5-6, pp. 877–905, 2008.
  • [15] R. Tibshirani, “Regression shrinkage and selection via the lasso,” Journal of the Royal Statistical Society. Series B (Methodological), pp. 267–288, 1996.
  • [16] H. Zou and T. Hastie, “Regularization and variable selection via the elastic net,” Journal of the Royal Statistical Society: Series B (Statistical Methodology), vol. 67, no. 2, pp. 301–320, 2005.
  • [17] Z. Xu, H. Zhang, Y. Wang, X. Chang, and Y. Liang, “L 1/2 regularization,” Science China Information Sciences, vol. 53, no. 6, pp. 1159–1169, 2010.
  • [18] J. Fan and R. Li, “Variable selection via nonconcave penalized likelihood and its oracle properties,” Journal of the American statistical Association, vol. 96, no. 456, pp. 1348–1360, 2001.
  • [19] C.-H. Zhang et al., “Nearly unbiased variable selection under minimax concave penalty,” The Annals of statistics, vol. 38, no. 2, pp. 894–942, 2010.
  • [20] H.-H. Huang, X.-Y. Liu, and Y. Liang, “Feature selection and cancer classification via sparse logistic regression with the hybrid l1/2+ 2 regularization,” PloS one, vol. 11, no. 5, p. e0149675, 2016.
  • [21] E. Daghir-Wojtkowiak, P. Wiczling, S. Bocian, Ł. Kubik, P. Kośliński, B. Buszewski, R. Kaliszan, and M. J. Markuszewski, “Least absolute shrinkage and selection operator and dimensionality reduction techniques in quantitative structure retention relationship modeling of retention in hydrophilic interaction liquid chromatography,” Journal of Chromatography A, vol. 1403, pp. 54–62, 2015.
  • [22] M. Goodarzi, T. Chen, and M. P. Freitas, “Qspr predictions of heat of fusion of organic compounds using bayesian regularized artificial neural networks,” Chemometrics and Intelligent Laboratory Systems, vol. 104, no. 2, pp. 260–264, 2010.
  • [23] R. Aalizadeh, C. Peter, and N. S. Thomaidis, “Prediction of acute toxicity of emerging contaminants on the water flea daphnia magna by ant colony optimization–support vector machine qstr models,” Environmental Science: Processes & Impacts, vol. 19, no. 3, pp. 438–448, 2017.
  • [24] S. K. Shevade and S. S. Keerthi, “A simple and efficient algorithm for gene selection using sparse logistic regression,” Bioinformatics, vol. 19, no. 17, pp. 2246–2253, 2003.
  • [25] Y. Liang, C. Liu, X.-Z. Luan, K.-S. Leung, T.-M. Chan, Z.-B. Xu, and H. Zhang, “Sparse logistic regression with a l 1/2 penalty for gene selection in cancer classification,” BMC bioinformatics, vol. 14, no. 1, p. 198, 2013.
  • [26] M. P. Kumar, B. Packer, and D. Koller, “Self-paced learning for latent variable models,” in Advances in Neural Information Processing Systems, 2010, pp. 1189–1197.
  • [27] L. Jiang, D. Meng, S.-I. Yu, Z. Lan, S. Shan, and A. Hauptmann, “Self-paced learning with diversity,” in Advances in Neural Information Processing Systems, 2014, pp. 2078–2086.
  • [28] D. Meng, Q. Zhao, and L. Jiang, “What objective does self-paced learning indeed optimize?” arXiv preprint arXiv:1511.06049, 2015.
  • [29] Y. Bengio, J. Louradour, R. Collobert, and J. Weston, “Curriculum learning,” in Proceedings of the 26th annual international conference on machine learning.   ACM, 2009, pp. 41–48.
  • [30] L.-Y. Xia, Y.-W. Wang, D.-Y. Meng, X.-J. Yao, H. Chai, and Y. Liang, “Descriptor selection via log-sum regularization for the biological activities of chemical structure,” International journal of molecular sciences, vol. 19, no. 1, p. 30, 2017.
  • [31] J. Friedman, T. Hastie, and R. Tibshirani, “Regularization paths for generalized linear models via coordinate descent,” Journal of statistical software, vol. 33, no. 1, p. 1, 2010.
  • [32] W. Zhang, Y.-w. Wan, G. I. Allen, K. Pang, M. L. Anderson, and Z. Liu, “Molecular pathway identification using biological network-regularized logistic models,” BMC genomics, vol. 14, no. 8, p. S7, 2013.
  • [33] AID:651580, “Pubchem bioassay summary.” https://pubchem.ncbi.nlm.nih.gov/bioassay/651580, accessed April 19, 2018.
  • [34] AID:743297, “Pubchem bioassay summary.” https://pubchem.ncbi.nlm.nih.gov/bioassay/743297, accessed April 19, 2018.
  • [35] AID:743263, “Pubchem bioassay summary.” https://pubchem.ncbi.nlm.nih.gov/bioassay/743263, accessed April 19, 2018.
  • [36] L. Jiang, D. Meng, Q. Zhao, S. Shan, and A. G. Hauptmann, “Self-paced curriculum learning.” in AAAI, vol. 2, no. 5.4, 2015, p. 6.
  • [37] A. Golbraikh and A. Tropsha, “Beware of q2!” Journal of molecular graphics and modelling, vol. 20, no. 4, pp. 269–276, 2002.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
169264
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description