TopP-S: Persistent homology based multi-task deep neural networks for simultaneous predictions of partition coefficient and aqueous solubility

TopP-S: Persistent homology based multi-task deep neural networks for simultaneous predictions of partition coefficient and aqueous solubility

Kedi Wu, Zhixiong Zhao, Renxiao Wang, Guo-Wei Wei iii Address correspondences to Guo-Wei Wei. E-mail: wei@math.msu.edu
Department of Mathematics
Michigan State University, MI 48824, USA
School of Medicine
Foshan University, Foshan, Guangdong, 528000, China
State Key Laboratory of Bioorganic Chemistry, Shanghai Institute of Organic Chemistry,
Chinese Academy of Sciences, Shanghai, 200032, P.R. China
Department of Electrical and Computer Engineering
Michigan State University, MI 48824, USA
Department of Biochemistry and Molecular Biology
Michigan State University, MI 48824, USA
July 4, 2019
Abstract

Aqueous solubility and partition coefficient are important physical properties of small molecules. Accurate theoretical prediction of aqueous solubility and partition coefficient plays an important role in drug design and discovery. The prediction accuracy depends crucially on molecular descriptors which are typically derived from theoretical understanding of the chemistry and physics of small molecules. The present work introduces an algebraic topology based method, called element specific persistent homology (ESPH), as a new representation of small molecules that is entirely different from conventional chemical and/or physical representations. ESPH describes molecular properties in terms of multiscale and multicomponent topological invariants. Such topological representation is systematical, comprehensive, and scalable with respect to molecular size and composition variations. However, it cannot be literally translated into a physical interpretation. Fortunately, it is readily suitable for machine learning methods, rendering topological learning algorithms. Due to the inherent correlation between solubility and partition coefficient, a uniform ESPH representation is developed for both properties, which facilitates multi-task deep neural networks for their simultaneous predictions. This strategy leads to more accurate prediction of relatively small data sets. A total of six data sets is considered in the present work to validate the proposed topological and multi-task deep learning approaches. It is demonstrate that the proposed approaches achieve some of the most accurate predictions of aqueous solubility and partition coefficient. Our software is available online at http://weilab.math.msu.edu/TopP-S/.

Key words: persistent homology, partition coefficient, aqueous solubility, multitask learning, deep neural networks, topological learning

I Introduction

The partition coefficient, denoted and defined to be the ratio of concentrations of a solute in a mixture of two immiscible solvents at equilibrium, is of great importance in pharmacology. It measures the drug-likeness of a compound as well as its hydrophobic effect on human body. The logarithm of this coefficient, i.e., , has proved to be one of key parameters in drug design and discovery. Optimal along with low molecular weight and low polar surface area plays an important role in governing kinetic and dynamic aspects of drug action. In particular, Hansch et al. [3] gave a detailed description of how lipophilicity impacted pharmacodynamics. This being said, surveys show that approximately half of the drug candidates fail to reach market due to unsatisfactory pharmacokinetic properties or toxicity [4], which indeed makes predictions even more important.

The extent of existing reliable experimental data is negligible compared to tremendous compounds whose data are practically needed. Therefore, computational prediction of partition coefficient is an indispensable approach in modern drug design and discovery. Since the pioneering work of Hansch et al. [5, 6, 7], a large variety of octanol-water partition coefficient predictors has been developed over the past few decades. Many methods are generally called as quantitative structure-activity relationship (QSAR) models. In general, these models can be categorized into atom-based additive methods, fragment/compound-based methods, and property based methods. One of atom-based additive methods, which was first proposed by Crippen and co-workers [8], is essentially purely additive and effectively a table look-up per atom. Later on, XLOGP3, a refined version of atom-based additive methods, was developed [9]. This approach considers various atom types, contributions from neighbors, as well as correction factors which help overcome known difficulties in purely atomistic additive methods. However additivity may fail in some cases, where unexpected contributions to occur, especially for complicated structures. Fragment/compound based predictors, instead of employing information from single atom, are built at compounds or fragments level. Compounds or fragments are then added up with correction factors. Popular fragment methods include KOWWIN [10, 11], CLOGP [12, 13], ACD/LOGP [14, 15], KLOGP[16, 17]. A major challenge for fragment/compound based methods is the optimal classification of “building blocks". The number of fragments and corrections involved in current methods range from hundreds to thousands, which could be even larger if remote atoms are also taken into account. This fact may lead to technical problems in practice and may also cause overfitting in modeling. The third category is property-based. Basically property-based methods determine partition coefficient using properties, empirical approaches, three dimensional (3D) structures (e.g., implicit solvent models, molecule dynamics (MD) methods), and topological or electrostatic indices. Most of these methods are modeled using statistical tools such as associative neural network (ALOGPS) [18, 19]. It is worthy to mention that property-based methods are relatively computationally expensive, and depend largely on the choice of descriptors and accuracy of computations. This to some extent results in a preference of methods in the first two categories over those in the third.

Another closely related chemical property is aqueous solubility, denoted by , or its logarithm value . In drug discovery and other related pharmaceutical fields, it is of great significance to identify molecules with undesirable water solubility on early stages as solubility affects absorption, distribution, metabolism, and elimination processes (ADME) [20, 21]. QSPR models, along with atom/group additive models [22, 2, 23, 24], have been developed to predict solubility. For example, QSPR models assume that aqueous solubility correlates with experimental properties such as aforementioned partition coefficient and melting point [25], or molecular descriptors such as solvent accessible area. However, due to the difficulty of experimentally measuring solubility for certain compounds, the experimental data can contain errors up to 1.5 log units[26, 27] and no less than 0.6 log units [28]. Such a high variability brings challenge to solubility prediction.

Both partition coefficient and aqueous solubility reveal how a solute dissolves in solvent(s). Therefore it is reasonable to assume that there exists a shared feature representation across these two related tasks. In machine learning theory, multitask (MT) learning is designed to take the advantage of shared feature representations of correlated properties. It learns the so-called “inductive bias” from related tasks to improve accuracy using the same representation [29]. In other words, MT learning aims at learning a shared and generalized feature representation from multiple tasks and has brought new insights into the study of bioinformatics. Successful applications include splice-site and MHC-I binding prediction [30] in sequence biology, gene expression analysis, and system biology [31]. MT learning becomes more efficient when it is incorporated with deep learning (DL) strategies. DL has successfully achieved state-of-the-art results in signal and information processing fields, such as speech recognition [32, 33] and natural language processing [34, 35], as well as toxicity prediction [36, 37, 38, 39, 40] and aqueous solubility prediction [41].

Geometric descriptors are commonly used in machine learning to represent small molecules. In fact, geometric representation of molecules, particularly macromolecules, often involves too much structural detail and thus may become intractable for large and complex biomolecular data sets. In contrast, topology offers the highest level of abstraction and truly metric free representations of molecules. However, traditional topology incurs too much geometric reduction to be practically useful for molecules. Persistent homology bridges classical geometry and topology, offering a multiscale representation of molecular systems [42, 43]. In doing so, it creates a family of topologies via a filtration parameter, which leads to a one-dimensional topological invariants, i.e., barcodes of Betti numbers. The physical interpretations of Betti-0, Betti-1 and Betti-2 barcodes are isolated components, circles, and cavities, respectively. Persistent homology has been applied to the modeling and prediction of nano particles, proteins and other biomolecules [44, 45, 46, 47, 48]. Nonetheless, it was found that primitive persistent homology has very limited predictive power in machine learning based classification of biomolecules [49], which motivate us to introduce element specific persistent homology (ESPH) to retain crucial biological information during the topological simplification of geometric complexity [50, 51, 52]. ESPH has found it success in the predictions of protein-ligand binding affinities [51, 52] and mutation induced protein stability changes [50, 52]. Unfortunately, the representational and predictive power of persistent homology and ESPH for small molecules is essentially unknown. Unlike proteins, small molecules involve a wide range of chemical elements and their properties are very sensitive to their chemical constitutions, symmetry and stereochemistry. Therefore, it is not clear whether persistent homology and ESPH are suitable descriptors for small molecules.

The objective for this work is to explore the representationability and predictive power of ESPH for small molecules. To this end, we focus on the analysis and prediction of small molecular solubility and partition coefficient. Due to their relevance to drug design and discovery, relatively large data sets have been collected in the literature for these problems, which give rise to good data sets for validation of topological descriptors. To overcome the difficulty of small available data sets for certain problems, we construct topological learning by integrating ESPH and multitask deep learning for partition coefficient and aqueous solubility predictions. We show that ESPH provides a competitive description of relatively small drug-like molecules. Additionally, the inherent correlation between partition coefficient and aqueous solubility makes the multi-task strategy a viable approach in joint and predictions.

The rest of this paper is structured as follow. In Section II, we give an introduction to element specific persistent homology and the construction of element specific topological descriptor (ESTD. The underlying motivation for ESTD is discussed and a concrete example is also presented in Section II.B. In Section II.D, we provide an overview of classic ensemble methods, multitask learning (MT) and deep neural network (DNN). The MT-DNN architecture is carefully formulated and illustrated in Section II.D.2. In Section III, we first give an overview of data sets. The predictions of MT-DNN models as well as other methods for partition coefficients and aqueous solubility are presented. Finally we wrap up the paper with some discussions in Section IV.

Ii Datasets and Methods

This section is devoted to datasets, topological methods, machine learning algorithms and evaluation metrics.

ii.a An overview of data sets

The primary work of this paper is to explore the proposed topology based multi-task methods for learning partition coefficient and aqueous solubility simultaneously. Data sets can naturally be divided into two parts – one for partition coefficient prediction and the other for aqueous solubility prediction. Note that for partition coefficient prediction there are multiple test sets while the training set remains the same.

Partition coefficient data sets

The training set used for partition coefficient prediction was originally compiled by Cheng et al. [9] and consists of 8199 compounds, which is based on Hansch et al.’s compilation [53]. These compounds are considered to have reliable experimental values by Hansch (marked with * or ). In addition, three sets were chosen as test sets. The first test set, which is completely independent from the training set, contains 406 small-molecule organic drugs approved by the Food and Drug Administration (FDA) of the United States and represents a variety of organic compounds of pharmaceutical interests. This set was also compiled by Cheng et al. [9]. The remaining two test sets, Star set and Non-star set, were publicly available and originated from a monograph of Avdeef [54]. Star set comprises 223 compounds that are part of BioByte Star set and have been widely used to develop prediction method. The Non-star set contains 43 compounds that represent relatively new chemical structures and properties. The compound list and corresponding partition coefficient is available for download at http://ochem.eu/article/17434. We also made an attempt to expand our training set by searching the NIH database as other software packages use a large number of molecules for supervised learning. In this way, more than 3000 additional molecules were added to the training set.

Aqueous solubility data sets

In order to develop and validate prediction models for aqueous solubility, several well-defined aqueous solubility datasets were used. Firstly, a diverse data set of 1708 molecules proposed by Wang et al. [1] was used to verify the predictive power of descriptors. Both leave-one-out and 10-fold cross-validation were carried out on this set. Furthermore, we also tested our models on a relatively small set with independent test sets [2]. As Hou [2] suggested, we also removed some molecules from the training set to ensure that training set and test set have no overlapping molecules.

Statistics of datasets

A summary of data sets used for the proposed models is given in Table 1.

logP data Number of molecules logS data Number of molecules
logP train set 8199 logS train set 1 [1] 1708
FDA test set 406 logS train set 2 [2] 1290 (1207 for test set 2)
Star test set 223 logS test set 1 [2] 21
Nonstar test set 43 logS test set 1 [2] 120
Table 1: Summary of logP and logS data sets used

ii.b Element specific topological descriptors (ESTD)

A brief introduction is given to persistent homology and element specific persistent homology (ESPH), followed by a detailed example to illustrate the persistent homology characterization of small molecules. A refined version of ESPH and corresponding ESTD construction are also discussed.

ii.b.1 Persistent homology

Persistent homology is a branch of algebraic topology that defines topological spaces in terms of algebraic structures. It is a main workhorse for topological data analysis, which offers topological simplification of data complexity. Unlike conventional physical or chemical approaches, persistent homology captures the underlying topological connectivity of small molecules directly from atomic coordinates, i.e., point cloud data in . Mathematically, isolated atoms of a molecule are 0-simplices. The connectivity among atoms defines high dimensional simplexes. For example, linked two atoms (a line segment) gives rise to a 1-simplex and mutually linked 3 atoms in a triangular shape is called a 2-simplex. A mutually linked four-atom tetrahedron is 3-simplex and so on and so forth.

Appropriate collection of simplices forms a simplicial complex, which is a topological space consisting of vertices (points), edges (line segments), triangles, and their high dimensional counterparts. Simplicial homology can be defined on the basis of simplicial complex and can be used to analyze topological invariants, i.e., Betti numbers. Physically, Betti-0, Betti-1 and Betti-2 describe numbers of independent components, rings and cavities, respectively. However, it is important to note that topological connectivity among a set of atoms in a molecule does not follow their physical relations, i.e., covalent bonds, hydrogen bonds and van der Waals bonds. Instead, it is defined by a filtration parameter, or artificial ball radius (not the atomic radius) for each atom. Therefore, at a given filtration radius, one obtains a set of simplices, and thus Betti numbers for a molecule. In persistent homology, the filtration radius varies continuously from zero to a large number until no meaningful topological structure can be created further [55, 43]. Therefore, persistent homology computes topological invariants of a given molecule at different spatial scales which correspond to different geometric shapes (simplices) and thus different topological connectivities. The persistence of topological invariants over the filtration for a given molecule can be recorded in barcodes or persistent diagrams [56, 57]. The barcode representation of persistent homology is utilized in the present work to construct ESTDs. Readers are referred to Ref. [44] for a detailed while still simple introduction of persistent homology.

The necessity of ESPH and an example

Primitive persistent homology treats all atoms in an equal footing, which neglects chemical and physical properties of molecules. To obtain an accurate representation of a given molecule, it is necessary to at least distinguish different element types and construct element specific topological descriptors (ESTDs) [50, 51, 51]. Figure 1 is a detailed example of how our ESTDs are calculated and how they can reveal the structure information of cyclohexane. An all-element representation of cyclohexan is given in Fig 0(a), where carbon atoms are in green and hydrogen atoms are in white. As we can see from its barcode plot Fig. 0(c), there are 18 Betti-0 bars that correspond to 18 atoms at the very beginning, 12 of which disappear when the filtration value gets 1.08Å. It indicates that each carbon atom has merged with its closest 2 hydrogen atoms as the filtration value becomes larger than the length of C-H bond and these three atoms are regarded as a single connected component. When the filtration value increases to 1.44Å, a Betti-1 bar emerges which means that a hexagonal carbon ring is captured and there is only one connected component left. As the filtration value eventually exceeds the radius of the hexagon, the ring structure disappears and that is why there is no Betti-1 bar. The longest Betti-0 bar corresponds to the existence of the connected component. When only carbon atoms are selected (C element), it is relatively straightforward to interpret the barcode plot. The cutoff where 5 Betti-0 bars disappear corresponds to the C-C bond length and the Betti-1 bar represents the existence of the hexagonal carbon ring.

(a) All-element representation of cyclohexane
(b) C element representation of cyclohexane
(c) Barcode plot of the all-element representation
(d) Barcodes of the C element representation
Figure 1: Cyclohexane and its persistent homology barcode plots. In subfigure(a) and (b), cyclohexane is shown with all elements and carbon element selected, respectively. In subfigure (c) and (d), from top to bottom, the results are for Betti-0 and Betti-1, respectively
The challenge for primitive persistent homology

Aforementioned persistent homology, although is able to capture the information such as covalent bonds between different atom types easily as shown in Fig. 1, does not necessarily reflect intramolecular interactions such as hydrogen bonds and van der Waals interaction, which is not ideal for the purpose of small molecule modeling. In other words, the Betti-0 bar between two atoms with certain hydrogen bonding or van der Waals cannot be captured since there already exist shorter Betti-0 bars between them (essentially covalent bonds). Thus it is important to redefine the distance between atom at and atom at as following:

(1)

where and are the atomic radius of atoms and , respectively, and is the bond length deviation in the data set. Here is a large number which is set to be greater than the maximal filtration value, and is the Euclidean distance between atom and atom , or equivalently,

(2)

By setting the distance between two close atoms to an sufficiently large number, we should be able to capture intramolecular interactions since a connection longer than filtration value is automatically neglected during persistent homology computation.

ii.b.2 ESTD construction

Inspired by classic atom-additive models for partition coefficient prediction, we utilize a total of 61 basic element types calculated by antechamber [58, 59] using general amber force field (GAFF) [58]. Atoms of given atom type and their appropriate combinations are selected to construct Vietoris-Rips complex and ESTDs are subsequently calculated.

It is also important to construct ESTDs via a small bin size. As the example above shows, barcodes at different cutoffs give rise to a variety of information such as covalent and non-covalent bonds. Specifically we can divide the barcodes into several small bins and then extract features from each bin. A complete list of ESTDs used in this study is shown in Ref. 2. Group 1 ESTDs focuses on different atom types, Group 2 is to capture the occurrences of non-covalent bonding and Group 3 mainly highlights the strength of non-covalent bonding and van der Waals interactions. Note that statistics of birth or death values in Group 3 refer to maximum, minimum, mean, and summation.

Feature group Element used Descriptors
Group 1 One element: Counts of Betti-0 bars for each element type with a total
where of 61 different element types calculated with GAFF [58]
Group 2 Two element types: {}, where Counts of Betti-0 bars with birth or death values
{C, O, N}, and falling within each bin
Group 3 Two element types: {}, where Statistics of birth or death values
{C, O, N}, and for all Betti-0 bars (consider all birth and death values)
Table 2: Detailed ESTD descriptions

The essence of ESTDs is to offer new insight to small molecule modeling by topological modeling and topological learning. By constructing a topological feature vector for the -th molecule of task , we are readily to combine topological learning with advanced machine learning algorithms, with further details to be discussed in Section II.D.

ii.c Additional molecular descriptors

In addition to ESTDs introduced in the previous subsection, we generate 633 2D molecule descriptors by CheomPy [60] for each molecule. The feature pool contains feature groups such as E-state descriptors. There are a total of 13 categories for these 2D molecule descriptors - 30 molecular constitutional descriptors, 35 topological descriptors, 44 molecular connectivity indices, 7 Kappa shape descriptors, 64 Burden descriptors, 245 E-state indices, 21 Basak information indices, 96 autocorrelation descriptors, 6 molecular property descriptors, 25 charge descriptors, and 60 MOE-type descriptors. A more detailed description of features and the CheomPy software are available on line at https://code.google.com/archive/p/pychem/downloads.

In order to improve the overall performance, we also combine these features with ESTDs to create ESTD. For consistency reasons, only molecules whose features can be calculated by both our ESTD software and ChemoPy software are used for training purpose. It is worth to mention that our ESTD approaches are applicable to all molecules whereas ChemoPy has difficulty in dealing with some molecules. Separate results and discussions for different sets of descriptors will also be conducted in later sections.

ii.d Topological learning algorithms

In this section, we present methods and algorithms of topology based multi-task learning for simultaneous predictions of partition coefficient and aqueous solubility. Ensemble methods, including random forest (RF) and gradient boosting decision tree (GBDT), and neural network architectures are discussed. A detailed description of multi-task neural network is also provided.

ii.d.1 Ensemble methods

Ensemble methods have been widely used to solve QSAR problems and have achieved some state-of-the-art results. They naturally handle correlations between descriptors and are generally insensitive to parametrization and feature selection due to bagging operations. In this work we choose gradient boosting tree as our baseline method and implement it using the scikit-learn package [61] (version 0.13.1). The number of estimators for random forest is set to 8000 because a further increase in the number does not essentially improve accuracy. For each set, 50 runs were done and the average of 50 predictions is taken as the final prediction.

Figure 2: An illustration of the MT-DNN architecture.

ii.d.2 Multi-task learning and deep neural networks

Multi-task learning

The idea of multi-task learning (MTL) is to learn ‘inductive bias’ from related tasks to improve accuracy, using the same representation. In other words, MTL aims at learning a shared, generalized feature representation for multiple tasks and potentially gives better predictions. In our work, we assume that the underlying molecular mechanism of partition coefficient and aqueous solubility shares commonalities and differences, which can be learned jointly.

Multi-task deep neural network (MT-DNN)

A deep neural network has a wider and deeper architecture as compared to the traditional artificial neural network – it consists of more layers and more neurons in each layer and reveals the facets of input features at different levels. As for this multitask framework, different tasks share the first few dense layers, on top of which individual predictor is attached to for each specific task (one for partition coefficient and one for aqueous solubility). An illustration of the multitask deep neural network is shown in Fig. 2.

In this study, we have a total of 2 tasks - one for logP prediction and the other for logS prediction. Suppose that there are molecules in -th task and the -th molecule for -th task can be represented by a topological feature vector . Given the training data , where , , with being experimental value (logP or logS) for the -th molecule of task , the objective topological learning is to minimize the function for different tasks:

(3)

where is a functional of topological feature vectors to be learned, parametrized by weight matrix and bias term , and can be cross entropy loss for classification and mean squared error for regression. Since logP and logS are measured quantitatively, the loss function ( or ) to be minimized is defined as:

(4)

A wide range of parameters for MT-DNN has been tested and following parameters in Table 3 are used for training and testing in this study although our parameter search is by no means exhaustive:

# of hidden layers 7
# of neurons on each dense layer 1000 for first 4 layers and 100 for the next 3 layers
# of neurons on output layer 2 with linear activation
Learning rate 0.0001
Table 3: Parameters for MT-DNN.

Note that Adam (adaptive momentum estimation) [62] is used as gradient update method. For stability purpose during training, all data were normalized with zero-mean and unit variance.

Remark

Although the units for these two tasks are different (partition coefficient uses log unit as units, and solubility prediction use as units), we can still train a MT-DNN model for them simultaneously - shared layers learn their inherent commonalities while individual layers deal with their differences.

ii.e Evaluation metrics

Several different evaluation metrics, including root mean squared error (RMSE), Pearson correlation coefficient (R) and mean unsigned error (MUE), were used to evaluate the performances of different models. These methods are defined as below:

(5)
(6)

and

(7)

where represents or , is the total number of molecules in the test set, and stand for the experimental and predicted value for the th molecule, respectively, and is the average of predicted and experimental value for the entire test set, respectively.

Additionally, Tetko et al. [63] proposed and an additional metric based on the difference between experimental and predicted () was proposed. The percentage within each error range was considered.

  • If , prediction is considered to be “acceptable";

  • If , prediction is considered to be “disputable";

  • If , prediction is considered to be “unacceptable".

As a result, we use this metric along with to evaluate model performances on Star set and Non-star set described below.

Iii Results

In this section, we present the results of the proposed ESPH methods in conjugation with random forest and multi-task deep neural networks for variety of data sets, including partition coefficient and solubility test sets. Otherwise stated, different tasks are trained together in the same network. Besides, we would like to introduce some notations for easier reference. ESTD-1 contains 61 Betti-0 bar-based ESTDs (Group 1 in Table 2), ESTD-2 contains all ESTDs listed in Table 2 (Group 1-3).

iii.a Partition coefficient prediction

Training set cross-validation

In order to have an idea of how our topological representation would work for partition coefficient, a 10-fold cross-validation is performed using baseline method GBDT. Note that 50 runs were done to achieve the final results as randomness is involved and the results are summarized in Table 4.

Method RMSE MUE
GBDT-ESTD 0.924 0.45 0.32
XLOGP3-AA [9] 0.904 0.50 0.39
Table 4: Results of 10-fold cross validation on the partition coefficient training set,

It can be seen that our descriptors better than XLOGP3 software [9] given the same training data, and thus demonstrates great predictive power. It would be very interesting to see the performances of our MT-DNN compared to XLOGP3 and GBDT.

FDA set

The first test set that we would like to apply our model to is the FDA test set. A molecule that contains Hg was dropped due to the difficulty of computation. A major challenge of this set is that its structures are more complex than that of the training set, and the partition coefficient range spans over nearly 12 units. A series of prediction methods [9] including our multi-task neural networks, is applied to this set and their results are summarized in Table 5 for a comparison with ours.

Method RMSE MUE
MT-ESTD-1 0.920 0.57 0.28
MT-ESTD-1 0.909 0.60 0.27
MT-ESTD-2 0.909 0.60 0.34
ALOGPS 0.908 0.60 0.42
MT-ESTD-2 0.891 0.66 0.44
XLOGP3 0.872 0.72 0.51
XLOGP3-AA 0.847 0.80 0.57
CLOGP 0.838 0.88 0.51
TOPKAT 0.815 0.88 0.56
ALOGP98 0.80 0.90 0.64
KowWIN 0.771 1.10 0.63
HINT 0.491 1.93 1.30
Table 5: Results of different logP prediction methods on 406 FDA-approved drugs [9], ranked by . Two molecules were dropped for our model evaluation due to feature generation failure of ChemoPy

As we can see from Table 5, our multi-task model gives the best prediction in terms of R, RMSE and MUE. Specifically the small MUE of our model (0.29 log units) indicates that our predictions are less biased than other methods tested, except for some outliers. Also note that the training set is completely independent of the test set which shows the applicability of our multi-task architecture. We also build models with the same architecture when additional molecules gathered from NIH-database are included as there is no guarantee that ALOGPS are completely independent of the testset. It turns out that the accuracy can be greatly improved.

Star set and Non-star set

Star set and Non-star set were proposed by Tetko [63] as two benchmark sets for evaluating partition coefficient models. Over 20 different models were tested on these two sets. It should be emphasized that for these sets, different models are trained on different training sets and their overlap with the test sets is unknown. Thus it makes more sense to merge our 8199 training set with additional molecules in NIH database and see how additional training data can benefit the overall performances. Results of different models on these two sets can be found in Table 6. Notice that models trained with additional data are labeled with superscript *.

For star set, we achieve RMSE of 0.53 log units with other popular commercial software packages such as ACD/logP and CLOGP, in addition to a high acceptable prediction percentage (75%, rank 3) as well as a low unacceptable rate (5 %, rank 3). For non-star set, most methods do not give accurate predictions as the structures in this set are relatively new and complex. Our 49% acceptable rate ranks number 2 among all predictors, though RMSE is relatively high due to a few large outliers (rank 5). The results are satisfactory as commercial software packages generally use a much larger training set than that ours. In general, when there exist more overlapped molecules in the training set, the test results will be significantly improved. As Table 6 indicates, our MT-DNN-ESTD models achieve a substantial improvement over XLOP3 for star set, while maintaining the same level of accuracy for non-star set. The performances of our MT-DNN-ESTD models suggest that our models are able to predict and accurately. In fact, the predictive power can potentially be further improved once more molecules are incorporated into the training set. Thus we also extend our original 8199 training set by adding molecules in both Star set and Non-star set. When we use the extended training set, much higher accuracy is achieved as listed in Table 6. Again we observe that MT-DNN outperforms RF with the same set of features.

Star Set () Non-Star Set ()
% of Molecules % of Molecules
Within Error Range Within Error Range
Method RMSE < 0.5 <1 > 1 RMSE < 0.5 <1 > 1
AB/LogP 0.41 84 12 4 1.00 42 23 35
S+logP 0.45 76 22 3 0.87 40 35 26
MT-ESTD-1-AD 0.49 77 16 7 0.98 49 19 33
MT-ESTD-2 0.49 74 21 5 0.97 49 23 28
ACD/logP 0.50 75 17 7 1.00 44 32 23
CLOGP 0.52 74 20 6 0.91 47 28 26
VLOGP OPS 0.52 64 21 7 1.07 33 28 26
ALOGPS 0.53 71 23 6 0.82 42 30 28
MT-ESTD-1 0.53 75 17 8 0.97 47 28 26
MT-ESTD-1-AD 0.53 73 18 9 1.00 37 30 33
MT-ESTD-2-AD 0.53 71 19 9 1.01 47 19 35
MT-ESTD-1 0.55 72 18 10 1.01 33 28 40
MT-ESTD-2 0.56 66 23 11 1.06 35 33 33
MiLogP 0.57 69 22 9 0.86 49 30 21
XLOGP3 0.62 60 30 10 0.89 47 23 30
KowWIN 0.64 68 21 11 1.05 40 30 30
CSLogP 0.65 66 22 12 0.93 58 19 23
ALOGP 0.69 60 25 16 0.92 28 40 33
MolLogP 0.69 61 25 14 0.93 40 25 26
ALOGP98 0.70 61 26 13 1.00 30 37 33
OsirisP 0.71 59 26 16 0.94 42 26 33
VLOGP 0.72 65 22 14 1.13 40 28 33
TLOGP 0.74 67 16 13 1.12 30 37 30
ABSOLV 0.75 53 30 17 1.02 49 28 23
QikProp 0.77 53 30 17 1.24 40 26 35
QuantlogP 0.80 47 30 22 1.17 35 26 40
SLIPPER-2002 0.80 62 22 15 1.16 35 23 42
COSMOFrag 0.84 48 26 19 1.23 26 40 23
XLOGP2 0.87 57 22 20 1.16 35 23 42
QLOGP 0.96 48 26 25 1.42 21 26 53
VEGA 1.04 47 27 26 1.24 28 30 42
CLIP 1.05 41 25 30 1.54 33 9 49
LSER 1.07 44 26 30 1.26 35 16 49
MLOGP(Sim+) 1.26 38 30 33 1.56 26 28 47
NC+NHET 1.35 29 26 45 1.71 19 16 65
SPARC 1.36 45 22 32 1.70 28 21 49
HINTLOGP 1.80 34 22 44 2.72 30 5 65
Table 6: Benchmark test results [63] of different logP prediction methods on both Star set molecules and Non-Star set molecules. The superscript means that Star molecules and Non-star molecules were included in the training.

iii.b Aqueous solubility prediction

To evaluate the performances of solubility our models, several datasets are used, derived from Wang et al. [1] and Hou et al. [2]. For leave-one-out validation, only baseline method is used. For 10-fold cross-validation, the 9 remaining folds are trained together with the partition coefficient training set when evaluating the remaining fold with MT-DNN architecture.

iii.b.1 1708 set in Ref. [1]

For this dataset, both leave-one out and 10-fold cross validations are carried out in order to evaluate the performance of our models.

Leave-one-out

As MT-DNN requires a lot of computational resources, only baseline method GBDT is used for leave-one-out prediction. We use 4000 trees and 0.10 learning rate as training parameters to develop models and following results in Table 7 are achieved.

Method RMSE MUE
GBDT-1-AD 0.931 0.543 0.389
GBDT-2-AD 0.929 0.551 0.389
GBDT-2 0.910 0.621 0.457
ASMS-LOGP[1] 0.897 0.664 0.505
GBDT-1 0.893 0.683 0.494
ASMS[1] 0.884 0.707 0.547
Table 7: Leave-one-out test on the 1708 solubility dataset
10-fold cross-validation

As MT-DNN and baseline method GBDT involves randomness, we run MT-DNN and GBDT 50 times and report mean performances for all metrics. The results are summarized in Table 8. It is observed that our models yield more accurate and robust predictions than ASMS and ASMS-LOGP models do, improving the from 0.884 to 0.925. Additionally we also notice that there generally exists an improvement of MT-ESTD models over GBDT models, though, not as significant as what we see in previous partition coefficient prediction.

Method Mean (RMSD) Mean RMSE (RMSD) Mean MUE (RMSD)
MT-ESTD-1 0.925 (0.001) 0.568 (0.005) 0.393 (0.003)
MT-ESTD-2 0.924 (0.003) 0.571 (0.010) 0.395 (0.004)
GBDT-1 0.924 (0.002) 0.572 (0.006) 0.408 (0.005)
GBDT-2 0.923 (0.002) 0.571 (0.006) 0.408 (0.005)
MT-ESTD-1 0.908 (0.002) 0.630 (0.005) 0.466 (0.003)
GBDT-2 0.904 (0.002) 0.642 (0.008) 0.469 (0.005)
MT-ESTD-2 0.902 (0.002) 0.649 (0.007) 0.466 (0.005)
GBDT-1 0.889 (0.003) 0.697 (0.009) 0.502 (0.005)
ASMS[1] 0.884 (0.021) 0.699 (0.054) 0.527 (0.034)
ASMS-LOGP [1] 0.869 (0.022) 0.742 (0.053) 0.570 (0.034)
Table 8: 10-fold cross-validation on the 1708 solubility dataset

iii.b.2 Data set in Ref. [2]

We test our models on data set proposed by Hou et al. [2], where training and test sets were predefined to cover a variety of molecules. Test set 1 contains 21 commonly used compounds of pharmaceutical and environmental interest [64] and is to be trained on the original 1290 molecules. Test set 2 contains 120 molecules that were used to develop Klopman and Zhu’s group contribution model [65]. As Hou et al. [2] suggested, we remove 83 molecules that overlap with test set 2 from the training set to make prediction independent and unbiased. This reduces the size of training set for test set 2 to 1207.

Test set 1

Table 9 shows the performances of different models on test set 1. Our MT-ESTD models perform similarly to Drug-LOGS method, while achieving improvement over Klopman and Zhu’s MLR method [65] with ESTDs.

Method R RMSE
MT-ESTD-1 0.94 0.69
Drug-LOGS[2] 0.94 0.64
GBDT-2 0.94 0.71
MT-ESTD-2 0.93 0.75
GBDT-1 0.93 0.76
MT-ESTD-2 0.92 0.79
Klopman MLR [65] 0.92 0.86
GBDT-2 0.92 0.85
MT-ESTD-1 0.91 0.82
GBDT-1 0.84 1.07
Table 9: Results of test set 1 [2], where MUE was not reported
Test set 2

The results of test set 2 are summarized in Table 10. For this dataset, our MT-ESTD models gives satisfactory results with a high Pearson correlation over 0.97 across all ESTD combinations. Such results indicate that our methods are applicable to a wide variety of molecules.

Method R RMSE MUE
MT-ESTD-1 0.97 0.65 0.47
MT-ESTD-2 0.97 0.67 0.48
MT-ESTD-1 0.97 0.70 0.50
MT-ESTD-2 0.97 0.71 0.53
GBDT-2 0.97 0.73 0.50
GBDT-1 0.96 0.76 0.52
Drug-LOGS [2] 0.96 0.79 0.57
GBDT-2 0.96 0.79 0.60
GBDT-1 0.96 0.82 0.58
Group contribution [65] 0.96 0.84 0.70
Table 10: Results of Klopman and Zhu’s test set 2

Iv Discussion

In this section, we discuss the outcome of trained models from several aspects. The emphasis is put on the predictive power of our ESTDs in terms of MT-DNN and how topology based multitask learning can improve partition coefficient and aqueous solubility predictions.

iv.a ESTDs for small molecules

As the previous results indicated, there exists a common feature representation for both partition coefficient and solubility prediction. Our features come from two different categories – one that is computed solely by ESPH, and the other one that has been widely used in the development of QSAR models. Although the number of ESTDs (121) is small, it turns out that our topological feature representation of molecules has a very strong predictive power as compared to baseline method. Our ESTDs highlight atom type information and keep track of the formation of chemical bond as well as ring structures of a given molecule. Indeed, by constructing topological space of a molecule, ESTDs are able to extract useful features from ESPH computation. In fact, more detailed topological representations which more precisely reflect the covalent bond length, hydrogen bond strength and van der Waals interaction can constructed via a small bin size [52].

iv.b Multitask learning

The goal of multitask learning is to learn commonalities between different tasks, and to simultaneously improve model performances. In this study, partition coefficient and aqueous solubility were trained jointly and substantial improvements over singletask models are observed. Our results suggests that there exists shared information across these two tasks that can benefit prediction accuracy. Indeed, the original motivation for predicting logP and logS is that both coefficients closely relate to the extent to which a compound dissolves in solvent. By comparing our MT-DNN with gradient boosting tree, we find that it is beneficial to learn partition coefficient and aqueous solubility models together. Our MT-ESTD models achieve satisfactory results on various partition coefficient and aqueous solubility data sets, some of which are the state-of-the-art to our knowledge. Our ESTDs alone can give very accurate predictions, bringing us new insights by element specific persistent homology computations. In addition to ESTDs, commonly-used 2D descriptors also help to improve the overall accuracy. The attempt to learn these two related measurements together gives a boost over learning them separately and the improvement is validated on most data sets.

V Conclusion

Partition coefficient and aqueous solubility are among most important physical properties of small molecules and have significant applications to drug design and discovery in terms of lipophilic efficiency. Based on chemical and physical models, a wide variety of computational methods has been developed in the literature for the theoretical predictions of partition coefficient and aqueous solubility.

Present work introduces an algebraic topology based method, element specific persistent homology (ESPH), for simultaneous partition coefficient and aqueous solubility predictions. ESPH offers an unconventional representation of small molecules in terms of multiscale and multicomponent topological invariants. Here the multiscale representation is inherited from persistent homology, while the multicomponent formulation is developed to retain essential chemical information during the topological simplification of molecular geometric complexity. Therefore, the present ESPH gives a unique representation of small molecules that cannot be obtained by any other method. Although ESPH representation of molecules cannot by literally translated into a physical interpretation, it systematically and comprehensively encipher chemical and physical information of molecules into scalable topological invariants, and thus is ideally suited for machine learning/deep learning algorithms to decipher such information.

To predict partition coefficient and aqueous solubility, we integrate ESPH with advanced machine learning methods, including gradient boosting tree, random forest, and deep neural networks to construct topological learning strategies. Since partition coefficient and aqueous solubility are highly correlated to each other, we develop a common set of ESPH descriptors, called element specific topological descriptors (ESTDs), to represent both properties. This approach enables us to perform simultaneous predictions of partition coefficient and aqueous solubility using a topology based multi-task deep learning strategy.

To test the representational of ESPH and the predictive power of the proposed topological multi-task deep learning strategy, we consider some commonly used data sets, including two benchmark test sets, for partition coefficient, as well as additional solubility data sets. Extensive cross validations and benchmark tests indicate that the proposed topological multi-task strategy offers some of the most accurate predictions of partition coefficient and aqueous solubility.

Availability

Our software is available as an online server at http://weilab.math.msu.edu/TopP-S/.

Acknowledgment

This work was supported in part by NSF Grants DMS-1721024 and IIS-1302285, and MSU Center for Mathematical Molecular Biosciences Initiative.

References

  • [1] J. Wang, G. Krudy, T. Hou, W. Zhang, G. Holland, and X. Xu, “Development of reliable aqueous solubility models and their application in druglike analysis,” Journal of chemical information and modeling, vol. 47, no. 4, pp. 1395–1404, 2007.
  • [2] T. Hou, K. Xia, W. Zhang, and X. Xu, “Adme evaluation in drug discovery. 4. prediction of aqueous solubility based on atom contribution approach,” Journal of chemical information and computer sciences, vol. 44, no. 1, pp. 266–275, 2004.
  • [3] A. Leo, D. Hoekman, and C. Hansch, “Hydrophobic, electronic, and steric constants,” 1995.
  • [4] H. Van De Waterbeemd and E. Gifford, “ADMET in silico modelling: towards prediction paradise?,” Nature reviews Drug discovery, vol. 2, no. 3, pp. 192–204, 2003.
  • [5] C. Hansch and T. Fujita, “p-- analysis. a method for the correlation of biological activity and chemical structure,” Journal of the American Chemical Society, vol. 86, no. 8, pp. 1616–1626, 1964.
  • [6] T. Fujita, J. Iwasa, and C. Hansch, “A new substituent constant, , derived from partition coefficients,” Journal of the American Chemical Society, vol. 86, no. 23, pp. 5175–5180, 1964.
  • [7] A. Leo, C. Hansch, and D. Elkins, “Partition coefficients and their uses,” Chemical reviews, vol. 71, no. 6, pp. 525–616, 1971.
  • [8] A. K. Ghose and G. M. Crippen, “Atomic physicochemical parameters for three-dimensional-structure-directed quantitative structure-activity relationships. 2. modeling dispersive and hydrophobic interactions,” Journal of chemical information and computer sciences, vol. 27, no. 1, pp. 21–35, 1987.
  • [9] T. Cheng, Y. Zhao, X. Li, F. Lin, Y. Xu, X. Zhang, Y. Li, R. Wang, and L. Lai, “Computation of octanol-water partition coefficients by guiding an additive model with knowledge,” Journal of chemical information and modeling, vol. 47, no. 6, pp. 2140–2148, 2007.
  • [10] W. M. Meylan and P. H. Howard, “Atom/fragment contribution method for estimating octanol–water partition coefficients,” Journal of pharmaceutical sciences, vol. 84, no. 1, pp. 83–92, 1995.
  • [11] W. M. Meylan and P. H. Howard, “Estimating log P with atom/fragments and water solubility with log P,” Perspectives in drug discovery and design, vol. 19, no. 1, pp. 67–84, 2000.
  • [12] A. J. Leo, “Calculating log poct from structures,” Chemical Reviews, vol. 93, no. 4, pp. 1281–1306, 1993.
  • [13] J. Chou and P. C. Jurs, “Computer-assisted computation of partition coefficients from molecular structures using fragment constants,” Journal of Chemical Information and Computer Sciences, vol. 19, no. 3, pp. 172–178, 1979.
  • [14] A. A. Petrauskas and E. A. Kolovanov, “ACD/Log P method description,” Perspectives in drug discovery and design, vol. 19, no. 1, pp. 99–116, 2000.
  • [15] M. J. Walker, “Training ACD/LogP with experimental data,” QSAR & Combinatorial Science, vol. 23, no. 7, pp. 515–520, 2004.
  • [16] H. Zhu, A. Sedykh, S. K. Chakravarti, and G. Klopman, “A new group contribution approach to the calculation of LogP,” Current Computer-Aided Drug Design, vol. 1, no. 1, pp. 3–9, 2005.
  • [17] A. Y. Sedykh and G. Klopman, “A structural analogue approach to the prediction of the octanol-water partition coefficient,” Journal of chemical information and modeling, vol. 46, no. 4, pp. 1598–1603, 2006.
  • [18] I. V. Tetko and V. Y. Tanchuk, “Application of associative neural networks for prediction of lipophilicity in ALOGPS 2.1 program,” Journal of chemical information and computer sciences, vol. 42, no. 5, pp. 1136–1145, 2002.
  • [19] I. V. Tetko and P. Bruneau, “Application of ALOGPS to predict 1-octanol/water distribution coefficients, logP, and logD, of AstraZeneca in-house database,” Journal of pharmaceutical sciences, vol. 93, no. 12, pp. 3103–3110, 2004.
  • [20] C. A. Lipinski, F. Lombardo, B. W. Dominy, and P. J. Feeney, “Experimental and computational approaches to estimate solubility and permeability in drug discovery and development settings,” Advanced drug delivery reviews, vol. 23, no. 1-3, pp. 3–25, 1997.
  • [21] L. Di and E. H. Kerns, “Biological assay challenges from compound solubility: strategies for bioassay optimization,” Drug discovery today, vol. 11, no. 9, pp. 446–451, 2006.
  • [22] B.-G. DUAN, Y. LI, J. LI, T.-J. CHENG, and R.-X. WANG, “An empirical additive model for aqueous solubility computation: success and limitations,” Acta Physico-Chimica Sinica, vol. 28, no. 10, pp. 2249–2257, 2012.
  • [23] J. Wang, T. Hou, and X. Xu, “Aqueous solubility prediction based on weighted atom type counts and solvent accessible surface areas,” Journal of Chemical Information and Modeling, vol. 49, no. 3, pp. 571–581, 2009.
  • [24] J. Wang, G. Krudy, T. Hou, W. Zhang, G. Holland, and X. Xu, “Development of reliable aqueous solubility models and their application in druglike analysis,” Journal of chemical information and modeling, vol. 47, no. 4, pp. 1395–1404, 2007.
  • [25] S. H. Yalkowsky and S. C. Valvani, “Solubility and partitioning i: solubility of nonelectrolytes in water,” Journal of pharmaceutical sciences, vol. 69, no. 8, pp. 912–922, 1980.
  • [26] J. C. Dearden, “In silico prediction of aqueous solubility,” Expert opinion on drug discovery, vol. 1, no. 1, pp. 31–52, 2006.
  • [27] R. Dannenfelser, M. Paric, M. White, and S. H. Yalkowsky, “A compliation of some physico-chemical properties for chlorobenzenes,” Chemosphere, vol. 23, no. 2, pp. 141–165, 1991.
  • [28] W. L. Jorgensen and E. M. Duffy, “Prediction of drug solubility from structure,” Advanced drug delivery reviews, vol. 54, no. 3, pp. 355–366, 2002.
  • [29] R. Caruana, “Multitask learning,” in Learning to learn, pp. 95–133, Springer, 1998.
  • [30] C. Widmer and G. Rätsch, “Multitask learning in computational biology.,” in ICML Unsupervised and Transfer Learning, pp. 207–216, 2012.
  • [31] Q. Xu and Q. Yang, “A survey of transfer and multitask learning in bioinformatics,” Journal of Computing Science and Engineering, vol. 5, no. 3, pp. 257–268, 2011.
  • [32] G. E. Dahl, D. Yu, L. Deng, and A. Acero, “Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 20, no. 1, pp. 30–42, 2012.
  • [33] L. Deng, G. Hinton, and B. Kingsbury, “New types of deep neural network learning for speech recognition and related applications: An overview,” in Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on, pp. 8599–8603, IEEE, 2013.
  • [34] R. Socher, Y. Bengio, and C. D. Manning, “Deep learning for nlp (without magic),” in Tutorial Abstracts of ACL 2012, pp. 5–5, Association for Computational Linguistics, 2012.
  • [35] I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence learning with neural networks,” in Advances in neural information processing systems, pp. 3104–3112, 2014.
  • [36] G. E. Dahl, N. Jaitly, and R. Salakhutdinov, “Multi-task neural networks for qsar predictions,” arXiv preprint arXiv:1406.1231, 2014.
  • [37] J. Ma, R. P. Sheridan, A. Liaw, G. E. Dahl, and V. Svetnik, “Deep neural nets as a method for quantitative structure–activity relationships,” Journal of chemical information and modeling, vol. 55, no. 2, pp. 263–274, 2015.
  • [38] A. Mayr, G. Klambauer, T. Unterthiner, and S. Hochreiter, “DeepTox: toxicity prediction using deep learning,” Frontiers in Environmental Science, vol. 3, p. 80, 2016.
  • [39] B. Ramsundar, S. Kearnes, P. Riley, D. Webster, D. Konerding, and V. Pande, “Massively multitask networks for drug discovery,” arXiv preprint arXiv:1502.02072, 2015.
  • [40] Z. X. Wu and G. W. Wei, “Comparison of multi-task convolutional neural network (MT-CNN) and a few other methods for toxicity prediction ,” Toxicological Sciences, Submitted, 2017.
  • [41] A. Lusci, G. Pollastri, and P. Baldi, “Deep architectures and deep learning in chemoinformatics: the prediction of aqueous solubility for drug-like molecules,” Journal of chemical information and modeling, vol. 53, no. 7, pp. 1563–1575, 2013.
  • [42] H. Edelsbrunner, D. Letscher, and A. Zomorodian, “Topological persistence and simplification,” Discrete Comput. Geom., vol. 28, pp. 511–533, 2002.
  • [43] A. Zomorodian and G. Carlsson, “Computing persistent homology,” Discrete Comput. Geom., vol. 33, pp. 249–274, 2005.
  • [44] K. L. Xia and G. W. Wei, “Persistent homology analysis of protein structure, flexibility and folding,” International Journal for Numerical Methods in Biomedical Engineerings, vol. 30, pp. 814–844, 2014.
  • [45] K. L. Xia, X. Feng, Y. Y. Tong, and G. W. Wei, “Persistent homology for the quantitative prediction of fullerene stability,” Journal of Computational Chemsitry, vol. 36, pp. 408–422, 2015.
  • [46] K. L. Xia, Z. X. Zhao, and G. W. Wei, “Multiresolution topological simplification,” Journal Computational Biology, vol. 22, pp. 1–5, 2015.
  • [47] K. L. Xia, Z. X. Zhao, and G. W. Wei, “Multiresolution persistent homology for excessively large biomolecular datasets,” Journal of Chemical Physics, vol. 143, p. 134103, 2015.
  • [48] B. Wang and G. W. Wei, “Object-oriented persistent homology,” Journal of Computational Physics, vol. 305, pp. 276–299, 2016.
  • [49] Z. Cang, L. Mu, K. Wu, K. Opron, K. Xia, and G.-W. Wei, “A topological approach to protein classification,” Molecular based Mathematical Biologys, vol. 3, pp. 140–162, 2015.
  • [50] Z. X. Cang and G. W. Wei, “Analysis and prediction of protein folding energy changes upon mutation by element specific persistent homology,” Bioinformatics (arXiv preprint arXiv:1703.10966), Revised, 2017.
  • [51] Z. X. Cang and G. W. Wei, “Integration of element specific persistent homology and machine learning for protein-ligand binding affinity prediction ,” International Journal for Numerical Methods in Biomedical Engineering, Accepted, 2017.
  • [52] Z. X. Cang and G. W. Wei, “TopologyNet: Topology based deep convolutional neural networks for biomolecular property predictions ,” Plos Computational Biology (arXiv preprint arXiv:1704.00063), Submitted, 2017.
  • [53] C. Hansch, A. Leo, D. Hoekman, et al., Exploring QSAR: fundamentals and applications in chemistry and biology, vol. 557. American Chemical Society Washington, DC, 1995.
  • [54] A. Avdeef, Absorption and drug development: solubility, permeability, and charge state. John Wiley & Sons, 2012.
  • [55] H. Edelsbrunner, D. Letscher, and A. Zomorodian, “Topological persistence and simplification,” in Foundations of Computer Science, 2000. Proceedings. 41st Annual Symposium on, pp. 454–463, IEEE, 2000.
  • [56] G. Carlsson, A. Zomorodian, A. Collins, and L. J. Guibas, “Persistence barcodes for shapes,” International Journal of Shape Modeling, vol. 11, no. 2, pp. 149–187, 2005.
  • [57] R. Ghrist, “Barcodes: The persistent topology of data,” Bull. Amer. Math. Soc., vol. 45, pp. 61–75, 2008.
  • [58] J. Wang, R. M. Wolf, J. W. Caldwell, P. A. Kollman, and D. A. Case, “Development and testing of a general amber force field,” Journal of computational chemistry, vol. 25, no. 9, pp. 1157–1174, 2004.
  • [59] J. Wang, W. Wang, P. A. Kollman, and D. A. Case, “Automatic atom type and bond type perception in molecular mechanical calculations,” Journal of molecular graphics and modelling, vol. 25, no. 2, pp. 247–260, 2006.
  • [60] D.-S. Cao, Q.-S. Xu, Q.-N. Hu, and Y.-Z. Liang, “Chemopy: freely available python package for computational biology and chemoinformatics,” Bioinformatics, p. btt105, 2013.
  • [61] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay, “Scikit-learn: Machine learning in Python,” Journal of Machine Learning Research, vol. 12, pp. 2825–2830, 2011.
  • [62] D. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
  • [63] R. Mannhold, G. I. Poda, C. Ostermann, and I. V. Tetko, “Calculation of molecular lipophilicity: state-of-the-art and comparison of log P methods on more than 96,000 compounds,” Journal of pharmaceutical sciences, vol. 98, no. 3, pp. 861–893, 2009.
  • [64] G. Klopman, S. Wang, and D. M. Balthasar, “Estimation of aqueous solubility of organic molecules by the group contribution approach. application to the study of biodegradation,” Journal of chemical information and computer sciences, vol. 32, no. 5, pp. 474–482, 1992.
  • [65] G. Klopman and H. Zhu, “Estimation of the aqueous solubility of organic molecules by the group contribution approach,” Journal of chemical information and computer sciences, vol. 41, no. 2, pp. 439–445, 2001.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
24280
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description