TopPS: Persistent homology based multitask deep neural networks for simultaneous predictions of partition coefficient and aqueous solubility
Abstract
Aqueous solubility and partition coefficient are important physical properties of small molecules. Accurate theoretical prediction of aqueous solubility and partition coefficient plays an important role in drug design and discovery. The prediction accuracy depends crucially on molecular descriptors which are typically derived from theoretical understanding of the chemistry and physics of small molecules. The present work introduces an algebraic topology based method, called element specific persistent homology (ESPH), as a new representation of small molecules that is entirely different from conventional chemical and/or physical representations. ESPH describes molecular properties in terms of multiscale and multicomponent topological invariants. Such topological representation is systematical, comprehensive, and scalable with respect to molecular size and composition variations. However, it cannot be literally translated into a physical interpretation. Fortunately, it is readily suitable for machine learning methods, rendering topological learning algorithms. Due to the inherent correlation between solubility and partition coefficient, a uniform ESPH representation is developed for both properties, which facilitates multitask deep neural networks for their simultaneous predictions. This strategy leads to more accurate prediction of relatively small data sets. A total of six data sets is considered in the present work to validate the proposed topological and multitask deep learning approaches. It is demonstrate that the proposed approaches achieve some of the most accurate predictions of aqueous solubility and partition coefficient. Our software is available online at http://weilab.math.msu.edu/TopPS/.
Key words: persistent homology, partition coefficient, aqueous solubility, multitask learning, deep neural networks, topological learning
Contents
I Introduction
The partition coefficient, denoted and defined to be the ratio of concentrations of a solute in a mixture of two immiscible solvents at equilibrium, is of great importance in pharmacology. It measures the druglikeness of a compound as well as its hydrophobic effect on human body. The logarithm of this coefficient, i.e., , has proved to be one of key parameters in drug design and discovery. Optimal along with low molecular weight and low polar surface area plays an important role in governing kinetic and dynamic aspects of drug action. In particular, Hansch et al. [3] gave a detailed description of how lipophilicity impacted pharmacodynamics. This being said, surveys show that approximately half of the drug candidates fail to reach market due to unsatisfactory pharmacokinetic properties or toxicity [4], which indeed makes predictions even more important.
The extent of existing reliable experimental data is negligible compared to tremendous compounds whose data are practically needed. Therefore, computational prediction of partition coefficient is an indispensable approach in modern drug design and discovery. Since the pioneering work of Hansch et al. [5, 6, 7], a large variety of octanolwater partition coefficient predictors has been developed over the past few decades. Many methods are generally called as quantitative structureactivity relationship (QSAR) models. In general, these models can be categorized into atombased additive methods, fragment/compoundbased methods, and property based methods. One of atombased additive methods, which was first proposed by Crippen and coworkers [8], is essentially purely additive and effectively a table lookup per atom. Later on, XLOGP3, a refined version of atombased additive methods, was developed [9]. This approach considers various atom types, contributions from neighbors, as well as correction factors which help overcome known difficulties in purely atomistic additive methods. However additivity may fail in some cases, where unexpected contributions to occur, especially for complicated structures. Fragment/compound based predictors, instead of employing information from single atom, are built at compounds or fragments level. Compounds or fragments are then added up with correction factors. Popular fragment methods include KOWWIN [10, 11], CLOGP [12, 13], ACD/LOGP [14, 15], KLOGP[16, 17]. A major challenge for fragment/compound based methods is the optimal classification of “building blocks". The number of fragments and corrections involved in current methods range from hundreds to thousands, which could be even larger if remote atoms are also taken into account. This fact may lead to technical problems in practice and may also cause overfitting in modeling. The third category is propertybased. Basically propertybased methods determine partition coefficient using properties, empirical approaches, three dimensional (3D) structures (e.g., implicit solvent models, molecule dynamics (MD) methods), and topological or electrostatic indices. Most of these methods are modeled using statistical tools such as associative neural network (ALOGPS) [18, 19]. It is worthy to mention that propertybased methods are relatively computationally expensive, and depend largely on the choice of descriptors and accuracy of computations. This to some extent results in a preference of methods in the first two categories over those in the third.
Another closely related chemical property is aqueous solubility, denoted by , or its logarithm value . In drug discovery and other related pharmaceutical fields, it is of great significance to identify molecules with undesirable water solubility on early stages as solubility affects absorption, distribution, metabolism, and elimination processes (ADME) [20, 21]. QSPR models, along with atom/group additive models [22, 2, 23, 24], have been developed to predict solubility. For example, QSPR models assume that aqueous solubility correlates with experimental properties such as aforementioned partition coefficient and melting point [25], or molecular descriptors such as solvent accessible area. However, due to the difficulty of experimentally measuring solubility for certain compounds, the experimental data can contain errors up to 1.5 log units[26, 27] and no less than 0.6 log units [28]. Such a high variability brings challenge to solubility prediction.
Both partition coefficient and aqueous solubility reveal how a solute dissolves in solvent(s). Therefore it is reasonable to assume that there exists a shared feature representation across these two related tasks. In machine learning theory, multitask (MT) learning is designed to take the advantage of shared feature representations of correlated properties. It learns the socalled “inductive bias” from related tasks to improve accuracy using the same representation [29]. In other words, MT learning aims at learning a shared and generalized feature representation from multiple tasks and has brought new insights into the study of bioinformatics. Successful applications include splicesite and MHCI binding prediction [30] in sequence biology, gene expression analysis, and system biology [31]. MT learning becomes more efficient when it is incorporated with deep learning (DL) strategies. DL has successfully achieved stateoftheart results in signal and information processing fields, such as speech recognition [32, 33] and natural language processing [34, 35], as well as toxicity prediction [36, 37, 38, 39, 40] and aqueous solubility prediction [41].
Geometric descriptors are commonly used in machine learning to represent small molecules. In fact, geometric representation of molecules, particularly macromolecules, often involves too much structural detail and thus may become intractable for large and complex biomolecular data sets. In contrast, topology offers the highest level of abstraction and truly metric free representations of molecules. However, traditional topology incurs too much geometric reduction to be practically useful for molecules. Persistent homology bridges classical geometry and topology, offering a multiscale representation of molecular systems [42, 43]. In doing so, it creates a family of topologies via a filtration parameter, which leads to a onedimensional topological invariants, i.e., barcodes of Betti numbers. The physical interpretations of Betti0, Betti1 and Betti2 barcodes are isolated components, circles, and cavities, respectively. Persistent homology has been applied to the modeling and prediction of nano particles, proteins and other biomolecules [44, 45, 46, 47, 48]. Nonetheless, it was found that primitive persistent homology has very limited predictive power in machine learning based classification of biomolecules [49], which motivate us to introduce element specific persistent homology (ESPH) to retain crucial biological information during the topological simplification of geometric complexity [50, 51, 52]. ESPH has found it success in the predictions of proteinligand binding affinities [51, 52] and mutation induced protein stability changes [50, 52]. Unfortunately, the representational and predictive power of persistent homology and ESPH for small molecules is essentially unknown. Unlike proteins, small molecules involve a wide range of chemical elements and their properties are very sensitive to their chemical constitutions, symmetry and stereochemistry. Therefore, it is not clear whether persistent homology and ESPH are suitable descriptors for small molecules.
The objective for this work is to explore the representationability and predictive power of ESPH for small molecules. To this end, we focus on the analysis and prediction of small molecular solubility and partition coefficient. Due to their relevance to drug design and discovery, relatively large data sets have been collected in the literature for these problems, which give rise to good data sets for validation of topological descriptors. To overcome the difficulty of small available data sets for certain problems, we construct topological learning by integrating ESPH and multitask deep learning for partition coefficient and aqueous solubility predictions. We show that ESPH provides a competitive description of relatively small druglike molecules. Additionally, the inherent correlation between partition coefficient and aqueous solubility makes the multitask strategy a viable approach in joint and predictions.
The rest of this paper is structured as follow. In Section II, we give an introduction to element specific persistent homology and the construction of element specific topological descriptor (ESTD. The underlying motivation for ESTD is discussed and a concrete example is also presented in Section II.B. In Section II.D, we provide an overview of classic ensemble methods, multitask learning (MT) and deep neural network (DNN). The MTDNN architecture is carefully formulated and illustrated in Section II.D.2. In Section III, we first give an overview of data sets. The predictions of MTDNN models as well as other methods for partition coefficients and aqueous solubility are presented. Finally we wrap up the paper with some discussions in Section IV.
Ii Datasets and Methods
This section is devoted to datasets, topological methods, machine learning algorithms and evaluation metrics.
ii.a An overview of data sets
The primary work of this paper is to explore the proposed topology based multitask methods for learning partition coefficient and aqueous solubility simultaneously. Data sets can naturally be divided into two parts – one for partition coefficient prediction and the other for aqueous solubility prediction. Note that for partition coefficient prediction there are multiple test sets while the training set remains the same.
Partition coefficient data sets
The training set used for partition coefficient prediction was originally compiled by Cheng et al. [9] and consists of 8199 compounds, which is based on Hansch et al.’s compilation [53]. These compounds are considered to have reliable experimental values by Hansch (marked with * or ). In addition, three sets were chosen as test sets. The first test set, which is completely independent from the training set, contains 406 smallmolecule organic drugs approved by the Food and Drug Administration (FDA) of the United States and represents a variety of organic compounds of pharmaceutical interests. This set was also compiled by Cheng et al. [9]. The remaining two test sets, Star set and Nonstar set, were publicly available and originated from a monograph of Avdeef [54]. Star set comprises 223 compounds that are part of BioByte Star set and have been widely used to develop prediction method. The Nonstar set contains 43 compounds that represent relatively new chemical structures and properties. The compound list and corresponding partition coefficient is available for download at http://ochem.eu/article/17434. We also made an attempt to expand our training set by searching the NIH database as other software packages use a large number of molecules for supervised learning. In this way, more than 3000 additional molecules were added to the training set.
Aqueous solubility data sets
In order to develop and validate prediction models for aqueous solubility, several welldefined aqueous solubility datasets were used. Firstly, a diverse data set of 1708 molecules proposed by Wang et al. [1] was used to verify the predictive power of descriptors. Both leaveoneout and 10fold crossvalidation were carried out on this set. Furthermore, we also tested our models on a relatively small set with independent test sets [2]. As Hou [2] suggested, we also removed some molecules from the training set to ensure that training set and test set have no overlapping molecules.
Statistics of datasets
A summary of data sets used for the proposed models is given in Table 1.
logP data  Number of molecules  logS data  Number of molecules 

logP train set  8199  logS train set 1 [1]  1708 
FDA test set  406  logS train set 2 [2]  1290 (1207 for test set 2) 
Star test set  223  logS test set 1 [2]  21 
Nonstar test set  43  logS test set 1 [2]  120 
ii.b Element specific topological descriptors (ESTD)
A brief introduction is given to persistent homology and element specific persistent homology (ESPH), followed by a detailed example to illustrate the persistent homology characterization of small molecules. A refined version of ESPH and corresponding ESTD construction are also discussed.
ii.b.1 Persistent homology
Persistent homology is a branch of algebraic topology that defines topological spaces in terms of algebraic structures. It is a main workhorse for topological data analysis, which offers topological simplification of data complexity. Unlike conventional physical or chemical approaches, persistent homology captures the underlying topological connectivity of small molecules directly from atomic coordinates, i.e., point cloud data in . Mathematically, isolated atoms of a molecule are 0simplices. The connectivity among atoms defines high dimensional simplexes. For example, linked two atoms (a line segment) gives rise to a 1simplex and mutually linked 3 atoms in a triangular shape is called a 2simplex. A mutually linked fouratom tetrahedron is 3simplex and so on and so forth.
Appropriate collection of simplices forms a simplicial complex, which is a topological space consisting of vertices (points), edges (line segments), triangles, and their high dimensional counterparts. Simplicial homology can be defined on the basis of simplicial complex and can be used to analyze topological invariants, i.e., Betti numbers. Physically, Betti0, Betti1 and Betti2 describe numbers of independent components, rings and cavities, respectively. However, it is important to note that topological connectivity among a set of atoms in a molecule does not follow their physical relations, i.e., covalent bonds, hydrogen bonds and van der Waals bonds. Instead, it is defined by a filtration parameter, or artificial ball radius (not the atomic radius) for each atom. Therefore, at a given filtration radius, one obtains a set of simplices, and thus Betti numbers for a molecule. In persistent homology, the filtration radius varies continuously from zero to a large number until no meaningful topological structure can be created further [55, 43]. Therefore, persistent homology computes topological invariants of a given molecule at different spatial scales which correspond to different geometric shapes (simplices) and thus different topological connectivities. The persistence of topological invariants over the filtration for a given molecule can be recorded in barcodes or persistent diagrams [56, 57]. The barcode representation of persistent homology is utilized in the present work to construct ESTDs. Readers are referred to Ref. [44] for a detailed while still simple introduction of persistent homology.
The necessity of ESPH and an example
Primitive persistent homology treats all atoms in an equal footing, which neglects chemical and physical properties of molecules. To obtain an accurate representation of a given molecule, it is necessary to at least distinguish different element types and construct element specific topological descriptors (ESTDs) [50, 51, 51]. Figure 1 is a detailed example of how our ESTDs are calculated and how they can reveal the structure information of cyclohexane. An allelement representation of cyclohexan is given in Fig 0(a), where carbon atoms are in green and hydrogen atoms are in white. As we can see from its barcode plot Fig. 0(c), there are 18 Betti0 bars that correspond to 18 atoms at the very beginning, 12 of which disappear when the filtration value gets 1.08Å. It indicates that each carbon atom has merged with its closest 2 hydrogen atoms as the filtration value becomes larger than the length of CH bond and these three atoms are regarded as a single connected component. When the filtration value increases to 1.44Å, a Betti1 bar emerges which means that a hexagonal carbon ring is captured and there is only one connected component left. As the filtration value eventually exceeds the radius of the hexagon, the ring structure disappears and that is why there is no Betti1 bar. The longest Betti0 bar corresponds to the existence of the connected component. When only carbon atoms are selected (C element), it is relatively straightforward to interpret the barcode plot. The cutoff where 5 Betti0 bars disappear corresponds to the CC bond length and the Betti1 bar represents the existence of the hexagonal carbon ring.
The challenge for primitive persistent homology
Aforementioned persistent homology, although is able to capture the information such as covalent bonds between different atom types easily as shown in Fig. 1, does not necessarily reflect intramolecular interactions such as hydrogen bonds and van der Waals interaction, which is not ideal for the purpose of small molecule modeling. In other words, the Betti0 bar between two atoms with certain hydrogen bonding or van der Waals cannot be captured since there already exist shorter Betti0 bars between them (essentially covalent bonds). Thus it is important to redefine the distance between atom at and atom at as following:
(1) 
where and are the atomic radius of atoms and , respectively, and is the bond length deviation in the data set. Here is a large number which is set to be greater than the maximal filtration value, and is the Euclidean distance between atom and atom , or equivalently,
(2) 
By setting the distance between two close atoms to an sufficiently large number, we should be able to capture intramolecular interactions since a connection longer than filtration value is automatically neglected during persistent homology computation.
ii.b.2 ESTD construction
Inspired by classic atomadditive models for partition coefficient prediction, we utilize a total of 61 basic element types calculated by antechamber [58, 59] using general amber force field (GAFF) [58]. Atoms of given atom type and their appropriate combinations are selected to construct VietorisRips complex and ESTDs are subsequently calculated.
It is also important to construct ESTDs via a small bin size. As the example above shows, barcodes at different cutoffs give rise to a variety of information such as covalent and noncovalent bonds. Specifically we can divide the barcodes into several small bins and then extract features from each bin. A complete list of ESTDs used in this study is shown in Ref. 2. Group 1 ESTDs focuses on different atom types, Group 2 is to capture the occurrences of noncovalent bonding and Group 3 mainly highlights the strength of noncovalent bonding and van der Waals interactions. Note that statistics of birth or death values in Group 3 refer to maximum, minimum, mean, and summation.
Feature group  Element used  Descriptors 
Group 1  One element:  Counts of Betti0 bars for each element type with a total 
where  of 61 different element types calculated with GAFF [58]  
Group 2  Two element types: {}, where  Counts of Betti0 bars with birth or death values 
{C, O, N}, and  falling within each bin  
Group 3  Two element types: {}, where  Statistics of birth or death values 
{C, O, N}, and  for all Betti0 bars (consider all birth and death values) 
The essence of ESTDs is to offer new insight to small molecule modeling by topological modeling and topological learning. By constructing a topological feature vector for the th molecule of task , we are readily to combine topological learning with advanced machine learning algorithms, with further details to be discussed in Section II.D.
ii.c Additional molecular descriptors
In addition to ESTDs introduced in the previous subsection, we generate 633 2D molecule descriptors by CheomPy [60] for each molecule. The feature pool contains feature groups such as Estate descriptors. There are a total of 13 categories for these 2D molecule descriptors  30 molecular constitutional descriptors, 35 topological descriptors, 44 molecular connectivity indices, 7 Kappa shape descriptors, 64 Burden descriptors, 245 Estate indices, 21 Basak information indices, 96 autocorrelation descriptors, 6 molecular property descriptors, 25 charge descriptors, and 60 MOEtype descriptors. A more detailed description of features and the CheomPy software are available on line at https://code.google.com/archive/p/pychem/downloads.
In order to improve the overall performance, we also combine these features with ESTDs to create ESTD. For consistency reasons, only molecules whose features can be calculated by both our ESTD software and ChemoPy software are used for training purpose. It is worth to mention that our ESTD approaches are applicable to all molecules whereas ChemoPy has difficulty in dealing with some molecules. Separate results and discussions for different sets of descriptors will also be conducted in later sections.
ii.d Topological learning algorithms
In this section, we present methods and algorithms of topology based multitask learning for simultaneous predictions of partition coefficient and aqueous solubility. Ensemble methods, including random forest (RF) and gradient boosting decision tree (GBDT), and neural network architectures are discussed. A detailed description of multitask neural network is also provided.
ii.d.1 Ensemble methods
Ensemble methods have been widely used to solve QSAR problems and have achieved some stateoftheart results. They naturally handle correlations between descriptors and are generally insensitive to parametrization and feature selection due to bagging operations. In this work we choose gradient boosting tree as our baseline method and implement it using the scikitlearn package [61] (version 0.13.1). The number of estimators for random forest is set to 8000 because a further increase in the number does not essentially improve accuracy. For each set, 50 runs were done and the average of 50 predictions is taken as the final prediction.
ii.d.2 Multitask learning and deep neural networks
Multitask learning
The idea of multitask learning (MTL) is to learn ‘inductive bias’ from related tasks to improve accuracy, using the same representation. In other words, MTL aims at learning a shared, generalized feature representation for multiple tasks and potentially gives better predictions. In our work, we assume that the underlying molecular mechanism of partition coefficient and aqueous solubility shares commonalities and differences, which can be learned jointly.
Multitask deep neural network (MTDNN)
A deep neural network has a wider and deeper architecture as compared to the traditional artificial neural network – it consists of more layers and more neurons in each layer and reveals the facets of input features at different levels. As for this multitask framework, different tasks share the first few dense layers, on top of which individual predictor is attached to for each specific task (one for partition coefficient and one for aqueous solubility). An illustration of the multitask deep neural network is shown in Fig. 2.
In this study, we have a total of 2 tasks  one for logP prediction and the other for logS prediction. Suppose that there are molecules in th task and the th molecule for th task can be represented by a topological feature vector . Given the training data , where , , with being experimental value (logP or logS) for the th molecule of task , the objective topological learning is to minimize the function for different tasks:
(3) 
where is a functional of topological feature vectors to be learned, parametrized by weight matrix and bias term , and can be cross entropy loss for classification and mean squared error for regression. Since logP and logS are measured quantitatively, the loss function ( or ) to be minimized is defined as:
(4) 
A wide range of parameters for MTDNN has been tested and following parameters in Table 3 are used for training and testing in this study although our parameter search is by no means exhaustive:
# of hidden layers  7 

# of neurons on each dense layer  1000 for first 4 layers and 100 for the next 3 layers 
# of neurons on output layer  2 with linear activation 
Learning rate  0.0001 
Note that Adam (adaptive momentum estimation) [62] is used as gradient update method. For stability purpose during training, all data were normalized with zeromean and unit variance.
Remark
Although the units for these two tasks are different (partition coefficient uses log unit as units, and solubility prediction use as units), we can still train a MTDNN model for them simultaneously  shared layers learn their inherent commonalities while individual layers deal with their differences.
ii.e Evaluation metrics
Several different evaluation metrics, including root mean squared error (RMSE), Pearson correlation coefficient (R) and mean unsigned error (MUE), were used to evaluate the performances of different models. These methods are defined as below:
(5) 
(6) 
and
(7) 
where represents or , is the total number of molecules in the test set, and stand for the experimental and predicted value for the th molecule, respectively, and is the average of predicted and experimental value for the entire test set, respectively.
Additionally, Tetko et al. [63] proposed and an additional metric based on the difference between experimental and predicted () was proposed. The percentage within each error range was considered.

If , prediction is considered to be “acceptable";

If , prediction is considered to be “disputable";

If , prediction is considered to be “unacceptable".
As a result, we use this metric along with to evaluate model performances on Star set and Nonstar set described below.
Iii Results
In this section, we present the results of the proposed ESPH methods in conjugation with random forest and multitask deep neural networks for variety of data sets, including partition coefficient and solubility test sets. Otherwise stated, different tasks are trained together in the same network. Besides, we would like to introduce some notations for easier reference. ESTD1 contains 61 Betti0 barbased ESTDs (Group 1 in Table 2), ESTD2 contains all ESTDs listed in Table 2 (Group 13).
iii.a Partition coefficient prediction
Training set crossvalidation
In order to have an idea of how our topological representation would work for partition coefficient, a 10fold crossvalidation is performed using baseline method GBDT. Note that 50 runs were done to achieve the final results as randomness is involved and the results are summarized in Table 4.
Method  RMSE  MUE  

GBDTESTD  0.924  0.45  0.32 
XLOGP3AA [9]  0.904  0.50  0.39 
It can be seen that our descriptors better than XLOGP3 software [9] given the same training data, and thus demonstrates great predictive power. It would be very interesting to see the performances of our MTDNN compared to XLOGP3 and GBDT.
FDA set
The first test set that we would like to apply our model to is the FDA test set. A molecule that contains Hg was dropped due to the difficulty of computation. A major challenge of this set is that its structures are more complex than that of the training set, and the partition coefficient range spans over nearly 12 units. A series of prediction methods [9] including our multitask neural networks, is applied to this set and their results are summarized in Table 5 for a comparison with ours.
Method  RMSE  MUE  

MTESTD1  0.920  0.57  0.28 
MTESTD1  0.909  0.60  0.27 
MTESTD2  0.909  0.60  0.34 
ALOGPS  0.908  0.60  0.42 
MTESTD2  0.891  0.66  0.44 
XLOGP3  0.872  0.72  0.51 
XLOGP3AA  0.847  0.80  0.57 
CLOGP  0.838  0.88  0.51 
TOPKAT  0.815  0.88  0.56 
ALOGP98  0.80  0.90  0.64 
KowWIN  0.771  1.10  0.63 
HINT  0.491  1.93  1.30 
As we can see from Table 5, our multitask model gives the best prediction in terms of R, RMSE and MUE. Specifically the small MUE of our model (0.29 log units) indicates that our predictions are less biased than other methods tested, except for some outliers. Also note that the training set is completely independent of the test set which shows the applicability of our multitask architecture. We also build models with the same architecture when additional molecules gathered from NIHdatabase are included as there is no guarantee that ALOGPS are completely independent of the testset. It turns out that the accuracy can be greatly improved.
Star set and Nonstar set
Star set and Nonstar set were proposed by Tetko [63] as two benchmark sets for evaluating partition coefficient models. Over 20 different models were tested on these two sets. It should be emphasized that for these sets, different models are trained on different training sets and their overlap with the test sets is unknown. Thus it makes more sense to merge our 8199 training set with additional molecules in NIH database and see how additional training data can benefit the overall performances. Results of different models on these two sets can be found in Table 6. Notice that models trained with additional data are labeled with superscript *.
For star set, we achieve RMSE of 0.53 log units with other popular commercial software packages such as ACD/logP and CLOGP, in addition to a high acceptable prediction percentage (75%, rank 3) as well as a low unacceptable rate (5 %, rank 3). For nonstar set, most methods do not give accurate predictions as the structures in this set are relatively new and complex. Our 49% acceptable rate ranks number 2 among all predictors, though RMSE is relatively high due to a few large outliers (rank 5). The results are satisfactory as commercial software packages generally use a much larger training set than that ours. In general, when there exist more overlapped molecules in the training set, the test results will be significantly improved. As Table 6 indicates, our MTDNNESTD models achieve a substantial improvement over XLOP3 for star set, while maintaining the same level of accuracy for nonstar set. The performances of our MTDNNESTD models suggest that our models are able to predict and accurately. In fact, the predictive power can potentially be further improved once more molecules are incorporated into the training set. Thus we also extend our original 8199 training set by adding molecules in both Star set and Nonstar set. When we use the extended training set, much higher accuracy is achieved as listed in Table 6. Again we observe that MTDNN outperforms RF with the same set of features.
Star Set ()  NonStar Set ()  
% of Molecules  % of Molecules  
Within Error Range  Within Error Range  
Method  RMSE  < 0.5  <1  > 1  RMSE  < 0.5  <1  > 1 
AB/LogP  0.41  84  12  4  1.00  42  23  35 
S+logP  0.45  76  22  3  0.87  40  35  26 
MTESTD1AD  0.49  77  16  7  0.98  49  19  33 
MTESTD2  0.49  74  21  5  0.97  49  23  28 
ACD/logP  0.50  75  17  7  1.00  44  32  23 
CLOGP  0.52  74  20  6  0.91  47  28  26 
VLOGP OPS  0.52  64  21  7  1.07  33  28  26 
ALOGPS  0.53  71  23  6  0.82  42  30  28 
MTESTD1  0.53  75  17  8  0.97  47  28  26 
MTESTD1AD  0.53  73  18  9  1.00  37  30  33 
MTESTD2AD  0.53  71  19  9  1.01  47  19  35 
MTESTD1  0.55  72  18  10  1.01  33  28  40 
MTESTD2  0.56  66  23  11  1.06  35  33  33 
MiLogP  0.57  69  22  9  0.86  49  30  21 
XLOGP3  0.62  60  30  10  0.89  47  23  30 
KowWIN  0.64  68  21  11  1.05  40  30  30 
CSLogP  0.65  66  22  12  0.93  58  19  23 
ALOGP  0.69  60  25  16  0.92  28  40  33 
MolLogP  0.69  61  25  14  0.93  40  25  26 
ALOGP98  0.70  61  26  13  1.00  30  37  33 
OsirisP  0.71  59  26  16  0.94  42  26  33 
VLOGP  0.72  65  22  14  1.13  40  28  33 
TLOGP  0.74  67  16  13  1.12  30  37  30 
ABSOLV  0.75  53  30  17  1.02  49  28  23 
QikProp  0.77  53  30  17  1.24  40  26  35 
QuantlogP  0.80  47  30  22  1.17  35  26  40 
SLIPPER2002  0.80  62  22  15  1.16  35  23  42 
COSMOFrag  0.84  48  26  19  1.23  26  40  23 
XLOGP2  0.87  57  22  20  1.16  35  23  42 
QLOGP  0.96  48  26  25  1.42  21  26  53 
VEGA  1.04  47  27  26  1.24  28  30  42 
CLIP  1.05  41  25  30  1.54  33  9  49 
LSER  1.07  44  26  30  1.26  35  16  49 
MLOGP(Sim+)  1.26  38  30  33  1.56  26  28  47 
NC+NHET  1.35  29  26  45  1.71  19  16  65 
SPARC  1.36  45  22  32  1.70  28  21  49 
HINTLOGP  1.80  34  22  44  2.72  30  5  65 
iii.b Aqueous solubility prediction
To evaluate the performances of solubility our models, several datasets are used, derived from Wang et al. [1] and Hou et al. [2]. For leaveoneout validation, only baseline method is used. For 10fold crossvalidation, the 9 remaining folds are trained together with the partition coefficient training set when evaluating the remaining fold with MTDNN architecture.
iii.b.1 1708 set in Ref. [1]
For this dataset, both leaveone out and 10fold cross validations are carried out in order to evaluate the performance of our models.
Leaveoneout
As MTDNN requires a lot of computational resources, only baseline method GBDT is used for leaveoneout prediction. We use 4000 trees and 0.10 learning rate as training parameters to develop models and following results in Table 7 are achieved.
10fold crossvalidation
As MTDNN and baseline method GBDT involves randomness, we run MTDNN and GBDT 50 times and report mean performances for all metrics. The results are summarized in Table 8. It is observed that our models yield more accurate and robust predictions than ASMS and ASMSLOGP models do, improving the from 0.884 to 0.925. Additionally we also notice that there generally exists an improvement of MTESTD models over GBDT models, though, not as significant as what we see in previous partition coefficient prediction.
Method  Mean (RMSD)  Mean RMSE (RMSD)  Mean MUE (RMSD) 

MTESTD1  0.925 (0.001)  0.568 (0.005)  0.393 (0.003) 
MTESTD2  0.924 (0.003)  0.571 (0.010)  0.395 (0.004) 
GBDT1  0.924 (0.002)  0.572 (0.006)  0.408 (0.005) 
GBDT2  0.923 (0.002)  0.571 (0.006)  0.408 (0.005) 
MTESTD1  0.908 (0.002)  0.630 (0.005)  0.466 (0.003) 
GBDT2  0.904 (0.002)  0.642 (0.008)  0.469 (0.005) 
MTESTD2  0.902 (0.002)  0.649 (0.007)  0.466 (0.005) 
GBDT1  0.889 (0.003)  0.697 (0.009)  0.502 (0.005) 
ASMS[1]  0.884 (0.021)  0.699 (0.054)  0.527 (0.034) 
ASMSLOGP [1]  0.869 (0.022)  0.742 (0.053)  0.570 (0.034) 
iii.b.2 Data set in Ref. [2]
We test our models on data set proposed by Hou et al. [2], where training and test sets were predefined to cover a variety of molecules. Test set 1 contains 21 commonly used compounds of pharmaceutical and environmental interest [64] and is to be trained on the original 1290 molecules. Test set 2 contains 120 molecules that were used to develop Klopman and Zhu’s group contribution model [65]. As Hou et al. [2] suggested, we remove 83 molecules that overlap with test set 2 from the training set to make prediction independent and unbiased. This reduces the size of training set for test set 2 to 1207.
Test set 1
Test set 2
The results of test set 2 are summarized in Table 10. For this dataset, our MTESTD models gives satisfactory results with a high Pearson correlation over 0.97 across all ESTD combinations. Such results indicate that our methods are applicable to a wide variety of molecules.
Method  R  RMSE  MUE 

MTESTD1  0.97  0.65  0.47 
MTESTD2  0.97  0.67  0.48 
MTESTD1  0.97  0.70  0.50 
MTESTD2  0.97  0.71  0.53 
GBDT2  0.97  0.73  0.50 
GBDT1  0.96  0.76  0.52 
DrugLOGS [2]  0.96  0.79  0.57 
GBDT2  0.96  0.79  0.60 
GBDT1  0.96  0.82  0.58 
Group contribution [65]  0.96  0.84  0.70 
Iv Discussion
In this section, we discuss the outcome of trained models from several aspects. The emphasis is put on the predictive power of our ESTDs in terms of MTDNN and how topology based multitask learning can improve partition coefficient and aqueous solubility predictions.
iv.a ESTDs for small molecules
As the previous results indicated, there exists a common feature representation for both partition coefficient and solubility prediction. Our features come from two different categories – one that is computed solely by ESPH, and the other one that has been widely used in the development of QSAR models. Although the number of ESTDs (121) is small, it turns out that our topological feature representation of molecules has a very strong predictive power as compared to baseline method. Our ESTDs highlight atom type information and keep track of the formation of chemical bond as well as ring structures of a given molecule. Indeed, by constructing topological space of a molecule, ESTDs are able to extract useful features from ESPH computation. In fact, more detailed topological representations which more precisely reflect the covalent bond length, hydrogen bond strength and van der Waals interaction can constructed via a small bin size [52].
iv.b Multitask learning
The goal of multitask learning is to learn commonalities between different tasks, and to simultaneously improve model performances. In this study, partition coefficient and aqueous solubility were trained jointly and substantial improvements over singletask models are observed. Our results suggests that there exists shared information across these two tasks that can benefit prediction accuracy. Indeed, the original motivation for predicting logP and logS is that both coefficients closely relate to the extent to which a compound dissolves in solvent. By comparing our MTDNN with gradient boosting tree, we find that it is beneficial to learn partition coefficient and aqueous solubility models together. Our MTESTD models achieve satisfactory results on various partition coefficient and aqueous solubility data sets, some of which are the stateoftheart to our knowledge. Our ESTDs alone can give very accurate predictions, bringing us new insights by element specific persistent homology computations. In addition to ESTDs, commonlyused 2D descriptors also help to improve the overall accuracy. The attempt to learn these two related measurements together gives a boost over learning them separately and the improvement is validated on most data sets.
V Conclusion
Partition coefficient and aqueous solubility are among most important physical properties of small molecules and have significant applications to drug design and discovery in terms of lipophilic efficiency. Based on chemical and physical models, a wide variety of computational methods has been developed in the literature for the theoretical predictions of partition coefficient and aqueous solubility.
Present work introduces an algebraic topology based method, element specific persistent homology (ESPH), for simultaneous partition coefficient and aqueous solubility predictions. ESPH offers an unconventional representation of small molecules in terms of multiscale and multicomponent topological invariants. Here the multiscale representation is inherited from persistent homology, while the multicomponent formulation is developed to retain essential chemical information during the topological simplification of molecular geometric complexity. Therefore, the present ESPH gives a unique representation of small molecules that cannot be obtained by any other method. Although ESPH representation of molecules cannot by literally translated into a physical interpretation, it systematically and comprehensively encipher chemical and physical information of molecules into scalable topological invariants, and thus is ideally suited for machine learning/deep learning algorithms to decipher such information.
To predict partition coefficient and aqueous solubility, we integrate ESPH with advanced machine learning methods, including gradient boosting tree, random forest, and deep neural networks to construct topological learning strategies. Since partition coefficient and aqueous solubility are highly correlated to each other, we develop a common set of ESPH descriptors, called element specific topological descriptors (ESTDs), to represent both properties. This approach enables us to perform simultaneous predictions of partition coefficient and aqueous solubility using a topology based multitask deep learning strategy.
To test the representational of ESPH and the predictive power of the proposed topological multitask deep learning strategy, we consider some commonly used data sets, including two benchmark test sets, for partition coefficient, as well as additional solubility data sets. Extensive cross validations and benchmark tests indicate that the proposed topological multitask strategy offers some of the most accurate predictions of partition coefficient and aqueous solubility.
Availability
Our software is available as an online server at http://weilab.math.msu.edu/TopPS/.
Acknowledgment
This work was supported in part by NSF Grants DMS1721024 and IIS1302285, and MSU Center for Mathematical Molecular Biosciences Initiative.
References
 [1] J. Wang, G. Krudy, T. Hou, W. Zhang, G. Holland, and X. Xu, “Development of reliable aqueous solubility models and their application in druglike analysis,” Journal of chemical information and modeling, vol. 47, no. 4, pp. 1395–1404, 2007.
 [2] T. Hou, K. Xia, W. Zhang, and X. Xu, “Adme evaluation in drug discovery. 4. prediction of aqueous solubility based on atom contribution approach,” Journal of chemical information and computer sciences, vol. 44, no. 1, pp. 266–275, 2004.
 [3] A. Leo, D. Hoekman, and C. Hansch, “Hydrophobic, electronic, and steric constants,” 1995.
 [4] H. Van De Waterbeemd and E. Gifford, “ADMET in silico modelling: towards prediction paradise?,” Nature reviews Drug discovery, vol. 2, no. 3, pp. 192–204, 2003.
 [5] C. Hansch and T. Fujita, “p analysis. a method for the correlation of biological activity and chemical structure,” Journal of the American Chemical Society, vol. 86, no. 8, pp. 1616–1626, 1964.
 [6] T. Fujita, J. Iwasa, and C. Hansch, “A new substituent constant, , derived from partition coefficients,” Journal of the American Chemical Society, vol. 86, no. 23, pp. 5175–5180, 1964.
 [7] A. Leo, C. Hansch, and D. Elkins, “Partition coefficients and their uses,” Chemical reviews, vol. 71, no. 6, pp. 525–616, 1971.
 [8] A. K. Ghose and G. M. Crippen, “Atomic physicochemical parameters for threedimensionalstructuredirected quantitative structureactivity relationships. 2. modeling dispersive and hydrophobic interactions,” Journal of chemical information and computer sciences, vol. 27, no. 1, pp. 21–35, 1987.
 [9] T. Cheng, Y. Zhao, X. Li, F. Lin, Y. Xu, X. Zhang, Y. Li, R. Wang, and L. Lai, “Computation of octanolwater partition coefficients by guiding an additive model with knowledge,” Journal of chemical information and modeling, vol. 47, no. 6, pp. 2140–2148, 2007.
 [10] W. M. Meylan and P. H. Howard, “Atom/fragment contribution method for estimating octanol–water partition coefficients,” Journal of pharmaceutical sciences, vol. 84, no. 1, pp. 83–92, 1995.
 [11] W. M. Meylan and P. H. Howard, “Estimating log P with atom/fragments and water solubility with log P,” Perspectives in drug discovery and design, vol. 19, no. 1, pp. 67–84, 2000.
 [12] A. J. Leo, “Calculating log poct from structures,” Chemical Reviews, vol. 93, no. 4, pp. 1281–1306, 1993.
 [13] J. Chou and P. C. Jurs, “Computerassisted computation of partition coefficients from molecular structures using fragment constants,” Journal of Chemical Information and Computer Sciences, vol. 19, no. 3, pp. 172–178, 1979.
 [14] A. A. Petrauskas and E. A. Kolovanov, “ACD/Log P method description,” Perspectives in drug discovery and design, vol. 19, no. 1, pp. 99–116, 2000.
 [15] M. J. Walker, “Training ACD/LogP with experimental data,” QSAR & Combinatorial Science, vol. 23, no. 7, pp. 515–520, 2004.
 [16] H. Zhu, A. Sedykh, S. K. Chakravarti, and G. Klopman, “A new group contribution approach to the calculation of LogP,” Current ComputerAided Drug Design, vol. 1, no. 1, pp. 3–9, 2005.
 [17] A. Y. Sedykh and G. Klopman, “A structural analogue approach to the prediction of the octanolwater partition coefficient,” Journal of chemical information and modeling, vol. 46, no. 4, pp. 1598–1603, 2006.
 [18] I. V. Tetko and V. Y. Tanchuk, “Application of associative neural networks for prediction of lipophilicity in ALOGPS 2.1 program,” Journal of chemical information and computer sciences, vol. 42, no. 5, pp. 1136–1145, 2002.
 [19] I. V. Tetko and P. Bruneau, “Application of ALOGPS to predict 1octanol/water distribution coefficients, logP, and logD, of AstraZeneca inhouse database,” Journal of pharmaceutical sciences, vol. 93, no. 12, pp. 3103–3110, 2004.
 [20] C. A. Lipinski, F. Lombardo, B. W. Dominy, and P. J. Feeney, “Experimental and computational approaches to estimate solubility and permeability in drug discovery and development settings,” Advanced drug delivery reviews, vol. 23, no. 13, pp. 3–25, 1997.
 [21] L. Di and E. H. Kerns, “Biological assay challenges from compound solubility: strategies for bioassay optimization,” Drug discovery today, vol. 11, no. 9, pp. 446–451, 2006.
 [22] B.G. DUAN, Y. LI, J. LI, T.J. CHENG, and R.X. WANG, “An empirical additive model for aqueous solubility computation: success and limitations,” Acta PhysicoChimica Sinica, vol. 28, no. 10, pp. 2249–2257, 2012.
 [23] J. Wang, T. Hou, and X. Xu, “Aqueous solubility prediction based on weighted atom type counts and solvent accessible surface areas,” Journal of Chemical Information and Modeling, vol. 49, no. 3, pp. 571–581, 2009.
 [24] J. Wang, G. Krudy, T. Hou, W. Zhang, G. Holland, and X. Xu, “Development of reliable aqueous solubility models and their application in druglike analysis,” Journal of chemical information and modeling, vol. 47, no. 4, pp. 1395–1404, 2007.
 [25] S. H. Yalkowsky and S. C. Valvani, “Solubility and partitioning i: solubility of nonelectrolytes in water,” Journal of pharmaceutical sciences, vol. 69, no. 8, pp. 912–922, 1980.
 [26] J. C. Dearden, “In silico prediction of aqueous solubility,” Expert opinion on drug discovery, vol. 1, no. 1, pp. 31–52, 2006.
 [27] R. Dannenfelser, M. Paric, M. White, and S. H. Yalkowsky, “A compliation of some physicochemical properties for chlorobenzenes,” Chemosphere, vol. 23, no. 2, pp. 141–165, 1991.
 [28] W. L. Jorgensen and E. M. Duffy, “Prediction of drug solubility from structure,” Advanced drug delivery reviews, vol. 54, no. 3, pp. 355–366, 2002.
 [29] R. Caruana, “Multitask learning,” in Learning to learn, pp. 95–133, Springer, 1998.
 [30] C. Widmer and G. Rätsch, “Multitask learning in computational biology.,” in ICML Unsupervised and Transfer Learning, pp. 207–216, 2012.
 [31] Q. Xu and Q. Yang, “A survey of transfer and multitask learning in bioinformatics,” Journal of Computing Science and Engineering, vol. 5, no. 3, pp. 257–268, 2011.
 [32] G. E. Dahl, D. Yu, L. Deng, and A. Acero, “Contextdependent pretrained deep neural networks for largevocabulary speech recognition,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 20, no. 1, pp. 30–42, 2012.
 [33] L. Deng, G. Hinton, and B. Kingsbury, “New types of deep neural network learning for speech recognition and related applications: An overview,” in Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on, pp. 8599–8603, IEEE, 2013.
 [34] R. Socher, Y. Bengio, and C. D. Manning, “Deep learning for nlp (without magic),” in Tutorial Abstracts of ACL 2012, pp. 5–5, Association for Computational Linguistics, 2012.
 [35] I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence learning with neural networks,” in Advances in neural information processing systems, pp. 3104–3112, 2014.
 [36] G. E. Dahl, N. Jaitly, and R. Salakhutdinov, “Multitask neural networks for qsar predictions,” arXiv preprint arXiv:1406.1231, 2014.
 [37] J. Ma, R. P. Sheridan, A. Liaw, G. E. Dahl, and V. Svetnik, “Deep neural nets as a method for quantitative structure–activity relationships,” Journal of chemical information and modeling, vol. 55, no. 2, pp. 263–274, 2015.
 [38] A. Mayr, G. Klambauer, T. Unterthiner, and S. Hochreiter, “DeepTox: toxicity prediction using deep learning,” Frontiers in Environmental Science, vol. 3, p. 80, 2016.
 [39] B. Ramsundar, S. Kearnes, P. Riley, D. Webster, D. Konerding, and V. Pande, “Massively multitask networks for drug discovery,” arXiv preprint arXiv:1502.02072, 2015.
 [40] Z. X. Wu and G. W. Wei, “Comparison of multitask convolutional neural network (MTCNN) and a few other methods for toxicity prediction ,” Toxicological Sciences, Submitted, 2017.
 [41] A. Lusci, G. Pollastri, and P. Baldi, “Deep architectures and deep learning in chemoinformatics: the prediction of aqueous solubility for druglike molecules,” Journal of chemical information and modeling, vol. 53, no. 7, pp. 1563–1575, 2013.
 [42] H. Edelsbrunner, D. Letscher, and A. Zomorodian, “Topological persistence and simplification,” Discrete Comput. Geom., vol. 28, pp. 511–533, 2002.
 [43] A. Zomorodian and G. Carlsson, “Computing persistent homology,” Discrete Comput. Geom., vol. 33, pp. 249–274, 2005.
 [44] K. L. Xia and G. W. Wei, “Persistent homology analysis of protein structure, flexibility and folding,” International Journal for Numerical Methods in Biomedical Engineerings, vol. 30, pp. 814–844, 2014.
 [45] K. L. Xia, X. Feng, Y. Y. Tong, and G. W. Wei, “Persistent homology for the quantitative prediction of fullerene stability,” Journal of Computational Chemsitry, vol. 36, pp. 408–422, 2015.
 [46] K. L. Xia, Z. X. Zhao, and G. W. Wei, “Multiresolution topological simplification,” Journal Computational Biology, vol. 22, pp. 1–5, 2015.
 [47] K. L. Xia, Z. X. Zhao, and G. W. Wei, “Multiresolution persistent homology for excessively large biomolecular datasets,” Journal of Chemical Physics, vol. 143, p. 134103, 2015.
 [48] B. Wang and G. W. Wei, “Objectoriented persistent homology,” Journal of Computational Physics, vol. 305, pp. 276–299, 2016.
 [49] Z. Cang, L. Mu, K. Wu, K. Opron, K. Xia, and G.W. Wei, “A topological approach to protein classification,” Molecular based Mathematical Biologys, vol. 3, pp. 140–162, 2015.
 [50] Z. X. Cang and G. W. Wei, “Analysis and prediction of protein folding energy changes upon mutation by element specific persistent homology,” Bioinformatics (arXiv preprint arXiv:1703.10966), Revised, 2017.
 [51] Z. X. Cang and G. W. Wei, “Integration of element specific persistent homology and machine learning for proteinligand binding affinity prediction ,” International Journal for Numerical Methods in Biomedical Engineering, Accepted, 2017.
 [52] Z. X. Cang and G. W. Wei, “TopologyNet: Topology based deep convolutional neural networks for biomolecular property predictions ,” Plos Computational Biology (arXiv preprint arXiv:1704.00063), Submitted, 2017.
 [53] C. Hansch, A. Leo, D. Hoekman, et al., Exploring QSAR: fundamentals and applications in chemistry and biology, vol. 557. American Chemical Society Washington, DC, 1995.
 [54] A. Avdeef, Absorption and drug development: solubility, permeability, and charge state. John Wiley & Sons, 2012.
 [55] H. Edelsbrunner, D. Letscher, and A. Zomorodian, “Topological persistence and simplification,” in Foundations of Computer Science, 2000. Proceedings. 41st Annual Symposium on, pp. 454–463, IEEE, 2000.
 [56] G. Carlsson, A. Zomorodian, A. Collins, and L. J. Guibas, “Persistence barcodes for shapes,” International Journal of Shape Modeling, vol. 11, no. 2, pp. 149–187, 2005.
 [57] R. Ghrist, “Barcodes: The persistent topology of data,” Bull. Amer. Math. Soc., vol. 45, pp. 61–75, 2008.
 [58] J. Wang, R. M. Wolf, J. W. Caldwell, P. A. Kollman, and D. A. Case, “Development and testing of a general amber force field,” Journal of computational chemistry, vol. 25, no. 9, pp. 1157–1174, 2004.
 [59] J. Wang, W. Wang, P. A. Kollman, and D. A. Case, “Automatic atom type and bond type perception in molecular mechanical calculations,” Journal of molecular graphics and modelling, vol. 25, no. 2, pp. 247–260, 2006.
 [60] D.S. Cao, Q.S. Xu, Q.N. Hu, and Y.Z. Liang, “Chemopy: freely available python package for computational biology and chemoinformatics,” Bioinformatics, p. btt105, 2013.
 [61] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay, “Scikitlearn: Machine learning in Python,” Journal of Machine Learning Research, vol. 12, pp. 2825–2830, 2011.
 [62] D. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
 [63] R. Mannhold, G. I. Poda, C. Ostermann, and I. V. Tetko, “Calculation of molecular lipophilicity: stateoftheart and comparison of log P methods on more than 96,000 compounds,” Journal of pharmaceutical sciences, vol. 98, no. 3, pp. 861–893, 2009.
 [64] G. Klopman, S. Wang, and D. M. Balthasar, “Estimation of aqueous solubility of organic molecules by the group contribution approach. application to the study of biodegradation,” Journal of chemical information and computer sciences, vol. 32, no. 5, pp. 474–482, 1992.
 [65] G. Klopman and H. Zhu, “Estimation of the aqueous solubility of organic molecules by the group contribution approach,” Journal of chemical information and computer sciences, vol. 41, no. 2, pp. 439–445, 2001.