# BaTFLED: Bayesian Tensor

Factorization Linked to External Data

# BaTFLED: Bayesian Tensor Factorization Linked to External Data (supplementary material)

###### Abstract

The vast majority of current machine learning algorithms are designed to predict single responses or a vector of responses, yet many types of response are more naturally organized as matrices or higher-order tensor objects where characteristics are shared across modes. We present a new machine learning algorithm BaTFLED (Bayesian Tensor Factorization Linked to External Data) that predicts values in a three-dimensional response tensor using input features for each of the dimensions. BaTFLED uses a probabilistic Bayesian framework to learn projection matrices mapping input features for each mode into latent representations that multiply to form the response tensor. By utilizing a Tucker decomposition, the model can capture weights for interactions between latent factors for each mode in a small core tensor. Priors that encourage sparsity in the projection matrices and core tensor allow for feature selection and model regularization. This method is shown to far outperform elastic net and neural net models on ’cold start’ tasks from data simulated in a three-mode structure. Additionally, we apply the model to predict dose-response curves in a panel of breast cancer cell lines treated with drug compounds that was used as a Dialogue for Reverse Engineering Assessments and Methods (DREAM) challenge.

BaTFLED: Bayesian Tensor

Factorization Linked to External Data

Nathan H. Lazar Oregon Health & Science University Portland, OR, USA lazar@ohsu.edu Mehmet Gönen Koç University Istanbul, Turkey mehmetgonen@ku.edu.tr Kemal Sönmez Oregon Health & Science University Portland, OR, USA sonmezk@ohsu.edu

noticebox[b]30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\end@float

## 1 The BaTFLED model

### 1.1 Model structure

BaTFLED is a generative probabilistic model with the structure shown in figure 1. Matrices of input feature values for the training set examples are multiplied by learned projection matrices to form latent matrices. The number of columns in the latent matrices are set parameters determining the size of the latent space for that mode. The response tensor is factored using a Tucker decomposition (Tucker, 1964) into the latent matrices and a core tensor. This three-way matrix multiplication can be framed as a sum of outer products of the columns of the latent matrices weighted by values in the core tensor. In order to allow for lower-order interactions between latent factors and to obviate the need for normalization of responses, a column of ones is added to each latent matrix. Thus the core element corresponding to the outer product of these columns acts as a global constant added to all responses and products of other columns with the s columns can encode interactions between only one or two of the modes. For comparison, we also implement a CANDECOMP/PARAFAC (CP) decomposition (Hitchcock, 1927), a special case of the Tucker decomposition where the latent space for each mode must be the same size and the core matrix only has values along the super-diagonal.

In order to learn the values in the projection matrices, latent matrices and core tensor, we impose a probabilistic framework placing distributions on each of the unknown values. The distributions are chosen to maintain conjugacy and to encourage sparsity in the projection matrices and the core tensor. All of the major distributions are Gaussian and where sparsity is desired, means are set to zero and Gamma priors are placed on the precision (inverse variance). Setting the shape and scale for these Gamma distributions to extreme values (shape , scale ) encourages most of the variances to be near zero and the corresponding Gaussian to have a tight peak centered at zero. Variances are shared across rows of the projection matrices so that each feature is either selected or not for all the latent factors.

### 1.2 Training

Due to size of the BaTFLED model, training would be computationally intractable with sampling methods, so instead we use a variational Bayesian approach. This seeks to maximize a lower bound of the posterior probability of the unknown values given the observed data and set parameters by approximating the full joint ‘p’ distribution with a factorable ‘q’ distribution. The parameters of each of these ‘q’ distributions rely only on the expected values of other parameters in the model and equations for the optimal parameter values can be found analytically. Thus, once these update equations are derived, training proceeds by initializing all parameters randomly and updating each in sequence given the expectation of the other parameters. Full model details are in the supplemental material.

### 1.3 Prediction

Once trained, there are four types of prediction that this model can perform. First, if there are missing values in the response tensor, these can be filled in by multiplying the latent matrices through the core tensor. In this ’warm start’ prediction we estimate a response for a combination when we have seen responses for other combinations involving the same input feature vectors. Second, given a vector of input features for a new example for one mode, ’cold start’ predictions can be made for that example when combined with any example for the other two modes. Multiplying through the projection matrix and core tensor, a new matrix ’slice’ of the response tensor is formed representing responses for this new test example across all training data. Similarly, ’cold start’ prediction is possible for new exmples from two modes yielding a vector of responses across the third mode and lastly, ’cold start’ prediction for all three modes produces a single predicted response value.

## 2 Experiments

### 2.1 Simulated data

We first test the model on data simulated from the three-mode structure shown in figure 1 and compare it to baseline approaches. For these tests, response data is generated with 30 training examples and 100 features for each mode, 10 of which are used to produce the responses. Each mode has four latent factors, so with the added columns of ones, the core tensor is 5x5x5. Zero-centered Gaussian noise with a standard deviation of 1/10 of the response standard deviation is added to responses and each method is trained on 28 examples for each mode with 1% of interactions removed for ’warm start’ testing.

In order to ascertain what types of interactions each of the tested models are able to discover, we generate data with three different sparse core tensors. The first core tensor only has non-zero values along the three primary edges, so the response tensor is formed using only 1D interactions between one latent factor from one mode and two columns of ones from the other modes. The second core tensor has non-zero values on three primary faces and so it allows 1D and 2D interactions between modes. The last core tensor can have non-zero values in any position, but is sparse so that only 1/2 of the possible interactions have non-zero weight.

### 2.2 Cancer cell line screening data

Cancer cell line drug screen panels consist of a number of patient-derived cancer cell lines which are treated with a broad range of drugs at varying doses, grown for several days and imaged to assess cellular death. The resulting response data is inherently three-dimensional with one measurement for each combination of cell line, drug and dose. Current studies summarize these responses across doses by extracting parameters of fitted curves, and attempt to predict these values using genomic features. Instead, we predict response at each dose in the hope is that by incorporating more information, more useful relationships between the cell line genomics and drug structures can be revealed.

One such study was presented as a prediction challenge by the Dialogue on Reverse Engineering Assessment and Methods (DREAM) project (Costello et al., 2014). This challenge released data publicly so that teams from around the world could compete in the task of predicting responses in a panel of 52 breast cancer cell lines treated with 26 drugs. Although the original challenge was to predict a summary curve measure, data on response at each dose was also released and is utilized here. A set of 17 cell lines were held out as the final test set, but since the task of predicting responses for drugs was not part of the challenge, no drugs were separated into the test data set. In order to show performance on predicting multiple mode combinations, we present results both from 10 fold cross-validation runs on the 35 cell lines and 26 drugs in the training set (leaving out 4 cell lines and 3 drugs in each fold), as well as results for models trained with all of the original DREAM training data on the final 17 cell lines. The 733 input features for cell lines consist of binary indicators of cancer subtype, binary indicators of gene mutations and continuous measures of gene expression for a set of genes known to be associated with cancer (Bindal et al., 2011). The 433 input features for drugs consist of binary indicators of known gene targets, binary indicators of chemical substructures and continuous measures of 1D and 2D structural features extracted using the PaDEL software (Yap, 2011).

## 3 Results

We examine the performance of the BaTFLED model both for predicting the responses and in selecting the true predictors and compare to four types of baseline models. In the first baseline, we predict the mean response across the training data averaging as many responses as possible for the the prediction task. All other models are run in ten replicates with input features consisting of vectors of the features for each mode concatenated together and a single continuous response for each combination of input examples. We train elastic net models using the R package ‘glmnet’ (Friedman, et al. 2010) with different sparsity settings ranging from ridge regression () to LASSO (). Random forest and neural net models are run in R using the ‘h2o’ package (Aiello et al. 2016). Random forest models with 1,000 and 5,000 trees of depth 5, and 1,000 trees of depth 10 are tested as well as neural net models with 1, 2 and 3 hidden layers with 1,500, 750 and 500 nodes in each layer respectively. Neural net models use rectified linear activation functions with a dropout fraction of on the input layer and on hidden layers. Full results are given in the supplemental material.

On simulated data, BaTFLED models are run for iterations with prior parameters and . The Tucker models are given xx cores and the CP models a latent dimension of . All experiments were run in 10 replicates and we report the mean RMSE across replicates for representative models in table 1. Also shown are AUROC (area under the receiver operator characteristic) measures for these models when selecting predictors in data generated with 1,2 and 3D interactions. While BaTFLED performs comparably to baseline methods when predicting linear interactions it is able to learn higher-order relationships in data while other methods cannot. Additionally, BaTFLED selects the correct features in models with higher-order relationships at a higher frequency than other models.

For the DREAM data, we compare to the same baseline methods and report Pearson correlation measures in table 2. Since the DREAM data has more input features and less than half the total responses than the simulated data, we believe that it is somewhat underpowered for this prediction task. Simulation with artificially underpowered datasets indicate that this can be partially overcome by performing multiple rounds of testing to reduce the number of features. For the the cross-validation runs we found the best performance by first training BaTFLED models for 200 iterations with a xx core and strong sparsity priors ( and ), keeping the union top 15% of predictors across folds, and retraining for 40 iterations without encouraging sparsity. For the final DREAM testing runs, we tried a similar scheme, but found that a single round of training for 120 iterations with a xx core and strong sparsity priors performed as well as the second round. Our current results suggest that BaTFLED is able to predict responses as well or better than other baseline methods in both the cross-validation setting and on held-out test data on this very difficult challenge. In addition, BaTFLED may be able to discover interactions across modes that other methods cannot.

Warm | Mode 1 | Mode 2 | Mode 3 | Mode | |||||

1&2 | Mode | ||||||||

1&3 | Mode | ||||||||

2&3 | Mode | ||||||||

1,2&3 | Mode 1 | ||||||||

AUROC | |||||||||

Data generated with only 1D interactions | |||||||||

Mean | 0.65 | 0.57 | 0.36 | 0.52 | 1.19 | 1.29 | 1.16 | 0.81 | |

LASSO | 0.10 | 0.14 | 0.18 | 0.27 | 0.18 | 0.31 | 0.30 | 0.30 | 0.93 |

Random forest | 0.18 | 0.59 | 0.40 | 0.58 | 0.63 | 0.76 | 0.62 | 0.77 | 0.64 |

Neural net | 0.65 | 0.77 | 0.68 | 0.72 | 0.73 | 0.85 | 0.82 | 0.84 | 0.72 |

BaTFLED CP. | 0.38 | 0.46 | 0.42 | 0.46 | 0.45 | 0.52 | 0.45 | 0.46 | 0.87 |

BaTFLED Tucker | 0.11 | 0.14 | 0.23 | 0.22 | 0.24 | 0.25 | 0.29 | 0.28 | 0.90 |

Data generated with 1D & 2D interactions | |||||||||

Mean | 0.88 | 0.74 | 0.78 | 0.84 | 0.94 | 1.09 | 1.37 | 0.89 | |

LASSO | 0.95 | 0.86 | 0.93 | 1.05 | 0.81 | 0.86 | 1.13 | 0.88 | 0.74 |

Random forest | 0.42 | 0.69 | 0.73 | 0.93 | 0.75 | 0.82 | 1.05 | 0.90 | 0.73 |

Neural net | 0.37 | 0.73 | 0.72 | 0.84 | 0.81 | 0.86 | 1.05 | 0.96 | 0.88 |

BaTFLED CP. | 0.68 | 0.70 | 0.67 | 0.77 | 0.66 | 0.71 | 0.81 | 0.64 | 0.90 |

BaTFLED Tucker | 0.11 | 0.12 | 0.11 | 0.11 | 0.12 | 0.12 | 0.12 | 0.12 | 0.98 |

Data generated with 1D, 2D & 3D interactions | |||||||||

Mean | 0.98 | 1.02 | 0.80 | 1.05 | 0.85 | 1.13 | 0.89 | 0.81 | |

LASSO | 0.96 | 0.95 | 0.82 | 1.08 | 0.79 | 1.06 | 0.83 | 0.82 | 0.58 |

Random forest | 0.61 | 0.92 | 0.78 | 1.05 | 0.81 | 1.06 | 0.84 | 0.85 | 0.75 |

Neural net | 0.33 | 0.92 | 0.82 | 1.10 | 0.89 | 1.03 | 0.88 | 0.92 | 0.85 |

BaTFLED CP. | 0.96 | 0.95 | 0.83 | 1.08 | 0.78 | 1.05 | 0.83 | 0.81 | 0.48 |

BaTFLED Tucker | 0.12 | 0.12 | 0.12 | 0.12 | 0.12 | 0.13 | 0.12 | 0.13 | 1.00 |

Training | Warm | Cell lines | Drugs | Cl & dr | Training | Warm | Cell lines | |

Cross-validation on training dataset | Final test set | |||||||

Mean | 0.65 | 0.40 | 0.63 | 0.66 | 0.43 | 0.65 | 0.46 | 0.57 |

LASSO | 0.79 | 0.77 | 0.62 | 0.66 | 0.48 | 0.79 | 0.75 | 0.59 |

Random forest | 0.87 | 0.82 | 0.58 | 0.66 | 0.42 | 0.87 | 0.78 | 0.58 |

Neural net | 0.94 | 0.78 | 0.53 | 0.58 | 0.39 | 0.93 | 0.77 | 0.60 |

BaTFLED Tucker* | 0.96 | 0.75 | 0.70 | 0.71 | 0.57 | 0.96 | 0.76 | 0.61 |

* Cross-validation results are from a two-round training strategy. |

## 4 Conclusions

As technological advancements in biology allow for the characterization of systems across an increasing number of molecular, structural, and behavioral dimensions, methods that can combine such datasets in a principled and interpretable manor are greatly needed. Tensor factorizations provide a natural framework for organizing such data and this work demonstrates the usefulness of these methods in a predictive setting. BaTFLED is designed for general use, is freely available and can be easily applied to many different applications. A fully functional R package is available at https://www.github.com/nathanlazar/BaTFLED3D.

#### Acknowledgments

Thanks to Dr. Shannon McWeeney, Dr. Adam Margolin and Dr. Lucia Carbone.

## References

Aiello, S., Kraljevic, T., Maj, P. and with contributions from the H2O.ai team (2016). h2o: R Interface for H2O. R package version 3.10.0.8. https://CRAN.R-project.org/package=h2o

Bindal, N., Forbes, S.A., Beare, D., Gunasekaran, P., Leung, K., Kok, C.Y., Jia, M., Bamford, S., Cole, C., Ward, S., et al. (2011). COSMIC: the catalogue of somatic mutations in cancer. Genome Biology 12: P3.

Costello, J.C., Heiser, L.M., Georgii, E., Gönen, M., Menden, M.P., Wang, N.J., Bansal, M., Ammad-ud-din, M., Hintsanen, P., Khan, S.A., et al. (2014). A community effort to assess and improve drug sensitivity prediction algorithms. Nature Biotechnology 32: 1202–1212.

Friedman, J., Hastie, T. & Tibshirani R. (2010). Regularization Paths for Generalized Linear Models via Coordinate Descent. Journal of Statistical Software 33(1): 1-22. URL http://www.jstatsoft.org/v33/i01/

Hitchcock, F.L. (1927). The expression of a tensor or a polyadic as a sum of products (Cambridge, Mass.: Inst. of Technology).

Tucker, L.R. (1964). The extension of factor analysis to three-dimensional matrices. In H. Gulliksen, and N. Frederiksen (eds.), Contributions to Mathematical Psychology., pp. 110–127 New York: Holt, Rinehart and Winston.

Yap, C.W. (2011). PaDEL-descriptor: An open source software to calculate molecular descriptors and fingerprints. Journal of Computational Chemistry 32: 1466–1474.

## 5 Details of the BaTFLED model

### 5.1 Notation

The notation throughout this paper mostly follows the conventions of the excellent review (Kolda and Bader, 2007). I will use boldface capital letters to refer to tensors, capital letters to refer to matrices (and index upper limits), boldface lower case letters to refer to vectors and lower-case letters to refer to scalars. Indices will typically range from to their capital version so, for example, the three modes of the third-order response tensor Y have indices , and where , and and a typical element will be written . I use for the input feature data, for the projection matrices, for the matrix of priors on the precision (inverse variance) of the projection matrices, for the latent matrices, and C for the core tensor. The superscripts indicate the three modes (, and to suggest cell lines, drugs and doses) and core elements (). For simplicity we denote all the observed values and set parameters (except the responses) by .

(1) |

and all random variables by

(2) |

### 5.2 Probabilistic model

The full joint distribution of the the prior parameters , projection matrices , latent factor matrices and the core tensor C given the observed data and the set parameters is summarized as . This distribution factors as follows (dependencies on set parameters are omitted for clarity).

(3) |

The distributional assumptions for each of the factors are

(4) | ||||

(5) | ||||

(6) | ||||

(7) | ||||

(8) | ||||

(9) | ||||

(10) | ||||

(11) | ||||

(12) | ||||

(13) | ||||

(14) | ||||

(15) | ||||

Where is a normal distribution with mean and variance and is a gamma distribution with shape and scale . We use a shared variance for the response tensor and shared variances , and for the elements of the latent matrices and respectively. The gamma distributions on the parameters for the projection matrices and and the core C allows the user to encourage sparsity in these parts of the model. By setting the prior shape and scale parameters to extreme values (ex. and ) the majority of values move toward zero as the model is trained. The precision values in the projection () matrices can optionally be shared across rows which encourages the use of the same predictor variables for all latent factors.

Also, columns of ones can be introduced into the input () matrices and latent () matrices. In the input matrices this allows for bias terms, and in the a latent matrices, these columns allow the model to learn marginal coefficients that apply to lower-dimensional subsets of the response tensor. For example if all three matrices have a constant, we extend the ranges of , and to include a zero index and the corresponding element of the core matrix can learn the ’intercept’ for the response tensor; a value that is added to all responses. Similarly the element is a coefficient for the first latent factor for the first mode and encodes the one-dimensional effects that this latent factor will have on response regardless of the other two modes. The coefficient encodes interactions between the first latent factors of modes one and two that occur regardless of mode three.

## 6 Inference

With the model established above, our goal is to find the posterior distribution of variables in the model given the observations as well as the marginal likelihood or model evidence . Exact inference of the posterior is not analytically solvable and sampling based approaches like Markov chain monte carlo (MCMC) would be computationally prohibitive for a model of this size. Instead we use a variational approximation (or variational Bayes) approach that maximizes a lower bound on the log of the posterior. This lower bound can be written as:

(17) |

The new distribution is a joint distribution of the same variables as the distribution that approximates the distribution under the assumption that is factorable in some convenient way. Although the distribution has the same unknown variables as , the parameters of these distributions (mean, variance, etc.) are different than those given above. Closed forms for these parameters can be derived analytically and they depend on expectations over the distributions of other variables in the model. Thus, in order to find the distribution that best approximates the distribution an iterative expectation maximization process is employed.

The joint likelihood is given in (3) and we assume that the distribution is fully factorable:

(18) |

Thus the lower bound can be written:

(19) |

The parameters for the distributions of a given variable that maximize this lower bound can be obtained by finding the expectation of with respect to the q distributions over all other variables (denoted ). This gives expressions (update equations) for each of the distributions that depend only on moments of the other variables (and fixed parameters).

(20) |

### 6.1 Update equations

The optimal distributions for individual elements of the first mode and core elements are given below, other modes are identical. All expectations are over the distributions.

Prior matrices:

(21) |

(22) |

Projection matrices

(23) |

with

Latent matrices:

(24) |

with

and

Core C tensor

(25) |

with

## 7 Experimental results

Results for additional experiments and performance measures are given below.

### 7.1 Simulated data

Responses are simulated from the structure assumed by the BaTFLED model. Each of the three modes has 30 samples, 4 latent factors and 100 features, 10 of which influence response. Two samples for each mode are held out as validation data. For the ‘1D’ interaction responses, the core only has values along the edges corresponding to the constant columns in the latent matrices. This means that no interactions between modes influence responses. For the ‘1D & 2D’ interaction responses, the core only has values in the matrix slices corresponding to the constant columns, so responses are built from interactions between at most two of the modes. For the ‘1D ,2D & 3D’ interaction responses, the core has values throughout, but is sparse so half of potential interactions have weight zero and do not contribute to the responses.

BaTFLED Tucker models are trained for 100 iterations with a xx core and sparsity priors and on both the projection matrices and the core. BaTFLED CP models are trained for 100 iterations with latent features and sparsity priors and on the projection matrices. The features for elastic net, neural net and random forest models are vectors consisting of the features for each mode concatenated. Elastic net models are trained using the R package ‘glmnet’ (Friedman et al. 2010) and we choose lambda values that give the most regularized model within one standard deviation of the model with the lowest cross-validated error. Neural net models are trained using the ‘h2o’ R package (Aiello et al. 2016) with , and hidden layers with , and nodes in each layer respectively. Activations are rectified linear functions and a dropout fraction of on the input layer and on the hidden layers are used. For each replication, convergence is determined by monitoring the mean squared error (MSE) on ten cross-validation folds and stopping when the MSE changes by less than across five epochs. Random forest models are also trained in R using the h2o package. Splitting is determined using the the automatic option ‘mtries=-1’ and models with trees of depth , trees of depth , and trees of depth are tested.

### 7.2 DREAM data

When running on DREAM data separate predictors are trained for each dose for elastic net, random forest models and neural net models. The inputs for each method are vectors containing the features for cell lines and features for drugs concatenated together. Elastic net models were run using the R package ‘glmnet’ and while the ‘h2o’ package was used for random forest and neural net training. The same model parameters that were used on the simulated data were tested here.

For cross-validation runs, BaTFLED models were run for 200 iterations with strong sparsity priors ( and ) on the first round. The second round was run using the union of the top 15% of predictors across folds (278 cell line features and 163 drug features for CP models, 152 cell line features and 105 drug features for Tucker models), for 40 iterations, without encouraging sparsity ( and ). Tucker models use a xx core and CP models have a latent dimension of .

For the final testing on the 17 held-out cell lines, BaTFLED models were run for 120 iterations for both rounds with sparsity encouraged only on first round. Again the union of the top 15% of predictors across replicates were used for the second round (191 cell line features and 102 drug features for CP models and 330 cell line features and 172 drug features for Tucker models). Elastic net, random forest and neural net models were run as above.

All models were run on Intel Xeon processors in parallel when possible. The elastic net models run on a single core take a few minutes to complete while random forest and neural net models take 10-20 minutes on the simulated data while using 24 cores. On the DREAM data, run time was similar for the random forest models, but significantly longer for neural net models with some taking up to 14 hours. The BaTFLED models tested here run in 10-20 minutes for simulated data and up to 8 hours on the DREAM data using 16 cores. However, the code has not been optimized very throughly yet, so significant improvements are likely possible.

Cold start prediction: | |||||||||

Train | Warm | Mode 1 | Mode 2 | Mode 3 | Mode 1&2 | Mode 1&3 | Mode 2&3 | Mode 1,2&3 | |

Data generated with only 1D interactions | |||||||||

Mean | 0.56(0.14) | 0.59(0.15) | 0.57(0.44) | 0.36(0.15) | 0.52(0.28) | 1.19(0.50) | 1.29(0.56) | 1.16(0.30) | 0.81(0.20) |

LASSO | 0.10(0.00) | 0.10(0.01) | 0.14(0.04) | 0.18(0.10) | 0.27(0.19) | 0.18(0.07) | 0.31(0.18) | 0.30(0.16) | 0.30(0.16) |

E. Net | 0.10(0.00) | 0.10(0.01) | 0.15(0.04) | 0.18(0.10) | 0.28(0.21) | 0.19(0.06) | 0.32(0.19) | 0.31(0.17) | 0.31(0.17) |

E. Net | 0.10(0.00) | 0.10(0.01) | 0.19(0.04) | 0.20(0.11) | 0.32(0.28) | 0.20(0.06) | 0.39(0.25) | 0.37(0.25) | 0.37(0.25) |

Ridge | 0.10(0.00) | 0.10(0.01) | 0.45(0.31) | 0.29(0.12) | 0.55(0.28) | 0.46(0.18) | 0.75(0.29) | 0.65(0.30) | 0.76(0.23) |

R. Forest 1Kx5 | 0.42(0.04) | 0.42(0.04) | 0.70(0.34) | 0.47(0.13) | 0.73(0.29) | 0.63(0.22) | 0.83(0.20) | 0.65(0.20) | 0.78(0.14) |

R. Forest 5Kx5 | 0.42(0.05) | 0.42(0.05) | 0.67(0.30) | 0.48(0.14) | 0.70(0.27) | 0.62(0.17) | 0.79(0.20) | 0.66(0.25) | 0.77(0.20) |

R. Forest 1Kx10 | 0.16(0.01) | 0.18(0.01) | 0.59(0.35) | 0.40(0.16) | 0.58(0.30) | 0.63(0.22) | 0.76(0.22) | 0.62(0.21) | 0.77(0.18) |

Neural net 1L. | 0.43(0.05) | 0.42(0.06) | 0.54(0.31) | 0.50(0.08) | 0.58(0.25) | 0.52(0.20) | 0.74(0.24) | 0.71(0.27) | 0.74(0.19) |

Neural net 2L. | 0.52(0.02) | 0.53(0.03) | 0.67(0.31) | 0.62(0.10) | 0.59(0.24) | 0.68(0.18) | 0.78(0.25) | 0.77(0.28) | 0.81(0.22) |

Neural net 3L. | 0.63(0.03) | 0.65(0.04) | 0.77(0.28) | 0.68(0.07) | 0.72(0.21) | 0.73(0.16) | 0.85(0.22) | 0.82(0.23) | 0.84(0.18) |

BaTFLED CP. | 0.37(0.04) | 0.38(0.04) | 0.46(0.15) | 0.42(0.11) | 0.46(0.13) | 0.45(0.20) | 0.52(0.15) | 0.45(0.11) | 0.46(0.18) |

BaTFLED Tucker | 0.11(0.00) | 0.11(0.01) | 0.16(0.05) | 0.20(0.16) | 0.15(0.03) | 0.22(0.13) | 0.18(0.05) | 0.22(0.14) | 0.23(0.12) |

Data generated with 1D & 2D interactions | |||||||||

Mean | 0.83(0.05) | 0.88(0.07) | 0.74(0.26) | 0.78(0.22) | 0.84(0.39) | 0.94(0.30) | 1.09(0.31) | 1.37(0.34) | 0.89(0.39) |

LASSO | 0.92(0.04) | 0.95(0.08) | 0.86(0.19) | 0.93(0.15) | 1.05(0.27) | 0.81(0.23) | 0.86(0.24) | 1.13(0.36) | 0.88(0.37) |

E. Net | 0.92(0.04) | 0.96(0.08) | 0.86(0.19) | 0.93(0.15) | 1.05(0.27) | 0.80(0.23) | 0.86(0.23) | 1.13(0.36) | 0.88(0.36) |

E. Net | 0.92(0.04) | 0.95(0.08) | 0.87(0.19) | 0.93(0.15) | 1.05(0.28) | 0.81(0.23) | 0.86(0.23) | 1.14(0.36) | 0.89(0.36) |

Ridge | 0.92(0.04) | 0.95(0.08) | 0.87(0.19) | 0.95(0.15) | 1.06(0.28) | 0.82(0.26) | 0.85(0.20) | 1.18(0.38) | 0.92(0.40) |

R. Forest 1Kx5 | 0.71(0.08) | 0.73(0.09) | 0.79(0.18) | 0.84(0.17) | 1.01(0.35) | 0.77(0.22) | 0.86(0.21) | 1.11(0.33) | 0.88(0.34) |

R. Forest 5Kx5 | 0.71(0.08) | 0.72(0.09) | 0.78(0.18) | 0.83(0.17) | 1.01(0.36) | 0.77(0.23) | 0.83(0.22) | 1.08(0.30) | 0.88(0.33) |

R. Forest 1Kx10 | 0.38(0.04) | 0.42(0.05) | 0.69(0.20) | 0.73(0.18) | 0.93(0.41) | 0.75(0.21) | 0.82(0.22) | 1.05(0.31) | 0.90(0.35) |

Neural net 1L. | 0.44(0.03) | 0.44(0.04) | 0.81(0.25) | 0.77(0.23) | 0.97(0.35) | 0.84(0.28) | 0.90(0.21) | 1.09(0.35) | 0.91(0.42) |

Neural net 2L. | 0.29(0.02) | 0.30(0.02) | 0.74(0.27) | 0.71(0.22) | 0.87(0.38) | 0.80(0.30) | 0.86(0.22) | 1.09(0.33) | 0.91(0.45) |

Neural net 3L. | 0.37(0.04) | 0.37(0.04) | 0.73(0.27) | 0.72(0.22) | 0.84(0.35) | 0.81(0.32) | 0.86(0.23) | 1.05(0.31) | 0.96(0.50) |

BaTFLED CP. | 0.66(0.18) | 0.68(0.18) | 0.70(0.17) | 0.67(0.19) | 0.77(0.29) | 0.66(0.24) | 0.71(0.15) | 0.81(0.36) | 0.64(0.27) |

BaTFLED Tucker | 0.11(0.01) | 0.11(0.01) | 0.11(0.01) | 0.11(0.01) | 0.11(0.01) | 0.11(0.01) | 0.12(0.02) | 0.12(0.02) | 0.11(0.04) |

Data generated with 1D, 2D & 3D interactions | |||||||||

Mean | 0.94(0.04) | 0.98(0.05) | 1.02(0.30) | 0.80(0.37) | 1.05(0.32) | 0.85(0.43) | 1.13(0.47) | 0.89(0.49) | 0.81(0.58) |

LASSO | 0.99(0.01) | 0.96(0.04) | 0.95(0.29) | 0.82(0.36) | 1.08(0.35) | 0.79(0.44) | 1.06(0.49) | 0.83(0.51) | 0.82(0.58) |

E. Net | 1.00(0.01) | 0.96(0.04) | 0.95(0.29) | 0.82(0.36) | 1.08(0.35) | 0.79(0.44) | 1.06(0.49) | 0.83(0.51) | 0.82(0.58) |

E. Net | 0.99(0.01) | 0.96(0.04) | 0.95(0.29) | 0.82(0.36) | 1.08(0.35) | 0.79(0.43) | 1.06(0.49) | 0.83(0.51) | 0.82(0.58) |

Ridge | 1.00(0.01) | 0.96(0.04) | 0.95(0.29) | 0.82(0.36) | 1.08(0.35) | 0.79(0.44) | 1.06(0.49) | 0.83(0.51) | 0.81(0.58) |

R. Forest 1Kx5 | 0.86(0.03) | 0.85(0.04) | 0.96(0.30) | 0.78(0.34) | 1.07(0.34) | 0.81(0.40) | 1.08(0.48) | 0.82(0.49) | 0.83(0.55) |

R. Forest 5Kx5 | 0.86(0.02) | 0.86(0.04) | 0.95(0.27) | 0.80(0.34) | 1.07(0.33) | 0.81(0.39) | 1.06(0.47) | 0.83(0.49) | 0.81(0.56) |

R. Forest 1Kx10 | 0.55(0.05) | 0.61(0.07) | 0.92(0.25) | 0.78(0.32) | 1.05(0.31) | 0.81(0.39) | 1.06(0.44) | 0.84(0.46) | 0.85(0.55) |

Neural net 1L. | 0.46(0.04) | 0.43(0.03) | 0.92(0.27) | 0.92(0.25) | 1.07(0.28) | 0.83(0.42) | 1.03(0.45) | 0.87(0.48) | 0.81(0.56) |

Neural net 2L. | 0.24(0.01) | 0.26(0.02) | 0.89(0.25) | 0.81(0.32) | 1.10(0.26) | 0.81(0.43) | 1.04(0.45) | 0.86(0.47) | 0.83(0.56) |

Neural net 3L. | 0.30(0.04) | 0.33(0.03) | 0.92(0.28) | 0.82(0.32) | 1.10(0.28) | 0.89(0.44) | 1.03(0.43) | 0.88(0.45) | 0.92(0.56) |

BaTFLED CP. | 1.00(0.00) | 0.96(0.04) | 0.95(0.29) | 0.83(0.35) | 1.08(0.35) | 0.78(0.45) | 1.05(0.50) | 0.83(0.51) | 0.81(0.59) |

BaTFLED Tucker | 0.12(0.03) | 0.12(0.02) | 0.12(0.03) | 0.13(0.04) | 0.12(0.02) | 0.13(0.04) | 0.12(0.02) | 0.13(0.04) | 0.12(0.04) |

Cold start prediction: | |||||||||

Train | Warm | Mode 1 | Mode 2 | Mode 3 | Mode 1&2 | Mode 1&3 | Mode 2&3 | Mode 1,2&3 | |

Data generated with only 1D interactions | |||||||||

Mean | 0.77(0.08) | 0.76(0.08) | 0.90(0.11) | 0.97(0.03) | 0.85(0.15) | 0.00(0.42) | 0.01(0.51) | -0.12(0.39) | NA |

LASSO | 1.00(0.00) | 1.00(0.00) | 0.99(0.00) | 0.99(0.01) | 0.94(0.13) | 0.97(0.03) | 0.90(0.20) | 0.92(0.13) | 0.88(0.20) |

E. Net | 1.00(0.00) | 1.00(0.00) | 0.99(0.01) | 0.99(0.01) | 0.93(0.13) | 0.97(0.03) | 0.90(0.20) | 0.92(0.13) | 0.87(0.21) |

E. Net | 1.00(0.00) | 1.00(0.00) | 0.99(0.01) | 0.99(0.01) | 0.91(0.16) | 0.97(0.04) | 0.86(0.22) | 0.89(0.16) | 0.82(0.23) |

Ridge | 1.00(0.00) | 1.00(0.00) | 0.94(0.08) | 0.97(0.02) | 0.83(0.19) | 0.89(0.08) | 0.64(0.26) | 0.75(0.26) | 0.29(0.55) |

R. Forest 1Kx5 | 0.92(0.02) | 0.92(0.02) | 0.79(0.16) | 0.93(0.04) | 0.75(0.16) | 0.79(0.15) | 0.39(0.56) | 0.74(0.32) | 0.26(0.61) |

R. Forest 5Kx5 | 0.92(0.01) | 0.92(0.02) | 0.81(0.11) | 0.93(0.04) | 0.75(0.16) | 0.81(0.10) | 0.42(0.59) | 0.75(0.33) | 0.28(0.67) |

R. Forest 1Kx10 | 0.99(0.00) | 0.99(0.00) | 0.86(0.15) | 0.96(0.03) | 0.84(0.15) | 0.80(0.13) | 0.48(0.46) | 0.76(0.26) | 0.10(0.63) |

Neural net 1L. | 0.99(0.00) | 0.99(0.00) | 0.92(0.08) | 0.96(0.03) | 0.85(0.15) | 0.87(0.08) | 0.64(0.26) | 0.78(0.19) | 0.21(0.56) |

Neural net 2L. | 0.98(0.00) | 0.98(0.00) | 0.91(0.15) | 0.94(0.05) | 0.84(0.15) | 0.80(0.21) | 0.56(0.44) | 0.69(0.33) | 0.08(0.63) |

Neural net 3L. | 0.99(0.00) | 0.99(0.00) | 0.91(0.13) | 0.95(0.04) | 0.84(0.15) | 0.78(0.23) | 0.38(0.50) | 0.64(0.35) | -0.02(0.69) |

BaTFLED CP. | 0.95(0.01) | 0.95(0.01) | 0.90(0.09) | 0.94(0.03) | 0.88(0.09) | 0.87(0.13) | 0.76(0.26) | 0.86(0.11) | 0.59(0.55) |

BaTFLED Tucker | 0.99(0.00) | 0.99(0.00) | 0.99(0.01) | 0.99(0.00) | 0.99(0.01) | 0.98(0.02) | 0.97(0.02) | 0.98(0.01) | 0.95(0.06) |

Data generated with 1D & 2D interactions | |||||||||

Mean | 0.55(0.08) | 0.51(0.09) | 0.53(0.29) | 0.62(0.14) | 0.63(0.23) | 0.08(0.34) | 0.00(0.23) | -0.08(0.15) | NA |

LASSO | 0.38(0.09) | 0.36(0.13) | 0.23(0.29) | 0.26(0.35) | 0.26(0.25) | 0.04(0.31) | 0.17(0.44) | 0.25(0.35) | 0.17(0.56) |

E. Net | 0.38(0.09) | 0.36(0.13) | 0.23(0.29) | 0.26(0.35) | 0.26(0.24) | 0.05(0.30) | 0.17(0.44) | 0.26(0.35) | 0.17(0.57) |

E. Net | 0.38(0.09) | 0.36(0.13) | 0.22(0.28) | 0.26(0.35) | 0.26(0.25) | 0.04(0.30) | 0.17(0.44) | 0.25(0.35) | 0.18(0.56) |

Ridge | 0.39(0.09) | 0.37(0.13) | 0.20(0.30) | 0.26(0.34) | 0.25(0.26) | 0.01(0.30) | 0.23(0.39) | 0.21(0.39) | 0.11(0.32) |

R. Forest 1Kx5 | 0.74(0.06) | 0.75(0.08) | 0.46(0.20) | 0.46(0.34) | 0.40(0.28) | 0.19(0.36) | 0.24(0.36) | 0.33(0.37) | 0.22(0.46) |

R. Forest 5Kx5 | 0.75(0.06) | 0.75(0.07) | 0.51(0.20) | 0.44(0.41) | 0.38(0.35) | 0.16(0.39) | 0.35(0.30) | 0.34(0.43) | 0.12(0.51) |

R. Forest 1Kx10 | 0.94(0.01) | 0.93(0.02) | 0.65(0.22) | 0.64(0.29) | 0.53(0.29) | 0.24(0.53) | 0.34(0.41) | 0.47(0.45) | -0.06(0.66) |

Neural net 1L. | 0.98(0.00) | 0.98(0.00) | 0.60(0.19) | 0.69(0.16) | 0.62(0.22) | 0.13(0.38) | 0.28(0.21) | 0.40(0.40) | 0.11(0.45) |

Neural net 2L. | 0.98(0.00) | 0.98(0.00) | 0.57(0.18) | 0.71(0.16) | 0.63(0.20) | 0.17(0.34) | 0.22(0.27) | 0.51(0.23) | 0.07(0.36) |

Neural net 3L. | 0.97(0.00) | 0.97(0.01) | 0.57(0.19) | 0.70(0.17) | 0.64(0.20) | 0.19(0.40) | 0.24(0.25) | 0.54(0.27) | 0.16(0.43) |

BaTFLED CP. | 0.69(0.26) | 0.67(0.31) | 0.50(0.34) | 0.67(0.28) | 0.66(0.27) | 0.47(0.35) | 0.48(0.28) | 0.70(0.26) | 0.51(0.50) |

BaTFLED Tucker | 0.99(0.00) | 0.99(0.00) | 0.99(0.00) | 0.99(0.00) | 0.99(0.00) | 0.99(0.01) | 0.99(0.01) | 0.99(0.01) | 0.96(0.08) |

Data generated with 1D, 2D & 3D interactions | |||||||||

Mean | 0.31(0.11) | 0.08(0.12) | -0.11(0.50) | 0.22(0.40) | 0.23(0.25) | -0.02(0.16) | -0.05(0.14) | 0.03(0.08) | NA |

LASSO * | 0.15(0.05) | 0.08(0.08) | 0.02(0.17) | 0.20(0.12) | 0.01(0.17) | 0.12(0.17) | -0.19(0.23) | 0.16(0.25) | -0.07(0.35) |

E. Net * | 0.16(0.06) | 0.09(0.09) | 0.01(0.19) | 0.17(0.12) | 0.00(0.19) | 0.12(0.19) | -0.23(0.25) | 0.12(0.26) | -0.11(0.37) |

E. Net * | 0.17(0.05) | 0.10(0.09) | 0.00(0.20) | 0.18(0.12) | 0.01(0.20) | 0.13(0.20) | -0.29(0.25) | 0.18(0.24) | -0.21(0.34) |

Ridge * | 0.20(0.05) | 0.11(0.10) | -0.01(0.22) | 0.11(0.18) | 0.01(0.25) | -0.02(0.30) | -0.20(0.35) | 0.03(0.36) | -0.11(0.31) |

R. Forest 1Kx5 | 0.59(0.06) | 0.53(0.11) | 0.01(0.30) | 0.27(0.31) | 0.13(0.20) | 0.05(0.25) | -0.04(0.26) | 0.18(0.34) | -0.07(0.43) |

R. Forest 5Kx5 | 0.59(0.06) | 0.51(0.11) | 0.11(0.35) | 0.23(0.33) | 0.14(0.17) | 0.03(0.32) | 0.04(0.26) | 0.16(0.30) | 0.05(0.43) |

R. Forest 1Kx10 | 0.88(0.03) | 0.81(0.06) | 0.22(0.49) | 0.33(0.34) | 0.24(0.24) | 0.09(0.28) | 0.04(0.37) | 0.17(0.32) | -0.02(0.46) |

Neural net 1L. | 0.97(0.00) | 0.96(0.01) | 0.25(0.50) | 0.39(0.39) | 0.21(0.27) | 0.07(0.35) | 0.12(0.37) | 0.05(0.26) | -0.06(0.26) |

Neural net 2L. | 0.97(0.00) | 0.97(0.00) | 0.30(0.45) | 0.38(0.38) | 0.22(0.19) | 0.19(0.36) | 0.14(0.31) | 0.09(0.33) | 0.06(0.30) |

Neural net 3L. | 0.97(0.00) | 0.96(0.00) | 0.27(0.47) | 0.35(0.44) | 0.21(0.24) | 0.09(0.36) | 0.19(0.37) | 0.12(0.34) | -0.09(0.37) |

BaTFLED CP. | 0.01(0.03) | 0.00(0.07) | 0.01(0.06) | -0.02(0.05) | 0.02(0.07) | 0.07(0.20) | 0.02(0.19) | 0.00(0.16) | 0.18(0.36) |

BaTFLED Tucker | 0.99(0.00) | 0.99(0.00) | 0.99(0.01) | 0.98(0.01) | 0.99(0.00) | 0.97(0.04) | 0.99(0.02) | 0.98(0.01) | 0.98(0.03) |

* Elastic net set all weights to zero in 4-5 of 10 runs. |

LASSO | E. Net | E. Net | Ridge | RF 1Kx5 | RF 5Kx5 | RF 1Kx10 | Neural net 1L. | Neural net 2L. | Neural net 3L. | BaTFLED CP. | BaTFLED Tucker | |

Data generated with only 1D interactions | ||||||||||||

Mode 1 | 0.93(0.07) | 0.92(0.08) | 0.88(0.08) | 0.75(0.09) | 0.69(0.08) | 0.66(0.09) | 0.64(0.08) | 0.66(0.07) | 0.74(0.10) | 0.72(0.10) | 0.87(0.06) | 0.87(0.09) |

Mode 2 | 0.84(0.06) | 0.84(0.06) | 0.84(0.05) | 0.77(0.05) | 0.65(0.05) | 0.64(0.03) | 0.62(0.08) | 0.64(0.06) | 0.70(0.07) | 0.71(0.08) | 0.78(0.08) | 0.80(0.11) |

Mode 3 | 0.88(0.09) | 0.88(0.09) | 0.85(0.09) | 0.76(0.07) | 0.68(0.06) | 0.63(0.09) | 0.62(0.08) | 0.65(0.05) | 0.67(0.08) | 0.66(0.09) | 0.83(0.07) | 0.90(0.05) |

Data generated with 1D & 2D interactions | ||||||||||||

Mode 1 | 0.74(0.10) | 0.73(0.09) | 0.73(0.09) | 0.74(0.06) | 0.68(0.10) | 0.75(0.08) | 0.73(0.08) | 0.79(0.06) | 0.85(0.06) | 0.88(0.05) | 0.90(0.08) | 0.99(0.02) |

Mode 2 | 0.80(0.08) | 0.80(0.07) | 0.80(0.07) | 0.75(0.05) | 0.70(0.06) | 0.72(0.07) | 0.72(0.06) | 0.81(0.07) | 0.85(0.07) | 0.88(0.07) | 0.92(0.09) | 1.00(0.01) |

Mode 3 | 0.79(0.05) | 0.78(0.06) | 0.79(0.07) | 0.79(0.09) | 0.72(0.05) | 0.72(0.05) | 0.76(0.04) | 0.86(0.06) | 0.90(0.04) | 0.91(0.04) | 0.84(0.12) | 0.99(0.02) |

Data generated with 1D, 2D & 3D interactions | ||||||||||||

Mode 1 | 0.58(0.04) | 0.57(0.05) | 0.58(0.07) | 0.64(0.12) | 0.71(0.09) | 0.71(0.04) | 0.75(0.08) | 0.89(0.04) | 0.86(0.04) | 0.85(0.04) | 0.48(0.13) | 1.00(0.00) |

Mode 2 | 0.57(0.05) | 0.57(0.05) | 0.58(0.05) | 0.65(0.12) | 0.70(0.09) | 0.70(0.08) | 0.73(0.05) | 0.87(0.05) | 0.85(0.06) | 0.83(0.08) | 0.51(0.10) | 1.00(0.00) |

Mode 3 | 0.59(0.08) | 0.58(0.08) | 0.61(0.11) | 0.67(0.14) | 0.72(0.06) | 0.72(0.09) | 0.74(0.06) | 0.88(0.04) | 0.84(0.05) | 0.83(0.06) | 0.50(0.10) | 1.00(0.00) |

Training | Warm | Cell lines | Drugs | Cell lines & drugs | |

First training round cross-validation | |||||

Mean | 0.76(0.01) | 0.96(0.18) | 0.81(0.23) |