From Gene Expression to Drug Response: A Collaborative Filtering Approach
Predicting the response of cancer cells to drugs is an important problem in pharmacogenomics. Recent efforts in generation of large scale datasets profiling gene expression and drug sensitivity in cell lines have provided a unique opportunity to study this problem. However, one major challenge is the small number of samples (cell lines) compared to the number of features (genes) even in these large datasets. We propose a collaborative filtering (CF) like algorithm for modeling gene-drug relationship to identify patients most likely to benefit from a treatment. Due to the correlation of gene expressions in different cell lines, the gene expression matrix is approximately low-rank, which suggests that drug responses could be estimated from a reduced dimension latent space of the gene expression. Towards this end, we propose a joint low-rank matrix factorization and latent linear regression approach. Experiments with data from the Genomics of Drug Sensitivity in Cancer database are included to show that the proposed method can predict drug-gene associations better than the state-of-the-art methods.
From Gene Expression to Drug Response: A Collaborative Filtering Approach
|Cheng Qian Nicholas D. Sidiropoulos Magda Amiridi Amin Emad ††thanks: Email: firstname.lastname@example.org (C. Qian), email@example.com (N. D. Sidiropoulos), firstname.lastname@example.org (M. Amiridi) and email@example.com (A. Emad)|
|Department of Electrical and Computer Engineering, University of Virginia, VA, USA|
|Department of Electrical and Computer Engineering, McGill University, Montreal, QC, Canada|
Index Terms— Gene expression, drug response, linear regression, collaborative filtering, Genomics of Drug Sensitivity in Cancer (GDSC).
Selecting the right drugs is critical for cancer survival , but existing methods that predict a patient’s response to a particular drug are not reliable enough. Resistance to chemotherapy is a major issue, as time is of essence in many cases. Therefore, it is of great interest to construct predictive models of chemotherapy response that physicians can use to prescreen the most promising treatment options. In recent years, the field of pharmacogenomics has emerged as a very promising area with challenging problems that can benefit from more attention from the signal processing community.
Several large-scale studies have been recently conducted to measure the gene expression (i.e. transcriptomic) profile of hundreds of cell lines and their sensitivity to tens to hundreds of different drugs [2, 3, 4]. The results of these studies, which are available in databases such as the Cancer Cell Line Encyclopedia (CCLE) , the Genomics of Drug Sensitivity in Cancer (GDSC) , and the Cancer Therapeutics Response Portal (CTRP) [3.5], bring predictive models linking gene expression to drug response closer within reach.
Numerous drug sensitivity prediction algorithms have been proposed to characterize the relationship between transcriptomic information and drug response [5, 6, 7, 9, 10, 8]. Emad et al. recently proposed a gene prioritization method called Prioritization of Genes Enhanced with Network Information (ProGENI) to rank genes that are closely related to a phenotype . With the ranked genes, the authors employed a kernel support vector machine (SVM) for drug sensitivity prediction, and showed that ProGENI–identified genes can better predict drug response compared to genes identified by other widely used prioritization methods such as Pearson correlation and Elastic Net (EN). In , through a collaborative effort between the National Cancer Institute (NCI) and the Dialogue on Reverse Engineering Assessment and Methods (DREAM) project, a comparison of 44 different drug response prediction methods was undertaken, among which the Bayesian multitask multiple Kernel learning exhibited the best prediction performance. However, the training was based on just 35 samples, which seems very limited. To handle the cases where the number of genes is greater than that of cell lines, the prevailing methods rely on some sort of sparse regression for gene selection, to help resolve the underdeterminacy that arises in even the simplest linear prediction models [11, 12].
In this paper, we take a different approach. Motivated by the observation that the gene expression matrix is approximately (very) low-rank, instead of relying on gene selection to obtain a well-posed problem, we propose a collaborative filtering (CF) approach based on joint low-rank biased matrix factorization and linear prediction from the latent space. It is worth highlighting that unlike existing methods that ignore the bias in the expression of different genes, CF takes this bias into account, which results in a more accurate model. We provide preliminary results that corroborate the effectiveness of the proposed method using real data from GDSC.
2 Proposed Method
One major challenge, even in large databases such as GDSC , is the large number of features (tens of thousands of genes) compared to the number of samples (hundreds of cell lines). Therefore, the prediction of drug response from gene expression is inherently under-determined. In the literature, the common way to deal with this problem is to judiciously select a small number of transcriptomic features through sophisticated feature selection methods such as sparse regression or other gene ranking strategies that utilize prior knowledge in the form of protein-protein interactions (PPI’s) and genetic interactions . The existing data sets contain both gene expression data of different cell lines and their response to different drugs, where the response of each drug is only measured for a subset of the cell lines. The experimentally measured gene expression data is naturally noisy and is not necessarily ‘centered’. We propose to model the intrinsically low dimensionality of the gene expression data while taking noise and bias into consideration, using a new method based on collaborative filtering.
One way to tackle biases in the gene expression measurements is to model the gene expression value as
where denotes the actual gene expression, is the additive noise which here we assume to be Gaussian distributed with zero mean, and is the bias of the th gene. The matrix form of (1) is given by
where is the transpose, is a vector of length with all elements equal to , is the number of training samples (cell lines), is the number of transcriptomic features in a cell line, , and .
To continue, we bring forth our motivation by an example shown in Fig. 1, where singular values of a gene expression matrix from the GDSC data set are plotted. This matrix contains the expression of genes in cell lines. Fig. 1 shows that the gene expression matrix is dominated by a few principal components, indicating that gene expressions of different cell lines are strongly correlated and the gene expression matrix is approximately low-rank. Thus, we have
where and are low-rank factors with . Therefore, (2) can be approximated by
The above observation implies that it is not necessary to exploit all the transcriptomic features for drug response prediction. On the contrary, the dimension of can be significantly reduced by a dimensionality reduction matrix . As a follow-up, we propose a novel joint dimensionality reduction and drug response prediction strategy, where the drug response is estimated from the latent space of the gene expression matrix—. Mathematically, we try to solve
where is the -norm, is the Frobenius norm, is the drug response in the training set, and are parameters for fitting the response from the latent space such that . The first term in (5) models the dimensionality reduction and bias cancellation, the second regularization fits the drug response from the latent space of , and controls the strength of regularization. In (5), fixing any four variables, the problem for the remaining variable is linear least squares (LS). We therefore employ an alternating least squares (ALS) strategy to solve (5). Specifically, at each iteration, the subproblem w.r.t. is
where , and the solution is
with being the matrix inverse.
Since , and can be updated simultaneously. The associated subproblem is
where and the minimum is reached at
Fixing , the update for is straightforward, i.e.,
Finally, we have
where is the th row of . We iteratively update until the algorithm converges. Convergence is monotonic in terms of the cost function, by virtue of the conditionally optimal updates of ALS.
So far, we have shown how to estimate the unknown variables in our model. However, it is still unclear how to use to predict drug response for new patients. It is worth noting that and are not the parameters of interest, instead, and are the “meat” and “bread”. To explain this point, let us first showcase how to use for dimensionality reduction of a new cell line (note that we now switch to a using a column vector for the cell line). We then solve
resulting in the reduced dimension gene expression vector
Comparing (13) and (7), we see that in (7) is updated differently since it is a regularized LS that involves and while does not. Nevertheless, for real-data applications, the new gene expression may not rigorously follow the proposed model, which means that estimated from ALS is not properly paired with the new cell line . Thus, estimating the response from is not the best option.
To handle this issue, we need to recalculate by using a refined obtained in the same manner as (13), i.e.,
Hence, by minimizing , we have
The drug response is then estimated through
The overall procedure is summarized in Algorithm 1.
Remark: As we can see from (9), is not simply a gene expression bias vector; it actually plays an important role in finding an accurate dimensionality reduction matrix and assisting drug response estimation from the latent space.
We compare the performance of of our CF-inspired approach with state-of-the-art algorithms using real data from GDSC, which is fully accessible at https://www.cancerrxgene.org/downloads. We use the RMA-normalised basal gene expression profiles of the cell lines released on March 2, 2017. Tthe drug response data that we use was released on March 27, 2017, containing the biochemical half maximal inhibitory concentration () values of different drugs for each cell line.
For most of the cell lines in GDSC, the expression values of approximately genes are available, but the drug that has been measured using the largest number of samples includes approximately samples – which poses a challenge for training an accurate prediction model. Therefore, it is necessary to prudently select a smaller subset of informative features for training while excluding the irrelevant ones. Toward this end, we first employ the ProGENI algorithm  to rank the genes for each drug and select the top genes to construct a smaller-size gene expression matrix. Then we choose of the data samples of a tested drug for training, for validation and for testing.
In the first example, we compare CF with orthogonal matching pursuit (OMP), iterative hard-thresholding (IHT) and Elastic Net (EN) in terms of relative root mean square error (RMSE). We choose 10 drugs (i.e., SN-38, TAK-715, Ruxolitinib, Ispinesib Mesylate, BX-912, Avagacestat, XMD14-99, PHA-793887, XMD15-27, Quizartinib) to examine the performance of different competitors and report their RMSEs. Here, RMSE is averaged through 10 random permutations and is calculated as
where contains the drug response estimates of the testing cell lines from the th permutation. Since the rank and hyper-parameter for CF are unknown, we vary from 5 to 30 and from 100 to 400, and choose that minimizes the Euclidean distance between the estimated and true drug response vector of the validation set. Simulation results are shown in Table 1, where CF has the smallest RMSE for all 10 drugs. Overall, OMP performs slightly better than IHT while EN has the worst performance.
In the second example, we examine the performance of CF on predicting drug sensitivity. We compare our method with OMP and three classifiers from MATLAB Statistics and Machine Learning Toolbox as baselines, i.e., logistic regression, kernel SVM (K-SVM) with Gaussian kernel function and linear SVM (L-SVM). Here, we do not include IHT and EN because their performance is worse than OMP. The drug selected for comparison is SN-38. This drug has the largest number of samples in the GDSC data set. There are 989 cell lines tested with this drug, but only 956 of them have associated transcriptomic data, which means that the total number of available samples is 965. Note that we define a threshold such that a cell line with value smaller than the threshold is identified as sensitive to the drug; otherwise, resistant. Thus, given a threshold, the data set can be divided into two parts corresponding to drug sensitive and resistant, respectively. The values for this drug range from to , where the smaller the value, the more sensitive the cell line to this drug. Therefore, we vary the threshold from to to compare the prediction performance of different algorithms. Then given a threshold, we choose of the data samples from each of the two parts for training, for validation and for testing, such that the percentage of either resistant or sensitive samples is fixed in training, validation and testing sets.
Similar to the previous example, we run 10 Monte Carlo simulations with randomly partitioned training/validation/testing sets and report the average prediction accuracy of the testing set defined as
where is the drug response label, is the estimated label, is the size of the testing data and if and otherwise. Also notably, for logistic regression and SVM, in consideration of the limited training samples, we employ OMP to further reduce the number of features by solving , where the indices of the nonzero elements in denote the selected features.
Fig. 2 shows the results, from which we see that CF has the highest prediction accuracy under different thresholds. Its performance is followed by OMP, logistic regression and L-SVM. However, the K-SVM does not work well.
A novel CF algorithm has been proposed for drug response prediction from gene expression. Simulations validated that CF works better than many sparse regression methods (e.g., OMP, IHT and EN) and classical linear and nonlinear classification algorithms (e.g., logistic regression and SVM).
The CF method estimates the values rather than the labels of drug sensitive/resistant. This can be valuable for imputing missing ’s of cell lines for which the drug response is not measured, and for understanding the relationship between gene expression and drug response. Another important advantage of our CF-based method is that it can be used in the case of incomplete gene expression measurements – i.e., even when the matrix has many missing entries. The only change in this case is that one needs to use weighted LS or stochastic gradient updates within the main ALS algorithm, as is well-known in the collaborative filtering literature.
-  J.C. Chang, et al., “Gene expression profiling for the prediction of therapeutic response to docetaxel in patients with breast cancer,” The Lancet, vol. 362, no. 9381, pp. 362-369, 2003.
-  J. Barretina, et al., “The Cancer Cell Line Encyclopedia enables predictive modelling of anticancer drug sensitivity,” Nature, vol. 483, no. 7391, pp. 603-607, 2012.
-  W. Yang, et al., “Genomics of Drug Sensitivity in Cancer (GDSC): a resource for therapeutic biomarker discovery in cancer cells,” Nucleic acids research, vol. 41, no. D1, pp. D955-D961, 2012.
-  M.G. Rees, et al., “Correlating chemical sensitivity and basal gene expression reveals mechanism of action,” Nature chemical biology, vol. 12, no. 2, pp. 109, 2016.
-  J. Lamb, et al., “The Connectivity Map: using gene-expression signatures to connect small molecules, genes, and disease,” Science, vol. 313, pp. 1929â1935, 2006.
-  J. Barretina, et al., “The Cancer Cell Line Encyclopedia enables predictive modelling of anticancer drug sensitivity,” Nature, vol. 483, pp. 603â607, 2012.
-  M.J. Garnett, et al., “Systematic identification of genomic markers of drug sensitivity in cancer cells,” Nature, vol. 483, pp. 570â575, 2012.
-  M.P. Menden, F. Iorio, M. Garnett, U. McDermott, C.H. Benes, P.J. Ballester and Saez-Rodriguez, “Machine learning prediction of cancer cell sensitivity to drugs based on genomic and chemical properties,” PLoS one, vol. 8, no. 4, pp. e61318, 2013.
-  A. Emad, J. Cairns, K. R. Kalari, L. Wang and S. Sinha, “Knowledge-guided gene prioritization reveals new insights into the mechanisms of chemoresistance,” Genome biology, vol. 18, no. 1, pp.153, 2017.
-  J.C. Costello, et al., “A community effort to assess and improve drug sensitivity prediction algorithms,” Nature biotechnology, vol. 32, no. 12, pp. 1202-1211, 2014.
-  X. Yu, I. Weber and R. Harrison, “Sparse representation for HIV-1 protease drug resistance prediction,” Proc. SIAM Int. Conf. Data Mining. Society for Industrial and Applied Mathematics, pp. 342-349, 2013.
-  A. Basu, R. Mitra, L. Han, S.L. Schreiber and P. A. Clemons, “RWEN: response-weighted elastic net For prediction of chemosensitivity of cancer cell lines,” Bioinformatics, vol. 34, no. 19, pp. 3332-3339, 2018.
-  Y.C. Pati, R. Rezaiifar, and P.S. Krishnaprasad, “Orthogonal matching pursuit: Recursive function approximation with applications to wavelet decomposition,” Proc. The Twenty-Seventh Asilomar Conf. Signals, Systems and Computers, pp. 40â44, Pacific Grove, CA, 1993.
-  T. Blumensath, M. E. Davies, “Iterative hard thresholding for compressed sensing,” Appl. Computat. Harmon. Anal., vol. 27, no. 3, pp. 265-274, 2009.
-  H. Zou and T. Hastie, “Regularization and variable selection via the elastic net,” J.R. Statist. Soc. B., vol. 67, pp. 301â320, 2005.