Longitudinal data analysis using matrix completion
In clinical practice and biomedical research, measurements are often collected sparsely and irregularly in time while the data acquisition is expensive and inconvenient. Examples include measurements of spine bone mineral density, cancer growth through mammography or biopsy, a progression of defect of vision, or assessment of gait in patients with neurological disorders. Since the data collection is often costly and inconvenient, estimation of progression from sparse observations is of great interest for practitioners.
From the statistical standpoint, such data is often analyzed in the context of a mixed-effect model where time is treated as both random and fixed effect. Alternatively, researchers analyze Gaussian processes or functional data where observations are assumed to be drawn from a certain distribution of processes. These models are flexible but rely on probabilistic assumptions and require very careful implementation.
In this study, we propose an alternative elementary framework for analyzing longitudinal data, relying on matrix completion. Our method yields point estimates of progression curves by iterative application of the SVD. Our framework covers multivariate longitudinal data, regression and can be easily extended to other settings.
We apply our methods to understand trends of progression of motor impairment in children with Cerebral Palsy. Our model approximates individual progression curves and explains 30% of the variability. Low-rank representation of progression trends enables discovering that subtypes of Cerebral Palsy exhibit different progression trends.
Keywords: Multivariate longitudinal data, Matrix completion, Longitudinal data analysis, Functional data analysis, Matrix factorization, Regression, Interpolation
One of the fundamental questions in medical practice is how diseases progress in individual patients. Accurate continuous monitoring of a patient’s condition could considerably improve prevention and treatment. Many medical tests, such as x-ray, MRI, motion capture gait analysis, biopsy, or blood tests are costly, harmful or inconvenient to perform frequently. Since in many situations increase in sampling is not feasible due to inconvenience and costs, practitioners need to reach out for statistical tools to analyze the dynamics of these measurements.
While in many situations multiple data points from patients’ histories are available, these data are often underutilized. For the sake of simplicity and convenience, many prognostic models applied in practice only use the last few observations from a series or summary statistics such as the mean over time. However, this simplification ignores important information about the progression, including its dynamics or individual patient’s variability. Moreover, the noise inherent to measurements further hinders the inference of changes in patient’s health. For making use of these data, practitioners need statistical models and software. To enable appropriate usage and broad adoption these tools should be simple to use and understand.
To illustrate the potential benefits of temporal models in these scenarios, in Figure 1 we present measurements of gait pathology progression. The thick solid line represents the estimated mean and indicates a clear trend of growth during puberty. However, by looking at individual processes and by modeling between-subject similarities, we may model the progression more accurately. This approach could result in personalized predictions, new clustering techniques patients based on progression patterns, or extraction of latent representation of progression patterns, which then could be used as predictors in other models.
This kind of data has been commonly analyzed using linear mixed models, where we treat time as a random effect and nonlinearity of this effect is imposed by choice of the functional basis (Zeger et al., 1988; Verbeke, 1997; McCulloch and Neuhaus, 2001). When data is very sparse, additional constraints on the covariance structure of trajectories are derived using cross-validation or information criteria (Rice and Wu, 2001; Bigelow and Dunson, 2009). To further reduce the dimension, practitioners model the covariance as a low-rank matrix (James et al., 2000; Berkey and Kent, 1983; Yan et al., 2017; Hall et al., 2006; Besse and Ramsay, 1986; Yao and Lee, 2006; Greven et al., 2011). Multiple models were developed for incorporating additional covariates (Song et al., 2002; Liu and Huang, 2009; Rizopoulos et al., 2014). While statisticians use these methods in practice, they need to fine-tune them each time since the nature of processes differs in each clinical setting. Moreover, the probabilistic formulation of the model and dependence on the underlying distributions might hinder applicability or adoption to other practical problems.
In this work, we propose a flexible and straightforward statistical framework, exploiting matrix factorization techniques. We focus on the simplicity of the formulation, and we implement software easy to use and extend.
2 Background and related work
In many clinical settings, researchers and practitioners model the patient’s history as a multivariate process of clinical measurements. Examples include variables such as the size of a tumor, blood markers, height and weight of a patient. The multivariate measurements are noisy, and they are observed on different time-points. This situation arises when, for example, clinicians perform a blood test at every visit but a biopsy sporadically.
Even though multiple variables are measured, the ones directly related to the disease, such as the size of a tumor, are usually of the special interest. Therefore, to focus our attention and aid further discussion, we start by introducing notation and methodology for the univariate case. Let denote the number of subjects. For each individual , we measure the process at irregularly sampled time-points . We assume that for each we have , for some .
Let denote observations corresponding to for each individual . To model individual trajectories given pairs practitioners map observations into a low-dimensional space which represents progression patterns. Ideally, a small distance between individuals in the latent space reflects similar progression patterns.
In this section, we discuss state-of-the-art approaches to estimating this low-dimensional latent embedding. We classify them into three categories: the direct approach, mixed-effect models, and low-rank approximations.
2.1 Direct approach
If the set of observed values for each individual is dense, elementary interpolation using a continuous basis can be sufficient for approximating the entire trajectory. Let be a basis of . In practice, we truncate the basis to a set of the first basis elements. Let be a vector of basis elements evaluated at a timepoint . Throughout this article we use the word basis to refer to some truncated basis of elements.
To find an individual trajectory, for each subject , we may use least squares method and estimate a set of coefficients , minimizing squared euclidean distance to the observed point
Finally, to reduce overfitting, it is often convenient to compute principal functional components across all individuals and represent curves in the space spanned by the first few of them. Such representation, referred to as a Karhunen-Loève expansion (Watanabe, 1965; Kosambi, 2016), has became a foundation of many functional data analysis workflows (Ramsay and Dalzell, 1991; Yao et al., 2005; Cnaan et al., 1997; Laird, 1988; Horváth and Kokoszka, 2012; Besse et al., 1997).
This approach to modeling covariance has two main drawbacks. First, if the number of observations for an individual is smaller or equal to the size of the basis , we can fit a curve with no error leading to overfitting and unreliable estimator of the variance. Second, this approach ignores similarities between the curves, which could potentially improve the fit.
A basic idea to remedy these issues is to estimate both the basis coefficients and the variance structure simultaneously. Linear mixed-effect models provide a convenient solution to this problem.
2.2 Linear mixed-effect models
A common approach to modeling longitudinal observation is to assume that data come from a linear mixed-effect model (LMM) (Verbeke, 1997; Zeger et al., 1988). We operate in a functional space with a basis of elements and we assume there exists a fixed effect , where for for . We model the individual random effect as a vector of basis coefficients. In the simplest form, we assume
where is a covariance matrix. We model individual observations as
where , is the standard deviation of observation error and is the basis evaluated at timepoints defined in . Estimation of the model parameters is typically accomplished by the expectation-maximization (EM) algorithm (Laird and Ware, 1982). For estimating coefficients one can use the best unbiased linear predictor (BLUP) (Henderson, 1950; Robinson, 1991).
Since the LMM estimates the covariance structure and the individual fit simultaneously, it reduces the problem of uncertain estimates of , present in the direct approach. However, this model is only applicable if we observe a relatively large number of observations per subject since we attempt to estimate coefficients for every subject.
To model trajectories from a small number of observations, practitioners further constrain the covariance structure. If we knew the functions which contribute the most to the random effect, we could fit a LMM in a smaller space spanned by these functions. We explore possibilities to learn the basis from the data using low-rank approximations of the covariance matrix.
2.3 Low-rank approximations
There are multiple ways to constrain the covariance structure. We can use cross-validation or information criteria to choose the best basis, the number of elements or positions of spline knots (Rice and Wu, 2001; Bigelow and Dunson, 2009). Alternatively, we can place a prior on the covariance matrix (MacLehose and Dunson, 2009).
Another solution is to restrict the latent space to dimensions and learn from the data the mapping between the latent space and the basis. In the simplest scenario with Gaussian observation noise, observations can be modeled as
following the notation from (2.3).
James et al. (2000) propose an EM algorithm for finding model parameters and latent variables in (2.4). In the expectation stage, they marginalize , while in the maximization stage, with assumed observed, they maximize the likelihood with respect to . The maximum likelihood, given , takes form
Another approach to estimating parameters of (2.4) is to optimize over and marginalize (Lawrence, 2004). This approach allows modifying the distance measure in the latent space, using the kernel trick (Schulam and Arora, 2016).
Estimation of the covariance structure of processes is central to the estimation of individual trajectories. Descary and Panaretos (2016) propose a method where the estimate of the covariance matrix is obtained through matrix completion.
Methods based on low-rank approximations are widely adopted and applied in practice (Berkey and Kent, 1983; Yan et al., 2017; Hall et al., 2006; Besse and Ramsay, 1986; Yao and Lee, 2006; Greven et al., 2011). However, due to their probabilistic formulation and reliance on the distribution assumptions, these models usually need to be carefully fine-tuned for specific situations. This shortcoming motivates us to develop an elementary optimization framework.
3 Modeling sparse longitudinal processes using matrix completion
The mixed-effect model, like any other probabilistic model, can be heavily biased when data comes from a distribution considerably different than assumed. In the medical context, since biomarkers can differ in every clinical setting, fine-tuning the models may require an extensive amount of time and expertise. In this work, we develop a more flexible approach based solely on the approximation rather than the underlying probabilistic distributions.
We pose the problem of trajectory prediction as a matrix completion problem, and we solve it using sparse matrix factorization techniques (Rennie and Srebro, 2005; Candès and Recht, 2009). In the classical matrix completion problem, the objective is to predict elements of a sparsely observed matrix using its known elements while minimizing a specific criterion, often chosen to be the Mean Squared Error (MSE). The motivating example is the “Netflix Prize” competition (Bennett and Lanning, 2007), where participants were tasked to predict unknown movie ratings using other observed ratings. We can represent these data as a matrix of users and movies, with a subset of known elements, measured on a fixed scale, e.g., .
To solve the matrix completion problem, we usually assume that the true matrix can be approximated by a low-rank matrix (Srebro et al., 2005). In the low-rank representation columns of spanning the space of movies can be interpreted as “genre” components, and each user is represented as a weighted sum of their preferred genres, i.e., a row in the matrix of latent variables (see Figure 2).
We can use the same idea to predict sparsely sampled curves, as long as we introduce an additional smoothing step. The low-dimensional latent structure now corresponds to progression patterns, and a trajectory of each individual can be represented as a weighted sum of these “principal” patterns. In Figure 2, the patterns are given by , while the individual weights are encoded in .
We first introduce a methodology for univariate sparsely-sampled processes. The direct method, mixed-effect models and low-rank approximations described in Section 2 have their analogy in the matrix completion setting. We discuss these analogies in sections 3.2 and 3.3. Next, we show that the simple representation of the problem allows for extension to multivariate sparsely-sampled processes and a regression setting.
For each individual we observe measurements at time-points . Unlike in the prior work introduced in Section 2, here we discretize the time grid to time-points where , and is arbitrarily large. Each individual is expressed as a partially observed vector . For each time-point for we find a corresponding grid-point . We assign . Let be a set of grid indices corresponding to observed grid-points for an individual . All elements outside , i.e. are considered missing.
For sufficiently large, our observations can be uniquely represented as a matrix with missing values. We denote the set of all observed elements by pairs of indices as . Let be the projection onto observed indices, i.e. , such that for and otherwise. We define as the projection on the complement of , i.e. . We use to denote the Frobenius norm, i.e. the square root of the sum of matrix elements, and to denote the nuclear norm, i.e. the sum of singular values.
As in Section 2 we impose smoothness by using a continuous basis . However, now, the basis is evaluated on the grid and we define a matrix .
In our algorithms we use a diagonal-thresholding operators defined as follows. Let be a diagonal matrix. We define soft-thresholding as
where , and hard-thresholding as
3.2 Direct approach
The optimization problem (2.1) of the direct approach described in Section 2.1 can be approximated in the matrix completion setting. First, note that the bias introduced by the grid is negligible if the grid is sufficiently large. We have
and by the continuity of the basis on a closed interval the second element in (3.1) can be arbitrarily small if . For the simplicity of notation in the reminder of this work we assume that all the points are observed on the grid of equidistributed points.
Now, we rewrite the optimization problem (2.1) as a matrix completion problem
where by optimization over we mean optimization over all and is an matrix composed of vectors stacked vertically.
The matrix formulation in equation (3.2) and the classic approach in Section 2.1 share multiple characteristics. In both cases, if data is dense, we may find an accurate representation of the underlying process simply by fitting least-squares estimates of or . Conversely, if the data is too sparse, the problem becomes ill-posed, and the least-squares estimates can overfit.
However, representations (2.1) and (3.2) differ algebraically and this difference constitutes the foundation for the method introduced in the sequel of this paper. The matrix representation enables us to use the matrix completion framework and, in particular, in Section 3.4 we introduce convex optimization algorithms for solving (3.2).
Note that some low-rank constraints on the random effect from the mixed-effect model introduced in Section 2.2 can be expressed in terms of constraints on and they can be potentially solved without imposing probability distributions. In particular in Section 3.3 we show that the linear mixed-effect model can be expressed by constraining .
3.3 Low-rank matrix approximation
In the low-rank approach described in Section 2.3 we assume that individual trajectories can be represented in a low-dimensional space, by constraining the rank of .
We might attempt taking the same route for solving (3.2). One difficulty comes from the fact that optimization with a rank constraint on turns the original least squares problem into a non-convex problem. Motivated by matrix completion literature, we relax the rank constraint in (3.2) to a nuclear norm constraint and we attempt to solve
for some parameter .
Cai et al. (2010) propose a first-order singular value thresholding algorithm (SVT), for solving a general class of problems involving a nuclear norm penalty and a linear form of , which includes (3.3). Their algorithm can handle large datasets and has strong guarantees on convergence, but it requires tuning the step size parameter, which can greatly influence performance. This limitation was addressed by Ma et al. (2011); Mazumder et al. (2010); Hastie et al. (2015) who introduced iterative procedures which do not depend on such tuning. Moreover, Hardt and Wootters (2014); Chen and Wainwright (2015) propose methods for the non-convex problem and Ge et al. (2016) argue that the non-convex problem has no spurious local minima.
In the last decade, the machine learning and statistical learning communities have introduced multiple algorithms for matrix completion. Many of them are suitable for solving (3.3). However, in this article we focus on analyzing the benefits of framing the trajectory prediction problem (2.1) in the matrix completion framework, rather than on benchmarking existing solutions.
We argue that the matrix formulation (3.3) has two key advantages over the probabilistic approach (Section 2.3). First, the problem (3.3) does not impose any assumption on the distribution of or the distribution of the noise. This property is particularly important whenever the data does not follow the normal distribution. We further describe links between the matrix factorization and low-rank mixed-effect models in Section 3.6. Second, the simple formulation of (3.3) allows us to generalize this model to the multivariate or regression setting as we discuss in Section 4.
3.4 Low-rank approximation with singular value thresholding
The low-rank probabilistic approach, introduced in Section 2.3, is based on the observation that the underlying processes for each subject can be approximated in a low-dimensional space. Here, we exploit the same characteristic using a matrix-factorization techniques for solving the optimization problem (3.3).
For the sake of clarity and simplicity, we choose to solve the problem (3.3) with an extended version of the Soft-Impute algorithm designed by Hastie et al. (2015); Mazumder et al. (2010). As discussed in Section 3.3, many other convex optimization algorithms can be applied.
The key component to the solution is the following property linking problem (3.3) with the singular value decomposition (SVD). Consider the fully observed case of (3.3). Using Theorem 2.1 in Cai et al. (2010), one can show that the optimization problem
has a unique solution , where and is the SVD of . We refer to as the singular value thresholding (SVT) of .
In order to solve (3.4) with a sparsely observed , we modify the Soft-Impute algorithm to account for the basis . Our Algorithm 1 iteratively constructs better approximations of the global solution for each in some predefined set for . For a given approximation of the solution , we use to impute unknown elements of obtaining . Then, we construct the next approximation by computing SVT of .
As a stopping criterion, we compute the change between subsequent solution, relative to the magnitude of the solution, following the methodology in Cai et al. (2010). We set a fixed threshold for this criterion, depending on the desired accuracy.
A common bottleneck of algorithms introduced by Cai et al. (2010); Mazumder et al. (2010); Ma et al. (2011) as well as other SVT-based approaches is the computation of the SVD of large matrices. This problem is particularly severe in standard matrix completion settings, such as the Netflix problem, where the matrix size is over . However, in our problem,
with in our motivating data example. While algorithms for large matrices are applicable here, the property (3.5) makes the computation of SVD feasible in practice with generic packages such as PROPACK (Larsen, 2004).
For making predictions on new data, in practice, we are interested in two scenarios: (1) we collect new measurements of an existing patient ; (2) we include a new patient with at least as many observations as the rank of . In both cases, we use the current fit for each parameter and update all models with newly observed data. This approach not only estimates new predictions but also marginally improves the existing model.
3.5 -norm regularization
While the nuclear norm relaxation (3.3) is motivated by making the problem convex, Mazumder et al. (2010) argue that in many cases it can also give better results than the rank constraint. They draw an analogy to the relation between best-subset regression ( regularization) and LASSO ( regularization as in Tibshirani (1996); Friedman et al. (2001)). In LASSO by shrinking model parameters, we allow more parameters to be included, what can potentially improve the prediction if the true subset is larger than the one derived through regularization. In the case of (3.3), shrinking the nuclear norm allows us to include more dimensions of again potentially improving the prediction if the true dimension is high.
Conversely, the same phenomenon can also lead to problems if the underlying dimension is small. In such case, shrinking may allow including unnecessary dimensions emerging from noise. To remedy this issue, following the analogy with penalized linear regression, we may consider another classes of penalties. In particular, we may consider coming back to the rank constraint by modifying the nuclear norm penalty (3.3) to . We define . The problem
has a solution , where and is the SVD of . We refer to as the hard singular value thresholding (hard SVT) of . We use Algorithm 2 to find the hard SVT of .
3.6 The link between reduced-rank model and Soft-Longitudinal-Impute
Intuitively we might expect similarity between the principal directions derived using the probabilistic approach (2.3) and their counterparts derived from the SVT-based approach. We investigate this relation by analyzing behavior of SVT for matrices sampled from the probabilistic model given by (2.3).
For simplicity, let us assume that and the data is fully observed on a grid of time-points. Assume that observations come from the mixed-effect model
where is an unknown covariance matrix of rank and variables are independent. By the spectral decomposition theorem we decompose , where is a orthogonal and is a diagonal matrix with positive coefficients in decreasing order. Since and are independent, the distribution (3.7) can be rewritten as
Note that, the model (3.8) is a factor model with factors—the first columns of .
Let be the observed matrix and let . Then, is the maximum likelihood estimator of .
Factor analysis methods give not only estimates of and but also estimates of the individual latent variables . In the multivariate analysis literature, there are multiple ways to estimate factor scores, i.e., a matrix such that , most notably Spearman’s scores, Bartlett’s scores, and Thompson’s scores (Kim and Mueller, 1978). Simply taking as the estimate of the scores corresponds to the solution of (3.4) as long as .
Note that in Theorem 1 we assume that is known, which is infeasible in practice. However, the likelihood of can be parametrized by , and we can find the optimal solution analytically. This corresponds to minimizing (3.4) for different .
4 Multivariate longitudinal data
In practice, we are often interested in prediction of a univariate process in the context of other longitudinal measurements and covariates constant over time. Examples include prediction of disease progression given patient’s demographics, data from clinical visits at which multiple blood tests or physical exams are taken, or measurements which are intrinsically multivariate such as gene expression or x-rays collected over time. The growing interest in this setting stimulated research in latent models (Sammel and Ryan, 1996) and multivariate longitudinal regression (Gray and Brookmeyer, 1998, 2000). Diggle et al. (2002) present an example case study in which longitudinal multivariate modeling enables estimation of joint confidence region of multiple parameters changing over time, shrinking the individual confidence intervals.
In this section, we present an extension of our univariate framework to multivariate measurements sparse in time. We explore two cases: (1) dimensionality reduction, where we project sparsely sampled multivariate processes to a small latent space, and (2) linear regression, where we use a multivariate process and covariates constant over time for prediction of a univariate process of interest. To motivate our approach, we start with a regression involving two variables observed over time.
4.1 Motivation: Univariate regression
Suppose, as previously, that the true processes are in a low-dimensional space of some continuous functions (e.g., splines) and that we observe them with noise. More precisely, let
for , where are vectors, are vectors and is a matrix of splines evaluated on a grid of points. We assume zero-mean independent errors with fixed covariance matrices respectively, and that the true processes underlying the observed and follow a linear relation in the spline space, i.e.
where is an unknown matrix.
where are matrices of observation noise. Without loss of generality we assume that is orthonormal. We have freedom to choose the basis and any basis can be orthogonalized using, for example, singular value decomposition.
Note that due to orthogonality of and after multiplying both expressions in (4.3) by we can map the problem to the classical multivariate errors-in-variables models. Let
where and . In errors-in-variables models it is assumed that the parameters and are unknown, and are to be estimated. Both regressors and responses are measured with noise (here and ). The parameter can be interpreted as a latent representation of both and .
The problem of estimating parameters in (4.4) has been extensively studied in literature dating back to Adcock (1878). Two main methods for estimating parameters in (4.4) are maximum likelihood estimates (MLE) and generalized least squares estimators (GLSE). The estimators in MLE are derived under the assumption of certain distributions of the errors. In GLSE, the only assumption about errors is that they are independent, zero-mean, and they have a common covariance matrix. Then, and are zero-mean and estimates for and can be found by optimizing some norm of these expressions. Gleser and Watson (1973) show that in the no-intercept model for and of the same size (as in our case) and under the assumption of Gaussian errors, MLE and GLSE give the same estimates of , if GLSE are derived for the Frobenius norm.
In this work, we focus on the GLSE approach as it can be directly solved in our matrix factorization framework and we find it easier to deploy and extend in practice. To account for different magnitudes of the noise in and , we consider the optimization problem with weights
where . Let . Then (4.5) can be transformed to
where operator stacks vertically matrices with the same number of rows. To solve (4.6), we show that the SVD of truncated to the first dimensions, can be decomposed to . Let be the SVD of , with
where each and is a matrix for and each is a matrix for . By Lemma 2.1 and Lemma 2.2 in (Gleser, 1981) matrix is almost surely nonsingular. Therefore, almost surely exists and we can transform the decomposition such that and , solving (4.6).
For partially observed data, if they are very sparse, it might be essential to constrain the rank of the solution. The partially-observed and rank-constrained version of the problem (4.6) takes the form
where is the desired rank of the solution and is a projection on
As previously, for an unknown we can introduce a rank penalty using the nuclear norm
The algorithm in the general case of multiple processes is derived in Section 4.2.
Although we motivate the problem as a regression of on , note that and are symmetric in (4.6). The low-rank matrix is, therefore, a joint low-rank representation of matrices and and thus our method can be seen as a dimensionality reduction technique or as a latent space model. In Section 4.2 we extended this idea to a larger number of variables. In Section 4.3 we discuss how this approach can be used for regression.
The linear structure of (4.3) allows us to draw analogy not only to the errors-in-variables models but also to vast literature on canonical correlation analysis (CCA), partial least squares (PLS), factor analysis (FA), and linear functional equation (LFE) models. Borga et al. (1997) show that solutions of CCA and PLS can also be derived from the SVD of stacked matrices, as we did with in (4.6). Gleser (1981) thoroughly discusses the relation between errors-in-variables, FA and LFE.
Finally, note that the method of using SVD for stacked matrices has also been directly applied in the recommender systems context. Condli et al. (1999) showed that for improving prediction of unknown entries in some partially observed matrix one might consider a low-rank decomposition of , where are additional covariates for each row, e.g., demographic features of individuals.
4.2 Dimensionality reduction
Suppose that for every individual we observe multiple variables varying in time (e.g. results of multiple medical tests at different times in a clinic) and we want to find a projection on maximizing the variance explained for some . This projection would correspond to characterizing patients by their progression trends of multiple metrics simultaneously.
We extend the equation (4.6) to account for a larger number of covariates and we do not impose decomposition of the solution yet. We formulate the following optimization problem
where are some matrices corresponding to the processes measured on a grid of points, is a basis evaluated on the same grid and orthogonalized (a matrix), and is a matrix.
Let be the Kronecker product of identity matrix and , i.e. a matrix with stacked times on the diagonal, and let . Note that is orthogonal and therefore results developed in Section 2.3 apply here. In particular, singular value thresholding solves
and we can use Algorithm 1 for solving
where is the projection on observed indices .
Note that (4.8) can be further extended. First, we can allow for different basis for each process. Let be a basis selected for approximation of processes for , where each basis is dimensional and it is evaluated on some time-grid . In particular, we allow , which corresponds to a covariate constant in time for a given individual (e.g. gender). Second, if measurements are on different scales, we may normalize them using scaling factors . Then, the problem (4.8) takes form
which we solve with the same techniques as (4.8). Third, the observed indices may vary in each process allowing for, for example, dense measurements of one process and sparse measurements of another one.
In practice, we are interested in estimating progression of an individual parameter (e.g., cancer growth) given some individual features constant over time (e.g., demographics) or given progressions of other covariates (e.g., blood tests, vitals, biopsy results).
We start with a regression problem with fixed covariates and sparsely observed response trajectories. Let be a matrix of observed covariates, be a sparsely observed matrix of trajectories, and be a matrix representing a basis of splines evaluated on a grid of points. We consider the optimization problem
Suppose we want to incorporate other variables varying in time for prediction of the response process. We can directly apply the method proposed in Section 4.2 and model the response and regressors together. However, it might be suboptimal for prediction, as it optimizes for least-squares distance in all variables rather than only the response. This difference is analogical to the difference between regression line of some univariate on independent variables and the first principal component analysis of , used for prediction of . While the first one minimizes the distance to , the latter minimizes the distance to , which is usually less efficient for predicting .
Alternatively, we can use methodology from Section 4.2 only for covariates. The solution of (4.8) can be decomposed into , and we regress on as in (4.9). Let be an orthogonal matrix derived from , where is the numbers of latent components. We search for a matrix solving
where is a projection on a set of observed coefficients. We propose a two-step iterative procedure (Algorithm 4).
We illustrate properties of the multivariate longitudinal fitting in a simulation study. First, we generate curves with quickly decaying eigenvalues of covariance matrices. Then, we compare the performance of the methods in terms of the variance explained by the predictions.
5.1 Data generation
Let be a grid of equidistributed points and let be a basis of spline functions evaluated on the grid . We simulate three matrices using the same procedure , where :
Define the procedure generating symmetric matrices for a given vector :
Use SVD to decompose to
Return , where is a diagonal matrix with on the diagonal
Let , and , where denotes the multivariate Gaussian distribution
Draw vectors according to the distribution
Return a matrix with rows .
and let be generated following the procedure and let . We consider and as coefficients in a spline space . We derive corresponding functions by multiplying these matrices by , i.e. , and . We set .
We uniformly sample indices , i.e. around points per curve on average. Each observed element of each matrix and is drawn with Gaussian noise with mean and standard deviation . The task is to recover from sparse observed elements .
We compare Soft-Longitudinal-Impute (SLI) defined in Algorithm 1 with the fPCA procedure (James et al., 2000), implemented in Peng and Paul (2009). Both SLI and fPCA require a set of basis functions. In both cases, we use the same basis as in the data generation process. In SLI we also need to specify the tuning parameter , while in fPCA we need to choose the rank . We use cross-validation to choose and by optimizing for the prediction error on held-out (validation) observations.
We divide the observed coefficients into training (), validation () and test () sets. We choose the best parameters of the three models on the validation set and then retrain on the entire training and validation sets combined. We compute the error of entire curves by taking mean squared Frobenius distance between and estimated , i.e.
on the test set .
For illustration, we also present results of Sparse-Longitudinal-Regression (SLR), regressing on and . Note, however, that the SLR, unlike two other methods, uses additional information about contained in and . This comparison is only meant to validate our approach.
We train the algorithms with all combinations of parameters: regularization parameter for SLI and SLR procedures and the rank for fPCA procedure . We define the grid of points. We compare the three methods fPCA, SLI and SLR, to the baseline null model which we define as the population mean across all visits.
The SLI achieves performance similar to (James et al., 2000), as presented in Table 1. The SLR, having access to additional information about , clearly outperforms other methods validating its correctness.
In Figure 3 we present the first components derived from both sparse functional PCA and SLI. In Figure 4 we present example predictions from all four methods. In Figure 5, we present the estimated rank and cross-validation error of one of the simulation runs.
6 Data study
The target class of problems motivating this study is a prediction of trajectories of disease progression of individual patients from sparse observations. In this section, we present how our methods are applied for understanding the progression of neurological disorders leading to motor impairments and gait pathologies. First, we discuss how practitioners collect the data and use them to guide the decision process. Next, we describe our dataset and present how our methodology can improve current workflows. Then, we present and discuss results.
In clinical gait analysis, at each visit movement of a child is recorded using optical motion capture. Optical motion capture allows estimating 3D positions of body parts using a set of cameras tracking markers positions on the subject’s body. A set of at least three markers is placed at each analyzed body segment usually associated with bones (e.g., tibia, humerus) so that its 3D position and orientation can be identified uniquely. These data are then used to determine relative positions of body segments by computing the angle between the corresponding planes. Typically it is done using a biomechanical model for enforcing biomechanical constraints and improving accuracy.
In gait analysis practitioners are usually concerned about movement pattern of nine joints in lower limbs: ankle, knee, hip in each leg and pelvis (Figure 7). Each joint angle is measured in time. For making the curves comparable between the patients, usually, the time dimension is normalized to the percentage of the gait cycle, defined as the time between two foot strikes (Figure 7).
While trajectories of joint angles are a piece of data commonly used by practitioners for taking decisions regarding treatment, their high-dimensional nature hinders their use as a quantitative metric of gait pathology or treatment outcome. This motivates development of univariate summary metrics of gait impairment, such as questionnaire-based metrics Gillette Functional Assessment Walking Scale (FAQ) (Gorton III et al., 2011), Gross Motor Function Classification System (GMFCS) (Palisano et al., 2008) and Functional Mobility Scale (FMS) (Graham et al., 2004), or observational video analysis scores such as Edinburgh Gait Score (Read et al., 2003).
One of the most widely adopted quantitative measurements of gait impairments in pediatrics is Gait Deviation Index (GDI) (Schwartz and Rozumalski, 2008). GDI is derived from joint angle trajectories and measures deviation of the first ten singular values from the population average of the typically developing population. GDI is normalized in such a way that corresponds to the mean value of typically developing children, with the standard deviation equal . It is proven to be highly correlated with questionnaire-based methods. Thanks to its deterministic derivation from the motion capture measurements this method is considered more objective than questionnaires.
In medical practice, GDI has been adapted as a metric for diagnosing the severity of impairment, and it constitutes an integral part of the clinical decision making process and evaluation of treatment outcomes. However, in order to correctly identify the surgery outcome, it is crucial to understand the natural progression of GDI. In particular, a positive outcome of a surgery might be negligible when compared to natural improvement during puberty. Similarly, a surgery maintaining the same level of GDI might be incorrectly classified as a failure, if the decline in patient’s function over time is not accounted for.
Methods introduced in this article can be used to approximate individual progressions of GDI. First, we present how a prediction can be made solely based on the patient’s GDI history and histories of other patients. Next, using our regression procedure, we predict GDI trajectories using other sparsely observed covariates, namely O2 expenditure and walking speed.
6.1 Materials and methods
We analyze a dataset of Gillette Children’s Hospital patients visiting the clinic between 1994 and 2014, age ranging between 4 and 19 years, mostly diagnosed with Cerebral Palsy. The dataset contains visits of patients without gait disorders and visits of patients with gait pathologies.
Motion capture data was collected at 120Hz and joint angles in time were extracted. These joint angles were then normalized in time to the gait cycle, resulting in curves as in Figure 7. Points from these curves were then subsampled (51 equidistributed points). Given the data in this form, we computed GDI metric from each visit and each leg.
In the dataset which we received from the hospital, for each patient we observe their birthday and disease subtype. From each visit, we observe the following variables: patient ID, time of the visit, GDI of the left leg, GDI of the right leg, walking speed, and O2 expenditure. Other clinical variables that we received were not included in this study. Note that walking speed is related to information we lose during normalization of the gait cycle in time. O2 expenditure is a measure of a subject’s energy expenditure during walking. Pathological gait is often energy inefficient and reduction of O2 expenditure is one of the objectives of treatments. Finally, GDI is computed for two legs while in many cases the neurological disorder affects only one limb. To simplify the analysis, we focus on the more impaired limb by analyzing the minimum of the left and the right GDI.
Our objective is to model individual progression curves. We test three methods: functional principal components (fPCA), Soft-Longitudinal-Impute (SLI) and Sparse-Longitudinal-Regression (SLR). We compare the results to the null model – the population mean across all visits (
mean). In SLR, we approximate GDI using latent variables of sparsely observed covariates O2 expenditure and walking speed, following the methodology from Section 4.3.
In our evaluation procedure, for the test set, we randomly select of observations of patients who visited the clinic at least times. Then, we split the remaining of observations into a training and validation sets in proportion. We train the algorithms with the following combinations of parameters: the regularization parameter for SLI and SLR procedures and the rank for fPCA procedure . We define the grid of points. We repeat the entire evaluation procedure times.
Let us denote the test set as . We validate each model on held-out indices by computing the mean squared error as defined in (5.1). We select the parameters of each of the three methods using cross-validation, using the same validation set.
For reproducibility and to simplify adoption we created an
fcomplete available at
https://github.com/kidzik/fcomplete. The package contains implementations of algorithms 1, 2, 3, and 4, and helper functions for transforming the data, sampling training and test datasets, and plotting functions. For convenience, we also provided an interface for using the
fpca package implementing Sparse Functional Principal Components algorithms (James et al., 2000; Peng and Paul, 2009). The analysis was perform on a desktop PC with 32 GB RAM memmory and an Intel® Core™ i7-6700K CPU @ 4.00GHz, operating on a Ubuntu 18.04 system with
R version 3.4.4.
Compared to the null model, all three methods explain around of the variance. We present detailed results in Table 2. Sparse-Longitudinal-Regression had smaller mean and variance than two other methods indicating that two other variables. We conclude that O2 expenditure and walking speed provide additional information for prediction of GDI progression.
Both fPCA and Sparse-Longitudinal-Impute provide latent representations of patients’ progression curves which can potentially be interpreted. To this end we first analyze the singular value vectors from our SVD solution which we refer to as principal components.
In the left plot in Figure 9 we show the first two estimated principal components. We found that the first component estimates the change between GDI before and after age of 20. The second component models changes around age of 10 and around age of 18. In the right plof in Figure 9, by adding a principle component to the population mean curve, we illustrate how differences in the first component are reflected in the patients trajectory. By visual investigation of curves returned by our Sparse-Longitudinal-Impute and by fPCA we found similar trends in the first two components.
Since our SVD decomposition defines a low-rank representation of the progression trends, we can also use it to gain insights on progression in different groups of patients. In Cerebral Palsy we divide paralysis into subtypes depending on which limbs are affected: monolegia (one leg), diplegia (two legs), hemiplegia (one side of the body), triplegia (three limbs), quadriplegia (four limbs). Hemiplegia is the most prevalent in our population and it might be divided depending on severity, from type I (weak muscles, drop foot) to type IV (severe spasticity). We find differences between trends of progression for different subtypes of paralysis of patients (). We illustrate these distributions in Figure 9.
Results presented in Section 6.2 imply that our Sparse-Longitudinal-Impute and Sparse-Longitudinal-Regression methods can be successfully applied to understands trends of variability of disease progression. We show how to incorporate progressions of O2 expenditure and walking speed in the prediction of the progression of GDI. We present how low-rank representation can be leveraged to gain insights about subtypes of impairment.
While a large portion of variance remains unexplained, it is important to note that in practice the individual progression is not accounted for explicitly in the current decision-making process. Instead, practitioners only use the population-level characteristics of the dependence between age and impairment severity. Our model can greatly improve this practice.
Despite successful application, we identify limitations that could be potentially addressed in the extensions of our model. First, the method is meant to capture natural continuous progression of GDI, while in practice there are many discrete events, such as surgeries that break continuity assumption and render the mean trajectories less interpretable. Second, our methodology does not address the “cold start problem”, i.e. we do not provide tools for predictions with only one or zero observations. Third, we do not provide explicit equations for confidence bounds of predicted parameters.
While these and other limitations can constrain applicability of the method in the current form, they can be addressed using existing techniques of matrix completion. The focus of this paper is to introduce a computational framework rather than build a full solution for all cases. Elementary formulation of the optimization problem as well as the fully-functional
R implementation can foster development of new tools using matrix completion for longitudinal analysis and for mixed-effect models.
- Adcock (1878) Robert James Adcock. A problem in least squares. The Analyst, 5(2):53–54, 1878.
- Bennett and Lanning (2007) James Bennett and Stan Lanning. Netflix: the netflix prize. In KDD Cup and Workshop in conjunction with KDD, 2007.
- Berkey and Kent (1983) CS Berkey and RL Kent. Longitudinal principal components and non-linear regression models of early childhood growth. Annals of human biology, 10(6):523–536, 1983.
- Besse and Ramsay (1986) Philippe Besse and James O Ramsay. Principal components analysis of sampled functions. Psychometrika, 51(2):285–311, 1986.
- Besse et al. (1997) Philippe C Besse, Hervé Cardot, and Frédéric Ferraty. Simultaneous non-parametric regressions of unbalanced longitudinal data. Computational Statistics & Data Analysis, 24(3):255–270, 1997.
- Bigelow and Dunson (2009) Jamie L Bigelow and David B Dunson. Bayesian semiparametric joint models for functional predictors. Journal of the American Statistical Association, 104(485):26–36, 2009.
- Borga et al. (1997) Magnus Borga, Tomas Landelius, and Hans Knutsson. A unified approach to pca, pls, mlr and cca. Linköping University, Department of Electrical Engineering, 1997.
- Cai et al. (2010) Jian-Feng Cai, Emmanuel J Candès, and Zuowei Shen. A singular value thresholding algorithm for matrix completion. SIAM Journal on Optimization, 20(4):1956–1982, 2010.
- Candès and Recht (2009) Emmanuel J Candès and Benjamin Recht. Exact matrix completion via convex optimization. Foundations of Computational mathematics, 9(6):717, 2009.
- Chen and Wainwright (2015) Yudong Chen and Martin J Wainwright. Fast low-rank estimation by projected gradient descent: General statistical and algorithmic guarantees. arXiv preprint arXiv:1509.03025, 2015.
- Cnaan et al. (1997) Avital Cnaan, NM Laird, and Peter Slasor. Tutorial in biostatistics: using the general linear mixed model to analyse unbalanced repeated measures and longitudinal data. Stat Med, 16(2349):80, 1997.
- Condli et al. (1999) Michelle Keim Condli, David D Lewis, David Madigan, and Christian Posse. Bayesian mixed-effects models for recommender systems. In ACM SIGIR, volume 99, 1999.
- Descary and Panaretos (2016) Marie-Hélène Descary and Victor M Panaretos. Functional data analysis by matrix completion. arXiv preprint arXiv:1609.00834, 2016.
- Diggle et al. (2002) Peter Diggle, Patrick Heagerty, Kung-Yee Liang, and Scott Zeger. Analysis of longitudinal data. Oxford University Press, 2002.
- Friedman et al. (2001) Jerome Friedman, Trevor Hastie, and Robert Tibshirani. The elements of statistical learning, volume 1. Springer series in statistics New York, 2001.
- Ge et al. (2016) Rong Ge, Jason D Lee, and Tengyu Ma. Matrix completion has no spurious local minimum. In Advances in Neural Information Processing Systems, pages 2973–2981, 2016.
- Gleser (1981) Leon Jay Gleser. Estimation in a multivariate" errors in variables" regression model: large sample results. The Annals of Statistics, pages 24–44, 1981.
- Gleser and Watson (1973) Leon Jay Gleser and Geoffrey S Watson. Estimation of a linear transformation. Biometrika, 60(3):525–534, 1973.
- Gorton III et al. (2011) George E Gorton III, Jean L Stout, Anita M Bagley, Katherine Bevans, Tom F Novacheck, and Carole A Tucker. Gillette functional assessment questionnaire 22-item skill set: factor and rasch analyses. Developmental Medicine & Child Neurology, 53(3):250–255, 2011.
- Graham et al. (2004) H Kerr Graham, Adrienne Harvey, Jillian Rodda, Gary R Nattrass, and Marinis Pirpiris. The functional mobility scale (fms). Journal of Pediatric Orthopaedics, 24(5):514–520, 2004.
- Gray and Brookmeyer (1998) Sarah M Gray and Ron Brookmeyer. Estimating a treatment effect from multidimensional longitudinal data. Biometrics, pages 976–988, 1998.
- Gray and Brookmeyer (2000) Sarah M Gray and Ron Brookmeyer. Multidimensional longitudinal data: estimating a treatment effect from continuous, discrete, or time-to-event response variables. Journal of the American Statistical Association, 95(450):396–406, 2000.
- Greven et al. (2011) Sonja Greven, Ciprian Crainiceanu, Brian Caffo, and Daniel Reich. Longitudinal functional principal component analysis. Recent Advances in Functional Data Analysis and Related Topics, pages 149–154, 2011.
- Hall et al. (2006) Peter Hall, Hans-Georg Müller, and Jane-Ling Wang. Properties of principal component methods for functional and longitudinal data analysis. The annals of statistics, pages 1493–1517, 2006.
- Hardt and Wootters (2014) Moritz Hardt and Mary Wootters. Fast matrix completion without the condition number. In Conference on Learning Theory, pages 638–678, 2014.
- Hastie et al. (2015) Trevor Hastie, Rahul Mazumder, Jason D Lee, and Reza Zadeh. Matrix completion and low-rank svd via fast alternating least squares. Journal of Machine Learning Research, 16:3367–3402, 2015.
- Henderson (1950) Charles R Henderson. Estimation of genetic parameters. In Biometrics, volume 6, pages 186–187. International biometric soc, 1950.
- Horváth and Kokoszka (2012) Lajos Horváth and Piotr Kokoszka. Inference for functional data with applications, volume 200. Springer Science & Business Media, 2012.
- James et al. (2000) Gareth M James, Trevor J Hastie, and Catherine A Sugar. Principal component models for sparse functional data. Biometrika, pages 587–602, 2000.
- Jöreskog (1967) Karl G Jöreskog. Some contributions to maximum likelihood factor analysis. Psychometrika, 32(4):443–482, 1967.
- Kim and Mueller (1978) Jae-On Kim and Charles W Mueller. Factor analysis: Statistical methods and practical issues, volume 14. Sage, 1978.
- Kosambi (2016) DD Kosambi. Statistics in function space. In DD Kosambi, pages 115–123. Springer, 2016.
- Laird (1988) Nan M Laird. Missing data in longitudinal studies. Statistics in medicine, 7(1-2):305–315, 1988.
- Laird and Ware (1982) Nan M Laird and James H Ware. Random-effects models for longitudinal data. Biometrics, pages 963–974, 1982.
- Larsen (2004) Rasmus Munk Larsen. Propack-software for large and sparse svd calculations. Available online. URL http://sun. stanford. edu/rmunk/PROPACK, pages 2008–2009, 2004.
- Lawrence (2004) Neil D Lawrence. Gaussian process latent variable models for visualisation of high dimensional data. In Advances in neural information processing systems, pages 329–336, 2004.
- Liu and Huang (2009) Lei Liu and Xuelin Huang. Joint analysis of correlated repeated measures and recurrent events processes in the presence of death, with application to a study on acquired immune deficiency syndrome. Journal of the Royal Statistical Society: Series C (Applied Statistics), 58(1):65–81, 2009.
- Ma et al. (2011) Shiqian Ma, Donald Goldfarb, and Lifeng Chen. Fixed point and bregman iterative methods for matrix rank minimization. Mathematical Programming, 128(1):321–353, 2011.
- MacLehose and Dunson (2009) Richard F MacLehose and David B Dunson. Nonparametric bayes kernel-based priors for functional data analysis. Statistica Sinica, pages 611–629, 2009.
- Mardia et al. (1980) Kantilal V Mardia, John T Kent, and John M Bibby. Multivariate analysis (probability and mathematical statistics), 1980.
- Mazumder et al. (2010) Rahul Mazumder, Trevor Hastie, and Robert Tibshirani. Spectral regularization algorithms for learning large incomplete matrices. Journal of machine learning research, 11(Aug):2287–2322, 2010.
- McCulloch and Neuhaus (2001) Charles E McCulloch and John M Neuhaus. Generalized linear mixed models. Wiley Online Library, 2001.
- Palisano et al. (2008) Robert J Palisano, Peter Rosenbaum, Doreen Bartlett, and Michael H Livingston. Content validity of the expanded and revised gross motor function classification system. Developmental Medicine & Child Neurology, 50(10):744–750, 2008.
- Peng and Paul (2009) Jie Peng and Debashis Paul. A geometric approach to maximum likelihood estimation of the functional principal components from sparse longitudinal data. Journal of Computational and Graphical Statistics, 18(4):995–1015, 2009.
- Ramsay and Dalzell (1991) James O Ramsay and CJ Dalzell. Some tools for functional data analysis. Journal of the Royal Statistical Society. Series B (Methodological), pages 539–572, 1991.
- Read et al. (2003) Heather S Read, M Elizabeth Hazlewood, Susan J Hillman, Robin J Prescott, and James E Robb. Edinburgh visual gait score for use in cerebral palsy. Journal of pediatric orthopaedics, 23(3):296–301, 2003.
- Rennie and Srebro (2005) Jasson DM Rennie and Nathan Srebro. Fast maximum margin matrix factorization for collaborative prediction. In Proceedings of the 22nd international conference on Machine learning, pages 713–719. ACM, 2005.
- Rice and Wu (2001) John A Rice and Colin O Wu. Nonparametric mixed effects models for unequally sampled noisy curves. Biometrics, 57(1):253–259, 2001.
- Rizopoulos et al. (2014) Dimitris Rizopoulos, Laura A Hatfield, Bradley P Carlin, and Johanna JM Takkenberg. Combining dynamic predictions from joint models for longitudinal and time-to-event data using bayesian model averaging. Journal of the American Statistical Association, 109(508):1385–1397, 2014.
- Robinson (1991) George K Robinson. That blup is a good thing: the estimation of random effects. Statistical science, pages 15–32, 1991.
- Sammel and Ryan (1996) Mary Dupuis Sammel and Louise M Ryan. Latent variable models with fixed effects. Biometrics, pages 650–663, 1996.
- Schulam and Arora (2016) Peter Schulam and Raman Arora. Disease trajectory maps. In Advances in Neural Information Processing Systems, pages 4709–4717, 2016.
- Schwartz and Rozumalski (2008) Michael H Schwartz and Adam Rozumalski. The gait deviation index: a new comprehensive index of gait pathology. Gait & posture, 28(3):351–357, 2008.
- Song et al. (2002) Xiao Song, Marie Davidian, and Anastasios A Tsiatis. A semiparametric likelihood approach to joint modeling of longitudinal and time-to-event data. Biometrics, 58(4):742–753, 2002.
- Srebro et al. (2005) Nathan Srebro, Noga Alon, and Tommi S Jaakkola. Generalization error bounds for collaborative prediction with low-rank matrices. In Advances In Neural Information Processing Systems, pages 1321–1328, 2005.
- Tibshirani (1996) Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological), pages 267–288, 1996.
- Tipping and Bishop (1999) Michael E Tipping and Christopher M Bishop. Probabilistic principal component analysis. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 61(3):611–622, 1999.
- Verbeke (1997) Geert Verbeke. Linear mixed models for longitudinal data. In Linear mixed models in practice, pages 63–153. Springer, 1997.
- Watanabe (1965) Satosi Watanabe. Karhunen-loeve expansion and factor analysis. In Transactions of the 4th Prague Conference on Information Theory, Statistical Decision Functions, and Random Processes, Prague, 1965, 1965.
- Yan et al. (2017) Fangrong Yan, Xiao Lin, Xuelin Huang, et al. Dynamic prediction of disease progression for leukemia patients by functional principal component analysis of longitudinal expression levels of an oncogene. The Annals of Applied Statistics, 11(3):1649–1670, 2017.
- Yao and Lee (2006) Fang Yao and Thomas Lee. Penalized spline models for functional principal component analysis. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 68(1):3–25, 2006.
- Yao et al. (2005) Fang Yao, Hans-Georg Müller, Jane-Ling Wang, et al. Functional linear regression analysis for longitudinal data. The Annals of Statistics, 33(6):2873–2903, 2005.
- Zeger et al. (1988) Scott L Zeger, Kung-Yee Liang, and Paul S Albert. Models for longitudinal data: a generalized estimating equation approach. Biometrics, pages 1049–1060, 1988.
Appendix A Proofs
We prove convergence by mapping our problem into the framework of Mazumder et al. (2010). Following their notation we define
Our objective is to find . We define
Let be an matrix of rank and is an orthogonal matrix. The solution to the optimization problem
is given by where
is the SVD of , , and .
By Lemma 1 from Mazumder et al. (2010) we know that