Longitudinal LASSO: Jointly Learning Features and Temporal Contingency for Outcome Prediction

Longitudinal LASSO: Jointly Learning Features and Temporal Contingency for Outcome Prediction

Tingyang Xu
Department of Computer
Science and Engineering
University of Connecticut
Storrs, CT, USA
Jiangwen Sun
Department of Computer
Science and Engineering
University of Connecticut
Storrs, CT, USA
Jinbo Bi \titlenoteCorrespondence should be adressed to Jinbo Bi.
Department of Computer
Science and Engineering
University of Connecticut
Storrs, CT, USA
tix11001@engr.uconn.edu javon@engr.uconn.edu jinbo@engr.uconn.edu
22 February 2015
Abstract

Longitudinal analysis is important in many disciplines, such as the study of behavioral transitions in social science. Only very recently, feature selection has drawn adequate attention in the context of longitudinal modeling. Standard techniques, such as generalized estimating equations, have been modified to select features by imposing sparsity-inducing regularizers. However, they do not explicitly model how a dependent variable relies on features measured at proximal time points. Recent graphical Granger modeling can select features in lagged time points but ignores the temporal correlations within an individual’s repeated measurements. We propose an approach to automatically and simultaneously determine both the relevant features and the relevant temporal points that impact the current outcome of the dependent variable. Meanwhile, the proposed model takes into account the non-i.i.d nature of the data by estimating the within-individual correlations. This approach decomposes model parameters into a summation of two components and imposes separate block-wise LASSO penalties to each component when building a linear model in terms of the past measurements of features. One component is used to select features whereas the other is used to select temporal contingent points. An accelerated gradient descent algorithm is developed to efficiently solve the related optimization problem with detailed convergence analysis and asymptotic analysis. Computational results on both synthetic and real world problems demonstrate the superior performance of the proposed approach over existing techniques.

Longitudinal modeling; regularization methods; sparse predictive modeling; regression
\permission

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from Permissions@acm.org. \conferenceinfoKDD ’15,August 11 - 14, 2015, Sydney, NSW, Australia \copyrightetc© 2015 ACM. \crdataISBN 978-1-4503-3664-2/15/08 …$15.00.
DOI: http://dx.doi.org/10.1145/2783258.2783403 \numberofauthors3

\category

G.1.6Numerical AnalysisOptimization[Gradient methods] \categoryH.2.8Database managementDatabase Application[Data mining]

\terms

Algorithms, Performance, Experimentation

1 Introduction

A longitudinal study collects and analyzes repeated measurements of a set of features for a group of subjects through time. Longitudinal analyses are important in many areas, such as in social and behavioral science [20, 7, 4], in economics [18, 2], in climate[13, 2], and in genetics [21]. For example, to predict binge drinking of college students, a longitudinal study may be designed to monitor them weekly or even daily in terms of multiple covariates, such as, the level of stress, status of negative affects and social behaviors [4, 1]. The fluctuation of these covariates is used to analyze and predict binge drinking (the dependent or outcome variable) of a student at the current observation time point. Changes of the covariates in the proximal time points are anticipated to alter the likelihood that a student binge drinks at the current observation point. To precisely understand how covariates affect the outcome, the analysis has to model not only the current values of the covariates but also their proximal values as well as take into account the correlation structure in the repeated measurements.

Typically, longitudinal data are analyzed by extending generalized linear models (GLM) with different assumptions, such as marginal models, random effects models, and transition models [6]. For example, a marginal model regresses the outcome on the current observation of features but factors in a within-subject correlation matrix that is estimated for a few proximal time points. In contrast, a random effects model reflects the variability among individuals rather than the population average comparing with marginal models. For marginal modeling, generalized estimating equations (GEE) are the most widely used methods which estimate a predictive model to predict the current outcome together with correlations among different outcomes observed temporally. The resultant predictive models are generally more accurate than those of classic regression analysis that assumes independently and identically distributed (i.i.d.) observations [12]. Research on feature selection in longitudinal data leads to a new family of methods based on the penalized GEE (PGEE)[8]. For random effects models, generalized linear mixture model(GLMM)[11, 15] is the major method. It explores natural heterogeneity across individuals in the regression coefficients and represents this heterogeneity by a probability distribution.

None of those extensions of GLM aim to detect causal relationships from temporal changes of covariates to the outcomes of the current effect. In many studies, it is however necessary and insightful to model simultaneously the correlation among outcome records and the lagged causal effects of covariates [1]. For example, psychologists have identified that there is lagged effect in the alcohol use behavior. An individual’s drinking today may be a response to an elevated level of stress two days back rather than the current day. It is actually an important question for psychologists to find out both which temporal points and which covariates influence the current outcome the most. This lagged effect is not used by temporal marginal modeling to make predictions.

On the other hand, researchers have developed machine learning approaches for longitudinal analysis that predict an outcome using feature values at multiple time points [2, 13]. For example, graphical Granger modeling [2], and grouped graphical Granger modeling[13] are insightful to explore the influences from past temporal information present in time series data in the modeling and understanding of the causal relationships. These methods assume that past values of certain time series features causally affect an outcome variable, and hence construct a model based on these values to predict future outcomes. Often, they estimate causality relationship (causal graph) among all features. However, these methods assume i.i.d. samples which are clearly violated in longitudinal data, and moreover they are incapable of selecting the most influential time points.

All existing methods either assume i.i.d. samples in Granger causality modeling or assume correlated samples but do not model temporal causal effects. Therefore, we propose a new learning formulation that constructs predictive models as functions of covariants not only from the current observation but also from multiple previous consecutive observations, and simultaneously determine the temporal contingency and the most influential features. The proposed method has the following advantages:

  1. The proposed method makes predictions based on lagged data from current and previous time points. It decomposes the model coefficients into a summation of two components and impose different block-wise least absolute shrinkage and selection operators (LASSO) to the two components. One regularizer is used to detect the contingency of specific time points whereas the other is used to select covariates.

  2. The proposed method also learns simultaneously a structured correlation matrix from the data. The correlations among the outcomes themselves imply the changing trend of the outcomes in the proximal time points within each subject.

  3. We develop a family of methods where the outcome variable is assumed to follow a distribution from the exponential family, including Bernoulli, Gaussian and Poisson distributions. The formulations for these distributions are discussed in Section 3.3.

  4. We provide the convergence analysis in Section 3.1 and asymptotic analysis in Section 3.2 to show that the proposed algorithm can find the optimal solution for the predictive models.

We have empirically compared the proposed method against the state of the art on both synthetic and real world datasets. The computational results demonstrate the effectiveness and the capability of our approach.

Figure 1: The outcome at time can be relevant to multiple covariates observed at current and several previous time points , which forms a data matrix (left). If we associate with each entry of this matrix a weight in our additive prediction model, then our model coefficients form a matrix (right). If the coefficient matrix is sparse, then the resultant model will be selective in terms of covariates and time points.

2 Method

In our approach, the predictive model takes the form of the trace of the product of the lagged data and the model coefficient matrix as shown in Figure 1. The model coefficients are organized into a matrix rather than a vector used in traditional analysis because this way reflects the structure in the lagged data. Note that the lagged observations of can also be included in the data matrix to be used in the predictive model. For notational convenience, we just use to represent the data that are used to form the model.

We first briefly review two most relevant sets of longitudinal analytics in Section 2.1 which will help elucidate the advantages of our proposed formulation.

2.1 Preliminaries

We introduce the notation that is used through out the paper. A bold lower case letter denotes a vector, such as . The refers to the norm of a vector , which is formed as , where is the -th component of and is the length of . A bold upper case letter denotes a matrix such as . Similarly, , and represent the -th row, -th column and -th component of , respectively. The Frobenius norm and norm of a matrix refer, respectively, to , which is equal to , and , defined by , where is the number of rows in , and indicates the trace of . We assume that is the column-major vectorization of , which is defined as assuming columns are in . Then, is the inner product of two matrices and that is computed as the inner product of and . The operator re-shapes into a matrix of a proper size determined by the specific context.

Assume that we are given data of number of individuals on number of features (independent variables) that are repeatedly measured at time points for each individual . The data of each individual is represented by a matrix of size , and refers to the -entry data vector of individual at time point . Without loss of generality, we assume that all individuals have data at the same consecutive time points () to simplify the notation and the subsequent analysis. Data on the dependent variable (outcome) is also given in of length that contains the observations at the time points for individual . Typically, a longitudinal study aims to estimate the effect of covariates on the dependent variable.

2.1.1 Granger Causality

The notion of Granger Causality was introduced by the Nobel prize winning economist, Clive Granger, and has proven useful in time series analysis [10]. It is based on the intuition that if a time series variable causally affects another, the past observations of the former should be useful in predicting the future outcome of the latter.

Specifically, a time series observation is said to Granger cause another time series outcome, , if the regressing for in terms of past and is significantly better than the regressing just with past values of . The so-called Granger test first performs two regressions:

(1)

and , where is the maximum “lag" in the past observations, and then uses a hypothesis test such as an F-test to determine if the outcome can be predicted significantly better from the past covariate . Recent graphical Granger models [2, 13] extend it from a single time series covariate to multiple covariates . They learn the coefficients and ’s with LASSO type of regularizers and evaluate if coefficients are non-zero for Granger causality.

2.1.2 Generalized Estimating Equations (GEE)

GEE estimates the parameters of a GLM while taking into account the correlations in the training examples. Similar to GLM, it assumes that the dependent variable comes from a class of distributions known as the exponential family. For each member in this family, there exists a link function that can be used to translate the nonlinear model into a linear model. The expectation of the outcome for subject at time is computed as:

(2)

where represents the mean model, is the inverse of a link function in a GLM [14], and . The variance of is computed as where is a scaling parameter that may be known or estimated.

GEE presumes a so-called working correlation structure, typically denoted by , where is a parameter to be determined from data. The common choices of include exchangeable, tri-diagonal and the first-order autoregressive (AR(1)) formula [12]. The exchangeable correlation structure, also called equi-correlation, assumes that for all . The tri-diagonal structure uses a tridiagonal matrix as where if or otherwise. The AR(1) formula assumes a correlation structure along continuous time, and uses .

To estimate the regression coefficients , GEE uses the the estimating equations that are formulated, in general, by setting the derivative of an appropriate loss function to 0. Although a loss function may not be explicitly written out, the estimating equations always can be computed by

(3)

where the matrix where combines all into a vector, . The matrix is the estimated covariance structure as:

(4)

where is an diagonal matrix with as the -th diagonal element. Algorithms are given in [12] to compute and for the different choices of .

2.2 The Proposed Formulation

In our approach, each training example consists of the current and previous records of the repeated measurements. Let

be a data matrix for subject . Given total measurements for each subject, the index of starts from in order to have enough previous observations in the first training example. Hence, there are totally training examples for each subject. If includes previous values of as a feature, then the model where essentially gives the same model like Eq.(1) in the graphical Granger models.

The Granger models would assume that the training examples are i.i.d.. However, the consecutive examples are not mutually independent because they contain overlapping records (e.g., and share records , , ). GEE provides a mechanism to estimate the sample correlation simultaneously while constructing predictive models, and to extend the linear models to generalized linear models. To apply GEE to our model, we replace used in GEE by the following formula

(5)

Substituting Eq.(5) for in Eq.(2) yields a formulation similar to GEE. The regression coefficients can be estimated through the well-developed GEE estimators. In particular, the quasi-likelihood methods of GEE estimate by minimizing a loss function that is defined via the model deviance. The model deviance measures the difference between the log-likelihood of the estimated mean model and that of the observed values . For instance, the model deviance for a linearly regressive response is written by where contains the observed responses for subject , and is the estimated expectations of for subject . If the response follows an arbitrary distribution, the model deviance may not correspond to an explicit function. For the exponential family, it takes a special form as discussed in Theorem 1 below, which is still complicated. We denote by the deviance occurred on subject . GEE minimizes a loss function of for the optimal by solving the estimating equations, i.e., taking the derivatives of the loss function and setting them to .

Now, to select among features and discover the most influential time points in predicting over time, (and also to control the model capacity,) we apply regularizers to the model parameters. We first decompose into a summation of two components as and apply different regularizers to and . The block-wise LASSO, such as the matrix norm, is widely-used in multi-task learning or feature selection with group structures, but has not been explored within the GEE setting. To the best of our knowledge, it has not been studied in longitudinal analytics how to produce shrinkage effects simultaneously on both features and contingent temporal records through proper regularization. The general matrix norm [23] calculates the sum of the norms of the rows in a matrix. Regularizers based on the norms encourage row sparsity by shrinking the entire rows to have zero entries.

In our parameter matrix , rows correspond to features and columns correspond to the observation time points. If we apply the norm to (row-wisely), the optimal solution of will contain rows with all zero entries. Thus, a selected subset of features in the observations will be used in the predictive model to predict the current outcome. The norm of (column-wisely) encourages to select among columns of . If the -th column of contains the largest values in the selected columns, the current outcome is most contingent on the previous -th record, thus having the “lagged" effect. Overall, we solve the following optimization problem for the best model parameters which is computed as :

(6)

where in the deviance is simply replaced by .

The optimization of Eq.(6) is challenging. In general, even solving the GEE formulation is not easy as it estimates not only the model expectation but also the variance term . The algorithm that solves the GEE (i.e., the estimating equations) applies the Newton-Raphson method in the iterative reweighted least squares (IRLS) procedure [8] to estimate and . However, this method does not solve any formula that uses regularizers. By modifying the Newton-Raphson method or shooting algorithm [8], it can be extended only to the regularizers that are decomposable into individual parameters . For instance, the vector norm of can be decomposed into the summation of individual , . The matrix norm, unfortunately, can not be decomposed in such a way. Therefore, we have developed an accelerated gradient descent method based on the fast iterative shrinkage-thresholding algorithm (FISTA) [3]. Further, the following theorem shows that Eq.(6) is a convex optimization problem in terms of . Our algorithm can be proved to find the global optimal solution of Eq.(6) when is fixed (to a consistent estimate given by GEE).

Theorem 1

The first term of Eq.(6) is convex and continuously differentiable with respect to and if the distribution of is in a natural exponential family and the link function is continuous.

{proof}

First, let us recall that the probability density function of a distribution in the exponential family takes the following form:

where , , and are known functions and specified for each member of the exponential family, and is a parameter in the mean as defined in Eq.(2). Typically, . Then, the deviance of the exponential family can be computed as

where denotes the true value under a saturated model, denotes the fitted values of the model. Thus, and are constant in model fitting. The derivative of always satisfies . Moreover, it has been proved that is a convex function on the natural parameter space [19]. Thus, the deviance contains either linear terms or a convex term with respect to . In our model (5), is linear with respect to . Hence, the deviance term in Eq.(6) is convex with respect to and .

Moreover, it is true that which is the inverse of a continuous link function [19]. The first term of Eq.(6) is continuously differentiable with respect to and . Thus, theorem 1 holds.

2.3 Optimization Algorithm

To solve Eq.(6), we design an alternating optimization algorithm that alternates between optimizing two working sets of variables: one set consisting of and and the other consisting of .

(a) Find and when is fixed

When is fixed, the objective function of Eq.(6), denoted by , is convex with a continuously differentiable part that is the deviance and a nonsmooth part that constitutes the two regularizers. We hence have

We develop a FISTA algorithm in the following iterative procedure to find optimal and .

Denote the iterates at the -th iteration by and . Let , be the partial derivative of with respect to and , respectively, For any given point , the following is a well-defined proximal map for the non-smooth

If has Lipschitz continuous gradient with Lipschitz modulus . Then, according to the Lemma 2.1 in [3], the inequality

holds indicating that is the upper bound of .

Starting from an initial point , we iteratively search for the optimal solution. At each iteration , we first use the iterates and to compute (at the first iteration, )

(7)

where is a scalar and updated at each iteration as:

(8)

Then, we solve the following problem

(9)

for a solution , where and are respectively the partial derivatives of computed at , and acts as a learning step size.

Since there is no interacting term between and in Eq.(9), the problem can be decomposed into two separate subproblems as follows:

(10)
(11)

The two subproblems share the same structure and thus can be solved following the same procedure. Hence, we only show how to solve (10) for the best .

Eq.(10) is equivalent to the following problem

after omitting constants, and this problem has a closed-form solution where each row of , is:

and . The gradient vector (i.e., the gradient of the deviance) can be computed by Eq.(3) with the fixed , i.e.

(12)

where , and .

In the above discussion, the Lipschitz modulus is computed and given. However, the calculation of can be computational expensive. We therefore follow the similar argument in [9] to find a proper approximation at each iteration starting from . Recall that the Lipschitz constant is defined:

where indicates the maximum singular value of the Hessian of . Decompose the Hessian matrix into where and is the rank of the Hessian matrix. We have an upper bound of as follows:

(13)

We use the upper bound in Eq.(13) as in our iterations. Using this upper bound may increase the number of iterative steps for convergence. Algorithm 1 summarizes the steps for finding optimal and with fixed .

Input: , , , ,
Output: ,
1. = 1, compute and initialize , and ;
2. Solve Eq.(9) to obtain and .
3. Compute by Eq.(8).
4. Compute and by Eq.(7).
5. .
Repeat until convergence.
Algorithm 1 Search for optimal and with fixed

(b) Find when and are fixed

When and are fixed, the regularizers no longer appear in the objective of Eq.(6). Eq.(6) is degenerated into just the GEE formula with as the variables. Hence, can be estimated via the standard GEE procedure, i.e., from the current Pearson residuals defined by:

where is the -th diagonal entry in the matrix [12]. The specific estimator of depends on the choices of . This GEE-based procedure has been shown to find a consistent estimate of [12].

Let be the total number of training examples, and be the practical number of parameters in . A general approach to estimating is given by:

(14)

for , and . In addition, the scaler parameter in Eq.(4) can be estimated as follows:

(15)

Algorithm 2 depicts the overall procedure for solving Eq.(6).

Input: , , ,
Output: ,
1. Set = ;
2. Solve for and using Algorithm 1.
3. Estimate using a proper estimator in [12] and compute by Eq.(14) and by Eq.(15).
Repeat until convergence.
Algorithm 2 Main algorithm - Jointly select features and temporal points

3 Theoretical Analysis

We provide a convergence analysis for Algorithm 1 and an asymptotic analysis for the proposed formulation.

3.1 Convergence Analysis

We show that Algorithm 1 converges to the optimal solution with a convergence rate of . The proof follows largely the arguments in [3]. We only provide a sketch here.

Theorem 2

Let and be the pair of the matrix generated by Algorithm 1. Then for any

where is a globally optimal solution of Eq.(6).

{proof}

We start with defining the following quantities

where , , and subsequent and are defined by Eq.(7). Following the proof of Theorem 4.4 in [3], in the first iteration, given , we have , and . We show that by applying Lemma 2.3 in [3], which yields

Reorganizing the above inequality yields

Thus, holds.

Then, according to Lemma 4.1 in [3], we have for every , , together with , which derives into the following inequality,

Therefore, we obtain that

(16)

Given is updated according to Eq.(8), it is easy to show that . Substituting this inequality into Eq.(16) yields

By the Remark 3.2 in [3] and the inequality (13), we also know that an upper bound of is . Hence,

In our algorithm, we set .

Remark 1

The loss function, , of an exponential distribution has Lipschitz continuous gradient within the range where are constant values in terms of , respectively to guarantee the non-trivial step size . Otherwise, it may lead to a sub-optimal solution.

3.2 Asymptotic Analysis

To facilitate the asymptotic analysis, we re-write the notation as follows: let

and

where one block corresponds to and the other to . Then, correspondingly, we have , and can be re-written as .

Solve Eq.(6) yields a solution to the penalized estimating equations:

(17)

assuming for notational convenience which will not change the property. Given our model definition (5), . The first term in (17) is the estimating functions in GEE [12] whereas the second term corresponds to the regularizers. The asymptotic property of Eq.(6) can be naturally derived from the results in [12] which have proved that the estimating equations of GEE gives a consistent estimator of . We extend the same argument to our formulation Eq.(6) in Theorem 3 under the following regularity conditions: is bounded, and , and are not singular, and the following limit is also not singular

Moreover, is twice continuously differentiable with respect to , and is positive definite.

Theorem 3

Assume that: (1) is a consistent estimator given ; (2) is a consistent estimator given ; and (3) the tuning parameter . Under the regularity conditions listed above, optimizing Eq.(6) yields an asymptotically consistent and normally distributed estimator , that is:

where is the true model coefficients in a model of and is a positive definite variance-covariance matrix (see [12] for details of ).

{proof}

Multiplying to both sides of Eq.(17) yields

(18)

It is known that solving yields an estimate of that is asymptotically consistent with :

Since our regularizer (based on the matrix norm) is Lipschitz continuous, its partial derivative is bounded. The second term of Eq.(18) vanishes when , and thus the conclusion holds.

Recall how and are estimated in the proposed method. Those estimates from the Pearson residuals are consistent. Thus, the estimate in the proposed method is asymptotically consistent and normally distributed according to Theorem 3.

3.3 Exemplar Exponential Families with Lipschitz Condition

The purposed algorithm is suitable to optimize any loss function that has Lipschitz continuous gradient. In this section, we discuss that three exemplar exponential families: Gaussian, Bernoulli, and Poisson, satisfy the Lipschitz condition. We specify how to compute the gradient of the loss function for these distributions. The gradients will instantiate (and replace) Eq.(12) used in our algorithm.

3.3.1 Gaussian Distribution

If the outcome follows a Gaussian distribution, then the outcome is linearly regressive in terms of the covariates in the observations. The mean and the conditional covariance of with a working correlation structure are calculated as:

so the gradient in Eq.(12) at the -th iteration can be computed as

where , and . The gradient can be similarly computed. Hence, the gradient is linear in terms of , and thus Lipschitz continuous.

3.3.2 Bernoulli Distribution

If the generalized variables follow a Bernoulli distribution and the outcomes are binary variables. The relationship between the outcome and covariates can be learned by a logistic regression which is a special case of the GLM with the Bernoulli assumption. Hence, the mean and the conditional covariance of with the working correlation structure are formulated as

(19)

where
and .

The gradient in Eq.(12) can be written as:

where
, and . The gradient can be similarly computed.

3.3.3 Poisson Distribution

If the generalized variables follow a Poisson distribution and the outcomes contain count values. The relationship of the outcome and covariates is learned by a Poisson regression. The mean and the conditional covariance of with the working correlation structure are formulated as

where