Least Angle Regression in Tangent Space and
LASSO for Generalized Linear Model
We propose sparse estimation methods for the generalized linear models,
which run one of Least Angle Regression (LARS) and
Least Absolute Shrinkage and Selection Operator (LASSO)
in the tangent space of the manifold of the statistical model.
Our approach is to roughly approximate the statistical model
and to subsequently use exact calculations.
LARS was proposed as an efficient algorithm for parameter estimation and variable selection for the normal linear model.
The LARS algorithm is described in terms of Euclidean geometry with regarding the correlation as the metric of the parameter space.
Since the LARS algorithm only works in Euclidean space,
we transform a manifold of the statistical model into the tangent space at the origin.
In the generalized linear regression,
this transformation allows us to run the original LARS algorithm
for the generalized linear models.
The proposed methods are efficient and perform well.
Real-data analysis shows that the proposed methods output similar results
to that of the -regularized maximum likelihood estimation for the generalized linear models.
Numerical experiments show that
our methods work well
and they can be better than the -regularization
in generalization, parameter estimation, and model selection.
Keywords: Exponential family, Generalized linear regression, Information geometry, Sparse modelling
We propose sparse estimation methods for the generalized linear models (GLM). One of the proposed methods is based on Least Angle Regression (LARS)  and is described in terms of information geometry. The main features of our approach are i) we use an approximation of a statistical model and do not use the statistical model itself, and ii) the proposed methods are calculated exactly, which allows us to compute the estimators efficiently. In the literature, a few extensions of LARS have been proposed which are based on information geometry/Riemannian geometry/differential geometry: for example,  and . The existing methods take advantage of a dual structure of a model manifold, which requires computational costs. Our method utilizes a part of the dual structure and uses the original LARS algorithm in the tangent space. The proposed method enables us to compute the estimator easily. Furthermore, we show Least Absolute Shrinkage and Selection Operator (LASSO)  for the normal linear model is also available in the tangent space.
In this two decades, sparse modeling is extensively investigated. LASSO is a representative method and motivated many researchers in statistics, machine learning, and other fields. LASSO was proposed as an estimation and variable-selection method for the normal linear model. LASSO minimizes the regularized least square with a tuning parameter. Various generalizations have been proposed for other problems. For example,  and  treat the generalized linear regression and Gaussian graphical models, respectively. See also .
LARS was proposed for the same problem as LASSO. The LARS algorithm is very efficient, and it can also compute the LASSO estimator if a minor change is added. The LARS algorithm uses only correlation coefficients between the response and explanatory variables. Therefore, the algorithm is described in terms of Euclidean geometry.
Information geometry is a Riemannian-geometrical framework for statistics and other fields [2, 3, 1, 5]. In this framework, we treat a statistical model as a Riemannian manifold and take advantage of its geometrical properties for estimation, test, and other tasks. Each probability distribution is treated as a point in the manifold. For example, estimation problem for the generalized linear regression can be described in terms of the geometry. The GLM is treated as a manifold and an estimator assigns a point in the manifold to an observed data. The maximum likelihood estimator (MLE) uses a kind of projection.
Some extensions of LARS have been proposed based on the information geometry of the exponential family of distributions.  and  proposed different extensions of LARS, which take advantage of the dual structure of the model manifold. Their works are theoretically natural and can be extended to other models than the GLM [10, 11]. However, the existing methods need many iterations of approximation computation, which is inevitable for treating more complicated objects than Euclidean space. For example,  treated many tangent spaces each of which corresponds to an estimate while our methods use only one tangent space.  wrote that their “DGLARS method may be computationally more expensive than other customized techniques” for the L1-regularization method. One of our aims is to provide as an efficient method as the L1-regularization for the GLM. Note again that our approach is different from that of the existing methods. We roughly approximate the model manifold by the tangent space and use the exact computation of LARS in the tangent space. This approximation is natural from the viewpoint of information geometry. The usefulness of our idea is validated by numerical experiments. One advantage of our methods is that the methods do not require additional implementation because we can use existing packages.
In Section 2, we introduce our problem and the related works. In Section 3, we propose a sparse estimation method based on LARS. Furthermore, LASSO-type estimators are also proposed. In Section 4, we compare our methods with the -regularization for the GLM by performing numerical experiments. Section 5 is our conclusion. Lemmas are given and proved in Appendix A.
2 Problem and related method
2.1 Problem and notation
In this paper, we consider the generalized linear regression, which is an estimation problem of the GLM . In the generalized linear regression, the expectation of a response is represented by a linear combination of explanatory variables as
where is called a link function, is the sample size, is the number of the explanatory variables, and is the parameter to be estimated. Let be the design matrix, which is an -matrix. Let and be the response vector and its expectation, respectively, which are column vectors of length . In general, the link function is a function of and is not determined uniquely. However, in the paper, we only use the canonical link function, which results in useful properties of the GLM.
In terms of probability distributions, the problem above corresponds to estimation for an exponential family of distributions, that is, the GLM,
where is called a potential function.
As a special case, the normal linear regression uses the link function and a quadratic function as the potential function. Another example is the logistic regression, where the link function is and the potential function is .
Through the paper, we assume that the design matrix is normalized, that is, each column vector has the mean zero and the -norm one: and for . Furthermore, we assume that column vectors of are linearly independent.
We briefly describe the LARS algorithm. In subsection 3.2, we use the LARS algorithm for proposing an estimation method. The detail and more discussions on LARS can be found in, for example,  and .
LARS was proposed as an algorithm for parameter estimation and variable selection in the normal linear regression. In the LARS algorithm, the estimator moves from the origin to the maximum likelihood estimate (MLE) of the full model. The full model means the linear model including all the explanatory variables. The MLE is determined by the design matrix and the response . The detailed algorithm of LARS is showed in Algorithm 1, where is -th estimate the algorithm outputs. After iterations, LARS outputs a sequence of the estimates .
The idea of the LARS algorithm is showed by Figure 1. Figures 1 and 1 indicate the estimator’s move and the residual’s move, respectively, in the parameter space when . The estimator i) selects an element of the parameter which makes a least angle between and -axis, and ii) uses it as a trajectory in the form of the bisector of an angle. The LARS algorithm is described in terms of Euclidean geometry and can be computed efficiently. Furthermore, plays an important role in the LARS algorithm, which is one of our motivations for considering the tangent space of a statistical model.
LASSO is an optimization problem for parameter estimation and variable selection in the normal linear regression. LASSO solves the minimization problem
where is a tuning parameter. The path of the LASSO estimator when varies can be made by the LARS algorithm with a minor modification.
LASSO can be applied to the GLM as the -regularized MLE, which is the minimization problem
For example, see .
3 The proposed methods
In subsection 3.1, we introduce information geometry we use in this paper. In subsection 3.2, we propose LARS in tangent space, which is an extension of the original LARS to the GLM. The proposed method is identical to the original LARS when applied to the normal linear model. Subsection 3.3 is a remark on the matrix . In subsection 3.4, we propose other methods which are related with LASSO. Subsection 3.5 explains the difference between the proposed methods and the existing methods.
3.1 Information geometry
In the generalized linear regression, we need to select one distribution from the GLM. A model manifold is a manifold consisting of probability distributions of interest. That is, the model manifold is , where indicates the probability distribution with the regression coefficient . The parameter works as a coordinate system in .
The tangent space at a point is a linear space consisting of directional derivatives, that is, , where . We consider the tangent space at . For simplicity, we call and , the origin and the tangent space at the origin, respectively.
Any pair of two vectors in has its inner product. The inner product is determined by the Fisher information matrix :
where is the log-likelihood. Using the Fisher metric , the inner product of and is given by
A point in the tangent space can be identified with a point in via an exponential map. We introduce the e-exponential map defined as follows. For , let with . Our problem in this paper is estimation for the GLM and the parameter is a regression coefficient vector . Therefore, we can avoid technical difficulties of an exponential map. The map is a bijection from to . For details, see subsection A.3.
For readers familiar with information geometry, we make an additional remark. The model manifold of the GLM is e-flat and the regression coefficient is an e-affine coordinate system of . is the natural basis of with respect to the coordinate system . Each coordinate axis of in corresponds to -axis in via the e-exponential map.
In the following, we also use another representation of . This representation is useful for our purpose: . In our notation, also indicates in the tangent space , not only a point . However, we believe that it is not confusing because a vector in the tangent space and a point in are identified through the exponential map.
3.2 LARS in tangent space
The main idea of the proposed method is to run LARS in the tangent space at the origin. First, we correspond the model manifold to the tangent space by the e-exponential map. After this mapping, our computation is done by the original LARS algorithm. However, we do not use the response directly. We introduce a virtual response . The LARS algorithm outputs a sequence of parameter estimates, whose length is the same as the dimension of the parameter. Finally, the estimates are mapped to the model manifold.
Before running the original LARS algorithm, we introduce the virtual response . The virtual response is defined using the design matrix and the MLE of the full model: . Note that LARS uses only correlation coefficients between the response and the explanatory variables in the form of , which is identical with . Therefore, introducing the appropriate representation of the response , we need only as .
In the estimation step of the proposed method, we run the original LARS algorithm in the tangent space as if the response is . LARS outputs a sequence of the model parameter. As is shown in Figure 1, the LARS estimator can be regarded as moving from the origin to the MLE of the full model. At the same time, however, the residual of the estimator is moving from the MLE to the origin (Figure 1). The latter is useful for our method because it allows us to fix the estimator’s tangent space to the origin. What moves is the residual , not the estimator . Note that Algorithm 1 in subsection 2.2 is actually described from the latter perspective.
LARS in Tangent space (TLARS)
LARS in Tangent space (TLARS) is given as follows:
Calculate the MLE of the full model.
Run the LARS algorithm for the design matrix and the response .
Using the sequence made by LARS, the result is the sequence .
As a special cese, the proposed method coincides with the original LARS when we consider the normal linear regression. Note that TLARS is as computationally efficient as LARS although TLARS solves the estimation problem of the GLM. Furthermore, we can use existing packages of LARS for the computation of TLARS.
3.3 KL divergence and correlation
The Kullback-Leibler divergence (KL divergence) is a key quantity in information geometry, which is also important in statistics, machine learning and information theory. For the GLM, the KL divergence is given by
where is the expectation parameter of the exponential family. The KL divergence is approximated up to second order as
In generalized linear regression, the Fisher metric is proportional to the correlation matrix , that is, for some . (See Appendix A.1.) The KL divergence is approximately related with the correlation matrix as
3.4 LASSO in tangent space
We propose two estimation methods. One is LASSO modification of TLARS. The other is an approximation of the -regularization for the GLM (1).
LASSO in Tangent space 1 (TLASSO1)
By modifying the LARS algorithm so that the algorithm outputs the LASSO estimator , we can use LASSO in the tangent space . LASSO in Tangent space (TLASSO1) is formally defined as a minimization problem
which implies that we use the design matrix and the response in the ordinary LASSO. This is corresponding to the LASSO modification of TLARS.
As was shown in subsection 3.3, the correlation matrix is regarded as an approximation of the KL divergence, on which the MLE is based. TLASSO1 is also an approximation of the -regularization for the GLM.
LASSO in Tangent space 2 (TLASSO2)
Another LASSO-type method is a direct approximation of (1). TLASSO2 is defined as
where and satisfies . Since the column vectors of the design matrix are assumed to be linearly independent, uniquely exists. Problem (5) is LASSO for the normal linear regression with the design matrix and the response . TLASSO2 (5) is an approximation of (1). In fact, using and , the log-likelihood is approximated as follows (see subsection A.2):
Note that is an approximation of the MLE .
3.5 Remarks on other information-geometrical methods
We briefly compare TLARS with two existing methods which are extensions of LARS based on information geometry. One is Bisector Regression (BR) by  and the other is Differential-Geometric LARS (DGLARS) by . Our concern here is about algorithm itself.
First, the BR algorithm is very different from TLARS. BR takes advantage of the dually flat structure of the GLM and tries to make an equiangular curve using the KL divergence. Furthermore, the BR estimator moves from the MLE of the full model to the origin while, in our method, the residual moves from to the origin.
DGLARS is also different from TLARS. It uses tangent spaces, where the equiangular vector is considered. However, the DGLARS estimator actually moves from to in . Accordingly, the tangent space at the current estimator moves, which makes us treat the tangent spaces at many points in . DGLARS treats the model manifold directly. Therefore, it requires many iterations of approximation computation for the algorithm. Note that, on the other hand, the update of the TLARS estimator is described fully in terms of only the tangent space .
4 Numerical examples
We show results of numerical examples and compare our methods with a related method. In detail, we compare four methods in the logistic regression setting: LARS in Tangent Space (TLARS), LASSO in Tangent Space (TLASSO1 and 2), and the -regularized maximum likelihood estimation for the GLM (L1).
Our methods do not require an extra implementation since the LARS algorithm has already been implemented in lars package of the software R. Using R, we only needed glm() for calculating the MLE and lars package for the proposed methods. For the computation of -regularization, we used glmnet package .
4.1 Real data
We applied the proposed methods and the L1 method to a real data. The data is the South Africa heart disease (SAheart) data included by ElemStatLearn package of R. This data contains nine explanatory variables of 462 samples. The response is a binary variable.
We show the results by the four methods. Figures 3 and 3 are the paths by TLARS and TLASSO1, respectively. In this example, they are the same. Figure 3 is the TLASSO2 path, and Figure 3 is the L1 path. The paths by TLARS, TLASSO1, and TLASSO2 are made by lars() function of R, and that of L1 by glmnet().
As Figure 3 shows, the four paths are very similar. The proposed methods are based only on the tangent space, not on the model manifold itself, while L1 directly takes advantage of the likelihood. These results imply that the approximation of the model does not require deterioration of result for our methods, especially, for TLARS and TLASSO1.
4.2 Numerical experiments
We performed numerical experiments of logistic regression. The topic is three-fold: generalization, parameter estimation, and model selection. The result is shown in Table 1. Bold values are the best and better values.
The procedure of the experiments is as follows. We fixed the number of the parameter , the true value of the parameter , and the sample size . For each of trials, we made the design matrix by rnorm() function in R. Furthermore, we made the response based on and , that is, elements of have different Bernoulli distributions. The four methods were applied to .
For selecting one model and one estimate from a sequence of parameter estimates, we used AIC and BIC:
where is the dimension of the parameter of the model under consideration. For a sequence made by each of the four methods, let and the MLE of the model . We call (6) with AIC1, and (6) with AIC2. Similarly, (7) with is BIC1, and (7) with is BIC2.
For evaluating the generalization error of the four methods, we newly made observations in -th trial (). We computed the difference between and predictions by each of the methods. The “Generalization” columns of Table 1 show the average prediction error over trials. Smaller value is better.
The “Model selection” columns show the fraction of the trials (among trials) where the methods selected the true model. The “Seq” column indicates the fraction of the trials where each sequence of estimates included the true model. Larger value is better.
In the “Parameter estimation” columns, each value means the average of of the selected estimate . Smaller value is better.
In Table 1, we report the results of three cases. We used for all cases but case C2 where . In case A, we set and . We used for case A1 and for A2. In generalization, three methods (TLARS, TLASSO1, and L1) with AIC2 were much better than the other combinations of method and information criterion. In model selection, the four methods with BIC1 were much better regardless of the sample size. In parameter estimation, TLARS and TLASSO1 with AIC1 and BIC2 were better in the small sample setting. However, in the larger sample setting, the four methods with AIC2 were better. These tendencies were observed in other cases not reported here; For example, .
Case B is the case of and with the relation , where and are the second and third columns of the design matrix , respectively, and is distributed according to a multivariate normal distribution. We set and for cases B1 and B2, respectively. In generalization, TLARS and TLASSO1 with AIC1, BIC1, and BIC2 were better than the others in Case B1. Three methods (TLARS, TLASSO1, and L1) with AIC1 and BIC2 were better for the larger sample setting. In Case B, our interest is mainly in generalization because estimation of the true model and the parameter value are not very meaningful. However, the four methods with BIC1 were better in model selection.
In case C, we used and, as , the vector of the length 50 with ten s, ten s, and thirty s. In generalization and parameter estimation, three methods (TLARS, TLASSO1, and L1) with AIC2 were better than the others regardless of the sample size. In model selection, the four methods with BIC1 were much better than the others.
In summary, the proposed methods worked very well. Of course, the L1 method sometimes performs better than our methods. However, the proposed methods, especially TLARS and TLASSO1, are better than L1 in many situations. Furthermore, TLARS and TLASSO1 output the same results in very many trials.
|Method||Generalization ()||Model selection||Parameter estimation|
We proposed the sparse estimation methods as an extension of LARS for the GLM. The methods take advantage of the tangent space at the origin, which is a rough approximation of the model manifold. The proposed methods are computationally efficient because the problem is approximated by the normal linear regression. The numerical experiments showed that our idea worked well by comparison with the -regularization for the GLM. One of our future works is to evaluate TLARS theoretically. Furthermore, we will apply tools developed for LARS and LASSO to TLARS and TLASSO, for example, screening and post-selection inference.
Appendix A Lemmas and remarks
a.1 Metric at tangent space and correlation between explanatory variables
We show that the Fisher metric at the tangent space is proportional to the correlation matrix of the explanatory variables (Lemma 2). To avoid confusion, in this subsection, we use for the metric in and for the metric in the tangent space at .
It holds that
Since it is known that ,
for some .
It is known that the metric is derived from the potential function : . Therefore, it holds
where is the derivative of . Letting and , we have . Since both and are known to be positive definite, is a positive constant. ∎
Note that is common to all and in the proof. This is why the tangent space at the origin is selected as the space where LARS runs.
a.2 Approximations of the likelihood and MLE
We approximate the log-likelihood and the MLE of the GLM. Lemma 3 implies that is an approximation of the MLE
The log-likelihood is expanded as
a.3 e-exponential map
In Riemannian geometry, a point in a tangent space is mapped to a manifold via an exponential map. An exponential map is defined using a geodesic. A geodesic in a manifold corresponds to a straight line in Euclidean space. When we consider an exponential map, we need to introduce not only metric but also a connection. A connection determines flatness and straightness in a manifold. In Section 3, we implicitly introduced the e-connection. From the viewpoint of the e-connection, each curve of -axis is an e-geodesic in .
For a manifold and a point , an exponential map at is formally defined as follows. First, we consider the geodesic for which satisfies and . Here the parameter moves in an interval including . Note that, given a connection, the geodesic locally exists and is uniquely determined. The exponential map is and for , where .
In general, an exponential map is not necessarily easy to treat. For example, the domain of an exponential map is called a star-shaped domain and does not coincide with a whole tangent space. However, our exponential map has a useful property. The domain of is a whole and the range is a whole .
The map defined in subsection 3.1 is the e-exponential map for a manifold of the GLM. Furthermore, is a bijection from the tangent space to the manifold .
For , the value of the map is , where . It is known that the e-geodesic satisfying and is represented as . Therefore, , which means that is the e-exponential map.
Since , the e-exponential map is defined on a whole . For , is in and , which imply that the e-exponential map is a surjection. Furthermore, if are different, because the column vectors of are assumed to be linearly independent. ∎
- thanks: This work was partly supported by JSPS KAKENHI Grant Number JP18K18008 and JST CREST Grant Number JPMJCR1763.
- (2000) Methods of information geometry. Translations of Mathematical Monographs, Vol. 191, Oxford University Press. Cited by: Appendix A, §1, §3.1.
- (1985) Differential-geometrical methods in statistics. Lecture Notes in Statistics, Vol. 28, Springer. Cited by: Appendix A, §1, §3.1.
- (2016) Information geometry and its applications. Springer. Cited by: Appendix A, §1, §3.1.
- (2013) DgLARS: a differential geometric approach to sparse generalized linear models. Journal of the Royal Statistical Society, Series B 75, pp. 471–498. Cited by: §1, §1, §3.5.
- (2017) Information geometry. Springer. Cited by: Appendix A, §1, §3.1.
- (2004) Least angle regression. Annals of Statistics 32, pp. 407–499. Cited by: §1, §2.2, §3.4.
- (2008) Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software 33, pp. 1–22. Cited by: §4.
- (2009) The elements of statistical learning (2nd edition). Springer. Cited by: §1, §2.2.
- (2010) An extension of least angle regression based on the information geometry of dually flat spaces. Journal of Computational and Graphical Statistics 19, pp. 1007–1023. Cited by: §1, §1, §3.5.
- (2013) Edge selection based on the geometry of dually flat spaces for gaussian graphical models. Statistics and Computing 23, pp. 793–800. Cited by: §1.
- (2015) An estimation procedure for contingency table models based on the nested geometry. Journal of the Japan Statistical Society 45, pp. 57–75. Cited by: §1.
- (1989) Generalized linear models. Monographs on Statistics and Applied Probability, Vol. 37, Chapman & Hall/CRC. Cited by: Appendix A, §2.1.
- (2007) -Regularization path algorithm for generalized linear models. Journal of the Royal Statistical Society, Series B 69, pp. 659–677. Cited by: §1, §2.3.
- (1996) Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, Series B 58, pp. 267–288. Cited by: §1.
- (2007) Model selection and estimation in the gaussian graphical model. Biometrika 94, pp. 19–35. Cited by: §1.