Cholesky Factor Interpolation for Efficient Approximate Cross-Validation

# Cholesky Factor Interpolation for Efficient Approximate Cross-Validation

Da Kuang
University of California, Los Angeles
Los Angeles, CA 90095
da.kuang@cc.gatech.edu &Alex Gittens
International Computer Science Institute
Berkeley, CA 94704
gittens@icsi.berkeley.edu &Raffay Hamid
DigitalGlobe Inc.
Seattle, WA 98103
raffay@cc.gatech.edu
###### Abstract

The dominant cost in solving least-square problems using Newton’s method is often that of factorizing the Hessian matrix over multiple values of the regularization parameter (). We propose an efficient way to interpolate the Cholesky factors of the Hessian matrix computed over a small set of values. This approximation enables us to optimally minimize the hold-out error while incurring only a fraction of the cost compared to exact cross-validation. We provide a formal error bound for our approximation scheme and present solutions to a set of key implementation challenges that allow our approach to maximally exploit the compute power of modern architectures. We present a thorough empirical analysis over multiple datasets to show the effectiveness of our approach.

Cholesky Factor Interpolation for Efficient Approximate Cross-Validation

Da Kuang University of California, Los Angeles Los Angeles, CA 90095 da.kuang@cc.gatech.edu Alex Gittens International Computer Science Institute Berkeley, CA 94704 gittens@icsi.berkeley.edu Raffay Hamid DigitalGlobe Inc. Seattle, WA 98103 raffay@cc.gatech.edu

## 1 Introduction

Least-squares regression has continued to maintain its significance as a worthy opponent to more advanced learning algorithms. This is mainly because:

Its closed form solution can be found efficiently by maximally exploiting modern hardware using high performance BLAS- software [9].

Advances in kernel methods [28] [21] [17] can efficiently construct non-linear spaces, where the closed form solution of linear regression can be readily used.

Availability of error correcting codes [5] allow robust simultaneous learning of multiple classifiers in a single pass over the data.

However, an important bottleneck in solving large least-squares problems is the cost of -fold cross validation [1]. To put this cost in perspective, we show in Figure 1 the costs of performing the main steps in solving least-squares using Newton’s method.

Specifically, for -dimensional data each fold requires finding optimal value of regularization parameter searched over values. This requires solving a linear system with variables, represented by the Hessian matrix, times for each of the folds. As solving this system using the Cholesky factorization of the Hessian costs operations, the total cost of -folds adds up to operations. Other considerable costs include computing the Hessian requiring operations. Comparing these costs, we see that when , cross validation is the dominant cost. Figure 2 presents an empirical sense of cross-validation and Hessian costs as a function of and .

Our goal is to reduce the computational cost of cross-validation without increasing the hold-out error. To this end, we propose to densely interpolate Cholesky factors of the Hessian matrix using a sparse set of values. Our key insight is that Cholesky factors for different values lie on smooth curves, and can hence be approximated using polynomial functions (see Figure 3 for illustration). We provide empirical evidence supporting this intuition and provide an error bound for this approximation.

We formalize a least-squares framework to simultaneously learn the multiple polynomial functions required to densely interpolate the Hessian factors. An important challenge to solve this problem efficiently is a matrix-vector conversion strategy with minimal unaligned memory access (see  5 for details). To address this challenge, we propose a general-purpose strategy to efficiently convert blocks of Choelsky factors to their corresponding vectors. This enables our learning framework to use BLAS- [9] level computations, therefore maximally exploiting the compute power of modern hardware. Our results demonstrate that the proposed approximate regularization approach offers a significant computational speed-up while providing hold-out error that is comparable to exact regularization.

## 2 Related Work

Solving linear systems has been well-explored in terms of their type [9] [27] and scale [18] [32]. Our focus is on the least-squares problem as the computational basis of linear regression. Popular methods to solve least-squares include QR [9], Cholesky [15], and Singular Value Decomposition (SVD) [8]. For large dense problems, Cholesky factorization has emerged as the method of choice which is used to solve the normal equation, i.e., the linear system represented by the Hessian matrix of least-squares problem. This is because of its storage () and efficiency ( and ) advantages over QR and SVD respectively [9]).

For real-world problems, it is common for system parameters to undergo change over time. Previous works in this context have mostly focused on low-rank updates in system parameters including direct matrix inverse [12], LU decomposition [16], Choleksy factorization [19], and Singular Value Decomposition [3] [11]. Our problem is however different in two important ways: (i) we focus on linear systems undergoing full rank updates, and (ii) our updates are limited to the diagonal of the Hessian. Both of these attributes are applicable to the regularization of least-squares problems, as we shall see in  3.

A standard way to solve regularized least-squares is to find the SVD of the input matrix once for each training fold, and then reuse the singular vectors for different values. For large problems however, finding the SVD of a design matrix even once can be prohibitively expensive. In such situations, truncated [14] or randomized approximate [13] SVD can be used. However their effectiveness for optimizing hold-out error for least squares problems is still unexplored. There has also been work to reduce the number of regularization folds [4] by minimizing the regularizing path [6]. Our work can be used in conjunction with these approaches to further improve the overall performance.

## 3 piCholesky Framework

### 3.1 Preliminaries

Let be the design matrix with each row as one of the training examples in a -dimensional space. Let be the dimensional vector for training labels. Then the Tychonov regularized [31] least-squares cost function is given as:

 \@setsize10pt\ixpt\@ixptJ(θ)=12(y−Xθ)⊤(y−Xθ)+λ2θ⊤θ (1)

Setting the derivative of with respect to equal to zero results in the following solution of :

 \@setsize10pt\ixpt\@ixptθ=(H+λI)−1g (2)

where is the identity matrix, the Hessian matrix equals and the gradient vector is . Equation 2 is solved for different values of in a k-fold cross-validation setting, and the with minimum hold-out error in expectation is picked.

### 3.2 Least Squares using Cholesky Factors

Rewriting Equation 2 as , where , we can find Cholesky factors of as , where is a lower-triangular matrix. Equation 2 can now be written as . Using as , this can be solved by a forward-substitution to solve the triangular system for followed by a back-substitution to solve the triangular system for .

### 3.3 Cholesky Factors Interpolation

To minimize the computational cost for cross-validation, we propose to compute Cholesky factors of over a small set of values, followed by interpolating the corresponding entries in for a more densely sampled set of values. Note that corresponding entries of for different values of lie on smooth curves that can be accurately approximated using multiple polynomial functions. Figure 4 illustrates this point empirically, where the red curves show multiple corresponding entries in computed over different values using exact Cholesky for MNIST data.

Recall that for -dimensional data, there are number of entries in the lower-triangular part of . Therefore, to interpolate a Cholesky factor based on those computed for different values of , we need to learn D polynomial functions, each for an entry in the lower-triangular part of .

This challenge can be posed as a least-squares problem. Recall that each polynomial function to be learned is of order . We therefore need exact Cholesky factors, each of which is computed using one of the values of . We evaluate a polynomial basis of the space of -th order polynomials at the sparsely sampled values of (e.g., , and for second-order polynomials); this way, we form our observation matrix . Our targets are the rows of D values, where each row corresponds to the exact Cholesky matrix computed for one of the values of . This forms our target matrix .

In Algorithm 1, we use monomials as the polynomial basis to interpolate the th-order polynomials. While we can employ other polynomial bases that are numerically more stable (such as Chebyshev polynomials), we found in our experiments that the observation matrix is well-conditioned and therefore using monomials does not harm our numerical stability.

We can now define our cost function as:

 \@setsize10pt\ixpt\@ixptJλ(Θ)=12(T−VΘ)⊤(T−VΘ) (3)

where is the polynomial coefficient matrix. Each column of represents the coefficients of the D polynomial functions. Following procedure similar to  3.1, the expression of can be written as

 \@setsize10pt\ixpt\@ixptΘ=H−1λGλ (4)

Here , and .

Given a new regularization parameter value , the value for can be computed by evaluating the D polynomial functions at . This procedure is listed in Algorithm 1. The interpolation results using Algorithm 1 for the Cholesky factors computed on the MNIST data [22] are shown with blue curves in Figure 4. Here we set and . As can be seen from the figure, the blue plots (interpolated) trace the red plots (exact) closely.

Computational Complexity: The dominant step of Algorithm 1 is evaluating for , which requires operations. The only other noteworthy steps of Algorithm 1 are finding and each of which takes operations. Since , the overall asymptotic cost of Algorithm 1 is . Furthermore, it only takes operations to evaluate the interpolated Cholesky factor for each value.

## 4 Theoretical Analysis

There are two challenges in developing a reasonable bound on the error of the piCholesky algorithm. The first is determining the extent to which the Cholesky factorization can be approximated entrywise as a polynomial: if one explicitly forms the symbolic Cholesky factorization of even a matrix, it is not clear at all that the entries of the resulting Cholesky factor can be approximated well by any polynomial. The second is determining the extent to which the particular polynomial recovered by the piCholesky procedure of solving a least squares system (Algorithm 1) is a good polynomial approximation to the Cholesky factorization.

The classical tool for addressing the first challenge is the Bramble-Hilbert lemma [2][Lemma 4.3.8], which guarantees the existence of a polynomial approximation to a smooth function on a compact domain, with Sobolev norm approximation error bounded by the Sobolev norm of the derivatives of the function itself. Our proof of the existence of a polynomial approximation to the Cholesky factorization is very much in the spirit of the Bramble-Hilbert lemma, but provides sharper results than the direct application of the lemma itself (essentially because we do not care about the error in approximating the derivatives of the Cholesky factorization). We surmount the second obstacle by noting that, if instead of sampling the Cholesky factorization itself when following the piCholesky procedure, we sample from a polynomial approximation to the Cholesky factorization, then the error in the resulting polynomials can be bounded using results on the stability of the solution to perturbed linear systems.

In our arguments, we find it convenient to use the Fréchet derivative: given a mapping between two normed linear linear spaces we define , the derivative of , to be the function that maps to , the unique linear map (assuming it exists) that is tangent to at in the sense that

 lim∥δ∥X→0∥f(u+δ)−f(u)−Duf(δ)∥Y∥δ∥X=0,

The Fréchet derivative generalizes the conventional derivative, so it follows a chain rule, can be used to form Taylor Series expansions, and shares the other properties of the conventional derivative [24].

We inductively define the th derivative of , as the function that maps to the unique linear map tangent to at in the sense that

 lim∥δ1∥X→0∥Dr−1u+δ1f(δ2)−Dr−1uf(δ2)−Druf(δ1,δ2)∥Y∥δ1∥X=0.

For a comprehensive introduction to Fréchet derivatives and their properties, we refer the reader to [24].

### 4.1 Performance Guarantee for piCholesky

Let denote the second-order polynomial obtained from the Taylor Series expansion of around and let denote the approximation to obtained using the piCholesky procedure. Our argument consists in bounding the errors in approximating with (Theorem 4.4) and in approximating with (Theorem 4.6). The root mean-squared error in approximating with the piCholesky procedure is then controlled using the triangle inequality (Theorem 4.7): if and is the maximum distance of any of the sample points used in Algorithm 1 from , then

 1√D∥C(A+λI)−pπ(λ)∥F≤[γ3+√gw3(1+γ2)(λc+1)∥V†∥2]R[λc−γ,λc+γ]√D (5)

Here, is the number of sample points (values of ) used in the piCholesky procedure, is the number of elements in , and is defined in Theorem 4.4. The quantity measures the magnitude of the third-derivative of over the interval ; when it is small, the implicit assumption made by the piCholesky procedure that is well-approximated by some quadratic polynomial is reasonable. Unfortunately the task of relating to more standard quantities such as the eigenvalues of is beyond the reach of our analysis.

The matrix (defined in Algorithm 1) is a submatrix of the Vandermonde matrix formed by the sample points, so the quantity measures the conditioning of the sample points used to fit the piCholesky polynomial approximant: it is small when the rows of are linearly independent. This is exactly the setting in which we expect the least-squares fit in Algorithm 1 to be most stable.

The cubic dependence on in our bound reflects our intuition that since the Cholesky factorization is nonlinear, we do not expect the quadratic approximation formed using the piCholesky procedure to perform well far away from the interpolation points used to form the approximation. The cubic dependence on also captures our intuition that we can only expect piCholesky to give a good approximation when the interpolation points used to fit the interpolant cover a small interval containing the optimal regularization parameter.

To algorithmically address the fact that this optimal is unknown, we introduce a Multi-level Cholesky procedure in Section 6.2 that applies a binary-search-like procedure to narrow the search range before applying piCholesky.

### 4.2 Proof of the performance guarantee for piCholesky

Our first step in developing the Taylor series of is establishing that is indeed Fréchet differentiable, and finding an expression for the derivative.

###### Theorem 4.1.

Assume is a positive-definite matrix and Then for any lower-triangular matrix and is full-rank. Furthermore, if is a symmetric matrix, then

 DAC(Δ)=(DLS)−1(Δ).

To key to establishing this result is the observation that is the inverse of the mapping that maps the set of lower-triangular matrices into the set of symmetric matrices. Differentiability of is then a consequence of the following corollary of the Inverse Function Theorem (Theorem 2.5.2 of [24]).

###### Theorem 4.2.

Let be continuously differentiable. If is invertible, then

 D(f−1)f(u)=(Dfu)−1.
###### Proof of Theorem 4.1.

By Theorem 4.2 and the fact that it suffices to establish that is continuously differentiable with and that is invertible.

Recall that (assuming it exists), is the unique linear map tangent to at and observe that

 S(L+Γ) =LLT+ΓLT+LΓ+ΓΓT =S(L)+ΓLT+LΓ+O(∥Γ∥2).

It follows that exists and is as stated. Clearly is also continuous as a function of so is continuously differentiable.

To show that is invertible, assume that is such that Because is positive-definite, is invertible, and we can conclude that The left hand side is a lower triangular matrix since and are lower-triangular; for similar reasons, the the right hand side is upper-triangular. It follows that is a diagonal matrix, and hence for some diagonal matrix Together with the assumption that this implies that , and consequently Thus so we have established that the nullspace of is It follows that is invertible.

The claims of Theorem 4.1 now follow. ∎

The higher-order derivatives of are cumbersome, so instead of dealing directly with , which maps matrices to matrices, we compute the higher-order derivatives of the equivalent function that maps vectors to vectors. The following theorem gives the first three derivatives of

###### Theorem 4.3.

Let be the image under of, respectively, the set of positive-definite matrices of order and the space of lower-triangular matrices of order Define by

 C=vec∘C∘vec−1.

When is positive-definite, the first three derivatives of at are given by

 DvAC(δ1) =M−1δ1, D2vAC(δ1,δ2) =−M−1⋅\llbracketvec−1(M−1δ1)\rrbracket⋅M−1δ2, D3vAC(δ1,δ2,δ3) =M−1⋅(\llbracketvec−1(M−1δ1)\rrbracket⋅M−1⋅\llbracketvec−1(M−1δ2)\rrbracket +\llbracketvec−1(M−1⋅\llbracketvec−1(M−1δ1)\rrbracket⋅M−1δ2)\rrbracket +\llbracketvec−1(M−1δ2)\rrbracket⋅M−1⋅\llbracketvec−1(M−1δ1))\rrbracket)⋅M−1δ3,

where

###### Proof.

First we convert the expression for given in Theorem 4.1 into an expression for To do so, we note that and are linear functions, so are their own derivatives. It follows from the Chain Rule (Theorem 2.4.3 of [24]) that

 DC=vec∘DC∘vec−1,

so

 DvAC(vΔ)=vec(DAC(Δ)).

By Theorem 4.1, where is the solution to the equation

 Δ=ΓC(A)T+C(A)ΓT.

We can convert this to an equation for using the fact (Section 10.2.2 of [26]) that

for arbitrary matrices and Specifically, we find that satisfies

 vΔ=(C(A)⊗I)vΓ+(I⊗C(A))vΓT.

Recall that is symmetric, so and we have that

 vΓ=\llbracketC(A)\rrbracket−1vΔ.

It follows that

 DvAC(vΔ)=vec(DAC(Δ))=vec(Γ)=vΓ=\llbracketC(A)\rrbracket−1vΔ=M−1vΔ

as claimed.

To compute the second derivative of we use the identity (Section 2.2 of [26])

 DxA(x)−1=−A(x)−1⋅DxA(x)⋅A(x)−1 (6)

that holds for any differentiable matrix-valued function of Using this identity with and we see that

 D2vAC(δ1,δ2)=DvA(DvAC(δ2))(δ1) =DvA(M−1δ2)(δ1)=DvA(M−1)(δ1)⋅δ2 =−M−1⋅DvAM(δ1)⋅M−1⋅δ2.

Using the Chain Rule and the linearity of we see that

 DvAM(δ1) =DvA\llbracketC(A)\rrbracket(δ1)=DvA\llbracketvec−1(C(vA))\rrbracket(δ1) (7)

Thus, as claimed,

 D2vAC(δ1,δ2)=−M−1⋅\llbracketvec−1(M−1δ1)\rrbracket⋅M−1δ2.

To compute the third derivative of we note that the Product Rule (Theorem 2.4.4 of [24]) implies

 Dx[A1(x)⋅A2(x)⋅A1(x)]=DxA1(x)⋅A2(x)⋅A1(x)+A1(x)⋅DxA2(x)⋅A1(x)+A1(x)⋅A2(x)⋅DxA1(x)

for any matrix-valued differentiable functions and We apply this result with and to see that

 D3vAC(δ1,δ2,δ3)=DvA(D2vAC(δ2,δ3))(δ1) =−DvAM−1(δ1)⋅\llbracketvec−1(M−1δ2)\rrbracket⋅M−1δ3 −M−1⋅DvA\llbracketvec−1(M−1δ2)\rrbracket(δ1)⋅M−1δ3 −M−1⋅\llbracketvec−1(M−1δ2)\rrbracket⋅DvAM−1(δ1)⋅δ3. (8)

From (6) and (7), we calculate

 DvAM−1(δ1)=−M−1⋅DvAM(δ1)⋅M−1=−M−1⋅\llbracketvec−1(M−1δ1)\rrbracket⋅M−1,

and to calculate we use the fact that is linear:

 DvA\llbracketvec−1(M−1δ2)\rrbracket(δ1) =\llbracketvec−1((DvAM−1)(δ1)⋅δ2)\rrbracket =\llbracketvec−1(−M−1⋅\llbracketvec−1(M−1δ1)\rrbracket⋅M−1⋅δ2)\rrbracket.

The claimed expression for follows from using these latter two computations to expand (8). ∎

Now that we have the first three derivatives of we can develop the second-order Taylor Series expansion of and bound the error of the approximation.

###### Theorem 4.4.

Assume is an positive-definite matrix of order and let and The second-order Taylor Series approximation to at is

 pTS(λ;λc)=C(A+λcI)+vec−1((λ−λc)M−1cvI−(λ−λc)22M−1cEcM−1cvI).

Let then for any

 1√D∥C(A+λI)−pTS(λ;λc)∥F≤2|λ−λc|33√DR[λc,λ]

where

 R[a,b]:=maxs∈[min(a,b),max(a,b)](∥M−1sEs∥22∥M−1svI∥2+∥M−1s∥2∥M−1sEs∥2∥M−1svI∥22).

To establish this result, we use the following version of Taylor’s Theorem (Theorem 2.4.15 of [24]), stated for the case where the first three derivatives of exist and are continuous.

###### Theorem 4.5.

Let be a three-times continously differentiable mapping. For all

 f(u+h)=f(u)+D1u(h)+12D2u(h,h)+R(u,h)(h,h,h),

where

 R(u,h)=12∫10(1−t)2(D3u+th−D3u)dt.
###### Proof of Theorem 4.4.

Let and For convenience, define and By Theorem 4.3, if we take then

 DvAC(h) =(λ−λc)M−1vI, D2vAC(h,h) =−(λ−λc)2M−1EM−1vI D3vAC(h,h,h)

so the application of Taylor’s Theorem to at gives the expansion

 C(vA+λvI)=C(vA+λcvI)+(λ−λc)M−1vI−(λ−λc)22M−1EM−1vI+(λ−λc)3R(vA+λcvI,(λ−λc)vI)(vI,vI,vI).

Since is an isometry, i.e., for any matrix , we conclude that the Taylor expansion of the Cholesky factorization map around is given by

 C(A+λI)=C(A+λcI)+vec−1((λ−λc)M−1vI−(λ−λc)22M−1EM−1vI)+(λ−λc)3vec−1(R(vA+λc,(λ−λc)vI)(vI,vI,vI)). (9)

The remainder term can be bounded as follows:

 ∥R(vA+λc,(λ−λc)vI)(vI,vI,vI)∥2 =∥∥∥∫10(1−t)22(D3vA+[(1−t)λc+tλ]vIC−D3vA+λcvIC)(vI,vI,vI)dt∥∥∥2 ≤16maxt∈[0,1]∥∥(D3vA+[(1−t)λc+tλ]vIC−D3vA+λcvIC)(vI,vI,vI)∥∥2 ≤13maxs∈[λc,λ]∥D3vA+svIC(vI,vI,vI)∥2 ≤13maxs∈[λc,λ](2∥M−1sEsM−1sEsM−1svI∥2 +∥M−1s\llbracketvec−1(M−1sEsM−1svI)\rrbracketM−1svI∥2) ≤13maxs∈[λc,λ](2∥M−1sEs∥22∥M−1svI∥2

To further simplify this estimate, note that for any matrix

 ∥\llbracketvec−1(vX)\rrbracket∥2 =∥\llbracketX\rrbracket∥2=∥I⊗X+X⊗I∥2 ≤∥I⊗X∥2+∥X⊗I∥2 ≤2∥I∥2∥X∥2≤2∥X∥F =2∥vX∥2.

In particular,

 ∥\llbracketvec−1(M−1sEsM−1svI)\rrbracket∥2≤2∥M−1sEsM−1svI∥2≤2∥Ms−1Es∥2∥M−1svI∥2.

It follows that

 ∥R(vA+λc,(λ−λc)vI)(vI,vI,vI)∥2≤23maxs∈[λc,λ](∥M−1sEs∥22∥M−1svI∥2+∥M−1s∥2∥M−1sEs∥2∥M−1svI∥22),

where for convenience we use to denote As a consequence of this bound and (9), we conclude that

 1√D∥C(A+λI)−pTS(λ;λc)∥F≤2|λ−λc|33√Dmaxs∈[λc,λ](∥M−1sEs∥22∥M−1svI∥2+∥M−1s∥2∥M−1sEs∥2∥M−1svI∥22).

Our next result quantifies the distance between and , the polynomial approximation fit using the piCholesky interpolation procedure. The result is based on the observation that can be recovered using the same algorithm that returns the piCholesky polynomial if samples from are used for the interpolation instead of samples from Thus the error can be interpreted as being caused by sampling error, and bounded using results on the stability of least squares systems.

###### Theorem 4.6.

Let and be as in Theorem 4.4 and let be the matrix whose entries are the second-order polynomial approximations to the entries of with coefficients defined using Algorithm 1. Assume that the sampled regularization points used in Algorithm 1 all lie within distance of With and as in Algorithm 1 and defined as in Theorem 4.4,

 1√D∥pTS(λ;λc)−pπ(λ)∥F≤√gDw3[1+(λ−λc)2](λc+1)∥V†∥2R[λc−w,λc+w].

for any

###### Proof.

Let and denote the matrices and in Algorithm 1, and denote the matrix of coefficients fit by the piCholesky algorithm, so Let . By construction, the piCholesky approximation to the Cholesky factor of is given by

Analogously, let denote the matrix constructed by sampling at the values and let denote the matrix with rows consisting of the vectors for Observe that since the entries of are quadratic polynomials, because the relationship holds for Let then

Simple calculations verify that and where

 M=⎡⎢ ⎢⎣1−λcλ2c01−λc001⎤⎥ ⎥⎦.

Consequently,

 1D∥pTS(λ;λ