High-Dimensional Optimization in Adaptive Random Subspaces

# High-Dimensional Optimization in Adaptive Random Subspaces

## Abstract

We propose a new randomized optimization method for high-dimensional problems which can be seen as a generalization of coordinate descent to random subspaces. We show that an adaptive sampling strategy for the random subspace significantly outperforms the oblivious sampling method, which is the common choice in the recent literature. The adaptive subspace can be efficiently generated by a correlated random matrix ensemble whose statistics mimic the input data. We prove that the improvement in the relative error of the solution can be tightly characterized in terms of the spectrum of the data matrix, and provide probabilistic upper-bounds. We then illustrate the consequences of our theory with data matrices of different spectral decay. Extensive experimental results show that the proposed approach offers significant speed ups in machine learning problems including logistic regression, kernel classification with random convolution layers and shallow neural networks with rectified linear units. Our analysis is based on convex analysis and Fenchel duality, and establishes connections to sketching and randomized matrix decomposition.

\newalphalph\myfnsymbolmult

[mult]†

## 1 Introduction

Random Fourier features, Nystrom method and sketching techniques have been successful in large scale machine learning problems. The common practice is to employ oblivious sampling or sketching matrices, which are typically randomized and fixed ahead of the time. However, it is not clear whether one can do better by adapting the sketching matrices to data. In this paper, we show that adaptive sketching matrices can significantly improve the approximation quality. We characterize the approximation error on the optimal solution in terms of the smoothness of the function, and spectral properties of the data matrix.

Many machine learning problems end up being high dimensional optimization problems, which typically follow from forming the kernel matrix of a large dataset or mapping the data trough a high dimensional feature map, such as random Fourier features [1] or convolutional neural networks [2]. Such high dimensional representations induce higher computational and memory complexities, and result in slower training of the models. Random projections are a classical way of performing dimensionality reduction, and are widely used in many algorithmic contexts [3]. Nevertheless, only recently these methods have captured great attention as an effective way of performing dimensionality reduction in convex optimization. In the context of solving a linear system and least-squares optimization, the authors of [34] propose a randomized iterative method with linear convergence rate, which, at each iteration, performs a proximal update , where the next iterate is restricted to lie within an affine subspace , and is a dimension-reduction matrix with . In the context of kernel ridge regression, the authors of [4] propose to approximate the -dimensional kernel matrix by sketching its columns to a lower -dimensional subspace, chosen uniformly at random. From the low dimensional kernel ridge solution , they show how to reconstruct an approximation of the high dimensional solution . Provided that the sketching dimension is large enough – as measured by the spectral properties of the kernel matrix –, the estimate retains some statistical properties of , e.g., minimaxity. Similarly, in the broader context of classification through convex loss functions, the authors of [5, 6] propose to project the -dimensional features of a given data matrix to a lower -dimensional subspace, chosen independently of the data. After computing the optimal low-dimensional classifier , their algorithm returns an estimate of the optimal classifier . Even though they provide formal guarantees on the estimation error , their results rely on several restrictive assumptions, that is, the data matrix must be low rank, or, the classifier must lie in the span of the top few left singular vectors of . Further, random subspace optimization has also been explored for large-scale trust region problems [7], also using a subspace chosen uniformly at random. Our proposed approach draws connections with the Gaussian Kaczmarz method proposed in [34] and the kernel sketching method in [4]. Differently, we are interested in smooth, convex optimization problems with ridge regularization. In contrast to [5, 6], we do not make any assumption on the optimal solution .

Our work relates to the considerable amount of literature on randomized approximations of high dimensional kernel matrices . The typical approach consists of building a low-rank factorization of the matrix , using a random subset of its columns [8, 9, 10, 11]. The so-called Nystrom method has proven to be effective empirically [12], and many research efforts have been devoted to improving and analyzing the performance of its many variants (e.g., uniform column sub-sampling, leverage-score based sampling), especially in the context of kernel ridge regression [13, 14]. In a related vein, sketching methods have been proposed to reduce the space complexity of storing high-dimensional data matrices [15, 4], by projecting their rows to a randomly chosen lower dimensional subspace. Our theoretical findings build on known results for low-rank factorization of positive semi-definite (p.s.d.) matrices [16, 17, 18, 19], and show intimate connections with kernel matrices sketching [4]. Lastly, our problem setting also draws connections with compressed sensing [20] where the goal is to recover a high dimensional structured signal from a small number of randomized, and usually oblivious, measurements.

### 1.1 Contributions

In this work, we propose a novel randomized subspace optimization method with strong solution approximation guarantees which outperform oblivious sampling methods. We derive probabilistic bounds on the error of approximation for general convex functions. We show that our method provides a significant improvement over the oblivious version, and theoretically quantify this observation as function of the spectral properties of the data matrix. We also introduce an iterative version of our method, which converges to the optimal solution by iterative refinement.

### 1.2 An overview of our results

Let be a convex and -strongly smooth function, i.e., for all , and a high-dimensional matrix. We are interested in solving the primal problem

 x∗=argminx∈Rdf(Ax)+λ2∥x∥22, (1)

Given a random matrix with , we consider instead the sketched primal problem

 α∗∈argminα∈Rmf(ASα)+λ2α⊤S⊤Sα, (2)

where we effectively restrict the optimization domain to a lower -dimensional subspace. In this work, we explore the following questions: How can we estimate the original solution given the sketched solution ? Is a uniformly random subspace the optimal choice, e.g., Gaussian i.i.d.? Or, can we come up with an adaptive sampling distribution that is related to the matrix , which yields stronger guarantees?

By Fenchel duality analysis, we exhibit a natural candidate for an approximate solution to , given by . Our main result (Section 2) establishes that, for an adaptive sketching matrix of the form where is typically Gaussian i.i.d., the relative error satisfies a high-probability guarantee of the form , with . Our error bound depends on the smoothness parameter , the regularization parameter , the shape of the domain of the Fenchel conjugate and the spectral decay of the matrix . Further, we show that this error can be explicitly controlled in terms of the singular values of , and we derive concrete bounds for several standard spectral profiles, which arise in data analysis and machine learning. In particular, we show that using the adaptive matrix provides much stronger guarantees than oblivious sketching, where is independent of . Then, we take advantage of the error contraction (i.e., ), and extend our adaptive sketching scheme to an iterative version (Section 3), which, after iterations, returns a higher precision estimate that satisfies . Throughout this work, we specialize our formal guarantees and empirical evaluations (Section 5) to Gaussian matrices , which is a standard choice and yields the tightest error bounds. However, our approach extends to a broader class of matrices , such as Rademacher matrices, sub-sampled randomized Fourier (SRFT) or Hadamard (SRHT) transforms, and column sub-sampling matrices. Thus, it provides a general framework for random subspace optimization with strong solution guarantees.

## 2 Convex optimization in adaptive random subspaces

We introduce the Fenchel conjugate of , defined as , which is convex and its domain is a closed, convex set. Our control of the relative error is closely tied to controlling a distance between the respective solutions of the dual problems of (1) and (2). The proof of the next two Propositions follow from standard convex analysis arguments [21], and are deferred to Appendix C.

###### Proposition 1 (Fenchel Duality).

Under the previous assumptions on , it holds that

 minxf(Ax)+λ2∥x∥22=maxz−f∗(z)−12λ∥A⊤z∥22.

There exist an unique primal solution and an unique dual solution . Further, we have , and .

###### Proposition 2 (Fenchel Duality on Sketched Program).

Strong duality holds for the sketched program

 minαf(ASα)+λ2∥Sα∥22=maxy−f∗(y)−12λ∥PSA⊤y∥22,

where is the orthogonal projector onto the range of . There exist a sketched primal solution and an unique sketched dual solution . Further, for any solution , it holds that and .

We define the following deterministic functional which depends on , the data matrix and the sketching matrix , and plays an important role in controlling the approximation error,

 Zf≡Zf(A,S)=supΔ∈(domf∗−z∗)(Δ⊤AP⊥SA⊤Δ∥Δ∥22)12, (3)

where is the orthogonal projector onto . The relationship suggests the point as a candidate for approximating . The Fenchel dual programs of (1) and (2) only differ in their quadratic regularization term, and , which difference is tied to the quantity . As it holds that , we show that the error can be controlled in terms of the spectral norm , or more sharply, in terms of the quantity , which satisfies . We formalize this statement in our next result, which proof is deferred to Appendix B.1.

###### Theorem 1 (Deterministic bound).

Let be any minimizer of the sketched program (2). Then, under the condition , we have

 ∥˜x−x∗∥2⩽√μ2λZf∥x∗∥2, (4)

which further implies

 ∥˜x−x∗∥2⩽√μ2λ∥P⊥SA⊤∥2∥x∗∥2. (5)

For an adaptive sketching matrix , we rewrite , where is p.s.d. Combining our deterministic bound (5) with known results [16, 18, 17] for randomized low-rank matrix factorization in the form of p.s.d. matrices , we can give guarantees with high probability (w.h.p.) on the relative error for various types of matrices . For conciseness, we specialize our next result to adaptive Gaussian sketching, i.e., Gaussian i.i.d. Given a target rank , we introduce a measure of the spectral tail of as , where is the rank of the matrix and its singular values. The proof of the next result follows from a combination of Theorem 1 and Corollary  in [16], and is deferred to Appendix B.2.

###### Corollary 1 (High-probability bound).

Given and a sketching dimension , let , with Gaussian i.i.d. Then, for some universal constant , provided , it holds with probability at least that

 ∥˜x−x∗∥2⩽c0√μ2λRk(A)∥x∗∥2. (6)
###### Remark 1.

The quantity is the eigenvalue of the matrix , restricted to the spherical cap , where is the unit sphere in dimension . Thus, depending on the geometry of , the deterministic bound (4) might be much tighter than (5), and yield a probabilistic bound better than (6). The investigation of such a result is left for future work.

### 2.1 Theoretical predictions as a function of spectral decay

We study the theoretical predictions given by (5) on the relative error, for different spectral decays of and sketching methods, in particular, adaptive Gaussian sketching versus oblivious Gaussian sketching and leverage score column sub-sampling [18]. We denote the eigenvalues of . For conciseness, we absorb into the eigenvalues by setting and . This re-scaling leaves the right-hand side of the bound (5) unchanged, and does not affect the analysis below. Then, we assume that , , and as . These assumptions are standard in empirical risk minimization and kernel regression methods [22], which we focus on in Sections 4 and 5. We consider three decaying schemes of practical interest. The matrix has either a finite-rank , a -exponential decay where and , or, a -polynomial decay where and . Among other examples, these decays are characteristic of various standard kernel functions, such as the polynomial, Gaussian and first-order Sobolev kernels [23]. Given a precision and a confidence level , we denote by (resp. , ) a sufficient dimension for which adaptive (resp. oblivious, leverage score) sketching yields the following -guarantee on the relative error. That is, with probability at least , it holds that .

We determine from our probabilistic regret bound (6). For , using our deterministic regret bound (5), it then suffices to bound the spectral norm in terms of the eigenvalues , when is a leverage score column sub-sampling matrix. To the best of our knowledge, the tightest bound has been given by [18] (see Lemma ). For , we leverage results from [5]. The authors provide an upper bound on the relative error , when is Gaussian i.i.d. with variance . It should be noted that their sketched solution is slightly different from ours. They solve , whereas we do include the matrix in the regularization term. One might wonder which regularizer works best when is Gaussian i.i.d. Through extensive numerical simulations, we observed a strongly similar performance. Further, standard Gaussian concentration results yields that .

Our theoretical findings are summarized in Table 1, and we give the mathematical details of our derivations in Appendix D. For the sake of clarity, we provide in Table 1 lower bounds on the predicted values and , and, thus, lower bounds on the ratios and . Overall, adaptive Gaussian sketching provides stronger guarantees on the relative error .

We illustrate numerically our predictions for adaptive Gaussian sketching versus oblivious Gaussian sketching. With and , we generate matrices and , with spectral decay satisfying respectively and . First, we perform binary logistic regression, with where , and is the -th row of . For the polynomial (resp. exponential) decay, we expect the relative error to follow w.h.p. a decay proportional to (resp. ). Figure 1 confirms those predictions. We repeat the same experiments with a second loss function, . The latter is a convex relaxation of the penalty for fitting a shallow neural network with a ReLU non-linearity. Again, Figure 1 confirms our predictions, and we observe that the adaptive method performs much better than the oblivious sketch.

## 3 Algorithms for adaptive subspace sketching

### 3.1 Numerical conditioning and generic algorithm

A standard quantity to characterize the capacity of a convex program to be solved efficiently is its condition number [24], which, for the primal (1) and (adaptive) sketched program (2), are given by

 κ=λ+supxσ1(A⊤∇2f(Ax)A)λ+infxσd(A⊤∇2f(Ax)A),κS=supασ1(˜S⊤A(λI+A⊤∇2f(AA⊤˜Sα)A)A⊤˜S)infασd(˜S⊤A(λI+A⊤∇2f(AA⊤˜Sα)A)A⊤˜S).

The latter can be significantly larger than , up to . A simple change of variable overcomes this issue. With , we solve instead the optimization problem

 α∗†=argminα†∈Rmf(AS,†α†)+λ2∥α†∥22. (7)

It holds that . The additional complexity induced by this change of variables comes from computing the (square-root) pseudo-inverse of , which requires flops via a singular value decomposition. When is small, this additional computation is negligible and numerically stable, and the re-scaled sketched program (7) is actually better conditioned that the original primal program (1), as stated in the next result that we prove in Appendix C.3.

###### Proposition 3.

Under adaptive sketching, the condition number of the re-scaled sketched program (7) satisfies with probability .

We observed a drastic practical performance improvement between solving the sketched program as formulated in (2) and its well-conditioned version (7).

If the chosen sketch dimension is itself prohibitively large for computing the matrix , one might consider a pre-conditioning matrix , which is faster to compute, and such that the matrix is well-conditioned. Typically, one might compute a matrix based on an approximate singular value decomposition of the matrix . Then, one solves the optimization problem . Provided that is invertible, it holds that satisfies .

### 3.2 Error contraction and almost exact recovery of the optimal solution

The estimate satisfies a guarantee of the form w.h.p., and, with provided that is large enough. Here, we extend Algorithm 1 to an iterative version which takes advantage of this error contraction, and which is relevant when a high-precision estimate is needed.

A key advantage is that, at each iteration, the same sketching matrix is used. Thus, the sketched matrix has to be computed only once, at the beginning of the procedure. The output satisfies the following recovery property, which empirical benefits are illustrated in Figure 2.

###### Theorem 2.

After iterations of Algorithm 2, provided that , it holds that

 ∥˜x(T)−x∗∥2⩽⎛⎝μZ2f2λ⎞⎠T2∥x∗∥2. (9)

Further, if where with i.i.d. Gaussian entries and for some target rank , then, for some universal constant , after iterations of Algorithm 2, provided that , the approximate solution satisfies with probability at least ,

 ∥˜x(T)−x∗∥2⩽(c20μR2k(A)2λ)T2∥x∗∥2. (10)
###### Remark 2.

An immediate extension of Algorithms 1 and 2 consists in using the power method [16]. Given , one uses the sketching matrix . The larger , the smaller the approximation error (see Corollary 10.10 in [16]). Of pratical interest are data matrices with a spectral profile starting with a fast decay, and then becoming flat. This happens typically for of the form , where has a fast decay and is a noise matrix with, for instance, independent subgaussian rows [33]. Our results easily extend to this setting and we illustrate its empirical benefits in Figure 2.

## 4 Application to empirical risk minimization and kernel methods

By the representer theorem, the primal program (1) can be re-formulated as

 w∗∈argminw∈Rnf(Kw)+λ2w⊤Kw, (11)

where . Clearly, it holds that . Given a matrix with i.i.d. Gaussian entries, we consider the sketched version of the kernelized primal program (11),

 α∗∈argminα∈Rmf(K˜Sα)+λ2α⊤˜S⊤K˜Sα. (12)

The sketched program (12) is exactly our adaptive Gaussian sketched program (2). Thus, setting , it holds that . Since the relative error is controlled by the decay of the eigenvalues of , so does the relative error . More generally, the latter statements are still true if is any positive semi-definite matrix, and, if we replace by any square-root matrix of . Here, we denote (see Eq. (3)).

###### Theorem 3.

Let be any positive semi-definite matrix. Let be any minimizer of the kernel program (11) and be any minimizer of its sketched version (12). Define the approximate solution . If , then it holds that

 ∥K12(˜w−w∗)∥2⩽√μ2λZf∥K12w∗∥2. (13)

For a positive definite kernel and a data matrix , let be the empirical kernel matrix, with . Let be a random feature map [1, 25], such as random Fourier features or a random convolutional neural net. We are interested in the computational complexities of forming the sketched versions of the primal (1), the kernel primal (11) and the primal (1) with instead of . We compare the complexities of adaptive and oblivious sketching and uniform column sub-sampling. Table 2 shows that all three methods have similar complexities for computing and . Adaptive sketching exhibits an additional factor that comes from computing the correlated sketching matrices and . In practice, the latter is negligible compared to the cost of forming which, for instance, corresponds to a forward pass over the whole dataset in the case of a convolutional neural network. On the other hand, uniform column sub-sampling is significantly faster in order to form the sketched kernel matrix , which relates to the well-known computational advantages of kernel Nystrom methods [12].

## 5 Numerical evaluation of adaptive Gaussian sketching

We evaluate Algorithm 1 on MNIST and CIFAR10. First, we aim to show that the sketching dimension can be considerably smaller than the original dimension while retaining (almost) the same test classification accuracy. Second, we aim to get significant speed-ups in achieving a high-accuracy classifier. To solve the primal program (1), we use two standard algorithms, stochastic gradient descent (SGD) with (best) fixed step size and stochastic variance reduction gradient (SVRG) [26] with (best) fixed step size and frequency update of the gradient correction. To solve the adaptive sketched program (2), we use SGD, SVRG and the sub-sampled Newton method [27, 28] – which we refer to as Sketch-SGD, Sketch-SVRG and Sketch-Newton. The latter is well-suited to the sketched program, as the low-dimensional Hessian matrix can be quickly inverted at each iteration. For both datasets, we use training and testing images. We transform each image using a random Fourier feature map , i.e.,  [1, 29]. For MNIST and CIFAR10, we choose respectively and , and, and , so that the primal is respectively -dimensional and -dimensional. Then, we train a classifier via a sequence of binary logistic regressions – which allow for efficient computation of the Hessian and implementation of the Sketch-Newton algorithm –, using a one-vs-all procedure.

First, we evaluate the test classification error of . We solve to optimality the primal and sketched programs for values of and sketching dimensions . In Table 3 are reported the results, which are averaged over trials for MNIST and trials for CIFAR10, and, empirical variances are reported in Appendix A. Overall, the adaptive sketched program yields a high-accuracy classifier for most couples . Further, we match the best primal classifier with values of as small as for MNIST and for CIFAR10, which respectively corresponds to a dimension reduction by a factor and . These results additionally suggest that adaptive Gaussian sketching introduces an implicit regularization effect, which might be related to the benefits of spectral cutoff estimators. For instance, on CIFAR10, using and , we obtain an improvement in test accuracy by more than compared to . Further, over some sketching dimension threshold under which the performance is bad, as the value of increases, the test classification error of increases to that of , until matching it.

Further, we evaluate the test classification error of two sketching baselines, that is, oblivious Gaussian sketching for which the matrix has i.i.d. Gaussian entries, and, adaptive column sub-sampling (Nystrom method) for which with a column sub-sampling matrix. As reported in Table 4, adaptive Gaussian sketching performs better for a wide range of values of sketching size and regularization parameter .

Then, we compare the test classification error versus wall-clock time of the optimization algorithms mentioned above. Figure 3 shows results for some values of and . We observe some speed-ups on the -dimensional MNIST problem, in particular for Sketch-SGD and for Sketch-SVRG, for which computing the gradient correction is relatively fast. Such speed-ups are even more significant on the -dimensional CIFAR10 problem, especially for Sketch-Newton. A few iterations of Sketch-Newton suffice to almost reach the minimum , with a per-iteration time which is relatively small thanks to dimensionality reduction. Hence, it is more than times faster to reach the best test accuracy using the sketched program. In addition to random Fourier features mapping, we carry out another set of experiments with the CIFAR10 dataset, in which we pre-process the images. That is, similarly to [31, 32], we map each image through a random convolutional layer. Then, we kernelize these processed images using a Gaussian kernel with . Using our implementation, the best test accuracy of the kernel primal program (11) we obtained is . Sketch-SGD, Sketch-SVRG and Sketch-Newton – applied to the sketched kernel program (12) – match this test accuracy, with significant speed-ups, as reported in Figure 3.

## Acknowledgements

This work was partially supported by the National Science Foundation under grant IIS-1838179 and Office of Naval Research, ONR YIP Program, under contract N00014-17-1-2433.

## Appendix A Additional experimental results and implementation details

### a.1 Synthetic examples (Figure 1)

With and , we sample two matrices with orthonormal columns and , uniformly at random from their respective spaces. We construct two diagonal matrices , such that their respective diagonal elements are and . We set and , and we sample a planted vector with iid entries .

In the case of binary logistic regression, for each , we set , for .

For the ReLU model, we set , where . Hence, each observation is the result of a linear operation and a non-linear operation . Additionally, it can be shown that the global minimum of the optimization problem

 minx12n∑i=1(a⊤ix)2+−2(a⊤ix)yi,

is equal to , which motivates using such a convex relaxation.

### a.2 Numerical illustration of the iterative and power methods (Figure 2)

We use the MNIST dataset with training images and testing images. We rescale the pixel values between . Each image is mapped through random cosines which approximate the Gaussian kernel, i.e., . We choose and .

We perform binary logistic regression for even-vs-odd classification of the digits.

For the iterative method, we use the sketching matrix , where is Gaussian iid. That is, we run the iterative method on top of the power method, with .

### a.3 Adaptive Gaussian sketching on MNIST and CIFAR10 datasets (Table 3 and Figure 3)

Experiments were run in Python on a workstation with cores and GB of RAM. The MNIST and CIFAR10 datasets were downloaded through the PyTorch.torchvision module and converted to NumPy arrays. We use the Sklearn.kernel_approximation.RBFSampler module to generate random cosines. We use our own implementation of each algorithm for a fair comparison.

For SGD, we use a batch size equal to . For SVRG, we use a batch size equal to and update the gradient correction every iterations. For Sketch-SGD, we use a batch size equal to . For Sketch-SVRG, we use a batch size equal to and update the gradient correction every iterations. Each iteration of the sub-sampled Newton method (Sketch-Newton) computes a full-batch gradient, and, the Hessian with respect to a batch of size .

For SGD and SVRG, we considered step sizes between and . We obtained best performance for . For the sub-sampled Newton method, we use a step size , except for the first iterations, for which we use .

In Figure 3, we did not report results for SVRG for solving the primal (1) on CIFAR, as the computation time for reaching a satisfying performance was significantly larger than for the other algorithms.

In Table 3, we did not investigate results for CIFAR with , as the primal classifier had a test error significantly larger than smaller values of .

## Appendix B Proof of main results

Here, we establish our main technical results, that is, the deterministic regret bounds (4) and (5) stated in Theorem 1 and its high-probability version stated in Corollary 1, along with its extension to the iterative Algorithm 2 as given in Theorem 2, and its variant for kernel methods, given in Theorem 3. Our analysis is based on convex analysis and Fenchel duality arguments.

### b.1 Proof of Theorem 1

We introduce the Fenchel dual program of (1),

 minzf∗(z)+12λ∥A⊤z∥22. (14)

For a sketching matrix , the Fenchel dual program of (2) is

 minyf∗(y)+12λ∥PSA⊤y∥22. (15)

Let be any minimizer of the sketched program (2). Then, according to Proposition 2, the unique solution of the dual sketched program (15) is

 y∗=∇f(ASα∗)

and the subgradient set is non-empty. We fix .

According to Proposition 1, the dual program (14) admits a unique solution , which satisfies

 z∗=∇f(Ax∗),

and which subgradient set is non-empty. We fix .

We denote the error between the two dual solutions by . By optimality of with respect to the sketched dual (15) and by feasibility of , first-order optimality conditions imply that

 ⟨1λAPSA⊤y∗