Ridge Regression and Provable Deterministic Ridge Leverage Score Sampling

Ridge Regression and Provable Deterministic Ridge Leverage Score Sampling

Abstract

Ridge leverage scores provide a balance between low-rank approximation and regularization, and are ubiquitous in randomized linear algebra and machine learning. Deterministic algorithms are also of interest in the moderately big data regime, because deterministic algorithms provide interpretability to the practitioner by having no failure probability and always returning the same results.

We provide provable guarantees for deterministic column sampling using ridge leverage scores. The matrix sketch returned by our algorithm is a column subset of the original matrix, yielding additional interpretability. Like the randomized counterparts, the deterministic algorithm provides error column subset selection, error projection-cost preservation, and an additive-multiplicative spectral bound. We also show that under the assumption of power-law decay of ridge leverage scores, this deterministic algorithm is provably as accurate as randomized algorithms.

Lastly, ridge regression is frequently used to regularize ill-posed linear least-squares problems. While ridge regression provides shrinkage for the regression coefficients, many of the coefficients remain small but non-zero. Performing ridge regression with the matrix sketch returned by our algorithm and a particular regularization parameter forces coefficients to zero and has a provable bound on the statistical risk. As such, it is an interesting alternative to elastic net regularization. Column subset selection; Deterministic sampling algorithms; Projection-cost preservation; Ridge leverage scores; Spectral bounds

1 Introduction

Classical leverage scores quantify the importance of each column for the range space of the sample-by-feature data matrix . Classical leverage scores have been used in regression diagnostics, outlier detection, and randomized matrix algorithms \citepvelleman_efficient_1981, chatterjee_influential_1986, drineas_relative-error_2008.

There are many different flavors of leverage scores, and we will focus on ridge leverage scores. To understand the advantages of ridge leverage scores, we briefly review classical and rank- subspace leverage scores. A comprehensive review of the extensive leverage score literature is beyond the scope of this note.

Ridge leverage scores were introduced by \citetalaoui_fast_2015 to give statistical bounds for the Nyström approximation for kernel ridge regression. The ridge leverage score for the column of is,

(1)

where the column of is an -vector denoted by , denotes the Moore-Penrose pseudoinverse of , and is the rank- SVD approximation to , defined in Sec. 5.1, and is the regularization parameter. We will always choose , because this choice of regularization parameter gives the stated guarantees. In contrast to ridge leverage scores, the rank- subspace leverage score is,

(2)

The classical leverage score is the ridge leverage score (and also the rank- subspace leverage score) evaluated at .

Ridge leverage scores and rank- subspace leverage scores take two different approaches to mitigating the small principle components of in classical leverage scores. Ridge leverage scores diminish the importance of small principle components through regularization, as opposed to rank- subspace leverage scores, which omit the small principle components entirely. \citetcohen_input_2017 argue that regularization is a more natural and stable alternative to omission, and they prove sampling bounds for column subset selection, projection-cost preservation, and the spectrum for randomized algorithms with ridge leverage score sampling. The latter two bounds hold for a weighted column subset of the full data matrix. These bounds require columns, where is the failure probability and is the error.

In the “big data” era, much attention has been paid to randomized algorithms due to improved algorithm performance and ease of generalization to the streaming setting. However, for moderately big data (i.e. the feature set is too large for inspection by humans, but the algorithm performance is not a limitation), deterministic algorithms provide more interpretability to the practitioner than randomized algorithms, since they always provide the same results and have no failure probability.

The usefulness of deterministic algorithms has already been recognized. \citetpapailiopoulos_provable_2014 introduce a deterministic algorithm for sampling columns from rank- subspace leverage scores and provide a columns subset selection bound. \citetmccurdy_column_2017 prove a spectral bound for \citetpapailiopoulos_provable_2014’s deterministic algorithm and for random sampling with rank- subspace leverage scores. However, the relative spectral bound is limited to the rank- subspace projection of the column subset matrix and the full data matrix , so to get a comparable relative spectral bound requires . On the other hand, under the condition of power-law decay in the sorted rank- subspace leverage scores, the deterministic algorithm chooses fewer columns than random sampling with the same error for the column subset selection bound when , where is the decay power and is an absolute constant \citeppapailiopoulos_provable_2014. In addition, \citetpapailiopoulos_provable_2014 show that many real data sets display power-law decay in the sorted rank- subspace leverage scores, illustrating the algorithm’s real-world utility.

Ridge regression \citephoerl_ridge_1970 is a commonly used method to regularize ill-posed linear least-squares problems. The ridge regression minimization problem is, for outcome , features , and coefficients ,

(3)
(4)

where the regularization parameter penalizes the size of the coefficients in the minimization problem. We will always choose for ridge regression with matrix .

In ridge regression, the underlying statistical model for data generation is,

(5)

where is a deterministic linear function of the features and is the random error. The mean squared error is a measure of statistical risk for the squared error loss function and estimator and is,

(6)

Ridge regression is often chosen over regression subset selection procedures for regularization because, as a continuous shrinkage method, it exhibits lower variability \citepbreiman_heuristics_1996. However many ridge regression coefficients can be small but non-zero, leading to a lack of interpretability for moderately big data (). The lasso method \citeptibshirani_regression_1994 provides continuous shrinkage and automatic feature selection using an penalty function instead of the penalty function in ridge regression, but for case, lasso saturates at features. The elastic net algorithm combines lasso ( penalty function) and ridge regression ( penalty function) for continuous shrinkage and automatic feature selection \citepzou_regularization_2005. We combine deterministic ridge leverage score column subset selection with ridge regression for a particular value of the regularization parameter, providing automatic feature selection and continuous shrinkage, and we show that the statistical risk is bounded.

1.1 Contributions

We introduce a deterministic algorithm (Algorithm 1) for ridge leverage score column sampling inspired by the deterministic algorithm of [\citeauthoryearPapailiopoulos, Kyrillidis, and BoutsidisPapailiopoulos et al.2014]. By using ridge leverage scores instead of rank- subspace scores in the deterministic algorithm, we can prove additional nice bounds for the column subset matrix . We prove that the same columns subset selection, projection-cost preservation, and additive-multiplicative spectral bounds hold for deterministic column sampling as for random sampling as in \citetcohen_input_2017. We show that under the condition of power-law decay in the ridge leverage scores, the deterministic algorithm chooses fewer columns that random sampling with the same error when , where is the decay power and is an absolute constant. We show that ridge regression performed with the column subset matrix and a particular choice of regularization parameter has a provable bound on the statistical risk. The proof techniques are such that a bound on the statistical risk also holds for randomized ridge leverage score sampling as in \citetcohen_input_2017. We also provide a proof-of-concept illustration on biological data.

Our work is triply beneficial from the interpretability standpoint; it is deterministic, it chooses an unweighted subset of representative columns, and it comes with four desirable error guarantees, three of which stem from naturalness of the low-rank ridge regularization.

1.2 Notation

The singular value decomposition (SVD) of any complex matrix is , where and are square unitary matrices (), is a rectangular diagonal matrix with real non-negative non-increasingly ordered entries. is the complex conjugate and transpose of , and is the identity matrix. The diagonal elements of are called the singular values, and they are the positive square roots of the eigenvalues of both and , which have eigenvectors and , respectively. and are the left and right singular vectors of .

Defining as the first columns of and analogously for , and the square diagonal matrix with the first entries of , then is the rank- SVD approximation to . Furthermore, we refer to matrix with only the last columns of and last entries in as , and .

The Moore-Penrose pseudo inverse of a rank matrix is given by .

The Frobenius norm of a matrix is given by . The spectral norm of a matrix is given by the largest singular value of .

2 Deterministic Ridge Column Sampling

2.1 The DRLS Algorithm

Algorithm 1.

The DRLS algorithm selects for the submatrix all columns with ridge leverage score above a threshold , determined by the error tolerance .The algorithm is as follows.

  1. Choose the error tolerance, .

  2. For every column , calculate the ridge leverage scores (Eqn. 1).

  3. Sort the columns by , from largest to smallest. The sorted column indices are .

  4. Define an empty set . Starting with the largest sorted column index , add the corresponding column index to the set , in decreasing order, until,

    (7)

    and then stop. Note that (see Sec.5.2 for proof). It will be useful to define . Eqn. 7 can equivalently be written as .

  5. If the set size , continue adding columns in decreasing order until .

  6. The leverage score of the last column included in defines the leverage score threshold .

  7. Introduce a rectangular selection matrix of size . If the column indexed by is in , then . otherwise. The DRLS submatrix is .

Note that when the ridge leverage scores on either side of the threshold are not equal, the algorithm returns a unique solution. Otherwise, there are as many solutions as there are columns with equal ridge leverage scores at the threshold.

This algorithm is deeply indebted to the deterministic algorithm of [\citeauthoryearPapailiopoulos, Kyrillidis, and BoutsidisPapailiopoulos et al.2014]. It substitutes ridge leverage scores for rank- subspace scores, and has a different stopping parameter.

Algorithm 1 requires arithmetic operations.

3 Approximation Guarantees

3.1 Bounds for DRLS

We derive a new additive-multiplicative spectral approximation bound (Eqn. 8) for the square of the submatrix selected with DRLS.

Theorem 1.

Additive-Multiplicative Spectral Bound: Let be a matrix of at least rank and be defined as in Eqn. 1. Construct following the DRLS algorithm described in Sec. 2.1. Then satisfies,

(8)

The symbol denotes the Loewner partial ordering which is reviewed in Sec 5.1 (see [\citeauthoryearHorn and JohnsonHorn and Johnson2013] for a thorough discussion).

Conceptually, the Loewner ordering in Eqn. 8 is the generalization of the ordering of real numbers (e.g. ) to Hermitian matrices. Statements of Loewner ordering are quite powerful; important consequences include inequalities for the eigenvalues. We will use Eqn. 8 to prove Theorems 2, 3, and 4. Note that our additive-multiplicative bound holds for an un-weighted column subset of .

Theorem 2.

Column Subset Selection: Let be a matrix of at least rank and be defined as in Eqn. 1. Construct following the DRLS algorithm described in Sec. 2.1. Then satisfies,

(9)

with and where is the best rank- approximation to in the column space of with the respect to the Frobenius norm.

Column subset selection algorithms are widely used for feature selection for high-dimensional data, since the aim of the column subset selection problem is to find a small number of columns of that approximate the column space nearly as well as the top singular vectors.

Theorem 3.

Rank- Projection-Cost Preservation: Let be a matrix of at least rank and be defined as in Eqn. 1. Construct following the DRLS algorithm described in Sec. 2.1. Then satisfies, for any rank orthogonal projection ,

(10)

To simplify the bookkeeping, we prove the lower bound of Theorem 3 with error (), and assume .

Projection-cost preservation bounds were formalized recently in \citetfeldman_turning_2013, cohen_dimensionality_2015. Bounds of this type are important because it means that low-rank projection problems can be solved with instead of while maintaining the projection cost. Furthermore, the projection-cost preservation bound has implications for -means clustering, because the -means objective function can be written in terms of the orthogonal rank- cluster indicator matrix \citepboutsidis_unsupervised_2009.1 Note that our rank- projection-cost preservation bound holds for an un-weighted column subset of .

A useful lemma on an approximate ridge leverage score kernel comes from combining Theorem 1 and 3.

Lemma 1.

Approximate ridge leverage score kernel: Let be a matrix of at least rank and be defined as in Eqn. 1. Construct following the DRLS algorithm described in Sec. 2.1. Let be the coefficient in the lower bound of Theorem 3 and assume . Let for matrix . Then and satisfy,

(11)
Theorem 4.

Approximate Ridge Regression with DRLS: Let be a matrix of at least rank and be defined as in Eqn. 1. Construct following the DRLS algorithm described in Sec. 2.1, let be the coefficient in the lower bound of Theorem 3, and assume . Choose the regularization parameter for ridge regression with a matrix (Eqn. 4). Under these conditions, the statistical risk of the ridge regression estimator is bounded by the statistical risk of the ridge regression estimator :

(12)

where .

Theorem 4 means that there are bounds on the statistical risk for substituting the DRLS selected column subset matrix for the complete matrix when performing ridge regression with the appropriate regularization parameter. Performing ridge regression with the column subset effectively forces coefficients to be zero and adds the benefits of automatic feature selection to the regularization problem. We also note that the proof of Theorem 4 relies only on Theorem 2 and Theorem 3 and facts from linear algebra, so a randomized selection of weighted column subsets that obey similar bounds to Theorem 2 and Theorem 3 (e.g. \citetcohen_input_2017) will also have bounded statistical risk, albeit with a different coefficient . As a point of comparison, consider the elastic net minimization with our ridge regression regularization parameter:

(13)

The risk of elastic net has the following bound in terms of the risk of ridge regression :

(14)

This comes from a slight re-working of Theorem 3.1 of \citetzou_adaptive_2009. The bounds for the elastic net risk and are comparable when .

Theorem 5.

Ridge Leverage Power-law Decay: Let be a matrix of at least rank and be defined as in Eqn. 1. Furthermore, let the ridge leverage scores exhibit power-law decay in the sorted column index ,

(15)

Construct following the DRLS algorithm described in Sec. 2.1. The number of sample columns selected by DRLS is,

(16)

The concept of power-law decay behavior for leverage scores was introduced by \citetpapailiopoulos_provable_2014 (Theorem ) for rank- subspace leverage scores. Our Theorem is an adaptation of \citetpapailiopoulos_provable_2014’s Theorem for ridge leverage scores.

An obvious extension of Eqn. 8 is the following bound,

(17)

which also holds for selected by ridge leverage random sampling methods with weighted columns and failure probability [\citeauthoryearCohen, Musco, and MuscoCohen et al.2017]. Thus, DRLS selects fewer columns with the same accuracy in Eqn. 17 for power-law decay in the ridge leverage scores when,

(18)

where is an absolute constant. In particular, when , the number of columns deterministically sampled is .2

4 Biological Data Illustration

We provide a biological data illustration of ridge leverage scores and ridge regression with multi-omic data from lower-grade glioma (LGG) tumor samples collected by the TCGA Research Network (http://cancergenome.nih.gov/). Diffuse lower-grade gliomas are infiltrative brain tumors that occur most frequently in the cerebral hemisphere of adults.

The data is publicly available and hosted by the Broad Institute’s GDAC Firehose \citepbroad_institute_of_mit_and_harvard_broad_2016. We download the data using the tool \citepwan_tcga2stat:_2016. imports the latest available version-stamped standardized Level 3 dataset on Firehose. The data collection and data platforms are discussed in detail in the original paper \citepthe_cancer_genome_atlas_research_network_comprehensive_2015.

We use the following multi-omic data types: mutations (), DNA copy number (alteration () and variation ()), messenger RNA (mRNA) expression (), and microRNA expression (). Methylation data is also available, but we omit it due to memory constraints. The mRNA and microRNA data is normalized. DNA copy number (variation and alteration) has an additional pre-processing step; the segmentation data reported by TCGA is turned into copy number using the tool \citepzhang_cntools:_2015 that is imbedded in . The mutation data is filtered based on status and variant classification and then aggregated at the gene level \citepwan_tcga2stat:_2016.

There are tumor samples and multi-omic features in the downloaded dataset. We are interested in performing ridge regression with the biologically meaningful outcome variables relating to mutations of the “IDH1” and “IDH2″ gene and deletions of the “1p/19q” chromosome arms (“codel”). These variables were shown to be predictive of favorable clinical outcomes and can be found in the supplemental tables \citepthe_cancer_genome_atlas_research_network_comprehensive_2015. We restrict to samples with these outcome variables ( tumor samples), and we drop an additional sample (“TCGA-CS-4944”) because it is an outlier with respect to the SVD projection of the samples. This leaves a total of tumor samples with outcome variables “IDH” (a mutation in either “IDH1″ or “IDH2”) and “codel” for the analysis.

Lastly, we drop all multi-omic features that have zero columns and greater than missing data on the tumor samples. We the replace missing values with the mean of the column. This leaves a final multi-omic feature set of for the tumor samples. Our final matrix is column mean-centered.

4.1 Ridge leverage score sampling

Applying the DRLS algorithm with leads to , selecting approximately of the total multi-omic features. Figure 1 shows the spectrum of eigenvalues of for LGG. Figure 2 shows the relationship between the number of columns kept, , and for the ridge leverage scores. Figure 3 shows the power-law decay of the LGG ridge leverage scores with sorted column index. The LGG multi-omic data shows that ridge leverage score power-law decay occurs in the wild. Figure 4 shows a histogram of the ratio of for random rank- orthogonal projections . The projections are chosen as the first directions from an orthogonal basis randomly selected with respect to the Haar measure for the orthogonal group \citepmezzadri_how_2006. Lastly, Figure 5 illustrates the ridge leverage score regularization of the classical leverage score for the LGG multi-omic features.

4.2 Ridge regression with ridge leverage score sampling

We perform ridge regression with the appropriate regularization parameter for two biologically meaningful outcome variables; the first is whether either the “IDH1” or the “IDH2″ gene is mutated and the second whether the “1p/19q” chromosome arms have deletions (“codel”). We encode the status of each event as . Figures 6, 7 , and 8 show the top three SVD projections for the tumor samples, colored by the combined status for “IDH” and “codel”. No tumor samples have the “1p/19q” codeletion and no “IDH” mutation. Visual inspection of the SVD plot confirms that this is a reasonable regression problem for “IDH” and a difficult regression problem for “codel”; also, logisitic regression would be more natural for binary outcomes. We proceed anyway, since our objective is to compare ridge regression with all of the features to ridge regression the DRLS subset on realistic biological data. Figures 9 and 10 confirm that the ridge regression fits are close () for all the tumor samples. Figures 11 and 12 confirm that the ridge regression coefficients are close () for all the tumor samples. Figure 13 and 14 illustrate the overall performance of ridge regression for these two outcome variables.

Lastly, we simulate samples according to the linear model (Eqn. 5), where , the coefficients , and is the LGG multi-omic feature matrix. We choose . We perform ridge regression with and then again with in accordance with Theorem 4. We calculate the risks and and find that Theorem 4 is not violated. Table 1 shows the risk ratios along with other relevant ratios for the ridge leverage scores.

ave() ave()
0.85 0.97 1.03
Table 1: Ridge leverage score ratios for for LGG tumor multi-omic data. The ratios are near one, as expected. Ridge regression risk ratio for data simulated from the LGG multi-omic feature matrix and Eqn. 5.

5 Proofs

5.1 Brief Review

See \citethorn_matrix_2013 for a linear algebra review. A square complex matrix is Hermitian if . Symmetric positive semi-definite (S.P.S.D) matrices are Hermitian matrices. The set of Hermitian matrices is a real linear space. As such, it is possible to define a partial ordering (also called a Loewner partial ordering, denoted by ) on the real linear space. One matrix is “greater” than another if their difference lies in the closed convex cone of S.P.S.D. matrices. Let be Hermitian and the same size, and a complex vector of compatible dimension. Then,

(19)

If is Hermitian with smallest and largest eigenvalues , respectively, then,

(20)

Let be Hermitian and the same size, and let be any complex rectangular matrix of compatible dimension. The conjugation rule is,

(21)

In addition, let and be the non-decreasingly ordered eigenvalues of . Then,

(22)

Since the trace of a matrix is the sum of its eigenvalues, , and the Loewner ordering implies the ordering of eigenvalues (Eqn. 22), the Loewner ordering also implies the ordering of their sum,

(23)

5.2 Proof of ridge leverage score sum

The sum of ridge leverage scores is,

(24)
(25)

where is diagonal and . We split the sum into two parts,

(26)
(27)

This proof is due to [\citeauthoryearCohen, Musco, and MuscoCohen et al.2017].

5.3 Proof of Theorem 1

The upper bound in Eqn. 8 in Theorem 1 follows from the fact that and the conjugation rule (Eqn. 21),

(28)

This upper bound is true for any column selection of .

For the lower bound in Eqn. 8, consider the quantity,

By the conjugation rule (Eqn. 21) on Eqn. 28, , so is S.P.S.D. By the construction of DRLS (Eqn. 7) , and because is S.P.S.D., . By Eqn. 20 and the previous facts, . As a result of the conjugation rule applied to this upper bound,

and rearrangement leads to the lower bound of Eqn. 8.

5.4 Proof of Theorem 2

To prove Eqn. 9, we will use the following lemma.

Lemma 2.

([\citeauthoryearBoutsidis, Drineas, and Magdon-IsmailBoutsidis et al.2011] Lemma , Eqn. , specialized to the Frobenius norm): Consider , where and . Let be any matrix such that , and let . Then,

(29)

where is the best rank- approximation to in the column space of with the respect to the Frobenius norm.

We’ll use a slightly relaxed version of Lemma 2,

(30)

Choosing , and noting that , we have

(31)

It remains to calculate , where . As a consequence of the conjugation rule (Eqn. 21) applied to Eqn. 8 with pre- (post-)multiplication by ( ), we have,

(32)

From Eqn. 32, the eigenvalue obeys,

(33)

after using the fact that (by definition). Eqn. 33 shows that, for , . Combining Eqn. 31 and Eqn. 33 gives Eqn. 9. This proof illustrates the power of the spectral bound (Eqn. 8), since the column subset selection bound (Eqn. 9) is a direct consequence of Eqn. 8.

5.5 Proof of Theorem 3

It will be convenient to define the projection matrix . Then the upper bound in Eqn. 10 follows directly from upper spectral bound (Eqn.28), the conjugation rule (Eqn. 21), and the trace rule (Eqn. 23).

The lower projection-cost preservation bound is considerably more involved to prove, and also relies primarily on the spectral bound (Eqn. 8) and facts from linear algebra. Our proof is nearly identical to the proof of [\citeauthoryearCohen, Musco, and MuscoCohen et al.2017]’s Theorem , with only one large deviation and several small differences in constants. We include the full proof for completeness, and point out the major difference.

We split and into their projections on the top “head” singular vectors and the remaining “tail” singular vectors. Choose such that is the smallest singular value that obeys . Let and be projection matrices. Note that,

(34)
(35)
(36)

Head terms

First we bound the terms involving only . Consider Eqn. 8 and the vector for any vector . Eqn. 8 gives,

(37)

because by the artful definition of , . Therefore we have,

(38)

and finally,

(39)

Tail terms

To bound the lower singular directions of , we decompose further as,

(40)

and analogously for .

The upper spectral bound (Eqn.28), the conjugation rule (Eqn. 21) gives,

(41)

The conjugation rule (Eqn. 21), and the trace rule (Eqn. 23) give,

(42)

Next we consider . In \citetcohen_input_2017’s proof, a scalar Chernoff bound is used for ([\citeauthoryearCohen, Musco, and MuscoCohen et al.2017] Section 4.3, Eqn. 17). Since our matrix is constructed deterministically, we will prove and substitute the following bound (Eqn. 43),

(43)

for the scalar Chernoff bound.

To prove Eqn. 43, we first note the lower bound follows directly from upper spectral bound (Eqn. 28), the conjugation rule (Eqn. 21), and the trace rule (Eqn. 23). To prove the upper bound, we rewrite the difference in Frobenius norms as a difference in sums over column norms:

(44)
(45)
(46)
(47)

The first inequality follows from the fact that for . As a result, .

Combining the upper bound of Eqn. 43 with Eqn. 42 gives,

(48)

Cross terms

Finally, we show is small. We rewrite it as,

(49)

using the cyclic property of the trace and the fact that is in ’s column span. Because is semi-positive definite and defines a semi-inner product, we can use the Cauchy-Schwarz inequality,

(50)
(51)
(52)

The square of the second term decomposes as,

(54)

which is small for every