Subset Selection for Matrices with Fixed Blocks
Subset selection for matrices is the task of extracting a column sub-matrix from a given matrix with such that the pseudoinverse of the sampled matrix has as small Frobenius or spectral norm as possible. In this paper, we consider the more general problem of subset selection for matrices that allows a block is fixed at the beginning. Under this setting, we provide a deterministic method for selecting a column sub-matrix from . We also present a bound for both the Frobenius and the spectral matrix norms of the pseudoinverse of the sampled matrix with showing that the bound is asymptotically optimal. The main technology for proving this result is the interlacing families of polynomials which is developed by Marcus, Spielman and Srivastava. This idea also results in a deterministic greedy selection algorithm that produces the sub-matrix promised by our result.
1.1. Subset selection for matrices
Subset selection for matrices aims to select a column sub-matrix from a given matrix with such that the sampled matrix is well-conditioned. To state conveniently, we will assume is full-rank, i.e., . Given , the cardinality of the set is denoted by . We use to denote the sub-matrix of obtained by extracting the columns of indexed by and use to denote the Moore-Penrose pseudoinverse of . Let be a sampling parameter. We can state the subset selection for matrices as follows:
Find a subset with cardinality at most such that and is minimized, i.e.,
where , denotes the spectral and the Frobenius matrix norm, respectively.
Problem 1.1 is raised in many applied areas such as preconditioning for solving linear systems, sensor selection , graph signal processing [12, 34], and feature selection in -means clustering [7, 8]. In , Avron and Boutsidis show an interesting connection between Problem 1.1 and the combinatorial problem of finding a low-stretch spanning tree in an undirected graph. In statistics literature, the subset selection problem has also been studied. For instance, for , the solution to Problem 1.1 is statistically optimal design for linear regression [15, 28].
One simple method for solving Problem 1.1 is to evaluate the performance of all possible subset of size , but evidently it is computationally expensive unless or is very small. In , Çivril and Magdon-Ismail study the complexity of the spectral norm version of Problem 1.1, where they show that this problem is NP-hard. Several heuristics have been proposed to approximately solve the subset selection problem. Section 1.3 will provide a summary of known results from prior literature.
1.2. Our contribution
In this paper we consider a generalized version of the subset selection for matrices where we have a matrix fixed at first and then complement this matrix by adding columns of such that has as small Frobenius or spectral norm as possible. Usually, is chosen as a column sub-matrix of . This notion of keeping a fixed block of is useful, if we already know that such a block has some distinguished properties. We state the problem as follows:
Suppose that and with and . Find a subset with cardinality at most such that and is minimized, i.e.,
where , denotes the spectral and the Frobenius matrix norm, respectively.
We would like to mention that the Frobenius norm version of Problem 1.2 was considered in . If we take , then Problem 1.2 is reduced to Problem 1.1. Hence, the results presented in this paper also present a solution to Problem 1.1. We next state the main result of this paper. To state conveniently, for , throughout this paper, we set
We have the following result for Problem 1.2.
Suppose that and with and . Then for any fixed , there exists a subset with cardinality such that is full-rank and for both ,
The proof of Theorem 1.3 provides a deterministic algorithm for computing the subset in time where is the exponent of the complexity for the matrix multiplication. We will introduce it in Section 4.
If we take in Theorem 1.3, then we can obtain the following corollary:
Suppose that with . Then for any fixed , there exists a subset with cardinality such that and for both ,
1.3. Related work
1.3.1. Lower bounds
The lower bound is defined as the non-negative number such that for every of cardinality , there exists a matrix satisfying
The lower bounds for Problem 1.1 have been developed in . For , Theorem in  shows the bound is ; for and , Theorem in  shows the bound is . The approximation bound in Corollary 1.4 asymptotically matches those bounds.
1.3.2. Restricted invertibility principle
The restricted invertibility problem asks whether one can select a large number of linearly independent columns of and provide an estimation for the norm of the restricted inverse. To be more precise, one wants to find a subset , with cardinality being as large as possible, such that for all and to estimate the constant . In , Bourgain and Tzafriri study restricted invertibility problem with showing its applications in geometry and analysis. Later, their results are improved in [30, 32, 29]. In , Marcus, Spielman and Srivastava employ the method of interlacing families of polynomials to sharpen this result with presenting a simple proof to restricted invertibility principle. One can see  for a survey of recent development in restricted invertibility.
Problem 1.1 is different with restricted invertibility problem. In Problem 1.1, we require , while, in the restricted invertibility problem, one only considers the case where . Our proof for Theorem 1.3 is inspired by the method used by Marcus, Spielman and Srivastava  to prove restricted invertibility principle. We will introduce the main idea of the proof in Section 1.4.
1.3.3. Approximation bounds for
We first focus on and with presenting known bounds for the approximation ratio
In [2, 16, 17], the authors develop a greedy removal algorithm where one “bad” column of is removed at each step. They show that this algorithm can find a subset such that in time. If is fixed, the approximation bound in [2, 16, 17] is which is as same as that of Corollary 1.4.
In , the Frobenius norm version of Problem 1.2 has been considered by Youssef. Let be the fixed matrix which is chosen at the beginning. Theorem in  shows that for any sampling parameter , one can produce a subset in time with presenting an upper bound of . Note that
and hence . Hence Theorem 1.3 is available for the wider range of the sampling parameter .
1.3.4. Approximation bounds for
For , Corollary in  designs an algorithm for computing which can run in time with presenting the bound
If is fixed, the asymptotically bound in (3) is which is larger than that in Corollary 1.4. For the spectral norm, to our knowledge, Problem 1.2 has not been considered in previous paper, and Theorem 1.3 is the first work on the approximation bound as well as the deterministic algorithm for Problem 1.2.
1.3.5. Approximation bounds for both
In , a deterministic algorithm is also presented for both . The algorithm which runs in time outputs a set with satisfying
we obtain that
Hence our result in Corollary 1.4 improves the bound in (4). Particularly, when tends to , the approximation bound in (4) goes to infinity while still is finite. Hence, the bound is far better than the one in (4) when is close to .
Many random algorithms are developed for solving Problem 1.1 (see ). In this paper, we focus on deterministic algorithms. Motivated by the proof of Theorem 1.3, we introduce a deterministic algorithm in Section 4 which outputs a subset such that
for any fixed . As shown in Theorem 4.1, the complexity of the algorithm is where is the exponent of the complexity for the matrix multiplication. We emphasize that our algorithm is faster than all of the algorithms mentioned in Section 1.3.3 and Section 1.3.4 when is large enough, since there exists a factor in the computational cost of all of the algorithms, while the time complexity of our algorithm is linear about .
Note that the time complexity of the algorithm mentioned in Section 1.3.5 is much better than that of our algorithm. However, as said before, the approximation bound obtained by our algorithm is far better than the one which is provided by the algorithm mentioned in Section 1.3.5. Moreover, our algorithm can solve both Problem 1.1 and Problem 1.2 while all of the other algorithms only work for Problem 1.1.
1.4. Our techniques
Our proof of Theorem 1.3 builds on the method of interlacing families which is a powerful technology developed in [22, 23] (see also [24, 25]) by Marcus, Spielman and Srivastava in work of the solution to the Kadison-Singer problem. Recall that an interlacing family of polynomials has the property that there is always contain a polynomial whose -th largest root is at least the -th largest root of the sum of the polynomials in the family (or the expected polynomial).
Our selection is based on the observation that the space spanned by the columns of the matrix is actually the space spanned by the columns of a matrix with rows consisting of the left singular vectors of . Note that the left singular vectors are a set of orthonormal vectors, so which is sometimes called the “isotropic” case. Then we consider the subset selection in the isotropic case while fixing at the beginning, where is a sub-matrix of corresponding to . We then prove that if is selected by randomly sampling columns from without replacement, the related characteristic polynomials of form an interlacing family. This implies that there is a subset such that the smallest root of the characteristic polynomial of is at least the smallest root of the expected characteristic polynomial of certain sums of those characteristic polynomials. Then we need present a lower bound of the smallest root of this expected characteristic polynomial. We do this by using method of lower barrier function argument[4, 29, 23] together with the consideration of the behavior of the roots of a real-rooted polynomial under the operator .
2.1. Notations and Lemmas
We use to denote the operator that performs partial differentiation in . We say that a univariate polynomial is real-rooted if all of its coefficients and roots are real. For a real-rooted polynomial , we let and denote the smallest and the largest root of , respectively. We use to denote the th largest root of . Let and be two sets and we use to denote the set of elements in but not in . We use to denote the expectation of a random variable.
Singular Value Decomposition. For a matrix , we denote the operator norm and the Frobenius norm of by and , respectively. The (thin) singular value decomposition (SVD) of of rank is
with singular values . Here, is some rank parameter . The matrices and contain the left singular vectors of ; and similarly, the matrices and contain the right singular vectors of . We see that and .
Moore-Penrose pseudo-inverse. Suppose that and its thin SVD is . We write as the Moore-Penrose pseudo-inverse of , here is the inverse of . It has the following properties.
Lemma 2.1 (, Fact ).
Let and . If or , then .
In general, if is not full rank. However, if is a nonsingular square matrix, the following lemma shows that . Lemma 2.2 is useful in our argument and we believe that it is independent interesting.
Let be an invertible matrix. Then for any , .
Set . Then . It suffices to prove
Let be the singular value decomposition of , where and are two unitary matrices,
with and . Note that
Recall that . Then
implies (5). Denote the standard basis by . Since is invertible, so the linear systems and has the same solutions. Hence for . This implies
Jacobi’s formula and Jensen’s Inequality.
Lemma 2.3 (Jacobi’s formula).
Let and be two square matrices. Then,
We will utilize Jensen’s inequality to estimate the lower bound of the sum of a certain concave function.
Lemma 2.4 (Jensen’s Inequality).
Let be a function from to . Then is concave if and only if
We also need the following lemma.
Lemma 2.5 (, Fact ).
If is an invertible matrix, then for any vector ,
2.2. Interlacing Families
Our proof of Theorem 1.3 builds on the method of interlacing families which is a powerful techniques discovered in [22, 23] by Marcus, Spielman and Srivastava in work of the solution to the Kadison-Singer problem [19, 9, 10, 11, 23, 26].
Let and be two real-rooted polynomials. We say interlaces if
We say that polynomials have a common interlacing if there is a polynomial so that interlaces for each .
Following , we define the notion of an interlacing family of polynomials as follows.
Definition 2.6 (, Definition 2.5).
An interlacing family consists of a finite rooted tree and a labeling of the nodes by monic real-rooted polynomials , with two properties:
Every polynomial corresponding to a non-leaf node is a convex combination of the polynomials corresponding to the children of .
For all nodes with a common parent, the polynomials have a common interlacing.111This condition is equivalent to that all convex combinations of all the children of a node are real-rooted; the equivalence is implied by Helly’s theorem and Lemma 2.9.
We say that a set of polynomials form an interlacing family if they are the labels of the leaves of such a tree.
The following lemma which was proved in [24, Theorem ] shows the utility of forming an interlacing family.
Lemma 2.7 (, Theorem ).
Let be an interlacing family of degree polynomials with root labeled by and leaves by . Then for all indices , there exist leaves and such that
In Section 3, we will prove that the polynomials obtained by average subset selection form an interlacing family. According to the above definition, this requires establishing the existence of certain common interlacing. The following lemma will be used to show the common interlacing.
Lemma 2.8 (, Claim ).
If is a symmetric matrix and are vectors in , then the polynomials
have a common interlacing.
The following lemma shows that the common interlacings are equivalent to the real-rootedness of convex combinations.
Lemma 2.9 (, Theorem ).
Let be real-rooted (univariate) polynomials of the same degree with positive leading coefficients. Then have a common interlacing if and only if is real-rooted for all convex combinations .
2.3. Lower barrier function and properties
In this section, we introduce the lower barrier potential function from [4, 23]. For a real-rooted polynomial , one can use the evolution of such barrier function to track the approximation locations of the roots .
For a real-rooted polynomial with roots , define the lower barrier function of as
We have the following technical lemma for the lower barrier function. This result can be obtained by Lemma in . Here we include a proof for completeness.
Suppose that is a real-rooted polynomial and . Suppose that and
Suppose that the degree of is and its zeros are . To this end, we need to prove . According to
we have . Noting that , we obtain that . Next we will express in terms of and :
wherever all quantities are finite, which happens everywhere except at the zeros of and . Since is strictly below the zeros of both, it follows that:
So (8) is equivalent to
By expanding and in terms of the zeros of , we can see that (8) is equivalent to
Noting , we have
as desired. Here the first and the second inequalities are due to , i.e., and the Cauchy-Schwarz inequality. ∎
3. Proof of Theorem 1.3
In this section, we present the proof of Theorem 1.3. Our proof provides a deterministic greedy algorithm which will be proposed in Section 4. To state our proof clearly, we introduce the following result with postponing its proof to the end of this section.
Let which satisfies . Assume that with . Let be a sub-matrix of whose columns are indexed by . Set . Then for any fixed there exists a subset of size such that
where is defined by (1).
Using this Theorem, we next present the proof of Theorem 1.3.
Proof of Theorem 1.3.
Let be the SVD of . Suppose that and are two indexed sets so that and .
Recall that , which implies that . Applying Theorem 3.1 with and , we obtain that there exists a subset with size such that
Consider the left side of (2), we have
where follows from standard properties of matrix norms and using the definition of the pseudoinverse of and , and follows from (9). To this end, we still need present an upper bound of . Note that
The rest of this section aims to prove Theorem 3.1 by using the method of interlacing families. The proof consists of two main parts. Firstly, we will prove that the characteristic polynomials of the matrices that arise in Theorem 3.1 form an interlacing family and present an expression for the expected characteristic polynomial (the summation of the polynomials in the family). Secondly, we use the barrier function argument to establish a lower bound on the smallest zero of the expected characteristic polynomial.
3.1. Interlacing family for subset selection
In this subsection, we show that the expected characteristic polynomials obtained by average subset selection over while keeping the given matrix form an interlacing family.
Let the columns of be the vectors and let be a given matrix with . Since , we obtain that . Denote the nonzero singular values of as . For each , set
For any fixed set of size less than , we define the polynomial
where the expectation is taken uniformly over sets of size containing . Building on the ideas of Marcus-Spielman-Srivastava , we can derive expressions for the polynomials .
We begin with the following result.
Suppose that and . Then
holds for every subset of size .
According to Lemma 2.5, we have
Since and , we obtain that