Structured Matrix Estimation and Completion
We study the problem of matrix estimation and matrix completion under a general framework. This framework includes several important models as special cases such as the gaussian mixture model, mixed membership model, bi-clustering model and dictionary learning. We consider the optimal convergence rates in a minimax sense for estimation of the signal matrix under the Frobenius norm and under the spectral norm. As a consequence of our general result we obtain minimax optimal rates of convergence for various special models.
Keywords: matrix completion, matrix estimation, minimax optimality
AMS 2000 subject classification: 62J99, 62H12, 60B20, 15A83
Over the past decade, there have been considerable interest in statistical inference for high-dimensional matrices. A fundamental model in this context is the matrix de-noising model, under which one observes a matrix where is an unknown non-random matrix of interest, and is a random noise matrix. The aim is to estimate from such observations. Often in applications a part of elements of is missing. The problem of reconstructing the signal matrix given partial observations of its entries is known as matrix completion problem. There has been an important research in the past years devoted to accurate matrix completion methods.
In general, the signal cannot be recovered consistently from noisy and possibly missing observations. If we only know that is an arbitrary matrix, the guaranteed error of estimating from noisy observations can be prohibitively high. However, if has an additional structure one can expect to estimate it with high accuracy from a moderate number of noisy observations. The algorithmic and analytical tractability of the problem depends on the type of adopted structural model. A popular assumption in the matrix completion literature is that the unknown matrix is of low rank or can be well approximated by a low rank matrix. Significant progresses have been made on low rank matrix estimation and completion problems, see e.g., [9, 8, 19, 26, 31, 32, 16, 28, 7]. However, in several applications, the signal matrix can have other than just low rank structure. Some examples are as follows.
Biology. The biological data are sometimes expected to have clustering structures. For example, in the gene microarray data, a large number of gene expression levels are measured under different experimental conditions. It has been observed in the experiments that there is a bi-clustering structure on the genes . This means that, besides being of low rank, the gene microarray data can be rearranged to approximately have a block structure.
Computer Vision. To capture higher-level features in natural images, it is common to represent data as a sparse linear combination of basis elements  leading to sparse coding models. Unlike the principle component analysis that looks for low rank decompositions, sparse coding learns useful representations with number of basis vectors, which is often greater than the dimension of the data.
While there are some successful algorithmic advancements on adapting new structures in these specific applications, not much is known on the fundamental limits of statistical inference for the corresponding models. A few exceptions are the stochastic block model [18, 29] and the bi-clustering model . However, many other structures of signal matrix are not analyzed.
The aim of this paper is to study a general framework of estimating structured matrices. We consider a unified model that includes gaussian mixture model, mixed membership model , bi-clustering model , and dictionary learning as special cases. We first study the optimal convergence rates in a minimax sense for estimation of the signal matrix under the Frobenius norm and under the spectral norm from complete observations on the sparsity classes of matrices. Then, we investigate this problem in the partial observations regime (structured matrix completion problem) and study the minimax optimal rates under the same norms. We also establish accurate oracle inequalities for the suggested methods.
This section provides a brief summary of the notation used throughout this paper. Let be matrices in .
For a matrix , is its th entry, is its th column and is its th row.
The scalar product of two matrices of the same dimensions is denoted by
We denote by the Frobenius norm of and by the largest absolute value of its entries: . The spectral norm of is denoted by .
For , we denote by its -norm (the number of non-zero components of ), and by its -norm, .
We denote by the largest -norm of the rows of :
For any , we write for brevity .
Given a matrix , and a set of indices , we define the restriction of on as a matrix with elements if and otherwise.
The notation and (abbreviated to and when there is no ambiguity) stands for the identity matrix and the matrix with all entries 0, respectively.
We denote by the cardinality of a finite set , by the integer part of , and by the smallest integer greater than .
We denote by the covering number, under the Frobenius norm, of a set of matrices.
3 General model and examples
Assume that we observe a matrix with entries
where are the entries of the unknown matrix of interest , the values are independent random variables representing the noise, and are i.i.d. Bernoulli variables with parameter such that is independent of .
Model (1) is called the matrix completion model. Under this model, an entry of matrix is observed with noise (independently of the other entries) with probability , and it is not observed with probability . We can equivalently write (1) in the form
where is a matrix with entries
We denote by the probability distribution of satisfying (1) and by the corresponding expectation. When there is no ambiguity, we abbreviate and to and , respectively.
We assume that are independent zero mean sub-Gaussian random variables. The sub-Gaussian property means that the following assumption is satisfied.
There exists such that, for all ,
We assume that the signal matrix is structured, that is, it can be factorized using sparse factors. Specifically, let be integers such and . We assume that
Here, for we assume that and the set is a set containing only one element, which is the identity matrix, and for ,
where the set is a subset of called an alphabet. The set is defined analogously by replacing by . We will also consider the class defined analogously to , with the only difference that the inequality in (3) is replaced by the equality.
Choosing different values of , and different alphabets we obtain several well-known examples of matrix structures.
Sparse Dictionary Learning:
Stochastic Block Model:
Mixed Membership Model:
Here, the classes and are not exactly equal to but rather subclasses of and , respectively.
Statistical properties of inference methods under the general model (1) are far from being understood. Some results were obtained in particular settings such as the Mixture Model and Stochastic Block Model.
Gaussian mixture models provide a useful framework for several machine learning problems such as clustering, density estimation and classification. There is a quite long history of research on mixtures of Gaussians. We mention only some of this work including methods for estimating mixtures such as pairwise distances [14, 15], spectral methods [37, 23] or the method of moments [12, 5]. Most of these papers are concerned with construction of computationally efficient methods but do not address the issue of statistical optimality. In  authors provide precise information theoretic bounds on the clustering accuracy and sample complexity of learning a mixture of two isotropic Gaussians in high dimensions under small mean separation.
The Stochastic Block Model is a useful benchmark for the task of recovering community structure in graph data. More generally, any sufficiently large graph behaves approximately like a stochastic block model for some , which can be large. The problem of estimation of the probability matrix in the stochastic block model under the Frobenius norm was considered by several authors [11, 39, 40, 10, 6] but convergence rates obtained there are suboptimal. More recently, minimax optimal rates of estimation were obtained by Gao et al.  in the dense case and by Klopp et al  in the sparse case.
Recently, a related problem to ours was studied by Soni et al. . These authors consider the case when the matrix to be estimated is the product of two matrices, one of which, called a sparse factor, has a small number of non-zero entries (in contrast to this, we assume row-sparsity). The estimator studied in  is a sieve maximum likelihood estimator penalized by the norm of the sparse factor where the sieve is chosen as a specific countable set.
4 Results for the case of finite alphabets
We start by considering the case of finite alphabets and and complete observations, that is . In this section, we establish the minimax optimal rates of estimation of under the Frobenius norm and we show that they are attained by the least squares estimator
where is a suitable class of structured matrices. We first derive an upper bound on the risk of this estimator uniformly over the classes . The following theorem provides an oracle inequality for the Frobenius risk of . Here and in what follows, we adopt the convention that for any . We also set for brevity
This theorem is proved in Section A.
Note that if the set and/or in the definition of contains only the identity matrix, the corresponding term and/or disappears from the upper bound of Theorem 1.
Under the assumptions of Theorem 1,
for a constant depending only on the cardinalities of and .
The next theorem provides a lower bound showing that the convergence rate of Corollary 2 is minimax optimal. This lower bound is valid for the general matrix completion model (1). In what follows, the notation stands for the infimum over all estimators taking values in .
Let the entries of matrix in model (2) be independent random variables with Gaussian distribution , and let the alphabets and contain the set . There exists an absolute constant such that
Furthermore, the same inequalities hold with in place of if and .
The three ingredients , and of the optimal rate are coming from the ignorance of and respectively. The proof is based on constructing subsets of by fixing two of these parameters to get each of the three terms. The choice of when fixing the pairs and is based on a probabilistic method, namely, Lemma 17. Similar techniques have been used in  to prove the lower bounds for sparse graphon estimation, and in .
Theorem 3 can be extended to more general sub-Gaussian distributions under an additional Kullback-Leibler divergence assumption. Assume that there is a constant such that the distribution of in model (1) satisfies
Let the alphabets and contain the set . Then there exists an absolute constant such that
5 Optimal rates in the spectral norm
In this section we derive the optimal rates of convergence of estimators of when the error is measured in the spectral norm. Interestingly, our results imply that these optimal rates coincide with those obtained for estimation of matrices with no structure. That is, the additional structure that we consider in the present paper does not have any impact on the rate of convergence of the minimax risk when the error is measured in the spectral norm.
The lower bound under the spectral norm can be obtained as a corollary of the lower bound under the Frobenius norm given by Theorem 3.
Under the assumptions of Theorem 3, there exists a absolute constant such that
The proof of this corollary is given in Section B.2.
To get matching upper bounds we can use the soft thresholding estimator introduced in  or the hard thresholding estimator proposed in . These papers deal with the completion problem for low rank matrices in the context of trace regression model, which is a slightly different setting.
Here, we consider the hard thresholding estimator. Set
The singular value decomposition of matrix has the form
where is the rank of , are the singular values of indexed in the decreasing order, and (respectively, ) are the left (respectively, the right) singular vectors of . The hard thresholding estimator is defined by the formula
where is the regularization parameter. In this section, we assume that the noise variables are bounded as stated in the next assumption.
For all we have , and there exists a positive constant such that
A more general case of sub-Gaussian noise can be treated as well; in this case, we can work on the event where is bounded by a suitable constant and show that the probability of the complement of is small.
The following theorem gives the upper bound on the estimation error of the hard thresholding estimator (8).
Assume that and let Assumption 2 hold. Let where is a sufficiently large absolute constant. Assume that . Then, with probability at least , the hard thresholding estimator satisfies
where is an absolute constant.
6 A general oracle inequality under incomplete observations
The aim of this section is to present a general theorem about the behavior of least squares estimators in the setting with incomplete observations. This theorem will be applied in the next section to obtain an analog of the upper bound of Theorem 1 for general alphabets. To state the theorem, it does not matter whether we consider a vector or matrix setting. Therefore, in this section, we will deal with the vector model. Assume that we observe a vector with entries
for some unknown . Our goal is to estimate . Here, are independent random noise variables, and are i.i.d. Bernoulli variables with parameter such that is independent of .
When is known we can equivalently write (9) in the form
where now is a vector with entries
and . In this section, we denote by the probability distribution of satisfying (10).
Consider the least squares estimator of :
where is a subset of . For some element of we set .
Since is a decreasing left-continuous function of , we have
Let be independent random variables satisfying for some and all . Assume that there exists a constant such that for all . Then, for any , with -probability at least , the least squares estimator (11) satisfies the oracle inequality
where is an absolute constant.
The proof of this theorem is given in Section D.
where is an absolute constant.
Theorem 6 shows that the rate of convergence of the least squares estimator is determined by the value of satisfying the global entropy condition (12). This quantity is the critical covering radius that appeared in the literature in different contexts, see, e.g., . In particular, this critical radius has been shown to determine the minimax optimal rates in nonparametric estimation problems. However, it may lead to slightly suboptimal rates (with deterioration by a logarithmic factor) for parametric estimation problems.
7 Structured matrix completion with general alphabets
For the structured matrix completion over infinite alphabets we consider the following parameter spaces:
Here, and are positive constants, and for ,
If , we assume that and we define as the set containing only one element, which is the identity matrix.
The difference from the class is only in the fact that the elements of matrix and those of the corresponding factor matrices are assumed to be uniformly bounded. This assumption is natural in many situations, for example, in the Stochastic Block Model or in recommendation systems, where the entries of the matrix are ratings. We introduce the bounds of the entries of the factor matrices in order to fix ambiguities associated with the factorization structure.
A key ingredient in applying Theorem 6 to this particular case is to find the covering number when . For any , any , and any , set
The following result is proved in Section E.
For any , , and we have
Note that Proposition 7 and (12) imply that for some numerical constant . Then, we have that for the general scheme of matrix completion and general alphabets the upper bound given by Corollary 8 departs from the lower bound of Theorem 3 by a logarithmic factor:
8 Adaptation to unknown sparsity
The estimators considered above require the knowledge of the degrees of sparsity and of . In this section, we suggest a method that does not require such a knowledge and thus it is adaptive to the unknown degree of sparsity. Our approach will be to estimate using a sparsity penalized least squares estimator. Let
For any let
In the following, denotes the random set of observed indices in model (1). In this section we denote by the following estimator
where is a regularization parameter. Note that this estimator does not require the knowledge of . The following theorem proved in Appendix F gives an upper bound on the estimation error of .
Assume that and . Let . Then, for any , with -probability at least the estimator (16) satisfies
where is an absolute constant.
where is an absolute constant.
We finish this section by two remarks.
1. Structured matrix estimation. In the case of complete observations, that is , the estimator (16) coincides with the following estimator
Then, one can show that, with high probability, the following upper bound on the estimation error holds
Here we do not need an upper bound on . At the same time, the estimator (17) is adaptive to the sparsity parameter .
2. Sparse Factor Model. Sparse Factor Model is studied in . With our notation, it corresponds to a particular case of and being the identity matrix with the difference that we consider row-sparse matrix while is assumed component-wise sparse in . Convergence rates obtained in  are of the order (up to a logarithmic factor). This is greater then the upper bound given by Theorem 10 which, in this setting, is of the order .
Appendix A Proof of Theorem 1
Since is the least squares estimator on , and , we have that for any ,
Now we use the following lemma proved in Section G.1.
Let be a random matrix with independent sub-Gaussian entries. Introduce the notation
For any , the following inequalities hold, where is an absolute constant: