Stable recovery of the factors from a deep matrix product and application to convolutional network
We study a deep matrix factorization problem. It takes as input a matrix obtained by multiplying matrices (called factors). Each factor is obtained by applying a fixed linear operator to a short vector of parameters satisfying a model (for instance sparsity, grouped sparsity, non-negativity, constraints defining a convolution network…). We call the problem deep or multi-layer because the number of factors is not limited. In the practical situations we have in mind, we can typically have or . This work aims at identifying conditions on the structure of the model that guarantees the stable recovery of the factors from the knowledge of and the model for the factors.
We provide necessary and sufficient conditions for the identifiability of the factors (up to a scale rearrangement). We also provide a necessary and sufficient condition called deep-Null Space Property (because of the analogy with the usual Null Space Property in the compressed sensing framework) which guarantees that even an inaccurate optimization algorithm for the factorization stably recovers the factors.
We illustrate the theory with a practical example where the deep factorization is a convolutional network.
1.1 Content of the paper
We consider the following matrix factorization problem: let , , write , . We are given a matrix which is (approximately) the product of factors :
This paper investigates models/constraints imposed on the factors for which we can (up to obvious scale rearrangement) identify or stably recover the factors from .
This question is of paramount importance in many fields including statistics and machine learning, vision, image processing, information theory and numerical linear algebra. In practical applications contains some data or represents a linear operator. It is often only specified indirectly and/or approximately. Notice that might be a simple vector.
We now describe the structures imposed on the factors that we investigate in this paper. The factors are required to be structured matrices defined by a number of unknown parameters. More precisely, for , let
be a linear map. This linear map might for instance simply map the values in to prescribed locations in the matrix . In that case, when is small, all the factors are -sparse. Another insightful example is when uses to construct a Toeplitz or Circulant matrix. More complex examples include those where the matrix is obtained by combining several smaller Toeplitz or circulant matrices. In the latter case, the product can be the matrix corresponding to a convolutional network. This example is presented in Section 7. Another interesting example is obtained when (or ) is defined by the product (or ), where the columns of contain learning samples and (or ) has the form (1). In this case, contains an analysis of the samples.
In addition to the structure induced by the operators , we also consider structure imposed on the vectors . We assume that we know a collection of models with the property that for every , is a given subset. We will assume that the parameters defining the factors are such that there exists such that . The typical examples we have in mind include models for which is the level set of a function. The function defines a prior on the parameters . For instance, when , might contain sparse vectors, impose grouped sparsity or co-sparsity. Still when , other examples impose non-negativity constraints, orthogonality (for low rank approximation), equality (in phase retrieval), an upper-bound on the norm of the columns of (in dictionary learning). The examples are numerous.
We now precisely state the problem considered in this paper. We assume a collection of models is known and that there exists a model defined by an unknown and parameters , with for all , and we only are given the product of matrices
for an unknown error term . Our goal is to establish sharp and, as far as possible, simple conditions guaranteeing that we can recover the parameters with an accuracy comparable to . The typical solver we have in mind minimizes
Doing so, the stability results apply to a larger class of solvers. In particular, it can apply to heuristics or (for instance) to solvers that minimize an approximate objective function or an objective function that is not based on the Euclidean distance. Our statements say that the smaller the smaller the error on the estimate of the parameters .
The stability question boils down to an identifiability issue when the error term vanishes and the minimization is exact. We therefore begin to study necessary and sufficient conditions for the uniqueness of the factors satisfying both and
The minimization problem (2) is non-convex because the product is not (jointly) linear. The constraint might also be non-convex. As a consequence, solving or even finding an efficient heuristic solving (2) might be difficult or impossible for some instances of the problem. We do not address the numerical issues related to the minimization of (2). In this regard, although the identifiability is desired when interpreting the solution, it implies that the minimizer of (2) is unique. Intuitively, this is expected to reduce the size of the convergence bassin and complicate the numerical resolution of (2). In that sense, a sharp condition of identifiability separates identifiable problems and problems which better lend themselves to global optimization. Outside of this crude intuition and the fact that, we do not investigate whether (2) can actually be minimized or not (see  for an example of such a result).
When , (3) and (2) are in general highly non-linear/non-convex (even when is known and is a Euclidean space), so the uniqueness and stability of the solution is not easy to characterize. This is due to the product of the factors. However, our results also apply when . When , the questions and results stated in this work are identical or close to well established existing results in compressed/compressive sensing [8, 15, 7, 13]. Although it is not our primary interest, for pedagogical reasons, we also express our definitions and statements in this setting.
The main contributions of this paper are:
In the absence of noise (see Section 5):
We establish a simple geometric condition on the intersection of two sets which are necessary and sufficient to guarantee the identifiability of the parameters defining the factors (Proposition 7).
We also provide a simple algorithm to compute this rank that works in many reasonable cases (Proposition 4).
In the presence of noise when considering an inaccurate minimizer (see Section 6):
We establish that when the deep-Null Space Property holds we can recover the factors with an accuracy bounded above by the sum of the noise level and a quantity reflecting the minimization inaccuracy (Theorem 5).
We establish the converse statement: if we are able to recover the factors with an accuracy upper bounded by the noise level then the deep-Null Space Property holds (Theorem 6).
We specialize the above results to convolutional networks and establish a simple condition, that can be computed in many contexts, such that
In order to establish these results, we investigate and recall several results on tensors, tensor rank and the Segre embedding (see Section 3). In particular, we investigate the Lipschitz continuity (Theorem 2) and stable recovery (Theorem 1) for the Segre embedding.
1.2 Bibliographical landmarks
Matrix factorization problems are ubiquitous in statistics, information theory and data representation. The simplest version consists of a model with one layer (i.e., ) and . This is the usual linear approximation problem. In this case, can be vectorized to form a column vector and the operator simply multiplies the column vector by a fixed (rectangular) matrix. Typically, in this setting, the latter matrix has more rows than columns and, when a solution to (3) exists, its uniqueness depends on the column rank of the matrix.
The above linear approximation is often improved using a “non-linear approximation” . In this framework, the fixed matrix has more columns than rows and contains a sparse vector whose support is also optimized to better approximate the data. The identifiability and stable recovery for this problem has been intensively studied and gave rise to a new application named compressed/compressive sensing (see [8, 15]). Some compressed sensing statements (especially the ones guaranteeing that any minimizer of the problem stably recovers the unknown) are special cases () of the statements provided in this paper. We will not perform a complete review on compressed sensing but would like to highlight the Null Space Property described in . The fundamental limits of compressed sensing (for a solution of the problem) have been analyzed in detail in .
The questions we are studying are mostly relevant (and new) when . In the case of such models, the non-linearity comes from the multiplicative nature of (3) and the identifiability and stable recovery are not easily guaranteed. Recently, sparse coding and dictionary learning has been introduced (see  for an overview on the subject). In that framework, contains the data and (most of the time) people consider two layers: . The layer is an optimized dictionary of atoms and each column of contains the code (or coordinates) of the corresponding column in . In this case, the mapping maps a vector from a small vector space into a sparse matrix. The identifiability and stable recovery of the factors has been studied in many dictionary learning contexts and provides guarantees on the approximate recovery of both an incoherent dictionary and sparse coefficients when the number of samples is sufficiently large (i.e., is large, in our setting). In , the authors developed local optimality conditions in the noiseless case, as well as sample complexity bounds for local recovery when is square and are iid Bernoulli-Gaussian. This was extended to overcomplete dictionaries in  (see also  for tight frames) and to the noisy case in . The authors of  provide exact recovery results for dictionary learning, when the coefficient matrix has Bernoulli-Gaussian entries and the dictionary matrix has full column rank. This was extended to overcomplete dictionaries in  and in  but only for approximate recovery. Finally,  provides such guarantees under general conditions which cover many practical settings.
Factorizations with have also been considered for the purpose of phase retrieval , blind deconvolution [2, 12, 30], blind deconvolution in random mask imaging , blind-deconvolution and blind-de-mixing , self-calibration  and Non-negative matrix factorization [29, 16, 28, 4]. Most of these papers use the same lifting property we are using. They further propose to convexify the problem and provide sufficient conditions for obtaining identifiability and stability. A more general bilinear framework is considered in , where the analysis shares similarities with the results presented here but is restricted to identifiability when .
When compared to these results, the scope of the present study is to consider the identifiability and the stability of the recovery for any , in a general context. The authors have announced preliminary versions of the results described here in . They are significantly extended here. To the best of our knowledge, little is known concerning the identifiability and the stability of matrix factorization when . The uniqueness of the factorization corresponding to the Fast Fourier Transform was proved in . Other results consider the identifiability of the factors which are sparse and random  and might even consider the presence of non-linearities between the layers to include the deep classification architectures .
The use of deep matrix factorization is classical. In particular handcrafted deep matrix factorization of a few particular matrices are used in many fields of mathematics and engineering. Most fast transforms, such as the Cooley-Tukey Fast Fourier Transform, the Discrete Cosine Transform and the Wavelet transform, are deep matrix products.
The construction of optimized deep matrix factorization only started recently (see [10, 11] and references therein). In [10, 11, 36], the authors consider compositions of sparse convolutions organized according to a convolutional tree. In the simplified case studied in , is a vector, the vectors are the convolution kernels and each operator maps to a circulant (or block-circulant) matrix. The first layer corresponds to the coordinates/code of in the frame obtained by computing the compositions of convolutions along the unique branch of the tree. In , the authors consider a factorization involving several sparse layers. In that work, the authors simultaneously estimate the support and the coefficients of each sparse factor. They use this factorization to define an analogue of the Fast Fourier Transform for signals living on graphs  and latter reworked on this principle using old ideas . In , the authors consider a (deep) multi-resolution matrix factorization, inspired by the wavelet decomposition, where the factors are orthogonal and sparse. In [41, 42], the authors consider factors based on householder reflectors and Givens rotation. In , the authors study a multi-layer Non-negative matrix factorization.
2 Notation and summary of the hypotheses
We continue to use the notation introduced in the introduction. For an integer , set .
We consider and and real valued tensors of order whose axes are of size , denoted by . The space of tensors is abbreviated . The entries of are denoted by , where . The index set is simply denoted . For , the entries of are (for we let etc.). We either write or .
A collection of vectors is denoted (i.e., using bold fonts). Our collections are composed of vectors of size and the vector is denoted . The entry of the vector is denoted . A vector not related to a collection of vectors is denoted by (i.e., using a light font). Throughout the paper we assume
We also assume that, for all , . They can however be equal or constant after a given .
All the vector spaces , , etc. are equipped with the usual Euclidean norm. This norm is denoted and the scalar product . In the particular case of matrices, corresponds to the Frobenius norm. We also use the usual norm, for , and denote it by .
Define an equivalence relation in : for any , , if and only if there exists such that
Denote the equivalence class of by .
We say that the zero tensor is of rank . We say that a non-zero tensor is of rank (or decomposable) if and only if there exists a collection of vectors such that is the outer product of the vectors , for , that is, for any ,
Let denote the set of tensors of rank or .
The rank of any tensor is defined to be
For , let
The superscript refers to optimal solutions. A set with a subscript means that is ruled out of the set. In particular, denotes the non-zero tensors of rank . Attention should be paid to since its definition is not straightforward (see (4)).
3 Facts on the Segre embedding and tensors of rank and
Parametrize by the map
The map is called the Segre embedding and is often denoted by in the algebraic geometry literature.
Identifiability of from : For and , if and only if .
Geometrical description of : is a smooth (i.e., ) manifold of dimension (see, e.g., , chapter 4, pp. 103).
Geometrical description of : When , the singular locus has dimension strictly less than that of , and is a smooth manifold. This smooth manifold is of dimension when , and of dimension when (see, e.g., , chapter 5).
We can improve Standard Fact 1 and obtain a stability result guaranteeing, that if we know a rank tensor sufficiently close to , we approximately know . In order to state this, we need to define a metric on . This has to be considered with care since, whatever , the subset is not compact. In particular, considering
when goes to infinity, we easily construct examples that make the standard metric on equivalence classes useless111For instance, if and are such that (for instance) , we have even though we might have (and therefore ). This does not define a metric. Also, when and are such that , whatever , we have Therefore, the Hausdorff distance between and is infinite for almost every pair . This metric is therefore not very useful in the present context..
This leads us to consider
The interest in this set comes from the fact that, whatever , the set is finite. Indeed, if the such that, for all , must all satisfy , i.e. .
For any , we define the mapping by
For any , is a metric on .
Notice that, the sets and are finite and therefore the infimum in the definition of is reached. We also have whatever ,
Moreover, whatever and and there exist such that and
Using the above two properties, we can check that
As a consequence, the outer infimum in (6) is irrelevant and we have
Using this last property, we easily check that is a metric on . ∎
Using this metric, we can state, not only is uniquely determined by , but this operation is stable.
Stability of from
Let and be such that . For all ,
Notice first that when the inequality is a straightforward consequence of the usual inequalities between norms. We therefore assume from now on that .
All along the proof, we consider and and assume that . We also assume that . We first prove the inequality when .
In order to do so, we consider
and assume, without lost of generality (otherwise, we can multiply one vector of and by to get this property and multiply back once the inequality have been established), that . We therefore have . Notice also that we have, under the above hypotheses,
Moreover, we consider the operator that extracts the signals of size that are obtained when freezing, at the index in a tensor , all coordinates but one. Formally, we denote
where for all and all
We have for all
We therefore have . This can be written . Similarly, we have .
Also, because of the definition of and , we are guaranteed that, whatever ,
The latter being independent of , we have . Unfortunately, unless for instance , it might occur that . However, if we consider
we have since and
In the sequel we will successively calculate upper bounds of and in order to find an upper bound of .
Upper bound of :
But we also have using the mean value theorem and (8)
We therefore finally obtain that
Upper bound of :
First, since , we know that there exists such that
Furthermore, we have for all
Also, if there is such that , since (11) holds, there necessarily exist another such that . If we replace by and replace by we remain in and can only make decrease. Repeating this process until all the ’s are non-negative, we can assume without loss of generality that
Second, the value appearing in (12), can be bounded by using bounds on and the identity
Qualitatively, the latter identity indeed guarantees that, as goes to , goes to . Let us now establish this quantitatively.
Finally, we get
In order to establish the property when and , we simply use the fact that
The following proposition shows that the upper bound in (7) cannot be improved by a significant factor.
There exist and such that , and
In the example, we consider and such that for all and all