Minimax Lower Bounds for Kronecker-Structured Dictionary Learning

# Minimax Lower Bounds for Kronecker-Structured Dictionary Learning

Zahra Shakeri, Waheed U. Bajwa, Anand D. Sarwate Dept. of Electrical and Computer Engineering, Rutgers University, Piscataway, New Jersey 08854
{zahra.shakeri, waheed.bajwa, anand.sarwate}@rutgers.edu
The work of the authors was supported in part by the National Science Foundation under awards CCF-1525276 and CCF-1453073, and by the Army Research Office under award W911NF-14-1-0295.
###### Abstract

Dictionary learning is the problem of estimating the collection of atomic elements that provide a sparse representation of measured/collected signals or data. This paper finds fundamental limits on the sample complexity of estimating dictionaries for tensor data by proving a lower bound on the minimax risk. This lower bound depends on the dimensions of the tensor and parameters of the generative model. The focus of this paper is on second-order tensor data, with the underlying dictionaries constructed by taking the Kronecker product of two smaller dictionaries and the observed data generated by sparse linear combinations of dictionary atoms observed through white Gaussian noise. In this regard, the paper provides a general lower bound on the minimax risk and also adapts the proof techniques for equivalent results using sparse and Gaussian coefficient models. The reported results suggest that the sample complexity of dictionary learning for tensor data can be significantly lower than that for unstructured data.

## I Introduction

Dictionary learning has recently received significant attention due to the increased importance of finding sparse representations of signals/data. In dictionary learning, the goal is to construct an overcomplete basis using input signals such that each signal can be described by a small number of atoms (columns) [1]. Although the existing literature has focused on one-dimensional data, many signals in practice are multi-dimensional and have a tensor structure: examples include 2-dimensional images and 3-dimensional signals produced via magnetic resonance imaging or computed tomography systems. In traditional dictionary learning techniques, multi-dimensional data are processed after vectorizing of signals. This can result in poor sparse representations as the structure of the data is neglected [2].

In this paper we provide fundamental limits on learning dictionaries for multi-dimensional data with tensor structure: we call such dictionaries Kronecker-structured (KS). Several algorithms have been proposed to learn KS dictionaries [3, 4, 5, 6, 2, 7] but there has been little work on the theoretical guarantees of such algorithms. The lower bounds we provide on the minimax risk of learning a KS dictionary give a measure to evaluate the performance of the existing algorithms.

In terms of relation to prior work, theoretical insights into classical dictionary learning techniques [8, 9, 10, 11, 12, 13, 14, 15, 16] have either focused on achievability of existing algorithms [8, 9, 10, 11, 12, 13, 14] or lower bounds on minimax risk for one-dimensional data [15, 16]. The former works provide sample complexity results for reliable dictionary estimation based on the appropriate minimization criteria [8, 9, 10, 11, 12, 13, 14]. Specifically, given a probabilistic model for sparse signals and a finite number of samples, a dictionary is recoverable within some distance of the true dictionary as a local minimum of some minimization criterion [12, 13, 14]. In contrast, works like Jung et al. [15, 16] provide minimax lower bounds for dictionary learning under several coefficient vector distributions and discuss a regime where the bounds are tight for some signal-to-noise (SNR) values. Particularly, for a dictionary and neighborhood radius , they show samples suffices for reliable recovery of the dictionary within its local neighborhood.

While our work is related to that of Jung et al. [15, 16], our main contribution is providing lower bounds for the minimax risk of dictionaries consisting of two coordinate dictionaries that sparsely represent 2-dimensional tensor data. The full version of this work generalizes the results to higher-order tensors [17]. The main approach taken in this regard is the well-understood technique of lower bounding the minimax risk in nonparametric estimation by the maximum probability of error in a carefully constructed multiple hypothesis testing problem [18, 19]. As such, our general approach is similar to the vector case [16]. Nonetheless, the major challenge in such minimax risk analyses is the construction of appropriate multiple hypotheses, which are fundamentally different in our problem setup due to the Kronecker structure of the true dictionary. In particular, for a dictionary consisting of the Kronecker product of two coordinate dictionaries and , where and , our analysis reduces the sample complexity from for vectorized data [16] to . Our results hold even when one of the coordinate dictionaries is not overcomplete (note that both and cannot be undercomplete, otherwise won’t be overcomplete). Like previous work [16], our analysis is local and our lower bounds depend on the distribution of multidimensional data. Finally, some of our analysis relies on the availability of side information about the signal samples. This suggests that the lower bounds can be improved by deriving them in the absence of such side information.

Notational Convention: Underlined bold upper-case, bold upper-case, bold lower-case and lower-case letters are used to denote tensors, matrices, vectors, and scalars, respectively. We write for . The -th column of a matrix is denoted by , while denotes the matrix consisting of columns of with indices , denotes the sum of all elements of , and denotes the identity matrix. Also, and denote the and norms of the vector , respectively, while and denote the spectral and Frobenius norms of , respectively.

We write for the Kronecker product of two matrices and : the result is an matrix. Given and , we write for their Khatri-Rao product [20]: this is essentially the column-wise Kronecker product of matrices. Given two matrices of the same dimension , their Hadamard product is denoted by , which is the element-wise product of and . For matrices and , we define their distance to be . We use if for some constant .

## Ii Background and Problem Formulation

In the conventional dictionary learning setup, it is assumed that an observation is generated via a fixed dictionary,

 y=Dx+n, (1)

in which the dictionary is an overcomplete basis () with unit-norm columns, is the coefficient vector, and is the underlying noise vector. In contrast to this conventional setup, our focus in this paper is on second-order tensor data. Consider the -dimensional observation . Using any separable transform, can be written as

 Y––=(T−11)TX––T−12, (2)

where is the matrix of coefficients and and are non-singular matrices transforming the columns and rows of , respectively. Defining and , we can use a property of Kronecker products [21], , to get the following expression for :

 y=(A⊗B)x+n (3)

for coefficient vector , and noise vector , where and . In this work, we assume independent and identically distributed (i.i.d.) noisy observations that are generated according to the model in (3). Concatenating these observations in , we have

 Y=DX+N, (4)

where is the unknown KS dictionary, is the coefficient matrix which we initially assume to consist of zero-mean random coefficient vectors with known distribution and covariance , and is additive white Gaussian noise (AWGN) with zero mean and variance .

Our main goal in this paper is to derive conditions under which the dictionary can possibly be learned from the noisy observations given in (4). In this regard, we assume the true KS dictionary consists of unit norm columns and we carry out local analysis. That is, the true KS dictionary is assumed to belong to a neighborhood around a fixed (normalized) reference KS dictionary , i.e., , , and :

 D A′∈Rm1×p1,B′∈Rm2×p2}, and (5) D ∈X(D0,r)≜{D′∈D:∥∥D′−D0∥∥2F

where the radius is known. It is worth noting here that, similar to the analysis for vector data [16], our analysis is applicable to the global KS dictionary learning problem. Finally, some of our analysis in the following also relies on the notion of the restricted isometry property (). Specifically, satisfies the of order with constant if

 ∀ s-sparse x,(1−δs)∥x∥22≤∥Dx∥22≤(1+δs)∥x∥22. (7)

### Ii-a Minimax risk analysis

We are interested in lower bounding the minimax risk for estimating based on observations , which is defined as the worst-case mean squared error (MSE) that can be obtained by the best KS dictionary estimator . That is,

 ε∗=infˆDsupD∈X(D0,r)EY{∥∥ˆD(Y)−D∥∥2F}. (8)

In order to lower bound this minimax risk , we resort to the multiple hypothesis testing approach taken in the literature on nonparametric estimation [19, 18]. This approach is equivalent to generating a KS dictionary uniformly at random from a carefully constructed class for a given , . Observations in this setting can be interpreted as channel outputs that are fed into an estimator that must decode . A lowerbound on the minimax risk in this setting depends not only on problem parameters such as the number of observations , noise variance , dimensions of the true KS dictionary, neighborhood radius , and coefficient distribution, but also on various aspects of the constructed class  [18].

To ensure a tight lower bound, we must construct such that the distance between any two dictionaries in is sufficiently large and the hypothesis testing problem is sufficiently hard, i.e., distinct dictionaries result in similar observations. Specifically, for , we desire a construction such that

 ∀l≠l′, ∥Dl−Dl′∥F≥2√2εand DKL(fDl(Y)||fDl′(Y))≤αL, (9)

where denotes the Kullback-Leibler (KL) divergence between the distributions of observations based on and , while and are non-negative parameters. Roughly, the minimax risk analysis proceeds as follows. Considering to be an estimator that achieves , and assuming and generated uniformly at random from , we have for the minimum-distance detector as long as . The goal then is to relate to and using Fano’s inequality [19]:

 (1−P(ˆl(Y)≠l))log2L−1≤I(Y;l), (10)

where denotes the mutual information (MI) between the observations and the dictionary . Notice that the smaller is in (9), the smaller will be in (10). Unfortunately, explicitly evaluating is a challenging task in our setup because of the underlying distributions. Similar to [16], we will instead resort to upper bounding by assuming access to some side information that will make the observations conditionally multivariate Gaussian (recall that ). Our final results will then follow from the fact that any lower bound for given the side information will also be a lower bound for the general case [16].

### Ii-B Coefficient distribution

The minimax lower bounds in this paper are derived for various coefficient distributions. First, similar to [16], we consider arbitrary coefficient distributions for which the covariance matrix exists. We then specialize our results for sparse coefficient vectors and, under additional assumptions on the reference dictionary , obtain a tighter lower bound for some signal-to-noise ratio (SNR) regimes, where .

#### Ii-B1 General coefficients

The coefficient vector in this case is assumed to be a zero-mean random vector with covariance . We also assume access to the side information to obtain a lower bound on in this setup.

#### Ii-B2 Sparse coefficients

In this case, we assume to be an -sparse vector such that the support of , denoted by , is uniformly distributed over :

 P(supp(x)=S)=1(ps)for any S∈E. (11)

Further, we model the nonzero entries of , i.e., , as drawn in an i.i.d. fashion from a distribution with variance :

 Ex{xSxTS|S}=σ2aIs. (12)

Notice that an under the assumptions of (11) and (12) has

 Σx=spσ2aIp. (13)

Further, it is easy to see in this case that . Finally, the side information assumed in this sparse coefficients setup will either be or .

## Iii Lower Bound for General Coefficients

We now provide our main result for the lower bound for the minimax risk of the KS dictionary learning problem for the case of general (i.i.d.) coefficient vectors.

###### Theorem 1.

Consider a KS dictionary learning problem with i.i.d observations generated according to model (3) and the true dictionary satisfying (6) for some and . Suppose exists for the zero-mean random coefficient vectors. If there exists an estimator with worst-case MSE , then the minimax risk is lower bounded by

 ε∗ ≥C1r2σ2Np∥Σx∥2(c1(p1(m1−1)+p2(m2−1))−3) (14)

for any and , where .

Outline of Proof: The idea of the proof, as discussed in section II-A, is that we construct a set of distinct KS dictionaries that satisfy:

• Any two distinct dictionaries in are separated by a minimum distance in the neighborhood, i.e., for any and some positive :

 ∥Dl−D′l∥F≥2√2ε, for l≠l′. (15)

Notice that if the true dictionary, , is selected uniformly at random from in this case then, given side information , the observations follow a multivariate Gaussian distribution and an upper bound on the conditional MI can be obtained by using an upper bound for KL-divergence of multivariate Gaussian distributions. This bound depends on parameters , and .

Next, assuming (15) holds for , if there exists an estimator achieving the minimax risk and the recovered dictionary satisfies , the minimum distance detector can recover . Consequently, the probability of error can be used to lower bound the conditional MI using Fano’s inequality. The obtained lower bound in our case will only be a function of .

Finally, using the obtained upper and lower bounds for the conditional MI:

 η2≤I(Y;l|T(X))≤η1, (16)

a lower bound for the minimax risk is attained.

A formal proof of Theorem 1 relies on the following lemmas whose proofs appear in the full version of this work [17]. Note that since our construction of is more complex than the vector case [16, Theorem 1], it requires a different sequence of lemmas, with the exception of Lemma 3, which follows from the vector case.

###### Lemma 1.

There exists a set of matrices , where elements of take values for some , such that for , , any and , the following relation is satisfied:

 ∣∣∑(Al⊙Al′)∣∣≤t. (17)
###### Lemma 2.

Considering the generative model in (3), given some and reference dictionary , there exists a set of cardinality such that for any , any , and any satisfying

 ε′

and any , with , we have

 2pr2(1−t)ε′ ≤∥Dl−Dl′∥2F≤8pr2ε′. (19)

Furthermore, considering the general coefficient model for and assuming side information , we have

 ∀ l, I(Y;l|T(X)) ≤4Np∥Σx∥2r2σ2ε′. (20)
###### Lemma 3.

Consider the generative model in (3) with minimax risk for some . Assume there exists a finite set with dictionaries satisfying

 ∥Dl−Dl′∥2F≥8ε (21)

for . Then for any side information , we have

 I(Y;l|T(X))≥12log2(L)−1. (22)
###### Proof of Theorem 1.

According to Lemma 2, for any satisfying (18), there exists a set of cardinality that satisfies (20) for any and any . According to Lemma 3, if we set , (21) is satisfied for and provided there exists an estimator with worst case MSE satisfying , (22) holds. Combining (20) and (22) we get

 12log2(L)−1≤I(Y;l|T(X))≤32Np∥Σx∥2c2r2σ2ε, (23)

where . Defining , (23) translates into

 ε≥C1r2σ2Np∥Σx∥2(c1(p1(m1−1)+p2(m2−1))−3). (24)

## Iv Lower Bound for Sparse Coefficients

We now turn our attention to the case of sparse coefficients and obtain lower bounds for the corresponding minimax risk. We first state a corollary of Theorem 1, for .

###### Corollary 1.

Consider a KS dictionary learning problem with i.i.d observations according to model (3). Assuming the true dictionary satisfies (6) for some and the reference dictionary satisfies , if the random coefficient vector is selected according to (11) and there exists an estimator with worst-case MSE error , the minimax risk is lower bounded by

 ε∗ (25)

for any and , where .

This result is a direct consequence of Theorem 1, by substituting the covariance matrix of given in (13) in (14).

### Iv-a Sparse Gaussian coefficients

In this section, we make an additional assumption on the coefficient vector generated according to (11) and assume non-zero elements of follow a Gaussian distribution. By additionally assuming the non-zero entries of are i.i.d., we can write as

 xS∼N(0,σ2aIs). (26)

Therefore, given side information , observations follow a multivariate Gaussian distribution. We now provide a theorem for the lower bound attained for this coefficient distribution.

###### Theorem 2.

Consider a KS dictionary learning problem with i.i.d observations according to model (3). Assuming the true dictionary satisfies (6) for some and the reference coordinate dictionaries and satisfy , if the random coefficient vector is selected according to (11) and (26) and there exists an estimator with worst-case MSE error , then the minimax risk is lower bounded by

 ε∗ ≥C2r2σ4Ns2σ4a(c1(p1(m1−1)+p2(m2−1))−3) (27)

for any and , where .

Outline of Proof: The constructed dictionary class in Theorem 2 is similar to that in Theorem 1. But the upper bound for the conditional MI, , differs from that in Theorem 1 as the side information is different.

Given the true dictionary and support for the -th coefficient vector , let denote the columns of corresponding to the non-zeros elements of . In this case, we have

 (28)

We can write the subdictionary in terms of the Khatri-Rao product of two smaller matrices:

 Dl,Sk=Ala,Ska∗Blb,Skb, (29)

where , and , are multisets with the following relationship with : . Note that and are not submatrices of and , as and are multisets. Figure 1 provides a visual illustration of (29). Therefore, the observations follow a multivariate Gaussian distribution with zero mean and covariance matrix:

 Σk,l =σ2a(Ala,Ska∗Blb,Skb)(Ala,Ska∗Blb,Skb)T+σ2Is (30)

and we need to obtain an upper bound for the conditional MI using (30). We state a variation of Lemma 2 necessary for the proof of Theorem 2. The proof of the lemma is again provided in [17].

###### Lemma 4.

Considering the generative model in (3), given some and reference dictionary , there exists a set of cardinality such that for any , any , and any satisfying

 0<ε′≤min{r2s,r44p}, (31)

and any , with , we have

 2pr2(1−t)ε′≤∥Dl−Dl′∥2F≤8pr2ε′. (32)

Furthermore, assuming the reference coordinate dictionaries and satisfy and the coefficient matrix is selected according to (11) and (26), considering side information , we have:

 I(Y;l|T(X)) ≤7921(σaσ)4Ns2r2ε′. (33)
###### Proof of Theorem 2.

According to Lemma 4, for any satisfying (31), there exists a set of cardinality that satisfies (33) for any and any . Setting , (21) is satisfied for and, provided there exists an estimator with worst case MSE satisfying , (22) holds. Consequently,

 12log2(L)−1≤I(Y;l|T(X))≤8×7921c2(σaσ)4Ns2r2ε, (34)

where . Defining , (34) can be written as

 ε≥C2(σσa)4r2(c1(p1(m1−1)+p2(m2−1))−3)Ns2. (35)

## V Discussion and Conclusion

In this paper we follow an information-theoretic approach to provide lower bounds for the worst-case MSE of KS dictionaries that generate 2-dimensional tensor data. Table I lists the dependence of the known lower bounds on the minimax rates on various parameters of the dictionary learning problem and the SNR. Compared to the results in [16] for the unstructured dictionary learning problem, which are not stated in this form, but can be reduced to this, we are able to decrease the lower bound in all cases by reducing the scaling to for KS dictionaries. This is intuitively pleasing since the minimax lower bound has a linear relationship with the number of degrees of freedom of the KS dictionary, which is (), and the square of the neighborhood radius . The results also show that the minimax risk decreases with a larger number of samples and increased SNR. Notice also that in high SNR regimes, the lower bound in (25) is tighter, while (27) results in a tighter lower bound in low SNR regimes. Our bounds depend on the signal distribution and imply necessary sample complexity scaling . Future work includes extending the lower bounds for higher-order tensors and also specifying a learning scheme that achieves these lower bounds.

## References

• [1] K. Kreutz-Delgado, J. F. Murray, B. D. Rao, K. Engan, T. Lee, and T. J. Sejnowski, “Dictionary learning algorithms for sparse representation,” Neural computation, vol. 15, no. 2, pp. 349–396, 2003.
• [2] Z. Zhang and S. Aeron, “Denoising and completion of 3D data via multidimensional dictionary learning,” arXiv preprint arXiv:1512.09227, 2015.
• [3] G. Duan, H. Wang, Z. Liu, J. Deng, and Y.-W. Chen, “K-CPD: Learning of overcomplete dictionaries for tensor sparse coding,” in Proc. IEEE 21st Int. Conf. Pattern Recognition (ICPR), 2012, pp. 493–496.
• [4] S. Hawe, M. Seibert, and M. Kleinsteuber, “Separable dictionary learning,” in Proc. IEEE Conf. Comput. Vision and Pattern Recognition (CVPR), 2013, pp. 438–445.
• [5] S. Zubair and W. Wang, “Tensor dictionary learning with sparse Tucker decomposition,” in Proc. IEEE 18th Int. Conf. Digital Signal Process. (DSP), 2013, pp. 1–6.
• [6] Y. Peng, D. Meng, Z. Xu, C. Gao, Y. Yang, and B. Zhang, “Decomposable nonlocal tensor dictionary learning for multispectral image denoising,” in Proc. IEEE Conf. Comput. Vision and Pattern Recognition (CVPR), 2014, pp. 2949–2956.
• [7] S. Soltani, M. E. Kilmer, and P. C. Hansen, “A tensor-based dictionary learning approach to tomographic image reconstruction,” arXiv preprint arXiv:1506.04954, 2015.
• [8] M. Aharon, M. Elad, and A. M. Bruckstein, “On the uniqueness of overcomplete dictionaries, and a practical way to retrieve them,” Linear algebra and its applications, vol. 416, no. 1, pp. 48–67, 2006.
• [9] A. Agarwal, A. Anandkumar, P. Jain, and P. Netrapalli, “Learning sparsely used overcomplete dictionaries via alternating minimization,” arXiv preprint arXiv:1310.7991, 2013.
• [10] A. Agarwal, A. Anandkumar, and P. Netrapalli, “Exact recovery of sparsely used overcomplete dictionaries,” arXiv preprint arXiv:1309.1952., 2013.
• [11] S. Arora, R. Ge, and A. Moitra, “New algorithms for learning incoherent and overcomplete dictionaries,” in Proc. 27th Conf. Learning Theory, 2014, pp. 779–806.
• [12] K. Schnass, “On the identifiability of overcomplete dictionaries via the minimisation principle underlying K-SVD,” Applied and Computational Harmonic Analysis, vol. 37, no. 3, pp. 464–491, 2014.
• [13] ——, “Local identification of overcomplete dictionaries,” Journal of Machine Learning Research, vol. 16, pp. 1211–1242, 2015.
• [14] R. Gribonval, R. Jenatton, and F. Bach, “Sparse and spurious: dictionary learning with noise and outliers,” arXiv preprint arXiv:1407.5155, 2014.
• [15] A. Jung, Y. C. Eldar, and N. Gortz, “Performance limits of dictionary learning for sparse coding,” in Proc. IEEE 22nd European Signal Process. Conf. (EUSIPCO), 2014, pp. 765–769.
• [16] A. Jung, Y. C. Eldar, and N. Görtz, “On the minimax risk of dictionary learning,” arXiv preprint arXiv:1507.05498, 2015.
• [17] Z. Shakeri, W. U. Bajwa, and A. D. Sarwate, “Minimax lower bounds on dictionary learning for tensor data,” 2016, preprint.
• [18] A. B. Tsybakov, Introduction to nonparametric estimation.   Springer Series in Statistics. Springer, New York, 2009.
• [19] B. Yu, “Assouad, Fano, and Le Cam,” in Festschrift for Lucien Le Cam.   Springer, 1997, pp. 423–435.
• [20] A. Smilde, R. Bro, and P. Geladi, Multi-way analysis: Applications in the chemical sciences.   John Wiley & Sons, 2005.
• [21] C. F. Van Loan, “The ubiquitous Kronecker product,” Journal of computational and applied mathematics, vol. 123, no. 1, pp. 85–100, 2000.
You are adding the first comment!
How to quickly get a good reply:
• Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
• Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
• Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters