Learning Mixtures of Separable Dictionaries for Tensor Data: Analysis and Algorithms
This work addresses the problem of learning sparse representations of tensor data using structured dictionary learning. It proposes learning a mixture of separable dictionaries to better capture the structure of tensor data by generalizing the separable dictionary learning model. Two different approaches for learning mixture of separable dictionaries are explored and sufficient conditions for local identifiability of the underlying dictionary are derived in each case. Moreover, computational algorithms are developed to solve the problem of learning mixture of separable dictionaries in both batch and online settings. Numerical experiments are used to show the usefulness of the proposed model and the efficacy of the developed algorithms.
Many data processing tasks such as feature extraction, data compression, classification, signal denoising, image inpainting, and audio source separation make use of data-driven sparse representations of data [2, 3, 4]. In many applications, these tasks are performed on data samples that are naturally structured as multiway arrays, also known as multidimensional arrays or tensors. Instances of multidimensional or tensor data include videos, hyperspectral images, tomographic images, and multiple-antenna wireless channels. Despite the ubiquity of tensor data in many applications, traditional data-driven sparse representation approaches disregard their multidimensional structure. This can result in sparsifying models with a large number of parameters. On the other hand, with the increasing availability of large data sets, it is crucial to keep sparsifying models reasonably small to ensure their scalable learning and efficient storage within devices such as smartphones and drones.
Our focus in this paper is on learning of “compact” models that yield sparse representations of tensor data. To this end, we study dictionary learning (DL) for tensor data. The goal in DL, which is an effective and popular data-driven technique for obtaining sparse representations of data [2, 4, 3], is to learn a dictionary such that every data sample can be approximated by a linear combination of a few atoms (columns) of . While DL has been widely studied, traditional DL approaches flatten tensor data and then employ methods designed for vector data [4, 5]. Such simplistic approaches disregard the multidimensional structure in tensor data and result in dictionaries with a large number of parameters. One intuitively expects, however, that dictionaries with smaller number of free parameters that exploit the correlation and structure along different tensor modes are likely to be more efficient with regards to storage requirements, computational complexity, and generalization performance, especially when training data are noisy or scarce.
To reduce the number of parameters in dictionaries for tensor data, and to better exploit the correlation among different tensor modes, some recent DL works have turned to tensor decompositions such as the Tucker decomposition  and CANDECOMP/PARAFAC decomposition (CPD)  for learning of “structured” dictionaries. The idea in structured DL for tensor data is to restrict the class of dictionaries during training to the one imposed by the tensor decomposition under consideration . For example, structured DL based on the Tucker decomposition of -way tensor data corresponds to the dictionary class in which any dictionary consists of the Kronecker product  of smaller subdictionaries [10, 11, 12, 13, 14, 15]. The resulting DL techniques in this instance are interchangeably referred to in the literature as separable DL or Kronecker-structured DL (KS-DL).
In terms of parameter counting, the advantages of KS-DL for tensor data are straightforward: the number of parameters needed to be estimated and stored for unstructured dictionary learning is , whereas the KS-DL model requires only the sum of the subdictionary sizes . Nonetheless, while existing KS-DL methods enjoy lower sample/computational complexity and better storage efficiency over unstructured DL , the KS-DL model makes a strong separability assumption among different modes of tensor data. Such an assumption can be overly restrictive for many classes of data , resulting in an unfavorable tradeoff between model compactness and representation power.
In this paper, we overcome this limitation by proposing and studying a generalization of KS-DL that we interchangeably refer to as learning a mixture of separable dictionaries or low separation rank DL (LSR-DL). The separation rank of a matrix is defined as the minimum number of KS matrices whose sum equals [17, 18]. The LSR-DL model interpolates between the under-parameterized separable model (a special case of LSR-DL model with separation rank ) and the over-parameterized unstructured model. Figure 1 provides an illustrative example of the usefulness of LSR-DL, in which one learns a dictionary with a small separation rank: while KS-DL learns dictionary atoms that cannot reconstruct diagonal structures perfectly because of the abundance of horizontal/vertical (DCT-like) structures within them, LSR-DL also returns dictionary atoms with pronounced diagonal structures as the separation rank increases.
I-a Main Contributions
We first propose and analyze a generalization of the separable DL model—which we call a mixture of separable dictionaries model or LSR-DL model—that allows for better representation power than the separable model while having smaller number of parameters than standard DL. Our analysis assumes a generative model involving a true LSR dictionary for tensor data and investigates conditions under which the true dictionary is recoverable, up to a prescribed error, from training tensor data. Our first major set of LSR dictionary identifiability results are for the conventional optimization-based formulation of the DL problem , except that the search space is constrained to the class of dictionaries with maximum separation rank (and individual mixture terms having bounded norms when and ).111While we also provide identifiability results for LSR dictionaries without requiring the boundedness assumption, those results are only asymptotic in nature; see Section III for details. Similar to conventional DL problems, this LSR-DL problem is nonconvex with multiple global minima. We therefore focus on local identifiability guarantees, meaning that a search algorithm initialized close enough to the true dictionary can recover that dictionary.222This is due to our choice of distance metric, which is the Frobenius norm. To this end, under certain assumptions on the generative model, we show that samples ensure existence of a local minimum of the constrained LSR-DL problem for th-order tensor data within a neighborhood of radius around the true LSR dictionary.
Our initial local identifiability results are based on an analysis of a separation rank-constrained optimization problem that exploits a connection between LSR (resp., KS) matrices and low-rank (resp., rank-1) tensors. However, a result in tensor recovery literature  implies that finding the separation rank of a matrix is NP-hard. Our second main contribution is development and analysis of two different relaxations of the LSR-DL problem that are computationally tractable in the sense that they do not require explicit computation of the separation rank. The first formulation once again exploits the connection between LSR matrices and low-rank tensors and uses a convex regularizer to implicitly constrain the separation rank of the learned dictionary. The second formulation enforces the LSR structure on the dictionary by explicitly writing it as a summation of KS matrices. Our analyses of the two relaxations once again involve conditions under which the true LSR dictionary is locally recoverable from training tensor data. We also provide extensive discussion in the sequel to compare and contrast the three sets of identifiability results for LSR dictionaries.
Our third main contribution is development of practical computational algorithms, which are based on the two relaxations of LSR-DL, for learning of an LSR dictionary in both batch and online settings. We then use these algorithms for learning of LSR dictionaries for both synthetic and real tensor data, which are afterward used in denoising and representation learning tasks. Numerical results obtained as part of these efforts help validate the usefulness of LSR-DL and highlight the different strengths and weaknesses of the two LSR-DL relaxations and the corresponding algorithms.
I-B Relation to Prior Work
Tensor decompositions [20, 21] have emerged as one of the main sets of tools that help avoid overparameterization of tensor data models in a variety of areas. These include deep learning , collaborative filtering , multilinear subspace learning , source separation , topic modeling , and many other works (see[22, 23] and references therein). But the use of tensor decompositions for reducing the (model and sample) complexity of dictionaries for tensor data has been addressed only recently.
There have been many works that provide theoretical analysis for the sample complexity of the conventional DL problem [27, 28, 29, 30]. Among these, Gribonval et al.  focus on the local identifiability of the true dictionary underlying vectorized data using Frobenius norm as the distance metric. Shakeri et al.  extended this analysis for the sample complexity of the KS-DL problem for th-order tensor data. This analysis relies on expanding the objective function in terms of subdictionaries and exploiting the coordinate-wise Lipschitz continuity property of the objective function with respect to each subdictionary . While this approach ensures the identifiability of the subdictionaries, it requires the dictionary coefficient vectors to follow the so-called separable sparsity model  and does not extend to the LSR-DL problem. In contrast, we provide local identifiability sample complexity results for the LSR-DL problem and two relaxations of it. Further, our identifiability results hold for coefficient vectors following the random sparsity model and the separable sparsity model.
In terms of computational algorithms, several works have proposed methods for learning KS dictionaries that rely on alternating minimization techniques to update the subdictionaries [31, 13, 11]. Among other works, Hawe et al.  employ a Riemannian conjugate gradient method combined with a nonmonotone line search for KS-DL. While they present the algorithm only for matrix data, its extension to higher-order tensor data is trivial. Schwab et al.  have also recently addressed the separable DL problem for matrix data; their contributions include a computational algorithm and global recovery guarantees. In terms of algorithms for LSR-DL, Dantas et al.  proposed one of the first methods for matrix data that uses a convex regularizer to impose LSR on the dictionary. One of our batch algorithms, named STARK , also uses a convex regularizer for imposing LSR structure. In contrast to Dantas et al. , however, STARK can be used to learn a dictionary from tensor data of any order. The other batch algorithm we propose, named TeFDiL, learns subdictionaries of the LSR dictionary by exploiting the connection to tensor recovery and using tensor CPD. Recently, Dantas et al.  proposed an algorithm for learning an LSR dictionary for tensor data in which the dictionary update stage is a projected gradient descent algorithm that involves a CPD after every gradient step. In contrast, TeFDiL only requires a single CPD at the end of each dictionary update stage. Finally, while there exist a number of online algorithms for DL [5, 34, 35], the online algorithms developed in here are the first ones that enable learning of structured (either KS or LSR) dictionaries.
Ii Preliminaries and Problem Statement
Notation and Definitions: We use underlined bold upper-case (), bold upper-case (), bold lower-case (), and lower-case () letters to denote tensors, matrices, vectors, and scalars, respectively. For any integer , we define . We denote the -th column of a matrix by . For an matrix and an index set , we denote the matrix constructed from the columns of indexed by as . We denote by an -tuple , while represents the set . We drop the range indicators if they are clear from the context.
Norms and inner products: We denote by the norm of vector , while we use , , and to denote the spectral, Frobenius, and trace (nuclear) norms of matrix , respectively. Moreover, is the max column norm and . We define the inner product of two tensors (or matrices) and as where is the vectorization operator. The Euclidean distance between two tuples of the same size is defined as .
Kronecker product: We denote by the Kronecker product of matrices and . We use for the Kronecker product of matrices. We drop the range indicators when there is no ambiguity. We call a matrix a (-th order) Kronecker-structured (KS) matrix if it is a Kronecker product of matrices.
Definitions for matrices: For a matrix with unit -norm columns, we define the cumulative coherence as . We say a matrix satisfies the -restricted isometry property (-RIP) with constant if for any and any with , we have .
Definitions for tensors: We briefly present required tensor definitions here: see Kolda and Bader  for more details. The mode- unfolding matrix of is denoted by , where each column of consists of the vector formed by fixing all indices of except the one in the th-order. We denote the outer product (tensor product) of vectors by , while denotes the mode- product between a tensor and a matrix. An -way tensor is rank- if it can be written as outer product of vectors: . Throughout this paper, by the rank of a tensor, , we mean the CP-rank of , the minimum number of rank- tensors that construct as their sum. The CP decomposition (CPD), decomposes a tensor into sum of its rank- tensor components. The Tucker decomposition factorizes an -way tensor as , where denotes the core tensor and denote factor matrices along the -th mode of for .
Notations for functions and spaces: We denote the element-wise sign function by . For any function , we define the difference . We denote by the Euclidean unit sphere: . We also denote the Euclidean sphere with radius by . The oblique manifold in is the manifold of matrices with unit-norm columns: . We drop the dimension subscripts and use only when there is no ambiguity. The covering number of a set with respect to a norm , denoted by , is the minimum number of balls of -norm radius needed to cover .
Dictionary Learning Setup: In dictionary learning (DL) for vector data, we assume observations are generated according to the following model:
where is the true underlying dictionary, is a randomly generated sparse coefficient vector, and is the underlying noise vector. The goal in DL is to recover the true dictionary given the noisy observations that are independent realizations of (1). The ideal objective is to solve the statistical risk minimization problem
where is the underlying distribution of the observations, is the dictionary class, typically selected for vector data to be the same as the oblique manifold, and
However, since we have access to the distribution only through noisy observations drawn from this distribution, we resort to solving the following empirical risk minimization problem as a proxy for Problem (2):
Dictionary Learning for Tensor Data: To represent tensor data, conventional DL approaches vectorize tensor data samples and treat them as one-dimensional arrays. One way to explicitly account for the tensor structure in data is to use the Kronecker-structured DL (KS-DL) model, which is based on the Tucker decomposition of tensor data. In the KS-DL model, we assume that observations are generated according to
where are generating subdictionaries, and and are the coefficient and noise tensors, respectively. Equivalently, the generating model (5) can be stated for as:
where and . This is the same as the unstructured model with the additional condition that the generating dictionary is a Kronecker product of subdictionaries. As a result, in the KS-DL problem, the constraint set in (4) becomes , where is the set of KS matrices with unit-norm columns and and are vectors containing ’s and ’s, respectively.333We have changed the indexing of subdictionaries for ease of notation.
In summary, the structure in tensor data is exploited in the KS-DL model by assuming the dictionary is “separable” into subdictionaries for each mode. However, as discussed earlier, this separable model is rather restrictive. Instead, we generalize the KS-DL model using the notion of separation rank.444The term was introduced in  for (see also ).
The separation rank of a matrix is the minimum number of th-order KS matrices such that , where .
The KS-DL model corresponds to dictionaries with separation rank . We instead propose the low separation rank (LSR) DL model in which the separation rank of the underlying dictionary is relatively small so that . This generalizes the KS-DL model to a generating dictionary of the form , where is the separation rank of . Consequently, defining , the empirical rank-constrained LSR-DL problem is
However, the analytical tools at our disposal require the constraint set in (7) to be closed, which we show does not hold for when and . In that case, we instead analyze (7) with replaced by () closure of and () a certain closed subset of . We refer the reader to Section III for further discussion.
In our study of the LSR-DL model (which includes the KS-DL model as a special case), we use a correspondence between KS matrices and rank-1 tensors, stated in Lemma 1 below, which allows us to leverage techniques and results in the tensor recovery literature to analyze the LSR-DL problem and develop tractable algorithms. (This correspondence was first exploited in our earlier work .)
Any th-order Kronecker-structured matrix can be rearranged as a rank-, th-order tensor with .
It follows immediately from Lemma 1 that if , then we can rearrange matrix into the tensor where . Therefore, we have the following equivalence:
This correspondence between separation rank and tensor rank highlights a challenge with the LSR-DL problem: finding the rank of a tensor is NP-hard and thus so is finding the separation rank of a matrix. This makes Problem (7) in its current form (and its variants) intractable. To overcome this, we introduce two tractable relaxations to the rank-constrained Problem (7) that do not require explicit computation of the tensor rank. The first relaxation uses a convex regularization term to implicitly impose low tensor rank structure on , which results in a low separation rank . The resulting empirical regularization-based LSR-DL problem is
with , where is described in (3) and is a convex regularizer to enforce low-rank structure on . The second relaxation is a factorization-based LSR-DL formulation in which the LSR dictionary is explicitly written in terms of its subdictionaries. The resulting empirical risk minimization problem is
and the terms are constrained as for some positive constant when and .
In the rest of this paper, we study the problem of identifying the true underlying LSR-DL dictionary by analyzing the LSR-DL Problems (7)–(9) introduced in this section and developing algorithms to solve Problems (8) and (9) in both batch and online settings. Note that while Problem (7) (and its variants when and ) cannot be explicitly solved because of its NP-hardness, identifiability analysis of this problem—provided in Section III—provides the basis for the analysis of tractable Problems (8) and (9), provided in Section IV.
Iii Identifiability in the Rank-constrained LSR-DL Problem
In this section, we derive conditions under which a dictionary is identifiable as a solution to either the separation rank-constrained problem in (7) or a slight variant of (7) when and . Specifically, we show that under certain assumptions on the generative model, there is at least one local minimum of either Problem (7) or one of its variants that is “close” to the underlying dictionary . Notwithstanding the fact that no efficient algorithm exists to solve the intractable Problem (7), this identifiability result is important in that it lays the foundation for the local identifiability results in tractable Problems (8) and (9).
Generative Model: Let be the underlying dictionary. Each tensor data sample in its vectorized form is independently generated using a linear combination of atoms of dictionary with added noise: , where . Specifically, atoms of are selected uniformly at random, defining the support . Then, we draw a random sparse coefficient vector supported on . We state further assumptions on our model similar to the prior works [29, 15].
Assumption 1 (Coefficient Distribution).
Consider a random variable and positive constants and . Define . We assume: i) , ii) , iii) , and iv) and almost surely.
Assumption 2 (Noise Distribution).
Consider a random variable and positive constant . We assume: i) , ii) , and iii) almost surely.
Assume , , , , and the noise is relatively small in the sense that .
Our Approach: In our analysis of the separation rank-constrained LSR-DL problem, we will alternate between four different constraint sets that are related to our dictionary class , namely, , , the closure of under the Frobenius norm, and a closed subset of , defined as . We often use the generic notation for the constraint set when our discussion is applicable to more than one of these sets.
We want to find conditions that imply the existence of a local minimum of within a ball of radius around the true dictionary :
for some small . To this end, we first show that the expected risk function in (2) has a local minimum in for the LSR-DL constraint set .
To show that a local minimum of exists in , we need to show that attains its minimum over in the interior of .555Having a minimum on the boundary is not sufficient since the function might achieve lower values in the neighborhood of outside . We show this in two stages. First, we use the Weierstrass Extreme Value Theorem , which dictates that the continuous function attains a minimum in (or on the boundary of) as long as is a compact set. Therefore, we first investigate compactness of in Section III-A. Second, in order to be certain that the minimum of over is a local minimum of , we show that cannot obtain its minimum over on the boundary of , denoted by . To this end, in Section III-B we derive conditions that if is nonempty then666If the boundary is empty, it is trivial that the infimum is attained in the interior of the set.
which implies cannot achieve its minimum on .
Finally, in Section III-C we use concentration of measure inequalities to relate in (4) to and find the number of samples needed to guarantee (with high probability) that also has a local minimum in the interior of .
Iii-a Compactness of the Constraint Sets
When the constraint set is a compact subset of the Euclidean space , the subset is also compact. Thus, we first investigate the compactness of the constraint set . Since is a bounded set, according to the Heine-Borel Theorem , it is a compact subset of if and only if it is closed. Also, can be written as the intersection of and the oblique manifold . In order for to be closed, it suffices to show that and are closed. It is trivial to show is closed; hence, we focus on whether is closed.
In the following, we use the facts that the constraint is equivalent to and that the rearrangement mapping that sends to preserves topological properties of sets such as the distances between the set elements under the Frobenius norm. These facts allow us to translate the topological properties of tensor sets into properties of the structured matrices that we study here.
Let and . Then, the set is not closed. However, the set of KS matrices and the set are closed.
To illustrate the non-closedness of for and and motivate the use of the sets and in lieu of , we provide an example. Consider the sequence where are linearly independent pairs. It is clear that for any . The limit point of this sequence, however, is , which is a separation-rank- matrix. Therefore, the set is not closed.
The non-closedness of means there exist sequences in whose limit points are not in the set. Two possible solutions to circumvent this issue include: () use the closure of as the constraint set, and () eliminate such sequences from . We discuss each solution in detail below.
Adding the limit points
We denote the closure of by . By slightly relaxing the constraint set in (7) to , we can instead solve the following:
where . Note that () a solution to (7) is a solution to (12) and () a solution to (12) is either a solution to (7) or is arbitrarily close to a member of .777The first argument holds since if for all , by continuity it also holds for all . The second argument is trivial.
Eliminating the problematic sequences
In order to exclude the sequences such that for all and , we first need to characterize them.
Assume where and . We can write where . Then, as . In fact, at least two of the coefficient sequences are unbounded.
The following corollary of Lemma 3 suggests that one can exclude the problematic sequences from by bounding the norm of individual KS (separation-rank-) terms.
Consider the set whose members can be written as such that . Then, for any the set is closed.
We have now shown that the sets , , , and are compact subsets of . Next, we provide asymptotic identifiability results for these compact constraint sets.
Iii-B Asymptotic Analysis for Dictionary Identifiability
Now that we have discussed the compactness of the relevant constraint sets, we are ready to show that the minimum of over , defined in (10), is not attained on . This will complete our proof of existence of a local minimum of in . In our proof, we make use of a result in Gribonval et al. , presented here in Lemma 4.
Lemma 4 (Theorem 1 in Gribonval et al. ).
Since is a continuous function and the ball is compact, by the extreme value theorem, attains its infimum at a point in the ball. If this minimum is attained in the interior of then it is a local minimum of . Therefore, a key ingredient of the proof is showing that for all if is nonempty. Lemma 4 states the conditions under which on , where .
Next, we discuss finite sample identifiability of the true dictionary for three of the constraint sets.
Iii-C Sample Complexity for Dictionary Identifiability
We now derive the number of samples required to guarantee, with high probability, that has a local minimum at a point “close” to when the constraint set is either , , or for and . First, we use concentration of measure inequalities based on the covering number of the dictionary class to show that the empirical loss uniformly converges to its expectation with high probability. This is formalized below.
Lemma 5 (Theorem 1 and Lemma 11, Gribonval et al. ).
Define . It follows from (14) that with high probability (w.h.p.),
for all . Therefore, when for all , we have for all . In this case, we can use similar arguments as in the asymptotic analysis to show that has a local minimum at a point in the interior of . Hence, our focus in this section is on finding the sample complexity required to guarantee that w.h.p. We begin with characterization of covering numbers of the three constraint sets, which may also be of independent interest to some readers.
Covering Numbers: The covering number of the set with respect to the norm is known in the literature to be upper bounded as follows :
We now turn to finding the covering numbers of LSR sets and . The following lemma establishes a bound on covering number of , which depends on the separation rank exponentially.
The covering number of the set with respect to the norm is upper bounded as follows:
Next, we obtain an upper bound on the covering number of for a given constant .
The covering number of the set with respect to the max-column norm is bounded as follows: