Lower Memory Oblivious (Tensor) Subspace Embeddings with Fewer Random Bits: Modewise Methods for Least Squares

Lower Memory Oblivious (Tensor) Subspace Embeddings with Fewer Random Bits: Modewise Methods for Least Squares

Abstract.

In this paper new general modewise Johnson-Lindenstrauss (JL) subspace embeddings are proposed that are both considerably faster to generate and easier to store than traditional JL embeddings when working with extremely large vectors and/or tensors.

Corresponding embedding results are then proven for two different types of low-dimensional (tensor) subspaces. The first of these new subspace embedding results produces improved space complexity bounds for embeddings of rank- tensors whose CP decompositions are contained in the span of a fixed (but unknown) set of rank-one basis tensors. In the traditional vector setting this first result yields new and very general near-optimal oblivious subspace embedding constructions that require fewer random bits to generate than standard JL embeddings when embedding subspaces of spanned by basis vectors with special Kronecker structure. The second result proven herein provides new fast JL embeddings of arbitrary -dimensional subspaces which also require fewer random bits (and so are easier to store – i.e., require less space) than standard fast JL embedding methods in order to achieve small -distortions. These new oblivious subspace embedding results work by effectively folding any given vector in into a (not necessarily low-rank) tensor, and then embedding the resulting tensor into for .

Applications related to compression and fast compressed least squares solution methods are also considered, including those used for fitting low-rank CP decompositions, and the proposed JL embedding results are shown to work well numerically in both settings.

1. Motivation and Applications

Due to the recent explosion of massively large-scale data, the need for geometry preserving dimension reduction has become important in a wide array of applications in signal processing (see e.g. [21, 20, 3, 55, 25, 12]) and data science (see e.g. [6, 13]). This reduction is possible even on large dimensional objects when the class of such objects possesses some sort of lower dimensional intrinsic structure. For example, in classical compressed sensing [21, 20] and its related streaming applications [16, 17, 24, 30], the signals of interest are sparse vectors – vectors whose entries are mostly zero. In matrix recovery [13, 44], one often analogously assumes that the underlying matrix is low-rank. Under such models, tools like the Johnson-Lindenstrauss lemma [32, 2, 18, 36, 37] and the related restricited isometry property [14, 5] ask that the geometry of the signals be preserved after projection into a lower dimensional space. Typically, such projections are obtained via random linear maps that map into a dimension much smaller than the ambient dimension of the domain; -sparse -dimensional vectors can be projected into a dimension that scales like and rank- matrices can be recovered from linear measurements [21, 20, 13]. Then, inference tasks or reconstruction can be performed from those lower dimensional representations.

Here, our focus is on dimension reduction of tensors, multi-way arrays that appear in an abundance of large-scale applications ranging from video and longitudinal imaging [39, 9] to machine learning [45, 51] and differential equations [8, 40]. Although a natural extension beyond matrices, their complicated structure leads to challenges both in defining low dimensional structure as well as dimension reduction projections. In particular, there are many notions of tensor rank, and various techniques exist to compute the corresponding decompositions [35, 54]. In this paper, we focus on tensors with low CP-rank, tensors that can be written as a sum of a few rank-1 tensors written as outer products of basis vectors. The CP-rank and CP-decompositions are natural extensions of matrix rank and SVD, and are well motivated by applications such as topic modeling, psychometrics, signal processing, linguistics and many others [15, 26, 4]. Although there are now some nice results for low-rank tensor dimension reduction (see e.g. [43, 38, 48]), these give theoretical guarantees for dimensional reducing projections that act on tensors via their matricizations or vectorizations. Here, our goal is to provide similar guaranties but for projections that act directly on the tensors themselves without the need for unfolding. In particular, this means the projections can be defined modewise using the CP-decomposition, and that the low dimensional representations are also tensors, not vectors. This extends the application for such embeddings to those that cannot afford to perform unfoldings or for which it is not natural to do so. In particular, for tensors in for large and , this avoids having to store an often impossibly large linear map. We elaborate on our main contributions next.

1.1. Contributions and Related Work

In this paper we analyze modewise tensor embedding strategies for general -mode tensors similar to those introduced and analyzed for -mode tensors in [47]. In particular, herein we focus on obliviously embedding an apriori unknown -dimensional subspace of a given tensor product space into a similarly low-dimensional vector space with high probability. In contrast to the standard approach of effectively vectorizing the tensor product space and then embedding the resulting transformed subspace using standard JL methods involving a single massive matrix (see, e.g., [38]), the approaches considered herein instead result in the need to generate and store significantly smaller matrices which are then combined to form a linear embedding operator via

(1)

where each is a -mode product (reviewed below in §2.1), and is a trivial vectorization operator.

Let be the number of rows one must use for both and above (as we shall see, the number of rows required for both matrices will indeed be essentially equivalent). The collective sizes of the matrices needed to define above will be much smaller (and therefore easier to store, transmit, and generate) than whenever holds. As a result, much of our discussion below will revolve around bounding the dominant term on the left hand side above, which will also occasionally be referred to as the intermediate embedding dimension below. We are now prepared to discuss our two main results.

General Oblivious Subspace Embedding Results for Low Rank Tensor Subspaces Satisfying an Incoherence Condition

The first of our results provides new oblivious subspace embeddings for tensor subspaces spanned by bases of rank one tensors, as well as establishes related least squares embedding results of value in, e.g., the fitting of a general tensor with an accurate low rank CPD approximation. One of its main contributions is the generality with which it allows one to select the matrices used to construct the JL embedding in (1). In particular, it allows each of these matrices to be drawn independently from any desired nearly-optimal family of JL embeddings (as defined immediately below) that the user likes.

Definition 1 (-JL embedding).

Let . We will call a matrix an -JL embedding of a set into if

holds for some for all .

Definition 2.

Fix and let be a family of probability distributions where each is a distribution over matrices. We will refer to any such family of distributions as being an -optimal family of JL embedding distributions if there exists an absolute constant such that, for any given , with , and nonempty set of cardinality

a matrix will be an -JL embedding of into with probability at least .

In fact many -optimal families of JL embedding distributions exist for any given including, e.g., those associated with random matrices having i.i.d. subgaussian entries (see Lemma 9.35 in [22]) as well as those associated with sparse JLT constructions [33]. The next theorem proves that any desired combination of such matrices can be used to construct a JL embedding as per (1) for any tensor subspace spanned by a basis of rank one tensors satisfying an easily testable (and relatively mild1) coherence condition. We utilize the notations set forth below in Section 2.

Theorem 1.

Fix and . Let , , and be an -dimensional subspace of spanned by a basis of rank one tensors with modewise coherence satisfying

Then, one can construct a linear operator as per (1) with for an absolute constant so that with probability at least

(2)

will hold for all .

If the intermediate embedding dimension can be bounded above by

(3)

for an absolute constant . If, however, then (2) holds for all and

(4)

can be achieved, where is another absolute constant.

Proof.

This is a largely restatement of Theorem 6. When defining as per (1) following Theorem 6 one should draw with from an -optimal family of JL embedding distributions for each , where each is an absolute constant. Furthermore, should be drawn from an -optimal family of JL embedding distributions with as above. The probability bound together with (3) both then follow. The achievable intermediate embedding dimension when in (4) can be obtained from Corollary 2 since the bound can then be utilized in that case. ∎

One can vectorize the tensors and tensor spaces considered in Theorem 1 using variants of (14) to achieve subspace embedding results for subspaces spanned by basis vectors with special Kronecker structure as considered in, e.g., two other recent papers that appeared during the preparation of this manuscript [31, 41]. The most recent of these papers also produces bounds on what amounts to the intermediate embedding dimension of a JL subspace embedding along the lines of (1) when (see Theorem 4.1 in [41]). Comparing (4) to that result we can see that Theorem 1 has reduced the dependence of the effective intermediate embedding dimension achieved therein from to (now independent of ) for a much more general set of modewise embeddings. However, Theorem 1 incurs a worse dependence on epsilon and needs the stated coherence assumption concerning to hold. As a result, Theorem 1 provides a large new class of modewise subspace embeddings that will also have fewer rows than those in [41] for a large range of ranks provided that is sufficiently small and is sufficiently large.

Note further that the form of (2) also makes Theorem 1 useful for solving least squares problems of the type encountered while computing approximate CP decompositions for an arbitrary tensor using alternating least squares methods (see, e.g., §4 for a related discussion as well as [7] where modewise strategies were shown to work well for solving such problems in practice). Comparing Theorem 1 to the recent least squares result of the same kind proven in [31] (see Corollary 2.4) we can see that Theorem 1 has reduced the dependence of the effective intermediate embedding dimension achievable in [31] from therein to in (3) for a much more general set of modewise embeddings. In exchange, Theorem 1 again incurs a worse dependence on epsilon and needs the stated coherence assumption concerning to hold, however. As a result, Theorem 1 guarantees that a larger class of modewise JL embeddings can be used in least squares applications, and that they will also have smaller intermediate embedding dimensions as long as is sufficiently small and sufficiently large.

Fast Oblivious Subspace Embedding Results for Arbitrary Tensor Subspaces

Our second main result builds on Theorem 2.1 of Jin, Kolda, and Ward in [31] to provide improved fast subspace embedding results for arbitrary tensor subspaces (i.e., for low dimensional tensor subspaces whose basis tensors have arbitrary rank and coherence). Let . By combining elements the proof of Theorem 1 with the optimal -dependence of Theorem 2.1 in [31] we are able to provide a fast modewise oblivious subspace embedding as per (1) that will simultaneously satisfy (2) for all in an entirely arbitrary -dimensional tensor subspace with probability at least while also achieving an intermediate embedding dimension bounded above by

(5)

Above is an absolute constant. Note that neither nor in (5) are raised to a power of which marks a tremendous improvement over all of the previously discussed results when is large. See Theorem 8 for details.

As alluded to above the results herein can also be used to create new JL subspace embeddings in the traditional vector space setting. Our next and final main result does this explicitly for arbitrary vector subspaces by restating a variant of Theorem 8 in that context. We expect that this result may be of independent interest outside of the tensor setting.

Theorem 2.

Fix and . Let such that and for an absolute constant , and let be an -dimensional subspace of for . Then, one can construct a random matrix with

(6)

for an absolute constant such that with probability at least it will be the case that

holds for all . Furthermore, requires only

(7)

random bits and memory for storage for an absolute constant , and can be multiplied against any vector in just -time.

Note that choosing produces an oblivious subspace embedding result for , and that choosing to be the column space of a rank matrix produces a result useful for least squares sketching.

Proof.

This follows from Theorem 8 after identifying with (i.e., after effectively reshaping any given vectors under consideration into -mode tensors .) Note further that if then one can implicitly pad the vectors of interest with zeros until it is (i.e., effectively trivially embedding into ) before preceding. ∎

1.2. Organization

The remainder of the paper is organized as follows. Section 2 provides background and notation for tensors (Subsections 2 and 2.1), as well as for Johnson-Lindenstrauss embeddings (Subsection 2.2).

We start Section 3 with the definitions of the rank of the tensor (and low-rank tensor subspaces) and the maximal modewise coherence of tensor subspace bases. Then we work our way to Corollary 2, that gives our first main result on oblivious tensor subspace embeddings via modewise tensor products (for any fixed subspace having low enough modewise coherence). This result is very general in terms of JL-embedding maps one can use as building blocks in each mode. Finally, in Subsection 3.2 we discuss the assumption of modewise incoherence and provide several natural examples of incoherent tensor subspaces.

In Section 4 we describe the fitting problem for the approximately low rank tensors and explain how modewise dimension reduction (as presented in Section 3) reduces the complexity of the problem. Then we build the machinery to show that the solution of the reduced problem will be a good solution for the original problem (in Theorem 6). We conclude Section 4 by introducing a two-step embedding procedure that allows one to further reduce the final embedding dimension (this our second main embedding result, Theorem 8). This improved procedure relies on a specific form of JL-embedding of each mode. Both embedding results can be applied to the fitting problem.

In Section 5 we present some simple experiments confirming our theoretical guarantees, and then we conclude in Section 6.

2. Notation, Tensor Basics, & Linear Johnson-Lindenstrauss Embeddings

Tensors, matrices, vectors and scalars are denoted in different typeface for clarity below. Calligraphic boldface capital letters are always used for tensors, boldface capital letters for matrices, boldface lower-case letters for vectors, and regular (lower-case or capital) letters for scalars. The matrix will always represent the identity matrix. The set of the the first natural numbers will be denoted by for all .

Throughout the paper, denotes the Kronecker product of vectors or matrices, and denotes the tensor outer product of vectors or tensors.2 The symbol on the other hand represents the composition of functions (see e.g. Section 4). Numbers in parentheses used as a subscript or superscript on a tensor either denote unfoldings (introduced in Section 2.1) when appearing in a subscript, or else an element in a sequence when appearing in a superscript. The notation for a given set of vectors will always denote the vector . Additional tensor definitions and operations are reviewed below (see, e.g., [35, 19, 50, 54] for additional details and discussion).

2.1. Tensor Basics

The set of all -mode tensors forms a vector space over the complex numbers when equipped with component-wise addition and scalar multiplication. The inner product of will be given by

(8)

This inner product then gives rise to the standard Euclidean norm

(9)

If we say that and are orthogonal. If and are orthogonal and also have unit norm (i.e., have ) we say that they are orthonormal.


Tensor outer products: The tensor outer product of two tensors and , , is a -mode tensor whose entries are given by

(10)

Note that when and are both vectors, the tensor outer product will reduce to the standard outer product.

Lemma 1.

Let , and . Then,

  1. .

Proof.

The first property follows from the fact that

To establish the second property we note that

Fibers: Let tensor . The vectors in obtained by fixing all of the indices of except for the one that corresponds to its mode are called its mode- fibers. Note that any such will have mode- fibers denoted by .

Tensor matricization (unfolding): The process of reordering the elements of the tensor into a matrix is known as matricization or unfolding. The mode- matricization of a tensor is denoted as and is obtained by arranging ’s mode- fibers to be the columns of the resulting matrix.

-mode products: The -mode product of a -mode tensor with a matrix is another -mode tensor . Its entries are given by

(11)

for all . Looking at the mode- unfoldings of and one can easily see that their -mode matricization can be computed as a regular matrix product

(12)

for all . The following simple lemma formally lists several important properties of mode-wise products.

Lemma 2.

Let , , and for all . The following four properties hold:

  1. .

  2. .

  3. If then .

  4. If then .

Proof.

The first, second, and fourth facts above are easily established using mode- unfoldings. To establish above, we note that

Reshaping both sides of the derived equality back into their original tensor forms now completes the proof.3 The proof of using unfoldings is nearly identical.

To prove we may again use mode- unfoldings to see that

Reshaping these expressions back into their original tensor forms again completes the proof.

To prove it is perhaps easiest to appeal directly to the component-wise definition of the mode- product given in equation (11). Suppose that (the case is nearly identical). Set and to simplify subscript notation. We have for all , , and with that

A generalization of the observation (12) is available: unfolding the tensor

(13)

along the mode is equivalent to

(14)

where is the matrix Kronecker product (see [35]). In particular, (14) implies that the matricization .4 On a related note, one can also express the relation between the vectorized forms of and in (13) as

(15)

where vect is the vectorization operator.

It is worth noting that trivial inner product preserving isomorphisms exist between a tensor space and any of its matricized versions (i.e., mode- matricization can be viewed as an isomorphism between the original tensor vector space and its mode- matricized target vector space ). In particular, the process of matricizing tensors is linear. If, for example, then one can see that the mode- matricization of is for all modes .

2.2. Linear Johnson-Lindenstrauss Embeddings

Many linear -JL embedding matrices exist [32, 2, 18, 36, 37] with the best achievable for arbitrary (see [37] for results concerning the optimality of this embedding dimension). Of course, one can define JL embedding on tensors in a similar way, namely, as linear maps approximately preserving tensor norm:

Definition 3 (Tensor -JL embedding).

A linear operator is an -JL embedding of a set into if

holds for some for all .

It is easy to check that JL embeddings can preserve pairwise inner products.

Lemma 3.

Let and suppose that is an -JL embedding of the vectors

into . Then,

Proof.

This well known result is an easy consequence of the polarization identity for inner products. We have that

where the second to last inequality follows from Young’s inequality for products. ∎

The fact that is an inner product space means that the following trivial generalization of Lemma 3 to the tensor JL embeddings also holds.

Lemma 4.

Let and suppose that is an -JL embedding of the tensors

into . Then,

Proof.

The proof is similar to that of Lemma 3, with replacing , and making use of the linearity of . ∎


In the case where a more general set is embedded using JL embeddings, for example, a low-rank subspace of tensors, in order to pass to a smaller finite set, a discretization technique can be used. Due to linearity, it actually suffices to discretize the unit ball of the space in question. In the next lemma we present a simple subspace embedding result based on a standard covering argument (see, e.g., [5, 22]). We include its proof for the sake of completeness.

Lemma 5.

Fix . Let be an -dimensional subspace of , and let be an -net of the -dimensional Euclidean unit sphere . Then, if is an -JL embedding of it will also satisfy

(16)

for all . Furthermore, we note that there exists an -net such that .

Proof.

The cardinality bound on can be obtained from the covering results in Appendix C of [22].5 It is enough to establish for an arbitrary due to the linearity of and . Let , and choose an element with . We have that

holds for all . This, in turn, means that the upper bound above will hold for a vector realizing so that must also hold. As a consequence, . The upper bound now follows.

To establish the lower bound we define and note that this quantity will also be realized by some element of the compact set . As above we consider this minimizing vector and choose an element with in order to see that

As a consequence, . The lower bound now follows. ∎

Remark 1.

We will see later in the text that the cardinality (exponential in ) can be too big to produce tensor JL embeddings with optimal embedding dimensions. In this case one can use a much coarser “discretization” to improve the dependence on based on, e.g., the next lemma.


With Lemma 4 in hand we are now able to prove a secondary subspace embedding result which, though it leads to suboptimal results in the vector setting, will be valuable for higher mode tensors.

Lemma 6.

Fix and let be an -dimensional subspace of spanned by a set of orthonormal basis tensors . If is an -JL embedding of the tensors

into , then

holds for all .

Proof.

Appealing to Lemma 4 we can see that for all . As a consequence, we have for any that

To finish we now note that due to the orthonormality of the basis tensors . ∎

3. Modewise Linear Johnson-Lindenstrauss Embeddings of Low-Rank Tensors

In this section, we consider low-rank tensor subspace embeddings for tensors with low-rank expansions in terms of rank-one tensors (i.e., for tensors with low-rank CP Decompositions). Our general approach will be to utilize subspace embeddings along the lines of Lemmas 5 and 6 in this setting. However, the fact that our basis tensors are rank-one will cause us some difficulties. Principally, among those difficulties will be our inability to guarantee that we can find an orthonormal, or even fairly incoherent, basis of rank-one tensors that span any particular -dimensional tensor subspace we may be interested in below.

Going forward we will consider the standard form of a given rank- -mode tensor defined by

(17)