Theorem 2.1 (MTS recovery analysis).
Abstract

Sketching refers to a class of randomized dimensionality reduction methods that aim to preserve relevant information in large-scale datasets. They have efficient memory requirements and typically require just a single pass over the dataset. Efficient sketching methods have been derived for vector and matrix-valued datasets. When the datasets are higher-order tensors, a naive approach is to flatten the tensors into vectors or matrices and then sketch them. However, this is inefficient since it ignores the multi-dimensional nature of tensors. In this paper, we propose a novel multi-dimensional tensor sketch (MTS) that preserves higher order data structures while reducing dimensionality. We build this as an extension to the popular count sketch (CS) and show that it yields an unbiased estimator of the original tensor. We demonstrate significant advantages in compression ratios when the original data has decomposable tensor representations such as the Tucker, CP, tensor train or Kronecker product forms. We apply MTS to tensorized neural networks where we replace fully connected layers with tensor operations. We achieve nearly state of art accuracy with significant compression on image classification benchmarks.

oddsidemargin has been altered.
marginparsep has been altered.
topmargin has been altered.
marginparwidth has been altered.
marginparpush has been altered.
paperheight has been altered.
The page layout violates the ICML style. Please do not change the page layout, or include packages like geometry, savetrees, or fullpage, which change it for you. We’re not able to reliably undo arbitrary changes to the style. Please remove the offending package(s), or layout-changing commands and try again.

 

Multi-dimensional Tensor Sketch:
Dimensionality Reduction That Retains Efficient Tensor Operations

 

Yang Shi1  Animashree Anandkumar2 


footnotetext: 1University of California, Irvine, USA 2California Institute of Technology, USA. Correspondence to: Yang Shi <shiy4@uci.edu>. \@xsect

Many modern machine learning and data mining applications involve manipulating large-scale multi-dimensional data structures. For instance, the input data can be multi-modal or multi-relational (e.g. combination of image and text), and the intermediate computations can involve higher-order tensor operations (e.g. layers in a tensorized neural network). Memory, bandwidth and computational requirements are usually bottlenecks when these operations are done at scale. Efficient dimensionality reduction schemes can greatly alleviate this issue if they can find a compact representation while preserving accuracy.

A popular class of dimensionality reduction techniques involve spectral decomposition, such as principle component analysis (PCA), singular value decomposition (SVD) and tensor decompositions. They aim to fit the datasets using low rank approximations, which can be interpreted as a low-dimensional latent structure in the data. They have shown good performance in many applications such as topic modelling (Aza et al., 2001) and recommendation systems (Koren et al., 2009). However, for large-scale datasets, exact computation of spectral decompositions is expensive and randomized methods are used instead.

Sketching forms another class of randomized dimensionality reduction methods. These methods have efficient memory requirements and typically do just a single pass on the data. They have approximation guarantees in recovering the original data. In addition, they allow for certain operations to be accurately carried out in the low-dimensional sketched space, e.g. inner products and outer products on vectors. Count sketch (CS) is the simplest sketching technique (Charikar et al., 2002) that uses random signs and random hash functions to carry out dimensionality reduction. This has been applied in many settings. Demaine et al. (2002) design a frequency estimation of internet packet streams. The purpose is to determine essential features of the traffic stream using limited space. Another application is multi-modal pooling of features, e.g. in visual question and answering (VQA), this involves bilinear pooling of image and text features.

Sketching techniques, however, mostly focus on vector and matrix-valued data. A few works attempt to extend to higher order tensors: e.g. to non-linear kernels (Pham & Pagh, 2013) and to higher order tensors (Wang et al., 2015), which we refer to as Count-based Tensor Sketch(CTS). However, they view the tensor as sets of vectors and sketch along each fibre 111Fibre is the higher order analogue of matrix rows and columns. of the tensor. Hence, they still use inherently vector-based sketching techniques and do not exploit the complete multi-dimensional nature of data, and can miss correlations across different modes. For instance, the image feature extracted from the traditional vision network contains spatial information. Separating these data at each location ignores the connection between adjacent pixels.

Main contributions: We propose multi-dimensional tensor sketch (MTS), which is the first method to fully exploit the multi-dimensional nature of higher order tensors. It projects the original tensor to another tensor, which can be of the same order, but with smaller dimensions (which affect the recovery guarantees). This allow for efficient computation of various operations such as tensor product and tensor contractions by directly applying operations on the sketched components. MTS has advantages over vector-based sketching methods such as CTS when the underlying tensor has special forms such as Kronecker product, Tucker-form, CP-form or tensor-train forms. Computation and memory improvement over CTS is summarized in Table 1.

We compare MTS and CTS for Kronecker product compression using synthetic data. MTS outperforms CTS in aspect of relative error while using less computation time when they keep the same compression ratio. We apply MTS for approximating tensor operations in tensorized neural networks. These networks replace fully connected layers with multi-linear tensor algebraic operations. This results in compression since multi-linear layers can take better advantage of spatial information available in activations from convolutional layers. Applying MTS to tensor operations further results in compression while preserving accuracy. We demonstrate efficiency on CIFAR10 dataset.

Operator Computation Memory
Kronecker Product
Tucker-form Tensor if , if
CP-form Tensor if
Tensor-train if
Table 1: Ratio of improvement using MTS over CTS on different tensor operations. (Assume input for Kronecker product is a -dimension square matrix, input for tensor contraction is a third-order tensor with dimension and rank )
Figure 1: Multi-dimensional tensor sketch of a third-order tensor.

Important Tensor Applications: We focus on tensor sketching because data is inherently multi-dimensional. In probabilistic model analysis, tensor decomposition is the crux of model estimation via the method of moments. A variety of models such as topic models, hidden Markov models, Gaussian mixtures etc., can be efficiently solved via the tensor decomposition techniques under certain mild assumptions (Anandkumar et al., 2014). Tensor methods are also relevant in deep learning. Yu et al. () learn the nonlinear dynamics in recurrent neural networks directly using higher order state transition functions through tensor train decomposition. Kossaifi et al. (2017) propose tensor contraction and regression layers in deep convolutional neural networks. Instead of mapping high-order activation tensor to vectors to pass through a fully connected layer, they generate tensor weights to filter the multi-dimensional activation tensor. Tensor product is used in combining different features in multi-modal tasks. Visual Question Answering task(Antol et al., 2015) requires integration of feature maps from image and text that have drastically different structures. Many studies have been done to explore various features and combine them wisely based on their structures(Fukui et al., 2016; Teney et al., 2017; Shi et al., 2018).

Notation Meaning
FFT(a) 1D Fourier Transform
IFFT(a) 1D Inverse Fourier Transform
Element-wise Product
Convolution
Kronecker Product
FFT2(A) 2D Fourier Transform
IFFT2(A) 2D Inverse Fourier Transform
Matrix Inverse
Decompression
[n]
Table 2: Summary of notations.
\@xsect

In this section, we first present important definitions and theorems in previous literature. We then define the multi-dimensional tensor sketch and present its approximation guarantees. We show the superiority of this sketch, comparing to CS, in approximating Kronecker product.

\@xsect

We denote vectors by lowercase letters, matrices by uppercase letters, higher-order tensors(multi-dimensional data structures) by calligraphic uppercase letters. The order of a tensor is the number of modes it admits. For example, is a -order tensor because it has N modes. A scalar is a zeroth-order tensor, a vector is a first-order tensor, a matrix is a second-order tensor with the rows being the first mode and columns being the second mode, and a three-way array (say ) is a third-order tensor with the first, second and third modes indexed by , , and , respectively.

Tensor product is known as outer product in vectors case. A -order tensor is an element of the tensor product of vector spaces: , where . Assume , tensor contraction , and element-wise:

We show a tensor contraction example in Figure 2. Tensor decomposition is an extension of matrix decomposition to higher orders. The Tucker decomposition (Tucker, 1966) is analogous to principal component analysis. It decomposes a tensor as a core tensor contracted with a matrix along each mode. For instance, a third-order tensor has Tucker decomposition:

(1)

where , , , . CANDECOMP/PARAFAC(CP) (Harshman, 1970) is a special case of Tucker-form tensor: the core tensor is a sparse tensor that only diagonal data have non-zero values. It can be represented as a sum of rank-1 tensors. In previous example, . Figure 3 shows the Tucker-form and CP-form of a third-order tensor. By splitting variables, Tensor Train (TT) (Oseledets, 2011) can represent a high order tensor with several third order tensors.

Figure 2: Tensor contraction example: given , , ( is an identity matrix).
Figure 3: Tucker(upper) and CP(lower) decomposition of a third-order tensor.
\@xsect

We describe Count Sketch(CS) in Algorithm 1. The estimation can be made more robust by taking independent sketches of the input and calculate the median of the estimators. Charikar et al. (2002) prove that the CS is an unbiased estimator with variance bounded by the 2-norm of the input. Pagh (2012) use CS and propose a fast algorithm to compute count sketch of an outer product of two vectors.

(2)

The convolution operation can be transferred to element-wise product using FFT properties. Thus, the computation complexity reduces from to , if the vectors are of size and the sketching size is .

Notice that CS can be applied to each fibre of the input tensor, we can always sketch a tensor using CS. We call this Count-based Tensor Sketch(CTS). Algorithm 2 describes the CTS. The disadvantage is the ignorance of the connections between fibres, as we mentioned in Section id1.

\@xsect

Given a tensor , random hash functions : , , : , and random sign functions : , , : :

We can write the MTS using tensor operations.

(3)

Here , , if , for , otherwise . equals the signed tensor() contract with the hash matrices(, for ) along each mode. To recover the original tensor, we have

, or in compact format

(4)

We show MTS in Algorithm 3. We prove that this MTS is unbiased and it has variance of roughly . All detailed proofs are in Appendix id1.

1:procedure CS()
2:     
3:     
4:     for i:1 to n  do
5:               
6:     return
7:procedure CS-Decompress()
8:     for i:1 to n  do
9:               
10:     return
Algorithm 1 Count Sketch (Charikar et al., 2002)
1:procedure CTS(,c)
2:     for each fibre x in  do
3:         each fibre in =CS(x,c)      
4:     return
5:procedure CTS-Decompress()
6:     for each fibre x in  do
7:         each fibre in =CS-Decompress(x)      
8:     return
Algorithm 2 Count-based Tensor Sketch
1:procedure MTS()
2: contains sketching parameters
3:     Generate hash functions , given
4:     Compute hash matrices ,
5:     return
6:procedure MTS-decompress()
7:     return
Algorithm 3 Multi-dimensional Tensor Sketch
Theorem 2.1 (MTS recovery analysis).

Given a tensor , MTS hash functions with sketching dimensions , for , the decompress Equation 4 computes an unbiased estimator for with variance bounded by , for any .

In computational analysis, we assume takes operations if . However, we need to compute Equation 3 in a sequential way in practise. It requires extra permutations and copies without computation optimization. For example, given , , can be computed in three steps:

  • reshape to

  • compute

  • reshape to

Shi et al. (2016) propose a parallel computing scheme for tensor contraction using Basic Linear Algebra Subroutines(BLAS). Based on that, can be computed in one step without permutations:

  • parallel compute number of matrix matrix multiplications: ,

Recently released High-Performance Tensor Transposition (HPTT) is an open-source library that performs efficient tensor contractions (Springer et al., 2017). By applying these tensor contraction primitives, we can accelerate the sketching process in practise.

\@xsect

Pagh (2012) shows that the count sketch of an outer product equals the convolution between the count sketch of each input vector. Furthermore, the convolution operation in time domain can be transferred to element-wise product in frequency domain. Kronecker product is a generalization of outer product from vector to matrix. It is one kind of tensor product. Usually, the computation and memory cost are expensive as shown in Figure 6.

Sketching a Kronecker product using count-based tensor sketch (CTS) We compute the sketch of by sketching each of the outer product(row-wise) using CTS. Assume , each outer product has operations and there are such outer products. Thus, it requires a total operations to sketch the Kronecker product.

Sketching a Kronecker product using multidimensional tensor sketch (MTS) We show that

(5)

Convolution in Equation 5 can be further simplified by converting sketched components to 2D-frequency domain. Assume , :

(6)

This approximation requires to complete MTS(A), MTS(B) and to complete 2D Fourier Transform if the MTS sketching parameter is . The proof for Equation 5 is in Appendix id1. This example shows the advantage of MTS in estimating tensor operations by directly applying operations on the sketched components. See Figure 66 and Table 3 for comparison.

Figure 4: Kronecker product of two matrices with size . It requires computation.
Figure 5: CTS of Kronecker product. It requires computations.
Figure 6: MTS of Kronecker product. It requires computations.
Operator Computation
Table 3: Computation analysis of sketched Kronecker product using MTS and CTS (We choose to maintain same recovery error).
\@xsect

In this section, we discuss how to approximate tensors given different tensor representations: Tucker-form tensor, CP-form tensor and Tensor-train. We show that MTS is more efficient for compressing tensors given these forms compared to CTS, especially when the tensors are high-rank.

\@xsect

The Tucker decomposition is a form of higher-order PCA. In the following analysis, we use a third-order tensor as an example. Define a tensor : . Here, , , , . To simplify the analysis, we assume , . Figure 7 shows the difference between sketching methods using CTS and MTS.

Figure 7: Sketch of a third-order Tucker-form tensor.
\@xsect

To apply count sketch to the Tucker-form tensor, we rewrite the decomposition as a sum of rank-1 tensors. The CTS representation is:

(7)

where are column of respectively. is the tensor product(outer product in vector space). In this way, we can apply outer product sketching (Pagh, 2012) to each term in the summation. Notice that the convolution operation in Equation 7 can be computed through FFT. The computation complexity of this estimation process is . The memory complexity is .

Suppose is the recovered tensor for after applying count sketch on , and with sketching dimension , = , , , for . Suppose the estimation takes independent sketches of , and and then report the median of the estimates:

Theorem 3.1 (CTS recovery analysis for Tucker tensor).

If , , then with probability there is .

Theorem 3.1 shows that the sketch length can be set as to provably approximate a 3rd-order Tucker-form tensor with dimension , rank .

\@xsect

The MTS of Tucker-form tensor can be done by MTS each of the factors and the core tensor. We rewrite . Thus,

(8)

In (Pagh, 2012), the compressed matrix multiplication is a sum of CS of outer products. Instead of using CS of each row/column of the input matrices to compute the approximated outer product, we use each row/column of the multi-dimensional tensor sketched matrices. Notice that since does not have any underlying structure, . The computation complexity of this estimation process is . The memory complexity is .

Suppose is the recovered tensor for after applying MTS on , and with sketching dimension and along two modes and count sketch on with sketching dimension . Suppose the estimation takes independent sketches of , , and and then report the median of the estimates:

Theorem 3.2 (MTS recovery analysis for Tucker tensor).

If and , then with probability there is .

Theorem 3.2 shows that the product of sketch length can be set as to provably approximate a 3rd-order Tucker-form tensor with dimension , rank . Given Theorem 3.1 and 3.2, if we set and , the estimation error for CTS and MTS will be at the same level. Detailed proofs for Theorem 3.1, Theorem 3.2 are in Appendix id1.

REMARKS: CP-form Tensor Sketching CP-form tensor can be treated as a special case of Tucker-form tensor. For a third-order tensor: its core tensor is a sparse tensor: only diagonal data has non-zero values. We can write the decomposition as . We can sketch it in the same way as we described above. The only difference is that instead of sum over all values, we sum over only number of values. We summarize the computation and memory analysis in Table 4 and 5. By performing MTS on the CP-form tensor components, we get a ratio of improvement if the tensor is overcomplete().

\@xsect

The computation and memory cost is analyzed in Table 5. We get a computation cost improvement with a ratio of if , if , compared to previous CTS method for Tucker-form Tensor.

Operator Computation Memory
Tucker
CTS()
MTS()
CP
CTS()
MTS()
Table 4: Computation and memory analysis for Tucker/CP tensor sketching before substituting with known parameters.
Operator Computation Memory
Tucker
CTS()
MTS()
CP
CTS()
MTS()
Table 5: Computation and memory analysis for Tucker/CP tensor sketching after substituting with known parameters (keep both methods with same recovery error).
\@xsect

Tensor-train(TT) decomposition is another important tensor structure. For simplicity, we use a third-order tensor as an example. Assume , , , , the TT format is: , . By sketching , and , we obtain an estimation of the TT format tensor. We assume , .

  • CTS Count sketch is done along the fibers of size with sketching dimension . In other word, . We can then use the method proposed by (Pagh, 2012) to compute a sequence of matrix multiplications.

  • MTS We can rewrite the TT format as: . , . Algorithm 5 in Appendix id1 shows the procedure steps.

Operator Computation Memory
CTS()
MTS()
Table 6: Computation and memory analysis for TT.

We put the error analysis for TT-form sketching in Appendix id1. We show that if and , the recovery error will be the same for MTS and CTS methods. This gives us approximate O() times improvement using MTS if , comparing to the one using CS method.

\@xsect

In this section, we present an empirical study of the multi-dimensional tensor sketch in matrix and tensor operations. The goals of this study are: a) establish that our method significantly compress the input without losing principle information from the data, b) demonstrate the advantages of MTS in compressing Kronecker product and tensor contraction operation in practice, compared to CTS, c) present potential application of MTS in deep learning tasks.

\@xsect

We compress Kronecker products using CTS and MTS based on Figure 6 and Figure 6. We compute the recovery relative error and compression time verse the compression ratio. Given a Kronecker product matrix , the CTS() , the MTS() . The compression ratio for CTS and MTS are and . The relative error is calculated as . The result is obtained by independently running the sketching times and choosing the median. All inputs are randomly generated from the normal distribution.

From Figure 8, we show that the recovery error of CTS/MTS is proportional to the compression ratio, while the running time is inverse proportional to the compression ratio. MTS out performs CTS in aspect of relative error and running time when they keep the same compression ratio.

Figure 8: Kronecker product estimation for two matrices.
\@xsect

Given a matrix , all entries are sampled independently and uniformly from [-1,1], except for rows two and nine which are positively correlated. In Figure 9, the upper middle figure is the true value of , the upper right figure is an approximation of using the algorithm in (Pagh, 2012) with a compression ratio of 2.5. Lower left is the true value of . By applying MS based on Figure 6 (Algorithm 4 in Appendix id1), we get approximated in the lower middle figure with a compression ratio of 6.25. Further, we estimate using the fact that , if . In both methods, we repeat the sketching 300 times and use the median as the final estimation. The covariance matrix estimation quality using MTS is better than the one using CS, and it uses a higher compression rate.

Figure 9: Covariance matrix estimation.
Figure 10: Training loss and test accuracy on CIFAR10 for different network structures.
\@xsect
Figure 11: Tensor regression layer with sketching.

To demonstrate the versatility of our method, we combine it with deep learning, by integrating it into a tensor regression network for object classification.

Kossaifi et al. (2017) propose the tensor regression layer(TRL) to express outputs through a low-rank multi-linear mapping. It learns a Tucker-form tensor weight for the high-order activation tensor. Instead of reconstructing the full regression weight tensor using tensor contractions, we propose to sketch it using Equation 7 and 8. In our experiments, we use a ResNet18(He et al., 2016), from which the flattening and fully-connected layers are removed, and are replaced by our proposed sketched tensor regression layer. The network structure is illustrated in Figure 11.

We compare CTS and MTS on the CIFAR10 dataset (Krizhevsky, 2009). We report the learning loss and the test accuracy in Figure 10. Compared to CTS, MTS converges faster. In Figure 12, MTS-tensorized network still outperforms the non-tensorized network when the compression ratio is not too big. Compared to the original tensor regression network, we use 8 times fewer parameters with only accuracy loss(original TRL has 25k parameters, sketeched TRL has 3k. Final test accuracy using original TRL/sketched TRL is 95%/91%).

Figure 12: Test accuracy on CIFAR 10 for MTS tensorized network with different compression ratio(Compression ratio is respect to tensorized network. Compression ratio = 0 means no compression added).
\@xsect

Given the importance of data approximation in the aspect of computation efficiency and privacy concerns, many methods have been developed to use fewer parameters in the estimation of the original data. As one might expect, data with a particular structure may require a specific approximation method. Truncated singular value decomposition (SVD) (Eckart & Young, 1936) is an approximation method to represent the rotation and rescale of the data. However, this decomposition may not fit for specific data that has sparsity/non-negativity constraints. CX and CUR matrix decompositions are proposed to solve this problem (Mahoney & Drineas, 2009; Caiafa & Cichocki, 2010).
Since structure in the data may not be respected by mathematical operations on the data, sketching is more interpretable and has better informing intuition (Bringmann & Panagiotou, 2017; Alon et al., 1999; Weinberger & Saul, 2009). Min-hash (Broder, 1997) is a technique for estimating how similar two sets of data are. An extension of that is one-bit CP-hash (Christiani et al., 2018). To make use of parallel computing resources, 2-of-3 cuckoo hashing (Amossen & Pagh, 2011) is proposed based on (Pagh & Rodler, 2001).

Count Sketch(CS) is first proposed by Charikar et al. (2002) to estimate the frequency of each element in a stream. Pagh (2012) propose a fast algorithm to compute count sketch of an outer product of two vectors using FFT properties. They prove that the CS of the outer product is equal to the convolutions between the CS of each input. Tensor sketch is proposed by Pham & Pagh (2013) to approximate non-linear kernel. It can approximate any polynomial kernel in time, while previous methods require time for training samples in -dimensional space and random feature maps. Wang et al. (2015) develop a novel approach for randomized computation of tensor contractions using tensor sketch, and apply it to fast tensor CP decomposition. However, all these sketching techniques are projecting tensors to vector space.

Besides above dimension reduction algorithms, efficient tensor contraction primitive is another focus in computation community. Jhurani & Mullowney (2015) propose low-overhead interface for multiple small matrix multiplications on NVIDIA GPUs. Shi et al. (2016) further present optimized computation primitives for single-index contractions involving all the possible configurations of second-order and third-order tensors.

\@xsect

In this paper, we propose a new sketching technique, called multi-dimensional tensor sketching (MTS). MTS is an unbiased estimator for tensors using sketching techniques along each mode. We present approximation algorithms for Kronecker product and tensor contraction. We improve both computational and memory requirement for tensor product and tensor contraction compared to previous sketching techniques, especially when tensors have high ranks. We apply the algorithms to synthetic and real data. In both cases, MTS performs efficient compression while preserving major information from the input.

References

  • Alon et al. (1999) Alon, N., Matias, Y., and Szegedy, M. The space complexity of approximating the frequency moments. Journal of Computer and System Sciences, 58(1):137–147, February 1999.
  • Amossen & Pagh (2011) Amossen, R. R. and Pagh, R. A new data layout for set intersection on gpus. In Proceedings of the 2011 IEEE International Parallel & Distributed Processing Symposium, IPDPS ’11, pp. 698–708, Washington, DC, USA, 2011. IEEE Computer Society. ISBN 978-0-7695-4385-7. doi: 10.1109/IPDPS.2011.71. URL https://doi.org/10.1109/IPDPS.2011.71.
  • Anandkumar et al. (2014) Anandkumar, A., Ge, R., Hsu, D., Kakade, S. M., and Telgarsky, M. Tensor decompositions for learning latent variable models. The Journal of Machine Learning Research, 15(1):2773–2832, 2014.
  • Antol et al. (2015) Antol, S., Agrawal, A., Lu, J., Mitchell, M., Batra, D., Zitnick, C. L., and Parikh, D. VQA: Visual Question Answering. In International Conference on Computer Vision (ICCV), 2015.
  • Aza et al. (2001) Aza, Y., Fia, A., Karli, A. R., McSherr, F., and Sai, J. Spectral analysis of data. STOC, 2001.
  • Bringmann & Panagiotou (2017) Bringmann, K. and Panagiotou, K. Efficient sampling methods for discrete distributions. Algorithmica, 79(2):484–508, Oct 2017. ISSN 1432-0541. doi: 10.1007/s00453-016-0205-0. URL https://doi.org/10.1007/s00453-016-0205-0.
  • Broder (1997) Broder, A. Z. On the resemblance and containment of documents. IEEE:Compression and Complexity of Sequences: Proceedings, Positano, Amalfitan Coast, Salerno, Italy,, 10:21–29, 1997.
  • Caiafa & Cichocki (2010) Caiafa, C. F. and Cichocki, A. Generalizing the column–row matrix decomposition to multi-way arrays. Linear Algebra and its Applications, 433:557–573, Sept 2010.
  • Charikar et al. (2002) Charikar, M., Chen, K., and Farach-Colton, M. Finding frequent items in data streams. In Proceedings of ICALP’02, pp. 693–703, 2002.
  • Christiani et al. (2018) Christiani, T., Pagh, R., and Sivertsen, J. Scalable and robust set similarity join. The annual IEEE International Conference on Data Engineering, 2018.
  • Demaine et al. (2002) Demaine, E. D., López-Ortiz, A., and Munro, J. I. Frequency estimation of internet packet streams with limited space. In Proceedings of the 10th Annual European Symposium on Algorithms, ESA ’02, pp. 348–360, London, UK, UK, 2002. Springer-Verlag. ISBN 3-540-44180-8. URL http://dl.acm.org/citation.cfm?id=647912.740658.
  • Eckart & Young (1936) Eckart, C. and Young, G. The approximation of one matrix by another of lower rank. In Psychometrika. Springer-Verlag, 1936.
  • Fukui et al. (2016) Fukui, A., Park, D. H., Yang, D., Rohrbach, A., Darrell, T., and Rohrbach, M. Multimodal compact bilinear pooling for visual question answering and visual grounding. EMNLP 2016, 2016.
  • Harshman (1970) Harshman, R. Foundations of the parafac procedure: Models and conditions for an explanatory multi-model factor analysis. UCLA Working Papers in Phonetics, 16:1–84, 1970. URL http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.144.5652&rep=rep1&type=pdf.
  • He et al. (2016) He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)., 2016.
  • Jhurani & Mullowney (2015) Jhurani, C. and Mullowney, P. A gemm interface and implementation on nvidia gpus for multiple small matrices. J. of Parallel and Distributed Computing, pp. 133–140, 2015.
  • Koren et al. (2009) Koren, Y., Bell, R., and Volinsky, C. Matrix factorization techniques for recommender systems. IEEE Computer Society, 2009.
  • Kossaifi et al. (2017) Kossaifi, J., Lipton, Z. C., Khanna, A., Furlanello, T., and Anandkumar, A. Tensor regression networks. 2017. URL http://arxiv.org/abs/1707.08308.
  • Krizhevsky (2009) Krizhevsky, A. Learning multiple layers of features from tiny images. Technical report, 2009.
  • Mahoney & Drineas (2009) Mahoney, M. W. and Drineas, P. Cur matrix decompositions for improved data analysis. Proceedings of the National Academy of Sciences, 106(3):697–702, 2009. ISSN 0027-8424. doi: 10.1073/pnas.0803205106.
  • Oseledets (2011) Oseledets, I. V. Tensor-train decomposition. SIAM J. Sci. Comput, 33(5):2295–2317, 2011.
  • Pagh (2012) Pagh, R. Compressed matrix multiplication. ITCS, 2012.
  • Pagh & Rodler (2001) Pagh, R. and Rodler, F. F. Cuckoo hashing. Lecture Notes in Computer Science, 2001. doi: 10.1007/3-540-44676-1˙10.
  • Pham & Pagh (2013) Pham, N. and Pagh, R. Fast and scalable polynomial kernels via explicit feature maps. KDD, 2013.
  • Shi et al. (2016) Shi, Y., Niranjan, U., Anandkumar, A., and Cecka, C. Tensor contractions with extended blas kernels on cpu and gpu. HiPC, 2016.
  • Shi et al. (2018) Shi, Y., Furlanello, T., Zha, S., and Anandkumar, A. Question type guidted attention in visual question answering. ECCV, 2018.
  • Springer et al. (2017) Springer, P., Su, T., and Bientinesi, P. Hptt: A high-performance tensor transposition c++ library. Proceedings of the 4th ACM SIGPLAN International Workshop on Libraries, Languages, and Compilers for Array Programming, 2017.
  • Teney et al. (2017) Teney, D., Anderson, P., He, X., and van den Hengel, A. Tips and tricks for visual question answering: Learnings from the 2017 challenge,. CVPR 2017 VQA Workshop, 2017. URL http://arxiv.org/abs/1708.02711.
  • Tucker (1966) Tucker, L. R. Some mathematical notes on three-mode factor analysis. Psychometrika, 31(3):279–311, 1966. doi: 10.1007/BF02289464.
  • Wang et al. (2015) Wang, Y., Tung, H.-Y., Smola, A., and Anandkumar, A. Fast and guaranteed tensor decomposition via sketching. Proceedings of Advances in Neural Information Processing Systems (NIPS), 2015.
  • Weinberger & Saul (2009) Weinberger, K. Q. and Saul, L. K. Distance metric learning for large margin nearest neighbor classification. Journal of Machine Learning Research, 10:207–244, 2009.
  • (32) Yu, R., Zheng, S., Anandkumar, A., and Yue, Y. Long-term forecasting using tensor-train rnns.
\@xsect\@xsect
Algorithm 4 Compress/Decompress Kronecker Product 1:procedure Compress-KP() 2:     for X in [A,B] do 3:          = MTS()       4:     FFT2(),FFT2() 5:     =IFFT2( ) 6:     return () 7: 8:procedure Decompress-KP() 9:      = zeros() 10:     for w,q,o,g:=1 to  do 11:         k = 12:         l = 13:         tmp = 14:         i = 15:         j = 16:          = tmp       17:     return ()
\@xsect
Algorithm 5 TT Sketching: procedure Compress() , ,      for X in [do           = MTS()            reshape()       = MTS()      FFT2(),FFT2(),FFT2()      Q =      for k:=1 to  do           += Q[:,k] [k,:]            return (IFFT2()) procedure Decompress()      C = zeros()      for w,g,o:=1 to  do          k =           =           = tmp            return ()
\@xsect\@xsect
Lemma B.1.

Given two matrices , ,

(9)
Proof of Lemma b.1.

The Kronecker product defines . Thus:

(10)

where , , , .
Assign , , , , and , we have

(11)

Consequently, we have . The recovery map is for , , , . ∎

\@xsect
Theorem B.2 ((Charikar et al., 2002)).

Given a vector , CS hashing functions s and h with sketching dimension c, for any , the recovery function computes an unbiased estimator for with variance bounded by .

Proof of Theorem b.2.

For let be the indicator variable for the event . We can write as

(12)

Observe that , if , , for all , and , we have

(13)

To bound the variance, we rewrite the recovery function as

(14)

To simplify notation, we assign X as the first term, Y as the second term. , and since for are 2-wise independent. Thus,

(15)

for . Consequently,

(16)

The last equality uses that , for all . Summing over all terms, we have . ∎

Proof of Theorem 2.1.

For , , let be the indicator variable for the event and . We can write as

(17)

Observe that , if , . ,