Abstract
Sketching refers to a class of randomized dimensionality reduction methods that aim to preserve relevant information in largescale datasets. They have efficient memory requirements and typically require just a single pass over the dataset. Efficient sketching methods have been derived for vector and matrixvalued datasets. When the datasets are higherorder tensors, a naive approach is to flatten the tensors into vectors or matrices and then sketch them. However, this is inefficient since it ignores the multidimensional nature of tensors. In this paper, we propose a novel multidimensional tensor sketch (MTS) that preserves higher order data structures while reducing dimensionality. We build this as an extension to the popular count sketch (CS) and show that it yields an unbiased estimator of the original tensor. We demonstrate significant advantages in compression ratios when the original data has decomposable tensor representations such as the Tucker, CP, tensor train or Kronecker product forms. We apply MTS to tensorized neural networks where we replace fully connected layers with tensor operations. We achieve nearly state of art accuracy with significant compression on image classification benchmarks.
oddsidemargin has been altered.
marginparsep has been altered.
topmargin has been altered.
marginparwidth has been altered.
marginparpush has been altered.
paperheight has been altered.
The page layout violates the ICML style.
Please do not change the page layout, or include packages like geometry,
savetrees, or fullpage, which change it for you.
We’re not able to reliably undo arbitrary changes to the style. Please remove
the offending package(s), or layoutchanging commands and try again.
Multidimensional Tensor Sketch:
Dimensionality Reduction That Retains Efficient Tensor Operations
Yang Shi ^{1 } Animashree Anandkumar ^{2 }
Many modern machine learning and data mining applications involve manipulating largescale multidimensional data structures. For instance, the input data can be multimodal or multirelational (e.g. combination of image and text), and the intermediate computations can involve higherorder tensor operations (e.g. layers in a tensorized neural network). Memory, bandwidth and computational requirements are usually bottlenecks when these operations are done at scale. Efficient dimensionality reduction schemes can greatly alleviate this issue if they can find a compact representation while preserving accuracy.
A popular class of dimensionality reduction techniques involve spectral decomposition, such as principle component analysis (PCA), singular value decomposition (SVD) and tensor decompositions. They aim to fit the datasets using low rank approximations, which can be interpreted as a lowdimensional latent structure in the data. They have shown good performance in many applications such as topic modelling (Aza et al., 2001) and recommendation systems (Koren et al., 2009). However, for largescale datasets, exact computation of spectral decompositions is expensive and randomized methods are used instead.
Sketching forms another class of randomized dimensionality reduction methods. These methods have efficient memory requirements and typically do just a single pass on the data. They have approximation guarantees in recovering the original data. In addition, they allow for certain operations to be accurately carried out in the lowdimensional sketched space, e.g. inner products and outer products on vectors. Count sketch (CS) is the simplest sketching technique (Charikar et al., 2002) that uses random signs and random hash functions to carry out dimensionality reduction. This has been applied in many settings. Demaine et al. (2002) design a frequency estimation of internet packet streams. The purpose is to determine essential features of the traffic stream using limited space. Another application is multimodal pooling of features, e.g. in visual question and answering (VQA), this involves bilinear pooling of image and text features.
Sketching techniques, however, mostly focus on vector and matrixvalued data. A few works attempt to extend to higher order tensors: e.g. to nonlinear kernels (Pham & Pagh, 2013) and to higher order tensors (Wang et al., 2015), which we refer to as Countbased Tensor Sketch(CTS). However, they view the tensor as sets of vectors and sketch along each fibre ^{1}^{1}1Fibre is the higher order analogue of matrix rows and columns. of the tensor. Hence, they still use inherently vectorbased sketching techniques and do not exploit the complete multidimensional nature of data, and can miss correlations across different modes. For instance, the image feature extracted from the traditional vision network contains spatial information. Separating these data at each location ignores the connection between adjacent pixels.
Main contributions: We propose multidimensional tensor sketch (MTS), which is the first method to fully exploit the multidimensional nature of higher order tensors. It projects the original tensor to another tensor, which can be of the same order, but with smaller dimensions (which affect the recovery guarantees). This allow for efficient computation of various operations such as tensor product and tensor contractions by directly applying operations on the sketched components. MTS has advantages over vectorbased sketching methods such as CTS when the underlying tensor has special forms such as Kronecker product, Tuckerform, CPform or tensortrain forms. Computation and memory improvement over CTS is summarized in Table 1.
We compare MTS and CTS for Kronecker product compression using synthetic data. MTS outperforms CTS in aspect of relative error while using less computation time when they keep the same compression ratio. We apply MTS for approximating tensor operations in tensorized neural networks. These networks replace fully connected layers with multilinear tensor algebraic operations. This results in compression since multilinear layers can take better advantage of spatial information available in activations from convolutional layers. Applying MTS to tensor operations further results in compression while preserving accuracy. We demonstrate efficiency on CIFAR10 dataset.
Operator  Computation  Memory 

Kronecker Product  
Tuckerform Tensor  if , if  
CPform Tensor  if  
Tensortrain  if 
Important Tensor Applications: We focus on tensor sketching because data is inherently multidimensional. In probabilistic model analysis, tensor decomposition is the crux of model estimation via the method of moments. A variety of models such as topic models, hidden Markov models, Gaussian mixtures etc., can be efficiently solved via the tensor decomposition techniques under certain mild assumptions (Anandkumar et al., 2014). Tensor methods are also relevant in deep learning. Yu et al. () learn the nonlinear dynamics in recurrent neural networks directly using higher order state transition functions through tensor train decomposition. Kossaifi et al. (2017) propose tensor contraction and regression layers in deep convolutional neural networks. Instead of mapping highorder activation tensor to vectors to pass through a fully connected layer, they generate tensor weights to filter the multidimensional activation tensor. Tensor product is used in combining different features in multimodal tasks. Visual Question Answering task(Antol et al., 2015) requires integration of feature maps from image and text that have drastically different structures. Many studies have been done to explore various features and combine them wisely based on their structures(Fukui et al., 2016; Teney et al., 2017; Shi et al., 2018).
Notation  Meaning 

FFT(a)  1D Fourier Transform 
IFFT(a)  1D Inverse Fourier Transform 
Elementwise Product  
Convolution  
Kronecker Product  
FFT2(A)  2D Fourier Transform 
IFFT2(A)  2D Inverse Fourier Transform 
Matrix Inverse  
Decompression  
[n] 
In this section, we first present important definitions and theorems in previous literature. We then define the multidimensional tensor sketch and present its approximation guarantees. We show the superiority of this sketch, comparing to CS, in approximating Kronecker product.
We denote vectors by lowercase letters, matrices by uppercase letters, higherorder tensors(multidimensional data structures) by calligraphic uppercase letters. The order of a tensor is the number of modes it admits. For example, is a order tensor because it has N modes. A scalar is a zerothorder tensor, a vector is a firstorder tensor, a matrix is a secondorder tensor with the rows being the first mode and columns being the second mode, and a threeway array (say ) is a thirdorder tensor with the first, second and third modes indexed by , , and , respectively.
Tensor product is known as outer product in vectors case. A order tensor is an element of the tensor product of vector spaces: , where . Assume , tensor contraction , and elementwise:
We show a tensor contraction example in Figure 2. Tensor decomposition is an extension of matrix decomposition to higher orders. The Tucker decomposition (Tucker, 1966) is analogous to principal component analysis. It decomposes a tensor as a core tensor contracted with a matrix along each mode. For instance, a thirdorder tensor has Tucker decomposition:
(1) 
where , , , . CANDECOMP/PARAFAC(CP) (Harshman, 1970) is a special case of Tuckerform tensor: the core tensor is a sparse tensor that only diagonal data have nonzero values. It can be represented as a sum of rank1 tensors. In previous example, . Figure 3 shows the Tuckerform and CPform of a thirdorder tensor. By splitting variables, Tensor Train (TT) (Oseledets, 2011) can represent a high order tensor with several third order tensors.
We describe Count Sketch(CS) in Algorithm 1. The estimation can be made more robust by taking independent sketches of the input and calculate the median of the estimators. Charikar et al. (2002) prove that the CS is an unbiased estimator with variance bounded by the 2norm of the input. Pagh (2012) use CS and propose a fast algorithm to compute count sketch of an outer product of two vectors.
(2) 
The convolution operation can be transferred to elementwise product using FFT properties. Thus, the computation complexity reduces from to , if the vectors are of size and the sketching size is .
Notice that CS can be applied to each fibre of the input tensor, we can always sketch a tensor using CS. We call this Countbased Tensor Sketch(CTS). Algorithm 2 describes the CTS. The disadvantage is the ignorance of the connections between fibres, as we mentioned in Section id1.
Given a tensor , random hash functions : , , : , and random sign functions : , , : :
We can write the MTS using tensor operations.
(3) 
Here , , if , for , otherwise . equals the signed tensor() contract with the hash matrices(, for ) along each mode. To recover the original tensor, we have
, or in compact format
(4) 
We show MTS in Algorithm 3. We prove that this MTS is unbiased and it has variance of roughly . All detailed proofs are in Appendix id1.
Theorem 2.1 (MTS recovery analysis).
Given a tensor , MTS hash functions with sketching dimensions , for , the decompress Equation 4 computes an unbiased estimator for with variance bounded by , for any .
In computational analysis, we assume takes operations if . However, we need to compute Equation 3 in a sequential way in practise. It requires extra permutations and copies without computation optimization. For example, given , , can be computed in three steps:

reshape to

compute

reshape to
Shi et al. (2016) propose a parallel computing scheme for tensor contraction using Basic Linear Algebra Subroutines(BLAS). Based on that, can be computed in one step without permutations:

parallel compute number of matrix matrix multiplications: ,
Recently released HighPerformance Tensor Transposition (HPTT) is an opensource library that performs efficient tensor contractions (Springer et al., 2017). By applying these tensor contraction primitives, we can accelerate the sketching process in practise.
Pagh (2012) shows that the count sketch of an outer product equals the convolution between the count sketch of each input vector. Furthermore, the convolution operation in time domain can be transferred to elementwise product in frequency domain. Kronecker product is a generalization of outer product from vector to matrix. It is one kind of tensor product. Usually, the computation and memory cost are expensive as shown in Figure 6.
Sketching a Kronecker product using countbased tensor sketch (CTS) We compute the sketch of by sketching each of the outer product(rowwise) using CTS. Assume , each outer product has operations and there are such outer products. Thus, it requires a total operations to sketch the Kronecker product.
Sketching a Kronecker product using multidimensional tensor sketch (MTS) We show that
(5) 
Convolution in Equation 5 can be further simplified by converting sketched components to 2Dfrequency domain. Assume , :
(6) 
This approximation requires to complete MTS(A), MTS(B) and to complete 2D Fourier Transform if the MTS sketching parameter is . The proof for Equation 5 is in Appendix id1. This example shows the advantage of MTS in estimating tensor operations by directly applying operations on the sketched components. See Figure 6, 6 and Table 3 for comparison.
Operator  Computation 

In this section, we discuss how to approximate tensors given different tensor representations: Tuckerform tensor, CPform tensor and Tensortrain. We show that MTS is more efficient for compressing tensors given these forms compared to CTS, especially when the tensors are highrank.
The Tucker decomposition is a form of higherorder PCA.
In the following analysis, we use a thirdorder tensor as an example.
Define a tensor : . Here, , , , . To simplify the analysis, we assume , . Figure 7 shows the difference between sketching methods using CTS and MTS.
To apply count sketch to the Tuckerform tensor, we rewrite the decomposition as a sum of rank1 tensors. The CTS representation is:
(7)  
where are column of respectively. is the tensor product(outer product in vector space). In this way, we can apply outer product sketching (Pagh, 2012) to each term in the summation. Notice that the convolution operation in Equation 7 can be computed through FFT. The computation complexity of this estimation process is . The memory complexity is .
Suppose is the recovered tensor for after applying count sketch on , and with sketching dimension , = , , , for . Suppose the estimation takes independent sketches of , and and then report the median of the estimates:
Theorem 3.1 (CTS recovery analysis for Tucker tensor).
If , , then with probability there is .
Theorem 3.1 shows that the sketch length can be set as to provably approximate a 3rdorder Tuckerform tensor with dimension , rank .
The MTS of Tuckerform tensor can be done by MTS each of the factors and the core tensor. We rewrite . Thus,
(8)  
In (Pagh, 2012), the compressed matrix multiplication is a sum of CS of outer products. Instead of using CS of each row/column of the input matrices to compute the approximated outer product, we use each row/column of the multidimensional tensor sketched matrices. Notice that since does not have any underlying structure, . The computation complexity of this estimation process is . The memory complexity is .
Suppose is the recovered tensor for after applying MTS on , and with sketching dimension and along two modes and count sketch on with sketching dimension . Suppose the estimation takes independent sketches of , , and and then report the median of the estimates:
Theorem 3.2 (MTS recovery analysis for Tucker tensor).
If and , then with probability there is .
Theorem 3.2 shows that the product of sketch length can be set as to provably approximate a 3rdorder Tuckerform tensor with dimension , rank . Given Theorem 3.1 and 3.2, if we set and , the estimation error for CTS and MTS will be at the same level. Detailed proofs for Theorem 3.1, Theorem 3.2 are in Appendix id1.
REMARKS: CPform Tensor Sketching CPform tensor can be treated as a special case of Tuckerform tensor. For a thirdorder tensor: its core tensor is a sparse tensor: only diagonal data has nonzero values. We can write the decomposition as . We can sketch it in the same way as we described above. The only difference is that instead of sum over all values, we sum over only number of values. We summarize the computation and memory analysis in Table 4 and 5. By performing MTS on the CPform tensor components, we get a ratio of improvement if the tensor is overcomplete().
The computation and memory cost is analyzed in Table 5. We get a computation cost improvement with a ratio of if , if , compared to previous CTS method for Tuckerform Tensor.
Operator  Computation  Memory 
Tucker  
CTS()  
MTS()  
CP  
CTS()  
MTS() 
Operator  Computation  Memory 
Tucker  
CTS()  
MTS()  
CP  
CTS()  
MTS() 
Tensortrain(TT) decomposition is another important tensor structure. For simplicity, we use a thirdorder tensor as an example. Assume , , , , the TT format is: , . By sketching , and , we obtain an estimation of the TT format tensor. We assume , .
Operator  Computation  Memory 

CTS()  
MTS() 
We put the error analysis for TTform sketching in Appendix id1. We show that if and , the recovery error will be the same for MTS and CTS methods. This gives us approximate O() times improvement using MTS if , comparing to the one using CS method.
In this section, we present an empirical study of the multidimensional tensor sketch in matrix and tensor operations. The goals of this study are: a) establish that our method significantly compress the input without losing principle information from the data, b) demonstrate the advantages of MTS in compressing Kronecker product and tensor contraction operation in practice, compared to CTS, c) present potential application of MTS in deep learning tasks.
We compress Kronecker products using CTS and MTS based on Figure 6 and Figure 6. We compute the recovery relative error and compression time verse the compression ratio. Given a Kronecker product matrix , the CTS() , the MTS() . The compression ratio for CTS and MTS are and . The relative error is calculated as . The result is obtained by independently running the sketching times and choosing the median. All inputs are randomly generated from the normal distribution.
From Figure 8, we show that the recovery error of CTS/MTS is proportional to the compression ratio, while the running time is inverse proportional to the compression ratio. MTS out performs CTS in aspect of relative error and running time when they keep the same compression ratio.
Given a matrix , all entries are sampled independently and uniformly from [1,1], except for rows two and nine which are positively correlated. In Figure 9, the upper middle figure is the true value of , the upper right figure is an approximation of using the algorithm in (Pagh, 2012) with a compression ratio of 2.5. Lower left is the true value of . By applying MS based on Figure 6 (Algorithm 4 in Appendix id1), we get approximated in the lower middle figure with a compression ratio of 6.25. Further, we estimate using the fact that , if . In both methods, we repeat the sketching 300 times and use the median as the final estimation. The covariance matrix estimation quality using MTS is better than the one using CS, and it uses a higher compression rate.
To demonstrate the versatility of our method, we combine it with deep learning, by integrating it into a tensor regression network for object classification.
Kossaifi et al. (2017) propose the tensor regression layer(TRL) to express outputs through a lowrank multilinear mapping. It learns a Tuckerform tensor weight for the highorder activation tensor. Instead of reconstructing the full regression weight tensor using tensor contractions, we propose to sketch it using Equation 7 and 8. In our experiments, we use a ResNet18(He et al., 2016), from which the flattening and fullyconnected layers are removed, and are replaced by our proposed sketched tensor regression layer. The network structure is illustrated in Figure 11.
We compare CTS and MTS on the CIFAR10 dataset (Krizhevsky, 2009). We report the learning loss and the test accuracy in Figure 10. Compared to CTS, MTS converges faster. In Figure 12, MTStensorized network still outperforms the nontensorized network when the compression ratio is not too big. Compared to the original tensor regression network, we use 8 times fewer parameters with only accuracy loss(original TRL has 25k parameters, sketeched TRL has 3k. Final test accuracy using original TRL/sketched TRL is 95%/91%).
Given the importance of data approximation in the aspect of computation efficiency and privacy concerns, many methods have been developed to use fewer parameters in the estimation of the original data. As one might expect, data with a particular structure may require a specific approximation method. Truncated singular value decomposition (SVD) (Eckart & Young, 1936)
is an approximation method to represent the rotation and rescale of the data.
However, this decomposition may not fit for specific data that has sparsity/nonnegativity constraints. CX and CUR matrix decompositions are proposed to solve this problem (Mahoney & Drineas, 2009; Caiafa & Cichocki, 2010).
Since structure in the data may not be respected by mathematical operations on the data, sketching is more interpretable and has better informing intuition (Bringmann & Panagiotou, 2017; Alon et al., 1999; Weinberger & Saul, 2009).
Minhash (Broder, 1997) is a technique for estimating how similar two sets of data are. An extension of that is onebit CPhash (Christiani et al., 2018). To make use of parallel computing resources, 2of3 cuckoo hashing (Amossen & Pagh, 2011) is proposed based on (Pagh & Rodler, 2001).
Count Sketch(CS) is first proposed by Charikar et al. (2002) to estimate the frequency of each element in a stream. Pagh (2012) propose a fast algorithm to compute count sketch of an outer product of two vectors using FFT properties. They prove that the CS of the outer product is equal to the convolutions between the CS of each input. Tensor sketch is proposed by Pham & Pagh (2013) to approximate nonlinear kernel. It can approximate any polynomial kernel in time, while previous methods require time for training samples in dimensional space and random feature maps. Wang et al. (2015) develop a novel approach for randomized computation of tensor contractions using tensor sketch, and apply it to fast tensor CP decomposition. However, all these sketching techniques are projecting tensors to vector space.
Besides above dimension reduction algorithms, efficient tensor contraction primitive is another focus in computation community. Jhurani & Mullowney (2015) propose lowoverhead interface for multiple small matrix multiplications on NVIDIA GPUs. Shi et al. (2016) further present optimized computation primitives for singleindex contractions involving all the possible configurations of secondorder and thirdorder tensors.
In this paper, we propose a new sketching technique, called multidimensional tensor sketching (MTS). MTS is an unbiased estimator for tensors using sketching techniques along each mode. We present approximation algorithms for Kronecker product and tensor contraction. We improve both computational and memory requirement for tensor product and tensor contraction compared to previous sketching techniques, especially when tensors have high ranks. We apply the algorithms to synthetic and real data. In both cases, MTS performs efficient compression while preserving major information from the input.
References
 Alon et al. (1999) Alon, N., Matias, Y., and Szegedy, M. The space complexity of approximating the frequency moments. Journal of Computer and System Sciences, 58(1):137–147, February 1999.
 Amossen & Pagh (2011) Amossen, R. R. and Pagh, R. A new data layout for set intersection on gpus. In Proceedings of the 2011 IEEE International Parallel & Distributed Processing Symposium, IPDPS ’11, pp. 698–708, Washington, DC, USA, 2011. IEEE Computer Society. ISBN 9780769543857. doi: 10.1109/IPDPS.2011.71. URL https://doi.org/10.1109/IPDPS.2011.71.
 Anandkumar et al. (2014) Anandkumar, A., Ge, R., Hsu, D., Kakade, S. M., and Telgarsky, M. Tensor decompositions for learning latent variable models. The Journal of Machine Learning Research, 15(1):2773–2832, 2014.
 Antol et al. (2015) Antol, S., Agrawal, A., Lu, J., Mitchell, M., Batra, D., Zitnick, C. L., and Parikh, D. VQA: Visual Question Answering. In International Conference on Computer Vision (ICCV), 2015.
 Aza et al. (2001) Aza, Y., Fia, A., Karli, A. R., McSherr, F., and Sai, J. Spectral analysis of data. STOC, 2001.
 Bringmann & Panagiotou (2017) Bringmann, K. and Panagiotou, K. Efficient sampling methods for discrete distributions. Algorithmica, 79(2):484–508, Oct 2017. ISSN 14320541. doi: 10.1007/s0045301602050. URL https://doi.org/10.1007/s0045301602050.
 Broder (1997) Broder, A. Z. On the resemblance and containment of documents. IEEE:Compression and Complexity of Sequences: Proceedings, Positano, Amalfitan Coast, Salerno, Italy,, 10:21–29, 1997.
 Caiafa & Cichocki (2010) Caiafa, C. F. and Cichocki, A. Generalizing the column–row matrix decomposition to multiway arrays. Linear Algebra and its Applications, 433:557–573, Sept 2010.
 Charikar et al. (2002) Charikar, M., Chen, K., and FarachColton, M. Finding frequent items in data streams. In Proceedings of ICALP’02, pp. 693–703, 2002.
 Christiani et al. (2018) Christiani, T., Pagh, R., and Sivertsen, J. Scalable and robust set similarity join. The annual IEEE International Conference on Data Engineering, 2018.
 Demaine et al. (2002) Demaine, E. D., LópezOrtiz, A., and Munro, J. I. Frequency estimation of internet packet streams with limited space. In Proceedings of the 10th Annual European Symposium on Algorithms, ESA ’02, pp. 348–360, London, UK, UK, 2002. SpringerVerlag. ISBN 3540441808. URL http://dl.acm.org/citation.cfm?id=647912.740658.
 Eckart & Young (1936) Eckart, C. and Young, G. The approximation of one matrix by another of lower rank. In Psychometrika. SpringerVerlag, 1936.
 Fukui et al. (2016) Fukui, A., Park, D. H., Yang, D., Rohrbach, A., Darrell, T., and Rohrbach, M. Multimodal compact bilinear pooling for visual question answering and visual grounding. EMNLP 2016, 2016.
 Harshman (1970) Harshman, R. Foundations of the parafac procedure: Models and conditions for an explanatory multimodel factor analysis. UCLA Working Papers in Phonetics, 16:1–84, 1970. URL http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.144.5652&rep=rep1&type=pdf.
 He et al. (2016) He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)., 2016.
 Jhurani & Mullowney (2015) Jhurani, C. and Mullowney, P. A gemm interface and implementation on nvidia gpus for multiple small matrices. J. of Parallel and Distributed Computing, pp. 133–140, 2015.
 Koren et al. (2009) Koren, Y., Bell, R., and Volinsky, C. Matrix factorization techniques for recommender systems. IEEE Computer Society, 2009.
 Kossaifi et al. (2017) Kossaifi, J., Lipton, Z. C., Khanna, A., Furlanello, T., and Anandkumar, A. Tensor regression networks. 2017. URL http://arxiv.org/abs/1707.08308.
 Krizhevsky (2009) Krizhevsky, A. Learning multiple layers of features from tiny images. Technical report, 2009.
 Mahoney & Drineas (2009) Mahoney, M. W. and Drineas, P. Cur matrix decompositions for improved data analysis. Proceedings of the National Academy of Sciences, 106(3):697–702, 2009. ISSN 00278424. doi: 10.1073/pnas.0803205106.
 Oseledets (2011) Oseledets, I. V. Tensortrain decomposition. SIAM J. Sci. Comput, 33(5):2295–2317, 2011.
 Pagh (2012) Pagh, R. Compressed matrix multiplication. ITCS, 2012.
 Pagh & Rodler (2001) Pagh, R. and Rodler, F. F. Cuckoo hashing. Lecture Notes in Computer Science, 2001. doi: 10.1007/3540446761˙10.
 Pham & Pagh (2013) Pham, N. and Pagh, R. Fast and scalable polynomial kernels via explicit feature maps. KDD, 2013.
 Shi et al. (2016) Shi, Y., Niranjan, U., Anandkumar, A., and Cecka, C. Tensor contractions with extended blas kernels on cpu and gpu. HiPC, 2016.
 Shi et al. (2018) Shi, Y., Furlanello, T., Zha, S., and Anandkumar, A. Question type guidted attention in visual question answering. ECCV, 2018.
 Springer et al. (2017) Springer, P., Su, T., and Bientinesi, P. Hptt: A highperformance tensor transposition c++ library. Proceedings of the 4th ACM SIGPLAN International Workshop on Libraries, Languages, and Compilers for Array Programming, 2017.
 Teney et al. (2017) Teney, D., Anderson, P., He, X., and van den Hengel, A. Tips and tricks for visual question answering: Learnings from the 2017 challenge,. CVPR 2017 VQA Workshop, 2017. URL http://arxiv.org/abs/1708.02711.
 Tucker (1966) Tucker, L. R. Some mathematical notes on threemode factor analysis. Psychometrika, 31(3):279–311, 1966. doi: 10.1007/BF02289464.
 Wang et al. (2015) Wang, Y., Tung, H.Y., Smola, A., and Anandkumar, A. Fast and guaranteed tensor decomposition via sketching. Proceedings of Advances in Neural Information Processing Systems (NIPS), 2015.
 Weinberger & Saul (2009) Weinberger, K. Q. and Saul, L. K. Distance metric learning for large margin nearest neighbor classification. Journal of Machine Learning Research, 10:207–244, 2009.
 (32) Yu, R., Zheng, S., Anandkumar, A., and Yue, Y. Longterm forecasting using tensortrain rnns.
Lemma B.1.
Given two matrices , ,
(9) 
Proof of Lemma b.1.
The Kronecker product defines . Thus:
(10) 
where , , , .
Assign , , , , and , we have
(11) 
Consequently, we have . The recovery map is for , , , . ∎
Theorem B.2 ((Charikar et al., 2002)).
Given a vector , CS hashing functions s and h with sketching dimension c, for any , the recovery function computes an unbiased estimator for with variance bounded by .
Proof of Theorem b.2.
For let be the indicator variable for the event . We can write as
(12) 
Observe that , if , , for all , and , we have
(13) 
To bound the variance, we rewrite the recovery function as
(14) 
To simplify notation, we assign X as the first term, Y as the second term. , and since for are 2wise independent. Thus,
(15) 
for . Consequently,
(16) 
The last equality uses that , for all . Summing over all terms, we have . ∎
Proof of Theorem 2.1.
For , , let be the indicator variable for the event and . We can write as
(17) 
Observe that , if , . ,