Tensorized Embedding Layers
Abstract
The embedding layers transforming input words into real vectors are the key components of deep neural networks used in natural language processing. However, when the vocabulary is large, the corresponding weight matrices can be enormous, which precludes their deployment in a limited resource setting. We introduce a novel way of parametrizing embedding layers based on the Tensor Train (TT) decomposition, which allows compressing the model significantly at the cost of a negligible drop or even a slight gain in performance. We evaluate our method on a wide range of benchmarks in natural language processing and analyze the tradeoff between performance and compression ratios for a wide range of architectures, from MLPs to LSTMs and Transformers.
1 Introduction
Deep neural networks (DNNs) typically used in natural language processing (NLP) employ large embeddings layers, which map the input words into continuous representations and usually have the form of lookup tables. Despite such simplicity and, arguably because of it, the resulting models are cumbersome, which may cause problems in training and deploying them in a limited resource setting. Thus, the compression of large neural networks and the development of novel lightweight architectures have become essential problems in NLP research.
One way to reduce the number of parameters in the trained model is to imply a specific structure on its weight matrices (e.g., assume that they are lowrank or can be well approximated by lowrank tensor networks). Such approaches are successful at compressing the pretrained models, but they do not facilitate the training itself. Furthermore, they usually require an additional finetuning stage to recover the performance of the original model.
In this paper, we introduce a new, parameter efficient embedding layer, termed TT–embedding, which can be plugged in into any model and trained endtoend. The benefits of our compressed TT–layer are twofold. Firstly, instead of storing huge embedding matrix, we store a sequence of much smaller 2dimensional and 3dimensional tensors, necessary for reconstructing the required embeddings, which allows compressing the model significantly at the cost of a negligible performance drop. Secondly, the overall number of parameters can be relatively small (and constant) during the whole training stage, which allows to use larger batches or train efficiently in a case of limited resources.
To validate the efficiency of the proposed approach, we have tested it on several popular NLP tasks. In our experiments, we have observed that the standard embeddings can be replaced by TT–embeddings with the compression ratio of orders without any significant drop (and sometimes even with a slight gain) of the metric of interest. Specifically, we report the following compression ratios of the embedding layers: on the IMDB sentiment classification with absolute increase in classification accuracy; on the WMT En–De machine translation with drop in the BLEU score; on the WikiText103 language modeling with drop in test perplexity.
Additionally, we have also evaluated our algorithm on a task of binary classification with a large number of categorical features. More concretely, we applied TT–embedding to the click through rate (CTR) prediction problem, a crucial task in the field of digital advertising. Neural networks, typically used for solving this problem, while being rather elementary, include a large number of embedding layers of significant size. As a result, a majority of model parameters that represent these layers, may occupy hundreds of gigabytes of space. We show that TT–embedding not only considerably reduces the number of parameters in such models, but also sometimes improves their accuracy.
2 Related work
In recent years, a large body of research was devoted to compressing and speeding up various components of neural networks used in NLP tasks. Joulin et al. (2016) adapted the framework of product quantization to reduce the number of parameters in linear models used for text classification. See et al. (2016) proposed to compress LSTMbased neural machine translation models with pruning algorithms. Lobacheva et al. (2017) showed that the recurrent models could be significantly sparsified with the help of variational dropout (Kingma et al., 2015). Cheong and Daniel (2019) successfully compressed the Transformer architecture with the combination of pruning and quantization.
There is a plethora of prior work on compressing the embedding layers used in NLP models. Chen et al. (2018b) proposed more compact Kway Ddimensional discrete encoding scheme to replace the “onehot” encoding of categorical features, such as words in NLP taks. Variani et al. (2018) introduced WEST, a compression method based on structured sparse and structured dense decomposition of the embedding matrix. Chen et al. (2018a) proposed to compress the pretrained embedding matrix by capitalizing on the powerlaw distribution of words and using smaller dimensionality (lower rank) for the embeddings of less frequent words. Baevski and Auli (2018) used similar idea in endtoend fashion by training such structured lowrank embeddings from scratch. However, both of these methods rely on the assumption of powerlaw distribution of tokens and are not efficient when dealing with other popular tokenizations, such as wordpieces Schuster and Nakajima (2012); Wu et al. (2016) or BPEs Sennrich et al. (2015). The effectiveness of simple lowrank factorized embeddings has been recently rediscovered by Lan et al. (2019), and we refer to this method as to important baseline. Also, Lam (2018) proposed a quantization algorithm for compressing word vectors, but its benefits are orthogonal to those of lowrank matrix and tensor factorizations and they can be used together, complementing each other.
Tensor methods have also been already successfully applied to neural networks compression. Novikov et al. (2015) coined the idea of reshaping weights of fullyconnected layers into highdimensional tensors and representing them in Tensor Train (TT) (Oseledets, 2011) format. This approach was later extended to convolutional (Garipov et al., 2016) and recurrent (Yang et al., 2017a; Tjandra et al., 2017; Yu et al., 2017) neural networks. Furthermore, Lebedev et al. (2015) showed that convolutional layers could be also compressed with canonical (CP) tensor decomposition (Carroll and Chang, 1970; Harshman, 1970). Finally, Wang et al. (2018) compressed both fullyconnected and convolutional layers with Tensor Ring decomposition (Zhao et al., 2016). Recently, Ma et al. (2019) succesfully applied BlockTerm Tensor Decomposition to the compression of selfattention modules in the Transformer (Vaswani et al., 2017) architecture. In this work, we show the benefits of applying tensor machinery to the compression of embedding layers, which are an essential component of all models used in NLP.
3 Motivation
Since most of the parameters in the NLP models occupy the embedding layers, we can greatly reduce size of the entire model by compressing these layers. Our goal is to replace the standard embedding matrix with a more compact, yet powerful and trainable, representation which would allow us to efficiently map words into vectors.
In this section, we briefly discuss our motivation of using tensorized embedding layers instead of both standard embedding layers and their lowrank factorized counterpart.
3.1 Compression ratio perspective
The simplest approach to compactly represent a matrix of a large size is to use the low–rank matrix factorization, which treats matrix as a product of two matrices . Here and are much “thinner” matrices, and is the rank hyperparameter. Note that rather than training the model with the standard embedding layer, and then trying to compress the obtained embedding, we can initially seek the embedding matrix in the described low–rank format. Then, for evaluation and training, the individual word embedding can be computed as a product which does not require materializing the full matrix . This approach reduces the number of degrees of freedom in the embedding layer from to .
However, typically, in the NLP tasks, the embedding dimension is much smaller than the vocabulary size , and obtaining significant compression ratio using lowrank matrix factorization is problematic. In order to preserve the model performance, the rank cannot be taken very small, and the compression ratio is bounded by , which is close to for usually fullrank embedding matrix (see Figure 1 in Chen et al. (2018b)). To overcome this bound and achieve significant compression ratio even for matrices of disproportional dimensionalities, we reshape them into multidimensional tensors and apply the Tensor Train decomposition, which allows for more compact representation with the number of parameters falling down to logarithmic with respect to .
3.2 Softmax bottleneck perspective
We hypothesize that such tensorized embeddings are not only superior in terms of more efficient compression, but are more theoretically justified for the usage in NLP tasks than embedding layers based on matrix factorization. Our analysis is based on softmax bottleneck theory (Yang et al., 2017b) and the fact that modern NLP architectures typically use the same weights for both embedding and softmax layers (Press and Wolf, 2016; Inan et al., 2016).
This theory models a natural language as a collection of pairs of a context and its conditional next token distributions: , and considers parametric language models with a Softmax function operating on a context vector and a word embedding to define the conditional distribution . Given the number of context vectors , the number of tokens , and dimensionality of word embeddings , the following three matrices are defined: , , . The rows of these matrices correspond to context vectors, word embeddings, and log probabilities of the true data distribution respectively. Such language model attempts to approximate (up to an addition of constant matrices corresponding to a degree of freedom in Softmax) in the form
(1) 
Note that the rank of is bounded by , while the matrix is presumed to be a high rank matrix (Yang et al., 2017a), which provides an upper bound on expressivity of such models. Now, suppose that the matrix is additionally factorized as with some rank . Then the rank of righthand side of Equation 1 is bounded by , which further reduces expressivity of such models. Contrary to this, we show that tensorized embeddings do not reduce expressivity in the softmax bottleneck sense — while the embedding matrix is compressed it still has full matrix rank. We provide a rigorous statement in Section 4.4 and verify benefits of tensorized embeddings over lowrank factorized ones empirically in Section 5.
4 Tensor Train embedding
In this section, we briefly introduce the necessary notation and present the algorithm for training the TT–embedding layer. Hereinafter, by way tensor we mean a multidimensional array:
with entries such that .
4.1 Tensor Train decomposition
A tensor is said to be represented in the Tensor Train (TT) format (Oseledets, 2011) if each element of can be computed as:
where the tensors are the socalled TT–cores and by definition. The minimal values of for which the TT–decomposition exists are called TT–ranks. Note, that the element is effectively the product of vectors and matrices:
where stands for the slice (a subset of a tensor with some indices fixed) of the corresponding TT–core .
The number of degrees of freedom in such a decomposition can be evaluated as . Thus, in the case of small ranks, the total number of parameters required to store a tensor in TT–representation is significantly smaller than parameters required to store the full tensor of the corresponding size. This observation makes the application of the TT–decomposition appealing in many problems dealing with extremely large tensors.
4.2 TT–matrix
Let be a matrix of size . Given two arbitrary factorizations of its dimensions into natural numbers, and , we can reshape
More concretely, define the bijections and that map row and column indices and of the matrix to the dimensional vectorindices such that . From the matrix we can form an way tensor whose th dimension is of length and is indexed by the tuple . This tensor is then represented in the TT–format:
(2) 
Such representation of the matrix in the TT–format is called TT–matrix (Oseledets, 2010; Novikov et al., 2015) and is also known as Matrix Product Operator (Pirvu et al., 2010) in physics literature. The factorizations will be referred to as the shape of TT–matrix, or TT–shapes. The process of constructing the TT–matrix from the standard matrix is visualized in Figure 1 for the tensor of order . Note, that in this case the TT–cores are in fact th order tensors as the indices are given by tuples , but all the operations defined for tensors in the TT–format are naturally extended to TT–matrices.
4.3 TT–embedding
By TT–embedding, we call a layer with trainable parameters (TT–cores) represented as a TT–matrix of the underlying tensor shape which can be transformed into a valid embedding layer , with and . To specify the shapes of TT–cores one has also to provide the TT–ranks, which are treated as hyperparameters of the layer and explicitly define the total compression ratio.
In order to compute the embedding for a particular word indexed in the vocabulary, we first map the row index into the dimensional vector index , and then calculate components of the embedding with formula (2). Note, that the computation of all its components is equivalent to selecting the particular slices in TTcores (slices of shapes in , in and so on) and performing a sequence of matrix multiplications, which is executed efficiently in modern linear algebra packages, such as BLAS. Pseudocode for the procedure of computing the mapping is given in Appendix A.
In order to construct TT–embedding layer for a vocabulary of size and embedding dimension , and to train a model with such a layer, one has to perform the following steps.

Provide factorizations of and into factors and , and specify the set of TT–ranks .

Initialize the set of parameters of the embedding . Concrete initialization scenarios are discussed further in the text.

During training, given a batch of indices , compute the corresponding embeddings using Equation 2.
TT–embedding implies a specific structure on the order of tokens in the vocabulary (the order of rows in the embedding matrix), and determining the optimal order is an appealing problem to solve. However, we leave this problem for future work and use the order produced by the standard tokenizer (sorted by frequency) in our current experiments.
We also experimented with a more general form of TTdecomposition, namely Tensor Ring (TR) decomposition (Zhao et al., 2016; Wang et al., 2018). This decomposition by construction has the appealing property of being circular permutation invariant (and, thus, more robust with respect to the order of the tokens), which could have potentially provided an improvement over the TTbased models with simple frequency based ordering. However, despite having stronger generalization abilities, TR might require more intricate optimization procedure (Section 2.5 in Grasedyck et al. (2013)), and we did not observe the benefits of using TR instead of TT in our experiments (Appendix C).
Initialization
The standard way to initialize an embedding matrix is via, e.g., Glorot initializer (Glorot and Bengio, 2010), which initializes each element as . For the TT–embedding, we can only initialize the TT–cores, and the distribution of the elements of the resulting matrix is rather non–trivial. However, it is easy to verify that if we initialize each TT–core element as , the resulting distribution of the matrix elements has the property that and . Capitalizing on this observation, in order to obtain the desired variance while keeping , we can simply initialize each TT–core as
(3) 
The resulting distribution is not Gaussian, however, it approaches the Gaussian distribution
In our experiments, we have used the modified Glorot initializer implemented by formula (3), which greatly improved performance, as opposed to initializing TT–cores simply via a standard normal distribution. It is also possible to initialize TT–embedding layer by converting the learned embedding matrix into TT–format using the TT–SVD algorithm (Oseledets, 2011), however, this approach requires the pretrained embedding matrix and does not exhibit better performance in practice Garipov et al. (2016).
Hyperparameter selection
Our embedding layer introduces two additional structurespecific hyperparameters, namely TT–shapes and TT–ranks.
TT–embedding does not require the vocabulary size to be represented exactly as the product of factors , in fact, any factorization will suffice. However, in order to achieve the highest possible compression ratio for a fixed value of , the factors should be as close to each other as possible Novikov et al. (2015); Yang et al. (2017a). Our implementation includes a simple automated procedure for selecting a good set of values during TT–embedding initialization. The factors are defined by the embedding dimensionality which can be easily chosen to support good factorization, e.g., or .
The values of TT–ranks directly define the compression ratio, so choosing them to be too small or too large will result into either significant performance drop or little reduction of the number of parameters. In our experiments, we set all TT–ranks to for problems with small vocabularies and for problems with larger vocabularies, which resulted in a good tradeoff between embedding layer compression ratio and the metric of interest.
4.4 Expressivity of TT–embedding
Recall that in Section 3 we argued that one advantage of TT–embeddings is the property of being full rank matrices despite providing a significant data compression. Let us now formalize this statement.
For a fixed , , and a set of ranks , we consider , the set of all tensors represented in the TTmatrix format such that for any we have
entrywise. Let denote an ordinary matrix of size obtained from the TTmatrix with the inverse of procedure decsribed in Section 4.2 (application of formulas from Section 4.1, followed by transposing and reshaping). We show that the following results holds true.
Theorem 1.
For all besides a set of measure zero
where the ordinary matrix rank is assumed.
See Appendix B for a proof.
This theorem states that for almost all TTembeddings (besides a negligible set), the corresponding standard embedding matrix is fullrank. Thus, using the same matrix in the softmax layer, we can achieve significant compression without hitting the softmax bottleneck, as opposed to the lowrank matrix factorization.
5 Experiments
Code
We have implemented TT–embeddings described in Section 4 in Python using PyTorch (Paszke et al., 2019). The code is available at the anonymous repository https://github.com/ttembedding/ttembeddings.
Experimental setup
We tested our approach on several popular NLP tasks:

Sentiment analysis — as a starting point in our experiments, we test TT–embeddings on a rather simple task of predicting polarity of a sentence.

Neural Machine Translation (NMT) — to verify the applicability of TT–embeddings in more practical problems, we test it on a more challenging task of machine translation.

Language Modeling (LM) — then, we evaluate TT–embeddings on language modeling tasks in the case of extremely large vocabularies.

Click Through Rate (CTR) prediction — finally, we show that TT–embeddings can be applied for the binary classification with categorical features of significant cardinality.
Dataset  Model  Embedding shape  Test acc.  Emb  Total 

compr.  params  
IMDB  Full  M  
TT1  M  
TT2  M  
TT3  M  
SST  Full  M  
TT1  M  
TT2  M  
TT3  M 
To prove the generality and wide applicability of the proposed approach, we tested it on various architectures, such as MLPs (CTR), LSTMs (sentiment analysis), and Transformers (NMT, LM). The baselines we compare with are

Standard embedding layer parametrized by a matrix with the baseline compression ratio of .

Lowrank factorized embedding layer parametrized by two matrices and such that the corresponding embedding matrix is . The compression ratio in this case is .
Note that Transformers in LM and NMT use the same weight matrix for their embedding and softmax layers (Press and Wolf, 2016; Inan et al., 2016) which already significantly reduces model size. Untying weights and tensorizing the embedding layer only will lead to the increase in the number of parameters instead of compression. In our experiments, we use two separate TTdecompositions of the same shape for embedding and softmax layers and report the compression ratios as .
5.1 Sentiment analysis
Model  Embedding shape  Rank  Token  Sacre  Emb  Total 

BLEU  BLEU  compr.  params  
Big  —  M  
Big+LR1  M  
Big+LR2  M  
Big+LR3  M  
Big+TT1  M  
Big+TT2  M  
Big+TT3  M 
For this experiment, we have used the IMDB dataset (Maas et al., 2011) with two categories, and the Stanford Sentiment Treebank (SST) (Socher et al., 2013) with five categories. We have taken the most frequent words for the IMDB dataset and for SST, embedded them into a –dimensional space using either standard embedding or TT–embedding layer, and performed classification using a standard bidirectional two–layer LSTM with hidden size , and dropout rate .
Our findings are summarized in Table 1. We observe that the models with largely compressed embedding layers can perform equally or even better than the full uncompressed models. This suggests that learning individual independent embeddings for each particular word is superfluous, as the expressive power of LSTM is sufficient to make use of these intertwined, yet more compact embeddings. Moreover, slightly better test accuracy of the compressed models in certain cases (e.g., for the SST dataset of a rather small size) insinuates that imposing specific tensorial low–rank structure on the embedding matrix can be viewed as a special form of regularization, thus potentially improving model generalization. A detailed and comprehensive test of this hypothesis goes beyond the scope of this paper, and we leave it for future work.
5.2 Neural Machine Translation
For this experiment, we have trained the Transformerbig model (, , ) from Vaswani et al. (2017) on WMT English–German dataset consisting of roughly million sentence pairs. We evaluated on newstest2014 dataset using beam search with a beam size of and no length penalty. We did not employ checkpoint averaging and used the last checkpoint to compute the BLEU score. Sentences were tokenized with YouTokenToMe
Model  Embedding shape  Rank  Valid  Test  Emb  Total 

PPL  PPL  compr.  params  
TXL  —  M  
TXL+LR1  M  
TXL+LR1  M  
TXL+LR1  M  
TXL+TT1  M  
TXL+TT2  M  
TXL+TT3  M 
Our results are summarized in Table 2. We observe that even in this rather challenging task, both embedding and softmax layers can be compressed significantly, at the cost of a small drop in the BLEU score. However, with the increase of compression factor, the performance deteriorates rapidly. Compared to the sentiment analysis, NMT is a much more complex task which benefits more from additional capacity (in the form of more powerful RNN or more transformer blocks) rather than regularization (Bahdanau et al., 2014; Vaswani et al., 2017; Wu et al., 2019), which may explain why we did not manage to improve the model by regularizing its embedding layers with TTembedding.
Compared to the lowrank factorization of the embedding layer, the BLEU score of the Transformer with TTembedding is higher and degrades much slower with the decrease of TTrank. We hypothesize that this is because of the corresponding embedding matrix being full rank and not suffering from the softmax bottleneck Yang et al. (2017b).
TTembeddings induce training iteration time overhead if compared to the baseline Transformerbig due to our current implementation heavy relying on slow torch.einsum function while standard embedding and softmax layers make use of fast and highlyoptimized Tensor Cores for mixedprecision training. We expect a dedicated CUDA kernel to be much more efficient.
5.3 Language modeling
We took the TransformerXL (Dai et al., 2019), an open source
Compared to sentiment analysis and NMT, we were not able to achieve that high compression ratios for embedding and softmax layers in LM. However, in our case of extremely large vocabulary, even moderate times compression allowed us to save M of weights at the cost of perplexity drop. Note that TTembeddings also outperform lowrank factorization baseline achieving better tradeoff between compression and the performance.
Hash  Model  Factorization  TT  Hidden  Test  Emb.  Total 
rank  size  loss  compr.  params  
Full  —  —  M  
TT1  factors  M  
TT2  factors  M  
TT3  factors  M  
TT4  factors  M  
—  TT1  factors  M  
TT2  factors  M 
5.4 Click Through Rate prediction
Among other applications of the TT–embedding layer, we chose to focus on CTR prediction, a popular task in digital advertising (He et al., 2014). We consider open dataset provided by Criteo for Kaggle Display Advertising Challenge (Criteo Labs, 2014) which consists of categorical features, M samples and is binary labeled according to whether the user clicked on the given advertisement. Unique values of categorical features are bijectively mapped into integers. To reduce the memory footprint, if the size of a corresponding vocabulary is immense (e.g., a cardinality of some features in this dataset is of order ), these integers are further hashed by taking modulus with respect to some fixed number such as . However, due to strong compression properties of TT–embeddings, this is not necessary for our approach, and we consider both full and hashed datasets in our experiments.
CTR with the baseline algorithm
The task at hand can be treated as a binary classification problem. As a baseline algorithm, we consider the neural network with the following architecture. First, each of the categorical features is passed through a separate embedding layer with embedding size . After that, the embedded features are concatenated and passed through fullyconnected layers of neurons and ReLU activation functions. In all experiments, we used Adam optimizer with the learning rate equal to . Since many input features have a large number of unique values (e.g., ) and storing the corresponding embedding matrices would be costly, we employ the hashing procedure mentioned earlier.
CTR with TT–embeddings
We substitute the embedding layers with the TT–embedding layers. Besides that, we leave the overall structure of the neural network unchanged with the same parameters as in the baseline approach. Table 4 presents the experimental results on the Criteo CTR dataset. To the best of our knowledge, our loss value is very close to the stateoftheart result (Juan et al., 2016). These experiments indicate that the substitution of large embedding layers with TT–embeddings leads to significant compression ratios (up to times) with a slight improvement in the test loss, and up to with a small drop in the test loss. The total size of the compressed model does not exceed Mb, while the baseline model weighs about Mb. The obtained compression ratio suggests that the usage of TT–embedding layers may be beneficial in CTR prediction.
6 Discussion and future work
We propose a novel embedding layer, the TT–embedding, for compressing huge lookup tables used for encoding categorical features of significant cardinality, such as the index of a token in natural language processing tasks. The proposed approach, based on the TT–decomposition, experimentally proved to be effective, as it heavily decreases the number of training parameters at the cost of a small deterioration in performance. In addition, our method can be easily integrated into any deep learning framework and trained via backpropagation, while capitalizing on reduced memory requirements and increased training batch size.
Our experimental results suggest several appealing directions for future work. First of all, TT–embeddings impose a concrete tensorial lowrank structure on the embedding matrix, which was shown to improve the generalization ability of the networks acting as a regularizer. The properties and conditions of applicability of this regularizer are subject to more rigorous analysis. Secondly, unlike standard embedding, we can introduce nonlinearity into TTcores to improve their expressive power (Khrulkov et al., 2019). Additionally, it is important to understand how the order of tokens in the vocabulary affects the properties of the networks with TT–embedding. We hypothesize that there exists the optimal order of tokens which better exploits the particular structure of TT–embedding and leads to a boost in performance and/or compression ratio. Finally, the idea of applying higher–order tensor decompositions to reduce the number of parameters in neural nets is complementary to more traditional methods such as pruning (Han et al., 2015) and quantization (Hubara et al., 2017; Xu et al., 2018). Thus, it would be interesting to make a thorough comparison of all these methods and investigate whether their combination may lead to even stronger compression.
Appendix A Multiindex construction
Appendix B Proof of Theorem 1
Recall that for fixed , , and a set of ranks we defined , the set of all tensors represented in the TTmatrix format such that for any we have
entrywise. Let denote an ordinary matrix of size obtained from the TTmatrix with the inverse of procedure decsribed in Section 4.2 (application of formulas from Section 4.1, followed by transposing and reshaping).
Our analysis is based on the fact that forms an irreducible algebraic set (Buczyńska et al., 2015; Hartshorne, 2013). Concretely, we will use the fact that for an irreducible algebraic set any algebraic subset either has measure zero, or coincides with . We start with a simple lemma.
Lemma 1.
Let
then is an algebraic subset of .
Proof.
We need to show that is cut out by polynomial equations on . This readily follows from the facts that is a linear mapping, and that the upper bound on matrix rank can be specified by requiring all minors of specific size to vanish (which is a polynomial constraint). ∎
We now show that is in fact a proper subset of , i.e., .
Lemma 2.
For any there exists with
Proof.
We provide a concrete example of such a tensor. Define the collection of TT–cores using the equations
(4) 
with denoting the Kronecker delta symbol. It easy to verify that of a tensor specified by this collection of cores takes a very simple form: , which clearly is of maximal rank. ∎
Using Lemmas 2 and 1 and based on previous discussion on properties of algebraic sets we conclude that the following theorem holds.
Theorem 1.
For all besides a set of measure zero
where the ordinary matrix rank is assumed.
Appendix C Tensor Ring Embedding
Tensor Ring (TR) decomposition is a generalization to TTdecomposition where the first and the last cores are dimensional tensors which corresponds to . Formally, a tensor is said to be represented in the TR format (Zhao et al., 2016) if each element of can be computed as:
Similar to TT, we can define TRmatrix (see Figure 3) and corresponding TRembedding layer.
While our results (Table 5 and Table 6) suggest that TTembedding shows better compressionperformance tradeoff than its TR counterpart, much more experimentation is needed to properly compare these two approaches (for example, we see that TR is a promising direction for future work as it outperforms TT on SST2 benchmark). However, such analysis is computationally heavy and goes beyond the scope of this paper.
Dataset  Model  Embedding shape  Rank  Test acc.  Emb  Total 
compr.  params  
IMDB  Full  —  M  
TT1  16  M  
TT2  16  M  
TT3  16  M  
TR1  16  M  
TR2  16  M  
TR3  16  M  
TR4  8  M  
TR5  8  M  
TR6  8  M  
SST  Full  —  M  
TT1  16  M  
TT2  16  M  
TT3  16  M  
TR1  8  M  
TR2  8  M  
TR3  8  M 
Model  Embedding shape  Rank  Token  Sacre  Emb  Total 

BLEU  BLEU  compr.  params  
Big  —  M  
Big+TT1  M  
Big+TT2  M  
Big+TT3  M  
Big+TR1  M  
Big+TR2  M 
Parameter  Value 
Data cleaning  
max training sequence length in tokens  
max source / target ratio  
Model  
vocabulary size,  
hidden size,  
intermediate FF layer size,  
number of attention heads,  
number of layers in encoder / decoder  
Optimization  
optimizer  NovoGrad 
learning rate  
betas,  
learning rate decay policy  cosine 
weight decay  
batch size in tokens  
number of training steps  
number of warmup steps  
Regularization  
global dropout,  
label smoothing  
Inference  
beam search beam size  
length penalty 
Parameter  Value 
Model  
vocabulary size,  
hidden size,  
intermediate FF layer size,  
number of attention heads,  
number of layers  
Optimization  
optimizer  NovoGrad 
learning rate  
betas,  
learning rate decay policy  cosine 
weight decay  
batch size in sequences  
target sequence length  
memory sequence length  
number of training steps  
number of warmup steps  
Regularization  
global dropout,  
Inference  
batch size  
target sequence length  
memory sequence length  
max positional encodings length 
Footnotes
 by reshape we mean a columnmajor reshape command such as numpy.reshape in Python.
 Asymptotic normality is a consequence of application of the Central Limit Theorem.
 https://github.com/VKCOM/YouTokenToMe
 https://github.com/kimiyoung/transformerxl
References
 Adaptive input representations for neural language modeling. arXiv preprint arXiv:1809.10853. Cited by: §2.
 Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473. Cited by: §5.2.
 The hackbusch conjecture on tensor formats. Journal de Mathématiques Pures et Appliquées 104 (4), pp. 749–761. Cited by: Appendix B.
 Analysis of individual differences in multidimensional scaling via an nway generalization of EckartYoung decomposition. Psychometrika 35 (3). Cited by: §2.
 GroupReduce: blockwise lowrank approximation for neural language model shrinking. NIPS. Cited by: §2.
 Learning Kway Ddimensional Discrete Codes for Compact Embedding Representations. arXiv preprint arXiv:1806.09464. Cited by: §2, §3.1.
 Transformers. zip: compressing transformers with pruning and quantization. Technical report Technical report, Stanford University, Stanford, California, 2019. URL https â¦. Cited by: §2.
 Kaggle Display Advertising Challenge. External Links: Link Cited by: §5.4.
 Transformerxl: attentive language models beyond a fixedlength context. arXiv preprint arXiv:1901.02860. Cited by: §5.3, Table 3.
 Ultimate tensorization: compressing convolutional and FC layers alike. arXiv preprint arXiv:1611.03214. Cited by: §2, §4.3.
 Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, pp. 249–256. Cited by: §4.3.
 A literature survey of lowrank tensor approximation techniques. GAMMMitteilungen 36 (1), pp. 53–78. Cited by: §4.3.
 Learning both weights and connections for efficient neural network. In Advances in Neural Information Processing Systems, pp. 1135–1143. Cited by: §6.
 Foundations of the PARAFAC procedure: models and conditions for an” explanatory” multimodal factor analysis. UCLA Working Papers in Phonetics. Cited by: §2.
 Algebraic geometry. Vol. 52, Springer Science & Business Media. Cited by: Appendix B.
 Practical lessons from predicting clicks on ads at facebook. In Proceedings of the Eighth International Workshop on Data Mining for Online Advertising, pp. 1–9. Cited by: §5.4.
 Long shortterm memory. Neural Computation 9 (8), pp. 1735–1780. Cited by: 4th item.
 Quantized neural networks: training neural networks with low precision weights and activations. Journal of Machine Learning Research 18 (187), pp. 1–30. Cited by: §6.
 Tying word vectors and word classifiers: a loss framework for language modeling. arXiv preprint arXiv:1611.01462. Cited by: §3.2, §5.
 Fasttext. zip: compressing text classification models. arXiv preprint arXiv:1612.03651. Cited by: §2.
 Fieldaware factorization machines for CTR prediction. In Proceedings of the 10th ACM Conference on Recommender Systems, pp. 43–50. Cited by: §5.4.
 Generalized tensor models for recurrent neural networks. arXiv preprint arXiv:1901.10801. Cited by: §6.
 Variational dropout and the local reparameterization trick. In Advances in Neural Information Processing Systems, pp. 2575–2583. Cited by: §2.
 Word2Bitsquantized word vectors. arXiv preprint arXiv:1803.05651. Cited by: §2.
 Albert: a lite bert for selfsupervised learning of language representations. arXiv preprint arXiv:1909.11942. Cited by: §2.
 Speedingup convolutional neural networks using finetuned CPdecomposition. ICLR. Cited by: §2.
 Bayesian sparsification of recurrent neural networks. arXiv preprint arXiv:1708.00077. Cited by: §2.
 A tensorized transformer for language modeling. arXiv preprint arXiv:1906.09777. Cited by: §2.
 Learning word vectors for sentiment analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Portland, Oregon, USA, pp. 142–150. Cited by: §5.1.
 Pointer sentinel mixture models. arXiv preprint arXiv:1609.07843. Cited by: §5.3.
 Tensorizing neural networks. In Advances in Neural Information Processing Systems, pp. 442–450. Cited by: §2, §4.2, §4.3.
 Approximation of matrices using tensor decomposition. SIAM Journal on Matrix Analysis and Applications 31 (4), pp. 2130–2145. Cited by: §4.2.
 Tensortrain decomposition. SIAM Journal on Scientific Computing 33 (5), pp. 2295–2317. Cited by: §2, §4.1, §4.3.
 PyTorch: an imperative style, highperformance deep learning library. In Advances in Neural Information Processing Systems, pp. 8024–8035. Cited by: §5.
 Matrix product operator representations. New Journal of Physics 12 (2), pp. 025012. Cited by: §4.2.
 A call for clarity in reporting bleu scores.. arXiv:1804.0877. Cited by: Table 6, Table 2.
 Using the output embedding to improve language models. arXiv preprint arXiv:1608.05859. Cited by: §3.2, §5.
 Japanese and korean voice search. In 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5149–5152. Cited by: §2.
 Compression of neural machine translation models via pruning. arXiv preprint arXiv:1606.09274. Cited by: §2.
 Neural machine translation of rare words with subword units. arXiv preprint arXiv:1508.07909. Cited by: §2.
 Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 conference on empirical methods in natural language processing, pp. 1631–1642. Cited by: §5.1.
 Compressing recurrent neural network with tensor train. arXiv preprint arXiv:1705.08052. Cited by: §2.
 WEST: Word Encoded Sequence Transducers. arXiv preprint arXiv:1811.08417. Cited by: §2.
 Attention is all you need. In Advances in Neural Information Processing Systems, pp. 5998–6008. Cited by: §2, 4th item, §5.2, §5.2.
 Wide compression: tensor ring nets. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 9329–9338. Cited by: §2, §4.3.
 Pay less attention with lightweight and dynamic convolutions. arXiv preprint arXiv:1901.10430. Cited by: §5.2.
 Google’s neural machine translation system: bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144. Cited by: §2.
 Deep neural network compression with single and multiple level quantization. arXiv preprint arXiv:1803.03289. Cited by: §6.
 Tensortrain recurrent neural networks for video classification. arXiv preprint arXiv:1707.01786. Cited by: §2, §3.2, §4.3.
 Breaking the softmax bottleneck: a highrank rnn language model. arXiv preprint arXiv:1711.03953. Cited by: §3.2, §5.2.
 Longterm forecasting using tensortrain RNNs. arXiv preprint arXiv:1711.00073. Cited by: §2.
 Tensor ring decomposition. arXiv preprint arXiv:1606.05535. Cited by: Appendix C, §2, §4.3.