Tensorized Embedding Layers

Tensorized Embedding Layers


The embedding layers transforming input words into real vectors are the key components of deep neural networks used in natural language processing. However, when the vocabulary is large, the corresponding weight matrices can be enormous, which precludes their deployment in a limited resource setting. We introduce a novel way of parametrizing embedding layers based on the Tensor Train (TT) decomposition, which allows compressing the model significantly at the cost of a negligible drop or even a slight gain in performance. We evaluate our method on a wide range of benchmarks in natural language processing and analyze the trade-off between performance and compression ratios for a wide range of architectures, from MLPs to LSTMs and Transformers.


1 Introduction

Deep neural networks (DNNs) typically used in natural language processing (NLP) employ large embeddings layers, which map the input words into continuous representations and usually have the form of lookup tables. Despite such simplicity and, arguably because of it, the resulting models are cumbersome, which may cause problems in training and deploying them in a limited resource setting. Thus, the compression of large neural networks and the development of novel lightweight architectures have become essential problems in NLP research.

One way to reduce the number of parameters in the trained model is to imply a specific structure on its weight matrices (e.g., assume that they are low-rank or can be well approximated by low-rank tensor networks). Such approaches are successful at compressing the pre-trained models, but they do not facilitate the training itself. Furthermore, they usually require an additional fine-tuning stage to recover the performance of the original model.

In this paper, we introduce a new, parameter efficient embedding layer, termed TT–embedding, which can be plugged in into any model and trained end-to-end. The benefits of our compressed TT–layer are twofold. Firstly, instead of storing huge embedding matrix, we store a sequence of much smaller 2-dimensional and 3-dimensional tensors, necessary for reconstructing the required embeddings, which allows compressing the model significantly at the cost of a negligible performance drop. Secondly, the overall number of parameters can be relatively small (and constant) during the whole training stage, which allows to use larger batches or train efficiently in a case of limited resources.

To validate the efficiency of the proposed approach, we have tested it on several popular NLP tasks. In our experiments, we have observed that the standard embeddings can be replaced by TT–embeddings with the compression ratio of orders without any significant drop (and sometimes even with a slight gain) of the metric of interest. Specifically, we report the following compression ratios of the embedding layers: on the IMDB sentiment classification with absolute increase in classification accuracy; on the WMT En–De machine translation with drop in the BLEU score; on the WikiText-103 language modeling with drop in test perplexity.

Additionally, we have also evaluated our algorithm on a task of binary classification with a large number of categorical features. More concretely, we applied TT–embedding to the click through rate (CTR) prediction problem, a crucial task in the field of digital advertising. Neural networks, typically used for solving this problem, while being rather elementary, include a large number of embedding layers of significant size. As a result, a majority of model parameters that represent these layers, may occupy hundreds of gigabytes of space. We show that TT–embedding not only considerably reduces the number of parameters in such models, but also sometimes improves their accuracy.

2 Related work

In recent years, a large body of research was devoted to compressing and speeding up various components of neural networks used in NLP tasks. Joulin et al. (2016) adapted the framework of product quantization to reduce the number of parameters in linear models used for text classification. See et al. (2016) proposed to compress LSTM-based neural machine translation models with pruning algorithms. Lobacheva et al. (2017) showed that the recurrent models could be significantly sparsified with the help of variational dropout (Kingma et al., 2015). Cheong and Daniel (2019) successfully compressed the Transformer architecture with the combination of pruning and quantization.

There is a plethora of prior work on compressing the embedding layers used in NLP models. Chen et al. (2018b) proposed more compact K-way D-dimensional discrete encoding scheme to replace the “one-hot” encoding of categorical features, such as words in NLP taks. Variani et al. (2018) introduced WEST, a compression method based on structured sparse and structured dense decomposition of the embedding matrix. Chen et al. (2018a) proposed to compress the pre-trained embedding matrix by capitalizing on the power-law distribution of words and using smaller dimensionality (lower rank) for the embeddings of less frequent words. Baevski and Auli (2018) used similar idea in end-to-end fashion by training such structured low-rank embeddings from scratch. However, both of these methods rely on the assumption of power-law distribution of tokens and are not efficient when dealing with other popular tokenizations, such as wordpieces Schuster and Nakajima (2012); Wu et al. (2016) or BPEs Sennrich et al. (2015). The effectiveness of simple low-rank factorized embeddings has been recently re-discovered by Lan et al. (2019), and we refer to this method as to important baseline. Also, Lam (2018) proposed a quantization algorithm for compressing word vectors, but its benefits are orthogonal to those of low-rank matrix and tensor factorizations and they can be used together, complementing each other.

Tensor methods have also been already successfully applied to neural networks compression. Novikov et al. (2015) coined the idea of reshaping weights of fully-connected layers into high-dimensional tensors and representing them in Tensor Train (TT) (Oseledets, 2011) format. This approach was later extended to convolutional (Garipov et al., 2016) and recurrent (Yang et al., 2017a; Tjandra et al., 2017; Yu et al., 2017) neural networks. Furthermore, Lebedev et al. (2015) showed that convolutional layers could be also compressed with canonical (CP) tensor decomposition (Carroll and Chang, 1970; Harshman, 1970). Finally, Wang et al. (2018) compressed both fully-connected and convolutional layers with Tensor Ring decomposition (Zhao et al., 2016). Recently, Ma et al. (2019) succesfully applied Block-Term Tensor Decomposition to the compression of self-attention modules in the Transformer (Vaswani et al., 2017) architecture. In this work, we show the benefits of applying tensor machinery to the compression of embedding layers, which are an essential component of all models used in NLP.

3 Motivation

Since most of the parameters in the NLP models occupy the embedding layers, we can greatly reduce size of the entire model by compressing these layers. Our goal is to replace the standard embedding matrix with a more compact, yet powerful and trainable, representation which would allow us to efficiently map words into vectors.

In this section, we briefly discuss our motivation of using tensorized embedding layers instead of both standard embedding layers and their low-rank factorized counterpart.

3.1 Compression ratio perspective

The simplest approach to compactly represent a matrix of a large size is to use the low–rank matrix factorization, which treats matrix as a product of two matrices . Here and are much “thinner” matrices, and is the rank hyperparameter. Note that rather than training the model with the standard embedding layer, and then trying to compress the obtained embedding, we can initially seek the embedding matrix in the described low–rank format. Then, for evaluation and training, the individual word embedding can be computed as a product which does not require materializing the full matrix . This approach reduces the number of degrees of freedom in the embedding layer from to .

However, typically, in the NLP tasks, the embedding dimension is much smaller than the vocabulary size , and obtaining significant compression ratio using low-rank matrix factorization is problematic. In order to preserve the model performance, the rank cannot be taken very small, and the compression ratio is bounded by , which is close to for usually full-rank embedding matrix (see Figure 1 in Chen et al. (2018b)). To overcome this bound and achieve significant compression ratio even for matrices of disproportional dimensionalities, we reshape them into multidimensional tensors and apply the Tensor Train decomposition, which allows for more compact representation with the number of parameters falling down to logarithmic with respect to .

3.2 Softmax bottleneck perspective

We hypothesize that such tensorized embeddings are not only superior in terms of more efficient compression, but are more theoretically justified for the usage in NLP tasks than embedding layers based on matrix factorization. Our analysis is based on softmax bottleneck theory (Yang et al., 2017b) and the fact that modern NLP architectures typically use the same weights for both embedding and softmax layers (Press and Wolf, 2016; Inan et al., 2016).

This theory models a natural language as a collection of pairs of a context and its conditional next token distributions: , and considers parametric language models with a Softmax function operating on a context vector and a word embedding to define the conditional distribution . Given the number of context vectors , the number of tokens , and dimensionality of word embeddings , the following three matrices are defined: , , . The rows of these matrices correspond to context vectors, word embeddings, and log probabilities of the true data distribution respectively. Such language model attempts to approximate (up to an addition of constant matrices corresponding to a degree of freedom in Softmax) in the form


Note that the rank of is bounded by , while the matrix is presumed to be a high rank matrix (Yang et al., 2017a), which provides an upper bound on expressivity of such models. Now, suppose that the matrix is additionally factorized as with some rank . Then the rank of right-hand side of Equation 1 is bounded by , which further reduces expressivity of such models. Contrary to this, we show that tensorized embeddings do not reduce expressivity in the softmax bottleneck sense — while the embedding matrix is compressed it still has full matrix rank. We provide a rigorous statement in Section 4.4 and verify benefits of tensorized embeddings over low-rank factorized ones empirically in Section 5.

4 Tensor Train embedding

In this section, we briefly introduce the necessary notation and present the algorithm for training the TT–embedding layer. Hereinafter, by -way tensor we mean a multidimensional array:

with entries such that .

4.1 Tensor Train decomposition

A tensor is said to be represented in the Tensor Train (TT) format (Oseledets, 2011) if each element of can be computed as:

where the tensors are the so-called TT–cores and by definition. The minimal values of for which the TT–decomposition exists are called TT–ranks. Note, that the element is effectively the product of vectors and matrices:

where stands for the slice (a subset of a tensor with some indices fixed) of the corresponding TT–core .

The number of degrees of freedom in such a decomposition can be evaluated as . Thus, in the case of small ranks, the total number of parameters required to store a tensor in TT–representation is significantly smaller than parameters required to store the full tensor of the corresponding size. This observation makes the application of the TT–decomposition appealing in many problems dealing with extremely large tensors.

4.2 TT–matrix

Let be a matrix of size . Given two arbitrary factorizations of its dimensions into natural numbers, and , we can reshape1 and transpose this matrix into an -way tensor and then apply the TT–decomposition to it, resulting in a more compact representation.

More concretely, define the bijections and that map row and column indices and of the matrix to the -dimensional vector-indices such that . From the matrix we can form an -way tensor whose -th dimension is of length and is indexed by the tuple . This tensor is then represented in the TT–format:


Such representation of the matrix in the TT–format is called TT–matrix (Oseledets, 2010; Novikov et al., 2015) and is also known as Matrix Product Operator (Pirvu et al., 2010) in physics literature. The factorizations will be referred to as the shape of TT–matrix, or TT–shapes. The process of constructing the TT–matrix from the standard matrix is visualized in Figure 1 for the tensor of order . Note, that in this case the TT–cores are in fact -th order tensors as the indices are given by tuples , but all the operations defined for tensors in the TT–format are naturally extended to TT–matrices.

Figure 1: Construction of the TT–matrix from the standard embedding matrix. Blue color depicts how the single element in the initial matrix is transformed into the product of the highlighted vectors and matrices in the TT–cores.

4.3 TT–embedding

By TT–embedding, we call a layer with trainable parameters (TT–cores) represented as a TT–matrix of the underlying tensor shape which can be transformed into a valid embedding layer , with and . To specify the shapes of TT–cores one has also to provide the TT–ranks, which are treated as hyperparameters of the layer and explicitly define the total compression ratio.

In order to compute the embedding for a particular word indexed in the vocabulary, we first map the row index into the -dimensional vector index , and then calculate components of the embedding with formula (2). Note, that the computation of all its components is equivalent to selecting the particular slices in TT-cores (slices of shapes in , in and so on) and performing a sequence of matrix multiplications, which is executed efficiently in modern linear algebra packages, such as BLAS. Pseudocode for the procedure of computing the mapping is given in Appendix A.

In order to construct TT–embedding layer for a vocabulary of size and embedding dimension , and to train a model with such a layer, one has to perform the following steps.

  • Provide factorizations of and into factors and , and specify the set of TT–ranks .

  • Initialize the set of parameters of the embedding . Concrete initialization scenarios are discussed further in the text.

  • During training, given a batch of indices , compute the corresponding embeddings using Equation 2.

  • Computed embeddings can be followed by any standard layer such as LSTM (Hochreiter and Schmidhuber, 1997) or self-attention (Vaswani et al., 2017), and trained with backpropagation since they differentially depend on the parameters .

TT–embedding implies a specific structure on the order of tokens in the vocabulary (the order of rows in the embedding matrix), and determining the optimal order is an appealing problem to solve. However, we leave this problem for future work and use the order produced by the standard tokenizer (sorted by frequency) in our current experiments.

We also experimented with a more general form of TT-decomposition, namely Tensor Ring (TR) decomposition (Zhao et al., 2016; Wang et al., 2018). This decomposition by construction has the appealing property of being circular permutation invariant (and, thus, more robust with respect to the order of the tokens), which could have potentially provided an improvement over the TT-based models with simple frequency based ordering. However, despite having stronger generalization abilities, TR might require more intricate optimization procedure (Section 2.5 in Grasedyck et al. (2013)), and we did not observe the benefits of using TR instead of TT in our experiments (Appendix C).


The standard way to initialize an embedding matrix is via, e.g., Glorot initializer (Glorot and Bengio, 2010), which initializes each element as . For the TT–embedding, we can only initialize the TT–cores, and the distribution of the elements of the resulting matrix is rather non–trivial. However, it is easy to verify that if we initialize each TT–core element as , the resulting distribution of the matrix elements has the property that and . Capitalizing on this observation, in order to obtain the desired variance while keeping , we can simply initialize each TT–core as


The resulting distribution is not Gaussian, however, it approaches the Gaussian distribution2 with the increase of the TT–rank (Figure 2).

Figure 2: Distribution of matrix elements of the TT–matrix of shape initialized by formula (3) with . As the TT–rank increases, the resulting distribution approaches Gaussian .

In our experiments, we have used the modified Glorot initializer implemented by formula (3), which greatly improved performance, as opposed to initializing TT–cores simply via a standard normal distribution. It is also possible to initialize TT–embedding layer by converting the learned embedding matrix into TT–format using the TT–SVD algorithm (Oseledets, 2011), however, this approach requires the pretrained embedding matrix and does not exhibit better performance in practice Garipov et al. (2016).

Hyperparameter selection

Our embedding layer introduces two additional structure-specific hyperparameters, namely TT–shapes and TT–ranks.

TT–embedding does not require the vocabulary size to be represented exactly as the product of factors , in fact, any factorization will suffice. However, in order to achieve the highest possible compression ratio for a fixed value of , the factors should be as close to each other as possible Novikov et al. (2015); Yang et al. (2017a). Our implementation includes a simple automated procedure for selecting a good set of values during TT–embedding initialization. The factors are defined by the embedding dimensionality which can be easily chosen to support good factorization, e.g., or .

The values of TT–ranks directly define the compression ratio, so choosing them to be too small or too large will result into either significant performance drop or little reduction of the number of parameters. In our experiments, we set all TT–ranks to for problems with small vocabularies and for problems with larger vocabularies, which resulted in a good trade-off between embedding layer compression ratio and the metric of interest.

4.4 Expressivity of TT–embedding

Recall that in Section 3 we argued that one advantage of TT–embeddings is the property of being full rank matrices despite providing a significant data compression. Let us now formalize this statement.

For a fixed , , and a set of ranks , we consider , the set of all tensors represented in the TT-matrix format such that for any we have

entry-wise. Let denote an ordinary matrix of size obtained from the TT-matrix with the inverse of procedure decsribed in Section 4.2 (application of formulas from Section 4.1, followed by transposing and reshaping). We show that the following results holds true.

Theorem 1.

For all besides a set of measure zero

where the ordinary matrix rank is assumed.

See Appendix B for a proof.

This theorem states that for almost all TT-embeddings (besides a negligible set), the corresponding standard embedding matrix is full-rank. Thus, using the same matrix in the softmax layer, we can achieve significant compression without hitting the softmax bottleneck, as opposed to the low-rank matrix factorization.

5 Experiments


We have implemented TT–embeddings described in Section 4 in Python using PyTorch (Paszke et al., 2019). The code is available at the anonymous repository https://github.com/tt-embedding/tt-embeddings.

Experimental setup

We tested our approach on several popular NLP tasks:

  • Sentiment analysis — as a starting point in our experiments, we test TT–embeddings on a rather simple task of predicting polarity of a sentence.

  • Neural Machine Translation (NMT) — to verify the applicability of TT–embeddings in more practical problems, we test it on a more challenging task of machine translation.

  • Language Modeling (LM) — then, we evaluate TT–embeddings on language modeling tasks in the case of extremely large vocabularies.

  • Click Through Rate (CTR) prediction — finally, we show that TT–embeddings can be applied for the binary classification with categorical features of significant cardinality.

Dataset Model Embedding shape Test acc. Emb Total
compr. params
SST Full M
Table 1: Sentiment analysis, LSTM on IMDB and SST datasets. Embedding compression is calculated as the ratio between the number of parameters in the full embedding layer and TT–embedding layer. The LSTM parts are identical in both models, and the TT–ranks were set to in these experiments.

To prove the generality and wide applicability of the proposed approach, we tested it on various architectures, such as MLPs (CTR), LSTMs (sentiment analysis), and Transformers (NMT, LM). The baselines we compare with are

  1. Standard embedding layer parametrized by a matrix with the baseline compression ratio of .

  2. Low-rank factorized embedding layer parametrized by two matrices and such that the corresponding embedding matrix is . The compression ratio in this case is .

Note that Transformers in LM and NMT use the same weight matrix for their embedding and softmax layers (Press and Wolf, 2016; Inan et al., 2016) which already significantly reduces model size. Untying weights and tensorizing the embedding layer only will lead to the increase in the number of parameters instead of compression. In our experiments, we use two separate TT-decompositions of the same shape for embedding and softmax layers and report the compression ratios as .

5.1 Sentiment analysis

Model Embedding shape Rank Token Sacre Emb Total
BLEU BLEU compr. params
Big M
Big+LR1 M
Big+LR2 M
Big+LR3 M
Big+TT1 M
Big+TT2 M
Big+TT3 M
Table 2: NMT, Transformer-big on WMT’14 English-to-German dataset. Both case-sensitive tokenized BLEU (higher is better) and de-tokenized SacreBLEU (Post, 2018) on newstest2014 are reported. In case of low-rank (LR) factorization, rank is the factorization rank; in case of TT-embedding (TT), rank is the TT-rank.

For this experiment, we have used the IMDB dataset (Maas et al., 2011) with two categories, and the Stanford Sentiment Treebank (SST) (Socher et al., 2013) with five categories. We have taken the most frequent words for the IMDB dataset and for SST, embedded them into a –dimensional space using either standard embedding or TT–embedding layer, and performed classification using a standard bidirectional two–layer LSTM with hidden size , and dropout rate .

Our findings are summarized in Table 1. We observe that the models with largely compressed embedding layers can perform equally or even better than the full uncompressed models. This suggests that learning individual independent embeddings for each particular word is superfluous, as the expressive power of LSTM is sufficient to make use of these intertwined, yet more compact embeddings. Moreover, slightly better test accuracy of the compressed models in certain cases (e.g., for the SST dataset of a rather small size) insinuates that imposing specific tensorial low–rank structure on the embedding matrix can be viewed as a special form of regularization, thus potentially improving model generalization. A detailed and comprehensive test of this hypothesis goes beyond the scope of this paper, and we leave it for future work.

5.2 Neural Machine Translation

For this experiment, we have trained the Transformer-big model (, , ) from Vaswani et al. (2017) on WMT English–German dataset consisting of roughly million sentence pairs. We evaluated on newstest2014 dataset using beam search with a beam size of and no length penalty. We did not employ checkpoint averaging and used the last checkpoint to compute the BLEU score. Sentences were tokenized with YouTokenToMe3 byte-pair-encodings, resulting in a joint vocabulary of tokens. For the full list of hyperparameters, see the Appendix.

Model Embedding shape Rank Valid Test Emb Total
PPL PPL compr. params
Table 3: LM, Transformer-XL (Dai et al., 2019) on the WikiText-103 dataset. Lower values of perplexity (PPL) are better.

Our results are summarized in Table 2. We observe that even in this rather challenging task, both embedding and softmax layers can be compressed significantly, at the cost of a small drop in the BLEU score. However, with the increase of compression factor, the performance deteriorates rapidly. Compared to the sentiment analysis, NMT is a much more complex task which benefits more from additional capacity (in the form of more powerful RNN or more transformer blocks) rather than regularization (Bahdanau et al., 2014; Vaswani et al., 2017; Wu et al., 2019), which may explain why we did not manage to improve the model by regularizing its embedding layers with TT-embedding.

Compared to the low-rank factorization of the embedding layer, the BLEU score of the Transformer with TT-embedding is higher and degrades much slower with the decrease of TT-rank. We hypothesize that this is because of the corresponding embedding matrix being full rank and not suffering from the softmax bottleneck Yang et al. (2017b).

TT-embeddings induce training iteration time overhead if compared to the baseline Transformer-big due to our current implementation heavy relying on slow torch.einsum function while standard embedding and softmax layers make use of fast and highly-optimized Tensor Cores for mixed-precision training. We expect a dedicated CUDA kernel to be much more efficient.

5.3 Language modeling

We took the Transformer-XL (Dai et al., 2019), an open source4 state-of-the-art language modeling architecture at the time of this writing, and replaced its embedding and softmax layers with TT–factorizations. Then, we tested different model configurations on the WikiText–103 (Merity et al., 2016) dataset and reported the results in Table 3. For the full list of hyperparameters, see the Appendix.

Compared to sentiment analysis and NMT, we were not able to achieve that high compression ratios for embedding and softmax layers in LM. However, in our case of extremely large vocabulary, even moderate times compression allowed us to save M of weights at the cost of perplexity drop. Note that TT-embeddings also outperform low-rank factorization baseline achieving better trade-off between compression and the performance.

Hash Model Factorization TT Hidden Test Emb. Total
rank size loss compr. params
Full M
TT1 factors M
TT2 factors M
TT3 factors M
TT4 factors M
TT1 factors M
TT2 factors M
Table 4: CTR prediction. The hashed dataset is constructed as specified in Section 5.4 with hashing value . Embedding layers with more than unique tokens were replaced by TT–embeddings with shape factorizations consisting of or factors.

5.4 Click Through Rate prediction

Among other applications of the TT–embedding layer, we chose to focus on CTR prediction, a popular task in digital advertising (He et al., 2014). We consider open dataset provided by Criteo for Kaggle Display Advertising Challenge (Criteo Labs, 2014) which consists of categorical features, M samples and is binary labeled according to whether the user clicked on the given advertisement. Unique values of categorical features are bijectively mapped into integers. To reduce the memory footprint, if the size of a corresponding vocabulary is immense (e.g., a cardinality of some features in this dataset is of order ), these integers are further hashed by taking modulus with respect to some fixed number such as . However, due to strong compression properties of TT–embeddings, this is not necessary for our approach, and we consider both full and hashed datasets in our experiments.

CTR with the baseline algorithm

The task at hand can be treated as a binary classification problem. As a baseline algorithm, we consider the neural network with the following architecture. First, each of the categorical features is passed through a separate embedding layer with embedding size . After that, the embedded features are concatenated and passed through fully-connected layers of neurons and ReLU activation functions. In all experiments, we used Adam optimizer with the learning rate equal to . Since many input features have a large number of unique values (e.g., ) and storing the corresponding embedding matrices would be costly, we employ the hashing procedure mentioned earlier.

CTR with TT–embeddings

We substitute the embedding layers with the TT–embedding layers. Besides that, we leave the overall structure of the neural network unchanged with the same parameters as in the baseline approach. Table 4 presents the experimental results on the Criteo CTR dataset. To the best of our knowledge, our loss value is very close to the state-of-the-art result (Juan et al., 2016). These experiments indicate that the substitution of large embedding layers with TT–embeddings leads to significant compression ratios (up to times) with a slight improvement in the test loss, and up to with a small drop in the test loss. The total size of the compressed model does not exceed Mb, while the baseline model weighs about Mb. The obtained compression ratio suggests that the usage of TT–embedding layers may be beneficial in CTR prediction.

6 Discussion and future work

We propose a novel embedding layer, the TT–embedding, for compressing huge lookup tables used for encoding categorical features of significant cardinality, such as the index of a token in natural language processing tasks. The proposed approach, based on the TT–decomposition, experimentally proved to be effective, as it heavily decreases the number of training parameters at the cost of a small deterioration in performance. In addition, our method can be easily integrated into any deep learning framework and trained via backpropagation, while capitalizing on reduced memory requirements and increased training batch size.

Our experimental results suggest several appealing directions for future work. First of all, TT–embeddings impose a concrete tensorial low-rank structure on the embedding matrix, which was shown to improve the generalization ability of the networks acting as a regularizer. The properties and conditions of applicability of this regularizer are subject to more rigorous analysis. Secondly, unlike standard embedding, we can introduce non-linearity into TT-cores to improve their expressive power (Khrulkov et al., 2019). Additionally, it is important to understand how the order of tokens in the vocabulary affects the properties of the networks with TT–embedding. We hypothesize that there exists the optimal order of tokens which better exploits the particular structure of TT–embedding and leads to a boost in performance and/or compression ratio. Finally, the idea of applying higher–order tensor decompositions to reduce the number of parameters in neural nets is complementary to more traditional methods such as pruning (Han et al., 2015) and quantization (Hubara et al., 2017; Xu et al., 2018). Thus, it would be interesting to make a thorough comparison of all these methods and investigate whether their combination may lead to even stronger compression.

Appendix A Multiindex construction

  Require: – vocabulary size, – an arbitrary factorization of ,
       – index of the target word in vocabulary.
  Returns: -dimensional index.
  for  to  do
  end for
Algorithm 1 The algorithm implementing the bijection as described in Section 4.2.
  Require: – vocabulary size, – an arbitrary factorization of ,
      -dimensional index.
  Returns: – index of the target word in vocabulary
  for  to  do
  end for
Algorithm 2 The algorithm implementing the bijection , inverse to .

Appendix B Proof of Theorem 1

Recall that for fixed , , and a set of ranks we defined , the set of all tensors represented in the TT-matrix format such that for any we have

entry-wise. Let denote an ordinary matrix of size obtained from the TT-matrix with the inverse of procedure decsribed in Section 4.2 (application of formulas from Section 4.1, followed by transposing and reshaping).

Our analysis is based on the fact that forms an irreducible algebraic set (Buczyńska et al., 2015; Hartshorne, 2013). Concretely, we will use the fact that for an irreducible algebraic set any algebraic subset either has measure zero, or coincides with . We start with a simple lemma.

Lemma 1.


then is an algebraic subset of .


We need to show that is cut out by polynomial equations on . This readily follows from the facts that is a linear mapping, and that the upper bound on matrix rank can be specified by requiring all minors of specific size to vanish (which is a polynomial constraint). ∎

We now show that is in fact a proper subset of , i.e., .

Lemma 2.

For any there exists with


We provide a concrete example of such a tensor. Define the collection of TT–cores using the equations


with denoting the Kronecker delta symbol. It easy to verify that of a tensor specified by this collection of cores takes a very simple form: , which clearly is of maximal rank. ∎

Using Lemmas 2 and 1 and based on previous discussion on properties of algebraic sets we conclude that the following theorem holds.

Theorem 1.

For all besides a set of measure zero

where the ordinary matrix rank is assumed.

Appendix C Tensor Ring Embedding

Tensor Ring (TR) decomposition is a generalization to TT-decomposition where the first and the last cores are -dimensional tensors which corresponds to . Formally, a tensor is said to be represented in the TR format (Zhao et al., 2016) if each element of can be computed as:

Similar to TT, we can define TR-matrix (see Figure 3) and corresponding TR-embedding layer.

While our results (Table 5 and Table 6) suggest that TT-embedding shows better compression-performance trade-off than its TR counterpart, much more experimentation is needed to properly compare these two approaches (for example, we see that TR is a promising direction for future work as it outperforms TT on SST-2 benchmark). However, such analysis is computationally heavy and goes beyond the scope of this paper.

Figure 3: Construction of the TR–matrix from the standard embedding matrix. Blue color depicts how the single element in the initial matrix is transformed into the product of the highlighted matrices. In contrast to TT-embedding, matrix trace operator is applied to the final matrix, resulting in a scalar (highlighted element).
Dataset Model Embedding shape Rank Test acc. Emb Total
compr. params
TT1 16 M
TT2 16 M
TT3 16 M
TR1 16 M
TR2 16 M
TR3 16 M
TR4 8 M
TR5 8 M
TR6 8 M
SST Full M
TT1 16 M
TT2 16 M
TT3 16 M
TR1 8 M
TR2 8 M
TR3 8 M
Table 5: Sentiment analysis, LSTM with either TT-embedding or TR-embedding on IMDB and SST datasets.
Model Embedding shape Rank Token Sacre Emb Total
BLEU BLEU compr. params
Big M
Big+TT1 M
Big+TT2 M
Big+TT3 M
Big+TR1 M
Big+TR2 M
Table 6: NMT, Transformer-big with either TT-embedding or TR-embedding on WMT’14 English-to-German dataset. Both case-sensitive tokenized BLEU and de-tokenized SacreBLEU (Post, 2018) on newstest2014 are reported.
Table 7: Hyperparameters of Transformer-big used for neural machine translation on WMT’14.
Parameter Value
Data cleaning
 max training sequence length in tokens
 max source / target ratio
 vocabulary size,
 hidden size,
 intermediate FF layer size,
 number of attention heads,
 number of layers in encoder / decoder
 optimizer NovoGrad
 learning rate
 learning rate decay policy cosine
 weight decay
 batch size in tokens
 number of training steps
 number of warmup steps
 global dropout,
 label smoothing
 beam search beam size
 length penalty
Table 8: Hyperparameters of Transformer-XL used for language modeling on WikiText-103.
Parameter Value
 vocabulary size,
 hidden size,
 intermediate FF layer size,
 number of attention heads,
 number of layers
 optimizer NovoGrad
 learning rate
 learning rate decay policy cosine
 weight decay
 batch size in sequences
 target sequence length
 memory sequence length
 number of training steps
 number of warmup steps
 global dropout,
 batch size
 target sequence length
 memory sequence length
 max positional encodings length


  1. by reshape we mean a column-major reshape command such as numpy.reshape in Python.
  2. Asymptotic normality is a consequence of application of the Central Limit Theorem.
  3. https://github.com/VKCOM/YouTokenToMe
  4. https://github.com/kimiyoung/transformer-xl


  1. Adaptive input representations for neural language modeling. arXiv preprint arXiv:1809.10853. Cited by: §2.
  2. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473. Cited by: §5.2.
  3. The hackbusch conjecture on tensor formats. Journal de Mathématiques Pures et Appliquées 104 (4), pp. 749–761. Cited by: Appendix B.
  4. Analysis of individual differences in multidimensional scaling via an n-way generalization of Eckart-Young decomposition. Psychometrika 35 (3). Cited by: §2.
  5. GroupReduce: block-wise low-rank approximation for neural language model shrinking. NIPS. Cited by: §2.
  6. Learning K-way D-dimensional Discrete Codes for Compact Embedding Representations. arXiv preprint arXiv:1806.09464. Cited by: §2, §3.1.
  7. Transformers. zip: compressing transformers with pruning and quantization. Technical report Technical report, Stanford University, Stanford, California, 2019. URL https â€¦. Cited by: §2.
  8. Kaggle Display Advertising Challenge. External Links: Link Cited by: §5.4.
  9. Transformer-xl: attentive language models beyond a fixed-length context. arXiv preprint arXiv:1901.02860. Cited by: §5.3, Table 3.
  10. Ultimate tensorization: compressing convolutional and FC layers alike. arXiv preprint arXiv:1611.03214. Cited by: §2, §4.3.
  11. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, pp. 249–256. Cited by: §4.3.
  12. A literature survey of low-rank tensor approximation techniques. GAMM-Mitteilungen 36 (1), pp. 53–78. Cited by: §4.3.
  13. Learning both weights and connections for efficient neural network. In Advances in Neural Information Processing Systems, pp. 1135–1143. Cited by: §6.
  14. Foundations of the PARAFAC procedure: models and conditions for an” explanatory” multimodal factor analysis. UCLA Working Papers in Phonetics. Cited by: §2.
  15. Algebraic geometry. Vol. 52, Springer Science & Business Media. Cited by: Appendix B.
  16. Practical lessons from predicting clicks on ads at facebook. In Proceedings of the Eighth International Workshop on Data Mining for Online Advertising, pp. 1–9. Cited by: §5.4.
  17. Long short-term memory. Neural Computation 9 (8), pp. 1735–1780. Cited by: 4th item.
  18. Quantized neural networks: training neural networks with low precision weights and activations. Journal of Machine Learning Research 18 (187), pp. 1–30. Cited by: §6.
  19. Tying word vectors and word classifiers: a loss framework for language modeling. arXiv preprint arXiv:1611.01462. Cited by: §3.2, §5.
  20. Fasttext. zip: compressing text classification models. arXiv preprint arXiv:1612.03651. Cited by: §2.
  21. Field-aware factorization machines for CTR prediction. In Proceedings of the 10th ACM Conference on Recommender Systems, pp. 43–50. Cited by: §5.4.
  22. Generalized tensor models for recurrent neural networks. arXiv preprint arXiv:1901.10801. Cited by: §6.
  23. Variational dropout and the local reparameterization trick. In Advances in Neural Information Processing Systems, pp. 2575–2583. Cited by: §2.
  24. Word2Bits-quantized word vectors. arXiv preprint arXiv:1803.05651. Cited by: §2.
  25. Albert: a lite bert for self-supervised learning of language representations. arXiv preprint arXiv:1909.11942. Cited by: §2.
  26. Speeding-up convolutional neural networks using fine-tuned CP-decomposition. ICLR. Cited by: §2.
  27. Bayesian sparsification of recurrent neural networks. arXiv preprint arXiv:1708.00077. Cited by: §2.
  28. A tensorized transformer for language modeling. arXiv preprint arXiv:1906.09777. Cited by: §2.
  29. Learning word vectors for sentiment analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Portland, Oregon, USA, pp. 142–150. Cited by: §5.1.
  30. Pointer sentinel mixture models. arXiv preprint arXiv:1609.07843. Cited by: §5.3.
  31. Tensorizing neural networks. In Advances in Neural Information Processing Systems, pp. 442–450. Cited by: §2, §4.2, §4.3.
  32. Approximation of matrices using tensor decomposition. SIAM Journal on Matrix Analysis and Applications 31 (4), pp. 2130–2145. Cited by: §4.2.
  33. Tensor-train decomposition. SIAM Journal on Scientific Computing 33 (5), pp. 2295–2317. Cited by: §2, §4.1, §4.3.
  34. PyTorch: an imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems, pp. 8024–8035. Cited by: §5.
  35. Matrix product operator representations. New Journal of Physics 12 (2), pp. 025012. Cited by: §4.2.
  36. A call for clarity in reporting bleu scores.. arXiv:1804.0877. Cited by: Table 6, Table 2.
  37. Using the output embedding to improve language models. arXiv preprint arXiv:1608.05859. Cited by: §3.2, §5.
  38. Japanese and korean voice search. In 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5149–5152. Cited by: §2.
  39. Compression of neural machine translation models via pruning. arXiv preprint arXiv:1606.09274. Cited by: §2.
  40. Neural machine translation of rare words with subword units. arXiv preprint arXiv:1508.07909. Cited by: §2.
  41. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 conference on empirical methods in natural language processing, pp. 1631–1642. Cited by: §5.1.
  42. Compressing recurrent neural network with tensor train. arXiv preprint arXiv:1705.08052. Cited by: §2.
  43. WEST: Word Encoded Sequence Transducers. arXiv preprint arXiv:1811.08417. Cited by: §2.
  44. Attention is all you need. In Advances in Neural Information Processing Systems, pp. 5998–6008. Cited by: §2, 4th item, §5.2, §5.2.
  45. Wide compression: tensor ring nets. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 9329–9338. Cited by: §2, §4.3.
  46. Pay less attention with lightweight and dynamic convolutions. arXiv preprint arXiv:1901.10430. Cited by: §5.2.
  47. Google’s neural machine translation system: bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144. Cited by: §2.
  48. Deep neural network compression with single and multiple level quantization. arXiv preprint arXiv:1803.03289. Cited by: §6.
  49. Tensor-train recurrent neural networks for video classification. arXiv preprint arXiv:1707.01786. Cited by: §2, §3.2, §4.3.
  50. Breaking the softmax bottleneck: a high-rank rnn language model. arXiv preprint arXiv:1711.03953. Cited by: §3.2, §5.2.
  51. Long-term forecasting using tensor-train RNNs. arXiv preprint arXiv:1711.00073. Cited by: §2.
  52. Tensor ring decomposition. arXiv preprint arXiv:1606.05535. Cited by: Appendix C, §2, §4.3.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description