Exponential Machines
Abstract
Modeling interactions between features improves the performance of machine learning solutions in many domains (e.g. recommender systems or sentiment analysis). In this paper, we introduce Exponential Machines (ExM), a predictor that models all interactions of every order. The key idea is to represent an exponentially large tensor of parameters in a factorized format called Tensor Train (TT). The Tensor Train format regularizes the model and lets you control the number of underlying parameters. To train the model, we develop a stochastic Riemannian optimization procedure, which allows us to fit tensors with entries. We show that the model achieves stateoftheart performance on synthetic data with highorder interactions and that it works on par with highorder factorization machines on a recommender system dataset MovieLens 100K.
Exponential Machines
Alexander Novikov 
novikov@bayesgroup.ru 
Mikhail Trofimov 
mikhail.trofimov@phystech.edu 
Ivan Oseledets 

i.oseledets@skoltech.ru 
Skolkovo Institute of Science and Technology, Russian Federation 
National Research University Higher School of Economics, Russian Federation 
Institute of Numerical Mathematics, Russian Federation 
Moscow Institute of Physics and Technology, Russian Federation 
1 Introduction
Machine learning problems with categorical data require modeling interactions between the features to solve them. As an example, consider a sentiment analysis problem – detecting whether a review is positive or negative – and the following dataset: ‘I liked it’, ‘I did not like it’, ‘I’m not sure’. Judging by the presence of the word ‘like’ or the word ‘not’ alone, it is hard to understand the tone of the review. But the presence of the pair of words ‘not’ and ‘like’ strongly indicates a negative opinion.
If the dictionary has words, modeling pairwise interactions requires parameters and will probably overfit to the data. Taking into account all interactions (all pairs, triplets, etc. of words) requires impractical parameters.
In this paper, we show a scalable way to account for all interactions. Our contributions are:

We propose a predictor that models all interactions of dimensional data by representing the exponentially large tensor of parameters in a compact multilinear format – Tensor Train (TTformat) (Sec. 3). Factorizing the parameters into the TTformat leads to a better generalization, a linear with respect to number of underlying parameters and inference time (Sec. 5). The TTformat lets you control the number of underlying parameters through the TTrank – a generalization of the matrix rank to tensors.

We show that the linear model (e.g. logistic regression) is a special case of our model with the TTrank equal (Sec. 8.3).

We extend the model to handle interactions between functions of the features, not just between the features themselves (Sec. 7).
2 Linear model
In this section, we describe a generalization of a class of machine learning algorithms – the linear model. Let us fix a training dataset of pairs , where is a dimensional feature vector of th object, and is the corresponding target variable. Also fix a loss function , which takes as input the predicted value and the ground truth value . We call a model linear, if the prediction of the model depends on the features only via the dot product between the features and the dimensional vector of parameters :
(1) 
where is the bias parameter.
One of the approaches to learn the parameters and of the model is to minimize the following loss
(2) 
where is the regularization parameter. For the linear model we can choose any regularization term instead of , but later the choice of the regularization term will become important (see Sec. 6.1).
Several machine learning algorithms can be viewed as a special case of the linear model with an appropriate choice of the loss function : least squares regression (squared loss), Support Vector Machine (hinge loss), and logistic regression (logistic loss).
3 Our model
Before introducing our model equation in the general case, consider a dimensional example. The equation includes one term per each subset of features (each interaction)
(3)  
Note that all permutations of features in a term (e.g. and ) correspond to a single term and have exactly one associated weight (e.g. ).
In the general case, we enumerate the subsets of features with a binary vector , where if the th feature belongs to the subset. The model equation looks as follows
(4) 
Here we assume that . The model is parametrized by a dimensional tensor , which consists of elements.
The model equation (4) is linear with respect to the weight tensor . To emphasize this fact and simplify the notation we rewrite the model equation (4) as a tensor dot product , where the tensor is defined as follows
(5) 
Note that there is no need in a separate bias term, since it is already included in the model as the weight tensor element (see the model equation example (3)).
The key idea of our method is to compactly represent the exponentially large tensor of parameters in the Tensor Train format (Oseledets, 2011).
4 Tensor Train
A dimensional tensor is said to be represented in the Tensor Train (TT) format (Oseledets, 2011), if each of its elements can be computed as the following product of matrices and vectors
(6) 
where for any and for any value of , is an matrix, is a vector and is an vector (see Fig. 1). We refer to the collection of matrices corresponding to the same dimension (technically, a dimensional array) as the th TTcore, where . The size of the slices controls the tradeoff between the representational power of the TTformat and computational efficiency of working with the tensor. We call the TTrank of the tensor .
An attractive property of the TTformat is the ability to perform algebraic operations on tensors without materializing them, i.e. by working with the TTcores instead of the tensors themselves. The TTformat supports computing the norm of a tensor and the dot product between tensors; elementwise sum and elementwise product of two tensors (the result is a tensor in the TTformat with increased TTrank), and some other operations (Oseledets, 2011).
5 Inference
In this section, we return to the model proposed in Sec. 3 and show how to compute the model equation (4) in linear time. To avoid the exponential complexity, we represent the weight tensor and the data tensor (5) in the TTformat. The TTranks of these tensors determine the efficiency of the scheme. During the learning, we initialize and optimize the tensor in the TTformat and explicitly control its TTrank. The TTrank of the tensor always equals 1. Indeed, the following TTcores give the exact representation of the tensor
The th core is a matrix for any value of , hence the TTrank of the tensor equals .
Now that we have TTrepresentations of tensors and , we can compute the model response in linear time with respect to the number of features .
Theorem 1.
The model response can be computed in FLOPS, where is the TTrank of the weight tensor .
We refer the reader to Appendix A where we propose an inference algorithm with complexity and thus prove Theorem 1.
The TTrank of the weight tensor is a hyperparameter of our method and it controls the efficiency vs. flexibility tradeoff. A small TTrank regularizes the model and yields fast learning and inference but restricts the set of possible tensors . A sufficiently large TTrank allows any value of the tensor and effectively leaves us with the full polynomial model without any advantages of the TTformat.
6 Learning
Learning the parameters of the proposed model corresponds to minimizing the loss under the TTrank constraint:
(7)  
subject to 
where the loss is defined as follows
(8) 
We consider two approaches to solve problem (7). In a baseline approach, we optimize the objective with the stochastic gradient descent applied to the underlying parameters of the TTformat of the tensor .
An alternative to the baseline is to perform gradient descent with respect to the tensor , that is subtract the gradient from the current estimate of on each iteration. The TTformat indeed allows to subtract tensors, but this operation increases the TTrank on each iteration, making this approach impractical.
To improve upon the baseline and avoid the TTrank growth, we exploit the geometry of the set of tensors that satisfy the TTrank constraint (7) to build a Riemannian optimization procedure (Sec. 6.1). We experimentally show the advantage of this approach over the baseline in Sec. 8.2.
6.1 Riemannian optimization
The set of all dimensional tensors with fixed TTrank
forms a Riemannian manifold (Holtz et al., 2012). This observation allows us to use Riemannian optimization to solve problem (7). Riemannian gradient descent consists of the following steps which are repeated until convergence (see Fig. 2 for an illustration):

Project the gradient on the tangent space of taken at the point . We denote the tangent space as and the projection as .

Follow along with some step (this operation increases the TTrank).

Retract the new point back to the manifold , that is decrease its TTrank to .
We now describe how to implement each of the steps outlined above.
Lubich et al. (2015) proposed an algorithm to project a TTtensor on the tangent space of at a point which consists of two steps: preprocess in and project in . Lubich et al. (2015) also showed that the TTrank of the projection is bounded by a constant that is independent of the TTrank of the tensor :
Let us consider the gradient of the loss function (8)
(9) 
Using the fact that and that the projection is a linear operator we get
(10) 
Since the resulting expression is a weighted sum of projections of individual data tensors , we can project them in parallel. Since the TTrank of each of them equals 1 (see Sec. 5), all projections cost in total. The TTrank of the projected gradient is less or equal to regardless of the dataset size .
Note that here we used the particular choice of the regularization term. For terms other than (e.g. ), the gradient may have arbitrary large TTrank.
As a retraction – a way to return back to the manifold – we use the TTrounding procedure (Oseledets, 2011). For a given tensor and rank the TTrounding procedure returns a tensor such that its TTrank equals and the Frobenius norm of the residual is as small as possible. The computational complexity of the TTrounding procedure is .
Since we aim for big datasets, we use a stochastic version of the Riemannian gradient descent: on each iteration we sample a random minibatch of objects from the dataset, compute the stochastic gradient for this minibatch, make a step along the projection of the stochastic gradient, and retract back to the manifold (Alg. 1).
An iteration of the stochastic Riemannian gradient descent consists of inference , projection , and retraction , which yields total computational complexity.
6.2 Initialization
We found that a random initialization for the TTtensor sometimes freezes the convergence of optimization method (see Sec. 8.3). We propose to initialize the optimization from the solution of the corresponding linear model (1).
The following theorem shows how to initialize the weight tensor from a linear model.
Theorem 2.
For any dimensional vector and a bias term there exist a tensor of TTrank , such that for any dimensional vector and the corresponding objecttensor the dot products and coincide.
The proof is provided in Appendix B.
7 Extending the model
In this section, we extend the proposed model to handle polynomials of any functions of the features. As an example, consider the logarithms of the features in the dimensional case:
In the general case, to model interactions between functions of the features we redefine the objecttensor as follows:
where
The weight tensor and the objecttensor are now consist of elements. After this change to the objecttensor , learning and inference algorithms will stay unchanged compared to the original model (4).
Categorical features.
Our basic model handles categorical features by converting them into onehot vectors . The downside of this approach is that it wastes the model capacity on modeling nonexisting interactions between the onehot vector elements which correspond to the same categorical feature. Instead, we propose to use one TTcore per categorical feature and use the model extension technique with the following function
This allows us to cut the number of parameters per categorical feature from to without losing any representational power.
8 Experiments
We release a Python implementation of the proposed algorithm and the code to reproduce the experiments^{1}^{1}1https://github.com/Bihaqo/expmachines. For the operations related to the TTformat, we used the TTToolbox^{2}^{2}2https://github.com/oseledets/ttpy.
8.1 Datasets
The datasets used in the experiments (see details in Appendix C)

UCI (Lichman, 2013) Car dataset is a classification problem with objects and binary features (after onehot encoding). We randomly splitted the data into training and test objects and binarized the labels for simplicity.

UCI HIV dataset is a binary classification problem with objects and features, which we randomly splitted into training and test objects.

Synthetic data. We generated train and test objects with features and set the ground truth target variable to a degree polynomial of the features.
8.2 Riemannian optimization
In this experiment, we compared two approaches to training the model: Riemannian optimization (Sec. 6.1) vs. the baseline (Sec. 6). In this and later experiments we tuned the learning rate of both Riemannian and SGD optimizers with respect to the training loss after 100 iterations by the grid search with logarithmic grid.
On the Car and HIV datasets we turned off the regularization () and used rank . We report that on the Car dataset Riemannian optimization (learning rate ) converges faster and achieves better final point than the baseline (learning rate ) both in terms of the training and test losses (Fig. 3a, 6a). On the HIV dataset Riemannian optimization (learning rate ) converges to the value around times faster than the baseline (learning rate , see Fig. 3b), but the model overfitts to the data (Fig. 6b).
The results on the synthetic dataset with highorder interactions confirm the superiority of the Riemannian approach over SGD – we failed to train the model at all with SGD (Fig. 7).


On the MovieLens 100K dataset, we have only used SGDtype algorithms, because using the onehot feature encoding is much slower than using the categorical version (see Sec. 7), and we have yet to implement the support for categorical features for the Riemannian optimizer. On the bright side, prototyping the categorical version of ExM in TensorFlow allowed us to use a GPU accelerator.
8.3 Initialization
In this experiment, we compared random initialization with the initialization from the solution of the corresponding linear problem (Sec. 6.2). We explored two ways to randomly initialize a TTtensor: 1) filling its TTcores with independent Gaussian noise; 2) initializing to represent a linear model with random coefficients (sampled from a standard Gaussian). We report that on the Car dataset type1 random initialization slowed the convergence compared to initialization from the linear model solution (Fig. 3a), while on the HIV dataset the convergence was completely frozen (Fig. 3b).
Two possible reasons for this effect are: a) the vanishing and exploding gradients problem (Bengio et al., 1994) that arises when dealing with a product of a large number of factors ( in the case of the HIV dataset); b) initializing the model in such a way that highorder terms dominate we may force the gradientbased optimization to focus on highorder terms, while it may be more stable to start with loworder terms instead. Type2 initialization (a random linear model) indeed worked on par with the best linear initialization on the Car, HIV, and synthetic datasets (Fig. 3b, 7).
8.4 Comparison to other approaches
On the synthetic dataset with highorder interactions we compared Exponential Machines (the proposed method) with scikitlearn implementation (Pedregosa et al., 2011) of logistic regression, random forest, and kernel SVM; FastFM implementation (Bayer, 2015) of nd order Factorization Machines; our implementation of highorder Factorization Machines^{3}^{3}3https://github.com/geffy/tffm; and a feedforward neural network implemented in TensorFlow (Abadi et al., 2015). We used th order FM with the Adam optimizer (Kingma & Ba, 2014) for which we had chosen the best rank () and learning rate () based on the training loss after the first iterations. We tried several feedforward neural networks with ReLU activations and up to fullyconnected layers and hidden units. We compared the models based on the Area Under the Curve (AUC) metric since it is applicable to all methods and is robust to unbalanced labels (Tbl. 5).
Method Test AUC Training time (s) Inference time (s) Log. reg. RF Neural Network SVM RBF SVM poly. 2 SVM poly. 6 2nd order FM 6th order FM 6th order FM 6th order FM 0.96 ExM rank 3 ExM rank 8 ExM rank 16 0.96 Method Test AUC Training time (s) Inference time (s) Log. reg. RF Neural Network SVM RBF SVM poly. 2 SVM poly. 6 2nd order FM 6th order FM 6th order FM 6th order FM 0.96 ExM rank 3 ExM rank 8 ExM rank 16 0.96 Method Test AUC Training time (s) Inference time (s) Log. reg. RF Neural Network SVM RBF SVM poly. 2 SVM poly. 6 2nd order FM 6th order FM 6th order FM 6th order FM 0.96 ExM rank 3 ExM rank 8 ExM rank 16 0.96
the test AUC for the MovieLens 100K
dataset.
the test AUC for the MovieLens 100K
dataset.
On the MovieLens 100K dataset we used the categorical features representation described in Sec. 7. Our model obtained test AUC with the TTrank equal in seconds on a Tesla K40 GPU (the inference time is seconds per test objects); our implentation of 3rd order FM obtained ; logistic regression obtained ; and Blondel et al. (2016a) reported with 3rd order FM on the same data.
8.5 TTrank
The TTrank is one of the main hyperparameters of the proposed model. Two possible strategies can be used to choose it: gridsearch or DMRGlike algorithms (see Sec. 9). In our experiments we opted for the former and observed that the model is fairly robust to the choice of the TTrank (see Fig. 5), but a too small TTrank can hurt the accuracy (see Tbl. 5).
9 Related work
Kernel SVM is a flexible nonlinear predictor and, in particular, it can model interactions when used with the polynomial kernel (Boser et al., 1992). As a downside, it scales at least quadratically with the dataset size (Bordes et al., 2005) and overfits on highly sparse data.
With this in mind, Rendle (2010) developed Factorization Machine (FM), a general predictor that models pairwise interactions. To overcome the problems of polynomial SVM, FM restricts the rank of the weight matrix, which leads to a linear number of parameters and generalizes better on sparse data. FM running time is linear with respect to the number of nonzero elements in the data, which allows scaling to billions of training entries on sparse problems.
For highorder interactions FM uses CPformat (Caroll & Chang, 1970; Harshman, 1970) to represent the tensor of parameters. The choice of the tensor factorization is the main difference between the highorder FM and Exponential Machines. The TTformat comes with two advantages over the CPformat: first, the TTformat allows for Riemannian optimization; second, the problem of finding the best TTrank approximation to a given tensor always has a solution and can be solved in polynomial time. We found Riemannian optimization superior to the SGD baseline (Sec. 6) that was used in several other models parametrized by a tensor factorization (Rendle, 2010; Lebedev et al., 2014; Novikov et al., 2015). Note that CPformat also allows for Riemannian optimization, but only for 2order tensors (and thereafter 2order FM).
A number of works used fullbatch or stochastic Riemannian optimization for data processing tasks (Meyer et al., 2011; Tan et al., 2014; Xu & Ke, 2016; Zhang et al., 2016). The last work (Zhang et al., 2016) is especially interesting in the context of our method, since it improves the convergence rate of stochastic Riemannian gradient descent and is directly applicable to our learning procedure.
In a concurrent work, Stoudenmire & Schwab (2016) proposed a model that is similar to ours but relies on the trigonometric basis in contrast to polynomials used in Exponential Machines (see Sec. 7 for an explanation on how to change the basis). They also proposed a different learning procedure inspired by the DMRG algorithm (Schollwöck, 2011), which allows to automatically choose the ranks of the model, but is hard to adapt to the stochastic regime. One of the possible ways to combine strengths of the DMRG and Riemannian approaches is to do a full DMRG sweep once in a few epochs of the stochastic Riemannian gradient descent to adjust the ranks.
Other relevant works include the model that approximates the decision function with a multidimensional Fourier series whose coefficients lie in the TTformat (Wahls et al., 2014); and models that are similar to FM but include squares and other powers of the features: Tensor Machines (Yang & Gittens, 2015) and Polynomial Networks (Livni et al., 2014). Tensor Machines also enjoy a theoretical generalization bound. In another relevant work, Blondel et al. (2016b) boosted the efficiency of FM and Polynomial Networks by casting their training as a lowrank tensor estimation problem, thus making it multiconvex and allowing for efficient use of Alternative Least Squares types of algorithms. Note that Exponential Machines are inherently multiconvex.
10 Discussion
We presented a predictor that models all interactions of every order. To regularize the model and to make the learning and inference feasible, we represented the exponentially large tensor of parameters in the Tensor Train format. To train the model, we used Riemannian optimization in the stochastic regime and report that it outperforms a popular baseline based on the stochastic gradient descent. However, the Riemannian learning algorithm does not support sparse data, so for dataset with hundreds of thousands of features we are forced to fall back on the baseline learning method. We found that training process is sensitive to initialization and proposed an initialization strategy based on the solution of the corresponding linear problem. The solutions developed in this paper for the stochastic Riemannian optimization may suit other machine learning models parametrized by tensors in the TTformat.
Acknowledgments
The study has been funded by the Russian Academic Excellence Project ‘5100’.
References
 Abadi et al. (2015) Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dan Mané, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viégas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. TensorFlow: Largescale machine learning on heterogeneous systems, 2015. URL http://tensorflow.org/. Software available from tensorflow.org.
 Bayer (2015) I. Bayer. Fastfm: a library for factorization machines. arXiv preprint arXiv:1505.00641, 2015.
 Bengio et al. (1994) Y. Bengio, P. Simard, and P. Frasconi. Learning longterm dependencies with gradient descent is difficult. IEEE transactions on neural networks, 5(2):157–166, 1994.
 Blondel et al. (2016a) M. Blondel, A. Fujino, N. Ueda, and M. Ishihata. Higherorder factorization machines. 2016a.
 Blondel et al. (2016b) M. Blondel, M. Ishihata, A. Fujino, and N. Ueda. Polynomial networks and factorization machines: New insights and efficient training algorithms. In Advances in Neural Information Processing Systems 29 (NIPS). 2016b.
 Bordes et al. (2005) A. Bordes, S. Ertekin, J. Weston, and L. Bottou. Fast kernel classifiers with online and active learning. The Journal of Machine Learning Research, 6:1579–1619, 2005.
 Boser et al. (1992) B. E. Boser, I. M. Guyon, and V. N. Vapnik. A training algorithm for optimal margin classifiers. In Proceedings of the fifth annual workshop on Computational learning theory, pp. 144–152, 1992.
 Caroll & Chang (1970) J. D. Caroll and J. J. Chang. Analysis of individual differences in multidimensional scaling via nway generalization of EckartYoung decomposition. Psychometrika, 35:283–319, 1970.
 Harper & Konstan (2015) F. M. Harper and A. J. Konstan. The movielens datasets: History and context. ACM Transactions on Interactive Intelligent Systems (TiiS), 2015.
 Harshman (1970) R. A. Harshman. Foundations of the PARAFAC procedure: models and conditions for an explanatory multimodal factor analysis. UCLA Working Papers in Phonetics, 16:1–84, 1970.
 Holtz et al. (2012) S. Holtz, T. Rohwedder, and R. Schneider. On manifolds of tensors of fixed TTrank. Numerische Mathematik, pp. 701–731, 2012.
 Kingma & Ba (2014) D. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
 Lebedev et al. (2014) V. Lebedev, Y. Ganin, M. Rakhuba, I. Oseledets, and V. Lempitsky. Speedingup convolutional neural networks using finetuned CPdecomposition. In International Conference on Learning Representations (ICLR), 2014.
 Lichman (2013) M. Lichman. UCI machine learning repository, 2013.
 Livni et al. (2014) R. Livni, S. ShalevShwartz, and O. Shamir. On the computational efficiency of training neural networks. In Advances in Neural Information Processing Systems 27 (NIPS), 2014.
 Lubich et al. (2015) C. Lubich, I. V. Oseledets, and B. Vandereycken. Time integration of tensor trains. SIAM Journal on Numerical Analysis, pp. 917–941, 2015.
 Meyer et al. (2011) G. Meyer, S. Bonnabel, and R. Sepulchre. Regression on fixedrank positive semidefinite matrices: a Riemannian approach. The Journal of Machine Learning Research, pp. 593–625, 2011.
 Novikov et al. (2015) A. Novikov, D. Podoprikhin, A. Osokin, and D. Vetrov. Tensorizing neural networks. In Advances in Neural Information Processing Systems 28 (NIPS). 2015.
 Oseledets (2011) I. V. Oseledets. TensorTrain decomposition. SIAM J. Scientific Computing, 33(5):2295–2317, 2011.
 Pedregosa et al. (2011) F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikitlearn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011.
 Rendle (2010) S. Rendle. Factorization machines. In Data Mining (ICDM), 2010 IEEE 10th International Conference on, pp. 995–1000, 2010.
 Schollwöck (2011) U. Schollwöck. The densitymatrix renormalization group in the age of matrix product states. Annals of Physics, 326(1):96–192, 2011.
 Stoudenmire & Schwab (2016) E. Stoudenmire and D. J. Schwab. Supervised learning with tensor networks. In Advances in Neural Information Processing Systems 29 (NIPS). 2016.
 Tan et al. (2014) M. Tan, I. W. Tsang, L. Wang, B. Vandereycken, and S. J. Pan. Riemannian pursuit for big matrix recovery. 2014.
 Wahls et al. (2014) S. Wahls, V. Koivunen, H. V. Poor, and M. Verhaegen. Learning multidimensional fourier series with tensor trains. In Signal and Information Processing (GlobalSIP), 2014 IEEE Global Conference on, pp. 394–398. IEEE, 2014.
 Xu & Ke (2016) Z. Xu and Y. Ke. Stochastic variance reduced Riemannian eigensolver. arXiv preprint arXiv:1605.08233, 2016.
 Yang & Gittens (2015) J. Yang and A. Gittens. Tensor machines for learning targetspecific polynomial features. arXiv preprint arXiv:1504.01697, 2015.
 Zhang et al. (2016) H. Zhang, S. J. Reddi, and S. Sra. Fast stochastic optimization on Riemannian manifolds. arXiv preprint arXiv:1605.07147, 2016.
Appendix A Proof of Theorem 1
Theorem 1 states that the inference complexity of the proposed algorithm is , where is the TTrank of the weight tensor . In this section, we propose an algorithm that achieve the stated complexity and thus prove the theorem.
Proof.
Let us rewrite the definition of the model response (4) assuming that the weight tensor is represented in the TTformat (6)
Let us group the factors that depend on the variable ,
where the matrices for are defined as follows
The final value can be computed from the matrices via matrixbyvector multiplications and vectorbyvector multiplication, which yields complexity.
Note that the proof is constructive and corresponds to the implementation of the inference algorithm. ∎


Appendix B Proof of Theorem 2
Theorem 2 states that it is possible to initialize the weight tensor of the proposed model from the weights of the linear model.
Theorem.
For any dimensional vector and a bias term there exist a tensor of TTrank , such that for any dimensional vector and the corresponding objecttensor the dot products and coincide.
To prove the theorem, in the rest of this section we show that the tensor from Theorem 2 is representable in the TTformat with the following TTcores
(11)  
and thus the TTrank of the tensor equals .
We start the proof with the following lemma:
Lemma 1.
For the TTcores (11) and any the following invariance holds:
Proof.
We prove the lemma by induction. Indeed, for the statement of the lemma becomes
which holds by definition of the first TTcore .
Now suppose that the statement of Lemma 1 is true for some . If , then is an identity matrix and . Also, , so the statement of the lemma stays the same.
If , then there are options:

If , then and

If , then and

If with , then and
Which is exactly the statement of Lemma 1. ∎


Proof of Theorem 2.
The product of all TTcores can be represented as a product of the first cores times the last core and by using Lemma 1 we get
The elements of the obtained tensor that correspond to interactions of order equal to zero; the weight that corresponds to equals to ; and the bias term .
The TTrank of the obtained tensor equal since its TTcores are of size . ∎
Appendix C Detailed description of the datasets
We used the following datasets for the experimental evaluation

UCI (Lichman, 2013) Car dataset is a classification problem with objects and binary features (after onehot encoding). We randomly splitted the data into training and test objects. For simplicity, we binarized the labels: we picked the first class (‘unacc’) and made a oneversusrest binary classification problem from the original Car dataset.

UCI (Lichman, 2013) HIV dataset is a binary classification problem with objects and features, which we randomly splitted into training and test objects.

Synthetic data. We generated train and test objects with features. Each entry of the data matrix was independently sampled from with equal probabilities . We also uniformly sampled subsets of features (interactions) of order : . We set the ground truth target variable to a deterministic function of the input: , and sampled the weights of the interactions from the uniform distribution: .

MovieLens 100K. MovieLens 100K is a recommender system dataset with users and movies (Harper & Konstan, 2015). We followed Blondel et al. (2016a) in preparing the features and in turning the problem into binary classification. For users, we converted age (rounded to decades), living area (the first digit of the zipcode), gender and occupation into a binary indicator vector using onehot encoding. For movies, we used the release year (rounded to decades) and genres, also encoded. This process yielded additional onehot features for each usermovie pair ( features in total). Original ratings were binarized using as a threshold. This results in positive samples, half of which were used for traininig (with equal amount of sampled negative examples) and the rest were used for testing.