# Multi-output Polynomial Networks and Factorization Machines

## Abstract

Factorization machines and polynomial networks are supervised polynomial models based on an efficient low-rank decomposition. We extend these models to the multi-output setting, i.e., for learning vector-valued functions, with application to multi-class or multi-task problems. We cast this as the problem of learning a 3-way tensor whose slices share a common basis and propose a convex formulation of that problem. We then develop an efficient conditional gradient algorithm and prove its global convergence, despite the fact that it involves a non-convex basis selection step. On classification tasks, we show that our algorithm achieves excellent accuracy with much sparser models than existing methods. On recommendation system tasks, we show how to combine our algorithm with a reduction from ordinal regression to multi-output classification and show that the resulting algorithm outperforms simple baselines in terms of ranking accuracy.

## 1Introduction

Interactions between features play an important role in many classification and regression tasks. Classically, such interactions have been leveraged either explicitly, by mapping features to their products (as in polynomial regression), or implicitly, through the use of the kernel trick. While fast linear model solvers have been engineered for the explicit approach [9], they are typically limited to small numbers of features or low-order feature interactions, due to the fact that the number of parameters that they need to learn scales as , where is the number of features and is the order of interactions considered. Models kernelized with the polynomial kernel do not suffer from this problem; however, the cost of storing and evaluating these models grows linearly with the number of training instances, a problem sometimes referred to as the curse of kernelization [30].

Factorization machines (FMs) [25] are a more recent approach that can use pairwise feature interactions efficiently even in very high-dimensional data. The key idea of FMs is to model the weights of feature interactions using a **low-rank** matrix. Not only this idea offers clear benefits in terms of model compression compared to the aforementioned approaches, it has also proved instrumental in modeling interactions between **categorical** variables, converted to binary features via a one-hot encoding. Such binary features are usually so sparse that many interactions are never observed in the training set, preventing classical approaches from capturing their relative importance. By imposing a low rank on the feature interaction weight matrix, FMs encourage shared parameters between interactions, allowing to estimate their weights even if they never occurred in the training set. This property has been used in recommender systems to model interactions between user variables and item variables, and is the basis of several industrial successes of FMs [32].

Originally motivated as neural networks with a polynomial activation (instead of the classical sigmoidal or rectifier activations), polynomial networks (PNs) [20] have been shown to be intimately related to FMs and to only subtly differ in the non-linearity they use [5]. PNs achieve better performance than rectifier networks on pedestrian detection [20] and on dependency parsing [10], and outperform kernel approximations such as the Nyström method [5]. However, existing PN and FM works have been limited to single-output models, i.e., they are designed to learn scalar-valued functions, which restricts them to regression or binary classification problems.

Our contributions.

In this paper, we generalize FMs and PNs to multi-output models, i.e., for learning vector-valued functions, with application to multi-class or multi-task problems.

1) We cast learning multi-output FMs and PNs as learning a 3-way tensor, whose slices **share a common basis** (each slice corresponds to one output). To obtain a **convex formulation** of that problem, we propose to cast it as learning an infinite-dimensional but **row-wise sparse** matrix. This can be achieved by using group-sparsity inducing penalties. (§Section 3)

2) To solve the obtained optimization problem, we develop a variant of the **conditional gradient** (a.k.a. Frank-Wolfe) algorithm [11], which repeats the following two steps: i) select a new basis vector to add to the model and ii) refit the model over the current basis vectors. (§Section 4) We prove the **global convergence** of this algorithm (Theorem ?), despite the fact that the basis selection step is non-convex and more challenging in the shared basis setting. (§Section 5)

3) On multi-class classification tasks, we show that our algorithm achieves comparable accuracy to kernel SVMs but with much more compressed models than the Nyström method. On recommender system tasks, where kernelized models cannot be used (since they do not generalize to unseen user-item pairs), we demonstrate how our algorithm can be combined with a reduction from ordinal regression to multi-output classification and show that the resulting algorithm **outperforms single-output PNs and FMs** both in terms of root mean squared error (RMSE) and **ranking accuracy**, as measured by nDCG (normalized discounted cumulative gain) scores. (§Section 6)

## 2Background and related work

Notation.

We denote the set by . Given a vector , we denote its elements by . Given a matrix , we denote its rows by and its columns by . We denote the norm of by and its norm by . The number of non-zero rows of is denoted by .

**Factorization machines (FMs).** Given an input vector , FMs predict a scalar output by

where contains feature weights and is a low-rank matrix that contains pairwise feature interaction weights. To obtain a low-rank , [25] originally proposed to use a change of variable , where (with a rank parameter) and to learn instead. Noting that this quadratic model results in a non-convex problem in , [4] proposed to convexify the problem by learning directly but to encourage low rank using a nuclear norm on . For learning, [4] proposed a conditional gradient like approach with global convergence guarantees.

**Polynomial networks (PNs).** PNs are a recently-proposed form of neural network where the usual activation function is replaced with a squared activation. Formally, PNs predict a scalar output by

where (evaluated element-wise) is the squared activation, is the output layer vector, is the hidden layer matrix and is the number of hidden units. Because the r.h.s term can be rewritten as if we set , we see that PNs are clearly a slight variation of FMs and that learning can be recast as learning a low-rank matrix . Based on this observation, [20] proposed to use GECO [26], a greedy algorithm for convex optimization with a low-rank constraint, similar to the conditional gradient algorithm. [13] proposed a learning algorithm for PNs with global optimality guarantees but their theory imposes non-negativity on the network parameters and they need one distinct hyper-parameter per hidden unit to avoid trivial models. Other low-rank polynomial models were recently introduced in [29] but using a tensor network (a.k.a. tensor train) instead of the canonical polyadic (CP) decomposition.

## 3A convex formulation of multi-output PNs and FMs

In this section, we generalize PNs and FMs to multi-output problems. For the sake of concreteness, we focus on PNs for multi-class classification. The extension to FMs is straightforward and simply requires to replace by , as noted in [5].

The predictions of multi-class PNs can be naturally defined as , where is the number of classes, and is low-rank. Following [5], we can model the linear term directly in the quadratic term if we augment all data points with an extra feature of value 1, i.e., . We will therefore simply assume henceforth. Our main proposal in this paper is to decompose using a **shared** basis:

where, in neural network terminology, can be interpreted as a hidden layer matrix and as an output layer matrix. Compared to the naive approach of decomposing each as , this reduces the number of parameters from to .

While a nuclear norm could be used to promote a low rank on each , similarly as in [4], this is clearly not sufficient to impose a shared basis. A naive approach would be to use non-orthogonal joint diagonalization as a post-processing. However, because this is a non-convex problem for which no globally convergent algorithm is known [24], this would result in a loss of accuracy. Our key idea is to cast the problem of learning a multi-output PN as that of learning an **infinite but row-wise sparse matrix**. Without loss of generality, we assume that basis vectors (hidden units) lie in the unit ball. We therefore denote the set of basis vectors by Let us denote this infinite matrix by (we use a discrete notation for simplicity). We can then write

denotes the weights of basis across all classes (outputs). In this formulation, we have and sharing a common basis (hidden units) amounts to encouraging the rows of , , to be either dense or entirely sparse. This can be naturally achieved using group-sparsity inducing penalties. Intuitively, in can be thought as restricted to its row support. Define the training set by and . We then propose to solve the convex problem

where is a smooth and convex multi-class loss function (cf. Appendix Appendix A for three common examples), is a sparsity-inducing penalty and is a hyper-parameter. In this paper, we focus on the (lasso), (group lasso) and penalties for , cf. Table ?. However, as we shall see, solving is more challenging with the and penalties than with the penalty. Although our formulation is based on an infinite view, we next show that has finite row support.

Proof is in Appendix B.1. It is open whether we can tighten this result when or .

## 4A conditional gradient algorithm with approximate basis vector selection

At first glance, learning with an infinite number of basis vectors seems impossible. In this section, we show how the well-known conditional gradient algorithm [11] combined with group-sparsity inducing penalties naturally leads to a greedy algorithm that **selects and adds basis vectors that are useful across all outputs**. On every iteration, the conditional gradient algorithm performs updates of the form , where is a step size and is obtained by solving a linear approximation of the objective around the current iterate :

Let us denote the negative gradient by for short. Its elements are defined by

where is the gradient of w.r.t. (cf. Appendix A). For ReLu activations, solving is known to be NP-hard [1]. Here, we focus on quadratic activations, for which we will be able to provide approximation guarantees. Plugging the expression of , we get

and is a diagonal matrix such that . Let us recall the definition of the dual norm of : . By comparing this equation to , we see that is the argument that achieves the maximum in the dual norm , up to a constant factor . It is easy to verify that any element in the subdifferential of , which we denote by , achieves that maximum, i.e., .

Basis selection.

As shown in Table ?, elements of (subgradients) are matrices with a **single non-zero row** indexed by , where is an optimal basis (hidden unit) selected by

and where when , when and when . We call a basis vector selection criterion. Although this selection criterion was derived from the linearization of the objective, it is fairly natural: it chooses the basis vector with largest “violation”, as measured by the norm of the negative gradient row .

**Multiplicative approximations.** The key challenge in solving or equivalently arises from the fact that has infinitely many rows . We therefore cast basis vector selection as a continuous optimization problem w.r.t. . Surprisingly, although the entire objective is convex, is not. Instead of the exact maximum, we will therefore only require to find a that satisfies

where is a multiplicative approximation (higher is better). It is easy to verify that this is equivalent to replacing the optimal by an approximate that satisfies .

**Sparse case.** When , we need to solve

It is well known that the optimal solution of is the dominant eigenvector of . Therefore, we simply need to find the dominant eigenvector of each and select as the with largest singular value . Using the power method, we can find an that satisfies

for some tolerance parameter . The procedure takes time, where is the number of non-zero elements in [26]. Taking the maximum w.r.t. on both sides of leads to , where . However, using does not encourage selecting an that is useful for all outputs. In fact, when , our approach is equivalent to imposing independent nuclear norms on .

**Group-sparse cases.** When or , we need to solve

respectively. Unlike the -constrained case, we are clearly selecting a basis vector with largest violation **across all outputs**. However, we are now faced with a more difficult non-convex optimization problem. Our strategy is to first choose an initialization which guarantees a certain multiplicative approximation , then refine the solution using a monotonically non-increasing iterative procedure.

Initialization.

We simply choose as the approximate solution of the case, i.e., we have

Now, using and , this immediately implies

with if and if .

Refining the solution.

We now apply another instance of the conditional gradient algorithm to solve the subproblem itself, leading to the following iterates:

where . Following [3], if we use the Armijo rule to select , every limit point of the sequence is a stationary point of . In practice, we observe that is almost always selected. Note that when and (i.e., single-output case), our refining algorithm **recovers the power method**. Generalized power methods were also studied for structured matrix factorization [16], but with different objectives and constraints. Since the conditional gradient algorithm assumes a differentiable function, in the case , we replace the absolute function with the Huber function if , otherwise.

Corrective refitting step.

After iterations, contains at most non-zero rows. We can therefore always store as (the output layer matrix) and (the basis vectors / hidden units added so far). In order to improve accuracy, on iteration , we can then refit the objective . We consider two kinds of corrective steps, a convex one that minimizes w.r.t. and an optional non-convex one that minimizes w.r.t. both and . Refitting allows to remove previously-added bad basis vectors, thanks to the use of sparsity-inducing penalties. Similar refitting procedures are commonly used in matching pursuit [22]. The entire procedure is summarized in Algorithm 1 and implementation details are given in Appendix Appendix D.

## 5Analysis of Algorithm 1

The main difficulty in analyzing the convergence of Algorithm 1 stems from the fact that we cannot solve the basis vector selection subproblem globally when or . Therefore, we need to develop an analysis that can cope with the multiplicative approximation . Multiplicative approximations were also considered in [18] but the condition they require is too stringent (cf. Appendix Appendix B.2 for a detailed discussion). The next theorem guarantees the number of iterations needed to output a multi-output network that achieves as small objective value as an optimal solution of .

In [20], single-output PNs were trained using GECO [26], a greedy algorithm with similar guarantees. However, GECO is limited to learning infinite vectors (not matrices) and it does not constrain its iterates like we do. Hence GECO cannot remove bad basis vectors. The proof of Theorem ? and a detailed comparison with GECO are given in Appendix Appendix B.2. Finally, we note that the infinite dimensional view is also key to convex neural networks [2]. However, to our knowledge, we are the first to give an explicit multiplicative approximation guarantee for a non-linear multi-output network.

## 6Experimental results

### 6.1Experimental setup

**Datasets.** For our multi-class experiments, we use four publicly-available datasets: segment (7 classes), vowel (11 classes), satimage (6 classes) and letter (26 classes) [12]. Quadratic models substantially improve over linear models on these datasets. For our recommendation system experiments, we use the MovieLens 100k and 1M datasets [14]. See Appendix E for complete details.

**Model validation.** The greedy nature of Algorithm 1 allows us to easily interleave training with model validation. Concretely, we use an outer loop (embarrassingly parallel) for iterating over the range of possible regularization parameters, and an inner loop (Algorithm 1, sequential) for increasing the number of basis vectors. Throughout our experiments, we use 50% of the data for training, 25% for validation, and 25% for evaluation. Unless otherwise specified, we use a multi-class logistic loss.

### 6.2Method comparison for the basis vector (hidden unit) selection subproblem

As we mentioned previously, the linearized subproblem (basis vector selection) for the and constrained cases involves a significantly more challenging non-convex optimization problem. In this section, we compare different methods for obtaining an approximate solution to . We focus on the case, since we have a method for computing the true global solution , albeit with exponential complexity in (cf. Appendix C). This allows us to report the empirically observed multiplicative approximation factor .

**Compared methods.** We compare * init + refine* (proposed), *random init + refine*, * init* (without refine), *random init* and *best data*: where .

**Results.** We report in Figure 1. * init + refine* achieves nearly the global maximum on both datasets and outperforms *random init + refine*, showing the effectiveness of the proposed initialization and that the iterative update can get stuck in a bad local minimum if initialized badly. On the other hand, * init + refine* outperforms * init* alone, showing the importance of iteratively refining the solution. *Best data*, a heuristic similar to that of approximate kernel SVMs [7], is not competitive.

### 6.3Sparsity-inducing penalty comparison

In this section, we compare the , and penalties for the choice of , when varying the maximum number of basis vectors (hidden units). Figure ? indicates test set accuracy when using output layer refitting. We also include linear logistic regression, kernel SVMs and the Nyström method as baselines. For the latter two, we use the quadratic kernel . Hyper-parameters are chosen so as to maximize validation set accuracy.

**Results.** On the vowel (11 classes) and letter (26 classes) datasets, and penalties outperform norm starting from 20 and 75 hidden units, respectively. On satimage (6 classes) and segment (7 classes), we observed that the three penalties are mostly similar (not shown). We hypothesize that and penalties make a bigger difference when the number of classes is large. Multi-output PNs substantially outperform the Nyström method with comparable number of basis vectors (hidden units). Multi-output PNs reach the same test accuracy as kernel SVMs with very few basis vectors on vowel and satimage but appear to require at least 100 basis vectors to reach good performance on letter. This is not surprising, since kernel SVMs require 3,208 support vectors on letter, as indicated in Table ? below.

### 6.4Multi-class benchmark comparison

**Compared methods.** We compare the proposed conditional gradient algorithm with output layer refitting only and with both output and hidden layer refitting; projected gradient descent (FISTA) with random initialization; linear and kernelized models; one-vs-rest PNs (i.e., fit one PN per class). We focus on PNs rather than FMs since they are known to work better on classification tasks [5].

**Results** are included in Table ?. From these results, we can make the following observations and conclusions. When using output-layer refitting on vowel and letter (two datasets with more than 10 classes), group-sparsity inducing penalties lead to better test accuracy. This is to be expected, since these penalties select basis vectors that are useful across all classes. When using full hidden layer and output layer refitting, catches up with and on the vowel and letter datasets. Intuitively, the basis vector selection becomes less important if we make more effort at every iteration by refitting the basis vectors themselves. However, on vowel, is still substantially better than (89.57 vs. 87.83).

Compared to projected gradient descent with random initialization, our algorithm (for both output and full refitting) is better on (), () and () of the datasets. In addition, with our algorithm, the best model (chosen against the validation set) is substantially sparser. Multi-output PNs substantially outperform OvR PNs. This is to be expected, since multi-output PNs learn to share basis vectors across different classes.

### 6.5Recommender system experiments using ordinal regression

A straightforward way to implement recommender systems consists in training a single-output model to regress ratings from one-hot encoded user and item indices [25]. Instead of a single-output PN or FM, we propose to use ordinal McRank, a **reduction from ordinal regression to multi-output binary classification**, which is known to achieve good nDCG (normalized discounted cumulative gain) scores [19]. This reduction involves training a probabilistic binary classifier for each of the relevance levels (for instance, in the MovieLens datasets). The expected relevance of (e.g. the concatenation of the one-hot encoded user and item indices) is then computed by

where we use the convention . Thus, all we need to do to use ordinal McRank is to train a probabilistic binary classifier for all .

Our key proposal is to use a multi-output model to learn all classifiers simultaneously, i.e., in a multi-task fashion. Let be a vector representing a user-item pair with corresponding rating , for . We form a matrix such that if and otherwise, and solve

where is set to the binary logistic loss, in order to be able to produce probabilities. After running Algorithm 1 on that objective for iterations, we obtain and . Because is shared across all outputs, the only small overhead of using the ordinal McRank reduction, compared to a single-output regression model, therefore comes from learning instead of .

In this experiment, we focus on multi-output factorization machines (FMs), since FMs usually work better than PNs for one-hot encoded data [5]. We show in Figure 2 the RMSE and nDCG (truncated at 1 and 5) achieved when varying (the maximum number of basis vectors / hidden units).

**Results.** When combined with the ordinal McRank reduction, we found that and –constrained multi-output FMs substantially outperform single-output FMs and PNs on both RMSE and nDCG measures. For instance, on MovieLens 100k and 1M, –constrained multi-output FMs achieve an nDCG@1 of 0.75 and 0.76, respectively, while single-output FMs only achieve 0.71 and 0.75. Similar trends are observed with nDCG@5. We believe that this reduction is more robust to ranking performance measures such as nDCG thanks to its modelling of the expected relevance.

## 7Conclusion and future directions

We defined the problem of learning multi-output PNs and FMs as that of learning a 3-way tensor whose slices share a common basis. To obtain a convex optimization objective, we reformulated that problem as that of learning an infinite but row-wise sparse matrix. To learn that matrix, we developed a conditional gradient algorithm with corrective refitting, and were able to provide convergence guarantees, despite the non-convexity of the basis vector (hidden unit) selection step.

Although not considered in this paper, our algorithm and its analysis can be modified to make use of stochastic gradients. An open question remains whether a conditional gradient algorithm with provable guarantees can be developed for training deep polynomial networks or factorization machines. Such deep models could potentially represent high-degree polynomials with few basis vectors. However, this would require the introduction of a new functional analysis framework.

**Supplementary material**

## AConvex multi-class loss functions

The gradient w.r.t. , denoted , can be computed by

where is a vector whose element is 1 and other elements are 0. For the smoothed multi-class hinge loss and the multi-class squared hinge loss, see [27] and [6], respectively.

## BProofs

### b.1Finite support of an optimal solution (Proposition )

**General case.** We first state a result that holds for arbitrary activation function (sigmoid, ReLu, etc...). The main idea is to use the fact that the penalties considered in Table ? are atomic [8]. Then, we can equivalently optimize over the convex hull of a set of atoms and invoke Carathéodory’s theorem for convex hulls.

Let be an -dimensional vector whose element is . Let us define the sets

where we define the set as follows:

case:

case:

case: .

Then is equivalent to

where is the convex hull of the set . The matrices and are related to each other by

for some such that and . By Carathéodory’s theorem for convex hulls, there exists with at most non-zero elements. Because elements of are matrices with a single non-zero row, contains at most non-zero rows (hidden units).

**Case of constraint and squared activation.** When , given s.t. , the output can be written as

Following [5], the nuclear norm of a symmetric matrix can be defined by

and the minimum is attained by the eigendecomposition and .

Therefore, we can always compute the eigendecomposition of each and use the eigenvectors as hidden units and the eigenvalues as output layer weights. Moreover, this solution is feasible, since eigenvectors belong to and since the norm of all eigenvalues is minimized. Since a matrix can have at most eigenvalues, we can conclude that has at most elements. Combined with the previous result, has at most non-zero rows (hidden units).

For the and penalties, we cannot make this argument, since applying the eigendecomposition might increase the penalty value and therefore make the solution infeasible.

### b.2Convergence analysis (Theorem )

In this section, we include a convergence analysis of the conditional gradient algorithm with multiplicative approximation in the linear minimization oracle. The proof follows mostly from [15] with a trick inspired from [1] to handle multiplicative approximations. Finally, we also include a detailed comparison with the analysis of GECO [26] and Block-FW [18].

We focus on constrained optimization problems of the form

where is convex and -smooth w.r.t. and .

**Curvature and smoothness constants.** The convergence analysis depends on the following standard curvature constant

Intuitively, this is a measure of non-linearity of : the maximum deviation between and its linear approximations over . The assumption of bounded is closely related to a smoothness assumption on . Following [15], for any choice of norm , can be upper-bounded by the smoothness constant as

Using , we obtain

and therefore

**Linear duality gap.** Following [15], we define the linear duality gap

Since is convex and differentiable, we have that

Let us define the primal error

Minimizing w.r.t. on both sides we obtain

Hence can be used as a certificate of optimality about .

**Bounding progress.** Let be the current iterate and be our update. The definition of implies

We now use that is obtained by an exact linear minimization oracle (LMO)

and therefore . Combined with , we obtain

Subtracting on both sides, we finally get

**Primal convergence.** Since we use a fully-corrective variant of the conditional gradient method, our algorithm enjoys a convergence rate at least as good as the variant with fixed step size. Following [15] and using , for every , the iterates satisfy

Thus, we can obtain an -accurate solution if we run the algorithm for iterations.

**Linear minimization with multiplicative approximation.** We now extend the analysis to the case of approximate linear minimization. Given , we assume that an approximate LMO outputs a certain such that

for some multiplicative factor (higher is more accurate). Since and are in , we have like before

Following the same trick as [1], we now absorb the multiplicative factor in the constraint

where we defined (i.e., the ball is shrunk by a factor ). We therefore obtain . Similarly as before, this implies that

Subtracting on both sides, we get

We thus get that iterate satisfies and

We can therefore obtain an such that if we run our algorithm for iterations with constraint parameter and multiplicative factor . Put differently, we can obtain an such that if we run our algorithm for iterations with constraint parameter and multiplicative factor .

**Comparison with the analysis of GECO.** GECO [26] is a greedy algorithm with fully-corrective refitting steps for learning a sparse vector from possibly infinitely-many features, similarly to our algorithm. However, unlike our algorithm, GECO does not constrain the norm of its iterates (i.e., there is no parameter ), which can lead to severe overfitting in practice. Following [26], GECO obtains a certain (unbounded) such that

In comparison, for the -constrained case, our algorithm learns an such that and

We see that our algorithm and GECO have similar guarantees, with the difference that GECO does not constrain its iterates.

GECO was used to learn single-output polynomial networks in [20]. Combining together with , it was shown that GECO can learn the parameters (unbounded) of a single-output polynomial network with unit ball constraint and squared activation such that

However, if we run our algorithm with an constraint, it can learn an such that and

Clearly, our algorithm with an constraint uses fewer iterations than GECO for learning polynomial networks with unit ball constraint and more than hidden units.

**Comparison with the analysis of Block-FW.** [18] analyze a block Frank-Wolfe method with “multiplicative” approximations in the linear minimization oracle. However, they require a different condition, namely:

for some . Under this condition, they show that the algorithm converges to an -approximate solution in iterations. A disadvantage of the above condition is that it contains an additive term that depends on the current iterate and so it is difficult to give guarantees on in general.

## CComputing an optimal solution of the linearized subproblem ( case)

We describe how to compute an optimal hidden unit in the case, albeit with exponential complexity in . Because of its exponential complexity in (the number of outputs), clearly, this method should only be used to evaluate other (polynomial-time) algorithms.

Recall that we want to solve

Now, if we knew the sign , we could rewrite the problem as

whose optimal solution is the dominant eigenvector of the symmetric matrix . The idea is then simply to find the dominant eigenvector for all possible sign vectors and choose the eigenvector that achieves largest objective value.

## DImplementation details

In practice, penalized formulations are more convenient to handle than constrained ones. Here, we discuss why we can safely replace constrained formulations by penalized formulations in the refitting step. We use the output layer refitting objective as an example. It is well known that there exists such that this objective is equivalent to

Unfortunately, the relation between and is a priori unknown. However, it is easy to see that the constant factor in is absorbed by the output layer in our refitting step. This means that we need to know the actual value of for the refitting step but not for the hidden unit selection step. As long as we compute a full regularization path, we may therefore use a penalized formulation in a practical implementation. We do so for both refitting objectives we discussed.

For both refitting objectives, we use FISTA, an accelerated projected gradient method with convergence rate, where is the iteration number. We set the maximum number of iterations to and the stopping criterion’s tolerance to .

## EDatasets

For our multi-class experiments, we used the following four publicly available datasets [12].

Name | |||
---|---|---|---|

segment | 2,310 | 19 | 7 |

vowel | 528 | 10 | 11 |

satimage | 4,435 | 36 | 6 |

letter | 15,000 | 16 | 26 |

For recommender system experiments, we used the following two publicly available datasets [14].

Name | |||
---|---|---|---|

Movielens 100k | 100,000 (ratings) | 2,625 = 943 (users) + 1,682 (movies) | 5 |

Movielens 1M | 1,000,209 (ratings) | 9,940 = 6,040 (users) + 3,900 (movies) | 5 |

The task is to predict ratings between 1 and 5 given by users to movies, i.e., . The design matrix was constructed following [25]. Namely, for each rating , the corresponding is set to the concatenation of the one-hot encodings of the user and item indices. Hence the number of samples is the number of ratings and the number of features is equal to the sum of the number of users and items. Each sample contains exactly two non-zero features. It is known that factorization machines are equivalent to matrix factorization when using this representation [25].

## FAdditional experimental results

### f.1Multi-class squared hinge loss results

We also compared the multi-class logistic (ML) loss to the multi-class squared hinge (MSH) loss. The MSH loss achieves comparable test accuracy to the ML loss. However, it can often be much faster to train, since it does not require expensive exponential and logarithm calculations.

[-2pt]Constraint | [-2pt]Loss | ||||||||

(rl)3-6 (rl)7-10 | segment | vowel | satimage | letter | segment | vowel | satimage | letter | |

(rl)1-2 (rl)3-6 (rl)7-10 | MSH | 96.01 | 87.83 | 89.98 | 92.03 | 95.67 | 79.13 | 88.99 | 91.25 |

(21) | (8) | (22) | (130) | (21) | (25) | (21) | (149) | ||

ML | 96.71 | 87.83 | 89.80 | 92.29 | 97.05 | 80.00 | 89.71 | 91.01 | |

(41) | (12) | (25) | (150) | (20) | (21) | (40) | (139) | ||

(rl)3-6 (rl)7-10 | MSH | 96.01 | 86.96 | 90.25 | 91.57 | 95.67 | 85.22 | 89.98 | 92.03 |

(15) | (8) | (12) | (94) | (25) | (19) | (50) | (149) | ||

ML | 96.71 | 89.57 | 89.08 | 91.81 | 96.36 | 85.22 | 89.71 | 92.24 | |

(40) | (15) | (18) | (106) | (21) | (15) | (50) | (150) | ||

(rl)3-6 (rl)7-10 | MSH | 95.84 | 85.22 | 89.80 | 92.27 | 97.05 | 86.09 | 88.90 | 91.20 |

(16) | (18) | (29) | (149) | (28) | (33) | (24) | (119) | ||

ML | 96.71 | 86.96 | 88.99 | 92.35 | 96.19 | 86.96 | 89.35 | 91.68 | |

(24) | (15) | (20) | (149) | (16) | (41) | (41) | (128) | ||

### f.2Full vs. output layer refitting comparison

In this experiment, we compare output layer refitting with full refitting of both the hidden and output layers. Empirically, we observe that full refitting does not always outperform output layer refitting in terms of objective value but it does so in terms of test accuracy.

### References

**Breaking the curse of dimensionality with convex neural networks.**

F. Bach.*JMLR*, 2017.**Convex neural networks.**

Y. Bengio, N. Le Roux, P. Vincent, O. Delalleau, and P. Marcotte. In*NIPS*, 2005.*Nonlinear programming*.

D. P. Bertsekas. Athena Scientific Belmont, 1999.**Convex factorization machines.**

M. Blondel, A. Fujino, and N. Ueda. In*ECML/PKDD*, 2015.**Polynomial networks and factorization machines: New insights and efficient training algorithms.**

M. Blondel, M. Ishihata, A. Fujino, and N. Ueda. In*ICML*, 2016.**Block coordinate descent algorithms for large-scale sparse multiclass classification.**

M. Blondel, K. Seki, and K. Uehara.*Machine Learning*, 93(1):31–52, 2013.**Fast kernel classifiers with online and active learning.**

A. Bordes, S. Ertekin, J. Weston, and L. Bottou.*JMLR*, 6(Sep):1579–1619, 2005.**The convex geometry of linear inverse problems.**

V. Chandrasekaran, B. Recht, P. A. Parrilo, and A. S. Willsky.*Foundations of Computational Mathematics*, 12(6):805–849, 2012.**Training and testing low-degree polynomial data mappings via linear svm.**

Y.-W. Chang, C.-J. Hsieh, K.-W. Chang, M. Ringgaard, and C.-J. Lin.*Journal of Machine Learning Research*, 11:1471–1490, 2010.**A fast and accurate dependency parser using neural networks.**

D. Chen and C. D. Manning. In*EMNLP*, 2014.**Conditional gradient algorithms with open loop step size rules.**

J. C. Dunn and S. A. Harshbarger.*Journal of Mathematical Analysis and Applications*, 62(2):432–444, 1978.**http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/, 2011.**

R.-E. Fan and C.-J. Lin.**Globally optimal training of generalized polynomial neural networks with nonlinear spectral methods.**

A. Gautier, Q. N. Nguyen, and M. Hein. In*NIPS*, 2016.**http://grouplens.org/datasets/movielens/, 1998.**

GroupLens.**Revisiting Frank-Wolfe: Projection-free sparse convex optimization.**

M. Jaggi. In*ICML*, 2013.**Generalized power method for sparse principal component analysis.**

M. Journée, Y. Nesterov, P. Richtárik, and R. Sepulchre.*Journal of Machine Learning Research*, 11:517–553, 2010.**Field-aware factorization machines for CTR prediction.**

Y. Juan, Y. Zhuang, W.-S. Chin, and C.-J. Lin. In*ACM Recsys*, 2016.**Block-coordinate Frank-Wolfe optimization for structural SVMs.**

S. Lacoste-Julien, M. Jaggi, M. Schmidt, and P. Pletscher. In*ICML*, 2012.**McRank: Learning to rank using multiple classification and gradient boosting.**

P. Li, C. J. Burges, and Q. Wu. In*NIPS*, 2007.**On the computational efficiency of training neural networks.**

R. Livni, S. Shalev-Shwartz, and O. Shamir. In*NIPS*, 2014.**Conditional gradient algorithms for rank-one matrix approximations with a sparsity constraint.**

R. Luss and M. Teboulle.*SIAM Review*, 55(1):65–98, 2013.**Matching pursuits with time-frequency dictionaries.**

S. G. Mallat and Z. Zhang.*IEEE Transactions on Signal Processing*, 41(12):3397–3415, 1993.**Exponential machines.**

A. Novikov, M. Trofimov, and I. Oseledets.*arXiv preprint arXiv:1605.03795*, 2016.**Beyond CCA: Moment matching for multi-view models.**

A. Podosinnikova, F. Bach, and S. Lacoste-Julien. In*ICML*, 2016.**Factorization machines.**

S. Rendle. In*ICDM*, 2010.**Large-scale convex minimization with a low-rank constraint.**

S. Shalev-Shwartz, A. Gonen, and O. Shamir. In*ICML*, 2011.**ShareBoost: Efficient multiclass learning with feature sharing.**

S. Shalev-Shwartz, Y. Wexler, and A. Shashua. In*NIPS*, 2011.**Coffin: A computational framework for linear SVMs.**

S. Sonnenburg and V. Franc. In*ICML*, 2010.**Supervised learning with tensor networks.**

E. Stoudenmire and D. J. Schwab. In*NIPS*, 2016.**Multi-class Pegasos on a budget.**

Z. Wang, K. Crammer, and S. Vucetic. In*ICML*, 2010.**Convex factorization machine for toxicogenomics prediction.**

M. Yamada, W. Lian, A. Goyal, J. Chen, K. Wimalawarne, S. A. Khan, S. Kaski, H. M. Mamitsuka, and Y. Chang. In*KDD*, 2017.**Scaling factorization machines with parameter server.**

E. Zhong, Y. Shi, N. Liu, and S. Rajan. In*CIKM*, 2016.