Abstract
Despite the recent successes of deep neural networks, the corresponding training problem remains highly nonconvex and difficult to optimize. Classes of models have been proposed that introduce greater structure to the objective function at the cost of lifting the dimension of the problem. However, these lifted methods sometimes perform poorly compared to traditional neural networks. In this paper, we introduce a new class of lifted models, Fenchel lifted networks, that enjoy the same benefits as previous lifted models, without suffering a degradation in performance over classical networks. Our model represents activation functions as equivalent biconvex constraints and uses Lagrange Multipliers to arrive at a rigorous lower bound of the traditional neural network training problem. This model is efficiently trained using blockcoordinate descent and is parallelizable across data points and/or layers. We compare our model against standard fully connected and convolutional networks and show that we are able to match or beat their performance.
Fenchel Lifted Networks:
A Lagrange Relaxation of Neural Network Training
Fangda Gu* &Armin Askari* &Laurent El Ghaoui
UC Berkeley &UC Berkeley &UC Berkeley \aistatsaddress
1 Introduction
Deep neural networks (DNNs) have become the preferred model for supervised learning tasks after their success in various fields of research. However, due to their highly nonconvex nature, DNNs pose a difficult problem during training time; the optimization landscape consists of many saddle points and local minima which make the trained model generalize poorly (entropy2016; dauphin2014). This has motivated regularization schemes such as weight decay (krogh1992), batch normalization (ioffe2015), and dropout (dropout) so that the solutions generalize better to the test data.
In spite of this, backprop used along with stochastic gradient descent (SGD) or similar variants like Adam (kingma2015adam) suffer from a variety of problems. One of the most notable problems is the vanishing gradient problem which slows down gradientbased methods during training time. Several approaches have been proposed to deal with the problem; for example, the introduction of rectified linear units (ReLU). However, the problem persists. For a discussion on the limitations of backprop and SGD, we direct the reader to Section 2.1 of taylor2016training.
One approach to deal with this problem is to introduce auxiliary variables that increase the dimension of the problem. In doing so, the training problem decomposes into multiple, local subproblems which can be solved efficiently without using SGD or Adam; in particular, the methods of choice have been block coordinate descent (BCD) (askari2018; lau2018proximal; zhang2017convergent; pmlrv33carreiraperpinan14) and the alternating direction method of multipliers (ADMM) (taylor2016training; zhang2016efficient). By lifting the dimension of the problem, these models avoid many of the problems DNNs face during training time. In addition, lifting offers the possibility of penalizing directly the added variables, which opens up interesting avenues into the interpretability and robustness of the network.
While these methods, which we refer to as “lifted” models for the remainder of the paper, offer an alternative to the original problem with some added benefits, they have their limitations. Most notably, traditional DNNs are still able to outperform these methods in spite of the difficult optimization landscape. As well, most of the methods are unable to operate in an online manner or adapt to continually changing data sets which is prevalent in most reinforcement learning settings (Sutton:1998:IRL:551283). Finally, by introducing auxiliary variables, the dimensionality of the problem greatly increases, making these methods very difficult to train with limited computational resources.
1.1 Paper contribution
To address the problems listed above, we propose Fenchel lifted networks, a biconvex formulation for deep learning based on Fenchel’s duality theorem that can be optimized using BCD. We show that our method is a rigorous lower bound for the learning problem and admits a natural batching scheme to adapt to changing data sets and settings with limited computational power. We compare our method against other lifted models and against traditional fully connected and convolutional neural networks. We show that we are able to outperform the former and that we can compete with or even outperform the latter.
Paper outline.
In Section 2, we give a brief overview of related works on lifted models. In Section 3 we introduce the notation for the remainder of the paper. Section 4 introduces Fenchel lifted networks, their variants and discusses how to train these models using BCD. Section 5 compares the proposed method against fully connected and convolutional networks on MNIST and CIFAR10.
2 Related Work
Lifted methods
Related works that lift the dimension of the training problem are primarily optimized using BCD or ADMM. These methods have experienced recent success due to their ability to exploit the structure of the problem by first converting the constrained optimization problem into an unconstrained one and then solving the resulting subproblems in parallel. They do this by relaxing the network constraints and introducing penalties into the objective function. The two main ways of introducing penalties into the objective function are either using quadratic penalties (Sutskever:2013:IIM:3042817.3043064; taylor2016training; lau2018proximal) or using equivalent representations of the activation functions (askari2018; zhang2017convergent).
As a result, these formulations have many advantages over the traditional training problem, giving superior performance in some specific network structures (pmlrv33carreiraperpinan14; zhang2017convergent). These methods also enjoy great potential to parallelize as shown by taylor2016training. However, there has been little evidence showing that these methods can compete with traditional DNNs which shadows the nice structure these formulations bring about.
An early example of auxiliary variables being introduced into the training problem is the method of auxiliary coordinates (MAC) by pmlrv33carreiraperpinan14 which uses quadratic penalties to enforce network constraints. They test their method on auto encoders and show that their method is able to outperform SGD. Followup work by carreira2016parmac; taylor2016training demonstrate the huge potential for parallelizing these methods. lau2018proximal gives some convergence guarantee on a modified problem.
Another class of models that lift the dimension of the problem do so by representing activation functions in equivalent formulations. Negiar2017; askari2018; zhang2017convergent; li2019lifted explore the structure of activation functions and use maps to represent activation functions. In particular, askari2018 show how a strictly monotone activation function can be seen as the of a specific optimization problem. Just as with quadratic penalties, this formulation of the problem still performs poorly compared to traditional neural networks.
3 Background and Notation
Feedforward neural networks.
We are given an input data matrix of data points and a response matrix . We consider the supervised learning problem involving a neural network with hidden layers. The neural network produces a prediction with the feed forward recursion given below
(1) 
where are the activation functions that act columnwise on a matrix input, is a vector of ones, and and are the weight matrices and bias vectors respectively. Here is the number of output values for a single data point (i.e., hidden nodes) at layer with and . Without loss of generality, we can remove by adding an extra column to and a row of ones to . Then (1) simplifies to
(2) 
In the case of fully connected networks, is typically sigmoidal activation functions or ReLUs. In the case of Convolutional Neural Networks (CNNs), the recursion can accommodate convolutions and pooling operations in conjunction with an activation. For classification tasks, we typically apply a softmax function after applying an affine transformation to .
The initial value for the recursion is and . We refer to the collections and as the and variables respectively.
The weights are obtained by solving the following constrained optimization problem
s.t.  
(3) 
Here, is a loss function, is a hyperparameter vector, and ’s are penalty functions used for regularizing weights, controlling network structures, etc. In (3), optimizing over the variables is trivial; we simply apply the recursion (2) and solve the resulting unconstrained problem using SGD or Adam. After optimizing over the weights and biases, we obtain a prediction for the test data by passing through the recursion (2) one layer at a time.
Our model.
We develop a family of models where we approximate the recursion constraints (2) via penalties. We use the maps from askari2018 to create a biconvex formulation that can be trained efficiently using BCD and show that our model is a lower bound of (3). Furthermore, we show how our method can naturally be batched to ease computational requirements and improve the performance.
4 Fenchel lifted networks
In this section, we introduce Fenchel lifted networks. We begin by showing that for a certain class of activation functions, we can equivalently represent them as biconvex constraints. We then dualize these constraints and construct a lower bound for the original training problem. We show how our lower bound can naturally be batched and how it can be trained efficiently using BCD.
4.1 Activations as biconvex constraints
In this section, we show how to convert the equality constraints of (3) into inequalities which we dualize to arrive at a relaxation (lower bound) of the problem. In particular, this lower bound is biconvex in the variables and variables. We make the following assumption on the activation functions .
BC Condition The activation function satisfies the BC condition if there exists a biconvex function , such that
We now state and prove a result that is at the crux of Fenchel lifted networks.
Theorem 1.
Assume is continuous, strictly monotone and that or . Then satisfies the BC condition.
Proof.
Without loss of generality, is strictly increasing. Thus it is invertible and there exists such that for which implies . Now, define as
where and is either or satisfies . Then we have
(4) 
where is the Fenchel conjugate of . By the FenchelYoung inequality, with equality if and only if
By construction, . Note furthermore since is continuous and strictly increasing, so is on its domain, and thus are convex. It follows that is a biconvex function of .
We simply need to prove that above is indeed the Fenchel conjugate of . By definition of the Fenchel conjugate we have that
It is easy to see that . Thus
where the third equality is a consequence of integration by parts, and the fourth equality we make the subsitution ∎
Note that Theorem 1 implies that activation functions such as sigmoid and tanh can be equivalently written as a biconvex constraint. Although the ReLU is not strictly monotone, we can simply restrict the inverse to the domain ; specifically, for define
Then, we can rewrite the ReLU function as the equivalent set of biconvex constraint
where . This implies
(5) 
4.2 Lifted Fenchel model
Assuming the activation functions of (3) satisfy the hypothesis of Theorem 1, we can reformulate the learning problem equivalently as
(6) 
where is the shorthand notation of . We now dualize the inequality constraints and obtain the lower bound of the standard problem (3) via Lagrange relaxation
s.t.  (7) 
where are the Lagrange multipliers. The maximum lower bound can be achieved by solving the dual problem
(8) 
where is the optimal value of (3). Note if all our activation functions are ReLUs, we must also include the constraint in the training problem as a consequence of (5). Although the new model introduces new parameters (the Lagrange multipliers), we can show that using variable scaling we can reduce this to only one hyperparameter (for details, see Appendix A). The learning problem then becomes
s.t.  (9) 
In a regression setting where the data is generated by a one layer network, we are able to provide global convergence guarantees of the above model (for details, see Appendix B).
Comparison with other methods.
For ReLU activations, as in (5) differs from the penalty terms introduced in previous works. In askari2018; zhang2017convergent they set and in taylor2016training; pmlrv33carreiraperpinan14 they set . Note that in the latter is not biconvex. While the in the former is biconvex, it does not perform well at test time. li2019lifted set based on a proximal operator that is similar to the BC condition.
Convolutional model.
Our model can naturally accommodate average pooling and convolution operations found in CNNs, since they are linear operations. We can rewrite as where denotes the convolution operator and write Pool() to denote the average pooling operator on . Then, for example, the sequence Conv Activation can be represented via the constraint
(10) 
while the sequence Pool Conv Activation can be represented as
(11) 
Note that the pooling operation changes the dimension of the matrix.
4.3 Prediction rule.
In previous works that reinterpret activation functions as maps (askari2018; zhang2017convergent), the prediction at test time is defined as the solution to the optimization problem below
s.t.  (12) 
where is test data point, is the predicted value, and , are the intermediate representations we optimize over. Note if is a mean squared error, applying the traditional feedforward rule gives an optimal solution to (4.3). We find empirically that applying the standard feedforward rule works well, even with a crossentropy loss.
4.4 Batched model
The models discussed in the introduction usually require the entire data set to be loaded into memory which may be infeasible for very large data sets or for data sets that are continually changing. We can circumvent this issue by batching the model. By sequentially loading a part of the data set into memory and optimizing the network parameters, we are able to train the network with limited computational resources. Formally, the batched model is
(13) 
where contains only a batch of data points instead of the complete data set. The additional term in the objective is introduced to moderate the change of the variables between subsequent batches; here represents the optimal variables from the previous batch and is a hyperparameter vector. The variables are reinitialized each batch by feeding the new batch forward through the equivalent standard neural network.
4.5 Blockcoordinate descent algorithm
The model (4.2) satisfies the following properties:

For fixed variables, and fixed variables , the problem is convex in , and is decomposable across data points.

For fixed variables, the problem is convex in the variables, and is decomposable across layers, and data points.
The nonbatched and batched Fenchel lifted network are trained using block coordinate descent algorithms highlighted in Algorithms 1 and 2. By exploiting the biconvexity of the problem, we can alternate over updating the variables and variables to train the network.
Note Algorithm 2 is different from Algorithm 1 in three ways. First, reinitialization is required for the variables each time a new batch of data points are loaded. Second, the subproblems for updating variables are different as shown in Section 4.5.2. Lastly, an additional parameter is introduced to specify the number of training alternations for each batch. Typically, we set .
4.5.1 Updating variables
For fixed variables, the problem of updating variables can be solved by cyclically optimizing with fixed. We initialize our variables by feeding forward through the equivalent neural network and update the ’s backward from to in the spirit of backpropagation.
We can derive the subproblem for with fixed from (4.2). The subproblem writes
(14) 
where . By construction, the subproblem (14) is convex and parallelizable across data points. Note in particular when our activation is a ReLU, the objective function in (14) is in fact strongly convex and has a continuous first derivative.
For the last layer (i.e., ), the subproblem derived from (4.2) writes differently
(15) 
where . For common losses such as mean square error (MSE) and crossentropy, the subproblem is convex and parallelizable across data points. Specifically, when the loss is MSE and we use a ReLU activation at the layer before the output layer, (15) becomes
where and we use the fact that is a constant to equivalently replace as in (5) by a squared Frobenius term. The subproblem is a nonnegative least squares for which specialized methods exist kim2014algorithms.
For a crossentropy loss and when the secondtolast layer is a ReLU activation, the subproblem for the last layer takes the convex form
(16) 
where is the softmax function and is the elementwise logarithm. askari2018 show how to solve the above problem using bisection.
4.5.2 Updating variables
With fixed variables, the problem of updating the variables can be solved in parallel across layers and data points.
Subproblems for nonbatched model.
The problem of updating at intermediate layers becomes
(17) 
Again, by construction, the subproblem (17) is convex and parallelizable across data points. Also, since there is no coupling in the variables between layers, the subproblem (17) is parallelizable across layers.
For the last layer, the subproblem becomes
(18) 
Subproblems for batched model.
As shown in Section 4.4, the introduction of regularization terms between and values from a previous batch require the subproblems (17, 18) be modified. (17) now becomes
(19) 
while (18) becomes
(20) 
Note that these subproblems in the case of a ReLU activation are strongly convex and parallelizable across layers and data points.
5 Numerical Experiments
In this section, we compare Fenchel lifted networks against other lifted models discussed in the introduction and against traditional neural networks. In particular, we compare our model against the models proposed by taylor2016training, lau2018proximal and askari2018 on MNIST. Then we compare Fenchel lifted networks against a fully connected neural network and LeNet5 (lecun1998gradient) on MNIST. Finally, we compare Fenchel lifted networks against LeNet5 on CIFAR10. For a discussion on hyperparameters and how model paramters were selected, see Appendix Appendix C.
5.1 Fenchel lifted networks vs. lifted models
Here, we compare the nonbatched Fenchel lifted network against the models proposed by taylor2016training^{1}^{1}1Code available in https://github.com/PotatoThanh/ADMMNeuralNetworks, lau2018proximal^{2}^{2}2Code available in https://github.com/deeplearningmath/bcd_dnn and askari2018. The former model is trained using ADMM and the latter ones using the BCD algorithms proposed in the respective papers. In Figure 1, we compare these models on MNIST with a 78430010 architecture (inspired by lecun1998gradient) using a mean square error (MSE) loss.
After multiple iterations of hyperparameter search with little improvement over the base model, we chose to keep the hyperparameters for taylor2016training and lau2018proximal as given in the code. The hyperparameters for askari2018 were tuned using cross validation on a holdout set during training. Our model used these same parameters and cross validated the remaining hyperparameters. The neural network model was trained using SGD. The resulting curve of the neural network is smoothed in Figure 1 for visual clarity. From Figure 1 it is clear that Fenchel lifted networks vastly outperform other lifted models and achieve a test set accuracy on par with traditional networks.
5.2 Fenchel lifted networks vs. neural networks on MNIST
For the same 78430010 architecture as the previous section, we compare the batched Fenchel lifted networks against traditional neural networks trained using first order methods. We use a cross entropy loss in the final layer for both models. The hyperparameters for our model are tuned using cross validation. Figure 2 shows the results.
As shown in Figure 2, Fenchel lifted networks learn faster than traditional networks as shown by the red curve being consistently above the blue and green curve. Although not shown, between batch 600 and 1000, the accuracy on a training batch would consistently hit 100% accuracy. The advantage of the Fenchel lifted networks is clear in the early stages of training, while towards the end the test set accuracy and the accuracy of an Adamtrained network converge to the same values.
We also compare Fenchel lifted networks against a LeNet5 convolutional neural network on MNIST. The network architecture is 2 convolutional layers followed by 3 fullyconnected layers and a cross entropy loss on the last layer. We use ReLU activations and average pooling in our implementation. Figure 3 plots the test set accuracy for the different models.
In Figure 3, our method is able to nearly converge to its final test set accuracy after only 2 epochs while Adam and SGD need the full 20 epochs to converge. Furthermore, after the first few batches, our model is attaining over 90% accuracy on the test set while the other methods are only at 80%, indicating that our model is doing something different (in a positive way) compared to traditional networks, giving them a clear advantage in test set accuracy.
5.3 Fenchel lifted networks vs CNN on CIFAR10
In this section, we compare the LeNet5 architechture and with Fenchel lifted networks on CIFAR10. Figure 4 compares the accuracies of the different models.
In this case, the Fenchel lifted network still outperforms the SGD trained network and only slightly under performs compared to the Adam trained network. The larger variability in the accuracy per batch for our model can be attributed to the fact that in this experiment, when updating the variables, we would only take one gradient step instead of solving (4.5.2) and (4.5.2) to completion. We did this because we found empirically solving those respective subproblems to completion would lead to poor performance at test time.
6 Conclusion and Future Work
In this paper we propose Fenchel lifted networks, a family of models that provide a rigorous lower bound of the traditional neural network training problem. Fenchel lifted networks are similar to other methods that lift the dimension of the training problem, and thus exhibit many desirable properties in terms of scalability and the parallel structure of its subproblems. As a result, we show that our family of models can be trained efficiently using block coordinate descent where the subproblems can be parallelized across data points and/or layers. Unlike other similar lifted methods, Fenchel lifted networks are able to compete with traditional fully connected and convolutional neural networks on standard classification data sets, and in some cases are able to outperform them.
Future work will look at extending the ideas presented here to Recurrent Neural Networks, as well as exploring how to use the class of models described in the paper to train deeper networks.
References
Supplementary material
Appendix Appendix A Variable Scaling
Note that the new model (4.2) has introduced more hyperparameters. We can use variable scaling and the dual formulation to show how to effectively reduce this to only one hyperparameter. Consider the model with ReLU activations, that is, the biconvex function as in (5) and regularization functions for . Note that is homogeneous of degree 2, that is for any and we have
Define and the scalings
Then (4.2) becomes
s.t.  (21) 
Using the fact and defining we have
s.t.  (22) 
where is now only a function of one variable as opposed to variables. Note that this argument for variable scaling still works when we use average pooling or convolution operations in conjunction with a ReLU activation since they are linear operations. Note furthermore that the same scaling argument works in place of any norm due to the homogeneity of norms – the only thing that would change is how is scaled by and .
Another way to show that we only require one hyperparameter is to note the equivalence
Then we may replace the biconvex constraints in (4.2) by the equivalent constraint . Since this is only one constraint, when we dualize we only introduce one Lagrange multiplier .
Appendix Appendix B Onelayer Regression Setting
In this section, we show that for a one layer network we are able to convert a nonconvex optimization problem into a convex one by using the BC condition described in the main text.
Consider a regression setting where for some fixed and a given data matrix . Given a training set we can solve for by solving the following nonconvex problem
(23) 
We could also solve the following relaxation of (23) based on the BC condition
(24) 
Note (24) is trivially convex in by definition of . Furthermore, by construction and if and only if . Since , it follows (which is the minimizer of (23)) is a global minimizer of the convex program (24). Therefore, we can solve the original nonconvex problem (23) to global optimality by instead solving the convex problem presented in (24).
Appendix Appendix C Hyperparameters for Experiments
For all experiments that used batching, the batch size was fixed at 500 and . We observed empirically that larger batch sizes improved the performance of the lifted models. To speed up computations, we set and empirically find this does not affect final test set performance. For batched models, we do not use since we explicitly regularize through batching (see (4.4)) while for the nonbatched models we set for all . For models trained using Adam, the learning rate was set to and for models trained using SGD, the learning rate was set to . The learning rates were a hyperparamter that we picked from {} to give the best final test performance for both Adam and SGD.
For the network architechtures described in the experimental results, we used the following hyperparamters:

Fenchel Lifted Network for LeNet5 architecture


Fenchel Lifted Network for 78430010 architecture (batched)


Fenchel Lifted Network for 78430010 architecture (nonbatched)

For all weights the initialization is done through Xavier initialization implemented in TensorFlow. The variables are chosen to balance the change of variables across layers in iterations. Although the theory in Appendix A states we can collapse all hyperparameters into a single hyperparameter, due to time constraints, we were unable to implement this change upon submission. We also stress that the hyperparamter search over the ’s were very coarse and a variety of values worked well in practice; for simplicitly we only present the ones we used to produce the plots in the experimental results.