Deep Neural Networks with Multi-Branch Architectures AreLess Non-Convex

# Deep Neural Networks with Multi-Branch Architectures Are Less Non-Convex

Hongyang Zhang
Carnegie Mellon University
hongyanz@cs.cmu.edu
Junru Shao
Carnegie Mellon University
junrus@cs.cmu.edu
Ruslan Salakhutdinov
Carnegie Mellon University
rsalakhu@cs.cmu.edu
###### Abstract

Several recently proposed architectures of neural networks such as ResNeXt, Inception, Xception, SqueezeNet and Wide ResNet are based on the designing idea of having multiple branches and have demonstrated improved performance in many applications. We show that one cause for such success is due to the fact that the multi-branch architecture is less non-convex in terms of duality gap. The duality gap measures the degree of intrinsic non-convexity of an optimization problem: smaller gap in relative value implies lower degree of intrinsic non-convexity. The challenge is to quantitatively measure the duality gap of highly non-convex problems such as deep neural networks. In this work, we provide strong guarantees of this quantity for two classes of network architectures. For the neural networks with arbitrary activation functions, multi-branch architecture and a variant of hinge loss, we show that the duality gap of both population and empirical risks shrinks to zero as the number of branches increases. This result sheds light on better understanding the power of over-parametrization where increasing the network width tends to make the loss surface less non-convex. For the neural networks with linear activation function and loss, we show that the duality gap of empirical risk is zero. Our two results work for arbitrary depths and adversarial data, while the analytical techniques might be of independent interest to non-convex optimization more broadly. Experiments on both synthetic and real-world datasets validate our results.

## 1 Introduction

Deep neural networks are a central object of study in machine learning, computer vision, and many other domains. They have substantially improved over conventional learning algorithms in many areas, including speech recognition, object detection, and natural language processing [28]. The focus of this work is to investigate the duality gap of deep neural networks. The duality gap is the discrepancy between the optimal values of primal and dual problems. While it has been well understood for convex optimization, little is known for non-convex problems. A smaller duality gap in relative value typically implies that the problem itself is less non-convex, and thus is easier to optimize.111Although zero duality gap can be attained for some non-convex optimization problems [6, 48, 11], they are in essence convex problems by considering the dual and bi-dual problems, which are always convex. So these problems are relatively easy to optimize compared with other non-convex ones. Our results establish that: Deep neural networks with multi-branch architecture have small duality gap in relative value.

Our study is motivated by the computational difficulties of deep neural networks due to its non-convex nature. While many works have witnessed the power of local search algorithms for deep neural networks [16], these algorithms typically converge to a suboptimal solution in the worst cases according to various empirical observations [52, 28]. It is reported that for a single-hidden-layer neural network, when the number of hidden units is small, stochastic gradient descent may get easily stuck at the poor local minima [27, 49]. Furthermore, there is significant evidence indicating that when the networks are deep enough, bad saddle points do exist [1] and might be hard to escape [15, 21, 10, 1].

Given the computational obstacles, several efforts have been devoted to designing new architectures to alleviate the above issues, including over-parametrization [17, 54, 23, 41, 2, 46] and multi-branch architectures [57, 18, 63, 33, 60]. Empirically, increasing the number of hidden units of a single-hidden-layer network encourages the first-order methods to converge to a global solution, which probably supports the folklore that the loss surface of a wider network looks more “convex” (see Figure 1). Furthermore, several recently proposed architectures, including ResNeXt [63], Inception [57], Xception [18], SqueezeNet [33] and Wide ResNet [64] are based on having multiple branches and have demonstrated substantial improvement over many of the existing models in many applications. In this work, we show that one cause for such success is due to the fact that the loss of multi-branch network is less non-convex in terms of duality gap.

Our Contributions. This paper provides both theoretical and experimental results for the population and empirical risks of deep neural networks by estimating the duality gap.

First, we study the duality gap of deep neural networks with arbitrary activation functions, adversarial data distribution, and multi-branch architecture (see Theorem 1). The multi-branch architecture is general, which includes the classic one-hidden-layer architecture as a special case (see Figure 2). By Shapley-Folkman lemma, we show that the duality gap of both population and empirical risks shrinks to zero as the number of branches increases. Our result provides better understanding of various state-of-the-art architectures such as ResNeXt, Inception, Xception, SqueezeNet, and Wide ResNet.

Second, we prove that the strong duality (a.k.a. zero duality gap) holds for the empirical risk of deep linear neural networks (see Theorem 2). To this end, we develop multiple new proof techniques, including reduction to low-rank approximation and construction of dual certificate (see Section 4).

Finally, we empirically study the loss surface of multi-branch neural networks. Our experiments verify our theoretical findings.

Notation. We will use bold capital letter to represent matrix and lower-case letter to represent scalar. Specifically, let be the identity matrix and denote by the all-zero matrix. Let be a set of network parameters, each of which represents the connection weights between the -th and -th layers of neural network. We use to indicate the -th column of . We will use to represent the -th largest singular value of matrix . Given skinny SVD of matrix , we denote by the truncated SVD of to the first singular values. For matrix norms, denote by the matrix Schatten- norm. Nuclear norm and Frobenius norm are special cases of Schatten- norm: and . We use to represent the matrix operator norm, i.e., , and denote by the rank of matrix . Denote by the span of rows of . Let be the Moore-Penrose pseudo-inverse of .

For convex matrix function , we denote by the conjugate function of and the sub-differential. We use to represent a diagonal matrix with diagonal entries . Let , and . For any two matrices and of matching dimensions, we denote by the concatenation of and along the row and the concatenation of two matrices along the column.

## 2 Duality Gap of Multi-Branch Neural Networks

We first study the duality gap of neural networks in a classification setting. We show that the wider the network is, the smaller the duality gap becomes.

Network Setup. The output of our network follows from a multi-branch architecture (see Figure 2):

 f(w;x)=1II∑i=1fi(w(i);x),w(i)∈Wi,(Wi%isconvexset)

where is the concatenation of all network parameters , is the input instance, is the parameter space, and represents an continuous mapping by a sub-network which is allowed to have arbitrary architecture such as convolutional and recurrent neural networks. As an example, can be in the form of a -layer feed-forward sub-network:

 fi(w(i);x)=w⊤iψHi(W(i)Hi...ψ1(W(i)1x))∈R, w(i)=[wi;vec(W(i)1);...;vec(W(i)Hi)]∈Rpi.

Hereby, the functions are allowed to encode arbitrary form of continuous element-wise non-linearity (and linearity) after each matrix multiplication, such as sigmoid, rectification, convolution, while the number of layers in each sub-network can be arbitrary as well. When and , i.e., each sub-network in Figure 2 represents one hidden unit, the architecture reduces to a one-hidden-layer network. We apply the so-called -hinge loss [4, 7] on the top of network output for label :

 ℓτ(w;x,y):=max(0,1−y⋅f(w;x)τ),τ>0. (1)

The -hinge loss has been widely applied in active learning of classifiers and margin based learning [4, 7]. When , it reduces to the classic hinge loss [43, 17, 38].

We make the following assumption on the margin parameter , which states that the parameter is sufficiently large.

###### Assumption 1 (Parameter τ).

For sample drawn from distribution , we have for all with probability measure .

We further empirically observe that using smaller values of the parameter and other loss functions support our theoretical result as well (see experiments in Section 5). It is an interesting open question to extend our theory to more general losses in the future.

To study how close these generic neural network architectures approach the family of convex functions, we analyze the duality gap of minimizing the risk w.r.t. the loss (1) with an extra regularization constraint. The normalized duality gap is a measure of intrinsic non-convexity of a given function [13]: the gap is zero when the given function itself is convex, and is large when the loss surface is far from the convexity intrinsically. Typically, the closer the network approaches to the family of convex functions, the easier we can optimize the network.

Multi-Branch Architecture. Our analysis of multi-branch neural networks is built upon tools from non-convex geometric analysis — Shapley–Folkman lemma. Basically, the Shapley–Folkman lemma states that the sum of constrained non-convex functions is close to being convex. A neural network is an ideal target to apply this lemma to: the width of network is associated with the number of summand functions. So intuitively, the wider the neural network is, the smaller the duality gap will be. In particular, we study the following non-convex problem concerning the population risk:

 minw∈W1×...×WIE(x,y)∼P[ℓτ(w;x,y)],  % s.t.  1II∑i=1hi(w(i))≤K, (2)

where are convex regularization functions, e.g., the weight decay, and can be arbitrary such that the problem is feasible. Correspondingly, the dual problem of problem (2) is a one-dimensional convex optimization problem:222Although problem (3) is convex, it does not necessarily mean the problem can be solved easily. This is because computing is a hard problem. So rather than trying to solve the convex dual problem, our goal is to study the duality gap in order to understand the degree of non-convexity of the problem.

 (3)

For , denote by

the convex relaxation of function on . For , we also define

 ˆfi(˜w):=infw(i)∈Wi{E(x,y)∼P(1−y⋅fi(w(i);x)τ):hi(w(i))≤hi(˜w)}.

Our main results for multi-branch neural networks are as follows:

###### Theorem 1.

Denote by the minimum of primal problem (2) and the maximum of dual problem (3). Let and . Suppose ’s are compact and both and are continuous w.r.t. . If there exists at least one feasible solution of problem (P), then under Assumption 1 the duality gap w.r.t. problems (2) and (3) can be bounded by

 0≤inf(P)−sup(D)Δworst≤2I.

Note that measures the divergence between the function value of and its convex relaxation . The constant is the maximal divergence among all sub-networks, which grows slowly with the increase of . This is because only measures the divergence of one branch. The normalized duality gap has been widely used before to measure the degree of non-convexity of optimization problems [13, 58, 14, 24, 22]. Such a normalization avoids trivialities in characterizing the degree of non-convexity: scaling the objective function by any constant does not change the value of normalized duality gap. Even though Theorem 1 is in the form of population risk, the conclusion still holds for the empirical loss as well. This can be achieved by setting the marginal distribution as the uniform distribution on a finite set and as the corresponding labels uniformly distributed on the same finite set.

Inspiration for Architecture Designs. Theorem 1 shows that the loss surface of deep network is less non-convex when the width is large; when , surprisingly, deep network is as easy as a convex optimization. An intuitive explanation is that the large number of randomly initialized hidden units represent all possible features. Thus the optimization problem involves just training the top layer of the network, which is convex. Our result encourages a class of network architectures with multiple branches and supports some of the most successful architectures in practice, such as Inception [57], Xception [18], ResNeXt [63], SqueezeNet [33], Wide ResNet [64], Shake-Shake regularization [25] — all of which benefit from the split-transform-merge behaviour as shown in Figure 2. The theory sheds light on an explanation of strong performance of these architectures.

Related Works. While many efforts have been devoted to studying the local minima or saddle points of deep neural networks [42, 68, 55, 36, 62, 61], little is known about the duality gap of deep networks. In particular, Choromanska et al. [20, 19] showed that the number of poor local minima cannot be too large. Kawaguchi [35] improved over the results of [20, 19] by assuming that the activation functions are independent Bernoulli variables and the input data are drawn from Gaussian distribution. Xie et al. [62] and Haeffele et al. [30] studied the local minima of regularized network, but they require either the network is shallow, or the network weights are rank-deficient. Ge et al. [27] showed that every local minimum is globally optimal by modifying the activation function. Zhang et al. [67] and Aslan et al. [3] reduced the non-linear activation to the linear case by kernelization and relaxed the non-convex problem to a convex one. However, no formal guarantee was provided for the tightness of the relaxation. Theorem 1, on the other hand, bounds the duality gap of deep neural networks with mild assumptions.

Another line of research studies the convexity behaviour of neural networks when the number of hidden neurons goes to the infinity. In particular, Bach [5] proved that a single-hidden-layer network is as easy as a convex optimization by using classical non-Euclidean regularization tools. Bengio et al. [12] showed a similar phenomenon for multi-layer networks with an incremental algorithm. In comparison, Theorem 1 not only captures the convexification phenomenon when , but also goes beyond the result as it characterizes the convergence rate of convexity of neural networks in terms of duality gap. Furthermore, the conclusion in Theorem 1 holds for the population risk, which was unknown before.

## 3 Strong Duality of Linear Neural Networks

In this section, we show that the duality gap is zero if the activation function is linear. Deep linear neural network has received significant attention in recent years [51, 35, 67, 44, 8, 28, 31, 9] because of its simple formulation333Although the expressive power of deep linear neural networks and three-layer linear neural networks are the same, the analysis of landscapes of two models are significantly different, as pointed out by [28, 35, 44]. and its connection to non-linear neural networks.

Network Setup. We discuss the strong duality of regularized deep linear neural networks of the form

 (W∗1,...,W∗H)=argminW1,...,WH12∥Y−WH⋯W1X∥2F+γH[∥W1X∥HSH+H∑i=2∥Wi∥HSH], (4)

where is the given instance matrix, is the given label matrix, and represents the weight matrix in each linear layer. We mention that (a) while the linear operation is simple matrix multiplications in problem (4), it can be easily extended to other linear operators, e.g., the convolutional operator or the linear operator with the bias term, by properly involving a group of kernels in the variable  [30]. (b) The regularization terms in problem (4) are of common interest, e.g., see [30]. When , our regularization terms reduce to , which is well known as the weight-decay or Tikhonov regularization. (c) The regularization parameter is the same for each layer since we have no further information on the preference of layers.

Our analysis leads to the following guarantees for the deep linear neural networks.

###### Theorem 2.

Denote by and . Let and , where stands for the minimal non-zero singular value of . Then the strong duality holds for deep linear neural network (4). In other words, the optimum of problem (4) is the same as its convex dual problem

 (5)

where is a convex function. Moreover, the optimal solutions of primal problem (4) can be obtained from the dual problem (5) in the following way: let be the skinny SVD of matrix , then for , and is a globally optimal solution to problem (4).

The regularization parameter cannot be too large in order to avoid underfitting. Our result provides a suggested upper bound for the regularization parameter, where oftentimes characterizes the level of random noise. When , our analysis reduces to the un-regularized deep linear neural network, a model which has been widely studied in [35, 44, 8, 28].

Theorem 2 implies the followig result on the landscape of deep linear neural networks: the regularized deep learning can be converted into an equivalent convex problem by dual. We note that the strong duality rarely happens in the non-convex optimization: matrix completion [6], Fantope [48], and quadratic optimization with two quadratic constraints [11] are among the few paradigms that enjoy the strong duality. For deep networks, the effectiveness of convex relaxation has been observed empirically in [3, 67], but much remains unknown for the theoretical guarantees of the relaxation. Our work shows strong duality of regularized deep linear neural networks and provides an alternative approach to overcome the computational obstacles due to the non-convexity: one can apply convex solvers, e.g., the Douglas–Rachford algorithm,444Grussler et al. [29] provided a fast algorithm to compute the proximal operators of . Hence, the Douglas–Rachford algorithm can find the global solution up to an error in function value in time  [32]. for problem (5) and then conduct singular value decomposition to compute the weights from . In addition, our result inherits the benefits of convex analysis. The vast majority results on deep learning study the generalization error or expressive power by analyzing its complicated non-convex form [47, 66, 65]. In contrast, with strong duality one can investigate various properties of deep linear networks with much simpler convex form.

Related Works. The goal of convexified linear neural networks is to relax the non-convex form of deep learning to the computable convex formulations [67, 3]. While several efforts have been devoted to investigating the effectiveness of such convex surrogates, e.g., by analyzing the generalization error after the relaxation [67], little is known whether the relaxation is tight to its original problem. Our result, on the other hand, provides theoretical guarantees for the tightness of convex relaxation of deep linear networks, a phenomenon observed empirically in [3, 67].

We mention another related line of research — no bad local minima. On one hand, although recent works have shown the absence of spurious local minimum for deep linear neural networks [50, 35, 44], many of them typically lack theoretical analysis of regularization term. Specifically, Kawaguchi [35] showed that un-regularized deep linear neural networks have no spurious local minimum. Lu and Kawaguchi [44] proved that depth creates no bad local minimum for un-regularized deep linear neural networks. In contrast, our optimization problem is more general by taking the regularization term into account. On the other hand, even the “local=global” argument holds for the deep linear neural networks, it is still hard to escape bad saddle points [1]. In particular, Kawaguchi [35] proved that for linear networks deeper than three layers, there exist bad saddle points at which the Hessian does not have any negative eigenvalue. Therefore, the state-of-the-art algorithms designed to escape the saddle points might not be applicable [34, 26]. Our result provides an alternative approach to solve deep linear network by convex programming, which bypasses the computational issues incurred by the bad saddle points.

## 4 Our Techniques and Proof Sketches

In this section, we present our techniques and proof sketches of Theorems 1 and 2.

(a) Shapley-Folkman Lemma. The proof of Theorem 1 is built upon the Shapley-Folkman lemma [22, 56, 24, 13], which characterizes a convexification phenomenon concerning the average of multiple sets and is analogous to the central limit theorem in the probability theory. Consider the averaged Minkowski sum of sets given by . Intuitively, the lemma states that as , where is a metric of the non-convexity of a set (see Figure 3 for visualization). We apply this lemma to the optimization formulation of deep neural networks. Denote by augmented epigraph the set , where is the constraint and is the objective function in the optimization problem. The key observation is that the augmented epigraph of neural network loss with multi-branch architecture can be expressed as the Minkowski average of augmented epigraphs of all branches. Thus we obtain a natural connection between an optimization problem and its corresponding augmented epigraph. Applying Shapley-Folkman lemma to the augmented epigraph leads to a characteristic of non-convexity of the deep neural network.

(b) Variational Form. The proof of Theorem 2 is built upon techniques (b), (c), and (d). In particular, problem (4) is highly non-convex due to its multi-linear form over the optimized variables . Fortunately, we are able to analyze the problem by grouping together and converting the original non-convex problem in terms of the separate variables to a convex optimization with respect to the new grouping variable . This typically requires us to represent the objective function of (4) as a convex function of . To this end, we prove that (see Lemma 4 in Appendix C). So the objective function in problem (4) has an equivalent form

 minW1,...,WH12∥Y−WHWH−1⋯W1X∥2F+γ∥WHWH−1⋯W1X∥∗. (6)

This observation enables us to represent the optimization problem as a convex function of the output of a neural network. Therefore, we can analyze the non-convex problem by applying powerful tools from convex analysis.

(c) Reduction to Low-Rank Approximation. Our results of strong duality concerning problem (6) are inspired by the problem of low-rank matrix approximation:

 minW1,...,WH12∥Y−Λ∗−WHWH−1⋯W1X∥2F. (7)

We know that all local solutions of (7) are globally optimal [35, 44, 6]. To analyze the more general regularized problem (4), our main idea is to reduce problem (6) to the form of (7) by Lagrangian function. In other words, the Lagrangian function of problem (6) should be of the form (7) for a fixed Lagrangian variable , which we will construct later in subsection (d). While some prior works attempted to apply a similar reduction, their conclusions either depended on unrealistic conditions on local solutions, e.g., all local solutions are rank-deficient [30, 29], or their conclusions relied on strong assumptions on the objective functions, e.g., that the objective functions are twice-differentiable [30], which do not apply to the non-smooth problem (6). Instead, our results bypass these obstacles by formulating the strong duality of problem (6) as the existence of a dual certificate satisfying certain dual conditions (see Lemma 6 in Appendix C). Roughly, the dual conditions state that the optimal solution of problem (6) is locally optimal to problem (7). On one hand, by the above-mentioned properties of problem (7), globally minimizes the Lagrangian function when is fixed to . On the other hand, by the convexity of nuclear norm, for the fixed the Lagrangian variable globally optimize the Lagrangian function. Thus is a primal-dual saddle point of the Lagrangian function of problem (6). The desired strong duality is a straightforward result from this argument.

(d) Dual Certificate. The remaining proof is to construct a dual certificate such that the dual conditions hold true. The challenge is that the dual conditions impose several constraints simultaneously on the dual certificate (see condition (19) in Appendix C), making it hard to find a desired certificate. This is why progress on the dual certificate has focused on convex programming. To resolve the issue, we carefully choose the certificate as an appropriate scaling of subgradient of nuclear norm around a low-rank solution, where the nuclear norm follows from our regularization term in technique (b). Although the nuclear norm has infinitely many subgradients, we prove that our construction of dual certificate obeys all desired dual conditions. Putting techniques (b), (c), and (d) together, our proof of strong duality is completed.

## 5 Experiments

In this section, we verify our theoretical contributions by the experimental validation. We release our PyTorch code at https://github.com/hongyanz/multibranch.

### 5.1 Visualization of Loss Landscape

Experiments on Synthetic Datasets. We first show that over-parametrization results in a less non-convex loss surface for a synthetic dataset. The dataset consists of examples in whose labels are generated by an underlying one-hidden-layer ReLU network with 11 hidden neurons [49]. We make use of the visualization technique employed by [40] to plot the landscape, where we project the high-dimensional hinge loss () landscape onto a 2-d plane spanned by three points. These points are found by running the SGD algorithm with three different initializations until the algorithm converges. As shown in Figure 1, the landscape exhibits strong non-convexity with lots of local minima in the under-parameterized case . But as increases, the landscape becomes more convex. In the extreme case, when there are hidden neurons in the network, no non-convexity can be observed on the landscape.

Experiments on MNIST and CIFAR-10. We next verify the phenomenon of over-parametrization on MNIST [39] and CIFAR-10 [37] datasets. For both datasets, we follow the standard preprocessing step that each pixel is normalized by subtracting its mean and dividing by its standard deviation. We do not apply data augmentation. For MNIST, we consider a single-hidden-layer network defined as: , where , , is the input dimension, is the number of hidden neurons, and is the number of branches, with and . For CIFAR-10, in addition to considering the exact same one-hidden-layer architecture, we also test a deeper network containing hidden layers of size --, with ReLU activations and . We apply 10-class hinge loss on the top of the output of considered networks.

Figure 4 shows the changes of landscapes when increases from to for MNIST, and from to for CIFAR-10, respectively. When there is only one branch, the landscapes have strong non-convexity with many local minima. As the number of branches increases, the landscape becomes more convex. When for -hidden-layer networks on MNIST and CIFAR-10, and for -hidden-layer network on CIFAR-10, the landscape is almost convex.

### 5.2 Frequency of Hitting Global Minimum

To further analyze the non-convexity of loss surfaces, we consider various one-hidden-layer networks, where each network was trained 100 times using different initialization seeds under the setting discussed in our synthetic experiments of Section 5.1. Since we have the ground-truth global minimum, we record the frequency that SGD hits the global minimum up to a small error after iterations. Table 1 shows that increasing the number of hidden neurons results in higher hitting rate of global optimality. This further verifies that the loss surface of one-hidden-layer neural network becomes less non-convex as the width increases.

## 6 Conclusions

In this work, we study the duality gap for two classes of network architectures. For the neural network with arbitrary activation functions, multi-branch architecture and -hinge loss, we show that the duality gap of both population and empirical risks shrinks to zero as the number of branches increases. Our result sheds light on better understanding the power of over-parametrization and the state-of-the-art architectures, where increasing the number of branches tends to make the loss surface less non-convex. For the neural network with linear activation function and loss, we show that the duality gap is zero. Our two results work for arbitrary depths and adversarial data, while the analytical techniques might be of independent interest to non-convex optimization more broadly.

Acknowledgements. We would like to thank Jason D. Lee for introducing the Shapley-Folkman lemma, and Maria-Florina Balcan, David P. Woodruff and Xiaofei Shi for their thoughtful comments on the paper.

## References

• [1] A. Anandkumar and R. Ge. Efficient approaches for escaping higher order saddle points in non-convex optimization. In Annual Conference on Learning Theory, pages 81–102, 2016.
• [2] S. Arora, N. Cohen, and E. Hazan. On the optimization of deep networks: Implicit acceleration by overparameterization. In International Conference on Machine Learning, 2018.
• [3] Ö. Aslan, X. Zhang, and D. Schuurmans. Convex deep learning via normalized kernels. In Advances in Neural Information Processing Systems, pages 3275–3283, 2014.
• [4] P. Awasthi, M.-F. Balcan, N. Haghtalab, and H. Zhang. Learning and 1-bit compressed sensing under asymmetric noise. In Annual Conference on Learning Theory, pages 152–192, 2016.
• [5] F. Bach. Breaking the curse of dimensionality with convex neural networks. Journal of Machine Learning Research, 18(19):1–53, 2017.
• [6] M.-F. Balcan, Y. Liang, D. P. Woodruff, and H. Zhang. Matrix completion and related problems via strong duality. In Innovations in Theoretical Computer Science, 2018.
• [7] M.-F. F. Balcan and H. Zhang. Sample and computationally efficient learning algorithms under s-concave distributions. In Advances in Neural Information Processing Systems, pages 4799–4808, 2017.
• [8] P. Baldi and K. Hornik. Neural networks and principal component analysis: Learning from examples without local minima. Neural Networks, 2(1):53–58, 1989.
• [9] P. Baldi and Z. Lu. Complex-valued autoencoders. Neural Networks, 33:136–147, 2012.
• [10] P. Bartlett and S. Ben-David. Hardness results for neural network approximation problems. In European Conference on Computational Learning Theory, pages 50–62, 1999.
• [11] A. Beck and Y. C. Eldar. Strong duality in nonconvex quadratic optimization with two quadratic constraints. SIAM Journal on Optimization, 17(3):844–860, 2006.
• [12] Y. Bengio, N. L. Roux, P. Vincent, O. Delalleau, and P. Marcotte. Convex neural networks. In Advances in neural information processing systems, pages 123–130, 2006.
• [13] D. P. Bertsekas and N. R. Sandell. Estimates of the duality gap for large-scale separable nonconvex optimization problems. In IEEE Conference on Decision and Control, volume 21, pages 782–785, 1982.
• [14] Y. Bi and A. Tang. Refined Shapely-Folkman lemma and its application in duality gap estimation. arXiv preprint arXiv:1610.05416, 2016.
• [15] A. Blum and R. L. Rivest. Training a 3-node neural network is NP-complete. In Advances in neural information processing systems, pages 494–501, 1989.
• [16] A. Brutzkus and A. Globerson. Globally optimal gradient descent for a ConvNet with Gaussian inputs. In International Conference on Machine Learning, 2017.
• [17] A. Brutzkus, A. Globerson, E. Malach, and S. Shalev-Shwartz. SGD learns over-parameterized networks that provably generalize on linearly separable data. In International Conference on Learning Representations, 2018.
• [18] F. Chollet. Xception: Deep learning with depthwise separable convolutions. In IEEE Conference on Computer Vision and Pattern Recognition, pages 1251–1258, 2017.
• [19] A. Choromanska, M. Henaff, M. Mathieu, G. B. Arous, and Y. LeCun. The loss surfaces of multilayer networks. In Artificial Intelligence and Statistics, pages 192–200, 2015.
• [20] A. Choromanska, Y. LeCun, and G. B. Arous. Open problem: The landscape of the loss surfaces of multilayer networks. In Annual Conference on Learning Theory, pages 1756–1760, 2015.
• [21] B. DasGupta, H. T. Siegelmann, and E. Sontag. On the complexity of training neural networks with continuous activation functions. IEEE Transactions on Neural Networks, 6(6):1490–1504, 1995.
• [22] A. d’Aspremont and I. Colin. An approximate Shapley-Folkman theorem. arXiv preprint arXiv:1712.08559, 2017.
• [23] S. S. Du and J. D. Lee. On the power of over-parametrization in neural networks with quadratic activation. In International Conference on Machine Learning, 2018.
• [24] E. X. Fang, H. Liu, and M. Wang. Blessing of massive scale: Spatial graphical model estimation with a total cardinality constraint. 2015.
• [25] X. Gastaldi. Shake-Shake regularization. arXiv preprint arXiv:1705.07485, 2017.
• [26] R. Ge, F. Huang, C. Jin, and Y. Yuan. Escaping from saddle points-online stochastic gradient for tensor decomposition. In Annual Conference on Learning Theory, pages 797–842, 2015.
• [27] R. Ge, J. D. Lee, and T. Ma. Learning one-hidden-layer neural networks with landscape design. In International Conference on Learning Representations, 2017.
• [28] I. Goodfellow, Y. Bengio, and A. Courville. Deep Learning. MIT Press, 2016.
• [29] C. Grussler, A. Rantzer, and P. Giselsson. Low-rank optimization with convex constraints. arXiv:1606.01793, 2016.
• [30] B. D. Haeffele and R. Vidal. Global optimality in neural network training. In IEEE Conference on Computer Vision and Pattern Recognition, pages 7331–7339, 2017.
• [31] M. Hardt and T. Ma. Identity matters in deep learning. In International Conference on Learning Representations, 2017.
• [32] B. He and X. Yuan. On the convergence rate of the Douglas–Rachford alternating direction method. SIAM Journal on Numerical Analysis, 50(2):700–709, 2012.
• [33] F. N. Iandola, S. Han, M. W. Moskewicz, K. Ashraf, W. J. Dally, and K. Keutzer. Squeezenet: Alexnet-level accuracy with 50x fewer parameters and< 0.5 MB model size. arXiv preprint arXiv:1602.07360, 2016.
• [34] C. Jin, R. Ge, P. Netrapalli, S. M. Kakade, and M. I. Jordan. How to escape saddle points efficiently. arXiv:1703.00887, 2017.
• [35] K. Kawaguchi. Deep learning without poor local minima. In Advances in Neural Information Processing Systems, pages 586–594, 2016.
• [36] K. Kawaguchi, B. Xie, and L. Song. Deep semi-random features for nonlinear function approximation. In AAAI Conference on Artificial Intelligence, 2018.
• [37] A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images. 2009.
• [38] T. Laurent and J. von Brecht. The multilinear structure of ReLU networks. arXiv preprint arXiv:1712.10132, 2017.
• [39] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
• [40] H. Li, Z. Xu, G. Taylor, and T. Goldstein. Visualizing the loss landscape of neural nets. arXiv preprint arXiv:1712.09913, 2017.
• [41] Y. Li, T. Ma, and H. Zhang. Algorithmic regularization in over-parameterized matrix sensing and neural networks with quadratic activations. In Annual Conference on Learning Theory, 2017.
• [42] S. Liang, R. Sun, J. D. Lee, and R. Srikant. Adding one neuron can eliminate all bad local minima. arXiv preprint arXiv:1805.08671, 2018.
• [43] S. Liang, R. Sun, Y. Li, and R. Srikant. Understanding the loss surface of neural networks for binary classification. In International Conference on Machine Learning, 2018.
• [44] H. Lu and K. Kawaguchi. Depth creates no bad local minima. arXiv:1702.08580, 2017.
• [45] T. L. Magnanti, J. F. Shapiro, and M. H. Wagner. Generalized linear programming solves the dual. Management Science, 22(11):1195–1203, 1976.
• [46] B. Neyshabur, Z. Li, S. Bhojanapalli, Y. LeCun, and N. Srebro. Towards understanding the role of over-parametrization in generalization of neural networks. arXiv preprint arXiv:1805.12076, 2018.
• [47] B. Neyshabur, R. Salakhutdinov, and N. Srebro. Path-SGD: Path-normalized optimization in deep neural networks. In Advances in Neural Information Processing Systems, pages 2422–2430, 2015.
• [48] M. L. Overton and R. S. Womersley. On the sum of the largest eigenvalues of a symmetric matrix. SIAM Journal on Matrix Analysis and Applications, 13(1):41–45, 1992.
• [49] I. Safran and O. Shamir. Spurious local minima are common in two-layer ReLU neural networks. In International Conference on Machine Learning, 2017.
• [50] A. M. Saxe. Deep linear neural networks: A theory of learning in the brain and mind. PhD thesis, Stanford University, 2015.
• [51] A. M. Saxe, J. L. McClelland, and S. Ganguli. Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. arXiv:1312.6120, 2013.
• [52] S. Shalev-Shwartz, O. Shamir, and S. Shammah. Failures of gradient-based deep learning. In International Conference on Machine Learning, 2017.
• [53] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
• [54] M. Soltanolkotabi, A. Javanmard, and J. D. Lee. Theoretical insights into the optimization landscape of over-parameterized shallow neural networks. arXiv preprint arXiv:1707.04926, 2017.
• [55] D. Soudry and Y. Carmon. No bad local minima: Data independent training error guarantees for multilayer neural networks. arXiv:1605.08361, 2016.
• [56] R. M. Starr. Quasi-equilibria in markets with non-convex preferences. Econometrica: Journal of the Econometric Society, pages 25–38, 1969.
• [57] C. Szegedy, S. Ioffe, V. Vanhoucke, and A. A. Alemi. Inception-v4, Inception-ResNet and the impact of residual connections on learning. In AAAI Conference on Artificial Intelligence, pages 4278–4284, 2017.
• [58] M. Udell and S. Boyd. Bounding duality gap for separable problems with linear constraints. Computational Optimization and Applications, 64(2):355–378, 2016.
• [59] M. Udell, C. Horn, R. Zadeh, and S. Boyd. Generalized low rank models. Foundations and Trends in Machine Learning, 9(1):1–118, 2016.
• [60] A. Veit, M. J. Wilber, and S. Belongie. Residual networks behave like ensembles of relatively shallow networks. In Advances in Neural Information Processing Systems, pages 550–558, 2016.
• [61] R. Vidal, J. Bruna, R. Giryes, and S. Soatto. Mathematics of deep learning. arXiv preprint arXiv:1712.04741, 2017.
• [62] B. Xie, Y. Liang, and L. Song. Diverse neural network learns true target functions. In Artificial Intelligence and Statistics, pages 1216–1224, 2017.
• [63] S. Xie, R. Girshick, P. Dollár, Z. Tu, and K. He. Aggregated residual transformations for deep neural networks. In IEEE Conference on Computer Vision and Pattern Recognition, pages 5987–5995, 2017.
• [64] S. Zagoruyko and N. Komodakis. Wide residual networks. In British Machine Vision Conference, pages 87.1–87.12, 2016.
• [65] C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals. Understanding deep learning requires rethinking generalization. In International Conference on Learning Representations, 2016.
• [66] Y. Zhang, J. Lee, M. Wainwright, and M. Jordan. On the learnability of fully-connected neural networks. In Artificial Intelligence and Statistics, pages 83–91, 2017.
• [67] Y. Zhang, P. Liang, and M. J. Wainwright. Convexified convolutional neural networks. In International Conference on Machine Learning, 2017.
• [68] P. Zhou and J. Feng. Empirical risk landscape analysis for understanding deep neural networks. In International Conference on Learning Representations, 2018.

## Appendix A Supplementary Experiments

### a.1 Performance of Multi-Branch Architecture

In this section, we test the classification accuracy of the multi-branch architecture on the CIFAR-10 dataset. We use a 9-layer VGG network [53] as our sub-network in each branch, which is memory-efficient for practitioners to fit many branches into GPU memory simultaneously. The detailed network setup of VGG-9 is in Table 2, where the width of VGG-9 is either 16 or 32. We test the performance of varying numbers of branches in the overall architecture from 4 to 32, with cross-entropy loss. Figure 5 presents the test accuracy on CIFAR-10 as the number of branches increases. It shows that the test accuracy improves monotonously with the increasing number of parallel branches/paths.

### a.2 Strong Duality of Deep Linear Neural Networks

We compare the optima of primal problem (4) and dual problem (5) by numerical experiments for three-layer linear neural networks (). The data are generated as follows. We construct the output matrix by drawing the entries of from i.i.d. standard Gaussian distribution and the input matrix by the identity matrix. The varies from to . Both primal and dual problems are solved by numerical algorithms. Given the non-convex nature of primal problem, we rerun the algorithm by multiple initializations and choose the best solution that we obtain. The results are shown in Figure 6. We can easily see that the optima of primal and dual problems almost match. The small gap is due to the numerical inaccuracy.

We also compare the distance between the solution of primal problem and the solution of dual problem in Table 3. We see that the solutions are close to each other.

## Appendix B Proofs of Theorem 1: Duality Gap of Multi-Branch Neural Networks

The lower bound is obvious by the weak duality. So we only need to prove the upper bound .

Consider the subset of :

 Yi:={yi∈R2:yi=1I[hi(w(i)),E(x,y)∼P(1−y⋅fi(w(i);x)τ)],w(i)∈Wi},i∈[I].

Define the vector summation

 Y:=Y1+Y2+...+YI.

Since and are continuous w.r.t. and ’s are compact, the set

 {(w(i),hi(w(i)),fi(w(i);x)):w(i)∈Wi}

is compact as well. So , , , and , are all compact sets. According to the definition of and the standard duality argument [45], we have

 inf(P)=min{w:there exists (r,w)∈Y % such that r≤K},

and

 sup(D)=min{w:there exists (r,w)∈conv(Y) such that r≤K}.

Technique (a): Shapley-Folkman Lemma. We are going to apply the following Shapley-Folkman lemma.

###### Lemma 3 (Shapley-Folkman, [56]).

Let be a collection of subsets of . Then for every , there is a subset of size at most such that

 y∈⎡⎣∑i∉I(y)Yi+∑i∈I(y)conv(Yi)⎤⎦.

We apply Lemma 3 to prove Theorem 1 with . Let be such that

 ¯¯¯r≤K,and¯¯¯¯w=sup(D).

Applying the above Shapley-Folkman lemma to the set , we have that there are a subset of size and vectors

 (¯¯¯ri,¯¯¯¯wi)∈conv(Yi),  i∈¯¯¯¯Iand¯¯¯¯¯w(i)∈Wi,  i∉¯¯¯¯I,

such that

 1I∑i∉¯¯¯Ihi(¯¯¯¯¯w(i))+∑i∈¯¯¯I¯¯¯ri=¯¯¯r≤K, (8)
 1I∑i∉¯¯¯IE(x,y)∼P(1−y⋅fi(¯¯¯¯¯w(i);x)τ)+∑i∈¯¯¯I¯¯¯¯wi=sup(D). (9)

Representing elements of the convex hull of by Carathéodory theorem, we have that for each , there are vectors and scalars such that

 3∑j=1aji=1,aji≥0, j=1,2,3,

Recall that we define

 ˆfi(˜w):=infw(i)∈Wi{E(x,y)∼P(1−y⋅fi(w(i);x)τ):hi(w(i))≤hi(˜w)}, (10)
 ˜fi(˜w):=infaj,wj(i)∈Wi⎧⎪⎨⎪⎩pi+1∑j=1ajE(x,y)∼P⎛⎜⎝1−y⋅fi(wj(i);x)τ⎞⎟⎠:˜w=pi+1∑j=1ajwj(i),pi+1∑j=1aj=1,aj≥0⎫⎪⎬⎪⎭,

and . We have for ,

 ¯¯¯ri≥1Ihi(3∑j=1ajiwj(i)),(because hi(⋅) is convex) (11)

and

 ¯¯¯¯wi≥1I˜fi(3∑j=1ajiwj(i))(by the definition of ˜fi(⋅))≥1Iˆfi(3∑j=1ajiwj(i))−1IΔi.(by the % definition of Δi) (12)

Thus, by Eqns. (8) and (11), we have

 1I∑i∉¯¯¯Ihi(¯¯¯¯¯w(i))+1I∑i∈¯¯¯Ihi(3∑j=1ajiwj(i))≤K, (13)

and by Eqns. (9) and (12), we have

 E(x,y)∼P⎡⎢⎣1I∑i∉¯¯¯I(1−y⋅fi(¯¯¯¯¯w(i);x)τ)⎤⎥⎦+1I∑i∈¯¯¯Iˆfi(3∑j=1ajiwj(i))≤sup(D)+1I∑i∈¯¯¯IΔi. (14)

Given any and , we can find a vector such that

 hi(¯¯¯¯¯w(i))≤hi(3∑j=1ajiwj(i)) and E(x,y)∼P(1−y⋅fi(¯¯¯¯¯w(i);x)τ)≤ˆfi(3∑j=1ajiwj(i))+ϵ, (15)

where the first inequality holds because is convex and the second inequality holds by the definition (10) of . Therefore, Eqns. (13) and (15) impliy that

 1II∑i=1hi(¯¯¯¯¯w(i))≤K.

Namely, is a feasible solution of problem (2). Also, Eqns. (14) and (15) yield

 inf(P)≤E(x,y)∼P[1II∑i=1(1−y⋅fi(¯¯¯¯¯w(i);x)τ)]≤sup(D)+1I∑i∈¯¯¯I(Δi+ϵ)≤sup(D)+2IΔworst+2ϵ,

where the last inequality holds because . Finally, letting leads to the desired result.

## Appendix C Proofs of Theorem 2: Strong Duality of Deep Linear Neural Networks

Let . We note that by Pythagorean theorem, for every ,

 12∥Y−WH⋯W1X∥2F=12∥˜Y−WH⋯W1X∥2F+12∥Y−˜Y∥2Findependent of W1,…,WH.

So we can focus on the following optimization problem instead of problem (4):

 minW1,...,WH12∥˜Y−WH...W1X∥2F+γH[∥W1X∥HSH+H∑i=2∥Wi∥HSH]. (16)

Technique (b): Variational Form. Our work is inspired by a variational form of problem (16) given by the following lemma.

###### Lemma 4.

If is optimal to problem

 minW1,…,WHF(W1,…,WH):=12∥˜Y−WH⋯W1X∥2F+γ∥WH⋯W1X∥∗, (17)

then is optimal to problem (16), where is the skinny SVD of , for , and . Furthermore, problems (16) and (17) have the same optimal objective function value.

###### Proof of Lemma 4.

Let be the skinny SVD of matrix . We notice that

 ∥Z∥∗=∥WHWH−1⋯W1X∥∗≤∥W1X∥SHH∏i=2∥Wi∥SH(by the generalized H¨\textupolder's inequality)≤1H[∥W1X∥HSH+H∑i=2∥Wi∥HSH].(by the inequality of mean)

Hence, on one hand, for every ,

which yields

 minW1,…,WHF(W1,…,WH)≤minW1,…,WH