Steepest Neural Architecture Descent: Escaping Local Optimum with Signed Neuron Splittings

# Steepest Neural Architecture Descent: Escaping Local Optimum with Signed Neuron Splittings

## Abstract

Developing efficient and principled neural architecture optimization methods is a critical challenge of modern deep learning. Recently, Liu et al. (2019b) proposed a splitting steepest descent (S2D) method that jointly optimizes the neural parameters and architectures based on progressively growing network structures by splitting neurons into multiple copies in a steepest descent fashion. However, S2D suffers from a local optimality issue when all the neurons become “splitting stable”, a concept akin to local stability in parametric optimization. In this work, we develop a significant and surprising extension of the splitting descent framework that addresses the local optimality issue. The idea is to observe that the original S2D is unnecessarily restricted to splitting neurons into positive weighted copies. By simply allowing both positive and negative weights during splitting, we can eliminate the appearance of splitting stability in S2D and hence escape the local optima to obtain better performance. By incorporating signed splittings, we significantly extend the optimization power of splitting steepest descent both theoretically and empirically. We verify our method on various challenging benchmarks such as CIFAR-100, ImageNet and ModelNet40, on which we outperform S2D and other advanced methods on learning accurate and energy-efficient neural networks.

## 1 Introduction

Although the parameter learning of deep neural networks (DNNs) has been well addressed by gradient-based optimization, efficient optimization of neural network architectures (or structures) is still largely open. Traditional approaches frame the neural architecture optimization as a discrete combinatorial optimization problem, which, however, often lead to highly expensive computational cost and give no rigorous theoretical guarantees. New techniques for efficient and principled neural architecture optimization can significantly advance the-state-of-the-art of deep learning.

Recently, Liu et al. (2019b) proposed a splitting steepest descent (S2D) method for efficient neural architecture optimization, which frames the joint optimization of the parameters and neural architectures into a continuous optimization problem in an infinite dimensional model space, and derives a computationally efficient (functional) steepest descent procedure for solving it. Algorithmically, S2D works by alternating between typical parametric updates with the architecture fixed, and an architecture descent which grows the neural network structures by optimally splitting critical neurons into a convex combination of multiple copies.

In S2D, the optimal rule for picking what neurons to split and how to split is theoretically derived to yield the fastest descent of the loss in an asymptotically small neighborhood. Specifically, the optimal way to split a neuron is to divide it into two equally weighted copies along the minimum eigen-direction of a key quantity called splitting matrix for each neuron. Splitting a neuron into more than two copies can not introduce any additional gain theoretically and do not need to be considered for computational efficiency. Moreover, the change of loss resulted from splitting a neuron equals the minimum eigen-value of its splitting matrix (called the splitting index). Therefore, neurons whose splitting matrix is positive definite are considered to be “splitting stable” (or not splittable) in that splitting them in any fashion can increase the loss and hence would be “pushed back” by subsequent gradient descent. In this way, the role of splitting matrices is analogous to how Hessian matrices characterize local stability for typical parametric optimization, and the local stability due to positive definite splitting matrices can be viewed as a notation of local optimality in the parameter-structure joint space. Unfortunately, the presence of the splitting stable status leads to a key limitation of the practical performance of splitting steepest descent, since the loss can be stuck at a relatively large value when the splittings become stable and can not continue.

This work fills a surprising missing piece of the picture outlined above, which allows us to address the local optimality problem in splitting descent with a simple algorithmic improvement. We show that the notation of splitting stable caused by positive definite splitting matrices is in fact an artifact of splitting neurons into positively weighted copies. By simply considering signed splittings which allows us to split neurons into copies with both positive and negative weights, the loss can continue decrease unless the splitting matrices equals zero for all the neurons. Intriguingly, the optimal spitting rules with signed weights can have upto three or four copies (a.k.a. triplet and quartet splittings; see Figure 4(c-e)), even though signed binary splittings (Figure 4(a-b)), which introduces no additional neurons over the original positively weighted splitting, can work sufficiently well in practice.

Our main algorithm, signed splitting steepest descent (S3D), which outperforms the original S2D in both theoretical guarantees and empirical performance. Theoretically, it yields stronger notion of optimality and allows us to establish convergence analysis that was impossible for S2D. Empirically, S3D can learn smaller and more accurate networks in a variety of challenging benchmarks, including CIFAR-100, ImageNet, ModelNet40, on which S3D substantially outperforms S2D and a variety of baselines for learning small and energy-efficient networks (e.g. Liu et al., 2017; Li et al., 2017; Gordon et al., 2018; He et al., 2018).

## 2 Background: Splitting Steepest Descent

Following Liu et al. (2019b), we start with the case of splitting a single-neuron network , where is the parameter and the input. On a data distribution , the loss function of is

 L(θ)=Ex∼D[Φ(σ(θ,x))],

where denotes a nonlinear loss function.

Assume we split a neuron with parameter into copies whose parameters are , each of which is associated with a weight , yielding a large neural network of form . Its loss function is

 \Lm(\vvθ,\vvw)=Ex∼D[Φ(m∑i=1wiσ(θi,x))],

where we write . We shall assume , so that we obtain an equivalent network, or a network morphism (Wei et al., 2016), when the split copies are not updated, i.e., for . We want to find the optimal splitting scheme (, , ) to yield the minimum loss .

Assume the copies can be decomposed into where denotes a step-size parameter, the average displacement of all copies (which implies ), and the individual “splitting” direction of . Liu et al. (2019b) showed the following key decomposition:

 \Lm(\vvθ,\vvw)=L(θ+ϵδ0)+ϵ22\II(\vvδ,\vvw;θ)+O(ϵ3), (1)

where denotes the effect of average displacement, corresponding to typical parametric without splitting, and denotes the effect of splitting the neurons; it is a quadratic form depending on a splitting matrix defined in Liu et al. (2019b):

 \II(\vvδ,\vvw;\leavevmode\nobreak θ)=m∑i=1wiδ⊤iS(θ)δi,\leavevmode\nobreak \leavevmode\nobreak \leavevmode\nobreak \leavevmode\nobreak \leavevmode\nobreak \leavevmode\nobreak \leavevmode\nobreak where% \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode% \nobreak\ S(θ)=Ex∼D[Φ′(σ(θ,x))∇2θθσ(θ,x)]. (2)

Here is called the splitting matrix of . Because splitting increases the number of neurons and only contributes an decrease of loss following (1), it is preferred to decrease the loss with typical parametric updates that requires no splitting (e.g., gradient descent), whenever the parametric local optimum of is not achieved. However, when we research a local optimum of , splitting allows us to escape the local optimum at the cost of increasing the number of neurons. In Liu et al. (2019b), the optimal splitting scheme is framed into an optimization problem:

 G+m:=min\vvδ,\vvw{\II(\vvδ,\vvw;θ):\leavevmode\nobreak \vvw∈P+m,\leavevmode\nobreak \vvδ∈Δ\vvw}, (3)

where we optimize the weights in a probability simplex and splitting vectors in set set :

 P+m={\vvw∈\RRm:\leavevmode\nobreak m∑i=1wi=1,\leavevmode\nobreak wj≥0,\leavevmode\nobreak \leavevmode\nobreak ∀j}, (4)
 Δ\vvw={\vvδ∈\RRm×d:m∑i=1wiδi=0,\leavevmode\nobreak \normδj≤1,\leavevmode\nobreak ∀j},

in which is constrained in the unit ball and the constraint is to ensure a zero average displacement. Liu et al. (2019b) showed that the optimal gain in (3) depends on the minimum eigen-value of in that

 G+m=min(λmin,0).

If , we obtain a strict decrease of the loss, and the maximum decrease can be achieved by a simple binary splitting scheme (), in which the neuron is split into two equally weighted copies along the minimum eigen-vector direction of , that is,

 m=2, w1=w2=1/2, δ1=−δ2=vmin. (5)

See Figure 1(a) for an illustration. This binary splitting defines the best possible splitting in the sense of (3), which means that it can not be further improved even when it is allowed to split the neuron into an arbitrary number of copies.

On the other hand, if , we have and the loss can not be decreased by any splitting scheme considered in (3). This case was called being splitting stable in Liu et al. (2019b), which means that even if the neuron is split into an arbitrary number of copies in arbitrary way (with a small step size ), all its copies would be pushed back to the original neuron when gradient descent is applied subsequently.

### 2.1 Main Method: Signed Splitting Steepest Descent

Following the derivation above, the splitting process would get stuck and stop when the splitting matrix is positive definite (), and it yields small gain when is close to zero. Our key observation is that this phenomenon is in fact an artifact of constraining the weights to be non-negative in optimization (3)-(4). By allowing negative weights, we can open the door to a much richer class of splitting schemes, which allows us to descent the loss more efficiently. Interestingly, although the optimal positively weighted splitting is always achievable by the binary splitting scheme () shown in (5), the optimal splitting schemes with signed weights can be either binary splitting (), triplet splitting (), or at most quartet splitting ().

Specifically, our idea is to replace (3) with

 G−cm:=min\vvδ,\vvw{\II(\vvδ,\vvw;θ):\leavevmode\nobreak \vvw∈P−cm,\leavevmode\nobreak \vvδ∈Δ\vvw}, (6)

where the weight is constrained in a larger set whose size depends on a scalar :

 P−cm={\vvw∈\RRm:m∑i=1wi=1,\leavevmode\nobreak m∑i=1|wi|≤c}. (7)

We can see that reduces to when , and contains negative weights when . By using , we enable a richer class of splitting schemes with signed weights, hence yielding faster descent of the loss function.

The optimization in (6) is more involved than the positive case (3), but still yield elementary solutions. We now discuss the solution when we split the neuron into copies, respectively. Importantly, we show that no additional gain can be made by splittings with more than copies.

For notation, we denote by , the smallest and largest eigenvalues of , respectively, and , their corresponding eigen-vectors with unit norm.

{thm}

[Binary Splittings] For the optimization in (6) with and , we have

 G−c2=min(λmin,\leavevmode\nobreak \leavevmode\nobreak −c−1c+1λmax,\leavevmode\nobreak \leavevmode\nobreak 0),

and the optimum is achieved by one of the following cases:

i) no splitting (), which yields ;

ii) the positive binary splitting in (5), yielding ;

iii) the following “negative” binary splitting scheme:

 w1=−c−12,\leavevmode\nobreak \leavevmode\nobreak \leavevmode\nobreak \leavevmode\nobreak δ1=vmax,\leavevmode\nobreak \leavevmode\nobreak \leavevmode\nobreak \leavevmode\nobreak \leavevmode\nobreak \leavevmode\nobreak \leavevmode\nobreak \leavevmode\nobreak \leavevmode\nobreak \leavevmode\nobreak \leavevmode\nobreak \leavevmode\nobreak w2=c+12,\leavevmode\nobreak \leavevmode\nobreak \leavevmode\nobreak \leavevmode\nobreak δ2=c−1c+1vmax. (8)

which yields This amounts to splitting the neuron into two copies with a positive and a negative weight, respectively, both of which move along the eigen-vector , but with different magnitudes (to ensure a zero average displacement). See Figure 1(b) for an illustration. Recall that the positive splitting (5) follows the minimum eigen-vector , and achieves a decrease of loss only if . In comparison, the negative splitting (8) exploits the maximum eigen-direction and achieves a decrease of loss when . Hence, unless , or , a loss decrease can be achieved by either the positive or negative binary splitting.

{thm}

[Triplet Splittings] For the optimization in (6) with and , we have

 G−c3=min(c+12λmin,\leavevmode\nobreak \leavevmode\nobreak −c−12λmax,\leavevmode\nobreak \leavevmode\nobreak 0),

and the optimum is achieved by one of the following cases:

i) no splitting (), with ;

ii) the following “positive” triplet splitting scheme with two positive weights and one negative weights that yields :

 w1=c+14,   w2=c+14,   w3=−c−12,   δ1=vmin,   δ2=−vmin,   δ3=0. (9)

iii) the following “negative” triplet splitting scheme with two negative weights and one positive weights that yields :

 w1=−c−14,w2=−c−14,w3=c+12,   δ1=vmax,   δ2=−vmax,   δ3=0, (10)

Similar to the binary splittings, the positive and negative triplet splittings exploit the minimum and maximum eigenvalues, respectively. In both cases, the triplet splittings achieve larger descent than the binary counterparts, which is made possible by placing a copy with no movement () to allow the other two copies to achieve larger descent with a higher degree of freedom.

See Figure 1(c)-(d) for illustration of the triplet splittings. Intuitively, the triplet splittings can be viewed as giving birth to two off-springs while keeping the original neuron alive, while the binary splittings “kill” the original neuron and only keep the two off-springs.

We now consider the optimal quartet splitting (), and show that no additional gain is possible with copies. {thm}[Quartet Splitting and Optimality] For any , and , we have

 G−cm=G−c4=c+12λthmin\leavevmode\nobreak −\leavevmode\nobreak c−12λthmax,

where and . In addition, the optimum is achieved by the following splitting scheme with :

 \w1=\w2=c+14,      \w3=\w4=−c−14,      δ1=−δ2=vthmin,      δ3=−δ4=vthmax, (11)

where ,   , and denotes the indicator function.

Therefore, if , we have , and (11) yields effectively no splitting (, ). In this case, no decrease of the loss can be made by any splitting scheme, regardless of how large is.

If (resp. ), we have (resp. ), and (11) reduces to the positive (resp. negative) triplet splitting in Theorem 2.1. There is no additional gain to use over .

If , this yields a quartet splitting (Figure 1(e)) which has two positively weighted copies split along the direction, and two negative weighted copies along the direction. The advantage of this quartet splitting is that it exploits both maximum and minimum eigen-directions simultaneously, while any binary or triplet splitting can only benefit from one of the two directions.

Remark A common feature of all the splitting schemes above ( or ) is that the decrease of loss is all proportional to the spectrum radius of splitting matrix, that is,

 G−cm≤−κmρ(S(θ)),\leavevmode\nobreak \leavevmode\nobreak \leavevmode\nobreak \leavevmode\nobreak \leavevmode\nobreak \leavevmode\nobreak \leavevmode\nobreak \leavevmode\nobreak %where\leavevmode\nobreak \leavevmode\nobreak \leavevmode\nobreak \leavevmode\nobreak \leavevmode\nobreak \leavevmode\nobreak \leavevmode\nobreak \leavevmode\nobreak \leavevmode\nobreak \leavevmode\nobreak ρ(S(θ)):=max(|λmax(S(θ))|,\leavevmode\nobreak |λmin(S(θ))|), (12)

where and for . Hence, unless , which implies , we can always decrease the loss by the optimal splitting schemes with any . This is in contrast with the optimal positive splitting in (5), which get stuck when is positive semi-definite ().

We can see from Eq (12) that the effects of splittings with different are qualitatively similar. The improvement of using the triplet and quartet splittings over the binary splittings is only up to a constant factor of , and may not yield a significant difference on the final optimization result. As we show in experiments, it is preferred to use binary splittings (), as it introduces less neurons in each splitting and yields much smaller neural networks.

#### Algorithm

Similar to Liu et al. (2019b), the splitting descent can be easily extended to general neural networks with multiple neurons, possibly in different layers, because the effect of splitting different neurons are additive as shown in Theorem 2.4 of Liu et al. (2019b).

This yields the practical algorithm in (1), in which we alternate between i) the standard parametric update phase, in which we use traditional gradient-based optimizers until no further improvement can be made by pure parametric updates, and ii) the splitting phase, in which we evaluate the minimum and maximum eigenvalues of the splitting matrices of the different neurons, select a subset of neurons with the most negative values of with , 3, or 4, and split these neurons using the optimal schemes specified in Theorem 2.1-2.1.

The rule for deciding how many neurons to split at each iteration can be a heuristic of users’ choice. For example, we can decide a maximum number of neurons to split and a positive threshold , and select the top neurons with the most negative values of , and satisfy .

#### Computational Cost

Similar to Liu et al. (2019b), the eigen-computation of signed splittings requires in time and in space, where is the number of neurons and is the parameter size of each neuron. However, this can be significantly improved by using the Rayleigh-quotient gradient descent for eigen-computation introduced in Wang et al. (2019a), which has roughly the same time and space complexity as typical parametric back-propagation on the same network (i.e., in time and in space). See Appendix E.3 for more details on how we apply Rayleigh-quotient gradient descent in signed splittings.

#### Convergence Guarantee

The original S2D does not have a rigours convergence guarantee due to the local minimum issue associated with positive definite splitting matrices. With more signed splittings, the loss can be minimized much more thoroughly, and hence allows us to establish provably convergence guarantees. In particular, we show that, under proper conditions, by splitting two-layer neural networks using S3D with only binary splittings starting from a single-neuron network, we achieve a training MSE loss of by splitting at most steps, where is data size and the dimension of the input dimension. The final size of the network we obtain, which equals , is smaller than the number of neurons required for over-parameterization-based analysis of standard gradient descent training of neural networks, and hence provides a theoretical justification of that splitting can yield accurate and smaller networks than standard gradient descent. For example, the analysis in Du et al. (2019) requires neurons, or in Oymak and Soltanolkotabi (2019), larger than what we need when is large. The detailed results are shown in Appendix due to space constraint.

## 3 Experiments

We test our algorithm on various benchmarks, including CIFAR-100, ImageNet and ModelNet40. We apply our signed splitting steepest descent (S3D) following Algorithm 1 and compare it with splitting steepest descent (S2D) (Liu et al., 2019b), which is the same as Algorithm 1 except that only positive splittings are used. We also consider an energy-aware variant following Wang et al. (2019a), in which the increase of energy cost for splitting each neuron is estimated at each splitting step, and the set of neurons to split is selected by solving a knapsack problem to maximize the total splitting gain subject to a constraint on the increase of energy cost. See Wang et al. (2019a) for details.

We tested S3D with different splitting sizes () and found that the binary splitting () tends to give the best performance in practical deep learning tasks of image and point cloud classification. This is because tend to give much larger networks while do not yield significant improvement over to compensate the faster growth of network size. In fact, if we consider the average gain of each new copy, provides a better trade-off between the accuracy and network sizes. Therefore, we only consider in all the deep learning experiments. Due to the limited space, we put more experiment details in Appendix.

#### Toy RBF neural networks

We revisit the toy RBF neural network experiment described in (Liu et al., 2019b) to domenstrate the benefit of introducing signed splittings. Liu et al. (2019b), it still tends to get stuck at local optima when the splitting matrices are positive definite. By using more general signed splittings, our S3D algorithm allows us to escape the local optima that S2D can not escape, hence yielding better results. For both S2D and S3D, we start with an initial network with a single neuron and gradually grow it by splitting neurons. We test both S2D which includes only positive binary splittings, and S3D with signed binary splittings (), triplet splittings (), and quartet splittings (), respectively. More experiment setting can be found in Appendix D.1.

As shown in Figure 2 (a), S2D gets stuck in a local minimum, while our signed splitting can escape the local minima and fit the true curve well in the end. Figure 2 (b) shows different loss curves trained by S3D () with different . The triangle remarks in Figure 2 (b) indicate the first time when positive and signed splittings pick differ neurons. Figure 2 (d) further provides evidence showing that S3D can pick up a different but better neuron (with large ) to split compared with S2D, which helps the network get out of local optima.

#### Results on CIFAR-100

We apply S3D to grow DNNs for the image classification task. We test our method on MobileNetV1 (Howard et al., 2017) on CIFAR-100 and compare our S3D with S2D (Liu et al., 2019b) as well as other pruning baselines, including L1 Pruning (Liu et al., 2017), Bn Pruning (Liu et al., 2017) and MorphNet (Gordon et al., 2018). We also apply our algorithm in an energy-aware setting discussed in Wang et al. (2019a), which decides the best neurons to split by formulating a knapsack problem to best trade-off the splitting gain and energy cost; see Wang et al. (2019a) for the details. To speedup the eigen-computation in S3D and S2D, we use the fast gradient-based eigen-approximation algorithm in (Wang et al., 2019a) (see Appendix E.3). Our results show that our algorithm outperforms prior arts with higher accuracy and lower cost in terms of both model parameter numbers and FLOPs. See more experiment detail in Appendix D.2

Figure 3 (a) and (b) show that our S3D algorithm outperforms all the baselines in both the standard-setting and the energy-aware setting of Wang et al. (2019a). Table 2 reports the testing accuracy, parameter size and FLOPs of the learned models. We can see that our method achieves significantly higher accuracy as well as lower parameter sizes and FLOPs. We study the relation between testing accuracy and the hyper-parameter in Figure 3 (c), at the 5th splitting step in Figure 3 (a) (note that reduces to S2D). We can see that is optimal in this case.

#### Results on ImageNet

We apply our method in ImageNet classification task. We follow the setting of (Wang et al., 2019a), using their energy-aware neuron selection criterion and fast gradient-based eigen-approximation. We also compare our methods with AMC (He et al., 2018) , full MobileNetV1 and MobileNetV1 with , width multipliers on each layers. We find that our S3D achieves higher Top-1 and Top-5 accuracy than other methods with comparable multiply-and-accumulate operations (MACs). See details of setting in Appendix D.3. Table 2 shows that our S3D obtains better Top-1 and Top-5 accuracy compared with the S2D in Wang et al. (2019a) and other baselines with the same or smaller MACs. We also visualize the filters after splitting on ImageNet; see Appendix D.4.

#### Results on Point Cloud Classification

\myempty

Point cloud is a simple and popular representation of 3D objects, which can be easily captured and processed by mobile devices. Point cloud classification amounts to classifying 3D objects based on their point cloud representations, and is found in many cutting-edge AI applications, such as face recognition in Face ID and LIDAR-based recognition in autonomous driving. Since many of these applications are deployed on mobile devices, a key challenges is to build small and energy efficient networks with high accuracy. We can attack this challenge with splitting steepest descent. We consider point cloud classification with Dynamic graph convolution neural network (DGCNN) (Wang et al., 2019b). DGCNN one of the best networks for point cloud, but tends to be expensive in both speed and space, because it involves K-nearest-neighbour (KNN) operators for aggregating neighboring features on the graph. We apply S3D to search better DGCNN structures with smaller sizes, hence significantly improving the space and time efficiency. Following the experiment in Wang et al. (2019b), we choose ModelNet40 as our dataset. See details in Appendix D.5. Table 3 shows the result compared with PointNet (Qi et al., 2017), PointNet++ (Qi et al., 2017) and DGCNN with different multiplier on its EgdConv layers. We compare the accuracy as well as model size and time cost for forward processing. For forward processing time, we test it on a single NVIDIA RTX 2080Ti with a batch size of 16. We can see that our S3D algorithm obtains networks with the highest accuracy among all the methods, with a faster forward processing speed than DGCNN () and a smaller model size than DGCNN ().

## 4 Related Works

Neural Architecture Search (NAS) has been traditionally framed as a discrete combinatorial optimization and solved based on black-box optimization methods such as reinforcement learning (e.g. Zoph and Le, 2017; Zoph et al., 2018), evolutionary/genetic algorithms (e.g., Stanley and Miikkulainen, 2002; Real et al., 2018), or continuous relaxation followed with gradient descent (e.g., Liu et al., 2019a; Xie et al., 2018). These methods need to search in a large model space with expensive evaluation cost, and can be computationally expensive or easily stucked at local optima. Techniques such as weight-sharing (e.g. Pham et al., 2018; Cai et al., 2019; Bender et al., 2018) and low fidelity estimates (e.g., Zoph et al., 2018; Falkner et al., 2018; Runge et al., 2019) have been developed to alleviate the cost problem in NAS; see e.g., Elsken et al. (2019b); Wistuba et al. (2019) for recent surveys of NAS. In comparison, splitting steepest descent is based on a significantly different functional steepest view that leverages the fundamental topological information of deep neural architectures to enable more efficient search, ensuring both rigorous theoretical guarantees and superior practical performance.

The idea of progressively growing neural networks has been considered by researchers in various communities from different angles. However, most existing methods are based on heuristic ideas. For example, Wynne-Jones (1992) proposed a heuristic method to split neurons based on the eigen-directions of covariance matrix of the gradient. See e.g., Ghosh and Tumer (1994); Utgoff and Precup (1998) for surveys of similar ideas in the classical literature.

Recently, Chen et al. (2016) proposed a method called Net2Net for knowledge transferring which grows a well-trained network by splitting randomly picked neurons along random directions. Our optimal splitting strategies can be directly adapted to improve Net2Net. Going beyond node splitting, more general operators that grow networks while preserving the function represented by the networks, referred to as network morphism, have been studied and exploited in a series of recent works (e.g., Chen et al., 2016; Wei et al., 2016; Cai et al., 2018; Elsken et al., 2019a).

A more principled progressive training approach for neural networks can be derived using Frank-Wolfe (e.g., Schwenk and Bengio, 2000; Bengio et al., 2006; Bach, 2017), which yields greedy algorithms that iteratively add optimal new neurons while keeping the previous neurons fixed. Although rigorous convergence rate can be established for these methods (e.g., Bach, 2017), they are not practically applicable because adding each new neuron requires to solve an intractable non-convex global optimization problem. In contrast, the splitting steepest descent approach is fully computationally tractable, because the search of the optimal node splitting schemes amounts to an tractable eigen-decomposition problem (albeit being non-convex). The original S2D in Liu et al. (2019b) did not provide a convergence guarantee, because the algorithm gets stuck when the splitting matrices become positive definite. By using signed splittings, our S3D can escape more local optima, ensuring both strong theoretical guarantees and better empirical performance.

An alternative approach for learning small and energy-efficient networks is to prune large pre-trained neural networks to obtain compact sub-network structures (e.g., Han et al., 2016; Li et al., 2017; Liu et al., 2017, 2019c; Frankle and Carbin, 2018). As shown in our experiments and Liu et al. (2019b); Wang et al. (2019a), the splitting approach can outperform existing pruning methods, without requiring the overhead of pre-traininging large models. A promising future direction is to design algorithms that adaptively combine splitting with pruning to achieve better results.

## 5 Conclusion

In this work, we develop an extension of the splitting steepest descent framework to avoid the local optima by introducing signed splittings. Our S3D can learn small and accurate networks in challenging cases. For future work, we will develop further speed up of S3D and explore more flexible ways for optimizing network architectures going beyond neuron splitting.

## Appendix A Derivation of Optimal Splitting Schemes with Negative Weights

{lem}

Let be an optimal solution of (6). Then must be an eigen-vector of unless or .

###### Proof.

Write for simplicity. With fixed weights , the optimization w.r.t. is

 min\vvδm∑i=1wiδ⊤iSδi\leavevmode\nobreak \leavevmode\nobreak \leavevmode\nobreak s.t.\leavevmode\nobreak \leavevmode\nobreak \leavevmode\nobreak \normm∑i=1wiδi=0,\leavevmode\nobreak \leavevmode\nobreak \normδi=1.

By KKT condition, the optimal solution must satisfy

 w∗iSδ∗i−λ1w∗i¯δ∗−λ∗2δi=0 ¯δ∗:=m∑i=1w∗iδ∗i=0,

where and are two Lagrangian multipliers. Canceling out gives

 w∗iSδ∗i−λ2δ∗i=0.

Therefore, if and , then must be the eigen-vector of with eigen-value . ∎

### a.1 Derivation of Optimal Binary Splittings (m=2)

{thm}

1) Consider the optimization in (6) with and . Then the optimal solution must satisfy

 δ1=r1v,\leavevmode\nobreak \leavevmode\nobreak \leavevmode\nobreak \leavevmode\nobreak δ2=r2v,

where is an eigen-vector of and are two scalars.

2) In this case, the optimization reduces to

 G−c2:=min\vvw,\vvr,v(w1r21+w2r22)×λs.t.w1+w2=1w1r1+w2r2=0|w1|+|w2|≤c|r1|,|r2|≤1λ is an eigen-value of S(θ). (13)

3) The optimal value above is

 G−c2=min(\lambdamin,\leavevmode\nobreak \leavevmode\nobreak −c−1c+1\lambdamax,\leavevmode\nobreak \leavevmode\nobreak 0). (14)

If , the optimal solution is achieved by

 w1=−c−12,\leavevmode\nobreak \leavevmode\nobreak \leavevmode\nobreak \leavevmode\nobreak δ1=vmax, w2=c+12,\leavevmode\nobreak \leavevmode\nobreak \leavevmode\nobreak \leavevmode\nobreak δ2=c−1c+1vmax.

If the optimal solution is achieved by

 w1=12,\leavevmode\nobreak \leavevmode\nobreak \leavevmode\nobreak \leavevmode\nobreak δ1=vmin, w2=12,\leavevmode\nobreak \leavevmode\nobreak \leavevmode\nobreak \leavevmode\nobreak δ2=−vmin.

If , and hence , the optimal solution is achieved by no splitting:

###### Proof.

1) The form of and is immediately implied by the constraint . By Lemma A, must be an eigen-vector of .

2) Plugging and into (6) directly implies (13).

3) Following (13), we seek to minimize the product of and . If , we need to minimize , while if , we need to maximize . Lemma A.1 and A.1 below show that the minimum and maximum values of equal and , respectively. Because the range of is , we can write

 G−c2 =mint,v{t×λ:\leavevmode\nobreak \leavevmode\nobreak \leavevmode\nobreak −c−1c+1≤t≤1,\leavevmode\nobreak \leavevmode\nobreak \leavevmode\nobreak \lambdamin≤λ≤\lambdamax} =min(λmin,\leavevmode\nobreak \leavevmode\nobreak −c−1c+1λmax).

From , we can easily see that , and hence the form above is equivalent to the result in Theorem 2.1. The corresponding optimal solutions follow Lemma A.1 and A.1 below, which describe the values of to minimize and maximize , respectively. ∎

{lem}

Consider the following optimization with :

 Rmin2:=min(\vvw,\vvr)∈\RR4w1r21+w2r22\leavevmode\nobreak \leavevmode\nobreak \leavevmode\nobreak \leavevmode\nobreak s.t.w1+w2=1w1r1+w2r2=0|w1|+|w2|≤c|r1|,|r2|≤1. (15)

Then we have and the optimal solution is achieved by the following scheme:

 w1=−c−12,\leavevmode\nobreak \leavevmode\nobreak \leavevmode\nobreak \leavevmode\nobreak r1=1w2=c+12,\leavevmode\nobreak \leavevmode\nobreak \leavevmode\nobreak \leavevmode\nobreak r2=c−1c+1. (16)
###### Proof.

Case 1 (, )   Assume . We have .

 mina,r1,r2(1+a)r21 −ar22 \leavevmode\nobreak \leavevmode\nobreak \leavevmode\nobreak \leavevmode\nobreak s.t. (1+a)r1=ar2 a≤c−12 |r1|,|r2|≤1.

Eliminating , we have

 mina,r2(−a1+a)r22\leavevmode\nobreak \leavevmode\nobreak \leavevmode\nobreak \leavevmode\nobreak \leavevmode\nobreak s.t.\leavevmode\nobreak \leavevmode\nobreak \leavevmode\nobreak \leavevmode\nobreak \leavevmode\nobreak a≤c−12,\leavevmode\nobreak \leavevmode\nobreak \leavevmode\nobreak \leavevmode\nobreak \leavevmode\nobreak |r2|≤1.

The optimal solution is or , and , for which we achieve a minimum value of .

Case 2 (, )   This case is obviously sub-optimal since we have in this case.

Overall, the minimum value is . This completes the proof. ∎

{lem}

Consider the following optimization with :

 Rmax2:=max(\vvw,\vvr)∈\RR4w1r21+w2r22\leavevmode\nobreak \leavevmode\nobreak \leavevmode\nobreak \leavevmode\nobreak s.t.w1+w2=1w1r1+w2r2=0|w1|+|w2|≤c|r1|,|r2|≤1. (17)

Then we have , which is achieved by the following scheme:

 w1=12,\leavevmode\nobreak \leavevmode\nobreak \leavevmode\nobreak \leavevmode\nobreak r1=1w2=12,\leavevmode\nobreak \leavevmode\nobreak \leavevmode\nobreak \leavevmode\nobreak r2=−1. (18)
###### Proof.

It is easy to see that . On the other hand, this bound is achieved by the scheme in (18). ∎

### a.2 Derivation of Triplet Splittings (m=3)

{thm}

Consider the optimization in (6) with and .

1) The optimal solution of (6) must satisfy

 δi=dλ∑ℓ=1ri,ℓvℓ,

where is a set of orthonormal eigen-vectors of that share the same eigenvalue , and is a set of coefficients.

2) Write for . The optimization in (6) is equivalent to

 G−c3:=min\vvw,\vvr,λ(w1\normr12+w2\normr22+w3\normr32)×λs.t.w1+w2+w3=1w1r1+w2r2+w3r3=0|w1|+|w2|+|w3|≤c\normr1,\normr2,\normr3≤1λ is an eigen-value of S(θ) with dλ orthogonal eigen-vectors. (19)

3) The optimal value above is

 G−c3=min(c+12\lambdamin,\leavevmode\nobreak \leavevmode\nobreak −c−12\lambdamax,\leavevmode\nobreak \leavevmode\nobreak 0). (20)

If , the optimal solution is achieved by

 (w1=−c−14,\leavevmode\nobreak \leavevmode\nobreak \leavevmode\nobreak \leavevmode\nobreak δ1=vmax), (w2=−c−14,\leavevmode\nobreak \leavevmode\nobreak \leavevmode\nobreak \leavevmode\nobreak δ2=−vmax), (w3=c+12,\leavevmode\nobreak \leavevmode\nobreak \leavevmode\nobreak \leavevmode\nobreak δ3=0).

If

the optimal solution is achieved by

 (w1=c+14,\leavevmode\nobreak \leavevmode\nobreak \leavevmode\nobreak \leavevmode\nobreak δ1=vmin), (w2=c+14,\leavevmode\nobreak \leavevmode\nobreak \leavevmode\nobreak \leavevmode\nobreak δ2=−vmin), (w3=−c−12,\leavevmode\nobreak \leavevmode\nobreak \leavevmode\nobreak \leavevmode\nobreak δ3=0).

If , and hence , the optimal solution can be achieved by no splitting: .

###### Proof.

1-2) Following Lemma A, the optimal are eigen-vectors of . Because eigen-vectors associated with different eigen-values are linearly independent, we have that must share the same eigen-value (denoted by ) due to the constraint . Assume is associated with orthonormal eigen-vectors . Then we can write for , for which and . It is then easy to reduce (6) to (19). 3) Following Lemma A.2 and A.2, the value of in (19) can range from to