# Splitting Steepest Descent for Growing Neural Architectures

###### Abstract

We develop a progressive training approach for neural networks which adaptively grows the network structure by splitting existing neurons to multiple off-springs.
By leveraging a functional steepest descent idea,
we derive a simple criterion for deciding the best subset of neurons to split
and a *splitting gradient* for optimally updating the off-springs.
Theoretically, our splitting strategy is a second-order functional steepest descent for escaping saddle points in an -Wasserstein metric space,
on which the standard parametric gradient descent is a first-order steepest descent.
Our method provides a new computationally efficient approach for
optimizing
neural network structures,
especially for learning lightweight neural architectures in resource-constrained settings.

## 1 Introduction

Deep neural networks (DNNs) have achieved remarkable empirical successes recently. However, efficient and automatic optimization of model architectures remains to be a key challenge. Compared with parameter optimization which has been well addressed by gradient-based methods (or back-propagation), optimizing model structures involves significantly more challenging discrete optimization with large search spaces and high evaluation cost. Although there have been rapid progresses recently, designing the best architectures still requires a lot of expert knowledge and trial-and-errors for most practical tasks.

This work is motivated by the idea of extending the power of gradient descent to the domain of model structure optimization. In particular, we consider the problem of progressively
growing a neural network by “splitting” existing neurons into several “off-springs”,
and develop a simple and practical approach for deciding
the best subset of neurons to split and how to split them, adaptively based on the existing structure.
Instead of treating this as a discrete optimization problem,
our approach is to frame the structure optimization problem into a continuous functional optimization,
and derive an optimal splitting strategy that yields the *steepest descent* of the loss when the off-springs are infinitely close to the original neurons.

This yields a practical algorithm shown in Algorithm 1, which alternates between a standard parametric gradient descent phase to reach a parametric local optima, and a splitting phase in which the neurons are sorted according to a splitting index, and the neurons ranked top are split into two off-springs following a splitting gradient direction. The splitting index and gradient are minimum eigenvalue and eigenvector of a splitting matrix that can be calculated efficiently. Theoretically, these two phases can be viewed as performing functional steepest descent on an -Wasserstein metric space, in which the splitting phase is a second order descent for escaping saddle points in the functional space, while the parametric gradient descent corresponds to a first order descent. Empirically, our algorithm is simple and practical, and can be useful in many challenging problems, including progressive training of interpretable neural networks, learning lightweight neural architectures for resource constrained settings, and transfer learning, etc.

#### Related Works

The idea of progressively growing neural networks by node splitting is not new, but previous works are based on heuristic or purely random splitting strategies (e.g., wynne1992node; chen2015net2net). A different approach for progressive training is the Frank-Wolfe or gradient boosting based strategies (e.g., schwenk2000boosting; bengio2006convex; bach2017breaking), which iteratively add new neurons derived from functional conditional gradient, while keeping the previous neurons fixed. However, these methods are not suitable for large scale settings, because adding each neuron requires to solve a difficult non-convex optimization problem, and keeping the previous neurons fixed prevents us from correcting the mistakes made in earlier iterations. A practical alternative of Frank-Wolfe is to simply add new randomly initialized neurons and co-optimize the new and old neurons together. However, random initialization does not leverage the information of the existing model and takes a longer time to converge, while the split neurons inherent the knowledge from the existing model (see chen2015net2net), and are already close to the optimal solution.

An opposite direction of progressive training is to prune large pre-trained networks (e.g., han2015deep; li2016pruning; liu2017learning). In comparison, our splitting method requires no large pre-trained models and can learn lightweight architectures better than existing pruning methods, which is of critical importance for resource-constrained settings like mobile devices and Internet of things. More broadly, there has been a series of recent works on neural architecture search, based on various strategies for combinatorial optimization, including reinforcement learning (RL) (e.g., pham2018efficient; cai2018proxylessnas; zoph2016neural), evolutionary algorithms (EA) (e.g., stanley2002evolving; real2018regularized), and continuous relaxation (e.g., liu2018darts; xie2018snas).

#### Background: Steepest Descent and Saddle Points

Gradient descent is the driving horse for solving large scale optimization in machine learning and deep learning.
GD can be viewed as a steepest descent procedure that iteratively improves the solution by following the direction that maximally decreases the loss function within a small neighborhood.
Specifically, for minimizing a loss function ,
each iteration of steepest descent updates the parameter via , where is a small step size and is an update direction chosen to maximally decrease the loss of the updated parameter under a norm constraint , where denotes the Euclidean norm.
When and is infinitesimal,
the optimal descent direction equals the negative gradient direction, that is,
,
yielding a descent of .
At a critical point with a zero gradient (),
the steepest descent depends on the spectrum of the Hessian matrix . Denote by the minimum eigenvalue of and its related eigenvector.
When , the point is a stable local minimum and no further improvement can be made in the infinitesimal neighborhood.
When ,
the point is a saddle point or local maximum,
the steepest descent direction equals the eigenvector ,
and yields an decrease on the loss.^{1}^{1}1The property of the case when depends on higher order information, and happens rarely.
In practice, it has been shown that there is no need to explicitly calculate the negative eigenvalue direction, because saddle points and local maxima are highly unstable and can be escaped by using gradient descent with random initialization or stochastic noise (e.g., lee2016gradient; jin2017escape).

## 2 Splitting Neurons Using Steepest Descent

We introduce our main method in this section. We first illustrate the idea with the simple case of splitting a single neuron in Section 2.1, and then consider the more general case of simultaneously splitting multiple neurons in deep networks in Section 2.2, which yields our main progressive training algorithm (Algorithm 1). Section 2.3 draws a theoretical discussion and interpret our procedure as a functional steepest descent of the distribution of the neuron weights under the -Wasserstein metric.

### 2.1 Splitting a Single Neuron

Let be a neuron inside a neural network that we want to learn from data, where is the parameter of the neuron and its input variable. Assume the loss of has a form of

(1) |

where is a data distribution, and is a map determined by the overall loss function. The parameters of the other parts of the network are assumed to be fixed or optimized using standard procedures and are omitted for notation convenience.

Standard gradient descent can only yield parametric updates of . We introduce a generalized steepest descent procedure that allows us to incrementally grow the neural network by gradually introducing new neurons, by “splitting” the existing neurons into multiple copies in a (locally) optimal fashion derived using ideas from steepest descent idea.

In particular, we split into off-springs , and replace the neuron with a weighted sum of the off-spring neurons , where is a set of positive weights assigned on the off-springs, and satisfies , . This yields an augmented loss function on and :

(2) |

A key property of this construction is that it introduces a smooth change on the loss function when the off-springs are close to the original parameter : when , , the augmented network and loss are equivalent to the original ones, that is, , where denotes the vector consisting of all ones; when all the are within an infinitesimal neighborhood of , it yields an infinitesimal change on the loss, with which a steepest descent can be derive.

Formally, consider the set of splitting schemes whose off-springs are -close to the original neuron:

We want to decide the optimal to maximize the decrease of loss with an infinitesimal .
Although this appears to be an infinite dimensional optimization because is allowed to be arbitrarily large,
we show that the optimal choice is achieved with either (no splitting) or (splitting into two off-springs), with uniform weights .
Whether a neuron should be split ( or )
and the optimal values of the off-springs
are decided by
the minimum eigenvalue and eigenvector of a *splitting matrix*,
which plays a role similar to Hessian matrix for deciding saddle points.
{mydef}[Splitting Matrix]
For in (1),
its splitting matrix is defined as

(3) |

It is a symmetric “semi-Hessian” matrix that involves the first derivative , and the second derivative of . It is the “easy part” of the full Hessian matrix where is an extra term that involve the more complex second order derivative . We call the minimum eigenvalue of the splitting index of .

It is useful to decompose each into , where is an average displacement vector shared by all copies, and is the splitting vector associated to , and satisfies (which implies ). It turns out that the change of loss naturally decomposes into two terms that reflect the effects of the average displacement and splitting, respectively. {thm} For and in (1) and (2), assume has bounded third order derivatives w.r.t. . We have

(4) |

where the change of loss is decomposed into two terms: the first term is the effect of the average displacement , and it is equivalent to applying the standard parametric update on . The second term is the change of the loss caused by the splitting vectors . It depends on only through the splitting matrix . Therefore, the average displacement can be decided by standard parametric steepest (gradient) descent, which yields an decrease of loss at non-stationary points. In comparison, the splitting term is always , which is much smaller. Given that introducing new neurons increases model size, splitting should not be preferred whenever an gain can be obtained with parametric updates that do not increase model size. Therefore, it is motivated to introduce splitting only at stable local minima, when the optimal equals zero and no further improvement is possible with (infinitesimal) regular parametric descent on . In this case, we only need to minimize the splitting term to decide the optimal , which is shown in the following theorem.

a) If , we have for any and , and hence no infinitesimal splitting can decrease the loss. We call that is splitting stable in this case.

b) If , an optimal splitting strategy that minimizes subject to is

and |

where denotes the eigenvector related to and is called the splitting gradient. Here we split the neuron into two copies of equal weights, and update each copy with the splitting gradient. The change of loss obtained is .

#### Remark

The splitting stability () does not necessarily ensure the standard parametric stability of (i.e., ), except when is convex which ensures (see Definition 2.1). If both and hold, the loss can not be improved by any local update or splitting, no matter how many off-springs are allowed. Since stochastic gradient descent guarantees to escape unstable stationary points (lee2016gradient; jin2017escape), we only need to calculate to decide the splitting stability in practice.

### 2.2 Splitting Deep Neural Networks

In practice, we need to split multiple neurons simultaneously, which may be of different types, or locate in different layers of a deep neural network. The key questions are if the optimal splitting strategies of different neurons influence each other in some way, and how to compare the gain of splitting different neurons and select the best subset of neurons to split under a budget constraint.

It turns out the answer is simple. We show that the change of loss caused by splitting a set of neurons is simply the sum of the splitting terms of the individual neurons. Therefore, we can calculate the splitting matrix of each neuron independently without considering the other neurons, and compare the “splitting desirability” of the different neurons by their minimum eigenvalues (splitting indexes). This motivates our main algorithm (Algorithm 1), in which we progressively split the neurons with the most negative splitting indexes. Since the neurons can be in different layers and of different types, this provides an adaptive way to grow neural network structures to fit best with data.

where |

To set up the notation, let be the parameters of a set of neurons (or any duplicable sub-structures) in a large network, where is the parameter of the -th neuron. Assume we split into copies , with weights with . Denote by and the loss function of the original and augmented networks, respectively. It is hard to specify the actual expression of the loss functions in general cases, but it is sufficient to know that depends on each only through the output of its related neuron,

(5) |

where denotes the activation function of neuron , and and denote the parts of the loss that connect to the input and output of neuron , both of which depend on the other parameters in some complex way. Similarly, the augmented loss satisfies

(6) |

where , and and are the augmented variants of and , respectively. Eq (5) and 6 only provide a partial specification of the loss function of deep neural nets, but are sufficient to establish the following key extension of Theorem 2.1 to the multiple neurons case. {thm} Under the setting above, assume for , where denotes the average displacement vector on , and is the -th splitting vector of , with . Assume have bounded third order derivatives w.r.t. . We have

where the effect of average displacement is again equivalent to parametric update; the splitting effect equals the sum of the individual splitting terms , with the splitting matrix of neuron ,

(7) |

The important implication of
Theorem 2.2 is that there is *no crossing term* in the splitting matrix,
unlike the standard Hessian matrix.
Therefore, the splitting effect of an individual neuron
only depends on its own splitting matrix and can be evaluated individually;
the splitting effects of different neurons can be compared using their splitting indexes,
allowing us to decide the best subset of neurons to split when a maximum number constraint is imposed.
As shown in Algorithm 1,
we decide a maximum number of neurons to split at each iteration,
and a threshold of splitting index,
and split the neurons whose splitting indexes are ranked in top and smaller than .

#### Computational Efficiency

The computational cost of evaluating all the splitting indexes and gradients is , where is the number of neurons and is the number of the parameters of each neuron, which is often not large in practice. Further computational speedup can be obtained by using efficient numerical methods, which we leave for future work. \myempty[ For very large scale settings, we can approximate the eigenvalues and eigenvectors efficiently by gradient descent on Rayleigh quotient, which only requires to evaluate the matrix-vector product via without expanding the whole splitting matrix, yielding a low computational cost similar to that of the standard parametric gradient descent (See Appendix LABEL:sec:fast_grad_approx for more discussion). ]

### 2.3 Splitting as -Wasserstein Steepest Descent

We present a functional aspect of our approach, in which we frame the neuron parameter optimization and splitting into a functional optimization in the space of distributions of the neuron weights, and show that our splitting strategy can be viewed as a second order descent for escaping saddle points in the -Wasserstein space of distributions.

Consider the loss in (2). Because the off-springs of each neuron are exchangeable, we can equivalently represent using the empirical measure of the off-springs,

(8) |

where denotes the delta measure on and the functional representation of . The idea is to optimize in the space of distributions using a functional steepest descent. To do so, a notion of distance on distributions need to be decided. We consider the -Wasserstein metric,

(9) |

where denotes the set of measures whose first and second marginals are and , respectively, and can be viewed as describing a transport plan from to . We obtain the -Wasserstein metric in the limit when , in which case the -norm reduces to an esssup norm, that is, see more in villani2008optimal and Appendix A.2.

The -Wasserstein metric yields a natural connection to node splitting. For each , the transport plan represents the distributions of the points transported from , which can be viewed as the off-springs of in the context of node splitting. If , it means that can be obtained from splitting such that all the off-springs are -close, i.e., (this, however, does not hold for with a finite ). This is consistent with the augmented neighborhood introduced in Section 2.1, except that here can be an absolutely continuous distribution, representing a continuously infinite number of off-springs; but this yields no practical difference because any distribution can be approximated arbitrarily close using a countable number of particles.

Similar to the parametric case, an -Wasserstein steepest descent on should iteratively find new points that maximize the decrease of loss in an -ball of the current points. Define

We are ready to show the connection of Algorithm 1 to the -Wasserstein steepest descent. {thm} Consider the and in (2) and (8), connected with . Define and with , which are related to the gradient and splitting matrices of , respectively.

a) If is on a non-stationary point, then the steepest descent of is achieved by moving all the particles of with gradient descent on , that is,

where denotes the distribution of when .

b) If reaches a stable local optima, the steepest descent on is splitting each neuron with into two copies of equal weights following their eigenvectors, while keeping the remaining neurons to be unchanged. Denote by the distribution obtained in this way, we have

where we have .

#### Remark

There has been a line of theoretical works on analyzing gradient-based learning of neural networks via -Wasserstein gradient flow (e.g., mei2018mean; chizat2018global) by considering the “mean field limit” . Our framework is significant different, since we consider -Wasserstein flow, holds for the case when is finite, and consider the splitting efficient which requires to look at the second order terms.

## 3 Experiments

We test our method on both toy and realistic examples, including learning interpretable neural networks, architecture search for image classification and keyword spotting benchmarks. Due to limited space, many of the detailed settings are shown in Appendix, in which we also include additional results on distribution approximation (Appendix C.1), transfer learning (Appendix C.2).

Eigenvalues Loss decrease | Training Loss | ||

(a) | (b) | (c) Angle | (d) #Iteration |

#### Toy RBF neural networks

We apply our method to learn a one-dimensional RBF neural network shown in Figure 1a. See Appendix B.1 for more details of the setting.

We start with a small neural network with neuron and gradually increase the model size by splitting neurons. Figure 1a shows that we almost recover the true function as we split up to neurons. The following two figures study the eigenvalues and eigenvectors direction when we split neurons from to . Figure 1b shows the top five eigenvalues and the decrease of loss after the splitting when ; we can see that the eigenvalue and loss decrease correlate linearly, confirming our results in Theorem 2.2. Figure 1c shows the decrease of the loss when we split the top 1 neuron following the direction with different angles from the eigenvector at . We can see that the decrease of the loss is maximized when the splitting direction aligns with the eigenvector, consistent with our theory. In Figure 1d, we compare with different baselines of progressive training, including Random Split, splitting a randomly chosen neuron with a random direction; New Initialization, adding a new neuron with randomly initialized weights and co-optimization it with previous neurons; Gradient Boosting, adding new neurons with Frank-Wolfe algorithm while fixing the previous neurons; Baseline (scratch), training a network of size from scratch. Figure 1d shows our method yields the best result after 7 splits.

#### Learning Interpretable Neural Networks

To visualize the dynamics of the splitting process, we apply our method to incrementally train an interpretable neural network designed by li2018deep, which contains a “prototype layer” whose weights are enforced to be similar to realistic images to encourage interpretablity. See Appendix B.2 and li2018deep for more detailed settings. We apply our method to split the prototype layer starting from a single neuron on MNIST, and show in Figure 2 the evolutionary tree of the neurons in our splitting process. We can see that the blurry (and hence less interpretable) prototypes tend to be selected and split into two off-springs that are similar yet more interpretable. Figure 2 (b) shows the decrease of loss when we split each of the five neurons at the 5-th step (with the decrease of loss measured at the local optima reached dafter splitting); we find that the eigenvalue correlates well with the decrease of loss and the interpretablity of the neurons. The complete evolutionary tree and quantitative comparison with baselines are shown in Appendix B.2.

Eigenvalues Loss decrease | |

(a) | (b) |

#### Progressive Training for Image Classification

We investigate the effectiveness of our methods in learning small and efficient network structures for image classification. We experiment with two popular deep neural architectures, MobileNet (howard2017mobilenets) and VGG19 (simonyan2014very). In both cases, we start with a relatively small network and gradually grow the network by splitting the convolution filters following Algorithm 1. See Appendix B.3 for more details of the setting. Because there is no other off-the-shelf progressive training algorithm that can adaptively decide the neural architectures like our method, we compare with pruning methods, which follow the opposite direction of gradually removing neurons starting from a large pre-trained network. We test two state-of-the-art pruning methods, including batch-normalization-based pruning (Bn-prune) (liu2017learning) and L1-based pruning (L1-prune) (li2016pruning). As shown in Figure 3a-b, our splitting method performs the best with similar model sizes. This is surprising and significant, because the pruning methods leverage the knowledge from a large pre-train model, while our method does not.

To further test the effect of architecture learning in both splitting and pruning methods, we test another setting when we discard the weights of the neurons and retain the whole network starting from a random initialization under the structure obtained from splitting or pruning at each iteration. As shown in Figure 3c-d, the results of retraining is comparable with (or better than) the result of successive finetuning in Figure 3a-b, which is consistent with the findings in liu2018rethinking. Meanwhile, our splitting method still outperforms both Bn-prune and L1-prune.

MobileNet (finetune) | VGG19 (finetune) | MobileNet (retrain) | VGG19 (retrain) |

Test Accuracy | |||

(a) Ratio | (b) Ratio | (c) Ratio | (d) Ratio |

#### Resource-Efficient Keyword spotting on Edge Devices

Keyword spotting systems aim to detect a particular keyword from a continuous stream of audio. It is typically deployed on energy-constrained edge devices and requires real-time response and high accuracy for good user experience. This casts a key challenge of constructing efficient and lightweight neural architectures. We apply our method to solve this problem, by splitting a small model (a compact version of DS-CNN) obtained from zhang2017hello. See Appendix B.4 for detailed settings.

Table 1 shows the results on the Google speech commands benchmark dataset (warden2018speech), in which our method achieves significantly higher accuracy than the best model (DS-CNN) found by zhang2017hello, while having less than 31% less parameters and Flops. Figure 4 shows further comparison with Bn-prune (liu2017learning), which is again inferior to our method.

## 4 Conclusion

We present a simple approach for progressively training neural networks via neuron splitting. Our approach highlights a novel view of neural structure optimization as continuous functional optimization, and yields a practical procedure with broad applications. For future work, we will further investigate fast gradient descent based approximation of large scale eigen-computation and more theoretical analysis and extensions of our approach.

## Acknowledgment

This work is supported in part by NSF CRII 1830161 and NSF CAREER 1846421. We would like to acknowledge Google Cloud and Amazon Web Services (AWS) for their support.

## References

## Appendix A Proofs

### a.1 Proofs of Splitting Taylor Expansion

###### Proof of Theorem 2.1.

Taking the gradient of in (1) gives

where is the derivative of (which is a univariate function), and

When is split into , the augmented loss function is

where and The weights should satisfy and . In this way, we have when .

Taking the gradient of w.r.t. when , we have

Taking the second derivative, we get

where

Note that we have following this definition.

For , we have

For , assume , and define to be the average displacement. Therefore, . Using the Taylor expansion of w.r.t. at , we have

This completes the proof. ∎

###### Proof of Theorem 2.1.

Recall that

with , and . Since , it is obvious that

On the other hand, this lower bound is achieved by setting , and This completes the proof. ∎

###### Proof of Theorem 2.2.

Step 1: We first consider the case when . In this case, Lemma A.1 gives

(10) |

where denotes the augmented parameters obtained when we only split the -th neuron, while keeping all the neurons unchanged. Applying Theorem 2.1, we have for each ,

Combining this with (10) yields the result.

Step 2: We now consider the more general case when . Let . Applying the result above on , we have

where Therefore,

This completes the proof. ∎

Let be the parameters of neurons. Recall that we assume is split into off-springs with parameters and weights , which satisfies . Let , where is the perturbation on the -th off-spring of the -th neuron. Assume , that is, the average displacement of all the neurons is zero.

Denote by the augmented parameters we obtained by only splitting the -th neuron, while keeping all the other neurons unchanged, that is, we have for , and for all and . Assume the third order derivatives of are bounded. We have

###### Proof.

Define