Efficient Network Construction through Structural Plasticity
Abstract
Deep Neural Networks (DNNs) on hardware is facing excessive computation cost due to the massive number of parameters. A typical training pipeline to mitigate overparameterization is to predefine a DNN structure with redundant learning units (filters and neurons) with the goal of high accuracy, then to prune redundant learning units after training with the purpose of efficient inference. We argue that it is suboptimal to introduce redundancy into training in order to reduce redundancy later in inference. Moreover, the fixed network structure further results in poor adaption to dynamic tasks, such as lifelong learning. In contrast, structural plasticity plays an indispensable role in mammalian brains to achieve compact and accurate learning. Throughout the lifetime, active connections are continuously created while those that are no longer important are degenerated. Inspired by such observation, we propose a training scheme, namely Continuous Growth and Pruning (CGaP), where we start the training from a small network seed, then literally execute continuous growth by adding important learning units and finally prune secondary ones for efficient inference. The inference model generated from CGaP is sparse in the structure, largely decreasing the inference power and latency when deployed on hardware platforms. With popular DNN structures on representative datasets, the efficacy of CGaP is benchmarked by both algorithmic simulation and architectural modeling on Fieldprogrammable Gate Arrays (FPGA). For example, CGaP decreases the FLOPs, model size, DRAM access energy and inference latency by 63.3%, 64.0%, 11.8% and 40.2%, respectively, for ResNet110 on CIFAR10.
I Introduction
Deep Neural Networks have various applications including image classification [1], object detection [2], speech recognition [3] and natural language processing [4]. However, the accuracy of DNNs heavily relies on massive amounts of parameters and deep structures, making it hard to deploy DNNs on resourcelimited embedded systems. When training or inferring the DNN models on hardware, the model must be stored in the external memory such as dynamic randomaccess memory (DRAM) and fetched multiple times. These operations are expensive in computation, memory access, and energy consumption. For example, Fig. 1 shows the energy consumption of one inference pass in several modern DNN structures, simulated by the FPGA performance model [5] under the setting of 300 MHz operating frequency and 19.2 GB/s DRAM bandwidth. The input image size is . A typical DNN model is too large to fit in onchip memory. For instance, VGG19 [6] has 20.4M parameters. Running such a model requires frequent external memory access, exacerbating the power consumption of a typical embedded system.
Previous researches have designed customized hardware for DNN acceleration [7, 8]. Most of them are limited to relatively small neural networks, such as LeNet5 [9]. For larger networks such as AlexNet [1] and VGG16 [6], additional efforts are usually required to improve the hardware efficiency [10, 11]. For example, [10] saves the energy through data gating and zero skipping. Some other works focus on data reuse of convolutional layers and demonstrate the results on specific hardware [7, 12, 13, 14]. However, their improvements are limited on those networks where fullyconnected layer is widely used, such as RNNs and LSTMs.
To support more general models, network pruning is a popular approach by removing secondary weights and neurons. Network pruning executes a threestep procedure, which 1) trains a predesigned network from scratch, 2) removes less important connections or filters/neurons according to a saliency score (a metrics to measure the importance of weights and learning units) [15, 16, 17, 18, 19], or by adding a regularization term into the loss function [20, 21], and 3) finetunes to recover the accuracy.
However, the above pruning techniques suffer from two limitations: (1) training a large and fixed network from scratch could be suboptimal as it introduces redundancy; (2) in the process of training, pruning only discards less important weights at the end of training but does not strengthen important weights and nodes. These limitations of network pruning confine the learning performance as well as the model pruning efficiency (i.e., how many parameters can be removed and how structured the sparsity is).
In contrast to the static DNN model, the biological nervous system exhibits active growth and pruning through the lifetime. [22, 23, 24] have observed that the rapid growth of neurons and synapses takes place in an infant’s brain and is vital to the maturity of an adult’s brain. In brains, some neurons and synapses are used more frequently and are consequently strengthened. Those neurons and synapses that are not used consistently are weakened and removed. The structural plasticity of brain is central to the study of developmental biology.
Inspired by this observation from biology, we propose a training scheme named Continuous Growth and Pruning (CGaP), which leverages structural plasticity to tackle the aforementioned limitations of pruning techniques. Instead of training an overparameterized network from scratch, CGaP starts the training from a small network seed (Fig. 2(a)), whose size is as low as 0.1%3% of the fullsize reference model. In each iteration of the growth, CGaP locally sorts neurons and filters (also known as output channels in some literature) according to our saliency score (Section IIIB). Based on the saliency score, important learning units are selected and the corresponding new units are added (see Fig. 2(b)). The selection and addition of important units help reinforce the learning and increase model capacity. Then a filterwise and neuronwise pruning will be executed on the postgrowth model (Fig. 2(c)) based on pruning metrics. Finally, CGaP generates a significantly sparse and structured inference model (Fig. 2(d)) with accuracy improved. In the generated inference model, large amounts of filters and neurons have been removed, achieving structured pruning. Compared to nonstructured pruning [15], CGaP benefits hardware implementation as it reduces the computation volume and memory access without any additional hardware architecture change.
Algorithmic experiments and hardware simulations validate that CGaP significantly decreases the number of external and onchip memory accesses, accelerating the inference by bypassing the removed filters and neurons. On the algorithm side, we demonstrate the performance in accuracy and model pruning on several networks and datasets. For instance, CGaP reduces parameters of VGG19 with accuracy improvement on CIFAR100 [25], parameters with accuracy improvement on SVHN [26]. For ResNet110 [27], CGaP reduces parameters with accuracy improvement on CIFAR10 [25]. These results exceed the stateoftheart pruning methods [15, 16, 17, 18, 28, 29]. Furthermore, we validate the efficiency of the inference model generated from CGaP using FPGA simulator [5]. For one inference pass of VGG19 on CIFAR100, previous nonstructured pruning approach [15] requires energy consumption of pJ in accessing DRAM and 5.6 ms inference latency, while CGaP requires only pJ and 4.4 ms latency.
The contribution of this paper is as follows:

A braininspired training flow (CGaP) with a dynamic structure is proposed. CGaP grows the network from a small seed and effectively reduces overparameterization without sacrificing accuracy.

The advantage of structured sparsity of the inference model generated from CGaP is validated using a highlevel FPGA performance model including onchip buffer access energy, external memory access energy and inference latency.

The discussion and understanding of the reason that the growth improves the learning efficiency are provided.
The rest of the paper is organized as follows. Section II introduces the background of model pruning. Section III demonstrates the saliency score used to select the learning units. Section IV describes the proposed Continuous Growth and Pruning scheme. Section V presents the experimental results from algorithmic simulations. Section VI demonstrates the simulation results from FPGA performance modeling. Section VII discusses the understanding of network plasticity as well as ablation study. Section VIII concludes this work and discusses the insight into future work.
Ii Previous Work
There have been broad interests in reducing the redundancy of DNNs in order to deploy them on a resourcelimited hardware platform. The structural surgery is a widely used approach and can be categorized into destructive direction and constructive direction. We will discuss these two directions, as well as orthogonal approaches to our methods in this section.
Iia Destructive Methods
The destructive methods zero out specific connections or remove filters or neurons in convolutional or fullyconnected layers, generating a sparse model. Weight magnitude pruning [15] pruned weights by setting the selected weights to zeros. The selection is based on L1norm, i.e., the absolute value of the weight. Weight magnitude pruning generates a sparse weight matrix, but not in a structured way. In this case, specific hardware design [30] is needed to take advantage of the optimized inference model, otherwise the nonstructured sparsity does not benefit hardware acceleration due to the overhead in model management. The kernelwise pruning [16] pruned kernels layer by layer based on the saliency metrics of each filter and achieved structured sparsity in the inference model. Compared to [16], CGaP prunes filters, leading to a more structured inference model. Besides the saliencybased pruning, the penaltybased approach has been explored by [21, 31] and structured sparsity was achieved. Our method is different from all the above pruning schemes from two perspectives: (1) We start training from a small seed other than an overparameterized network; (2) Besides removing secondary filters/neurons, we also reinforce important ones to further improve learning accuracy and model compactness.
IiB Constructive Methods
The constructive approaches include techniques that add new connections or filters to enlarge the model capacity. [32, 33] increased network size by adding random neurons with fresh initialization (i.e., weights are randomly initialized, without pretrained information). They evaluated their approach on basic XOR problems. Different from their approach, CGaP selectively adds neurons and filters that are initialized with the information learned from the previous training. Meanwhile, CGaP is validated on modern DNNs and datasets under more realistic scenarios. [34] grew the smallest Neural Tree Networks (NTN) to minimize the number of classification errors on Boolean function learning tasks, and used pruning to enhance the generalization of NTN. [35] improved the accuracy of radial basis function (RBF) networks on function approximation tasks by adding and removing hidden neurons. To enhance the accuracy of spikebased classifiers, [36] progressively added dendrites to the network, and then optimized the topology of the dendritic tree. Different from them, CGaP aims at improving the efficiency of the inference model of modern Deep Neural Networks on image classification tasks. [37] constructed the DNN by activating connections and choosing a set of convolutional filters among a bunch of randomly generated filters according to their influence on the training performance. However, this approach highly depends on trial and error to find the optimal set of filters that could reduce loss the most. This approach is sensitive to power and timing budgets, limiting its extension on large datasets. Unlike their work, CGaP directly grows the network from a seed, minimizing the effort on trail and error.
IiC Orthogonal Methods
The orthogonal methods, such as lowprecision quantization and lowrank decomposition, compress the DNN models by quantizing the parameters to fewer bits [38, 39], or by finding a lowrank approximation [40, 41]. Note that our CGaP approach can be combined with these orthogonal methods to further improve inference efficiency.
Iii Saliency Score
In this section, we describe the detailed methodology of CGaP, starting from the saliency score, which is used to sample the importance of a learning unit. Section IIIA defines the terminology we use in this paper. Section IIIB provides the mathematical proof of the saliency score we adopt.
Iiia Terminology
A DNN can be treated as a feedforward multilayer architecture that maps the input images to certain output vectors. Each layer is a certain function, such as convolution, ReLU, pooling and inner product, whose input is , output is and parameter is in case of convolutional and fullyconnected layers. Hereby the convolutional layer (convlayer) is formulated as: , wherein , , , where subscript denotes the index of the layer. And the fullyconnected layer is represented by: , where the input , the output , and the parameter matrix is .
Convolutional layer (convlayer)
the 4 dimensions of its weight matrix are: the number of output channels , the number of input channels , and the kernel width and height , respectively. We denote the th 3D filter, which generates the th output channel in the feature map, as . The th 2D kernel in the th filter is denoted as . On the other hand, a 4D weight tensor , which operates on the th input feature map, is a package of kernels across all output channels. For example, in Fig. 3, is a 3D filter consisting of kernels, and as well as are both 4D tensors with dimension of , which include all the output channels but have only one input channel located at j. The refers to one weight at the th row and the th column in the th filter of the th input channel.
Fullyconnected layer (fclayer)
input propagate from one hidden activation to the next layer. We refer the whole set of as a neuron . This neuron receives information from previous layer through its fanin weights (as shown in Fig. 4) and propagates to the next layer through fanout weights . Also note that the output dimension of layer equals to the input dimension of layer , i.e., . The weight pixel in layer at the crosspoint of row and column is denoted as . Moreover, the ‘depth’ of a DNN model indicates the number of layers, and the ‘width’ of a DNN model refers to the number of filters or neurons of each layer.
Learning units
Growing or pruning a filter indicates adding or removing and its corresponding output feature map. Growing or pruning a neuron means adding or removing both and .
IiiB Saliency Score
We adopt a saliency score to measure the effect of a single filter/neuron on the loss function, i.e., the importance of each learning unit. The saliency score is developed from Taylor Expansion of the loss function. Previously, [42] applied it on pruning. In this paper, we adopt this saliency score and apply it on the growth and pruning scheme. In this section, we provide a mathematical formulation of the saliency score.
The saliency score represents the difference between the loss with and without each unit. In other words, if the removal of a filter/neuron leads to relatively small accuracy degradation, this unit is recognized as an unimportant unit, and vice versa. Thus, the objective function to get the filter with the highest saliency score is formulated as:
(1) 
Using the firstorder of the Taylor Expansion:
(2) 
we get:
(3) 
Similarly, the saliency score of a neuron is derived as:
(4) 
Iv CGaP Methodology
With the saliency score as the foundation, we develop the entire CGaP flow atop. This section explains the overall flow and the detailed implementation of each step in CGaP.
The CGaP scheme is described in Algorithm 1. Starting from a small network seed, the growth takes place periodically at a frequency of (see Algorithm 1 line 4, where ‘%’ denotes the operation to obtain the remainder of division). During each growth, important learning units are chosen and grown at growth ratio layer by layer from the bottom (input) to top (output), based on the local ranking of the saliency score. The growth phase stops when reaching a capacity threshold , followed by several epochs of training on the peak model . When the training accuracy reaches a threshold , the pruning phase starts. Pruning is performed layer by layer, from the bottom layer to the top layer, at the frequency of . The details in the growth phase and the pruning phase is demonstrated as follows.
Iva Growth phase
Algorithm 2 presents the methodology in the growth phase. Each iteration of growth in a layer consists of two steps: growth in layer and mapping in the adjacent layer. There are two conditions need to be discussed separately: convolutional layers (Fig. 3) and fullyconnected layers (Fig. 4). Due to the difference between these two kinds of operation as discussed previously, after the growth of layer , the mapping in convlayer takes place at the adjacent layer . In fclayers, the mapping is in layer .
Growth in convlayer
According to the local ranking of the saliency score (Eq. IIIB), we sort all the 3D filters in this layer. With a growth ratio , filters are selected in the th layer at the th growth. On the side of each selected filter , as shown in Fig. 3, we create a new filter that has the same size, named .
In the ideal case, the new filter and existing filter are expected to collaborate with each other and optimize the learning. The existing filter has already learned on the current task. To keep the same learning pace between the existing filter and the new filter, we initialize as follows:
(5)  
(6) 
where is a scaling factor and is a constant following uniform distribution in , where . Instead of random initialization, the above initialization helps reconcile the learning status of the newborn filters with the old filters. Meanwhile, the scaling factor prevents output from an exponential explosion caused by the feedforward propagation . The noise prevents the learning from sticking at a local minimum that leads to suboptimal solutions. No matter which distribution the noise follows, in a reasonable range is able to provide similar performance. However, other distributions usually introduce more hyperparameters and thus, require more efforts in parameter tuning. For example, Gaussian noise introduces more hyperparameter, e.g., the standard deviation, than uniform noise. For simplicity, we use uniformly distributed noise.
Mapping in convlayer
After the number of filters in layer grows from to , the number of output feature maps also increases from to . Therefore, the inputwise dimension of layer should increase correspondingly in order to be consistent in data propagation. To match the dimension, we first locate the 4D tensor in layer , which processes the feature maps generated by . Then we add a new 4D tensor adjacent to . The and are initialized as follows:
(7)  
(8) 
To summarize, as illustrated in Fig. 3, the filter (green) is selected according to the saliency score and a new tensor (orange) is added. Then the inputwise tensor (in blue dashed rectangular) in layer is projected, and (in black dashed rectangular) is generated.
After layer grows and layer is mapped, layer grows and layer is mapped, so on and so forth till the last convolutional layer. It is worth mentioning that for the ‘projection shortcuts’ [27] with convolutions in ResNet [27], the dimension mapping is between the two layers that the shortcut connects, not necessarily to be the adjacent layers.
Growth and mapping in fclayers
As illustrated in Fig. 4, the neuron growth in fclayers occurs at fanout weights, and its initialization follows Eq. 5 and 6.
The mapping in fclayers take place in the fanin weights as follows:
(9)  
(10) 
After growing the last convlayer, We flatten the output feature map of this convlayer, treat it as the input from layer and map in the same manner.
IvB Pruning phase
Pruning in each layer consists of two steps: weight pruning and unit pruning. First, we sort weight pixels locally in each convlayer according to Eq.11:
(11) 
and in each fclayer according to Eq.12:
(12) 
In each layer, 100% weight pixels with the lowest are set as zero, where is the weight pruning rate. Then the entire filter/neuron whose sparsity is larger than the filter/neuron pruning rate or is set to zero. In this way, a large amount of entire filters/neurons are pruned, leading to a compact inference model.
V Algorithmic Experiments
To evaluate the proposed approach, we present experimental results in this section. We perform experiments on several modern DNN structures (LeNet [9], VGGNet [6], ResNet [27]) and representative datasets (MNIST [9], CIFAR10, CIFAR100 [25], SVHN [26]).
Method  Accuracy  FLOPs  Pruned  Param.  Pruned 
LeNet5Baseline  99.29  4.59M  –  431K  – 
Pruning [17]  99.26  0.85M  81.5%  112K  74.0% 
Pruning [15]  99.23  0.73M  84.0%  36K  92.0% 
CGaP  99.36  0.44M  90.4%  8K  98.1% 
Method  Accuracy  FLOPs  Pruned  Param.  Pruned 
VGG19Baseline  72.63  797M  –  20.4M  – 
Pruning [28]  71.85  NA  –  10.1M  50.5% 
Pruning [18]  72.85  501M  37.1%  5.0M  75.5% 
CGaP  73.00  373M  53.2%  4.3M  78.9% 
Method  Accuracy  FLOPs  Pruned  Param.  Pruned 
VGG19Baseline  96.02  797M  –  20.4M  – 
Pruning [18]  96.13  398M  50.1%  3.1M  84.8% 
CGaP  96.25  206M  74.2%  2.9M  85.8% 
Method  Accuracy  FLOPs  Pruned  Param.  Pruned 
VGG16Baseline  93.25  630M  –  15.3M  – 
Pruning [16]  93.40  410M  34.9%  5.4M  64.7% 
CGaP  93.59  280M  56.2%  4.5M  70.6% 
ResNet56Baseline  93.03  268M  –  0.85M  – 
Pruning [28]  92.56  182M  32.1%  0.73M  14.1% 
Pruning [43]  90.20  134M  50.0%  NA   
CGaP  93.20  181M  32.5%  0.53M  37.6% 
ResNet110Baseline  93.34  523M  –  1.72M  – 
Pruning [16]  93.11  310M  40.7%  1.16M  32.6% 
Pruning [29]  93.52  300M  40.8%  NA  – 
CGaP  93.43  192M  63.3%  0.62M  64.0% 
Va Training Setup
Network structures
The LeNet5 architecture consists of two sets of convolutional, ReLU [44] and max pooling layers, followed by two fullyconnected layers and finally a softmax classifier. The VGG16 and VGG19 structures we use have the same convolutional structure as [6] but are redesigned with only two fullyconnected to be fairly compared with the pruningonly method [16]. Therefore, the VGG16 (VGG19) has 13 (16) convolutional layers, each is followed by a batch normalization layer [45] and a ReLU activation. The structures of ResNet56 and ResNet110 follow [16]. Each convolutional layer is followed by a batch normalization layer and ReLU activation. During the training, the depth of the networks remains constant since CGaP does not touch the depth of the network, but the width of each layer changes.
Note that in the following text, we denote the fullsize models trained from scratch without sparsity regularization as ‘baseline’ models. The threestep pruning schemes that remove weights or filters but do not execute network growth are denoted as ‘pruningonly’ models.
Datasets
MNIST is a handwritten digit dataset in greyscale (i.e., one color channel) with 10 classes from digit 0 to digit 9. It consists of 60,000 training images and 10,000 testing images. The CIFAR10 dataset consists of 60,000 color images in 10 classes, with 5000 training images and 1000 testing images per class. The CIFAR100 dataset has 100 classes, including 500 training images and 100 testing images per class. The Street View House Number (SVHN) is a realworld color image dataset that is resized to a fixed resolution of pixels. It contains 73,257 training images and 26,032 testing images.
Hyperparameters
We set the learning rate to be 0.1 and divide by 10 for every 30% of the training epochs. We train our model using Stochastic Gradient Descent (SGD) with a batch size of 128 examples, a momentum of 0.9, and a weight decay of 0.0005. The loss function is the crossentropy loss with softmax function. We train 60, 200, 220 and 100 epochs on MNIST, CIFAR10, CIFAR100 and SVHN datasets, respectively. In the growth phase, we have hyperparameters set as follows: the growth stopping condition , i.e., the growth stops at the th growth if the number of filters in the th growth is larger than the baseline model. The growth ratio is set as 0.6. The growth frequency is set as 1/3. The scaling factor in Eq. 5 to Eq. 10 is set to 0.5 and is 0.1. The pruning frequency is set to be 1. The setting of the weight pruning rate follows [15], [16] and [18] for LeNet5, VGGNet and ResNet, respectively. and is set to be same as .
Framework and platform
The experiments are performed with PyTorch [46] framework on one NVIDIA GeForce GTX 1080 Ti platform. It is worth mentioning that experiments performed with different frameworks may have variation in accuracy and performance. Thus, to have a fair comparison among CGaP, baseline and pruningonly methods, all the results in Table I, II, III and IV are obtained from experiments with PyTorch framework.
VB Performance Evaluation
With training setup as aforementioned, we perform experiments on several datasets with modern DNN architectures. In Table I, Table II, Table III and Table IV, we summarize the performance attained by CGaP on MNIST, CIFAR100, SVHN, and CIFAR10 datasets, respectively. To be specific, the second column ‘Accuracy’ denotes the inference accuracy in percentage achieved by the baseline model, the uptodate pruningonly approaches and CGaP approach, respectively.
The column ‘FLOPs’ represent the calculated number of FLOPs of a single inference pass. The calculation of FLOPs follows the method described in [42]. Fewer FLOPs means lower computation cost in one inference pass. The neighboring column, ‘Pruned’, represents the reduction of FLOPs in the compressed model as compared to the baseline model. The column ‘Param.’ stands for the number of parameters of the inference model. Fewer parameters promise a smaller model size. The last column, ‘Pruned’, denotes the percentage pruned in parameters compared to the baseline. Larger pruned percentage implies fewer computation operations and more compact model. The best result of each column is highlighted in bold.
The results shown in Table I to IV prove that CGaP outperforms the previous pruningonly approaches in accuracy and model size. For instance, as displayed in Table IV, on ResNet56, our CGaP approach achieves 93.20% accuracy with 32.5% reduction in FLOPs and 37.6% reduction in parameters, while the uptodate pruningonly method [28] that deals with static structure only reaches 92.56% accuracy with 32.1% reduction in FLOPs and 14.1% reduction in parameters. On ResNet110, though [29] achieves 0.09% higher accuracy than CGaP, CGaP overwhelms it by trimming 22.5% more FLOPs.
VC Visualization of the dynamic structures
Fig. 5 presents the dynamic model size during CGaP training. During the growth phase, the model size continuously increases and reaches a peak capacity. When the pruning phase starts, the model size drops.
Furthermore, the sparsity achieved by CGaP is structured. In other words, large amounts of filters and neurons are entirely pruned. For instance, the baseline LeNet5 without sparsity regularization has 20, 50 filters in convlayer 1 and convlayer 2, 500 and 10 neurons in fclayer 1 and fclayer 2, denoted as [205050010] (number of filters/neurons in [conv1conv2fc1fc2]). The model achieved by CGaP contains only 8, 17 filters and 23, 10 neurons, denoted as [8172310]. Compared to baseline results, CGaP significantly decreases 60%, 66%, 95.4% units for each layer (the output layer should remain the same as the number of classes all the time). In this case, the pruned filters and neurons are skipped in the inference pass and thus accelerating the computation pipeline on hardware.
Another example is provided in Fig. 6, which visualizes the VGG19 structures from CGaP as well as the baseline structure on two different tasks. In the baseline model, the width (number of filters/neurons) of each layer is abundant, from 64 filters (the bottom convlayers) to 512 filters (the top convlayers). The baseline VGG19 structure is designed to have a large enough size in order to guarantee the learning capacity. However, it turns out to be redundant, as proved by the structure that CGaP generated: to filters are pruned out in each layer. Meanwhile, in the baseline model, the top convlayers are designed to have more filters than the bottom layers, but CGaP shows that it is not always necessary for top layers to have a relatively large size.
VD Validating the saliencybased growth
Fig. 7 validates the efficacy of our saliencybased growth policy. Selective growth, which emphasizes the important units according to the saliency score, has lower crossentropy loss than randomly growing some units. The spiking in Fig. 7 is caused by the first iteration of pruning and this loss is recovered by the following iterative finetuning. In selective growth, this loss is lower than that in random growth. This phenomenon supports our argument that selective growth assists the pruning phase. The detailed understanding of growth will be further discussed in Section VII.
To summarize the results from the algorithm simulations, the proposed CGaP approach:

Largely compresses the model size by (ResNet56) to (LeNet5) for representative DNN structures.

Decreases the inference cost, to be specific, number of FLOPs, by (ResNet56) to (LeNet5) on various datasets.

Does not sacrifice accuracy and even improves accuracy.

Outperforms the stateoftheart pruningonly methods that deal with fixed structures.
Vi Experiments on FPGA simulator
The results above demonstrate that CGaP generates an accurate and small inference model. In this section, we further evaluate the onchip inference cost of the generated models and compare CGaP with previous nonstructured pruning [15]. As CGaP achieves structured sparsity, CGaP outperforms the previous work on nonstructured pruning in hardware acceleration and power efficiency. We validate this by performing the estimation of buffer access energy, DRAM access energy and latency using the performance model for FPGA [5].
Via Overview of the FPGA simulator
[5] is a highlevel performance model designed to estimate the number of external and onchip memory access, as well as the latency. The resource costs are formulated by the acceleration strategy as well as the design variables that control the loop tiling and unrolling. The performance model has been validated across several modern DNN algorithms in comparison to onboard testings on two FPGAs, with the differences within [5].
In the following experiments, the setup follows: the pixels and weights are both 16bit fixed point, the data width of DRAM controller is bits, the accelerator operating frequency is MHz, and the DRAM bandwidth is GB/second. The parameters related to loop tiling and unrolling follow the setting in [5].
ViB Results from FPGA performance model
The onchip and external memory access energy across VGG16, VGG19, ResNet56 and ResNet110 is displayed in Fig. 8(a) and Fig. 8(b), respectively. The inference latency is shown in Fig. 8(c). Though the models generated from weight magnitude pruning and CGaP have the same sparsity, CGaP outperforms nonstructured magnitude weight pruning in hardware efficiency and acceleration. For example, with the same setup of the pruning ratio during training, magnitude weight pruning decreases onchip access energy, DRAM access energy and latency for VGG19 on CIFAR100, while the CGaP achieves , , and reduction. The nonstructured weight pruning [15] is able to improve the power and latency efficiency in comparison to baseline. However, the improvement is limited. In contrast, CGaP achieves significant acceleration and energy reduction. The reason is that the nonstructured sparsity, i.e., scattered weight distribution, leads to irregular memory access that weakens the acceleration on hardware in a real scenario.
Vii Discussion
In Section V and VI, the performance of CGaP has been comprehensively evaluated on algorithm platforms and hardware platforms. In this section, we provide a more indepth understanding of the growth to explain why selective growth is able to improve the performance from the traditional pipelines. Furthermore, we provide a thorough ablation study to validate the robustness of the proposed CGaP method.
Initial seeds  ‘2’  ‘4’  ‘6’  ‘8’  ‘10’  ‘12’  
#filters  conv1_  2  4  6  8  10  12 
conv2_  4  8  12  16  20  24  
conv3_  8  16  24  32  40  48  
conv4_  16  32  48  64  80  96  
conv5_  16  32  48  64  80  96  
#param  Initial (M)  0.01  0.06  0.13  0.23  0.36  0.53 
Testing accuracy*  0.69%  0.2%  0.16%  +0.37%  +0.04%  0.29%  
*Relative accuracy of the final VGG19 model on CIFAR100 as compared to the baseline. 
Understanding the growth
Fig. 9 illustrates a visualization of the weights in the bottom convlayer (conv1_1) in VGG19, at the moment of initialization, after the first growth, after the last growth and when training ends. Inside each figure, the upper bar is the CGaP model, whose size varies at different training moments. The lower bar is from the baseline model, whose size is static during training. At the initialization moment (Fig. 9(a)), CGaP model only has 8 filters in this layer while the baseline model has 64 filters. Then the number of filters grows to 13 after one iteration growth (Fig. 9(b)), meaning the most important 5 filters are selected and added. It is clear that the pattern in Fig. 9(b) is more active than that in (a), indicating the filters have already fetched effective features from the input images. More important, along with the growing, the pattern in CGaP model becomes more structured than that in the baseline model, as shown in Fig. 9(c). Benefiting from this wellstructured pattern, our CGaP model has higher learning accuracy than the baseline model. From Fig. 9(c) to Fig. 9(d), relatively unimportant filters are removed, and important ones are kept. We observe that most of the filters that are favored by the growth, such as filters at index 36, 48, 72, 96 in Fig. 9(c), are still labeled as important filters in Fig. 9(d) even after a long training process between the growth phase and the pruning phase. Leveraging the growth policy, the model is able to recover quickly from the loss caused by pruning (the spiking in Fig. 7).
Robustness of the seed
The performance of CGaP is stable under the variation of the initial seeds. To prove this, we scan several seeds in different size and present the variation in accuracy and inference model size. The structure of 6 scanned seeds is listed in Table V. Each seed has a different number of filters in each layer, e.g., seed ‘2’ has 2 filters in block conv1. The size of the seeds varies from 0.01M to 0.53M. Fig. 10 presents the final model size and the number of growth of each seed. A larger seed leads to a larger final model but requires fewer iterations of growth to reach the intended model size. Generally speaking, there is a tradeoff between the inference accuracy and the model size. Though the seed varies a lot from each other, the final accuracy is quite robust, as listed in the ‘Accuracy’ row in Table V. It is worth mentioning that, even though the seed ‘2’ degrades the accuracy of from baseline, the inference model size is only 2.4M, significantly smaller than the baseline size (20.4M).
Robustness of the hyperparameters
CGaP is conditioned on a set of hyperparameters to achieve optimal performance, while it is stable under the variation of these hyperparameters. Empirically, we leverage the following experience to perform parameter optimization: a smaller growth rate for a larger seed and vice versa; threshold is set based on the user’s intended model size; a smaller for a complicated dataset and vice versa; a relatively greedy growth (larger and ) prefers a larger noise but smaller to push the model away from sticking at a local minimum. Tuning of the pruning ratio of each layer is in a similar manner to that of the other pruning works [15] [16].
In particular, we scan 121 combinations of the scaling factor and noise in the range [0.0, 1.0] with the step=0.1 and provide the following discussion. For VGG16 on CIFAR10, the accuracy of several corner cases are (=1, =0, which is a case of random initialization), (=1, =1), (=0, =1, which is another case of mimicking its neighbor without scaling) and (=0, =0, the training is invalid in this case), (=0, =0, which is another case of mimicking its neighbor with scaling), (=0.5, =0, which is another case of random initialization). The best accuracy of is under =0.1, =0.5. The combinations in the zone that and always provide > accuracy. To summarize, impacts more than as is relatively small; should not be too large and 0.5 is safe for future tasks and networks; adding a noise improves the accuracy (like from to ) as it prevents local minimum; inheriting from the neighbor is more efficient than randomly initializing since the network is able to resume the learning right after the growth.
Viii Conclusion and Future Work
Modern DNNs typically start training from a fixed and overparameterized network, which leads to redundancy and is lack of structural plasticity. We propose a novel dynamic training algorithm, Continuous Growth and Pruning, that initializes training from a small network, expands the network width continuously to learn important learning units and structures and finally prunes secondary ones. The effectiveness of CGaP depends on where to start and stop the growth, which learning unit (filter and neuron) should be added, and how to initialize the newborn units to ensure model convergence. Our experiments on benchmark datasets and architectures demonstrate the advantage of CGaP on learning efficiency (accurate and compact). We further validate the energy and latency efficiency of the inference model generated by CGaP on FPGA performance simulator. Our approach and analysis will help shed light on the development of adaptive neural networks for dynamic tasks such as continual and lifelong learning.
Acknowledgment
This work was supported in part by CBRIC, one of six centers in JUMP, a Semiconductor Research Corporation (SRC) program sponsored by DARPA. It was also partially supported by National Science Foundation (NSF) under CCF #1715443.
Xiaocong Du (S’19) received her B.S. degree in control engineering from Shandong University, Jinan, China, in 2014, and the M.S. degree in electrical and computer engineering from University of Pittsburgh, Pittsburgh, PA, US, in 2016. Now, she is pursuing her Ph.D. degree in electrical engineering at Arizona State University, Tempe, AZ, USA. Her research interests include efficient algorithm and hardware codesign for deep learning, neural architecture search, continual learning, and neuromorphic computing. 
Zheng Li (S’19) obtained his B.S. degree in electronics and information engineering from Beihang University, Beijing, China, in 2014, and the M.S. degree in electrical and computer engineering from University of Pittsburgh, Pittsburgh, PA, USA, in 2017. He is currently working towards the Ph.D. degree in computer engineering at Arizona State University, Tempe, AZ, USA. He worked as a summer intern in Machine Learning at MobaiTech, Inc, Tempe, AZ, USA in 2018. His current research interests include algorithm design and optimization for computer vision tasks, such as object detection and autonomous driving. 
Yufei Ma (S’16M’19) received the B.S. degree in information engineering from Nanjing University of Aeronautics and Astronautics, Nanjing, China, in 2011, the M.S.E. degree in electrical engineering from University of Pennsylvania, Philadelphia, PA, USA, in 2013, and the Ph.D. degree with Arizona State University, Tempe, AZ, USA, in 2018. His current research interests include the highperformance hardware acceleration of deep learning algorithms on digital applicationspecified integrated circuit and fieldprogrammable gate array. 
Yu Cao (S’99M’02SM’09F’17) received the B.S. degree in physics from Peking University in 1996. He received the M.A. degree in biophysics and the Ph.D. degree in electrical engineering from University of California, Berkeley, in 1999 and 2002, respectively. He worked as a summer intern at HewlettPackard Labs, Palo Alto, CA in 2000, and at IBM Microelectronics Division, East Fishkill, NY, in 2001. After working as a postdoctoral researcher at the Berkeley Wireless Research Center (BWRC), he is now a Professor of Electrical Engineering at Arizona State University, Tempe, Arizona. He has published numerous articles and two books on nanoCMOS modeling and physical design. His research interests include physical modeling of nanoscale technologies, design solutions for variability and reliability, reliable integration of postsilicon technologies, and hardware design for onchip learning. Dr. Cao was a recipient of the 2012 Best Paper Award at IEEE Computer Society Annual Symposium on VLSI, the 2010, 2012, 2013, 2015 and 2016 Top 5% Teaching Award, Schools of Engineering, Arizona State University, 2009 ACM SIGDA Outstanding New Faculty Award, 2009 Promotion and Tenure Faculty Exemplar, Arizona State University, 2009 Distinguished Lecturer of IEEE Circuits and Systems Society, 2008 Chunhui Award for outstanding oversea Chinese scholars, the 2007 Best Paper Award at International Symposium on Low Power Electronics and Design, the 2006 NSF CAREER Award, the 2006 and 2007 IBM Faculty Award, the 2004 Best Paper Award at International Symposium on Quality Electronic Design, and the 2000 Beatrice Winner Award at International SolidState Circuits Conference. He has served as Associate Editor of the IEEE Transactions on CAD, and on the technical program committee of many conferences. 
References
 A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in neural information processing systems, pp. 1097–1105, 2012.
 S. Ren, K. He, R. Girshick, and J. Sun, “Faster rcnn: Towards realtime object detection with region proposal networks,” in Advances in Neural Information Processing Systems, pp. 91–99, 2015.
 A. Graves, A.r. Mohamed, and G. Hinton, “Speech recognition with deep recurrent neural networks,” in Acoustics, Speech and Signal Processing (ICASSP), 2013 ieee international conference on, pp. 6645–6649, IEEE, 2013.
 P. Zhang, Y. Goyal, D. SummersStay, D. Batra, and D. Parikh, “Yin and yang: Balancing and answering binary visual questions,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5014–5022, 2016.
 Y. Ma, Y. Cao, S. Vrudhula, and J.s. Seo, “Performance modeling for cnn inference accelerators on fpga,” IEEE Transactions on ComputerAided Design of Integrated Circuits and Systems, 2019.
 K. Simonyan and A. Zisserman, “Very deep convolutional networks for largescale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
 Y. Chen, T. Luo, S. Liu, S. Zhang, L. He, J. Wang, L. Li, T. Chen, Z. Xu, N. Sun, et al., “Dadiannao: A machinelearning supercomputer,” in Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture, pp. 609–622, IEEE Computer Society, 2014.
 Z. Du, R. Fasthuber, T. Chen, P. Ienne, L. Li, T. Luo, X. Feng, Y. Chen, and O. Temam, “Shidiannao: Shifting vision processing closer to the sensor,” in ACM SIGARCH Computer Architecture News, vol. 43, pp. 92–104, ACM, 2015.
 Y. LeCun, L. Bottou, Y. Bengio, P. Haffner, et al., “Gradientbased learning applied to document recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, 1998.
 Y.H. Chen, T. Krishna, J. S. Emer, and V. Sze, “Eyeriss: An energyefficient reconfigurable accelerator for deep convolutional neural networks,” IEEE Journal of SolidState Circuits, vol. 52, no. 1, pp. 127–138, 2016.
 K. Guo, L. Sui, J. Qiu, J. Yu, J. Wang, S. Yao, S. Han, Y. Wang, and H. Yang, “Angeleye: A complete design flow for mapping cnn onto embedded fpga,” IEEE Transactions on ComputerAided Design of Integrated Circuits and Systems, vol. 37, no. 1, pp. 35–47, 2017.
 A. Shafiee, A. Nag, N. Muralimanohar, R. Balasubramonian, J. P. Strachan, M. Hu, R. S. Williams, and V. Srikumar, “Isaac: A convolutional neural network accelerator with insitu analog arithmetic in crossbars,” ACM SIGARCH Computer Architecture News, vol. 44, no. 3, pp. 14–26, 2016.
 J. Qiu, J. Wang, S. Yao, K. Guo, B. Li, E. Zhou, J. Yu, T. Tang, N. Xu, S. Song, et al., “Going deeper with embedded fpga platform for convolutional neural network,” in Proceedings of the 2016 ACM/SIGDA International Symposium on FieldProgrammable Gate Arrays, pp. 26–35, ACM, 2016.
 C. Farabet, C. Poulet, J. Y. Han, and Y. LeCun, “Cnp: An fpgabased processor for convolutional networks,” in 2009 International Conference on Field Programmable Logic and Applications, pp. 32–37, IEEE, 2009.
 S. Han, J. Pool, J. Tran, and W. Dally, “Learning both weights and connections for efficient neural network,” in Advances in Neural Information Processing Systems, pp. 1135–1143, 2015.
 H. Li, A. Kadav, I. Durdanovic, H. Samet, and H. P. Graf, “Pruning filters for efficient convnets,” arXiv preprint arXiv:1608.08710, 2016.
 H. Hu, R. Peng, Y.W. Tai, and C.K. Tang, “Network trimming: A datadriven neuron pruning approach towards efficient deep architectures,” arXiv preprint arXiv:1607.03250, 2016.
 Z. Liu, J. Li, Z. Shen, G. Huang, S. Yan, and C. Zhang, “Learning efficient convolutional networks through network slimming,” in Proceedings of the IEEE International Conference on Computer Vision, pp. 2736–2744, 2017.
 J.H. Luo, J. Wu, and W. Lin, “Thinet: A filter level pruning method for deep neural network compression,” in Proceedings of the IEEE international conference on computer vision, pp. 5058–5066, 2017.
 V. Lebedev and V. Lempitsky, “Fast convnets using groupwise brain damage,” in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2554–2564, IEEE, 2016.
 W. Wen, C. Wu, Y. Wang, Y. Chen, and H. Li, “Learning structured sparsity in deep neural networks,” in Advances in Neural Information Processing Systems, pp. 2074–2082, 2016.
 J. H. Gilmore, W. Lin, M. W. Prastawa, C. B. Looney, Y. S. K. Vetsa, R. C. Knickmeyer, D. D. Evans, J. K. Smith, R. M. Hamer, J. A. Lieberman, et al., “Regional gray matter growth, sexual dimorphism, and cerebral asymmetry in the neonatal brain,” Journal of Neuroscience, vol. 27, no. 6, pp. 1255–1260, 2007.
 S. J. Lipina and J. A. Colombo, Poverty and brain development during childhood: An approach from cognitive psychology and neuroscience. American Psychological Association, 2009.
 M. Butz and A. van Ooyen, “A simple rule for dendritic spine and axonal bouton formation can account for cortical reorganization after focal retinal lesions,” PLoS computational biology, vol. 9, no. 10, p. e1003259, 2013.
 A. Krizhevsky and G. Hinton, “Learning multiple layers of features from tiny images,” tech. rep., Vol. 1, no. 4, p. 7, Citeseer, 2009.
 Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y. Ng, “Reading digits in natural images with unsupervised feature learning,”, NIPS Workshop on Deep Learning and Unsupervised Feature Learning 2011.
 K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778, 2016.
 Z. Liu, M. Sun, T. Zhou, G. Huang, and T. Darrell, “Rethinking the value of network pruning,” arXiv preprint arXiv:1810.05270, 2018.
 Y. He, G. Kang, X. Dong, Y. Fu, and Y. Yang, “Soft filter pruning for accelerating deep convolutional neural networks,” arXiv preprint arXiv:1808.06866, 2018.
 S. Han, X. Liu, H. Mao, J. Pu, A. Pedram, M. A. Horowitz, and W. J. Dally, “Eie: efficient inference engine on compressed deep neural network,” in Computer Architecture (ISCA), 2016 ACM/IEEE 43rd Annual International Symposium on, pp. 243–254, IEEE, 2016.
 B. Liu, M. Wang, H. Foroosh, M. Tappen, and M. Pensky, “Sparse convolutional neural networks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 806–814, 2015.
 T. Ash, “Dynamic node creation in backpropagation networks,” Connection science, vol. 1, no. 4, pp. 365–375, 1989.
 B. J. Briedis and T. D. Gedeon, “Using the GrowandPrune Network to Solve Problems of Large Dimensionality”, Proceedings of the 1998 Australian Conference on Neural Networks, Brisbane, 1998.
 A. Sakar and R. J. Mammone, “Growing and pruning neural tree networks,” IEEE Transactions on Computers, vol. 42, no. 3, pp. 291–299, 1993.
 G.B. Huang, P. Saratchandran, and N. Sundararajan, “A generalized growing and pruning rbf (ggaprbf) neural network for function approximation,” IEEE Transactions on Neural Networks, vol. 16, no. 1, pp. 57–67, 2005.
 S. Hussain and A. Basu, “Multiclass classification by adaptive network of dendritic neurons with binary synapses using structural plasticity,” Frontiers in neuroscience, vol. 10, p. 113, 2016.
 X. Dai, H. Yin, and N. K. Jha, “Nest: A neural network synthesis tool based on a growandprune paradigm,” arXiv preprint arXiv:1711.02017, 2017.
 Y. Gong, L. Liu, M. Yang, and L. Bourdev, “Compressing deep convolutional networks using vector quantization,” arXiv preprint arXiv:1412.6115, 2014.
 I. Hubara, M. Courbariaux, D. Soudry, R. ElYaniv, and Y. Bengio, “Quantized neural networks: Training neural networks with low precision weights and activations,” The Journal of Machine Learning Research, vol. 18, no. 1, pp. 6869–6898, 2017.
 E. L. Denton, W. Zaremba, J. Bruna, Y. LeCun, and R. Fergus, “Exploiting linear structure within convolutional networks for efficient evaluation,” in Advances in neural information processing systems, pp. 1269–1277, 2014.
 C. Leng, Z. Dou, H. Li, S. Zhu, and R. Jin, “Extremely low bit neural network: Squeeze the last bit out with admm,” AAAI Conference on Artificial Intelligence (2018): n. pag. Web. 9 Aug. 2019
 P. Molchanov, S. Tyree, T. Karras, T. Aila, and J. Kautz, “Pruning convolutional neural networks for resource efficient inference,” arXiv preprint arXiv:1611.06440, 2016.
 Y. He, J. Lin, Z. Liu, H. Wang, L.J. Li, and S. Han, “Amc: Automl for model compression and acceleration on mobile devices,” in Proceedings of the European Conference on Computer Vision (ECCV), pp. 784–800, 2018.
 V. Nair and G. E. Hinton, “Rectified linear units improve restricted boltzmann machines,” in Proceedings of the 27th international conference on machine learning (ICML10), pp. 807–814, 2010.
 S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” arXiv preprint arXiv:1502.03167, 2015.
 A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer, “Automatic differentiation in pytorch,” NIPS Workshop Autodiff, 2017.