Efficient Network Construction through Structural Plasticity
Deep Neural Networks (DNNs) on hardware is facing excessive computation cost due to the massive number of parameters. A typical training pipeline to mitigate over-parameterization is to pre-define a DNN structure with redundant learning units (filters and neurons) with the goal of high accuracy, then to prune redundant learning units after training with the purpose of efficient inference. We argue that it is sub-optimal to introduce redundancy into training in order to reduce redundancy later in inference. Moreover, the fixed network structure further results in poor adaption to dynamic tasks, such as lifelong learning. In contrast, structural plasticity plays an indispensable role in mammalian brains to achieve compact and accurate learning. Throughout the lifetime, active connections are continuously created while those that are no longer important are degenerated. Inspired by such observation, we propose a training scheme, namely Continuous Growth and Pruning (CGaP), where we start the training from a small network seed, then literally execute continuous growth by adding important learning units and finally prune secondary ones for efficient inference. The inference model generated from CGaP is sparse in the structure, largely decreasing the inference power and latency when deployed on hardware platforms. With popular DNN structures on representative datasets, the efficacy of CGaP is benchmarked by both algorithmic simulation and architectural modeling on Field-programmable Gate Arrays (FPGA). For example, CGaP decreases the FLOPs, model size, DRAM access energy and inference latency by 63.3%, 64.0%, 11.8% and 40.2%, respectively, for ResNet-110 on CIFAR-10.
Deep Neural Networks have various applications including image classification , object detection , speech recognition  and natural language processing . However, the accuracy of DNNs heavily relies on massive amounts of parameters and deep structures, making it hard to deploy DNNs on resource-limited embedded systems. When training or inferring the DNN models on hardware, the model must be stored in the external memory such as dynamic random-access memory (DRAM) and fetched multiple times. These operations are expensive in computation, memory access, and energy consumption. For example, Fig. 1 shows the energy consumption of one inference pass in several modern DNN structures, simulated by the FPGA performance model  under the setting of 300 MHz operating frequency and 19.2 GB/s DRAM bandwidth. The input image size is . A typical DNN model is too large to fit in on-chip memory. For instance, VGG-19  has 20.4M parameters. Running such a model requires frequent external memory access, exacerbating the power consumption of a typical embedded system.
Previous researches have designed customized hardware for DNN acceleration [7, 8]. Most of them are limited to relatively small neural networks, such as LeNet-5 . For larger networks such as AlexNet  and VGG-16 , additional efforts are usually required to improve the hardware efficiency [10, 11]. For example,  saves the energy through data gating and zero skipping. Some other works focus on data reuse of convolutional layers and demonstrate the results on specific hardware [7, 12, 13, 14]. However, their improvements are limited on those networks where fully-connected layer is widely used, such as RNNs and LSTMs.
To support more general models, network pruning is a popular approach by removing secondary weights and neurons. Network pruning executes a three-step procedure, which 1) trains a pre-designed network from scratch, 2) removes less important connections or filters/neurons according to a saliency score (a metrics to measure the importance of weights and learning units) [15, 16, 17, 18, 19], or by adding a regularization term into the loss function [20, 21], and 3) fine-tunes to recover the accuracy.
However, the above pruning techniques suffer from two limitations: (1) training a large and fixed network from scratch could be sub-optimal as it introduces redundancy; (2) in the process of training, pruning only discards less important weights at the end of training but does not strengthen important weights and nodes. These limitations of network pruning confine the learning performance as well as the model pruning efficiency (i.e., how many parameters can be removed and how structured the sparsity is).
In contrast to the static DNN model, the biological nervous system exhibits active growth and pruning through the lifetime. [22, 23, 24] have observed that the rapid growth of neurons and synapses takes place in an infant’s brain and is vital to the maturity of an adult’s brain. In brains, some neurons and synapses are used more frequently and are consequently strengthened. Those neurons and synapses that are not used consistently are weakened and removed. The structural plasticity of brain is central to the study of developmental biology.
Inspired by this observation from biology, we propose a training scheme named Continuous Growth and Pruning (CGaP), which leverages structural plasticity to tackle the aforementioned limitations of pruning techniques. Instead of training an over-parameterized network from scratch, CGaP starts the training from a small network seed (Fig. 2(a)), whose size is as low as 0.1%-3% of the full-size reference model. In each iteration of the growth, CGaP locally sorts neurons and filters (also known as output channels in some literature) according to our saliency score (Section III-B). Based on the saliency score, important learning units are selected and the corresponding new units are added (see Fig. 2(b)). The selection and addition of important units help reinforce the learning and increase model capacity. Then a filter-wise and neuron-wise pruning will be executed on the post-growth model (Fig. 2(c)) based on pruning metrics. Finally, CGaP generates a significantly sparse and structured inference model (Fig. 2(d)) with accuracy improved. In the generated inference model, large amounts of filters and neurons have been removed, achieving structured pruning. Compared to non-structured pruning , CGaP benefits hardware implementation as it reduces the computation volume and memory access without any additional hardware architecture change.
Algorithmic experiments and hardware simulations validate that CGaP significantly decreases the number of external and on-chip memory accesses, accelerating the inference by bypassing the removed filters and neurons. On the algorithm side, we demonstrate the performance in accuracy and model pruning on several networks and datasets. For instance, CGaP reduces parameters of VGG-19 with accuracy improvement on CIFAR-100 , parameters with accuracy improvement on SVHN . For ResNet-110 , CGaP reduces parameters with accuracy improvement on CIFAR-10 . These results exceed the state-of-the-art pruning methods [15, 16, 17, 18, 28, 29]. Furthermore, we validate the efficiency of the inference model generated from CGaP using FPGA simulator . For one inference pass of VGG-19 on CIFAR-100, previous non-structured pruning approach  requires energy consumption of pJ in accessing DRAM and 5.6 ms inference latency, while CGaP requires only pJ and 4.4 ms latency.
The contribution of this paper is as follows:
A brain-inspired training flow (CGaP) with a dynamic structure is proposed. CGaP grows the network from a small seed and effectively reduces over-parameterization without sacrificing accuracy.
The advantage of structured sparsity of the inference model generated from CGaP is validated using a high-level FPGA performance model including on-chip buffer access energy, external memory access energy and inference latency.
The discussion and understanding of the reason that the growth improves the learning efficiency are provided.
The rest of the paper is organized as follows. Section II introduces the background of model pruning. Section III demonstrates the saliency score used to select the learning units. Section IV describes the proposed Continuous Growth and Pruning scheme. Section V presents the experimental results from algorithmic simulations. Section VI demonstrates the simulation results from FPGA performance modeling. Section VII discusses the understanding of network plasticity as well as ablation study. Section VIII concludes this work and discusses the insight into future work.
Ii Previous Work
There have been broad interests in reducing the redundancy of DNNs in order to deploy them on a resource-limited hardware platform. The structural surgery is a widely used approach and can be categorized into destructive direction and constructive direction. We will discuss these two directions, as well as orthogonal approaches to our methods in this section.
Ii-a Destructive Methods
The destructive methods zero out specific connections or remove filters or neurons in convolutional or fully-connected layers, generating a sparse model. Weight magnitude pruning  pruned weights by setting the selected weights to zeros. The selection is based on L1-norm, i.e., the absolute value of the weight. Weight magnitude pruning generates a sparse weight matrix, but not in a structured way. In this case, specific hardware design  is needed to take advantage of the optimized inference model, otherwise the non-structured sparsity does not benefit hardware acceleration due to the overhead in model management. The kernel-wise pruning  pruned kernels layer by layer based on the saliency metrics of each filter and achieved structured sparsity in the inference model. Compared to , CGaP prunes filters, leading to a more structured inference model. Besides the saliency-based pruning, the penalty-based approach has been explored by [21, 31] and structured sparsity was achieved. Our method is different from all the above pruning schemes from two perspectives: (1) We start training from a small seed other than an over-parameterized network; (2) Besides removing secondary filters/neurons, we also reinforce important ones to further improve learning accuracy and model compactness.
Ii-B Constructive Methods
The constructive approaches include techniques that add new connections or filters to enlarge the model capacity. [32, 33] increased network size by adding random neurons with fresh initialization (i.e., weights are randomly initialized, without pre-trained information). They evaluated their approach on basic XOR problems. Different from their approach, CGaP selectively adds neurons and filters that are initialized with the information learned from the previous training. Meanwhile, CGaP is validated on modern DNNs and datasets under more realistic scenarios.  grew the smallest Neural Tree Networks (NTN) to minimize the number of classification errors on Boolean function learning tasks, and used pruning to enhance the generalization of NTN.  improved the accuracy of radial basis function (RBF) networks on function approximation tasks by adding and removing hidden neurons. To enhance the accuracy of spike-based classifiers,  progressively added dendrites to the network, and then optimized the topology of the dendritic tree. Different from them, CGaP aims at improving the efficiency of the inference model of modern Deep Neural Networks on image classification tasks.  constructed the DNN by activating connections and choosing a set of convolutional filters among a bunch of randomly generated filters according to their influence on the training performance. However, this approach highly depends on trial and error to find the optimal set of filters that could reduce loss the most. This approach is sensitive to power and timing budgets, limiting its extension on large datasets. Unlike their work, CGaP directly grows the network from a seed, minimizing the effort on trail and error.
Ii-C Orthogonal Methods
The orthogonal methods, such as low-precision quantization and low-rank decomposition, compress the DNN models by quantizing the parameters to fewer bits [38, 39], or by finding a low-rank approximation [40, 41]. Note that our CGaP approach can be combined with these orthogonal methods to further improve inference efficiency.
Iii Saliency Score
In this section, we describe the detailed methodology of CGaP, starting from the saliency score, which is used to sample the importance of a learning unit. Section III-A defines the terminology we use in this paper. Section III-B provides the mathematical proof of the saliency score we adopt.
A DNN can be treated as a feedforward multi-layer architecture that maps the input images to certain output vectors. Each layer is a certain function, such as convolution, ReLU, pooling and inner product, whose input is , output is and parameter is in case of convolutional and fully-connected layers. Hereby the convolutional layer (conv-layer) is formulated as: , wherein , , , where subscript denotes the index of the layer. And the fully-connected layer is represented by: , where the input , the output , and the parameter matrix is .
Convolutional layer (conv-layer)
the 4 dimensions of its weight matrix are: the number of output channels , the number of input channels , and the kernel width and height , respectively. We denote the -th 3D filter, which generates the -th output channel in the feature map, as . The -th 2D kernel in the -th filter is denoted as . On the other hand, a 4D weight tensor , which operates on the -th input feature map, is a package of kernels across all output channels. For example, in Fig. 3, is a 3D filter consisting of kernels, and as well as are both 4D tensors with dimension of , which include all the output channels but have only one input channel located at j. The refers to one weight at the -th row and the -th column in the -th filter of the -th input channel.
Fully-connected layer (fc-layer)
input propagate from one hidden activation to the next layer. We refer the whole set of as a neuron . This neuron receives information from previous layer through its fan-in weights (as shown in Fig. 4) and propagates to the next layer through fan-out weights . Also note that the output dimension of layer equals to the input dimension of layer , i.e., . The weight pixel in layer at the cross-point of row and column is denoted as . Moreover, the ‘depth’ of a DNN model indicates the number of layers, and the ‘width’ of a DNN model refers to the number of filters or neurons of each layer.
Growing or pruning a filter indicates adding or removing and its corresponding output feature map. Growing or pruning a neuron means adding or removing both and .
Iii-B Saliency Score
We adopt a saliency score to measure the effect of a single filter/neuron on the loss function, i.e., the importance of each learning unit. The saliency score is developed from Taylor Expansion of the loss function. Previously,  applied it on pruning. In this paper, we adopt this saliency score and apply it on the growth and pruning scheme. In this section, we provide a mathematical formulation of the saliency score.
The saliency score represents the difference between the loss with and without each unit. In other words, if the removal of a filter/neuron leads to relatively small accuracy degradation, this unit is recognized as an unimportant unit, and vice versa. Thus, the objective function to get the filter with the highest saliency score is formulated as:
Using the first-order of the Taylor Expansion:
Similarly, the saliency score of a neuron is derived as:
Iv CGaP Methodology
With the saliency score as the foundation, we develop the entire CGaP flow atop. This section explains the overall flow and the detailed implementation of each step in CGaP.
The CGaP scheme is described in Algorithm 1. Starting from a small network seed, the growth takes place periodically at a frequency of (see Algorithm 1 line 4, where ‘%’ denotes the operation to obtain the remainder of division). During each growth, important learning units are chosen and grown at growth ratio layer by layer from the bottom (input) to top (output), based on the local ranking of the saliency score. The growth phase stops when reaching a capacity threshold , followed by several epochs of training on the peak model . When the training accuracy reaches a threshold , the pruning phase starts. Pruning is performed layer by layer, from the bottom layer to the top layer, at the frequency of . The details in the growth phase and the pruning phase is demonstrated as follows.
Iv-a Growth phase
Algorithm 2 presents the methodology in the growth phase. Each iteration of growth in a layer consists of two steps: growth in layer and mapping in the adjacent layer. There are two conditions need to be discussed separately: convolutional layers (Fig. 3) and fully-connected layers (Fig. 4). Due to the difference between these two kinds of operation as discussed previously, after the growth of layer , the mapping in conv-layer takes place at the adjacent layer . In fc-layers, the mapping is in layer .
Growth in conv-layer
According to the local ranking of the saliency score (Eq. III-B), we sort all the 3D filters in this layer. With a growth ratio , filters are selected in the -th layer at the -th growth. On the side of each selected filter , as shown in Fig. 3, we create a new filter that has the same size, named .
In the ideal case, the new filter and existing filter are expected to collaborate with each other and optimize the learning. The existing filter has already learned on the current task. To keep the same learning pace between the existing filter and the new filter, we initialize as follows:
where is a scaling factor and is a constant following uniform distribution in , where . Instead of random initialization, the above initialization helps reconcile the learning status of the newborn filters with the old filters. Meanwhile, the scaling factor prevents output from an exponential explosion caused by the feedforward propagation . The noise prevents the learning from sticking at a local minimum that leads to sub-optimal solutions. No matter which distribution the noise follows, in a reasonable range is able to provide similar performance. However, other distributions usually introduce more hyper-parameters and thus, require more efforts in parameter tuning. For example, Gaussian noise introduces more hyper-parameter, e.g., the standard deviation, than uniform noise. For simplicity, we use uniformly distributed noise.
Mapping in conv-layer
After the number of filters in layer grows from to , the number of output feature maps also increases from to . Therefore, the input-wise dimension of layer should increase correspondingly in order to be consistent in data propagation. To match the dimension, we first locate the 4D tensor in layer , which processes the feature maps generated by . Then we add a new 4D tensor adjacent to . The and are initialized as follows:
To summarize, as illustrated in Fig. 3, the filter (green) is selected according to the saliency score and a new tensor (orange) is added. Then the input-wise tensor (in blue dashed rectangular) in layer is projected, and (in black dashed rectangular) is generated.
After layer grows and layer is mapped, layer grows and layer is mapped, so on and so forth till the last convolutional layer. It is worth mentioning that for the ‘projection shortcuts’  with convolutions in ResNet , the dimension mapping is between the two layers that the shortcut connects, not necessarily to be the adjacent layers.
Growth and mapping in fc-layers
The mapping in fc-layers take place in the fan-in weights as follows:
After growing the last conv-layer, We flatten the output feature map of this conv-layer, treat it as the input from layer and map in the same manner.
Iv-B Pruning phase
Pruning in each layer consists of two steps: weight pruning and unit pruning. First, we sort weight pixels locally in each conv-layer according to Eq.11:
and in each fc-layer according to Eq.12:
In each layer, 100% weight pixels with the lowest are set as zero, where is the weight pruning rate. Then the entire filter/neuron whose sparsity is larger than the filter/neuron pruning rate or is set to zero. In this way, a large amount of entire filters/neurons are pruned, leading to a compact inference model.
V Algorithmic Experiments
To evaluate the proposed approach, we present experimental results in this section. We perform experiments on several modern DNN structures (LeNet , VGG-Net , ResNet ) and representative datasets (MNIST , CIFAR-10, CIFAR-100 , SVHN ).
V-a Training Setup
The LeNet-5 architecture consists of two sets of convolutional, ReLU  and max pooling layers, followed by two fully-connected layers and finally a softmax classifier. The VGG-16 and VGG-19 structures we use have the same convolutional structure as  but are redesigned with only two fully-connected to be fairly compared with the pruning-only method . Therefore, the VGG-16 (VGG-19) has 13 (16) convolutional layers, each is followed by a batch normalization layer  and a ReLU activation. The structures of ResNet-56 and ResNet-110 follow . Each convolutional layer is followed by a batch normalization layer and ReLU activation. During the training, the depth of the networks remains constant since CGaP does not touch the depth of the network, but the width of each layer changes.
Note that in the following text, we denote the full-size models trained from scratch without sparsity regularization as ‘baseline’ models. The three-step pruning schemes that remove weights or filters but do not execute network growth are denoted as ‘pruning-only’ models.
MNIST is a handwritten digit dataset in grey-scale (i.e., one color channel) with 10 classes from digit 0 to digit 9. It consists of 60,000 training images and 10,000 testing images. The CIFAR-10 dataset consists of 60,000 color images in 10 classes, with 5000 training images and 1000 testing images per class. The CIFAR-100 dataset has 100 classes, including 500 training images and 100 testing images per class. The Street View House Number (SVHN) is a real-world color image dataset that is resized to a fixed resolution of pixels. It contains 73,257 training images and 26,032 testing images.
We set the learning rate to be 0.1 and divide by 10 for every 30% of the training epochs. We train our model using Stochastic Gradient Descent (SGD) with a batch size of 128 examples, a momentum of 0.9, and a weight decay of 0.0005. The loss function is the cross-entropy loss with softmax function. We train 60, 200, 220 and 100 epochs on MNIST, CIFAR-10, CIFAR-100 and SVHN datasets, respectively. In the growth phase, we have hyper-parameters set as follows: the growth stopping condition , i.e., the growth stops at the -th growth if the number of filters in the -th growth is larger than the baseline model. The growth ratio is set as 0.6. The growth frequency is set as 1/3. The scaling factor in Eq. 5 to Eq. 10 is set to 0.5 and is 0.1. The pruning frequency is set to be 1. The setting of the weight pruning rate follows ,  and  for LeNet-5, VGG-Net and ResNet, respectively. and is set to be same as .
Framework and platform
The experiments are performed with PyTorch  framework on one NVIDIA GeForce GTX 1080 Ti platform. It is worth mentioning that experiments performed with different frameworks may have variation in accuracy and performance. Thus, to have a fair comparison among CGaP, baseline and pruning-only methods, all the results in Table I, II, III and IV are obtained from experiments with PyTorch framework.
V-B Performance Evaluation
With training setup as aforementioned, we perform experiments on several datasets with modern DNN architectures. In Table I, Table II, Table III and Table IV, we summarize the performance attained by CGaP on MNIST, CIFAR-100, SVHN, and CIFAR-10 datasets, respectively. To be specific, the second column ‘Accuracy’ denotes the inference accuracy in percentage achieved by the baseline model, the up-to-date pruning-only approaches and CGaP approach, respectively.
The column ‘FLOPs’ represent the calculated number of FLOPs of a single inference pass. The calculation of FLOPs follows the method described in . Fewer FLOPs means lower computation cost in one inference pass. The neighboring column, ‘Pruned’, represents the reduction of FLOPs in the compressed model as compared to the baseline model. The column ‘Param.’ stands for the number of parameters of the inference model. Fewer parameters promise a smaller model size. The last column, ‘Pruned’, denotes the percentage pruned in parameters compared to the baseline. Larger pruned percentage implies fewer computation operations and more compact model. The best result of each column is highlighted in bold.
The results shown in Table I to IV prove that CGaP outperforms the previous pruning-only approaches in accuracy and model size. For instance, as displayed in Table IV, on ResNet-56, our CGaP approach achieves 93.20% accuracy with 32.5% reduction in FLOPs and 37.6% reduction in parameters, while the up-to-date pruning-only method  that deals with static structure only reaches 92.56% accuracy with 32.1% reduction in FLOPs and 14.1% reduction in parameters. On ResNet-110, though  achieves 0.09% higher accuracy than CGaP, CGaP overwhelms it by trimming 22.5% more FLOPs.
V-C Visualization of the dynamic structures
Fig. 5 presents the dynamic model size during CGaP training. During the growth phase, the model size continuously increases and reaches a peak capacity. When the pruning phase starts, the model size drops.
Furthermore, the sparsity achieved by CGaP is structured. In other words, large amounts of filters and neurons are entirely pruned. For instance, the baseline LeNet-5 without sparsity regularization has 20, 50 filters in conv-layer 1 and conv-layer 2, 500 and 10 neurons in fc-layer 1 and fc-layer 2, denoted as [20-50-500-10] (number of filters/neurons in [conv1-conv2-fc1-fc2]). The model achieved by CGaP contains only 8, 17 filters and 23, 10 neurons, denoted as [8-17-23-10]. Compared to baseline results, CGaP significantly decreases 60%, 66%, 95.4% units for each layer (the output layer should remain the same as the number of classes all the time). In this case, the pruned filters and neurons are skipped in the inference pass and thus accelerating the computation pipeline on hardware.
Another example is provided in Fig. 6, which visualizes the VGG-19 structures from CGaP as well as the baseline structure on two different tasks. In the baseline model, the width (number of filters/neurons) of each layer is abundant, from 64 filters (the bottom conv-layers) to 512 filters (the top conv-layers). The baseline VGG-19 structure is designed to have a large enough size in order to guarantee the learning capacity. However, it turns out to be redundant, as proved by the structure that CGaP generated: to filters are pruned out in each layer. Meanwhile, in the baseline model, the top conv-layers are designed to have more filters than the bottom layers, but CGaP shows that it is not always necessary for top layers to have a relatively large size.
V-D Validating the saliency-based growth
Fig. 7 validates the efficacy of our saliency-based growth policy. Selective growth, which emphasizes the important units according to the saliency score, has lower cross-entropy loss than randomly growing some units. The spiking in Fig. 7 is caused by the first iteration of pruning and this loss is recovered by the following iterative fine-tuning. In selective growth, this loss is lower than that in random growth. This phenomenon supports our argument that selective growth assists the pruning phase. The detailed understanding of growth will be further discussed in Section VII.
To summarize the results from the algorithm simulations, the proposed CGaP approach:
Largely compresses the model size by (ResNet-56) to (LeNet-5) for representative DNN structures.
Decreases the inference cost, to be specific, number of FLOPs, by (ResNet-56) to (LeNet-5) on various datasets.
Does not sacrifice accuracy and even improves accuracy.
Outperforms the state-of-the-art pruning-only methods that deal with fixed structures.
Vi Experiments on FPGA simulator
The results above demonstrate that CGaP generates an accurate and small inference model. In this section, we further evaluate the on-chip inference cost of the generated models and compare CGaP with previous non-structured pruning . As CGaP achieves structured sparsity, CGaP outperforms the previous work on non-structured pruning in hardware acceleration and power efficiency. We validate this by performing the estimation of buffer access energy, DRAM access energy and latency using the performance model for FPGA .
Vi-a Overview of the FPGA simulator
 is a high-level performance model designed to estimate the number of external and on-chip memory access, as well as the latency. The resource costs are formulated by the acceleration strategy as well as the design variables that control the loop tiling and unrolling. The performance model has been validated across several modern DNN algorithms in comparison to on-board testings on two FPGAs, with the differences within .
In the following experiments, the setup follows: the pixels and weights are both 16-bit fixed point, the data width of DRAM controller is bits, the accelerator operating frequency is MHz, and the DRAM bandwidth is GB/second. The parameters related to loop tiling and unrolling follow the setting in .
Vi-B Results from FPGA performance model
The on-chip and external memory access energy across VGG-16, VGG-19, ResNet-56 and ResNet-110 is displayed in Fig. 8(a) and Fig. 8(b), respectively. The inference latency is shown in Fig. 8(c). Though the models generated from weight magnitude pruning and CGaP have the same sparsity, CGaP outperforms non-structured magnitude weight pruning in hardware efficiency and acceleration. For example, with the same setup of the pruning ratio during training, magnitude weight pruning decreases on-chip access energy, DRAM access energy and latency for VGG-19 on CIFAR-100, while the CGaP achieves , , and reduction. The non-structured weight pruning  is able to improve the power and latency efficiency in comparison to baseline. However, the improvement is limited. In contrast, CGaP achieves significant acceleration and energy reduction. The reason is that the non-structured sparsity, i.e., scattered weight distribution, leads to irregular memory access that weakens the acceleration on hardware in a real scenario.
In Section V and VI, the performance of CGaP has been comprehensively evaluated on algorithm platforms and hardware platforms. In this section, we provide a more in-depth understanding of the growth to explain why selective growth is able to improve the performance from the traditional pipelines. Furthermore, we provide a thorough ablation study to validate the robustness of the proposed CGaP method.
|*Relative accuracy of the final VGG-19 model on CIFAR-100 as compared to the baseline.|
Understanding the growth
Fig. 9 illustrates a visualization of the weights in the bottom conv-layer (conv1_1) in VGG-19, at the moment of initialization, after the first growth, after the last growth and when training ends. Inside each figure, the upper bar is the CGaP model, whose size varies at different training moments. The lower bar is from the baseline model, whose size is static during training. At the initialization moment (Fig. 9(a)), CGaP model only has 8 filters in this layer while the baseline model has 64 filters. Then the number of filters grows to 13 after one iteration growth (Fig. 9(b)), meaning the most important 5 filters are selected and added. It is clear that the pattern in Fig. 9(b) is more active than that in (a), indicating the filters have already fetched effective features from the input images. More important, along with the growing, the pattern in CGaP model becomes more structured than that in the baseline model, as shown in Fig. 9(c). Benefiting from this well-structured pattern, our CGaP model has higher learning accuracy than the baseline model. From Fig. 9(c) to Fig. 9(d), relatively unimportant filters are removed, and important ones are kept. We observe that most of the filters that are favored by the growth, such as filters at index 36, 48, 72, 96 in Fig. 9(c), are still labeled as important filters in Fig. 9(d) even after a long training process between the growth phase and the pruning phase. Leveraging the growth policy, the model is able to recover quickly from the loss caused by pruning (the spiking in Fig. 7).
Robustness of the seed
The performance of CGaP is stable under the variation of the initial seeds. To prove this, we scan several seeds in different size and present the variation in accuracy and inference model size. The structure of 6 scanned seeds is listed in Table V. Each seed has a different number of filters in each layer, e.g., seed ‘2’ has 2 filters in block conv1. The size of the seeds varies from 0.01M to 0.53M. Fig. 10 presents the final model size and the number of growth of each seed. A larger seed leads to a larger final model but requires fewer iterations of growth to reach the intended model size. Generally speaking, there is a trade-off between the inference accuracy and the model size. Though the seed varies a lot from each other, the final accuracy is quite robust, as listed in the ‘Accuracy’ row in Table V. It is worth mentioning that, even though the seed ‘2’ degrades the accuracy of from baseline, the inference model size is only 2.4M, significantly smaller than the baseline size (20.4M).
Robustness of the hyper-parameters
CGaP is conditioned on a set of hyper-parameters to achieve optimal performance, while it is stable under the variation of these hyper-parameters. Empirically, we leverage the following experience to perform parameter optimization: a smaller growth rate for a larger seed and vice versa; threshold is set based on the user’s intended model size; a smaller for a complicated dataset and vice versa; a relatively greedy growth (larger and ) prefers a larger noise but smaller to push the model away from sticking at a local minimum. Tuning of the pruning ratio of each layer is in a similar manner to that of the other pruning works  .
In particular, we scan 121 combinations of the scaling factor and noise in the range [0.0, 1.0] with the step=0.1 and provide the following discussion. For VGG16 on CIFAR-10, the accuracy of several corner cases are (=1, =0, which is a case of random initialization), (=1, =1), (=0, =1, which is another case of mimicking its neighbor without scaling) and (=0, =0, the training is invalid in this case), (=0, =0, which is another case of mimicking its neighbor with scaling), (=0.5, =0, which is another case of random initialization). The best accuracy of is under =0.1, =0.5. The combinations in the zone that and always provide > accuracy. To summarize, impacts more than as is relatively small; should not be too large and 0.5 is safe for future tasks and networks; adding a noise improves the accuracy (like from to ) as it prevents local minimum; inheriting from the neighbor is more efficient than randomly initializing since the network is able to resume the learning right after the growth.
Viii Conclusion and Future Work
Modern DNNs typically start training from a fixed and over-parameterized network, which leads to redundancy and is lack of structural plasticity. We propose a novel dynamic training algorithm, Continuous Growth and Pruning, that initializes training from a small network, expands the network width continuously to learn important learning units and structures and finally prunes secondary ones. The effectiveness of CGaP depends on where to start and stop the growth, which learning unit (filter and neuron) should be added, and how to initialize the newborn units to ensure model convergence. Our experiments on benchmark datasets and architectures demonstrate the advantage of CGaP on learning efficiency (accurate and compact). We further validate the energy and latency efficiency of the inference model generated by CGaP on FPGA performance simulator. Our approach and analysis will help shed light on the development of adaptive neural networks for dynamic tasks such as continual and lifelong learning.
This work was supported in part by C-BRIC, one of six centers in JUMP, a Semiconductor Research Corporation (SRC) program sponsored by DARPA. It was also partially supported by National Science Foundation (NSF) under CCF #1715443.
Xiaocong Du (S’19) received her B.S. degree in control engineering from Shandong University, Jinan, China, in 2014, and the M.S. degree in electrical and computer engineering from University of Pittsburgh, Pittsburgh, PA, US, in 2016. Now, she is pursuing her Ph.D. degree in electrical engineering at Arizona State University, Tempe, AZ, USA. Her research interests include efficient algorithm and hardware co-design for deep learning, neural architecture search, continual learning, and neuromorphic computing.
Zheng Li (S’19) obtained his B.S. degree in electronics and information engineering from Beihang University, Beijing, China, in 2014, and the M.S. degree in electrical and computer engineering from University of Pittsburgh, Pittsburgh, PA, USA, in 2017. He is currently working towards the Ph.D. degree in computer engineering at Arizona State University, Tempe, AZ, USA. He worked as a summer intern in Machine Learning at MobaiTech, Inc, Tempe, AZ, USA in 2018. His current research interests include algorithm design and optimization for computer vision tasks, such as object detection and autonomous driving.
Yufei Ma (S’16-M’19) received the B.S. degree in information engineering from Nanjing University of Aeronautics and Astronautics, Nanjing, China, in 2011, the M.S.E. degree in electrical engineering from University of Pennsylvania, Philadelphia, PA, USA, in 2013, and the Ph.D. degree with Arizona State University, Tempe, AZ, USA, in 2018. His current research interests include the high-performance hardware acceleration of deep learning algorithms on digital application-specified integrated circuit and field-programmable gate array.
Yu Cao (S’99-M’02-SM’09-F’17) received the B.S. degree in physics from Peking University in 1996. He received the M.A. degree in biophysics and the Ph.D. degree in electrical engineering from University of California, Berkeley, in 1999 and 2002, respectively. He worked as a summer intern at Hewlett-Packard Labs, Palo Alto, CA in 2000, and at IBM Microelectronics Division, East Fishkill, NY, in 2001. After working as a post-doctoral researcher at the Berkeley Wireless Research Center (BWRC), he is now a Professor of Electrical Engineering at Arizona State University, Tempe, Arizona. He has published numerous articles and two books on nano-CMOS modeling and physical design. His research interests include physical modeling of nanoscale technologies, design solutions for variability and reliability, reliable integration of post-silicon technologies, and hardware design for on-chip learning. Dr. Cao was a recipient of the 2012 Best Paper Award at IEEE Computer Society Annual Symposium on VLSI, the 2010, 2012, 2013, 2015 and 2016 Top 5% Teaching Award, Schools of Engineering, Arizona State University, 2009 ACM SIGDA Outstanding New Faculty Award, 2009 Promotion and Tenure Faculty Exemplar, Arizona State University, 2009 Distinguished Lecturer of IEEE Circuits and Systems Society, 2008 Chunhui Award for outstanding oversea Chinese scholars, the 2007 Best Paper Award at International Symposium on Low Power Electronics and Design, the 2006 NSF CAREER Award, the 2006 and 2007 IBM Faculty Award, the 2004 Best Paper Award at International Symposium on Quality Electronic Design, and the 2000 Beatrice Winner Award at International Solid-State Circuits Conference. He has served as Associate Editor of the IEEE Transactions on CAD, and on the technical program committee of many conferences.
- A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in neural information processing systems, pp. 1097–1105, 2012.
- S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time object detection with region proposal networks,” in Advances in Neural Information Processing Systems, pp. 91–99, 2015.
- A. Graves, A.-r. Mohamed, and G. Hinton, “Speech recognition with deep recurrent neural networks,” in Acoustics, Speech and Signal Processing (ICASSP), 2013 ieee international conference on, pp. 6645–6649, IEEE, 2013.
- P. Zhang, Y. Goyal, D. Summers-Stay, D. Batra, and D. Parikh, “Yin and yang: Balancing and answering binary visual questions,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5014–5022, 2016.
- Y. Ma, Y. Cao, S. Vrudhula, and J.-s. Seo, “Performance modeling for cnn inference accelerators on fpga,” IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2019.
- K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
- Y. Chen, T. Luo, S. Liu, S. Zhang, L. He, J. Wang, L. Li, T. Chen, Z. Xu, N. Sun, et al., “Dadiannao: A machine-learning supercomputer,” in Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture, pp. 609–622, IEEE Computer Society, 2014.
- Z. Du, R. Fasthuber, T. Chen, P. Ienne, L. Li, T. Luo, X. Feng, Y. Chen, and O. Temam, “Shidiannao: Shifting vision processing closer to the sensor,” in ACM SIGARCH Computer Architecture News, vol. 43, pp. 92–104, ACM, 2015.
- Y. LeCun, L. Bottou, Y. Bengio, P. Haffner, et al., “Gradient-based learning applied to document recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, 1998.
- Y.-H. Chen, T. Krishna, J. S. Emer, and V. Sze, “Eyeriss: An energy-efficient reconfigurable accelerator for deep convolutional neural networks,” IEEE Journal of Solid-State Circuits, vol. 52, no. 1, pp. 127–138, 2016.
- K. Guo, L. Sui, J. Qiu, J. Yu, J. Wang, S. Yao, S. Han, Y. Wang, and H. Yang, “Angel-eye: A complete design flow for mapping cnn onto embedded fpga,” IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 37, no. 1, pp. 35–47, 2017.
- A. Shafiee, A. Nag, N. Muralimanohar, R. Balasubramonian, J. P. Strachan, M. Hu, R. S. Williams, and V. Srikumar, “Isaac: A convolutional neural network accelerator with in-situ analog arithmetic in crossbars,” ACM SIGARCH Computer Architecture News, vol. 44, no. 3, pp. 14–26, 2016.
- J. Qiu, J. Wang, S. Yao, K. Guo, B. Li, E. Zhou, J. Yu, T. Tang, N. Xu, S. Song, et al., “Going deeper with embedded fpga platform for convolutional neural network,” in Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, pp. 26–35, ACM, 2016.
- C. Farabet, C. Poulet, J. Y. Han, and Y. LeCun, “Cnp: An fpga-based processor for convolutional networks,” in 2009 International Conference on Field Programmable Logic and Applications, pp. 32–37, IEEE, 2009.
- S. Han, J. Pool, J. Tran, and W. Dally, “Learning both weights and connections for efficient neural network,” in Advances in Neural Information Processing Systems, pp. 1135–1143, 2015.
- H. Li, A. Kadav, I. Durdanovic, H. Samet, and H. P. Graf, “Pruning filters for efficient convnets,” arXiv preprint arXiv:1608.08710, 2016.
- H. Hu, R. Peng, Y.-W. Tai, and C.-K. Tang, “Network trimming: A data-driven neuron pruning approach towards efficient deep architectures,” arXiv preprint arXiv:1607.03250, 2016.
- Z. Liu, J. Li, Z. Shen, G. Huang, S. Yan, and C. Zhang, “Learning efficient convolutional networks through network slimming,” in Proceedings of the IEEE International Conference on Computer Vision, pp. 2736–2744, 2017.
- J.-H. Luo, J. Wu, and W. Lin, “Thinet: A filter level pruning method for deep neural network compression,” in Proceedings of the IEEE international conference on computer vision, pp. 5058–5066, 2017.
- V. Lebedev and V. Lempitsky, “Fast convnets using group-wise brain damage,” in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2554–2564, IEEE, 2016.
- W. Wen, C. Wu, Y. Wang, Y. Chen, and H. Li, “Learning structured sparsity in deep neural networks,” in Advances in Neural Information Processing Systems, pp. 2074–2082, 2016.
- J. H. Gilmore, W. Lin, M. W. Prastawa, C. B. Looney, Y. S. K. Vetsa, R. C. Knickmeyer, D. D. Evans, J. K. Smith, R. M. Hamer, J. A. Lieberman, et al., “Regional gray matter growth, sexual dimorphism, and cerebral asymmetry in the neonatal brain,” Journal of Neuroscience, vol. 27, no. 6, pp. 1255–1260, 2007.
- S. J. Lipina and J. A. Colombo, Poverty and brain development during childhood: An approach from cognitive psychology and neuroscience. American Psychological Association, 2009.
- M. Butz and A. van Ooyen, “A simple rule for dendritic spine and axonal bouton formation can account for cortical reorganization after focal retinal lesions,” PLoS computational biology, vol. 9, no. 10, p. e1003259, 2013.
- A. Krizhevsky and G. Hinton, “Learning multiple layers of features from tiny images,” tech. rep., Vol. 1, no. 4, p. 7, Citeseer, 2009.
- Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y. Ng, “Reading digits in natural images with unsupervised feature learning,”, NIPS Workshop on Deep Learning and Unsupervised Feature Learning 2011.
- K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778, 2016.
- Z. Liu, M. Sun, T. Zhou, G. Huang, and T. Darrell, “Rethinking the value of network pruning,” arXiv preprint arXiv:1810.05270, 2018.
- Y. He, G. Kang, X. Dong, Y. Fu, and Y. Yang, “Soft filter pruning for accelerating deep convolutional neural networks,” arXiv preprint arXiv:1808.06866, 2018.
- S. Han, X. Liu, H. Mao, J. Pu, A. Pedram, M. A. Horowitz, and W. J. Dally, “Eie: efficient inference engine on compressed deep neural network,” in Computer Architecture (ISCA), 2016 ACM/IEEE 43rd Annual International Symposium on, pp. 243–254, IEEE, 2016.
- B. Liu, M. Wang, H. Foroosh, M. Tappen, and M. Pensky, “Sparse convolutional neural networks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 806–814, 2015.
- T. Ash, “Dynamic node creation in backpropagation networks,” Connection science, vol. 1, no. 4, pp. 365–375, 1989.
- B. J. Briedis and T. D. Gedeon, “Using the Grow-and-Prune Network to Solve Problems of Large Dimensionality”, Proceedings of the 1998 Australian Conference on Neural Networks, Brisbane, 1998.
- A. Sakar and R. J. Mammone, “Growing and pruning neural tree networks,” IEEE Transactions on Computers, vol. 42, no. 3, pp. 291–299, 1993.
- G.-B. Huang, P. Saratchandran, and N. Sundararajan, “A generalized growing and pruning rbf (ggap-rbf) neural network for function approximation,” IEEE Transactions on Neural Networks, vol. 16, no. 1, pp. 57–67, 2005.
- S. Hussain and A. Basu, “Multiclass classification by adaptive network of dendritic neurons with binary synapses using structural plasticity,” Frontiers in neuroscience, vol. 10, p. 113, 2016.
- X. Dai, H. Yin, and N. K. Jha, “Nest: A neural network synthesis tool based on a grow-and-prune paradigm,” arXiv preprint arXiv:1711.02017, 2017.
- Y. Gong, L. Liu, M. Yang, and L. Bourdev, “Compressing deep convolutional networks using vector quantization,” arXiv preprint arXiv:1412.6115, 2014.
- I. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv, and Y. Bengio, “Quantized neural networks: Training neural networks with low precision weights and activations,” The Journal of Machine Learning Research, vol. 18, no. 1, pp. 6869–6898, 2017.
- E. L. Denton, W. Zaremba, J. Bruna, Y. LeCun, and R. Fergus, “Exploiting linear structure within convolutional networks for efficient evaluation,” in Advances in neural information processing systems, pp. 1269–1277, 2014.
- C. Leng, Z. Dou, H. Li, S. Zhu, and R. Jin, “Extremely low bit neural network: Squeeze the last bit out with admm,” AAAI Conference on Artificial Intelligence (2018): n. pag. Web. 9 Aug. 2019
- P. Molchanov, S. Tyree, T. Karras, T. Aila, and J. Kautz, “Pruning convolutional neural networks for resource efficient inference,” arXiv preprint arXiv:1611.06440, 2016.
- Y. He, J. Lin, Z. Liu, H. Wang, L.-J. Li, and S. Han, “Amc: Automl for model compression and acceleration on mobile devices,” in Proceedings of the European Conference on Computer Vision (ECCV), pp. 784–800, 2018.
- V. Nair and G. E. Hinton, “Rectified linear units improve restricted boltzmann machines,” in Proceedings of the 27th international conference on machine learning (ICML-10), pp. 807–814, 2010.
- S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” arXiv preprint arXiv:1502.03167, 2015.
- A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer, “Automatic differentiation in pytorch,” NIPS Workshop Autodiff, 2017.