Towards Evolutional Compression

# Towards Evolutional Compression

## Abstract

Compressing convolutional neural networks (CNNs) is essential for transferring the success of CNNs to a wide variety of applications to mobile devices. In contrast to directly recognizing subtle weights or filters as redundant in a given CNN, this paper presents an evolutionary method to automatically eliminate redundant convolution filters. We represent each compressed network as a binary individual of specific fitness. Then, the population is upgraded at each evolutionary iteration using genetic operations. As a result, an extremely compact CNN is generated using the fittest individual. In this approach, either large or small convolution filters can be redundant, and filters in the compressed network are more distinct. In addition, since the number of filters in each convolutional layer is reduced, the number of filter channels and the size of feature maps are also decreased, naturally improving both the compression and speed-up ratios. Experiments on benchmark deep CNN models suggest the superiority of the proposed algorithm over the state-of-the-art compression methods.

## 1Introduction

Large-scale deep convolutional neural networks (CNNs) have been successfully applied to a wide variety of applications such as image classification [?], object detection [?], and visual enhancement [?]. To strengthen representation capability and improve CNN performance, several convolutional layers have traditionally been used in network construction. Given the complex network architecture and numerous variables, most CNNs place excessive demands on storage and computational resources, thus limiting them to high-performance servers.

We are now in an era of intelligent mobile devices. Deep learning, one of the most promising artificial intelligence techniques, is expected to reduce reliance on servers and to apply advanced algorithms to smartphones, tablets, and wearable computers. Nevertheless, it remains challenging for mobile devices without GPUs and the necessary memory to carry CNNs usually run on servers. For instance, more than 232MB of memory and floating number multiplications would be consumed by AlexNet [?] or VGGNet [?] to process a single, normal-sized input image. Hence, special developments are required to translate CNNs to smartphones and other portable devices.

To overcome this conflict between reduced hardware configurations and the higher resource demands of CNNs, several attempts have been made to compress and speed up well-trained CNN models. Liu et al. developed to [?] learn CNNs with sparse architectures, thereby reducing model complexities compared to ordinary CNNs, while Han et al. [?] directly discarded subtle weights in pre-trained CNNs to obtain sparse CNNs. Figurnov et al. [?] reduced the computational cost of CNNs by masking the input data of convolutional layers. Wen et al. [?] explored subtle connections from the perspective of channels and filters. In the DCT frequency domain, Wang et al. [?] excavated redundancy on all weights and their underlying connections to deliver higher compression and speed-up ratios. In addition, a number of techniques exist to compress convolution filters in CNNs including weight pruning [?], quantization and binarization [?], and matrix decomposition [?].

While these methods have reduced the storage and computational burdens of CNNs, the research to deep model compression is still in its infancy. Existing solutions are typically grounded in different, albeit intuitive, assumptions of network redundancy, e.g. weight or filter redundancy with small absolute values, low-rank filter redundancy, and within-weight redundancy. Although these redundancy assumptions are valid, we hypothesize that all possible types of redundancy have yet to be identified and validated. We postulate that an approach can be developed to autonomously investigate the diverse and volatile network redundancies and to constantly upgrade the solution to cater for environment changes.

In this paper, we develop an evolutionary strategy to excavate and eliminate redundancy in deep neural networks. The network compression task can be formulated as a binary programming problem, where a binary variable is attached to each convolution filter to indicate whether or not the filter takes effect. Inspired by studies in evolutionary computation [?], we treat each binary encoding w.r.t. a compressed network as an individual and stack them to constitute the population. A series of evolutionary operators (e.g. crossover and mutation) allow the population to constantly evolve to reach the most competitive, compact, and accurate network architecture. When evaluating an individual, we use a relatively small subset of the original training dataset to fine-tune the compressed network, which quickly excavates its potential performance; therefore, the overall running times of compressing CNNs are acceptable. Experiments conducted on several benchmark CNNs demonstrate that compressed networks are more lightweight but have comparable accuracies to their original counterparts. Beyond conventional network redundancies, we suggest that convolutional filters with either large or small weights possess redundancies, the discrepancy between filters is appreciated, and that high-frequency coefficients of convolution filters are unnecessary (i.e. smooth filters are adequate).

## 2An Evolutionary Method for Compressing CNNs

Most existing CNN compression methods are based on the consensus that weights or filters with subtle values have limited influence on the performance of the original network. In this section, we introduce an evolutionary algorithm to significantly excavate redundancy in CNNs and devise a novel compression method.

### 2.1Molding Redundancy in Convolution Filters

Considering a convolutional neural network with convolutional layers , we define sets of convolution filters for these layers. For the -th convolutional layer, its filter is denoted as , where and are the height and width of filters, respectively, is the channel size, and is the number of filters in this layer. Given a training sample and the corresponding ground truth , the error of the network can be defined as , which could be, for example, softmax or Euclidean losses. The conventional weight pruning algorithm can be formulated as

where is a binary tensor for removing redundant weights in , is the -norm accumulating absolute values , i.e. the number in , is the element-wise product, and is the tradeoff parameter. A larger will make more sparse and so a network parameterized with will have fewer weights.

In general, Fcn. Equation 1 is easy to solve if is a linear mapping of . However, neural networks are composed of a series of complex operations, such as pooling and ReLU, which increase the complexity of Fcn. Equation 1. Therefore, a greedy strategy [?] have been introduced to obtain a feasible solution that removes weights with small absolute values:

where is a threshold. This strategy is based on the intuitive assumption that small weights make subtle contributions to the calculation of the convolution response.

Although sparse filters learned by Fcn. Equation 1 demand less storage and computational resources, the size of the feature maps produced by these filters does not change. For example, a convolutional layer with 10 filters will generate 10 feature maps for one input data before and after compression, which accounts for a large proportion of online memory usage. Moreover, these sparse filters usually need some additional techniques to support and speed-up their compression such as CuSparse kernel, CSR format, or the fixed-point multiplier [?]. Therefore, more flexible approaches [?] have been developed to directly discard redundant filters:

where denotes the -th filter in the -th convolutional layer. By directly removing convolution filters, network complexity can be significantly decreased. However, Fcn. Equation 3 is also biased since the Frobenius norm of filters is not a reasonable redundancy indicator. For example, most of the weights in a filter for extracting edge information are very small. Thus, a more accurate approach for identifying redundancy in CNNs is urgently required.

### 2.2Modeling Redundancy by Exploiting Evolutionary Algorithms

Instead of the greedy strategies shown in Fcns. Equation 2 and Equation 3, evolutionary algorithms such as the genetic algorithm (GA [?]) and simulated annealing (SA [?]) have been widely applied to the NP-hard binary programming problem. A series of bit (0 or 1) strings (individuals) are used to represent possible solutions of the binary programming problem, and these individuals evolve using some pre-designed operations to maximize their fitnesses.

A binary variable can be attached to each weight in the CNN to indicate whether the weight takes effect or not, but a large number of binary variables will significantly slow down the CNN compression process, especially for sophisticated CNNs learned over large-scale datasets (e.g. ImageNet [?]). For instance, AlexNet [?] has eight convolutional layers with more than 32-floating weights in total, so it is infeasible to generate a population with hundreds of -dimensional individuals. In addition, as mentioned above, excavating redundancy in convolution filters itself produces a regular CNN model with less computational complexity and memory usage, which is more suitable for practical applications. Therefore, we propose to assign a binary bit to each convolution filter in a CNN, and these binary bits form an individual in this network. By doing so, the dimensionality of is tolerable, e.g. (without the last 1000 convolution filters corresponding to the 1000 classes in the ILSVRC 2012 dataset) for AlexNet.

During evolution, we use GA to constantly update individuals of greater fitness. Other evolutionary algorithms can be applied using a similar approach. The compression task has two objectives: preserving performance and removing the redundancy of the original networks. The fitness of a specific individual can therefore be defined as

where denotes the binary bit for the -th convolution filter in the given network, and is the number of all convolution filters in the network. calculates the classification loss of the network using compressed filters , which supposed as a general loss taken value from 0 to 1. In addition, we include a constant in Fcn. Equation 4, which ensures for the convenience of calculating the probability of each individual in the evolutionary algorithm process. is the tradeoff parameter, and

where implies that the -th filter in the -th layer has been discarded, otherwise retained.

10029