Low-memory convolutional neural networks through incremental depth-first processing
We introduce an incremental processing scheme for convolutional neural network (CNN) inference, targeted at embedded applications with limited memory budgets. Instead of processing layers one by one, individual input pixels are propagated through all parts of the network they can influence under the given structural constraints. This depth-first updating scheme comes with hard bounds on the memory footprint: the memory required is constant in the case of 1D input and proportional to the square root of the input dimension in the case of 2D input.
Convolutional neural networks (CNNs) deliver state of the art results in most computer vision tasks, such as image segmentation, classification, and object recognition [11, 13]. Due to their versatile fields of application, CNN models are being integrated into embedded systems, such as smartphones, internet of things (IoT) endpoints, robotics, or hearing aids. However, the extensive memory requirements of such models pose serious challenges to the system designer, and have led researchers to explore various approaches for increasing the efficiency of CNN model implementations. Promising ways of reducing memory requirements and/or lowering the number of operations required to run the network include model compression techniques [17, 16, 1, 5, 4, 7], pruning [15, 12, 19, 2], parameter and variable quantization [10, 18, 9, 14, 21, 6], or architectural optimizations [8, 20].
In this work, we achieve a substantial reduction in memory occupation while not altering the model itself, but rather the way updates are computed. Thus, our method is applicable to any CNN model, whereby the computed output is identical to the original output. In addition to its low memory properties, our technique naturally leads to a sparse update scheme, where no unnecessary computation is carried out if values are zero. This sparsity, which arises from an event-based processing style employed here, can be exploited, for instance, in conjunction with low-precision parameters and activations, which typically encourage a sparse activation vector.
2 Asynchronous incremental updates
Rather than computing complete feature maps layer by layer, we propose to process inputs in a depth-first fashion. The advantages of our approach arise from processing small segments, or even single components of the input individually, while keeping in memory only a fraction of the whole network state: only those components of the network state which are non-zero and can be influenced by the current or future input segments (or components) need to be stored.
As an illustration of our method, consider a multilayer neural network (we will add convolutions later), where the activation of a unit of layer is given by
where is a non-linearity, are the weight parameters, is a bias term, and is the input layer. Rather than computing whole layers one by one for a given input sample, we can process individual components of the input separately. To do so, we introduce a stateful equivalent of the neuron eq. 1, given by
where is a state variable, which is retained throughout the presentation of an input sample. Furthermore, the neuron emits changes of its activation in terms of “events”,
at times where such events would be non-zero. Without loss of generality, we assume the initial state of the network to be zero, , , and . An event is transmitted to a connected neuron via a weight and added to its state variable,
Note that a potential change of the target neuron, , is now transmitted in the same way to connected neurons of higher layers. It is now straight-forward to show that the states and are equal given that both networks have received the same input at their input layers (see  for details).
The same event-based update scheme can be applied to CNNs, whereby each neuron only connects to a small number of neurons of the next higher layer, as specified by the convolution kernel. Thus, each neuron can only influence a few neurons of the next higher layer, typically leading to a cone-shaped dependency structure. Consequently, a single component of the input only influences a cone-shaped subspace of the whole state space, and thus, it is sufficient to keep only the respective cone in memory when the input component is processed. In the one-dimensional case, where the input is a vector, consecutive components can be presented one by one, and their contribution to the state cone and the output be computed individually. Thereby, the memory requirements are constant, regardless of the size of the input vector. This is illustrated in fig. 1. Conversely, if the input is two-dimensional, a whole row of cones needs to be stored, as their states are still relevant during the presentation of the next row. Thus, the memory requirements scale with the square-root of the input dimension, in this case.
-  Jose M Alvarez and Mathieu Salzmann. Compression-aware training of deep networks. In Advances in Neural Information Processing Systems, pages 856–867, 2017.
-  Sajid Anwar, Kyuyeon Hwang, and Wonyong Sung. Structured pruning of deep convolutional neural networks. ACM Journal on Emerging Technologies in Computing Systems (JETC), 13(3):32, 2017.
-  Jonathan Binas, Giacomo Indiveri, and Michael Pfeiffer. Deep counter networks for asynchronous event-based processing. arXiv preprint arXiv:1611.00710, 2016.
-  Minsik Cho and Daniel Brand. Mec: memory-efficient convolution for deep neural network. arXiv preprint arXiv:1706.06873, 2017.
-  Yoojin Choi, Mostafa El-Khamy, and Jungwon Lee. Universal deep neural network compression. arXiv preprint arXiv:1802.02271, 2018.
-  Matthieu Courbariaux, Yoshua Bengio, and Jean-Pierre David. Binaryconnect: Training deep neural networks with binary weights during propagations. In Advances in neural information processing systems, pages 3123–3131, 2015.
-  Song Han, Huizi Mao, and William J Dally. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149, 2015.
-  Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861, 2017.
-  Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio. Binarized neural networks. In Advances in neural information processing systems, pages 4107–4115, 2016.
-  Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio. Quantized neural networks: Training neural networks with low precision weights and activations. arXiv preprint arXiv:1609.07061, 2016.
-  Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.
-  Vadim Lebedev and Victor Lempitsky. Fast convnets using group-wise brain damage. In Computer Vision and Pattern Recognition (CVPR), 2016 IEEE Conference on, pages 2554–2564. IEEE, 2016.
-  Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. nature, 521(7553):436, 2015.
-  Fengfu Li, Bo Zhang, and Bin Liu. Ternary weight networks. arXiv preprint arXiv:1605.04711, 2016.
-  Hao Li, Asim Kadav, Igor Durdanovic, Hanan Samet, and Hans Peter Graf. Pruning filters for efficient convnets. arXiv preprint arXiv:1608.08710, 2016.
-  Christos Louizos, Karen Ullrich, and Max Welling. Bayesian compression for deep learning. In Advances in Neural Information Processing Systems, pages 3290–3300, 2017.
-  Karen Ullrich, Edward Meeds, and Max Welling. Soft weight-sharing for neural network compression. arXiv preprint arXiv:1702.04008, 2017.
-  Jiaxiang Wu, Cong Leng, Yuhang Wang, Qinghao Hu, and Jian Cheng. Quantized convolutional neural networks for mobile devices. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4820–4828, 2016.
-  Tien-Ju Yang, Yu-Hsin Chen, and Vivienne Sze. Designing energy-efficient convolutional neural networks using energy-aware pruning. arXiv preprint, 2017.
-  Xiangyu Zhang, Xinyu Zhou, Mengxiao Lin, and Jian Sun. Shufflenet: An extremely efficient convolutional neural network for mobile devices. arXiv preprint arXiv:1707.01083, 2017.
-  Chenzhuo Zhu, Song Han, Huizi Mao, and William J Dally. Trained ternary quantization. arXiv preprint arXiv:1612.01064, 2016.