Lowmemory convolutional neural networks
through incremental depthfirst processing
Abstract
We introduce an incremental processing scheme for convolutional neural network (CNN) inference, targeted at embedded applications with limited memory budgets. Instead of processing layers one by one, individual input pixels are propagated through all parts of the network they can influence under the given structural constraints. This depthfirst updating scheme comes with hard bounds on the memory footprint: the memory required is constant in the case of 1D input and proportional to the square root of the input dimension in the case of 2D input.
1.2
1 Introduction
Convolutional neural networks (CNNs) deliver state of the art results in most computer vision tasks, such as image segmentation, classification, and object recognition (Krizhevsky et al., 2012; LeCun et al., 2015). Due to their versatile fields of application, CNN models are being integrated into embedded systems, such as smartphones, internet of things (IoT) endpoints, robotics, or hearing aids. However, the extensive memory requirements of such models pose serious challenges to the system designer, and have led researchers to explore various approaches for increasing the efficiency of CNN model implementations. Promising ways of reducing memory requirements and/or lowering the number of operations required to run the network include model compression techniques (Ullrich et al., 2017; Louizos et al., 2017; Alvarez and Salzmann, 2017; Choi et al., 2018; Cho and Brand, 2017; Han et al., 2015), pruning (Li et al., 2016b; Lebedev and Lempitsky, 2016; Yang et al., 2017; Anwar et al., 2017), parameter and variable quantization (Hubara et al., 2016b; Wu et al., 2016; Hubara et al., 2016a; Li et al., 2016a; Zhu et al., 2016; Courbariaux et al., 2015), or architectural optimizations (Howard et al., 2017; Zhang et al., 2017).
In this work, we achieve a substantial reduction in memory occupation while not altering the model itself, but rather the way updates are computed. Thus, our method is applicable to any CNN model, whereby the computed output is identical to the original output. In addition to its low memory properties, our technique naturally leads to a sparse update scheme, where no unnecessary computation is carried out if values are zero. This sparsity, which arises from an eventbased processing style employed here, can be exploited, for instance, in conjunction with lowprecision parameters and activations, which typically encourage a sparse activation vector.
Unlike other approaches in this domain, we do not provide a method for better parallelization of CNN computation, but rather a streaming core, that can process small segments of the input as they arrive. This is particularly relevant to embedded systems, where memory is scarce and taking into account input sparsity can have a significant impact on the power budget.
2 Related work
Depthfirst processing schemes for CNNs have been proposed previously. Typically, such methods are based on the identification of computational subgraphs, which can be evaluated independently, leading to a certain level of parallelization. Alwani et al. (2016) propose an updating scheme, where a local variable in the network is computed based on all inputs affecting this variable. This leads to multiple evaluations of a given input pattern. Rouhani et al. (2017) map a network to a set of local subnetworks, which can be computed individually. Weber et al. (2018) focus on accelerating the evaluation of elementwise and pooling layers by identifying independent paths and creating local subgraphs.
Our approach does not rely on advanced methods for finding independent structures, and does not explicitly require independent structures at all. Instead of asking, which inputs affect a particular output, we ask, which outputs are affected by a particular input. As a consequence, computations in our model are based on arbitrarily small input segments, allowing for streaming implementations and natural utilization of input sparsity.
3 Asynchronous incremental updates
Rather than computing complete feature maps layer by layer, we propose to process inputs in a depthfirst fashion. The advantages of our approach arise from processing small segments, or even single components of the input individually, while keeping in memory only a fraction of the whole network state: only those components of the network state which are nonzero and can be influenced by the current or future input segments (or components) need to be stored.
As an illustration of our method, consider a multilayer neural network (we will add convolutions later), where the activation of a unit of layer is given by
(1) 
where is a nonlinearity, are the weight parameters, is a bias term, and is the input layer. Rather than computing whole layers one by one for a given input sample, we can process individual components of the input separately. To do so, we introduce a stateful equivalent of the neuron eq. 1, given by
(2) 
where is a state variable, which is retained throughout the presentation of an input sample. Furthermore, the neuron emits changes of its activation in terms of “events”,
(3) 
at times where such events would be nonzero. Without loss of generality, we assume the initial state of the network to be zero, , , and . An event is transmitted to a connected neuron via a weight and added to its state variable,
(4)  
(5) 
Note that a potential change of the target neuron, , is now transmitted in the same way to connected neurons of higher layers. It is now straightforward to show that the states and are equal given that both networks have received the same input at their input layers (see Binas et al. (2016) for details).
The same eventbased update scheme can be applied to CNNs, whereby each neuron only connects to a small number of neurons of the next higher layer, as specified by the convolution kernel. Thus, each neuron can only influence a few neurons of the next higher layer, typically leading to a coneshaped dependency structure. Consequently, a single component of the input only influences a coneshaped subspace of the whole state space, and thus, it is sufficient to keep only the respective cone in memory when the input component is processed. In the onedimensional case, where the input is a vector, consecutive components can be presented one by one, and their contribution to the state cone and the output be computed individually. Thereby, the memory requirements are constant, regardless of the size of the input vector. This is illustrated in fig. 1. Conversely, if the input is twodimensional, a whole row of cones needs to be stored, as their states are still relevant during the presentation of the next row. Thus, the memory requirements scale with the squareroot of the input dimension, in this case (see fig. 2 for illustration.)
4 Experimental verification
To verify our approach, we implemented the incremental depthfirst update mechanism in contemporary deep learning software frameworks^{1}^{1}1Code will be made available upon acceptance of the paper.. As a proofofconcept, a multilayer CNN was trained to classify the MNIST dataset. Training was performed using a conventional CNN implementation; the learned parameters were then copied to our incremental model, which was used to classify the dataset. The twodimensional input images were fed to the system one row at a time. Fig. 3 depicts the output of the network for one example image over the course of the 28 input presentations. The first few inputs (mostly zero, as the digits are roughly centered in the image) have little impact on the output, and the output abruptly changes as more relevant pixels come into view. The output gradually converges to the exact same prediction value as would be obtained from the underlying conventional CNN. The convergence speed is shown in fig. 4. For MNIST, the network seems to require most of the image in order to be able to make an accurate prediction. It is conceivable that in other scenarios, for instance where the input contains redundant information, better predictions might be obtained earlier during presentation. The incremental implementation requires only a fraction of the memory of the conventional CNN implementation, as only subsets of the activations need to be retained. The memory requirements thereby only grow with the square root of the input dimension.
5 Discussion
We propose an incremental updating scheme for convolutional neural networks, where small segments of the input are processed depthfirst, and the output layer is gradually updated, as more inputs are presented. With more input segments being presented, the output converges to the same value as a conventional CNN implementation.
Our method puts a hard bound on the memory requirements of the model, and is well suited for embedded applications. Once implemented, our model operates as a streaming core, taking inputs as they arrive, and incrementally updating its output.
This opens an interesting perspective on CNN processing, which in our framework can be seen as a somewhat recurrent architecture, which has a hidden state (activation subset buffers) and computes an output based on the sequence of inputs (e.g. rows or columns of an image) it receives. Similar updating schemes as the one proposed here have also been used to implement biologically plausible spikebased learning (O’Connor et al., 2017).
Future work will focus on providing a lowlevel library for embedded systems, as well as the design of network architectures which are particularly suitable for this kind of processing (note that our model does not pose any constraints on the architecture, but it might be more or less efficient depending on the particular network.)
References

Alvarez and
Salzmann (2017)
Alvarez, J. M. and M. Salzmann
2017. Compressionaware training of deep networks. In Advances in Neural Information Processing Systems, Pp. 856–867. 
Alwani et al. (2016)
Alwani, M., H. Chen, M. Ferdman, and P. Milder
2016. Fusedlayer cnn accelerators. In The 49th Annual IEEE/ACM International Symposium on Microarchitecture, P. 22. IEEE Press. 
Anwar et al. (2017)
Anwar, S., K. Hwang, and W. Sung
2017. Structured pruning of deep convolutional neural networks. ACM Journal on Emerging Technologies in Computing Systems (JETC), 13(3):32. 
Binas et al. (2016)
Binas, J., G. Indiveri, and M. Pfeiffer
2016. Deep counter networks for asynchronous eventbased processing. arXiv preprint arXiv:1611.00710. 
Cho and Brand (2017)
Cho, M. and D. Brand
2017. Mec: memoryefficient convolution for deep neural network. arXiv preprint arXiv:1706.06873. 
Choi et al. (2018)
Choi, Y., M. ElKhamy, and J. Lee
2018. Universal deep neural network compression. arXiv preprint arXiv:1802.02271. 
Courbariaux
et al. (2015)
Courbariaux, M., Y. Bengio, and J.P. David
2015. Binaryconnect: Training deep neural networks with binary weights during propagations. In Advances in neural information processing systems, Pp. 3123–3131. 
Han et al. (2015)
Han, S., H. Mao, and W. J. Dally
2015. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149. 
Howard et al. (2017)
Howard, A. G., M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand,
M. Andreetto, and H. Adam
2017. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861. 
Hubara et al. (2016a)
Hubara, I., M. Courbariaux, D. Soudry, R. ElYaniv, and
Y. Bengio
2016a. Binarized neural networks. In Advances in neural information processing systems, Pp. 4107–4115. 
Hubara et al. (2016b)
Hubara, I., M. Courbariaux, D. Soudry, R. ElYaniv, and
Y. Bengio
2016b. Quantized neural networks: Training neural networks with low precision weights and activations. arXiv preprint arXiv:1609.07061. 
Krizhevsky et al. (2012)
Krizhevsky, A., I. Sutskever, and G. E. Hinton
2012. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, Pp. 1097–1105. 
Lebedev and Lempitsky (2016)
Lebedev, V. and V. Lempitsky
2016. Fast convnets using groupwise brain damage. In Computer Vision and Pattern Recognition (CVPR), 2016 IEEE Conference on, Pp. 2554–2564. IEEE. 
LeCun et al. (2015)
LeCun, Y., Y. Bengio, and G. Hinton
2015. Deep learning. nature, 521(7553):436. 
Li et al. (2016a)
Li, F., B. Zhang, and B. Liu
2016a. Ternary weight networks. arXiv preprint arXiv:1605.04711. 
Li et al. (2016b)
Li, H., A. Kadav, I. Durdanovic, H. Samet, and H. P.
Graf
2016b. Pruning filters for efficient convnets. arXiv preprint arXiv:1608.08710. 
Louizos et al. (2017)
Louizos, C., K. Ullrich, and M. Welling
2017. Bayesian compression for deep learning. In Advances in Neural Information Processing Systems, Pp. 3290–3300. 
O’Connor et al. (2017)
O’Connor, P., E. Gavves, and M. Welling
2017. Temporally efficient deep learning with spikes. arXiv preprint arXiv:1706.04159. 
Rouhani et al. (2017)
Rouhani, B. D., A. Mirhoseini, and F. Koushanfar
2017. Deep3: Leveraging three levels of parallelism for efficient deep learning. In Proceedings of the 54th Annual Design Automation Conference 2017, P. 61. ACM. 
Ullrich et al. (2017)
Ullrich, K., E. Meeds, and M. Welling
2017. Soft weightsharing for neural network compression. arXiv preprint arXiv:1702.04008. 
Weber et al. (2018)
Weber, N., F. Schmidt, M. Niepert, and F. Huici
2018. Brainslug: Transparent acceleration of deep learning through depthfirst parallelism. arXiv preprint arXiv:1804.08378. 
Wu et al. (2016)
Wu, J., C. Leng, Y. Wang, Q. Hu, and J. Cheng
2016. Quantized convolutional neural networks for mobile devices. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Pp. 4820–4828. 
Yang et al. (2017)
Yang, T.J., Y.H. Chen, and V. Sze
2017. Designing energyefficient convolutional neural networks using energyaware pruning. arXiv preprint. 
Zhang et al. (2017)
Zhang, X., X. Zhou, M. Lin, and J. Sun
2017. Shufflenet: An extremely efficient convolutional neural network for mobile devices. arXiv preprint arXiv:1707.01083. 
Zhu et al. (2016)
Zhu, C., S. Han, H. Mao, and W. J. Dally
2016. Trained ternary quantization. arXiv preprint arXiv:1612.01064.