Searching for Winograd-aware Quantized Networks
The rise in popularity of deep CNNs has spawned a research effort to find lower complexity networks to increase inference efficiency. This is desirable for inference in the cloud and becomes crucial on mobile and IoT devices with much more constrained hardware Lane and Warden (2018). Over the last few years, multiple approaches have been proposed to alleviate the compute-bound nature of convolutions Sze et al. (2017). Arguably, the use of depthwise convolutions, as popularized by the family of MobileNet architectures Howard et al. (2017); Sandler et al. (2018); Howard et al. (2019), has become the most widely embraced design choice to make lightweight networks. These layers are used in state of the art image classification networks Stamoulis et al. (2019); Tan and Le (2019). However, beyond image classification, normal convolutions are still chosen in favour of depthwise convolutions for applications like image super-resolution Zhang et al. (2018); Lee et al. (2019), image segmentation Takikawa et al. (2019) and, GANs Brock et al. (2019). Therefore, alternative forms of speeding up standard convolutions are required to run these applications in mobile CPUs, which often come with constrained compute and energy budgets Whatmough et al. (2019). Model quantization and the use of alternative convolution algorithms instead of direct convolution are two ways of accomplishing this task.
Lower-precision networks result in smaller model sizes, faster inference, lower energy consumption and smaller chip area Sze et al. (2017); Li et al. (2019). Concretely, 8-bit quantizatized models achieve comparable performance to full-precision models Jacob et al. (2018); Krishnamoorthi (2018) while being ready for deployment on off-the-shelf hardware as 8-bit arithmetic is widely supported. In addition to resulting in a direct model size reduction, 8-bit integer-only arithmetic benefits from up to and chip area reduction compared to full precision additions and multiplies respectively, requiring and less energy Horowitz (2014); Whatmough et al. (2018). Because of these desirable benefits, 8-bit quantization has been widely adopted in both compute-constrained devices Liberis and Lane (2019); Chowdhery et al. (2019); Wang et al. (2019) and accelerators Whatmough et al. (2019).
Orthogonal to lightweight architectural designs and quantization, fast convolution algorithms in replacement of direct convolutions can provide further speedups. These come with their own trade-offs Anderson and Gregg (2018), but in this work we focus on the Winograd algorithm since it is the fastest known algorithm for convolutions of the dimensions often found in CNNs. The Winograd convolution performs the bulk of the computation as a Hadamard product between weights and input in the Winograd space requiring operations. Unlike normal convolutions, that generate a single output per convolution, a Winograd convolution computes several outputs simultaneously. This property makes Winograd convolutions minimal in the number of general multiplications
In this paper, we focus on alleviating the problem of numerical error that arises when using Winograd convolutions in quantized neural networks. Achieving this ultimately enables us to combine the speedups of Winograd with those that reduced precision arithmetic is known to offer, among other benefits in terms of energy and area. To this end, we present an end-to-end training pipeline that exposes the numerical inaccuracies introduced by Winograd to the learning of the model parameters. We also address the source of the numerical error and propose a relaxation on the form of the transformation matrices used in the Winograd convolution algorithm. We achieve this by adding these matrices to the set of learnable parameters in a layer, after initializing them via Cook-Toom L. Toom (1963). Finally, we describe wiNAS, a Winograd-aware Neural Architecture Search framework which leverages Winograd-aware layers and latency measurements on Arm Cortex-A73 and A53 cores, to jointly optimize for high accuracy and low latency. Our framework transforms a given macro-architecture by replacing each convolution with either im2row or Winograd convolutions of different tile sizes.
The contributions of this work are summarized below:
We show that Winograd-aware networks enable Winograd convolutions in quantized networks, including 8-bit networks with little accuracy drop. To the best of our knowledge, this is the first time this has been empirically demonstrated.
We demonstrate that learning the Winograd transforms, as opposed to keeping these fixed, results in better network accuracy – up to 10% improvement when using and tiles in 8-bits CNNs with filters. This improvement is more pronounced with larger filters.
We present wiNAS as a tool that can find Winograd-aware networks jointly optimised for both high accuracy and low latency given a real hardware model.
2 Related Work
Convolutions have become the de facto spatial feature extractor in neural networks. As a result, a number of approaches have emerged to reduce the computational costs of using this operator.
Compact CNN Architectures.
These include alternative formulations to the dense convolutional layer, such as bottleneck layers He et al. (2016) that perform the convolutions in a lower-dimensional space, or the depth-wise convolutional layers Howard et al. (2017) which replace the standard convolutions with a channel-wise convolution followed by a point-wise convolution. More recently, Chen et al. (2019) proposed a compact multi-resolution convolutional block that reduces spatial redundancy of low frequencies resulting in faster inference, memory savings and slightly higher accuracy. This reinforces the proposition that current networks rely more on texture than shape for image/object discrimination Brendel and Bethge (2019); Geirhos et al. (2019). In this work, instead of presenting a new architecture, we propose an optimization for an existing known-good architecture to speed up inference. Our optimization can be applied to existing pre-trained models without the need for end-to-end training.
The most extreme form of quantization is binary networks Courbariaux et al. (2015); Lin et al. (2017); Xiang et al. (2017), which replace convolutions with bit-shifts resulting in inference speed-ups Rastegari et al. (2016). Ternary and 2-bit models Li et al. (2016); Wan et al. (2018); Gong et al. (2019) achieve higher accuracies while alleviating some the challenges of training binary networks Alizadeh et al. (2019). However, it is 8-bit quantization Jacob et al. (2018); Krishnamoorthi (2018); Wang et al. (2019) that has achieved high popularity due to its balance between accuracy, model size reduction and inference speedup. Newer data formats, such as Posit Carmichael et al. (2019) aim to close the accuracy gap between INT8 and FP32 networks, however hardware supporting it is unavailable. For training, BFLOAT16 Kalamkar et al. (2019) has been validated as an alternative to FP32, enabling faster training. In this work, we adopt INT8 and INT16 uniform quantization during training and study how lowering precision impacts on the lossy nature of Winograd convolutions.
Fast Convolution Algorithms.
Alternative formulations of the convolution operation such as: the use of FFTs, which replace convolution with its multiplication-only counterpart in the frequency domain resulting in faster inference Mathieu et al. (2013); Abtahi et al. (2018) and training Highlander and Rodriguez (2015); the Strassen algorithm Strassen (1969), which when applied to convolutions Cong and Xiao (2014); Tschannen et al. (2018) significantly reduces the number of multiplications at the cost of more additions; or the Winograd algorithm Winograd (1980), which replaces convolutions with a set of matrix transformations and point-wise multiplications and, results in significantly faster inference stages Lavin and Gray (2016).
The Winograd algorithm for fast convolution was first applied to CNNs by Lavin and Gray (2016), showing speedup compared to cuDNN Chetlur et al. (2014) on a VGG Simonyan and Zisserman (2015) network, with no loss in accuracy on small tiles, batch 1. However, exploiting Winograd convolutions on larger input tiles is challenging due to numerical instability. In response to this limitation, Barabasz et al. (2018) showed that the error introduced by the Winograd algorithm grows at least exponentially with tile size, which can be partially alleviated by choosing better polynomial points for constructing the transformation matrices via Cook-Toom L. Toom (1963). An alternative formulation using trimmed Vandermonde matrices was described by Vincent et al. (2017). More recently, several works studying the suitability of Winograd convolutions in memory and compute constrained setups have been proposed. These include: the use of integer arithmetic for complex Winograd convolutions Meng and Brothers (2019); a general formulation for the Winograd algorithm Barabasz and Gregg (2019) that shows promising results in FP16 and BFLOAT16 when using higher degree polynomials; an efficient region-wise multi-channel implementation of Winograd convolutions using General Matrix Multiplications (GEMMs) Maji et al. (2019) that achieves speedups on Arm Cortex-A CPUs; and, a technique Liu et al. (2018) that enables up to 90% sparsity in the Hadamard product stage of the Winograd algorithm, effectively reducing by the number of multiplications with no accuracy loss in FP32 models. Our work fills the gap of using Winograd convolutions in quantized neural networks, enabling even faster convolutions in current off-the-shelf hardware, such as mobile CPUs.
Neural Architecture Search.
Automating the process of designing neural network architectures has drawn considerable attention. Early attempts relied on reinforcement learning Zoph and Le (2017); Brock et al. (2018); Real et al. (2019); Tan et al. (2019) or Bayesian optimization Hernández-Lobato et al. (2016); Fedorov et al. (2019) and required thousands of GPU hours to converge due to their computationally expensive and exhaustive search stages. Other works opted instead for a gradient-based search by framing the problem as a single over-parameterized network where all candidate operations at a particular node (e.g. a layer) are taken into consideration. The main aspect differentiating gradient-based NAS approaches is the way the output of a layer combines the contribution of each candidate operation. While Bender et al. (2018) defines it as the sum and DARTS Liu et al. (2019) as a weighted sum, ProxylessNAS Cai et al. (2019) relies on path-level binarization, making it possible to perform the search on the entire architecture directly using a single GPU. In addition to architecture discovery, NAS has also been successfully used for automated network pruning He et al. (2018) and quantization Wang et al. (2019). Our work leverages NAS to find the optimal convolution algorithm (i.e. im2row or different Winograd implementations) for each layer in the model while preserving the overall network macro-architecture and model size.
3 Winograd-Aware Networks
This section introduces Winograd convolutions and their trade-offs in terms of compute, memory and accuracy. Then, we present the Winograd-aware layers used in our networks.
3.1 Winograd implementation trade-offs
The Winograd algorithm for convolutions using linear polynomials guarantees to use the minimum number of element-wise multiplications to compute outputs using an filter. Lavin and Gray (2016) refer to this minimal algorithm as and present its matrix form:
where , and are transformation matrices applied to the filter , input and output respectively and is the Hadamard or element-wise multiplication.
These transformation matrices are commonly constructed
The challenges associated with the use of Winograd convolutions span three dimensions:
Compute. Winograd convolutions require the transformation of both tile and filter to the Winograd domain. The cost of these transformations grows with , and can represent a significant portion of the total computation of up to 75% (Sec. 6.2). This suggests that Winograd offers little to no speedup in layers with few filters. The cost of is often ignored as it is amortized across inferences.
Memory. In Eq.1, transforms the filter to the Winograd domain, matching the dimensions of the input tile . This results in an increase of run-time memory associated with the weights: and for and respectively. This is especially undesirable on memory-constrained devices such as microcontrollers.
Numerical Error. Small and perform well in single and double precision (FP32/64) networks and are available in production-ready libraries such as NVIDIA cuDNN Chetlur et al. (2014) and Arm Compute Library Arm Software (). Because these introduce only marginal numerical error, a network can first be trained using conventional convolutions before replacing appropriate layers with Winograd, without impacting accuracy. However, attempting this with larger Winograd tiles, or in combination with quantization, results significant accuracy loss. The root of the problem
In this work we focus on minimizing the numerical errors that arise when using the Winograd algorithm in quantized networks. Our approach does not aggravate the compute and memory challenges previously mentioned. Instead, it indirectly alleviates these by making use of quantization.
3.2 A Winograd-aware training pipeline
Neural networks have proven to be resilient to all kinds of approximations, e.g. pruning and quantization. When applying these techniques, consistently better models are generated if these approximations are present during training. In other words, when the training is aware of quantization, or when training is aware of pruning.
Following this intuition, we propose an end-to-end Winograd-aware pipeline as shown in Figure 2. In the forward pass we apply Eq.1 to each patch of the activations from the previous layer. We can apply standard back-propagation, since Eq.1 is only a collection of matrix-matrix multiplications. This implementation allows us to:
Learn the transforms. Traditionally, matrices , and are fixed. Instead, we can treat them as another set of learnable parameters in the layer. This relaxation leads to much improved performance in quantized networks while still maintaining the overall structure of the Winograd convolution algorithm and its speedups.
Quantization diversity. Unlike standard convolution, which does not require intermediate computation, Winograd convolution requires at least four of them for , , the Hadamard product and the output transformation. Each of these can be quantized to a different number of bits depending on the bit-width of the input, that of the weights, and the overall complexity of the problem the network is designed to solve.
4 Searching for Winograd-Aware Networks
Simultaneously maximizing accuracy and minimizing latency with Winograd convolution isn’t trivial. The reason for this is that large tiles result in low latency, but come at the cost of higher numerical error. This presents a good opportunity to jointly optimize network accuracy and latency.
To this end, we implement a NAS-based approach that automatically transforms an existing architecture into a Winograd-aware version. We perform NAS at the micro-architecture level by selecting from different convolution algorithms for each layer, but without modifying the network’s macro-architecture (e.g. number or order of layers, hyper-parameters, etc). Keeping the macro-architecture fixed allows us to fairly compare the standard model to its Winograd-aware counterpart in terms of latency and accuracy. We call our framework wiNAS.
4.1 Winograd-aware NAS pipeline
Introducing latency measurements into the optimization objective requires knowing the shape of the input tensor, i.e. the activations from the previous layer, at each layer of the network. We design wiNAS as a variation of ProxylessNAS Cai et al. (2019), leveraging path sampling while performing the search. This technique, enables the allocation of the entire network on a single GPU by evaluating no more than two candidate operations at each layer per batch.
Similarly to ProxylessNAS, wiNAS formulates the search as a two-stage process, alternating the update of model parameters (the weights), where the loss is defined as
and the update of architecture parameters (the weight assigned to each operation on a given layer), where the loss introduces the latency metrics is defined as
where are the architecture parameters and controls the impact of latency in the loss. The expected latency, , for a given layer is the weighted combination of the latency estimate of each candidate operation with their respective probability of being sampled. Intuitively, searching for Winograd convolutions with high would result in faster models, potentially at the detriment of accuracy.
Unlike ProxylessNAS, wiNAS focuses on simply selecting the optimal convolution algorithm for each of the convolutional layers. Therefore, the set of candidate operations for a given conv2d layer contains im2row and Winograd-aware layers in their , and configurations. This search space is illustrated in Figure 3. Each candidate operation comes with its respective latency, which is a function of the output dimensions and quantization level.
5 Experimental Setup
We conduct various experiments grouped in three categories. In this section we describe each experiment we conducted. We used PyTorch Paszke et al. (2017) for training and Arm Compute Library for deployment.
5.1 Vanilla Winograd-aware networks
We begin our study of Winograd-aware networks by performing an extensive evaluation on the ResNet-18 He et al. (2016) architecture using the CIFAR-10 Krizhevsky (2009) dataset. In this experiment we train the network end-to-end using standard convolutions, and , and Winograd convolutions. For each experiment with Winograd, all layers in the network use the same tile size, except the last two residual blocks which are kept fixed to .The input convolutional layer uses normal convolutions. We run the experiments for FP32, INT16, INT10 and INT8 quantized networks, where both weights and activations are uniformly quantized (including all the intermediate outputs shown in Figure 2). We follow the per-layer symmetric quantization as described in Krishnamoorthi (2018).We repeated each experiment while enabling the Winograd transforms , and to be learnt, which we denote using the additional suffix -flex.
Winograd-aware layers do not require an over-parameterized model to perform well. We also varied the model size by using a width-multiplier, as used by the MobileNets family, ranging from 0.125 to 1.0, meaning that when the multiplier is 1.0 the network is the full ResNet-18. This leads to models ranging between 215K and 11M parameters. Winograd-aware layers with learnable transformations marginally increase () the model size, since the transforms themselves need to be saved for model deployment. We repeated the experiment for CIFAR-100 Krizhevsky (2009), but without varying the depth-multiplier. CIFAR-100 is considerably more challenging that CIFAR-10, as it is comprised of 100 classes with only 600 images per class.
Additionally, we use an INT8 LeNet Lecun et al. (1998), trained on the MNIST dataset, to evaluate the suitability of Winograd-aware layers with learnable transforms for filters. This is a more challenging case than filters, because a larger tile tile is required (defined by ), with larger transformation matrices which require the choice of more good polynomial points.
For experiments on ResNet-18, we replace -stride convolution layers with a max-pooling layer followed by a dense convolution layer. Altering the network in this way is necessary since there is no known equivalent for strided Winograd convolutions, which remains an open research question. This is a common strategy when evaluating Winograd Liu et al. (2018); Choi et al. (2018). We also modified the number of output channels of the input layer from 64 to 32. We did this to reduce the memory peak during training. We use the Adam Kingma and Ba (2015) optimizer and train for 120 epochs. Both CIFAR-10/100 use the same ResNet-18 architecture, differing only in the number of outputs of the fully connected layer. Results for other architectures are shown in A.1.
5.2 wiNAS: Winograd-aware NAS
To evaluate wiNAS, we define two different sets of candidate operations. These spaces are: wiNAS\textsubscriptWA and wiNAS\textsubscriptWA-Q, both allowing each convolutional layer to be implemented with either im2row or each of the Winograd configurations, , or . The former uses a fixed bit-width for all elements in the architecture, while the latter introduces in the search space candidates of each operation quantized to FP32, INT16 and INT8.
The hyperparameters used for wiNAS are as follows: for the learning of model parameters we use mini-batch SGD with Nesterov momentum Sutskever et al. (2013). In the stage where we update the architecture parameters we use instead Adam with the first momentum scaling, , set to zero, so the optimizer only updates paths that have been sampled. For both stages we use Cosine Annealing Loshchilov and Hutter (2017) scheduling and a batch size of 64. We perform the search for 100 epochs in each search space at different values ranging from 0.1 to 1e-3. Once the search is completed, we trained the architecture end-to-end with the same hyperparameters as the rest of winograd-aware networks.
5.3 Winograd convolutions on mobile CPUs
For our study, we chose Arm A73 and A53 cores on a Huawei HiKey 960 development board with the big.LITTLE CPU architecture. These cores are good candidates for validating the speedups that are achievable with Winograd convolutions in today’s off-the-shelf mobile hardware.
|A73||2.4 GHz||64 KB||2048 KB|
|A53||1.8 GHz||32 KB||512 KB|
While both A73 and A53 are implemented as 16nm quad-core CPUs, the former is a high-performance processor and the latter implements a high-efficiency processor. In Table 2 we summarise the main differences between these CPUs. The memory bandwidth would be the primary factor that ultimately sets the upper limit to the speedup achievable by Winograd since it requires operating in larger tiles than direct convolution algorithms such as im2row or im2col.
In our study, we measured the time taken for convolutions using im2row, im2col and each of the Winograd configurations (, , ) when varying output width/height (from down to ) and (from to ). We performed the benchmark in controlled conditions and in single thread mode. Each combination was run five times with five seconds delay in between to prevent thermal throttling. We implemented Winograd convolutions using GEMMs (Maji et al. (2019)), and performed the same experiment separately on A73 and A53 for both FP32 and INT8. INT16 measurements are not currently supported in Arm Compute Library.
6 Experimental Results
The results of this work are arranged as three subsections. First, we show that winograd-aware networks can achieve high accuracy. Second, we present the results from our dense benchmark for winograd convolutions on mobile CPUs. Third, we show that wiNAS can jointly optimize a given macro-architecture for accuracy and latency.
6.1 Vanilla Winograd-aware networks
Figure 4 (left) shows Winograd-aware networks in FP32 perform as well as direct convolutions, with both fixed and learned (-flex) transformation matrices. With quantization (all other plots), winograd-aware layers are essential to enable the use of fast Winograd convolutions. This is not possible if switching to Winograd convolutions after training, as is commonly done in practice (see Table 1).
Furthermore, we show that learning the Winograd transforms (-flex) results in and better accuracies for and in INT8 scenarios. We argue that enabling this relaxation helps in easing the numerical instability inherent to Winograd convolutions, which is further exacerbated by quantization. The accuracy of Winograd-aware models scales linearly with network width, suggesting that these can be exploited in conjunction with architecture compression techniques such as channel pruning.
Results from LeNet ( filters), provides further evidence that larger tiles result in higher numerical error. In Figure 5, we show that even in relatively small datasets like MNIST, keeping the transformations , and fixed, leads to poor results as the output tile size is increased. This difference is almost 47% in the case of layers, which uses tiles.
Winograd-aware layers do not structurally modify the network architecture, since Winograd is just an algorithm to perform convolution. We demonstrate it is possible to transform a pre-trained model with standard convolution into its Winograd-aware counterpart within a few epochs. Concretely, in Figure 6 we show that an INT8 ResNet-18 can be adapted from a model of the same network that was trained end-to-end with standard convolutions in 20 epochs of retraining. This represents a training time reduction for Winograd-aware models. This is only possible when allowing the transformation matrices to evolve during training. Adapting FP32 models can be done in a single epoch.
We believe both and performance could be raised with alternative quantization implementations, closing the accuracy gap with and direct convolutions.
6.2 Impact of Winograd on Latency
The speedups associated to with use of Winograd convolutions often only account for the point-wise stage while assuming negligible costs for the input, weights and output transformations. Furthermore, these also assume that the larger the input patch, d, the larger the speedup compared to normal convolutions. However, although these assumptions are true for large tensors, they are not necessarily true when working with tensors of the sizes often found in CNNs for image classification or object detection.
Figure 7 shows a small portion of the obtained latency measurements for our benchmark in FP32. An almost identical pattern appears when using 8-bit arithmetic. In Figure 8 we show the speedups that Winograd convolutions offer at different layers of a ResNet-18 network. Our observations can be summarized in three points:
Input layers do not benefit from Winograd. This is primarily because the matrices in the element-wise GEMM stage are not large enough to compensate for the costs of transforming the input and output patches (see Figure 5.3 and 8). They represent a significant portion (up to 65% and 75% respectively on the A73 and A53) of the total costs for convolving a RGB input expanded to 32 channels. Similar ratios can be observed for other input sizes. In spite of this, this first layer accounts for a marginal portion of the total latency of the model, often below 1ms.
Optimal is a function of input width and height. For an input with sufficient number of channels, e.g. 32 channels and up, we observe a consistent pattern alternating between and as the channel dimension of the output increase. This alternation comes as a result of the impossibility of subdividing the input into an integer number of patches, and therefore having to waste calculations when operating around the matrix edges. This pattern is invariant to different configurations and fades away as input dimensions exceed , where F6 consistently becomes the fastest.
Winograd transforms are costly. Excluding inputs with very few channels, the cost of performing the transformations to/from the Winograd domain can exceed 25% of the overall costs. These costs become negligible as the input width and height decrease, but the rate at which this happens also depends on the hardware. Our Winograd-aware pipeline formulation doesn’t impose any constrains on how the transforms are learnt. This results in dense transforms (as opposed to the default transforms witch contain zeros) and therefore applying them require additional compute. Table 3 includes this latency overhead in models making use of learned transforms. In Appendix A.2 we provide more details on how dense transforms impact overall latency.
On A53, the speedups from FP32 Winograd convolutions are smaller than On A73. We argue this comes as a results of the differences in the memory subsystem, limiting the lower-end CPU to efficiently operate with larger tensors. These speedups grow significantly when leveraging INT8 arithmetic, made possible by winograd-aware training. Concretely, INT8 Winograd increases the speedup on the A53 by a factor of almost compared to Winograd in FP32, as shown in WA\textsubscriptF4 configurations in Table 3 – at the cost of 1.1% accuracy in CIFAR-10. In the case of the more challenging CIFAR-100 dataset, the drop in accuracy is more severe. However, our WA\textsubscriptF2 layers offer attractive speedups for INT8 with no drop in accuracy. We rely on wiNAS to minimize this degradation with small impact on latency.
6.3 wiNAS Networks
|Type||act. / param.||CIFAR-10||CIFAR-100||Latency (ms)||Speedup||Latency (ms)||Speedup|
|im2row||32 / 32||93.16||74.62||118||-||85||-|
|im2row||8 / 8||93.20||74.11||117||54|
Choosing the convolution algorithm that minimizes overall network latency can be easily done by looking at the benchmark results. However, since the accuracy of Winograd convolutions degrade rapidly in reduced precision networks, selecting the fastest algorithm for each layer without sacrificing accuracy, is not straight forward.
When using wiNAS\textsubscriptWA, values of larger than 0.05 consistently result in models with the same layer configuration as those in WA\textsubscriptF4 (described in section 5.1). When lowering the impact of latency in Eq. 3 loss function, we observed several Winograd-aware layers were replaced with either im2row or , at the cost of less than 9 ms latency increase in the worst case, an INT8 model on the A53 for CIFAR-100. These models resulted in similar accuracies in FP32 and reached and higher accuracies in INT8 for CIFAR-10 and CIFAR-100 respectively. Despite CIFAR-100 models sacrificing more latency in order to recover accuracy, they remained faster than WA\textsubscriptF2 at INT8.
When introducing quantization in the search performed by wiNAS\textsubscriptWA-Q, the accuracy gap is almost closed for CIFAR-10 and further reduced for CIFAR-100. This comes primarily as a result of relying on higher bit-widths for the first layers in the network. In both cases, we maintain attractive speedups compared to im2row and Winograd convolutions in FP32, specially on the A73. All the ResNet-18 architectures optimized with wiNAS are described in A.3.
In this section we present some of the challenges of training Winograd-aware networks and propose lines of future work.
A direct implementation of Eq. 1 requires saving the intermediate outputs of each matrix-matrix multiplication, since these are needed for back-propagation. This results in high memory usage. In this work we had to rely on gradient checkpointing Chen et al. (2016) to lower the memory peak during training, at the cost of additional computation. We believe a native CUDA implementation of the Winograd-aware layers with better memory reuse would ease this problem.
Learning larger models (with width multipliers 0.75 and 1.0) proved challenging for and when introducing quantization. Using other types of quantization would likely help. In particular per-channel affine quantization, as in Jacob et al. (2018). Also, enabling different bit-widths throughout Eq. 1 could help mitigate the accuracy drop.
It is known that bad polynomial points for constructing , and introduce significant deviations in the result of computing Winograd convolutions compared to that of normal convolutions. We observed that good starting points are also important even when learning the Winograd transformations. Polynomial points specifically tailored for quantized Winograd could alleviate some of the degradation that we observed with increased tile size.
In this work we focused on mobile CPUs, but we expect these benefits to be also applicable to GPUs. However, to further maximize the speedups that Winograd-aware layers for quantized CNNs offer, a custom hardware implementation in the form of an accelerator would be preferable.
Running CNN-based applications that require standard convolutional layers is challenging in compute-constrained devices such as mobile CPUs. This paper presents Winograd-aware layers as the building block to combine the benefits of quantized networks and fast Winograd convolutions. We studied Winograd-aware layers with different tile sizes, three quantization levels and on three popular datasets. We found that allowing the transformation matrices to evolve during training resulted in significantly better models. With wiNAS we leveraged Winograd-aware layers and latency metrics from off-the-shelf mobile CPUs and found architectures that helped minize the numerical instability of Winograd. A Winograd-aware ResNet-18 quantized to INT8 offers up to faster inference for only a marginal accuracy drop compared to existing Winograd implementations, which are limited to FP32. This network is also faster than an optimized im2row implementation using INT8 arithmetic.
Appendix A Appendix
a.1 Winograd-aware layers for other architectures
The results of our study of Winograd-aware networks presented Section 6 showed multiple configurations of the ResNet-18 architecture at different width-multipliers, bit-widths, quantization levels and convolution algorithms. Here, we present a similar analysis for two other popular architectures for image classification. We limit our study to the full models (i.e. mult=1.0) We show results for SqueezeNet Iandola et al. (2016) in Table 4 and for ResNeXt Xie et al. (2017) in Table 5. These results align with what was observed for ResNet-18: In the presence of quantization, learning the Winograd transformations (flex configurations) resulted in superior performance than using the default (static) transformations. All experiments used the same hyper-parameters as described in Section 5.
|Type||act. / param.||trans.||CIFAR-10||CIFAR-100|
|im2row||32 / 32||-||91.13||69.06|
|im2row||8 / 8||-||91.15||69.34|
|type||act. / param.||trans.||CIFAR-10||CIFAR-100|
|im2row||32 / 32||-||93.17||74.54|
|im2row||8 / 8||-||93.40||74.89|
For both architectures, INT8 Winograd-aware models with learnt Winograd transformations did not result in a accuracy gaps as pronounced as the ones reported for ResNet-18 in Section 6. These models even surpass the im2row baselines for CIFAR-100. We argue this is because SqueezeNet and ResNeXt_() have fewer convolutional layers (8 and 6, respecitvely) compared to ResNet-18, which has 16. Therefore, the succession of fewer convolutional layers implemented as Winograd convolutions reduces the overall impact of numerical error.
a.2 Overhead of Learnt Winograd Transforms
The default Winograd transformation matrices contain varying amounts of 0âs. For the sparsity ratios are 50%, 33% and 25% respectively for , and . From the construction process of these matrices and specially the choice of polynomial points, we would expect lower sparsity ratios as these transforms are adjusted for larger input patches. For example, for the default transforms these ratios are 22%, 22% and 25%. For implementations of matrix-matrix multiplications that can exploit data sparsity, as is the case of Arm’s Compute Library, having zeros means less compute which often translate into lower latencies.
The Winograd-aware formulation here presented doesn’t impose restrictions on how the learnt transform should look like. As a consequence, the resulting transforms rarely contain zeros. This translates in additional compute for input and output transforms. The impact of using dense, learnt, transforms for WA\textsubscriptF4 models running on a Cortex-A73 is a latency increase of 17% (+8ms) and 20% (+6ms) for FP32 and INT8 respectively for a ResNet18 network. This increase in latency is higher on the Cortex-A53 since the Winograd transforms are proportionally more expensive on this core. These penalties represent the worst case performance increase, assuming the transforms are compute bound. However, we believe that due to the access patterns of the Winograd transform kernels (gather and scatter across a wide area of memory) at least some of the performance of the transforms results from misses in the cache hierarchy and so some additional computation can be tolerated without necessarily increasing execution time.
We note that the impact for models is considerably higher especially since the original transforms G and A are, not only sparse, but binary and the learnt ones are not. However, these penalties are never met in practice since Winograd-aware models with default transforms can perform equally well as those with learnt transforms (as shown in Figure 4 and Tables 4 and 5) even in INT8.
Even with the performance loss due to the learnt transforms, we’re still demonstrating some (non-negligible) and speedup compared to INT8 im2row for A73 and A53 respectively. To the best of our knowledge this is the first time INT8 Winograd convolutions are empirically proven to work.
a.3 Architectures optimized with wiNAS
Our framework wiNAS, takes a given macro-architecture and optimizes each convolutional layer by choosing from direct convolution or different Winograd configurations. For the search, all convolutions were fixed to use im2row.
For wiNAS\textsubscriptWA in FP32, the resulting architecture only substituted the last convolution layer with im2row instead of . The rest of the layers remained unchanged from the WA\textsubscriptF4 configuration (which was described in Section 5.1). The same micro-architecture was used in CIFAR-10 and CIFAR-100.
For wiNAS\textsubscriptWA with 8-bit quantization and CIFAR-10, wiNAS replaced the 5th and second last convolutional layers with im2row, instead of and respectively. For CIFAR-100, more optimization was compared to WA\textsubscriptF4. The resulting micro-architecture optimization is shown in Figure 9 (left).
When introducing quantization in the search space, wiNAS\textsubscriptWA-Q, the resulting architectures are shown in Figure 9 for both CIFAR-10 (middle) and CIFAR-100 (right).
- This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 732204 (Bonseyes). This work is supported by the Swiss State Secretariat for Education, Research and Innovation (SERI) under contract number 16.0159. The opinions expressed and arguments employed herein do not necessarily reflect the official views of these funding bodies.
- General multiplications is a term commonly used in Winograd jargon referring to element-wise or Hadamard product stage.
- See Section 5.2 in Blahut (2010) for a step-by-step example.
- We refer the interested reader to Barabasz et al. (2018) for an analysis on the nature of the errors in Winograd convolutions.
- Accelerating convolutional neural network with fft on embedded hardware. IEEE Transactions on Very Large Scale Integration (VLSI) Systems 26 (9), pp. 1737–1749. External Links: Cited by: §2.
- A systematic study of binary neural networks’ optimisation. In International Conference on Learning Representations, Cited by: §2.
- Optimal dnn primitive selection with partitioned boolean quadratic programming. Proceedings of the 2018 International Symposium on Code Generation and Optimization - CGO 2018. External Links: Cited by: §1.
- Arm compute library. Note: \urlhttps://developer.arm.com/ip-products/processors/machine-learning/compute-libraryAccessed: 2019-08-01 Cited by: §3.1.
- Improving accuracy of winograd convolution for dnns. CoRR abs/1803.10986. Cited by: §2, footnote 3.
- Winograd convolution for dnns: beyond linear polynomials. External Links: Cited by: §2.
- Understanding and simplifying one-shot architecture search. In Proceedings of the 35th International Conference on Machine Learning, J. Dy and A. Krause (Eds.), Proceedings of Machine Learning Research, Vol. 80, StockholmsmÃ¤ssan, Stockholm Sweden, pp. 550–559. Cited by: §2.
- Fast algorithms for signal processing. Cambridge University Press. External Links: Cited by: footnote 2.
- Approximating CNNs with bag-of-local-features models works surprisingly well on imagenet. In International Conference on Learning Representations, Cited by: §2.
- Large scale GAN training for high fidelity natural image synthesis. In International Conference on Learning Representations, Cited by: §1.
- SMASH: one-shot model architecture search through hypernetworks. In International Conference on Learning Representations, Cited by: §2.
- ProxylessNAS: direct neural architecture search on target task and hardware. In International Conference on Learning Representations, Cited by: §2, §4.1.
- Deep positron: a deep neural network using the posit number system. 2019 Design, Automation and Test in Europe Conference and Exhibition (DATE). External Links: Cited by: §2.
- Training deep nets with sublinear memory cost. External Links: Cited by: §7.
- Drop an octave: reducing spatial redundancy in convolutional neural networks with octave convolution. External Links: Cited by: §2.
- CuDNN: efficient primitives for deep learning. CoRR abs/1410.0759. Cited by: §2, §3.1.
- Compression of deep convolutional neural networks under joint sparsity constraints. External Links: Cited by: §5.1.
- Visual wake words dataset. External Links: Cited by: §1.
- Minimizing computation in convolutional neural networks. In Artificial Neural Networks and Machine Learning – ICANN 2014, S. Wermter, C. Weber, W. Duch, T. Honkela, P. Koprinkova-Hristova, S. Magg, G. Palm and A. E. P. Villa (Eds.), Cham, pp. 281–290. External Links: Cited by: §2.
- BinaryConnect: training deep neural networks with binary weights during propagations. In Advances in Neural Information Processing Systems 28, C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama and R. Garnett (Eds.), pp. 3123–3131. Cited by: §2.
- SpArSe: Sparse Architecture Search for CNNs on Resource-Constrained Microcontrollers. In Proceedings of the Neural Information Processing Systems (NeurIPS) Conference 2019, Cited by: §2.
- ImageNet-trained CNNs are biased towards texture; increasing shape bias improves accuracy and robustness.. In International Conference on Learning Representations, Cited by: §2.
- Differentiable soft quantization: bridging full-precision and low-bit neural networks. External Links: Cited by: §2.
- Deep residual learning for image recognition. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). External Links: Cited by: §2, §5.1.
- AMC: automl for model compression and acceleration on mobile devices. Lecture Notes in Computer Science, pp. 815â832. Cited by: §2.
- Designing neural network hardware accelerators with decoupled objective evaluations. In NIPS workshop on Bayesian Optimization, Cited by: §2.
- Very efficient training of convolutional neural networks using fast fourier transform and overlap-and-add. Procedings of the British Machine Vision Conference 2015. External Links: Cited by: §2.
- 1.1 computing’s energy problem (and what we can do about it). In 2014 IEEE International Solid-State Circuits Conference Digest of Technical Papers (ISSCC), Vol. , pp. 10–14. External Links: Cited by: §1.
- MobileNets: efficient convolutional neural networks for mobile vision applications. External Links: Cited by: §1, §2.
- Searching for mobilenetv3. External Links: Cited by: §1.
- SqueezeNet: alexnet-level accuracy with 50x fewer parameters and ¡0.5mb model size. External Links: Cited by: §A.1.
- Quantization and training of neural networks for efficient integer-arithmetic-only inference. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. External Links: Cited by: §1, §2, §7.
- A study of bfloat16 for deep learning training. External Links: Cited by: §2.
- Adam: a method for stochastic optimization. In International Conference on Learning Representations (ICLR), Cited by: §5.1.
- Quantizing deep convolutional networks for efficient inference: A whitepaper. CoRR abs/1806.08342. External Links: Cited by: §1, §2, §5.1.
- Learning multiple layers of features from tiny images. Technical report Citeseer. Cited by: §5.1, §5.1.
- The complexity of a scheme of functional elements realizing the multiplication of integers. Doklady Akademii Nauk SSSR 3, pp. . Cited by: §1, §2.
- The deep (learning) transformation of mobile and embedded computing. Computer 51 (5), pp. 12–16. External Links: Cited by: §1.
- Fast algorithms for convolutional neural networks. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). External Links: Cited by: §2, §2, §3.1.
- Gradient-based learning applied to document recognition. In Proceedings of the IEEE, pp. 2278–2324. Cited by: §5.1.
- MobiSR: efficient on-device super-resolution through heterogeneous mobile processors. In The 25th Annual International Conference on Mobile Computing and Networking, MobiCom â19, New York, NY, USA. External Links: Cited by: §1.
- Ternary weight networks. External Links: Cited by: §2.
- On-Chip Memory Technology Design Space Explorations for Mobile Deep Neural Network Accelerators. In 2019 56th ACM/IEEE Design Automation Conference (DAC), Vol. , pp. 1–6. External Links: Cited by: §1.
- Neural networks on microcontrollers: saving memory at inference via operator reordering. External Links: Cited by: §1.
- Towards accurate binary convolutional neural network. In Advances in Neural Information Processing Systems, pp. 345–353. Cited by: §2.
- DARTS: differentiable architecture search. In International Conference on Learning Representations, Cited by: §2.
- Efficient sparse-winograd convolutional neural networks. In International Conference on Learning Representations, Cited by: §2, §5.1.
- SGDR: stochastic gradient descent with warm restarts. In International Conference on Learning Representations (ICLR) 2017 Conference Track, Cited by: §5.2.
- Efficient winograd or cook-toom convolution kernel implementation on widely used mobile cpus. External Links: Cited by: §2, §5.3.
- Fast training of convolutional networks through ffts. External Links: Cited by: §2.
- Efficient winograd convolution via integer arithmetic. CoRR abs/1901.01965. Cited by: §2.
- Automatic differentiation in PyTorch. In NIPS Autodiff Workshop, Cited by: §5.
- XNOR-net: imagenet classification using binary convolutional neural networks. CoRR abs/1603.05279. External Links: Cited by: §2.
- Regularized evolution for image classifier architecture search. Proceedings of the AAAI Conference on Artificial Intelligence 33 (01), pp. 4780–4789. External Links: Cited by: §2.
- MobileNetV2: inverted residuals and linear bottlenecks. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. External Links: Cited by: §1.
- Very deep convolutional networks for large-scale image recognition. In International Conference on Learning Representations, Cited by: §2.
- Single-path mobile automl: efficient convnet design and nas hyperparameter optimization. External Links: Cited by: §1.
- Gaussian elimination is not optimal. Numer. Math. 13 (4), pp. 354–356. External Links: Cited by: §2.
- On the importance of initialization and momentum in deep learning. In Proceedings of the 30th International Conference on Machine Learning, S. Dasgupta and D. McAllester (Eds.), Proceedings of Machine Learning Research, Vol. 28, Atlanta, Georgia, USA, pp. 1139–1147. Cited by: §5.2.
- Efficient processing of deep neural networks: a tutorial and survey. Proceedings of the IEEE 105 (12), pp. 2295â2329. External Links: Cited by: §1, §1.
- Gated-scnn: gated shape cnns for semantic segmentation. arXiv preprint arXiv:1907.05740. Cited by: §1.
- MnasNet: platform-aware neural architecture search for mobile. In CVPR, Cited by: §2.
- EfficientNet: rethinking model scaling for convolutional neural networks. External Links: Cited by: §1.
- StrassenNets: deep learning with a multiplication budget. In Proceedings of the 35th International Conference on Machine Learning, J. Dy and A. Krause (Eds.), Proceedings of Machine Learning Research, Vol. 80, StockholmsmÃ¤ssan, Stockholm Sweden, pp. 4985–4994. Cited by: §2.
- On improving the numerical stability of winograd convolutions. In International Conference on Learning Representations (Workshop track), Cited by: §2.
- TBN: convolutional neural network with ternary inputs and binary weights. In ECCV, Cited by: §2.
- HAQ: hardware-aware automated quantization with mixed precision. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1, §2, §2.
- DNN Engine: A 28-nm Timing-Error Tolerant Sparse Deep Neural Network Processor for IoT Applications. IEEE Journal of Solid-State Circuits 53 (9), pp. 2722–2731. External Links: Cited by: §1.
- A 16nm 25mm2 SoC with a 54.5x Flexibility-Efficiency Range from Dual-Core Arm Cortex-A53 to eFPGA and Cache-Coherent Accelerators. In 2019 Symposium on VLSI Circuits, Vol. , pp. C34–C35. External Links: Cited by: §1.
- FixyNN: Efficient Hardware for Mobile Computer Vision via Transfer Learning. External Links: Cited by: §1.
- Signal processing and complexity of computation. In ICASSP ’80. IEEE International Conference on Acoustics, Speech, and Signal Processing, Vol. 5, pp. 94–101. External Links: Cited by: §1.
- Arithmetic complexity of computations. edition, Society for Industrial and Applied Mathematics, . External Links: Cited by: §2.
- Binary deep neural networks for speech recognition. In Proc. Interspeech 2017, pp. 533–537. External Links: Cited by: §2.
- Aggregated residual transformations for deep neural networks. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5987–5995. Cited by: §A.1.
- Image super-resolution using very deep residual channel attention networks. CoRR abs/1807.02758. External Links: Cited by: §1.
- Neural architecture search with reinforcement learning. In International Conference on Learning Representations, Cited by: §2.