Searching for Winogradaware Quantized Networks
1 Introduction
The rise in popularity of deep CNNs has spawned a research effort to find lower complexity networks to increase inference efficiency. This is desirable for inference in the cloud and becomes crucial on mobile and IoT devices with much more constrained hardware Lane and Warden (2018). Over the last few years, multiple approaches have been proposed to alleviate the computebound nature of convolutions Sze et al. (2017). Arguably, the use of depthwise convolutions, as popularized by the family of MobileNet architectures Howard et al. (2017); Sandler et al. (2018); Howard et al. (2019), has become the most widely embraced design choice to make lightweight networks. These layers are used in state of the art image classification networks Stamoulis et al. (2019); Tan and Le (2019). However, beyond image classification, normal convolutions are still chosen in favour of depthwise convolutions for applications like image superresolution Zhang et al. (2018); Lee et al. (2019), image segmentation Takikawa et al. (2019) and, GANs Brock et al. (2019). Therefore, alternative forms of speeding up standard convolutions are required to run these applications in mobile CPUs, which often come with constrained compute and energy budgets Whatmough et al. (2019). Model quantization and the use of alternative convolution algorithms instead of direct convolution are two ways of accomplishing this task.
Lowerprecision networks result in smaller model sizes, faster inference, lower energy consumption and smaller chip area Sze et al. (2017); Li et al. (2019). Concretely, 8bit quantizatized models achieve comparable performance to fullprecision models Jacob et al. (2018); Krishnamoorthi (2018) while being ready for deployment on offtheshelf hardware as 8bit arithmetic is widely supported. In addition to resulting in a direct model size reduction, 8bit integeronly arithmetic benefits from up to and chip area reduction compared to full precision additions and multiplies respectively, requiring and less energy Horowitz (2014); Whatmough et al. (2018). Because of these desirable benefits, 8bit quantization has been widely adopted in both computeconstrained devices Liberis and Lane (2019); Chowdhery et al. (2019); Wang et al. (2019) and accelerators Whatmough et al. (2019).
Orthogonal to lightweight architectural designs and quantization, fast convolution algorithms in replacement of direct convolutions can provide further speedups. These come with their own tradeoffs Anderson and Gregg (2018), but in this work we focus on the Winograd algorithm since it is the fastest known algorithm for convolutions of the dimensions often found in CNNs. The Winograd convolution performs the bulk of the computation as a Hadamard product between weights and input in the Winograd space requiring operations. Unlike normal convolutions, that generate a single output per convolution, a Winograd convolution computes several outputs simultaneously. This property makes Winograd convolutions minimal in the number of general multiplications
In this paper, we focus on alleviating the problem of numerical error that arises when using Winograd convolutions in quantized neural networks. Achieving this ultimately enables us to combine the speedups of Winograd with those that reduced precision arithmetic is known to offer, among other benefits in terms of energy and area. To this end, we present an endtoend training pipeline that exposes the numerical inaccuracies introduced by Winograd to the learning of the model parameters. We also address the source of the numerical error and propose a relaxation on the form of the transformation matrices used in the Winograd convolution algorithm. We achieve this by adding these matrices to the set of learnable parameters in a layer, after initializing them via CookToom L. Toom (1963). Finally, we describe wiNAS, a Winogradaware Neural Architecture Search framework which leverages Winogradaware layers and latency measurements on Arm CortexA73 and A53 cores, to jointly optimize for high accuracy and low latency. Our framework transforms a given macroarchitecture by replacing each convolution with either im2row or Winograd convolutions of different tile sizes.
The contributions of this work are summarized below:

We show that Winogradaware networks enable Winograd convolutions in quantized networks, including 8bit networks with little accuracy drop. To the best of our knowledge, this is the first time this has been empirically demonstrated.

We demonstrate that learning the Winograd transforms, as opposed to keeping these fixed, results in better network accuracy – up to 10% improvement when using and tiles in 8bits CNNs with filters. This improvement is more pronounced with larger filters.

We present wiNAS as a tool that can find Winogradaware networks jointly optimised for both high accuracy and low latency given a real hardware model.
2 Related Work
Convolutions have become the de facto spatial feature extractor in neural networks. As a result, a number of approaches have emerged to reduce the computational costs of using this operator.
Compact CNN Architectures.
These include alternative formulations to the dense convolutional layer, such as bottleneck layers He et al. (2016) that perform the convolutions in a lowerdimensional space, or the depthwise convolutional layers Howard et al. (2017) which replace the standard convolutions with a channelwise convolution followed by a pointwise convolution. More recently, Chen et al. (2019) proposed a compact multiresolution convolutional block that reduces spatial redundancy of low frequencies resulting in faster inference, memory savings and slightly higher accuracy. This reinforces the proposition that current networks rely more on texture than shape for image/object discrimination Brendel and Bethge (2019); Geirhos et al. (2019). In this work, instead of presenting a new architecture, we propose an optimization for an existing knowngood architecture to speed up inference. Our optimization can be applied to existing pretrained models without the need for endtoend training.
Quantization.
The most extreme form of quantization is binary networks Courbariaux et al. (2015); Lin et al. (2017); Xiang et al. (2017), which replace convolutions with bitshifts resulting in inference speedups Rastegari et al. (2016). Ternary and 2bit models Li et al. (2016); Wan et al. (2018); Gong et al. (2019) achieve higher accuracies while alleviating some the challenges of training binary networks Alizadeh et al. (2019). However, it is 8bit quantization Jacob et al. (2018); Krishnamoorthi (2018); Wang et al. (2019) that has achieved high popularity due to its balance between accuracy, model size reduction and inference speedup. Newer data formats, such as Posit Carmichael et al. (2019) aim to close the accuracy gap between INT8 and FP32 networks, however hardware supporting it is unavailable. For training, BFLOAT16 Kalamkar et al. (2019) has been validated as an alternative to FP32, enabling faster training. In this work, we adopt INT8 and INT16 uniform quantization during training and study how lowering precision impacts on the lossy nature of Winograd convolutions.
Fast Convolution Algorithms.
Alternative formulations of the convolution operation such as: the use of FFTs, which replace convolution with its multiplicationonly counterpart in the frequency domain resulting in faster inference Mathieu et al. (2013); Abtahi et al. (2018) and training Highlander and Rodriguez (2015); the Strassen algorithm Strassen (1969), which when applied to convolutions Cong and Xiao (2014); Tschannen et al. (2018) significantly reduces the number of multiplications at the cost of more additions; or the Winograd algorithm Winograd (1980), which replaces convolutions with a set of matrix transformations and pointwise multiplications and, results in significantly faster inference stages Lavin and Gray (2016).
Winograd Convolution.
The Winograd algorithm for fast convolution was first applied to CNNs by Lavin and Gray (2016), showing speedup compared to cuDNN Chetlur et al. (2014) on a VGG Simonyan and Zisserman (2015) network, with no loss in accuracy on small tiles, batch 1. However, exploiting Winograd convolutions on larger input tiles is challenging due to numerical instability. In response to this limitation, Barabasz et al. (2018) showed that the error introduced by the Winograd algorithm grows at least exponentially with tile size, which can be partially alleviated by choosing better polynomial points for constructing the transformation matrices via CookToom L. Toom (1963). An alternative formulation using trimmed Vandermonde matrices was described by Vincent et al. (2017). More recently, several works studying the suitability of Winograd convolutions in memory and compute constrained setups have been proposed. These include: the use of integer arithmetic for complex Winograd convolutions Meng and Brothers (2019); a general formulation for the Winograd algorithm Barabasz and Gregg (2019) that shows promising results in FP16 and BFLOAT16 when using higher degree polynomials; an efficient regionwise multichannel implementation of Winograd convolutions using General Matrix Multiplications (GEMMs) Maji et al. (2019) that achieves speedups on Arm CortexA CPUs; and, a technique Liu et al. (2018) that enables up to 90% sparsity in the Hadamard product stage of the Winograd algorithm, effectively reducing by the number of multiplications with no accuracy loss in FP32 models. Our work fills the gap of using Winograd convolutions in quantized neural networks, enabling even faster convolutions in current offtheshelf hardware, such as mobile CPUs.
Neural Architecture Search.
Automating the process of designing neural network architectures has drawn considerable attention. Early attempts relied on reinforcement learning Zoph and Le (2017); Brock et al. (2018); Real et al. (2019); Tan et al. (2019) or Bayesian optimization HernándezLobato et al. (2016); Fedorov et al. (2019) and required thousands of GPU hours to converge due to their computationally expensive and exhaustive search stages. Other works opted instead for a gradientbased search by framing the problem as a single overparameterized network where all candidate operations at a particular node (e.g. a layer) are taken into consideration. The main aspect differentiating gradientbased NAS approaches is the way the output of a layer combines the contribution of each candidate operation. While Bender et al. (2018) defines it as the sum and DARTS Liu et al. (2019) as a weighted sum, ProxylessNAS Cai et al. (2019) relies on pathlevel binarization, making it possible to perform the search on the entire architecture directly using a single GPU. In addition to architecture discovery, NAS has also been successfully used for automated network pruning He et al. (2018) and quantization Wang et al. (2019). Our work leverages NAS to find the optimal convolution algorithm (i.e. im2row or different Winograd implementations) for each layer in the model while preserving the overall network macroarchitecture and model size.
3 WinogradAware Networks
This section introduces Winograd convolutions and their tradeoffs in terms of compute, memory and accuracy. Then, we present the Winogradaware layers used in our networks.
3.1 Winograd implementation tradeoffs
The Winograd algorithm for convolutions using linear polynomials guarantees to use the minimum number of elementwise multiplications to compute outputs using an filter. Lavin and Gray (2016) refer to this minimal algorithm as and present its matrix form:
(1) 
where , and are transformation matrices applied to the filter , input and output respectively and is the Hadamard or elementwise multiplication.
These transformation matrices are commonly constructed
The challenges associated with the use of Winograd convolutions span three dimensions:
Compute. Winograd convolutions require the transformation of both tile and filter to the Winograd domain. The cost of these transformations grows with , and can represent a significant portion of the total computation of up to 75% (Sec. 6.2). This suggests that Winograd offers little to no speedup in layers with few filters. The cost of is often ignored as it is amortized across inferences.
Memory. In Eq.1, transforms the filter to the Winograd domain, matching the dimensions of the input tile . This results in an increase of runtime memory associated with the weights: and for and respectively. This is especially undesirable on memoryconstrained devices such as microcontrollers.
Numerical Error. Small and perform well in single and double precision (FP32/64) networks and are available in productionready libraries such as NVIDIA cuDNN Chetlur et al. (2014) and Arm Compute Library Arm Software (). Because these introduce only marginal numerical error, a network can first be trained using conventional convolutions before replacing appropriate layers with Winograd, without impacting accuracy. However, attempting this with larger Winograd tiles, or in combination with quantization, results significant accuracy loss. The root of the problem
ResNet18 Accuracy  

Convolution method  32bit  16bit  8bit 
Direct  
Winograd  
Winograd  
Winograd 
In this work we focus on minimizing the numerical errors that arise when using the Winograd algorithm in quantized networks. Our approach does not aggravate the compute and memory challenges previously mentioned. Instead, it indirectly alleviates these by making use of quantization.
3.2 A Winogradaware training pipeline
Neural networks have proven to be resilient to all kinds of approximations, e.g. pruning and quantization. When applying these techniques, consistently better models are generated if these approximations are present during training. In other words, when the training is aware of quantization, or when training is aware of pruning.
Following this intuition, we propose an endtoend Winogradaware pipeline as shown in Figure 2. In the forward pass we apply Eq.1 to each patch of the activations from the previous layer. We can apply standard backpropagation, since Eq.1 is only a collection of matrixmatrix multiplications. This implementation allows us to:

Learn the transforms. Traditionally, matrices , and are fixed. Instead, we can treat them as another set of learnable parameters in the layer. This relaxation leads to much improved performance in quantized networks while still maintaining the overall structure of the Winograd convolution algorithm and its speedups.

Quantization diversity. Unlike standard convolution, which does not require intermediate computation, Winograd convolution requires at least four of them for , , the Hadamard product and the output transformation. Each of these can be quantized to a different number of bits depending on the bitwidth of the input, that of the weights, and the overall complexity of the problem the network is designed to solve.
4 Searching for WinogradAware Networks
Simultaneously maximizing accuracy and minimizing latency with Winograd convolution isn’t trivial. The reason for this is that large tiles result in low latency, but come at the cost of higher numerical error. This presents a good opportunity to jointly optimize network accuracy and latency.
To this end, we implement a NASbased approach that automatically transforms an existing architecture into a Winogradaware version. We perform NAS at the microarchitecture level by selecting from different convolution algorithms for each layer, but without modifying the network’s macroarchitecture (e.g. number or order of layers, hyperparameters, etc). Keeping the macroarchitecture fixed allows us to fairly compare the standard model to its Winogradaware counterpart in terms of latency and accuracy. We call our framework wiNAS.
4.1 Winogradaware NAS pipeline
Introducing latency measurements into the optimization objective requires knowing the shape of the input tensor, i.e. the activations from the previous layer, at each layer of the network. We design wiNAS as a variation of ProxylessNAS Cai et al. (2019), leveraging path sampling while performing the search. This technique, enables the allocation of the entire network on a single GPU by evaluating no more than two candidate operations at each layer per batch.
Similarly to ProxylessNAS, wiNAS formulates the search as a twostage process, alternating the update of model parameters (the weights), where the loss is defined as
(2) 
and the update of architecture parameters (the weight assigned to each operation on a given layer), where the loss introduces the latency metrics is defined as
(3) 
where are the architecture parameters and controls the impact of latency in the loss. The expected latency, , for a given layer is the weighted combination of the latency estimate of each candidate operation with their respective probability of being sampled. Intuitively, searching for Winograd convolutions with high would result in faster models, potentially at the detriment of accuracy.
Unlike ProxylessNAS, wiNAS focuses on simply selecting the optimal convolution algorithm for each of the convolutional layers. Therefore, the set of candidate operations for a given conv2d layer contains im2row and Winogradaware layers in their , and configurations. This search space is illustrated in Figure 3. Each candidate operation comes with its respective latency, which is a function of the output dimensions and quantization level.
5 Experimental Setup
We conduct various experiments grouped in three categories. In this section we describe each experiment we conducted. We used PyTorch Paszke et al. (2017) for training and Arm Compute Library for deployment.
5.1 Vanilla Winogradaware networks
We begin our study of Winogradaware networks by performing an extensive evaluation on the ResNet18 He et al. (2016) architecture using the CIFAR10 Krizhevsky (2009) dataset. In this experiment we train the network endtoend using standard convolutions, and , and Winograd convolutions. For each experiment with Winograd, all layers in the network use the same tile size, except the last two residual blocks which are kept fixed to .The input convolutional layer uses normal convolutions. We run the experiments for FP32, INT16, INT10 and INT8 quantized networks, where both weights and activations are uniformly quantized (including all the intermediate outputs shown in Figure 2). We follow the perlayer symmetric quantization as described in Krishnamoorthi (2018).We repeated each experiment while enabling the Winograd transforms , and to be learnt, which we denote using the additional suffix flex.
Winogradaware layers do not require an overparameterized model to perform well. We also varied the model size by using a widthmultiplier, as used by the MobileNets family, ranging from 0.125 to 1.0, meaning that when the multiplier is 1.0 the network is the full ResNet18. This leads to models ranging between 215K and 11M parameters. Winogradaware layers with learnable transformations marginally increase () the model size, since the transforms themselves need to be saved for model deployment. We repeated the experiment for CIFAR100 Krizhevsky (2009), but without varying the depthmultiplier. CIFAR100 is considerably more challenging that CIFAR10, as it is comprised of 100 classes with only 600 images per class.
Additionally, we use an INT8 LeNet Lecun et al. (1998), trained on the MNIST dataset, to evaluate the suitability of Winogradaware layers with learnable transforms for filters. This is a more challenging case than filters, because a larger tile tile is required (defined by ), with larger transformation matrices which require the choice of more good polynomial points.
For experiments on ResNet18, we replace stride convolution layers with a maxpooling layer followed by a dense convolution layer. Altering the network in this way is necessary since there is no known equivalent for strided Winograd convolutions, which remains an open research question. This is a common strategy when evaluating Winograd Liu et al. (2018); Choi et al. (2018). We also modified the number of output channels of the input layer from 64 to 32. We did this to reduce the memory peak during training. We use the Adam Kingma and Ba (2015) optimizer and train for 120 epochs. Both CIFAR10/100 use the same ResNet18 architecture, differing only in the number of outputs of the fully connected layer. Results for other architectures are shown in A.1.
5.2 wiNAS: Winogradaware NAS
To evaluate wiNAS, we define two different sets of candidate operations. These spaces are: wiNAS\textsubscriptWA and wiNAS\textsubscriptWAQ, both allowing each convolutional layer to be implemented with either im2row or each of the Winograd configurations, , or . The former uses a fixed bitwidth for all elements in the architecture, while the latter introduces in the search space candidates of each operation quantized to FP32, INT16 and INT8.
The hyperparameters used for wiNAS are as follows: for the learning of model parameters we use minibatch SGD with Nesterov momentum Sutskever et al. (2013). In the stage where we update the architecture parameters we use instead Adam with the first momentum scaling, , set to zero, so the optimizer only updates paths that have been sampled. For both stages we use Cosine Annealing Loshchilov and Hutter (2017) scheduling and a batch size of 64. We perform the search for 100 epochs in each search space at different values ranging from 0.1 to 1e3. Once the search is completed, we trained the architecture endtoend with the same hyperparameters as the rest of winogradaware networks.
5.3 Winograd convolutions on mobile CPUs
For our study, we chose Arm A73 and A53 cores on a Huawei HiKey 960 development board with the big.LITTLE CPU architecture. These cores are good candidates for validating the speedups that are achievable with Winograd convolutions in today’s offtheshelf mobile hardware.
CPU  Clock  L1  L2 

A73  2.4 GHz  64 KB  2048 KB 
A53  1.8 GHz  32 KB  512 KB 
While both A73 and A53 are implemented as 16nm quadcore CPUs, the former is a highperformance processor and the latter implements a highefficiency processor. In Table 2 we summarise the main differences between these CPUs. The memory bandwidth would be the primary factor that ultimately sets the upper limit to the speedup achievable by Winograd since it requires operating in larger tiles than direct convolution algorithms such as im2row or im2col.
In our study, we measured the time taken for convolutions using im2row, im2col and each of the Winograd configurations (, , ) when varying output width/height (from down to ) and (from to ). We performed the benchmark in controlled conditions and in single thread mode. Each combination was run five times with five seconds delay in between to prevent thermal throttling. We implemented Winograd convolutions using GEMMs (Maji et al. (2019)), and performed the same experiment separately on A73 and A53 for both FP32 and INT8. INT16 measurements are not currently supported in Arm Compute Library.
6 Experimental Results
The results of this work are arranged as three subsections. First, we show that winogradaware networks can achieve high accuracy. Second, we present the results from our dense benchmark for winograd convolutions on mobile CPUs. Third, we show that wiNAS can jointly optimize a given macroarchitecture for accuracy and latency.
6.1 Vanilla Winogradaware networks
Figure 4 (left) shows Winogradaware networks in FP32 perform as well as direct convolutions, with both fixed and learned (flex) transformation matrices. With quantization (all other plots), winogradaware layers are essential to enable the use of fast Winograd convolutions. This is not possible if switching to Winograd convolutions after training, as is commonly done in practice (see Table 1).
Furthermore, we show that learning the Winograd transforms (flex) results in and better accuracies for and in INT8 scenarios. We argue that enabling this relaxation helps in easing the numerical instability inherent to Winograd convolutions, which is further exacerbated by quantization. The accuracy of Winogradaware models scales linearly with network width, suggesting that these can be exploited in conjunction with architecture compression techniques such as channel pruning.
Results from LeNet ( filters), provides further evidence that larger tiles result in higher numerical error. In Figure 5, we show that even in relatively small datasets like MNIST, keeping the transformations , and fixed, leads to poor results as the output tile size is increased. This difference is almost 47% in the case of layers, which uses tiles.
Winogradaware layers do not structurally modify the network architecture, since Winograd is just an algorithm to perform convolution. We demonstrate it is possible to transform a pretrained model with standard convolution into its Winogradaware counterpart within a few epochs. Concretely, in Figure 6 we show that an INT8 ResNet18 can be adapted from a model of the same network that was trained endtoend with standard convolutions in 20 epochs of retraining. This represents a training time reduction for Winogradaware models. This is only possible when allowing the transformation matrices to evolve during training. Adapting FP32 models can be done in a single epoch.
We believe both and performance could be raised with alternative quantization implementations, closing the accuracy gap with and direct convolutions.
6.2 Impact of Winograd on Latency
The speedups associated to with use of Winograd convolutions often only account for the pointwise stage while assuming negligible costs for the input, weights and output transformations. Furthermore, these also assume that the larger the input patch, d, the larger the speedup compared to normal convolutions. However, although these assumptions are true for large tensors, they are not necessarily true when working with tensors of the sizes often found in CNNs for image classification or object detection.
Figure 7 shows a small portion of the obtained latency measurements for our benchmark in FP32. An almost identical pattern appears when using 8bit arithmetic. In Figure 8 we show the speedups that Winograd convolutions offer at different layers of a ResNet18 network. Our observations can be summarized in three points:
Input layers do not benefit from Winograd. This is primarily because the matrices in the elementwise GEMM stage are not large enough to compensate for the costs of transforming the input and output patches (see Figure 5.3 and 8). They represent a significant portion (up to 65% and 75% respectively on the A73 and A53) of the total costs for convolving a RGB input expanded to 32 channels. Similar ratios can be observed for other input sizes. In spite of this, this first layer accounts for a marginal portion of the total latency of the model, often below 1ms.
Optimal is a function of input width and height. For an input with sufficient number of channels, e.g. 32 channels and up, we observe a consistent pattern alternating between and as the channel dimension of the output increase. This alternation comes as a result of the impossibility of subdividing the input into an integer number of patches, and therefore having to waste calculations when operating around the matrix edges. This pattern is invariant to different configurations and fades away as input dimensions exceed , where F6 consistently becomes the fastest.
Winograd transforms are costly. Excluding inputs with very few channels, the cost of performing the transformations to/from the Winograd domain can exceed 25% of the overall costs. These costs become negligible as the input width and height decrease, but the rate at which this happens also depends on the hardware. Our Winogradaware pipeline formulation doesn’t impose any constrains on how the transforms are learnt. This results in dense transforms (as opposed to the default transforms witch contain zeros) and therefore applying them require additional compute. Table 3 includes this latency overhead in models making use of learned transforms. In Appendix A.2 we provide more details on how dense transforms impact overall latency.
On A53, the speedups from FP32 Winograd convolutions are smaller than On A73. We argue this comes as a results of the differences in the memory subsystem, limiting the lowerend CPU to efficiently operate with larger tensors. These speedups grow significantly when leveraging INT8 arithmetic, made possible by winogradaware training. Concretely, INT8 Winograd increases the speedup on the A53 by a factor of almost compared to Winograd in FP32, as shown in WA\textsubscriptF4 configurations in Table 3 – at the cost of 1.1% accuracy in CIFAR10. In the case of the more challenging CIFAR100 dataset, the drop in accuracy is more severe. However, our WA\textsubscriptF2 layers offer attractive speedups for INT8 with no drop in accuracy. We rely on wiNAS to minimize this degradation with small impact on latency.
6.3 wiNAS Networks
Conv.  Bits  Accuracy (%)  CortexA53  CotexA73  
Type  act. / param.  CIFAR10  CIFAR100  Latency (ms)  Speedup  Latency (ms)  Speedup 
im2row  32 / 32  93.16  74.62  118    85   
im2col  93.16  74.62  156  102  
W\textsubscriptF2  93.16  74.60  126  56  
W\textsubscriptF4  93.14  74.53  97  46  
WA\textsubscriptF2*  93.46  74.69  126  56  
WA\textsubscriptF4  93.54  74.98  
wiNAS\textsubscriptWA  93.35  74.71  
im2row  8 / 8  93.20  74.11  117  54  
im2col  93.20  74.11  124  59  
WA\textsubscriptF2*  93.72  73.71  91  38  
WA\textsubscriptF4  92.46  72.38  
wiNAS\textsubscriptWA  92.71  73.42  /  /  /  /  
wiNAS\textsubscriptWAQ  auto  92.89  73.88  /  /  /  / 
Choosing the convolution algorithm that minimizes overall network latency can be easily done by looking at the benchmark results. However, since the accuracy of Winograd convolutions degrade rapidly in reduced precision networks, selecting the fastest algorithm for each layer without sacrificing accuracy, is not straight forward.
When using wiNAS\textsubscriptWA, values of larger than 0.05 consistently result in models with the same layer configuration as those in WA\textsubscriptF4 (described in section 5.1). When lowering the impact of latency in Eq. 3 loss function, we observed several Winogradaware layers were replaced with either im2row or , at the cost of less than 9 ms latency increase in the worst case, an INT8 model on the A53 for CIFAR100. These models resulted in similar accuracies in FP32 and reached and higher accuracies in INT8 for CIFAR10 and CIFAR100 respectively. Despite CIFAR100 models sacrificing more latency in order to recover accuracy, they remained faster than WA\textsubscriptF2 at INT8.
When introducing quantization in the search performed by wiNAS\textsubscriptWAQ, the accuracy gap is almost closed for CIFAR10 and further reduced for CIFAR100. This comes primarily as a result of relying on higher bitwidths for the first layers in the network. In both cases, we maintain attractive speedups compared to im2row and Winograd convolutions in FP32, specially on the A73. All the ResNet18 architectures optimized with wiNAS are described in A.3.
7 Discussion
In this section we present some of the challenges of training Winogradaware networks and propose lines of future work.
A direct implementation of Eq. 1 requires saving the intermediate outputs of each matrixmatrix multiplication, since these are needed for backpropagation. This results in high memory usage. In this work we had to rely on gradient checkpointing Chen et al. (2016) to lower the memory peak during training, at the cost of additional computation. We believe a native CUDA implementation of the Winogradaware layers with better memory reuse would ease this problem.
Learning larger models (with width multipliers 0.75 and 1.0) proved challenging for and when introducing quantization. Using other types of quantization would likely help. In particular perchannel affine quantization, as in Jacob et al. (2018). Also, enabling different bitwidths throughout Eq. 1 could help mitigate the accuracy drop.
It is known that bad polynomial points for constructing , and introduce significant deviations in the result of computing Winograd convolutions compared to that of normal convolutions. We observed that good starting points are also important even when learning the Winograd transformations. Polynomial points specifically tailored for quantized Winograd could alleviate some of the degradation that we observed with increased tile size.
In this work we focused on mobile CPUs, but we expect these benefits to be also applicable to GPUs. However, to further maximize the speedups that Winogradaware layers for quantized CNNs offer, a custom hardware implementation in the form of an accelerator would be preferable.
8 Conclusion
Running CNNbased applications that require standard convolutional layers is challenging in computeconstrained devices such as mobile CPUs. This paper presents Winogradaware layers as the building block to combine the benefits of quantized networks and fast Winograd convolutions. We studied Winogradaware layers with different tile sizes, three quantization levels and on three popular datasets. We found that allowing the transformation matrices to evolve during training resulted in significantly better models. With wiNAS we leveraged Winogradaware layers and latency metrics from offtheshelf mobile CPUs and found architectures that helped minize the numerical instability of Winograd. A Winogradaware ResNet18 quantized to INT8 offers up to faster inference for only a marginal accuracy drop compared to existing Winograd implementations, which are limited to FP32. This network is also faster than an optimized im2row implementation using INT8 arithmetic.
Appendix A Appendix
a.1 Winogradaware layers for other architectures
The results of our study of Winogradaware networks presented Section 6 showed multiple configurations of the ResNet18 architecture at different widthmultipliers, bitwidths, quantization levels and convolution algorithms. Here, we present a similar analysis for two other popular architectures for image classification. We limit our study to the full models (i.e. mult=1.0) We show results for SqueezeNet Iandola et al. (2016) in Table 4 and for ResNeXt Xie et al. (2017) in Table 5. These results align with what was observed for ResNet18: In the presence of quantization, learning the Winograd transformations (flex configurations) resulted in superior performance than using the default (static) transformations. All experiments used the same hyperparameters as described in Section 5.
Conv.  Bits  WA  Accuracy (%)  

Type  act. / param.  trans.  CIFAR10  CIFAR100 
im2row  32 / 32    91.13  69.06 
WA\textsubscriptF2  static  91.31  69.42  
WA\textsubscriptF2  flex  91.25  69.36  
WA\textsubscriptF4  static  91.23  69.14  
WA\textsubscriptF4  flex  91.41  69.32  
im2row  8 / 8    91.15  69.34 
WA\textsubscriptF2  static  90.88  70.06  
WA\textsubscriptF2  flex  91.03  70.18  
WA\textsubscriptF4  static  79.28  55.84  
WA\textsubscriptF4  flex  90.72  69.73 
Conv.  Bits  WA  Accuracy (%)  

type  act. / param.  trans.  CIFAR10  CIFAR100 
im2row  32 / 32    93.17  74.54 
WA\textsubscriptF2  static  93.19  74.66  
WA\textsubscriptF2  flex  93.08  74.58  
WA\textsubscriptF4  static  93.24  74.47  
WA\textsubscriptF4  flex  93.15  74.62  
im2row  8 / 8    93.40  74.89 
WA\textsubscriptF2  static  92.93  75.32  
WA\textsubscriptF2  flex  93.11  75.80  
WA\textsubscriptF4  static  76.73  51.20  
WA\textsubscriptF4  flex  93.29  75.35 
For both architectures, INT8 Winogradaware models with learnt Winograd transformations did not result in a accuracy gaps as pronounced as the ones reported for ResNet18 in Section 6. These models even surpass the im2row baselines for CIFAR100. We argue this is because SqueezeNet and ResNeXt_() have fewer convolutional layers (8 and 6, respecitvely) compared to ResNet18, which has 16. Therefore, the succession of fewer convolutional layers implemented as Winograd convolutions reduces the overall impact of numerical error.
a.2 Overhead of Learnt Winograd Transforms
The default Winograd transformation matrices contain varying amounts of 0âs. For the sparsity ratios are 50%, 33% and 25% respectively for , and . From the construction process of these matrices and specially the choice of polynomial points, we would expect lower sparsity ratios as these transforms are adjusted for larger input patches. For example, for the default transforms these ratios are 22%, 22% and 25%. For implementations of matrixmatrix multiplications that can exploit data sparsity, as is the case of Arm’s Compute Library, having zeros means less compute which often translate into lower latencies.
The Winogradaware formulation here presented doesn’t impose restrictions on how the learnt transform should look like. As a consequence, the resulting transforms rarely contain zeros. This translates in additional compute for input and output transforms. The impact of using dense, learnt, transforms for WA\textsubscriptF4 models running on a CortexA73 is a latency increase of 17% (+8ms) and 20% (+6ms) for FP32 and INT8 respectively for a ResNet18 network. This increase in latency is higher on the CortexA53 since the Winograd transforms are proportionally more expensive on this core. These penalties represent the worst case performance increase, assuming the transforms are compute bound. However, we believe that due to the access patterns of the Winograd transform kernels (gather and scatter across a wide area of memory) at least some of the performance of the transforms results from misses in the cache hierarchy and so some additional computation can be tolerated without necessarily increasing execution time.
We note that the impact for models is considerably higher especially since the original transforms G and A are, not only sparse, but binary and the learnt ones are not. However, these penalties are never met in practice since Winogradaware models with default transforms can perform equally well as those with learnt transforms (as shown in Figure 4 and Tables 4 and 5) even in INT8.
Even with the performance loss due to the learnt transforms, we’re still demonstrating some (nonnegligible) and speedup compared to INT8 im2row for A73 and A53 respectively. To the best of our knowledge this is the first time INT8 Winograd convolutions are empirically proven to work.
a.3 Architectures optimized with wiNAS
Our framework wiNAS, takes a given macroarchitecture and optimizes each convolutional layer by choosing from direct convolution or different Winograd configurations. For the search, all convolutions were fixed to use im2row.
For wiNAS\textsubscriptWA in FP32, the resulting architecture only substituted the last convolution layer with im2row instead of . The rest of the layers remained unchanged from the WA\textsubscriptF4 configuration (which was described in Section 5.1). The same microarchitecture was used in CIFAR10 and CIFAR100.
For wiNAS\textsubscriptWA with 8bit quantization and CIFAR10, wiNAS replaced the 5^{th} and second last convolutional layers with im2row, instead of and respectively. For CIFAR100, more optimization was compared to WA\textsubscriptF4. The resulting microarchitecture optimization is shown in Figure 9 (left).
When introducing quantization in the search space, wiNAS\textsubscriptWAQ, the resulting architectures are shown in Figure 9 for both CIFAR10 (middle) and CIFAR100 (right).
Footnotes
 This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 732204 (Bonseyes). This work is supported by the Swiss State Secretariat for Education, Research and Innovation (SERI) under contract number 16.0159. The opinions expressed and arguments employed herein do not necessarily reflect the official views of these funding bodies.
 General multiplications is a term commonly used in Winograd jargon referring to elementwise or Hadamard product stage.
 See Section 5.2 in Blahut (2010) for a stepbystep example.
 We refer the interested reader to Barabasz et al. (2018) for an analysis on the nature of the errors in Winograd convolutions.
References
 Accelerating convolutional neural network with fft on embedded hardware. IEEE Transactions on Very Large Scale Integration (VLSI) Systems 26 (9), pp. 1737–1749. External Links: Document, ISSN 10638210 Cited by: §2.
 A systematic study of binary neural networks’ optimisation. In International Conference on Learning Representations, Cited by: §2.
 Optimal dnn primitive selection with partitioned boolean quadratic programming. Proceedings of the 2018 International Symposium on Code Generation and Optimization  CGO 2018. External Links: ISBN 9781450356176, Document Cited by: §1.
 Arm compute library. Note: \urlhttps://developer.arm.com/ipproducts/processors/machinelearning/computelibraryAccessed: 20190801 Cited by: §3.1.
 Improving accuracy of winograd convolution for dnns. CoRR abs/1803.10986. Cited by: §2, footnote 3.
 Winograd convolution for dnns: beyond linear polynomials. External Links: 1905.05233 Cited by: §2.
 Understanding and simplifying oneshot architecture search. In Proceedings of the 35th International Conference on Machine Learning, J. Dy and A. Krause (Eds.), Proceedings of Machine Learning Research, Vol. 80, StockholmsmÃ¤ssan, Stockholm Sweden, pp. 550–559. Cited by: §2.
 Fast algorithms for signal processing. Cambridge University Press. External Links: Document Cited by: footnote 2.
 Approximating CNNs with bagoflocalfeatures models works surprisingly well on imagenet. In International Conference on Learning Representations, Cited by: §2.
 Large scale GAN training for high fidelity natural image synthesis. In International Conference on Learning Representations, Cited by: §1.
 SMASH: oneshot model architecture search through hypernetworks. In International Conference on Learning Representations, Cited by: §2.
 ProxylessNAS: direct neural architecture search on target task and hardware. In International Conference on Learning Representations, Cited by: §2, §4.1.
 Deep positron: a deep neural network using the posit number system. 2019 Design, Automation and Test in Europe Conference and Exhibition (DATE). External Links: ISBN 9783981926323, Document Cited by: §2.
 Training deep nets with sublinear memory cost. External Links: 1604.06174 Cited by: §7.
 Drop an octave: reducing spatial redundancy in convolutional neural networks with octave convolution. External Links: 1904.05049 Cited by: §2.
 CuDNN: efficient primitives for deep learning. CoRR abs/1410.0759. Cited by: §2, §3.1.
 Compression of deep convolutional neural networks under joint sparsity constraints. External Links: 1805.08303 Cited by: §5.1.
 Visual wake words dataset. External Links: 1906.05721 Cited by: §1.
 Minimizing computation in convolutional neural networks. In Artificial Neural Networks and Machine Learning – ICANN 2014, S. Wermter, C. Weber, W. Duch, T. Honkela, P. KoprinkovaHristova, S. Magg, G. Palm and A. E. P. Villa (Eds.), Cham, pp. 281–290. External Links: ISBN 9783319111797 Cited by: §2.
 BinaryConnect: training deep neural networks with binary weights during propagations. In Advances in Neural Information Processing Systems 28, C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama and R. Garnett (Eds.), pp. 3123–3131. Cited by: §2.
 SpArSe: Sparse Architecture Search for CNNs on ResourceConstrained Microcontrollers. In Proceedings of the Neural Information Processing Systems (NeurIPS) Conference 2019, Cited by: §2.
 ImageNettrained CNNs are biased towards texture; increasing shape bias improves accuracy and robustness.. In International Conference on Learning Representations, Cited by: §2.
 Differentiable soft quantization: bridging fullprecision and lowbit neural networks. External Links: 1908.05033 Cited by: §2.
 Deep residual learning for image recognition. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). External Links: ISBN 9781467388511, Document Cited by: §2, §5.1.
 AMC: automl for model compression and acceleration on mobile devices. Lecture Notes in Computer Science, pp. 815â832. Cited by: §2.
 Designing neural network hardware accelerators with decoupled objective evaluations. In NIPS workshop on Bayesian Optimization, Cited by: §2.
 Very efficient training of convolutional neural networks using fast fourier transform and overlapandadd. Procedings of the British Machine Vision Conference 2015. External Links: ISBN 1901725537, Document Cited by: §2.
 1.1 computing’s energy problem (and what we can do about it). In 2014 IEEE International SolidState Circuits Conference Digest of Technical Papers (ISSCC), Vol. , pp. 10–14. External Links: Document, ISSN 01936530 Cited by: §1.
 MobileNets: efficient convolutional neural networks for mobile vision applications. External Links: 1704.04861 Cited by: §1, §2.
 Searching for mobilenetv3. External Links: 1905.02244 Cited by: §1.
 SqueezeNet: alexnetlevel accuracy with 50x fewer parameters and ¡0.5mb model size. External Links: 1602.07360 Cited by: §A.1.
 Quantization and training of neural networks for efficient integerarithmeticonly inference. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. External Links: ISBN 9781538664209, Document Cited by: §1, §2, §7.
 A study of bfloat16 for deep learning training. External Links: 1905.12322 Cited by: §2.
 Adam: a method for stochastic optimization. In International Conference on Learning Representations (ICLR), Cited by: §5.1.
 Quantizing deep convolutional networks for efficient inference: A whitepaper. CoRR abs/1806.08342. External Links: 1806.08342 Cited by: §1, §2, §5.1.
 Learning multiple layers of features from tiny images. Technical report Citeseer. Cited by: §5.1, §5.1.
 The complexity of a scheme of functional elements realizing the multiplication of integers. Doklady Akademii Nauk SSSR 3, pp. . Cited by: §1, §2.
 The deep (learning) transformation of mobile and embedded computing. Computer 51 (5), pp. 12–16. External Links: Document, ISSN 00189162 Cited by: §1.
 Fast algorithms for convolutional neural networks. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). External Links: ISBN 9781467388511, Document Cited by: §2, §2, §3.1.
 Gradientbased learning applied to document recognition. In Proceedings of the IEEE, pp. 2278–2324. Cited by: §5.1.
 MobiSR: efficient ondevice superresolution through heterogeneous mobile processors. In The 25th Annual International Conference on Mobile Computing and Networking, MobiCom â19, New York, NY, USA. External Links: ISBN 9781450361699, Document Cited by: §1.
 Ternary weight networks. External Links: 1605.04711 Cited by: §2.
 OnChip Memory Technology Design Space Explorations for Mobile Deep Neural Network Accelerators. In 2019 56th ACM/IEEE Design Automation Conference (DAC), Vol. , pp. 1–6. External Links: Document, ISSN 0738100X Cited by: §1.
 Neural networks on microcontrollers: saving memory at inference via operator reordering. External Links: 1910.05110 Cited by: §1.
 Towards accurate binary convolutional neural network. In Advances in Neural Information Processing Systems, pp. 345–353. Cited by: §2.
 DARTS: differentiable architecture search. In International Conference on Learning Representations, Cited by: §2.
 Efficient sparsewinograd convolutional neural networks. In International Conference on Learning Representations, Cited by: §2, §5.1.
 SGDR: stochastic gradient descent with warm restarts. In International Conference on Learning Representations (ICLR) 2017 Conference Track, Cited by: §5.2.
 Efficient winograd or cooktoom convolution kernel implementation on widely used mobile cpus. External Links: 1903.01521 Cited by: §2, §5.3.
 Fast training of convolutional networks through ffts. External Links: 1312.5851 Cited by: §2.
 Efficient winograd convolution via integer arithmetic. CoRR abs/1901.01965. Cited by: §2.
 Automatic differentiation in PyTorch. In NIPS Autodiff Workshop, Cited by: §5.
 XNORnet: imagenet classification using binary convolutional neural networks. CoRR abs/1603.05279. External Links: 1603.05279 Cited by: §2.
 Regularized evolution for image classifier architecture search. Proceedings of the AAAI Conference on Artificial Intelligence 33 (01), pp. 4780–4789. External Links: Document Cited by: §2.
 MobileNetV2: inverted residuals and linear bottlenecks. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. External Links: ISBN 9781538664209, Document Cited by: §1.
 Very deep convolutional networks for largescale image recognition. In International Conference on Learning Representations, Cited by: §2.
 Singlepath mobile automl: efficient convnet design and nas hyperparameter optimization. External Links: 1907.00959 Cited by: §1.
 Gaussian elimination is not optimal. Numer. Math. 13 (4), pp. 354–356. External Links: ISSN 0029599X, Document Cited by: §2.
 On the importance of initialization and momentum in deep learning. In Proceedings of the 30th International Conference on Machine Learning, S. Dasgupta and D. McAllester (Eds.), Proceedings of Machine Learning Research, Vol. 28, Atlanta, Georgia, USA, pp. 1139–1147. Cited by: §5.2.
 Efficient processing of deep neural networks: a tutorial and survey. Proceedings of the IEEE 105 (12), pp. 2295â2329. External Links: ISSN 15582256, Document Cited by: §1, §1.
 Gatedscnn: gated shape cnns for semantic segmentation. arXiv preprint arXiv:1907.05740. Cited by: §1.
 MnasNet: platformaware neural architecture search for mobile. In CVPR, Cited by: §2.
 EfficientNet: rethinking model scaling for convolutional neural networks. External Links: 1905.11946 Cited by: §1.
 StrassenNets: deep learning with a multiplication budget. In Proceedings of the 35th International Conference on Machine Learning, J. Dy and A. Krause (Eds.), Proceedings of Machine Learning Research, Vol. 80, StockholmsmÃ¤ssan, Stockholm Sweden, pp. 4985–4994. Cited by: §2.
 On improving the numerical stability of winograd convolutions. In International Conference on Learning Representations (Workshop track), Cited by: §2.
 TBN: convolutional neural network with ternary inputs and binary weights. In ECCV, Cited by: §2.
 HAQ: hardwareaware automated quantization with mixed precision. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1, §2, §2.
 DNN Engine: A 28nm TimingError Tolerant Sparse Deep Neural Network Processor for IoT Applications. IEEE Journal of SolidState Circuits 53 (9), pp. 2722–2731. External Links: Document, ISSN 1558173X Cited by: §1.
 A 16nm 25mm2 SoC with a 54.5x FlexibilityEfficiency Range from DualCore Arm CortexA53 to eFPGA and CacheCoherent Accelerators. In 2019 Symposium on VLSI Circuits, Vol. , pp. C34–C35. External Links: Document, ISSN 21585601 Cited by: §1.
 FixyNN: Efficient Hardware for Mobile Computer Vision via Transfer Learning. External Links: 1902.11128 Cited by: §1.
 Signal processing and complexity of computation. In ICASSP ’80. IEEE International Conference on Acoustics, Speech, and Signal Processing, Vol. 5, pp. 94–101. External Links: Document, ISSN Cited by: §1.
 Arithmetic complexity of computations. edition, Society for Industrial and Applied Mathematics, . External Links: Document Cited by: §2.
 Binary deep neural networks for speech recognition. In Proc. Interspeech 2017, pp. 533–537. External Links: Document Cited by: §2.
 Aggregated residual transformations for deep neural networks. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5987–5995. Cited by: §A.1.
 Image superresolution using very deep residual channel attention networks. CoRR abs/1807.02758. External Links: 1807.02758 Cited by: §1.
 Neural architecture search with reinforcement learning. In International Conference on Learning Representations, Cited by: §2.