Network Deconvolution
Abstract
Convolution is a central operation in Convolutional Neural Networks (CNNs), which applies a kernel to overlapping regions shifted across the image. However, because of the strong correlations in realworld image data, convolutional kernels are in effect relearning redundant data. In this work, we show that this redundancy has made neural network training challenging, and propose network deconvolution, a procedure which optimally removes pixelwise and channelwise correlations before the data is fed into each layer. Network deconvolution can be efficiently calculated at a fraction of the computational cost of a convolution layer. We also show that the deconvolution filters in the first layer of the network resemble the centersurround structure found in biological neurons in the visual regions of the brain. Filtering with such kernels results in a sparse representation, a desired property that has been missing in the training of neural networks. Learning from the sparse representation promotes faster convergence and superior results without the use of batch normalization. We apply our network deconvolution operation to modern neural network models by replacing batch normalization within each. Extensive experiments show that the network deconvolution operation is able to deliver performance improvement in all cases on the CIFAR10, CIFAR100, MNIST, FashionMNIST, Cityscapes, and ImageNet datasets.
1 Introduction
Images of natural scenes that the human eye or camera captures contain adjacent pixels that are statistically highly correlated (Olshausen and Field, 1996; Hyvrinen et al., 2009), which can be compared to the correlations introduced by blurring an image with a Gaussian kernel. We can think of images as being convolved by an unknown filter (Figure 1). The correlation effect complicates object recognition tasks and makes neural network training challenging, as adjacent pixels contain redundant information.
It has been discovered that there exists a visual correlation removal processes in animal brains. Visual neurons called retinal ganglion cells and lateral geniculate nucleus cells have developed “Mexican hat”like circular centersurround receptive field structures to reduce visual information redundancy, as found in Hubel and Wiesel’s famous cat experiment (Hubel and Wiesel, 1961, 1962). Furthermore, it has been argued that data compression is an essential and fundamental processing step in natural brains, which inherently involves removing redundant information and only keeping the most salient features (Richert et al., 2016).
In this work, we introduce network deconvolution, a method to reduce redundant correlation in images. Mathematically speaking, a correlated signal is generated by a convolution: (as illustrated in Fig. 1 right), where is the kernel and is the corresponding convolution matrix. The purpose of network deconvolution is to remove the correlation effects via: , assuming is an invertible matrix.
Image data being fed into a convolutional network (CNN) exhibit two types of correlations. Neighboring pixels in a single image or feature map have high pixelwise correlation. Similarly, in the case of different channels of a hidden layer of the network, there is a strong correlation or “crosstalk” between these channels; we refer to this as channelwise correlation. The goal of this paper is to show that both kinds of correlation or redundancy hamper effective learning. Our network deconvolution attempts to remove both correlations in the data at every layer of a network.
Our contributions are the following:

We introduce network deconvolution, a decorrelation method to remove both the pixelwise and channelwise correlation at each layer of the network.

Our experiments show that deconvolution can replace batch normalization as a generic procedure in a variety of modern neural network architectures with better model training.

We prove that this method is the optimal transform if considering optimization.

Deconvolution has been a misnomer in convolution architectures. We demonstrate that network deconvolution is indeed a deconvolution operation.

We show that network deconvolution reduces redundancy in the data, leading to sparse representations at each layer of the network.

We propose a novel implicit deconvolution and subsampling based acceleration technique allowing the deconvolution operation to be done at a cost fractional to the corresponding convolution layer.

We demonstrate the network deconvolution operation improves performance in comparison to batch normalization on the CIFAR10, CIFAR100, MNIST, FashionMNIST, Cityscapes, and ImageNet datasets, using 10 modern neural network models.
2 Related Work
2.1 Normalization and Whitening
Since its introduction, batch normalization has been the main normalization technique (Ioffe and Szegedy, 2015) to facilitate the training of deep networks using stochastic gradient descent (SGD). Many techniques have been introduced to address cases for which batch normalization does not perform well. These include training with a small batch size (Wu and He, 2018), and on recurrent networks (Salimans and Kingma, 2016). However, to our best knowledge, none of these methods has demonstrated improved performance on the ImageNet dataset.
In the signal processing community, our network deconvolution could be referred to as a whitening deconvolution. There have been multiple complicated attempts to whiten the feature channels and to utilize secondorder information. For example, authors of (Martens and Grosse, 2015; Desjardins et al., 2015) approximate secondorder information using the Fisher information matrix. There, the whitening is carried out interwoven into the back propagation training process.
The correlation between feature channels has recently been found to hamper the learning. Simple approximate whitening can be achieved by removing the correlation in the channels. Channelwise decorrelation was first proposed in the backward fashion (Ye et al., 2017). Equivalently, this procedure can also be done in the forward fashion by a change of coordinates (Ye et al., 2018; Huang et al., 2018, 2019). However, none of these methods has captured the nature of the convolution operation, which specifically deals with the pixels. Instead, these techniques are most appropriate for the standard linear transform layers in fullyconnected networks.
2.2 Deconvolution of DNA Sequencing Signals
Similar correlation issues also exist in the context of DNA sequencing (Ye et al., 2014) where many DNA sequencers start with analyzing correlated signals. There is a crosstalk effect between different sensor channels, and signal responses of one nucleotide base spread into its previous and next nucleotide bases. As a result the sequencing signals display both channel correlation and pixel correlation. A blind deconvolution technique was developed to estimate and remove kernels to recover the unblurred signal.
3 Motivations
3.1 Suboptimality of Existing Training Methods
Deep neural network training has been a challenging research topic for decades. Over the past decade, with the regained popularity of deep neural networks, many techniques have been introduced to improve the training process. However, most of these methods are suboptimal even for the most basic linear regression problems.
Assume we are given a linear regression problem with loss (Eq. 1). In a typical setting, the output is given by multiplying the inputs with an unknown weight matrix , which we are solving for. In our paper, with a slight abuse of notation, can be the data matrix or the augmented data matrix .
(1) 
Here is the response data to be regressed. With neural networks, gradient descent iterations (and its variants) are used to solve the above problem. To conduct one iteration of gradient descent on Eq. 1, we have:
(2) 
Here is the step length or learning rate. Basic numerical experiments tell us these iterations can take long to converge. Normalization techniques, which are popular in training neural networks, are beneficial, but are generally not optimal for this simple problem. If methods are suboptimal for simplest linear problems, it is less likely for them to be optimal for more general problems. Our motivation is to find and apply what is linearly optimal to the more challenging problem of network training.
For the regression problem, an optimal solution can be found by setting the gradient to : = 0
(3) 
Here, we ask a fundamental question: When can gradient descent converge in one single iteration?
Proposition 1.
Gradient descent converges to the optimal solution in one iteration if .
Proof.
Substituting , into the optimal solution (Eq. 3) we have .
On the other hand, substituting the same condition with in Eq. 2 we have . ∎
Since gradient descent converges in one single step, the above proof gives us the optimality condition.
calculates the covariance matrix of the features. The optimal condition suggests that the features should be standardized and uncorrelated with each other. When this condition does not hold, the gradient direction does not point to the optimal solution. In fact, the more correlated the data, the slower the convergence (Richardson, 1911). This problem could be handled equivalently (Section A.8) either by correcting the gradient by multiplying the matrix , or by a change of coordinates so that in the new space we have . This paper applies the latter method for training convolutional networks.
3.2 Need of Support for Convolutions
Even though normalization methods were developed for training convolutional networks and have been found successful, these methods are more suitable for nonconvolutional operations. Existing methods normalize features by channel or by layer, irrespective of whether the underlying operation is convolutional or not. We will show in section 4.1 that if the underlying operation is a convolution, this usually implies a strong violation of the optimality condition.
3.3 A Neurological Basis for Deconvolution
Many receptive fields in the primate visual cortex exhibit centersurround type behavior. Some receptive fields, called oncenter cells respond maximally when a stimuli is given at the center of the receptive field and a lack of stimuli is given in a circle surrounding it. Some others, called offcenter cells respond maximally in the reversed way when a lack of stimuli is at the center and the stimuli is given in a circle surrounding it (Fig. 9) (Hubel and Wiesel, 1961, 1962). It is wellunderstood that these centersurround fields form the basis of the simple cells in the primate V1 cortex.
If the centersurround structures are beneficial for learning, one might expect such structures to be learned in the network training process. As shown in the proof above, the minima is the same with or without such structures, so gradient descent does not get an incentive to develop a faster solution, rendering the need to develop such structures externally. As shown later in Fig. 2, our deconvolution kernels strongly resemble centersurround filters like those in nature.
4 The Deconvolution Operation
4.1 The Matrix Representation of a Convolution Layer
The standard convolution filtering , can be formulated into one large matrix multiplication (Fig. 3). In the 2dimensional case, is the flattened 2D . The first column of corresponds to the flattened image patch of , where is the side length of the kernel. Neighboring columns correspond to shifted patches of : . A commonly used function has been designed for this operation. Since the columns of are constructed by shifting large patches of by one pixel, the columns of are heavily correlated with each other, which strongly violates the optimality condition. This violation slows down the training algorithm (Richardson, 1911), and cannot be addressed by normalization methods (Ioffe and Szegedy, 2015).
For a regular convolution layer in a network, we generally have multiple input feature channels and multiple kernels in a layer. We call in each channel, and horizontally concatenate the resulting data matrices from each individual channel to construct the full data matrix, then vectorize and concatenate all the kernels to get . Matrix vector multiplication is used to calculate the output , which is then reshaped into the output shape of the layer. Similar constructions can also be developed for the specific convolution layers such as the grouped convolution, where we carry out such constructions for each group. Other scenarios such as when the group number equals the channel number (channelwise conv) or when ( conv) can be considered as special cases.
In Fig. 3 (top right) we show as an illustrative example the resulting calculated covariance matrix of a sample data matrix in the first layer of a VGG network (Simonyan and Zisserman, 2014) taken from one of our experiments. The first layer is a convolution that mixes RGB channels. The total dimension of the weights is , the corresponding covariance matrix is . The diagonal blocks correspond to the pixelwise correlation within neighborhoods. The off diagonal blocks correspond to correlation of pixels across different channels. We have empirically seen that natural images demonstrate stronger pixelwise correlation than crosschannel correlation, as the diagonal blocks are brighter than the off diagonal blocks.
4.2 The Deconvolution Operation
Once the covariance matrix has been calculated, an inverse correction can be applied. It is beneficial to conduct the correction in the forward way for numerical accuracy and because the gradient of the correction can also be included in the gradient descent training.
Given a data matrix as described above in section 4.1, where is the number of samples, and is the number of features, we calculate the covariance matrix .
We then calculate an approximated inverse square root of the covariance matrix (see section 4.5.3) and multiply this with the centered vectors . In a sense, we remove the correlation effects both pixelwise and channelwise. If computed perfectly, the transformed data has the identity matrix as covariance: .
Algorithm 1 describes the process to construct and . Here is introduced to improve stability. We then apply the deconvolution operation via matrix multiplication to remove the correlation between neighboring pixels and across different channels. The deconvolved data is then multiplied with . The full equation becomes , or simply if is the augmented data matrix (Fig. 3).
We denote the deconvolution operation in the th layer as . Hence, the input to next layer is:
(4) 
where is the (right) matrix multiplication operation, is the input coming from the th layer, is the deconvolution operation on that input, is the weights in the layer, and is the activation function.
4.3 Justifications
On Naming the Method Deconvolution
The name network deconvolution has also been used in inverting the convolution effects in biological networks (Feizi et al., 2013). We prove that our operation is indeed a generalized deconvolution operation.
Proposition 2.
Removal of pixelwise correlation (or patchbased whitening) is a deconvolution operation.
Proof.
Let be the delta kernel, be an arbitrary signal, and .
(5) 
The above equations show that the deconvolution operation negates the effects of the convolution using kernel . ∎
The Deconvolution Kernel
The deconvolution kernel can be found as , where is the Vectorize function, or equivalently by slicing the middle row/column of and reshaping it into the kernel size. We visualize the deconvolution kernel from random images from the ImageNet dataset. The kernels indeed show centersurround structure, which coincides with the biological observation (Fig. 2). The filter in the green channel is an oncenter cell while the other two are offcenter cells (Hubel and Wiesel, 1961, 1962).
4.4 Optimality
Motivated by the ubiquity of centersurround and lateralinhibition mechanisms in biological neural systems, we now ask if removal of redundant information, in a manner like our network deconvolution, is an optimal procedure for learning in neural networks.
Optimizations
There is a classic kernel estimation problem (Cho and Lee, 2009; Ye et al., 2014) that requires solving for the kernel given the input data and the blurred output data . Because violates the optimality condition, it takes tens or hundreds of gradient descent iterations to converge to a close enough solution. We have demonstrated that our deconvolution processing is optimal for the kernel estimation, in contrast to all other normalization methods.
On Training Neural Networks
Training convolutional neural networks is analogous to a series of kernel estimation problem, where we have to solve for the kernels in each layer. Network deconvolution has a favorable nearoptimal property for training neural networks.
For simplicity, we assume the activation function is a samplevariant matrix multiplication throughout the network. The popular (Nair and Hinton, 2010) activation falls into this category. Let be the linear transform/convolution in a certain layer, the inputs to the layer, the operation from the output of the current layer to the output of the last deconvolution operation in the network. The computation of such a network can be formulated as: .
Proposition 3.
Network deconvolution is nearoptimal if is an orthogonal transform and if we connect the output to the loss.
Rewriting the matrix equation from the previous subsection using the Kronecker product, we get: , where is the Kronecker product. According to the discussion in the previous subsection, gradient descent is optimal if satisfies the orthogonality condition. In a network trained with deconvolution, the input is orthogonal for each batch of data. If is also orthogonal, then according to basic properties of the Kronecker product (Golub and van Loan, 2013)(Ch 12.3.1), is also orthogonal. is orthogonal since it is the output of the last deconvolution operation in our network. If we assume is an orthogonal transform, then transforms orthogonal inputs to orthogonal outputs, and is approximately orthogonal.
Slight loss of optimality incurs since we do not enforce to be orthogonal. But the gain here is that the network is unrestricted and is promised to be as powerful as any standard network. On the other hand, it is worth mentioning that many practical loss functions such as the cross entropy loss have similar shapes to the loss.
4.5 Accelerations
We note that in a direct implementation, the runtime of our training using deconvolution is slower than convolution using the wallclock as a metric. This is due to the suboptimal support in the implicit calculation of the matrices in existing libraries. We propose acceleration techniques to reduce the deconvolution cost to only a fraction of the convolution layer (Section A.6). Without further optimization, our training speed is similar to training a network using batch normalization on the ImageNet dataset while achieving better accuracy. This is a desired property when faced with difficult models (Goodfellow et al., 2014) and with problems where the network part is not the major bottleneck (Ye et al., 2018).
Implicit Deconvolution
Following from the associative rule of matrix multiplication, , which suggests that the deconvolution can be carried out implicitly by changing the model parameters, without explicitly deconvolving the data. Once we finish the training, we freeze to be the running average. This change of parameters makes a network with deconvolution perform faster at testing time, which is a favorable property for mobile applications. We provide the practical recipe to include the bias term: .
Fast Computation of the Covariance Matrice
We propose a simple subsampling technique that speeds up the computation of the covariance matrix by a factor of . Since the number of involved pixels is usually large compared with the degree of freedom in a convariance matrix (Section A.6), this simple strategy provides significant speedups while maintaining the training quality. Thanks to the regularization and the iterative method we discuss below, we found the subsampling method to be robust even when the covariance matrix is large.
Fast Inverse Square Root of the Covariance Matrix
Computing the inverse square root has a long and fruitful history in computer graphics and numerical mathematics. Fast computation of the inverse square root of a scalar with NewtonSchulz iterations has received wide attention in game engines (Lomont, 2003). One would expect the same method to seamlessly generalize to the matrix case. However, according to numerous experiments, the standard NewtonSchulz iterations suffer from severe numerical instability and explosion after iterations for simple matrices (Section A.3) (Higham, 1986)(Eq. 7.12), (Higham, 2008). Coupled NewtonSchulz iterations have been designed (Denman and Beavers,Jr., 1976) (Eq.6.35), (Higham, 2008) and been proved to be numerically stable.
We compute the approximate inverse square root of the covariance matrix at low cost using coupled NewtonSchulz iteration, inspired by the DenmanBeavers iteration method (Denman and Beavers,Jr., 1976). Given a symmetric positive definite covariance matrix , the coupled NewtonSchulz iterations start with initial values , . The iteration is defined as: , and (Higham, 2008) (Eq.6.35). Note that this coupled iteration has been used in recent works to calculate the square root of a matrix (Lin and Maji, 2017). Instead, we take the inverse square root from the outputs, as first shown in (Ye et al., 2018). In contrast with the vanilla NewtonSchulz method (Higham, 2008; Huang et al., 2019)(Eq. 7.12), we found the coupled NewtonSchulz iterations are stable even if iterated for thousands of times.
It is important to point out a practical implementation detail: when we have input feature channels, and the kernel size is , the size of the covariance matrix is . The covariance matrix becomes large in deeper layers of the network, and inverting such a matrix is cumbersome. We take a grouping approach by evenly dividing the feature channels into smaller blocks (Ye et al., 2017; Wu and He, 2018; Ye et al., 2018); we us to denote the block size, and usually set . The minibatch covariance of a each block has a manageable size of . NewtonSchulz iterations are therefore conducted on smaller matrices. We notice that only a few () iterations are necessary to achieve good performance. Solving for the inverse square root takes . The computation of the covariance matrix has complexity . Implicit deconvolution is a simple matrix multiplication with complextity . The overall complexity is , which is usually a small fraction of the cost of the convolution operation (Section A.6). In comparison, the computational complexity of a regular convolution layer has a complexity of .
4.6 Sparse Representations
Our deconvolution applied at each layer removes the pixelwise and channelwise correlation and transforms the original dense representations into sparse representations (in terms of heavytailed distributions) without losing information. This is a desired property and there is a whole field with wide applications developed around sparse representations (Hyvrinen et al., 2009)(Fig. 7.7), (Olshausen and Field, 1996; Ye et al., 2013). In Fig. 4, we visualize the deconvolution operation on an input and show how the resulting representations ( 4(d)) are much sparser than the normalized image ( 4(b)). We randomly sample images from the ImageNet and plot the histograms and log density functions before and after deconvolution (Fig. 10). After deconvolution, the log density distribution becomes heavytailed. This holds true also for hidden layer representations (Section A.5). We show in the supplementary material (Section A.4) that the sparse representation makes classic regularizations more effective.
5 A Unified View
Network deconvolution is a forward correction method that has relations to several successful techniques in training neural networks. When we set , the method becomes channelwise decorrelation, as in (Ye et al., 2018; Huang et al., 2018). When , network deconvolution is Batch Normalization (Ioffe and Szegedy, 2015). If we apply the decorrelation in a backward way in the gradient direction, network deconvolution is similar to (Ye et al., 2017), natural gradient descent (Desjardins et al., 2015) and (Martens and Grosse, 2015), while being more efficient and having better numerical properties (Section A.8).
6 Experiments
We now describe experimental results validating that network deconvolution is a powerful and successful tool for sharpening the data. Our experiments show that it outperforms identical networks using batch normalization (Ioffe and Szegedy, 2015), a major method for training neural networks. As we will see across all experiments, deconvolution not only improves the final accuracy but also decreases the amount of iterations it takes to learn a reasonably good set of weights in a small number of epochs.
Linear Regression with loss and Logistic Regression: As a first experiment, we ran network deconvolution on a simple linear regression task to show its efficacy. We select the FashionMNIST dataset. It is noteworthy that with binary targets and the loss, the problem has an explicit solution if we feed the whole dataset as input. This problem is the classic kernel estimation problem, where we need to solve for optimal kernels to convolve with the inputs and minimize the loss for binary targets. During our experiment, we notice that it is important to use a small learning rate of for vanilla training to prevent divergence. However, we notice that with deconvolution we can use the optimal learning rate and get high accuracy as well. It takes iterations to get to a low cost under the minibatch setting (Fig. 5(a)). This even holds if we change the loss to logistic regression loss (Fig. 5(b)).
CIFAR10  CIFAR100  

Net Size  BN 1  ND 1  BN 20  ND 20  BN 100  ND 100  BN 1  ND 1  BN 20  ND 20  BN 100  ND 100  
VGG16  14.71M  14.12%  74.18%  90.07%  93.25%  93.58%  94.56%  2.01%  37.94%  63.22%  71.97%  72.75%  75.32% 
ResNet18  11.17M  56.25%  72.89%  92.64%  94.07%  94.87%  95.40%  16.10%  35.73%  72.67%  76.55%  77.70%  78.63% 
Preact18  11.17M  55.15%  72.70%  91.93%  94.10%  94.37%  95.44%  15.17%  36.52%  70.79%  76.04%  76.14%  79.14% 
DenseNet121  6.88M  59.56%  76.63%  93.25%  94.89%  94.71%  95.88%  17.90%  42.91%  74.79%  77.63%  77.99%  80.69% 
ResNext29  4.76M  52.14%  69.22%  93.12%  94.05%  95.15%  95.80%  17.98%  30.93%  74.26%  77.35%  78.60%  80.34% 
MobileNet v2  2.28M  54.29%  65.40%  89.86%  92.52%  90.51%  94.35%  15.88%  29.01%  66.31%  72.33%  67.52%  74.90% 
DPN92  34.18M  34.00%  53.02%  92.87%  93.74%  95.14%  95.82%  8.84%  21.89%  74.87%  76.12%  78.87%  80.38% 
PNASNetA  0.13M  21.81%  64.19%  75.85%  81.97%  81.22%  84.45%  10.49%  36.52%  44.60%  55.65%  54.52%  59.44% 
SENet18  11.26M  57.63%  67.21%  92.37%  94.11%  94.57%  95.38%  16.60%  32.22%  71.10%  75.79%  76.41%  78.63% 
EfficientNet  2.91M  35.40%  55.67%  84.21%  86.78%  86.07%  88.42%  19.03%  22.40%  57.23%  57.59%  59.09%  62.37% 
Convolutional Networks on CIFAR10/100: We ran deconvolution on the CIFAR10 and CIFAR100 datasets (Table 1), where we compared again the use of network deconvolution versus the use of batch normalization. Across modern network architectures for both datasets, deconvolution consistently improves convergence on these wellknown datasets. There is a wide performance gap after the first epochs of training. Deconvolution leads to faster convergence: 20epoch training using deconvolution leads to results that are comparable to 100epoch training using batch normalization.
In our setting, we remove all batch normalizations in the networks and replace them with deconvolution before each convolution/fullyconnected layer. For the convolutional layers, we set before calculating the covariance matrix. For the fullyconnected layers, we set equal to the input channel number, which is usually . We set the batch size to and the weight decay to . All models are trained with and a learning rate of .
Convolutional Networks on ImageNet: We tested three widely acknowledged model architectures (VGG11, ResNet18, DenseNet121) from the PyTorch model zoo and find significant improvements on both networks over the reference models. Notably, for the VGG11 network, we notice our method has led to significant improved accuracy. The top1 accuracy is even higher than , reported by the reference VGG13 model trained with batch normalization. The improvement introduced by network deconvolution is twice as large as that from batch normalization (). This fact also suggests that improving the training method may be more effective than improving the architecture.
We keep most of the default settings when training the models. We set for all deconvolution operations. The networks are trained for epochs with a batch size of , and weight decay of . The initial learning rates are , and , respectively for VGG11,ResNet18 and DenseNet121 as described in the paper. We used cosine annealing to smoothly decrease the learning rate to compare the curves.
Generalization to Other Tasks It is worth pointing out that network deconvolution can be applied to other tasks that have convolution layers. Further results on semantic segmentation on the Cityscapes dataset can be found in (Sec. A.8). Also, the same deconvolution procedure for convolutions can be used for nonconvolutional layers, which makes it useful for the broader machine learning community. We constructed a 3layer fullyconnected network that has 128 hidden nodes in each layer and used the for the activation function. We compare the result with/without batch normalization, and deconvolution, where we remove the correlation between hidden nodes. Indeed, applying deconvolution to MLP networks outperforms batch normalization, as shown in Fig. 5(c,d).
VGG11  ResNet18  DenseNet121  

Original  BN  Deconv  BN  Deconv  BN  Deconv  
ImageNet top1  69.02%  70.38%  71.95%  69.76%  71.24%  74.65%  75.73% 
ImageNet top5  88.63%  89.81%  90.49%  89.08%  90.14%  92.17%  92.75% 
7 Conclusion
In this paper we presented network deconvolution, a novel decorrelation method tailored for convolutions, which is inspired by the biological visual system. Our method was evaluated extensively and shown to improve the optimization efficiency over standard batch normalization. We provided a thorough analysis of its performance and demonstrated consistent performance improvement of the deconvolution operation on multiple major benchmarks given modern neural network models. Our proposed deconvolution operation is straightforward in terms of implementation and can serve as a good alternative to batch normalization.
Appendix A Appendix
a.1 Source Code
Source code can be found at:
The models for CIFAR10/CIFAR100 are adapted from the following repository:
a.2 Generalization to Semantic Segmentation
To demonstrate the applicability of network deconvolution to different tasks, we modify a baseline architecture of DeepLabV3 with a ResNet50 backbone for semantic segmentation. We remove the batch normalization layers in both the backbone network and the head network and preapply deconvolutions in all the convolution layers. The full networks are trained from scratch on the Cityscape dataset (with training images) using a learning rate of for epochs with batch size 8. All settings are the same with the official PyTorch recipe except we have raised the learning rate from to for training from scratch. Here we report the mean intersection over union (mIoU), pixel accuracy and training loss curves of standard training and deconvolution using a crop size of . Network deconvolution significantly improves the training quality and accelerates the convergence (Fig. 6).
a.3 Coupled/Uncoupled Newton Schulz Iterations
We take the Lenna image and construct the covariance matrix using pixels from windows in color channels. We apply the vanilla NewtonSchulz iteration and compare it with the coupled NewtonSchulz iteration. The Frobenius Norm of is plotted in Fig. 7. The rounding errors quickly accumulate with the vanilla Newton Schulz iterations, while the coupled iteration is stable. From the curve we set the iteration number to for the first layer of the network to thoroughly remove the correlation in the input data. We freeze the deconvolution matrix after iterations. For the middle layers of the network we set the iteration number to .
a.4 Regularizations
If two features correlate, weight decay regularization is less effective. If are strongly correlated features, but differ in scale, and if we look at: , the weights are likely to coadapt during the training, and weight decay is likely to be more effective on the larger coefficient. The other, small coefficient is left less penalized. Network deconvolution reduces the coadaptation of weights, and weight decay becomes less ambiguous and more effective. Here we report the accuracies of the VGG13 network on the CIFAR100 dataset using weight decays of and . We notice that a stronger weight decay is detrimental to the performance with standard training using batch normalization. In contrast, the network achieves better performance with deconvolution using a stronger weight decay. Each setting is repeated for 5 times, and the mean accuracy curves with confidence intervals of (+/ 1.0 std) are shown in Fig. 8(a).
a.5 Sparse Representations for Convolution Layers
Network deconvolution reduces redundancy similar to the that in the animal vision system (Fig. 9). The centersurround antagonism results in efficient and sparse representations. In Fig. 10 we plot the distribution and log density of the signals at the first layer before and after deconvolution. The distribution after the deconvolution has a wellknown heavytailed shape (Hyvrinen et al., 2009; Ye et al., 2013).
Fig. 11 shows the inputs to the th convolution layer in . This input is the output of a activation function. The deconvolution operation removes the correlation between channels and nearby pixels, resulting in a sharper and sparser representation.
a.6 Performance Breakdown
Network deconvolution is a customized new design and relies on less optimized functions such as . Even so, the slow down is tunable to be around on modern networks. We plot the walltime vs accuracy plot of the VGG network on the ImageNet dataset. For this plot we use (Fig. 12). Here we also break down the computational cost on CPU to show deconvolution is a lowcost and promising approach if properly optimized on GPUs. We take random images at various scales and set the input/output channels to common values in modern networks. The CPU timing on a batch size of can be found in Table 3. Here we fix the NewtonSchulz iteration times to be .
256  256  3  64  1  3  3  0.069  0.0079  0.00025  0.50 
128  128  64  128  1  3  3  0.315  0.3214  0.02069  0.67 
64  64  128  256  2  3  3  0.045  0.0445  0.00076  0.55 
32  32  256  512  4  3  3  0.022  0.0391  0.00222  0.48 
16  16  512  512  8  3  3  0.011  0.0444  0.01155  0.23 
128  128  64  128  64  3  3  0.376  0.0871  0.00024  0.66 
64  64  128  256  128  3  3  0.042  0.0412  0.00082  0.53 
32  32  256  512  256  3  3  0.023  0.0377  0.00208  0.47 
16  16  512  512  512  3  3  0.011  0.0437  0.01194  0.22 
128  128  64  128  32  3  3  0.360  0.0939  0.00021  0.67 
64  64  128  256  32  3  3  0.044  0.0425  0.00076  0.55 
32  32  256  512  32  3  3  0.023  0.0374  0.00204  0.46 
16  16  512  512  32  3  3  0.011  0.0421  0.01180  0.22 
256  256  3  64  1  3  5  0.030  0.0046  0.00029  0.49 
128  128  64  128  1  3  5  0.153  0.1284  0.01901  0.66 
256  256  3  64  1  7  3  0.338  0.1111  0.00107  0.69 
256  256  3  64  1  7  5  0.180  0.0408  0.00107  0.69 
256  256  3  64  1  7  7  0.063  0.0210  0.00104  0.70 
256  256  3  64  1  11  3  0.681  0.4810  0.00479  0.99 
256  256  3  64  1  11  5  0.315  0.1909  0.00488  1.00 
256  256  3  64  1  11  7  0.199  0.1022  0.00494  0.99 
256  256  3  64  1  11  11  0.069  0.0420  0.00496  1.00 
a.7 Accelerated Convergence
We demonstrate the loss curves using different settings when training the VGG11 network on the ImageNet dataset(Fig. 8(b)). We can see network deconvolution leads to significantly faster decay in training loss.
a.8 Forward Backward Equivalence
We discuss the relation between the correction in the forward way and in the backward way. We thank Prof. Brian Hunt for providing us this simple proof.
Assuming that in one layer of the network we have , , here .
(6) 
Assuming is fixed,
(7) 
One iteration of gradient descent with respect to is:
(8) 
.
One iteration of gradient descent with respect to is:
(9) 
(10) 
We then reach the familiar form (Ye et al., 2017):
(11) 
We have proved the equivalence between forward correction and the Newton’s methodlike backward correction. Carrying out the forward correction as in our paper is beneficial because as the neural network gets deep, gets more illposed. Another reason is that because depends on , the layer gradients are more accurate if we include the inverse square root into the back propagation training. This is easily achievable with the help of automatic differentiation implementations:
(12) 
Here is an input to the current layer and is the input to the next layer.
a.9 Influence of Batch Size
We notice network deconvolution works well under various batch sizes. However, different settings need to be adjusted to achieve optimal performance. High learning rates can be used for large batch sizes, small learning rates should be used for small batch sizes. When the batch size is small, to avoid overfitting the noise, the number of NewtonSchulz iterations should be reduced and the regularization factor should be raised. More results and settings can be found in Table 4.
2  89.12%  0.001  0.01  2 
8  91.26%  0.01  0.01  2 
32  91.18%  0.01  1e5  5 
128  91.56%  0.1  1e5  5 
512  91.66%  0.5  1e5  5 
2048  90.64%  1  1e5  5 
a.10 Implementation of FastDeconv in PyTorch
We present the reference implementation in PyTorch. “FastDeconv” can be used to replace instances of “nn.Conv2d” in the network architectures. Batch normalizations should also be removed.
See pages  of fig/FastDeconv.pdf
References
 Fast motion deblurring. In ACM SIGGRAPH Asia 2009 Papers, SIGGRAPH Asia ’09, New York, NY, USA, pp. 145:1–145:8. External Links: ISBN 9781605588582, Link, Document Cited by: §4.4.1.
 The matrix sign function and computations in systems. Appl. Math. Comput. 2 (1), pp. 63–94. External Links: ISSN 00963003, Link, Document Cited by: §4.5.3, §4.5.3.
 Natural neural networks. In Advances in Neural Information Processing Systems 28, C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama and R. Garnett (Eds.), pp. 2071–2079. External Links: Link Cited by: §2.1, §5.
 Network deconvolution as a general method to distinguish direct dependencies in networks. In Nature biotechnology, Cited by: §4.3.1.
 Matrix computations. Fourth edition, JHU Press. External Links: ISBN 1421407949 9781421407944, Link Cited by: §4.4.2.
 Generative adversarial networks. External Links: 1406.2661 Cited by: §4.5.
 Newton’s method for the matrix square root. Cited by: §4.5.3.
 Functions of matrices: Theory and computation. Society for Industrial and Applied Mathematics, Philadelphia, PA, USA. External Links: ISBN 9780898716467 Cited by: §4.5.3, §4.5.3.
 Decorrelated batch normalization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 791–800. Cited by: §2.1, §5.
 Iterative normalization: beyond standardization towards efficient whitening. CoRR abs/1904.03441. External Links: Link, 1904.03441 Cited by: §2.1, §4.5.3.
 Integrative action in the cat’s lateral geniculate body. The Journal of Physiology 155 (2), pp. 385–398. External Links: Document, Link, https://physoc.onlinelibrary.wiley.com/doi/pdf/10.1113/jphysiol.1961.sp006635 Cited by: §1, §3.3, §4.3.2.
 Receptive fields, binocular interaction and functional architecture in the cat’s visual cortex. The Journal of physiology 160 (1), pp. 106–154. Cited by: §1, §3.3, §4.3.2.
 Natural image statistics: a probabilistic approach to early computational vision.. 1st edition, Springer Publishing Company, Incorporated. External Links: ISBN 1848824904, 9781848824904 Cited by: §A.5, §1, §4.6.
 Batch normalization: accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167. Cited by: §2.1, §4.1, §5, §6.
 Improved bilinear pooling with cnns. CoRR abs/1707.06772. External Links: Link, 1707.06772 Cited by: §4.5.3.
 Fast inverse square root. Tech315 nical Report 32. Cited by: §4.5.3.
 Optimizing neural networks with Kroneckerfactored approximate curvature. CoRR abs/1503.05671. External Links: Link, 1503.05671 Cited by: §2.1, §5.
 Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th international conference on machine learning (ICML10), pp. 807–814. Cited by: §4.4.2.
 Emergence of simplecell receptive field properties by learning a sparse code for natural images. Nature 381 (6583), pp. 607–609. External Links: ISSN 14764687, Document, Link Cited by: §1, §4.6.
 The approximate arithmetical solution by finite differences of physical problems involving differential equations, with an application to the stresses in a masonry dam. Philosophical Transactions of the Royal Society of London. Series A, Containing Papers of a Mathematical or Physical Character 210 (), pp. 307–357. External Links: ISSN 02643952, Link Cited by: §3.1, §4.1.
 Fundamental principles of cortical computation: unsupervised learning with prediction, compression and feedback. arXiv preprint arXiv:1608.06277. Cited by: §1.
 Weight normalization: A simple reparameterization to accelerate training of deep neural networks. CoRR abs/1602.07868. External Links: Link, 1602.07868 Cited by: §2.1.
 Very deep convolutional networks for largescale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §4.1.
 Group normalization. CoRR abs/1803.08494. External Links: Link, 1803.08494 Cited by: §2.1, §4.5.3.
 BlindCall: Ultrafast basecalling of highthroughput sequencing data by blind deconvolution. Bioinformatics 30 (9), pp. 1214–1219. External Links: ISSN 13674803, Document, Link, http://oup.prod.sis.lan/bioinformatics/articlepdf/30/9/1214/644754/btu010.pdf Cited by: §2.2, §4.4.1.
 Unsupervised learning of dense optical flow and depth from sparse event data. CoRR abs/1809.08625. External Links: Link, 1809.08625 Cited by: §2.1, §4.5.3, §4.5.3, §4.5, §5.
 Sparse norm filtering. CoRR abs/1305.3971. External Links: Link, 1305.3971 Cited by: §A.5, §4.6.
 On the importance of consistency in training deep neural networks. CoRR abs/1708.00631. External Links: Link, 1708.00631 Cited by: §A.8, §2.1, §4.5.3, §5.