[
Abstract
Infinite width limits of deep neural networks often have tractable forms. They have been used to analyse the behaviour of finite networks, as well as being useful methods in their own right. When investigating infinitely wide CNNs it was observed that the correlations arising from spatial weight sharing disappear in the infinite limit. This is undesirable, as spatial correlation is the main motivation behind CNNs. We show that the loss of this property is not a consequence of the infinite limit, but rather of choosing an independent weight prior. Correlating the weights maintains the correlations in the activations. Varying the amount of correlation interpolates between independentweight limits and meanpooling. Empirical evaluation of the infinitely wide network shows that optimal performance is achieved between the extremes, indicating that correlations can be useful.
:
\theoremsep
\jmlrproceedingsAABI 20203rd Symposium on Advances in Approximate Bayesian Inference, 2020
Correlated Weights in Infinite Limits of Deep CNNs]Correlated Weights in Infinite Limits
of Deep Convolutional Neural Networks
1 Introduction
Analysing infinitely wide limits of neural networks has long been used to provide insight into the properties of neural networks. Neal (1996) first noted such a relationship, through the correspondence between infinitely wide Bayesian neural networks and Gaussian processes (GPs). The success of GPs raised the question of whether such a comparatively simple model could replace a complex neural network. MacKay (1998) noted that taking the infinite limit resulted in a fixed feature representation, a key desirable property of neural networks. Since this property is lost due to the infinite limit, MacKay inquired: “have we thrown the baby out with the bath water?”
In this work, we follow the recent interest in infinitely wide convolutional neural networks (GarrigaAlonso et al., 2019; Novak et al., 2019), to investigate another property that is lost when taking the infinite limit: correlation from patches in different parts of the image. Given that convolutions were developed to introduce these correlations, and that they improve performance (Arora et al., 2019), it seems undesirable that they are lost when more filters are added. Currently, the only way of reintroducing these correlations is by changing the model architecture by introducing meanpooling (Novak et al., 2019). This raises two questions:

Is the loss of patchwise correlations a necessary consequence of the infinite limit?

Is an architectural change the only way of reintroducing patchwise correlations?
We show that the answer to both these questions is “no”. Correlations between patches can also be maintained in the limit without pooling by introducing correlations between the weights in the prior. The amount of correlation can be controlled, which allows us to interpolate between the existing approaches of full independence and meanpooling. Our approach allows the discrete architectural choice of meanpooling to be replaced with a more flexible continuous amount of correlation. Empirical evaluation on CIFAR10 shows that this additional flexibility improves performance.
Our work illustrates how choices that are made in the prior affect properties of the limit, and that good choices can improve performance. The success of this approach in the infinite limit also raises questions about whether correlated weights should be used in finite networks.
2 Spatial Correlations in SingleLayer Networks
To begin, we will analyse the infinite limit of a single hidden layer convolutional neural network (CNN). We extend GarrigaAlonso et al. (2019) and Novak et al. (2019) by considering weight priors with correlations. By adjusting the correlation we can interpolate between existing independent weight limits and meanpooling, which previously had to be introduced as a discrete architectural choice. We also discuss how existing convolutional Gaussian processes (van der Wilk et al., 2017; Dutordoir et al., 2020) can be obtained from limits of correlated weight priors.
A singlelayer CNN (see fig. 1 for a graphical representation of the notation) takes in an image input with width , height , and channels (e.g. one per colour). The image is divided up into patches , with representing a location of one of the zeropadded patches. Weights are applied by taking an inner product with all patches, which we do times to give multiple channels in the next layer. By collecting all weights in the tensor we can denote the computation of the pre and postnonlinearity activations as
(1) 
In a singlelayer CNN, these activations are followed by a fullyconnected layer with weights . Our final output is again given by a summation over the activations
(2) 
where denotes the result before the summation over the patch locations .
We analyse the distribution on function outputs for some Gaussian prior on the weights , where is the collection of the weights at all layers. In all the cases we consider, we take the prior to be independent over layers and channels. Here we extend on earlier work by allowing spatial correlation in the final layer’s weights (we will consider all layers later) through the covariance matrix . This gives the prior
(3) 
where we use square brackets to index into a matrix or tensor, using the Numpy colon notation for the collection of all variables along an axis.
Independence between channels makes the collection of all activations
(4) 
The limit of the sum of the final expectation over can be found (see appendix D for details) in closed form for many activations and is denoted as . We find the final kernel for the GP by taking the covariance between function values and and performing the final sum in eq. 2:
(5) 
We can now see how different choices for give different forms of spatial correlation.
Independence
GarrigaAlonso et al. (2019) and Novak et al. (2019) consider , i.e. the case where all weights are independent. The resulting kernel simply sums components over patches, which implies an additive model (Stone, 1985), where a different function is applied to each patch, after which they are all summed together: . This structure has commonly been applied to improve GP performance in highdimensional settings (e.g. Duvenaud et al., 2011; Durrande et al., 2012). Novak et al. (2019) point out that the same kernel can be obtained by taking an infinite limit of a locally connected network (LCN) (LeCun, 1989) where connectivity is the same as in a CNN, but without weight sharing, indicating that a key desirable feature of CNNs is lost.
Meanpooling
By taking we make the weights fully correlated over all locations, leading to identical weights for all , i.e. . This is equivalent to taking the mean response over all spatial locations (see eq. 2), or meanpooling. As Novak et al. (2019) discuss, this reintroduces the spatial correlation that is the intended result of weight sharing. The “translation invariant” convolutional GP of van der Wilk et al. (2017) can be obtained by this singlelayer limit using Gaussian activation functions (van der Wilk, 2019). Since this meanpooling was shown to be too restrictive in this singlelayer case, van der Wilk et al. (2017) considered pooling with constant weights (i.e. without a prior on them). In this framework, this is equivalent to placing a rank 1 prior on the finallayer weights by taking . This maintains the spatial correlations, but requirs the s to be learned by maximum marginal likelihood (MLII).
Spatially correlated weights
In the pooling examples above, the spatial covariance of weights is taken to be a rank1 matrix. We can add more flexibility to the model by varying the strength of correlation between weights based on their distance in the image. We consider an exponential decay depending on the distance between two patches: . We recover full independence by taking , and meanpooling with . Intermediate values of allow the rigid assumption of complete weight sharing to be relaxed, while still retaining spatial correlations between similar patches. This construction gives exactly the same kernel as investigated by Dutordoir et al. (2020), who named this property “translation insensitivity”, as opposed to the stricter invariance that meanpooling gives. The additional flexibility improved performance without needing to add many parameters that are learned in an nonBayesian fashion.
Our construction shows that spatial correlation can be retained in infinite limits without needing to resort to architectural changes. A simple change to the prior on the weights is all that is needed. This property is retained in wide limits of deep networks in a similar way, which we investigate next.
3 Spatial Correlations in Deep Networks
In appendix B, we provide a detailed but informal extension of the previous section’s results to deep networks. We also formulate the correlated weights prior in the framework provided by by Yang (2019), which provides a formal justification for our results.
The procedure for computing the kernel has a recursive form similar to existing analyses (GarrigaAlonso et al., 2019; Novak et al., 2019). Negligible additional computation is needed to consider arbitrary correlations, compared to only considering meanpooling (Novak et al., 2019; Arora et al., 2019). The main bottleneck is the need for computing covariances for all pairs of patches in the image, as in eq. 5. For a dimensional convolutional layer, the corresponding kernel computation is a convolution with a dimensional covariance tensor.
4 Experiments
We seek to test two hypotheses. 1) Can we eliminate architectural choices, and recover their effect using continuous hyperparameters instead? 2) In the additional search space we have uncovered, can we find a kernel that performs better than the existing ones?
We evaluate various models on classbalanced subsets of CIFAR10 of size , following Arora et al. (2020). As is standard practice in the wide network literature, we reframe classification as regression to onehot targets . We subtract from to make its mean zero, but we observed that this affects the results very little. The test predictions are the argmax over of the posterior Gaussian process means
(6) 
where is a hyperparameter, the variance of the observation noise of the GP regression. We perform crossvalidation to find a setting for . We use the eigendecomposition of to avoid the need to recompute the inverse for each value of .
4.1 Correlated weights in the last layer
We start with two architectures used in the neural network kernel literature, the CNNGP (Novak et al., 2019; Arora et al., 2019) with 14 layers, and the Myrtle network (Shankar et al., 2020) with 10 layers. The CNNGP14 architecture has a sized layer at the end, which is usually transformed into the output using mean pooling. The Myrtle10 architecture has a pooling layer at the end.
We replace the final pooling layers with a layer with correlated weights. Following Dutordoir et al. (2020), the covariance of the weights is given by the Matérn3/2 kernel with lengthscale :
(7) 
Note that are 2d vectors representing patch locations. The “extremes” of independent weights and mean pooling are represented by setting and , respectively.
In figure 2, we investigate how the 4fold crossvalidation accuracy on data sets of size varies with the lengthscale of the Matérn3/2 kernel, which controls the “amount” of spatial correlation in the weights of the last layer. For each data point in each line, we split the data set into 4 folds, and we calculate the test accuracy on 1 fold using the other 3 as training set, for each value of that we try. We take the maximum accuracy over .
We investigate how the effect above varies with data set size. As the data set grows larger, we observe that the advantage of having a structured covariance matrix in the output becomes more apparent. We can also see the optimal lengthscale converging to a similar value, of about , which is evidence for the hypothesis holding with more data. The optimal lengthscale is the same for both networks, so it may be a property of the CIFAR10 data set.
The largest data set size in each part of the plot was run only once because of computational constraints. We transform one data set of size into two data sets of size by taking block diagonals of the stored kernel matrix, so we have more runs for the smallest sizes. This is a valid MonteCarlo estimate of the true accuracy under the data distribution, with less variance than independent data sets, because the data sets taken are anticorrelated, since they have no points in common.
4.2 Correlated weights in intermediate layers
We take the same approach to the experiment in figure 3. This time, we replace the intermediate meanpooling layer, together with the next convolution layer, in the Myrtle10 architecture with correlated weights. We change it to a weightcorrelated matrix. We observe that when the lengthscale is 0, the performance of the network is poor, suggesting that the meanpooling layers in Myrtle10 are necessary. Additionally, we are able to recover the performance of the handselected architecture by varying the lengthscale parameter.
5 Conclusion
The disappearance of spatial correlations in infinitely wide limits of deep convolutional neural networks could be seen as another example of how Gaussian processes lose favourable properties of neural networks. While other work sought to remedy this problem by changing the architecture (meanpooling), we showed that changing the weight prior could achieve the same effect. Our work has three main consequences:

Weight correlation shows that locally connected models (without spatial correlation) and meanpooling architectures (with spatial correlation) actually exist at ends of a spectrum. This unifies the two views in the neural network domain. We also unify two known convolutional architectures that were introduced from the Gaussian process community.

We show empirically that modest performance improvements can be gained by using weight correlations between the extremes of locally connected networks or meanpooling. We also show that the performance of meanpooling in intermediate layers can be matched by weight correlation.

Using weight correlation may provide advantages during hyperparameter tuning. Discrete architectural choices need to be searched through simple evaluation, while continuous parameters can use gradientbased optimisation. While we have not taken advantage of this in our current work, this may be a fruitful direction for future research.
Appendix A Related work
Infinitely wide limits of neural networks are currently an important tool for creating approximations and analyses. Here we provide a background on the different infinite limits that have been developed, together with a brief overview of where they have been applied.
Interest in infinite limits first started with research into properties of Bayesian priors on the weights of neural networks. Neal (1996) noted that prior function draws from a single hidden layer neural network with appropriate Gaussian priors on the weights tended to a Gaussian process as the width grew to infinity. The simplicity of performing Bayesian inference in Gaussian process models led to their widespread adoption soon after (Williams and Rasmussen, 1996; Rasmussen and Williams, 2006). Over the years, the wide limits of networks with different weight priors and activation functions have been analysed, leading to various kernels which specify the properties of the limiting Gaussian processes (Williams, 1997; Cho and Saul, 2009).
With the increasing prominence of deep learning, recursive kernels were introduced in an attempt to obtain similar properties. Cho and Saul (2009); Mairal et al. (2014) investigated such methods for fullyconnected and convolutional architectures respectively. Despite similarities between recursive kernels and neural networks, the derivation did not provide clear relationships, or any equivalence in a limit. Hazan and Jaakkola (2015) took initial steps to showing the wide limit equivalence of a neural network beyond the single layer case. Recently, Matthews et al. (2018); Lee et al. (2018) simultaneously provided general results for the convergence of the prior of deep fullyconnected networks to a GP.
With the general tools in place, GarrigaAlonso et al. (2019); Novak et al. (2019) derived limits of the prior of convolutional neural networks with infinite filters. These two papers directly motivated this work by noting that spatial correlations disappeared in the infinite limit. Spatial mean pooling at the last layer was suggested as one way to recover correlations, with Novak et al. (2019) providing initial evidence of its importance. Due to computational constraints, they were limited to using a Monte Carlo approximation to the limiting kernel, while Arora et al. (2019) performed the computation with the exact NTK. Very recent preprints provide followon work that pushes the performance of limit kernels (Shankar et al., 2020) and demonstrated the utility of limit kernels for small data tasks (Arora et al., 2020). Extending on the results for convolutional architectures, Yang (2019) showed how infinite limits could be derived for a much wider range of network architectures.
In the kernel and Gaussian process community, kernels with convolutional structure have also been proposed. Notably, these retained spatial correlation in either a fixed (van der Wilk et al., 2017) or adjustable (Mairal et al., 2014; Dutordoir et al., 2020) way. While these methods were not derived using an infinite limit, van der Wilk (2019) provided an initial construction from an infinitely wide neural network limit. Inspired by these results, we propose limits of deep convolutional neural networks which retain spatial correlation in a similar way.
Appendix B Spatial Correlations in Infinitely Wide Deep Convolutional Networks
The setup for the case of a deep neural network follows that of section 2, but with weights applied to each patch of the activations in the previous layer as
(8) 
We use two ways to index into activations. We either index into the th location as , or into the th location in the th patch as . For regular convolutions, the number of patches is equal to the number of spatial positions in the layer before due to zeropadding, regardless of filter size. The only operation that changes the spatial size of the activations is a strided convolution. The one exception is the final layer, where we reduce all activations with their own weight. To unify notation, we see this as just another convolutional layer, but with a patch size equal to the activation size, and without zero padding. The final layer can have multiple output channels to allow e.g. classification with multiple classes.
As pointed out by Matthews et al. (2018), a straightforward application of the central limit theorem is not strictly possible for deep networks. Fortunately, Yang (2019) developed a general framework for expressing neural network architectures and finding their corresponding Gaussian process infinite limits. The resulting kernel is given by the recursion that can be derived from a more informal argument which takes the infinite width limit in a sequential layerbylayer fashion, as was used in GarrigaAlonso et al. (2019). We follow this informal derivation, as this more naturally illustrates the procedure for computing the kernel. A formal justification can be found in appendix C.
b.1 Recursive Computation of the Kernel
To derive the limiting kernel for the output of the neural network, we will derive the distribution of the activations for each layer. In our weight prior, we correlate weights within a convolutional filter:
(9) 
Our derivation is general for any covariance matrix, so layers with correlated weights can be interspersed with the usual layers.
A Gaussian process is determined by the covariance of function values for pairs of inputs . Since the activations at the top layer are the function values, we will compute covariances between activations from the bottom of the network up. Starting from the recursion in eq. 8, we can find the covariance between any two prenonlinearity activations from a pair of inputs :
(10) 
For , the activations in the previous layer are the image inputs, i.e. , making the expectation a simple product between image patches. The prenonlinearity activations are Gaussian, because of the linear relationship with the weights. This allows us to find the covariance of the postnonlinearity activations. Since are jointly Gaussian, the expectation will only depend on pairwise covariances. Here we represent this dependence through the function (see appendix B for details on the computation):
(11) 
The pre and postnonlinearity activations are independent between different channels, and identical over all channels, so we omit denoting the channel indices.
To compute the covariance of the prenonlinearity for , we can again apply eq. 10. Section B.1 shows that the postnonlinearity covariances are constant over channels, so we can simplify eq. 10 further:
(12) 
We next want to compute the postnonlinearity activations for layer . For finite , will not be Gaussian, which is required by section B.1. However, if we take , will converge to a Gaussian by the central limit theorem, all while keeping the covariance constant. After taking the limit, we can then apply section B.1. This provides us with a recursive procedure to compute the covariances all the way up to the final layer, by sequentially taking limits of .
b.2 Computational Properties: convolutions double in dimensions
The core of the kernel computation for convolutional networks, whether or not they have spatial correlations, is the sum over pairs of elements of input patches , for each pair of output locations in eq. 10. For a network that is built with convolutions of 2dimensional inputs with 2dimensional weights, the sum in 10 is exactly a 4dimensional convolution of the full second moment of the input distribution (for inputs , the outer product), with the 4dimensional covariance tensor of the weights. In general, a dimensional convolution in weight space corresponds to a dimensional convolution in covariance space, with the same strides, dilation and padding.
With this framework, the expression for the covariance of the next layer when using independent weights becomes a 4d convolution of the activations’ second moment with a diagonal 4d covariance tensor . This is conceptually simpler, but computationally more complex, than the convolutionlike sums over diagonals of Arora et al. (2019).
b.3 Implementation
We extend the neuraltangents (Novak et al., 2020) library with a convolution layer and a readout layer, that admit a 4dimensional covariance tensor for the weights. This allows interoperation with existing operators.
4d convolutions are uncommon in deep learning, so our implementation uses a sum over 3d convolutions, where is the spatial size of the convolutional filter. While this enables GPU acceleration, computing the kernel is a costly operation. Reproducing our results takes around 10 days using an nVidia RTX 2070 GPU. Access to computational resources limited our experiments to subsets of data on CIFAR10.
Appendix C Proof that a CNN with correlations in the weights converges to a GP
In this section, we formally prove that a CNN with correlated weights converges to a Gaussian process in the limit of infinite width. Using the Netsor programming language due to Yang (2019), most of the work in the proof is done by one step: describe a CNN with correlated weights in Netsor .
For the reader’s convenience, we informally recall the Netsor programming language (Yang, 2019) and the key property of its programs (Corollary C.2). The outline of our presentation here also closely follows Yang (2019). Readers familiar with Netsor should skip to section C.3, where we show the program that proves Theorem C.3.
We write to mean the set .
c.1 The Netsor programming language
There are three types of variables: vars, vars, and vars. Each of these have one or two parameters, which are the widths we will take to infinity. For a given index in (or ), each of these variables is a scalar. To represent vectors that do not grow to infinity, we need to use collections of variables.

(Gaussianvars) are wise approximately i.i.d. and Gaussian. By “wise (approximately) independent” we mean that there can be correlations between vars, but only within a single index . vars will converge in distribution to an wise independent, identically distributed Gaussian in the limit of , if all widths are .

represent matrices, like the weight matrices of a dense neural network. Their entries are always i.i.d. Gaussian with with zero mean, even for finite instantiations of the program (finite ). There are no correlations between different vars, or elements of the same var.

represent variables that become wise i.i.d. (not necessarily Gaussian) in the infinite limit. is a subtype of , so all vars are also vars.
We indicate the type of a variable, or each variable in a collection, using “”.
[Netsor program] A Netsor program consists of:

A set of vars or vars.

New variables can be defined using the following rules:

. Multiply an an i.i.d. Gaussian matrix times an i.i.d. vector, which becomes a Gaussian vector in the limit .

Given constants , and vars of type , their linear combination is a var.

applying an elementwise nonlinear function , we map several vars to one var.


A tuple of scalars . The variables are input vars used only in the output tuple of the program. It may be the case that for different . Each is an var.
c.2 The output of a Netsor program converges to a Gaussian process
{definition}[Controlled function] A function is controlled if
for , where is the L2 norm.
Intuitively, this means that the function grows more slowly than the rate at which the tail of a Gaussian decays. Recall that the tail of a mean zero, identity covariance Gaussian decays as .
Assumption \thetheorem
All nonlinear functions in the Netsor program are controlled.
Assumption \thetheorem (Distribution of var inputs)
Each element in each input var is sampled from the zeromean, i.i.d. Gaussian,
Assumption \thetheorem (Distribution of var inputs)
Consider the input vector of all vars for each channel , that is the vector . It is drawn from a Gaussian, . The covariance may be singular. The vars that correspond to the output are sampled independently from all other vars, so they are excluded from each
Yang (2019) goes on to prove the Netsor master theorem, from which the corollary of interest follows.
[Corollary 5.5, abridged, Yang (2019)] Fix a Netsor program with controlled nonlinearities, and draw its inputs according to assumptions C.2 and C.2. For simplicity, fix the widths of all the variables to . The program outputs are , where each is an var, and each is a var independent from all others with variance (there can be some repeated indices, ). Then, as , the output tuple converges in distribution to a Gaussian . The value of is given by the recursive rules in equation 2 of Yang (2019).
Informally, the rules consist of recursively calculating the covariances of the program’s vars and output, assuming at every step that the vars are wise i.i.d. and Gaussian. This is the approach we employ in Section 4 of the main paper.
c.3 Description in Netsor of a CNN with correlations in the weights
The canonical way to represent convolutional filters in Netsor (Yang, 2019, Netsor program 4) is to use one var for every spatial location of the weights. That is, if our convolutional patches have size , we define the input for all . But vars have to be independent of each other, so how can we add correlation in the weights? We apply the correlation separately, using a LinComb operation. For this, we will use the following wellknown lemma, which is the expression of a Gaussian random variable.
Let , be an arbitrary mean vector and covariance matrix, respectively. Let be a collection of i.i.d. Gaussian random variables. Then, there exists a lowertriangular square matrix such that . Furthermore, the random vector , (equivalently, ) has a Gaussian distribution, . {proof} is a covariance matrix so it is positive semidefinite, thus a lowertriangular square matrix s.t. always exists. (If is singular, might have zeros in the diagonal.) The vector is Gaussian because it is a linear transformation of . Calculating its moments finishes the proof. Thus, to express convolution in Netsor with correlated weights , we can use the following strategy. First, express several convolutions with uncorrelated weights . Then, combine the output of the convolutions using LinComb and coefficients of the matrix .
If the correlated weights have a nonzero mean, we conjecture that one can add a coefficient and use it in the LinComb as well. Because we only use in the main text, we will not consider this further.
[The convolution with trick is correct] Consider the definitions in algorithm 1. Define the correlated convolution
where , for and , mirroring eqs. (8) and (9).
Then, conditioning on the value of
for all , and , and for any widths , the random variables and
have the same distribution for all , and .
The covariance is more involved. First we rewrite as a function of , by substituting the definition of into it and making the indices of the MatMul explicit
(13) 
Then we can write out the second moment:
(14)  