Bayesian Convolutional Neural Networks with Many Channels are Gaussian Processes
Abstract
There is a previously identified equivalence between wide fully connected neural networks (FCNs) and Gaussian processes (GPs). This equivalence enables, for instance, test set predictions that would have resulted from a fully Bayesian, infinitely wide trained FCN to be computed without ever instantiating the FCN, but by instead evaluating the corresponding GP. In this work, we derive an analogous equivalence for multilayer convolutional neural networks (CNNs) both with and without pooling layers, and achieve state of the art results on CIFAR10 for GPs without trainable kernels. We also introduce a Monte Carlo method to estimate the GP corresponding to a given neural network architecture, even in cases where the analytic form has too many terms to be computationally feasible.
Surprisingly, in the absence of pooling layers, the GPs corresponding to CNNs with and without weight sharing are identical. As a consequence, translation equivariance in finitechannel CNNs trained with stochastic gradient descent (SGD) has no corresponding property in the Bayesian treatment of the infinite channel limit – a qualitative difference between the two regimes that is not present in the FCN case. We confirm experimentally, that while in some scenarios the performance of SGDtrained finite CNNs approaches that of the corresponding GPs as the channel count increases, with careful tuning SGDtrained CNNs can significantly outperform their corresponding GPs, suggesting advantages from SGD training compared to fully Bayesian parameter estimation.
Bayesian Convolutional Neural Networks with Many Channels are Gaussian Processes
Roman Novak, Lechao Xiao ^{†}^{†}thanks: Google AI Residents (g.co/airesidency). Equal contribution. , Jaehoon Lee ^{1}^{1}footnotemark: 1 , Yasaman Bahri ^{1}^{1}footnotemark: 1 , 

Daniel A. Abolafia, Jeffrey Pennington, Jascha SohlDickstein 
Google Brain 
{romann, xlc, jaehlee, yasamanb, danabo, jpennin, jaschasd}@google.com 
1 Introduction
Neural networks (NNs) demonstrate remarkable performance (He et al., 2016; Oord et al., 2016; Silver et al., 2017; Vaswani et al., 2017), but are still only poorly understood from a theoretical perspective (Goodfellow et al., 2015; Choromanska et al., 2015; Pascanu et al., 2014; Zhang et al., 2017). NN performance is often motivated in terms of model architectures, initializations, and training procedures together specifying biases, constraints, or implicit priors over the class of functions learned by a network. This induced structure in learned functions is believed to be well matched to structure inherent in many practical machine learning tasks, and in many realworld datasets. For instance, properties of NNs which are believed to make them well suited to modeling the world include: hierarchy and compositionality (Lin et al., 2017; Poggio et al., 2017), Markovian dynamics (Tiňo et al., 2004; 2007), and equivariances in time and space for RNNs (Werbos, 1988) and CNNs (Fukushima & Miyake, 1982; Rumelhart et al., 1985) respectively.
The recent discovery of an equivalence between deep neural networks and GPs (Lee et al., 2018; de G. Matthews et al., 2018) allow us to express an analytic form for the prior over functions encoded by deep NN architectures and initializations. This transforms an implicit prior over functions into an explicit prior, which can be analytically interrogated and easily reasoned about.
Previous work studying these Neural Networkequivalent Gaussian Processes (NNGPs) has established the correspondence only for fully connected networks (FCNs). Additionally, previous work has not used analysis of NNGPs to gain specific insights into the equivalent NNs.
In the present work, we extend the equivalence between NNs and NNGPs to deep Convolutional Neural Networks (CNNs), both with and without pooling. CNNs are a particularly interesting architecture for study, since they are frequently held forth as a success of motivating NN design based on invariances and equivariances of the physical world (Cohen & Welling, 2016) – specifically, designing a NN to respect translation equivariance (Fukushima & Miyake, 1982; Rumelhart et al., 1985). As we will see in this work, absent pooling, this quality can vanish in the Bayesian treatment of the infinite width limit.
The specific novel contributions of the present work are:

We show analytically that CNNs with many channels, trained in a fully Bayesian fashion, correspond to an NNGP (§2, §3). We show this for CNNs both with and without pooling, with arbitrary convolutional striding, and with both and padding. We prove convergence as the number of channels in hidden layers go to infinity uniformly (§A.5.3), strengthening and extending the result of de G. Matthews et al. (2018) under mild conditions on the nonlinearity derivative.

We show that in the absence of pooling, the NNGP for a CNN and a Locally Connected Network (LCN) are identical (§5.1). An LCN has the same local connectivity pattern as a CNN, but without weight sharing or translation equivariance.

We experimentally compare trained CNNs and LCNs and find that under certain conditions both perform similarly to the respective NNGP (Figure 4, b, c). Moreover, both architectures tend to perform better with increased channel count, suggesting that similarly to FCNs (Neyshabur et al., 2015; Novak et al., 2018) CNNs benefit from overparameterization (Figure 4, a, b), corroborating a similar trend observed in Canziani et al. (2016, Figure 2). However, we also show that careful tuning of hyperparameters allows finite CNNs trained with SGD to outperform their corresponding NNGP by a significant margin. We experimentally disentangle and quantify the contributions stemming from local connectivity, equivariance, and invariance in a convolutional model in one such setting (Table 1).

We introduce a Monte Carlo method to compute NNGP kernels for situations (such as CNNs with pooling) where evaluating the NNGP is otherwise computationally infeasible (§4).
1.1 Related work
In early work on neural network priors, Neal (1994) demonstrated that, in a fully connected network with a single hidden layer, certain natural priors over network parameters give rise to a Gaussian process prior over functions when the number of hidden units is taken to be infinite. Followup work extended the conditions under which this correspondence applied (Williams, 1997; Le Roux & Bengio, 2007; Hazan & Jaakkola, 2015). An exactly analogous correspondence for infinite width, finite depth deep fully connected networks was developed recently in Lee et al. (2018); de G. Matthews et al. (2018).
The line of work examining signal propagation in random deep networks (Poole et al., 2016; Schoenholz et al., 2017; Yang & Schoenholz, 2017; Hanin & Rolnick, 2018; Chen et al., 2018) is related to the construction of the GPs we consider. They apply a mean field approximation in which the preactivation signal is replaced with a Gaussian, and the derivation of the covariance function with depth is the same as for the kernel function of a corresponding GP. Recently, Xiao et al. (2018) extended this to convolutional architectures without pooling. Xiao et al. (2018) also analyzed properties of the convolutional kernel at large depths to construct a phase diagram which will be relevant to NNGP performance, as discussed in §A.2.
Compositional kernels coming from convolutional and fully connected layers also appeared outside of the GP context in Daniely et al. (2016). In this work, they prove approximation guarantees between a network and its corresponding kernel, and show that empirical kernels will converge as the number of channels increases.
There is a line of work considering stacking of GPs, such as deep GPs (Lawrence & Moore, 2007; Damianou & Lawrence, 2013). These no longer correspond to GPs, though they can describe a rich class of probabilistic models beyond GPs. Alternatively, deep kernel learning (Wilson et al., 2016b; a; Bradshaw et al., 2017) utilizes GPs with base kernels which take in features produced by a deep neural network (often a CNN), and train the resulting model endtoend. Finally, van der Wilk et al. (2017) incorporates convolutional structure into GP kernels, with followup work stacking multiple such GPs (Kumar et al., 2018; Blomqvist et al., 2018; Anonymous, 2019) to produce a deep convolutional GP (which is no longer a GP). Our work differs from all of these in that our GP corresponds exactly to a fully Bayesian CNN in the infinite channel limit.
Borovykh (2018) analyzes the convergence of network outputs to a GP after marginalizing over all inputs in a dataset, in the case of a temporal CNN. Thus, while they also consider a GP limit, they do not address the dependence of network outputs on specific inputs, and their model is unable to generate test set predictions.
In concurrent work, GarrigaAlonso et al. (2018) derive an NNGP kernel equivalent to one of the kernels considered in our work. In addition to explicitly specifying kernels corresponding to pooling and vectorizing, we also compare the NNGP performance to finitewidth SGDtrained CNNs and analyze the differences between the two models.
2 Manychannel Bayesian CNNs are Gaussian processes
2.1 Preliminaries
Consider a series of convolutional hidden layers, . The parameters of the network are the convolutional filters and biases, and , respectively, with outgoing (incoming) channel index () and filter relative spatial location .^{1}^{1}1We will use Roman letters to index channels and Greek letters for spatial location. We use letters , etc to denote channel indices, , etc to denote spatial indices and , etc for filter indices. For notational simplicity, we treat the 1D case with spatial dimension in the text, but the single spatial index can be extended to higher dimensions by replacing with tuples. Similarly, our analysis straightforwardly generalizes to strided convolutions (§A.3). Assume a Gaussian prior on both the filter weights and biases,
(1) 
The weight and bias variances are , respectively. is the number of channels (filters) in layer , is the filter size, and is the fraction of the receptive field variance at location (with ). In experiments we utilize uniform , but nonuniform should enable kernel properties that are better suited for ultradeep networks, as in Xiao et al. (2018).
Let denote a set of input images (training set or validation set or both). The network has activations and preactivations for each input image , with input channel count , number of pixels , where
(2) 
We emphasize the dependence of and on the input . is a pointwise nonlinearity. is assumed to be zero padded so that the spatial size is constant throughout the network.
A recurring quantity in this work will be the empirical uncentered covariance tensor of the activations , defined as
(3) 
is therefore a 4dimensional random variable indexed by two inputs and two spatial locations (the dependence on layer widths and their weights and biases is implied and by default not stated explicitly). , the empirical uncentered covariance of inputs, is deterministic.
Whenever an index is omitted, the variable is assumed to contain all possible entries along the respective dimension. E.g. is a tensor of shape , has the shape , has the shape , etc.
2.2 Correspondence between Gaussian processes and Bayesian deep CNNs with infinitely many channels
We next consider the prior over functions computed by a CNN in the limit of infinitely many channels in the hidden (excluding input and output) layers, for , and derive its equivalence to a GP with a compositional kernel. The following section gives a proof which uses the empirical uncentered covariance tensors to characterize finite width intermediate layers and relies on explicit Bayesian marginalization over these intermediate layers. In Appendix A.5 we give several alternative derivations of the correspondence.
2.2.1 A single convolutional layer is a GP conditioned on the uncentered covariance tensor of the previous layer’s activations
As can be seen in Equation 2, the preactivation tensor is an affine transformation of the multivariate Gaussian , specified by the previous layer’s activations . An affine transformation of a multivariate Gaussian is itself a Gaussian. Specifically,
(4) 
where the first equality in Equation 4 follows from the independence of the weights and biases for each channel . The uncentered covariance tensor for the preactivations is derived in Xiao et al. (2018), where is an affine transformation (a crosscorrelation operator followed by a shifting operator) defined as follows:
(5) 
2.2.2 Uncentered covariance tensor becomes deterministic with increasing channel count
The summands in Equation 3 are i.i.d., due to the independence of the weights and biases for each channel . Subject to weak restrictions on the nonlinearity , we can apply the law of large number and conclude that,
(6)  
(7) 
For nonlinearities such as (Nair & Hinton, 2010) and the error function () can be computed in closed form as derived in Cho & Saul (2009) and Williams (1997) respectively.
2.2.3 Bayesian marginalization over all hidden layers
The distribution over the CNN outputs can be evaluated by marginalizing over all intermediate layer uncentered covariances in the network (see Figure 1):
(8)  
(9) 
In the limit of infinitely many channels in the hidden layers, ^{2}^{2}2Unlike de G. Matthews et al. (2018), we do not require each to be strictly increasing., all the conditional distributions except for converge weakly to delta functions and can be integrated out. Precisely, Equation 9 reduces to the expression in the following theorem.
Theorem 2.1.
If is Lipschitz, then we have the following convergence in distribution
(10) 
i.e. composed with itself times and applied to .
In other words, is the (deterministic) covariance of the CNN activations in the limit of infinitely many (hence subscript) channels in each of the convolutional layers from to . See §A.5.3 for the proof. Therefore Equation 10 states that the outputs for any set of input examples and pixel indices are jointly Gaussian distributed – i.e. the output of a CNN with infinitely many channels in its hidden layers is described by a GP with a covariance function .
3 Transforming a GP over spatial locations into a GP over classes
In §2.2 we have shown that in the infinite channel limit a deep CNN is a GP indexed by input samples and spatial locations of the top layer. Further, its uncentered covariance tensor can be computed in closed form. Here we show that transformations to obtain class predictions that are common in CNN classifiers can be represented as either vectorization or projection (as long as we treat classification as regression, similarly to Lee et al. (2018)). Both of these operations preserve the GP equivalence and allow the computation of the covariance tensor of the respective GP (now indexed by input samples and target classes) as a simple transformation of .
3.1 Vectorization
One common readout strategy is to vectorize (flatten) the output of the last convolutional layer into a vector and stack a fully connected layer on top:
(11) 
where the weights and biases are i.i.d. Gaussian, , and is the number of classes. The samplesample kernel of the output (identical for each class ) of this particular GP, denoted by , is
(12)  
(13) 
where the limit of infinite width is derived identically to §2.2. As observed in Xiao et al. (2018), to compute any diagonal terms of , one needs only the corresponding diagonal terms of . Consequently, we only need to store and the memory cost is (or per covariance entry in an iterative or distributed setting). Note that this approach ignores pixelpixel covariances and produces a GP corresponding to a locallyconnected network (see §5.1).
3.2 Projection
Another approach is a projection collapsing the spatial dimensions. Let be a deterministic vector, , and be the same as above.
Define the output to be
(14)  
(15) 
where the limiting behavior is derived identically to Equation 12. Examples of this approach include

Global average pooling: take and denote this particular GP as . Then
(16) This approach corresponds to applying global average pooling right after the last convolutional layer.^{3}^{3}3 Spatially local average pooling in intermediary layers can be constructed in a similar fashion (§A.3). We focus on global average pooling in this work to more effectively isolate the effects of pooling from other aspects of the model like local connectivity or equivariance. This approach takes all pixelpixel covariance into consideration and makes the kernel translation invariant. However, it requires memory to compute the samplesample covariance of the GP (or per covariance entry in an iterative or distributed setting). It is impractical to use this method to analytically evaluate the GP, and we propose to use a Monte Carlo approach (see §4).

Subsampling one particular pixel: take ,
(17) This approach (denoted ) makes use of only one pixelpixel covariance, and requires the same amount of memory as to compute.
4 Monte Carlo evaluation of intractable GP kernels
We introduce a Monte Carlo estimation method for NNGP kernels which are computationally impractical to compute analytically, or for which we do not know the analytic form. Similar in spirit to traditional random feature methods (Rahimi & Recht, 2007), the core idea is to instantiate many random finite width networks and use the empirical uncentered covariances of activations to estimate the Monte CarloGP (MCGP) kernel,
(18) 
where consists of draws of the weights and biases from their prior distribution, , and is the width or number of channels in hidden layers. The MCGP kernel converges to the analytic kernel with increasing width, in probability.
For finite width networks, the uncertainty in is . From Daniely et al. (2016), we know that , which leads to . For finite , is also a biased estimate of , where the bias depends solely on network width. We do not currently have an analytic form for this bias, but we can see in Figures 3 and 7 that for the hyperparameters we probe it is small relative to the variance. In particular, is nearly constant for constant . We thus treat as the effective sample size for the Monte Carlo kernel estimate. Increasing and reducing can reduce memory cost, though potentially at the expense of increased compute time and bias.
In a nondistributed setting, the MCGP reduces the memory requirements to compute from to , making the evaluation of CNNGPs with pooling practical.
MCCNNGP 

5 Discussion
5.1 Bayesian CNNs with many channels are identical to locally connected networks, in the absence of pooling
Locally Connected Networks (LCNs) (Fukushima, 1975; Lecun, 1989) are CNNs without weight sharing between spatial locations. LCNs preserve the connectivity pattern, and thus topology, of a CNN. However, they do not possess the equivariance property of a CNN – if an input is translated, the latent representation in an LCN will be completely different, rather than also being translated.
The CNNGP predictions without spatial pooling in §3.1 and item 2 depend only on samplesample covariances, and do not depend on pixelpixel covariances. LCNs destroy pixelpixel covariances: , for and all and . However, LCNs preserve the covariances between input examples at every pixel: . As a result, in the absence of pooling, LCNGPs and CNNGPs are identical. Moreover, LCNGPs with pooling are identical to CNNGPs with vectorization of the top layer (under suitable scaling of ). We confirm these findings experimentally in trained networks in the limit of large width in Figure 4 (b), as well as by demonstrating convergence of MCGPs of the respective architectures to the same CNNGP (modulo scaling of ) in Figures 3 and 7.
5.2 Pooling leverages equivariance to provide invariance
The only kernel leveraging pixelpixel covariances is that of the CNNGP with pooling. This enables the predictions of this GP and the corresponding CNN to be invariant to translations (modulo edge effects) – a beneficial quality for an image classifier. We observe strong experimental evidence supporting the benefits of invariance throughout this work (Figures 2, 3, 4 (b); Tables 1, 2), in both CNNs and CNNGPs.
5.3 Finitechannel SGDtrained CNNs can outperform infinitechannel Bayesian CNNs, in the absence of pooling
In the absence of pooling, the benefits of equivariance and weight sharing are more challenging to explain in terms of Bayesian priors on class predictions (since without pooling equivariance is not a property of the outputs, but only of intermediary representations). Indeed, in this work we find that the performance of finitewidth SGDtrained CNNs often approaches that of their CNNGP counterpart (Figure 4, b, c)^{4}^{4}4This observation is conditioned on the respective NN fitting the training set to . Underfitting breaks the correspondance to an NNGP, since train set predictions of such a network no longer correspond to the true training labels. Properly tuned underfitting often also leads to better generalization (Table 2)., suggesting that in those cases equivariance does not play a beneficial role in SGDtrained networks.
However, as can be seen in Tables 1, 2 and Figure 4 (c), the best CNN overall outperforms the best CNNGP by a significant margin – an observation specific to CNNs and not FCNs or LCNs. We observe this gap in performance especially in the case of networks trained with a large learning rate. In Table 1 we demonstrate this large gap in performance by evaluating different models with equivalent architecure and hyperparameter settings, chosen for good SGDtrained CNN performance.
We conjecture that equivariance, a property lacking in LCNs and the Bayesian treatment of the infinite channel CNN limit, contributes to the performance of SGDtrained finitechannel CNNs with the correct settings of hyperparameters. Nonetheless, more work is needed to disentangle and quantify the separate contributions of stochastic optimization and finite width effects to differences in performance between CNNs with weight sharing and their corresponding CNNGPs.
(a)  (c) 

(b)  No Pooling  Global Average Pooling 

LCN 

CNN 

#Channels 
Quality:  Compositionality  Local connectivity  Equivariance  Invariance  

Model:  FCN  FCNGP  LCN (w/ pooling)  CNNGP  CNN  CNN w/ pooling 
Error:  () 
Model  CIFAR10  MNIST  FashionMNIST 

CNN with pooling  ()  
CNN with and large learning rate  ()  
CNNGP  
CNN with small learning rate  
CNN with (any learning rate)  
Convolutional GP (van der Wilk et al., 2017)  
ResNet GP (GarrigaAlonso et al., 2018)  
Residual CNNGP (GarrigaAlonso et al., 2018)  
CNNGP (GarrigaAlonso et al., 2018)  
FCNGP  
FCNGP (Lee et al., 2018)  
FCN  () 
6 Conclusion
In this work we have derived a Gaussian process that corresponds to a deep fully Bayesian CNN with infinitely many channels. The covariance of this GP can be efficiently computed either in closed form or by using Monte Carlo sampling, depending on the architecture.
The CNNGP achieves state of the art results for GPs without trainable kernels on CIFAR10. It can perform competitively with CNNs (that fit the training set) of equivalent architecture and weight priors, which makes it an appealing choice for small datasets, as it eliminates all trainingrelated hyperparameters. However, we found that the best overall performance is achieved by finite SGDtrained CNNs and not by their infinite Bayesian counterparts. We hope our work stimulates future research into disentangling the contributions of the two qualities (Bayesian treatment and infinite width) to the performance gap observed.
7 Acknowledgements
We thank Greg Yang, Sam Schoenholz, Vinay Rao, Daniel Freeman, and Qiang Zeng for frequent discussion and feedback on preliminary results.
References
 Abadi et al. (2016) Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, et al. Tensorflow: Largescale machine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467, 2016.
 Anonymous (2019) Anonymous. Deep convolutional gaussian process. In Submitted to International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=HyeUPi09Y7. under review.
 Blomqvist et al. (2018) Kenneth Blomqvist, Samuel Kaski, and Markus Heinonen. Deep convolutional gaussian processes. arXiv preprint arXiv:1810.03052, 2018.
 Borovykh (2018) Anastasia Borovykh. A gaussian process perspective on convolutional neural networks. ResearchGate:325192731, 05 2018. URL https://www.researchgate.net/publication/325192731.
 Bradshaw et al. (2017) John Bradshaw, Alexander G de G Matthews, and Zoubin Ghahramani. Adversarial examples, uncertainty, and transfer testing robustness in gaussian process hybrid deep networks. arXiv preprint arXiv:1707.02476, 2017.
 Canziani et al. (2016) Alfredo Canziani, Adam Paszke, and Eugenio Culurciello. An analysis of deep neural network models for practical applications. arXiv preprint arXiv:1605.07678, 2016.
 Chen et al. (2018) Minmin Chen, Jeffrey Pennington, and Samuel Schoenholz. Dynamical isometry and a mean field theory of RNNs: Gating enables signal propagation in recurrent neural networks. In Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pp. 873–882, StockholmsmÃ¤ssan, Stockholm Sweden, 10–15 Jul 2018. PMLR. URL http://proceedings.mlr.press/v80/chen18i.html.
 Cho & Saul (2009) Youngmin Cho and Lawrence K Saul. Kernel methods for deep learning. In Advances in neural information processing systems, pp. 342–350, 2009.
 Choromanska et al. (2015) Anna Choromanska, Mikael Henaff, Michael Mathieu, Gérard Ben Arous, and Yann LeCun. The loss surfaces of multilayer networks. In Artificial Intelligence and Statistics, pp. 192–204, 2015.
 Cohen & Welling (2016) Taco Cohen and Max Welling. Group equivariant convolutional networks. In International conference on machine learning, pp. 2990–2999, 2016.
 Damianou & Lawrence (2013) Andreas Damianou and Neil Lawrence. Deep gaussian processes. In Artificial Intelligence and Statistics, pp. 207–215, 2013.
 Daniely et al. (2016) Amit Daniely, Roy Frostig, and Yoram Singer. Toward deeper understanding of neural networks: The power of initialization and a dual view on expressivity. In Advances In Neural Information Processing Systems, pp. 2253–2261, 2016.
 de G. Matthews et al. (2018) Alexander G. de G. Matthews, Jiri Hron, Mark Rowland, Richard E. Turner, and Zoubin Ghahramani. Gaussian process behaviour in wide deep neural networks. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=H1nGgWC.
 Fukushima (1975) Kunihiko Fukushima. Cognitron: A selforganizing multilayered neural network. Biological cybernetics, 20(34):121–136, 1975.
 Fukushima & Miyake (1982) Kunihiko Fukushima and Sei Miyake. Neocognitron: A selforganizing neural network model for a mechanism of visual pattern recognition. In Competition and cooperation in neural nets, pp. 267–285. Springer, 1982.
 GarrigaAlonso et al. (2018) Adrià GarrigaAlonso, Laurence Aitchison, and Carl Edward Rasmussen. Deep convolutional networks as shallow Gaussian processes. arXiv preprint arXiv:1808.05587, aug 2018. URL https://arxiv.org/abs/1808.05587.
 Golovin et al. (2017) Daniel Golovin, Benjamin Solnik, Subhodeep Moitra, Greg Kochanski, John Karro, and D Sculley. Google vizier: A service for blackbox optimization. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1487–1495. ACM, 2017.
 Goodfellow et al. (2015) Ian J Goodfellow, Oriol Vinyals, and Andrew M Saxe. Qualitatively characterizing neural network optimization problems. International Conference on Learning Representations, 2015.
 Hanin & Rolnick (2018) Boris Hanin and David Rolnick. How to start training: The effect of initialization and architecture. arXiv preprint arXiv:1803.01719, 2018.
 Hazan & Jaakkola (2015) Tamir Hazan and Tommi Jaakkola. Steps toward deep kernel methods from infinite neural networks. arXiv preprint arXiv:1508.05133, 2015.
 He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778, 2016.
 Kingma & Ba (2015) Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. 3rd International Conference for Learning Representations, 2015.
 Krizhevsky (2009) Alex Krizhevsky. Learning multiple layers of features from tiny images. Technical report, Citeseer, 2009.
 Kumar et al. (2018) Vinayak Kumar, Vaibhav Singh, PK Srijith, and Andreas Damianou. Deep gaussian processes with convolutional kernels. arXiv preprint arXiv:1806.01655, 2018.
 Lawrence & Moore (2007) Neil D Lawrence and Andrew J Moore. Hierarchical gaussian process latent variable models. In Proceedings of the 24th international conference on Machine learning, pp. 481–488. ACM, 2007.
 Le Roux & Bengio (2007) Nicolas Le Roux and Yoshua Bengio. Continuous neural networks. In Artificial Intelligence and Statistics, pp. 404–411, 2007.
 Lecun (1989) Yann Lecun. Generalization and network design strategies. In Connectionism in perspective. Elsevier, 1989.
 LeCun et al. (1998) Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradientbased learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
 Lee et al. (2018) Jaehoon Lee, Yasaman Bahri, Roman Novak, Sam Schoenholz, Jeffrey Pennington, and Jascha Sohldickstein. Deep neural networks as gaussian processes. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=B1EAM0Z.
 Lin et al. (2017) Henry W Lin, Max Tegmark, and David Rolnick. Why does deep and cheap learning work so well? Journal of Statistical Physics, 168(6):1223–1247, 2017.
 Nair & Hinton (2010) Vinod Nair and Geoffrey E Hinton. Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th international conference on machine learning (ICML10), pp. 807–814, 2010.
 Neal (1994) Radford M. Neal. Priors for infinite networks (tech. rep. no. crgtr941). University of Toronto, 1994.
 Neyshabur et al. (2015) Behnam Neyshabur, Ryota Tomioka, and Nathan Srebro. In search of the real inductive bias: On the role of implicit regularization in deep learning. Proceeding of the international Conference on Learning Representations workshop track, abs/1412.6614, 2015.
 Novak et al. (2018) Roman Novak, Yasaman Bahri, Daniel A. Abolafia, Jeffrey Pennington, and Jascha SohlDickstein. Sensitivity and generalization in neural networks: an empirical study. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=HJC2SzZCW.
 Oord et al. (2016) Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu. Wavenet: A generative model for raw audio. arXiv preprint arXiv:1609.03499, 2016.
 Pascanu et al. (2014) Razvan Pascanu, Yann N Dauphin, Surya Ganguli, and Yoshua Bengio. On the saddle point problem for nonconvex optimization. arXiv preprint arXiv:1405.4604, 2014.
 Poggio et al. (2017) Tomaso Poggio, Hrushikesh Mhaskar, Lorenzo Rosasco, Brando Miranda, and Qianli Liao. Why and when can deepbut not shallownetworks avoid the curse of dimensionality: A review. International Journal of Automation and Computing, 14(5):503–519, Oct 2017. ISSN 17518520. doi: 10.1007/s1163301710542. URL https://doi.org/10.1007/s1163301710542.
 Poole et al. (2016) Ben Poole, Subhaneil Lahiri, Maithra Raghu, Jascha SohlDickstein, and Surya Ganguli. Exponential expressivity in deep neural networks through transient chaos. In Advances In Neural Information Processing Systems, pp. 3360–3368, 2016.
 QuiñoneroCandela & Rasmussen (2005) Joaquin QuiñoneroCandela and Carl Edward Rasmussen. A unifying view of sparse approximate gaussian process regression. Journal of Machine Learning Research, 6(Dec):1939–1959, 2005.
 Rahimi & Recht (2007) Ali Rahimi and Ben Recht. Random features for largescale kernel machines. In In Neural Infomration Processing Systems, 2007.
 Rasmussen & Williams (2006) Carl Edward Rasmussen and Christopher KI Williams. Gaussian processes for machine learning, volume 1. MIT press Cambridge, 2006.
 Rumelhart et al. (1985) David E Rumelhart, Geoffrey E Hinton, and Ronald J Williams. Learning internal representations by error propagation. Technical report, California Univ San Diego La Jolla Inst for Cognitive Science, 1985.
 Schoenholz et al. (2017) Samuel S Schoenholz, Justin Gilmer, Surya Ganguli, and Jascha SohlDickstein. Deep information propagation. ICLR, 2017.
 Silver et al. (2017) David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, et al. Mastering the game of go without human knowledge. Nature, 550(7676):354, 2017.
 Tiňo et al. (2004) Peter Tiňo, Michal Cernansky, and Lubica Benuskova. Markovian architectural bias of recurrent neural networks. IEEE Transactions on Neural Networks, 15(1):6–15, 2004.
 Tiňo et al. (2007) Peter Tiňo, Barbara Hammer, and Mikael Bodén. Markovian bias of neuralbased architectures with feedback connections. In Perspectives of neuralsymbolic integration, pp. 95–133. Springer, 2007.
 van der Wilk et al. (2017) Mark van der Wilk, Carl Edward Rasmussen, and James Hensman. Convolutional gaussian processes. In Advances in Neural Information Processing Systems 30, pp. 2849–2858, 2017.
 Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems, pp. 5998–6008, 2017.
 Vershynin (2010) Roman Vershynin. Introduction to the nonasymptotic analysis of random matrices. arXiv preprint arXiv:1011.3027, 2010.
 Werbos (1988) Paul J Werbos. Generalization of backpropagation with application to a recurrent gas market model. Neural networks, 1(4):339–356, 1988.
 Williams (1997) Christopher KI Williams. Computing with infinite networks. In Advances in neural information processing systems, pp. 295–301, 1997.
 Wilson et al. (2016a) Andrew G Wilson, Zhiting Hu, Ruslan R Salakhutdinov, and Eric P Xing. Stochastic variational deep kernel learning. In Advances in Neural Information Processing Systems, pp. 2586–2594, 2016a.
 Wilson et al. (2016b) Andrew Gordon Wilson, Zhiting Hu, Ruslan Salakhutdinov, and Eric P Xing. Deep kernel learning. In Artificial Intelligence and Statistics, pp. 370–378, 2016b.
 Xiao et al. (2017) Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashionmnist: a novel image dataset for benchmarking machine learning algorithms, 2017.
 Xiao et al. (2018) Lechao Xiao, Yasaman Bahri, Jascha SohlDickstein, Samuel Schoenholz, and Jeffrey Pennington. Dynamical isometry and a mean field theory of CNNs: How to train 10,000layer vanilla convolutional neural networks. In Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pp. 5393–5402, StockholmsmÃ¤ssan, Stockholm Sweden, 10–15 Jul 2018. PMLR. URL http://proceedings.mlr.press/v80/xiao18a.html.
 Yang & Schoenholz (2017) Ge Yang and Samuel Schoenholz. Mean field residual networks: On the edge of chaos. In Advances in neural information processing systems, pp. 7103–7114, 2017.
 Zhang et al. (2017) Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning requires rethinking generalization. In International Conference on Learning Representations, 2017.
Appendix A Appendix
a.1 Additional Figures
No Pooling  Global Average Pooling  
LCN 


CNN 

#Channels 
No Pooling  Global Average Pooling  
LCN 


CNN 

#Channels 
MCCNNGP with pooling 

MCLCNGP 

MCLCNGP with Pooling 

MCFCNGP 
a.2 Relationship to Deep Signal Propagation
The recurrence relation linking the GP kernel at layer to that of layer following from Equation 10 (i.e. ) is precisely the covariance map examined in a series of related papers on signal propagation (Xiao et al., 2018; Poole et al., 2016; Schoenholz et al., 2017; Lee et al., 2018) (modulo notational differences; denoted as , or e.g. in Xiao et al. (2018)). In those works, the action of this map on hiddenstate covariance matrices was interpreted as defining a dynamical system whose largedepth behavior informs aspects of trainability. In particular, as , , i.e. the covariance approaches a fixed point . The convergence to a fixed point is problematic for learning because the hidden states no longer contain information that can distinguish different pairs of inputs. It is similarly problematic for GPs, as the kernel becomes pathological as it approaches a fixed point. Precisely, in the chaotic regime outputs of the GP become asymptotically decorrelated and therefore independent, while in the ordered regime they approach perfect correlation of . Either of these scenarios captures no information about the training data in the kernel and makes learning infeasible.
This problem can be ameliorated by judicious hyperparameter selection, which can reduce the rate of exponential convergence to the fixed point. For hyperpameters chosen on a critical line separating two untrainable phases, the convergence rates slow to polynomial, and very deep networks can be trained, and inference with deep NNGP kernels can be performed – see Table 3.
Depth:  Phase boundary  

CNNGP 

FCNGP 
a.3 Strided convolutions and average pooling in intermediate layers
Our analysis in the main text can easily be extended to cover average pooling and strided convolutions (applied before the pointwise nonlinearity). Recall that conditioned on the preactivation is a meanzero multivariate Gaussian. Let denote a linear operator. Then is mean zero Gaussian, and the covariance is
(19) 
One can easily see that are i.i.d. multivariate Gaussian.
Strided convolution. Strided convolution is equivalent to a nonstrided convolution composed with subsampling. Let denote size of the stride. Then the strided convolution is equivalent to choosing as follows: for .
Average pooling. Average pooling with stride and window size is equivalent to choosing for and .
a.4 Review of exact Bayesian regression with GPs
Our discussion in the paper has focused on model priors. A crucial benefit we derive by mapping to a GP is that Bayesian inference is straightforward to implement and can be done exactly for regression (Rasmussen & Williams, 2006, chapter 2), requiring only simple linear algebra. Let denote training inputs , training targets, and collectively for the training set. The integral over the posterior can be evaluated analytically to give a posterior predictive distribution on a test point which is Normal, , with
(20)  
(21) 
We use the shorthand to denote the matrix formed by evaluating the GP covariance on the training inputs, and likewise is a length vector formed from the covariance between the test input and training inputs. Computationally, the costly step in GP posterior predictions comes from the matrix inversion, which in all experiments were carried out exactly, and typically scales as (though algorithms scaling as exist for sufficiently large matrices). Nonetheless, there is a broad literature on approximate Bayesian inference with GPs which can be utilized for efficient implementation (Rasmussen & Williams, 2006, chapter 8); (QuiñoneroCandela & Rasmussen, 2005).
a.5 Kernel Convergence Proof
In this section, we present three different approaches to illustrate the weak convergence of neural networks to Gaussian processes as the number of channels goes to infinity. Although the first §A.5.1 and second approaches §A.5.2 (taking iterated limits) are less formal, they provide some intuitions to the convergence of neural networks to GPs. The approach in §A.5.3 is more standard and the proof is more involved. We only provide the arguments for convolutional neural networks. It is straightforward to extend them to locally or fully connected networks.
We will use the following wellknown theorem.
Theorem A.1 (Portmanteau Theorem).
Let be a sequence of realvalued random variables. The following are equivalent:

in distribution,

For all bounded continuous function ,
(22) 
The characteristic functions of , i.e. converge to that of pointwisely, i.e. for all ,
(23)
a.5.1 Forward Mode
We show that when taking sequentially, a CNN converges to a GP in the following sense: preactivations of each layers () converge to a Gaussian in distribution. We will proceed by induction. Let . It is not difficult to see that are pairwisely independent (multivariate) Gaussian with identical distribution and thus i.i.d. Gaussian. Assume are i.i.d. Gaussian (unconditionally). We claim that so are . Indeed, since both the connection weights from layer to layer and the biases from different channels are independent, are uncorrelated and have the same distribution. To prove that they are mutually independent, we only need to show that for each , converges to a Gaussian in distribution as . Since are i.i.d., thus the outcomes of the inner sum of Equation 2 are i.i.d. We can then apply a multivariate central limit theorem^{5}^{5}5Assuming the covariance of is finite. to conclude that converges to a Gaussian in distribution (note that we have applied the fact that is a Gaussian).
a.5.2 Reverse Mode
Conditioning on , is a random variable that converges to in probability as the number of channels (the law of large numbers, see Equation 7).
It is clear that different channels of are uncorrelated and have the same distribution. We will show that for any channel index , the random variable “converges” to the Gaussian
(24) 
in the sense that its characteristic function converges pointwisely to that of , i.e. for each and for all vectors
(25) 
Proof.
Applying Fubini’s Theorem and the formula of the characteristic function of multivariate Gaussian
(26)  
(27)  
(28)  
(29)  
(30) 
We now apply and switch the order of it with the outer integral. The Lebesgue dominant theorem allows us to do so because the inner integral is bounded above by the constant function which is absolutely integrable w.r.t. the outer integral. We then apply Theorem A.1, since is bounded and continuous in and
(31) 
Repeatedly applying the same argument^{6}^{6}6Here we need to be continuous. gives
(32) 
Note that the addition of various layers on top (as discussed in §3) does not change the proof in a qualitative way. ∎
a.5.3 Uniform Convergence Mode
In this section, we present a sufficient condition on the activation function so that the neural networks will converge to a Gaussian process as all the widths approach to infinity uniformly. Precisely, we are interested in the case as , i.e.,
(33) 
Using Theorem A.1 and the arguments in the above section, it is not difficult to see that a sufficient condition is that the empirical covariance converges in probability to the analytic covariance.
Corollary A.1.1.
If , i.e. converges to in probability as , then
(34) 
In the remaining section, we provide a sufficient condition for Corollary A.1.1 (i.e. ), borrowing some ideas from Daniely et al. (2016).
Notation. Let denote the set of positive semidefinite matrices and for , define
(35) 
Further let and be a function and a random variable (induced by the activation ) given by
(36)  
(37) 
Finally, let denote the space of measurable functions with the following properties:

Uniformly Squared Integrable: for every , there exists a positive constant such that
(38) 
Lipschitz Continuity: for every , there exists such that for all ,
(39) 
Uniform Convergence in Probability: for every and every ,
(40)
We will also use and to denote the spaces of functions satisfying property 1, property 2 and property 3, respectively. It is not difficult to see that for every , is a vector space, and so is .
Definition A.1.
We say is linearly bounded (exponentially bounded) if there exist such that
(41) 
Note that the class of linearly bounded (exponentially bounded) functions is closed under addition and scalar multiplication. Moreover exponentially bounded functions contain all polynomials, are also closed under multiplication and integration in the sense for any constant the function
(42) 
is also exponentially bounded.
Lemma A.2.
The following is true:

contains all exponentially bounded functions.

contains all functions whose first derivative are exponentially bounded.

contains all linearly bounded functions.
Proof.
1. We prove the first statement. Assume .
(43) 
In the last inequality, we applied
Thus
(44) 
2. To prove the second statement, let and define (similarly for ):
(45) 
Then (and ). Let
(46) 
and
(47) 
Since is exponentially bounded, is also exponentially bounded. In addition, is exponentially bounded for any polynomial .
Applying the Mean Value Theorem (we use the notation to hide the dependence on and other absolute constants)
(48)  
(49)  
(50)  
(51) 
Note that the operator norm is bounded by the infinity norm (up to a multiplicity constant) and is exponentially bounded. There is a constant (hidden in ) and such that the above is bounded by
(52)  
(53)  
(54)  
(55) 
Here we have applied the facts
(56) 
3. Assume . We postpone the proof of the following lemma.
Lemma A.3.
Assume . Then there is a such that for all and all ,