On Exact Computation with an Infinitely Wide Neural Net
How well does a classic deep net architecture like AlexNet or VGG19 classify on a standard dataset such as CIFAR-10 when its “width”— namely, number of channels in convolutional layers, and number of nodes in fully-connected internal layers — is allowed to increase to infinity? Such questions have come to the forefront in the quest to theoretically understand deep learning and its mysteries about optimization and generalization. They also connect deep learning to notions such as Gaussian processes and kernels. A recent paper (Jacot et al., 2018) introduced the Neural Tangent Kernel (NTK) which captures the behavior of fully-connected deep nets in the infinite width limit trained by gradient descent; this object was implicit in some other recent papers. A subsequent paper (Lee et al., 2019) gave heuristic Monte Carlo methods to estimate the NTK and its extension, Convolutional Neural Tangent Kernel (CNTK), and used this to try to understand the limiting behavior on datasets like CIFAR-10. An attraction of such ideas is that a pure kernel-based method is used to capture the power of a fully-trained deep net of infinite width.
The current paper gives the first efficient exact algorithm (based upon dynamic programming) for computing CNTK as well as an efficient GPU implementation of this algorithm. This results in a significant new benchmark for performance of a pure kernel-based method on CIFAR-10, being higher than the methods reported in (Novak et al., 2019), and only lower than the performance of the corresponding finite deep net architecture (once batch normalization etc. are turned off).
We give the first non-asymptotic proof showing that a fully-trained sufficiently wide net is indeed equivalent to the kernel regression predictor using NTK. Our experiments also demonstrate that earlier Monte Carlo approximation can degrade the performance significantly, thus highlighting the power of our exact kernel computation, which we have applied even to the full CIFAR-10 dataset and -layer nets.
How well does a classic deep net architecture like AlexNet or VGG19 perform on a standard dataset such as CIFAR-10 when its “width”— namely, number of channels in convolutional layers, and number of nodes in fully-connected internal layers — is allowed to increase to infinity? Questions about these “infinite limits” of deep nets have naturally emerged in the ongoing effort to understand the power of deep learning. In mathematics it is often easier to study objects in the infinite limit. Furthermore, the infinite limit could conceivably make sense in deep learning, since over-parametrization seems to help optimization a lot and doesn’t hurt generalization much (Zhang et al., 2017): deep neural nets with millions of parameters work well even for datasets with k training data points. So why not imagine nets whose width goes to infinity?
Allowing width to go to infinity also connects deep learning in an interesting way with other areas of machine learning. A single hidden-layer neural network with i.i.d. random parameters, in the limit of infinite width, is a function drawn from a Gaussian Process (GP) (Neal, 1996). This model as well as analogous ones with multiple layers (Lee et al., 2018; Matthews et al., 2018) and convolutional filters (Novak et al., 2019; Garriga-Alonso et al., 2019) make up the Gaussian Process view of deep learning. These correspond to infinitely wide deep nets whose all parameters are chosen randomly (with careful scaling), and only the top (classification) layer is trained.
From now on we will use weakly-trained nets to refer to nets whose layers receive random initialization and only the top layer is trained by gradient descent. We use fully-trained to refer to nets whose all parameters are trained. It has long been known that weakly-trained convolutional nets have reasonable performance on MNIST and CIFAR-10. Weakly-trained nets that are fully-connected instead of convolutional, can also be thought of as “multi-layer random kitchen sinks,” which also have a long history.
Weakly-trained nets — whether of finite or infinite width — also define interesting kernels. Specifically, if denotes the output of the network on input where denotes the parameters in the network, and is an initialization distribution over (usually Gaussian), then training just the top layer with an loss is equivalent to a kernel regression for the following kernel:
where are two inputs. This kernel method makes sense even if the number of parameters goes to infinity.
The objects of interest in this paper are not weakly-trained nets, but fully-trained nets. In the finite case, analysis of optimization and generalization of fully-trained nets is of course an open problem. One may also ask:
Can we understand the power of fully-trained nets whose width goes to infinity?
A priori this question doesn’t seem any easier than the finite case, and empirical evaluation seems computationally infeasible due to the infinite limit. They also do not correspond to a kernel method in any obvious way.
Recent papers suggest that nets whose width greatly exceeds the number of training data points can rapidly reduce training error to via gradient descent, and under some conditions, the trained net also exhibits good generalization (Du et al., 2019, 2018b; Li and Liang, 2018; Allen-Zhu et al., 2018a, b; Zou et al., 2018; Arora et al., 2019). Extra-wideness plays a crucial role in the proof: it is shown that as width increases, training causes increasingly smaller changes (in a proportionate sense) in the parameters. This raises the possibility that as one increases the width to infinity, a certain limiting behavior can emerge even in the fully-trained net. A recent paper by Jacot et al. (2018) isolated a notion implicit in the above papers (Du et al., 2019, 2018b), which they called the Neural Tangent Kernel (NTK). They suggested — via a proof that is slightly heuristic — that this fixed kernel characterizes the behavior of fully-connected infinite width neural networks whose layers have been trained by gradient descent. The NTK is very different from the Gaussian Process kernels discussed earlier, and is defined using the gradient of the output of the randomly initialized net with respect to its parameters, i.e.,
Here, the gradient appears from considering gradient descent, as will be explained in Section 3. One may also generalize the NTK to convolutional neural nets, and we call the corresponding kernel Convolutional Neural Tangent Kernel (CNTK).
Though NTK and CNTK are defined by an infinite limit, a recent paper (Lee et al., 2019) attempted to understand their properties via a heuristic finite approximation of the infinite limit kernel by Monte Carlo methods. However, as will be shown in Section 5.2, such Monte Carlo methods can degrade the performance a lot. It was still open what is the full power of exact NTK and CNTK on modern datasets. This is a challenging question especially for CNTK, since when convolutional operation is involved, it is believed that exact computation of kernels (for either convolutional Gaussian Process kernel or CNTK) is infeasible for large datasets like CIFAR-10 (Novak et al., 2019).
We give an exact and efficient dynamic programming algorithm to compute the NTK and CNTK for ReLU activation (namely, to compute given and ). Using this algorithm — as well as implementation tricks for GPUs — we can settle the question of the performance of fully-trained infinitely wide nets with a variety of architectures. For instance, we find that their performance on CIFAR-10 is within of the performance of the same architectures in the finite case (note that the proper comparison in the finite case involves turning off batch norm, data augmentation, dropout, etc., in the optimization). In particular, the CNTK corresponding to a -layer convolutional net with global averge pooling achieves classification accuracy. This is higher than the best reported performance of a Gaussian process with fixed kernel on CIFAR-10 (Novak et al., 2019). 111 We only consider fixed kernels defined without using the training data. We do not compare to methods that tune the kernels using training data (Van der Wilk et al., 2017) or use a neural network to extract features and then applying a kernel method on top of them (Mairal et al., 2014).
We give a more rigorous, non-asymptotic proof that the NTK captures the behavior of a fully-trained wide neural net under weaker condition than previous proofs. We also experimentally show that the Monte Carlo methods for approximating CNTK in the earlier work do not compute good approximations, which is clear from their much worse performance on CIFAR.
We use bold-faced letters for vectors, matrices and tensors. For a vector , let be its -th entry; for a matrix , let be its -th entry; for a 4th-order tensor , let be its -th entry. Let be the identity matrix, and . Let be an indicator vector with -th entry being and other entries being , and let denote the all-one vector. We use to denote the pointwise product and to denote the tensor product. We use to transform a vector to a diagonal matrix. We use to denote the activation function, such as the rectified linear unit (ReLU) function: , and to denote the derivative of . Denote by the Gaussian distribution with mean and covariance .
1.2 Paper Organization
The rest of the paper is organized as follows. Section 2 gives an overview of related work. In Section 3, we review how NTK arises from training an infinitely wide fully-connected neural network, and also rigorously establish the equivalence between a fully-trained wide neural net and kernel regression under the NTK. In Section 4, we derive the formulas for CNTK, and describe our efficient computation of CNTK; In Section 5, we present our experimental results and discuss our findings. Lastly, we conclude in Section 6.
2 Related Work
From a Gaussian Process (GP) viewpoint, the correspondence between infinite neural networks and kernel machines was first noted by Neal (1996). Follow-up work extended this correspondence to more general shallow neural networks (Williams, 1997; Roux and Bengio, 2007; Hazan and Jaakkola, 2015). More recently, this was extended to deep and convolutional neural networks (Lee et al., 2018; Matthews et al., 2018; Novak et al., 2019; Garriga-Alonso et al., 2019). However, these kernels, as we discussed in Section 1, represent weakly-trained nets, instead of fully-trained nets.
Beyond GPs, the connection between neural networks and kernels is also studied in the compositional kernel literature. Cho and Saul (2009) derived a closed-form kernel formula for rectified polynomial activations, which include ReLU as a special case. Daniely et al. (2016) proposed a general framework to transform a neural network to a compositional kernel and later Daniely (2017) showed for sufficiently wide neural networks, stochastic gradient descent can learn functions that lie in the corresponding reproducing kernel Hilbert space. However, the kernels studied in these works still correspond to weakly-trained neural networks.
Wilson et al. (2016); Al-Shedivat et al. (2016) used a neural network as a feature extractor, applied a GP on top of these features and then trained the resulting model end-to-end. Another line of work built probabilistic graphical models (PGMs) which uses kernel as a component (Damianou and Lawrence, 2013; Lawrence and Moore, 2007; Van der Wilk et al., 2017; Kumar et al., 2018; Blomqvist et al., 2018). These are not pure kernel methods so we do not compare with them in this paper. Nevertheless, CNTK may be combined with these techniques to generate more powerful predictors.
This paper is inspired by a line of recent work on over-parameterized neural networks (Du et al., 2019, 2018b; Du and Hu, 2019; Li and Liang, 2018; Allen-Zhu et al., 2018b, a; Zou et al., 2018). These papers established that for (convolutional) neural networks with large but finite width, (stochastic) gradient descent can achieve zero training error. A key component in these papers is showing that the weight matrix at each layer is close to its initialization. This observation implies that the kernel defined in Equation (2) is still close to its initialization. Arora et al. (2019) explicitly used this observation to derive generalization bounds for two-layer over-parameterized neural networks.
Jacot et al. (2018) derived the exact same kernel from kernel gradient descent. They showed that if the number of neurons per layer goes to infinity in a sequential order, then the kernel remains unchanged for a finite training time. They termed the derived kernel Neural Tangent Kernel (NTK). We follow the same naming convention and name its convolutional extension Convolutional Neural Tangent Kernel (CNTK). Later, Yang (2019) derived a formula of CNTK. Comparing with (Yang, 2019), our CNTK formula has a more explicit convolutional structure and results in an efficient GPU-friendly computation method. Recently, Lee et al. (2019) tried to empirically verify the theory in (Jacot et al., 2018) using a heuristic Monte Carlo method. They showed that in the first few iterations, the kernel is approximately unchanged. However, as will be shown in Section 5.2, such Monte Carlo methods can decrease the classification accuracy by even on a “CIFAR-2” (airplane V.S. car) dataset. Therefore, exact kernel evaluation is important to study the power of NTK and CNTK.
3 Neural Tangent Kernel
In this section we describe fully-connected deep neural net architecture and its infinite width limit, and how training it with respect to the loss gives rise to a kernel regression problem involving the neural tangent kernel (NTK).
We denote by the output of a neural network where is all the parameters in the network and is the input.222For simplicity, we only consider a single output here. The generalization to multiple outputs is straightforward. Given a training dataset , consider training the neural network by minimizing the squared loss over training data:
The proof of the following lemma uses simple differentiation and appears in Appendix A.
Consider minimizing the squared loss by gradient descent with infinitesimally small learning rate: . Let be the network outputs on all ’s at time , and be the desired outputs. Then follows the following evolution, where is an positive semidefinite matrix whose -th entry is :
The statement of Lemma 3.1 involves a matrix . Below we define a deep net architecture whose width is allowed to go to infinity, while fixing the training data as above. In the limit, it can be shown that the matrix remains constant during training i.e., equal to . Moreover, under a random initialization of parameters, the random matrix converges in probability to a certain deterministic kernel matrix as the width goes to infinity, which is the Neural Tangent Kernel (Equation (2)) evaluated on the training data. If for all , then Equation (3) becomes
Note that the above dynamics is identical to the dynamics of kernel regression under gradient flow, for which at time the final prediction function is (assuming )
3.1 Fully-Connected Deep Neural Net and Its Infinite Width Limit
Now we define fully-connected deep neural net and review its Gaussian Process viewpoint in the infinite width limit. Fully-connected net is defined in the standard way: the -th layer is computed by applying a linear transformation on the output of the -th layer, followed by a coordinate-wise nonlinearity . Recall that standard initializion would pick each entry of the layer matrix from where is such that the expected norm of each row in the matrix is . Thus depends on the number of nodes in the row, and the initialization stays bounded as we let the number of nodes in each row go to infinity.
The limit behavior of gradient descent depends however on the details of initialization. In this paper we initialize parameters using and multiply on the outside (instead of using to initialize). These two parameterizations can be made equivalent if one is allowed to set different learning rates for different layers, as discussed in (Lee et al., 2019). In fact the gradient from each layer defines a different kernel, and setting different learning rates for different layers amounts to summing over those individual kernels with different weights. Our choice allows setting equal learning rates for all the layers while giving rise to the NTK in the infinite width limit, which is the unweighted sum of those individual kernels. In contrast, using Kaiming initialization (He et al., 2015) at infinite width is equivalent to setting the learning rate of the first layer to 0 and the rest to be equal, thus resulting in a similar kernel to the NTK — the unweighted sum of kernels from the second layer to the last layer. The initialization method in (Daniely, 2017) gives the kernel corresponding to the last layer only, which corresponds to Gaussian process.
Now we define a fully-connected neural net formally. Let be the input, and denote and for notational convenience. We define an -hidden-layer fully-connected neural network recursively:
where is the weight matrix in the -th layer (), is a coordinate-wise activation function, and . The last layer of the neural network is
where is the weights in the final layer, and represents all the parameters in the network.
We initialize all the weights to be i.i.d. random variables, and consider the limit of large hidden widths: . The scaling factor in Equation (6) ensures that the norm of for each is approximately preserved at initialization (see (Du et al., 2018b)). In particular, for ReLU activation, we have ().
Recall from (Lee et al., 2018) that in the infinite width limit, the pre-activations at every hidden layer has all its coordinates tending to i.i.d. centered Gaussian processes of covariance defined recursively as:
for . The intuition is that is a centered Gaussian process conditioned on (), with covariance
which converges to as given that each is a centered Gaussian process with covariance . This yields the inductive definition in Equation (7).
3.2 Neural Tangent Kernel
We proceed to derive the NTK for the fully-connected neural net. Recall that we need to compute the value that converges to at random initialization in the infinite width limit. We can write the partial derivative with respect to a particular weight matrix in a compact form:
Then, for any , we can compute
Note that we have established in Equation (8) that
For the other factor , by definition (9) we get
Although and are dependent, the Gaussian initialization of allows us to replace with a fresh new sample without changing its limit. This is made rigorous for ReLU activation in Theorem 3.2.
Applying this approximation inductively in Equation (11), we get
Finally, since , we obtain the final NTK expression for the fully-connected neural network:
where we let for convenience.
Rigorously, for ReLU activation, we have the following theorem that gives a concrete bound on the hidden widths that is sufficient for convergence to the NTK at initialization: \thmt@toks\thmt@toks Fix and . Suppose , , and , . Then for any inputs such that , with probability at least we have:
Theorem 3.1 (Convergence to the NTK at initializatoin).
Previous results are asymptotic, i.e., they require the widths to go to infinity, while Theorem 3.2 gives a non-asymptotic bound on the required layer widths.
Equivalence between Wide Neural Net and Kernel Regression using NTK.
Built on Theorem 3.2, we can further incorporate the training process and show the equivalence between a fully-trained sufficiently wide neural net and the kernel regression solution using the NTK, as described in Lemma 3.1 and the discussion after it.
Recall that the training data are , and is the NTK evaluated on these training data, i.e., . Denote . For a testing point , we let be the kernel values of this testing point and training points, i.e., . The prediction of kernel regression using NTK on this testing point is
Since the above solution corresponds to the linear dynamics in Equation (4) with zero initialization, in order to establish equivalence between neural network and kernel regression, we would like the initial output of the neural network to be small. Therefore, we apply a small multiplier , and let the final output of the neural network be
We let be the prediction of the neural network at the end of training.
The following theorem establishes the equivalence between the fully-trained wide neural network and the kernel regression predictor using the NTK.
Theorem 3.2 (Main theorem).
Suppose , and with . Then for any with , with probability at least over the random initialization, we have
Several comments are in sequel. Theorem 3.2 is, to our knowledge, the first result that rigorously shows the equivalence between a fully-trained neural net and a kernel predictor. Comparing with (Jacot et al., 2018), our bound is non-asymptotic whereas (Jacot et al., 2018) only has an asymptotic result; furthermore, Jacot et al. (2018) required the width of every layer to go to infinity in a sequential order, while we can have the same number of neurons per layer, which is closer to practice. Comparing with recent results on over-parameterized neural nets (Arora et al., 2019; Allen-Zhu et al., 2018b, a; Du et al., 2019, 2018b; Li and Liang, 2018; Zou et al., 2018), our theorem is a more precise characterization of the learned neural network. That is, the prediction is essentially a kernel predictor. Therefore, to study the properties of these over-parameterized nets, such as their generalization power, it is sufficient to study the corresponding NTK.
While this theorem only gives guarantee for a single point, using a union bound, we can show that this guarantee holds for (exponentially many) finite testing points. Combing this with the standard analysis of hold-out validation set, we can conclude that a fully-trained wide neural net enjoys the same generalization ability as its corresponding NTK.
For the proof of Theorem 3.2, we first use a generic argument to show that the perturbation on the prediction can be reduced to the perturbation on kernel value at the initialization and during training. Theorem 3.2 guarantees a small perturbation on kernel value at initialization. For the perturbation during training, we use high level proof idea from Du et al. (2018b); Arora et al. (2019) to reduce the perturbation on the kernel value to the perturbation on the gradient of each prediction with respect to weight matrices. Then we adopt technical lemmas from Allen-Zhu et al. (2018b) to obtain bounds on the perturbation of the gradient. See Appendix D for details. We remark that Jacot et al. (2018); Lee et al. (2019) provided proofs for the training part. However, both are asymptotic results and only apply to finite training time. In contrast, we give a finite-width perturbation bound and our result applies to infinite training time.
4 Convolutional Neural Tangent Kernel
In this section we study convolutional neural networks (CNNs) and their corresponding CNTKs. We study two architectures, vanilla CNN and CNN with global average pooling (GAP). To formally define CNNs, we first introduce some notations. Throughout the paper, we let be the width and be the height of the image. We use to denote the filter size. In practice, or . We use standard zero padding and set stride size to be to make sure the input of each layer has the same size. For a convolutional filter and an image , the convolution operator is defined as
Equation (14) shows patch depends on . Our CNTK formula also relies on this dependency. For , define
Lastly, for a tensor , we denote a sub-tensor and we let .
4.1 Vanilla CNN
The first type of CNN is the vanilla CNN which consists convolution layers and one fully-connected layer, formally defined as follows.
Let be the input image where is the number of channels in the input image.
For , , the intermediate outputs are defined as
where each is a filter with Gaussian initialization.
The final output is defined as
where is a weight matrix with Gaussian initialization.
For this architecture, using the same reasoning as in Section 3.2, we obtain the following convolutional neural tangent kernel formula. The details are provided in Appendix E. We will use a dynamic programming approach to compute the kernel.
We let be two input images.
For , , define
For , define
Define , for
Define , for
Note that and share similar structures as their NTK counterparts. The only difference is that we have one more step, taking the trace over patches. This step represents the convolution operation in the corresponding CNN. Next, we can use a recursion to compute the final kernel value.
First, we define .
For and , we define
Lastly, the final kernel value is defined as
4.2 CNN with Global Average Pooling
We also consider another architecture, CNN with global average pooling.
Let be the input image and is the number of initial channels.
For , , the intermediate outputs are defined as
The final output is defined as
where is a scalar with Gaussian initialization.
Besides using global average pooling, another modification is that we do not train the first and the layer. This is inspired by Du et al. (2018a) in which authors showed that if one applies gradient flow, then at any training time , the difference between the squared Frobenius norm of the weight matrix at time and that at initialization is same for all layers. However, note that and are special because they are smaller matrices compared with other intermediate weight matrices, so relatively, these two weight matrices change more than the intermediate matrices during the training process, and this may dramatically change the kernel. Therefore, we choose to fix and to the make over-parameterization theory closer to practice.
We let be two input images. Note because CNN with global average pooling and vanilla CNN shares the same architecture except the last layer, , and are the same for these two architectures. the only difference is in calculating the final kernel value. To compute the final kernel value, we use the following procedure.
First, we define . Note this is different from CNTK for vanilla CNN which uses as the initial value because we do not train the first layer.
For and , we define
For , we define
Lastly, the final kernel value is defined as
Note that we ignore comparing with the CNTK of CNN. This is because we do not train the last layer. The other difference is we calculate the mean over all entries, instead of calculating the summation over the diagonal ones. This is because we use global average pooling so the cross-variances between every two patches will contribute to the kernel.
4.3 Fast Computation for ReLU-Activated CNTK
To compute the CNTK corresponding to a CNN with convolution layers and one fully-connected layer on samples, the time complexity is . Previous work assumed that directly computing convolutional kernel exactly is computationally infeasible, and thus resorted to approximations like Monte Carlo sampling (Novak et al., 2019). In this section, we present a new approach to efficiently compute CNTK for ReLU activation exactly. Most computation required by our new approach can be described as entry-wise operations over matrices and tensors, which allows efficient implementations on GPUs.
Following the formulas in Sections 4.1 and 4.2, the trickiest part is computing the expectation of the post-activation output, i.e., Equation (15) and (16). These two expectations depend on (the same) matrices . To obtain faster implementations, our key observation is that if the diagonal entries of are all ones and the activation function is ReLU, there are closed-form formulas for the the corresponding expectations. To see this, let us suppose for now that for some . When the activation function is ReLU, one can show that
Now we let
Here, we interpret , ,