Kernel and Rich Regimes in Overparametrized Models

  Kernel and Rich Regimes in Overparametrized Models  

Abstract

A recent line of work studies overparametrized neural networks in the “kernel regime,” i.e. when the network behaves during training as a kernelized linear predictor, and thus training with gradient descent has the effect of finding the minimum RKHS norm solution. This stands in contrast to other studies which demonstrate how gradient descent on overparametrized multilayer networks can induce rich implicit biases that are not RKHS norms. Building on an observation by Chizat and Bach [5], we show how the scale of the initialization controls the transition between the “kernel” (aka lazy) and “rich” (aka active) regimes and affects generalization properties in multilayer homogeneous models. We also highlight an interesting role for the width of a model in the case that the predictor is not identically zero at initialization. We provide a complete and detailed analysis for a family of simple depth- models that already exhibit an interesting and meaningful transition between the kernel and rich regimes, and we also demonstrate this transition empirically for more complex matrix factorization models and multilayer non-linear networks.

1 Introduction

A string of recent papers study neural networks trained with gradient descent in the “kernel regime”. They observe that, in a certain regime, networks trained with gradient descent behave as kernel methods [15, 7, 6]. This allows one to prove convergence to zero error solutions in overparametrized settings [8, 9, 1]. This also implies that the learned function is the the minimum norm solution in the corresponding RKHS [5, 3, 17], and more generally that models inherit the inductive bias and generalization behavior of the RKHS. This suggests that, in a certain regime, deep models can be equivalently replaced by kernel methods with the “right” kernel, and deep learning boils down to a kernel method with a fixed kernel determined by the architecture and initialization, and thus it can only learn problems learnable by appropriate kernel.

This contrasts with other recent results that show how in deep models, including infinitely overparametrized networks, training with gradient descent induces an inductive bias that cannot be represented as an RKHS norm. For example, analytic and/or empirical results suggest that gradient descent on deep linear convolutional networks implicitly biases toward minimizing the bridge penalty, for , in the frequency domain [13]; on an infinite width single input ReLU network infinitesimal weight decay biases towards minimizing the second order total variations of the learned function [19], further, empirically it has been observed that this bias is implicitly induced by gradient descent without explicit weight decay [19, 21]; and gradient descent on a overparametrized matrix factorization, which can be thought of as a two layer linear network, induces nuclear norm minimization of the learned matrix [12] and can ensure low rank matrix recovery [16]. All these natural inductive biases ( bridge penalty for , total variation norm, nuclear norm) are not Hilbert norms, and therefore cannot be captured by a kernel. This suggests that training deep models with gradient descent can behave very differently from kernel methods, and have richer inductive biases.

One might then ask whether the kernel approximation indeed captures the behavior of deep learning in a relevant and interesting regime, or does the success of deep learning come when learning escapes this regime to have richer inductive biases that exploit the multilayer nature of neural networks? In order to understand this, we must first carefully understand when each of these regimes hold, and how the transition between the “kernel regime” and the “rich regime” happens.

Some investigations of the kernel regime emphasize the number of parameters (“width”) going to infinity as leading to this regime. However, Chizat and Bach [5] identified the scale of the model at initialization as a quantity controlling entry into the kernel regime. Their results suggest that for any number of parameters (any width), a homogeneous model can be approximated by a kernel when its scale at initialization goes to infinity (see the survey in Section 3). Considering models with increasing (or infinite) width, the relevant regime (kernel or rich) is determined by how the scaling at initialization behaves as the width goes to infinity. In this paper we elaborate and expand of this view, carefully studying how the scale of initialization affects the model behaviour for -homogeneous models.

Our Contributions

In Section 4 we analyze in detail a simple 2-homogeneous model for which we can exactly characterize the implicit bias of training with gradient descent as a function of the scale, , of initialization. We show: (a) the implicit bias transitions from the norm in the limit to in the limit; (b) consequently, for certain problems e.g. high dimensional sparse regression, using a small initialization can be necessary for good generalization; and (c) we highlight how the “shape” of the initialization, i.e. the relative scale of the parameters, affects the bias but not the bias. In Section 5 we extend this analysis to analogous -homogeneous models, showing that the order of homogeneity or the “depth” of the model hastens the transition into the regime. In Section 6, we analyze asymmetric matrix factorization models, and show that the “width” (i.e. the inner dimension of the factorization) has an interesting role to play in controlling the transition between kernel and rich behavior which is distinct from the scale.

2 Setup and preliminaries

We consider models which map parameters and examples to predictions . We denote the predictor implemented by the parameters as , such that . Much of our focus will be on models, such a linear networks, which are linear in (but not in the parameters !), in which case is a linear functional in the dual space and can be represented as a vector with . Such models are essentially alternate parametrizations of linear models, but as we shall see that the specific parametrization is crucial.

We focus on models that are -positive homogeneous in the parameters , for some integer , meaning that for any , . We refer to such models simply as -homogeneous. Many interesting model classes have this property, including multi-layer ReLU networks with fully connected and convolutional layers, layered linear networks, and matrix factorization, where corresponds to the depth of the network.

We use to denote the squared loss of the model parametrized by over a training set . We consider minimizing the loss using gradient descent with infinitesimally small stepsize i.e. gradient flow dynamics

(1)

We are particularly interested in the scale of initialization and we capture it through a scalar parameter . For scale , we will denote by the gradient flow path (1) with the initial condition . We consider underdetermined/overparameterized models (typically ), where there are many global minimizers of , all with . In many cases, we expect the dynamics of gradient flow to converge to a global minimizer of which perfectly fits the data – this is often empirically observed in large neural network learning, though proving this happens to be challenging and will not be our focus. Rather, our main focus is on which of the many minimizers gradient flow converges to. We want to characterize or, more importantly, the predictor reached by gradient flow depending on the scale .

3 The Kernel Regime

Locally, gradient descent/flow depends solely on the first-order approximation of the model w.r.t. :

(2)

That is, locally around any , gradient flow operates on the model as if it were an affine model with feature map , corresponding to the tangent kernel . Of particular interest is the tangent kernel at initialization, [15, 23, 22].

Previous work uses “kernel regime” to describe a situation in which the tangent kernel does not change over the course of optimization or, less formally, where it does not change significantly, i.e. where . For homogeneous models with initialization , , where we denote . Thus, in the kernel regime, training the model is exactly equivalent to training an affine model with kernelized gradient descent/flow with the kernel and a “bias term” of . Minimizing the loss of this affine model using gradient flow reaches the solution nearest to the initialization where distance is measured with respect to the RKHS norm determined by . That is, . To avoid handling this bias term, and in particular its large scale as increases, Chizat and Bach [5] suggest using “unbiased” initializations such that , so that the bias term vanishes. This is often achieved by replicating units with opposite signs at initialization (see, e.g. Section 4).

But when does the kernel regime happen? Chizat and Bach [5] showed that for any homogeneous1 model satisfying some technical conditions, the kernel regime is reached when . That is, as we increase the scale of initialization, the dynamics converge to the kernel gradient flow dynamics for the initial kernel . In Sections 4 and 5, for our specific models, we prove this limit as a special case of our more general analysis for all , and we also demonstrate it empirically for matrix factorization and deep networks in Sections 6 and 7. In Section 6, we additionally show how increasing the “width” of certain asymmetric matrix factorization models can also lead to the kernel regime, even when the initial scale goes to zero at an appropriately slow rate.

In contrast to the kernel regime, and as we shall see in later sections, the small initialization limit often leads to very different and rich inductive biases, e.g. inducing sparsity or low-rank structure [12, 16, 13], that allow for generalization in settings where kernel methods would not. We will refer to the limit of this distinctly non-kernel behavior as the “rich limit.” This regime is also called the “active,” “adaptive,” or “feature-learning” regime since the tangent kernel changes over the course of training, in a sense adapting to the data. We argue that this rich limit is the one that truly allows us to exploit the power of depth, and thus is the more relevant regime for understanding the success of deep learning.

4 Detailed Study of a Simple Depth-2 Model

Consider the class of linear functions over , with squared parameterization as follows:

(3)

where for denotes elementwise squaring. The model can be thought of as a “diagonal” linear neural network (i.e. where the weight matrices have diagonal structure) with units. A “standard” diagonal linear network would have units, with each unit connected to just a single input unit with weight and the output with weight , thus implementing the model . But, it can be shown from gradient flow dynamics that if at initialization , their magnitude will remain equal and their signs will not flip throughout training, and so we can equivalently replace both with a single weight , yielding the model .

The reason for using both and (or units) is two-fold. First, it ensures that the image of is all (signed) linear functions, and thus the model is truly equivalent to standard linear regression. Second, it allows for initialization at (by choosing ) without this being a saddle point from which gradient flow will never escape.2

The model (3) is perhaps the simplest non-trivial -homogeneous model for , and we chose it for studying the role of scale of initialization because it already exhibits distinct and interesting kernel and rich behaviors. Furthermore, we can completely understand both the implicit regularization driving this model and the transition between the regimes analytically.

We study the underdetermined case where there are many possible solutions . We will use to denote the solution reached by gradient flow when initialized at . To simplify the presentation, we will start by focusing on the special case where . In this case, the tangent kernel at initialization is , which is just a scaling of the standard inner product kernel, so . Thus, in the kernel regime, will be the minimum norm solution, . Following Chizat and Bach [5] and the discussion in Section 3, we thus expect that .

In contrast, Corollary in Gunasekar et al. [12], it follows that as , gradient flow leads instead to a rich limit corresponding to the minimum norm solution . Comparing this with the kernel regime, we already see two distinct behaviors and, in high dimensions, two very different inductive biases. In particular, the rich limit bias is not an RKHS norm for any choice of kernel. We have now described the asymptotic regimes where or , but can we characterize and understand the transition between the two regimes as scales from very small to very large? The following theorem does just that.

Theorem 1 (Special case: ).

For any , if the gradient flow solution for the squared parameterization model in eq. (3) satisfies , then

(4)

where and .

A General Approach for Deriving the Implicit Bias

Once given an expression for , it is straightforward to analyze the dynamics of and show that it is the minimum solution to . However, a key contribution of this work is in developing a method for determining what the implicit bias is when we do not already have a good guess. First, we analyze the gradient flow dynamics and show that if then for a certain function and vector . It is not necessary to be able to calculate , which would be very difficult, even for our simple examples. Next, we suppose that there is some function such that (4) holds. The KKT optimality conditions for (4) are and s.t. . Therefore, if indeed and then . We solve the differential equation to yield . Theorem 4 in Appendix B is proven using this method, and we hope that this approach may be useful for other problems.

(a) Generalization
(b) Norms of solution
(c) Sample complexity
Figure 1: In (a) the population error of the gradient flow solution vs.  in the sparse regression problem described in Section 4. In (b), we plot in blue and in red vs. . In (c), the largest such that achieves population error at most is shown. The dashed line indicates the number of samples needed by .

In light of Theorem 4, the function (referred to elsewhere as the “hypentropy” function [11]) can be understood as an implicit regularizer which biases the gradient flow solution towards one particular zero-error solution out of the many possibilities. As ranges from to , the regularizer interpolates between the and norms, as illustrated in Figure 3(a) (the line labelled depicts the coordinate function ). As we have that , and so the behaviour of is governed by around , thus . On the other hand when , is determined by as . In this regime . The following Theorem, proven in Appendix C, quantifies the scale of which guarantees that approximates the minimum or norm solution: {restatable}theoremalphaforlonetwosolution For any , under the setting of Theorem 4 with ,

Looking carefully at Theorem 4, we notice a certain asymmetry between reaching the kernel regime versus the rich limit: polynomially large suffices to approximate to a very high degree of accuracy, but exponentially small is needed to approximate .3 This suggests an explanation for the difficulty of empirically demonstrating rich limit behavior in matrix factorization problems [12, 2]: since the initialization may need to be exceedingly small, conducting experiments in the truly rich limit may be infeasible for computational reasons.

Generalization

In order to understand the effect of the initialization on generalization, consider a simple sparse regression problem, where and where is -sparse with non-zero entries equal to . When , gradient flow will generally reach a zero training error solution, however, not all of these solutions will generalize the same. In the rich limit, samples suffices for to generalize well. On the other hand, even though we can fit the training data perfectly well, the kernel regime solution would not generalize at all with this sample size ( samples would be needed), see Figure 1(c). Thus, for this sparse learning problem, good generalization requires using very small initialization, and generalization will tend to improve as decreases. From an optimization perspective this is unfortunate because is a saddle point, so taking will tend to increase the time needed to escape the vicinity of zero.

Thus, there seems to be a tension between generalization and optimization: a smaller might improve generalization, but it makes optimization trickier. This suggests that one should operate just on the edge of the rich limit, using the largest that still allows for generalization. This is borne out by our experiments with deep, non-linear neural networks (see Section 7), where standard initializations correspond to being right on the edge of entering the kernel regime, where we expect models to both generalize well and avoid serious optimization difficulties. Given the extensive efforts put into designing good initialization schemes, this gives further credence to the idea that models will perform best when trained in the intermediate regime between rich and kernel behavior.

The tension between optimization and generalization can also be seen through a tradeoff between the sample size and the largest we can use and still generalize. In Figure 1(c), for each sample size , we plot the largest for which the gradient flow solution achieves population risk below some threshold. As approaches the minimum number of samples for which generalizes (the vertical dashed line), must become extremely small. However, generalization becomes much easier even when the number of samples only slightly larger, and much larger suffices.

The “Shape” of and the Implicit Bias

So far, we have discussed the implicit bias in the special case , but we can also characterize it for non-uniform initialization : {restatable}[General case]theoremgradflowQminimizer For any and with no zero entries, if ,

(5)

where and . Consider the asymptotic behavior of . For small , so for

(6)

In other words, in the limit, is proportional to a quadratic norm weighted by . On the other hand, for large , so as

(7)

So, in the limit, is proportional to regardless of the shape of the initialization ! The specifics of the initialization, , therefore affect the implicit bias in the kernel regime (and in the intermediate regime) but not in the rich limit.

For wide neural networks with i.i.d. initialized units, the analogue of the “shape” is the distribution used to initialize each unit, including the relative scale of the input weights, output weights, and biases. Indeed, as was explored by Williams et al. [21] and as we elaborate on in Section 7, changing the unit initialization distribution changes the tangent kernel at initialization and hence the kernel regime behavior. However, in Section 7 we also demonstrate empirically that changing the initialization distribution (“shape”) does not change the rich regime behavior of such networks. These observations match the behavior of analyzed above.

Explicit Regularization

From the geometry of gradient descent, it is tempting to imagine that its implicit bias would correspond to selecting the parameter closest in Euclidean norm to initialization:

(8)
(9)
Figure 2: versus .

It is certainly the case for standard linear regression , where from standard analysis, it can be shown that so the bias is captured by . But does this characterization fully explain the implicit bias for our 2-homogeneous model? Perhaps the behavior in terms of can also be explained by ? Focusing on the special case , it is easy to verify that the limiting behavior when and of the two approaches match. We can also calculate , which decomposes over the coordinates, as: where is the unique real root of .

This function is shown next to in Figure 2. They are similar but not the same since is algebraic (even radical), while is transcendental. Thus, and they are not simple rescalings of each other either. Furthermore, while needs to be exponentially small in order for to approximate the norm, the algebraic approaches polynomially in terms of the scale of . Therefore, the bias of gradient descent and the transition from the kernel regime to the rich limit is more complex and subtle than what is captured simply by distances in parameter space.

5 Higher Order Models

So far, we considered a 2-homogeneous model, corresponding to a simple depth-2 “diagonal” network. Deeper models correspond to higher order homogeneity (e.g. a depth- ReLU network is -homogeneous), motivating us to understand the effect of the order of homogeneity on the transition between the regimes. We therefore generalize our model and consider:

(10)

As before, this is just a linear regression model with an unconventional parametrization. It is equivalent to a depth- matrix factorization model with commutative measurement matrices, as studied by Arora et al. [2], and can be thought of as a depth- diagonal linear network. We can again study the effect of the scale of on the implicit bias. Let denote the limit of gradient flow on when . In Appendix D we prove: {restatable}theoremhigherorderthm For any and , if , then

where and is the antiderivative of the unique inverse of on . Furthermore, and . In the two extremes, we again get in the kernel regime, and more interestingly, for any depth , we get the in the rich limit, as has also been observed by Arora et al. [2]. The fact that the rich limit solution does not change with depth is perhaps surprising, and does not agree from what would be obtained with explicit regularization (regularizing is equivalent to regularization), nor implicitly on e.g. the logistic loss [12].

(a) Regularizer
(b) Approximation ratio
(c) Sparse regression simulation
Figure 3: (a) for several values of . (b) The ratio as a function of , where is the first standard basis vector and is the all ones vector in . This captures the transition between approximating the norm (where the ratio is ) and the norm (where the ratio is ). (c) A sparse regression simulation as in Figure 1, using different order models. The y-axis is the largest (the scale of at initialization) that leads to recovery of the planted predictor to accuracy . The vertical dashed line indicates the number of samples needed in order for to approximate the plant.

Although the two extremes do not change as we go beyond , what does change is the intermediate regime, particularly the sharpness of the transition into the extreme regimes, as illustrated in Figures 3(a)-3(c). The most striking difference is that, even at order , the scale of needed to approximate is polynomial rather then exponential, yielding a much quicker transition to the rich limit versus the case above. This allows near-optimal sparse regression with reasonable initialization scales as soon as , and increasing hastens the transition to the rich limit. This may offer an explanation for the empirical observations regarding the benefit of depth in deep matrix factorization [2].

6 The Effect of Width

The kernel regime was first discussed in the context of the high (or infinite) width of a network, but our treatment so far, following [5], identified the scale of the initialization as the crucial parameter for entering the kernel regime. So is the width indeed a red herring? Actually, the width indeed plays an important role and allows entering the kernel regime more naturally.

The fixed-width models so far only reach the kernel regime when the initial scale of parameters goes to infinity. To keep this from exploding both the outputs of the model and itself, we used Chizat and Bach’s “unbiasing” trick. However, using unbiased models with conceals the unnatural nature of this regime: although the final output may not explode, outputs of internal units do explode in the scaling leading to the kernel regime. Realistic models are not trained like this. We will now use a “wide” generalization of our simple linear model to illustrate how increasing the width can induce kernel regime behavior in a more natural setting where both the initial output and the outputs of all internal units, do not explode and can even vanish.

Consider an (asymmetric) matrix factorization model, i.e. a linear model over matrix-valued observations4 described by where , and we refer to as the “width.” We are interested in understanding the behaviour as and the scaling of initialization of each individual parameter changes with . Let denote the underlying linear predictor. We consider minimizing the squared loss on samples using gradient flow on the parameters and . This formulation includes a number of special cases such as matrix completion, matrix sensing, and two layer linear neural networks.

We want to understand how the scale and width jointly affect the implicit bias. Since the number of parameters grows with , it now makes less sense to capture the scale via the magnitude of individual parameters. Instead, we will capture scale via , i.e. the scale of the model itself at initialization. The initial predictions are also of order , e.g. when is Gaussian and has unit Frobenius norm. We will now show that the model remains in the kernel regime depending on the relative scaling of and . Unlike the -homogeneous models of Sections 4 and 5, can be in the kernel regime when remains bounded, or even when it goes to zero.

But does the scale of indeed capture the relevant notion of parameter scale? For a symmetric matrix factorization model , does captures the entire behaviour of the model since the dynamics on induced by gradient flow on depend only on and not on itself [12].

For the asymmetric model , this is no longer the case, and the dynamics do depend on the specific factorization and not only on the product . Instead, we can consider an equivalent “lifted” symmetric problem defined by and with . The dynamics of –which on the off diagonal blocks are equivalent to those of –are now fully determined by itself; that is, by the combination of the “observed” part as well as the “unobserved” diagonal blocks and . To see how this plays out in terms of the width, consider initializing and with i.i.d.  entries. The off-diagonal entries of , and thus , will scale with while the diagonal entries of will scale with .

The relevant scale for the problem is thus that of the entire lifted matrix , i.e. . We can infer that there is a transition around : (a) if , then and the model should remain in the kernel regime, even in cases where ; (b) on the other hand, if then and the model should approach some rich limit; (c) at the transition, when , will remain bounded and we will be in an intermediate regime. If exists we expect an implicit bias resembling (see the discussion below). Geiger et al. [10] also argue for a transition between the kernel and rich extremes around using different arguments, but they focus on the extremes and and not on the transition. Here, we understand the scaling directly in terms of how the width affects the magnitude of the symmetrized model . For the symmetric matrix factorization model, we can analyze the case and prove it, unsurprisingly, leads to the kernel regime. We prove this as Theorem E and Corollary E in Appendix E, closely following the approach of Chizat and Bach [5]. It would be more interesting to characterize the implicit bias across the full range of the intermediate regime, which perhaps corresponds to something resembling the “-Schatten-not-quite-norm,” i.e.  applied to the singular values of . However, even just the rich limit in this setting has defied generic analysis so far (q.v. the still unresolved conjecture of [12]), and analyzing the intermediate regime is even harder (in particular, the limit of the intermediate regime describes the rich limit). Nevertheless, we can give a precise treatment for the special case where the rich regime is known, namely where the measurements commute with each other. For simplicity, we will discuss specifically the case where the measurements and are diagonal, but the same argument applies.

\Edge\Edge\Edge\Edge\Edge\Edge\Edge\Edge\Edge\Edge\Edge\Edge\Edge\Edge\Edge\Edge\Edge\Edge\Edge\Edge\Edge\Edge\Edge\Edge
Figure 4: A wide parallel network

Matrix Sensing with Diagonal/Commutative Measurements

Consider the special case where , , are all diagonal, or more generally commutative, matrices. The diagonal elements of (the only relevant part when is diagonal) are , and so the diagonal case can be thought of as an (asymmetric) “wide” analogue to the 2-homogeneous model we considered in Section 4, i.e. a “wide parallel linear network” where each input unit has its own set of hidden units. This is depicted in Figure 4. In any case, we consider initializing and with i.i.d.  entries, so will be of magnitude , and take , scaling as a function of .

For such a model we can completely characterize the implicit bias. Consider with and let , where we assume this limit is either infinite or exists and is finite. We will show below that the matrix will converge to the zero error solution minimizing applied to its spectrum (the “Schatten--not-quite-norm”). This corresponds to an implicit bias which approximates the trace norm for small and the Frobenius norm for large . In the diagonal case (i.e. an “infinitely wide parallel network”), this is just minimum solution, but unlike the width-1 model of Section 4, this is obtained without an “unbiasing” trick, and when the scale of the outputs of all units at initialization vanishes rather then explodes.

In the large- limit, so the four submatrices of the lifted matrix have diagonal structure. The dynamics on are linear combination of terms of the form , and each of these terms will share this same block-diagonal structure, which is therefore maintained throughout the course of optimization. We thus restrict our attention to just the main diagonal of and the diagonal of , all other entries will remain zero. In fact, we only need to track and , with the goal of understanding .

Since the dynamics of depend only on the observations and itself, and not on the underlying parameters, we can understand the implicit bias via analyzing any initialization that gives . A convenient choice is and so that and . Let denote the matrix whose th row is , and let be the vector of residuals with . A simple calculation then shows that the dynamics are given by and which have as a solution

(11)

This solution for has a familiar form! In particular, if indeed reaches a zero-error solution, then using the same argument as for Theorem 4 we conclude that s.t. .

Low-Rank Matrix Completion

Matrix completion is a more realistic instance of our model where the measurements are indicators of single entries of the matrix, and where the therefore correspond to observed entries of an unknown matrix . When , there are many minimizers of the squared loss which correspond to matching on all of the observed entries, and imputing arbitrary values for the unobserved entries. When is rank- for , the minimum nuclear norm zero-error solution will recover when the number of observed entries [4]. These measurements do not commute, so the analysis above does not apply. In fact, there is no existing analysis for the implicit bias outside of the kernel regime. Nevertheless, our experiments in Figure 5 indicate that the rich limit does give implicit nuclear norm regularization and indeed will converge to when . On the other hand, the kernel regime corresponds to implicit Frobenius norm regularization, which does not recover unless .

Figure 5: Matrix Completion We generate rank-1 ground truth where and observe random entries. We minimize the squared loss on the observed entries of the model with using gradient descent with small stepsize . We initialize . For the solution, , reached by gradient descent, the left heatmap depicts the excess nuclear norm (this is conjectured to be zero in the rich limit); and the right heatmap depicts the root mean squared difference between the entries and corresponding to unobserved entries of (in the kernel regime, the unobserved entries do not move). Both exhibit a phase transition around . For the excess nuclear norm is approximately zero, corresponding to the rich limit. For , the unobserved entries do not change, which corresponds to the kernel regime. This phase transition appears to sharpen somewhat as increases.

7 Neural Network Experiments

In Sections 4 and 5, we intentionally focused on the simplest possible models in which a kernel-to-rich transition can be observed, in order to isolate this phenomena and understand it in detail. In those simple models, we were able to obtain a complete analytic description of the transition. Obtaining such a precise description in more complex models is too optimistic at this point, but we demonstrate the same phenomena empirically for realistic non-linear neural networks.

Figures 6(a) and 6(b) indicate that non-linear ReLU networks remain in the kernel regime when the initialization is large, they exit from the kernel regime as the initialization becomes smaller, and exiting from the kernel regime allows for smaller test error on the synthetic data. On MNIST data, Figure 6(c) shows that previously published successes with training very wide depth-2 ReLU networks without explicit regularization [e.g. 18] relies on the initialization being small, i.e. being outside of the kernel regime. In fact, the 2.4% test error reached for large initialization is no better than what can be achieved with a linear model over a random feature map. Turning to a more realistic network, 6(d) shows similar behavior when training a VGG11-like network on CIFAR10.

Interestingly, in all experiments, when , the models both achieve good test error and are just about to enter the kernel regime, which may be desirable due to the learning vs. optimization tradeoffs discussed in Section 4. Not coincidentally, corresponds to using the standard out-of-the-box Uniform He initialization. Given the extensive efforts put into designing good initialization schemes, this gives further credence to the idea that model will perform best when trained just outside of the kernel regime.

(a) Test RMSE vs scale
(b) Grad distance vs scale
(c) MNIST test error vs scale
(d) CIFAR10 test error vs scale
Figure 6: Synthetic Data: We generated a small regression training set in by sampling 10 points uniformly from the unit circle, and labelling them with a 1 hidden layer teacher network with 3 hidden units. We trained depth-, ReLU networks with 30 units per layer with squared loss using full GD and a small stepsize . The weights of the network are set using the Uniform He initialization, and then multiplied by . The model is trained until training loss. Shown in (a) and (b) are the test error and the “grad distance” vs. the depth-adjusted scale of the initialization, . The grad distance is the cosine distance between the tangent kernel feature map at initialization versus at convergence. MNIST: We trained a depth-2, 5000 hidden unit ReLU network with cross-entropy loss using SGD until it reached 100% training accuracy. The stepsizes were optimally tuned w.r.t. validation error for each individually. In (c), the dashed line shows the test error of the resulting network vs.  and the solid line shows the test error of the explicitly trained kernel predictor. CIFAR10: We trained a VGG11-like deep convolutional network with cross-entropy loss using SGD and a small stepsize for 2000 epochs; all models reached 100% training accuracy. In (d), the dashed line shows the final test error vs. . The solid line shows the test error of the explicitly trained kernel predictor. See Appendix A for further details about all of the experiments.

7.1 Univariate 2-layer ReLU Networks

(a)
(b)
(c)
Figure 7: Each subplot has functions learned by univariate ReLU network of width with initialization , for some fixed . In Figure , are fixed by a standard initialization scheme as and for second layer. In and , the relative scaling of the layers in is changed without changing the scale of the output.

Consider a two layer width- ReLU network with univariate input given by where and are the weights and bias parameters, respectively, for the two layers. This setting is the simplest non-linear model which has been explored in detail both theoretically and empirically [19, 21]. [19] show that for an infinite width, univariate ReLU network, the minimal parameter norm solution for a 1D regression problem, i.e., is given by a linear spline interpolation. We hypothesize that this bias to corresponds to the rich limit in training univariate -layer networks. In contrast, Theorem and Corollary of [21], show that the kernel limit corresponds to different cubic spline interpolations, where the exact form of interpolation depends on the relative scaling of weights across the layers. We explored the transition between the two regimes as the scale of initialization changes. We again consider a unbiased model as suggested by [5] to avoid large outputs for large .

In Figure 7, we fix the width of the network to and empirically plot the functions learned with different initialization for fixed . Additionally, we also demonstrate the effect of changing , by relatively scaling of layers without changing the output as shown in Figure 7-(b,c). First, as we suspected, we see that the rich limit of indeed corresponds to linear spline interpolation and is indeed independent of the specific choice as long as the outputs are unchanged. In contrast, as was also observed by [21], the kernel limit (large ), does indeed change as the relative scaling of the two layers changes, leading to what resembles different cubic splines.

Acknowledgements

This work was supported by NSF Grant 1764032. BW is supported by a Google PhD Research Fellowship.

Appendix A Neural Network Experiments

In this Appendix, we provide further details about the neural network experiments.

Synthetic Experiments

We construct a synthetic training set with points drawn uniformly from the unit circle in and labelled by a teacher model with 1 hidden layer of 3 units. We train fully connected ReLU networks with depths 2, 3, and 5 with 30 units per layer to minimize the square loss using full gradient descent with constant stepsize until the training loss is below . We use Uniform He initialization for the weights and then multiply them by .

Here, we describe the details of the neural network implementations for the MNIST and CIFAR10 experiments.

Mnist

Since our theoretical results hold for the squared loss and gradient flow dynamics, here we empirically assess whether different regimes can be observed when training neural networks following standard practices.

We train a fully-connected neural network with a single hidden layer composed of 5000 units on the MNIST dataset, where weights are initialized as , , denoting the number of units in the previous layer, as suggested by He et al. [14]. SGD with a batch size of is used to minimize the cross-entropy loss over the training points, and error over the test samples are used as measure of generalization. For each value of , we search over learning rates and use the one which resulted in best generalization.

There is a visible phase transition in Figure 6(c) in terms of generalization ( error for , and error for ), even though every network reached training accuracy and less than