Kernel and Rich Regimes in Overparametrized Models

 
Kernel and Rich Regimes in Overparametrized Models
 

Blake Woodworth Toyota Technological Institute at Chicago    Suriya Gunasekar Toyota Technological Institute at Chicago    Pedro Savarese Toyota Technological Institute at Chicago    Edward Moroshko Technion

Itay Golan Technion    Jason Lee Princeton University    Daniel Soudry Technion    Nathan Srebro Toyota Technological Institute at Chicago
Abstract

A recent line of work studies overparametrized neural networks in the “kernel regime,” i.e. when the network behaves during training as a kernelized linear predictor, and thus training with gradient descent has the effect of finding the minimum RKHS norm solution. This stands in contrast to other studies which demonstrate how gradient descent on overparametrized multilayer networks can induce rich implicit biases that are not RKHS norms. Building on an observation by Chizat and Bach (2018), we show how the scale of the initialization controls the transition between the “kernel” (aka lazy) and “rich” (aka active) regimes and affects generalization properties in multilayer homogeneous models. We provide a complete and detailed analysis for a simple two-layer model that already exhibits an interesting and meaningful transition between the kernel and rich regimes, and we demonstrate the transition for more complex matrix factorization models and multilayer non-linear networks.

1 Introduction

A string of recent papers study neural networks trained with gradient descent in the “kernel regime.” They observe that, in a certain regime, networks trained with gradient descent behave as kernel methods (Jacot et al., 2018; Daniely et al., 2016; Daniely, 2017). This allows one to prove convergence to zero error solutions in overparametrized settings (Du et al., 2018, 2019; Allen-Zhu et al., 2018), and also implies gradient descent will converge to the minimum norm solution (in the corresponding RKHS) (Chizat and Bach, 2018; Arora et al., 2019b; Mei et al., 2019) and more generally that models inherit the inductive bias and generalization behavior of the RKHS. This suggests that, in a certain regime, deep models can be equivalently replaced by kernel methods with the “right” kernel, and deep learning boils down to a kernel method with a fixed kernel determined by the architecture and initialization, and thus it can only learn problems learnable by some kernel.

This contrasts with other recent results that show how in deep models, including infinitely overparametrized networks, training with gradient descent induces an inductive bias that cannot be represented as an RKHS norm. For example, analytic and/or empirical results suggest that gradient descent on deep linear convolutional networks implicitly biases toward minimizing the bridge penalty, for , in the frequency domain (Gunasekar et al., 2018b); weight decay on an infinite width single input ReLU implicitly biases towards minimizing the second order total variations of the learned function (Savarese et al., 2019); and gradient descent on a overparametrized matrix factorization, which can be thought of as a two layer linear network, induces nuclear norm minimization of the learned matrix (Gunasekar et al., 2017) and can ensure low rank matrix recovery (Li et al., 2018). All these natural inductive biases ( bridge penalty for , total variation norm, nuclear norm) are not Hilbert norms, and therefore cannot be captured by a kernel. This suggests that training deep models with gradient descent can behave very differently from kernel methods, and have much richer inductive biases.

One might then ask whether the kernel approximation indeed captures the behavior of deep learning in a relevant and interesting regime, or does the success of deep learning come when learning escapes this regime? In order to understand this, we must first carefully understand when each of these regimes hold, and how the transition between the “kernel” regime and the “rich” regime happens.

Some investigations of the kernel regime emphasized the number of parameters (“width”) going to infinity as leading to this regime. However Chizat and Bach (2018) identified the scale of the model as a quantity controlling entry into the kernel regime. Their results suggest that for any number of parameters (any width), a model can be approximated by a kernel when its scale at initialization goes to infinity (see details in Section 3). Considering models with increasing (or infinite) width, the relevant regime (kernel or rich) is determined by how the scaling at initialization behaves as the width goes to infinity. In this paper we elaborate and expand of this view, carefully studying how the scale of initialization effects the model behaviour for -homogeneous models.

In Section 4 we provide a complete and detailed study for a simple 2-homogeneous model that can be viewed as linear regression with squared parametrization, or as a “diagonal” linear neural network. For this model we can exactly characterize the implicit bias of training with gradient descent, as a function of the scale of initialization, and see how this implicit bias becomes the norm in the kernel regime, but the norm in the rich regime. We can therefore understand how, e.g. for a high dimensional sparse regression problem, where it is necessary to discover the relevant features, we can get good generalization when the initialization scale is small, but not when is large. Deeper networks corresponds to higher orders of homogeneity, and so in Section 5 we extend our study to a -homogeneous model, studying the effects of . In Sections 6 and 7, we demonstrate similar transitions experimentally in matrix factorization and non-linear networks.

2 Setup and preliminaries

We consider models which map parameters and examples to predictions . We denote the predictor implemented by the parameters as such that . Much of our focus will on models, such a linear networks, which are linear in (but not on the parameters !), in which case is a linear predictor and can be represented as a vector with . Such models are essentially alternate parametrizations of linear models, but as we shall see that change of parametrization is crucial.

In this paper, we consider models that are -positive homogeneous in the parameters , for some integer , meaning that for any , and . We refer to such models simply as -homogeneous. Many interesting model classes have this property, including multi-layer ReLU networks with fully connected and convolutional layers, layered linear neural networks, and matrix factorization where corresponds to the depth of the network.

Consider a training set consisting of examples of input label pairs. For a given loss function , the loss of the model parametrized by is . We will focus mostly on the squared loss . We slightly abuse notation and use to denote the vector of predictions and so for the squared loss we can write , where is the vector of target labels.

Minimizing the loss using gradient descent amounts to iteratively updating the parameters

(1)

We consider gradient descent with infinitesimally small stepsize , i.e. gradient flow dynamics

(2)

We are particularly interested in the scale of initialization and we capture it through a scalar parameter . For scale , we will denote by the gradient flow path (2) with the initial condition for some fixed . We use , and for linear predictors , to denote the dynamics on the predictor induced by the gradient flow on .

In many cases, we expect the dynamics to converge to a minimizer of , though proving this happens will not be our main focus. Rather, we are interested in the underdetermined case, , where there are generally many minimizers of , all with and . Our main focus is which of the many minimizers does gradient flow converge to. That is, we want to characterize or, more importantly, the predictor or we converge to, and how these depend on the scale . In underdetermined problems, where there are many zero error solutions, simply fitting the data using the model does not provide enough inductive bias to ensure generalization. But in many cases, the specific solution reached by gradient flow (or some other optimization procedure) has special structure, or minimizes some implicit regularizer, and this structure or regularizer provides the needed inductive bias (Gunasekar et al., 2018b, a; Soudry et al., 2018; Ji and Telgarsky, 2018).

3 The Kernel Regime

Gradient descent/flow considers only the first-order approximation of the model w.r.t. :

(3)

That is, locally around any , gradient flow operates on the model as if it were an affine model with feature map , corresponding to the tangent kernel (Jacot et al., 2018; Zou et al., 2018; Yang, 2019; Lee et al., 2019). Of particular interest is the tangent kernel at initialization, where we denote .

The “kernel regime” refers to a situation in which the tangent kernel does not change over the course of optimization, and less formally to the regime where it does not change significantly, i.e. where . In this regime, training the model is exactly equivalent to training an affine model with kernelized gradient descent/flow with the kernel and a “bias term” of . To avoid handling this bias term, and in particular its scaling, Chizat and Bach (2018) suggest using “unbiased” initializations such that , so that the bias term vanishes. This can often be achieved by replicating units or components with opposite signs at initialization, which is the approach we use here (see Sections 46 for examples and details).

For underdetermined problem with multiple solutions , unbiased kernel gradient flow (or gradient descent) converges to the minimum norm solution , where is the RKHS norm corresponding to the kernel. And so, in the kernel regime, we will have that , and the implicit bias of training is precisely given by the kernel.

When does the “kernel regime” happen? Chizat and Bach (2018) showed that for any homogeneous111Chizat and Bach did not consider only homogenous models, and instead of studying the scale of initialization they studied scaling the output of the model. For homogeneous models, the dynamics obtained by scaling the initialization are equivalent to those obtained by scaling the output, and so here we focus on homogenous models and on scaling the initialization. model satisfying some technical conditions, the kernel regime is reached as . That is, as we increase the scale of initialization, the dynamics converge to the kernel gradient flow dynamics with the kernel , and we have . In Sections 4 and 5 we prove this limit directly for our specific models, and we also demonstrate it empirically for matrix factorization and deep networks in Sections 6 and 7.

In contrast, and as we shall see in later sections, the small initialization limit often leads to a very different and rich inductive bias, e.g. inducing sparsity or low-rank structure (Gunasekar et al., 2017; Li et al., 2018; Gunasekar et al., 2018b), that allows for generalization in many settings where kernel methods would not. We refer to this limit reached as as the “rich regime.” This regime is also referred to as the “active” or “adaptive” regime (Chizat and Bach, 2018) since the tangent kernel changes over the course of training, in a sense adapting to the data. We argue that this regime is the one that truly allows us to exploit the power of depth, and thus is the more relevant regime for understanding the success of deep learning.

4 Detailed Study of a Simple Depth-2 Model

We study in detail a simple -homogeneous model. Consider the class of linear functions over , with squared parameterization as follows:

(4)

where we use the notation for to denote elementwise squaring. We consider initializing all weights equally with .

This is nothing but a linear regression model, except with an unconventional parametrization. The model can also be thought of as a “diagonal” linear neural network (i.e. where the weight matrices have diagonal structure) with units. A standard diagonal linear network would have units, with each unit connected to just a single input unit with weights and the output with weight , thus implementing the model . But if at initialization , their magnitude will remain equal and their signs will not flip throughout training, and so we can equivalently replace both with a single weight , yielding the model .

The reason for using both and (or units) is two-fold. First, it ensures that the image of is all (signed) linear functions, and thus the model is truly equivalent to standard linear regression. Second, it allows for initialization at without this being a saddle point from which gradient flow will never escape.222Our results can be generalized to non-uniform initialization, “biased initiliation” (i.e. where at initialization), or the asymmetric parameterization , however this complicates the presentation without adding much insight.

The model (4) is perhaps the simplest non-trivial -homogeneous model for , and we chose it for this reason, as it already exhibits distinct and interesting kernel and rich regimes. Furthermore, we can completely understand both the implicit regularization driving this model and the transition between the regimes analytically.

Consider the behavior of the limit of gradient flow (2) as a function of the initialization, in the underdetermined case where there are many possible solutions . The tangent kernel at initialization is , i.e. a scaling of the standard inner product kernel, so . Thus, in the kernel regime, gradient flow leads to the minimum norm solution, . Following Chizat and Bach (2018) and the discussion in Section 3, we thus expect that , and we also show this below.

In contrast, Gunasekar et al. (2017) shows that as , gradient flow leads instead to the minimum norm solution . This is the “rich regime.” Comparing this with the kernel regime, we already see two very distinct behaviors and, in high dimensions, two very different inductive biases. In particular, the rich regime’s bias is not an RKHS norm for any choice of kernel. Can we charactarize and understand the transition between the two regimes as transitions from very small to very large? The following theorem does just that.

Theorem 1.

For any ,

(5)

where and

Proof sketch

The proof in Appendix A proceeds by showing the gradient flow dynamics on lead to a solution of the form

(6)

where . While evaluating the integral would be very difficult, the fact that

(7)

already provides a dual certificate for the KKT conditions for .

(a) Generalization
(b) Norms of solution
(c) Sample complexity
Figure 1: In (a), the population error of the gradient flow solution vs.  in the sparse regression problem described in Section 4. In (b), the excess norm (blue) and excess norm (red) of the gradient flow solution, i.e.  and . In (c), the largest such that achieves population error at most is shown. The dashed line indicates the number of samples needed by .

In light of Theorem 1, the function (referred to elsewhere as the “hypentropy” function (Ghai et al., 2019)) can be understood as an implicit regularizer which biases the gradient flow solution towards one particular zero-error solution out of the many possibilities. As ranges from to , the regularizer interpolates between the and norms, as illustrated (labelled ) in Figure 2(a), which shows the coordinate function . As we have that , and so the behaviour of is controlled by the behaviour of around . In this regime is quadratic, and so . On the other hand when , is governed by the asymptotic behaviour as . In this regime . For any initialization scale , the function describes exactly how training will interpolate between the kernel and rich regimes. The following Theorems, proven in Appendix B, provide a quantitative statement of how the and norms are approached as and respectively:

Theorem 2.

For any

Theorem 2 indicates a certain asymmetry between reaching the rich and kernel regimes: polynomially large suffices to approximate to a very high degree of accuracy. On the other hand, exponentially small is sufficient to approximate , and Lemma 2 in Appendix B proves that is necessary in order for to be proportional to the norm, which indicates that must be that small to approximate for certain problems.

In order to understand the effects of initialization on generalization, consider a simple sparse regression problem, where and where is -sparse and its non-zero entries are . When , gradient flow will reach a zero training error solution, however, not all of these solutions will generalize the same. With samples, the rich regime, i.e. the minimum norm solution will generalize well. However, even though we can fit the training data perfectly well, we should not expect any generalization in the kernel regime with this sample size ( samples would be needed that regime), see Figure 0(c). In this case, to generalize well may require using very small initialization, and generalization will improve as we decrease . From an optimization perspective this is unfortunate because is a saddle point, so taking drastically increases the time needed to escape the saddle point.

Thus, there is a tension here between generalization and optimization: a smaller might improve generalization, but it makes optimization trickier. This suggests that in practice we would want to compromise, and operate just at the edge of the rich regime, using the largest that still allows for generalization. This is borne out in our neural network experiments in Section 7, where standard initialization schemes correspond to being right on the edge of entering the kernel regime, where we expect models to both generalize well and avoid serious optimization difficulties.

The tension between optimization and generalization can also be seen through a tradeoff between the sample size and the largest we can use and still generalize. In Figure 0(c), for each sample size , we plot the largest for which the gradient flow solution achieves population risk below some threshold. As approaches the number of samples needed for to generalize (the vertical dashed line), must become extremely small. However, generalization is much easier when the number of samples is only slightly larger, and we can use much more moderate initialization.

The situation we describe here is similar to a situation studied by Mei et al. (2019), who considered one-pass stochastic gradient descent (i.e. SGD on the population objective) and so analyzed the number of steps, and so also number of samples, required for generalization. Mei et al. (2018) showed that even with large initialization one can achieve generalization by optimizing with more one-pass SGD steps. Our analysis suggests that the issue here is not that of optimizing longer or more accurately, but rather of requiring a larger sample size—in studying one-pass SGD this distinction is blurred, but our analysis separates between the two.

Figure 2: The true implicit regularizer compared with .

Explicit Regularization

It is tempting to imagine that the effect of implicit regularization through gradient descent corresponds to selecting the solution closest to initialization in Euclidean norm:

(8)

where

(9)

It is certainly the case for standard linear regression , and the implicit bias is fully captured by this view. Is the implicit bias of indeed captured this minimum Euclidean distance solution also for our 2-homogeneous (depth 2) model, and perhaps more generally? Can the behavior discussed above can also be explained by ?

Indeed, it is easy to verify that for our square parameterization, the limiting behavior when and of the two approaches match, i.e.  and . To check whether the complete behaviour and transition are also captured by (8), we can calculate , which decomposes over the coordinates, as333Substituting and equating the gradient w.r.t.  to zero leads to a quadratic equation, the solution of which can be substituted back to evaluate :

(10)

Where is the unique real root of w.r.t. . As depicted in Figure 2, is quadratic around and asymptotically linear as , yielding regularization when and regularization as , similarly to . However, and are not the same regularizer: is algebraic (even radical), while is transcendental. This implies and cannot be simple rescaling of each other, and hence will lead to different sets of solutions, and . In particular, while needed to be exponentially small in order for to approximate the norm, and so for the limit of the gradient flow path to approximate the minimum norm solution, being algebraic converges to polynomially (that is, only needs to scale polynomially with the accuracy). We see, that implicit regularization effect of gradient descent (or gradient flow), and the transition from the kernel to rich regime, is more complex and subtle than what is captured simply by distances in parameter space.

5 Higher Order Models

In the previous Section, we considered a 2-homogeneous model, corresponding to a simple depth-2 “diagonal” network. Deeper models correspond to higher order homogeneity (a depth- ReLU or linear network is -homogeneous), motivating us to understand the effect of the order of homogeneity on the transition between the regimes. We therefore generalize our model and consider:

(11)

We again consider initializing all weights equally so . As before, this is just a linear regression model with an unconventional parametrization. It is equivalent to a depth- matrix factorization model with commutative measurement matrices, as studied by Arora et al. (2019a), and can be thought of as a depth- diagonal linear network.

We can again study the effect of the scale of the initialization on the implicit bias. Let denote the limit of gradient flow on when initialized at for the -homogeneous model. Using the same approach as in Section 4, in Appendix C we show:

Theorem 3.

For any and , if gradient flow reaches a solution , then

where and is the antiderivative of the unique inverse of on . Furthermore, and .

In the two extremes we see that we again get the minimum solution in the kernel regime, and more interestingly, for any depth , we get the same minimum norm solution in the rich regime, as has also been observed by Arora et al. (2019a). The fact that the rich regime solution does not change with depth is perhaps surprising, and does not agree from what is obtained with explicit regularization (regularizing is equivalent to regularization), nor with implicit regularization on logistic-type loss Gunasekar et al. (2017).

(a) Regularizer
(b) Approximation ratio
(c) Sparse regression simulation
Figure 3: (a) for several values of . (b) The ratio as a function of , capturing the transition between approximating the norm (where the ratio is ) and the norm (where the ratio is ). (c) sparse regression simulation as in Figure 1, using different order models. The y-axis is (the scale of at initialization) needed to recover the planted predictor to accuracy . The dashed line indicates the number of samples needed in order for to approximate the plant.

Although the two extremes do not change as we go beyond , what does change is the intermediate regime (the regularizer is unique and cannot be obtained with any other order ), as well as the sharpness of the transition into the extreme regimes, as illustrated in Figures 2(a)-2(c). The most striking difference is that for order the scale of needed to approximate the is polynomial rather then exponential, yielding a much quicker transition to the “rich regime”, and allowing near-optimal sparse regression with reasonable initialization scales. Increasing further hastens the transition. This might also help explain some of the empirical observations about the benefit of depth in deep matrix factorization Arora et al. (2019a).

6 Demonstration in Matrix Completion

We now turn to a more complex depth two model, namely a matrix factorization model, and demonstrate similar transitions empirically. Specifically, we consider the model over matrix inputs defined by , where . This corresponds to linear predictors over matrix arguments specified by . In overparameterized regime with , the parameterization itself does not introduce any explicit rank constraints. We consider here a random low rank matrix completion problem where represents an uniform random observation of entry of a planted low rank matrix of rank : . For underdetermined problems where , there are many trivial global minimizers, most of which are not low rank and hence will not guarantee recovery. As was demonstrated empirically by Gunasekar et al. (2017) and also proven rigorously for Gaussian measurements by Li et al. (2018), as gradient flow implicitly regularizes the nuclear norm, which for random measurements leads to recovery of ground truth (Candès and Recht, 2009; Recht et al., 2010): these are very different and rich implicit biases that are not RKHS norms.

Crucially, the reconstruction results in Gunasekar et al. (2017) and Li et al. (2018) are dependent on initialization with scale . Here we further explore the role of initialization. Similar to Section 4, in order to get unbiased initialization, we consider and initialization of the form and , where . We will study implicit bias of gradient flow over the factorized parameterization with above initialization.

For matrix completion problems with , the tangent kernel at initialization is given by . This defaults to the trivial delta kernel for the two special cases (a) have orthogonal columns (e.g. ), or (b) have independent Gaussian entries and . In these cases, minimizing the RKHS norm of the tangent kernel corresponds to returning a zero imputed matrix (minimum Frobenius norm solution). Figure 4 demonstrates the behaviour of gradient flow updates in the “rich" regime (where for recovers the ground truth) and in the “kernel" regime (where for large , there are no updates to the unobserved entries).

Figure 4: Regimes in Matrix Completion We generated a rank-one matrix completion problem with ground truth by generating with i.i.d.  entries and observing random entries . We fit the observed entries by minimizing the squared loss on a matrix factorization model with . For different scalings , we examine the matrix reached by gradient flow on (solved using python ODE solvers) and plot (i) the reconstruction error on unobserved entries , and (ii) the amount by which the unobserved entries changed during optimization . In (a) we used and initialized to . In (b) for varying , we initialized to and with with i.i.d.  entries. For large , the tangent kernel converges to the kernel corresponding to the Frobenius norm.

7 Neural Network Experiments

In the preceding sections, we intentionally focused on the simplest possible models in which a kernel-to-rich transition can be observed, in order to isolate this phenomena and understand it in detail. In those simple models, we were able to obtain a complete analytic description of the transition. Obtaining such a precise description in more complex models is somewhat optimistic at this point, as we do not yet have a satisfying description of even just the rich regime. Instead, we now provide empirical evidence suggesting that also for non-linear and realistic networks, the scale of initialization induces a transition into and out of a “kernel” regime, and that to reach good generalization we must operate outside of the “kernel” regime. To track whether we are in the “kernel” regime, we track how much the gradient changes throughout training. In particular, we define the gradient distance to be the cosine distance between the initial tangent kernel feature map and the final tangent kernel feature map .

In Figures 4(a) and 4(b), we see that also for a non-linear ReLU network, we remain in the kernel regime when the initialization is large, and that exiting from the kernel regime is necessary in order to achieve small test error on the synthetic data. Interestingly, when , the models achieve good test error but have smaller gradient distance which, not coincidentally, corresponds to using the out-of-the-box Uniform He initialization. This lies on the boundary between the rich and kernel regimes, which is desirable due to the learning vs. optimization tradeoffs discussed in Section 4. On MNIST data, Figure 4(e) shows that previously published successes with training overly wide depth-2 ReLU networks without explicit regularization (e.g. Neyshabur et al., 2014) relies on the initialization being small, i.e. being outside of the “kernel regime”. In fact, the 2.4% test error reached for large initialization is no better than what can be achieved with a linear model over a random feature map. Turning to a more realistic network, 4(f) shows similar behavior when training a VGG11-like network on CIFAR10.

So far, we attempted to use the best fixed stepsize for each initialization (i.e. achieving the best test error). But as demonstrated in Figures 4(c) and 4(d), the stepsize choice can also have a significant effect, with larger stepsizes allowing one to exit the kernel regime even at an initialization scale where a smaller stepsize would remain trapped in the kernel regime. Further analytic and empirical studies are necessary in order to understand the joint behavior of the stepsize and initialization scale.

(a) Test RMSE vs scale
(b) Grad distance vs scale
(c) Test RMSE vs stepsize
(d) Grad distance vs stepsize
(e) MNIST test error vs scale
(f) CIFAR10 test error vs scale
Figure 5: Synthetic Data: we generated a small regression training set in by sampling 10 points uniformly from the unit circle, and labelling them with a 1 hidden layer teacher network with 3 hidden units. We trained overparametrized depth-, ReLU networks with 30 units per layer with squared loss using full GD and a small stepsize . The weights of the network are set using the Uniform He initialization, and then multiplied by . The model is trained until training loss. Shown in (a) and (b) are the test error and grad distance vs. the depth-adjusted scale of the initialization, , when a small constant stepsize is used. for (c) and (d), we fix near the transition into the kernel regime, and show the test error and grad distance vs. the stepsize. MNIST: we trained a depth-2 with 5000 hidden units with cross-entropy loss using SGD until it reached 100% training accuracy. The stepsizes were optimally tuned for each individually. In (e), the dashed line shows the test error of the resulting network vs. . We repeated the experiment, but froze the bottom layer and only trained the output layer until convergence. The solid line shows the test error of this predictor vs . CIFAR10: we trained a VGG11-like deep convolutional network with cross-entropy loss using SGD and a small stepsize for 2000 epochs; all models reached 100% training accuracy. In (f), the dashed line shows the final test error vs. . We repeated the experiment freezing the bottom 10 layers and training only the output layer–the solid line shows this model’s test error. See Appendix E for full details about all of the experiments.

8 Discussion

The main point of this paper is to emphasize the distinction between the “kernel” regime in training overparametrized multi-layered networks, and the “rich” (active, adaptive) regime, show how the scaling of the initialization can transition between them, and understand this transition in detail. We argue that rich inductive bias that enables generalization may arise in the rich regime, but that focusing on the kernel regime restricts us to only what can be done with an RKHS. By studying the transition we also see a tension between generalization and optimization, which suggests we would tend to operate just on the edge of the rich regime, and so understanding this transition, rather then just the extremes, is important. Furthermore, we see that at the edge of the rich regime, the implicit bias of gradient descent differs substantively from that of explicit regularization. Although in our theoretical study we focused on a simple model so that we can carry out a complete and exact analysis analytically, our experiments show that this is representative of the behaviour also in other homogeneous models, and serve as a basis of a more general understanding.

Effect of Width

Our treatment focused on the effect of scale on the transition between the regimes, and we saw that, as pointed out by Chizat and Bach, we can observe a very meaningful transition between a kernel and rich regime even for finite width parametric models. The transition becomes even more interesting if the width of the model (the number of units per layer, and so also the number of parameters) increases towards infinity. In this case, we must be careful as to how the initialization of each individual unit scales when the total number of units increase, and which regime we fall in to is controlled by the relative scaling of the width and the scale of individual units at initialization. This is demonstrated, for example, in Figure 6, which shows the regime change in matrix factorization problems, from minimum Frobenius norm recovery (the kernel regime) to minimum nuclear norm recovery (the rich regime), as a function of both the number of factors and the scale of initialization of each factor . As is expected, the scale at which we see the transition decreases as the model becomes wider, but further study is necessary to obtain a complete understanding of this scaling.

A particularly interesting aspect of infinite width networks is that, unlike for fixed-width networks, it may be possible to scale relative to the width such that at the infinite-width limit we would have an (asymptotically) unbiased predictor at initialization , or at least a non-exploding initialization , even with random initialization (without a doubling trick leading to artificially unbiased initialization), while still being in the kernel regime. For two-layer networks with ReLU activation, Arora et al. (2019b) showed that with width the gradient dynamics stay in the kernel regime forever.

Acknowledgements

BW is grateful to be supported by the Google PhD Fellowship Program and the NSF Graduate Research Fellowship Program under award 1754881. JDL acknowledges support of the ARO under MURI Award W911NF-11-1-0303, and the Sloan Research Fellowship. This is part of the collaboration between US DOD, UK MOD and UK Engineering and Physical Research Council (EPSRC) under the Multidisciplinary University Research Initiative. The work of DS was supported by the Israel Science Foundation (grant No. 31/1031), and by the Taub Foundation. NS is supported by NSF Medium (grant No. NSF-102:1764032) and by NSF BIGDATA (grant No. NSF-104:1546500).

References

Appendix A Proof of Theorem 1

See 1

Proof.

The proof involves relating the set of points reachable by gradient flow on to the KKT conditions of the minimization problem. While it may not be obvious from the expression, is the integral of an increasing function and is thus convex, and is the sum of applied to individual coordinates of , and is therefore also convex.

The linear predictor is given by applied to the limit of the gradient flow dynamics on . Recalling that ,

(12)

where the residual , and denotes the element-wise product of and . It is easily confirmed that these dynamics have a solution:

(13)

This immediately gives an expression for :

(14)
(15)

Understanding the limit exactly requires calculating , which would be a difficult task. However, for our purposes, it is sufficient to know that there is some such that . In other words, the vector is contained in the non-linear manifold

(16)

Setting this aside for a moment, consider the KKT conditions of the convex program

(17)

which are

(18)
(19)

Expanding in (18), there must exist such that

(20)
(21)
(22)

Since we already know that the gradient flow solution , there is some such that is a certificate of (18). Furthermore, this problem satisfies the strict saddle property [Ge et al., 2015] [Zhao et al., 2019, Lemma 2.1], therefore gradient flow will converge to a zero-error solution, i.e. . Thus, we conclude that is a solution to (17). ∎

Appendix B Proof of Theorem 2

Lemma 1.

For any ,

guarantees that

Proof.

First, we show that . Observe that is even because and are odd. Therefore,

(23)
(24)
(25)
(26)

Therefore, we can rewrite

(27)
(28)
(29)
(30)

Using the fact that

(31)

we can bound for

(32)
(33)
(34)
(35)

So, for any , then

(37)
(38)
(39)

On the other hand, using (30) and (31) again,

(40)
(41)

Using the inequality , this can be further lower bounded by

(42)
(43)

Therefore, for any then

(44)

We conclude that for that

(45)

Lemma 2.

Fix any and . Then for any , in the sense that there exist vectors such that