Learning a Single Neuron with Gradient Methods

# Learning a Single Neuron with Gradient Methods

## Abstract

We consider the fundamental problem of learning a single neuron in a realizable setting, using standard gradient methods with random initialization, and under general families of input distributions and activations. On the one hand, we show that some assumptions on both the distribution and the activation function are necessary. On the other hand, we prove positive guarantees under mild assumptions, which go significantly beyond those studied in the literature so far. We also point out and study the challenges in further strengthening and generalizing our results.

## 1 Introduction

In recent years, much effort has been devoted to understanding why neural networks are successfully trained with simple, gradient-based methods, despite the inherent non-convexity of the learning problem. However, our understanding of this is still partial at best.

In this paper, we focus on the simplest possible nonlinear neural network, composed of a single neuron, of the form , where is the parameter vector and is some fixed non-linear activation function. Moreover, we consider a realizable setting, where the inputs are sampled from some distribution , the target values are generated by some unknown target neuron (possibly corrupted by independent zero-mean noise, and where we generally assume for simplicity), and we wish to train our neuron with respect to the squared loss. Mathematically, this boils down to minimizing the following objective function:

 F(w):=Ex∼D[12(σ(w⊤x)−σ(v⊤x))2]. (1)

For this problem, we are interested in the performance of gradient-based methods, which are the workhorse of modern machine learning systems. These methods initialize randomly, and proceed by taking (generally stochastic) gradient steps w.r.t. . If we hope to explain the success of such methods on complicated neural networks, it seems reasonable to expect a satisfying explanation for their convergence on single neurons.

Although the learning of single neurons was studied in a number of papers (see the related work section below for more details), the existing analyses all suffer from one or several limitations: Either they apply for a specific distribution , which is convenient to analyze but not very practical (such as a standard Gaussian distribution); Apply to gradient methods only with a specific initialization (rather than a standard random one); Require technical conditions on the input distribution which are not generally easy to verify; Or require smoothness and strict monotonicity conditions on the activation function (which excludes, for example, the common ReLU function ). However, a bit of experimentation strongly suggests that none of these restrictions is really necessary for standard gradient methods to succeed on this simple problem. Thus, our understanding of this problem is probably still incomplete.

The goal of this paper is to study to what extent the limitations above can be removed, with the following contributions:

• We begin by asking whether positive results are possible without any explicit assumptions on the distribution or the activation (other than, say, bounded support for the former and Lipschitz continuity for the latter). Although this seems reasonable at first glance, we show in Sec. 3 that unfortunately, this is not the case: Even for the ReLU activation function, there are bounded distributions on which gradient descent will fail to optimize Eq. (1) with probability exponentially close to . Moreover, even for which is a standard Gaussian, there are Lipschitz activation functions on which gradient methods will likely fail.

• Motivated by the above, we ask whether it is possible to prove positive results with mild and transparent assumptions on the distribution and activation function, which does not exclude common setups. In Sec. 4, we prove a key technical result, which implies that if the distribution is sufficiently “spread” and the activation function satisfies a weak monotonicity condition (satisfied by ReLU and all standard activation functions), then is positive in most of the domain. This implies that an exact gradient step with sufficiently small step size will bring us closer to in “most” places. Building on this result, we prove in Sec. 5 a constant-probability convergence guarantee for several variants of gradient methods (gradient descent, stochastic gradient descent, and gradient flow) with random initialization.

• In Sec. 6, we consider more specifically the case where is any spherically symmetric distribution (which includes the standard Gaussian as a special case) and the ReLU activation function. In this setting, we show that the convergence results can be made to hold with high probability, due to the fact that the angle between the parameter vector and the target vector motonically decreases. As we discuss later on, the case of the ReLU function and a standard Gaussian distribution was also considered in [22, 15], but that analysis crucially relied on initialization at the origin and a Gaussian distribution, whereas our results apply to more generic initialization schemes and distributions.

• A natural question arising from these results is whether a high-probability result can be proved for non-spherically symmetric distributions. We study this empirically in Subsection 6.2, and show that perhaps surprisingly, the angle to the target function might increase rather than decrease, already when we consider unit-variance Gaussian distributions with a non-zero mean. This suggests that a fundamentally different approach would be required for a general high-probability guarantee.

Overall, we hope our work contributes to a better understanding of the dynamics of gradient methods on simple neural networks, and suggests some natural avenues for future research.

### 1.1 Related Work

First, we emphasize that learning a single target neuron is not an inherently difficult problem: Indeed, it can be efficiently performed with minimal assumptions, using the Isotron algorithm and its variants (Kalai and Sastry [14], Kakade et al. [13]). Also, other algorithms exist for even more complicated networks or more general settings, under certain assumptions (e.g., Goel et al. [9], Janzamin et al. [11]). However, these are non-standard algorithms, whereas our focus here is on standard, vanilla gradient methods.

For this setting, a positive result was provided in Mei et al. [16], showing that gradient descent on the empirical risk function (with sampled i.i.d. from and sufficiently large) successfully yields a good approximation of . However, the analysis requires to be strictly monotonic, and to have uniformly bounded derivatives up to the third order. This excludes standard activation functions such as the ReLU, which are neither strictly monotonic nor differentiable. Indeed, assuming that the activation is strictly monotonic makes the analysis much easier, as we show later on in Thm. 3.2. A related analysis under strict monotonicity conditions is provided in Oymak and Soltanolkotabi [17].

For the specific case of a ReLU activation function and a standard Gaussian input distribution, Tian [26] proved that with constant probability, gradient flow over Eq. (1) will asymptotically converge to the global minimum. Soltanolkotabi [22] and Kalan et al. [15] considered a similar setting, and proved a non-asymptotic convergence guarantee for gradient descent or stochastic gradient descent on the empirical risk function . However, that analysis crucially relied on initialization at precisely , as well as a certain assumption on how the derivative of the ReLU function is computed at . In more details, we impose the convention that even though the ReLU function is not differentiable at , we take to be some fixed positive number, and the gradient of the population objective at to be

 Ex∼D[(σ(0)−σ(v⊤x))σ′(0)x] = −σ′(0)⋅Ex∼D[σ(v⊤x)x] .

Assuming , we get that the gradient is non-zero and proportional to . For a Gaussian distribution (and more generally, spherically symmetric distributions), this turns out to be proportional to , so that an exact gradient step from will lead us precisely in the direction of the target parameter vector . As a result, if we calculate a sufficiently precise approximation of this direction from a random sample, we can get arbitrarily close to in a single iteration (see Kalan et al. [15, Remark 1] for a discussion of this). Unfortunately, this unique behavior is specific to initialization at with a certain convention about (note that even locally around , the gradient may not approximate , since it is generally discontinuous around ). Thus, although the analysis is important and insightful, it is difficult to apply more generally.

Du et al. [6] considered conditions under which a single ReLU convolutional filter is learnable with gradient methods, a special case of which is a single ReLU neuron. The paper is closely related to our work, in the sense that they were also motivated by finding general conditions under which positive results are attainable. Moreover, some of the techniques they employed share similarities with ours (e.g., considering the gradient correlation as in Sec. 4). However, our results differ in several aspects: First, they consider only the ReLU activation function, while we also consider general activations. Second, their results assume a technical condition on the eigenvalues of certain distribution-dependent matrices, with the convergence rate depending on these eigenvalues. However, the question of when might this condition hold (for general distributions) is left unclear. In contrast, our assumptions are more transparent and have a clear geometric intuition. Third, their results hold with constant probability, even for a standard Gaussian distribution, while we employ a different analysis to prove high probability guarantees for general spherically symmetric distributions. Finally, we also provide negative results, showing the necessity of assumptions on both the activation function and the input distribution, as well as suggesting which approaches might not work for further generalizing our results.

A line of recent works established the effectiveness of gradient methods in solving non-convex optimization problems with a strict saddle property, which implies that all near-stationary points with nearly positive definite Hessians are close to global minima (see Jin et al. [12], Ge et al. [7], Sun et al. [23]). A relevant example is phase retrieval, which actually fits our setting with being the quadratic function (Sun et al. [24]). However, these results can only be applied to smooth problems, where the objective function is twice differentiable with Lipschitz-continuous Hessians (excluding, for example, problems involving the ReLU activation function). An interesting recent exception is the work of Tan and Vershynin [25], which considered the case . However, their results are specific to that activation, and assumes a specific input distribution (uniform on a scaled origin-centered sphere). In contrast, our focus here is on more general families of distributions and activations.

Brutzkus and Globerson [3] show that gradient descent learns a simple convolutional network with non-overlapping patches, when the inputs have a standard Gaussian distribution. Similar to the analysis in Sec. 6 in our paper, they rely on showing that the angle between the learned parameter vector and a target vector monotonically decreases with gradient methods. However, the network architecture studied is different than ours, and their proof heavily relies on the symmetry of the Gaussian distribution.

Less directly related to our setting, a popular line of recent works showed how gradient methods on highly over-parameterized neural networks can learn various target functions in polynomial time (e.g., Allen-Zhu et al. [1], Daniely [5], Arora et al. [2], Cao and Gu [4]). However, as pointed out in Yehudai and Shamir [27], this type of analysis cannot be used to explain learnability of single neurons.

## 2 Preliminaries

Notation. We use bold-faced letters to denote vectors. For a vector , we let denote its -th coordinate. We denote to be the ReLU function. For a vector , we let , and by we denote the all-ones vector . Given vectors , we let denote the angle between and . We use to denote probability. denotes the indicator function, for example equals if and otherwise.

Target Neuron. Unless stated otherwise, we assume that the target vector in Eq. (1) is unit norm, .

Gradients. When is differentiable, the gradient of the objective function in Eq. (1) is

 ∇F(w)=Ex∼D[(σ(w⊤x)−σ(v⊤x))⋅σ′(w⊤x)x] (2)

When is not differentiable, we will still assume that it is differentiable almost everywhere (up to a finite number of points), and that in every point of non-differentiability , there are well-defined left and right derivatives. In that case, practical implementations of gradient methods fix to be some number between its left and right derivatives (for example, for the ReLU function, is defined as some number in ). Following that convention, the expected gradient used by these methods still corresponds to Eq. (2), and we will follow the same convention here.

Algorithms. In our paper, we focus on the following three standard gradient methods:

• Gradient Descent: We initialize at some and set a fixed learning rate . At each iteration , we do a single step in the negative direction of the gradient:

• Stochastic Gradient Descent (SGD): We initialize at some and set a fixed learning rate . At each iteration , we sample an input , and calculate a stochastic gradient:

 gt=(σ(w⊤txt)−σ(v⊤xt))⋅σ′(w⊤txt)xt (3)

and do a single step in the negative direction of the stochastic gradient: Note that here we consider SGD on the population loss, which is different from SGD on a fixed training set. We also note that our proof techniques easily extend to mini-batch SGD, where is taken to be the average of stochastic gradients w.r.t. sampled i.i.d. from . However, for simplicity we will focus on .

• Gradient Flow: We initialize at some , and for every , we set to be the solution of the differential equation: This can be thought of as a continuous form of gradient descent, where we consider an infinitesimal learning rate. We note that strictly speaking, gradient flow is not an algorithm. However, it approximates the behavior of gradient descent in many cases, and has the advantage that its analysis is often simpler.

## 3 Assumptions on the Distribution and Activation are Necessary

The main concern of this paper is under what assumptions can a single neuron be provably learned with gradient methods. In this section, we show that perhaps surprisingly, this is not possible unless we make non-trivial assumptions on both the input distribution and the activation function.

### 3.1 Assumptions on the Input Distribution are Necessary

We begin by asking whether Eq. (1) can be minimized by gradient methods in a distribution-free manner (with no assumptions beyond, say, bounded support), as in learning problems where the population objective is convex. Perhaps surprisingly, we show that the answer is negative, even if we consider specifically the ReLU activation, and a distribution supported on the unit Euclidean ball. This is based on the following key result:

###### Theorem 3.1.

Suppose that is the ReLU function (with the convention that ), and assume that is sampled from a product distribution (namely, each is sampled independently from some distribution ). Then there exists a distribution over the inputs, supported on , and with such that the following holds: With probability at least over the initialization point sampled from , if we run gradient flow, gradient descent or stochastic gradient descent, then for every we have (for gradient flow ).

###### Proof.

For each distribution , let . We define the following dataset:

 S={xi=biei:i=1…,d}

where is the standard -th unit vector, and if and otherwise. Take to be the uniform distribution on .

Informally, the proof idea is the following: With overwhelming probability, we will initialize at a point such that for at least coordinates , it holds that , and as a result, is zero on those coordinates. Based on this, we show that these coordinates will not change from their initialized values. However, a point with coordinates with this property is suboptimal by a fixed factor, so the algorithm does not converge to an optimal solution.

More formally, using Eq. (2) and the fact that is the ReLU function, we get

 ∇F(w)=1dd∑i=1(σ(w⊤xi)−σ(v⊤xi))⋅\mathbbm1(w⊤xi>0)xi .

In particular, for every index for which we have that . Next, we define with (note that ). For every indices for which we have that:

 F(w) =12dd∑i=1(σ(w⊤xi)−σ(v⊤xi))2≥12d∑i∈{i1,…,id/4}(σ(w⊤xi)−σ(v⊤xi))2 = 12d∑i∈{i1,…,id/4}σ(v⊤xi)2=12d∑i∈{i1,…,id/4}σ(b2i1√d)2=18d (4)

Denote the random variable and (for gradient flow we denote ). It is easily verified that . We have that are independent, , and . Using Hoeffding’s inequality, we get that w.p it holds that , which means that there are at least indices such that . We condition on this event and let these indices be . We will now show that for every index , using gradient methods will not change the -th coordinate of ( for gradient flow) from its initial value. Let be such a coordinate.

For gradient descent, we will show by induction that for every iteration we have that . The base case is true, because we conditioned on this event. Assume for , then , which means that , and in particular . This proves that for every iteration , the -th coordinate of is zero, which mean that .

For stochastic gradient descent, at each iteration we sample , and define the stochastic gradient as in Eq. (3). If then hence , otherwise, if then by and by the same induction argument as in gradient descent we have that . In both cases the -th coordinate of the stochastic gradient is zero, hence .

For gradient flow, assume on the way of contradiction that for some that and let be the first time that this happen. Then for all we have that , and in particular . Hence for all running gradient flow we get , and in particular , a contradiction to the fact that is continuous. Thus for all we showed that , hence which shows that .

By the conditioned event, Eq. (3.1) applies at initialization. Since in all the gradient methods above the -th coordinate of did not change from its initial value for , we can apply Eq. (3.1) to get that for every iteration for gradient descent or SGD we have that (and for gradient flow, for every time , we have ).

We end by noting that although the distribution defined here is discrete over a finite dataset, the same argument can also be made for a non-discrete distribution, by considering a mixture of smooth distributions concentrated around the support points of the discrete distribution above. ∎

The theorem above applies to any product initialization scheme, which includes most standard initializations used in practice (e.g., the standard Xavier initialization [8]). The theorem implies that it is impossible to prove positive guarantees in our setting without distributional assumptions on ths inputs. Inspecting the construction, the source of the problem (at least for the ReLU neuron) appears to be the fact that the distribution is supported on a small number of well-separated regions. Thus, in our positive results, we will assume that the distribution is sufficiently “spread”, as formalized later on in Sec. 4

### 3.2 Assumptions on the Activation Function

We now turn to discuss the activation function, explaining why even if the activation is Lipschitz and the input distribution is a standard Gaussian, this is likely insufficient for positive guarantees in our setting.

In particular, let us consider the case that is a -Lipschitz periodic function. Then Theorem in [21] implies that for a large family of input distributions on (including a standard Gaussian), if we assume that the vector in the target neuron is a uniformly distributed unit vector, then for any fixed ,

 Varv(∇F(w))≤O(exp(−d)).

This implies that the gradient at is virtually independent of the underlying target vector : In fact, it is extremely concentrated around a fixed value which does not depend on . Theorem 4 from [21] goes further and shows that for any gradient method, even an exponentially small amount of noise will be enough to make its trajectory (after at most iterations) independent of , in which case it cannot possibly succeed in this setting. We note that their result is even more general as they consider a general function instead of , so our setting can be seen as a private case.

When considering a standard Gaussian distribution, the above argument can be easily extended to activations which are periodic only in a segment of length around the origin. This can be seen by extending the activation to which is periodic on , applying the above argument to it, and noting that the probability mass outside of a ball of radius is exponentially small (for example, see [27] Proposition 4.2, where they consider an activation which is a finite sum of ReLU functions and periodic in a segment of length ).

The above discussion motivates us to impose some condition on the activation function which excludes periodic functions. One such mild assumptions, which we will adopt in the rest of the paper (and corresponds to virtually all activations used in practice) is that the activation is monotonically non-decreasing. Before continuing, we remark that by assuming a slight strengthening of this assumption, namely that the function is strictly monotonically increasing, it is easy to prove a positive guarantee, as evidenced by Thm. 3.2. However, this excludes popular activations such as the ReLU function.

###### Theorem 3.2.

Assume for some , and the following for some :

• is positive definite with minimal eigenvalue

•  .

Then starting from any point , after doing iterations of gradient descent with learning rate , we have that:

 ∥wt−v∥2≤∥w0−v∥(1−λγ2η)t .

The proof can be found in Appendix A, and can be easily generalized to apply also to gradient flow and SGD. The above shows that if we assume strict monotonicity of the activation, then under very mild assumptions on the data will converge exponentially fast to . In the rest of the paper, however, we focus on results which only require weak monotonicity.

## 4 Under Mild Assumptions, the Gradient Points in a Good Direction

Motivated by the results in Sec. 3, we use the following assumptions on the distribution and activation:

###### Assumption 4.1.

The following holds for some fixed :

1. The distribution satisfies the following: For any vector , let denote the marginal distribution of on the subspace spanned by (as a distribution over ). Then any such distribution has a density function such that .

2. is monotonically non-decreasing, and satisfies .

The distributional assumption is such that in every -dimensional subspace, the marginal distribution is sufficiently “spread” in any direction close to the origin. For example, for a standard Gaussian distribution, this is true for regardless of the dimension (as the marginal distribution of a standard Gaussian on the subspace is a standard -dimensional Gaussian). Also, for any distribution, it can be made to hold by mixing it with a bit of a Gaussian or uniform distribution if possible. The assumption on the activation function is very mild, and covers most activations used in practice such as ReLU and ReLU-like functions (e.g. leaky-ReLU, Softplus), as well as standard sigmoidal activations (for which the derivative in any bounded interval is lower bounded by a positive constant).

With these assumptions, we prove the following key technical result, which implies that the gradient of the objective has a positive correlation with the direction of the global minimum (at ), if the angle between and and the norm of are not too large:

###### Theorem 4.2.

Under Assumptions 4.1, for any such that and for some , it holds that

 ⟨∇F(w),w−v⟩ ≥ α4βγ28√2sin3(δ4)∥w−v∥2 .

The theorem implies that for suitable values of , gradient methods (which move in the negative gradient direction) will decrease the distance from . When this behavior occurs, it is easy to show that gradient methods succeed in learning the target neuron, like in the previous Thm. 3.2 for the strictly monotonic case. The main challenge is to guarantee that the trajectory of the algorithm will indeed never violate the theorem’s conditions, in particular that the angle between and indeed remains bounded away from (and in fact, later on we will show that such a guarantee is not always possible).

The formal proof of the theorem can be found in Appendix B, but its intuition can be described as follows: we want to bound below the term

 ⟨∇F(w),w−v⟩=Ex[(σ(w⊤x)−σ(v⊤x))⋅σ′(w⊤x)⋅(w⊤x−v⊤x)] .

Note that:

1. Using the assumption on , the term inside the above expectation is nonnegative for every . This is because , and for any monotonically non-decreasing function we have . Thus, viewing the expectation as an integral over a nonnegative function, we can lower bound it by taking the integral over the smaller set . Note that on this set, and .

2. The resulting integral depends only on dot products of with and . Thus, it is enough to consider the marginal distribution on the -dimensional plane spanned by and .

3. By the assumption on the distribution, the density function of this marginal distribution is always at least on any such that . This means we can lower bound the integral above by integrating over with a uniform distribution on this set and multiplying by .

In total, the expression above can be lower bounded by a certain -dimensional integral (with uniform measure and with no terms) on the set

 {y∈R2: ^w⊤y>0, ^v⊤y>0,∥y∥≤α}

where are the -dimensional vectors representing on the -dimensional plane spanned by them. We lower bound this integral by a term that scales with the angle .

###### Remark 4.3 (Implication on Optimization Landscape).

The proof of the theorem can be shown to imply that for the ReLU activation, under the theorem’s conditions, the only stationary point that is not the global minimum must be at the origin. In particular, the proof implies that any stationary point (with ) must be along the ray . For the ReLU activation (which satisfies for any and ), the gradient at such points equals

 ∇F(−a⋅v)=Ex[(σ(−av⊤x)−σ(v⊤x))σ′(−av⊤x)x] = Ex[(−av⊤x)σ′(−av⊤x)x] .

In particular,

 ⟨∇F(−a⋅v),v⟩=−a⋅Ex[σ′(−av⊤x)(v⊤x)2] .

This implies that might be zero only if either (i.e., at the origin), or with probability , which cannot happen according to Assumption 4.1.

## 5 Convergence with Constant Probability Under Mild Assumptions

In this section, we use Thm. 4.2 in order to show that under some assumption on the initialization of , gradient methods will be able to learn a single neuron with probability at least (close to) . Note that the loss surface of is not convex, and as explained in Remark 4.3, there may be a stationary point at . This stationary point can cause difficulties, as it is not obvious how to control the angle between and close to the origin (which is required for Thm. 4.2 to apply). But, if we assume at initialization, then we are bounded away from the origin, and we can ensure that it will remain that way throughout the optimization process. One such initialization, which guarantees this with at least constant probability, is a zero-mean Gaussian initialization with small enough variance:

###### Lemma 5.1.

Assume . If we sample for then w.p we have that

In order to bound each gradient step we will need these additional assumptions:

###### Assumption 5.2.

The following holds for some positive :

1. almost surely over

2. for all

###### Theorem 5.3.

Under assumptions 4.1 and 5.2 we have:

1. (Gradient Flow) Assume that . Running gradient flow, then for every time we have

 ∥w(t)−v∥2≤∥w(0)−v∥2exp(−tλ)

where .

2. (Gradient Descent) Assume that . Let for and . Running gradient descent with step size , we have that for every , after iterations:

 ∥wT−v∥2≤∥w0−v∥2(1−ηλ2)T
3. (Stochastic Gradient Descent) Let , and assume that . Let where and . Then w.p , after iterations we have that:

 ∥wT−v∥2≤ϵ2

Combined with Lemma 5.1, Thm. 5.3 shows that with proper initialization, gradient flow, gradient descent as well as stochastic gradient descent successfully minimize Eq. (1) with probability (close to) , and for the first two algorithms, the distance to decays exponentially fast.

The full proof of the theorem can be found in Appendix C, and its intuition for gradient flow and gradient is as described above (namely, that if , it will stay that way and will just continue to shrink over time, using Thm. 4.2). The proof for stochastic gradient descent is much more delicate. This is because the update at each iteration is noisy, so we need to ensure we remain in the region where Thm. 4.2 is applicable. Here we give a short proof intuition:

1. Assume we initialized with for some . In order for the analysis to work we need that throughout the algorithm’s run. Thus, we show (using a maximal version of Azuma’s inequality) that if is small enough (depending on ), and we take at most gradient steps then w.h.p for every :

2. The next step is to show that if , then for an appropriate . This is done using Thm. 4.2, as in the gradient descent case, but note that here this only holds in expectation over the sample selected at iteration .

3. Next, we use Azuma’s inequality again on iterations for a small enough , to show that w.h.p does not move too far away from where the expectation is taken over . Also, we show that after iterations for a constant smaller than . This shows that w.h.p., after a single epoch of iterations, shrinks by a constant factor.

4. We then repeat this analysis across epochs (each consisting of iterations), and use a union bound. Overall, we get that after sufficiently many iterations, with high probability, the iterates get as close as we want to zero.

We note the optimization analysis for stochastic gradient descent is inspired by the analysis in [20] for the different non-convex problem of principal component analysis (PCA), which also attempts to avoid a problematic stationary point. An interesting question for future research is to understand to what extent the polynomial dependencies in the problem parameters can be improved.

###### Remark 5.4.

Our assumption on the data that is made for simplicity. For the gradient descent case, it is easy to verify that the proof only requires that the fourth moment of the data is bounded by some constant, which ensures that the gradients of the objective function used by the algorithm are bounded. For SGD it is enough to assume that the input distribution is sub-Gaussian. The proof proceeds in the same manner, by using a variant of Azuma’s inequality for martingales with sub-Gaussian tails, e.g. [19].

## 6 High-Probability Convergence

The results in the previous section hold under mild conditions, but unfortunately only guarantee a constant probability of success. In this section, we consider the possibility of proving guarantees which hold with high probability (arbitrarily close to ). On the one hand, in Subsection 6.1, we provide such a result for the ReLU activation, assuming the input distribution is spherically symmetric. On the other hand, in Subsection 6.2, we point out non-trivial obstacles to extending such a result to non-spherically symmetric distributions. Overall, we believe that getting high-probability convergence guarantees for non-spherically symmetric distributions is an interesting avenue for future research.

### 6.1 Convergence for Spherically Symmetric Distributions

In this subsection, we make the following assumptions:

###### Assumption 6.1.

Assume that:

1. has a spherically symmetric distribution. That is, for every orthogonal matrix :

2. The activation function is the standard ReLU function .

These assumptions are significantly stronger than Assumptions 4.1, but allow us to prove a stronger high-probability convergence result. Note that even with these assumptions the loss surface is still not convex, and may contain a spurious stationary point (see Remark 4.3). For simplicity, we will focus on proving the result for gradient flow. The result can then be extended to gradient descent and stochastic gradient descent, along similar lines as in the proof of Thm. 5.3.

The proof strategy in this case is quite different from that of the constant-probability guarantee, and relies on the following key technical result:

###### Lemma 6.2.

If , then

The lemma (which relies on the spherical symmetry of the distribution) implies that if we initialize at any point , then the angle between and is strictly less than , and will remain so as long as . As a result, we can apply Thm. 4.2 to prove that decays exponentially fast. The only potential difficulty is that may converge to the potential stationary point at the origin (at which the angle is not well-defined), but fortunately this cannot happen due to the following lemma:

###### Lemma 6.3.

Let and assume that . If then

The lemma can be shown to imply that as long as remains bounded away from , then cannot decrease below some positive number (as its derivative is positive close enough to zero, and is a continuous function of ). The proof idea of both lemmas is based on a technical calculation, where we project the spherically symmetric distribution on the -dimensional subspace spanned by and .

Using the lemmas above, we can get the following convergence guarantee:

###### Theorem 6.4.

Assume we initialize such that , for some and that Assumption 4.1(1) holds. Then running gradient flow, we have for all

 ∥w(t)−v∥2≤∥w(0)−v∥exp(−λt)

where .

We now note that the assumption of the theorem holds with exponentially high probability under standard initialization schemes. For example, if we use a Gaussian initialization , then by standard concentration of measure arguments, it holds w.p that is at most (say) , and w.p that . As a result, by Thm. 6.4, w.p over the initialization we have for all . The full proof of the theorem can be found in Appendix D.

###### Remark 6.5.

If we further assume that the distribution is a standard Gaussian, then it is possible to prove Lemma 6.2 and Lemma 6.3 in a much easier fashion. The reason is that specifically for a standard Gaussian distribution there is a closed-form expression (without the expectation) for the loss and the gradient, see [3], [18]. We provide the relevant versions of the lemmas, as well as their proofs, in Subsection D.1.

### 6.2 Non-monotonic Angle Behavior

The results in the previous subsection crucially relied on the fact that at almost any point , the angle decreases. This type of analysis was also utilized in works on related settings (e.g., Brutzkus and Globerson [3]).

Based on this, it might be tempting to conjecture that this monotonically decreasing angle property (and as a result, high-probability guarantees) can be shown to hold more generally, not just for symmetrically spherical distributions. Perhaps surprisingly, we show empirically that this may not be the case, already when we discuss the simple setting of unit variance Gaussian with a non-zero mean. We emphasize that this does not necessarily mean that gradient methods will not succeed, only that an analysis based on showing monotonic behavior of the relevant geometric quantity will not work in general.

In particular, in Figure 1 we report the result of running gradient descent (with constant step size ) on our objective function in , where the input distribution is a unit-variance Gaussian with mean at , and our target vector is . We initialize at three different locations: . Although the algorithm eventually reaches the global minimum , the angle between them is clearly non-monotonic, and actually is initially increasing rather than decreasing. Even worse, the angle appears to attain every value in , so it appears that any analysis using angle-based “safe regions” is bound to fail.

Overall, we conclude that proving a high-probability convergence guarantee for gradient methods appears to be an interesting open problem, already in the case of unit-variance, non-zero-mean Gaussian input distributions. We leave tackling this problem to future work.

Acknowledgements. This research is supported in part by European Research Council (ERC) grant 754705.

## Appendix A Proofs from Sec. 3

###### Proof of Thm. 3.2.

We have that:

 ⟨∇F(w),w−v⟩ \lx@stackrel(∗)=Ex[γ⋅(σ(w⊤x)−σ(v⊤x))(w⊤x−v⊤x)]

where is by monotonicity of (hence always), and is by the assumption that . Next, we bound the gradient :

 ∥∇F(wt)∥2 =Ex[(σ(w⊤tx)−σ(v⊤x))2⋅σ′(w⊤x)2x⊤x] ≤c42Ex[(w⊤tx−v⊤x)2⋅x⊤x] ≤c42∥wt−v∥2Ex[∥x∥2⋅x⊤x]≤c21c42∥wt−v∥2.

At iteration we have that:

 ∥wt+1−v∥2 =∥wt−η∇F(wt)−v∥2 =∥wt−v∥2−2η⟨∇F(wt),wt−v⟩+η2∥∇F(wt)∥2 ≤∥wt−v∥2−2γ2λη∥wt−v∥2+η2c21c42∥wt−v∥2 ≤∥wt−v∥2(1−γ2λη).

Using induction over the above proves the lemma.

## Appendix B Proofs from Sec. 4

We will first need the following lemma:

###### Lemma B.1.

Fix some , and let be two vectors in such that for some . Then

 infu:∥u∥=1∫\mathbbm1a⊤y>0\mathbbm1b⊤y>0\mathbbm1∥y∥≤α(u⊤y)2dy ≥ α48√2sin3(δ4) .
###### Proof.

It is enough to lower bound

The inner infimum is attained at some such that . This is because does not depend on and , and the volume for which the indicator function inside the integral is non-zero is smallest when the angle is largest. Setting this and switching the order of the infima, we get

When , we note that the set is simply a “pie slice” of radial width out of a ball of radius . Since the expression is invariant to rotating the coordinates, we will consider without loss of generality the set , and the expression above reduces to

 infu∫y∈P(¯u⊤y)2dy = infu1,u2:u21+u22=1u21∫y∈Py21dy+u22∫y∈Py22dy = min{∫y∈Py21dy , ∫y∈Py22dy} ≥ ∫y∈Pmin{y21,y22}dy , (5)

where is from the fact that is symmetric around the -axis (namely, if and only if ).

We now note that the set contains the two (disjoint and equally-sized) rectangular sets

 P′1:=[α2cos(δ4),αcos(δ4)]×[α2sin(δ4),αsin(δ4)]

and

 P′2:=[α2cos(δ4),αcos(δ4)]×[−αsin(δ4),−α2sin(δ4)]

(see Figure 2 for an illustration). Therefore, we can lower bound Eq. (5) by

 ∫y∈P′1∪P′2min{y21,y22}dy = (miny∈P′1∪P′2min{y21,y22})∫y∈P′1∪P′21dy = α24min{cos2(δ4),sin2(δ4)}⋅∫y∈P′1∪P′21dy = α24sin2(δ4)⋅∫y∈P′1∪P′21dy ,

where we used the fact that and therefore . The integral is simply the volume of , and since and are disjoint and equally sized rectanges, this equals twice the volume of , namely . Plugging into the above, we get

 α2