Approximation Schemes for ReLU Regression

# Approximation Schemes for ReLU Regression

## Abstract

We consider the fundamental problem of ReLU regression, where the goal is to output the best fitting ReLU with respect to square loss given access to draws from some unknown distribution. We give the first efficient, constant-factor approximation algorithm for this problem assuming the underlying distribution satisfies some weak concentration and anti-concentration conditions (and includes, for example, all log-concave distributions). This solves the main open problem of Goel et al., who proved hardness results for any exact algorithm for ReLU regression (up to an additive ). Using more sophisticated techniques, we can improve our results and obtain a polynomial-time approximation scheme for any subgaussian distribution. Given the aforementioned hardness results, these guarantees can not be substantially improved.

Our main insight is a new characterization of surrogate losses for nonconvex activations. While prior work had established the existence of convex surrogates for monotone activations, we show that properties of the underlying distribution actually induce strong convexity for the loss, allowing us to relate the global minimum to the activation’s Chow parameters.

## 1 Introduction

Finding the best-fitting ReLU with respect to square-loss – also called “ReLU Regression” – is a fundamental primitive in the theory of neural networks. Many authors have recently studied the problem both in terms of finding algorithms that succeed under various assumptions and proving hardness results (Manurangsi and Reichman, 2018; Soltanolkotabi, 2017; Goel et al., 2019; Yehudai and Shamir, 2019; Goel et al., 2017; Manurangsi and Reichman, 2018). In this work, we consider the agnostic model of learning where no assumptions are made on the noise.

Recall the ReLU function parameterized by is defined as (for simplicity, let . Given samples drawn from a distribution over , the objective of the learner is to find a hypothesis that has square loss at most , where is defined to be the loss of the best fitting ReLU, i.e.,

 opt:=minw∈RdED[(ReLU(⟨w,x⟩)−y)2].

There are several hardness results known for this problem. A recent result shows that finding a hypothesis achieving a loss of is NP-hard when there are no distributional assumptions on , the marginal of on the examples (Manurangsi and Reichman, 2018). Recent work due to Goel et al. (2019) gives hardness results for achieving error , even if the underlying distribution is the standard Gaussian. This work also provides an algorithm that achieves error under the assumption that is log-concave. The main problem open problem posed by Goel et al. (2019) is the following:

###### Question 1.

For the problem of ReLU regression, is it possible to recover a hypothesis achieving error of in time ?

In this paper we answer this question in the affirmative. Specifically, we show that there is a fully polynomial time algorithm which can recover a vector such that the loss of the corresponding ReLU function, , is at most . More formally, we prove the following:

###### Theorem 1.1.

If is isotropic log-concave, there is an algorithm that takes samples and runs in time and returns a vector such that has square loss with high probability.

The sample complexity of our algorithm is nearly linear in the problem dimension and hence information-theoretically optimal up to logarithmic factors. To establish this near-optimal sample complexity, we leverage intricate tools involving uniform one-sided concentration of empirical processes of log-concave distributions.

Additionally, we show that under stronger distributional assumptions and if the algorithm is allowed to be improper, i.e., if the hypothesis need not be the ReLU of a linear function, then it is possible to return a hypothesis that achieves a loss of in polynomial time for any constant as long as .

###### Theorem 1.2.

If is -subgaussian for , then for any constant , there is an algorithm with sample complexity and running time that outputs a hypothesis whose square loss is at most with high probability.

Given the hardness results of Goel et al. (2019), the aforementioned accuracy guarantees are essentially best-possible.

### 1.1 Our Approach

A major barrier to minimizing the square loss for the ReLU regression problem is that it is nonconvex. In such settings, gradient descent-based algorithms can potentially fail due to the presence of poor local minima. In the case of ReLU regression, the number of these bad local minima for the square loss can be as large as exponential in the dimension (Auer et al., 1996).

Despite this fact, for well-structured noise models, it is possible to learn a ReLU with respect to square loss by applying results on isotonic regression (Kalai and Sastry, 2009; Kakade et al., 2011; Klivans and Meka, 2017). These results show that if the noise is bounded and has zero mean, it is possible to learn conditional mean functions of the form where is a monotone and Lipschitz activation. This is proven via an analysis similar to that of the perceptron algorithm. It is not clear, however, how to extend these results to harder noise models.

In retrospect, one way to interpret the algorithms from Kalai and Sastry (2009) and Kakade et al. (2011) is to view them as implicitly minimizing a surrogate loss6. The intuition is as follows: although a monotone and Lipschitz function need not be convex, it is not difficult to see that its integral is convex. This motivates the following definition of a surrogate loss:

 LsurrD(w)=E(x,y)∼D[∫⟨w,x⟩0(σ(a)−y) da].

Properties of this loss were explored early on in the work of Auer et al. (1996) who gave a formal proof that the loss is convex (a succinct write-up of properties of this loss can also be found in notes due to Kanade (2018)). Thus, we can efficiently minimize this loss using gradient descent. What is more subtle is the relationship of the minima of the surrogate loss to the minima of the original square-loss.

The main insight of the current work is that algorithms that directly minimize this surrogate loss have strong noise-tolerance properties if the underlying marginal distribution satisfies some mild conditions. As a consequence, we prove that the GLMtron algorithm of Kakade et al. (2011) (or equivalently projected gradient descent on the surrogate loss) achieves a constant-factor approximation for ReLU regression. The proof of this relies on three key structural observations:

• The first insight concerns the notion of the Chow parameters of a function. The Chow parameters of a function with respect to a distribution are defined to be the first moments of with respect to , i.e., . We show that the Chow parameters of a strictly monotone and Lipschitz activation function robustly characterize the function, i.e., two functions with approximately the same Chow parameters have approximately the same loss. More precisely, any that satisfies induces a concept with square loss .

• The second observation is that the gradient of the surrogate loss at is the difference between the Chow parameters of and the first moments of the labels, , i.e.,

 ∇wLsurrD(w)=χσwD−χD.
• The third insight is that if the underlying distribution satisfies some concentration and anti-concentration properties (satisfied, for instance, by log-concave distributions), then the surrogate loss is strongly convex. In particular, this holds for any activation that is strictly monotone and -Lipschitz, including ReLUs.

Any strongly convex function achieves its minimum at a point where the gradient is zero. The first two observations now imply that the point where the surrogate loss has zero gradient corresponds to a weight vector achieving a loss of .

A naive analysis for the concentration of empirical gradients results in a sample complexity of roughly . To achieve the near-linear sample complexity of in Theorem 1.1, we show that while the gradient is not uniformly concentrated in all directions, it does concentrate from below in the direction going from the current estimate to the minimizer of the loss.

Theorem 1.1 achieves a constant factor approximation to the ReLU regression problem when the underlying distribution is log-concave. It is not clear how to show that minimizing the surrogate loss alone can go beyond a constant factor approximation. Still, it turns out that under a slightly stronger distributional assumption on (sub-gaussianity), we can give a polynomial-time approximation scheme (PTAS) for ReLU regression.

To achieve this, we build on the localization framework used to solve the problem of learning halfspaces under various noise models (Daniely, 2015; Awasthi et al., 2017). The problem of learning halfspaces, however, differs from the problem of ReLU regression. One crucial difference is that for the problem of learning halfspaces, the agnostic noise model is equivalent to the noise model where an fraction of the labels are corrupted. In the case of ReLU regression, every point’s label can potentially be corrupted.

Our approach broadly proceeds in two stages:

• First, we use our constant-factor approximation algorithm to recover a vector satisfying , where is the vector achieving an error of . We use this to partition the space into three regions for a certain choice of a parameter . Our three regions are , , and .

• In each of these regions we find functions whose loss competes with that of the best fitting ReLU (i.e., ).

Observe that takes the value for most of the region . Intuitively, the best-fitting linear function must achieve a loss comparable to for . Similar reasoning shows that for the region , is a good hypothesis. Using results from approximation theory, we show that the function in the region is closely approximated by a polynomial of degree . To find a function which achieves a comparable loss to the concept, we perform polynomial regression to find the best-fitting polynomial of appropriate degree in this region. Finally, our algorithm returns the following hypothesis .

 h(x)=⎧⎨⎩⟨w+,x⟩,x∈T+P(x),x∈T0,x∈T−.

The paper by Daniely (2015) shows this result only for the uniform distribution on the sphere, while our result works for all sub-gaussian distributions. The analysis of this algorithm is nontrivial. In particular, in addition to using tools from approximation theory to derive the polynomial approximation, the choice of the parameter to partition our space is delicate, and we need to calculate approximations with respect to complicated marginal distributions that do not have nice closed-form expressions.

### 1.2 Prior and Related Work

Here we provide an overview of the most relevant prior work. Goel et al. (2017) give an efficient algorithm for ReLU regression that succeeds with respect to any distribution supported on the unit sphere, but has sample complexity and running time exponential in . Soltanolkotabi (2017) shows that SGD efficiently learns a ReLU in the realizable setting when the underlying distribution is assumed to be the standard Gaussian. Goel et al. (2018) gives a learning algorithm for one convolutional layer of ReLUs for any symmetric distribution (including Gaussians). Goel et al. (2019) gives an efficient algorithm for ReLU regression with error guarantee of .

Yehudai and Shamir (2019) shows that it is hard to learn a single ReLU activation via stochastic gradient descent, when the hypothesis used to learn the ReLU function is of the form and the functions are random feature maps drawn from a fixed distribution. In particular, they show that any which approximates (where and ) up to a small constant square loss, must have one of the being exponentially large in for some or have exponentially many random features in the sum (i.e., . Their paper makes the point that regression using random features cannot learn the ReLU function in polynomial time. Our results use different techniques to learn the unknown ReLU function that are not captured by this model.

We note that Chow parameters have been previously used in the context of learning halfspaces under well-behaved distributions, see, e.g., O’Donnell and Servedio (2008); De et al. (2012); Diakonikolas et al. (2019) and references therein. The technique of localization has been used extensively in the context of learning halfspaces over various structured distributions. Specifically, Awasthi et al. (2017) use this technique to learn origin-centered halfspaces with respect to log-concave distributions in the presence of agnostic noise, obtaining an error guarantee of . Subsequently, Daniely (2015) uses an adaptation of the localization technique in conjunction with the polynomial approximation technique from Kalai et al. (2005) to obtain a PTAS for the problem of agnostically learning origin-centered halfspaces under the uniform distribution over the sphere. More recently, Diakonikolas et al. (2018) obtain similar guarantees in the presence of nasty noise, where the halfspace need not be origin-centered.

While the problem of learning halfspaces is related to that of ReLU regression, we stress that for ReLU regression every label may be corrupted (possibly by arbitrarily large values), while in the context of learning halfspaces only an fraction of the labels are corrupted. This is because the loss for halfspace learning is instead of the square-loss. Indeed, a black-box application of the results for halfspace learning in the context ReLU regression results in the suboptimal guarantee of (Goel et al., 2019).

## 2 Preliminaries

#### Notation.

For , we denote . We will use small boldface characters for vectors. For , and , denotes the -th coordinate of , and denotes the -norm of . We will use for the inner product between . We will use for the expectation of random variable and for the probability of event . For two functions let mean that there exists a such that for all and denote . denotes the -dimensional Euclidean ball at the origin with radius , that is, . We say if , also we use to hide log factors of the input. We will use to denote a subgradient of at the point .

#### Learning Models.

We start by reviewing the PAC learning model Vapnik (1982); Valiant (1984). Let be the target (concept) class of functions , be a hypothesis class, and be a loss function. In the (distribution-specific) agnostic PAC model Haussler (1992); Kearns et al. (1994), we are given a multi-set of labeled examples that are i.i.d. samples drawn from a distribution on , where . The marginal distribution is assumed to lie in a family of well-behaved distributions. The goal is to find a hypothesis that approximately minimizes the expected loss , compared to . In this paper, we will have , , and . We will focus on constant factor approximation algorithms, that is, we will want a hypothesis which satisfies for some universal constant and . If the hypothesis then the learner is proper else it is called improper.

#### Problem Setup.

We consider the concept class of Generalized Linear Models (GLMs) for activation functions which are non-decreasing and -Lipschitz. Common activations such as ReLU and Sigmoid satisfy this assumption. We use the -error as our loss function, i.e., . We overload the definition by setting . Our goal is to design a proper constant-approximation PAC learner for class in time and sample complexity polynomial in the input parameters.

In this paper, we focus primarily on the ReLU activation, that is, . We also restrict ourselves to isotropic distributions, that is, and . We also assume that the labels are bounded in absolute value by 1 for ease of presentation. For approximate learning guarantees, our results go through if we assume the distribution of labels is sub-exponential.

###### Definition 2.1 (Chow parameters).

Given a distribution over , for any function , define the (degree-) Chow parameters of w.r.t. as .

For a sample drawn from , we also define the corresponding empirical Chow parameter with respect to as .

We overload notation by defining the true Chow parameters as and its corresponding empirical true Chow parameter w.r.t. as .

###### Definition 2.2 (Chow distance).

Given distribution over , for any functions , define the Chow distance between and w.r.t. as , that is, the Euclidean distance between the corresponding Chow parameters.

###### Lemma 2.3 (Chow distance to function distance).

Let be such that the marginal on is isotropic. For any functions and , .

###### Proof.

We have

 ∥χfD−χgD∥2 =∥E(x,y)∼D[(f(x)−g(x))x]∥2 =max∥u∥2=1ED[(f(x)−g(x))⟨u,x⟩] ≤√LD(f,g)max∥u∥2≤1√ED[⟨u,x⟩2]=√LD(f,g).

Here the first equality follows from the variational form of the Euclidean norm and the last follows from applying Cauchy-Schwartz inequality and using isotropy of the underlying distribution on . ∎

###### Corollary 2.4 (Chow-distance from true Chow vector).

Let be such that the marginal on is isotropic. For any activation function and vector , we have .

###### Proof.

Letting and in Lemma 2.3 gives us,

 ∥χD−χσwD∥22 ≤ED[(E[y|x]−σw(x))2] ≤ED[(y−σw(x))2]=LD(σw).

Here the last inequality follows from an application of Jensen’s inequality. ∎

#### Organization.

In Section 3, we give an algorithm to find a weight vector that matches the true Chow parameters for the class of GLMs. In Section 4, we show that under certain assumptions on the activation function, the so obtained weight vector in fact gives us the approximate learning guarantee. In Section 5, we show that, for isotropic log-concave distributions, the ReLU satisfies our assumptions, and combining the previous techniques gives us the desired approximate learning result. Finally, in Section 6 we give an algorithm that improves the approximation factor to for any constant at the cost of improper learning.

## 3 Matching Chow Parameters via Projected Gradient Descent

In this section, we show that projected gradient descent on the surrogate loss outputs a hypothesis whose Chow parameters nearly match the true Chow parameters, . More formally, we redefine the surrogate loss as follows:

 LsurrD(w)=E(x,y)∼D[∫⟨w,x⟩0(σ(a)−y) da]=E(x,y)∼D[˜σ(⟨w,x⟩)−y⟨w,x⟩].

Here is the anti-derivative of . For example, for the ReLU activation, we have that for all and otherwise. We correspondingly define the empirical version of the surrogate loss over sample set as .

We note that the gradient of is directly related to the Chow parameters as follows

 ∇LsurrD(w)=E[σ(⟨w,x⟩)x]−χD=χσwD−χD.

Furthermore, the Hessian can be computed as

 ∇2LsurrD(w)=E[σ′(⟨w,x⟩)xxT]≽0.

Where is a subgradient. Here the last inequality follows from the non-decreasing property of . Thus, we have that is convex. Moreover, since is -Lipschitz, and our distribution is isotropic, we have that implying that is -smooth. Since minimizing the surrogate loss minimizes the gradient norm of the loss, loss minimization matches the Chow parameters of the GLM to the true Chow parameters.

By standard Projected Gradient Descent analysis with approximate gradients, we have the following theorem, the proof of which is in Section D of the appendix.

###### Theorem 3.1.

Suppose is sufficiently large so that for all we have

 ∥∇LsurrD(w)−∇^LsurrS(w)∥2≤ϵ.

Also suppose that the minimizer of lies in . Then Algorithm 1 when run on samples from with weight bound and for iterations has an iteration such that

 ∥χσw(T′)D−χD∥22≤8ϵW+2ϵ2.

Subsequently, we can use a fresh batch of samples and choose the hypothesis with the smallest gradient. Assuming our distribution satisfies certain concentration properties, we can bound the number of samples needed by the above algorithm using the following lemma whose proof we defer to Section C of the appendix.

###### Lemma 3.2.

If is a distribution such that for every , has a density bounded above by for some , then for , for all we have that

 PrS∼Dm[∥∥∇LsurrD(w)−∇^LsurrS(w)∥∥2≤ϵ]≥1−δ.

#### Faster Rates under Strong Convexity

If we assume that is strongly convex and restrict to a bounded fourth moment distribution, we can get much faster rates and improved sample complexity (in fact linear in the dimension up to log factors).

###### Definition 3.3 (Strong-Convexity).

We say that the activation satisfies -strong convexity w.r.t. distribution , if for all there exists such that

 ⟨χσuD−χσvD,u−v⟩≥μ∥u−v∥22.
###### Theorem 3.4.

Let be such that is isotropic log-concave. Suppose that the minimizer of lies in . If satisfies -strong convexity w.r.t. then for Algorithm 1 (without the projection step) run with and

 m≥~Ω((μ+1)μ2ϵ2dlog4(dδ)(W+1)2+dμ2log(W+1μδ))where0≤ϵ≤W

after iterations, holds with probability at least as long as .

The proof of Theorem 3.4 is deferred to Section B in the Appendix.

## 4 Matching Chow Parameters Suffices for Approximate Learning

In this section, we show that under certain assumptions on the activation function, matching Chow vectors implies small loss of the surrogate minimizer. We subsequently show that commonly used activation functions such as satisfy this assumption.

###### Definition 4.1 (Chow Learnability).

We say that an activation function satisfies -Chow Learnability w.r.t. some distribution if for all and some fixed constant , we have that

We will require the following lemma, proved in Section E.

###### Lemma 4.2.

If a -Lipschitz activation satisfies -strong convexity w.r.t. such that is isotropic, then the activation also satisfies -Chow Learnability.

###### Remark 1.

Observe that Chow learnability may be a much weaker notion than strong convexity, since strong convexity requires parameter closeness. For activations with bounded ranges, such as sigmoid, it is possible for the loss to be small and Chow parameters to be close while the vectors themselves may be far.

If the activation satisfies the Chow learnability condition, then we can show that a hypothesis nearly matching the Chow parameters attains small loss.

###### Theorem 4.3.

Let be such that it satisfies -Chow Learnability w.r.t. with being isotropic. Suppose is such that . Then we have

 LD(σw)≤2 optD(Cσ)(1+2β)+4βϵ.
###### Proof.

Let be the function attaining the loss . By assumption on , we have

 LD(σw,σw∗) ≤β⋅∥χσwD−χσw∗D∥22 ≤2 β (∥χσwD−χD∥22+∥χσw∗D−χD∥22) ≤2 β (ϵ+optD(Cσ)).

Here the last inequality follows by Corollary 2.4. Also using triangle inequality,

 LD(σw)≤2 optD(Cσ)+2 LD(σw,σw∗).

Combining the above gives us the desired result. ∎

###### Remark 2.

In the above guarantee, we can replace by (see proof of Lemma 2.4). In the p-concept setting, where this is potentially a tighter guarantee. This is because is in fact 0 whereas might be large. Since we are focused on the agnostic setting, we will stick to using in our results.

## 5 Constant Factor Approximation for ReLU Regression

In this section, we present a constant factor approximation algorithm for ReLU regression over any isotropic log-concave distribution using the techniques developed in the previous sections.

###### Theorem 5.1.

Let be such that is isotropic log-concave and assume the labels are bounded. Let achieve loss and assume that . Then Algorithm 1 outputs a vector such that

 LD(ReLUw)≤O(optD(CReLU))+ϵ,

with probability using samples, for , and time.

Our main observation is that the ReLU activation satisfies the strong convexity condition w.r.t. any isotropic log-concave distribution.

###### Lemma 5.2 (Strong Convexity of ReLU).

Let be such that is isotropic log-concave. Then there exists some fixed constant such that ReLU is -strongly convex w.r.t. .

#### Proof Sketch.

Since the ReLU is -Lipschitz and non-decreasing, we have

 (χReLUvD−χReLUuD)T(v−u) =E[(ReLU(⟨v,x⟩)−ReLU(⟨u,x⟩))((v−u)⋅x)] ≥E[(ReLU(⟨v,x⟩)−ReLU(⟨u,x⟩))2].

Now our goal is to bound from below the error between the two ReLUs by the distance between the corresponding vectors. Due to the anti-concentration properties of log-concave distributions, there is sufficient probability mass in a constant radius ball around the origin. This enables us to exploit the linear region of the corresponding ReLUs to establish the lower bound. We defer the full proof to Section F in the Appendix.

#### Proof of Theorem 5.1

By Lemma 5.2, the ReLU activation satisfies -strong convexity w.r.t. for some constant . This implies that is strongly-convex and therefore the minimizer of (say ) satisfies . Using Lemma 2.4 and the strong convexity of ReLU, we have that

 ∥w∗−w∥2 ≲∥χReLUw∗D−χReLUwD∥2 ≤√LD(ReLUw∗)=√optD(CReLU).

Therefore, . It is not hard to see that with bounded labels . Therefore, we can now apply Theorem 3.4 to find a hypothesis with Chow distance at most . The result now follows directly from Theorem 4.3.

## 6 A PTAS for ReLU Regression

In this section, we show that if the activation is the function we can solve the problem of finding the best fitting up to a -approximation, when the underlying marginal over the input is sub-gaussian. We assume that , for some constant .

We define sub-gaussian distributions here:

###### Definition 6.1.

A distribution on is called -subgaussian, , if for any direction the probability density function of where , satisfies .

Our algorithm (Algorithm 2) works by partitioning the domain into three parts , where

 T ={u∈Rd:|⟨w,u⟩|≤γ√opt} T+ ={u∈Rd:⟨w,u⟩>γ√opt} T− ={u∈Rd:⟨w,u⟩<−γ√opt}.

The hypothesis behaves as a different function in each of these parts. For , the hypothesis is the function. For , the hypothesis takes the value of , which is the best fitting linear function over . Finally, over the hypothesis outputs the value that the best fitting -norm bounded polynomial of degree . Our main theorem of this section is the following:

###### Theorem 6.2.

Let be -subgaussian for , and for every , then there is an algorithm that takes samples and time, and returns a hypothesis that with high probability satisfies

 ED[(h(x)−y))2]≤(1+η)opt+ϵ.
###### Remark 3.

We note that if the distribution is uniform over , then the sample complexity of our algorithm scales as , instead of , since the distribution is -subgaussian. That is, under the uniform distribution over the unit sphere, the sample complexity is independent of .

The proof Theorem 6.2 follows from a direct application of the following properties of Algorithm 2 with the specified parameters.

###### Lemma 6.3.

Let be -subgaussian for and let be a set of i.i.d. samples drawn from . If , where , and , then for , we have

1. .

2. .

3. .

###### Proof of Theorem 6.2.

Using Lemma 6.3, we get

 ED[(h(x)−y)2] =ED[(h(x)−ReLU(⟨x,w∗⟩))2(1T+(x)+1T(x)+1T−(x))] ≤ED[(⟨w+,x⟩−y)21T+(x)]+ED[(0−y)21T−(x)]+ED[(P(x)−y)21T(x)] ≤ED[(ReLU(⟨w∗,x⟩)−y)2]+η⋅opt+ϵ=(1+η)opt+ϵ.

We now prove Lemma 6.3.

###### Proof of Lemma 6.3.

Let be as defined in Algorithm 2. We first project down to two dimensions. Let . Let be the 2-dimensional space spanned by and let be the orthogonal projection onto . If , then and . Since is a constant factor approximation for a -subgaussian distribution, with probability we have , for some constant . This is easy to check via the Cauchy-Schwartz inequality and using the structural lemmas from previous subsections. Additionally

 −⟨w∗,x⟩ =−⟨w∗,PV(x)⟩=−⟨w∗−w,PV(x)⟩−⟨w,PV(x)⟩ (1) ≤−⟨w∗−w,PV(x)⟩≤∥w∗−w∥2∥PV(x)∥2≤cχν√opt∥PV(x)∥2. (2)

A similar calculation for implies . We now bound above the error in the region . Since is the best fitting linear function over , the loss of is necessarily larger than that of . An application of Lemma G.6 in the first step implies

 =minw′∈B(d,W)ED[(⟨w′,x⟩−y)21T+(x)]+ϵPrD[T