Learning Halfspaces and Neural Networks withRandom Initialization

# Learning Halfspaces and Neural Networks with Random Initialization

Yuchen Zhang  Jason D. Lee  Martin J. Wainwright  Michael I. Jordan

Department of Electrical Engineering and Computer Science
University of California, Berkeley, CA 94709
{yuczhang,jasondlee88,wainwrig,jordan}@eecs.berkeley.edu
###### Abstract

We study non-convex empirical risk minimization for learning halfspaces and neural networks. For loss functions that are -Lipschitz continuous, we present algorithms to learn halfspaces and multi-layer neural networks that achieve arbitrarily small excess risk . The time complexity is polynomial in the input dimension and the sample size , but exponential in the quantity . These algorithms run multiple rounds of random initialization followed by arbitrary optimization steps. We further show that if the data is separable by some neural network with constant margin , then there is a polynomial-time algorithm for learning a neural network that separates the training data with margin . As a consequence, the algorithm achieves arbitrary generalization error with sample and time complexity. We establish the same learnability result when the labels are randomly flipped with probability .

## 1 Introduction

The learning of a halfspace is the core problem solved by many machine learning methods, including the Perceptron (Rosenblatt, 1958), the Support Vector Machine (Vapnik, 1998) and AdaBoost (Freund and Schapire, 1997). More formally, for a given input space , a halfspace is defined by a linear mapping from to the real line. The sign of the function value determines if is located on the positive side or the negative side of the halfspace. A labeled data point consists of a pair , and given such pairs , the empirical prediction error is given by

 ℓ(f):=1nn∑i=1I[−yif(xi)≥0]. (1)

The loss function in equation (1) is also called the zero-one loss. In agnostic learning, there is no hyperplane which perfectly separates the data, in which case the goal is to find a mapping that achieves a small zero-one loss. The method of choosing a function based on minimizing the criterion (1) is known as empirical risk minimization (ERM).

It is known that finding a halfspace that approximately minimizes the zero-one loss is NP-hard. In particular, Guruswami and Raghavendra (2009) show that, for any , given a set of pairs such that the optimal zero-one loss is bounded by , it is NP-hard to find a halfspace whose zero-one loss is bounded by . Many practical machine learning algorithms minimize convex surrogates of the zero-one loss, but the halfspaces obtained through the convex surrogate are not necessarily optimal. In fact, the result of Guruswami and Raghavendra (2009) shows that the approximation ratio of such procedures could be arbitrarily large.

In this paper, we study optimization problems of the form

 ℓ(f):=1nn∑i=1h(−yif(xi)), (2)

where the function is -Lipschitz continuous for some , but is otherwise arbitrary (and so can be nonconvex). This family does not include the zero-one-loss (since it is not Lipschitz), but does include functions that can be used to approximate it to arbitrary accuracy with growing . For instance, the piecewise-linear function:

 h(x):=⎧⎪ ⎪⎨⎪ ⎪⎩0x≤−12L,1x≥12L,Lx+1/2otherwise, (3)

is -Lipschitz, and converges to the step function as the parameter increases to infinity.

Shalev-Shwartz et al. (2011) study the problem of minimizing the objective (2) with function defined by (3), and show that under a certain cryptographic assumption, there is no -time algorithm for (approximate) minimization. Thus, it is reasonable to assume that the Lipschitz parameter is a constant that does not grow with the dimension or sample size . Moreover, when is a linear mapping, scaling the input vector or scaling the weight vector is equivalent to scaling the Lipschitz constant. Thus, we assume without loss of generality that the norms of and are bounded by one.

### 1.1 Our contributions

The first contribution of this paper is to present two -time methods—namely, Algorithm 1 and Algorithm 2—for minimizing the cost function (2) for an arbitrary -Lipschitz function . We prove that for any given tolerance , these algorithms achieve an -excess risk by running multiple rounds of random initialization followed by a constant number of optimization rounds (e.g., using an algorithm such as stochastic gradient descent). The first algorithm is based on choosing the initial vector uniformly at random from the Euclidean sphere; despite the simplicity of this scheme, it still has non-trivial guarantees. The second algorithm makes use a better initialization obtained by solving a least-squares problem, thereby leading to a stronger theoretical guarantee. Random initialization is a widely used heuristic in non-convex ERM; our analysis supports this usage but suggests that a careful theoretical treatment of the initialization step is necessary.

Our algorithms for learning halfspaces have running time that grows polynomially in the pair , as well as a term proportional to . Our next contribution is to show that under a standard complexity-theoretic assumption—namely, that —this exponential dependence on cannot be avoided. More precisely, letting denote the piecewise linear function from equation (3) with , Proposition 1 shows that there is no algorithm achieving arbitrary excess risk in time when . Thus, the random initialization scheme is unlikely to be substantially improved.

We then extend our approach to the learning of multi-layer neural networks, with a detailed analysis of the family of -layer sigmoid-activated neural networks, under the assumption that -norm of the incoming weights of any neuron is assumed to be bounded by a constant . We specify a method (Algorithm 3) for training networks over this family, and in Theorem 3, we prove that its loss is at most an additive term of worse than that of the best neural network. The time complexity of the algorithm scales as , where the constant does not depends on the input dimension or the data size, but may depend exponentially on the triplet .

Due to the exponential dependence on , this agnostic learning algorithm is too expensive to achieve a diminishing excess risk for a general data set. However, by analyzing data sets that are separable by some neural network with constant margin , we obtain a stronger achievability result. In particular, we show in Theorem 4 that there is an efficient algorithm that correctly classifies all training points with margin in polynomial time. As a consequence, the algorithm learns a neural network with generalization error bounded by using training points and in time. This so-called BoostNet algorithm uses the AdaBoost approach (Freund and Schapire, 1997) to construct a -layer neural network by taking an -layer network as a weak classifier. The shallower networks are trained by the agnostic learning algorithms that we develop in this paper. We establish the same learnability result when the labels are randomly flipped with probability (see Corollary 1). Although the time complexity of BoostNet is exponential in , we demonstrate that our achievable result is unimprovable—in particular, by showing that a complexity is impossible under a certain cryptographic assumption (see Proposition 2).

Finally, we report experiments on learning parity functions with noise, which is a challenging problem in computational learning theory. We train two-layer neural networks using BoostNet, then compare them with the traditional backpropagation approach. The experiment shows that BoostNet learns the degree-5 parity function by constructing 50 hidden neurons, while the backpropagation algorithm fails to outperform random guessing.

### 1.2 Related Work

This section is devoted to discussion of some related work so as to put our contributions into broader context.

#### 1.2.1 Learning halfspaces

The problem of learning halfspaces is an important problem in theoretical computer science. It is known that for any constant approximation ratio, the problem of approximately minimizing the zero-one loss is computationally hard (Guruswami and Raghavendra, 2009; Daniely et al., 2014). Halfspaces can be efficiently learnable if the data are drawn from certain special distributions, or if the label is corrupted by particular forms of noise. Indeed, Blum et al. (1998) and Servedio and Valiant (2001) show that if the labels are corrupted by random noise, then the halfspace can be learned in polynomial time. The same conclusion was established by Awasthi et al. (2015) when the labels are corrupted by Massart noise, and the covariates are drawn from the uniform distribution on a unit sphere. When the label noise is adversarial, the halfspace can be learned if the data distribution is isotropic log-concave and the fraction of labels being corrupted is bounded by a small quantity (Kalai et al., 2008; Klivans et al., 2009; Awasthi et al., 2014). When no assumption is made on the noise, Kalai et al. (2008) show that if the data are drawn from the uniform distribution on a unit sphere, then there is an algorithm whose time complexity is polynomial in the input dimension, but exponential in (where is the additive error). In this same setting, Klivans and Kothari (2014) prove that the exponential dependence on is unavoidable.

Another line of work modifies the loss function to make it easier to minimize. Ben-David and Simon (2001) suggest comparing the zero-one loss of the learned halfspace to the optimal -margin loss. The -margin loss asserts that all points whose classification margins are smaller than should be marked as misclassified. Under this metric, it was shown by Ben-David and Simon (2001); Birnbaum and Shwartz (2012) that the optimal -margin loss can be achieved in polynomial time if is a positive constant. Shalev-Shwartz et al. (2011) study the minimization of a continuous approximation to the zero-one loss, which is similar to our setup. They propose a kernel-based algorithm which performs as well as the best linear classifier. However, it is an improper learning method in that the classifier cannot be represented by a halfspace.

#### 1.2.2 Learning neural networks

It is known that any smooth function can be approximated by a neural network with just one hidden layer (Barron, 1993), but that training such a network is NP-hard (Blum and Rivest, 1992). In practice, people use optimization algorithms such as stochastic gradient (SG) to train neural networks. Although strong theoretical results are available for SG in the setting of convex objective functions, there are few such results in the nonconvex setting of neural networks.

Several recent papers address the challenge of establishing polynomial-time learnability results for neural networks. Arora et al. (2013) study the recovery of denoising auto-encoders. They assume that the top-layer values of the network are randomly generated and all network weights are randomly drawn from . As a consequence, the bottom layer generates a sequence of random observations from which the algorithm can recover the network weights. The algorithm has polynomial-time complexity and is capable of learning random networks that are drawn from a specific distribution. However, in practice people want to learn deterministic networks that encode data-dependent representations.

Sedghi and Anandkumar (2014) study the supervised learning of neural networks under the assumption that the score function of the data distribution is known. They show that if the input dimension is large enough and the network is sparse enough, then the first network layer can be learned by a polynomial-time algorithm. More recently, Janzamin et al. (2015) propose another algorithm relying on the score function that removes the restrictions of Sedghi and Anandkumar (2014). The assumption in this case is that the network weights satisfy a non-degeneracy condition; however, the algorithm is only capable of learning neural networks with one hidden layer. Our algorithm does not impose any assumption on the data distribution, and is able to learn multi-layer neural networks.

Another approach to the problem is via the improper learning framework. The goal in this case is to find a predictor that is not a neural network, but performs as well as the best possible neural network in terms of the generalization error. Livni et al. (2014) propose a polynomial-time algorithm to learn networks whose activation function is quadratic. Zhang et al. (2015) propose an algorithm for improper learning of sigmoidal neural networks. The algorithm runs in time if the depth of the networks is a constant and the -norm of the incoming weights of any node is bounded by a constant. It outputs a kernel-based classifier while our algorithm outputs a proper neural network. On the other hand, in the agnostic setting, the time complexity of our algorithm depends exponentially on .

## 2 Preliminaries

In this section, we formalize the problem set-up and present several preliminary lemmas that are useful for the theoretical analysis. We first set up some notation so as to define a general empirical risk minimization problem. Let be a dataset containing points where and . To goal is to learn a function so that is as close to as possible. We may write the loss function as

 ℓ(f):=n∑i=1αih(−yif(xi)). (4)

where is a -Lipschitz continuous function that depends on , and are non-negative importance weights that sum to . As concrete examples, the function can be the piecewise linear function defined by equation (3) or the sigmoid function . Figure 1 compares the step function with these continuous approximations. In the following sections, we study minimizing the loss function in equation (4) when is either a linear mapping or a multi-layer neural network.

Let us introduce some useful shorthand notation. We use to denote the set of indices . For , let denote the -norm of vector , given by , as well as . If is a -dimensional vector and is a function, we use as a convenient shorthand for the vector . Given a class of real-valued functions, we define the new function class .

Let be i.i.d. samples drawn from the dataset such that probability of drawing is proportional to . We define the sample-based loss function:

 G(f):=1kk∑j=1h(−y′jf(x′j)). (5)

It is straightforward to verify that . For a given function class , the Rademacher complexity of with respect to these samples is defined as

 Rk(F):=E[supf∈F1kk∑j=1εjf(x′j)], (6)

where the are independent Rademacher random variables.

###### Lemma 1.

Assume that contains the constant zero function , then we have

 E[supf∈F|G(f)−ℓ(f)|]≤4LRk(F).

This lemma shows that controls the distance between and . For the function classes studied in this paper, we will have . Thus, the function will be a good approximation to if the sample size is large enough. This lemma is based on a slight sharpening of the usual Ledoux-Talagrand contraction for Rademacher variables (Ledoux and Talagrand, 2013); see Appendix A for the proof.

##### Johnson-Lindenstrauss lemma:

The Johnson-Lindenstrauss lemma is a useful tool for dimension reduction. It gives a lower bound on an integer such that after projecting vectors from a high-dimensional space into an -dimensional space, the pairwise distances between this collection of vectors will be approximately preserved.

###### Lemma 2.

For any and any positive integer , consider any positive integer . Let be a operator that projects a vector in to a random -dimensional subspace of , then scales the vector by . Then for any set of vectors , we have

 ∣∣∥ui−uj∥22−∥ϕ(ui)−ϕ(uj)∥22∣∣≤ϵ∥ui−uj∥22for every i,j∈[n]. (7)

holds with probability at least .

See the paper by Dasgupta and Gupta (1999) for a simple proof.

##### Maurey-Barron-Jones lemma:

Letting be a subset of any Hilbert space , the Maurey-Barron-Jones lemma guarantees that any point in the convex hull of can be approximated by a convex combination of a small number of points of . More precisely, we have:

###### Lemma 3.

Consider any subset of any Hilbert space such that for all . Then for any point is in the convex hull of , there is a point in the convex hull of points of such that .

See the paper by Pisier (1980) for a proof. This lemma is useful in our analysis of neural network learning.

## 3 Learning Halfspaces

In this section, we assume that the function is a linear mapping , so that our cost function can be written as

 ℓ(w):=n∑i=1αih(−yi⟨w,xi⟩). (8)

In this section, we present two polynomial-time algorithms to approximately minimize this cost function over certain types of -balls. Both algorithms run multiple rounds of random initialization followed by arbitrary optimization steps. The first algorithm initializes the weight vector by uniformly drawing from a sphere. We next present and analyze a second algorithm in which the initialization follows from the solution of a least-square problem. It attains the stronger theoretical guarantees promised in the introduction.

### 3.1 Initialization by uniform sampling

We first analyze a very simple algorithm based on uniform sampling. Letting denote a minimizer of the objective function (8), by rescaling as necessary, we may assume that . As noted earlier, by redefining the function as necessary, we may also assume that the points all lie inside the Eulcidean ball of radius one—that is, for all .

Given this set-up, a simply way in which to estimate is to first draw uniformly from the Euclidean unit sphere, and then apply an iterative scheme to minimize the the loss function with the randomly drawn vector as the initial point. At an intuitive level, this algorithm will find the global optimum if the initial weight is drawn sufficiently close to , so that the iterative optimization method converges to the global minimum. However, by calculating volumes of spheres in high-dimensions, it could require rounds of random sampling before drawing a vector that is sufficiently close to , which makes it computationally intractable unless the dimension is small.

In order to remove the exponential dependence, Algorithm 1 is based on drawing the initial vector from a sphere of radius , where the radius should be viewed as hyper-parameter of the algorithm. One should choose a greater for a faster algorithm, but with a less accurate solution. The following theorem characterizes the trade-off between the accuracy and the time complexity.

###### Theorem 1.

For and , let , and . With probability at least , Algorithm 1 outputs a vector which satisfies:

 ℓ(ˆw)≤ℓ(w∗)+6ϵL.

The time complexity is bounded by .

The proof of Theorem 1, provided in Appendix B, uses the Johnson-Lindenstrauss lemma. More specifically, suppose that we project the weight vector and all data points to a random subspace and properly scale the projected vectors. The Johnson-Lindenstrauss lemma then implies that with a constant probability, the inner products will be almost invariant after the projection—that is,

 ⟨w∗,xi⟩≈⟨ϕ(w∗),ϕ(xi)⟩=⟨rϕ(w∗),xi⟩for every i∈[n],

where is the scale factor of the projection. As a consequence, the vector will approximately minimize the loss. If we draw a vector uniformly from the sphere of with radius and find that is sufficiently close to , then we call this vector as a successful draw. The probability of a successful draw depends on the dimension of , independent of the original dimension  (see Lemma 2). If the draw is successful, then we use as the initialization so that it approximately minimize the loss. Note that drawing from the unit sphere of a random subspace is equivalent to directly drawing from the original space , so that there is no algorithmic need to explicitly construct the random subspace.

It is worthwhile to note some important deficiencies of Algorithm 1. First, the algorithm outputs a vector satisfying the bound with , so that it is not guaranteed to lie within the Euclidean unit ball. Second, the -norm constraints and cannot be generalized to other norms. Third, the complexity term has a power growing with . Our second algorithm overcomes these limitations.

### 3.2 Initialization by solving a least-square problem

We turn to a more general setting in which and are bounded by a general -norm for some . Letting denote the associated dual exponent ( i.e., such that ), we assume that and for every . Note that this generalizes our previous set-up, which applied to the case . In this setting, Algorithm 2 is a procedure that outputs an approximate minimizer of the loss function. In each iteration, it draws points from the data set , and then constructs a random least-squares problem based on these samples. The solution to this problem is used to initialize an optimization step.

The success of Algorithm 2 relies on the following observation: if we sample points independently from the dataset, then Lemma 1 implies that the sample-based loss will be sufficiently close to the original loss . Thus, it suffices to minimize the sample-based loss. Note that is uniquely determined by the inner products . If there is a vector satisfying , then its performance on the sample-based loss will be equivalent to that of . As a consequence, if we draw a vector that is sufficiently close to (called a successful ), then we can approximately minimize the sample-based loss by solving , or alternatively by minimizing . The latter problem can be solved by a convex program in polynomial time. The probability of drawing a successful only depends on , independent of the input dimension and the sample size. This allows the time complexity to be polynomial in . The trade-off between the target excess risk and the time complexity is characterized by the following theorem. For given , it is based on running Algorithm 2 with the choices

 k:={⌈2logd/ϵ2⌉if p=1⌈(q−1)/ϵ2⌉if p>1, (9)

and .

###### Theorem 2.

For given , with the choices of given above, Algorithm 2 outputs a vector such that

 ℓ(ˆw)≤ℓ(w∗)+11ϵLwith probability at least 1−δ.

The time complexity is bounded by

See Appendix C for the proof. Theorem 2 shows that the time complexity of the algorithm has polynomial dependence on but exponential dependence on . Shalev-Shwartz et al. (2011) proved a similar complexity bound when the function takes the piecewise-linear form (3), but our algorithm applies to arbitrary continuous functions. We note that the result is interesting only when , since otherwise the same time complexity can be achieved by a grid search of within the -dimensional unit ball.

### 3.3 Hardness result

In Theorem 2, the time complexity has an exponential dependence on . Shalev-Shwartz et al. (2011) show that the time complexity cannot be polynomial in even for improper learning. It is natural to wonder if Algorithm 2 can be improved to have polynomial dependence on given that . In this section, we provide evidence that this is unlikely to be the case.

To prove the hardness result, we reduce from the MAX-2-SAT problem, which is known to be NP-hard. In particular, we show that if there is an algorithm solving the minimization problem (8), then it also solves the MAX-2-SAT problem. Let us recall the MAX-2-SAT problem:

###### Definition (Max-2-Sat).

Given literals and clauses . Each clause is the conjunction of two arguments that may either be a literal or the negation of a literal ***In the standard MAX-2-SAT setup, each clause is the disjunction of two literals. However, any disjunction clause can be reduced to three conjunction clauses. In particular, a clause is satisfied if and only if one of the following is satisfied: , , .. The goal is to determine the maximum number of clauses that can be simultaneously satisfied by an assignment.

Since our interest is to prove a lower bound, it suffices to study a special case of the general minimization problem—namely, one in which and , for any . The following proposition shows that if is the piecewise-linear function with , then approximately minimizing the loss function is hard. See Appendix D for the proof.

###### Proposition 1.

Let be the piecewise-linear function (3) with Lipschitz constant . Unless is the class of randomized polynomial-time algorithms., there is no randomized -time algorithm computing a vector which satisfies with probability at least .

Proposition 1 provides a strong evidence that learning halfspaces with respect to a continuous sigmoidal loss cannot be done in time. We note that Hush (1999) proved a similar hardness result, but without the unit-norm constraint on and . The non-convex ERM problem without a unit norm constraint is notably harder than ours, so this particular hardness result does not apply to our problem setup.

## 4 Learning Neural Networks

Let us now turn to the case in which the function represents a neural network. Given two numbers and such that , we assume that the input vector satisfies for every . The class of -layer neural networks is recursively defined in the following way. A one-layer neural network is a linear mapping from to , and we consider the set of mappings:

 N1:={x→⟨w,x⟩: ∥w∥p≤B}.

For , an -layer neural network is a linear combination of -layer neural networks activated by a sigmoid function, and so we define:

 Nm:={x→d∑j=1wjσ(fj(x)): d<∞, fj∈Nm−1, ∥w∥1≤B}.

In this definition, the function is an arbitrary 1-Lipschitz continuous function. At each hidden layer, we allow the number of neurons to be arbitrarily large, but the per-unit -norm must be bounded by a constant . This regularization scheme has been studied by Bartlett (1998); Koltchinskii and Panchenko (2002); Bartlett and Mendelson (2003); Neyshabur et al. (2015).

Assuming a constant -norm bound might be restrictive for some applications, but without this norm constraint, the neural network class activated by any sigmoid-like or ReLU-like function is not efficiently learnable (Zhang et al., 2015, Theorem 3). On the other hand, the -regularization imposes sparsity on the neural network. It is observed in practice that sparse neural networks are capable of learning meaningful representations such as by convolutional neural networks, for instance. Moreover, it has been argued that sparse connectivity is a natural constraint that can lead to improved performance in practice (see, e.g. Thom and Palm, 2013).

### 4.1 Agnostic learning

In the agnostic setting, it is not assumed there exists a neural network that separates the data. Instead, our goal is to compute a neural network that minimizes the loss function over the space of all given networks. Letting be the network that minimizes the empirical loss , we now present and analyze a method (see Algorithm 3) that computes a network whose loss is at most worse that that of . We first state our main guarantee for this algorithm, before providing intuition. More precisely, for any , the following theorem applies to Algorithm 3 with the choices:

 k:=⌈q/ϵ2⌉,s:=⌈1/ϵ2⌉,T:=⌈5(4/ϵ)k(sm−1)/(s−1)log(1/δ)⌉. (11)
###### Theorem 3.

For given and , with the choices of given above, Algorithm 3 outputs a predictor such that

 ℓ(ˆf)≤ℓ(f∗)+(2m+9)ϵLBm with probability at least 1−δ. (12)

The computational complexity is bounded by .

We remark that if , then is a class of linear mappings. Thus, Algorithm 2 can be viewed as a special case of Algorithm 3 for learning one-layer neural networks. See Appendix E for the proof of Theorem 3.

The intuition underlying Algorithm 3 is similar to that of Algorithm 2. Each iteration involves resampling independent points from the dataset. By the Rademacher generalization bound, minimizing the sample-based loss will approximately minimize the original loss . The value of is uniquely determined by the vector . As a consequence, if we draw sufficiently close to , then a nearly-optimal neural network will be obtained by approximately solving , or equivalently .

In general, directly solving the equation would be difficult even if the vector were known. In particular, since our class is highly non-linear, solving this equation cannot be reduced to solving a convex program. On the other hand, suppose that we write for some functions . Then the problem becomes much easier if the quantities are already known for every . With this perspective in mind, we can approximately solve the equation by minimizing

 minw∈Rd:∥w∥1≤B  k∑j=1(s∑l=1wlσ(f∗l(x′j))−uj)2. (13)

Accordingly, suppose that we draw vectors such that each is sufficiently close to —any such draw is called successful. We may then recursively compute -layer networks by first solving the approximate equation , and then rewriting problem (13) as

 minw∈Rd:∥w∥1≤B  k∑j=1(s∑l=1wlσ(gl(x′j))−uj)2.

This convex program matches the problem (10) in Algorithm 3. Note that the probability of a successful draw depends on the dimension . Although there is no constraint on the dimension of , the Maurey-Barron-Jones lemma (Lemma 3) asserts that it suffices to choose to compute an -accurate approximation. We refer the reader to Appendix E for the detailed proof.

### 4.2 Learning with separable data

We turn to the case in which the data are separable with a positive margin. Throughout this section, we assume that the activation function of is an odd function (i.e., ). We say that a given data set is separable with margin , or -separable for short, if there is a network such that for each . Given a distribution over the space , we say that it is -separable if there is a network such that almost surely (with respect to ).

Algorithm 4 learns a neural network on the separable data. It uses the AdaBoost approach (Freund and Schapire, 1997) to construct the network, and we refer to it as the BoostNet algorithm. In each iteration, it trains a weak classifier with an error rate slightly better than random guessing, then adds the weak classifier to the strong classifier to construct an -layer network. The weaker classifier is trained by Algorithm 3 (or by Algorithm 1 or Algorithm 2 if are one-layer networks). The following theorem provides guarantees for its performance when it is run for

 T:=⌈16B2log(n+1)γ2⌉

iterations. The running time depends on a quantity that is a constant for any choice of the triple , but with exponential dependence on .

###### Theorem 4.

With the above choice of , the BoostNet algorithm achieves:

1. In-sample error: For any -separable dataset Algorithm 4 outputs a neural network such that,

 yiˆf(xi)≥γ16for % every i∈[n],  with probability at least 1−δ.

The time complexity is bounded by .

2. Generalization error: Given a data set consisting of i.i.d. samples from any -separable distribution , Algorithm 4 outputs a network such that

 P[sign(ˆf(x))≠y] ≤ϵwith probability at least 1−2δ. (15)

Moreover, the time complexity is bounded by .

See Appendix F for the proof. The most technical work is devoted to proving part (a). The generalization bound in part (b) follows by combining part (a) with bounds on the Rademacher complexity of the network class, which then allow us to translate the in-sample error bound to generalization error in the usual way. It is worth comparing the BoostNet algorithm with the general algorithm for agnostic learning. In order to bound the generalization error by , the time complexity of Algorithm 3 will be exponential in .

The same learnability results can be established even if the labels are randomly corrupted. Formally, for every pair sampled from a -separable distribution, suppose that the learning algorithm actually receives the corrupted pair , where

 ˜y={ywith probability 1−η,−ywith probability η.

Here the parameter corresponds to the noise level. Since the labels are flipped, the BoostNet algorithm cannot be directly applied. However, we can use the improper learning algorithm of Zhang et al. (2015) to learn an improper classifier such that with high probability, and then apply the BoostNet algorithm taking as input. Doing so yields the following guarantee:

###### Corollary 1.

Assume that and . For any constant , consider the neural network class activated by §§§The erf function can be replaced by any function satisfying polynomial expansion , such that for any finite .. Given a random dataset of size for any -separable distribution, there is a -time algorithm that outputs a network such that

 P(sign(ˆf(x))≠y)≤ϵwith probability at least 1−δ.

See Appendix G for the proof.

### 4.3 Hardness result for γ-separable problems

Finally, we present a hardness result showing that the dependence on is hard to improve. Our proof relies on the hardness of standard (nonagnostic) PAC learning of the intersection of halfspaces given in Klivans et al. (2006). More precisely, consider the family of halfspace indicator functions mapping to given by

Given a -tuple of functions belonging to , we define the intersection function

 h(x)={1if h1(x)=⋯=hT(x)=1,−1otherwise,

which represents the intersection of half-spaces. Letting denote the set of all such functions, for any distribution on , we want an algorithm taking a sequence of as input where is a sample from and . It should output a function such that with probability at least . If there is such an algorithm whose sample complexity and time complexity scale as , then we say that is efficiently learnable. Klivans et al. (2006) show that if then is not efficiently learnable under a certain cryptographic assumption. This hardness statement implies the hardness of learning neural networks with separable data.

###### Proposition 2.

Assume is not efficiently learnable for . Consider the class of two-layer neural networks activated by the piecewise linear function or the ReLU function , and with the norm constraint . Consider any algorithm such that when applied to any -separable data distribution, it is guaranteed to output a neural network satisfying with probability at least . Then it cannot run in -time.

See Appendix H for the proof.

## 5 Simulation

In this section, we compare the BoostNet algorithm with the classical backpropagation method for training two-layer neural networks. The goal is to learn parity functions from noisy data — a challenging problem in computational learning theory (see, e.g. Blum et al., 2003). We construct a synthetic dataset with points. Each point is generated as follows: first, the vector is uniformly drawn from and concatenated with a constant as the -th coordinate. The label is generated as follows: for some unknown subset of indices , we set

 y={xi1xi2…xipwith % probability 0.9,−xi1xi2…xipwith probability 0.1.

The goal is to learn a function such that predicts the value of . The optimal rate is achieved by the parity function , in which case the prediction error is . If the parity degree , the optimal rate cannot be achieved by any linear classifier.

Choose and . The activation function is chosen as . The training set, the validation set and the test set contain respectively 25K, 5K and 20K points. To train a two-layer BoostNet, we choose the hyper-parameter , and select Algorithm 3 as the subroutine to train weak classifiers with hyper-parameters . To train a classical two-layer neural network, we use the random initialization scheme of Nguyen and Widrow (1990) and the backpropagation algorithm of Møller (1993). For both methods, the algorithm is executed for ten independent rounds to select the best solution.

Figure 2 compares the prediction errors of BoostNet and backpropagation. Both methods generate the same two-layer network architecture, so that we compare with respect to the number of hidden nodes. Note that BoostNet constructs hidden nodes incrementally while NeuralNet trains a predefined number of neurons. Figure 2 shows that both algorithms learn the degree-2 parity function with a few hidden nodes. In contrast, BoostNet learns the degree-5 parity function with less than 50 hidden nodes, while NeuralNet’s performance is no better than random guessing. This suggests that the BoostNet algorithm is less likely to be trapped in a bad local optimum in this setting.

## 6 Conclusion

In this paper, we have proposed algorithms to learn halfspaces and neural networks with non-convex loss functions. We demonstrate that the time complexity is polynomial in the input dimension and in the sample size but exponential in the excess risk. A hardness result relating to the necessity of the exponential dependence is also presented. The algorithms perform randomized initialization followed by optimization steps. This idea coincides with the heuristics that are widely used in practice, but the theoretical analysis suggests that a careful treatment of the initialization step is necessary. We proposed the BoostNet algorithm, and showed that it can be used to learn a neural network in polynomial time when the data are separable with a constant margin. We suspect that the theoretical results of this paper are likely conservative, in that in application to real data, the algorithms can be much more efficient than the bounds might suggest.

### Acknowledgements:

MW and YZ were partially supported by NSF grant CIF-31712-23800 from the National Science Foundation, AFOSR-FA9550-14-1-0016 grant from the Air Force Office of Scientific Research, ONR MURI grant N00014-11-1-0688 from the Office of Naval Research. MJ and YZ were partially supported by the U.S.ARL and the U.S.ARO under contract/grant number W911NF-11-1-0391. We thank Sivaraman Balakrishnan for helpful comments on an earlier draft.

## References

• Arora et al. (2013) S. Arora, A. Bhaskara, R. Ge, and T. Ma. Provable bounds for learning some deep representations. arXiv:1310.6343, 2013.
• Awasthi et al. (2014) P. Awasthi, M. F. Balcan, and P. M. Long. The power of localization for efficiently learning linear separators with noise. In Proceedings of the 46th Annual ACM Symposium on Theory of Computing, pages 449–458. ACM, 2014.
• Awasthi et al. (2015) P. Awasthi, M.-F. Balcan, N. Haghtalab, and R. Urner. Efficient learning of linear separators under bounded noise. arXiv:1503.03594, 2015.
• Barron (1993) A. R. Barron. Universal approximation bounds for superpositions of a sigmoidal function. IEEE Transactions on Information Theory, 39(3):930–945, 1993.
• Bartlett (1998) P. L. Bartlett. The sample complexity of pattern classification with neural networks: the size of the weights is more important than the size of the network. Information Theory, IEEE Transactions on, 44(2):525–536, 1998.
• Bartlett and Mendelson (2003) P. L. Bartlett and S. Mendelson. Rademacher and gaussian complexities: Risk bounds and structural results. The Journal of Machine Learning Research, 3:463–482, 2003.
• Ben-David and Simon (2001) S. Ben-David and H. U. Simon. Efficient learning of linear perceptrons. In Advances in Neural Information Processing Systems, volume 13, page 189. MIT Press, 2001.
• Birnbaum and Shwartz (2012) A. Birnbaum and S. S. Shwartz. Learning halfspaces with the zero-one loss: time-accuracy tradeoffs. In Advances in Neural Information Processing Systems, volume 24, pages 926–934, 2012.
• Blum and Rivest (1992) A. Blum and R. L. Rivest. Training a 3-node neural network is NP-complete. Neural Networks, 5(1):117–127, 1992.
• Blum et al. (1998) A. Blum, A. Frieze, R. Kannan, and S. Vempala. A polynomial-time algorithm for learning noisy linear threshold functions. Algorithmica, 22(1-2):35–52, 1998.
• Blum et al. (2003) A. Blum, A. Kalai, and H. Wasserman. Noise-tolerant learning, the parity problem, and the statistical query model. Journal of the ACM, 50(4):506–519, 2003.
• Daniely et al. (2014) A. Daniely, N. Linial, and S. Shalev-Shwartz. From average case complexity to improper learning complexity. In Proceedings of the 46th Annual ACM Symposium on Theory of Computing, pages 441–448. ACM, 2014.
• Dasgupta and Gupta (1999) S. Dasgupta and A. Gupta. An elementary proof of the Johnson-Lindenstrauss lemma. International Computer Science Institute, Technical Report, pages 99–006, 1999.
• Freund and Schapire (1997) Y. Freund and R. E. Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences, 55(1):119–139, 1997.
• Guruswami and Raghavendra (2009) V. Guruswami and P. Raghavendra. Hardness of learning halfspaces with noise. SIAM Journal on Computing, 39(2):742–765, 2009.
• Hush (1999) D. R. Hush. Training a sigmoidal node is hard. Neural Computation, 11(5):1249–1260, 1999.
• Janzamin et al. (2015) M. Janzamin, H. Sedghi, and A. Anandkumar. Generalization bounds for neural networks through tensor factorization. arXiv:1506.08473, 2015.
• Kakade and Tewari (2008) S. Kakade and A. Tewari. Lecture note: Rademacher composition and linear prediction. 2008.
• Kakade et al. (2009) S. M. Kakade, K. Sridharan, and A. Tewari. On the complexity of linear prediction: Risk bounds, margin bounds, and regularization. In Advances in Neural Information Processing Systems, volume 21, pages 793–800, 2009.
• Kalai et al. (2008) A. T. Kalai, A. R. Klivans, Y. Mansour, and R. A. Servedio. Agnostically learning halfspaces. SIAM Journal on Computing, 37(6):1777–1805, 2008.
• Klivans and Kothari (2014) A. Klivans and P. Kothari. Embedding hard learning problems into gaussian space. Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques, 28:793–809, 2014.
• Klivans et al. (2006) A. R. Klivans, A. Sherstov, et al. Cryptographic hardness for learning intersections of halfspaces. In 47th Annual IEEE Symposium on Foundations of Computer Science, pages 553–562. IEEE, 2006.
• Klivans et al. (2009) A. R. Klivans, P. M. Long, and R. A. Servedio. Learning halfspaces with malicious noise. The Journal of Machine Learning Research, 10:2715–2740, 2009.
• Koltchinskii and Panchenko (2002) V. Koltchinskii and D. Panchenko. Empirical margin distributions and bounding the generalization error of combined classifiers. Annals of Statistics, pages 1–50, 2002.
• Ledoux and Talagrand (2013) M. Ledoux and M. Talagrand. Probability in Banach Spaces: isoperimetry and processes, volume 23. Springer Science & Business Media, 2013.
• Livni et al. (2014) R. Livni, S. Shalev-Shwartz, and O. Shamir. On the computational efficiency of training neural networks. In Advances in Neural Information Processing Systems, volume 26, pages 855–863, 2014.
• Møller (1993) M. F. Møller. A scaled conjugate gradient algorithm for fast supervised learning. Neural networks, 6(4):525–533, 1993.
• Neyshabur et al. (2015) B. Neyshabur, R. Tomioka, and N. Srebro. Norm-based capacity control in neural networks. arXiv preprint arXiv:1503.00036, 2015.
• Nguyen and Widrow (1990) D. Nguyen and B. Widrow. Improving the learning speed of 2-layer neural networks by choosing initial values of the adaptive weights. In International Joint Conference on Neural Networks, pages 21–26, 1990.
• Pisier (1980) G. Pisier. Remarques sur un résultat non publié de B. Maurey. Séminaire Analyse Fonctionnelle, pages 1–12, 1980.
• Rosenblatt (1958) F. Rosenblatt. The perceptron: a probabilistic model for information storage and organization in the brain. Psychological Review, 65(6):386, 1958.
• Schapire and Singer (1999) R. E. Schapire and Y. Singer. Improved boosting algorithms using confidence-rated predictions. Machine Learning, 37(3):297–336, 1999.
• Sedghi and Anandkumar (2014) H. Sedghi and A. Anandkumar. Provable methods for training neural networks with sparse connectivity. arXiv:1412.2693, 2014.
• Servedio and Valiant (2001) R. Servedio and L. Valiant. Efficient algorithms in computational learning theory. Harvard University, Cambridge, MA, 2001.
• Shalev-Shwartz and Singer (2010) S. Shalev-Shwartz and Y. Singer. On the equivalence of weak learnability and linear separability: New relaxations and efficient boosting algorithms. Machine Learning, 80(2-3):141–163, 2010.
• Shalev-Shwartz et al. (2011) S. Shalev-Shwartz, O. Shamir, and K. Sridharan. Learning kernel-based halfspaces with the 0-1 loss. SIAM Journal on Computing, 40(6):1623–1646, 2011.
• Thom and Palm (2013) M. Thom and G. Palm. Sparse activity and sparse connectivity in supervised learning. The Journal of Machine Learning Research, 14(1):1091–1143, 2013.
• Vapnik (1998) V. N. Vapnik. Statistical learning theory, volume 1. Wiley New York, 1998.
• Zhang et al. (2015) Y. Zhang, J. D. Lee, and M. I. Jordan. -regularized neural networks are improperly learnable in polynomial time. arXiv:1510.03528, 2015.

## Appendix A Proof of Lemma 1

The following inequality always holds:

 supf∈F|G(f)−ℓ(f)|≤max{supf∈F{G(f)−ℓ(f)},supf′∈F{ℓ(f′)−G(f′)}}.

Since contains the constant zero function, both and are non-negative, which implies

 supf∈F|G(f)−ℓ(f)|≤supf∈F{G(f)−ℓ(f)}+supf′∈F{ℓ(f′)−G(f′)}.

To establish Lemma 1, it suffices to prove:

 E[supf∈F{G(f)−ℓ(f)}]≤2LRk(F)andE[supf′∈F{ℓ(f′)−G(f′)}]≤2LRk(F)

For the rest of the proof, we will establish the first upper bound. The second bound can be established through an identical series of steps.

The inequality follows as a consequence of classical symmetrization techniques (e.g. Bartlett and Mendelson, 2003) and the Talagrand-Ledoux concentration (e.g. Ledoux and Talagrand, 2013, Corollary 3.17). However, so as to keep the paper self-contained, we provide a detailed proof here. By the definitions of and , we have

where is an i.i.d. copy of . Applying Jensen’s inequality yields

 ≤E[supf∈F{1kk∑j=1h(−y′jf(x′j))−h(−y′′jf(x′′j))}] =E[supf∈F{1kk∑j=1εj(h(−y′jf(x′j))−h(−y′′jf(x′′j)))}] ≤E[supf∈F{1kk∑j=1εjh(−y′jf(x′j))+supf∈F1kk∑j=1εjh(−y′′jf(x′′j))}] =2E[supf∈F{1kk∑j=1εjh(−y′jf(x′j))}]. (16)

We need to bound the right-hand side using the Rademacher complexity of the function class