Learning Depth-Three Neural Networks in Polynomial Time

# Learning Depth-Three Neural Networks in Polynomial Time

## Abstract

We give a polynomial-time algorithm for learning neural networks with one hidden layer of sigmoids feeding into any smooth, monotone activation function (e.g., sigmoid or ReLU). We make no assumptions on the structure of the network, and the algorithm succeeds with respect to any distribution on the unit ball in dimensions (hidden weight vectors also have unit norm). This is the first assumption-free, provably efficient algorithm for learning neural networks with more than one hidden layer.

Our algorithm– Alphatron– is a simple, iterative update rule that combines isotonic regression with kernel methods. It outputs a hypothesis that yields efficient oracle access to interpretable features. It also suggests a new approach to Boolean function learning via smooth relaxations of hard thresholds, sidestepping traditional hardness results from computational learning theory.

Along these lines, we give improved results for a number of longstanding problems related to Boolean concept learning, unifying a variety of different techniques. For example, we give the first polynomial-time algorithm for learning intersections of halfspaces with a margin (distribution-free) and the first generalization of DNF learning to the setting of probabilistic concepts (queries; uniform distribution). Finally, we give the first provably correct algorithms for common schemes in multiple-instance learning.

## 1Introduction

Giving provably efficient algorithms for learning neural networks is a longstanding and fundamental challenge in the theory of machine learning. Despite the remarkable achievements obtained in practice from applying tools for learning neural networks, surprisingly little is known from a theoretical perspective. In fact, much theoretical work has led to negative results showing that– from a worst-case perspective– even learning the simplest architectures seems computationally intractable [31]. For example, there are known hardness results for agnostically learning a single halfspace (learning a halfspace with adversarial noise) [29].

As such, much work has focused on finding algorithms that succeed after making various restrictive assumptions on both the network’s architecture and the underlying marginal distribution. Recent work gives evidence that for gradient-based algorithms these types of assumptions are actually necessary [37]. In this paper, we focus on understanding the frontier of efficient neural network learning: what is the most expressive class of neural networks that can be learned, provably, in polynomial-time without taking any additional assumptions?

### 1.1Our Results

We give a simple, iterative algorithm that efficiently learns neural networks with one layer of sigmoids feeding into any smooth, monotone activation function (for example, Sigmoid or ReLU). Both the first hidden layer of sigmoids and the output activation function have corresponding hidden weight vectors. We assume nothing about these weight vectors other than the standard normalization that they have -norm equal to one. The algorithm succeeds with respect to any distribution on the unit ball in dimensions. This is the first provably efficient, assumption-free result for learning neural networks with more than one hidden layer.

Our algorithm, which we call Alphatron, combines the expressive power of kernel methods with an additive update rule inspired by work from isotonic regression. Alphatron also outputs a hypothesis that gives efficient oracle access to interpretable features. That is, if the output activation function is , Alphatron constructs a hypothesis of the form where is an implicit encoding of products of features from the instance space, and yields an efficient algorithm for random access to the coefficients of these products.

More specifically, we obtain the following new supervised learning results by choosing an appropriate kernel function in conjunction with Alphatron:

• Let be any feedforward neural network with one hidden layer of sigmoids of size feeding into any activation function that is monotone and -Lipschitz. Given independent draws from with , we obtain an efficiently computable hypothesis such that with running time and sample complexity (the algorithm succeeds with high probability).

• We obtain the first efficient PAC algorithm for learning intersections of polynomially many halfspaces (with a margin) with respect to any distribution on (prior work due to [27] gave a quasipolynomial-time algorithm). We show that this is a special case of a more general class of Boolean learning problems where the goal is to learn Lipschitz-bounded combinations of Boolean functions in the probabilistic concept model due to Kearns and Schapire [26]. In this framework, we can learn smooth combinations of halfspaces (with a margin) whose sample complexity is independent of the number of halfspaces.

• We give the first generalization of well-known results for PAC learning DNF formulas with respect to the uniform distribution (given query access to the unknown DNF) to the setting of probabilistic concepts. Concretely, we give a query algorithm– KMtron– that learns any random variable whose conditional mean is a smooth, monotone combination of functions of bounded -norm with respect to the uniform distribution on 3. It is easy to see this captures the function class of polynomial-size DNF formulas.

• We give the first provably efficient algorithms for nontrivial schemes in multiple instance learning (MIL). Consider an MIL scheme where a learner is given a set or bag of instances , and the learner is told only some function of their labels, namely for some unknown concept and combining function . We give the first provably efficient algorithms for correctly labeling future bags even if the instances within each bag are not identically distributed. Our algorithms hold if the underlying concept is sigmoidal or a halfspace with a margin. If the combining function averages label values (a common case), we obtain bounds that are independent of the bag size.

Almost all of our results holds in the probabilistic concept model of learning due to Kearns and Schapire [26] and only require that the conditional mean of the label given instance is approximated by bounded-norm elements from a Reproducing Kernel Hilbert Space (RKHS). We learn specifically with respect to square-loss, though this will imply polynomial-time learnability for most commonly studied loss functions.

### 1.2Relationship to Traditional Boolean Function Learning

PAC learning simple Boolean concept classes has proved challenging. For example, the best known distribution-free algorithm for learning an intersection of just two halfspaces runs in exponential time in the dimension. For learning intersections of polynomially many halfspaces, there are known hardness results based on cryptographic assumptions [29] or the hardness of constraint satisfaction problems [9]. A source of angst in machine learning theory is that these hardness results do not square with recent practical successes for learning expressive function classes.

A key conceptual aspect of this work is to shift from the PAC model to the probabilistic concept model. We hope to revive the probabilistic concept model as a fertile area for constructing supervised learning algorithms, as it can handle both Boolean and real-valued concepts. Additionally, this model has the following interesting benefit that we feel has been overlooked: it allows for Boolean learning problems where an output hypothesis can answer “don’t know” by giving the value . As mentioned above, hardness results from computational learning theory indicate that simple Boolean function classes such as intersections of halfspaces can encode pseudorandom outputs. For these classes, PAC learning is hopeless. On the other hand, in the probabilistic concept model, we measure error with respect to square-loss. In this model, for subsets of inputs encoding pseudorandom labels, a hypothesis may simply output , which is essentially the optimal strategy.

The models we study in this paper still capture Boolean learning problems. More specifically, let be a random draw from a distribution where with . If always outputs or , then we are in the typical PAC scenario. On the other hand, may be a real-valued function. Our approach is to consider Boolean learning problems where the conditional mean function is computed by a real-valued neural network. These networks can be viewed as relaxations of Boolean function classes.

For example, one natural relaxation of an AND of Boolean inputs would be a piecewise-linear combining function that is for all inputs in , equal to on input , and a line that interpolates between and . Additionally, we could relax any halfspace defined on the unit sphere to where , a sigmoid. Although PAC learning an intersection of halfspaces seems out of reach, we can give fully polynomial-time algorithms for learning sums of polynomially many sigmoids feeding into as a probabilistic concept.

### 1.3Our Approach

The high-level approach is to use algorithms for isotonic regression to learn monotone combinations of functions approximated by elements of a suitable RKHS. Our starting point is the Isotron algorithm, due to Kalai and Sastry [28], and a refinement due to Kakade, Kalai, Kanade and Shamir [21] called the GLMtron. These algorithms efficiently learn any generalized linear model (GLM): distributions on instance-label pairs where the conditional mean of given is equal to for some (known) smooth, non-decreasing function and unknown weight vector . Their algorithms are simple and use an iterative update rule to minimize square-loss, a non-convex optimization problem in this setting. Both of their papers remark that their algorithms can be kernelized, but no concrete applications are given.

Around the same time, Shalev-Shwartz, Shamir, and Sridharan [42] used kernel methods and general solvers for convex programs to give algorithms for learning a halfspace under a distributional assumption corresponding to a margin in the non-realizable setting (agnostic learning). Their kernel was composed by Zhang et al. [47] to obtain results for learning sparse neural networks with certain smooth activations, and Goel et al. [16] used a similar approach in conjunction with general tools from approximation theory to obtain learning results for a large class of nonlinear activations including ReLU and Sigmoid.

### 1.4Our Algorithm

We combine the two above approaches into an algorithm called Alphatron that inherits the best properties of both: it is a simple, iterative update rule that does not require regularization, and it learns broad classes of networks whose first layer can be approximated via an appropriate feature expansion into an RKHS. It is crucial that we work in the probabilistic concept model. Even learning a single ReLU in the non-realizable or agnostic setting seems computationally intractable [16].

One technical challenge is handling the approximation error induced from embedding into an RKHS. In some sense, we must learn a noisy GLM. For this, we use a learning rate and a slack variable to account for noise and follow the outline of the analysis of GLMtron (or Isotron). The resulting algorithm is similar to performing gradient descent on the support vectors of a target element in an RKHS. Our convergence bounds depend on the resulting choice of kernel, learning rate, and quality of RKHS embedding. We can then leverage several results from approximation theory and obtain general theorems for two different notions of RKHS approximation: 1) the function class can be uniformly approximated by low-norm elements of a suitable RKHS or 2) the function class is separable (similar to the notion of a margin) by low-norm elements of an RKHS.

For generalizing uniform-distribution DNF learning algorithms to the probabilistic concept setting, we re-interpret the KM algorithm for finding large Fourier coefficients [22] as a query algorithm that gives accurate estimates of sparse, high-dimensional gradients. For the case of square-loss with respect to the uniform distribution on the hypercube, we can combine these estimates with a projection operator to learn smooth, monotone combinations of -bounded functions (it is easy to see that DNF formulas fall into this class).

For Multiple Instance Learning (MIL), we observe that the problem formulation is similar to learning neural networks with two hidden layers. We consider two different types of MIL: deterministic (the Boolean label is a deterministic function of the instance labels) and probabilistic (the Boolean label is a random variable whose mean is a function of the instance labels). We make use of some further kernel tricks, notably the mean-map kernel, to obtain a compositional feature map for taking averages of sets of instances. This allows us to prove efficient run-time and sample complexity bounds that are, in some cases, independent of the bag size.

### 1.5Related Work

The literature on provably efficient algorithms for learning neural networks is extensive. In this work we focus on common nonlinear activation functions: sigmoid, ReLU, or threshold. For linear activations, neural networks compute an overall function that is linear and can be learned efficiently using any polynomial-time algorithm for solving linear regression. Livni et al. [31] observed that neural networks of constant depth with constant degree polynomial activations are equivalent to linear functions in a higher dimensional space (polynomials of degree are equivalent to linear functions over monomials). It is known, however, that any polynomial that computes or even -approximates a single ReLU requires degree [16]. Thus, linear methods alone do not suffice for obtaining our results.

The vast majority of work on learning neural networks takes strong assumptions on either the underlying marginal distribution, the structure of the network, or both. Works that fall into these categories include [24]. In terms of assumption-free learning results, Goel et al. [16] used kernel methods to give an efficient, agnostic learning algorithm for sums of sigmoids (i.e., one hidden layer of sigmoids) with respect to any distribution on the unit ball.

Another line of work related to learning neural networks focuses on when local minima found by gradient descent are actually close to global minima. In order to give polynomial-time guarantees for finding a global minimum, these works require assumptions on the underlying marginal or the structure of the network (or both) [7]4. All of the problems we consider in this paper are non-convex optimization problems, as it is known that a single sigmoid with respect to square-loss has exponentially many bad local minima [2].

For classical generalization and VC dimension bounds for learning neural networks we refer the reader to Anthony and Bartlett [1] (the networks we consider in this paper over the unit ball have VC dimension where is the number of hidden units in the first layer).

### 1.6Organization

In the preliminaries we define the learning models we use and review the core tools we need from kernel methods and approximation theory. We then present our main algorithm, Alphatron, and give a proof of its correctness. We then combine Alphatron with various RKHS embeddings to obtain our most general learning results. Using these general results, we subsequently describe how to obtain all of our applications, including our main results for learning depth-three neural networks.

## 2Preliminaries

Notation.

Vectors are denoted by bold-face and denotes the standard 2-norm of the vector. We denote the space of inputs by and the space of outputs by . In our paper, is usually the unit sphere/ball and is or . Standard scalar (dot) products are denoted by for vectors , while inner products in a Reproducing Kernel Hilbert Space (RKHS) are denoted by for elements in the RKHS. We denote the standard composition of functions and by .

#### Learning Models

We consider two learning models in our paper, the standard Probably Approximately Correct (PAC) learning model and a relaxation of the standard model, the Probabilistic Concept (p-concept) learning model. For completeness, we define the two models and refer the reader to [45] for a detailed explanation.

Here we focus on square loss for p-concept since an efficient algorithm for square-loss implies efficient algorithms of various other standard losses.

#### Generalization Bounds

The following standard generalization bound based on Rademacher complexity is useful for our analysis. For a background on Rademacher complexity, we refer the readers to [6].

For a linear concept class, the Rademacher complexity can be bounded as follows.

The following result is useful for bounding the Rademacher complexity of a smooth function of a concept class.

### 2.1Kernel Methods

We assume the reader has a basic working knowledge of kernel methods (for a good resource on kernel methods in machine learning we refer the reader to [40]). We denote a kernel function by where is the associated feature map and is the corresponding reproducing kernel Hilbert space (RKHS).

Here we define two kernels and a few of their properties that we will use for our analysis. First, we define a variant of the polynomial kernel, the multinomial kernel due to Goel et al. [16]:

It is easy to see that the multinomial kernel is efficiently computable. A multivariate polynomial of degree can be represented as an element . Also, every can be interpreted as a multivariate polynomial of degree such that

where coefficient is as follows,

Here, is used to index the corresponding entry in .

The following lemma is due to [16], following an argument of Shalev-Shwartz et al. [42]:

Remark. Observe that we can normalize the multinomial feature map such that for bounded space . More formally, where , hence we can normalize using this value. Subsequently, in the above, will need to be multiplied by the same value. For , the scaling factor is [16]. Throughout the paper, we will assume the kernel to be normalized as discussed.
For our results on Multiple Instance Learning, we make use of the following known kernel defined over sets of vectors:

Fact. If then .

### 2.2Approximation Theory

We will make use of a variety of tools from approximation theory to obtain specific embeddings of function classes into a RKHS. The following lemma for approximating the Boolean function was given by [8]:

The above lemma assumes takes on values , but a simple linear transformation also works for .

A halfspace is a Boolean function defined by a vector and threshold , given input , where is 0 if else 1. Given and halfspace over , is said to have margin with respect to if . Let for with be the halfspaces and equals the intersection. The following lemma due to Klivans and Servedio [27] gives a construction of a polynomial whose sign always equals an intersection of halfspaces with a margin. We give the proof as it is useful for a subsequent lemma.

We have . From the properties of Chebyshev polynomials, we know that for and for . Hence,

• If , for each , hence , implying .

• If , then there exists such that , thus . Also observe that . This implies .

We extend the above lemma to give a threshold function for OR5 of a fixed halfspace with margin over a set of vectors. Let be a halfspace with and represent the conjunction over any set of vectors. Let over which the halfspace has margin.

Similar to the previous proof, we have . From the properties of Chebyshev polynomials, we know that for and for . Hence,

• If , for each hence , implying .

• If , then there exists such that , thus . Also observe that . This implies since .

Finally we state the following lemmas that bound the sum of squares of coefficients of a univariate polynomial:

We have . It follows that is bounded above by

## 3The Alphatron Algorithm

Here we present our main algorithm Alphatron (Algorithm ?) and a proof of its correctness. In the next section we will use this algorithm to obtain our most general learning results.

Define implying . Let and . It is easy to see that . Let be the empirical versions of the same.

The following theorem generalizes Theorem 1 of [21] to the bounded noise setting in a high dimensional feature space. We follow the same outline, and their theorem can be recovered by setting and as the zero function.

Let and . We will first prove the following lemma and subsequently use it to prove the theorem.

Expanding the left hand side of the equation above, we have

Here (Equation 1) follows from substituting the expression of , ( ?) follows from bounding and, ( ?) follows from being monotone and -Lipschitz, that is, . ( ?) follows from observing that the first term equals , the second term is bounded in norm by since the range of is and using the assumption .

We now bound as follows.

Here (Equation 4) follows by expanding the square and ( ?) follows by applying Jensen’s inequality to show that for all and vectors for , and subsequently using the fact that . Combining ( ?) and ( ?) gives us the result.

By definition we have that are zero mean iid random variables with norm bounded by . Using Hoeffding’s inequality (and the fact that that the ’s are independent draws), with probability we have

Now using the previous lemma with and , we have

Thus, for each iteration of Alphatron, one of the following two cases needs to be satisfied,

Case 1:
Case 2: (assuming that and )

Let be the first iteration where Case 2 holds. We need to show that such an iteration exists. Assume the contradictory, that is, Case 2 fails for each iteration. Since , however, in at most iterations Case 1 will be violated and Case 2 will have to be true. If then exists such that

We need to bound in terms of . Define , and . Using Theorem ? and ? we have . By definition of Rademacher complexity, we have

Here, are iid Rademacher variables hence and are drawn iid from .

Recall that is an element of as (case 1 is satisfied in iteration ) and . A direct application of Theorem ? on with loss function , gives us the following bound on with probability ,

The last step is to show that we can indeed find a hypothesis satisfying the above guarantee. Since for all , is up to constants equal to we can do so by choosing the hypothesis with the minimum using a fresh sample set of size . This holds as given the sample size, by Chernoff bound using the fact that is bounded in , each for will have empirical error within of the true error with probability and hence all will simultaneously satisfy this with probability ,

Setting will give us the required bound.

Alphatron runs in time where is the time required to compute the kernel function .

## 4Some General Theorems Involving Alphatron

In this section we use Alphatron to give our most general learnability results for the p-concept model and PAC learning setting. We then state several applications in the next section. Here we show that if a function can be uniformly approximated by an element of an appropriate RKHS then it is p-concept learnable. Similarly, if a function is separable by an element in an appropriate RKHS then it is PAC learnable. We assume that the kernel function is efficiently computable, that is, computable in polynomial time in the input dimension. Formally, we define approximation and separation as follows:

Combining Alphatron and the approximation (separation) guarantees, we have the following general learning results:

Let be the RKHS corresponding to and be the feature vector. Since is -approximated by kernel function , we have for . This implies that for some function . Thus . Applying Theorem ?, we have that Alphatron outputs a hypothesis such that

for some constants . Also Alphatron requires at most iterations. Setting gives us the required result.

Let be the vector that -separates and . Consider as follows:

Note that is monotone and 1-Lipschitz. Observe that,

• If then since .

• If then since .

From above, we can see that the samples drawn from the distribution satisfy . Thus we can apply Theorem ? with , and (for sufficiently large constant ) to obtain output hypothesis with (with probability ).

Recall that may be real-valued as Alphatron learns with square loss. Let us define to equal 1 if and 0 otherwise. We will show that . Using Markov’s inequality, we have . For , suppose . Since then clearly . Thus, . Scaling appropriately, we get the required result.

## 5Main Applications

The general framework of the previous section can be used to give new learning results for well studied problems. In this section we give polynomial time learnability results for depth-three neural networks with sigmoidal activations in the p-concept model. We follow this by showing how to obtain the first polynomial-time algorithms PAC learning a polynomial number of intersections/majorities of halfspaces with a margin.

### 5.1Learning Depth-3 Neural Networks

Following standard convention (see for example [41]), we define a neural network with one hidden layer (depth-) with units as follows:

for , for , . We subsequently define a neural network with two hidden layers (depth-3) with one unit in hidden layer 2 and units in hidden layer 1 as:

for , for , and .

[16] showed that activation functions sigmoid: and ReLU: can be -approximated by the multinomial kernel for dependent on , more formally they showed the following:

The following lemma extends the approximation guarantees to linear combinations of function classes.

We have for each , for some such that . Consider . We have ,

Also . Thus satisfies the required approximation.

The following theorem is our main result for learning classes of depth-three neural networks in polynomial time:

Combining Lemmas ? and ? we have that for activation function is -approximated by some kernel with and sufficiently large constant . Thus by Theorem ?, we have that there exists an algorithm that outputs a hypothesis such that, with probability ,

for some constants . Setting and gives us the required result (the claimed bounds on running time also follow directly from Theorem ?).

We also obtain results for networks of ReLUs, but the dependence on the number of hidden units, , and are exponential (the algorithm still runs in polynomial-time in the dimension):

Although our algorithm does not recover the parameters of the network, it still outputs a hypothesis with interpretable features. More specifically, our learning algorithm outputs the hidden layer as a multivariate polynomial. Given inputs , the hypothesis output by our algorithm Alphatron is of the form where and is dependent on required approximation. As seen in the preliminaries, can be expressed as a polynomial and the coefficients can be computed as follows,

Here, we follow the notation from [16]; maps ordered tuple for to tuple such that and maps ordered tuple to the number of distinct orderings of the ’s for . The function can be computed from the multinomial theorem (cf. [46]). Thus, the coefficients of the polynomial can be efficiently indexed. Informally, each coefficient can be interpreted as the correlation between the target function and the product of features appearing in the coefficient’s monomial.

### 5.2Learning Smooth Functions of Halfspaces with a Margin

In this section we consider the problem of learning a smooth combining function of halfspaces with a margin . We assume that all examples lie on the unit ball and that for each weight vector , . For simplicity we also assume each halfspace is origin-centered, i.e. (though our techniques easily handle the case of nonzero ).

We use Lemma ? to show the existence of polynomial of degree such that for , and for , .

Since for each , , we have such that is bounded in by . From Lemma ? and ?, we have that for each ,