Risk Bounds for High-dimensional Ridge Function Combinations Including Neural Networks

# Risk Bounds for High-dimensional Ridge Function Combinations Including Neural Networks

Jason M. Klusowski   Andrew R. Barron

Department of Statistics
Yale University
New Haven, CT, USA

Email: {jason.klusowski, andrew.barron}@yale.edu
###### Abstract

Let be a function on with an assumption of a spectral norm . For various noise settings, we show that , where is the sample size and is either a penalized least squares estimator or a greedily obtained version of such using linear combinations of sinusoidal, sigmoidal, ramp, ramp-squared or other smooth ridge functions. The candidate fits may be chosen from a continuum of functions, thus avoiding the rigidity of discretizations of the parameter space. On the other hand, if the candidate fits are chosen from a discretization, we show that .

This work bridges non-linear and non-parametric function estimation and includes single-hidden layer nets. Unlike past theory for such settings, our bound shows that the risk is small even when the input dimension of an infinite-dimensional parameterized dictionary is much larger than the available sample size. When the dimension is larger than the cube root of the sample size, this quantity is seen to improve the more familiar risk bound of , also investigated here.

## 1 Introduction

Functions in are approximated using linear combinations of ridge functions with one layer of nonlinearities. These approximations are employed via functions of the form

 fm(x)=fm(x,ζ)=m∑k=1ckϕ(ak⋅x+bk), (1)

which is parameterized by the vector , consisting of in , and in for , where is the number of nonlinear terms. Models of this type arise with considerable freedom in the choice of the activation function , ranging from general smooth functions of projection pursuit regression [17] to the unit step sigmoid and ramp functions of single-hidden layer neural nets [8, 6, 7, 13, 27].

Our focus in this paper is on the case that is a fixed Lipschitz function (such as a sigmoid or ramp or sinusoidal function), though some of our conclusions apply more generally. For these activation functions, we will obtain statistical risk conclusions using a penalized least squares criterion. We obtain generalization error bounds for these by balancing the approximation error and descriptive complexity. The most general form of our bounds hold for quite general non-linear infinite dictionaries. A hallmark of our conclusions is to lay bare how favorable risk behavior can be obtained as long as the logarithm of the number of parameters relative to sample size is small. This entails a slower rate of convergence through a rate that is smaller than what is cemented in traditional cases, but leads to better results than these earlier bounds would permit in certain very high-dimensional situations. From an applied perspective, good empirical performance of neural net (and neural net like) models has been reported as in [26] even when is much larger than , though theoretical understanding has been lacking. Returning to the case of a single layer of nonlinearly parameterized function, it is useful to view the representation (1) as

 ∑hβhh(x), (2)

where the are a selection of functions from the infinite library of functions of the form for real vector and the are coefficients of linear combination of in the library. These representations are single hidden-layer networks. Deep network approximations are not very well understood. Nevertheless our results generalize provided some of our arguments are slightly modified.

We can reduce (1) to (2) as follows. Suppose the library is symmetric and contains the zero function. Without loss of generality, we may assume that the or are non-negative by replacing the associated with , that by assumption also belongs to . One can assume the internal parameterization take the form by appending a coordinate of constant value to and a coordinate of value to the vector . Note that now and are -dimensional.

We will take advantage of smoothness of activation function (assumption that either is Lipschitz or that its first derivative is Lipschitz). Suppose is an arbitrary probability measure on . Let be the norm induced by the inner product . For a symmetric collection of dictionary elements containing the zero function, we let be the linear span of .

The variation of with respect to (or the atomic norm of with respect to ) is defined by

 limδ↓0inffδ∈F{∥β∥1:fδ=∑h∈Hβhhand∥fδ−f∥≤δ,βh∈R+},

where . For functions in , this variation picks out the smallest among representations . In the particular case that , we have . For functions in the closure of the linear span of , the variation is the smallest limit of such norms among functions approaching the target. The subspace of functions with finite is denoted . Such variation control provides for approximation (opportunity) for dimension independent rates of order with an term approximation.

It is fruitful to discuss spectral conditions for finite variation for various choices of . To this end, define , for . If has a bounded domain in and a Fourier representation with , it is possible to use approximating functions of the form (1) with a single activation function . Such activation functions can be be general bounded monotone functions. We use for vectors in and for scalars such as . As we have said, to obtain risk bounds in later sections, we will assume that either is bounded Lipschitz or that, additionally, its derivative is Lipschitz. These two assumptions are made precise in the following statements.

###### Assumption 1.

The activation function has norm at most one and satisfies

 |ϕ(z)−ϕ(~z)|≤L1|z−~z|,

for all in and for some positive constant .

###### Assumption 2.

The activation function has norm at most one and satisfies

 |ϕ(z)−ϕ(~z)|≤L1|z−~z|,

and

 |ϕ′(z)−ϕ′(~z)|≤L2|z−~z|,

for all in and for some positive constants and .

In particular, Assumption 2 implies that

 |ϕ(z)−ϕ(~z)−(z−~z)ϕ′(~z)|≤12(z−~z)2L2,

for all in .

A result from [7] provides a useful starting point for approximating general functions by linear combinations of such objects. Suppose is finite. Then by [7] the function has finite variation with respect to step functions and, consequently, there exists an artificial neural network of the form (1) with , , and such that, if a suitable constant correction is subtracted from , then

 ∥f⋆−fm∥2≤v2f⋆,1m.

In particular, minus a constant correction has variation less than .

If has right at left limits and , respectively, the fact that as allows one to use somewhat arbitrary activation functions as basis elements. For our results, it in undesirable to have unbounded weights. Accordingly, it is natural to impose a restriction on the size of the internal parameters and to also enjoy a certain degree of smoothness not offered by step functions. Although, it should be mentioned that classical empirical process theory allows one to obtain covering numbers for indicators of half-spaces (which are scale invariant in the size of the weights) by taking advantage of their combinatorial structure [3]. Nevertheless, we adopt the more modern approach of working with smoothly parameterized dictionaries. In this direction, we consider the result in [13], which allows one to approximate by linear combinations of ramp ridge functions (also known as first order ridge splines or hinging hyper-planes) , with , .

The ramp activation function (also called a lower-rectified linear unit or ReLU) is currently one of the most popular form of artificial neural network activation functions, particularly because it is continuous and Lipschitz. In particular, it satisfies the conditions of Assumption 1 with depending on the size of its domain. In Theorem 6, we refine a result from [13]. For an arbitrary target function with finite has finite variation with respect to the ramp functions and, consequently, there exists an approximation of the form (1) activated by ridge ramp functions with and such that if a suitable linear correction is subtracted from , then

 ∥f⋆−fm∥2≤16v2f⋆,2m. (3)

In particular, minus a linear correction has variation less than .

The second order spline , which may also be called ramp-squared, satisfies the conditions of Assumption 2 with constants and depending on the size of its domain. Likewise, in Theorem 6, we show that for an arbitrary target function with finite a quadratically corrected has finite variation with respect to the second order splines, there exists an approximation of the form (1) activated by second order ridge splines with and such that, if a suitable quadratic correction is subtracted from , then

 ∥f⋆−fm∥2≤16v2f⋆,3m. (4)

In particular, minus a quadratic correction has variation less than .

For integer , we define the infinite dictionary

 Hs={x↦±(α⋅x−t)s−1+:∥α∥1=1,|t|≤1}.

We then set to be the linear span of . With this notation, .

The condition ensures that (corrected by a -th degree ridge polynomial) belongs to and . Functions with moderate variation are particularly closely approximated. Nevertheless, even when is infinite, we express the trade-offs in approximation accuracy for consistently estimating functions in the closure of the linear span of .

In what follows, we assume that the internal parameters have norm at most . Likewise, we assume that so that . This control on the size of the internal parameters will be featured prominently throughout. In the case of spline activation functions, we are content with the assumption . Note that if one restricts the size of the domain and internal parameters (say, to handle polynomials), the functions are still bounded and Lipschitz but with possibly considerably worse constants.

Suppose data are independently drawn from the distribution of . To produce predictions of the real-valued response from its input , the target regression function is to be estimated. The function is assumed to be bounded in magnitude by a positive constant . We assume the noise has moments (conditioned on ) that satisfy a Bernstein condition with parameter . That is, we assume

 E(|ε|k|X)≤12k!ηk−2V(ε|X),k=3,4,…,

where . This assumption is equivalent to requiring that is uniformly bounded in for some . A stricter assumption is that is uniformly bounded in , which corresponds to an error distribution with sub-Gaussian tails. These two noise settings will give rise to different risk bounds, as we will see.

Because is bounded in magnitude by , it is useful to truncate an estimator at a level at least . Depending on the nature of the noise , we will see that will need to be at least plus a term of order or . We define the truncation operator that acts on function in by . Associated with the truncation operator is a tail quantity that appears in the following analysis and our risk bounds have a term, but this will be seen to be negligible when compared to the main terms. The behavior of is studied in Lemma 12.

The empirical mean squared error of a function as a candidate fit to the observed data is . Given the collection of functions , a penalty , , and data, a penalized least squares estimator arises by optimizing or approximately optimizing

 (1/n)n∑i=1(Yi−f(Xi))2+penn(f)/n. (5)

Our method of risk analysis proceeds as follows. Given a collection of candidate functions, we show that there is a countable approximating set of representations , variable-distortion, variable-complexity cover of , and a complexity function , with the property that for each in , there is an in such that is not less than a constant multiple of , where is a constant (depending on , , , and ) and is given as a suitable empirical measure of distortion (based on sums of squared errors). The variable-distortion, variable-complexity terminology has its origins in [10, 11, 15]. The task is to determine penalties such that an estimator approximately achieving the minimum of satisfies

 E∥T^f−f⋆∥2≤cinff∈F{∥f−f⋆∥2+Epenn(f)/n}, (6)

for some universal . Valid penalties take different forms depending on the size of the effective dimension relative to the sample size and smoothness assumption of .

• When is large compared to and if satisfies Assumption 1, a valid penalty divided by sample size is at least

 16vf(γnB2nv20log(d+1)n)1/4+8(γnB2nv20log(d+1)n)1/2+Tnn. (7)
• When the noise is zero and is large compared to and if satisfies Assumption 1, a valid penalty divided by sample size is at least

 16v4/3f(γnv20log(d+1)n)1/3+4(v4/3f+1)(γnv20log(d+1)n)2/3. (8)
• When is large compared to and if satisfies Assumption 2, a valid penalty divided by sample size is at least of order

 v4/3f(γnv20log(d+1)n)1/3+vf(1nn∑i=1|Yi|)(γnv20log(d+1)n)1/3+Tnn. (9)
• When is small compared to and if satisfies Assumption 1, a valid penalty divided by sample size is at least

 60vfv0(dγnlog(n/d+1)n)1/2+1/2(d+3)+1v20(dγnlog(n/d+1)n)1/2+1/2(d+3) +(dγnlog(n/d+1)n)1/2+3/2(d+3)+dγnlog(n/d+1)n+Tnn. (10)

Here and for some and .

Accordingly, if belongs to , then is not more than a constant multiple of the above penalties with replaced by .

In the single-hidden layer case, we have the previously indicated quantification of the error of approximation . Nevertheless, the general result (6) allows us to likewise say that the risk for multilayer networks will be at least as good as the deep network approximation capability will permit. The quantity

 inff∈F{∥f−f⋆∥2+Epenn(f)/n}.

is an index of resolvability of by functions with sample size . We shall take particular advantage of such risk bounds in the case that does not depend on . Our restriction of to is one way to allow the construction of such penalties.

The following table expresses the heart of our results in the case of penalty based on the norm of the outer layer coefficients of one-hidden layer networks expressible through (subject to constraints on the inner layer coefficients). These penalties also provide risk bounds for moderate and high-dimensional situations.

The results we wish to highlight are contained in the first two rows of Table 1. The penalties as stated are valid up to modest universal constants and negligible terms that do not depend on the candidate fit. The quantity is of order in the sub-exponential noise case, order in the sub-Gaussian noise case and of constant order in the zero noise case. This (as defined in Lemma 12) depends on the variance bound , Bernstein parameter , the upper bound of , and the noise tail level of the indicated order.

When belongs to , a resulting valid risk bound is a constant multiple of or , according to the indicated cases. In this way the expression provides a rate of convergence. Thus the columns of Table 1 provide valid risk bounds for these settings.

The classical risk bounds for mean squared error, involving to some power, are only useful when the sample size is much larger than the dimension. Here, in contrast, in the first two lines of Table 1, we see the dependence on dimension is logarithmic, permitting much smaller sample sizes. These results are akin to those obtained in [31] (where the role of the dimension there is the size of the dictionary) for high-dimensional linear regression. However, there is an important difference. Our dictionary of non-linear parameterized functions is infinite dimensional. For us, the role of is the input dimension, not the size of the dictionary. The richness of is largely determined by the sizes of and and more flexibly represents a larger class of functions.

The price we pay for the smaller dependence on input dimension is a deteriorated rate with exponent in general and under slightly stronger smoothness assumptions on , rather than the familiar exponents of .

The rate in the last row improves upon the familiar exponent of to . Note that when is large, this enhancement in the exponent is negligible. The rate in the first row is better than the third approximately for , the second is better than the third row approximately for , and both of these first two rows have risk tending to zero as long as .

For functions in , an upper bound of for the squared error loss is obtained in [8]. The squared error minimax rates for functions in [33], was determined to be between

 (1/n)1/2+1/(2(d+1))(logn)−(1+1/d)(1+2/d)(1+2/d)(2+1/d)5

and

 (logn/n)1/2+1/(2(2d+1)).

Using the truncated penalized least squares estimator (6), we obtain an improved rate of order , where is logarithmic in , using techniques that originate in [20] and [19], with some corrections here.

## 2 How far from optimal?

For positive , let

 Dv0=Dv0,ϕ={ϕ(θ⋅x−t),x∈B:∥θ∥1≤v0,t∈R} (11)

be the dictionary of all such inner layer ridge functions with parameter restricted to the ball of size and variables restricted to the cube . The choice of the norm on the inner parameters is natural as it corresponds to for . Let be the closure of the set of all linear combinations of functions in with norm of outer coefficients not more than . For any class of functions on , the minimax risk is

 Rn,d(F)=inf^fsupf∈FE∥f−^f∥2, (12)

Consider the model for , where and . It was determined in [25], that for , roughly corresponding to ,

 Rn,d(Fv0,v1,sine)≥C(v0v21log(1+d/v0)n)1/2, (13)

and for ,

 Rn,d(Fv0,v1,sine)≥C(dv21log(1+v0/d)n)1/2. (14)

These lower bounds are similar in form to the risk upper bounds that are implied from the penalties in Table 2. These quantities have the attractive feature that the rate (the power of ) remains at least as good as or even as the dimension grows. However, rates determined by (14) and the last line in Table 2 are only useful provided is small. In high dimensional settings, the available sample size might not be large enough to ensure this condition.

These results are all based on obtaining covering numbers for the library . If satisfies a Lipschitz condition, these numbers are equivalent to covering numbers of the internal parameters or of the Euclidean inner product of the data and the internal parameters. The factor of multiplying the reciprocal of the sample size is produced from the order log cardinality of the standard covering of the library . What enables us to circumvent this polynomial dependence on is to use an alternative cover of that has log cardinality of order . Misclassification errors for neural networks with bounded internal parameters have been analyzed in [12, 3, 27] (via Vapnik-Chervonenkis dimension and its implications for covering numbers). Unlike the setup considered here, past work [8, 5, 33, 20, 9, 11, 27, 14, 4, 34, 24] has not investigated the role of such restricted parameterized classes in the determination of suitable penalized least squares criterion for non-parametric function estimation. After submission of the original form of this work, our results have been put to use in [32] to give risk statements about multi-layer (deep) networks activated by ramp functions.

## 3 Computational aspects

From a computational point of view, the empirical risk minimization problem (5) is highly non-convex, and it is unclear why existing algorithms like gradient descent or back propagation are empirically successful at learning the representation (1). There are relatively few rigorous results that guarantee learning for regression models with latent variables, while keeping both the sampling and computational complexities polynomial in and . Here we catalogue some papers that make progress toward developing a provably good, computationally feasible estimation procedure. Most of them deal with parameter recovery and assume that has exactly the form (1). Using a theory of tensor decompositions from [1], [22] apply the method of moments via tensor factorization techniques to learn mixtures of sigmoids, but they require a special non-degeneracy condition on the activation function. It is assumed that the input distribution is known apriori. In [37], the authors use tensor initialization and resampling to learn the parameters in a representation of the form (1) with smooth that has sample complexity and computation complexity .

In [21], the authors estimate the gradient of the regression function (where is Gaussian and is the logistic sigmoid) at a set of random points, and then cluster the estimated gradients. They prove that the estimated gradients concentrate around the internal parameter vectors. However, unless the weights of the outer layer are positive and sum to , the complexity is exponential in . In [2], it was shown that for a randomly initialized neural network with sufficiently many hidden units, the generic gradient descent algorithm learns any low degree polynomial. Learning non-linear networks through multiple rounds of random initialization followed by arbitrary optimization steps was proposed in [35]. In [36], an efficiently learned kernel based estimator was shown to perform just as well as a class of deep neural networks. However, its ability to well-approximate general conditional mean regression functions is unclear.

The next section discusses an iterative procedure that reduces the complexity of finding the penalized least squares estimator (5).

## 4 Greedy algorithm

The main difficulty with constructing an estimator that satisfies (6) is that it involves a -dimensional optimization. Here, we outline a greedy approach that reduces the problem to performing -dimensional optimizations. This construction is based on the -penalized greedy pursuit (LPGP) in [20], with the modification that the penalty can be a convex function of the candidate function complexity. Greedy strategies for approximating functions in the closure of the linear span of a subset of a Hilbert space has its origins in [23] and many of its statistical implications were studied in [9] and [20].

Let be a function, not necessarily in . Initialize . For iteratively, given the terms of as and the coefficients of it as , we proceed as follows. Let , with the term in chosen to come within a constant factor of the maximum inner product with the residual ; that is

 ⟨hm,f⋆−fm−1⟩≥1csuph∈H⟨h,f⋆−fm−1⟩.

Define . Associated with this representation of is the norm of its coefficients . The coefficients and are chosen to minimize .

In the empirical setting, with , the high-dimensional optimization task is to find such that

 1nn∑i=1Riϕ(θm⋅Xi)≥1csupθ1nn∑i=1Riϕ(θ⋅Xi)

The fact that one does not need to find the exact maximizer of the above empirical inner product, but only come to within a constant multiple of it, has important consequences. For example, in adaptive annealing, one begins by sampling from an initial distribution and then iteratively samples from a distribution proportional to , evolving according to , where satisfies . The mean of is at least for sufficiently large .

###### Theorem 1.

Suppose is a real-valued non-negative convex function. If is chosen according to the greedy scheme described previously, then

 ∥f⋆−fm∥2+w(vm)≤inff∈F{∥f⋆−f∥2+w(cvf)+4bfm}, (15)

where . Furthermore, for all ,

 ∥f⋆−fm∥2+w(vm) ≤inff∈Finfδ>0⎧⎨⎩(1+δ)∥f⋆−f∥2+w(cvf)+4(1+δ)δ−1(c+1)2v2fm⎫⎬⎭, (16)

and hence with ,

 ∥f⋆−fm∥2+w(vm)≤inff∈F⎧⎨⎩(∥f⋆−f∥+2(c+1)vf√m)2+w(cvf)⎫⎬⎭.
###### Proof.

Fix any in the linear span , with the form , with non-negative and set

 em=∥f⋆−fm∥2−∥f⋆−f∥2+w(vm).

From the definition of and as minimizers of for each , and the convexity of ,

 em =∥f⋆−(1−αm)fm−1−βm,mhm∥2−∥f⋆−f∥2+ w((1−αm)vm−1+βm,m) ≤∥f⋆−(1−αm)fm−1−αmcvfhm∥2−∥f⋆−f∥2+ w((1−αm)vm−1+αmcvf) ≤∥f⋆−(1−αm)fm−1−αmcvfhm∥2−∥f⋆−f∥2+ (1−αm)w(vm−1)+αmw(cvf).

Now is equal to . Expanding this quantity leads to

 ∥f⋆−(1−αm)fm−1−αmcvfhm∥2 =(1−αm)2∥f⋆−fm−1∥2 −2αm(1−αm)⟨f⋆−fm−1,chmvf−f⋆⟩ +α2m∥f⋆−chmvf∥2.

Next we add to this expression to obtain

 em ≤(1−αm)em−1+α2m[∥f⋆−chmvf∥2−∥f⋆−f∥2]+αmw(cvf) −2αm(1−αm)⟨f⋆−fm−1,chmvf−f⟩ +αm(1−αm)[2⟨f⋆−fm−1,f⋆−f⟩−∥f⋆−fm−1∥2−∥f⋆−f∥2]. (17)

The expression in brackets in (17) is equal to and hence the entire quantity is further upper bounded by

 em ≤(1−αm)em−1+α2m[∥f⋆−chmvf∥2−∥f⋆−f∥2]+αmw(cvf) −2αm(1−αm)⟨f⋆−fm−1,chmvf−f⟩.

Consider a random variable that equals with probability having mean . Since a maximum is at least an average, the choice of implies that is at least . This shows that is no less than . Expanding the squares in and using the Cauchy-Schwarz inequality yields the bound . Since and , we find that is at most . Hence we have shown that

 e1≤bf+w(cvf)

and

 em≤(1−αm)em−1+α2mbf+αmw(cvf). (18)

Because is a minimizer of , it can replace it by any value in and the bound (18) holds verbatim. In particular, we can choose , and use an inductive argument to establish (15). The second statement (1) follows from similar arguments upon consideration of

 em=∥f⋆−fm∥2−(1+δ)∥f⋆−f∥2+w(vm),

together with the inequality . ∎

## 5 Risk bounds

### 5.1 Penalized estimators over a discretization of the parameter space

In the case that is an